Compiler Transformations for High-Performance Computing

DAVID F. BACON, SUSAN L. GRAHAM, AND OLIVER J. SHARP

Computer Sc~ence Diviszon, Unwersbty of California, Berkeley, California 94720

In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. In contrast, optimization for high-performance superscalar, vector, and parallel processors maximize parallelism and memory locality with transformations that rely on tracking the properties of arrays using loop dependence analysis. This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages such as C and Fortran. Transformations for both sequential and various types of parallel architectures are covered in depth. We describe the purpose of each traneformation, explain how to determine if it is legal, and give an example of its application. Programmers wishing to enhance the performance of their code can use this survey to improve their understanding of the optimization that compilers can perform, or as a reference for techniques to be applied manually. Students can obtain an overview of technology. Compiler writers can use this survey as a reference for most of the important optimizations developed to date, and as a bibliographic reference for the details of each optimization. Readers are expected to be familiar with modern computer architecture and basic program compilation techniques.

Categories and Subject Descriptors: D.1.3 [Programming Techniques]: Concurrent Programming; D.3.4 [Programming Languages]: Processors—compilers; optimization; 1.2.2 [Artificial Intelligence]: Automatic Programming-program transformation

General Terms: Languages, Performance

Additional Key Words and Phrases: Compilation, dependence analysis, locality, multiprocessors, optimization, parallelism, superscalar processors, vectorization

INTRODUCTION As optimizing compilers become more Optimizing compilers have become an es- effective, programmers can become less sential component of modern high-perfor- concerned about the details of the under- mance computer systems. In addition to lying machine architecture and can em- translating the input program into ma- ploy higher-level, more succinct, and chine language, they analyze it and ap- more intuitive programming constructs ply various transformations to reduce its and program organizations. Simultane- running time or its size. ously, hardware designers are able to

This research has been sponsored in part by the Defense Advanced Research Projects Agency (DARPA) under contract DABT63-92-C-O026, by NSF grant CDA-8722788, and by an IBM Resident Study Program Fellowship to David Bacon, Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and]or specific permission. @ 01994 0360-0300/94/1200-0345 $03.50

ACM Computing Surveys, Vol. 26, No. 4, December 1994 346 * David F. Bacon et al.

CONTENTS imperative languages such as Fortran and C for high-performance architec- tures, including superscalar, vector, and various classes of multiprocessor

INTRODUCTION machines. 1. SOURCE LANGUAGE Most of the transformations we de- 2 TRANSFORMATION ISSUES scribe can be applied to anyofthe Algol- 2 1 Correctness 22 Scope family languages; many are applicable to 3 TARGET ARCHITECTURES functional, logic, distributed, and object- 31 Characterizing Performance oriented languages as well. These other 32 Model Architectures languages raise additional optimization 4 COMPILER ORGANIZATION issues that space does not permit us 5 DEPENDENCE ANALYSIS 51 Types of Dependences to cover in this survey. The references 52 Representing Dependences include some starting points for investi- 5,3 Loop Dependence Analysis gation of optimizations for LISP and 5.4 Subscript Analysis functional languages [Appel 1992; Clark 6. TRANSFORMATIONS and Peyton-Jones 1985; Kranz et al. 61 Data-Flow-Based Loop Transformations 6.2 Loop Reordering 1986], object-oriented languages [Cham- 6.3 Loop Restructuring bers and Ungar 1989], and the set-based 6.4 Loop Replacement Transformations language SETL [Freudenberger et al. 6.5 Memory Access Transformations 1983]. 6.6 Partial Evaluation 6.7 Redundancy Ehmmatlon We have also restricted the discussion 68 Procedure Call Transformations to higher-level transformations that re- 7. TRANSFORMATIONS FOR PARALLEL quire some program analysis. Thus we MACHINES exclude peephole optimizations and most 71 Data Layout machine-level optimizations. We use the 72 Exposing Coarse-Grained Parallelism 73 Computation PartltIOnmg term optimization as shorthand for opti- 74 Communication Optimization mizing transformation. 75 SIMD Transformations Finally, because of the richness of the 76 VLIWTransformatlons topic, we have not given a detailed de- 8 TRANSFORMATION FRAMEWORKS scription of intermediate program repre- 8.1 Unified Transformation sentations and analysis techniques. 8.2 Searching the Transformation Space 9. COMPILER EVALUATION We make use of a number of different 9.1 Benchmarks machine models, all based on a hypothet- 9.2 Code Characterlstlcs ical superscalar processor called S-DLX. 9.3 Compiler Effectiveness The Appendix details the machine mod- 94 Instruction-Level Parallehsm CONCLUSION els, presenting a simplified architecture APPENDIX: MACHINE MODELS and instruction set that we use when we A.1 Superscalar DLX need to discuss machine code. While all A 2 Vector DLX assembly language examples are com- A3 Multiprocessors ACKNOWLEDGMENTS mented, the reader will need to refer to REFERENCES the Armendix to understand some details of the’ ~xamples (such as cycle counts). We assume a basic familiarity with ~ro~am compilation issues. Readers un- employ designs that yield greatly im- ~am~liar with program flow analysis or proved performance because they need other basic compilation techniques may only concern themselves with the suit- consult Aho et al. [1986]. ability of the designas a compiler target, not with its suitability as a direct pro- 1. SOURCE LANGUAGE grammer interface. In this survey we describe transforma- All of the high-level examples in this tions that optimize programs written in survey are written in a language similar

ACM Computing Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations “ 347 to Fortran 90, with minor variations and Programmers unfamiliar with Fortran extensions; most examples use only fea- should also note that all arguments are tures found in Fortran 77. We have cho- passed by reference. sen to use Fortran because it is the de When describing compilation for vector facto standard of the high-performance machines, we sometimes use array nota- engineering and scientific computing tion when the mapping to hardware reg- community. Fortran has also been the isters is clear. For instance, the loop input language for a number of research doalli=l,64 projects studying parallelization [Allen et a[i] = a[i] + c al. 1988a; Balasundaram et al. 1989; end do all Polychronopoulos et al. 1989]. It was cho- sen by these projects not only because of could be implemented with a scalar-vec- its ubiquity among the user community, tor add instruction, assuming a machine but also because its lack of pointers and with vector registers of length 64. This its static memory model make it more would be written in Fortran 90 array amenable to analysis. It is not yet clear notation as what effect the recent introduction of a[l :64] = a[l :64] + c pointers into Fortran 90 will have on optimization. or in vector machine assembly language The optimizations we have presented as are not specific to Fortran—in fact, many commercial compilers use the same in- LF F2, c(R30) ;Ioad c into register F2 termediate language and optimizer for ADDI R8, R30, #a ;Ioad addr. of a into R8 both C and Fortran. The presence of un- LV VI, R8 ;Ioad vector a[l :64] to restricted pointers in C can reduce oppor- V1 tunities for optimization because it is ADDSV VI, F2, VI ;add scalar to vector Sv VI. R8 ;store vector in a[l :64] impossible to determine which variables may be referenced by a pointer. The pro- cess of determining which references may Array notation will not be used when point to the same storage locations is loop bounds are unknown, because there called [Banning 1979; Choi is no longer an obvious correspondence et al. 1993; Cooper and Kennedy 1989; between the source code and the fixed- Landi et al. 1993]. length vector operations that perform the The only changes to Fortran in our computation. To use vector operations, examples are that array subscripting is the compiler must perform the transfor- denoted by square brackets to distin- mation called strip-mining, which is dis- guish it from function calls; and we use cussed in Section 6.2.4. do all loops to indicate textually that all the iterations of a loop may be executed 2. TRANSFORMATION ISSUES concurrently. To make the structure of For a compiler to apply an optimization the computation more explicit, we will to a program, it must do three things: generally express loops as iterations rather than in Fortran 90 array notation. (1) Decide upon a part of the program to We follow the Fortran convention that optimize and a particular transfor- arrays are stored contiguously in mem- mation to apply to it. ory in column-major form. If the array is (2) Verify that the transformation either traversed so as to visit consecutive loca- does not change the meaning of the tions in the linear memory, the first (that program or changes it in a restricted is, leftmost) subscript varies the fastest. way that is acceptable to the user. In traversing the two-dimensional array (3) Transform the program. declared as a[n, m], the following array locations are contiguous: a[n – 1, 3], In this survey we concentrate on the a[n, 31, a[l, 41, a[2, 41. last step: transformation of the program.

ACM Computmg Surveys, Vol. 26, No 4, December 1994 348 ● David F. Bacon et al.

However, in Section 5 we introduce de- tions in the two executions produces the pendence analysis techniques that are same result. used for deciding upon and verifying many loop transformations. Nondeterminism can be introduced by Step (1) is the most difficult and poorly language constructs (such as Ada’s se- understood; consequently compiler de- lect statement), or by calls to system or sign is in many ways a black art. Be- library routines that return information cause analysis is expensive, engineering about the state external to the program issues often constrain the optimization (such as Unix time( ) or read()). strategies that the compiler can perform. In some cases it is straightforward to Even for relatively straightforward cause executions to be identical, for in- uniprocessor target architectures, it is stance by entering the same inputs, In possible for a sequence of optimizations other cases there may be support for de- to slow a program down. For example, an terminism in the programming environ- attempt to reduce the number of instruc- ment—for instance, a compilation option tions executed may actually degrade per- that forces select statements to evaluate formance by making less efficient use of their guards in a deterministic order. As the cache. a last resort, the programmer may have As processor architectures become to replace nondeterministic system calls more complex, (1) the number of dimen- temporarily with deterministic opera- sions in which optimization is possible tions in order to compare two versions. increases and (2) the decision process is To illustrate many of the common greatly complicated. Some progress has problems encountered when trying to op- been made in systematizing the applica- timize a program and yet maintain cor- tion of certain families of transforma- rectness, Figure l(a) shows a subroutine, tions, as discussed in Section 8. However, Figure l(b) is a transformed version of it. all optimizing compilers embody a set of The new version may seem to have the heuristic decisions as to the transforma- same semantics as the original, but it tion orderings likely to work best for the violates Definition 2.1.1 in the following target machine(s). ways:

a Overflow. If b[k] is a very large num- 2.1 Correctness ber, and a[l ] is negative, then chang- When a program is transformed by the ing the order of the additions so that compiler, the meaning of the program C = b [k] + 100000.0 is executed before should remain unchanged. The easiest the loop body could cause an overflow way to achieve this is to require that the to occur in the transformed program transformed program perform exactly the that did not occur in the original. Even same operations as the original, in ex- if the original program would have actly the order imposed by the semantics overflowed, the transformation causes of the language. However, such a strict the exception to happen at a different interpretation leaves little room for im- point. This situation complicates de- provement. The following is a more bugging, since the transformation is practical definition. not visible to the programmer. Finally, if there had been a print statement be- Definition 2.1.1 A transformation is tween the assignment and use of C, legal if the original and the transformed the transformation would actually programs produce exactly the same out- change the output of the program. put for all identical executions. * Different results. Even if no overflow Two executions of a program are iden- occurs, the values that are left in the tical executions if they are supplied with array a may be slightly different be- the same input data and if every corre- cause the order of the additions has sponding pair of nondeterministic opera- been changed. The reason is that float-

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 349

subroutine tricky (a, b,n, m,k) ● Memory fault. If k > m but n < 1, the integer n, m, k reference to b[k] is illegal. The refer- real a[m], b[m] ence would not be evaluated in the original program because the loop body doi=l, n is never executed, butit would occurin a[i] = b[k] + a[i] + 100000.0 the call shown in Figure l(c) to the end do transformed code in Figure l(b). return e Different results. a and b may be com- (a) original program pletely or partially aliased to one an- other, changing the values assigned to a in the transformed program. Figure subroutine tricky(a,b,n,m,k) l(d) shows how this might occur: If the integer n, m, k call is to the original subroutine, b[k] real a[m], b[m], C is changed when i = 1, since k = n and b [n] is aliased to a[l ]. In the trans- C = b[k] + 100000.0 formed version, the old value of b [k] is doi=n, i,-1 used for all i, since it is read before the a[i] = a[i] + C loop. Even if the reference to b[k] were end do moved back inside the loop, the trans- return formed version would still give differ- (b) transformed program ent results because the loop traversal order has been reversed.

k = m+l As a result of these problems, slightly n=O different definitions are used in practice. call tricky(a,b,n,m,k) When bitwise identical results are de- (c) possible call to tricky sired, the following definition is used:

Definition 2.1.2 A transformation is equivalence (a[l], b[n]) legal if, for all semantically correct pro- gram executions, the original and the k=n transformed programs produce exactly call tricky(a,b,n,m,k) the same output for identical executions. (d) possible call to tricky Languages typically have many rules Figure 1. Incorrect program transformations. that are stated but not enforced; for in- stance in Fortran, array subscripts must remain within the declared bounds, but ing-point numbers are approximations this rule is generally not enforced at ofrealnumbers ,andtheorder inwhich compile- or run-time. A program execu- the approximations are applied(round- tion is correct if it does not violate the ing) can affect the result. However, for rules of the language. Note that correct- a sequence of commutative and asso- ness is a property of a program execu- ciative integer operations, if no order tion, not of the program itself, since the of evaluation can cause an exception, same program may execute correctly un- then all evaluation orders are equiva- der some inputs and incorrectly under lent. We call operations that are alge- others. braically but not computationally Fortran solves the problem of parame- commutative (or associative) semicom- ter aliasing by declaring calls that alias mutative (or semiassociative) opera- scalars or parts of arrays illegal, as in tions. These issues do not arise with Figure l(d). In practice, many programs Boolean operations, since they do not make use of aliasing anyway. While this cause exceptions or compute approxi- may improve performance, it sometimes mate values. leads to very obscure bugs.

ACM Computing Surveys, Vol. 26, No. 4, December 1994 350 “ David F. Bacon et al.

For languages and standards that de- niques. The advantage for analysis is fine a specific semantics for exceptions, that there is only one entry point, so meeting Definition 2.1.2 tightly con- control transfer need not be considered strains the permissible transforma- in tracking the behavior of the code. tions. If the compiler may assume that * Innermost loop. To target high-perfor- exceptions are only generated when the mance architectures effectively, com- program is semantically incorrect, a pilers need to focus on loops. Most of transformed program can produce differ- the transformations discussed in this ent results when an exception occurs and survey are based on loop manipulation. still be legal. When exceptions have a Many of them have been studied or well-defined semantics, as in the IEEE widely applied only in the context of floating-point standard [American Na- the innermost loop. tional Standards Institute 1987], many transformations will not be legal under 0 Perfect loop nest. A loop nest is a set of this definition. loops one inside the next. The nest is However, demanding bitwise identical called a perfect nest if the body of ev- results may not be necessary. An alterna- ery loop other than the innermost con- tive correctness criterion is: sists of only the next loop in the nest. Because a perfect nest is more easily Definition 2.1.3 A transformation is summarized and reorganized, several legal if, for all semantically correct ex- transformations apply only to perfect ecutions of the original program, the nests. original and the transformed programs e General loop nest. Any loop nesting, perform equivalent operations for identi- perfect or not. cal executions. All permutations of semi- commutative operations are considered ● Procedure. Some optimizations, mem- equivalent. ory access transformations in particu- lar, yield better improvements if they Since we cannot predict the degree to are applied to an entire procedure at which transformations of semicommuta- once. The compiler must be able to tive operations change the output, we manage the interactions of all the ba- must use an operational rather than an sic blocks and control transfers within observational definition of equivalence. the procedure. The standard and rather Generally, in practice, programmers ob- confusing term for procedure-level op- serve whether the numeric results differ timization in the literature is global by more than a certain tolerance, and if optimization. they do, force the compiler to employ 0 InterProcedural. Considering several Definition 2.1.2. proc~dures together often exp~ses more opportunities for optimization; in par- 2.2 Scope ticular, procedure call overhead is of- ten significant, and can sometimes be Optimizations can be applied to a pro- reduced or eliminated with inter- gram at different level~ of granularity. procedural analysis. As the scope of the transformation is en- larged, the cost of analysis generally in- We will generally confine our attention creases. Some useful gradations of to optimizations beyond the basic block complexity are: level. * Statement. Arithmetic expressions are the main source of potential optimiza- tion within a statement. 3. TARGET ARCHITECTURES 0 Basic block (straight-line code). This is In this survey we discuss compilation the focus of early optimization tech- techniques for high-performance archi-

ACM Computmg Surveys, Vol. 26, No 4, December 1994 Compiler Transformations ● 351 tectures, which for our purposes are su- ware detects that the result of a vector perscalar, vector, SIMD, shared-memory multiply is used by a vector add, and multiprocessor, and distributed-memory uses a strategy called chaining in which multiprocessor machines. These architec- the results from the multiplier’s pipeline tures have in common that they all use are fed directly into the adder. Other parallelism in some form to improve compound operations are sometimes performance. implemented as well. The structure of an architecture dic- Use of such compound operations may tates how the compiler must optimize allow an application to run up to twice as along a number of different (and some- fast. It is therefore important to organize times competing) axes. The compiler must the code so as to use multiply-adds wher- attempt to: ever possible. e maximize use of computational re- sources (processors, functional units, 3.1 Characterizing Performance vector units), . minimize the number of operations There are a number of variables that we performed, have used to help describe the perfor- mance of the compiled code quantita- ● minimize use of memory bandwidth (register, cache, network), and tively: - minimize the size of total memory e S is the hardware speed of a single required. processor in operations per second. While optimization for scalar CPUS has Typically, speed is measured either in concentrated on minimizing the dynamic millions of instructions per second instruction count (or more precisely, the (MIPS) or in millions of floating-point number of machine cycles required), operations per second (megaflops). CPU-intensive applications are often at e P is the number of processors. least as dependent upon the performance 0 F is the number of operations executed of the memory system as they are on the by a program. performance of the functional units. 0 In particular, the distance in memory T is the time in seconds to run a pro- between consecutively accessed elements gram. of an array can have a major perfor- e U = F/ST is the utilization of the ma- mance impact. This distance is called the chine by a program; a utilization of 1 is stride. If a loop is accessing every fourth ideal, but real programs typically have element of an array, it is a stride-4 loop. significantly lower utilizations. A su- If every element is accessed in order, it is perscalar machine can issue several in- a stride-1 loop. Stride-1 access is desir- structions per cycle, but if the program able because it maximizes memory local- does not contain a mixture of opera- ity and therefore the efficiency of the tions that matches the mixture of func- cache, translation lookaside buffer (TLB), tional units, utilization will be lower. and paging systems; it also eliminates Similarly, a vector machine cannot be bank conflicts on vector machines. utilized fully unless the sizes of all of Another key to achieving peak perfor- the arrays in the program are an exact mance is the paired use of multiply and multiple of the vector length. add operations in which the result from 0 Q measures reuse of operands that are the multiplier can be fed into the adder. stored in memory. It ii the ratio of the For instance, the IBM RS/6000 has a number of times the operand is refer- multiply-add instruction that uses the enced during computation to the num- multiplier and adder in a pipelined fash- ber of times it is loaded into a register. ion; one multiply-add can be issued each As the value of Q rises, more reuse is cycle. The Cray Y-MP C90 does not have being achieved, and less memory band- a single instruction; instead the hard- width is consumed. Values of Q below

ACM Computing Surveys, Vol. 26, No. 4, December 1994 352 e David F. Bacon et al.

1 indicate that redundant loads are level intermediate language, and object occurring. code. First, optimizations are applied to a 0 Qc is an analogous quantity that mea- high-level intermediate language (HIL) sures reuse of a cache line in memory. that is semantically very close to the It is the ratio of the number of words original source program. Because all the that are read out of the cache from semantic information of the source m-o- that particular line to the number of gram is available, higher-level tran~for- times the cache fetches that line from mations are easier to apply. For instance, memory. array references are clearly distinguish- able, instead of being a sequence of low- level address calculations. 3.2 Model Architectures Next, the program is translated to In the Appendix we present a series of low-level intermediate form (LIL), essen- model architectures that we will use tially an abstract machine language, and throughout this survey to demonstrate optimized at the LIL level. Address com- the effect of’ various compiler transforma- putations provide a major source of tions. The architectures include a super- potential optimization at this level. For scalar CPU (S-DLX), a vector CPU instance, two references to a[5, 3] and (V-DLX), a shared-memory multiproces- a[7, 3] both require the computation of sor (sMX), and a distributed-memory the address of column 3 of array a. multiprocessor (dMX). We assume that Finally, the program is translated to the reader is familiar with basic princi- object code, and machine-specific opti- ples of modern computer architecture, in- mization are performed. Some optimiza- cluding RISC design, pipelining, caching, tion, like code collocation, rely on profile and instruction-level parallelism. Our information obtained by running the generic architectures are based on DLX, program. These optimizations are of- an idealized RISC architecture intro- ten implemented as binary-to-binary duced by Hennessey and Patterson translations. [1990]. The high-level optimization phase be- gins by doing all of the large-scale re- structuring that will significantly change 4. COMPILER ORGANIZATION the organization of the program. This in- Figure 2 shows the design of a hypotheti- cludes various procedure-restructuring cal compiler for a superscalar or vector optimlizations, as well as scalarization, machine. It includes most of the general- which converts data-parallel constructs purpose transformations covered in this into loops. Doing this restructuring at survey. Because compilers for parallel the beginning allows the rest of the com- machines are still a very active area of piler to work on individual procedures in research, we have excluded these ma- relative isolation, chines from the design. Next, high-level data-flow optimiza- The purpose of this compiler design is tion, partial evaluation, and redundancy to give the reader (1) an idea of how the elimination are performed. These opti- various types of transformations fit to- mization simplify the program as much gether and (2) some of the ordering is- as possible, and remove extraneous code sues that arise. This organization is by that could reduce the effectiveness of no means definitive: different architec- subsequent analysis. tures dictate different designs, and opin- The main focus in compilers for high- ions differ on how best to order the performance architectures is loop opti- transformations. mization. A sequence of transformations Optimization takes place in three dis- is performed to convert the loops into a tinct phases, corresponding to three dif- form that is more amenable to optimiza- ferent representations of the program: tion: where possible, perfect loop nests high-level intermediate language, low- are created; subscript expressions are

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 353

F % Source Program Perfect Loop Ne.os lnductton Vartables I Loop Reordering Loop Intercharrga Loop Skewing Loop Reversal Loop Tthng Loop Coalescing Strip Mining Cycle Shnnkma “FTail-recursion Ehmmatlon Procedure Cull Graph Procedure Anwkmons 1

t Low-level IL Generation High-Level Dataflow Optimization ConstantPropagationand Folding Low-lsvel Inferresdlate Copy Props ation Language (UL) Common.%%sxpreswonElimination Low-level Dataflow Optimization Loop-invariant Code MotIon Constant Propagation and Folding Loop Unswtfching Copy Propagation Stranoth Reduction Common Subaxpression Elimination Loq-mvanant Code Motion of lnd.Var. Exprs. Reassociation Elimination

Partial Evaluation Algebralc Slmplficatron Low-level Redundancy Eliminatiorr Short-circuiting &%%%%%X:?duct’On =

Redundancy Elimination UnreachableCode Elimmation UsaleasCede Elimmation Dead Variable Eltminat!on

! LooIz Preparation I I Cross-Call Re Ister Allocation Loop Distribution = Loop Normalization Assemblv L@muaee 1 Loop Peeling Scalar Expanaion Micro-optimization Peephok Optimization Depeedsnce Vectors t Superoptlmlzatlon Loop Preparation II 1 ForwardSubstation Array Padding Array Alignment Idiom and ReductionRecognition LOODCollaosmo

Erectable Progrom 1 Instruction Ceche Optimization Code Co-locatiom Displacement Minimization Cache Conflict Avoidance

t Final Opttmtzed Co&

Pigure 2. Organization of a hypothetical optimizing compiler.

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 354 “ David F. Bacon et al.

rewritten in terms of the induction vari- There is a control dependence between ables; and so on. Then the loop iterations statement 1 and statement 2, written S1 are reordered to maximize parallelism 4s2, when statement S1 determines and locality. A postprocessing phase or- whether S’z will be executed. For exam- ganizes the resulting loops to minimize ple: loop overhead. 1 After the , the pro- if (a = 3) then 2 b=10 gram is converted to low-level intermedi- end if ate language (LIL). Many of the data-flow and redundancy elimination optimiza- Two statements have a data dependence tion that were applied to the HIL are if they cannot be executed simultane- reapplied to the LIL in order to eliminate ously due to conflicting uses of the same inefficiencies in the component expres- variable. There are three types of data sions generated from the high-level con- dependence: flow dependence (also structs. Additionally, induction variable called true dependence), antidependence, optimization are performed, which are and output dependence. Sb has a flow often important for high performance on dependence on S~ (denoted by S~ - S’d) uniprocessor machines. A final LIL opti- when S~ must be executed first because mization pass applies procedure call it writes a value that is read by S’d. For optimizations. example: Code generation converts LIL into assembly language. This phase of the 3 a=c=lo compiler is responsible for instruction 4 d=2*a+c selection, , and SG has an antidependence on S~ (denoted . The compiler applies by S~ * Sc ) when Sc writes a variable low-level optimizations during this phase that is read by S~: to improve further the performance of the code. 5 e= f*4+g Finally, object code is generated by the 6 g=z.h assembler, and the object files are linked into an executable. After profiling, pro- An antidependence does not constrain file-based cache optimization can be ap- execution as tightly as a flow depen- plied to the executable program itself. dence. As before, the code will execute correctly if So is delayed until after S~ 5, DEPENDENCE ANALYSIS completes. An alternative solution is to Among the various forms of analysis used use two memory locations g~ and g~ to by optimizing compilers, the one we rely hold the values read in S~ and written in on most heavily in this survey is depen- SG, respectively. If the write by SG com- dence analysis [Banerjee 1988b; Wolfe pletes first, the old value will still be 1989b]. This section introduces depen- available in g~. dence analysis, its terminology, and the An output dependence holds when both underlying theory. statements write the same variable: A dependence is a relationship be- 7 a=b*c tween two computations that places 8 a=d i-e constraints on their execution order. De- pendence analysis identifies these con- We denote this condition by writing ST straints, which are then used to deter- ~ Sg. Again, as with an antidependence, mine whether a particular transforma- storage replication can allow the state- tion can be applied without changing the ments to execute concurrently. In this semantics of the computation. short example there is no intervening use of a and no control transfer between 5.1 Types of Dependence the two assignments, so the computation There are two kinds of dependence: con- in ST is redundant and can actually be trol dependence and data dependence. eliminated.

ACM Computing Surveys, Vol. 26, No 4, December 1994 Compiler Transformations ● 355

The fourth possible relationship, called doi=2, n an input dependence, holds when two ac- 1 a[i] = a[il + c cesses to the same memory location are 2 b[i] = a[i-1] * b [i] both reads. Although an input depen- end do dence imposes no ordering constraints, Figure 3. Loop-carried dependence. the compiler can make note of it for the purposes of optimizing data placement on multiprocessors. We denote an unspecified type of de- pendence by SI = S2. Another common 5.3 Loop Dependence Analysis notation for _dependences uses S~ 8S4 for fjl~ + S4, S58SG for S5 ++ SG, and S7 ~“sa So far we have examined dependence in for S7 + Sa, the context of straight-line code with con- In the case of data dependence, when ditionals—analyzing loops is a more we write X * Y we are being somewhat complicated problem. In straight-line imprecise: the individual variable refer- code, each statement is executed at most ences within a statement generate the once, so the dependence arcs described so dependence, not the statement as a far capture all the possible dependence whole. In the output dependence example relationships. In loops, each statement above, b, C, d, and e can all be read from may be executed many times, and for memory in any order, and the results of many transformations it is necessary b * c and d + e can be executed as soon as to describe dependence that exist be- their operands have been read from tween iterations, called loop-carried memory. S7 * S~ actually means that dependence. the store of the value b * c into a must A simple example of loop-carried de- precede the store of the value d + e into pendence is shown in Figure 3. There is a. When there is a potential ambiguity, no dependence between S’l and S2 with- we will distinguish between different in any single iteration of the loop, but variable references within statements. there is one between two successive iter- ations. When i = k, S2 reads the value of a[k – 1] written by S1 in iteration k – 1. To compute dependence information for 5.2 Representing Dependence loops, the key problem is to understand the use of arrays; scalar variables are To capture the dependence information relatively easy to manage. To track array for a piece of code, the compiler creates a behavior, the compiler must analyze the dependence graph; typically each node in subscript expressions in each array the graph represents one statement. An reference. arc between two nodes indicates that To discover whether there is a depen- there is a dependence between the com- dence in the loop nest, it is sufficient to putations they represent. determine whether any of the iterations Because it can be cumbersome to ac- can write a value that is read or written count for both control and data depen- by any of the other iterations. dence during analysis, sometimes Depending on the language, loop incre- compilers convert control dependence ments may be arbitrary expressions. into data dependence using a technique However, the dependence analysis algo- called if-conversion [Allen et al. 1983]. rithms may require that the loops have If-conversion introduces additional only unit increments. When they do not, boolean variables that encode the condi- the compiler may be able to normalize tional predicates; every statement whose them to fit the requirements of the anal- execution depends on the conditional is ysis, as described in Section 6.3.6. For then modified to test the boolean vari- the remainder of this section we will as- able. In the transformed code, data de- sume that all loops are incremented by 1 pendence subsumes control dependence. for each iteration.

ACM Computing Surveys, Vol. 26, No 4, December 1994 356 e David F. Bacon et al.

do il = 11, u1

do id = [d, Ud 1 a[fl(il, . ...id). fm($l, ($l, . .. ..d)] = .. .

2 . ..= a[gl(il )..., id), . . ..9m(& !... , id)] end do

end do end do

Figure 4. GeneraI loop nest.

Figure 4 shows a generalized perfect In other words, there is a dependence nest of d loops. The body of the loop nest when the values of the subscripts are the reads and writes elements of the m- same in different iterations. If no such 1 dimensional array a. The function f, and and J exist, the two references are inde- g, map the current values of the 100P pendent across all iterations of the loop. iteration variables to integers that index In the case of an output dependence the ith dimension of a. The generalized caused by the same write in different loop can give rise to any type of data iterations, the condition is simply Vp: dependence: for instance, two different fp(~) = fp(w. iterations may write the same element of For example, suppose that we are at- a, creating an output dependence. tempting to describe the behavior of the An iteration can be uniquely named by loop in Figure 5(a). Each iteration of the a vector of d elements 1 = (ii, . . . , id), inner loop writes the element a[i, j]. There where each index falls within the itera- is a dependence if any other iteration tion range of its corresponding loop in reads or writes that same element. In the nesting (that is, 1P < iP < u ). The this case, there are many pairs of itera- outermost loop corresponds to t L e left- tions that depend on each other. Con- most index. sider iterations 1 = (1,3) and J = (2, 2). We wish to discover what loop-carried Iteration 1 occurs first, and writes the dependence exist between the two refer- value a[l, 3]. This value is read in itera- ences to a, and to describe somehow those tion J, so there is a flow dependence dependence that exist. Clearly, a refer- from iteration 1 to iteration J. Extend- ence in iteration J can depend only on ing the notation for dependence, we another reference in iteration 1 that was write I + J. executed before it, not after it. We for- When X - Y, we define the depen- malize the notion of “before” with the < dence distance as Y–X=(yl– relation: xl, ..., y~ – x~). In Figure 5(a), the de- pendence distance J – I = (1,– 1).When I< Jiff3p:(iP

ACM Computmg Surveys, Vol. 26, No, 4, December 1994 Compiler Transformations “ 357

doi=2, n We will use the general term depen- do j = 1, n-1 dence vector to encompass both distance a[i, j] = a[i, j] + a[i-l, j+l] and direction vectors. In Figure 5 the end do direction vector for loop (a) is ( <, >), end do and the direction vectors for loop (b) are (a) {(1, -l)} {(=, <),(<, =),(<, >)}. Notethatadi- rection vector entry of < corresponds to a distance vector entry that is greater doi=l, n than zero. do j = 2, n-1 The dependence behavior of a loop is a[jl=(a[jl + a[j-11 + a[j+il)/3 described by the set of dependence vec- end do tors for each pair of possibly conflicting end do references. These can be summarized into (b) {(0,1),(1,0),(1,-1)} a single loop direction vector, at the ex- pense of some loss of information (and Figure 5. Distance vectors. potential for optimization). The depen- dence of the loop in Figure 5(b) can be summarized as ( < , *). The symbol # V (the first nonzero element of the dis- denotes both a < and > direction. and tance vector must be positive). A nega- * denotes < , > and = . tive element in the distance vector means A particular dependence between that the dependence in the corresponding statements S’l and S2 is denoted by writ- loop is on a higher-numbered iteration. If ing the dependence vector above the ar- the first nonzero element was negative, (<) this would indicate a dependence on a row, for example SI -+ S2. future iteration, which is impossible. Burke and Cytron [1986] present a The reader should note that distance framework for dependence analysis that vectors describe dependence among iter- defines a hierarchy of dependence vec- ations, not among array elements. The tors and allows flow and antide~en- operations on array elements create the dences to be treated symmetrically: An dependence, but the distance vectors de- antidependence is simply a flow depen- scribe dependence among iterations. For dence with an impossible dependence instance, the loop nest that updates the vector (V + O). one-dimensional array a in Figure 5(b) has dependence described by the set of 5.4 Subscript Analysis two-element distance vectors {(O, 1), (1, O), In discussing the analysis of loop-carried (1, - l)}. dependence, we omitted an important de- In some cases it is not possible to de- tail: how the compiler decides whether termine the exact dependence distance at two array references might refer to the compile-time, or the dependence distance same element in different iterations. In may vary between iterations; but there is examining a loop nest, first the compiler enough information to partially charac- tries to prove that different iterations are terize the dependence. A direction vector independent by applying various tests to (introduced by Wolfe [1989b]) is com- the subscript expressions. These tests monly used to describe such depen- rely on the fact that the expressions are dence. almost always linear. When dependence For a dependence I * J, we define the are found, the compiler tries to describe direction vector W = (w ~, ..., w~) where them with a direction or distance vector. If the subscript expressions are too com- if iP < jp < plex to analyze, the compiler assumes . if ip = jp the statements are fully dependent on ‘P= — one another such that no change in exe- > ifip >jP. I cution order is permitted.

ACM Computing Surveys, Vol 26, No 4, December 1994 358 ● David F. Bacon et al.

There are a large variety of tests, all of do i = 1, n-l which can prove independence in some a[i, i] = a[i, i+l] cases. It is infeasible to solve the problem end do directly, even for linear subscript expres- Figure 6. Coupled subscripts. sions, because finding dependence is equivalent to the NP-complete problem of finding integer solutions to systems of the presence of coupled subscript linear Diophantine equations [Banerjee expressions. et al. 1979]. Two general and approxi- Section 9.2 discusses a number of stud- mate tests are the GCD [Towle 1976] and ies that examine the subscript ex- Banerjee’s inequalities [Banerjee 1988a]. pressions in scientific applications and Additionally, there are a large number evaluate the effectiveness of different of exact tests that exploit some subscript dependence tests. characteristics to determine whether a particular type of dependence exists. One 6. TRANSFORMATIONS of the less expensive exact tests is the Single-Index Test [Banerjee 1979; Wolfe This section catalogs general-purpose 1989b]. The Delta Test [Goff et al. 1991] program transformations; those transfor- is a more general strategy that examines mations that are only applicable on par- some combinations of dimensions. Other allel machines are discussed in Section 7. tests that are more expensive to evaluate The primary emphasis is on loops, since consider the multiple subscript dimen- that is generally where most of the exe- sions simultaneously, such as the A-test cution time is spent. For each transfor- [Li et al. 1990], multidimensional GCD mation, we provide an example, discuss [Banerjee 1988a], and the power test the benefits and shortcomings, identify [Wolfe and Tseng 1992]. The omega test any variants, and provide citations. [Pugh 1992] uses a linear programming A major goal of optimizing compilers algorithm to solve the dependence equa- for high-performance architectures is to tions. The SUIF compiler project at Stan- discover and exploit parallelism in loops. ford has had success applying a series of We will indicate when a loop can be exe- exact tests, starting with the cheaper cuted in parallel by using a do all loop ones [Maydan et al. 1991]. They use a instead of a do loop. The iterations of a general algorithm (Fourier-Motzkin vari- do all loop can be executed in any order, able elimination [Dantzig and Eaves or all at once. 1974]) as a backup. A standard reference on compilers in One advantage of the multiple-dimen- general is the “Red Dragon” book, to sion exact tests is their ability to handle which we refer for some of the most com- coupled subscript expressions [Goff et al. mon examples [Aho et al. 1986]. We draw 1991]. Two expressions are coupled if the also on previous summaries [Allen and same variable appears in both of them. Cocke 1971; Kuck 1977; Padua and Wolfe Figure 6 shows a loop that has no depen- 1986; Rau and Fisher 1993; Wolfe 1989b]. dence between iterations. The state- Because some transformations were al- ment reads the values to the right of the ready familiar to programmers who ap- diagonal and updates the diagonal. A test plied them manually, often we cite only that only examines one dimension of the the work of researchers who have subscript expressions at a time, however, systematized and automated the imple- will not detect the independence because mentation of these transformations. Ad- there are many pairs of iterations where ditionally, we omit citations to works that the expression i in one is equal to the are restricted to basic blocks when global expression i + 1 in the other. The more optimization techniques exist. Even so, general exact tests would discover that the origin of some optimizations is murky. the iterations are independent despite For instance, Ershov’s ALPHA compiler

ACM Computing Surveys, Vol 26, No 4, December 1994 Compiler Transformations o 359

[Ershov 1966] performed interprocedural doi=l, n constant propagation, albeit in a limited a[i] = a[i] + c*i form, in 1964! end do (a) original loop

6.1 Data-Flow-Based Loop Transformations

A number of classical loop optimizations T=c are based on data-flow analysis, which doi=l, n tracks the flow of data through a pro- a[i] = a[i] + T gram’s variables [Muchnick and Jones T=T+c 1981]. These optimizations are summa- end do rized by Aho et al. [1986]. (b) after strength reduction

Figure 7. Strength reduction example, 6.1.1 Loop-Based Strength Reduction

Reduction in strength replaces an ex- pression in a loop with one that is equiv- Table 1. Strength Reductions (the variable c IS loop invariant; x may vary between Iterations) alent but uses a less expensive operator [Allen 1969; Allen et al. 1981]. Figure Expression Initialization Use Update 7(a) shows a loop that contains a multi- cxi T=c T T=T+c plication. Figure 7(b) is a transformed c’ T=c T T=Txc version of the loop where the multiplica- (-1)’ T=–1 T T=–T tion has been replaced by an addition. z/c T = ~/C XXT Table 1 shows how strength reduction can be applied to a number of operations. The operation in the first column is as- Figure 8(a–c). This special case is most sumed to appear within a loop that iter- important on architectures in which inte- ates over i from 1 to n. When the loop is ger multiply operations take more cycles transformed, the compiler initializes a than integer additions. (Current exam- temporary variable T with the expression ples include the SPARC [Sun Microsys- in the second column, The operation tems 1991] and the Alpha [Sites 1992].) within the loop is replaced by the expres- Strength reduction may also make other sion in the third column, and the value of optimizations possible, in particular the T is updated each iteration with the value elimination of induction variables, as is in the fourth. shown in the next section. A variable whose value is derived from The term strength reduction is also the number of iterations that have been applied to operator substitution in a non- executed by an enclosing loop is called an loop context, like replacing x x 2 with induction variable. The loop control vari- x + x. These optimizations are covered in able of a do statement is the most com- Section 6.6.7. mon kind of induction variable, but other variables may also be induction 6.1.2 induction Variable Elimination variables. The most common use of strength re- Once strength reduction has been per- duction, often implemented as a special formed on induction variable expres- case, is strength reduction of induction sions, the compiler can often eliminate variable expressions [Aho et al. 1986; the original variable entirely. The loop Allen 1969; Allen et al. 1981; Cocke and exit test must then be expressed in terms Schwartz 1970]. of one of the strength-reduced induction Strength reduction can be applied to variables; the optimization is called lin- products involving induction variables by ear function test replacement [Allen 1969; converting them to references to an Aho et al. 1986]. The replacement not equivalent running sum, as shown in only reduces the number of operations in

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 Davidl? Bacon et al.

do i= l,n a[i] = a[i] + sqrt(x) n doi= 1, end do a[il = a[i] + c (a) original loop end do (a) original code if (n > O) C = sqrt(x) do i = l,n LF F4 , C(R30) ;load c Into F4 a[i] = a[i] + C LU R8 , n(R30) ;load n into R8 end do #1 ;set i (R9) to 1 LI R9 , (b) after code motion ADDI R12, R30, #a ;Ri2=address(a[l]) Loop:14ULTI R1O, R9, #4 ;RiO=i*4 Figure9. Loop-invariant code motion. ADDI RIO,R12,R1O ;RiO=address(a[i+l]) LF F5, -4(R1O) ;load a[i] into F5 ADDF F5, F5, F4 ;a[i] :=a[i]+c SF -4(R1O), F5 ;store new a[il SLT Ril, R9, R8 ;Ril = i

LF F4, c(R30) ;load c into F4 LU R8, n(R30) ;load n into R8 6.1.3 Loop-lnvarianf Code Motion LI R9, #l ;set i (R9) to 1 When a computation appears inside a ADDI RI0,R30, #a ;RIO=address(a[i]) 100P, but its result does not change be- Loop : LF F5, (R1O) ;load a[i] into F5 tween iterations, the compiler can move ADDF F5, F5, F4 ;a[i]:=a[i]+c that computation outsidethe loop [Ahoet SF (R1O), F5 ;store new a[i] al, 1986; Cocke and Schwartz 1970]. ADDI R9, R9, #1 ;i=i+l Code motion canbe applied at a high ADDI R1O, Ri0,#4 ;RIO=address(a[i+l]) level to expressions in the source code, or SLT Rll, R9, R8 ;Rll = i

ACM Computmg Sumeys,Vol, 26,N0 4, December 1994 Compiler Transformations ● 361

do i=2, n Conditionals that are candidates for a[i] = a[i] + c unswitching can be detected during the if (x < 7) then analysis for code motion, which identifies b[i] = a[i] * c [i] loop-invariant values. else In Figure 10(a) the variable x is loop b[i] = a[i-1] * b[i-1] invariant, allowing the loop to be end if unstitched and the true branch to be end do executed in parallel, as shown in Figure (a) original loop 10(b). Note that, as with loop-invariant code motion, if there is any chance that the condition evaluation will cause an if (n > 1) then exception, it must be guarded by a test if (x < 7) then that the loop will be executed. do all i=2, n In a loop nest where the inner loop has a[i] = a[i] + c unknown bounds, if code is generated b[i] = a[i] * c[i] straightforwardly there will be a test be- end do all fore the body of the inner loop to deter- else mine whether it should be executed at do i=2, n all. The test for the inner loop will be a[il = a[i] + c repeated every time the outer loop is exe- b[i] = a[i-1] * b[i-1] cuted. When the compiler uses an inter- end do mediate representation of the program end if that exposes the test explicitly, unswitch- end if ing can be applied to move the test out- (b) after unswitching side of the outer loop. The RS/6000 XL C/Fortran compiler uses unswitching for Figure IO. . this purpose [0’Brien et al. 1990].

6.2 Loop Rerxdering more general term referring to any transfo~mation that moves a- comput~- In this section we describe transforma- tion to an earlier point in the program tions that change the relative order of [Aho et al. 1986]. While loop-invariant execution of the iterations of a loop nest code motion is one particularly common or nests. These transformations are pri- form of hoisting, there are others: an marily used to expose parallelism and expression can be evaluated earlier to improve memory locality. reduce register pressure or avoid arith- Some compilers only apply reordering metic unit latency, or a load instruction transformations to perfect loop nests (see might be moved upward to reduce the Section 6.2.7). To increase the opportuni- effect of memory access latency. ties for optimization, such a compiler can sometimes apply loop distribution to ex- tract perfect loop nests from an imperfect 6.1.4 Loop Unswitching nesting. Loop unswitching is applied when a loop The compiler determines whether a contains a conditional with a loop- loop can be executed in parallel by exam- invariant test condition. The loop is then ining the loop-carried dependence. The replicated inside each branch of the obvious case is when all the dependence conditional, saving the overhead of condi- distances for the loop are O (direction tional branching inside the loop, reduc- =), meaning that there are no depen- ing the code size of the loop body, and dence carried across iterations by the possibly enabling the parallelization of a loop. Figure n(a) is an example; the dis- branch of the conditional [Allen and tance vector for the loop is (O, 1), so the Cocke 1971]. outer loop is parallelizable.

ACM Computing Surveys, Vol. 26, No. 4, December 1994 362 - David F. Bacon et al.

doi=l, n double precision a[*] doj=2, n a[i, j] = a[i, j-1] + c do i = 1, 1024* stride, stride end do a[i] = a[i] + c end do end do (a) outer loop is parallelizable (a) loop with varying stride

doi=i, n I etride \ Cache I TLB I Relative \ doj=l, n Misses Misses Speed (%) a[i, j] = a[i-l, j] + a[i-l, j+l] 1 64 2 100 end do 2 128 4 83 end do (b) inner loop is parallelizable.

Figure Il. Dependence conditions forparallelizing t I 1 loops. 16 I 1024 I 32 I 23 64 I 1024 I 128 I 19 256 I 1024 512 12 More generally, the pth loop in a loop 512 I 1024 1024 8 nest is parallelizable if for every distance (b) effect of stride vector V=(vl, . . ..vP. v~), v~), Figure 12. Predicted effect of stride on perfor- up =ov3qo. mance of an IBM RS/6000 for the above loop. Array elements are double precision (8 bytes): miss In Figure 1 l(b), the distance vectors rates are per 1024 iterations. Beyond stride 16, are {(1, 0), (1, – l)}, so the inner loop is TLB misses dominate [IBM 1992], parallelizable. Both references on the right-hand side of the expression read elements of a from row i — 1, which was loop nest to increase the granularity of written during the previous iteration of each iteration and reduce the number the outer loop. Therefore the elements of of barrier synchronizations; row i may be calculated and written in any order. . reduce stride, ideally to stride 1; and ● increase the number of loop-invariant 6.2.1 expressions in the inner loop.

Loop interchange exchanges the position Care must be taken that these benefits of two loops in a perfect loop nest, gener- do not cancel each other out, For in- ally moving one of the outer loops to the stance, an interchange that improves innermost position [Allen and Kennedy register reuse may change a stride-1 ac- 1984; 1987; Wolfe 1989b]. Interchange is cess pattern to a stride-n access pattern one of the most powerful transformations with much lower overall performance due and can improve performance in many to increased cache misses. Figure 12 ways. demonstrates the dramatic effect of dif- Loop interchange may be performed to: ferent strides on an IBM RS /6000. In Figure 13(a), the inner l’oop accesses e enable vectorization by interchanging array a with stride n (recall that we are an inner, dependent loop with an outer, assuming the Fortran convention of col- independent loop; umn-maj or storage order). By inter- * improve vectorization by moving the changing the loops, we convert the inner independent loop with the largest range loop to stride-1 access, as shown in into the innermost position; Figure 13(b). e improve parallel performance by mov- For a large array in which less than ing an in-depende-nt loop outwa~d in a one column fits in the cache, this opti-

ACM Computing Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 363

do i = l,n doi=2, n do j = l,n do j = 1, n-1 total [i] = total[il + a[i, jl a[i, j] = a[i-l, j+i] end do end do end do end do (a) original loop nest (a)

J J do j = l,n 123 123 do i = i,n total [i] = total[il + a[i, jl end do end do (b) interchanged loop nest

Figure 13. Loop interchange, (b) (c)

Figure 14. Original loop (a); original traversal or- der (b); traversal order after interchange (c). mization reduces the number of cache misses on a from n 2 to n 2 X elementsizei linesize, or n 2/ 16 with 4-byte elements cuted is shown by the dotted line. and the 64-byte lines of S-DLX. However, The traversal order after interchange is the original loop allows total [i] to be shown in Figure 14(c): some iterations placed in a register, eliminating the are executed before iterations that they load\ store operations in the inner loop depend on, so the interchange is illegal. (see scalar replacement in Section 6.5.4). Switching the loop bounds is straight- So the optimized version increases the forward when the iteration space is rect- number of load/store operations for total angular, as in the loop nest in Figure 13. from 2n to 2n 2. If a fits in the cache, the In this case the bounds of the inner loop original loop is better. are independent of the indices of the con- On a vector architecture, the trans- taining loop, and the two can simply be formed loop enables vectorization by exchanged. When the iteration space is eliminating the dependence on total [i] in not rectangular, computing the bounds is the inner loop. more complex. Triangular spaces are of- Interchanging loops is legal when the ten used by programmers, and trape- altered dependence are legal and when zoidal spaces are introduced by loop the loop bounds can be switched. If two skewing (discussed in the next section). loops p and q in a perfect loop nest of d Further techniques are necessary to loops are interchanged, each dependence manage imperfectly nested loops. Some vector V=(ul, . . ..up. vq, ,,u~). .,u~) in of the variations are discussed in detail the original loop nest becomes V’ = by Wolfe [ 1989b] and Wolf and Lam (vi,...,u~,UP,,,v~)..,v~) in the trans- [1991]. formed loop nest. If V’ is lexicograph- ically positive, then the dependence 6.2.2 Loop Skewing relationships of the original loop are sat- isfied. Loop skewing is an enabling transforma- A loop nest of just two loops can be tion that is primarily useful in combina- interchanged unless it has a dependence tion with loop interchange [Lamport vector of the form ( < , >). Figure 14(a) 1974; Muraoka 1971; Wolfe 1989b]. shows the loop nest with the dependence Skewing was invented to handle wave- (1, – 1), giving rise to the loop-carried front computations, so called because the dependence shown in Figure 14(b). The updates to the array propagate like a order in which the iterations are exe- wave across the iteration space.

ACM Computing Surveys, Vol. 26, No. 4, December 1994 364 “ David F. Bacon et al.

do I = 2, n-1 dex variables to compensate, skewing do J = 2, m-1 does not change the meaning of the pro- a[l, Jl = (a[~-t, Jl + a[i, j-11 + gram and is always legal. a[l+l, jl + a[l, J+ll )/4 The original loop nest in Figure 15(a) end do can be interchanged, but neither loop can end do be parallelized, because there is a depen- (a) ongmal code dependence {(1, O), (O, 1)} dence on both the inner loop (O, 1) and on the outer loop (1, O). The result when f = 1 is shown in Fig- ure 15(c–d). The transformed code is equivalent to the original, but the effect on the iteration space is to align the diagonal wave fronts of the original loop

(b) original space (c) skewed space nest so that for a given value of j, all iterations in i can be executed in parallel. To expose this parallelism, the skewed do 1 = 2, n-1 loop nest must also be interchanged. Af- do I = i+2, i+m-1 ter skew and interchange, the loop nest a[l,~-11 = (a[l-i, j-d + a[i, j-1-ii + has distance vectors {(1, O), (1, 1)}. The a[l+l, j-11 + a[l, J+l-ll )/4 first dependence allows the inner loop to end do end do be parallelized because the correspond- (d) skewed code dependence {(1, 1), (O, 1)} ing dependence distance is O. The second dependence allows the inner loop to be parallelized because it is a dependence do I = 4, n+n-2 on previous iterations of the outer loop. do 1 = max(2, J-m+l), min(n-l, ~-2) Skewing can expose parallelism for a a[i, j-d = (a[i-l, j-d + a[l, j-1-1] + nest of two loops with a set of distance a[i+l, j-ii + a[l, ]+l-d )/4 vectors {Vk } only if end do end do (e) skewed and interchanged code dependence {(1,0),(1,1)}

Figure 15. Loop skewing. When skewing by f, the original distance vector (vi, vz) becomes (rJl, fvl + Pz). For any dependence where Uz < 0, the goal is In Figure 15(a) we show a typical to find f such that ful + V2 > 1. The wavefront computation, Each element is correct skew factor f is computed by tak- computed by averaging its four nearest ing the maximum of f, = [(1 – u12)/u,ll neighbors, While neither of the loops is over all the dependence [Kennedy et al. parallelizable in their original form, each 1993]. array diagonal (“ wavefront”) could be Interchanging of skewed loops is com- computed in parallel. The iteration space plicated by the fact that their bounds and dependence are shown m Figure depend on the iteration variables of the 15(b), with the dotted lines indicating the enclosing loops. For two loops with wave fronts. bounds il = 11, U1 and iz = lZ, Uz where Skewing is performed by adding the lZ and Uz are expressions independent of outer loop index multiplied by a skew il, the skewed inner loop has bounds iz factor, f, to the bounds of the inner itera- = fi ~ + 12, fi ~ + U2. After interchange, tion variable, and then subtracting the the bounds are same quantity from every use of the in- ner iteration variable inside the loop. Be- doi2=fZ1+12, fu1+u2 cause it alters the loop bounds but then do il = max(ll, [(i, – u2)/fD, alters the uses of the corresponding in- min(ul, [(iz – 12)/f 1)

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 365

An alternative method for handling doi=l, n wavefront computations is supernode do j =1, n partitioning [Irigoin and Triolet 1988]. a[i, j] = a[i-1, j+i] + I end do end do 6.2.3 Loop Reversal (a) original loop nest: distance vector (1, - 1). Reversal changes the direction in which Interchange is not possible. the loop traverses its iteration range [Wedel 19751. It is often used in conjunc- tion with other iteration space reordering doi=l, n transformations because it changes the doj=n, l,-1 dependence vectors [Wolfe 1989b]. a[i, j] = a[i-1, j+i] + I As an optimization in its own right, end do reversal can reduce loop overhead by end do eliminating the need for a compare in- (b) inner loop reversed: direction vector (1, 1). struction on architectures without a com- Loops may be interchanged. pound compare-and-branch (such as the Figure 16. Loop reversal. Alpha [Sites 1992]), The loop is reversed so that the iteration variable runs down to zero, allowing the loop to end with a branch-if-not-equal-to-zero instruction do i=l, n (BNEZ, on S-DLX). a[i] = a[i] + c Reversal can also eliminate the need end do for temporary arrays in implementing (a) originaf loop Fortran 90 array statements (see Section 6.4.3). If loop p in a nest of d loops is re- TN = (n/64)*64 versed, then for each dependence vector do TI=I, TN, 64 V, the entry UP is negated. The reversal a[TI :TI+63] = a[TI :TI+63] + c is legal if each resulting vector V’ is end do lexicographically positive, that is, when do i=TN+l, n =Oor3q0. a[i] = a[i] + c ‘P For instance, the inner loop of a loop end do nest with direction vectors {(< , =), (b) after strip mining ( <, >)} can be reversed, because the re- sulting dependence are all still lexico~aphically positive. R9 = address of a[TI] Figure 16 shows how reversal can en- LV VI, R9 ; VI <- a[TI :TI+63] able loop interchange: the original loop ADDSV VI, F8, VI ; vi <- VI + c (F8=c) nest (a) has the distance vector (1, – 1) Sv Vl, R9 ; a[TI :TI+63] <- VI that prevents interchange because the (c) vector assembly code for the update of resulting distance vector ( – 1, 1) is not a[TI :TI+63] lexicographically positive; the reversed Figure 17. Strip mining. loop nest (b) can legally be interchanged.

6.2.4 Strip Mining array notation, and is equivalent to a do Strip mining is a method of adjusting the all loop. Cleanup code is needed if the granularity of an operation, especially a iteration length is not evealy divisible by parallelizable operation [Abu-Sufah et al. the strip length. 1981; Allen 1983; Loveman 1977]. One of the most common uses of strip An example is shown in Figure 17. The mining is to choose the number of inde- strip-mined computation is expressed in pendent computations in the innermost

ACM Computing Surveys, Vol. 26, No. 4, December 1994 366 e David F. Bacon et al. loop of a nest. On a vector machine, for doi=i, n example, the serial loop can then be con- i a[i+k] = b[i] verted into a series of vector operations, 2 b[i+k] = a[i] + c[i] each vector comprising a single “strip” end do [Cray Research 1988]). Strip mining is a) because of the write to a, S1 SS2; because also used for SIMD compilation [Weiss of the write to b, S2~S1. 1991], for combining send operations in a loop on distributed-memory multipro- cessors [Hiranandani et al. 1992], and for do TI=l, n,k limiting the size of compiler-generated do all i = TI , TI+k-1 temporary arrays [Abu-Sufah 1979; Wolfe a[i+k] = b[i] 1989b]. 2 b[i+k] = a[i] + c[i] Strip mining often requires other end do all transformations to be performed first. end do Loop distribution (see Section 6.2.7) can (b) k iterations can be performed in parallel expose simple loops from within an origi- because that is the minimum dependence nal loop nest that is too complex to strip distance. mine. Loop interchange can be used to move a parallelizable loop into the inner- ~~ most position of a loop nest or to maxi- o mize the length of the strip. 1 2 3 4 s 6 The examples given above demonstrate how strip mining can create a larger unit (c) iteration space when n = 6 and k = 2. of work out of smaller ones. The transfor- Figure 18. Cycle shrinking. mation can also be used in the reverse direction, as shown in Section 7.4.

6.2.5 Cycle Shrinking The result is potentially a speedup by a factor of k, but k is likely to be small Cycle shrinking is essentially a special- (usually 2 or 3); so this optimization is ization of strip mining. When a loop has normally limited to exposing parallelism dependence that prevent it from being that can be exploited at the instruction executed in parallel (that is, converted to level (for instance by loop unrolling), Note a do all), the compiler may still be able that k must be constant within the loop to expose some parallelism if the depen- and must at least be known to be positive dence distance is greater than one. In at compile time. this case cycle shrinking will convert a serial loop into an outer serial loop and 6.2.6 Loop Tiling an inner parallel loop [Polychronopoulos 1987a]. Cycle shrinking is primarily use- Tiling is the multidimensional general- ful for exposing fine-grained parallelism. ization of strip mining. Tiling (also called For instance, in Figure 18(a), a[i + k] is blocking) is primarily used to improve written in iteration i and read in itera- cache reuse ( QC) by dividing an iteration tion i + k; the dependence distance is k. space into tiles and transforming the loop Consequently the first k iterations can be nest to iterate over them [Abu-Sufah et performed in parallel provided that none al. 1981; Gannon et al, 1988; Lam et al, of the subsequent iterations is allowed to 1991; Wolfe 1989a]. However, it can also begin until the first k are complete. The be used to improve processor, register, same is then done for the next k itera- TLB, or page locality. tions, as shown in Figure 18(b). The iter- The need for tiling is illustrated by the ation space dependence are shown in loop in Figure 19(a) that assigns a the Figure 18(c): each group of k iterations is transpose of b. With the j loop innermost, dependent only on the previous group. access to b is stride-1, while access to a is

ACM Computmg Surveys, Vol 26, No. 4, December 1994 Compiler Transformations * 367

do i=l, n do i=l, n do j=l, n a[i] = a[i] +C a[i, j] = b[j, i] x[i+ll = x[i]*7 + x[i+i] + a[i] end do end do end do (a) original loop (a) original loop

do all i=l, n do TI=I, n, 64 a[i] = a[i]+c do TJ=l, n, 64 end do all do i=TI, min(TI+63, n) do i=i, n do j=TJ, min(TJ+63ufi) x[i+l] = x[i]*7 + x[i+l] + a[i] a[i, j] = b[j, i] end do end do (b) after loop distribution end do end do Figure 20. Loop distribution. end do (b) tiled loop

Figure 19, Looptiling. 0 improve instruction cache and instruc- tion TLB locality due to shorter loop bodies; 0 reduce memory requirements by iterat- stride-n. Interchanging does not help, ing over fewer arrays; and since it makes access to b stride-n. By iterating over subrectangles of the itera- e increase register reuse by decreasing tion space, as shown in Figure 19(b), the register pressure. loop uses every cache line fully. The inner two loops of a matrix multi- Figure 20 is an example in which dis- ply have this structure; tiling is critical tribution removes de~endences and al- for achieving high performance in dense lows part of a loop to ‘be run in parallel, matrix multiplication. Distribution may be applied to any A pair of adjacent loops can be tiled if loop, but all statements belonging to a they can legally be interchanged. After dependence cycle (called a n-block [Kuck tiling, the outer pair of loops can be in- 1977]) must be placed in the same loop, terchanged to improve locality across and if SI = Sz in the original loop, then tiles, and the inner loops can be ex- the loop containing SI must precede the changed to exploit inner-loop parallelism loop that contains S’z. If the loop contains or register locality. control flow, applying if-conversion (see Section 5.2) can expose greater opportu- nities for distribution. An alternative is 6.2.7 Loop Distribution to use a control dependence graph Distribution (also called loop fission or [Kennedy and McKinley 1990], ) breaks a single loop into A specialized version of this transfor- many. Each of the new loops has the mation is distribution by name, origi- same iteration space as the original, but nally called horizontal distribution of contains a subset of the statements of the name partition [Abu-Sufah et al. 1981]. original loop [Kuck 1977; Kuck et al. Rather than performing full dependence 1981: Muraoka 19711. analysis on the loop, the statements are Distribution is used to ~artitioned into sets that reference mu- ~ually exclusive variables. These state- ● create perfect loop nests; ments are guaranteed to be independent. ● create subloops with fewer depen- When the arrays in question are large, dence; distribution by name can increase cache

ACM Computing Surveys, Vol. 26, No. 4, December 1994 368 0 David F. Bacon et al. locality. Note that the above loop cannot I be distributed using fission by name, since both statements reference a.

6.2.8 Loop Fusion s The inverse transformation of distribu- 20 0 o’ 0 tion is fusion (also called jamming) [Ershov 1966]. It can improve Figure 21. Two lo~ps containing S1 and S2 cannot performance by be fused when S2 = S’l in the fused loop.

reducing loop overhead; increasing instruction parallelism; 6.3 Loop Restructuring improving register, vector [Wolfe This section describes loop transforma- 1989b], data cache, TLB, or page [Abu- tions that change the structure of the Sufah 1979] locality; and loop, but leave the computations per- improving the load balance of parallel formed by an iteration of the loop body loops. and their relative order unchanged.

In Figure 20, distribution enables the 6.3.1 Loop Unrolling parallelization of part of the loop. How- ever, fusing the two loops improves regis- Unrolling replicates the body of a loop ter and cache locality since a[i] need only some number of times called the un- be loaded once. Fusion also increases in- rolling factor (u) and iterates by step u struction parallelism by increasing the instead of step 1. The benefits of un- ratio of floating-point operations to inte- rolling have been studied on several dif- ger operations in the loop and reduces ferent architectures [Dongarra and Hind loop overhead by a factor of two. With 1979]; it is a fundamental technique for large n, the distributed loop should run generating the long instruction se- faster on a vector machine while the fused quences required by VLIW machines loop should be better on a superscalar [Ellis 19861. machine. Unrolling can improve the perfor- For two loops to be fused, they must mance by have the same loop bounds; when the = reducing loop overhead; bounds are not identical, it is sometimes possible to make them identical by peel- “ increasing instruction parallelism; and ing (described in Section 6.3.5) or by in- * improving register, data cache, or TLB troducing conditional expressions into the locality, body of the loop. Two loops with the same In Figure 22, we show all three of these bounds may be fused if there do not exist improvements in an example. Loop over- statements S1 in the first loop and Sz in head is cut in half because two iterations the seco~~ ~uch that they have a depen- are performed before the test and branch dence Sz = S1 in the fused loop. The rea- at the end of the loop. Instruction paral- son this would be incorrect is that before lelism is increased because the second fusing, all instances of S1 execute before assignment can be performed while the any Sz. After fusing, corresponding in- results of the first are being stored and stances are executed together. If any in- the loop variables are being updated. stance of S1 has a dependence on (i.e., If array elements are assigned to regis- must be executed after) any subsequent ters (either directly or by using scalar instance of Sz, the fusion alters execu- replacement, as described in Section tion order illegally, as shown in Figure 6.5.4), register locality will improve be- 21. cause a[i] and a[i + 1] are used twice in

ACM Computing Surveys, Vol 26, No 4, December 1994 Compiler Transformations 9 369

do i=2, n-1 Most compilers for high-performance a[ij = a[i] + a[i-1] * a[i+ll machines will unroll at least the inner- end do most loop of a nesting. Outer loop un- (a) original loop rolling is not as universal because it yields replicated instances of the inner loops. To avoid the additional control do i=2, n-2, 2 overhead, the compiler can often fuse the a[i] = a[i] + a[i-1] * a[i+i] copies back together, yielding the same a[i+l] = a[i+l] + a[i] * a[i+2] loop structure that appeared in the origi- end do nal code. This combination of transfor- mations is sometimes referred to as if (mod(n-2,2) = i) then unroll-and-jam [Callahan et al. 1988]. a[n-1] = a[n-1] + a[n-2] * a[nl Loop quantization [Nicolau 1988] is end if another approach to unrolling that avoids (b) loop unrolled twice replicated inner loops. Rather than creat- ing multiple copies and then subse- Figure 22. Loop unrolling quently eliminating them, quantization adds additional statements to the inner- most loop directly. The iteration ranges are changed, but the structure of the loop nest remains the same. However, unlike the loop body, reducing the number of straightforward unrolling, quantization loads per iteration from 3 to 2. changes the order of execution of the loop If the target machine has double- or and is not always a legal transformation. multiword loads, unrolling often allows several loads to be combined into one. 6.3.2 The if statement at the end of Figure Another technique to improve instruction 22(b) is the loop epilogue that must be parallelism is software pipelining [Lam generated when it is not known at com- 1988]. In hardware pipelining, instruc- pile time whether the number of itera- tion execution is broken into stages, such tions of the loop will be an exact multiple as Fetch, Execute, and Write-back. The of the unrolling factor u. If u > 2, the first instruction is fetched in the first loop epilogue is itself a loop. clock cycle. In the next cycle, the second Unrolling has the advantage that it instruction is fetched while the first is can be applied to any loop, and can be executed, and so on. Once the pipeline done profitably at both the high and the has been filled, the machine will com- low levels. Some compilers also perform plete 1 instruction per cycle. loop rerolling prior to unrolling because In software pipelining, the operations programs often contain loops that were of a single loop iteration are broken into unrolled by hand for a different target s stages, and a single iteration performs architecture. stage 1 from iteration i, stage 2 from Figure 23(a–c) shows the effects of un- iteration i – 1, etc. Startup code must be rolling in more detail. The source code in generated before the loop to initialize the Figure 23(a) is translated to the assem- pipeline for the first s – 1 iterations, and bly code in Figure 23(b), which takes 6 cleanup code must be generated after the cycles per result on S-DLX. After un- loop to drain the pipeline for the last rolling 3 times, the code requires 8 cycles s — 1 iterations. per iteration or 22/s cycles per result, as Figure 23(d) shows how software shown in Figure 23(c), The original loop pipelining improves the performance of a stalls for one cycle waiting for the load, simple loop. The depth of the pipeline is and for two cycles waiting for the ADDF s = 3. The software pipelined loop pro- to complete. In the unrolled loop some of duces one new result every iteration, or 4 these cycles are filled. cycles per result.

ACM Computmg Surveys, Vol. 26, No 4, December 1994 370 “ David F. Bacon et al.

do i=l, n a[i] = a[i] + c end do

(a) the initial loop

1 LW R8, n(R30) ADDI RIO, R30, #a ;load n mto R8 ;RIO=address(a[l]) 2 LF F4, c(R30) ;load c mto F4 3 MJLTI R8, R8, #4 ; R8=n*4 5 ADDI R8, RIO, R8 ; R8=address(a[n+l] )

1 L:LF F5, (RIO) ADDI R1O, RiO, #4 ;load a[l] into F5 ;RIO=address(a[l+l] ) 3 ADDF FE., F5, F4 SLT Ril, RI0,R8 ;a[il=a[il+c ;Rll= RIO

1 L:LF F5, (R1O) ;load a[l] into F5 2 LF F6, 4(RIO) ;load a[l+i] into F6 3 LF F7, 8(R1O) ADDF F5, F5, F4 ;load a[i+2] Into F8 ;a[l]=a[l]+c 4 ADDI Rio, Rio, #12 ADDF F6, F6, F4 ;R1O=address (a[l+3]) ;a[l+l]=a[i+l]+c 5 SLT Rii, RI0,R8 ADDF F7, F7, F4 ;Rll= R1OF5 ;store new a[i] -r SF -8(R1O), F6 ;store new a[l+ll a SF -4(RIO)> F7 BNEZ Rli, L ;store new a[i+21 ;If Rll, goto L (c) after unrolhng 3 tmnes and rescheduhng (loop prologue and epdogue omitted)

1 LW R8, n(R30) ADDI RIO, R30,#a ;load n into R8 ;RiO=address(a[l]) 2 LF F4, c(R30) SUBI R8, R8, #2 ;load c into F4 ; R8=n-2 3 MJLTI R8, R8, #4 ;R8=(n-2)*4 5 ADDI R8, RIO, R8 LF F5, (R1O) ;R8=address(a[n-1]) ;F5=a[l] 7 ADDF F6, F5, F4 LF F5, 4(R1O) ;F6=a[l]+c ;F5=a[2]

1 L:SF (R1O), F6 ADDI RIO, R1O, #4 ;store new a[i] ;RIO=address(a[i+i] ) 2 ADDF F6, F5, F4 SLT Rll, R1O,R8 ;a[l+l]=a[i+l]+c ;Rll= R1O

i L:SF (RIO), F6 ADDF F6, F5, F4 ;store new a[i] ; a[i+21 =a[i+21 +C 2 SF 4( R1O), F8 ADDF F8, F7, F4 ;store new a[i+l] ;a[i+3]=a[i+3]+c 3 LF F5, 16(RIO) ADDI R1O, R1O, #8 ;load a[l+4] Into F5 ;R1O=address (a[l+21) 4 LF F7, 12(RIO) SLT Rli, RIO, R8 ;load a[i+5] into F7 ;Rli= RIO

Figure 23. Increasing instruction parallelism in loops.

ACM Computmg Surveys. Vol 26,N0 4, December 1994 Compiler Transformations * 371

do all i=l, n Overlappxl or=.~.- do all j=l, m a[i, jl = a[i, jl + c rme end do all (a) LcJOpunrolling end do all (a) original loop

Tmle (b) SoftwarePipelimng do all T=l, n*m i. = ((T-1) / m)*m + 1 Figure 24. Loop unrolling vs. software pipelining. j = MOD(T-1, m) + 1 a[i, jl = a[i, jl + c end do all Finally, Figure 23(e) shows the result (b) coalesced loop of combining software pipelining (s = 3) with unrolling (u = 2). The loop takes 5 cycles per iteration, or 21/z cycl:s per real TA [n*m] result. Unrolling alone achieves 2 /~ cy- equivalence (TA, a) cles per result. If software pipelining is do all T = 1, n*m combined with unrolling by u = 3, the TA [T] = TA [T] + c resulting loop would take 6 cycles per end do all iteration or two cycles per result, which (c) collapsed loop is optimal because only one memory op- Figure 25. Loop coalescing vs. collapsing. eration can be initiated per cycle. Figure 24 illustrates the difference be- tween unrolling and software pipelining: unrolling reduces overhead, while in Aiken and Nicolau [1988a] and Lam pipelining reduces the startup cost of [1988]. each iteration. Perfect pipelining combines unrolling 6.3.3 Loop Coalescing by loop quantization with software pipelining [Aiken and Nicolau 1988a; Coalescing combines a loop nest into a 1988b]. A special case of software pipelin- single loop, with the original indices com- ing is predictive commoning, which is puted from the resulting single induction applied to memory operations. If an ar- variable [Polychronopoulos 1987b; 1988]. ray element written in iteration i – 1 is Coalescing can improve the scheduling of read in iteration i, then the first element the loop on a parallel machine and may is loaded outside of the loop, and each also reduce loop overhead. iteration contains one load and one store. In Figure 25(a), for example, if n and m The IW/6000 XL C/Fortran compiler are slightly larger than the number of [0’Brien et al. 1990] performs this processors P, then neither of the loops optimization. schedules as well as the outer parallel If there are no loop-carried depen- loop, since executing the last n – P iter- dence, the length of the pipeline is the ations will take the same time as the length of the dependence chain. If there first P. Coalescing the two loops ensures are multiple, independent dependence that P iterations can be executed every chains. thev can be scheduled together time except during the last (nm mod P) subject to ~esource availability, o; loop iterations, as shown in Figure 25(b). distribution can be applied to put each Coalescing itself is always legal since it chain into its own loop. The scheduling does not change the iteration order of the constraints in the presence of loop- loop. The iterations of the coalesced loop carried dependence and conditionals are may be parallelized if all the original more complex; the details are discussed loops were parallelizable. A criterion for

ACM Computing Surveys, Vol. 26, No. 4, December 1994 372 ● David F. Bacon et al. parallelizability in terms of dependence doi=2, n vectors is discussed in Section 6.2. b[i] = b[i] + b[21 The complex subscript calculations end do introduced by coalescing can often be doalli=3, n simplified to reduce the overhead of the a[i] = a[i] + c coalesced loop [Polychronopoulos 1987 b]. end do all (a) original loops 6.3.4 Loop Collapsing

Collapsing is a simpler, more efficient, if (2 <= n) then but less general version of coalescing in b[2] = b[2] + b[2] which the number of dimensions of the end if array is actually reduced. Collapsing do all i=3, n eliminates the overhead of multiple b[i] = b[il + b[2] nested loops and multidimensional array a[i] = a[i] + c indexing. end do all Collapsing is used not only to increase (b) after peeling one iteration from first loop the number of parallelizable loop itera- and fusing the resulting loops tions, but also to increase vector lengths Figure 26. Loop peeling and to eliminate the overhead of a nested loop (also called the Carry optimization [Allen and Cocke 1971; IBM 1991]). The collapsed version of the loop dis- Since peeling simply breaks a loop into cussed in the previous section is shown sections without changing the iteration in Figure 25(c). order, it can be applied to any loop. Collapsing is best suited to loop nests that iterate over memory with a constant stride. When more complex indexing is 6.3.6 Loop Normalization involved, coalescing may be a better approach. Normalization converts all loops so that the induction variable is initially 1 (or O) and is incremented by 1 on each iteration 6.3.5 Loop Peeling [Allen and Kennedy 1987]. This transfor- mation can expose opportunities for fu- When a loop is peeled, a small number of iterations are removed from the begin- sion and simplify inter-loop dependence ning or end of the loop and executed analysis, as shown in Figure 27. It can separately. If only one iteration is peeled, also help to reveal which loops are candi- a common case, the code for that itera- dates for peeling followed by fusion. tion can be enclosed within a conditional. The most important use of normaliza- For a larger number of iterations, a sepa- tion is to permit the compiler to apply rate loop can be introduced. Peeling has subscript analysis tests, many of which two uses: for removing dependence cre- require normalized iteration ranges. ated by the fh-st or last few loop itera- tions, thereby enabling parallelization; 6.3.7 Loop Spreading and for matching the iteration control of adjacent loops to enable fusion. Spreading takes two serial loops and The loop in Figure 26(a) is not paral- moves some of the computation from the lelizable because of a flow dependence second to the first so that the bodies of between iteration i = 2 and iterations i = both loops can be executed in parallel 3 .. . n. Peeling off the first iteration al- [Girkar and Polychronopoulos 1988]. lows the rest of the loop to be parallelized An example is shown in Figure 28: the and fused with the following loop, as two loops in the original program (a) can- shown in Figure 26(b). not be fused because they have different

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations ~ 373

doi=l, n do i = 1, n/2 a[i] = a[i] + c 1 a[i+l] = a[i+l] + a[i] end do end do do i = 1, n-3 do i = 2, n+l 2 b[i+l] = b[i+l] + b[i] + a[i+3] b[i] = a[i-i] * b[i] end do end do (a) original loops (a) original loops

do i = 1, n/2 doi=l, n COBEGIN a[i] = a[i] + c a[i+l] = a[i+i] + a[i] end do if (i J 3) then b[i-2] = b[i-2]+b[i-3]+a[i] doi=l, n end if b[i+l] = a[i] * b[i+l] COEND end do end do (b) after normalization, the two loops can be do i = n/2-3, n-3 fused b[i+l] = b[i+l] + b[i] + a[i+3] end do Figure 27. Loop normalization. (b) after spreading

Figure 28. Loop spreading. bounds, and there would be a depen- 6.4 Loop Replacement Transformations

dence S’z ‘~) SI in the fused loop due to This section describes loop transforma- the write to a in S1 and the read of a in tions that operate on whole loops and S’z. By executing the statement Sz three completely alter their structure. iterations later within the new loop, it is possible to execute the two statements in 6.4.1 Reduction Recognition parallel, which we have indicated textu- ally with the COBEGIN / COEND com- A reduction is an operation that com- pound statement in Figure 28(b). putes a scalar value from an array, Com- The number of iterations by which the mon reductions include computing either body of the second loop must be delayed the sum or the maximum value of the is the maximum dependence distance be- elements in an array. In Figure 29(a), the tween any statement in the second loop sum of the elements of a are accumulated and any statement in the first loop, plus in the scalar S. The dependence vector for 1. Adding one ensures that there are no the loop is (l), or ( <). While a loop with dependence within an iteration. For this direction vector (<) must normally be reason, there must not be any scalar de- executed serially, reductions can be par- pendence between the two loop bodies. allelized if the operation performed is Unless the loop bodies are large, associative. Commutativity provides ad- spreading is primarily beneficial for ex- ditional opportunities for reordering. posing instruction-level parallelism. 13e- In Figure 29(b), the reduction has been pending on the amount of instruction vectorized by using vector adds (the in- parallelism achieved, the introduction of ner do all loop) to compute TS; the final a conditional may negate the gain due to result is computed from TS using a scalar spreading. The conditional can be re- loop. For semicommutative and semias- moved by peeling the first few (in this sociative operators like floating-point case 3) iterations from the first loop, at multiplication, the validity of the trans- the cost of additional loop overhead. formation depends on the language se-

ACM Computing Surveys, Vol. 26, No, 4, December 1994 374 “ David F. Bacon et al.

doi=l, n is described by Chen and Kuck [19751, by s=s+a[i] Kuck [1977; 1978], and Wolfe [1989b]. end do Blelloch [1989] describes compilation (a) a sum reduction loop strategies for recognizing and exploiting parallel prefix operations. The compiler can recognize and con- real TS [64] vert other idioms that are specially supported by hardware or software. For TS[1:64] = 0.0 instance, the CMAX Fortran preproces- sor converts loops implementing vector do TI = 1, n, 64 and matrix operations into calls to as- TS[1:64] = TS[1:64] + a[TI:TI+63] sembly-coded BLAS (Basic Linear Alge- end do bra Subroutines) [Sabot and Wholey do TI = 1, 64 1993]. The VAX Fortran compiler con- s = s + TSITI] verts string copy loops into block transfer end do instructions [Harris and Hobbs 1994]. (b) loop transformed for vectorization 6.4.3 Array Statement Scalarization Figure 29. Reduction recognition. When a loop is expressed in array nota- tion, the compiler can either convert it into vector operations or scalarize it into mantics and the programmer’s intent, as one or more serial loops [Wolfe 198!2b]. described in Section 2.1. Provided that However, the conversion is not com- bitwise identical results are not required, pletely straightforward because array the partial sums can be computed in notation requires that the operation be parallel. performed as if every value on the right- Maximum parallelism is achieved by hand side and every subexpression on computing the reduction with a tree: the left-hand side were computed before pairs of elements are summed, then pairs any assignments are performed. of these results are summed, and so on. The example in Figure 30 shows a The number of serial steps is reduced computation in array notation (a), its from O(n) to O(log n). “obvious” (but incorrect) conversion to se- Operations such as and, or, rein, and rial form (b), and a correct conversion (c). max are truly associative, and their re- The reason that Figure 30(b) is not cor- duction can be parallelized under all rect is that in the original code, every circumstances. element of a is to be incremented by the value of the previous element. The incre- ments are to happen as if they were all 6.4,2 Loop Idiom Recognition performed simultaneously; in the incor- Parallel architectures often provide spe- rect version, each element is incremented cialized hardware that the compiler can by the updated value of the previous ele- take advantage of. Frequently, for exam- ment. ple, SIMD machines support reduction The general solution is to introduce a directly in the processor interconnection temporary array T and to have a sepa- network. Some parallel machines, such rate loop that writes the values back into as the Connection Machine CM-2 [Think- a, as shown in Figure 30(c). The tempo- ing Machines 1989], include hardware rary array can then be eliminated if the not only for reduction but for parallel two loops can be legally fused, namely, prefix operations, allowing a loop with a when there is no dependence S2 ‘~) SI in body of the form a[il = a[i – 1] + a[i] to the fused loop, where S1 is the assign- be parallelized. The parallelization of a ment to the temporary, and S2 is the more general class of linear recurrences assignment to the original array.

ACM Computmg Surveys, Vol 26, No. 4, December 1994 Compiler Transformations “ 375

a[2:n-1] = a[2:n-1] + a[l:n-2] As a result, optimization of the use of

(a) initial array language expression the memory system has become steadily more important. Factors affecting mem- ory performance include: do i = 2, n-1 a[i] = a[i] + a[i-1] 0 Reuse, denoted by Q and Qc, the ratio end do of uses of an item to the number of (b) incorrect scalarization times it is loaded (described in Section 3); ● Parallelism. Vector machines often do i = 2, n-1 divide memory into banks, allowing 1 T[i] = a[i.] + a[i-1] vector registers to be loaded in a paral- end do lel or pipelined fashion. Superscalar do i = 2, n-1 machines often support double- or 2 a[i] = T[i] quadword load and store instructions; end do 0 Working Set Size. If all the memorv (c) correct scalarization elemen~s accessed inside of a loop d; not fit in the data cache, then items that will be accessed in later iterations do i = n-1, 2, -1 may be flushed, decreasing Qc. If more a[i] = a[i] + a[i-1] end do variables are simultaneously live than there are available registers (that is, (d) reversing both loops allows fusion and the register pressure is high), then eliminates need for temporary array T loads and stores will have to spill val- ues into memory, decreasing Q. If more a[2:n-11 = a[2:n-il + a[i:n-2] + a[3:n] pages are accessed in a loop than there (e) array expression requiring a temporary are entries in the TLB, then the TLB may thrash. Figure 30. Array statement scalarization. Since the registers are the top of the memory hierarchy, efficient register us- age is absolutely crucial to high perfor- In this case there is an antidepen- mance. Until the late seventies, register dence, butitcan be removedby reversing allocation was considered the single most the loops, enabling fusion, andeliminat- important problem in compiler optimiza- ing the temporary, as shown in Figure tion for which there was no adequate 30(d). However, the array language solution. The introduction of techniques statementin Figure 30(e) requires atem- based on graph coloring [Chaitin 1982; porary since an antidependence exists re- Chaitin et al. 1981; Chow and Hennessy gardless of the direction of the loop. 1990] yielded very efficient global (within a procedure) register allocation. Most compilers now make use of some variant 6.5 Memory Access Transformations of graph coloring in their register High-performance applications are as allocator. frequently memory limited as they are In general, memory optimizations are compute limited. In fact, for the last fif- inhibited by low-level coding that relies teen years CPU speeds have doubled ev- on particular storage layouts. Fortran ery three to five years, while DRAM EQUIVALENCE statements are the most speeds have doubled about once every obvious example. C and Fortran 77 both decade (DRAMs, or dynamic random ac- specify the storage layout for each type of cess memory chips, are the low-cost, declaration. Programmers relying on a low-power memory chips used for main particular storage layout may defeat a memory in most computers). wide range of important optimization.

ACM Computing Surveys, Vol. 26, No 4, December 1994 376 ● David F. Bacon et al.

Optimizations covered in other sec- real a[8,512] tions that also can improve memory sys- tem performance are loop interchange do i = 1, 512 (6.2.1), loop tiling (6.2.6), loop unrolling a[i, i] = a[l, i] + c (6.3. 1), 100P fusion (6.2.8), and various end do optimization that eliminate register (a) original code saves at procedure calls (6.8).

real a[9,512] 6.5.1 Array Padding

Padding is a transformation whereby un- do i = 1, 512 used data locations are inserted between a[l, i] = a[l, i] + c the columns of an array or between ar- end do rays. Padding is used to ameliorate a (b) padding a eliminates memory bank number of memory system conflicts, in confbcts on V-DLX particular: Figure 31. Array padding - bank conflicts on vector machines with banked memory [Burnett and Coffman, Jr. 1970]; sensitive to low power-of-two strides. . cache set or TLB set conflicts; However, large power-of-two strides will o cache miss jamming [Bacon et al. 1994]; cause extremely poor performance due to and cache set and TLB set conflicts. The * false sharing of cache lines on shared- problem is that addresses are mapped to memory multiprocessors. sets simply by using a range of bits from the middle of the address. For instance, Bank conflicts on vector machines like on S-DLX, the low 6 bits of an address the Cray and our hypothetical V-DLX are the byte offset within the line, and can be caused when the program indexes the next 8 bits are the cache set. Figure through an array dimension that is not 12 shows the effect of stride on the per- laid out contiguously in memory, leading formance of a cache-based superscalar to a nonunit stride whose value we will machine. define to be s. If s is an exact multiple of A number of researchers have noted the number of banks (B), the bandwidth that other strides yield reduced cache of the memory system will be reduced by performance. Bailey [1992] has quanti- a factor of B (8, on V-DLX) because all fied this by noting that because many memory accesses are to the same bank. words fit into a cache line, if the number The number of memory banks is usually of sets times the number of words per set a power of two. is almost exactly divisible by the stride In general, if an array will be accessed (say by n f e), then for a limited number with stride s, the array should be padded of references r, every nth memory refer- by the smallest p such that gcd(s + ence will map to the same set, If [ r/n] is p, B) = 1. This will ensure that B suc- greater than the associativity of the cessive accesses with stride s + p will all cache, then the later references will evict address different banks. An example is the lines loaded by the earlier references, shown in Figure 3 l(a): the loop accesses precluding reuse. memory with stride 8, so all memory ref- Set and bank conflicts can be caused erences will be to the first bank. After by a bad stride over a single array, or by padding, successive iterations access a loop that accesses multiple arrays that memory with stride 9, so they go to suc- all align to the same set or bank. Thus cessive banks, as shown in Figure 3 l(b). padding can be inserted between columns Cached memory systems, especially of an array (intraarray padding), or be- those that are set-associative, are less tween arrays (interarray padding).

ACM Computmg Surveysj Vol 26, No 4, December 1994 Compiler Transformations ● 377

A further performance artifact, called doi=l, n cache miss J“amming, can occur on ma- c = b[i] chines that allow processing to continue a[i] = a[i] + c during a cache miss: if cache misses are end do

spread nonuniformly across the loop iter- (a) original loop ations, the asynchrony of the processor will not be exploited, and performance will be reduced. Jamming typically oc- real T[n] curs when several arrays are accessed with the same stride and when all have doalli=l, n the same alignment relative to cache line T[i] = b [i] boundaries (that is, the same low ad- a[i] = a[i] + T[i] dress bits). Bacon et al. [1994] describe end do all cache miss jamming in detail and pre- (b) after scalar expansion sent a unified framework for inter- and intraarray padding to handle set con- Figure 32. Scalar expansion. flicts and jamming. The disadvantages of padding are that it increases memory consumption and performed for any compiler-generated makes the subscript calculations for temporaries in a loop. To avoid creating operations over the whole array more unnecessarily large temporary arrays, a complex, since the array has “holes.” In vectorizing compiler can perform scalar particular, padding reduces the benefits expansion after strip mining, expanding of loop collapsing (see Section 6.3.4). the temporary to the size of the vector strip. 6.5.2 Scalar Expansion Scalar expansion can also increase in- Loops often contain variables that are struction-level parallelism by removing used as temporaries within the loop body. dependence. Such variables will create an antidepen- 6.5.3 Array Contraction dence S’z ‘~) S1 from one iteration to the next, and will have no other loop-carried After transformation of a loop nest, it dependence. Allocating one temporary may be possible to contract scalars or for each iteration removes the depen- arrays that have previously been ex- dence and makes the loop a candidate for panded. It may also be possible to con- parallelization [Padua et al. 1980; Wolfe tract other arrays due to interchange or 1989b], as shown in Figure 32. If the the use of redundant storage allocation final value of c is used after the loop, c by the programmer [Wolfe 1989b]. must be assigned the value of T[n]. If the iteration variable of the pth loop Scalar expansion is a fundamental in a loop nest is being used to index the technique for vectorizing compilers, and k th dimension of an array x, then di- was performed by the Burroughs Scien- mension k may be removed from x if (1) tific Processor [Kuck and Stokes 1982] loop p is not parallel, (2) all distance and Cray-1 [Russell 1978] compilers. An vectors V involving x have VP = O, and alternative for parallel machines is to (3) x is not used subsequently (that is, x use private variables, where each proces- is dead after the loop). The latter two sor has its own instance of the variable; conditions are true for compiler- these may be introduced by the compiler expanded variables unless the loop struc- (see Section 7.1.3) or, if the language suP- ture of the program was changed after ports private variables, by the expansion. In particular, loop distribu- programmer. tion can inhibit array contraction by If the compiler vectorizes or paral- causing the second condition to be lelizes a loop, scalar expansion must be violated.

ACM Computing Surveys, Vol. 26, No. 4, Decsmber 1994 378 0 David F. Bacon et al.

real T[n, n] do i = l,n do j = i,n

doi=l, n total [i] = total[il + a[i, j] doallj=l, n end do T[i, j] = a[i, j]*3 end do b[i, j] = T[i, j] + b[i, j]/T[i, j] (a) original loop nest end do all end do do i = l,n (a) original code T = total [i] do j = l,n real T[n] T = T + a[i, j] end do

doi=l, n total [i] = T doallj=l, n end do T[j] = a[i, j]*3 (b) after scalar replacement b[i, j] = T[j] + b[i, j]/T[j] end do all Figure 34. Scalar replacement. end do

(b) after array contraction 6.1.3). Loop interchange can be used to Figure 33. Array contraction. enable or improve scalar replacement; Carr [1993] examines the combination of scalar replacement, interchange, and un- Contraction reduces the amount of roll-and-jam in the context of cache storage consumed by compiler-generated optimization. temporaries, as well as reducing the An example of scalar replacement is number of cache lines referenced. Other shown in Figure 34; for a discussion of methods for reducing storage consump- the interaction between replacement and tion by temporaries are strip mining (see loop interchange, see Section 6,2.1. Section 6.2.4) and dynamic allocation of temporaries, either from the heap or from 6.5.5 Code Collocation a static block of memory reserved for temporaries. Code collocation improves memory access behavior by placing related code in close proximity. The earliest work rearranged 6.5.4 Scalar Replacement code (often at the granularity of a proce- Even when it is not possible to contract dure) to improve paging behavior [F’errari an array into a scalar, a similar opti- 1976; Hatfield and Gerald 1971]. mization can be performed when a fre- More recent strategies focus on im- quently referenced array element is proving cache behavior by placing the invariant within the innermost loop or most frequent successor to a basic block loom. In this case. the arrav element can (or the most frequent callee of a proce- be loaded into a scalar (anfi presumably dure) immediately adjacent to it in in- therefore a register) before the inner loop struction memory [Hwu and Chang 1989; and, if it is modified, stored after the Pettis and Hansen 1990]. inner loop [Callahan et al. 1990]. An estimate is made of the frequency Replacement multiplies Q for the ar- with which each arc in the control flow ray element by the number of iterations graph will be traversed during program in the inner loop(s). It can also eliminate execution (using either profiling informa- unnecessary subscript calculations, al- tion or static estimates). Procedures are though that optimization is often done by grouped together using a greedy algo- loop-invariant code motion (see Section rithm that always takes the pair of pro-

ACM Computmg Surveys, Vol. 26, No, 4, December 1994 Compiler Transformations ● 379 cedures (or procedure groups) with the placement jumps, the code should be largest number of calls between them. organized to keep related sections close Within a procedure, basic blocks can be together in memory, in particular those grouped in the same way (although the sections executed most frequently direction of the control flow must be [Szymanski 19781. taken into account), or a top-down algo- Displacement minimization can also be rithm can be used that starts from the applied to data. For instance, a base reg- procedure entry node. Basic blocks with a ister may be allocated for a Fortran com- frequency estimate of zero can be moved mon block or group of blocks: to a separate page to increase locality common /big / q, r, x[20000], y, z further. However, accessing that page may require long displacement jumps to If the array x contains word-sized ele- be introduced (see the next subsection), ments, the common block is larger than creating the potential for performance the amount of memory indexable by the loss if the basic blocks in question are offset field in the load instruction (216 actually executed. bytes on S-DLX). To address y and z, Procedure inlining (see Section 6.8.5) multiple-instruction sequences must be can also affect code locality, and has been used in a manner analogous to the long- studied both in conjunction with [Hwu jump sequences above. The problem is and Chang 1989] and independent of avoided if the layout of big is: [McFarling 1991] code positioning. Inlin- common /big / q, r, y, z, x[20000] ing improves performance often by reduc- ing overhead and increasing locality, but if a procedure is called more than once in 6.6 Partial Evaluation a loop, inlining will often increase the Partial evaluation refers to the general number of cache misses because the pro- technique of performing part of a compu- cedure body will be loaded more than tation at compile time. Most of the classi- once. cal optimizations based on data-flow analysis are either a form of partial eval- 6.5.6 Dkplacement Minimization uation or of redundancy elimination The target of a branch or a jump is usu- (described in Section 6.7). Loop-based data-flow optimizations are described in ally specified relative to the current value of the program counter (PC). The largest Section 6.1. offset that can be specified varies among architectures; it can be as few as 4 bits. If 6.6.1 Constant Propagation control is transferred to a location Constant propagation [Callahan et al. outside of the range of the offset, a multi- 1986; Kildall 1973; Wegman and Zadeck instruction sequence or long-format in- 1991] is one of the most important opti- struction is required to perform the jump. mization that a compiler can perform, For instance, the S-DLX instruction and a good optimizing compiler will ap- BEQZ R4, error is only legal if error is ply it aggressively, Typically programs within 215 bytes. Otherwise, the instruc- contain many constants; by propagating tion must be replaced with: them through the program, the compiler can do a significant amount of precompu- BNEZ R4, cent ;reversed test tation. More important, the propagation LI R8, error ;get low bits reveals many opportunities for other op- LUI R8, error >>16 ;get high bits timization. In addition to obvious possi- JR R8 ;jump to target bilities such as dead-code elimination, cent: loop optimizations are affected because constants often appear in their induction This sequence requires three extra in- ranges. Knowing the range of the loop, structions. Given the cost of long dis- the compiler can be much more accurate

ACM Computmg Sumeys, Vol. 26, No. 4, December 1994 380 e Dauid F. Bacon et al.

n=64 t = i*4 C=3 S=t doi=l, n print *, a[s] a[i] = a[i] + c r.t end do a[r] = a[r] + c

(a) original code (a) original code

doi=l,64 t . 1*4 a[i] = a[i] + 3 print *, a [t] end do a[t] = a[t] + c

(b) after constant propagation (b) after copy propagation

Figure 35. Constant propagation. Figure 36. Copy propagation

in applying the loop optimizations that, more than anything else, determine per- pression elimination (6.7.4) may cause the formance on high-speed machines. same value to be copied several times. Figare 35 shows a simple example of The compiler can propagate the original constant propagation. On V-DLX, the re- name of the value and eliminate redun- sulting loop can be converted into a sin- dant copies [Aho et al. 1986]. gle vector operation because the loop is Copy propagation reduces register the same length as the hardware vector pressure and eliminates redundant regis- registers. The original loop would have to ter-to-register move instructions. An ex- be strip mined before vectorization (see ample is shown in Figure 36. Section 6.2.4), increasing the overhead of the loop. 6.6.4 Forward Substitution

6.6.2 Forward substitution is a generalization Constant folding is a companion to con- of copy propagation. The use of a variable stant propagation: when an expression is replaced by its defining expression, contains an operation with constant val- which must be live at that point. Substi- ues as operands, the compiler can replace tution can change the dependence rela- the expression with the result. For exam- tion between variables [Wolfe 1989b] or ple, x = 3 * * 2 becomes x = 9. Typically improve the analysis of subscript expres- constants are propagated and folded si- sions in loops [Allen and Kennedy 1987; multaneously [Aho et al. 1986]. Kuck et al. 1981]. Note that constant folding may not be For instance, in Figure 37(a) the loop legal (under Definition 2.1.2 in Section cannot be parallelized because an un- 2.1) if the operations performed at com- known element of a is being written. Af- pile time are not id~nticd to ‘chow that ter forward substitution, as shown in Fi~re 37(b), the subscript expression is would have been performed at run time; a common source of such problems is the in terms of the loop-bound variable, and rounding of floating-point numbers dur- it is straightforward to determine that ing input or output between phases of the loop can be implemented as a paral- the compilation process [Clinger 1990; lel reduction (described in Section 6.4.1). Steele and White 1990]. The use of variables like npl is a com- mon Fortran idiom that was developed when compilers did not perform aggres- 6.6.3 Copy Propagation sive optimization. The idiom is recom- Optimization such as induction variable mended as “good programming style” in elimination (6. 1.2) and common- subex- a number of Fortran programming texts!

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 381

npl = n+l Xxo = o doi=l, n o/x = o a[npl] = a[npl] + a[i] Zxl = z end do (a) original code Z+o = z x/1 = x

doalli=l, n Figure 38. Algebraic identities used in expression simplification. a[n+ll = a[n+ll + a[i] end do all

(b) after forward substitution

Figure 37. Forward substitution. Table 2. Identltles Used In Strength Reduction (&IS the string concatenation operator) Expression I Reduced Expr. I Datatypes I Forward substitution is generally per- formed on array subscript expressions at the same time as loop normalization (Section 6.3.6). For efficient subscript ~ analysis techniques to work, the array (a, O) + (b, O) 1 (a+ b, O) complex- subscripts must be linear functions of the len(s~ & s~) \ len(.s~)+len(sz) string induction variables.

6.6.5 Reassociation

Reassociation is a technique for increas- ing the number of common subexpres- Floating-point arithmetic can be prob- sions in a program [Cocke and Markstein lematic to simplify; for example, if x is an 1980; Markstein et al. 1994]. It is gener- IEEE floating-point number with value ally applied to address calculations Nan (not a number), then x x O = x, in- within loops when performing strength stead of O. reduction on induction variable expres- sions (see Section 6.1.1). Address calcula- tions generated by array references 6.6.7 Strength Reduction consist of several multiplications and ad- The identities in Table 2 are called ditions. Reassociation applies the asso- strength reductions because they replace ciative, commutative, and distributive laws to rewrite these expressions in a an expensive operator with an equivalent less expensive operator. Section 6.1.1 dis- canonical sum-of-products form. Forward substitution is usually per- cusses the application of strength reduc- tion to operations that appear inside a formed where possible in the address loop. calculations to increase the number of The two entries in the table that refer potential common subexpressions. to multiplication, x x 2 = x + x and i x 2’ = i << c, can be generalized. Multipli- 6.6.6 Algebraic Simplification cation by any integer constant can be The compiler can simplify arithmetic ex- performed using only shift and add in- pressions by applying algebraic rules to structions [Bernstein 1986]. Other trans- them. A particularly useful example is formations are conditional on the values the set of algebraic identities. For in- of the affected variables; for example, stance, the statement x = (y* 1 + O) / 1 i/2c = i >> c if and only if i is non- can be transformed into x = y if x and y negative [Steele 1977]. are integers. Figure 38 illustrates some It is also possible to convert exponenti- of the commonly applied rules. ation to multiplication in the evaluation

ACM Computmg Surveys, Vol. 26, No, 4, December 1994 382 ● Dauid F. Bacon et al.

wri.te(6,100) c[i] Format compilation is further compli- read(7,100) (d(j), j = 1, 100) cated in C by the fact that printf and 100 format (Al) scanf are library functions and may be

(a) original code redefined by the programmer.

6.6.9 Superoptimizing call putchar(c[i], 6) A superoptimizer [Massalin 1987] repre- call fgets(d, 100, 7) sents the extreme of optimization, seek- (b) after format compilation ing to replace a sequence of instructions

Figure39. Format compilation with the optimal alternative, It does an exhaustive search, beginning with a sin- gle instruction. If all single instruction sequences fail, two-instruction sequences ofpolynomials ,using the identity are searched, and so on.

n–1 A randomly generated instruction se- a~xn + a~.lx + . . . +alx +aO quence is checked by executing it with a =(a~x”-l+a~_lx’-2 + ...+al) small number of test inputs that were X+ao. run through the original sequence. If it passes these tests, a thorough verifica- tion procedure is applied. 6.6.8 110 Format Compilation Superoptimization at compile time is practical only for short sequences (on the Most languages provide fairly elaborate order of a dozen instructions). It can also facilities for formatting input and output. be used by the compiler designer to find The formatting specifications are in ef- optimal sequences for commonly occur- fect a formatting sublanguage that is ring idioms, It is particularly useful for generally “interpreted” at run time, with eliminating conditional branches in short a correspondingly high cost for character instruction sequences, where the pipeline input and output. stall penalty may be larger than the cost Formatted writes can be converted al- of executing the operations themselves most directly into calls to the run-time [Granlund and Kenner 1992]. routines that implement the various for- mat styles. These calls are then likely 6.7 Redundancy Elimination candidates for inline substitution. Figure 39 shows two 1/0 statements and their There are many optimizations that im- compiled equivalents. Idiom recognition prove performance by identifying redun- has been performed to convert the im- dant computations and removing them plied do loop into an aggregate [Morel and Renvoise 1979]. We have al- operation. ready covered one such transformation, Note that in Fortran, a format state- loop-invariant code motion, in Section ment is analogous to a procedure defini- 6.1.3. There a computation was being tion, which may be invoked by any performed repeatedly when it could be number of read or write statements, ‘The done once. same trade-off as with procedure inlining Redundancy-eliminating transforma- applies: the formatted 1/0 can be ex- tions remove two other kinds of computa- panded inline for higher efficiency, or tions: those that are unreachable and encapsulated as a procedure for code those that are useless. A computation is compactness (inlining is described in Sec- unreachable if it is never executed; re- tion 6.8.5). moving it from the program will have no Format compilation is done by the VAX semantic effect on the instructions exe- Fortran compiler [Harris and Hobbs cuted. Unreachable code is created by 1994] and by the Gnu C compiler [Free programmers (most frequently with con- Software Foundation 1992]. ditional debugging code), or by transfor-

ACM Computmg Surveys. Vol. 26. No 4, December 1994 Compiler Transformations ● 383 mations that have left “orphan” code integer c, n, debug behind. debug = O A computation is useless if none of the n=O outputs of the program are dependent on a . b+j’ it. if (debug j 1) then C=a+b+d print *, ‘Warning -- total is ‘ , c 6.7.1. Unreachable-Code Elimination end if Most compilers perform unreachable-code call foo(a) elimination [Allen and Cocke 1971; Aho doi=l, n et al. 1986]. In structured programs, there a[i] = a[i] + c are two primary ways for code to become end do unreachable. If a conditional predicate is (a) original code known to be true or false, one branch of the conditional is never taken, and its code can be eliminated. The other com- integer c, n, debug mon source of unreachable code is a loo~ debug = O that does not perform any iterations. ‘ n.o In an unstructured program that relies a = b+7 on goto statements to transfer control, if (O > 1) then unreachable code is not obvious from the end if program structure but can be found by call foo(a) traversing the control flow graph of the doi=l, O program. end do Both unreachable and useless code are (b) after constant propagation and unreachable often created by constant propagation, code elimination described in Section 6.6.1. In Figure 40(a), the variable debug is a constant. When its value is propagated, the conditional integer c, n, debug expression becomes if (O > 1). This ex- debug = O pression is always false, so the body of n=O a . b+7 the conditional is never executed and can be eliminated, as shown in Figure 40(b). call foo(a) Similarly, the body of the do loop is never (c) after redundant control elimination executed and is therefore removed. Unreachable-code elimination can in integer c, n, debug turn allow another iteration of constant a = b+7 propagation to discover more constants; call foo(a) for this reason some compilers perform (d) after useless code elimination constant propagation more than once. Unreachable code is also known as dead code, but that name is also applied a . b+7 to useless code. so we have chosen to use call foo(a) the more speci~c term. (e) after dead variable elimination Sometimes considered as a separate step, redundant-control elimination re- Figure 40. Redundancy elimination. moves control constructs such as loom. and conditionals when they become re- dundant (usually as a result of constant 6.7.2 Useless-Code Elimination propagation). In Figure 40(b), the loop Useless code is often created by other and conditional control exm-essions. are optimizations, like unreachable-code not used, and we can remove them from elimination. When the compiler discovers the program, as shown in (c). that the value being computed by a state-

ACM Computing Surveys, Vol. 26, No. 4, December 1994 384 * David F. Bacon et al. ment is not necessary, it can remove the first operand [Arden et al, 1962]. For code. This can be done for any nonglobal example, the control expression in variable that is not liue immediately af- If ((a = 1) and (b = 2)) then ter the defining statement. Live-variable c=~ analysis is a well-known data-flow prob- end if lem [Aho et al. 1986]. In Figure 40(c), the is known to be false if a does not equal 1, values computed by the assignment regardless of the value of b. Short circuit- statements are no longer used; they have ing would convert this expression to: been eliminated in (d). if (not (a = 1)) goto 10 if (not (b = 2)) goto 10 C=5 6.7.3 Dead-Var/ab/e Elimination 10 continue After a series of transformations, partic- Note that if any of the operands in the ularly loop optimizations, there are often boolean expression have side-effects, variables whose value is never used. The short circuiting can change the results of unnecessary variables are called dead the evaluation. The alteration may or variables; eliminating them is a common may not be legal, depending on the lan- optimization [Aho et al. 1986]. guage semantics. Some language defini- In Figure 40(d), the variables C, n, and tions, including C and many dialects of debug are no longer used and can be Lisp, address this problem by requiring removed; (e) shows the code after the short circuiting of boolean expressions. variables are pruned. 6.8 Procedure Call Transformations

6.7.4 Common-Subexpression Elimination The optimizations described in the next several sections attempt to reduce the In many cases, a set of computations will overhead of procedure calls in one of four contain identical subexpressions. The re- ways: dundancy can arise in both user code and in address computations generated by the ● eliminating the call entirely; compiler. The compiler can compute the ● eliminating execution of the called pro- value of the subexpression once, store it, cedure’s body; and reuse the stored result [Aho et al. ● eliminating some of the entry/exit 1977; 1986; Cocke 1970]. Common-subex- overhead; and pression elimination is an important e avoiding some steps in making a proce- transformation and is almost universally dure call when the behavi~r ;f the performed. While it is generally a good called procedure is known or can be idea to perform common-subexpression altered. elimination wherever possible, the com- piler must consider the current register 6.8.1 A Calling Convention for S-DLX pressure and the cost of recomputing. If To demonstrate the procedure call opti- storing the temporary value(s) forces ad- mization, we fh-st define a calling con- ditional spills to memory, the transfer. vention for S-DLX. Table 3 shows how mation can actually deoptimize the the registers are used. program. In general, each called procedure is responsible for ensuring that the values in registers RI 6 – R25 are preserved 6.7.5 Short Circuiting across the call. The stack begins at the Short circuiting is an optimization that top of memory and grows downward. can be performed on boolean expressions. There is no explicit frame pointer; in- It is based on the observation that the stead, the stack pointer is decremented value of many binary boolean operations by the size s of the procedure’s frame at can be determined from the value of the entry and left unchanged during the call.

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 3%5

Table 3. S-DLX Registers and Their Usage

Number Usage RO Always zero; writes are ignored Ri Return value when returning from a procedure call R2. .R7 The first six words of the arguments to the procedure call R8. .R15 8 caller save registers, used as temporary registers by callee R16. .R25 10 callee save registers. These registers must be preserved across a call. R26 . . R29 Reserved for use by the operating system R30 Stack pointer R31 Return address during a procedure call FO. .F3 The first four floating point arguments to the procedure call F4. .F17 14 caller save floating point registers F18. .F31 14 callee save floating point registers

The value R30 + s serves as a uirtual (5) The frame is removed from the stack. frame pointer that points to the base of (6) ~~~ds is transferred to the return the stack frame, avoiding the use of a second dedicated register. For languages that cannot predict the amount of stack Calling a procedure is a four-step pro- space used during execution of a proce- cess: dure, an additional general-purpose reg- ister can be used as a frame pointer, or (1) The values of any of the registers the size of the frame can be stored at the R1 – RI 5 that contain live values are top of the stack (R30 + O). saved. If the values of any global On entering a procedure, the return variables that might be used by the address is in R31. The first six words of callee are in a register and have been the procedure arguments appear in reg- modified, the copy of those variables isters R2 – R7, and the rest of the argu- in memory is updated. ment data is on the stack. Figure 41 (2) The arguments are stored in the des- shows the layout of the stack frame for a ignated registers and, if necessary, procedure invocation. on the stack. A similar convention is followed for floating-point registers, except that only (3) A linked jump is made to the target four are reserved for arguments. procedure; the CPU leaves the ad- Execution of a procedure consists of six dress of the next instruction in R31. steps: (4) Upon return, the saved registers are restored, and the registers holding Space is allocated on the stack for the (1) global variables are reloaded. procedure invocation.

(2) The values of registers that will be To demonstrate the structure of a pro- modified during procedure execution cedure and the calling convention, Figure (and that must be preserved across 42 shows a simple function and its com- the call) are saved on the stack. If the piled code. The function (foo) and the procedure makes any procedure calls function that it calls (max) each take two itself, the saved registers should in- integer arguments, so they do not need to clude the return address, R31. pass arguments on the stack. The stack (3) The procedure body is executed. frame for foo is three words, which are (4) The return value (if any) is stored in used to save the return address (R31 ) RI, and the registers that were saved and register R 16 during the call to max, in step 2 are restored. and to hold the local variable d. R31

ACM Computing Surveys, Vol 26, No. 4, December 1994 386 * David F. Bacon et al.

high addresses

caller’s stack frarrte

arg n

ar’z 1

locals and temporaries

saved registers frame size

arguments for called procedures Sp --.+ 1

low addresses 4

stackgrows down

Figure 41. Stack frame layout

must be preserved because it is overwrit- the return address (R31 ). Additionally, if ten by the jump-and-link (JAL) instruc- the procedure does not have any local tion; RI 6 must be preserved because it is variables allocated to memory, the com- used to hold c across the call. piler does not need to create a stack The procedure first allocates the 12 frame. bytes for the stack frame and saves R31 Figure 43(a) shows the function max and R16; then the parameters are loaded (called by the previous example function into registers. The value of d is calcu- foo), its original compiled code (b), and lated in the temporary register R9. Then the code after leaf procedure optimiza- the addresses of the arguments are stored tion (c). After eliminating the save/re- in the argument registers, and a jump to store of R31, there is no need to allocate max is made. On return from max, the a stack frame. Eliminating the code that return value has been computed into the deallocates the frame also allows the return value register (Rl ). After remov- function to return directly if x < y. ing the stack frame and restoring the saved registers, the procedure jumps back 6.8.3 Cross-Call Register Allocation to its caller through R31. Separate compilation reduces the amount of information available to the compiler 6.8.2 Leaf Procedure Optimization about called procedures. However, when A leaf procedure is one that does not call both callee and caller are available, the any other procedures; the name comes compiler can take advantage of the regis- from the fact that these procedures are ter usage of the callee to optimize the leaves in the static call graph. The sim- call. plest optimization for leaf procedures is If the callee does not need (or can be that they do not need to save and restore restricted not to use) all the temporary

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations ● 387

integer function foo(c, b) integar function max(x, y) integer c, b integer x, y integer d, e if (x > y) then d = c+b max=x e = max(b, d) else foo = e+c max = Y return end if end return end (a) source code for function foo (a) source code for function rnax

foo: SUBI R30, #12 ;adjust SP ;adjust SP Sw 8(R30), R31 ;eave retaddr max: SUBI R30, #4 Sw (R30), R31 ;save retaddr Sw 4(R30), R16 ;save R16 ; R8=X LW R16, (R2) ;R16=C LW R8, (R2) LH R9, (R3) ; R9=y LW R8, (R3) ; R8=b ADD R9, R16, R8 ;R9=d=c+b SGT RIO, R8, R9 ;RIO=(X > y) BEQZ R1O, Else ;X <= y Sw (R30), R9 ;save d MOV Rl, R8 ; rnax=x MOV R2, R3 ;argl=addr(b) ADDI R3, R30, #O ;arg2=addr(d) J Ret Else:MOV Rl, R9 ; max=y JAL max ;call max; Rl=e ADD RI, Rl, R16 ;Rl=e+c Ret: LW R31, (R30) ;restore R31 ;restore SP LW R16, 4(R30) ;restore R16 ADDI R30, #4 ; return LW R31, 8(R30) ;restore retaddr JR R31 ADDI R30, #12 ;restore SP (b) original compiled code of max JR R31 ; return (b) compiled code for foo max: LW R8, (R2) ; R8=X Figure42. Functionfoo andits compiledcode. LW R9, (R3) ; R9=y SGT R1O, R8, R9 ;R1O=(X > y) BEQZ R1O, Else ;X <= y MOV Rl, R8 ; max=x JR R31 ; return registers (R8–R15), the caller can leave Elee:MOV Rl, R9 ; max=y values in the unused registers through- JR R31 ; return out execution of the callee. Additionally, move instructions for parameters can be (c) max after leaf optimization eliminated. Figure 43. Leafprocedure optimization. To perform this optimizationon apro- cedure f, registers must first have been allocatedfor each ofthe procedures thatf foo: SUBI R30, #4 ;adjust SP calls. This can be done by performing MOV Rll, R31 ;save retaddr register allocation on procedures ordered LW R12, (R2) ;R12=c by adepth-firstpostorder traversalofthe LW R8, (R3) ; R8=b call graph [Chow 1988]. ADD R9, R12, R8 ; R9=d Forexample, in Figure 43(c)max uses Sw (R30), R9 ;save d only R8–RIO. Figure 44 shows foo after MOV R2, R3 ;argi=addr(b) cross-call register allocation. R31 is saved ADDI R3, R30, #0 ; arg2=addr (d) in Rll instead ofon the stack, and the JAL max ;call max return is ajump to the saved addresses ADD RI, RI, R12 ; Rl=e+c in R1l. Additionally, R12 is used instead ADDI R30, #4 ;restore SP of R16, allowing the save and restore of JR Rll ; return RI 6 to be eliminated. Only d remains in the stack frame. Figure 44. Cross-call register allocation for fOO.

ACM Computing Surveys, Vol. 26,No.4, December 1994 388 * David F. Bacon et al.

6.8.4 Parameter Promotion max: SGT R1O, R2, R3 ;RIO=(X > y) BEQZ R1O, Else ;X <= y When a parameter is passed by refer- MOV RI, R2 ; max=x ence, the address calculation is done by JR R31 ; return the caller, but the load ofthe parameter Else:MOV RI, R3 ; nrax=y value is done by the callee. This wastes JR R31 ; return an instruction, since most address calcu- lations canbehandled withtheot~set(Rn) (a)maxafter parameter promotion: xandy are passed by value in R2 and R3. format of load instructions. More importantly, if the operand is al- ready in a register in the caller, it must foo: MOV Rll, R31 ; save retaddr be spilled to memory and reloaded by the LW R12, (R2) ;R12=c callee. If the callee modifies the value, it LW R2, (R3) ; R2=b must then be stored. Upon return to the ADD R3, R12, R2 ;R3=d caller, if the compiler cannot prove that JAL max ; call max the callee did not modify the operand, it ADD RI , Rl, R12 ;Rl=e+c must be loaded again. Thus, as many as JR RI 1 ; return two unnecessary loads and two unneces- (b) foo after parameter promotion on max sary stores can be introduced. An unmodified reference parameter Figure 45. Parameter promotion. can be passed by value, and a modified reference parameter can be passed by value-result. Figure 45(a) shows max af- 1979; Scheifler 1977]. Each occurrence of ter this transformation has been applied. a formal parameter is replaced with a Figure 45(b) shows the corresponding version of the corresponding actual pa- modified form of foo. rameter, modified to reflect the calling Since d can now be held in a register, convention of the language. Renaming there is no longer a need for a stack may also be required for the local vari- frame, so frame collapsing can be ap- ables of the inlined procedure if they con- plied, and the stack frame is eliminated. flict with the calling procedure’s variable In general, when the compiler can stati- names or if the procedure is inlined more cally identify all the callers of a leaf pro- than once in the same caller. cedure, it can expand their stack frames Inlining can almost always be per- to include enough space for both proce- formed, except when the procedure in dures. The leaf procedure simply uses question is recursive. Even recursive rou- the caller’s stack frame without doing tines may benefit from a finite number of any new allocation of its own. irdining steps, particularly when a con- Parameter promotion is particularly stant argument allows the compiler to important for languages such as Fortran, determine the number of recursive calls in which all argument passing is by ref- that will be performed. For Fortran pro- erence. However, the argument-passing grams, incompatible common-block us- semantics of value-result are not the ages between caller and callee can make same as for arguments passed by refex= inlining more complex and in practice ence, in particular in the event of alias- often prevent it. ing. Interprocedural analysis may be When a call is inlined, all the overhead necessary to determine the legality of the for the invocation is eliminated. The stack transformation. frames for the caller and callee are allo- cated together, and the transfer of con- 6.8.5 Procedure Inlining trol is eliminated. This is particularly Procedure inlining (also known as proce- important for the return (J R31 ), since dure integration) replaces a procedure a jump through a register may incur a call with a copy of the body of the called higher pipeline penalty than a jump to a procedure [Allen and Cocke 1971; Ball fixed address.

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations w 389

do i=i, n foo: Lu R12, (R2) ;R12=c call f(a, i) LW R2, (R3) ; R2=b end do ADD R3, R12, R2 ;R3=d

subroutine f(x, j) max: SGT R1O, R2, R3 ;R1O=(X > y) dimension x[*I BEQZ R1O, Else ;X <= y x[jl = x[jl + c MOV RI, R2 ; max=x return J Ret ; “return” to f Else:MOV Rl, R3 ; max=y (a) original code

Ret: ADD RI, RI, R12 ;Rl=e+c do all i=l, n 3R R31 ; return a[il = a[il + c Figure 47. max inlined into foo. end do all (b) after inlining

Figure 46, Procedure inlining case exponentially. However, in practice it is simple to control the size increase by selective application of inlining (for ex- Another reason for irdining is to ample, to small leaf procedures, or to improve compiler analysis and opti- procedures that are only called from a mization. In many compilers, a loop few places). The result can be a dramatic containing a procedure call cannot be improvement in execution speed. Figure parallelized because its read-write be- 46 shows a source-level example of inlin- havior is unknown. After the call is in- ing; Figure 47 shows the assembler out- lined, the compiler may be able to prove put after the procedure max is inlined loop independence, thereby allowing vec- into fOO. torization or parallelization. Addition- Ignoring cache effects, assume the fol- ally, register usage may be improved, lowing: if tp is the time to execute the constants propagated more accurately, entire procedure and t~ is the time to and more redundant operations elimi- execute just the body of the procedure, n nated. is the number of times it is called, and T An alternative to inlining is to perform is the total execution time of the pro- interprocedural analysis. The advantage gram, then of interprocedural analysis is that it can be applied uniformly, since it does not t, = n(tP – t~) cause code expansion the way inlining does. However, interprocedural analysis is the time saved by inlining, and can be costly and increases the complexity of the compiler. Inlining also affects the instruction I=: cache behavior of the program [McFarling 1991]. The change can be fa- is the fraction of the total run time saved vorable, because locality is improved by by inlining. eliminating the transfer of control. On Some recent studies [Cooper et al. 1991; the other hand, if a loop body is made 1992] have demonstrated that realizing much larger, it may no longer fit in the the theoretical benefits of inlining may cache and therefore cause additional require some modification of the com- memory accesses. Furthermore, if the piler, because inlined code has different loop contains multiple calls to the same characteristics than human-written code. procedure, multiple copies of the proce- Peculiarities of a language or a com- dure will be loaded into the cache. piler may reduce the effectiveness of in- The primary disadvantage of inlining lining or produce unexpected results. is that it increases code size, in the worst Cooper et al. examined several commer-

ACM Computmg Surveys, Vol 26, No 4, December 1994 390 “ David F. Bacon et al. cial Fortran compilers and uncovered the call f(a, n, 2) following potential problems: (1) in- creased demand for registers; (2) larger subroutine f(x, n, p) numbers of local variables, sometimes real x [*I exceeding a compiler’s limit for the num- integer n, p ber of analyzable variables; (3) loss of doi=l, n information due to the Fortran conven- x[i] = x[i] **p tion that procedure parameters may be end do assumed to be unaliased. (a) original code

6.8.6 Procedure Cloning call F.2(a, n) Procedure cloning [Cooper et al. 1993] is a technique for improving optimization subroutine F.2(x, n) across procedure call boundaries. The call real x [*] sites of the procedure being cloned are integer n divided into groups, and a specialized doi=l, n version of the procedure is created for x [i] = x [i] *x [i] each group. The specialization often pro- end do vides opportunities for better optimiza- (b) after cloning tion, particularly due to improved constant propagation. Figure 48. Procedure cloning. In Figure 48, cloning procedure f with p replaced by the constant 2 allows re- duction in strength. The real-valued do i=l, n exponentiation is replaced by a multipli- call f(x, n) cation, which is usually at least 10 times end do faster. subroutine f (a, j ) real a[*] 6.8.7 Loop Pushing a[jl = a[jl + c Loop pushing (also called loop embed- return ding [Hall et al. 1991]) moves a loop nest (a) original loop and procedure from the caller to a cloned version of the called procedure. If a compiler does not call F.2(x) perform vectorization or parallelization across procedure calls directly, pushing subroutine F-2 (a) is a less general way of achieving a real a[*] similar effect. do all i=l, n For example, pushing is done by the a[i] = a[il + c CMAX Fortran preprocessor for the end do all Thinking Machines CM-5. CMAX con- return verts Fortran 77 programs to Fortran 90, (b) after loop pushing attempting to discover data-parallel op- erations in the process [Sabot and Figure 49. Loop pushing. Wholey 1993]. Pushing not only allows the paral- lelization of the loop in Figure 49, it also pendence analysis for distribution must eliminates the overhead of all but one of be interprocedural. If there are no other the procedure calls. statements in the loop, the transforma- If there are other statements in the tion is always legal. loop, distribution is a prerequisite to Procedure inlining (see Section 6.8.5) pushing. In this case, however, the de- is a different way of achieving a very

ACM Computmg Surveys, Vol 26, No. 4, December 1994 Compiler Transformations 8 391 similar effect, and does not require inter- recursive logical function Inarray(a, x,i,n) procedural analysis. real x, a[nl integer i, ~

6.8.8 Tail Recursion Elimination If (i > n) then inarray = . FALSE. Tail recursion is a particularly common else if (a[i] = x) then form of recursion. A function is recursive inarray = TRUE, if it invokes itself, directly or indirectly. else It is tail recursive if its last act is to call inanay = inarray(a, x, 1+1, n) itself and return the value of the recur- end if return sive call without performing any further (a) A tail-recursive procedure processing. When a function is tail recursive, it is unnecessary to invoke a separate in- logical function inarray(a, x, 1, n) stance with its own stack frame. The real x, a[nl recursion can be eliminated; the current Integer i, n invocation will not be using its frame any longer, so the call can be replaced by a if (i > n) then inarray = . FALSE. jump to the top of the procedure. Figure else if (a[i] = x) then 50 shows an example of a tail-recursive inarray = . TRUE. function (a) and the result after the re- else cursion is eliminated (b). A function that 1 = i+l is not tail recursive is shown in Figure goto i 50(c): it uses the result of the recursive end if return call as an operand to the addition, so there is computation that must be per- (b) After tail recursion elimination formed after the recursive call returns. Recursive programs can be trans- formed automatically into tail-recursive recursive integer function surmrray(a, x,i, n) versions that can be executed iteratively real x, a[n] [Burstall and Darlington 1977], but this integer i, n is not commonly performed by existing if (i = n) then compilers for imperative languages. Some sumsrray = a [i] languages prevent tail recursion elimina- else tion by requiring clean-up code to be exe- sunrarray = a[l]+sunmrray(a, x, i+l, n) cuted after a procedure is finished. The end if semantics of C + +, for example, de- return mand that before a procedure returns it (c) A procedure that is not tail-recursive must call a deconstructor on each stack- Figure 50. Tail recursion elimination. allocated local object variable. Other languages, like Scheme [Rees et al. 1986], require that all tail-recursive procedures be identified and that they be possible to cache the results of recent executed without consuming stack space invocations. When the procedure is called proportional to the recursion depth. again with the same arguments, the cached result is used instead of recom- puting it [Abelson and Sussman 1985; 6.8.9 Function Memorization Michie 1968]. Memorization is an optimization that is Figure 51 shows a simple example of applied to side-effect free procedures memorization. If f is often called with the (that is, procedures that do not change same arguments and f also takes a non- the state of the program, also called ref- trivial amount of time to run, then mem- erentially transparent). In such cases it is oization will substantially increase

ACM Computing Surveys, Vol 26, No. 4, December 1994 392 ● David F. Bacon et al.

y = f(i) describe transformations that are spe- (a) original function call cific to parallel architectures. Highly efficient parallel applications have been developed by manually apply- logical f. UNCACHED[n] ing the transformations discussed here to real f .CACHE[n] sequential code. Duplicating these suc- doi=l, n cesses by automatically transforming the f_ UNCACHED[i] = true. code within a compiler is an active area end do of research. While a great deal of work has been done, automatic parallelization . . . of full-scale applications with existing compilers does not consistently yield sig- if (f_ UNCACHED[i]) then nificant speedups. Experiments that f. CACHE [i] = f(i) measure the effectiveness of parallelizing f-UNCACHED[i] = false. compilers are presented in Section 9.3. end if Because the parallelization of se- y = f-CACHE[i] quential code is proving to be difficult, (b) code augmented for memorization languages like HPF Fortran [High Per- formance Fortran Forum 1993] provide Figure 51. Function memoization constructs that allow the programmer to declare data decomposition strategies and to expose parallelism. The compiler is performance. Ifnot, it will degrade per- then responsible for generating commu- formance and consume memory. nication and synchronization code, using This example assumes that f’s parame- various optimizations (discussed below) ter is confined to the range 1 . . . n, and to reduce the resulting overhead. As of that n is not extremely large. A more yet, however, there are few real applica- sophisticated memorization scheme would tions that have been written in these hash the arguments and use a cache size languages, and there is little experimen- that makes a sensible trade-off between tal data. reuse and memory consumption. For In this section we make extensive use functions that return dynamically allo- of our model shared-memory and dis- cated objects, storage management must tributed-memory multiprocessors sMX be considered as well. and dMX, which are described fully in For c calls and a hit rate of r-, the time the Appendix. Both machines are con- to execute the c calls is structed using S-DLX processors. The programming environment initializes the T= C(~th + (1 – r)tm) global variable Pnum to be the number of where tk is the time to retrieve the mem- processors in the machine and Pid to be the number of the local processor oized result from the cache (a hit), and tn is the time to call f plus the overhead (starting with O). of discovering that it is not in the cache (a miss). With a high hit rate and a large 7.1 Data Layout difference between th and tn, memoriz- One of the key decisions that a compiler ation can significantly improve perfor- must make in targeting a multiprocessor mance. is the decomposition of data across the processors. The need to distribute data 7. TRANSFORMATIONS FOR PARALLEL and manage it is obvious in a dis- MACHINES tributed-memory programming model, All the transformations discussed so far where communication is explicit. The are applicable to a wide variety of com- performance of code under a shared- puter organizations. In this section we memory model is also dependent on

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 393 avoiding unnecessary communication: doallj=l, n the fact that communication is performed doalli=l, n a[i, j] = a[i, j] + c implicitly does not eliminate its cost. end do all end do all 7.1.1 Regular Array Decomposition (a) original loop

The most common strategy for decompos- ing an array on a parallel machine is to 12345878 map each dimension to a set of proces- 2 sors in a regular pattern. The four com- 3 monly used patterns are discussed in the 40123 5 following sections. s

7

8 “!IIII Serial Decomposition (b) ( block, sertd) decornpos,t;on of a on four Serial decomposition is the degenerate processors case, where an entire dimension is allo- cated to a single processor. Figure 52(a) call FORK(Pnum) shows a simple loop nest that adds c to do j = (n/Pnuro)*Pid+i, rnin( (n/Pnum)*(Pid+i) , n) each element of a two-dimensional array, doi=l, n the decomposition of the array onto pro- a[i, j] = a[i, j] + c cessors (b), and the loop transformed into end do code for sMX (c). Each column is mapped end do call JOIN() serially, meaning that all the data in (c) corresponding ( block, serial) scheduling of that column will be placed on a single the loop, using block size n/Pnurn processor. As a result, the loop that iter- ates over the columns (the inner loop) is unchanged. *2345678 1

2 o 1 3 Block Decomposition 14

A block decomposition divides the ele- 8 6 ments into one group of adj scent ele- 2 3 1 ments per processor. Using the previous 8 E13 example, the rows of array a are divided (d) ( block, block) decomposition of a on four between the four processors, as shown in processors Figure 52(b). Because the rows have been divided across the processors, the outer Figure 52. Serial and block decompositions of an loop in Figure 52(c) has been modified so array on the ehared-memory multiprocessor sMX. that each processor iterates only over the local portion of each row. Figure 52(d) shows the decomposition across the processors, each of which has if block scheduling is applied across both its own private address space. the horizontal and vertical dimensions. As a result, naming a nonlocal array A major difference between compila- element requires a mapping function tion for shared- and distributed-memory from the original, global name space to a machines is that shared-memory ma- processor number and local array indices chines provide a global name space, so within that processor. The examples that any particular array element can be shown here are all for shared memory, named by any processor using the same and therefore sidestep these complexi- subscripts as in the sequential code. On ties, which are discussed in Section 7.3. the other hand, on a distributed-memory However, the same decomposition princi- machine, arrays must usually be divided ples are used for distributed memory.

ACM Computing Surveys, Vol 26, No 4, December 1994 394 “ David F. Bacon et al.

I doj=l, n

f2345678 doi=l, j

1 total = total + a[i, jl

2 end do 3 end do la (a) loop 5

6

7

8

(a) (Cyclic, serial) decomposition of a on four processors

i call FORK (Pnund do j = Pid+l, n, Pnum doi=l, n a[i, jl = a[i, jl + c end do end do (b) elements of a read by the loop call JOIN()

(b) corresponding (CYCIZC, ser!a~) scheduling Of Figure 54. Triangular iteration space. loop

Figure 53. Cyclic decomposition on sMX be clustered together and computed by a single processor. With a cyclic decomposi- tion, the expensive iterations are spread The advantage of block decomposition across the processors. is that adj scent elements are usually on the same processor. Because a loop that computes a value for a given element Biock-Cyclic Decomposition often uses the values of neighboring ele- A block-cyclic decomposition combines ments, the blocks have good locality and the two strategies; the elements are di- reduce the amount of communication. vided into many adjacent groups, usually an integer multiple of the number of pro- Cyclic Decomposition cessors available. Each group of elements is assigned to a processor cyclically. Fig- A cyclic decomposition assigns succes- ure 55(a) shows an array whose columns sive elements to successive processors. A have been allocated to two processors in cyclic allocation of columns to processors a block-cyclic fashion, and Figure 55(b) for the sequential code in Figure 52(a) is gives the code to implement the mapping shown in Figure 53(a). Figure 53(b) shows on sMX. Block-cyclic is a compromise be- the transformed code for sMX. tween the locality of block decomposition Cyclic decomposition has the opposite and the load balancing of cyclic effect of blocking: it hafi poor locality decomposition. for neighbor-based communication, but spreads load more evenly. Cyclic decom- Irregular Computations position works well for loops like the one in Figure 54, as long as the array is large Some loops, like the one shown in Figure enough so that each processor handles 56, have highly irregular execution be- many columns. A common algorithm that havior, When the variance of execution results in this type of load balance is LU times is high, none of the static regular factorization of matrices. A block decom- decomposition strategies may be suf- position would work poorly because the ficient to yield good performance. A iterations most costly to execute would variety of dynamic algorithms make

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations “ 395

across a set of processors; in a block de- 1236 S678 composition, for example, the array might < be divided into contiguous 10 x 10 sub- 2 arrays. The alignment of the array es- 3 tablishes the specific elements that are i’ ,0 1 0 1 in each of those subarrays. 6 Languages like HPF Fortran rely on 7 the proWammer to declare how arrays 8 EIll are to be aligned and what style of de- (a) (block-cyclic, serial) decomposition of a on composition to use; the language includes two processors annotations like BLOCK and CYCLIC. A variety of research projects have in- vestigated whether the strategy can be call FORK(Pnum) determined automatically without pro- do k = 1, n, 2*Pnum grammer assistance. Programmers can do j = k + 2*Pid, k + 2*Pid + i doi=l, n find it difficult to choose a good set of a[i, j] = a[i, j] + c declarations, particularly since the opti- end do mal decomposition can change during the end do course of program execution. end do The general strategy for automating call JOIN() decomposition and alignment is to ex- (b) corresponding (block-cyclic, serial) press the behavior of the program in a scheduling of loop, using block size 2 representation that either captures com- munication explicitly or allows it to be Figure 55. Block-cyclic decomposition on sMX. computed. The compiler applies opera- tions that reflect different decomposition

doi=l, n and alignment choices; the goal is to if (mask [i] = 1) then maximize parallelism and minimize a[i] = expensive. functiono communication. The approaches include end if the Alignment-Distribution Graph end do [Chatterjee et al. 1993a; 1993b], the Component Affinity Graph [Li and Chen (a) example loop 1990; 1991], constraints [Gupta 1992], and affine transformations [Anderson aa and Lam 1993]. J& In addition to presenting an algorithm to find the optimal mapping of arrays to processors for one- and two-dimensional 1234567 8S arrays, Mace [1985; 1987] shows that finding optimal decompositions is an (b) iteration versus cost for the above loop NP-complete problem.

Figure 56. Irregular execution behavior. 7.1.3 Scalar Privatization

The goal behind privatization is to in- scheduling decisions at run time based crease the amount of parallelism and to on the historical behavior of the compu- avoid unnecessary communication. When tation [Lucco 1992; Polychronopoulos and a scalar is used within a loop solely as a Kuck 1987; Tang and Yew 1990]. scratch variable, each processor can be given a private copy so the use of the 7.1.2 Automatic Decomposition and Alignment scalar need not involve any communica- The decomposition of an array deter- tion. The transformation is safe if there mines how the elements are distributed are no loop-carried dependence involv-

ACM Computing Surveys, Vol. 26, No. 4, December 1994 396 ● David F. Bacon et al.

doi=l, n tion discussed in this survey and led to c = b[i] the development of dependence analysis. a[i] = a[i] + c Feautrier [ 1988; 1991] extends depen- end do dence analysis to support array expan- sion, essentially the same optimization Figure 57. Scalar value suitable for privatization. as privatization. The analysis is expen- sive because it requires the use of para- metric integer programming. Other ing the scalar; before the scalar is used privatization research has extended in the loop body, its value is always up- data-flow analysis to consider the behav- dated within the same iteration. ior of individual array elements [Li 1992; Figure 57 shows a loop that could ben- Maydan et al. 1993; Tu and Padua 1993]. efit from privatization. The scalar value c In addition to determining whether is simply a scratch value; if it is priva- privatization is legal, the compiler must tized, the loop is revealed to have inde- also check whether the values that are ~endent. iterations. If the arravs are left in an array are used by subsequent allocated to separate processors before computations. If so, after the array is the loop is executed, no further commu- privatized the compiler must add code to nication is necessary. In this very simple preserve those final values by performing example the variable could also have been a copy-out operation from the private eliminated entirely, but in more complex copies to a global array. loops temporary variables are repeatedly updated and referenced. 7.1.5 Cache Alignment Scalar privatization is analogous to scalar expansion (see Section 6.5.2); in On shared-memory processors like sMX, both cases, spurious dependence are when one processor modifies storage, the eliminated by replicating the variable. cache line in which it resides is flushed The transformation is widely incorpo- by any other processors holding a copy. rated into parallel compilers [Allen et al. The flush is necessary to ensure proper 1988b; Cytron and Ferrante 1987]. synchronization when multiple proces- sors are updating a single shared vari- able. If several m-ocessors are continually 7.1.4 Array Privatization . . updating different variables or array ele- Array privatization is similar to scalar ments that reside in the same cache line, privatization; each processor is given its however. the constant flushing of cache own copy of the entire array. The effect of lines may severely degrade pe~formance. privatizing an array is to reduce commu- This problem is called fake sharing, nication requirements and to expose When false sharing occurs between additional parallelism by removing un- ~arallel lootI iterations accessing differ- necessary dependence caused by storage ;nt column’s of an array or ~ifferent reuse. structures, false sharing can be ad- Parallelization studies of real applica- dressed by aligning each of the columns tion programs have found that privatiza- or structures to a cache line boundarv. tion is necessary for high performance False sharing of scalars can be address~d [Eigenmann et al. 1991; Singh and by inserting padding between them so Hennessy 1991]. Despite its importance, that each shared scalar resides in its own verifying that privatization of an array is cache line (padding is discussed in Sec- legal is much more difficult than the tion 6.5. 1). A further optimization is to equivalent test on a scalar value. place shared variables ‘and their corre- Traditionally, data-flow analysis sponding lock variables in the same cache [Muchnick and Jones 1981] treats an en- line, which will cause the lock acquisition tire array as a single value. This lack of to act as a prefetch of the data [Torrellas precision prevents most loop optimiza- et al. 1994],

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations * 397

7.2 Exposing Coarse-Grained Parallelism dence between them that prevent the Transferring data can be an expensive compiler from executing both simultane- ously on different processors. However, it operation on a machine with asyn- is often the case that only some of the chronously executing processors. One way iterations conflict. Split uses memory to avoid the inefficiency of frequent com- munication is to divide the program into summarization to identify the iterations in one loop nest that are independent of subcomputations that will execute for an extended period of time without generat- the other one, placing those iterations in a separate loop. Applying split to an ap- ing message traffic. The following opti- plication exposes additional sources of mization are designed to identify such computations or to help the compiler concurrency and pipelining. create them. 7.2.3 Graph Partitioning

7.2.1 Procedure Call Parallelization Data-flow languages [Ackerman 1982; One potential source of coarse-grained McGraw 1985; Nikhil 1988] expose paral- parallelism in imperative languages is lelism explicitly. The program is con- the procedure call. In order to determine verted into a graph that represents basic whether a call can be performed as an computations as nodes and the move- independent task, the compiler must use ment of data as arcs between nodes. interprocedural analysis to examine its Data-flow graphs can also be used as an behavior. internal representation to express the Because of its importance to all forms parallel structure of programs written in of optimization, interprocedural analysis imperative languages. has received a great deal of attention One of the major difficulties with from compiler researchers. One approach data-flow languages is that they expose is to propagate data-flow and dependence parallelism at the level of individual information through call sites [Banning arithmetic operations. As it is impracti- 1979; Burke and Cytron 1986; Callahan cal to use software to schedule such a et al. 1986; Myers 1981]. Another is to small amount of work, early projects fo- summarize the array access behavior of a cused on developing architectures that procedure and use that summary to eval- embed data-flow execution policies into uate whether a call to it can be executed hardware [Arvind and Culler 1986; in parallel [Balasundaram 1990; Calla- Arvind et al. 1980; Dennis 1980]. Such han and Kennedy 1988a; Li and Yew machines have not proven to be success- 1988; Triolet et al, 1986]. ful commercially, so researchers began to develop techniques for compiling data- 7.2.2 Split flow languages on conventional architec- tures. The most common approach is to A more comprehensive approach to pro- interpret the data-flow graph dynami- gram decomposition summarizes the data cally, executing a node representing a usage behavior of an arbitrary block of computation when all of its operands are code in a symbolic descriptor [Graham et available. To reduce scheduling over- al. 1993]. As in the previous section, the head, the data-flow graph is generally summary identifies independence be- transformed by gathering many simple tween computations—between procedure computations into larger blocks that are calls or between different loop nests. Ad- executed atomically [Anderson and ditionally, the descriptor is used in ap- Hudak 1990; Hudak and Goldberg 1985; plying the split transformation. Mirchandaney et al. 1988; Sarkar 1989]. Split is unusual in that it is not ap- plied in isolation to a computation like a 7.3 Computation Partitioning loop nest. Instead, split considers the re- lationship of pairs of computations. Two In addition to decomposing data as dis- loop nests, for example, may have depen- cussed in Section 7.1, parallel compilers

ACM Computing Surveys, Vol. 26, No. 4, December 1994 398 0 David F. Bacon et al. must allocate the computations of a pro- doi=l, n gram to different processors. Each pro- a[i] = a[i] + c cessor will generally execute a restricted b[i] = b[i] + c set of iterations from a particular loop. end do The transformations in this section seek (a) original loop to enforce correct execution of the pro- gram without introducing unnecessary overhead. LBA = (n/Pnum)*Pid + 1 UBA = (n/Pnurn)* (Pid + 1)

7.3.1 Guard Introduction LBB = (n/Pnum) *Pid + 1 UBB = (n/Pnum)*(pid + 1) After a loop is parallelized, not all the doi=l, n processors will compute the same itera- if (LBA <= i and. i <= UBA) tions or send the same slices of data. The a[i] = a[il + c compiler can ensure that the correct com- if (LBB <= i and. i <= UBB) putations are performed by introduc- b[i] = b[i] + c ing conditional statements (known as end do guards) into the program [Callahan and (b) parallelized loop with guards introduced Kennedy 1988b]. Figure 58(a) shows an example of a sequential loop that we will transform LB = (n/Pnum)*Pid + 1 for parallel execution. Figare 58(b) shows UB = (n/Pnum)*(Pid + 1) the parallel version of the loop. The code doi=l, n computes upper and lower bounds that if (LB <= i and. i <= US) then the guards use to prevent computation a[i] = a[i] + c on array elements that are not stored b[i] = b[i] + c locally within the processor. The bounds end if depend on the size of the array, the num- end do ber of processors, and the Pid of the (c) after guard combination processor. We have made a number of simplifying assumptions: the value of n is an even LB = (n/Pnum)*Pid + 1 multiple of Pnum, and the arrays are UB = (n/Pnum)*(Pid + 1) already distributed across the processors doi=LB, UB with a block decomposition. We have also a[i] = a[i] + c chosen an example that does not require b[i] = b[i] + c interprocessor communication. To paral- end do lelize the loops encountered in practice, (d) after bounds reduction compilers must introduce communica- Figure 58. Computation partitioning. tion statements and much more complex guard and induction variable expres- sions. dMX does not have a global name- serve the original element indices to space, so each processor’s instance of an maintain a closer relationship between array is separate. In Figure 58(b), the the sequential code and its transformed compiler could take advantage of the iso- version. lation of each processor to remap the array elements. Instead of the original array of size n, each processor could de- 7.3.2 Redundant Guard Elimination clare a smaller array of n / Pnum ele- ments and modify the loop to iterate from In the same way that the compiler can 1 to n / Pnum. Although the remapping use code hoisting to optimize computa- would simplify the loop bounds, we pre- tion, it can reduce the number of guards

ACM Computmg Surveysj Vol. 26, No. 4, December 1994 Compiler Transformations ● 399 by hoisting them to the earliest point of cvcles reauired to send one word to the where they can be correctly computed. number of ‘cycles required for a single Hoisting often reveals that identical operation on dMX is about 500:1. guards have been introduced and that all Like vectors on vector processors, com- but one can be removed. In Figure 58(b), munication o~erations on a distributed- the two guard expressions are identical memory machine are characterized by a because the arrays have the same decom- startup time, t~, and a per-element cost, position. When the guards are hoisted to t~, the time required to send one byte the beginning of the loop, the compiler once the message has been initiated. On can eliminate the redundant guard. The dMX, t, = 10PS and th = 100ns, so second pair of bound computations be- sending ‘one 10-byte message costs 11 ps comes useless and is also removed. The while sending two 5-byte messages costs final result is shown in Figure 58(c). 21 ps. To avoid paying the startup cost unnecessarily. . . the o~timizations. in this section combine data from multiple mes- 7.3.3 Bounds Reduction sages and send them in a single In Figure 58(c), the guards control which operation. iterations of the loop perform computa- tion. Since the desired set of iterations is 7.4.1 Message Vectorization a contiguous range, the compiler can achieve the same effect by changing the Analysis can often determine the set of induction expressions to reduce the loop data items transferred in a loop. Rather bounds [Koelbel 1990]. Figure 58(d) than sending each element of an array in shows the result of the transformation. an individual message, the compiler can group many of them together and send them in a sirwle block transfer. Because 7.4 Communication Optimization this is analo~ous to the way a vector An important part of compilation for dis- processor interacts with memory, the op- tributed-memory machines is the analy- timization is called message vectoriza- sis of an application’s communication tion [Balasundaram et al. 1990; Gerndt needs and introduction of explicit mes- 1990; Zima et al. 1988]. sage-passing operations into it. Before a Figure 59(a) shows a sample loop that statement is executed on a given proces- multiplies each element of array a by the sor, any data it relies on that is not mirror element of array b. Figure 59(b) is available locally must be sent. A simple- an inefficient parallel version for proces- minded approach is to generate a sors O and 1 on dMX. To simplify the message for each nonlocal data item code, we assume that the arrays have referenced by the statement, but the already been allocated to the processors resulting overhead will usually be un- using a block decomposition: the lower acceptably high. Compiler designers have half of each array is on processor O and developed a set of optimizations that re- the upper half on processor 1. duce the time required for communica- Each processor begins by computing tion. the lower and upper bounds of the range The same fundamental issues arise in that it is res~onsible for. During each communication optimization as in opti- iteration, it s&ds the element of”b that mization for memory access (described the other processor will need and waits in Section 6.5): maximizing reuse, mini- to receive the corresponding message. mizing the working set, and making use Note that Fortran’s call-by-reference of available parallelism in the em-nmuni- semantics c.on~ert the arrav . reference cation system. However, the problems are implicitly into the address of the corre- magnified because while the ratio of one sponding element, which is then used by memory access to one computation on the low-level communication routine to S-DLX is 16:1, the ratio of the number extract the bytes to be sent or received,

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 400 “ David F. Bacon et al.

doi=l, n their internal buffer sizes are exceeded. a[il = a[il + b[n+l-i] The compiler may be able to perform end do additional communication optimization (a) original loop by using strip mining to reduce the mes- sage length as shown in Figure 59(d).

LB = Pid * (n/2) + 1 US = LB + (n/2) 7.4.2 Message Coalescing otherPid = i - Pid doi=LB, UB Once message vectorization has been call SEND(b[i] , 4, otherPid) performed, the compiler can further re- call RECEIVE (b[n+l-i] , 4) duce the frequency of communication by a[i] = a[i] + b[n+l-i] grouping messages together that send end do overlapping or adjacent data. The For- (b) parallel loop tran D compiler [Tseng 1993] uses Regu- lar Sections [Callahan and Kennedy LB = Pid * (n/2) + 1 1988a], an array summarization strat- LIE = LB + (n/2) egy, to describe the array slice in each otherPid = 1 - Pid message. When two slices being sent to otherLB = otherPid * (n/2) + 1 the same processor overlap or cover con- otherUB = otherLB + (n/2) tiguous ranges of the array, the associ- call SEND(b[LB] , (n/2)*4, otherPid) ated messages are combined. call RECEIVE(b [otherLB] , (n/2) *4, otherPid) doi=LB, UB a[i] = a[i] + b[n+l-i] 7.4.3 Message Aggregation end do (c) parallel loop with vectorized messages Sending a message is generally much more expensive than performing a block copy locally on a processor. Therefore it do j = LB, UB, 256 is worthwhile to aggregate messages be- call SEND(b[j] , 256*4, otherPid) ing sent to the same processor even if the call RECEIVE(b[otherLB+ (j-LB)] , data they contain is unrelated [Tseng 256*4, otherPid) 1993]. A simple aggregation strategy is do i = j, j+255 to identify all the messages directed at a[i] = a[i] + b[n+l-i] end do the same target processor and copy the end do various strips of data into a single buffer. (d) after strip mining messages (assuming The target processor performs the array size is a multiple of 256) reverse operation.

Figure 59. Message vectorization 7.4.4 Collective Communication

Many parallel architectures and mes- sage-pawing libraries offer e.pecial-pur- When the message arrives, the iteration pose communication primitives such as proceeds. broadcast, hardware reduction, and scat- Figure 59(c) is a much more efficient ter-gather. Compilers can improve per- version that handles all communication formance by recognizing opportunities to in a single message before the loop be- exploit these operations, which are often gins executing. Each processor computes highly efficient. The idea is analogous to the upper and lower bound for both itself idiom and reduction recognition on se- and for the other processor so as to place quential machines. Li and Chen [1991] the incoming elements of b properly. present compiler techniques that rely on Message-passing libraries and network pattern matching to identify opportuni- hardware often perform poorly when ties for using collective communication.

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations ● 401

7.4.5 Message Pipelining 7.5 SIMD Transformations

Another important optimization is to SIMD architectures exhibit much more pipeline parallel computations by over- regular behavior than MIMD machines, lapping communication and computation. eliminating many problematic synchro- Studies have demonstrated that many nization issues. In addition to the align- applications perform very poorly with- ment and decomposition strategies for out pipelining [Rogers 1991]. Many mes- distributed-memory systems (see Section sage-passing systems allow the processor 7.1.2), the regularity of SIMD intercon- to continue executing instructions while nection networks offers additional oppor- a message is being sent. Some support tunities for optimization. The compiler fully nonblocking send and receive opera- can use very accurate cost models to esti- tions. In either case, the compiler has the mate the performance of a particular opportunity to arrange for useful compu- layout. tation to be performed while the network Early SIMD compilation work targeted is delivering messages. IVTRAN [Millstein and Muntz 1975], a A variety of algorithms have been de- Fortran dialect for the Illiac IV that pro- veloped to discover opportunities for vided layout and alignment declarations. pipelining and to move message transfer The compiler provided a parallelizing operations so as to maximize the amount module called the Paralyzer [Presberg of resulting overlap [Rogers 1991; and Johnson 1975] that used an early Koelbel and Mehrotra 1991; Tseng 1993]. form of dependence analysis to identify independent loops and applied linear transformations to optimize communica- 7.4.6 Redundant Communication Elimination tion. To avoid sending messages wherever The Connection Machine Convolution possible, the compiler can perform a vari- Compiler [Bromley et al. 1991] targets ety of transformations to eliminate re- the topology of the SIMD architecture dundant communication. Many of the explicitly with a pattern-matching strat- optimizations covered earlier can also be egy. The compiler focuses on computa- used on messages. If a message is sent tions that update array elements based within the body of a loop but the data on their neighbors, It finds the pattern of does not change from one iteration to neighbors that are needed to compute a the next, the SEND can be hoisted out of given element, called the stencil. The the loop. When two messages contain the cross, a common stencil, represents a same data, only one need be sent. computation that updates each array ele- Messages offer further opportunities ment using the value of its neighbors to for optimization. If the contents of a mes- the north, east, west, and south. Stencils sage are subsumed by a previous commu- are aggregated into larger patterns, nication, the message need not be sent; or multistencils, using optimizations this situation is often created when analogous to loop unrolling and strip SENDS are hoisted in order to maximize mining. The multistencils are mapped pipelining opportunities. If a message to the hardware so as to minimize contains data, some of which has already communication. been sent, the overlap can be removed to SIMD compilers developed at Compass reduce the amount of data transferred. [Knobe and Natarajan 1993; Knobe et al. Another possibility is that a message is 1988; 1990] construct a preference graph being sent to a collection of processors, based on the computations being per- some of which previously received the formed. The graph represents alignment data. The list of recipients can be pruned, requests that would yield the least com- reducing the amount of communication munication overhead. The compiler satis- traffic. These optimizations are used by fies every request when possible, using a the PTRAN II compiler [Gupta et al. greedy algorithm to find a solution when 1993] to reduce overall message traffic. preferences are in conflict.

ACM Computmg Surveys, Vol 26, No 4, December 1994 402 ● David F. Bacon et al.

Wholey [1992] begins by computing a pilers for these machines must look detailed cost function for the computa- through branches to find additional in- tion. The function is the input to a hill- structions to execute. The multiple paths climbing algorithm that searches the of control require the introduction of space of possible allocations of data to speculative execution—computing some processors. It begins with two processors, results that may turn out to be unused then four, and so forth; each time the (ideally at no additional cost, by func- number of processors is doubled, the al- tional units that would otherwise have gorithm uses the most efficient allocation gone unused). from the previous stage as a starting The speculation can be handled dy- point. namically by the hardware [ Sohi and In addition to general-purpose map- Vajapayem 1990], but that complicates ping algorithms, the compiler can use the design significantly. The trace-sched- specific transformations like loop flatten- uling alternative is to have the compiler ing [von Hanxleden and Kennedy 1992] do it by identif~ng paths (or traces) to avoid idle time due to nonuniform through the control-flow graph that computations. might be taken. The compiler guesses which way the branches are likely to go and hoists code above them accordingly. 7.6 VLIW Transformations Superblock scheduling [Hwu et al. 19931 The Very Long Instruction Word (VLIW) is an extension of trace scheduling that architecture is another strategy for exe- relies on program profile information to cuting instructions in parallel. A VLIW choose the traces. processor [Colwell et al. 1988; Fisher et al. 1984; Floating Point Systems 1979; 8. TRANSFORMATION FRAMEWORKS Rau et al. 1989] exposes its multiple functional units explicitly; each machine Given the many transformations that instruction is very large (on the order of compiler writers have available, they face 512 bits) and controls many independent a daunting task in determining which functional units. Typically the instruc- ones to apply and in what order. There is tion is divided into 32-bit pieces, each of no single best order of application; one which is routed to one of the functional transformation can permit or prevent a units. second from being applied, or it can With traditional processors, there is change the effectiveness of subsequent a clear distinction between low-level changes. Current compilers generally scheduling done via use a combination of heuristics and a and high-level scheduling between pro- partially fixed order of applying cessors. This distinction is blurred in transformations. VLIW architectures. They rely on the There are two basic approaches to this compiler to expose the parallel opera- problem: uniffing the transformations in tions explicitly and schedule them at a single mechanism, and applying search compile time. Thus the task of generat- techniques to the transformation space. ing object code incorporates high-level In fact there is often a degree of overlap resource allocation and scheduling deci- between these two approaches. sions that occur only between processors on a more conventional parallel 8.1 Unified Transformation architecture. Trace scheduling [Ellis 1986; Fisher A promising strategy is to encode both 1981; Fisher et al. 1984] is an analysis the characteristics of the code being strategy that was developed for VLIW transformed and the effect of each trans- architectures. Because VLIW machines formation; then the compiler can quickly require more parallelism than is typi- search the space of possible sets of trans- cally available within a basic block, com- formations to find an efficient solution.

ACM Computing Surveys, Vol 26, No. 4, December 1994 Compiler Transformations ● 403

One framework that is being actively doi =2,10 investigated is based on unimodular ma- do j = 1, 10 trix theory [Banerjee 1991; Wolf and Lam a[i, j] = a[i-l, jl + a[i, jl 1991]. It is applicable to any loop nest end do whose dependence can be described with end do a distance vector; a subset of the loops Figure 60. Unimodular transformation example. that require a direction vector can also be handled. The transformations that a unimodular matrix can describe are in- terchange, reversal, and skew. Any loop nest whose dependence are The basic principle is to encode each all representable by a distance vector can transformation of the loop in a matrix be transformed via skewing into a canon- and apply it to the dependence vectors of ical form called a fully permutable loop the loop. The effect on the dependence nest. In this form, any two loops in the pattern of applying the transformation nest can be interchanged without chang- can be determined by multiplying the ing the loop semantics. Once in this matrix and the vector. The form of the canonical form, the compiler can decom- product vector reveals whether the trans- pose the loop into the granularity that formation is valid. To model the effect of matches the target architecture [Wolf and applying a sequence of transformations, Lam 1991]. the corresponding matrices are simply Sarkar and Thekkath [1992] describe a multiplied. framework for transforming perfect loop Figure 60 shows a loop nest that can nests that includes unimodular transfor- be transformed with unimodular matri- mations, tiling, coalescing, and parallel ces. The distance vector that describes loop execution. The transformations are the loop is D = (1, O), representing the encoded in an ordered sequence. Rules dependence of iteration i on i – 1 in the are provided for mapping the dependence outer loop. vectors and loop bounds, and the trans- Because of the dependence, it is not formation to the loop body is described by legal to reverse the outer loop of the nest. a template. The reversal transformation is Pugh [1991] describes a more ambi- tious (and time-consuming) technique that can transform imperfectly nested R=–::. loops and can do most of the transforma- [1 tions possible through a combination of statement reordering, interchange, fu- The product is sion, skewing, reversal, distribution, and parallelization. It views the transforma- tion problem as that of finding the best RD=P1= ‘:. H schedule for a set of operations in a loop nest. A method is given for generating and testing candidate schedules. This demonstrates that the transforma- tion is not legal, because PI is not lexico- graphically positive. 8.2 Searching the Transformation Space We can also test whether the two loops Wang [1991] and Wang and Gannon can be interchanged; the interchange [1989] describe a parallelization system transformation is 1 = ~ ~ . Applying that uses heuristic search techniques [. 1 from artificial intelligence to find a pro- that to D yields P2 = [1~ . In this case, gram transformation sequence. The tar- the resulting vector is lexicographically get machine is represented by a set of positive showing that the transformation features that describe the type, size, and is legal. speed of the processors, memory, and in-

ACM Computing Surveys, Vol 26, No 4, December 1994 404 “ David F. Bacon et al. terconnect. The heuristics are organized benchmarks measure the combined effec- hierarchically. The main functions are: tiveness of the compiler and the target description of parallelism in the program machine. Thus two compilers that target and in the machine; matching of program the same machine can be compared by parallelism to machine parallelism; and using the SPEC ratings of the generated control of restructuring. code. For parallel machines, the contribution of the compiler is even more important, 9. COMPILER EVALUATION since the difference between naive and Researchers are still trying to find a good optimized code can be many orders of way to evaluate the effectiveness of com- magnitude. Parallel benchmarks include pilers. There is no generally agreed upon SPLASH (Stanford Parallel Applications way to determine the best possible per- for Shared Memory) [Singh et al. 1992] formance of a particular program on a and the NASA Numerical Aerodynamic particular machine, so it is difficult to Simulation (NAS) benchmarks [Bailey determine how well a compiler is doing. et al. 1991]. Since some applications are better struc- The Perfect Club [Berry et al. 1989] is tured than others for a given architec- a benchmark suite of computationally in- ture or a given compiler, measurements tensive programs that is intended to help for a particular application or group of evaluate serial and parallel machines. applications will not necessarily predict how well another application will fare. 9.2 Code Characteristics Nevertheless, a wide variety of mea- surement studies do exist that seek to A number of studies have focused on the evaluate how applications behave, how applications themselves, examining their well they are being compiled, and how source code for programmer idioms or well they could be compiled. We have profiling the behavior of the compiled divided these studies into several groups. executable. Knuth [1971] carried out an early and influential study of Fortran programs. He 9.1 Benchmarks studied 440 programs comprising 250,000 Benchmarks have received by far the lines (punched cards). The most impor- most attention since they measure deliv- tant effect of this study was to dramatize ered performance, yielding results that the fact that the majority of the execu- are used to market machines. They were tion time of a program is usually spent in originally developed to measure machine a very small proportion of the code. Other speed, not compiler effectiveness. The interesting statistics are that 9570 of all Livermore Loops [McMahon 1986] is one the do loops incremented their index of the early benchmark suites; it sought variable by 1, and 4070 of all do loops to compare the performance of supercom- contained only one statement. puters. The suite consists of a set of small Shen et al. [1990] examined the sub- loops based on the most time-consuming script expressions that appeared in a set inner loops of scientific codes. of Fortran mathematical libraries and The SPEC benchmark suite [Dixit scientific applications. They applied vari- 1992] includes both scientific and gen- ous dependence tests to these expres- eral-purpose applications intended to be sions. The results demonstrate how the representative of an engineering/scien- tests compare to one another, showing tific workload. The SPEC benchmarks are that one of the biggest problems was un- widely used as indicators of machine per- known variables. These variables were formance, but are essentially uniproces- caused by procedure calls and by the use sor benchmarks. of an array element as an index value As architectures have become more into another array. Coupled subscripts complex, it has become obvious that the also caused problems for tests that exam-

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations e 405

ine a single array dimension at a time One study of four Perfect benchmark (coupled subscripts are discussed in programs compiled on the Alliant FX/8 Section 5.4). produced speedups between 0.9 (that is, A study by Petersen and Padua [1993] a slowdown) and 2.36 out of a potential evaluates approximate dependence tests 32; when the applications were tuned by against the omega test [Pugh 1992], find- hand, the speedups ranged from 5.1 to ing that the latter does not expose sub- 13.2 [Eigenmann et al. 1991]. In another stantially more parallelism in the Perfect study of 12 Perfect benchmarks compiled Club benchmarks. with the KAP compiler for a simulated machine with unlimited parallelism, 7 applications had speedups of 1.4 or less; 9.3 Compiler Effectiveness two applications had speedups of 2-4; As we mentioned above, researchers find and the rest were sped up 10.3, 66, and it difficult to evaluate how well a com- 77. All but three of these applications piler is doing. They have come up with could have been sped up by a factor of 10 four approaches to the problem: or more [Padua and Petersen 1992]. Lee et al. [1985] study the ability of the e examine the compiler output by hand Parafrase compiler to parallelize a mix- to evaluate its ability to transform code; ture of small programs written in For- e measure performance of one compiler tran. Before compilation, while loops against another; were converted to do loops, and code for e compare the performance of executa- handling error conditions was removed. ble compiled with full optimization With 32 processors available, 4 out of 15 against little or no optimization; and applications achieved 30920 efficiency, and e compare an application compiled for 2 achieved 10% efficiency; the other parallel execution against the sequen- 9 out of 15 achieved less than 10% tial version running on one processor. efficiency. Out of 89 loops, 59 were parallelized, most of them loops that Nobayashi and Eoyang [1989] compare initialized array data structures. Some the performance of supercomputer com- coding idioms that are amenable to im- pilers from Cray, Fujitsu, Alliant, proved analysis were identified. Ardent, and NEC. The compilers were Some preliminary results have also applied to various loops from the Liver- been reported for the Fortran D compiler more and Argonne test suites that re- [Hiranandani et al. 1993] on the Intel quired restructuring before they could be iPSC/860 and Thinking Machines CM-5 computed with vector operations. architectures. Carr [1993] applies a set of loop trans- formations (unroll-and-jam, loop inter- 9.4 Instruction-Level Parallelism change, tiling, and scalar replacement) to various scientific applications to find a In order to evaluate the potential gain strategy that optimizes cache utilization. from instruction-level parallelism, re- Wolf [1992] uses inner loops from the searchers have engaged in a number of Perfect Club and from one of the SPEC studies to measure an upper bound on benchmarks to compare the performance how much parallelism is available when of hand-optimized to compiler-optimized an unlimited number of functional units code. The focus is on improving locality is assumed. Some of these studies are to reduce traffic between memory and discussed and evaluated in detail by Rau the cache. and Fisher [1993]. Relatively few studies have been per- Early studies [Tjaden and Flynn 1970] formed to test the effectiveness of real were pessimistic in their findings, mea- compilers in parallelizing real programs. suring a maximum level of parallelism However, the results of those that have on the order of two or three—a result are not encouraging. that was called the Flynn bottleneck. The

ACM Computing Surveys, Vol 26, No 4, December 1994 406 ● David F. Bacon et al. main reason for the low numbers was SPEC89 suite. The resulting code was that these studies did not look beyond executed on a variety of simulated basic blocks. machines. The range includes unrealisti- Parallelism can be exploited across cally omniscient architectures that com- basic block boundaries, however, on ma- bine data-flow strategies for scheduling chines that use speculative execution. h- with perfect branch prediciton. They also stead of waiting until the outcome of a restricted models with limited hardware. conditional branch is known, these archi- Under the most optimistic assumptions, tectures begin executing the instructions they were able to execute 17 instructions at either or both potential branch tar- per cycle; more practical models yielded gets; when the conditional is evaluated, between 2.0 and 5.8 instructions per cy- any computations that are rendered in- cle. Without looking past branches, few valid or useless must be discarded. applications could execute more than 2 When the basic block restriction is re- instructions per cycle. laxed, there is much more parallelism Lam and Wilson [1992] argue that one available. Riseman and Foster [1972] as- major reason why studies of instruction- sumed replicated hardware to support level parallelism differ markedly in their speculative execution. They found that results is branch prediction. The studies there was significant additional paral- that perform speculative execution based lelism available, but that exploiting it on prediction yield much less parallelism would require a great deal of hardware. than the ones that have an oracle with For the programs studied, on the order of perfect knowledge about branches. 216 arithmetic units would be necessary The authors speculate that executing to expose 10-way parallelism. branches locally will not yield large de- A subsequent study by Nicolau and grees of parallelism; they conduct experi- Fisher [1984] sought to measure the par- ments on a number of applications and allelism that could be exploited by a find similar results to other studies (be- VLIW architecture using aggressive com- tween 4.2 and 9.2). They argue that com- piler analysis. The strategy was to com- piler optimizations that consider the pile a set of scientific programs into an global flow of control are essential. intermediate form, simulate execution, Finally, Larus [1993] takes a different and apply a scheduling algorithm approach than the others, focusing retroactively that used branch and ad- strictly on the parallelism available in dress information revealed by the simu- loops. He examines six codes (three inte- lation. The study revealed significant ger programs from SPEC89 and three amounts of parallelism—a factor of tens scientific applications), collects a trace or hundreds. annotated with semantic information, Wall [1991] took trace data from a and runs it through a simulator that as- number of applications and measured the sumes a uniform shared-memory archi- amount of parallelism under a variety of tecture with infinite processors. Two of models. His study considered the effect of the scientific codes, the ones that were architectural features such as varying primarily array manipulators, showed a numbers of registers, branch prediction, potential speedup in the range of 65–250. and jump prediction. It also considered The potential speedup of the other codes the effects of alias analysis in the com- was 1.7–3, suggesting that the tech- piler. The results varied widely, yielding niques developed for parallelizing sci- as much as 60-way parallelism under the entific codes will not work well in most optimistic assumptions. The aver- parallelizing other types of programs. age parallelism on an ambitious combi- nation of architecture and compiler CONCLUSION support was 7. Butler et al. [1991] used an optimizing We have described a large number of compiler on a group of programs from the transformations that can be used to im-

ACM Computing Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations ● 407 prove program performance on sequen- APPENDIX MACHINE MODELS tial and parallel machines. Numerous A.1 Superscalar DLX studies have demonstrated that these transformations, when properly applied, A superscalar processor has multiple are sufficient to yield high performance. functional units and can issue more than However, current optimizing compilers one instruction per clock cycle. Current lack an organizing principle that allows examples of superscalar machines are the them to choose how and when the trans- DEC Alpha [Sites 1992], HP PA-RISC formations should be applied. Finding a [Hewlett-Packard 1992], IBM RS/6000 framework that unifies many transfor- [Oehler and Blasgen 1991], and Intel mations is an active area of research. Pentium [Alpert and Avnon 1993]. Such a framework could simplify the S-DLX is a simplified superscalar RISC problem of searching the space of possi- architecture. It has four independent ble transformations, making it easier and functional units for integer, load/store, therefore cheaper to build high-quality branch, and floating-point operations. In optimizing compilers. A key part of this every cycle, the next two instructions are problem is the development of a cost considered for execution. If they are for function that allows the compiler to eval- different functional units, and there uate the effectiveness of a particular set are no dependence between the instruc- of transformations. tions, they are both initiated. Otherwise, Despite the absence of a strategy for the second instruction is deferred until uniffing transformations, compilers have the next cycle. S-DLX does not reorder proven to be quite successful in targeting the instructions. sequential and superscalar architectures. Most operations complete in a single On ~arallel architectures. however. most cycle. When an operation takes more than hig~-performance applications currently one cycle, subsequent instructions that rely on the programmer rather than the use results from multicycle instructions compiler to manage parallelism. Com- are stalled until the result is available. pilers face a large space of possible Because there is no instruction reorder- transformations to apply, and parallel ing, when an instruction is stalled no machines exact a very high cost of fail- instructions are issued to any of the ure when optimization algorithms do not functional units. discover a good reorganization strategy S-DLX has a 32-bit word, 32 general- for the application. purpose registers (GPRs, denoted by Rn), Because efforts to ~arallelize tradi- and 32 floating-point registers (FPRs, de- tional languages autom~tically have met noted by Fn). The value of RO is always O. with little success, the focus of research The FPRs can be used as double-preci- has shifted to compiling languages in sion (64-bit) register pairs. For the which the Dromammer shares the re- sake of simplicity we have not included sponsibility }or”exposing parallelism and the double-precision floating-point choosing a data layout. instructions. Another developing area is the opti- Memory is byte addressable with a mization of nonscientific applications. 32-bit virtual address. All memory refer- Like most researchers working on high- ences are made by load and store instruc- performance architectures, compiler tions between memory and the registers. writers have focused on code that relies Data is cached in a 64-kilobyte 4-way on loop-based manipulation of arrays. set-associative write-back cache com- Other kinds of programs, particularly posed of 1024 64–byte cache lines. Figure those written in an object-oriented pro- 61 is a block diagram of the architecture gramming style, rely heavily on pointers. and its primary datapaths. Optimization of pointer manipulation and Table A.1 describes the instruction set. of object-oriented languages is emerging All instructions are 32 bits and must be as another focus of compiler research. word-aligned. The immediate operand

ACM Computing Surveys, Vol. 26, No. 4, December 1994 408 * David F. Bacon et al.

-. I 64 -& 32 Branch 64 from RAM memory Integer * * 64 KB Cache Unit Unit

32 32 64 bytedline 1024 lines ? _ 32 32 * 32 Int Registers “ Load/Store 4-way set- ~ Addr ~ 64 s-l Unit * associative Xlation Pg$ 32 FP Registers

64

— Floating Point Unit

Figure 61. S-DLX functional diagram,

Table A.1. The S-DLX InstructIon Set Example Instr. Name Meaning Simdar instructions LW Rl, 30(R2) Load word RI +-Memory [30+R2] Load float (LF) SW 500(R4), R3 Store word Memory [500+R41 +R3 Store float (SF) LI RI, #666 Load Immediate RI-666 LUI RI, #666 Load upper immediate R11,5 31 t--666

MOV RI, R2 Move register R1+R2 ADD RI, R2, R3 Add R1*R2+R3 Subtract (SUB) WJLT RI, R2, R3 Multlply R1-R2x R3 Divide (DIV) ADDI RI, R2, #3 Add immediate R1-R2+3 SUBI , WJLTI , DIVI SLL RI> R2, R3 Shift left logical RI+R2< F3 Multiply-Add float FltFl+(F2x F3) EQF Fl, F2 Test equal float If (F1=F2) FPCR +’-1 LTF , GTF, LEF , GEF, IiEF else FPCR +0

field is 16 bits. The address for load and high 16 bits; then the high 16 bits must store instructions is computed by adding be set with a load upper immediate (LUI) the 16-bit immediate operand to the reg- instruction. ister. To create a full 32-bit constant, the The program counter is the special reg- low 16 bits must first be set with a load ister PC. Jumps and branches are rela- immediate (Ll) instruction that clears the tive to PC + 4; jumps have a 26-bit signed

ACM Computing Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 409

Table A.2. V-DLXVector InstructIons

Example Instr. Name Meaning Slmllar I

LV VI, RI Load vector register V1+-VLR words at H [RI] I I uws VI. (RI. R2) Load vector with stride Vlt-everv R2ffi word for i SV VI. RI I Store vector register I HIR1] -VLR WC

SUBV V1, V2, V3 I Vector-vector subtraction I V.. -...... -

SUBVS VI .V2. F1 / Vector-scalar subtraction I VI [1. . VLR] +V2 SUBSV V1; F1 ;V2 I Scalar-vector subtraction I VI [1 . .VLR] t-Fi-V2[l .VLR] I DIVSV

I SVLR Ri I Set vector length register I VLR+RI offset, and branches have a 16-bit signed vector support. This new architecture, offset. Integer branches test the GPRs for V-DLX, has eight vector registers, each zero; floating-point branches test a spe- of which holds a vector consisting of up cial floating-point condition register to 64 floating-point numbers. The vector (FPCR). functional units perform all their opera- There is no branch delay slot. Because tions on data in the vector and scalar the number of instructions executed per registers. cycle varies on a superscalar machine, it We discuss only the functional units does not make sense to have a fixed num- that perform floating-point addition, ber of delay slots. The number of delay multiplication, and division, though vec- slots is also heavily dependent on the tor machines typically have units to per- pipeline depth, which may vary from one form integer and logical operations as chip generation to the next. well. V-DLX issues only one scalar or one Instead, static branch prediction is vector instruction per cycle, but noncon- used: forward branches are always pre- flicting scalar and vector instructions can dicted as not taken, and backward overlap each other. branches are predicted as taken. If the A special register, the vector length prediction is wrong, there is a one-cycle register (VLR), controls the number of delay. quantities that are loaded, operated upon, Floating-point divides take 14 cycles or stored in any vector instruction. By and all other floating-point operations software convention, the VLR is normally take four cycles, except when the result set to 64, except when handling the last is used by a store operation, in which few iterations of a loop. case they take three cycles. A load that The vector operations are described in hits in the cache takes two cycles; if it Table A.2. They include arithmetic opera- misses in the cache it takes 16 cycles. tions on vector registers and load/store Integer multiplication takes two cycles. operations between vector registers and All other instructions take one cycle. memory. Vector operations take place ei- When a load or store is followed by an ther between two vector registers or integer instruction that modifies the ad- between a vector register and a scalar dress register, the instructions may be register. In the latter case, the scalar issued in the same cycle. If a floating- value is extended across the entire vec- point store is followed by an operation tor. All vector computations have a vec- that modifies the register being stored, tor register as the destination. the instructions may also be issued in The speed of a vector operation de- the same cycle. pends on the depth of the pipeline in its implementation. The first result appears A.2 Vector DLX after some number of cycles (called the For vectorizing transformations, we will startup time). After the pipeline is full, use a version of DLX extended to include one result is computed per clock cycle. In

ACM Computmg Surveys. Vol. 26, No. 4, December 1994 410 ● David F. Bacon et al.

Table A.3. Startup Times in Cycles on V-DLX 90 array section expressions are exam- [ Operation I Start-Up Time I ples of explicitly data parallel operations. APL [Iverson 1962] also contains a wide variety of data-parallel array operators. Examples of control-parallel operations ~ , I Vector Load 12 j are cobegin / coend blocks and doacross loops [Cytron 19861. For both shared- and distributed-mem- ory multiprocessors, we assume that the the meantime, the processor can con- programming environment initializes the tinue to execute other instructions. Table variables Pn urn to be the total number of A.3 gives the startup times for the vector processors and Pid to be the number of operations in V-DLX. These times should the local processor. Processors are num- not be compared directly to the cycle bered starting with O. times for operations on S-DLX because vector machines typically have higher A.3. 1 Shared-Memory DLX Multiprocessor clock speeds than microprocessors, al- though this gap is closing. sMX is our prototypical shared-memory A large factor in the sQeed of vector parallel architecture. It consists of 16 S- architec~ures is their mem’ory system. V- DLX processors and 512MB of shared DLX has eight memory banks. After the main memory. The processors and mem- load latency, data is supplied at the rate ory system are connected together by a of one word per clock cycle, provided that bus. Each processor has an intelligent the stride with which the data is ac- cache controller that monitors the bus (a cessed does not cause bank conflicts (see snoopy cache). The caches are the same Section 6.5.1). as on S-DLX, except that they contain Current examples of vector machines 256KB of data. The bandwidth of the bus are the Cray C-90 [Oed 1992] and IBM is 128 megabytes/second. ES 9000 Model 900 VF [Gibson and Rao The processors share the memory 1992]. units; a program running on a processor can access any memory element, and the system ensures that the values are main- A.3 Multiprocessors tained consistently across the machine. When multiple processors are employed Without caching, consistency is easy to to execute a program, many additional maintain; every memory reference is issues arise. The most obvious issue is handled by the main memory controller. how much of the underlying machine ar- However, performance would be very poor chitecture to expose to the programmer. because memory latencies are already too At one extreme, the programmer can high on sequential machines to run well make explicit use of hardware-supported without a cache; having many processors operations by programming with locks, share a common bus would make the fork/join primitives, barrier synchro- problem much worse by increasing mem- nizations, and message send and receive. ory access latency and introducing band- These operations are typically provided width restrictions. The solution is to give as system calls or library routines. Sys- each processor a cache that is smart tem call semantics are usually not de- enough to resolve reference conflicts. fined by the language, making it difficult A snoopy cache implements a sharing for the compiler to optimize them protocol that maintains consistency while automatically. still minimizing bus traffic. A processor High-level languages for large-scale that modifies a cache line invalidates all parallel processing provide primitives for other copies and is said to own that cache expressing parallelism in one of two ways: line. There are a variety of cache co- control parallel or data parallel. Fortran herency protocols [Stenstrom 1990;

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations ● 411

Table A.4, Memory reference latency in sMX pass through a processor en route to some Type of Memory Reference / Cycles ] other destination in the mesh are han- I 21 dled by the network processor without Read value available in local cache 1 Read value owned by‘ other mocessor I 1; i interrupting the CPU. Read value nobody owns “ I 20 I The latency of a message transfer is Write value owneri locally 1 ten microseconds plus 1 microsecond per Write value owned by other Drocessor I 18 10 bytes of message (we have simplified Write value nobody ;wns ‘ 22 the machine by declaring that all remote processors have the same latency). Com- munication is supported by a message library that provides the following calls: Eggers and Katz 1989], but the details are not relevant to this survey. From the “ SEND(buffer, nbytes,target) compiler writer’s perspective, the key is- Send nbytes from buffer to the proces- sue is the time it takes to make a mem- sor target. If the message fits in the ory reference. Table A.4 summarizes the network processor’s memory, the latency of each kind of memory reference call returns after the copy (1 microsec- in sMX, ond/ 10 bytes). If not, the call blocks Typically, shared-memory multiproces- until the message is sent. Note that sors implement a number of synchroniza- this raises the potential for deadlock, tion operations. FORK(n) starts identical but we will not concern ourselves with copies of the current program running on that in this simplified model. n processors; because it must copy the ● BROADCAST(buffer, nbytes) stack of the forking processor to a private Send the message to every other pro- memory for each of the n – 1 other pro- cessor in the machine. The communica- cessors, it is a very expensive operation. tion library uses a fast broadcasting JOIN( ) desynchronizes with the previous algorithm, so the maximum latency is FORK, and makes the processor available roughly twice that of sending a mes- for other FORK operations (except for the sage to the furthest edge of the mesh. original forking processor, which * RECEIVE (buffer, nbytes) proceeds serially). Wait for a message to arrive; when it BARRIER( ) performs a barrier synchro- does, up to nbytes of it will be put in nization, which is a synchronization point the buffer. The call returns the total in the program where each processor number of bytes in the incoming mes- waits until all processors have arrived at sage; if that value is greater than that point. nbytes, RECEIVE guarantees that sub- sequent calls will return the rest of A.3.2 Distributed-Memory DLX Multiprocessor that message first. dMX is our hypothetical distributed- memory multiprocessor, The machine ACKNOWLEDGMENTS consists of 64 S-DLX processors (num- We thank Michael Burke, Manish Gupta, Monica bered O through 63) connected in an 8 x 8 Lam, and Vivek Sarkar for their assistance with mesh. The network bandwidth of each various parts of this survey. We thank John link in the mesh is 10 MB per second. Boyland, James Demmel, Alex Dupuy, John Hauser, Each processor has an associated net- James Larus, Dick Muntz, Ken Stanley, the mem- work processor that manages communi- bers of the IBM Advanced Development Technology cation; the network processor has its own Institute, and the anonymous referees for their pool of memory and can communicate valuable comments on earlier drafts. without involving the CPU. Having a separate processor manage the network REFERENCES allows applications to send a message ASELSON, H. AND SUSSMAN, G. J. 1985. Structure asynchronously and continue executing and Interpretation of Computer Programs. MIT while the message is sent. Messages that Press, Cambridge, Mass.

ACM Computing Surveys, Vol 26, No. 4, December 1994 412 “ David F. Bacon et al.

ABU-SUFAH, W 1979 Improving the performance of Reduction of operator strength, In Program virtual memory computers. Ph D., thesis, Tech Flow Analysis: Theory and ApphcatLons, S. S. Rep 78-945, Univ of Illinois at Urbana- Muchnik and N. D. Jones, Eds., Prentice-Hall, Champaign Englewood Cliffs, N, J,, 79-101, ABU-SUFAEL W., Kuzw, D ,J., AND LAWRIE, D 1981 ALLEN, J R., KENNEDY, K., PORTERFIELD, C., AND On the performance enhancement of paging WARREN, J 1983 Conversion of control depen- systems through program analysis and trans- dence to data dependence In Conference Record formations IEEE Trans Comput C-30, 5 of the 10th ACM Symposium on Principles of (May), 341-356 Programming Languages (Austin, Tex., Jan,). ACM Press, New York, 177-189. ACKRRiWMJ, W B. 1982. Data flow languages Com- puter 15, 2 (Feb ), 15-25 Z%PERT. D AND AVNON, D 1993 Architecture of the Pentium microprocessor IEEE Micro 13, 3 AHo, A. V,, JOHNSON, S C.. AND ULLMAN, J D 1977 (June), 11-21 Code generation for expressions with common subexpressions J ACM 24, 1 (Jan), 146–160 AMERICAN NATIONAI, STANDARDS INSTITUTE 1987 An American National standard, IEEE standard AHo, A V., SETHI, R, AND ULLMAN, J D. 1986 for binary floating-point arithmetic SIGPLAN Compder~. Pnnccples, Techruqucs, and Tools Not 22, 2 (Feb.), 9-25 Addmon-Wesley, Reading, Mass ANDERSON, J. M. AND LAM, M. S 1993 Global opti- AIKDN, A. AND NICOLAU, A. 1988a Optimal loop mlzatlons for parallelism and locahty on scal- parallehzatlon. In Proceechngs of the SIGPLAN able parallel machmes. In Proceedz ngs of Conference on Progrczmmlng Language Desgn the SIGPLAN Conference on Programming and Implementutz on (Atlanta, Ga , June) SIG- Language Design and Implementation (Al- PLAN Not 23, 7 (July), 308-317 buquerque, New Mexico, June) SIGPL4N Not. AIKEN, A. AND NICOLAU. A, 1988b. Perfect pipehn- 28, 6, 112-125 mg: A new loop parallelization technique. In ANDERSON, S. AND HUDAK, P 1990 Compilation of Proceedings of the 2nd European Symposrum Haskell array comprehensions for scientific on Programmmg Lecture Notes in Computer computing. In Proceedings of the SIGPLAN Science, vol 300 Springer-Verlag, Berlin, Conference on Programming Language Design 221-235. and Implementation (White Plains, N.Y , June). ALLEN, J, R, 1983. Dependence analysis for sub- SIGPLAN Not 25, 6, 137-149 scripted variables and Its apphcatlon to pro- APPEL, A. W. 1992. Compdmg wLth Contmuatzons. gram transformations. Ph.D. thesis, Computer Cambridge Umverslty Press, Cambridge, Eng- Science Dept., Rice Umverslty, Houston, Tex land. ALLEN, F. E, 1969. Program optlmlzatlon. In An- ARDEN, B. W., GALLER, B, A,, AND GRAHAM, R, M, nual Rev~ew m Automatw Programmmg 5. In- 1962. An algorithm for translating Boolean ex- ternational Tracts in Computer Science and pressions, J, ACM 9, 2 (Apr,), 222-239, Technology and then- Apphcatlon. vol. 13 Perg- AKWND AND CULLER, D. E. 1986. Dataflow architec- amon Press, Oxford, England, 239–307 tures In Annual Reznew of Computer Sctence. ALLEN, F. E. AND COCKE, J. 1971. A catalogue of Vol. 1. J. F, Traub, B. J. Grosz, B, W. Lampson, optimizing transformations. In Design and 0p- and N, J. Nilsson, Eds, Annual Rewews, Palo tmawatbon of Compzlers, R Rustm, Ed Pren- Mto, Cahf, 225-253 tice-Hall, Englewood Chffs, N. J., 1–30. ,&VZND, KAT13AIL, V , AND PINGALI, K. 1980. A ALLEN, J. R. AND KENNEDY, K. 1987, Automatic dataflow architecture with tagged tokens. Tech. translation of Fortran programs to vector form. Rep TM-174, Laboratory for Computer Sci- ACM Trans. Program Lang. Syst. 9, 4 (Oct.). ence, Massachusetts Institute of Technology, 491-542, Cambridge, Mass. ALLEN, J. R. AND KENNEDY, K. 1984, Automatic loop BACON, D. F., CHOW, J.-H., Ju, D R, MUTHUKUMAR, interchange In Proceedings of the SIGPLAN K , AND SARKAR, V. 1994. A compiler framework Symposium on Compiler Construction for restructuring data declarations to enhance (Montreal, Quebec, June). SIGPJ5AN Not. 19, cache and TLB effectiveness. In Proceedings of 6, 233-246. CASCON ’94 (Toronto, Ontario, Oct.). ALLEN, F, E., BURKE, M,, CHARLES, P., CYTRON, R., BAILEY, D. H, 1992, On the behawor of cache memo- AND FERRANTE, J. 1988a. An overwew of the ries with stnded data access, Tech. Rep. RNR- PTRAN analysis system for multiprocessing, J. 92-015, NASA Ames Research Center, Moffett Parall Distrib. Comput. 5, 5 (Oct.), 617-640. Field, Calif., (May), ALLEN, F. E,, BURKE, M., CYTRON, R., FERKANTE, J., BAILEY, D. H., BARSZCZ, E., BARTON, J. T., HSIEH, W., AND SARKAR, V. 1988b. A framework BROWNING, D. S., CARTER, R. L., DAGUM, L,, for determining useful parallelism. In Proceed- FATOOHI, R. A., FR~DERICKSON, P. O., LASINSKZ, ings of the ACM International Conference on T. A., SCHREIBER, R. S., SIMON, H. D., Supercomputu-zg (St. Malo, France. July). ACM VENKATAKRISHNAN, V., AND WEERATUNGA, S. K. Press, New York, 207–215. 1991. The NAS parallel benchmarks. Int. J. ALLEN, F. E , COCKE, J., AND KENNEDY, K. 1981, Supercomput. Appl, 5, 3 (Fall), 63-73,

ACM Computmg Surveys, Vol. 26. No 4, December 1994 Compiler Transformations * 413

BALASUNDARAM, V. 1990. A mechanism for keeping BLELLOCH, G, 1989, Scans as primitive parallel op- useful internal information in parallel pro- erations. IEEE Trans. Comput. C-38, 11 (Nov.), gramming tools: The Data Access Descriptor, 1526-1538. J. Parall. DtstrLb. Comput. 9, 2 (June), BROMLEY, M., HELLER, S., MCNERNEY, T., AND 154-170. STEELE, G. L., JR. 1991. Fortran at ten gi BALASUNDARAM, V., Fox, G., KENNEDY, K., AND gaflops: The Connection Machine convolution KREMER, U. 1990. An interactive environment compiler. In Proceedings of the SIGPLAN Con- for data partitioning and distribution. In Pro- ference on Programmmg Language Destgn and ceedings of the 5th Dlstnbuted Memory Corn- Implementation (Toronto, Ontano, June). SIG- puter Conference (Charleston, South Carolina, PLAN Not. 26, 6, 145-156. Apr.). IEEE Computer Society Press, Los BURKE, M. AND CYTRON, R. 1986. Interprocedural Alamitos, Calif. dependence analysis and parallelization. In BALASUNDARAM, V., KENNEDY, K., KREMER, U., Proceedings of the SIGPLAN Symposium on MCKINLEY, K., AND SUBHLOK, J. 1989. The Compiler Construction (Palo Alto, Calif., June). ParaScope editor: An interactive parallel pro- SZGPLANNot. 21, 7 (July), 162-175. Extended gramming tool. In Proceedings of Superconz- version available as IBM Thomas J. Watson putmg ’89 (Reno, Nev., Nov.). ACM Press, New Research Center Tech. Rep. RC 11794. York, 540-550. BURNETT, G. J. AND COFFMAN, E. G., JR. 1970. A BALL, J. E. 1979. Predicting the effects of optimiza- study of interleaved memory systems. In Pro- tion on a procedure body. In Proceedings of the ceedings of the Spring Joint AFIPS Computer SIGPLAN Symposium on Compiler Construc- Conference. vol. 36. AFIPS, Montvale, N.J., tion (Denver, Color., Aug.). SIGPLAN Not. 14, 467-474. 8, 214-220. BURSTALL, R. M. AND DARLINGTON, J. 1977. A trans- formation system for developing recursive pro- BANERJEE, U. 1991. Unimodular transformations of grams. J. ACM 24, 1 (Jan.), 44-67. double loops. In Aduances in Languages and Compilers for Parallel Processing, A. Nicolau, BUTLER) M., YEH, T., PATT, Y., ALsuP, M., SCALES, Ed. Research Monographs in Parallel and Dis- H., AND SHEBANOW, M. 1991, Single instruction tributed Computing, MIT Press, Cambridge, stream parallelism is greater than two. In Pro- Mass., Chap. 10. ceedings of the 18th Annual International Sym- posium on Computer Architecture (Toronto, BANERJEE, U. 1988a. Dependence Analysis for Su- Ontario, May). SIGARCH Comput. Arch. Neuw percomputing. Kluwer Academic Publishers, 19, 3, 276-286. Boston, Mass. CALLAHAN, D. AND KENNEDY, K. 1988a. Analysis of BANEKJEE, U. 1988b. An introduction to a formal interprocedural side-effects in a parallel pro- theory of dependence analysis. J. Supercom- gramming environment J. Parall. Distrib put. 2, 2 (Oct.), 133-149. Cornput. 5, 5 (Oct.), 517-550. BANERJEE, U. 1979. Speedup of ordinary programs, CALLAHAN, D. AND KENNEDY, K. 1988b. Compihng Ph.D. thesis, Tech. Rep. 79-989, Computer Sci- programs for distributed-memory multiproces- ence Dept., Univ. of Illinois at Urbana- sors. J. Supercompzd. 2, 2 (Oct.), 151–169. Champaign. CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990. BANERJEE, U., CHEN, S. C., KUCK, D. J., AND TOWLE, Improving register allocation for subscripted R. A. 1979. Time and parallel processor bounds variables. In Proceedings of the SIGPLAN Con- for Fortran-like loops. IEEE Trans. Comput. ference on Programming Language Design and C-28, 9 (Sept.), 660-670. Implementation (White Plains, N.Y., June). BANNING, J. P. 1979. An efficient way to find the SIGPLAN Not. 25, 6, 53-65. side-effects of procedure calls and the aliases of CALLAHAN,D , COCKE, J., AND KENNEDY, K. 1988. variables. In Conference Record of the 6th ACM Estimating interlock and improving balance for Symposium on Principles of Programming Lan - pipelined architectures J. Parall. Distrib. guages (San Antonio, Tex., Jan.). ACM Press, Comput. 5, 4 (Aug.), 334-358. 29-41. CALLAHAN, D., COOPER, K., KENNEDY, K., AND BERNSTEIN, R. 1986. Multiplication by integer con- TORCZON, L. 1986. Interprocedural constant stants. Softw. Pratt. Exper. 16, 7 (July), propagation. In Proceedings of the SIGPLAN 641-652. Sympos~um on CompJer Construction (Palo Alto, Cahf., June). SIGPLAN Not. 21, 7 (July), BERRY, M., CHEN, D., Koss, P., KUCK, D., Lo, S., 152-161. PANG, Y., POINTER, L., ROLOFF, R., SAMEH, A., CLEMENTI, E , CHIN. S , SCHNEI~ER, D , Fox, G , CARR, S. 1993. Memory-hierarchy management. ME SSINA, P., WALKER, D., HSIUNG, C., Ph.D. thesis, R]ce Umversity, Houston, Tex. SCHWARZMEIER, J., LUE, K., ORSZAG, S., S~IDL, CHAITIN, G. J. 1982. Register allocation and spilling F., JOHNSON, O., GOODRUM, R., AND MARTIN, J. via graph coloring. In Proceedings of the SIG- 1989. The Perfect Club benchmarks: Effective PLAN Symposium on Compiler Construction performance evaluation of supercomputers. Znt. (Boston, Mass., June). SIGPLAN Not 17, 6, J. Supercomput. Appl. 3, 3 (Fall), 5-40. 98-105.

ACM Computing Surveys, Vol 26, No 4, December 1994 414 “ David F. Bacon et al.

CFIMTIN, G. J,, AUSLANDER, M. A., CHANDRA, A. K., 221-228. Also available as IBM Tech Rep. RC COCKE, J., HOPKINS, M. E., AND MARKSTEIN, 8111, Feb. 1980. P. W. 1981. Register allocation via coloring. COCKE, J. AND SCHWARTZ, J. T. 1970. Programming Comput. Lang. 6, 1 (Jan.), 47-57. languages and their compders (preliminary CHAMBERS, C. .MXD UNGAR, D. 1989. Customization: notes). 2nd ed. Courant Inst. of Mathemalzcal Optimizing compiler technology for SELF, a Sciences, New York Umverslty, New York. dynamically-typed object-oriented program- COLWELL, R. P,, NIX, R, P,, ODONNEL, J. J., ming language. In proceedings of the SIG- PAPWORTH, D. B., AND RODMAN, P, K. 1988. A PLAN Conference on Programming Language VLIW architecture for a trace scheduling com- Design and Implementation (Portland, Ore., piler. IEEE Trans. Comput. C-37, 8 (Aug.), June). SIGPLAN Not. 24, 7 (July), 146-160. 967-979. CEATTERJEE, S., GILBERT, J. R., AND SCHREIBER, R. COOPER, K, D, AND KENNEDY, K, 1989, Fast inter- 1993a. The alignment-distribution graph. In procedural ahas analysie, In Conference Record Proceedings of the 6th International Workshop of the 16th ACM Symposmm on Prmc~ples of on Languages and Compilers for Parallel Com- Programmmg Languages (Austin, Tex., Jan.). puting. Lecture Notes in Computer Science, ACM Press, New York, 49-59. vol. 768. 234–252. COOPER, K, D., HALL, M. W,, AND KENNEDY, K. 1993. CHATTERJEE, S., GILBERT, J. R., SCHREIBER, R., AND A methodology for procedure cloning. Comput. TENG, S.-H. 1993b. Automatic array alignment Lang. 19, 2 (Apr.), 105-117. in data-parallel programs. In Conference Record COOPER, K. D., HALL, M. W., AND TORCZON, L. 1992. of the 20th ACM Sympo.mum on Prmczples of Unexpected side effects of inline substitution Programmmg Languages (Charleston, S. Car- A case study. ACM Lett. Program. Lang. Syst olina, Jan.). ACM Press, New York, 16–28. 1, 1 (Mar.), 22-32. CHEN, S. C. AND KUCK, D. J. 1975. Time and paral- COOPER, K D., HALL, M. W., AND TORCZON, L 1991. lel processor bounds for linear recurrence sys- An experiment with inline substitution. Softw tems. IEEE Trans. Comput. C-24 7 (July), Pratt. Exper. 21, 6 (June), 581-601 701-717. CRAY RESEARCH 1988 CFT77 Reference Manual. CHOI, J.-D., BURKE, M., AND CARINI, P. 1993. Effi- Publication SR-0018-C. Cray Research, Inc.. cient flow-sensitive interprocedural computa- Eagan, Minn. tion of pointer-induced aliases and side effects. CYTRON, R. 1986. Doacross: Beyond vectorizatlon In Conference Record of the 20th ACM Sympo- for multiprocessors. In Proceedings of the Inter- sium on Principles of Programming Languages national Conference on Parallel Processing (St. (Charleston, S. Carolina, Jan.). ACM Press, Charles, Ill., Aug.). IEEE Computer Society, New York, 232-245. Washington, D. C., 836-844. CHOW, F C 1988. Minimizing register usage CYTRON, R. AND FERRANTE, J. 1987. What’s in a penalty at procedure calls. In Proceedings of name? -or- the value of renaming for paral- the SIGPLAN Conference on Programming lehsm detection and storage allocation. Pro- Language Design and Implementation (Atlanta, ceedings of the Internat~onal Conference on Ga., June). SIGPLAN Not. 23, 7 (July), 85-94. Parallel Procewng (University Park, Penn., CHOW, F. C. AND HENNESSEY, J. L. 1990. The prior- Aug.), Pennsylvania State University Press, ity-based coloring approach to register alloca- University Park. Pa., 19-27. tion. ACM Trans. Program. Lang. Syst. 12, 4 DANTZIG, G. B. AND EAVES, B. C. 1974. Fourier- (Oct.), 501-536. Motzkin ehmination and sits dual with applica- CLARKR, C. D. AND PEYTON-JONES, S. L. 1985. Strict- tion to integer programming. In Combinatorial ness analysis—a practical approach. In Func- Programming: Methods and Applications tional Programming Languages and Computer (Versailles, France, Sept.). B. D. Roy, Ed. D. Architecture. Lecture Notes in Computer Sci- Reidel, Boston, Mass., 93-102.

ence, vol. 201. Springer-Verlag, Berlin, 35–49. DENNIS, J B 1980 Data flow supercomputers CLINGER, W. D. 1990 How to read floating point Computer 13, 11 (Nov.), 48-56. numbers accurately. In proceedings of the SIG- DIXIT, K. M. 1992. New CPU benchmarks from PLAN Conference on Programmmg Language SPEC. In Digest of Papers, Spring COMPCON Deslogn and Implementation (White Plains, 1992, 37th IEEE Computer Society Interna- NY., June). SIGPLAN Not. 25, 6, 92-101. tional Conference (San Francisco, Calif., Feb.). COCKE, J. 1970. Global common subexpression elim- IEEE Computer Society Press, Los Alamitos, ination. In proceedings of the ACM Symposmm Calif., 305-310 on Compder OptLmizatzon (July). SIGPLAN DONGARRA, J. AND HIND, A. R. 1979. Unrolling loops Not. 5, 7, 20–24. m Fortran. Softzo, Pratt. Exper. 9, 3 (Mar.), 219-226. COCKR, J. AND MARKSTEIN, P. 1980. Measurement of program improvement algorithms. In Proceed- EGGERS, S. J. AND KATZ, R. H. 1989 Evaluating the ings of the IFIP Congress (Tokyo, Japan, Oct.). performance of four snooping cache coherency North-Holland, Amsterdam, Netherlands, protocols. In proceedings of the 16th Annual

ACM Computmg Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations ● 415

International Symposmm on Computer Archi- Nov.). IEEE Computer Society, Washington, tecture (Jerusalem, Israel, May). SZGARCH D. C,, 164-173. Comput. Arch. News 17, 3 (June), 2-15. GOFF, G,, KENNEDY, K,, AND TSENG, C.-W. 1991. EIGENMANN, R,, HOEFLINGER, J., LI, Z,, AND PADUA, Practical dependence testing. In Proceedings of D, A. 1991, Experience in the automatic paral- the SIGPLAN Conference on Programming lelization of four Perfect-benchmark programs. Language Design and Implementat~on (Toronto, In Proceedings of the 4th International Work- Ontario, June). SIGPLAN Not, 26, 6, 15-29, shop on Languages and Compders for Parallel GRAHAM, S. L., LUCCO, S., AND SHARP, O. J. 1993. Computing. Lecture Notes in Computer Sci- Orchestrating interactions among parallel com- ence, vol. 589. Springer-Verlag, Berlin, 65–83. putations, In Proceedings of the SIGPLAN Also available as Tech. Rep. 1193, Center for Conference on Programming Language Design Supercomputing Research and Development, and Implementation (Albuquerque, New Mex- ELLIS, J. R. 1986. Bulldog: A Compiler for VLIW ico, June), SZGPLAN Not. 28, 6, 100-111. Architectures. ACM Doctoral Dissertation GRANLUND, T. AND KENNER, R. 1992. Eliminating Award. MIT Press, Cambridge, Mass. branches using a superoptimizer and the GNU ERSHOV, A. P. 1966. ALPHA-an automatic pro- compiler. In Proceedings of the SIGPLAN Con- gramming system of high efficiency. J. ACM ference on Programming Language Design and 13, 1 (Jan.), 17-24. Implementation (San Francisco, Calif., June). FEAUTRIER, P. 1991. Dataflow analysis of array and SZGPLAN Not. 27, 7 (July), 341-352. scalar references. Int. J. Parall. Program. 20, GUPTA, M. 1992. Automatic data partitioning on 1 (Feb.), 23-52. distributed memory multicomputers. Ph.D. the- FEAUTRIER, P. 1988. Array expansion. In Proceed- sis, Tech. Rep. UILU-ENG-92-2237, Univ. of ings of the ACM International Conference on Illinois at Urbana-Champaign. Supercomputing (St. Malo, France, July), ACM GUPTA, M., MIDKIE-F, S., SCHONBERG, E., SWEENEY, Press, New York, 429-441. P., WANG, K. Y., AND BURRE, M. 1993. PTRAN FERRARI, D, 1976, The improvement of program II: A compiler for High Performance Fortran. behavior. Computer 9, 11 (Nov.), 39-47. In Proceedings of the 4th International Work- FISHER, J.A. 1981. Trace scheduling A technique shop on Compilers for Parallel Computers for global microcode compaction. IEEE Trans. (Delft, Netherlands, Dec.). 479-493. Comput. C-30, 7 (July), 478-490. HALL, M. W., KENNEDY, K., AND MCKINLEY, K. S. FISHER, J. A., ELLIS, J. R., RUTTENBERG, J. C., AND 1991. Interprocedural transformations for par- NICOLAU, A. 1984. Parallel processing A smart allel code generation. In Proceedings of Super- compiler and a dumb machine. In Proceedings computing ’91 (Albuquerque, New Mexico, of the SIGPLAN Symposium on Compiler Con- Nov.). IEEE Computer Society Press, os Alami- struction (Montreal, Quebec, June). SIGPLAN tos, Calif., 424-434. Not. 19, 6, 37-47. HARRIS, K. AND HOBBS, S. 1994. VAX Fortran. In FLOATING POINT SYSTEMS. 1979. FPS AP-120B Pro- Optimization in Compders, F. E. Allen, R. cessor Handbook. Floating Point Systems, Inc., Cytron, B. K. Rosen, and K. Zadeck, Eds. ACM Beaverton, Ore. Press, New York, chap. 16. To be published, FREE SOFTWARE FOUNDATION. 1992. gcc 2,x Refer- HATFIELD, D. J. AND GERALD, J. 1971. Program re- ence Manual. Free Software Foundation, Cam- structuring for virtual memory, IBM Syst. J. bridge, Mass. 10, 3, 168-192, FREUDENEERGER, S. M., SCHWARTZ, J. T., AND SHARIR, HENNESSY, J. L, AND PATTERSON, D. A. 1990. Com- M. 1983. Experience with the SETL optimizer. puter Architecture: A Quantitative Approach. ACM Trans. Program. Lang. Syst. 5, 1 (Jan.), Morgan Kaufmann Publishers, San Mateo, 26-45. Calif, GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. HEWLETT-PACKARD. 1992. PA-RZSC 1.1 Architecture Strategies for cache and local memory manage- and Instruction Manual. 2nd ed. Part Number ment by global program transformation, J. 09740-90039. Hewlett-Packard, Palo Alto, Calif. Parall. Distr~b. Comput. 5, 5 (Oct.), 587-616. HIGH PERFORMANCE FORTRAN FORUM. 1993. High GERNDT, M. 1990. Updating distributed variables in Performance Fortran language specification, local computations. Concurrency Pratt. Ilxper. version 1.0. Tech. Rep. CRPC-TR92225, Rice 2, 3 (Sept.), 171-193. University, Houston, Tex. GIBSON, D. H. AND RAO, G. S. 1992. Design of the HIRANANDAM, S., KENNEDY, K., AND TSENG, C.-W. IBM System/390 computer family for numeri- 1993. Preliminary experiences with the cally intensive applications: An overview for Fortran D compiler. In Proceedmg.s of Super- engineers and scientists. IBM J. Res. Dev. 36, computing ’93 (Portland, Ore., Nov.). IEEE 4 (July), 695-711. Computer Society Press, Los Alamitos, CaIif., GIRKAR, M. AND POLYCHRONOPOULOS, C. D. 1988. 338-350. Compiling issues for supercomputers. In Pro- HIRANANDANL S., KRNNEDY, K., AND TSENG, C.-W. ceedings of Supercomputing ’88 (Orlando, Fla., 1992. Compiling Fortran D for MIMD dis-

ACM Computing Surveys, Vol. 26, No. 4, December 1994 416 * David F. Bacon et al,

tributed-memory machines. Commun. ACM 35, programs. Soflw. Pratt. Exper. 1,2 (Apr.-June), 8 (Aug.), 66-80. 105-133. HUDAK, P. AND GOLDBERG, B. 1985. Distributed exe- KOELBEL, C 1990. Compiling programs for non- cution of functional programs using serial com- shared memory machines Ph D thesis, Purdue binators IEEE Trans. Comput. C-34, 10 (Ott), University, West Lafayette, Ind. 881-890. KOELBEL, C AND MEHROTRA, P. 1991 Compiling HWU, W. W. AND CHANG, P. P. 1989 Achieving high global name-space parallel loops for distributed instruction cache performance with an optimiz- execution. IEEE Trans. Parall Dlstnb. Syst. ing compiler. In Proceedings of the 16th An- 2, 4 (Oct.), 440-451 n ual International Symposium on Computer KRANZ, D., KELSE~, R., REES, J., HUDAK, P., PHILBIN, Architecture (Jerusalem, Israel, May) J , AND ADAMS, N. 1986. ORBIT: An optimizing SIGARCH Comput. Arch. News 17, 3 (June), compiler for Scheme. In Proceedings of the 242-251. SIGPLAN Symposmm on Compder Construc- Hwu, W. W., MAHLKE, S. A., CHEN, W. Y., CHANG, tion (Palo Alto, Calif, June). SIGPLAN Not P. P., WARTER, N. J., BRINGMANN, R. A., 21, 7 (July), 219-233 OUELLETTE, R. G., HANK, R. E., KIYOHARA, T., Kuc~, D. J. 1978. The Structure of Computers and HAAO, G E., HOLM, J, G., AND LAVERY, D M Computations Vol. 1. John Wdey and Sons, 1993. The superblock: An effective technique New York. for VLIW and superscalar compilation. J. Su - KUCK, D. J. 1977. A survey of parallel machine percomput 7, 1/2 (May), 229-248 orgamzation and programming. ACM Comput. IBM 1992 Optimization and Tuning Guide for the Suru. 9, 1 (Mar.), 29-59. XL Fortran and XL C Compilers. 1st ed. Publi- KUCK, D. J, AND STOKES, R, 1982. The Burroughs cation SC09-1545-00, IBM, Armonk, N.Y. Sclentitic Processor (BSP). IEEE Trans. Com- IBM. 1991 IBM RISC System/6000 NIC Tuning put, C-31, 5 (May), 363-376. Guide for Fortran and C Publication GG24- KUCK, D. J., KUHN, R. H,, PADUA, D,, LEASURE, B,, 3611-01, IBM, Armonk, NY. AND WOLFE, M. 1981, Dependence graphs and IRIGOIN, F. AND TRIOLET, R. 1988. Supernode parti- compiler optimizatlons, In Conference Record of tioning In Conference Record of the 15th ACM the 8th ACM Svmposlum on Prmctples of Pro- Symposium on Principles of Programming Lan - grammmg Languages (Williamsburg, Vs., Jan,). guages (San Diego, Calif., Jan) ACM Press, ACM Press, New York, 207-218. New York, 319-329 LAM, M S 1988. Software pipelining: An effective IVERSON, K. E. 1962. A F’rogrammzng Language scheduling technique for VLIW machines In John Wiley and Sons, New York. Proceedings of the SIGPLAN Conference on KENNEDY, K. AND MCKINLEY, K. S. 1990 Loop dis- Programming Language Design and Implemen- tation (Atlanta, Ga., June). SIGPLAN Not. 23, tribution with arbitrary control flow In Pro- 7 (July), 318-328 ceedings of Supercomputing ’90 (New York, N.Y, Nov. ) IEEE Computer Society Press, Los LAM, M. S AND WILSON, R P 1992. Limits of con- Alamitos, Calif., 407-416 trol flow on parallelism In Proceedings of the 19th Annual International Symposium on KENNEDY, K., MCKTNLEY, K. S., AND TSENG, C.-W. Computer Architecture (Gold Coast, Australia, 1993. Analysis and transformation in an inter- May). SIGARCH Comput Arch News 20, 2, active parallel programming tool Concurrency 46-57 Pratt. Exper 5, 7 (Oct.), 575-602. LAM, M, S., ROTHBERG, E. E., AND WOLF, M. E. 1991. KILDALL, G. 1973. A unified approach to global pro- The cache performance and optlmlzatlon of gram optimization In Conference Record of the blocked algorithms. In Proceedings of the 4th ACM Sympostum on Prmclples of Program- International Conference on Architectural mmg Languages (Boston, Mass , Oct.). ACM Support for Programmmg Languages and Op- Press, New York, 194–206. erating Systems (Santa Clara, Cahf,, Apr.). KNOBE, K. AND NATARAJAN, V. 1993. Automatic data SIGPLAN Not. 26, 4, 63–74. allocation to minimize communication on SIMD LAMPORT, L. 1974 The parallel execution of DO machmes. J. Supercomput. 7, 4 (Dec.), 387–415. loops. Commun. ACM 17, 2 (Feb.), 83-93. KNOBE, K., LUKAS, J, D., AND STEELE, G. L., JR. LANDI, W., RYDER, B. G., AND ZHANG, S. 1993. Inter- 1990, Data optimlzatlon: Allocation of arrays to procedural modification side effect analysis reduce communication on SIMD machmes. J. with pointer abasing. In Proceedings of the Parall Distrib, Comput. 8, 2 (Feb.), 102-118. SIGPLAN Conference on Programmmg Lan- KNOBE, K., LUKAS, J D., AND STEELE, G. L., JR guage Design and Implementation (Al- 1988. Massively parallel data optimization, In buquerque, New Mexico, June), SZGPLAN Not. The 2nd. Symposium on the Frontiers of Mas- 28, 6, 56-67, siuely Parallel Computation (Fairfax, Vs., Oct.). LARUS, J. R. 1993 Loop-level parallelism in nu- IEEE, Piscataway, N.J., 551-558. meric and symbolic programs. IEEE Trans. KNUTH, D. E. 1971. An empirical study of Fortran Parall. Dwtrtb. Syst 4, 7 (July), 812-826.

ACM Computmg Surveys, Vol 26, No 4, December 1994 Compiler Transformations ● 417

LEE, G., KRUSKAL, C. P., AND KUCK, D. J. 1985. An MAYDAN, D. E., HENNESSEY, J. L., AND LAM, M. S. empirical study of automatic restructuring of 1991. Efficient and exact data dependence nonnumerical programs for parallel processors. analysis. In Proceedings of the SIGPLAN Con- IEEE Trans. Comput. C-34, 10 (Oct.), 927-933. ference on Programming Language Design and LI, Z. 1992. Array privatization for parallel execu- Implementation (Toronto, Ontario, June). SIG- tion of loops. In Proceedings of the ACM Inter- PLAN Not. 26, 6, 1-14. national Conference on Supercomputmg MCFARLING, S. 1991. Procedure merging with in- (Washington, D. C., July). ACM Press, New struction caches. In Proceedings of the SIG- York, 313-322. PLAN Conference on Programming Language LI, J. AND CHEN, M. 1991. Compiling communica- Design and Implementation (Toronto, Ontario, tion-eftlcient programs for massively parallel June). SIGPLAN Not. 26, 6, 71-79. machines. IEEE Trans. Parall. Distrib. Syst. MCGRAW, J. R. 1985. SISAL: Streams and iteration 2, 3 (July), 361-376. in a single assignment language. Tech. Rep. LI, J. AND CHEN, M. 1990. Index domain alignment: M-146, Lawrence Livermore National Labora- Minimizing cost of cross-referencing between tory, Livermore, Calif. distributed arrays. In The 3rd Symposium on MCMAHON, 1?. M. 1986. The Livermore Fortran ker- the Frontiers of Massively Parallel Computa- nels: A computer test of numerical performance tion, J. Jaja, Ed. IEEE Computer Society Press, range. Tech. Rep. UCRL-55745, Lawrence Liv- Los Alamitos, Calif., 424-433. ermore National Laboratory, Livermore, Calif. LI, Z. AND YEW, P. 1988. Interprocedural analysis MICHIE, D. 1968. “Memo” functions and machine for parallel computing. In Proceedings of the learning. Nature 218, 19-22. International Conference on Parallel Process- ing. F. A. Briggs, Ed. Vol. 2. Pennsylvania State MILLSTEIN, R. E. AND MUNTZ, C. A. 1975. The IL- University Press, University Park, Pa., LIAC-IV Fortran compiler. In Programming 221-228. Languages and Compilers for Parallel and Vec- tor Machines (New York, N.Y., Mar). SIG- LI, Z., YEW, P., AND ZHU, C. 1990. Data dependence PLAN Not. 10, 3, 1-8. analysis on multi-dimensional array refer- ences. IEEE Trans. Parall. D~str~b. Syst. 1, 1 MIRCHANDANEY, R., SALTZ, J. H., SMITH, R. M., NICOL, (Jan.), 26-34. D. M., AND CROWLEY, K. 1988. Principles of LOVRMAN, D. B. 1977. Program improvement by runtime support for parallel processors. In Pro- source-to-source transformation. J. ACM 1, 24 ceedings of the ACM International Conference (Jan.), 121-145. on Supercomputing (St. Male, France, July). ACM Press, New York, 140-152. Lucco, S. 1992. A dynamic scheduling method for irregular parallel programs. In Proceedings of MOREL, E. AND RENVOME, C. 1979. Global optimiza- the SIGPLAN Conference on Programming tion by suppression of partial redundancies. Language Design and Implementation (San Commun. ACM 22, 2 (Feb.), 96-103. Francisco, Calif., June). SIGPLAN Not. 27, 7 MUCHNICK, S. S. AND JONES, N., (EDs.) 1981. Pro- (July), 200-211. gram Flow Analysis. Prentice-Hall, Englewood MACE, M, E, 1987, Memory Storage Patterns m Cliffs, N.J. Parallel Processing. Kluwer Academic Publish- MURAOKA, Y. 1971. Parallelism exposure and ex- ers, Norwell, Mass. ploitation in programs. Ph.D. thesis, Tech. Rep. MACE, M. E. AND WAGNER, R. A. 1985. Globally 71-424, Univ. of Illinois at Urbana-Champaign. optimum selection of memory storage patterns. MYERS, E. W. 1981. A precise interprocedural data In Proceedings of the International Conference flow algorithm. In Conference Record of the 8th on Parallel Processing, D. DeGrott, Ed. IEEE ACM Symposium on Principles of Program- Computer Society, Washington, D. C., 264-271. ming Languages (Williamsburg, Vs., Jan.). MARKSTEIN, P., MARKSTEIN, V., AND ZADECK, K. 1994. ACM Press, New York, 219-230. Strength reduction. In Optimization in t30mpd- NICOLAU, A. 1988. Loop quantization: A generalized ers. ACM Press, New York, Chap. 9 To be loop unwinding technique. J. Parall Distrib. published. Comput. 5, 5 (Oct.), 568-586. MASSALIN, H. 1987. Superoptimizer: A look at the NICOLAU, A. AND FISHER, J. A. 1984. Measuring the smallest program. In Proceedings of the 2nd parallelism available for very long instruction International Conference on Architectural Sup- word architectures. IEEE Trans. Comput. C-33, port for Programmmg Languages and Operat- 11 (Nov.), 968-976. ing Systems (Palo Alto, Calif., Oct.). SIGPLAN Not. 22, 10, 122-126. NIKHIL, R. S. 1988. ID reference manual, version MAYDAN, D. E., AMAMSINGHE, S. P., AND LAM, M. S. 88.0. Tech. Rep. 284, Laboratory for Computer 1993. Array data flow analysis and its use in Science, Massachusetts Instztute of Technol- array privatization. In Conference Record of the ogy, Cambridge, Mass. 20th ACM Sympostum on Principles of Pro- NOBAYASHL H. AND EOYANG, C. 1989. A comparison gramming Languages (Charleston, S. Carolina, study of automatically vectorizing Fortran Jan.). ACM Press, New York, 1-15. compilers. In Proceedings of Supercomputing

ACM Computing Surveys, Vol. 26, No 4, December 1994 418 * David F. Bacon et al.

’89 (Reno, Nev., Nov.), ACM Press, New York, Proceedings of the International Conference on 820-825 Parallel Processing. Volume 2, Pennsylvania OBRIEN, K., HAY, B., MINISH, J., SCHAFFER, H , State Umverslty Press, Umversity Park, Pa., SCHLOSS, B., SHEPHERD, A., AND ZALESKI, M. 39-48. 1990 Advanced compiler technology for the PRESBERG, D. L. AND JOHNSON, N. W. 1975 The RISC System/6000 architecture. In IBM RISC Paralyzer: IVTRAN’S parallehsrn analyzer and System/6000 Technology. Publication SA23- synthesizer, In Programmmg Languages and 2619. IBM Corporation, Mechanicsburg, Penn Compders for Parallel and Vector Machines OED, W. 1992. Cray Y-MP C90: System features (New York, NY., Mar.). SIGPLAN Not. 10, 3, and early benchmark results. Parall. Comput. 9-16. 18, 8 (Aug.), 947-954. PUGH, W. 1992. A practical algorithm for exact OEHLER, R. R. AND BLASGEN, M, W. 1991. IBM RISC array dependence analysls Commun. ACM 35, System/6000: Architecture and performance. 8 (Aug.), 102-115, IEEE Micro 11, 3 (June), 14-24, PUGH, W. 1991, Umform techniques for loop opti- PADUA, D. A. AND PETERSEN, P. M. 1992 Evaluation mization, In proceedings of the ACM Interna- of parallelizing compilers. Parallel Computing tional Conference on Supercomputmg (Cologne, and Transputer Appllcatlons, (Barcelona, Germany, June). ACM Press, New York. Spain, Sept ) CIMNE, 1505-1514. Also avail- RAU, B. AND FISHER, J. A. 1993. InstructIon-level able as Center for Supercomputing Research parallel processing: History, overview, and per- and Development Tech Rep. 1173. spectwe, J. Supercomput. 7, 1/2 (May), 9–50 PADUA, D. A, AND WOLFE, M. J. 1986. Advanced RAU, B., YEN, D. W. L., YEN, W., AND TOWI,E, R. A. compiler optlmlzations for supercomputers. 1989. The Cydra 5 departmental supercom- Commun. ACM 29, 12 (Dec.), 1184-1201. puter: Design phdosophies, decisions, and PADUA, D. A., KUCK, D. J. AND LAWRIE, D. 1980. trade-offs, Computer 22, 1 (Jan,), 12-34. High-speed multiprocessors and compilation techniques IEEE Trans. Comput. C-29, 9 REES, J., CLINGER, W, ET AL. 1986 Revised3 report (Sept.), 763-776. on the algorithmic language Scheme. SIG- PLAN Not 21, 12 (Dee), 37-79. PETERSEN, P. M. AND PADUA, D A 1993 Static and dynamic evaluation of data dependence analY- RISEMAN, E. M, AND FOSTER, C. C, 1972, The mhlbl- sis. In Proceedings of the ACM International tion of potential parallehsrn by conditional Conference on Supercomputmg (Tokyo, Japan, jumps. IEEE Trans, Comput, C-21, 12 (Dec.), July). ACM Press, New York, 107-116. 1405-1411.

PETTIS, K, AND HANSEN, R. C, 1990. Profile guided ROGERS, A. M, 1991. Compiling for locality of refer- code posltionmg. In Proceedings of the SIG- ence. Ph.D. thesis, Tech Rep. TR 91-1195, Dept. PLAN Conference on Programmmg Language of Computer Science, Cornell University Des~gn and Implementation (Whte Plains, NY., June). SIGPLAN Not. 25, 6, 16-27. RUSSELL, R. M. 1978. The CRAY-1 computer sys- tem. Commun. ACM 21, 1 (Jan.), 63–72, POLYCHRONOPOULOS, C. D. 1988. Parallel Program- ming and Compilers. Kluwer Academic Pub- SABOT, G. AND WHOLEY, S. 1993 CMAX: A Fortran lishers, Boston, Mass. translator for the ConnectIon Machine system. POLYCHRONOPOULOS, C D 1987a Advanced loop In proceedings of the ACM International Con- optlmlzations for parallel computers In Pro- ference on Supercomputmg (Tokyo, Japan, ceedings of the Ist International Conference on July). ACM Press, New York. Supercomputmg. Lecture Notes in Computer SARKAR, V. 1989. Partitioning and Scheduling Par- Science, vol 297 Springer-Verlag, Berlin, allel Programs for Multiprocessors. Research 255-277. Monographs in Parallel and Distributed Com- POLYCHRONOPOULOS, C. D. 1987b, LOOP coalescing: puting. MIT Press, Cambridge, Mass A compiler transformation for parallel ma- SARKAR, V. AND THEKKATH, R. 1992. A general chines, In Proceedings of the International framework for Iteration-reordering transforma- Conference on Parallel Processing (Umverslty tions. In Proceedings of the SIGPLAN Confer- Park, Penn., Aug.). Pennsylvania State Univer- ence on Programming Language Design and sity Press, University Park, Pa., 235–242. Implementation (San Francisco, Calif., June). POLYCHRONOPOULOS, C. D. AND KUCK, D. J. 1987. SIGPLAN Not. 27, 7 (July), 175-187, Guided self-scheduling: A practical scheduhng scheme for parallel supercomputers. IEEE SCHEIFLER, R. W. 1977. An analysis of inline substi- Trans. Comput. C-36, 12 (Dec.), 1425-1439, tution for a structured programming language. Commun. ACM 20, 9 (Sept.), 647-654. POLYCHRONOPOULOS, C. D., GIRKAR, M., HAGHIGHAT, M. R., LEE, C. L., LEUNG, B., AND SCHOUTEN, D. SHEN, Z , LI, Z., AND YEW, P -C. 1990. An empirical 1989. Parafrase-2: An environment for paral- study of Fortran programs for parallelizing Ielizing, partitioning, synchronizing, and compilers. IEEE Trans. Parall. Distrib. Syst, scheduling programs on multiprocessors. In 1, 3 (July), 356-364.

ACM Computmg Surveys, Vol. 26, No 4, December 1994 Compiler Transformations ● 419

SINGH, J. P. AND HENNESSY, J. L. 1991. An empirical Computer Science Dept., Rice University, investigation of the effectiveness and limita- Houston, Tex. tions of automatic parallelization. In Proceed- Tu, P, AND ~ADUA, D. A, 1993, Automatic array ings of the International Symposium on Shared privatization. In Proceedings of the 6th Inter- Memory Multiprocessing, (Apr.), 213-240. national Workshop on Languages and Compd- SINGH, J. P., WEBER, W.-D., AND GUPTA, A. 1992. ers for Parallel Computing. Lecture Notes in SPLASH: Stanford parallel applications for Computer Science, vol. 768. Springer-Verlag, shared memory. SZGARCH C’omput. Arch, Berlin, 500-521. News 20, 1 (Mar.), 5-44. Also available as VON HANXLEDEN, R. AND KENNEDY, K. 1992. Relax- Stanford Univ. Tech. Rep. CSL-TR-92-526. ing SIMD control flow constraints using loop SITES, R. L., (ED.) 1992. Alpha Architecture Refer- transformations. In Proceedings of the SIG- ence Manual. Digital Press, Bedford, Mass. PLAN Conference on Programming Language SOHI, G. S. AND VAJAPAYEM, S. 1990. Instruction Design and Implementation (San Francisco, issue logic for high-performance, interruptable, Calif., June). SIGPLAN Not. 27, 7 (July), multiple functional unit, pipelined computers. 188-199. IEEE Trans. Comput. C-39, 3 (Mar.), 349-359. WALL, D. W. 1991. Limits of instruction-level paral- STEELE, G. L., JR. 1977. Arithmetic shifting consid- lelism. In Proceedings of the 4th Interncutonal ered harmful, SIGPLAN Not. 12, 11 (Nov.), Conference on Arch~tectural Support for Pro- 61-69. gramming Languages and Operating Systems STEELE, G, L., JR, AND WHITE, J, L. 1990. How to (Santa Clara, Calif., Apr.). SIGPLAN Not. 26, print floating-point numbers accurately, In 4, 176-188. Proceedings of the SIGPLAN Conference on WANG, K. 1991. Intelligent program optimization Programming Language Design and Implemen- and parallelization for parallel computers. tation (White Plains, N.Y., June). SIGPLAN Ph.D. thesis, Tech. Rep. CSD-TR-91-030, Pur- Not, 25, 6, 112-126. due University, West Lafayette, Ind. STENSTROM, P. 1990. A survey of cache coherence WANG, K. AND GANNON, D. 1989. Applying Al tech- schemes for multiprocessors. Computer 23, 6 niques to program optimization for parallel (June), 12-24. computers. In Parallel Processing for Super- SUN MICROSYSTEMS. 1991. SPARC Architecture computers and Art~fictal Intelligence, K. Hwang Manual, Version 8. Part No. 800-1399-08. Sun and D. Llegroot, Eds. McGraw Hill, New York, Microsystems, Mountain View, Calif. Chap. 12, f?IZYMANSHI, T. G. 1978. Assembling code for ma- WEDEL, D. 1975. Fortran for the Texas Instruments chines with span-dependent instructions. Corn- ASC system. In Programmmg Languages and mum. ACM 21, 4 (Apr.), 300–308. Compilers for Parallel and Vector Machines TANG, P, AND YEW, P, 1990. Dynamic proceesor (New York, N.Y., Mar.). SZGPLAN Not. 10, 3, self-scheduling for general parallel neeted 119-132. loops, IEEE Trans. Comput. C-39, 7 (July), WEGMAN, M. N. AND ZADECK, F. K. 1991. Constant 919-929. propagation with conditional branches, ACM THINKING MACHINES. 1989. Connection Machine, Trans. Program. Lang. Syst. 13, 2 (Apr.), Model CM-2 Technical Summary. Thinking 181-210, Machines Corp., Cambridge, Mass. WEISS, M. 1991. Strip mining on SIMD architec- TJADEN, G. S. AND FLYNN, M. J. 1970. Detection and tures. In Proceedings of the ACM International parallel execution of parallel instructions. Conference on Supercomputtng (Cologne, Ger- IEEE Trans. Comput. C-19, 10 (Oct.), 889-895. many, June). ACM Press, New York, 234–243. TORRELLAS, J., LAM, H. S., AND HENNESSY, J. L. WHOLEY, S. 1992. Automatic data mapping for dis- 1994. False sharing and spatial locality in mul- tributed-memory parallel computers. In Pro- tiprocessor caches. IEEE Trans. Comput. 43, 6 ceedings of the ACM International Conference (June), 651-663. on Supercomputing (Washington, D. C., July). ACM Pressj New York, 25–34. TOWLE, R. A. 1976, Control and data dependence for program transformations. Ph.D. thesis, WOLF, M. E. 1992. Improving locality and paral- Tech. Rep. 76-788, Computer Science Dept., lelism in nested loops. Ph.D. thesis, Computer Univ. of Illinois at Urbana-Champaign. Science Dept., Stanford University, Stanford, TRIOLET, R., IRIGOIN, F., AND FEAUTRIER, P. 1986. Calif. Direct parallelization of call statements. In WOLFE, M. J. 1989a. More iteration space tiling. In Proceedings of the SIGPLAN Symposium on Proceedings of Supercomputing ’89 (Reno, Nev., Compder Construction (Palo Alto, Calif., June). Nov.). ACM Press, New York, 655–664. SIGPLAN Not. 21, 7 (July), 176-185. WOLFE, M. J. 1989b. OptLmLzmg Supercompders for TSENG, C.-W. 1993. An optimizing Fortran D com- Supercomputers. Research Monographs in Par- piler for MIMD distributed-memory machines. allel and Distributed Computing. MIT Press, Ph.D. thesis, Tech. Rep. Rice COMP TR93-199, Cambridge, Mass.

ACM Computing Surveys, Vol 26, No. 4, December 1994 420 e David F. Bacon et al

WOLI, M, E, AND LAM, M, S. 1991, A loop transfor- ZIMA, H P, BAST, H ,J , Ah_D GERNnT, M 1988 mation theory and an algorlthm to maxlmlze SUPERB. A tool for semi-automatic parallelism. IEEE Trans. Parall, Dlstrlb. Syst. SIMD\MIMD parallehzatlon Parall C’ornput 2, 4 (Oct.), 452-471, 6, 1 (Jan,), 1-18 WOLFE, M J ANI) TSEN~, C 1992 The Power Test for data dependence. IEEE Trans. Parall Dzs- trzb S-yst 3, 5 (Sept ), 591–601

Recewed November 1993, final revision accepted October 1994

ACM Comput,ng Surveys, Vol 26, No 4, December 1994