Interprocedural Transformations for Parallel Code Generation

Home , Inline expansion

Interpro cedural Transformations for Parallel Co de Generation

Mary W Hall Ken Kennedy Kathryn S M Kinley

Department of Computer Science Rice University Houston TX

Abstract common form of interpro cedural transformation is pro

cedure inlining Inlining substitutes the b o dy of a

We present a new approach that enables compiler

called pro cedure for the pro cedure call and optimizes

optimization of pro cedure calls and lo op nests con

it as a part of the calling pro cedure

taining pro cedure calls We intro duce two inter

Even though regular section analysis and inlining are

pro cedural transformations that move lo ops across pro

frequently successful each of these metho ds has its lim

cedure b oundaries exp osing them to traditional opti

itations Compilation time and space consider

mizations on lo op nests These transformations are

ations require that regular section analysis summarize

incorp orated into a co de generation algorithm for a

array side eects In general summary analysis for

sharedmemory multipro cessor The co de generator re

lo op parallelization is less precise than the analysis of

lies on a machine mo del to estimate the exp ected b en

inlined co de On the other hand inlining can yield an

ets of lo op parallelization and parallelismenhancing

explosion in co de size which may disastrously increase

transformations Several transformation strategies are

compile time and seriously inhibit separate compila

explored and one that minimizes total execution time is

tion Furthermore inlining may cause a loss of

selected Ecient supp ort of this strategy is provided

precision in dep endence analysis due to the complex

by an existing interpro cedural compilation system We

ity of subscripts that result from array parameter re

demonstrate the p otential of these techniques by ap

shap es For example when the dimension size of a

plying this co de generation strategy to two scientic

formal array parameter is also passed as a parameter

applications programs

translating references of the formal to the actual can

intro duce multiplications of unknown symb olic values

Intro duction

into subscript expressions This situation o ccurs when

Mo dern computer architectures such as pip elined

inlining is used on the spec Benchmark program ma

sup erscalar VLIW and multipro cessor machines de

trix

mand sophisticated compilers to exploit their p erfor

In this pap er a hybrid approach is develop ed that

mance p otentials To exp ose parallelism and compu

overcomes some of these limitations We intro duce a

tation for these architectures the compiler must con

pair of new interpro cedural transformations loop em

sider a statement in light of its surrounding context

bedding which pushes a lo op header into a pro cedure

Lo ops provide a proven source of b oth context and

called within the lo op and loop extraction which ex

parallelism Lo ops with signicant amounts of com

tracts the outermost lo op from a pro cedure b o dy into

putation are prime candidates for compilers seeking

the calling pro cedure These transformations exp ose

to make eective utilization of the available resources

such lo ops to intrapro cedural optimizations In this pa

Given that increased mo dularity is encouraged to man

p er the intrapro cedural optimizations considered are

age program computation and complexity it is natural

lo op fusion lo op interchange and lo op distribution

to exp ect that programs will contain many pro cedure

However many other transformations that require lo op

calls and pro cedure calls in lo ops and the ambitious

nests will also b enet from emb edding and extraction

compiler will want to optimize them

Some examples are lo op skewing and memory hi

Unfortunately most conventional compiling systems

erarchy optimizations such as unroll and jam

abandon parallelizing optimizations on lo ops contain

As a motivating example consider the Fortran co de

ing pro cedure calls Two existing compilation technolo

in Example a The J lo op in subroutine S may safely

gies are used to overcome this problem interpro cedural

b e made parallel but the outer I lo op in subroutine P

analysis and interpro cedural transformation

may not b e However the amount of computation in

Interprocedural analysis applies dataow analysis

the J lo op is small relative to the I lo op and may not

techniques across pro cedure b oundaries to enhance the

b e sucient to make parallelization protable If the I

eectiveness of dep endence testing A sophisticated

lo op is embedded into subroutine S as shown in b the

form of interpro cedural analysis called regular section

inner and outer lo ops may b e interchanged as shown

analysis makes it p ossible to parallelize lo ops with calls

in c The resulting parallel outer J lo op now contains

by determining whether the side eects to arrays as a

plenty of computation As an added b enet pro cedure

result of each call are limited to nonintersecting subar

call overhead has b een reduced

rays on dierent lo op iterations

Lo op emb edding and lo op extraction provide many

Interprocedural transformation is the pro cess of mov

of the optimization opp ortunities of inlining without

ing co de across pro cedure b oundaries either as an op

its signicant costs Co de growth of individual pro

timization or to enable other optimizations The most

Page

SUBROUTINE P SUBROUTINE P SUBROUTINE P

REAL ANN REAL ANN REAL ANN

INTEGER I

DO I

CALL SAI CALL SA CALL SA

ENDDO

SUBROUTINE SFI SUBROUTINE SF SUBROUTINE SF

REAL FNN REAL FNN REAL FNN

INTEGER IJ INTEGER IJ INTEGER IJ

DO J DO I PARDO J

FJI FJI DO J DO I

ENDDO FJI FJI FJI FJI

ENDDO ENDDO

ENDDO ENDPARDO

a b efore transformation b lo op emb edding c lo op interchange

Example

cedures is nominal so compilation time is not seri these is assumed In partic

ously aected Overall program growth is also mo d ular the reader should b e familiar with dep endence

erate b ecause multiple callers may invoke the same op graphs where dep endence edges are characterized with

timized pro cedure b o dy In addition the compilation such information as dep endence typ e and hybrid direc

dep endences among pro cedures are reduced since the tiondistance vectors The dep endence graph sp ec

compiler controls the small amount of co de movement ies a conservative approximation of the partial order

across pro cedures and can easily determine if an editing of memory accesses necessary to preserve the semantics

change of one pro cedure invalidates other pro cedures of a program The safe application of program trans

Our approach to interpro cedural optimization is fun formations is based on preserving this partial order

damentally dierent from previous research in that the

Augmented Call Graph

application of interpro cedural transformations is re

The program representation for interpro cedural trans

stricted to cases where it is determined to b e protable

formations requires an augmented cal l graph to describ e

This strategy called goaldirected interprocedural opti

the calling relationship among pro cedures and sp ecify

mization avoids the costs of interpro cedural optimiza

lo op nests The co de generation algorithm considers

tion when it is not necessary Interpro cedural trans

lo ops containing pro cedure calls and lo ops adjacent to

formations are applied as dictated by a co de genera

pro cedure calls For this purp ose the programs call

tion algorithm that explores p ossible transformations

graph which contains the usual procedure nodes and

selecting a choice that minimizes total execution time

cal l edges is augmented to include sp ecial loop nodes

Estimates of execution time are provided by a machine

and nesting edges If a pro cedure p contains a lo op l

mo del which takes into account the overhead of par

there will b e a nesting edge from the pro cedure no de

allelization The co de generator is part of an inter

representing p to the lo op no de representing l If a

pro cedural compilation system that eciently supp orts

lo op l contains a call to a pro cedure p there will b e a

interpro cedural analysis and optimization by retaining

nesting edge from l to p Any inner lo ops are also rep

separate compilation of pro cedures

resented by lo op no des and are children of their outer

The remainder of this pap er is organized into ve

lo op The outermost lo op of each routine is marked

ma jor sections related work and conclusions Sec

enclosing if all the other statements in the pro cedure

tion provides the technical background for the rest

fall inside the lo op Figure a shows the augmented

of the pap er In Section a compilation system is

call graph for the program from Example

describ ed which is p owerful enough to supp ort inter

pro cedural optimization but also retains the advan Regular Section Analysis

tages of a separate compilation system Section ex

A regular section describ es the side eects to the

plains the interpro cedural and intrapro cedural trans

substructures of an array Sections represent a re

formations in more detail and Section presents a co de

stricted set of the most commonly o ccurring array ac

generation algorithm that uses these to parallelize pro

cess patterns single elements rows columns grids

grams for a sharedmemory multipro cessor Section

and their higher dimensional analogs This restriction

describ es an exp eriment where this approach was ap

on the shap es assists in making the implementation

plied to the Perfect Benchmark programs sp ec and

o cean

Technical Background

Dep endence Analysis

Dep endence analysis and testing have b een widely re

searched and in this pap er a working knowledge of

Page

Supp ort for Interpro cedural

Ref Ref

Optimization

In this section we present the compilation system

of the ParaScop e Programming Environment

This system was designed for the ecient supp ort of

A I AJ I

interpro cedural analysis and optimization The to ols

in ParaScop e co op erate to enable the compilation sys

tem to p erform interpro cedural analysis without direct

S Mo d Mo d

examination of source co de This information is then

used in co de generation to make decisions ab out inter

pro cedural optimizations The co de generator only ex

amines the dep endence graph for the pro cedure cur

rently b eing compiled not the graph for the entire pro

A I AJ I

gram In addition ParaScop e employs recompilation

a Augmented

analysis after program changes to minimize program

b Sections c Slices

reanalysis

Call Graph

The ParaScop e Compilation System

Figure

Interpro cedural analysis in the ParaScop e compilation

system consists of two principal phases The rst takes

ecient The representation of the dimensions of a

place prior to compilation At the end of each editing

particular array variable may take one of three forms

session the immediate interpro cedural eects of a pro

an invo cation invariant expression representing a

cedure are determined and stored For example this

single element a range consisting of a lower b ound

information includes the array sections that are lo cally

an upp er b ound and a step size or the sp ecial el

mo died and referenced in the pro cedure The pro ce

ement signifying that all of this dimension may b e

dures calling interface is also determined in this phase

aected Sections are separated into mo died and ref

It includes descriptions of the calls and lo ops in the

erenced sets The sections for Example are shown in

pro cedure and their relative p ositions In this way the

Figure b

information needed from each mo dule of source co de is

By using sections the problem of lo cating dep en

available at all times and need not b e derived on every

dences on pro cedure calls is simplied to the problem

compilation

of nding dep endences on ordinary statements The

Interpro cedural optimization is orchestrated by the

mo died and referenced subsections for the call app ear

program compiler a to ol that manages and provides in

to the dep endence analyzer like the left and righthand

formation ab out the whole program The pro

sides of an assignment resp ectively For singleelement

gram compiler b egins by building the augmented call

subsections dep endence testing is the same as it would

graph describ ed in Section The program compiler

b e for any other variable access For subsections that

then traverses the augmented call graph p erforming

contain one or more dimensions with ranges the de

interpro cedural analysis and subsequently co de gen

p endence analyzer simulates do lo ops for each of the

eration Conceptually program compilation consists

range dimensions with the lower b ound upp er b ound

of three principal phases interpro cedural analysis

and step size of the lo op corresp onding to those of

dep endence analysis and planning and co de

the range Sections are necessarily an approximation

generation

of actual accesses To assist conservative dep endence

Interpro cedural analysis The program compiler

testing they are marked exact and inexact to indicate

calculates interpro cedural information over the aug

whether they are an approximation

mented call graph First the information collected

Regular sections enable dep endence analysis to de

during editing is recovered from the database and as

termine if lo ops containing calls are parallel Sections

so ciated with the appropriate no des and edges in the

are also currently used to determine the safety of intra

call graph This information is then propagated in a

pro cedural transformations on a lo op nest containing

topdown or b ottomup pass over the no des in the call

calls In this pap er sections are extended to enable

graph dep ending on the interpro cedural problem Sec

the co de generator to determine the safety of inter

tion analysis is p erformed at this time Interpro cedural

pro cedural transformations We intro duce an annota

constant propagation and symb olic analysis are also

tion to a section called a slice Slices resemble data

p erformed as these greatly increase the precision of

access descriptors but they are not as detailed A

subsequent dep endence analysis

slice identies the section of an array accessed and the

Dep endence analysis Interpro cedural informa

order of that access in terms of a particular lo ops in

tion is then made available to dep endence analysis

dex expression Symb olic slices are stored only for the

which is p erformed separately for each pro cedure

outermost lo op of a pro cedure They are also marked

Dep endence analysis results in a dep endence graph

as exact or inexact Figure c illustrates the slice

Edges in the dep endence graph connect statements

annotations for the program in Example

that form the source and sink of a dep endence If the

source or sink of a dep endence is a call site a sec

Page

Augmented

Call Graph

RSD Dep endence Co de

Analysis Analysis Generation

Marked k Lo ops

RSDs

Dep endence Graphs

wRSDs Slices

Figure Flow of information for interpro cedural transformations

tion annotates it The section may more accurately plied to pro cedures where it is no longer valid allowing

describ e the p ortion of the array involved in the dep en separate compilation to b e preserved The recompila

dence Dep endence analysis also distinguishes parallel tion pro cess after interpro cedural transformations have

lo ops in the augmented call graph Dep endence analy b een applied is describ ed in more detail elsewhere

sis is separated from co de generation for an imp ortant

Interpro cedural Transformation

reason it provides the co de generator knowledge ab out

each pro cedure without reexamining their source or de

We intro duce two new interpro cedural transforma

p endence graph

tions lo op extraction and lo op emb edding These ex

Planning and Co de Generation The nal

p ose the lo op structure to optimization without incur

phase of the program compiler determines where inter

ring the costs of inlining The movement of a sin

pro cedural optimization is protable When more than

gle lo op header is detailed b elow Moving additional

one option for interpro cedural transformation exists

statements that precede or are enclosed by a lo op is

it selects the most protable option Planning is im

a straightforward generalization of these two transfor

p ortant to interpro cedural optimization since unnec

mations and for simplicity is not describ ed This sec

essary optimizations may lead to signicant compile

tion also describ es the additional information needed

time costs without any executiontime b enet To de

to p erform the applicability and safety tests for lo op

termine the protability of transformations requires a

fusion and lo op interchange across call b oundaries All

machine mo del To determine the safety of transfor

of these are used in our co de generation algorithm The

mations the dep endence graph and sections are suf

co de generation algorithm also uses lo op distribution

cient Once protable transformations are lo cated

but do es not apply it across call b oundaries Therefore

they are applied and parallelism is intro duced in the

it may b e p erformed with no additional information

transformed program

Lo op distribution is discussed in detail in Section

The relationship among the compilation phases is

Lo op Extraction

depicted in Figure Each step adds annotations to the

Lo op extraction moves an enclosing lo op of a pro cedure

call graph that are used by the next phase Following

p outward into one of its callers This optimization may

program transformation each pro cedure is separately

b e thought of as partial inlining The new version of p

compiled Interpro cedural information for a pro cedure

no longer contains the lo op The caller now contains a

is provided to the compiler to enhance intraprocedural

new lo op header surrounding the call to p The index

optimization

variable of the lo op originally a lo cal in p b ecomes a

Recompilation Analysis

formal parameter and is passed at the call The call

A unique part of the ParaScop e compilation system

ing pro cedure creates a new variable to serve as the

is its recompilation analysis which avoids unnecessary

lo op index avoiding name conicts It is always safe

recompilation after editing changes to the program

to extract an outer enclosing lo op from a pro cedure

Recompilation analysis tests that interpro cedural facts

Example a contains a lo op with two calls to pro ce

used to optimize a pro cedure have not b een invalidated

dure S and b contains the result after lo op extraction

by editing changes To extend recompilation analy

Note that b has an additional variable declaration for

sis for interpro cedural transformations a few additions

the lo op index J in P It is included in the actual pa

are needed When an interpro cedural transformation is

rameter list for S In this example the J lo op may now

p erformed a description of the interpro cedural trans

b e fused and interchanged to improve p erformance

formations annotates the no des and edges in the aug

Lo op Emb edding

mented call graph On subsequent compilations this

Lo op emb edding moves a lo op that contains a pro ce

information indicates to the program compiler that the

dure call into the called pro cedure and is the dual of

same tests used initially to determine the safety of the

lo op extraction The new version of the called pro ce

transformations should b e reapplied

dure requires a new lo cal variable for the lo ops index

To determine if interpro cedural transformations are

variable If a name conict exists a new name for the

still safe the new and old sections are rst compared

lo ops index variable must b e created This transfor

in most cases avoiding examination of the dep endence

mation is illustrated in Example

graph This means that dep endence analysis is only ap

Page

is straightforward

Lo op Fusion

SUBROUTINE PA SUBROUTINE PA

REAL ANN BNN REAL ANN BNN

Lo op fusion places the b o dies of two adjacent lo ops

INTEGER I INTEGER IJ

with the same numb er of iterations into a single

DO I

lo op When several pro cedure calls app ear contigu

DO I DO J

ously or lo ops and calls are adjacent it may b e p ossible

CALL SAI CALL SAIJ

to extract the outer lo op from the called pro cedures

CALL SBI ENDDO

exp osing lo ops for fusion and further optimization In

ENDDO DO J

the algorithm checkFusion we consider fusion for an

CALL SBIJ

ENDDO

ordered set S fs s g where s is either a call

1 p i

ENDDO

or a lo op There cannot b e any intervening statements

SUBROUTINE SFI SUBROUTINE SFIJ

b etween s and s and each call must contain an en

i i+1

REAL FNN REAL FNN

closing lo op which is b eing considered for fusion

INTEGER IJ INTEGER IJ

Fusion is safe for two lo ops l and l if it do es not

1 2

DO J

result in values owing from the statements in l back

FJI FJI FJI FJI

into the statements in l in the resultant lo op and vice

ENDDO

versa The simple test for safety p erforms dep endence

a b efore transformation b lo op extraction

testing on the lo op b o dies as if they were in a single

lo op Each forward dep endence originally b etween l

Example

and l is tested Fusion is unsafe if any dep endences are

reversed b ecoming backward lo opcarried dep endences

If the index variable of the lo op to b e emb edded ap

in the fused lo op

p ears in an actual parameter in the call this parameter

This test requires the insp ection of the dep endence

is no longer correctly dened To remedy this problem

source and sink variable references in l and l If one

1 2

the formals that dep end on it must b e assigned and

or more of the lo ops is inside a call the variable refer

computed in the newly emb edded lo op In the sim

ences are represented instead as the mo died and ref

plest case an index variable i is passed to a formal f

erenced sections for the call The slices that annotate

Here f should b e assigned i on every iteration of the

the sections corresp ond to the lo ops b eing considered

emb edded lo op prior to the rest of the lo op b o dy

for fusion and are tested identically to variable refer

If an actual is an array reference whose subscript ex

ences see Section Unfortunately while variable

pression contains the lo op index variable the actual

references are always exact a section and its slice are

passed at the call b ecomes simply the array name In

not If the slice is not exact fusion is conservatively

the called pro cedure the original subscript expression

assumed to b e unsafe To b e more precise would re

for each dimension of the actual is added to the sub

quire the insp ection of the dep endence graphs for each

script expression for the corresp onding dimension of

called pro cedure p ossibly a signicant overhead

the formal at each reference to the formal If the array

parameter is reshap ed across the call this translation

is more complicated The array formal is replaced by

checkFusion S

a new array with the same shap e as the actual The

Input S fs s g s is a call or a lo op

1 p i

references to the variable are translated by linearizing

s is adjacent to s

i i+1

the formals subscript expressions and then convert

Output returns true if fusion is safe s

ing to the dimensions of the new array Finally the

F fs g

subscript expressions for each dimension of the actual

for i to n

are added to those for the translated reference This

let l the lo op header of s

i i

metho d is also the one that is used in inlining

if the numb er of iterations of l dier from F then

Pro cedure Cloning

return false

Pro cedures optimized with emb edding or extraction

for each forward dep endence sr c sink

may have multiple callers and an optimization valid

is not exact then if sr c or sink

i F

for one caller may not b e valid for another To avoid

return false

signicant co de growth multiple callers should share

b ecomes if sr c sink

the same version of the optimized pro cedure whenever

backward lo opcarried then

p ossible This technique of generating multiple copies

return false

of a pro cedure and tailoring the copies to their calling

endfor

environments is called pro cedure cloning

F F fs g

endfor

Dep endence Up dates

return true

Because our co de generator only applies lo op extrac

tion and lo op emb edding after safety and protability

Lo op Interchange

are ensured an up date of lo cal dep endence informa

Lo op interchange of two nested lo ops exchanges the tion is not necessary However if further optimiza

lo op headers changing the order in which the itera tion is desired up dating the dep endence information

Page

tion space is traversed It is used to intro duce par running time of the lo op b o dy tB is computed then

allelism or to adjust granularity of parallelism In the running time for the inner lo op is given by the for

particular when a lo op containing calls is not paral mula

lel or parallelizing the lo op is not protable it may ub tB o

b e p ossible to move parallel lo ops in the called pro ce

where o is the sequential lo op overhead The running

dures outward using lo op interchange as in Examples

time for the entire lo op nest is then given by the fol

and The safety of lo op interchange may b e deter

lowing

mined by insp ecting the distancedirection vector to

ub ub tB o o

1 n

ensure that no existing dep endence is reversed after

interchange

In order to estimate the running time of a parallel

Our algorithm considers lo op interchange only when

lo op we need to take into account any overhead intro

a p erfect nest can b e created via lo op extraction em

duced by the parallel lo op Our exp eriments on uni

b edding fusion and distribution If a lo op contains

form sharedmemory machines indicate that this over

more than one call it may b e p ossible to fuse the outer

head consists of a xed cost c of starting the parallel

enclosing lo ops of calls to create a p erfect nest Even if

execution and a cost c of forking and synchronizing

there are multiple statements and calls it may b e p os

each parallel pro cess If there are P parallel pro cessors

sible to use lo op distribution to create a p erfect nest If

an estimate of the cost of executing the inner lo op of

a p erfect nest may b e safely created testing the safety

the ab ove example in parallel is given by the equation

of interchange simply requires insp ection of the direc

tion vectors and slices for dep endences b etween calls or

tB o c c P

s f

statements in the nest

This formula assumes that the iterations are divided

Interpro cedural Parallel Co de

into nearly equal blo cks at startup time and the over

Generation

head of an iteration o remains the same Given a p er

fect lo op nest where just one lo op is b eing considered

In this section we present an algorithm for the inter

for parallel execution these two formulae may b e gen

pro cedural parallel co de generation problem This al

eralized to compute the exp ected sequential and paral

gorithm moves lo ops across pro cedure b oundaries when

lel execution time If the parallel execution time is less

other transformations such as lo op fusion interchange

than the sequential execution time it is protable to

and distribution may b e applied to the resulting lo op

run the lo op in parallel

nests to intro duce or improve singlelevel lo op paral

To enable the parallel co de generator to compare the

lelism The goal of this algorithm is to only apply

costs of dierent transformation choices we intro duce

transformations which are proven to minimize execu

the following cost function

tion time for a particular co de segment To determine

the minimum execution time of a co de segment a sim

costL how B where

ple machine mo del is used This mo del includes the

L fl l g a p erfect lo op nest

1 n

cost of arithmetic and conditional statements as well

how indicates whether l is parallel k or sequential

as op erations such as parallel lo ops sequential lo ops

B the lo op b o dy

and pro cedure call overhead Both Polychronop oulos

The function cost estimates the running time of a lo op

and Sarkar have used similar machine mo dels in their

nest l l where the inner lo op l is sp ecied as

1 n n

research

either parallel or sequential and all outer lo ops are

Machine Mo del and Performance

sequential The lo op b o dy B may contain any typ es of

Estimation

statements including calls and inner lo op nests

A cost mo del is needed to compare the costs of various

Co de Generation Algorithm

execution options First a metho d for estimating the

The goal of our interpro cedural parallel co de genera

cost of executing a sequential lo op is presented Con

tion algorithm is to intro duce eective lo op parallelism

sider the following p erfect lo op nest where ub

for programs which contain pro cedure calls and lo ops

ub are constants and B is the lo op b o dy

This algorithm applies the following transformations

DO i ub

lo op fusion lo op interchange lo op distribution lo op

1 1

emb edding lo op extraction and lo op parallelization

DO i ub

n n

These transformations are applied at call sites and for

a lo op nest containing call sites The algorithm seeks

ENDDO

a minimum cost single lo op parallelization based on

ENDDO

p erformance estimates

Potential lo op and call sequences that may b enet In order to estimate the cost of running this lo op on

from these interpro cedural transformations are adja a single pro cessor a metho d for estimating the run

cent pro cedure calls lo ops adjacent to calls and lo op ning time of the lo op b o dy is needed If B consists of

nests containing calls To nd candidates for inter straightline co de simply sum the time to execute each

pro cedural optimization the augmented call graph is statement in the sequence To handle control ow we

traversed in a topdown pass If a candidate b enets assume a probability for each branch and compute the

from interpro cedural transformation the transforma weighted mean of the branches Once the sequential

Page

BestCost S L

Input a set of statements S fs s g in p erfect lo op nest L fl l g

1 p 1 n

Output a tuple h T i where the minimum execution time and

T the set of transformations that result in

h T i hcostL seq uential S i

if L then

if checkFusionS fused lo op l is k then

h T i min hcostl k body l ffuse make l kg i h T i

f f f

return h T i

endif

for i n

if l is k then

h T i min hcostfl l g k body l fmake l kg i h T i

1 i i i

if i n then return h T i

endif

endfor

if checkFusionS then

if fused lo op l is k then

if checkInterchangel l l is k after interchange then

n f f

h T i minhcostfl l l g k l body l ffuse interchange make l kgi h T i

1 n1 f n f f

else

h T i minhcostfl l l g k body l ffuse make l kg i h T i

1 n f f f

else if l is k checkInterchange l l l k after interchange then

n n f n

h T i minhcostfl l l l g k body l ffuse interchange make l kgi h T i

1 n1 f n f n

endif

return h T i

tions are p erformed and no further optimization of that mation If L other transformations are considered

call sequence is attempted Additional candidates for as follows

optimization may b e created by using judicious co de First the outermost parallel lo op of L is sought and

motion and lo op coalescing combining nested lo ops compared with the sequential time If any of l l

1 n1

into a single lo op are parallel BestCost returns Lo op interchange out

ward of any of these parallel lo ops could also b e con

BestCost Algorithm

sidered Otherwise if all of S fuses into l three trans

BestCost considers L fl l g a p erfect lo op nest

1 n

formations on l and l are considered

f n

with b o dy S fs s g where l is the innermost

1 p n

Interchanging a parallel l with l to make a par

f n

lo op and L may b e the empty set S consists of at

allel lo op with increased granularity

least one call and may also contain other statements

A parallel l in its current p osition

such as lo ops control ow and assignments

Interchanging l and l to intro duce inner lo op

n f

The BestCost algorithm makes use of lo op paral

parallelism

lelization fusion interchange extraction and emb ed

Case is illustrated in Examples and Further

ding lo op distribution is excluded to determine a tu

interchanging of l to enable a more outer lo op to b e

ple h T i such that is the b est execution time and

parallel may also b e tested here

T sp ecies the transformations needed to obtain this

time Unfortunately nding the b est ordering of a lo op

Emb edding versus Extraction

nest via lo op interchange requires that all p ossible p er

To apply the set of transformations sp ecied by h T i

mutations n b e considered Therefore to restrict the

the lo ops involved may need to b e placed in the same

search space and simplify this presentation BestCost

routine In particular if T sp ecies interchange or fu

only considers lo op interchange of l the innermost nest

sion across a call then one of emb edding or extraction

and l the result of fusing S However opp ortunities

must b e applied If there is only one call then em

to test various interchange strategies are p ointed out

b edding lo op l into the called pro cedure is preferable

in the text

b ecause it reduces pro cedure call overhead If there

The sequential execution time is computed rst

is more than one call and T requires fusion extrac

T If there is no surrounding lo op nest L S

tion from all the calls is p erformed Fusion inter

may b e a group of adjacent calls and lo ops that can b e

change and parallelization may then b e p erformed on

fused If fusion of all memb ers of S is p ossible and pro

the transformed lo ops

duces a parallel lo op its execution time is computed

Lo op Distribution

and compared to the sequential cost using the function

min The function min assigns the minimum of the If BestCostL S cannot intro duce parallelism then

two times and T the corresp onding program transfor it may b e p ossible to use lo op distribution to do so

Page

Lo op distribution seeks parallelism by separating inde

Input

p endent parallel and sequential statements in L For

L fl l g p erfect lo op nest

1 n

example lo op distribution may create lo op nests of ad

S fs s g ordered b o dy of L

1 p

jacent calls and lo ops which BestCost can optimize

IT fit it g numb er of lo op iterations

1 n

Ordered Partitions Lo op distribution is safe if

(i)

time BestCost f g fl l g

the partition of statements into new lo ops preserves j k i n

all of the original dep endences Dep endences

Output

(i) (i) (i)

are preserved if any statements involved in a cycle of

opt min time time

j r k

jk r +1k

dep endences a recurrence are placed in the same lo op

b est execution time for l

partition The dep endences b etween the partitions

(i)

D grouping of partitions at l

then form an acyclic graph that can always b e ordered

with b est execution time

using top ological sort

Grouping via dynamic programming

By rst cho osing a safe partition with the nest p os

sible granularity and then grouping partitions larger

for i n

partitions may b e formed Any one of these group

partition into

1 m

ings may exp ose the optimal parallelization of the lo op

for m

Unfortunately there exists an exp onential numb er of

for j m

(i) (i) (i+1)

p ossible groupings

opt min time it time

i+1

jj + jj +

map(j )map(j + )

To limit the search space statement order is xed

(i) (i+1)

if time time then

jj +

map(j )map(j + )

based on a top ological sort of all the dep endences for

(i)

D ff gg

L Ambiguities are resolved in favor of placing parallel j j +

jj +

else

partitions adjacent to each other The advantage of

(i) (i)

this ordering is that lo opcarried antidep endences may

D D

jj +

map(j )map(j + )

b e broken allowing parallelism to b e exp osed

endif

Grouping partitions via dynamic program

for k

(i) (i) (i)

ming A dynamic programming solution is used to

if opt opt opt then

jj + jj +k j +k +1j +

compute the b est grouping for the nest granularity

(i) (i) (i)

opt opt opt

jj + jj +k j +k +1j +

ordered partitions This algorithm is similar to tech

(i) (i) (i)

D D D

niques for calculating the shortest path b etween two

jj + jj +k j +k +1j +

endif

p oints in a graph The algorithm is ON M N

endfor

is the numb er of p erfectly nested lo ops M is the max

endfor

imum numb er of partitions and is less than or equal to

endfor

the numb er of statements in the lo op Both N and M

endfor

are typically small numb ers

The dynamic programming solution app ears in Fig

Figure

ure The algorithm b egins by nding the nest par

tition for the inner lo op l that satises its own dep en

(i)

dences and the ordering constraints On subsequent

gether For a lo op l D provides the b est group

iterations the initial partition is further constrained

ing of partitions at lo op l Up on termination of the

(1)

by including the dep endences for the next outer lo op

algorithmD indicates the nal grouping with the

Since an inner lo op may have more partitions than its

minimal cost Implicit in D is also a description of any

enclosing lo op a map is constructed that correlates a

additional transformations sp ecied by BestCost

statements partition for the previous and current it

Improvements To leverage the dynamic program

eration mapj returns the partition from l that

i+1

ming solution the distribution algorithm generates

corresp onds to in l

j i

partitions based on a xed statement order that sat

For each lo op level BestCost calculates the b est ex

ises all the dep endences A correct and less restric

ecution time of each p ossible grouping of partitions

tive statement order uses only the dep endences for the

The grouping algorithm rst tests the nest partition

particular lo op nest b eing distributed In general this

and then each pair of adjacent partitions Increasingly

ordering causes the map b etween solutions for adjacent

larger groupings of partitions are tested for a partic

lo op partitions to b e useless It provides a single b est

ular lo op level At each level the minimal execution

solution for each nesting level of distribution instead of

time for each grouping analyzed is stored The minimal

one overall b est solution In practice exp erimentation

grouping time is taken from the grouping at this level

will b e needed to dierentiate these strategies

as well as that of the previous inner lo ops This strat

egy allows inner lo op distributions to b e used within

Exp erimental Validation

an outer lo op distribution to minimize overall execu

This section presents signicant p erformance improve

tion time On completion the b est execution time for

ments due to interpro cedural transformation on two

the grouping of the entire lo op nest is determined

scientic programs sp ec and o cean taken from the

Each time the algorithm lo cates a grouping of parti

Perfect Benchmarks Sp ec contains non

tions that improves execution time a set D is con

comment lines and is a uid dynamics weather sim

structed to describ e how partitions are group ed to

ulation that uses Fast Fourier Transforms and rapid

Page

elliptic problem solvers Ocean has noncomment time

lines and is a D uid dynamics o cean simulation that With seven pro cessors the results are similar for all

also uses Fast Fourier Transforms three versions since each program version provided ad

To lo cate opp ortunities for transformations we equate parallelism and granularity for seven pro cessors

browsed the dep endences in the program using the On pro cessors IPinfo was slower than the original

ParaScop e Editor Using other ParaScop e program b ecause the parallel outer lo ops had insu

to ols we determined which pro cedures in the program cient parallelism only to iterations The paral

contained pro cedure calls We examined the pro ce lel inner lo ops of Original were b etter matched to the

dures containing calls lo oking for interesting call struc numb er of pro cessors b ecause they had at least it

tures We lo cated adjacent calls lo ops adjacent to erations The interpro cedural transformation version

calls and lo ops containing calls which could b e op IPtrans demonstrated the b est p erformance a sp eedup

timized of b ecause it combined the amount of paral

The rest of this section describ es our exp eriences exe lelism in Original with increased granularity The inter

cuting these programs on a pro cessor Sequent Sym pro cedural transformations resulted in a p ercent im

metry S Since the optimizations used and the exp er provement in execution time over Original in the opti

imental metho dology diered slightly for each program mized p ortion

they are describ ed separately Parallelizing just these lo ops resulted in a sp eedup

for the entire program of ab out on pro cessors

Optimizing sp ec

and on pro cessors Higher sp eedups might result

In sp ec lo ops containing calls were common Over

from parallelizing the entire application

all transformations were applied to such lo ops

Optimizing o cean

Emb edding and interchange were applied to lo ops

which contained calls to a single pro cedure The re

There were places in the main routine of o cean

maining lo ops which contained multiple pro cedure

where we extracted and fused interpro cedurally adja

calls were optimized using extraction fusion and in

cent lo ops They were divided almost evenly b etween

terchange These lo ops were found in pro cedures del adjacent calls and lo ops adjacent to calls In all

gloop and gwater cases where a lo op was adjacent to a call the lo op

For the transformed lo ops p erformance was mea was dimensional while the lo op in the called pro ce

sured among three p ossibilities no parallelization

dure was dimensional Prior to fusion we coalesced

of lo ops containing pro cedure calls paralleliza

the dimensional lo op into a dimensional lo op by

tion using interpro cedural information and inter

linearizing the subscript expressions of its array refer

pro cedural information and transformations To ob ences The resulting fused lo ops consisted of b etween

tain these versions the steps illustrated in Figure and parallel lo ops from the original program thus

were p erformed increasing the granularity of parallelism

The Original version contains directives to parallelize

To measure p erformance improvements due to inter

the lo ops in the leaf pro cedures that are invoked by the

pro cedural transformation we p erformed steps similar

lo ops of interest The IPinfo version parallelizes the

to those in Figure Directives forced the paralleliza

lo ops containing calls For the IPtrans version we tion and blo cking of the individual lo ops in the Original

p erformed interpro cedural transformation followed by version and the fused lo ops in IPtrans The execution

outer lo op parallelization The parallel lo ops in each times were measured for the entire program and just

version were also blo cked to allow multiple consecutive

the optimized p ortion The optimized execution times

iterations to execute on the same pro cessor without

are shown b elow

synchronization The compiler default is to create a

Pro cessors

separate pro cess for each iteration of a parallel lo op

Time in optimized

p ortion Sp eedup

Time in optimized

Original s

p ortion Sp eedup

IPtrans s

Processors

Original s

The sp eedups are relative to the time in the op

IPinfo s

timized p ortion of the sequential version of the pro

IPtrans s

gram which was seconds The optimized co de

Processors

accounted for ab out p ercent of total program exe

Original s

cution time For the whole program the parallelized

IPinfo s

versions achieve a sp eedup of ab out over the se

IPtrans s

quential execution time

Note that IPtrans achieved a p ercent improvement

The results rep orted ab ove are the b est execution

over Original in the optimized p ortion This improve

time in seconds for the optimized p ortions of each ver

ment resulted from increasing the granularity of paral

sion The sp eedups are compared against the execution

lel lo ops and reducing the amount of synchronization

time in the optimized p ortion of the program on a sin

It is also p ossible that fusion reduced the cost of mem

gle pro cessor which was s This accounted for

ory accesses Often the fused lo ops were iterating over

more than p ercent of the total sequential execution

Page

Original

Directives on

inner lo ops

IPinfo

sp ec Blo ck

Directives on

outer lo ops

IPtrans

Transform

Figure Stages of preparing program versions for exp eriment

the same elements of an array These groups of lo ops erator b ecomes more imp ortant as the numb er of pro

were not the only opp ortunities for interpro cedural fu cessors increase Eectively utilizing more pro cessors

sion there were many other cases where fusion was requires more parallelism in the co de This b ehavior

safe but the numb er of iterations were not identical was particularly observed in sp ec where the b enets

Using a more sophisticated fusion algorithm might re of interpro cedural transformation were increased with

sult in even b etter execution time improvements the numb er of pro cessors

Although it may b e argued that scientic programs

Related Work

structured in a mo dular fashion are rare in practice we

b elieve that this is an artifact of the inability of previ

While the idea of interpro cedural optimization is not

ous compilers to p erform interpro cedural optimizations

new previous work on interpro cedural optimization for

of the kind describ ed here Many scientic program

parallelization has limited its consideration to inline

mers would like to program in a more mo dular style

substitution and interpro cedural analysis of

but cannot aord to pay the p erformance p enalty By

array side eects The various

providing compiler supp ort to eectively optimize pro

approaches to array sideeect analysis must make a

tradeo b etween precision and eciency Section anal cedures containing calls we encourage the use of mo du

lar programming which in turn will make these trans

ysis used here loses precision b ecause it only represents

formations applicable on a wider range of programs

a few array substructures and it merges sections for all

references to a variable into a single section However

Acknowledgments

these prop erties make it ecient enough to b e widely

used by co de generation In addition exp eriments with

We are grateful to Paul Havlak ChauWen Tseng

regular section analysis on the linpack library demon

Linda Torczon and Jerry Roth for their contributions

strated a p ercent reduction in parallelisminhibiting

to this work Use of the Sequent Symmetry S

dep endences allowing lo ops containing calls to b e

was provided by the Center for Research on Paral

parallelized Comparing these numb ers against

lel Computation under NSF Co op erative Agreement

published results of more precise techniques there was

CDA

no b enet to b e gained by the increased precision of

the other techniques

References

Sections inspired a similar but more detailed ar

F Allen and J Co cke A catalogue of optimizing transfor

ray summary analysis data access descriptors which

mations In J Rustin editor Design and Optimization of

stores access orders and expresses some additional

Compilers PrenticeHall

shap es In fact the slice annotation to sec

J R Allen D Callahan and K Kennedy Automatic de

tions could b e obviated by using some of the techniques

comp osition of scientic programs for parallel execution In

in Huelsb ergen et al for determining exact array de

Proceedings of the Fourteenth Annual ACM Symposium on

scriptors for use in dep endence testing However slices

the Principles of Programming Languages Munich Ger

are app ealing due to our existing implementation and

many January

their simplicity

J R Allen and K Kennedy Automatic translation of For

tran programs to vector form ACM Transactions on Pro

Conclusions

gramming Languages and Systems Octob er

This pap er has describ ed a compilation system intro

duced two interpro cedural transformations lo op em

R Allen and S Johnson Compiling C for vectorization

b edding and lo op extraction and prop osed a parallel

parallelization and inline expansion In Proceedings of the

co de generation strategy The usefulness of this ap

SIGPLAN Conference on Program Language Design

and Implementation Atlanta GA June

proach has b een illustrated on the Perfect Benchmark

programs sp ec and o cean Taken as a whole the re

V Balasundaram and K Kennedy A technique for sum

marizing data access and its use in parallelism enhancing

sults indicate that providing freedom to the co de gen

Page

transformations In Proceedings of the SIGPLAN Con Dept of Computer Science University of Wisconsin

ference on Program Language Design and Implementation Madison July

Portland OR June

C A Huson An inline subroutine expander for Parafrase

V Balasundaram K Kennedy U Kremer K S M Kinley Masters thesis Dept of Computer Science University of

and J Subhlok The ParaScop e Editor An interactive par Illinois at UrbanaChampaign

allel programming to ol In Proceedings of Supercomputing

K Kennedy and K S M Kinley Lo op distribution with

Reno NV Novemb er

arbitrary control ow In Proceedings of Supercomputing

U Banerjee Dependence Analysis for Supercomputing New York NY Novemb er

Kluwer Academic Publishers Boston MA

K Kennedy K S M Kinley and C Tseng Analysis and

P Briggs K Co op er M W Hall and L Torczon Goal transformation in the ParaScop e Editor In Proceedings of

directed interpro cedural optimization Technical Rep ort the ACM International Conference on Supercomput

TR Dept of Computer Science Rice University De ing Cologne Germany June

cemb er

K Kennedy K S M Kinley and C Tseng Interac

M Burke and R Cytron Interpro cedural dep endence anal tive parallel programming using the ParaScop e Editor

ysis and parallelization In Proceeding s of the SIGPLAN IEEE Transactions on Paral lel and Distributed Systems

Symposium on Compiler Construction Palo Alto CA July

June

D Kuck The Structure of Computers and Computations

D Callahan J Co cke and K Kennedy Estimating inter Volume John Wiley and Sons New York NY

lo ck and improving balance for pip elined machines Journal

D Kuck R Kuhn D Padua B Leasure and M J Wolfe

of Paral lel and Distributed Computing Au

Dep endence graphs and compiler optimizations In Confer

gust

ence Record of the Eighth Annual ACM Symposium on the

D Callahan K Co op er R Ho o d K Kennedy and L Tor Principles of Programming Languages Williamsburg VA

czon ParaScop e A parallel programming environment January

The International Journal of Supercomputer Applications

Z Li and P Yew Ecient interpro cedural analysis for pro

Winter

gram restructuring for parallel programs In Proceedings

D Callahan and K Kennedy Analysis of interpro cedural of the ACM SIGPLAN Symposium on Paral lel Program

side eects in a parallel programming environment In Pro ming Experience with Applications Languages and Sys

ceedings of the First International Conference on Super tems PPEALS New Haven CT July

computing SpringerVerlag Athens Greece June

Z Li and P Yew Interpro cedural analysis and program re

K Co op er M W Hall and L Torczon An exp eriment structuring for parallel programs Technical Rep ort

with inline substitution SoftwarePractice and Experi Center for Sup ercomputing Research and Development

ence June University of Illinois at UrbanaChampaign January

K Co op er K Kennedy and L Torczon The impact of R McNaughton and H Yamada Regular expressions and

interpro cedural analysis and optimization in the IR pro state graphs for automata IRE Transactions on Electronic

gramming environment ACM Transactions on Program Computers

ming Languages and Systems Octob er

Y Muraoka Paral lelism Exposure and Exploitation in Pro

K Co op er K Kennedy and L Torczon Interpro cedural grams PhD thesis Dept of Computer Science University

optimization Eliminating unnecessary recompilation In of Illinois at UrbanaChampaign February Rep ort

Proceedings of the SIGPLAN Symposium on Compiler No

Construction Palo Alto CA June

C Polychronop oulos On Program Restructuring Schedul

G Cyb enko L Kipp L Pointer and D Kuck Sup er ing and Communication for Paral lel Processor Systems

computer p erformance evaluation and the Perfect b ench PhD thesis Dept of Computer Science University of Illi

marks In Proceedings of the ACM International Con nois at UrbanaChampaign August

ference on Supercomputing Amsterdam The Netherlands

V Sarkar Partition and Scheduling Paral lel Programs for

June

Multiprocessors The MIT Press Cambridge MA

J Ferrante K Ottenstein and J Warren The program

R Triolet F Irigoin and P Feautrier Direct parallelization

dep endence graph and its use in optimization ACM Trans

of CALL statements In Proceedings of the SIGPLAN

actions on Programming Languages and Systems

Symposium on Compiler Construction Palo Alto CA June

July

G Go K Kennedy and C Tseng Practical dep endence

M J Wolfe Lo op skewing The wavefront metho d re

testing In Proceedings of the SIGPLAN Conference on

visited International Journal of Paral lel Programming

Program Language Design and Implementation Toronto

August

Canada June

M J Wolfe Optimizing Supercompilers for Supercomput

M W Hall Managing Interprocedural Optimization PhD

ers The MIT Press Cambridge MA

thesis Rice University April

P Havlak and K Kennedy Exp erience with interpro cedural

analysis of array side eects In Proceedings of Supercom

puting New York NY Novemb er

L Huelsb ergen D Hahn and J Larus Exact dep endence

analysis using data access descriptors In Proceedings of the

International Conference on Paral lel Processing St

Charles IL August

L Huelsb ergen D Hahn and J Larus Exact dep endence

analysis using data access descriptors Technical Rep ort

Page