RICE UNIVERSITY

Automatic and Interactive Parallelization

by

c

Kathryn S M Kinley

A Thesis Submitted

in Partial Fulfillment of the

Requirements for the Degree

Do ctor of Philosophy

Approved Thesis Committee

Ken Kennedy

Noah Harding Professor chair

Computer Science

Keith D Co op er Asso ciate Professor

Computer Science

Don H Johnson Professor

Electrical and Computer Engineering

Danny C Sorensen Professor

Mathematical Sciences

Houston Texas

March

Automatic and Interactive Parallelization

c

Kathryn S M Kinley

Abstract

The goal of this dissertation is to give programmers the ability to achieve high p er

formance by fo cusing on developing parallel algorithms rather than on architecture

sp ecic details The advantages of this approach also include program p ortability and

legibility To achieve high p erformance we provide automatic compilation techniques

that tailor parallel algorithms to sharedmemory multipro cessors with lo cal caches

and a common bus In particular the maps complete applications onto the

sp ecics of a machine exploiting b oth parallelism and memory

To optimize complete applications we develop novel general algorithms to trans

form lo ops that contain arbitrary conditional control ow In addition we provide new

interpro cedural transformations which enable optimization across pro cedure b ound

aries These techniques provide the basis for a robust automatic parallelizing algo

rithm that is applicable to complete programs

The algorithm for automatic parallel co de generation takes into consideration the

interaction of parallelism and data lo cality as well as the overhead of parallelism The

algorithm is based on a simple cost mo del that accurately predicts cache line reuse

from multiple accesses to the same memory lo cation and from consecutive accesses

The optimizer uses this mo del to improve data lo cality It also uses the mo del to

discover and intro duce eective parallelism that complements the b enets of data

lo cality The optimizer further improves the eectiveness of parallelism by seeking to

increase its granularity Parallelism is intro duced only when granularity is sucient

to overcome its asso ciated costs

The algorithm for parallel co de generation is shown to b e ecient and several of its

comp onent algorithms are proven optimal The ecacy of the optimizer is illustrated

with exp erimental results In most cases it is very eective and either achieves or

improves the p erformance of handcrafted parallel programs When p erformance is

not satisfactory we provide an interactive parallel programming to ol which combines

compiler analysis and algorithms with human exp ertise

Acknowledgments

Ken Kennedy provided me with the three most imp ortant elements of supp ort in

graduate scho ol intellectual p olitical and nancial In addition Ken Keith Co op er

and Linda Torczon fostered a research atmosphere and working environment whose

b enets are untold I would also like to recognize the other memb ers of my committee

Keith Co op er Don Johnson and Danny Sorensen Keith has b een an endless source

of encouragement and wisdom throughout my graduate career Don Johnson gave

me my rst taste of research and ho oked me for life

I am fortunate that many of my fellow graduate students and friends supp orted my

research intellectually emotionally and with implementations I would esp ecially like

to thank ChauWen Tseng Marina Kalem Mary Hall Paul Havlak Nat McIntosh

Preston Briggs Ben Chase and the entire compiler group

I am extremely gratefully to my entire family As always my parents were a

constant source of love and encouragement To my husband Scotty Strahan I hop e

I am the ro ck for you that you have b een for me

A little Madness in the Spring

Is wholesome even for the King Emily Dickinson

Contents

Abstract ii

Acknowledgments iii

List of Illustrations ix

Intro duction

Automatic parallelization

Interactive parallelization

Overview

Technical Background

Dep endence Analysis

Interpro cedural dep endence analysis

Augmented call graph

Interactive Parallel Programming

Intro duction

Work Mo del

Transformations

Reordering transformations

Dep endence breaking transformations

Memory hierarchy transformations

Miscellaneous transformations

Transformation algorithms

Lo op interchange

Lo op skewing

Lo op distribution

Unroll and jam

Incremental analysis after edits

User and compiler interaction

v

Related work

Discussion

Lo op Transformations with Arbitrary Control Flow

Motivation

Lo op distribution

Mechanics

Restructuring

Co de generation

Other transformations

Lo op skewing

Lo op reversal

Lo op p ermutation

Strip mining

Privatization

Scalar expansion

Lo op fusion

Lo op p eeling

Related work

Discussion

Interpro cedural Transformations

Intro duction

Technical background

Augmented call graph

Interpro cedural section analysis

Supp ort for interpro cedural optimization

The ParaScop e compilation system

Recompilation analysis

Interpro cedural transformation

Lo op extraction

Lo op emb edding

Intrapro cedural transformations

Lo op fusion

Lo op p ermutation

vi

Exp erimental results

Sp ec

Ocean

Related work

Discussion

Optimizing for Parallelism and Data Lo cality

Intro duction

Memory and language mo del

Tradeos in optimization

Optimizing data lo cality

Sources of data reuse

Simplifying assumptions

Lo op cost

Reference groups

Lo op cost algorithm

Imp erfectly nested lo ops

Lo op p ermutation

Memory order

Permuting to achieve memory order

Data lo cality exp erimental results

Matrix multiply

Stencil computations Jacobi and SOR

Erlebacher

Parallelism

Performance estimation

Intro ducing parallelism

Strip mining

Parallelization algorithm

Optimization algorithm

Exp erimental results

Matrix multiply

Dmxpy

Related work

Discussion

vii

An Automatic Parallel Co de Generator

Intro duction

Parallel co de generation

Driving co de generation

Pro cedure cloning

Lo opbased optimization

Partitioning for lo op distribution and lo op fusion

Simple partition algorithm

Merging the solutions

Discussion

Lo op fusion

Lo op distribution

Integrating interpro cedural transformations

Selecting the appropriate interpro cedural transformation

Extensions to pro cedure cloning

Discussion

Exp erimental Results

Intro duction

Metho dology

Ask and ye shall receive

Original parallel versions and nearby sequential versions

Creating an automatically parallelized version

Execution environment

Results

Parallel co de generation statistics

Discussion

Conclusions

A Description of Test Suite Programs

A Banded Linear Systems

A BTN Unconstrained Optimization

A Direct Search Metho d

A Erlebacher

viii

A Interior Point Metho d

A Linpackd Benchmark

A Multidirectional Search Metho d

A D Seismic Inversion

A Optimal Control

A TwoPoint Boundary Problems

Bibliography

Illustrations

PED User Interface

Eect of lo op skewing on dep endences and iteration space

Eect of unroll and jam on iteration space

Control and data dep endence graphs for distribution example

Sections and data access descriptors

Information ow for interpro cedural transformations

Stages of preparing program versions for exp eriment

Memory and parallelism tradeos

Stencil computation Jacobi

Stencil computation Successive Over Relaxation SOR

Erlebacher forward and backward sweeps in Z dimension

Parallel lo op training set

Counter example for the greedy algorithm

Partition graph G

o

Dividing G

o

Fusing G G and merging the result

s p

Chapter

Intro duction

Many program transformations that intro duce parallelism into sequential scientic

Fortran programs have proven eective in improving p erformance on vector and

sharedmemory multipro cessor hardware For advanced parallel architectures ob

taining the b est p erformance often requires the program to b e mo died for the par

ticular features of the underlying architecture Currently users must mo dify their

programs for each architecture of interest to achieve high p erformance Not only are

programmers required to understand architecture sp ecic details their programs are

usually not p ortable once they have b een mo died in this fashion To address these

problems this dissertation seeks to determine the following

Does there exists a machineindependent paral lel programming style from

which can produce paral lel programs with acceptable or excel lent

performance on sharedmemory multiprocessors with local caches and a

common bus

Clearly we are not attempting to solve the dusty deck problem where a program

develop ed using a sequential algorithm is automatically transformed to a parallel

one This problem is inherently more dicult b ecause programs may need signicant

technical exp ertise or algorithmic restructuring for go o d parallel p erformance In fact

this problem has not b een solved even for unipro cessor vector machines KKLWb

CDL

A lesson to b e learned from vectorization is that programmers rewrote their pro

grams in a p ortable vectorizable style based on feedback from vectorizing compilers

CKK Wolc Compilers were then able to take these programs and generate

machinedep endent vector co de with excellent results We are testing this same the

sis for the harder problem of sharedmemory parallel machines

Vectorization achieves high p erformance by simply utilizing parallelism on a single

statement for a single lo op level On any architecture where parallelism exacts a

higher cost larger regions of parallelism ie higher granularity must b e discovered

and exploited to achieve high p erformance Because successes in this arena were few

we cho ose to explore parallel co de generation for sharedmemory multipro cessors with

lo cal caches and a common bus We b elieve the solution to this problem to b e a rst

step in compiling for more advanced parallel architectures

We advo cate that one machine indep endent program version b e develop ed in a

sequential language such as Fortran the most widely used programming language

in the scientic community Our compiler would then apply ambitious algorithms to

customize the program for a sharedmemory multipro cessor

Automatic parallelization

Previous automatic parallel co de generation algorithms for sharedmemory multi

pro cessors are for the most part ad hoc and have not yet established an accept

able level of success Although many transformations and combinations of trans

formations have b een shown to parallelize interesting example lo ops an eective

overall parallelization strategy for complete applications has not b een forthcoming

+

ABC ACK KKLWa Wola WL The automatic parallelization prob

lem is very dicult for a variety of reasons

One imp ortant reason is that the theoretical statement of seeking all p ossible

parallelism do es not work well in practice In practice parallelism incurs overhead

If this overhead is not taken into account parallelization can degrade p erformance

rather than enhance it Similarly parallelism intro duced without regard to its eect

on the p erformance of the memory subsystem can degrade p erformance Another

reason parallelization is dicult to discover in complete applications is that it requires

precise array analysis in the presence of pro cedure calls Until recently this analysis

was not available CKb HK HK

We have develop ed a new interpro cedural approach for automatic parallel co de

generation for complete applications Two imp ortant comp onents of this algorithm

are generalized and interpro cedural transformations that attack the problems found

real programs

Generalized transformations for loops containing conditional control ow Much

previous work cannot apply parallelizing transformations when lo ops contain

conditional control ow In this thesis a broad selection of lo op transformations

is extended to deal with conditional branches using the control dep endence

representation In particular a new algorithm for p erforming lo op distribution

is shown to b e optimal for a legal partitioning of the statements into new lo ops

Interprocedural transformations We intro duce two new interpro cedural trans

formations loop embedding and loop extraction that exp ose lo op nests to other

optimizations without incurring costs asso ciated with pro cedure inlining We

present a strategy for determining the b enets and safety of these two transfor

mations when combined with other lo opbased optimizations The comp ound

transformations are judiciously applied when p erformance is exp ected to im

prove The recompilation system and analysis needed to p erform and test these

optimizations is shown to b e ecient

These algorithms enable automatic parallelization of complete applications

Three imp ortant factors in optimizing for parallel architectures are granularity

parallelism and data lo cality Parallelism is usually most eective when it achieves

the highest p ossible granularity the amount of work p er parallel task The granularity

of parallelism must also b e sucient to overcome the overhead of parallelism such as

pro cessor synchronization costs We only p erform lo ops in parallel when p erformance

estimation determines there is enough granularity to improve execution time

To address data lo cality we present a simple cost mo del for determining cache

line reuse It computes reuse due to accesses to consecutive memory lo cations on a

particular cache line and reuse due to multiple accesses to the same memory lo cation

on a cache line The cost mo del is used to order lo ops in a nest to improve data

lo cality and to discover and exploit parallelism This optimization strategy pro duces

data lo cality at the innermost lo ops and parallelism at the outermost lo op Each

is placed where it is most likely to b e eective Exp erimental results validate this

approach They indicate that the cost mo del is accurate and eective for driving

optimization even for scalar machines

This strategy provides the core of the optimizer Algorithms for applying addi

tional lo op transformations are also describ ed In particular a new unied algorithm

for p erforming lo op fusion and distribution is presented which achieves maximal gran

ularity under certain constraints Several comp onent algorithms are shown to b e

optimal The optimization algorithms are based on theoretical and practical con

siderations All of the algorithms are incorp orated into a cohesive interpro cedural

parallel co de generation algorithm

Interactive parallelization

Unfortunately automatic parallelization is unlikely to yield excellent parallel p erfor

mance in every case for a variety of reasons For example the algorithm may b e

unsuitable for parallel execution One diculty which often arises is that imp ortant

constants and symb olics values are unknown at compile time Therefore in addition

to improved automatic parallelization via advanced compiler techniques we combine

compiler strategies and human insight in an interactive parallel programming to ol

Our to ol the ParaScop e Editor is intended to provide all the analysis and opti

mization capabilities of the parallelizer in an intelligent editor It provides a large

collection of parallelism enhancing transformations that have proven eective such as

lo op interchange lo op fusion and strip mining It contains a userassertion facility for

communication b etween the user and the compiler as well as an advanced text and

structure editor for Fortran with functions such as searching and view ltering In or

der to make the ParaScop e Editor Ped truly interactive up dates after user changes

such as edits transformations and assertions must b e quick and precise We describ e

fast incremental algorithms for precise up dates after user or compiler changes They

+

are implemented in Ped and have proven themselves ecient in practice HHK

Overview

In Chapter we b egin by describing the analysis required to p erform eective pro

gram parallelization We then serve two purp oses by discussing the ParaScop e Editor

in Chapter The rst is a general description of interactive parallel programming

and the supp orting implementation However we also intro duce the lo opbased trans

formations which form a basis for parallel co de generation We detail several of the

transformation algorithms and present new incremental up date algorithms for them

These algorithms serve as an intro duction to the analysis and representations neces

sary to supp ort automatic as well as interactive parallelization

The next two chapters are devoted to new algorithms that make the parallelization

of complete applications viable throughout the rest of the dissertation Chapters

and then develop an integrated automatic parallel co de generation strategy This

optimizer is tested exp erimentally to determine if it provides supp ort for machine

indep endent parallel programming

The exp eriment compares a go o d handco ded parallel program to one derived by

handsimulating our automatic algorithm on a nearby sequential version Therefore

parallelism is known to exist and we measure the ability of our automatic techniques

to uncover this parallelism The results of our exp eriment indicate that given a few

assertions the automatically generated versions usually p erform as well or b etter than

handco ded versions These results do not completely prove the thesis statement but

provide very promising supp ort for it

Chapter

Technical Background

For the most part this dissertation fo cuses on on exploiting existing analysis to

p erform eective optimizing transformations To understand these optimizations re

quires knowledge of the tenents on which they are based We therefore b egin with an

overview of the analysis required for program parallelization and transformation

Dep endence Analysis

Dep endences describ e a partial order b etween statements that must b e maintained to

preserve the meaning of a program with sequential semantics A dep endence b etween

statement S and S denoted S S indicates that S the source must b e executed

1 2 1 2 1

b efore S the sink There are two typ es of dep endence data dep endence and control

2

dep endence

Data dep endence

A data dependence S S indicates that S and S read or write a common memory

1 2 1 2

lo cation in a way that requires their execution order to b e preserved Ber There

are four typ es of data dep endence Kuc

True ow dep endence

o ccurs when S writes a memory lo cation that S later reads

1 2

Anti dep endence

o ccurs when S reads a memory lo cation that S later writes

1 2

Output dep endence

o ccurs when S writes a memory lo cation that S later writes

1 2

Input dep endence

1

o ccurs when S reads a memory lo cation that S later reads

1 2

1

Input dep endences do not restrict statement order

Control dep endence

Intuitively a control dependence S S indicates that the execution of S directly

1 c 2 1

determines whether S will b e executed The control ow graph G represents the

2 f

ow of execution in the program The following formal denitions of control dep en

dence and the p ostdominance relation computed on G are taken from the literature

f

FOW CFS

Denition x is postdominated by y in G if every path from x to

f

the exit no de of G contains y

f

Denition Given two statements x y G y is control dependent

f

on x if and only if

a nonnull path p x y such that y p ostdominates every no de

b etween x and y on p and

y do es not p ostdominate x

Based on these denitions a control dep endence graph G can b e built with the

cd

control dep endence edges x y where l is the lab el of the rst edge on path x

l

y Additionally if G is structured ro oted and acyclic the resulting G is a tree

f cd

where structured has its usual meaning as originally formulated by Bohm and Jacopini

BJ If G is unstructured ro oted and acyclic the resulting G is a dag CFS

f cd

Lo opcarried and lo opindep endent dep endence

Because scientic Fortran programs sp end most of their time executing lo ops Knu

this thesis fo cuses on executing lo ops in parallel Dep endence analysis determines

which lo ops in the program may b e run safely in parallel A dep endence b etween

iterations of a lo op is called loopcarried and prevents the iterations of a lo op from

b eing executed in parallel All AK Consider the following lo op

DO I N

S AI

1

S AI

2

S AI

3

ENDDO

The true dep endence S S is called loopindependent b ecause it exists regardless

1 2

of the surrounding lo ops Lo opindep endent dep endences whether data or control

o ccur within a single iteration of a lo op and do not inhibit a lo op from running in

parallel For example if S S were the only dep endence in the lo op the iterations

1 2

of this lo op could b e run in parallel b ecause statements executed on each iteration

only aect other statements in the same iteration and not in any other iterations

However lo opindep endent dep endences do aect statement order within a lo op iter

ation Interchanging statements S and S violates the lo opindep endent dep endence

1 2

and changes the meaning of the program

By comparison the true dep endence S S is loopcarried b ecause the source and

1 3

sink of the dep endence o ccur on dierent iterations of the lo op S reads the memory

3

lo cation that was written by S on the previous iteration Lo opcarried dep endences

1

inhibit lo op iterations from executing in parallel without explicit synchronization

When there are nested lo ops the level of any carried dep endence is the outermost

lo op on which it rst arises All AK

Dep endence testing

Determining the existence of data dep endence b etween array references is more dif

cult than for scalars b ecause the subscript expressions must b e considered The

pro cess of dierentiating b etween two subscripted references in a lo op nest is called

dependence testing Ban Wolb GKT To illustrate consider the problem of

determining whether or not there exists a dep endence from statement S to S in the

1 2

following lo op nest

DO i L U

1 1 1

DO i L U

2 2 2

DO i L U

n n n

S Af i i f i i

1 1 1 n m 1 n

S Ag i i g i i

2 1 1 n m 1 n

ENDDO

ENDDO

ENDDO

Let and b e vectors of n integer indices within the ranges of the upp er and lower

b ounds of the n lo ops There is a dep endence from S to S if and only if there exist

1 2

and such that is lexicographically less than or equal to and the following

system of dependence equations is satised

f g k k m

k k

Distance and direction vectors

Distance and direction vectors may b e used to characterize data dep endences by

their access pattern b etween lo op iterations If there exists a data dep endence for

and then the distance vector D D D is

1 n 1 n 1 n

dened as The direction vector d d d of the dep endence is dened

1 n

by the equation

if

i i

d

if

i

i i

if

i i

Dep endence distances and directions are represented as a vector whose elements

displayed left to right represent the dep endence from the outermost to the innermost

lo op in the nest By denition all distance and direction vectors are lexicographically

p ositive We use

1 n

to represent a distance or direction vector where is the dep endence distance or

i

direction for the lo op at level i For example consider the following lo op nest

DO I N

DO J M

DO K l

AI J K AI J K C

ENDDO

ENDDO

ENDDO

The distance and direction vectors for the true dep endence b etween the denition and

use of array A are and resp ectively Since several dierent values

of and may satisfy the dep endence equations a set of distance and direction

vectors may b e needed to completely describ e the dep endences arising b etween a pair

of array references

Distance vectors rst used by Kuck and Muraoka KMC Mur sp ecify

the numb er of lo op iterations b etween two accesses to the same memory lo cation

Direction vectors intro duced by Wolfe Wol summarize distance vectors and are

therefore less precise However there are situations where direction vectors may b e

computed but distance vectors cannot b e

Both may b e used to calculate lo opcarried dep endences Additionally direction

vectors are sucient to determine the safety and protability of lo op interchange

AK Wol Distance vectors are often required by other transformations that ex

ploit parallelism Banb KMTa Lam WL Wol and improve data lo cality

CCK KMTa GJG Data dep endence also characterizes reuse of individual

memory lo cations CCK

Interpro cedural dep endence analysis

The presence of pro cedure calls complicates the pro cess of analyzing dep endences

Without interpro cedural analysis worst case assumptions must b e made in the pres

ence of pro cedure calls Conventional interpro cedural analysis discovers constants

aliasing owinsensitive side eects such as ref and mod and owsensitive side

eects such as use and kill CCKT CKTa However parallelization is limited

b ecause arrays are treated as monolithic ob jects making it imp ossible to determine

whether two references to an array actually access the same memory lo cation

Array sections

To provide more precise analysis array accesses can b e summarized in terms of reg

ular sections or data access descriptors that describ e subsections of arrays such as

rows columns and rectangles BK CKb HK Lo cal symb olic analysis and

interpro cedural constants are required to build accurate sections Once constructed

sections may b e quickly intersected during interpro cedural analysis and dep endence

testing to determine whether dep endences exist This analysis is describ ed in more

detail in Section

Augmented call graph

The program representation for our work on whole program optimization requires

an augmented cal l graph to describ e the calling relationship among pro cedures and

sp ecify lo op nests For this purp ose the programs call graph which contains the

usual procedure nodes and cal l edges is augmented to include loop nodes and nesting

edges The lo op no des contain lo op header information If a pro cedure p contains a

lo op l there will b e a nesting edge from the pro cedure no de representing p to the

lo op no de representing l If a lo op l contains a call to a pro cedure p there will b e a

nesting edge from l to p Any inner lo ops are also represented by lo op no des and are

children of their outer lo op The outermost lo op of each routine is marked enclosing

if all the other statements in the pro cedure fall inside the lo op Each lo op is also

marked as sequential or parallel A lo op with no lo opcarried dep endences ie all the

direction vectors contain for the lo op is parallel and all others sequential

Chapter

Interactive Parallel Programming

The ParaScop e Editor is a new kind of exploratory parallel programming to ol for

developing scientic Fortran programs It is able to comp ensate in many cases for

the deciencies of automatic parallelizers by bringing user exp ertise and compiler

technology to b ear on program parallelization It assists the knowledgeable user by

displaying the results of sophisticated program analyses and by providing a set of

p owerful interactive transformations After an edit or parallelismenhancing trans

formation the ParaScop e Editor incrementally up dates b oth the analyses and source

quickly These fast up dates are useful in b oth batch and automatic systems This

chapter fo cuses on these abilities and intro duces the transformations and the analysis

they require that are used throughout the thesis

Intro duction

The ParaScop e Editor helps users interactively transform a sequential Fortran

program into a parallel program with explicit parallel constructs such as those in

PCF Fortran Lea In a language like PCF Fortran the principal mechanism for the

intro duction of parallelism is the paral lel loop which sp ecies that its iterations may

b e run in parallel according to any schedule The fundamental problem intro duced by

such languages is the p ossibility of nondeterministic execution For example consider

converting the following sequential lo op into a parallel lo op

DO I

AINDEXI AINDEXI

ENDDO

Dep endence analysis conservatively assumes that INDEXI for a particular iteration

may equal INDEXI for a later iteration Therefore there may b e a lo opcarried de

p endence on A and an automatic parallelizer would not execute this lo op in parallel

Unfortunately a parallelizer is often forced to make conservative assumptions ab out

whether dep endences exist These assumptions may arise b ecause of complex sub

scripts as ab ove or the use of unknown symb olics As a result automatic systems

miss lo ops that could b e parallelized For example it may b e that INDEXI is a

p ermutation allowing the lo op to b e safely p erformed in parallel This weakness has

led previous researchers to conclude that automatic systems by themselves are not

p owerful enough to nd all of the parallelism in a program

However the analysis p erformed by automatic systems can b e extremely useful to

the programmer during the parallelization pro cess The ParaScop e Editor Ped is

based up on this observation It is designed to supp ort an interactive parallelization

pro cess in which the user examines a particular lo op and its dep endences To safely

parallelize a lo op the user must either determine that each dep endence shown is

not valid b ecause of some overly conservative assumption made by the system

or transform the lo op to satisfy valid dep endences After each transformation Ped

reconstructs the dep endence graph so that the user may determine the level of success

achieved and apply additional transformations if desired

A to ol with this much functionality is b ound to b e complex Ped incorp orates

a complete source editor and supp orts dep endence analysis dep endence display and

a large variety of program transformations to enhance parallelism We describ e in

detail elsewhere the usage user interface and motivation of the ParaScop e Editor

+

BKK FKMW KMTb We also cover elsewhere the typ es of analyses

and representations needed to supp ort this to ol and automatic parallelization see

Section KMTa In this chapter we fo cus on ecient algorithms for incremen

tal up dates after a transformation or edit All of these algorithms are implemented

in Ped

We b egin with an overview of the existing work mo del and a description of the

transformation pro cess These descriptions include the mechanisms for communi

cation b etween the user and Ped and an example Ped session The incremental

algorithms for determining safety and protability and for p erforming the update of

dep endence information and source for four imp ortant transformations are also de

tailed The transformations are lo op interchange lo op skewing lo op distribution

and unroll and jam We discuss related work and conclude

Work Mo del

This thesis exploits looplevel parallelism which comprises most of the usable paral

lelism in scientic co des when synchronization costs are considered CSY In the

work mo del b est supp orted by Ped the user rst selects a lo op for parallelization

Ped then displays all of its lo opcarried data dep endences Other dep endences such

as control dep endences may also b e displayed at the option of the user The user may

sort lter or mark the dep endences This mechanism allows users to mark as rejected

those dep endences that are due to overly conservative dep endence analysis so that

the transformations will ignore them If dep endence testing is exact and proves a

dep endence to exist the dep endence is premarked as proven Otherwise the dep en

dence is premarked as pending signifying to the user that it may b e the result of

overlyconservative analysis Additionally the user may mark p ending dep endences

as accepted indicating that the dep endence do es in fact o ccur A similar facility is

provided for variable classication FKMW These provide users with a p owerful

mechanism for exp erimenting with dierent parallelization strategies

Peds user interface is shown in Figure The gure shows a black and white

screen dump of a color Ped session The program pane in the top half of the window

displays a lo op from a parallel direct search program pro duced by Virginia Torczon

The outer lo op on line is selected by clicking the mouse on the lo op icon the in

the leftmost column The selection causes the header and all the enclosed statements

to b e a dierent color than the other text in this black and white picture it is not

detectable Buttons across the top of the editing pane invoke various Ped features

such as transformations program analysis view ltering and editing

In Ped color is used to convey p oints of interest fo cus or sp ecial meaning The

dependence pane is in the middle pane of the window and shows dep endences carried

by the selected lo op The output dep endence on SINDEXIJ is selected which

causes it to b e highlighted in the dep endence pane The dep endence is also reected

in the text pane by an arrow from the highlighted source reference to the highlighted

sink reference In this example the dep endence has its source and sink at the same

reference so only one reference is highlighted If the end p oints of the dep endence

span the width of the screen one end p oint is brought into view To view the other

end p oint the user need only select it in the dep endence pane and then Ped will

bring it into view in the text pane The lab els across the top of the dep endence pane

may b e selected to sort by that characteristic They may also b e used in lter and

marking queries on dep endences

The variable display at the b ottom of the Ped window presents each variable

that participates in the lo op It also presents the classication of the variable if the

lo op were run in parallel The user interface for all three of these displays is unied

requiring the user to learn only one simple paradigm

Figure PED User Interface

Transformations

Ped provides a variety of interactive structured transformations that enhance or

exp ose parallelism in programs If the user has made assertions ab out dep endences

and variables the transformations take these into account These transformations are

applied according to a power steering paradigm the user sp ecies the transformation

to b e made and the system provides advice and carries out the mechanical details

The user is therefore relieved of the resp onsibility of making tedious and error prone

program changes

Ped evaluates each transformation invoked according to three criteria applicabil

ity safety and protability A transformation is applicable if it can b e mechanically

p erformed For example lo op interchange is inapplicable for a single lo op A trans

formation is safe if it preserves the meaning of the original sequential program Some

transformations are always safe others require a sp ecic dep endence pattern Finally

Ped classies a transformation as protable if it can determine that the transforma

tion directly or indirectly improves the parallelism of the resulting program

To p erform a transformation the user makes a program selection and invokes

the desired transformation If the transformation is inapplicable Ped resp onds with

a diagnostic message If the transformation is safe Ped advises the user as to its

protability For parameterized transformations Ped may also suggest a parameter

value The user may then apply the transformation For example see lo op skewing

and unroll and jam in Sections and

If the transformation is unsafe or unprotable Ped resp onds with a warning

explaining the cause In these cases the user may decide to override the system advice

and apply the transformation anyway For example if a user decides to parallelize

a lo op with lo opcarried dep endences Ped will warn the user of the dep endences

but allow the lo op to b e made parallel This override ability is extremely imp ortant

in an interactive to ol since it allows the user to apply knowledge unavailable to

the to ol The programs abstract syntax tree AST and dep endence information are

automatically up dated after each transformation to reect the transformed source

Ped supp orts a large set of transformations that have proven useful for intro duc

ing discovering and exploiting parallelism Ped also supp orts transformations for

improving data lo cality Each transformation is briey intro duced b elow Many are

found in the literature AC AK CCK KM KMTb KKLW Lov

Wol In Ped their novel asp ect is the analysis of their applicability safety prof

itability and the incremental up dates of source and dep endence information We

classify the transformations implemented in Ped as follows

Reordering Transformations

Lo op Distribution Lo op Interchange

Lo op Skewing Lo op Reversal

Lo op Fusion Statement Interchange

Unroll and Jam

Dep endence Breaking Transformations

Privatization Scalar Expansion

Array Renaming Lo op Peeling

Lo op Splitting Alignment

Memory Hierarchy Transformations

Strip Mining Scalar Replacement

Lo op Unrolling

Miscellaneous Transformations

Sequential Parallel Lo op Bounds Adjusting

Statement Addition Statement Deletion

Reordering transformations

Reordering transformations change the order in which statements are executed either

within or across lo op iterations They are safe if all the dep endences in the original

program are preserved Reordering transformations are used to exp ose or enhance

lo oplevel parallelism They are often p erformed in concert with other transformations

to structure computations in a way that allows useful parallelism to b e intro duced

These may also b e used to optimize data lo cality

Lo op distribution partitions indep endent statements inside a lo op into multi

ple lo ops with identical headers It is used to separate statements that may b e

parallelized from those that must b e executed sequentially KM KMTa

Kuc The partitioning of the statements is targeted to vector or parallel

hardware as sp ecied by the user

Lo op interchange interchanges the headers of two p erfectly nested lo ops

changing the order in which the iteration space is traversed When lo op in

terchange is safe it can b e used to adjust the granularity of parallel lo ops

AK KMTa Wolb

Lo op skewing adjusts the iteration space of two p erfectly nested lo ops by

shifting the work p er iteration in order to exp ose parallelism When p ossible

Ped computes and suggests the optimal skew degree Lo op skewing may b e used

with lo op interchange in Ped to exp ose wavefront parallelism KMTa Wol

Lo op reversal reverses the order of execution of lo op iterations

Lo op fusion can increase the granularity of parallel regions and promote reuse

by fusing two contiguous lo ops when dep endences are not violated AC

+

KKP

Statement interchange interchanges two adjacent indep endent statements

Unroll and jam increases the p otential candidates for scalar replacement and

pip elining by unrolling the b o dy of an outer lo op in a lo op nest and fusing the

resulting inner lo ops AC CCK CCK KMTa

Dep endence breaking transformations

Dep endence breaking transformations are used to satisfy sp ecic dep endences that

inhibit parallelism They may intro duce new storage to eliminate storagerelated

anti or output dep endences or convert lo opcarried dep endences to lo opindep endent

dep endences often enabling the safe application of other transformations If all the

dep endences carried on a lo op are eliminated the lo op may then b e run in parallel

Privatization makes an array or scalar variable lo cal to a parallel lo op elimi

nating dep endences on the variable b etween lo op iterations

Scalar expansion transforms a scalar variable into a onedimensional array

It breaks output and anti dep endences which may b e inhibiting parallelism

KKLWa

Array renaming also known as no de splitting KKLWa is used to break

anti dep endences by copying the source of an anti dep endence into a newly

intro duced temp orary array and renaming the sink to the new array AK

Lo op distribution may then b e used to separate the copying statement into a

separate lo op allowing b oth lo ops to b e parallelized

Lo op p eeling p eels o the rst or last k iterations of a lo op as sp ecied by

the user It is useful for breaking dep endences which arise on the rst or last k

iterations of the lo op AC

Lo op splitting or index set splitting separates the iteration space of one lo op

into two lo ops where the user sp ecies at which iteration to split For exam

ple if DO I is split at the following two lo ops result DO I

and DO I Lo op splitting is useful in breaking crossing dependences

dep endences that cross a sp ecic iteration AK

Alignment moves instances of statements from one iteration to another to

break lo opcarried dep endences Cal

Memory hierarchy transformations

Memory optimizing transformations adjust a lo ops balance b etween computations

and memory accesses to make b etter use of the memory hierarchy and functional

pip elines These transformations have proven to b e extremely eective for b oth scalar

and parallel machines

Strip mining takes a lo op with step size of and changes the step size to a

new user sp ecied step size greater than A new inner lo op is inserted which

iterates over the new step size If the minimum distance of the dep endences in

the lo op is less than the step size the resultant inner lo op may b e parallelized

Used alone the order of the iterations is unchanged but used in concert with

lo op interchange the iteration space may b e tiled Wola to utilize memory

bandwidth and cache more eectively CK

Scalar replacement takes array references with consistent dep endences and

replaces them with scalar temp oraries that may b e allo cated into registers

CCK It improves the p erformance of the program by reducing the numb er

of memory accesses required

Lo op unrolling decreases lo op overhead and increases p otential candidates for

scalar replacement by unrolling the b o dy of a lo op AC KMTa

Miscellaneous transformations

Finally Ped has a few miscellaneous transformations

Sequential Parallel converts a sequential DO lo op into a parallel lo op and

vice versa

Lo op b ounds adjusting adjusts the upp er and lower b ounds of a lo op by a

constant It is used in preparation for lo op fusion

Statement addition adds an assignment statement

Statement deletion deletes an assignment statement

Transformation algorithms

The incremental up date algorithms for the transformations serve a critical function

they up date the co de and dep endence information quickly and immediately allowing

users to understand the changes see the eects and continue the transformation

pro cess without reanalyzing the entire program Although many of the algorithms for

applying these transformations have app eared elsewhere our implementation gives

protability advice and p erforms incremental up dates of dep endence information

Rather than describ e all these phases for each transformation we have chosen to

examine only a few interesting transformations in detail We discuss lo op interchange

lo op skewing lo op distribution and unroll and jam The purp ose mechanics and

safety of these transformations are presented followed by their protability estimates

user advice and incremental dep endence up date algorithms

Lo op interchange

Lo op interchange has b een used extensively in vectorizing and parallelizing compilers

to adjust the granularity of parallel lo ops and to exp ose parallelism AK KKLW

Wol Ped interchanges pairs of adjacent lo ops Lo op p ermutations may b e p er

formed as a series of pairwise interchanges Ped supp orts interchange of triangular

or skewed lo ops It also interchanges hexagonal lo ops that result after skewed lo ops

are interchanged

Safety

Lo op interchange is safe if it do es not reverse the order of execution of the source and

sink of any dep endence Ped determines this by examining the direction vectors for

all dep endences carried on the outer lo op If any dep endence has a direction vector

of the form interchange is unsafe These dep endences are called interchange

preventing They are precomputed and recorded in a ag in the dep endence edge

Each dep endence edge carried on the outer lo op is examined If any one of these has

the interchange preventing ag set Ped advises the user that interchange is unsafe

Protability

Ped judges the protability of lo op interchange by calculating which of the lo ops

will b e parallel after the interchange A dep endence carried on the outer lo op will

move inward if it has a direction vector of the form These dep endences are

called interchange sensitive They are also precomputed and stored in a ag on each

dep endence edge Ped examines each dep endence edge on the outer lo op to determine

where it will b e following interchange It then checks for dep endences carried on the

inner lo op as well they move outward following interchange Dep ending on the

result Ped advises the user that neither one or b oth of the lo ops will b e parallel

after interchange

Up date

Up dates after lo op interchange are very quick Dep endence edges on the interchanged

lo ops are moved directly to the appropriate lo op level based on their interchange sen

sitive ags All the dep endences in the lo op nest then have the elements in their

direction vector corresp onding to the interchanged lo ops swapp ed eg b e

comes Finally the interchange ags are recalculated for dep endences in the

lo op nest

Lo op skewing

Lo op skewing is a transformation that changes the shap e of the iteration space to

exp ose parallelism across a wavefront IT Lam Mur Wol It can b e ap

plied in conjunction with lo op interchange strip mining and lo op reversal to obtain

Figure Eect of lo op skewing on

dep endences and iteration space

j j

i

i

b efore skew after skew

eective lo oplevel parallelism in a lo op nest Banb WL Wola All of these

transformations are supp orted in Ped

Lo op skewing is applied to a pair of p erfectly nested lo ops that b oth carry dep en

dences even after lo op interchange Lo op skewing adjusts the iteration space of these

lo ops by shifting the work p er iteration changing the shap e of the iteration space

from a rectangle to a parallelogram as illustrated in Figure Skewing changes

dep endence distances for the inner lo op so that all dep endences are carried on the

outer lo op after lo op interchange The inner lo op can then b e safely parallelized

Lo op skewing of degree is p erformed by adding times the outer lo op index

variable to the upp er and lower b ounds of the inner lo op followed by subtracting the

same amount from each o ccurrence of the inner lo op index variable in the lo op b o dy

In the example b elow the lo op nest on the right results when the J lo op in the left

lo op nest is skewed by degree with resp ect to lo op I

DO I DO I

DO J DO J I I

AIJ AIJ AIJ AIJI AIJI AIJI

ENDDO ENDDO

ENDDO ENDDO

Figure illustrates the iteration space for this example For the original lo op

dep endences with distance vectors and prevent either lo op from b eing

safely parallelized In the skewed lo op the distance vectors for dep endences are

transformed to and There are no longer any dep endences within each

column of the iteration space so parallelism is exp osed However to intro duce the

parallelism on the I lo op requires a lo op interchange

Safety

Lo op skewing is always safe b ecause it do es not change the order in which array

memory lo cations are accessed It only changes the shap e of the iteration space

Protability

To determine if skewing is protable Ped ascertains whether skewing will exp ose

parallelism that can b e made explicit using lo op interchange and suggests the min

imum skew amount needed to do so This analysis requires that all dep endences

carried on the outer lo op have precise distance vectors Skewing is only protable if

dep endences on the inner lo op and

at least one dep endence on the outer lo op with a distance vector d d

1 2

where d

2

The interchange preventing or interchange sensitive dep endences in case prevent

the application of lo op interchange to move all dep endences to the outer lo op If they

do not exist at least one lo op may already b e safely parallelized p ossibly by using

lo op interchange The purp ose of lo op skewing is to change the distance vector to

d d where d In terms of the iteration space lo op skewing is needed to

1

2 2

transform dep endences that p oint down or downwards to the left into dep endences

that p oint downwards to the right Followed by lo op interchange these dep endences

will remain on the outer lo op allowing the inner lo op to b e safely parallelized

To compute the skew degree we rst consider the eect of lo op skewing on each

dep endence When skewing the inner lo op with resp ect to the outer lo op by an

integer degree the original distance vector d d b ecomes d d d So for

1 2 1 1 2

any dep endence where d we want such that d d To nd the minimal

2 1 2

skew degree we compute

d

2

d

1

for each dep endence taking the maximum for all the dep endences this is suggested

as the skew degree

Up date

Up dates after lo op skewing are also very fast After skewing by degree the incre

mental up date algorithm changes the original distance vectors d d for all dep en

1 2

dences in the nest to d d d and then up dates their interchange ags

1 1 2

Lo op distribution

Lo op distribution separates indep endent statements inside a single lo op into multiple

+

lo ops with identical headers AK KKP It is used to exp ose partial parallelism

by separating statements which may b e parallelized from those that must b e executed

sequentially It is a cornerstone of vectorization and parallelization

In Ped the user can sp ecify whether distribution is for the purp ose of vectorization

or parallelization If the user sp ecies vectorization then each statement is placed in

a separate lo op when p ossible If the user sp ecies parallelization then statements

are group ed together into the fewest lo ops such that the most statements can b e made

parallel and the original statement order is maintained The user is presented with

a partition of the statements into new lo ops as well as an indication of which lo ops

are parallelizable The user may then apply or reject the distribution partition

Safety

To maintain the meaning of the original lo op the partition must not put statements

+

that are involved in recurrences into dierent lo ops KM KKP Recurrences

are calculated by nding strongly connected regions in the subgraph comp osed of

lo opindep endent dep endences and dep endences carried on the lo op to b e distributed

Statements not involved in recurrences may b e placed together or in separate lo ops

but the order of the resulting lo ops must preserve all other data and control dep en

dences Ped always computes a partition which meets these criteria

If there is control ow in the original lo op the partition may cause decisions that

o ccur in one lo op to b e used in a later lo op These decisions corresp ond to lo op

indep endent control dep endences that cross b etween partitions We use the metho d

describ ed in Section to insert new arrays called execution variables that record

these crossing decisions Given a partition this algorithm intro duces the minimal

numb er of execution variables necessary to eect the partition even for lo ops with

arbitrary control ow

Protability

Currently Ped do es not change the order of statements in the lo op during parti

tioning This simplication improves the recognizability of the resulting program

but may reduce the parallelism uncovered In particular statements that fall lexi

cally b etween statements in a recurrence will b e put into the same partition as the

recurrence In addition when the source of a dep endence lexically follows the sink

these statements will b e placed in the same partition A more exible partitioning

algorithm that allows statements to b e reordered is describ ed in Section

When distributing for vectorization statements not involved in recurrences are

placed in separate lo ops When distributing for parallelization they are partitioned

as follows A statement is added to the preceding partition only if it do es not cause

that partition to b e sequentialized Otherwise it b egins a new partition Consider

distributing the lo op on the left for parallelization

DO I N PARALLEL DO I N

S AI S AI

1 1

S AI ENDDO

2

ENDDO PARALLEL DO I N

S AI

2

ENDDO

This lo op contains only the lo opcarried true dep endence S S Since there are

1 2

no recurrences S and S b egin in separate partitions S is placed in a parallel

1 2 1

partition then S is considered The addition of S to the partition would instantiate

2 2

the lo opcarried true dep endence causing the partition to b e sequential Therefore

S is placed in a separate lo op and b oth lo ops may b e made parallel as seen on the

2

right ab ove

Up date

Up dates can b e p erformed quickly on the existing dep endence graph after lo op dis

tribution Data and control dep endences b etween statements in the same partition

remain unchanged Data dep endences b etween statements placed in separate parti

tions are converted from lo opcarried dep endences into lo opindep endent dep endences

as in the ab ove example

Lo opindep endent control dep endences that cross partitions are deleted and re

placed as follows First lo opindep endent data dep endences are intro duced b etween

the denitions and uses of execution variables representing the crossing decision A

control dep endence is then inserted from the test on the execution variable to the

sink of the original control dep endence The up date algorithm is explained more

thoroughly in Section

Unroll and jam

Unroll and jam is a transformation that unrol ls an outer lo op in a lo op nest then jams

or fuses the resulting inner lo ops AC CCK Unroll and jam can b e used to

convert dep endences carried by the outer lo op into lo op indep endent dep endences or

dep endences carried by some inner lo op It brings two accesses to the same memory

lo cation closer together and can signicantly improve p erformance by enabling reuse

of either registers or cache When applied in conjunction with scalar replacement on

scientic co des unroll and jam has resulted in integer factor sp eedups even for single

pro cessors CCK Unroll and jam may also b e applied to imp erfectly nested lo ops

or lo ops with complex iteration spaces Figure shows an example iteration space

b efore and after unroll and jam of degree

Before p erforming unroll and jam of degree on a lo op with step we may

need to use to make the total numb er of iterations divisible by

by separating the rst few iterations of the lo op into a prelo op We then create

additional copies of the lo op b o dy All o ccurrences of the lo op index variable in the

th

I new lo op b o dy must b e incremented by I The step of the lo op is then increased

to Consider the following

before DO I

DO J

CI J

DO K

CI J CI J AI K BK J

ENDDO

ENDDO

ENDDO

after DO I

DO J

CI J

CI J

DO K

CI J CI J AI K BK J

CI J CI J AI K BK J

ENDDO

ENDDO ENDDO

Figure Eect of unroll and jam on iteration space

j

j

i i

b efore unroll and jam after unroll and jam

In the ab ove matrix multiply example lo op I is unrolled and jammed by one to bring

together references to BK J resulting in the second lo op nest Unroll and jam may

also b e p erformed on lo op J to bring together references to AI K

Safety

To determine safety an alternative formulation of unroll and jam is used Unroll and

jam is equivalent to strip mining the outer lo op by the unroll degree interchanging

the strip mined lo op to the innermost p osition and then completely unrolling the

strip mined lo op Since strip mining and lo op unrolling are always safe we only need

to determine whether we can safely interchange the strip mined lo op to the innermost

p osition

Ped determines this requirement by searching for interchange preventing dep en

dences on the outer lo op Unroll and jam is unsafe if any dep endence carried by the

outer lo op has a direction vector of the form Even if such a dep endence is

found unroll and jam is still safe if the unroll degree is less than the distance of the

dep endence on the outer lo op since this dep endence would remain carried by the

outer lo op Ped will either warn the user that unroll and jam is unsafe or provide a

range of safe unroll degrees

Unroll and jam of imp erfectly nested lo ops changes the execution order of the

imp erfectly nested statements with resp ect to the rest of the lo op b o dy Dep endences

carried on the unrolled lo op with distance less than or equal to the unroll degree

are converted into lo opindep endent dep endences If any of these dep endences cross

b etween the imp erfectly nested statements and the statements in the inner lo op they

inhibit unroll and jam Sp ecically the intervening statements cannot b e moved and

prevent fusion of the inner lo ops

Protability

Balance describ es the ratio b etween computation and memory access rates CCK

Unroll and jam is protable if it brings the balance of a lo op closer to the balance

of the underlying machine Ped automatically calculates the optimal unroll and jam

degree for a lo op nest including lo ops with complex iteration spaces CCK

Up date

An algorithm for the incremental up date of the dep endence graph after unroll and

jam is describ ed elsewhere CCK However we chose a dierent strategy Since no

global dataow or symb olic information is changed by unroll and jam Ped rebuilds

the scalar dep endence graph for the lo op nest and renes it with dep endence tests

This up date strategy proved much simpler to implement and is very quick in practice

Incremental analysis after edits

Editing is fundamental for any program development to ol b ecause it is the most ex

ible means of making program changes The ParaScop e Editor therefore provides

advanced editing features When editing the user has complete access to the func

tionality of the hybrid text and structure editor underlying Ped including simple

text entry templatebased editing search and replace functions intelligent and cus

tomizable view lters and automatic syntax and typ e checking

Rather than reanalyze immediately after each edit Ped waits for a reanalyze

command from the user The user may thus avoid analyzing intermediate stages of

the program that may b e illegal or simply uninteresting The transformations the

dep endence display and the variable display are disabled during an editing session

b ecause they rely on dep endence information that may b e invalidated by the edits

Once the user prompts Ped the dep endence driver invokes syntax and typ e checking

If errors are detected the user is warned otherwise reanalysis pro ceeds

Unfortunately incremental dep endence analysis after edits is a very dicult prob

lem Precise dep endence analysis requires utilization of several dierent kinds of in

formation In order to calculate precise dep endence information Ped may need to

incrementally up date the control ow graph control dep endence graph static single

+

assignment graph ssa CFR and call graphs as well as recalculate scalar live

range constant symb olic interpro cedural and dep endence testing information

Several algorithms for p erforming incremental analysis are found in the litera

ture for example dataow analysis RP Zad interpro cedural analysis Bur

RC interpro cedural recompilation analysis BCKT as well as dep endence anal

ysis Ros However few of these algorithms have b een implemented and evaluated

in an interactive environment Rather than tackle all these problems at once we

chose a simple yet practical strategy for the current implementation of Ped First

the scop e of each program change is evaluated Incremental analysis is applied only

when it may b e protable otherwise batch dep endence analysis is invoked Ped will

apply incremental dep endence analysis when the following situations are detected

No up date needed

Many program edits fall into this category It is trivial using a structure editor to

determine that changes to comments or whitespace do not require reanalysis Other

more interesting cases include changes to arithmetic expressions that do not disturb

control ow or symb olic analysis For instance changing the assignment AI BI

to AI BI do es not aect dep endence information one whit

Delete dep endence edges

Removal of an array reference may b e handled simply by deleting all edges involving

that reference

Add dep endence edges

Addition of an array reference may b e handled by scanning the lo op nest for o ccur

rences of the same variable p erforming dep endence tests b etween the new reference

and any other references and adding the necessary dep endence edges

Redo dep endence testing

Changes to lo op b ounds or array subscript expressions require dep endence testing to

b e p erformed on all aected array variables

Redo lo cal symb olic analysis

Some typ es of program changes do not aect the scalar dep endence graph but may

require symb olic analysis to b e reapplied For instance changing the assignment

JJ to JJ where J is an auxiliary requires redoing symb olic

analysis and dep endence testing

Redo lo cal dep endence analysis

Changes such as the mo dication of control ow or variables involved in symb olic

analysis require signicant up dates b est handled by redoing dep endence analysis

However the nature of the change may allow the reanalysis to b e limited to the

current lo op nest or pro cedure In these cases the entire program do es not need to

b e reanalyzed

User and compiler interaction

Once the programmer b egins changing the source for the purp ose of optimization

version control b egins to play an imp ortant role Is the new version now machine

dep endent or is it a b etter machine indep endent version or it is a bug x The

automation of version control is still an op en question and in the current imple

mentation version control is left to the programmer In order to facilitate a single

machineindep endent program version we prop ose the following compilercontrolled

approach

In this approach the compiler would rst p erform automatic parallelization as a

sourcetosource transformation pro ducing a machine sp ecic version If the user is

satised with the programs resulting p erformance the user need not intervene at all

If the user is unsatised the compiler communicates to the user in the interactive

to ol in this case Ped The compiler would mark the lo ops in the original version

that it was unable to parallelize or parallelize well It would also rank the lo ops

and subroutines by their eects on execution time using p erformance estimation or

runtime proling If the user wants to maintain p ortability the onus shifts to the

user to make assertions and improvements in an architecture indep endent manner

The user is assisted with the hints and functionality currently provided Ped such as

dep endence display and transformations In addition users would b e able to invoke

the compilers optimizing and parallelizing algorithms in Ped to determine the eects

of their changes providing almost an interactive compiler

Related work

Several other research groups are also developing advanced parallel programming

to ols Peds analysis and transformation capabilities compare favorably to automatic

parallelization systems such as Parafrase Ptran and of course PFC Our work on

interactive parallelization b ears similarities to Sigmacs Pat and Superb

Ped has b een greatly inuenced by the Rice Parallel Fortran Converter PFC

which has fo cused on the problem of automatically vectorizing and parallelizing se

quential Fortran AK PFC has a mature dep endence analyzer which p erforms data

dep endence analysis control dep endence analysis interpro cedural constant propaga

tion CCKT interpro cedural sideeect analysis of scalars CKTa and inter

pro cedural array section analysis CKb HK Ped expands on PFC s analysis

and transformation capabilities and makes them available to the user in an interactive

environment Because of its mature analysis and implementation PFC is available

as a dep endence information server for the ParaScop e Editor On demand the infor

mation provided by PFC is converted into the internal representations in ParaScop e

Editor This functionality enables the use of PFC s more advanced analysis in Ped

Parafrase was the rst automatic vectorizing compiler KKLW It supp orts pro

gram analysis and p erforms a large numb er of program transformations to improve

parallelism In Parafrase program transformations are structured in phases and are

always applied where applicable Batch analysis is p erformed after each transforma

tion phase to up date the dep endence information for the entire program Parafrase

+

adds scheduling and improved program analysis and transformations PGH More

advanced interpro cedural and symb olic analysis is planned HP Parafrase uses

Faust as a front end to provide interactive parallelization and graphical displays

GGGJ

Ptran is also an automatic parallelizer with extensive program analysis It com

putes the ssa and program dependence graphs and p erforms constant propagation

+

and interpro cedural analysis CFR FOW Ptran intro duces b oth task and

lo op parallelism but the only other program transformations are variable privatiza

+

tion and lo op distribution ABC Sar

Sigmacs a programmable interactive parallelizer in the Faust programming

environment computes and displays call graphs pro cess graphs and a statement

dep endence graph GGGJ SG In a pro cess graph each no de represents a task

or a pro cess which is a separate entity running in parallel The call and pro cess graphs

may b e animated dynamically at run time Sigmacs also p erforms several interactive

program transformations and is planning on incorp orating automatic up dates of

dep endence information

Pat is also an interactive parallelization to ol SA SA Its dep endence analy

sis is restricted to Fortran programs where only one write o ccurs to each variable in a

lo op In addition Pat uses simple dep endence tests that do not calculate general dis

tance or direction vectors Hence it is incapable of applying lo op level transformations

such as lo op interchange and skewing However Pat do es supp ort replication and

alignment insertion and deletion of assignment statements and lo op parallelization

for a single lo op It can also insert synchronization to protect sp ecic dep endences

Pat divides analysis into scalar and dep endence phases but do es not p erform sym

b olic or interpro cedural analysis The incremental dep endence up date that follows

transformations is simplied due to its austere analysis SAS

Superb interactively converts sequential programs into data parallel SPMD pro

grams that can b e executed on the Suprenum distributed memory multipro cessor

ZBG Superb provides a set of interactive program transformations including

transformations that exploit data parallelism The user sp ecies a data partitioning

then no de programs with the necessary send and receive op erations are automati

cally generated Algorithms are also describ ed for incremental up date of usedef and

defuse chains following structured program transformations KZBG

Discussion

The ParaScop e Editor provides a complementary strategy to backup automatic par

allelization In an integrated approach that makes the compiler algorithms available

as well as the individual transformations the user may make assertions and see the

results in the automatically generated version It also enables users to exp eriment

with dierent mixtures of transformations without reanalyzing the entire program

b etween transformations

Our exp erience with the ParaScop e Editor has shown that dep endence analysis can

+

b e used in an interactive to ol with ample eciency HHK This eciency is due

to fast yet precise dep endence analysis algorithms and a dep endence representation

that makes it easy to nd dep endences and to reconstruct them after a change

To our knowledge Ped is the rst to ol to oer general editing with dep endence

reconstruction along with a substantial collection of useful program transformations

Chapter

Lo op Transformations with Arbitrary Control Flow

Previous co de generation techniques and program transformations have known limi

tation dealing with control ow Many of these transformations are lo op based and

are not applicable when there exists control ow such as branching within a lo op or

exit branches out of a lo op Because most programs contain meaningful control ow

in lo ops this limitation is a serious aw For truly eective parallel co de generation

this problem must b e addressed

In this chapter we extend a broad selection of transformations from Section to

deal with arbitrary control ow thus allowing an integrated transformation system

for parallel co de generation or interactive parallelization that is not inhibited by

control ow Several transformations are inhibited by some typ e of control ow and

others are easily extended For example lo op p ermutation is safe when branches are

internal and is inhibited by exit branches Strip mining on the other hand is safe

regardless of the typ e of branching Two particularly imp ortant transformations lo op

distribution and lo op fusion require more sophisticated algorithms that leverage the

control dep endence graph

In the next section we give a motivating example an intro duction to the problems

with previous work a few denitions and our general approach Due to its wide spread

use and the new and optimal results presented here We b egin with the algorithm

for lo op distribution when lo ops contain arbitrary control ow The following section

detail a variety of other transformations and includes lo op skewing lo op reversal

lo op p ermutation strip mining privatization scalar expansion lo op fusion and lo op

p eeling

Motivation

To motivate our treatment of conditional control ow we rst consider lo op distri

bution Lo op distribution breaks up a single lo op into two or more lo ops each of

which iterates over a disjoint subset of the statements in the b o dy of the original

lo op The usefulness of this transformation derives from its ability to convert a large

lo op whose iterations cannot b e run in parallel into multiple lo ops many of which

can b e parallelized Consider the following co de

DO I N

AI BI C

DI AIE

ENDDO

If we wish to retain the original meaning of this co de fragment the iterations cannot

b e run in parallel without explicit synchronization lest a value of AI is fetched b e

fore the previous iteration has a chance to store it However if the lo op is distributed

each of the resulting lo ops can b e run in parallel

DOALL I N

AI BI C

ENDDO

DOALL I N

DI AIE

ENDDO

The presence of conditionals complicates distribution Consider for example the

following lo op

DO I N

IF AI EQ THEN

AI BI C

DI AIE

ENDIF

ENDDO

In order to place the rst assignment in the rst lo op and the second assignment

in the second lo op the result of the if statement must b e known in b oth lo ops

The if cannot b e replicated in b oth lo ops b ecause the rst assignment changes the

value of A One solution to this problem is to convert all if statements to conditional

assignment statements as follows

DO I N

PI AI EQ

IF PI AI BI C

IF PI DI AIE

ENDDO

The resulting lo op can b e distributed by considering only data dep endence b ecause

the control dep endence has b een converted to a data dep endence involving the logical

array P This approach called ifconversion AKPW All has b een used success

fully in a variety of vectorization systems which incorp orate several other transforma

tions as well AK SK KKLW However ifconversion has several drawbacks

If vectorization or parallelization fails it is not easy to reconstruct ecient branching

co de In addition ifconversion may cause signicant increases in the co de space to

hold conditionals

For these reasons research in automatic parallelization has concentrated on an

alternative approach that uses control dep endences to mo del control ow FOW

+ +

ABC ABC Our approach uses b oth data and control dep endence graphs

as were dened in Sections and For our purp oses it is useful to classify the

typ e of control ow in a lo op nest and its corresp ondence in the control dep endence

graph as either

internal branching or

exit branching

Internal branching consists of conditional control ow that aects only the statements

executed on a particular iteration of the lo op These are lo opindep endent control

dep endences Exit branching is conditional control ow which terminates the exe

cution of the lo op These are lo opcarried control dep endences Internal branching

may utilize structured or unstructured constructs Exit branching can only b e formed

using unstructured control ow gotos in Fortran

Exit branches are inherently sequential b ecause they give rise to lo op carried de

p endences Although the main fo cus of this dissertation is the use of transformations

in a parallelizing environment many of the transformations b elow are also useful for

scalar compilation and data lo cality optimizations Therefore the extensions and lim

itations necessary for b oth internal and exit branching are include in the discussion

b elow

Lo op distribution

Lo op distribution is an integral part of transforming a sequential program into a

parallel one It was intro duced by Muraoka Mur and is used extensively in par

allelization vectorization and memory management For lo ops with control ow

previous metho ds for lo op distribution have signicant drawbacks We present a new

algorithm for lo op distribution in the presence of control ow mo deled by a control

dep endence graph This algorithm is shown optimal in that it generates the minimum

numb er of new arrays and tests p ossible We also present a co de generation algorithm

that pro duces co de for the resulting program without replicating statements or condi

tions These algorithms are very general and can b e used in automatic or interactive

parallelization systems

This section presents a metho d for p erforming lo op distribution in the presence of

control ow based on control dep endences Control dep endences may b e used like data

dep endences for determining the placement of statements in lo ops However when

there exists a control dep endence b etween statements that crosses their new resp ective

lo op b o dies correct co de generation requires recording the results of evaluating the

predicate in a logical array and testing the logical array in the second lo op

Our approach is optimal in the sense that it intro duces the fewest p ossible new

logical arrays and tests In particular it intro duces one array for each conditional

no de up on which some no de in another lo op in the distribution dep ends We also

present an algorithm for generating co de for the b o dy of a lo op after distribution

The algorithms are very fast b oth asymptotically and practically This algorithm is

also describ ed elsewhere KM

Mechanics

Lo op distribution may b e separated into a threestage pro cess the statements

in the lo op b o dy are partitioned into groups to b e placed in dierent output lo ops

the control and data dep endence graphs are restructured to eect the new lo op

organization and an equivalent program is generated from the dep endence graphs

To p erform lo op distribution without changing the original meaning of the lo op the

placement of statements into new lo ops must preserve the data and control dep en

dences of the original The metho d we present is designed to work on any partition

that is legal ie any partition that preserves the control and data dep endences

A partition can preserve all dep endences if and only if there exists no dep endence

+

cycle spanning more than one output lo op KKP AK If there is a cycle

involving control andor data dep endences it must b e entirely contained within a

2

single partition This condition is b oth necessary and sucient Consider what

must b e done to generate co de given a partitioning into lo ops some linear order for

the lo ops must b e chosen If we treat each output lo op as a single no de and dene

dep endence b etween lo ops to b e inherited in the natural way from control and data

2

Lo ops with exit branches are an exception to this condition The necessary extensions are discussed at the end of Section

dep endences b etween statements then the resulting graphs will b e acyclic if and only

if each original recurrence is conned to a single lo op Since an acyclic graph can

always b e ordered using top ological sort and a cyclic graph can never b e ordered the

condition is established

In the algorithms presented b elow the no des in b oth the control and data dep en

dence graphs usually represent a single statement Exceptions to the single statement

p er no de rule are inner lo ops and irreducible regions all of their statements are rep

resented with a single no de

Because our algorithm accepts any legal partition as input it is as general as

p ossible It can b e used for vectorization which seeks a partition of the nest p ossible

granularity or for MIMD parallelization which seeks the coarsest p ossible granularity

without sacricing parallelism We discuss a partitioning strategy in Section for

sharedmemory parallel co de generation In the discussion here we assume a legal

partition is provided

Restructuring

In the original program control decisions are made and used in the same lo op on the

same iteration but a partition may sp ecify that decisions that are made in one lo op b e

used in another This problem is illustrated b elow by Example Its corresp onding

G and data dep endence graph are shown in Figure

cd

Example

DO I N

S IF AI GT T THEN

1

S AI I

2

ELSE

S T T

3

S FI AI

4

S IF BI NE THEN

5

S U AI BI

6

ELSE

S U AI U

7

S CI BI CI

8

ENDIF

ENDIF

S DI DI CI

9 ENDDO

Figure Control and data dep endence

graphs for distribution example

a G b Data Dep endence

cd

......

......

......

.. ..

.. ..

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. . . .

. . . .

. .

......

......

......

......

......

......

......

......

......

......

......

......

. . . . .

......

......

......

......

. . . . .

......

......

......

. . . . .

. . . . .

......

. . . . .

......

......

......

. . . .

......

......

. . . . .

......

......

......

......

. . . . .

......

......

. . . . .

......

. . . .

......

......

......

......

. . . .

......

......

. . . .

......

. . . . .

S . . . . S . . . . S S . . . .

......

. . . . .

......

......

......

1 . . . . 9 . . 1 9 . .

......

......

. . . . .

......

. . . .

......

. . . . .

. . . .

......

......

. . . .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . . .

......

......

......

......

. . . . .

......

.. .. .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. .. . .

. . . . .

......

......

. . . . .

......

.

......

......

......

......

......

......

...... lc

......

......

......

. . .

......

......

......

......

......

......

. . . . .

......

......

......

......

. . . . .

......

......

. . . t . . .

......

. .. f f f . . . . .

......

......

......

......

......

......

......

. . .

......

. . . . .

......

. . . .

......

......

......

......

......

......

......

......

......

......

. . ..

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. S S S S . . S S S S ......

......

......

......

......

......

2 3 4 5 . 2 3 4 5 ......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . .

......

. . . .

......

......

. . . .

. . . .

. .. . .

......

......

. . . .

......

......

......

......

. .

......

. .

...... lc

......

. .

......

. .

......

......

. .

......

. .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

. .

. . . .

. .

......

......

. .

......

. .

......

......

. .

......

. .

......

......

. .

......

. .

......

......

. .

. . . .

. .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

. ..

......

......

. .

. . . t . . .

. .

. . . f f . . . lc

......

. .

......

. .

......

......

. . ...

......

. . .. .

......

......

. . . . lc

......

. . . .

. . . . .

......

......

......

. . . .

. . . . .

......

. . . . .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . S S S ...... S S S ......

......

......

......

......

......

. . . 6 7 8 . . . . 6 7 8 ......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

...... lc . . .

......

......

......

. . . .

. . . . .

. .

. . . .

. . .

. .

.

. .

.

.

.

.

.

. . . .

. . . ..

. .. . .

......

. . . ..

......

... . .

......

lc

The data dep endence graph in Figure b shows true dep endences with solid

lines and anti dep endences with dashed lines Lo op carried edges are lab eled with lc

In this example output dep endences are redundant and are not included Given the

data and control dep endences in Figure the statements may b e placed in four

partitions S S S S S S S and S S This particular partition is

1 2 3 4 5 6 7 8 9

chosen solely for exp osition of the algorithm and in Figure a it is sup erimp osed

on G such that each partition is enclosed by dashed lines

cd

Given this partition some statements are no longer in the same lo op with state

ments up on which they are control dep endent For example S is control dep endent

4

on S but S and S are not in the same partition In Figure the G edges that

1 1 4 cd

cross partitions represent decisions made in one lo op and used in a later lo op There

may b e a chain of decisions on which a no de n is control dep endent but given a legal

partition all of ns immediate predecessors and ancestors in G are guaranteed either

cd

to b e in ns partition or in an earlier one Therefore the execution of n may b e deter

mined solely from the execution of ns predecessors We intro duce execution variables

to compute and store decisions that cross partitions in G for b oth structured and

cd unstructured co de

Execution variables

Execution variables are only needed for branch nodes b ecause they corresp ond to

control decisions in the original program Any no de in G that has a successor must

cd

b e a branch no de but only branch no des with at least one successor in a dierent

partition are of interest here For each branch in this restricted set a unique execution

variable is created Only one execution variable is created regardless of the numb er

of successors or the numb er of dierent partitions to which the successors b elong

The execution variable is assigned the value of the test at the branch capturing

the branch decision Later this variable will b e tested to determine control ow in

a subsequent partition Hence the creation of an execution variable will replace

control dep endences b etween partitions with data dep endences Execution variables

are arrays with one value for each iteration of the lo op b ecause each iteration can

give rise to a dierent control decision If desired lo op invariant decisions can b e

detected AC and represented with scalar execution variables

All previous techniques whether they are G based or not use Bo olean logic

cd

when intro ducing arrays to record branch decisions These metho ds require either

testing and recording the path taken in previous lo ops or intro ducing additional

arrays In Example in the lo op with statements S S either S or S or

6 7 6 7

neither may execute on a given iteration Because there are three p ossibilities the

correct decision cannot b e made with a single Bo olean variable For example if S

1

takes the true branch then neither S nor S should execute If just S s decision is

6 7 5

stored then one of S or S will mistakenly b e executed b ecause the branch recording

6 7

array for S must either b e true or false regardless of S s decision

5 1

Given this drawback we have formulated execution variables to have three p ossible

values true false and which represents undened Every execution variable is

initialized to at the b eginning of the lo op in which it will b e assigned indicating that

the branch has not yet b een executed Because of the existence of a not executed

value the control dep endent successors in dierent partitions need only test the value

of the execution variables for their immediate predecessors they do not need to test

the entire path of their control dep endence ancestors This condition is true whether

the control ow is unstructured or structured Execution variables completely capture

the control decision at a no de making them extremely p owerful

Algorithm Execution variable and guard creation

Input partitions G statement order

cd

Output mo died G with execution variables

cd

Algorithm

for each partition P

for each n P in order

if an edge n o G where o P

l cd

insert ev i into P at top

n

let test b e ns branch condition

if n m where m P

l

ev i test

n

replace n with

if ev i EQ true

n

else

replace n with ev i test

n

endif

for each P P containing a successor of n

k

Build guards and mo dify G

cd

for each l where n p with p P

l k

create new statement N

if ev i eq l

n

add N to P N is new and unique

k

insert data dep endences for ev

n

for each n q where q P

l k

Up date control dep endences

delete n q from G

l cd

add N q to G

tr ue cd

endfor

endfor

endfor

endif

endfor

endfor

Restructuring

The restructuring Algorithm creates and inserts execution variables and guards

given a distribution partition It also up dates the control and data dep endence graphs

to reect the changes it makes The algorithm is applied in partition order and within

a partition in statement order over G statement order is the original lexical order

cd

The algorithm can b e sub divided into three parts First execution variables for a

branch no de n are created where needed Next guard expressions are inserted for any

no des control dep endent on n Then the control and data dep endences are up dated

reecting the new guards and execution variables

The need for an execution variable for n is determined by considering ns imme

diate successors If there is an outgoing edge from n to a no de that is not in ns

partition an execution variable is created In Example execution variables are

needed for S and S The initialization of the execution variable is inserted at the

1 5

b eginning of ns partition ensuring it will always b e executed Next an assignment

of the execution variable to ns test is inserted in no de n If n has successors in its

partition its branch is changed to test the execution variable Otherwise its branch

is deleted

For each partition P that contains a successor of n a guard on ns execution

k

variable is built Here the successors of n are also considered in statement order

A guard is built for every distinct lab el from n into P Each guard compares ns

k

execution variable ev I to the distinct lab el l All of ns successors in G in P on

n cd k

lab el l are severed from n and connected to the newly created corresp onding guard

Our examples have only two lab els true and false but any numb er of branch targets

can b e handled

Consider Example S has successors in two partitions S S and S S

5 6 7 8 9

The successors in S S are on dierent branches S is on the true branch so the

6 7 6

guard expression created is ev i eq true S is on the false branch so its

5 7

guard expression is ev i eq false The old edges and are deleted

5

from G and new edges attaching and to their corresp onding guards are created

cd

Similarly a guard is created for and connected to S

8

The following simple optimization is included in the algorithm and examples but

for clarity do es not app ear in the statement of the algorithm Determining whether

the initialization of an execution variable is necessary can b e accomplished when an

execution variable is created for a no de n If n is not control dep endent on any

other no de ie a ro ot in the control dep endence graph then there is no need for

initialization to b e inserted During guard creation for the successors of this no de

the execution variable is known to have a value other than Therefore if control

ow is structured only one guard is needed for each successor partition instead of for

each lab el

After restructuring is applied each partition has a correct G a correct data

cd

dep endence graph and p ossibly some new statements execution variable assignments

and guards At this p oint the co de for the distribution partition can b e generated

We use a simple co de generation algorithm which is describ ed in Section Given

the distribution in Figure for Example restructuring and co de generation

results in the following co de

DO I N

EV I AI GT T

1

S IF EV I EQ TRUE THEN

1 1

S AI I

2

ELSE

S T T

3

ENDIF

ENDDO

DO I N

EV I

5

IF EV I EQ FALSE THEN

1

S FI AI

4

S EV I BI EQ

5 5

ENDIF

ENDDO

DO I N

S IF EV I EQ TRUE U AI BI

6 5

S ELSE IF EV I EQ FALSE U AI U

7 5

ENDDO

DO I N

S IF EV I EQ FALSE CI BI CI

8 5

S DI DI CI

9

ENDDO

The advantages of threevalued logic are illustrated by the concise guards for S

6

and S As shown in Section ev i must b e explicitly tested for true or false

7 5

b ecause if S evaluated to true then ev i will b e and neither S nor S should

1 5 6 7

execute Not only do we avoid testing ev i here if S and S were in S s partition

1 4 5 1

there would b e no need to store S s decision at all even though S are S indirectly

1 6 7

dep endent on S and S remains in a dierent lo op

1 1

Optimality

Given a distribution this section proves that our algorithm creates the minimal num

b er of execution variables needed to track control decisions aecting statement ex

ecution in other lo ops It also establishes that the algorithm pro duces the minimal

numb er of guards on the values of an execution variable required to correctly execute

the distributed co de Therefore our algorithm is optimal for a given distribution partition

Lemma Each execution variable represents a unique decision that

must b e communicated b etween two lo ops

Pro of An execution variable is created only when a decision in one partition directly

aects the execution of a statement in another partition as sp ecied by G The

cd

denition of G guarantees that no decision no de subsumes another and therefore

cd

any decisions represented by execution variables are unique 2

The restructuring algorithm creates the minimal numb er of guards on the values

of an execution variable required to correctly determine execution Let

p the numb er of distinct partitions P and

m the numb er of distinct branch lab els l

that contain successors of no de n There are at most k tests on the value of an

execution variable ev where

n

p

m

X X

l P k

j i

j =1 i=1

k is the sum of distinct lab els into every distinct partition and is b ounded by the

numb er of ns successors that are in separate partitions P

i

Theorem The numb er of guards that test an execution variable is

the minimum required to preserve correctness for the given distribution

Pro of By contradiction If there exists a version of the distribution with fewer guards

then guards would b e pro duced that were either unnecessary or redundant If there

were unnecessary guards then Lemma would b e violated If there were redundant

guards then there would b e multiple guards for no des in the same partition with the

same lab el However the algorithm pro duces at most one guard p er lab el used in a

partition 2

Exit branches

Because exit branches determine the numb er of iterations that are executed for an

entire lo op they are somewhat sequential in nature It is p ossible to p erform distribu

tion on such lo ops in a limited form by placing all exit branches in the rst partition

Of course any other statements involved in recurrences with these statements must

also b e in the rst partition This forces the numb er of iterations to b e completely

determined by the rst lo op If there are any statements left any legal partitioning

of them may b e p erformed The control dep endences for each of the subsequent par

titions can b e satised with execution variables as describ ed ab ove However during

co de generation their lo op b ounds must b e adjusted If an exit branch was taken

any statements preceding it in the original lo op must execute the same numb er of

times as the rst lo op later statements must execute one less time than the rst lo op

Otherwise when no exit branch is taken all lo ops must execute the same numb er of

times as the rst lo op

Co de generation

To review there are three phases to distribution in the presence of control ow The

rst step determines a partitioning based on data and control dep endences The

second step inserts execution variables and guards to eect the partition and up dates

the control and data dep endences The third step is co de generation

In step two the only changes to the data dep endence graph are the addition of

edges that connect the denitions of execution variables to their uses A G is

cd

built for each new lo op during this phase In each new lo ops G there are no control

cd

dep endences b etween guards However there may b e relationships b etween execution

variables that can b e exploited during co de generation

Now we consider co de generation for unstructured or structured control ow with

out exit branches Section outlines the extensions for exit branches Because

the data and control dep endence graphs as well as the program statements are cor

rect on entry to the co de generation phase a variety of co de generation algorithms

could b e used For example any of the co de generation algorithms based on the

program dependence graph could b e used in conjunction with the ab ove algorithm

FM FMS CFS BB The very simple co de generation scheme describ ed

here has b een is implemented in the ParaScop e Editor KMTa

When transformations are applied in an interactive environment it is imp ortant

to retain as much similarity to the original program as p ossible The programmer can

more easily recognize and understand transformed co de when it resembles the original

For this reason although partitioning may cause statements to change order the

original statement order and control structure within a partition is maintained If the

original lo op is structured the resulting co de will b e structured If the original lo op

was unstructured and dicult to understand so most likely will b e the distributed lo op

To maintain the original statement ordering an ordering numb er is computed and

stored in ordern All the no des in G are numb ered relative to their original lexical

cd

order from one to the numb er of no des All of the execution variable initialization

no des are numb ered zero so they will always b e generated b efore any other no de

in their partition The newly created guard no des have an order numb er and a

relative numb er reln Their order numb ers are the numb er of the no de whose

execution variable app ears in the guard expression Their relative numb ers reln

are the numb er of the guards lowest numb ered successor Both of these numb ers

can b e computed when the guard is created To simplify the discussion branches

are assumed to have only two lab el values true and false but the algorithm may b e

easily extended for multivalued branches

The rest of our discussion is divided into three parts First relab eling which

corrects and renames statement lab els is describ ed The co de generation discussion

is separated into one section for structured and one for unstructured co de

Lab el renaming

A distribution partition may sp ecify that the destination of a goto that is a lab eled

statement b e in a dierent lo op from the goto Replication and lab el renaming of

gotos of this typ e must b e p erformed to comp ensate for this after restructuring and

b efore co de generation Renaming is easily accomplished by replacing the destination

of a goto that is no longer in the same lo op with an existing lab el or a new lab el l

P

j

which may require a continue The new destination has the same relative ordering

as the original lab el Often this will b e the last statement in the partition Reuse of

lab els is done whenever p ossible

Example

......

.. ..

. .

. .

. .

. .

.

.

.

.

.

.

.

.

......

......

......

......

......

. . . . .

......

. . . . .

. . . . .

. . . . .

. . . . .

. . . .

. . .

. . . . .

. . . .

. . .

. . . . .

. .

. . . . .

. . .

. . . . .

. . . .

. . . .

. . . . .

. . . .

. . .

......

. . . .

. . .

. . . . . DO I N

. .

. .

. .

. . . .

. .

. . . .

. .

. .

. . .

. . .

. . . . .

. . . . .

. . .

. . . .

. . .

......

. . . .

. . .

.

. . . . .

. . .

. .

S S . . . . .

. . .

. .

. . .

. .

. . . . .

. 6 1 . . .

. . .

......

. . .

. . .

. . . . .

. . . . .

. . .

. . .

. . . . .

. . . . .

. .

. . . .

. .

. . . .

. . . .

. . . .

. . . . .

. . . . .

. . . . .

......

...... S IF p GOTO

......

......

......

......

......

. .. . . 1

. . . . .

......

. . . . . G

......

. . . ..

. .

. .. . .

. . . .

. . . ..

. . . . cd

. .. . .

......

. . .. .

......

......

......

......

......

......

. . .

......

. . . .

......

. . . ..

. . . . .

......

. .

......

. . . . .

. . . ..

......

......

. . .

......

......

......

......

......

. . . . .

. . ..

. . . .

. .. . .

. . . . S

. . . ..

. . . ..

. . . .

.. .

. . . .

. .. . . 2

. . . . .

......

. . . . .

......

. . .

......

......

......

......

......

. . .

......

. . . . .

. .. . .

......

. . . .

......

. . .

. .. . .

......

......

. . . . .

......

. . . . .

. .. .

. . . . .

......

. . . . .

......

. . . . .

. ..

. . . .

......

. . . ..

. . . . .

......

. . . .

. . ..

. . . . .

......

......

......

. . . . .

. . t ...... S GOTO

. . f f f . .

......

......

......

. . . . .

...... 3

. . . .

. ..

......

......

......

. . . .

. .. . .

. .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . .

......

...... IF p GOTO

......

......

......

......

......

......

......

. . . . .

. . . . .

......

......

. . . . .

......

f . . . .

......

......

. . . . .

......

......

......

......

. . . . .

......

. . . . .

......

......

. . . .

......

. . . .

......

. . . S S S S . . . .

. . . .

......

. . . . .

......

......

. 2 3 4 5 . . . .

......

......

. . . . .

......

......

......

. . . . .

......

. . . . .

. . . . .

......

......

...... S

. . . . .

......

......

......

......

...... 5

......

......

......

......

......

......

......

. .

. . .

. . .

. . .

.

.

.

.

.

. .

. .

. .

. ..

. .

.. .

. .

......

......

......

......

S

6 ENDDO

Consider Example with a distribution partition S S S S and S S

1 2 3 6 4 5

The destination of S s goto S is not in the same partition as S therefore the

1 4 1

gotos lab el must b e renamed In this case the new destination of S s jump must

1

not interfere with the execution of S To determine the destination and new lab el

6

the statement numb er of the original lab eled statement in this case is compared

to each statement in the partition following S in order When a statement numb er

1

greater than the original is found S in our example its lab el is used or a new one

6

is created for it Any empty jumps are deleted A straightforward relab eling of the

rst partition in Example after restructuring results in the following

DO I N

EV I p

1

S IF EV I EQ TRUE GOTO

1 1

S

2

S

6

ENDDO

Structured co de generation

When G is a tree co de generation is relatively simple FM FMS BB This

cd

discussion emphasizes prop erly selecting and inserting the appropriate control struc

tures for newly created guards Other G co de generation algorithms must select

cd

and create control structures for all branches Because we use the original control

structures for all but the newly created guards only they are of interest here When

the guards are created they are identied by setting guard n to true For all other

no des guard n evaluates to false With structured control ow the only two control

structures that need b e inserted when generating guards are ifthen and ifthen

else

Our algorithm for co de generation from structured or unstructured co de app ears

in Figure It considers each partition and its no des based on their order numb er

from lowest to highest If a no de n is not a guard no de it is generated with its

original control structure followed by any descendants using depthrst recursion on

G Given a tree G and that all control dep endences are satised the ancestors

cd cd

of a no de n in G must b e generated b efore n is If the no de is a guard no de

cd

the control structure for it must b e selected and created This work is done in the

pro cedure genguard

Algorithm Co de generation after distribution

Co degenn

Input n is a statement no de

G ordered partitions ordern reln

cd

guardn goton

Output The distributed lo ops

Algorithm

for each partition P

gen DO The original lo op header

while n P

cho ose n with smallest ordern and

if goton and not only predecessor

with greatest reln otherwise smallest reln

done false

delete n from P

if guardn

genguard n

else

gen n

gensuccessors n al l al l matches any branch lab el

endif

endwhile

endfor

gensuccessors n l

Input n is a statement no de

l is a lab el

Algorithm

while done false and n m G and P

l cd

cho ose m with smallest orderm

if p m where p n

In structured co de m has one predecessor and this will never o ccur

done true

else

delete m from P

gen m

gensuccessors m al l

endif endwhile

Algorithm

Continued

genguard n

Input n is a statement no de

Algorithm

Generate unstructured co de

if p reln p n and p still P

let L b e the statement lab el of no de reln

gen if n goto L

Generate structured constructs

The original conditional was structured

else if n q and n r where orderq orderr

tr ue f alse

gen if n then

gensuccessors n true

gen else

gensuccessors n false

gen endif

The original conditional was structured

else if o where ordero ordern

n chosen st reln relo

gen if n then

gensuccessors n true

delete o from P

gen else if o then

gensuccessors o true

gen endif

The original could b e unstructured or structured

else

gen IF n then

gensuccessors n true

gen endif

endif end

If the guard no de has true and false branches an ifthenelse is generated where

the conditional is the guard expression For each successor on the true branch it and

its descendants are generated recursively in order The false successors are generated

similarly under the else If there are two guards with the same order numb er they

are ordered by their relative numb er and an ifthenelseifthen is generated

The rst guard expression b ecomes the rst conditional and its successors and their

descendants are generated in the corresp onding then The second guard expression

conditions the elseifthen and is followed by its descendants Otherwise the guard

is the only no de with this order numb er and an ifthen is generated for the guard

and its descendants

In Example the control dep endence graph is a long narrow tree After applying

the ab ove algorithms to this lo op the co de b elow results The rst lo op shows the

dead branch optimization The second lo op illustrates that it is p ossible to generate

correct co de without adding control dep endences b etween guards More ecient co de

could b e generated by noticing in the second lo op nest if ev i is true then neither

1

ev i or ev i can b e true and similarly if ev i is true then ev i cannot b e true

3 5 3 5

This co de would not have fewer tests but would b e more ecient and have a dierent

structure

....

. ..

.. ..

..

.

..

.

.

.

.

.

.

.

.

.

.

.

. .

.

. . ..

......

......

. . .

. .

. . ..

. . .

. . .

. .

. . .

. . .

.

. .

. .

. .

. . .

. .

. .

. .

. .

. .

. . .

.

. . .

. .

.

. . .

. . . .

Example .

. . .

. . .

.

. .

.

. . .

. .

.

. .

.

. . .

. .

.

. . .

.

. .

.

.

.

.

. .

. . .

.

. . S .

.

. . . .

. . .

.

. . . .

1 .

. . .

. . . .

.

. . .

. .

. . .

. . .

. .

. . . .

.

. .

.

. .

. .

. .

. .

. .

. .

. .

. .

. ..

. .

. ..

.. ..

......

.

. . . .

..

. . . .

. . . .

..

. . . .

.

. . . ..

. . . .

.

. . . ..

.

. . . .

. . . ..

.

. . . .

..

. .

. .

..

. .

.

. ..

. .

.

. ..

.

. .

. ..

.

. .

..

. . . .

. . . .

. DO I N

. . . .

.

. . . .

. . . .

..

. . . .

.

. . . ..

. . . .

.

. . . ..

.

. . . .

. . . ..

.

. .

..

. . G

. .

..

. .

.

. ..

. .

. cd

. ..

.

. .

. ..

t .

. . . .

f ..

. . . .

. . . .

..

. . . .

.

. . . ..

. . . .

.

. . . .

. . . .

.

. . . .

.

. . . ..

. . . .

.

. ..

. S IF p THEN

. .

. .

.

. .

..

. . 1

. .

..

. .

.

. ..

. .

.

. . . ..

.

. . . .

. . . ..

.. .

......

. .. ..

......

......

. ..

......

. . . .

......

. . . .

. . . . .

......

. .

......

......

. ..

. . . . .

. ..

......

......

. . . .

......

......

......

......

. . . . .

. . . .

. . . . .

. . . . .

......

. . . .

......

. . . . S

......

......

. .

......

. . .

...... 2

. . . . .

. . . . .

. . . .

. . .

......

. . . .

. . .

......

. .

. . . .

. . . . .

. . .

. . .

. .

. . . . .

. . .

. .

. . .

. .

. . . . .

. . .

. .

. . .

. .

......

. . . .

. .

. . . .

. .

......

. . . .

. .

......

. . . . S S . .

. . . .

......

. .

......

......

2 3 . .

......

. .

. . . . ELSE

......

. .

. . . .

. . . .

. . . .

......

. .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

......

......

. . . .

......

. . . .

......

......

......

......

.

......

......

.

......

.

......

......

.

. . . ..

.

. . . .

. . . ..

.

. . . .

..

. . . .

. .

..

. .

.

. ..

. .

.

. . . ..

.

. . . . S IF p THEN

. . . ..

.

. . . .

..

. . . .

...... 3

..

......

.

......

......

.

......

.

......

. . . .

.

. . . .

.

. . . .

. . . .

.

. . . .

..

. .

. .

..

. .

. .

..

. . . .

.

. . . ..

. . . .

.

. . . ..

.

. . . .

. . t . . . ..

f .

......

..

......

......

..

......

.

......

. . . .

. S

. . . ..

.

. . . .

. . . .

.

. . . . 4

.

. .

. .

.

. .

.

. .

. . . .

.

. . . .

.

. . . ..

. . . .

.

. . . ..

.

......

......

.

......

..

......

......

..

......

. . .

. . . ..

......

. . .

......

.

......

......

.

. . . ..

...

. . . ..

......

. ....

. . . . .

. .

......

......

. . . . ELSE

......

......

......

......

. . . .

......

. . . .

......

......

. . . .

......

. . . .

......

......

. .

......

. . . .

. . . .

. . . .

. .

......

. . . .

. .

. . . .

. .

. .

. .

. . . .

. .

. .

. . . .

......

. .

. . . .

. .

. . . .

......

. .

......

. .

......

......

. . . .

......

. .

......

......

. .

. . S S . .

. . S IF p THEN

......

. . . .

. .

......

4 5 . .

. . . . 5

. . . .

. .

. .

. . . .

. .

. .

. . . .

......

. .

. . . .

......

. . . .

......

. .

......

......

. . . .

......

. . . .

......

......

......

......

......

......

. . . .

.

. . . .

.

. . . .

. . . .

.

. . . .

.

. .

. .

.

. .

.

. .

. . . .

..

. . . . S

.

. . . ..

. . . .

.

. . . ..

. . . . . 6

. .

......

.

......

......

.

......

..

......

. . . .

.

. . . .

.

. . . .

. . . .

.

. . . .

.

. .

. .

.

. .

.

. .

. . . .

.

. . . .

.

. . . .

. . . .

.

. . . .

.

......

......

.

......

.

......

......

t .

......

f ..

. . . . ELSE

. . . .

..

. . . .

.

. . . ..

. . . .

.

. ..

.

. .

. .

.

. .

.

. . . .

. . . .

.

. . . .

.

. . . .

. . . .

.

......

.

......

......

.

......

.

......

......

.

. . . .

..

. . . .

. . . .

..

. . . .

.

. . . ..

. . . .

. . .

. ..

. . .

. . . . .

......

.

......

......

. S

......

..

......

......

. .

...... 7

......

......

......

. . . .

......

. . . .

......

......

. . . .

......

. . . .

......

. . . .

. . . .

. . . .

. . . .

......

. .

. .

. . . .

. .

. . . .

. .

. .

......

. .

. . . .

. . . .

. . . .

. . . .

. .

. . . .

......

. .

. .

. .

. . . .

. .

. .

. .

. .

......

. . . .

. . ENDIF

. . . .

. .

......

. . . .

. .

. . . .

. . S S

......

. . . .

. .

......

......

. . 6 7

......

......

. .

......

. .

. . . .

......

. .

. . . .

......

. . . .

. . . .

......

. .

. . . .

......

. . . .

......

......

......

. .. . .

......

......

......

......

......

. . . .

. . . .

. . . .

. .

. .

. .

. .

. .

. .

. .

. . ENDIF

. .

. .

. .

. .

.. .

. . . .

......

... ..

......

ENDIF ENDDO

DO I N

EV I

3

EV I

5

S EV I p

1 1

IF EV I EQ FALSE THEN

1

S EV I p

3 3

IF EV I EQ FALSE THEN

3

S EV I p

5 5

IF EV I EQ FALSE THEN

5

S

7

ENDIF

ENDIF

ENDIF

ENDDO

DO I N

IF EV I EQ TRUE S

1 2

IF EV I EQ TRUE S

3 4

IF EV I EQ TRUE S

5 6

ENDDO

Unstructured co de generation

We can avoid the usual problems when generating co de with a dag G for unstruc

cd

tured control ow by using the original structure and computing some additional

information ab out the origin of the new guards This information can b e computed

during co de generation or when the guards are created If a guard is the only pre

decessor of its successors the ordering and structure selection for structured control

ow can b e used For guards that have successors with multiple predecessors gotos

are generated

The key insight is that although a no de can b e control dep endent on many no des

only one of these dep endences may b e from a structured construct Observe that in a

connected subpart of G when guards are created from gotos outside the partition

cd

into the subpart the guards with the highest order numb ers will b e generated rst

One or two gotos may result When a goto will result in a guarded goto and a

structured construct care is taken to generate the goto rst In this case the no de

with larger relative numb er b etween the two guards will b e selected and a goto for

it is generated

The recursive generation of successors and their descendants must cho ose the

lowest numb ered successor to generate rst In structured co de this is guaranteed

to b e the true branch but with an ifgoto the false branch is lower In structured

co de the generation of successors is immediately preceded by their one and only

predecessor In unstructured co de to ensure all control dep endences are satised

the recursion must cease if a no de has other predecessors that have not yet b een

generated When there are multiple gotos this situation may arise

Returning to Example and applying co de generation results in the co de b elow

DO I N

EV I p

1

S IF EV I EQ TRUE GOTO

1 1

S

2

S

6

ENDDO

DO I N

IF EV I EQ FALSE GOTO

1

IF EV EQ TRUE THEN

1

S IF p GOTO P

4 2

S

5

P CONTINUE

2

ENDDO

Notice that when the second partition is generated the goto is generated rst

The guards for S and S have the same order numb er ie but b ecause S was a

4 5 1

goto the jump to S is generated rst Then S s guard S and S are generated

5 4 4 5

Here and in Example there are jumps into structured constructs Although these

jumps are nonstandard Fortran some compilers accept them and regardless can b e

implemented with gotos

Finally consider Example with a distribution partition S S and S S

1 2 3 4

S S where no des are control dep endent on more than one predecessor

5 6

....

. ..

.. .

..

.

.

.

..

.

.

.

.

.

.

.

.

Example . .

.

.

.. . .

......

. . .

. . .

.. .

. .. .

. . .

.. . .

. .

. . .

. . .

.

. .

. . .

.

. . .

. . .

.

. . .

. .

.

. . .

.

. . .

. .

.

. . . .

. . .

.

. . .

. . .

.

. .

.

. . .

. .

.

. .

.

. . .

. .

.

. . . . .

.

. . . .

. . .

.

. . .

.

. . . .

. . .

.

. . . S

.

. . . .

. . .

.

. . . .

. 1

. . .

. .

.

.

. .

.

. .

.

. .

.

. .

.

. .

. .

. .

. .

. . . .

. .

. . . .

. . . .

. ..

. . . .

.. .

......

......

. . . . .

.

. . . . .

DO I N . . . . .

. . . . .

. .

. . . .

. . .

.

. .

. .

. . .

. .

.

. .

. .

. . .

. .

.

. . .

.

. . .

. . . . .

. . . . .

. .

. . . .

. . . . .

. . . . .

. .

. . . .

. . . .

. .

. . . .

. .

. . . . .

. . .

. .

. . . .

. .

. . .

.

. .

. . .

.

. . .

. .

.

. . .

. .

. .

S IF p GOTO . .

.

. . .

. .

. . . .

. . . .

1 . .

. . . .

. .

. . . . .

. . .

. .

G . . . . .

. . .

. . .

. . .

. .

. . . . .

. . . .

cd . .

. . . .

f .

. . .

. . .

.

. .

. .

. .

. . .

.

. .

. .

. .

. . .

. .

.

. .

. . . . .

. . . .

. .

. . .

. .

. . . . .

. . . .

. .

. . . .

.

. . . . .

. . . . .

.

S IF p THEN . . . .

. .

. . . .

. . . . .

.

. . . .

2 . . . .

. . . .

. . .

. . . .

. . .

t . .

. . . . .

. . . .

. .

. . . .

.

. . . . .

. . . .

. .

......

......

......

......

. . .

. . . . .

. . . .

. . . . .

......

. .

. . . . .

. . . .

. . . . .

......

. .

. . . . .

. . .

......

. .

. . .

. . .

. .

. . . .

. .

. . .

. . .

. .

. . .

. . . .

S .

. . .

. . .

. . .

. . . . .

. . .

3 . . . .

. . . . .

. . .

. . . .

. . .

. . . . .

......

. . .

. . .

. . .

. . . .

. . .

. . .

. . . .

.

. . . . .

. S . .

. . .

. . . .

. .

. . . . .

. . . . .

. 2 .

......

. . .

. .

. . . . .

. . .

. . .

. . . .

. . .

. . . .

. . . .

. . . .

. . .

. . . .

. . .

. . . .

. . . .

. . t

. . . . .

. . . .

. . . . .

. . . .

S . . .

......

......

.. .. .

......

......

4 ......

......

......

......

......

......

......

......

.. . . .

......

......

......

......

......

......

. . . . .

......

......

......

......

......

......

.. . .

. . . . .

......

. . . . .

......

......

. .

......

. . . . .

......

......

.. . . .

......

. . . ..

. . . . .

......

. . . . .

......

......

......

. . .

......

ELSE . . . . .

......

......

. . . . .

. . . . .

. . . .

......

......

......

. . . .

......

. . . .

......

......

.. . . .

......

......

. . . . .

. . . ..

. . . . .

......

......

......

......

. . . . .

.. . . .

. . . . .

......

t t . . . . .

f f ......

......

.. . .

......

......

. . . .

......

......

......

. . . .

. . .. .

......

......

......

......

......

. . .

......

......

......

. . . . .

S ......

. . . . .

.. . . .

. . . . .

......

......

5 ......

......

......

......

......

......

......

......

......

. . . ..

......

......

......

......

......

......

......

......

......

......

......

......

. . .

......

......

......

......

......

......

......

......

......

. . . .

......

......

......

. . . .

......

......

......

......

......

. . . .

......

S . . . .

......

......

. . . .

......

......

6 . . . .

. . . .

. . . .

......

. . . .

. . . .

. . . .

......

. . . .

. . . .

. . . .

......

. . . .

. . . .

......

. . . .

......

. . . .

......

......

. . . .

......

......

. . . .

. . . . . S S S S .

. . . .

......

......

. . . .

......

. . . 3 4 5 6 .

......

......

. . . .

. . . .

......

. . . .

. . . .

......

......

. . . .

......

. . . .

......

ENDIF ......

......

. . . .

......

......

......

......

......

......

......

......

......

......

. .

. .

. .

. .

. .

.. .

. .

. ..

.. ..

.. .

......

ENDDO

Distribution restructuring lab el renaming and co de generation p erformed on the

ab ove results in the following co de

DO I N

EV I

2

S EV I p

1 1

IF EV I EQ TRUE GOTO P

1 1

S EV I p

2 2

P CONTINUE

1

ENDDO

DO I N

IF EV I EQ TRUE GOTO

1

IF EV I EQ TRUE THEN

2

S

3

S

4

ELSE IF EV I EQ FALSE THEN

2

S

5

S

6

ENDIF

ENDDO

Other transformations

We now describ e the extensions and algorithms for a selection of other lo op transfor

mations

Lo op skewing

Lo op skewing is always safe regardless of the typ e of control ow b ecause it do es

not change the order in which array memory lo cations are accessed It only changes

the shap e of the iteration space see Section

Lo op reversal

Lo op reversal reverses the order of lo op iterations In the absence of control ow it is

safe if there are no lo opcarried data dep endences This safety test is easily extended

to handle control dep endences If the control dep endences are lo opindep endent ie

they are completely internal to the lo op nest then they do not inhibit lo op reversal

However if they are lo opcarried ie exit branches then lo op reversal is not safe

Lo op p ermutation

Because lo op p ermutations may b e p erformed as a series of pairwise interchanges the

rest of this discussion is simplied by fo cusing on lo op interchange Lo op interchange

is safe if it do es not reverse the order of execution of the source and sink of any data

dep endence By examining the direction vectors for all dep endences carried on the

outer lo op one may determine if there exists any data dep endence with a direction

vector of the form which would b e reversed by lo op interchange These data

dep endences are called interchangepreventin g

First consider nested lo ops containing internal branching The tests for lo op p er

mutation are only concerned with preserving the original ow of values and although

internal branching aects this ow data dep endence fully characterizes it Therefore

the existing safety test suces for this case

An exit branch completely out of a nest inhibits p ermutation of any lo ops in the

nest Such p ermutation might result in executing to o many iterations of some lo ops

and to o few of others In a more restrictive setting where the exit branch is only out

of a inner subset of the nest p ermutation on the outer subset may b e p ossible A

simple way to understand this eect is to think of it as a direction vector Consider

exit branches out of the k inner lo ops of a p erfect lo op nest of depth n to have a

control dep endence direction vector The direction vector

1 2 nk n1 n

indicates that lo op levels n k to n may not b e moved but that levels to n k

are free to b e p ermuted

Strip mining

Because strip mining do es not change the lo op b o dy or the iteration space no ad

ditional mechanisms are needed when the lo op b o dy contains any typ e of control

ow

Privatization

Any variable can b e made private to a lo op if it is dened and used only within

the lo op and it is always dened b efore it is used To determine if these conditions

hold for scalar variables requires general dataow information When testing these

conditions for arrays dataow and dep endence information are required General

dataow analysis is sophisticated enough to determine these conditions for scalars

in lo ops containing control ow CF

Scalar expansion

In vectorization scalar expansion is often preferable to privatization However it is

not clear in many cases which of the two is preferred Scalar expansion has similar

but slightly less restrictive constraints than scalar privatization Scalar expansion

also requires that the variable b e dened b efore it is used on all iterations of the lo op

but it may still b e live outside of the lo op Consider the following example of scalar

expansion

DO I N becomes DO I N

T TAI

T TAI

ENDDO ENDDO

T TAN

T

T

Notice that the value stored on the last iteration of the lo op tan must b e stored

3

into the scalar t If there is no use of t outside the lo op this is not necessary

Again dataow analysis is able to determine these conditions in the lo op and if the

last value is always stored back live analysis of t is unnecessary

Lo op fusion

Lo op fusion places the b o dies of two adjacent lo ops with the same numb er of iterations

into a single lo op AC Fusion is safe for two lo ops l and l if it do es not result

1 2

in values owing from statements originally in l back into statements originally in l

2 1

and vice versa The simple test for safety p erforms dep endence testing on the lo op

b o dies as if they were in a single lo op Each forward dep endence originally b etween l

1

and l is tested Fusion is unsafe if any dep endences are reversed b ecoming backward

2

lo opcarried dep endences in the fused lo op

If either lo op contains internal branching then the same test for safety is correct

b ecause the control dep endences have the same eect in the fused co de as in the

original

3

Privatization could also utilize a store back allowing it to b e applicable even when lo op denitions

reach outside the lo op

Exit branches

Fusion is unsafe if there are exits out of the rst lo op which bypass execution of the

second so that the second lo op header is control dep endent up on the exit branches

The control dep endences indicate that every exit test must execute b efore any of the

statements in the second lo op may b e determined to execute

If there are no exit branches out of the rst lo op and there is an exit branch in the

second lo op it is p ossible to fuse correctly by restructuring the fused lo ops based on

control dep endence The test for safety in this case need only determine if the data

dep endences prevent fusion The restructuring step needs to replicate the control

dep endences for the second lo op in the fused lo op Consider Example and its

corresp onding control dep endence graph

Notice that the exit branches in the second lo op result in a cyclic control dep endence

on the header To maintain this control dep endence structure for the statements in the

second lo op an execution variable ev is inserted and the co de is slightly restructured

A scalar execution variable ev records which exit branch if any is taken Outside

the lo op it is initialized to which means that no exit branch has b een taken In

the fused lo op the execution variable is assigned the lab el of the branch at the p oint

an exit branch would b e taken The exit branch goto l abel is replaced with a

goto to the end of the lo op continue in our example The restructuring is

completed by mimicking the original control dep endence structure on the do An if

statement is inserted which dominates the execution of all the statements originating

in the second nest The test for the if is false when an exit branch was taken on a

previous iteration

Lo op p eeling

Peeling takes the statements in one or more iterations of a lo op executes them outside

of the lo op and adjusts the lo op b ounds accordingly Peeling may b e p erformed on

the rst or last iteration A slightly more general form of p eeling index set splitting

places a numb er of p eeled iterations in a prelo op or a p ostlo op The considerations

that arise from all of these are basically equivalent so consider p eeling the rst

iteration of a lo op as seen b elow

DO I l b ub ss becomes SLl b

SLI DO I l b ss ub ss

ENDDO SLI ENDDO

Example Fusion

DO I N becomes EV

S DO I N

1

S S

2 1

ENDDO S

2

DO I N IF EV EQ THEN

S S

3 3

S S

4 4

S IF test GOTO ex IF test THEN

5

S IF test GOTO ex EV ex

6

ENDDO GOTO

SL ENDIF

8

IF test THEN

ex SL EV ex

9

ex SL GOTO

10

ENDIF

CONTINUE

ENDDO

IF EV NE GOTO EV

SL

8

ex SL

9

ex SL

10

Control dep endence graph

......

. ..

. .

. .

. .

. .

.

. .

.

. .

.

.

. .

.

.

.

. .

.

.

.

. .

.

.

.

.

. .

.

.

.

......

......

......

. . . . .

......

. . . . .

......

. . . . .

......

......

......

......

......

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. .

. . . .

. .

. .

. .

. . . .

. .

. .

. .

. .

. . . .

. .

. . . . .

. . ..

. . ..

......

......

. . ..

. . . .

......

. . . .

. . .

. . . ..

. . . .

. . . . .

. . DO DO ..

. . .

......

. . .

. . .

......

. . .

. . .

. . . . .

. . .

. . .

. . . . .

. . . . .

. . .

. . .

. . . . .

. . . . .

. . . . .

. . .

......

. . . . .

. . . . .

. . . . .

. . . . .

......

......

......

......

......

......

......

. . . . .

. . . .

. . . . .

......

. . . .

. . . . .

......

. . . . .

. . . .

......

. . . . .

......

. . .

......

......

. . .

......

. . . . .

......

. . . .

. . . . .

......

. . .

......

......

. . . . .

. . . .

......

. . . . .

......

. . .

......

......

. . .

......

. . . . .

......

. . . .

. . . . .

......

. . . . .

. . . .

......

. . . . .

. . . .

. . . . .

......

......

. . .

......

. . . . .

. . . .

......

. . . . .

. . . . .

. . .

......

......

. . . . .

. . . .

. . . . .

......

. . . .

. . . . .

......

. . . . .

. . . .

......

. . . . .

. . . .

. . . . .

......

. . . . .

. . .

......

. . . . .

......

. . . .

. . . . .

......

. . .

......

. . . . .

. . . . .

. . . .

. . . . .

......

. . . .

. . . . .

......

. . . . .

. . .

......

. . . . .

......

. . .

......

......

. .

......

. . . . .

......

. . . .

. . . . .

......

. .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . . .

......

......

......

. . . . .

......

......

......

......

. . . . .

......

......

......

. . . . .

......

......

......

. . . . .

......

......

......

. . . . .

......

......

......

. . . . .

......

......

......

. . . . .

......

......

......

......

......

......

. . . . . S S S S S

......

......

......

......

......

. . . . . 1 2 3 4 5 .

. . . . .

......

......

......

. . . . .

......

......

. . . . .

......

......

......

. . . . .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

.

.

. .

.

.

.

. .

.

.

. .

. .

. .

. .

.

.

. .

.

.

.

. .

.

.

. .

. .

. .

. .

. .

. .

. .

.

.

. .

.

.

.

. .

.

.

.

.

. .

.

.

. .

. .

.

.

. .

. .

. .

. .

. .

. .

. .

.

.

. .

. .

.

.

. .

. .

.

.

. .

. .

.

.

. .

. .

. .

. .

. .

. .

. .

. .

.

.

. .

. .

.

.

. .

. .

.

.

.

. . .

. .

. . .

. . .

.

. . .

. . .

. . . .

.

. . .

.

. . .

. . . .

. . . .

.

......

. .. .

. .

. . .

. .

. .. .

. . .

. .

. . .

. .

. . .

. .

. .

. .

. . .

. .

. .

. .

. .

. .

.

. . .

.

. .

. .

. .

. .

. .

. .

.

. . .

.

. .

.

. .

. .

. .

. .

.

. .

. .

. . .

.

.

.

. .

.

.

.

.

. .

. S

.

. .

.

.

.

. . 6

.

.

. .

.

. .

.

.

. .

.

. .

.

. .

. .

.

. .

. .

. .

. .

. .

.. .

. .

. ..

.. ..

.. ..

... ..

....

Peeling the rst iteration replicates every statement in the lo op where every o ccur

rence of the induction variable is replaced with the lo ops lower b ound The new lo op

lower b ound is the step size plus the original

Peeling is always legal for lo ops with or without control ow b ecause it do es

not change the order of statement execution However b ecause p eeling replicates

statements some care must b e taken when lab eled statements due to branching are

present For each p eeled statement that is lab eled a new unused lab el must replace

it All references to that lab el in the p eeled statements must also b e changed to

reect the new lab el Of course the destinations of exit branches should remain as

they were

Related work

One approach taken in automatic vectorizers when lo ops contain conditional control

ow is to convert control dep endences into data dep endences using a technique called

ifconversion All AKPW Ifconversion is theoretically app ealing b ecause it

allows for a unied treatment of data dep endence without control dep endences and

has b een used successfully in a numb er of vectorization systems KKLW SK

However it has several drawbacks

In its original form ifconversion is intractable and may intro duce an exp onential

numb er of new logical arrays to record control ow decisions Unfortunately even

when applied in more limited circumstances it results in substantial increases in co de

space used for holding the results of conditionals In addition after ifconversion has

b een p erformed it is not easy to reconstruct the original program or even ecient

branching co de if vectorization fails Another concern is that the transformed co de no

longer resembles the original Even though many other transformation have this prob

lem to some degree ifconversion typically obliterates the original program These

drawbacks are exacerbated in an interactive environment and are signicant enough

that other solutions have b een sought

An alternative approach when control ow is present uses explicit control and data

dep endences In his dissertation Towle develops techniques for vectorizing programs

with control ow using lo op distribution scalar expansion and replication using

data and control dep endence information Tow However his denition of control

dep endence emb eds but do es not extract the essential control relationship b etween

two statements that is if the execution of one statement directly determines the

execution of another

Control dep endence as formulated by Ferrante Ottenstein and Warren is clean

complete and extracts this essential relationship FOW They include control and

data dep endences in the program dep endence graph pdg and our approach uses the

same basis Their pap er also discusses several optimizing transformations p erformed

on the pdg no de splitting co de motion lo op fusion and lo op p eeling However

their algorithms are applicable only for structured control ow Neither Towle or

Ferrante et al present lo op distribution

Ferrante Mace and Simons present related algorithms whose goals are to avoid

replication and branch variables when p ossible FM FMS Their co de generation

algorithms convert parallel programs into sequential ones and like ours are based on

G They discuss three transformations that restructure control ow lo op fusion

cd

dead co de elimination and branch deletion

Callahan and Kalem present two metho ds for generating lo op distributions in the

presence of control ow CKa The rst which works for structured or unstruc

tured control ow replicates the control ow of the original lo op in each of the new

lo ops by using G Branch variables are inserted to record decisions made in one lo op

f

and used in other lo ops An additional pass then trims the new lo ops of any empty

control ow Dietz uses a very similar approach Die These approaches have some

of the same drawbacks of ifconversion ie the proliferation of unnecessary guards

Callahan and Kalems second metho d which works only for structured control

ow uses G G and Bo olean execution variables Their execution variables indi

f cd

cate if a particular no de in G is reached and they are created for edges in G that

f cd

cross b etween partitions Their execution variables are assigned true at the successor

indicating the successor will execute rather than assigning the decision made at the

predecessor Also one execution variable may b e needed for every successor in the

descendant partition Because their co de generation algorithm is based on G rather

f

than G the pro of of how an execution variable is used is much more dicult and

cd

is not given Towle Tow and Baxter and Bauer BB use similar approaches for

inserting conditional arrays

The Stardent compiler distributes lo ops with structured control ow by keeping

groups of statements with the same control ow constraints together All For

example all the statements in the true branch of a blo ck if must stay together so

only the outer level of if nests can b e considered This limits eectiveness of distri

bution b ecause partitions are articially made larger p ossibly by grouping parallel

statements with sequential ones

Discussion

In summary although much attention has b een paid to mo deling and understanding

control ow in other work a general formulation of parallelism enhancing transfor

mations with arbitrary control ow was not available until now Using control and

data dep endences we have presented new and generalized versions of many imp ortant

lo op transformations In particular the algorithm for lo op distribution was shown

optimal and represents a signicant improvement over previous algorithms

I myself have never been able to nd out precisely what feminism is I only know that

people cal l me a feminist whenever I express sentiments that dierentiate me from a

doormat Reb ecca West

Chapter

Interpro cedural Transformations

Striving for a large granularity of parallelism has a natural consequence the compiler

must lo ok for parallelism in regions of the program that span multiple pro cedures

This kind of optimization is called whole program or interpro cedural analysis and

transformation This chapter presents a new approach that enables compiler opti

mization of pro cedure calls and lo op nests containing pro cedure calls We intro duce

two interpro cedural transformations lo op extraction and lo op emb edding that move

lo ops across pro cedure b oundaries exp osing them to lo op nest optimizations We also

describ e the ecient supp ort of these transformations using the interpro cedural com

pilation system in the ParaScop e parallel programming environment These transfor

mations are shown eective in practice on existing applications programs

Intro duction

To exp ose parallelism and computation for parallel architectures the compiler must

consider a statement in light of its surrounding context Lo ops provide a proven source

of b oth context and parallelism Lo ops with signicant amounts of computation are

prime candidates for compilers seeking to make eective utilization of the available

resources Go o d software engineering practices encourage mo dularity as a way to

manage program computation and complexity and increasingly programmers are

using a mo dular programming style Therefore it is natural to exp ect that programs

will contain many pro cedure calls and pro cedure calls in lo ops and to ensure high

p erformance compilers will need to optimize them

Unfortunately most conventional compilation systems abandon parallelizing op

timizations on lo ops containing pro cedure calls Two existing compilation technolo

gies are used to overcome this problem interpro cedural analysis and interpro cedural

transformation

Interpro cedural analysis applies dataow analysis techniques across pro ce

dure b oundaries to enhance the eectiveness of dep endence testing Regular

section analysis is a sophisticated form of interpro cedural analysis which makes

it p ossible to parallelize lo ops with calls see Section It determines if

the side eects to arrays as a result of each call are limited to nonintersecting

subarrays on dierent lo op iterations CKb HK

Interpro cedural transformation is the pro cess of moving co de across pro

cedure b oundaries either as an optimization or to enable other optimizations

The most common form of interpro cedural transformation is procedure inlining

Inlining substitutes the b o dy of a called pro cedure for the pro cedure call and

optimizes it as a part of the calling pro cedure AC

Even though regular section analysis and inlining are frequently successful at en

abling optimization each of these metho ds has its limitations HK LYa Hus

Compilation time and space considerations require that regular section analysis sum

marize array side eects In general summary analysis for lo op parallelization is

less precise than the analysis of inlined co de On the other hand inlining can yield

an increase in co de size which may disastrously increase compile time and seriously

inhibit separate compilation CHT RG Furthermore inlining may cause a loss

of precision in dep endence analysis due to the complexity of subscripts that result

from array parameter reshap es For example when the dimension size of a formal

array parameter is also passed as a parameter translating references of the formal to

the actual can intro duce multiplications of unknown symb olic values into subscript

expressions This situation o ccurs when inlining is used on the spec Benchmark

program matrix BCHT

In this chapter a hybrid approach is develop ed that overcomes some of these

limitations We intro duce a pair of new interpro cedural transformations loop embed

ding which pushes a lo op header into a pro cedure called within the lo op and loop

extraction which extracts the outermost lo op from a pro cedure b o dy into the calling

pro cedure However b ecause there is a cost for interpro cedural transformations our

strategy applies them only when p erformance b enets are exp ected to result

The p erformance b enet of these transformations comes from using the exp osed

lo ops in highpayo intraprocedural lo op optimizations Any intrapro cedural trans

formations that requires lo op nests may b e applicable on those provided by lo op

emb edding and extraction Additionally testing the safety and protability of some

of the lo op transformations across pro cedure b oundaries requires no extension to

the tests discussed in Chapters and Extensions are needed for transformations

that require dep endence distance information such as lo op p ermutation The intra

pro cedural optimizations which are extended in this chapter are lo op fusion and lo op

p ermutation These results easily generalize for other transformations such as lo op

skewing Wol and unroll and jam CCK

As a motivating example consider the Fortran co de in Example a The J

lo op in subroutine S may safely b e made parallel but the outer I lo op in subroutine

P may not b e However the amount of computation in the J lo op is small relative

to the I lo op and may not b e sucient to make parallelization protable If the I

lo op is embedded into subroutine S as shown in b the inner and outer lo ops may b e

interchanged as shown in c The resulting parallel outer J lo op now contains plenty

of computation As an added b enet pro cedure call overhead has b een reduced

Lo op emb edding and lo op extraction provide many of the optimization opp ortu

nities of inlining without its signicant costs Co de growth of individual pro cedures

is nominal so compilation time is not seriously aected Overall program growth

is also mo derate b ecause multiple callers may invoke the same optimized pro cedure

b o dy In addition the compilation dep endences among pro cedures are reduced since

the compiler controls the small amount of co de movement across pro cedures and can

easily determine if an editing change of one pro cedure invalidates other pro cedures

Our approach to interpro cedural optimization is fundamentally dierent from pre

vious research in that the application of interpro cedural transformations is restricted

Example Lo op emb edding

SUBROUTINE P SUBROUTINE P SUBROUTINE P

REAL ANN REAL ANN REAL ANN

INTEGER I

DO I

CALL SAI CALL SA CALL SA

ENDDO

SUBROUTINE SFI SUBROUTINE SF SUBROUTINE SF

REAL FNN REAL FNN REAL FNN

INTEGER IJ INTEGER IJ INTEGER IJ

PARALLEL DO J DO I PARALLEL DO J

FJI FJI PARALLEL DO J DO I

ENDDO FJI FJI FJI FJI

ENDDO ENDDO

ENDDO ENDPARDO

a b efore transformation b lo op emb edding c lo op interchange

to cases where it is exp ected to b e protable This strategy called goaldirected inter

procedural optimization avoids the costs of interpro cedural optimization when it do es

not enable other p erformance enhancing optimizations BCHT Interpro cedural

transformations are applied as dictated by a co de generation algorithm that explores

p ossible transformations selecting a choice that intro duces parallelism and exploits

data lo cality

The co de generator is part of an interpro cedural compilation system that eciently

supp orts interpro cedural analysis and optimization by retaining separate compilation

of pro cedures We rst explored this typ e of system using a simple p erformance

estimation based parallel co de generation algorithm HKT This chapter provides

a more general framework and that is integrated into a more sophisticated paral

lelization algorithm discussed in Chapter We also present exp erimental results to

illustrate the ecacy of these transformations on application programs

Technical background

Augmented call graph

The program representation for interpro cedural transformations requires an aug

mented cal l graph G which describ es the calling relationships among pro cedures

ac

and lo op nests The details of the G are presented in Section Figure a

ac

shows an abbreviated version of the augmented call graph G for the program from

ac

Example where the solid line is a call edge and the dashed lines are nesting edges

Interpro cedural section analysis

A regular section describ es the side eects to the substructures of an array Sections

represent a restricted set of the most commonly o ccurring array access patterns

single elements rows columns grids and their higher dimensional analogs This

restriction on the shap es assists in making the implementation ecient HK The

representation of the dimensions of a particular array variable may take one of three

forms

an invo cation invariant expression representing a single element

a range consisting of a lower b ound an upp er b ound and a step size or

the sp ecial element signifying that all of this dimension may b e aected

Figure Sections and data access descriptors

Ref Ref

P

I

A I AJ I

S Mo d Mo d

J

A I AJ I

a G c dad b Sections

ac

Sections are separated into mo died and referenced sets The sections for Example

are shown in Figure b

By using sections the problem of lo cating dep endences on pro cedure calls is sim

plied to the problem of nding dep endences on ordinary statements The mo died

and referenced subsections for the call app ear to the dep endence analyzer like the

left and righthand sides of an assignment resp ectively For singleelement subsec

tions dep endence testing is the same as it would b e for any other variable access

For subsections that contain one or more dimensions with ranges the dep endence

analyzer simulates do lo ops for each of the range dimensions with the lower b ound

upp er b ound and step size of the lo op corresp onding to those of the range

Regular sections enable dep endence analysis to determine if lo ops containing calls

are parallel and are sucient to determine the safety of intrapro cedural transforma

tions on a lo op nest containing calls However a more precise version of sections is

needed to determine the safety of intrapro cedural transformations which involve lo ops

in dierent pro cedures b efore lo op emb edding or extraction places them in the same

pro cedure These are similar to data access descriptors or dads and they provide

detailed information ab out references and how the lo ops in a called pro cedure access

it BK Our version of dads are a little more precise b ecause we have the lo op

header information in G but for the purp oses of this discussion dads evoke the

ac

appropriate meaning

A dad identies the section of an array accessed and the order of that access

in terms of each enclosing lo ops index expression It also indicates the relative

ordering of the accesses We consider dads as annotations of sections In addition

the sections are marked as exact or inexact for the purp oses of dep endence testing

used in determining the safety of intrapro cedural transformations in the caller The

regular section information is sucient to test dep endence on lo ops containing calls

To test dep endence on the lo ops in the call at the call site demands that the dad

b e exact in the following sense An exact reference or mo died section must b e b e

describ ed in terms of a constant or any surrounding lo ops It must also meet one of

the following criteria either it is not the result of a merge or if it is the result

of a merge either the merge was b etween accesses where they overlap exactly and

completely or the accesses are completely disjoint Figure c illustrates the dad

annotations for the program in Example

Supp ort for interpro cedural optimization

In this section we present the compilation system of the ParaScop e Programming

+

Environment CCH CKTa This system was designed for the ecient supp ort

of interpro cedural analysis and optimization The to ols in ParaScop e co op erate to

enable the compilation system to p erform interpro cedural analysis without direct

examination of source co de This information is then used in co de generation to make

decisions ab out interpro cedural optimizations The co de generator only examines the

dep endence graph for the pro cedure currently b eing compiled not the graph for the

entire program In addition ParaScop e employs recompilation analysis after program

changes to minimize program reanalysis CKTb This system was original intended

for scalar compilation This section extend the ParaScop e system to supp ort parallel

co de generation

The ParaScop e compilation system

Interpro cedural analysis in the ParaScop e compilation system consists of two principal

phases The rst takes place prior to compilation At the end of each editing session

the immediate interpro cedural eects of a pro cedure are determined and stored For

example this information includes the array sections of global variables and callby

reference formal parameters that are lo cally mo died or referenced in the pro cedure

The pro cedures calling interface is also determined in this phase It includes descrip

Figure Information ow for interpro cedural transformations

Augmented

Call Graph

RSD Dep endence Co de

Analysis Analysis Generation

Marked k Lo ops

RSDs

Dep endence Graphs

wRSDs Slices

tions of the calls and lo ops in the pro cedure and their relative p ositions In this way

the information needed from each mo dule of source co de is available at all times and

need not b e derived on every compilation

Interpro cedural optimization is orchestrated by the program compiler a to ol that

manages and provides information ab out the whole program CKTa Hal The

program compiler rst builds the augmented call graph describ ed in Section

The program compiler then traverses the augmented call graph p erforming inter

pro cedural analysis and subsequently co de generation Conceptually program com

pilation consists of three principal phases interpro cedural analysis dep en

dence analysis and planning and co de generation

Interpro cedural analysis

The program compiler calculates interpro cedural information over the augmented

call graph First the information collected during editing is recovered from the

database and asso ciated with the appropriate no des and edges in the call graph

This information is then propagated in a topdown or b ottomup pass over the no des

in the call graph dep ending on the interpro cedural problem Section analysis is

p erformed at this time Interpro cedural constant propagation and symb olic analysis

are also p erformed as these greatly increase the precision of subsequent dep endence

analysis

Dep endence analysis

Interpro cedural information is then made available to dep endence analysis which

is p erformed separately for each pro cedure Dep endence analysis yields dep endence

edges that are placed in the dep endence graph If the source or sink of a dep endence is

a call site a section annotates it The section may more accurately describ e the p or

tion of the array involved in the dep endence Dep endence analysis also distinguishes

parallel lo ops in the augmented call graph Dep endence analysis is separated from

co de generation for an imp ortant reason it provides the co de generator knowledge

ab out each pro cedure without reexamining its source or dep endence graph

Planning and co de generation

The nal phase of the program compiler determines where interpro cedural optimiza

tion is estimated to b e protable Planning is imp ortant to interpro cedural optimiza

tion since unnecessary transformations may lead to signicant compiletime costs

without any executiontime b enet To determine the safety of transformations the

dep endence graph and sections are sucient

The relationship among the compilation phases is depicted in Figure Each

step adds annotations to the call graph that are used by the next phase Following

program transformation each pro cedure is separately compiled Interpro cedural in

formation for a pro cedure is provided to the compiler to enhance intraprocedural

optimization

Pro cedure cloning

Pro cedures optimized with lo op emb edding or extraction may have multiple callers

and an optimization valid for one caller may not b e valid for another To avoid co de

growth multiple callers should share the same version of the optimized pro cedure

whenever p ossible This technique of generating multiple copies of a pro cedure and

tailoring the copies to their calling environments is called pro cedure cloning CKTa

CHK

Recompilation analysis

A unique part of the ParaScop e compilation system is its recompilation analysis

which avoids unnecessary recompilation after program edits Recompilation anal

ysis tests that interpro cedural facts used to optimize a pro cedure have not b een

invalidated by editing changes CKTb BC BCKT To extend recompila

tion analysis for interpro cedural transformations a few additions are needed When

an interpro cedural transformation is p erformed a description of the interpro cedural

transformations annotates the no des and edges in the augmented call graph On

subsequent compilations this information indicates to the program compiler that the

same tests used initially to determine the safety of the transformations should b e

reapplied

To determine if interpro cedural transformations are still safe the new and old

sections are rst compared in most cases avoiding examination of the dep endence

graph As a result dep endence analysis is only applied to pro cedures where it is

no longer valid allowing separate compilation to b e preserved The recompilation

pro cess after interpro cedural transformations have b een applied is describ ed in more

detail elsewhere Hal

Interpro cedural transformation

Lo op extraction and lo op emb edding exp ose the lo op structure to optimization with

out incurring the costs of inlining Just as inlining is always safe these transforma

tions are always safe The mechanics of p erforming the movement of a lo op header is

detailed b elow If moving additional statements is desired it may b e p erformed with

the techniques develop ed for inlining

Lo op extraction

Lo op extraction moves a lo op that encloses the b o dy of its pro cedure p outward into

one of its callers This optimization may b e thought of as partial inlining The new

version of p no longer contains the lo op The caller now contains a new lo op header

surrounding the call to p The index variable of the lo op originally a lo cal in p

b ecomes a formal parameter and is passed at the call The calling pro cedure creates

a new variable to serve as the lo op index avoiding name conicts It is always safe

to extract an outer enclosing lo op from a pro cedure Example a contains a lo op

with two calls to pro cedure S and b contains the result after lo op extraction Note

that b has an additional variable declaration for the lo op index J in P It is included

in the actual parameter list for S The J lo ops may now b e fused and interchanged to

improve p erformance as in Example c

Lo op emb edding

Lo op emb edding moves a lo op that contains a pro cedure call into the called pro cedure

and is the dual of lo op extraction The new version of the called pro cedure requires a

Example

SUBROUTINE PA SUBROUTINE PA SUBROUTINE PA

REAL ANN BNN REAL ANN BNN REAL ANN BNN

INTEGER I INTEGER IJ INTEGER IJ

DO I

DO I DO J DO J

CALL SAI CALL SAIJ DO I

CALL SBI ENDDO CALL SAIJ

ENDDO DO J CALL SBIJ

CALL SBIJ ENDDO

ENDDO ENDDO

ENDDO

SUBROUTINE SFI SUBROUTINE SFIJ SUBROUTINE SFIJ

REAL FNN REAL FNN REAL FNN

INTEGER IJ INTEGER IJ INTEGER IJ

DO J

FJI FJI FJI FJI FJI FJI

ENDDO

a b efore transformation b lo op extraction c lo op fusion interchange

new lo cal variable for the lo ops index variable If a name conict exists a new name

for the lo ops index variable must b e created This transformation is illustrated in

Example

If the index variable of the lo op to b e emb edded app ears in an actual parameter

at the call site this parameter is no longer correctly dened To remedy this problem

the formal parameters in the call that dep end on it must b e assigned and computed

in the newly emb edded lo op In the simplest case an index variable i is passed to a

formal f Here f should b e assigned i on every iteration of the emb edded lo op prior

to the rest of the lo op b o dy

If an actual is an array reference whose subscript expression contains the lo op

index variable the actual passed at the call b ecomes simply the array name In the

called pro cedure the original subscript expression for each dimension of the actual

is added to the subscript expression for the corresp onding dimension of the formal

at each reference to the formal If the array parameter is reshap ed across the call

this translation is more complicated The array formal is replaced by a new array

with the same shap e as the actual The references to the variable are translated by

linearizing the formals subscript expressions and then converting to the dimensions

of the new array BC Finally the subscript expressions for each dimension of the

actual are added to those for the translated reference This metho d is also the one

that is used in our implementation of inlining

Dep endence up dates

Because our co de generator only applies lo op extraction and lo op emb edding after

safety and protability are ensured an up date of lo cal dep endence information may

not b e necessary However if further optimization is desired up dating the dep endence

information is straightforward The dep endence information just moves and translates

with the lo op which is moving

Emb edding versus Extraction

There are several factors which aect the choice b etween emb edding or extraction

during the optimization pro cess All things b eing equal emb edding lo ops needed for

optimizations into the called pro cedure is preferable b ecause it reduces pro cedure call

overhead However if several lo ops originating from dierent call sites are needed to

p erform an optimization extraction is required as illustrated in Example If an

optimization uses a lo op in the call and more than one lo op of the caller then lo op

extraction is also preferred On the other hand if the optimization involves the inner

lo op of the caller and more than one lo op in the called pro cedure lo op emb edding

is preferred The other option for these and other more complex circumstances is to

p erform lo op emb edding or extraction multiple times to adjoin the necessary lo ops

Intrapro cedural transformations

The following two sections discuss how to test for the safety of intrapro cedural trans

formations across pro cedure b oundaries The tests are needed when the requisite

lo ops are not in the same pro cedure but may b e placed together via emb edding or

extraction

Lo op fusion

When several pro cedure calls app ear contiguously or lo ops and calls are adjacent

it may b e p ossible to extract the outer lo op from the called pro cedures Once

lo ops are exp osed fusion and other optimizations may b e p erformed as illustrated by

Example In the algorithm checkFusion we consider fusion of fs s g where s

1 2 i

is either a call or a lo op Lo op fusion is restricted in this setting in that there may

not b e any intervening statements b etween s and s

1 2

The test for fusion b etween two lo ops l and l requires the insp ection of the

1 2

dep endence source and sink variable references in l and l If one or more of the

1 2

lo ops is inside a call the variable references are represented instead as the mo died

and referenced sections for the call The section and its dad corresp ond to the

lo ops b eing considered for fusion and are tested identically to variable references see

Section Unfortunately while variable references are always exact a section

is not If a section for a particular array is not exact and a p otential dep endence

exists b etween the lo op nests fusion is conservatively assumed to b e unsafe There

exists a p otential dep endence if an array is referenced in b oth nests and at least one

is a write A more precise test could b e p erformed by insp ecting the dep endence

graphs for each called pro cedure In practice the more precise test may b e no more

successful and could intro duce signicant overhead

Lo op p ermutation

Lo op p ermutation of a lo op nest rearranges the lo op headers changing the order in

which the iteration space is traversed As with fusion the distancedirection vectors

for the lo ops in the caller b eing considered in the p ermutation must b e computable

Algorithm Interpro cedural fusion test

checkFusion s s

1 2

Input s s where s is a call or a lo op and s is adjacent to s

1 2 i 1 2

Output returns true if fusion is safe

Algorithm

let l the lo op header of s

1 1

let l the lo op header of s

2 2

if the numb er of iterations of l dier from l

1 2

return false

for each forward dep endence sr c sink

s s

1 2

if sr c or sink is not exact

s s

1 2

return false

if sr c sink b ecomes backward lo opcarried

s s

1 2

return false

endfor

return true

at the call Again the sections involved in dep endences must b e exact or the test

conservatively assumes the transformation to b e unsafe Conversely in the call the

distancedirection vectors for surrounding lo ops could b e made available when the

call is b eing optimized This option is less app ealing b ecause other optimizations are

inhibited in the caller Regardless if there are lo ops in either the caller or the called

routine that do not carry any dep endences the augmented call graph reects it and

many p ermutations can b e shown safe without additional dep endence testing

Exp erimental results

This section presents signicant p erformance improvements due to interpro cedural

transformation on two scientic programs sp ec and o cean taken from the Perfect

Benchmarks CKPK To lo cate opp ortunities for transformations we browsed

+

the dep endences in the program using the ParaScop e Editor BKK KMTa

KMTb Using other ParaScop e to ols we determined which pro cedures in the

program contained pro cedure calls We examined the pro cedures containing calls

lo oking for interesting call structures We lo cated adjacent calls lo ops adjacent to

calls and lo ops containing calls which could b e optimized the entire application

was not parallelized The original and optimized programs were executed on a

pro cessor Sequent Symmetry S Since the optimizations used diered slightly for

each program they are describ ed separately

Figure Stages of preparing program versions for exp eriment

Original

Directives on

inner lo ops

IPinfo

sp ec Blo ck

Directives on

outer lo ops

IPtrans

Transform

Sp ec

Sp ec contains noncomment lines and is a uid dynamics weather simulation

that uses Fast Fourier Transforms and rapid elliptic problem solvers In sp ec lo ops

containing calls were common Overall transformations were applied to such

lo ops Emb edding and interchange were applied to lo ops which contained calls to a

single pro cedure The remaining lo ops which contained multiple pro cedure calls

were optimized using extraction fusion and interchange These lo ops were found in

pro cedures del gloop and gwater

For the transformed lo ops p erformance was measured among three p ossibil

ities no parallelization of lo ops containing pro cedure calls parallelization

using interpro cedural information and interpro cedural information and transfor

mations To obtain these versions the steps illustrated in Figure were p erformed

The Original version contains directives to parallelize the lo ops in the leaf pro

cedures that are invoked by the lo ops of interest The IPinfo version parallelizes

the lo ops containing calls For the IPtrans version we p erformed interpro cedural

transformation followed by outer lo op parallelization The parallel lo ops in each ver

sion were also strip mined to allow multiple consecutive iterations to execute on the

same pro cessor without synchronization The compiler default is to schedule each

iteration of a parallel lo op separately incurring additional overhead

pro cessors pro cessors

time in optimized time in optimized

p ortion sp eedup p ortion sp eedup

Original s s

IPinfo s s

IPtrans s s

The results rep orted ab ove are the b est execution time in seconds for the optimized

p ortions of each version The sp eedups are compared against the execution time in

the optimized p ortion of the program on a single pro cessor which was s This

p ortion accounted for more than p ercent of the total sequential execution time

With seven pro cessors the results are similar for all three versions since each pro

gram version provided adequate parallelism and granularity for seven pro cessors On

pro cessors IPinfo was slower than the original program b ecause the parallel outer

lo ops had insucient parallelism only to iterations The parallel inner lo ops of

Original were b etter matched to the numb er of pro cessors b ecause they had at least

iterations The interpro cedural transformation version IPtrans demonstrated the

b est p erformance a sp eedup of b ecause it combined the amount of parallelism

in Original with increased granularity The interpro cedural transformations resulted

in a p ercent improvement in execution time over Original in the optimized p ortion

Parallelizing just these lo ops resulted in a sp eedup for the entire program of ab out

on pro cessors and on pro cessors

Ocean

Ocean has noncomment lines and is a D uid dynamics o cean simulation that

uses Fast Fourier Transforms There were places in the main routine of o cean

where we extracted and fused interpro cedurally adjacent lo ops They were divided

almost evenly b etween adjacent calls and lo ops adjacent to calls In all cases

where a lo op was adjacent to a call the lo op was dimensional while the lo op in the

called pro cedure was dimensional Prior to fusion we coalesced the dimensional

lo op into a dimensional lo op by linearizing the subscript expressions of its array

references The resulting fused lo ops consisted of b etween and parallel lo ops from

the original program thus increasing the granularity of parallelism

To measure p erformance improvements due to interpro cedural transformation we

p erformed steps similar to those in Figure Directives forced the parallelization

and blo cking of the individual lo ops in the Original version and the fused lo ops in

IPtrans The execution times were measured for the entire program and just the

optimized p ortion The optimized execution times are shown b elow

pro cessors

time in optimized

p ortion sp eedup

Original s

IPtrans s

The sp eedups are relative to the time in the optimized p ortion of the sequential version

of the program which was seconds The optimized co de accounted for ab out

p ercent of total program execution time For the whole program the parallelized

versions achieve a sp eedup of ab out over the sequential execution time

Note that IPtrans achieved a p ercent improvement over Original in the opti

mized p ortion This improvement resulted from increasing the granularity of parallel

lo ops and reducing the amount of synchronization It is also p ossible that fusion

reduced the cost of memory accesses Often the fused lo ops were iterating over the

same elements of an array These groups of lo ops were not the only opp ortunities

for interpro cedural fusion there were many other cases where fusion was safe but the

numb er of iterations were not identical Using a more sophisticated fusion algorithm

might result in even b etter execution time improvements

Related work

While the idea of interpro cedural optimization is not new previous work on inter

pro cedural optimization for parallelization has limited its consideration to inline sub

stitution AJ CHT Hus and interpro cedural analysis of array side eects

BK BC CKb HK HHLa HHLb LYa LYb TIF The vari

ous approaches to array sideeect analysis must make a tradeo b etween precision

and eciency Section analysis used here loses precision b ecause it only represents a

selection of array substructures and it merges sections for all references to a variable

into a single section However these prop erties make it ecient enough to b e widely

used by co de generation In addition exp eriments with regular section analysis on the

Linpack library demonstrated a p ercent reduction in parallelisminhibiting dep en

dences allowing lo ops containing calls to b e parallelized HK Comparing these

numb ers against published results of more precise techniques there was no b enet to

b e gained by the increased precision of the other techniques LYa LYb TIF

Discussion

The usefulness of this approach has b een illustrated on the Perfect Benchmark pro

grams sp ec and o cean Taken as a whole the results indicate that providing free

dom to the co de generator b ecomes more imp ortant as the numb er of pro cessors

increase Eectively utilizing more pro cessors requires more parallelism in the co de

This b ehavior was particularly evident in sp ec where the b enets of interpro cedural

transformation were increased with the numb er of pro cessors

Although it may b e argued that scientic programs structured in a mo dular fash

ion are rare in practice we b elieve that this is an artifact of the inability of previ

ous compilers to p erform interpro cedural optimizations of the kind describ ed here

Increasing numb ers of scientic programmers are using a mo dular programming style

and cannot aord to pay a p erformance p enalty By providing compiler supp ort

to eciently optimize pro cedures containing calls we encourage the use of mo dular

programming which in turn will make these transformations applicable on a wider

range of programs These techniques enable a desirable programming style which

uses pro cedures that can b e eectively parallelized

This chapter and the previous one provide algorithmic supp ort for applying trans

formations to entire applications In particular program optimization is enabled for

lo ops containing control ow and is not inhibited when lo op nests span pro cedure

b oundaries We now turn to the prop er application of these transformations to eect

excellent parallel p erformance

Chapter

Optimizing for Parallelism and Data Lo cality

Previous research has used program transformation to intro duce parallelism and to

exploit data lo cality Unfortunately these two ob jectives have usually b een consid

ered indep endently This chapter explores the tradeos b etween eectively utilizing

parallelism and memory hierarchy on sharedmemory multipro cessors We present a

simple but surprisingly accurate memory mo del to determine cache line reuse from

b oth multiple accesses to the same memory lo cation and from consecutive memory

access The mo del is used in memory optimizing and lo op parallelization algorithms

that eectively exploit data lo cality and parallelism in concert We demonstrate the

ecacy of this approach with very encouraging exp erimental results This algorithm

forms the core of our parallel co de generation strategy

Intro duction

Transformations to exploit parallelism and to improve data lo cality are two of the

most valuable compiler techniques in use to day Indep endently each of these opti

mizations has b een shown to result in dramatic improvements This chapter seeks to

combine the b enets of b oth by using a simple memory mo del to drive optimizations

for data lo cality and parallelism By unifying the treatment of these optimizations

we are able to place lo ops with data reuse on inner lo ops and to intro duce parallelism

for outer lo ops Our strategy pro duces data lo cality at the innermost lo ops where it

is most likely to b e exploited by the hardware and places parallelism at the outermost

lo op where it is most eective If these two goals conict we present an algorithm

that usually reaps the b enets of b oth

Optimizing data lo cality is necessarily b oth architecture and language dep endent

However the reuse of memory lo cations and the consecutive access of adjacent mem

ory lo cations form the foundation of most memory hierarchy optimizations Reuse

of a particular memory reference for arrays can b e discovered using datadep endence

analysis However reuse of consecutive accesses often called unit stride access is a

signicant source of reuse that can easily b e determined when the storage order of

arrays and the cache line size is known In this chapter we intro duce a simple mo del

for estimating the cost in memory references of executing a given lo op nest The

principal advantage of this mo del over previous mo dels is that it takes into account

cache reuse due to consecutive accesses to the same cache line We show how this

mo del can b e used to exploit data lo cality at multiple levels via lo op p ermutation

Our algorithm rst uses the memory mo del to nd a lo op organization that ex

ploits data lo cality It then seeks to parallelize the outermost lo op or a parallel lo op

that can b e p ositioned outermost Given sucient iterations it then strip mines the

lo op into two lo ops such that one lo op is used to achieve lo cality and the other is

used to intro duce parallelism

Matrix multiply example

As an example of this pro cess consider the ubiquitous matrix multiply

DO J N

DO K N

DO I N

CIJ CIJ AIK BKJ

Assuming arrays are stored such that columns of the arrays are in consecutive memory

lo cations ie columnmajor order this lo op organization exploits data lo cality in

the following manner The consecutive access on the inner I lo op to CIJ and AIK

provide an opp ortunity for cache line reuse when the cache line size is greater than

There is also a lo opinvariant reuse of BKJ on the I lo op Additionally the J

and the I lo ops can b e parallel However if the numb er of pro cessors P is less than

the numb er of iterations of either lo op it is not protable to utilize b oth levels of

parallelism at once due to additional scheduling overhead A b etter execution time

would result by maximizing the granularity of one level of the parallelism and then

matching it to the machine If N P selecting J to b e executed in parallel preserves

data lo cality and intro duces a single level of parallelism with maximum granularity

PARALLEL DO J N

DO K N

DO I N

CIJ CIJ AIK BKJ

However if the numb er of lo op iterations is greater than the numb er of pro cessors

N P it is often useful to combine indep endent iterations into a single parallel task

to achieve granularity that matches the machine The parallel lo op is strip mined by

the numb er of pro cessors where the strip size is SS d NP e We call the J lo op the

strip and the JJ lo op which walks b etween strips the iterator

PARALLEL DO JJ N SS

DO J JJ MINJJ SS N

DO K N

DO I I N

CIJ CIJ AIK BKJ

The parallel JJ lo op carves up the data space nicely but if each pro cessors cache is

still not large enough to contain all of array A tiling the lo op nest further improves

p erformance by providing reuse of A Tiling combines strip mining and lo op inter

change to promote reuse across a lo op nest IT Wola For matrix multiply the

lo op nest may b e tiled by strip mining the K lo op by TS and then interchanging it

with J

PARALLEL DO JJ N SS

DO KK N TS

DO J JJ MINJJ SS N

DO K KK MINKK B N

DO I I N

CIJ CIJ AIK BKJ

Here TS is selected based on the cache size This organization moves the reuse of

ANKKKKTS on the J lo op closer together in time making it more likely to still

b e in cache This optimization approach may b e divided into three phases

optimizing to improve data lo cality

nding and p ositioning a parallel lo op and

p erforming lowlevel memory optimizations such as tiling for cache and placing

references in registers LRW CCK

This chapter fo cuses on the rst two phases We advo cate the rst two phases b e

followed by a lowlevel memory optimizing phase but do not address it here

Memory and language mo del

Because we are evaluating reuse we require some knowledge of the memory hierarchy

However b ecause our mo del is very simple only minimal knowledge of the cache is

required the compiler must know the cache line size cl s The size set asso ciativity

and replacement p olicy of the cache are not imp ortant here In addition we assume a

writeback cache and ignore nonunique write references If the cache is writethrough

these writes should b e included

In addition we only concern ourselves with memory accesses caused by array

references since they dominate memory access in scientic Fortran co des We also

assume that arrays are stored in columnmajor order where unit stride accesses in

the rst array dimension translate into contiguous memory accesses Our results are

also valid for rowmajor arrays such as those found in C with only minor changes

Tradeos in optimization

This section illustrates with an exp eriment the inuence of memory reuse and par

allelism granularity on sp eedup As exp ected it indicates the b est p erformance is

p ossible only when b oth are utilized eectively in concert It also shows that when

b oth cannot b e achieved at once there are situations where favoring one or the other

results in the b est execution time Neither always dominates To illustrate we phrase

the following question

Given enough computation to make paral lelism protable what is the eect

of reuse and how should it aect the optimization strategy

Figure presents the results of executing dierent parallel versions of the following

lo op nest on pro cessors of a Sequent Symmetry S with pro cessors with

increasing amounts of total work

DO J N

DO I M

DO H L

CI J CI J AI J BI J

The total amount of work is increased by varying the upp er b ounds N and M from

to the numb er of pro cessors P We consider p ositioning I or J as the outer

parallel lo op in the nest In Figure the b est version of this lo op nest has an outer

parallel J lo op with iterations N and total work is increased by varying M

from to Each of the pro cessors accesses distinct columns of each array This

organization exploits cache line reuse on each pro cessor and results in linearlyscalable

sp eedup

When the J lo op is outermost and the numb er of parallel iterations of is varied

from to along with P and the I lo op contains iterations the total amount of

18 best (J out,N=18,M=2:18)

16

14

12

gran (J out,N=2:18,M=18)

10 s p e d - u 8 mem (I out,M=18,N=2:18)

6 worst (I out,M=2:18,N=18)

4

2

2 4 6 8 10 12 14 16 18

total work

Figure Memory and parallelism tradeos

work increases but the work p er pro cessor remains the same This organization is

illustrated by the gran line In this case the sp eedup scales by the numb er of parallel

iterations but cache line reuse is still facilitated on each pro cessor

If instead the I lo op is made outermost and parallel then pro cessors must comp ete

for the cache line which contains CIJ in order to write it This comp etition is called

false sharing In addition multiple pro cessors require cache lines containing AIJ

and BIJ increasing network contention and total memory utilization When the

numb er of parallel iterations of the I lo op as outermost varies from to along with

P and the J lo op contains iterations the worst line indicates the p erformance If

the numb er of parallel iterations of I is held at while the J lo op is varied from to

the mem line results

Compare the pair of lines b est and mem The factor of two dierence is due to the

b enet of cache line reuse in b est and the limitations of false sharing and increased

bus and memory utilization in mem The same comparison holds for the gran and

worst lines These results indicate that the parallelizing algorithm must recognize

reuse and false sharing to b e eective

Now compare the pair of crossing lines gran and mem These computations dier

only by an interchange An optimization strategy that only used lo op interchange

would b e forced to pick b etween the two To obtain the b est p erformance for this

example the J lo op would b e outermost when N otherwise the I lo op should b e

outermost In addition this crossover p oint would need to b e determined for each

computation a daunting task Our approach instead combines lo op interchange and

strip mining in a parallelization strategy that minimizes false sharing and exploits

data reuse

Optimizing data lo cality

In this section we describ e two sources of data reuse then we incorp orate b oth in

a simple yet realistic cost mo del In subsequent sections this cost mo del is used to

guide optimizations for improving data lo cality and exploiting parallelism

Sources of data reuse

We rst consider the two ma jor sources of data reuse

multiple accesses to the same memory lo cation

accesses to consecutive memory lo cations ie stride or unit stride access

Multiple accesses to the same memory lo cation may arise from either a single array

reference or multiple array references These accesses are lo opindep endent if they

o ccur in the same lo op iteration and are lo opcarried if they o ccur on dierent lo op

iterations Wolf and Lam call this temporal reuse WL The most obvious source of

temp oral reuse is from lo opinvariant references For instance consider the reference

to AJ in the following lo op nest It is invariant with resp ect to the I lo op and is

reused by each iteration

DO J N

DO I N

S S AJ BI CJI

A second source of data reuse is caused by multiple accesses to consecutive memory

lo cations For instance each cache line is reused multiple times on the inner I lo op

for BI in the ab ove example Wolf and Lam call this spatial reuse WL The

actual amount of reuse is dep endent on the size of BI relative to the cache line size

and the pattern of intervening references For the rest of this chapter we assume

for simplicity that the cache line size is expressed as a multiple of the numb er of

array elements For reasonably large computations references such as CJI do not

provide any reuse on the I lo op b ecause the desired cache lines have b een ushed by

intervening memory accesses

Previous researchers have studied techniques for improving lo cality of accesses

for registers cache and pages AS CCK WL GJG We concentrate on

improving the lo cality of accesses for cache ie we attempt to increases the lo cality

of access to the same cache line Empirical results show that improving spatial reuse

can b e signicantly more eective than techniques that consider temp oral reuse alone

KMT In addition consecutive memory access results in reuse at all levels of the

memory hierarchy except for registers

Simplifying assumptions

To simplify analysis we make two assumptions First our lo op cost function assumes

that reuse o ccurs only across iterations of the innermost lo op This assumption

decreases precision but greatly simplies analysis since it allows the numb er of cache

line accesses to b e calculated indep endent of the p ermutation of all outer lo ops This

assumption is accurate if the inner lo op contains a suciently large numb er of memory

accesses to completely ush the cache after executing all of its iterations We show

later that our optimizations to improve lo cality can select a desirable p ermutation of

outer lo ops even with this restriction

Cache interference refers to the situation where two memory lo cations are mapp ed

to the same cache line eliminating an opp ortunity to exploit reuse for one of the

references Our second assumption is that cache interferences o ccur rarely for small

numb ers of inner lo op iterations compared to the total numb er of distinct cache lines

accessed in those iterations In other words we exp ect very few interferences for each

cache line b eing reused since the cache line is only needed for a small numb er of

consecutive inner lo op iterations Lam et al show that this assumption may not hold

if cache lines must remain live for longer p erio ds of time Considerable interference

may take place when lo ops are tiled to increase reuse across outer lo ops LRW

Lo op cost

Given these assumptions we present a lo op cost function Lo opCost based on our

memory mo del Its goal is to estimate the total numb er of cache lines accessed when

a candidate lo op l is p ositioned as the innermost lo op The result is used to guide lo op

p ermutation to improve data lo cality The estimate is computed in two steps First

references that will access the same cache line in the same or dierent iterations of

the l lo op are combined using RefGroup Second the numb er of cache lines accessed

by all groups is calculated using Lo opCost

Reference groups

The goal of the RefGroup algorithm is to partition variable references in the program

text into reference groups such that all references in a group access the same memory

lo cations and consequently the same cache line Wolf and Lam call these groups

equivalence classes exhibiting grouptemporal reuse The partition pro cess is particu

larly simple here b ecause we only consider reuse for each lo op when it is p ositioned

innermost

Two references are in the same reference group for lo op l if they actually access

some common memory lo cation data dep endence exists b etween them and the

Algorithm Determine reference groups

RefGroup Refs DG l

Input

Refs fRef Ref g references

1 n

DG fhRef Ref i g the dep endence graph

i j

l candidate innermost lo op

Output

fRefGroup RefGroup g reference groups for l

1 m

Algorithm

m

while Refs do

m m

RefGroup fr g where r Refs

m

Refs Refs fr g

for each hr r i or hr r i D G st r Refs

if is a constant d is the only

l l

nonzero entry in

RefGroup RefGroup fr g

m m

Refs Refs fr g

endif

endfor endwhile

reuse o ccurs on l if it is p ositioned as the innermost lo op The common accesses then

o ccur on either the same iteration of l or across d iterations of l d

l l

More formally we dene RefGroup as follows

Denition Two references Ref and Ref b elong to the same ref

1 2

erence group with resp ect to lo op l if and only if

Ref Ref and

1 2

is a lo opindep endent dep endence or

the entry in corresp onding to lo op l is a constant d d may b e

l

zero and all other entries are zero

Jacobi example

For instance consider the following Jacobi iteration example

DO I N

DO J N

AJI BJI BJI BJI

BJI BJI

Data dep endences connect all references to B The reference groups for the I lo op are

fAJIg fBJIBJIBJIg

fBJIg fBJIg

The reference groups for the J lo op are

fAJIg fBJIBJIBJIg

fBJIg fBJIg

RefGroup is shown in Algorithm Its eciency may b e improved by pruning

all identical array references since they access the same memory lo cation on each

iteration and always fall in the same reference group

Lo op cost algorithm

After the numb er of reference groups for lo op l is computed with RefGroup the

algorithm RefCost is applied to estimate the total numb er of cache lines that would

accessed by each reference group if l were the innermost lo op Once again the task

is simplied b ecause we only consider reuse b etween iterations of l

RefCost works by considering one array reference Ref from each reference group

these representative references are classied as lo opinvariant consecutive or non

consecutive with resp ect to lo op l Lo opinvariant array references have subscripts

4

that do not vary with l they require only one cache line for all iterations of l

Consecutive array accesses vary with l only in the rst subscript dimension They

access a new cache line every cl s iterations resulting in tr ipcl s cache line accesses

assuming l p erforms tr ip iterations Fewer cache lines are reused for nonunit strides

Nonconsecutive array accesses vary with l in some other subscript dimension they

access a dierent cache line each iteration yielding a total of tr ip cache line accesses

Once RefCost is computed the algorithm Lo opCost calculates the total numb er

of cache lines accessed by all references when l is the innermost lo op It simply

sums RefCost for all reference groups then multiplies the result by the trip counts

of all the remaining lo ops This calculation will underestimate the numb er of cache

lines accessed on the inner lo op if the distance of the dep endences for a particular

RefGroup set are greater than cl s Also slight underestimates o ccurs b ecause the

exact alignment of arrays in memory is not known until runtime Lo opCost will

overestimate the numb er of cache lines if there is additional reuse across an outer

lo op

Lo opCost is expressed more formally in Algorithm for the following lo op nest

containing one array reference from each reference group RefGroup RefGroup

1 m

do i l b ub s

1 1 1 1

do i l b ub s

2 2 2 2

do i l b ub s

n n n n

Ref f i i f i i

1 1 n j 1 n

1

Ref g i i g i i

1 1 n k 1 n

m

Note that Lo opCost can b e used to calculate cache line accesses even for array refer

ences with complex subscript expressions For instance it determines that AIJN

results in consecutive memory accesses with resp ect to b oth the I and J lo ops

4

Of course lo opinvariant references should eventually b e put in registers

Algorithm Determine inner lo op cost

Lo opCost L R cl s

Input L fl l g a lo op nest with headers l b ub s

1 n

R fRef Ref g representatives from each reference group

1 m

tr ip ub l b s s

l l l l l

cl s the cache line size

appear f the set of index variables that app ears in

the subscript expression f

coei f the co ecient of the index variable i in the subscript f

l l

it may b e zero

Output

LoopC ostl numb er of cache lines accessed with l as innermost lo op

Algorithm

m

X Y

A

Lo opCost l RefCostRef f i i f i i tr ip

1 1 n j 1 n h

k

k =1 h=l

RefCostRef

k

if i appear f i appear f lo op invariant

l 1 l j

tr ip cl s if i appear f jcoe i f j js j unit stride

l l 1 l 1 l

i appear f i appear f

l 2 l j

tr ip otherwise no reuse

l

Imp erfectly nested lo ops

Because of their simplicity b oth RefGroup and Lo opCost can also b e applied to

imp erfectly nested lo ops Consider the following example where the rst denition

of AJ is imp erfectly nested

DO J

AJ

DO I

AJ AJ

RefGroup would place all references to AJ in the same reference group When we

apply RefCost to calculate the numb er of cache lines accessed by a reference group we

need to select the most deeply nested memb er of the group Lo opCost then multiplies

the result by the trip counts of all the lo ops that actually enclose the reference

Lo op p ermutation

The previous section presents our cost mo del for evaluating the data lo cality of a

given lo op structure with resp ect to cache In this section we show how the cost

mo del guides lo op p ermutation to restructure a lo op nest for b etter data lo cality

A naive optimization algorithm would simply generate all legal lo op p ermuta

tions and select the p ermutation that yields the b est estimated data lo cality using

Lo opCost Unfortunately generating all p ossible lo op p ermutations takes time that

is exp onential in the numb er of lo ops and can b e exp ensive in practice It b ecomes

increasingly unapp ealing when transformations such as strip mining intro duce even

larger search spaces

Instead of testing all p ossible p ermutations we show how our cost mo del allows

us to design an algorithm to directly compute a preferred lo op p ermutation

Memory order

The lo cality evaluating function Lo opCost do es not calculate data reuse on outer

lo ops however we can still restructure programs to exploit outer lo op reuse The key

insight is that if lo op l causes more reuse than lo op l when b oth are considered as

innermost lo ops l will also promote more reuse than l when b oth lo ops are placed

at the same outer lo op p osition

Lo opCost can thus b e considered to b e a measure of the reuse carried by a

lo op This allows us to select a desired p ermutation of lo ops called memory or

der that yields the b est estimated data lo cality We simply rank each lo op l us

ing Lo opCost ordering the lo ops from outermost to innermost l l such that

1 n

LoopCostl LoopCostl

i1 i

Memory order algorithm

The algorithm MemoryOrder is dened as follows It computes Lo opCost for each

lo op sorts the lo ops in order of decreasing cache line accesses ie increasing reuse

and returns this lo op p ermutation

Example

As an example recall matrix multiply We compute memory order with cls

The reference groups for matrix multiply put the two references to CIJ in the same

group on all the lo ops and AIK and BKJ are placed in separate groups Lo opCost

computes the relative reuse on each of the lo ops as seen b elow

Lo opCost as innermost

references J K I

2 2 2

CI J n n n n n

2 2 2

AI K n n n n n

2 2 2

BK J n n n n n

3 2 3 2 3 2

total s n n n n n n

The algorithm MemoryOrder uses these costs to compute a preferred lo op ordering

of J K I from outermost to innermost The same result is obtained by previous

researchers AK WL

Permuting to achieve memory order

We must now decide whether the desired memory order is legal If it is not we must

select some legal lo op p ermutation close to memory order To determine whether a

lo op p ermutation is legal is straightforward We p ermute the entries in the distance

or direction vector for every true anti and output dep endence to reect the desired

lo op p ermutation The lo op p ermutation is illegal if and only if the rst nonzero entry

of some vector is negative indicating that the execution order of a data dep endence

has b een reversed AK Bana Banb WL

In many cases the lo op p ermutation calculated by MemoryOrder is legal and we

are nished However if the desired memory order is prevented by data dep endences

we use a simple heuristic for calculating a legal lo op p ermutation near memory order

2

The algorithm for determining this organization takes maxD n time in the worst

case where n is the depth of the nest and D is the numb er of dep endences a denite

improvement over considering all legal p ermutations which is exp onential in n The

algorithm is guaranteed to nd a legal p ermutation with the desired inner lo op if one

exists

Permutation algorithm

Given a memory ordering fi i i g of the lo ops fi i i g where i has

1 2 n

n

1 2 1

the least reuse and i has the most we can test if it is a legal p ermutation directly

n

Algorithm Determine the closest

p ermutation to memory order

NearbyPermutation O DV L

Input

O fi i i g the original lo op ordering

1 2 n

DV set of original legal direction vectors for l

n

L fi i i g a p ermutation of O

n

1 2

Output

P a nearby p ermutation of O

Algorithm

P k m n

while L

for j m

l l L

j

if direction vectors for fp p l g are legal

1 k

P fp p l g

1 k

L L fl g k k m m

break for

endif

endfor

endwhile

by p erforming the equivalent p ermutation on the elements of the direction vectors

If the result is a legal set of direction vectors the lo ops are p ermuted accordingly

Otherwise we attempt to achieve a nearby p ermutation with the algorithm

NearbyPermutation The algorithm builds up a legal p ermutation in P by rst testing

is legal in the outermost p osition If it is legal it is added to P to see if the lo op i

1

and removed from L If it is not legal the next lo op in L is tested Once a lo op l is

p ositioned the pro cess is rep eated starting from the b eginning of L fl g until L is

empty The following theorem holds for the NearbyPermutation algorithm

Theorem If there exists a legal permutation where is the inner

n

most loop then NearbyPermutation wil l nd a permutation where is

n

innermost

The pro of by contradiction of the theorem pro ceeds as follows Given an original set

of legal direction vectors each step of the for is guaranteed to nd a lo op which

results in a legal direction vector otherwise the original was not legal AK Bana

In addition if any lo op through may b e legally p ositioned prior to it will

1 n1 n

b e

This characteristic is imp ortant b ecause most data reuse o ccurs on the innermost

lo op and is due to spatial reuse so p ositioning the inner lo op correctly will yield the

b est data lo cality

Data lo cality exp erimental results

We tested the algorithm for optimizing data lo cality indep endently and rep ort some

of these results here

Matrix multiply

We executed all p ossible lo op p ermutations of matrix multiply for problem sizes

and on a variety of unipro cessors to determine the

accuracy of the MemoryOrder in predicting the b est lo op p ermutations In Table

the p ermutations are ordered from the most desirable to the least based on the ranking

computed by MemoryOrder On all the pro cessors memory order JKI pro duced the

b est results in all but two cases On all the pro cessors but the Sequent the entire

ranking generally served to accurately predict relative p erformance These results

illustrate that Lo opCost is eective in predicting relative reuse on outer lo ops as well

as inner lo ops

Table Matrix Multiply in seconds

Lo op Permutation

Pro cessor JKI KJI JIK IJK KIJ IKJ

Sequent Weitek

Sun Sparc

Intel i

IBM RS

Sun Sparc

Intel i

IBM RS

Sun Sparc

Intel i

IBM RS

The disparity in execution times b etween p ermutations b ecame greater as the

pro cessor sp eed increased On the individual pro cessors execution times varied by

signicant factors of up to on the Sparc on the i and a dramatic

on the RS These results indicate that data lo cality should b e the overwhelming

force driving scalar compilers to day

Stencil computations Jacobi and SOR

Stencil computations such as Jacobi and SOR are nite dierence techniques fre

quently used to solve partial dierence equations BHMS Jacobi runs completely

in parallel while SOR causes a computational wavefront to sweep diagonally through

the array Both kernels were written using D arrays We created and

measured the execution time of the following program versions all of the actual

programs app ear in Figures and

Memory Order Lo ops are ordered according to the algorithm MemoryOrder Both

temp oral and spatial reuse are exploited Neither program required the algo

rithm NearbyPermutation

Po or Order Lo ops are p ermuted in exactly the opp osite manner as memory order

It is provided merely to show the worstcase p erformance if data lo cality is not

taken into account

Figure Stencil computation Jacobi

Memory Order

DO I N

DO J N

AJI BJI BJI BJI BJI BJI

Poor Order

DO J N

DO I N

AJI BJI BJI BJI BJI BJI

D Tiles

DO JJ NTILE

DO I N

DO J JJMINJJTILEN

AJI BJI BJI BJI BJI BJI

D Tiles

DO I I NTILE

DO JJ NTILE

DO I I IMINI ITILEN

DO J JJMINJJTILEN

AJI BJI BJI BJI BJI BJI

D and D Tiles The lo ops are rst placed in memory order then one or b oth

lo ops are tiled in order to exploit reuse on the outer lo op This version shows

that tiling can degrade p erformance if insucient reuse exists on outer lo ops

Memory order can b e easily computed for b oth kernels Temp oral reuse is identical

for b oth lo ops since each results in the same numb er of reference groups four for

Jacobi three for SOR The J lo op has much lower Lo opCost since all reference

groups yield consecutive accesses when J is considered as the candidate innermost

lo op In comparison all references result in nonconsecutive accesses with I innermost

Spatial reuse from consecutive accesses thus dominates when considering the prop er

lo op p ermutation for go o d data lo cality

The execution times measured for these program versions app ear in Tables and

Permuting lo ops to achieve memory order for these two kernels shows impressive

improvements compared to their p o orly ordered counterparts The sp eedups for b oth

programs range from a factor of to The RS is esp ecially sensitive to

consecutive accesses for Jacobi its p erformance increased by a factor of in memory

order

Figure Stencil computation

Successive Over Relaxation SOR

Memory Order

DO I N

DO J N

AJI AJI AJI AJI AJI AJI

Poor Order

DO J N

DO I N

AJI AJI AJI AJI AJI AJI

D Tiles

DO JJ NTILE

DO I N

DO J JJMINJJTILEN

AJI AJI AJI AJI AJI AJI

D Tiles

DO I I NTILE

DO JJ NTILE

DO I I IMINI ITILEN

DO J JJMINJJTILEN

AJII AJII AJII AJII AJII AJII

Empirical results show that neither Jacobi nor SOR p ossess sucient reuse at

outer lo ops to justify tiling spatial reuse is the most imp ortant factor for these

stencil computations In fact tiling degrades p erformance The tiled versions of each

kernel b egin to recover only as tile sizes b ecome quite large

Table Performance of Jacobi in seconds

D Tiles D Tiles

Pro cessor Memory Po or x x x

Sun Sparc

Intel i

IBM RS

Table Performance of SOR in seconds

D Tiles D Tiles

Pro cessor Memory Po or x x x

Sun Sparc

Intel i

IBM RS

Erlebacher

Erlebacher is a small b enchmark program written by Thomas Eidson of ICASE that

p erforms AlternatingDirectionImplicit ADI integration It p erforms vectorized

tridiagonal solves in each dimension resulting in computation wavefronts across all

three dimensions Our results are for the forward and backward sweeps in Z dimension

The program versions we used are as follows

Vector Order All sequential lo ops those carrying dep endences are placed out

ermost Inner lo ops may all b e executed in parallel Within the parallel and

sequential lo op nests each is ordered for data lo cality

Parallel Order All parallel lo ops are placed outermost Inner lo ops are sequential

and carry dep endences corresp onding to reuse Within each group lo ops are

ordered for data lo cality This is the lo op p ermutation selected by optimiza

tions that attempt to exploit temp oral reuse without considering spatial reuse

consecutive accesses

Memory Order The lo ops are ordered according to descending Lo opCost As we

have shown b oth temp oral and spatial reuse are considered when cho osing

memory order All of the lo op nests in Erlebacher were fully p ermutable none

required the algorithm NearbyPermutation

Handco ded The lo ops are ordered according to the original source as provided by

Thomas Eidson

In addition for each lo op strategy we also have fused and separate versions of the

program

Alone All the statements remain in the same nest as originally written resulting in

single statement lo op nests

Fuse Each single statement lo op nest is rst p ermuted into the desired lo op order

We then fuse all adjacent lo op nests where legal This version demonstrates the

eects of exploiting reuse through lo op fusion

Figure demonstrates vector parallel memory order and handco ded versions for

two of the lo op nests found in the solution stage for the Z dimension For this example

parallel order exploits the temp oral reuse represented by dep endences carried on the

K lo op However as seen in Table it results in the worst p erformance b ecause it

eliminates consecutive accesses for the references to array F

Both vector and memory order place I as the innermost lo op resulting in cache

line reuse for references to array F However memory order also selects the middle

lo op that yields the most reuse In the rst lo op nest the J lo op is preferred b ecause

b oth AK and BK b ecome lo opinvariant references overcoming the savings derived

for the K lo op from putting FIJK and FIJK in the same reference group

Table Performance of Erlebacher in seconds

Vector Order Parallel Order Memory Order Hand

Pro cessor Alone Fuse Alone Fuse Alone Fuse Co ded

Motorola

Intel i

Sequent Weitek

Sun Sparc

Intel i

IBM RS

Figure Erlebacher forward and

backward sweeps in Z dimension

f Vector Order outer loops sequentialinner loops paral lel g

DO KN

DO JN

DO IN

FIJKFIJKAKFIJKBK

DO KN

DO JN

DO IN

FIJK FIJK CKFIJK EKFIJN

f Paral lel Order outer loops paral lelinner loops sequential g

DO JN

DO IN

DO KN

FIJKFIJKAKFIJKBK

DO JN

DO IN

DO KN

FIJK FIJK CKFIJK EKFIJN

f Memory Order in order of decreasing LoopCost g

DO KN

DO JN

DO IN

FIJKFIJKAKFIJKBK

DO JN

DO KN

DO IN

FIJK FIJK CKFIJK EKFIJN

f Handcoded g

DO KN

DO JN

DO IN

FIJKFIJKAKFIJKBK

DO JN

DO KN

DO IN FIJK FIJK CKFIJK EKFIJN

In comparison in the second lo op nest K is the preferred middle lo op By com

bining FIJK and FIJK in the same reference group and making FIJN a lo op

invariant reference the K lo op provides more savings than the J lo op can by making

b oth AK and BK lo opinvariant This results in approximately a improve

ment for the i showing that simply selecting the correct innermost lo op is not

enough sucient to yield the maximum data lo cality

Our results show that data lo cality grows in imp ortance with pro cessor sp eeds

Lo op fusion provides additional improvements in execution time but selecting the

correct lo op p ermutation for go o d data lo cality again yields the greatest b enet

Parallelism

In the following two subsections parallelism is evaluated and exploited We rst

present a p erformance estimator that evaluates the p otential b enet of parallelism A

parallel co de generation strategy then uses p erformance estimation and the cost mo del

develop ed in the previous section with other transformations to combine eective

parallelism and memory order making tradeos as necessary

Performance estimation

This section uses p erformance estimation to quantify the eects of parallelism on

execution time Our p erformance estimator predicts the cost of parallel and sequential

p erformance using a lo op mo del and a training set approach

The goal of our p erformance estimator is to assist in co de generation for b oth

shared and distributed memory multipro cessors BFKK KMM Mo deling the

target machines at an architectural level would require calculating an analytical mo del

for each supp orted architecture Instead our p erformance estimator uses a training

set to characterize each architecture in a machineindep endent fashion A training

set is a group of kernel computations that are compiled executed and timed on

each target machine They measure the cost of op erations such as multiplication

branching intrinsics and lo op overhead These costs are then made available to the

p erformance estimator via a table of data Note the training sets for the p erformance

estimator only measure access times to data in registers or the closest cache

Of particular interest is the estimation of parallel lo ops Given sucient par

allel granularity using all available pro cessors results in the b est execution time

Estimating the cost in this circumstance may b e mo deled by determining the follow

ing

c cost of starting parallel execution

s

c cost of forking and synchronizing

f

a parallel pro cess

P numb er of pro cessors

b numb er of iterations of the parallel lo op

tB cost of the lo op b o dy

If the lo op b ounds are unknown a guess is used that is based on the declared dimen

sion of the arrays accessed in the lo op With these parameters the p erformance of a

parallel lo op with sucient work may b e estimated by

b

tB c c P

s f

P

However if the amount of work is not sucient parallel lo op execution is more

dicult to mo del Instead of an equation a table is used to indicate the appropriate

numb er of pro cessors for the b est p erformance The mo del and the table are generated

using a training set

The sample training set for determining parallel lo op overhead b egins by varying

the total amount of work For each unit of work the numb er of pro cessors is varied

from to the total available The numb er of pro cessors which minimize the execution

time of this work is selected The result of a training run for parallel lo ops on the

Sequent S app ears in Figure

This particular training run rep eatedly p erformed a single scalar op eration that

executed for approximately microseconds which represents one unit of work in

Figure Each of the contour lines indicates a particular execution time The

single line cutting across the contour lines represents the minimum execution time for

executing a particular work load and the appropriate numb er of pro cessors When

total work is b elow a table determines the appropriate numb er of pro cessors and

approximate execution time Once the total work is over ab out the parallel lo op

mo del is used The estimator provides a single cost function for evaluating lo ops that

cho oses b etween the techniques based on total work and numb er of lo op iterations

E stimatel how returns h npi where

l is a lo op with b o dy B

how indicates whether l may b e run in parallel

Interpolated contours in microseconds

160

400

140

500

120

600 p r o c e s

1000

2000

3000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0 50 100 150 200 250 300 350 400 450 500

total work

Figure Parallel lo op training set

This function returns a tuple h npi with an estimate which is the minimal execution

time and the numb er of pro cessors np necessary to obtain the estimate based on

whether the lo op is parallel Note if the lo op is sequential or it is not protable to

run it in parallel the sequential running time and np are returned

Intro ducing parallelism

The key to intro ducing parallelism is to maintain memory order during parallelization

by using strip mining and loop shifting lo op shifting moves an inner lo op outward

across one or more lo ops Strip mining p erforms two functions in parallelization

It preserves cache line reuse in parallel execution Without strip mining consecutive

iterations may b e scheduled on dierent pro cessors denying cache line reuse

Because strip mining results in two lo ops the parallel iterator lo op may b e shifted

outward to maximize granularity while the sequential strip remains in place providing

the data lo cality intro duced using memory order To illustrate this p oint consider

the subroutine dmxpy from Linpackd written in memory order DBMS

DO J JMIN N

DO I N YI YI XJ MIJ

The J lo op is not parallel The I lo op can b e parallel Both contain reuse A

simple parallelization that maximizes granularity would interchange the two lo ops and

make the I lo op parallel without strip mining Unfortunately with this organization

the parallel lo op may b e scheduled such that consecutive iterations are assigned to

dierent pro cessors causing false sharing of Y and eliminating cache line reuse for

consecutive accesses to X and M In addition cache lines containing the same array

elements would b e required at multiple pro cessors increasing total memory and bus

utilization

We instead strip mine a parallel lo op by strip size SS dNPe to provide reuse on

the strip and parallelize the resultant iterator If the parallel lo op is outermost as in

matrix multiply parallelization is complete If not we use lo op shifting to move the

parallel iterator to its outermost legal p osition maximizing its granularity Applying

this strategy to dmxpy we b egin with the memory ordered lo op nest The I lo op

is the only parallel lo op and it contains reuse Therefore it is strip mined The

parallel iterator is not outermost but it is legally shifted to the outermost p osition

The compiler shifts the lo op resulting in maximum granularity and data lo cality as

illustrated b elow

PARALLEL DO I N SS

DO J JMIN N

DO I I I MINI SS N

YI I YI I XJ MI IJ

Strip mining

If a lo op is selected to b e p erformed in parallel it is strip mined if it contains any

reuse Given sucient iterations strip mining exploits data lo cality and parallelism

by using dNPe as the strip size where N is the numb er of iterations Assuming

cl s P the iteration space is suciently large if P N If

P N cl s P

strip mining by dNPe is less than the cl s and may result in false sharing However

the granularity of the parallel lo op do es match P and some reuse will o ccur In this

case we still strip mine by dNPe However if N P strip mining may provide reuse

but at the cost of drastically reducing the granularity of parallelism This tradeo is

very machine sp ecic We cho ose not to strip mine when N P

When memory order is computed the lo ops are marked to indicate if they contain

any reuse If there is reuse the strip mining algorithm uses the ab ove equations to

select a strip size that maximizes granularity and reuse If there is no reuse the

strip mining algorithm do es not p erform strip mining giving more exibility to the

scheduler

Parallelization algorithm

For memory ordered lo op nests that are not parallel on the outermost lo op the

Parallelization algorithm uses lo op shifting to intro duce parallelism It uses lo op

shifting rather than a general lo op p ermutation algorithm in order to minimize the

eect of parallelization on data lo cality It p erforms strip mining when the lo op

2

contains reuse b efore shifting for the same reason In the worst case it is On time

Algorithm intro duces parallelism into memory order It b egins by testing

whether the outermost lo op is parallel In the rst iteration of the for k j k

Algorithm Intro duce parallelism

Parallelizatio n L

Input L f g a legal p ermutation

1 n

Output T a parallelizing transformation

Algorithm

T

for j n

for k j n

if legal at p osition j parallel

k

T f StripMine

k

shift iterator to j paral lelize it g

return T

elseif legal at j b ecomes parallel

k j

T fStripMine shift k iterator to j

k

StripMinenew

j +1

paral lelize the j iterator g

endif

endfor

if T return T endfor

the rst if tests if the outermost lo op is parallel Trivially a shift of lo op to

j

p osition j is always legal

If the lo op is parallel it is strip mined and parallelized and the algorithm returns

If the lo op is not parallel a legal shift of an inner lo op to p osition j which is parallel

at p osition j is sought If a parallel lo op is found that can b e shifted outermost to

j it is strip mined parallelized and shifted and the algorithm returns Otherwise a

shift to p osition j may cause the next inner lo op ie the lo op originally p ositioned

at j to b e parallel This situation is determined in the elseif Because it is more

desirable to parallelize a lo op at p osition j than at j all other shifts to p osition j

are considered b efore this parallelization is returned at the completion of the for k

In Algorithm Parallelize do es not detect when strip mining results in a strip size

of less than cl s or strip mining is not p erformed due to insucient parallel iterations

As we saw in Section these conditions are unavoidable in some cases and the b est

p ossible p erformance is gained even when they hold However we extend Parallelize

as follows to seek a b etter parallelization for which neither condition holds

If StripMine returns with a strip size of less than cl s or do es not strip mine due to

insucient parallel iterations then the numb er of parallel iterations PI and the size

of the strip SS are recorded and the for k lo op continues instead of returning If the

for k nds a parallelization where neither condition holds it returns Otherwise

at the completion of the for k it selects the parallelization with the largest pair

PI SS

Optimization algorithm

The optimization driver for exploiting data reuse and intro ducing parallelism app ears

in Algorithm It combines the comp onent algorithms describ ed in the previous

2

sections and is also On time

It rst calls MemoryOrder to optimize data lo cality via lo op p ermutation It then

determines whether the lo op contains sucient computation to pursue parallelism

If it do es the memory ordered lo op nest is provided to the algorithm Parallelize If

needed Parallelize uses strip mining and lo op shifting to intro duce lo op level paral

lelism

The search space in Parallelize is constrained to meet our goal of p erturbing the

memory order as little as p ossible If parallelism is not discovered and would b e

Algorithm Optimizing for parallelism and data lo cality

SimpleOptimizer L

Input L fl l g

1 n

Output T an optimization of L

Algorithm

O MemoryOrderL

np Estimate O paral lel

if np parallelism is protable

T ParallelizeO

endif

p erform f O T g

protable other optimization strategies that consider all lo op p ermutations lo op

skewing WL or lo op distribution see Chapter should b e explored

Exp erimental results

The overall parallelization strategy was also tested by applying it by hand to two ker

nels The results of these exp eriments and those for data lo cality are very promising

Matrix multiply

The sp eedups of a parallel tiled matrix multiply on and pro cessors of a Sequent

Symmetry S for arrays of size and are presented in Table

We ran a sequential version with the lo ops in memory order JKI a sequential tiled

version and the identically tiled parallel version The parallel version is tiled by

and is the same version presented in Section Besides tiling no other lowlevel

memory optimizations were used The sp eedups were basically linear for b oth matrix

sizes when comparing the two tiled versions

Table Sp eedups for Parallel Matrix Multiply

sp eedup of parallel JKI tiled

pro cessors pro cessors

over over over over

sequential JKI sequential JKI tiled sequential JKI sequential JKI tiled

x

x

Dmxpy

The subroutine dmxpy from Linpack was optimized using these algorithms as illus

trated in Section In scientic programs there are many instances of this typ e

of doublynested lo op which iterates over vectors andor matrices where only one

lo op is parallel and it is b est ordered at the innermost p osition These lo ops may

b e an artifact of a vectorizable programming style They app ear frequently in the

Perfect b enchmarks CKPK the Level BLAS DCHH and the Livermore

lo ops McM

Table illustrates the p erformance b enets with the organization of dmxpy

generated by our algorithm on matrices of size on pro cessors For

comparison the p erformance when the I strip is not returned to its b est memory

p osition and a parallel inner I lo op were also measured

Table Dmxpy on pro cessors

lo op organization

I lo op parallel

IJII IIIJ JI

sp eedup over sequential JI

Related work

Our work b ears the most similarity to research by Wolf and Lam WL They

develop an algorithm that estimates all temp oral and spatial reuse for a given lo op

p ermutation including reuse on outer lo ops This reuse is represented as a localized

vector space Vector spaces representing reuse for individual and multiple references

are combined to discover all lo ops L carrying some reuse They then exhaustively

evaluate all legal lo op p ermutations where some subset of L is in the innermost

p osition and select the one with the b est estimated lo cality

Wolf and Lams algorithm for selecting a lo op p ermutation is p otentially more

precise and p owerful than the one presented here It directly calculates reuse across

outer lo ops and can suggest lo op skewing and reversal to achieve reuse however how

often these transformations are needed is yet to b e determined Skewing in particular

is undesirable b ecause it reduces spatial reuse

Gannon et al also formulate the dep endence testing problem to give reuse and

volumetric information ab out array references GJG This information is then

used to tile and interchange the lo op nests for cache after which parallelism is in

serted at the outermost p ossible p osition They do not consider how the parallelism

aects the volumetric information nor if interchange would improve the granularity

of parallelism

Portereld presents a formula that approximates the numb er of cache lines ac

cessed for a lo op nest but it is restricted to a cache line size of one and lo ops with

uniform dep endences Por Ferrante et al present a more general formula that

also approximates the numb er of cache lines accessed and is applicable across a wider

range of lo ops FST However they rst compute an estimate for every array ref

erence and then combine them trying not to do dep endence testing Like Wolf and

Lam they exhaustively search for a lo op p ermutation with the lowest estimated cost

Many algorithms have b een prop osed in the literature for intro ducing parallelism

into programs Callahan et al use the metric of minimizing barrier synchroniza

tion p oints via lo op distribution fusion and interchange for intro ducing parallelism

ACK Cal Wolf and Lam WL intro duce all p ossible parallelism via the uni

mo dular transformations lo op interchange skewing and reversal Neither of these

techniques try to map the parallelism to a machine or try take into account data

lo cality nor is any lo op b ound information considered Banerjee also considers intro

ducing parallelism via unimo dular transformations but only for doubly nested lo ops

Banb Banerjee do es however consider lo op b ound information

Because we accept some imprecision our algorithms are simpler and may b e ap

plied to computations that have not b een fully characterized in Wolf and Lams uni

mo dular framework For instance we can supp ort imp erfectly nested lo ops multiple

lo op nests and imprecise data dep endences We b elieve that this approximation is a

very reasonable one esp ecially in view of the fact that we intend to use a scalar cache

tiling metho d as a nal step in the co de generation pro cess CCK In addition

2

the algorithms presented here are On time in the worst case where n is the depth

of the lo op nest and are a considerable improvement over work which compares all

legal p ermutations and then picks the b est taking exp onential time Our approach

has app eared elsewhere KM

Discussion

We have addressed the problem of cho osing the b est lo op ordering in a nest of lo ops for

exploiting data lo cality and for generating parallel co de for sharedmemory multipro

cessors As our exp erimental results b ear out the key issue in lo op order selection is

achieving eective use of the memory hierarchy esp ecially cache lines Our approach

improves data lo cality provides the highest granularity of parallelism and prop erly

p ositions lo ops for lowlevel memory optimizing transformations When p ossible the

b enets of parallelism and data lo cality are therefore b oth exploited

The next chapter incorp orates this algorithm into a more general interpro cedural

approach for generating parallel co de

Chapter

An Automatic Parallel Co de Generator

In this chapter we present a parallel co de generation algorithm for sharedmemory

multipro cessors We use the results of Chapters and to design an interpro cedural

algorithm for complete application programs The key parallelization comp onent

is the algorithm for improving data lo cality and intro ducing parallelism develop ed

Chapter This chapter presents a new technique that p erforms lo op fusion to

enhance granularity When necessary partial parallelism is exploited using lo op dis

tribution A general unied treatment of fusion and distribution for lo op nests is

describ ed and shown optimal under certain constraints The result is a cohesive lo op

based interpro cedural parallelization algorithm which builds on extends the work in

previous chapters

Intro duction

The goal of the optimization algorithm presented in this chapter is to intro duce

parallelism in a way that minimizes execution time over the entire program The

ma jor comp onents used by the parallelizer to discover and exploit parallelism and

data lo cality are as follows

Lo opbased intrapro cedural transformations

lo op p ermutation

strip mining

lo op fusion

lo op distribution

Interpro cedural transformations

lo op emb edding

lo op extraction

pro cedure inlining

pro cedure cloning

Performance estimation

The purp ose of this work is to improve execution time by exploiting and discover

ing parallelism and improving data lo cality Exploiting parallelism takes three forms

assuring sucient granularity to make parallelism protable maximizing gran

ularity making each parallel task as large as p ossible and matching the numb er

of parallel iterations to the machine Assuring sucient granularity and matching

it to the machine is dep endent on the architecture Most previous research fo cuses

on discovering parallelism andor maximizing its granularity without regard to data

+

lo cality Cal WL ABC

Our approach addresses all of these concerns The key comp onent is the combi

nation of lo op interchange and strip mining develop ed in Chapter which considers

all these factors Sucient granularity is assured using p erformance estimation see

Section In this chapter we present a lo op fusion algorithm to further in

crease granularity In addition we use lo op distribution to discover parallelism when

necessary A unied treatment of fusion and distribution shows the problems to b e

identical This algorithm is shown to b e optimal for a restricted problem scop e

This chapter uses the extensions develop ed in Chapter to design an algorithm

that considers lo opbased transformations even in the presence of pro cedure calls

The lo opbased transformations determine which if any interpro cedural transfor

mations are necessary The interpro cedural transformations are called enablers and

are applied using goaldirected interpro cedural optimization BCHT The inter

pro cedural transformations are applied only when they are exp ected to improve ex

ecution time ie the interpro cedural transformations are required to p erform a

lo opbased parallelization of a lo op nest that spans pro cedures

Parallel co de generation

Driving co de generation

The driver for optimizing a program app ears in Algorithm The algorithm Driver

optimizes routines and lo ops in reverse p ostorder using the augmented call graph

G guaranteeing that a pro cedure or lo op is optimized only when the context of its

ac

calling pro cedures and outer lo ops is known This formulation is general enough for

p erforming many whole program optimizations on programs without recursion for

which G is a DAG It is used here to intro duce a single level of highgranularity par

ac

allelism and exploit the memory hierarchy We illustrate this algorithm by optimizing

the following example

Example

PROCEDURE C edges G

ac

CALL S C S call edge

DO I N C I lo op edge

CALL S I S call edge

ENDDO S J lo op edge

PROCEDURE S

DO J N

ENDDO

Driver b egins with C and marks it visited It then tests if all of pro cedure Cs

predecessors have b een optimized Lets assume none of them have Driver then tries

to recursively optimize each of Cs successors in the forall The rst successor is

the call to S However all of Ss predecessors have not b een visited so it pro ceeds

to the next successor the I lo op whose predecessors have all b een visited Driver

then optimizes the I lo op in step of the algorithm using Optimize to sp ecify a

lo opbased parallelization Lets assume Optimize simply parallelizes the I lo op The

pro cedure Transform applies this parallelization and marks all the descendants of the

I lo op S and J as optimized Note the descendants are not marked visited

Driver continues to seek optimization candidates for the descendants of the I lo op

to ensure that all paths to a pro cedure are optimized if p ossible Driver is called again

on pro cedure S this time all of Ss predecessors have b een visited It checks to see if

all the predecessors of S have b een optimized as well as visited Of Ss predecessors

C and the I lo op only the I lo op has b een optimized Therefore there is a calling

sequence to S that do es not contain parallelism and S should b e and is considered

for optimization The only successor of S is the J lo op and it is optimized next using

Optimize If a parallelization is sp ecied it is p erformed There are no more edges

or unvisited no des consequently the algorithm terminates

Discussion

This section details more formally the Driver algorithm Driver is initially called on

the programs ro ot no de and each no de in the G is initialized to b e unvisited and

ac

unoptimized The G is traversed in reverse p ostorder such that a no de n is only

ac

visited once all its predecessors are visited Note that only pro cedure no des may have

Algorithm Driver for parallel co de generation

Driver n

Input n is a no de in the G

ac

Algorithm

if any predecessor of n is not visited return

mark n visited

if n is a pro cedure and all predecessors of n are marked optimized return

forall m where edge n m in top ological order

if m is a pro cedure

Driver m

if m is visited and m s G such that s is optimized

ac

T Fusions s

1 k

Transformn fs s g T

1 k

endif

else

if m is an outer lo op of pro cedure n

let L include the entire lo oping structure ro oted at m

T OptimizeL

Transformn m T

endif

Driver m

endif

endforall

more than one immediate predecessor If n is a pro cedure and all its predecessors

are marked as optimized then parallelism has b een intro duced in each calling context

and the pro cedure do es not need to b e optimized for additional parallelism

Lo op nests are optimized as a whole when the outermost lo op of the nest is

encountered in step of the algorithm If m is the outermost lo op of a nesting

structure L at step L is constructed and optimization is p erformed on it If

optimization is successful the transformation is applied and all the aected no des are

marked as optimized by the subroutine Transform which app ears in Algorithm

However the algorithm continues to recurse over all successors of a no de regardless

of no de typ e until all no des have b een visited or all the callers of a pro cedure have

b een optimized or the lo op or pro cedure itself has b een optimized

At step if m is a pro cedure and m is visited optimization has just b een

p erformed on all the lo ops and calls it contains If any of these successors of m are

optimized they are now considered for fusion This feature allows fusion of lo ops that

are not enclosed by an outer lo op Fusion of lo ops that are enclosed by an outer lo op

and all other lo opbased transformations are determined by the Optimize algorithm

For example consider a pro cedure P containing two adjacent lo ops l and l The

1 2

pro cedure P is visited followed by l and l Both are determined to b e parallel and

1 2

are marked as optimized Then Fusion l l is considered to maximize granularity

1 2

and to reduce synchronization and lo op overhead

In order to simplify the discussion this version of Driver do es not p erform any

interpro cedural transformations The parallelizer is extended in Section to p er

form these interpro cedural transformations This version of the parallelizer do es and

must p erform pro cedure cloning for correctness

Pro cedure cloning

Pro cedure cloning is required when optimizations along distinct call paths want to

use dierent versions of the same pro cedure CKTa CHK In our algorithm

this situation only o ccurs when a pro cedure is called more than once

Consider again Example Pro cedure S may b e called once directly and once

from inside the I lo op The second call to S is parallelized by making the I lo op

parallel and the rst call by making the J lo op parallel In this case two versions

of S are needed The original sequential version of S is required for the parallel I

lo op Another version called the clone is needed in which to sp ecify the J lo op b e

p erformed in parallel

To determine if cloning is necessary the parallelizer keeps track of whether or not

parallelism has b een intro duced in callers using the WalkMarkG algorithm The

ac

subroutine Transform in Algorithm determines if a clone is necessary

Rememb er only no des whose predecessors have all b een visited where at least

one is unoptimized are candidates for optimization in Parallelization If Optimize

returns a parallelizing transformation for a candidate Transform is called to apply it

to the appropriate pro cedure n If none of ns predecessors are optimized a clone is

unnecessary and the transformation is p erformed directly on the pro cedure n If any

of ns predecessors are marked optimized then a clone is need on which to apply the

current optimizing transformation If a clone already exists then other lo op nests

in the pro cedure have b een optimized and the new optimization is also applied to

this existing clone Otherwise a clone is created and the optimization is applied to

Algorithm Applying transformations and cloning

Transform n L T

Input n is a pro cedure no de in the G

ac

L is a set of lo ops in n

T a transformation to p erform on L

Algorithm

if T return

if a predecessor of n marked optimized

if n create a clone of pro cedure n n

clone clone

Apply n T

clone

else

Apply n T

endif

forall l L WalkMarkG l

ac

WalkMarkG n

ac

Input n is a no de in the G

ac

Algorithm

if n marked optimized return

mark n optimized

forall m where edge n m in top ological order

WalkMarkG m

ac

endforall

the cloned pro cedure Only two versions of any pro cedure are required using this

simplied algorithm

Lo opbased optimization

The pro cedure Optimize in Algorithm determines an optimizing transformation

to parallelize an arbitrary lo op nest L fl l g L describ es a lo op nesting

1 n

structure ro oted at l that may contain constructs such as imp erfectly nested lo ops

1

pro cedure calls and control ow The lo oplevel transformations Optimize considers

are lo op p ermutation strip mining lo op fusion and lo op distribution

Optimize rst calls OrderParallelize which p erforms lo op p ermutation and strip

mining using the algorithms develop ed in Chapter for improving data lo cality and

exploiting parallelism OrderParallelize b egins by improving data lo cality using

MemoryOrder Given sucient granularity it then inserts parallelism into the resul

tant nest using the algorithm Parallelize MemoryOrder and Parallelize are dened

Algorithm Optimizing a lo op nest

Optimize L

Input L fl l g a lo op nest

1 n

Output T a parallelizing transformation

Algorithm

T OrderParallelize L

if T return T

BD BreakDep endences L

if B D

T OrderParallelize BD L

if T return fT B Dg

endif

T DistributionBD L

return fT B D g

OrderParallelize L

Input L fl l g a lo op nest

1 n

Output T a parallelizing transformation

Algorithm

O MemoryOrderL

np Estimate O L paral lel

if np parallelism is not protable return O

T ParallelizeO L

if T parallelism found

F Fusion T O L

return f O T F g

endif

return

in Chapter If OrderParallelize has b een successful at intro ducing parallelism

at some level fusion of lo ops in the resultant nest is considered This algorithm is

discussed in detail b elow

If step is unsuccessful step attempts to satisfy as many dep endences as

p ossible with BreakDep endences The literature includes a collection of transforma

tions that are used to satisfy sp ecic dep endences that inhibit parallelism They in

clude lo op p eeling scalar expansion KKLWa array renaming AK KKLWa

alignment and replication Cal and lo op splitting AK

These transformations may intro duce new storage to eliminate storagerelated

anti or output dep endences or convert lo opcarried dep endences to lo opindep endent

dep endences often enabling the safe application of other transformations If all the

dep endences carried on a lo op are eliminated the lo op may then b e run in parallel If

BreakDep endences has some success OrderParallelize is called again on the result

If neither step or is successful lo op distribution to intro duce partial parallelism

is considered as discussed in the next section

The Optimize algorithm is able to intro duce parallelism into lo ops which contain

pro cedure calls and unstructured control ow This imp ortant ability stems from the

analysis and transformations describ ed in Chapters and

Partitioning for lo op distribution and lo op fusion

In this section we present a new algorithm which maximizes parallelism and mini

mizes the numb er of lo ops It unies the treatment of lo op distribution and fusion

This solution is shown optimal for a single lo op under certain constraints

Lo op distribution is safe if the partition of statements into new lo ops preserves

all of the original dep endences see Section Dep endences are preserved if any

statements involved in a dep endence recurrence are placed in the same lo op The

dep endences b etween the partitions then form an directed acyclic graph that can

+

always b e ordered using top ological sort AK KKP

By rst cho osing a safe partition of the lo ops with the nest p ossible granular

ity and then fusing partitions back together larger partitions may b e formed This

transforms the lo op distribution to one of lo op fusion a problem thought to b e very

hard In fact Goldb erg and Paige prove a nearby fusion problem NPHard but their

result is not applicable here GP The algorithm we present is linear

In the lo op fusion problem for parallelism the partitions b egin as separate lo ops

that are either parallel or sequential and may or may not b e legally fused together

In lo op distribution it is always legal to fuse the lo ops back together and run the

original lo op sequentially We would like to utilize all the parallelism and have the

fewest lo ops Callahan refers to these criteria as maximal paral lelism with minimum

barrier synchronization Cal

The lo op distribution and lo op fusion problem is a graph partitioning problem on

a directed acyclic graph DAG Each no de in the graph represents a sequential or

parallel lo op containing a set of statements There are data dep endence edges some

of which are fusion preventing Fusion preventing edges exist b etween no des that

cannot b e fused together without changing the lo ops semantics Fusion preventing

edges also exist b etween parallel no des that when fused force the resultant lo op to

b e sequential We seek to minimize the numb er of partitions lo ops sub ject to the

constraint that a particular partition contains only one typ e of no de ie undirected

fusion preventing edges are implicit b etween sequential and parallel no des A more

formal description follows

Partitioning problem

Goal group parallel lo ops together and sequential lo ops together such

that parallelism is maximized while minimizing the numb er of partitions

Given a DAG with

no des parallel sequential lo ops

edges data dep endence edges some of which are

fusion preventing edges

Rules

cannot fuse two no des with a fusion preventing edge b etween them

cannot change relative ordering of two no des connected by an edge

cannot fuse sequential and parallel no des

Callahan presents a greedy algorithm for a closely related problem that omits rule

Cal ACK His work also tries to partition a graph into minimal sets but his

mo del of parallelism includes lo oplevel and forkjoin task parallelism For example

consider the example graph in Figure

Callahans greedy algorithm partitions this graph into fP S g and fP g and

1 1 2

places a barrier synchronization b etween the partitions S and P run in parallel with

1 1

each other and the iterations of P may b e p erformed in parallel P is p erformed

1 2

in parallel once they b oth complete Callahans formulation of the lo op distribution

problem ignores the no de typ e enabling the greedy algorithm to provably minimize

the numb er of lo ops and maximize parallelism for a single levellevel Cal ACK

Our mo del of parallelism diers in that it only considers lo oplevel parallelism If

Figure Counter example for the greedy algorithm

......

......

. . . .

......

......

. . . .

. . . .

......

. . . .

. . . .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. .

. .

. . . .

. .

S P . .

. . . .

. . . .

. .

. .

1 1 . . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. . . .

. .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. .. . .

. . . .

. . . ..

......

......

......

......

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. .

.

. . .

. .

.

. .

. . .

.

. .

. . .

. . .

. . .

. . .

. . .

. . .

. .

......

.. ..

. .

.. .

. ..

. .

. .

. ..

. .

. .

. .

. .

.

. .

.

. .

. .

. .

.

.

. .

. .

.

.

. .

.

.

. .

. .

. .

.

.

. .

.

.

. .

.

.

.

.

. .

.

.

. .

. .

.

.

. .

.

.

. .

. .

.

.

. .

.

P .

.

.

. .

.

.

2 . .

.

.

. .

. .

. .

. .

. .

.

.

. .

.

.

. .

. .

. .

.

.

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

.. .

.. .

.. ..

......

parallel the cost of lo op overhead is higher than the cost of barrier synchronization

our mo del will b e an improvement

Consider using the greedy algorithm for the lo op distribution and lo op fusion

problem restricted to lo oplevel parallelism In the example in Figure the lo ops

S and P are of dierent no de typ es and cannot b e placed in the same partition

1 1

therefore one of P or S must b e selected If S is selected rst the greedy partition

1 1 1

is fS g fP P g If P is selected rst the greedy partition is fP g fS g fP g The

1 1 2 1 1 1 2

greedy algorithm is foiled b ecause it cannot determine which no de to select rst

Note if the graph consists of just sequential no des or just parallel no des then this

is equivalent to Callahans problem formulation Therefore the greedy algorithm is

optimal in the numb er of lo op nests created when the no des are only of one typ e

Our algorithm is based on this observation

Simple partition algorithm

Our solution divides the problem into two parts a sequential graph and a parallel

graph The greedy algorithm then minimizes the numb er of partitions for each of

the graphs and maximizes parallelism in the parallel graph Because parallel no des

and sequential no des cannot b e placed in the same partition without violating the

maximum parallelization constraint merging the solutions will result in a partitioning

that maximizes parallelism and minimizes the numb er of lo ops The remainder of this

section describ es how to correctly and eciently construct the separate graphs and

merge the reduced partitioned solutions back together

Figure Partition graph G

o

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . .

. . .

......

......

. . .

. . .

......

......

. . .

. . .

......

. . .

. . .

......

. . .

. . .

......

. . .

. . .

......

......

. . .

. . .

......

......

. . .

. . .

......

......

. . .

. . .

......

. . .

. . .

......

. . .

. . .

......

. . .

. . .

. . .

. . S S P .

......

. . .

. . .

......

. . . . 1 2 3 . .

. . .

. . .

......

. . .

. . .

......

. . .

. . .

......

......

. . .

. . .

......

. . .

......

. . .

......

......

. . .

......

......

. . .

......

......

......

......

......

......

......

......

......

......

......

......

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

. . .

.

. . .

.

. . .

. . . .

.

. . . .

. .

. . . .

. . . .

. .

. . .

. .

. . . .

. . . .

. .

. . . .

.

. . .

. . . .

. .

. . . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

......

. . . . .

. . .

. . . . .

......

. . .

. . . . .

......

.

......

. . .

......

......

. .

......

......

. .

......

. .

......

......

......

......

......

......

......

......

......

......

......

. . . .

. . . .

. . . .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. . P P

. . . .

. .

. .

. . . .

. . . . 4 5

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

. . . .

. . . .

......

. . . . .

......

. . . . .

......

......

......

......

......

.

. .

. .

.

. .

.

. .

. .

.

. .

.

. .

. .

.

. .

.

. .

. .

.

. .

. .

.

. .

.

. .

. .

.

.

.

. .

. .

.

. .

.

. .

. .

.

. .

.

. .

. .

.

. .

.

. .

. .

.

. .

.

. .

. .

.

. .

.

. .

. .

. .

. .

. .

. . .

. .

. .

. . .

. .

. .

......

......

......

. . . .

......

......

.. . . .

. . . . .

. . .

. . . . .

. . .

. .

. . .

. .

. . .

. .

. .

. .

. .

. . .

. .

.

. . .

. .

. .

. .

. .

. .

.

. . .

. .

.

. . .

.

. .

. . .

. . .

.

. .

. . .

.

. .

. .

.

. . .

.

. .

. . .

. . .

. .

.

. . .

.

. .

. .

.

. . .

.

. S .

. .

. .

. .

. .

.

. . 6

. .

. .

. .

. .

.

. . .

. .

.

. . .

.

. .

. . .

. .

. .

.

. . .

. . .

. .

. .

. .

. .

. . .

. .

. . .

. .

. .. . .

. . . .

. . .

. . . .

. . .

. . . .

......

......

......

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. .

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

......

. . . . .

. . .

. . . . .

......

. . .

. . . . .

......

.

......

. . .

......

......

. .

......

......

. .

......

. .

......

......

......

......

......

......

......

......

......

......

......

. . . .

. . . .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. .

. .

. .

. .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. . P S

. . . .

. .

. .

. .

. .

. . . . 7 8

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. .

. . . .

. .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

.. . . .

. . . .

. . . .

. . . .

......

......

......

Correctness of problem division

To separate the problem into two parts the essential relationships in the original

graph G must b e preserved in the two comp onent graphs the sequential graph G

o s

and the parallel graph G without intro ducing unnecessary constraints First all

p

the ordering edges b etween two no des of the same typ e must b e preserved in the

comp onent graphs In addition G may represent relationships that prevent no des of

o

the same typ e from b eing in the same partition but that do not have an edge b etween

them Consider G in Figure Although an edge do es not directly connect S and

o 1

S they may not b e fused together without including P This fusion would violate

6 4

the maximal parallelism constraint These transitive fusion preventing relationships

are required b etween two no des of the same typ e that are connected by a path of

no des of a dierent typ e No other edges are required G also contains edges from

s

G that connect sequential no des likewise for G

o p

The simplest way to preserve all the fusion preventing relationships in G in the

o

comp onent graphs is to compute a mo died transitive closure on G b efore pulling

o

them apart A fusion preventing edge are added b etween two sequential no des that

cannot b e placed in the same partition b ecause there exists a path b etween them

that contains at least one parallel no de Similarly a fusion preventing edge is added

b etween two parallel no des connected by a path containing a sequential no de We

now show how to divide G and compute the necessary transitive fusion preventing

o

edges in linear time

Computing a transitive closure on a DAG is ON E time and space where N

is the numb er of no des in G and E is the numb er of edges AHU Of course

o

transitive closure intro duces additional edges that are unnecessary Applying this

algorithm to the graph in Figure would result in the following fusion preventing

edges S S S S S S S S P P The two of edges S S and S S

1 6 2 6 1 8 2 8 4 7 1 8 2 8

are redundant b ecause of the original ordering edge S S and the two other fusion

6 8

preventing edges S S S S

1 6 2 6

Ecient problem division

Using the following denition we can determine the minimal numb er of fusion pre

venting edges needed b etween parallel no des to preserve correctness Similarly the

minimal numb er of edges b etween sequential no des can b e determined

Algorithm Add transitive fusion

preventing edges to partition graph

FindParallelTFPedges n

Input n G a no de in the original graph

o

Output G partition graph with transitive fusion preventing edges

t

Algorithm

if any predecessor of n is not visited return

mark n visited reverse p ostorder walk

if n Snodes

Pathsn Pathst

(tn)G

o

else

Pathsn n

forall t n G st t Snodes

o

forall pnode Pathst

i

addFusionPreventingEdge pnode n

i

endforall

endforall

endif

forall n m G FindParallelTFPedgesm

o

Denition Two parallel no des pnode and pnode require a tran

i j

sitive fusion preventing edge in their comp onent graph if and only if

+

path apathpnode snode pnode G

j o

k i

n path st n pnode and n pnode and

i j

k

n path n S nodes

k

where S nodes is the set of sequential no des and P nodes is the set of

parallel no des in the original graph G

o

Intuitively there must exists a path b etween two parallel no des with at least one

no de on the path and and no parallel no des are on the path

Based on this denition FindParallelTFPedges in Algorithm computes the

necessary transitive fusion preventing edges that must b e inserted b etween parallel

no des The corresp onding algorithm FindSequentialFTPedges is sp ecied similarly

FindParallelTFPedges formulates this problem like a dataow problem accept that

solutions along the edges are dierent dep ending on the typ es of no des an edge con

nects

FindParallelTFPedges recursively walks the no des in reverse p ostorder such that

a no de is never visited until all its predecessors have b een visited For a no de n it

computes a set of parallel no des Pathsn such that pnode Pathsn if there exists

a path from pnode to n that contains sequential no des and do es not contain parallel

no des Pathsn for a sequential no de is the union of all ns predecessors Paths

Pathsn for a parallel no de is itself n

If n is a sequential no de no fusion preventing edges need b e added If n is a

parallel no de fusion preventing edges are added from each memb er of Pathst to n

where t is a sequential no de and a predecessor of n

Performing FindParallelTFPedges and FindSequentialTFPedges results in a par

tition graph G that includes all the necessary transitive relationships b etween parallel

t

no des and those b etween sequential no des This graph can now easily b e separated

by placing all the parallel no des and all the edges b etween parallel no des in one

graph G and all the sequential no des and edges b etween them in G Algorithm

p s

provides the sp ecics

If this algorithm is applied to the example in Figure the transitive graph and

the comp onent graphs that result app ear in Figure The fusion preventing edges it

adds are S S S S P P Callahans greedy algorithm may now b e applied to

1 6 2 6 4 7

the comp onent graphs to obtain a minimal solution for each The minimal solution for

the example places S and S in the same partition S and S in the same partition P

1 2 6 8 3

Algorithm Place sequential and

parallel no des into separate graphs

PullApart n

Input n G a no de in partition graph with transitive fusion preventing edges

t

Output G partition graph for parallel no des

p

G partition graph for sequential no des

s

Algorithm

mark n visited

if n Snodes

add n to G

s

forall n m st m Snodes add n m to G

s

else n Pnodes

add n to G

p

forall n m st m Pnodes add n m to G

p

endif

forall n m st m unvisited PullApart m

and P in the same partition and P and P in the same partition This partitioning

4 7 5

is illustrated in Figure for G and G

s p

Merging the solutions

Merging the separate solutions is a straightforward mapping of the edges in G to

o

the no des in G and G to form a merged graph G The merged graph is formed

s p m

by placing all the edges and no des in G and G into G and then adding all the

s p m

edges in G where one endp oint is in G and and the other is in G Because the

o s p

construction of G and G assures they are b oth DAGs and G is a DAG G will b e

s p o m

a DAG Therefore it can b e top ologically sorted into a linear ordering In Figure

the merged graph G and a linear ordering for our example is presented

m

Discussion

The driver for partitioning app ears in Algorithm It rst inserts the required

fusion preventing edges The problem is then divided into two parts and the greedy

algorithm p erformed on each The resulting solutions are minimal for each part

These solutions are then merged back together to form one overall solution that

pro duces the minimal numb er of lo ops achievable without sacricing parallelism

These algorithms all take ON E time and space making them practical for use

in a compiler This algorithmic approach may b e applied to other graph partitioning

Figure Dividing G

o

G G G

t s p

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . S S P . . . . . P S S P

......

......

......

......

. . . . 1 2 3 ...... 3 1 2 4

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . .

......

......

. . . .

......

.

......

......

. . .

......

. . . .

......

......

. .

......

......

. .

......

......

......

......

......

......

......

......

......

......

......

.

......

......

......

......

. . .

......

......

. . .

......

. . . .

......

......

. . . .

......

. .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . . .

......

......

......

......

......

......

. . . . .

......

......

......

......

......

......

......

.

......

......

. . . . .

......

......

......

......

.

......

. . . . .

......

......

......

......

. . .

......

......

......

......

. . .

......

......

. . P P . . . . . P S P

......

......

......

......

. .

. . 4 5 ...... 5 6 7

......

......

......

......

. . . . .

......

......

. . . . .

......

......

......

......

......

......

. . . . .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. .

. . . .

. . . . .

. .

. . . . .

.

. . . . .

. . . . .

. .

. . . .

. .

. . . . .

. . . .

. .

. . . . .

. .

. . . .

. . . . .

. .

. . . .

. .

. . . . .

. . . . .

.

. . . . .

. . . . .

. . . . .

. .

. . . . .

. . . .

. .

. . . . .

.

. . . . .

. . . . .

. .

. . . .

. .

. . . . .

. . . .

. .

. . . . .

. .

. . . .

......

......

. . . . .

......

......

......

......

......

. . . .

......

......

......

......

. . . .

......

......

. . . . .

......

. . .

......

......

......

......

......

......

......

......

......

. . . . .

......

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . .

. .

. . . . .

. . . .

. . . . .

. . .

. .

. . . . .

. . . .

. . .

. . .

. . . .

. . .

. .

. . . . .

. . .

. .

. . . . .

. .

. . . .

......

.

......

. . .

. . . .

. . . . .

. . .

. . . .

. . .

. . . .

......

. .

. . . .

......

.

......

. . . .

.

. . . . .

. .

. . .

. . . .

.

. . . . .

. .

S . . . S

. . . .

. . .

. . .

. . . .

.

6 . . . . 8

. . .

. . .

. . . .

. . .

. .

. . . . .

. . .

. .

. . . . .

. .

. . .

. . . . .

. . . .

. . .

. .

. . . . .

. . . . .

. . . .

. . .

. . . .

. . .

. . . . .

. . . .

. . . . .

. . . .

......

......

......

......

......

......

......

......

......

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

.

. . .

. . .

.

. . .

. . .

.

. . .

.

. . .

......

. . . . .

. . .

. . . . .

......

. . .

. . . . .

......

.

......

. . .

......

......

. .

......

......

. .

......

. .

......

......

......

......

......

......

......

......

......

......

......

. . . .

. . . .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. .

. .

. .

. .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. .

. . P S

. . . .

. . . .

. .

. .

. . . . 7 8

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. .

. .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. .

. .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . ..

......

. . . .

. . . .

......

......

......

......

problems as well The separation of concerns lends itself to other problems that need

to sort or partition items of dierent typ es while maintaining transitive relationships

In addition the structure of the algorithm enables dierent algorithms to b e used for

partitioning or sorting the comp onent graphs

The following two sections briey describ e how to use the partitioning algorithm

to p erform lo op fusion and lo op distribution

Lo op fusion

In Algorithm Driver candidates for lo op fusion are discovered at step The

candidate nests have b een optimized and may b e parallel or sequential lo ops In

addition candidates for lo op fusion are discovered in Algorithm OrderParallelize

Both algorithms need to fuse nests that do not have a common outer lo op and may

fuse lo ops that do have a common outer lo op Therefore fusion of outer distinct

lo ops is attempted rst Fusion is then considered for any candidates created by the

outer fusion as well as for candidates present in the original lo op structure

Fusion is always considered last b ecause it may interfere with lo op p ermutation

Permutation is multiplicative and is therefore usually more eective than fusion which

is additive for creating a larger granularity of parallelism However fusion do es

increase granularity and reduce lo op overhead Fusion also has b oth go o d and bad

consequences on cache line and register reuse The merged references may increase

data reuse due to completely intersecting references It may also thwart reuse by

causing the data for a single iteration to overow cache

Figure Fusing G G and merging the result

s p

G G G

s p m

......

......

......

......

......

......

......

......

......

......

......

. . . .

......

......

......

......

......

. . . .

. . . .

......

......

. . . .

. . . .

......

......

......

......

. . . .

. . . .

......

......

......

......

. . . .

. . . .

......

. . . .

. . . .

......

. . . .

. . . . .

......

.

......

. . . . .

......

......

......

......

......

......

. . . . .

......

. . . .

. . . .

......

P P S S P P . S S . . .

. . . .

......

......

. . . .

3 4 1 2 3 4 . 1 2 . . .

......

. . . .

. . . .

......

. . . .

. . . .

......

. . . .

. . . .

......

. . . .

. . . .

......

......

. . . .

......

. . . .

......

......

......

......

. . . .

......

......

......

......

......

......

......

......

......

......

......

......

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

......

. . . .

. . . .

......

. . . .

......

. . . .

......

......

......

. . . .

......

......

. . . .

. . . .

. . . .

......

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

......

. . . .

......

......

. . . .

......

. . . .

......

......

. . . .

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......

. . . .

. . . .

......

......

. . . .

. . . .

......

......

......

......

......

......

......

......

......

. . . .

. . . .

......

......

. . . .

. . . .

......

. . . . .

. . . . .

......

..

......

......

......

......

. . . . .

. . . . .

......

..

......

. . . .

. . . .

......

P P S S P P S S . . . .

. . . .

. . . .

. . . .

......

. . . .

5 7 6 8 5 7 6 8 . . . .

......

. . . .

. . . .

......

. . . .

. . . .

......

. . . .

. . . .

......

. . . .

. . . .

......

......

. . . .

......

. . . .

......

......

......

......

. . . .

......

......

......

......

......

......

......

......

......

......

......

......

fS S gfS S gfP P gfP P g

1 2 6 8 3 4 5 7

Algorithm Partition Algorithm

Partition G

o

Input G N E

o

N parallel and sequential lo ops

E data dep endence fusion preventing edges

Output T a parallelizing transformation

Algorithm

forall n G mark n unvisited

o

while n G marked unvisited and n has no predecessors

o

FindParallelTFPedges n

endwhile

forall n G mark n unvisited

o

while n G marked unvisited and n has no predecessors

o

FindSequentialTFPedges n

endwhile

forall n G mark n unvisited

t

while n G marked unvisited and n has no predecessors

t

PullApartn

endwhile

Greedy G

p

Greedy G

s

Merge G G G

p s o

Lo op distribution

In Algorithm Optimize lo op distribution is considered at step to intro duce

parallelism when other techniques have failed Lo op distribution seeks parallelism

by separating indep endent parallel and sequential statements in L If the lo op nest

contains only a single lo op the partitioning algorithm can b e applied to the nest

division of the statements in the lo op Although the partitioning algorithm yields

maximal theoretical parallelism it may not b e protable to p erform all of the resul

tant lo ops in parallel In this case all unprotable parallel lo ops should b e marked

sequential and the partitioning algorithm should b e applied again

If there are multiple nests lo op distribution b egins by nding the nest division

of statements at the outermost lo op level If OrderParallelize can parallelize any

of these then the partitioning algorithm is applied to increase their granularity For

divided lo op nests that OrderParallelize cannot parallelize the outer lo op is sp ecied

as sequential and this algorithm is applied recursively to the inner nest of lo ops

Otherwise the nest is p erformed sequentially

Integrating interpro cedural transformations

We now consider integrating the interpro cedural lo op emb edding lo op extraction and

pro cedure inlining to Algorithm Driver the driver for parallel co de generation

In Chapter we develop ed additional testing mechanisms in the caller for optimizing

nesting structures that cross pro cedure b oundaries Potential lo op and call sequences

that may b enet from emb edding or extraction are adjacent pro cedure calls lo ops

adjacent to calls and lo op nests containing calls Another candidate for emb edding

and extraction is a lo oping structure that contains an outermost lo op that encloses

the b o dy of the called pro cedure For example two adjacent pro cedure calls may

b oth contain parallel enclosing lo ops If these lo ops may b e fused legally and prof

itably fusing them is accomplished by rst p erforming lo op extraction on b oth of

the pro cedures A candidate for pro cedure inlining in this setting contains lo ops but

do es not contain an enclosing lo op

Candidates for interpro cedural optimization are discovered in traversal of the aug

mented call graph at steps and in Algorithm At step indep endent

lo ops and pro cedures have b een optimized and they are now considered for fusion

As illustrated in Section the fusion algorithm is capable of testing and deter

mining fusion of candidate lo ops and calls Therefore no additional mechanisms are

required in this case If a fusion is sp ecied here lo op extraction is the appropriate

interpro cedural transformation to enable it

In step a lo op structure ro oted at m is created If no attempt at inter

pro cedural transformation is desired or if there are no calls within the lo op struc

ture m consists of just the lo ops in the current pro cedure Even if no interpro cedural

transformations are considered lo ops containing calls will still b e parallelized when

protable Rememb er that when interpro cedural transformations are considered they

are still only applied if necessary to enable a parallelism enhancing transformation

If we wish to consider only lo op emb edding and extraction then the lo op structure

for any calls in m that contain enclosing lo ops is exp osed to optimization by including

the enclosing lo ops in L and ignoring the call site L is the lo op structure on which

Optimize is called The extensions for array sections describ ed in Chapter place

annotations at the call and lo ops for the accessed arrays allowing them to b e treated

as normal array references By using the annotations the optimization routine is thus

p ermitted to optimize across pro cedure b oundaries when and if necessary Similarly

if pro cedure inlining is a desired option the entire lo op structure in a call can b e

considered at the caller via this metho d

Optimize do es not sp ecify explicitly the interpro cedural transformations that are

required to p erform the optimizing transformation T that it returns The inter

pro cedural transformations are instead implicit in T given the original programs

lo oping structure

Selecting the appropriate interpro cedural transformation

To apply the set of transformations sp ecied by T the lo ops involved may need to

b e placed in the same routine In particular if T sp ecies a transformation across

a pro cedure b oundary an interpro cedural transformation is required For example

if T involves imp erfect nesting structures in the caller or the called pro cedure then

pro cedure inlining is required to p erform T

5

If the nesting structures involved in T are p erfect in the caller or are p erfect

enclosing lo ops in the called pro cedure one of lo op emb edding or lo op extraction is

preferred They are preferable b ecause they do not have the other p otential costs of

inlining For example if there is only one call and its lo ops are involved in T then

emb edding the lo op into the called pro cedure is selected b ecause it reduces pro cedure

call overhead and it do es not have inlinings other eects Additionally if T sp ecies

a distribution which results in a single call in a nest emb edding is p erformed here as

well Otherwise if there is more than one call involved extraction is required to place

the lo ops from all the involved calls in the same pro cedure Fusion p ermutation etc

may then b e p erformed in the caller

Extensions to pro cedure cloning

Interpro cedural transformations may induce additional cloning Rememb er that given

a single level of parallelism a pro cedure may b e p erformed sequentially in one calling

sequence and in parallel in another For example if a pro cedure is sequential it may

b e called in a parallel lo op or a sequential one It is correct in either setting If the

caller requires some of its called pro cedure lo ops to b e parallel a sequential version

5

In this case the p erfect lo op may contain only calls and no other statements

minus the extracted lo ops is needed Clones are needed if a pro cedure is called in

more than one parallelization setting that require dierent versions of the pro cedure

We reuse clones when settings are identical and create them when required The four

p otential versions of a pro cedure are

a sequential version only one is required

a sequential version that has lo ops extracted from it as many versions as dif

ferent numb ers of lo ops that are extracted for parallelizing distinct callers are

required

a parallel version only one is required and

a parallel version that has lo ops emb edded into it as many versions as callers

who emb ed dierent lo ops or dierent numb ers of lo ops are required

Discussion

This chapter has presented a general interpro cedural algorithm for determining and

p erforming lo opbased parallelization It builds on and extends the algorithms and

techniques develop ed in previous chapters This algorithm has a few drawbacks it

considers neither multiple levels of parallelism nor imp ortant transformations such

as lo op skewing lo op reversal and alignment However the framework is general

enough to supp ort the addition of these typ es of transformations The algorithm

do es p erform lo op p ermutation strip mining lo op fusion lo op parallelization lo op

distribution and interpro cedural transformations As we show in Chapter it is very

eective in practice

Chapter

Exp erimental Results

In this chapter we describ e an exp eriment to test the ecacy of the parallel co de

generation algorithm develop ed in the previous chapters A collection of programs

handco ded for parallel machines were obtained for this exp eriment From these

parallel programs two additional versions were obtained a nearby sequential version

and an automatically parallelized version The automatically parallelized version was

obtained from the nearby sequential version via the parallel co de generation algorithm

from Chapter Using these program versions we measure the ability of automatic

compiler techniques to uncover parallelism that is available in the program Based

on the results of this exp eriment we are guardedly optimistic In many cases the

analysis and algorithms presented in this thesis relieve the programmer of the burden

of explicit parallel programming for a variety of sharedmemory parallel machines

Intro duction

A lesson to b e learned from vectorization is that programmers rewrote their pro

grams in a vectorizable style based on feedback from their vectorizing compilers

CKK Wolc Compilers were then able to take these programs and generate

machinedep endent vector co de with excellent results We are testing this same thesis

for sharedmemory parallel machines The exp eriment describ ed b elow considers the

automatic parallelization of sequential program versions where parallelism is known

to exist By measuring the ability of our automatic techniques to uncover this par

allelism we are also testing whether a machineindep endent parallel programming

style exists This style would allow compilers to p erform machinesp ecic optimiza

tion with excellent results

We designed the following exp eriment to measure the ecacy of our automatic

parallel co de generator A variety of programs written for parallel machines were

assembled Each of these programs was transformed into a nearby sequential version

For each program a sequential nearby version was easily created by eliminating all

the compiler directives In all the programs this pro cess resulted in an appropriate

machineindep endent sequential program version

On the nearby sequential version we then simulated by hand our parallel co de

generation algorithm using the advanced analysis and transformations provided by

PFC and the ParaScop e Editor Ped We ran and compared all three versions on a

Sequent Symmetry S with pro cessors We measured execution times for each

version for the entire application for the p ortions the user was b etter able to paral

lelize and for the p ortions our algorithm was b etter able to parallelize

Metho dology

Ask and ye shall receive

We solicited programs from scientists at Argonne National Lab oratory and from users

of the Sequent and Intel iPSC at Rice The applications programs that were vol

unteered had b een written to run on the following parallel machines the Sequent

Symmetry S with pro cessors the Alliant FX with pro cessors and the Intel

iPSC with pro cessors The authors are numerical scientists at Rice University

Argonne National Lab oratory ICASE Institute for Computer Applications in Science

and Engineering George Mason University Princeton University and the University

of Tennessee All are asso ciated with the Center for Research on Parallel Computation

The problems inherent to any program test set also arise here In particular

it may b e that only well structured co des were volunteered Mayb e the authors

of p o orly structured ones were to o embarrassed to exp ose their co des to a critical

eye Fortunately this furthers our arguments for a mo dular machineindep endent

programming style rather than frustrating us during the exp eriments By collecting

programs rather than writing them ourselves we avoided the pitfall of writing a test

suite to match the abilities of our techniques No screening pro cess was p erformed

we used all the programs that were submitted Table contains the name the

abbreviation we use to refer to it the total numb er of lines and the authors of the

nine programs in the test suite They are describ ed in more detail in App endix A

Original parallel versions and nearby sequential versions

For each of the programs that were written for the Sequent this version b ecame

the original parallel version For the programs written for other architectures any

parallelization directives were mo died to reect the equivalent Sequent directives

The nearby sequential versions of each program was created by simply deleting all

Table Program Test Suite

Abbreviation Program lines authors

Interior Interior Point Metho d Guangye Li Irv Lustig

Direct Direct Search Metho ds Virginia Torczon

Multi Multidirectional Search Metho ds Virginia Torczon

Erlebacher ADI Integration Thomas Eidson

Seismic D Seismic Inversion Michael Lewis

BTN BTN Unconstrained Optimization Stephen Nash

Banded Banded Linear Systems Stephen Wright

ODE TwoPoint Boundary Problems Stephen Wright

Control Optimal Control Stephen Wright

Linpackd Linpackd b enchmark Jack Dongarra

the parallel directives In Erlebacher the parallelism was not made explicit Here a

naive parallelization of outer lo ops was p erformed to create the parallel version

Creating an automatically parallelized version

To create an automatically parallelized program the nearby sequential program was

+ +

rst imp orted into the ParaScop e Programming Environment CCH HHK

As a result of imp orting the program each pro cedure in the program was placed

in a separate mo dule Also a program comp osition was automatically created that

describ es the entire program and the call graph was built At this stage the Program

Comp osition Editor agged mo dules that were incorrect Ped then revealed a few

minor semantic errors which were corrected For example in one program a pro cedure

with many parameters had used the same name twice Program analysis was also

p erformed automatically

However to overcome gaps in the current implemantation of program analysis

we used the Program Comp osition Editor to imp ort dep endence information from

PFC PFC is the Rice system for see Section AK

PFC s analysis is more mature and includes imp ortant features not yet implemented

in Ped It p erforms advanced dep endence tests which include symb olics dep endence

tests and it computes interpro cedural constants interpro cedural symb olics and inter

pro cedural mo d and ref information for simple array sections GKT HK HK

PFC pro duces a le of dep endence information that is converted into Peds internal

representations

In Ped we used the call graph program analysis and the transformations that Ped

provides to meticulously apply Driver the parallelization algorithm from Chapter

to each of the programs by hand As we discussed in Section the implementation

of transformation algorithms in Ped includes the correctness tests but do es not assist

in cho osing when or how to apply them The application of the transformations was

completely driven by the Driver algorithm To p erform Driver the augmented call

graph G was easily derived from the call graph The transformations were attempted

ac

as sp ecied by the algorithm and applied only when Ped assured their correctness

Optimization diaries were kept for each program

Execution environment

For our exp eriments we used a pro cessor Sequent Symmetry S that was pro

vided by the Center for Research on Parallel Computation at Rice University under

NSF Co op erative Agreement CDA We selected the Sequent for several

reasons The Sequent has a simple parallel architecture which do es not include vector

hardware allowing our exp eriments to fo cus solely up on medium grain parallelism

Each pro cessor has its own Kbyte twoway setasso ciative cache and the cache line

size is words In addition the Sequent has a very exible compiler that allows

the program to completely sp ecify parallelism and do es not intro duce all available

parallelism Ost These features gave our algorithms complete control over the

parallelization pro cess

To intro duce parallelism into the programs we used the parallel lo op compiler

directives suggested by the Sequents user manual Ost To compile and run all

the program versions we used the version of Sequents Fortran ATS compiler

for multipro cessing with optimization at its highest level O An additional option

instructed the compiler to use the Weitek oatingp oint accelerator In a few

programs compiler bugs prevented the highest level of optimization and use of the

Weitek chip at the same time In these programs the Weitek oatingp oint

accelerator was used and optimization was suppressed

We measured execution times for each program version for the entire application

for the p ortions the user was b etter able to parallelize and for the p ortions our algo

rithm was b etter able to parallelize In programs where the original parallel version

and the automatically parallelized versions do not dier there were no diering p or

tions For example if a lo op is optimized the same way in b oth parallel versions

the individual execution time for that lo op is not distinguished However if the au

tomatic version parallelized a lo op and the original did not the execution time for

that lo op is measured in all versions Execution times for the diering optimized

p ortions were measured using the microsecond clo ck getusclk The elapsed times for

the entire applications were measured in seconds using secnds

Results

In Table we present the sp eedup results for the dierent parallel programs over

their sequential counterparts The results are divided up into three categories with

two versions each The two versions are

hand the original user handco ded parallel version

auto the automatically parallelized version

the three categories are

Entire Application measures the sp eedup over the entire application It also

indicates the p ercent change b etween the handco ded and automatic versions

Degradations measures the sp eedup in regions where the handco ded version

exploited more parallelism than the automatic version

Improvements measures the sp eedup in regions where the automatically

parallelized version exploited more parallelism than the original version

In Table a blank entry means that no program or program subpart fell in that

category For example Linpackd did not have an original parallel program version

therefore all the handco ded slots are left blank In some cases dierences arose

b etween versions in inner lo ops When this situation o ccurred the p erformance of

the outer enclosing lo op was measured in order to disrupt the execution as little

as p ossible The sp eedups of these optimized versions are actually under rep orted

All these programs were complete applications which read or computed initial data

computed and printed results Therefore linear sp eedups on the entire application

were not exp ected and did not o ccur

As can b e seen in the p ercent change column for the entire application category

except for one program all the automatically generated programs p erformed the same

as the handco ded parallel version or improved on it In three programs Interior

BTN and Multi the users found more parallelism than our automatic techniques In

Interior these degradations did not have much eect on the overall application If we

Table Sp eedups over sequential versions

Entire Application Degradations Improvements

Name hand auto hand auto hand auto

Seismic

Erlebacher

BTN

Interior

Direct

ODE

y

Control

y

Banded

Multi

Linpackd NA

pro cessor Sequent

result not obtainable

y pro cessors

lo ok at the table containing the execution times Table it is apparent that b oth

the degradations and improvements only eected a small part of the overall execution

time

In BTN and Multi the user found parallelism by using critical sections in lo ops

which we were unable to analyze prop erly In BTN this parallelism was actually over

whelmed by the overhead of the critical section resulting in improved p erformance

when executed sequentially In Multi the parallelism was sucient to ameliorate

the overhead of the critical section resulting in improved p erformance for the hand

co ded version In the Banded program the automatic techniques were unsuccessful

in nding any parallelism The reasons for this failure are discussed in App endix A

Other than these programs our algorithms either improved p erformance over the

hand co ded version or p erformed equally as well as the hand co ded version

When we consider the improvements category when our algorithms chose a dif

ferent optimization strategy from the user they were always an improvement This

improvement was a least a factor of and at b est a factor of

Table Execution Times in seconds

Application Degradations Improvements

seq hand auto seq hand auto seq hand auto

Seismic

Erlebacher

BTN

Interior

Direct

ODE

Control

Banded

Multi

Linpackd

result not obtainable

Parallel co de generation statistics

Transformations

Besides lo op parallelization the most eective and most often applied co de trans

formation was lo op p ermutation to improve data lo cality Outer lo op parallelization

was also enabled frequently by the memory ordering In some cases the lo ops needed

to b e strip mined such that memory reuse and parallelization were compatible For

example lo op p ermutation and strip mining were needed in Linpackd and Erlebacher

In this exp eriment all of the transformation p ortions of the automatic parallelization

algorithm were exercised except for lo op distribution and lo op emb edding The p er

formance estimator also was used in several instances to inhibit parallelization An

example of these decisions was found in the program ODE Of particular interest in

the program Seismic were opp ortunities to p erform interpro cedural lo op extraction

and lo op fusion which resulted in excellent improvements

Analysis

The analysis provided by PFC was accurate and for the most part bug free Some

analysis b eyond the current implementation was needed to parallelize these programs

Regular section analysis proved to b e a very imp ortant feature of the current system

but a few improvements are needed Flowsensitive summary information ab out array

accesses is need to determine array kills There were at least two programs that

would have b eneted from this analysis In one of these an array could have b een

made private Currently PFC p erforms symb olic analysis when the symb olic term

is a constant The analysis also needs to p erform a more general and sophisticated

symb olic test when the symb olic term is unknown and lo op invariant This feature

would allow it to b etter deal with linearized arrays However a b etter solution

to this problem is to reward nicely structured multidimensional array references

with excellent p erformance Programmers will then have an incentive to program

multidimensionally

Assertions

Five of the programs in this test suite used index arrays that were p ermutations of

the index set McK Several were monotonic nondecreasing with a well dened

with a pattern In three programs automatic parallelization would not have b een

p ossible with out using an assertion and the testing techniques develop ed in our

earlier research McK The other two used them in a way that did not interfere

with parallelization

Discussion

Our results are very promising They are a clear indication that a clean mo dular

parallel programming style in Fortran is suitable for p ortable parallel programming

of sharedmemory machines given sucient compiler technology

Chapter

Conclusions

In this research we underto ok to prove an ambitious thesis

Automatic compiler techniques can produce paral lel programs with accept

able or excel lent performance on sharedmemory multiprocessors

In this chapter we summarize what was achieved in pursuit of supp orting the thesis

Both the successes and limitations are presented In addition the implications of this

work for programmers other architectures and future work are describ ed

Due to the limited success of other researchers attacking this problem EB

SH Sar we were not condent when we b egan that acceptable p erformance

would result from automatic techniques Therefore we concentrated much eort on

designing and implementing Ped During this pro cess we develop ed fast incremen

tal up date algorithms for many transformations These algorithms are useful in b oth

interactive and batch systems b ecause of their sp eed and precision Implementing

and designing Ped also provided insight into the analysis and transformations and a

testb ed for exp erimenting with dierent automatic techniques Indeed Ped is prov

ing to b e a valuable platform for compiling for other parallel architectures as well

+

HKK DKMC HK illustrating the usefulness of this typ e of to ol for devel

oping compilers for new typ es of architectures At the same time however we pursued

more advanced and general compiler techniques for sharedmemory multipro cessors

We rst fo cused on generalizing existing compiler metho ds to handle conditional

control ow in lo ops and lo op nests that span pro cedure calls We develop ed tech

niques for lo op transformations such as lo op fusion when lo ops contain arbitrary

control ow In particular the algorithm for p erforming a given lo op distribution in

the presence of arbitrary control ow is proven optimal

We also intro duce a new approach which enables optimization across pro cedure

call b oundaries without paying the p enalties of pro cedure inlining Two new trans

formations lo op emb edding and lo op extraction move lo ops across call b oundaries

making them available to lo oplevel optimizations These transformations are ap plied judiciously using a goaldirected optimization strategy the transformations are

only applied when they enable p erformance enhancing optimizations These inter

pro cedural transformations and the transformations for lo ops containing conditional

control ow provide a general platform for automatic parallelization algorithms for

entire applications

The most signicant contribution of this thesis and the core of automatic paral

lelization is the algorithm that combines intro ducing parallelism and improving data

lo cality The algorithm for improving data lo cality is based on a simple cost mo del

that accurately predicts cache line reuse from multiple accesses to the same memory

lo cation and from consecutive memory access This algorithm is shown eective for

unipro cessors as well Given sucient granularity parallelism is then intro duced

The algorithm which combines parallelism and data lo cality uses the cost mo del to

intro duce parallelism that complements data lo cality This algorithm forms the core

of the parallelizing compiler and is shown conclusively to b e very eective in practice

illustrating the necessity for considering data lo cality during parallelization

The automatic parallelizing compiler further enhances the granularity of paral

lelism using lo op fusion Also when necessary the compiler achieves partial paral

lelism using lo op distribution The lo op distribution and lo op fusion problems are

shown to b e duals A general algorithm for b oth determines maximal partial paral

lelism with the minimum numb er of lo ops for a collection of candidate sequential and

parallel lo ops This formulation is shown to maximize parallelism

Assuming a few assertions that describ e index arrays and the range of scalar

values the complete parallel co de generation algorithm is then shown eective in

practice via exp erimental results This result illustrates very promising supp ort of

the thesis for sharedmemory machines with a lo cal cache and a common bus

In collecting the test suite we solicited programs from researchers and then used

all the programs we received in our exp eriment These programs were carefully hand

co ded for go o d parallel p erformance and many are currently the b est known parallel

algorithms The fact that we were able to improve carefully handco ded programs

designed to exploit parallelism indicates that the details of parallel execution are

b etter left to the compiler

Many of the parallel lo ops in the test suite contained pro cedure calls and control

ow The mo dular program style that programmers are using to manage complexity

proves the need for the generalized parallelization techniques develop ed in the the

sis In addition the improvements gained over the handco ded programs are mostly

due to the comp onent algorithm for optimizing data lo cality in concert with paral

lelism Although lo op fusion did prove useful in several programs We b elieve our

exp erimental results provide strong evidence for the eectiveness of this approach

With this metho d the programmer is p ermitted to pay more attention to the cor

rectness of a calculation and less to the explicit lo op structure required to achieve

high p erformance

In the testsuite the vast ma jority of lo ops that programmers sp ecied as parallel

were able to b e detected as parallel by analysis Most that were not contained un

ordered critical sections Programs that use synchronization in order to p erform lo ops

in parallel such as doacross style parallelism and critical regions are not handled by

our approach Cyt Sub HKT However these lo ops are an imp ortant source

of parallelism that should b e addressed

We have not considered the more challenging issues which arise on sharedmemory

machines with nonuniform access time such as the TC Buttery Nor have we

considered distributed memory architectures such as the Intel Hyp ercub e However

other research has shown that many of the same solutions prove eective on b oth

sharedmemory and distributedmemory machines LS For example the data

lo cality algorithm will b e used in a compiler to improve distributed memory p erfor

mance HKT Compiling for these architectures is more dicult but we b elieve

future work will show our techniques to b e applicable and that they will serve as a

stepping stone in compilation for these machines

This is not the end It is not even the beginning of the end But it is perhaps the

end of the beginning Winston Churchill Nov after the Battle of Egypt

App endix A

Description of Test Suite Programs

Although most of the programs we used are from algorithms that are b eing used in

research and have b een published in the literature they do not come from a single test

suite Therefore we briey describ e each b elow Except for Linpackd all the programs

were written to run on a parallel machine We were unable to obtain a parallel version

of the Linpackd b enchmark but included it anyway b ecause of its imp ortance in

the numerical community and its well known algorithms In the Implementation

details section for each program we briey describ e the creation of the dierent

program versions We also indicate any assumptions or changes that were made to

the programs to improve parallelization As exp ected the current analysis was lacking

in a few cases Whenever analysis was required b eyond the current implementation

it is noted Otherwise all the parallelism detection and transformations were based

on the current analysis The programs are ordered alphab etically

A Banded Linear Systems

This program is a partitioned Gaussian elimination algorithm with partial pivoting

Wria The system is assumed to b e nonsingular Hence the submatrices in the

chosen partitioning may b e rankdecient and this makes the algorithm more complex

than those which have b een prop osed for diagonally dominant and symmetric p ositive

denite systems It is suitable for multipro cessors with small to mo derate numb ers

of pro cessors It was written by Stephen Wright at Argonne National Lab oratory

Implementation details

The handco ded parallel version was written for an Alliant FX using Alliant com

piler directives This version was used for the sequential version The parallelism

consisted of three parallel lo ops containing a single pro cedure call each Using the

Sequent directives on those lo ops do es not work In attempting the automatic par

allelization analysis was complicated by index variables used to p erform array lin

earization based on the program input With index array assertions and advanced

symb olic propagation to dierentiate linearized subscripts dep endence analysis will

b e able to determine indep endence However the program also used osets into a

row at a call site and then subscripted it with negative indices This practice is not

legal Fortran and will thwart even advanced dep endence analysis It is most likely b e

the bug resp onsible for the failure of handco ded and automatic versions

A BTN Unconstrained Optimization

This program solves unconstrained nonlinear optimization problems based on a blo ck

truncatedNewton metho d NS NS It was written by Stephen Nash and Ariela

Sofer at George Mason University TruncatedNewton metho ds obtain the search

direction by approximately solving the Newton equations via some iterative metho d

The metho d used here is a blo ck version of the Lanczos metho d which is numerically

stable for nonconvex problems This program also uses a parallel derivative checker

Implementation details

This program was written for execution on the Sequent and therefore required no

mo dications for parallel execution The sequential version was easily created by

eliminating directives There were two interesting parallel lo ops that used a critical

section to up date a shared variable Using our analysis and the automatic parallel

co de generator we were unable to parallelize these lo ops for this and other reasons

However as can b e seen in the results section the critical section formed a b ottleneck

and actually degraded p erformance b eyond that of the sequential p erformance

A Direct Search Metho d

This program is a derivativefree parallel algorithm for the nonlinear unconstrained

optimization problem DT It was written by Virginia Torczon at Rice University

It searches based on the previous function values where the function is continuous on a

compact level set A sp ecial feature of the algorithm emb o died in this parallel program

is that it is easily mo died to to take advantage of any numb er of pro cessors and to

adapt to any ratio of communication cost to function evaluation cost The parallelism

in this version scales with the problem size but not the numb er of pro cessors

Implementation details

This program was written for execution on the Sequent and therefore required no

mo dications for parallel execution The sequential version was easily created by

eliminating directives

To automatically parallelize this program required an assertion that an index array

used to subscript the data is a p ermutation array The four parallel lo ops could then

b e identied as such by our to ols Without this assertion no parallelism could b e

detected With the assertion the critical four lo ops could b e identied as parallel

The algorithm this program emb o dies is fully scalable even though it is not re

ected in the results shown earlier The results are limited b ecause problem size

was constrained to to t on the Sequent The theoretical sp eedup was therefore

limited to In addition the function evaluations available for this study were very

small which allowed the overhead of the parallel constructs to b ecome a factor further

degrading p erformance

A Erlebacher

Erlebacher is a tridiagonal solver for the calculation of derivatives written by Thomas

Eidson at ICASE NASALangley

Implementation details

The handco ded version of Erlebacher provided to us was a sequential version intended

for an Intel Hyp ercub e target We used this as the nearby sequential version and hand

p erformed parallelization on this version To create the user parallelized version we

p erformed a naive parallelization of outermost parallel lo ops

A Interior Point Metho d

This program implements a primalDual predictorcorrector interior p oint metho d to

solve multicommo dity ow problems LL It was written by Guangye Li at Rice

University and Irv Lustig at Princeton This problem is a well known application

of linear programming The blo ck structure of the constraint matrix is exploited via

parallel computation The bundling constraints require the Cholesky factorization

of a dense matrix where a metho d that exploits parallelism for the dense Cholesky factorization is used

Implementation details

This program was written for execution on the Sequent and therefore required no

mo dications for parallel execution The sequential version was easily created by

eliminating directives During the automatic parallelization pro cess two p oints of

interest were encountered The rst was a small bug where a parameter was declared

twice in a subroutine header that was p ointed out by the typ e checker The other

was that debugging IO was still present in a pro cedure called by a parallel lo op Dr

Li indicated the IO was for development purp oses and could b e ignored

A Linpackd Benchmark

The Linpackd b enchmark is a representative set of linear algebra routines that are

widely used by scientists and engineers to p erform numerical op erations on matrices

DBMS We used a matrix size for our exp eriment This program was

written by Jack Dongarra at the University of Tennessee

Mo dications or assertions

This program was originally a sequential version From this version we derived the

automatically parallelized version We p erformed dead co de elimination by hand us

ing constant propagation to delete some sp ecial case co de for nonunit stride accesses

A Multidirectional Search Metho d

The parallel multidirectional search metho d is a more p owerful and general version

of the direct search metho d describ ed ab ove DT It diers in that the parallelism

available in the algorithm is prop ortional to b oth the size of the problem and the size

of the search space The search space is based on the numb er of pro cessors This

algorithm do es not just enhance a sequential algorithm it provides a more ambitious

and eective search strategy based on the numb er of pro cessors and is fully scalable

Implementation details

This program contained a single parallel lo op Within the lo op the programmer used

an unordered critical section to test for convergence This construct could not b e

analyzed with existing techniques and caused parallelization to fail More advanced

techniques are needed to analyze this programming style

A D Seismic Inversion

This program checks the adjointness of two routines which apply a linear op erator

DW and its adjoint DW DW and DW come from D seismic inversion for oil

exploration The op erator DW is the derivative of the incoherency or dierential

semblance with resp ect to the background sound velo city This program was written

by Michael Lewis at Rice University

Implementation details

This program was written for execution on the Sequent and therefore required no

mo dications for parallel execution The sequential version was easily created by

eliminating directives The automatically parallelized version of this program em

ployed interpro cedural lo op fusion to improve p erformance

A Optimal Control

This program computed solutions for linearquadratic optimal control problems that

arise from Newtons metho d or twometric gradient pro jection metho ds to nonlinear

problems Wrib It is a decomp osition of the domain of the problem and is related

to multiple sho oting metho ds for twop oint b oundary value problems

Implementation details

This co de was written for an Alliant FX using Alliant compiler directives This

version worked as the sequential version Using the Sequent parallel lo op directives

allowed the original parallel version to b e obtained In order to automatically par

allelize these lo ops array kill analysis and one user assertion ab out the value of a

symb olic were required

A TwoPoint Boundary Problems

This program uses nite dierences to solve twop oint b oundary value Bo des Wri

It was written by Stephen Wright at Argonne National Lab oratory It uses a struc

tured orthogonal factorization technique to solve the system in an eective stable

and parallel manner

Implementation details

This co de was written for an Alliant FX using Alliant compiler directives This

version worked p erfectly as the sequential version The parallelism consisted of three

parallel lo ops containing a single pro cedure call each Using the Sequent parallel

lo op directives allowed the original parallel version to b e obtained In order to auto

matically parallelize these lo ops array kill analysis and b etter symb olic dep endence

testing are needed than currently exist in PFC The symb olic analysis was needed to

analyze array linearizations However b oth are well within the scop e of an advanced dep endence analyzer

Bibliography

+

ABC F Allen M Burke P Charles R Cytron and J Ferrante An overview

of the PTRAN analysis system for multipro cessing In Proceedings of

the First International Conference on Supercomputing SpringerVerlag

Athens Greece June

+

ABC F Allen M Burke P Charles J Ferrante W Hsieh and V Sarkar

A framework for detecting useful parallelism In Proceedings of the Sec

ond International Conference on Supercomputing St Malo France July

AC F Allen and J Co cke A catalogue of optimizing transformations In

J Rustin editor Design and Optimization of Compilers PrenticeHall

ACK J R Allen D Callahan and K Kennedy Automatic decomp osition

of scientic programs for parallel execution In Proceedings of the Four

teenth Annual ACM Symposium on the Principles of Programming Lan

guages Munich Germany January

AHU A V Aho J E Hop croft and J Ullman The Design and Analysis of

Computer Algorithms AddisonWesley Reading MA

AJ R Allen and S Johnson Compiling C for vectorization parallelization

and In Proceedings of the SIGPLAN Conference

on Program Language Design and Implementation Atlanta GA June

AK J R Allen and K Kennedy Automatic lo op interchange In Proceedings

of the SIGPLAN Symposium on Compiler Construction Montreal

Canada June

AK J R Allen and K Kennedy Automatic translation of Fortran programs

to vector form ACM Transactions on Programming Languages and

Systems Octob er

AKPW J R Allen K Kennedy C Portereld and J Warren Conversion

of control dep endence to data dep endence In Conference Record of

the Tenth Annual ACM Symposium on the Principles of Programming

Languages Austin TX January

All J R Allen Dependence Analysis for Subscripted Variables and Its Ap

plication to Program Transformations PhD thesis Dept of Computer

Science Rice University April

All J R Allen Private communication February

AS W AbuSufah Improving the Performance of Virtual Memory Com

puters PhD thesis Dept of Computer Science University of Illinois at

UrbanaChampaign

Ban U Banerjee Dependence Analysis for Supercomputing Kluwer Aca

demic Publishers Boston MA

Bana U Banerjee A theory of lo op p ermutations In D Gelernter A Nicolau

and D Padua editors Languages and Compilers for Paral lel Comput

ing The MIT Press

Banb U Banerjee Unimo dular transformations of double lo ops In Advances

in Languages and Compilers for Paral lel Computing Irvine CA August

The MIT Press

BB W Baxter and H R Bauer I I I The program dep endence graph and

vectorization In Proceedings of the Sixteenth Annual ACM Symposium

on the Principles of Programming Languages Austin TX January

BC M Burke and R Cytron Interpro cedural dep endence analysis and par

allelization In Proceedings of the SIGPLAN Symposium on Compiler

Construction Palo Alto CA June

BCHT P Briggs K Co op er M W Hall and L Torczon Goaldirected inter

pro cedural optimization Technical Rep ort TR Dept of Com

puter Science Rice University Decemb er

BCKT M Burke K Co op er K Kennedy and L Torczon Interpro cedural

optimization Eliminating unnecessary recompilation Technical Rep ort

TR Dept of Computer Science Rice University July

Ber A J Bernstein Analysis of programs for parallel pro cessing IEEE

Transactions on Electronic Computers Octob er

BFKK V Balasundaram G Fox K Kennedy and U Kremer A static p erfor

mance estimator in the Fortran D programming system In J Saltz and

P Mehrotra editors Languages Compilers and RunTime Environ

ments for Distributed Memory Machines NorthHolland Amsterdam

The Netherlands

BHMS M Bromley S Heller T McNerney and G Steele Jr Fortran at ten

gigaops The Connection Machine convolution compiler In Proceed

ings of the SIGPLAN Conference on Program Language Design and

Implementation Toronto Canada June

BJ C Bohm and G Jacopini Flow diagrams turing machines and lan

guages with only two formation rules Communications of the ACM

May

BK V Balasundaram and K Kennedy A technique for summarizing data

access and its use in parallelism enhancing transformations In Proceed

ings of the SIGPLAN Conference on Program Language Design and

Implementation Portland OR June

+

c

BKK V Balasundaram K Kennedy U Kremer K S M Kinley and

J Subhlok The ParaScop e Editor An interactive parallel program

ming to ol In Proceedings of Supercomputing Reno NV Novemb er

Bur M Burke An intervalbased approach to exhaustive and incremental

interpro cedural dataow analysis ACM Transactions on Programming

Languages and Systems July

Cal D Callahan A Global Approach to Detection of Paral lelism PhD thesis

Dept of Computer Science Rice University March

+

CCH D Callahan K Co op er R Ho o d K Kennedy and L Torczon Para

Scop e A parallel programming environment International Journal of

Supercomputing Applications Winter

CCK D Callahan J Co cke and K Kennedy Estimating interlo ck and im

proving balance for pip elined machines Journal of Paral lel and Dis

tributed Computing August

CCK D Callahan S Carr and K Kennedy Improving register allo cation for

subscripted variables In Proceedings of the SIGPLAN Conference

on Program Language Design and Implementation White Plains NY

June

CCKT D Callahan K Co op er K Kennedy and L Torczon Interpro cedural

constant propagation In Proceedings of the SIGPLAN Symposium

on Compiler Construction Palo Alto CA June

CDL D Callahan J Dongarra and D Levine Vectorizing compilers A test

suite and results In Proceedings of Supercomputing Orlando FL

Novemb er

CF R Cytron and J Ferrante Whats in a name or the value of renaming

for parallelism detection and storage allo cation In Proceedings of the

International Conference on Paral lel Processing St Charles IL

August

+

CFR R Cytron J Ferrante B Rosen M Wegman and K Zadeck An

ecient metho d of computing static single assignment form In Pro

ceedings of the Sixteenth Annual ACM Symposium on the Principles of

Programming Languages Austin TX January

CFS R Cytron J Ferrante and V Sarkar Exp eriences using control dep en

dence in PTRAN In D Gelernter A Nicolau and D Padua editors

Languages and Compilers for Paral lel Computing The MIT Press

CHK K Co op er M W Hall and K Kennedy Pro cedure cloning In Proceed

ings of the IEEE International Conference on Computer Language

Oakland CA April

CHT K Co op er M W Hall and L Torczon An exp eriment with inline

substitution SoftwarePractice and Experience June

CKa D Callahan and M Kalem Control dep endences Sup ercomputer Soft

ware Newsletter Dept of Computer Science Rice University Octob er

CKb D Callahan and K Kennedy Analysis of interpro cedural side eects in a

parallel programming environment In Proceedings of the First Interna

tional Conference on Supercomputing SpringerVerlag Athens Greece

June

CK S Carr and K Kennedy Blo cking linear algebra co des for memory

hierarchies In Proceedings of the Fourth SIAM Conference on Paral lel

Processing for Scientic Computing Chicago IL Decemb er

CKK D Callahan K Kennedy and U Kremer A dynamic study of vector

ization in PFC Technical Rep ort TR Dept of Computer Science

Rice University July

CKPK G Cyb enko L Kipp L Pointer and D Kuck Sup ercomputer p er

formance evaluation and the Perfect b enchmarks In Proceedings of the

ACM International Conference on Supercomputing Amsterdam

The Netherlands June

CKTa K Co op er K Kennedy and L Torczon The impact of interpro cedural

n

analysis and optimization in the IR programming environment ACM

Transactions on Programming Languages and Systems

Octob er

CKTb K Co op er K Kennedy and L Torczon Interpro cedural optimization

Eliminating unnecessary recompilation In Proceedings of the SIGPLAN

Symposium on Compiler Construction Palo Alto CA June

CSY D Chen H Su and P Yew The impact of synchronization and gran

ularity on parallel systems In Proceedings of the th International

Symposium on Computer Architecture Seattle WA May

Cyt R Cytron Doacross Beyond vectorization for multipro cessors In

Proceedings of the International Conference on Paral lel Processing

St Charles IL August

DBMS J Dongarra J Bunch C Moler and G Stewart LINPACK Users

Guide SIAM Publications Philadelphia PA

DCHH J Dongarra J Du Croz S Hammarling and R Hanson An extended

set of Fortran basic linear algebra subprograms ACM Transactions on

Mathematical Software March

Die H Dietz Finding largegrain parallelism in lo ops with serial control de

p endences Proceedings of the International Conference on Paral lel

Processing August

DKMC E Darnell K Kennedy and J MellorCrummey Automatic software

cache coherence through vectorization In Proceedings of the ACM

International Conference on Supercomputing Washington DC July

DT J E Dennis Jr and V Torczon Direct search metho ds on parallel ma

chines SIAM Journal of Optimization Novemb er

EB R Eigenmann and W Blume An eectiveness study of parallelizing

compiler techniques In Proceedings of the International Conference

on Paral lel Processing St Charles IL August

c

FKMW K Fletcher K Kennedy K S M Kinley and S Warren The Para

Scop e Editor User interface goals Technical Rep ort TR Dept

of Computer Science Rice University May

FM J Ferrante and M Mace On linearizing parallel co de In Conference

Record of the Twelfth Annual ACM Symposium on the Principles of

Programming Languages New Orleans LA January

FMS J Ferrante M Mace and B Simons Generating sequential co de from

parallel co de In Proceedings of the Second International Conference on

Supercomputing St Malo France July

FOW J Ferrante K Ottenstein and J Warren The program dep endence

graph and its use in optimization ACM Transactions on Programming

Languages and Systems July

FST J Ferrante V Sarkar and W Thrash On estimating and enhanc

ing cache eectiveness In U Banerjee D Gelernter A Nicolau

and D Padua editors Languages and Compilers for Paral lel Comput

ing Fourth International Workshop Santa Clara CA August

SpringerVerlag

GGGJ V Guarna D Gannon Y Gaur and D Jablonowski Faust An envi

ronment for programming parallel scientic applications In Proceedings

of Supercomputing Orlando FL Novemb er

GJG D Gannon W Jalby and K Gallivan Strategies for cache and lo cal

memory management by global program transformations In Proceed

ings of the First International Conference on Supercomputing Springer

Verlag Athens Greece June

GJG D Gannon W Jalby and K Gallivan Strategies for cache and lo cal

memory management by global program transformation Journal of

Paral lel and Distributed Computing Octob er

GKT G Go K Kennedy and C Tseng Practical dep endence testing In

Proceedings of the SIGPLAN Conference on Program Language De

sign and Implementation Toronto Canada June

GP A Goldb erg and R Paige Stream pro cessing In Conference Record of

the ACM Symposium on Lisp and Functional Programming pages

August

Hal M W Hall Managing Interprocedural Optimization PhD thesis Dept

of Computer Science Rice University April

+

c

HHK M W Hall T Harvey K Kennedy N McIntosh K S M Kinley J D

Oldham M Paleczny and G Roth Exp eriences using the ParaScop e

Editor an interactive parallel programming to ol In Proceedings of the

Fourth ACM SIGPLAN Symposium on Principles and Practice of Par

al lel Programming San Diego CA May

HHLa L Huelsb ergen D Hahn and J Larus Exact dep endence analysis

using data access descriptors In Proceedings of the International

Conference on Paral lel Processing St Charles IL August

HHLb L Huelsb ergen D Hahn and J Larus Exact dep endence analysis

using data access descriptors Technical Rep ort Dept of Computer

Science University of Wisconsin at Madison July

HK P Havlak and K Kennedy Exp erience with interpro cedural analysis of

array side eects In Proceedings of Supercomputing New York NY

Novemb er

HK P Havlak and K Kennedy An implementation of interpro cedural

b ounded regular section analysis IEEE Transactions on Paral lel and

Distributed Systems July

HK R v Hanxleden and K Kennedy Relaxing SIMD control ow con

straints using lo op transformations In Proceedings of the SIGPLAN

Conference on Program Language Design and Implementation San

Francisco CA June

+

HKK S Hiranandani K Kennedy C Ko elb el U Kremer and C Tseng An

overview of the Fortran D programming system In U Banerjee D Gel

ernter A Nicolau and D Padua editors Languages and Compilers for

Paral lel Computing Fourth International Workshop Santa Clara CA

August SpringerVerlag

HKT S Hiranandani K Kennedy and C Tseng Compiler optimizations for

Fortran D on MIMD distributedmemory machines In Proceedings of

Supercomputing Albuquerque NM Novemb er

HKT S Hiranandani K Kennedy and C Tseng Evaluation of compiler op

timizations for Fortran D on MIMD distributedmemory machines In

Proceedings of the ACM International Conference on Supercom

puting Washington DC July

HP M Haghighat and C Polychronop oulos Symb olic dep endence analysis

for highp erformance parallelizing compilers In Advances in Languages

and Compilers for Paral lel Computing Irvine CA August The MIT Press

Hus C A Huson An inline subroutine expander for Parafrase Masters

thesis Dept of Computer Science University of Illinois at Urbana

Champaign

IT F Irigoin and R Triolet Sup erno de partitioning In Proceedings of the

Fifteenth Annual ACM Symposium on the Principles of Programming

Languages San Diego CA January

KKLWa D Kuck R Kuhn B Leasure and M J Wolfe Analysis and transfor

mation of programs for parallel computation In Proceedings of COMP

SAC the th International Computer Software and Applications Con

ference pages Chicago IL Octob er

KKLWb D Kuck R Kuhn B Leasure and M J Wolfe The structure of

an advanced retargetable vectorizer In Proceedings of COMPSAC

the th International Computer Software and Applications Conference

pages Chicago IL Octob er

KKLW D Kuck R Kuhn B Leasure and M J Wolfe The structure of an

advanced retargetable vectorizer In Supercomputers Design and Ap

plications pages IEEE Computer So ciety Press Silver Spring

MD

+

KKP D Kuck R Kuhn D Padua B Leasure and M J Wolfe Dep endence

graphs and compiler optimizations In Conference Record of the Eighth

Annual ACM Symposium on the Principles of Programming Languages

Williamsburg VA January

c

KM K Kennedy and K S M Kinley Lo op distribution with arbitrary con

trol ow In Proceedings of Supercomputing New York NY Novem

b er

c

KM K Kennedy and K S M Kinley Optimizing for parallelism and data

lo cality In Proceedings of the ACM International Conference on

Supercomputing Washington DC July

KMC D Kuck Y Muraoka and S Chen On the numb er of op erations si

multaneously executable in Fortranlike programs and their resulting

sp eedup IEEE Transactions on Computers C De

cemb er

c

KMM K Kennedy N McIntosh and K S M Kinley Static p erformance es

timation in a parallelizing compiler Technical Rep ort TR Dept

of Computer Science Rice University Decemb er

c

KMTa K Kennedy K S M Kinley and C Tseng Analysis and transformation

in the ParaScop e Editor In Proceedings of the ACM International

Conference on Supercomputing Cologne Germany June

c

KMTb K Kennedy K S M Kinley and C Tseng Interactive parallel pro

gramming using the ParaScop e Editor IEEE Transactions on Paral lel

and Distributed Systems July

c

KMT K Kennedy K S M Kinley and C Tseng Improving data lo cality

Technical Rep ort TR Dept of Computer Science Rice University

March

Knu D Knuth An empirical study of FORTRAN programs Software

Practice and Experience

Kuc D Kuck The Structure of Computers and Computations Volume

John Wiley and Sons New York NY

KZBG U Kremer H Zima HJ Bast and M Gerndt Advanced to ols and

techniques for automatic parallelization Paral lel Computing

Lam L Lamp ort The parallel execution of DO lo ops Communications of

the ACM February

Lea B Leasure editor PCF Fortran Language Denition version The

Parallel Computing Forum Champaign IL August

LL I J Lustig and G Li An implementation of a parallel primaldual

interior p oint metho d for multicommondity ow problems Technical

Rep ort CRPCTR Center for Research on Parallel Computation

Rice University January

Lov D Loveman Program improvement by sourcetosource transforma

tions Journal of the ACM January

LRW M Lam E Rothb erg and M E Wolf The cache p erformance and op

timizations of blo cked algorithms In Proceedings of the Fourth Interna

tional Conference on Architectural Support for Programming Languages

and Operating Systems Santa Clara CA April

LS C Lin and L Snyder A comparison of programming mo dels for shared

memory multipro cessors In Proceedings of the International Con

ference on Paral lel Processing St Charles IL August

LYa Z Li and P Yew Ecient interpro cedural analysis for program restruc

turing for parallel programs In Proceedings of the ACM SIGPLAN Sym

posium on Paral lel Programming Experience with Applications Lan

guages and Systems PPEALS New Haven CT July

LYb Z Li and P Yew Interpro cedural analysis and program restructuring

for parallel programs Technical Rep ort Center for Sup ercomputing

Research and Development University of Illinois at UrbanaChampaign

January

McK K S McKinley Dep endence analysis of arrays subscripted by index

arrays Technical Rep ort TR Dept of Computer Science Rice

University Decemb er

McM F McMahon The Livermore Fortran Kernels A computer test of the

numerical p erformance range Technical Rep ort UCRL Lawrence

Livermore National Lab oratory

Mur Y Muraoka Paral lelism Exposure and Exploitation in Programs PhD

thesis Dept of Computer Science University of Illinois at Urbana

Champaign February Rep ort No

NS S G Nash and A Sofer A generalpurp ose parallel algorithm for un

constrained optimization SIAM Journal of Optimization

Novemb er

NS S G Nash and A Sofer BTN software for parallel unconstrained

optimization ACM TOMS to app ear

Ost A Osterhaug editor Guide to Paral lel Programming on Sequent Com

puter Systems Sequent Technical Publications San Diego CA

+

PGH C Polychronop oulos M Girkar M Haghighat C Lee B Leung and

D Schouten The structure of Parafrase An advanced parallelizing

compiler for C and Fortran In D Gelernter A Nicolau and D Padua

editors Languages and Compilers for Paral lel Computing The MIT

Press

Por A Portereld Software Methods for Improvement of Cache Perfor

mance PhD thesis Dept of Computer Science Rice University May

RC B Ryder and M Carroll An incremental algorithm for software analysis

In Proceedings of the Second ACM SIGSOFTSIGPLAN Software En

gineering Symposium on Practical Software Development Environments

Palo Alto CA Decemb er

RG S Richardson and M Ganapathi Interpro cedural optimization Ex

p erimental results SoftwarePractice and Experience February

Ros C Rosene Incremental Dependence Analysis PhD thesis Dept of

Computer Science Rice University March

RP B Ryder and M Paull Incremental data ow analysis algorithms ACM

Transactions on Programming Languages and Systems Jan

uary

SA K Smith and W App elb e PAT an interactive Fortran parallelizing

assistant to ol In Proceedings of the International Conference on

Paral lel Processing St Charles IL August

SA K S Smith and W App elb e An interactive conversion of sequential

to multitasking Fortran In Proceedings of the ACM International

Conference on Supercomputing Crete Greece June

Sar V Sarkar PTRAN the IBM parallel translation system In Proceed

ings of the International Workshop on Compilers for Paral lel Comput

ers Paris France Decemb er To app ear as a chapter in Paral lel

Functional Programming Languages and Compilers editor B Szyman

ski ACM Press

SAS K Smith W App elb e and K Stirewalt Incremental dep endence analy

sis for interactive parallelization In Proceedings of the ACM Inter

national Conference on Supercomputing Amsterdam The Netherlands

June

SG B Shei and D Gannon SIGMACS A programmable programming en

vironment In Advances in Languages and Compilers for Paral lel Com

puting Irvine CA August The MIT Press

SH J Singh and J Hennessy An empirical investigation of the eective

ness of and limitations of automatic parallelization In Proceedings of

the International Symposium on Shared Memory Multiprocessors Tokyo

Japan April

SK R G Scarb orough and H G Kolsky A vectorizing Fortran compiler

IBM Journal of Research and Development March

Sub J Subhlok Analysis of Synchronization in a Paral lel Programming En

vironment PhD thesis Dept of Computer Science Rice University August

TIF R Triolet F Irigoin and P Feautrier Direct parallelization of CALL

statements In Proceedings of the SIGPLAN Symposium on Compiler

Construction Palo Alto CA June

Tow R A Towle Control and Data Dependence for Program Transformation

PhD thesis Dept of Computer Science University of Illinois at Urbana

Champaign March

WL M E Wolf and M Lam Maximizing parallelism via lo op transforma

tions In Proceedings of the Third Workshop on Languages and Compil

ers for Paral lel Computing Irvine CA August

WL M E Wolf and M Lam A data lo cality optimizing algorithm In Pro

ceedings of the SIGPLAN Conference on Program Language Design

and Implementation Toronto Canada June

Wol M J Wolfe Optimizing Supercompilers for Supercomputers PhD thesis

Dept of Computer Science University of Illinois at UrbanaChampaign

Octob er

Wol M J Wolfe Lo op skewing The wavefront metho d revisited Interna

tional Journal of Paral lel Programming August

Wola M J Wolfe More iteration space tiling In Proceedings of Supercom

puting Reno NV Novemb er

Wolb M J Wolfe Optimizing Supercompilers for Supercomputers The MIT

Press Cambridge MA

Wolc M J Wolfe Semiautomatic domain decomp osition In Proceedings of

the th Conference on Hypercube Concurrent Computers and Applica

tions Monterey CA March

Wria S J Wright Parallel algorithms for banded linear systems SIAM

Journal of Scientic and Statistical Computation July

Wrib S J Wright Partitioned dynamic programming for optimal control

SIAM Journal of Optimization Novemb er

Wri S J Wright Stable parallel algorithms for twop oint b oundary value

problems SIAM Journal of Scientic and Statistical Computation

to app ear

Zad F Zadeck Incremental data ow analysis in a structured program editor

In Proceedings of the SIGPLAN Symposium on Compiler Construc

tion Montreal Canada June

ZBG H Zima HJ Bast and M Gerndt SUPERB A to ol for semi

automatic MIMDSIMD parallelization Paral lel Computing