appeared in IEEE Computer, Volume 28, Number 10, October 1995

An Overview of the PARADIGM Compiler for

DistributedMemory Multicomputers

y

Prithvira j Banerjee John A Chandy Manish Gupta

Eugene W Ho dges IV John G Holm Antonio Lain

Daniel J Palermo Shankar Ramaswamy Ernesto Su

Center for Reliable and HighPerformance Computing

University of Illinois at UrbanaChampaign

W Main St Urbana IL

Phone

Fax

banerjeecrhcuiucedu

Abstract

Distributedmemory multicomputers suchasthetheIntel Paragon the IBM SP and the

Thinking Machines CM oer signicant advantages over sharedmemory multipro cessors in

terms of cost and Unfortunately extracting all the computational p ower from these

machines requires users to write ecientsoftware for them which is a lab orious pro cess The

PARADIGM compiler pro ject provides an automated means to parallelize programs written

using a serial programming mo del for ecient execution on distributedmemory multicomput

ers In addition to p erforming traditional compiler optimizations PARADIGM is unique in that

it addresses many other issues within a unied platform automatic data distribution commu

nication optimizations supp ort for irregular computations exploitation of functional and data

parallelism and multithreaded execution This pap er describ es the techniques used and pro

vides exp erimental evidence of their eectiveness on the Intel Paragon the IBM SP and the

Thinking Machines CM

Keywords compilers distributed memorymulticomputers parallelizing compilers parallel

pro cessing

This research was supp orted in part by the Oce of Naval Research under Contract NJ by the

National Aeronautics and Space Administration under Contract NASA NAG and in part byanATT graduate

fellowship a FulbrightMEC fellowship an IBM graduate fellowship and an ONR graduate fellowship We are also

grateful to the National Center for Sup ercomputing Applications the San Diego Sup ercomputing Center and the

Argonne National Lab for providing access to their machines

y

currently working at IBM T J Watson ResearchCenter PO Box Yorktown Heights NY

A preliminary version of this pap er app eared in the First Intl Workshop on Parallel Pro cessing Bangalore India

Decemb er

Intro duction

Distributedmemory multicomputers can provide the high levels of p erformance

required to solve the Grand Challenge computational science problems Multicomputers such as the

Intel Paragon the IBM SPSP and the Thinking Machines CM oer signicant advantages

over sharedmemory multipro cessors in terms of cost and scalability Unfortunately extracting all

the computational power from these machines requires users to write ecient software for them

which is a lab orious pro cess One ma jor reason for this diculty is the absence of a global address

space As a result the programmer has to manually distribute computations and data across pro

cessors and manage communication explicitly The PARADIGM PARAllelizing compiler for DIs

tributed memory Generalpurp ose Multicomputers pro ject at the University of Illinois addresses

this problem by developing an automated means to parallelize and optimize sequential programs

for ecient execution on distributedmemory m ulticomputers

To understand the complexity of writing programs in a messagepassing mo del it is useful

to examine the parallel co de generated by the compiler which roughly corresp onds to what an

exp erienced programmer might write for a small example Figure a shows an example serial

program for Jacobis iterative metho d for solving systems of linear equations In Figure b a highly

ecient parallel program is shown for an Intel Paragon with a variable numb er of pro cessors From

this example it is apparent that if a programmer were required to manually parallelize even a

mo derately sized application the eort would be tremendous Furthermore co ding with explicit

communication op erations commonly results in errors which are notoriously hard to nd By

automating the parallelization pro cess it will b ecome p ossible to oer high levels of p erformance

to the scientic computing community at large

Some of the other ma jor research eorts in this area include Fortran D Fortran D

and the Superb compiler In order to standardize parallel programming with data distribution

directives High Performance Fortran HPF has also been develop ed in a collab orative eort

between researchers in industry and academia A number of commercial HPF compilers are be

ginning to app ear including pro ducts from Applied Parallel Research Convex Cray Digital IBM

The Portland Group Silicon Graphics Thinking Machines and others

PARADIGM pro ject apart from other compiler eorts for distributedmemory What sets the

multicomputers is the broad range of research topics that are b eing addressed In addition to p er

program jacobi if myp le mnum then

parameter np ncycles call crecvBm bB mbB

real Anp np Bnp np end if

np np if myp le mnum then

do k ncycles call csendBm bAmbBmto

do j np end if

do i np if myp ge then

Ai j Bi j Bi j call crecvBm bB

Bi j Bi j end if

end do if myp ge then

end do minc f packB m bBm buf

do j np call csendmbufmincmto

do i np end if

Bi j Ai j if myp le mnum then

end do call crecvmbuf

end do minc f unpackBm bBmbA

end do mbBm buf

end end if

if myp le mnum then

minc f packBm bAmbB

a Serial version

mbBm buf

call csendmbufmincmto

program jacobi

end if

character mbuf

if myp ge then

integer myp mbA mbA mbB mbB

call crecvmbuf

integer mnumdim mnum mto minc

minc f unpackB m bBm buf

real A B

end if

mnumdim fnumber of mesh dimensionsg

dojmaxmbAmyp

mnum fminimum mesh congurationg

minmbAmyp mbA

mnum

do i maxmbAmyp

call mgetnummnumnumno des

minmbAmyp mbA

call mgridinitmnumdimmnumnumno des

Ai j Bi j Bi j Bi j

call mgridco ordmyno demyp

Bi j

mto mgridrelmyp

end do

mto mgridrelmyp

end do

mto mgridrelmyp

dojmaxmbBmyp

mto mgridrelmyp

minmbBmyp mbB

mbA ceiloat mnum

do i maxmbBmyp

mbA ceiloat mnum

minmbBmyp mbB

mbB ceiloat mnum

Bi j Ai j

mbB ceiloat mnum

end do

do k

end do

if myp ge then

end do

call csendBm bBmto

end

end if

b Parallel version for a variable numb er of pro cessors with a minimum mesh conguration

Figure Example program Jacobis Iterative Metho d

forming traditional compiler optimizations to distribute computations and to reduce communication

overheads the research in the PARADIGM pro ject aims at combining the following asp ects

p erforming automatic data distribution for regular computations optimizing communication for

regular computations supp orting irregular computations using a combination of compiletime

analysis and runtime supp ort exploiting functional and simultaneously and

generating multithreaded messagedriven co de to hide communication latencies Current eorts

in the pro ject aim at integrating all of these capabilities into the PARADIGM framework In this

article we briey describ e the techniques used and provide exp erimental results to demonstrate

their utility

Sidebar Distributed Memory Compiler Terminology

Data Parallelism Parallelism that exists by p erforming similar op erations simultaneously across

dierent elements of a data set SPMD Single Program Multiple Data

Functional Parallelism Parallelism that exists by p erforming p otentially dierent op erations

on dierent data sets simultaneously MPMD Multiple Program Multiple Data

Regular Computations Computations that typically use dense regular matrix structures

regular accesses can usually b e characterized using compiletime analysis

Irregular Computations Computations that typically use sparse irregular matrix structures

irregular accesses are usually input data dep endent requiring runtime analysis

Data Partitioning The physical distribution of data across the pro cessors of a parallel machine

in order to eciently use available memory and improve the lo calityof reference in parallel

programs

Global IndexAddress Index used to access an elementofanarray dimension when the entire

dimension is physically allo cated on the pro cessor equivalent to the index used in a serial

program

Lo cal IndexAddress Index pair pro cessor index used to access an element of an array

dimension when the dimension is partitioned across multiple pro cessors the lo cal index can

also refer to just the index p ortion of the pair

Owner Computes Rule States that all computations mo difying the value of a data elementare

to b e p erformed by the pro cessor to which the element is assigned by the data partitioning

User Level A context of execution under user control which has its own stack and

registers

Multithreading A set of user level threads that share the same user data space and co op eratively

execute a program PARADIGM: PARAllelizing compiler for Distributed-memory General-purpose Multicomputers Sequential FORTRAN 77 Program / HPF

Parafrase-2: Program Analysis and Dependence Passes

Automatic DataIrregular Pattern Task Graph Partitioning Analysis Synthesis

Regular Pattern Irregular Pattern Task Allocation Static Analysis Run-time Support & Scheduling & Optimizations SPMD MPMD Multithreading Transformation and Run-time Support

Generic Library Interface & Code Generation

Optimized Parallel Program

Figure PARADIGM Compiler Overview

Compiler Framework

Figure shows a functional illustration of howweenvision the complete PARADIGM compilation

system The compiler accepts either a sequential FORTRAN or High Performance Fortran

HPF program and pro duces an explicit message passing version FORTRAN program with

calls to the selected message passing library and our runtime system The following are brief

descriptions of the ma jor phases in the parallelization pro cess

Program Analysis Parafrase is used as a prepro cessing platform to parse the sequential

program into an intermediate representation and to analyze the co de to generate ow dep endence

and call graphs Various co de transformations such as constant propagation and induction variable

substitution are also p erformed at this stage

Automatic Data Partitioning For regular computations the data distribution is determined

automatically by the compiler It congures the machine into an abstract multidimensional mesh

of pro cessors and decides how program data is to b e distributed on the mesh Estimates of the

time sp ent in computation and communication drivethe selection of the data distribution High

level communication op erations and other communication optimizations p erformed in the compiler

are reected in the estimates in order to correctly determine the b est distribution

Regular Computations Using the owner computes rule the compiler divides computation

across pro cessors according to the selected data distribution and generates interpro cessor commu

nication for required nonlo cal data To avoid the overhead of computing ownership at run time

static analysis is used to partition lo ops at compile time lo op b ounds reduction In addition

anumb er of optimizations are p erformed to reduce the overhead of communication

Irregular Computations In many imp ortant applications compiletime analysis is insucient

when communication patterns are data dep endentandknown only at run time A subset of these

applications has the interesting prop erty that the communication pattern rep eats across several

steps PARADIGM approaches these problems through a combination of exible irregular run

time supp ort and compiletime analysis Novel features in our approach are the exploitation of

spatial lo cality and the overlapping of computation and communication

Functional Parallelism Recent research has shown the b enets of simultaneous exploitation

of functional and data parallelism for some applications Such applications can b e viewed as a graph

comp osed of a set of dataparallel tasks with precedence relationships which describ e the functional

parallelism that exists among the tasks Using this task graph PARADIGM exploits functional

parallelism by determining the number of pro cessors to allo cate for each dataparallel task and

scheduling them such that the overall execution time is minimized The techniques used for regular

and irregular dataparallel compilation are used to generate co de for each of the dataparallel tasks

Multithreading Messagepassing programs normally send messages asynchronously and blo ck

when waiting for messages resulting in lower eciency One solution is to run multiple threads on

each pro cessor to overlap computation and communication Multithreading allows one thread

waiting for messages Compiler to utilize the unused cycles which would otherwise be wasted

transformations are used to convert messagepassing co de into a messagedriven mo del thereby

simplifying the multithreading runtime system Multithreading is most b enecial for programs

with a high p ercentage of idle cycles such that the overhead of switching b etween threads can be

hidden

Generic Library Interface Supp ort for sp ecic communication libraries is provided through a

generic library interface For each supp orted library abstract functions are mapp ed to corresp ond

ing library sp ecic co de generators at compile time Library interfaces have been implemented

for Thinking Machines CMMD Parasoft Express MPI Intel NX PVM and PICL Execution

tracing as well as supp ort for multiple platforms is also provided in Express PVM and PICL

The p ortability of this interface allows the compiler to easily generate co de for a wide variety of

machines

In the remainder of this pap er each of the ma jor areas within the compiler will b e describ ed in

more detail Section outlines the techniques used in automatic data partitioning Section de

scrib es the analysis and optimizations p erformed for regular computations while Section describ es

our approach for irregular computations Section explores the simultaneous use of functional and

data parallelism and Section rep orts on multithreading messagedriven co de

Sidebar Data Distribution

0 0 1 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 01 2 0 1 0 1 0 1 0 1 1 3 2 3 2 3 2 3 2 3 0 0 1 0 1 0 1 0 1 2 1 2 3 2 3 2 3 2 3 2 3 2 0 1 0 1 0 1 0 1 3

3 2 3 2 3 2 3 2 3

a block b cyclick c block block d cyclick cyclick



Examples of data distributions for a twodimensional array

Arrays are physically distributed across pro cessors to eciently use available memory and im

prove the lo cality of reference in parallel programs In High Performance Fortran either the

programmer or the compiler must sp ecify the distribution of program data Several examples of

data distributions are shown for a twodimensional array Each dimension of an array can b e given

a sp ecic distribution Blo cked and cyclic distributions are actually two extremes of a general

k where k is the blo cksize A blo ck distribution commonly referred to as blo ckcyclic or cyclic

distribution is equivalent to a blo ckcyclic distribution in which the blo ck size is the size of the

N

original array divided by the number of pro cessors cyclic A cyclic distribution is simply a

P

blo ckcyclic distribution with a blo ck size of one cyclic Sequential program

Internal Parafrase-2 Representation

Alignment Detector Computational Driver Block/Cyclic Cost Estimator

Solver Block size Communication Cost Estimator

Mesh Config.

Data distribution specifications

Figure Automatic Data Partitioning Overview

Automatic Data Partitioning

Determining the b est data partitioning for a given application is a dicult task that requires careful

examination of numerous tradeos Since communication tends to be more exp ensive relative

to lo cal computation a partitioning should be selected to maintain high data lo cality for each

pro cessor Excessive communication can easily oset any gains made through the use of available

parallelism in the program At the same time the partitioning should also evenly distribute the

workload among the pro cessors making full use of the parallelism present in the computation

Since the programmer may not b e and should not havetobeaware of all the interactions b etween

distribution decisions and compiler optimizations automatic data partitioning

 reduces the burden on the programmer

ves program p ortability and machine indep endence  impro

 improves the selection of data distributions

In the compiler data partitioning decisions are made in a number of distinct phases as illus

trated in Figure Often there is a tradeo b etween minimizing interpro cessor communication

and exploiting all available parallelism the communication and the computational costs imp osed

bythe underlying architecture must both be taken into account These costs are generated using

architectural parameters for each target machine With the exception of the architecturesp ecic

costs the partitioning algorithm is machine indep endent

Below are brief descriptions of each of the ma jor phases p erformed during the partitioning

pass Each phase involves identication of data distribution preferences by a detector mo dule

assignment of costs byadriver to quantify the estimated p erformance impact of those preferences

and the resolution of any conicts bya solver

Array Alignment The alignment pass identies which array dimensions should b e mapp ed to

the same pro cessor mesh dimension The alignment preferences b etween twoarrays can b e b etween

dierent pairings of dimensions interdimensional alignment as well as by an oset or stride

within a given pair of dimensions intradimensional alignment Currentlyonlyinterdimensional

alignment is p erformed in the partitioning pass

Blo ckCyclic Distribution Once array alignment has been p erformed the distribution pass

determines whether each array dimension should be distributed in a blo cked or cyclic manner

Array dimensions are rst classied by their communication requirements If the communication

in a mesh dimension is recognized as a nearestneighb or pattern it indicates the need for a blo cked

distribution For dimensions that are only partially traversed less than a certain threshold a

cyclic distribution may be more desirable for load balancing Using alignment information from

the previous phase the array dimensions that crossreference each other are assigned the same kind

of partitioning to ensure the intended alignment

Blo ck Size Selection When a cyclic distribution is chosen the compiler is able to make further

kcyclic partitionings Since only a cyclic distribu adjustments on the blo ck size giving rise to blo c

tion is chosen in the previous phase in order to improve the load balance for partially traversed

array dimensions a closer examination of the communication costs must b e p erformed This anal

ysis is sometimes needed when arrays are used to simulate recordlike structures not supp orted

directly in FORTRAN or when lowerdimensional arrays play the role of higherdimensional

arrays

Mesh Conguration After all the distribution parameters have b een determined the cost es

timates are functions of only the number of pro cessors in each mesh dimension For each set of

aligned array dimensions the compiler determines if there are any parallel op erations p erformed If

no parallelism exists in a given dimension it is collapsed onto a single pro cessor If there is only one

dimension that has not been collapsed all pro cessors are assigned to this dimension In the case

of multiple dimensions of parallelism the compiler determines the b est arrangement of pro cessors

byevaluating the cost expression to estimate execution time for each feasible conguration

At this p oint the distribution information is passed on to direct the remainder of the paralleliza

tion pro cess in the compiler The user may also desire to generate a HPF program containing the

directives which sp ecify the selected partitioning This technique allows the partitioning pass to b e

used as an indep endent to ol while remaining integrated with the compilation system Furthermore

b eing integrated ensures that the partitioning pass is always aware of the optimizations p erformed

by the compiler For more complex programs it is also p ossible to further improve p erformance

by redistributing data at selected p oints in the program The static partitioner is currently b eing

extended to automatically determine when such dynamic data partitionings are useful

Regular Computations

For regular computations in which the communication pattern can b e determined at compile time

PARADIGM uses static analysis to partition computation across pro cessors and to generate opti

mized interpro cessor communication To do this analysis eciently the compiler needs a mech

anism to describ e partitioned data and iteration sets Pro cessor Tagged Descriptors PTDs are

used to provide a uniform representation of these sets for every pro cessor Op erations on PTDs

are extremely ecient simultaneously capturing the eect on all pro cessors in a given dimension

PTDs however are not general enough to handle the most complicated array distributions

references and lo op b ounds that are o ccasionally found in real co des see Figure For these

complex cases PARADIGM represents partitioned data and iteration sets by symbolic linear in

equalities and generates lo ops to scan these regions using a technique known as FourierMotzkin

TM

pro jection To implementFourierMotzkin pro jection Mathematica a p owerful otheshelf

symb olic analysis system is linked with the compiler to provide a high level of symb olic supp ort

as well as rapid prototyping Thus by using the ecient PTD representation for the simplest

and most frequent cases and a more general inequalitybased representation for the dicult cases

PARADIGM is able to compile a larger prop ortion of programs without jeopardizing compilation

sp eed

REAL A

REAL B

HPF PROCESSORS MESH

HPF TEMPLATE T

HPF ALIGN Ak WITH T k

HPF DISTRIBUTE T CYCLIC BLOCK ONTO MESH

HPF DISTRIBUTE B CYCLIC CYCLIC ONTO MESH

DO i

DO j i 

A  i B  i j   i   j

END DO

END DO

Figure Example lo op with complex array references and distributions

The p erformance of the resulting parallel program also greatly dep ends on how well its inter

pro cessor communication has b een optimized Since the startup cost of communication overhead

tends to be several orders of magnitude greater than either the p erelement computation cost or

the perbyte transmission cost rate frequent communication can easily dominate the execution

time A linear pointtop oint transfer cost of a message of m bytes is used as a basis for the

communication mo del

transfer moverhead rate m

where the values for overhead and rate are empirically measured for a given machine

Several optimizations are employed that combine messages in dierent ways to amortize the

y reducing the total amount of communication overhead in the pro startup cost and thereb

gram In lo ops where there are no crossiteration dep endencies parallelism is extracted

by indep endently executing groups of iterations on separate pro cessors For indep endent refer

ences message coalescing message vectorization and message aggregation are used to reduce the

overhead asso ciated with frequentcommunication For references within lo ops that contain cross

iteration dep endencies coarsegrain pipelining is employed to optimize communication across lo ops

while balancing the overhead with the available parallelism

Message Coalescing Redundant communication for dierent references to the same data is

unnecessary if the data has not b een mo died b etween uses When statically analyzing individual j j

i i

P1 P2 P1 P2

Before After

a Message Vectorization

j j

i i

P1 P2 P1 P2

Before After

b Message Aggregation

Figure Optimizations used to reduce overhead asso ciated with frequentcommunication

references redundantcommunication is detected and coalesced into a single message allowing the

data to b e reused rather than communicated for every reference For dierent sections of a given

array individual elements are coalesced by p erforming a union of the dierent sections thereby

ensuring that overlapping data elements are communicated only once Since coalescing will either

eliminate entire communication op erations or reduce the size of messages containing array sections

it is always b enecial

Message Vectorization Nonlo cal elements of an array that are indexed within a lo op nest can

also be vectorized into a single larger message instead of b eing communicated individually see

Figure a Dep endence analysis is used to determine the outermost lo op at which the combining

can be applied The elementwise messages are combined or vectorized as they are lifted out of

the enclosing lo op nests to the selected vectorization level Vectorization reduces the total number

of communication op erations hence the total overhead while increasing the message length

Message Aggregation Multiple messages to be communicated between the same source and

destination can also b e aggregated into a single larger message Communication op erations are rst

sorted by their destinations during the analysis Messages with identical source and destination 0.08 Overhead Busy 80 0.06 70 60 0.04 50 40

Normalized Time 0.02 30 ADI EXPL 20 Jacobi 10 0 0 ALL ALL ALL SERIAL COAL AGGR VECT ALL VECT VECT VECT COAL COAL COAL AGGR AGGR AGGR

ADI EXPL Jacobi 1-D Dist

SERIAL original unmo died serial co de VECT COAL message vectorization

COAL statically partitioned parallel program AGGR COAL message aggregation

with message coalescing ALL COAL message vectorization and aggregation

Figure Comparison of Message Coalescing Vectorization and Aggregation pro cessor CM

pairs are then combined into a single communication op eration see Figure b Aggregation can

be p erformed on communication op erations of individual data references as well as vectorized

communication op erations The gain from aggregation is similar to vectorization in that the total

overhead is reduced at the cost of increasing the message length

To illustrate the ecacy of these optimizations the p erformance of several program fragments

executed on a pro cessor CM is shown in Figure The automatic data distribution pass

selected linear D partitionings for b oth ADI Integration Livermore kernel with element

arrays and Explicit Hydro dynamics Livermore kernel with element arrays and a D

partitioning for Jacobis iterative metho d similar to that previously shown in Figure with

matrices

execution times have been normalized to the serial For comparison purp oses the rep orted

execution of the corresp onding program and are further separated into two quantities

 Busy time sp ent p erforming the actual computation

 Overhead time sp ent executing co de related to computation partitioning and communication

The relative eectiveness of each optimization can be seen by examining the amount of overhead

eliminated as the optimizations are incrementally applied j j j

i i i Optimal Granularity P0 t1 P0 t1 t2 t3 P0 t1 t2 t3

P1 t2 P1 t2 t3 P1 t2 t3 Speedup

t P2 t3 P2 t3 P2 3 Fine Coarse

Granularity

a Sequential b Fine Grain c Coarse Grain

Figure Pip elined Execution of Recurrences

It is also interesting to notice that an additional run of a D partitioned version of Jacobi shows

a higher overhead compared to the compilerselected D version This shows the eectiveness of

the automatic data partitioning pass since it was able to select the best distribution despite the

small dierences in p erformance For larger machine sizes and more complex programs the utility

of automatic data distribution will be even more apparent as the communication costs b ecome

greater for inferior data distributions

Coarse Grain Pip elining In cases where there are crossiteration dep endencies due to recur

rences it is not p ossible to immediately execute every iteration in parallel Often however there

is the opp ortunity to ov erlap parts of the lo op execution synchronizing to ensure that the data

dep endencies are enforced

To illustrate this technique assume an array is blo ck partitioned by rows and dep endencies

exist from the previous row and previous column In Figure a each pro cessor p erforms an op era

tion on every element of the rows it owns b efore sending the b order rowto the waiting pro cessor

thereby serializing execution of the entire computation Instead in Figure b the rst pro cessor

can compute the elements of one partitioned column and then send the border element of that

column to the next pro cessor such that it can b egin its computation immediately Ideally if com

munication has zero overhead this is the most ecient form of computation since no pro cessor will

wait unnecessarily However as discussed earlier the cost of p erforming numerous singleelement

communications can b e quite exp ensive compared to the small strips of computation To address

this problem this overhead can be reduced by increasing the granularity of the communication

see Figure c An analytic pip eline mo del has been develop ed using estimates of computation

Figure Finite Element Airfoil Grid

and communication to allow the compiler to automatically select a granularity that results in

nearoptimal p erformance

Irregular Computations

In many imp ortant applications compiletime analysis is insucient when the required commu

nication patterns are data dep endent and thus are only known at run time For example the

computation of airow and surface stress over an airfoil may utilize an irregular nite element grid

suchastheoneshown in Figure whichcontains vertices and edges To eciently

run such irregular applications on a massively parallel multicomputer runtime compilation tech

niques can b e used The dep endency structure of the program is analyzed in a prepro cessing

computation is maintained step before the actual computation o ccurs If the same structure of a

across several steps this prepro cessing can b e reused amortizing its cost In practice this concept

is implemented with two sequences of co de an insp ector for prepro cessing and an executor for

p erforming the actual computation

The prepro cessing step p erformed by the insp ector can be very complex the unstructured

grid is partitioned the resulting communication patterns are optimized and global indices are 6 0.26 CHAOS/PARTI 0.24 CHAOS/PARTI 5 PILAR Enum. 0.22 PILAR Enum. PILAR Interval PILAR Interval 4 0.20 0.18 3 0.16 0.14 2

Time (seconds) Time (seconds) 0.12 1 0.10 0.08 0 0.06 10 15 20 25 30 10 15 20 25 30

Processors Processors

a Schedule insp ector b Redistribute executor

Figure Edge redistribution for Rotor on the IBM SP

translated into lo cal indices During the executor phase elements are communicated based on

this prepro cessing analysis In order to simplify the implementation of insp ectors and executors

irregular runtime supp ort IRTS is used to provide primitives for these op erations

There are several ways to improve a stateoftheart IRTS such as CHAOSPARTI The

internal representation of communication patterns in such systems is somewhat restricted they

represent irregular patterns that are completely enumerated or regular blo ck patterns Neither

optimizes regular and irregular accesses together nor eciently supp orts the small regular blo cks

that arise in irregular applications written for the exploitation of spatial cache lo cality Moreover

systems suchasCHAOSPARTI do not provide nonblo cking communication primitives whichcan

further increase p erformance

All of these problems are addressed in the Parallel Irregular Library with Application of Regu

larity PILAR PARADIGMs IRTS for irregular computations PILAR is written in C to

easily supp ort dierent internal representations of communication patterns allowing for ecient

handling of a wide range of applications from fully irregular to regular using a common frame

work PILAR uses intervals for the description of small regular blo cks as discussed earlier and

enumeration for patterns with little or no regularity The ob jectoriented nature of the library

simplies the implementation of new representations as well as the interactions among ob jects that

have dierentinternal representations

An exp eriment is p erformed to evaluate the eectiveness of PILAR in exploiting spatial regu

larity in irregular applications The overhead of redistributing the edges of an unstructured grid

is measured after a partitioner has assigned no des to pro cessors We assume a typical CSR Com

pressed Sparse Row or HarwellBo eing initial layout in which edges of a given grid no de are

contiguous in memory Redistribution is done in two phases the rst phase insp ector computes a

schedule that captures the redistribution of the edges and sorts the new global indices the second

phase executor redistributes the array with the previously computed schedule using a global data

exchange primitive

The exp eriment uses a large unstructured grid Rotor from NASA A large ratio b etween

the maximum degree and the average degree of a no de in this grid would cause a twodimensional

matrix representation of the edges to b e very inecient Multidimensional optimizations in CHAOS

or PILAR with enumeration cannot be used The p erformance of CHAOSPARTI is compared

against PILAR with both enumeration and intervals during the two phases of the redistribution

Results for a pro cessor IBM SP app ear in Figure which clearly shows the b enet of using

the more compact interval representation Further exp eriments also show that only three edges p er

grid no de are required to b enet from an interval based representation in the SP

Even with adequate IRTS the generation of ecient insp ectorexecutor co de for irregular appli

cations is fairly complex In PARADIGM compiler analysis for irregular computations will b e used

to detect reuse of prepro cessing insert communication primitives and highlight opp ortunities to ex

ploit spatial lo cality After p erforming this analysis the compiler will generate insp ectorexecutor

co de with emb edded calls to PILAR routines

Functional and Data Parallelism

The eciency of dataparallel execution tends to drop o for larger numbers of pro cessors for a

given problem size or for smaller problem sizes for a given numb er of pro cessors By exploiting

functional parallelism in addition to data parallelism the overall execution eciency of a program

can sometimes b e improved Ataskgraphknown as a Macro Dataow Graph MDG is used to

represent both the functional and data parallelism available in a program The MDG for a given

eighted directed acyclic graph DAG with no des representing dataparallel routines program is a w

in the program and edges representing precedence constraints among these routines In the MDG

data parallelism is implicit in the weight functions of the no des while functional parallelism is

captured by the precedence constraints among no des 20 1 N1 N 18 0.95 1 N2 N 16 N3 0.9 1 14 0.85 N2 12 0.8 Time t1

10 Efficiency 0.75 t N3 2 8 0.7 N1 N2 Time 6 0.65 N3 N2 N3 4 0.6 1 2 3 4 1 2 3 4 1234 1234 Processors Processors

Processors SPMD, t1 =16.6 secMPMD, t2 =15.3 sec

a Macro DataowGraph b Allo cation and Scheduling

Figure Example of Functional Parallelism

The weights of the no des and edges are based on the pro cessing and data redistribution costs

The pro cessing cost is the computation and communication time required for the execution of

a dataparallel routine and dep ends on the number of pro cessors used to execute the routine

Scheduling may make it necessary to redistribute an arraybetween the execution of pair of routines

The time required for this data redistribution dep ends on the numb er of pro cessors as well as the

data distributions used by the routines

To determine the best execution strategy for a given program an allo cation and scheduling

approach is used on the MDG Allo cation determines the numb er of pro cessors to use for each no de

while scheduling results in an execution scheme for the allo cated no des on the target multicomputer

Figure a shows an MDG with three no des N N andN along with the pro cessing costs and

 

eciencies of the no des as a function of the number of pro cessors they use For this example

assume there are no data redistribution costs among the three routines Given a four pro cessor

system two schemes of execution for the program are shown pictorially in Figure b The rst

scheme exploits pure data parallelism with all routines using four pro cessors The second scheme

exploits b oth functional and data parallelism with routines N and N executing concurrently and

 

using two pro cessors each As shown by this example go o d allo cation and scheduling can decrease

the program execution time

The allo cation and scheduling algorithms in PARADIGM are based on the mathematical forms

of the pro cessing and data redistribution cost functions it can b e shown that they b elong to a class

of functions known as posynomials This prop erty is used to formulate the problem with a form

of convex programming for optimal allo cation After allo cation a list scheduling p olicy is used

for scheduling the no des on a given system The nish time obtained using this scheme has b een SPMD MPMD 60 60 60 60

40 40 40 40 Speedup Speedup Speedup Speedup 20 20 20 20

0 0 0 0 32 64 128 32 64 128 32 64 128 32 64 128 Processors Processors Processors Processors Strassen's Matrix Multiply Computational Fluid Dynamics Strassen's Matrix Multiply Computational Fluid Dynamics

(256x256 matrices) (128x128 mesh) (256x256 matrices) (128x128 mesh)

a CM b Paragon

Figure SPMDMPMD Performance Comparison

shown to b e within a factor of the optimal nish time in theory in practice this factor is small

In Figure the p erformance of the allo cation and scheduling approach is compared to that

of a pure dataparallel approach Sp eedups are computed for the PARAGON and CM for a

pair of applications Performance using the allo cation and scheduling approach is identied as

MPMD and p erformance for the pure dataparallel scheme is SPMD The rst application

shown is Strassens matrix multiplication algorithm The second is a computational uid dynamics

co de using a sp ectral metho d For machines with a large number of pro cessors the p erformance

of the MPMD execution relative to SPMD is improved by a factor of ab out two to three These

results demonstrate the utility of the allo cation and scheduling approach

Multithreading

When the resulting parallel program has a high p ercentage of idle cycles multithreading can be

used to further improve p erformance By running multiple threads on each pro cessor one of

the threads can utilize the cycles which would otherwise be wasted waiting for messages To

supp ort multithreaded execution message passing co de is rst generated by the compiler for a

number of virtual pro cessors greater than the number of physical pro cessors in a given machine

Multiple virtual pro cessors are then mapp ed onto physical pro cessors resulting in multiple threads

of execution for eachphysical pro cessor

In order to execute multithreaded co de eciently compiler transformations are used to convert

messagepassing co de into a messagedriven mo del thereby simplifying the multithreading runtime begin begin main1 begin main2

code A code A code C

code B code B code B receive Start receive Start receive message Enable Enable code C main2 main2

code D code D code D

end end main1 end main2 (a) Original (b) Transformed

Control Flow Graph Control Flow Graph

Figure Transformation of the While Statement

system MRTS The transformation required is simple for co de without conditionals but b ecomes

more complex when conditionals and lo ops are included Although this section only presents the

transformation for converting while lo ops to messagedriven co de similar transformations can be

p erformed on other control structures

The transformation of the while lo op is shown in Figure Figure a shows the control ow

graph of a messagepassing program containing a receive in a while lo op Figure b shows the

transformed co de In Figure b main is constructed such that co de A is executed followed by

B is executed the receive is the while condition check If the while lo op condition is true co de

executed the routine enables main and main returns to the MRTS to execute other threads

If the while lo op condition is false co de D is executed and the thread ends If main is enabled

it will receiveitsmessage and execute co de C which is the co de after the receive inside the while

lo op At this p oint the original co de would check for lo op completion Therefore the transformed

co de must also p erform this check and if it is true main enables another invo cation of itself and

returns to the MRTS Otherwise co de D is executed and the thread ends Multiple copies of main

and main can b e executed on the pro cessors to increase the degree of multithreading

This transformation was p erformed on the following four scientic applications written in the

SPMD programming mo del with blo cking receives

In the messagedriven mo del receive op erations are used to switchbetween threads and therefore must return

control to the multithreading runtime system SPMD Message Driven 15 15

100 100 10 10 Speedup Speedup Speedup Speedup 50 50 5 5

0 0 0 0 SPMD 1248 SPMD 1248 SPMD 1 2 4 8 SPMD 1248 Number of Threads Number of Threads Number of Threads Number of Threads (a) Gauss-Seidel Solver (b) Givens QR Factorization (c) IMPL-2D Livermore kernel 23 (d) IMPL-1D Livermore kernel 23

(2048x2048 matrix) (512x512 matrix) (1024x1024 matrix) (1024x1024 matrix)

Figure Sp eedup of Message Driven Threads pro cessor CM

GS GaussSeidel Iterative Solver

QR Givens QR factorization of a dense matrix

IMPLD D distribution of Implicit Hydro dynamics

IMPLD D distribution of Implicit Hydro dynamics

All were run with large matrices on the CM

Figure compares the sp eedup of SPMD co de with that of messagedriven co de with varying

numb ers of threads p er pro cessor For the Gauss Seidel and Givens QR applications the amount

of available parallelism inhibits the sp eedup On the other hand cache eects pro duce sup erlinear

sp eedup for Implicit Hydro dynamics The messagedriven versions of the co de outp erform the

SPMD versions in all cases except IMPLD where multithreading causes a signicant increase in

communication costs For the other applications improvement is seen when to threads are used

but the exact number of threads that pro duces the maximum sp eedup varies This indicates the

numb er of threads required for optimal sp eedup is somewhat application dep endent

Conclusions

PARADIGM is a exible parallelizing compiler for multicomputers It can automatically distribute

program data and p erform a variety of communication optimizations for regular computations

as well as provide supp ort for irregular computations using compilation and runtime techniques

For programs which have functional parallelism the compiler can increase p erformance through

prop er resource allo cation and scheduling PARADIGM can also use multithreading to further

increase the eciency of co des by overlapping communication and computation We b elieve that

all these metho ds are useful for compiling a wide range of applications for distributedmemory

multicomputers

References

S Hiranandani K Kennedy and C Tseng Compiling Fortran D for MIMD Distributed Memory

Machines Communications of the ACMvol pp Aug

Z Bozkus A ChoudharyGFox T Haupt and S Ranka Fortran DHPF Compiler for Distributed

Memory MIMD Computers Design Implementation and Performance Results in Proceedings of the

th ACM International ConferenceonSupercomputingTokyo Japan pp July

B Chapman P Mehrotra and H Zima Programming in Vienna Fortran Scientic Programming

vol pp Aug

C Ko elb el D Loveman R Schreib er G Steele Jr and M Zosel The High Performance Fortran

Handbook Cambridge MA The MIT Press

C D Polychronop oulos M Girkar M R Haghighat C L Lee B Leung and D Schouten Parafrase

An Environmen tforParallelizing Partitioning Synchronizing and Scheduling Programs on Multi

pro cessors in Proceedings of the th International Conference on Paral lel Processing St Charles

IL pp I I Aug

M Gupta and P Banerjee PARADIGM A Compiler for Automated Data Partitioning on Multicom

puters in Proceedings of the th ACM International ConferenceonSupercomputingTokyo Japan

July

D J Palermo E Su J A ChandyandP Banerjee Compiler Optimizations for Distributed Memory

Multicomputers used in the PARADIGM Compiler in Proceedings of the rd International Conference

on Paral lel Processing St Charles IL pp I I Aug

A Lain and P Banerjee Exploiting Spatial Regularity in Irregular Iterative Applications in Proceed

ings of the th International Paral lel Processing Symposium Santa Barbara CA pp Apr

S Ramaswamy S Sapatnekar and P Banerjee A Convex Programming Approach for Exploiting

Data and Functional Parallelism on Distributed Memory Multicomputers in Proceedings of the rd

International ConferenceonParal lel Processing St Charles IL pp I I Aug

J G Holm A Lain and P Banerjee Compilation of Scientic Programs into Multithreaded and

Message Driven Computation in Proceedings of the Scalable High Performance Computing Con

ference Knoxville TN pp May

ESuDJPalermo and P Banerjee Pro cessor Tagged Descriptors A Data Structure for Compil

ing for DistributedMemory Multicomputers in Proceedings of the International Conference on

es and Compilation Techniques Montreal Canada pp Aug Paral lel Architectur

C Ancourt and F Irigoin Scanning Polyhedra with DO Lo ops in Proceedings of the Third ACM

SIGPLAN Symposium on Principles Practices of Paral lel Programming Williamsburg VA pp

Apr

R Ponnusamy J Saltz and A Choudhary RuntimeCompilation Techniques for Data Partitioning

and Communication Schedule Reuse in Proceedings of Supercomputing Portland OR pp

Nov