<<

Compiler and Runtime Supp ort for Programming in Adaptive

Parallel Environments

Guy Edjlali Gagan Agrawal

Alan Sussman Jim Humphries

and Jo el Saltz

UMIACS and Dept of Computer Science

University of Maryland

College Park MD USA

fedjlaligaganalshumphriesaltzgcsumdedu

Abstract

For b etter utilization of computing resources it is imp ortant to consider parallel pro

gramming environments in which the numb er of available pro cessors varies at runtime In

this pap er we discuss runtime supp ort for data parallel programming in such an adaptive

environment Executing programs in an adaptive environment requires redistributing data

when the numb er of pro cessors changes and also requires determining new lo op b ounds and

communication patterns for the new set of pro cessors We have develop ed a runtime

to provide this supp ort We discuss how the runtime library can b e used by of

HPFlike languages to generate co de for an adaptive environment We present p erformance

results for a NavierStokes solver and a multigrid template run on a network of workstations

and an IBM SP Our exp eriments show that if the numb er of pro cessors is not varied

frequently the cost of data redistribution is not signicant compared to the time required

for the actual computation Overall our work establishes the feasibility of compiling HPF

for a network of nondedicated workstations which are likely to b e an imp ortant resource

for parallel programming in the future

Intro duction

In most existing parallel programming systems each parallel program or job is assigned a xed

numb er of pro cessors in a dedicated mo de Thus the job is executed on a xed numb er of

pro cessors and its is not aected by other jobs on any of the pro cessors This simple

mo del often results in relatively p o or use of available resources A more attractive mo del would

b e one in which a particular parallel program could use a large numb er of pro cessors when no

other job is waiting for resources and use a smaller numb er of pro cessors when other jobs need

resources Setia et al have shown that such a dynamic scheduling p olicy results in b etter

utilization of the available pro cessors

There has b een an increasing trend toward using a network of workstations for parallel

execution of programs A workstation usually has an individual owner or small set of users

who would like to have sole use of the machine at certain times However when the individual

1

This work was supp orted by ARPA under contract No NAG and by NSF under grant No

ASC The authors assume all resp onsibility for the contents of the pap er

users of workstations are not logged in these workstations can b e used for executing a parallel

application When the individual user of a workstation returns the application must b e adjusted

either not to use the workstation at all or to use very few cycles on the workstation The idea

is that the individual user of the workstation do es not want the execution of a large parallel

application to slow down the pro cesses heshe wants to execute

We refer to a parallel programming environment in which the numb er of pro cessors avail

able for a given application varies with time as an adaptive parallel programming environment

The ma jor diculty in using an adaptive parallel programming environment is in developing

applications for execution in such an environment In this pap er we address this problem for

distributed memory parallel machines and networks of workstations neither of which supp ort

shared memory In these machines communication b etween pro cessors has to b e explicitly

scheduled by a or by the user

A commonly used mo del for developing parallel applications is the data parallel programming

mo del in which parallelism is achieved by dividing large data sets b etween pro cessors and having

each pro cessor work only on its lo cal data High Performance Fortran HPF a language

prop osed by a consortium from industry and academia and b eing adopted by a numb er of

vendors targets the data parallel programming mo del In compiling HPF programs for execution

on distributed memory machines two ma jor tasks are dividing work or lo op iterations across

pro cessors and detecting inserting and optimizing communication b etween pro cessors To the

b est of our knowledge all existing work on compiling data parallel applications assumes that

the numb er of pro cessors available for execution do es not vary at runtime If the

numb er of pro cessors varies at runtime runtime routines need to b e inserted for determining

work partitioning and communication during the execution of the program

We have develop ed a runtime library for developing data parallel applications for execution

in an adaptive environment There are two ma jor issues in executing applications in an adaptive

environment

Redistributing data when the numb er of available pro cessors changes during the execution

of the program and

Handling work distribution and communication detection insertion and optimization when

the numb er of pro cessors on which a given parallel lo op will b e executed is not known at

compiletime

Executing a program in an adaptive environment can p otentially incur a high overhead If

the numb er of available pro cessors is varied frequently then the cost of redistributing data can

b ecome signicant Since the numb er of available pro cessors is not known at compiletime work

partitioning and communication need to b e handled by runtime routines This can result in a

signicant overhead if the runtime routines are not ecient or if the runtime analysis is applied

to o often

Our runtime library called Adaptive Multiblo ck PARTI AMP includes routines for han

dling the two tasks we have describ ed This runtime library can b e used by compilers for data

parallel languages or it can b e used by a programmer parallelizing an application by hand In

this pap er we describ e our runtime library and also discuss how it can b e used by a compiler We

restrict our work to data parallel languages in which parallelism is sp ecied through parallel lo op

constructs like forall statements and array expressions We present exp erimental results on two

applications parallelized for adaptive execution by inserting our runtime supp ort by hand Our

exp erimental results show that if the numb er of available pro cessors do es not vary frequently

the cost of redistributing data is not signicant as compared to the total execution time of the

program Overall our work establishes the the feasibility of compiling HPFlike data parallel

languages for a network of nondedicated workstations

The rest of this pap er is organized as follows In Section we discuss the programming

mo del and mo del of execution we are targeting In Section we describ e the runtime library

we have develop ed We briey discuss how this runtime library can b e used by a compiler in

Section In Section we present exp erimental results we obtained by using the library to

parallelize two applications and running them on a network of workstations and an IBM SP

In Section we compare our work with other eorts on similar problems We conclude in

Section

Mo del for Adaptive Parallelism

In this section we discuss the programming mo del and mo del of program execution our runtime

library targets We call a parallel programming system in which the numb er of available pro ces

sors varies during the execution of a program an adaptive programming environment We refer

to a program executed in such an environment as an adaptive program These programs should

adapt to changes in the numb er of available pro cessors The numb er of pro cessors available to

a parallel program changes when users log in or out of individual workstations or when the

load on pro cessors change for various reasons such as from other parallel jobs in the system

We refer to remapping as the activity of a program adjusting to the change in the numb er of

available pro cessors

We have chosen our mo del of program execution with two main concerns

We want a mo del which is practical for developing and running common scienti and

engineering applications and

We want to develop adaptive programs that are p ortable across many existing parallel

programming systems This implies that the adaptive programs and the runtime supp ort

develop ed for them should require minimal op erating system supp ort

We restrict our work to parallel programs using the Single Program Multiple Data SPMD

mo del of execution In this mo del the same program text is run on all the pro cessors and

Real ANN BNN

Do Time step to

Forall i N j N

Aij Bji Aij

EndForall

More Computation involving A B

Enddo

Figure Example of a DataParallel Program

parallelism is achieved by partitioning data structures typically arrays b etween pro cessors

This mo del is frequently used for scientic and engineering applications and most of the existing

work on developing languages and compilers for programming parallel machines uses the SPMD

mo del An example of a simple data parallel program that can b e easily transformed into

a parallel program that can b e executed in SPMD mo de is shown in Figure The only change

required to turn this program into an SPMD parallel program for a static environment would b e

to change the lo op b ounds of the forall lo op appropriately so that each pro cessor only executes

on the part of array A that it owns and then to determine and place the communication b etween

pro cessors for array B

We are targeting an environment in which a parallel program must adapt according to the

system load A program may b e required to execute on a smaller numb er of pro cessors b ecause

an individual user logs in on a workstation or b ecause a new parallel job requires resources

Similarly it may b e desirable for a parallel program to execute on a larger numb er of pro cessors

b ecause a user on a workstation has logged out or b ecause another parallel job executing in the

parallel system has nished In such scenarios it is acceptable if

The adaptive program do es not remap immediately when the system load changes and

When the program remaps from a larger numb er of pro cessors to a smaller numb er of

pro cessors it may continue to use a small numb er of cycles on the pro cessors it no longer

uses for computation

This kind of exibility can signicantly ease remapping of data parallel applications with

minimal op erating system supp ort If an adaptive program has to b e remapp ed from a larger

numb er of pro cessors to a smaller numb er of pro cessors this can b e done by redistributing

the distributed data so that pro cessors which should no longer b e executing the program do

not own any part of the distributed data The SPMD program will continue to execute on

all pro cessors We refer to a pro cess that owns distributed data as an active process and a

pro cess from which all data has b een removed as a skeleton process A pro cessor owning an

active pro cess is referred to as an active processor and similarly a pro cessor owning a skeleton

pro cess is referred to as a skeleton processor A skeleton pro cessor will still execute each parallel

lo op in the program However after evaluating the lo cal lo op b ounds to restrict execution to

lo cal data a skeleton pro cessor will determine that it do es not need to execute any iterations

of the parallel lo op All computations involving writing into scalar variables will continue to b e

executed on all pro cessors The parallel program will use some cycles in the skeleton pro cessors

in the evaluation of lo op b ounds for parallel lo ops and in the computations involving writing into

scalar variables However for data parallel applications involving large arrays this is not likely

to cause any noticeable slowdown for other pro cesses executing on the skeleton pro cessors This

mo del substantially simplies remapping when a skeleton pro cessor again b ecomes available for

executing the parallel program A skeleton pro cessor can b e made active simply by redistributing

the data so that this pro cessor owns part of the distributed data New pro cesses do not need

to b e spawned when skeleton pro cessors b ecome available hence no op erating system supp ort

is required for remapping to start execution on a larger numb er of pro cessors In this mo del

a maximal p ossible set of pro cessors is sp ecied b efore starting execution of a program The

program text is executed on all these pro cessors though some of these may not own any p ortions

of the distributed data at any given p oint in the program execution We b elieve that this is not

a limitation in practice since the set of workstations or pro cessors of a parallel machine that

can p ossibly b e used for running an application is usually known in advance

In Figure we have represented three dierent states of ve pro cessors workstations ex

ecuting a parallel program using our mo del In the initial state the program data is spread

across all ve pro cessors In the second state two users have logged in on pro cessors and

so the program data is remapp ed onto pro cessors and After some time those users log

o and another user logs in on pro cessor The program adapts itself to this new conguration

by remapping the program data onto pro cessors and

If an adaptive program needs to b e remapp ed while it is in the middle of a parallel much

eort may b e required to ensure that all computations n restart at the correct p oint on all the

pro cessors after remapping The main problem is ensuring that each iteration of the parallel

lo op is executed exactly once either b efore or after the remapping Keeping track of which lo op

iterations have b een completed b efore the remapping and only executing those that havent

already b een completed after the remapping can b e exp ensive However if the program is

allowed to execute for a short time after detecting that remapping needs to b e done the remap

ping can b e substantially simplied Therefore in our mo del the adaptive program is marked

with remap points These remap p oints can b e sp ecied by the programmer if the program is

parallelized by hand or may b e determined by the compiler if the program is compiled from 3 different states of workstations and program

Time t0 P0 P1 P2 P3 P4

Active Process User Process Time t1 P0 P1 P2 P3 P4

user 1 user 2

Skeleton Process

Time t2 P0 P1 P2 P3 P4

user 3

Figure An Adaptive Programming Environment

a single program sp ecication eg using an HPF compiler We allow remapping when the

program is not executing a data parallel lo op The lo cal lo op b ounds of a data parallel lo op are

likely to b e mo died when the data is redistributed since a pro cessor is not likely to own exactly

the same data b oth b efore and after remapping We will further discuss how the compiler can

determine placement of remap p oints in Section

At each remap p oint the program must determine if there is a reason to remap We assume a

detection mechanism that determines if load needs to b e shifted away from any of the pro cessors

which are currently active or if any of the skeleton pro cessors can b e made active This de

tection mechanism is the only op erating system supp ort our mo del assumes All the pro cessors

synchronize at the remap p oint and if the detection mechanism determines that remapping is

required data redistribution is done

Two main considerations arise in cho osing remap p oints If the remap p oints are to o far

apart that is if the program takes to o much time b etween remap p oints this may not b e

acceptable to the users of the machines If remap p oints are to o close together the overhead

of using the detection mechanism may start to b ecome signicant

Our mo del for adaptive parallel programming is closest to the one presented by Proutty et

al They also consider data parallel programming in an adaptive environment including a

network of heterogeneous workstations The main dierence in their approach is that the resp on

sibility for data repartitioning is given to the application programmer We have concentrated on

developing runtime supp ort that can p erform data repartitioning work partitioning and com

munication after remapping Our mo del satises the three requirements stated by Proutty et al

namely withdrawal the ability to withdraw computation from a pro cessor within a reasonable

time expansion the ability to expand into newly available pro cessors and redistribution the

ability to redistribute work onto a dynamic numb er of pro cessors so that no pro cessor b ecomes

a b ottleneck

Runtime Supp ort

In this section we discuss the runtime library we have develop ed for adaptive programs The

runtime library has b een develop ed on top of an existing runtime library for structured and

blo ck structured applications This library is called Multiblo ck PARTI since it was

initially used to parallelize multiblo ck applications We have develop ed our runtime supp ort for

adaptive parallelism on top of Multiblo ck PARTI b ecause this runtime library provides much of

the runtime supp ort required for forall lo ops and array expressions in data parallel languages like

HPF This library was also integrated with the HPFFortran compiler develop ed at Syracuse

University We discuss the functionality of the existing library and then present the

extensions that were implemented to supp ort adaptive parallelism We refer to the new library

with extensions for adaptive parallelism as Adaptive Multiblo ck PARTI AMP

Multiblo ck PARTI

This runtime library can b e used in optimizing communication and partitioning work for HPF

co des in which data distribution lo op b ounds andor strides are unknown at compiletime and

indirection arrays are not used Consider the problem of compiling a data parallel lo op such

as a forall lo op in HPF for a distributed memory parallel machine or network of workstations

If all lo op b ounds and strides are known at compiletime and if all information ab out the

data distribution is also known then the compiler can p erform work partitioning and can also

determine the sets of data elements to b e communicated b etween pro cessors However if all

this information is not known then these tasks may not b e p ossible to p erform at compiletime

Work partitioning and communication generation b ecome esp ecially dicult if there are symb olic

strides or if the data distribution is not known at compiletime In such cases runtime analysis

can b e used to determine work partitioning and generate communication The Multiblo ck PARTI

library has b een develop ed for providing the required runtime analysis routines

In summary the runtime library has routines for three sets of tasks

Dening data distribution at runtime this includes maintaining a distributed array de

scriptor DAD which can b e used by communication generation and work partitioning

routines

Performing communication when the data distribution lo op b ounds andor strides are

unknown at compiletime and

Partitioning work lo op iterations when data distribution lo op b ounds andor strides are

unknown at compiletime

A key consideration in using runtime routines for work partitioning and communication is to

keep the overhead of runtime analysis low For this reason the runtime analysis routines must

b e ecient and it should b e p ossible to reuse the results of runtime analysis whenever p ossible

In this communication is p erformed in two phases First a subroutine is called

to build a communication schedule that describ es the required data motion and then another

subroutine is called to p erform the data motion sends and receives on a distributed memory

parallel machine using a previously built schedule Such an arrangement allows a schedule to

b e used multiple times in an iterative co de

To illustrate the functionality of the runtime routines for communication analysis consider

a single statement foral l lo op as sp ecied in HPF This is a parallel lo op in which lo op b ounds

and strides asso ciated with any lo op variable cannot b e functions of any other lo op variable

If there is only a single array on the right hand side and all subscripts are ane functions

of the lo op variables then this foral l lo op can b e thought as copying a rectilinear section of

data from the right hand side array into the left hand array p otentially involving changes of

osets and strides and index p ermutation We refer to such communication as a regular section

move The library includes a regular section move routine Regular Section Move Sched

that can analyze the communication asso ciated with a copy from a right hand side array to left

hand side array when data distribution lo op b ounds andor strides are not known at compile

time

A regular section move routine can b e invoked for analyzing the communication asso ciated

with any foral l lo op but this may result in unnecessarily high runtime overheads for b oth ex

ecution time and memory usage Communication resulting from lo ops in many real co des has

much simpler features that make it easier and less timeconsuming to analyze For example in

many lo ops in meshbased co des only ghost or overlap cells need to lled along certain

dimensions If the data distribution is not known at compiletime the analysis for communica

tion can b e much simpler if it is known that only overlap cells need to b e lled The Multiblo ck

PARTI library includes a communication routine Overlap Cel l Fil l Sched which computes a

Real A B Temp

DAD D DAD for A and B

SCHED Sched

Num Pro c Get Numb er of Pro cessors

DADNum Pro c D Create

Transp ose SchedD Sched Compute

Lo Bnd Lo cal Lower BoundD

Bnd Lo cal Lower BoundD Lo

Up Bnd Lo cal Upp er BoundD

Bnd Lo cal Upp er BoundD Up

Do Time step to

MoveB Temp Sched Data

Forall i Lo BndUp Bnd

BndUp Bnd j Lo

Aij Tempij Aij

EndForall

More Computation involving A B

Enddo

Figure Example SPMD Program Using Multiblo ck PARTI

schedule that is used to direct the lling of overlap cells along a given dimension of a distributed

array The schedules pro duced by Overlap Cel l Fil l Sched and Regular Section Move Sched are

employed by a routine called Data Move that carries out b oth interpro cessor communication

sends and receives and intrapro cessor data copying

The nal form of supp ort provided by the Multiblo ck PARTI library is to distribute lo op

iterations and transform global distributed arrays references into lo cal references In distributed

memory compilation the owner computes rule is often used for distributing lo op iterations

Owner computes means that a particular lo op iteration is executed by the pro cessor owning the

Lower Bound lefthand side array element written into during that iteration Two routines Local

and Local Upper Bound are provided by the library for transforming lo op b ounds returning

resp ectively the lo cal lower and upp er b ounds of a given dimension of the referenced distributed

array based up on the owner computes rule

An example of using the library routines to parallelize the program from Figure is shown in

Figure The library routines are used for determining work partitioning lo op b ounds and for

determining and optimizing communication b etween the pro cessors In this example the data

distribution is known only at runtime and therefore the distributed array descriptor DAD is

lled in at runtime Work partitioning and communication is determined at runtime using the

information stored in the DAD The function Compute Transpose Schedule is shorthand for a

call to the Regular Section Move Sched routine with the parameters set to do a transp ose for

a twodimensional distributed array The schedule generated by this routine is then used by

Move routine for transp osing the array B and storing the result in the array Temp the Data

Lower Bound and Local Upper Bound are used to partition the data parallel Functions Local

lo op across pro cessors using the DAD The sizes of the arrays A B and Temp on each pro cessor

dep end up on the data distribution and are known only at runtime Therefore arrays A B and

Temp are allo cated at runtime The calls to the memory management routines are not shown in

the gure The co de could b e optimized further by writing sp ecialized routines to p erform the

transp ose op eration but the library routines are also applicable to more general forall lo ops

The Multiblo ck PARTI library is currently implemented on the Intel iPSC and Paragon

the Thinking Machines CM the IBM SP and the PVM message passing environment for a

network of workstations The design of the library is architecture indep endent and therefore

it can b e easily p orted to any distributed memory parallel machine or any environment that

supp orts message passing eg Express

Adaptive Multiblo ck PARTI

The existing functionality of the Multiblo ck PARTI library was useful for developing adaptive

programs in several ways If the numb er of pro cessors on which a data parallel lo op is to

b e executed is not known at compiletime it is not p ossible for the compiler to analyze the

communication and in some cases even the work partitioning This holds true even if all other

information such as lo op b ounds and strides is known at compiletime Thus runtime routines

are required for analyzing communication and work partitioning in a program written for

adaptive execution even if the same program written for static execution on a xed numb er of

pro cessors did not require any runtime analysis

Several extensions were required to the existing library to provide the required functionality

for adaptive programs When the set of pro cessors on which the program executes changes at

runtime all active pro cessors must obtain information ab out which pro cessors are active and

how the data is distributed across the set of active pro cessors To deal with only some of the

pro cessors b eing active at any time during execution of the adaptive program the implementa

tion of Adaptive Multiblo ck PARTI uses the notion of physical numbering and logical numbering

of pro cessors If p is the numb er of pro cessors that can p ossibly b e active during the execution

of the program each such pro cessor is assigned a unique physical pro cessor numb er b etween

and p b efore starting program execution If we let c b e the numb er of pro cessors that are

active at a given p oint during execution of a program then each of these active pro cessors is

assigned a unique logical pro cessor numb er b etween and c The mapping b etween physical

pro cessor numb ers and logical pro cessor numb ers for active pro cessors is up dated at remap

p oints The use of a logical pro cessor numb ering is similar in concept to the scheme used for

pro cessor groups in the Message Passing Interface Standard MPI

Information ab out data distributions is available at each pro cessor in the Distributed Ar

ray Descriptors DADs However DADs only store the total size in each dimension for each

distributed array The exact part of the distributed array owned by an active pro cessor can b e

determined using the logical pro cessor numb er Each pro cessor maintains information ab out

what physical pro cessor corresp onds to each logical pro cessor numb er at any time The map

ping from logical pro cessor numb er to physical pro cessor is used for communicating data b etween

pro cessors

In summary the additional functionality implemented in AMP over that available in Multi

blo ck PARTI is as follows

Routines for consistently up dating the logical pro cessor numb ering when it has b een de

tected that redistribution is required

Routines for redistributing data at remap p oints and

Mo died communication analysis and data move routines to incorp orate information ab out

the logical pro cessor numb ering

The communication required for redistributing data at a remap p oint dep ends up on the

logical pro cessor numb erings b efore and after redistribution Therefore after it has b een decided

that remapping is required all pro cessors must obtain the new logical pro cessor numb ering The

detection routine after determining that data redistribution is required decides up on a new

logical pro cessor numb ering of the pro cessors which will b e active The detection routine informs

all the pro cessors which were either active b efore remapping or will b e active after remapping of

the new logical numb ering It also informs the pro cessors which will b e active after remapping

ab out the existing logical numb ering pro cessors that are active b oth b efore and after remapping

will already have this information These pro cessors need this information for determining what

p ortions of the distributed arrays they will receive from which physical pro cessors

The communication analysis required for redistributing data was implemented by mo difying

the Multiblo ck PARTI Regular Section Move Sched routine The new routine takes b oth the

new and old logical numb ering as parameters The analysis for determining the data to b e sent

by each pro cessor is done using the new logical numb ering since data will b e sent to pro cessors

with the new logical numb ering and the analysis for determining the data to b e received is done

using the old logical numb ering since data will b e received from pro cessors with the old logical

numb ering

Compute Initial DAD Sched and Loop Bounds

step to Do Time

If Detection then Remap

Data MoveB Temp Sched

Forall i Lo BndUp Bnd

BndUp Bnd j Lo

Aij Tempij Aij

EndForall

More Computation involving A B

Enddo

Remap

Real New A New B

NPro c Get No of Pro c and Numb New

D Create DADNew NPro c New

Redistribute DataA New A D New D

DataB New B D New D Redistribute

D New D A New A B New B

Transp SchedD Sched Compute

Bnd Lo cal Lower BoundD Lo

Lo Bnd Lo cal Lower BoundD

Bnd Lo cal Upp er BoundD Up

Bnd Lo cal Upp er BoundD Up

End

Figure Adaptive SPMD Program Using AMP

Mo dications to the Multiblo ck PARTI communication functions were also required for

incorp orating information ab out logical pro cessor numb erings This is b ecause the data dis

tribution information in a DAD only determines which logical pro cessor owns what part of a

distributed array To actually p erform communication these functions must use the mapping

b etween logical and physical pro cessor numb erings

Figure shows the example from Figure parallelized using AMP The only dierence

from the nonadaptive parallel program is the addition of the detection and remap calls at the

b eginning of the time step lo op The initial computation of the lo op b ounds and communication

schedule are the same as in Figure The remap p oint is the b eginning of the timestep lo op If

remapping is to b e p erformed at this p oint the function Remap is invoked Remap determines

the new logical pro cessor numb ering after it is known what pro cessors are available and creates

a new Data Access Descriptor DAD The Redistribute Data routine redistributes the arrays A

and B using b oth the old and new DADs After redistribution the old DAD can b e discarded

The new communication schedule and lo op b ounds are determined using the new DAD We have

not shown the details of the memory allo cation and deallo cation for the data redistribution

Compilation Issues

The examples shown previously illustrate how AMP can b e used by application programmers to

develop adaptive programs by hand We now briey describ e the ma jor issues in compiling pro

grams written in an HPFlike data parallel for an adaptive environment

We also discuss some issues in expressing adaptive programs in High Performance Fortran As

we stated earlier our work is restricted to data parallel languages in which parallelism is sp ec

ied explicitly Incorp orating adaptive parallelism in compilation systems in which parallelism

is detected automatically is b eyond the scop e of this pap er

In previous work we successfully integrated the Multiblo ck PARTI library with a prototyp e

FortranDHPF compiler develop ed at Syracuse University Routines provided by the

library were inserted for analyzing work partitioning and communication at runtime whenever

compiletime analysis was inadequate This implementation can b e extended to use Adaptive

Multiblo ck PARTI and compile HPF programs for adaptive execution The ma jor issues in

compiling a program for adaptive execution are determining remap p oints inserting appropriate

actions at remap p oints and ensuring reuse of the results of runtime analysis to minimize the

cost of such analysis

Remap Points

In our mo del of execution of adaptive programs remapping is considered only at certain p oints

in the program text If our runtime library is to b e used a program cannot b e remapp ed

inside a data parallel lo op The reason is that the lo cal lo op b ounds of a data parallel lo op are

determined based up on the current data distribution and in general it is very dicult to ensure

that all iterations of the parallel lo op are executed by exactly one pro cessor either b efore or

after remapping

There are at least two p ossibilities for determining remap p oints They may b e sp ecied

by the programmer in the form of a directive or they may b e determined automatically by

the compiler For the data parallel language HPF parallelism can only b e explicitly sp ecied

through certain constructs eg forall statement forall construct indep endent statement

Inside any of these constructs the only functions that can b e called are those explicitly marked

as pure functions Thus it is simple to determine solely from the syntax what p oints in the

program are not inside any data parallel lo op and therefore can b e remap p oints Making all

such p oints remap p oints may however lead to a large numb er of remap p oints which may o ccur

very frequently during program execution and may lead to signicant overhead from employing

the detection mechanism and synchronization of all pro cessors at each remap p oint

Alternatively a programmer may sp ecify certain p oints in the program to b e remap p oints

through an explicit directive This however makes adaptive execution less transparent to the

programmer

Once remap p oints are known to the compiler it can insert calls to the detection mechanism

at those p oints The compiler also needs to insert a conditional based on the result of the

detection mechanism so that if the detection mechanism determines that remapping needs

to b e done then calls are made b oth for building new Distributed Array Descriptors and for

redistributing the data as sp ecied by the new DADs The resulting co de lo oks very similar

to the co de shown in the example from Section except that the compiler will not explicitly

regenerate schedules after a remap The compiler generates schedules anywhere they will b e

needed and relies on the runtime library to cache schedules that may b e reused as describ ed

in the next section

Schedule Reuse in the Presence of Remapping

As we discussed in Section a very imp ortant consideration in using runtime analysis is the

ability to reuse the results of runtime analysis whenever p ossible This is relatively straightfor

ward if a program is parallelized by inserting the runtime routines by hand When the runtime

routines are automatically inserted by a compiler an approach based up on additional runtime

b o okkeeping can b e used In this approach all schedules generated are stored in hash tables by

the runtime library along with their input parameters Whenever a call is made to generate a

schedule the input parameters sp ecied for this call are matched against those for all existing

schedules If a match is found the stored schedule is returned by the library This approach was

successfully used in the prototyp e HPFFortranD compiler that used the Multiblo ck PARTI

runtime library Our previous exp eriments have shown that saving schedules in hash tables

and searching for existing schedules results in less than overhead as compared to a hand

implementation that reuses schedules optimally

This approach easily extends to programs which include remapping One of the parameters to

the schedule call is the Distributed Array DescriptorDAD After remapping a call for building

a new DAD for each distributed array is inserted by the compiler For the rst execution of

any parallel lo op after remapping no schedule having the new DADs as parameters will b e

available in the hash table New schedules for communication will therefore b e generated The

hash tables for storing schedules can also b e cleared after remapping to reduce the amount of

memory used by the library

Relationship to HPF

In HPF the Processor directive can b e used to declare a pro cessor arrangement An intrinsic

function Number of Processors is also available for determining the numb er of physical pro ces

of Processors in sors available at runtime HPF allows the use of the intrinsic function Number

the sp ecication of a pro cessor arrangement Therefore it is p ossible to write HPF programs in

which the numb er of physical pro cessors available is not known until runtime The Processor

directive can app ear only in the sp ecication part of a scoping unit ie a subroutine or main

program There is no mechanism available for changing the numb er of pro cessors at runtime

To the b est of our knowledge existing work on compiling data parallel languages for dis

tributed memory machines assumes a mo del in which the numb er of pro cessors is statically

known at compiletime Therefore several comp onents of our runtime library are

also useful for compiling HPF programs in which a pro cessor arrangement has b een sp ecied

using the intrinsic function Number of Processors HPF also allows Redistribute and Realign

directives which can b e used to change the distribution of arrays at runtime Our redistribution

routines would b e useful for implementing these directives in an HPF compiler

Exp erimental Results

To study the p erformance of the runtime routines and to determine the feasibility of using an

adaptive environment for data parallel programming we have exp erimented with a multiblo ck

NavierStokes solver template and a multigrid template The multiblo ck template was

extracted from a computational uid dynamics application that solves the thinlayer Navier

Stokes equations over a D surface multiblo ck TLNSD The sequential Fortran co de was

develop ed by Vatsa et al at NASA Langley Research Center and consists of nearly lines

of co de The multiblo ck template which was designed to include p ortions of the entire co de

that are representative of the ma jor computation and communication patterns of the original

co de consists of nearly lines of F co de The multigrid co de we exp erimented with

was develop ed by Overman et al at NASA Langley In earlier work we hand parallelized

these co des using Multiblo ck PARTI and also parallelized Fortran D versions of these co des

No of Time p er Cost of Remapping to

Pro cs Iteration pro cs pro cs pro cs pro c

Figure Cost of Remapping in ms Multiblo ck co de on Network of Workstations

using the prototyp e HPFFortran D compiler In b oth these co des the ma jor computation

is p erformed inside a sequential timestep lo op For each of the parallel lo ops in the ma jor

computational part of the co de the lo op b ounds and communication patterns do not change

across iterations of the timestep lo op when the co de is run in a static environment Thus

communication schedules can b e generated b efore the rst iteration of the timestep lo op and

can b e used for all time steps in a static environment

We mo died the hand parallelized versions of these co des to use the Adaptive Multiblo ck

PARTI routines For b oth these co des we chose the b eginning of an iteration of the time

step lo op as the remapping p oint If remapping is done the data distribution changes and the

schedules used for previous time steps can no longer b e used For our exp eriments we used

two parallel programming environments The rst was a network of workstations using PVM

for message passing We had up to workstations available for our exp eriments The second

environment was a pro cessors IBM SP

In demonstrating the feasibility of using an adaptive environment for parallel program exe

cution we considered the following factors

the time required for remapping and computing a new set of schedules as compared to

the time required for each iteration of the timestep lo op

the numb er of time steps that the co de must execute after remapping to a greater numb er

of pro cessors to eectively amortize the cost of remapping and

the eect of skeleton pro cesses on the p erformance of their host pro cessors

On the network of Sun workstations we considered executing the program on or

workstations at any time Remapping was p ossible from any of these congurations to any other

conguration We measured the time required for one iteration of the timestep lo op and the

No of Time p er Cost of Remapping to

Pro cs Iteration pro cs pro cs pro cs pro cs pro c

Figure Cost of Remapping in ms Multiblo ck co de on IBM SP

No of Time p er Cost of Remapping to

Pro cs Iteration pro cs pro cs pro cs pro c

Figure Cost of Remapping in ms Multigrid co de on IBM SP

No of No of Timesteps for Amortizing

Pro c when remapp ed to

pro c pro c pro c pro c

Figure No of Timesteps for Amortizing Cost of Remapping Multiblo ck co de on Network

of Sun Workstations

cost of remapping from one conguration to another The exp eriments were conducted at a time

when none of the workstations had any other jobs executing The time required p er iteration

for each conguration and the time required for remapping from one conguration to another

are presented in the Figure In this table the second column shows the time p er iteration

and columns to show the time for remapping to and pro cessor conguration

resp ectively The remapping cost includes the time required for redistributing the data and the

time required for building a new set of communication schedules The sp eedup of the template

is not very high b ecause it has a high communication to computation ratio and communication

using PVM is relatively slow These results show that the time required for remapping for this

application is at most the time required for time steps

Note that on a network of workstations connected by an Ethernet it takes much longer to

remap from a larger numb er of pro cessors to a smaller numb er of pro cessors than from a small

numb er of pro cessors to a large numb er of pro cessors eg the time required for remapping

from pro cessors to pro cessor is signicantly higher than the time required for remapping

from pro cessor to pro cessors This is b ecause if several pro cessors try to send messages

simultaneously on an Ethernet contention o ccurs and none of the messages may actually b e

sent leading to signicant delays overall Instead if a single pro cessors is sending messages to

many other pro cessors no such contention o ccurs

We p erformed the same exp eriment on a pro cessor IBM SP The results are shown are

in Figure The program could execute on either or pro cessors and we considered

remapping from any of these congurations to any other conguration The templates obtains

signicantly b etter sp eedup and the time required for remapping is much smaller The sup er

linear sp eedup noticed in going from pro cessor to pro cessors b ecause on pro cessor all

data cannot t into the main memory of the machine In Figure we show the results from the

multigrid template Again the remapping time for this routine is reasonably small

Another interesting tradeo o ccurs when additional pro cessors b ecome available for running

the program Running the program on a greater numb er of pro cessors can reduce the time

required for completing the execution of the program but at the same time remapping the

program onto a new set of pro cessors causes additional overhead for moving data A useful

factor to determine is the numb er of iterations of the timestep lo op that must still b e executed

so that it will b e protable to remap from fewer to a greater numb er of pro cessors Using the

timings from Figure we show the results in Figure This gure shows that if the program

will continue run for a several more timesteps remapping from almost any conguration to any

other larger conguration is likely to b e protable Since the remapping times are even smaller

on the SP the numb er of iterations required for amortizing the cost of remapping will b e even

smaller

In our mo del of adaptive parallel programming a program is never completely removed

from any pro cessor A skeleton pro cess steals some cycles on the host pro cessor which can

p otentially slow down other pro cesses that want to use the pro cessor eg a workstation user

who has just logged in The skeleton pro cesses do not p erform any communication and do not

synchronize except at the remap p oints In our examples the remap p oint is the b eginning of

an iteration of the timestep lo op We measured the time required p er iteration on the skeleton

pro cessors Our exp eriments show that the execution time on skeleton pro cessers is always less

than of the execution time on active pro cessers For the multiblo ck co de the time required

p er iteration for the skeleton pro cessors was ms and ms on the IBM SP and Sun

workstations resp ectively The multigrid co de to ok ms p er iteration on the IBM SP We

exp ect therefore that a skeleton pro cess will not slow down any other job run on that pro cessor

signicantly assuming that the skeleton pro cess gets swapp ed out by the op erating system when

it reaches a remap p oint

Related Work

In this section we compare our approach to other eorts on similar problems

Condor is a system that supp orts transparent migration of a pro cess through check

p ointing from one workstation to another It also p erforms detection to determine if the user

of the workstation on which a pro cess is b eing executed has returned and also lo oks out for

other idle workstations However this system do es not supp ort parallel programs it considers

only programs that will b e executed on a single pro cessor

Several researchers have addressed the problem of using an adaptive environment for execut

ing parallel programs However most of these consider a task parallel mo del or a masterslave

mo del In a version of PVM called Migratable PVM MPVM a pro cess or a task running on

a machine can b e migrated to other machines or pro cessors However MPVM do es not provide

any mechanism for redistribution of data across the remaining pro cessors when a data parallel

program has to b e withdrawn from one of the pro cessors

Another system called User Level Pro cesses ULP has also b een develop ed This system

provides lightweight user level tasks Each of these tasks can b e migrated from one machine

to another but again there is no way of achieving loadbalance when a parallel program needs

to b e executed on a smaller numb er of pro cessors Piranha is a system develop ed on top of

Linda In this system the application programmer has to write functions for adapting to a

change in the numb er of available pro cessors Programs written in this system use a masterslave

mo del and the master co ordinates relo cation of slaves There is no clear way of writing data

parallel applications for adaptive execution in all these systems

Data Parallel C and its compilation system have b een designed for load balancing on a

network of heterogeneous machines The system requires continuous monitoring of the progress

of the programs executing on each machine Exp erimental results have shown that this involves

a signicant overhead even when no load balancing is required

Conclusions and Future Work

In this pap er we have addressed the problem of developing applications for execution in an

adaptive parallel programming environment meaning an environment in which the numb er

of pro cessors available varies at runtime We have dened a simple mo del for programming

and program execution in such an environment In the SPMD mo del supp orted by HPF the

same program text is run on all the pro cessors remapping a program to include or exclude

pro cessors only involves remapping the parallel data used in the program The only op erating

system supp ort required in our mo del is for detecting the availability or lack of availability

of pro cessors This makes it easier to p ort applications develop ed using this mo del onto many

parallel programming systems

We have presented the features of Adaptive Multiblo ck PARTI which provides runtime

supp ort that can b e used for developing adaptive parallel programs We describ ed how the

runtime library can b e used by a compiler to compile programs written in HPFlike data parallel

languages for adaptive execution We have presented exp erimental results on a hand parallelized

NavierStokes solver template and a multigrid template run on a network of workstations and

an IBM SP Our exp erimental results show that adaptive execution of a parallel program can

b e provided at relatively low cost if the numb er of available pro cessors do es not vary frequently

Acknowledgements

We would also like to thank V Vatsa and M Sanetrik at NASA Langley Research Center for

providing access to the multiblo ck TLNSD application co de We will also like to thank John

van Rosendale at ICASE and Andrea Overman at NASA Langley for making their sequential

and hand parallelized multigrid co de available to us

References

Gagan Agrawal Alan Sussman and Jo el Saltz Compiler and runtime supp ort for structured and

blo ck structured applications In Proceedings Supercomputing pages IEEE Computer

So ciety Press Novemb er

Gagan Agrawal Alan Sussman and Jo el Saltz Ecient runtime supp ort for parallelizing blo ck

structured applications In Proceedings of the Scalable High Performance Computing Conference

SHPCC pages IEEE Computer So ciety Press May

Gagan Agrawal Alan Sussman and Jo el Saltz An integrated runtime and compiletime approach

for parallelizing structured and blo ck structured applications IEEE Transactions on Paral lel and

Distributed Systems To app ear Also available as University of Maryland Technical Rep ort

CSTR and UMIACSTR

R Bjornson Linda on Distributed Memory Multiprocessors PhD thesis Yale University

Z Bozkus A Choudhary G Fox T Haupt S Ranka and MY Wu Compiling Fortran DHPF

for distributed memory MIMD computers Journal of Paral lel and Distributed Computing

April

Jeremy Casas Ravi Konuru Steve W Otto Rob ert Prouty and Jonathan Walp ole Adaptive load

migration systems for PVM In Proceedings Supercomputing pages IEEE Computer

So ciety Press Novemb er

Message Passing Interface Forum MPI A messagepassing interface standard Technical Rep ort

CS Computer Science Dept University of Tennessee April Also app ears in the

International Journal of Sup ercomputer Applications Volume Numb er

Al Geist A Beguelin J Dongarra W Jiang R Manchek and V Sunderam PVM users guide

and reference manual Technical Rep ort ORNLTM Oak Ridge National Lab oratory May

David Gelernter and David Kaminsky Sup ercomputing out of recycled garbage Preliminary exp e

rience with Piranha In Proceedings of the Sixth International Conference on Supercomputing pages

ACM Press July

Michael Gerndt Up dating distributed variables in lo cal computations Concurrency Practice and

Experience Septemb er

P Havlak and K Kennedy An implementation of interpro cedural b ounded regular section analysis

IEEE Transactions on Paral lel and Distributed Systems July

Seema Hiranandani Ken Kennedy and ChauWen Tseng Compiling Fortran D for MIMD

distributedmemory machines Communications of the ACM August

C Ko elb el D Loveman R Schreib er G Steele Jr and M Zosel The High Performance Fortran

Handbook MIT Press

MLitzkow and MSolomon Supp orting checkp ointing and pro cess migration outside the Unix kernel

Usenix Winter Conference

Vijay K Naik Sanjeev Setia and Mark Squillante Performance analysis of job scheduling p olicies

in parallel sup ercomputing environments In Proceedings Supercomputing pages IEEE

Computer So ciety Press Novemb er

NNedeljkovic and MJQuinn Dataparallel programming on a network of heterogeneous worksta

tions Concurrency Practice and Experience

Andrea Overman and John Van Rosendale Mapping robust parallel multigrid algorithms to scalable

memory architectures In Proceedings of Copper Mountain Conference on Multigrid Methods

April

RKonuru JCasa RProuty and JWalp ole A userlevel pro cess package for PVM In Proceedings of

the Scalable High Performance Computing Conference SHPCC pages IEEE Computer

So ciety Press May

RProuty SOtto and JWalp ole Adaptive execution of data parallel computations on networks of

heterogeneous workstations Technical Rep ort CSE Oregon Graduate Institute of Science

and Technology

Sanjeev Setia Scheduling on Multiprogrammed Distributed Memory Paral lel Machines PhD thesis

University of Maryland Aug

Alan Sussman Gagan Agrawal and Jo el Saltz A manual for the multiblo ck PARTI runtime prim

itives revision Technical Rep ort CSTR and UMIACSTR University of Mary

land Department of Computer Science and UMIACS Decemb er

VN Vatsa MD Sanetrik and EB Parlette Development of a exible and ecient multigrid

based multiblo ck ow solver AIAA In Proceedings of the st Aerospace Sciences Meeting

and Exhibit January

Hans P Zima and Barbara Mary Chapman Compiling for distributedmemory systems Proceedings

of the IEEE February In Sp ecial Section on Languages and Compilers for

Parallel Machines