<<

Preprint MCSP Mathematics and Computer Science Division Argonne National Lab oratory Argonne Ill January

Language Constructs for Mo dular Parallel Programs

Ian Foster

Mathematics and Computer Science Division

Argonne National Lab oratory

Argonne IL USA

Abstract

We describ e programming language constructs that facilitate the application of

mo dular design techniques in parallel programming These constructs allow us to iso

late resource management and pro cessor scheduling decisions from the sp ecication

of individual mo dules which can themselves encapsulate design decisions concerned

with concurrency communication pro cess mapping and data distribution This ap

proach p ermits development of libraries of reusable parallel program comp onents and

the reuse of these comp onents in dierent contexts In particular alternative map

ping strategies can b e explored without mo difying other asp ects of program logic We

describ e how these constructs are incorp orated in two practical parallel programming

languages PCN and M have b een developed for b oth languages

allowing exp erimentation in substantial applications

Keywords mo dularity parallel programming programming languages program com

p osition co de reuse virtual computer

Introduction

In sequential programming mo dular and ob jectoriented design and programming tech

niques are well understo o and widely used to reduce complexity p ermit separate devel

opment of comp onents and encourage reuse In parallel programming the

situation is less advanced Parallel programs are for the most part developed in an

adho fashion as monolithic entities that cannot easily b e adapted to changing circum

stances Co de reuse is rare outside the sp ecialized context of singleprogram multipledata

SPMD programming where the same program is run on every pro cessor and libraries

can b e called to p erform common global op erations

This pap er presents programming language constructs that p ermit the b enets of mo d

ularity to b e realized in parallel programs The central ideas are as follows First pro cess

and data placement decisions within an individual mo dule are sp ecied with resp ect to

a virtual computer the embedding of this virtual computer within a physical or another

virtual computer is sp ecied when the mo dule is invoked This approach p ermits resource

management and lo cality decisions to b e separated from the sp ecications of individual

mo dules and developed in a hierarchical fashion by using stepwise renement techniques

Second virtual computer constructs are incorp orated into compositional programming

languages in which concurrency is sp ecied with explicit parallel constructs and inter

actions b etween concurrent pro cesses are restricted so that neither physical lo cation nor

execution schedule aects the result of a computation Languages with this prop

erty include Strand and PCN in which pro cesses interact by reading and writing shared

singleassignment variables and Fortran M in which interactions o ccur via singlereader

singlewriter virtual channels Comp ositional languages p ermit design decisions concerned

with mapping data distribution communication and scheduling to b e made separately

and to b e mo died without changing other asp ects of a design

In this pap er we show how virtual computer constructs are integrated into two com

p ositional parallel programming languages PCN and Fortran M PCN is a

Clike language that integrates ideas from concurrent and imp erative

programming Fortran M is a small set of extensions to Fortran In each case virtual com

puters and related constructs have b een introduced in a manner that is consistent with the

base language hence simplifying b oth comprehension and compilation Both PCN and

Fortran M have b een used to develop a range of substantial parallel applications these

provide an empirical basis for evaluation of the concepts

In summary the contributions of this pap er are as follows

The denition of programming language constructs that facilitate the mo dular con

struction of parallel programs

The instantiation of the constructs in practical parallel programming languages

Empirical evaluation in a range of applications

The next three sections of this pap er address each of these issues in turn after which

we review related work and present our conclusions

Mo dularity and Parallel Programs

We use the term mo dularity to refer to two related programstructuring techniques

Stepwise renement the decomp osition of a complex problem into simpler

subproblems each solved by a separate mo dule with a welldened

Mo dular decomp osition the isolation in separate mo dules of design decisions

that are dicult likely to change or common to several program comp onents

The rst of these techniques is more naturally applied top down in the design pro cess

and if care is taken to ensure generality pro duces mo dules that can b e reused in dierent

contexts The second is more naturally applied bottom up and can reduce the cost of

program mo dications Both techniques are central to ob jectoriented design

It is instructive to examine how these techniques can b e applied in sequential and paral

lel programming For illustrative purp oses we consider a simple example the convolution

op eration used for example in motion estimation algorithms in image pro cessing

Each pair of images is rst pro cessed by using a twodimensional fast Fourier transform

FFT The resulting matrices are then multiplied and an inverse FFT is applied to the

result The various op erations and the dep endencies b etween them are illustrated in

Figure FFT MUL FFT -1

FFT

Figure Convolution Algorithm Structure

In a sequential implementation stepwise renement may identify FFT matrix multi

plication input and output mo dules The data structures to b e input to and output from

each mo dule are sp ecied in its interface internal data structures and algorithmic details

are encapsulated and not visible to other comp onents The interface sp ecies the size and

p erhaps the type of the data structures to b e op erated on hence allowing the mo dule to

b e used in dierent contexts For example the same FFT mo dule can b e called three

times in the convolution example The interface design may also address eciency issues

for example by sp ecifying that data is to b e passed by reference rather than by value

The same decomp osition can b e used in a parallel implementation Now however we

need to encapsulate not only data and computation but also concurrency communication

pro cess mapping and data distribution If we do not encapsulate this information two

mo dules mapp ed to the same pro cessor might interfere for example if one received mes

sages intended for the other When comp osing two mo dules we prefer not to b e aware

of how each maps computation and data The mappings used in dierent mo dules can

b e made conformant to improve p erformance but should not aect correctness Again

eciency issues need to b e addressed in the interface design for example by allowing for

the exchange of distributed data structures b etween parallel mo dules

In a sequential implementation mo dular decomp osition may identify the representa

tion of matrices as common to several program comp onents Hence we dene a mo dule

that provides the required matrix op erations read an element write an element etc

The rest of the program uses these functions to access matrices alternative matrix rep

resentations eg linear storage or a linkedlist representation of sparse matrices can b e

substituted without changes to other comp onents

A design decision of equal signicance in a parallel implementation is how computation

and data are mapp ed to pro cessors For example the various phases of the convolution

problem illustrated in Figure can b e organized so as to execute a on disjoint sets of

pro cessors as a pip eline b in sequence for each input data set or c as some hybrid

of these two approaches Figure We wish to lo calize the changes required to explore

these alternatives which do not aect the basic structure of the algorithm but may have

dierent p erformance characteristics These changes concern the allo cation of computa

tional resources pro cessors to mo dules and the scheduling of comp onents mapp ed to the

same pro cessor

(a) (b) (c)

Figure Alternative Mappings for the Parallel Convolution Algorithm

We have found that the design issues encountered in the convolution example are

common to many parallel programming problems In summary the most imp ortant issues

are as follows

Parallel mo dules need to encapsulate not only computation and data as in sequential

programming but also communication concurrency pro cess mapping and data

distribution

Resource allo cation and scheduling decisions need to b e isolated from mo dule sp ec

ications

Interfaces b etween parallel mo dules need to allow for the sharing of distributed data

structures

We now present the programming language concepts that we use to address these

issues

Virtual Computers We remove resource allo cation decisions from individual mo dules

by sp ecifying pro cess and data mapping within a mo dule relative to a virtual computer

The shap e size and mapping of this virtual computer to physical pro cessors are sp ecied

externally to the mo dule This mapping need not b e one to one several virtual computer

no des can b e lo cated on a single physical pro cessor or a virtual computer can corresp ond

to a subset of a physical computer For example a parallel FFT mo dule might b e sp ecied

with resp ect to a onedimensional array of pro cessors with size sp ecied in its interface

Only the virtual computer sp ecication in the calling program need b e changed to explore

the dierent mappings illustrated in Figure

Virtual computers control b oth resource allo cation and lo cality The size of a program

comp onents virtual computer determines in part the computational resources available

to that comp onent The number of other comp onents mapp ed to the same pro cessors

is also a factor The lo cation of the virtual computer determines whether interactions

with other comp onents require inexp ensive intraprocessor or exp ensive interprocessor com

munication For example compared with strategy a strategy b in Figure p ermits

dierent phases of the convolution op eration to communicate inexp ensively b ecause all

op erate on the same pro cessors rather than on disjoint sets of pro cessors On the other

hand it increases intramodule communication costs b ecause each mo dule is distributed

over many pro cessors

We dene three op erations on virtual computers pro cess mapping data distribution

and embedding

Process mapping is the mechanism by which a pro cess is created on or migrated to

a particular pro cessor Language constructs are provided to sp ecify that a particular

pro cess is to initiate or continue execution on a particular no de of a virtual computer For

example an FFT mo dule might create one pro cess on each no de of the virtual computer

in which it executes

Data distribution is the mechanism by which data structures are distributed over pro

cessors Language constructs are provided to dene the type and size of a distributed

and the mapping of its elements to the no des of a virtual computer For

example an FFT mo dule might sp ecify a blo cked distribution of its input output and

work arrays

Embedding is the mechanism by which virtual computers are dened An embedding

function sp ecies the size and type of a computer and how its no des are mapp ed to the

no des of an existing physical or virtual computer Language constructs are provided for

sp ecifying embedding functions In the convolution example an embedding function would

b e used to dene the sub computers in which the various mo dules FFT etc execute

Embedding functions allow the programmer to develop resource management and lo

cality sp ecications in a hierarchical fashion For example assume that our convolution

program is dened to execute in a twodimensional virtual computer This convolution

co de may itself b e embedded in a larger program When calling the convolution co de it

suces to create a twodimensional virtual computer in order to sp ecify how the convolu

tion co de and hence its constituent mo dules is mapp ed to physical pro cessors

Scheduling It is often necessary for correctness or useful for eciency for comp o

nents mapp ed to the same pro cessor to execute in an interleaved fashion For example

in Figure concurrent execution of dierent pip eline stages can improve eciency by

allowing overlapping of computation and communication one can execute while others

wait for data

A programmer can control scheduling explicitly by merging co de for dierent comp o

nents into a single thread of control for each pro cessor However this approach typically

requires an intimate and complex intermingling of the dierent comp onents which not only

compromises mo dularity but also hinders exploration of alternative scheduling strategies

Unfortunately scheduling is an asp ect of a parallel algorithm that is frequently changed

Execution on a dierent machine minor changes to other comp onents of an algorithm or

a new can all change the temp oral relationships b etween comp onents and hence

change the optimal schedule

We address this problem by p ermitting the programmer to sp ecify comp onents as

separate pro cesses that execute in a datadriven manner The execution schedule is then

determined by the availability of data and by the scheduling algorithm used to select

executable pro cesses This approach is of course well known in actors dataow parallel

and concurrent logic programming

Interfaces Concurrentlyexecuting program comp onents must b e able to share data

As noted ab ove a mechanism is required that supp orts encapsulation that is indep endent

of resource allo cation and scheduling decisions and that is ecient on parallel computers

We achieve this as follows

Logical Resources In order that mo dules do not interfere in unexp ected ways interac

tions b etween program comp onents are restricted to language ob jects passed via interfaces

Hence concurrentlyexecuting pro cesses cannot op erate on shared global variables or

interact via physical resources such as pro cessor or memory addresses

Schedule Independence In order that scheduling decisions can b e changed without

mo difying other program comp onents we restrict op erations on shared data to those for

which execution order do es not aect the result of a computation For example two

pro cesses may share a channel on which one p erforms nonblocking sends while the other

p erforms blo cking receives Or several pro cesses may share a singleassignment variable

that one writes and several read In b oth cases the readers blo cks until data is provided

by the writer hence the schedule do es not aect the result of a computation

Global Address Space In order that mapping decisions can b e changed without mo d

ifying other program comp onents we make no distinction at the language level b etween

lo cal and remote references to ob jects Hence a pro cess can op erate on a resource regard

less of its lo cation This requires some form of global address space as provided for

example by virtual channel and singleassignment variables

Concurrency Concurrent mo dule interfaces are achieved by p ermitting shared data

for example arrays of virtual channels or singleassignment variables to b e distributed

over pro cessors and accessed concurrently by program comp onents

The rst three of these requirements are satised by comp ositional languages such as

PCN and Fortran M The integration of data distribution constructs satises the fourth

requirement

Language Constructs for Mo dularity

The programming language concepts introduced in the preceding section have b een incor

p orated in the PCN and Fortran M programming languages In each case it has proved

p ossible to introduce the concepts in a manner that is consistent with other language

concepts

Program Comp osition Notation

Program Comp osition Notation PCN is a highlevel concurrent programming

language with a Clike syntax that combines features of concurrent logic languages and

imp erative languages Programs are constructed by using three comp osition op erators

parallel sequential and choice to comp ose pro cedure calls other comp ositions or

primitive op erations Concurrently executing pro cesses interact by reading and writing

singleassignment variables Such variables are initially undened and can b e written at

most once an attempt to read an undened variable susp ends until the variable is written

In b oth PCN and Fortran M we sp ecify that the virtual computer in which a pro cess

executes is by default the same as that pro cesss parent unless sp ecied otherwise when

the pro cess is created Information ab out this virtual computer can b e obtained by inquiry

functions in PCN these are topology and nodes which return a representation of its

type eg a term meshg and its size resp ectively Mapping data distribution

and embedding op erations within the pro cess are p erformed relative to this implicit virtual

computer An alternative approach which also has its merits is to represent the virtual

computer as a data structure that can b e manipulated in the language This approach is

adopted in CC

Mapping in PCN is sp ecied by using the inx op erator to apply a mapping function

to a pro cess or blo ck A mapping function returns a no de number within the current virtual

computer For example in the following the op erator is used to lo cate a call to pro cess

myproc on no de ij of a MN mesh The rst co de fragment is a parallel blo ck that

invokes pro cedure myproc on no de meshnodeij of the current virtual computer The

second co de fragment is a denition for the meshnode function This uses a PCN choice

comp osition op erator a form of guarded command to sp ecify actions to b e p erformed

if the current top ology is a mesh and the mesh lo cation is valid action return the

appropriate index within the mesh or otherwise action return an error value The inx

op erator is a termmatching primitive that tests and decomp oses the term returned

by topology All variables in PCN are lo cal to the pro cedure in which they o ccur

singleassignment variables are not explicitly declared

myproc meshnodeij

function meshnodeij

topology meshMN i M j N

returnMij

default return

Data distribution is supp orted in the form of blo cked distributions of arrays of single

assignment variables In a blo cked distribution each pro cessor is allo cated a contiguous

blo ck of array elements Elements of a distributed array are accessed in the same

manner as ordinary array elements More complex distributions and data structures such

as those supp orted in dataparallel programming languages can b e integrated in

the same manner but are not supp orted in the current PCN compiler

A keyword port is used to declare these distributed arrays For example the following

pro cedure declares a distributed array P of singleassignment variables with one element

on each no de of the current virtual computer The syntax i over n is a

parallel enumerator used here to create nodes instances of the ringnode pro cess The

lo cation function nodei is assumed to return its argument it is used to place each

pro cess on a separate no de in the current virtual computer Elements of P are used to

connect these pro cesses in a unidirectional ring as P is distributed the newly created

pro cesses can access elements eciently without the need to communicate with a central

lo cation

ring

port Pnodes

i over nodes

ringnodePiPinodes nodei

Embedding is supp orted by the inx op erator in used to annotate a pro cess call

or blo ck with an embedding function that returns a representation of the new virtual

computer For example assume an embedding function subarraystartsize which

embeds a D computer of sp ecied size at lo cation start in a parent computer of the

same type but larger size The following co de fragment uses this function to invoke two

of the FFT pro cesses required for the convolution problem each on a separate set of n

pro cessors

call fft in subarrayn

call fft in subarraynn

The PCN extensions that supp ort virtual computers data distribution and embed

ding have b een incorp orated into a PCN compiler A programmable sourcetosource

transformation system is used to translate the extended language into PCN as well as calls

to a runtime that manages virtual computers and handles requests for distributed

data This compiler has supp orted extensive exp erimentation with the language exten

sions as describ ed in Section The compiler is designed to optimize pro cess mapping and

access to distributed data b oth of which take constant time irresp ective of computer size

Execution of an embedding function takes time prop ortional to the square of the size of

the new virtual computer however this cost is distributed over the physical pro cessors on

which the virtual computer is to b e created and has not proved excessive on pro cessor

computers

Fortran M

Fortran M is a small set of extensions to Fortran designed to supp ort the mo dular construc

tion of scientic and engineering applications A primary goal in designing Fortran M

was to provide a minimal set of extensions that were consistent with Fortran concepts

The temptation to x Fortran was avoided We describ e Fortran M for two reasons

First it shows how our concepts can b e incorp orated into a programming language with

characteristics very dierent from those of PCN Second it allows us to provide a more

comprehensive illustration of how distributed arrays are integrated with virtual computers

Fortran M programs use parb eginparend constructs to create pro cesses that interact

by sending and receiving messages on typed ports that reference singlewriter single

reader channels Channel op erations are mo deled on Fortran le IO constructs Map

ping is sp ecied with resp ect to virtual computers In keeping with Fortran concepts

virtual computers are N dimensional arrays of virtual pro cessors As in High Perfor

mance Fortran HPF a PROCESSORS declaration is used to dene the size and shap e

of the pro cessor array that is to apply in a particular pro cess and the inquiry functions

Of Processors and Processor Shape p ermit a pro cess to determine the size and Number

shap e of the computer in which it executes

Mapping is sp ecied by using the LOCATION annotation An annotation of the form

LOCATIONI I app ended to a pro cess call indicates that the call is to execute on

1 n

no de I I of the current virtual computer An executable RELOCATE statement

1 n

can also b e dened to request pro cess migration however this is not currently supp orted in

the Fortran M compiler due to the p ortability problems asso ciated with pro cess migration

The following co de fragment implements the ring structure for which PCN co de was

presented in the preceding section The co de declares arrays of inp orts and outp orts

creates channels and then invokes the ringnode pro cesses passing the p orts as arguments

Notice the PROCESSORS declaration and LOCATION annotation

process ring

PROCESSORSp

inport integer pip

outport integer pop

do i p

channelinpii outpomodi

enddo

processdo i p

processcall ringnodepii poi LOCATIONi

endprocessdo

end

Data distribution is sp ecied by using data distribution statements based on those in

HPF with the dierence that these statements are interpreted with resp ect to the current

virtual computer rather than an entire machine For example the following statements can

b e added to the previous program to sp ecify that arrays pi and po are to b e distributed

DISTRIBUTE pi

ALIGN po WITH pi

Embedding is sp ecied with the SUBMACHINE annotation When app ended to a pro cess

call this indicates that the call is to execute in a submachine of sp ecied size and shap e

Fortran array constructors can b e used to dene the new virtual machine For example

the following co de fragment invokes a parallel FFT pro cess in the ith row and column of

an N N pro cessor array resp ectively Pro cesses and distributed arrays created within

each FFT pro cess are distributed only over the pro cessors on which that pro cess executes

processes

processcall fft SUBMACHINEi N

processcall fft SUBMACHINEN i

endprocesses

A Fortran M compiler has b een constructed that translates Fortran M programs into

Fortran plus calls to a communication and pro cess management library This compiler is

op erational on b oth networks of workstations and parallel computers and has b een used

in a variety of programming exp eriments as describ ed in Section

Exp eriences

The language constructs describ ed in this pap er have b een applied to a wide range of

scientic and engineering problems in areas as diverse as atmospheric mo deling com

putational biology computational chemistry and image pro cessing Here we summarize

some of these investigations which provide insights into the strengths and weaknesses of

the approach

Parallel Solvers

We rst describ e a study of parallel algorithms for numerical metho ds used to solve partial

dierential equations in spherical geometry in which PCN was used to prototype algorithm

variants Three dierent numerical metho ds were considered a sp ectral transform

metho d on a regular latitudelongitude grid a nite dierence metho d on overlapping

stereoscopic grids and a control volume on an icosahedralhexagonal grid In each case

the resulting parallel algorithms are complex and there was a need to explore algorithmic

variants We use the simplest of the three examples to illustrate the approach The sp ec

tral transform algorithm comprises three comp onents It op erates on a latitudelongitude

grid with the bulk of the computation b eing p erformed indep endently at each grid p oint

communication is encapsulated in an FFT p erformed in each latitude and a summation

p erformed in each longitude The PCN implementation is structured as follows The rou

tine spectral is executed in a twodimensional mesh virtual computer created by a call to

spectral mesh An FFT mo dule is invoked in each row of this computer and a summation

mo dule in each column by using embedding functions row and col resp ectively A mo d

ied version of the sequential co de is executed on each pro cessor The missing arguments

represent the p ort variables used for communication b etween the dierent mo dules

mainlatlon

spectrallatlon in spectralmeshlatlon

spectrallatlon

i over lat fft in rowi

i over lon sum in coli

i over latlon compute nodei

Our approach proved to have two ma jor b enets in this case First it proved p ossible

to reuse FFT and summation mo dules Both were available as PCN programs and could

b e called directly without reengineering to execute in dierent contexts In eect no new

parallel co de had to b e written b eyond that shown ab ove only the sequential Fortran

co de required mo dication to execute in the parallel harness Second exp erimentation

with alternative mapping strategies was facilitated For example on a twodimensional

mesh computer such as the Intel Delta it is p ossible to map each row of the sp ectral

mesh to either a row or a submesh of the computer Both have p otential p erformance

advantages the former results in simpler communication patterns while the latter makes

more ecient use of available wires We were able to exp eriment with b oth approaches

These b enets were also observed when implementing the nite dierence and control

volume algorithms

The study also revealed deciencies in the PCN approach and to ols The use of two

languages PCN for the parallel harness and Fortran for sequential computation proved

oputting to the applied mathematicians who assisted in the pro ject Also although the

PCN scheduler schedules computational tasks and remote reads correctly a lack of pre

emption meant that remote read op erations were sometimes delayed for extended p erio ds

awaiting completion of computationally intensive op erations As a shortterm solution

we decomp osed the compute pro cess into many lightweight pro cesses This is not as ex

p ensive as it might sound b ecause of the low cost of switching b etween the lightweight

threads used in PCN less than sec on a Sun For example the control volume co de

achieves parallel eciency relative to a sequential Fortran co de on Intel

Delta pro cessors dep ending on problem size even when decomp osed in this way In the

long term some form of preemption is required

Earth System Mo dels

In the second application that we consider our constructs are applied in a multidisciplinary

framework An earth system mo del integrates mo dels of the atmosphere o cean biosphere

etc The developer of a parallel earth system mo del encounters challenging encapsulation

issues as comp onent mo dels may themselves b e parallel programs The ability to explore

alternative mapping strategies is also imp ortant since it can sometimes b e desirable to

execute the comp onent mo dels on dierent pro cessors or even on dierent computers in

other situations it may b e preferable to multiprocess mo dels on the same pro cessors

In order to explore the application of our approach to this class of problems we have

developed a Fortran M framework co de that couples simple atmosphere and o cean mo d

els Arrays of channels are used to transfer data b etween two parallel mo dels and virtual

computers are used to control pro cess placement This framework co de has b een used to

explore p erformance issues For example the simple form of comp ositional data structure

used in Fortran M the channel requires that data b e copied when it is transferred

from one mo dule to another even if these two mo dules execute on the same pro cessor

An early concern was that high copying costs would necessitate that the p ortability ob

tained by a Fortran M implementation b e sacriced in order to p ermit direct sharing

of data However exp eriments showed that the cost of a mo dular decomp osition based

on Fortran M concepts was small ab out of total execution time on a Sun work

station Much of this overhead was due to the relatively high cost of switching b etween

Fortran M pro cesses currently implemented as pro cesses and the use of TCPIP

so ckets for interprocess communication Exp eriments with a prototype Fortran M com

piler that uses Sun lightweight pro cesses threads and sharedmemory op erations show

that channel overhead can b e reduced signicantly For example Unix pro cess switch cost

dened here as the time required to send a null message b etween two pro cesses on the

same pro cessor is around msec with threads this is reduced to msec and greater

reductions app ear p ossible if sp ecialized thread packages are used

Education

PCN and Fortran M have b een used to teach parallel programming to graduates and

undergraduates in b oth computing and the sciences Students were provided with libraries

of mo dules implementing global op erations mesh structures load balancing libraries

and the like which they then used in programming pro jects This approach proved

eective at two levels Students were able to use some mo dules unchanged without a

deep understanding of their implementation Other mo dules served as frameworks that

they mo died to suit a particular purp ose The sophistication of the programs developed

by students was considerably higher when mo dule libraries were used Not surprisingly

the computing students showed a strong preference for PCN while the science students

preferred Fortran M

Related Work

Several parallel languages and programming environments have b een developed to supp ort

the mo dular construction of parallel programs Borkhar et al prop ose that parallel

programs b e constructed by plugging together cells in a manner analogous to VLSI

They use this technique to generate ecient programs for the iWarp systolic pro cessor

o ccam has b een used for similar purp oses The target hardware limits the programs

that can b e sp ecied in these systems in the iWARP work the contiguous submesh is

the only virtual computer supp orted and the number of channels is limited Griswold et

al prop ose process ensembles as a means of organizing data computation and commu

nication However they do not consider hierarchies of virtual computers Browne

et al prop ose a comp ositional calculus for sp ecifying interconnections b etween software

chips The integration of this calculus into a programming notation is not discussed

and the notion of virtual computer is absent

Some recent work on parallel messagepassing libraries has explored the use of pro cess

groups and communication contexts to supp ort the encapsulation of communication

op erations in parallel libraries Chien and Dallys Concurrent Aggregates CA

language allows the denition of homogeneous collections of ob jects called aggregates

messages addressed to an aggregate are routed to one of its members As in this

pap er concurrent structures can b e dened and comp osed with other structures to build

concurrent programs However resource allo cation and lo cality are not addressed

Virtual computers have b een used to achieve p ortability by hiding information concern

ing the size and top ology of a physical computer Martin Hudak and Taylor

have investigated notations for sp ecifying pro cess mapping on a p otentially innite pro

cessing surface In dataparallel languages data distribution is sp ecied with

resp ect to a virtual computer as prop osed here however hierarchies cannot b e dened

While these systems succeed in decoupling mapping or data distribution from other as

p ects of a parallel algorithm they do not p ermit resource allo cation decisions to b e isolated

from mo dule denitions For example it is not straightforward to invoke the same mo dule

on pro cessor subsets that may b e overlapping or of dierent sizes

Some of the ideas developed in this pap er have also b een applied to CC extensions

to C designed to supp ort comp ositional parallel programming In CC virtual

computers are represented explicitly as arrays of p ointers to processor objects Mapping

is achieved by invoking a function in a pro cessor ob ject Data distribution and embedding

are not supp orted directly but can b e sp ecied as CC functions CC mapping

statements have semantic content as functions can access data contained in a pro cessor

ob ject hence mapping is not indep endent of other asp ects of a design

Conclusions

Research in sequential programming has demonstrated the value of mo dular design In

this pap er we have describ ed programming language concepts that facilitate the appli

cation of mo dular design principles in parallel programming We have also show how

these concepts can b e realized in practical parallel programming languages and rep orted

exp eriences using these languages to solve largescale programming problems These ex

p eriences show that the familiar b enets of mo dular programming can b e achieved in a

parallel context Mo dules implementing useful parallel algorithms can b e reused Existing

applications can b e integrated into larger systems without changes to their internal struc

ture or implementation Ma jor structural changes concerning the mapping of computation

and data to pro cessors do not require changes to comp onent mo dules

A signicant asp ect of this work is that it is now p ossible to build truly mo dular

and hence reusable parallel libraries In a manner analogous to but more exible than

VLSI cell libraries we can dene libraries of parallel program comp onents with sp ecied

internal logic and external interface but encapsulating no resource allo cation or scheduling

decisions These comp onents can b e used unchanged in a wide variety of contexts or

mo died by the user to solve related problems We are currently constructing libraries of

this sort in several areas

We have assumed in this pap er that mo dules are implemented with the same technol

ogy for example PCN or Fortran M However mo dules can also b e b e constructed using

dierent techniques eg messagepassing libraries or dataparallel languages as long as

implementations do not encapsulate mapping or scheduling decisions

Acknowledgments

This work was supp orted by the Oce of Scientic Computing US Department of Energy

under Contract WEng I am grateful to Rob ert Olson and Steven Tuecke for

their work on the PCN and Fortran M compilers and to Mani Chandy and Carl Kesselman

for numerous discussions

References

G Agha Actors MIT Press

V Bala and S Kipnis Pro cess Groups A mechanism for the co ordination of

and communication among pro cesses in the Venus collective communication library

Preprint IBM T J Watson Research Center

G Bo o ch Objectoriented Design BenjaminCummings

S Borkar et al iWarp An integrated solution to highsp eed

in Proc Supercomputing Conf pp

J Browne J Werth and T Lee Intersection of parallel structuring and reuse of

software comp onents in Proc Intl Conf on Parallel Processing Penn State Press

Cann D C J T Feo and T M DeBoni Sisal High p erformance applicative

computing in Proc Symp Parallel and Distributed Processing IEEE Press

pp

K M Chandy and I Foster A deterministic notation for co op erating pro cesses

Preprint MCSP Mathematics and Computer Science Division Argonne

National Lab oratory Argonne Ill

K M Chandy and C Kesselman Comp ositional parallel programming in CC

Technical Rep ort Caltech

K M Chandy and S Taylor An Introduction to Parallel Programming Jones and

Bartlett

I Chern and I Foster Design and parallel implementation of two metho ds for solving

PDEs on the sphere in Parallel Computational Fluid Dynamics Elsevier Science

Publishers BV

A Choudhary and Ponnusamy Parallel implementation and evaluation of a mo

tion estimation system algorithm using several data decomp osition strategies J

Parallel and Distributed Computing no pp

A Chien and W Dally Concurrent Aggregates in Proc ACM Symp on Principles

and Practice of Parallel Programming pp

J Dongarra R van de Geijn and D Walker A lo ok at scalable dense linear alge

bra libraries in Proc Scalable High Performance Computing Conf Williamsburg

Virginia

I Foster and K M Chandy Fortran M A language for mo dular parallel program

ming Preprint MCSP Argonne National Lab oratory Argonne Ill

I Foster C Kesselman and S Taylor Concurrency Simple concepts and p owerful

to ols The Computer Journal vol no pp

I Foster R Olson and S Tuecke Pro ductive parallel programming The PCN

approach Scientic Programming vol no

I Foster and S Taylor A compiler approach to scalable concurrent program design

ACM Trans Prog Lang Syst to app ear

G Fox S Hiranandani K Kennedy C Ko elb el U Kremer C Tseng and M Wu

Fortran D language sp ecication Technical Rep ort TR Computer Science

Rice Univ Houston Texas

W Griswold G Harrison D Notkin and L Snyder Port ensembles A communi

cation abstraction for nonshared memory parallel programming in Proc Intl Conf

on Parallel Processing Penn State Press

High Performance Fortran Forum High Performance Fortran Language Specication

Version Technical Rep ort CITICRPC Rice University November

P Henderson Functional Programming PrenticeHall Englewoo d Clis NJ

C Hoare Communicating sequential pro cesses Commun ACM vol no pp

P Hudak Parafunctional programming IEEE Computer pp

Inmos Limited occam Reference Manual Prentice Hall

M Lemke and D Quinlan A parallel C array class library for architecture

indep endent development of structured grid applications ACM SIGPLAN Notices

vol no pp

A Martin The torus An exercise in constructing a pro cessing surface in Proc

Conf on VLSI Caltech pp

D Parnas On the criteria to b e used in decomp osing systems into mo dules Com

mun ACM vol no pp

A Skjellum and A Leung Zip co de A p ortable multicomputer communication li

brary atop the Reactive Kernel in Proc th Distributed Memory Concurrent Com

puting Conf IEEE Press pp

S Taylor Parallel Logic Programming Techniques Prentice Hall

Thinking Machines Corp oration CM Fortran Reference Manual Thinking Machines

Cambridge Mass

N Wirth Program development by stepwise renement Commun ACM vol

pp