<<

Experience with Parallel Programming Using Code Templates

1 2 2

Ajit Singh Jonathan Schaeer Duane Szafron

asinghetudeuwaterlo oca jonathancsualb ertaca duanecsualb ertaca

1 2

University of Waterlo o University of Alb erta

Dept of Electrical and Eng Dept of Science

Waterlo o Ontario Edmonton Alb erta

Canada NL G Canada TG H

Abstract

For almost a decade we have b een working at developing and using templatebased mo dels for parallel

computing Templatebased mo dels separate the sp ecication of the parallel structuring asp ects from

the applicati on co de that is to b parallelized A user provides the applicatio n co de and sp ecies the

parallel structure of the application using highlevel icons called templates The parallel programming

system then generates the co de necessary for paralleliz in g the application The goal here is to provide a

mechanism for quick and reliable development of coarsegrain parallel applicatio ns that employ frequently

o ccurring parallel structures Our initial templatebased system FrameWorks was p ositively received

but had a numb er of shortcomings The Enterprise parallel programming environment evolved out of

this work Now after several years of exp erience with the system its shortcomings are b ecoming evident

Controlled exp eriments have b een conducted to assess the usability of our system in comparison with

other systems This pap er outlines our exp eriences in developing and using these systems A list of

desirable characteristics of templatebased mo dels is given FrameWorks and Enterprise systems are

discussed in the context of these characteristics and the results of our usability exp eriments Many of

our observations are relevant to other parallel programming systems even though they may b e based on

dierent assumptions Although templatebase mo dels have the p otential for simplifying the complexities

of parallel programming they have yet to realize these exp ectations for highp erformance applications

Exp erience with Parallel Programming Using Co de Templates

Intro duction

Along with the growing interest in parallel and there has b een a corresp onding in

crease in the development of mo dels to ols and systems for parallel programming Consequently practitioners

in the area are now faced with a somewhat dicult challenge how to select parallel programming to ols that

will b e appropriate for their applications There is no easy answer The decision is a function of many

parameters including some that are sp ecic to the user and their computing environment These include the

typ e of parallelism available in the application for example ne or coarsegrained data parallel or not

pip eline or masterslave the target architectures for example shared or distributed memory language

constraints and p erformance exp ectations Other parameters are sp ecic to the to ol and its capabilities

including its feature set p ortability and usability ease of use exibility expressive p ower

As is evident from the formation of user groups such as the Parallel To ols Consortium there is a concern in

the community ab out the lack of p ostdevelopment analysis and evaluation for various to ols and technologies

that are b eing prop osed Typically researchers envision a new to ol or technology develop it and dep ending

on their initial exp eriences rep ort it in the literature With few exceptions longterm exp eriences with

parallel programming systems and their relationships with similar systems are hardly ever rep orted

Many dierent approaches have b een taken towards the development of parallel programming mo dels

A new parallel is one approach for example a pro cedural language such as Orca

or a functional language like Sisal However practical considerations such as legacy co de and the

demand for Cbased languages often make this an impractical choice Alternatives that allow the

programmer to take advantage of the existing co de and exp ertise in common sequential languages have found

much wider acceptance One such approach includes providing libraries for parallelization PVM P

and MPI are examples Another approach is to extend an existing sequential language with

directives High Performance Fortran or keywords for example Mentat and PAMS

A relatively new alternative has b egun to emerge that allows a programmer to b enet from the existing

co de and knowledge of a sequential program while minimizing the mo dications that are required for

parallelization The programmer provides a sp ecication of the parallel structuring asp ects of the application

in the form of co de annotations One interesting approach to co de annotation is to recognize that there are

commonly o ccurring parallel techniques A parallel programming to ol can supp ort these techniques by

providing algorithmic skeletons or templates that capture the parallel b ehavior The user provides the

sequential application co de and selects the templates required to parallelize the application such as in PIE

Exp erience with Parallel Programming Using Co de Templates

and HeNCE The system then generates the necessary parallel co de Templatebased mo dels

separate the sp ecication of the parallel structuring asp ects such as synchronization communication and

pro cesspro cess or mapping from the application co de that is to b e parallelized A template implements

commonly o ccurring parallel interactions in an applicationindep endent manner The goal here is to provide

an easy approach for the initial development and restructuring of coarsegrain parallel applications that relies

on commonly used parallelization techniques

This pap er discusses our longterm exp eriences with two templatebased parallel programming systems for

coarsegrained parallelism Our research b egan in when we used templates to exp eriment with dierent

parallel structures for a application We quickly realized that the approach was

more general and could b e used to build a larger class of parallel applications Building on this success the

FrameWorks parallel programming to ol was develop ed Our initial exp erience with FrameWorks

was encouraging However for a numb er of reasons describ ed later in this pap er it was not p ossible to evolve

the system b eyond a certain p oint Consequently an entirely new pro ject called Enterprise was initiated

Enterprise is a templatebased parallel programming environment which oers a much wider range of related

to ols for parallel program design co ding debugging and p erformance tuning It has

b een publicly available since httpwebcsualb ertacaenter

Several other parallel programming systems have relied on techniques that are similar to the approach

used by us for example Many of our results and exp eriences with FrameWorks

and Enterprise are applicable to such systems as well as other highlevel parallel programming systems

Before we delve into details it is useful to clarify a couple of p oints regarding our use of the term

template In the past techniques based on the use of applicationindep endent common parallel structures

have often b een describ ed under dierent names such as algorithmic skeletons mo del programs based on

parallel programming paradigms and parallel program archetyp es In addition to us some other

researchers also have recognized or used the term template to refer to such techniques

Although the underlying details of these techniques vary signicantly they all have the common goal of

sp ecifying commonly o ccurring parallel structures in the form of applicationindep endent and reusable co de

For the last ten years we have used the term templatebased to refer to this technique At this p oint it

should also b e p ointed out that our use of the term template here is quite distinct from C templates

Our usage is restricted to the context of parallel programming where it is used to denote a prepackaged

set of applicationindep endent characteristics This has no intended relationship with the C templates

Exp erience with Parallel Programming Using Co de Templates

which are used to build generic classes in sequential programs

In this pap er we lo ok at templatebased parallel programming mo dels from two viewp oints First as

the designers we can address the diculties in the design and implementation of these to ols Second we

have had considerable interaction with users developing templatebased parallel applications Controlled

exp eriments which compared Enterprise with a numb er of to ols including PVM give insights ab out the

strengths and weaknesses of the templatebased approach The result is that although templatebased

mo dels have tremendous p otential for bridging the gap b etween sequential and parallel co de there still

remain a numb er of shortcomings that must b e addressed b efore the technology will b e widely used

Section describ es the templatebased approach Section explains the distinctions b etween this ap

proach and other highlevel techniques for parallel programming Section outlines the ob jectives for an

ideal templatebased parallel programming to ol and discusses their signicance Section briey describ es

FrameWorks and its shortcomings These problems with FrameWorks led to the design of Enterprise as

describ ed in Section Section describ es our exp eriences with templatebased mo dels and the lessons

learned Extending the template mo del to other asp ects of parallel programming is discussed in Section

Section describ es the requirements for future templatebased to ols Finally Section presents our

conclusions

This pap er may seem to b e overly critical of templatebased approaches Our intent is not to discourage

research in this area Rather we b elieve that far to o many pap ers in the literature are

long on praise and short on criticism It is our hop e that the issues discussed in this pap er can b e seriously

tackled by the research community so the full p otential of templatebased to ols can b e realized

Templatebased Programming

In the context of parallel programming a template represents a prepackaged set of characteristics which can

fully or partially sp ecify the nature of communication synchronization and pro cessor bindings

of an entity Templates implement various typ es of interactions found in parallel systems but with the

key comp onents the applicationsp ecic pro cedures unsp ecied A user provides the applicationsp ecic

pro cedures and the to ol provides the glue to bind it all together The templates abstract commonly o ccurring

structures and characteristics of parallel applications The ob jective here is to allow users to develop parallel applications in a rapid and easy manner

Exp erience with Parallel Programming Using Co de Templates

For example consider a graphics animation program Animation consisting of three mo dules Gener

ate Geometry and Display It takes a sequence of graphical images called frames and animates

them Generate computes the lo cation and motion of each ob ject for a frame It then calls Geometry

to p erform actions such as viewing transformations pro jection and clipping Finally the frame is pro cessed

by Display which p erforms hiddensurface removal and antialiasing Then it stores the frame on the disk

After this Generate continues with the of the next frame and the whole pro cess is rep eated

Figure shows the structure of a sequential version of the animation program

Figure should b e placed here

A simple way to parallelize this application would b e to let the three mo dules work in a pip elined manner

on dierent pro cessors After computing a frame Generate passes it to Geometry for pro cessing and

starts working on the next frame Similarly Geometry passes its output to Display and then receive its

next frame from Generate Therefore all three mo dules work in parallel on dierent frames see Figure a

Now if Display takes much longer to do its pro cessing as compared to Generate and Geometry which is

generally the case in reality hiddensurface removal and antialiasing require much more time than the other

comp onents of the program more than one instance of Display can b e initiated This is p ossible b ecause

the pro cessing of each frame is indep endent Similarly if the p erformance of Geometry is to b e improved

several instances of it may b e initiated as well This situation is shown in Figure b where Geometry and

Display have several active instances

Figure should b e placed here

This parallel version of Animation contains two of the commonlyused structures for parallel computing

namely the pip eline and the replication Consider parallelizing this application on for example a network of

workstations Parallel program development would require a signicant amount of time and eort if a low

level to ol was used for example Unix so ckets or a messagepassing library such as MPI Further the

parallelism would b e explicit in the co de increasing the complexity of the program Each time the program

mer wanted to exp eriment with a dierent parallel structure for the application additional programming

eort would b e required to rewrite the co de Moreover such an eort would b e replicated knowingly or

unknowingly by other programmers while writing other applications

Templatebased parallel programming systems provide skeletons templates of implementations of such

parallel structures A user simply provides sequential mo dules of co de and selects the appropriate templates

Exp erience with Parallel Programming Using Co de Templates

to structure the parallel application As explained later the templates of FrameWorks or Enterprise can

b e used to quickly generate the parallel structures shown in Figure The pro cedural relationships in the

diagram indicate that the three mo dules interact in a pip eline manner and that Geometry and Display can

have multiple instances that execute indep endently from each other The choice of template indicates which

communication pattern the system automatically generates The resulting parallel program automatically

spawns the pro cesses on available pro cessors establishes the communication links and ensures the prop er

communication and synchronization From the users p oint of view all the co ding is sequential all the

parallel asp ects are generated by the system By separating the applicationsp ecic co de from the parallel

implementation templatebased development to ols aim to decrease program development time and reduce

the numb er of program errors due to parallelization

In addition to FrameWorks and Enterprise there are several other templatebased parallel systems in

the literature for example Typically these systems dier on several dimensions

including the selection of templates available to the user restrictions on the co de asso ciated with templates

restrictions on the data that can b e passed b etween templates and correctness prop erties such as deadlo ck

of the generated program It is the approach to these issues that distinguishes one system from another

Templates Versus Other Highlevel Techniques

Several dierent highlevel mo dels have b een used for the design of parallel programming to ols This sec

tion compares the imp ortant prop erties of templatebased systems to those of other wellknown highlevel

techniques for building parallel applications

A template encapsulates certain b ehavior in a parallel environment A programmer using a template is

concerned only with its sp ecied b ehavior The actual implementation may vary from environment to envi

ronment dep ending on among other things the architecture and the op erating system In some ways this

is analogous to programming with abstract data typ es which provide welldened means for manipulating

data structures while hiding all the underlying implementation details from the user

Macros and messagepassing libraries are p opular implementations of highlevel parallel mo dels However

the separation of application co de and parallelization co de is a key dierence b etween templates and these

metho ds For example the programmer must explicitly insert macros or library functions in the application

co de On the other hand templates are nonintrusive there need not b e any reference in the users sequential

Exp erience with Parallel Programming Using Co de Templates

co de to the templates This has imp ortant implications b oth for new parallel program development and for

the restructuring of existing parallel applications

Applicationsp ecic parallel libraries provide a second form of highlevel abstraction for parallel program

ming For example PBLAS implements library routines for parallel applications based on linear algebra

These routines hide the underlying details of the parallel solution from the user The user only needs to

supply the data for a particular instance of the problem There are two fundamental dierences b etween

applicationsp ecic libraries and templates First libraries provide an applicationsp ecic parallel solution

templates are applicationindep endent The applicationindep endent nature of templates has also b een em

phasized by other researchers Second a programmer using templates has the freedom to cho ose

b etween dierent parallel solutions to a problem applicationsp ecic libraries usually provide a single solu

tion

New programming languages are a third technique for supp orting highlevel abstractions for parallel pro

gramming Although the approach has some advantages a serious disadvantage is that a programmer

cannot make use of the existing co de for the sequential version of an application Some argue that parallel

applications should b e written from scratch However this argument is not consistent with the way complex

tasks are usually solved Initially the emphasis is on nding a sequential solution to the task It is only

when the solution b egins to take a signicant amount of execution time that p eople start thinking ab out

parallelizing the application However by this time a large investment has b een made in the sequential

solution In a templatebased system the programmer can often reuse the existing sequential legacy co de

While developing parallel applications programmers often think in terms of certain highlevel abstrac

tions such as masterslave pip eline or divideandconquer Renement of these abstractions to lowlevel

primitives is p ostp oned until the implementation phase Templatebased systems attempt to directly sup

p ort these abstractions The user sp ecies the required abstractions and the system generates the required

co de To achieve the desired b ehavior the system may have to insert co de at many places in the users

sequential co de This is an imp ortant dierence from techniques such as calls where the expanded

co de is lo calized at the p oint of the macro call

The concept of templates is consistent with Simons views on chunking of knowledge According to

this view p eople do not generally think in terms of individual lowlevel op erations while solving complex

tasks Rather they organize their thoughts in terms of strategies which consist of chunks of lowlevel

op erations structured in certain ways

Exp erience with Parallel Programming Using Co de Templates

While templates encourage co de reuse they do not eliminate the need to rewrite sequential co de to

adapt it to a parallel environment Like any other parallel to ol some co de rewriting or restructuring may

b e necessary to exp ose the parallelism satisfy the programming constraints of the to ol or achieve improved

p erformance

Desirable Characteristics of TemplateBased Mo dels

As we gain more insight into how programmers develop parallel applications and how dierent template

based systems can b e built we get a b etter understanding of characteristics that should b e or could b e

present in templatebased systems In this section we outline what we feel are the imp ortant characteristics

of the ideal templatebased mo del No to ol presently exists that supp orts all of these features The list is

used in this pap er to serve as a b enchmark for analyzing FrameWorks Enterprise and other systems In

the following discussion each of the characteristic is given a short name which is shown inside parentheses

These names are used throughout the pap er to refer to the corresp onding characteristics

Structuring the Parallelism

Templatebased systems should allow the fewest p ossible restrictions on how the user can structure the

parallelism in an application The most imp ortant structural prop erties are

Separation of Sp ecication Separation This is the central feature of a templatebased system It

means that it should b e p ossible to sp ecify the templates ie the parallelization asp ects of the appli

cation separately from the application co de This characteristic is crucial for rapid prototyping and

p erformance tuning of a parallel application It also allows for the application co de and its paralleliza

tion structures to b e evolved in a semiindep endent manner

Hierarchical Resolution of Parallelism Hierarchy This allows the renement of a comp onent in a

parallel application graph by expanding it using the same mo del That is templates can include other

templates Therefore there is no need to have separate mo dels for programminginthelarge and

programminginthesmall

Mutually Indep endent Templates Independence It is not sucient to dene some templates that can

b e used with other templates The meanings of all templates should b e context insensitive so that they

Exp erience with Parallel Programming Using Co de Templates

can b e used with other templates

Extendible Rep ertoire of Templates Extendible It should b e p ossible for a user to extend the set of

templates available

Large Collection of Useful Templates Utility The system should b e useful over a wide range of

applications

Op en Systems Open It should b e p ossible for the programmer to include lowerlevel mechanisms

such as explicit in their application The absence of such a feature results in a

closed system where the only applications that can b e develop ed are those whose required parallel

structures match the templates This is a very dicult requirement as it has signicant implications

for application development debugging and p erformance tuning

Programming

Templates may imp ose constraints on how users write sequential co de

Program Correctness Correctness The system should oer some guaranteed prop erties of correctness

For example absence of deadlo cks deterministic execution and fault tolerance are some desirable

correctness features

Programming Language Language The system should build on an existing commonlyused language

Ideally there should b e no changes to the syntax or semantics of the language This facilitates reuse

of existing sequential co de and makes it p ossible to take advantage of existing exp ertise in sequential

programming

Language NonIntrusiveness NonIntrusiveness A system may satisfy the language ob jective but

force the user to change sequential co de to accommo date limitations in the parallel programming

mo del For example to develop a parallel application using a messagepassing library the user may

have to appropriately restructure the co de and insert calls to the messagepassing library in the co de

The only way to prop erly eliminate this problem and also satisfy the language constraint is to have a

compiler that automatically parallelizes the co de Unfortunately for coarsegrained applications the

required compiler technology do es not exist

Exp erience with Parallel Programming Using Co de Templates

User Satisfaction

The system must satisfy a numb er of p erformance constraints b oth at program development time and at

runtime These include

Execution Performance Performance The maximum p erformance p ossible sub ject to the combina

tion of templates chosen by the user should b e achievable There will always b e limitations to the

achievable p erformance The complexity and interdep endence of comp onents external to the system

communication subsystem op erating system network etc make it very dicult to abstract and

still attain the highest p ossible p erformance

Supp ort To ols Support The system should provide a complete set of design co ding debugging and

monitoring to ols that supp ort the templatebased mo del These to ols must supp ort the same level of

abstraction as the programming mo del

To ol Usability Usability The ideal to ol should have a high degree of usability It should b e easy to

learn and easy to use Usability assessments have b een neglected in the literature

Application Portability Portability The to ol should allow the user to p ort applications to a numb er

of dierent architectures Some p erformance losses may b e exp ected for a p o orlychosen architecture

but the program should still run

Outline of FrameWorks

This section provides a brief overview of the FrameWorks mo del and system A more complete description

can b e found in

FrameWorks represents our initial attempt at developing a templatebased system In the FrameWorks

mo del an application consists of a xed numb er of mo dules which are written using an extended version of

a highlevel language C A module consists of a set of pro cedures exactly one of which is sp ecied as the

entry pro cedure The entry pro cedure of a mo dule can b e called by other mo dules in the application in a

manner similar to lo cal C pro cedure calls A mo dule may also have local pro cedures which may b e called

only from within the mo dule There are no common variables among the mo dules Each application has

one main mo dule that contains the main pro cedure The main mo dule may or may not have an entry pro cedure

Exp erience with Parallel Programming Using Co de Templates

The Interconnection Structure

FrameWorks provides a set of templates for sp ecifying the interconnection among communicating mo dules

A mo dules complete interconnection with other mo dules can b e describ ed by a tuple

inputtemplate outputtemplate bodytemplate

For each typ e of template the user must select one of the choices available and sp ecify the input and output

links for each mo dule This information is used by the system to generate an expanded version of a mo dule

containing the lowlevel co de for parallel synchronization scheduling and communication To distinguish an

original mo dule from its expanded version the latter is referred to as a pro cess

Input templates describ e the interface through which a pro cess receives its input There are three options

for input templates initial in and assimilator Figure a A pro cess with an initial template do es

not receive any input from other mo dules This template is used only by the main mo dule of the application

A pro cess using an inpipeline receives its input from any of its input pro cesses and serves them in a rst

comerstserved manner In the case of an assimilator template the pro cess takes exactly one input from

each of its input pro cesses b efore calling the entry pro cedure of the enclosed mo dule

Figure should b e placed here

Similarly there are three output templates outpipeline manager and terminal Figure b A pro cess

with an outpipeline template can call any of its output pro cesses A manager template is used for executing

a xed numb er of copies of each of its output pro cesses A pro cess whose output is marked as terminal do es

not call any other pro cess

A b o dy template is used to assign additional characteristics to a mo dule which mo dify the mo dules

execution b ehavior in the distributed environment The use of a b o dy template is optional There are two

choices for the b o dy template executive and contractor Figure c The executive template causes the

pro cess to have its input output and error streams directed to the users terminal The contractor template

is useful for computationally intensive pro cesses of an application by dynamically utilizing idle pro cessors

at run time When a mo dules b o dy is declared as contractor the runtime environment executes a variable

numb er of replicas of the given mo dule Each of these replicas is known as an employee of the contractor

A contractor pro cess hires an unsp ecied numb er of employee pro cesses to get the job done The designer

of the application do es not take part in the hiring and ring of employee pro cesses the user simply sp ecies

Exp erience with Parallel Programming Using Co de Templates

that the given pro cess should function as a contractor Pro cess management is p erformed by the runtime

environment and is transparent to the designer

Communication Among Mo dules

Mo dules communicate with each other using programmersp ecied structured messages called frames A

frame is similar to a C structure except that p ointer typ e variables are not allowed For each link b etween

two mo dules the programmer sp ecies two frames an input frame a structure containing all the input

parameters needed for a call and if necessary an output frame containing all the reply or output values

returned Execution of an application is initiated by the main pro cedure of the main mo dule Mo dules

interact with each other using FrameWorks call statements

call name inputframe

or

outputframe name inputframe

where name is the name of the mo dule called

The two forms of FrameWorks calls shown ab ove op erate in the non and blocking mo des resp ec

tively The nonblo cking mo de implies that the calling mo dule will not wait for the completion of the call

Instead it will continue with its own execution as so on as the called mo dule has received the data In the

blo cking mo de the calling mo dule waits until a reply frame is returned via another FrameWorks construct

the reply statement Within the called mo dule if the statement

reply outputframe

is encountered the data in outputframe is returned to the calling mo dule After the called mo dule returns

it starts waiting to serve another incoming call

Figure should b e placed here

As an example to structure the graphics application discussed in Section the user simply attaches input

and output templates to the three mo dules as shown in Figure a The only required change to the source

co de is to extend the normal pro cedure call to Geometry and Display into FrameWorks nonblo cking

Exp erience with Parallel Programming Using Co de Templates

call statements To replicate the execution of Geometry or Display one simply attaches the contractor

template to these mo dules Figure b No further mo dications to the co de are required The system

creates and manages execution of variable numb er of instances of Geometry and Display mo dules The

exact numb er of instances employed dep ends on the work load of these mo dules as well as availability of

lightly loaded pro cessors on the network

Exp erience and Lessons of FrameWorks

FrameWorks was a prototyp e system used to demonstrate the feasibility of templatebased concepts Our

exp erience with FrameWorks indicated that in the case of existing sequential applications reasonable p erfor

mance gains could b e obtained using the simple mo dications needed to create a coarsegrain parallel version

of the applications In several cases partitioning of the complete application into mo dules was p ossible while

keeping most of the co de of the sequential version intact In some cases however more ecient partitioning

of mo dules required a signicant amount of restructuring In the case of applications that were designed

with FrameWorks in mind the amount of work to switch b etween the sequential and parallel version was

quite small In such cases exp erimenting with dierent templates often required either no mo dications or

only a small numb er of mo dications within the mo dules

Although the initial exp erience with FrameWorks was encouraging gradually several problems with the

mo del and the system b ecame apparent The ma jor limitations included

The parallelism was expressed in the co de cal l and reply and in the graphical violating

the separation and nonintrusiveness ob jectives The consequence was that changes in the template

sp ecication had to b e mirrored in the co de increasing the chance of user error

FrameWorks required the user to sp ecify as many as three templates to fully describ e the parallel

structure of a pro cess There were some subtle constraints on how these templates could b e combined

eliminating illegal and impractical combinations Users often found these constraints confusing p o or

usability

The blo cking version of the cal l primitive is a source of ineciency In this case the calling mo dule

is blo cked waiting for the reply frame even though it may not immediately need it to pro ceed with its

computation resulting in decreased performance

The cal l and reply primitives can use only a single frame as a parameter for exchanging data Frames

Exp erience with Parallel Programming Using Co de Templates

are limited to nonp ointer data restricting the parameter passing p ossibilities Since sequential C

programs often use p ointers for passing data to functions these restrictions often required signicant

mo dications to the sequential co de to supp ort the FrameWorks metho of parameter passing failing

the nonintrusiveness ob jective

For its time FrameWorks was quite novel in its approach toward structuring parallel applications

After its publication we had several requests for the from other researchers and practitioners

However the FrameWorks system was not an easily p ortable system The main reason for this was its

dep endence on a homegrown messagepassing library and userinterface management to ols Although

these to ols help ed us quickly develop the prototyp e system p orting FrameWorks to a new system

meant installing all the to ols and libraries it used Some of the to ols in turn dep ended on other lo cally

develop ed research to ols These constraints made the job of p orting FrameWorks to other sites very

dicult violating the portability ob jective

Enterprise Parallel Programming System

Enterprise is not just a parallel programming to ol it is a parallel programming environment It is a complete

to ol set for parallel program design co ding compiling executing debugging and proling A detailed

description can b e found in

Improvements in Enterprise over FrameWorks

Enterprise represents an advancement over FrameWorks in several ways

Enterprise combines the threepart templates of FrameWorks into single units called assets that

represent all the useful cases This eliminates the issue of illegal or impractical combination of partial

templates improving usability Enterprise also intro duces some new templates improved utility

In Enterprise the use of FrameWorks cal l and reply keywords was eliminated By using a precompiler

Enterprise automatically dierentiates b etween a pro cedurecall and a mo dulecall based on the ap

plication graph called the asset diagram In eect all the parallel sp ecications are in the asset

diagram not in the user co de This creates an orthogonal relationship b etween the application co de

programming mo del and the asset diagram metaprogramming mo del Enterprise largely satises

the separation ob jective

Exp erience with Parallel Programming Using Co de Templates

Enterprise allows templates to b e hierarchically combined to form a parallel program almost without

limitation satisfying the hierarchy ob jective

A useful debugging feature is that Enterprise programs can b e run sequentially or in parallel often

without changes to the co de asset diagram or recompiling Also the events in a parallel program

execution can b e logged so that the program can b e deterministically replayed

An analysis of the op erational mo del of FrameWorks templates proved that a template would not cause

a deadlo ck due to interactions within its comp onents The analysis also showed however that

deadlo ck is still p ossible in an application where mo dules make blo cking calls to one another in a cyclic

manner Use of the assimilator template was also shown to cause a deadlo ck under some situations

Learning from this Enterprise eliminated the assimilator template It also restricted the application

graph to b e only treestructured This eliminated the p ossibility of an Enterprise application getting

into a deadlo ck situation either due to its internal op eration or due to cycles in the applications call

graph The user can however still write co de to cause a deadlo ck For example an asset may b e

in an innite lo op due to some programming error thus resulting in an indenite wait for the entire

application These deadlo ck prop erties contribute towards the correctness ob jective

In FrameWorks when a mo dule call is made that returns a result the caller is blo cked until the callee

replies Enterprise uses futures to let the caller pro ceed concurrently until it needs to access results

In eect Enterprise uses compiler technology to p ostp one synchronization as long as p ossible The

result is improved performance Enterprise has the synchronization implicit in the co de in Frameworks

it is explicit

Unlike FrameWorks Enterprise mo dule calls are not restricted to a single parameter Moreover Enter

prise uses its precompiler to take care of marshaling and unmarshaling of parameters This eliminates

the need for frames and allows parallel pro cedure calls to lo ok like sequential pro cedure calls Further

Enterprise allows p ointers to b e passed as parameters although the system do es not supp ort passing

p ointer data that itself contains p ointer data This considerably improves the nonintrusiveness of the

system

FrameWorks used analogies to illustrate the op erations of its templates for example manager master

slave and contractor However it was not quite consistent in its approach Often it mixed these

Exp erience with Parallel Programming Using Co de Templates

with somewhat unclear terminologies such as inpipeline or assimilator Enterprise relies on a single

consistent analogy of a human organization to apply do cument and explain parallel structures Human

organizations are excellent examples of parallel systems The analogies are intended to reduce the

p erceived diculty of learning parallel programming improving the usability of the system

Enterprise has b een implemented with the portability ob jective in mind The system is implemented

on top of existing easilyaccessible technology Its user interface supp orts XWindows and was written

in Smalltalk The precompiler was built using the Sage to ols The runtime library can use any

one of three messagepassing kernels PVM ISIS and NMP All these systems are available

on a large numb er of systems

Enterprise Programming Mo del

Consider a call from a mo dule A to a mo dule B

Result B Param Param ParamN

some other code

Value Result

The sequential semantics of such a call is that A calls B passing it N parameters and then blo cks

waiting for the return values from B b efore resuming execution Enterprise preserves the eects of the

sequential semantics but allows A and B to execute concurrently When A calls B the parameters

to B are packaged into a message marshaled and sent to the pro cess that executes B After calling

B A continues with its execution until it tries to access Result to calculate Value If B has yet not

completed execution then A blo cks until B returns the Result These socalled futures signicantly

increase the concurrency without requiring any additional sp ecication from the user In eect a future is

the synchronization primitive in Enterprise For many applications the sequential co de lo oks identical to

the parallel co de and has equivalent semantics

Enterprise allows p ointer typ e parameters in mo dule calls The macros IN OUT and INOUT can

b e used to designate input output and inputoutput typ e parameters For example consider the following

program segment where A calls B

Exp erience with Parallel Programming Using Co de Templates

int Data Result

Result B Data INOUT

some other code

Value Data

The second parameter INOUT indicates that items of parameter Data are to b e used for input as well

as output Here the mo dule call to B sends elements of Data to B When B nishes executing

it copies elements back to A overwriting data lo cations A will blo ck when it accesses Data

if the call to B has not yet returned It should b e noted that due to weak typing in C it is not always

p ossible to deduce the length of the p ointer typ e argument Therefore an additional parameter indicating

its length is necessary

In fact these macros are not necessary If they were not included in Enterprise then all p ointer parameters

would b e treated as INOUT preserving the sequential semantics However p erformance would b e lower

since all p ointer data would b e copied b oth on asset call and return The macros have b een included so that

programmers can give the system imp ortant guidance to improve communication eciency

Enterprise MetaProgramming Mo del

The metaprogramming mo del of Enterprise consists of templates called assets and a few basic op erations

that are used to combine dierent assets to achieve the desired parallel structures As in FrameWorks

sequential co de is attached to assets to get a complete parallel application Enterprise currently supp orts

assets whose icons are given in Figure

Figure should b e placed here

Enterprise It represents a program and is analogous to an entire business organization By default every

enterprise asset contains a single individual A develop er can transform this individual into a line

department or division thus facilitating hierarchical structuring and renement

Individual It represents a slave in traditional parallel programming terminology and is analogous to a

p erson in an organization It do es not contain any other assets In terms of Enterprises programming

comp onent it represents a pro cedure that executes sequentially An individual has source co de and

Exp erience with Parallel Programming Using Co de Templates

a unique name When an individual is called it executes its sequential co de to completion Any

subsequent call to that individual must wait until the previous call is nished If a develop er entered

all the co de for a program into a single individual the program would execute sequentially

Line A line is analogous to an assembly or pro cessing line it is usually called a pip eline in literature It

contains a xed numb er of heterogeneous assets in a sp ecied order The assets in a line need not

necessarily b e individuals they can also b e other lines departments or divisions Each asset in the line

renes the work of the previous one and contains a call to the next For example a line might consist

of an individual that takes an order a department that lls it and an individual that addresses the

package and mails it The rst asset in a line is a receptionist A subsequent call to the line waits only

until the receptionist has nished its task for the previous call not until the entire line is nished

Department A department represents a masterslave relationship in the traditional parallel computing

terminology and is analogous to a department in an organization It contains a xed numb er of

heterogeneous assets and a receptionist that directs each incoming communication to the appropriate

asset All assets execute in parallel

Division It represents a divideandconquer computation and contains a hierarchical collection of individual

assets among which the work is distributed When created a division contains a receptionist and

a representative that represents a leaf no de Divisions are the only recursive assets in Enterprise

Programmers can increase a divisions breadth by replicating the representative The depth of recursion

can b e increased one level at a time by transforming the representative leaf no de into a division This

approach lets develop ers sp ecify arbitrary fanout at each level

Service It represents a monitor and is analogous to any asset in an organization that is not consumed by

use and whose order of use is not imp ortant It cannot contain or call any asset but other assets can

call it A wall clo ck is an example of a service anyone can query it to nd the time and the order of

access is not imp ortant

Enterprise provides a small set of building blo cks from which users can construct complex programs using

a simple mechanism The user b egins by representing a program as a single enterprise asset containing a

single individual This one p erson business represents a sequential program Four basic op erations are used

to transform this sequential program into a parallel one asset expansion asset transformation asset addition

and asset replication Using the analogy the simple business grows into a p ossibly complex organization

Exp erience with Parallel Programming Using Co de Templates

The initial Enterprise asset can b e expanded to reveal its internal structure a single individual The

individual asset can then b e transformed into a comp osite asset like a department line or division and the

comp osite assets can b e expanded to reveal their default comp onents Comp onent assets can b e added to

lines and departments If there are more calls to an asset than it can handle in a reasonable time the asset

can b e replicated to pro duce multiple identical copies If a call to a replicated asset has not returned by

the time a subsequent call is made to the asset one of the replicas transparently handles the call Finally

comp onent assets at any level can b e replicated and expanded so a program can consist of a hierarchy of

assets to an arbitrary level

Figure should b e placed here

The graphics example of Section can b e parallelized in Enterprise with minimal changes to the orig

inal source co de mostly p ointer parameters Contrast the Enterprise asset diagram in Figure to the

Frameworks diagram in Figure Note the hierarchical comp osition Figure a shows Animation as a sin

gle organization or enterprise Expanding the icon reveals its inner structure a line of assets Figure b

Expanding that shows that the line consists of three individuals Generate Geometry and Display

one of which is replicated up to eight times Figure c The diagrams are easily mo died For example to

replicate Geometry the user need only select replication from Geometrys menu and sp ecify the numb er

of copies This new parallel program will now run without any additional changes to the users co de

Consider the parallel p olynomial multiplication example of Figure a The program can b e describ ed

as a line of two assets a receptionist PolyMult and a division Mult PolyMult reads in the co ecients

of the p olynomials to b e multiplied It then calls Mult to recursively do the multiplication The co de for

Mult is shown in Figure a To make this program run prop erly using Enterprise the p ointer parameters

must b e followed by an additional size parameter with an INOUT designation not shown These small

changes violate the nonintrusiveness requirement

Figure b shows the parallel structure of the program Inside the double line rectangle is the expansion

of the enterprise asset Inside the dashedline asset is the expansion of the line consisting of the two assets

PolyMult and Mult The asset Mult has b een expanded as a division Inside the division is another

division Here the division is replicated three times b ecause of three recursive calls to Mult The diagram

shows the depth of recursion two levels Figure c shows the result of expanding the diagram into a

standard call graph showing all of the pro cesses including the hidden Enterprise pro cesses The simple diagram of Figure c corresp onds to a complex structure of pro cesses

Exp erience with Parallel Programming Using Co de Templates

Figure should b e placed here

An interesting prop erty illustrated by the PolyMult example is that Enterprise can execute an asset

sequentially or in parallel at runtime Mult is recursive and if there are pro cesses available to do the

recursion in parallel it is done in parallel Once the recursion reaches the depth of the asset diagram tree

subsequent calls are pro cessed sequentially

Enterprise eliminates the p ossibility of certain typ es of common parallelization errors For example as

mentioned earlier it eliminates cyclic calls thus protecting users against a common cause of deadlo cks Errors

such as waiting for a message such as a reply that will not b e sent is checked by the system Similarly a

missing connection error is prevented by checking asset calls to nonexistent assets Also the system handles

packing and unpacking of parameters for message communication thus eliminating a common error in parallel

programming

The preceding has suggested that Enterprise comes close to satisfying the separation of specication

ob jective While largely true there are two imp ortant places where this is violated First assets corresp ond

to pro cedurefunction calls in the user co de Changes in the asset diagram for example adding a new asset

must b e reected in the co de and visaversa Second the asset diagram may force the user to restructure

their co de to achieve the desired parallelism For example replicating an asset isnt b enecial if there is only

a single call to that asset To maximize p erformance the user might have to rewrite the co de so that the

asset gets called many times by dividing the work of one call into multiple computationallysmaller calls

Lessons and Exp eriences

There are several parallel programming systems that employ techniques similar to those found in Frame

WorksEnterprise for example All these systems can b e viewed as templatebased

This section presents a critical evaluation of templatebased parallel programming to ols It is intended to

illustrate the large gap b etween current technology and what has to b e improved b efore it can gain wide

acceptance Emphasis is placed on areas that require further research work

Separation of Sp ecication

The signicance of separating sequential application program comp onents from how these comp onents

interact has long b een recognized In early systems comp onent interaction was sp ecied in separate text

Exp erience with Parallel Programming Using Co de Templates

les The advent of workstation technology and their graphical user interfaces GUI greatly enhanced

the ease eciency and eectiveness of sp ecifying parallel structures

Many of the systems that employ a separation of sp ecications b etween parallel structuring and appli

cation co de are based on the dataow mo del Example systems are CODE DGL LGDF

and Paralex Typically in these systems the programmer describ es the dataow using a graph where

no des represent pro cesses or programs and links represent the ow of data b etween no des A no de can b egin

execution when all the links incident to that no de have their inputs available Some of these mo dels also

provide hierarchical resolution of parallelism others dont In a pure dataow mo del it is

dicult to describ e lo ops selflo op arcs and staticdynamic no de replication For this reason some systems

mo dify the mo del to intro duce these additional constructs For example CODE supp orts replicated no des

Several mo dels based on controlow that address the separation ob jective have emerged Example

systems include CAPER PIE and Parallel Utilities Library PUL The PIE system Pro

gramming and Instrumentation Environment supp orts implementation templates for masterslave recursive

masterslave heap pip eline and systolic multidimensional pip eline In PIE a template can have another

implementation template as part of it thus facilitating the hierarchical resolution of parallelism As another

example the parallel language PAL is a pro cedural language with a language construct called molecule

A molecule can b e used to dene one of several typ es of parallel computation SIMD sequential pip elined

dataow etc The PUL system provides highlevel templates for task as well as data parallelism It

also supp orts templates for parallel IO

Separate sp ecicationbased parallel computation mo dels are also not limited to pro cedural programming

languages For example Coles algorithmic skeletons and PL are designed using the functional

programming mo del Similarly Strand uses to design its templates

Some researchers advo cate using applicationindep endent parallel program skeletons not only for building

parallel applications but also for educating programmers or do cumenting the solution strategies in parallel

computing A for parallel computing is dened as a class of

that solve dierent problems but have the same control structure A parallel program archetyp e is

a program design strategy for a class of parallel problems along with the asso ciated program designs and

example implementations In b oth of these works there is an added emphasis on enhancing develop ers

understanding of common classes of parallel problems

The separation of sp ecication b etween parallel structuring metaprogramming mo del and the user

Exp erience with Parallel Programming Using Co de Templates

co de is imp ortant as it allows relatively indep endent evolution of the two However complete separation is

currently not p ossible as the metaprogram needs to connect to the user co de at some p oint This is true

ab out Enterprise as well as all other similar systems that we are aware of

Enterprise attempted to preserve the semantics of sequential C Again it was not p ossible to completely

achieve this There are semantic dierences b etween the programming and metaprogramming mo dels Also

the distributed memory mo del and futures force some subtle changes in semantics that can confuse some

users For example in a MIMD mo del each pro cess has its own copy of the variables that may b e declared

as global variables in the original sequential program This means that all information required by an

asset including access to the callers global variables must b e added as extra parameters to the asset call

Dep ending on the application this could require a ma jor restructuring of the users co de This is a source

of many programming errors by rsttime Enterprise users

The ab ove p oints illustrate that there are aws in the Enterprise mo del Similar weaknesses exist in

other templatebased mo dels All coarsegrained distributedmemory systems that we are aware of require

the user to make some changes to their sequential co de for parallelization The ideal orthogonal relationship

b etween sequential co de and parallel sp ecications is hard to achieve since the needs of the programming

mo del and the metaprogramming mo del are sometimes contradictory

Tradeos

An imp ortant weakness of any templatebased mo del is that not all parallel algorithms can b e readily

expressed using the available rep ertoire of templates aecting the utility of the system For example

an that relies on a group of pro cesses using p eertop eer communication as used for example

in mesh problems cannot b e supp orted using current Enterprise assets Although this problem can b e

alleviated somewhat as new assets are designed it will never really disapp ear There are two main reasons

for this First it is probably imp ossible to predetermine a set of templates that can represent an arbitrary

communication top ology without reducing the level of the templates to a connectthedots approach

Second there are certain tradeos involved in designing a highlevel parallel programming system The

diculty in supp orting p eertop eer communication lies not with the implementation but rather with the

conict that in such a system it may no longer b e p ossible to oer any correctness guarantees such as an

absence of deadlo cks

Exp erience with Parallel Programming Using Co de Templates

Performance

Often a solution generated by a highlevel to ol such as Enterprise may not achieve the same p erformance

as a solution handcrafted by an exp ert using a lowlevel communication library such as PVM There are

several reasons for this p erformance degradation

An Enterprise template may include a hidden pro cess For example the department asset is imple

mented via a representative pro cess that manages various assets in the department Although the

additional pro cess means that there is some p erformance overhead its presence is desirable b ecause

it avoids splitting the co de b etween several interacting pro cesses and duplicating this co de in all the

assets This would result in a p o orly engineered system that would b e dicult to understand and

maintain However a particular instance of this typ e of application may b e handcrafted by a user

without a representative pro cess thus achieving b etter eciency

Asset calls sometimes transfer more data than necessary The user knows exactly how much data to

pass and can optimize a program to minimize it Enterprise do es not have the same intimate knowledge

of the application as the user and will always err on the side of transferring to o much

The Enterprisegenerated co de includes a lot of error checking A handcrafted application may want

to eliminate most of it

Enterprise provides facilities for collecting debugging and p erformance monitoring information Even

if these facilities are not used they still create small runtime overhead

Being a highlevel system Enterprise deals with general structures rather than sp ecic instances For

example the divideandconquer asset division uses generalized co de that is valid for any sp ecied

values for depth and width This results in some overhead in the form of extra co de

A user can use PVM to construct arbitrary communication graphs exploiting communication short

cuts to improve p erformance This is not p ossible in Enterprise or FrameWorks

Enterprise is built on PVM Therefore even though it may b e p ossible to apply certain p erformance

enhancements to Enterprise p erformance of an Enterprise application cannot exceed that of the b est

p ossible PVM implementation

Exp erience with Parallel Programming Using Co de Templates

In summary the tradeo is b etween b etter in exchange for p ossibly slower execution

p erformance The degree to which other similar systems would suer the p erformance loss would

dep end on whether these systems have similar reasons or not

Although sp eedup is only one factor in judging the utility of a parallel programming system there is a

segment of the parallel programming community that demands near p eak p erformance from their applica

tions However with the availability of relatively inexp ensive multipro cessor machines and the widespread

use of networked singlepro cessor workstations more and more p eople are turning towards parallel comput

ing For such users a shorter learning curve ease of program design development and debugging are just

as imp ortant as sp eedup A to ol that quickly achieves a p erformance improvement even though it may

provide less than p eak p erformance may b e quite acceptable The debate over p eak p erformance is akin

to a similar debate in the sequential programming world over the use of highlevel and fourth

generation to ols instead of highly ecient programming Hardly anyone now questions

the utility of highlevel language compilers

Usability

A motivation for developing FrameWorks and Enterprise was to construct a parallel programming system

with a high degree of usability The system should b e easy to learn easy to use and b ecause of the high

level templates capable of constructing correct parallel programs quickly Our exp erience with Enterprise

as well as feedback from the user community seemed to supp ort these claims Still it was felt that some

comparative assessment of Enterprise should b e made to determine how well it fared against for example

a lowlevel messagepassing library

In we conducted a controlled exp eriment in consultation with a cognitive psychologist

Half of the graduate students in a paralleldistributed computing class solved a problem using Enterprise

while the rest used NMP a PVMlike library of messagepassing primitives The student accounts were

monitored to collect statistics on the numb er of compiles program executions editing sessions and login

hours When the students submitted their assignment for grading the quality of their solution sp eedup

was measured and the numb er of lines of co de written was counted Full details of the exp eriment can b e

found in

The results of this exp eriment were a mixed bag of exp ected results as well as surprises The statistics

supp ort our initial exp ectation that students would do less work with Enterprise but get a more ecient

Exp erience with Parallel Programming Using Co de Templates

solution with NMP Enterprise students wrote fewer lines of co de than the NMP students in addition

to doing fewer edits compiles and test runs However the NMP solutions ran faster One surprising

result was that even though Enterprise users wrote less co de they had more login hours than the NMP

students A detailed examination of the logged data revealed three main causes for this

Better to ol support Enterprise users frequently used the animation feature of the system to replay a

computation This to ol provided useful runtime information but was quite slow to run

Usability The Enterprise compiler prepro cesse d the users co de several times b efore generating C co de

to b e compiled Consequently compilations were at least fold slower something that all users found

frustrating

Performance Since NMP p erformance was b etter Enterprise users sp ent more time trying to improve

the p erformance of their solution

A second exp eriment was conducted in to assess three to ols Enterprise PVM and PAMS a

commercially available to ol that allows lo op iterations to b e done in parallel A graduate class of

students was divided into three groups each group using a dierent parallel programming to ol to do each

assignment related sorting and tree searching The students were asked to evaluate the to ol

used

As exp ected PVM solutions pro duced the b est p erformance on two of the assignments it was signicantly

b etter than EnterprisePAMS with Enterprise and PAMS pro ducing slower but comparable results The

co de inserted into the sequential program by PVM users averaged over lines more than co de inserted

by EnterprisePAMS users Sup ercially it seems like an obvious tradeo b etter p erformance for more

programming eort exp ended However things were not as they seemed EnterprisePAMS users sp ent more

login time working on their assignments typically from to additional hours Again a seemingly

paradoxical result app ears students using the highlevel to ols wrote less co de but sp ent more time developing

it Why Gradually three reasons emerged

Performance Graduate students are highly comp etitive Before the start of the exp eriment they were

warned that some to ols might p erform signicantly b etter than others on a particular assignment

They were encouraged to b e comp etitive get the b est sp eedups within the group that was using

the same to ol and not to comp ete with students using the other to ols Despite this many of the

EnterprisePAMS students tried very hard to get PVMlike sp eedups They tried numerous clever

Exp erience with Parallel Programming Using Co de Templates

ways of circumventing the programming mo del but were rarely rewarded with b etter p erformance

Enterprise and PAMS students said that for each assignment there was an obvious way to parallelize

the program and this they could do quickly and easily However after their initial success they found

it very hard to improve p erformance

Understanding Many students had diculty grasping the notion that Enterprise would take care of

everything for you Templates hide a lot of detail from the user If an asset made a call to another

asset even though the co de and semantics of the call lo ok sequential the students knew it was b eing

done in parallel They felt they needed to understand how Enterprise worked which of course is

defeating part of the purp ose of a highlevel to ol

Language Both PAMS and Enterprise make subtle changes to the host programming language C

semantics Even though these semantic dierences were prop erly do cumented in the manuals this was

still a source of confusion for some students Programming in PVM in contrast was as easy as writing

sequential co de to many students Even though they had to write more PVM co de the students found

that they needed to know less than PVM routines and once these routines were learned writing

parallel co de was easy

The data suggests that the students most dissatised with EnterprisePAMS were the ones who did their

rst assignment with PVM In PVM the user has complete control over the parallelism and can do whatever

is desired When these users tried EnterprisePAMS they quickly b ecame frustrated at the lack of control

they had In the nal class evaluation of the to ols lack of user control over the parallelism was cited as

the biggest disadvantage of EnterprisePAMS We could summarize the implications of these exp eriments

as follows

These exp eriments and feedback from WWW users demonstrate that the Enterprise mo del and its

supp ort to ols can b e used to develop parallel programs All to o often research to ols are evaluated solely

by the research group that develop ed the technology There is a concern in the parallel programming

community that the functionality and usability of parallel programming environments is often never

validated

If the goal of a user is to quickly generate an initial version of a parallel application Enterprise and

PAMS can b e termed as easytouse systems as compared to PVM However if the goal is to get a

parallel solution where p erformance is the overriding concern communication libraries may provide a

Exp erience with Parallel Programming Using Co de Templates

b etter alternative For reasons outlined earlier it is very hard for Enterprise to generate co de that is

as ecient as the one generated by handcrafted solutions

Some users particularly those who have worked with the lowlevel to ols do not like to lose the control

and the exibility that such to ols provide In systems like FrameWorks or Enterprise where a user

must develop a parallel program using only the highlevel constructs provided by the system the lack

of openness may b e counterpro ductive A p ossible solution might b e to have more open and extendible

systems where a user may use templates if desired but can also access lowlevel primitives for eciency

and exibility We discuss the issue of open and extendible systems further in Section

A highlevel system should not intro duce changes in the semantics of the underlying sequential lan

guage In trying to preserve sequential compatibility b oth Enterprise and PAMS intro duce subtle

changes to the semantics These changes make the systems harder to learn and understand and there

fore make it more dicult to develop and debug applications Subtle changes were harder for the

students to deal with obvious changes such as new keywords or library calls were easier since they

would b e more explicit in the co de

TemplateBased Mo dels and LowLevel Communication Libraries

Enterprise has a simple interface that allows it to use a variety of communication packages PVM ISIS and

NMP Enterprise can b e viewed as a software layer on top of for example PVM The question arises as to

what the user gains and loses by moving to a higher level of abstraction in their co de

There are two main goals of the Enterprise system to create a highlevel programming environment that

is easy to use and to promote co de reuse by encapsulating parallel programming co de into templates For

example Enterprises mo del allows the user to achieve a near complete separation of sp ecication There is

nothing in the users co de that indicates it is intended for parallel execution other than optional parameter

macros for p erformance The use of a precompiler allows the Enterprise system to automatically insert

communication parameter packing and synchronization co de into the users application In contrast with

PVM the user must explicitly address these issues by inserting PVM library calls into the co de violating

the nonintrusiveness ob jective It is the users resp onsibility to structure the co de so that a compiler ag

can b e used to selectively includeexclude the parallel co de

Enterprise oers the user additional b enets For example the mo del allows for the hierarchical use of

Exp erience with Parallel Programming Using Co de Templates

the templates thus ensuring deadlo ckfree structuring of applications Also the user has the assurance that

the generated co de for the sp ecied structures is correct Both p oints contribute to the correctness ob jective

In moving to a higherlevel mo del such as Enterprise the user has lost something Most noticeable is

the p ossible decreased performance Message passing libraries allow for more exibility the user can easily

tune a system to maximize p erformance Further these libraries have a large supp ort infrastructure that

has resulted in them b eing made available on most ma jor platforms excellent portability

The choice b etween PVM and a higherlevel to ol is not easy The decision can b e simplied to a tradeo

b etween execution p erformance and software engineering Highlevel parallel programming to ols have the

p otential to enable users to build parallel applications more quickly and reliably In return they may have

to accept slightly worse p erformance

Expanding the Role of Templates

Most templatebased parallel programming mo dels use their templates to represent control ow However

there are several more areas where the application of the templatebased approach holds promise

Templates for parallel IO There are a numb er of commonly o ccurring parallel IO access patterns

These patterns can b e abstracted into a set of useful templates An Enterprisebased implementation

involves the user annotating through the asset diagram each le with an appropriate template

For example the user can designate a le to b e a diary or a newspaper again using analogies to describ e

the data access patterns

Templates for Work is pro ceeding on enhancing Enterprise with distributed shared

memory Users sp ecify the shared memory and its access templates via the user interface and the

compiler analyses the users co de to insert lo cks in the appropriate places Templates corresp ond to

dierent access proto cols including facilities to preserve sequential semantics guarantee deterministic

execution or allow for chaotic results

Templates for Data Parallelism Templates can b e used to describ e alignment and distribution of data

on pro cessors as in HPF The system can then generate SPMD co de for the negrain data

parallel solution for the given function or segment of co de However this approach is not suitable for

applications that require redistribution or realignment of data during execution

Exp erience with Parallel Programming Using Co de Templates

A Next Generation To ol

Templates represent a p owerful abstraction mechanism We b elieve templates have the p otential to make

as strong an impact on the art of parallel programming as macros and co de libraries However from our

exp eriences with FrameWorks and Enterprise we have learned a numb er of lessons that must b e rememb ered

when developing new templatebased to ols

Open Systems Enterprise provides a highlevel parallel programming mo del that the user must use

There are no facilities for the user to step back from the mo del to access lowerlevel primitives to

achieve b etter p erformance or to accommo date an application for which a suitable template is not

available For example even though Enterprise generates PVM co de this co de is hidden from the user

There is no easy way to use Enterprise to generate a correct PVM program and then incrementally

tune this program to achieve b etter p erformance A highlevel templatebased to ol must allow the user

the p ossibility of accessing lowerlevel primitives Also it should b e p ossible to develop an application

partially with the use of templates and partially by using lowlevel communication primitives

Extendibility FrameWorks and Enterprise supp ort a xed numb er of templates It is dicult add new

templates to the system An imp ortant step towards enhancing the utility of a templatebased mo del

would b e to design a system that provides a standard interface for attaching templates to the user co de

In such a system it may b e p ossible for the user to develop new templates As long as the templates

are mutually indep endent it should b e p ossible to integrate them into the rest of the system This

would result in a system that is extendible and can supp ort a large numb er of templates

Portability It is imp erative to continue building on top of existing established technology Some

de facto standards seem to b e emerging For example PVM and p ossibly MPI so on is currently

adequate as the lowestlevel building blo ck PICL seems to b e a p opular choice for parallel program

instrumentation Given the signicant eort required to build a parallel programming system it

seems fo olhardy to continue to invent when one can reuse Also the widespread availability of these

to ols on dierent platforms enhances the p ortability of the system

Language Many parallel programming to ols make subtle changes to the semantics of an existing

sequential language We b elieve this is a mistake Changing a programming languages semantics can

increase the users learning curve and result in diculties in understanding and debugging parallel co de

Exp erience with Parallel Programming Using Co de Templates

Imp ortance of Compiler Technology Our research would greatly b enet from b etter compiler technol

ogy For example some of the semantic confusion in Enterprise could b e eliminated static analysis

of the co de can do a b etter automatic job of co de reorganization to improve concurrency and delay

synchronization thereby improving p erformance compilers can uncover data dep endencies p ossibly

uncovering programming errors at compiletime rather than at runtime ow control analysis can

identify communication patterns that can assist in the initial pro cesspro cess or mapping Orca for

example uses compiletime analysis to help distribute the data

TradeOs Should we build a to ol for the inexp erienced user or the exp erienced user For example it

is conceivable to build an open and extendible system such as outlined in item and ab ove However

in such a system it may no longer b e p ossible to give the correctness guarantees that Enterprise oers

The requirements of users vary with their skill and exp erience levels For the former simplicity of the

mo del and ease of use are the most imp ortant considerations For the latter p erformance is often the

only metric that matters

Conclusions

Who are the p otential users of parallel computing technology There will always b e a user community that

uses parallel computing to squeeze even the last nanosecond of p erformance out of a machine We claim this

group is a very small p ercentage of the p otential user community Lo cal area networks of workstations are

commonplace and the p opularity of lowcost multipro cessor sharedmemory machines is rapidly growing

However few p eople take advantage of the parallelism in these architectures Many p eople want their

programs to run faster but are unwilling to invest the time necessary to achieve this

Consider the sequential world With a simple optimization ag on a compile O the user can get

increased p erformance To further improve p erformance the user must read the compiler manual page to

nd any other optimization options that might b e applicable for example inlining functions If b etter

p erformance is still required users will take the next step and use a to ol to analyze their programs such as

an execution proler They will use this feedback to mo dify their co de

For most users sequential program improvement stops at the compiler level Ideally the same should

b e true for coarsegrained parallel program development such as is seen with vectorizing compilers Given

that compilation techniques are still in their infancy for coarsegrained applications the next logical step

Exp erience with Parallel Programming Using Co de Templates

is to provide a to ol that allows users to parallelize their application with minimal eort Templatebased

mo dels oer real prosp ects of making this a reality

Rather than putting forward yet another mo del for building parallel applications this pap er was meant

to consolidate an existing approach to parallel programming Usability exp eriments of Enterprise have added

a new dimension to our understanding of how programmers with little or no exp erience in parallel computing

build their parallel applications There are a numb er of researchers working on similar highlevel systems

for parallel programming A strong research interest in this area is evident from the fact that recently a

mailing list was started on the Internet to discuss issues sp ecic to skeleton or templatebased approaches

the list has over memb ers skeletonsdcsedacuk We hop e that our exp erience in developing two

such mo dels into working systems as well as the results of our exp eriments in estimating the usability of

parallel programming systems will b e useful to researchers and practitioners in this area

Perhaps the most damning comment on the stateoftheart of parallel programming to ols for coarse

grained parallelism is the continued widespread p opularity of PVMMPI No matter how clever the im

plementation templatebased parallel programming to ols cannot achieve the p erformance of handcrafted

solutions However there comes a p oint where the software engineering considerations gained by using a

highlevel to ol outweigh the incremental gain in p erformance Unfortunately we are not yet at that stage

the gap b etween the p erformance p ossible from a to ol such as FrameWorksEnterprise versus PVM is large

enough that serious users will continue to program in PVM for the foreseeable future

We have identied several areas in Sections and where eort is necessary to enhance the usability

of the templatebased systems Work on several of these issues is in progress Templatebased

techniques alone may not b e enough to provide an easy to use highlevel parallel programming system that

supp orts quick prototyping and restructuring of parallel applications and supp orts co de reuse However we

b elieve that templatebased techniques will play a signicant role in building the ideal parallel programming

systems of the future

Acknowledgments

The constructive comments from Ian Parsons Greg Wilson and Stephen Siu are appreciated This re

search was conducted using grants from the Natural Sciences and Engineering Research Council of Canada OGP and OGP and IBM Canada Ltd

Exp erience with Parallel Programming Using Co de Templates

References

H Bal M Kaasho ek and A Tannenbaum Orca A Language for Parallel Programming of Distributed

Systems IEEE Transactions on Software Engineering

J Feo D Cann and R Oldeho eft A Rep ort on the Sisal Language Pro ject Journal of Paral lel and

Distributed Computing

G Geist and V Sunderam NetworkBased on the PVM System Concurrency

Practice and Experience

R Butler and E Lusk Monitors Messages and Clusters The P Programming System Paral lel

Computing

D Walker The Design of a Standard Message Passing Interface for Distributed Memory Concurrent

Computers Paral lel Computing

DB Loveman High Performance Fortran IEEE Paral lel and Distributed Technology

February

A Grimshaw WT Strayer and P Narayan Dynamic Ob jectOriented Parallel Pro cessing IEEE

Paral lel and Distributed Technology

W Karp o and B Lake PARDO A deterministic Scalable Programming Paradigm for Distributed

Memory Parallel Computer Systems and Workstation Clusters In Supercomputing Symposium

Calgary pages

M Cole Algorithmic Skeletons Structured Management of Paral lel Programming MIT Press Cam

bridge Mass

Z Segall and L Rudolph PIE A Programming and Instrumentation Environment for Parallel Pro

cessing IEEE Software

A Baguelin J Dongarra G Giest R Manchek and V Sunderam Graphical Development To ols for

NetworkBased Concurrent Computing In Supercomputing pages

M Green and J Schaeer Frameworks A Distributed Computer Animation System In Canadian

Information Processing Society Edmonton pages

Exp erience with Parallel Programming Using Co de Templates

A Singh J Schaeer and M Green Structuring Distributed Algorithms in a Workstation Environ

ment The FrameWorks Approach In International Conference on Paral lel Processing volume I I

pages

A Singh J Schaeer and M Green A TemplateBased To ol for Building Applications in a Multi

Environment In D Evans G Joub ert and F Peters editors Paral lel Computing

pages NorthHolland Amsterdam

A Singh J Schaeer and M Green A TemplateBased Approach to the Generation of Distributed

Applications Using a Network of Workstations IEEE Transactions of Paral lel and Distributed Systems

January

P Iglinski S MacDonald D Novillo I Parsons J Schaeer D Szafron and D Woloschuk Enterprise

User Manual Version Technical Rep ort No Department of Computing Science University

of Alb erta

G Lob e D Szafron and J Schaeer The Enterprise User Interface In TOOLS Technology of

ObjectOriented Languages and Systems pages

S MacDonald D Szafron and J Schaeer An Ob jectOriented Runtime System for Parallel Appli

cations In TOOLS Technology of ObjectOriented Languages and Systems to app ear

J Schaeer and D Szafron Software Engineering Considerations in the Construction of Parallel

Programs In High Performance Computing Technology and Applications pages Elsevier

Science Publishers BV Netherlands

D Szafron and J Schaeer An Exp eriment to Measure the Usability of Parallel Programming Sys

tems Concurrency Practice and Experience

J Schaeer D Szafron G Lob e and I Parsons The Enterprise Mo del for Developing Distributed

Applications IEEE Paral lel and Distributed Technology

B Bacci M Danelutto S Orlando S Pelagatti and M Vanneschi PL A Structured Approach

to HighLevel Parallel Language and its Structured Supp ort Concurrency Practice and Experience

Exp erience with Parallel Programming Using Co de Templates

A Bartoli P Cosini G Dini and CA Prete Graphical Design of Distributed Applications Through

Reusable Comp onents IEEE Paral lel and Distributed Technology

JC Browne M Azam and S Sob ek CODE A Unied Approach to Parallel Programming IEEE

Software pages July

JC Browne S Hyder J Dongarra K Mo ore and P Newton Visual Programming and Debugging

for Parallel Computing IEEE Paral lel and Distributed Technology

L Schafers C Scheidler and O KamerFuhrmann TRAPPER A Graphical Programming Environ

ment for Industrial HighPerformance Applications In Paral lel Architectures and Languages Europe

pages

P Brinch Hansen Studies in Computational Science Parallel Programming Paradigms Prentice

Hal l Inc

P Brinch Hansen Search for Simplicity Essays in Parallel Programming IEEE Computer Society

Press pages Chapter

K M Chandy Concurrent Program Archetyp es In International Paral lel Programming Symposium

Keynote Address

M Danelutto and S Pelagatti Parallel Implementation of FP Using a TemplateBased Approach

In Proc of the th Int Workshop on Implementation of Functional Languages pages Sept

S Leer M McKusick M Karels and J Quarterman The Design and Implementation of BSD

Unix AddisonWesley Publishing Company Inc

G Andrews RA Olsson MA Con I Elsho K Nilsen T Purdin and G Townsend An Overview

of the SR Language and Implementation ACM Transactions on Programming Languages and Systems

HA Simon and WG Chase Skill in Chess American Scientist August

A Singh ATemplateBased Approach to Structuring Distributed Algorithms Using a Network of Work

stations PhD thesis Dept of Computing Science University of Alb erta

Exp erience with Parallel Programming Using Co de Templates

AR Halstead MultiLisp A Language for Concurrent Symb olic Computation ACM Transactions

on Programming Languages and Systems

D Gannon J Lee B Shei S Sarukkai S Narayana and N Sundaresan SIGMA I I A To ol Kit for

Building Parallelizing Compilers and Performance Analysis Systems In Programming Environments

for Paral lel Computing pages North Holland Netherlands

K Birman A Schip er and P Stephenson Lightweight Causal and Atomic Group Multicast ACM

Transactions on Computer Systems

T Marsland T Breitkreutz and S Sutphen A Network Multipro cessor for Exp eriments in Paral

lelism Concurrency Practice and Experience

JC Browne A Tripathi S Fedak A Adiga and R Kapur A Language for Sp ecication and

Programming of Recongurable Parallel Structures In International Conference on Paral lel Processing

pages

TG Lewis and Rudd WG Architecture of the Parallel Programming Supp ort Environment In

IEEE COMPCON pages

R Jagannathan AR Downing WT Zaumen and RKS Lee Dataow Based Technology for

CoarseGrain Multipro cessing on a Network of Workstations In International Conference on Paral lel

Processing pages August

DC DiNucci and RG Babb I I LGDF Parallel Programming Mo del In IEEE COMPCON pages

O Babaoglu L Alvisi A Amoroso and R Davoli Paralex An Environment for Parallel Programming

in Distributed Systems Technical Report UBLCS Department of Mathematics University of

Bologna Italy

B Sugla J Edmark and B Robinson An Intro duction to the CAPER Application Programming

Environment In International Conference on Paral lel Processing pages August

L Clarke R Fletcher S Trevin R Bruce and S Chapple Reuse Portability and Parallel Libraries

In Programming Environments for Massively Paral lel Distributed Systems pages Birkhauser

Verlag Basel Switzerland

Exp erience with Parallel Programming Using Co de Templates

Z Xu and K Hwang Molecule A Language Construct for Layered Development of Parallel Programs

IEEE Transactions on Software Engineering May

I Foster and S Taylor Strand A Practical Parallel Programming To ol In North American Confer

ence on Logic Programming MIT Press

D Szafron and J Schaeer Exp erimentally Assessing the Usability of Parallel Programming Systems

In Programming Environments for Massively Paral lel Distributed Systems pages Birkhauser

Verlag Basel Switzerland

I Parsons R Unrau J Schaeer and D Szafron A Template Approach to Parallel IO Paral lel

Computing to app ear in

D Novillo HighLevel Representations for Distributed Shared Memory Technical rep ort Department

of Computing Science University of Alb erta

S Siu Op enness and Extensibility in DesignPatternBased Parallel Programming Systems Masters

thesis Electrical and Computer Engineering Dept University of Waterlo o

G Geist M Heath B Peyton and P Worley PICL A Portable Instrumented Communication

Library Technical Rep ort ORNLTM Mathematical Sciences Section Oak Ridge National

Lab oratory

H Bal and M Kaasho ek Ob ject Distribution in Orca using CompileTime and RunTime Tech

niques In ObjectOriented Programming Systems Languages and Applications OOPSLA pages

Exp erience with Parallel Programming Using Co de Templates

Definition of structures used by functions

geometryh contains the structure for polytbl objtbl etc

include geometryh

define MAXIMAGES

struct generategeometry

int imagenumber

struct obj objtblMAXOBJ

struct geometrydisplay

int imagenumber npoly

struct polygon polytblMAXPOLY

main Generate

struct generategeometry work

int image

for image image MAXIMAGES image

loop through images

ComputeObjects work Modeling and motion computation

Geometry work Send further processing to Geometry

Geometry struct generategeometry work

struct geometrydisplay frame

DoConversion work frame View transformation on the image

Display frame Send data to Display for further processing

Display struct geometrydisplay frame

DoHidden frame Hidden surface removal and antialiasing

WriteImage frame Store image on disk

Figure Structure of the Animation Application

Exp erience with Parallel Programming Using Co de Templates

Generate

Generate

Geometry Geometry ... Geometry

Geometry

Display Display Display... Display Display

Figure 2a: A Parallel Version Figure 2b: A Parallel Version with Replications

Figure Potential Parallelizations of Animation

PQQRSRP

Initial In-pipeline Assimilator

Figure 3a: Input Templates.

PPPRQ R P Q R

Terminal Out-pipeline Manager

Figure 3b: Output Templates.

......

Executive Contractor

Figure 3c: Body Templates.

Figure FrameWorks Templates

Exp erience with Parallel Programming Using Co de Templates

Generate Generate

Geometry Geometry ...

Display Display ...

Figure 4a: Line. Figure 4b: Line with Replications.

Figure Parallel Versions of the Animation Program Using FrameWorks

Figure Enterprise Assets

Exp erience with Parallel Programming Using Co de Templates

Figure The Animation Program in Enterprise

Exp erience with Parallel Programming Using Co de Templates

Multiply two polynomials together with coefficients in Pointer and

Pointer arrays and put the product coefficients in the Answer

Pointer array

Mult Pointer Pointer N AnswerPointer

localvars Result Result Result Cross Cross

if N

AnswerPointer Pointer Pointer

else

Multiply the low and high order terms

Mult Pointer Pointer N Result

Mult PointerN PointerN N Result

Low and high crossover terms

Cross CrossOverTerms Pointer Pointer N

Cross CrossOverTerms Pointer Pointer N

Mult Cross Cross N Result

Sequentially combine results to give the answer

Combine Result Result Result N AnswerPointer

return

a Polynomial Multiplication PseudoCo de

Figure Polynomial Multiplication in Enterprise