<<

Parallel Functional Programming with Skeletons

the ocamlpl exp eriment

  ? 

Marco Danelutto Rob erto Di Cosmo Susanna Pelagatti

  ?

Dipartimento di Informatica LIENSDMI INRIA Ro cquencourt

Universita di Pisa Ecole Normale Superieure Domaine de Voluceau

PISA ITALY PARIS FRANCE ROCQUENCOURT FRANCE

parallel the evaluation of all the of a strict func Abstract

tion call Both the p ossibility of automatically exploiting

Writing parallel programs is not easy and them

this of implicit parallelism through parallel graph re

is usually a nightmare To cop e with these diculties a

duction or other techniques and

structured approach to parallel programs using skeletons

the p ossibility of providing the user with ways to annotate

and template based compiler techniques has b een developed

somehow functional programs in order to drive parallelism

over the past years by several researchers including the PL

exploitation have b een explored with dierent results

group in Pisa

After Coles work on skeletons a new research track has

This approach is based on the use of a set of primitive

b een initiated concerning parallel functional programming

forms that are just functionals implemented via templates

with skeletons As an example Bratvold showed how skele

exploiting the underlying parallelism so it is natural to ask

tons can b e lo oked for and exploited within an ML dialect

whether marrying a real functional language like Ocaml with

and Darlingtons group at the Imp erial College in Lon

the PL skeletons can b e the basis of a p owerful parallel pro

don started considering plain functional skeletons and

gramming environment We show that this is the case our

came up with the denition of a skeleton functional co ordi

prototype written entirely in Ocaml using a limited form of

nation language

closure passing allows a very simple and clean programming

Some of the authors of this pap er developed at the Uni

style shows real sp eedup over a network of workstations

versity of Pisa a skeleton parallel

and as an added fundamental b onus allows logical debug

pl pl is actually an imp erative programming lan

ging of parallel programs in a sequential framework without

guage although the skeleton framework used is completely

changing the user co de

functional The pl compiler uses an original template

Key words skeletons functional languages closures

based parallelism exploitation technique achieving very go o

parallelism

p erformance as well as programmability and p ortability fea

tures The authors started lo oking at the p ossibility of

exp orting the templatebased skeleton parallelism exploita

Introduction

tion techniques to the functional world

In the meanwhile the Ob jective functional pro

Functional programming languages have greatly improved

gramming language has b een developed at INRIA Ro cquen

since their app earance in the sixties The p erformance

court in France Ob jective Caml in the se

achieved in functional program execution is now compara

quel is a functional language of the ML family It

ble to the p erformance achieved using classical imp erative

supp orts b oth functions as rstclass values and full imp er

programming languages In addition mo dern func

ative features in particular arrays mo diable inplace This

tional programming languages have advanced features like

combination of features makes it well adapted to skeleton

strong typing and advanced mo dule systems that improve

based programming higherorder functions functions tak

b oth programmer pro ductivity and co de quality and main

ing userprovided functions as arguments can b e suitably

tainability

used to mo delimplement skeletons while the imp erative

But what ab out parallel functional programming Since

features can b e exploited to provide a parallel implementa

the very b eginning it has b een claimed that functional pro

tion of skeletons Other useful features of ocaml include a

gramming languages are implicitly parallel mainly due to

p owerful mo dule system allowing several implementations

the p ossibility of using strategies for all the

of the skeletons to b e substituted for one another without

strict functions app earing in a program Eager evaluation

changing the user co de and a builtin marshaler allowing

strategies in conjunction with the

transmission of arbitrary data structures over byte streams

prop erty sp orted by pure functional languages allow func

based on the same structural information used by the ocaml

tional languages or interpreters to schedule in

garbage collector

0

The ocamlpl pro ject has b een partially funded by the

The goal of the authors at the b eginning of the ocamlpl

Galileo bilateral FranceItaly pro ject n Further in

pro ject was to assess the merits and the feasibility of the

formation on the ocamlpl Ocaml and pl pro jects can

integration of the pl language inside ocaml This gave

b e found at the URLs httpwwwdiunipiitmarcod

ocamlplocamlplhtml httppauillacinriafrocaml and ocamlpl a programming environment that allows to write

httpwwwdiunipiitsusannaplhtml

parallel programs in ocaml according to the skeleton mo del

supp orted by the parallel language pl It provides seamless pip eline stages

integration of parallel programming and functional program In ocamlpl we allow skeletons to b e arbitrarily com

ming and advanced features like sequential logical debugging p osed and we provide the user with b oth stream parallel

of parallel programs and strong typing useful b oth in teach and data parallel skeletons data parallelism b eing the one

ing parallel programming and in building fullscale applica coming from p erforming in parallel relative to

tions In addition we wanted the skeleton implementation a single input data item

to run on widely available computer systems The skeleton set of ocamlpl contains most of the skele

During the implementation of the system it turned out tons app earing in pl although some of the ocamlpl skele

that we could get more than that In our implementation the tons are actually simplied versions of those app earing in

user co de containing the skeleton expressions can b e linked pl here we will denote by f the computed by

with dierent mo dules in order to get either a parallel exe skeleton F on a single data item of its input data stream

cutable running on a network of workstation or a sequen

Farm skeleton The farm skeleton written farm computes

tial executable running on a single workstation Therefore

in parallel a function f over dierent data items app ear

users are enabled to p erform logical program debugging us

ing in its input stream From a functional viewp oint

ing the sequential version of the program and then to move

given a stream of data items x x farmF

to the parallel version by just changing a linker option

1 n

computes f x f x F b eing the skeleton pa

The high level of abstraction provided by functional pro

1 n

rameter Parallelism is exploited by having k inde

gramming coupled with the ability to send closures over an

p endent parallel pro cesses computing f on dierent

IO channel provided the key to an elementary and robust

items of the input stream

runtime system that consists of a very limited number of

lines of co de

Pip eline skeleton The pip eline skeleton denoted by the

inx op erator p erforms in parallel the computa

Skeletons

tions relative to dierent stages of a function over dif

ferent data items of the input stream Functionally

The skeleton parallel programming mo del supp orts struc

F F F computes f f f x over

1 2 N n 2 1 i

tured parallel programming Using this mo del

all the data items x b elonging to the input stream

i

the parallel structureb ehavior of an application has to b e

Parallelism is exploited by having n indep endent par

expressed by using skeletons picked up from of a set of prede

allel pro cesses Each pro cess computes a function f

i

ned ones p ossibly in a nested way Each skeleton mo dels a

over the data items pro duced by pro cess computing

typical pattern of parallel and it is parametric

f and delivers results to pro cess computing f

i1 i+1

in the computation p erformed in parallel Skeletons can b e

understo o d as second order functionals that mo del the par

Map skeleton The map skeleton written mapvector

allel computation coming out from the application of a given

computes in parallel a function over all the data items

parallelism exploitation pattern to the functions

of a vector generating a new vector Therefore for

As an example a common skeleton is the

each vector X app earing in the input data stream

one Pip eline is a stream parallel skeleton The paral

mapvector F n computes the function f over all the

lelism exploited comes from p erforming in parallel compu

items of the vector using n dierent parallel pro cesses

tations relative to dierent input data sets app earing on

computing f over distinct vector items

the input data stream In particular the pip eline skele

Reduce skeleton The reduce skeleton written

ton mo dels the parallelism coming from the parallel com

reducevector folds a binary asso ciative func

putation of dierent stages of a function onto dierent

tion over all the data items of a vector Therefore

input data items app earing on the input data stream

reducevector F n computes x f x f f x out of

1 2 n

In other words pipeline f f computes the func

1 n

the vector x x for each one of the vectors

1 n

tion f x f f f x over all the data items

n n1 1

app earing in the input data stream The computation

x x app earing onto the input data stream Paral

1

k

is p erformed using n dierent parallel pro cesses

lelism comes from computing in parallel f x f f x

1 i 2 1 i1

computing f

f f f x

n n1 1 in+1

In general a skeleton programming mo del provides a set

We also included a sequential skeleton written seq that is

S of skeletons that the programmer can use to mo del the

only used to transform a plain function into a skeleton in

parallel b ehavior of an application Simpler skeleton mo dels

such a way that sequential co de can b e used as a skeleton

such as the original mo dels by Cole and Darlington do

parameter Its worthwhile to p oint out that the choice of

not allow skeleton comp osition Therefore the programmer

the skeletons of ocamlpl pretends to b e neither complete

must simply pick up the b est skeleton from the set S and

nor denitive Our goal was to investigate the feasibility of

then provide the function parameters as sequential co de As

integrating the skeleton programming mo del within a func

an example a program exploiting pip eline parallelism can

tional programming mo del Therefore new skeletons can b e

b e expressed by using the pip eline skeleton and by providing

added to ocamlpl in the future

sequential co de for the f app earing in the pipeline call

i

Other skeleton mo dels such as pl allow full skeleton

comp osition In this case a sequential skeleton is included

An example

in S and the function parameters of any other skeleton must

In order to understand how an ocamlpl application can

also b e skeletons Therefore programmers can either provide

b e written let supp ose that we want to develop a parallel

the function parameters of a pip eline by using the sequential

application plotting the Mandelbrot set Provided that we

skeleton to encapsulate sequential co de or by using other

have dened ocaml functions

skeletons including the pipeline one to express any of the

item of the input stream the second function ini

ocamlp3l each

the input stream for pro cessing the output

source code tializes

stream pro duced by the skeleton program second pa

tuple the rst item is the function pro cessing

plus sequential plus parallel rameter

output stream item the second and the third pro

semantics module semantics module each

vide to initialize and nalize the output stream han

dling pro cess and for evaluating a given skeleton pro

ocamlp3l

the third parameter of startstop provides the

prototype gram

skeleton program

compiler actual

the pardo initiates the program evaluation

according to the semantics dened by the mo dules in

cluded with the open directives In the case of the co de

shown ab ove the farm skeleton is evaluated in parallel

setting up a network of indep endent pro cesses com

puting the color of the pixels plus a scheduler pro cess

functional application

a pro cess collecting and displaying colored pixels

code debugging here and

the screen In case the user included the sequen

high performance on

skeleton implementation mo dule with the directive

execution here tial

open Seqpl instead the farm skeleton is computed

within a single sequential pro cess

Figure Overall design of the ocamlpl prototype compiler

ocamlpl implementation

computing the color of a pixel in the set out of its co

We implemented a prototype compiler generating executable

ordinates

co de out of the ocamlpl source co de The prototype com

piler is entirely written in ocaml and is made out of a set of

displaying a color pixel of given co ordinates on the dis

mo dules Dep ending on the set of mo dules linked with the

play

user co de the compiler either pro duces co de that can b e run

on a single workstation or co de that can b e run in parallel

op ening the graphic display

onto a network of workstation In b oth cases the co de can

either b e ocaml bytecode or native co de see Figure

closing the graphic display after a user acknowledge

In case of sequential execution the mo dules linked to

generating a pixel co ordinate record each time the func

the user co de contain second order functions for each one of

tion is called and raising an End of file exception

the skeletons used These functions actually implement the

when there are no more pixels to generate this function

sequential semantics of the skeletons ie sequentially com

keeps state in a global mutable variable

pute the result of applying the skeleton to the parameters

see Figure left

then we can write the application as follows

In case of parallel execution the mo dules linked to user

co de contain second order functions for each one of the skele

link skeleton implementation code

tons that build an abstract skeleton out of the user co de

open Parpl

The skeleton tree is traversed and a distributed pro cess

open Nodecode

open Template

network is logically assigned for the execution of each skele

suitable plain ocaml sequential code

ton Therefore a logical pro cess graph is obtained Then

let colorpixel coords

this graph is traversed and logical channels actually Unix

let displaypixel coordscol

so ckets are assigned as the inputoutput channels of each

let opendisplay

one of the pro cesses Finally a closure is derived relative to

let closedisplay

each one of the pro cesses in the graph These closures are

let generateapixel

computed lo oking at the functions stored in the template

let dummyfun

library Each one of this functions mo dels the b ehavior of

set up the parallel program structure

one of the pro cesses implementing a skeleton Eg there

let mandelprogram

startstop

are template pro cess functions mo deling a pip eline stage a

generateapixeldumm yf un

farm worker a farm emitter pro cess etc When the parallel

displaypixelopendi spl ay los edi spl ay

program is eventually to b e run onto a workstation network

farmseqcolorpixel in

these closures will b e used to sp ecialize each one of the pro

cesses executed on a workstation in such a way that the

pardo mandelprogram

collection of pro cesses participating in the program execu

tion actually implement the logical pro cess network derived

where

from the skeleton tree see Figure right

More details on the ocamlpl compiler can b e found in

the open directives just tell the ocaml compiler to com

in the following sections we will discuss the main features

pile and link other ocaml mo dules see Section

of the prototype compiler

the startstop expression sets up a framework suitable

for generating an input stream to the skeleton program

rst parameter couple the rst function generates let prog = farm(seq(f),5) ||| farm(seq(g),3);;

... pardo(...prog...) ;; usercode.ml

let (|||) f g = fun x -> g(f x);; parp3l.ml template.ml nodecode.ml let farm (f,n) = f;; let seq f = f;; generate the skeleton tree generate code for generate code to be rsh-ed ... seqp3l.ml associate channel/process each process according on network for parallel execution info to the node to the template library

pipe open sockets wait for closure farm farm compute closure until end-of-input shutdown (bytecode or native) code computing seq seq the result of skeleton program sequentially

(bytecode or native) code to be run

on networked workstations

Figure Compiler sketch sequential co de generation left parallel co de generation right

Closure passing as distributed higher order pa sends out the sp ecialization information the function

rameterization it must p erform to each no de

In order to implement parallel skeleton execution we choose

This very last task requires a sophisticated op eration

to use an SPMD Single Program Multiple Data approach

sending a function or a closure over a communication chan

all the no des of the network will run the same program

nel Section discusses the implementation of this op era

namely a pro cess template Each one of the

tion

no des will b e sp ecialized by information provided by a sp e

On the other side all the no des dierent from the ro ot

cial ro ot no de during the initialization pro cess with the

one simply wait for a connection to come in from the ro ot

pro cess template co de that it must run The sp ecialization

no de then send out the address of the lo cal so cket p ort

information includes the closure of the function computed by

they allo cate to do further communication wait for the list

the no de as well as the communication channels the pro cess

of neighbors and for the sp ecialization function and then

must use Overall the sp ecialized no de collection imple

simply p erform the sp ecialization function until termination

ments a pro cess network implementing parallel execution of

the skeletons provided in the user co de

Marshaling of function closures

Therefore the ro ot no de p erforms the following tasks

Most garbagecollected languages provide a builtin mar

executes the skeleton expressions of the user program

shaler also called serialization that takes any data struc

As a consequence a describing the pro

ture and enco des it as a sequence of bytes suitable for trans

cess network is built which is used to compute the

mission over a communication channel the symmetric un

conguration information for each no de in the pro cess

marshaling primitive then rebuilds the data structure at the

network

receiving end Marshaling exploits the same runtime type

information that guides the garbage collector

maps virtual no des onto the p o ol of available machines

In functional languages marshaling is often restricted to

initializes a so cket connection with all the participating

pure data structures that do not contain any function clo

no des

sures Marshaling function closures is delicate b ecause it is

unclear how to transmit the co de p ointers embedded in the

gets a lo cal p ort addresses from each of them

closures In distributed programming the usual solution is

not to transmit the co de of functions but rebuild a proxy

sends out to each no de the addresses of its connected

function on the client side The proxy transparently sends

neighbors these three steps provide an implementation

its argument back to the originating site where the function

of a centralized deadlo ck free algorithm interconnect

application is evaluated and its result sent back to the caller

ing the other no des according to the pro cess network

This is the traditional remote pro cedure call or network ob

sp ecied by the skeleton expression

jects However this solution is not appropriate for tions This disadvantage however simply implies that

us as it prevents any parallelism all function evaluations parallelism can b e usefully exploited only when the commu

are p erformed sequentially on the originating site nicationcomputation ratio is small It do es not aect the

To allow parallel evaluation in a general distributed set overall features of prototype

ting actual co de mobility is required the co de of the As far as the pro cess mo del is concerned all we need is a

function must b e transmitted on the wire This puts mechanism allowing an instance of the template interpreter

strong constraints on the co de which must b e fully p osition to b e run onto dierent workstations b elonging to a lo cal

indep endent and relo catable In particular all branches area network The Unix rsh mechanism matches this re

must b e PCrelative and references to global variables must quirement and a similar mechanism can b e used within the

b e avoided by putting the global variables in the function Windows environment Note that as pro cesses are gener

closure Some bytecodes have b een designed that fulll these ated and run on dierent machines just at the b eginning of

requirements However neither the ocaml bytecode the ocamlpl program execution any considerations ab out

nor the native co de generated by ocamlopt are fully relo cat p erformance in rshing pro cessing is irrelevant

able Workarounds involving sending relo cation information As an alternative to TCPIP so ckets we are also con

along with the co de have b een prop osed but are quite sidering implementing ocamlpl on top of the MPI library

complex The size of the transmitted co de is also a concern for message passing in distributedmemory multiprocessors

in many applications On networks of workstations MPI is implemented

For ocamlpl a much simpler solution exists We are on top of TCPIP so ckets and therefore has no advantages

doing SPMD applications instead of general distributed ap over our handwritten implementation However vendors of

plications the same program text is running on all commu distributedmemory sup ercomputers such as the Cray E

nicating machines The application is statically linked and and SGIs and Digitals clusters of multiprocessors provide

we disallow dynamic loading of ob ject les Hence we can implementations of MPI that take advantage of the fast cus

simply send a co de address on the wire as osets from the tom interconnection networks of those machines ocamlpl

b eginning of the co de section without sending the piece of over MPI would thus b enet from these custom commu

co de p ointed to by the address and b e certain that this co de nication hardware without sacricing p ortability Another

address will p oint to the correct piece of co de in the receiver interesting feature of MPI is the group communication prim

To guard against co de mismatches the sender transmits an itives broadcast scatter gather reduce that can b e imple

MD checksum of its own co de along with the co de address mented very eciently on certain communication top ologies

this allows the receiver to check quickly that it is running

the same co de as the sender This check proved very use

Template implementation

ful to guard against inconsistencies when accessing the same

Within ocamlpl the skeleton instances of a program are

NFSmounted le system from dierent workstations We

implemented by using implementation templates ie pro

extended the standard ocaml marshaler to supp ort this form

cess networks implementing the parallel semantics of the

of closure sharing the changes are now integrated in the cur

skeleton following the approach adopted within the pl

rent ocaml distribution

compiler Figure shows the implementation templates

Like all co de mobility schemes our marshaling of clo

used Each white circle represents a whole pro cess net

sures require that the sender and receiver run the same in

work computing a skeleton In case the skeleton is seqF

struction set This is no problem if the application is com

the network has a single pro cess computing the sequential

piled with the ocaml bytecode compiler since the bytecode

co de of the function f over all the data items app earing onto

is platformindep endent If the application is compiled with

the input stream Each black circle represents a pro cess

the nativecode compiler communication is restricted to ma

generated by the ocamlpl implementation aimed at either

chines having the same pro cessor architecture In other

co ordinating parallel computation or at mergingsplitting

terms we restrict the SPMD paradigm to single binary

the data streams according to the skeleton semantics Fi

executable program multiple data

nally arrows represent communication channels hosted by

To summarize the p ossibility of sending closures in the

Unix so ckets implementing data streams

implementation allowed us to obtain a form of higher order

Within ocamlpl these implementation templates can b e

distributed parameterization that keeps the runtime co de to

nested to any depth reecting the full comp osability of the

a minimum size the source co des of the full system is less

skeletons provided to the programmer Each template ap

than Kbytes

p earing in ocamlpl

Communication and pro cess supp ort

is parametric in the parallelism degree exploited As

an example the farm template may accommo date any

According to the initial goals of the ocamlpl pro ject we

numbers of worker pro cesses

lo oked for a simple p ortable and reliable communication

system We were interested in coming up with a solution

is parametric in the skeleton computed as the b o dy of

that can b e actually used onto widely available low cost

the skeleton As an example the farm template is sp e

systems Eventually we chose to use plain TCPIP so ck

cialized by the skeleton representing the farm worker

ets First of all this choice allowed b oth the Unix world

and the Windows world to b e addressed Second no par

provides a set of process templates ie parametric

ticular customization of the supp ort is needed to match the

pro cess sp ecications that can b e instantiated to get

ocamlpl features Finally the p ointtopoint connection

the real pro cess co des building out the each one

oriented stream mo del provided by Unix so ckets is p erfect

of the pro cesses participating in the parallel evalua

to mo del data streams of ocamlpl On the down side

tion of a ocamlpl program Such pro cess templates

the adoption of Unix so ckets presents an evident disadvan

are provided as functions in one of the library les

tage which is the low p erformance achieved in communica collect results

f and deiver steps them

f1 f2 fn

develop skeleton co de mo deling the application at

f

This just requires a full understanding of the

pipeline template hand

skeleton semantics and usually allows the user to reuse

schedule

p ortions of existing applications written in data items consistent

to workers f

plain ocaml

collect results

f rebuild result vector farm template

test the functionality of the new application by supply

and deliver it

ing relevant input data items and lo oking at the results

f f feedback partial

using the sequential skeleton semantics In results and deliver computed

complete results

case of problems the user may run the sequential de

decompose f

to ols to overcome the problem vector and bugging schedule f

items to workers

link the parallel skeleton semantics mo dule and run the schedule pairs

either from original

onto the workstation network Provided map template vector or from feedback f application

to the workers that the application was sequentially correct no new

errors will b e found at this step assuming the runtime

is correct

reduce template supp ort

lo ok at the p erformance results achieved by running the

Figure Templates used to implement the ocamlpl skele

application on the number of pro cessing no des avail

tons

able and p ossibly mo dify the program either by ad

justing the signicant p erformance parameters such

as the parallelism degree of the farm mapvector and

reducevector or by changing the skeleton

templateml of ocamlpl Parameters to the pro

used to exploit parallelism insert farms split pipeline

cess template include the input and output stream

stages etc

streams sp ecication as well as the user function

to b e computed on the items app earing onto the input

Therefore the ocamlpl user developing a parallel appli

stream in order to get the data items that have to b e

cation is not involved in any one of the error prone b oring

delivered onto the output stream

and heavy activities usually related to parallel programming

pro cess explicit communication handling buer

Template based compilation

management etc These details are completely hidden

within the compiler librariesco de

The whole compilation pro cess transforming an ocamlpl

Performance debugging or tuning is the activity a user

skeleton program into the parallel pro cess network imple

is supp osed to p erform in order to get the last MIPS out of

menting the program can b e summarized in three steps see

his co de running on the parallel machine at hand In order

Figure right

to p erform p erformance debugging the programmer must

rst the skeleton co de is transformed into a skeleton

lo ok out for b ottlenecks in the computation of his program

tree data structure recording all the signicant details

and take actions aimed at removing such b ottlenecks An

of the skeleton nesting supplied by the user co de

ocamlpl programmer may lo ok at the times sp ent in the

computation of the dierent stages of his skeleton program

then the skeleton tree is traversed and pro cesses are

and try to understand if there is some stage which b ehaves

assigned to each skeleton according to the implemen

as a b ottleneck Once individuated the stage he can p erform

tation templates During this phase pro cesses are de

dierent actions

noted by their inputoutput channels identied via a

unique number

if the stage is already parallel eg it is a farm the

user can augment the parallelism degree of the stage

nally once the number and the kind of parallel pro

eg put a bigger int as the second parameter of the

cesses building out the skeleton co de implementation

farm call

is known co de is generated that either delivers the

prop er closures derived by using the pro cess templates

if the stage is not parallel the user can gure out strate

to the template interpreter instances running on dis

gies to parallelize it by using the prop er skeletons

tinct workstations ro ot no de or waits for a closure

and rep eatedly computes this closure on the prop er In b oth cases the user is only asked to mo dify the paral

input and output channels until an EndOfFile mark lelization strategies of the program either in the quality or

is received nonro ot no des Closures are sent to in the quantity of parallelism exploited He is not asked to

the various template interpreter no des in a roundrobin mo dify pro cess or communication co de which is the thing

way This p olicy will b e changed in the next versions of normally happ ening when using other parallel programming

ocamlpl in order to b e able to achieve a b etter load approaches

balancing

Preliminary results

Program development with ocamlpl

The preliminary results we obtained concern b oth pro

In order to develop a new parallel application using grammability and p erformance

ocamlpl the user is exp ected to p erform the following 70 7 "measured_bytecode_1.dat" "measured_native_1.dat" ideal_scalability_bytecode_1(n) ideal_scalability_native_1(n) "measured_bytecode_2.dat" "measured_native_2.dat" 60 ideal_scalability_bytecode_2(n) 6 ideal_scalability_native_2(n) "measured_bytecode_3.dat" "measured_native_3.dat" ideal_scalability_bytecode_3(n) ideal_scalability_native_3(n)

50 5

40 4

30 3 Elapsed time (secs) Elapsed time (secs)

20 2

10 1

0 0 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4

PE number PE number

Figure bytecode application scalability farmmap tem Figure native co de application scalability farmmap

plate template

55

Parallel applications can b e easily de Programmability "measured_native_120.dat"

ideal_scalability120(n)

veloped in ocamlpl Once a programmer has a clear idea of

50

the skeletons available it takes a very short time to trans

a sequential co de into a skeleton program Times

form 45

involved in designing a parallel application when the pro

40

grammer has explicitly to deal with communications syn

chronization and pro cess handling are orders of magnitude

35

higher The ma jor activity a user has to p erform when

ocamlpl applications is p erformance debug

developing 30

gingtuning However changing either the parallel struc

Elapsed time (secs)

or the parallelism degree of an ocamlpl application

ture 25

is a very simple and fast task Therefore the user is b oth

20

encouraged to exp eriment dierent parallel structures of his

application and supp orted in the p erformance tuning activ

15

ity Overall we exp erimented times of minutes to transform

ocaml co de into running parallel ocamlpl appli sequential 10

1 1.5 2 2.5 3 3.5 4

and times of minuteshours to p erform p erformance

cations PE number

tuning of an ocamlpl application

Figure Scalability farmmap template scalability native

We have developed several test applications in order to

co de low communicationcomputation ratio

validate the ocamlpl prototype and to evaluate its p er

formance gures These applications include wellknown

applications such as Mandelbrot set computation and ma

trix multiplication as well as more complex applications in

the same applications compiled to native co de Scalability

particular a protein folding application currently developed

is close to ideal one ie the sp eedup is almost linear al

by a research team at the University of La Plata in Ar

though in the bytecode runs the measured times are closer

gentina This application computes the threedimensional

to ideal ones than in native co de runs This is due to the fact

folding shap e of protein chains

that in the second case lo cal computations are p erformed

faster than in the rst case while interprocess communi

Performance Preliminary results concerning p erfor

cation always take the same time therefore communica

mance are encouraging We measured actual sp eedups when

tioncomputation ratio is higher However if we decrease

running the parallel co de on small workstation networks

the communicationcomputation ratio by augmenting the

These sp eedups turned out to b e almost linear when the

data set size we get closer idealmeasured curves even in

communicationcomputation ratio of each pro cess b elong

case of native co de compiled applications see Figure rel

ing to the pro cess network implementing the ocamlpl pro

ative to the same application of Figures and run onto

gram was suitably small Figure and show the scala

a larger dataset All these Figures plot completion times

bility of a very simple application just exploiting a farm

relative to application run on a network of up to four ho

skeleton Both Figures plot application completion time in

mogeneous same pro cessor a Mhz Pentium I I same

seconds relative to the same application op erating on dif

memory cache and disk size and sp eed etc machines

ferently sized datasets the upp er curves referring to bigger

We also measured the introduced by ocamlpl

datasets Figure plots completion time of application

mo dules First of all we run a simple farm application com

co de compiled to bytecode whereas Figure is relative to 70 2.45 "farm_worker_3pe.dat" "excess_25-6-12-24-bytecode.dat" "excess_50-6-12-24-native.dat" 65 2.4

60 2.35 55

2.3 50

45 2.25

40

Elapsed time (secs) Elapsed time (secs) 2.2

35 2.15 30

2.1 25

20 2.05 2 4 6 8 10 12 14 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

Number of workers (running on 3 PEs) Number of interpreters (on 3 PEs)

Figure Instances of the template interpreter p er no de vs

Figure Number of workers in a farmmap template vs

completion time native and bytecode

completion time native co de

provided the communicationcomputation grain of the ap

piled by linking the parallel semantics mo dule and with a

plication matches the network features

parallelism degree sp ecied in the farm co de Then we

We also run ocamlpl applications on larger PC

run the same application compiled by linking the sequential

networks However the dierences in CPUs among the PC

semantics mo dule Finally we run a plain ocaml application

used make imp ossible to use these runs to derive suitable

computing the same algorithm and using a list to emulate

p erformance data

the task stream All the three applications were run onto

the same machine We observed that in bytecode runs the

parallel semantics mo dule introduced a rough overhead

Related work

while in the native co de runs it introduced a overhead

with resp ect to the runs using the sequential mo dule No

Many researchers are currently working on skeletons and

sensible overhead was introduced by sequential semantics

most of them are building some kind of parallel implemen

mo dule with resp ect to the handwritten ML application

tation

co de

In particular Darlingtons group at Imp erial College in

Finally we evaluated the eect of excess parallelism

London is actively working on skeletons They have explored

on completion time We considered a simple application just

the problems relative to implementing a skeleton program

exploiting farm parallelism and we measured the completion

ming system but the approach taken uses an imp erative lan

time in two cases

guage as the implementation language at least for the co de

implementing the pro cesses running on the parallel machine

varying the parallelism degree of the farm template

no des Bratvold takes into account plain ML pro

keeping the number of PEs used to execute the appli

grams and lo oks for skeletons within them compiling these

cation constant Figure

skeletons by using pro cess networks that lo ok like implemen

tation templates However b oth the nal target language

varying the number of instances of the template inter

and the implementation language are imp erative Finally

preter pro cess p er PE keeping the number of PEs used

Serot presents an embedding of skeletons within ocaml

to execute the application constant Figure

that apparently lo oks like to b e close to our work The mes

In the former case the excess parallelism do es not in

sage passing is p erformed by interfacing the MPI library

crease the application completion time and slightly smaller

with ocaml rather than using so ckets and the skeletons

completion times have b een measured the gain is b elow

taken into account are slightly dierent from ours How

however In the latter case we observed a b etter b ehavior

ever the imp ortant dierence is that these skeletons cannot

the maximum decrease in completion time was around

b e nested On the one hand this allows to implement the

The p erformance results discussed here are relative to a

skeletons by a simple library directly calling MPI On the

very small workstation network and to simple application

other hand the expressive p ower of the language is much

co de Indeed we actually used in this applications the tem

lower than the expressive p ower of ocamlpl

plate which implements farm map and reduce skeletons ie

the most critical one and we fully exp erimented the whole

Conclusions

runtime system based on closure passing

As exp ected the results showed that scalability actu

In this pap er we showed how a skeleton parallel program

ally dep ends on the amount of computation p erformed af

ming mo del such as the one provided by pl can b e suc

ter receiving a single task ie on the already mentioned

cessfully merged within the functional programming envi

communicationcomputation ratio Therefore when mov

ronments such as the one provided by ocaml In particular

ing to larger networks we exp ect to get go o d p erformances

we discussed how skeletons can b e embedded within ocaml

as second order functions and how mo dules implementing the th International Workshop on Implementation of Func

tional Languages pages September Nijmegen

b oth the sequential and the parallel skeleton semantics can

The Neaderlands

b e supplied that allow users to write functionally debug and

run in parallel skeleton applications using ocaml to express

Darlington A J Field P Harrison P H J Kelly

sequential computations and data types The whole pro cess D W N Sharp Q Wu and L While Parallel Pro

gramming Using Skeleton Functions In M R A Bo de and

preserved the strong typing prop erties of ocaml

G Wolf editors PARLE Parallel Architectures and Lan

At the moment the prototype ocamlpl implementa

gauges Europe Springer Verlag June LNCS No

tion runs under Linux uses so ckets to implement commu

nication and preliminary results show that the embedding J Darlington Y Guo H W To Q Wu J Yang and

M Kohler S A Uniform Functional Interface to

of skeletons within the ocaml programming environment is

Parallel Imp erative Languages In Third

feasible and eective In the near future we want to adopt a

Workshop PCW Fujitsu Lab oratories Ltd November

more ecient communication layer p ossibly by using MPI

instead of the Unix so cket library At the same time

J Darlington Y Guo H W To and J Yang Parallel

we are p orting the system on the ubiquitous Windows sys

Skeletons for Structured Comp osition In Fifth ACM SIG

tems for didactic purp oses We also wish to extend the set

PLAN Symposium on Principles and Practice of Parallel

of skeletons supp orted in particular dataparallel skeletons

Programming ACM Press July

op erating on matrices Finally we are currently trying to

M Feeley and J S Miller A parallel virtual machine for

restructure the co de of ocamlpl in such a way that new

ecient Scheme compilation In Proceedings of the ACM

skeletons may b e easily added at the moment the insertion

Conference on LISP and Functional Programming pages

of a new skeleton requires b oth to mo dify some fundamental

June Nice France

data structures and to add the prop er co de to the pro cess

C Fournet G Gonthier and L Maranget The join calcu

templates library and to the parallel and sequential seman

lus language Software do cumentation and related pap ers

tics mo dules

available on the Web httpjoininriafr

P Hartel and K Langendo en Benchmarking implemen

References

tations of lazy functional languages In Functional Pro

gramming Computer Architecture pages Cop en

P Au J Darlington M Ghanem Y Guo H To and

hagen June

J Yang Coordinating heterogeneous parallel computation

In L Bouge P Fraigniaud A Mignotte and Y Rob ert

P Hudak ParaFunctional programming in Haskell In B K

editors Europar pages SpringerVerlag

Szymansky editor Parallel Functional Language and Com

pilers chapter pages AddisonWesley

B Bacci M Danelutto S Orlando S Pelagatti and

3

M Vanneschi P L A Structured High level programming

S L P Jones Parallel Implementations of Functional Pro

language and its structured supp ort Concurrency Practice

gramming Languages The Computer Journal

and Experience May

B Bacci M Danelutto S Orlando S Pelagatti and

F C Knab e Language Support for Mobile Agents PhD the

M Vanneschi Summarising an exp eriment in parallel pro

sis School of Carnegie Mellon University

gramming language design In B Hertzb erger and G Ser

December

azzi editors HighPerformance Computing and Network

X Leroy J Vouillon and D Doligez The Ob jective Caml

ing volume of Lecture Notes in Computer Science pages

system Software and do cumentation available on the Web

SpringerVerlag

httppauillacinria fr ocam l

A Birrell G Nelson S Owicki and E Wobbler Network

MPIForum Do cument for a standard messagepassing

ob jects Technical Rep ort DEC Systems Research Cen

interface Technical Rep ort CS University of Ten

ter

nessee November

G E Blello ch and J Greiner Parallelism in sequential func

A Ohori and K Kato Semantics for communication primi

tional languages In Functional Programming Computer

tives in a p olymorphic language In th symp Principles of

Architecture pages June

Programming Languages pages ACM Press

T Bratvold SkeletonBased Parallelisation of Functional

L C Paulson ML for the working programmer Cambridge

Programs PhD thesis HeriotWatt University

University Press

L Cardelli Amber In Combinators and Functional Pro

J Serot Embo dying Parallel Functional Skeletons An Ex

gramming Languages volume of Lecture Notes in Com

p erimental Implementation on Top of MPI In G Lengauer

puter Science SpringerVerlag

Griebl editor Proceedings of the EuroPar Parallel Pro

M Cole Algorithmic Skeletons Structured Management

cessing pages Springer Verlag LNCS no

of Parallel Computations Research Monographs in Parallel

and Distributed Computing Pitman

C Walinsky and D Banerjee A dataparallel FP compiler

M Danelutto R D Cosmo X Leroy and S Pelagatti

Journal of Parallel and Distributed Computing

OcamlPl a functional parallel programming system Tech

nical Rep ort LIENS ENS Paris Available

on WEB at httpwwwdmiensfr EDI TION pr epri nts

Indexlienshtml

M Danelutto R D Meglio S Orlando S Pelagatti and

M Vanneschi A metho dology for the development and

supp ort of massively parallel programs Future Generation

Computer Systems July

M Danelutto and S Pelagatti Parallel Implementation of

FP using a TemplateBased Approach In Proceedings of