This paper appeared in First International Conference of the Austrian Center for Paral

lel Computation Salzburg Austria pages Springer Verlag Lecture Notes in

Computer Science

Mo dula and its Compilation

Michael Philippsen and Walter F Tichy

Universitat Karlsruhe

email philippseniraukade

start with the problem sp ecication and not with a se

Abstract

quential implementation In a sequential program to o

many opp ortunities for parallelism have b een hidden

Mo dula an extension of Mo dula is a program

or eliminated

ming language for writing highly parallel programs in

The second approach is to write programs that

amachineindep endent problemoriented way The

are explicitly parallel We claim that only minor ex

novel attributes of Mo dula are that programs are

tensions of existing programming languages are re

indep endent of the numb er of pro cessors indep endent

quired to express highly parallel programs Thus pro

of whether memory is shared or distributed and inde

grammers will need only mo derate additional training

p endent of the control mo des SIMD or MIMD of a

mainly in the area of parallel algorithms and their

parallel machine

analysis This area fortunatelyiswell develop ed

This article briey describ es Mo dula and dis

see for instance textb o oks and In compiler

cusses its ma jor advantages over the dataparallel pro

technologyhowever new techniques must b e found

gramming mo del We also present the principles of

to map machineindep endent programs to existing ar

translating Mo dula programs to MIMD and SIMD

chitectures while at the same time parallel machine

machines and discuss the lessons learned from our rst

architecture must evolve to eciently supp ort the fea

compiler targeting the Connection Machine Wecon

tures that are required for problemoriented program

clude with imp ortant architectural principles required

ming styles

of parallel computers to allow for ecient compiled

We take the approach of expressing parallelism ex

programs

plicitly but in a machineindep endentway In sec

tion we analyze the problems that plague most par

Intro duction

allel programming languages to day Section then

presents Mo dula an extension of Mo dula for

Highly parallel machines with thousands and tens

the explicit formulation of highly parallel programs

of thousands of pro cessors are now b eing manufac

The extension is small and easy to learn but pro

tured and used commercially These machines are of

vides a programming mo del that is far more general

rapidly growing imp ortance for highsp eed computa

and machine indep endent than other prop osals Next

tion They have also initiated a ma jor shift within

we discuss compilation techniques for targeting MIMD

Computer Science from the sequential to the parallel

and SIMD machines and rep ort on exp erience with

computer One of the ma jor problems we face in the

our rst Mo dula compiler for the Connection

use of these new machines is programmability How

Machine We conclude with prop erties of parallel ma

to write with no more than ordinary eort programs

chine architectures that would improve the eciency

that bring the rawpower of a parallel computer to

of highlevel parallel programs

b ear on a problem

Two ma jor approaches to the programming prob

lem can b e distinguished The rst is to automati

Related Work

cally parallelize sequential software Although there

is overwhelming economic justication for it this ap Most current programming languages for parallel

proach will meet with only limited success in the short and highly parallel machines including LISPC

to medium term see for instance The goal of MPL VAL Sisal Occam Ada FORTRAN Blaze

automatically pro ducing parallel programs can only Dino and Kali suer

if ever b e achieved by program transformations that from some or all of the following problems

Whereas the numb er of pro cessors of a parallel machine dep endence of parallel programs Data

machine is xed the problem size is not Be parallelism extends a synchronous SIMD mo del with

cause most of the known parallel languages do a global name space whichobviates the need for ex

not supp ort the virtual pro cessor concept the plicit message passing b etween pro cessing elements

programmer has to write explicit mappings for It also makes the numb er of virtual pro cessing ele

adapting the pro cess structure of each program ments a function of the problem size rather than a

to the available pro cessors This is not only a te function of the target machine

dious and rep etitive task but also one that makes

The dataparallel approach has three ma jor ad

programs nonp ortable

vantages It is a natural extension of sequential

programming The only parallel instruction a syn

Colo cating data with the pro cessors that op erate

chronous forall statement is a simple extension of

up on the data is critical for the p erformance of

the well known for statement and is easy to under

machines Po or colo cation

stand Debugging dataparallel programs is not

results in high communication costs and p o or p er

much more dicult than debugging sequential pro

formance Go o d colo cation is highly dep endent

grams The reason is that there is only a single lo

on the top ology of the communication network

cus of control which dramatically simplies the state

and must at present b e programmed by hand

space of a program compared to that of an MIMD

It is a primary source of machine dep endence

program with thousands of indep endent lo ci of con

All parallel machines provide facilities for inter

trol There is a wide range of dataparallel al

pro cess communication most of them by means

gorithms Most parallel algorithms in textb o oks are

of a message passing system Nearly all paral

dataparallel compare for instance According

lel languages supp ort only lowlevel send and get

to Fox more than of the existing paral

communication commands Programming com

lel applications he examined fall in the class of syn

munication with these primitives esp ecially if

chronous dataparallel programs Furthermore sys

only nearest neighbor communication is available

tolic algorithms as well as vectoralgorithms are sp e

is a time consuming and error prone task

cial cases of dataparallel algorithms

But dataparallelism at least as dened bycur

There are several control mo des for parallel ma

rent languages has some drawbacks It is a syn

chines including MIMD SIMD dataow and

chronous mo del Even if the problem is not amenable

systolic mo des Any extant parallel language tar

to a synchronous solution there is no escap e In par

gets exactly one of those control mo des What

ticular parallel programs that interact with sto chastic

ever the choice it severely limits p ortabilityas

events are awkward to write and run ineciently

well as the space of solutions

There is no nested parallelism This means that once

Mo dula provides solutions to the basic problems a parallel activity has started the involved pro cesses

mentioned ab ove The language abstracts from the cannot start up additional parallel activity A paral

memory organization and from the number of physi lel op eration simply cannot expand itself and involve

cal pro cessors Mapping of data to pro cessors is p er more pro cesses This prop erty seriously limits parallel

formed by the compiler optionally supp orted by high searches in irregular search spaces for example The

eect is that dataparallel programs are strictly bi level directives provided by the programmer Com

mo dal They alternate b etween a sequential and a par munication is not directly visible Instead reading

allel mo de where the maximal degree of parallelism is and writing in a virtually shared address space sub

xed once the parallel mo de is entered Tochange the sumes communication A however is

degree of parallelism the program rst has to stop all not required Parallelism is explicit and the program

parallel activity and return to the sequential mo de mer can cho ose among synchronous and asynchronous

The use of pro cedures to structure a parallel program execution mo de at anylevel of granularityThus pro

in a topdown fashion is severely limited The problem grams can use SIMDmo de where prop er synchroniza

here is that it is not p ossible to call a pro cedure in par tion is dicult or use MIMDmo de where synchro

allel mo de when the pro cedure itself invokes parallel nization is simple or infrequent The twomodescan

even b e intermixed freely

op erations this is a consequence of Pro cedures

cannot allo cate lo cal data and spawn data parallel op The dataparallel approach discussed in and ex

erations on it unless they are called from a sequential emplied in languages such as LISP C and MPL

program Thus pro cedures can only b e used in ab out is currently quite successful b ecause it has reduced

half of the cases where they would b e desirable They Overview of the forall statement

also force the use of global data structures on the pro

grammer

The forall statement creates a set of pro cesses that

execute in parallel In the asynchronous form the in

When designing Mo dula wewanted to preserve

dividual pro cesses op erate concurrently and are joined

the main advantages of dataparallel languages while

at the end of the forall statement The asynchronous

avoiding the ab ovedrawbacks The following list

forall simply terminates when the last of the created

contains the main advances of Mo dula over data

pro cesses terminates In the synchronous form the

parallel languages

pro cesses created bythe forall op erate in unison until

they reach a branchpoint suchasanif or case state

The programming mo del of Mo dula is a strict

ment At branchpoints the set of pro cesses partitions

sup erset of dataparallelism It allows b oth syn

into two or more subsets Pro cesses within a single

chronous and asynchronous parallel programs

subset continue to op erate in unison but the subsets

are not synchronized with resp ect to each other Thus

Mo dula is problemoriented in the sense that

the union of the subsets op erate in MSIMD mo de A

the programmer can cho ose the degree of paral

statement causing a partition into subsets terminates

lelism and mix the control mo de SIMDlikeor

when all its subsets terminate at which p oint the sub

MIMDlike as needed by the intended algorithm

sets rejoin to continue with the following statement

Variants of b oth the synchronous and asynchronous

Parallelism may b e nested at any level

form of the forall statementhave b een intro duced by

previously prop osed languages such as Blaze C Oc

Pro cedures may b e called from sequential or par

cam Sisal VAL LISP and oth

allel contexts and can generate parallel activity

ers Note also that vector instructions are simple

without any restrictions

instances of the synchronous forall

None of the languages mentioned ab ove include both

Mo dula is translatable eectively for b oth

forms of the forall statement even though b oth are

SIMD and MIMD architectures

necessary for writing readable and p ortable parallel

programs The synchronous form is often easier to

The Language Mo dula

handle than the asynchronous form b ecause it avoids

synchronization hazards However the synchronous

Mo dula has b een chosen as a base for a paral

form maybeoverly constraining and may lead to p o or

lel language b ecause of its simplicity There are no

machine utilization The combination of synchronous

reasons why similar extensions could not b e added

and asynchronous forms in Mo dula actually p er

to other imp erative languages suchasFORTRAN

mits the full range of parallel programming styles b e

or ADA The necessary extensions were surprisingly

tween SIMD and MIMD



small They consist of synchronous and asynchronous

The syntax of the forall is as follows

versions of a forall statement plus simple optional

declarations for mapping arraydataonto pro cessors

FORALL ident SimpleTyp e IN PARALLEL j SYNC

StatementSequence

in a machine indep endent fashion An interconnec

END

tion net work is not directly visible in the language We

assume a shared address space among all pro cessors

The identier intro duced by the forall statementis

though not necessarily shared memory There are no

lo cal to the statement and serves as a runtime con

explicit message passing instructions instead reading

stantforevery pro cess created bytheforall Sim

and writing lo cations in shared address space subsume

pleType is an enumeration or a subrange The

message passing This approach simplies program

forall creates as many pro cesses as there are elements

ming dramatically and assures network indep endence

in S impl eT y pe and initializes the runtime constan t

of programs The burden of distinguishing b etween lo

cal and nonlo cal references and substituting explicit

MSIMD Multiple SIMD Few but more than one instruc

message passing co de for the latter is placed on an

tion streams op erate on many data streams A compromise

between SIMD and MIMD

optimizing compiler The programmer can inuence



We use the EBNF syntax notation of the Mo dula language

the distribution of data with a few simple declara

denition with keywords in upp er case j denoting alternation

tions but these are only hints to the compiler with no

optionalityand grouping of the enclosed sentential

eect on the semantics of the program whatso ever forms

of each pro cess to a unique value in S impl eT y pe The synchronous forall

The created pro cesses all execute the statements in

The pro cesses created bya synchronous forall ex

S tatementS eq uence

ecute every single statementofStatementSequence in

unison To illustrate this mo de its semantics for se

The asynchronous forall

lected statements is describ ed in some detail b elow

The created pro cesses execute S tatementS eq uence

A statement sequence is executed in unison by ex

concurrently without any implicit intermediate syn

ecuting all its statements in order and in unison

chronization The execution of the forall terminates

when all created pro cesses have nished Thus the In the case of branching statements suchasIF

asynchronous forall contains only one synchroniza C THEN SS ELSE SS END the set of participat

tion p oint at the end Any additional synchronization ing pro cesses divides into disjoint and indepen

must b e programmed explicitly with semaphores and dently operating subsets each of which executes

the op erations WAIT and SIGNAL one of the branches SS and SS in the exam

In the following example an asynchronous forall ple in unison Note that in contrast to other

statement implements a vector addition dataparallel languages no assumption ab out the

relative sp eeds or relative order of the branches

may b e made The execution of the entire state

FORALL iN IN PARALLEL

ment terminates when all pro cesses of all subsets

zi xi yi

have nished

END

In the case of lo op statements suchasWHILE C

Since no two pro cesses created by the forall access

DO SS END the set of pro cesses for any iteration

the same vector element no temp oral ordering of the

divides into two disjoint subsets namely the ac

pro cesses is necessary The N pro cesses may execute

tive and the inactive ones with resp ect to the

at whatever sp eed The forall terminates when all

lo op statement Initially all pro cesses entering

pro cesses created byithave terminated

the lo op are active Every iteration starts with

A more complicated example illustrating recur

the synchronous evaluation of the lo op condition

sive pro cess creation is the following Pro cedure

C by all active pro cesses The pro cesses for which

P ar S ear ch searches a directed p ossibly cyclic graph

C evaluates to FALSE b ecome inactive The rest

in parallel fashion It can b est b e understo o d by

forms the active subset which executes statement

comparing it with depthrstsearch except that

sequence SS in unison The execution of the whole

P ar S ear ch runs in parallel It starts with a ro ot of

lo op statement terminates when the subset of ac

the graph and visits no des in the graph in a parallel

tive pro cesses b ecomes empty

and largely unpredictable fashion

Hence synchronous parallel op eration closely re

sembles the lo ckstep op eration of SIMD machines

PROCEDURE ParSearch v NodePtr

with an imp ortant generalization for parallel branches

BEGIN

As an example consider the computation of all

IF Marked v THEN RETURN END

FORALL svsuccessors IN PARALLEL

p ostx sums of a vector V of length N The pro

ParSearch succv s

gram should place into V i the sum of all elements

END

V i V N A recursive doubling technique as

visit v

in reference computes all p ostx sums in O log N

END ParSearch

time where N is the length of the vector

Figure illustrates the pro cess The program op The pro cedure ParSearch simply creates as many

j

erates by computing partial sums of length s pro cesses as a given no de has successors and starts

where j counts the iterations The inner forall creates each pro cess with an instance of P ar S ear ch Before

N pro cesses Note that there is a onetoone mapping visiting a no de P ar S ear ch has to test whether the

between pro cess numb ers and elements of the vector no de has already b een visited and marked Since mul

In each iteration the length of the partial sums is dou tiple pro cesses may reach the same no de simultane

bled by parallel summation of neighb oring sums The ously testing and setting the mark is done in a criti

if statement inside the forall disables all pro cesses cal section implemented with a semaphore asso ciated

that must not participate in the computation during with each no de by the pro cedure M arked If the

a given iteration graph is a tree no marking is necessary

Dimensions with allo cator SPREAD are divided into

VAR V ARRAY N OF REAL

segments one for eachoftheavailable pro cessors A

VAR s CARDINAL

vector with n elements is assigned to P pro cessors by

BEGIN

s

allo cating a segment of length dnP e to each pro ces

WHILE s N DO

sor While utilizing all available pro cessors it mini

FORALL iN IN SYNC

mizes the cost of nearestneighb or communication

IF isN THEN

Vi ViVis

Dimensions with allo cator CYCLE are distributed in

END

a roundrobin fashion over the available pro cessors

END

Given P pro cessors the elements of a vector whose

ss

indices are identical mo dulo P are asso ciated with the

END

END

same pro cessor In contrast to SPREAD CYCLE max

vvvvvvvvvvv

bor communication

0 1234 5678910 imizes the cost of nearestneigh

neighb oring array elements are always in dierent pro cessors leading to b etter pro cessor utilization if a par vvvvvvvvvvv

0,1 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10 allel algorithm op erates on subsegments of a vector at

atime

Dimensions with allo cator RANDOM are distributed vvvvvvvvvvv

0,3 1,4 2,5 3,6 4,7 5,8 6,9 7,10 8,10 9,10 10

randomly over the available pro cessors In contrast to

CYCLE RANDOM leads to a b etter pro cessor utilization

vvvvvvvvvvv if a parallel algorithms accesses the dimension in a

0,7 1,8 2,9 3,10 4,10 5,10 6,10 7,10 8,10 9,10 10

random pattern

If either SPREAD CYCLEor RANDOM apply to sev e dimensions then these dimensions are vvvvvvvvvvv eral successiv

0,10 1,10 2,10 3,10 4,10 5,10 6,10 7,10 8,10 9,10 10

unrolled into one pseudovector with a length that

is the pro duct of the lengths of the individual dimen

Figure Computing p ostx sums of a vector

sions This scheme idles fewer pro cessors than apply

ing SPREAD CYCLEor RANDOM to individual dimen

sions

Allo cators SBLOCK and CBLOCK apply SPREAD and

Allo cation of array data

CYCLE resp to each dimension individuallyFor two

successiv e dimensions SBLOCK has the eect of creat

Colo cation of data with the pro cessors that access

ing rectangular subarrays and assigning those to the

the data is imp ortant for parallel machines without

available pro cessors With this arrangement nearest

uniform access time to memory lo cations Po or align

neighbor communication in all dimensions is b est sup

ment of data and pro cessors may cause excessive com

p orted when the interconnection network can b e con

munication overhead We therefore provide a simple

gured into the same numb er of dimensions as the

machineindep endent construct for controlling the al

arrays

lo cation of array data This construct is optional and

CBLOCK for two dimensions also creates two

do es not change the meaning of a program it aects

dimensional subarrays but the rows and columns of

only p erformance A compiler for a machine with uni

these subarrays are then distributed indep endently

form memory access time may ignore the construct

in a roundrobin fashion over the pro cessor grid

The allo cation of array data to pro cessors is con

Again SBLOCK minimizes nearestneighb or communi

trolled with one allo cator p er dimension The mo di

cation while CBLOCK allows high pro cessor utilization

ed declaration syntax for arrays is as follows

if smaller subarrays are pro cessed in parallel

ArrayTyp e ARRAY SimpleTyp e allo cator

f SimpleTyp e allo catorg OF typ e

Implementing Mo dula

allo cator LOCAL j SPREAD j CYCLE j RANDOM j

SBLOCK j CBLOCK

When discussing the principles of compiling

Array elements whose indices dier only in dimen Mo dula we rst present the more dicult case of

sions that are marked LOCAL are asso ciated with the compiling for MIMD and then intro duce the simpli

same pro cessor This facility is used to avoid distribu cations for SIMD The latter have actually b een im

tion of data in a given dimension plemented in our CM compiler Although this rst

the spawner compiler do es not contain sophisticated optimization

it help ed us in understanding the main sources of op

The problem of load balancing is to distribute

timizations when compiling for ma

threads over the available pro cessors so that the

chines

load on the pro cessors is equalized the threads are

colo cated with their data and coscheduling of

threads within the same forallinstance b ecomes p os

Asynchronous forall

sible Again a centralized solution must b e avoided

One p ossibility is for pro cessors to keep a running to

Asynchronous forall MIMD implemen

tal of ready pro cesses and the overall average The

tation

overall average can b e up dated p erio dically sayat

A straightforward approach to implementing the

the end of a timeslice by another recursive doubling

forall is to create the required number of lightweight

tec hnique in which all pro cessors participate Newly

pro cesses or threads Threads of a forall share the

created threads are moved b etween neighb oring pro

same program the enclosed statement sequence but

cessors dep ending on the current load in comparison to

have their own stacks The resulting runtime struc

the average Under certain circumstances migrating

ture is a tree of stacks

a longrunning including its data to another

On a distributed memory machine it pays to repli

pro cessor maybeadvantageous In addition static

cate the entire program to all pro cessors Co de repli

compiler analysis can indicate preferred pro cessors for

cation can b e accomplished quickly during startup

colo cating data and threads

either using a broadcast facility or recursive doubling

Coscheduling of threads in the same forall is nec

The main problems when compiling forall state

essary to avoid delays inherentincontext swaps when

ments are thread creation thread termination and

the threads communicate Without coscheduling

load balancing All of these problems must b e solved

communicating threads mayenter a situation where

in a parallel fashion Sequential implementations

they execute alternatingly or in coroutine fashion in

would cause a serious b ottleneck and for algorithms

stead of in parallel Coscheduling can b e ac

with ne granularity result in essentially sequential

complished by increasing the thread priority with the

programs Note also that forallsmay b e nested so

nesting depth of foralls or byproviding sp ecial mech

there maybeseveral new sets of threads b eing created

anisms for task forces ie for scheduling groups of

simultaneously

threads simultaneously

The pro cess reaching a forall called the spawner

Obviously thread creation termination and load

must create the set of threads prescrib ed bythe forall

balancing must b e as fast as p ossible Various opti

The spawner actually creates only the initial thread of

mizations for bulk thread generation are feasible but

forall called the leader A simple optimization the

will not b e discussed here for lack of space

is to let the spawner take on the role of the leader

The ab ove techniques have not b een implemented

The leader then replicates itself All threads created

in our rst compiler since the CM is a SIMD machine

keep replicating themselves again and again until the

However work has started on a Mo dula compiler

required numb er is obtained This metho d is another

targeting a Transputer cluster where the techniques

variant of recursive doubling A small number of pa

will b e used We are also exploring sp ecial hardware

rameters controls the replication When a thread has

facilities to sp eed up these tasks

replicated itself a sucientnumb er of times it simply

jumps to the b eginning of the foralls co de sequence

Asynchronous forall SIMD implementa

and b egins execution

tion

Synchronization and thread termination at the

The synchronous nature of a SIMD machine coupled foralls end follow the same pattern Each thread

with the broadcast bus from the frontend makes has a semaphore for receiving termination signals from

all three of thread creation termination and load other threads A thread that reaches the end of

its forall rst waits for termination signals from all balancing op erate in constant time or nearly con

the threads it spawned during the replication pro cess stant time For generality assume nested foralls m

then signals its creating thread and destroys itself If threads each execute a forall statement each creat

all n threads of a forall terminate at ab out the same ing n new threads Thus the numb er of threads to b e

time then the leader learns ab out the combined ter created is t nmIfn is not uniform for all the m

mination in time prop ortional to O log n signals the spawners then a O log mtime summation instead of

spawner and kills itself or simply resumes the role of constanttime multiplicationmust b e p erformed to

compute t lo op was entered Similar considerations apply to the

return statement

Once t is known t stacks are created by assigning

Consider the following example

to eachofp pro cessors a segmentofdt pe stacks

This op erations takes constant time and balances the

load p erfectly Pro cess termination also takes con

FORALL iN IN PARALLEL

stant time since there is no synchronization overhead

LOOP

However it may b e necessary to provide each thread

IF ODDi THEN EXIT END

SS with some initial data such as its numb er during

END

creation Spreading this information takes again loga

END

rithmic time but as demonstrated by the Connection

Machine sp ecial instructions for spreading data are

When control ow reaches the exitthentwoac

so fast that in practice they can b e regarded as con

tivity bits have b een stacked for each thread one for

stant

the lo op and one for the if statement Topreventa

What remains to b e discussed is the scheduling of

thread that has already executed the exit from b eing

instructions Since the asynchronous forall prescrib es

reactivated after the ifitstop two activity bits must

no scheduling of the threads at all the compiler writer

b e set to FALSE

can cho ose one that works well on a given SIMD or

Recursion termination is similar to lo op termina

MSIMD machine We describ e briey the implemen

tion If a recursive call o ccurs inside a parallel if or

tation wechose for the Connection Machine CM We

case then the frontend must sense whether there

assume initially that the number of available pro ces

is any active thread left in a branch If not then

sors equals the numb er of threads

the branch terminates Without this provision un

b ounded recursion would ensue

Activity Bits The central idea of control owon

Parallel Pro cedure Call Because pro cedures can

SIMD computers is deactivation and reactivation of

b e called from b oth sequential and parallel contexts

pro cessors controlled by an activity bit asso ciated

each pro cedure must b e compiled twice Once for ex

with each pro cessor When the activity bit is o the

ecuting entirely on the frontend in sequential mo de

pro cessor do es not execute the instructions issued by

and a second time for executing within a forall state

the frontend This facility is sucient for simulating

ment The dierence is that in the parallel version the

the usual control ow constructs in a parallel context

pro cedure call and return instructions are executed

All that is needed is a stack of activity bits for each

only on the frontend Thus we need twotyp es of

thread The top of each activitystack is stored into

stacks On the frontend westack return addresses

the activity bit of a pro cessor Suitable manipula

On the stacks asso ciated with the parallel threads w e

tion of the activity bits turns threads on and o as

store parameters and lo cal data This division is a

required by the instruction stream issuing from the

direct consequence of SIMD and would even o ccur if

frontend

frontend and parallel pro cessors had the same instruc

There are two small extensions of the usual con

tion set On the CM the instruction sets dier and

trol ow mechanism for SIMD machines They are

so the sequential and parallel versions are completely

needed for recursion and for exit and return state

dierent

ments First consider parallel lo ops ie lo ops within

Our compiler relies on a minor language restriction

a forall On a SIMDmachine the frontend rep eat

Pro cedures may not b e nested within each other The

edly issues the instructions for the lo op b o dyuntil

reason is that uplevel addressing is quite exp ensive

the termination conditions of all threads executing

Since it is in general unpredictable in what context a

the lo op are met The usual technique is to evalu

pro cedure is called each memory access would have

ate a threads termination condition directly into its

to distinguish at runtime whether it references data

activity bit Before each iteration the frontend tests

on the frontend or the parallel pro cessors

whether there are any p ositive activity bits left If

not the lo op terminates An exit statementmay also

terminate a lo op by turning o the activity bit of the Pro cessor Virtualization Simulating more

corresp onding thread However since an exit state threads than there are pro cessors available is called

mentmay b e nested several levels deep within a lo op processor virtualization In SIMD mo de it is not p os

it must not only set the topmost activity bit to false sible to simply create new pro cesses on demand and

but all those that have b een stacked since the last let the op erating system schedule them Instead the

frontend has to issue the instructions implementing scheduling pro cesses in a certain fashion the overlaps

the b o dy of a forall in a lo op The numb er of iter may b e reduced greatlyEven synchronization p oints

ations of this lo op is given by the ratio of threads to between statements can b e eliminated if there are no

available pro cessors dep endencies Much of the dep endency analysis de

velop ed for parallelizing compilers applies here

The PARIS instruction set of the CM provides au

tomatic pro cessor virtualization This means that pro

cessor virtualization is transparent to the program

Synchronous forall SIMD implementa

mer The rmware simulates as many threads as re

tion

quired The maximum numb er of threads is only lim

The SIMD implementation of the synchronous forall

ited by the available memory b ecause the lo cal mem

was simple on the CM the builtin virtualization do es

ory of each pro cessor must b e shared out among the

the job However this virtualization cannot take ad

assigned threads

vantage of the optimizations describ ed ab ove Instead

Our Mo dula compiler uses the automatic pro ces

it must make conservative assumptions The resulting

sor virtualization However this virtualization is quite

virtualization is far from ecient An optimizing com

exp ensive The main reason is that the virtualiza

piler could pro duce a much faster virtualization in the

tion actually implements synchronous virtualization

ma jority of cases Consider the following example

which requires many temp orary variables In essence

this virtualization wraps every single instruction into a

virtualizing lo op even though a lo op around the entire

FORALL i N IN SYNC

body of a forall would suce since the asynchronous

Ai Ai

END

forall prescrib es no scheduling of threads The latter

simulation would b e obviously much more ecient

Belowaretwo p ossible virtualizations on p pro ces

sors expressed in Mo dula

Synchronous forall

sCEILINGNp

Synchronous forall MIMD implementa

FORALL j p IN PARALLEL

tion

FOR i js TO MINjsN

DO

The synchronous forall requires many more synchro

TMPi Ai

TMPi TMPi

nization p oints than the asynchronous form There

END

must b e a synchronization p ointbetween every two

END

statements inside a forall and in the case of the as

FORALL j p IN PARALLEL

signment even within a single statement A parallel

FOR i js TO MINjsN

DO

assignment of the from LRmeans that the value

Ai TMPi

of R is evaluated synchronously and stored in a temp o

END

rary Similarly the address represented by L is evalu

END

sCEILINGNp

ated synchronously and stored in a temp oraryOnly

FORALL j p IN PARALLEL

after b oth of these parallel evaluations have completed

FOR i js TO MINjsN

can the assignment b e made Otherwise interference

DO

reg Ai

is p ossible as in the assignment Ai Ai

reg reg

A synchronization p oint is implemented with a

reg reg

scheme similar to the one used to terminate an asyn

Ai reg

chronous forall except that now the threads do not

END

END

terminate but wait for a signal to pro ceed First

a logarithmic reduction informs the leader that all

The program on the left shows the conservative vir

threads in the pro cess have reached the synchroniza

tualization as p erformed byPARIS The optimized

tion p oint Then a logarithmic doubling pro cess sends

version on the right hand side exploits the fact that

signals back out to the threads to continue

only one temp orary lo cation is required By using a

single register for it on every pro cessor the number of Clearlysynchronization p oints are exp ensive We

writes to memory are reduced to one third of the un are currently investigating metho ds to eliminate them

where p ossible For instance the synchronization optimized version Furthermore no synchronization

p oint inside an assignment is not necessary if the left is necessary On a SIMD machine this means that

and right hand sides do not interfere Furthermore by the two lo ops can b e merged on a MIMD machine

wesave the synchronization p oint Furthermore if to check all three cases at runtime Such a sim

the individual pro cessors havea vector capabilitythe ple and frequently rep eated case analysis could b e

computation in each pro cessor can even b e interleaved done much more eciently in hardware

While implementing the synchronous forall for the

Autonomous addressing capability An au

CM wehave identied the main sources of optimiza

tonomous addressing capability means that each

tion in compiling for massively parallel machines We

pro cessor can generate its own address for access

have started to include these optimizations in the next

ing memory The Connection Machine do es not

compilers for MasPar CM and Transputer including

havesuch a facility on the CM each pro ces

the necessary datadep endence analysis

sor must use the same address The lackofau

tonomous addressing not only makes many ap

plications awkward to write but also precludes

Recommendations for Parallel Ma

certain optimizations in pro cessor virtualization

chine Architectures

Single instruction set SIMD machines to daytyp

The following list itemizes some broad requirements

ically have dierent instruction sets for frontend

that parallel machine architectures should fulll to al

and parallel pro cessors This prop erty implies

low for ecient compiled programs These require

that the co de generator of the compiler has to

ments are likely to b e encountered when designing the

b e written twice Also each pro cedure has to

translation schemes for parallel imp erative languages

b e translated twice doubling co de size A sp eed

dierential b etween frontend and parallel pro

Hardware supp ort for fast pro cess creation and

cessors however do es not app ear to b e a ma jor

synchronization

problem

Small instruction set The CM oers ab out

Shared address space All pro cessors should b e

PARIS instructions only a few of whichacom

able to generate addresses for the entire mem

piler can actually generate A study determining

ory on the system In particular the frontends

the most frequently used instructions in parallel

memory should b e part of that address space

programs is sorely needed

A source of great diculty in our compiler were

the many dierenttyp es of addresses The com

piler has to distinguish b etween lo cal addresses

Conclusion

global addresses addresses in the frontend gen

Ease of programming as well as p ortability of pro

eral communication addresses and communica

grams will b e of overwhelming imp ortance for the

tion addresses on a grid Optimizing for all these

acceptance of highly parallel machines Mo dula

cases is often imp ossible even with detailed inter

supp orts b oth few extensions of a sequential pro

pro cedural analysis Furthermore parallel p oint

gramming language suce for writing highly paral

ers are quite exp ensive to implement without a

lel problemoriented programs and compilers that

shared address space one basically has to simu



can generate ecient co de for a wide range of par

late the shared address space in software

allel machines app ear feasible Improvements in hard

ware architecture op erating systems programming

Uniform communication mechanism Most paral

languages and compiler technology should eventually

lel machines to dayprovide a set of instructions for

render the current practice of machine dep endent par

accessing lo cal memory a second one for accessing

allel programming as obsolete as machine dep endent

memory in direct neighb ors and a third set for

sequential programming

accessing distant memory units The dierences

in sp eed are signicant and therefore require that

the compiler detect the faster cases However it is

References

often imp ossible to know statically for whichcase

to optimize For instance we found that in most

Selim G Akl The Design and Analysis of Paral

cases it was imp ossible to determine in the com

lel AlgorithmsPrentice Hall Englewood Clis

piler whether a pro cedure would access lo cal or

New Jersey

nonlo cal memory The generated co de thus has

American National Standards Institute Inc



A shared address space do es not imply shared memory Washington DC ANSI Programming Language

Fortran ExtendedFortran ANSI X distributed op erating system structure Com

munications of the ACM February

Henry E Bal Jennifer S Steiner and Andrew S

Prentice Hall Englewo o d Clis New Jersey

Tanenbaum Programming languages for dis

INMOS Limited Occam Programming Manual

tributed computing systems ACM Computing

Surveys Septemb er

M Rosing R Schnab el and R Weaver DINO

Georey C Fox What havewe learnt from using

Summary and example In Proc of the Third

real parallel machines to solve real problemsIn

ConferenceonHypercube Concurrent Computers

Proc of the Third ConferenceonHypercubeCon

and Applications pages Pasadena CA

current Computers and Applicationsvolume

ACM Press New York

pages Pasadena CA ACM Press

New York

Thinking Machines Corp oration Cambridge

Massachusetts Lisp Reference Manual Version

Alan Gibb ons and Wo jciech Rytter Ecient

Paral lel AlgorithmsCambridge University Press

Thinking Machines Corp oration Cambridge

Massachusetts C Programming Guide Version

W Daniel Hillis and Guy L Steele Data par

November

allel algorithms Communications of the ACM

December

Walter F TichyParallel matrix multiplication on

the Connection Machine International Journal

Charles Ko elb el and Piyush Mehrotra Supp ort

of High Speed Computing

ing shared data structures on distributed mem

ory architectures In Proc of the nd ACM SIG

US Government Ada Joint Program Oce

PLAN Symposium on Principles and Practiceof

ANSIMILStd A Reference Manual for the

Paral lel Programming PPOPP pages

Ada Programming LanguageJanuary

March

Niklaus Wirth Programming in Modula Third

Ralf Kretzschmar Ein Mo dulaCompiler fur

corrected Edition SpringerVerlag Berlin Hei

die Connection Machine CM Masters thesis

delb erg New York

University of Karlsruhe Department of Informat

ics May

edzielewski Stephen James McGraw Stephen Sk

Allan Ro d Oldeho eft John Glauert Chris

Kirkham Bill Noyce and Rob ert Thomas

SISAL Language Reference Manual Lawrence

Livermore National Lab oratory March

James R McGraw The VAL language Descrip

tion and analysis ACM Transactions on Pro

gramming Languages and Systems

January

Piyush Mehrotra and John Van Rosendale The

BLAZE language A parallel language for scien

tic programming Paral lel Computing

Novemb er

Michael Metcalf and John Reid Fortran Ex

plained Oxford Science Publications

John K Ousterhout Donald A Scelza and

Pradeep S Sindhu Medusa An exp erimentin