Models for : Review and Perspectives

Christoph Kessler and Jörg Keller Journal Article

N.B.: When citing this work, cite the original article. Original Publication: Christoph Kessler and Jörg Keller, Models for Parallel Computing: Review and Perspectives, Mitteilungen - Gesellschaft für Informatik e.V., Parallel-Algorithmen und Rechnerstrukturen, 2007. 24, pp. 13-29.

Postprint available at: Linköping University Electronic Press

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-40734

PARSMitteilungen Dec ISSN GIITG PARS

Mo dels for Parallel Computing

Review and Persp ectives



Christoph Kessler and Jrg Keller

PELAB Dept of Computer Science IDA

Linkping university Sweden

chrkeidaliuse



Dept of Mathematics and Computer Science

Fernuniversitt Hagen Germany

joergkellerFernUni Hagen de

Abstract As parallelism on dierent levels b ecomes ubiquitous in to days computers

it seems worthwhile to provide a review of the wealth of mo dels for parallel computation

that have evolved over the last decades We refrain from a comprehensive survey and con

centrate on mo dels with some practical relevance together with a p ersp ective on mo dels

with potential future relevance Besides presenting the mo dels we also refer to languages

implementations and to ols

Key words Parallel Computational Mo del SurveyParallel Programming Language Par

allel Cost Mo del

Intro duction

Recent desktop and highp erformance pro cessors provide multiple hardware threads tech

nically realized by hardware multithreading and multiple pro cessor cores on a single chip

Very so on programmers will b e faced with hundreds of hardware threads p er pro cessor chip

As exploitable instructionlevel parallelism in applications is limited and the pro cessor clo ck

frequency cannot be increased any further due to power consumption and heat problems

exploiting level parallelism b ecomes unavoidable if further improvement in pro ces

sor p erformance is requiredand there is no doubt that our requirements and exp ectations

of machine p erformance will increase further This means that parallel programming will

actually concern a ma jority of application and system programmers in the foreseeable fu

ture even in the desktop and emb edded domain and no longer only a few working in the

comparably small HPC market niche

A model of paral lel computation consists of a parallel programming mo del and a corre

sp onding cost mo del A paral lel programming model describ es an abstract parallel machine

by its basic op erations such as arithmetic op erations spawning of tasks reading from and

writing to or sending and receiving messages their eects on the state of

the computation the constraints of when and where these can be applied and how they

can be comp osed In particular a parallel programming mo del also contains at least for

shared memory programming mo dels a memory model that describ es how and when mem

ory accesses can b ecome visible to the dierent parts of a parallel computer The memory

mo del sometimes is given implicitlyA paral lel cost model asso ciates a cost which usually

describ es parallel execution time and resource o ccupation with each basic op eration and

describ es how to predict the accumulated cost of comp osed op erations up to entire parallel

programs A parallel programming mo del is often asso ciated with one or several paral lel

programming languages or libraries that realize the mo del

Parallel algorithms are usually formulated in terms of a particular parallel program

ming mo del In contrast to sequential programming where the von Neumann mo del is the

predominant programming mo del notable alternatives are eg dataow and declarative

programming there are several comp eting parallel programming mo dels This heterogene

ity is partly caused by the fact that some mo dels are closer to certain existing hardware

architectures than others and partly b ecause some parallel algorithms are easier to express

in one mo del than in another one

Programming mo dels abstract to some degree from details of the underlying hardware

which increases p ortabilityof parallel algorithms across a wider range of parallel program

ming languages and systems

In the following we give a brief survey of parallel programming mo dels and comment

on their merits and p ersp ectives from b oth a theoretical and practical view While we

fo cus on the in our opinion most relevant mo dels this pap er is not aimed at providing a

comprehensive presentation nor a thorough classication of parallel mo dels and languages

Instead we refer to survey articles and b o oks in the literature such as by Bal et al

Skillicorn Giloi Maggs et al Skillicorn and Talia Hambrusch

Lengauer and Leop old

Mo del Survey

Before going through the various parallel programming mo dels we rst recall in Section

two fundamental issues in parallel program execution that o ccur in implementations of

several mo dels

Parallel Execution Styles

There exist several dierent parallel execution styles which describ e dierent ways how

the parallel activities eg pro cesses threads executing a parallel program are created and

terminated from the programmers point of view The two most prominent ones are fork

joinstyle and SPMDstyle parallel execution

Execution in Forkjoin style spawns parallel activities dynamically at certain points

fork in the program that mark the b eginning of parallel computation and collects and

terminates them at another p oint join At the b eginning and the end of program execution

only one activity is executing but the number of parallel activities can vary considerably

during execution and thereby adapt to the currently available parallelism The mapping of

activities to physical pro cessors needs to b e done at runtime by the op erating system bya

thread package or by the languages runtime system While it is p ossible to use individual

calls for fork and join functionality as in the pthreads package most parallel pro

gramming languages use scoping of structured programming constructs to implicitly mark

fork and join p oints for instance in Op enMP they are given bytheentry and exit from an

omp parallel region in they are given by the spawn construct that frames a function

call

Nestedparal lelism can b e achieved with forkjoin style execution by statically or dynam

ically nesting such forkjoin sections within each other such that eg an inner fork spawns

several parallel activities from each activity spawned by an outer fork leading to a higher

Execution in SPMD style single program multiple data creates a xed number p

usually known only from program start of parallel activities physical or virtual pro cessors

at the b eginning of program execution ie at entry to main and this number will be kept

constant throughout program execution ie no new parallel activities can be spawned

In contrast to forkjoin style execution the programmer is resp onsible for mapping the

parallel tasks in the program to the p available pro cessors Accordingly the programmer

has the resp onsibility for load balancing while it is provided automatically by the dynamic

scheduling in the forkjoin style While this is more cumb ersome at program design time

and limits exibility it leads to reduced runtime overhead as dynamic scheduling is no

longer needed SPMD is used eg in the MPI message passing standard and in some parallel

programming languages such as UPC

Nested parallelism can be achieved with SPMD style as well namely if a group of p

P

pro cessors is sub divided into s subgroups of p pro cessors each where p p Each

i i

i

subgroup takes care of one subtask in parallel After all subgroups are nished with their

subtask they are discarded and the parent group resumes execution As group splitting can

b e nested the group hierarchy forms a tree at any time during program execution with the

leaf groups b eing the currently active ones

This schema is analogous to exploiting nested parallelism in forkjoin style if one regards

the original group G of p pro cessors as one pthreaded pro cess which may spawn s new

p threaded pro cesses G i s such that the total number of active threads is not

i i

increased The parent pro cess waits until all child pro cesses subgroups have terminated

and reclaims their threads

Parallel Random Access Machine

The Paral lel Random Access Machine PRAM mo del was prop osed byFortune and Wyllie

as a simple extension of the Random Access Machine RAM mo del used in the design

and analysis of sequential algorithms The PRAM assumes a set of pro cessors connected

to a shared memory There is a global clo ck that feeds b oth pro cessors and memory and

execution of any instruction including memory accesses takes exactly one unit of time

indep endent of the executing pro cessor and the p ossibly accessed memory address Also

there is no limit on the number of pro cessors accessing shared memory simultaneously

The memory mo del of the PRAM is strict consistency the strongest consistency mo del

known whichsays that a write in clo ck cycle t b ecomes globally visible to all pro cessors

in the b eginning of clo ck cycle t not earlier and not later

The PRAM mo del also determines the eect of multiple pro cessors writing or reading the

same memory lo cation in the same clo ck cycle An EREW PRAM allows a memory lo cation

to be exclusively read or written by at most one pro cessor at a time the CREW PRAM

allows concurrent reading but exclusive writing and CRCW even allows simultaneous write

accesses by several pro cessors to the same memory lo cation in the same clo ck cycle The

CRCW mo del sp ecies in several submo dels how such multiple accesses are to b e resolved

eg by requiring that the same value b e written by all Common CRCW PRAM by the

priority of the writing pro cessors Priority CRCW PRAM or by combining all writ

ten values according to some global reduction or prex computation Combining CRCW

PRAM A somewhat restricted form is the CROW PRAM where each memory cell may

only be written by one pro cessor at all which is called the cells owner In practice many

CREW algorithms really are CROW

Practical Relevance The PRAM mo del is unique in that it supp orts deterministic par

allel computation and it can be regarded as one of the most programmerfriendly mo dels

available Numerous algorithms have b een develop ed for the PRAM mo del see eg the b o ok

by JaJa Also it can b e used as a rst mo del for teaching parallel algorithms as it

allows students to fo cus on pure parallelism only rather than having to worry ab out data

lo calityand communication eciency already from the b eginning

The PRAM mo del esp ecially its cost mo del for shared memory access havehowever b een

strongly criticized for b eing unrealistic In the shadow of this criticism several architectural

approaches demonstrated that a costeective realization of PRAMs is nevertheless p ossible

using hardware techniques such as multithreading and smart combining networks such as

the NYU Ultracomputer SBPRAM byWolfgang Pauls group in Saarbrcken

XMT by Vishkin and ECLIPSE by Forsell A fair comparison of such approaches

with current clusters and cachebased multipro cessors should take into consideration that

the latter are go o d for sp ecialpurp ose regular problems with high data lo cality while they

p erform poorly on irregular computations In contrast the PRAM is a generalpurp ose

mo del that is completely insensitive to data lo cality

Partly as a reaction to the criticism ab out practicalityvariants of the PRAM mo del have

been prop osed within the parallel algorithms theory community Such mo dels relax one or

several of the PRAMs prop erties These include asynchronous PRAM variants the

hierarchical PRAM HPRAM the blo ck PRAM the queuing PRAM QPRAM

and the distributed PRAM DRAM to name only a few Even the BSP mo del which we

will discuss in Section can b e regarded a relaxed PRAM and actually was intro duced to

bridge the gap between idealistic mo dels and actual parallel machines On the other hand

also an even more p owerful extension of the PRAM was prop osed in the literature the BSR

Broadcast with selective reduction

Implementations Several PRAM programming languages have b een prop osed in the lit

erature such as Fork For most of them there is b eyond a compiler a PRAM

simulator available sometimes additional to ols such as trace le visualizers or libraries for

central PRAM data structures and algorithms Also metho ds for translating PRAM al

gorithms automatically for other mo dels such as BSP or message passing have b een prop osed

in the literature

Unrestricted Message Passing

A machine sometimes called messagepassing multicomputer consists

of a number of RAMs that run asynchronously and communicate via messages sent over a

communication network Normally it is assumed that the network p erforms message rout

ing so that a pro cessor can send a message to any other pro cessor without consideration of

the particular network structure In the simplest form a message is assumed to be sent by

one pro cessor executing an explicit send command and received by another pro cessor with

an explicit receive command pointtopoint communication Send and receive commands

can b e either blo cking ie the pro cessors get synchronized or nonblo cking ie the sending

pro cessor puts the message in a buer and pro ceeds with its program while the message

passing subsystem forwards the message to the receiving pro cessor and buers it there until

the receiving pro cessor executes the receive command There are also more complex forms

of communication that involve a group of pro cessors socalled col lective communication

operations such as broadcast multicast or reduction op erations Also onesidedcommuni

cations allow a pro cessor to p erform a communication op eration send or receive without

the pro cessor addressed b eing actively involved

The cost mo del of a messagepassing multicomputer consists of two parts The op erations

p erformed lo cally are treated as in a RAM Pointtop oint nonblo cking communications are

mo delled by the LogP mo del named after its four parameters The latency L sp ecies

the time that a message of one word needs to be transmitted from sender to receiver The

overhead o sp ecies the time that the sending pro cessor is o ccupied in executing the send

command The gap g gives the time that must pass b etween two successive send op erations of

a pro cessor and thus mo dels the pro cessors bandwidth to the communication network The

pro cessor count P gives the numb er of pro cessors in the machine Note that by distinguishing

between L and o it is p ossible to mo del programs where communication and computation

overlap The LogP mo del has b een extended to the LogGP mo del byintro ducing another

parameter G that mo dels the bandwidth for longer messages

Practical Relevance Message passing mo dels such as CSP communicating sequential

pro cesses have b een used in the theory of concurrent and distributed systems for many

years As a mo del for parallel computing it b ecame p opular with the arrival of the rst

distributed memory parallel computers in the late s With the denition of vendor

indep endent messagepassing libraries message passing b ecame the dominant programming

style on large parallel computers

Message passing is the least common denominator of p ortable parallel programming

Message passing programs can quite easily be executed even on shared memory computers

while the opp osite is much harder to p erform eciently As a lowlevel programming

mo del unrestricted message passing gives the programmer the largest degree of control

over the machine resources including scheduling buering of message data overlapping

communication with computation etc This comes at the cost of co de b eing more error

prone and harder to understand and maintain Nevertheless message passing is as of to day

the most successful parallel programming mo del in practice As a consequence numerous

to ols eg for p erformance analysis and visualization have b een develop ed for this mo del

Implementations Early vendorsp ecic libraries were replaced in the early s by

portable message passing libraries such as PVM and MPI MPI was later extended in the

MPI standard by onesided communication and forkjoin style MPI interfaces

have been dened for Fortran C and C Today there exist several widely used imple

mentations of MPI including op ensource implementations such as Op enMPI

A certain degree of structuring in MPI programs is provided by the hierarchical group

concept for nested parallelism and the communicator concept that allows to create separate

communication contexts for parallel software comp onents A library for managing nested

parallel multipro cessor tasks on top of this functionality has b een provided by Raub er and

Rnger Furthermore MPI libraries for sp ecic distributed data structures such as

vectors and matrices have b een prop osed in the literature

Bulk Synchronous Parallelism

The bulksynchronous paral lel BSP mo del prop osed by Valiant in and slightly

mo died by McColl enforces a structuring of message passing computations as a dy

namic sequence of barrierseparated supersteps where each sup erstep consists of a com

putation phase op erating on lo cal variables only followed by a global interpro cessor com

munication phase The cost mo del involves only three parameters numb er of pro cessors p

pointtop oint network bandwidth g and message latency resp barrier overhead L such

that the worstcase asymptotic cost for a single sup erstep can b e estimated as w  hg  L

if the maximum lo cal work w p er pro cessor and the maximum communication volume h per

pro cessor are known The cost for a program is then simply determined by summing up the

costs of all executed sup ersteps

An extension of the BSP mo del for nested parallelism by nestable sup ersteps was pro

posed by Skillicorn

Avariant of BSP is the CGM coarse grained multicomputer mo del prop osed by Dehne

which has the same programming mo del as BSP but an extended cost mo del that also

accounts for aggregated messages in coarsegrained parallel environments

Practical Relevance Many algorithms for the BSP mo del have been develop ed in the

s in the parallel algorithms community Implementations of BSP on actual parallel

machines have been studied extensively In short the BSP mo del while still simplifying

considerably allows to derive realistic predictions of execution time and can thereby guide

algorithmic design decisions and balance tradeos Up to now the existing BSP implemen

tations have been used mainly in academic contexts see eg Bisseling

Implementations The BSP mo del is mainly realized in the form of libraries such as

BSPlib or PUB for an SPMD execution style

NestStep is a parallel programming language providing a partitioned global ad

dress space and SPMD execution with hierarchical pro cessor groups It provides deter

ministic parallel program execution with a PRAMsimilar programming interface where

BSPcompliant memory consistency and synchronicity are controlled by the step statement

that represents sup ersteps NestStep also provides nested sup ersteps

Asynchronous Shared Memory and Partitioned Global Address Space

In the shared memory mo del several threads of execution have access to a common memory

So far it resembles the PRAM mo del However the threads of execution run asynchronously

ie all p otentially conicting accesses to the shared memory must be resolved by the pro

grammer p ossibly with the help of the system underneath Another notable dierence to

the PRAM mo del is the visibilityof writes In the PRAM mo del each write to the shared

memory was visible to all threads in the next step strict consistency In order to simplify

ecient implementation a large number of weaker consistency mo dels have b een develop ed

which however shift the resp onsibility to guarantee correct execution to the programmer

We will not go into details but refer to

The cost mo del of shared memory implementations dep ends on the realization If several

pro cessors share a physical memory via a bus the cost mo del resembles that of a RAM with

the obvious practical mo dications due to caching and consideration of synchronization cost

This is called symmetric SMP b ecause b oth the access mechanism and

the access time are the same indep endently of address and accessing pro cessor Yet if there

is only a shared address space that is realized by a distributed memory architecture then

the cost of the shared memory access strongly dep ends on how far the accessed memory

lo cation is p ositioned from the requesting pro cessor This is called nonuniform memory

access NUMA The dierences in access time between placements in lo cal cache lo cal

memory or remote memory may well be an order of magnitude for each level In order

to avoid remote memory accesses caching of remote pages in the lo cal memory has been

employed and b een called cache coherent NUMA CCNUMA A variant where pages are

dynamically placed according to access has b een called cache only memory access COMA

While NUMA architectures create the illusion of a shared memory their p erformance tends

to b e sensitive to access patterns and artefacts like thrashing much in the same manner as

unipro cessor caches require consideration in p erformance tuning

In order to give the programmer more control over the placement of data structures

and the lo cality of accesses partitioned global address space PGAS languages provide a

programming mo del that exp oses the underlying distributed memory organization to the

programmer The languages provide constructs for laying out the programs data structures

in memory and to access them

A recent development is transactional memory see eg for a recent survey

which adopts the transaction concept known from database programming as a primitivefor

parallel programming of shared memory systems A transaction is a sequential co de sec

tion enclosed in a statement such as atomic that should either fail completely or

commit completely to shared memory as an atomic op eration ie intermediate state of an

atomic computation is not visible to other pro cessors in the system Hence instead of having

to hardco de explicit synchronization co de into the program eg by using mutex lo cks or

semaphores transactions provide a declarative solution that leaves the implementation of

atomicitytothe language runtime system or to the hardware For instance an implemen

tation could use sp eculative parallel execution It is also p ossible to nest transactions

If implemented by a sp eculative parallel execution approach atomic transactions can

avoid the sequentialization eect of mutual exclusion with resulting idle times as long as

conicts o ccur only seldom

A related approach is the design of lockfreeparal lel data structures such as shared hash

tables FIFO queues and skip lists The realization on shared memory platforms typically

combines sp eculative parallel execution of critical sections in up date op erations on such data

structures with hardwaresupp orted atomic instructions suchasfetchadd compareswap

or loadlinkedstoreconditional

Practical Relevance Shared memory programming has b ecome the dominant form of pro

gramming for small scale parallel computers notably SMP systems As largescale parallel

computers have started to consist of clusters of SMP no des shared memory programming

on the SMPs also has b een combined with message passing concepts CCNUMA systems

seem like a doubleedged sword b ecause on the one hand they allow to p ort shared memory

applications to larger installations However to get sucient p erformance the data layout

in memory has to be considered in detail so that programming still resembles distributed

memory programming only that control over placement is not explicit PGAS languages

that try to combine b oth worlds currently gain some attraction

Implementations POSIX threads pthreads were up to now the most widely used par

allel programming environment for shared memory systems However pthreads are usually

realized as libraries extending sequential programming languages such as C and Fortran

whose compiler only knows ab out a single target pro cessor and thus lack a welldened

memory mo del Instead pthreads programs inherit memory consistency from the un

derlying hardware and thus must account for worstcase scenarios eg by inserting ush

memory fence op erations at certain places in the co de

Cilk is a sharedmemory parallel language for algorithmic multithreading By the

spawn construct the programmer sp ecies indep endence of dynamically created tasks which

can b e executed either lo cally on the same pro cessor or on another one Newly created tasks

are put on a lo cal task queue A pro cessor will execute tasks from its own lo cal task queue

as long as it is nonempty Pro cessors b ecoming idle steal a task from a randomly selected

victim pro cessors lo cal task queue As tasks are typically lightweight load balancing is

completely taken care of by the underlying runtime system The overhead of queue access

is very low unless task stealing is necessary

Op enMP is gaining p opularity with the arrival of multicore pro cessors and mayeventu

ally replace Pthreads completelyOpenMPprovides structured parallelism in a combination

of SPMD and forkjoin styles Its strengths are in the work sharing constructs that allowto

distribute lo op iterations according to one of several predened scheduling p olicies Op enMP

has a weak memory consistency mo del that is ush op erations may be required to guar

antee that stale copies of a shared variable will no longer be used after a certain p oint in

the program The forthcoming Op enMP standard supp orts a taskbased programming

mo del which makes it more appropriate for negrained nested parallelism than its ear

lier versions A drawback of Op enMP is that it has no go o d programming supp ort yet for

controlling data layout in memory

In PGAS languages accesses to remote shared data are either handled by separate access

primitives as in NestStep while in other cases the syntax is the same and the language

runtime system automatically traps to remote memory access if nonlo cal data have been

accessed as in UPC Universal Parallel C Lo ops generally take data anity into

account such that iterations should preferably be scheduled to pro cessors that have most

of the required data stored lo cally to reduce communication

The Linda system provides a shared memory via the concept of tuple spaces which

is much more abstract than linear addressing and partly resembles access to a relational

database

A programming language for transactional programming called ATOMOS has been

prop osed by Carlstrom et al As an example of a library of lo ckfree parallel data

structures we refer to NOBLE

In parallel language design there is some consensus that extending sequential program

ming languages by a few new constructs is not the cleanest way to build a sound parallel

programming language even if this promises the straightforward reuse of ma jor parts of

available sequential legacy co de Currently three ma jor new parallel languages are under

development for the highp erformance computing domain Chap el promoted by Cray

X promoted by IBM and Fortress promoted by Sun It remains to be seen how

these will be adopted by programmers and whether any of these will be strong enough to

b ecome the dominant parallel programming language in a long range p ersp ective At the

moment it rather app ears that Op enMP and p erhaps UPC are ready to enter mainstream

computing platforms

Data Parallel Mo dels

Data parallel mo dels include SIMD Single Instruction Multiple Data and vector comput

ing data parallel computing systolic computing cellular automata VLIW computing and

stream data pro cessing

Data paral lel computing involves the elementwise application of the same scalar com

putation to several elements of one or several op erand vectors which usually are arrays or

parts thereof creating a result vector All element computations must be indep endent of

each other and may therefore b e executed in any order in parallel or in a pip elined way The

equivalent of a data parallel op eration in a sequential programming language is a lo op over

the elementwise op eration scanning the op erand and result vectors Most data parallel

languages accordingly provide a data parallel lo op construct suchasforall Nested forall

lo ops scan a multidimensional iteration space which maps array accesses in its body to

p ossibly multidimensional slices of involved op erand and result arrays The strength of data

parallel computing is the single state of the programs control making it easier to analyze

and to debug than taskparallel programs where each thread may follow its own control

ow path through the program On the other hand data parallel computing alone is quite

restrictive it ts well for most numerical computations but in many other cases it is to o

unexible

A sp ecial case of data parallel computing is SIMD computing or vector computing

Here the data parallel op erations are limited to predened SIMDvector op erations such

as elementwise addition While any SIMDvector op eration could be rewritten by a data

parallel lo op construct the converse is not true the elementwise op eration to be applied

in a data parallel computation may be much more complex eg contain several scalar

op erations and even nontrivial control ow than what can be expressed by available vec

torSIMD instructions In SIMDvector programming languages SIMDvector op erations

are usually represented using array notation such as an bn cn

Systolic computing is a pip eliningbased parallel computing mo del involving a syn

chronously op erating pro cessor array a socalled systolic processor array where pro cessors

have lo cal memory and are connected by a xed channelbased interconnection network

Systolic algorithms have been develop ed mainly in the s mostly for sp ecic network

top ologies such as meshes trees pyramids or hyp ercub es see eg Kung and Leiserson

The main motivation of systolic computation is that the movement of data between pro

cessors is typically on a nearestneighbor basis socalled shift communications which has

shorter latency and higher bandwidth than arbitrary p ointtop oint communication on some

distributed memory architectures and that no buering of intermediate results is necessary

as all pro cessing elements op erate synchronously

A cel lular automaton CA consists of a collection of nite state automata stepping

synchronously each automaton having as input the current state of one or several other

automata This neighbour relationship is often reected by physical proximity such as ar

ranging the CA as a mesh The CA mo del is a mo del for computations

with structured communication Also systolic computations can be viewed as a kind of

CA computation The restriction of nearestneighbor access is relaxed in the global cel lular

automaton GCA where the state of each automaton includes an index of the automa

ton to read from in this step Thus the neighbor relationship can be time dep endent and

data dep endent The GCA is closely related to the CROW PRAM and it can b e eciently

implemented in recongurable hardware ie eld programmable gate arrays FPGA

In a very large instruction word VLIW pro cessor an instruction word contains several

assembler instructions Thus there is the p ossibility that the compiler can exploit instruction

level parallelism ILP b etter than a sup erscalar pro cessor by having knowledge of the

program to be compiled Also the hardware to control execution in a VLIW pro cessor is

simpler than in a sup erscalar pro cessor thus p ossibly leading to a more ecient execution

The concept of explicitly paral lel instruction computing EPIC combines VLIW with other

architectural concepts such as predication to avoid conditional branches known from SIMD

computing

In stream pro cessing a continuous stream of data is pro cessed each element undergoing

the same op erations In order to save memory bandwidth several op erations are interleaved

in a pip elined fashion As such stream pro cessing inherits concepts from vector and systolic

computing

Practical Relevance Vector computing was the paradigm used by the early vector sup er

computers in the s and s and is still an essential part of mo dern highp erformance

computer architectures It is a sp ecial case of the SIMD computing paradigm whichinvolves

SIMD instructions as basic op erations in addition to scalar computation Most mo dern high

end pro cessors havevector units extending their instruction set by SIMDvector op erations

Even many other pro cessors nowadays oer SIMD instructions that can apply the same op

eration to or subwords simultaneously if the subwordsized elements of each op erand

and result are stored contiguously in the same wordsized register Systolic computing has

b een p opular in the s in the form of multiunit copro cessors for highp erformance com

putations CA and GCA have found their niche for implementations in hardware Also with

the relation to CROW PRAMs the GCA could b ecome a bridging mo del b etween highlevel

parallel mo dels and ecient congurable hardware implementation VLIW pro cessors be

came p opular in the s were pushed aside by the sup erscalar pro cessors during the

s but have seen a rebirth with Intels Itanium pro cessor VLIW is to day also a p opu

lar concept in highp erformance pro cessors for the digital signal pro cessing DSP domain

Stream pro cessing has current p opularity b ecause of its suitability for realtime pro cessing of digital media

Implementations APL is an early SIMD programming language Other SIMD lan

guages include VectorC and C Fortran supp orts vector computing and even a

simple form of With the HPF extensions it b ecame a fulledged data

parallel language Other data parallel languages include ZPL NESL Dataparallel

C and Mo dula

CA and GCA applications are mostly programmed in hardware description languages

Besides prop osals for own languages the mapping of existing languages like APL onto those

automata has b een discussed As the GCA is an active new area of research there are

no complete programming systems yet

Clustered VLIW DSP pro cessors such as the TI Cxx family allow parallel execution

of instructions yet apply additional restrictions on the op erands which is a challenge for

optimizing compilers

Early programming supp ort for stream pro cessing was available in the Bro ok language

and the Sh library Univ Waterlo o Based on the latter RapidMind provides a com

mercial development kit with a stream extension for C

TaskParallel Mo dels and Task Graphs

Many applications can b e considered as a set of tasks each task solving part of the problem at

hand Tasks may communicate with each other during their existence or may only accept

inputs as a prerequisite to their start and send results to other tasks only when they

terminate Tasks may spawn other tasks in a forkjoin style and this may be done even

in a dynamic and data dep endent manner Such collections of tasks may be represented

by a task graph where no des represent tasks and arcs represent communication ie data

dep endencies The scheduling of a task graph involves ordering of tasks and mapping of

tasks to pro cessing units Goals of the scheduling can be minimization of the applications

runtime or maximization of the applications eciency ie of the machine resources Task

graphs can o ccur at several levels of granularity

While a sup erscalar pro cessor must detect data and control ow dep endencies from a

linear stream of instructions dataow computing provides the execution machine with the

applications dataow graph where no des represent single instructions or basic instruction

blo cks so that the underlying machine can schedule and dispatch instructions as so on as all

necessary op erands are available thus enabling b etter exploitation of parallelism Dataow

computing is also used in hardwaresoftware codesign where computationally intensive

parts of an application are mapp ed onto recongurable custom hardware while other parts

are executed in software on a standard micropro cessor The mapping is done such that the

workloads of both parts are ab out equal and that the complexity of the communication

overhead between the two parts is minimized

Task graphs also occur in where each no de may already represent

an executable with a runtime of hours or days The execution units are geographically dis

tributed collections of computers In order to run a grid application on the grid resources

the task graph is scheduled onto the execution units This may o ccur prior to or during ex

ecution static vs dynamic scheduling Because of the wide distances between no des with

corresp onding restricted communication bandwidths scheduling typically involves cluster

ing ie mapping tasks to no des such that the communication bandwidth between no des

is minimized As a grid no de more and more often is a parallel machine itself also tasks

can carry parallelism socalled mal leable tasks Scheduling a malleable task graph involves

the additional diculty of determining the amount of execution units allo cated to parallel tasks

Practical Relevance While dataow computing in itself has not b ecome a mainstream in

programming it has seriously inuenced parallel computing and its techniques have found

their way into many pro ducts Hardware software codesign has gained some interest by

the integration of recongurable hardware with micropro cessors on single chips Grid com

puting has gained considerable attraction in the last years mainly driven by the enormous

computing p ower needed to solve grand challenge problems in natural and life sciences

Implementations A prominent example for parallel dataow computation was the MIT

Alewife machine with the ID functional programming language

Hardwaresoftware co design is frequently applied in digital signal pro cessing where there

exist a number of multipro cessor systemsonchip DSPMPSoC see eg The Mitrion

C programming language from Mitrionics serves to program the SGI RASC appliance that

contains FPGAs and is mostly used in the eld of bio informatics

There are several grid middlewares most prominently Globus and Unicore for

which amultitude of schedulers exists

General Parallel Programming Metho dologies

In this section we briey review the features advantages and drawbacks of several widely

used approaches to the design of parallel software

Many of these actually start from an existing sequential program for the same problem

which is more restricted but of very high signicance for software industry that has to p ort

a host of legacy co de to parallel platforms in these days while other approaches encourage

a radically dierent parallel program design from scratch

Fosters PCAM Metho d

Foster suggests that the design of a parallel program should start from an existing p os

sibly sequential algorithmic solution to a computational problem by partitioning it into

many small tasks and identifying dep endences between these that may result in communi

cation and synchronization for which suitable structures should b e selected These rst two

design phases partitioning and communication are for a mo del that puts no restriction on

the numb er of pro cessors The task granularity should b e as ne as p ossible in order to not

constrain the later design phases articially The result is a more or less scalable parallel

algorithm in an abstract programming mo del that is largely indep endent from a particular

parallel target computer Next the tasks are agglomerated to macrotasks pro cesses to

reduce internal communication and synchronization relations within a macrotask to lo cal

memory accesses Finally the macrotasks are scheduled to physical pro cessors to balance

load and further reduce communication These latter two steps agglomeration and mapping

are more machine dep endent b ecause they use information ab out the numb er of pro cessors

available the network top ologythe cost of communication etc to optimize p erformance

Incremental Parallelization

For manyscientic programs almost all of their execution time is sp ent in a fairly small part

of the co de Directivebased parallel programming languages such as HPF and Op enMP

which are designed as a semantically consistent extension to a sequential base language

such as Fortran and C allow to start from sequential source co de that can be parallelized

incrementally Usually the most computationally intensive inner lo ops are identied eg

by proling and parallelized rst by inserting some directives eg for lo op parallelization

If p erformance is not yet sucient more directives need to b e inserted and even rewriting

of some of the original co de maybe necessary to achieve reasonable p erformance

Automatic Parallelization

Automatic parallelization of sequential legacy co de is of high imp ortance to industry but

notoriously dicult It o ccurs in two forms static paral lelization by a smart compiler and

runtime paral lelization with supp ort by the languages runtime system or the hardware

Static parallelization of sequential co de is in principle an undecidable problem b ecause

the dynamic b ehavior of the program and thereby the exact data dep endences between

statement executions is usually not known at compile time but solutions for restricted

programs and sp ecic domains exist Parallelizing compilers usually fo cus on lo op paral

lelization b ecause lo ops account for most of the computation time in programs and their

dynamic control structure is relatively easy to analyze Metho ds for data dep endence testing

of sequential lo ops and lo op nests often require index expressions that are linear in the lo op

variables In cases of doubt the compiler will conservatively assume dep endence ie non

parallelizability of all lo op iterations Domainsp ecic metho ds may lo ok for sp ecial co de

patterns that representtypical computation kernels in a certain domain such as reductions

prex computations dataparallel op erations etc and replace these by equivalent parallel

co de see eg These metho ds are in general less limited by the given control structure

of the sequential source program than lo op parallelization but still rely on careful program

analysis

Runtime parallelization techniques defer the analysis of data dep endences and deter

mining the parallel schedule of computation to runtime when more information ab out

the program eg values of inputdep endent variables is known The necessary computa

tions prepared by the compiler will cause runtime overhead which is often prohibitively

high if it cannot b e amortized over several iterations of an outer lo op where the dep endence

structure do es not change between its iterations Metho ds for runtime parallelization of

irregular lo ops include doacross parallelization the insp ectorexecutor metho d for shared

and for distributed memory systems the privatizing DOALL test and the LRPD test

The latter is asoftware implementation of sp eculative lo op parallelization

In speculative paral lelization of loops several iterations are started in parallel and the

memory accesses are monitored such that p otential missp eculation can be discovered If

the sp eculation was wrong the iterations that were erroneously executed in parallel must

be rolled back and reexecuted sequentially Otherwise their results will be committed to

memory

Threadlevel parallel architectures can implement general threadlevel sp eculation eg

as an extension to the proto col Potentially parallel tasks can b e identied

by the compiler or onthey during program execution Promising candidates for sp ecu

lation on data or control indep endence are the parallel execution of lo op iterations or the

parallel execution of a function call together with its continuation co de following the call

Simulations have shown that threadlevel sp eculation works b est for a small number of

pro cessors

SkeletonBased and LibraryBased Parallel Programming

Structured paral lel programming also known as skeleton programming restricts the

manyways of expressing parallelism to comp ositions of only a few predened patterns so

called skeletons Skeletons are generic p ortable and reusable basic program build

ing blo cks for which parallel implementations may be available They are typically derived

from higherorder functions as known from functional programming languages A skeleton

based parallel programming system like PL SCL eSkel MuesLi

or QUAFF usually provides a relatively small xed set of skeletons Each skeleton

represents a unique way of exploiting parallelism in a sp ecically organized typ e of compu

tation such as data parallelism task farming parallel divideandconquer or pip elining By

comp osing these the programmer can build a structured highlevel sp ecication of parallel

programs The system can exploit this knowledge ab out the structure of the parallel com

putation for automatic program transformation resource scheduling and mapping

Performance prediction is also enhanced by comp osing the known p erformance prediction

functions of the skeletons accordingly The appropriate set of skeletons their degree of com

p osability generality and architecture indep endence and the b est ways to supp ort them in

programming languages havebeenintensively researched in the s and are still issues of

current research

Comp osition of skeletons may b e either nonhierarchical by sequencing using temp orary

variables to transp ort intermediate results or hierarchical by conceptually nesting skeleton

functions that is by building a new hierarchically comp osed function by virtually insert

ing the co de of one skeleton as a parameter into that of another one This enables the elegant

comp ositional sp ecication of multiple levels of parallelism In a declarative programming

environment such as in functional languages or separate skeleton languages hierarchical

comp osition gives the co de generator more freedom of choice for automatic transformations

and for ecient resource utilization such as the decision of how many parallel pro cessors

to sp end at which level of the comp ositional hierarchy Ideally the cost estimations of the

comp osed function could be comp osed corresp ondingly from the cost estimation functions

of the basic skeletons While nonnestable skeletons can be implemented by generic library

routines nestable skeletons require in principle a static prepro cessing that unfolds the

skeleton hierarchy eg by using C templates or C prepro cessor macros

The exploitation of nested paral lelism sp ecied by such a hierarchical comp osition is

quite straightforward if a forkjoin mechanism for recursive spawning of parallel activities

is applicable In that case each thread executing the outer skeleton spawns a set of new

threads that execute also the inner skeleton in parallel This may result in very negrained

parallel execution and shifts the burden of load balancing and scheduling to the runtime

system which may incur tremendous space and time overhead In a SPMD environment

like MPI UPC or Fork nested parallelism can b e exploited by suitable group splitting

Parallel programming with skeletons may b e seen in contrast to parallel programming us

ing parallel library routines Domainsp ecic parallel subroutine libraries eg for numerical

computations on large vectors and matrices are available for almost any parallel computer

platform Both for skeletons and for library routines reuse is an imp ortant purp ose Nev

ertheless the usage of library routines is more restrictive b ecause they exploit parallelism

only at the b ottom level of the programs hierarchical structure that is they are not com

p ositional and their computational structure is not transparent for the programmer

Conclusion

At the end of this review of parallel programming mo dels we may observe some current

trends and sp eculate a bit ab out the future of parallel programming mo dels

As far as we can foresee to day the future of computing is parallel computing dictated

by physical and technical necessity Parallel computer architectures will b e more and more

hybrid combining hardware multithreading many cores SIMD units accelerators and on

chip communication systems which requires the programmer and the compiler to solicit

parallelism orchestrate computations and manage data lo cality at several levels in order

to achieve reasonable p erformance A p erhaps extreme example for this is the Cell BE pro cessor

Because of their relative simplicity purely sequential languages will remain for certain

applications that are not p erformance critical such as word pro cessors Some will have

standardized parallel extensions and be slightly revised to provide a welldened memory

mo del for use in parallel systems or disapp ear in favor of new true parallel languages As

parallel programming leaves the HPC market niche and go es mainstream simplicity will

be pivotal esp ecially for novice programmers We foresee a multilayer mo del with a simple

deterministic highlevel mo del that fo cuses on parallelism and p ortability while it includes

transparent access to an underlying lowerlevel mo del layer with more p erformance tuning

p ossibilities for exp erts New software engineering techniques such as asp ectoriented and

viewbased programming and mo deldriven developmentmay help in managing complexity

Given that programmers were mostly trained in a sequential way of algorithmic thinking

for the last years migration paths from sequential programming to parallel program

ming need to be op ened To prepare coming generations of students b etter undergraduate

teaching should encourage a massively parallel access to computing eg by taking up par

allel time work and cost metrics in the design and analysis of algorithms early in the

curriculum

From an industry perspective to ols that allow to more or less automatically p ort

sequential legacy software are of very high signicance Deterministic and timepredictable

parallel mo dels are useful eg in the realtime domain Compilers and to ols technology must

keep pace with the intro duction of new parallel language features Even the most advanced

parallel programming language is do omed to failure if its compilers are premature at its

market intro duction and pro duce poor co de as we could observe in the s for HPF in

the highp erformance computing domain where HPC programmers instead switched to

the lowerlevel MPI as their main programming mo del

Acknowledgments Christoph Kessler acknowledges funding by Ceniit at Linkpings

universitet by Vetenskapsrdet VR by SSF RISE by Vinnova SafeMo dSim and by the

CUGS graduate scho ol

References

Ferri Ab olhassan Reinhard Drefenstedt Jrg Keller Wolfgang J Paul and Dieter Scheerer On the physical

design of PRAMs Computer J Decemb er

AliReza AdlTabatabai Christos Kozyrakis and Bratin Saha Unlo cking concurrency multicore programming

with transactional memory ACM Queue Dec Jan

Sarita V Adve and Kourosh Gharachorlo o Shared Memory Consistency Mo dels a Tutorial IEEE Comput

Anant Agarwal Ricardo Bianchini David Chaiken Kirk L Johnson David Kranz John Kubiatowicz Beng

Hong Lim David Mackenzie and Donald Yeung The MIT Alewife machine Architecture and p erformance In

Proc nd Int Symp Computer Architecture pages

A Aggarwal AK Chandra and M Snir Communication complexity of PRAMs Theoretical Computer Science

Alb ert Alexandrov Mihai F Ionescu Klaus E Schauser and Chris Scheiman LogGP Incorp orating long

messages into the LogP mo del for parallel computation Journal of Paral lel and

Eric Allen David Chase Jo e Hallett Victor Luchangco JanWillem Maessen Sukyoung Ryu Guy L

Steele Jr and Sam TobinHo chstadt The fortress language sp ecication version March

httpresearchsuncompro jectsplrgPublicationsfortressb etap df

Bruno Bacci Marco Danelutto Salvatore Orlando Susanna Pelagatti and Marco Vanneschi PL A structured

high level programming language and its structured supp ort Concurrency Pract Exp

Bruno Bacci Marco Danelutto and Susanna Pelagatti Resource Optimisation via Structured Parallel Program

ming In pages April

Henri E Bal Jennifer G Steiner and Andrew S Tanenbaum Programming Languages for Distributed Com

puting Systems ACM Computing Surveys Septemb er

R Bisseling Paral lel Scientic Computation A Structured Approach using BSP and MPI Oxford University Press

Guy Blello ch Programming Parallel Algorithms Comm ACM March

Rob ert D Blumofe Christopher F Jo erg Bradley C Kuszmaul Charles E Leiserson Keith H Randall and

Yuli Zhou Cilk an ecientmultithreaded runtime system In Proc th ACM SIGPLAN Symp Principles

and Practice of Paral lel Programming pages

HansJ Bo ehm Threads cannot b e implemented as a library In Proc ACM SIGPLAN Conf Programming

Language Design and Implementation pages

Olaf Bonorden Ben Juurlink Ingo von Otte and Ingo Rieping The Paderb orn University BSP PUB Library

Paral lel Computing

Ian Buck Tim Foley Daniel Horn Jeremy Sugerman Kayvon Fatahalian Mike Houston and Pat Hanrahan

Bro ok for GPUs stream computing on graphics hardware In SIGGRAPH ACM SIGGRAPH Papers

pages New York NY USA ACM Press

William W Carlson Jesse M Drap er David E Culler Kathy Yelick Eugene Bro oks and Karen Warren

Intro duction to UPC and language sp ecication Technical Rep ort CCSTR second printing IDA Center

for Computing Sciences May

Brian D Carlstrom Austen McDonald Hassan Cha JaeWoong Chung Chi Cao Minh Christoforos E

Kozyrakis and Kunle Olukotun The atomos transactional programming language In Proc Conf Prog Lang

Design and Impl PLDI pages ACM June

Nicholas Carriero and David Gelernter Linda in context Commun ACM

Bradford L Chamb erlain David Callahan and Hans P Zima Parallel programmability and the chap el language

submitted

Murray Cole Bringing skeletons out of the closet A pragmatic manifesto for skeletal parallel programming

Paral lel Computing

Murray I Cole Algorithmic Skeletons Structured Management of Paral lel Computation Pitman and MIT

Press

Richard Cole and Ofer Za jicek The APRAM Incorp orating Asynchronyinto the PRAM mo del In Proc st

Annual ACM Symp Paral lel Algorithms and Architectures pages

David E Culler Richard M Karp David A Patterson Abhijit Sahay Klaus E Schauser Eunice Santos

Ramesh Subramonian and Thorsten von Eicken LogP Towards a realistic mo del of parallel computation In

Principles Practice of Paral lel Programming pages

J Darlington A J Field P G Harrison P H B Kelly D W N Sharp and Q Wu Parallel Programming

Using Skeleton Functions In Proc Conf Paral lel Architectures and Languages Europe pages Springer

LNCS

J Darlington Y Guo H W To and J Yang Parallel skeletons for structured comp osition In Proc th ACM

SIGPLAN Symp Principles and Practice of Paral lel ProgrammingACM Press July SIGPLAN Notices

pp

K M Decker and R M Rehmann editors Programming Environments for Massively Paral lel Distributed

Systems Birkhuser Basel Switzerland Pro c IFIP WGWorking Conf at Monte Verita Ascona

Switzerland April

F Dehne A Fabri and A RauChaplin Scalable parallel geometric algorithms for coarse grained multicom

puters In Proc ACM Symp on Comput Geometry pages

Beniamino di Martino and Christoph W Keler Two program comprehension to ols for automatic parallelization

IEEE Concurr Spring

Dietmar W Erwin and David F Snelling Unicore A grid computing environment In Proc th Intl Conference

on Paral lel Processing EuroPar pages London UK SpringerVerlag

Vija j Saraswat et al Rep ort on the exp erimental language X draft v White pap er

httpwwwibmresearchcom commresearchpro jectsnsfpagesxindexhtml February

Jo el Falcou and Jo celyn Serot Formal semantics applied to the implementation of a skeletonbased parallel

programming libraryInProc ParCo IOS press

Martti Forsell A scalable highp erformance computing solution for networks on chips IEEE Micro pages

Septemb er

S Fortune and J Wyllie Parallelism in random access machines In Proc th Annual ACM Symp Theory of

Computing pages

Ian Foster Designing and Building Paral lel Programs Addison Wesley

Ian Foster Globus to olkit version Software for serviceoriented systems In Proc IFIP Intl Conf Network

and Paral lel Computing LNCS pages Springer

Ian Foster Carl Kesselman and Steven Tuecke The anatomy of the grid Enabling scalable virtual organizations

Intl J Applications

Phillip B Gibb ons A More Practical PRAM Mo del In Proc st Annual ACM Symp Paral lel Algorithms and

Architectures pages

W K Giloi Parallel Programming Mo dels and Their Interdep endence with Parallel Architectures In Proc st

Int Conf Massively Paral lel Programming Models IEEE Computer So ciety Press

Sergei Gorlatch and Susanna Pelagatti A transformational framework for skeletal programs Overview and

case study In Jose Rohlim et al editor IPPSSPDP Workshops Proceedings IEEE Int Paral lel Processing

Symp and Symp Paral lel and DistributedProcessing pages Springer LNCS

Allan Gottlieb An overview of the NYU ultracomputer pro ject In JJ Dongarra editor Experimental Paral lel

Computing Architectures pages Elsevier Science Publishers

Susanne E Hambrusch Mo dels for Parallel Computation In Proc Int Conf Paral lel Processing Workshop

on Chal lenges for Paral lel Processing

Philip J Hatcher and Michael J Quinn DataParal lel Programming on MIMD Computers MIT Press

Maurice Herlihy and J Eliot B Moss Transactional memory Architectural supp ort for lo ckfree data structures

In Proc Int Symp Computer Architecture

Heywood and Leop old Dynamic randomized simulation of hierarchical PRAMs on meshes In AIZU Aizu

International Symposium on Paral lel AlgorithmsArchitecture Synthesis IEEE Computer So ciety Press

Jonathan M D Hill Bill McColl Dan C Stefanescu Mark W Goudreau Kevin Lang Satish B Rao Torsten

Suel Thanasis Tsantilas and Rob Bisseling BSPlib the BSP Programming Library Paral lel Computing

Rolf Homann KlausPeter Vlkmann Stefan Waldschmidt and Wolfgang Heenes GCA Global Cellular

Automata A Flexible Parallel Mo del In PaCT Proceedings of the th International Conference on Paral lel

Computing Technologies pages London UK SpringerVerlag

Kenneth E Iverson AProgramming Language WileyNewYork

Joseph JJ An Introduction to Paral lel Algorithms AddisonWesley

Johannes Jendrsczok Rolf Homann Patrick Ediger and Jrg Keller Implementing APLlike data parallel

functions on a GCA machine In Proc st Workshop Paral lel Algorithms and Computing Systems PARS

Jrg Keller Christoph Kessler and Jesp er Tr Practical PRAM Programming Wiley New York

Ken Kennedy Charles Ko elb el and Hans Zima The rise and fall of High Performance Fortran an historical

ob ject lesson In Proc Int Symposium on the History of Programming Languages HOPL III June

Christoph Kessler Managing distributed shared arrays in a bulksynchronous parallel environment Concurrency

Pract Exp

Christoph Kessler Teaching parallel programming early In Proc Workshop on Developing Computer Science

Education How Can It Be Done Linkpings universitet SwedenMarch

Christoph Kessler and Andrzej Bednarski Optimal integrated co de generation for VLIW architectures Con

currency and Computation Practice and Experience

Christoph W Keler NestStep Nested Parallelism and Virtual Shared Memory for the BSP mo del The J of

Supercomputing

Christoph W Kessler A practical access to the theory of parallel algorithms In Proc ACM SIGCSE

Symposium on Computer Science EducationMarch

Christoph W Keler and Helmut Seidl The Fork Parallel Programming Language Design Implementation

Application Int J Paral lel Programming February

Herb ert Kuchen Askeleton library In Proc EuroPar pages

H T Kung and C E Leiserson Algorithms for VLSI pro cessor arrays In C Mead and L Conway editors

Introduction to VLSI systems pages AddisonWesley

Christian Lengauer A p ersonal historical p ersp ective of parallel programming for high p erformance In Gnter

Hommel editor CommunicationBased Systems CBS pages Kluwer

Claudia Leop old Paral lel and Distributed Computing A survey of models paradigms and approaches Wiley

New York

KC Li and H Schwetman Vector C A Vector Pro cessing Language J Paral lel and Distrib Comput

L Fava Lindon and S G Akl An optimal implementation of broadcasting with selective reduction IEEE

Trans Paral lel Distrib Syst

B M Maggs L R Matheson and R E Tarjan Mo dels of Parallel Computation a Survey and Synthesis In

Proc th Annual Hawaii Int Conf System Sciencesvolume pages January

W F McColl General Purp ose Parallel Computing In A M Gibb ons and P Spirakis editors Lectures on

Paral lel Computation Proc ALCOM Spring School on Paral lel Computation pages Cambridge

University Press

Wolfgang J Paul Peter Bach Michael Bosch Jrg Fischer Cdric Lichtenau and Jo chen Rhrig Real PRAM

programming In Proc Int EuroPar Conf August

Susanna Pelagatti Structured Development of Paral lel Programs TaylorFrancis

Michael Philippsen and Walter F Tichy Mo dula and its Compilation In Proc st Int Conf of the Austrian

Center for Paral lel Computation pages Springer LNCS

Thomas Raub er and Gudula Rnger Tlib a library to supp ort programming with hierarchical multipro cessor

tasks J Paral lel and Distrib Comput March

Lawrence Rauchwerger and David Padua The Privatizing DOALL Test A RunTime Technique for DOALL

Lo op Identication and ArrayPrivatization In Proc th ACM Int Conf Supercomputing pages ACM

Press July

Lawrence Rauchwerger and David Padua The LRPD Test Sp eculative RunTime Parallelization of Lo ops with

Privatization and Reduction Parallelization In Proc ACM SIGPLAN Conf Programming Language Design

and Implementation pages ACM Press June

J Rose and G Steele C an Extended C Language for Data Parallel Programming Technical Rep ort PL

Thinking Machines Inc Cambridge MA

D B Skillicorn Mo dels for Practical Parallel Computation Int J Paral lel Programming

D B Skillicorn miniBSP a BSP Language and Transformation System Technical rep ort

Dept of Computing and Information Sciences Queenss University Kingston Canada Oct

httpwwwqucisqueensucahomeskillminips

David B Skillicorn and Domenico Talia editors Programming Languages for Paral lel Processing IEEE Com

puter So ciety Press

David B Skillicorn and Domenico Talia Mo dels and Languages for Parallel Computation ACM Computing

Surveys June

Lawrence Snyder The design and development of ZPL In Proc ACM SIGPLAN Third symposium on history

of programming languages HOPLIIIACM Press June

Hkan Sundell and Philippas Tsigas NOBLE A nonblo cking interpro cess communication library Technical

Rep ort Dept of Computer Science Chalmers UniversityofTechnology and Gteb org University SE

Gteb org Sweden

Leslie G Valiant A Bridging Mo del for Parallel Computation Comm ACM August

Fredrik Warg Techniques to reducethreadlevel speculation overhead PhD thesis Dept of Computer Science

and Engineering Chalmers university of technology Gothenburg Sweden

Xingzhi Wen and Uzi Vishkin Pramonchip rst commitmentto silicon In SPAA Proceedings of the

nineteenth annual ACM symposium on Paral lel algorithms and architectures pages New York NY

USA ACM

Wayne Wolf Guest editors intro duction The emb edded systems landscap e Computer