On the Co existence of SharedMemory and

MessagePassing in the Programming of Parallel

Applications

1 2

J Cordsen and W SchroderPreikschat

GMD FIRST Rudower Chaussee D Berlin Germany

University of Potsdam Am Neuen Palais D Potsdam Germany

Abstract Interop erability in nonsequential applications requires com

munication to exchange information using either the sharedmemory or

messagepassing paradigm In the past the communication paradigm in

use was determined through the architecture of the underlying comput

ing platform Sharedmemory computing systems were programmed to

use sharedmemory communication whereas distributedmemory archi

tectures were running applications communicating via messagepassing

Current trends in the architecture of parallel machines are based on

sharedmemory and distributedmemory For scalable parallel applica

tions in order to maintain transparency and eciency b oth communi

cation paradigms have to co exist Users should not b e obliged to know

when to use which of the two paradigms On the other hand the user

should b e able to exploit either of the paradigms directly in order to

achieve the b est p ossible solution

The pap er presents the VOTE communication supp ort system VOTE

provides co existent implementations of sharedmemory and message

passing communication Applications can change the communication

paradigm dynamically at runtime thus are able to employ the under

lying computing system in the most convenient and applicationoriented

way The presented case study and detailed p erformance analysis un

derpins the applicability of the VOTE concepts for highp erformance

Intro duction

Future parallel machines will exhibit a hybrid architecture based on shared

memory and distributedmemory This line is already followed with all stateof

theart sp ecial purp ose parallel computers Similar holds for parallel machines

based on workstation clusters actual highp erformance workstations are shared

memory multipro cessor systems

Beside the diculty developing scalable and ecient parallel programs at

all the problem of an application programmer is exploiting the various com

puter architectures in the most ecient way Due to false sharing relying on

the sharedmemory paradigm as intro duced by a virtual sharedmemory VSM

system results in a decrease of end user p erformance while running in a dis

tributed environment Vice versa relying on a distributedmemory paradigm

ie messagepassing results in a decrease of end user p erformance while running

in a sharedmemory environment Thus in the attempt of developing scalable

and p ortable parallel application software programmers are concerned with a

tradeo

The purp ose of the VOTE system is to supp ort application programmers

confronted with the tradeo b etween sharedmemory and messagepassing com

munication Sharedmemory communication is physically limited to op erate

lo cally only Therefore to implement a global sharedmemory communication fa

cility VOTE supp orts a highly ecient core system The core system implements

a global address space by means of VSM At default a multiple readersingle

writer invalidationbased sequential consistency maintenance is provided A ex

ible set of functions is available allowing many p owerful optimizations to pro

grams running under control of sequential consistency As a step b eyond

VOTE supp orts messagepassing communication MPI standard through a

functional enrichment of its VSM system Thus VOTE provides to the applica

tion programmer architectural transparency by means of sequential consistency

and a messagepassing communication system

The rest of this pap er is organized as follows Section deals with the prin

ciple problem of supp orting b oth communication paradigms Afterwards Sec

tion covers an application case study by demonstrating the implementation of

successive overrelaxation dynamically changing b etween the two communication

paradigms Section discusses the achieved runtime results Related works are

presented in Section and Section concludes the pap er

Communication Paradigms

To days trend in the design of scalable highp erformance computing systems is

based on a co op eration b etween software and hardware to supp ort b oth an

ecient sharedmemory and messagepassing abstraction The main problem

of this co op eration is due to the incompatibility of the two views of consistency

semantics as given by the two communication paradigms

Messagepassing is an explicit communication style lacking any measures

of implicit consistency maintenance Global eects in the sense of making

changes to some logically shared but physically distributed state are achieved

only when the resp ective parallel application program explicitly calls commu

nication functions In contrast changing p ortions of the logically shared but

physically lo cal state works implicit by simply readingwriting the memory cells

ie program variables

The sharedmemory communication paradigm is a single readersingle writer

SRSW scheme o ccasionally extended by cachebased sharedmemory machines

to a multiple readersingle writer scheme Every mo difying memory access works

implicit However since for p erformance reasons data replication usually takes

place this paradigm requires either invalidating or up dating the data copies held

in the caches or the memories These measures of consistency maintenance im

plicity inuence the op eration of all pro cesses and generally impact the runtime

b ehavior of a sharedmemory parallel program

A paradigm shift from sharedmemory to messagepassing communication is

almost straightforward It makes public ie global shared data private ie lo

cal and thus nonshared The pro cess shifting to the messagepassing paradigm

will b e deleted from the directory information maintained by the VSM sys

tem to keep track of p otential data copy holders Due to the paradigm shift

programming complexity increases signicantly Running under control of the

sharedmemory paradigm consistency was ensured by the VSM system In the

messagepassing case the programmer b ecomes resp onsible maintaining a glob

ally consistent view of the common data sets

A paradigm shift from messagepassing to sharedmemory is not that

straightforward The main problem is the unication of the lo cal data copies

into a single data image In other words private data copies are made public or

invalidated and the directory information is up dated In most cases this task

cannot b e p erformed automatically b ecause of the lack of information ab out the

order of mo dications done by the individual pro cesses In the current imple

mentation of VOTE it is therefore up to the application program to announce

the lo cality of consistent data

Case Study Successive Overrelaxation

This section presents two implementations of a successive overrelaxation algo

rithm based on the red black technique This kind of application is a typical

representative for parallel numerical algorithms op erating on dense data struc

tures The amount of data b eing shared is little and increases only with the num

b er of computing no des Data sharing is p erformed only by a pair of pro cesses

that is each pro cess communicates with two other pro cesses the predecessor

and successor Merely the rst and last pro cess communicate only with a single

pro cess

The following subsections present two implementations of a SPMDbased

single programmultiple data successive overrelaxation program The rst co de

communicates via sharedmemory and uses a global barrier synchronization The

second co de takes full advantage of the communication supp ort system VOTE

that is it changes the communication paradigm from sharedmemory to message

passing and vice versa

SharedMemory Communication

The co de example in Fig uses sharedmemory communication and is a very

compact implementation of a successive overrelaxation The red black tech

nique is used to avoid overwriting the old value of a matrix element b efore it is

used for the computation of the new values of the resp ective neighb or Therefore

computation is done only on every other row p er iteration line This

approach requires two instead of one global barrier synchronization lines

void sor float rMN float bMN int f int t

int i j k

while converged rbft

for j f j t j

for k k N k

bjk rjkrjkrjkrjk

if j t break

for k k N k

bjk rjkrjkrjkrjk

votebarrier

for j f j t j

for k k N k

rjk bjkbjkbjkbjk

if j t break

for k k N k

rjk bjkbjkbjkbjk

votebarrier

Fig Sharedmemory version

and The algorithm continues executing the outer lo op line until the

check of convergence evaluates to true line

Variables f and t line implement a blo ck distribution of matrix r and b

These variables are calculated on a p erpro cess basis and are a function of the

size of a matrix the total numb er of pro cesses and the logical identication of a

pro cess The computation steps in lines and lines pro duce data used

as input for the pro cesses with adjacent values of the variables f and t When

the barrier synchronization falls that data is consumed by the communication

partner

MessagePassing Communication

The second implementation in Fig comprises three parts The rst part

lines implements a switch from sharedmemory to messagepassing com

munication This requires a barrier synchronization making sure that all pro

cesses are ready for leaving consistency maintenance of VOTE Line compute

Comm rank the working set of a pro cess using the logical identication MPI

and the total numb er of computing pro cesses MPI Comm size Calling

access is to ensure that the pro cesses get access to memory regions vote

with readwrite p ermission Note that overlapping memory regions are dupli

cated into dierent address spaces with each of the duplicated regions b eing

given a readwrite p ermission As a side eect VOTE b ecomes unable to guar

antee VSM consistency maintenance Finally a barrier synchronization guaran

tees that all pro cesses will have received private copies of the former VSM data

b efore the computation ie second phase is entered

The second phase is a computation lo op p erforming according to the suc

cessive overrelaxation scheme lines Computation is identical to the

sharedmemory version excepted the global barrier synchronizations are re

Isend placed by sequences of MPI calls for communication In particular MPI

void sor float rMN float bMN int f int t

int i j k Id Procs MPIRequest r MPIStatus s

votebarrier

MPICommRank MPICOMMWORLD Id MPICommSize MPICOMMWORLD Procs

if Id i f else i f

if Id Procs j t else j t

voteaccess bi bjN ReadWrite

voteaccess ri rjN ReadWrite

votebarrier

while converged rbft

for j f j t j

for k k N k

bjk rjkrjkrjkrjk

if j t break

for k k N k

bjk rjkrjkrjkrjk

if Id MPIIsend bfNMPIFLOATIdIdMPICOMMWORLDr

if Id Procs MPIRecv btNMPIFLOATIdIdMPICOMMWORLDs

if Id Procs MPIIsend btNMPIFLOATIdIdMPICOMMWORLDr

if Id MPIRecv bfNMPIFLOATIdIdMPICOMMWORLDs

for j f j t j

for k k N k

rjk bjkbjkbjkbjk

if j t break

for k k N k

rjk bjkbjkbjkbjk

if Id MPIIsend rfNMPIFLOATIdIdMPICOMMWORLDr

if Id Procs MPIRecv rtNMPIFLOATIdIdMPICOMMWORLDs

if Id Procs MPIIsend rtNMPIFLOATIdIdMPICOMMWORLDr

if Id MPIRecv rfNMPIFLOATIdIdMPICOMMWORLDs

if Id voteinform bf bfN NoAccess

if Id voteinform rf rfN NoAccess

if Id Procs voteinform bt btN NoAccess

if Id Procs voteinform rt rtN NoAccess

votebarrier

Fig Messagepassing version including switches

Recv are used to exchange the relevant data b etween the commu and MPI

nicating pro cesses Sending of data using MPI Isend is nonblo cking while

MPI Recv blo cks the caller until the data has b een received This pro cedure

the pair of a nonblo cking send and blo cking receive guarantees synchroniza

tion prop er for ensuring a pairwise ordering of memory op erations of the pro

cesses pro ducing and consuming the same data regions Because compared to

the sharedmemory solution explicit global barrier synchronization is no longer

required the messagepassing solution shows up with increased concurrency

The nal phase switches back to sequential consistency maintenance provided

by the VSM system The blo ck distribution of the shared data is reestablished

This is done by removing the overlapping memory regions line These

calls instruct VOTE to up date its directory information establishing a consis

tent mapping of the ownerships of memory regions Since sharedmemory pro

cessing is allowed to start only when the directory information has b een up dated

entirely global synchronization b ecomes necessary Therefore the nal barrier

synchronization is placed in line

Performance Results

The discussion of the results obtained is somewhat dicult since the p erfor

mance of the program based on the sharedmemory communication of VOTE

could not b e compared directly with the results of other VSM systems Only very

few VSM systems are available and their p erformance results are not comp etitive

to VOTE b ecause they all run on top of fairly heavyweight microkernelbased

op erating systems In contrast VOTE is implemented as part of the parallel

Peace op erating system family and runs on the parallel MANNA comput

ing system

System Platform CPU Network Time ms

Mether SunOS MHz MC MBs

Munin V MHz MC MBs

Myoan OSF MHz iXP MBs

VOTE Peace MHz iXP MBs

Table Comparison of access fault handling times

Table summarizes the p erformance of handling a read access fault in vari

ous systems Mether and Munin run on rather old hardware but a comparison

of Myoan and VOTE is fair b ecause of the same typ e of CPU used in b oth

systems Myoan runs on the Intel Paragon machine with a network throughput

which is more than four times b etter than the throughput of the Manna com

munication network Yet the p erformance of VOTE is more than six times b etter

than handling a read access fault in Myoan although communication and data

transfer are resp onsible for ab out of the total costs

In the following VOTE is compared with the p erformance of typical message

passing systems that run on the MANNA computing system The overheads b e

ing present in the successive overrelaxation algorithm are presented Afterwards

the results achieved with VOTE are compared to b oth a synchronous and an

asynchronous implementation on top of the Parix messagepassing system

VOTE PVM and MPI

On the MANNA computing system the quality in p erformance of the PVM and

MPI programming systems dier quite a lot The PVM system is designed

and implemented explicitly to run on top of the Peace op erating system In

contrast the MPI package is a p ort of the abstract device implementation of

MPI This package was easily p orted but it shows up with pure runtime

results

Fig shows the communication overheads when the successive overrelax

ation runs one hundred cycles of the outer lo op The overheads are due to the 800

700

MPI 600 PVM VOTE

500 Milliseconds 400

300

200 2 4 6 8 10 12 14 16

Processes

Fig Overheads in VOTE PVM and MPI

necessary communication with all three systems using an asynchronous function

to send data and a blo cking function to receive data Each lo op iteration requires

to b oth send and receive data four times Only the two pro cesses with the lowest

and highest logical identication do half the communication

At a rst glance it gets clear that the p erformance achieved through MPI is

3

out of discussion The curves of PVM and VOTE have the same shap e with

VOTE p erforming ab out one hundred milliseconds b etter ab out p ercent

less overhead The explanation is quite simple VOTE avoids the creation of

copies of the data b eing transferred to a remote pro cess

Avoiding data copying b ecomes p ossible b ecause VOTE controls b oth the

messagepassing functions and the management of memory resources When data

arrives the status of the addressed pro cess is checked If the pro cess is ready

to receive data is written directly into the provided address space Otherwise

a copy is made and the data is intermediately buered In contrast the PVM

system always copies the data on b oth the sending and receiving site

VOTE vs Parix

A comparison with a PowerXplorer running the Parix op erating system has

b een chosen b ecause Parix is a communication and library like Peace

This avoids comparing apples with oranges in the case of a heavyweight micro

kernel approach on one hand and a highsp eed and lightweight runtime executive

on the other hand

The computing systems that were used to run VOTE and Parix have dier

ent hardware and thus dierent p erformance capabilities VOTE runs on the

Mainly this is b ecause the transmission of data always involves two communications

across the network In a rst message a header is send The second message transfers

the data 120 Vote, shared-memory communication Vote, nessage-passing communication Parix, synchronous communication 100 Parix, asynchronous communication

80

60 Seconds

40

20

0 2 4 6 8 10 12 14 16

Number of processes

Fig Comparison of VOTE and Parix results

iXPbased parallel MANNA machine and Parix uses the PowerXplorer based

on a PowerPC The p erformance capabilities of the computing systems show

some signicant dierences In the context of Parix endtoend communication

b etween partners will b e at a much smaller rate than in the VOTE environment

On the other hand computation is faster compared to the results achievable on

4

top of VOTE Having b oth a faster pro cessor and a slower communication net

5

work results in bad sp eedup results Therefore Fig shows runtime results

and not sp eedup result

Fig presents four curves which may b e group ed into two pairs The rst

pair is the curve of a sequential consistent execution on top of VOTE and the

runtime of the Parix implementation using a synchronous implementation of the

send function Observing the course of these two curves the lack of

is already visible for very small numb ers of pro cesses The second pair is given

through the implementations that use asynchronous send functions These curves

show up with a b etter scalability and promise further runtime improvements for

higher numb er of computing pro cesses

Related Work

In the sector of hardware supp orted distributed sharedmemory DSM systems

Alewife and Flash multipro cessors are designed to supp ort in hardware

b oth a sharedaddress space and a general messagepassing interface The two

communication paradigms are implemented through a controller using the same

Computation with a PowerPC is nearly twice as fast as with a iXP pro cessor

This is due to a higher clo ck sp eed and a b etter compiler

Communication with a message size of KB data is ab out ms on top of VOTE

and ms on top of Parix

FIFO pip elines to pass messages as well as memory op erations This imple

ments a global ordering of events and guarantees the same memory consistency

semantic for b oth communication paradigms The communication interface of

Alewife provides sequential consistency and that of Flash provides release con

sistency

The CarlOS system implements in software a messagedriven release con

sistent global address space As in Alewife and Flash the two communication

paradigms are op erated through a single integrated solution Like Flash CarlOS

implements the release consistency mo del Therefore the user interface requires

that sharedmemory accesses as well as calls to messagepassing functions are at

tributed with the information ab out the typ e and scop e of consistency required

VOTE takes a dierent approach Co existence of communication paradigms

means not to integrate b oth approaches into a single one Rather VOTE sup

p orts interfaces to the two approaches and allow to use whichever is likely

to b e most ecient for the communication in question However the p oint is

that VOTE provides indep endent implementations of the two communication

paradigms and thus is able to supp ort their natural op eration mo dels with

varying memory consistency mo dels

Conclusion

The future parallel computer architecture interconnects sharedmemory multi

pro cessor systems on a networking ie messagepassing basis This architecture

calls for two programming paradigms sharedmemory and messagepassing For

scalable parallel applications in order to maintain transparency and eciency

b oth paradigms have to co exist Users should not b e obliged to know when to

use which of the two paradigms On the other hand the user should not b e

prevented from exploiting either of the paradigms directly in order to achieve

the b est p ossible solution to his problems

The VOTE system allows employment of b oth a shared and distributed mem

ory environment in a convenient way VOTE oers a sequential consistent VSM

system with a highly ecient access fault handling scheme A set of additional

p erformance enhancement techniques implements many p owerful optimizations

to parallel applications b eing based on a sequential consistent sharedmemory

communication Beside of the global address space and its implicit consistency

maintenance scheme messagepassing communication functions are supp orted as

well A programmer is able to cho ose the communication paradigm b est suited

to match the sp ecic application demands and to use the underlying computing

system in the most convenient way

The p erformance numb ers presented in this pap er prove a high quality stan

dard of b oth the sharedmemory and messagepassing communication p erfor

mance of VOTE VSM systems running on comparable hardware platforms are

clearly outp erformed by a factor of six and the eciency of messagepassing

packages running on the same platform run at least slower These results

indicate that the VOTE symbiosis of a transparent and ecient communication

supp ort system is a viable way to meet the programming challenges of the hy

brid memory architecture of future parallel hardware platforms ie networks

of sharedmemory multipro cessor systems

References

LM Adams HF Jordan Is SOR Color Blind In SIAM Sci Stat Computation

Vol No pp

U Bruning WK Giloi W SchroderPreikschat Latency Hiding in Message

Passing Architectures In Proceedings of the International Paral lel Processing

Symposium IPPS pp Cancun Mexico Apr

G Cabillic T Priol I Puaut Myoan An Implementation of the Koan Shared

Virtual Memory on the Intel Paragon Technical Report Irisa Rennes

France Apr

J Cordsen ed The SODA Pro ject Studien der GMD Nr ISBN

Octob er

K Gharachorlo o D Lenoski J Laudon P Gibb ons A Gupta J Hennessy

Memory Consistency and Event Ordering in Scalable SharedMemory Multipro

cessors In Proceedings of the th International Symposium on Computer Archi

tecture pp th May

W Gropp E Lusk An Abstract Device Denition to Supp ort the Implementa

tion of a HighLevel PointtoPoint MessagePassing Interface Technical Report

MCSP Argonne National Lab oratory Argonne IL

W Gropp E Lusk A Skjellum Using Mpi Portable Parallel Programming with

the MessagePassing Interface MIT Press ISBN Oct

PT Ko ch RJ Fowler E Jul MessageDriven Relaxed Consistency in a Software

Distributed In Proceedings of the First USENIX Symposium on

Operating Systems Designs and Implementation pp Nov

D Kranz K Johnson A Agarwal J Kubiatowicz BH Lim Integrating

MessagePassing and SharedMemory Early Exp erience In Pro ceedings of the

th Symp osium on Principles and Practice of Parallel Programming pp

May

J Kuskin D Ofelt M Heinrich J Heinlein R Simoni K Gharachorlo o J

Chapin D Nakahira J Baxter M Horowitz A Gupta M Rosenblum J Hen

nessy The Stanford FLASH Multipro cessor In Proceedings of the st Annual

International Symposium on Computer Architecture pp Apr

L Lamp ort How to make a Multipro cessor Computer that Correctly Executes

Multipro cessor Programs In IEEE Transactions on Computers Vol C No

pp Sep

Parsytec Parix Programmers Manual Parsytec Computer GmbH Aachen Ger

many

A Geist A Beguelin JJ Dongorra W Jiang R Manchek VS Sunderam Pvm

Users Guide and Reference Manual Technical Report OrnlTm Oak

Ridge National Lab oratory Oak Ridge USA May

W SchroderPreikschat The Logical Design of Parallel Op erating Systems

PrenticeHall ISBN

A

This article was pro cessed using the L T X macro package with LLNCS style E