Design of the Munin Distributed System

John B Carter

Department of

University of Utah

Salt LakeCity UT

This researchwas supp orted in part by the National Science Foundation under Grants CDA

CCR CCRby the IBM Corp oration under Research Agreement No by the

Texas Advanced Technology Program under Grants and andby a NASA Graduate

Fellowship

Design of the Munin Distributed Shared Memory System

Prop osed running head Design of the Munin Distributed Shared Memory System

Contact author John B Carter

Department of Computer Science

University of Utah

Salt LakeCityUT

Abstract

Software distributed shared memory DSM is a software abstraction of shared

memory on a distributed memory machine The key problem in building an

ecient DSM system is to reduce the amount of communication needed to keep

the distributed memories consistent The Munin DSM system incorp orates a

number of novel techniques for doing so including the use of multiple consistency

proto cols and supp ort for multiple concurrent writer proto cols Due to these

and other features Munin is able to achieve high p erformance on a varietyof

numerical applications This pap er contains a detailed description of the design

and implementation of the Munin prototyp e with sp ecial emphasis given to its

novel write shared proto col Furthermore it describ es a numb er of lessons that we

learned from our exp erience with the prototyp e implementation that are relevant

to the implementation of future DSMs

List of Symbols

Boldface italics and typewriterfontare used often

Sp ecial symb ols used A numb er of mathematical symb ols are used including slash

leftarrow left brace f right brace g and ellipsis

Intro duction

A software distributed shared memory DSM system provides the abstraction of a shared

address space spanning the pro cessors of a distributed memory multipro cessor This abstrac

tion simplies the programming of distributed memory multipro cessors and allows parallel

programs written for shared memory machines to b e p orted easily The basic idea b ehind

software DSM whichwe will refer to as simply DSM throughout the rest of this pap er is

to treat the lo cal memory of a pro cessor as if it were a coherentcache in a shared memory

multipro cessor DSM systems combine the b est features of shared memory and distributed

memory multipro cessors They supp ort the relatively simple and p ortable programming

mo del of shared memory on physically distributed memory hardware which is more scalable

and less exp ensive to build than shared memory hardware DSM would thus seem to b e

an ideal vehicle for making parallel pro cessing widely available However although many

DSM systems have b een prop osed and implemented

DSM is not widely used The reason for this is that it has proven dicult to achieve accept

able p erformance using DSM without requiring programmers to carefully restructure their

sharedmemory parallel programs to reect the way that the DSM op erates For example

it is often necessary to decomp ose the shared data into small pagealigned pieces or to in

tro duce new variables to reduce the amount of sharing This restructuring can b e as tedious

and dicult as using messagepassing directly or more so

The challenge in building a DSM system is to achieve go o d p erformance over a wide

range of programs without requiring programmers to restructure their shared memory par

allel programs The overhead of maintaining consistency in software and the high latency

of sending messages make this dicult The primary source of DSM overhead is the large

amountofcommunication that is required to maintain consistency Since DSMs use general

purp ose networks eg Ethernet and op erating systems to communicate the latency of

eac h message b etween no des is high Given this high cost of interpro cessor communication

the p erformance challenge for DSM is to reduce the amount of communication p erformed

during the execution of a DSM program ideally to the same level as the amount of commu

nication p erformed during the execution of an equivalent message passing program DSM

has not gained wide acceptance b ecause previous DSM systems did not achieve this level of

communication

The Munin DSM system incorp orates several techniques make DSM a more viable so

lution for distributed pro cessing by substantially reducing the amount of communication

required to maintain consistency compared to previous DSM systems The innovative

communicationreducing features supp orted by Munin include the use of multiple consis

tency protocolswhichkeep each shared variable consistent with a proto col well suited to its

exp ected or observed access pattern and supp ort for multiple concurrent writers which ad

dresses the problem of false sharing by reducing the amount of unnecessary communication

p erformed keeping falsely shared data consistent

To determine the value of Munins novel features weevaluated the p erformance of a

suite of seven scientic applications implemented using message passing Munin DSM and

a conventional Ivylike DSM system Munins p erformance is within of message

passing for four out of the seven applications studied and within to for the other

three Furthermore for the ve applications in which there was a mo derate to high degree

of sharing the programs run under Munin achieved from to over higher sp eedups

than their conventional DSM counterparts

Munins p erformance on a variety of application programs is evaluated elsewhere

this pap er concentrates on the design of Munin esp ecially its ma jor data structures and

the proto cols used to maintain memory consistency Also included are a numb er of lessons

that we garnered from our exp erience with the prototyp e implementation that are relev ant

to the implementation of future DSMs

The remainder of this pap er is organized as follows Section presents an overview of

Munins runtime system Section describ es Munins multiple consistency proto cols and the

algorithms used to implement them We summarize Munins p erformance in Section and

draw conclusions in Section

Overview of the Munin Runtime System

The core of the Munin system is the runtime library that contains the fault handling

supp ort synchronization and other runtime mechanisms It consists of approximately

lines of C source co de that create an kilobyte library le that is linked into each Munin

program Each no de of an executing Munin program consists of a collection of Munin run

time threads that handle consistency and synchronization op erations and one or more user

threads p erforming the parallel computation Munin programmers write parallel programs

using threads as they would on a unipro cessor or shared memory multipro cessor

Figure illustrates the organization of a Munin program during runtime The Munin

prototyp e was implemented on a collection of sixteen SUN workstations running the

V op erating system On each participating no de the Munin runtime is linked into the

same address space as the user program and thus can access user data directly The two

ma jor data structures used by the Munin runtime are an object directory that maintains

the state of the shared data b eing used by lo cal user threads and the delayedupdate queue

DUQ which manages Munins software implementation of release consistency Munin

installs itself as the default page fault handler for the Munin program so that the underlying

Vkernel will forward all memory exceptions to it for handling

Each Munin no de interacts with the V kernel to communicate with the other Munin no des

over the Ethernet and to manipulate the virtual memory system as part of maintaining the

consistency of shared memory Although the Munin prototyp e was implemented on the V

system it could b e made to run on any op erating system that allowed user programs to

manipulate its own virtual memory mappings handle page faults at user levelcreateand

destroy pro cesses on remote no des exchange messages with remote pro cesses and access

unipro cessor synchronization supp ort P and V All of these requirements are present

in current op erating systems suchasMach Chorus and Unix

Object Directory

User

Code Munin Runtime and SUN 3/60s Data

DUQ

V Kernel ...

Network (10 Mbps Ethernet)

Figure Munin Runtime Organization

Munins ob ject directory is structured as a hash table that maps a virtual address to an

entry that describ es the data lo cated at that address In Munin all variables on the same

page are treated as a single ob ject with the same consistency proto col Programmers

can guide placementofvariables on to pages but by default variables are placed on pages

by themselves The memory overhead of this solution is negligible for the applications

studied as there were few distinct variables When a Munin runtime thread cannot nd an

ob ject directory entry in the lo cal hash table it requests a copy from the no de where the

computation started the socalled the ro ot no de whichcontains all of the shared data at

the start of the program The elds of an entry in the directory include a lock that provides

exclusive access to the entry for a handler thread a start address and size that act as keys

for lo oking up a shared data items directory entry a protocol that sp ecies how the fault

handler should service access faults on the data item various state bits that characterize the

dynamic state of the data such as whether it is present lo cally and whether it is writable a

copyset that represents a b est guess of the set of pro cessors with copies of the data and a

probable owner that represents a b est guess of the owner of the data

The variables copyset is the set of all no des that the lo cal no de b elieves have a copyof

the data No des are added to a variables copyset in several ways If a remote no de has

a copy of the variable when the lo cal no de rst accesses it the reply message containing

the data includes the remote no de in the original copyset Similarly a no de is added to

the copyset when the lo cal no de handles a request from that no de or the lo cal no de has

b een informed that the other no de is caching the data as a side eect of some op eration It

is p ossible for the copyset to include no des no longer caching the data b ecause they hav e

ushed their copy without informing the lo cal no de and it is also p ossible that no des that

are caching copies of the data are not in the copyset b ecause another no de satised their

load request If a copy of the data resides lo cally the transitive closure of the copysets the

lo cal copyset plus the copyset of the no des in the lo cal copyset plus is guaranteed to b e

a sup erset of the set of no des that haveacopy of the data b ecause an entry in a copyset is

only removed when that no de informs the other no des that it is no longer caching the data

The probable owner is used to determine the identity of the Munin no de that currently

owns the data The owner no de is used by the conventional and migratory proto cols

to arbitrate the decision of which no de has write access to the data For the writeshared

proto col the owner no de represents the copy of last resort that cannot b e unilaterally purged

from memory eg as part of the up date timeout mechanism without nding another no de

willing to b ecome the owner of last resort This approach is analogous to the copy of last

resort used in cacheonly multipro cessors

Multiple Consistency Proto cols

Previous DSM systems have employed a single proto col to maintain the consistency of all

shared data The sp ecic proto col varied from system to system eg Ivy supp orted a

pagebased emulation of a conventional hardware proto col while Emerald used ob ject

oriented language supp ort to handle shared ob ject invo cations but each system treated all

shared data identically This led to a situation where some programs could b e handled ef

fectively bya given DSM system while others could not dep ending on the wayinwhich

shared data was accessed by the program To understand how shared memory programs

characteristically access shared data we studied the access b ehavior of a suite of shared

memory parallel programs The results of this study and others sup

p ort the notion that using the exibility of a software implementation to supp ort multiple

consistency protocols can improve the p erformance of DSM They also suggest the typ es of

access patterns that should b e supp orted The results of those studies most relevanttothe

design of ecient DSM are summarized as follows

A single mechanism cannot optimally supp ort all data access patterns and the most

imp ortant distinction b etween the observed data access patterns is b etween those b est

handled via some form of invalidate proto col and those b est handled via some form of

update proto col

The number of characteristic sharing patterns is small and most shared data can b e

characterized as b eing accessed in one of these ways

Synchronization variables are accessed in an inherently dierentw ay than data vari

ables and are more sensitive to increased access latency

The characteristic access pattern of individual variables do es not change frequently

during execution so a static proto col selection p olicy suces in most cases

The rst three results strongly suggest that a DSM system that supp orts a small number

of consistency proto cols will outp erform conventional DSM systems that supp ort a single

static proto col A small numb er of data access patterns characterize most accesses to shared

data Thus it is feasible to supp ort sucient consistency proto cols so that most individual

shared variables can b e kept consistent with a proto col that is well suited to the way that

they are characteristically accessed without the DSM system b ecoming overly complicated

Furthermore the results indicate that at the very least three proto cols should b e supp orted

one invalidationbased one up datebased and one for synchronization

Munin supp orts four consistency proto cols conventional readonly migratory and write

shared plus a suite of synchronization proto cols for lo cks barriers and condition variables

In addition to the consistency proto cols provided weprovide a capability for users to install

their own fault handlers and use Munins facilities to create consistency proto cols of their

own When writing a Munin program the programmer annotates the declaration of shared

variables with a sharing pattern to sp ecify what proto col to use to keep it consistent eg

shared fwritesharedg C type variable name If a variable is not annotated the

conventional proto col is used Incorrect annotations may result in inecient p erformance or

in runtime errors that are detected by the Munin runtime system but not in incorrect b e

havior if there is sucient synchronization to satisfy the requirements of release consistency

The consistency proto cols all have roughly the same structure Consistency op erations

generally are initiated when a user thread attempts to access data that is either not present

or that has b een protected by the runtime system so that accesses to it generates exceptions

The fault handler examines the exception message to determine the lo cation and nature

of the exception which it uses to lo ok up the data item in the ob ject directory It then

p erforms the consistency op erations appropriate to the datas proto col After p erforming

the consistency op erations required to satisfy the fault the fault handler resumes the user

thread and waits to receive its next exception or remote request message The remainder of

this section describ es the implementation of Munins consistency mechanisms More detailed

descriptions including pseudoco de of the algorithms can b e found elsewhere

Conventional

Conventional shared variables are replicated on demand and are kept consistent using an

invalidationbased proto col that requires a writer to b e the sole owner b efore it can mo dify

the data When a thread attempts to write to replicated data a message is transmitted

to invalidate all other copies of the data The thread that generated the miss blo cks until

all invalidation messages are acknowledged We based Munins conventional proto col on

Ivys distributed dynamic manager proto col This proto col is typical of what existing

DSM systems provide and is the conventional DSM proto col evaluated in our

p erformance study

We incorp orated a simplied version of the freezing mechanism from Mirage so that

after a no de acquires ownership of a conventional data item it do es not reply to requests

from other no des for a p erio d of time msecs This mechanism guarantees that the no de

p erforming the write makes progress even in the face of heavy sharing The p erformance of

the conventional DSM was largely unaected by the choice of the freeze time as long as it

is ab ove msecs Without the timeout mechanism we observed several phenomena that

severely hurt the p erformance of conventional data when there was a high degree of sharing

When two or more threads concurrently mo dify a single page a frequent o ccurrence the

data ping p ongs b etween the writers with little progress made b etween faults The freezing

mechanism partially alleviates this problem by ensuring that progress is made no matter how

much sharing is o ccurring There were even problems when there was only a single writer

but multiple readers of a single data item When there were a large numb er of readers it

was often the case that by the time the writer nished invalidating all the replicas one of

the rst no des to b e invalidated would have requested a new copy of the data to satisfy a

read miss Munin runtime threads have a higher scheduling priority than user threads to

ensure that remote data requests do not starv e so the reload request was satised b efore the

writer was resumed This choice of priorities resulted in the data b eing read protected on

the original no de b efore the writer was able to complete the write that caused the original

write fault which caused the writer to fault anew This result convinced us that a freezing

mechanism akin to that found in Mirage should b e incorp orated into future software DSMs

that supp ort ownershipbased invalidate proto cols

Read Only

Readonly data is writable only during initialization whichallows it to b e initialized along

with the rest of the program The consistency proto col simply consists of replication on

demand Aruntime error is generated and the system debugger is invoked if a thread

attempts to write to readonly data As noted earlier if the programmer do es not sp ecify

that a particular variable is shared it is replicated when the worker no des are created

Thus readonly ob jects could simply b e not marked as shared There are two reasons that

programmers might wish to distinguish readonly shared data from nonshared data i for

debugging purp oses to detect unexp ected writes to input data and ii to conserve memory

by loading on demand only the p ortion of the readonly data that a given no de requires

Migratory

For migratory data a single thread p erforms multiple accesses to the data including one or

more writes b efore another thread accesses the data This access pattern is typical

of shared data that is accessed only inside a critical section or via a work queue For this

typ e of data it is generally b est to migrate the data to a pro cessor as so on as it accesses

it the rst time regardless of whether the rst access is a read or a write The consistency

proto col for migratory data propagates the data to the next thread that accesses the data

provides the thread with read and write access even if the rst access is a read and in

validates the original copy This proto col avoids a write miss and a message to invalidate

the old copy when the new thread rst mo dies the data In addition the Munin program

mer can sp ecify the logical connections b etween shared variables and the synchronization

variables that protect them This information is conveyed to the runtime system using the

AssociateDataAndSynch call which suggests that Munin include a copy of the sp ecied

shared variable in the message that passes ownership of the sp ecied synchronization vari

able This pragma is particularly useful for asso ciating migratory data accessed within a

critical section with the lo ckcontrolling the critical section It reduces the numb er of faults

and messages needed to migrate the data as in Clouds

Write Shared

Writeshared variables are frequently written bymultiple threads concurrently without in

tervening synchronization to order the accesses b ecause the programmer knows that each

thread reads from and writes to indep endent p ortions of the data Because of the way that

the data is laid out in memory writeshared data often exhibits a high degree of sharing at

a coarse granularity eg a cache line or page but no sharing at a word granularity a

phenomenon known as false sharing One example of false sharing is when two indep endent

shared variables reside on the same page of memory each b eing mo died by a dierentpro

cessor Another example is when a shared array is laid out contiguously in a single page

of memory and dierent pro cessors are mo difying disjoint parts of the arrayAnintelligent

compiler or careful user can alleviate the false sharing in the rst case by allo cating unre

lated variables on distinct pages at the exp ense of using extra memoryHowever the false

sharing in the second case is unavoidable b ecause the falsely shared data is part of a single

contiguous arrayWehave observed that writeshared data is very common and that its

presence results in very p o or DSM p erformance if it is handled bya conventional consistency

proto col that communicates whenever a shared page is mo died False sharing is a particu

larly serious problem for DSM systems for two reasons i the consistency units are large so

false sharing is very common and ii the latencies asso ciated with detecting mo dications

and communicating are large so unnecessary faults and messages are particularly exp ensive

Any DSM system that exp ects to achieve acceptable p erformance must address the problem

of false sharing

The writeshared proto col is designed sp ecically to mitigate the eect of false sharing It

do es so by supp orting concurrent writers by buering writes until synchronization requires

their propagation by using an up datebased consistency proto col and by timing out and

invalidating data that is not b eing used frequently Unlike existing up date proto cols the

writeshared proto col buers and combines up date messages as shown in Figure The

reason that Munin can buer up dates rather than send them as so on as they are generated

is that it supp orts the release consistency memory mo del A detailed description of the

release consistency mo del is b eyond the scop e of this pap er but roughly a shared memory

implementation based up on release consistency can delay the p oint at which memory must

b e consisten tuntil a pro cessor p erforms a release op eration eg releases a lo ckorarrives

at a barrier In the case of Munin this means that up dates to shared data can b e buered

and combined b etween release p oints A single pro cessor often p erforms a series of writes

to a shared blo ck within a critical section When this o ccurs the writeshared proto col

transmits a single up date message containing all of the changes p erformed within the critical

section to eachnodecaching a copy of the data rather than sending a stream of up dates

as each write o ccurs This can lead to order of magnitude reductions in the number of

messages required to maintain consistency compared to a conventional pip elined invalidate

proto col For DSM where messages are exp ensive but where large messages are not much

more exp ensive than small messages combining up dates is imp ortant release stalled w(x) w(y) w(z) P1 xzy

Single update message ack for (x,y,z)

P2

Figure Buering up dates in Munin

The mechanism used to buer and combine up date messages the delayedupdate queue

DUQ is illustrated in Figure The initial copyset of a writeshared data item is empty

on all no des including on the ro ot no de Before a writeshared variable X is mo died the

page on which it resides is mapp ed so that the rst write will cause a page fault symbolized

via the dashed b ox

Read faults are handled as follows If the data is present but has b een read protected

such as the rst time it is accessed after the no de has received an up date and the timeout

mechanism has read protected it see b elow it simply remaps the page to b e readable and

marks the data as accessed as part of the up date timeout mechanism When a writeshared

variable is rst loaded on a no de it is made readonly so any attempts to mo dify it are

detected The one exception to this is if it is the only copy of the data in the system in

which case it can b e mapp ed readwrite until another no de requests a copy at which time

the original copy is made readonly

Write faults are handled similarly except when the data is b eing actively shared In this

case the writeshared proto col invokes the DUQ mechanism as illustrated in Figure a

After determining that the accessed variable call it X is writeshared the write fault

handler makes a copyof X X and puts the data items directory entry on the DUQ It

tw in

then maps the original copyofX to b e readwrite and resumes the faulted thread Since the

original copyof X is no longer write protected all subsequent writes to X pro ceed with no

consistency overhead The key feature of the write fault handler is that it only communicates

with other no des if the data is not present If the data is already present but mapp ed read

only or sup ervisoronly the handler only p erforms lo cal op erations The handler do es not

need to get exclusive write access nor do es it need to immediately propagate the mo dication

to the other cached copies This feature allows multiple no des to concurrently mo dify a single

data item without communicating for each write

The server routine that handles remote requests for writeshared data is straightforward

It is irrelevant whether or not the requesting no de faulted on a read or a write Anynode

can resp ond to a request and not just the owner If the no de that satises a data request

has a writable copy but not a twin of the data the dirtycopy of the data is senttothe

requester and then remapp ed to b e readonly so subsequentchanges to the data are detected

and eventually purged

The trickiest part of the write shared proto col o ccurs when a lo cal thread p erforms

a release op eration releases a lo ck arrives at a barrier signals a condition variable or

terminates At this time all delayed mo dications to writeshared data must b e propagated

to their remote copies b efore the lo cal thread may pro ceed These changes are propagated

in phases where each phase is resp onsible for propagating changes to a particular set of

no des First the DSM runtime determines if any up dates have b een buered on the DUQ

If there are the server walks down the DUQ and creates a dierential enco ding of every

enqueued data item In the example illustrated in Figure it determines that X has b een

mo died It enco des the mo dications by p erforming a wordbyword comparison of the

data item X and its original contents X As it nds dierences it copies them to tw in

Delayed Update Write(X) Queue "Diff"

Update X X Copy on write twin twin Compare & Encode Replicas

X

Make original writable X X X Write protect (if replicated)

(a) (b)

Figure Delayed Up date Queue Op eration

By detecting changes at the word level and not the byte level we risk the p ossibility that two threads will

mo dify dierentbytes of the same word and the up date mechanism will either signal a data race or overwrite

one of the mo dications The applications that we examined had no bytegrained shared data but if the

a buer prep ending each sequence of up dated words with their starting address and the

run length Because the twin is not needed after the mo dications have b een propagated

Munin uses the buer that contained the twin to contain the enco ding By using eightbytes

of scratch space in the message header and the p ortions of the twin array that have already

b een scanned the enco ding algorithm has the prop erty that the di is never larger than

the twin and it never overwrites parts of the twin that have not yet b een examined as part

of the enco ding pro cess Thus reuse of the twin buer is safe

After enco ding all of the data items enqueued on the DUQ the server transmits a de

scriptor message containing a list of the enco ded data items to the remote no des sharing any

of the enco ded data Each of these recipients determines if it needs to receive up dates to any

of the data describ ed in the descriptors requests the enco ded data that it is still caching

and replies with an indication of i whether it is still caching each data item and ii its

copyset for each data item The up dating no de continues to transmit up date messages to

any no de thoughttobecaching a copyofany of the enco ded data until it has sent up dates

to all such no des This pro cess is guaranteed to terminate whenever either no new no des

are added to the copyset during a phase or all no des have b een up dated When a NACK is

received the no de that sent the NACK is removed from the lo cal copyset for the data items

sp ecied After all the up dates for a data item have b een p erformed X is writeprotected

if it is still replicated to ensure that subsequent writes are detected

To incorp orate up dates the receiving Munin worker thread examines the up date message

and determines if it is still caching any of the data sp ecied It is p ossible that the no de

has invalidated data that it once cached as a result of the up date timeout mechanism see

b elow If it still has a copyofany of the data describ ed in the up date message it requests

the corresp onding enco dings and sends a reply message to the up dating no de The reply

message includes an indication of which of the enco ded data items the no de is still caching

and its copyset for each enco ded item whether it is still b eing cached or not The receiving

no de then deco des the up dates to extract the individual words that the sending no de has

mo died If the no de is not caching any of the data describ ed in the up date message it

sends a reply message with NACKs and copysets for all of the data

Incorp orating an up date normally entails simply traversing the enco ding and copying the

user anticipates a problem a runtime switchcanchange the granularity of comparison to the byte level at

the exp ense of increased enco ding and deco ding time

mo died sequences into the lo cal copy of the data However the user can sp ecify that the

runtime system detect dynamic data races which is useful for debugging complex parallel

programs with synchronization bugs If a no de receives an up date for a variable that it has

mo died a clean copyofthevariable will b e present The system can detect data races by

p erforming a threeway comparison of the received up date the dirtyversion and the clean

version as it deco des the enco ded up dates If while p erforming the threeway comparison

it nds that all three copies of a particular word dier it has detected a data race on that

word and it generates an error message detailing what it has found

This pro cess is complicated somewhat byanupdate timeout mechanism Thegoalof

the timeout mechanism is to invalidate data that has grown stale so as to avoid unnecessary

future up dates The timeout mechanism ensures that up dates to an ob ject are only accepted

for a limited p erio d of time after it was last accessed When an up date is incorp orated the

data is temp orarily mapp ed to b e sup ervisoronly so that any access to it read or write

by a lo cal user thread will b e detected A timestamp is set in the data items directory

entry at this time If the data is accessed b efore another up date arrives the subsequent

fault simply remaps the data and resets a ag However if an up date to the data is received

and the no de has not used the data since the last up date sucient time has elapsed since

the last up date and the data is not dirty the up date server invalidates the lo cal copyA

copy of last resort for each writeshared variable is used to prevent all copies of the variable

from b eing invalidated via the up date timeout mechanism when several no des send up dates

simultaneously This copy cannot b e invalidated unless the no de is rst able to nd another

no de to tak eover this resp onsibilityTheprob owner chain is used to nd this copy of last

resort the owner in the writeshared proto col is this copy

A technique similar to the delayed up date queue was used by the Myrias SPS multipro

cessor It p erformed the copyonwrite and di in hardware but required a restricted

form of parallelism to ensure correctness Sp ecically only one pro cessor could mo dify a

cache line at a time and the only form of parallelism that could exploit this mechanism was

a form of Fortran doall statement

We considered implementing write detection byhaving the compiler add co de to log

writes to replicated data as part of the write as is done in Emerald and Midway

However although recent results indicate that compilerbased write detection can outp erform

VMbased detection wechose not to explore this approach in the prototyp e b ecause we

did not want to mo dify the compiler and we are concerned with the p ortability constraints

imp osed by the requirement that DSM programmers use a sp ecial compiler It is an attractive

alternative for systems that do not supp ort fast page fault handling such as the iPSCi

hyp ercub e However if the numb er of writes to a particular data item b etween DUQ ushes

is high as is often the case this approach will p erform relatively p o orly b ecause each

write to a shared variable is slower

Performance Summary

We present a summary of Munins p erformance here to illustrate the value of Munins design

More detailed evaluations app ear elsewhere

Seven application programs were used in the evaluation of Munin Matrix Multiply

MULT Finite Dierencing DIFF Traveling Salesman Problem b oth negrained TSP

F and coarsegrained TSPC Quicksort QSORT Fast Fourier Transform FFT and

Gaussian Elimination with Partial Pivoting GAUSS Three dierentversions of each ap

plication were written a Munin DSM version a conventional DSM version that used Munins

conventional proto col to implement a sequentially consistent memory and thus measured

the p erformance of a conventional DSM system such as Ivy and a message passing version

The message passing programs represent b est case implementations of the parallel programs

The DSM runtime system used the same general purp ose message passing facilities provided

by the op erating system that the message passing programs used directly Great care was

taken to ensure that the inner lo ops of each computation the problem decomp osition and

the ma jor data structures for eachversion w ere identical

Toevaluate the impact of Munins design two basic comparisons were p erformed In

the rst comparison the p erformance of the Munin versions of the programs was compared

to the p erformance of equivalent message passing programs This comparison measures the

extent to which programs written using the shared memory mo del and run under Munin

can achieve p erformance comparable to programs written using explicit message passing

Table presents the sp eedup achieved by eachversion of the program and the p ercentage

of the message passing sp eedup achieved by the Munin DSM version of the program for

sixteen pro cessors For four of the applications MULT DIFF TSPC and FFT the Munin

versions achieveover of the sp eedup of their handco ded messagepassing equivalents

despite the fact that no signicant restructuring of the original sharedmemory programs was

p erformed while p orting them to run under Munin For the other three applications TSPF

QSORT and GAUSS the p erformance of the Munin variants is b etween and of

their messagepassing equivalents Supp ort for explicit RPC improves the p erformance of

these applications to within of their message passing equivalents

Message Munin How Munin Conven How

Passing Sp eedup Close Sp eedup tional Close

Sp eedup Sp eedup

MULT MULT

DIFF DIFF

TSPC TSPC

TSPF TSPF

QSORT QSORT

FFT FFT

GAUSS GAUSS

Table Munin vs Message Passing Table Conventional DSM vs Munin DSM

In the second comparison the p erformance of the Munin versions of the programs was

compared to the p erformance of the conventional DSM versions This comparison iden

ties the typ es of programs that stand to b enet the most from Munins novel features

and the typ es of programs that are adequately supp orted with a conventional pagebased

invalidationstyle DSM Table presents the sp eedup achieved byeachversion of the pro

gram and the p ercentage of the Munin DSM sp eedup achieved by the conventional DSM

version of the program for sixteen pro cessors For the programs with large grained shar

ing MULT and TSPC the conventional versions achieved p erformance within of the

sp eedup of their Munin counterparts For DIFF TSPF and GAUSS the p erformance of

the conventional versions was reduced to of Munin For QSORT conventional p er

formance was under of Munin p erformance while for FFT conventional p erformance

was orders of magnitude worse

These results are very promising and argue that Munin represents a signicantstep

towards making DSM useful on a much wider sp ectrum of programs and programming styles

Sp ecically conventional DSM p erforms well on programs with relatively little sharing or

when the sharing is at a large granularityHowever its p erformance drops o quickly when

there is much negrained sharing and b ecomes unacceptable when there is much concurrent

write sharing or false sharing

Conclusions

This pap er contains a detailed description of the design and implementation of the Munin

prototyp e with sp ecial emphasis given to its write shared proto col Munin contains a number

of mechanisms and implementation strategies designed to improve the p erformance of DSM

including supp ort for concurrent writers the use of distributed ownership proto cols and the

provision of a sp ecialized library of synchronization op erations Many of these features are

also relevant to the design of future scalable shared memory hardware

We learned a numb er of lessons from our exp erience with the prototyp e implementation

that designers of future DSM systems should consider including

DSM can b e made ecient without the use of unconventional programming languages

compilers or op erating system supp ort

Organizing the DSM runtime as a library package works well In particular this

organization improves p erformance by reducing the number of context switches and

the amount of crossdomain memory copying required to maintain consistency

The implementation of delayed up dates is subtle The key to handling writeshared

data ecien tly is to minimize the amountoftimespent purging the DUQ so it is

imp ortant to p erform up dates in parallel using multiple threads or multicast

While it is easy to add consistency proto cols if the server software is well designed

a small numb er of proto cols is enough to handle the way that most shared data is

accessed

Munins p erformance could b e improved signicantly byanumb er of factors a lower

latency OS op erations message passing exception handling and VM remapping b a

high bandwidth multicast network and c negrained write detection hardware In the

prototyp e OS overhead was a ma jor source of overhead In addition while we carefully

avoided the use of Ethernets multicast capability to b etter mo del how Munin would work on

a scalable network technology the presence of multicast would greatly reduce the time needed

to p erform up dates in the infrequent but very exp ensive situations where a large number of

no des are write sharing data as in FFT Finally negrained write detection hardware

could eliminate the largest comp onentofsoftware overhead di creation These issues are

sub jects of ongoing research at the University of Utah

References

A Agarwal and A Gupta Memoryreference characteristics of multipro cessor applications

under MACH In Proceedings of the th Annual International Symposium on Computer

Architecture pages June

M Ahamad PW Hutto and R John Implementing and programming causal distributed

shared memoryIn Proceedings of the th International Conference on Distributed Com

puting Systems pages May

HE Bal and AS Tanenbaum Distributed programming with shared data In Proceedings of

the International Conference on Computer Languages pages Octob er

JK Bennett JB Carter and W Zwaenep o el Adaptivesoftware cache managemen t for

distributed shared memory architectures In Proceedings of the th Annual International

Symposium on Computer Architecture pages May

BN Bershad MJ Zekauskas and WA Sawdon The Midway distributed shared memory

system In COMPCON pages February

R Bisiani and M Ravishankar PLUS A distributed sharedmemory system In Proceedings

of the th Annual International Symposium on Computer Architecture pages

May

JB Carter Ecient DistributedShared Memory BasedOnMultiProtocol Release Consis

tency PhD thesis Rice University August

JB Carter JK Bennett and W Zwaenep o el Techniques for reducing consistencyrelated

communication in distributed shared memory systems ACM Transactions on Computer

SystemsTo app ear

JB Carter JK Bennett and W Zwaenep o el Implementation and p erformance of Munin In

Proceedings of the th ACM SymposiumonOperating Systems Principles pages

Octob er

JS Chase FG Amador ED Lazowska HM Levy and RJ Littleeld The Amb er system

Proceedings of the th ACM Parallel programming on a networkofmultipro cessors In

Symposium on Operating Systems Principles pages December

DR Cheriton and W Zwaenep o el The distributed V kernel and its p erformance for diskless

workstations In Proceedings of the th ACM Symposium on Operating Systems Principles

pages Octob er

SJ Eggers and RH Katz A characterization of sharing in parallel programs and its appli

cation to coherency proto col evaluation In Proceedings of the th Annual International

Symposium on Computer Architecture pages May

B Fleisch and G Pop ek Mirage A coherent distributed shared memory design In Proceedings

of the th Symposium on Operating Systems Principles pages Decemb er

K Gharachorlo o D Lenoski J Laudon P Gibb ons A Gupta and J Hennessy Memory con

sistency and event ordering in scalable sharedmemory multipro cessors In Proceedings of

the th Annual International SymposiumonComputerArchitecture pages Seattle

Washington May

E Jul H Levy N Hutchinson and A Black Finegrained mobility in the Emerald system

ACM Transactions on Computer Systems February

K Li and P Hudak in shared virtual memory systems ACM Transactions

on Computer Systems November

RG Minnich and DJ Farb er The Mether system A distributed shared memory for SunOS

In Proceedings of the Summer USENIX Conference pages June

Myrias Corp oration System overview Edmonton Alb erta

B Nitzb erg and V Lo Distributed shared memory A survey of issues and algorithms IEEE

Computer August

U Ramachandran M Ahamad and YA Khalidi Unifying synchronization and data transfer

in maintaining coherence of distributed shared memoryTechnical Rep ort GITCS

Georgia Institute of Technology June

SK Reinhardt JR Larus and DA Wo o d Temp est and Typho on Userlevel shared mem

oryIn Proceedings of the st Annual International Symposium on Computer Architecture

pages April

RL Sites and AAgarwal Multipro cessor cache analysis using ATUM In Proceedings of the

th Annual Intl Symposium on Computer Architecture pages June

M Stumm and S Zhou Algorithms implementing distributed shared memory IEEE Com

puter May

JE Veenstra and RJ Fowler A p erformance evaluation of optimal hybrid cache coherency

proto cols In Proceedings of the th Symposium on Architectural Support for Programming

Languages and Operating Systems pages September

DHD Warren and S Haridi The Data Diusion machine A shared virtual memory archi

tecture for parallel execution of logic programs In Proceedings of the International

Conference on Fifth Generation Computer Systems pages Decemb er

WD Web er and A Gupta Analysis of patterns in multipro cessors In

Proceedings of the rd Symposium on Architectural Support for Programming Languages

and Operating Systems pages April

MJ Zekauskas WA Sawdon and BN Bershad Software write detection for distributed

shared memoryInProceedings of the First Symposium on Operating System Design and

Implementation pages November