Evaluation of Architectural Supp ort for Global AddressBased Communication

in LargeScale Parallel Machines

y y

Arvind Krishnamurthy Klaus E Schauser Chris J Scheiman

Randolph Y Wang David E Culler and Katherine Yelick

the sp eci target architecture Wehave develop ed multi Abstract

ple highly optimized versions of this compiler employing a Largescale parallel machines are incorp orating increas

range of co degeneration strategies for machines with dedi ingly sophisticated architectural supp ort for userlevel mes

cated network pro cessors In this studywe use this sp ec saging and global memory access We provide a systematic

trum of runtime techniques to evaluate the p erformance evaluation of a broad sp ectrum of current design alternatives

tradeos in architectural supp ort for communication found based on our implementations of a global address language

in several of the current largescale parallel machines on the Thinking Machines CM Paragon Meiko CS

We consider ve imp ortant largescale parallel platforms TD and Berkeley NOW This evaluation includes

that havevarying degrees of architectural supp ort for com a range of compilation strategies that makevarying use of

munication the Thinking Machines CM Intel Paragon the network pro cessor each is optimized for the target ar

Meiko CS Cray TD and Berkeley NOW The CM pro chitecture and the particular strategyWe analyze a family

vides direct userlevel access to the network the Paragon of interacting issues that determine the p erformance trade

provides a network pro cessor NP that is symmetric with os in each implementation quantify the resulting latency

the compute pro cessor CP the MeikoandNOW provide overhead and bandwidth of the global access op erations

an asymmetric network pro cessor that includes the network and demonstrate the eects on application p erformance

interface NI and the TD provides dedicated hardware

which acts as a sp ecialized NP for remote reads and writes

Intro duction

Against these hardware alternatives w e consider a variety

of implementation techniques for global memory op erations

In recentyears several architectures have demonstrated prac

ranging from general purp ose active message handlers to

tical scalabilitybeyond a thousand micropro cessors includ

sp ecialized handlers executing directly on the NP or in hard

ing the nCUBE Thinking Machines CM Intel Paragon

ware This implementation exercise reveals several crucial

Meiko CS and Cray TD More recently researchers have

issues including protection address translation synchro

also demonstrated high p erformance communication in Net

nization resp onsiveness and owcontrol whichmust b e

work of Workstations NOW using scalable switched lo cal

addressed dierently under the dierent regimes and con

area network technology While the dominant pro

tribute signicantl y to the eective communication costs in

gramming mo del at this scale is message passing the prim

aworking system

itives used are inherently exp ensive due to buering and

Our investigation is largely orthogonal to the many ar

scheduling overheads Consequently these machines

chitectural studies of distributed machines

provide varying levels of architectural supp ort for communi

whichseektoavoid unnecessary communication by exploit

cation in a global address space via various forms of memory

ing address translation hardware to allow consistent repli

read and write

cation of blo cks throughout the system

We develop ed the SplitC language to allow exp erimen

and op erating system studies which seek the same end by

tation with new communication hardware mechanisms by

extending virtual memory supp ort In these ef

involving the compiler in the supp ort for the global address

forts communication is caused by a single load or store in

op erations Global memory op erations are statically typ ed

struction and the underlying hardware or op erating system

so the SplitC compiler can generate a short sequence of

mechanisms move the data transparentlyWe fo cus on what

co de for each p otentially remote op eration as required by

happ ens when the communication is necessary So far dis

Computer Science Division University of California Berkeley

tributed shared memory techniques have scaled up from the

CA farvindkcullerrywangyeli ckg csb erkeley edu

tens of pro cessors toward a hundred but many leaders of

y

Department of Computer Science University of California Santa

the eld suggest that the thousand pro cessor scale will b e

Barbara CA fschauserchrissgcsucsbedu

reached only by clusters of these machines in the foresee

able future Our investigatio n overlaps somewhat with the

co op erative shared memory work which initiates communi

cation transparently but allows remote memory op erations

to b e serviced by programmable handlers on dedicated net

work pro cessors The study could in principle b e

p erformed with other compilerassis ted shared memory im

plementations but these do not have the necessary base Thinking Machines CM5 Cray T3D NOW Compute Processor Compute Compute Message Processor Network Processor NI Unit I/O Network Network Processor Memory Bus Memory Bus Network DMA Memory Proc. NI Memory (BLT) Memory SRAM DMA

Meiko CS2 Intel Paragon Network Processor Network Compute Network Compute NI Processor Processor Processor Proc. DMA Memory Bus

Memory Bus Shared Shared DMA NI Memory

Memory Network

Figure Structure of the multiprocessor nodes

messages by writing to output FIFOs in the NI using un of highly optimized implementation s on a range of hardware

cached stores it p olls for messages bychecking network sta alternatives

tus registers Thus the network is eectively a distributed The rest of the pap er is organized as follows Section

set of queues The queues are quite shallow holding only provides background information for our studyWe briey

three word messages The network is a ary fat tree that survey the target architectures and the source language

has a link bandwidth of MBsec in each direction Section sketches the basic implementation strategies Sec

tion provides a qualitative analysis of the issues that arise

in our implementation s and their p erformance impacts Sec

Intel Paragon In the Paragon each no de contains one

tion substantiates this with a quantitative comparison us

or more compute pro cessors MHz i pro cessors and

ing microb enchmarks and parallel application s FinallySec

an identical CPU dedicated for use as a network pro ces

tion draws conclusions

sor Our conguration has a single compute pro cessor p er

no de The compute and network pro cessors share memory

over a cachecoherent memory bus The network pro ces

Background

sor which runs in system mo de provides communication

In this section we establish the background for our study

through shared memory to user level on the compute pro

cessor It is also resp onsible for constructing and interpret includin g the key architectural features of our candidate ma

chines and an overview of SplitC

ing message tags Also attached to the memory bus are

DMA engines and a network interface The network in

terface provides a pair of relatively deep input and output

Machines

FIFOs KB each which can b e driven by either pro cessor

We consider ve machines all constructed with commercial

or by the DMA engines The network is a D mesh with

micropro cessors and a scalable lowlatency interconnection

links op erating at MBs in each direction

network The pro cessor and network p erformance diers

across the machines but more imp ortantly they dier in

o CS no de contains a sp ecial MeikoCS The Meik

the pro cessors interface to the network They range from

purp ose Elan network pro cessor integrated with the net

a minimal network interface on the CM to a fulledged

work interface and DMA controller The network pro cessor

pro cessor on the Paragon Figure gives a sketchofthe

is attached to the memory bus and is cachecoherentwith

no de architecture on each machine

the compute pro cessor which is a MHz threeway sup er

scalar Sup erSparc pro cessor The network pro cessor func

Thinking Machines CM The CM has the most

tions b oth as a pro cessor and as a memory device so the

primitive messaging hardware of the ve machines Each

compute pro cessor can issue commands to the network inter

no de contains a single MHz Sparc pro cessor and a con

face and get status backviaamemoryexchange instruction

ventional MBusbased memory system We ignore the vec

at user level The network pro cessor has a dedicated connec

tor units in b oth the CM and Meikomachines The net

tion to the network however it has only mo dest pro cessing

work interface unit provides userlevel access to the network

power and no general purp ose cache so instructions and

Each message has a tag identifying it as a system message

data are accessed from main memory The network is com

interrupting user message or noninterrupting user message

prised of two ary fattrees that have a linklevel bandwidth

that can b e p olled from the NI The compute pro cessor sends of MBsec 3 5 Compute Compute Compute Network Network Compute Processor Processor Processor Processor 1 Processor 3 Processor

4 2 1 2 6 4

Memory Memory Shared Shared Memory Memory

Memory Operations To Satisfy the Read Memory Operations To Satisfy the Read Steps in the Active Message Request Steps in the Active Message Request

Steps in the Active Message Reply Steps in the Active Message Reply

Figure Hand lers areexecuted on the CP

Figure Hand lers are executed on the NP

1

9 Compute Network Network Compute Compute Network 3 Network Compute Processor Processor 3 Processor Processor Processor Processor Processor Processor

12 1 11 10 2 458 7 6 4 2

Shared Shared Shared Shared Memory Memory Memory Memory

Memory Operations To Satisfy the Read Memory Operations To Satisfy the Read Steps in the Active Message Request Steps in the Active Message Request

Steps in the Active Message Reply Steps in the Active Message Reply

Figure NPs aretreated simply as network interfaces

Figure CP directly injects messages into the network

Cray TD The Cray TD has a sophisticated message

space abstraction built from a combination of compiler and

unit whichisintegrated into the memory controller to pro

runtime supp ort The language implementations dier in

vide direct hardware supp ort for remote memory op erations

the amount of information available at compile time and the

A no de consists of a MHz Alpha pro cessor

amount of runtime supp ort for moving and caching values

memory and a shell of supp ort circuitry to provide global

We consider the problem of implementing a minimalist lan

memory access and synchronization A remote memory op

guage SplitC which fo cuses attention on the problems

eration typically requires a short sequence of instructions

of naming retrieving and up dating remote values

to set up the destination pro cessor numb er in an external

A program is comprised of a of control on each

address register issue a memory op eration and then test

pro cessor from a common co de image SPMD The threads

for completion rather than a simple load or store The shell

execute asynchronouslybutmaysynchronize through global

also provides a systemlevel bulk transfer engine whichcan

accesses or barriers Pro cessors interact through reads and

DMA large blo cks of contiguous or strided data to or from

writes on shared data The typ e system distinguish es lo cal

remote memories Pro cessors are group ed in pairs share

accesses from global accesses although a global access may

a network interface and blo cktransfer engine and all

b e to an address on the lo cal pro cessor

pro cessor no des are connected via a threedimensional torus

Splitphase or nonblo ckin g variants of read and write

network with MBs links

called get and put are provided to allow the long latency of

a remote access to b e masked These op erations are com

pleted by an explicit sync op eration Another form of write

Berkeley NOW The BerkeleyNOW is a cluster of Ul

called storeavoids acknowledging the completion of a re

traSparc workstations connected together by Myrinet

mote write but rather increments a counter on the pro cessor

The CP is a MHz fourway sup erscalar UltraSparc pro

containing the target address This op eration supp orts e

cessor The Myrinet NI is an IO card that plugs into the

cient oneway communication and remote event notication

standard SBus It contains a bit CISCbased LANai

Bulk transfer within the global address space is provided

network pro cessor DMA engines and lo cal memory SRAM

in b oth blo cking and nonblo cking forms In addition to

The CP can access the NPs lo cal memory through uncached

read and write atomic readmo difywrite op erations such

loadsstores on memory mapp ed addresses The NP can

as fetchopstore are supp orted

access the main memory through sp ecial DMA op erations

initiated through the IO bus The Myrinet network is com

p osed of crossbar switches with eight bidirectional p orts and

Implementations

the switches can b e linked to obtain arbitrary top ologies

The network provides a link bandwidth of MBsec

Global memory op erations fundamentally involve three pro

cessing steps the pro cessor issues a request the

ed on a remote no de by accessing memory request is serv

Global address space language

p ossibly up dating state or triggering an event and a re

Many parallel languages includi ng HPF SplitC

sp onse or completion indication is delivered to the request

CC Cid and Olden provide a global address

ing no de Between these steps information ows from the global memory op eration there are three p oints at whichan

CP through network pro cessors NPs network interfaces NP is involved in a message sendreceive while the CP is in

NIs and memory systems Consequently a basic issue in volved only once Since interfacing with the NI is typically

any implementation is cho osing which set of hardware ele exp ensive this asymmetry in roles could p otentially lead

ments are involved in the transfer and where the message is to a load imbalance during a global communication phase

handled In this section we outline four p ossible strategies The Receive strategy corrects this asymmetry by dynami

for implementing global memory op erations cally balancing the load b etween the CP and the NP

Compute Pro cessor as Message Handler Pro c In our

Wehave implemented ten versions of the SplitC lan

rst approach the message handlers are executed on the

guage Eachversion is highly optimized for the underlying

CP In the simplest implementation of this strategy the CP

hardware On the CM which has no network pro cessors

injects the message into the network through the network

a sole version exists All message handlers are executed di

interface and the CP on the remote no de receives the mes

rectly by the compute pro cessors On the Meiko and the

sage executes the handler and resp onds as shown in Fig

NOW wehave an implementation that executes the mes

ure for a remote read op eration

sage handlers on the CP and another that executes them on

The same basic strategy can b e employed with network

the NP The CP do es not interact with the network

pro cessors using them simply as smart network interfaces

directly in either case On the Paragon wehave an im

as shown in Figure The communication b etween a CP

plementation for each of the four dierent message handler

and an NP on a no de o ccurs through a queue built in shared

placement strategies discussed in this section On the TD

memory The CP writes to the queue and the NPwhich

e implemented a sole version that handles all remote wehav

is constantly p olling the queue retrieves the message from

memory accesses using the combined memory controller and

memory and injects it into the network The message is re

network interface

ceived by the NP on the remote side enqueued into shared

memoryandeventually handled by the remote CP where

Issues and Qualitative Analysis

up on a handler executes and initiates a resp onse In this im

plementation the network pro cessors task is to movedata

In this section we presentasetofinteracting issues that

between shared memory and the network guarantee that the

shap e the implementation of global address op erations Our

network is constan tly drained of messages and p erform ow

goal is to identify the hardware and software features that

control by tracking the numb er of outstanding messages

aect b oth correctness and p erformance This provides a

framework for understanding the p erformance measurements

Network Pro cessor as Message Handler NP In our next

in the next section

approach the message handlers are executed on the NPs

There are three dimensions to communication p erfor

thereby reducing the involvement of the CPs see Figure

mance The latency of an op eration is the total time taken

The request initiation is similar to the base implementation

to complete the op eration If the program must wait until

the CP uses shared memory to communicate the message to

the op eration completes this is the most imp ortant gure of

the NP which injects the message into the network The NP

merit However if the program has other work to do then

on the remote no de receives the message executes the corre

the imp ortant measure is overheadwhich is the pro cessing

sp onding handler and initiates a resp onse without involving

time sp ent issuing and completing a communication event

the CP The NP on the requesting no de receives the resp onse

If there are many concurrent communication events o ccur

and up dates state to indicate the completion of the remote

ring at the same time then the time for eacheventtoget

request without involving the CP This strategy streamlines

through the b ottleneck in the communication system called

message handling by eliminati ng the involvementofthecom

the o ccupancy or gapmay matter most since it determines

pute pro cessors

the eective communication bandwidth Our p ersp ectiveis

inuenced by the LogP mo del although we use a dif

ferent denition of latencywhich includes overhead

Message Injection by the Compute Pro cessor Inject In

We divide the set of implementation issues into three

this approach the CP on the requesting no de directly in

categories those that are determined primarily by the ma

jects messages into the network without involving the NP

chine architecture those that are determined by the par

as shown in Figure The remote NP receives and ex

ticular language implementation and those that are only

ecutes the message b efore initiatin g the resp onse whichis

determined by application usage characteristics

eventually received and handled by the NP on the request

ing no de This approach streamlines message injection by

eliminatin g the NPs involvement However in this strategy

Architectural issues

since b oth the CP and the NP can inject messages into the

We b egin with the family of architectural issues in a b ottom

network the network interface must b e protected to ensure

up fashion starting with the basic factors involved in moving

mutually exclusive access

bits around We then consider the pro cessors themselves

and work upwards toward system issues of protection and

Message Receipt by either Pro cessor Receive Our next

address translation

approach diers from the Inject strategy in one asp ect the

compute pro cessors are also allowed to receive and handle

Interface to the network pro cessor and the network

the messages As with the Inject strategy the CP directly

interface

injects requests into the network However on the remote

no de the request is serviced by either the CP or the NP de

The interface b etween a pro cessor and the network is a no

p ending on which pro cessor is available Similarl y when the

torious source of software and hardware costs We consider

reply comes back it is handled by either one of the two pro

two design goals in this section minimizing overhead and

cessors on the source no de In the Inject approach during a

minimizing latency which suggest opp osing designs

To minimize latencywe should streamline the communi Protection

cation pro cess by reducing the numb er of memory transfers

Protection issues arise at each step of a global access op

between pro cessors One exp ects the latency to b e mini

eration The network interface must b e protected from in

mized if requests are issued directly from the CP into the

correct use by application s Messages sentby one applica

network the remote op eration is handled and serviced di

tion must arrive only at target remote pro cesses for which

rectly out of the network and the resp onse is given directly

it is authorized Hosts must continue to extract messages

to the requesting pro cessor

to avoid network deadlo ck Finally handlers must only ac

To minimize the communication overhead for the CP

cess storage resources that are part of the applicatio n The

the solution is to ooad as muchwork as p ossible to a sepa

traditional solution in LANs and early message passing ma

rate NP The ooaded work of transferring data into or out

chines was to involve the op erating system on sending and

of the network involves checking various status indicators

receiving every message Ncub e This requirement

writing or reading FIFOs and checking that the transfer

can b e eliminated with more complete architectural supp ort

was successful On all ve machines this transfer is exp en

describ ed b elow and with coarser system measures suchas

sive Thus wewould exp ect the reduction in overhead to

gang scheduling of applications on partitions of the machine

b e signicant as the cost of transfer to the network is traded

The Paragon has primitive hardware supp ort for protec

o against communication with the NP

tion It do es not distinguish user and systems messages or

Surprisingl y on the Paragon the cache coherency pro

messages from dierent pro cesses and there is no safe user

to col b etween the CP and the NP results in at least four

level access to the NI Instead of invoking the OS for each

bus transfers for an eightword pro cessortoNP transfer at

communication op eration protection is enforced by passing

a cost greater than a pro cessortoNI transfer On the

all messages through a shared buer to the NP which runs

Meiko b ecause the NI is integrated into the NPtheCP

at system level and provides protection checks resource ar

can not directly send into the network Communication b e

bitration and continually drains the network In all but

tween the pro cessor and NP involves a transfer request b e

the base implementation remote op erations are serviced di

ing written to the NP with a single exchange instruction

rectly on the NP which p erforms protection checks on ac

whereup on the NP pulls the message data directly out of the

cess to user addresses In the implementations in which

work The NOW pro cessors cache and writes it into the net

the CP directly injects requests into the NI there is a pro

conguration is similar to that of Meikos with the NP hav

tection lo ophole these exp eriments reect the p erformance

ing a dedicated connection to the NI The dierence is that

that would b e p ossible if the Paragon were to adopt mea

the CP and the NP do not share memory Instead the CP

sures similar to the CM or Meiko

and the NP communicate through a queue lo cated in the

On the CM the NI attaches a tag to each message so

NPs lo cal memorywhich is accessible to the CP through

that user and system messages can b e distinguis hed The

memory mapp ed addresses On the TD crossing the pro

NI state is divided into distinct user and system regions

cessor chip b oundary is the ma jor cost of all remote memory

so that the parallel applicatio n can only inject messages for

op erations however the external shell is designed sp ecif

other pro cesses within the application Gangschedulin g on

ically to present the network as conveniently as p ossible to

a subtree of the network is used to ensure that applications

the pro cessor

do not interfere with each other Also since remote services

A correctness issue also arises if the pro cessor injects re

are only p erformed on the pro cessor the normal address

quests directly into the network and the NP is to handle

translation mechanism enforces protection The NOW also

requests and inject resp onses the network interface must

gangschedules parallel jobs It do es not supp ort timeslicing

b e protected to ensure mutually exclusive access This issue

currently so a parallel programs trac is insulated from

do es not arise on the MeikoorNOW since they do not allow

others The control program on the NP is protected from

direct access to the NI from the pro cessor In our Paragon

user tamp ering and it p erforms protection checks on user

implementation explicit lo cks are used to guarantee exclu

supplied addresses

sive access

The Meiko protection mechanism is more sophisticated

A collection of communicating pro cesses p ossess a common

Relative p erformance and capability

communication capabili ty All messages are tagged with the

capability whichisusedtoidentify the communication con

One of the stark dierences in our target platforms is the

text on the NPThiscontext includes the set of remote no des

pro cessing p ower and capabiliti es of the NPs One exp ects

to which the application is authorized to send messages and

p erformance to improve if handlers are executed on the NP

the virtual memory segment of the lo cal pro cess that is ac

However if the NP is much slower than the pro cessor as on

cessible to remote pro cesses Timeslicing the NP pro cessor

the Meiko then it may b e b est for the NP to do only simple

allows applicatio ns to make forward progress so arbitrary

op erations it maybefastertohave the NP pass complex

user handlers may b e run directly on the NP

requests to the pro cessor than to execute the request itself

The TD enforces protection entirely through the ad

The optimal choice dep ends not only on the relativespeed

dress translation mechanism If an access is made to an

of the two pro cessors but also their relative load if the pro

authorized remote lo cation the virtual to physical transla

cessor is busy then the b est p erformance maybeachieved

tion will succeed and the shell will issue a request access to

by executing the handler on the slower NP Of course the

the designated physical lo cation on the appropriate physical

request must alw ays b e passed on to the remote pro cessor

no de Parallel applicatio ns are gangscheduled on a sub

if it demands capabiliti es not present in the NPFor exam

cub e of the machine so they dont comp ete for network

ple the Meiko NP do es not directly supp ort oatingp ointor

resources

byte op erations the NOW NP has no oating p oint supp ort

and the TD shell can only serve remote memory accesses

and very limited synchronizatio n op erations Address translation

Akey issue in all of our implementations is how the ad dress of a globally accessible lo cation is translated A global

p ointer is statically distinct from a lo cal p ointer so the ad copy to or from the user space

dress translation is p otentially a joint eort by the compiler

generated co de the remote handler and the hardware If

Language implementation issues

remote op erations are served bytheCP the virtual to phys

Wenow turn to the family of issues at the language imple

ical translation is p erformed by the standard virtual mem

mentation level given that the network pro cessor can exe

ory mechanism and pagefault handling is decoupled from

hitectural characteristics cute handlers and has sp ecic arc

communication If remote op erations are served by the NP

that can b e fully exploited

many alternatives arise

On the Meiko the user thread on the NP runs in the

virtual address space of the user pro cess The NP contains

Generality of handlers

its own TLB and its page table is kept consistent with that

If the network pro cessor runs in kernel mo de as on the

used by the pro cessor Thus the address translation on

Paragon it can only run a xed set of safe handlers

request issue is the same as for the Pro c strategy

The protection and address translation capabiliti es of the

On the Paragon if the message handler is run on the NP

MeikoNPmake general handlers p ossible but its p o or p er

it runs in kernel mo de In the currentversion of OSFAD

formance makes it usable only for highly sp ecialized han

it executes directly in the physical address space Thus all

dlers Atomic handlers are challenging to implement e

accesses to user address space are translated and checked in

ciently on the NP Exp ensivelocking may b e required to

software on the NP In principle a TLB could b e used to ac

ensure exclusive access to program states by the CP and

celerate this translation but since the no des are p otentially

NP Our approachistoalways execute the complex atomic

timesliced the NP would still need to check thatitcon

op erations on the compute pro cessor to avoid costly lo cking

tained a mapping for the target pro cess of the message and

adjust its mapping or emulate the context as appropriate

Synchronization

The remote pagefault issue arises as well but the message

is explicitly ab orted by the handler

Nonblo cking op erations such as get put and store require

On the NOW since the NP is attached to the IO bus

some form of synchronizatio n event to signal completion

it can access main memory only through valid IO space

This is easily implemented with counters but if op erations

addresses However since the IO bus supp orts only bit

are issued by the pro cessor and handled by the NP then the

addresses only a p ortion of the users address space can b e

counters must b e maintained prop erly with minimal cost for

mapp ed into the IO space at any given time Consequently

exclusive access An ecient solution is to split the counter

if an access is made to an address that is not currently part

into two counters using one counter for the compute pro

of the IO space the NP passes the request to the compute

cessor increments and the other for the network pro cessor

pro cessor

decrements No race condition can o ccur since each pro ces

The TD takes a completely dierent approach in that

sor can only write to one lo cation and can read b oth The

the virtual address is translated on the pro cessor that issues

sum of the two counters pro duces the desired counter value

the request The page tables are set up so the result of the

Implementing this on the Meiko and the Paragon is b est ac

address translation contains the index of a remote pro cessor

complished byhaving eachhalfofthecounter in a separate

in a small external set of registers and a physical address on

cache line to avoid false sharing

that no de It is p ossible that a valid address for the remote

no de causes an address fault on the no de issuing the request

Optimizing for sp ecialized handlers

if the remote no de has extended its address space b eyond

that of the requester The language implementation avoids

If some of the handlers are to b e executed on the NP they

this problem by co ordinating memory allo cation No paging

can b e optimized for their task and the sp ecic capabiliti es

is supp orted

of the NPOntheParagon the translations of frequently

accessed variables eg the completion counters can b e

DMA supp ort

cached for future use Message formats are sp ecialized for

the NP handlers to minimize packet size Oneway op era

All of the issues ab ove come together in DMA supp ort for

tions such as stores can b e treated sp ecially to reduce the

bulk transfer op erations On the Paragon and Meiko DMA

number of reverse acknowledgments needed for owcontrol

oers much greater transfer bandwidth than small messages

The Meiko allows for optimizations of a dierent sort as

The Paragon op erates on physical addresses and has a com

the handler co de can b e mapp ed directly onto some sp ecial

plex set of restrictions for correct op eration so it must b e

ized op erations supp orted bytheNPsuch as remote atomic

managed at kernel level Toimproveoverall network uti

writes

lization the NP fragments large transfers into page sized

chunks The Meikoprovides more sophisticated DMA sup

Application issues

p ort by allowing the user to sp ecify arbitrary sized transfers

on virtual addresses The DMA engine is part of the NP and

Considering architectural and language implementation is

automatically p erforms address translation and fragments

sues in isolation one can construct a solution that attempts

the transfer into byte chunks whichareinterleaved with

to minimize the latencyoverhead and gap for the individua l

other trac The TD blo ck transfer engine op erates on

global access op erations striking some balance b etween the

physical addresses and is only accessible at kernel level so

three metrics However the eective p erformance of these

the cost of a trap is paid on startup On the NOW a bulk

op erations dep ends on how they are actually used in pro

transfer requires the participation of two DMA engines the

grams Two issues that have emerged clearly in this study

host DMA moves data b etween main memory and the NPs

are resp onsiveness and the frequency of remote events

lo cal memory while the NI DMA is resp onsible for moving

data b etween the NPs lo cal memory and the network It

This set might b e enlarged by using sandb oxing or software fault

isolation techniques op erates in the IO address space and requires an additional

CM Meiko Paragon TD NOW

Feature Op eration Pro c Pro c NP Pro c NP Inject Receive NP Pro c NP

RT Latency AM

Read

Write

Overhead Get

Put

Store

Gap Get

Put

Store

Gap to Get

Put

Store

Gap exchange Get

Put

Store

Table Basic SplitC operations for dierent versions of SplitC times in us Proc indicates the compute processor

implementation NP the network processor implementation Inject the implementation where the compute processor directly

injects messages and Receive where it also directly receives responses

Resp onsiveness Summary

The prompt handling of incoming messages is imp ortant for Each implementation strategy on each platform must ad

minimizin g latency One way to ensure messages are han dress the issues raised in this section it represents a par

dled as they arrive is for the message to trigger an inter ticular p oint of balance in the opp osing tradeos present

rupt Unfortunately few commo dity pro cessors have fast In this section wehave examined in qualitative terms how

interrupts so where p ossible we utilize p olling of the net the available architectural supp ort the language implemen

work interface If the CP is resp onsible for p olling and fails tation techniques and the program usage characteristics in

to do so b ecause it is busy in compute intensive op erations uence the p erformance of global access op erations Given

the eective latency can increase dramatically If remote this framework let us examine the p erformance obtained on

op erations are handled on the NP it can b e resp onsiveto each of the op erations in isolation and the resulting appli

these requests regardless of the activities of the pro cessor cation p erformance

Frequency of remote events

Performance Analysis

Remote events are op erations where control information is

In this section wepresent detailed measurements of our ten

transmitted to a remote pro cess along with data The sim

implementations to provide a quantitative assessmentofthe

plest remote eventwe consider is the signaling store which

issues presented in the previous section We divide our dis

informs the remote pro cessor howmuch data has b een stored

cussion into four parts First we examine the raw commu

into it by incrementing a counter Synchronizati on op era

nication p erformance of active messages and SplitC primi

tions such as fetchadd and more general atomic pro ce

tives We also study bulk synchronous communication pat

dures involve more extensiv e op erations within the remote

terns where multiple pro cessors simultaneousl y exchange

address space Manyeventdriven applications use a dis

messages Next we examine bulk transfers and the achieved

tributed task queue mo del where communication causes a

bandwidth Then we study a microb enchmark that illus

new eventtobepostedonaqueue

trates the impact of attentiveness to the network Finally

If a program invokes very few remote op erations archi

we examine complete application s written in SplitC

tectural supp ort for communication has very little impact on

p erformance If a program is communication intensive and

Performance of SplitC primitives

if all remote op erations are variants of read and write which

involve only the remote memory and do not interact with

Table shows the roundtrip latency gap and overhead for

the remote pro cessor then unsurprisin glydevoting hard

active messages as well as get put and store op erations

ware to serve these op erations will improve p erformance

under the various implementatio ns The upp er three groups

However sp ecic supp ort for simple read and write op er

test a single requester and single remote server The lower

ations do es little to supp ort remote events such as stores

twoinvolvecommunication among multiple pro cessors

since they are relatively exp ensive to implementasmultiple

op erations on the remote memory space Thus the latency

Roundtrip Latency Toevaluate system impact on latency

overhead and gap of these remote event op erations varies

we consider three typ es of op erations all of whichwait for an

signicantl y across our platforms but the eective impact

acknowledgement of completion The rst is a general active

on p erformance dep ends on how frequently they are used in

message measured for the case of a null handler and the

application s which also varies dramatically

others are blo c king memory op erations read and write The

latency measurements exp oses two of the issues raised ear

lier total round trip time is minimized byavoiding the use of

NPs during injection thereby reducing memory transfers

and the use of sp ecialized hardware to supp ort particular overhead for gets and puts is almost twice the overhead

op erations signicantl y improves their p erformance for stores which do not require a reply On the TD the

Both the CM and the Paragon show that faster com overhead for gets and puts is almost half the latency for

munication is p ossible when messages are directly injected read and writes which means that pip elini ng two or three

into the network We see that the CM latency is small op erations is sucient to hide the latency Unlike other

compared to other architectures since the CP directly in platforms stores on the TD haveamuch higher overhead

jects and retrieves messages from the network The eect of since the store involves incrementing a remote counter it

reducing memory transfers b etween the CP and the NP is cannot b e mapp ed onto TDs hardware readwrite primi

most clearly seen on the Paragon On the Paragon each tives instead a general purp ose active message is used to

implementation improves on the previous versions The implement stores

advantage of ParagonNP over ParagonPro c is esp ecially

remarkable given the additional overhead of software ad

Gap The results for the gap exp ose the b ottlenecks in the

dress translation in ParagonNP The further improvement

various systems For the CM and the TD the gap is

of ParagonInject implies that the b enet of directly inject

the same as the overhead which indicates that the sending

ing into the network outweighs the additional cost of lo cks

pro cessor is the b ottleneck The higher gap for MeikoNP

which are required for mutually exclusive access to the NI

and the NOWNP implementations show us that the slower

On the NOW the NP strategy eliminates the message trans

NP is the b ottleneck On the Paragon the get op eration

fer b etween the NP and the CP however the NP can access

has a higher gap than puts due to an extra software address

the CPs memory only through exp ensive DMA op erations

translation made while handling the reply This b ehavior

issued over the IO bus The lower latencies for the NP

means that the NP on the sending side is the b ottleneck In

implementation imply that eliminating the message transfer

the Inject implementation the cost dierence b etween gets

more than comp ensates for the cost of the DMA op eration

and puts disapp ears implying that the NP on the remote

The MeikoNP implementation has higher latencies for

no de has b ecome the b ottleneck

active messages and reads since the NP is muchslower than

Two other observations can b e made concerning the gap

uch the CPHowever the write op eration on MeikoNP is m

results for the Paragon First there is no substantial change

faster than the read since the Meiko has hardware supp ort

in the gap b etween the Pro c and NP implementation s in

for remote write Similarly on the TD reads and writes

spite of removing the compute pro cessor from the critical

are much faster than active messages since reads and writes

path Second the store gap is lower due to an implementa

are directly supp orted in hardware while active message

tion optimization that bunches together acknowledgments

op erations must b e constructed from a sequence of remote

Note that even though the language do es not require ac

memory op erations In contrast the four Pro c implemen

knowledgments for stores they are sent to ensure ow con

tations of read and write on the CM Meiko Paragon and

trol in the network for all versions and availabili ty of buer

NOW are built using active messages so they take the time

space for the Pro c version

of a null active message plus the additional time needed to

It is also interesting to note that the NOWNP imple

read or write

mentation tradeso gap for latency byhaving the NP b e

The roundtrip measurements on the Paragon also bring

resp onsible for b oth interfacing with the NI as well as han

out protection and address translation issues The Paragon

dling the messages While this approachlowers latency by

Pro c and Paragon NP implementation s show no dierence

eliminatin g messages transfers b etween the CP and the NP

in latency for active messages b ecause user supplied han

it increases the load on the NPIfwe view the various com

dlers cannot b e executed on the NP due to inadequate pro

ponents of the system as dierent stages of a pip eline the

tection For the ParagonNP and Inject versions reads are

NP strategy eliminates some of the stages in the pip eline

slower than writes b ecause a read reply requires an extra

while increasing the time sp ent in the longest stage Conse

address translation step for storing the value read into a lo

quentlyitimproves latency at the exp ense of gap

cal variable In the Receive implementation the dierence

between the read and write costs disapp ears since the replies

Gap We can further isolate the system b ottleneckby

are handled on the CP

mo difying the gap microb enchmark to issue gets puts and

stores to two remote no des If an op eration can b e issued

Overhead One exp ects that the overhead when writing di

more frequently when issued to two dierent no des then the

rectly to the NI will b e greater than if an NP is involved

b ottleneck for the op eration is the pro cessing p o wer of the

surprisingl y this is not always the case Contrary to exp ec

remote no de otherwise it is send side limited For the CM

tation the overhead costs for the NP and the Inject versions

and the TD we notice that the op erations are send side

on the Paragon are similar which implies that it is as e

limited On the Paragon the numb ers supp ort our earlier

cient to write to the NI as writing to shared memory on this

conjecture that the NP on the source no de is the b ottleneck

platform As exp ected the Paragon and the NOW imple

in the NP version while the NP on the destination no de is the

mentations that use the NP for message handling havemuch

b ottleneck in the Inject and Receiveversions For the Meiko

less overhead since the CP do es not handle the replyOn

NP implementation the remote NP is the b ottleneck for gets

the other hand the MeikoNP overhead is higher than the

while the NP on the source no de is the b ottleneck for puts

MeikoPro c b ecause of an implemen tatio n detail Toavoid

and stores Similarly for the NOWNP implementation we

constant p olling on the NP the more exp ensiveeventmech

observe that the NP on the source no de is the b ottleneckfor

anism is used instead of shared memory ags for handing

gets and puts while the NP on the destination no de is the

o the message to the NP On the CM we observe that

b ottleneck for stores

the sending and receiving overhead accounts for almost all

of the roundtrip latency unlike the other implementations

Gap for Exchange Our nal microb enchmark measures

that involve a copro cessor

the gap when two pro cessors issue requests to each other

Comparing the store overhead to the other results helps

simultaneousl y This test exp oses the issues relating to how

reveal the underlying architecture As exp ected the CM CM-5 Paragon NOW

9 35 140 8 30 7 120 25 6 100 20 5 80 4 60 15 3 40 10 Bandwidth (MB/s) Bandwidth (MB/s) 2 Get Bandwidth (MB/s) Get Get 1 Store 20 Store 5 Store 0 0 0 8 8 8 32 16 32 64 16 32 64 128 512 128 256 512 128 256 512 2048 8192 1024 2048 4096 8192 1024 2048 4096 8192 32768 16384 32768 65536 16384 32768 65536 131072 524288 Message Size (Bytes) Message Size (Bytes)

Message Size (Bytes)

Figure Bandwidth of SplitC bulk get and storeoperations for our study platforms

the CP and NP divide up the workload involved in commu tency byhaving the NP continuously p oll for new messages

nication As exp ected for the Pro c implementations on the from the CP While this reduces the latency it constantly

Paragon and the Meiko the cost is roughly twice the regular takes resources from the NP and reduces the bandwidth for

gap since the compute pro cessor exp eriences the overhead bulk transfers The NP implementation on the other hand

for its message as well as for the remote request Since the only schedules threads on the NP when needed All the

NP and Inject versions allowfortheoverlapping of resources Paragon implementations use the same bulk transfer mech

the gap costs increase by less than a factor of two The anism since the device is complex to control and must b e

Receiveversion where the compute pro cessor receives and op erated at kernel level and achieve the same p erformance

handles messages to share the communication workload with a maximum bandwidth of MBs The CM achieves

the NP p erforms b etter than the Inject and NP versions in only MBs The TD provides two dierentmechanisms

spite of incurring the cost of using lo cks It is interesting for bulk transfer which dier in startup cost and p eak band

to observe that the Receiveversion has higher overhead and width and thus would b e employed in dierent regimes The

gap when a single no de issues fetches from a remote no de NOW throughput is limited by the SBus bandwidth

but it p erforms b etter for bulksynchronous communication

patterns suchasexchange

Polling granularity

To study the impact of attentiveness on communication p er

Conclusion The latencyoverhead and gap measurements

formance we use a microb enchmark where each pro cessor

of the SplitC primitives quantify the combined eects of the

p erforms a simple computep oll lo op with the computation

tradeos discussed in the previous section In particular we

granularityvaried based on an input parameter after each

can observe the utilityofhaving direct access to the network

computation the pro cess may p oll for messages All pro ces

a fast NP and of optimizing the global access primitives onto

sors take turns computing while the remaining pro cessors

available architectural supp ort in the target platform We

request a single data item from the busy pro cessor The

are also able to observe the eect of factors like software

requesting pro cessors need this data item to make progress

address translation and protection checking

If this request is not serviced immediately the requesters

idle and only a single pro cessor is busy computing at any

Bandwidth for bulk transfers

ever if the compute granularity is small or given time How

if a NP is used to service requests the resp onses come back

Figure shows the bandwidth curves for the bulk store and

immediately and all pro cessors can work in parallel

bulk get op erations for the dierentmac hines With the

Figures and show the impact of varying the com

exception of the CM all our architectures have a DMA

pute granularity on the overall runtime As exp ected not

engine to supp ort bulk transfers The Meiko NP imple

p olling or p olling infrequently results in p o or p erformance

mentation outp erforms Meiko Pro c for long messages for

The NP implementation always p erforms well b ecause the

stores it achieves MBs compared to MBs This

NP immediately services requests For a wide range of gran

o ccurs b ecause the Pro c version trades o bandwidth for la

MeikoCS Intel Paragon

Program Description Problem Size Time in sec Problem Size Time in sec

Main pro c NP Main pro c NP

radix Radix sort million keys million keys

sample Sample sort million keys million keys

pray Raytracer x tea p ot x tea p ot

sampleb Sample sort bulk transfers million keys million keys

radixb Radix sort bulk transfers million keys million keys

bitonic Bitonic sort million keys million keys

tb FFT using bulk transfers million pts million pts

cannon Cannon matrix multiply x matrix x matrix

mm Blo cked matrix multiply x matrix x matrix

t FFT using small transfers million pts million pts

shell Shell sort million keys million keys

wator Nbody simulation of sh sh sh

Table Run times for various SplitC programs on a processor Meiko CS and an processor Paragon Run times in

seconds for both the main processor and NP implementations are shown in the table

1.4 1.4

1.2 1.2

1 1

0.8 Meiko (poll) 0.8 Paragon (poll) Meiko-NP Paragon-NP 0.6 0.6 Time (us) Time (us)

0.4 0.4

0.2 0.2

0 0 1 4 1 4 16 64 16 64 256 256 1024 4096 1024 4096 16384 65536 16384 65536 262144 262144 1048576 1048576

Granualarity between polls Granualarity between polls

Figure Run times per iteration for varying granularity Figure Run times per iteration for varying granularity

between pol ls on a processor Meiko CS between pollsonanprocessor Paragon

ularities p olling on the compute pro cessor p erforms just as ability to handle communication while the compute pro ces

well as the NP implementation Only when p olling o ccurs sor computes Second much of the communication is done

very frequently do es the p olling overhead b ecome noticeable with bulk op erations and there is only a dierence in

bandwidth b etween the two SplitC implementation s

The program with the largest improvementiswator In

SplitC programs

this program each pro cessor runs through a lo op that reads

Finallywe compare the p erformance of full SplitC applica

a data p oint and then computes on that data Since the

tions under the NP and Pro c implementations on the Meiko

Pro c version only p olls when it p erforms communication

and the Paragon Table lists our b enchmark programs

any requests it receives while it is computing exp erience a

along with the corresp onding running times for the dierent

long delay The NP implementation in contrast can pro cess

versions of SplitC on the Meiko and Paragon Figures

requests immediately This accounts for the large dierence

and display the relative execution times as bar graphs

in runtimes Toavoid at least some of this delay the pro

The programs were run on a no de Meiko CS partition

grammer would have to add p olls to the compute routine

andannodeParagon Note that the problem sizes were

Unfortunately this program invokes the Xlibrary the co de

dierent

for which is not readily accessible for inserting p olls

On the Meiko we observe that under the NP strategy

On the Paragon almost all of the programs run approxi

radix and sample run slower while mm t shell and wator

mately at the same rate under b oth the Pro c and NP imple

run signicantly faster The remaining b enchmarks generate

mentations of SplitC The exceptions are the ne grained

similar timings Radix and sample run slower b ecause they

communication intensive programs radix sample and

are communication intensive and use remote events radix

t which run faster on the NP implementation b ecause

uses stores to p ermute the data set on each pass while sample

the underlying communication primitives are more ecient

uses an atomic remote push to move data to its destination

The remaining exception is wator which runs more e

pro cessor These primitives are substantiall y slower under

ciently under the NP implementation b ecause of the im

the NP implementation

proved resp onsiveness As exp ected programs that use bulk

Most of the b enchmarks have similar run times under the

transfers do not showmuchchange

two implementations This o ccurs for two reasons First

most of these programs do not overlap communication and

computation to a large degree Instead they run in phases

separated b y barriers As a result the NP cannot exploit its

1.6

for communication b etter userlevel messaging capabilitie s

1.4

and a greater emphasis on global addressbased communi

1.2 y cases this has led in the direction of ded

1 cation In man

work pro cessors however there is a great deal of

Meiko icated net 0.8

Meiko-NP

ariation in how sp ecialized these are to the communication

0.6 v

w they interface to the pro cessor and the network

0.4 task ho

the kind of synchronization supp ort they provide and the

Relative Execution Time 0.2 el of protection they oer 0 lev

fft

In this study weevaluate the tradeos presentinthis mm fftb shell radix p-ray wator radixb bitonic sample

cannon

large design space by implementing a simple global address

sampleb

programming language SplitC on a range of these architec

tures and by pursuing a family of implementation strategies

Figure Run times for our SplitC benchmark programs on

each fully optimized for the capabili ties of the hardware un

aprocessor Meiko CS normalized to the running time

der that strategyWe see quite substantial dierences in the

of the main processor implementation

latencyoverhead and gap exhibited on the individua l global

es and the dierences are in hindsight read

1.2 access primitiv ily explained On most of our applicatio ns the dierences

1

between the implementation strategies is less pronounced

0.8

partly b ecause they tend to use primitives that were more Paragon

0.6 uniform in p erformance across the strategies and partly b e Paragon-NP

0.4 cause opp osing tradeos tend to balance out ts that 0.2 The exp erience of the study and the measuremen

Relative Execution Time

it oers provide some clear design guidelines for the commu

0

nication substructure of very large parallel machines as well fft mm fftb

shell

as identifying p oints where the conclusion is still unclear radix p-ray wator radixb bitonic sample cannon

sampleb

Imp osing a network pro cessor b etween the applica

tion program and the network provides a very simple

Figure Run times for our SplitC benchmark programs

although not necessarily inexp ensive means of ad

on an processor Paragon normalized to the running time

dressing the complex requirements of protection ad

of the main processor implementation

dress translation media arbitration and owcontrol

for communication However it is imp ortant that the

interface b etween the pro cessor and the network pro

Discussion

cessor b e ecient This is not necessarily achieved by

The original motivation for an NP was to enable userlevel

traditional busbased cachecoherency proto cols since

protected communication without the limitations of gang

theideaistomove information from pro ducer to con

scheduling found on machines such as the CM Protec

sumer quickly rather than to hold data close to the

tion is realized by running part of the op erating system on

pro cessor that touches it It do es seem to b e achievable

the NP either in software as on the Paragon or in hard

by a more sp ecialized network pro cessor integrated

ware as on the Meiko Having the NP helps decrease the

with the network interface

overhead observed on the compute pro cessor and thus may

enable overlapping of computation and communication to a

There is an advantage to having the network pro ces

greater degree On some of the machines the NP also helps

sor b e resp onsive to the network and service mem

reduce the gap b etween messages and improve bulk trans

ory access requests without waiting for the pro cessor

fers An imp ortant advantage of the NP is that it improves

However if the network pro cessor is going to do more

the resp onsiveness

thanactasanintermediary and sanitize the network

There are several factors that prevent us from realizing

interface it needs to b e p owerful enough to do this

the full utilityoftheNP includin g limited sp eed required

job with p erformance comp etitive with the pro cessor

protection functionality of the NP and address translation

In particular if the network pro cessor is to provide a

as well as the synchronizati on and the shared memory access

protection mo del p owerful enough to allow its use on

cost b etween main pro cessor and NPFor example on the

general purp ose op erations it should b e fast enough

Meiko the protection check whichinvolves table lo okups is

to b e eective on those op erations

exp ensive since the NP do es not haveanonchip cache On

The remote memory p erformance of the TD and of

the Paragon the address translation has to b e p erformed

certain asp ects of the Meiko show that there are clear

in softw are Having the NP do the sending increases the

b enets to b e obtained through hardware supp ort for

observed latency as one more step is involved Just getting

sp ecic op erations in the network pro cessor and net

the information from the compute pro cessor to the NP is

work interface However the applicatio n usage charac

already quite exp ensive since it involves an elab orate shared

teristics on these large machines will need to stabilize

memory proto col Finally if the NP is a sp ecially designed

b efore it will b e p ossible to determine how these advan

chip as in the case of the Meiko and the NOW it is likely

tages balance against design time cost or reductions

to b e much slower than the compute pro cessor

in p erformance elsewhere in the system

Conclusions

Acknowledgments

There has b een a clear trend in the designs of large scale par

Wewould like to thank Lok Liu Rich Martin MitchFer

allel machines towards more sophisticated hardware supp ort

guson and Paul Kolano for all of their assistance and ex

p ertise We also thank Tom Anderson for providing us with High Performance Fortran Forum High Performance For

tran Language Sp ecication Version May

valuable comments This work was supp orted in part bythe

Advanced Research Pro jects Agency of the Departmentof

Kendall Square Research KSR Technical Summary

Defense under contracts DABTC and F

J Kubiatowicz and A Agarwal Anatomy of a Message in

C by the DepartmentofEnergyundercontract DE

the Alewife Multipro cessor In th ACM International Con

FGER by the National Science Foundation and

ferenceonSupercomputing July

California Micro Program David Culler is supp orted by

J Kuskin D Ofelt M Heinrich J Heinlein R Si

NSF PFF Award CCR Klaus Schauser is sup

moni K Gharachorlo o J Chapin D Nakahira J Baxter

p orted by NSF CAREER Award CCR Chris

M Horowitz A Gupta M Rosenblum and J Hennessy

Scheiman is supp orted byNSFPostdo ctoral Award ASC

The Stanford Flash Multipro cessor In st International

Computational resources were provided bythe

Symposium on April

NSF Infrastructure Grants CDA and CDA

C E Leiserson Z S Abuhamdeh D C Douglas C R

Feynman M N Ganmukhi J V Hill W D Hillis B C

References

Kuszmaul M A St Pierre D S Wells M C Wong

S Yang and R Zak The Network Architecture of the CM

C Amza A L Cox S Dwarkadas P Keleher H Lu R Ra

In Symposium on Paral lel and DistributedAlgorithms

jamonyWYu and W Zwaenep o el TreadMarks Shared

June

Memory Computing on Networks of Workstations IEEE

D Lenoski J Laundon K Gharachorlo oAGuptaand

Computer

J L Hennessy The Directory Based Cache Coherance Pro

T E Anderson D E Culler and D A Patterson A Case

to col for the DASH Multipro cessor In Proceedings of the

for NOWNetwork of Workstations IEEE Micro

th International Symposium on Computer Architecture

February

R Arpaci D Culler A Krishnamurthy S Steinb erg and

K Li and P Hudak Memory Coherence in Shared Virtual

K Yelick Empirical Evaluation of the CRAYTD A Com

Memory Systems ACM Transactions on Computer Sys

piler Persp ective In International Symposium on Computer

temsNovember

Architecture June

L T Liu and D E Culler Evaluation of the Intel Paragon

E Barton J Cownie and M McLaren Message passing on

on Active Message Communication In Intel

the Meiko CS Paral lel Computing April

Users Group Conference

B Bershad S Savage PPardyak E G Sirer D Becker

R S Nikhil Cid A Parallel Shared Memory C for Dis

M Fiuczynski C Chamb ers and S Eggers Extensibility

tributed Memory Machines In Languages and Compilers

Safety and Performance in the SPIN Op erating System In

for Paral lel Computing th International Workshop Pro

Fifteenth ACM Symposium on Principles

ceedings SpringerVerlag

S K Reinhardt J R Larus and D A Wood Typho on

N J Bo den D Cohen R E Felderman A E Kulawik

and Temp est UserLevel Shared Memory In International

C L Seitz J N Seizovic and W Su Myrinet A Gigabit

Symposium on Computer Architecture April

p erSecond Lo cal Area Network IEEE Micro Febru

K E Schauser and C J Scheiman Exp erience with Active

ary

Messages on the Meiko CS In th International Paral lel

M C Carlisle A Rogers J H Reppy and L J Hendren

Processing Symposium April

Early exp eriences with Olden parallel programming In

K E Schauser C J Scheiman J M Ferguson and PZ

Languages and Compilers for Paral lel Computing th In

Kolano Exploiting the Capabilities of Communications Co

ternational Workshop Proceedings SpringerVerlag

pro cessors In th International Paral lel Processing Sym

J B Carter J K Bennett and W Zwaenep o el Implemen

posium April

tation and Performance of Munin Proceedings of the th

R L Sites Alpha ArchitectureReferenceManual Digital

ACM Symposium on Operating Systems Principles

Equipment Corp oration

Novemb er

T von Eicken A Basu V Buch and W Vogels UNet

K M Chandy and C Kesselman Comp ositional C

A UserLevel Network Interface for Parallel and Distributed

Comp ositional Parallel Programming In th International

Computing In Fifteenth ACM SymposiumonOperating

Workshop on Languages and Compilers for Paral lel Com

System Principles Decemb er

puting New Haven CT August

T von Eicken D E Culler S C Goldstein and K E

Cray Research Incorp orated The CRAY TD Hardware

Schauser Active Messages a Mechanism for Integrated

R eference Manual

Communication and ComputationInInternational Sym

D Culler R Karp D Patterson A SahayKESchauser

posium on Computer Architecture

E Santos R Sumbramonian and T von Eicken LogP

R Wahb e S Lucco T Anderson and S Graham Ecient

Towards a Realistic Mo del of Parallel ComputationInPro

SoftwareBased Fault Isolation In Fourteenth ACM Sympo

ceedings of the Conference on Principles and Practice

sium on Operating System Principles

of Paral lel Programming San Diego CA May

M J Zekauskas W A Sawdon and B N Bershad Soft

D Culler L T Liu R P Martin and C Yoshikawa LogP

ware Write Detection for a Distributed Shared MemoryIn

Performance AssessmentofFast Network Interfaces IEEE

First Symposium on Operating Systems Design and Imple

MicroFebruary

mentation

D E Culler A Dusseau S C Goldstein A Krishnamurthy

S Lumetta T von Eicken and K Yelick Parallel Program

ming in SplitC In Supercomputing Portland Oregon

Novemb er

W Groscup The Intel Paragon XPS Sup ercomputerIn

Proceedings of the Fifth ECMWF Workshop on the Use of

Paral lel Processors in MeteorologyNov