Design and Implementation of a Multipurp ose

Cluster System Network Interface Unit

by

Bo on Seong Ang

Submitted to the Department of Electrical Engineering and

Computer Science

t of the requirements for the degree of in partial fulllmen

Do ctor of Philosoph y

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

February

c

Massachusetts Institute of Technology All rights reserved

Author : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Department of Electrical Engineering and

February

Certied by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Arvind

Johnson Professor of Computer Science

Thesis Sup ervisor

Certied by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Larry Rudolph

Principal Research Scientist

Thesis Sup ervisor

Accepted by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

A C Smith

Chairman Departmental Committee on Graduate Students

Design and Implementation of a Multipurp ose Cluster

System Network Interface Unit

by

Bo on Seong Ang

Submitted to the Department of Electrical Engineering and Computer Science

on February in partial fulllment of the

requirements for the degree of

Do ctor of Philosophy

Abstract

To day the interface b etween a high sp eed network and a high p erformance com

putation no de is the least mature hardware technology in scalable general purp ose

Currently the oneinterfacetsall philosophy prevails This ap cluster computing

proach p erforms p o orly in some cases b ecause of the complexity of mo dern memory

hierarchy and the wide range of communication sizes and patterns To days mes

are also unable to utilize the b est data transfer and co ordination sage passing NIUs

mechanisms due to p o or integration into the computation no des memory hierarchy

These shortcomings unnecessarily constrain the p erformance of cluster systems

thesis is that a cluster system NIU should supp ort multiple communica Our

tion interfaces layered on a virtual message queue substrate in order to streamline

data movement b oth within each no de as well as b etween no des The NIU should

b e tightly integrated into the computation no des memory hierarchy via the cache

coherent sno opy system bus so as to gain access to a rich set of data movement

op erations We further prop ose to achieve the goal of a large set of high p erformance

communication functions with a hybrid NIU microarchitecture that combines custom

hardware building blo cks with an otheshelf emb edded pro cessor

These ideas are tested through the design and implementation of the StarT

oyager NES an NIU used to connect a cluster of commercial PowerPC based SMPs V

Our prototyp e demonstrates that it is feasible to implement a multiinterface NIU at

reasonable hardware cost This is achieved by reusing a set of basic hardware building

blo cks and adopting a layered architecture that separates protected network sharing

from visible communication interfaces Through dierent mechanisms our

MHz NIU MHz pro cessor core can deliver very low latency for very short

s very high bandwidth for multikilobyte blo ck transfers messages under

MBytess bidirectional bandwidth and very low pro cessor overhead for multicast

communication each additional destination after the rst incurs pro cessor clo cks

We intro duce the novel idea of supp orting a large numb er of virtual message

queues through a combination of hardware Residen t message queues and rmware

emulated Nonresident message queues By using the Resident queues as rmware

controlled caches our implementation delivers hardware sp eed on the average while

providing graceful degradation in a low cost implementation

Finally we also demonstrate that an otheshelf emb edded pro cessor comple

y and the ments custom hardware in the NIU with the former providing exibilit

latter p erformance We identify the interface b etween the emb edded pro cessor and

custom hardware as a critical design comp onent and prop ose a command and com

pletion queue interface to improve the p erformance and reduce the complexity of

emb edded rmware

Arvind Thesis Sup ervisor

Title Johnson Professor of Computer Science

Thesis Sup ervisor Larry Rudolph

Title Principal Research Scientist

Design and Implementation of a Multipurp ose Cluster

System Network Interface Unit

by

Bo on Seong Ang

Submitted to the Department of Electrical Engineering and Computer Science

on February in partial fulllment of the

requirements for the degree of

Do ctor of Philosophy

Abstract

To day the interface b etween a high sp eed network and a high p erformance com

putation no de is the least mature hardware technology in scalable general purp ose

cluster computing Currently the oneinterfacetsall philosophy prevails This ap

proach p erforms p o orly in some cases b ecause of the complexity of mo dern memory

hierarchy and the wide range of communication sizes and patterns To days mes

sage passing NIUs are also unable to utilize the b est data transfer and co ordination

mechanisms due to p o or integration into the computation no des memory hierarchy

These shortcomings unnecessarily constrain the p erformance of cluster systems

Our thesis is that a cluster system NIU should supp ort multiple communica

tion interfaces layered on a virtual message queue substrate in order to streamline

data movement b oth within each no de as well as b etween no des The NIU should

b e tightly integrated into the computation no des memory hierarchy via the cache

coherent sno opy system bus so as to gain access to a rich set of data movement

op erations We further prop ose to achieve the goal of a large set of high p erformance

communication functions with a hybrid NIU microarchitecture that combines custom

hardware building blo cks with an otheshelf emb edded pro cessor

These ideas are tested through the design and implementation of the StarT

Voyager NES an NIU used to connect a cluster of commercial PowerPC based SMPs

Our prototyp e demonstrates that it is feasible to implement a multiinterface NIU at

reasonable hardware cost This is achieved by reusing a set of basic hardware building

blo cks and adopting a layered architecture that separates protected network sharing

from software visible communication interfaces Through dierent mechanisms our

MHz NIU MHz pro cessor core can deliver very low latency for very short

messages under s very high bandwidth for multikilobyte blo ck transfers

MBytess bidirectional bandwidth and very low pro cessor overhead for multicast

communication each additional destination after the rst incurs pro cessor clo cks

We intro duce the novel idea of supp orting a large numb er of virtual message

queues through a combination of hardware Resident message queues and rmware

emulated Nonresident message queues By using the Resident queues as rmware

controlled caches our implementation delivers hardware sp eed on the average while

providing graceful degradation in a low cost implementation

Finally we also demonstrate that an otheshelf emb edded pro cessor comple

ments custom hardware in the NIU with the former providing exibility and the

latter p erformance We identify the interface b etween the emb edded pro cessor and

custom hardware as a critical design comp onent and prop ose a command and com

pletion queue interface to improve the p erformance and reduce the complexity of

emb edded rmware

Thesis Sup ervisor Arvind

Title Johnson Professor of Computer Science

Thesis Sup ervisor Larry Rudolph

Title Principal Research Scientist

Acknowledgments

This dissertation would not have b een p ossible without the encouragement supp ort

patience and co op eration of many p eople Although no words can adequately express

my gratitude an acknowledgement is the least I can do

First and foremost I want to thank my wife Wee Lee and our families for standing

by me all these years They gave me the latitude to seek my calling were patient as

the years passed but I was no closer to enlightenment and provided me a sanctuary

to retreat to whenever my marathonlike graduate scho ol career wore me thin To

you all my eternity gratitude

I am greatly indebted to my advisors Arvind and Larry for their faith in my

abilities and for standing by me throughout my long graduate student career They

gave me the opp ortunity to colead a large systems pro ject an exp erience which

greatly enriched my systems building skills To Larry I want to express my gratitude

for all the fatherlybrotherly advice and the cheering sessions in the last leg of my

graduate scho ol apprenticeship I would also like to thank the other memb ers of my

thesis committee Frans and Anant for helping to rene this work

I want to thank Derek Chiou for our partnership through graduate scho ol working

together on Monso on StarT StarTNG and StarTVoyager I greatly enjoy bringing

vague ideas to you and jointly developing them into well thought out solutions This

work on StarTVoyager NES is as much yours as it is mine Thank you to o for the

encouragement and counselling you gave me all these years

The graduate students and sta in Computation Structures Group gave me a

home away from home Derek Chiou Alex Caro Andy Boughton James Ho e RPaul

Johnson Andy Shaw Shail Aditya Gupta Xiao wei Shen Mike Ehrlich Dan Rosen

band and Jan Maessen thank you for the company in this long pilgrimage through

graduate scho ol It was a pleasure working with all of you bright hardworking de

voted and at one time idealistic p eople I am also indebted to many of you esp ecially

Derek and Alex for coming to my rescue whenever I painted myself into a corner

In my nal years at MIT I had the pleasure of working with the StarTVoyager

team Derek Mike Dan Andy Boughton Jack Constanza Brad Bartley and Wing

chung Ho It was a tremendous education exp erience lab oring alongside the other

talented team memb ers esp ecially Derek Mike and Dan I also want to thank our

friends at IBM Marc Pratap Eknath Beng Hong and Alan for assisting us in the

StarTVoyager pro ject

To all of you I mentioned ab ove and many others that I missed thank you for

the lessons and memories of these years

Contents

Intro duction

Motivation

Prop osed Cluster Communication Architecture

A Flexible Network Interface Unit NIU Design

Related Work

Contributions

Road Map

Design Requirements

Communication

work Mo del Channel vs Queuesandnet

Message Content Sp ecication

Message Buer Management

Message Reception Metho d

Communication Service Guarantee

Synchronization Semantics

Summary

Communication

Caching and Coherence Granularity

Memory Mo del

Invalidate vs Up date

CCNUMA and SCOMA

Atomic Access and Op erationaware Proto col

Performance Enhancement Hints

Summary

System Requirements

Multitasking Mo del

Network Sharing Mo del

Fault Isolation

SMP Host System Restrictions

NIU Functionalities

Interface to Host

Interface to Network

Data Path and Buering

Data Transp ort Reliability and Ordering

Unilateral Remote Action

Cache Proto col Engine

Home Proto col Engine

Layered Network Interface Macroarchitecture A

Physical Network Layer

Two Indep endent Networks

Reliable Delivery Option

Ordered Delivery Option and Orderingset Concept

Bounded Outstanding Packet Count

Programmable Sendrate Limiter

Virtual Queues Layer

Virtual Queue Names and Translation

Dynamic Destination Buer Allo cation

Reactive Flowcontrol

Decoupled Pro cess Scheduling and Message Queue Activity

Transparent Pro cess Migration

Dynamic Computation Resource Adjustment

Application Interface Layer Message Passing

Basic Message

Express Message

DMA Transfer

TagOn Message

Onep oll

Implications of Handshake Alternatives

Comparison with Coherent Network Interface

Application Interface Layer Shared Memory

Supp ort for Interface Extensions

StarTVoyager NES Microarchitecture

StarTVoyager NES Overview

Alternate Organizations

Using an SMP pro cessor as NIU Service Pro cessor

Custom NIU ASIC with Integrated Programmable Core

Tabledriven Proto col Engines

StarTVoyager NES Execution Mo del

Interface b etween sP and NES Custom Functional Units

Option Status and Command Registers

Option Command and Completion Queues

Command Ordering and Data Dep endence

Command Completion Notication

Option Template Augmented Command and Completion

Queues

NES Core Microarchitecture

Resident Basic Message

Resident Express Message

OnePoll

TagOn Capability

NES Reclaim

sP Bus Master Capability on SMP System Bus

Interno de DMA

sP Serviced Space

Sno op ed Space

Mapping onto Microarchitecture

Physical Network Layer Implementation

Virtual Queues Layer Implementation

Application Interface Layer Implementation Message Passing

Interfaces

Application Interface Layer Implementation Shared Memory

Interfaces

NES Hardware Implementation

Design Flow

Evaluations

Evaluation Metho dology

Pro cessor Core Simulation

Memory System NES and Network Simulation

Multiple Message Passing Mechanisms

Bandwidth

Latency

Pro cessor Overhead

Multiple Message Queues Supp ort

Performance Cost to Resident Message Queues

Comparison of Resident and Nonresident Basic Message Queues

Performance Limits of the sP

sP Handling of Microop erations

sP Handling of Macroop erations

Conclusions and Future Work

What We Did

What We Learned

Future Work

Chapter

Intro duction

To day the interface b etween a high sp eed network and a high p erformance com

putation no de is the least mature hardware technology in scalable general purp ose

cluster computing Currently the oneinterfacetsall philosophy prevails This ap

proach p erforms p o orly in some cases b ecause of the complexity of mo dern memory

hierarchy and the wide range of communication sizes and patterns To days mes

sage passing NIUs are also unable to utilize the b est data transfer and co ordination

mechanisms due to p o or integration into the computation no des memory hierarchy

These shortcomings unnecessarily constrain the p erformance of cluster systems

Our thesis is that a cluster system NIU should supp ort multiple communication

interfaces layered on a virtual message queue substrate in order to streamline data

movement b oth within each no de as well as b etween no des The NIU should b e tightly

integrated into the computation no des memory hierarchy via the cachecoherent sys

tem bus so as to gain access to a rich set of data movement op erations We further

prop ose to achieve the goal of a large set of high p erformance communication func

tions with a hybrid NIU microarchitecture that combines a set of custom hardware

basic building blo cks with an otheshelf emb edded pro cessor

Together these features provide a cost eective solution for running mixe d work

loads encompassing parallel distributed clientserver and sequential applications

This means achieving b oth go o d overall system utilization and high single application

e across applications with widely dierent communication requirements performanc

Our prop osed NIU architectural features address the seemingly opp osing require

ments of high p erformance multiple communication interface supp ort while catering

to systemlevel issues of sharing protection and job scheduling exibility

Motivation

Connecting a cluster of commercial Symmetric Multipro cessors with a high p erfor

mance network is an attractive way of building large general purp ose computation

platforms By exploiting existing high volume commercial hardware and software as

the building blo cks this approach b oth reduces cost and provides an evolutionary

system upgrade path It also oers other advantages such as mo dular expansion and

with appropriate system software higher availability b ecause of multiple instances of

similar resources

The communication system is a key comp onent of a cluster system and should

have the following attributes Realizing these goals will require supp ort in the NIU

A general purp ose cluster system will encounter applications with a range of

communication requirements eg ne and coarse grain communication and

shared memory and message passing styles of communication For b oth compat

ibility and p erformance reasons it is imp ortant that a cluster system supp orts

a range of communication interfaces

High p erformance is imp ortant for the system to b e able to supp ort ne grain

parallel pro cessing and aggressive resource sharing across the cluster Ideally

interno de latency and bandwidth should b e no more than a few times worse

than those within a no de

Finally system oriented supp ort ecient network sharing full communication

protection and exible job scheduling is also critical as system throughput is

just as imp ortant as single application p erformance

To days NIUs divided b etween the two camps of message passing NIUs and shared

memory NIUs fall short of these goals

All NIUs to day are designed with a oneinterfacetsall philosophy presumably

to keep hardware simple and fast Should an application desire a communication

interface dierent from that provided by the underlying hardware software layers

are used to synthesize it from the hardware supp orted op erations Unfortunately

neither typ es of NIU oer a communication instruction set sucient for ecient

software synthesis of all common communication interfaces

The problem is particularly acute for message passing NIUs as it is very dicult

to synthesize shared memory in a way that is transparent to application programs A

numb er of metho ds have b een attempted but none have found wide

acceptance With the growing p opularity of SMP and SMP applications this is a

signicant draw back

Shared memory NIUs have fared b etter since it is functionally simple to syn

thesize message passing on shared memory Nonetheless shared memory emulated

message passing incurs more network trips than a direct implementation This p oints

to an interesting feature of shared memory systems software has no direct control

over data movement which o ccurs indirectly in resp onse to cachemisses Although it

simplies programming this feature also creates ineciency particularly for control

oriented communication

The work of MellorCrummey and Scott on shared memory implementa

tions of mutex lo ck and barrier provides an interesting illustration At a metalevel

the solution is to understand the b ehavior of the underlying coherence proto col and

craft algorithms that coax it into communicating in a way that is close to what a di

rect message passing implementation would have done Unfortunately due to the lack

of direct control over communication even these clever lo ck and barrier implemen

tations incur more network trac than a message passing implementation of similar

1

algorithms See Heinlein

1

Heinleins implementation is b est viewed as a hybrid that blurs the line b etween message passing

and shared memory A large part of the message passing co de implementing the lo ck and barrier

runs on an emb edded pro cessor in the NIU with application co de interacting with this co de through

memory mapp ed interfaces One could either view this as extending shared memory with sp ecial

lo ck and barrier proto col or as message passing co de oloaded onto the NIU

The b ottom line is that shared memory systems communication instruction set

is also inadequate resulting in redundant data movement b etween no des Interest

ingly while message passing NIUs provide go o d control over data movement b etween

no des many of them are not well integrated into the no des memory hierarchy re

sulting in inecient data movement and control exchange within no de Solving this

problem requires the NIU to b e lo cated on the cachecoherent system bus so that it

has access to a rich set of intrano de data movement op erations In addition the NIU

should oer multiple communication interfaces so that software has a suciently rich

communication instruction set to synthesize its desired communication eciently

Supp orting multiple communication interfaces requires revisiting the issue of net

work sharing and protection Traditionally message passing and shared memory

systems handle this issue dierently Shared memory machines essentially skirt this

issue by providing shared communication access without allowing direct network ac

cess Although entities in dierent protection domains can communicate concurrently

through shared memory accesses with protection enforced by normal virtual address

translation mechanism the fast network is used directly only by cachecoherence

proto col trac

Message passing systems that p ermit direct userlevel network access have to

resolve a tougher network sharing protection problem Aside from preventing illegal

message transmission and reception the protection scheme has to prevent deadlo ck

and starvation that may arise from sharing network resources b etween otherwise

indep endent jobs In our opinion this problem has never b een solved satisfactorily

in existing systems Current solutions discussed in Section suer from one or

more drawbacks including harsh restrictions on job scheduling p olicies signicant

latency p enalty and constrained functionality

Network sharing and protection get even more complex when b oth shared memory

and message passing interfaces are supp orted over the same network Interaction

b etween the two breaks some solutions used in systems that supp ort only one of them

For example a solution employed in message passing systems to prevent network

deadlo cks is to rely on software to continually accept packets from the network In Applications Interface Layer

Virtual Queues Layer

Physical Network Layer

Figure Our prop osed layered cluster system communication architecture

a system with b oth shared memory and message passing supp ort this no longer

works b ecause software may not get a chance to service messages its pro cessor may

b e stalled waiting for a cachemiss pro cessing to complete In turn the cachemiss

pro cessing could b e waiting for network resources held by message passing trac to

b ecome available Solving problems of this nature requires a comprehensive protected

network sharing mo del

Prop osed Cluster Communication Architec

ture

We prop ose a layered communication architecture illustrated in Figure to meet the

complex communication requirements of a cluster system This design assumes that

the NIU resides on the SMP no des memory bus giving it the ability to participate in

cachecoherence sno oping op erations It also assumes an NIU with a programmable

core making it feasible to supp ort a large and extendable set of communication

functions

The Physic al Network Layer is a transp ort layer that provides reliable packet

delivery over two logically indep endent networks It is also resp onsible for regulating

the ow of trac through the network to avoid network congestion

The Virtual Queues Layer implements the bulk of the protection scheme in our

system It op erates on the abstraction of virtual message queues using the packet

transp ort services provided by the Physical Network layer to move each message from

its transmit queue TxQ to its receive queue RxQ By controlling lo cal queue access

and transmittoreceivequeue connectivity the Virtual Queues layer provides system

software with the mechanism to stitch virtual message queues into indep endent

communication domains

The Virtual Queues layer deals with the problem of network deadlo ck arising from

dynamic receive queue buer space allo cation with a novel R eactive Flowcontrol

scheme This lazy queuetoqueue owcontrol strategy incurs no owcontrol cost

when communication trac is wellb ehaved imp osing owcontrol only when a p os

sible problem is detected See Section for further details

Virtualization of message queue name in the Virtual Queues layer intro duces a

level of indirection which facilitates job migration and the contraction of the numb er

of pro cessors devoted to a parallel job Furthermore our design allows a message

queue to remain active indep endent of the scheduling state of the pro cess using it

Together these features give system job scheduling unprecedented exibility

Finally the Applic ation Interface Layer fo cuses on the interface seen by application

co de One could view this layer as providing wrapp ers around the message queue

service of the Virtual Queues layer to form the communication instruction set As

an example a wrapp er for large messages marshals data from and stores data into

user virtual address space implementing virtual memory to virtual memory copy

across the network Another wrapp er in our design implements an interface crafted

to reduce endtoend latency and message sendreceive handshake overhead of very

short messages

Cachecoherent distributed shared memory supp ort can b e viewed as yet another

wrapp er alb eit a sophisticated one This wrapp er observes transactions on the mem

ory bus and where necessary translates them into request messages to remote no des

It also services requests from other no des supplying data initiating invalidation or

up date actions or executing these actions on lo cal caches to maintain data coherence

in the shared address space

It is feasible for an NIU to supp ort a fair numb er of wrapp ers b ecause although

they exp ort dierent communication abstractions their implementations share many

sub comp onents As such the actual provision of multiple Application Interface layer

abstractions can b e achieved with a relatively small set of composable primitives

By dividing the communication architecture into several layers with well dened

resp onsibilities this approach not only avoids the duplication of functions but also

simplies the internal design of each layer For instance with the Virtual Queues

Layer resp onsible for network sharing issues the design and implementation of each

instance in the interface layer can b e done in isolation without concern ab out how

other instances are using the shared network

A Flexible Network Interface Unit NIU De

sign

The abstract communication architecture describ ed in the previous section is

tested in an actual implementation the StarTVoyager Network Endp oint Subsys

2

to the tem NES The NES connects a cluster of IBM PowerPC ebased SMPs

Arctic network a high p erformance packet switched FatTree network

Figure illustrates how the StarTVoyager NES replaces one of the pro cessor cards

in the normal SMP to interface directly to the SMPs cachecoherent X system

bus The diagram also shows the two main comp onents of the NES an NES Core

containing custom logic and an sP subsystem containing a PowerPC pro cessor

used an an emb edded pro cessor We refer to this pro cessor as the Service Pro cessor

sPand the host SMPs pro cessors as the Application Pro cessors aPs

The NES microarchitecture attempts to achieve b oth p erformance and exibility

through the combination of custom hardware and emb edded pro cessor rmware By

ensuring that the common communication op erations are fully supp orted in hard

ware average p erformance is close to hardware sp eed On the other hand the

sP rmware handles the infrequent corner cases and provides extension capabilities

This combination ensures that the hardware can b e kept simple and thus fast despite

2

These are the IBM RISCSystem Mo del P machines rst intro duced in the fourth

quarter of Processor card Processor card PowerPC 604e PowerPC 604e microprocessor microprocessor

32kByte 32kByte 32kByte 32kByte I-Cache D-Cache I-Cache D-Cache

512kByte 512kByte Level-2 Cache Level-2 Cache

System Bus (60X Protocol) (64bits data, 32bits address/66MHz)

PCI Memory (64bits/50MHz) Controller & PCI Bridge PCI (32bits/33MHz)

DRAM Banks

Motherboard

Processor card StarT-Voyager NES PowerPC 604e sP Subsystem microprocessor (604 and memory system) 32kByte 32kByte I-Cache D-Cache

NES Core Arctic (Custom Logic, Network 512kByte SRAM, FIFO's & Level-2 Cache TTL<->PECL)

System Bus (60X Protocol) (64bits data, 32bits address/66MHz)

PCI Memory (64bits/50MHz) Controller & PCI Bridge PCI (32bits/33MHz)

DRAM Banks

Motherboard

Figure The top of the diagram shows an original SMP The b ottom shows a

one used in the StarTVoyager system with a pro cessor card replaced by the StarT

Voyager NES

the NESs supp ort for a large set of functionalities Key to the eectiveness of this

design is the interface b etween the sP and the NES Core describ ed in Section

Implementation of multiple virtual queues provides a concrete example of this hy

brid hardwarermware implementation approach The abstraction of a large numb er

of active message queues is achieved with a small numb er of NES Core implemented

hardware Resident queues and a large numb er of sP rmware implemented Non

resident queues Both typ es of queues exp ort identical interfaces to aP software In

addition the Resident queues can b e used as system software or sP rmware man

aged caches of the Nonresident queues and switching a queue b etween Resident and

Nonresident resources is transparent to co de using the queue

The NES Core is designed as a collection of communication primitives These are

assembled by other NES custom hardware or sP rmware into the functions available

to aP software providing an ecient way to implement multiple abstractions while

facilitating future extension As an illustration of this design concept the NES Core

supp orts several dierent message passing mechanisms Express Basic Express

TagOn BasicTagOn and interno de DMA catering to messages of increasing size

Although coherent shared memory is not a fo cus of this work the NES Core

provides sucient hardware ho oks so that sP rmware can implement various cache

coherence proto cols This part of the design emphasizes exibility for exp erimentation

over absolute p erformance With researchers constantly coming up with suggestions

for mo difying cache coherence proto col to improve p erformance eg

we b elieve that it is useful to have a platform which allows easy mo dication to its

proto col so that meaningful comparison with real workload can b e done

The StarTVoyager NES is designed to facilitate migration of p erformance critical

p ortions of sP rmware into hardware Ma jor p ortions of the StarTVoyager NES

are implemented in FPGAs with the base design o ccupying less than a third of the

total FPGA space Because the StarTVoyager NES is designed in a mo dular fashion

with welldened functional units moving a function from sP rmware to custom

hardware involves the addition of a new functional unit without ma jor p erturbations

to the existing design With two levels of programmability StarTVoyager allows

new functions to b e implemented and fully debugged in rmware rst subsequently

p ortions that are critical to p erformance can b e moved into FPGA hardware

Related Work

This section compares our NIU design with some closely related work We leveraged

many of the ideas prop osed in these work Section describ es NIU design in general

and includes references to a larger set of related work

SynnityNUMA

A recent commercial pro duct the Fujitsu SynnityNUMA is very similar to

our design in that it employs a layered NIU architecture with each layer handling a

dierent function Like our communication architecture the top layer deals with is

sues of interfacing to the applications Their second layer is resp onsible for recovering

from packet losses and as such presents a level of service equivalent to our b ottom

layer Their third layer provides a raw network transp ort service that has low but

nonnegligible losserror rate They do not have a layer corresp onding to our Virtual

Queues layer The network interface in SynnityNUMA is not programmable Hence

although it supp orts b oth message passing and shared memory there is no capability

for any further communication interface extensions

FLASH

The FLASH multipro cessors communication element the MAGIC chip is

the closest to our design from a capability p ersp ective A custom designed chip

MAGIC interfaces to the no de pro cessor DRAM PCI IO bus and a high p erfor

mance network It has a programmable core called the PP which co ordinates data

movement b etween the four comp onents that it interfaces with The PP also runs

cachecoherence proto col co de Because MAGIC o ccupies an extremely strategic p o

sition in the compute no de and is programmable it can b e programmed to implement

any function that our design can oer

Our design diers from MAGICFLASH in that we want to interface to commer

cial SMPs without replacing their memory controller and IO bus bridge chip As

a result we face dierent design constraints In that resp ect DASH the

predecessor of FLASH is more similar but it only supp orts shared memory and has

no programmability

Our NIU design also takes a very dierent microarchitecture approach In the

FLASH design all memory IO and network communication is pro cessed by MAGICs

programmable core Obviously the p erformance of this programmable core is crit

ical and their research fo cuses on making this fast We are interested in using an

otheshelf micropro cessor as the programmable element in our NIU to lower devel

opment cost and capitalize on the rapid improvements and higher core p erformance

of commercial micropro cessors To comp ensate for the weaknesses of our approach

such as slow ochip access in to days micropro cessors we provide full NIU hardware

supp ort for the most common and simple communication op erations Furthermore

the NIU provides the emb edded pro cessor with a rich set of communication oriented

primitives These enable the emb edded pro cessor to orchestrate data transfer without

directly touching data and to co ordinate a sequence of primitives without constantly

monitoring their progress

As a result of the dierence in implementation approach the programmability of

the two designs are go o d for dierent things For p erformance reasons our NIUs

programmable p ortion can only b e used to service infrequent op erations But it can

aord to execute more complex sequences of co de than can MAGICs PP on each

o ccasion

Typho on

The Typho on design is similar to ours in that they advo cate using a commer

cial micropro cessor to provide programmability in the NIU The design mentions

supp ort for b oth message passing and shared memory though there is no publicly

available description of message passing asp ect of the machine No machine b eyond

the Typho on a stripp ed down version of the design was built

Typho on uses Myrinet an IO bus NIU to provide message passing service

an existing SMP pro cessor one of the Sparc pro cessors in their SUN SMP as proto col

engine and a custom bus sno op er hardware which imp oses cacheline granularity

access p ermission checks on bus transactions to main memory DRAM Typho on

has little control over the design of most of its comp onents since they come othe

shelf Its design do es not address many of the microarchitectural issues that we

studied

Alewife

Alewife an earlier exp erimental machine built at MIT is similar to our work

in that it supp orts b oth message passing and shared memory Its communication sup

p ort is programmable in an interesting way Alewife uses a mo died Sparc pro cessor

called Sparcle which has hardware multithreading and fast interrupt supp ort

This makes it feasible for its CMMU the cache and memory management unit

which also serves as the network interface unit to interrupt Sparcle and have software

take over some of its functions For instance this is used to handle corner cases in

Alewifes cachecoherence proto col

Alewife do es not deal with issues of protection and sharing These are investi

gated in a followon pro ject FUGU Aside from adding small hardware

extensions to Alewife FUGU relies on Sparcles fast interrupt and interrupt software

to imp ose sharing protection

The constraints we faced are very dierent from those faced in Alewife and FUGU

For example Sparcle has no onchip cache and Alewifes CMMU essentially inter

faces to Sparcles L cache bus This is much closer to the pro cessor core than any

p ossible p oint of interface in to days commercial SMPs Sparcles fast interrupt and

multithreading supp ort is also not available on to days commercial micropro cessors

whose complex sup erscalar and sp eculative pro cessing pip elines are slow to resp ond

to interrupts These dierences make many of Alewifes solutions inapplicable to a

cluster of unmo died SMPs

Hamlyn

Hamlyn a message passing interface architecture that was implemented on

the Myrinet hardware shares our works goal of sharing a fast network b etween mul

tiple users in a protected fashion without scheduling restriction Through their choice

of requiring the sender to sp ecify the destination memory to write a message into

they avoid one of the ma jor problems of sharing the network dynamic receive queue

buer allo cation Whereas the NIU is resp onsible for receive buer allo cation in our

architecture this task is relegated to the message sender software in Hamlyn Hamlyn

shares another similarity with our design supp ort for several dierent message typ es

targeting messages of varying granularity

Remote Queues

Brewer and Chong prop osed exp osing message receive queues as Remote Queues

to software in order to optimize message passing The abstraction of message queue

is a central part of our protection mechanism and we leveraged their work

Contributions

The main contribution of this work is the prop osition that the NIU architecture for

cluster systems can b e endowed with a rich set of capabilities that provide sharing

protection and exibility without sacricing low latency and high p erformance This

prop osition is supp orted with a full realistic NIU design and asso ciated simulations

The following is a list of novel features explored in this work

A network sharing scheme based on multiple virtual message queues and a

simple Reactive owcontrol strategy This scheme p ermits exible network

sharing while retaining full protection and low communication latency

A cost eective high p erformance implementation of the multiple virtual mes

sage queue abstraction using caching concepts

A set of ecient message passing primitives crafted to achieve high p erformance

over a wide range of message sizes in an SMP environment

An NIU microarchitecture that couples a commercial micropro cessor with cus

tom hardware to achieve high p erformance and exibility The custom hardware

is organized around a basic set of communication primitives assembled by other

NIU hardware or emb edded pro cessor rmware into multiple communication

interfaces

The actual construction of an NIU the StarTVoyager NES which emb o dies

most of the ideas prop osed in this work

The primary emphasis of this thesis work is demonstrating the feasibility of an

NIU which directly supp orts multiple extendable communication interfaces It do es

not attempt to provide a denitive study of whether this NIU architecture leads to

b etter p erformance than alternate communication architectures such as one that

relies only on cachecoherent shared memory hardware Furthermore evaluation in

this thesis fo cuses on the message passing asp ect of StarTVoyager NES as shared

memory supp ort is studied by other research group memb ers

Our evaluation of the StarTVoyager NES shows that with an appropriate archi

tecture multiinterface supp ort and exible protected sharing of the NIU and network

are compatible with high p erformance and low latency For example a comparison

with StarTX an IO bus message passing NIU implemented in similar tech

nology but without NIU programmability or protection for network sharing shows

that the StarTVoyager NES delivers sup erior message passing p erformance higher

bandwidth for large transfers and lower latency for the shortest messages Part of the

advantage derives from b eing on the system memory bus while the remainder comes

from the multiple message passing mechanisms of StarTVoyager each of which has

its own sweet sp ot

We also show that only a small amount of fast memory and custom logic is required

for virtual queue supp ort A large numb er of message queues can b e supp orted using

a ResidentNonresident scheme only a small numb er of queues are buered in the

3

NES while the rest are buered in DRAM Although message passing through Non

3

Determining the exact size of the message queue working set is b eyond the scop e of this thesis

as it is dep endent on no de size and workload In the StarTVoyager test b ed we provide sixteen

resident queues incurs longer latency and achieves lower throughput the degradation

is reasonable less than a factor of ve in the worst case

The Nonresident message queues implemented with rmware on the sP provide

our rst examples of the utility of the sP As further study of the sP we conducted

a set of blo ck transfer exp eriments see Section with application co de sP co de

and dedicated hardware taking on varying resp onsibilities These exp eriments show

that the sP is functionally extremely exible They also show that when pushed to

the extreme sP p erformance is limited by context switch overhead when servicing

negrain communication events and by ochip access when handling coarsegrain

communication events

Road Map

Chapter provided a summary of this thesis why we are interested in this work

the problems addressed the solution prop osed b oth abstract and concrete our

contributions and results

Chapter examines the communication requirements of cluster systems what is

needed from the NIU current NIU design practices and why these designs are

inadequate

Chapter presents an abstract threelayered network interface architecture that

meets the goals set forth in Chapter In addition to describing each of the

three layers this chapter also explains the rationale b ehind the design choices

including comparing them with alternate design options

Chapter describ es the StarTVoyager NES a concrete implementation of the ar

chitecture prop osed in Chapter The microarchitecture of the NES is rst

presented at the functional level Next the mapping of the abstract architec

ture of Chapter onto the microarchitecture is describ ed Finally the hardware

transmit and sixteen receive resident queues a numb er which should b e sucient for the needs of

the op erating system and the current previous and next user jobs

mapping of the functional blo cks into physical devices and the hardware design

ow is presented

Chapter presents quantitative evaluation of the NES These evaluations are done

on a simulator b ecause the NES hardware was not available in time for this

work Using microb enchmarks a series of exp eriments demonstrate the p er

formance of b oth the fully hardware implemented Resident message queues and

the sP implemented Nonresident message queues A second series of exp er

iments examine several dierent ways of implementing blo ck transfers on the

NES They not only demonstrate the versatility of the NESs programmability

but also throw light on the p erformance p otential and limits of the design

Chapter summarizes what we learned from this research and suggest several av

enues for future work

Chapter

Design Requirements

An NIU designed to supp ort multiple communication abstractions and share a fast

network b etween trac in several protection domains has to overcome a numb er of

challenges One is implementing the multiple communication abstractions eciently

ie achieve go o d p erformance while working within reasonable hardware cost and

design complexity A second challenge is sharing the network in a protected fashion

without degrading p erformance such as incurring longer latency A third issue is

achieving ecient interaction b etween the application pro cessor and the NIU so as

to keep communication overhead low this is particularly challenging for message

passing

This chapter approaches these issues by examining the communication needs of

cluster systems the restrictions imp osed by commercial SMPs and the role of the

NIU in meeting these requirements Sections and survey the current practices

and promising new directions in message passing and shared memory communication

resp ectively These functions constitute a core set of capabilities that our NIU has

to supp ort Next Section examines systemlevel communication issues Since our

design targets commercial SMPs as the host no des it has to resp ect SMP imp osed

restrictions discussed in Section The last section of this chapter Section

explains what an NIU do es to deliver these communication functions We approach

this task in an incremental fashion and in the pro cess highlight some existing NIUs

that are representative of particular classes of NIU

Message Passing Communication

Message passing is a broad term referring to communication involving explicit software

send and receive actions In addition to transp orting data from a sender to a receiver

message passing often asso ciates control implications with the events of sending or

receiving a message Occasionally the term is also used to include Get and Put

op erations unilateral remote data fetch and write actions explicitly requested by

software on one no de but completed without direct participation of software on the

remote no de

Application co de usually utilizes the message passing service of a which

presents it with a convenient p ortable interface Any mismatch b etween this interface

and the machines native communication capability is hidden by the library co de As

to b e exp ected a go o d match b etween the librarys message passing interface and the

machines capability results the cost of library emulation co de

Message passing libraries come in many forms Some crafted to b e fast of

fer slightly abstracted versions of the machines underlying communication supp ort

Examples include Active Message Fast Message and Intels Virtual Inter

face Architecture Others such as NX PVM and MPI oer

many more functions meant to simplify the task of writing message passing parallel

programs The following are some common variations among the dierent message

passing services

Channel vs Queuesandnetwork Mo del

Two connectivity mo del channel and queuesandnetwork are common among

message passing libraries Under the channel mo del each channel connects exactly

one sender to one receiver In contrast to this onetoone mo del the queuesand

network mo del oers manytomany connectivity where each sender can send mes

sages to a large numb er of destinations using the same transmit queue Each receiver

similarly can receive messages from a large numb er of senders through one receive

queue Whereas the message destination of each channel is xed when it is set up Channel Model:

Sender A Pipe/Channel Receiver C

Sender A Pipe/Channel Receiver D

Queues and Network Model:

Transmit Receive queues queues Sender A Receiver C

Network

Sender B Receiver D

Figure Two common mo dels of communication channels and network connected

queues A channel connects a sender to only one receiver In contrast messages can

b e send to multiple receive queues through each send queue

each message sent via a transmit queue sp ecies its destination

The queuesandnetwork mo del is advantageous when the communication pattern

is dynamically determined and involves many sourcedestination pairs To connect

s senders with r receivers the channel mo del requires s r channels compared

s queues Furthermore if messages are received by to queuesandnetworks r

p olling the channel mo del requires each receiver to p oll from up to s channels if the

source of the next message is unknown

The channel mo del is appropriate when the communication pattern is static and

involves only a small numb er of sourcedestination pairs Many traditional parallel

programs involve only nearest neighb or communication and global barrier synchro

nization which can b e implemented with a reduction tree In such cases channels are

only needed b etween nearest neighb ors and neighb oring no des in the reduction tree

The queuesandnetwork mo del is more exible than the channel mo del but comes

at the price of more complex message buer management as we will see later in

Sections and It is our design goal to supp ort this mo del b ecause it oers

a sup erset of the functions of the channel mo del

Message Content Sp ecication

The content of a message can b e sp ecied in one of two ways i by reference with

the starting address and transfer size or ii by value with the data copied explicitly

into some sp ecial send buer Sp ecication by reference can p otentially reduce the

numb er of times message data is copied Zero copying is p ossible provided the NIU

is able to access memory with user virtual addresses This means that the NIU must

b e able to translate user virtual addresses into physical addresses b ecause the host

SMPs system bus deals with physical addresses only The alternative of making

system calls to translate virtual addresses to physical addresses incurs unacceptably

high overhead Unfortunately many NIUs are not equipp ed with this translation

capability so that when a library interface uses sp ecication by reference the library

co de ends up copying the message data

Zero copying is not always an imp ortant goal If the message contains very little

data data copying overhead is small while directly writing the data to the NIU reduces

message passing latency In addition sp ecication by reference is advantageous only

if the lo cation of the data can b e sp ecied easily eg if the message data already

exists in contiguous memory lo cations or at some regular stride Otherwise the task of

describing the data layout may b e as exp ensive if not more exp ensive than assembling

the data into the send buer If message content can b e sp ecied by reference only

application co de may end up assembling data in normal memory lo cation This

nullies the advantage of sp ecication by reference

A go o d design should supp ort sp ecication by value for short to medium sized

messages while adopting sp ecication by reference for large messages

Message Buer Management

Message passing requires buer space for storing messages b etween a send and the

corresp onding receive Buer management is closely tied to the message passing inter

face design with many p ossible division of resp onsibilities b etween the NIU and the

software using its message passing service Furthermore these choices have implica

tions for network sharing in a shared network some buer management choices are

more prone to network deadlo cks than others We b egin with a discussion of transmit

buer space management followed by one on receive buer space management

Transmit Buer Management

When transmit buer space is unavailable two interface design options are p ossible

i blo ck the sender or ii notify the sender ab out send failure Blo cking the sender

means hardware stalls the sender so that software is unaware of the problem It is

less exible than the second option b ecause a sender notied of a send failure can not

only retry in a tight lo op to emulate blo cking but also has the option of carrying out

other computation or receive actions b efore attempting to send again This added

exibility not only improves p erformance but is needed to prevent communication

deadlo cks

Although the ab ove paragraph uses the term notify the sender of send failure

this b ehavior is typically achieved with software testing for sucient transmit buer

space b efore sending a message The sender could b e notied of send failure through

an exceptioninterrupt but such a design is not used to day b ecause of implement

diculty An exceptioninterrupt based scheme must provide means for software

recovery at the very least the exception must o ccur b efore software attempts to

transmit yet another message This is not easy to guarantee when the NIU is not

integrated into the pro cessor core The high cost of interruptexception handling on

most pro cessors and OSs also makes its p erformance advantage unclear

Programs that can estimate its maximum transmit buering requirement can

avoid the dynamic buer space availability check overhead by allo cating the max

imum required space in the channel or transmit queue Others that need to deal with

the outofsendbuer problem must ensure that buer space will eventually free up

to avoid getting into a deadlo ck This is a classic resource allo cation with dep endence

problem and an application must ensure that no cycle forms in the dep endence graph

Receive Buer Management

Receive buer management is more complex It interacts with network sharing and

the detail requirements are dep endent on the connectivity mo del We rst discuss

receive buer management for channel mo dels

The simplest way to implement the channel mo del uses the same amount of buer

space in b oth the sender and receiver but exp ose only the buer space of one side

to software It treats the destination buer as an eagerly up dated mirror of the

source buer Whenever a message is enqueued into the channel the source NIU can

forward it to the destination with full condence that there is sucient destination

buer to accommo date it The source buer space provides transient buering should

the network b e congested This approach simplies NIU design as it never runs into

an outofbuer situation that problem is handled by software when it attempts

to enqueue into the channel However the simplicity comes at the exp ense of less

eciently buer space utilization only half the total amount of buer space used

to implement a channel is exp osed to software Any attempt to exp ose more of

the combined buer space to user co de requires additional co ordination b etween the

source and destination NIU

Receive buer space allo cation in the queuesandnetwork mo del is more dicult

Typically this is done dynamically at the time a message arrives When an application

is unable to estimate the maximum amount of receive queue space needed a message

may arrive at its destination to nd its receive queue out of buer space Most

systems deal with this situation in one of two ways i the message is blo cked ie

continue to o ccupy the transient buer in the network ii the message is dropp ed

A third option of returning the message to the sender is sometimes used but

requires appropriate supp ort to ensure that a the sender has space to buer the

returned message b efore resending it again and b there is a logically separate

network dedicated to return trac

Blo cking an incoming message when its destination queue is full causes blo ckage

into the network which may lead to deadlo cks Nevertheless the approach has b een

adopted in some machines eg the CM where multiple logically indep endent

networks and a software network usage discipline ensures that the blo ckage never

develops into dep endence cycles

If messages are dropp ed when buer space is unavailable at the destination the

message loss can either b e exp osed to user co de or hidden by the message passing

library with recovery proto col It is imp ortant to note that dropping the message

prevents only those deadlo cks due to dep endence cycles involving transient shared

system resources such as buers in network switches The application still has to

ensure that there is no dep endence cycles involving privately owned resources such as

its private receive queue buer space Otherwise dropping the packet merely converts

a network deadlo ck into communication livelo ck where rep eated attempts to deliver

a message fails with the message dropp ed at the destination NIU

The issue of dynamic receive buer allo cation can b e legislated away with an

NIU interface that requires a message sender to sp ecify the destination buer address

for each message In essence the interface is equivalent to remote write initiated

through a message transmit interface VIA and Hamlyn for instance

take this approach However this approach basically pushes the buer management

duties up one level If that level is also hardware such as cachecoherence proto col in

the NIU it is unclear that the problem has gotten any easier As such it is our goal

to supp ort dynamic receive buer allo cation even in the presence of network sharing

Dynamic buer allo cation and network deadlo cks in a shared network environment

is a ma jor problem in NIU design which we revisit in Section For now it suces

to say that in a system where network resources are shared by multiple logically

indep endent applications use of these shared resources have to b e done with care to

prevent dep endences from building up between otherwise indep endent jobs

Message Reception Metho d

The most common metho ds of receiving messages are by p olling or via interrupts

Because interrupts are exp ensive in most systems to day p olling is the recommended

metho d for high p erformance systems when communication is frequent Interrupts

also intro duces atomicity issues that force the use of mutex lo cks which reception by

p olling avoids Nevertheless p olling has its inconvenience and overhead When timely

servicing of messages is imp ortant p olling co de has to b e inserted into numerous parts

of the receivers co de making the resulting co de dicult to read and debug

Ideally b oth reception by p olling and by interrupt should b e supp orted The

receiver should b e able to dynamically opt for p olling when it exp ects messages and

for interrupt when messages are exp ected to b e rare It is also useful for the sender to

request that a particular message cause an interrupt eg when the sender urgently

needs the receivers attention

While messages are typically received by application co de this may b e preceded

by NIU hardware or rmware prepro cessing An example is the prop osed StarT sys

tem based on the MP pro cessors which treats each message in the sys

tem as a continuation comp osed of an instruction p ointer IP and a list of parameters

Designed to b e a multithreaded pro cessor op erating in a continuationpassing fashion

the MP includes an NIU that prepro cesses each incoming continuation mes

sage incorp orating it into the lo cal hardwaresupp orted continuation stackqueue

The pro cessor hardware maintains several priorities of continuations and supp orts a

branch instruction that jumps to the highest priority continuation

Pro cessing in the NIU can also completely take care of a message Examples in

clude Put or Get op erations available on DECs Memory Channel and NIU

implemented barrier or lo ck Aside from freeing the pro cessor to concentrate on

executing longer threads shifting servicing of incoming messages to the NIU en

sures timely servicing the NIU is always scheduled is continually servicing short

requests and do es not face susp ension due to interrupts like a page fault

Communication Service Guarantee

Application programmers are usually interested in whether message delivery is i

reliable ie lossless and ii inorder ie messages b etween each sourcedestination

pair arrive in the order sent Although some applications can tolerate message losses

most require reliable message delivery Inorder delivery is very useful to certain ap

plications as it helps reasoning ab out distributed events for instance cache coherent

distributed shared memory proto cols that can assume inorder delivery of messages

are simpler as fewer scenarios are p ossible

Messages may b e lost for many reasons Some networks do not guarantee lossless

service Optical networks for example are designed to op erate at high sp eed with very

low but nonnegligible error rates Networks may also adopt linklevel control which

drops a message when its destination p ort fails to make forward progress within a

timeout interval eg in Myrinet Finally some NIUs are designed to drop messages

when their destination queues are full hence even if the network delivers messages

reliably the NIU may drop them

Packets sent from one source to the same destination may not arrive in the order

sent due to several causes Some networks provide multiple paths b etween a source

destination pair of no des and adaptively routes packets to achieve higher bandwidth or

improve fault tolerance Flowcontrol proto cols can also disrupt ordering if messages

are dropp ed and retried without regard for ordering

Dep ending on the network and the NIU designs providing lossless and inorder

delivery guarantees may incur extra overhead in the form of proto cols usually some

form of Sliding Window proto col to recover from losses reconstruct the ordering

or achieve b oth A p ortable message passing library is therefore b etter o leaving

b oth as options so that applications which can tolerate some losses or outoforder

messages do not incur the corresp onding proto col overhead At the same time a

network and NIU design that delivers these prop erties without the overhead of these

proto cols is very useful

Synchronization Semantics

Message passing send and receive actions often have asso ciated control semantics

A receive action is often used as a mechanism to initiate action at the destination

The information received can also inform the receiver that the sender has reached a

particular p oint in execution

Some message passing libraries also supp ort blocking send and receive actions In

a blo cking send the senders execution is susp ended until the message is received at

the destination In a blo cking receive the receivers execution is susp ended until a

message arrives The alternatives to blo cking receive are either some way of indicating

that there is no message for the receiver or returning a default message Some

message passing library eg MPI oers even more semantics to the receive action

p ermitting the receiver to request for messages from a particular sender or of a

particular typ e

Blo cking sendreceive and selective receive are high level semantics that are

more appropriately implemented in software message passing layers or p ossibly NIU

rmware

Summary

The ab ove review of message passing practices shows that diversity ab ounds The

challenge for a general purp ose system is to cater to these dierences without com

promising the p erformance of any particular feature Many of the issues covered in

this review of message passing practices are interrelated with choices in one area

having implications on other asp ects For example the connectivity mo del chan

nel vs queuesandnetwork has implications for buer management which in turn

interacts with network sharing and job scheduling p olicy This apparent lack of or

thogonality complicates the NIU design task substantially Section revisits some

of these issues in a multitasking shared network environment Chapters and of

this dissertation will show how our design meets these challenges

Shared Memory Communication

This discussion fo cuses on coherent shared memory at the pro cessor loadstore in

struction level ie loadstore instructions executed on dierent SMP no des access

the same logical memory lo cation in a coherent fashion Researchers have tried other

ways of providing application programmers the abstraction of shared ob jects These

range from using sp ecial access routines explicitly making a lo cal transient copy of a

shared ob ject when access is needed to automated mo dication of executa

bles to replace shared memory loadstore instructions with access routines

Nevertheless supp orting shared memory at the pro cessor loadstore instruction level

has b etter p erformance p otential and places few requirements on software

Caching and Coherence Granularity

Coherent transparent access to shared memory using loadstore instructions is com

monly achieved in three ways i the memory lo cation is not cached ii caching is

allowed and state information is kept at the cacheline level to maintain coherence

iii caching is allowed and state information for coherence maintenance is kept at

the page level

Under the rst metho d each loadstore instruction results in an actual remote

op eration The TD TE and Tera are some machines that supp ort

only this kind of coherent shared memory Its greatest attraction is implementation

simplicity there is no cache coherence proto col to deal with Its drawback is remote

access latency on every loadstore access TD and TE provides sp ecial prefetch

buers in addition to relatively short remote access latency achieved with exotic

Sup ercomputer class technology to reduce the eect of this latency Tera uses sp ecial

pro cessors which exploit multithreading parallelism to tolerate this latency

When shared memory caching is allowed there is the problem of a cache copy

b ecoming stale when another pro cessor writes over part or all of the same cache

line The stale cached copies of the cacheline are said to have b ecome incoherent

The most common way of maintaining coherence among caches of a busbased SMP

is through bus sno oping techniques which maintains b o okkeeping at the cacheline

granularity In distributed implementations cachecoherence is typically main

tained with a directory based approach which again keeps cacheline

granularity state information Compared to uncached shared memory caching shared

memory takes advantage of data lo cality b oth temp oral and spatial at the cost of

a fair amount of b o okkeeping and design complexity NIU access to system bus is

needed to make this work across SMPs in a cluster

The third approach is logically similar to the second one but do es b o okkeeping at

the pagelevel This small dierence translates into the big implementation advantage

of requiring no hardware b eyond normal paging supp ort and some form of message

passing capability Its implementation through extensions to paging software is rela

tively easy in most OSs b ecause it uses an exp orted interface originally meant

for building thirdparty le systems This approach was pioneered by Li and has

b een further rened in many subsequent implementations the most notable b eing

Treadmark A variant of this approach Cashmere relies on noncoherent

shared memory hardware with remote write capability

A go o d shared memory system should b e built from a combination of all three

techniques Either user directive or dynamic monitoring of access characteristics can

help pick the right implementation technique for a particular region of memory Un

cached access is the most ecient technique when there is little or no lo cality In that

case any attempt at caching simply incurs overhead without any payback When

data is shared in a coarse grain fashion pagelevel coherence is adequate and con

sumes little resources for b o okkeeping state coherence trac and coherence message

pro cessing For example if the shared memory capability is used to simplify migra

tion of sequential jobs for load balancing reasons pagelevel coherence is p erfectly

adequate Besides most of the software infrastructure for implementing page level

coherence is also needed for cluster level virtual memory paging so supp orting it

incurs little additional cost

Pagelevel coherence is inadequate when data is shared in a nergrained fashion

for which cacheline granularity coherence maintenance oers b etter p erformance

As describ ed next cacheline granularity coherence can b e implemented in several

metho ds each with its own advantages and shortcomings Ideally one would like

to combine the advantages while avoiding the shortcomings Our NIU is designed

to facilitate this research by providing ho oks for implementing the ma jor approaches

and new improvements that are b eing considered

Memory Mo del

A memory mo del sometimes referred to as consistency mo del denes how multiple

loadstore op erations to dierent memory lo cations app ear to b e ordered for co de

running on dierent pro cessors Se quential Consistency is commonly regarded

as the most convenient memory mo del for programmers A simplied but useful

way of thinking ab out Sequential Consistency is that in a system implementing this

consistency mo del all memory access op erations app ear to o ccur in the same order to

all pro cessors it is as though all of them have b een serialized More sp ecically two

pro cessors will not see a set of memory access instructions as o ccurring in dierent

orders Although this mo del is simple to understand it is dicult to implement

eciently in a distributed environment

In resp onse to this challenge a multitude of weaker consistency mo dels eg

Lazy Consistency Release Consistency Entry Consistency Scop e Consis

tency Lo cation Consistency and DAG Consistency have b een prop osed

The main distinction b etween these consistency mo dels and Sequential Consistency

is they distinguish b etween normal memory access op erations and sp ecial order im

p osing op erations which are typically related to synchronization An imprecise but

useful way to think ab out these sp ecial op erations is that they demarcate p oints in a

s execution where either the eects of memory op erations it has executed must

b ecome visible to other threads or those p erformed by other threads will b e examined

by this thread By requiring the user to clearly identify these p oints weaker mem

ory mo dels allow an implementation to achieve higher p erformance through delaying

coherence actions and allowing greater overlap b etween op erations Up datebased

coherence proto cols for instance can b e implemented eciently for weak memory

mo dels but is inecient under Sequential consistency memory mo del

The question of memory mo del requires further research to understand the seman

tics and implementation implications of dierent mo dels An exp eriment platform

like StarTVoyager that is fast enough to run real programs and able to implement

dierent memory mo dels in comparable technology will throw much light on this

issue

Invalidate vs Up date

A coherence proto col can either adopt an invalidate or up date strategy The former

deletes cache copies that are stale so that future accesses will incur cachemisses and

fetch the new copy The latter sends the new data to existing cache copies to keep

them up to date The two strategies are go o d for dierent data sharing patterns

Invalidation is the b est approach if a no de accesses the same cacheline rep eatedly

without another no de writing any part of it On the other hand up date is b etter if

there is rep eated pro ducerconsumer data communication b etween multiple no des or

if there are multiple writers to the same cacheline The latter causes thrashing in an

invalidation based scheme even if the writes are to dierent lo cations false sharing

Ideally a coherence proto col should adaptively select the appropriate strategy for

a cacheline or page based on dynamically gathered access patterns information

As a nearer term goal the user can provide directives to assist in this choice

CCNUMA and SCOMA

In the older CCNUMA Cache Coherent Non Uniform Memory Access approach

each pagedin shared memory cacheline has a unique systemwide physical ad

dress and remotely fetched cachelines can only reside in a traditional cache where

its address tag and cacheset index furnish the global identify of a shared memory

lo cation

Under the SCOMA Simple Cacheonly Memory Architecture approach

lo cal main memory DRAM is used to cache remotely fetched cachelines Space

in this cache is allo cated at page granularity but lled at the cacheline granular

ity Because allo cation is at page granularity no addresstag at individual cacheline

granularity is necessary Instead the function of addresstag matching in normal

caches is p erformed by a pro cessors conventional virtual memory address translation

mechanism when it maps virtual page numb er to physical page frame Coherence

state is however still kept at the cacheline granularity to p ermit ner granularity

coherence control Under SCOMA the same logical memory lo cation can b e mapp ed

to dierent lo cal DRAM addresses on dierent no des Corresp ondence b etween each

lo cal DRAM address and a global shared memory address must b e maintained as it

is needed during cache miss pro cessing

The SCOMA approach can cheaply maintain a large cache of remotely fetched

data b ecause it incurs no addresstag overhead and uses mainmemory DRAM which

is fairly cheap and plentiful to store cache data If however memory access to a

page is sparse SCOMA makes p o or use of memory as a cached page of DRAM

may contain only a very small numb er of valid cachelines In such cases a CC

NUMA implementation is more ecient Falsa and Wo o d prop osed an algorithm

for automatically cho osing b etween the two

SCOMA and CCNUMA are two ends of a sp ectrum in which middleground

schemes are p ossible By intro ducing a small numb er of addresstag bits for each

cacheline in SCOMA DRAM each SCOMA DRAM page can b e shared b etween

several physical page frames At any one time each DRAM cacheline can only b e

used by a particular page frame but cachelines from several physical page frames

can use dierent cachelines of a DRAM page

The numb er of addresstag bits is kept small by requiring the physical page frames

mapp ed to the same SCOMA DRAM page to have page frame addresses that dier

only in a small numb er of bits in a predetermined bit eld In addition the memory

controller has to alias these pages to the same DRAM page by ignoring this bit eld

For example a system may allow up to four physical page frames to map to the same

DRAM page with the constrain that these page frames must have the same addresses

except in the most signicant two bits In this case only two bits of addresstag is

kept for each DRAM cacheline to identify the actual physical page frame using it

This scheme which we call NuCOMA NonUniform COMA combines SCOMAs

b enet of fullasso ciative mapping of virtual pages to physical page frame with the

ability to share a DRAM page b etween several virtual pages Furthermore the

addresstag overhead is small unlike the case of a traditional cache Another in

teresting feature of NuCOMA is when a single DRAM cacheline is shared b etween

two active physical address cachelines NuCOMA do es not required switching the

1

DRAM cacheline b etween the two cachelines which can lead to thrashing Instead

one cacheline is allowed to use the DRAM in the traditional SCOMA way while the

1

Caches closer to the pro cessor may or may not suer from thrashing dep ending on details of

their organization such as asso ciativity and numb er of sets in the caches

other is treated like a CCNUMA cacheline

From an implementation p ersp ective CCNUMA can b e implemented with a sub

set of the infrastructure needed for SCOMA NuCOMA will require additional sim

ple hardware supp ort b eyond those used for SCOMA and CCNUMA

Atomic Access and Op erationaware Proto col

A shared memory implementation must include atomic access supp ort In systems

implementing weak memory mo dels synchronization actions including atomic ac

cesses must b e dealt with sp ecially b ecause they take on additional semantics for

coherence maintenance Furthermore coherence proto col for lo cations used as syn

chronization variables should b e dierent b ecause they are often highly contended

for and are accessed in a highly stylized manner As an example QOLB Queue on

Lo ck Bit is prop osed as a means of avoiding unnecessary cachemiss fetches

for lo cations used as mutex lo cks One can regard it as an instance of a class of

operationaware proto cols which are cognizant of the semantics of and op erations on

data values stored in the shared memory lo cations

Op erationaware proto cols may use lo cal copies of a memory lo cation to maintain

parts of a distributed data structure instead of simply b eing a copy of a variable

Furthermore the coherence op erations p erform functions other than simply up dat

ing or invalidating a cachecopy For instance in the case of a lo ck a sp ecial lo ck

proto col may ensure that at most one lo cal copy indicates the lo ck to b e free say a

nonzero value while other lo cal copies all indicate that the lo ck is not free say a

zero value Coherence maintenance on this lo cation includes the resp onsibility for

moving the lo ck is free instance from one no de to another across the cluster as the

need arises and maintaining a list of waiting lo ck requesters

Aside from lo cks memory lo cations used for accumulating the sum minimum

maximum or some other reduction op eration can also b enet from op erationaware

coherence proto cols The proto col could automatically clone lo cal versions of the

memory lo cation and combine the lo cal version with the global version p erio dically

A sp ecial op eration is then used to obtain the globally uptodate value Further re

search and exp erimentation is needed to evaluate the eectiveness of op erationaware

proto cols but as a start the NIU must b e able to accommo date their implementation

Performance Enhancement Hints

Shared memory p erformance can often b e improved with hints A system should allow

easy sp ecication of hints such as the choice of coherence proto col prefetching and

presending of data of various granularities Because hints only aect p erformance

but not correctness they do not have to b e precisely accurate all the time

The NIU hardware should also assist in collecting memory access statistics Either

user co de or the cache coherence proto col can then use this information to select the

appropriate coherence maintenance strategy

Summary

Overall b ecause shared memory supp ort is still an active research area it is imp ortant

that our NIU design p ermits exp erimentation Both CCNUMA and SCOMA styles

of implementation should b e supp orted The implementation should p ermit easy

mo dications to coherence proto col and provide lowcost mechanisms for dynamically

relaying hints from the application co de to the underlying CCDSM implementation

System Requiremen ts

Systemlevel requirements for cluster communication center around protection en

forcement and fault isolation Protection enforcement dep ends on the network sharing

mo del which is in turn highly related to the job scheduling mo del Fault isolation fo

cuses on preventing faults at one SMP from spreading to other SMPs through cluster

level communication

Multitasking Mo del

Existing approaches to job scheduling for parallel applications have mostly b een re

stricted to gang scheduling When there is tight dep endence b etween dierent threads

of execution gang scheduling is necessary to ensure that threads scheduled to run are

not held back waiting for data from unscheduled threads This led to the p opularity

of gang scheduling and the sideeect that existing protection schemes in message

passing systems are mostly built on this assumption

Unfortunately gang scheduling is to o restrictive for a general cluster system For

one it is not an appropriate concept for distributed application and OS communi

cation OS communication is more naturally thought of as o ccurring ondemand

concurrently with user jobs communication If the OS message is urgent it should

cause an interrupt at the destination so that it is handled immediately If it is not

urgent the message should b e buered up for subsequent pro cessing such as when

the OS takes an exception to service some other trap or interrupt

It is also unclear at this p oint how a job scheduler should deal with exceptional

events such as page faults on a subset of a parallel jobs pro cesses Many earlier

parallel machines have sidestepp ed this issue by op erating with no virtual memory

eg CM most Crays Susp ending the entire parallel job when exception o ccurs

runs the danger of causing rep eated job swaps as page faults o ccur in one pro cess after

another of the same parallel job A b etter strategy is to allow the unaected pro cesses

to continue until data dep endence prevents them from making further progress

More generally parallelism variation across dierent phases of a programs ex

ecution suggests that machine utilization will improve if the numb er of pro cessors

devoted to a parallel job can expand and contract over time A parallel job may have

pro cesses that are fairly dormant during certain phases of execution At such times

the underutilized pro cessors should b e allo cated to other jobs including other paral

lel jobs or subset of a parallel job A parallel job with coarse granularity dep endence

for example is a go o d candidate for absorbing the idle pro cessors as it is likely to

make useful progress without all its constituent parts scheduled in a gang fashion

The cluster communication supp ort can make it easier to improve job scheduling

by imp osing fewer restrictions on the scheduler Thus the traditional practice of

using gang scheduling to imp ose communication protection by limiting the numb er

of jobs simultaneously sharing the network are unnecessarily limiting The communi

cation supp ort should also b e virtualized so that it do es not imp ose any imp ediment

on pro cess migration If a job scheduler can easily and cheaply migrate pro cesses it

will have an easier task at load balancing The same capabilities also make it easier

to implement automatic checkp oint and restart Along the same line a communi

cation architecture with appropriate ho oks can ease the implementation of runtime

expansion and contraction of allo cated pro cessor resource

Network Sharing Mo del

New network sharing and protection mo dels are needed to achieve greater job schedul

ing exibility Our discussion will fo cus on network sharing in the presence of direct

userlevel network access With the exception of exp erimental machines like Alewife

and p ossibly FLASH dep ending on the PP rmware in MAGIC the network in

distributed shared memory machines is not directly accessible from userlevel co de

and is really a private resource controlled by the coherence proto col As a result

there is no issue of isolating network trac b elonging to dierent protection domains

Job scheduling on these machines is not restricted by any communication protection

concerns which is adequately taken care of by normal virtual address translation

The picture changes immediately when userlevel co de is given direct access to the

network We rst examine how a numb er of message passing machines deal with

sharing and protection issues b efore prop osing our own mo del

The CM treats its fast networks as a resource dedicated to each job while it is

running much like pro cessors The machine can b e shared by several indep endent

jobs in two ways i it can b e physically partitioned into several smaller indep endent

units and ii each partition can b e timesliced b etween several jobs under a strict

gang scheduled p olicy Within each partition and timeslice its networks are not

shared Context switching b etween time slices includes saving and restoring the

network state with the assistance of sp ecial op eration mo des in the CM network

switches Aside from having a rather restricted network sharing mo del this design

employs sp ecial contextswitching features in switch hardware which are not useful

for other purp oses Context switching a lossless network without this level of supp ort

is tricky as one runs into the issue of buer space requirement for the packets that are

swapp ed out Unless some owcontrol scheme is in place to b ound packet count the

storage requirement can build up each time a job is swapp ed in and out Furthermore

unless care is taken a saveandrestore scheme is also likely to aect delivery order

guarantees

The SP adopts a slightly dierent solution but essentially still gang schedules

parallel jobs and uses the network for only one user job Instead of having hardware

capability for contextswitching the network the SP NIU tags every outgoing mes

sage with a job ID At the destination this ID is checked against that of the current

job If a mismatch is detected the message is dropp ed The scheme relies on a mes

sage loss recovery proto col to resend the dropp ed message at some later time This

allows each context switch to b e less strictly synchronized SP also allows system

co de to share the network by supp orting an extra system job ID and message queues

We argue for sharing the network aggressively As one of the most exp ensive

resource in a cluster system the network should b e used by all trac that b enets

from fast communication In addition to user parallel jobs the fast network should b e

available to OS services such as a parallel le system and the job scheduler To supp ort

the exible job scheduling p olicies describ ed earlier the network must b e prepared to

transp ort messages b elonging to arbitrarily many communication domains

To protect the indep endence of two parallel jobs A and B that happ en to b e

sharing a network the communication system must ensure that job A cannot intercept

job B s messages nor fake messages to job B and viceversa Furthermore a job

must b e prevented from hogging shared network resources and depriving other jobs

of communication services

The SP job ID solution can b e generalized to satisfy these requirements But

traditional loss recovery proto col often increases communication latency b ecause of

more complex source buering requirements and proto col b o okkeeping This is par

ticularly troublesome with the queuesandnetwork mo del b ecause sliding window

proto col is a p er sourcedestination pair proto col This mismatch means that a sim

ple FIFO send buer management p olicy which is convenient for hardware imple

mentation do es not work Instead when retransmission is needed send buers may

b e read and deallo cated in an order dierent from the message enqueue order The

increased hardware complexity may increase latency b eyond what is acceptable for

latency sensitive communication such as shared memory proto col A solution that

p ermits simpler hardware at least in the common case is highly desired

A shared network also interacts with buer management in that it cannot aord to

allow packets to blo ck into the network when destination queues are full This option

allowed in the CM and most distributed shared memory machines requires logically

separate networks While these logically separate networks can b e implemented as

virtual channels in the network switches to avoid the cost of extra wires and package

pins a fast network switch cannot supp ort more than a small numb er of virtual

channels The virtual channel solution is therefore insucient for our environment

where an arbitrary and large numb er of communication domains have to b e supp orted

simultaneously

Issues of network sharing are present in LAN and WAN but critical dierences

b etween these and tightly coupled cluster networks make many of the LANWAN

solutions infeasible Communication o ccurs much more frequently in cluster system

and requires much lower latencies Because these dierences usually span several

ordersofmagnitude LANWAN protection approach of implementing communica

tion through system calls was long ago recognized as to o exp ensive For the same

reasons we are revisiting the practice of using common loss recovery proto cols such

as variants of sliding window proto cols for preventing network blo ckage New solu

tions could exploit p ositive attributes of cluster system networks such as its extremely

low communication latency

Fault Isolation

In order for a cluster system to b e more reliable than its constituent SMPs faults in

each SMP must b e isolated Each SMP should run its own instance of the op erating

system that can function in isolation Parallel or cluster wide services should b e

layered on top as extensions By limiting the interaction of OSs on dierent SMP

no des to a well dened protected interface it is easier to isolate faults

At a lower level it is necessary to imp ose memory access control on remotely

initiated shared memory op erations particularly remotely initiated memory writes

Granting each remote SMP access to only a limited part of memory connes damage

that each SMP can wrought Lo cal OSs state for example should not b e directly

readwriteable with shared memory op erations from any other SMP

SMP Host System Restrictions

A typical commercial SMP oers an NIU limited p oints of interface The most com

mon options are the IO bus and the memory bus interfacing via DRAM SIMMDIMM

interface has also b een prop osed

The IO bus is by far the most p opular choice Examples include the SP

and SP Myrinet StarTJr StarTX Memory Channel and

2

Shrimp The IO bus designed to accommo date third party devices has the

advantage of b eing precisely sp ecied There are also extensive to ols chipsets and

ASIC mo dules that are available to make implementation easier Unfortunately cur

rent bridges connecting IO and memory buses do not propagate sucient cache

op eration information for ecient cachecoherent distributed shared memory imple

mentation

Interfacing at the main memory DRAM interface was prop osed in MINI Just

like in the case of IO bus insucient cachestate information is exchanged across

the memory busDRAM interface for implementing cachecoherent distributed shared

2

IO bus and the memory bus ShrimpI I has interfaces on b oth the EISA

memory eciently Its advantage over an IO bus NIU is p otentially shorter latency

for message passing op erations due to its proximity to the pro cessor

The memory bus is the most exible p oint of interface for a cluster system NIU An

NIU p ositioned on the memory bus can participate in the sno opy bus proto col making

clusterwide cachecoherent shared memory implementation feasible Message pass

ing p erformance is also go o d due to close proximity to b oth the SMP pro cessors and

memory the two main sources and destinations of message data Message passing

interfaces can also take advantage of the coherent caching capability of the mem

ory bus The NIU of most existing shared memory machine are lo cated here eg

DASH Origin NUMAQ SynnityNuma Message passing

machines like the CM the CS and the Paragon also have NIUs on the

memory bus This choice is convenient with to days SMPs whose multiple memory

bus slots intended for pro cessor and sometimes IO cards present ready slots for

the NIU

This thesis will investigate an NIU that interfaces to the SMP memory bus Al

though the design of such an NIU is inuenced by the typ es of bus transaction sup

p orted by the SMP these are fairly similar across to days micropro cessors so that

the general architectural principles prop osed in this work are application across SMP

families As an example of the uniformity among SMP families to day practically all

of them supp ort invalidationbased cachecoherence Most will also allow additional

system bus devices to b e bus master slave and sno op er The main variations pre

sented b elow are the atomic access primitive and the metho d of intervention These

dierences aect low level NIU design choices but have no b earing on the high level

architectural concepts of our thesis

The most common atomic access op erations are the loadwithreservation LR

and storeconditional SC pair used in PowerPC Alpha and MIPS Swapping b e

tween a register and a memory lo cation is used in Sparc and x pro cessor architec

tures Sparc also supp orts conditional swap There is as yet no denitive p osition on

how each of these atomic op erations scale in a distributed system

Intervention refers to the pro cess whereby a cache with the most uptodate copy

of a cacheline supplies the data to another cache Less aggressive sno opy buses im

plement intervention by retrying the reader having the cache with the data write it

back to main memory and then supplying data to the reader from main memory

More aggressive buses simply have the cache with the uptodate data supply it di

rectly to the reader p ossibly with writeback to main memory at the same time This

is sometimes called cachetocache transfer Many systems which implement the less

aggressive style of intervention can accommo date a lo okaside cache This supplies

data directly to readers providing a sp ecial case of cachetocache transfer We will

see later that our NIU design makes use of this feature

Interfacing to the memory bus has the slight disadvantage of dealing with a pro

prietary and sometimes not very well do cumented bus This is not a serious problem

if the SMP manufacturer is accessible to clarify ambiguities For p ortability of net

work interfaces one may cho ose to dene a new interface designed to accommo date

cluster communication but this is b eyond the scop e of this thesis

NIU Functionalities

This section examines the tasks an NIU p erforms to deliver the communication func

tions and system environment describ ed in earlier sections of this chapter We break

communication op erations into elementary steps that are examined individually We

b egin with three key functions supplied by every NIU interface to host interface to

network and data path and buering Next two other functions commonly found in

NIUs are discussed these are data transp ort reliability and ordering and supp ort

for unilateral remote communication action Finally cachecoherent shared mem

ory related NIU functions partitioned b etween a cache proto col engine and a home

proto col engine are describ ed

Interface to Host

As the intermediary b etween a computation no de and the network an NIU is resp on

sible for detecting when communication is required and determining the communi

cation content It is also resp onsible for notifying the host no de of information that

has arrived and making that information available

Except for NIUs that are integrated into micropro cessors an extinct sp ecies

these days NIUs typically detect service requests by the following metho ds i mon

itoring the state of predetermined main memory lo cations ii presenting software

with a memory mapp ed interface or iii sno oping bus transactions The rst two

metho ds are commonly employed when communication is explicitly initiated by soft

ware as in the case of message passing style communication Among the two metho d

i has the disadvantage that the NIU has to actively p oll for information Metho d

ii allows driven pro cessing in the NIU The last approach is most commonly

used to implement shared memory where communication is implicitly initiated when

software accesses to shared memory lo cations trigger remote cachemiss or ownership

acquisition pro cessing The metho d can also b e used in combination with i to re

duce p olling cost the predetermined lo cation is only p olled when sno oping indicates

that a write has b een attempted on that lo cation

Message passing communication requires the NIU to inform the host when in

formation arrives for it This may b e done actively with the NIU interrupting a

host pro cessor or passively with the NIU supplying status information in either pre

dened main memory lo cations or memory mapp ed NIU registers In the case of

shared memory software on the host pro cessor never explicitly sees the arrival of

messages Instead incoming information is delivered at the hardware level eg

replies to lo cally initiated requests such as cachemiss or cacheline ownership allow

p ending bus transactions to complete Externally initiated requests such as to re

move cache copies or write p ermission of a cacheline from lo cal caches result in NIU

initiated system bus transactions that cause the desired changes

The dierent metho ds of detecting service requests and delivering message avail

ability information require dierent NIU capabilities If co ordination b etween NIU

and host is through the main memory the NIU needs to have bus master capability

ie b e able to initiate bus transactions If co ordination is through memory mapp ed

NIU registers the NIU has to b e a bus slave a device which resp onds to bus trans

actions supplying or accepting data As a sno op er the NIU has to participate in

sno opy bus proto col intervening where necessary Although limiting the NIU to b e

either just a slave or just a master simplies the NIU design an NIU with all three

capabilities master slave and sno op er has the greatest exibility in host interface

design

Interface to Network

The NIU is the entity that connects directly to the network As such it has to

participate in the linklevel proto col of the network b ehaving much like a p ort in a

network switch Linklevel proto col covers issues like how the b eginning and end of

a packet is indicated and owcontrol strategy to deal with p ossible buer overrun

problems In some networks it also includes linklevel error recovery This

part of the NIU also deals with enco ding eg Manchester enco ding to improve

reliability and the actual electrical levels used in signaling

Data transp orted over the network has to conform to some format sp ecied by the

network For instance there is usually a header containing presp ecied control elds

like the packet destination followed by payload of arbitrary data Most systems also

app end extra information for checking the integrity of the packet usually some form of

CRC A network may also imp ose a limit on the maximum packet size messages that

are larger than the size limit will have to b e fragmented at the source and reassembled

at the destination Whether these details are directly exp osed to software making

it resp onsible for assembling messages into the network packet format or masked by

the NIU vary across systems

Data Path and Buering

An NIU provides the data path b etween the network and its host no de This includes

at least some ratematching buers used as transient storage to decouple the realtime

bandwidth requirement of the network from that available on the host no de

Within the no de CCNUMA style shared memory directly transfers data b etween

the NIU and caches Although SCOMA style shared memory logically moves data

b etween the NIU and main memory DRAM actual data transfer sometimes o ccurs

directly b etween the NIU and caches

Message passing often involves additional buers which can b e thought of as

service delay buers For instance incoming messages can b e buered away until

software is at a convenient p oint to service them Such buers are managed in a

coop erative fashion b etween no de software and the NIU

Service delay buers may b e on the NIU or in main memory DRAM Another

p ossibility is to place them logically in the main memory but have the NIU maintain

a cache of them For NIUs lo cated on memory buses that supp ort cachetocache

intervention the latter design avoids cycling data through DRAM in go o d cases while

having access to large cheap buer space in main memory

Every NIU has to deliver all the ab ove three functions in some form In fact simple

message passing NIUs provide only these functions and none of those describ ed later

For instance the CM NIU is a slave device on the memory bus providing memory

mapp ed FIFOs that software on the no de pro cessor writes to or reads from directly

These FIFOs serve b oth the bandwidthmatching and service delay functions

The StarTX PCI bus NIU card is b oth a slave and a master on the PCI

bus As a slave it accepts direct command and data writes from the pro cessor much

like the CM NIU except that the writes physically passes through a chip bridging

the memory and PCI buses With its bus master capability the StarTX NIU can

transfer data to and from main memory DRAM This allows it to use main memory

DRAM as service delay buers

Data T ransp ort Reliability and Ordering

In many systems the function of recovering from losses caused by unreliable net

work service or NIU dropping packets is left to software eg most ATM networks

Although this do es not present a functional problem since software can implement

recovery proto cols it is b est that the NIU provides this function in lowlatency net

works Aside from oloading this overhead from software the NIU is in a b etter

p osition to implement the recovery proto col without the duty of running long com

putation threads it can send and handle acknowledgements in a timely fashion and

monitor the need for retransmission For similar reasons proto cols for guaranteeing

inorder communication are also b est implemented by the NIU

NIU implemented loss recovery and ordering proto cols is most often found in

message passing NIUs with programmable emb edded pro cessors where the function

is implemented in rmware Examples include the SP and the Myrinet Most

shared memory machines relied on their networks to guarantee reliable data transp ort

This is strongly motivated by their requirement for low communication latency An

exception is the SynnityNuma which has hardware implemented NIUtoNIU

loss recovery proto col

Unilateral Remote Action

A few NIUs such as Shrimp Memory Channel TD and TE

supp ort uncached remote memory access This ability to carry out remote actions

unilaterally can b e a very useful b ecause it decouples the roundtrip communication

latency from the scheduling status of any user software counterpart on the remote end

Currently this capability is only available for stylized op erations typically memory

readwrite The TD also supp ort an atomic remote fetchincrementwrite op eration

The ability to run short remote user threads up on message arrival has b een prop osed

eg hardware platforms like Jmachine and Mmachine and sofware libraries

like Active Message but has not found widespread adoption

Cache Proto col Engine

The NIU of a cachecoherent distributed shared memory system p erforms tasks b e

yond those listed ab ove namely coherence maintenance Though traditionally con

sidered part of the cache controller or the memory controller coherence functions are

included in our NIU b ecause they are a form of communication and many capabilities

needed to implement shared memory are also useful for message passing Coherence

hardware is divided into two parts cache proto col engine describ ed in this section

and home proto col engine describ ed in the next section Our discussion is conned

to NIUs with direct access to the system bus b ecause other p oints of interface are

inadequate to this task

The cache proto col engine p erforms the cache side of coherence shared memory

supp ort It has two functions i it determines when a requested op eration on a

cachecopy requires actions at other no des and initiates those actions and ii it

executes other no des commands to mo dify the state or data of lo cal cache copies

The cache proto col engine is really part of the NIUs interface to host

For SCOMA style shared memory implementation the cache proto col engine has

to maintain cacheline state information and implement access control to DRAM

cachelines For CCNUMA style shared memory the cache proto col engine typically

do es not retain any long term information but simply echo es bus transactions to

the home site for the cacheline involved

Executing other no des commands on lo cal cache copies involves p erforming an

appropriate system bus transaction This is of course limited to the set of bus trans

actions supp orted by the host system In the case of SCOMA the cache proto col

engine may also have to mo dify the cacheline state it maintains

An imp ortant role of the cache proto col engine is to maintain transient state for

outstanding op erations Because a cachecoherence proto col is a distributed algo

rithm dealing with concurrent activities a lo cal view sometimes cannot dierentiate

b etween a numb er of p ossible global scenarios The pro jections of these scenarios

onto a lo cal no de can b e similar but the scenarios require dierent lo cal actions

Maintaining transient state can help disambiguate For instance when a cache pro

to col engine receives an external command to invalidate a cacheline for which it

has a p ending read it is not p ossible to tell whether the command has overtaken

a reply to its p ending read Many implementations put the two messages onto dif

ferent virtual networks due to deadlo ck concerns thus message ordering cannot b e

assumed The cache proto col engine therefore has to b e more sophisticated than

echoing commandsrequests in a simplistic way

Supp ort for SCOMA requires the NIU to track address mapping so that from

the physical address of a cacheline the cache engine can determine its home site and

provide the home site with sucient information to lo ok up the cachelines directory

information The cache engine also needs to track global address to lo cal physical

address mapping so that home proto col engines op erate on globally addresses only

CCNUMA style shared memory avoids address translation by using a xed eld

of the physical address to identify the home no de As long as a large physical address

space is available this is an adequate approach Translation capability bring conve

nience eg when the home no de is migrated only the translation information needs

to b e mo died instead of having to ush all existing cache copies

Home Proto col Engine

Asso ciated with main memory the home proto col engine is the centralized controller

which maintains the global view of what is happ ening to a cacheline It is typically

the authority which issues caches the p ermissions to maintain and op erate on cache

copies It also initiates commands to these caches to mo dify those p ermissions The

exact functions p erformed by the home proto col engine dier across implementations

For instance almost all systems keep directory for each cacheline to identify the

caches that may have copies of that cacheline This can b e kept in a lo calized data

structure at the cachelines home site the common approach or in a distributed

data structure maintained coop eratively with the cache engines as in SCI

Directory size is often a small but signicant fraction of the supp orted shared

memory size While highly implementation dep endent this fraction is often targeted

to b e not more than to For example if bytes is kept for every bytes

cacheline the overhead is ab out It is also preferably to use part of main

memory DRAM for this purp ose but when that is not feasible such as due to p ossible

deadlo cks extra DRAM can b e provided on the NIU

Earlier shared memory machines with only CCNUMA supp ort include DASH

3

Alewife SGIs Origin Sequents NUMAQ subsequently renamed STiNG

and FujitsuHALs SynnityNuma

3

and pro Alewifes coherence proto col is implemented with a combination of hardware FSMs

cessor software Hardware handles the common cases but interrupts the pro cessor for corner cases

Therefore it is theoretically p ossible for Alewife to implement SCOMA style shared memory This

has not b een attempted

Chapter

A Layered Network Interface

Macroarchitecture

This chapter presents our NIUs macroarchitecture The macroarchitecture is an ab

stract functional sp ecication realizable with multiple implementations one of which

is presented in Chapter We adopt a layered architecture to facilitate implementing

a wide range of communication op erations and ease of p orting Three layers are de

ned i a Physical Network layer which provides reliable ordered packet delivery

ii a Virtual Queues layer which implements protected network sharing and iii an

Application Interface layer which implements the communication interface exp orted

to application co de for example shared memory or message passing op erations

Each layer builds on services provided by the layer b elow to oer additional func

tions By dividing the resp onsibility this way a uniform framework takes care of

network sharing protection issues once and for all and the Application Interface

layer can implement multiple abstractions without worrying ab out subtle protection

violation it only needs to control through traditional virtual memory translation

mechanisms which job gets access to each memorymapp ed interface

Figure compares our prop osed communication architecture against that of

more established machines Each NIU in the other machines has a monolithic struc

ture and presents a small set of interface functions The gure also shows how each

architecture has a dierent way of enforcing sharing protection Protection in our StarT-Voyager Message Passing Machines Applications OS

Applications OS

SM MP MP Network Interface Virtual Network (multiple queues) Network Network

Shared Memory Machines

Applications OS Applications OS

Network Interface Bus Interface Coherence Protocol Network

Network

Figure Comparison of StarTVoyager network interface architecture against those

of traditional message passing machines and shared memory machines The b old black

lines indicate p oints where protection checks are imp osed Message passing machines

typically either rely on OS to enforce communication protection top right or op erate

in single network user mo de b ottom right

architecture demonstrated in the StarTVoyager example is enforced in two dier

ent places The rst one higher up in the diagram makes use of conventional VMM

translation while the second one is implemented in the Virtual Queues layer in the

NIU

Physical Network Layer

The Physical Network layer denes the characteristics of the underlying data trans

p ort service These are chosen b oth to ease the design and implementation of the

layers ab ove and to match the functions supplied by most system area networks

SAN The Physical Network layer provides reliable ordered packet delivery over

two logically indep endent networks with b ounded numb er of outstanding packets in

each network

Two Indep endent Networks

Abstractly the Physical Network layer at each NIU provides two pairs of send and

receive packet queues corresp onding to entry and exit p oints of two indep endent

networks Each packet enqueued into a send queue sp ecies its destination receive

queue which has to b e on the same network Having more than one network greatly

increases the options for overcoming network deadlo ck problems with attendant

p erformance improvements

Reliable Delivery Option

The Physical Network layer provides reliable packet delivery as an option on a packet

by packet basis Packets that are not marked as requiring reliable delivery may never

reach its destination queue but are never delivered more than once This service

guarantee is made an option b ecause we exp ect some systems to incur lower overhead

for packets that do not request for reliable delivery

An alternative is to push delivery reliability guarantees up to the next layer the

Virtual Queues layer Our discussion of buer management in Section mentioned

that a design which dynamically allo cates receiver queue buer space at message

arrival may run into the situation where no buer space is left If not handled

prop erly that problem can lead to deadlo ck aecting all parties sharing the network

We will revisit this problem in Section For now it suces to mention that

one solution is to drop a packet when its destination queue is out of buer space

and recover with a loss recovery proto col implemented b etween the private messages

queues of the Virtual Queues layer Attempting to recover the dropp ed packet with

loss recovery proto col at the shared queues of the Physical Network layer do es not

resolve the deadlo ck more details in the next section It therefore app ears that

requiring reliable packet delivery at the network level is redundant

Our choice is motivated by two p erformance considerations Firstly many cluster

networks provide reliable packet delivery service which one would like to pass on to

application co de This level of service is wasted if network sharing entails packet

dropping and the attendant overhead of loss recovery proto col

Secondly implementing a sliding window proto col requires a fair amount of state

and buering If it is done at the multiple queues level it most likely will have to b e

done in NIU rmware Implementing it directly in hardware for all the queues in the

Virtual Queues layer is infeasible b ecause of the signicant increase in p er message

queue state In contrast by pushing the loss recovery proto col down to the shared

Physical Network layer each NIU only implements two copies of the proto col one

for each network making it feasible to implement the proto col fully in hardware at

high p erformance

Finally our choice is p ossible b ecause of an alternate solution to the deadlo ck

problem Reactive Flowcontrol This is discussed in Section

Ordered Delivery Option and Orderingset Concept

We intro duced a more exible concept of delivery ordering to p ermit taking advantage

of multiple paths in a network Each packet which requires ordered delivery of some

form sp ecies an ordering set Packets b etween the same sourcedestination pair that

have requested for ordered delivery under the same orderingset are guaranteed to b e

delivered in the sent order No ordering is guaranteed b etween packets of dierent

orderingsets The traditional notion of message ordering is the limiting case where

only one orderingset is supp orted

The orderingset concept is useful in networks which provide multiple paths b e

tween each sourcedestination NIU pair For systems that determine packet routes

at the source the orderingset concept translates into using the same path for pack

ets in the same ordering set Ordered packets from dierent orderingset can utilize

dierent paths spreading out trac load Packets with no ordering requirements are

randomly assigned one of the p ossible routes

Bounded Outstanding Packet Count

The term outstanding packets refers to packets in transit in the network they have

left the source NIU but may not have reached the destination NIU A b ound on

the numb er of outstanding packets is required by the Reactive Flowcontrol scheme

describ ed in Section A well chosen b ound can also prevent congestion in the

network without limiting the go o d case network p erformance

Packet delivery reliability was made optional based on the argument that providing

that guarantee incurs additional cost in certain system Given the need to account

for outstanding packets that argument may app ear invalid After all accounting for

outstanding packets requires knowing when a packet no longer exists in the network

If one has gone through the eort of tracking the existence of a packet recovering

from its loss should not b e that much more exp ensive

This p erception is not strictly true Packets may b e lost b ecause they are delib er

ately dropp ed at the destination NIU in resp onse to lack of buer space in destination

queues Accounting for outstanding packets is relatively simple with destinations

sending p erio dic aggregated acknowledgement counts to the sources In comparison

if recovery of lost packets is needed it b ecomes necessary to either buer packets

at the source until delivery is conrmed A b ound on the numb er of outstanding

packets can also b e gotten in passive ways eg the entire networks transient buer

ing capacity provides a b ound which may b e low enough to b e useful for Reactive

Flowcontrol

Programmable Sendrate Limiter

A programmable sendrate limiter imp oses a programmable minimum time interval

b etween launching two packets into the same network It is a useful to ol for congestion

control Congestion in the network arises when senders inject packets at a faster

rate than they can b e removed In switched networks this often leads to very bad

congestion due to treeblo cking eects With the sendrate limiter faster NIUs in

heterogenous environment can b e programmed to match the receive sp eed of slower

NIUs The minimum time interval can also b e dynamically altered in resp onse

to network conditions

The Physical Network layer implements the glue to the underlying network When

that network already p ossesses the desired prop erties the network layer simply im

plements low level signaling b etween the NIU and the network When there is a

mismatch some form of sliding window proto col can b e used to bridge the gap

Sliding window proto col not only recovers from packet losses but can also enforce

order Furthermore the send window size imp oses a b ound on the numb er of unique

outstanding packets

Virtual Queues Layer

The Physical Network layer is a physically addressed packet transp ort service un

aware of pro cesses and protection requirements The Virtual Queues layer virtualizes

it laying the foundation for multitasking sharing of the network Systemlevel job

scheduling exibility dep ends on this layer Our design not only facilitates transpar

ent pro cess migration but also enables novel features such as dynamic contraction

and reexpansion of the numb er of pro cesses devoted to a parallel job

Figure shows the Virtual Queues layer and Physical Network layer working

coop eratively Central to the Virtual Queues layer is the abstraction of an arbitrary

number of message queues allo cated by the system as private queues to dierent

jobs These queues form the comp onents for constructing an arbitrary number of

independent virtual networks which can op erate simultaneously Protection is strictly

enforced through a virtual queue naming and translation scheme while the outof

receivebuer problem is solved by a Reactive owcontrol proto col which imp oses

zero p erformance cost under favorable conditions Figure la y er TxQ's ita uusPhysicalNetwork Virtual Queues and receivequeueidentity destination nodename translated intophysical Virtual destinationnames Layer Relationship Layer Node B Node A b et Loss Recovery/OrderingProtocol w Buffer ManagementProtocol een node name to itsphysicaldestination Packet routedaccording the Network Virtual Queues Physical Network ae Layer Layer Node Y Node X la receive queueidentity receive queuebasedonits Packet directedtoappropriate y er and Ph Virtual Queues ysical Net w

ork RxQ's

Virtual Queue Names and Translation

Controlling access to message queue is the basis for protection There are two cat

egories of message queue access i lo cal access to transmit and receive queues for

message enqueue and dequeue op erations resp ectively and ii remote receive queue

access ie naming a receive queue as the destination of a message

Without mo dications to micropro cessors lo cal access to transmit and receive

queues has to b e through a memory mapp ed interface Access control here is then

simply a matter of restricting which transmit and receive queues are mapp ed into the

virtual address space of a pro cess The normal virtual address translation mechanism

takes care of enforcing protection in each access instance

Control over the naming of message destination queue is more interesting as it

denes the communication domains Application co de refers to message destinations

with virtual names each of which is translated at the senders NIU into a global queue

address used by the Physical Network layer This translation is context dep endent

with each transmit queue having a dierent destination name space Only system

co de can set up the translation tables which dene the reach of each transmit

queue

Each global queue name has two comp onents one part identies the destination

NIU while the second sp ecies a receive queue identier RQID on that NIU The

latter is sub jected to a second translation at the destination NIU to determine the

actual destination queue resources The utility of this second translation will b ecome

apparent later when we discuss pro cess migration in Section and the combined

Resident and Nonresident queues approach of implementing the Virtual Queues layer

in Section

This scheme allows arbitrary communication graphs to b e set up Aside from the

fullyconnected communication graph common to parallel jobs partially connected

communication graphs are also p ossible The latter may b e useful for clientserver

communication where a server is connected to its clients but the clients are not con

nected to one another Adding and removing connections dynamically is also easy in

this scheme

Our protection scheme for network sharing oers several advantages over con

ventional approaches such as network timeslicing employed on the CM and the

jobID tagging and matching scheme used on the SP These approaches coupled

three issues which we feel should b e dealt with separately these are i communica

tion protection ii job scheduling and iii network packet loss recovery Coupling

communication protection with jobs scheduling restricts the kind of job scheduling

or network sharing in a system As for network packet loss recovery many cluster

system networks are reliable enough to not require provision for recovery from losses

due to network errors

Our protection scheme also oers other systemlevel b enets like easing job migra

tion and dynamic computation resource adjustment as describ ed b elow But b efore

that we show that our scheme can b e viewed as a generalized jobID tagging scheme

The jobID tagging scheme on SP only delivers packets that are tagged with

either the current jobs jobID or the system jobs jobID It is easy to extend this

scheme so that the receiver NIU p ermits more receive queues to b e active and de

multiplexes incoming packets into their desired receive queues This is however

insucient when we want to migrate jobs and their message queues that used to

b e on dierent no des to the same no de JobID alone do es not provide sucient

disambiguation it has to b e further rened by some kind of pro cess identity or queue

identity The is essentially our destination queue naming and translation scheme

Dynamic Destination Buer Allo cation

This section revisits the problem of allo cating destination queue buer space dynam

ically at message arrival time When space is unavailable the choices are either to

drop or blo ck the message Blo cking is a dangerous option in a shared environment

b ecause a blo cked packet o ccupies a shared network buer leading to the insidious

build up of dep endences between logically unrelated jobs Blo cking is only safe if there

is some way to guarantee that space will eventually free up In dedicated network en

vironments this typically dep ends on the application resp ecting some network usage A40 dropped!

TxQ B RxQ A1 12 A1

TxQ B1 A40 RxQB1

TxQ G1 RxQG1 Network TxQ RxQ A2 A2 TxQ RxQ B2 B2

TxQ RxQ G2 G2

Virtual Queues Physical Network Physical Network Virtual Queues

Layer Layer Layer Layer

Figure A sliding window style loss recovery proto col at the Physical Network

layer is insucient for preventing blo ckage in the receive queue of one communication

domain from blo cking trac in another communication domain which shares the same

physical network Rep eated dropping of A blo cks B

40 12

discipline In shared network environments this requires an NIU imp osed strategy

This is necessary b ecause the system cannot trust every job to resp ect the communi

cation discipline and sharing the same p o ol of network buers can create interjob

Dropping message when receiver buer is unavailable is simple but requires some

recovery strategy if reliable communication is desired Care has to b e taken to ensure

that the recovery proto col do es not intro duce its own dep endence arcs b etween un

related message queues In particular if a variant of sliding window proto col is used

an instance of the proto col is required for each pair of Virtual Queues layer transmit

and receive message queues Relying on a shared instance in the Physical Network

layer for this recovery will not work as illustrated in Figure

Figure shows a case where Packet A arrives to nd its destination queue

40

R xQ full The packet is dropp ed and no p ositive acknowledgement is provided

A1

by the sliding window proto col in the Physical Network layer This causes A to b e

40

resent later Supp ose the program which sent A has a deadlo ck bug such that it

40

never frees up queue space to accommo date A The sliding window proto col will

40

fail to make progress b ecause the failure to deliver A prevents the proto col from

40

rep orting progress Packets from queues b elonging to other protection domains are

thus blo cked to o eg B waiting in T xQ Basically any sliding window proto col

12 B 1

must op erate b etween TxQ and RxQ in the same layer

When the Physical Network layer provides lossless packet delivery and losses only

o ccur b ecause a full receive queue causes new packets to b e dropp ed another p ossi

ble recovery proto col is to provide acknowledge either p ositive or negative for each

packet Under this scheme the acknowledgements use a logically dierent network

from the one used for normal packets A negative acknowledgement NAck trig

gers subsequent resend Source buer space has to b e preallo cated b efore the

message is launched into the network but copying into this buer can b e deferred

by requiring NAck to return the entire message The app eal of this scheme is its

simplicity

Y et another alternate is to take steps to ensure that there is always sucient buer

space in the receive queues The onus of ensuring this can b e placed on the user or

library co de running on the aP A p ossible manifestation of this idea is to require the

message sender to sp ecify destination address to store the message into Another is

to continue to have the NIU allo cate destination buer but should the user program

make a mistake resulting in a message arriving to nd no buer space the message

will b e dropp ed To aid debugging the Virtual Queues layer provides a bit history

available to user co de indicating whether any message has b een discarded A second

option is to provide NIU assistance which dynamically and transparently regulates

message trac to avoid running out of receive queue buer space We prop ose one

such scheme which we call Reactive owcontrol

Reactive Flowcontrol

Reactive owcontrol imp oses no owcontrol overhead when trac in the network is

well b ehaved while providing a means to contain the damage wrought by ill b ehaved

programs It is also well suited to our NIU architecture as shown in the next chapter

it can b e implemented very cheaply on our system

Under Reactive owcontrol each receive queue has a low and a high watermark

as shown in Figure When the numb er of packets in a receive queue is b elow

the high watermark the receive queue op erates without any owcontrol feedback

to the transmit queues which send it messages When o ccupancy of the receive

queue reaches the high watermark owcontrol throttling packets are sent to all the

transmit queues that are p otential message sources to throttle message trac the

senders are told to stop sending packets to this destination until further notied The

receive queue is said to have overowe d and gone into throttled mode To ensure that

owcontrol packets are not blo cked by the congestion that it is attempting to control

one logical network of the Physical Network layer is reserved for owcontrol packets

only All normal packets go over the other logical network

After owcontrol throttling has b een initiated packets will continue to arrive

for the receive queue until throttle is completely in place The receive queue must

continue to accept these packets Up on receiving a throttling packet an NIU uses

a selective disable capability to prevent the relevant transmit queue from sending

packets to the receive queue concerned Packets to other destinations can however

continue to b e sent out from that receive queue

Each receive queue also has a low watermark A receive queue in throttled mo de

gets out of the mo de when its o ccupancy drops b elow this watermark At that p oint

the receive queues NIU sends throttle lifting owcontrol packets to the sources to

notify them that the receive queue is ready to accept packets again To prevent a

sudden o o d of packets to the receive queue the throttle lifting phase is done in a

controlled fashion by staggering the destination reenabling at dierent sources

Reactive owcontrol requires the receive queue to accept all incoming packets Packet sources Legend: stop sending packets

to this receive queue; Receive Queue receive queue continues Mode to accept packets Receive queue buffer usage sent before throttling took effect

Throttled

Receive queue buffer usage recedes below low watermark Receive queue buffer usage exceeds high watermark

Normal

No flow-control overhead

Low High watermark watermark

Buffer used under Normal mode of operation

Buffer used under Throttled mode of operation

Hysteresis between

Normal and Throttled Modes

Figure Overview of Reactive Flowcontrol

For this to work we must b e able to determine the maximum amount of buering

needed In our architecture the network layer provides this b ound Both active and

passive metho ds for achieving this b ound were discussed in Section This amount

of buer space also known as the overow buer space is needed for each receive

queue in addition to the normal buer space of size equivalent to the high water

mark In practice the overow buer size can b e substantial several kilobytes

but b ecause it resides within the owner pro cesss virtual address space implementa

tions can provide overow buering in main memory DRAM and even lo cally paged

memory

Strictly sp eaking in order to use the networks buering capacity as a b ound on

overow buer space as is done in the StarTVoyager NES describ ed in Chapter a

message which crosses the high watermark should b e blo cked until the owcontrol

throttling is known to have taken eect at all the message sources To avoid this tem

p orary blo ckage a threshold watermark at a p oint lower than the high watermark

is intro duced Flowcontrol is initiated when the threshold watermark is reached

while message blo ckage p ending disabling of all sources is only required when cross

ing the high watermark With an appropriate gap b etween the threshold and high

watermarks the temp orary message blo ckage can b e avoided in most cases

The watermarks are picked in the following ways

Low watermark The low watermark should b e high enough so that there is suf

cient work to keep the receiver busy while throttle is lifted Thus if it takes

time t to remove throttle the message servicing rate is r and average size

tl s

of each message is m bytes the low watermark should b e t r m bytes

s tl s s

This value should b e determined by system co de as b oth t and m are system

tl s

parameters r is application dep endent though the system can easily esti

s

mate a lower b ound this value can also b e sp ecied by the application when it

requests allo cation of a message queue

Threshold watermark The choice of threshold watermark is inuenced by two

considerations Firstly the threshold should provide sucient space for the

exp ected normal buering requirement of the application Furthermore the

size b etween low and threshold watermarks should provide hysteresis to pre

vent oscillation b etween throttled and normal mo de For instance this size

should b e such that the time to ll it up under maximum message arrival rate

is a reasonable multiple say times of the time it takes to imp ose and re

move throttle Hysteresis should work in the other direction to o but we exp ect

message reception service rate to b e lower than the arrival rate so we do not

have to worry ab out throttle b eing imp osed and lifted very quickly The ap

plication should provide system co de with the exp ected buering requirement

from which the system co de determines a threshold watermark that also has

sucient hysteresis

High watermark The high watermark is picked so that the size b etween thresh

old and high watermarks allows the receiver NIU to continue taking in arriving

messages during the time it takes throttle to b e put in place So if it takes time

t to imp ose throttle the message arrival rate is r and average size of each

ti a

message remains at m bytes the size b etween threshold and high watermarks

s

should b e t r m This like the low watermark should b e determined

ti a s

by system co de Application co de can again assist by estimating the exp ected

r

a

Decoupled Pro cess Scheduling and Message Queue

Activity

To enable more exible job scheduling p olicies message queues remain active inde

p endent of the scheduling status of the owner pro cesses This means messages can

b e launched into the network from them and arrive from the network into them

while their owner pro cesses are not running Reactive owcontrol takes care of lled

receive queues when the receiver pro cess is not active preventing network deadlo ck

from developing This will eventually blo ck senders that need to communicate with

this receiver The cluster job scheduler can b e informed of receive queue overow and

transmit queue blo ckage and use this information to carry out appropriate scheduling

actions

Transparent Pro cess Migration

Using virtual message destination queue names removes one p otential obstacle to

transparent pro cess migration When migrating a pro cess which owns a numb er of

receive queues to a new SMP the global queue names of these receive queues have to

b e changed In our design this simply means changing the virtual to global queue

name mapping in the translation tables of the transmit queues that send messages to

the aected receive queues The rest of this section briey sketches out the commu

nication asp ects of job migration

The system rst susp ends the migrating pro cess and all pro cesses which commu

nicate with it Next changes are made to the transmit queue translation tables

while the network is swept of all outstanding messages to the aected receive queues

Once this is done the pro cess migration can b e completed and all the susp ended

pro cesses reenabled Transparent migration requires references to other system re

sources such as le descriptors to b e migratable to o These issues are however

othorgonal to communication and b eyond the scop e of this thesis

The rest of this discussion describ es several features in our design which further

reduce the co ordination and dead time ie time when pro cesses must b e susp ended

during pro cess migration Although it is unclear how imp ortant these migration cost

reductions are compared to other costs of migration such as copying the migrated

pro cess state to the new SMP the improvements rely on features that are present for

other purp oses and thus come for free

Improvement Each message transmit and receive queue in the Virtual Queues

layer can b e disabled individually When a message transmit queue is disabled

messages will not b e launched from it into the network layer The owner pro cess

can however continue to enqueue message into the transmit queue as long as

there is buer space When a message receive queue is disabled messages

arriving into this queue will not b e enqueued into it Instead system co de

handles such messages

The capability for system co de to disable message queues is critical for the Reac

tive owcontrol scheme but also comes in handy for pro cess migration Instead

of susp ending pro cesses that may send messages to the migrating pro cess the

system only needs to disable the aected transmit queues until the migration is

completed The pro cesses themselves can continue execution Queue disabling

involves only actions in the NIUs and is a very cheap op eration in contrast to

pro cess susp ension

Improvement Further improvement to pro cess migration is p ossible with trans

mit queues selective disable feature As the name suggests applying selective

disable to a transmit queue prevents messages heading towards specic destina

tions from b eing launched from that queue Messages to other destinations are

unaected This is again motivated by Reactive owcontrol considerations

but is useful for relaxing the restrictions on sender pro cesses during pro cess

migration

Improvement Thus far our description of pro cess migration requires disabling

all transmit queues until the network is swept of messages previously sent from

them This is necessary to preserve message ordering In cases where message

ordering is unimp ortant these sweeps are unnecessary To deal with messages

heading to a receive queue that is b eing migrated a temp orary message forward

ing service implemented by the NIU forwards these messages to the queues

new lo cation This results in an implementation of pro cess migration without

global co ordination or global synchronization requirements

Dynamic Computation Resource Adjustmen t

The contraction and subsequent reexpansion of the numb er of pro cessors allo cated

to a parallel job is likely to b e a useful to ol for job scheduling Such changes may b e

triggered by a clusters workload uctuations or fault induced cluster size contraction

We will consider two cases here to illustrate how our queue name translation mo del

assists these contractions and reexpansions

Our rst example considers a static parallel execution mo del with coarse grain

dep endence for example bulk synchronous scientic parallel programs Because de

p endence is coarse grain it is sucient to implement contraction by migrating some

pro cesses so that the parallel jobs pro cesses t into fewer SMPs This assumes that

dep endence b etween pro cesses mapp ed onto the same pro cessors are coarse grain

enough that satisfying the dep endence do es not require excessive context switching

Reexpansion again simply involves pro cess migration Using migration capabilities

describ ed earlier b oth contraction and reexpansion are logically transparent to ap

plication co de

Our second example considers a multithreaded execution mo del characterized

by dynamically created medium to short threads and ner grain dep endence Exe

cution orchestrated by a run time system RTS is driven o continuation queues or

stacks Typically each pro cess has one continuation queuestack and work stealing

is commonly used for distributing work among pro cesses

Cid and Cilk are examples of parallel programming systems which fall into

this mo del Because of negrain dep endences an ecient implementation of contrac

tion should combine pro cesses and the continuation queuestack of these pro cesses

This is to avoid constant pro cess switches to satisfy ne grain dep endences b etween

threads in dierent pro cesses that are mapp ed onto the same pro cessor Message re

ceive queues of these pro cesses should also b e combined to improve p olling eciency

and timely servicing of messages

Appropriate RTS supp ort is critical for making such a contraction p ossible Our

destination queue name translation assists the collapsing of queues by allowing two

or more virtual destination queues to b e mapp ed to the same physical queue With

this supp ort references to virtual destination queues can b e stored in application

data structure b efore the contraction and continue to b e valid after it Messages sent

b efore the contraction and received after that can also contain references to virtual

destination queues without encountering problem

Application Interface Layer Message Pass

ing

The third and last layer of this communication architecture is the Application In

terface layer which corresp onds to the NIUs interface on the SMP hosts memory

bus The service exp orted by this layer is visible to application co de and is divided

b etween a message passing interface a shared memory interface and provision for

interface extensions This section covers the message passing interface

Software overhead latency and bandwidth are several considerations for message

transmission and reception StarTVoyager provides ve mechanisms that oer

p erformance tradeos among these considerations These are i Basic message ii

Express message iii DMA iv ExpressTagOn and v BasicTagOn Custom

message passing interfaces can b e added using the interface extension capability of

the NIU

For most commercial micropro cessors and SMPs memory mapping is the main

interface to the NES Because neither the memory hierarchys control structures nor

data path are designed for message passing the interaction b etween the pro cessor and

the NIU has to b e constructed out of existing memory loadstore oriented op erations

On the transmit side this interaction has three logical comp onents i determine

whether there is sucient transmit buer ii indicate the content of the message and

iii indicate that the message is ready for transmission The interaction on the receive

side also has three logical comp onents i determine if any message has arrived and is

awaiting pro cessing ii obtain message data and iii free up buer space o ccupied

by the message Designing ecient metho ds to achieve these interactions requires

taking into account the optimal op erating mo de of the memory bus the cacheline

transfer mechanism and the memory mo del

With to days SMPs the numb er of bus transactions involved in each message send

and receive is a primary determinant of message passing p erformance Pro cessor

b o okkeeping overhead tends to have a smaller impact on bandwidth and latency

b ecause to days pro cessor clo ck frequency is typically to times the bus clo ck

frequency whereas each bus transaction o ccupies the bus for several bus clo cks The

consumption of bus bandwidth b ecomes an even more signicant issue in an SMP

environment where the memory bus is shared by a numb er of pro cessors Several

techniques useful for reducing the numb er of message passing bus transactions are

describ ed b elow

Use cacheline burst transfers Virtually all mo dern micropro cessor buses are

optimized for cacheline burst transfers Consider the X bus used by

the PowerPC a cacheline bytes of data can b e transferred in as few

as bus cycles compared to bus cycles required for eight uncached Byte

1

transfers Aside from sup erior bus o ccupancy the cacheline burst transfer

also uses pro cessor storebuers more eciently reducing pro cessor stalls due

to storebuer overow

Using burst transfers adds some complexity to NIU design b ecause burst trans

fers are typically only available for cacheable memory With message queues

read and written by b oth the pro cessor and the NIU cache coherence must

b e maintained b etween the NES and pro cessor caches in order to exploit the

burst transfer This issue is discussed further when we describ e Basic Message

supp ort b elow

Avoid main memory DRAM The message data path should avoid going through

main memory DRAM in the common case b ecause information that go es through

DRAM crosses the system bus twice rst to b e written into main memory and

a second time to b e read out This not only wastes bus bandwidth but in

creases communication latency in to days systems the writeread delay through

DRAM can easily add another to bus clo cks Instead transfers should

b e directly b etween the NIU and pro cessor through any inline caches b etween

them

1

A few micropro cessors such as the MIPS R and Pentium Pro are able to aggregate

uncached memory writes to c ontiguous addresses into larger units for transfer over the memory bus

others oer sp ecial blo ck transfer op erations eg byte loadstore instructions in Sparc These are

however still nonstandard The PowerPC family used in this pro ject has neither of these features

Main memory DRAM do es have the advantage of oering large amount of rel

atively cheap memory An NIU may want to take advantage of it as backup

message buer space This is quite easy if the SMP systems bus proto col sup

p orts cachetocache transfers The feature allows a cache with mo died data to

supply it directly to another cache without rst writing it back to main memory

In such systems message buers can logically reside in main memory but b e

burst transferred directly b etween the pro cessor cache and the NES

Bundle control and data into the same bus transaction Control information

can b e placed on the same cacheline as message data so that it is transferred

in the same burst transaction For instance a FullEmpty eld can b e used to

indicate the status of a message For transmit buers the NIU indicates that

a buer is available by setting its FullEmpty eld to Empty and pro cessor

software sets it to Full to indicate that the message should b e transmitted

Control information can also b e conveyed implicitly by the very presence of a

bus event Because the numb er of bus events on cached addresses is not directly

controlled by software but is dep endent on cache b ehavior this technique can

only b e used on uncached addresses It is employed in our Express and Express

TagOn message passing mechanisms In those cases we also make use of address

bits to convey data to the NIU

Compress message into single softwarecontrollable bus transaction This is

an advantage if the SMP system uses one of the weak memory mo dels Found

in many mo dern micropro cessor families weak memory mo dels allow memory

op erations executed on one pro cessor to app ear outoforder to other devices

In such systems a messagelaunch op eration might get reordered to app ear

on the bus b efore the corresp onding messagecomp ose memory op erations To

enforce ordering pro cessors that implement a weak memory mo del provide a

memory barrier op eration eg sync in PowerPC pro cessors which is needed

b etween the messagecomp osition op erations and the messagelaunch op eration

Unfortunately memory barrier op erations result in bus transactions and their

implementations often wait for sup erscalar pro cessor pip eline to serialize

If the entire content of a message can b e compressed into a single software

controllable bus transaction memorybarrier instructions are avoided Express

message takes advantage of this

Aggregate Control Op erations An interface that exchanges pro ducer and con

sumer p ointers to a circular buer to co ordinate buer usage allows aggregation

of control op erations Instead of nding out whether there is another free trans

mit buer or another message waiting to b e serviced the pro cessor software can

nd out the number of transmit buers still available or the number of mes

sages still waiting to b e serviced Software can similarly aggregate the transmit

message or the free received message buer space indicators Aggregation in

these cases unfortunately do es have the negative sideeect of delaying trans

mission and the release of unneeded buer space There is thus a tradeo

b etween overhead and latency

Another form of aggregation is packing several pieces of control information

into the data exchanged in one transaction This can again p otentially save

bus transactions Both aggregation techniques are used in our Basic Message

supp ort

Cache Control Information For control variables that are exp ected to have go o d

runlength ie one party reads it many times b efore the other writes it or one

party writes it many times b efore the other reads it making the lo cation cache

able is advantageous To actually exchange information invalidation based

sno opy coherence op erations will cause at least two bus op erations one to

acquire write ownership and another to read data Therefore this is only a

go o d idea if average runlength is greater than two

A p ossible application of this idea is to cache consumer p ointer of receive queue

as is done in the CNI message passing interface This p ointer is written

frequently by pro cessor software but typically read infrequently by the NIU

Similarly the consumer p ointer of transmit queue is also a candidate although

in that case making it uncached has its merits This is b ecause to obtain long

write runlength on the NIU side for a transmit queue consumer p ointer the

pro cessor software has to use a lo cal variable to keep track of the numb er of

available buers and only read the consumer p ointer from the NIU to up date

this variable when it reaches zero That b eing the case making the consumer

p ointer uncached is actually b etter b ecause there is no need for the NIU to

obtain write ownership as is necessary if the lo cation is cached

In the case of receive queue consumer p ointer making the p ointer uncached

and using aggregation to reduce the numb er of write bus transactions has the

negative sideeect of delaying buer release A cached consumer p ointer allows

the write to o ccur to the pro cessor cache which only go es on the bus when the

NIU needs it

Sp ecial Supp ort for MulticastBroadcast An NIU interface that allows multi

cast or broadcast op erations to b e sp ecied can reduce the numb er of times the

same data is moved over the memory bus Our TagOn message capitalizes on

this while giving the exibility for some of the message data to b e dierent for

each multicast destination

One of the biggest challenges in improving message passing p erformance is to

reduce the latency of reading information into the pro cessor Unfortunately there

are few ways of improving this other than to reduce the numb er of such o ccurrences

using techniques describ ed ab ove Invalidation based sno opy bus proto col provides

no opp ortunity to push data into caches unless the pro cessor has requested for it

Sno opy bus proto col that allow up date or snarng neither of which are found

in commercial systems to day may op en up new opp ortunities Pro cessor software

can also use prefetching to hide this latency but this is not always easy or p ossible

particularly since the data can b ecome stale if the prefetch o ccurs to o early

We studied the ab ove design options in when we designed the StarTNG

and again in late when we started designing StarTVoyager Indep endently

Mukherjee et al also studied the problem of message passing interface design

for NIU connecting to coherent memory buses The most interesting result of their

work is a class of message passing interfaces that they named Coherent Network

Interfaces CNI We will discuss these after we describ e Basic Message This section

describ es the message passing mechanisms in our Application Interface layer

Basic Message

The Basic message mechanism provides direct access to the Virtual Queues layers

messaging service With a bit header sp ecifying the logical destination queue and

other options and a variable payload of b etween four and twentytwo byte words

a Basic Message is ideal for communicating an RPC request or any medium size

transfer of up to several hundred kilobytes

The Basic Message interface consists of separate transmit and receive queues each

with a cacheable message buer region and uncached pro ducer and consumer p oint

ers for exchanging control information b etween the pro cessor and the NIU Status

information from the NIU transmit queues consumer p ointers and receive queues

pro ducer p ointers are packed into a byte value so that they can all b e obtained

with a single read The message buer region is arranged as a circular FIFO with the

2

whole queue visible to enabling concurrent access to multiple

messages The message content is sp ecied by value ie the pro cessor is resp onsi

ble for assembling the message content into the transmit buer space An uncached

p ointer up date immediately triggers NIU pro cessing

The pro cessor p erforms four steps to send a basic message Figure top half

The Basic Message transmit co de rst checks to see if there is sucient buer space

to send the message Step That gure also shows several messages that were

comp osed earlier and are waiting to b e transmitted When there is buer space

the message is stored into the next available buer lo cation Step the buer

2

This is not strictly true in that the NIU maintains an overow buer extension for each receive

queue where incoming messages are temp orarily buered when the normal software visible receive

queue is full 1. Initial State 2. Write Buffer 3. Flush Cache to NES Buffer 4. Update Producer Pointer

Processor Processor Processor Processor

Cache Cache Cache Cache Ī± Ī± Ī± Ī±+32 Ī±+32 Ī±+32 PC PC PC PC

Ī± Ī± Ī± Ī±

PC PC PC PC

Cached Buffer Uncached Cached Buffer Uncached Cached Buffer Uncached Cached Buffer Uncached Pointers Pointers Pointers Pointers

NES NES NES NES

Figure Sending a Basic Message

is maintained as a circular queue of a xed but congurable size The transmit

and receive queue buers are mapp ed into cached regions of memory Unless NIU

Reclaim mo de is used in which case the NIU is resp onsible for buer space coherence

instructions to write the mo died cache maintenance the pro cessor must issue clean

lines to the corresp onding NIU buer lo cations step For systems with weak

memory mo del a barrier instruction is required after the clean instructions and

b efore the pro ducer p ointer is up dated via an uncached write This write step

prompts the NIU to launch the message after which the NIU frees the buer space

by incrementing the consumer p ointer

The application pro cessor overhead can b e reduced by using the NIU Reclaim

facility where the NIU issues clean bus op erations to maintain coherence b etween

the pro cessor cache and the NIU buers In this case the p ointer up date will cause

the NIU to reclaim the message and then launch it

Though the transmit and receive queues are mapp ed to cached regions pro ducer

and consumer p ointers are mapp ed to uncached regions to ensure that the most upto

date copies are seen b oth by the application and the NIU To minimize the frequency

of reading these p ointers from the NIU software maintains a copy of the pro ducer

and consumer p ointers P C in top half of gure The copy of the consumer p ointer

needs to b e up dated only when it indicates that the queue is full space may have

freed up since by then The NIU may move the consumer p ointer any time it launches

a messages as illustrated in our example b etween steps and

Message reception by p olling is exp ected to b e the common case although an

application can request for an interrupt up on message arrival This choice is available

to a receiver on a p er receive queue basis or to a sender on a p er message basis When

p olling for messages an application compares the pro ducer and consumer p ointers

to determine the presence of messages Messages are read directly from the message

buer region Coherence maintenance is again needed so that the application do es

not read a cached copy of an old message As b efore this can b e done either explicitly

by the pro cessor with flush instructions or by NIU Reclaim

A unique asp ect of the Basic Message buer queue is its memory allo cation scheme

Buer space in this queue is allo cated in cacheline granularity and b oth the pro ducer

and consumer p ointers are cacheline address p ointers Allo cation in smaller granu

larity is undesirable b ecause of the coherence problem caused by multiple messages

sharing a cacheline The other obvious choice of allo cating maximumsized buers

was rejected b ecause it do es not work well with either software prefetching of received

messages or NIU Reclaim The main problem is that the size of a message is unknown

until the header is read Therefore b oth prefetching and a simple implementation

of NIU Reclaim must either rst read the header and then decide how many more

cachelines of data to read or blindly read all three cachelines The former intro duces

latency while the latter wastes bandwidth With cacheline granularity allo cation ev

ery cacheline contains useful data and data that is fetched will either b e used for

the current message or subsequent ones Better buer space utilization is another

b enet of this choice

When we designed the Basic message supp ort we considered b oth the pro ducer

consumer p ointer scheme describ ed ab ove and a scheme which uses FullEmpty bits

in each xedsize buer We adopted the latter scheme in StarTNG the pre

decessor of StarTVoyager Tables and compare the bus transaction costs of

the two schemes

A FullEmpty bit scheme is exp ected to have b etter p erformance b ecause the

control functions do not incur additional bus transactions Under the pro ducer

consumer p ointer scheme the cost of obtaining free transmit buers can probably

b e amortized and is thus insignicant Similarly the cost of releasing receive buers

should b e insignicant But this still leaves this scheme at a disadvantage b ecause of

the extra cost of indicating transmit and nding out ab out received messages b oth

of which can only b e amortized at a cost to endtoend latency

We cho ose the pro ducerconsumer p ointer based scheme despite its p erformance

disadvantage due to implementation considerations Section provides a more de

tailed comparison of implementation complexity of p ointer and fullempty bit hand

shake schemes

Express Message

Messages with a minimal amount of data are common in many applications for syn

chronization or communicating a simple request or reply Basic Messages with a

cached message buer space is a bad match b ecause the bandwidth of burst trans

fer is not needed while the overhead of coherence maintenance weak memory mo del

if applicable and explicit handshake remains Express Messages are intro duced to

cater to such small payloads by utilizing a single uncached access to transfer all the

data of a message and thus avoid these overheads of Basic Messages

A ma jor challenge of the Express Message design is to maximize the amount of

data transp orted by a message while keeping each comp ose and launch to a single un

cached memory access The Express Message Mechanism packs the transmit queue

ID message destination and bits of data into the address of an uncached write

The NIU automatically transforms the information contained in the address into a

message header and app ends the data from the uncached write to form a message

Figure shows a simplied format for sending and receiving Express Messages

Additional address bits can b e used to convey more information but they consume

larger virtual and physical address space and can also have a detrimental eect on

TLB if the information enco ded into the address bits do es not exhibit go o d lo cal

ity Alternate translation mechanisms such as PowerPCs blo ckaddress translation

Item Description Cost and Frequency

Transmit

T Read transmit queue consumer bus transaction aggregatable

p ointer

T Write message content to bus transactions each cache

line Only bus transaction to

move data from pro cessor cache to

NIU for each cacheline if it is al

ready present in cache with write

p ermission Otherwise cachemiss

or ownership acquisition incurs an

other bus transaction

T memory barrier op eration bus transaction aggregatable

T Write transmit queue pro ducer bus transaction aggregatable

p ointer

Total n cachelines message not n b s to n b s

using Reclaim

b average numb er of buers

obtained at each read of transmit

queue consumer p ointer

s average numb er of buers

sent at each write of transmit queue

pro ducer p ointer

Receive

R Read receive queue pro ducer bus transaction aggregatable

p ointer

R Read message content bus transaction each cacheline

R Removal of stale data bus transaction each cacheline

R memory barrier op eration bus transaction aggregatable

R Write receive queue consumer bus transaction aggregatable

p ointer

Total n cachelines message n r f

r average numb er of buers ob

tained at each read of receive queue

pro ducer p ointer

f average numb er of buers

freed at each write of receive queue

consumer p ointer

Table This table summarizes the bus transaction cost of Basic Message supp ort

assuming NIU Reclaim is not used With NIU Reclaim the receive case incurs an

additional bus transaction for each cacheline of data NIU Reclaim for the transmit

case do es not incur any additional bus transaction if the sno opy bus proto col supp orts

cachetocache transfer Otherwise it also incurs an additional bus transaction for

each cacheline of data

Item Description Cost and Frequency

Transmit

T Read next transmit buers bus transaction

FullEmpty bit

T Write message content additional bus transaction to

move data from pro cessor cache to

NIU for the cacheline which con

tains the FullEmpty bit assuming

that the read in T also obtained

write p ermission

to bus transactions for each

additional cacheline Only bus

transaction to move data from pro

cessor cache to NIU for each cache

line if it is already present in cache

with write p ermission Otherwise

cachemiss or ownership acquisition

incurs another bus transaction

T memory barrier op eration bus transaction

T Write FullEmpty bit additional bus transactions

Total n cachelines message not n to n

using Reclaim

Receive

R Read next receive buers bus transaction

FullEmpty bit

R Read message content additional bus transactions for

the cacheline which contains the

FullEmpty bit

bus transaction for each addi

tional cacheline

R Removal of stale data bus transaction for each cacheline

after the rst one

R memory barrier op eration bus transaction

R Write FullEmpty bit additional bus transaction as

suming the earlier read of the cache

line has obtained write p ermission

This bus transaction accounts for

the NIU reading this bit

Total n cachelines message not n

using Reclaim

Table This table summarizes the bus transaction cost of a FullEmpty bit based

scheme assuming software is resp onsible for coherence maintenance If something

analogous to NIU Reclaim is used each transmitted or received cacheline incurs an

additional bus transaction Tx Format Rx Format Address Data0 Queue Priority Queue Priority

0 1 2 3 456 7 89 10 11 12 13 14 15 16 17 18 1920 21 22 23 24 2526 27 28 29 30 31 01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2223 24 25 26 27 28 29 30 31 0 10 0 0 1 0 0 Fixed field indicating queue numberLogical Destination general payload Reserved Logical Source general payload

Data Data1

32 33 34 35 36 37 38 39 40 41 42 43 44 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

general payload general payload Arctic Packet Format

01 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Interrupt on Arrival Priority UpRoute Logical Source Node Num (1st half) 16 1718 19 20 21 2223 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Logical Source DownRoute MsgOp Receive Queue ID Length Node Num (2nd half)

01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2223 24 25 26 27 28 29 30 31 0 1 0 0 Fixed field indicating queue numberLogical Destination general payload

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

general payload

Figure Express Message Formats

mechanism may b e employed to mitigate this problem but this dep ends on b oth

pro cessor architecture and OS supp ort Pro ducer and consumer p ointers are still

required for transmit queues as a mechanism for software to nd out the amount of

buer space available in the transmit queues

To reduce the data read by a receive handler the NIU reformats a received Express

Message packet into a bit value as illustrated in Figure Unlike Express Message

transmits address bits cannot b e used to convey message data to the pro cessor when

receiving an Express Message With pro cessors which supp ort doubleword read into

contiguous GPRs an Express Message receive is accomplished with a bit uncached

read For pro cessor families without this supp ort but with bit oating p oints an

Express Message receive can b e accomplished with a bit uncached read into an

FPR and the data is subsequently move into GPRs Alternatively two bit loads

into GPRs can b e issued to receive the message

Unlike Basic Messages the addresses for accessing Express Message queues do not

sp ecify particular queue entries Instead the NIU provides a FIFO pushp op interface

to transmit and receive Express Messages Due to this side eect sp eculative loads

from Express Message receive regions are disabled by setting the page attributes

Item Description Cost and Frequency

Transmit

T indicate transmit and content bus transaction

Receive

R p oll received message bus transactions if using bit

uncached read

bus transaction if using bit un

cached read

Table Bus transaction cost of Express message transmit and receive

appropriately When an application attempts to receive a message from an empty

receive queue a sp ecial Empty Express Message whose content is programmable by

system co de is returned If message handler information is enco ded in the message

such as in Active Messages the Empty Message can b e treated as a legitimate

message with a no action message handler

The Express message interface is implemented as a veneer on top of circular buer

queues controlled with pro ducerconsumer p ointers The NIU automatically incre

ments these p ointers in resp onse to readwrite bus transactions Software can actu

ally directly access the lowerlevel interface in the same way as for Basic message

except that up to messages now packs into each cacheline When there are a num

b er of Express messages to send eg multicasting to say destinations it is cheap er

to use this lowerlevel interface bus transactions for the messages compared

to bus transactions In systems with larger cacheline size this advantage is even

greater

Although using the lowerlevel interface can b e useful on the receive end to o

software has to guess whether there is going to b e many messages for it to receive

and then select the interface This can b e dicult The NIU which knows the numb er

of messages waiting to b e pro cessed in each Express RxQ can help by making this

decision for software This feature is not in the current version of the StarTVoyager

describ ed in Chapter When an RxQ has a large numb er of messages p ending

the NIU resp onse to a p oll not with the next message but with a sp ecial message

that includes the identity of the Express RxQ and the current p ointer values this is

currently already used for Onep oll on Basic messages Software can then switch to

using the lowerlevel interface to receive Express messages The advantage of using

the lowerlevel interface for receiver is even greater than for transmit since on StarT

Voyager we use two bit uncached read plus a p ossible SYNC to receive each

Express message Using the lowerlevel interface two cacheline of messages can b e

received with bus transactions compared to with the usual interface

DMA Transfer

DMA supp ort provides an ecient mechanism for moving contiguously lo cated data

from the memory on one no de to that on another It resembles traditional DMA

facility in that an aP can unilaterally achieve the movement without the remote aPs

involvement This is in contrast to data movement eected through normal message

passing where matching request and reply messages are needed The DMA facility

can b e thought of as a remote memory get or memory put op eration DMA re

quests are sp ecied with virtual addresses allowing zerocopying message passing to

b e implemented When global shared memory address space is involved the DMA

guarantees local coherence but not global coherence This means that the data

that is read or written is coherent with any cache copies in the source and destination

no des resp ectively but not necessarily with cache copies elsewhere in the system

Kubiatowicz provides arguments for this mo del of DMA in shared memory ma

chines

The DMA facility is designed to b e light weight so that it can b e protably

employed for relatively small sized transfers To reduce p ertransfer aP overhead

the design decouples page pinning from transfer request The cost of the former

can b e amortized over several transfers if traditional system calls are used to pin

pages Alternatively system software can b e designed so that the aP Virtual Memory

Manager co op erates with the sP allowing the latter to eect any necessary page

pinning User co de initiates DMA transfers through an interface similar to a Basic

Message transmit queue The transfer direction logical source and destination source

data virtual address destination data virtual address and length are sp ecied in a

DMA request message to the lo cal sP This sP together with the other sP involved in

the transfer p erforms the necessary translation and protection checks b efore setting

up DMA hardware to carry out the transfer An option for messages to b e delivered

after DMA completion eg to notify the receiver of the sender is also available

TagOn Message

The ExpressTagOn Message mechanism extends the Express Message mechanism to

allow additional data lo cated in sp ecial NIU memory to b e app ended to an outgoing

message The ExpressTagOn Message mechanism was designed to eliminate a copy

if message data was already in NIU memory It is esp ecially useful for implementing

coherent shared memory proto col and for multicasting a medium sized message As

comp osed by an application an ExpressTagOn Message lo oks similar to an Express

Message with the addition that several previously unused address bits now sp ecify

the NIU memory lo cation where the additional Bytes or Bytes of message

data can b e found For protection and to reduce the numb er of bits used to sp ecify

this lo cation the address used is the oset from a TagOn base address This base

address can b e dierent for each transmit queue and is programmed by system co de

At the destination NIU an ExpressTagOn Message is partitioned into two parts

that are placed into two separate queues The rst part is its header which is delivered

like an Express Message via a queue that app ears to b e a hardware FIFO The second

part made up of the data that is tagged on is placed in a separate buer similar

to a Basic Message receive queue which utilizes explicit buer deallo cation

ExpressTagOn Messages have the advantage of decoupling the header from the

message data allowing them to b e lo cated in noncontiguous addresses This is useful

in coherence proto col when shipping a cacheline of data from one site to another

Supp ose the sP is resp onding to another sites request for data This is achieved by

the sP rst issuing a command to move the data from aPDRAM into NIU memory

followed by a ExpressTagOn Message that ships the data to the requester In addition

to the cacheline of data an ExpressTagOn Message inherits the bit payload of

Express Messages which can b e used in this instance to identify the message typ e and

the address of the cacheline that is shipp ed Cacheline data may also b e brought

into the NIU without the sP asking for it for example the aPs cache may initiated a

writeback of dirty data In such cases ExpressTagOn Messages ability to decouple

message header and data allows the data to b e shipp ed out without further copying

ExpressTagOn Messages are also useful for multicasting To multicast some

data an application rst moves it into NIU memory Once that is done the ap

plication can send it to multiple destinations very cheaply using a ExpressTagOn

Message for each one Thus data is moved over the system memory bus only once

at the source site and the incremental cost for each destination is an uncached write

to indicate a ExpressTagOn Message

BasicTagOn extends Basic Messages in a way similar to ExpressTagOn It diers

in that the nonTagOn part of the message can contain a variable amount of data

Furthermore when a BasicTagOn message is received it is not separated into two

parts but is instead placed into one receive queue just like an ordinary Basic Message

Basic TagOn message oers similar advantages as ExpressTagOn at the transmit end

separation of message b o dy p ermitting more ecient multicast

BasicTagOn was added fairly late in the design It was mostly added to make

the supp ort uniform TagOn is an orthogonal option available with b oth Basic and

Express messages

Onep oll

In order to minimize the overhead of p olling from multiple receive queues StarT

Voyager intro duces a novel mechanism called OnePol l which allows one p olling ac

tion to p oll simultaneously from a numb er of Express Message receive queues as well

as Basic Message receive queues A single uncached read sp ecies within some of

its address bits the queues from which to p oll The result of the read is the highest

priority Express Message If the highest priority nonempty queue is a Basic Message

queue a sp ecial Express Message that includes the Basic Message queue name and

its queue p ointers is returned If there are no messages in any of the p olled queues

a sp ecial Empty Express Message is returned

OnePoll is useful to user applications most of which are exp ected to have four

receive queues Basic and ExpressExpressTagOn each with two priorities The sP

has nine queues to p oll clearly the OnePoll mechanism dramatically reduces the sPs

p olling costs

Implications of Handshake Alternatives

The FullEmpty bit scheme requires the control p ortion of the NIU to p oll and write

lo cations in the message buers With message queue buer usage o ccurring in FIFO

order the NIU knows the buer in each queue to p oll on next This section considers

several implementations invariably each is more complex than those for the Basic

message mechanism

First consider a simple implementation where the NIU provides message buer

SRAM and the control p ortion of the NIU reads the next buers FullEmpty bit eld

from this SRAM The NIU control also writes to the SRAM to set or clear FullEmpty

bits Both actions consume SRAM memory p ort bandwidth a contention problem

that increases in severity if the same SRAM p ort is used for other functions such

as buering received messages The p olling overhead also increases as the numb er

of queues supp orted in the NIU increases Even without considering contention the

time to p erform this p olling increases with the numb er of message queues as the

numb er of message buer SRAM p orts is unlikely to increase at the same rate

Clearly to implement the FullEmpty scheme eciently the NIU has to b e

smarter and p oll only when necessary This requires sno oping on writes to mes

sage buer space and only p olling after writes have o ccurred If NIU Reclaim is

supp orted the sno oping will only reveal an attempt to write due to acquisition of

cacheline ownership The NIU should then Reclaim the cacheline after a suitable

delay The design also has to deal with the situation where the NIU is unable to

transmit as fast as messages are comp osed In order that it do es not drop any needed

p oll the NIU has to keeps track of the message queues with p ending p olls probably

in a condensed form A design that blo cks write or writeownership acquisition until

the NIU can transmit enough messages to free up hardware resources to track new

message transmit requests is unacceptable as it can cause deadlo cks

The contention eect of NIU reading FullEmpty bits can b e circumvented by

duplicating the slice of the message buer SRAM where the FullEmpty information

is kept devoting it to NIU FullEmpty bits p olling Using the same idea to reduce the

contention eect of NIU writing FullEmpty bits requires slightly more complex data

paths Because the system bus width is unlikely to b e the same width as the message

buer size certain bits of the data bus takes data either from the FullEmpty SRAM

or the normal message data SRAM dep ending on which part of the message b eing

driven on to the bus A mux is thus needed Making this scheme work for variable

sized message buers will further increase complexity The most likely design in that

case is to not only store the value of FullEmpty bit but also store a bit to determine

if that lo cation is currently used for that purp ose

If we constrain software to mo dify the FullEmpty bit in a FIFO order and

assume xedsize message buers the ab ove design can b e further simplied The

NIU need not p oll on the FullEmpty bits Instead it simply sno ops on the data

p ortion of a bus transaction in addition to the address and control parts When it

sees a set FullEmpty eld in a bus transaction to a region asso ciated with a transmit

queue it go es ahead to increment that transmit queues pro ducer p ointer Similarly

a cleared FullEmpty bit for a bus transaction asso ciated with a receive queue triggers

increment of that queues consumer p ointer The constraint that the FullEmpty bit

of message queues has to b e set or cleared in FIFO order is unlikely to b e a problem

if each message queue is used by only one thread If several threads share a message

queue it may incur additional co ordination overhead

Although the FullEmpty bit design is well understo o d the implementation con

straints we faced in StarTVoyager make it infeasible Because we are using othe

shelf dualp orted SRAMs for message buers with one p ort of the SRAM directly

connecting to the SMP system bus we are unable to use the duplicated memory slice

idea to remove the overhead of NIU writing FullEmpty bits Even duplicating the

slice to remove NIU FullEmpty read contention is not seriously considered b ecause of

3

concerns ab out capacitance loading on the SMP system bus We would have picked

the FullEmpty bit design had we b een implementing the NIU in an ASIC with much

more exibility over the datapath organization

The implementation of the pro ducerconsumer p ointer scheme is much simpler

One reason is that the data and control information are clearly separated In fact

software up dates message queue p ointers by providing the new values of the p ointer

in the address p ortion of a memory op eration We also allow b oth read and write

op erations to up date the p ointers with the read op eration returning packed p ointer

information from the NIU

The separation of data and control makes it feasible to implement pro ducer and

consumer p ointers in registers lo cated with control logic By comparing these p ointers

the control logic determines whether the queue is full or empty and whether the NIU

needs to carry out any action

Comparison with Coherent Network Interface

The Coherence Network Interfaces CNIs use a combination of consumer p oint

ers and FullEmpty bits to exchange control information b etween pro cessor software

and the NIU The p ointers are placed in cached address lo cations to take advantage

of exp ected long write runlengths

To avoid explicitly setting the FullEmpty bits to empty they added a clever idea

of sensereverse on the FullEmpty bits the meaning of a in the FullEmpty bit

eld changes as the usage of the queue reaches the end of the linear buer address

region and wraps around to the b eginning Thus whereas in one pass indicates

Full buers the value indicates Empty buers on the next pass Using this scheme

requires xedsize message buers and the linear buer region to b e a multiple of

that xed size The hardware complexity of implementing CNI is around that of the

generic FullEmpty bit scheme While it disp enses with the NIU clearing transmit

queue FullEmpty bits it adds the need to maintain full cachecoherence on the

3

This was a serious concern earlier in the design when we were targeting a MHz clo ck rates

With our nal clo ck rate of MHz this should not b e an issue

pro ducer p ointers

The bus transaction costs of using CNI is shown in Table It actually lo oks

very similar to the generic FullEmpty bit scheme A denitive comparison of CNI

with the FullEmpty bit scheme can only b e done with a sp ecic system taking

into account a numb er of details including pro cessors memory system details and

actual NIU implementation For instance under CNI software do es not read the

FullEmpty bit to determine if a transmit buer is available so that roundtrip read

latency is avoided However software using a generic FullEmpty bit scheme can

hide this cost by prefetching the next buers FullEmpty bit when it starts to use

the current transmit buer CNIs use of a cached receive queue consumer p ointer is

a go o d idea Under the reasonable assumption that the receive queue do es not run

out of space very frequently it is b etter than using an uncached consumer p ointer or

using FullEmpty bit to indicate release of receive queue buer

Application Interface Layer Shared Memory

The shared memory interface implements b oth CCNUMA and SCOMA supp ort

The NIUs b ehavior on the system bus diers in the two cases It is the slave for bus

transactions generated by CCNUMA cache misses and has to carry out actions at

remote sites to obtain data or ownership In resp onse to requests from other no des

the NIU must b e able to b ehave as a proxy bus master fetching data from lo cal

main memory or forcing it from lo cal caches The NIU also has to keep directory

information for those cachelines for which it is the home site

In clusters with multipro cessor SMP no des CCNUMA implementation presents

some anomaly if the host SMP do es not supp ort cachetocache data transfer of b oth

clean and dirty data Most sno opy bus proto cols do not require nor allow a cache to

resp ond to a read transaction on the bus if it has the requested data in clean state

Thus it is p ossible that a remotely fetched cacheline d is present in pro cessor A of

a

an SMP but pro cessor B encounters a cache miss of the same cache line which has

to b e serviced by the SMPs NIU For simplicity of the distributed CCDSM proto col

Item Description Cost and Frequency

Transmit

T Read transmit queue consumer bus transaction aggregatable

p ointer

T Write message content to bus transactions each cache

line Only bus transaction to

move data from pro cessor cache to

NIU for each cacheline if it is al

ready present in cache with write

p ermission Otherwise cachemiss

or ownership acquisition incurs an

other bus transaction

T memory barrier op eration bus transaction

T Write FullEmpty bit additional bus transactions

Total n cachelines message not n b to n b

using Reclaim

b average numb er of buers

obtained at each read of transmit

queue consumer p ointer

Receive

R Read next receive buers bus transaction

FullEmpty bit

R Read message content bus transaction each cacheline

R Removal of stale data bus transaction each cacheline

R memory barrier op eration bus transaction

R Write receive queue consumer bus transaction if hit in cache

p ointer

bus transactions each time NIU

actually reads consumer p ointer

Total n cachelines message not n h

using Reclaim

h average numb er of times soft

ware writes the consumer p ointer

b efore the NIU reads it

Table This table summarizes the bus transaction cost of a CNI style scheme

assuming software is resp onsible for coherence maintenance for receive queues If

something analogous to NIU Reclaim is used on the receive queues each received

cacheline incurs an additional bus transaction

the NIU should implement a cache of remotely fetched data so that pro cessor B s

cachemiss can b e serviced with data from this cache This may b e implemented with

NIU rmware

NIU supp ort for SCOMA keeps coherence state information for cachelines in lo cal

main memory and sno ops on bus transactions addressing these regions Cases where

further actions at remote no des are needed use infrastructure similar to those for

implementing CCNUMA supp ort A further requirement is the ability to translate

address b etween lo cal main memory address and global shared memory address NIU

SCOMA supp ort should include hardware sno oping so that the sP is not involved

when the current cacheline state p ermits the bus transaction to complete Otherwise

the advantage of SCOMA over CCNUMA is badly ero ded

The shared memory interface includes provision for passing hints from application

co de to the NIU This can b e implemented fairly simply by using a memory mapp ed

interface to addresses interpreted by the sP

The problem of designing and implementing correct cachecoherence proto col is

still a ma jor research area to day Deadlo ck aside imple

menting a coherence proto col that actually meets the sp ecications of its memory

mo del is also a dicult problem Part of the problem is that many memory mo dels

are motivated by implementation convenience and sp ecied in an op erational man

ner that is often imprecise and incomplete It is b eyond the scop e of this thesis to

delve into the problem of coherence proto col design implementation and verication

Instead we target an NIU design which leaves details of the coherence proto col pro

grammable This b oth decouples the nal details of coherence proto cols from NIU

hardware design and implementation and p ermits future exp erimentation

Providing the ability to program the coherence proto col takes care of logical er

rors In addition hardwired lowlevel resource sharing p olicies in our NIU design

must not can cause deadlo ck This problem is actually proto col dep endent whether

a certain partitioning of resources is adequate safeguard against resource sharing

induced deadlo ck for a proto col is dep endent on the dependence chains that can arise

under the proto col One solution is to restrict the typ es of dependence chains that a

proto col can create and constrain proto cols to b e written in a way that resp ect these

limitations This approach is adopted for PP co de in FLASHs MAGIC chip and

is applicable to our design A second p ossibility which our design also supp orts is

to provide the capability to extend the numb er of resource p o ols through rmware

Supp ort for Interface Extensions

The NIU rmware programmable core has several capabilities that enable extend

ing the application communication interface These capabilities describ ed b elow

are in addition to a basic instruction set encompassing general integer and control

instructions and access to a reasonable amount of memory

The NIU rmware observes the system bus to detect b oth explicit and implicit

initiation of communication This is achieved through bus slave and bus sno op er

capabilities The NIU rmware is the slave for a static region of physical address

space Write bus transactions to this address space are forwarded to NIU rmware

while read bus transactions are supplied data by the NIU rmware

The NIU also sno ops on a static region of main memory NIU rmware can sp ecify

at cacheline granularity the typ e of bus transactions it is interested in observing and

those that it is interested in intervening In the latter case the bus transaction is not

allowed to complete until NIU rmware gives the approval and optionally provide

data to readlike transactions

NIU rmware is able to initiate any bus transaction to an arbitrary physical

address If data is transferred the NIU sp ecies the memory lo cation on the NIU

to transfer the data to or from including regions in the transmit and receive queues

This together with NIU rmwares ability to directly read and write these NIU

memory lo cations enables NIU rmware to indirectly read and mo dify the SMPs

main memory lo cations

General message passing capability is available to NIU rmware A TagOnlike

capability makes it easy for NIU rmware to send out data that is in NIU SRAM The

message passing capability also allows physical addressing of message destination In

this way NIU rmware can implement destination translation

NIU rmware has sucient capability to generate network packets of arbitrary

format Thus the rmware message passing capability not only enables NIU rmware

on dierent no des to communicate but allows NIU rmware to generate network

packets that are pro cessed completely by NIU hardware at the destination The NIU

rmware can also intercept selected incoming packets and take over their pro cessing

This selection is based on a eld in the packet header normally inserted during

destination translation at the message source

All the NIU rmware interface extension capabilities are comp osable They func

tion as an instruction set that NIU rmware uses to deliver the functions of new

communication interfaces Because NIU rmware is limited to one or a small numb er

of threads but is multiplexed b etween many dierent functions it is imp ortant that

its execution is never blo cked at the hardware level Instead all blo ckage must b e

visible to the rmware so that it can susp end pro cessing the aected request and

switch to pro cessing other requests The latter may b e necessary to clear up the

blo ckage

Chapter

StarTVoyager NES

Microarchitecture

This chapter describ es the StarTVoyager Network Endp oint Subsystem NES an

NIU that connects IBM RISCSystem Mo del P SMPs to the Arctic net

work The IBM RISCSystem Mo del P SMP also called Doral

was rst intro duced at the end of the fourth quarter of With two pro cessor card

slots and two PCI IO buses it is a desktop class machine marketed as an engineer

ing graphics workstation Each pro cessor card contains a PowerPC e running at

MHz and a kByte inline L cache The system bus conforms to the X bus

proto col and runs at MHz in the original system In StarTVoyager we replace

one of the pro cessor cards with our NES Figure shows b oth the original Doral

SMP and the one used in the StarTVoyager system

We would have preferred to use larger SMPs but real life constraints concerning

access to technical information system cost and timely availability of the SMP led to

our choice An earlier iteration of this pro ject the StarTNG targeted an eight

pro cessor PowerPC based SMP that was b eing develop ed by Bull Unfortunately

the PowerPC micropro cessor was never fully debugged or sold commercially

Naturally the SMP we were targeting did not materialize either

The demise of that pro ject taught us a few lessons In StarTNG strong emphasis

was placed on the absolute p erformance of the resulting system To meet this goal Processor card Processor card PowerPC 604e PowerPC 604e microprocessor microprocessor

32kByte 32kByte 32kByte 32kByte I-Cache D-Cache I-Cache D-Cache

512kByte 512kByte Level-2 Cache Level-2 Cache

System Bus (60X Protocol) (64bits data, 32bits address/66MHz)

PCI Memory (64bits/50MHz) Controller & PCI Bridge PCI (32bits/33MHz)

DRAM Banks

Motherboard

Processor card StarT-Voyager NES PowerPC 604e sP Subsystem microprocessor (604 and memory system) 32kByte 32kByte I-Cache D-Cache

NES Core Arctic (Custom Logic, Network 512kByte SRAM, FIFO's & Level-2 Cache TTL<->PECL)

System Bus (60X Protocol) (64bits data, 32bits address/66MHz)

PCI Memory (64bits/50MHz) Controller & PCI Bridge PCI (32bits/33MHz)

DRAM Banks

Motherboard

Figure The top of the diagram shows an original IBM Doral SMP The b ottom

shows a Doral used in the StarTVoyager system with a pro cessor card replaced by

the StarTVoyager NES

we targeted pro jected commercial systems that were still in very early stages of de

velopment This was to ensure that the completed cluster system would b e available

around the same time as the host SMP rather than much later as a late system

using previous generation SMPs This choice greatly increased the risk of the cho

sen host SMP system not materializing and is an unnecessary risk for architecture

research Although the absolute clo ck sp eed of StarTVoyager is lower than contem

p orary systems b oth its architecture at the micropro cessor and SMP system levels

and its relative clo ck ratios are highly similar to to days systems Our research results

are therefore directly applicable For example the newest PowerPC micropro cessor

from IBM the announced in Septemb er has a microarchitecture almost

identical to that of the e the only dierences b eing the presence of a backside L

cache interface on the and a much higher pro cessor core clo ck rate of MHz

Once we were convinced that absolute clo ck sp eed was no longer a high priority for

our pro ject we decided to implement the custom p ortions of the NES with FPGAs

Field Programmable Gate Arrays from Xilinx and an LPGA Laser Programmable

Gate Array from ChipExpress This reduces b oth the manp ower and nancial costs

while improving the architecture research p otential of the system as FPGA allows

relatively easy hardware design mo dications after the NES is constructed It also

reduces the risk of the pro ject Although we had access to IBM technical manuals

of the devices on the system bus we were unable to get simulation mo dels from

IBM for verifying our design Instead our hardware design is veried in a simulation

system that uses device mo dels we wrote based on our reading of the manuals Using

FPGAs reduces the risk from misinterpreting the manuals as changes can b e made

during bringup if such problems are detected The price for using FPGA and LPGA

technologies is a lower NES and memory bus clo ck frequency of MHz

Host SMP Memory System Characteristics

The PowerPC e implements a weak memory mo del where memory accesses are

not guaranteed to app ear in program order to external devices When such ordering

is imp ortant it is necessary to insert SYNC instructions Each SYNC instruction

State Read okay write okay dirty

Mo died M Yes Yes Yes

Exclusive E Yes Yes No

Shared S Yes No No

Invalid I No No No

Table Semantics of the four cacheline states in MESI coherence proto col

ensures that memory access instructions b efore it are completed and visible to external

devices b efore those after it

The X bus proto col uses the MESI cache coherence proto col an invalidation

based proto col with caches maintaining a bit state for each byte cacheline

representing one of the four states i M Mo died ii E Exclusive iii S Shared

and iv I Invalid Table describ es the semantics of these states The Pow

erPC pro cessor family provides the LoadreserveConditionalstore LRSC pair of

instructions as the atomicity primitives

The X bus proto col do es not supp ort cachetocache transfers ie data is not

normally supplied from one cache to another even if one cache has a dirty copy of

a cacheline that another cache wants to read Instead the dirty cacheline is rst

written back to main memory b efore it is read out again The Doral system provided

an exception to this it can accommo date an L lo okaside cache which can intervene

and act like main memory supplying data to read transactions and accepting data

from write transactions This limited form of cachetocache transfer capability is

utilized by the NES

Arctic Network Characteristics

The Arctic network is a high p erformance packet switched FatTree network It

supp orts variable packet sizes b etween to bytes in increments of bytes This

packet size includes a byte packet header and a byte CRC Packet delivery is

reliable ie lossless and inorder for packets using the same uproute through the

Fattree Except for nearest neighb or no des multiple paths through the network exist

b etween each sourcedestination no de pair Each link of the Arctic network delivers a StarT-Voyager NES

sP Subsystem

NES Core Arctic Network

Network side Bus side Hardware Filter Hardware Filter

System Bus (60X Protocol) (64bits data, 32bits address/66MHz)

Main data path

Figure A high level view of the StarTVoyager Network Endp oint Subsystem

NES The hardware lters determine whether the sP should b e involved in pro

cessing a particular event In most cases NES Core hardware handles the events

completely ensuring high p erformance The sP handles corner cases and provides

extendibility

bandwidth of MBytes p er second Two links an in link and an out link connects

to each SMP in the StarTVoyager system

Arctic distinguishes b etween two priorities of packets high and low with high

priority ones having precedence over low priority packets when routing through a

switch Furthermore switch buer allo cation is done in such a way that the last

buer is always reserved for high priority packets Eectively this means that high

priority packets can always get through the network even if low priority ones are

blo cked The converse is however not true Therefore strictly sp eaking Arctic do es

not supp ort two fully indep endent networks but the design is adequate for request

reply style network usage where one network replyhigh must remain unclogged

even when the other requestlow is blo cked but not viceversa

StarTV oyager NES Overview

The NES design faces the challenge of delivering a wide range of functions at

high p erformance while providing programmability Furthermore it must b e imple

mentable with mo derate eort We meet these demands with the design shown in

Figure which couples a mo derate amount of custom hardware the NES Core

with an otheshelf micropro cessor a PowerPC micropro cessor which we refer

to as the sP service pro cessor The idea is to have custom hardware completely

handle the most frequent op erations so that the sP is not involved very frequently

Because the sP takes care of infrequent corner cases the custom hardware can b e kept

simple Conversely b ecause the custom hardware completely takes care of most cases

overall p erformance is little aected even if the sP is somewhat slower at handling

op erations This makes it feasible to employ a normal otheshelf micropro cessor to

provide NES programmability

Using an OtheShelf Service Pro cessor

Compared to a custom designed programmable engine a generic micropro cessor

reduces the design eort but faces the problems of inadequate capabilities and slow

access to ochip devices due to deep pip elining b etween its highsp eed pro cessor core

and the external bus To overcome these deciencies we designed the custom logic

in the NES Core as a coprocessor to the sP oering it a exible set of comp osable

communication oriented commands These commands provide missing functions and

oloads data movement tasks that are simple for the NES core to provide but are

inecient for the sP to undertake

The strength of the sP is its ability to implement complex control decisions that

can also b e mo died easily Thus a guiding principle in the NES design is to have

the sP make control decisions which are carried out by the NES Core With this

organization most of the communication data move b etween the network and the

host SMP system through the NES Core bypassing the sP This design philosophy

is also adopted in the FLASHs MAGIC chip where the PP typically do es not

handle data directly but only deals with control information such as network packet

headers

A disadvantage of rmware implemented functions is lower throughput since han

dling each event takes multiple rmware instructions spanning a numb er of pro cessor

cycles With to days commercial micropro cessors which are all singlethreaded this

limits concurrency A custom designed programmable core suers similar inecien

cies although it could opt to supp ort multiple threads simultaneously to improve

throughput In contrast dedicated hardware for handling the same event can b e

pip elined so that several op erations can simultaneously b e in progress Custom dedi

cated hardware can also exploit greater parallelism eg CAM Content Addressable

Memory hardware for parallel lo okup

As b efore this problem is partly resolved with the communication primitives

supplied by the NES Core hardware With an appropriate set of primitives and

an ecient interface b etween the sP and the NES Core the sP can simply sp ecify

the sequence of actions needed but it neither carries out those actions by itself nor

closely sup ervises the ordering of the sequence This reduces the o ccupancy of the

sP A detailed discussion of the interface b etween sP and NES Core can b e found in

Section

The sP has its own memory subsystem consisting of a memory controller and

normal pagemo de DRAM The decision to provide sP with its own memory system

as opp osed to using the host SMPs main memory is discussed in the next section

Section where several alternative organizations are presented

Overview of NES Hardware Functions

The NES Core provides full hardware implementation of Basic message Express

message and their TagOn variants These message passing services are available to

b oth the aP and the sP Because the buer space and control state of these message

queues b oth reside in the NES Core only a xed small numb er of hardware message

queues are available transmit and receive queues To meet our goal of sup

p orting a large numb er of queues additional message queues with their buer and

control state residing in DRAM are implemented with the assistance of the sP

Both hardware implemented and sP implemented message queues present identical

interfaces to aP software Switching a logical message queue used by aP software

b etween hardware and sP implemented queues involves copying its state and changing

VMM mapping on the aP This can b e done in a manner completely transparent to

aP software The switch is also a lo cal decision The NES Core maintains a TLBlike

hardware which identies the RQIDs asso ciated with the hardware receive queues

A packet with an RQID that misses in this lo okup is handled by the sP

NES hardware handles the rep etitive op erations in a DMA At the sender NES

Core hardware reads data from system bus packetizes and sends them into the net

work At the receiver NES Core hardware writes the data into DRAM and keeps

count of the numb er of packets that have arrived The sP at b oth the source and

the destination are involved in nonrep etitive op erations of a DMA transfer such as

address translation and issuing commands to the functional units that p erform the

rep etitive tasks See Section for details

NES Core hardware also implements cacheline state bits for main memory falling

within SCOMA address space using them to imp ose access p ermission checks on

bus transactions to this address space We sometimes refer to this as the Sno op ed

Space In the go o d cases where the cacheline is present in a state that p ermits the

attempted bus transaction the bus transaction completes with no further delay than

a cachemiss to lo cal main memory in the original SMP The sP can ask to b e notied

of such an event In other cases the bus transaction is retried until the sP is notied

and replies with an approval When giving approval the sP can optionally supply the

data to a readlike bus transaction Section elab orates on SCOMA supp ort

The sP is directly resp onsible for another address space the sP Serviced Space

Typically writelike bus transactions are allowed to complete with their control and

state information enqueued for subsequent pro cessing by the sP Readlike bus trans

actions are typically retried until the sP comes back with approval The NES Core

hardware that supp orts this actually provides other options as describ ed later in Sec

tion CCNUMA style shared memory implementation is one use of this address

space Other uses are p ossible b ecause the sP interprets bus transactions to this ad

dress space aP software and sP can imp ose arbitrary semantics to bus transactions

to sp ecic addresses

The sP can also request for bus transactions on the aP bus These are sp ecied as

commands to the NES Core with the sP sp ecifying all X bus control signal values

NES Core execution of these bus transactions is decoupled from the sPs instruction

stream pro cessing Further details in Section

Our sP is similar to emb edded pro cessors on message passing NIUs like Myrinet

and SP in that it has the ability to construct send and receive arbitrary message

packets across the network See Sections through In addition the

NES Core provides the sP with very general capability not present in those systems

to observe and intervene in aP system bus transactions Details in Sections

and

The sP op erates as a software functional unit with many virtual functional units

time multiplexed on it NES Core hardware improves the sPs eciency by imple

menting congurable hardware lters which only present it with bus transactions or

packets that it wants to see or handle On the bus side the hardware cacheline state

bit check p erforms that function On the network side the destination translation

and receive packet RQID lo okup p erform this ltering for outgoing and incoming

packets resp ectively

Alternate Organizations

This section examines several alternate NES organizations covering ma jor design

choices that fundamentally shap e the NES microarchitecture

Using an SMP pro cessor as NIU Service Pro cessor

An alternative to our StarTVoyager NES organization is a design which uses one

of the SMP pro cessors as the sP See Figure Typho on has a similar

organization except that the NES is split into two devices a device on the system

bus to imp ose negrain access control and a device on the IO bus for message

passing access to the network This design has the advantage of not needing an sP SMP SMP SMP Processor Processor Processor (used as sP)

System Bus

Memory I/O Controller Bridge NES Network

Main Mmory DRAM

I/O Bus

Figure A design which uses one of the SMP pro cessors to serve as the NES sP

subsystem on the NES Unfortunately this design has a serious p erformance problem

arising from the sP having to constantly p oll the NES to nd out if it has any events

to handle This translates into system bus usage even when there is nothing for the

1

sP to do

There is also the p ossibility of deadlo cks Part of the sPs function is to b e

intimately involved in controlling the progress of bus transactions on the system bus

In this capacity there will b e cases when the sP wants to prevent bus transactions

from completing until it has carried out some action such as communicating with other

no des With the sP functions undertaken by one of the SMP pro cessors these actions

now dep end on the ability of this sP to use the system bus either to communicate

with the NES or simply to access main memory Insucient bus interface resources

may prevent such system bus transactions from completing as elab orated b elow

In an SMP bus interface resources in micropro cessor and other system bus devices

are divided into separate p o ols devoted to dierent op erations This is to avoid

deadlo cks arising from dep endence chains in sno opy cache coherence op erations The

sp ecics of a design eg the numb er of p o ols of resources and the mapping of bus

1

can remove this bus bandwidth Caching device registers prop osed by Mukherjee et al

consumption when there is no event to handle However it will lengthen the latency of handling

events since an actual transfer of information now requires two bus transactions an invalidation

and an actual read

interface op erations to the resource p o ol is tied to the dep endence chains that can

arise under its particular bus proto col

In busbased SMP it is common to assume that a pushout write transaction

triggered by a bus sno op hitting a dirty lo cal cache copy is the end of a dep endence

chain Such a bus transaction will always complete b ecause its destination memory

is eectively an innite sink of data

Read transactions on the other hand may dep end on a pushout Consequently

resources used for read transactions should b e in a dierent p o ol from pushout and

dep endence from this p o ol to that used for pushout will build up dynamically from

time to time The fact that pushout transactions do not create new dep endence

prevents dep endence cycles from forming

This assumption that a pushout transaction never creates new dep endence is

violated when a normal SMP pro cessor used as an sP delays the completion of a

pushout until it is able to read from main memory Whether this will really cause

a deadlo ck and whether there are any acceptable workarounds dep end on details

of the SMP system eg whether pushout queues op erate strictly in FIFO order or

p ermits bypasses Nevertheless the danger clearly exists

StarTNG

In the StarTNG pro ject we explored an organization that is very similar to using

an SMP pro cessor as the sP See Figure It diers in that the PowerPC

pro cessor was designed with a backside L cache interface that can also accommo date

a slave device Our design made use of this capability to connect the NES through

a private ie nonshared interface to a pro cessor that also directly connects

to the system bus This pro cessor used as the sP would p oll the NES via its back

2

side L cache interface rather than the system bus One way to think ab out this

2

design rep orted in divides the equivalent of our NES core into two p ortions The StarTNG

One p ortion which is message passing oriented interfaces to the backside L cache interface The

other p ortion which takes care of shared memory oriented functions is mostly accessible only via

the system bus with the exception that notication of p ending events is delivered to the sP via the

message passing interface Another dierence is that StarTNGs shared memory supp ort is a subset

of StarTVoyagers L2 Cache L2 Cache NES

Network SMP SMP SMP Processor Processor Processor (used as sP)

System Bus

Memory I/O Controller Bridge

Main Mmory DRAM

I/O Bus

Figure The StarTNG design which uses one of the SMP pro cessors as the NES

sP but capitalizes on this pro cessors sp ecial backside L cache to have a private

interface b etween the NES and this pro cessor

is StarTVoyagers sP bus is moved to the backside L cache interface This design

solves the p erformance problem of sP constantly p olling over the system bus

To deal with the problem of p otential deadlo cks from the sP directly using the

system bus usage rules are imp osed on sP software in the StarTNG design For

instance the sP pro cessor and the other SMP pro cessors cannot share cacheable

memory This avoids any p ossibility of dep endence arising from sno opy coherence

op erations on the shared data In addition many bus transactions that the sP wants

p erformed on the system bus have to b e done through the NES Core even though sP

software is capable of issuing instructions that result in the desired bus transactions

on the system bus This again is to avoid deadlo cks an example follows

Consider the case where the sP wishes to ush a particular cacheline from other

caches in the SMP no de A Flush bus transaction on the system bus will achieve

this But even though the sP can directly execute a Dcbf Data Cache Blo ck Flush

instruction that will result in such a bus transaction it is not safe to do so The

Flush bus transaction may require a pushout from another SMP pro cessor b efore

it completes In turn that pushout may b e queued b ehind several other pushouts

requiring sP pro cessing But until the Flush bus transaction triggered by the sPs

Dcbf instruction is completed the sPs instruction pip elined is blo cked A deadlo ck

3

results

While there are several ways of circumventing this particular deadlo ck scenario

delegating the task of p erforming the Flush bus transaction to the NES Core is the

4

only general one It ensures that the sPs execution is decoupled from the completion

of the Flush bus transaction While waiting for the latters completion the sP is

free to handle pushouts or any of many other events Because many functions are

multiplexed onto the sP any scenario which can blo ck its execution pip eline must b e

closely examined

Using the SMP main memory as sP memory is a more elegant design than the

separate sP memory system adopted in the StarTVoyager NES But as shown ab ove

several serious issues have to b e resolved b efore it is feasible In StarTVoyager the

lack of a backside L cache interface on the PowerPC e pro cessor precludes the

solution used in StarTNG The next b est solution that b oth avoids dedicated sP

DRAM and maintains a private bus b etween the sP and the NES Core is to have the

NES Core act as a bridge chip relaying sP bus transactions onto the SMP system

bus to access main memory We decided that for our exp erimental prototyp e this

was to o much trouble we were able to put together the sPs memory system using

commercial parts with little design eort

3

This example assumes that items in the pushout queues are serviced inorder The problem

do es not arise if the queuing p olicy allows bypassing Unfortunately this level of detail is rarely

sp ecied in micropro cessor manuals Furthermore neither a bus based SMP nor a more traditional

hardware implemented cachecoherent distributed shared memory machine requires bypassing in

the pushout queue Since allowing bypass requires a more complex design it is unlikely that the

feature is implemented

4

Other solutions typically rely on implementation details that allow absolute b ounds on buering

requirements to b e computed and provided for We view these solutions as to o dep endent on

implementation sp ecics which should not b e tied down rigidly For example should the buer

space in the pro cessor bus interface increase b ecause a new version of the chip has more silicon

available this solution will need the amount of resources to b e increased

Custom NIU ASIC with Integrated Programmable

Core

Incorp orating a custom programmable core into the NIU is an alternative that can

not only achieve the goals that we laid out in earlier chapters but also overcome some

disadvantages of an otheshelf sP Aside from having the programmable core closer

to the rest of the NIU this approach brings many opp ortunities for customizing the

programmable cores instruction set and p erhaps adding multiple contexts or even

simultaneous multithreading supp ort

A numb er of pro jects have taken this approach eg FLASH SUNs Smp

and Sequents NUMAQ Though all three machines contain some kind of

custom programmable core they vary in generality Most are highly sp ecialized micro

co de engines The MAGIC chip in FLASH is the closest to a generally programmable

core

In the interest of keeping design eort down we elected not to include a cus

tom designed programmable core We also felt that with the correct mix of simple

hardware and an otheshelf pro cessor go o d communication p erformance can b e

achieved Furthermore the higher core clo ck sp eed of our sP may allow it to p erform

more complex decisions on the o ccasions when it is involved In contrast PP the

programmable core of MAGIC is involved in all transactions The numb er of cycles

it can sp end on handling each event is therefore constrained in order to maintain

adequate throughput and latency This can b ecome a particularly serious issue with

5

multiple pro cessor SMP no des

Tabledriven Proto col Engines

Using hardware proto col engines driven by congurable tables oers some amount

of programmability over how a cache coherence proto col op erates without employing

microco de engine or general programmable core The exibility is of course limited

to what is congurable in the tables Since it is less general its implementation is

5

Each MAGIC chip serves only one single R pro cessor in FLASH

likely to retain the low latency and high throughput advantage of hardwired coherence

proto col engines Its implementation is also simpler than a generally programmable

core

The StarTVoyager NES actually employs this idea in a limited way As explained

later in Section the NES Core p erforms cacheline granularity access p ermission

checks for system bus transactions addressed to a p ortion of main memory DRAM

This memory region can b e used to implement SCOMA and lo cal p ortion of CC

NUMA shared memory The outcome of this hardware check is determined not only

by cacheline state information but also by congurable tables which sp ecify how

the cacheline state information is interpreted Useful as a comp onent of our overall

design the table driven FSM approach alone do es not provide the level of exibility

that we want in our NIU

StarTVoyager NES Execution Mo del

The NES Core is unusual as a copro cessor in that its execution is driven by multiple

external sources of stimuli triggering concurrent execution in shared functional units

These sources are i commands from the sP ii packets arrival from the network

and iii bus transactions on the SMP system bus Internally the NES Core consists

of a collection of functional units connected by queues Execution pro ceeds in a

continuation passingdataow style with each functional unit taking requests from

one or more input queues and in some cases generating continuation results into

output queues These queues in turn feed other functional units

This mo del of functional units connected by queues extends across the entire

cluster with requests from one NES traveling across the network into the queues of

another NES A very imp ortant part of the design is to ensure that dep endence cycles

do not build up across this vast web of queues

Although the NES Core has several sources of stimuli the notion of p erthread

context is weak limited to address baseb ound and pro ducerconsumer p ointer values

that dene FIFO queues In the case of transmit queues the context also includes the

destination translation table and baseb ound addresses which dene the NES SRAM

memory space used for TagOn messages

Conspicuously absent is a register le asso ciated with each command stream

Unlike register style instructions used in RISC micropro cessors to day but similar

to tokens in dataow machines the commands in the NES Core sp ecify op erand

information by value This choice is motivated by hardware simplicity Each of the

sP command streams can probably make use of a small register le but since most

events handled by the sP are exp ected to result in only a small numb er of commands

6

the context provided by a register le in the NES Core is not critical Without

the register le and the asso ciated register op erands fetching NES Core hardware is

simplied

Management of ordering and dep endence b etween commands from the same stream

discussed in Section is also kept very simple There is no general hardware sup

p ort for dynamic data dep endence tracking such as scoreb oarding A few sp ecial

commands are however designed to allow vectorpro cessor style chaining See im

plementation of DMA in Section

Interface b etween sP and NES Custom Func

tional Units

The interface b etween the sP and the NES Core is critical to p erformance sP o c

cupancy and the overall latency of functions implemented by the sP can vary signif

icantly dep ending on this interface In designing this interface we have to keep in

mind that it is relatively exp ensive for the sP to read from the NES Core It also

takes a fair numb er of cycles for the sP to write to the NES Core Three p ossible

interface designs are discussed here i a traditional command and status register

6

We will see later in the evaluation chapter Chapter that the cost of switching context is a

signicant overhead for sP rmware That suggests that multiple context supp ort in the sP is likely

to b e useful However that is not the same as saying that the NES Core should supp ort multiple

contexts for its command streams Firmware must b e able to get to those contexts cheaply if it is

to b e an advantage Function Control Unit 1 Status

Service Processor (sP)

Function Control

Unit n Status

Figure A command and status register interface

interface common among IO devices ii a command and completion queues design

with ordering guarantee b etween selected commands and iii an interface similar to

ii with the addition of command templates The StarTVoyager NES implemented

the second option

Option Status and Command Registers

IO devices are commonly designed with status and command registers which are

memory mapp ed by the micropro cessor As illustrated in Figure the micro

pro cessor issues requests by writing to command registers and checks the results or

progress of these requests by reading from status registers The IO device can usually

also notify the micropro cessor of an event with an interrupt but this is exp ensive

Traditional IO device interfaces are not designed for eciency as micropro cessors

are exp ected to access them infrequently A ma jor limitation of this interface is each

command register typically supp orts only one request at a time The micropro cessor

has to withhold subsequent requests to the same command register until the current

command is completed

Secondly when a sequence of op erations to dierent registers ie dierent func

tional units of the device contains dep endences the required ordering has to b e

enforced externally by the micropro cessor The micropro cessor has to delay issuing a

command until commands which it dep ends on are known to have completed

Thirdly p olling individual status registers to nd out ab out progress or completion

of a command is slow This can b e improved by packing status bits together so that

a single p oll returns more information There is however a limit to the amount of

information that can b e packed into a single access b ecause status reads are normally

done with uncached bus transactions this limit is typically or bits Checking

through a vector of packed status bits can also b e slow for the sP rmware

An sP working with such an interface will b e very inecient To accomplish a

sequence of commands the sP has to enforce intercommand dep endence and arbi

trate b etween functional unit usage conicts In the mean time it has to p oll for

b oth completion of previous requests and arrival of new transactions to handle Per

formance is degraded b ecause p olling is relatively slow and pip elining b etween NES

Core command execution and sP issue of subsequent requests is severely limits by the

p ollandissue sequence The sP co de also b ecomes very complex b ecause it has to

choreograph one or more sequences of actions while servicing new requests arriving

7

from the network or the SMP bus

Option Command and Completion Queues

Ideally the sP would like to issue all at once an entire sequence of commands

needed for pro cessing a new request This simplies sP co ding by removing the need

to continually monitor the progress of the sequence It also facilitates pip elining not

only can the sP issue more commands while earlier ones are b eing executed the NES

Core can p otentially exploit parallelism b etween these commands

The eciency of sP p olling for new requests or for notication of command com

pletion can b e improved by merging them into p olling from a single address The

7

Interleaving pro cessing of multiple transactions is not merely a p erformance improvement option

but a necessity as failure to continue servicing new requests can lead to deadlo cks Command queues

Dispatch Logic Service Processor (sP) Function Function Unit 1 Unit n

Completion queue

Completion Acknowledgement

Figure A command and completion queues interface

StarTVoyager NES achieves these advantages with an interface comp osed of two com

mand queues a completion queue and the ability to p erform OnePoll from several

queues

Command Ordering and Data Dep endence

A ma jor command queue design issue is hardware maintenance of ordering and data

dep endence b etween commands Without such guarantees the sP cannot fully exploit

the command queues since it still has to manually enforce data dep endence in the old

way In addressing this issue a design has to balance among the need for ordering

exploitation of intercommand parallelism and design complexity The following are

some options we considered

Dep endence Tracking Hardware At one extreme is an aggressive design that

dynamically tracks dep endence b etween commands due to readwrite access

to shared NES SRAM lo cations Several implementations of this dep endence

tracking are p ossible One is to employ scoreb oarding with fullempty bits

for all NES SRAM memory lo cations Another is to maintain a list of SRAM

lo cations which are b eing mo died much like the scheme used in loadstore

units of to days aggressive sup erscalar micropro cessors The size of NES SRAM

memory kB makes keeping fullempty bits relatively exp ensive The other

design is also dicult b ecause the exact SRAM memory lo cation read or written

by a command is sometimes unknown until part way through its execution in

a functional unit

Barrier Command A middle of the road option with mo derate design complexity

is to have sP software explicitly sp ecify dep endences The sP could b e provided

with barrier commands which blo ck issue of subsequent commands until pre

vious ones have completed Given that barriers are exp ected to b e used quite

frequently it can b e made an option in each ordinary command Setting this

option bit is equivalent to preceding a command with a barrier command

Dep endence Bit Register File Another way for sP software to explicitly sp ecify

dep endence is to provide a small register le of fullempty bits Each command

is expanded with two or more elds for naming fullempty registers one for

which it is pro ducer and the others consumer A command is not issued until

its consumer fullempty registers are all set Its pro ducer fullempty register

is cleared when the command is issued and set up on its completion This

approach provides ner granularity sp ecication of dep endence than the bar

rier command approach but comes at the price of increased command size

dep endence register le hardware and the control logic to track and enforce

dep endence

Multiple Sequentially Pro cessed Queues At the other extreme is a very simple

design which executes only one command from each queue at a time pro ceeding

to the next one only after the previous one has completed While simple this

design kills all intercommand parallelism To improve parallelism a design

can employ multiple command queues with no ordering constraints b etween

commands in dierent queues It can also limit ordering to command typ es that

are likely to have dep endences In the unlikely event that ordering is needed

b etween other commands the sP will manually enforce it by appropriately

holding back command issue

The last design is adopted in the StarTVoyager NES where two command queues

are supp orted The sP can also utilize additional queues which are limited to send

ing messages only All the design options describ ed ab ove achieve the rst order

requirement of allowing the sP to issue a stream of commands without p olling for

completion They dier only in their exploitation of intercommand parallelism Our

choice is a compromise b etween design simplicity and parallelism exploitation

Command Completion Notication

Command completion is rep orted to the sP via a completion queue Since it is

exp ensive for the sP to p oll the NES Core completion notication is made an option in

each command Very often a sequence of commands requires completion notication

only for the last command or none at all The completion queue is treated like

another message queue and can b e one of the queues named in a OnePoll Using

OnePoll further reduces sP p olling cost

The completion queue has a nite size and when it is full completion notication

cannot b e placed into it This will cause functional units to stall The sP is resp onsible

to ensure that this do es not lead to a deadlo ck It should in general preallo cate space

in the completion queue b efore issuing a command requiring completion notication

Alternatively sP co de can b e structured to ensure that it is constantly removing

notications from the completion queue

Option Template Augmented Command and Com

pletion Queues

Given that the sP is exp ected to issue short command sequences when pro cessing new

requests it is natural to consider extending our design with command templates The

goal is to reduce the time taken to issue a command sequence by preprogramming the

xed p ortions of each command sequence in the NES Core To invoke a sequence the Command queues

Dispatch Logic Template Memory Service Processor (sP) Function Function Unit 1 Unit n

Completion queue

Completion Acknowledgement

Figure A template augmented command and status register interface

sP simply identies the template and sp ecies the values of the variable parameters

While interesting it is unclear how much savings this pro duces As mentioned

b efore when a stream of commands is issued by the sP pro cessing of earlier com

mands overlap issuing of subsequent ones Thus the b enet of the template scheme

is probably more a matter of reducing sP o ccupancy rather than overall latency of

getting the command sequence done Coming up with an ecient means of passing

parameters is also tricky

As prop osed here a template is a straight sequence of commands It is however

not dicult to envision including more capabilities such as predicated commands

and conditional branches If the NES Core capability is pushed in that direction

one will so on end up with a customized programmable core At that p oint the sP

is probably no longer needed A substantial amount of design and implementation

eort is needed to implement the template scheme and even more for a more generally

programmable NES Core Consequently we did not explore the idea much further

NES Core Microarchitecture

This section describ es in detail the dierent queues and functional units in the

NES Core as illustrated in Figure We do this in an incremental fashion starting

with the simplest subset which implements only Resident Basic message supp ort

By reusing and extending existing hardware capabilities and adding additional NES

Core hardware the NESs functionality is enhanced until we arrive at the nal design

Figure shows the partitioning of the NES Core functional units and queues into

physical devices

Resident Basic Message

Figure presents a logical view of the NES Core comp onents that implement

Resident Basic message passing supp ort The state asso ciated with each queue is

separated into two parts the buer space and the control state Control state is lo

cated with the transmit and receive functional units while buer space is provided by

normal dualp orted synchronous SRAM The exact lo cation and size of each queues

buer space is programmable by setting appropriate base and b ound registers in the

queues control state To supp ort a numb er of queues with minimal duplication of

hardware the message queue control state is aggregated into les similar to register

les which share control logic that choreographs the launch and arrival of messages

to and from the network

Software accesses the message queue buer space directly using either cached or

uncached bus transactions The NES State ReadWrite logic provides software with

two windows to access message queue control state One window has access to the full

state including conguration information Obviously access to this window should

b e limited to system software or other trusted software A second window provides

limited access to only pro ducer and consumer p ointers This is exp orted to userlevel

co de

The actual implementation of Basic Message functions involves the devices shown

in Figure The design is completely symmetrical for the aP and the sP Two Block Control Counters Bus Operations (Schedules Operations) Bus Master

Command Queues Local Bus Snooper

sP Snooped Capture & Ack

Space Command Dispatch Remote

sP Serviced Block Bus Slave Space Transmit Capture &Ack Queue NES State

NES State Read/Write

Read/Write

Express Msg Queues Express Msg Queues sP System Bus SMP System Bus

Express Msg

Express Msg Interface

Interface

Direct Queue Direct Basic Msg Queues Basic Msg Queues Queue Buffer Read/Write Buffer Read/Write

Bus Slave

Reclaim Unit

Destination TxQ RxQ Transmit Receive Translation control control state state Table Unit Unit File File

Network

Figure Queues and functional units in the StarTVoyager NES ClsSRAM (Fast SRAM)

Cache-line State

aBIU (Xilinx FPGA) NESCtrl (ChipExpress LPGA) sBIU (Xilinx FPGA)

Block Control sP Snooped NES State NES State (Schedules Space Bus Operations Read/Write Read/Write Operations) sP Serviced Space Block Express Msg Transmit Reclaim Express Msg

Command Dispatch Unit Interface Interface Counters TxQ Transmit Receive RxQ Direct Queue Direct Queue control control Unit Unit Bus Master Buffer state state Buffer

Read/Write Capture & Ack File(control) (control) File Read/Write

aSRAM (dual-ported SRAM) sSRAM (dual-ported SRAM) Command Queues

Remote Local

Express Msg Queues

Capture &Ack Queue

Express Msg Queues SMP System Bus sP System Bus

Basic Msg Queues

Basic Msg Queues TxUFIFO RxUFIFO

Transmit Receive

Unit Unit TxURxU (Datapath) (Datapath)

Destination (Xilinx FPGA) Translation Table Destination Translation TTL PECL Table Signal Conversion

Network

Figure Physical devices in the StarTVoyager NES Bus Slave Bus Slave

NES State Basic Msg Queues Basic Msg Queues NES State

Read/Write Read/Write

Direct Direct sP System Bus

SMP System Bus Queue Queue Buffer Buffer Read/Write Read/Write

Destination TxQ RxQ Transmit Receive Translation control control Table state Unit Unit state

Network

Figure Queues and functional units to implement Resident Basic Message

banks of dualp orted SRAM provide storage space for message buer Software access

to these SRAM banks is achieved with the help of the BIUs which determine the

SRAM lo cation to read data from and write data into The destination translation

tables are also lo cated in the SRAM banks The TxFIFO and RxFIFO in the diagram

are used to decouple the realtime requirement of Arctic network from the scheduling

of Ibus This is necessary for incoming messages to prevent data losses A second

role of these FIFOs is for crossing clo ck domains the network op erates at MHz

most of the NES at MHz

Logic that implements transmit and receive functions is divided into two parts one

part resides in the NESCtrl chip and the other in the TxURxU FPGA The former is

resp onsible for the control functions Although it observes packet headers this part

do es not manipulate nor alter the data stream Such tasks are the resp onsibility of

the TxURxU The NESCtrl chip is also the arbitrator for the Ibus which is shared

among many dierent functions

Access to Queue Control State

Implementation of control state up date is easy since b oth the name of the state aBIU (Xilinx FPGA) NESCtrl (ChipExpress LPGA) sBIU (Xilinx FPGA)

Control NES State NES State (Schedules Read/Write Read/Write Operations)

Direct Queue Direct Queue TxQ Transmit Receive RxQ Buffer Buffer control control Unit Unit Read/Write Read/Write state state (control) (control)

aSRAM (dual-ported SRAM) sSRAM (dual-ported SRAM)

Basic Msg Queues Basic Msg Queues SMP System Bus sP System Bus TxUFIFO RxUFIFO

Transmit Receive

Unit Unit Destination TxURxU Destination (Datapath) (Datapath) Translation Translation (Xilinx FPGA) Table Table

TTL PECL Signal Conversion

Network

Figure Actual realization of Resident Basic Message queues and functional

units

b eing up dated and the new value are b oth found in the address p ortion of a control

state up date bus transaction Reading control state is more involved b ecause the

data has to b e written from the NESCtrl chip over the Ibus into the SRAM banks

b efore it can b e supplied as data to read bus transactions To reduce the impact

of this long data path on the latency of reading queue state shadow copies of the

most frequently read state the pro ducer p ointer of receive queues and the consumer

p ointer of transmit queues are kept in the SRAM banks and p erio dically up dated

by NESCtrl

Transmit Unit

The transmit unit tracks the transmit queues that are nonempty up dating this

information each time the pro ducer or consumer p ointer of a transmit queue is written

to A scheduler in the transmit unit then selects one of the nonempty transmit queues

that is also enabled and loads its control state into the state machine that controls

the actual message launch

The transmit unit is resp onsible for p erforming destination translation on most

outgoing packets With b oth message and translation table data coming from NES

SRAM this is achieved by simply marshalling the appropriate translation table entry

data into the message data stream heading towards the TxURxU The data path

in TxURxU then splices that information into appropriate p ortions of the outgoing

packets header

For the other packets that request physical destination addressing the transmit

unit checks that the transmit queue from which the packet originates has physical

addressing privilege Any error such as naming a logical destination whose translation

table is invalid or using physical addressing in a queue without that privilege results

in an error that shuts down the transmit queue The error is rep orted in an NES

Error register and may raise either an aP or an sP interrupt dep ending on how the

aP and sP interrupt masks are congured

Receive Unit

The receive unit is resp onsible for directing an incoming packet into an appropriate

receive queue It maintains an asso ciative lo okup table which matches RQID Re

ceive Queue ID taken from the header of incoming packets to physical message

queues The receive unit also maintains a sp ecial queue called the OverowMiss

queue Packets with RQIDs that miss in the lo okup are placed into this queue So

are packets heading to receive queues that are already full and whose control state

enables overow Exceptions to the latter are packets that ask to b e dropp ed when

its receive queue is full The choice of whether to drop a packet heading to a full

receive queue is made at the packet source sp ecied in the translation table entry

A packet can also blo ck if its receive queue is full This happ ens when the receive

queue state indicates that overow is not enabled and the incoming packet do es

not ask to b e dropp ed when encountering such a situation Although blo cking is in

general dangerous providing this option in the hardware gives system software the

exibility to decide whether this should b e made available to any software p erhaps

sP software can make use of it

Linklevel Proto col

Both the transmit and receive units participate in linklevel owcontrol with an Arc

tic Router switch Arctic adopts the strategy that each outgoing p ort keeps a count

of the numb er of message buers available at its destination This is decremented

each time a packet is sent out of a p ort and incremented when the destination signals

that space for another packet is available Arctic also imp oses the rule that the last

packet buer has to b e reserved for high priority packets The NESs transmit unit

has to resp ect this convention The receive unit typically indicates a packet buer

space has freed up when a packet is moved from RxFIFO into message buer space

in SRAM

Resident Express Message

The NES Core implementation of Resident Express message supp ort shown in

Figure reuses most of the hardware that is present for Basic message The hard

ware message queue structures are reused with only minor changes to the granularity Bus Slave Bus Slave

NES State NES State

Read/Write Read/Write Express Msg Queues Express Msg Queues

Express Msg Express Msg

Interface Interface sP System Bus SMP System Bus

Direct Direct Basic Msg Queues Basic Msg Queues Queue Queue Buffer Buffer Read/Write Read/Write

Destination TxQ RxQ Transmit Receive Translation control control Table state Unit Unit state

Network

Figure Queues and functional units to implement Resident Express message

and Basic message The lightly shaded p ortions require mo dications when we add

Express message The more deeply shaded p ortions are new additions for Express

message aBIU (Xilinx FPGA) NESCtrl (ChipExpress LPGA) sBIU (Xilinx FPGA)

NES State Control NES State Read/Write (Schedules Read/Write Operations)

Express Msg Express Msg Interface Interface

TxQ Transmit Receive RxQ control Direct Queue control Direct Queue state Unit state Buffer Unit Buffer Read/Write Read/Write

aSRAM (dual-ported SRAM) sSRAM (dual-ported SRAM)

Express Msg Queues Express Msg Queues SMP System Bus sP System Bus Basic Msg Queues Basic Msg Queues TxUFIFO RxUFIFO

Transmit Receive

Unit Unit TxURxU (Datapath) (Datapath) Destination Destination (Xilinx FPGA) Translation Translation Table Table

TTL PECL Signal Conversion

Network

Figure Actual realization of Express and Basic Message queues and functional

units The lightly shaded p ortions require mo dications when we add Express mes

sage The more deeply shaded p ortions are new additions for Express message Basic Tx Header Format tag-on-data? Interrupt on Arrival  01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2223 24 25 26 27 28 29 30 31  0 0    UpRoute  Logical Destination Length Msg Specify UpRoute? Node Num 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Express Tx Format tag-on-data Interrupt on Arrivaltag-on-data? addr offset  01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2223 24 25 26 27 28 29 30 31   0 1   UpRoute  Logical Destination general payload general payload Msg Specify UpRoute? Node Num 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

general payload

Figure Message headers of Basic and Express messages

of the p ointers bits vs bytes in Basic message

Transmit Unit

The Basic message and Express message formats are compared in Figure Be

cause we keep the rst bytes of the Basic and Express message headers almost

identical few changes are needed in the transmit unit One change is having the

transmit unit add a packet length eld which is explicitly sp ecied by software in

Basic message format but left implicit in the Express message format Another

change is feeding the bit Express message to TxURxU twice the rst time as the

equivalent of the Basic message header and the second time as the equivalent of the

rst data word in Basic message

Receive Unit

The receive unit requires additions to reformat Arctic packets into the bit format

required by Express message No reformatting is needed for Basic message since its

receive packet format is essentially Arctics format packet The reformating shown

in Figure is very simple and is done in TxURxU Most existing functions of

the receive unit such as resp ecting Arctics linklevel proto col and determining the

NES SRAM lo cation to place an incoming packet are unchanged b etween Basic and

Express message Arctic Packet 01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2223 24 25 26 27 28 29 30 31 Header 1

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2223 24 25 26 27 28 29 30 31 Data0 0

01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2223 24 25 26 27 28 29 30 31 Data1

48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

CRC

Express Rx Format 01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2223 24 25 26 27 28 29 30 31 1 Arctic Data0[1:14] Logical Source Node Num Arctic Data0[23:31] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Arctic Data1

Figure Transforming Arctic packet into Express messages receive format

BIU Express Message Supp ort

The main hardware addition to supp ort Express message is in the BIUs where

shadow copies of the control state of Express message queue are intro duced The

BIUs use these to implement the hardware FIFO interface of Express messages and

also the OnePoll mechanism for prioritized p olling of message queues

When pro cessor software p erforms an Express message send with an uncached

write its BIU uses the lo cal copy of queue state to generate the message buers

SRAM address The BIU also writes the address p ortion of the bus transaction

which is part of the message into the even word of this buer if this transmit is

triggered by a byte uncached write This is done over the Ibus with the help of

NESCtrl Finally it increments its copy of the pro ducer p ointer and up dates the

transmit units copy of that p ointer The latter is done using the same hardware path

taken when queue state is explicitly up dated by software in a Basic message send

When software p olls for messages from an Express message queue the BIU again

uses its copy of queue state to generate the SRAM address But instead of always

generating the SRAM address of the buer indicated by the queues consumer p ointer

the BIU takes into account whether the queue is empty If it is empty the BIU

generates the SRAM address of a sp ecial lo cation that contains a system software

programmed Empty Express message If the queue is nonempty the BIU up dates the

consumer p ointer The BIU and the receive unit coop erate to keep the queue p ointers

in synchrony The BIU propagates consumer p ointer up dates to the receive unit while

the receive unit propagates pro ducer p ointer up dates in the reverse direction

OnePoll

OnePoll of multiple queues is an extension of p olling from a single Express message

receive queue A OnePoll bus transaction sp ecies a numb er of message queues that

it wants to p oll from On its part the BIU determines which queues among these

are not empty and when there are several one queue is selected based on a xed

hardwired priority Since normal p olling already selects b etween two p ossible SRAM

addresses to read data from OnePoll simply increases the choice to include more

queues

To reduce the amount of work done when a OnePoll transaction is encountered

the BIU maintains a bit value for each Express message receive queue indicating

whether it is empty This is reevaluated each time a pro ducer or consumer p ointer is

8

up dated When servicing a OnePoll bus transaction the BIU uses this information

selecting only those of queues picked by the OnePoll bus transaction and nds the

highest priority nonempty queue among them The result is then used as mux

controls to select the desired SRAM address The depth of OnePoll logic for n queues

2

is O l og n while the size of the logic is O n In our implementation we are able

to include up to ten queues without timing problem

A Basic message receive queue can also b e included as a target queue in a OnePoll

op eration When a Basic message queue is the highest priority nonempty queue the

data returned by the OnePoll action includes b oth a software programmable p ortion

8

This implementation also avoids the need for parallel access to a large numb er of pro ducer and

consumer p ointers These p ointers can continue to b e kept in register les with limited numb er of

readwrite p orts

that is typically programmed to identify the Basic message queue and a p ortion

containing the NESCtrl up dated shadow queue p ointers in NES SRAM

TagOn Capability

TagOn capability is an option available with b oth Basic and Express message typ es

When transmitting a TagOn message the transmit unit app ends additional data from

an NES SRAM lo cation sp ecied in the message When receiving an Express TagOn

message the receive unit splits the packet into two parts directing the rst part into

a normal Express receive queue and the second into a queue similar to Basic message

receive queue Mo dications to the control logic of b oth the transmit and receive

units are necessary to implement this feature

Implementation of TagOn also requires additional control state Transmit queues

are augmented with a TagOn base address and a b ound value With the use of these

baseb ound values the SRAM address of the additional data is sp ecied with only an

oset reducing the numb er of address bits needed This is helpful in Express TagOn

where bits are scarce The scheme also enforces protection by limiting the NES SRAM

region that software can sp ecify as the source of TagOn data The control state of

Express receive queue is increased to accommo date the additional queue

When p erforming TagOn transmit the transmit unit needs to read in the TagOn

address oset to generate the TagOn datas SRAM address This capability is already

present to generate addresses of translation table entries Thus the design changes

are fairly incremental

NES Reclaim

9

The NES provides the Reclaim option to help maintain coherence of cacheable

Basic message queue buer space This is implemented in a very simple fashion

Instead of actually maintaining coherence statebits for each cacheline of buer space

9

Reclaim option is only available to the aP Supp orting Reclaim requires the BIU to b e a bus

master Because there is no pressing need for sBIU to b e a bus master we did not go through the

eort of implementing it Bus Master

Bus Slave Bus Slave

NES State NES State

Read/Write Read/Write Express Msg Queues Express Msg Queues

Express Msg Express Msg

Interface Interface sP System Bus SMP System Bus

Direct Direct Basic Msg Queues Basic Msg Queues Queue Queue Buffer Buffer Read/Write Read/Write

Reclaim Unit

Destination TxQ RxQ Transmit Receive Translation control control Table state Unit Unit state

Network

Figure Queues and functional units to implement Resident Express message

and Basic message including Reclaim option for Basic message queues The shaded

p ortions are intro duced to supp ort Reclaim aBIU (Xilinx FPGA) NESCtrl (ChipExpress LPGA) sBIU (Xilinx FPGA)

Control NES State NES State (Schedules Read/Write Read/Write Operations)

Express Msg Reclaim Express Msg Unit Interface Interface

Transmit RxQ Bus Master Direct Queue TxQ Receive Direct Queue control control Buffer Buffer state Unit state Read/Write Unit Read/Write

aSRAM (dual-ported SRAM) sSRAM (dual-ported SRAM)

Express Msg Queues Express Msg Queues SMP System Bus sP System Bus

Basic Msg Queues Basic Msg Queues TxUFIFO RxUFIFO

Transmit Receive

Unit Unit TxURxU (Datapath) (Datapath) Destination Destination (Xilinx FPGA) Translation Translation Table Table

TTL PECL Signal Conversion

Network

Figure Actual realization of Express and Basic Message queues and functional

units including Reclaim option for Basic message queues The shaded p ortions are

intro duced to supp ort Reclaim

the design relies on queue p ointer up dates to trigger coherence maintenance actions

When the pro ducer p ointer of a transmit queue is up dated the region b etween its

old value and the new value is conservatively assumed to b e freshly written and

still in the pro cessors cache The NES triggers writeback of these cachelines using

appropriate system bus transactions b efore transmitting the messages

Similarly when the consumer p ointer of a receive queue is up dated the region

b etween its old and new values is conservatively assumed to b e in the pro cessors

cache This time the value needs to b e invalidated as they will b ecome stale once

the buer space is reused for new messages

As shown in Figures and the NES Core implements Reclaim with the

addition of a Bus Master Unit to the aBIU and a new Reclaim Unit to the NESCtrl

The Reclaim Unit keeps a new category of p ointers called reclaim p ointers that are

related to pro ducer and consumer p ointers We will use transmit queue op erations

to illustrate how this scheme works

During a message send software up dates the reclaim p ointer instead of the pro

ducer p ointer of a transmit queue When the Reclaim Unit detects that a reclaim

p ointer has advanced b eyond its corresp onding pro ducer p ointer the Reclaim Unit

issues bus transaction requests to the BIUs Bus Master unit to pull any dirty data out

of pro cessor caches Once these bus transactions complete the Reclaim Unit up dates

the queues pro ducer p ointer which triggers the transmit unit to b egin pro cessing

With this implementation neither the transmit nor receive units are mo died to sup

p ort reclaim Furthermore the option of whether to use reclaim simply dep ends on

which p ointer software up dates

sP Bus Master Capability on SMP System Bus

With the NES Core functions describ ed so far the sP is unable to do much as an

emb edded pro cessor In fact an sP receives no more services from the NES Core than

an aP But now that aBIU has implemented bus master capability it is only a small

step to make this capability available to the sP Figure shows the additional

functional blo cks and queues that are needed to make this p ossible Figure is Control (Schedules Operations)

Bus Master Command Queues Local Command Dispatch

Bus Slave Acknowledgement

Ack Queue NES State

NES State Read/Write

Read/Write

Express Msg Queues Express Msg Queues sP System Bus SMP System Bus

Express Msg

Express Msg Interface

Interface

Direct Queue Direct Basic Msg Queues Basic Msg Queues Queue Buffer Read/Write Buffer Read/Write

Bus Slave

Reclaim Unit

Destination TxQ RxQ Transmit Receive Translation control control Table state Unit Unit state

Network

Figure Queues and functional units to implement Resident Express message

Basic message including Reclaim and giving sP the ability to initiate bus transac

tions on the SMP system bus The latter is made p ossible with the addition of the

shaded regions aBIU (Xilinx FPGA) NESCtrl (ChipExpress LPGA) sBIU (Xilinx FPGA)

Control NES State NES State (Schedules Read/Write Read/Write Operations)

Express Msg Reclaim Express Msg

Command Dispatch Unit Interface Interface

Transmit RxQ Bus Master Direct Queue TxQ Receive Direct Queue control control Buffer Buffer state Unit state Read/Write Acknowledgement Unit Read/Write

aSRAM (dual-ported SRAM) sSRAM (dual-ported SRAM) Command Queues

Local

Express Msg Queues

Ack Queue

Express Msg Queues SMP System Bus sP System Bus

Basic Msg Queues

Basic Msg Queues TxUFIFO RxUFIFO

Transmit Receive

Unit Unit TxURxU (Datapath) (Datapath)

Destination (Xilinx FPGA) Translation Table Destination Translation TTL PECL Table Signal Conversion

Network

Figure Actual realization of Resident Express and Basic message functions

including Reclaim and giving sP the ability to initiate bus transactions on the SMP

system bus The latter is made p ossible with the addition of the shaded regions

a similar diagram showing how these comp onents map onto physical devices The

main additions are two Local Command Queues a Disp atch Unit that arbitrates and

dispatches commands from these queues and an Acknow ledgement Queue

Lo cal Command Queues To execute bus op erations on the SMP system bus the

sP inserts bus op eration commands into any of the Lo cal Command queues in

the same way it issues an Express message send One can think of this as issuing

a message to the bus master unit in the aBIU to p erform the bus op eration

A bus op eration command sp ecies the transactions physical address and

control signals In addition if data is involved in the transaction the command

also sp ecies the aSRAM address to read data from or write data to The

format of this bus op eration request message is also used by the Reclaim Unit

to request bus op erations In this way the Bus Master unit oers a primitive

that is used by b oth the Reclaim Unit and the sP

Acknowledgement Queue If the sP needs to know when a command is completed

it can request for an acknowledgement The Bus Master unit inserts such an

acknowledgement into an Acknowledgement queue using much of the infras

tructure that already exists to place the address p ortion of an Express message

into buer space in the SRAM banks A minor dierence is that an acknowl

edgement requires bits of information to b e written as opp osed to bits

in the case of Express message comp osition The sP p olls for acknowledge

ment in the same way it receives Express messages In fact using OnePoll the

sP can b e p olling from the Acknowledgement queue and other message queues

simultaneously

The ab ove description shows that most of the functions needed to let the sP

issue bus transactions on the SMP system bus already exist Giving sP the ability

to transfer data b etween the SMP system bus and the aSRAM which it can access

directly using loadstore instructions gives the sP indirect access to the host SMPs

main memory Furthermore data can b e transferred b etween the SMP main memory

and NES SRAM lo cations used as message buer space Thus the sP can now b e a

proxy that transmits messages out of aP DRAM and writes incoming messages into

aP DRAM

The sP can also send normal Express and Express TagOn messages from the Lo cal

Command queues Because bus op eration and message commands in these queues

are serviced in strict sequential order the sP can issue b oth a bus command to move

data into aSRAM and an Express TagOn command to ship it out all at once there

is no need for the sP to p oll for the completion of the bus command b efore issuing

the Express TagOn

While this level of NES supp ort is functionally adequate for the sP to imple

ment Nonresident Basic message queues other NES Core features describ ed in Sec

tions and improve eciency

Interno de DMA

With the NES Core features describ ed so far the sP can implement DMA in

rmware But such an implementation can consume a signicant amount of sP time

see Section Since blo ck DMA involves a numb er of simple but rep etitive steps

we added several functional units in the NES Core to oload these tasks from the

sP

Two functional units the Block Bus Operations unit and the Block T ransmit unit

are employed at the sender NES The receiver NES is assisted by the addition of a

Remote Command Queue and a counting service provided by the Bus Master unit

The basic idea is to transmit DMA packets that include b oth data and bus commands

When the latter are executed by the Bus Master unit at the destination NES data

is written to appropriate SMP main memory lo cations The counting service counts

the numb er of packets that have arrived so that when all packets of a transfer have

arrived an acknowledgement is inserted into the Acknowledgement queue The sP is

resp onsible for setting up the Blo ck Bus Op erations unit the Blo ck Transmit unit

and initializing the DMA Channel Counters It do es this with commands issued to

the Lo cal Command queues All these commands include acknowledgement options

Details of these features follow Block Control Counters Bus Operations (Schedules Operations) Bus Master

Command Queues Local Capture & Ack

Command Dispatch Remote

Block Bus Slave Transmit Capture &Ack Queue NES State

NES State Read/Write

Read/Write

Express Msg Queues Express Msg Queues sP System Bus SMP System Bus

Express Msg

Express Msg Interface

Interface

Direct Queue Direct Basic Msg Queues Basic Msg Queues Queue Buffer Read/Write Buffer Read/Write

Bus Slave

Reclaim Unit

Destination TxQ RxQ Transmit Receive Translation control control Table state Unit Unit state

Network

Figure This diagram illustrates the addition of DMA supp ort The lightly

shaded blo cks are mo died while the darkly shaded regions are new additions to

provide NES hardware DMA supp ort aBIU (Xilinx FPGA) NESCtrl (ChipExpress LPGA) sBIU (Xilinx FPGA)

Block Control NES State NES State Bus Operations (Schedules Read/Write Read/Write Operations)

Block Express Msg Transmit Reclaim Express Msg

Command Dispatch Unit Interface Interface Counters Transmit RxQ Direct Queue TxQ Receive Direct Queue control control Bus Master Buffer Buffer state Unit state Read/Write Capture & Ack Unit Read/Write

aSRAM (dual-ported SRAM) sSRAM (dual-ported SRAM) Command Queues

Remote Local

Express Msg Queues

Capture &Ack Queue

Express Msg Queues SMP System Bus sP System Bus

Basic Msg Queues

Basic Msg Queues TxUFIFO RxUFIFO

Transmit Receive

Unit Unit TxURxU (Datapath) (Datapath)

Destination (Xilinx FPGA) Translation Table Destination Translation TTL PECL Table Signal Conversion

Network

Figure Actual realization of newly added logic for DMA supp ort are shaded in

this diagram Lightly shaded regions are existing functional blo cks that are mo died

while darkly shaded regions are new functional blo cks

Blo ck Bus Op erations Unit The Blo ck Bus Op erations unit is used to rep eat a

bus op eration command a numb er of times each time incrementing the SMP

10

bus address and SRAM address by cacheline size

A command to this unit is similar to a bus op eration command but also sp ec

ies the numb er of times the bus op eration is rep eated The Blo ck Bus Op er

ation unit treats SRAM space very much like a message queue Congurable

system state in the Blo ck Bus Op eration unit includes base and b ound ad

dresses and a pro ducer p ointer that are used to generate aSRAM addresses

The base and b ound values are not exp ected to change frequently so they are

programmed through the usual NES state access mechanism and not via the

command queues

In order for the Blo ck Bus Op erations unit to op erate in a pip elined fashion

with the Blo ck Transmit unit the pro ducer p ointer is shared with the Blo ck

Transmit unit It can b e optionally reset in each command

The Blo ck Bus Op eration unit is implemented in NESCtrl It could equally

well have b een asso ciated with the Bus Master unit in the aBIU FPGA

Blo ck Transmit Unit The Blo ck Transmit unit formats data into DMA packets

app ending an appropriate header and two bus op eration commands as trailers

Two cachelines bytes of data taken from NES SRAM is sandwiched

b etween them A command to this unit sp ecies the physical destination no de

name details of the bus op eration command to use such as the transaction

typ e and other control signal values and the starting SMP system bus physical

address The command also sp ecies a packet count

This unit shares the base b ound and pro ducer p ointer state with the Blo ck

Bus Op erations unit It also uses a private consumer p ointer As long as the

10

The design will b e much more exible if the increment is not xed but can b e varied in each

command Furthermore there should b e two separate increment values one for system bus address

the other for SRAM address Such a design would allow the Blo ck Bus Op erations unit to gather

data from the SMP system bus into NES SRAM sub ject to the data pathimp osed constraints that

the smallest data granularity is bytes and addresses have to b e byte aligned

desired packet count has not b een reached and the pro ducer p ointer is ahead

of the consumer p ointer this unit generates DMA packets Data is fetched

from aSRAM addresses pro duced from the base b ound and incrementing con

sumer p ointer The generated bus op eration commands carry remote system

11

bus addresses in increments of cacheline size

Each command to this unit can optionally reset the consumer p ointer This

together with the decoupling of the read op eration from the transmit op eration

allows multicast DMA op eration at no additional design or hardware cost A

single blo ck read command executed by the Blo ck Bus Op erations unit moves

the desired data into NES SRAM and multiple Blo ck Transmit commands send

them to several destinations

The Blo ck Transmit unit shares features with the normal Transmit unit and is

implemented in NESCtrl

Remote Command Queue At its destination a DMA packet is split into two

parts command and data and enqueued into the Remote Command queue a

sp ecial receive queue The command part of the Remote Command queue oers

an instruction set that is almost identical to that of the Lo cal Command queues

Thus supp orting a Remote Command queue adds few new requirements to what

is already in the design

Since it is more ecient to move data directly from the receive queue buer space

and space in this queue is shared b etween packets from multiple sources and

dynamically allo cated the bus commands in each DMA packet do not sp ecify

the SRAM address at which data is lo cated It is p ossible to adopt a design

where the destination NES inserts the SRAM address into each bus command

11

like in the case of Blo ck Bus Op eration unit the Blo ck Transmit unit would have b een Just

more exible if the address increment is programmable with separate SRAM address and SMP

physical address increments This would have allowed scatter op eration Scattering of data with

granularity smaller than a cacheline is still rather inecient b ecause each packet can only cause two

bus transactions at the destination The b est way to get around it is to have a smarter functional

unit at the receive end which pro duces the required numb er of bus transactions a unit similar in

function to the Blo ck Bus Op erations unit

b efore they are passed on to the Bus Master unit This is inconvenient in the

StarTVoyager NES b ecause of the way data path and control are organized

TxURxU is the part of the data path where this can b e done quite easily and

eciently but unfortunately the SRAM address is not available at that p oint

Instead we augment the Bus Master unit with a copy of the Remote Command

queue state and a mo died form of bus op eration commands called DMA Bus

Operation commands The aBIU generates SRAM address for these commands

from the consumer p ointer and asso ciated base and b ound values of the Re

mote Command queue This choice also has the b enecial eect of freeing up

some bits in the DMA Bus Op eration command for sp ecifying a DMA Channel

numb er

DMA Channel Counters The Bus Master unit maintains eight DMA Channel

counters These are initialized by commands from the lo cal or remote com

mand queues and decremented when a DMA Bus Op eration command is com

pleted When a counter reaches zero an acknowledgement is inserted into the

Acknowledgement queue to inform the lo cal sP of the event

With these functional units the sP is only minimally involved in DMA The sPs

may have to p erform address translation if a request from user co de uses logical

memory and no de addresses But once that is done the destination sP only needs

to initialize a DMA channel counter and then p oll for acknowledgement of that com

mand In parallel the source sP issues the Blo ck Bus Op eration unit a command

to read the desired data into the aSRAM After it is informed that the destination

sP has initialized the DMA Channel counter the source sP issues a Blo ck Transmit

command

A DMA Channel counter can also b e reinitialized from the Remote command

queue This can reduce latency of a transfer but requires the source sP to b e pre

allo cated a DMA Channel counter and to also send all the DMA packets in FIFO

12

order b ehind the reset command packet

12

The design can b e made more exible Instead of using a counter that is decremented we can

Having control over the transaction typ e used by the Blo ck Bus op eration unit

is useful In the context of DMA the p ossible options READ RWITM Read with

intent to mo dify and RWNITC Read with no intent to cache have dierent eects

on caches that have previously fetched the data that is b eing sent out READ will

leave only shared copies in caches RWITM will leave no copies while RWNITC will

not mo dify the cache states ie a cache can retain ownership of a cacheline

The functional units intro duced for DMA have additional uses The Blo ck Bus

op eration unit is obviously useful for moving blo ck data from aSRAM to aP DRAM

eg when we swap message queues b etween Resident and Nonresident implemen

tations It can also b e used to issue Flush bus op erations to a page that is b eing

paged out The current design shares baseb ound state values b etween the Blo ck

Bus Op eration and Blo ck Transmit units This couples their op erations to o tightly

Greater exibility can b e achieved if each has its own copy of this state The pro ducer

p ointer should also b e duplicated and a command to the Blo ck Bus Op eration unit

should sp ecify whether to increment the Blo ck Transmit units pro ducer p ointer

The DMA Channel counters can b e decremented by a command that do es not in

clude bus transactions This allows them to b e used as counters that are decremented

remotely via the Remote Command queue Uses include implementing a barrier or

accumulating the acknowledgement count of cachecoherence proto col invalidations

Although the sP could do this counting this hardware supp ort reduces sP o ccupancy

and results in faster pro cessing of the packets

The Remote Command queue op ens up many p ossibilities An sP can now issue

commands to a remote NES Core without the involvement of the remote sP This

cuts down latency and remote sP o ccupancy On the down side this op ens up some

protection concerns As long as it is safe to assume that remote sP and system co de

use two counters a target value and a current count An acknowledgement is only generated if the

two counters match and the target value is nonzero If b oth counters are reset to zero after they

match this design avoids the requirement that the reset command has to arrive b efore the data

packets A variant which keeps the target value unmo died after b oth counters match avoids the

need to reinitialize the counter for rep eated transfers of the same size To accommo date b oth a

reset command should sp ecify whether to clear the target value register This part of the design is

in the aBIU FPGA and can easily b e mo died to implement this new design

is trusted there is no safety violation b ecause access to this queue is controlled by

destination address translation mechanism Hence user packets cannot normally get

into this queue If remote sP and system co de is not to b e trusted some protection

checks will b e needed at the destination This is not in our current design

DMA Implementation Alternatives

We now compare this design with two other DMA design options which were not

adopted One design is completely stateless at the receiver end with each DMA

packet acknowledged to the sender which tracks the status of the transfer By adding

a programmable limit on the numb er of unacknowledged DMA packets this design

can also easily incorp orate sender imp osed trac control to avoid congestion If

the sender already knows the physical addresses to use at the destination p ossibly

b ecause this has b een set up in an earlier transfer and is still valid this design has the

advantage of not requiring setup Its main disadvantage is the extra acknowledgement

trac and hardware to implement the acknowledgement

We take the p osition that congestion control should b e addressed at a level where

all message trac is covered As for the setup delay our design can also avoid the

setup roundtrip latency if the destination physical address is known and a channel

has b een preallo cated as describ ed earlier

Another DMA design utilizes a much more elab orate receiver in which b oth the bus

transaction command and the full destination address are generated at the destination

NES The DMA packets still has to contain address oset information if multiple

network paths are exploited to improve p erformance Otherwise the packets have

to arrive in a predetermined order eectively restricting them to inorder delivery

Destination setup is of course necessary in this design It has the advantage of avoiding

the overhead of sending the bus command and address over the network rep eatedly

For example in our design each DMA packet carries bytes of data bytes of bus

command and bytes of Arctic header The bytes of bus command is a fairly high

overhead which can b e reduced to or bytes if this alternate design is adopted

This alternate design also gives b etter protection since SMP memory addresses are

generated lo cally

We settled on our design b ecause it intro duces very little additional mechanism

for receiving DMA packets but instead reuses with minor augmentation existing

bus master command capability Implementing our design requires adding an external

command queue which is a more general mechanism than a sp ecialized DMA receiver

sP Serviced Space

Using the NES Core features describ ed so far the aP and sP can only communicate

either with messages or via memory lo cations in NES SRAM or the SMPs main

memory The sP Serviced Space supp ort describ ed in this section allows the sP to

directly participate in aP bus transactions at a low hardware level This capability

together with the features describ ed ab ove is functionally sucient for implementing

CCNUMA style cachecoherent distributed shared memory

The sP Serviced Space is a physical address region on the SMP system bus that is

mapp ed to the NES ie the NES b ehaves like memory taking on the resp onsibility

of supplying or accepting data What distinguishes this region from normal memory

such as NES SRAM is that the sP handles bus transactions to this space ie the

sP decides what data to supply to reads what to do with the data of writes and

when each bus transaction is allowed to complete In that sense this address space

is active and not merely a static rep ository of state The following NES Core

mechanisms together implement the sP Serviced Space

Transaction Capture

The transaction capture mechanism informs the sP ab out bus transactions to the sP

Serviced space For instance if a bus transaction initiated by the aP writes data to

this region the sP needs to b e informed and given the full details such as the address

transaction typ e caching information and the data itself With this information the

sP can decide what to do such as writing the data to main memory of a remote no de

We extend the Acknowledgement queue for this purp ose When capturing a trans

action the NES Core inserts its address and control information as a bit entry Block Control Counters Bus Operations (Schedules Operations) Bus Master

Command Queues Local Capture & Ack

sP Serviced Command Dispatch Remote Space Block Bus Slave Transmit Capture &Ack Queue NES State

NES State Read/Write

Read/Write

Express Msg Queues Express Msg Queues sP System Bus SMP System Bus

Express Msg

Express Msg Interface

Interface

Direct Queue Direct Basic Msg Queues Basic Msg Queues Queue Buffer Read/Write Buffer Read/Write

Bus Slave

Reclaim Unit

Destination TxQ RxQ Transmit Receive Translation control control Table state Unit Unit state

Network

Figure The addition of sP Serviced space supp ort to the NES involves intro duc

ing the shaded functional blo ck Other existing infrastructure such as the Capture

Ack queue is reused aBIU (Xilinx FPGA) NESCtrl (ChipExpress LPGA) sBIU (Xilinx FPGA)

Block Control NES State NES State (Schedules sP Serviced Bus Operations Read/Write Read/Write Space Operations)

Block Express Msg Transmit Reclaim Express Msg

Command Dispatch Unit Interface Interface Counters Transmit RxQ Direct Queue TxQ Receive Direct Queue control control Bus Master Buffer Buffer state Unit state Read/Write Capture & Ack Unit Read/Write

aSRAM (dual-ported SRAM) sSRAM (dual-ported SRAM) Command Queues

Remote Local

Express Msg Queues

Capture &Ack Queue

Express Msg Queues SMP System Bus sP System Bus

Basic Msg Queues

Basic Msg Queues TxUFIFO RxUFIFO

Transmit Receive

Unit Unit TxURxU (Datapath) (Datapath)

Destination (Xilinx FPGA) Translation Table Destination Translation TTL PECL Table Signal Conversion

Network

Figure The addition of sP Serviced space supp ort to the NES involves intro

ducing the shaded functional blo ck in the aBIU Other existing infrastructure such

as the Capture Ack queue is reused

into the Acknowledgement queue which we rename the Capture Ack queue As

b efore the sP p olls this queue to obtain the captured information In addition a

data queue is added to the Capture Ack queue into which data written by the bus

transaction if any is enqueued From an implementation p oint of view the hardware

data structure to implement this Acknowledgement queue is the same as that of an

Express message receive queue which has b oth a header p ortion that stores bit

entries and a data p ortion Thus existing design and implementation comp onent is

reused

Transaction Approval

In order for the sP to return arbitrary data to a readlike bus transaction the NES

Core needs to susp end a bus transaction until the sP provides the data to return

To avoid deadlo ck situations such as those describ ed in Section when this address

space is used to implement CCNUMA shared memory the means of susp ending

the bus transaction must not blo ck the system bus For the X bus proto col this

requires retrying the bus transaction ie the bus master for the transaction is told

13

to relinquish the bus and reattempt that transaction at a later time Since the bus

master may reattempt the same bus transaction several times b efore the sP is ready

with the data the NES Core should lter out rep eated attempts the sP should not

b e sent multiple copies of the bus transactions address and control information

The ability to hold a bus transaction until the sP gives the approval to pro ceed

is not only useful for Readlike transactions but also for bus transactions like SYNC

memory barrier The sP may need to carry out some actions b efore the SYNC bus

transaction is allowed to complete in order to maintain the semantics of weak memory

13

In more advanced bus proto cols that supp orts outoforder split address and data buses such as

the XX bus proto col other mechanisms are p otentially available For instance the address phase

may b e allowed to complete with the data phase happ ening later during this interval other bus

transactions can b egin and fully complete ie b oth address and data phases nish on the bus

This provision may however still b e insucient to prevent deadlo cks In particular this is unsafe if

completing the address phase of a bus transaction means the pro cessor cache takes logical ownership

of cacheline and will hold back other bus transactions attempting to Invalidate the cacheline until

its data phase has completed Determining whether deadlo ck can arise involves details that are still

not uniform across bus proto col families There is also close interaction b etween the sno opy bus

and sp ecics of the directory base coherence proto col Delving into this level of detail is b eyond the

scop e of this thesis Free Capture Transaction Retried Transaction which needs Approval matches Lock Clear Clear Approve Ready Pending

Lock Clear Lock

Locked

Figure State transition diagram of the ApprovalReg

mo dels

The NES Core provides an Approval Register ApprovalReg to co ordinate this

pro cess The ApprovalReg is b est thought of as a transient cache entry space is

allo cated when cachemiss is detected the sP is resp onsible for lling the cache entry

and the cacheentry is freed after a single use Although our implementation has only

one ApprovalReg more ApprovalRegs can b e included to p ermit more outstanding

p ending approvals

When a bus transaction requiring sPs approval rst app ears on the bus and the

ApprovalReg is not currently o ccupied its details are recorded in the ApprovalReg

and also captured into the Capture Ack queue The ApprovalReg state changes

from Free to Pending as shown in Figure Subsequently the sP gives ap

proval for the transaction to complete by writing the ApprovalReg changing its state

to Ready If data is involved this write also supplies the SRAM address where the

data should b e read from or written to When the same bus transaction is attempted

again it is allowed to complete At the same time the Approval Register is cleared

Allowing the sP to sp ecify an arbitrary data lo cation SRAM address when it

gives approval as opp osed to restricting the data to a xed lo cation avoids copying

if the data has arrived over the network and is buered in some message receive

Notify sP Retry NES b ehavior

Yes Yes Approval Needed

Yes No Notication Only

No Yes Retry Only

No No Ignore allow bus transaction to complete

Table The four p ossible NES resp onses to bus transactions to the sP Service

Space or Sno op ed Space and bus transactions with no asso ciated address

queue Because the command to write the ApprovalReg comes from either the Lo cal

or Remote command queues it is p ossible for a remote sP to return data directly to

the NES Core without the involvement of the lo cal sP The lo cal sP can b e notied

of such an event simply by setting the acknowledgement option in the command

Service Space Resp onse Table

The NESs resp onse to bus transactions addressed to the sP Serviced space or bus

transactions without asso ciated address eg SYNC is determined by a congurable

resp onse table the Srv Space Resp onse Table The bus transaction typ e enco ding

is used as index into this table to obtain the resp onse As shown in Table four

resp onses are p ossible arising from the cross pro duct of two decisions i whether

the sP should b e notied of the bus transaction and ii whether the bus transaction

should b e retried This level of programmability is only marginally useful for the sP

Sno op ed space since only one or two resp onses make sense for most bus transaction

typ es It is included for uniformity of design b ecause similar programmability is very

useful for the Sno op ed Space supp ort describ ed in the next section

Nevertheless the design tries to make this generality as useful as p ossible For

instance if the table indicates a resp onse of Ignore to a readlike bus transaction and

there is no match against the ApprovalReg data is read from a xed SRAM lo cation

which system software can program with whatever Miss value desired This feature

allows aP software to determine whether a cachemiss has o ccurred if the Miss value

contains a bit pattern that is not valid data software can check the value loaded to

see if a cachemiss has o ccurred

The resp onse obtained from the Srv Space Resp onse table is used only if a bus

transaction do es not match that captured in the ApprovalReg Otherwise the re

sp onse is to retry the transaction if ApprovalReg is in the Pending state and to

allow it to complete if it is in the Ready state

Sno op ed Space

The NES Core provides cacheline granularity accessp ermission check for a p ortion

of the SMPs main memory the Sno op ed address space The mechanism enables

the sP to selectively observe and intervene in bus transactions to this address space

This capability can b e used to implement SCOMA style shared memory or to allow

14

fast access to the lo cal p ortion of CCNUMA shared memory It is also useful

for sP implementation of fancier message passing interfaces such as CNI which can

only b e implemented eciently if cacheline ownership acquisition is used to trigger

pro cessing

The Sno op ed Space supp ort shares many similarities with the sP Serviced Space

mechanism but is more complex NES resp onse to a bus transaction to this address

space is again congurable to any one of the four describ ed in Table Generation

of this resp onse is more complex taking into account a bit cacheline state asso

ciated with the bus transactions address and maintained in a new memory bank

the clsSRAM See Figure Access to this memory is controlled by the Sno op ed

Space unit in the aBIU To simplify the design the sP can write but not read this

state with command issued via the Lo cal or Remote command queues

Other designs that maintain cacheline state bit for DRAM hardwire the interpre

tation of the state values In order to allow exible exp erimentation with cache

coherence proto col the StarTVoyager NES uses a congurable table the Sno op ed

Space resp onse table to interpret the state values Exp erimental exibility is also

the rationale for having instead of state bits p er cacheline as having only four

dierent cacheline states leaves little ro om for what the four states should mean

14

With small mo dications to the aBIU FPGA it can also b e used to implement NuCOMA style

shared memory describ ed in Section Block Control Counters Bus Operations (Schedules Operations) Bus Master

Command Queues Local Bus Snooper

sP Snooped Capture & Ack

Space Command Dispatch Remote

sP Serviced Block Bus Slave Space Transmit Capture &Ack Queue NES State

NES State Read/Write

Read/Write

Express Msg Queues Express Msg Queues sP System Bus SMP System Bus

Express Msg

Express Msg Interface

Interface

Direct Queue Direct Basic Msg Queues Basic Msg Queues Queue Buffer Read/Write Buffer Read/Write

Bus Slave

Reclaim Unit

Destination TxQ RxQ Transmit Receive Translation control control Table state Unit Unit state

Network

Figure This diagram illustrates the addition of Sno op ed Space supp ort through

the shaded functional blo ck It also shows the full design of the StarTVoyager NES ClsSRAM (Fast SRAM)

Cache-line State

aBIU (Xilinx FPGA) NESCtrl (ChipExpress LPGA) sBIU (Xilinx FPGA)

Block Control sP Snooped NES State NES State (Schedules Space Bus Operations Read/Write Read/Write Operations) sP Serviced Space Block Express Msg Transmit Reclaim Express Msg

Command Dispatch Unit Interface Interface Counters Transmit RxQ Direct Queue TxQ Receive Direct Queue control control Bus Master Buffer Buffer state Unit state Read/Write Capture & Ack Unit Read/Write

aSRAM (dual-ported SRAM) sSRAM (dual-ported SRAM) Command Queues

Remote Local

Express Msg Queues

Capture &Ack Queue

Express Msg Queues SMP System Bus sP System Bus

Basic Msg Queues

Basic Msg Queues TxUFIFO RxUFIFO

Transmit Receive

Unit Unit TxURxU (Datapath) (Datapath)

Destination (Xilinx FPGA) Translation Table Destination Translation TTL PECL Table Signal Conversion

Network

Figure This diagram illustrates the addition of Sno op ed Space supp ort through

the shaded functional blo ck from a device p ersp ective It also shows the full design

of the StarTVoyager NES

The resp onse to a bus transaction is determined by using its transaction typ e and its

bit cacheline state as oset into the Sno op ed Space resp onse table The result is a

bit resp onse enco ding the four p ossibilities describ ed in Table

The ApprovalReg describ ed in the previous section is also used for Sno op ed Space

bus transactions As b efore it lters out retries of a bus transaction requiring sPs

approval so that the sP is only informed once Whenever the address of a bus

transaction matches that in the ApprovalReg the ApprovalRegs resp onse sup ersedes

that from the Sno op ed Space resp onse table lo okup Again a Pending ApprovalReg

state retries the bus transaction while a Ready state allows the bus transaction

to complete In the latter case the data supplied to a p ending bus transaction is

obtained from the NES SRAM lo cation indicated by the ApprovalReg Data for a

write transaction is also directly written into that NES SRAM lo cation

Providing approval through the ApprovalReg is not the only way to end the retry

of a Sno op ed Space bus transaction An alternative is to mo dify the cacheline state

kept in clsSRAM to a value which allows the bus transaction to complete In that

case data for the bus transaction is read from or written to SMP main memory If

data is coming from or going to the network using ApprovalReg has the advantage

of avoiding cycling the data through main memory DRAM

aP bus transactions initiated by the NESs bus master unit are treated sp ecially

by the NES Sno op ed Space unit They are allowed to complete regardless of the

status of the ApprovalReg the various resp onse tables and the relevant clsSRAM

cacheline state value

The use of congurable resp onse table adds to the latency of resp onse generation

But b ecause the table is small this increase do es not present a serious timing issue in

our design A pro duction system that supp orts only one or a small numb er of xed

proto cols will not need this exibility

The addition of Sno op ed Space supp ort brings us to the full design of the StarT

Voyager NES Through incrementally adding more functions to the design this section

b oth provides a detailed view of the NES microarchitecture and illustrates the reuse

of existing functional blo cks as new capabilities are added to the NES

Mapping onto Microarchitecture

Now that the NES microarchitecture has b een describ ed this section shows how the

macroarchitecture describ ed in Chapter is realized on the microarchitecture

Physical Network Layer Implementation

The Arctic network closely matches the requirements of the Physical Network layer

The only missing prop erty is an active means of b ounding the numb er of outstanding

packets To keep the design simple we rely on the maximum buering capacity of the

entire Arctic network to provide this b ound instead of intro ducing some new active

mechanism in the NES Our system is relatively small with a target size of no des

This translates into Arctic routers Buering capacity in each Arctic router

is also not large Bytes This gives a total network buering capacity of

kilobytes This b ound is low enough to b e a useful overow buer size estimate for

Reactive Flowcontrol

Virtual Queues Layer Implementation

The StarTVoyager NES implements the Virtual Queues Layer and the Application

Interface Layer with a combination of NES Core hardware and sP rmware The

division of tasks b etween hardware and rmware is mostly based on function with

those exp ected to b e used infrequently delegated to sP rmware Some functions

such as those related to Basic Express message passing mechanisms and their TagOn

variants are p erformed by b oth NES Core hardware and sP

Resident and Nonresident Queues

The NES Core hardware provides fast implementation of a limited numb er of mes

sage queues a numb er which is meant to capture the typical working set The sP

implements at lower p erformance a much larger numb er of queues to meet the de

sign goal of supp orting a large numb er of simultaneously active queues The former

referred to as Resident message queues act as a sP rmware managed cache of the

latter the Nonresident message queues Switching a logical message queue b etween

Resident and Nonresident resources is a lo cal decision requiring no co ordination with

other no des and is transparent to aP software

In the case of Resident message queues the NES Core hardware directly imple

ments the Virtual Queues Layer functions of destination translation and multiplexing

and demultiplexing messages from several hardware message queues onto the Phys

ical Network Layer services This is fairly straight forward and is describ ed in the

microarchitecture Sections through

To logically supp ort a larger numb er of message queues the sP rmware multi

plexes the Nonresident queues onto a subset of the Resident queues Discussion

of exactly how the aP interacts with the sP to use Nonresident queues is deferred

until next Section when the Application Interface Layer mapping is describ ed Dur

ing transmit the sP p erforms destination translation by either physical addressing of

packet destination or changing translation table entry Lowlevel design choices such

as asso ciating a source identity with each translation table entry instead of each mes

sage queue makes it feasible to use the latter approach Other NES Core functions

such as giving sP the ability to initiate aP bus transactions to move data b etween

aSRAM and sP DRAM and reading data message header from aSRAM are also

necessary capabilities

The NES Core hardwares RQID cachetag lo okup mechanism Section

makes it p ossible to switch receive queue b etween Resident and Nonresident queues

without global co ordination It also channels packets of Nonresident queues to the

MissOverow queue where sP takes over the task of sorting the packets into their

nal receive queues

Reactive Flowcontrol

Reactive owcontrol for preventing deadlo ck is implemented in sP rmware For

Nonresident queues it is clear that the sP which is already handling all the messages

can imp ose Reactive owcontrol b oth triggering throttle getting out of it and

selectively disabling transmission to a particular destination queue

For Resident queues the sP is also in a p osition to imp ose Reactive owcontrol

as long as the threshold watermark the p oint at which throttle is triggered is larger

than the receive queue size When a Resident receive queue overows overowing

packets are diverted to the MissOverow queue serviced by the sP The sP there

fore starts handling the Resident queues incoming packets as well and can initiate

throttle when necessary

The NES SRAM buer space of Resident receive queue is actually only part of

the total buer space for the receive queue The overowing packets are buered

in memory outside NES SRAM in aP DRAM or sP DRAM This is also the buer

space for the queue when it op erates in Nonresident mo de Ideally the threshold

watermark should match the size of the Resident receive queues buer space in NES

SRAM so that as long as a program op erates within this receive queue buering

requirement the sP is never involved and endtoend message passing latency is not

degraded

When a Resident receive queue overows the sP programs the NES Core to notify

15

it of subsequent bus transactions accessing this queues state In this way the sP

can move packets from the overow queue back into the aSRAM as space frees up

and monitor whether space usage in the Resident receive queue has dropp ed b elow

the low watermark

The sP is also involved in implementing selective disable of Resident transmit

queues a function not directly supp orted in hardware To selectively disable a des

tination the sP mo dies the physical destination of its translation table entry so

that packets heading to this logical destination are directed back to the sP itself

When the sP receives such lo op edback packets it buers them up for subsequent

retransmission In this way the Resident transmit queue which op erates in a strict

FIFO manner can continue to send other packets out to other destinations In the

event that to o many packets have lo op ed back the sP can shut down the transmit

queue by writing its queueenable bit to avoid running out of lo opback packet buer

15

This was not implemented in the NES but it requires only simple mo dications to the aBIU

FPGA Verilog co de

space

Application Interface Layer Implementation Mes

sage Passing Interfaces

As p ointed out in the last section the message passing interfaces with the exception

of DMA are implemented with b oth Resident and Nonresident queues The Appli

cation Interface Layer p ortion of Resident queues is completely implemented in NES

Core hardware as describ ed in Sections and DMA implementation

is also describ ed earlier in Section This section describ es how the interfaces of

Nonresident queues are implemented

A key comp onent of Nonresident message queue implementation is the mapping

of addresses This determines the visibility of access events to the sP the level of

control sP exercises over such events and the message data path eciency

Nonresident Basic Message Queues

The message queue buers of Nonresident Basic message queues are mapp ed to aP

16

main memory addresses for reading p ointer values are mapp ed to NES SRAM and

addresses used for up dating p ointers are mapp ed to the sP Serviced Space Mapping

p ointer up date addresses to sP Serviced Space has the advantage of relieving the sP

from continually p olling memory lo cations holding queue p ointers Instead captured

transactions trigger sP pro cessing This is crucial as the sP may otherwise have to

blindly p oll many lo cations

Up dates to the pro ducer p ointer of a Nonresident Basic message transmit queue

causes the sP to attempt a message transmit If the proxy Resident transmit queue

ie the queue onto which Nonresident messages are multiplexed has sucient buer

space the sP issues a bus op eration command to read message data from the SMP

system bus into the proxy queues next available buer space The sP will require

16

Because protection is only enforceable at kByte granularity but only bytes is really needed

for message queue state the NES SRAM can b e designed with a window where each kilobyte page

is aliased to only bytes of NES SRAM space

notication of command completion so that it can then read in the message header

translate it and up date the proxy queues pro ducer p ointer to launch the packet

into the network At that time it also up dates the emulated queues consumer

p ointer to free up transmit buer space If the sP is unable to transmit the message

immediately b ecause the proxy transmit queue is full the p ending send is recorded

and subsequently resumed when transmit space frees up The sP p erio dically p olls

the proxy transmit queues state to determine if sucient space has freed up

Nonresident Express Message Queues

The address used to indicate an Express message transmit is mapp ed to sP Serviced

space This is needed for implementing FIFO semantics ie a store has further

eects b eyond merely overwriting the data in a xed memory lo cation and allows

event driven sP pro cessing Similarly the address for receive p olling is also mapp ed

to sP Serviced space Buer space for the nontaggedon p ortion of Express message

is provided in sP DRAM while buer space for the taggedon data is mapp ed to aP

DRAM

Mapping the Express receive p olling address to sP Serviced space provides the

required functionality but the aP incurs fairly long latency when p olling for a message

A design which takes the sP o the latency path of providing the p olled data is highly

desired This is tricky however b ecause although the sP can dep osit the next message

at the address that aP software p olls from a mechanism is needed to ensure that the

aP software sees this message only once The generality of OnePoll also complicates

this as several dierent p olling addresses could all receive from the same receive queue

If OnePoll is limited to only the high and low priority receive queues of a typical

Express message Endp oint and a simple enhancement is made to NES Core hard

ware a lower latency alternative is p ossible The required enhancement enables the

Sno op ed Space supp ort to not only read clsSRAM statebit but also mo dify it A new

congurable table could b e intro duced to determine the next clsSRAM value based

on the current clsSRAM value With the Express message receive address mapp ed

to a Sno op ed Space lo cation the sP can get the next answer ready there so that an

aP can p oll ie read it from DRAM Because the aP needs to read this cacheline

exactly twice assuming aP uses only bit reads to receive Express messages the

automatic clsSRAM value transition will need to go from one which generates an

Ignore resp onse written by sP to a second one that Noties sP automatic to a

third which Retries without Notifying sP automatic

This enchancement scheme works even when OnePolling is from b oth receive

queues of an endp oint b ecause the address for p olling only one receive queue and

that for p olling b oth receive queues in an endp oint all fall within the same cache

line Unfortunately the more general form of OnePoll do es not work in this enchanced

scheme Although this new scheme improves the latency of a single Express receive

p oll the minimum interval b etween two p olls and hence the maximum throughput

is still constrained by sP pro cessing

Application Interface Layer Implementation Shared

Memory Interfaces

Coherent shared memory implementation on StarTVoyager relies heavily on the sPs

involvement For instance as describ ed b elow the sP is involved in servicing all

17

cachemisses under b oth CCNUMA and SCOMA style shared memory The sP

implements most of the functions of the cache proto col engine and the home proto col

engine

The basic NES Core hardware mechanisms that enable the sP to implement cache

coherent distributed shared memory are describ ed in Sections and It is

b eyond the scop e of this thesis to design and implement complete coherence proto cols

In this section we sketch out a cachemiss servicing example for SCOMA style shared

memory to illustrate how coherence proto cols can b e put together

In this example an SCOMA style cacheline miss results in fetching data from

the home no de The NES cacheline state and the resp onse table entries are set up

17

We use the term cachemiss broadly to mean not having the correct access p ermission ie

acquiring write p ermission is considered a cachemiss 3 1 L2 NES aP sP cache Core 2 9 8 Main Memory

4 L2 aP NES sP cache Core 5 7 6

Main

Memory

Figure SCOMA style pro cessing of a READ cachemiss on StarTVoyager In

this example clean data is available in the home no de

so that a read transaction to an absent cacheline is retried while the sP is notied

As describ ed in Section when such a bus transaction is attempted the NES

Core hardware noties the sP of the event by sending it details of the bus transaction

through the Capture Ack queue It also records details of the bus transaction in

the ApprovalReg Step in Figure

The toplevel sP co de is an innite lo op in which the sP p olls for events from the

Capture Ack queue the overow queue and other message receive queues used

for intersP communication OnePoll is used to improve p olling eciency The sP

may also need to monitor for space in transmit queues if there are p ending message

transmits

When the sP reads the captured read transaction information Step in Figure

and deco des it as an SCOMA space read cachemiss it translates the captured

physical address into a global address and a home no de numb er Using the Express

message mechanism the sP sends a request message to the home no de Step This

message could b e sent through a Lo cal Command queue or another Express message

queue The choice will b e the former if the proto col relies on ordering b etween this

new request and earlier messages sent through the Lo cal Command queue The sP

will typically also need to keep some transient information ab out the p ending bus

transaction in order to deal with incoming commands which may b e directed at the

same cacheline address

Action now shifts to the home no de The home no de sP receives this message when

it p olls for new events in its toplevel p olling lo op Step a This sP lo oks up the

cachelines directory information to decide what to do In our example the directory

state indicates that the cacheline is clean at home consequently data can b e supplied

directly from the home no de To achieve this the home sP rst translates the global

address into a lo cal physical address and then issues a bus master command Step

to transfer data from that lo cal SMP system bus physical address into NES SRAM

Step Through the same lo cal command queue the sP also issues an Express

TagOn message to send that data to the requesting sP Ordering imp osed by NES

Core hardware b etween bus master command and ExpressExpressTagOn message

command ensures that the second command is not executed until data fetched by the

rst is in NES SRAM

While waiting for this reply the sP at the requesting no de mo dies the clsSRAM

state of the cacheline with the p ending transaction to one which allows read Step

b Because the ApprovalReg still records the p ending transaction and its resp onse

dominates the READ bus transaction is still unable to pro ceed Once the reply

comes in Step the sP moves the data from the receive queue buer into the

appropriate aP DRAM lo cation Step using a bus master command issued through

a Lo cal Command queue It then clears the ApprovalReg and a retried READ bus

transaction can now complete Step

Discussions

The ab ove scenario involves the sP three times twice at the requesting no de and

once at the home no de The last involvement of the sP can b e removed or at least

taken o the latency critical path of returning data to the p ending bus transaction

To achieve this the home sP sends the reply into the Remote Command queue using

a reply message similar to a DMA packet The requesting sP can b e informed of the

reply using the acknowledgement option on this bus command

Heavy sP involvement in cachemiss pro cessing has b oth advantages and disad

vantages The main disadvantage is relatively long cachemiss latencies therefore

shared memory p erformance on this machine is very sensitive to cachemiss rate

The advantage is exibility in how the proto cols are implemented including the op

tion of trying out less common ways of maintaining coherence which may improve

cachemiss rate Given the tradeos this design makes sense for an exp erimentation

platform An actual workhorse design should include more custom hardware supp ort

to reduce sP involvement during cachemiss pro cessing

An interesting question is what other NES Core mechanism can b e provided to

reduce sP involvement in cachemiss pro cessing without drastically aecting the ex

ibility of using dierent coherence proto col One avenue is to remove the duty of

cache proto col engine from the sP or at least only involve the sP in dicult but rare NES main board NES Core sP-subsystem clsSRAM sP-bus (512kB) Data Addr & Ctrl Addr 20b 4b

Addr & aBIU Ctrl sBIU Addr & sP ~60b Ctrl Ctrl (604)

20b 20b 20b 20b

aSRAM sSRAM Data Data 64b (32kB) (32kB) Mem Ctrl I-bus and 64b sP dRAM

RxFIFO TxFIFO 64b ~60b

35MHz RxU TxU

32b 32b

Arctic Interface

37.5MHz NES daughter card

Arctic Network

Figure Ma jor comp onents of the StarTVoyager NES

cases when ordering of events is ambiguous We need to exp eriment with a numb er

of proto cols using the existing StarTVoyager NES design b efore we can answer that

question concretely

NES Hardware Implementation

The StarTVoyager NES is physically separated into two printed circuit b oards

the NES main b oard and the NES daughter card as shown in Figure Logic

sp ecic to the network are isolated in the daughter card so that p orting the design

to another network involves replacing only the daughter card

The main b oard contains two ma jor p ortions the sP subsystem and the NES

Core The former consists of a PowerPC micropro cessor employed as an

Comp onent Implementation Technology Comments

NESCtrl ChipExpress CX Laser approximately k gates

Programmable Gate Array plus kilobits of ram

aBIU Xilinx XCXL sp eed of CLBs used

grade

sBIU Xilinx XCXL sp eed of CLBs used

grade

TxURxU Xilinx XCXL sp eed of CLBs used

grade

Table Custom designed hardware in the StarTVoyager NES

emb edded pro cessor and its own memory system comprising of the commercially

available MPC memory controller and ordinary DRAM

The NES Core includes a control blo ck NesCTRL implemented with a ChipEx

press LPGA two Bus Interface Units aBIU and sBIU implemented in Xilinx

FPGAs and two banks of dual p orted SRAM One bus interface unit the aBIU

interfaces to the SMPs memory bus while the other the sBIU interfaces to the sP

subsystems memory bus

Porting this design to a dierent SMP will require changes to the aBIU and

p ossibly sP rmware but other parts of the NES hardware can remain unmo died

Originally a single LPGA is to contain the NESCtrl aBIU and sBIU This intro duces

pincount constraints so that only address and control signals from the two external

buses are directly wired to the BIUs The bit wide data p ortion of the buses

connects directly to the dual p orted SRAM banks only We subsequently moved the

aBIU and sBIU into large FPGAs which no longer has the pincount constraints but

we left this design unchanged

18

The main datapath through the NES main b oard is eight bytes wide Each

of the dual p orted SRAM banks aSRAM and sSRAM has kilobyte of storage

provided by eight bytesliced IDTs dualp orted synchronous SRAM memory

chips Five IDTs hardware FIFO chip each bit wide and deep are used

18

There are eight more parity bits but we will ignore parity in this discussion

to implement the RxFIFOF and TxFIFO

Design Flow

The main custom designed hardware comp onents in the StarTVoyager NES are the

NESCtrl aBIU sBIU and TxURxU Design of these comp onents were done in Verilog

at the Register Transfer Level RTL and then automatically synthesized placed and

routed with CAD to ols

The NESCtrl LPGA designed by Daniel Rosenband is synthesized with Synop

syss Design Compiler and then placed and routed by ChipExpress with proprietary

to ols The TxURxU FPGA designed by Michael Ehrlich is synthesized with Syn

opsyss FPGA Compiler and then placed and routed with Xilinxs M to ol set The

aBIU and sBIU FPGAs designed by me are synthesized using Synplicitys Synplify

and again placed and routed with Xilinxs M to ol set

Functional simulation and testing is done using the Cadence Verilog XL environ

ment augmented with our own C co de to mo del the aP and sP pro cessors and DRAM

The C co de interacts with Verilog through Cadences PLI Programming Language

Interface Verilog co de implements the bus interface p ortion of the pro cessors while

C co de mo dels the rest of the pro cessors As for memory the Verilog co de imple

ments the memory controls but the actual data storage elements are mo deled in C

co de Functional testing is done with hand co ded test examples these include tests

targeting sp ecic functions as well as random combinations of tests

Chapter

Evaluations

This chapter describ es simulation evaluation results obtained with microb enchmarks

The results fall into three categories The rst category evaluates StarTVoyager

NESs supp ort for multiple message passing mechanisms The goal is to demonstrate

that additional mechanisms layered on top of a basic set of capabilities improve p er

formance by taking advantage of sp ecic communication characteristics such as very

small or very large message sizes The second category examines the multiple mes

sage queues supp ort quantifying the cost of supp orting a large numb er of message

queues in the StarTVoyager design The third category evaluates the p erformance

of the otheshelf sP providing b oth absolute p erformance numb ers for represen

tative communication op erations and an understanding of the factors limiting its

p erformance

Although actual hardware was designed and built as part of this research work it

was not available in time for the evaluation rep orted here Instead the exp eriments

presented are conducted on a simulator StarTsim describ ed in the next section

Although we were unable to validate the simulator against actual working system

due to late availability of working hardware we exp ect the conclusions drawn from

our results to hold This is b ecause we base our conclusions on relative p erformance

relationships which still hold even if there is inaccuracies in the simulator since those

aect the p erformance numb ers uniformly

Evaluation Metho dology

StarTsim is an execution driven simulator it executes the simulated application

during each simulation run so that the program can resp ond to run time conditions

Thus it is p ossible for simulation of the same program to follow dierent execution

paths when run time conditions such as memory access or message passing latency

vary Written in C StarTsim runs in a single pro cess and has two ma jor comp onents

i a pro cessor core simulator AugRSk and ii a memory system NES and network

simulator Csim

Pro cessor Core Simulation

The pro cessor core simulator AugRSk op erates in an event driven fashion and is

capable of mo delling a multipro cessor system In the interest of faster simulation

sp eed the pro cessors are not simulated in lo ckstep cycle by cycle Instead as long

as a pro cessor is executing instructions that do not dep end on external events it ad

vances its lo cal time and continues executing more instructions In our system this

means that simulation of a pro cessor only susp ends at a memory access instruction

At that p oint an event to resume simulating the pro cessor when global time reaches

its lo cal time is enqueued in the global event queue

Pro cessor simulation is achieved by editing the assembly co de of the simulated

program with an extra custom compilation pass Memory access instructions are

replaced with calls to sp ecial routines which enforce causal order by susp ending and

resuming simulation as appropriate These routines also make calls to mo del caches

the NES and the network Co de is also added at basic blo ck exit p oints to advance

lo cal pro cessor time

StarTsim mo dels the latencies of memory access instructions precisely These

are determined dynamically taking into account cachehitmiss information and the

latency and contention eects in the memory system and the NES StarTsims mo d

elling of pro cessor core timing is approximate It assumes that each nonmemory

access instruction takes one pro cessor cycle There is no mo delling of sup erscalar

instruction issue datadep endence induced stalls or branch prediction and the cost

of misprediction Mo delling the pro cessor more accurately is highly desired but it

requires an eort that is b eyond the scop e of this thesis

The pro cessor simulator was based on the Augmint simulator which mo dels

x pro cessors Initial eort to p ort it to mo del the PowerPC architecture was done

at IBM Research We made extensive mo dications to complete the p ort and to

mo del the StarTVoyager system accurately

Memory System NES and Network Simulation

The second comp onent of StarTsim Csim is a detailed RTL Register Transfer

Level style mo del of the pro cessor cache memory bus NES and Arctic network

1

written in C Simulation in Csim pro ceeds in a cyclebycycle time synchronous

fashion Mo delling is very detailed accounting for contentions and delays in the NES

and the memory system The NES p ortion of this mo del is derived from the Verilog

RTL description used to synthesize the actual NES custom hardware

StarTsim mo dels two separate clo ck domains a pro cessor clo ck domain and a

bus clo ck domain The former can b e any integral multiple of the latter and is set to

four times in our exp eriments to reect the pro cessor core to bus blo ck ratio of our

hardware prototyp e This ratio is also common in systems sold to day end

for example a typical system with a MHz Pentium II pro cessor employs a

MHz system bus The pro cessor core loadstore unit and cache are in the pro cessor

clo ck domain while the memory bus the NES and the network op erate in the bus

clo ck domain The p erformance numb ers rep orted in this chapter assume a pro cessor

clo ck frequency of MHz and a bus clo ck frequency of MHz

StarTsim also mo dels address translations including b oth page and blo ck ad

dress translation mechanisms found in the PowerPC architecture Our simulation

environment provides a skeletal set of virtual memory management system software

routines to allo cate virtual and physical address ranges track virtual to physical

1

Csim is mainly the work of Derek Chiou I wrote co de to mo del some of the NES functions and

assisted in debugging Csim

address mappings and handle page faults

Multiple Message Passing Mechanisms

This section presents a detailed microb enchmark based p erformance evaluation of

Resident versions of Basic message Express message ExpressTagOn message and

DMA Performance is measured in terms of bandwidth latency and pro cessor over

head The result shows that while Basic message presents a versatile message passing

mechanism adding thin interface veneers on it to supp ort the other message passing

mechanisms signicantly enhances p erformance and functionality for certain typ es of

communication Together the collection of mechanisms is able to oer b etter p erfor

mance over a range of communication sizes and patterns than is p ossible with Basic

message alone A synopsis of sp ecic instances follows

NES hardware supp orted DMA presents signicant advantage for blo ck transfers

b eyond ab out kByte If such transfers are done using Basic message the overhead

that aP co de incurs while copying message data is the dominating p erformance con

straint Moving message through the aP registers is an unnecessarily convoluted data

path in this case This ineciency imp oses p erformance limits that are signicantly

lower than those under hardware DMA bidirectional blo ck transfer bandwidth un

der Basic message is only ab out a third of that under hardware DMA Hardware

DMA also brings with it the equally signicant advantage of oloading the data

packetization overhead from the aP

When message latency is critical and message payload is very small as often

happ ens with urgent control messages Express message is the b est option Intrano de

handshake overhead b etween aP software and the NES is a signicant comp onent of

latency In StarTVoyager this amounts to ab out of the overall latency for the

smallest Basic message This ratio is exp ected to b e even higher for future system

b ecause the latency through the StarTVoyager NES and Arctic network are not

the b est p ossible The NES is a b oard level design which incurs latencies that an

integrated ASIC can avoid Newer networks like SGIs Spider have signicantly

Message Passing Mechanism Bandwidth MBytess

Express Message

Basic Message

ExpressTagOn Message

NES Hardware DMA

Table Bandwidth achieved with dierent message passing mechanisms to transfer

kByte of data from one no de to another

lower latency than Arctic With Express message the cost of intrano de handshake

is reduced drastically so that overall latency is close to lower than that for Basic

message

When multicasting a message TagOn message oers the b est mechanism by

avoiding redundantly moving the same data to the NES multiple times Through more

ecient intrano de data movement and control exchange b etween the aP and the NES

TagOn message b oth reduces the aP pro cessor overhead and improves throughput

TagOns interface is also thread safe and is thus a b etter option if a message queue

is shared b etween multiple threads

The remainder of this section Sections through presents details of the

b enchmarks p erformance numb ers and explanation of these numb ers

Bandwidth

Table shows the bandwidth achieved using the four message passing mechanisms

As exp ected NES hardware implemented DMA achieves the highest bandwidth

while Express message delivers the lowest bandwidth The bandwidth for Basic and

ExpressTagOn are eectively the same

The bandwidth advantage of NES hardware DMA is even more pronounced when

bidirectional transfer bandwidth is considered ie two no des are simultaneously

exchanging data with each other Hardware DMA sustains a combined bandwidth

of MBytess for kByte bidirectional blo ck transfers The corresp onding

bandwidth for the other three transfer mechanisms is actually lower than their re

sp ectively unidirectional transfer bandwidth since the aP has to multiplex b etween

message send and receive actions

Microb enchmark Details

Unidirectional bandwidth is measured with dierent send and receive no des The

time taken to transfer kBytes of data from the sending no des main memory to the

destination no des main memory is measured and bandwidth is derived accordingly

All b enchmarks are written in C

For NES hardware supp orted DMA the initiating aP sends its lo cal sP an Express

TagOn message with details of the no de to no de transfer request This sP co ordinates

with the remote sP to translate virtual addresses into physical addresses and set up

the NES DMA hardware From that p oint on all the data transfer tasks are handled

by NES hardware which also noties the destination sP when the transfer completes

In the Basic Message blo ck transfer microb enchmark the source aP is resp on

sible for packetizing and marshalling data into transmit buers It also app ends a

destination address to each packet Because each packet is self identifying the

destination aP maintains almost no state for each inprogress DMA The destination

aP is resp onsible for copying data from receive queue buers into the target main

memory lo cations A similar scheme is used by the ExpressTagOn blo ck transfer

microb enchmark

Carrying the destination address in each packet simplies the destination aPs job

but consumes of the total payload An alternate design avoids this overhead by

having the destination no de generate destination addresses This however requires

establishing a connection b etween the source and destination no de Furthermore to

avoid the overhead of sequence numb ers and connection identication in each packet

the data packets have to arrive in the sent order and only one connection can exist

b etween each sourcedestination pair at any time

This connection based approach is used in the Express message blo ck transfer

microb enchmark The extremely small payload in Express message makes it imprac

tical for each packet to b e self identifying By adopting the connection approach the

Blo ck Transfer Size Bandwidth MBytess

kBytes

kBytes

kBytes

Table Measured NES hardware DMA bandwidth as transfer size changes Band

width improves with larger transfers b ecause of xed setup overhead

b enchmark can transfer byte in each data packet These packets all b elong to the

same order set to ensure the desired arrival order Each Express message has ro om

to carry more bits of information which is used to identify the connection numb er

and the packet typ e connection setup vs data packets

Bandwidth Limiting Factors

The NES hardware DMA bandwidth is limited by a combination of Arctic network

bandwidth limit and the NES DMA architecture Each NES hardware DMA data

packet carries bytes of data and bytes of aP bus op eration command There

is a further bytes of overhead imp osed by the Arctic network Thus data takes

up only of each packet Consequently although Arctics link bandwidth is

MBytess our DMA scheme is b ounded by a maximum bandwidth of MBytess

The overhead of setting up the NES DMA hardware is resp onsible for bringing that

numb er down to the measured MBytess for kByte transfers Because this

overhead is constant regardless of the transfer size the measured bandwidth improves

with larger transfer sizes as illustrated in Table

We can mo del the time taken for an n byte transfer as n x c Using the

timing for and kbyte transfers we arrive at these values for x and c

12

x

6

c

As n the achieved bandwidth is

n cx

2

lim lim

n!1 n!1

nx c x n cx

x

MBytess

c is the xed co ordination overhead that works out to b e s

The other three message typ es all share the same bandwidth limiting factor the

aPs pro cessing overhead The handshake b etween the aP and the NES and mar

shalling data through the aP to the NES limits the bandwidth attained Much of this

is due to aP bus access cost The next paragraph justies this claim in the case of

the Basic Message

Table in Chapter estimates the numb er of bus transactions for each Basic

Message packet Under steady state in the blo ck transfer microb enchmark the

handshake aggregation feature of Basic message is not exp ected to help reduce cost

neither in obtaining free buer nor in up dating transmit queue pro ducer p ointer

This is supp orted by the fact that ExpressTagOn achieves the same in fact slightly

b etter bandwidth as Basic Message the two mechanisms have almost identical

overheads once aggregation is not done in Basic Message

A maximum size Basic message cachelines is therefore exp ected to incur

ab out bus transactions which o ccupies the bus for bus clo cks This is close to

the one packet every bus clo cks throughput inferred from the bandwidth of the

Basic Message blo ck transfer b enchmark The dierence b etween this numb er and

the memory bus o ccupancy numb er is due to other aP pro cessor overhead buer

allo cation and deallo cation and data marshalling

Message Passing Mechanism Minimum Size Message Latency

p clk s MHz p clk

Express Message

ExpressTagOn Message

Basic Message

Table Oneway message latency for minimum size Express ExpressTagOn and

Basic messages Numb ers are rep orted in b oth pro cessor clo cks p clks and in micro

seconds s assuming a pro cessor clo ck of MHz The latency is measured from

the p oint when sender aP software b egins execution to the moment the receiver aP

software reads in the message data It includes the software cost of allo cating sender

buers The rep orted numb ers are for communication b etween nearest neighb ors

each additional hop on the Arctic network adds p clks under nocontention situa

tions

Latency

Latency is measured by pingp onging a message b etween two no des a large numb er

of times averaging the round trip time observed by one no de and then halving that

to obtain the oneway message latency We are only interested in the latency of

short messages that t within one packet since the latency of multipacket messages

is reected in the bandwidth b enchmarks describ ed previously For this reason we

present data for Express ExpressTagOn and Basic messages but not for hardware

DMA

Table lists the oneway latency of sending a minimum size message under dif

2

ferent message passing mechanisms The result shows that Express Message achieves

latency lower than that of Basic Message while ExpressTagOn incurs the longest

latency due to its limited message size options

As the message size increases from two to twenty words byte words the latency

of Basic messages increases in a step fashion Each step coincides with the message

buer crossing a cacheline b oundary b etween and words and then again b etween

2

This may app ear to b e an unfair comparison since the minimum message size of ExpressTagOn

is larger than those of Express and Basic messages owing to its paucity of message size option

The intent of this table is however to capture the minimum latency incurred when the payload is

extremely small One-way Latency

700

600

) 500 s k l C r

o 400 bmsg s s

e tmsg c o r 300 emsg P ( y c n e

t 200 a L

100

0 2 4 6 8 10 12 14 16 18 20

Message Size (Words)

Figure Oneway latency for Express emsg ExpressTagOn tmsg and Basic

bmsg messages as message size varies from to byte words

Latency comp onents Express ExpressTagOn Basic

pro c clks pro c clks pro c clks

aP software send

Delay through source NES

Delay through Arctic network switch

Delay through destination NES

aP software receive

Total

Table Breakdown of endtoend latency of minimum size Express ExpressTagOn

and Basic messages The aP software comp onents include latency on the memory bus

and words This is illustrated in Figure which also shows ExpressTagOn

message latency ExpressTagOn message latencies also have a grouping pattern

determined by cacheline b oundary But owing to its dierent interface the cache

line b oundary for ExpressTagOn message o ccurs b etween message sizes of and

words

Table provides the breakdown of the endtoend latency for minimum size

messages aP software overhead including moving the data to and from the NES

dominates the latency of ExpressTagOn and Basic messages Of the two Express

TagOn incurs higher software overhead since its buer management is more complex

it has to deal with the buer space used for the TagOn data in addition to that for its

Expressstyle header p ortion Together these numb ers clearly show that intrano de

overhead is a serious issue for message passing With Express message we are able

to pare this down to the p oint where the latency through StarTVoyagers hardware

comp onents dominate

The absolute latency through the NES and Arctic is quite substantial for all three

message typ es It accounts for the greater part of Express message latency out

of pro cessor clo cks There are three reasons for this all of which can b e overcome

with b etter hardware implementation technology

Firstly the pro cessor core to bus clo ck ratio greatly magnies any NES or

network latency This can b e improved by using b etter silicon technology in the NES

Message Passing Mechanism Tx Pro cessor Overhead Rx Pro cessor Overhead

p clk s MHz p clk p clk s MHz p clk

Express Message

ExpressTagOn Message

Basic Message

Table Pro cessor overhead for minimum size message one byte word

and the network to bring their clo ck sp eeds closer to that of the pro cessor Secondly

our NES design is a lo osely integrated b oardlevel design Lo ose integration results in

our NES message data path crossing devices b etween the Arctic network and the aP

system bus Each chip crossing adds at least two cycles to latchin and latchout the

data Implementing the entire NES Core in a single ASIC will improve this Lastly

parts of our NES are designed for fairly slow FPGA Even though faster FPGAs

are eventually available these p ortions have not b een redesigned to reduce their

pip eline depths Based on manual examination of the Verilog RTL co de we exp ect

the approximately p clk NES clo ck delay through the current NES to b e

reduced to ab out p clk NES clo ck if a reimplementation is done

Pro cessor Overhead

We now examine the pro cessor overhead for the various message typ es The pro cessor

overhead is related to but not exactly the same as the aP software comp onents of

the latency path the former measures o ccupancy while the latter measures critical

path

Table rep orts the pro cessor overheads incurred for minimum size messages

More detailed breakdowns for these numb ers are shown in Tables and The

pro cessor overhead for messages with sizes ranging from to words are rep orted

in Figure this shows a signicantly higher pro cessor overhead increment when

the message size increase crosses a cacheline b oundary

The pro cessor overhead for Express message is signicantly lower than those for

the other message typ es This is esp ecially true for message transmit This is achieved Processor Overhead

200

180

160

140

s emsg(tx) k

c 120

o tmsg(tx) l C

r bmsg(tx)

o 100

ss emsg(rx) e

c tmsg(rx) o 80 r

P bmsg(rx) 60

40

20

0 2 4 6 8 10 12 14 16 18 20

Message Size (Words)

Figure Message transmit tx and receive rx pro cessor overhead for Express

emsg ExpressTagOn tmsg and Basic bmsg messages as message size varies from

to byte words

Pro cessor overhead Express ExpressTagOn Basic

comp onents pro c clks pro c clks pro c clks

Data copying active cycles

Data copying stalled cycles

Bo okkeeping

Total

Table Breakdown of pro cessor overhead for sending minimum size Express

ExpressTagOn and Basic messages

Pro cessor overhead Express ExpressTagOn Basic

comp onents pro c clks pro c clks pro c clks

Reading data active cycles

Reading data stalled cycles

Bo okkeeping

Total

Table Breakdown of pro cessor overhead for receiving minimum size Express

ExpressTagOn and Basic messages

by bundling all the ochip access into an uncached transaction so that ochip access

cost is minimized As reected in Table ochip access cost which shows up as

stalled pro cessor cycles caused by cachemiss and cache pushout dominates the

pro cessor overhead for the other message typ es

Pro cessor overhead incurred during message receive is again dominated by o

chip access particularly the latency of reading from the NES It is very dicult to

reduce this read latency which shows up as pro cessor stall cycles unless prefetching

is done Though a go o d idea prefetching is not easy when message receive co de

is integrated into real programs as such our microb enchmarks do not include pre

fetching Among the dierent message typ es Express message still manages to achieve

the lowest pro cessor overhead b ecause it do es not require any b o okkeeping by the

aP software

Next we present statistics on the pro cessor overhead of sending multicast mes

sages under Basic and ExpressTagOn message passing mechanisms The overhead

is signicantly lower with ExpressTagOn as the numb er of multicast destinations

increases since the aP sp ecies the content of the multicast message to the NES only

once For a payload of bytes Basic message incurs an overhead of pro cessor

clo cks s MHz p clk p er destination Using ExpressTagOn the cost for

the rst destination is pro cessor clo cks s but each additional destination

incurs only pro cessor clo cks s

Finally if multiple threads share a message queue Basic message will require the

threads to co ordinate their usage using mutex lo cks This adds pro cessor overhead

To get an idea of this cost we implemented userlevel mutex lo cks with the load

reserve and storeconditional pair of atomicity instructions available on PowerPC

pro cessors and measured the cost on a real system Even in the case where there is

no contention it costs ab out pro cessor clo cks to take or to release a lo ck The high

cost is partly due to serialization of execution within the PowerPC e pro cessor

and partly due to the bus op erations triggered by these instructions Bracketing a

message send or receive co de sequence with a lo ck acquire and release will therefore

add ab out pro cessor clo cks of pro cessor overhead In contrast the uncached write

to send an Express or ExpressTagOn message is inherently atomic and hence no

lo cking is needed Atomicity on the receive side can similarly b e achieved using bit

uncached read

Sharing a Basic Message queue among concurrent threads adds not only the over

head of using mutex lo cks to co ordinate message queue usage but also intro duces

thread scheduling headaches A thread that has obtained the right to use a mes

sage queue will blo ck other threads if it is susp ended say due to page fault b efore

completing usage of the message queue

Multiple Message Queues Supp ort

This section provides quantitative evaluation of the virtual message queues idea as

implemented with our ResidentNonresident strategy First the impact of message

queue virtualization and supp orting a mo derate numb er of hardware queues is quan

tied We show that the added latency from queue name translation and the slightly

more complex logic is immaterial it is very small compared to the overall message

passing latency to day

Next the p erformance of Nonresident Basic message is presented and compared to

Resident Basic message p erformance The evaluation shows that the latency of Non

resident implementation is at worst ve times longer than Resident implementation

while bandwidth is reduced by two thirds This level of p erformance is p erfectly

adequate p erformance as a backup although it is not as go o d as we had anticipated

The message send p ortion of the Basic message interface presents many challenges to

ecient emulation by the sP other message send interfaces notably that of Express

and ExpressTagOn messages are more conducive to sP emulation

Despite not living up to our exp ectations the Nonresident Basic message p er

formance is still very resp ectable The oneway latency of s is not unlike those

rep orted for Myrinet Myrinet p erformance varies dep ending on the host system

used and the interface rmware running on the Lanai its custom designed emb edded

pro cessor Numb ers rep orted by various researchers indicate oneway latency as low

as us However well known userlevel interfaces such as AMI I and VIA incur

oneway latency numb ers of us and us resp ectively

Although the Nonresident bandwidth of around MBytes may lo ok low this

is achieved with the sP p erforming send and receive of messages with only byte

payload If all that we care ab out is how fast sP can implement blo ck transfers other

metho ds presented in Section achieve bandwidths of over MBytes

An insight we gained from this evaluation is the imp ortance of handling the

most common communication op erations completely in hardware instead of involv

ing rmware The relatively long latency of many well known message passing

NIUsmachines such as Myrinet and SP is due in large part to using rmware to

handle every message send and receive op eration It is common to attribute their long

latency to their lo cation on the IO bus While that is partly resp onsible rmware

is by far the biggest culprit

Comp onent Latency Penalty

NES clks

TxQ state load

Destination translation

TxQ state writeback

RQID lo okup

RxQ state load

RxQ state writeback

Table The latency p enalty incurred by Resident message queues in order to sup

p ort multiple virtual message queues The message queue state writeback p enalties

are zero b ecause they are not on the latency critical path

Performance Cost to Resident Message Queues

Message queue virtualization as implemented in StarTVoyager imp oses two typ es

of p erformance p enalty on hardware message queues i the cost of supp orting a

mo derately large numb er of hardware message queues and ii the cost of queue

name translation and matching virtual queue names to hardware queues Latency

is the p erformance metric most aected incurring a total overhead of six NES clo ck

cycles pro cessor cycles This is ab out of Express message latency which

is the shortest among all message typ es Table shows the breakdown for this

overhead

The transmit and receive queue state loading overhead is due to the way the NES

microarchitecture implements multiple hardware message queues As describ ed in

Section message queue states are separated from the logic that op erates on

them The former are group ed into queue state les akin to register les while

the latter is structured as transmit and receive functional units This organization

achieves more ecient silicon usage as the numb er of message queues increases since

the functional units need not b e replicated With this organization queue state has

to b e loaded into the functional units and subsequently written back Each load or

writeback op eration takes one cycle Its latency impact is minimal and bandwidth

impact is practically zero since it is hidden by pip elining

Virtualization of queue names requires message destination translation during

transmit and RQID Receive Queue IDentity tag lo okup to demultiplex incoming

packets Similar to cachetag matching the RQID tag lo okup incurs a latency p enalty

of one NES clo ck cycle It has no bandwidth impact as the lo okup is done with

dedicated CAM Content Addressable Memory in the NESCtrl chip

Message destination translation adds three NES cycles to latency one cycle to

generate the address of the translation table entry another to read the translation

information itself and a third to splice this information into the outgoing packet

Because the NES design places the translation information in the same SRAM bank as

message queue buers reading the translation information has a bandwidth impact

SRAM p ort usage originally d ne cycles for an nbyte message is lengthened

by one cycle This only has an impact on very small messages Furthermore an

implementation can use a separate translation table RAM to avoid this bandwidth

p enalty altogether

The latency p enalty of six NES clo ck cycles is so low that it will remain a small

part of the overall latency even if b etter implementation technology is used for the

NES and the network so that their latency contribution is greatly reduced We can

make an argument that the six cycle overhead value is small relative to the latency

of a system bus transaction which is at least four bus clo cks to day and unlikely

to decrease As long as the NIU is external to the pro cessor any message passing

communication will require at least two system bus transactions and probably more

Comparison of Resident and Nonresident Basic Mes

sage Queues

This section examines the p erformance of Nonresident Basic Message queues imple

mented as describ ed in Section We present bandwidth and latency numb ers for

the four combinations resulting from the crosspro duct of the sender and the receiver

using either Resident or Nonresident queues Pro cessor overhead numb ers are only

minimally increased by the switch from Resident to Nonresident queues and are not

TxQ Resident RxQ Resident Bandwidth MBytes

Yes Yes

Yes No

No Yes

No No

Table Bandwidth achieved with Basic Message when transferring kByte of data

from one no de to another The four cases employ dierent combinations of Resident

and Nonresident queues at the sender and receiver

presented here The marginal increase is due to longer latency of accessing DRAM

Table lists the bandwidth achieved for kByte blo ck data transfers When

b oth sender and receiver use Nonresident message queues the bandwidth is a third

of the Resident implementations bandwidth The same table also shows that the

b ottleneck is at the senders end since close to half the Resident queues bandwidth

is attained if the sender uses a Resident queue while the receiver uses a Nonresident

queue

Figure rep orts oneway message latency under the four combinations of sender

and receiver queue typ es The bar graph displays the absolute latency numb ers while

the line graphs join the p oints indicating the latency slowdown ratios ie the

ratio of the longer latencies incurred with Nonresident queues compared to the

corresp onding latencies under Resident queues The latency deterioration is worse

for smaller messages b ecause several Nonresident queue overhead comp onents are

xed regardless of message size The gure also shows that Nonresident message

transmission causes greater latency deterioration than Nonresident message receive

In the worst case of a one word Byte Basic message sent b etween Nonresident

queues latency is almost ve times longer

Accounting for sP Emulation Costs

To facilitate understanding of sP emulation cost we dierentiate b etween two cost

categories The rst is the inherent cost of sP rmware pro cessing This is determined

by our NES microarchitecture The second is the Basic message interface itself One-way Latency

2500 6

5 2000 r o t c

a bmsg F

4 e nr-bmsg s a s e k

1500 r snr-rr-bmsg c c o l n I sr-rnr-bmsg C y

r 3 c o n s e s t e 1000 a nr-bmsg/bmsg ratio c L o t r

2 n

P sr-rnr-bmsg/bmsg ratio e d i snr-rr-bmsg/bmsg ratio s e r

500 - 1 on N

0 0 2 4 6 8 10 12 14 16 18 20

Message Size (Words)

Figure Oneway latency of Basic Message when sent from and received in dierent

combinations of Resident and Nonresident Basic Message queues The bmsg case

is b etween Resident transmit and Resident receive queues The nrbmsg case is

b etween Nonresident transmit and Nonresident receive queues The snrrrbmsg

case is b etween a Nonresident sender and a Resident receiver while the srrnrbmsg

case is b etween a Resident sender and a Nonresident receiver

obviously dierent interface design choices inuence the eciency of sP emulation

We defer detailed discussions of the eciency of sP rmware pro cessing to Sec

tion It suces to say that sP handling of each event imp oses xed overheads

which can b e signicant if events are frequent Multiphase pro cessing of an op eration

exacerbates this cost b ecause state has to b e saved and restored b etween phases

The Basic message interface design is dicult to implement eciently with sP co de

for two reasons i information is exchanged in a way that forces the sP to pro cess a

message transmission in two phases ii our choice of buer queue organization with

a dynamically varied end of queue Details follow

Up dating a transmit queue pro ducer p ointer triggers the rst phase of sP pro cess

ing The sP is however unable to complete the emulated transmit b ecause two pieces

of message header information message destination and size have to b e read from

the SMPs main memory Because the sP cannot directly read the aPs DRAM it has

to rst marshal the message header and data into NES SRAM While waiting for this

to complete it must continue to service other requests to avoid p ossible deadlo cks

This incurs context switching overhead including storing aside information for the

second phase which then has to lo cate it

Basic message receive also incurs multiple phases of sP pro cessing but fortunately

the second phase can b e aggregated and there is little state carried from the rst

phase to the second one The main task of the latter is to free up message buer space

3

in the proxy receive queue In retrosp ect we could have extended the command set

of the lo cal command queue to include one that frees up hardware queue space This

together with the FIFO prop erty of the lo cal command queue would make it p ossible

for the sP to completely pro cess a Basic message receive in a single phase

Not all Nonresident transmit emulation is as exp ensive as that for Basic message

The transmit part of Express and ExpressTagOn interface is in fact much more

conducive to sP Nonresident implementation b ecause message destination and size

information are available in the bus transaction that triggers sP pro cessing This

3

The proxy queue is the hardware queue onto which packets from multiple emulated queues are

multiplexed

enables the sP to emulate message transmit in a single phase As a result of this

simplication the transmit sP o ccupancy is exp ected to b e half that of Basic message

Implementing Basic message Nonresident transmit is also particularly dicult

due to a seemingly inno cuous buer queue design decision Basic message queues are

circular queues with variable size buers While having many advantages discussed

in Section variable buer size leads to the p ossibility of a buer at the end of a

message queues linear address region straddling b oth the end and the b eginning of

the queues address range This situation is inconvenient for software for example

it makes it imp ossible to reference a message buer as a struct in C programs

To handle this problem we made the eective end of a queue variable as long

as the space left at the end is insucient for the maximum size message the queue

is wrapp ed back to the b eginning The amount of space skipp ed over is dep endent

on the dynamic comp osition of buer sizes encountered This was a bad decision

b ecause of problems it causes and b ecause there is an easier solution

Our buer queue structure makes buer allo cation co de tedious This inconve

nience and ineciency is greatly comp ounded in Nonresident transmit b ecause the

dynamically varied end of a queue shows up in both the emulated Nonresident queue

and the hardware proxy queue Furthermore the dynamic end of an emulated queue

cannot b e determined until the second phase pro cessing

The straight forward way to deal with this complexity is to stall pro cessing when

ever any queue either proxy or emulated approaches its dynamically determined end

Pro cessing only resumes when message size information available during the second

phase resolves the precise p oint at which the queue has reached its end

A second approach avoids stalling by having phase one pro cessing marshal su

cient data into two dierent SRAM lo cations Only one of these two copies is used

during phase two dep ending on where the queue wraps back to its b eginning Phase

two pro cessing ignores redundant data by sending lo opback garbage packets which

it discards This is the implementation we chose

The complexity of this variable end of queue design could have b een avoided

with an alternate solution that p ermits variable packet sizes without breaking any

packet into noncontiguous address region This solution imp oses the restriction that

message queues have to b e allo cated in pagesize granularity When the physical

pages are mapp ed into an applications virtual address space an extra virtual page

is used at the end which maps to the rst physical page ie b oth the rst and last

virtual pages of a message queue map onto the same physical page This allows a

packet o ccupying noncontiguous physical addresses to end up in contiguous virtual

addresses

Performance Limits of the sP

This section examines the ecacy of the sP in the StarTVoyager NES design As

describ ed in Chapter the NES Core provides the sP with a large exible set of

functions leaving few doubts that the sP is functionally capable of emulating almost

any communication abstraction What is less apparent is the p erformance the sP can

achieve and the factors limiting this p erformance

The results rep orted in this section show that when driven to the limit sP p erfor

mance is constrained by either context switching or ochip access When the amount

of work involved in each sP invo cation is small the context switch overhead event

p oll dispatch and b o okkeeping esp ecially for multiphase op erations dominates

When each sP invo cation orchestrates a large amount of data communication the

cost of sP ochip access dominates

This result validates several of our concerns when designing the sPNES Core

interface These include i crafting an interface which minimizes the numb er of

phases in sP pro cessing of an event and ii reducing the numb er of sP ochip

accesses through mechanisms like Onep oll On the other hand it also reveals one

area of sP function that we did not consider carefully during our design the burden

of NES Core hardware resource allo cation and deallo cation

At a higher level the results are a reminder that the sPs p erformance is ulti

mately lower than that of dedicated hardware Therefore it is crucial that it is not

invoked to o often If the sPs duties are limited to a small numb er of simple tasks its

pro cessing cost can p otentially b e lowered b ecause of reduced b o okkeeping In our

opinion an sP so constrained is b etter replaced with FPGAs

Two sets of exp eriments contribute to the results of this section The rst is the

Nonresident Basic message p erformance rst presented in Section We revisit

the p erformance results dissecting it to quantify the cost of generic sP functions

Details are rep orted in Section

In the second set of exp eriments the sP implements blo ck transfer in several

ways These exp eriments dier qualitatively from the rst set in that a fair amount of

communication is involved each time the sP is invoked Because the sP overhead from

dispatch and resource allo cation is amortized these exp eriments reveal a dierent set

of p erformance constrains Details are rep orted in Section

sP Handling of Microop erations

We instrumented our simulator and the Nonresident Basic message p erformance

b enchmarks to obtain the cost for each invo cation of the sP when it emulated Ba

sic message queues These numb ers are tabulated in Table To double check

that these numb ers are reasonable we also inferred from the bandwidth numb ers in

Table that at the b ottleneck p oint each Nonresident Basic packet takes pro

4

cessor clo cks and pro cessor clo cks for transmit and receive resp ectively These

numb ers are compatible with those in Table They also show that sP o ccupancy

constrains transmit bandwidth but not the receive bandwidth

Why do es the sP take several hundred pro cessor clo cks to pro cess each of these

events Firstly all our co de is written in C and then compiled with GCC Manual

examination of the co de suggests that careful assembly co ding should improve p erfor

mance by at least Secondly the sP co de is written to handle a large numb er

of events Polling and then dispatching to these events and saving and restoring the

4

Each emulated Basic Message packet carries bytes of data When sP emulates message

transmit it achieves MBytess ie it pro cesses thousand packets every second Since the

sP op erates at MHz this works out to b e one packet every cycles Similarly when the sP

emulates message receive it achieves MBytess ie it pro cesses thousand packets every

second This works out to b e one packet every cycles

Comp onent sP o ccupancy

pro c clkspacket

Tx emulation Phase

Marshal data

Tx emulation Phase

Destination trans and launch

Total

Rx emulation Phase

Demultiplex data

Rx emulation Phase

Free buer and up date queue state

sP free receive buer

Total

Table Breakdown of sP o ccupany when it implements Nonresident Basic mes

sage

state of susp ended multiphase pro cessing all contribute to the cost

As illustration we obtained the timing for sP co de fragments taken from the Non

resident Basic message implementation The sPs toplevel dispatch co de using a C

switchcase statement takes pro cessor clo cks When this is followed by dequeuing

the state of a susp ended op eration from a FIFO queue and then a second dispatch

an additional pro cessor clo cks is incurred

Hardware resource management such as allo cation and deallo cation of space in

the lo cal command queues also incurs overhead With each task taking several tens

of cycles these dispatches lo okups and resource management very quickly add up to

a large numb er of sP pro cessor cycles

sP Handling of Macroop erations

Table shows the bandwidth achieved for kBytes blo ck transfer using three

transfer metho ds two of which involves the sP

Benchmark Details

The rst metho d uses the NES hardware DMA capability This has b een rep orted

Transfer Metho d Bandwidth MBytess

NES Hardware DMA

sP sends and receives

sP sends NES hardware receives

Table kByte Blo ck transfer bandwidth under dierent transfer metho ds on

StarTVoyager as measured on the StarTsim simulator

earlier but is rep eated here for easy comparison In the second metho d the sP is

involved at b oth the sending and receiving ends The sP packetizes and sends data

by issuing aP bus op erations to its lo cal command queue to read data into the NES

These are followed by ExpressTagOn commands to ship the data across the network

The sP takes advantage of the lo cal command queues FIFO guarantee to avoid a

second phase pro cessing of the transmit packets

On the receive end the ExpressTagOn receive queue is set up so that the TagOn

part of the packet go es into aSRAM while the headerExpresslike part go es into

sSRAM The sP examines only the latter and then issues aP bus op eration commands

to its lo cal command queue to move the TagOn data into the appropriate aP DRAM

lo cations Pro cessing at the receiver sP has a second phase to deallo cate TagOn

data buer space in the receive queue This can b e aggregated and the rep orted

numb ers are from co de that aggregates the buer free action of two packet into one

sP invo cation

In the third b enchmark the sP is only involved at the sender end On the receiv

ing side the packets are pro cessed by the NESs remote command queue the same

hardware used in NES hardware DMA The purp ose of this example is to characterize

sP packetize and send p erformance When the sP is resp onsible for b oth sending and

receiving the receive pro cess is the likely p erformance b ottleneck b ecause it has two

phases This suspicion is conrmed by the results in Table which shows this

third transfer metho d achieving higher p erformance than the second one

Result Analysis

The numb ers in Table can b e explained by the conjecture that the limiting factor

on sP p erformance is the numb er of times it makes ochip accesses For example

under the second transfer metho d the MBytess bandwidth implies that the

b ottleneck pro cesses one packet every pro cessor clo ck bus clo cks As shown

in the following table when the sP sends blo ck data ochip access o ccupies the sP

bus for bus clo cks p er packet Short bus transactions involve smaller amounts of

data and each o ccupies the bus for only bus clo cks more data is involved in long

bus transactions each o ccupying the bus for bus clo cks

Op eration num typ e of bus transaction bus clo cks

Poll for event to handle short

lo cal command queue commands short

cacheline writemiss long

cacheline ush long

Poll for command ack short aggregated

Total

Chapter

Conclusions and Future Work

This piece of research centers around the thesis that a cluster system NIU should

supp ort multiple communication interfaces layered on a virtual message queues sub

strate in order to streamline data movement b oth within each no de as well as b etween

no des To validate this thesis we underto ok the design and implementation of the

StarTVoyager NES an NIU that emb o dies these ideas Our work encompasses design

sp ecication Verilog co ding synthesis and p erformance tuning simulator building

functional verication PC b oard netlist generation sP rmware co ding and nally

microb enchmark based evaluation Through this exercise we obtained a qualitative

idea of the design complexity and a quantitative picture of the hardware silicon size

Neither issues present any signicant imp ediment to realizing our design

What We Did

To enable an NIU to supp ort multiple interfaces we solved a series of problems

such as sharing a fast system area network in a safe manner without imp osing an

unreasonable p erformance p enalty and managing the hardware complexity and cost

of supp orting multiple interfaces We intro duced a threelayer NIU architecture

describ ed in Chapter to decouple the dierent issues encountered in this design

Sp ecically network sharing protection is handled in the Virtual Queues layer so that

the Application Interface layer can devote itself to crafting ecient communication

interfaces

In the Virtual Queues layer we designed a protection scheme that is b oth very

exible and cheap to implement Flexibility is achieved by leaving p olicy decisions to

system software Implementation is cheap b ecause the scheme requires only simple

supp ort from the NIU hardware We implemented these virtual queues in the StarT

Voyager NES with a combination of hardware Resident queues and rmware emulated

Nonresident queues The former are employed as caches of the latter under rmware

control This design illustrates the synergy b etween custom hardware functional

blo cks and the emb edded pro cessor in our hybrid NIU microarchitecture

In the Applications Interface layer we designed a numb er of message passing

mechanisms catering to messages of dierent sizes and communication patterns We

also provided sucient ho oks in the NES for rmware to implement cachecoherent

distributed shared memory These ho oks allow rmware to participate in and control

the outcome of sno opy bus transactions on the computation no de ensuring tight

integration into the no des memory hierarchy Though originally intended for cache

coherent shared memory implementation these capabilities are used to implement

other communication interfaces such as our Nonresident message queues The design

shows that by building on a basic set of hardware mechanisms multiple interfaces

can b e supp orted with minimal to no p erinterface enhancements

At the microarchitectural level we examined dierent options for intro ducing

programmability into the NIU We considered using one of the no de pro cessors but

found serious dangers of deadlo ck and likely p erformance problems We nally picked

using a dedicated otheshelf emb edded pro cessor in the NIU as the most exp edient

choice To overcome some p erformance and functional limitations of this approach

we treated the custom NIU hardware as a copro cessor to the emb edded pro cessor

and structured this interface as a set of inorder command and completion queues

What We Learned

The merits of our ideas are evaluated on a system simulator we develop ed The

results show that the ResidentNonresident implementation of virtual queues is a

go o d idea that achieves the seemingly conicting goals of high p erformance low cost

and exible functionality The average p erformance exp ected to b e dominated by

Resident queue p erformance is little worse than if the had NIU supp orted only a

small xed numb er of queues NIU hardware cost is kept low as the large numb er of

queues is implemented in rmware using storage in main memory

Microb enchmark based evaluation also illustrates the p erformance merit of mul

tiple interface supp ort in StarTVoyager NES the b est message passing mechanism

for a sp ecic communication dep ends on its characteristics such as data size or com

munication pattern The advantage comes from using these characteristics to pick the

b est data and control path from among the myriad options in to days complex mem

ory hierarchy It is also signicant that despite the generality and exibility of the

StarTVoyager NES each mechanism it oers is comp etitive against implementations

that provide only that particular mechanism

Finally we evaluated the utility of the otheshelf emb edded pro cessor in the

NIU While functionally extremely exible the emb edded pro cessor oers lower p er

formance than custom hardware We showed that with our design the emb edded

pro cessor is limited by either context switch overhead in the case of negrain com

munication events or by ochip access when it handles coarsegrain communication

events

While the StarTVoyager NES is an interesting academic prototyp e its imple

mentation is not very aggressive resulting in compromised p erformance If nancial

and manp ower resources p ermitted the entire NES except for the emb edded pro

cessor and its memory system should b e integrated into one ASIC This will improve

communication p erformance particularly latency If further resources are available

a generic programmable core integrated into this ASIC could overcome the ochip

access limitation of our emb edded pro cessor

Future Work

This work addresses the question Is a multiinterface NIU feasible This question

has b een answered in the armative through a complete design and microb enchmark

p erformance evaluation of the message passing substrate This work is however just

an initial investigation and many op en issues remain we list some b elow

Further Evaluation with Real Workload

This thesis fo cuses more on design b oth abstract and concrete and less on evaluation

Further evaluation of the StarTVoyager NES with real applications and workload will

lead to more denitive answers to the following imp ortant questions

What are the eects of multiple message passing mechanisms on overall appli

cation and workload p erformance

Do es more exible job scheduling enabled by our network sharing mo del actually

improve system utilization

What is the p erformance of Reactive Flowcontrol when used on real programs

under real workload conditions

Is our choice of supp orting the queuesofnetwork mo del and dynamic receive

queue buer allo cation necessary In particular is either the channel mo del

or having the message sender sp ecify destination address suciently convenient

interfaces

While seemingly simple these questions can only b e answered after the entire system

hardware system software and application software is in place

Cachecoherent Global Shared Memory

Shared memory is not a fo cus of this research but to fully validate the concept of

a multifunction NIU we must demonstrate that this architecture do es not p enalize

shared memory p erformance in any signicant way The goal is to demonstrate shared

memory p erformance on par with if not b etter than that delivered by shared memory

machines

Although the current NES functionally supp orts cachecoherent shared memory

and will deliver go o d p erformance when interno de cache miss rate is low there are

concerns ab out interno de cachemiss pro cessing latency and throughput b ecause the

sP is always involved in such cases

Further work needed in this area includes a combination of p erformance study

scrutiny of and mo dication to the NES microarchitecture to ensure that all cache

miss latency critical paths and throughput b ottlenecks are handled in NES hardware

Some p ossibilities are to add bypasses aggressively avoid storeandforward in the

pro cessing paths and give shared memory trac priority in resource arbitration

The StarTVoyager NES with its malleable FPGAbased hardware provides a go o d

platform for conducting this piece of research

Multiinterface NIU vs aP Emulation on Shared Memory NIU

This thesis investigated multiinterface NIUs ie NIUs designed to directly sup

p ort multiple communication interfaces An increasingly common approach to satisfy

application demand for dierent communication abstractions is to implement fast

coherent shared memory supp ort in hardware and use aP co de to emulate all other

communication interfaces Although we b elieve that this latter approach has b oth

p erformance and faultisolation drawbacks due to lack of direct control over data

movement this thesis did not attempt a denitive comparison of the two approaches

This is premature until the implementation feasibility of the multiinterface NIU

approach is demonstrated With this thesis work and further work on the shared

memory asp ects of the multiinterface NIU architecture as the foundation a compar

ative study on how b est to supp ort multiple communication interfaces is in order

Bibliography

A Agarwal D Chaiken G DSouza K Johnson D Kranz J Kubiatowicz

K Kurihara BH Lim G Maa D Nussbaum M Parkin and D Yeung The

MIT Alewife Machine A LargeScale DistributedMemory Multipro cessor In

Proceedings of Workshop on Scalable Shared Memory Multiprocesors Kluwer

Academic Publishers

A Agarwal J Kubiatowicz D Kranz BH Lim D Yeung G DSouza

and M Parkin Sparcle An Evolutionary Pro cessor Design for LargeScale

Multipro cessors IEEE Micro MayJun

T Agerwala J L Martin J H Mirza D C Sadler D M Dias and M Snir

SP system architecture IBM Systems Journal

M S Allen M Alexander C Wright and J Chang Designing the PowerPC

X Bus IEEE Micro SepOct

R Alverson D Callahan D Cummings K B A Porterled and B Smith

International Confer The Tera Computer System In Proceedings of the

ence on Supercomputing June

C Anderson and JL Baer Two Techniques for Improving Performance on

Busbased Multipro cessors In Proceedings of the First International Symposium

on HighPerformanc e Computer Architecture Raleigh NC pages

Jan

J Archibald and JL Baer Cache Coherence Proto cols Evaluation Using a

Multipro cessor Simulation Mo del ACM Transactions on Computer Systems

Nov

B N Bershad M J Zekauskas and W A Sawdon The Midway Distributed

Shared Memory System In Proceedings of the IEEE COMPCON Confer

ence Feb

R D Blumofe C F Jo erg B C Kuszmaul C E Leiserson K H Randall

and Y Zhou Cilk An Ecient Multithreaded Runtime System In Proceedings

of the th ACM SIGPLAN Symposium on Principles and Practice of Paral lel

Programming PPOPP pages July

M A Blumrich K Li R Alp ert C Dubnicki E W Felten and J Sandb erg

Virtual Memory Mapp ed Network Interface for the SHRIMP Multicomputer

In Proceedings of the st International Symposium on Computer Architecture

pages Apr

N J Bo den D Cohen R E Felderman A E Kulawik C L Seitz J N

Seizovic and W Su Myrinet A Gigabitp erSecond Lo calArea Network

IEEE Micro JanFeb

G A Boughton Arctic Routing Chip In Proceedings of Hot Interconnects II

Stanford CA pages Aug

Paral lel G A Boughton Arctic Switch Fabric In Proceedings of the

Computing Routing and Communication Workshop Atlanta GA June

E A Brewer and F T Chong Remote Queues Exp osing Message Queues for

Optimization and Atomicity In Proceedings of Symposium on Paral lal Algo

rithms and Architectures Santa Barbara CA

G Buzzard D Jacobson M Mackey S Marovich and J Wilks An Implemen

tation of the Hamlyn Sendermanaged Interface Architecture In Proceedings

of the Second Symposium on Operating Systems Design and Implementation

Seattle WA pages Oct

G Buzzard D Jacobson S Marovich and J Wilks Hamlyn a High

p erformance Network Interface with Senderbased Memory Management Tech

nical Rep ort HPL Computer Systems Lab oratory HewlettPackard Lab

oratories Palo Alto CA July

D Chaiken and A Agarwal SoftwareExtended Coherenct Shared Memory

Performance and Cost In page

K Charachorlo o D Lenoski J Laudon P Gibb ons A Gupta and J Hen

nessy memory Consistency and Event Ordering in Scalable SharedMemory

Multipro cessors In Proceedings of the th International Conference on Com

puter Architecture pages May

D Chiou B S Ang R Greiner Arvind J C Ho e M J Beckerle J E

Hicks and A Boughton StarTNG Delivering Seamless Parallel Computing

In Proceedings of the First International EUROPAR Conference Stockholm

Sweden pages Aug

Chip Express Corp oration Owen Street Santa Clara CA CX

Technology Design Manual May CXCX Family Laser

Programmable Gate Arrays

E Clarke O Grumb erg H Hiraishi S Jha D Long K McMillan and L Ness

Vercation of the futurebuscache coherence proto col In Proceedings of the

Eleventh International Symposium on Computer Hardware Description Lan

guages and their Applications NorthHolland Apr

D Culler R Karp D Patterson A Sahay K E Schauser E Santos R Sub

rmonian and T von Eicken LogP Torwards a Realistic Mo del of Parallel

Computation In Proceedings of the Fourth ACM SIGPLAN Symposium on

Principles and Practice of Paral lel Programming San Diego pages May

F Dahlgren M Dub ois and P Stenstrom Combined Performance Grains of

Simple Cache Proto col Extensions In Proceedings of the st International

Symposium on Computer Architecture Apr

W J Dally R Davison J A S Fiske G Fyler J S Keen R A Lethin

M Noakes and P R Nuth The MessageDriven Pro cessor A Multicomputer

Pro cessing No de with Ecient Mechanisms IEEE Micro MarApr

W J Dally L R Dennison D Harris K Kan and T Xanthop oulos Archi

tecture and Implementation of the Reliable Router In Proceedings of the Hot

Interconnects II pages Aug

D L Dill The Murphi Verication System In Proceedings of the th In

ternational Conference on Computer Aided Verication New Brunswich NJ

volume of Lecture Notes in Computer Science pages Springer

July

J J Dongarra S W Otto M Snir and D Walker A Message Passing

Standard for MPP and Workstations Communications of the ACM

July

D Dunning G Regnier G McAlpine D Cameron B Shub ert F Berry A M

Merritt E Gronke and C Do dd The Virtual Interface Architecture IEEE

Micro MarApr

B Falsa and D A Wo o d Reactive NUMA A Design for Unifying SCOMA

and CCNUMA In Proceedings of the th International Symposium on Com

puter Architecture pages June

M Fillo S W Keckler W J Dally et al The MMachine Multicomputer

International Journal of Paral lel Programming June

M P I Forum Do cument for a Standard Messagepassing Interface Technical

Rep ort CS University of Tennessee Nov

M Frigo and V Luchangco ComputationCentric Memory Mo dels In Pro

ceedings of the th ACM Symposium on Paral lel Algorithms and Architectures

pages June

M Galles Spider A HighSp eed Network Interconnect IEEE Micro

JanFeb

G R Gao and V Sarkar Lo cation Consistency Stepping Beyond the Memory

Coherence Barrier In Proceedings of the th International Conference on

Paral lel Processing Oconomowoc Wisconsin Aug

A Geist A Beguelin J Dongarra W Jiang R Manchek and V Sunderam

PVM Paral lel Virtual Machine A Users Guide and Tutorial for Networked

Paral lel Computing MIT Press

R B Gillett Memory Channel Network for PCI IEEE Micro

JanFeb

J R Go o dman M K Vernon and P J Wo est Ecient Synchronization

Primitives for Largescale Cachecoherent Multipro cessors In Proceedings of

the Third International Conference on Architectural Support for Programming

Languages and Operating Systems pages

F Hady R Minnich and D Burns The Memory Intergrated Network Interface

In Proceedings of the Hot Interconnects II pages Aug

E Hegersten A Saulsbury and A Landin Simple COMA No de Implemen

tation In Proceedings of the th Hawaii International Conference on System

Sciences Jan

J Heinlein Optimized Multiprocessor Communication and Synchronization Us

ing a Programmable Protocol Engine PhD thesis Stanford University Stanford

CA Mar

J C Ho e Eective Parallel Computation on Workstation Cluster with User

level Communication Network Masters thesis Massachusetts Institute of Tech

nology Cambridge MA Feb

J C Ho e StarTX A OneManYear Exercise in Network Interface Engineer

ing In Proceedings of Hot Interconnects VI pages Aug

J C Ho e and M Ehrlich StarTJr A parallel system from commo dity tech

nology In Proceedings of the th TransputerOccam International Conference

Nov

M Homewo o d and M McLaren Meiko CS Interconnect ElanElite Design

In Proceedings of Hot Interconnects I Aug

R W Horst and D Garcia ServerNet SAN IO Archtiecture In Proceedings

of Hot Interconnects V

IBM Micro electronics PowrPC RISC Microprocessor Users Manual Nov

MPCUMU MPCUMAD also available from Motorola Inc

Semiconductor Pro ducts Sector

IEEE Computer So ciety Scalable Coherent Interface IEEE Standard

Aug

L Ifto de J P Singh and K Li Scop e Consistency a Bridge Between Release

Consistency and Entry Consistency In Proceedings of the th Annual ACM

Symposium on Paral lel Algorithms and Architectures June

Intel Corp oration Intel Corp oration Sup ercomputer Systems Division

NW Greenbrier Parkway Beaverton OR Paragon XPS Product

Overview

K L Johnson M F Kaasho ek and D A Wallach CRL HighPerformance

AllSoftware Distributed Shared Memory In Proceedings of the Fifteenth Sym

posium on Operating Systems Principles pages Dec

A Kagi D Burger and J R Go o dman Ecient Synchronization Let Them

Eat QOLB In Proceedings of the th International Symposium on Computer

Architecture pages June

M R Karlin M S Manasse L Rudolph and D D Sleator Comp etitive

Sno opy Caching In Proceedings of the th Annual Symposium on Foundations

of Computer Science pages

P Keleher A L Cox S Dwarkadas and W Zwaenep o el TreadMarks Dis

tributed Shared Memory on Standard Workstations and Op erating Systems

In Proceedings of the Winter USENIX Conference pages Jan

P Keleher A L Cox and W Zwaenep o el Lazy Consistency for Software Dis

tributed Shared Memory In Proceedings of the th International Symposium

on Computer Architecture pages May

R E Kessler and J L Schwarzmeier Cray TD a New Dimension for Cray

Research In Digest of Papers COMPCON Spring San Francisco CA

pages Feb

L I Kontothanassis and M L Scott Software Cache Coherence for Large

Scale Multipro cessors In Proceedings of the First International Symposium on

HighPerformance Computer Architecture Raleigh NC pages Jan

J Kubiatowicz D Chaiken A Agarwal A Altman J Babb D Kranz BH

Lim K Mackenzie J Piscitello and D Yeung The Alewife CMMU Address

ing the Multipro cessor Communications Gap In HotChips VI pages

Aug

J D Kubiatowicz Integrated SharedMemory and MessagePassing Commu

nication in the Alewife Multiprocessor PhD thesis Massachusetts Institute of

Technology

J Kuskin D Ofelt M Heinrich J Heinlein R Simoni K Gharachorlo o

J Chaplin D Nakahira J Baxter M Horowitz A Gupta M Rosenblum

and J Hennessy The Stanford FLASH Multipro cessor In Proceedings of the

st Annual International Symposium on Computer Architecture Chicago Il

pages Apr

J S Kuskin The FLASH Multiprocessor Designing a Flexible and Scalable

System PhD thesis Stanford University Stanford CA Nov CSLTR

L Lamp ort How to Make a Multipro cessor Computer Correctly Executes

Multipro cessor Programs IEEE Transactions on Computers C

J Landou and D Lenoski The SGI Origin A CC NUMA Highly Scalable

Server In Proceedings of the th International Symposium on Computer Ar

chitecture pages June

C E Leiserson Fattrees Universal Networks for Hardwareecient Sup er

computing IEEE Transactions on Computers C Oct

C E Leiserson et al The Network Architecture of the Connection Machine

CM In Proceedings of the ACM Symposium on Paral lel Algorithms and

Architectures

D Lenoski J Laudon K Charachorlo o A Gupta and J Hennessy The

DirectoryBased Cache Coherence Proto col for the DASH Multipro cesor In

Proceedings of the th Annual International Symposium on Computer Archi

tecture pages May

D Lenoski J Laudon T Jo e D Nakahira L Stevens A Gupta and J Hen

nessy The DASH Prototyp e Implementation and Performance In Proceedings

of the th Annual International Symposium on Computer Architecture Gold

Coast Australia pages

K Li and P Hudak Memory Coherence in Shared Virtual Memory Systems

ACM Transactions on Computer Systems Nov

T Lovett and R Clapp STiNG A CCNUMA Computer System for the

Commercial Marketplace In Proceedings of the rd International Symposium

on Computer Architecture pages May This machine made

by Sequent was later renamed NUMAQ

K Mackenzie An Ecient Virtual Network Interface in the FUGU Scalable

Workstation PhD thesis Massachusetts Institute of Technology

K Mackenzie J Kubiatowicz A Agarwal and M F Kaasho ek FUGU Imple

menting Translation and Protection in a Multiuser Multimo del Multipro cessor

Mitlcstm MIT Lab oratory for Computer Science Oct

K Mackenzie J Kubiatowicz M Frank V Lee A Agarwal and F Kaasho ek

Exploiting TwoCase Delivery for Fast Protected Messaging In Proceedings of

the Fourth International Symposium on HighPerformance Computer Architec

ture Feb

C May E Silha R Simpson and H Warren editors The PowerPC Architec

ture A Specication for a New Family of RISC Processors Morgan Kaufman

Publishers Inc San Francisco CA second edition May

K L McMillan Symbolic Model Checking An Approach to the State Explosion

Problem PhD thesis CarnegieMellon University May

J M MellorCrummey and M L Scott Algorithms for Scalable Synchro

nization on Sharedmemory Multipro cessors ACM Transactions on Computer

Systems Feb

J M MellorCrummey and M L Scott Synchronization Without Contention

In Proceedings of the Fourth International Conference on Architecture Support

for Programming Languages and Operating Systems ASPLOS IV pages

Apr

Motorola MPC PCIBMC Users Manual

S S Mukherjee B Falsa M D Hill and D A Wo o d Coherent Network

Interfaces for FineGrain Communication In Proceedings of the rd Interna

tional Symposium on Computer Architecture pages May

S S Mukherjee and M D Hill The Impact of Data Transfer and Buering

Alternatives on Network Interface Design In Proceedings of the Fourth Inter

national Symposium on HighPerformance Computer Architecture Feb

AT Nguyen M Michael A Sharma and J Torrellas The Augmint Mul

tipro cessor Simulation To olkit for Intel x Architectures In Proceedings of

International Conference on Computer Design Oct

R H Nikhil G M Pap dop oulos and Arvind T A Multithreaded Massively

Parallel Architecture In Proceedings of the th International Symposium on

Computer Architecture Conference Proceeding pages May

R S Nikhil Cid A Parallel Sharedmemory C for Distributed Memory

Machines In Proceedings of the Seventh Annual Workshop on Languages and

Compilers for Paral lel Computing Ithaca NY SpringerVerlag Aug

A Nowatzyk G Aybay M Browne E Kelly M Parkin B Radke and

S Vishin Exploiting Parallelism in Cache Coherency Pro cto col Engines In

Proceedings of the First International EUROPAR Conference Stockholm Swe

den pages Aug

A Nowatzyk G Aybay M Browne E Kelly M Parkin B Radke and

S Vishin The Smp Scalable Memory Multipro cessor In Proceedings of the

International Conference on Paral lel Processing

S Pakin V Karamcheti and A A Chien Fast Message FM Ecient

Portable Communication for Workstation Clusters and MassivelyParallel Pro

cessors IEEE Concurrency AprilJune

G M Papadop oulos G A Boughton R Greiner and M J Beckerle T

Integrated Building Blo cks for Parallel Computing In Proceedings of Super

computing Portland Oregon pages Nov

S Park and D L Dill Verication of FLASH Cache Coherence Proto col by

Aggregation of Distributed Transactions In Proceedings of the th Annual ACM

Symposium on Paral lel Algorithms and Architectures pages June

S Park and D L Dill Verication of Cache Coherence Proto cols By Aggrega

tion of Distributed Transactions Theory of Computing Systems

P Pierce The NX Op erating System In Proceedings of the Third Conference

on Hypercube Concurrent Computers and Applications volume of pages

F Pong Symblock State Model A New Approach for the Verication of Cache

Coherence Protocols PhD thesis University of Southern California Aug

F Pong and M Dub ois Formal Verication of Delayed Consistency Proto cols

In Proceedings of the th International Paral lel Processing Symposium Apr

S K Reinhardt J R Larus and D A Wo o d Temp est and Typho on User

Level Shared Memory In Proceedings of the st Annual International Sympo

sium on Computer Architecture Chicago Il pages Apr

S K Reinhardt R W Ple and D A Wo o d Decoupled Hardware Sup

p ort for Distributed Shared Memory In Proceedings of the rd International

Symposium on Computer Architecture May

A Saulsbury T Wilkinson J Carter and A Landin An Argument for Sim

ple COMA In Proceedings of the First International Symposium on High

Performance Computer Architecture Raleigh NC pages Jan

D J Scales K Gharachorlo o and C A Thekkath Shasta A Low Overhead

SoftwareOnly Approach for Supp orting FineGrain Shared Memory In Pro

ceedings of the Seventh International Conference on Architectural Support for

Programming Languages and Operating Systems ASPLOS VII Cambridge

MA pages Oct

I Schoinas B Falsa A R Leb eck S K Reinhardt J R Larus and D A

Wo o d Finegrain Access Control for Distributed Shared Memory In Proceed

ings of the Sixth International Conference on Architecture Support for Program

ming Languages and Operating Systems ASPLOS VI pages Oct

S L Scott Synchronization and Communication in the TE Multipro cessor In

Proceedings of the Seventh International Conference on Architectural Support

for Programming Languages and Operating Systems Cambridge MA pages

Oct

X Shen and Arvind Sp ecication of Memory Mo dels and Design of Provably

Correct Cache Coherence Proto cols CSG Memo MIT Lab oratory for

Computer Science Jan

X Shen Arvind and L Ro dolph CACHET An Adaptive Cache Coherence

Proto col for Distributed SharedMemory Systems CSG Memo MIT Lab

oratory for Computer Science Jan

X Shen Arvind and L Rudolph CommitReconcile Fences CRF A New

Memory Mo del for Architects and Compiler Writers In Proceedings of the th

International Symposium On Computer Architecture Atlanta May

T von Eicken D E Culler S C Goldstein and K E Schauser Active

Messages a Mechanism for Integrated Communication and Computation In

Proceedings of the Nineteenth Annual International Symposium on Computer

Architecture Conference pages May

T von Eicken and W Vogels Evolution of the Virtual Interface Architecture

IEEE Computer Nov

WD Web er S Gold P Helland T Shimizu T Wicki and W Wilcke

The Synnity Interconnect Architecture A CostEective Infrastructure for

HighPerformance Servers In Proceedings of the th Annual International

Symposium on Computer Architecture pages June Original title

was The Mercury Interconnect Architecture a CostEective Infrastructure for

HighPerformance Servers