<<

The and Programming

of a FineGrain Multicomputer

Thesis by

JakovNSeizovic

In Partial Fulllment of the Requirements

for the Degree of

Do ctor of Philosophy

California Institute of Technology

Pasadena California



Submitted August

CaltechCSTR

ii

Copyright

JakovNSeizovic

All Rights Reserved

iii

Acknowledgments

Many thanks to all the great teachers Ive had in particular ToChuck Seitz my

research advisor for knowing when to b e a friend and when a tough guy for his

willingness to b e b oth a student and a teacher for p ointing out the obvious and

insisting on simplicityToJanvan de Snepscheut Mani Chandy and Eric van

de Velde memb ers of my thesisdefense committee for teaching me through their

example that one must never stop learning To MilenkoCvetinovic for giving me

all the right b o oks to get excited ab out To Ilija Sto janovic Dejan Zivkovic and

Slob o dan Cuk for b elieving in me b efore I did

Many more thanks to myfellow students To Nan Bo den for her dedicated

supp ort and critique dep ending on what I needed on any particular dayToWen

King Su for the long hours wespentworking together and for sharing his insights

into the wonderful world of programming To the late MikePertel for teaching

me ab out stamina in times go o d and bad ToBillAthas Don Sp eck and Craig

Steele for helping me keep b oth feet on the ground ToTonyLeeforbeingthe

b est ocemate there is

Thanks also go to my research sp onsors To the Advanced Research Pro jects

Agency whose program managers rep eatedly demonstrated a remarkable ability

to balance researchers vanities against the need for interpro ject co op eration To

IBM for their orientation towards the future emb o died in part in numerous

studentsupp ort programs

Totheentire Caltech Science Department sta and to Arlene

DesJardins in particular for humming busily in the background of my co co on

And to my family and friends who made sure that it wasnt all work and no

play iv

v

ToGoga

vi

Abstract

The research presented in this thesis was conducted in the context of the

Mosaic C an exp erimental negrain multicomputer The ob jectiveofthe

Mosaic exp erimentwas to develop a concurrent system with maximum

p erformance p er unit cost while still retaining a generalpurp ose application span

A stipulation of the Mosaic pro ject was that the complexity of a Mosaic no de b e

limited by the silicon complexityavailable on a single VLSI chip

The two most imp ortant original results rep orted in the thesis are

The design and implementation of C a concurrent ob jectoriented

programming system

SyntacticallyC is an extension of C The concurrent semantics of C

are contained within the concept A C pro cess is analogous to a C

ob ject but it is also an autonomous computing agent and a unit of p otential

concurrencyAtomic singlepro cess up dates that can b e individually enabled

and disabled are the execution units of the concurrent computation The

limited set of primitives that C provides is shown to b e sucient to express

avariety of concurrentprogramming problems concisely and eciently

An imp ortant design requirementforC was that ecientimplementations

should exist on a variety of concurrentarchitectures and in particular on the

simple and inexp ensive hardware of the Mosaic no de The Mosaic runtime

system was written entirely in C

Pip eline synchronization a novel generallyapplicable technique for hard

ware synchronization

This technique is a simple lowcost highbandwidth highreliability solution

to interfaces b etween synchronous and asynchronous systems or b etween

synchronous systems op erating from dierent clo cks

The technique can sustain the full communication bandwidth and achievean

arbitrarily low nonzero of synchronization failure P withthe

f

price in b oth latency and chip area b eing O log

P

f

Pip eline synchronization has b een successfully applied to the high

p erformance intercomputer communication in Mosaic no de ensembles

CONTENTS vii

Contents

Intro duction

Concurrency and VLSI

ConcurrentArchitectures

Concurrent Programming

SharedMemory Programming

Explicit Message Passing

ArchitectureIndep endent Programming

The ReactivePro cess Programming Mo del

The Mosaic C Pro ject

Overview of the Thesis

C

Intro duction

Ob jectOriented Programming vs Concurrency

Concurrent Ob jectOriented Languages

The Pro cess Concept

Managing Concurrency

Remote Pro cedure Call

Call Forwarding

ForkJoin

Semaphores

Monitors

Recursion

Message Passing

SingleAssignmentVariables

Pro cess Aggregates

Summary

Managing Program Complexity

Class Inheritance

Virtual Functions

Pro cess Layering

Pro cess Libraries

viii CONTENTS

Data Exchange

Putting It All Together

Implementation Issues

The RuntimeSystem Framework

Pro cess Creation

Runtime Services

Pro cess Dispatch

The pointer t and the entry t Typ es

Pro cess State

Pro cess Migration

Invoking Atomic Actions

ActivePassive

Remote Pro cedure Call

From C to C

Parsing

Co de Generation

Co de Splitting

The Mosaic C

Multicomputer Architecture

The Mosaic No de

The Mosaic Router

The Dynamic RAM

The Pro cessor and the Network Interface

Software Overhead of Communications

Pip eline Synchronization

Intro duction

Problem Sp ecication

Existing Solutions

Pip eline Synchronization

The MutualExclusion Element

TwoPhaseProto col FIFO

Pip eline Synchronizer

Correctness Pro of

Variations On the Theme

A CMOS Implementation

Conclusions

CONTENTS ix

Conclusions

Comparison With Related Work

MediumGrain Multicomputers

FineGrain Multicomputers

Multipro cessors

Summary

A Example Pro ducts of C Compilation

Bibliography

x CONTENTS

Chapter

Intro duction

Concurrency and VLSI

Progress in micro electronics technology during the past four decades has b een

remarkable byany measure Three ma jor factors contributed to this progress

a rapid and steady pace of improvements in pro cessing technology to pro duce

ever smaller faster and lowerp ower devices the developmentofdesign

metho dologies and to ols to manage design complexity and the exploitation

of concurrency The rst two factors are readily understo o d The imp ortance

of concurrency to the p erformancecost ratio of VLSI systems can b e understo o d

from results of VLSIcomplexity theory and has b een demonstrated rep eatedly

in practice

Sp ecialpurp ose computing engines were the rst to employ concurrent

solutions and continue to do so highly successfullytothisday Although

various forms of concurrency level parallelism pip elining vectorization are

exploited regularly in generalpurp ose computing engines applying concurrent

solutions to generalpurp ose computing at the application level has b een slower in

gaining ground

A considerable eort has b een made to exploit the concurrency that is implicit

in sequential programs This eort has b een successful in discovering and utilizing

mo dest degrees of concurrencybutisnow regarded almost universally as having

approached its limits Applications with explicitly concurrent formulations

are the driving force for a range of concurrentarchitectures some of which are

discussed in the following section

ConcurrentArchitectures

Most of to days concurrent are representatives of one of the following

three

CHAPTER INTRODUCTION

Computers with a single instruction stream and multiple data streams

SIMD

Twovariants of computers with multiple instruction streams and multiple

data streams MIMD

multipro cessors whichhave one global and

multicomputers whichhavemultiple lo cal address spaces

Early concurrentcomputer implementations closely followed this classication

SIMD computers employed multiple computing units to which instructions were

broadcast multipro cessors utilized buses andor switches to connect multiple

pro cessors to the global memory multicomputers featured indep endent

pro cessormemory pairs interacting through a messagepassing network

The dierences between the more recent representatives of these three

architectures are blurred When observed from a p ointthatis

suciently close to the hardware or from a p oint that is suciently far away from

the hardware these three architectures are remarkably similar Each consists of

a communication network connecting a collection of computing no des Eachnode

consists of one to several instructioninterpreting pro cessors a lo cal memoryand

a network interface All three architectures supp ort some concept of pro cesses

computing agents that execute concurrently and that can communicate data and

synchronize activities with each other What were once architectural distinctions

b ecame dierences in programming style dataparallel sharedmemory

and messagepassing Chapter programming abstractions Dep ending on the

emphasis on supp ort for one of these abstractions additional architectural supp ort

is provided for global synchronization for ecient nonlo cal memory access

or for lowlatencylowcontextswitchoverhead message handling

In this thesis we shall fo cus principally on the multicomputer varietyof

MIMD computers but shall indicate also how our results apply to multipro cessors

Multipro cessors provide hardware supp ort for a global address space and

implementinterpro cess communication through sharedmemory access whereas

multicomputers provide a generic messageexchange mechanism and implement

sharedmemory access through interpro cess communication

Given the similaritybetween these twoarchitectures one might exp ect that

given a typical problemset load of the computer it would b e easy to test and

decide whicharchitecture and whichmachine should b e employed Yet the

reality of concurrent computers and concurrentprogramming systems is somb er

programs are most often written in notations tailormade for a particular computer

architecture sometimes even for a particular machine The costeectiveness of

program execution on concurrentmachines is their main advantage over their

sequential counterparts Striving to maximize this costeectiveness however

CONCURRENT PROGRAMMING

emphasizes the concurrent computers main disadvantage the complexityof

writing programs

Concurrent Programming

There are two typically conicting driving forces shaping the developments in

concurrent programming increasing eciency and increasing expressivity

The eciencyconscious programming systems are typically the pro ducts of

design teams also involved with the design of concurrentmachines and often reect

the underlying architecture Sharedmemory programming and explicitmessage

passing programming are representatives of this class

The expressivityconscious programming systems are often pro duced by the

frustrated users of the pro ducts of the former groups and are typically architecture

indep endent Section

SharedMemory Programming

The rst developments in concurrent programming were motivated by the adventof

multiprogramming and multiuser op erating systems It should not therefore b e

surprising that the rst concurrentprogramming systems supp orted concurrent

pro cesses that communicated and synchronized through the memory of the

machine on whichtheywere executing The developmentoftheParallel RAM

PRAM mo del a theoretical framework on whichmuchofthework in concurrent

is based also promoted the p opularity of this programming style which

is still the predominant form of concurrent programming

From the early stages on sharedmemory programming has b een plagued

byvarious incarnations of the mutualexclusion problem This problem is due

primarily to the discrepancy in access granularitybet ween the data structures and

the memory units used to represent these data structures A numb er of remedies

were intro duced atomic testandset andor fetchandadd instructions and

semaphores One of the most signicant eorts was the work of Per Brinch

Hansen on ConcurrentPascal and the developmentof monitors Monitors

encapsulate data with the mutuallyexclusive op erations dened on the data

in dened andruntimesystemmanaged units This work

forms a foundation on whichmany of the recent developments in ob jectoriented

concurrent programming are based including the programming system describ ed

in this thesis

Explicit Message Passing

Communication and synchronization through explicit message passing is a

whose ro ots are as old as computers themselves stemming

CHAPTER INTRODUCTION

from the need for intercomputer information exchange This programming

paradigm was adopted and adapted for programming multicomputers

Starting with the Cosmic Cub e and its commercial descendents the

mainstream representatives of the multicomputer architecture employ otheshelf

pro cessor memory and compiler technology Programming systems for these

machines are based on a variety of sequential programming languages for sp ecifying

individual pro cess b ehavior wherein communication and synchronization b etween

pro cesses is achieved through a set of routines

There are two problems that are the curse of this programming style First

although mo dular organization of data structures can b e achieved within a pro cess

this mo dularity do es not extend readily to collections of pro cesses Second the o

theshelf technology often brought the otheshelf notion of pro cess granularity

heavy style pro cesses imp ose an unacceptably high software overhead to

pro cess communication and synchronization

ArchitectureIndep endent Programming

Anumb er of programming mo dels and notations have b een devised to provide

a uniform view to the programmer of concurrent computers and to map

computations onto either of the architectures describ ed ab ove The advantages

that these programming systems oer in reducing programming eort are

remarkable preserving the costeectiveness of concurrent computers running such

programs however has yet to b e demonstrated The assembly programming of

conventional sequential computers has b een all but eliminated by higherlevel

notations through large improvements in programwriting eciency with small

degradations of programexecution eciency The same has yet to happ en to

tailormade concurrentprogramming notations

Functional Programming and Dataow

In its pure form functional programming provides a metho d for dening

functions in terms of other moreprimitive functions The value of a function

is determined only bythevalue of its arguments and is not historysensitive

Since there are no side eects functionalprogramming notations are implicitly

concurrent and subexpressions including function arguments can b e evaluated

indep endently of each other

The intro duction of sideeects into functionalprogramming notations enables

them to mo del historysensitive b ehavior but it also op ens them up to the full

set of problems asso ciated with imp erativeprogramming notations Extending

pure functional programming with singleassignmentvariables and streams as

intro duced bydataow researchers represents an imp ortantintermediate p oint

This extension relaxes the nosideeects requirementinto the monotonicity

requirement Avariable starts up uninitialized and an assignmentbounds

CONCURRENT PROGRAMMING

it to a value multiple assignments are disallowed A stream consists of a

p ossiblyinnite sequence of variables that can only b e read and app ended

Using singleassignmentvariables for communication and synchronization is also

used extensively in comp ositional programming and in concurrent logic

programming describ ed next

Concurrent Logic Programming

The programming mo del typically asso ciated with sequential logic programming is

that of proving an existentially quantied statement given a program that consists

of a set of axioms Implementations of this mo del involvebacktracking

that could in principle be replaced by concurrent examination of all the

alternatives However for eciency reasons and b ecause of the need to b etter

mo del inputoutput b ehavior concurrent logic programming makes a

signicant departure from this mo del There is no backtracking once a non

deterministic choice is made no alternatives are examined

A concurrent logic program consists of a set of guarded clauses and each

clause represents a recursive sp ecication of pro cess structures To program in a

concurrent logic programming notation is to sp ecify tasks as unordered concurrent

sets of subtasks Tasks communicate and synchronize with each other by binding

singleassignmentvariables and waiting for variables to b ecome b ound

Restrictions on the expressivity of clause guards to improve eciency lead

to a family of at concurrentlogic notations A minimalist approac hto

concurrent logic programming of Ian Foster and Stephen Taylor resulted in Strand

a streamlined and ecient concurrentprogramming system without giving up

much of the expressivepower

UNITY

UNITY develop ed by K Mani Chandy and Jayadev Misra is a computation

mo del and a programming notation with an asso ciated pro of metho dology A

UNITY program consists of a set of guarded multiple assignments These

assignments are executed in arbitrary order The fo cus of programming in UNITY

is on what ie on data transformations as opp osed to when A particular

execution order can b e enforced only through data dep endencies A computation

terminates when it reaches a xed p oint ie when no assignment in the program

mo dies anyvariables

An interesting related research has b een rep orted by Craig S Steele In

this work a programming mo del and a corresp onding notation are develop ed in

which program actions are asso ciated with data ob jects through a programmer

sp ecied triggering mechanism An ecientmulticomputer implementation of this

UNITYlike programming system is demonstrated

CHAPTER INTRODUCTION

Actors

The Actors mo del of computation was rst prop osed by Carl Hewitt and

Henry Baker and was later formalized by William D Clinger and

Gul Agha In this mo del the unit of concurrent computation is an actoran

indep endent computing agent thatisactivated in resp onse to messages senttoit

Each actor has a unique address an asso ciated message queue and a sp ecied

behavior In a resp onse to a message an actor can send messages create new

actors and become a new actor by sp ecifying its replacement b ehavior

Because of its simplicity p otential eciency and straightforward implemen

tation on distributed architectures the Actors mo del is the basis for numerous

concurrentprogramming systems The reactivepro cess programming mo del de

scrib ed next and its asso ciated notation describ ed in Chapter are based in part

on the Actors mo del of computation

The ReactivePro cess Programming Mo del

The reactivepro cess programming mo del is a variant of the Actors programming

mo del Computation in this mo del is p erformed by a set of processes indep endent

computing agents A pro cess is normally at rest and starts executing in resp onse to

a message including the initial creation message In the course of its execution

a pro cess can send messages create new pro cesses and mo dify its state including

selftermination Message order is preserved for each pair of pro cesses in direct

communication Each message is marked with a tag that sp ecies whichofthe

pro cesss compiletimexed set of entry points should b e invok ed Eachentry

point runs to completion and is therefore an atomic up date of its pro cesss state

A pro cess can aect the order of execution of its entry p oints by enabling and

disabling them selectively at run time all entry p oints are initially enabled A

message tagged for a disabled entry p ointisdelivered when and if that entry

point is active again

This mo del is extended to include the remote procedurecal l RPC An entry

point of a pro cess can b e sp ecied to return a value to the message sender When

a message is sent and tagged for suchanentry p oint the sender is susp ended until

the message with the returned value arrives

Background

The reactivepro cess programming mo del is a result of the work in our research

group over the last decade Interestingly a comparison with the early work

of C R Lang on a concurrentversion of Simula reveals that our groups

ideas seem to have come almost full circle The ideas of C R Lang and the

preceeding work of were farsighted and outofsync with

THE MOSAIC C PROJECT

the multicomputer technology of their time In retrosp ect it is as if muchof

what our research group has b een doing was tracking and driving the necessary

communication pro cessor memory and compiler technology to approachthis

target

Starting with the development of the Cosmic Cub e our group embraced the

explicit messagepassing programming style The design of an exp erimental ne

grain multicomputer Mosaic C and the similarity of our approach to the Actor

mo del of computation provided additional motivation this eort culminated with

the work of W J Dally on Concurrent Smalltalk of W C Athas and

N J Bo den on Cantor a minimalist Actorbased notation and of W

K Su on ReactiveC and distributed eventdriven The work

on the Cosmic Environment and the Reactive Kernel shifted our fo cus

from organizing computations around pro cesses to organizing computations around

messages and the reactivity b ecame an essential part of the programming mo del

The Mosaic C Pro ject

The work describ ed in this thesis attempts to makea contribution to the unstately

condition of concurrent programming to dayOurwork is based on the following

principles

Concurrency must come cheap Concurrentmachines must b e extensible in

small and inexp ensivechunks

Concurrency must come cheap Only those highlevel programming con

structs for concurrency that can b e implemented eciently by the small

and inexp ensive hardware can b e used

Concurrency m ust come cheap Expressing concurrency in programs must

b e simple

Starting from these simple and restrictive requirements our research group

has b een conducting a computerarchitecture exp eriment to design a concurrent

computing system with as high a p erformancecost ratio as p ossible while still

retaining the not so welldened generalpurp ose application span In the

course of this exp eriment wehave built a testb ed consisting of a negrain

multicomputer the Mosaic C a concurrentprogramming notation C anda

distributed runtime system MADRE

What computer architecture is all ab out is bridging the gap b etween a

programming mo del and a technology Designers of all of to days

computers divide this complex spanning task into a set of smaller more manageable

subtasks A particular choice of subtasks is illustrated in Figure Welldened

anchor p oints simplify the implementation of the subtasks however if those anchor

CHAPTER INTRODUCTION

applications

programming mo del

programming abstractions

programming notation

programs

compiler

co de

runtime system

tasks

machine

computation and communication

technology

physical system

Figure Computer architecture

points are to o numerous andor to o rigid the design space may b ecome severely

restricted

For the Mosaic exp eriment the fairly rigid anchor p oints have b een

the scalable CMOS VLSI technology b ecause that was the b est technology

that we b oth had access to and understo o d well and

the reactivepro cess programming mo del b ecause we b elieved we could

implement it eciently

For each of the remaining intermediate p oints wehave b een able to identify

opp ortunities for what app eared to b e signicantimprovements over existing

concurrentcomputing systems In particular

The programming notation was to b e a derivative of a widelyaccepted ob ject

oriented programming notation trying to leverage o of the advances that

ob jectoriented programming brought to sequentialprogramming systems

OVERVIEW OF THE THESIS

The compiler technology was mature enough that we b elieved a compiler

could p erform much of the checking traditionally done at run time if at all

The runtime system was to b e fully distributed and utilize the distributed

queue describ ed in

The machine was to b enet from the results in router technology dRAM

technology synchronization metho ds Chapter packaging technology

and selftesting p ossibilities

Our computerarchitecture exp eriment culminated in designing and building

the Mosaic concurrentcomputing system Parts of the Mosaic system are describ ed

in and in this thesis

In as closelycoupled a research team as ours it is dicult to prop erly

grantevery bit of credit but to the rst order the following is the list of

contributions to the Mosaic pro ject Most of the work has b een done byve

p eople Nanette J Bo den Charles L Seitz Don Sp eck WenKing Su and the

author of this thesis Nan has b een involved principally with the runtime system

but was also invaluable to the development of the programming mo del and of the

notation and was the most understanding and thorough tester of the compiler and

of the machine Chuck did most of the overallmachine architecture the pro cessor

architecture the system integration and packaging and the router design Chuck

also had imp ortant contributions to every other asp ect of the pro ject and the

commitment to see it all the way through Don designed the dRAM and ROM

and taugh t all of us the value of patience and thoroughness WenKing designed

most of the pro cessor and router the workstation interfaces and worked on all

asp ects of verication and packaging Wen was also a principal contributor to

the programming mo del wrote a lowlevelbutworksreliablytheveryrsttime

programming system and retargeted the GnuC andC compiler to the Mosaic

The author of the thesis feels he deserves credit only for saying Why dont we

get this thing nished and for then going o to ll in the missing links routing

network interface chiplevel system integration and verication the programming

notation and the compiler Numerous other p eople have b een asso ciated with

the Mosaic pro ject most notably William C Athas who contributed to the

programming mo del and to the pro cessor architecture and Michael J Pertel who

contributed to the choice of the routing network and to its p erformance evaluation

Overview of the Thesis

In Chapter weintro duce C a concurrent ob jectoriented notation based on

C Chapter denes the C runtimesystem interface illustrates howC can

b e customized and explains the C compilation pro cess In Chapter a brief

description of the architecture of the Mosaic multicomputer is presented with

CHAPTER INTRODUCTION

emphasis on architectural supp ort for C Chapter presents a novel generally

applicable synchronization technique along with a pro of of its correctness and

its application to the Mosaic In Chapter we compare the results of our work

with the related research and suggest p ossible future research directions

Chapter

C

Intro duction

Ob jectOriented Programming vs Concurrency

Programming notations that supp ort ob jectoriented programming techniques are

the notations of choice for a rapidly growing numb er of complex applications

Indeed not since the intro duction of structured programming has there b een

such a degree of unanimity in the programming community This unanimity

is even more remarkable considering that just as was the case with structured

programming the p ower of ob jectoriented techniques is dicult to convey to

readers through short example programs in b o oks or articles When observed in

isolation none of these techniques is new or revolutionary It is only when one

approaches a largescale programming task armed with the full set of techniques

that their p ower b ecomes evident

Structuredprogramming techniques advo cate structuring of program control

ow in a topdown comp ositional fashion Ob jectoriented programming

techniques promote data organization in a b ottomup standardparts fashion

Both paradigms emphasize mo dularitybut whereas the former is fo cusing

principally on mo dularityofcontrol structures the latter do es a b etter job of

encapsulating data structures with the op erations dened on these structures

Ob jectoriented programming came ab out through attempts to make large

sequential programs more manageable Techniques such as data encapsulation

and access protection inheritance and guaranteed initialization all emerge from

the goal of helping help themselves

By our view much of what the tec hniques of ob jectoriented programming

are really helping to manage is concurrency Events are concurrent if they are

unordered ie if they can o ccur in any order or in parallel Mutual exclusion is

an example of an issue most often asso ciated with concurrent programming but

the problems that result from a disregard for mutual exclusion also o ccur regularly

CHAPTER C

in large sequential programs With uncontrolled access to global variables it is

imp ossible to keep track of all of the places in the co de where a certain variable

is accessed and of all the invo cations of such co de Nondeterministic execution

is another issue most often asso ciated with concurrent programming For a xed

set of inputs the execution of a sequential program will always result in the same

ordering of state changes yet with side eects on global variables it is often far

from obvious what all the inputs to a program are

Whereas sequential programming brings out the worst in us only in the large

concurrent programming will do that already in the small It should not b e

surprising then that in the hop e of reaping some of the b enets that ob ject

oriented techniques brought to sequential programming we are witnessing a

proliferation of programming systems trying to amend a particular ob jectoriented

notation with concurrent semantics

Concurrent Ob jectOriented Languages

Safety y

Expressivity

Eciency

Figure Design tradeos for concurrent programming systems

The threeway design tradeos illustrated in Figure are typical of design of any

programming system not only those attempting to harness concurrencyHowever

all three requirements are more pronounced and the balance more dicult to

achieve for a concurrentprogramming system

Eciency One of the ma jor reasons to employ concurrent solutions in the

rst place is to get more p erformance and programmingsystem overheads

are less likely to b e tolerated by users

Expressivity Moving from a single to many threads of control and the

requirement that threads can communicate and synchronize their activities

place additional demands on expressivity

Safety In addition to mutual exclusion and p ossible nondeterminism

mentioned in the previous section issues such as deadlo ck and livelo ckhave

to b e dealt with Simple semantics that aid correctness pro ofs are essential

INTRODUCTION

It is likely that some readers will nd what we consider a balanced design to b e

biased in favor of eciency then expressivity and then safety Our argument

ab out the increased imp ortance of eciency in a concurrentprogramming

environment is sometimes disputed on grounds that b ecause concurrent systems

oer b etter p erformancecost than their sequential counterparts one can aord

more ineciencies at the op eratingruntime system level The consequence of this

view on concurrentarchitectures is that machines with pathetic pro cesscreation

and communication overheads are b eing designed and built The ma jor goals of

the work describ ed in this thesis are to show that this pitfall can b e avoided and

to demonstrate that negrain concurrency can b e eciently exploited

Extensions of C

C is an ob jectoriented notation that is in widespread use due to its eciency

availabilityandupward compatibility with C C is the starting p ointfor

numerous programming systems that attempt to amend C with concurrent

semantics including the system describ ed in this thesis

A comparison of our work to related concurrentprogramming systems can b e

found in Section

C

C is the result of an exp eriment to express reactivepro cess concurrent programs

Section in an ob jectoriented programming notation Although C is an

extension of C the ob jectiveoftheC pro ject has not b een to b e able

to execute arbitrary C programs eciently on the Mosaic The emphasis of

C is on providing ecient supp ort for the simple abstractions fundamental to

the reactivepro cess computational mo del pro cess creation and communication

C strives not to imp ose higherlevel p olicies on synchronization communication

proto cols or pro cess placement

Although the C programming system is p ortable across a wide range of

architectures the Mosaic has b een b oth the driving force and the reality test b ehind

this eort Design decisions have consistently b een made to avoid compromising

the p erformance of C programs on the Mosaic Higherlevel programming

systems maybelayered on top of C butC is intended to serve as the Mosaics

lowestlevel workhorse programming system suitable b oth for op eratingsystem

and application programming

The remaining sections of this chapter are devoted to teaching the reader

ab out C Familiarity with the basic concepts of ob jectoriented programming

and of C in particular is assumed classes inheritance access rules op erator

C is not a sup erset of C b ecause it imp oses restrictions on global variables as discussed in Section

CHAPTER C

overloading Keywords are underlined in programming examples Although an

eort has b een made to steer clear of the idiosyncrasies of C some of them

were essential and they are explained as they are encountered The reader is

cautioned however that C is by no measure a minimalist toyexamplewriting

notation some of the more advanced examples are likely to present diculties to

those not familiar with C Our hop e is that this diculty is the result of C s

completeness rather than of p o or design choices

The Pro cess Concept

The C object concept is carried over intact to C class is a userdened typ e

an ob ject created according to a class denition is a collection of data items a

set of op erations dened on them and a set of access rules Program Class

memb er functions have the usual sequential semantics

class C

private

int data

public

C data initialization

void writeint i data i update

int read returndata retrieve

Program A Class Denition

The process concept is the only extension that C intro duces to C The

processdef keyword parallels the class keyword syntactically Program

Access rules are asso ciated with data memb ers and functions of a pro cess denition

and pro cess denitions can b e derived from other pro cess denitions Section

However a pro cess created according to a pro cess denition is more than a

collection of data items

Sp ecication A pro cess is an indep endent computing agent and a unit of

p otential concurrency Its public interface consists of a set of atomic actions

At creation time the pro cess constructor is executed if it is dened After the

constructor completes the pro cess is at rest The invo cation of an atomic action

of a C pro cess is decoupled from its execution Conceptually there is an innite

A pro cess constructor is an atomic action with the same name as that of the pro cess denition

The constructor may not return anyvalue

THE PROCESS CONCEPT

processdef P

private

int data

public

atomic P data initialization

atomic void writeint i datai update

int read returndata retrieve atomic

Program A Pro cess Denition

queue of incoming requests for each pro cess the invo cation of an atomic action

places a request into this queue Pro cess execution consists of servicing these

requests with each atomic action running to completion

Creating a pro cess is no dierent from creating an ob ject Program In most

cases pro cesses are created dynamically pp new P and p ersist until they

are explicitly destroyed delete pp One can also create a temp orary pro cess

as a lo cal variable just as with any other typ e Pp This temp orary pro cess is

destroyed implicitly when execution leaves its scop e

int i declaring an integer

P pp declaring a process pointer

pp new P creating a persistent process

i ppread retrieving a value

ppwritei updating

pp explicitly destroying the persistent process delete

P p declaring a temporary process

i pread retrieving a value

pwritei updating

implicitly destroying the temporary process

Program Programming with Pro cesses

AC computation is initiated by a runtime system that concurrently with

initialization of global pro cesses creates an instance of root Program the

constructor of which is dened by the user

CHAPTER C

processdef root

public

atomic rootint argc char argv

Program The root pro cess

Sp ecication A pro cess can aect the order of execution of its atomic actions

by enabling and disabling them selectively at run time All atomic actions are

initially enabled execution of a disabled action is p ostp oned until the action is

enabled again

For example let us assume that the rules for accessing a pro cess of typ e P in

Program are such that it may b e up dated only once every subsequent write

request should b e tagged as an error Furthermore all read requests o ccurring

b efore the rst write should b e serviced only after the rst up date o ccurs The

pro cess denition for this version of P is listed in Program

Pro cesses communicate and synchronize with each other through atomic

actions Thus far wehave discussed only the b ehavior of pro cesses as servers

how they deal with incoming requests invo cations of their atomic actions We

shall now dene the b ehavior of pro cesses as clients how they request services

from other pro cesses

Sp ecication When invoking an atomic action that do es not return a value

returns a void or if the returned value is not used the caller continues execution

indep endently of the callee The order of invo cations is preserved for each pair of

pro cesses in direct communication If the value returned by an atomic action is

used the caller may b e susp ended until the returned value is available

Invoking an atomic action that returns a value do es not in itself imply that

the requesting pro cess will b e susp ended until the requested value is available

It is only when this value is used that a of activitymust b e susp ended

F or example the Program uses a divideandconquer approach to compute the

th

n Fib onacci numb er Both subcomputations are initiated and the pro cess will

susp end only if it attempts to add the two partial results b efore they are available

It is sometimes desirable to enforce the sequential order of execution of sub

computations In such cases the C await construct should b e used For ex

ample return awaitfcomputen fcomputen ensures that

the rst sub computation is complete b efore the second one is initiated

Programming systems dier considerably in what constitutes use of unresolved

variables also called futures The most aggressive systems allow futures to b e

THE PROCESS CONCEPT 

processdef P

private

int initialized

int data

public

atomic P

void writeint atomic

atomic int read

atomic PP

initialized

passive read

atomic void Pwriteint i

if initialized

reporterror

else

data i

initialized

active read

atomic int Pread

return data

Program Enabling and Disabling Atomic Actions

exchanged b etween pro cesses and susp end a thread only when a value is needed

for a hardwareimplemented expression evaluation Supp ort for futures is the

central issue for numerous concurrentprogramming systems C

is not one of these systems and is not very aggressive in trying to discover

and utilize this typ e of concurrency In C assigning an unresolved value to

any programmerdened variable constitutes use of that future and wil l cause the

thread to be suspended C guarantees only that a thread wil l not be suspended

unnecessarily within an expression evaluation C semantics allowany additional

CHAPTER C

processdef fib

public

atomic int compute int n

switch n

return case

case return

default fib f f

return fcomputen fcomputen

Program Divide And Conquer

compilerruntime system optimization but only within the b o dy of a function or

an atomic action Unresolved variables must b e resolved b efore they can b e passed

as arguments

The reason for C s nonaggressive utilization of futures is that wewantto

encourage a programming style in which the concurrent b ehavior is generated

explicitly as opp osed to trying to utilize the concurrency that is implicit in

sequential formulation Synchronization on an unresolved future is inherently more

exp ensive than for example synchronization using the activepassive semantics

b ecause the pro cess state that must b e saved when blo cking on a future is

much larger For notations that have stackbased implementations of the regular

functioncall abstraction suchasC this state includes the stack

Managing Concurrency

All concurrencyrelated issues in the C programming system are encapsulated

into the pro cess concept The following syntactic restrictions enforce this

requirement

Only atomic actions can b e public memb ers of a pro cess denition



Only values pro cess p ointers and pro cess references can b e arguments to

atomic actions

The C static memb er functions can b e public memb ers of a pro cess denition since their

semantics do not allow them to access pro cess members anyway



The dierence b etween p ointers and references is a subtle idiosyncrasy of C and for the

purp oses of this thesis the two can b e considered equivalent

MANAGING CONCURRENCY

Pro cesses are the only global variables allowed



Pro cess denitions can haveno friends

As sp ecied in Section a pro cess is a unit of p otential concurrency

Pro cesses communicate and synchronize with each other through atomic actions

The remainder of this section will b e devoted to examples illustrating howsomeof

the wellknown concurrentprogramming paradigms can b e implemented in terms

of C pro cesses

Remote Pro cedure Call

The remote pro cedure call RPC is a common form of interaction b etween threads

of activity As illustrated in Program and in Figure a client requests a service

from a server and susp ends its execution until the request has b een attended to

The semantics of the RPC are identical to those of an ordinary pro cedure call

The implementations of the twotyp es of pro cedure calls however are typically

dierent b ecause the client and the server may b e op erating in dierent address

spaces A b etter name for the RPC mightbeinterpro cess pro cedure call

processdef server

public

atomic int request int

processdef client

public

atomic client server s

int i srequest

Program Remote Pro cedure Call

During a remote pro cedure call the calling pro cess is nominally susp ended

until the returned value is available so no concurrency is intro duced However

as discussed in Section with the use of futures the semantics of the RPC can

This includes b oth global and static C variables ie all variables with le scop e



The friend construct in C allows nonmemb er functions to have full access to private

class memb ers

CHAPTER C

place place

client client

time time

server server

Figure Remote Pro cedure Call

b e extended so that several requests can b e issued concurrently and the calling

pro cess is susp ended until all the requests have b een serviced Program and

Figure

place

fib

time

fib

fib fib

Figure Divide And Conquer

Call Forwarding

Call forwarding is a paradigm asso ciated with messagebased ob jectoriented

programming systems and is similar to tail recursion As an example consider

the sequential search of a singlylinked list of dictionary pro cesses in Program

When the value returned from an atomic action is itself obtained by an atomic

action invo cation programmer maycho ose to use the forward statement instead

With the return statement a request is issued the pro cess is susp ended until the

value is available and then reply is sent to the calling pro cess The eect of call

forwarding is to defer servicing of the request to another pro cess Two sequential

search examples one using the return and another the forward statement

are illustrated in Figures a and b resp ectively In addition to reducing

the numb er of replies call forwarding enables the list of pro cesses that form a

dictionary to pro cess multiple requests in a pip eline fashion Atany p oint in time

each search request is b eing worked on by at most one dictionary pro cess

MANAGING CONCURRENCY

processdef dict

private

dict next

int index

int data

public

int find int i atomic

if iindex

return data

else

return nextfindi can be replaced by

forward nextfindi

Program A Sequential Search

client client client client

place

time

dict dict dict dict

dict dict dict dict

dict dict dict dict

b a

Figure A Sequential Search with RPC a and with Call Forwarding b

ForkJoin

The remotepro cedurecall mechanism with limited supp ort for futures as provided

byC oers a convenient and easytounderstand programming paradigm for an

imp ortant class of problems A more exible forkjoin mechanism for pro cess

synchronization in C is oered through the combination of nonsusp ending

atomicaction invo cation and activepassivesemantics

There are two paradigms that C programmers can use to generate concurrent activities

CHAPTER C

Creating new processes whether p ersistent or temp orary The parent pro cess

continues execution indep endently of the child

Up on invoking an atomic action that does not return a value or when the

returned value is not used the caller continues executing without waiting for

the callee

The synchronization barriers can b e expressed using activepassive semantics

Supp ose that an FFT computation is implemented as illustrated in Figure

The expressions along the edges of the graph are co ecients Multiple inputs to a

X x X x

W W

N N

x X x X

W W

N N

x X x X

 

W W W W

N N N N

X x X x

W W

N N

x X x X

 

W W W W

N N N N

X x X x

 

W W W W

N N N N

x X x X

   

W W W W W W

N N N N N N

X x X x

i

N

Figure An Point FFT Computation W e N

N

no de imply addition and multiple outputs imply replication of the result

A concurrent program for N p oint FFT computation could employ N pro cesses

and compute the result in O log N steps Each step would consist of getting two

requests along the input edges adding the two input values multiplying bythe

co ecient and pro ducing two output values

Aversion of this program could similarly employ N log N pro cesses in a pip eline

regime achieving the same O log N latency but a new result would b e computed

on every step

In either approach though a pro cess circled in Figure must get one data

item along each of its input edges to b e able to compute and emit one data item

along each of its output edges A pro cess that might b e used as part of the FFT

computation pip eline is listed in Program



When a p ointer to a newly created pro cess is used in a subsequent computation this mayor

may not require susp ending the parent dep ending on the implementation However the parent

continues execution concurrently with childs constructor

MANAGING CONCURRENCY

processdef fft

private

Complex W first

fft outup outdn

void outputComplex in

Complex result first in W

outupupresult

outdndnresult

public

atomic fftfft u fft d Complex r

W r

outup u

outdn d

atomic void upComplex in

passivedn upon receiving both requests if

produce the output

active dn

outputin

else if you only have one request

await the second one

up passive

first in

atomic void dnComplex in

if passiveup upon receiving both requests

produce the output

active up

outputin

else if you only have one request

await the second one

passive dn

first in

Program An FFTComputing Pro cess

CHAPTER C

Semaphores

First intro duced by E W Dijkstra semaphores are lowlevel primitivesfor

pro cess synchronization A semaphore is typically used to control access to a

shared data structure with an N ary semaphore allowing access to at most N

pro cesses at anypoint in time Two op erations are dened on semaphores acquire

and release In general an implementation of an N ary semaphore must guarantee

that the numb er of acquire op erations minus the numb er of release op erations is

at most N and at least A C implementation of an N ary semaphore is

presented in Program

processdef semaphore

private

int count number or processes inside

the critical section

int max the maximum number allowed

public

atomic semaphoreint N initially there are no

processes inside the critical

maxN section

count

passive release

atomic int acquire

count one more inside

release at least one can release active

if count max if the maximum is reached

passive acquire no one can get in

return

atomic int release

count one less inside

active acquire at least one can acquire

if count no one is in so

passive release no one can exit

return

Program N ary Semaphore

An oftenused sp ecial case for N the binary semaphore is illustrated in Program

MANAGING CONCURRENCY

processdef semaphore

public

atomic semaphore

passive release

int acquire atomic

active release

passive acquire

return

atomic int release

acquire active

passive release

return

Program Binary Semaphore

Monitors

Of all of the concurrentprogramming paradigms semantics of C pro cesses are

closest to those of monitors Just as with monitors C pro cesses encapsulate

a set of data items and oer mutually exclusive access to a set of routines op erating

on this data C pro cesses also share some of the problems asso ciated with

monitors as b oth are nonreentrant The invo cation of an atomic action of a

C pro cess is unlikeaninvo cation of a monitor function decoupled from its

execution conceptually there is an innite buer of incoming requests for each

pro cess This decoupling enables pro cesses to b e active computing agents able to

aect the order of execution of their atomic actions

Recursion

In the examples shown so far the requirement that all the public member

functions of a pro cess b e atomic actions has b een helpful in expressing interactions

between concurrent threads of activity From the p oint of view of C

programmers the most signicant rep ercussion of the atomicityofinterpro cess

activities is that since at most one execution thread can b e asso ciated with a

pro cess atomic actions that return values are not reentrant For example in

Program the private memb er function fac has ordinary sequential reentrant

CHAPTER C

semantics However the memb er function must b e an atomic action public FAC

An invo cation of will therefore result in deadlo ck FAC

processdef bad

private

int facint n

if n

return

else

return n facn OK functions are reentrant

public

atomic int FACint n

if n

return

else

return n FACn ERROR atomic actions are

not reentrant

atomic int Facint n

return facn OK atomicaction interface

to a function

Program RecursiveFunctions and NonRecursiveAtomic Actions

In the world of nonreentrant atomic actions pro cesses are the medium used

to express recursive b ehavior Program

Message Passing

Invoking an atomic action of a pro cess is equivalent to wrapping up the argument

list and sending it in a message According to Sp ecication the atomicaction

invo cation do es not imply blo cking waiting for the reply do es so it is equivalent

to a nonblo cking message send

Message receiving has two forms

explicit asso ciated with the b ehavior of pro cesses as clients which receivea

value that is returned from a call to an atomic action and

implicit asso ciated with the b ehavior of pro cesses as servers which receive

an argument list as part of a request to execute an atomic action

MANAGING CONCURRENCY

processdef fac

private

int output

public

atomic facint input

input if

output

else

fac childinput

output input childresult

atomic int result

return output

or

processdef fac

private

int input

fac parent

public

atomic facint i fac p

if i

presult

delete this

else

input i

parent p

new facithis

atomic void resultint r

parentresultinputr

delete this

Program Recursive Pro cesses

CHAPTER C

The two forms of receive explicit and implicit cover the two extremes of

the sp ectrum of p ossible mechanisms for message discretion explicit receive

accepts only a particular message from a particular pro cess implicit receive

accepts any message from any pro cess The activepassive semantics provide

a more general selectivereceivemechanism atomic actions of a pro cess represent

incoming communication channels and the pro cess can at run time select the

communication channels over which it is ready to accept a message

SingleAssignmentVariables

Singleassignmentvariables are a safe form of futures Section Requesting a

read access on an uninitialized singleassignmentvariable causes the requesting

pro cess to b e susp ended until the variable is assigned to Since there can b e

at most one assignment to a singleassignmentvariable these variables can b e

eectively cached Pro cesses of typ e P in Program are an example of a p ossible

C implementation of singleassignmentvariables

Pro cess Aggregates

Thus far wehave describ ed pro cesses as indep endententities and have emphasized

the co deexecution asp ects of pro cesses In this section we shall showhow

pro cesses can b e treated as instances of a restricted data form one that can b e

accessed only through a set of mutually exclusive atomic actions

As illustrated in Program C programmers can treat pro cesses as variables

of any other typ e Whether a pro cess is a lo cal variable memb er of an ob ject or

of another pro cess elementofanarray or used in any other wayinwhicha

variable can b e used in C the pro cess semantics are the same According to

the syntactic restrictions describ ed in Section the only op erations allowed on

a pro cess are to take its address and to access its public memb ers all of which

are atomic actions The various pro cess usages determine only when a pro cess is

created and when it is destroyed For nonpro cess data typ es variable usage also

implies what the memory layout is When accessing pro cesses one cannot assume

for example that a pro cess declared as a lo cal variable resides on the stack nor

can one assume that a pro cess that is a memb er of a class is placed in memory next

to the other data memb ers In Section we shall discuss how programmers

can aect pro cessplacement strategy

The semantics of C are dened such that ecientimplementations exist

for b oth mainstream variants of MIMD computers multipro cessors whichhave

one global address space and multicomputers whichhavemultiple lo cal address

Pro cess assignment is an atomic action invo cation equivalent to issuing a request to the

source pro cess to send a copy of itself to the destination pro cess Section Passing pro cesses

as arguments is a form of assignment

MANAGING CONCURRENCY 

processdef P

class C an object of class C contains

public

P p a process

P pp and a process pointer

P p p declare two processes

p p process assignment

P p declare a process array

Program Treating Pro cesses As Data

spaces In C regardless of the underlying architecture a p ointer to a pro cess

can b e dereferenced globally since it contains sucient information to uniquely

identify the pro cess it p oints to

An imp ortant advantage that multipro cessors haveover multicomputers is

that they can employ most of the datalayout strategies develop ed for sequential

computers There are additional p erformance considerations guiding the design

decisions on the data layout as discussed in If for the time b eing we

neglect such p erformance considerations a vector of C pro cesses could on a

multipro cessor b e laid out in memory in the same wayasavector of elements

of any simple data typ e Elements with successive indices would reside at

memory addresses that dier by a stride equal to the size of the pro cess This

approachwould allow the programmer to compute the address of any pro cess in

the vector given the address of any other pro cess in the same vector and the two

corresp onding indices

On a multicomputer using the ab ovelayout strategy for vectors of pro cesses

is unacceptable for two reasons rst the address space of a multicomputer is

contiguous only within eachmulticomputer no de so the maximum size of a pro cess

vector would b e limited by the size of no de memory and second although the

computation mo del allows elements of a pro cess vector to op erate concurrently

CHAPTER C

that concurrency could not b e used to a p erformance advantage b ecause the

elements would all reside on the same no de

This example is but an instance of a more general problem of naming

constituent elements of distributed ob jects There are two issues that

are central to the solution of this problem The rst issue is that there should

exist a single name address of a distributed ob ject and a way of addressing

constituents given this name The second issue is that the programmer should b e

able to compute on references not just store them at pro cesscreation time and

fetch them when they need to b e used

A simple solution that takes only the rst issue into the account could employ

an addressmanager pro cess The managers address would represent the address

of the distributed pro cess as a whole All the requests would b e directed to this

pro cess and then forwarded to appropriate constituent pro cesses This solution

obviously intro duces an access b ottleneck but may b e acceptable for element

pro cesses that exhibit a large ratio of computationcommunication

We consider this problem to b e to o imp ortant to b e left to ad hoc approaches

particularly for such oftenused paradigms as arrays of pro cesses Accordingly

C oers a runtimesystemsupp orted mechanism for address managementthat

preserves the C addresscomputation semantics

The example in Program shows that the creation of a pro cess array

processdef P

P pnewP is equivalent to

P p uniqueCPMsizeof P

for int i i i

new pi P

Program Creating A Vector of Pro cesses

consist of two stages First a set of unique references is allo cated byinvoking

the unique CPM function with arguments sp ecifying how many references are

required and what the stride b etween the adjacent references should b e This

t function returns a pointer of the generic pro cessp ointer typ e pointer

analogous to void in C Next the actual pro cess creation is requested

sp ecifying that each new element pro cess b e placed in such a manner that it can

b e lo cated through the given p ointer A description of various avors of pro cess

MANAGING PROGRAM COMPLEXITY

creation is presented in Section A set of algorithms that provide ecient

supp ort for pro cess placement and lo okup is describ ed in

Summary

The programming examples in Section illustrate that a small set of mechanisms

supp orted byC is sucienttoexpressavariety of concurrentprogramming

paradigms This set consists of pro cess creation asynchronous request

synchronous request remote pro cedure call and selective servicing of requests

activepassivemechanism In Chapter we shall present an implementation

framework for this set of mechanisms

Managing Program Complexity

In the intro ductory section of this chapter we discussed how ob jectoriented

programming techniques came ab out through eorts to aid programmers in

managing program complexity All of the ob jectoriented techniques supp orted

byC are extended to managing pro cesses in C The interested reader may

consult the wealth of available literature on C including but not limited to

In the remainder of this section for completeness we shall mention briey two

of those techniques inheritance and virtual functions We shall then discuss the

techniques that are sp ecic to C and concurrent programming pro cess layering

pro cess libraries and customizing of the data exchange

Class Inheritance

Class inheritance is the C mechanism that enables userdened typ es to b e

derived from more basic typ es inheriting data memb ers and functions from the

base typ e p ossibly adding new ones andor overriding old ones Access rights are

asso ciated with eachclassmemb er For example in Program private members

of the base class shape can b e accessed only bymemb er functions of shape

protected memb ers of shape can in addition b e accessed bymemb er functions of

any class derived from shape for example circle and public members of shape

can b e accessed byany piece of co de anywhere in the program The class circle

is derived from class shape by adding a data member radiusandamember

function modify radius and byoverriding the memb er function draw

Atypical memory layout for the two classes is shown in Figure The p ointto

b e rememb ered is that C class inheritance is a compiletime rather than a runtime

mechanism Every instance of class circle contains a part corresp onding to an

Neglecting for the time b eing suchC features as multiple inheritance and virtual functions

CHAPTER C

class shape

private

int origin

void modifyorigin

protected

int color

modifycolor void

public

void draw

class circle shape

private

int radius

public

void modifyradius

draw void

Program Class Inheritance

int origin int origin

shape shape

circleshape

int color int color

int radius

Figure Class Inheritance vs Memory Layout

instance of class shape it is the denition of class shape that is shared not

any particular instance of it

The C classinheritance mechanism is mimickedby pro cess denitions in

C they to o can b e sp ecied through their similarities with and dierences from

previouslydened pro cess denitions

Virtual Functions

The virtualfunction mechanism supp orted byC is a mechanism that enables

programmers to separate the design of memb erfunction interfaces from the design

of memb er functions themselves

For example in Program given a shape spandacircle cp the

invo cation of spdraw and cpdraw will result in calling shapedraw

MANAGING PROGRAM COMPLEXITY

and resp ectively The compiler decides which call to generate circledraw

based on the typ e of p ointer through which the function has b een called

Had the two draw functions b een virtualtheinvo cation of spdraw

could haveinvoked either of the two functions dep ending on what the p ointer

sp pointed to In this case the compiler generates an indirect call through the

classsp ecic table

Pro cess Layering

The standard C inheritance mechanism allows one to describ e pro cess denitions

hierarchicallyHowever once a pro cess is created it is an indep endententityThe

hierarchy is reected in its structure not in its relationship with other pro cesses

There are imp ortant applications where in addition to hierarchy in structure

it is useful to haveruntimeexercised hierarchy in control For example in

op erating or runtime systems user pro cesses are created and managed by

system pro cesses In simulators pro cesses that mo del the b ehavior of physical

elements are managed by time or eventdriven schedulers

The mechanism that C uses to supp ort such applications is process layering

also called dynamic process inheritance As illustrated in Program and

Figure every instance of processdefgateis managed by an instance of

processdef scheduler

private

int time

processdef gate dynamic scheduler

protected

gate output

processdef twoinputgate gate

private

int state

atomic void inputint

atomic void inputint

Program Pro cess Layering

CHAPTER C

int time

scheduler

gate dynamic scheduler

gate output

two input gate gate

int state

Figure Pro cess Layering vs Memory Layout

processdef scheduler The details of pro cess layering will b e discussed in

Section which describ es the C runtimesystem interface The relationship

between the manager pro cess and the managed pro cess is established at the

creation time of the managed pro cess The manager provides a set of services

to all pro cesses that it manages with the same access protection that is oered

through the classinheritance mechanism The manager decides when an atomic

action of any of the pro cesses managed by it is executed as opp osed to invoked

while conforming to the denitions of pro cess b ehavior as sp ecied in Section

Pro cess Libraries

Libraries of C pro cesses can b e organized in the same way as libraries of data

structures in C In most cases the remote pro cedure calls to atomic actions

of pro cesses form a suitable interface and these calls replace the class member

function interfaces In these cases it is sucient that programs include header

les that contain interfacepro cess denitions

There are cases however in which imp osing the RPC interface would overly

serialize computations that are otherwise concurrent For example a pro cess

library might initialize a set of pro cesses for FFT computation as illustrated in

Section employing several input and several output data streams A stream

of input values can b e represented by a sequence of nonblo cking atomicaction

invo cations If a stream of output values were represented as a sequence of replies

obtained through the RPC mechanism just as in the sequentialsearch example

of Section the computation could not b e pip elined However unlikeinthis

search example this problem could not b e resolved with call forw arding

The mechanism typically used for C libraries with multiple input and output

streams is as follows an input stream is represented by a sequence of non

blo cking atomicactions invo cations of an inputinterface pro cess an output stream

is similarly a sequence of nonblo cking atomicactions invo cations of a pro cess

provided by the library user In this arrangement the libraryuser pro cess must

be derived from the outputinterface pro cess of the library it uses Section

MANAGING PROGRAM COMPLEXITY

When a pro cess uses multiple libraries multiple inheritance is employed to derive

such a pro cess from all of the outputinterface pro cesses from which it requires

results

Data Exchange

The designers of C made a commendable eort to provide an overloading

mechanism that enables programmers to pass arguments byvalue even when these

arguments are arbitrarilycomplicated linked data structures This mechanism is

not sucient for concurrentprogramming systems whichmust takeinto account

some additional considerations On multicomputers ob ject p ointers have lo cal

meaning Also concurrent computers may b e heterogeneous ensembles comprised

of machines with dierentdatalayout alignment size or representation

C addresses all of these p otential problems at the interpro cesscommunication

level invo cations of atomic actions with mechanisms that are describ ed in the

remainder of this section The communication sp ecications are declarativeas

opp osed to imperative the programmer sp ecies what sp ecial actions should b e

taken when a data item of certain typ e is communicated the compiler guarantees

that actions thus sp ecied will b e invoked on every o ccurrence of communication

Comm unicating ArbitrarilyComplex Data Structures byValue

One of the premises of negrain concurrent programming is that large data

structures are implemented in terms of many small co op erating pro cesses so it is

tempting to claim that pro cess p ointers that can b e globally dereferenced are all

that programmers might p ossibly want However an imp ortant use for p ointers in

C is for data structures that are only partially sp ecied at compile time linked

data structures and arrays of variable size If prop er supp ort and clean semantics

for this feature were not oered users would have resorted to ad hoc solutions

The mechanism supp orted byC enables the programmer to sp ecify what

extra actions should b e taken when communicating an ob ject of some class by

value In its most common form it amounts to attening the linked data

structure b efore sending and relinking it up on receiving As will b e illustrated in

Section variants of this mechanism can also b e used to express more intricate

but sometimes much more ecient communication proto cols

Supp ose that the data typ e of choice is a singlylinked list of elements of typ e

listeachofwhichcontains a p ointer to the next element in the list a p ointer to

avector of integers and a eld sp ecifying the size of the integer vector Figure

illustrates what is required to pass a data item of typ e list byvalue Part a

shows a data item scattered around in memoryPart b shows the attened data

structure with the dashed parts corresp onding to other arguments that maybe

sent in the same communication If the concurrent computer at hand is a shared

memory multipro cessor and if the attened argument list is in the shared address

CHAPTER C

space the task is completed Now supp ose that passing arguments moves them

from one address space to another as typically happ ens on a multicomputer When

the message that encapsulates the argument list is received all the p ointers are o

by a constant c and have to b e relinked as in d

a b c d

Figure Flattening Linked Data Structures

Program is the sp ecication of the attening and relinking tasks The

operator space computes howmuch extra space is needed in the argumentlist

when an instance of list is passed as an argument to an atomic action The

operator send sp ecies that in addition to this instance of listavector of

integers and the remaining part of the list should b e passed along The operator

recv requests that the vector of integers data and the rest of the list next be

relinked in place on the receive side

This sp ecial handling will b e invoked not only for instances of list but also

for all ob jects derived from list and for all ob jects that contain instances of list

as members C datastructure libraries can accordinglybebuiltina waythat

allows library users to b e indierent ab out the details of the implementation

This example illustrates how arbitrarily complex linked data structures can b e

passed byvalue However to avoid copying and when sharing of data structures

between pro cesses is needed structures must consist of linked pro cesses not of

linked ob jects

MANAGING PROGRAM COMPLEXITY

class list

private

int size number of integers data points to

int data

list next a pointer to the next of kin

public

sizet operator space

sizet s space datasize space for size integers

if next s spacenext space for the rest

s of the list return

void operator send voidv

v send vdatasize send size integers

if next v sendvnext send the rest

v of the list return

void operator recv

recv data relink int

if next recvnext relink the rest

of the list

Program Passing Linked Data Structures By Value

Communicating Across Heterogeneous Machine Boundaries

The C compiler assembles all messages argument lists to atomic actions and

initiates all instances of communication invo cations of atomic actions This

information enables the compiler to handle the size and alignment of the basic

data typ es integers oatingp ointnumb ers etc for a programmersp ecied set

of machines that maybeinvolved in direct communication

The example in Program sp ecies that in addition to the lo calmachine

typ e communication may b e established with machines of typ es I and Sparc

arbitrary usersp ecied names The entries within each machine description

corresp ond to the data size and alignment measured in units of size equal to the

minimumaddressable memory unit on the machine running this program and

any sp ecial treatmentthatmay b e required for a particular basic data typ e

The following is the complete list of C basic data typ es char short int long

CHAPTER C

machine I

char

short

int

long

machine Sparc

char

short sendlib recvlib

int

long

Program Machine Descriptions

For example for a machine of typ e Sparc short integers are of size and have

to b e p ositioned on addresses divisible by When sending a short integer to a

pro cess residing on a machine of typ e Sparc the data item has to b e converted

using the usersupplied and usernamed function send lib when receiving a short

integer from such a pro cess the data item has to b e converted using the function

recv lib

The compiler implicitly generates typ e machine t dened as

enum machinet localCPM I Sparc

and the user is obliged to dene the function

machinet machineCPM pointert

that maps pro cess p ointers into machine typ es

Putting It All Together

The examples of C programs shown so far were chosen to illustrate programming

techniques Wehave delib erately chosen clarityover completeness and indeed

some of these examples require the addition of forward declarations to b e accepted

by the compiler

In this section we shall show an example of a complete program that computes

the N p oint FFT as illustrated in Figure Our concurrent program will

float double long double signed char unsigned char unsigned short

unsigned int unsigned long void entry t and pointer t

PUTTING IT ALL TOGETHER

closely match this datadep endency graph with one addition Weshallintro duce a

column of no des whose purp ose is to rearrange the input values from the standard

linear ordering of indices to the bitreversed ordering required at the input of the

FFTcomputing graph Figure shows the mo died graph with circled parts

corresp onding to subcomputations p erformed by individual pro cesses

X x X x

W W

N N

X x X x

 

W W

N N

X x X x

 

 

W W W W

N N N N

X x X x

   

W W

N N

X x X x

 

 

W W W W

N N N N

X x X x

   

 

W W W W

N N N N

X x X x

   

   

W W W W W W

N N N N N N

X x X x

     

Figure An Point FFT Computation with the Pro cesses Circled

Typically writing C programs consists of four stages

Cho osing a concurrent algorithm

Designing an inputoutput interface

Designing the pro cess hierarchy and

Describing pro cess b ehavior

We shall organize the program as a library package Figure illustrates the

userlevel view of this library Input values are to b e sent to pro cesses of typ e

fft and output values will b e delivered to pro cesses of the same typ e For an

N p oint FFT computation there are N input and N output pro cesses all of which

have to b e derived from fft The set of p ointers to N input pro cesses could b e

represented in a varietyofways but it is often most intuitive to represent these

pro cesses as memb ers of a pro cess vector as describ ed in Section The same

is true for the set of p ointers to N output pro cesses

Program is the header le that the user must include to access the library

A user program might lo ok like Program Since the library sends the output

CHAPTER C

outputs outputs

inputs inputs

fft fft

fft

in in

fft fft

fft

in in

FFT FFT

graph graph

fft fft

fft

N N

in in

Figure User View of the FFTLibrary

values to the vector of fft pro cesses the consumer pro cesses are derived from fft

and have to b e created using the distributedpro cess mechanism The producer

pro cesses on the other hand dont havetobeelements of anyvector unless some other part of the user co de needs to treat them so

PUTTING IT ALL TOGETHER

ffth

include ch The runtimesystem header file

include Complexh The complexarithmetic package

processdef fft public CPM The runtime system requires that

every process be derived from CPM

public

atomic virtual void connectfft The syntax in C denotes

atomic virtual void inComplex that this is the specification

of an interface leaving it to

the derived processes to specify

how the requests are serviced

processdef fftgraph public CPM This process represents the

whole graph

private

fft inputs The pointer to the first input

int order Size of the FFT graph

public

atomic fftgraphintfft Creating the fft process graph

atomic fftgraph Deleting the fft process graph

fft inputint Finding out the address of a atomic

particular input

Complex Wint N int i A function that computes

complex roots of

int bitreverseint N int i A bitreversing function Program The FFTLibrary Header File

CHAPTER C

include ffth

processdef consumer public fft

public

atomic virtual void inComplex Do something with the result

processdef producer public CPM

public

producerfft Produce input values atomic

const int N

rootroot int argc char argv

fft outputs new consumerN Create the vector of consumers

fftgraph g new fftgraphNoutputs

Create the computation graph

fft inputs ginput Get the reference to the inputs

for int i iN i Create N producers

new producerinputsi

Program An Example of FFTLibrary Usage

PUTTING IT ALL TOGETHER

Figure shows the pro cesssp ecication hierarchy that we chose to

implement and Programs and sp ecify this hierarchy

fft

relay

join join

join fork join

fork

join mult fork

mult fork

i i

W W

N N

Figure Pro cessSp ecication Hierarchy

The fft pro cess denition is just an interface sp ecication and do es not

describ e any computation The remaining pro cess denitions sp ecify that the

pro cess activity consists of four distinct stages

Establishing a connection ie obtaining output references

Getting one or two input values

Computing the result whichmayinvolve an addition and a multiplication

and

Outputting one or twooutputvalues

The common parts of the co de are shared b etween dierent pro cess denitions

through the pro cessinheritance mechanism Using multiple inheritance whereby

pro cess denitions can b e derived from more than one pro cess denition would

have resulted in b etter co de reuse Nevertheless we felt that in the examples

in this thesis multiple inheritance would not havecontributed to readers

understanding of C

CHAPTER C

ffth

include ffth

processdef relay public fft

protected

fft out Output reference

Complex result The result

virtual void computeComplex How to compute the result

virtual void output How to generate the output

public

atomic virtual void inComplex

atomic virtual void connectfft

atomic relay

passive in

processdef join

processdef fork public relay

protected

join out Fork adds an output reference

virtual void output and produces two output values

public

atomic virtual void connectfft join

processdef multfork public fork Multfork also needs to multiply

protected

Complex W so here is the multiplicand

virtual void computeComplex and how to compute

void output It must generate the output virtual

public

atomic virtual void connectfft join Complex

Program Pro cess Hierarchy for FFT Computation Part

PUTTING IT ALL TOGETHER

ffth

include ffth

processdef join public relay Join has two distinct inputs

protected

virtual void computeComplex How to compute the result

public

atomic virtual void in Complex

virtual void inComplex atomic

atomic join

passivein passivein

processdef joinfork public join The same modifications

as from relay to fork

protected

join out

virtual void output

public

atomic virtual void connectfft join

processdef joinmultfork public joinfork The same modifications

as from fork to multfork

protected

Complex W

virtual void computeComplex

virtual void output

public

atomic virtual void connectfft join Complex

Program Pro cess Hierarchy for FFT Computation Part

The b ehavior of various pro cess typ es is sp ecied in Programs and

CHAPTER C

fftcpm

include ffth

atomic

void

relayconnect fft f

out f

active make all atomic function active

atomic

void

forkconnect fft f join j

out f

out j

active

atomic

void

multforkconnect fft f join j Complex c

out f

out j

W c

active

atomic

void

joinforkconnect fft f join j

out f

out j

active

atomic

void

joinmultforkconnect fft f join j Complex c

out f

out j

W c

active

Program The FFT Computation Part

PUTTING IT ALL TOGETHER

fftcpm

include ffth

atomic

void

relayin Complex c

computec

output

void

relaycompute Complex c

result c

void

multforkcompute Complex c

result W c

void

relayoutput

outinresult

void

forkoutput

outinresult outinresult

void

multforkoutput

outinresult outinresult

Program The FFT Computation Part

CHAPTER C

fftcpm

include ffth

atomic

void

joinin Complex c

if passivein

computec output activein

else

result c passivein

atomic

void

joinin Complex c

if passivein

computec output active in

else

result c passivein

void

joincompute Complex c

result c

void

joinmultforkcompute Complex c

result result c W

void

joinforkoutput

outinresult outinresult

void

joinmultforkoutput

outinresult outinresult

Program The FFT Computation Part

PUTTING IT ALL TOGETHER

Finally Programs and contain the co de used to build the p ointFFT N

pro cess graph Dep ending on how timecritical this creation task is solutions range

from entirely sequential taking O N log N steps to maximally concurrent taking

just O log N steps Our solution follows an intermediate approach in which the

pro cess creation is concurrent and takes O log N steps whereas passing references

around is sequential for each pro cess column and takes O N steps

ffth

include ffth

processdef buildtopfft public CPM

public

atomic buildtopfftint join intintfft

processdef buildbtmfft public CPM

public

atomic buildbtmfftint join intintfft

Program Building the FFT Graph Part

CHAPTER C

fftcpm

include ffth

fftgraphfftgraph int N fft outs

order N

inputs new relayN

if N

joinN join j new

new buildtopfftN j N inputs

new buildbtmfftN j N N inputs

for int i iN i

jiconnectoutsi

else

inputsconnectouts

Program Building the FFT Graph Part

PUTTING IT ALL TOGETHER

fftcpm

include ffth

buildtopfftbuildtopfftint N join outs int from int to fft inputs

int ntofrom

if n

joinfork f new joinforkn

new buildtopfftN f n inputs

buildbtmfftN f n n inputs new

for int i in i

ficonnectoutsioutsni

else

fork fork f new

fconnectoutsouts

inputsbitreverseNfromconnectf

buildbtmfftbuildbtmfft int N join outs int from int to fft inputs

int ntofrom

if n

joinmultfork f newjoinmultforkn

new buildtopfftN f n inputs

new buildbtmfftN f n n inputs

for int i in i

ficonnectoutsnioutsiWNfromi

else

multfork f new multfork

fconnectoutsoutsWNfrom

inputsbitreverseNfromconnectf

Program Building the FFT Graph Part

CHAPTER IMPLEMENTATION ISSUES

Chapter

Implementation Issues

There are two ma jor comp onents to the C programming system the

from C to C andtheC runtime system This programming system

is currently supp orted on the Mosaic and on all systems that supp ort the

Cosmic EnvironmentReactive Kernel CERK messagepassing primitives

which includes sequential computers networks of workstations and a varietyof

commercial multicomputers and multipro cessors

The translator is written in C and is b oth compilemachine and target

machineindep endent Most of the runtimesystem co de is p ortable as well with

the exception of a small set of C library functions that are illustrated in

Section

The RuntimeSystem Framework

The relationship b etween the C programming notation and the C runtime

systems is symbiotic Programs written in C require runtimesystem supp ort

C runtime systems are typically written in C

Although most of the runtimesystem co de is p ortable the resourceallo cation

requirements on various machines are quite dierent Giv en a suciently large

no de memory the amountofruntimesystem supp ort that C programs require

is minimal The runtime systems for C implementations on computers with

workstationsize no des typically consist of less than a thousand lines of C co de

The Mosaic negrain multicomputer consists of no des with severely restricted

memory resources hence the runtime system for the Mosaic employs much more

sophisticated runtime mechanisms Various congurations of MADRE the MosAic

Distributed Runtime systEm range from two to ten thousand lines of C co de

MADRE was written by Nanette J Bo den and its design and the distributed

algorithms it employs are describ ed in detail in her PhD thesis This work

demonstrates that the complexity of runtime systems for negrain multicomputers

need not result in large p enalties in sp eed nor do es it imply large chunks of no de

THE RUNTIMESYSTEM FRAMEWORK

resident co de that reduce the available no de memory even further MADRE is itself

a concurrent program that employs distributed solutions to manage distributed

resources

The mutual dep endence of the C programming notation and the C runtime

systems is only apparent In fact the runtime system is just a prewritten

part of any user program a part that includes an interface to the resource

allo cation and communication capabilities of the machine it is running on The

C programming mo del and programming notation supply only the framework

for implementing pro cess management and data communication striving not to

restrict the sp ectrum of p ossible runtimesystem implementations The remainder

of this section describ es this framework Since the primary target for executing C

programs is the Mosaic the names and default semantics of functions that we use

corresp ond to messagepassing communication primitives This do es not however

imply that these primitives are the only ones that can b e used sharedmemory

communication primitives for example are equally suitable for implementing the

necessary lowlevel routines

Pro cess Creation

An example of how pro cess creation may b e implemented in C is given in

Program In general pro cess creation consists of the following three stages

Choosing a manager byinvoking the manager CPM function corresp onding

to the typ e of the pro cess b eing created This function must return a

p ointer to the pro cess that will b e asked to instantiate the new pro cess

It is p ossible to dene multiple versions of this function some of whichmay

take arguments For example dierentversions may corresp ond to dierent

pro cessplacement strategies

Requesting the creation from the chosen manager byinvoking the managers

create CPM atomic action The two arguments corresp ond to the size of the

pro cess and the address of the constructor to b e invoked If the constructor

takes arguments those are passed as well Various avors of pro cess creation

can co exist in the system with one of them selected at creation time

Instantiating the process is done by a manager pro cess not necessarily the one

originally chosen The creation can b e delegated to other p otential manager

pro cesses and is eventually done in the consenting managers address space

This function must b e declared staticwhichisaC feature that makes a memb er function

generic asso ciated with a certain class denition not with any particular instance of that class

The size t is a C dened integer typ e that can represent the size of the largest p ossible

ob ject or pro cess The entry t typ e is intro duced byC and will b e describ ed in Section

CHAPTER IMPLEMENTATION ISSUES

processdef Manager

public

atomic P createCPM sizet entryt

processdef P dynamic Manager

public

static Manager managerCPM

atomic P

atomic Pint

new P is equivalent to

PmanagerCPMcreateCPMsizeof PPP

new P is equivalent to

PPPint PmanagerCPMcreateCPMsizeof

Program Pro cess Creation

Runtime Services

All of the protected and public memb ers of a manager can b e accessed by

the pro cesses it manages This access is handled transparently by the compiler

The programmer need not b e concerned whether some service is provided through

regular inheritance or through dynamic inheritance with the latter requiring one

or more levels of indirection Program

Pro cess Dispatch

A problem that emerges in the design of all op erating and runtime systems is that

of sp ecifying an interface for invoking user programs This task is typically done

in an ad hoc wayFor example user programs written in C and run under UNIX

must have a function called mainwhich is the userco de entry p oint However this

approach do es not enable the op erating system co de to merely call this function

since the address of main is not known at the op eratingsystem linking time The

typical solution is to require that main always b e at the same address or to nd

its address at loading time

Every C pro cess has a xed number of entry p oints corresp onding to its

atomic actions each of which could take dierentnumbers and typ es of arguments

THE RUNTIMESYSTEM FRAMEWORK

processdef Manager runtimesystem code

protected

int i

void f

processdef P dynamic Manager user code

private

int j

g void

public

atomic P

j accessing local data

i accessing managers data

g calling local function

f calling managers function

Program Accessing Runtime Services

and return values of dierenttyp es If the runtime system itself is to b e expressed

in C theremust b e a way of dispatching to any atomic action of any pro cess or

of any pro cess in some predened set In the remainder of this section we describ e

the C atomicaction dispatchmechanism

As illustrated in Figure every pro cess P is a no de of a pro cess tree with

its path toward the ro ot of a tree leading through its manager M its managers

manager MM etc Several such trees may co exist on eachphysical no de Every

processdef M that could b e used as a dynamic base for some pro cess denition

which means that an instance of M could b e a manager of some pro cess must

have a sp ecial atomic action dened atomic Mentry tcalledthe dispatcher

A generic dispatcher atomic entry t also has to b e dened its job is to

dispatch to ro ot pro cesses of pro cess trees

The entry t is a typ e intro duced by the compiler corresp onding to anyand

all typ es of entry p oints of pro cesses that could b e dened with M as their dynamic

base Avariable of this typ e can b e used like a regular C memb erfunction

pointer with one imp ortant distinction one need not knowtheinterfacing details

of all atomic actions that a variable of typ e entry t may b e used to invoke How

arguments are passed to anonymous atomic actions is discussed at the end of this

section Howvalues are returned from atomic action is presented in Section

CHAPTER IMPLEMENTATION ISSUES

t atomic entry

processdef MM

f atomic MMentry t g

processdef M dynamic MM

f atomic Mentry t g

processdef P dynamic M

f atomic int fintchar g

Figure Pro cess Dispatch

Sp ecication An execution of an atomic action of a pro cess can b e requested

only from the b o dy of its managers dispatcher atomic action

For the pro cess hierarchy in Figure this sp ecication means that the

execution of an atomic action of processdefPsay Pf consists of executing the

generic dispatcher whichcalls MMMMwhich calls MM which calls PfIt

is this layered execution that enables managers to manage other pro cesses The

semantics of atomicaction executions can bechangedbymodifying the runtime

system code As stated in The Annotated C Reference Manual this op ens

vast opp ortunities for generalization and language extension in the general area

of What is a function and how can I call it This feature could strikethe

reader as intolerably undersp ecied and inviting of hacking and abuse However

the safety prop erties of this mechanism are not as weak as they mayappearto

be The runtimesystemsp ecied mechanisms cannot b e changed by users the

manager always gets to run b efore dispatching to the managed pro cess Wehave

come to b elieve that the supp ort for some mechanism of this kind is essential for

a notation that is intended for expressing op erating andor runtime systems

Another way of thinking ab out this layered dispatch mechanism is that every

pro cess provides a set of services its atomic actions and an escape mec hanism to

which it can defer the execution if it cannot handle the requested service itself

Arguments to Atomic Actions

The memory layout of the arguments to atomic actions is the same as that for

regular functions in C with additional arguments b eing passed to the dispatcher actions of the manager pro cesses Figure

THE RUNTIMESYSTEM FRAMEWORK

arguments to

MMMM

arguments to MMMM

MM

arguments to MM

Pf

int

arguments to Pf

char

Figure AtomicActions Arguments Layout

These additional arguments are by default generated by the compiler but as

discussed in Section this default b ehavior can b e replaced by one dened by

the programmer

An additional feature is that the arguments are assumed to b e memb ers of the

compilerintro duced structure args t and can b e accessed as a unit through a

pointer variable args t args similar to the this variable in C

The pointer t and the entry t Typ es

In the programming examples wemadeuseof pointer t and entry t typ es always

referring to them as intro duced by the compiler These twotyp es are actually

dened by the runtimesystem in a le that has to b e included byevery C

program ch The C translator makes the structure of every pro cess

pointer the same as that of pointer t and the structure of every p ointer to a

memb er of a pro cess the same as that of entry t

Pro cess State

As discussed in the previous sections the state of a C pro cess consists of its

data memb ers

activepassive set and

apointer to the manager pro cess

What are the semantics of pro cess assignment in the context of pro cesses with the

state dened ab ove The default C semantics for pro cess assignment are bit

wise copying of data memb ers and of the representation of the activepassiveset

the p ointer to the manager pro cess is left untouched The example in Program

shows pro cess assignmentasequivalent to sending a request to the source pro cess

to send a copy of itself to the sp ecied destination pro cess

CHAPTER IMPLEMENTATION ISSUES

processdef P

private

int i

public

atomic int copyCPM P pp

ppcopyCPMthis forward

atomic int copyCPM P p

this p

return

P p p

p p is equivalent to

await pcopyCPMp

Program Pro cess Assignment

Pro cess Migration

No notion of pro cess migration is supp orted directly in C A pro cess p ointer

typically contains an absolute address of a piece of memory representing the

state of a pro cess However the example in Program shows how simple it

is to copy the state of a pro cess Furthermore with the ability of the runtime

system to dene the structure of pro cess p ointers Section the runtime

system framework describ ed in this chapter was sucient to implement distributed

pro cesses Section The supp ort for distributed pro cesses requires the

same indirection mechanism that might b e used for pro cess migration The work

rep orted in is a rst step towards a thorough examination of the issues involved

in pro cess migration The results presented in this work establish conditions under

which for example pro cess state can b e shipp ed to where the atomicaction co de

is lo cated just as readily as co de can b e cached where the pro cess state is lo cated

Invoking Atomic Actions

As illustrated in Program an atomicaction invo cation consists of three stages

THE RUNTIMESYSTEM FRAMEWORK

Introductory Stage Up on calling to determine the size of operator space

the argument list the is invoked to build the dispatcher list operator head

Given a data typ e TYPE and a pro cess typ e PROCESS the default op erator

semantics are as follows

sizet operator spaceTYPE t

return sizeoft

static

void PROCESSoperator headvoid v pointert p entryt e sizet s

return operator sendve

Main Stage For each element in the argument list the operator send is

invoked The default op erator semantics are bitwise copy

void operator sendvoid v TYPE t

TYPE tp v

tpt

return tp

Final Stage The operator tail is invoked with noop default semantics

static

void PROCESSoperator tailvoid void

At the time of atomicaction execution operator recv is invoked for each

element in the argument list The default semantics for this op erator are a noop Program

CHAPTER IMPLEMENTATION ISSUES

void operator recvTYPE t

Program Default operator recv

The set of op erators describ ed ab oveprovides runtimesystem programmers

with a p owerful to ol that they can use to dene how pro cess communication

is actually implemented in terms of lowerlevel routines The same set of

op erators is available to users An example of an application that mightbenet

signicantly from the ability to exercise total control is a program that implements

communicationnetwork proto cols The general usability of the ab ove mechanism

however is highly questionable Once the compiler relinquishes control over

data layout to a naive user obscure problems ab ound For a great ma jority

of applications the eciency of the dataexchange mechanisms describ ed in

Section is sucient

THE RUNTIMESYSTEM FRAMEWORK

processdef MM

public

atomic MMentryt

processdef M dynamic MM

public

atomic Mentryt

processdef P dynamic M

public

atomic void f intchar

P p

int i

char c

pfic atomic action invocation is equivalent to

sizet size operator spaceMMMM assuming there are

operator spaceMM no alignment problems

operator spacePf

operator spacei

spacec operator

void b v

pointert pp p

bv operator head pp MMMM size

v MMoperator head v pp MM size

v Moperator head v pp Pf size

v operator sendvi

v operator sendvc

tail v pp Pf size v Moperator

v MMoperator tail v pp MM size

operator tail b v pp MMMM size

Program AtomicAction Invocation

CHAPTER IMPLEMENTATION ISSUES

ActivePassive

The activepassive mechanism b ecause of its simplicity and eciency Sec

tion is the C synchronization mechanism of choice The runtimesystem

interface for this mechanism is presented in Program If a dierent synchroniza

tion mechanism is required it can b e implemented following the same approach

processdef P

public

atomic void f

atomic int g

atomic

void

Pf

active f is equivalent to

PactiveCPMPf

passive g is equivalent to

PpassiveCPMPg

Program ActivePassive Implementation

Remote Pro cedure Call

When invoking an atomic action that returns a value the sequence of events is

identical to that describ ed in Section except that an extra argument is passed

This extra argument is the p ointer to the currentlyrunning pro cess the pro cess

that exp ects the reply This p ointer is obtained by calling the runtimesystem

dened function current CPM The NULL extra argument implies that the

returned value is not required

Values Returned From Atomic Actions

Inside an atomic action the extra argument is called reply CPM As illustrated

in Program returning a value from an atomic action is equivalenttoinvoking

Note that it was not p ossible to use the this variable b ecause a pro cess might b e susp ended

while executing a nonmemb er function

FROM C TO C

the atomic action of the pro cess p ointed to by the CPM reply CPM return

pointer

processdef P

public

int f atomic

atomic

int

Pf

return is equivalent to

if replyCPM

replyCPMreturnCPM

return

Program Atomic Actions Returning Values

Susp ending A Pro cess

Whenever a returned value is exp ected from an atomic action the compiler

intro duces a placeholder for that value and the runtime system is passed a p ointer

to this placeholder through the wait CPMvoid function Multiple placeholders

can b e activeatany time as discussed in Section When the pro cess attempts

to access the placeholder and nds it uninitialized it susp ends itself byinvoking

the suspend CPM function

From C to C

There are a numb er of reasons for translating from C to C instead of compiling

from C directly to Mosaic co de First this was a faster way to build a running

system Second the wide availabilityofC guaranteed machine

indep endence Third we had go o d exp erience in retargeting the GnuC compiler

to pro duce excellent co de for the Mosaic pro cessor And fourth since C is

syntactically so similar to C C debugging to ols and other programming

supp ort to ols can b e used with few or no mo dications One disadvantage of the

CHAPTER IMPLEMENTATION ISSUES

translation approach is that the compile time increases b ecause programs must

b e parsed twice A p ossible disadvantage is that some optimization opp ortunities

may b e lost when using C as an intermediate target notation However wehave

identied no such lost opp ortunities so far

Parsing

The translator is a C program built within the framework of a Bisonpro duced

parser Practically every p erson who has ever worked on a pro ject that involved

parsing of C has already expressed their distaste that C syntax cannot b e

describ ed by an LALR grammar Nevertheless we feel that our own distaste

should b e on record to o Weacknowledge that it is not the compiler writer

but the language user who should b e the ultimate judge of the value and style

of a programming notation However if syntactic issues are subtle enough to b e

dicult for a compiler what hop e do es a user have of not making obscure mistakes

writing programs using that syntax Fortunately b eginners tend to use a small

set of basic language constructs whereas exp erienced users tend to develop their

own programming style from a subset of the richC oering In our exp erience

the complexity of handling the few sp ecial cases in parsing C is comparable to

the complexity of all of the remaining issues of translating C into C Suce

it to saythatwe are lo oking forward to the ANSI standard for C syntax

In our implem entation of the translator each grammar rule corresp onds to

a class denition For example given the grammar rule in Program three

expression assignmentexpression

expression assignmentexpression

Program An Example of a Grammar Rule

class denitions have to b e written as shown in Program Parsing a C

program generates a parse tree that consists of no des that are instances of classes

such as these illustrated in Program Wedevelop ed a program that given an

input grammar such as the one illustrated in Program generates the default

class denitions similar to those describ ed in Program the co de that builds

the parse tree and the default denitions of output functions The resulting

program co de is a parsing sp ecication for Bison which can b e used to pro duce

a default parser When a source program is fed to this default parser the parser

builds the parse tree It then invokes the output function at the topmost level

of the tree thereby causing the entire source program to b e pro duced as the

output This default b ehavior can b e mo died by dening additional elements

of class denitions by sp ecifying extra actions to b e taken while building the

FROM C TO C

class expression

void output

class expression public expression

assignmentexpression member

public

void output

memberoutput

class expression public expression

expression member

assignmentexpression member

public

void output

memberoutput

memberoutput

Program A Part of the Denition of the Parse Tree

parse tree and byproviding customized versions of the output routine for any

class denition This simple to ol for developing programs for sourcetosource

transformation a program of less than two thousand lines of C co de has b een

crucial to our ability to exp erimentwithnumerous versions of C syntax This

to ol generates ab out twothirds of the approximately lines of C co de of a

translator complete C

Co de Generation

Once the hurdle of parsing C is overcome the translation from C to C is a

fairly simple task The description of the runtimesystem framework in Section

also sp ecies this translation task Since the pro cess concept is the only extension

that C intro duces to C the fo cus of the translator is on keeping trackof

pro cesses and various other pro cessrelated typ es The translator considers each

segment of a source program to b e a typ e transformation For example a pro cess

pointer typ e when dereferenced is transformed into a pro cess typ e and a function

CHAPTER IMPLEMENTATION ISSUES

call transforms a list of argumenttyp es into the typ e of the returned value Since

the translator keeps track of all of the typ e transformations in a program text

op erations on pro cesses are detected and the replacement co de as illustrated in

Section is generated

Co de Splitting

In addition to the transformations describ ed in Section there is one more

requirement on the translator Since the Mosaic a machine with limited no de

memory resources is the most imp ortant target machine for executing C

programs the C translator must provide supp ort for co de splitting Pieces of

co de are cached in eachnodeby the runtime system and invoked through the

indirectfunctioncall mechanism A design decision had to b e made on what the

co desplitting target should b e

The default ob jectco de unit provided by the regular C compilers is a piece

of co de pro duced by the compilation of one source le We considered this

default setup to b e unacceptable Programmers would have to organize their co de

according to the co desplitting p olicy rather than according to the programming

abstraction requirements of the application This setup would unavoidably lead to

loss of p ortability whereby the source co de would have to b e rearranged and split

into smaller pieces when moving to a machine with less no de memory

Given that the default co desplitting p olicy was deemed unusable we identied

three welldened co desplitting targets These three targets with increasing

granularity are to split the co de so that each piece corresp onds to

an atomic action of a pro cess

a function andor an atomic action of a pro cess or

a block of code within a function with strictly sequential execution no

conditional execution

The nexthighergranularity target would b e equivalent to turning the runtime

system into a pseudoco de

If the blo ck of co de with strictly sequential execution is the co desplitting

target only co de that is certain to b e executed is ever brought to the co de cache

However this implies more frequent co decache up dates

If the co de corresp onding to a function or an atomic action is the co desplitting

target there is no unnecessary co de duplication as every named piece of co de is

a standalone unit In this case an indirectcall overhead has to b e paid for each

function call

Even though each of these options could b e supp orted bytheC translator

we decided to split the co de into pieces that corresp ond to atomic actions of

pro cesses This was the leastcomplicated and the b estundersto o d approach

FROM C TO C

and it still allowed us to provide an exp erimental testb ed that can b e used to

determine the eect of co desplitting granularity on the machine p erformance

Co de of a function is linked with every atomic action that invokes it Some of the

runtimesystem services such as sending messages and creating new pro cesses

are accessed by virtually every user pro cess and replicating that co de would

b e equivalent to including a large fraction of the runtime system in the co de of

each userpro cess atomic action Access to these services is through the indirect

functioncall mechanism but its sp ecication is left entirely to the runtimesystem

implementation We consider this an acceptable compromise particularly

b ecause any ecient co decaching p olicy must distinguish such oftenused co de

anyway

CHAPTER THE MOSAIC C

Chapter

The Mosaic C

Multicomputer Architecture

In its evolution away from the sequential computer organization the multicom

puter variety of MIMD concurrent computers has from the very b eginning ac

knowledged the imp ortance of lo cality in VLSI systems Amulticomputer

consists of a collection of computing no des connected with a communication net

work Figure Eachnodeistypically a sequential computer with its own

Communication Network ork

Computing No des

Figure Multicomputer Architecture

pro cessor and memory These computers communicate data and synchronize their

activities by exchanging messages through the communication network Distinct

mechanisms are used for pro cessormemory and interpro cessor communication

the rst optimized for minimum latency the second for maximum bandwidth

Unlikeinterpro cess communication latency whichcanbecovered up by excess

concurrency with multiple pro cesses p er no de there is no way to comp ensate

foralackofcommunication bandwidth In multicomputers the comp o

nents involved in the latencysensitivecommunication pro cessor and memory

are placed physically close within the no de The use of these distinct com

munication mechanisms is why the multicomputer architecture has proven to b e

so costeective and scalable Even though the question of the preferred

MULTICOMPUTER ARCHITECTURE

programming mo del is the sub ject of much dispute the ma jorityofcontemp o

rary concurrent computers regardless of their primary programming mo del are

built either as pure multicomputersorasmulticomputers with some additional

hardware supp ort

The communication bandwidth and latency are imp ortant gures of merit of

amulticomputer The bandwidth is limited by the communication network itself

or by the no des network interface The latency consists of two comp onents the

network latency the time required for a message to traverse the network and

the software overhead the time that the pro cessor takes to launch a message

into the network and to accept a message arriving from the network Since

the communication network typically op erates concurrently with the computing

no des the network comp onent of the message latency can b e covered up with

excess concurrency the software overhead cannot What is more the software

overhead consumes pro cessor cycles that mighthavebeendevoted to the users

computation

Background

The most common representatives of multicomputers and arguably the most

powerful multicomputers in existence are computer net works However even

when the logistics asso ciated with using such collections of computers are taken

care of these systems are costeectivetoday only for lo oselycoupled concurrent

computations This ineectiveness is due to the inadequate communication

bandwidth of existing networks and to the software overheads of concurrent

programming systems that access these communication capabilities

The history of multicomputerdesign eorts is the history of attempts to

increase the available communication bandwidth and to reduce the communication

latencythereby enlarging the application span to include more tightlycoupled

concurrent computations While some of these attempts have b een successful

we b elievethatmostofthemhave b een halfhearted which has contributed

to the widespread b elief that multicomputers are harder to program than

multipro cessors One aim of this thesis is to make the case that the programming

mo del issue can b e separated from the machinearchitecture issue

In Chapter we describ ed a programming mo del with supp ort for b oth

sharedvariable and messagepassing programming paradigms In the remainder

of this chapter we shall present the architecture of a multicomputer the Mosaic

This architecture although very simple is a platform on which an ecient

implementation of this programming mo del can b e built

CHAPTER THE MOSAIC C

The Mosaic No de

Most contemp orary multicomputers adopt a no de complexity that requires a

circuit b oard p er no de mediumgrain no des A stipulation of the

Mosaic pro ject was that the complexity of a Mosaic no de b e determined bythe

silicon complexityavailable on a single chip with reasonable  yield This

requirementwas motivated primarily by the large disparity in p erformance and

densitybetween onchip and interchip interconnection technology No des

of even ner granularity are feasible but at the present state of fabrication

technology smaller no des quickly reacha point where they are to o small to hold

even their own program co de An additional consideration was that we p erceived

singlechipno de multicomputers to represent a notsucientlyexplored p ointin

the design space of multicomputers The ultimate motivation for singlechip

no des came from realization that negrain multicomputers could havea much

larger application span than their mediumgrain counterparts they can deliver

costeective computing in small emb edded congurations as well as in large

ensembles

The organization of the Mosaic no de as shown in Figure is centered

router router

network network

pro cessor pro cessor

interface interface

memory memory bus

dRAM ROM dRAM ROM

Figure The Mosaic No de

around the memory bus This bus connects the dynamic RAM dRAM and

b o otstrap ROM on one side with the instructioninterpreting pro cessor and the

communicationnetwork interface on the other side The no de router although

logically this no des part of the communication network is part of the Mosaic

chip A plot of the layout of the Mosaic chipisshown in Figure

A more complete description of the Mosaic no de and of other Mosaic assemblies

THE MOSAIC NODE

Figure The MosaicC Chip mm  mm m SCMOS technology

W at V and MHz

can b e found in In this thesis we shall fo cus on those architectural issues

that are fundamental to our programming mo del and in particular on achieving

lowhardware and software communication overhead

The Mosaic Router

The communication network of the Mosaic multicomputer is a twodimensional

bidirectional mesh The analysis of the network p erformance and arguments for

employing this particular network can b e found in

Each communication channel is an asynchronous wide link with a

CHAPTER THE MOSAIC C

MB

throughput of The router provides packet communication and deadlo ck

s

free dimensionorder cutthrough routing The detailed description of the Mosaic

router is provided in

The largest conguration supp orted with the currentversion of the router is

 no des Larger ensembles can b e supp orted but messages would

havetoberelayed in software when the distance traveled along any dimension

exceeds From the p oint of view of the rest of the computing no de the

network is a bidirectional communication link with all other no des In a non

MB

congested network this communication link provides bandwidth in and out

s

of the no de with the communicationestablishing latency of ns p er hop

The Dynamic RAM

The Mosaic dRAM is by far the largest part of the Mosaic chip and the most

precious resource of a Mosaic no de The dRAM has b een describ ed in detail in

From the standp oint of the rest of the no de this memory is a singleclo ckcycle

dynamic RAM op erating in a pip eline mo de Figure The memory access is

allo cated on a p erclo ckcycle basis to one of the four indep endent address sources

comp eting for the memory access the pro cessor the send and the receivepartsof

the network interface and the memoryrefresh mechanism

time time

post post post post

drive drive drive drive

arbitrate arbitrate arbitrate arbitrate

request request request request

address address address address

memory access memory access memory access memory access

Figure MemoryAccess Pip eline

The Pro cessor and the Network Interface

The Mosaic pro cessor is a bit microprogrammed engine with one two and

threeword instructions and with an average of approximately three clo ck cycles

p er instruction The networkinterface is a directmemoryaccess device that

transfers data and p erforms reliable synchronization b etween the asynchronous

router and the synchronous memory Chapter

The prominent features of the Mosaic no de architecture those that makeit

particularly appropriate for a multicomputer no de are the interaction b etween the

THE MOSAIC NODE

pro cessor and the network interface Although lowlatency handling of messages

was imp erative for the Mosaic messagehandling capabilities had to b e suciently

general to allow exp eriments with dierent messagehandling strategies These two

often contradicting goals were achieved in part through the twocontextpro cessor

architecture and in part by providing a set of highlyecientlowlevel message

primitives

TwoContext Architecture

The most unusual feature of the pro cessor is its twocontext architecture Each

context has its own status register and eight general registers

There are eight additional general registers that are shared b etween the two

contexts

Typically one context is used for running programs and the other for message

handling under In this regime of op eration the context can

be thought of as an extension of the network interface

Acontext switch is p erformed only b etween instructions so it maybe

p ostp oned for several cycles However once initiated a context switchtakes zero

time since no pro cessor state needs to b e saved

Messages and Interrupts

The messagehandling and interrupthandling op erations are centered around a

small set of dedicated registers All of these registers can b e accessed b oth by the

pro cessor and by the netw ork interface For example

To send a message pro cessor must sp ecify where the message is lo cated in

memory and which no de to send it to This send op eration is p erformed by

writing into the following three registers

Message Send Pointer MSP that p oints to the lo cation of the rst

word of the message

Message Send Limit MSL that p oints to the lo cation of the last

word of the message and

Destination Register DXDY that contains the address of the

destination no de enco ded as the relative distance in X and Y dimensions

of the mesh network

Writing into DXDY triggers the network interface The network interface

then starts transferring the data from the memory increments MSPand

continues until the MSP exceeds MSL

The message receive op eration is initiated by the network interface The

pro cessor must sp ecify where the message is to b e written by setting

CHAPTER THE MOSAIC C

Message ReceivePointer MRP that p oints to the rst available

memory lo cation and

Message Receive Limit MRL that p oints to the last available

memory lo cation

The network interface generates three distinct interrupts corresp onding to

the following conditions when a complete message has b een sent when a

complete message has b een received and when the receive buer has b een

exhausted The interrupts are handled by accessing two additional registers

Interrupt Status Register ISR that contains three that

corresp ond to the three sources of interrupts These bits are set by

the network interface and cleared by the pro cessor

Interrupt Mask Register IMR that contains three bits that

corresp ond to the three interrupt sources Mo difying these bits enables

or disables their corresp onding interrupts

Program shows an excerpt that the runtime system might use as an

interruptdispatch routine All the registers can b e accessed directly from C

through a feature provided by the GnuC compiler

Software Overhead of Communications

As discussed in Section the software ov erhead asso ciated with message send

and receive op erations is the communication b ottleneck of most programming

systems for multicomputers The software overhead for C programs running

on the Mosaic under the MADRE runtime system is analyzed in this section

The overhead shown represents a typical case measured in Mosaic assembly

instructions The complexity of Mosaic assembly instructions is comparable to

that of a typical loadstore RISC pro cessor but the number of clock cycles p er

instruction is approximately three The communication supp ort of the Mosaic

pro cessor is minimal Section and the incorp oration of such supp ort into

atypical RISC pro cessor core is arguably relatively simple Exp eriments with

compiling the same co de for contemp orary RISC pro cessors exhibit comparable

instruction counts

Wewant to emphasize that all the numbers were obtained from the compiled

co de our C to C translator with the GnuC compiler targeted for the

Mosaic Both user programs and runtimesystem co de were written in C and

the only programmersp ecied optimization was using inline functions for critical

runtimesystem co de Excerpts from the source co de and the pro duced assembly

co de are presented in App endix A In our exp erience the compiled co de was

typically only ten p ercent less ecient than the b est handco ded assembly we

could pro duce not nearly sucient to justify such an eort

SOFTWARE OVERHEAD OF COMMUNICATIONS

void interrupttable a jump table filled with the

names of routines that correspond

softwareint to various interrupt conditions

recvint

sendint

sendrecvint

buffint

buffrecvint

buffsendint

buffsendrecvint

while

pending ISR get the pending interrupts int

interrupttablepending dispatch to an

ISR pending writing back acknowledges all the

interrupts that have just been

serviced

asm PUNT assembly instruction to return to

the other context

when the next interrupt arrives

we shall start here

Program An InterruptDispatch Routine

Message Sending

Figure illustrates the activityoftwo pro cesses placed on two dierent no des

in direct communication The pro ducer pro cess sends an innite stream of empty

messages to the consumer pro cess

The send overhead in the typical case consists of Mosaic instructions

A ma jor p ortion of the send overhead are the instructions that allo cate the

space and up date the send queue Since the send queue is guaranteed to b e used

in the FIFO regime it is implemented as a simple circular buer One wayof

reducing the overhead asso ciated with the send queue is to have the pro cessor

write the message contents directly into the network instead of into the memory

However this approachwould intro duce to o strong a coupling b etween the

pro cessor and the communication network For example an interrupt during the

messagesend op eration would blo ck the network similarlytheblocked network

CHAPTER THE MOSAIC C

PRODUCER PRODUCER

user user

interrupt interrupt

context context context context

register up dates register up dates

sendqueue allo cation sendqueue allo cation

header formation header formation

register up dates register up dates

DXDY computation DXDY computation

sendqueue check sendqueue check

messagesend request messagesend request

sendqueue app end sendqueue app end

latency network

interrupt dispatch interrupt dispatch

TOTAL TOTAL

sendqueue up date sendqueue up date

interrupt dispatch interrupt dispatch

interrupt return interrupt return

receivequeue app end receivequeue app end

TOTAL TOTAL

interrupt return interrupt return

receivequeue check receivequeue check

co de lo okup co de lo okup

activepassivecheck activepassivecheck

function dispatch function dispatch

 user atomic action  user atomic action

receivequeue up date receiv equeue up date

user user

interrupt interrupt

context context context context

CONSUMER CONSUMER

Figure Comp onents of the Software Overhead of a Communication

would prevent the pro cessor from doing other useful work or would result in more

frequentcontextswitching

Another big contributor to the software overhead on the sending side is the

DXDYconversion The routing hardware requires the rst word of any message to

SOFTWARE OVERHEAD OF COMMUNICATIONS

consist of twobytes of signandmagnitudeenco ded distance in the and routing x y

dimensions Extending the Mosaic instruction set with an instruction that would

compute this relative distance would reduce the sender overhead with a negligible

price in chip area In general the approach adopted by the Mosaic design team

has b een minimalist avoiding such sp ecial instructions However the discrepancy

between the bytesize signandmagnitude enco ding required by the router and the

wordsize twoscomplement arithmetic capabilities of the Mosaic pro cessor must

b e regarded as a design oversight

The message header in this example consists of the size of the message the

no de numb er and the address of the destination pro cess and the identier of the

atomic action to b e invoked

The four instructions designated as miscellaneous register up dates could not

b e asso ciated with any particular messagehandling op eration These instructions

are the necessary glue asso ciated with register allo cation b etween messagesending

comp onents

The co de could b e optimized for case of short messages and lightlyloaded

network bychecking whether the message has b een launched into the network to

avoid the interrupt dispatch and return

Message Receiving

The typical receiveoverhead totals Mosaic instructions ab out a quarter of which

is sp ent in receivequeue management Unlike with the send queue consumption

of the receive queue dep ends on user programs Messages can b e consumed out

of order either when using the activepassivemec hanism or when susp ending

while waiting for the RPC reply In our exp erience programming mo dels and

notations that do not allow such message discretion merely dump the

burden of buer management on the programmer In the case of programs

with regular communication patterns this requirement is not to o demanding

However in the case of highlydynamic communication patterns the buer

management b ecomes a signicant part of the programming eort This problem

is exacerbated on negrain multicomputers where lo cal no de resources maynot

b e sucient to absorb the receivequeue uctuations One may b e tempted to

deal with receivequeue overowbyblocking the incoming message trac which

will eventually blo ck the message source Such an approachintro duces negative

feedback to equalize the communication rates of the pro ducer and the consumer

Unfortunately this approach violates the consumption assumption that routing

networks typically require to guarantee freedom from deadlo ck The approach

that the MADRE runtime system takes in dealing with the receive queue overow

is to exp ort messages to other no des and retrieve them later For a detailed

description see Once the decision is made that the messages can b e consumed out of order

CHAPTER THE MOSAIC C

the runtime system must provide a general implementation of memory allo cation

The p erformance of receivequeue management dep ends on the communication

pattern of the application at hand Wehave exp erimented with two approaches to

receivequeue memory allo cation

Optimistic designed for minimum latency and optimized for the case when

the ma jority of messages are consumed in order A circular buer can b e

used and the general memory allo cator need only b e invoked to allo cate

space in whichtocopy the messages not consumed in order The software

overheads of Figure are obtained with this approach

Pessimistic designed for maximum throughput and optimized for the case

when there is enough irregularity in the message consumption that the

copying costs of the optimistic approachoutweigh the advantages of simple

memory allo cation According to this approach implemented in MADRE

the header of a message is received into a small dedicated buer and

up on buerfull interrupt the buer of correct size is allo cated for the rest

of the message This approach adds approximately twenty instructions to

the receiveoverhead in the typical case but the messages are never copied

between memory buers

The co delo okup overhead dep ends on the algorithm used and the ten

instructions shown representtheoverhead typical of a successful lo okup op eration

If the co de is not present it is lo cated and brought from another no de

Chapter

Pip eli ne Synchronization

Intro duction

The design metho d and to ols for VLSI are oriented almost exclusively towards the

design of clo cked synchronous systems Except for a few notable examples

contemp orary pro cessor and memory technology is designed and used exclusively

within the synchronous framework The one area in which the asynchronous design

style has b een successful in upsetting the dominance of the synchronous circuitry

is in highp erformance data communication and routing Even though there

are mechanisms for minimizing clo ckskew the p erformance p enaltyfor

maintaining clo ck coherency in physically large systems is prohibitive

One of the premises of the Mosaic pro ject was that the machine b e at least

in principle arbitrarily extensible so the communication network of the Mosaic

no de is asynchronous The rest of the Mosaic no de however is a synchronous

MB

design The Mosaic network interface p erforms data transfer b etween the

s

MB

synchronous memory bus A asynchronous communication link and the

s



large Mosaic ensemble with nodeshasaworst case of synchronization



events p er second or almost synchronization events p er year Just to b e

able to reduce the rate of synchronization failure to once p er year the b est

synchronizers weknowhow to build in m CMOS technology require ab out

half of the available clo ckperiod Wewere clearly very close to a p oint where

small unexp ected pro cess variations could result in nasty surprises Todealwith

this dicult synchronization problem in whichsynchronization and data rates are

similar we develop ed a technique that can sustain the full bandwidth and achieve

arbitrarilylow nonzero probability of failure P with the price in b oth latency

f



and chip area of O log

P f

CHAPTER SYNCHRONIZATION

Problem Sp ecication

Given the required rate of data transfer of E events p er second b etween an

asynchronous and a synchronous system with eacheventdelivering W bits

of information design an interface that will guarantee that the probabilityof

synchronization failure b e less than a given P

f

The assumption is that the owcontrol is implemented as either a twophase

or fourphase signaling proto col with bundled data

Existing Solutions

The standard approachtointerfacing asynchronous and synchronous systems is to

use a synchronizer at each synchronoussystem input A synchronizer is a circuit

that attempts to solve one of the following two equivalent decision problems

characteristic of digital systems given an input signal and a time reference decide

whether the input signal makes a transition b efore or after the reference or

given an input signal and a voltage reference decide whether the input voltage

is higher or lower than the reference As shown by the theoretical work

and bya wealth of exp erimental evidence any system attempting

to solve one of these two problems is of limited reliability In addition to two

stable states corresp onding to two decision p oints the system has a metastable

state One cannot put a b ound on how long the system may require to exit

this metastable state However a numb er of simple synchronizer implementations

have b een demonstrated that can guarantee that the probability that the

metastable state will last longer than t decreases exp onentially with t

m m

t

m

P e

f

where is a characteristic of the implementation Therefore to achievea

suciently small probabilityofsynchronization failure of a single asynchronous

input all that is required is to allow a suciently long time for the synchronizer

to exit the metastable state

Let us apply this singlesynchronizer approach to problem of Section

Figure We shall use a synchronizer that compares the arrival time of

its asynchronous input with the time reference dened by the downgoing edge of

its clo ck input When the asynchronous input changes state within T around

W

the downgoing edge of the clo ck input the synchronizer will enter the metastable

state with the exit probability determined by Equation Regardless of the

signaling proto col used for the synchronous system with clo ck p erio d T

E  T

EXISTING SOLUTIONS

R R R R S S

Asynch Synch Asynch Sync h

A A A A

T T

W W

T T

Figure Interfacing synchronous and asynchronous systems

to b e able to sustain the required throughput of E events p er second From now

on we shall assume that E is equal to the maximum attainable throughput We

T

shall assume that the implementation of the signaling proto col on the synchronous

and asynchronous side imp oses a total overhead of T p er transferred data item

oh

This assumption implies that we can allowatmostT T for the synchronizer

oh

to exit the metastable state and that there is a lower b ound on the probabilityof

T T

oh

synchronization failure that this simple approachcanachieve P  e f

CHAPTER PIPELINE SYNCHRONIZATION

Widening the Data Path

One p ossible approachtoimprove on this simple solution is to change the data

representation instead of transferring W bits every seconds we can transfer kW

E

k

bits every seconds in order to allow k times as much time for synchronization

E

Figure A simple variant of this solution requires that all communications

consist of multiples of k data units A lessrestrictive solution is equivalenttothe

solution presented next

S S S S S S

k k

Asynch Synch Asynch Synch

ev ents ev ents ev ents ev ents

Ek Ek E E

sec sec sec

W wires kW wires W wires

Figure Widening the data path

ev ents

Deriving Signals With Less Than E

sec

An alternative approachistochange the control representation instead of using

a request signal that changes state for every data item transferred we can derive

a request signal that denotes that there are at least k data items to b e transferred

Figure

 k  k  k

S S S S S S S S S

full full full

cells cells cells

R R R R R R

S S S S S S S S S

o i o i o i

Asynch Asynch Asynch

Synch Synch Synch

FIFO FIFO FIFO

k k k

D D D

i i i

D D D

o o o

A A A A A A

o i o i o i

ev ents

Figure Deriving signals with less then E sec

EXISTING SOLUTIONS

Stretchable Clo cks

The solution illustrated in this section do es not strictly sp eaking conform to

the problem sp ecication but is presented here for completeness This solution

achieves a P of exactly but it do es not maintain the required bandwidth The

f

synchronizer must b e able to detect that it is in the metastable condition and

it stretches the clo ck cycle of the synchronous system until the metastabilityhas

b een resolved Instead of synchronizing asynchronous input to the clo ck the clo ck

is synchronized to the asynchronous input When there is more than

one asynchronous input the clo ckmust b e stretched until all the synchronizers

have exited the metastable state

clo ck

generator

S

Synch

Figure Stretchable clo ck

CHAPTER PIPELINE SYNCHRONIZATION

Pip eline Synchronization

A common denominator for all the solutions presented in Section is that they

treat synchronization as a oneshot pro cess signal events transitions are either

asynchronous or synchronous In this section weshallcharacterize signal S with

the probability distribution of signalevent arrival time p t with resp ect to time

S

as measured at the synchronous system to which S is an input In some cases we

shallbeinterested only in the arrival time of p ositiveornegative edges and shall

use symbols p t and p t resp ectivelyThetwo graphs in Figure represent

S S 

twotypical cases for p t one for a synchronous and one for an asynchronous

S

signal The parameter T is equal to the clo ck p erio d of the synchronous system

and p t is a p erio dic function with p erio d T

S

Z

t T

t p t dt

S

t

an asynchronous signal asynchronous signal asynchronous signal an asynchronous signal

p p p p

S S S S

T

t t t t

T T T T

Figure Probability distribution of signalevent arrival time

What makes a signal synchronous with resp ect to some clo ckisthatevents

on that signal satisfy some setup andor hold time with resp ect to the clo ck

Relating to Figure weusethefollowing denition

Denition Signal S is synchronous with respect to some clock if there exists a

nonempty time segment t t in which p t

s h S

The usual assumption for asynchronous signals is that each arrival time is equally

probable Figure or that wehavenoknowledge of the probability distribution

We dene the asynchronous signals as all nonsynchronous ones

Denition Signal S is asynchronous with respect to some clock if thereisno

nonempty time segment in which p t

S

The probability distribution for a signal contains more information than is often

necessarysowe shall also intro duce a simpler metric to characterize how

asynchronous a signal is

PIPELINE SYNCHRONIZATION

For a signal and a given time window the Denition S T T  T

W W

asynchronicity is denedas

Z

t T

W

A T min p tdt

S W S

t T

t

The intuitive meaning of asynchronicity is that given a signal S and a

synchronous sampling device D with setup time T and hold time T

setup hold

A T T isthelowest probability of metastable b ehavior that can b e

S setup hold

achieved when sampling S with the device D For example

For synchronous signals according to Denition we can nd T such

max

that T T A T Therefore if we can build a sampling

W max S W

device with T T T it can sample S without ever exhibiting

setup hold max

metastable b ehavior

For asynchronous signals with uniform probability distribution Figure

T

W

A T From Equation it is easy to show that this is the worst

S W

T

case for asynchronicity

An asynchronicity of corresp onds to a hyp othetical malicious asyn

chronous signal no matter howwe p osition the sampling window and no

matter how small wemake it the probability distribution will b e a unitsize

delta function p ositioned within our chosen time window This case has to

b e assumed when interfacing to a signal of unknown probability distribution

In the remainder of this chapter we shall showhowwe can build circuits that

transform the arrivaltime probability distribution of signals in a way that reduces

their asynchronicityandhow to use these circuits to build pip eline synchronizers

The MutualExclusion Element

The mutualexclusion ME elementisa variantofsynchronizer The ME

element compares the signalarrival time at the two asynchronous requests and

generates mutually exclusiveacknowledge signals If the ME elemententers the

metastable state no acknowledge is granted until the ME element exits this state

Figure shows the symbol we use for the mutualexclusion element and a CMOS

implementation that was rst intro duced by Seitz

j   j 

th th

Let t and t denote the time of the j upgoing edge and the time of the j

S  S 

denote a causal delay from downgoing edge resp ectivelyofsignal S Let

E E



one event signal edge E to another E Then Equations and are the

 

requirements that any implementation of the ME elementmust satisfy

j  j 

t t j i

R A 

A  R  i i

i i

j  j 

t t j i

R A 

i i

A  R 

i i

CHAPTER PIPELINE SYNCHRONIZATION

A

R



R A

 

ME

R A

R

A



A A



R R



Figure Mutualexclusion element

A A

and dep end on the implementation and are typically approximately

R A R A 

i i i i

equal Without loss of generalitywe shall assume that

R A R A  RA

i i i i

The environment of the ME element is obliged to b ehave according to the following

sp ecication



j

j 

i t

j 

R

i

j

t

A 

i

j  j 

 t t j  i

R  A

i i

Let us now present an example illustrating how anMEelement can b e used to

reduce the asynchronicity of signals We shall examine what happ ens when one of

the request inputs of the ME element is connected to a p erio dic signal andthe

other input to an asynchronous signal with a uniform probability distribution of

upgoing edges Figure

If an asynchronous eventarrives while there is no contest in the ME

element and after theacknowledge will b e granted If the arrival time is

RA

during the p erio d in which theacknowledge will b e p ostp oned until

again The b ehavior in this idealized case which do es not take metastability

into account is what wewould likesynchronizers to achieve The delta function

corresp onds to all the requests that o ccur during the period and are

acknowledged after again The area of the delta function is suchthat

Equation holds

As discussed in Section any implementation of ME element will exhibit

metastable b ehavior if an input event o ccurs within some narrow time window T W

PIPELINE SYNCHRONIZATION

p p

R R

T T

T

T T

T T

hig h hig h

low low

A A

p

A

ideal synchronizer ideal synchronizer

ME ME

R A R A

T

RA RA RA RA

p p

A A

real synchronizer real synchronizer

T T

p

A

T

Figure ME elementasa synchronizer

around the time when makes the transition from low to high The probability

that the ME element will remain in a metastable state longer than t decays

m

exp onentially with t Equation Up on exiting the metastable state the ME

m

element generates an acknowledge either on the A or A output Therefore the

T

W

ME element transforms an input signal of asynchronicity A T into the

R W

T

output signal of asynchronicity

T T T T

T

W

hig h hig h hig h

W

T T

W W



A T e e e e

A W

T T

For a typical implementation T and

W

T

hig h

T T

W W

A T e

A W

T

so the ME element reduces the asynchronicity of the input signal by a factor that

is exp onential in the time allowed for synchronization

Since the other request input is connected to a clo ck the metastable state

cannot last longer than T but the A signal is still asynchronous according to

hig h

Denition The ME element attempts to do the synchronization in the allotted

time T and if it do esnt succeed it is forced out of the metastable state hig h

CHAPTER PIPELINE SYNCHRONIZATION

We trade one uncertainty whether the input made a transition for another

whether the ME element is in a metastable state Connecting a request input to

a clo ck violates the requirements that the ME element imp oses on its environment

Equation If the ME elementisleaving the metastable state and generating

A just as the clo ck makes the downgoing transition this violation can result in

a glitchonthe A output Therefore A must not b e used as an input to any

circuit and need not even b e generated As long as R do es not violate the ME

element requirements there will b e no spurious signals on the A output

Synchronizers

We shall p ostulate existence of three typ es of synchronizers as shown in Figure

dep ending on whether they synchronize only the upgoing only the downgoing

or b oth edges of the input signal CMOS implementations of these synchronizers

are presented in Section

S S S l S S S l

R A R A R A R A R A R A

Figure Synchronizers

These synchronizers will b e characterized with the time window T centered

W

around the downgoing edge of the clo ck Unless otherwise sp ecied the phrase

coincident with the downgoing edge of some clo ck will mean within T around

W

the downgoing edge of that clo ck

If the inputevent arrival time is of uniform distribution where the input event

is an upgoing edge a downgoing edge or a signal transition the outputevent

distribution is as shown in Figure with

S RA

The inputoutput b ehavior of eachsynchronizer clo cked with a p erio dic signal

of p erio d T T T can b e mo deled as a nondeterministic variable delay

low hig h

where

V

  T

S V low S

TwoPhaseProto col FIFO

Figure shows the symbol we use to representa twophaseproto col FIFO

element

Anumb er of dierent implementations of this FIFO cell can b e found in

For the purp oses of this thesis we shall not concern ourselv es with

details of any particular implementation Tobeavalid implementation the circuit

PIPELINE SYNCHRONIZATION

p p

R R

T T

T

T T

T T

hig hig h h

low low

p p

A A

T

S S S S

Figure Synchronizer inputoutput sp ecication

R R R R R R R R

i o i o i o i o R

D D D D D D D D

i o i o i o i o D

A A A A A A A A

i o i o i o i o A

Figure Twophaseproto col FIFO element

has to b ehave according to the following b ehavioral sp ecication

 R A R A

i i o o

The op erational description of the ab ove sp ecication is to rep eat forever  the

following sequence of actions wait for an event signal transition on R

i

generate concurrentevents on A and R andwait for an eventon A

i o o

Each signals events are numb ered from zero A formal denition corresp onding

to the sp ecication in Equation is as follows





j t

j 

R A

i i

R

i

t

j  A j 

i

maxt t j

R A A A

o

i i R A i

o

i





t j

j 

R R

o

i R

i

t

R

j  j 

o

t maxt j

R R A R

o o o

i A R

o

i

The environment is obliged to b ehave according to the following sp ecication



j

j 

t

j 

R

i

j

t

A i

CHAPTER PIPELINE SYNCHRONIZATION

j  j 

t t j 

A R

o o

The sp ecication do es not mention the datahandling requirements b ecause

they do not aect the synchronization issues Since there is a welldened time

relationship b etween the data and the control signals the only source of uncertainty

is the time when the input signals R and A make transitions

i o

Figure shows the dep endency graph for a twophaseproto col FIFO

element No des of the graph represent signal events and edges representthe

dep endencies b etween the events The values asso ciated with solid edges are

delays characteristic of a particular FIFO implementation The dashed edges

corresp ond to event dep endencies that are maintained by the environment the only

prop erty that can b e assumed ab out the delays asso ciated with these dep endencies

is that they are p ositive

R R R R

o o

i i

R R R R

i o i o

R A R A

i i i i

A R A R

o o o o

A A A A

i o i o

A A A A

o o

i i

Figure The dep endency graph of a twophaseproto col FIFO element

t represent the probability distributions of the signal Let p tand p

R A

o

i

event arrival times for the twoinputsofatwophaseproto col FIFO element

and for the moment disregard that these two distributions are not indep endent

Equations through The upp er b ound for the outputevent probability

distributions is

p t  p t  p t 

R R R R A A R

o o o o o

i i

p t  p t  p t 

A R R A A A A

o o

i i i i i

Pip eline Synchronizer

The blo ck diagram of a pip eline synchronizer that can b e used to interface an

asynchronous input data stream to a synchronous system is shown in Figure

and Figure are twophase nonoverlapping clo ck signals often

 

used for internal clo cking of CMOS chips For simplicitywe shall assume that

T T

 

T T

 

One can show that this simplication do es not qualitatively aect the results

that we shall present in this section The FIFO elements are twophaseproto col

asynchronous FIFO elements describ ed in Section The synchronizers are

PIPELINE SYNCHRONIZATION

k k

P P P P P P P P

f f f f f f f f

k mo d

S l S l S l

R R R R R R R R R R R R R R R R R R

o i o i o i o i o i o i o i o i o i

D D D D D D D D D D D D D D D D D D

i o i o i o i o i o i o i o i o i o

A A A A A A A A A A A A A A A A A A

o i o i o i o i o i o i o i o i o i

k

Figure Asynchronousinput synchronousoutput pip eline synchronizer

the symmetrical elements describ ed in Section and Section The wires

connecting the twohave zero delay the wire delayisabsorbedinto the FIFO

elements

T T T T

Figure Twophase nonoverlapping clo cks

The principal claim is that the datastream synchronization can be done in

stages along with the data ow In the following section we shall prove that the

probabilityofsynchronization failure at the synchronous end of the structure in

Figure decreases exp onentially with the numb er of stages

T

k  T 

oh



k



P P e

f f

where T is the implementationdep endentoverhead Therefore to achievethe

oh

desired P at the synchronous end one needs at least

f



P

f

log

T

P

T

f

oh

stages The area complexity of this solution is O log Both theoretical

P

f

work and exp erimental evidence show that it is not p ossible

to synchronize asynchronous signals with a latency less then O log so this

P

f

solution is latencyoptimal log

P

f

It is often the case that the output sections of asynchronous data sources

themselves consist of a series of FIFO elements In these cases it is p ossible

to insert the synchronizer elements into the output section and p erform a part of

CHAPTER PIPELINE SYNCHRONIZATION

or the entire synchronization there The latency and area will still b e log O

P

f

but it maybeachieved with a smaller T

oh

Correctness Pro of

We shall rst show that the structure in Figure b ehaves as an asynchronous

FIFO Then we shall nd implementation requirements for the FIFO elements

and synchronizers that are sucient to guarantee that Equation holds and

that the maximum throughput can b e sustained Finallywe shall nd the upp er

b ound for latency

Prop er Functional Behavior

As discussed in Section the inputoutput b ehavior of a synchronizer whose

clo ck input is connected to a p erio dic signal is equivalenttothatofavariable

delay Equation The intro duction of a b ounded alb eit variable delayto

any signal of a sp eedindep endent implementation of the FIFO elementdoesnot

aect the correctness of the circuit op eration

Probability of MetastabilityFailure

Figure shows a twostage segment of the pip eline synchronizer from

Section

i i i i i i

P P P P P P

f f f f f f

i mo d i mo d i mo d i mo d

i mo d i mo d

S l S l S l S l

R R R R R R R R R R R R

o i o i o i o i o i o i

D D D D D D D D D D D D

i o i o i o i o i o i o

A A A A A A A A A A A A

o i o i o i o i o i o i

i i i i i i

Figure A segment of a pip eline synchronizer chain

th i

The j eventon R can o ccur only at time

o

j j

t t

R R

i i

o

i

R R

o

i

or at time

j j

t t

A R

i i

o o

R A

o o

st

This event can cause metastable b ehavior of the i synchronizer Sec

tion only if it o ccurs coincidently with the downgoing transition of

PIPELINE SYNCHRONIZATION

Therefore

i mo d

i i i

A R P  P P

o i

f f f

The rst element of the sum in Equation corresp onds to the probability

b efore the end of the and the second that an eventon R occurs

i mo d i R R

o

i

element to the probabilitythataneventon A occurs b efore the end of the

o A R

o o

i mo d

i

R If the implementation We shall rst examine the case corresp onding to P

i

f

of the FIFO element satises the requirement

T

S R R

o

i

th

the i synchronizer will guarantee that this can o ccur only if it was in a metastable

state for at least

t T

m S R R

o

i

Therefore

T

S R R

o

i

i i



P R  P e

i

f f

i

If we could showthatP A then

o

f

T

k  T 

oh

k



P  P e

f f

T

oh S R R

o

i

i

However we shall showthattheP A requirementisoverly restrictive

o

f

In the remaining part of this section we shall devise a set of criteria that are

sucient to guarantee that even if the synchronizer enters the metastable regime

caused byanevent on the A input of a FIFO element the prop erties of the FIFO

o

chain will guarantee that this particular metastability results in a b enign b ehavior

at the synchronous end

Toprove this part we shall rst dene the rstevent metastability FEM and

the secondevent metastability SEM corresp onding to two particular propagation

mo des of the FIFO in Figure Next we shall showthatanA event can only

o

cause SEM and that the SEM can itself only cause SEM Finallywe shall show

that the SEM is b enign at the synchronous end

Let us observe one synchronizer element S from Figure with its clo ck

input connected to clo ck phase S will exhibit metastable b ehavior only if its

input event is coincident with the downgoing edge of

Denition When the input of a synchronizer element S clocked with changes

th

state coincident with an arbitrary j downgoing edge of and there wereno

st th

and the j downgoing edge of weshall prior input events between the j

say that S has entered rstevent metastability

CHAPTER PIPELINE SYNCHRONIZATION

When the input of a synchronizer element clockedwith changes Denition S

th

state coincident with an arbitrary downgoing edge of andthere was at least j

st th

one prior input event between the j and the j downgoing edge of we

shal l say that S has enteredsecondevent metastability

Let us nowpick up the thread of the pro of again by going back to Equation

th i st

and analyzing what happ ens when the j eventon R is caused bythej

o

i

eventon A

o

j  j  j 

t t  t

A R

i i i

o o

A A R

o o

i

According to Equations and

j  j 

t  t   max

A A R A A R R R

i i

o o o o

i i i i

R A

o

i

j  j 

 max  t  t 

R A R R A A A R

i i

o o o o

i i i i

R A

o

i

For a typical FIFO implementation  and  so weshall

R R R A A R A A

o o o o

i i i i

j  j 

 t This simplication do es not qualitatively change assume that i j t

i i

A R

o

i

the results we shall obtain it only improves the readability of our arguments

st st

If the metastability at the i synchronizer is a result of the j 

i nd

transition on A this can in turn cause metastabilityinthei synchronizer

o

Ho wever we can assure that the latter is always SEM if

j  j  j 

T t  T t t

i i i

R R R

o o o

From Equations and we can derive the following implementation

requirement

T

A R

o o

We shall fo cus next on showing that SEM can only cause SEM Supp ose that

th i

the j eventon R is SEM Then according to Denition

o

k  j 

t t

i

i mo d

R

o

k  k  j  

t t t

i

i mo d i mo d

R

o

and therefore

j  j  j 

t  T  t  t

 S S

i i i

R R R

o o

i

j  j 

t  t  T

 S R R

i i

o

i

R R

o o

which implies that

j  j 

t t  T

i i

R R

o o

The distinction b etween the rst and secondevent metastability has no physical basis it is dened only for purp oses of the pro of

PIPELINE SYNCHRONIZATION

Since

j  j  j 

t  t  t  A R

o o

i i i

R A R

o o

i

and according to Equation

j  j  j 

t  T t t T

i i i

R R R

o o o

th st

Therefore SEM at the i synchronizer can only cause SEM at the i

synchronizer

Finally with reference to Figure it should b e obvious to the reader that

one can design a synchronous circuit whose correct op eration will not b e aected

by the SEM The required throughput is equal to one event p er clo ckcycleso

R R

Figure Secondevent metastability

the synchronous circuit must after the rst event within a clo ck cycle is observed

and this event is guaranteed by Denition not to cause synchronization failure

acknowledge the rst event and disregard any input change until the following

clo ckcycle

Sustaining the Throughput

The maximum data throughput of the pip eline synchronizer in Figure is

throttled by the synchronous side and is equal to one data item p er clo ck cycle In

the remainder of this section we shall nd implementation requirements that are

sucient to guarantee that the pip eline synchronizer can sustain that throughput

Let us rst consider an innitelylong pip elinesynchronizer chain and assume

that it is in the steady state Figure illustrates this condition with the arcs

between events describing causal dep endencies For all evennumb ered FIFO cells

the events on R o ccur simultaneously after b ecomes high For all o dd

i S 

numbered FIFO cells the events on R o ccur simultaneously after b ecomes

i S 

high In this regime of op eration no synchronizer enters the metastable state and

T

t max t 

 S R A  A A

o

i i i



T

t max t 

 S R R  A R

o o o

i



T

For

A A

o

i



t

 S R A

i i

T

 t max

S R A A R  S R R

o o o

i i i 

CHAPTER PIPELINE SYNCHRONIZATION



T T T T T T T T

    

t t t t

   

t t t t

   

S S S S

e o e o

e o e o

R R R R

R R R R

o o o o

i i i i

e o e o

A A A A

i i i i

o e o e

A A A A

o o o o

Figure Steadystate op eration of the pip eline synchronizer

For the steady state in Figure to b e p ossible t must b e less than the half



p erio d The following conditions on the implementation of the synchronizer and

of the FIFO element are therefore sucient to guarantee that the innite chain

of pip eline synchronizers can sustain the maximum throughput while op erating in

this particular regime

T

A A

o

i

T

S R R

o

i

T

S R A A R

o o

i i

To examine the case of a nitelength pip eline synchronizer we shall use an

analogy with transmission lines A piece of transmission line that is terminated

with its characteristic imp edance b ehaves as if it were of innite length The

pip eline synchronizer similarly has to b e terminated with the circuitry that

satises the ab ove conditions if the full throughput is to b e maintained

The requirement at the synchronous end is

T

RA A R

o o

where is the delay from the time when one data transfer is requested until it is

RA

acknowledged In absence of metastabilit y at the synchronous end this condition

is trivially achieved

On the asynchronous end to terminate our pip eline synchronizer prop erly to

maintain the bandwidth the outside circuitry has to satisfy the condition

T

S R A AR

i i

PIPELINE SYNCHRONIZATION

where is the delay from the time when one data transfer is acknowledged

AR

until the time when the next data transfer is requested Of course the denition

of asynchronous signals prevents us from imp osing such requirements What we

shall prove in the following section is that with a prop erly terminated sink if a

particular asynchronous data transfer at the source is initiated at anypointin

time and if from then on Equation is satised the pip eline synchronizer

will with b ounded latency enter the steady state illustrated in Figure

Latency

Let us observethebehavior of a pip eline synchronizer of length k Figure

with the prop erlyterminated sink Equation We shall assume that the

synchronizer is in the state equivalent to the state at t that is all FIFOcells

are empty Starting in this initial state we shall number each signals transitions

starting from zero

st

If the rst asynchronous request o ccurs while between the n



th

downgoing edge and the n upgoing edge of the rst eventat R will o ccur

i

after again at time



 n

t t

S

R

i

Given that the conditions of Equation are satised the rst input

transitions of the FIFO stages o ccur at times

 nbic

t t  i k

S

i

i mo d

R

i

Using double induction it is simple to show that if the throughputsustaining

conditions are satised every transition will o ccur at the latest at the time

corresp onding to the steady state illustrated in Figure

j  nj bic

t  t  i k

S

i

i mo d

R

i

j  nj bic

t  t max  T  i k

A R S R R S R A

i

o o o

i i i

i mo d

R

o

The initial step of the double induction is satised by Equation The recursive

step establishes that with

T

A R

o o

j  j 

then it is also and for t if the condition in Equation is satised for t

i i

R R

o

i

j  j 

satised for t andfor t

i i

R R

o

i

k T

Therefore the latency of the pip eline synchronizer is in this case T 

l 

CHAPTER PIPELINE SYNCHRONIZATION

Let us assume that the rst asynchronous request o ccurs while 

th th

between the upgoing edge and the at some time downgoing edge of n n

n  j 

th

t For the i pip elinesynchronizer element the notation will denote

 i

j 

th

will denote how far from the steady state is the j transition of its R input

i

k

th

how far from the steady state is the j transition of the synchronoussystem input

k 

R

o

With the throughputsustaining conditions of Equation satised the rst

transitions on the inputs of the FIFO elements and on the input of the synchronous

system o ccur at times

 nbic 

t t  i k

S

i

i

i mo d

R

i

 

t t

R R

k k

o

i

R R

o

i

where

 

T  i  k max

S R R

o

i i i

 

k  k

T

Since Equation the rst output transition of each subsequent

S R R

o

i



j 

stage is closer to the steadystate condition i j

i

Using double induction one can show that the subsequent transitions o ccur at

times

j  nj bic j 

t  t  ik

S

i

i

i mo d

R

i

j  j  nj bk c

 t  t ik

S R R

k

o

i

k

k mo d

R

o

where

j 



max T i



AR





j 

j  j 

max T T  ik

S R R A R

i

o o o

i i i





j  j 



max T i k

RA A R

o o

k  k

Therefore if the following conditions are satised

T

AR

T

S R R

o

i

T

A R

o o

T

RA A R

o o

k 

successive transitions on R the input of the synchronous circuit always o ccur

o



kT

T after the successiveclock edges Latency in this case is T  less than

 l

 

PIPELINE SYNCHRONIZATION

In this subsection we shall nd the b ound on latency in case of metastability

caused by the input asynchronous event Since the synchronous side throttles

the throughput and since we cannot assume correct op eration of the synchronous

system under synchronization failure this part of the pro of will consider only the

case when the metastability do es not reach the synchronous end

th

If the rst asynchronous request o ccurs coincident with the n downgoing edge

of the rst synchronizer will enter metastable state and dep ending on when

it exits this state we shall distinguish b etween the three cases describ ed next

a If the metastable state lasts less for t and t T then

m m S R R

o

i



T







t T

m S R R 

 o

i

and all the subsequent transitions will follow Equation as analyzed in

b If the metastable state lasts for t and t T then

m m S R R

o

i

n n 

t t t

S R R

o

i

R

o

 nbic

t t  i k

S R R

i

o

i

R

o

and the mo died version of Equation with n replaced by n will hold for

k T

all pip elinesynchronizer signals The latency will b e T 

l



c Finally if the metastable state lasts for exactly T   it will

S R R

o

i

cause metastability in the subsequent synchronizer Let us assume that the rst m

synchronizers enter metastable state as a result of their rst input transition and

that remaining k  m synchronizers do not Then the rst output transitions of

the rst m synchronizers o ccur at times

nbic 

t  i m t

i

i mo d

R

o

and after the rst transition they follow Equation The remaining k  m

k T

synchronizers b ehave as analyzed in a or b The latency is T 

l



Conclusion

In this section wehave found a set of requirements that are sucienttoguarantee

that a particular implementation of a synchronizer and of a FIFO element

can b e used for pip eline synchronization These requirements are listed in

CHAPTER PIPELINE SYNCHRONIZATION

Equations and The union of all these requirements

is

T

S R R

o

i

T

A R

o o

T

AR

T

A A

o

i

T

S R A A R

o o

i i

T

S R A AR

i i

T

RA A R

o o

Variations On the Theme

In Section weshowed how pip eline synchronization can b e used to interface

input data streams that follow the twophase asynchronous proto col to synchronous

systems along with the pro of of its correct op eration In this section weshall

show that the same technique is applicable for interfacing to output streams that

follow the twophase asynchronous proto col as well as for synchronization of input

and output streams that op erate under the fourphase asynchronous proto col

The same pro of techniques can b e applied so only the implementations will b e

presented

Reversing the Direction of Data Flow

A blo ck diagram of a pip eline synchronizer that can b e used to interface an

asynchronous output data stream to a synchronous system is shown in Figure

k

R R R R R R R R R R R R R R R R R R

o i o i o i o i o i o i o i o i o i

D D D D D D D D D D D D D D D D D D

i o i o i o i o i o i o i o i o i o

S l S l S l

A A A A A A A A A A A A A A A A A A

o i o i o i o i o i o i o i o i o i

k mo d

Figure Synchronousinput asynchronousoutput pip eline synchronizer

The implementation of synchronizer and of FIFO elements must satisfy the

PIPELINE SYNCHRONIZATION

following requirements

T

S A A

o

i

T

R A

i i

T

RA

T

R R

o

i

T

S A R R A

o o

i i

T

S A R RA

o o

T

AR R A

i i

Using FourPhaseProto col FIFO Elements

When using the fourphase signaling proto col the sp ectrum of valid implementa

tions of FIFO elements is much larger For an exhaustive studysee One

p ossible implemen tation is

 R A R A R A R A

i i i i o o o o

with the dep endency graph as shown in Figure

R A  R R R A  R R

o o

i i i i i i

A  R  R A A  R R A 

i i o o i i o o

A R A R  A R A R 

o o o o o o o o

R A A  R  R A A  R 

i i o o i i o o

R A A A R A A A

o o

i i i i i i

Figure The dep endency graph of one form of fourphaseproto col FIFO

element

When using the fourphase proto col there are two transitions for every data

item transferred and we shall pick only one of the edges to representanevent

either an up or downgoing edge Let us assume that wehave made the choice

suchthattheR A  R andA  representevents ie their arrival time

i i o o

is uncertain Then the pip eline synchronization chain that wehave used for

In case of this particular arbitrary choice when the FIFO elements with the b ehavior as

sp ecied in Equation form a FIFOchain there is no uncertainty ab out the arrival time of

the complementary edges

j  j 



 t t



R A

i i

A R

i i





j  j 



  t t

R A A R

i i o o

A A 

i i

j 

j  j 



 t t

R A



i i

A R

o o





j  j 



  t t

R A A R

i i o o

A A 

o o

CHAPTER PIPELINE SYNCHRONIZATION

synchronization of the twophase proto col in Figures and can b e used

with the fourphase proto col when mo died in the following way FIFO elements

are replaced with the fourphaseproto col version and symmetric synchronizers are

replaced with an asymmetric version that synchronizes only the edges that have

been chosen to represent signal events

The requirements in Equations and still applywith

R A R A A R  R A 

o o

i i i i i i

R R R A A R  R R

o o o o

i i i i

A A A A A R  R A 

o o o o

i i i i

A R A R

o o o o

A CMOS Implementation

All four versions of pip eline synchronizers describ ed in this chapter with two

and fourphase proto col with synchronous and asynchronous input stream have

b een implemented and used in the Mosaic over the past three years In a set of

exp eriments rep orted by Cohen et al a lo calarea network was implemented

using Mosaic comp onents and used to transmit and receive more than bits

without a single error

In Section weshowed the ME elementthatwe use to build asymmetric

synchronizers whichsynchronize only the upgoing or only the downgoing

transitions To build a symmetric synchronizer we utilize two ME elements in

the circuit illustrated in Figure The circuit around the two ME elements

implements the following sp ecication

 R R A R R A R R A R R A

i   o   i   o  

R

A

i



A

ME



R



R

o

S

Q

R

o

R

R

o

R



A

ME



R A

i 

Figure A symmetricsynchronizer

CONCLUSIONS

The symmetricsynchronizer and the fourphaseproto col FIFO elements were

designed using a prerelease version of the asynchronousdesign to ols describ ed in

Conclusions

Even though the motivation for pip eline synchronization came from a sp ecic

problem interfacing Mosaics asynchronous router and synchronous memory

the applications of this technique are much broader Pip eline synchronization is

a simple lowcost highbandwidth highreliabilitysolutiontointerfaces b etween

synchronous and asynchronous systems or b etween synchronous systems op erating

from dierent clo cks where the data rate is to o high for the singlesynchronizer

approach

The p ower and simplicity of this technique allow the designer to break away

from the traditional divisions in the design of asynchronous and synchronous

circuits clo cked systems deal with synchronous events selftimed systems

deal with asynchronous events and interfacing b etween the twoisdonewith

synchronizers of limited reliability With pip eline synchronization the designer

can achieve arbitrarily low failure rates in exchange for latency rather than a

reduction in bandwidth

In developing pip eline synchronization instead of reasoning ab out signal events

we reasoned ab out probability distributions of signalevent arrival times and

intro duced a metric to characterize howasynchronous a signal is Along the way

we used some simple yet unconventional techniques we can partially synchronize

a signal event use it to p erform some work using selftimed circuits and then

rep eat the pro cess until the desired reliabilityisachieved

The probability mo del that weusedwas crucial to our understanding of

synchronization techniques and to our ability to devise a novel approach to data

stream synchronization However wewere not successful in using this mo del to

prove all the circuit prop erties that interested us hence we had to resort to rather

elab orate pro of techniques This is a topic that we b elieve deserves additional

investigation

CHAPTER CONCLUSIONS

Chapter

Conclusions

Comparison With Related Work

Avariety of concurrent machines have b een built in eort to apply concurrent

approaches to generalpurp ose computing at the application level In this

section we shall compare the Mosaic architecture and the C programming

system with other contemp orary architectures and programming systems Atthe

architecture level we shall fo cus on communication bandwidth and latencyand

on complexity of implementation At the programming level we shall discuss the

relative imp ortance of eciency expressivity and safety compared to C

MediumGrain Multicomputers

The antecedent of all mo derndaymulticomputers is the Cosmic Cub e A num

b er of commercial developments followed this eort with similar multicomputers

manufactured byIntel nCUBE and Ametek With the exception of nCUBEs

custom designs of integrated pro cessor and network interface these machines em

ployed otheshelf pro cessor memory and compiler technology

The raw hardware p erformance of these machines is quite impressive in terms

of p eak instruction rates and in some cases in terms of available communication

bandwidth The no des are of complexity comparable to that of workstations

with hardw are oatingp oint capabilities topoftheline singlechip pro cessors

and megabytes of no de memory

Standard programming systems for these machines are based on sequential

programming notations for sp ecifying individual pro cess b ehavior with library

routines for interpro cess communication and synchronization The expressive

power of these programming systems is comparable to that of C but b ecause

programs that sp ecify individual pro cess b ehavior are unlikeinC lexically

separate compilers cannot p erform many safetyimproving checks typical of ob ject

oriented programming that C p erforms including for example typ e matching

COMPARISON WITH RELATED WORK

of messages b etween the sender and the receiver Since the message scop e in

these systems is not welldened it is dicult if not imp ossible for the compiler

to assist the programmer in message passing of runtimesp ecied data structures

andor message exchange in heterogeneous environments An imp ortant dierence

in programming emphasis is due to C s blurring the distinction b etween pro cesses

as computing agents and pro cesses as data delib erately implying that pro cess

creation is inexp ensive

Standard programming systems for commercial multicomputers regularly fall

short of utilizing the hardware capabilities to their fullest mainly b ecause of the

large software overhead of communications typically on the order of a thousand of

pro cessor instructions Recentwork on Active Messages clearly demonstrates

how the incompatibility of the hardware mechanisms with the programming mo del

results in large software overheads The Mosaic design teams approach to reducing

the software overhead anticipated that advo cated byresearch on Active Messages

A relatively small amount of hardware supp ort is devoted to message handling

allowing for software optimizations of sp ecial cases The messagehandling layer

of the MADRE runtime system is triggered by message reception and

message handlers run to completion Two contexts one for message handling the

other for regular computation are used to eliminate the contextswitchoverhead

Where Active Messages diverges from our approach is that it exp ects the user

to handle the messagebuer management something we consider to o large a

burden except for applications with highly regular communi cation patterns For

the Mosaic a machine with scarce no dememory resources and so much p otential

concurrency to necessitate runtimesystemmanaged pro cess placement message

buer management is particularly demanding

FineGrain Multicomputers

In this section we shall compare the Mosaic architecture to two representatives of

negrain multicomputers the a wellestablished commercial family of

chips manufactured by INMOS and the JMachine develop ed by the research

team led by William J Dally at MIT Just as with the Mosaic b oth

the Transputer and the JMachine have b een develop ed in conjunction with and

inuenced by their resp ective programming systems

Transputer and Occam

Occam and its subsequentvariants are based on work of C A R Hoare on

Communicating Sequential Pro cesses CSP Occam is an explicitmessage

passing notation with pro cesses interacting through synchronous communication

channels In Occam programs all pro cesses and all channels must b e known at

compile time and Occam compilers use this information to completely eliminate

the need for runtime resource management This requirement together with typ ed

CHAPTER CONCLUSIONS

channels makes Occam a very safe notation to the p oint that there are rules for

semanticspreserving program transformations However Occam do es not supp ort

dynamic pro cess creation and it supp orts only a limited form of iteration and a

nonrecursive pro cedure call

The Transputer approach is to oer prepackaged computing solutions in

hardware including the Occam pro cess mo del and The presence

of oatingp oint hardware on chip unlikein Mosaic or JMachine is the

consequence of INMOS b eing driven byitsmarkets whereas the other two

pro jects are testb eds for concurrentprogramming exp eriments The Transputer

communication mechanism is limited with nearestneighbor communication It

requires either softwareassisted storeandforward routing or additional external

communication comp onents to provide a general communication capabilityThe

newestgeneration Transputer improves the routing generality and p erformance

through use of virtual channels

When used as intended for running Occam programs the Transputer provides

an excellent However its limited communication capabilities

and the hardwired pro cess notion typically present diculties for implementation

of nonOccamlike concurrentprogramming systems

JMachine and CST

Concurren t Smalltalk CST is a concurrent ob jectoriented pro

gramming notation and in spite of vastly dierentsyntax is in many imp ortant

asp ects similar to C The main semantic dierence b etween C and CST is

that whereas the state of a C pro cess can b e accessed only through a set of

mutuallyexclusive atomic actions metho ds of a CST ob ject can access the ob ject

concurrently The atomicity of CST metho ds must b e managed explicitly using

lo cks The primary mechanism for generating concurrency in CST is issuing con

current remote pro cedure calls and synchronizing through futures Although this

mechanism is supp orted in C the primary mechanisms for generating concur

rency in C are pro cess creation and nonblo cking message sends Some recently

rep orted results on the p erformance evaluation of the JMachine havebeen

obtained with programs written in the J language an extension of C with a small

numb er of constructs for communication and synchronization not unliketheC

approach

Unlike the Transputer the MessageDriven Pro cessor of the JMachine do es

not imp ose its preferred pro cess mo del on the user The emphasis of JMachine

is on implementing in hardware the instruction sequences that CST programs

use often a simple hashing p olicy with the twoway setasso ciative cache for

name translation detecting access to uninitialized variables for synchronization

scheduling o of the messagereceive queue injecting data into the routing network

and computing the relative distance of the message destination

COMPARISON WITH RELATED WORK

Because there are fewer preconceived notions built into the hardware than with

the Transputer the JMachine more readily accommo dates programming systems

for which it had not originally b een designed

The emphasis of Mosaic is on providing highp erformance routing and memory

with a seamless interface between the two One can argue that a C

implementation on the Mosaic would b enet from most if not all of the hardware

supp orted communication and synchronization mechanisms of the JMachine

However the reduction in the numb er of cycles and in the size of the co de has

to b e compared against the additional price in chip area and the loss of exibility

Given the large discrepancy in access time b etween onchip and ochip memory

the advantages of nominally timesaving but areaexp ensive mechanismsis

less than obvious The imp ortance of preserving the exibility to as late a design

stage as p ossible cannot b e overstated For example the xlate instruction on

the JMachine implements a simple hashing p olicy and a entry twowayset

asso ciativecache and executes in three cycles An equivalent of this instruction

takes Mosaic instructions but it can b e mo died trivially to work with dieren t

hash functions and dierentcache structure andor size

The JMachine and the Mosaic pro jects share many of the same underlying

principles and motivations many of which trace back to the time when

William J Dally was a PhD student in our research group and the rest

is due to our relatively frequentmutual progress up dates The necessityto

limit the scop es of the resp ective pro jects made the Mosaic team fo cus on eorts

that were more aggressive technologically singlechip no des with internal dRAM

pip eline synchronization advanced packaging whereas the fo cus of the JMachine

teams eorts was primarily on mechanisms synchronization name translation

scheduling

Multipro cessors

As discussed in Sections and the shared memory programming mo del

is a predominant concurrentprogramming paradigm mostly b ecause many pro

grammers nd it to b e a natural extension of the sequential programming mo del

Traditional sharedmemory programming systems extend sequential notations

with pro cess creation and synchronization primitives Data communication is done

through sharedmemory access Some recentlydevelop ed sharedmemory pro

gramming systems such as COOL movetowards concurrent ob jectoriented

programming similar to C although for a dierent set of reasons While the

goal of C is to increase expressivity of explicit messagepassing notations bypro

viding a global name space COOL oers monitorlike dataencapsulation as a safe

alternative to the allp owerful access of a sharedmemory word

C programs often lo ok remarkably similar to programs written in notations

designed with multipro cessors as their primary targets Let us compare the

CHAPTER CONCLUSIONS

runtime b ehavior of concurrent ob jectoriented programs on a cachecoherent

multipro cessor and on the Mosaic

Assume that a concurrent computation consists of a partiallyordered set of

actions each action mo dies the state of an ob ject o by executing a function

f with an argument list a One can write

ofa

Assume also that the state of the ob ject the function co de and the argument list

are each stored in their resp ective pieces of memory somewhere in the concurrent

machine

Atypical multipro cessor is centered A pro cessor is assigned to

p erform this action p ossibly by directly accessing the task queue of the assigned

pro cessor or by app ending the global task queue resp onsible for scheduling the

entire machine Once the action is scheduled using that pro cessors cacheup date

mechanism the pieces of ob ject state function co de and argument list are

requested and brought to the pro cessor After the required computations are

p erformed the ob jects state is eventually up dated in the memory through the

cachecoherence proto col Most communications are demand driven initiated by

the cacheup date requests with the exception of app ending the task queues

The Mosaic is memory centered There is only one pro cessor that can access

the ob jects state the one controlling the memory where the ob jects state is

stored The argument list will b e put into the same lo cal memory by a message

The function co de can in principle b e sent along with the argument list but

b ecause the co de is readonly it can b e eectively softwarecached in the same

memory as the ob ject it op erates on The communication is therefore mostly

supply driven with the exception of up dating the software co de cache

Due to the inheren t complexity of coherent data caching a typical

multipro cessor employs as muchastwo orders of magnitude fewer no des than

a Mosaic multicomputer of the same price On the other hand the higher

upfront no de cost enables one to incorp orate a more p owerful pro cessor The

emphasis of programming eort is therefore radically dierent Whereas the

eort in programming a multipro cessor will typically b e in keeping the exp ensive

no des busy with b etter loadbalancing strategies the eort in programming

the Mosaic is in generating as much concurrency as p ossible

For problems with limited concurrencyamultipro cessor clearly holds the

p erformance edge for problems with abundant concurrency the Mosaic is the

machine of choice For applications which b elong to neither of these two extremes

the p erformance will dep end on how eective the coherentdatacaching is If the

shared data is mostly of the singlewriter multiplereaders variety including

when who the writer is changes relatively infrequently coherent data caching

is eective For highly contested multiplewriters multiplereaders shared

data numerous up dating and invalidating requests render caching useless

SUMMARY

Summary

A computerarchitecture exp eriment is a complex task indeed Tomakesuchan

exp eriment a successful one is to achieve p erfection in a balancing act Manyan

exp eriment has failed to fulll its promise due to lessthanp erfect solutions for

suchmundane details as a fewbits worth of addressing space IO capabilities

mechanical assemblies etc

One of the great pitfalls so forcefully exp osed by the develop ers of the RISC

pro cessor architecture is to restrict oneself into a welldened b ox no matter

hownicetheboxlooked from the inside It was after pro cessor architects had p oked

into the compilerdesigners turf that the p ossibility for dramatic improvementin

pro cessor p erformance was revealed

Accordingly the single most imp ortant design requirement the requirement

that turned into the cornerstone of our testb ed is that wemust not assume that

we understand all the p ossible ways in which the system will b e used As b est

as we could we tried to provide mechanisms not p olicies This eort should

be obvious from the simplicity of our hardware communication primitives and

should b e even more obvious to readers who had the stamina to follow through the

examples describing the runtimesystem interfaces We made sure that the default

op erations of our software and hardware systems were as simple and intuitiveas

p ossible and strived not to hide anything from an inquisitive user The resulting

nonassuming arc hitecture of the Mosaic is suitable for implementation of number

of concurrentprogramming systems as well as for sp ecialpurp ose

emb edded applications

Wehave started with the onechipcomputer stipulation and tried to push

the applicationspan envelop e as far as we could The p erformance p otential of

negrain concurrent computers has long b een recognized What is lacking is

apowerful programming system to exploit that p erformance p otential not for a

handful of carefullycrafted applications but for a much larger application span

C approaches this problem from the b ottom up Programming paradigms that

can b e implemented eciently on a simple and inexp ensive hardware atomic

up dates that can b e enabled or disabled are mapp ed into a simple extension of

a p opular ob jectoriented notation

What if we had an arbitrary amount of pro cessing p ower disp ersed around the

memory What if copying of a memory segment from the lo cal memory into the

remote memory was actually faster than the lo cal copy op eration What if we sent

the co de and the argument list to where the data is instead of having all three

of those join at the allimp ortant pro cessor Or p erhaps we could send co de and

data to where the argumentlistis Howmuchcaching can we eciently do in

software What if concurrency was so abundant that wedidnothavetoworry

ab out utilization of individual pro cessors at all but rather how to extract more concurrency from an application

CHAPTER CONCLUSIONS

We do not knowanswerstoany of these questions to day What wehave

provided is a testb ed that could help in answering some of these questions in a

quantitative way and some results have already b een established

App endix A

Example Pro ducts of C

Compilation

In the interest of making it p ossible for other researchers to understand the

structure of the software overhead of communications rep orted in Section

we shall present a few simple examples of the Mosaic assembly co de pro duced by

translation from C to C followed by compilation using the GnuC targeted

for the Mosaic pro cessor

The Mosaic register set is describ ed in Section The assembly instructions

of interest are

op src dst where op is one of the standard arithmetic instructions add

subtract src and dst are generalpurp ose registers and the semantics

are dst dst op src

mov src dst with the semantics of dst src where src and dst can

b e generalpurp ose registers sp ecial registers or memory op erands with

addresses determined by

r a register

r a register p ostincremented

r a predecremented register

rconst a register plus a constant or

const a constan t

jmp dst jump to destination dst determined by one of the addressing

mo des describ ed ab ove

call dst actually two instructions one to save the return address on the

stack and another to jump as ab ove

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

cc label where cc is a condition under which to jump to the address j

sp ecied bythe label

Program denes the basic data structures used by the runtime system

typedef unsigned nodet multicomputer node

identifier

struct pointert a process pointer

consists of

nodet node a node number

void ptr andanaddress

typedef unsigned entryt a unique identifier

of an atomic action

struct RTSmsg a message

RTSmsg next queuing information

int size

pointert ptr message header

entryt e

Program RuntimeSystem Data Structures

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

ThecodeinProgramallows one to access the machine registers from the

source co de Note that the p ointers used in managing the send and the receive

queue are kept in the general registers that are shared b etween the twocontexts

register void MSP asm msp message receive pointer

register void MSL asm msl message receive limit

register void MRP asm mrp message send pointer

register void MRL asm mrl message send limit

register int DXDY asm dxdy destination register

register int IMR asm imr interrupt mask register

register int ISR asm isr interrupt status register

register RTSmsg SQHEAD asm r send queue head

register RTSmsg SQTAIL asm r send queue tail

register void SQEND asm r end of the send buffer

register RTSmsg RQHEAD asm r receive queue head

register RTSmsg RQTAIL asm r receive queue tail

register void RQEND asm r end of the receive buffer

Program Accessing Machine Registers from the Source Co de

As shown in Program the routing word x y computation uses two

tables If the storage cost is prohibitive it can b e done with no extra storage in

approximately twice the time

int dxtable

int dytable

inline

int

dxdyconversion nodet dest

unsigned dx dest

dy dest xFF unsigned

return dxtabledx dytabledy

Program The x y Computation

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

As describ ed in Section a messagesending op eration consists of calling

the building the arguments list and calling the operator head operator tail

Program presents an example denition of these two op erators

inline

void

sendqueuealloc int size

if voidSQTAIL size SQEND

return sendqueuewraparoundsize

else

return SQTAILsize

inline

void

operator headpointert p entryt e int size

void v sendqueueallocsize

v operator sendvsize build message header

sendvp v operator

v operator sendve

return v

inline

void

operator tailvoid begin void end pointert p entryt e int size

if SQHEAD SQTAIL

int dxdy dxdyconversionpnode

IMR disable interrupts

MSP SQTAILsize send the message

MSL end

DXDY dxdy

SQTAIL SQTAILnext end update the send queue

SQTAILnext

IMR enable interrupts

else

sendqueueappendbeginendpesize

Program Building the Message

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

By nowwehaveallthecodewe need to compile a messagesending example

Let Program b e the user co de

class C a userdefined class

public

void foo

void and its member function

Cfoo

void

f C c

while an endless sequence of

cfoo memberfunction invocations

processdef P a userdefined process

public

atomic void foo

atomic

void and its atomic action

Pfoo

void

f P p

while an endless sequence of

pfoo message send operations

Program A MessageSend Example

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

Programs and are the result of translation of the user co de to C followed

by the compilation to Mosaic assembly co de

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION 

globl fFPC fchar

fFPC

FUNC PROLOGUE BGN

mov bpsp save frame ptr

mov spbp new frame ptr

Save regs used

mov rsp

FUNC PROLOGUE END

mov bpr

L

mov rsp

call fooC

inc spsp

jmp L

FUNC EPILOGUE BGN

Restore regs used

mov spr

mov bpsp restore stack ptr

mov spbp restore frame ptr

rtn

FUNC EPILOGUE END

globl fFGPcpmptr fP

fFGPcpmptr

FUNC PROLOGUE BGN

mov bpsp save frame ptr

mov spbp new frame ptr

Save regs used

mov rsp

mov rsp

mov rsp

mov rsp

FUNC PROLOGUE END

mov r

L

mov bpr

mov bpr

mov rr

add r

cmp rr

jltu L

mov rsp

Program A MessageSend Example Compiled Co de Part

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

call sendqueuewraparoundFi

inc spsp

jmp L

L

inc rr

L

mov r

mov rr

mov rr

mov r

mov rr

mov bpr

cmp rr

jne L

rnr rr lsr by

rnr rr

and r

and r

mov rdxtabler

mov rdytabler

mov imr

inc rr

mov rmsp

dec rr

mov rmsl

or rr

mov rdxdy

mov rr

mov rr

mov r

mov imr

jmp L

L

mov rsp

call sendqueueappendFPv

inc spsp

jmp L

FUNC EPILOGUE BGN

Restore regs used

mov spr

mov spr

mov spr

mov spr

mov bpsp restore stack ptr

mov spbp restore frame ptr

rtn

FUNC EPILOGUE END

Program A MessageSend Example Compiled Co de Part

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

Program is a program that can b e used for userlevel dispatch and

Programs and its compiled version

struct RTScode a code piece

int offset info on how to find the

int mask activepassive bit

within process state

int code the code itself

typedef void FPvoidvoid a generic atomic action pointer

inline

atomic

void

int size pointert p entryt e generic dispatcher

RTScode c lookupcodee find the code piece

void p pptr the process pointer

FP f FPccode the code pointer

void a args the arguments pointer

int offset coffset

int mask cmask

if intpoffset mask check the activepassive bit

fpa dispatch to user code

else

messagerefusedargs

void

userdispatch this code runs continuously

in the user context

while

while RQHEADnext polling the receive queue

RQHEADsize call the generic dispatcher

RQHEAD RQHEADnext updating the receive queue

Program UserLevel Dispatch

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

globl userdispatchFv

userdispatchFv

FUNC PROLOGUE BGN

mov bpsp save frame ptr

mov spbp new frame ptr

Save regs used

mov rsp

mov rsp

mov rsp

mov rsp

mov rsp

FUNC PROLOGUE END

Program UserLevel Dispatch Compiled Co de Part

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

L

L

mov rr

cmp r

jeq L

mov rsp

call lookupcodeFUi

mov rr

mov rr

add r

mov rr

add r

mov rr

mov rr

mov rr

add rr

mov rr

mov rr

and rr

inc spsp

cmp r

jeq L

mov rsp

mov rsp

call r

add sp

jmp L

L

mov rsp

call messagerefusedFPRTSmsg

inc spsp

L

mov rr

jmp L

FUNC EPILOGUE BGN

Restore regs used

mov spr

mov spr

mov spr

mov spr

mov spr

mov bpsp restore stack ptr

mov spbp restore frame ptr

rtn

FUNC EPILOGUE END

Program UserLevel Dispatch Compiled Co de Part

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

Program is the interruptdispatch lo op and Programs and are the compiled co de

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

void interrupttable a jump table filled with the

names of routines that correspond

softwareint to various interrupt conditions

recvint

sendint

sendrecvint

buffint

buffrecvint

buffsendint

buffsendrecvint

void

interruptdispatch

while

int pending ISR get the pending interrupts

interrupttablepending dispatch to an interrupt handler

ISR pending writing back acknowledges all the

interrupts that have just been

serviced

asm PUNT assembly instruction to return to

the other context

when the next interrupt arrives

we shall start here

void

recvint on receive interrupt

the newlyarrived message is

RTSmsg mMRP linked into the receive queue

mnext

RQTAIL RQTAILnext m

MRP msize

void

sendint on send interrupt

the send queue is updated and

SQHEAD SQHEADnext checked for additional messages

if SQHEADnext

sendnextmessage

Program Programming the InterruptDriven Context

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

data

globl interrupttable

interrupttable

word softwareintFv

word recvintFv

word sendintFv

word sendrecvintFv

word buffintFv

word buffrecvintFv

word buffsendintFv

word buffsendrecvintFv

Program Programming the InterruptDriven Context Compiled Co de Part

APPENDIX A EXAMPLE PRODUCTS OF C COMPILATION

text

globl interruptdispatchFv

interruptdispatchFv

FUNC PROLOGUE BGN

mov bpsp save frame ptr

mov spbp new frame ptr

Save regs used

mov rsp

FUNC PROLOGUE END

L

mov isrr

call rinterrupttable

mov risr

APP

PUNT

NOAPP

jmp L

Restore regs used

mov spr

mov bpsp restore stack ptr

mov spbp restore frame ptr

rtn

FUNC EPILOGUE END

globl recvintFv

recvintFv

mov mrpr

mov r

mov rr

mov rr

inc rr

mov rmrp

rtn

globl sendintFv

sendintFv

mov rr

mov rr

cmp r

jeq L

call sendnextmessageFv

rtn

Program Programming the InterruptDriven Context Compiled Co de Part

BIBLIOGRAPHY

Biblio graphy

Anant Agarwal David Chaiken Go dfrey DSouza Kirk Johnson

David Kranz John Kubiatowicz Kiyoshi Kurihara BengHong Lim

Gino Maa Dan Nussbaum MikeParkin Donald Yeung The MIT Alewife

Machine A LargeScale DistributedMemory Multipro cessor MITLCSTM

b MIT Novemb er

Alok Aggarwal Ashok KChandra Marc Snir On Communication Latency

in PRAM Computations in Proceedings of the ACM Symposium on

Paral lel Algorithms and Architectures

Gul Agha Actors AModel of Concurrent Computation in Distributed

Systems MIT Press

William C Athas FineGrain Concurrent Computation Caltech Computer

Science Technical Rep ort TR PhD thesis

William C Athas and Charles L Seitz Multicomputers MessagePassing

Concurrent Computers IEEE Computer August

J Backus Can programming b e lib erated from the Von Neumann style

A functional style and its algebra of pro cesses CACM

August

H B Bakoglu Circuits Interconnections and Packaging for VLSI Addison

Wesley

Gordon Bell Ultracomputers A Teraop Before Its Time CACM

August

Nanette J Bo den A Study of FineGrain Programming Using Cantor

CaltechCSTR

Nanette J Bo den Runtime Systems for FineGrain Multicomputers Caltech CSTR

BIBLIOGRAPHY

Rob ert W Bro dersen Evolution of VLSI SignalPro cessing Circuits in

AdvancedResearch in VLSI Proceedings of the Decennial Caltech

Conference edited by Charles L Seitz MIT Press

Henry Burkhardt I I I Steven Frank Bruce Knob e and James Rothnie

Overview of the KSR Computer System Technical Rep ort KSRTR

Kendall Square Research Boston February

Alan Burns Programming in occam AddisonWesley

Steve M Burns Performance Analysis and Optimization of Asynchronous

Circuits CaltechCSTR

David Chaiken John Kubiatowicz Anant Agarwal Limitless Directories A

Scalable Scheme MITLCSTM MIT June

Thomas J Chaney and Charles E Molnar Anomalous Behavior of

Synchronizer and Arbiter Circuits IEEE Transactions on Computers

pp April

Andrew A Chien Concurrent Aggregates MIT Press

Rohit Chandra Ano op Gupta John L HennessyIntegrating Concurrency

and Data Abstraction in a Parallel Stanford

Universit yTechnical Rep ort No CSLTR February

K Mani Chandy Carl Kesselman Comp ositional C Comp ositional Parallel

Programming CaltechCSTR

K Mani ChandyJayadev Misra Paral lel Program Design A Foundation

AddisonWesley

K Mani Chandy Stephen Taylor A Primer for Program Comp osition

Notation CaltechCSTR

William Douglas Clinger Foundations of Actor Semantics MITAILab

Technical Rep ort AITR May

Cohen D Finn G Felderman R DeSchon A ATOMIC A HighSp eed

LowCost Lo cal Area Network USCISI Technical Rep ort ISIRR

Sept

W Crowther J Go o dhue E Starr R Thomas W Milliken and T Black

adar Performance Measurements on a No de Buttery Parallel Pro cessor

in Proceedings of the International ConferenceonParal lel Processing pp

BIBLIOGRAPHY

OleJohan Dahl Edsger W Dijkstra and Charles A R Hoare Structured

Programming Academic Press

William J Dally AVLSIArchitecture for Concurrent Data Structures

Kluwer Academic Publishers Norwell MA

William J Dally and Charles L Seitz Deadlo ckFree Message Routing in

Multipro cessor Interconnections Networks IEEE Transactions on Computers

C May

Edsger W Dijkstra Co op erating Sequential Pro cesses in Programming

Languages edited byFGenuys Academic Press

Edsger W Dijkstra A Discipline of Programming PrenticeHall

Charles Donnelly and Richard Stallman BISON The YACCcompatible

Parser Generator The Free Software Foundation June

Thorsten von Eicken David E Culler Seth Cop en Goldstein

Klaus Erik Schauser Active Messages a Mechanism for Integrated Com

munication and Computation University of California Berkeley Rep ort No

UCBCSD March

Margaret A Ellis and Bjarne Stroustrup The AnnotatedC R eference

Manual AddisonWesley

M J Flynn Very HighSp eed Computing Systems Proceedings of the IEEE

December

Ian Foster and Stephen Taylor STRAND New Concepts in Paral lel

Programming Prentice Hall

A Gotlieb et al The NYU Ultracomputer Designing and MIMD Shared

Memory Parallel Computer IEEE Transactions on Computers

February

Per Brinch Hansen Structured Multiprogramming CACM

July

John L Hennessy and David A Patterson Computer Architecture A

Quantitative Approach Morgan Kaufmann Publishers

Carl Hewitt Viewing Control Structures as Patterns of Passing Messages in

Articial Intel ligence An MIT Perspective edited by Winston and Brown MIT Press

BIBLIOGRAPHY

Carl Hewitt and Henry Baker Laws for Communicating Parallel Pro cesses

IFIP Toronto August pp

W Daniel Hillis The Connection Machine MIT Press

C A R Hoare Monitors an Op erating System Structuring Concept CACM

Octob er

C A R Hoare Communicating Sequential Pro cesses CACM

August

Waldemar Horwat Concurrent Smalltalk on the MessageDriven Pro cessor

MS Thesis MIT Departmentof Electrical Engineering and Computer

Science May

Waldemar Horwat Andrew A Chien William J Dally Exp erience with CST

Programming and Implementation in Proceedings of the ACM SIGPLAN

ConferenceonProgramming Language Design and Implementation

INMOS Limited Transputer Reference Manual Prentice Hall

Kirk Johnson AnantAgarwal The Impact of Communication Lo cality

on LargeScale Multipro cessor Performance MITLCSTM MIT

February

David Kranz Kirk Johnson Anant Agarwal John Kubiatowicz and

BengHong Lim In tegrating MessagePassing and SharedMemory Early

Exp erience MITLCSTM MIT Octob er

Clyde P Kruskal Marc Snir CostBandwidth Tradeos for Communication

Networks in Proceedings of the ACM Symposium on Paral lel Algorithms

and Architectures

David J Kuck et al Measurements of Parallelism in Ordinary FORTRAN

Programs IEEE Computer January

H T Kung and Charles E Leiserson Algorithms for VLSI Pro cessor

Arrays Chapter in Introduction to VLSI Systems byCarver Mead and

Lynn Conway AddisonWesley

Charles R Lang Jr The Extension of Ob jectOriented Languages to a

Homogeneous Concurrent Architecture Caltech Technical

Rep ort TR

K Rustan M Leino Multicomputer Programming with Mo dulaD Caltech CSTR

BIBLIOGRAPHY

Daniel Lenoski James Laudon Kourosh Gharachorlo o Ano op Gupta

John Hennessy Mark Horowitz and Monica Lam Design of the Stanford

DASH Multipro cessor Stanford UniversityTechnical Rep ort No CSLTR

December

Sigurd L Lillevik The Touchstone Gigaop DELTA Prototyp e IEEE

March

Johan J Lukkien and Jan L A van de Snepscheut A Tutorial Intro duction

to Mosaic Pascal CaltechCSTR

L R Marino The Eect of Asynchronous Inputs on Sequential Network Reli

ability IEEE Transactions on Computers C Novemb er

Alain J Martin Programming in VLSI From Communicating Pro cesses to

DelayInsensitive Circuits Chapter one in Developments in Concurrency and

Communication edited by C A R Hoare AddisonWesley

Alain J Martin Drazen Borkovic Marcel van der Go ot TonyLeeJose

Tierno CAST Caltech Asynchronous Synthesis To ols The First Release

in CaltechCSTR

Alain J Martin Steven M Burns

T K Lee Drazen BorkovicPieter J Hazewindus The Design of an Asyn

chronous Micropro cessor in Advanc edResearch in VLSI Proceedings of the

Decennial Caltech Conference edited by Charles L Seitz MIT Press

Daniel Maskit Stephen Taylor Exp eriences in Programming the JMachine

CaltechCSTR

David May Occam and the Transputer Chapter twoinDevelopments in

Concurrency and Communication edited by C A R Hoare AddisonWesley

Carver Mead Introduction to VLSI Systems AddisonWesley

Michael Noakes William J Dally System Design of the JMachine

Proceedings of the Sixth MIT ConferenceonAdvancedResearch in VLSI MIT

Press

Michael D Noakes Deb orah A Wallach and William J Dally The JMachine

Multicomputer An Architectural Evaluation pp in Proceedings of

the th Annual International Symp osium on Computer Architecture San

Diego California May

BIBLIOGRAPHY

Alan V Opp enheim and Ronald W Schafer Processing

pp PrenticeHall

MiroslavPechoucek Anomalous Resp onse Times of Input Synchronizers

IEEE Transactions on Computers C February

Michael J Pertel A Simple Network Simulator CaltechCSTR

Michael J Pertel A Critique of Adaptive Routing CaltechCSTR

G F Pster et al The IBM ResearchParallel Pro cessor RP Intro duction

and Architecture in Proceedings of the International Conferenceon

Paral lel Processing pp

John R Rose Guy L Steele Jr C An Extended Language for Data Parallel

Programming Thinking Machines Technical Rep ort PL March

Fred Rosenb erger and Thomas J Chaney FlipFlop Resolving Time Test

Circuit IEEE Journal of SolidState Circuits SC August

Fred U Rosenb erger Charles E Molnar Thomas J Chaney and TingPien

Fang QMo dules Internally Clo cked DelayInsensitive Mo dules IEEE

Transactions on Computers Septemb er

Charles L Seitz Ideas Ab out Arbiters LAMBDA First Quarter

Charles L Seitz System Timing Chapter seven in Introduction to VLSI

Systems by Carver Mead and Lynn Conway AddisonWesley

Charles L Seitz Concurrent VLSI Architectures IEEE Transactions on

Computers C Decemb er

Charles L Seitz The Cosmic Cub e CACM January

Charles L Seitz Multicomputers Chapter ve in Developments in

Concurrency and Communication edited by C A R Hoare AddisonWesley

Charles L Seitz W C Athas C M Flaig A J Martin J Seizovic

C S Steele and WK Su The Architecture and Programming of the

Ametek Series Multicomputer in The Third ConferenceonHypercube

Concurrent Computers and Applications Pasadena California

Charles L Seitz Nanette J Bo den Jakov Seizovic and WenKing Su The

Design of the Caltech Mosaic C Multicomputer in Research on Integrated

Systems Proceedings of the Symposium edited by Gaetano Borriello and Carl Eb eling MIT Press

BIBLIOGRAPHY

Charles L Seitz Jakov Seizovic WenKing Su The C Programmers

Abbreviated Guide to Multicomputer Programming CaltechCSTR

Charles L Seitz and WenKing Su A family of routing and communication

chips based on the Mosaic in Research on Integrated Systems Proceedings

of the Symposium edited by Gaetano Borriello and Carl Eb eling MIT

Press

Jakov Seizovic The Reactive Kernel CaltechCSTR

Ehud Shapiro editor Concurrent Prolog Col lectedPapers MIT Press

Ray Simar Personal Communication

Jaswinder Pal Singh WolfDietrichWeb er Ano op Gupta SPLASH Stanford

Parallel Applications for SharedMemory Stanford UniversityTechnical

Rep ort No CSLTR June

K Stuart Smith Aruno daya Chatterjee A C Environment for Distributed

Application Execution MCC Technical Rep ort ACTESP

Don Sp eck Fast K Scalable CMOS dRAM AdvancedResearch in VLSI

UC Santa Cruz MIT Press

E Sp ertus Execution of Dataow Programs on GeneralPurp ose Hardware

MS Thesis MIT Departmentof Electrical Engineering and Computer

Science August

Craig S Steele Anity A Concurrent Programming System for Multicom

puters CaltechCSTR

Leon Ehud Shapiro The Art of Prolog AdvancedProgramming

Techniques MIT Press

WenKing Su ReactivePro cess Programming and Distributed DiscreteEvent

Simulation CaltechCSTR

Ivan E Sutherland Micropip elines Lecture CACM

June

Stephen Taylor Paral lel Logic Programming Techniques Prentice Hall

Michael D Teimann Users Guide to GNU C The Free Software

Foundation

Manu Thapar Cache Coherence for Scalable Shared Memory Multipro cessors

Stanford UniversityTechnical Rep ort No CSLTR May

BIBLIOGRAPHY

Thinking Machines Corp oration The Connection Machine CM Technical

Summary Thinking Machines Corp oration January

Clark D Thompson AreaTime Complexity for VLSI in Caltech Conference

on Very Large Scale Integration edited by Charles L Seitz pp

Harry J M Veendrick The Behavior of FlipFlops Used as Synchronizers

and Prediction of Their Failure Rate IEEE Journal of SolidState Circuits

SC April

Akinori Yonezawa and Mario Tokoro editors ObjectOriented Concurrent

Programming MIT Press