DISTRIBUTED ALGORITHMS

Lecture Notes for

Fall

Nancy A Lynch

Boaz PattShamir

January

Preface

This rep ort contains the lecture notes used by Nancy Lynchs graduate course in Distributed

Algorithms during fall semester The notes were prepared by Nancy Lynch and

Teaching Assistant Boaz PattShamir

The main part of this rep ort is detailed lectures notes for the regular class meetings

We think the reader will nd these to b e reasonably well p olished and wewould very much

appreciate comments ab out any errors or suggestions on the presentation Following these

notes are the homework assignments Finallywe included an app endix that contains notes

for three additional lectures that were given after the regular term was over The extra

lectures cover material that did not t into the scheduled numb er of class hours These

notes are in a very rough shap e

Many thanks go to the students who to ok the course this semester they contributed

many useful comments and suggestions on the technical content and also had a very strong

inuence on the pace and level of the presentation If these notes are understandable to other

audiences it is largely b ecause of the feedbackprovided by this rst audience Thanks also

go to George Varghese for contributing an excellent lecture on selfstabilizing algorithms

Wewould also liketoacknowledge the work of Ken Goldman and Isaac Saias in preparing

earlier versions and resp ectively of the notes for this course Wewere able to

reuse much of their work The students who to ok the course during those twoyears worked

hard to prepare those earlier lecture notes and also deserve a lot of thanks Jennifer Welch

and Mark Tuttle also contributed to earlier versions of the course Finallywewould liketo

thank Joanne Talb ot for her invaluable help

Nancy Lynch lynchtheorylcsmitedu

Boaz PattShamir boaztheorylcsmitedu

January

Contents

LECTURES

Lecture Septemb er

Intro duction to the Course

The Sub ject Matter

Style

Overview of the Course

Synchronous Network Algorithms

Problem Example Leader Election in a Ring

Algorithm LeLann ChangRob erts

Lecture Septemb er

Leader Election on a Ring cont

Algorithm Hirshb ergSinclair

Counterexample Algorithms

Lower Bound on ComparisonBased Proto cols

Leader Election in a General Network

BreadthFirst Search

Extensions

Lecture Septemb er

Shortest Paths Unweighted and Weighted

Minimum Spanning Tree

Maximal Indep endentSet

Lecture Septemb er

Link Failures

The Co ordinated Attack Problem

Randomized Consensus

Faulty Pro cessors

Stop Failures

Lecture Septemb er

Byzantine Failures

Reducing the Communication Cost

Stopping Faults

The Byzantine Case

The TurpinCoan Algorithm

Lecture Septemb er

Numb er of Pro cesses for Byzantine Agreement

Byzantine Agreement in General Graphs

Weak Byzantine Agreement

Numb er of Rounds with Stopping Faults

Lecture Octob er

Numb er of Rounds With Stopping Failures cont

The Commit Problem

Two Phase Commit PC

Threephase Commit PC

Lower Bound on the Numb er of Messages

Lecture Octob er

Asynchronous Shared Memory

Informal Mo del

Mutual Exclusion Problem

Dijkstras Mutual Exclusion Algorithm

A Correctness Argument

Lecture Octob er

Dijkstras Mutual Exclusion Algorithm cont

An Assertional Pro of

Running Time

Improved Mutual Exclusion Algorithms

NoStarvation Requirements

Petersons TwoPro cess Algorithm

Tournament Algorithm

Iterative Algorithm

Lecture Octob er

Atomic Ob jects

Example ReadWrite Ob ject

Denition

Atomic Snapshots

Problem Statement

Unb ounded Algorithm

Bounded Register Algorithm

Lecture Octob er

Burns Mutual Exclusion Algorithm

Lamp orts Bakery Algorithm

Analysis

The Numb er of Registers for Mutual Exclusion

Two Pro cesses and One Variable

Three pro cesses and TwoVariables

The General Case

Lecture Octob er

Consensus Using ReadWrite Shared Memory

Imp ossibility for Arbitrary Stopping Faults

Imp ossibility for a Single Stopping Fault

Mo deling and Mo dularity for Shared Memory Systems

The Basic InputOutput Automaton Mo del

Lecture Octob er

IO Automata cont

Problem Sp ecication

Pro of Techniques

Shared Memory Systems as IO Automata

Instantaneous Memory Access

Ob ject Mo del

Mo dularity Results

Transforming from Instantaneous to Atomic Ob jects

Comp osition in the Ob ject Mo del

Lecture Octob er

Mo dularity cont

Waitfreedom

Concurrent Timestamp Systems

Denition of CTS Ob jects

Bounded Lab el Implementation

Correctness Pro of

Application to Lamp orts Bakery Algorithm

Lecture Novemb er

MultiWriter Register Implementation Using CTS

The Algorithm

A Lemma for Showing Atomicity

Pro of of the Multiwriter Algorithm

Algorithms in the ReadMo difyWrite Mo del

Consensus

Mutual Exclusion

Lecture Novemb er

ReadMo difyWrite Algorithms cont

Mutual Exclusion cont

Other Resource Allo cation Problems

Safe and Regular Registers

Denitions

Implementation Relationships for Registers

Register Constructions

Lecture Novemb er

Register Constructions cont

Construction Binary Regular Register from Binary Safe Register

Construction K ary Regular Register from Binary Regular Register

Construction Reader K ary Atomic Register from Regular Register

MultiReader Atomic Registers

Lamp orts Bakery Revisited

Asynchronous Network Algorithms

Leader Election

Lecture Novemb er

Leader Election in Asynchronous Rings cont

The LeLannChang Algorithm cont

The Hirshb ergSinclair Algorithm

The Peterson Algorithm

Burns Lower Bound Result

Problems in General Asynchronous Networks

Network Leader Election

Breadthrst Search and Shortest Paths

Lecture Novemb er

Asynchronous BroadcastConvergecast

Minimum Spanning Tree

Problem Statement

Connections Between MST and Other Problems

The Synchronous Algorithm Review

The GallagerHumbletSpira Algorithm Outline

In More Detail

A Summary of the Co de in the GHS Algorithm

Complexity Analysis

Proving Correctness for the GHS Algorithm

Lecture Novemb er

Minimum Spanning Tree cont

Simpler Synchronous Strategy

Synchronizers

Synchronous Mo del

HighLevel Implementation Using a Synchronizer

Synchronizer Implementations

Hybrid Implementation

Applications

Lower Bound for Synchronizers

Lecture Novemb er

Time Clo cks etc

Logical Time

Algorithms

Applications

Simulating Shared Memory

Mutual Exclusion and Resource Allo cation

Problem Denition

Mutual Exclusion Algorithms

Lecture Decemb er

Mutual Exclusion cont

Ricart Agrawala

Carvalho Roucairol

General Resource Allo cation

BurnsLynch Algorithm

Drinking Philosophers

Stable Prop erty Detection

Termination for Diusing Computations

Snapshots

Lecture Decemb er

The Concept of SelfStabilization

Do or Closing and Domain Restriction

SelfStabilization is attractive for Networks

Criticisms of SelfStabilization

Denitions of Stabilization

Execution Denitions

Denitions of Stabilization based on External Behavior

Discussion on the Stabilization Denitions

Examples from Dijkstras Shared Memory Mo del

Dijkstras Shared Memory Mo del and IOA

Dijkstras Second Example as Lo cal Checking and Correction

Dijkstras rst example as Counter Flushing

Message Passing Mo del

Mo deling the Top ology

Mo deling Links

Mo deling Network No des and Networks

Implementing the Mo del in Real Networks

Lo cal Checking and Correction in our Message Passing Mo del

Link Subsystems and Lo cal Predicates

Lo cal Checkability

Lo cal Correctability

Lo cal Correction Theorem

Theorem Statement

Overview of the Transformation Co de

Intuitive Pro of of Lo cal Correction Theorem

Intuition Behind Lo cal Snapshots

Intuition Behind Lo cal Resets

Intuition Behind Lo cal Correction Theorem

Implementing Lo cal Checking in Real Networks Timer Flushing

Summary

Lecture Decemb er

SelfStabilization and Termination

SelfStabilization and Finite State

Selfstabilizing Reset Proto col

Problem statement

Unb oundedregisters Reset

Example Link Reset

Reset Proto col

Analysis

Comments

Application Network Synchronization

Implementation with Bounded Pulse Numbers

Lecture Decemb er

FischerLynchPaterson Imp ossibility Result

BenOr Randomized Proto col for Consensus

Correctness

ParliamentofPaxos

HOMEWORK ASSIGNMENTS

Homework Assignment

Homework Assignment

Homework Assignment

Homework Assignment

Homework Assignment

Homework Assignment

Homework Assignment

Homework Assignment

Homework Assignment

Homework Assignment

Homework Assignment

APPENDIX

Lecture January

A Implementing Reliable PointtoPointCommunication in terms of Unreliable

Channels

A Stennings Proto col

A Alternating Bit Proto col

A BoundedHeader Proto cols Tolerating Reordering

Lecture January

B Reliable Communication Using Unreliable Channels

B BoundedHeader Proto cols Tolerating Reordering

B Tolerating No de Crashes

B packet HandshakeInternet Proto col

Lecture January

C MMT Mo del Denition

C Timed Automata

C Simple Mutual Exclusion Example

C Basic Pro of Metho ds

C Consensus

C More Mutual Exclusion

C Clo ck Synchronization

BIBLIOGRAPHY

D Intro duction

D Mo dels and Pro of Metho ds

D Automata

D Invariants

D Mapping Metho ds

D Shared Memory Mo dels

D IO Automata

D Temp oral Logic

D Timed Mo dels

D Algebra

D Synchronous MessagePassing Systems

D Computing in a Ring

D Network Proto cols

D Consensus

D Asynchronous Shared Memory Systems

D Mutual Exclusion

D Resource Allo cation

D Atomic Registers

D Snapshots

D Timestamp Systems

D Consensus

D Shared Memory for Multipro cessors

D Asynchronous MessagePassage Systems

D Computing in Static Networks

D Network Synchronization

D Simulating Asynchronous Shared Memory

D Resource Allo cation

D Delecting Stable Prop erties

D Deadlo ck Detection

D Consensus

D Datalink

D Sp ecialPurp ose Network Building Blo cks

D SelfStabilization

D Knowledge

D TimingBased Systems

D Resource Allo cation

D Shared Memory for Multipro cessors

D Consensus

D Communication

D Clo ck Synchronization

D Applications of Synchronized Clo cks

JJ Distributed Algorithms Septemb er

Lecturer Nancy Lynch

Lecture

Intro duction to the Course

The Sub ject Matter

This course is ab out distributed algorithms Distributed algorithms include a wide range

of parallel algorithms which can b e classied byavariety of attributes

Interpro cess Communication IPC metho d shared memory messagepassing dataow

Timing Mo del synchronous asynchronous partially synchronous

Failure Mo del reliable system faulty links faulty pro cessors

Problems addressed resource allo cation communication agreement database concur

rency control deadlo ck detection and manymore

Some of the ma jor intended application areas of distributed algorithms are

communication systems

sharedmemory multipro cessor computation

distributed op erating systems

distributed database systems

digital circuits and

realtime pro cesscontrol systems

Some kinds of parallel algorithms are studied in other courses and will not b e covered

in this course eg PRAM algorithms and algorithms for xedconnection networks suchas

Buttery The algorithms to b e studied in this course are distinguished byhaving a higher

degree of uncertainty and more independence of activities Some of the typ es of uncertainty

that we will consider are

unknown numb er of pro cessors

unknown shap e of network

indep endent inputs at dierent lo cations

several programs executing at once starting at dierent times going at dierent sp eeds

nondeterministic pro cessors

uncertain message delivery times

unknown message ordering

failures pro cessor stopping transient omission Byzantine link message loss dupli

cation reordering

Because of all this uncertainty no comp onent of a distributed system knows the entire

system state Luckily not all the algorithms we consider will have to contend with all of

these typ es of uncertainty

Distributed algorithms can b e extremely complex at least in their details and can b e

quite dicult to understand Even though the actual co de may b e short the fact that

many pro cessors are executing the co de in parallel with steps interleaved in some undeter

mined way implies that there can b e prohibitively many dierent executions even for the

same inputs This implies that it is nearly imp ossible to understand everything ab out the

executions of distributed algorithms This can b e contrasted with other kinds of parallel

algorithms such as PRAM algorithms for which one might hop e to understand exactly what

the welldened state lo oks likeateachgiven p oint in time Therefore instead of trying

to understand all the details of the execution one tends to assert certain properties of the

execution and just understand and prove these prop erties

Style

The general avor of the work to b e studied is as follows

Identify problems of ma jor signicance in practical and dene

abstract versions of the problems for mathematical study

Give precise problem statements

Describ e algorithms precisely

Prove rigorously that the algorithms solve the problems

Analyze the complexity of the algorithms

Prove corresp onding imp ossibility results

Note the emphasis on rigor this is imp ortant in this area b ecause of the subtle compli

cations that arise A rigorous approach seems necessary to b e sure that the problems are

meaningful the algorithms are correct the imp ossibility results are true and meaningful

and the interfaces are suciently welldened to allow system building

However b ecause of the many complications rigor is hard to achieve In fact the devel

opment of go o d formal metho ds for describing and reasoning ab out distributed algorithms

is the sub ject of a go o d deal of recent research Sp ecically there has b een much serious

work in dening appropriate formal mathematical models b oth for describing the algorithms

and for describing the problems they are supp osed to solve a considerable amountofwork

has also b een devoted to proof methods One diculty in carrying out a rigorous approach

is that unlike in many other areas of theoretical computer science there is no any single

accepted formal mo del to cover all the settings and problems that arise in this area This

phenomenon is unavoidable b ecause there are so manyvery dierent settings to b e studied

consider the dierence b etween shared memory and messagepassing and each of them has

its own suitable formal mo dels

So rigor is a goal to b e striven for rather than one that wewillachieveentirely in this

course Due to time limitations and sometimes the diculty of making formal presenta

tions intuitively understandable the presentation in class will b e a mixture of rigorous and

intuitive

Overview of the Course

There are many dierent orders in which this material could b e presented In this course we

divide it up rst according to timing assumptions since that seems to b e the most imp ortant

mo del distinction The timing mo dels to b e considered are the following

synchronous This is the simplest mo del We assume that comp onents take steps simulta

neously ie the execution pro ceeds in synchronous rounds

asynchronous Here we assume that the separate comp onents take steps in arbitrary order

partial ly synchronous timingbased This is an inb etween mo del there are some

restrictions on relative timing of events but execution is not completely lo ckstep

The next division is by the IPC mechanism shared memory vs messagepassing Then

we sub divide by the problem studied And nally each mo del and problem can b e considered

with various failure assumptions

Wenowgoover the bibliographical list and the tentativeschedule Handouts and

The bibliographical list do esnt completely corresp ond to the order of topics to b e covered

the dierence is that the material on mo dels will b e spread throughout the course as needed

General references These include the previous course notes and some related b o oks

There is not really muchintheway of textb o oks on this material

Intro duction The chapter on distributed computing in the handb o ok on Theoretical

Computer Science is a sketchyoverview of some of the mo deling and algorithmic ideas

Mo dels and Pro of Metho ds We shall not study this as an isolated topic in the course

rather it is distributed through the various units The basic mo dels used are automata

theoretic starting with a basic statemachine mo del with little structure These state ma

chines need not necessarily b e nitestate Invariant assertions are often proved ab out

automaton states by induction Sometimes we use one automaton to represent the problem

b eing solved and another to represent the solution then a correctness pro of b oils down

to establishing a corresp ondence that preserves the desired external behavior In this case

proving the corresp ondence is often done using a mapping or simulation metho d Sp ecially

tailored state machine mo dels have b een designed for some sp ecial purp oses eg for shared

memory mo dels where the structure consists of pro cesses and shared variables Another

mo del is the IO automaton mo del for reactive systems ie systems that interact with an

external environment in an ongoing fashion This mo del can mo del systems based on shared

variables but is more appropriate for messagepassing systems One of the key features of

this mo del is that it has go o d compositionality prop erties eg that the correctness of a com

p ound automaton can b e proved using the correctness of its comp onents Temporal logic is

an example of a sp ecial set of metho ds language logic mainly designed for proving liveness

prop erties eg something eventually happ ens Timedmodels are mainly newer research

work Typically these are sp eciallytailored mo dels for talking ab out timingbased systems

eg those whose comp onents have access to system clo cks can use timeouts etc Alge

braic methods are an imp ortant research subarea but we will not have time for this The

algebraic metho ds describ e concurrent pro cesses and systems using algebraic expressions

then use equations involving these expressions to prove equivalences and implementation

relationships among the pro cesses

Synchronous MessagePassing As noted the material is organized rst by timing

mo del The simplest mo del ie the one with the least uncertainty is the synchronous

mo del in which all the pro cesses take steps in synchronous rounds The sharedmemory

version of this mo del is the PRAM Since it is studied in other courses we shall skip this

sub ject and start with synchronous networks

We sp end the rst twoweeks or so of the course on problems in synchronous networks

that are typical of the distributed setting In the network setting wehave the pro cessors

at the no des of a graph Gcommunicating with their neighb ors via messages in the edges

We start with a simple toy example involving ring computation The problem is to elect a

unique leader in a simple network of pro cessors which are assumed to b e identical except for

Unique Identiers UIDs The uncertainty is that the size of network and the set of IDs

of pro cessors are unknown although it is known that the UIDs are indeed unique The

main application for this problem is a token ring where there is a single token circulating

and sometimes it it necessary to regenerate a lost token We shall see some details of the

mo deling and some typical complexity measures will b e studied For example we shall

show upp er and lower b ounds for the time and the amount of communication ie number

of messages required We shall also study some other problems in this simple setting

Next well go through a brief survey of some proto cols in more general networks We

shall see some proto cols used in unknown synchronous networks of pro cesses to solve some

basic problems like nding shortest paths dening a minimum spanning tree computing a

maximal indep endent set etc

Then we turn to the problem of reaching consensus This refers to the problem of

reaching agreement on some abstract fact where there are initial dierences of opinion

The uncertainty here stems not only from dierent initial opinions but also from processor

failuresWe consider failures of dierenttyp es stopping where a pro cessor suddenly stops

executing its lo cal proto col omission where messages may b e lost en route and Byzantine

where a faulty pro cessor is completely unrestricted This has b een an active research area

in the past few years and there are manyinteresting results so we shall sp end a couple of

lectures on this problem We shall see some interesting b ounds on the numb er of tolerable

faults time and communication

Asynchronous Shared Memory After warming up with synchronous algorithms in

which there is only a little uncertainty wemoveinto the more characteristic and p ossibly

more interesting part of the course on asynchronous algorithms Here pro cessors are no

longer assumed to take steps in lo ckstep synchrony but rather can interleave their steps in

arbitrary order with no b ound on individual pro cess sp eeds Typicallytheinteractions with

the external world ie inputoutput are ongoing rather than just initial input and nal

output The results in this setting have quite a dierentavor from those for synchronous

networks

The rst problem we deal with is mutual exclusion This is one of the fundamental and

historically rst problems in this area and consequentlymuchwork has b een dedicated

to exploring it Essentially the problem involves arbitrating access to a single indivisible

exclusiveuse resource The uncertainty here is ab out who is going to request access and

when Well sketch some of the imp ortant algorithms starting with the original algorithm

of Dijkstra Many imp ortant concepts for this eld will b e illustrated in this context in

cluding progress fairness faulttolerance and time analysis for asynchronous algorithms

We shall see upp er b ounds on the amount of shared memory corresp onding lower b ounds

and imp ossibility results We shall also discuss generalizations of mutual exclusion to more

general resource allo cation problems For example we will consider the Dining Philosophers

problem a prototypical resource allo cation problem

Next we shall study the concept of atomic registers sofarwehave b een assuming indi

visible access to shared memoryButhow can one implement this on simpler architectures

We shall lo ok at several algorithms that solve this problem in terms of weaker primitives An

interesting new prop erty that app ears here is waitfreenesswhich means that any op eration

on the register must complete regardless of the failure of other concurrent op erations

An atomic snapshot is a convenient primitive for shared readwrite memory Roughly

sp eaking the ob jectiveistotake an instantaneous snapshot of all the memory lo cations at

once An atomic snapshot is a useful primitivetohave for building more p owerful systems

We shall see how to implement it

A concurrent timestamp system is another nice primitive This is a system that issues

up on request timestamps that can b e used by programs to establish a consistent order

among their op erations The twist here is how to implementsuch systems with b ounded

memory It turns out out such b ounded timestamp systems can b e built in terms of atomic

snapshots Also a concurrent timestamp system can b e used to build more p owerful forms

of shared memorysuchasmultiwriter multireader memory

We shall also reconsider the consensus problem in the asynchronous shared memory

mo del and provetheinteresting fact it is imp ossible to solve in this setting

A p ossible new topic this time is shared memory for multiprocessorsMuch recent research

is aimed at identifying dierenttyp es of memory abstractions that are used or mightbe

useful for real systems Systems architects who develop suchtyp es of memory frequently do

not give precise statements of what their memory guarantees esp ecially in the presence of

concurrent accesses and failures So recently theoreticians have started trying to do this

Asynchronous MessagePassing Systems This section deals with algorithms that op

erate in asynchronous networks Again the system is mo deled as a graph with pro cessors at

no des and communication links are represented by the edges but now the system do es not

op erate in rounds In particular messages can arrive at arbitrary times and the pro cessors

can take steps at arbitrary sp eeds One mightsaythatwenowhave lo oser coupling of

the comp onents of the system wehave more indep endence and uncertainty

Computing in static graphs The simplest typ e of setting here is computation in a xed

unknown graph in which the inputs arrive at the b eginning and there is a single output

to b e pro duced Some examples are leader election in ring and minimum spanning tree

computation

Network synchronizationAt this p oint we could plunge into a study the many sp ecial

purp ose algorithms designed expressly for asynchronous distributed networks But instead

we shall rst try to imp ose some structure on such algorithms by considering algorithm

transformations that can b e used to run algorithms designed for a simpler computation

mo del on a a complex asynchronous network

The rst example here arises in the very imp ortant pap er by Lamp ort where he shows

a simple metho d of assigning consistent logical times to events in a distributed network

This can b e used to allow an asynchronous network to simulate one in which the no des

have access to p erfectly synchronized realtime clo cks The second example is Awerbuchs

synchronizer which allows an asynchronous network to simulate the lo ckstep synchronous

networks discussed in the rst part of the course at least those without failures and to

do so ecientlyWe shall contrast this simulation result with an interesting lower b ound

that seems to say that any such simulation must b e inecient the apparentcontradiction

turns out to dep end on the kind of problem b eing solved Third we shall see that an

synchronous network can simulate a centralized nondistributed state machine And fourth

an asynchronous network can b e used to simulate asynchronous shared memoryAny of these

simulations can b e used to run algorithms from simpler mo dels in the general asynchronous

network mo del

Next we shall lo ok at some sp ecic problems such as resource allo cation We shall see

how to solvemutual exclusion dining philosophers etc in networks

Detection of stable properties refers to a class of problems with a similar avor and a

common solution Supp ose that there is a separate algorithm running and wewantto

design another algorithm to monitor the rst To monitors here might mean for instance

to detect when it terminates or deadlo cks or to take a consistent snapshot of its state

We shall also revisit the consensus problem in the context of networks The problem is

easy without faults but with faults it is mostly imp ossible even for very simple typ es of

faults such as just stopping

Datalink proto cols involve the implementation of a reliable communication link in terms

of unreliable underlying channels We shall see the basic Alternating Bit Proto col the

standard case study for concurrent algorithm verication pap ers There havebeenmany

new algorithms and imp ossibility results for this problem

SpecialPurpose Network Building Blocks Some sp ecial problems that arise in commu

nication networks have sp ecial solutions that seem able to combine with others Ma jor

examples are the proto cols of broadcastconvergecast reset endtoend

Selfstabilization Informally a proto col is said to b e selfstabilizing if its sp ecication

do es not require a certain initial conguration to b e imp osed on the system to ensure cor

rect b ehavior of the proto col The idea is that a selfstabilizing proto col is highly resilient it

can recover from any transient error We shall see some of the basic selfstabilizing proto cols

and survey some of the recent results

Timingbased Systems These systems lie b etween synchronous and asynchronous so

they have somewhat less uncertainty than the latter They are put at the end of the course

b ecause they are a sub ject of newer research and since they are less well understo o d In

these systems pro cessors have some knowledge of time for example access to real time or

approximate real time or some timeout facility Another p ossible assumption is that the

pro cessor step time or message delivery time is within some known b ounds Time adds

some extra structure and complication to state machine mo dels but it can b e handled in the

same general framework In terms of the various time parameters one can get upp er and

lower b ounds for various interesting problems suchasmutual exclusion resource allo cation

consensus and communication

Wemayhave time to lo ok at some p opular clo cksynchronization algorithms and p erhaps

some of their applications

Synchronous Network Algorithms

We start with the synchronous network algorithms Our computation mo del is dened as

follows We are given a collection of no des organized into a directed graph G V E We

shall denote n jV jWe assume the existence of some message alphab et M and let nul l

denote the absence of a message Eachnode i V has

states i a set of states

start i a subset of states i

msgs i mapping states i outnbrs itoelements of M or nul l

trans i mapping states iandavector of messages in M fnul l g

one p er inneighbor of ito states i

Execution b egins with all the no des in some start states and all channels emptyThen

the no des in lo ck step rep eatedly execute the following pro cedure called round

Apply message generation function to generate messages for outneighb ors

Put them in the channels

Apply transition function to the incoming messages and the state to get the new state

Remarks Note the following p oints First the inputs are assumed to b e enco ded in the

start states Second the mo del presented here is deterministic the transitions and the

messages are functions ie singlevalued

Problem Example Leader Election in a Ring

Consider a graph that is a ring plus a technicality for the problem denition an extra

dummy no de representing the outside world The ring is assumed to b e unidirectional mes

sages can b e sent only clo ckwise The ring is of arbitrary size unknown to the pro cessors

ie size information is not built into their states The requirement is that eventually

exactly one pro cess outputs a leader message on its dummy outgoing channel

A rst easy observation is that if all the no des are identical then this problem is imp ossible

to solve in this mo del

Prop osition If al l the nodes in the ring have identical state sets start states message

functions and transition functions then they do not solve the leader election problem for

n

Pro of It is straightforward to verifyby induction on the numb er of rounds that all the

pro cessors are in identical states after anynumb er of rounds Therefore if any pro cessor

ever sends a leader message then they all do so at the same time violating the problem

requirement

As indicated by Prop osition the only way to solve the leader election problem is to

break the symmetry somehow A reasonable assumption from practice is that the no des are

identical except for a unique identier UID chosen from some large totally ordered set

eg the integers We are guaranteed that each no des UID is dierent from each others in

the ring though there is no constraint on which IDs actually app ear in the given ring Eg

they dont have to b e consecutiveintegers We exploit this assumption in the solutions we

present

Algorithm LeLann ChangRob erts

We rst describ e the algorithm informally The idea is that each pro cess sends its identier

around the ring when a no de receives an incoming identier it compares that identier to

its own If the incoming identier is greater than its own it keeps passing the identier

if its less than its own it discards the incoming identier and if its equal to its own the

pro cess declares itself the leader Intuitively it seems clear that the pro cess with the largest

ID will b e the one that outputs a leader message In order to make this intuition precise we

give a more careful description of the system

The message alphab et M consists of the UIDs plus the sp ecial message leader

The state set state i consists of the following comp onents

own oftyp e UID initially is UID

send oftyp e UID or nul l initially is UID

status with values in funknown chosen reported g initially unknown

The start state start i is dened by the initializations ab ove

The messages function msgs i is dened as follows Send message m a UID to clo ckwise

neighb or exactly if m is the currentvalue of send Sendleader to dummy no de exactly if

status chosen

The transition function is dened by the following pseudoco de

Supp ose new message is m

send nul l

If status chosen then status reported

Case

m own then

send m

m own then

status chosen

send nul l

else noop

Endcase

Although the description is in a fairly convenient programminglanguage style note that

it has a direct translation into a state machine eg each state consists of a value for each

of the variables We do not in general place restrictions on the amount of computation

needed to go from one state to another Also note the relative scarcityofowofcontrol

statements This phenomenon is prettytypical for distributed algorithms The presence of

such statements would make the translation from program to state machine less transparent

Howdowe go ab out proving that the algorithm is correct Correctness means that

exactly one pro cess ever do es leader output Let i denote the index where indices are

max

arranged consecutively around the ring of the pro cessor with the maximum UID It suces

to show that i do es a leader output at round n and no other pro cessor ever

max

do es such an output We prove these prop erties resp ectively in the following two lemmas

Lemma Node i outputs leader message at round n

max

Pro of Let v b e the own value of i Note that own values never change bythe

max max

co de that they are all distinct by assumption and that i has the largest by denition

max

of i By the co de it suces to showthataftern rounds

max

status i chosen

max

This can b e proved as things of this sort are typically proved by induction on the number

of computation steps here the numb er of rounds We need to strengthen the statementto

b e able to proveitby induction wemake the following assertion

For r n after r rounds send i r v

max max

Addition is mo dulo n In words the assertion is that the maximum value app ears at the

p osition r away from i Now this assertion is straightforwardtoproveby induction

max

The key step is that every no de other than i admits the maximum value into its send

max

comp onent since v is greater than all the other values

max

Lemma No process other than i ever sends a leader message

max

It suces to show that all other pro cesses always have status unknown Again it

helps to haveaninvariant Let a b denote the set of p ositions fa a b g where

addition is mo dulo n The following invariant asserts that only the maximum value app ear

between the maximum and the owner in the ring

If i i and j i i then own i send j

max max

Again it is straightforward to prove the assertion by induction the key step is that a non

maximum value do es not get past the maximum Obviouslywe use the fact that v is

max

greater than all the others

Termination As written this algorithm never terminates in the sense of all the pro cessors

reaching some designated nal state We could augment the mo del to include the concept

of a nal state For this algorithm to haveeveryone terminate we could have the elected

leader start a rep ort message around the ring and anyone who receives it can halt

Note that nal states do not play the same role ie that of accepting state for distributed

algorithms as they do for nitestate automata normally no notion of acceptance is used

here

Complexity analysis The time complexityis n roundsforaringofsizenuntil a

leader is announced The communication complexityis O n messages

JJ Distributed Algorithms Septemb er

Lecturer Nancy Lynch

Lecture

Leader Election on a Ring cont

Last time we considered the LeLannChangRob erts algorithm Weshowed that its time

complexityis n time units rounds and its communication complexityis O n messages

The presentation of the algorithm was made fairly formally and the pro ofs were sketched

Todaywe shall use less formal descriptions of three other algorithms

Algorithm Hirshb ergSincl air

The time complexity of the LCR algorithm is ne The numb er of messages however

lo oks excessive The rst algorithm to reduce the worstcase complexityto O n log nwas

that of Hirshb ergSinclair Again we assume that the ring size is unknown but nowwe

use bidirectional communication As in LCR this algorithm elects the pro cess with the

maximum UID

The idea is that every pro cess instead of sending messages all the way around the ring

as in the LCR algorithm will send messages that turn around and come back to the

originating pro cess The algorithm pro ceeds as follows

Each pro cess sends out messages in b oth directions that go distances that are

successively larger p owers of and then return to their origin see Figure

When UID v is sentoutby pro cess i other pro cesses on v s path compare v to

i i i

their own UID For such a pro cess j whose UID is v there are two p ossibilities

j

If v v then rather than pass along the original message j sends backa

i j

message to i telling i to stop initiating messages Otherwise v v In that

i j

case j relays v and since j can deduce that it cannot win it will not initiate any

i

new messages Finally if a pro cess receives its own message b efore that message

has turned around then that pro cess is the winner

d

d

d

Figure Successive messagesends in the Hirshb ergSinclair algorithm

i

Analysis A pro cess initiates a message along a path of length only if it has not b een

i

defeated by another pro cess within distance in either direction along the ring This

i

means that within any group of consecutive pro cesses along the ring at most one

i

will go on to initiate messages along paths of length This can b e used to show that at

most

n

i

i

in total will initiate messages along paths of length

The total numb er of messages sent out is then b ounded by

n n n

i

n

i

The factor of in this expression is derived from the fact that each round of message

sending for a given pro cess o ccurs in b oth directions clo ckwise and counterclo ckwise

and that each outgoing message must turn around and return For example in the rst

round of messages each pro cess sends out two messages one in each direction a distance

of one each and then each outgoing message returns a distance of one for a net total of four

messages sent Each term in the large parenthesized expression is the numb er of messages

sent out around the ring at a given pass counting only messages sent in one direction and

along the outgoing path Thus the rst term n indicates that all n pro cesses send

out messages for an outgoing distance of

Each term in the large parenthesized expression is less than or equal to n and there are

at most dlog ne terms in the expression so the total numb er of messages is O n log n

with a constant factor of approximately

The time complexity for this algorithm is just O n as can b e seen by considering the

time taken for the eventual winner The winning pro cess will send out messages that take

time and so forth to go out and return and it will nish after sending out the

dlog neth message If n is an exact p ower of then the time taken by the winning pro cess

is approximately n and if not the time taken is at most n

Counterexample Algorithms

Nowwe consider the question of whether it is p ossible to elect a leader with fewer n log n

messages The answer to this problem as we shall demonstrate shortly with an imp ossibility

result is negative That result however is valid only in a restricted mo del comparison

based proto cols which are dened in Denition b elow For the general case the following

counterexample algorithms can b e used to show that no such imp ossibility result can b e

proved We use the term counterexample algorithm for an algorithm that isnt interesting

by itself eg neither practical nor particularly elegant from a mathematical viewp oint

However it serves to show that an imp ossibility result cannot b e proved

Algorithm Wenow supp ose that the mo del is simpler with only little uncertainty

Sp ecicallywe assume the following

n is known to all pro cesses

The communication is unidirectional

The IDs are p ositiveintegers

All pro cesses start algorithm at the same time

In this setting the following simple algorithm works

In the rst n rounds only a message marked with UID can travel around

If a pro cessor with UID do es exists then this message is relayed throughout

the ring Otherwise the rst n rounds are idle In general the rounds in the

range kn kn k n are reserved for UID k Thus the minimal

UID eventually gets all the way around which causes its originating pro cess to

get distinguished

If it is desired to have the maximum elected instead of the minimum simply let the minimum

send a sp ecial message around at the end to determine the maximum

Actually algorithm isnt entirely a counterexample algorithm it has really b een used in some systems

The nice prop erty of the algorithm is that the total numb er of messages is n Unfor

tunately the time is ab out n M where M is the value of the minimum ID this value is

unb ounded for a xed size ring Note how heavily this algorithm uses the synchrony quite

dierently from the rst two algorithms

Algorithm FredericksonLynch Wenow present an algorithm that achieves the

O n message b ound in the case of unknown ring size all other assumptions as for algorithm

M

The cost is time complexitywhichiseven worse than algorithm sp ecically O n

where M is the minimum UID Clearlynoonewould even think of using this algorithm It

serves here just as a counterexample algorithm

Each pro cess spawns a message whichmoves around the ring carrying the UID

of the original pro cess Identiers that originate at dierent pro cesses are trans

mitted at dierent rates In particular UID v travel at the rate of message

v

transmission every clo ck pulses Anyslow identier that is overtaken bya

faster identier is deleted since it has a larger identier Also identier v ar

riving at pro cess w will b e deleted if wv If an identier gets back to the

originator the originator is elected

This strategy guarantees that the pro cess with the smallest identier gets all the way around

b efore the next smaller gets halfway etc and therefore up to the time of election would

use more messages than all the others combined Therefore total numb er of messages up

M

to the time of election is less than n The time complexityasmentioned ab ove is n

Also note that by the time the minimum gets all the way around all other transmissions

have died out and thus n is an upp er b ound on the numb er of messages that are ever sent

by the algorithm even after the leader message is output

We remark that algorithms and can also b e used with the pro cessors waking up

at dierent times The rst two algorithms require no mo dication but algorithm needs

some work in order to maintain the desired complexities

Lower Bound on ComparisonBased Proto cols

Wehave seen several algorithms for leader election on a ring Algorithm was comparison

based and had the complexityof O n log n messages and O ntime Algorithms and

were noncomparisonbased and had O n messages with huge running time To gain

further understanding of the problem wenow showalowerboundofn log n messages for

comparisonbased algorithms FredericksonLynch AttiyaSnirWarmuth This lower b ound

holds even if we assume that n is known to the pro cessors

The result is based on the dicultyof breaking symmetry Recall the imp ossibility pro of

from last time where we showed that b ecause of the symmetry it is imp ossible to elect a

leader without identiers The main idea in the argumentthatwe shall use b elowisthat

symmetry is p ossible with identiers also now it is p ossible to break it but it must require

much communication

Before we state the main results we need some denitions

Denition We cal l a distributedprotocol comparisonbased algorithm if the nodes are

identical except for UIDs and the only way that UIDs can be manipulatedistocopy them

and to compare them for f g the results of the comparisons areused to make choices

within the state machine UIDs can be sent in messages perhaps combined with other

information

Intuitively the decisions made by the state machines dep end only on the relative ranks

of the UIDs rather than their value

The following concept will b e central in the pro of

Denition Let X x x x and Y y y y be two strings of UIDs We

r r

say that X is order equivalent to Y if for al l i j r we have x x if and only if

i j

y y

i j

Notice that two strings of UIDs are order equivalent if and only if the corresp onding

strings of relative ranks of their UIDs are identical

The following denitions are technical

Denition Acomputation round is cal ledanactive round if at least one nonnul l mes

sage is sent in it

Denition The k neighborhood of process p in ring of size n where kbnc is dened

to consist of the k processes at distance at most k from p

The following concept is a key concept in the argument

Denition We say that two states s and t corresp ond with respect to strings X

x x x and Y y y y if al l of the UIDs in s are chosen from X and

r r

t is identical to s except for substituting each occurrence of any x in sbyy in t for al l

i i

i r

Corresp onding messages are dened analogously

Wecannow start proving our lower b ound

Lemma Let p and q be two distinct processes executing a comparisonbased algorithm

Suppose that p and q have orderequivalent k neighborhoods Then at any point after at most

k active rounds p and q areincorresponding states with respect to their k neighborhoods p1 p2 q1 q2

p q

Figure scenario in the pro of of Lemma

Pro of By induction on the number r of rounds in the computation Notice that p ossibly

rkFor each r weprove the lemma for all k

Basis r By Denition the initial states of p and q are identical except for

their UIDs and hence they are in corresp onding initial states with resp ect to their

neighb orho o ds consisting of nothing but their own UIDs

Inductive step Assume that the lemma holds for all r r Let p and p b e the

resp ectivecounterclo ckwise and clo ckwise neighbors of p and similarly q and q for q

We pro ceed by case analysis

Supp ose that neither p nor q receives a message from either neighb or at round r Then

by the induction hyp othesis on r using same k p and q are in corresp onding states

b efore r and since they have no new input they make corresp onding transitions and

end up in corresp onding states after r

Supp ose nowthatatroundr p receives a message from p but no message from p

Then round r is active By induction p and q are in corresp onding states with resp ect

to their k neighb orho o ds just b efore round r Hence q also sends a message

to q in round r and it corresp onds to the message sentby p with resp ect to the

k neighb orho o ds of p and q and therefore with resp ect to the k neighb orho o ds

of p and q Similarly p and q are in corresp onding states with resp ect to their

k neighb orho o ds just b efore round r and therefore q do es not send a message

to q at round r Also by induction p and q are in corresp onding states with resp ect

to their k neighb orho o ds and hence with resp ect to their k neighb orho o ds just

b efore round r So after they receive the corresp onding messages they remain in

corresp onding states with resp ect to their k neighb orho o ds

Finally supp ose p receives a message from p We use the same argumentasinthe

previous case to argue that lemma holds in the two sub cases either p send a message

at round r or not

Lemma tells us that many active rounds are necessary to break symmetry if there

are large orderequivalent neighb orho o ds Wenow dene particular rings with the sp ecial

prop erty that they havemany orderequivalent neighborhoods of various sizes position=7, ID=111 (7) position=0, ID=000 (0) position=6, ID=011 (3) position=1, ID=100 (4)

position=2, ID=010 (2) position=5, ID=101 (5) position=3, ID=110 (6)

position=4, ID=001 (1)

Figure Bitreversal assignment has symmetry

Denition Let c beaconstant and let R be a ring of size n R is said to have

p

csymmetry if for every n n and for every segment S of R of length thereare

cn

segments in R that areorderequivalent to S counting S itself at least

Remark the square ro ot is a technicality

Example For n apower of we shall showthatthebitreversal ring of size n has

a

symmetry Sp ecically supp ose n We assign to each pro cess i the integer in the range

n whose abit binary representation is the reverse of the abit binary representation

of i

For instance for n wehave a and the assignment is in Figure

The bitreversal ring is highly symmetric as wenow claim

Claim The bitreversal ring is symmetric

Pro of Left as an exercise

We remark that for the bitreversal ring there is even no need for the square ro ot caveat

For other values of n nonp owers of there is a more complicated construction that can

yield csymmetry for some smaller constant c The construction is complicated since simply

adding a few extra pro cessors could break the symmetry

Now supp ose wehave a ring R of size n with csymmetry The following lemma states

that this implies many active rounds

Lemma Suppose that algorithm A elects a leader in a csymmetric ring and let k be such

p

cn

that Then A has more than k active rounds n k n and

k

Pro of By contradiction supp ose A elects a leader say p in at most k active rounds Let

S b e the k neighb orho o d of p S is a segment of length k Since R is csymmetric

there must b e at least one other segmentinR which is orderequivalentto S letq b e the

midp oint of that segment Now by Lemma p and q remain in equivalent states throughout

the execution up to the election p oint We conclude that q also gets elected a contradiction

Nowwe can prove the lower b ound

Theorem All comparisonbasedleader election algorithms for rings of size n send n log n

messages before termination

Pro of Dene

p

cn

n k n and K max k

k

cn

Note that K

By Lemma there are at least K active rounds Consider the r th active round where

r K Since the round is active there is some pro cessor p that sends a message in

round r Let S be the r neighborhood of p Since R is csymmetric there are at

p

cn

segments in R that are equivalentto S provided least n r By Lemma

r

at the p oint just b efore the r th active round the midp oints of all these segments are in

corresp onding states so they al l send messages

Thus the total numb er of messages is at least

cn

K

X X

cn

A

cn

p p

r r

n n r r

p

cn

ln n integral approximation of the sum cn ln

n log n

Remark The pap er byAttiya Snir and Warmuth contains other related results ab out

limitations of computing p ower in synchronous rings

Leader Election in a General Network

So far we considered only a very sp ecial kind of networks namely rings Wenow turn to

consider more general networks in the same synchronized mo del We shall start with leader

election

Let us rst state the problem in the general framework Consider arbitraryunknown

strongly connected network digraph G V E We assume that the no des communicate

only over the directed edges Again no des have UIDs but the set of actual UIDs is unknown

The goal is as b efore to elect a leader

To solve the problem we use a generalization of the idea in the LCR algorithm as follows



In the BFS example b elowwe will remove this restriction and allowtwoway communication even when

the edges are way

Every no de maintains a record of the maximum UID it has seen so far initially

its own At each round no de propagates this maximum on all its outgoing

edges

Clearly the pro cessor with the maximum UID is the only one that will keep its own UID

forever as the maximumButhow do es the pro cessor with the maximum know that it

is the maximum so it can do an elected output In LCR it was enough that a pro cessor

received its own UID on an incoming link but that was particular to the ring network

top ology and cannot work in general One p ossible idea is to wait for suciently long time

until the pro cess is guaranteed that it would have received any larger value if one existed

It can b e easily seen that this time is exactly the maximum distance from any no de in the

graph to any other ie the diameter Hence this strategy requires some builtin information

ab out the graph Sp ecicallyevery no de must have an upp er b ound on the diameter

Wenow sp ecify the ab ove algorithm formallyWe describ e pro cess i

state i consists of comp onents

own oftyp e UID initially is UID

maxidoftyp e UID initially is UID

status withvalues in funknown chosen reported g initially unknown

rounds integer initially

msgs i Send message maUIDtoclockwise neighb or exactly if m is the

currentvalue of maxid

Send leader to dummy no de exactly if status chosen

trans i

Supp ose new message from each neighbor j is mj

If status chosen then status reported

rounds rounds

maxid maxfmaxidgfmj j innbrsi g

If rounds diam and maxid own and status unknown then status chosen

Complexity it is straightforward to observe that the time complexityis diam rounds

and that the numb er of messages is diam jE j



It is p ossible to design a general algorithm for strongly connected digraphs that do esnt use this diameter

information it is left as an exercise

BreadthFirst Search

The next problem we consider is doing BFS in a network with the same assumptions as ab ove

More precisely supp ose there is a distinguished sourcenode i Abreadthrst spanning tree

of a graph is dened to havea parent pointer at each no de such that the parentofanynode

at distance d from i is a no de at distance d from i The distinguished no de i has a nil

p ointer ie i is a ro ot The goal in computing a BFS tree is that eachnodewilleventually

output the name of its parent in a breadthrst spanning tree of the network ro oted at i

The motivation for constructing such a tree comes from the desire to have a convenient

communication structure eg for a distributed network graph The BFS tree minimizes the

maximum communication time to the distinguished no de

The basic idea for the algorithm is the same as for the sequential algorithm

Atany p oint there is some set of no des that is marked initially just i Other

no des get marked when they rst receive report messages The rst round after

a no de gets marked it announces its parent to the outside world and sends a

report to all its outgoing neighb ors except for all the edges from whichthereport

message was received Whenever an unmarked no de receives a report it marks

itself and cho oses one no de from whichthe report has come as its parent

Due to the synchrony this algorithm pro duces a BFS tree as can b e proved by induction

on the number of rounds

Complexity the time complexity diam rounds actually this could b e rened a little to

the maximum distance from i to any no de The numb er of messages is jE j eachedge

transmits a rep ort message exactly once

Extensions

The BFS algorithm is very basic in distributed computing It can b e augmented to execute

some useful other tasks Below we list two of them As mentioned earlier we shall assume

that communication is bidirectional on all links even though the graph we compute for might

have directed edges

Message broadcast Supp ose that a pro cessor has a message that it wants to communi

cate to all the pro cessors in the network The way to do it is to piggyback the message on

the report messages when establishing a BFS tree Another p ossibility is to rst pro duce

a BFS tree and then use it to broadcast Notice that for this the p ointers go opp osite to

the direction of the links We can augment the algorithm by letting a no de that discovers a

parent communicate with it so it knows which are its children

Broadcastconvergecast Sometimes it is desirable to collect information from through

out the network For example let us consider the problem in whicheach no de has some

initial input value and we wish to nd to global sum of the inputs This is easily and

eciently done as follows First establish a BFS tree which includes twoway p ointers so

parents know their children Then starting from the leaves fan in the results each leaf

sends its value to its parent each parentwaits until it gets the values from all its children

adds them and its input value and then sends the sum to its own parent When the ro ot

do es its sum the nal answer is obtained The complexity of this scheme is O diam time

we can rene as b efore the numb er of messages is jE j to establish the tree and then an

extra O n to establish the child p ointers and fan in the results

We remark that funny things happ en to the time when the broadcastconvergecast is run

asynchronouslyWe will revisit this as well as the other algorithms later when we study

asynchronous systems

JJ Distributed Algorithms Septemb er

Lecturer Nancy Lynch

Lecture

Shortest Paths Unweighted and Weighted

Todaywe continue studying synchronous proto cols We b egin with a generalization of the

BFS problem Supp ose wewant to nd a shortest paths from a distinguished no de i V

whichwe sometimes refer to as the root to eachnodeinV Nowwe require that each

no de should not only know its parent on a shortest path but also the distance

If all edges are of equal weights then a trivial mo dication of the simple BFS tree

construction can b e made to pro duce the distance information as well But what if there are

weights assigned to edges

This problem arises naturally in communication networks since in many cases wehave

some asso ciated cost for instance monetary cost to traverse the link

Sp ecically for eachedgei j let weight i j denote its weight Assume that every no de

knows the weight of all its incident edges ie the weight of an edge app ears in sp ecial

variables at b oth its endp oint no des As in the case of BFS we are given a distinguished

no de i and the goal is to have at each no de the weighted distance from i Thewayto

solve it is as follows

Eachnodekeeps track of the bestdist it knows from i Initially bestdist i

and bestdist j for j i At each round each no de sends its bestdist to all

its neighb ors Each recipientnode i up dates its bestdist by a relaxation step

where it takes the minimum of its previous bestdist value and of bestdist j

weight j i for each incoming neighbor j

It can b e proven that eventuallythe bestdist values will converge to the correct b est

distance values The only subtle p oint is when this o ccurs It is no longer sucienttowait

diam time see for example Figure However n time units are sucient In fact

the way to argue the correctness of the ab ove algorithm whichisavariantofthewell known

BellmanFord algorithm is by induction on the numb er of links in the b est paths from i

This seems to imply that a b ound on the numb er of no des must b e known but this

information can easily b e computed root 100

11

Figure the correct shortest paths from the ro ot stabilize only after time units even though

all no des are reachable in one step

Minimum Spanning Tree

We shall now mo dify our mo del slightly and assume that the communication graph is undi

rected ie if there is an edge b etween no des i and j then i and j can exchange messages in

b oth directions The edges as in the previous algorithm are assumed to haveweights which

are known at their endp oints Wecontinue to assume that no des have UIDs

Before we dene the problem let us dene a few standard graphtheoretic notions A

spanning tree for a graph G V Eisagraph T V E where E E suchthat T is

connected and acyclic if T is acyclic but not necessarily connected T is called a spanning

forest The weight of a graph is the sum of the weights of its edges

The Minimum Spanning Tree MST problem is to nd a spanning tree of minimal

weight In the distributed setting the output is a marked set of edges each no de will

mark those edges adjacent to it that are in the MST This problem arises naturally in the

context of communication networks since similarly to the BF tree the MST is a convenient

communication structure The MST minimizes the cost of broadcast ie when a no de

wishes to communicate a message to all the no des in the network

All known MST algorithms are based on the same simple theory Notice that any acyclic

set of edges ie a forest can b e augmented to a spanning tree The following lemma

indicates which edges can b e in a minimum spanning tree

Lemma Let G V E beanundirectedgraph and let fV E i k g be any

i i

spanning forest for G whereeach V E is connected Fix any i i k Let e bean

i i

edge of lowest cost in in the set

e e E E and exactly one endpoint of e is in V

j i

j

S

Then thereisaspanning treeforG that includes E and e and this tree is of as low a

j

j

S

E cost as any spanning tree for G that includes

j

j

Pro of By contradiction Supp ose the claim is false ie that there exists a spanning

S

tree T that contains E do es not contain e and is of strictly lower cost than anyother

j

j 1

11

Figure the circles represent comp onents If the comp onents cho ose MWOE as depicted by

the arrows a cycle would b e created

S

spanning tree which contains feg E Now consider the graph T obtained by adding e

j

j

to T Clearly T contains a cycle which has one more edge e e which is outgoing from

V

i

By the choice of eweknowthatweight e weight e Now consider T constructed

S

by deleting e from T T is spanning tree for G it contains feg E and it has weight

j

j

no greater than that of T a contradiction

The lemmaabove suggests a simple strategy for constructing a spanning tree start with

the trivial spanning forest that contains no edges then for some connected comp onent add

the minimalweight outgoing edge MWOE The claim that wehavejustproven guarantees

that we are safe in adding these edges there is an MST that contains them We can rep eat

this pro cedure until there is only one connected comp onent which is an MST This principle

forms the basis for wellknown sequential MST algorithms The PrimDijkstra algorithm

for instance starts with one no de and successively adds the smallestweight outgoing edge

from the current partial tree until a complete spanning tree has b een obtained The

Kruskal algorithm as another example starts with all no des as fragments ie connected

comp onents of the spanning forest and successively extends each fragmentwiththeminimun

weight outgoing edge thus combining fragments until there is only one fragment whichis

the nal tree More generallywe could use the following basic strategy start with all no des

as fragments and successively extend an arbitrary fragmentwithitsMWOE combining

fragments where p ossible

It is not readily clear whether we can do this in parallel The answer in general is

negative Consider for example the graph in Figure

However if all the edges have distinct weights then there is no problem as implied by

following lemma

Lemma If al l edges of a graph G have distinct weights then there is exactly one MST for

G

Pro of Similar to the pro of of Lemma left as an exercise

By Lemma if wehave a forest all of whose edges are in the unique MST then all

Figure arrows representMWOEs Exactly one edge is chosen twice in each new comp onent

MWOEs for all comp onents are also in the unique MST So we can add them all in at once

in parallel and there is no danger of creating a cycle So all wehavetoworry ab out is

making the edge weights unique this is easily done using the UIDs The weight of an edge

i j will for our purp oses b e the triplet weight v v where v vandv resp v

i j i j i j

denotes the UID of no de i resp j The order relation among the edges is determined by

the lexicographical order among the triplets

Using Lemma and the UIDs as describ ed ab ove we can nowsketch an algorithm for

distributed construction of MST The algorithm builds the comp onent in parallel in levels

as follows It is based on the asynchronous algorithm of Gallager Humblet and Spira

Start with Level comp onents consisting of individual no des and no edges

Supp ose inductively that wehaveLevel i comp onents each with its own spanning

tree of edges established and with a known leader within the comp onent To

get the Level i comp onents eachLevel i comp onent will conduct a search

along its spanning tree no des for the MWOE of the comp onent Sp ecically the

leader broadcasts search requests along tree edges Eachnodemust nd its own

MWOE this is done by broadcasting along all nontree edges to see whether

the other end is in same comp onent or not This can b e checked by comparing

the leaders UID which is used at all the no des of a comp onent as the comp onent

UID Then the no des convergecast the resp onses toward the leader while taking

minimums along the way

When all comp onents have found their MWOEs the fragments are merged by

adding the MWOEs in the leader can communicate with the source of the

MWOE to tell it to mark the edge Finally a new leader must b e chosen This

can b e done as follows It is easy to see that for each new connected comp onent

there is a unique new edge that is MWOE of b oth endp oint comp onents this

is the edge with the least weight among all the new edges See Figure for

example We can let new leader b e the larger UID endp oint of this edge

Note that it is crucial to keep the levels synchronized so when a no de inquires if the

other endp oint of a candidate edge is in the same comp onent or not the other endp ointhas

uptodate information If the UID at the other end is dierent wewanttobesurethat

the other end is really in a dierent comp onent and not just that it hasnt received the

comp onentUIDyet

Therefore we execute the levels synchronously one at a time Tobesurewehave

completed a level b efore going on to the next wemust allowanoverestimate of time which

unfortunately can b e big The time for a level can b e n in the worst case Not O diam

the paths on which communication takes place are not necessarily minimum in terms of

numb er of links The numb er of levels is b ounded bylogn since numb er of comp onents is

reduced by at least a half at each level

We conclude that the time complexityis O n log n The message complexityis O n

E log n since in each level O n messages are sent along all the tree edges and O E

additional messages are required to do the exploration for lo cal MWOEs

Remark The numb er of messages can b e reduced to O n log n jE jby b eing more

clever ab out the lo cal MWOE search strategy The optimization makes the time increase

by at most O n time units The idea is as follows Each no de marks the edges when they

are known to b e in same comp onent there is no need to explore them again Also the

edges are explored in order one at a time starting from the lightest This wayeach edge

is either marked or discovered to b e the lo cal MWOE For the message complexity analysis

we note that the numb er of messages sentover tree edges is as b efore O n log n Let us

app ortion the cost of explore messages Each edge gets explored and rejected at most once

in each direction for a total of O jE j There can b e rep eated exploration of edges that get

lo cally accepted ie are found to b e outgoing but there is at most one such exploration

per node in each level summing up for a total of O n log n

Maximal Indep endentSet

So far wesaw a collection of basic synchronous network algorithms motivated by graph

theory and practical communications We shall now see one more example b efore going

to study mo dels with failures that of nding a Maximal Independent Set of the no des of a

graph

This problem can b e motivated by resource allo cation problems in a network The neigh

b ors in the graph represent pro cesses than cannot simultaneously p erform some activity

involving a shared resource such as radio broadcast It is undesirable to have a pro cess

blo cked if no neighb ors are broadcasting hence the maximality requirement

The following algorithm due to Luby is particularly interesting b ecause is simple and

basic and b ecause it intro duces the use of randomization

Problem Statement Let G V E b e a undirected graph A set I V of no des is

independent if for all no des i j I i j E An indep endentsetI is maximal if any set I

which strictly contains I is not indep endent Our goal is to compute a maximal indep endent

set of G The output is done byeach member of I eventually sending a sp ecial message inI

We assume that a b ound B on nthenumb er of no des is known to all the no des We

do not need to assume the existence of UIDs ie we assume an anonymous network

The Algorithm The basic underlying strategy is based on the following iterativescheme

Let G V E b e the initial graph

I

while V do

cho ose a set I V that is indep endentinG

I I I

G induced subgraph of G on V I nbrs I

end while

This scheme always pro duces a maximal indep endentsetby construction I is indep en

dent and we throw out neighbors of G anyelementputinto I maximality is also easy since

the only no des that are removed from consideration are neighb ors of no des inserted into I

The key question in implementing this general skeleton algorithm is howtocho ose I

at each iteration We employ a new p owerful idea here namely randomization Sp ecically

in each iteration eachnode i cho oses an integer val iintherange fB g at random

using the uniform distribution Here we require the no des to know B The upp er b ound

B was chosen to b e suciently large so that with high probability all no des cho ose distinct

values Having chosen these values we dene I to consist of all the no des i suchthat

val i val j for all no des j neighbors of iThisobviously yields an indep endent set since

two neighb ors cannot simultaneously b e greater than each other

Formal Mo deling When we come to describ e formally this synchronous algorithm we

encounter a problem it do esnt quite t the mo del wehave already dened We need to

augment the mo del with some extra steps for the random choices The option we adopt here

is to intro duce a new function in addition to the message generation and transition func

tions to represent the random choice steps Formallyweaddan int internal transition

comp onent to the automaton description where for each state s int s is a probability dis

tribution over the set of states Each execution step will now pro ceed by rst using int to

pick a new random state and then apply msgs and trans as usual

For the MIS algorithm we rst outline the co de in words The algorithm works in phases

where each phase consists of three rounds

In the rst round of a phase the no des cho ose their resp ective val s and send them to

their neighb ors By the end of the rst round when all the values have b een received

the winners discover they are in I

In the second round of a phase the winners announce they are in I By the end of

this round each no de learns whether it has a neighbor in I

In the third round eachnodewithaneighbor in I tells all its neighb ors and then

all the involved no des the winners their neighb ors and the neighb ors of winners

neighb ors remove the appropriate no des and edges from the graph Sp ecically this

means the winners and their neighb ors discontinue participation after this phaseand

the neighb ors of neighb ors will remove all the edges that are incident on the newly

removed neighb ors

Let us now describ e the algorithm formally in our mo del

State

phase integer initially

round inf g initially

val infB g

nbrs set of vertices initially the neighb ors in the original graph G

awake Bo olean initially true

winner Bo olean initially false

neighbor Bo olean initially false

int i the global int is describ ed by indep endentchoices at all the no des

if awake and round then val rand from uniform distribution

msgs i if awake true and

if round

send val i to all no des in nbrs

if round

if winner true then

send inI to dummynode

send winner to all no des in nbrs

if round

if neighbor true then

send neighbor to all no des in nbrs

Below we use the notation mval to denote the particular comp onent of the message

trans i if awake true then

case

round

if val mj val for all j nbrs then winner true

round

if some winner message arrives then neighbor true

round

if winner neighbor then awake false

nbrs nbrs fj aneighbor message arrives from j g

endcase

round round mo d

Analysis Complete analysis can b e found in the original pap er byLuby Here we just

sketch the pro of

We rst state without pro of a technical lemma whichweneed

Lemma Let G V E and dene for an arbitrary node i V

X

sum i

dj

j nbrs i

where dj is the degreeofj in GLet I be dened as in the algorithm above Then

Pri nbrs I minsum i

Informally the idea in proving Lemma is that since the random choices made by the

dierent no des are independent identical ly distributed iid random variables the probability

of a given no de i with di neighb ors to b e a lo cal maximum is approximately di Since

we are dealing with overlapping events the inclusionexclusion principle is used to obtain

the b ound in the lemma

The next lemma is the main lemma in proving an upp er b ound on the exp ected number

of phases of the algorithm

Lemma Let G V E The expectednumber of edges removedfrom G in a single phase

of the algorithm is at least jE j

Pro of By the algorithm all the edges with at least one endp ointinnbrs I are removed

and hence the exp ected numb er of edges removed in a phase is at least

X

di Pri nbrs I

iV

This is since eachvertex i has the given probabilityofhaving a neighbor in I and if

this event o ccurs then it is certainly removed and thereby it causes the deletion of di

is included to take accountofpossibleovercounting since an edge can have edges The

two endp oints that may cause its deletion

Wenow plug in the b ound from Lemma

X

di minsum i

iV

Breaking this up according to which term of the min is less this equals

X X

A

di sum i di

sum i sum i

Now expand the denition of sum i and write di as sum of

X X X X

di

A

dj

isum i j nbrs i isum i j nbrs i

Next wemake the sum over edges for each edge wehavetwoendpoints for which the

ab ove expression adds in the indicated terms and so the preceding expression is equal to

B C

X X X

di di dj

B C

B C

A

dj di dj

isum i isum i isum i

j sum j j sum j sum j

Now each of the expressions within the summation terms is greater than so the total

jE j is at least

Using Lemma it is straightforward to prove a b ound on the exp ected numb er of phases

in Lubys algorithm

Theorem The expected number of phases of the MIS algorithm is O log n

Pro of By Lemma of the edges of the current graph are removed in each phase

Therefore the total exp ected running time is O log jE j O log n

Remark Actually a slightly stronger result holds the numb er of phases in the MIS

algorithm can b e proven to b e O log n with high probability

Discussion Randomization is a technique that seems frequently very useful in distributed

algorithms It can b e used to break symmetry eg it can b e used for the leader election

problem cf homework problem However when designing randomized algorithms one

thing that wemust b e careful ab out is that there is some p ossibility that things will go

wrong Wehavetomake sure that such anomalies do not cause serious problems In MIS

algorithms for example probabilistic anomalies may result in outputting a set whichisnot

indep endent or not maximal FortunatelyLubys algorithm always pro duces an MIS The

performance of the algorithm however dep ends crucially on the random choices it can

take more rounds if we are unlucky and bad choices are made This is generally considered

to b e acceptable for randomized algorithms Even so there is something worse here it is

p ossible to have equal choices made rep eatedly for neighb ors which means that there is a

p ossibilitywhich o ccurs with zero probability that the algorithm never terminates This

is less acceptable

JJ Distributed Algorithms Septemb er

Lecturer Nancy Lynch

Lecture

Our next topic is the ma jor example we shall study in the synchronous mo del namely

consensus algorithms We shall sp end some lectures on this sub ject The consensus

problem is a basic problem for distributed networks whichisofinterest in various applica

tions In studying this problem we extend our mo del to include a new source of uncertainty

failures of communication links and pro cessors

Link Failures

The Co ordinated Attack Problem

The problem we will describ e now is an abstraction of a common problem arising in dis

tributed database applications Informally the problem is that at the end of pro cessing

a given database transaction all the participants are supp osed to agree ab out whether to

commit ie accept the results of the transaction or to abort it We assume that the par

ticipants can have dierent opinions initially as to whether to commit or ab ort eg due

to lo cal problems in their part of the computation

Let us state the problem more precisely cast in terms of generals the abstraction is due

to Gray

Several sometimesonlytwo generals are planning to attack from dierent

directions against a common ob jective It is known that the only way for the

attack to succeed is if al l the generals attack If only some of the generals

attack their armies will b e destroyed There may b e some reasons whyoneof

the generals cannot attack eg insucient supplies The generals are lo cated in

dierent places and can communicate only via messengers However messengers

can b e caught and their messages may b e lost en route

Notice that if messages are guaranteed to arrive at their destinations then all the generals

could send messengers to all the others saying if they are willing to attack in our previous

terminology all the no des will broadcast their inputs After suciently many rounds all the

inputs will propagate to all the participants in the graph setting this requires knowledge of

an upp er b ound on the diameter Having propagated all the inputs virtually any rule can

b e used in our case the rule is attack if there is no ob jection Since all the participants

apply the same rule to the same set of input values they will all decide identically

In a mo del with more uncertaintyinwhich messages may b e lost this simplistic algorithm

fails It turns out that this is not just a problem with this algorithm we shall provethat

there is no algorithm that always solves this problem correctly

Before wegoontoprove the imp ossibility result we state the problem a bit more

carefullyWe consider the following mo del Consider pro cesses i for i n arranged

in an undirected graph G We assume that all pro cesses start with initial binary values

enco ded in their states We use the same synchronous mo del wehave b een working with

so far except that during the course of execution anynumb er of messages might b e lost

The goal is to reach consensus ie identical decisions in a xed numb er say r of rounds

We represent the decision byaspecialvalue enco ded in the pro cesses states at the end of

the last round Alternativelywe can require that the decision value will b e output using

sp ecial messages the dierence is simply one more round

We imp ose two conditions on the decisions made by the pro cesses as follows

Agreement All pro cesses decide on the same value

Validity If all pro cesses start with then the decision value is and if all pro cesses start

with and all messages are delivered then the decision value is

The agreement requirement is natural wewould certainly like to b e consistent The

validity requirement is one p ossible formalization of the natural idea that the value decided

up on should b e reasonable For instance the trivial proto col that always decides is ruled

out bythevalidity requirement We comment that the validity condition ab ove is quite weak

if even one message is lost the proto col is allowed to decide on anyvalue Nevertheless it

turns out that even this weak requirement is imp ossible to satisfy in anynontrivial graph

ie a graph with more than one no de To see this it suces to consider the sp ecial case

of two no des connected by one edge cf homework problem

We shall need the following standard denition

Denition Let and be two executions We say that is indistinguishable from

i

with respect to a process i denoted ifinboth and i has the same initial state

and receives exactly the same messages at the same rounds

i

The basic fact for any distributed proto col is that if then i behaves the same in

and This simple observation is the key in the pro of of the following theorem pp 1 2

r

Figure Pattern of message exchanges b etween p and p

Theorem There is no algorithm that solves the coordinated attack problem on a graph

with two nodes connectedbyanedge

Pro of By contradiction Supp ose a solution exists sayfor r rounds Without loss of

generalitywemay assume that the algorithm always sends messages in b oth directions at

every round since we can always send dummy messages We shall now construct a series of

executions suchthateach of them will b e indistinguishable from its predecessor with resp ect

to one of the pro cesses and hence all these executions must have the same decision value

Sp ecically let b e the execution that results when b oth pro cesses and start with

value and all messages are delivered By validitybothmust decide at the end of

Now let b e the execution that is the same as except that the last message from pro cess

to is not delivered Note that although pro cess may go to a dierent state at the last

round and therefore pro cess decides in By agreement pro cess must also

decide in Next let b e the same as except that the last message from pro cess

pro cess decides in andby agreement so do es pro cess to is lost Since

Continuing in this wayby alternately removing messages eventually we get down to an

execution in which b oth pro cesses start with and no messages are delivered As b efore

b oth pro cesses are forced to decide in this case

But now consider the execution in which pro cess starts with but pro cess starts with

and no messages are delivered This execution is indistinguishable from with resp ect

to pro cess and hence pro cess will still decide and so will pro cess by agreement But

this execution is indistinguishable from the execution in which they b oth start with with

restp ect to pro cess in whichvalidity requires that they b oth decide a contradiction

Theorem tells us in simple words that there is very little that we can do in distributed

networks if the communication maybefaulty and the correctness requirements are strict

This result is considered to reect a fundamental limitation of the p ower of distributed

networks

However something close to this problem must b e solved in real systems Recall the

database commit problem There we had even a stronger condition for validity namely

if anyone starts with then the decision value must b e How can we cop e with the

imp ossibility indicated by Theorem Wemust relax our requirements One p ossible idea is

to use randomization in the algorithm and allow some probability of violating the agreement

andor validity condition Another approachistoallow some chance of nontermination We

will return to nonterminating commit proto cols in a lecture or two

Randomized Consensus

In this section we discuss some prop erties of randomized proto cols for consensus We shall

consider as ab ove pro cesses in an arbitrary connected graph Our mo del will allowran

domized pro cesses as for the MIS Maximal Indep endent Set problem

In this section weweaken the problem requirement and weallow for some small proba

bility of error Consider a proto col that works in a xed number r of rounds Our main

results in this section VargheseL are b ounds on the liveness probability that an algo

rithm can guarantee in terms of and r that is for the probability that all the participants

attack in the case where b oth start with and all messages are delivered

We shall givean upper bound on the liveness probability Note that here an upp er

b ound is a negative result no algorithm will decide to attack in this case with to o high

probability We shall also see an algorithm that nearly achieves this probability

Formal Mo deling Wemust b e careful ab out the probabilistic statements in this setting

since it is more complicated than for the MIS case The complication is that the execution

that unfolds dep ends not only on the random choices but also on the choices of which

messages are lost These are not random Rather wewant to imagine they are under the

control of some adversary and wewanttoevaluate the algorithm by taking the worst case

over all adversaries

Formallyanadversary is here an arbitrary choice of

initial values for all pro cesses and

subset of fi j k i j is an edge and k r is a round numberg

The latter represents which messages are delivered i j k means that the message cross

ing i j at round k gets through Given any particular adversarytherandomchoices made

by the pro cesses induce a probability distribution on the executions Using this probability

space we can express the probabilityofevents such as agreeing etc To emphasize this fact

we shall use the notation Pr for the probability function induced bya given adversary A

A

Let us now restate the generals problem in this probabilistic setting We state it in terms

of two given parameters and L

Agreement For every adversary A

Pr some pro cess decides and some pro cess decides

A

Validity

If all pro cesses start with then is the only p ossible decision value

If A is the adversary in which all pro cesses start with and all messages get

through then Pr all decide L

A

We stress that we demand the rst validity requirement unconditional ly in the database

context for example one cannot tolerate any probability of error in this resp ect

Our goal is given a b ound on the probability of disagreement and r aboundonthe

time to termination wewould like to design an algorithm with the largest p ossible L

Example Proto col two pro cesses We start with a description of a simple proto col

for two pro cesses that will prove insightful in the sequel The proto col pro ceeds in r rounds

as follows

In the rst round pro cess cho oses an integer key uniformly at random from the range

frg The main idea of the algorithm is that this key will serve as a threshold for the

numb er of rounds of communication that must b e completed successfully or otherwise the

pro cessors will resort to the default value Sp ecically at each round the pro cesses send

messages with the following contents

their initial values

from pro cess only the value of key and

a color green or red

Intuitively a message colored red signies that some message was lost in the past More

precisely round messages are colored green and pro cess is round k message is colored

green provided that it has received all the messages at all previous rounds and they were

all colored green otherwise it is colored red

We complete the sp ecication of the proto col by the following attack rule

A pro cess attacks exactly if it knows that at least one pro cess started with

it knows the value of key and it has received green messages in all the rst

key rounds

Let us make a small digression here to formally dene what do es it mean for a pro cess

to know the value of some variable

Denition Aprocess i is said to know at time k the value of a variable v if either of the

fol lowing holds

v isalocal variable of ior

by time k k i receives a message from a process that knows v at time k

The denition is simplied since we consider only variables whose values are set b efore

any message is sent and never changed afterwards It is straightforward to extend the

denition to the general case

Another way to view the knowledge of a pro cess is to think of the execution graph

dened as follows The execution graph is a directed graph with a no de i k for each pro cess

i and for each round k and a directed edge for each delivered message ie i k i k

is an edge i a message is sent from i to i at round k and is delivered by round k With

this abstraction Denition says that a pro cess i knows at time k everything ab out the

no des from whichi k is reachable Notice that in eect we assume that a no de tells all it

knows in every message

and let L Validity is easy a pro cess decides only if there is Analysis Let

r

some input if all messages are delivered and all inputs are then the proto col will decide

The more interesting part here is agreement Fix a particular adversary If all messages

are delivered then the proto col will have all pro cesses decide on the same value Wenow

fo cus on the case that some message is not delivered and let k denote the rst round at

which a message is lost

Claim The only case in which there is disagreement is where key k

Pro of We use the fact that if either pro cess receives green messages for at least l rounds

then the other do es for at least l rounds

If kkey then one pro cess fails to receive a green message at some round strictly b efore

the k th round Consequently the other fails to do so byround k key and by the

proto col neither attacks

If kkey then b oth get at least key green messages Also since this means k b oth

knowvalue of key and b oth know same initial values and thus they decide on the same

value

Corollary Let A be any adversary and let r be the number of rounds Then

Pr disagreement

A

r

An Upp er Bound on the Liveness An obvious question that arises after the pro cessor

algorithm is can we do better

Before giving an answer we observe that the algorithm ab ove has the unpleasant prop erty

that if just one rstround message gets lost then we are very likely not to attack We

mightbeinterested in improving the probabilityofattacking as a function of the number

of messages that get through aiming at a more rened liveness claim Another direction

of improvement is to extend the algorithm to arbitrary graphs since the round dierence

claimed ab ovedoesnotalways hold in general graphs Therefore the answer to the question

ab oveisyeswe can give a mo dication for more general graphs that has provably b etter

liveness b ounds for partial message loss On the other hand however an almostmatching

upp er b ound for the liveness weprove b elowshows we cannot do signicantly b etter

We start with a denition of a notion that is imp ortant the imp ossibility pro of and in the

extended algorithm we shall describ e later We dene the information level of a pro cess

at a given timep oint This notion will b e used instead of the simple count of the number of

consecutive messages received

Denition Let i V beaprocess and let k bearound number Fix an adversary AWe

dene H i k the set of heights of i at round k inductively as fol lows

A

is an element of H i k

A

is an element of H i k if at round k process i knows about some initial value of

A

h is an element of H i k if at round k for al l processes j iprocess i knows

A

that h H j l for some l

A

Final ly dene level i k maxH i k

A A

Intuitively level i k h means that at round k pro cessor i knows that all pro cessors

A

know that all pro cessors knowh times that there exists an input value As it turns

out this level plays a key role in the randomized consensus probability b ounds

Wecannow prove the upp er b ound on the liveness

Theorem Any r round protocol for the generals problem with probability of disagreement

at most and probability of attack when no message is lost at least L satises

L r

We prove Theorem using the following more general lemma

Lemma Let A be any adversary and let i V beanynode Then

Pr i decides level i r

A A

We rst showhow the lemma implies the theorem

Pro of of Theorem

For the liveness condition we need to consider the adversary A in which all pro cessors have

input and no messages are lost The probability that all pro cessors decide is at most the

probability that any of them decides which is by Lemma at most level i r and

A

since by the denition of A level i r r we are done

A

Wenowprove the lemma

Pro of of Lemma

First based on Awe dene another adversary A as follows A is the adversary in which

the only initial values of and the only messages delivered are those that i knows ab out in

A Put in the execution graph language A is obtained from A by mo difying the execution

as follows We remove all the edges ie messages which are not on a path leading to any

no de asso ciated with pro cess iwealsochange the inputs in the initial states from to

for all the initial states from which there is no path to any no de asso ciated with i The idea

in A is to change all the inputs that are hidden from i in A to while preserving the

view from pro cess is standp oint

Nowwe can prove the lemmaby induction on the value of level i r

A

Base Supp ose level i r Then in A i do esnt knowaboutany initial value and

A

i

hence in A all the initial values are By validity i must decide and since A A we

conclude that i decides in A as well

Inductive step Supp ose level i r l and supp ose the lemma holds for all levels

A

less than l Consider the adversary A dened ab ove Notice that there must exist some j

such that level j r l otherwise wewould have level i r lacontradiction

A A

Now by the inductivehyp othesis

Pr j decides level j r

A A

and hence

j decides l Pr

A

But by the agreementboundwemust have

Pr i decides Pr j decides

A A

and thus

i decides l Pr

A

i

and since A A wemay conclude that

Pr i decides l

A

Extended Algorithm The pro of of Theorem actually suggests a mo died algorithm

whichworks in arbitrary connected graphs Roughly sp eaking the pro cesses will compute

their information levels explicitly and will use this information level in place of the number

of consecutive message deliveries in a thresholdbased algorithm As in the basic algorithm

pro cess still needs to cho ose key at the rst round to b e used as a threshold for the level

reached To get any desired probability of agreement wenowletkey be a real chosen

uniformly from the interval

We extend the denition of level slightly to incorp orate knowledge of the key as follows

Denition Let i V beaprocess and let k bearound number Fix an adversary AWe

dene H i k the set of heights of i at round k inductively as fol lows

A

is an element of H i k

A

i k if at round k process i knows about some initial value is an element of H

A

and knows key

h is an element of H i k if at round k for al l processes j iprocess i knows

A

that h H j l for some l

A

Dene the mo died level by mlevel i k max H i k

A

A

The dierence from Denition is in we require that the knowledge contain

the value of key also It is not hard to showthatmlevel is always within one of level cf

homework problem The idea in the proto col is simply to calculate the value of mlevel i k

A

explicitly

Each pro cess maintains a variable mlevel which is broadcast in every round

The pro cesses up date mlevel according to Denition After r rounds a pro cess

decides exactly if it knows the value of key and its calculated mlevel is at least

key

Analysis For simplicitywe will just carry out the analysis assuming that the graph G

is complete Mo dications for other graphs are left as an exercise

First consider the validity condition If all pro cesses start with then mlevel remains

at all pro cessors throughout the execution and so they all decide

Lemma Assume the graph is complete If al l processes start with and no message is

lost then

Pr attack minr

Pro of If all pro cesses start with and no message is lost then by the end of the algo

rithm mlevel r everywhere and the probability L of all attacking is at least equal to the

r

r probability that r key Ifr then Prattack If r then Prattack

We remark that it is p ossible to get a more rened result for intermediate adversaries

for any given adversary A dene LA to b e the probabilityofattack under Aand mlevel

A

to b e the minimum value of mlevel i r ie the minimum value of the mlevel at the end

A

of the proto col taken over all the pro cesses Then LA min mlevel The pro of is

A

left as a simple exercise

Wenow analyze the agreementachieved by the proto col This do es not dep end on the

graph b eing complete

Lemma Let A be any adversary Then the probability of disagreement under A is no more

than

Pro of By the proto col if mlevel key then all pro cesses will attack and if mlevel

A A

key then all pro cesses will refrain from attacking This follows from the fact that the

nal mlevel values are within of each other Hence disagreement is p ossible only when

mlevel key mlevel The probability of this eventis sincemlevel is xed

A A A

and key is uniformly distributed in

Faulty Pro cessors

In this section wecontinue our treatment of the consensus problem but nowwe consider

another situation where pro cessors fail rather than links We shall investigate two cases

stopping failures where pro cessors may exp erience sudden death and Byzantine failures

where a faulty pro cess may exhibit absolutely unconstrained b ehavior the former aims at

mo delling unpredictable crashes and the latter may mo del buggy programs

In b oth cases we assume that the numb er of failures is b ounded in advance Even though

this assumption is realistic in the sense that it may b e highly unlikely for a larger number

of failures to o ccur there is a serious drawback to these mo dels if the numb er of failures is

already quite high then it is likely in practice that wewillhave more failures More precisely

assuming a b ound on the numb er of failures implicitly implies that the failures are negatively

correlated It is arguable that failures are indep endent or even p ositively correlated

Stop Failures

We rst consider the simple case where they fail by stopping only That is at some p oint

during the algorithm a pro cessor may stop taking steps altogether In particular aprocessor

can stop in the midd le of the message sending step at some round ie at the round in which

the pro cessor stops only a subset of the messages it should have sentmay actually b e

sent We assume that the links are p erfectly reliable all the messages that are sent are

delivered For simplicitywe assume that the communication graph is complete As b efore

we assume that the n pro cessors start with initial values

Now the problem is as follows Supp ose that at most f pro cessors fail Wewant all the

nonfaulty pro cessors to eventually decide sub ject to

Agreement all values that are decided up on agree

Validity if all pro cessors have initial value v then the only p ossible decision value is v

We shall now describ e a simple algorithm that solves the problem In the algorithm each

pro cessor maintains a lab eled tree that represents the complete knowledge of the pro cessor

as follows The tree has f levels ranging from the ro ot to f the leaves Each

no de at level k k f has n k children The no des are lab eled by strings as follows

The ro ot is lab eled by the empty string and eachnodewithlabel i i has n k children

k

with lab els i i j where j ranges over all the elements of fng fi i gSee

k k

Figure for an illustration

The pro cesses ll the tree in the course of computation where the meaning of a value v

at a no de lab eled i i is informally i knows that i knows that i knows that the

k k k

value of i is v The algorithm now simply involves lling in all the entries in the tree and

after f rounds deciding according to some rule we shall describ e later

More preciselyeach pro cess i lls in its own initial value for the ro ot value in the

i

start state Then in round pro cess i broadcasts this value to all the other pro cesses for

simplicityofpresentation we also pretend that pro cess i sends the value to itself though

in our mo del this is simulated by lo cal computation Ateach round k k f

pro cess i broadcasts all the values from all the level k no des in its tree whose lab els do

not contain i to all the pro cesses The recipients record this information in the appropriate root

12 3Level 1

12. . . 1n 21 23. . . 2n 31 . . . 3n Level 2

. 123 . . . 12n . .

Level f+1

Figure the tree used for the stopfailures algorithm

no de at level k of their tree the value sentby pro cess j asso ciated with its no de i i i

k

is asso ciated with the no de i i i j This is value i i i j In this way the trees

k i k

are lled up layer bylayer

Thus in general a value v asso ciated with no de lab eled i i i at a pro cess j means

k

that i told j that i told i that i told i that v was i s initial value

k k k

Lastlywe sp ecify a decision rule Each pro cess that has not yet failed denes W to b e

the set of values that app ear anywhere in its tree If W is a singleton set then cho ose the

unique elementof W otherwise cho ose a presp ecied default value

We shall now argue that the ab ove algorithm satises the requirements

Claim Validity If al l the initial values are v then the only possible decision value is v

Pro of Trivial if all start with v then the only p ossible value at anynodeinanyones tree

is v and hence by the decision rule only v can b e decided up on

Claim Agreement Al l the values decidedupon are identical

Pro of Let W b e the set of values in the tree of pro cessor i By the decision rule it suces

i

to show that W W for anytwo pro cessors i and j that decide ie that complete the

i j

algorithm We prove that if v W then v W Supp ose v W Then v value pfor

i j i i

some lab el p that do es not contain i Convince yourself of this If jpjf thenjpij f

so pro cess i relays value v to pro cess j at round jpijand v value pi and thus v W

j j

On the other hand if jpj f then there is some nonfaulty pro cess l app earing in the

lab el psothatp qlrwhereq and r are strings Then at round jqlj pro cessor l succeeds

in sending to j and therefore v is relayed to j since thats the value relayed further by the

pro cesses in r Thus v value ql and hence v W

j j

By the same argument if v W then v W and we conclude that W W

j i i j

One can easily see that the numb er of bits communicated in the execution of the proto col

f

is prohibitively large O n and deem this proto col impractical The imp ortant thing

we learned from the ab ove algorithm however is that the consensus problem is solvable for

this simple kind of pro cessor failures This of course should b e contrasted with Theorem

which states that the problem is unsolvable in the case of communication failures In the

next lecture we shall see how can the number of communicated bits b e reduced dramatically

JJ Distributed Algorithms Septemb er

Lecturer Nancy Lynch

Lecture

Byzantine Failures

Last time we considered the consensus problem for stopping failures Now supp ose that we

allowaworse kind of pro cessor failure cf Lamp ort et al DolevStrong BarNoy et al

the pro cessors might not only fail by stopping but could exhibit arbitrary b ehavior Such

failures are known as Byzantine failures b ecause of the bad reputation of the Byzantine

Generals It must b e understo o d that in this mo del a failed pro cess can send arbitrary

messages and do arbitrary state transitions The only limitation on the b ehavior of a failed

pro cess is that it can only mess up the things that it controls lo cally namely its own

outgoing messages and its own state

In order for the consensus problem to make sense in this setting its statementmust b e

mo died slightly As b efore wewant all the nonfaulty pro cesses to decide But now the

agreement and validity conditions are as follows

Agreement All values that are decided up on by nonfaulty processes agree

Validity If all nonfaulty processes begin with v then the only p ossible decision value by

a nonfaulty process is v

It is intuitively clear the now the situation is somewhat harder than for stopping faults

In fact we shall show that there is a b ound on the numb er of faults that can b e tolerated

Sp ecicallywe will see that nf pro cesses are needed to tolerate f faults To gain some

intuition as to the nature of the problem let us consider the following example

What might go wrong with to o few pro cesses Consider three pro cesses i j k and

supp ose they are to solve consensus tolerating one Byzantine fault Let us see whyitis

imp ossible for them to decide in two rounds See Figure

Scenario Pro cesses i and k are nonfaulty and start with Pro cess j is faulty In the

rst round pro cesses i and k rep ort their values truthfully and pro cess j tells b oth i and k

that its value is Consider the viewp oint of pro cess i In the second round pro cess k tells (a) (b) (c) 1 1 j j j

i k i k i k

0 0 1 0

Figure p ossible congurations for three pro cessors The shaded circles represent faultypro

cesses Congurations a b and c depict Scenarios and resp ectively

i truthfully that j said and pro cess j tells i falsely that k said In this situation the

problem constraints require that i and k decide

Scenario Dual to Scenario Pro cesses j and k are nonfaulty and start with Pro cess

i is faulty In the rst round pro cesses j and k rep ort their values truthfully and pro cess

i tells b oth others that its value is Consider the view of pro cess j In the second round

pro cess k tells j that i said and i tells j that k said In this situation the problem

constraints require that j and k decide

Scenario Now supp ose that pro cesses i and j are nonfaulty and start with and

resp ectively Pro cess k is faulty but sends the same messages to i that it do es in Scenario

and sends the same messages to j that it is in Scenario In this case i will send the same

messages to j as it do es in Scenario and j will send the same messages to i as it do es in

Scenario Then i and j must follow the same b ehavior they exhibited in Scenarios and

namely decide and resp ectively thereby violating agreement

Note that i can tell that someone is faulty in Scenario since k tells i and j tells i that

k said But pro cess i do esnt knowwhichof j and k is faultyWe remark that although

the proto col could have more rounds the same kind of argument will carry over We shall

see later a rigorous pro of for the necessityof nf

An Algorithm We shall now describ e an algorithm assuming nf pro cesses Recall

the full information proto col describ ed in last lecture for stopping faults The basic idea

in the Byzantine case is based on the same tree and propagation strategy as for the stopping

faults case but nowwe shall use a dierent decision rule It is no longer the case that we

want to accept anyvalue for insertion into the set of go o d values just b ecause it app ears

somewhere in the tree Nowwemust b eware of liars We use the following easy fact

Lemma If i j and k are al l nonfaulty then value pvalue p for every label p ending

i j

in k

Pro of Follows from the simple fact that since k is nonfaulty it sends the same thing to

everyone

Nowwe can sp ecify a decision rule

If a value is missing at a no de it is set to some presp ecied default To determine

the decision for a given tree wework from the leaves up computing a new value

newvalue for each no de as follows For the leaves newvalue pvalue p For

i i

each no de newvalue is dened to b e the majority value of the newvalue s of the

children If no ma jorityvalue exists we set newvalue default The decision is

newvalue root

i

Wenow start proving the correctness of the algorithm First we prove a general lemma

Lemma Suppose that p is a label ending with the index of a nonfaulty process Then there

is a value v such that value pnewvalue pv for al l nonfaulty processes i

i i

Pro of By induction on the tree lab els working from the leaves on up

Base case Supp ose p is a leaf Then by Lemma all nonfaulty pro cesses i have the

same value p which is the desired value v by the denition of newvalue for leaves

i

Inductive step Supp ose jpj k k f Then Lemma implies that all nonfaulty

pro cesses i have the same value value p call this value v Therefore all the nonfaulty

i

pro cesses l send the same value v for pl to all pro cesses So for all nonfaulty i and l we

have value pl v By the inductivehyp othesis wehavethat newvalue pl v for all

i i

nonfaulty pro cesses i and l for some value v

Wenow claim that a ma jorityofthechild lab els of p end in nonfaulty pro cess indices

This is true b ecause the number of children is n k n ff and since at most f may

b e faultywehave that there always is a honest ma jority Note this is where we use the

b ound on the numb er of pro cesses

The induction is complete once we observe that since for any nonfaulty i v newvalue pl

i

for a ma jority of the children pl newvalue pv by the algorithm

i

Wenow argue validity

Corollary Suppose al l nonfaulty processes begin with the same value v Then al l nonfaulty

processes decide v

Pro of If all nonfaulty pro cesses b egin with v thenallnonfaulty pro cesses send v at the

rst round and therefore value j v for all nonfaulty pro cesses i j Lemma implies

i

that newvalue j v for all nonfaulty i and j and therefore newvalue root v for all

i i

nonfaulty i

Before weshow agreement we need a couple of additional denitions

Denition Consider an execution of the algorithm A treenode p is said to be common

in the given execution if at the end of the execution al l the nonfaulty processes have the

same newvalue p

Denition Consider a rootedtree T A subset of the treenodes C is a path covering of

T if every path from a leaf to the root contains at least one node from C Apath covering is

common if it consists entirely of common nodes

Lemma There exits a common path covering of the tree

Pro of Consider any path Eachofthef no des on the path ends with a distinct index

and therefore there exists a no de on the path say p with a lab el that ends in a nonfaulty

index Then Lemma implies that p is common

Lemma Let p beanode If thereisacommon path covering of the subtreerootedat p

then p is common

Pro of By induction on p starting at the leaves and working up

Base case If p is a leaf there is nothing to prove

Inductive step Supp ose p is at level k Assume that there is a common path covering C

of ps subtree If p itself is in C then p is common and we are done so supp ose p C

Consider any child pl of p C induces a common path covering for the subtree ro oted at

pl Hence by the inductivehyp othesis pl is common Since pl was an arbitrary child of p

all the children of p are common which in turn implies that p is common by the denition

of newvalue

Wecannow conclude that the algorithm satises the validity requirement

Corollary The root is common

Pro of By Lemmas and

Authentication The original pap er of Lamp ort Pease and Shostak and a later pap er

of Dolev and Strong present algorithms that work in the presence of Byzantine failures

but where the no des have the extra p ower of authentication based on digital signatures

Informally this means that no one can claim that another pro cess said something unless in

fact it did There is no nice formal characterization given for this mo del The interesting

fact is that the algorithm is muchlike the stopping fault algorithm rather than the full

Byzantine algorithm ab ove In particular it works for anynumb er of pro cesses even when

nf

Reducing the Communication Cost

A ma jor drawback of the algorithms presented ab ove is that they require an excessive amount

of communication sp ecically the numb er of messages communicated in the tree algo

rithms is exp onential in f Here we are counting the separate values b eing sent as separate

messages though in terms of the formal mo del manywould b e packaged into a single mes

sage The b ound would b e b etter expressed in terms of the number of bits of communication

In this section we discuss some metho ds to cut down this cost

Stopping Faults

We start by reducing the numb er of messages communicated in the stopping faults algorithm

to p olynomial in f Let us examine the algorithm describ ed in the previous lecture more

closely In that algorithm wehaveeveryone broadcasting everything to everyone However

in the end each pro cess only lo oks to see whether the number of distinct values it has received

is exactly Consider a pro cess which has already relayed distinct values to everyone We

claim that this pro cess do esnt need to relayanything else since every other pro cess already

has received at least twovalues Thus the algorithm b oils down to following rule

Use the regular algorithm but each pro cess prunes the messages it relays so

that it sends out only the rst two messages containing distinct values

The complexity analysis of the ab ove algorithm is easy the numberofroundsinisthe

same as b efore f but the numb er of messages is only O n since in total each pro cess

sends at most two messages to each other pro cess

What is a bit more involved is proving the correctness of the ab ove algorithm In what

follows weonlysketch a pro of for the correctness of the optimized algorithm The main

interest lies in the pro of technique we employ b elow

Let O denote this optimized algorithm and U the unoptimized version Weprove cor

rectness of O by relating it to U First we need the following prop erty

Lemma In either algorithm O or algorithm U after any number k of rounds for any

process index i and for any value v the fol lowing is true If v appears in is tree then it

appears in the treeatsomenode for which the index i is not in the label

Lemma is true b ecause there are only stopping failures so i can only relayavalue and

others claim i relayed it if i in fact originally had it

Dene values i k k f tobethesetofvalues that app ear in levels fkg

O

of pro cess is tree in O We dene values i k analogously The following lemma is imme

U

diate from the sp ecication of algorithm U

Lemma In algorithm U suppose that i sends a round k message to j and j receives

and processes it Then values i k values j k

U U

The key pruning prop ertyof O is captured by the following lemma

Lemma In algorithm O suppose that i sends a round k message to j and j receives

and processes it

If jvalues i k j then values i k values j k

O O O

If jvalues i k j then jvalues j k j

O O

Note that Lemma is required to proveLemma

NowtoseethatO is correct imagine running the two algorithms sidebyside O the

new optimized algorithm and U the original unoptimized version with the same initial

values and with failures o ccurring at the same pro cesses at exactly the same times If a

pro cess fails at the middle sending in one algorithm it should fail at the same p oint in the

other algorithm We state invariants relating the states of the two algorithms

Lemma After any number of rounds k k f values i k values i k

O U

Lemma After any number of rounds k k f

If jvalues i k j then values i k values i k

U O B

If jvalues i k j then jvalues i k j

U O

Pro of By induction Base case is immediate Assume now that the lemma holds for k

We show it holds for k

Supp ose jvalues i k j It follows that jvalues i k j so byinductive

U U

hyp othesis wehave values i k values i k By Lemma it suces to show

O U

that values i k values i k

U O

Let v values i k If v values i k then v values i k values i k

U U O O

So now assume that v values i k Supp ose v arrives at i in a round k in

U

a message from some pro cess j where v values j k Since jvalues i k j

U U

Lemma implies that jvalues j k j By inductivehyp othesis values j k

U O

values j k and so jvalues j k j By Lemma values j k values i k

U O O O

and hence v values i k as needed

O

Supp ose jvalues i k j If jvalues i k j then by inductivehyp othesis we

U U

have jvalues i k j which implies the result So assume that jvalues i k j

O U

By the inductivehyp othesis wehave values i k values i k We consider two

O U

sub cases

a For all j from which i receives a round k message jvalues j k j

B

In this case for all such j wehaveby inductivehyp othesis that values j k

A

values j k Lemma then implies that for all such j values j k values i k

B U O

It follows that values i k values i k whichissucient to prove

O U

the inductive step

b There exists j from which i receives a round k message such that jvalues j k j

U

By the inductivehyp othesis jvalues j k j By Lemma jvalues i k

O A

j as needed

The truth of Lemma for k f and the fact that W W for nonfaulty i and j

i j

in U together imply that i and j decide in the same way in algorithm O this follows from

considering the two cases in which W and W in U are singletons or have more than one

i j

value W since the ro ot always gets a value

i

Note The pro of technique of running two algorithms side by side and showing that they

have corresp onding executions is very imp ortant In particular in cases where one algorithm

is an optimization of another it is often much easier to show the corresp ondence than it is to

show directly that the optimized algorithm is correct This is a sp ecial case of the simulation

pro of metho d cf Mo del Section in the bibliographical list

The Byzantine Case

In the Byzantine mo del it do es not seem so easy to obtain an optimized pruned algorithm

with a p olynomial numb er of messages We only summarize results in this section

Polynomial communication and f k rounds for constant k assuming n f

pro cesses was achieved by a fairly complex algorithm by Dolev Fischer Fowler Lynch and

Strong DolevFFLS

Essentially this algorithm was expressed in a more structured fashion by Srikanth and

Toueg SrikanthT Their structure involved substituting an authentication proto col for

the assumed signature scheme in an ecientauthenticated BA algorithm used one by Dolev

and Strong Using this authentication proto col requires twice the numb er of rounds as do es

the DolevStrong proto col and also needs f pro cesses The basic idea is that whenever

a message was sent in the DolevStrong algorithm Srikanth and Toueg run a proto col to

send and accept the message

An improvement on the f rounds required by the ab ove algorithms was achieved byCoan

Coan who presented a family of algorithms requiring f f rounds where The

message complexity is p olynomial but as approaches zero the degree of the p olynomial

increases An imp ortant consequence of this result is that no lower b ound larger than

f rounds can b e proved for p olynomial algorithms although no xed degree p olynomial

algorithm is actually given for f rounds A pap er byBarNoy Dolev Dwork and Strong

BarNoyDDS presents similar ideas in a dierentwayWe remark that they intro duced

the tree idea used ab ove to present the consensus algorithms

Moses and Waarts achieve f rounds with p olynomial communication MosesW

The proto col lo oks pretty complicated with manyinvolved strategies for pruning the tree

Alsoitdoesnotwork for f pro cessors as do es the exp onential communication proto col

but rather for

Berman and Garay use a dierent approachtosolve the same problem as MW They

achievef pro cessors and r rounds Their algorithm is still fairly complicated

The TurpinCoan Algorithm

It is hard to cut down the amountofcommunication for general Byzantine agreement We

close this topic with an interesting trick that helps somewhat though it do es not reduce

the communication from exp onential to p olynomial The problem addressed here is howto

reduce BA for a multivalued domain to binaryvalued BA In other words the algorithm we

present uses binary BA as a subroutine If the domain is large the savings can b e substantial

In the TurpinCoan algorithm we assume that n f and that default value is

known initially by all nonfaulty pro cesses For each pro cess a lo cal variable x is initialized

to the input value for that pro cess

Below we describ e the TurpinCoan algorithm This time the style is changed a little

bit Namely the co de for each round is written explicitly Implicitlywe assume that the

co de will b e executed for rounds in sequence Of course in the underlying state

machine mo del this convention is expanded into manipulation of an explicit round variable

which gets read to determine which co de to p erform and gets incremented at every round

Round

Broadcast x to all other pro cesses

In the set of messages received if there are n f for a particular value v then

x v otherwise x nil

Round

Broadcast x to all other pro cesses

Let v be the value other than nil that o ccurs most often among those values

max

received with ties broken in some consistentway Let num be the number of

o ccurrences of v

max

if num n f then vote else vote

After round run the binary Byzantine agreement subroutine using vote as the input

value If the bit decided up on is then decide v otherwise decide the default value

max

Claim At most one value v is sent in round messages by correct processes

Pro of Any pro cess p sending a value v in round must have received at least n f round

messages containing v Since there are at most f faulty pro cesses this means that all other

pro cesses received at least n f copies of v Since the total numb er of messages received

is n no pro cess could have received n f messages containing a value v v in round

Therefore the claim holds

Theorem The TurpinCoan algorithm solves multivalued Byzantine agreement when

given a Boolean Byzantine agreement algorithm as a subroutine

Pro of It is easy to see that this algorithm terminates

To showvaliditywemust prove that if all nonfaulty pro cesses start with a value w

then all nonfaulty pro cesses must decide w After the rst round all nonfaulty pro cesses

will have set x w b ecause at least n f pro cesses broadcast it reliably Therefore in

the second round each nonfaulty pro cess will have v w and num n f and will

max

therefore obtain v ote The binary agreement subroutine is therefore required to cho ose

and each nonfaulty pro cess will cho ose w

In showing agreement we consider two cases If the subroutine decides on vote then

the default value is chosen by all nonfaulty pro cesses so agreement holds If the subroutine

decides on vote thenwemust argue that the lo cal variable v is the same for each

max

nonfaulty pro cess Note that for the subroutine to agree on vote then some nonfaulty

pro cess p must have started with v ote Therefore pro cess p must have received at

least n f round messages containing some particular value v Since there are at most f

faulty pro cesses each other pro cess must have received at least n f round two messages

with nonnil values from nonfaulty pro cesses By Claim al l of the nonnil messages from

nonfaulty pro cesses must have the same value namely v Therefore v must b e the value

o ccurring most often since there are at most f faulty pro cesses and n ff

The pro of metho d uses a b ound on the numb er of faulty pro cesses to obtain claims ab out

similaritybetween the views of dierent pro cesses in a faultprone system This is a useful

technique

JJ Distributed Algorithms Septemb er

Lecturer Nancy Lynch

Lecture

Wehave seen algorithms for distributed agreement even in the case of Byzantine faults

The algorithm wesaw used n f pro cesses and f synchronous rounds of commu

nication In this lecture we will consider whether these b ounds are necessary

Numb er of Pro cesses for Byzantine Agreement

It turns out that it is imp ossible to b eat the f upp er b ound on the numb er of pro cesses

ie if n f then there is no proto col that ensures agreement The pro of is neat Werst

prove that three pro cesses cannot tolerate one fault as suggested in an earlier example last

lecture Wethenshowwhy is it imp ossible to tolerate f n faults for any n

The rst pro of of this app eared in PeaseSL the pro of given here follows that in

FischerLM

Lemma Threeprocesses cannot solve Byzantine agreement in the presence of one fault

Pro of By contradiction Assume there is a proto col P consisting of three pro cesses p

q and r such that P with arbitrary inputs will satisfy the Byzantine agreement conditions

even if one pro cess malfunctions We shall construct a new system S from copies of the

same pro cesses and show that S must exhibit contradictory b ehavior It follows that P

cannot exist

Taketwo copies of each pro cess in P and arrange them into a pro cess system S as

shown in Figure

When congured in this way the system app ears to every pro cess as if it is congured in

the original threepro cess system P Notice that the pro cesses are not required to pro duce

any sp ecic output in S however S with any particular input assignment do es exhibit some

welldened b ehavior We shall see that no suchwelldened b ehavior is p ossible

Supp ose that the pro cesses in S are started with the input values indicated in Figure

Consider the pro cesses pq r p in S with the given inputs To q and r it app ears as if

they are running in the three pro cess system P inwhich p is faulty ie pretending its

running in a pro cessor system This is an allowable b ehavior for Byzantine Agreementon

three pro cesses and the correctness conditions for BA imply that q and r must eventually

p

p r

S

S

X

S

X

X

X

S

X

Xz

S

S

q q

S

S

q r

p

S

S

S

S

r p

S

S

S

S

q r

Figure The system S arrangement of pro cesses in the pro of of Lemma

agree on in the threepro cess system Since the sixpro cess system S app ears identical to

q and r b oth will eventually decide on in S as well

Next consider the pro cesses r p q r By similar reasoning p and q will eventually agree

on in S

Finally consider the pro cesses q r p q To r and p it app ears as if they are in a three

pro cess system with q faultyThenr and p must eventually agree in P although there is

no requirementonwhatvalue they agree up on and so they agree in S However this is

imp ossible since we just saw that r must decide and p must decide and we arrived to

the desired contradiction

Wenow use Lemma to show that Byzantine agreement requires n f pro cesses

to tolerate f Byzantine faults Lamp ortPS Wewilldothisbyshowing howan n f

pro cess solution that can tolerate f Byzantine failures implies the existence of a pro cess

solution which can tolerate a single Byzantine failure This contradicts the ab ove lemma

Theorem There is no solution to the Byzantine agreement problem on n processes in the

presenceoff Byzantine failures when n f

Pro of Assume there is a solution for Byzantine agreementwith n f For n

there can b e no agreement if one pro cess starts with and the other with then eachmust

allow for the p ossibility that the other is lying and decide on its own value But if neither

is lying they this violates the agreement prop erty Construct a threepro cess system with

each new pro cess simulating approximately onethird of the original pro cesses Sp ecically

partition the original pro cesses into three nonempty subsets P P andP each of size

at most f Let the three new pro cesses b e p p andp andleteach p simulate the

i

original pro cesses in P as follows Each pro cess p keeps track of the states of all the original

i i

pro cesses in P assigns its own initial value to every member of P and simulates the steps

i i

of all the pro cesses in P as well as the messages b etween the pro cesses in P Messages from

i i

pro cesses in P to pro cesses in another subset are sent from p to the pro cess simulating that

i i

subset If any simulated pro cess in P decides on a value v then p can decide on the value v

i i

If there is more than one suchvalue then p will cho ose a particular one chosen according

i

to some default

To see that this is a correct pro cess solution we reason as follows At most one of p p

p is allowed to b e faulty and eachsimulates at most f original pro cesses so the simulation

contains no more than f simulated faults Designate the faulty pro cesses in the simulated

system to b e exactly those that are simulated by a nonfaulty actual pro cess regardless of

whether they actually exhibit a fault This is as required by the simulated solution The

npro cess simulated solution then guarantees that the simulated pro cesses satisfy agreement

validity and termination

Let us argue briey that the validity agreement and termination of the npro cess simu

lation carry over to the pro cess system Termination is easyFor validity if all nonfaulty

actual pro cesses b egin with a value v then all the nonfaulty simulated pro cesses b egin with

v Validity for the simulated system implies that the only simulated decision value for a

simulated nonfaulty pro cess must b e v so the only actual decision value for an actual non

faultyprocessisv For agreement supp ose p and p are nonfaulty actual pro cesses They

i j

they simulate only nonfaulty pro cesses Agreement for the simulated system implies that all

of these simulated pro cesses agree so p and p also agree

i j

We conclude that given a system that tolerates f n faults we can construct a

proto col for pro cesses that tolerates one faulty pro cess contradicting Lemma

Let us recap what assumptions havewe used for this pro of

Lo cality A pro cesss actions dep end only on messages from its input channels and its initial

value

Faultiness A faulty pro cess is allowed to exhibit anycombination of b ehaviors on its

outgoing channels We could strengthen this assumption and insist that the b ehavior

of eachchannel can arise in some system in which the pro cess is acting correctly

The lo cality assumption basically states that communication only takes place over the

edges of the graph and thus it is only these inputs and a pro cess initial value that can aect

its b ehavior Note that the no des are allowed to have information ab out the network in which

they are supp osed to run eg the threepro cess network ab ove built into them but this

information is not changed magically if pro cesses are assembled into an unexp ected system

eg the sixpro cess network ab ove

The strengthened version of the fault assumption expresses a masquerading capability

of faulty pro cesses We cannot determine if a particular edge leads to a correct pro cess or

to a faulty pro cess simulating the b ehavior of a correct pro cess over the edge The fault

assumption gives faulty pro cesses the abilitytosimulate the b ehaviors of dierent correct

pro cesses over dierent edges

Byzantine Agreement in General Graphs

Wehave shown that Byzantine agreement can b e solved with n pro cesses and f faults

where n f In proving this result we assumed that any pro cess could send a message

directly to any other pro cess Wenow consider the problem of Byzantine agreementin

general communication graphs Dolev

Consider a communication graph G where the no des represent pro cesses and an edge

exists b etween two pro cesses if they can communicate It is easy to see that if G is a tree we

cannot accomplish Byzantine agreement with even one faulty pro cess Anyfaulty pro cess

that is not a leaf would essentially cut o one section of G from another The nonfaulty

pro cesses in dierent comp onents would not b e able to communicate reliablymuch less

reach agreement Similarly if removing f no des can disconnect the graph it should also b e

imp ossible to reach agreement with f faulty pro cesses We formalize this intuition using the

following standard graphtheoretic concept

Denition The connectivity of a graph Gconn G is the minimum number of nodes

whose removal results in a disconnectedgraph or a trivial node graph We say a graph G

is k connected if connG k

For example a tree has connectivity and a complete graph with n no des has connec

tivity n Figure shows a graph with connectivityIf q and s are removed then we

are left with two disconnected pieces p and r

We remark that Mengers Theorem states that a graph is k connected if and only if every

pair of p oints is connected by at least k no dedisjoint paths We shall use this alternative

characterization of the connectivity in the sequel

In the following theorem we relate the connectivity of a graph to the p ossibility of ex

ecuting a Byzantine agreement proto col over it The pro of uses metho ds similar to those

used in our lower b ound pro of for the numb er of faulty pro cesses

Theorem Byzantine agreement can be solvedonagraph G with n nodes and f faults if

and only if

p

t

H

H

H

H

H

H

H

H t t

q s

H

H

H

H

H

H

H

H t

r

Figure A graph G with connG

n f and

conn G f

Pro of We already knowthat n f pro cesses are required for a fully connected

graph It is easy to see that for an arbitrary communication graph we still need n f

since removing communication links do es not help the proto cols

Wenowshow the if direction namely that Byzantine agreement is p ossible if n f

and connG f Since we are assuming G is f connected Mengers Theorem

implies that there are at least f no dedisjoint paths b etween anytwonodesWecan

simulate a direct connection b etween these no des by sending the message along each of the

f paths Since only f pro cesses are faultywe are guaranteed that the value received in

the majority of these messages is correct Therefore simulation of a fully connected graph

can b e accomplished simulating each round of the proto col for a fullyconnected graph byat

most n rounds of in GWehave already seen an algorithm that solve Byzantine agreement

in this situation

Wenowprovethe only if direction of the theorem The argument that Byzantine agree

ment is not p ossible if connG f is a bit more intricate We will only argue the case for

f for simplicity

Assume there exists a graph G with connG in which consensus can b e solved

in the presence of one fault using proto col P Then there are two no des in G that either

disconnect the graph or reduce it to one no de But this latter case means that there are only

three no des in the graph and we already know that Byzantine agreement cannot b e solved

in a no de graph in the presence of fault So assume they disconnect the graph

The picture is then as in Figure where p and r are replaced by arbitrary connected

graphs and there may b e several edges from eachofq and s to eachofthetwo disconnected

groups Again for simplicityhowever we just consider p and r to b e single no des We

construct a system S by rewiring two copies of P asshown in Figure Each pro cess

in S b ehaves as if it was the samenamed pro cess in P for the graph in Figure

s

p t t t r

t t

q q

t t t

r p

s

Figure System S made by rewiring two copies of P

Now assume we start up system S with inputs corresp onding to the subscripts Consider

the b ehavior of the pro cesses p q and r of S outlined in Figure As b efore it app ears

to all three that they are in P with input as in Figure in an execution in which s is

a faulty pro cess By the lo cality assumption the outlined pro cesses will b ehave in the same

way in these two cases By the validity prop erty for P these pro cesses must all decide in

P and hence in S

s

p r t t t

t t

q q

t t t

r p

s

Figure A set of pro cesses in S

Now consider Figure and the corresp onding no de situation of Figure By the

same argument all the outlined pro cesses will decide in S

Finally consider Figure and its corresp onding no de situation of Figure Since

p

t

H

H

H

H

H

H

H

t Ht

q s

H

H

H

H

H

H

H

H t

r

Figure A conguration of P with s faulty that the outlined pro cesses cannot distinguish

from Figure

s

t t t r p

t t

q q

t t t

r p

s

Figure Another set of pro cesses in S

p

t

H

H

H

H

H

H

H

t t H

q s

H

H

H

H

H

H

H

t H

r

Figure A conguration with s faulty that the outlined pro cesses cannot distinguish from

Figure

only q is faulty the agreement condition requires that the outlined pro cesses decide on the

same value in P and hence in S

s

t t t r p

t t

q q

t t t

r p

s

Figure Yet another set of pro cesses in S

p

t

H

H

H

H

H

H

H

H t t

s q

H

H

H

H

H

H

H

H t

r

Figure The nonfaulty pro cesses must agree giving us a contradiction

However wehave already shown that pro cess p must decide and pro cess r must

decide Thus wehave reached a contradiction It follows that we cannot solve Byzantine

agreement for connG and f

To generalize the result to f we use the same diagrams with q and s replaced by

graphs of at most f no des eachand p and r by arbitrary graphs Again removing q and s

disconnects p and r The edges of Figure now represent bundles of all the edges b etween

the dierent groups of no des in p q r and s

Weak Byzantine Agreement

A slightly stronger result that can b e obtained from the same pro of metho d Lamp ort

dened a weaker problem than Byzantine agreement still in the case of Byzantine faults by

changing the validity requirement

Validity If all pro cesses start with value v and no faults occurthenv is the only

allowable decision value

Previously we required that even if there were faults if all nonfaulty pro cesses started

with v then all nonfaulty pro cesses must decide v Now they are only required to decide v

in the case of no failures This weakened restriction is motivated lo osely by the database

commit problem since we only require commitment in the case of no faults

Lamp ort tried to get b etter algorithms for weak Byzantine agreement than for Byzantine

agreement but failed Instead he got an imp ossibility result Note In obtaining this result

Lamp ort did not even need the assumption that the algorithm terminates in a xed number

of rounds but only that every execution eventual ly leads to termination for all nonfaulty

pro cesses In fact the imp ossibility results proved so far in this lecture still work even with

this weakening of the termination requirement For the present result we will explicitly

assume the weaker termination requirement

Theorem n f processes and connG f areneeded even for weak Byzantine

agreement

We will just show that three pro cesses cannot solveweak Byzantine agreement with one

fault the rest of the pro of follows as b efore

Supp ose there exist three pro cesses p q andr that can solveweak Byzantine agreement

with one fault Let P b e the execution in which all three pro cesses start with no failures

o ccur and therefore all three pro cesses decide Let P b e the same for instead of Let k

b e greater than or equal to the numb er of rounds in executions P and P Wenow create a

new system S with at least k nodesby pasting enough copies of the sequence pq r into

a ring Let half the ring start with value and the the other half start with see Figure

By arguing as we did b efore anytwo consecutive pro cesses must agree therefore all

pro cesses in S must agree Assume without loss of generality that they all decide

Now consider a blo ck of at least k consecutive pro cesses that start with For

round the middle k pro cesses in this blo ck will all mimic their namesakes in P b ecause

they do not receiveany information from outside the blo ckofsLikewise for rounds the

middle k pro cesses will b ehaveasin P etc Finally the middle pro cess of the group

will b ehaveasinP for k rounds Therefore it will decide This is a contradiction

3 2 1 0 1 0 0 2 0 0 S

1 1 1 2 1 1 1

1 2

3

Figure conguration in the pro of of Theorem

Numb er of Rounds with Stopping Faults

Wenow turn to the question of whether Byzantine agreement can b e solved in fewer than

f rounds Once again the answer is no even if we require only the weakest validity

condition namely that pro cesses must decide v only when all start with v and no faults

o ccur In fact at least f rounds are required even to simply tolerate f stopping faults

We assume the following mo del The numb er of pro cesses satises n f The

algorithm terminates in a xed numb er of rounds We also assume for simplicitythatevery

pro cess sends to every other pro cess at every round until it fails As usual we assume that

a pro cess can fail in the middle of sending at any round so that it can succeed in sending

any subset of the messages

As for twogenerals problem it is useful to carry out the pro of using the notion of

communication pattern

Dene a communication pattern to b e an indication of which pro cesses send to which other

pro cesses in each round A communication pattern do es not tell us the actual information

sent but only who sent to whom in a round A communication pattern can b e depicted

graphically as shown in Figure As b efore a communication pattern is a subset of set

of i j k triples where i and j are distinct pro cesses and k is a round Howeverinthe

stoppingfailure case if a message from some pro cess i fails to reach its destination in round

k then no messages from i are delivered in any round after k

In the gure p do es not send to p in round Thus p must have stopp ed and will

send nothing further in round and all future rounds Essentially a communication pattern

depicts how pro cesses fail in a run

Dene a run as an input assignment and a communication pattern This is identical to

what we called adversary in the section on twogenerals Given anyrunwe can dene PROCESSES

1

2

3

4 TIME

round 1 round 2 round 3 round 4

Figure Example of a communication pattern Pro cess stops b efore completing round

pro cess stops after completing round pro cesses and are up throughout the execution

an execution exec generated by This execution is an alternating sequence of vectors of

states one p er pro cess and matrices of messages one p er pair of pro cesses starting with

the initial states containing the given input values These states and messages can b e lled

in using the deterministic message and transition functions It is p ossible to infer when a

pro cess fails by the rst time if any when it fails to send a message After such a failure

it will not p erform its state transition function By convention wesay that a pro cess that

sends all its messages at round k but none at round k p erforms the transition of round

k in the execution exec

A pro cess is faulty in a run exactly if it stops sending somewhere in the communication

pattern We will only consider runs with at most f faulty pro cesses

The Case of One Failure First for intuition we will consider the sp ecial case of f

We show that two rounds are required to handle a single stopping fault We do this by

contradiction so assume a round algorithm exists

We construct a chain of executions each with at most one fault such that iweak

validity requires that the rst execution must lead to a decision and the last execution

must lead to a decision and iianytwo consecutive executions in the chain lo ok the

same to some pro cess that is nonfaulty in b oth This implies a contradiction since if two

executions lo ok the same to a nonfaulty pro cess that pro cess must make the same decision

in b oth executions and therefore by the agreement requirement b oth executions must have

the same decision value Thus every execution in the chain must have the same decision

value which is imp ossible since the rst execution must decide and the last must decide

We start the chain with the execution exec determined from the run having all

input values and the complete communication pattern as in Figure This execution

must decide

H

J H

H

J H

H

J H

H

J Hj

H

J H

H

J H

H

J H

H

J R Hj

H

J H

H

J H

H

J H

H

J Hj R

Figure A run whichmust decide



Starting from execution exec form the next execution byremoving the message from

p to p These two executions lo ok the same to all pro cesses except p and p Since

n f there is at least one such other pro cess and it is nonfaulty in b oth and hence

must decide the same in b oth

Next we remove the message from p to p these two lo ok the same to all pro cesses

except p and p Wecontinue in this manner removing one message at a time in suchaway

that every consecutive pair of executions lo oks the same to some nonfaulty pro cess

Once wehave removed all the messages from p we continue bychanging p s input value

from to Of course these two executions will lo ok the same to every pro cess except p

since p sends no messages Nowwe can add the messages back in one by one and again

every consecutive pair of executions will lo ok the same to some nonfaulty pro cess This

all yields exec where has as input value for p for all the others and complete

communication

Next we rep eat this construction for p removing p s messages onebyone changing

p s input value from to and then adding p s messages back in This yields Rep eating

this construction for p p we end the chain with the execution exec where all start

n n

with and there is complete communication This execution must decide Thus wehave

pro duced a chain as claimed and this is a contradiction

The Case of TwoFailures Before moving to the general case we will do one more

preliminary case We will nowshowthattwo rounds are not sucienttohandletwo stopping

failures Again we pro ceed bycontradiction Weformachain as we did for the case of one

fault and one round We start with the execution determined by all s as input and the

complete round communication pattern this execution must decide

To form the chain wewant again to work toward killing p at the b eginning When we

were only dealing with one round we could kill messages from p onebyone Now if we

delete a rstround message from p to q in one step of the chain then it is no longer the

case that the two executions must lo ok the same to some nonfaulty pro cess This is b ecause

in the second round q could inform all other pro cesses as to whether or not it received a

message from p in the rst round so that at the end of the second round the two executions

can dier for all nonfaulty pro cesses

Wesolve this problem by using several steps to delete the rstround message from p

to q and by letting q b e faultytoowe are allowed two faults We start with an execution

in which p sends to q in the rst round and q sends to every pro cess in the second round

Nowweletq b e faulty and remove secondround messages from q onebyone until wehave

an execution in which p sends to q in the rst round and q sends no messages in the second

round Nowwe can remove the rstround message from p to q and these two executions

will only lo ok dierenttop and q Nowwe replace secondround messages from q oneby

one until wehave an execution in which p do es not send to q in the rst round and q sends

to all in the second round This achieves our goal of removing a rstround message from p

to q while still letting each consecutive pair of executions lo ok the same to some nonfaulty

pro cess

In this waywe can remove rstround messages from p onebyone until p sends no

messages Then we can change p s input from to as b efore and replace p s messages

onebyone Rep eating this for p p gives the desired chain

n

The General Case The preceding two examples contain the main ideas for the general

case in which there are up to f faulty pro cesses in a run Supp ose wehavean f round

proto col for f faults

A little notation may help If and are runs with p nonfaultyinboththenwe write

p

to mean that exec and exec lo ok the same to p through time f same state

sequence and same messages received

We write if there exists a pro cess p that is nonfaulty in b oth and suchthat

p

We write for the transitive closure of the relation

Every run has a decision value denoted bydec If then decdec since

and lo ok the same to some nonfaulty pro cess Thus if then dec dec

Now let F b e the set of failures in run

Lemma Let and be runs Let p beaprocess Let k be such that k f If

jF F j k and and only dier in ps failurebehavior after time k then

This means that the patterns are the same except for some dierences in the existence

of some outgoing messages from p at rounds k or greater

To get our result we need to use the case k in Lemma This says that if runs and

have only one failure b etween them and dier only in the failure b ehavior of one pro cess

then We use this as follows

Let and be two runs b oth starting with all s as input in no pro cess fails and

in pro cess p fails at the b eginning and sends no messages There is only one failure

between them and they only dier in p s failure b ehavior onlysoby the lemma

Now let b e the same as except p has a as input instead of Since p sends no

messages in this change cannot b e seen byany pro cess except p so Now let

except there are no failures in Again the lemmasays have the same input as

Therefore

Wecontinue this letting b e the execution with no failures and input dened byhaving

i

the rst i pro cesses get as input and the others get Wehave so and

i i n

therefore dec dec But this is a contradiction since has all zeros as input with

n

no failures and hence must decide whereas has all ones as input with no failures and

n

hence must decide

It remains to prove the lemma This will b e done in the next lecture

JJ Distributed Algorithms Octob er

Lecturer Nancy Lynch

Lecture

Numb er of Rounds With Stopping Failures cont

Last time wewere showing the lower b ound of f rounds on consensus with stopping

We had gotten it down to the following lemma Todaywe will nish by proving the lemma

Lemma Let and be runs Let p beaprocess Let k be such that k f If

jF F j k and and only dier in ps failurebehavior after time k then

Intuitively the lemma means that the patterns are the same except for some dierences

in the outgoing messages from p at round k or later

Pro of Weprove the lemma byreverse induction on k iek go es from f down to

Base case Supp ose k f In this case and agree up to the end of round f

Consider round f If and do not dier at all then we are done so supp ose they dier

Then p is faultyinatleastoneof or and the total numb er of other pro cesses that fail

in either or is no more than f Consider two pro cesses q and r that are nonfaulty

in b oth and Note that suchtwo pro cesses exist since n f

Let b e the same as except that p sends to q in round f of exactly if it do es in

see Figure for example

q

r

Wehave and so

Inductive step Wewanttoshow that the lemma is true for k where kf

assuming that it holds for k Runs and agree up to the end of round k If they also

agree up to end of round k then we can apply the inductivehyp othesis and we are done

so assume pro cess p fails in round k and assume without loss of generality that p fails

in

Let for i n b e the same as except that p sends to p p at round k of

i i

exactly if it do es in

i

Claim For i n where

i i

Pro of Executions and dier at most in what p sends to p at round k Let

i i i

i

b e the same as except that p sends no messages after round k in Similarly let

i i

i i

p p p p p p

H H H

A H A H H J J

H H H

H A H H A J J

H H H

A H A H H J J

r r r

A A J J

A A J J

A A J J

A A J J

A A J J

A A J J

q q q

A A

A A

A A

s s s

Figure Example of construction used to prove base case of Lemma

b e the same as except that p sends no messages after round k in This situation

i i

i

is illustrated in Figure

p p p p

p p p p

i i i i

H H

H H

H H

i i

i i

Figure Example of construction used to prove inductive step of Lemma

has no more than k failures since has no more than Notice that eachof

i

i i

k failures Therefore by the inductivehyp othesis wehave and Also

i i

i i

since b oth lo ok the same to any nonfaulty pro cess other than p Thus

i i i

i i

From the claim and by the transitivityof wehave and since and only

n n

dier in ps failure b ehavior after time k we can apply the inductivehyp othesis once

again to get Thus wehave as needed

n

Applying Lemma with k we can summarize with the following theorem

Theorem Any agreement protocol that tolerates f stopping failures requires f rounds

The Commit Problem

The commit problem Dwork Skeen is a variant of the consensus problem and maybe

describ ed as follows We assume that the communication is realized by a complete graph

We assume reliable message delivery but there maybeanynumb er of pro cess stopping

faults

Agreement There is at most one decision value

Validity

If any pro cess starts with then is the only p ossible decision value

If all start with and there are no failures then is the only p ossible decision

value

Termination This comes in twoavors The weak termination requirement is that termi

nation is required only in the case where there are no failures the strong termination

requirement sp ecies that all nonfaulty pro cesses terminate

The main dierences b etween the b etween the commit problem and the consensus prob

lem for stopping faults are rst in the particular choice of validity condition and second

in the consideration of a weaker notion of termination

Two Phase Commit PC

For weak termination the standard algorithm in practice is twophase commit The proto col

consists of everyone sending their values to a distinguished pro cess say p p deciding based

on whether or not anyone has an initial value and broadcasting the decision see Figure for illustration

p

1

Figure message pattern in twophasecommit proto col

It is easy to see that PC satises agreement validityandweak termination

However PC do es not satisfy strong termination b ecause if p fails the other pro cesses

never nish In practice if p fails then the others can carry out some kind of communication

among themselves and may sometimes manage to complete the proto col

But this do es not always lead to termination since the other pro cesses might not b e able

to gure out what a failed p has already decided yet they must not disagree with it For

example supp ose that all pro cesses except for p start with but p fails b efore sending

any messages to anyone else Then no one else ever learns p s initial value so they cannot

decide But neither can they decide since as far as they can tell p mighthave already

decided

This proto col takes only rounds The weaker termination condition however implies

that this do es not violate the lower b ound just proved on the numb er of rounds Note that

the earlier result could b e mo died to use the commit validity condition

Threephase Commit PC

PC can b e augmented to ensure strong terminationThekey is that p do es not decide to

commit unless everyone else knows that it intends to This is implemented using an extra

phase The algorithm pro ceeds in four rounds as follows see also Figure

Round All pro cesses send their values to p Ifp receives any or do esnt hear from

some pro cess then p decides Otherwise it do esnt yet decide

Round If p decided it broadcasts this decision If not that is it has received s

from everyone then p sends tentative to all other pro cesses If a pro cess p i

i

receives it decides If it receives tentative it records this fact thus removing its

uncertainty ab out the existence of any initial s and ab out p s intended decision

Round If a pro cess p i recorded a tentative it sends ack to p At the end of

i

this round pro cess p decides whether or not it receives ack s from all the others

Note that p knows that each pro cess has either recorded the tentative or else has

failed without deciding and can therefore b e ignored

Round If p decided it broadcasts Anyone who receives decides

Rounds and serve as an extra phase of informing everyone of the intention to

commit and making sure they knowthisintention b efore p actually commits



Note that there is an apparent redundancy in this presentation It app ears that the ack s which arise

in the practical versions of the proto col are ignored in this abstract version So this version can apparently

b e shortened to three rounds eliminating the ack sentirely This remains to b e worked out p 1

rounds: 1234

Figure message pattern in threephasecommit proto col

This proto col do es not guarantee strong termination yet If p do esnt fail every non

faulty pro cess will eventually decide since p never blo cks waiting for the others But if p

fails the others can b e left waiting either to receive the tentative or the nal They

must do something at the end of the four rounds if they havent decided yet

At this p oint the literature is a little messy elab orate proto cols have b een describ ed

with many alternativeversions Wegive b elow the key ideas for the case where p has failed

After rounds every pro cess p p even it has failed is in one of the following states

decided

uncertain hasnt decided and hasnt received a round tentative message

ready hasnt decided and has received a round tentative message or

decided

The imp ortant prop erty of PC is stated in the following observation

Observation Let p be any process failed or not including p If p has decided then

al l nonfaulty processes in fact al l processes are either uncertain or have decided and if

p has decided then al l nonfaulty processes are either ready or have decided

Using this observation the nonfailed pro cesses try to complete the proto col For in

stance this could involve electing a new leader One solution is to use p and if this fails to

use p etc This assumes that the pro cesses know their indices The new leader collects

the status info from the nonfailed pro cesses and uses the following termination rule

If some pro cess rep orts that it decided the leader decides in the same way and broad

casts the decision

If all pro cesses rep ort that they are uncertain then the leader decides and broadcasts

Otherwise ie all are uncertain or ready and at least one is ready the leader contin

ues the work of the original proto col from the middle of round That is it sends

tentative messages to all that rep orted uncertain waits and decides Notice that

this cannot cause a violation of the validity requirement since the fact that someone

was ready implies that all started with initial value

Any pro cess that receives a or message decides on that value

Lower Bound on the Numb er of Messages

It is interesting to consider the numb er of messages that are sent in the proto col In this

section weprove the following theorem DworkSkeen

Theorem Any protocol that solves the commit problem even with weak termination sends

at least n messages in the failurefree run starting with al l s

Note that PC uses n messages in the failurefree case and therefore the result is

tight

The key idea in the lower b ound is stated in the following lemma

Lemma Let be the run whereallprocesses start with and thereare no failures Then

there must beapath of messages from every node to every other node

Before we prove Lemma let us consider a sp ecial case Consider the failurefree execu

tion whose graph is depicted in Figure Assume all pro cesses start with in

ρ ρ’ p 1 1 1

1 1

1 1 p

4 1 0

Figure execution graphs for and in the sp ecial case Black square represents pro cess

failure

In there is no path from p to p Consider now an alternative execution in which

the initial value of p is and all pro cesses stop when they rst hear from p In other

words in all the edges reachable from the no de of p at time are removed It is

p

straightforward to show that regard less of the initial value of p However validity

a contradiction requires that p decides in and in

The general case pro of uses the same argument

Pro of By example In the decision of all pro cesses must b e by termination and

validity If the lemma is false then there is some pair of pro cesses p q such that there is no

path of messages from p to q in Thus in q never learns ab out pNow construct by

failing every pro cess exactly when it rst learns ab out p and changing ps input to The

resulting run lo oks the same to q ie and therefore q decides in which violates

validity

The pro of of Theorem is completed by the following graphtheoretic lemma

Lemma Let G bea communication graph for n processes If each of i processes is con

nectedtoeach of al l the n processes by a chain of messages then thereareatleast n i

messages in the graph

Pro of By induction on i

Base case i The n follows from the fact that there must b e some message to

eachofthen pro cesses other than the given pro cess

Inductive step assume the lemma holds for iLetS be the set of i pro cesses that

are joined to all n Without loss of generalitywe can assume that in round at least one

of the pro cesses in S sends a message If not we can remove all the initial rounds in which

no pro cess in S sends a message what is left still joins S to all n Call some such pro cess

pNow consider the reduced graph obtained by removing a single rstround message from

p to anyone The remaining graph connects everyone in S fpg to all n By induction it

must have at least n i edges So the whole graph has at least n i edges which

is n i as needed

JJ Distributed Algorithms Octob er

Lecturer Nancy Lynch

Lecture

Asynchronous Shared Memory

In this lecture we b egin a new topic which features a ma jor switch in the avor of the

algorithms westudy Namelywe add the very imp ortant uncertainty of asynchrony for

pro cesses variable pro cess sp eeds and for messages variable message delayWe shall start

this part with shared memory rather than networks These systems can b e thought of as the

asynchronous analog of PRAMs At rst as a simplifying assumption we will not consider

failures asynchrony is complicated enough to deal with

Before we can express any algorithm or makeany meaningful statement we need to

dene a new computation mo del We rst describ e the mo del informally and consider an

example to gain some intuition only then will wecomeback to a more formal mo del

Informal Mo del

The system is mo deled as a collection of pro cesses and shared memory cells with external

interface see Figure

Each pro cess is a kind of state machine with a set states of states and a subset start of

states indicating the initial states as b efore But now it also has lab eled actions describing

activities it participates in These are classied as either input output or internal actions

We further distinguish b etween two dierent kinds of internal actions those that involve the

shared memory and those whichinvolve strictly lo cal computation

There is no message generation function since there are no messages in this mo del all

communication is via the shared memory

There is a transition relation trans which is a set of s s triples where is the lab el

of an input output or lo cal computation action The triple s s trans says that from a

state s it is p ossible to go to state s while p erforming action Note that trans is a relation

rather than a function this generality includes the convenience of nondeterminism

We assume that input actions can always happ en technicallythisisan inputenabling

mo del Formally for every s and input action there exists s suchthats s trans

Figure shared memory system Circles represent pro cesses squares represent shared memory

cells and arrows represent the external interface ie the input and output actions

In contrast output and lo cal computation steps might b e enabled only in a subset of the

states The intuition b ehind the inputenabling prop ertyisthatwe think of the input actions

as b eing controlled by an arbitrary external user while the internal and output actions are

under the control of the system

Shared memory access transitions are formalized dierently Sp ecically the transitions

have the form s ms m where s s are pro cess states and m m are shared memory

states These describ e changes to the state and the shared memory as a result of accessing

the shared memory from some state A sp ecial condition that we imp ose here is that the

enabling of a shared memory access action should dep end on the lo cal pro cess state only

and not on the contents of shared memory however the particular responseiethechange

to the states of the pro cess and of the shared memory can dep end on the shared memory

state Formally this prop erty is captured bysaying that if is enabled in s m then for

all memory states m is enabled in s m In particular cases the accesses are further

constrained

The most common restriction is to allow access only one lo cation to b e involved in a

single step Another p opular restriction is that all accesses should b e of typ e read or write

A read just involves copying the value of a shared memory lo cation to a lo cal variable while

a write just involves writing a designated value to a shared memory lo cation overwriting

whatever was there b efore

An execution is mo deled very dierently from b efore Now pro cesses are allowed to take

steps one at a time in arbitrary order rather than taking steps in synchronized rounds

This arbitrariness is the essence of asynchrony More formally an execution is an alternating

sequence s s consisting of system states ie the states of all the pro cesses and of

the shared memory alternated with actions eachidentied with a particular pro cess

where successive state action state triples satisfy the transition relation An execution

may b e a nite or an innite sequence

There is one imp ortant exception to the arbitrariness we dont wanttoallow a pro cess

to stop when it is supp osed to b e taking steps This situation arises when the pro cess is in

a state in whichany local ly control led action ie noninput action is enabled Recall that

input actions are always enabled however we will not require that they actually o ccur A

p ossible solution is to require that if a pro cess takes only nitely many steps its last step

takes it to a state in which no more lo cally controlled actions are enabled But that is not

quite sucient wewould like also to rule out another case where a pro cess do es innitely

many steps but after some p oint all these steps are input steps Wewanttomake sure that

the pro cess itself also gets turns But wehave to b e careful in saying this consider the

scenario in which innitely many inputs o ccur and no lo callycontrolled actions but in fact

no lo cally controlled actions are enabled That seems OK since vacuously the pro cess had

chances to take steps but simply had none it wanted to take We account for all these

p ossibilities in the following denition For each pro cess i either

the entire execution is nite and in the nal state no lo cally controlled action of pro cess

i is enabled or

the execution is innite and there are either innitely many places in the sequence

where i do es a lo cally controlled action or else innitely many places where no such

action is enabled

We call this condition the fairness condition for this system

We will normally consider the case where the execution is fair However at a later p oint

in the course when we consider waitfree ob jects and other faulttolerance issues we will

wanttoprove things ab out executions that are not necessarily fair

Mutual Exclusion Problem

The problem of Mutual Exclusion involves allo cating a single indivisible resource among n

pro cesses p p Having the access to the resource is mo deled by the pro cess reaching

n

a critical regionwhich is simply a designated subset of its states Normally that is when

it do esnt want access to the resource the pro cess is said to b e in the remainder region

In order to gain access to the critical region a pro cess executes a trying protocol while for

symmetry after the pro cess is done with the resource it executes an often trivial exit

protocol This pro cedure can b e rep eated and the b ehavior of each pro cess is thus cyclic

moving from the remainder region R to the trying region T then to the critical region

C and nally to the exit region E The cycle of regions that a pro cess visits is shown in

Figure In the diagram selflo ops indicate that a pro cess can remain in these regions

after it p erforms a step In contrast we will abstract away steps done within the critical and remainder regions and not consider them here

R

E T

C

Figure the cycle of regions of a single pro cess

We will consider here only algorithms in whichcommunication among pro cesses is done

through shared memoryFor starters we will supp ose that the memory is readwrite only

that is in one step a pro cess can either read or write a single word of shared memoryTo

match this up with the shared memory mo del discussed ab ove wehave n pro cess automata

accessing readwrite shared memoryThetwo basic actions involving pro cess p and a shared

i

variable x are

p reads x and uses the value read to mo dify the state of p and

i i

p writes a value determined from p s state to x

i i

The inputs to pro cesses are try actions mo deling the rest of the program requesting

i

access to the resource and exit mo deling the rest of the program announcing it is done

i

with the resource The output actions are crit whichisinterpreted as granting the resource

i

and rem which tells the rest of the program that it can continue The pro cesses inside the

i

diagram then are resp onsible for p erforming the trying and exit proto cols

It is sometimes useful to view the outside world as consisting of n userswhichwe mo del

as other state machines that communicate with their resp ective pro cesses via the designated

actions These users are assumed to execute some application programs In b etween the

crit and exit action user is free to use the resource see Figure

i i i

We will assume that each user ob eys the cyclic b ehavior that is that eachuserisnot

the rst to violate the cyclic order of actions b etween itself and its pro cess try i crit i User i exit i

rem i

Figure interface sp ecication for the user

A global picture of the system architecture app ears in Figure

Within this setting the proto col is itself supp osed to preserve the cyclic b ehavior and

moreover to satisfy the following basic prop erties

Mutual exclusion There is no reachable state in which more than one pro cess is in region

C

Deadlo ckFreedom If at anypointinafair execution at least one pro cess is in the trying

region and no pro cess is in the critical region then at some later p oint some pro cess

enters the critical region Similarlyifatanypoint in a fair execution at least one

pro cess is in the exit region with no other condition here then at some later p oint

some pro cess enters the critical region

Note that deadlo ckfreedom presupp oses fairness to all the pro cesses whereas we dont

need to assume this for the other conditions In the ensuing discussions of mutual exclusion

we might sometimes neglect to sp ecify that an execution is fair but it should b e clear that

we assume it throughout until otherwise sp ecied namelywhenwe consider faults

Note that resp onsibility for the entire system continuing to make progress dep ends not

only on the proto col but on the users as well If a user gets the resource but never returns it

via an exit then the entire system will grind to a halt But if the user returns the resource

i

every time then the deadlo ckfreedom condition implies that the system continues to make

progress unless everyone remains in its remainder region

Note that the deadlo ckfreedom prop erty do es not imply that any particular pro cess ever

succeeds in reaching its critical region Rather it is a global notion of progress saying that

some pro cess reaches the critical region For instance the following scenario do es not violate

the deadlo ckfreedom condition p enters T while p cycles rep eatedly through the regions

the rest of the pro cesses are in R The deadlo ckfreedom condition do es not guarantee

that p will ever enter C

EXTERNAL SYSTEM ENVIRONMENT

Processes Variables

P1

try i crit User i i Pi exit i

rem i

Pn

Figure Architecture for the mutual exclusion problem

There is one other constraint a pro cess can have a lo cally controlled action enabled only

when it is in T E This implies that the pro cesses can actively execute the proto col only

when there are active requests The motivation for this constraintisasfollows In the original

setting where this problem was studied the pro cesses did not have dedicated pro cessors they

were logically indep endent pro cesses executed on a single timeshared pro cessor In this

setting having sp ecial pro cesses managing the mutual exclusion algorithm would involve

extra contextswitching ie b etween the manager pro cess and the active pro cesses In a

true multipro cessor environment it is p ossible to avoid the contextswitching by using a

dedicated pro cessor to manage each resource however such pro cessors would b e constantly

monitoring the shared memory causing memory contention Moreover a pro cessor dedicated

to managing a particular resource is unavailable for participation in other computational

tasks

Dijkstras Mutual Exclusion Algorithm

The rst mutual exclusion algorithm for this setting was develop ed in by Edsger Dijk

stra It was based on a prior twopro cess solution by Dekker Until then it was not even

clear if the problem is solvable

We b egin by lo oking at the co de though it will not b e completely clear at rst how this

co de maps to the statemachine mo del The co de is given in Figure

The shared variables are ag i i n one p er pro cess eachtakingonvalues from

f g initially and turn aninteger in fngEach ag iisa singlewriter multi

reader variable writable only by pro cess i but readable byeveryone The turn variable is

multiwriter multireader b oth writable and readable byeveryone

We translate the co de to our mo del as follows The state of each pro cess consists of the

values of its lo cal variables as usual plus some other information that is not represented

explicitly in the co de including

Temp orary variables needed to remember values of variables read

A program counter to say where in the co de the pro cess is up to

Temp orary variables intro duced by the ow of control of the program eg the for

lo op intro duces a set to keep track of the indices already pro cessed

A region designation T C E or R

The state of the entire system consists of pro cess states plus values for all the shared

variables

The initial state of the system consists of initial values for all lo cal and shared variables

and program counters in the remainder region Temp orary variables can b e undened The

actions are try crit exit rem lo cal computation steps and the readwrite accesses to the

i i i

i

shared variables The steps just follow the co de in the natural way The co de describ es the

changes to the lo cal and shared variables explicitly but the implicit variables also need to

be changed by the steps For example when a try action o ccurs the region designation for

i

i b ecomes T The program counter changes as it should eg when try o ccurs it b ecomes

i

a p ointer to the b eginning of the trying region co de

This co de is adapted from Dijkstras pap er Note that the co de do es not sp ecify explicitly

exactly which p ortions of the co de comprise indivisible steps However since the pro cesses

are asynchronous it is imp ortanttodothis For this algorithm atomic actions are the

etc plus the single reads from and writes to the shared variables plus some lo cal try

i

computation steps The co de is rewritten in Figure to make the indivisible steps more

Shared variables

ag an array indexed bynofintegers from fg initially all

written by p and read by all pro cesses

i

turn integer from fng initially arbitrary

written and read by all pro cesses

Co de for p

i

Remainder Region

try

i

b egin rst stage trying to obtain turn

L agi

while turn i do

if agturn then turn i

end if

end while

b egin second stage checking that no other pro cessor has reached this stage

agi

for j i do Note order of checks unimp ortant

if agj then goto L

end if

end for

crit

i

Critical Region

exit

i

agi

rem

i

Figure Dijkstras mutual exclusion algorithm

explicit The atomic actions are enclosed in p ointy brackets Note that the read of ag turn

do es not taketwo atomic actions b ecause turn was just read in the line ab ove and so a lo cal

copyofturn can b e used The atomicity of lo cal computation steps is not sp ecied in

fact it is unimp ortant so any reasonable interpretation can b e used

A Correctness Argument

In this section wesketch a correctness argument for Dijkstras algorithm Recall that for

the synchronous mo del the nicest pro ofs resulted from invariants describing the state of

the system after some numb er of rounds In the asynchronous setting there is no notion of

round so it may not b e readily clear how to use assertional metho ds It turns out that they

can still b e used and indeed are extremely useful but typically it is a little harder to obtain

the invariants The arguments must b e applied at a ner level of granularitytomake claims

ab out the system state after anynumber of individual process steps rather than rounds

Before presenting an assertional pro of however wewillsketch a traditional operational

argument ie an argument based directly on the executions of the algorithm

In the following series of lemmas we show that the algorithm satises the requirements

The rst lemmaisvery eachtoverify

Lemma Dijkstras algorithm preserves cyclic region behavior for each i

The second lemma is more interesting

Lemma Dijkstras algorithm satises mutual exclusion

Pro of By contradiction Assume p and p are b oth simultaneously in region C in some

i j

reachable state where i j Consider an execution that leads to this state By the co de

b oth p and p set their ag variables to b efore entering their critical regions Consider

i j

the last such step in which eachsetsits ag variable Assume without loss of generality

that ag i is set to rst Then ag i remains from that p ointuntil p leaves C which

i

must b e after p enters C by the assumption that they b oth end up in C simultaneously

j

So ag ihasthevalue throughout the time from when p sets ag j to until p enters

j j

C see Figure

But during this time p do es a nal test of ag i and nds it unequal to a contradiction

j

We will redo this argument using an invariant assertion pro of later But rst we pro ceed

with an argument for the deadlo ckfreedom prop erty

Lemma Dijkstras algorithm is dead lockfree

Pro of We again argue bycontradiction Supp ose there is some fair execution that

reaches a p oint where there is at least one pro cess in T and no pro cess in C and supp ose

Shared variables

ag an array indexed bynofintegers from fg initially all

written by p and read byall

i

turn integer from fng initially arbitrary

written and read by all

Co de for p

i

Remainder region

try

i

L hag i i

while hturn ii do

if hag turn i then hturn ii

end if

end while

hagi i

for j i do

if hag j i then goto L

end if

end for

crit

i

Critical region

exit

i

hagi i

rem

i

Figure Dijkstras algorithm showing atomic actions

t1 t2 t3

Figure Attimet p sets ag itoattime t p nds that ag i attime t p exits

i j i

C

that after this p oint no pro cess ever enters C We b egin byremoving some complications

Note that all pro cesses in T E continue taking steps in and hence if some of the pro cesses

are in E then after one step they will b e in R So after some p ointin all pro cesses are

in T R Moreover since there are only nitely many pro cesses after some p ointnonew

pro cesses enter T so after some p oint no new pro cesses enter T That is after some p oint

in all the pro cesses are in T R and no pro cess every changes region Write

where in there is a nonempty xed group of pro cesses continuing to take steps forever in

T and no region changes o ccur Call these pro cesses contenders

After at most a single step in eachcontender i ensures that ag i and it remains

for the rest of Sowe can assume without loss of generality that this is the case for

all contenders throughout

Clearlyifturn changes during itischanged to a contenders index Moreover we

have the following claim

Claim In eventual l y turn acquires a contenders index

Pro of Supp ose not ie the value of turn remains equal to the index of a noncontender

throughout Then when any contender p checks it nds that turn i and ag turn

i

and hence would set turn to i There must exist an i for which this will happ en b ecause all

contenders are either in the while lo op or in the second stage destined to fail and return to

the while lo op b ecause by assumption weknow they dont reach C in Thus eventually

in turn gets set to a contenders index

Once turn is set to a contenders index it remains equal to a contenders index although

the value maychange to dierentcontenders Then any later reads of turn and ag turn

will yield ag turn since for all contenders i ag i and turn will not b e changed

Therefore eventually turn stabilizes to a nal contenders index Let be a sux of

in which the value of turn is stabilized at some contenders index say i

Nowwe claim that in any contender j i eventually ends up in the rst while lo op

lo oping forever This is b ecause if it is ever in the second stage then since it do esnt succeed

in entering C itmust eventually return to L Butthenitisstuck in stage b ecause

turn i j and ag i throughout So let be a sux of in whichall

contenders other than i lo op forever in the rst while lo op Note that this means that all

contenders other than i have their ag variables equal to throughout

We conclude the argumentby claiming that in pro cess i has nothing to stand in its

way This is true b ecause when the while lo op is completed ag i is set to and no other

pro cess has ag so p succeeds in its second stage tests and enters C

i

To complete the pro of we remark that the argument for the exit region is straightforward

since no one ever gets blo cked there

JJ Distributed Algorithms Octob er

Lecturer Nancy Lynch

Lecture

Dijkstras Mutual Exclusion Algorithm cont

In this section we nish our discussion of Dijkstras mutual exclusion algorithm with an

alternative pro of of the mutual exclusion prop erty of the algorithm Last time wesawan

op erational argument Todaywell sketch an assertional pro of using invariants just as in

the synchronous case See GoldmanLynch LICS for a more detailed pro of

An Assertional Pro of

To provemutual exclusion wemust show that

p p j i j p in C p in C

i j i j

Wewould liketoprove this prop osition by induction on the numb er of steps in an

execution But as usual the given statement is not strong enough to prove directly by

induction We need to augment it with other conditions whichmayinvolve all the state

comp onents namely shared variables lo cal variables program counters region designations

and temp orary variables

Wehave to b e a little more precise ab out the use of temp orary variables We dene for

each pro cess p a new lo cal variable S thisvariable is used by p during execution of the

i i i

for lo op to keep track of the set of pro cesses that it has already observed to have ag

Initially S When the pro cess nds ag j during an iteration of the for lo op it

i

p erforms the assignment S S fj g When S fngfig the program counter

i i i

of p moves to after the for lo op When ag i is set to or p assigns S Figure

i i i

shows where these uses of S would app ear in the co de for p Note that it b ecomes more

i

convenient to rewrite the for lo op as an explicit while lo op

For a given system state we dene the following sets of pro cesses

beforeC pro cesses whose program counter is right after the nal lo op

inC pro cesses whose program counter is in region C

Shared variables

ag an array indexed bynofintegers from fg initially all

written by p and read by all pro cesses

i

turn integer from fng initially arbitrary

written and read by all pro cesses

Co de for p

i

Remainder Region

try

i

L S

i

hagi i

while hturn ii do

if hag turn i then hturn ii

end if

end while

hagi i

while S fngfig do

i

cho ose j fngfig S

i

if hag j i then goto L

end if

S S fj g

i i

end while

crit

i

Critical Region

exit

i

S

i

hagi i

rem

i

Figure Dijkstras algorithm showing atomic actions and S

i

The actual entrance to C is a separate step from the last test in the lo op

Put in the ab ove terms we need to provethat jinCj We actually prove a stronger

statement namely that jinCj jbeforeCj This is a consequence of the following two

claims

p p j i j i S j S

i j j i

p beforeC inC S fngfig

i i

First a small lemma

Lemma If S then ag i

i

Pro of From the denition of S and the co de for the algorithm

i

Now the main nonexistence claim

Lemma p p j i j i S j S

i j j i

Pro of By induction on the length of executions The basis case is easy since in the initial

state all sets S are emptyNow consider the case where j gets added to S the reader can

i i

convince himherself that this is actually the only case of interest This must o ccur when i

is in its nal lo op testing ag Sincej gets added to S itmust b e the case that ag j

i

j

b ecause otherwise i would exit the lo op By the contrap ositive of Lemma then S

j

and so i S

j

Now the second claim

Lemma p beforeC inC S fng fig

i i

Pro of By induction on the length of the execution In the basis case all pro cesses are

in region R so the claim holds vacuouslyFor the inductive step assume the claim holds

in any reachable state and consider the next step The steps of interest are those where p

i

exits the second lo op normally ie do esnt goto L or enters C Clearlyifp exits the

i

lo op normally S contains all indices except i since this is the termination condition for the

i

lo op when rewritten explicitly in terms of the S sets Up on entry to the critical region S

i

do es not change Thus the claim holds in the induction step

Wecannow prove the main result bycontradiction

Lemma jinCj jbeforeCj

Pro of Assume for contradiction that in some state reachable in an execution jinC j

jbeforeC j Then there exist two pro cesses p and p i j such that p beforeC

i j i

inC and p beforeC inC By Lemma S fngfig and S fng

j i j

fj g But by Lemma either i S or j S a contradiction

j i

Running Time

Wewould like to b ound the time from any state in which there is some pro cess in T and no

one in C until someone enters C In the asynchronous setting however it is not clear what

time should mean For instance unlike the synchronous case there is no notion of rounds

to count in this mo del The way to measure time in this mo del is as follows We assume

that each step o ccurs at some real time p oint and that the algorithm b egins at time We

imp ose upp er b ounds on step time for individual pro cesses denoted by s and on maximum

time anyone uses the critical region denoted by c Using these b ounds we can deduce time

upp er b ounds for the time required for interesting activity to o ccur

Theorem In Dijkstras algorithm suppose that at a particular time thereissomeprocess

in T and no process in C Then within time O sn some process enters C

We remark that the constantinvolved in the bigO is indep endentof s c and n

Pro of We analyze the time along the same lines as we proved deadlo ckfreedom Supp ose

the contrary and consider an execution in which at some p oint pro cess p is in T and no

i

pro cess is in C and in which no pro cess enters C for time ksn for some particular large k

First we claim that the time elapsed from the starting p oint of the analysis until p tests

i

turn is at most O sn We will assume that k has b een chosen to b e considerably larger

than the constant hidden by this bigO This is b ecause p can at worst sp end this much

i

time b efore failing in the second stage and returning to LWeknowthatitmust fail b ecause

otherwise it would go to C whichwehave assumed do esnt happ en this quickly

Second we claim that the additional time elapsed until turn equals a contender index is

at most O s To see this we need a small case analysis If when p tests turn turn holds a

i

contender index were done so supp ose that this is not the case sp ecically supp ose that

turn j where j is not a contender Then within time O s after this test p will test

i

ag j If p nds ag j then p sets turn toi which is the index of a contender

i i

and we are again done On the other hand if it nds ag j thenitmust b e that in

between the test of turn by p and the test of ag j pro cess j entered the trying region and

i

b ecame a contender If turn has not changed in the interim then turn is equal to the index

of a contender j and we are done On the other hand if turn has changed in the interim

then it must have b een set to the index of a contender So again we are done

Then after an additional time O s no pro cess can ever reset turn or enter the second

stage Then time O sn later all others in the second stage will have left since they dont

succeed b ecause its to o so on Then within an additional time O sn i must succeed in

entering C

Improved Mutual Exclusion Algorithms

While Dijkstras algorithm guarantees mutual exclusion and deadlo ckfreedom there are

other desirable prop erties that it do es not have Most imp ortantly it do es not guarantee

any sort of fairness to individual pro cesses that is it is p ossible that one pro cess will

continuously b e granted access to its critical region while other pro cesses trying to gain

access are prevented from doing so This situation is sometimes called process starvation

Note that this is a dierent kind of fairness from that discussed earlier this is fairness to

the invoked operations rather than fairness of pro cess steps

Another unattractive prop erty of Dijkstras algorithm is that it uses a multireadermulti

writer variable for turn This may b e dicult or exp ensive to implement in certain kinds

of systems Several algorithms that improve up on Dijkstras have b een designed We shall

lo ok at some algorithm develop ed byPeterson and byPeterson and Fischer

NoStarvation Requirements

Before we lo ok at alternativemutual exclusion algorithms we consider what it means for an

algorithm to guarantee fairness Dep ending up on the context in which the algorithm is used

dierent notions of fairness may b e desirable Sp ecicallywe shall consider the following

three ideas that have b een used to rene the requirements for mutual exclusion algorithms

Lo ckoutfreedom In a fair execution with resp ect to pro cesses steps of the algorithm

where the users always return the resource any pro cess in T eventually enters C Also

in any fair execution any pro cess in E eventually enters R

Time b ound b If the users always return the resource within time c of when it is granted

and pro cesses always takestepsinT E within time s then any pro cess in T enters C

within time b Note b will typically b e some function of s and c Also if pro cesses

always take steps in T E within time sthenany pro cess in E enters R within time

b

Number of bypasses b Consider an interval of an execution throughout which some pro

cess p is in T more sp ecically has p erformed the rst noninput step in T During

i

this interval any other pro cess p j i can only enter C at most b times Also the

j

same for bypass in E

In eachcaseabove wehave stated fairness conditions for the exit region that are similar

to those for the trying region However all the exit regions we will consider are actually

trivial and satisfy stronger prop erties eg are waitfree

Some implications There are some simple relations among the dierent requirements

from mutual exclusion proto cols

Theorem If an algorithm is lockoutfree then it is dead lockfree

Pro of Consider a p oint in a fair execution such that i is in T no one is in C and supp ose

that no one ever enters C Then this is a fair execution in which the user always returns the

resource So the lo ckoutfreedom condition implies that i eventually enters C as needed for

deadlo ckfreedom

Likewise consider a p oint in a fair execution suchthat i is in E Bylockoutfreedom i

eventually enters R as needed for deadlo ckfreedom

Theorem If an algorithm is dead lockfree andhasabypass bound then the algorithm is

lockoutfree

Pro of Consider a p oint in a fair execution Supp ose that the users always return the

resource and i is in T Deadlo ckfreedom and the user returning all resources imply that

the system keeps making progress as long as anyone is in T Hence the only waytoavoid

breaking the bypass b ound is that eventually i must reach C

The argument for the exit region is similar

Theorem If an algorithm has any time bound then it is lockoutfree

Pro of Consider a p oint in a fair execution Supp ose the users always return the resource

and i is in T Asso ciate times with the events in the execution in a monotone nondecreasing

unb ounded way so that the times for steps of each pro cess are at most s and the times for

all the critical regions are all at most c By the assumption i enters C in at most the time

b ound so in particular i eventually enters C as needed for lo ckoutfreedom

The argument for the exit region is similar

In the following we shall see some proto cols that satisfy some of these more sophisticated

requirements

Petersons TwoPro cess Algorithm

We b egin with an algorithm that gives lockoutfreedom and a good time bound with an

interesting analysis We start with a pro cess solution for pro cesses p and p We write

for i ie the index of the other pro cess The co de is given in Figure

Wenow argue that the algorithm is correct

Theorem Petersons twoprocess algorithm satises mutual exclusion

Pro of Sketch It is easy to showby induction that

level i i atwait beforeC inC

Shared variables

level a Bo olean array indexed by initially

turn a Bo olean variable initial state arbitrary

Co de for p

i

Remainder Region

try

i

level i

turn i

wait for level or turn i

crit

i

Critical Region

exit

i

level i

rem

i

Figure Petersons algorithm for two pro cesses

Using we can showby induction that

i beforeC inC atwait beforeC inC turn i

There are three key steps to check First when i passes the wait test then wehave one of

the following Either turn i and we are done or else level inwhich case we are

done by

Second when reaches atwait then it explicitly sets turn i Third when turn gets

set to i then this must b e a step of pro cess p p erformed when it is not in the indicated

i

region

Theorem Petersons twoprocess algorithm satises dead lockfreedom

Pro of Sketch We prove deadlo ckfreedom bycontradiction That is assume that at some

p oint i T there is no pro cess in C and no one ever enters C later We consider two cases

If eventually enters T then b oth pro cesses must get stuckatthewait statement since they

dont enter C But this cannot happ en since the value of turn must b e favorable to one of

them

On the other hand supp ose that never reaches T In this case we can showby induction

that level eventually b ecomes and stays contradicting the assumption that i is stuckin

its wait statement

Theorem Petersons twoprocess algorithm satises lockoutfreedom

Pro of Sketch Weshow the stronger prop erty of b ounded bypass Supp ose the contrary

ie that i is in T after setting level i and enters C three times By the co de in

the second and third time sets turn and then must afterwards see turn i This

means that there are two o ccasions up on which i must set turn i since only i can set

turn i But by our assumption turn i is only p erformed once during one pass by

pro cess i through the trying region a contradiction

Tournament Algorithm

Wenow extend the basic twopro cess algorithm to a more general numb er of pro cesses Here

for simplicitywe assume that n is a p ower of The basic idea is to run a tournament using

the pro cess strategy

In the following presentation wetakethePetersonFischer tournament algorithm and

rephrase it using Petersons simpler pro cess algorithm We remark that the original

PetersonFischer algorithm is quite messy at least to o messy to prove formally here The

co de given here seems simpler but it still needs careful checking and pro of A disadvantage

of this co de as compared to the PetersonFischer co de is that this one involves multiwriter

variables while the original proto col works with singlewriter registers

Before wegive the algorithm we need some notation For i n and k log n

we dene the following notions

The i k competition is the number given by the highorder log n k bitsof i ie

i

comp i k

k

The i k role is the log n k st bit of the binary representation of i ie

i

role i k mo d

k

The i k opponents is the set of numb ers with the same highorder log n k bits as

i and opp osite log n k st bit ie

opponents i k fj comp j k comp i k and role j k role i k g

Shared variables

level an array indexed bynof flog ng initially all

turn a Bo olean array indexed by all binary strings of length at most log n

initial state arbitrary

Co de for p

i

Remainder Region

try

i

for k to log n do

level i k

turn comp i k role i k

wait for j opponents i k level j k or turn comp i k role i k

end for

crit

i

Critical Region

exit

i

level i

rem

i

Figure Petersons tournament algorithm

The co de of the algorithm is given in Figure We only sketch the correctness arguments

for the algorithm

Theorem The tournament algorithm satises mutual exclusion

Pro of Sketch The pro of follows the same ideas as for two pro cesses This time wemust

show that at most one pro cess from each subtree of a level k no de reaches level k or

nishes if k log n This is proven using an invariant

We think of the main lo op of the algorithm unfolded For k log ndenedone to

k

b e the set of p oints in the co de from right after the level k lo op until the end of C Consider

the following invariant

k k log n i done opponents i k wait done

k k k

or turn comp i k role i k

Werstshow that implies the result if two pro cesses i j reach the critical section

together consider their lowest common comp etition that is min fcomp i l comp j l g

l

Supp ose this comp etition is at level k Thenbothi and j are in done and so they must

k

have conicting settings of the role variable

The inductive pro of of is roughly as follows Fix a pro cess i and level number k

Weneedtoverify the following key steps awhen i passes the level k test this case is

easy b an opp onent setting turn and c a relative setting turn i cant from where it is

and others cant reachthereby inductivehyp othesis level k

Theorem The tournament algorithm satises dead lock freedom

Pro of Sketch Deadlo ck freedom follows from lo ckout freedom whichinturnfollows from

a time b ound Weshow a time b ound Sp ecicallywe claim that in O s n cntimeany

pro cess in T enters C We prove this b ound using a recurrence as follows

Dene T to b e the maximum time from when a pro cess enters T until it enters C For

k log n let T k b e the maximum time it takes a pro cess to enter C after winning

completing the b o dy of the lo op at level k Wewishtobound T

By the co de wehave T log n s since only one step is needed to enter C after winning

at the top level Now in order to nd T we set up a recurrence for T k in terms of

T k

Supp ose that p has just won at level k and advancestolevel k First p do es a couple

i i

of assignments taking O s time and then it reaches the wait lo op Nowwe consider two

cases

If p nds the wait condition to b e true the rst time it tests it then p immediately

i i

k

wins at level k Since the numb er of opp onents at level k is and since p must

i

k

test the level of all its opp onents this fragment takes O s time The remaining

time till it reaches the critical region is at most T k by denition The total time

k

is in this case at most O s T k

If p nds the wait condition to b e false the rst time it tests it then there is a

i

comp etitor say p atalevel k In this case we claim that either p or p must

j i j

k

win at level k within time O s of when p reaches the wait condition In fact

i

p mighthave already won b efore p reached the wait condition This is true b ecause

j i

within time O s p must reachitswait condition and then one of the wait conditions

j

must b e true b ecause of the value of turn comp i k Within an additional

k

O s someone will discover the wait condition is true and win Now if p is the

i

winner then it takes an additional T k until it reaches the critical region if p is

j

the winner then within additional time T k p reaches C and within additional

j

time c it leaves C After another O s time it resets level j Thereafter we

claim that p will so on win at level k right after p resets its level all the level

i j

k

values of opponents i k are less than k If they remain this wayfortimeO s

p will nd its condition true and win If not then within this time some opp onentof

i

i reaches level k Then within additional time s it sets turn to b e favorable to i

k

and within an additional O s p will discover this fact and win

i

The total time is thus at most the maximum of the two cases ab ove or

n o

k k k

max s T k Os T k c O s T k c

Thus we need to solve the following recurrence

k

T k T k O s c

T log n s

Let a b e the constant hidden by the bigO notation Wehave

T T as c

T as c

k k k

T k as c

log n log n log n

T log nas c

ns O s n O cn

O s n cn

Bounded Bypass As a last remark wepoint out the fact that Petersons algorithm do es

not haveany b ound on the number of bypasses To see this consider an execution in which

one pro cess i arrives at its leaf and takes its steps with intervening times equal to the upp er

b ound s and meanwhile another pro cess j arrives in the other half of the tree going much

faster Pro cess j can reach all the way to the top and win and in fact it can rep eat this

arbitrarily many times b efore i even wins at level This is due to the fact that is no lower

b ound was assumed on pro cess step times Note that there is no contradiction b etween

unb ounded bypass and a time upp er b ound b ecause the unbounded bypass can only o ccur

when some pro cesses take steps very fast

Shared variables

level an array indexed bynof fn g initially all

turn an array indexed byn of fng initial state arbitrary

Co de for p

i

Remainder Region

try

i

for k to n do

level i k

turn k i

wait for j i level j k or turn k i

end for

crit

i

Critical Region

exit

i

level i

rem

i

Figure Petersons iterative algorithm

Remark In the original PetersonFischer tournament algorithm with singlewriter vari

ables the turn information is not kept in a centralized variables but is rather distributed

around variables b elonging to the separate pro cesses This leads to a complex system of

rep eated reads of the variables in order to obtain consistent readings

Iterative Algorithm

In Petersons newer pap er there is another variant of the algorithm presented ab ove Instead

of a tournament this one conducts only one comp etition for eachlevel and rather than

allowing only one winner for each comp etition it makes sure there is at least one loserIn

this algorithm all n pro cesses can comp ete at the b eginning but at most n can have

won at level at any particular time and in general at most n k can havewon at level k

Thus after n levels get only one winner The co de is given in Figure

JJ Distributed Algorithms Octob er

Lecturer Boaz PattShamir

Lecture

Atomic Ob jects

In the past couple of lectures wehave b een considering the mutual exclusion problem in the

asynchronous sharedmemory mo del Todaywe shall consider a new problem in the same

mo del The new problem is the implementation of an interesting programming language

primitive a kind of data ob ject called an atomic objectAtomic ob jects have recently b een

prop osed as the basis for a general approach to solving problems in the asynchronous shared

memory mo del This approach can b e thought of as ob jectoriented style of constructing

asynchronous concurrent systems

The system architecture is the same as b efore see Figure

interface line process

shared memory

cell

Figure shared memory system

Before giving a formal denition of atomic ob jects wegive a simple example

Example ReadWrite Ob ject

The readwrite ob ject is somewhat similar to a shared readwrite variable but with separate

invo cation and resp onse actions More sp ecicallyithastwotyp es of input actions read

i

where i is a pro cess and write v where i is a pro cess and v avalue to b e written A read

i i

is supp osed to return a value The return event is represented by a separate output action

whichwe call readrespond v For uniformityalsogive the write a corresp onding resp onse

i

i

action writerespond The writerespond is merely an ack whose meaning is that the write

i

i i

action is done

The separation of actions which can b e viewed as giving a ner atomicity makes this

kind of ob ject a little dierent from a shared readwrite register In particular it p ermits

concurrent access to the ob ject from dierent users

The readwrite ob ject is one of the simplest useful ob jects We can easily dene other

fancier ob jects for example

queues with insert and delete actions

readmo difywrite ob jects to b e discussed later

snapshot ob jects to b e considered next

and more Eachobjecthasitsown characteristic interface with inputs b eing invo cations of

op erations and outputs b eing resp onses

Denition

In this section we give an elab orate denition of the requirements from an atomic ob ject

First recall the cyclic discipline for invo cations and resp onses for the mutual exclusion

ob ject Something similar is required from any atomic ob ject we dene a generalized notion

of wel lformedness at the interface Here this says that for each pro cess i whose connection

with the external environment is called line i the invo cations and resp onses alternate

starting with an invo cation We stress that not everything needs to alternate in this way

only the invo cations and resp onses on any particular line There can b e concurrency among

the invo cations on dierent lines We think of the ob ject when it is used in combination

with user programs that preserve ie are not the rst to violate this wellformedness

condition and the rst requirementonanobjectimplementation is that it also preserve

wellformedness

Property Preserves wel lformedness

Also as b efore we make a restriction on when our pro cesses can take steps Namelyin

wellformed executions pro cess i is only p ermitted to take steps in b etween an invo cation

and a resp onse at i that is while an invo cation is active at i

The next correctness condition involves the correctness of the resp onses This correctness

is dened in terms of a related serial specication A serial sp ecication describ es the correct

resp onses to a sequence of op eration invo cations when they are executed sequentially ie

without concurrentinvo cations For instance for a readwrite ob ject having initial value

the correct sequences include the following subscripts denote pro cessesline numb ers

read readrespond write writerespond read readrespond

Usually these sequences are sp ecied by a simple state machine in whicheach op eration

causes a change of state and a p ossible return value We remark that this serial sp ecication

is now standard in theory of data typ es

So let S b e a serial sp ecication for a particular set of op eration typ es eg read write

etc An atomic ob ject corresp onding to S has the same interface as S In addition to the

wellformedness condition ab ove we require something ab out the contents of the sequences

Property Atomicity

Wewanttosay that anywellformed execution must lo ok like a sequence in the serial

sp ecication S The wayofsaying this is a little complicated since wewant a condition

that also makes sense for executions in which some of the op erations dont return That is

even if for some of the lines i the nal op eration gets invoked and do es not return wewould

like that the execution ob ey some rules Sp ecicallywe require for a nite or innite

wellformed execution of the ob ject that it b e p ossible to select the following

For each completed op eration a serialization point within the activeinterval

An arbitrary subset T of the incomplete op erations ie those having an invo cation

but no resp onse such that for each op eration in T we can select a serialization p oint

sometime after the invo cation and a resp onse action

The p oints should b e selected in a waysuchthatifwemovetheinvo cation and resp onse

actions for all the completed op erations and all the op erations in T so that they ank their

resp ective serialization p oints ie if we shrink the completed op erations and all the op er

ations in T to contain only their resp ective serialization p oints then the resulting sequence

of invo cations and resp onses is in S ie is correct according to the serial sp ecication

So the atomicity condition essentially stipulates that an execution lo oks as if the op era

tions that were completed and some of the incomplete ones were p erformed instantaneously

at some time in their intervals Figure shows a few scenarios involving a twopro cess readwrite ob ject

(a) (c) write(8) write−respond write(8) process 1 Time process 2 read read−respond(0) read read−respond(8)

(b) (d) write(8) write−respond write(8)

read read−respond(8) read read−respond(0) read read−respond(8)

(e) write(8)

read read−respond(0) read read−respond(0)

Figure p ossible executions of a readwrite ob ject with two pro cesses The arrows repre

sent serialization p oints In the scenario e there are innitely many reads that return and

consequently the incomplete write do es not get a serialization p oint

Property Liveness

This prop erty is simple we require that in every wellformed fair executionevery invo

cation receives a resp onse This rules out for example the dead ob ject that never returns

resp onses

Notice that here we are using the underlying fairness notion for the mo del ie that

pro cesses continue to take steps when those are enabled In this case the statementof

Prop erty could b e simplied The reason wehavegiven the more complicated statement

of Prop erty is that in the sequel we shall consider nonfair executions of atomic ob ject

implementations This will b e when we are discussing resiliencyandwaitfreedom Note

that Prop erty so far only makes sense in the particular mo del we are using with pro cesses

to whichitmakes sense to b e fair

Discussion Intuitivelywecansay that an atomic ob ject is a concurrentversion of a

corresp onding serial sp ecication It is appropriate to ask what go o d is this concept The



In the literature waitfreedom is sometimes referred to as waitfreeness

imp ortant feature we get is that we can use an atomic ob ject in mo dular construction of sys

tems supp ose that we design a system to use instantaneous ob jects eg readwrite queues

or something more interesting but we dont actually have such instantaneous ob jects If

wehave atomic versions of them then we can plug those implementations in and as we

will later see under certain circumstances the resulting system b ehaves the same as far

as a user can tell We shall return to this discussion later

Atomic Snapshots

We shall now see how can simple atomic ob jects b e used to implement seemingly much

more p owerful ob jects Sp ecicallywell see how to implement atomic snapshot object

using atomic registers We follow the general ideas of Afek Attiya et al We start with

a description of the the problem and then give a simple solution that uses registers of

unb ounded size Finallywe rene the construction to work with b ounded registers

Problem Statement

Atomic snapshot ob ject mo dels an entire memory divided into n words It has n update and

n snap lines see Figure The update v writes v into word and the snap returns the

i

i

vector of all the latest values

As usual the atomic version of this ob ject has the same resp onses as if shrunk to a p oint

in the interval

update i n lines

snapi

n lines

Figure schematic interface of an atomic snapshot ob ject

The mo del of computation we assume is that of writer nreader atomic registers

Unb ounded Algorithm

Supp ose that each update writes into its register in a waysuch that eachvalue can b e veried

to b e new A simple waytodoitistoattach a sequence numb er to eachvalue whenever

a newer value is written it is written with an incremented sequence numb er For this rule

wehave the following simple prop erty

Suppose every update leaves a unique mark in its register If two consecutive

global reads return identical values then the values returnedareconsistent ie

they constitute a true snapshot

The observation ab ove immediately gives rise to the following simplistic algorithm each

update increments the lo cal sequence numb er each snap collects rep eatedly all the values

until two identical consecutive sets of collected values are seen a successful double col lect

This common set is returned bythesnaprespond

This algorithm always returns correct answers but it suers from the problem that a

snap may never return if there are updatesinvoked concurrently

For this problem however wehave the following observation see Figure

If a snap sees a value being updated times then the updater executes a complete

update operation within the interval of the snap

1st value

Snap 2nd value

3rd value

Figure a complete update must b e contained in the interval containing three changes of a

single value Notice the emb edded snap in the middle update

The observation ab ove leads us to the main idea of the algorithm which can b e informally

stated as If you want to write then read rst Sp ecicallywe extend the algorithm such

that b efore an update pro cess writes it must takea snap action and save the result of this

internal snap somewhere accessible to all others We shall use the term embeddedsnapfor

the snaps activated by updates

We can now describ e the unb ounded algorithm The system structure is depicted in

Figure First we describ e the state variables

For each update wehave a register which is a record with the following elds

i

value v

i

sequence number s

i

view ie a vector of values G i

snapj

embedded snap snap

update

updatei

Figure system structure for the algorithm

The co de for snap pro cess is as follows

i

Read all registers until either

see equal s for all j double collect or

j

see distinct values of s for some k

k

In the rst case return the rep eated vector of v s

j

In the second case return the second G for that k b orrowed view

k

The co de for update v pro cess is as follows

i

Do snap emb edded snap pro cedure

i

Write atomically

incremented sequence number in s

i

v in v and

i

the value returned by the snap in G

i

Correctness The wellformedness condition is clearly maintained The more interesting

parts are serializability and termination

Theorem The algorithm satises the atomicity condition

Pro of We construct explicit serialization p oints in the op erations interval at which the

read or write can said to have o ccurred Clearly update must b e serialized at the p ointof

write

For the snap op eration the situation is a bit more complicated Consider the sequence

of writes that o ccurs in any execution Atanypoint in time there is a unique vector of

v s Call these acceptable vectors Weshow that only acceptable vectors are obtained by

i

snaps We will construct serialization p oint for all snaps real and emb edded as follows

First consider snaps that terminate with successful double collects For these wecan

pickany p ointbetween the end of rst collect and the b eginning of second as a serialization

p oint It is clear that the vector returned by snaprespond is exactly the acceptable vector

at this p oint b ecause there is no change in that interval by the assumption that anychange

eects the sequence numb ers

Second consider snaps that terminate with b orrowed view Wepick serialization p oints

for these by induction on their order of the completion Base case is trivial no such snaps

For the inductive step note that the b orrowed view of the k th such snap is the result of

some other snap which is completely contained within the snap at hand By induction the

former was already assigned a p oint Wecho ose that same p ointforthesnap at hand This

p ointisinthebiggersnapinterval since it is in the smaller see Figure

By the induction wealsohave that the value returned is exactly the acceptable vector

at the serialization p oint

Theorem In a fair execution each snap and update terminates in O n memory access

steps

Pro of For snap it follows from the pigeonhole principle that after at most n col

lects of all the values there must b e a pro cess with changes Since in each collect the snap

pro cess reads n values the O n b ound follows For updatewehave O n to o b ecause of

emb edded snap which is followed only by a single write

Bounded Register Algorithm

The requirement of the previous algorithm to haveunb ounded registers is problematic The

oretically this means that the algorithm needs unb ounded space The p oint is that the

correctness only required that a change is detectable having a sequence number gives us a

total order which is far more than we need

If we think of the algorithm a little we can see that the purp ose of the sequence numb ers

was that snap pro cesses could tell when update pro duced a new value This however can

i

b e communicated explicitly using handshake bits More sp ecically the main idea is as

follows Instead of keeping sequence numb ers for update keep n pairs of handshake bits for

i

communication b etween update and snap for all j actuallyn for real and embedded

i j

snaps The handshake proto col is similar the PetersonFisher pro cess mutual exclusion

algorithm update sets the bits to b e unequal and snap sets them equal

i i

We use the following notation for the handshake bits For eachupdate snap pair we

i j

have bits p at iandq at j with this notation the handshake proto col is simply the

ij ji

following

When update writes p q

ij ji

i

Before snap reads q p

ji ij

j

The handshake proto col is nearly sucient As we shall see it can b e the case that

two consecutivevalues can b e confused Toovercome this problem each update has an

i

additional toggle bit toggle that it ips during each write This ensures that each write

i

changes the register value

Wecannowgivea sketch of the co de of the b ounded algorithm The complete algorithm

can b e found in page of Handout

update

i

Read the handshake bits q for all j

ji

Snap

Write the value b orrowed view the negated handshake bits and negated toggle bit

snap

j

Rep eat

Read handshake bits p and set q p for all i

ij ji ij

Do two collect

If b oth collects are equal in all p and toggle then return common vector

ij

i

else record who has moved

Until someone moved times

Return b orrowed view

Correctness The general idea is the same as for unb ounded case We serialize updatesby

the writes For snapwe can serialize snap that return b orrowed view using the same induc

tion Not surprisingly the problem now is correctness of snaps that return by doublecollect

The following argument shows that the same serialization rule ie anypointbetween the

two collects

Claim Ifasnapdoes a successful double col lect then no write occurs between the end of

the rst and the beginning of the second

Pro of By contradiction Supp ose two reads by snap of r pro duce p equal to q s most

i ij ji

j

recentvalues and toggle bit and assume a write by i o ccurs in b etween the reads Consider

the last such write itmust write the same p and toggle read by snap Since during

i ij j

i

update p gets q thenthereadmust precede snap s most recent write of q

ij ji ji

i j

Thus wemust have the situation depicted in Figure

update process read(q=−b) write(p=b,toggle=t)

write(q=b) read(p=b,toggle=t) read(p=b,toggle=t)

snap process

Figure scenario describ ed in the pro of of Claim The shaded area represent the active

interval of the last update b efore the second read of snap

i j

The read and write are part of the same update sothetwo reads by snap return

i i

i j

values written bythe successive update writes with identical toggle bits contradicting

i

the co de

Before we conclude the discussion of the atomic snapshot algorithm we remark that it

has the additional nice prop ertyofwaitfreedom Namelyinany execution including those

that are not necessarily fair to all pro cesses but just to some subset subset P any op eration

of any i P is guaranteed to terminate Notice that pro cesses not in P mayeven stop

without aecting the progress of the pro cesses in P We will discuss this concept extensively

in future lectures

JJ Distributed Algorithms Octob er

Lecturer Nancy Lynch

Lecture

So far wehavebeenworking within a mo del based on readwrite shared memory while

studying the mutual exclusion and atomic snapshot problems Wehave digressed slightly

by jumping from the mutual exclusion problem to the atomic snapshot problem Todaywe

will return to the mutual exclusion problem once again In the following lectures we will

consider the consensus problem in readwrite shared memory and then consider atomic and

other ob jects once again

Burns Mutual Exclusion Algorithm

Both of the algorithms wehave studied so far Dijkstras and Petersons use multiwriter

variables turn along with a collection of singlewriter variables ag Due to the fact that

it might b e dicult and inecient to implement ie physically build multiwriter shared

variables in certain systems in particular in distributed systems algorithms that use only

singlewriter variables are worth investigating We shall see two singlewriter algorithms by

Burns and by Lamp ort

The rst algorithm develop ed by Jim Burns app ears in Figure It do es not guar

antee fairness but only mutual exclusion and deadlo ckfreedom Lamp orts algorithm is fair

also but has the disadvantage of using unb ounded variables

Mutual exclusion The pro of that Burns algorithm guarantees mutual exclusion is sim

ilar to the pro of for Dijkstras algorithm except that the ag variable is set to where in

Dijkstras it is set to For example consider an op erational argument If i and j b oth

reach C then assume that i set ag to rst By the co de it keeps it until it leaves C

Also by the co de after j sets its ag to j must check that ag i b efore it can enter

C The argument is completed byshowing that at least one of the pro cesses must notice

that the other has set it ag Sp ecicallyif ijthen j must return to Landifij

then i cannot pro ceed to the critical region

Shared variables

ag anarray indexed bynof f g initially all where ag i is written

by p and read byall

i

Co de for p

i

L

ag i

for j fi g do

if ag j then goto L

end if

end for

ag i

for j fi g do

if ag j then goto L

end if

end for

M

for j fi ng do

if ag j then goto M

end if

end for

Critical region

ag i

Remainder region

Figure Burns Mutual Exclusion Algorithm

Deadlo ckfreedom This prop erty can b e argued by contradiction as follows As for

Dijkstras algorithm assume that there exists some execution in which all pro cesses are in

either R or T and their are no further region changes Partition the pro cesses into those

that ever reach lab el M and those that do not call the rst set P and the second set Q

Eventually in this execution there must exist a p oint where all pro cesses in P have already

reached M Note that they never thereafter drop backtoanypoint prior to lab el M Now

we claim that there is at least one pro cess in P sp ecically the one with the lowest index

among all the contenders that will reach P Let i b e the largest index of a pro cess in

P We claim that eventually after this p oint any pro cess j Q suchthat jihas ag j

set p ermanently to This is b ecause if its ag is it eventually detects the presence of a

smaller index active pro cess and returns to L where it sets its ag to and from that p oint

it can never progress far enough to set it backto Finallywe claim that p will eventually

i

reach the critical region a contradiction

Remark Burns algorithm uses no multiwriter variables but do es use n shared multi

reader variables to guarantee mutual exclusion and deadlo ckfreedom for n pro cesses Later

in this lecture we will see that this is optimal in terms of the numb er of registers even if

unb ounded multiwriter registers are available

Lamp orts Bakery Algorithm

In this section we discuss Lamp orts bakery algorithm It is a very basic and interesting

mutual exclusion algorithm it is practical and its ideas reapp ear in many other places In

the presentation here we assume for simplicity that the shared memory consists of single

writer readwrite shared variables The algorithm guarantees nice fairness b ehavior it

features lo ckoutfreedom a go o d time b ound and b ounded bypass In fact it has a stronger

prop erty FIFO after a waitfree do orway to b e dened b elow An unattractive prop erty

it has howeveristhatitusesunb ounded size registers The co de is given in Figure

We remark that the co de given here can b e simplied in the case of indivisible readwrite

registers The given co de works also for the safe registers mo del

In the algorithm the trying region is broken up into two subregions whichwe call the

doorway and the rest of T The do orway is the part from when the pro cess enters until it

sets choosing i to In the do orway the pro cess cho oses a number that is greater than the

numb ers that it sees that other pro cesses have already chosen While it do es this it sets

choosing i to to let the other pro cesses know that it is currently cho osing a numb er Note



The algorithm works also under a weaker mo del called safe registersinwhich the registers havemuch

less predictable b ehavior in the presence of concurrent accesses We discuss this mo del later in the course

Shared variables

choosing an array indexed by nofintegers from fg initially all where

choosing i is written by p and read by all

i

number an array indexed bynofintegers from fg initially all where

number i is written by p and read by all

i

Co de for p

i

b eginning of do orway

try

i

L

choosingi

numberi max fnumbernumberng

choosingi

end of do orway

b eginning of bakery

for j fng do

L

if choosing j then goto L

end if

L

if number j and numberj j numberii then goto L

end if

end for

crit

i

critical region

end of bakery

exit

i

numberi

rem

i

Remainder region

Figure Lamp orts bakery mutual exclusion Algorithm

that it is p ossible for two pro cesses to b e in the do orway concurrently and thus two pro cesses

may obtain the same numb er Toovercome this problem the comparison is done b etween

number index pairs lexicographicallyandthus ties are broken in favor of the pro cess with

the smaller index this is really an arbitrary rule but it is commonly used In the rest of

the trying region the pro cess waits for its number index pair to b e the lowest and is also

waiting for any pro cesses that are cho osing

The algorithm resembles the op eration of a bakery customers enter the do orway where

they cho ose a numb er then exit the do orwayandwait in the store until their numb er is the

smallest

Analysis

Basic Prop erties

Let D denote the do orway T D is then the rest of the trying region We rst argue that

the algorithm do es not violate the mutual exclusion prop erty

Claim In any reachable state of the algorithm if p C and for some j i we have

i

p T D C then number ii number j j

j

We give here an op erational pro of since it can b e extended more easily to the safe register

case later

Pro of Pro cess i had to read choosing j inL b efore entering C Thus at the time of

that read j was not in the cho osing region ie in the do orway after setting choosing j

to But since j is in T D C j must have gone through the do orway at some p oint

There are two cases to consider

Case j entered the cho osing region after pro cess i read that choosing j Then is

number was chosen b efore j started cho osing ensuring that j saw number iwhenitchose

Therefore when i is in C wehave number j number i which suces

Case j left the cho osing region b efore is read Then when i reaches L and reads j s

numb er it gets the most recentvalue number j Since i decided to enter C anyhow it must

b e that number ii number j j

Corollary The bakery algorithm satises mutual exclusion

Pro of Supp ose that two pro cesses i and j are b oth in C Thenby Claim wemust have

b oth number ii number j jandnumber j j number ii a contradiction



Here we are using the pro cesses indices in much the same wayaswe previously used the UIDs If

the pro cesses did not know their indices but had UIDs we could use those here instead This observation

holds also for Dijkstras algorithm but not for Petersons since there the indices distinguish the roles the

pro cesses play

Theorem The algorithm satises dead lockfreedom

Pro of Again we argue bycontradiction Supp ose that deadlo ck o ccurs then eventually

a p oint is reached after which a xed set of pro cesses are in T and no new region changes

o ccur Also by the co de eventually all of the pro cesses in T get out of the do orway and

into T D At this time p oint the pro cess with the lowest number index isnotblocked

Theorem The algorithm satises lockoutfreedom

Pro of Consider a particular pro cess i T Iteventually exits the do orway Thereafter

any new pro cess that enters the do orway sees is numb er and hence cho oses a higher number

Thus if i do esnt reach C none of these new pro cesses can reach C either since they all see

that i has a smaller numb er and by the co de they must wait b ehind i But by Theorem

pro cesses in T must continue to go to C which means that i eventually go es

Time Analysis

In this section weshowthatany pro cess enters C at most O n time units after it starts

executing T We start byshowing that the algorithm has b ounded bypass after the do orway

Lemma The total number of times a processor p enters the bakery while some processor

i

p is continuously in the bakery is no more than

j

Pro of It suces to show that any pro cessor may exit the bakery at most once while p is

j

in the bakery While j is in the bakery number j isunchanged Hence if a pro cess i exits

the bakery then when it enters the do orwayagain number i will b e assigned a number

strictly more than number i and p will not b e able to pro ceed from L from some state in

i

the fragmentuntil p exits C

j

Wenow analyze the the time for an individual pro cess to enter C Consider an execution

fragment such that in the rst state some pro cess p is in L and in the last state p is in

t t

C Our goal is to b ound the time of anysuch execution fragment Denote by T i the time

d

a pro cessor p sp ends in the do orwayandby T i the time p sp ends in the bakery we

i b i

consider only the fragment in question We shall b ound T tT t We rst b ound the

d b

time sp ent in the do orway

Lemma For any processor p T i O sn

i d

Pro of The do orway consists of n reads and writes

By the ab ove lemma after O sn time the pro cess p reaches the bakeryWenow consider

t

only the sux of the fragment in which p is in the bakery Consider the if statements They

t

can b e describ ed as busy waiting until some predicate holds We call a test successful if

the next step is not another execution of the test sp ecically if the predicate in L and L

was false The following lemma b ounds the numb er of successful tests in the fragmentwe

consider

Lemma The total number of successful tests in the fragment is no more than O n

Pro of In all the Trying Region every pro cessor makes O n successful tests and by Lemma

any pro cessor passes the Trying Region no more than twice

We can now b ound the total time of the fragment

Theorem The total time of the execution fragment in question is no more than T t

d

T t O sn O cn

b

Pro of We know that in all states during the execution fragment some pro cessor can make

progress or otherwise the liveness prop ertyiscontradicted But this implies that either

some pro cess is making progress in the do orway or that some pro cess makes a successful

comparison or that some pro cess makes progress in the C The total amount of time in the

fragment during which some pro cessor in the C is O cn since by Lemma any pro cessor

enters the C at most once Also the total amount of time in the fragment during which

any pro cessor is in the do orwayisO sn by Lemmas and Taking into accountthe

successful test steps from Lemma we can conclude that

T t O sn O sn O cnO sn O cn

b

The result is obtained when combining the ab ove b ound with Lemma

Additional Prop erties

The algorithm has some other desirable prop erties For example wehave FIFO after a

doorway This stronger fairness prop ertysays that if i nishes the do orwaybeforej enters

T then i must precede j in entering C This is an almostFIFOproperty it is not actually

FIFO based on time of entry to T oreven time of rst noninput step in T i could set its

choosing variable to then delay while others cho ose so all the others could b eat it

It isnt very useful to say as a feature that an algorithm is FIFO after a do orway

since so far there are no constraints on where the do orway b egins Wemighteven place the

do orway right at the entrance to C But the do orway in this algorithm has a nice prop erty

its waitfreewhich means that a pro cess is guaranteed eventually to complete it if that

pro cess continues to take steps regardless of what the other pro cesses do continue taking

steps or stop in any combination any sp eed

The bakery algorithm has also some limited amount of faulttolerance Consider faults

in which a user can ab ort an execution at any time when it is not in Rby sending an

abort input to its pro cess Pro cess i handles this by setting its variables lo cal and shared

i

back to initial values A lag is allowed in resetting the shared variables We can see that

mutual exclusion is still preserved Moreover we still havesomekindofdeadlockfreedom

supp ose that at some p oint some pro cess is in T and do es not fail ab ort and that there

are only nitely many total failures and that C is empty Then the algorithm guarantees

that eventually some pro cess enters C even if all other pro cesses fail

The Numb er of Registers for Mutual Exclusion

Wehave seen several algorithms using readwrite shared memoryformutual exclusion with

deadlo ckfreedom and also various fairness prop erties The question arises as to whether

the problem can b e solved with fewer than n readwrite variables

If the variables are constrained to b e singlewriter then the following lemma implies that

at least n variables are needed

Claim Suppose that an algorithm A solves mutual exclusion with dead lockfreedom for

n processes using only readwrite shared variables Suppose that s is a reachable state

of A in which al l processes are in the remainder region Then any process i can go critical

on its own starting from s and along the way i must write some shared variable

Pro of Deadlo ckfreedom implies that i can go critical on its own from s Supp ose that

it do es not write any shared variable Then consider any other pro cess i deadlo ckfreedom

also implies that i can go critical on its own starting from sNow consider a third execution

in which pro cess i rst b ehaves as it do es in its solo execution eventually entering C without

writing anysharedvariable Since pro cess i do es not write any shared variable the resulting

state s lo oks like state s to pro cess i Here wesay that two states look alike to some

pro cess if the state of that pro cess and the values of all shared variables are the same in the

two states Therefore i is also able to go critical on its own starting from s This violates

the mutual exclusion requirement

Claim directly implies that if only singlewriter registers are available then the algo

rithm must use at least n of them But note that wehavenoteven b eaten this b ound when

we used multiwriter variables In this section weprove that this is not an accident In fact

wehave the following lower b ound

Theorem If algorithm A solves mutual exclusion with dead lockfreedom for n pro

cesses using only readwrite variables then A must use at least n shared variables

Note that this result holds regardless of the size of the shared variables they can b e as

small as a single bit or even unb ounded in size Also note that no fairness assumption is

needed deadlo ckfreedom is sucient to get the imp ossibility result

To get some intuition we start with two simple cases We rst consider the case of one

variable and two pro cesses then we consider the case for twovariables and three pro cesses

and nally we shall extend the ideas to the general case

We shall use the following denition extensively

Denition Suppose that at some global state s the action enabled in some process p is

writing to a variable v that is the next time p takes a step it writes to v In this case we

say that p covers v

Two Pro cesses and One Variable

Supp ose that the system consists of two pro cesses p and p andonlyvariable We

construct a run that violates mutual exclusion for this system Claim implies that there

is a solo run of p that causes it to enter C and to write the single shared variable v b efore

doing so So now run p just until the rst time it is ab out to write to v ie until the rst

time it covers v Wethenrunp until it go es to C since p hasnt written anything yet

for p the situation is indistinguishable from the state in which p is in R and hence p will

b ehave as it do es running solo Then let p continue running Now the rst thing p do es

is to write v therebyoverwriting anything that p wrote and so eliminating all traces of p s

execution Thus p will run as if alone and also go to C contradicting the mutual exclusion

requirement

Three pro cesses and TwoVariables

Now supp ose that wehave three pro cesses p p p and twovariables v v Again we shall

construct a run that violates mutual exclusion using the following strategy Starting from

an initial state we will maneuver p and p alone so that eachiscovering a dierentvariable

moreover the resulting state s is indistinguishable to p from another reachable state s

in which all three pro cesses are in R Then werunp from s and it must eventually go

to C since it cant tell that anyone else is there Then weletp and p each take a step

Since they are covering the twovariables the rst thing they do is to overwrite all traces of

p s execution and so they will run as if they are alone and by deadlo ckfreedom one will

eventually go to C yielding the desired violation of mutual exclusion

It remains to showhowwe maneuver p and p to cover the two shared variables Wedo

this as follows see Figure

First we run p alone until it covers a shared variable for the rst time Call this p oint

t Then werunp alone until it enters C thencontinues to E Rbackto T and again

covers some variable for the rst time Call this p oint t We rep eat this pro cedure to obtain

a third p oint t Notice that two of the three p oints t t an t must involvecovering the

Critical Region

Remainder Region

t1 t2 t3

Figure solo run of p Itcovers a variable at eachoft t and t and hence it covers the

 

same variable twice

same variable see Figure Without loss of generality supp ose that in t and t p

covers v the same argument holds for all the other cases

Now consider the following execution Run p until p oint t Then let p enter and run

alone We claim that eventually p enters C since the state lo oks to p as if it is in the solo

execution Moreover we claim that along the way p must write the other variable call it

v For otherwise p could go to C then p could immediately write v and thus overwrite

all traces of p then go on and violate mutual exclusion

Remainder Region

p1 p2 p1

p1 covers v1 p2 covers v2 p1 overwrites p1 covers v1

v1

Figure Execution for variables By the end of this fragment p and p cover b oth shared



variables and the state is indistinguishabl e from the state in which p and p are in their last



remainder state

So nowwe construct the required execution as follows see Figure Run p until

p oint t whenitcovers v Then run p until it rst covers v Note that we are not done

yet b ecause p could have written v since last leaving RButnow we can resume p until

p oint t The rst thing it do es is write v therebyoverwriting anything p wrote Then p

and p cover variables v and v resp ectively Moreover by stopping p and p back in their

most recent remainder regions the shared memory and state of p would still b e the same

This completes the construction of the execution in which p and p cover the two shared

variables in a state that is indistinguishable for p from the state in which b oth are in R

By the argumentabove we conclude that in this case mutual exclusion can b e violated

The General Case

The pro of for the general case extends the examples ab ove with an inductive argument We

call a state a global remainder state if all pro cesses are in R in that state We call a state

k reachable from another if it is reachable using steps of pro cesses p p only

k

We start with two preliminary lemmas

Lemma Suppose A solves mutual exclusion with dead lockfreedom for n processes

using only readwrite variables Let s and s bereachable states of A that are indistinguish

able to p and suppose that s is a global remainder state Then p can go on its own from

i i

s to its critical region

Pro of Pro cess p can do so in s by deadlo ckfreedom Since s lo oks like s to p itcan

i i

b ehave in the same wayfroms

Lemma Suppose A solves mutual exclusion with dead lockfreedom for n processes

using only readwrite variables Let s beareachable state of ASuppose that i goes from R

to C on its own starting from s Then along the way i must write to some variable that is

not coveredin s

Pro of If not then after p enters can resume the others who overwrite and hide it

i

Using the ab ove easy prop erties we shall prove the following main lemma

Lemma Let A beann process algorithm solving mutual exclusion with dead lockfreedom

using only readwrite variables Let s be any reachable global remainder state Suppose

k n Then thereare two states s and s each k reachable from s such that the

fol lowing properties hold

k distinct variables arecoveredbyp p in s

k

s is a global remainder state

s looks like s to p p

k n

Notice that applying this lemma for k n yields the theorem

Pro of By induction on k

Base case For k wehave the rst example ab ove Togetsjustrunp till it covers

avariable In this case we dene s s

Inductive step supp ose the lemma holds for k we prove it holds for k n Starting

from s by induction we run p p until they reach a p oint where they cover k distinct

k

variables yet the state lo oks to p p like some global remainder state that is k

k n

reachable from s Then we let p p write in turn overwriting the k variables and

k

then let them progress to C E R calling the resulting state s We apply the inductive

hyp othesis again to reach another covering state s that lo oks like a global remainder state

n

times Now by the Pigeonhole k reachable from s We rep eat this pro cedure

k

n

Principle among the covering states that have b een obtained there must b e two

k

that cover the same set of variablesLett b e the rst of these p oints in the execution t

the second Let s and s b e the covering and global remainder states corresp onding to t

and likewise s and s for t

Consider running p alone from t Since the state at t lo oks to p likea reachable

k k

global remainder state Lemma implies that p will eventually enter C Along the way

k

by Lemma it must write to some variable not in V

Nowwe construct the needed execution to prove the lemma as follows see Figure

First run p p until t then let p take steps until it rst coversavariable not in V

k k

Next let p p resume and go to t

k

Critical Region

Remainder Region

1,...,k k+1 1,...,k

t1 t’ t2

Figure construction for the general case In t pro cesses p p cover k variables of V In

k

t p covers some variable not in V Int V and that variable are covered

k  

Call the resulting state s this is the s that is required in the statement of the lemma

This state s is the same as s with the state of p changed to what it is when it stops

k

Let s s

Wenowshow that the required prop erties hold First note that the entire construction

including the uses of the inductivehyp othesis only involves running p p so b oth

k

s and s are k reachable from s Also directly from the construction wehave k

variables covered in s the k variables in V and a new one covered by p Byinductive

k

hyp othesis s is a global remainder state s

lo ok aliketo p p But this follows from the It remains only to show that s and s

k n

facts that s and s lo ok aliketo p p and that s and s lo ok alike to all pro cesses

k n

except p

k

JJ Distributed Algorithms Octob er

Lecturer Nancy Lynch

Lecture

Consensus Using ReadWrite Shared Memory

In this section we fo cus on the consensus problem in the shared memory setting We shall

work from pap ers by Herlihy Loui AbuAmara and Fischer Lynch Paterson Werst

dene the problem in this context The architecture is the same architecture we considered

in the previous lectures namely readwrite shared variables allowing multireader multi

writer variables Here we assume that the external interface consists of input action init v

i

where v is the input value and output action decide v where v is the decision value Just

i

to keep consistent with the invo cationresp onse style we are using for mutual exclusion and

atomic ob jects we assume that the initial values arrive from the outside in input actions

interface line process

shared memory

cell

Figure shared memory system

We require the following prop erties from any execution fair or not

Agreement All decision values are identical

Validity If all pro cesses start with the same value v then v is the only p ossible decision

In addition we need some kind of termination condition The simplest would b e the follow

ing

Termination If init events o ccur at all no des and the execution is fair then all pro cesses

eventually decide

This condition which applies only when there are no faults has simple solutions even

though there is asynchrony So it is only interesting to consider the faultycase In the

following requirement we give the formulation of the requirement for stopping faults

Termination Supp ose init events o ccur at all pro cesses and let i be any pro cess If i do es

not fail ie the execution is fair to itheneventually a decide event o ccurs

i

We remark that this condition is similar to waitfreedom

Imp ossibility for Arbitrary Stopping Faults

Our rst result in this section is the following

Theorem There is no algorithm for consensus that tolerates stopping faults

Pro of Assume that A is a solution Supp ose for simplicity that the values in A are chosen

from f g Without loss of generalitywe can assume that the statetransition relation of

A is pro cessdeterministic ie from any global state if a pro cess i is enabled to takea

noninput step then there is a unique new global state that can result from its doing so

Also for any global state and any input there is a unique resulting global state This do es

not restrict the generality since we could just prune out some of the transitions the problem

still has to b e solved in the pruned algorithm

We can further restrict attention to inputrst executions of Ainwhich inputs arrive

everywhere b efore anything else happ ens in order of pro cess index That is they have

a prex of the form init v init v init v This is just a subset of the allowed

n n

executions Now consider any nite inputrst execution of AWesaythat is valent

if the only value v that ever app ears in a decide v eventin or anycontinuation of is

analogouslywesaythat is valent if the only suchvalue v is Wesay that is univalent

if it is either valent or valent and bivalent if b oth values app ear in some extensions

Note that this classication is exhaustive b ecause any such can b e extended to a fair

ie failurefree execution of Ainwhicheveryone is required to eventually decide

Dene an initial execution of A to b e an inputrst execution of length exactly nThat

is it consists of exactly n init actions one for each i in order of pro cess index

i

Lemma There exists an initial execution of A that is bivalent

Pro of If not then all initial executions are univalent Note that the one in which all start

with is valent and analogously for Now consider changing one at a time to a

This creates a chain of initial executions where anytwo consecutive initial executions in this

chain only dier in one pro cess input Since every initial execution in the chain is univalent

by assumption there must b e two consecutive initial executions and suchthat is

valent and is valent Without loss of generality supp ose that and only dier

in the initial value of pro cess iNow consider any extension of in which i fails rightat

the b eginning and never takes a step The rest of the pro cesses must eventually decide by

the termination condition Since is valent this decision must b e Now run the same

extension after stillhaving i fail at the b eginning Since i fails at the b eginning and

and are the same except for pro cess i the other pro cesses will b ehave in the same way

and decide in this case as well But that contradicts the assumption that is valent

Remark Note for later reference that this lemma still holds for a more constrained

assumption ab out A that it exp eriences at most a single fault Technically the lemma

holds under the same conditions as ab ove with the additional constraint in the termination

condition that there is at most one failure

Next we dene a decider to b e an inputrst execution that is bivalent but any onestep

extension of it is univalent Wehave the following lemma

Lemma A hasadecider

Pro of Supp ose not Then any bivalent inputrst execution has a bivalent onestep ex

tension Then starting with a bivalent initial execution as guaranteed by Lemma we will

pro duce an innite inputrst execution all of whose prexes are bivalent The construction

is extremely simple at each stage we start with a bivalent initial execution and we extend

it by one step to another bivalent conguration Since anybivalent inputrst execution has

a bivalent onestep extension weknowwe can do this But this is a contradiction b ecause

some pro cess do es not fail in this execution so a decision is supp osed to b e reached bythat

pro cess

Remark Note that this lemma do es require the full resiliency assuming resiliency to a

single fault is not sucient to guarantee the existence of a decider

Nowwe can complete the pro of of the theorem as follows Dene extension itobe

the execution obtained by extending execution by a single step of i By Lemma we

obtain a decider

Supp ose that p s step leads to a valent execution and p s step leads to a valent

i j

execution see Figure That is extension i is valentandextension j is

valent Clearly i j Note that by pro cess determinism assumption each pro cess can α ij

0−valent 1−valent

Figure is a decider If i takes a step then the resulting conguration is valent and if j

takes a step the resulting conguration is valent

only lead to a unique extension of

Wenow pro ceed by case analysis and get a contradiction for each p ossibility

α i j

j β β

decide 1

decide 1

Figure The extension do es not include steps of i and therefore must result in decision

value in b oth cases

Case p s step is a read step Consider extending extension j insucha way that all

i

pro cesses except for p continue to take steps Eventuallytheymust decide Now take the

i

same extension b eginning with the step of p and run it after extension i see Figure

j

Since p s step is just a read it do es not leaveany trace that would cause the other

i

pro cesses to b ehaveany dierently So in this case also they all decide contradicting the

assumption that extension iisvalent

Case p s step is a read step This case is symmetric to case and the same argument

j

applies

Case p s and p s steps are b oth writes We distinguish b etween the following sub

i j

cases

Case a p and p write to dierentvariables In this case the result of running either

i j

p and then p or p and then p after is exactly the same global state see Figure

i j j i α ij

ji

Figure If i and j write to dierentvariables then applying j after i results in the same

conguration as in applying i after j

But then wehave a common global state that can follow either a valent or a valent

execution If we run all the pro cesses from this state they must reach a decision but either

decision will yield a contradiction For instance if a decision of is reached wehavea

decision of in an execution extending a valent prex

Case b They are writes to the same variable As in Figure we can run all but p

i

until they decide On the other hand we can apply that extension also after extension i

and obtain the same decision b ecause the very rst step of the extension will overwrite the

value written by p and reach decision value from a valent state a contradiction as in

i

Case

In summarywehavecontradictions in all p ossible cases and thus we conclude that no

such algorithm A can exist

Imp ossibility for a Single Stopping Fault

We can strengthen Theorem to show imp ossibilityofeven resilient consensus algorithm

The stronger version is due to Loui AbuAmara following the Fischer Lynch Paterson

pro of for messagepassing systems

Theorem There is no algorithm that solves consensus under the conditions above in the

presence of a single stopping fault

Pro of As noted ab ove Lemma holds also for the case of one stopping fault and hence

we are still guaranteed that there exists a bivalent initial execution However our pro of will

now pro ceed using a dierent argument Sp ecicallywenowshow the following lemma

Lemma For every bivalent inputrst execution andeveryprocess i there is an exten

sion of such that extension i is bivalent

Before we turn to prove the lemma weshowhow it implies the theorem Lemma allows us

to construct the following bad execution in which no one fails and yet no one ever decides

We start with the given initial bivalent execution and rep eatedly extend it including at

least one step of pro cess in the rst extension then at least one step of in the second

extension etc in roundrobin order while keeping all prexes bivalent The key step is

the extension for any particular pro cess but that is done by applying the lemma In the

resulting execution each pro cess takes many steps yet no pro cess ever reaches a decision

So it remains only to prove the lemma

Pro of of Lemma

By contradiction supp ose the lemma is false Let us say what this means weassumethat

there exist some bivalent inputrst execution and some pro cess isuchthatforevery ex

tension of extension i is univalent This in particular implies that extension i

is univalent supp ose without loss of generality that it is valent Since is bivalent there

is some extension of leading to a decision of Now wemust have extension i

valent Consider applying i at all p oints along the path from to At the b eginning

we get valence and at the end valence At eachintermediate step wehaveunivalence

Therefore it must b e that there are two consecutive steps at which i is applied in the rst

of which it yields a valent extension and in the second a valent extension See Figure

α j i

i 0−valent

1−valent

Figure whenever we extend with iwe get a valent conguration

Supp ose that j is the pro cess that takes the intervening step We claim j i This

is true b ecause if j iwehave one step of i leading to valence while two steps lead to

valence contradicting the pro cessdeterminism So wemust have somewhere the congu

ration depicted in Figure

We again pro ceed by case analysis and obtain a contradiction for each case

Case is step is a read In this case after ji and ij thevalues in all shared variables

and the states of all pro cesses other than i are identical So from either of these states run

all pro cesses except i and they will have to decide But one of these p ositions is valent

and the other valent so wegeta contradiction for any decision value

Case j s step is a read This case is similar to Case After i and jithevalues of

all shared variables and the states of all pro cesses other than j are identical and therefore α ji

1−valent 0−valent

Figure the p oint in the execution whichwe analyze by cases

when we let all pro cesses except j run from b oth states we obtain the same contradiction

as ab ove

Case Both steps are writes

Case a The writes are to dierentvariables In this case we get a commutative scenario

depicted in Figure which immediately implies a contradiction

α j i

i β

β

decide 1 decide 0

decide 0

Figure case b Pro cess j do es not take steps in

Case b The writes are to the same variable In this case in a ji extension the i step

overwrites the j step see Figure Thus after i and jithevalues of all shared variables

and the states of all pro cesses other than j are identical and we obtain a contradiction as

b efore

This completes the pro of of Theorem

Mo deling and Mo dularity for Shared Memory Sys

tems

In this section we go back to considering the idea of atomic ob jects and showhow they might

b e used to describ e mo dular system designs An atomic ob ject is a concurrentversion of a

corresp onding serial sp ecication Wewanttoprove a theorem that says roughly sp eaking

that any algorithm written for the instantaneous access mo del wehave already b een using

can also run correctly in a variant of the mo del where all the shared variables are replaced

by corresp onding atomic ob jects This would allow us to decomp ose the task of solving a

problem into the task of solving it assuming some kind of instantaneous access memory

then separately implement atomic ob jects of the corresp onding kind This could b e done

in a series of stages if the problem b eing solved is itself that of building a more p owerful

atomic ob ject

To make all of this precise we should b e a little more formal ab out the mo del we are

using

Recall the mo del wehave b een assuming state machines with input output lo cal

computation and shared memory actions where the latter have transitions of the form

s ms m So far all accesses to shared memory have b een byinstantaneous

actions

Nowwewould like to dene a slightly dierentversion of the mo del where instead

of sharing memory with instantaneous actions the pro cesses communicate with separate

objects The communication is no longer instantaneous but involves separate invo cations

and resp onses Wemust say what kind of fairness wewant here to o

It is helpful at this p ointtointro duce a very general simple and precise formal mo del

for asynchronous concurrent systems IO Automata This mo del is useful as a general

foundation for the various kinds of asynchronous systems we will b e studying By adding

particular extra structure it can b e used to describ e the instantaneous access shared mem

ory systems wehave b een studying It is also very natural for describing noninstantaneous

access shared memory systems as well as messagepassing systems dataow systems etc

Virtually just ab out any asynchronous mo del

The Basic InputOutput Automaton Mo del

In this section wegive the basic denitions for the IO Automaton mo del for asynchronous

concurrent computation For the moment these denitions do not include any notion of

fairness In the subsequent sections we dene comp osition of IO automata and showthat

it has the nice prop erties that it should Then weintro duce the notion of fairness and show

howitinteracts with the comp osition notions We then describ e some useful conventions

for stating problems to b e solved by IO automata in terms of their external b ehavior

and describ e what it means for an IO automaton to solve a problem Finallywe describ e

two imp ortant pro of techniques for verifying systems describ ed as IO automata based on

mo dular decomp osition and on hierarchical decomp osition These notes are adapted from

the Lynch Tuttle CWI pap er

Actions and Action Signatures

We assume a universal set of actions Sequences of actions are used in this world for

describing the b ehavior of mo dules in concurrent systems Since the same action may o ccur

several times in a sequence it is convenient to distinguish the dierent o ccurrences We refer

to a particular o ccurrence of an action in a sequence as an event

The actions of each automaton are classied as either input output or internal The

distinctions are that input actions are not under the automatons control output actions are

under the automatons control and externally observable and internal actions are under the

automatons control but not externally observable In order to describ e this classication

each automaton comes equipp ed with an action signature

An action signature S is an ordered triple consisting of three disjoint sets of actions We

write in S out S and int S for the three comp onents of S and refer to the actions in the

three sets as the input actions output actions and internal actions of S resp ectivelyWelet

ext S S out S and refer to the actions in ext S astheexternal actions of S Also

we let local S out S int S and refer to the actions in local S asthelocal lycontrol led

actions of S Finallywe let acts S S out S int S and refer to the actions in

acts S as the actions of S Anexternal action signature is an action signature consisting

entirely of external actions that is having no internal actions If S is an action signature

then the external action signature of S is the action signature extsig S S out S

ie the action signature that is obtained from S by removing the internal actions

InputOutput Automata

Nowwe are ready to dene the basic comp onent of our mo del An inputoutput automaton

A also called an IO automaton or simply an automaton consists of vecomponents

an action signature sigA

a set states Aof states

a nonemptyset start A states Aof start states

a transition relation trans A states A acts sig A states A with the prop

erty that for every state s and input action there is a transition s s intrans A

and

an equivalence relation part Aonlocal sig A having at most countably many equiv

alence classes

We refer to an elements s of trans Aasa transitionor step of A The transition

s s is called an input transition of A if is an input action Output transitions internal

transitions external transitions and local lycontrol ledtransitions are dened analogouslyIf

s s is a transition of A then is said to b e enabled in sSinceevery input action is

enabled in every state automata are said to b e inputenabled The inputenabling prop erty

means that the automaton is not able to blo ck input actions The partition part Ais

what was describ ed in the intro duction as an abstract description of the comp onents of the

automaton We shall use it to dene fairness later

An execution fragment of A is a nite sequence s s s or an innite sequence

n n

s s s of alternating states and actions of A suchthats s isa

n n i i i

transition of A for every i An execution fragment b eginning with a start state is called an

executionWe denote the set of executions of A by execs A and the set of nite executions

of A by it nexecsA A state is said to b e reachable in A if it is the nal state of a nite

execution of A

The schedule of an execution fragment of A is the subsequence of consisting of

actions and is denoted by schedWesaythat is a schedule of A if is the schedule of

an execution of AWe denote the set of schedules of A by scheds A and the set of nite

schedules of A by nscheds A The behavior of an execution or schedule of A is the

subsequence of consisting of external actions and is denoted by beh Wesay that is

a behavior of A if is the b ehavior of an execution of AWe denote the set of b ehaviors of

A by behs A and the set of nite b ehaviors of A by nbehs A

Comp osition

For motivation consider the comp osition of user automata and a shared memory system

automaton

Generally sp eaking we can construct an automaton mo deling a complex system bycom

p osing automata mo deling the simpler system comp onents The essence of this comp osition

is quite simple when we comp ose a collection of automata weidentify an output action

of one automaton with the input action of each automaton having as an input action

Consequently when one automaton having as an output action p erforms all automata

having as an input action p erform simultaneously automata not having as an action

do nothing

We imp ose certain restrictions on the comp osition of automata Since internal actions

of an automaton A are intended to b e unobservable byany other automaton B we cannot

allow A to b e comp osed with B unless the internal actions of A are disjoint from the actions

of B since otherwise one of As internal actions could force B to take a step Furthermore

in keeping with our philosophy that at most one system comp onentcontrols the p erformance

of any given action we cannot allow A and B to b e comp osed unless the output actions of

A and B form disjoint sets Finally since we do not preclude the p ossibility of comp osing

a countable collection of automata each action of a comp osition must b e an action of only

nitely many of the comp ositions comp onents Note that with innite pro ducts wecan

handle systems that can create pro cesses dynamically

Since the action signature of a comp osition the comp ositions interface with its environ

ment is determined uniquely by the action signatures of its comp onents it is convenient

to dene a comp osition of action signatures b efore dening the comp osition of automata

of The preceding discussion motivates the following denition A countable collection S

i

iI

action signatures is said to b e strongly compatible if for all i j I satisfying i j wehave

out S out S

i j

int S acts S and

i j

no action is contained in innitely many sets acts S

i

Wesay that a collection of automata are strongly compatible if their action signatures are

strongly compatible

When we comp ose a collection of automata internal actions of the comp onents b ecome

internal actions of the comp osition output actions b ecome output actions and all other

actions each of which can only an input action of a comp onent b ecome input actions

As motivation for this decision consider one automaton A having as an output action

and two automata B and B having as an input action Notice that is essentially a

broadcast from A to B and B in the comp osition A B B of the three automata Notice

however that if we hide communication then the comp osition A B B would not b e

the same as the comp osition A B B since would b e made internal to A B b efore

comp osing with B and hence would no longer b e a broadcast to b oth B and B This

is problematic if wewant to reason ab out the system A B B in a mo dular wayby rst

reasoning ab out A B and then reasoning ab out A B B We will dene another op eration

to hide such communication actions explicitly

The preceding discussion motivates the following denitions The composition S

Q

S of a countable collection of strongly compatible action signatures fS g is dened

i i iI

iI

to b e the action signature with

in S in S out S

iI i iI i

out S out S and

iI i

int S int S

iI i

Illustration users and shared memory system automaton List the various actions show

signatures are compatible

Q

A of a countable collection of strongly compatible automata The composition A

i

iI

fA g is the automaton dened as follows

i iI

Q

sig A sig A

i

iI

Q

states A states A

i

iI

Q

start A start A

i

iI

trans A is the set of triples s s such that for all i I if acts A then

i

s is i trans A and if acts A thens i s i and

i i

part A part A

iI i

Q

A by A A When I is the nite set nwe often denote

i n

iI

Notice that since the automata A are inputenabled so is their comp osition The par

i

tition of the comp ositions lo callycontrolled actions is formed by taking the union of the

comp onents partitions that is each equivalence class of each comp onent b ecomes an equiv

alence class of the comp osition

Three basic results relate the executions schedules and b ehaviors of a comp osition to

those of the comp ositions comp onents The rst says for example that an execution of

a comp osition induces executions of the comp onent automata Given an execution

s s of A let jA b e the sequence obtained by deleting s when is not an action

i j j j

of A and replacing the remaining s by s i

i j j

Prop osition Let fA g beastrongly compatible col lection of automata and let A

i iI

Q

A If execs A then jA execs A for every i I Moreover the same result

i i i

iI

holds if execs is replaced by it nexecs scheds nscheds behs or nbehs

Here start Aandstates A are dened in terms of the ordinary Cartesian pro duct while sig Ais

dened in terms of the comp osition of actions signatures just dened Also we use the notation s ito

denote the ith comp onent of the state vector s

Certain converses of the preceding prop osition are also true The following prop osition

says that executions of comp onent automata can often b e pasted together to form an exe

cution of the comp osition

Prop osition Let fA g beastrongly compatible col lection of automata and let A

i iI

Q

A Suppose is an execution of A for every i I and suppose isasequenceof

i i i

iI

actions in acts A such that jA sched for every i I Then thereisanexecution

i i

of A such that sched and jA for every i I Moreover the same result holds

i i

when acts and schedarereplaced by ext and beh respectively

As a corollaryschedules and b ehaviors of comp onent automata can also b e pasted together

to form schedules and b ehaviors of the comp osition

Prop osition Let fA g beastrongly compatible col lection of automata and let A

i iI

Q

A Let beasequence of actions in acts AIf jA scheds A for every i I

i i i

iI

then scheds A Moreover the same result holds when acts and scheds arereplacedby

ext and behs respectively

As promised wenow dene an op eration that hides actions of an automaton bycon

verting them to internal actions This op eration is useful for redening what the external

actions of a comp osition are We b egin with a hiding op eration for action signatures if

S is an action signature and acts S then hide S S where in S in S

out S out S andint S int S Wenow dene a hiding op eration for au

tomata if A is an automaton and acts A then hide A is the automaton A obtained

from A by replacing sig Awithsig A hide sig A

Fairness

We are in general only interested in the executions of a comp osition in which all comp onents

are treated fairly While what it means for a comp onent to b e treated fairly mayvary from

context to context it seems that any reasonable denition should have the prop ertythat

innitely often the comp onentisgiven the opp ortunity to p erform one of its lo callycontrolled

actions In this section we dene such a notion of fairness

As wehave mentioned the partition of an automatons lo callycontrolled actions is in

tended to capture some of the structure of the system the automaton is mo deling Each

class of actions is intended to represent the set of lo callycontrolled actions of some system

comp onent

The denition of automaton comp osition guarantees that an equivalence class of a comp o

nent automaton b ecomes an equivalence class of a comp osition and hence that comp osition

retains the essential structure of the systems primitivecomponents In our mo del therefore

It might b e argued that retaining this partition is a bad thing to do since it destroys some asp ects of

b eing fair to each comp onent means b eing fair to each equivalence class of lo callycontrolled

actions This motivates the following denition

A fair execution of an automaton A is dened to b e an execution of A such that the

following conditions hold for each class C of part A

If is nite then no action of C is enabled in the nal state of

If is innite then either contains innitely manyevents from C or contains

innitely many o ccurrences of states in whichnoactionofC is enabled

This says that a fair execution gives fair turns to each class C of part A and therefore to

each comp onent of the system b eing mo deled Innitely often the automaton attempts to

p erform an action from the class C On each attempt either an action of C is p erformed or

no action from C can b e p erformed since no action from C is enabled For example wemay

view a nite fair execution as an execution at the end of which the automaton rep eatedly

cycles through the classes in roundrobin order attempting to p erform an action from each

class but failing each time since no action is enabled from the nal state

We denote the set of fair executions of A by fairexecs A Wesaythat is a fair schedule

of A if is the schedule of a fair execution of Aandwe denote the set of fair schedules

of A by fairscheds A Wesay that is a fair behavior of A if is the b ehavior of a fair

execution of A and we denote the set of fair b ehaviors of A by fairbehs A

We can prove the following analogues to Prop ositions in the preceding section

Prop osition Let fA g beastrongly compatible col lection of automata and let A

i iI

Q

A If fairexecs A then jA fairexecs A for every i I Moreover the same

i i i

iI

result holds if fairexecs is replaced by fairscheds or fairbehs

Prop osition Let fA g beastrongly compatible col lection of automata and let A

i iI

Q

A Suppose is a fair execution of A for every i I and suppose isasequenceof

i i i

iI

actions in acts A such that jA sched for every i I Then there is a fair execution

i i

of A such that sched and jA for every i I Moreover the same result

i i

holds when acts and schedarereplaced by ext and beh respectively

Prop osition Let fA g beastrongly compatible col lection of automata and let A

i iI

Q

A Let beasequence of actions in acts AIf jA fairscheds A for every i I

i i i

iI

then fairscheds A Moreover the same result holds when acts and fairscheds are

replaced by ext and fairbehs respectively

abstraction Notice however that any reasonable denition of fairness must lead to some breakdown of

abstraction since b eing fair means b eing fair to the primitive comp onents whichmust somehow b e mo deled

We state these results b ecause analogous results often do not hold in other mo dels As we

will see in the following section the fact that the fair b ehavior of a comp osition is uniquely

determined by the fair b ehavior of the comp onents makes it p ossible to reason ab out the

fair b ehavior of a system in a mo dular way

JJ Distributed Algorithms Octob er

Lecturer Nancy Lynch

Lecture

IO Automata cont

Problem Sp ecication

Wewanttosay that a problem sp ecication is simply a set of allowable b ehaviors and

that an automaton solves the sp ecication if each of its b ehaviors is contained in this

set The automaton solves the problem in the sense that every b ehavior it exhibits is a

b ehavior allowed by the problem sp ecication but notice that there is no single b ehavior

the automaton is required to exhibit The appropriate notion of b ehavior eg nite

b ehavior arbitrary b ehavior fair b ehavior etc used in such a denition dep ends to some

extent on the nature of the problem sp ecication In addition to a set of allowable b ehaviors

however a problem sp ecication must sp ecify the required interface b etween a solution and

its environment That is wewant a problem sp ecication to b e a set of b ehaviors together

with an action signature

We therefore dene a schedule module H to consist of two comp onents

an action signature sig H and

a set scheds H of schedules

Eachschedule in scheds H is a nite or innite sequence of actions of H We denote

by nscheds H the set of nite schedules of H The behavior of a schedule of H is

the subsequence of consisting of external actions and is denoted by beh Wesay

that is a behavior of H if is the b ehaviorofaschedule of H We denote the set of

b ehaviors of H by behs H and the set of nite b ehaviors of H by nbehs H We extend

the denitions of fair schedules and fair b ehaviors to schedule mo dules in a trivial way letting

fairscheds H scheds H and fairbehs H behs H We will use the term module to

refer to either an automaton or a schedule mo dule

There are several natural schedule mo dules that we often wish to asso ciate with an

automaton They corresp ond to the automatons schedules nite schedules fair sched

ules b ehaviors nite b ehaviors and fair b ehaviors For each automaton AletScheds A

Finscheds Aand Fairscheds A b e the schedule mo dules having action signature sig A

and having schedules scheds A nscheds Aand fairscheds A resp ectively Also for

each mo dule M either an automaton or schedule mo dule let Behs M Finbehs M and

Fairbehs M b e the schedule mo dules having the external action signature extsig M and

having schedules behs M nbehs M and fairbehs M resp ectively Here and elsewhere

we follow the convention of denoting sets of schedules with lower case names and corresp ond

ing schedule mo dules with corresp onding upp er case names

It is convenient to dene two op erations for schedule mo dules Corresp onding to our

comp osition op eration for automata we dene the comp osition of a countable collection

Q

of strongly compatible schedule mo dules fH g to b e the schedule mo dule H H

i iI i

iI

where

Q

sig H sig H

i

iI

scheds H is the set of sequences of actions of H suchthat jH is a schedule of H

i i

for every i I

The following prop osition shows how comp osition of schedule mo dules corresp onds to com

p osition of automata

Prop osition Let fA g beastrongly compatible col lection of automata and let A

i iI

Q Q Q

A Then Scheds A Scheds A Fairscheds A Fairscheds A Behs A

i i i

iI iI iI

Q Q

Behs A and Fairbehs A Fairbehs A

i i

iI iI

Corresp onding to our hiding op eration for automata we dene hide hide H to b e the sched

ule mo dule H obtained from H by replacing sig H with sig H hide sig H

Finallywe are ready to dene a problem sp ecication and what it means for an automa

ton to satisfy a sp ecication A problem is simply a schedule mo dule P An automaton A

solves a problem P if A and P have the same external action signature and fairbehs A

fairbehs P An automaton A implements a problem P if A and P have the same external

action signature that is the same external interface and nbehs A nbehs P No

tice that if A solves P then A cannot b e a trivial solution of P since the fact that A is

inputenabled ensures that fairbehs Acontains a resp onse by A to every p ossible sequence

of input actions For analogous reasons the same is true if A implements P

Since wemaywant to carry out correctness pro ofs hierarchically in several stages it is

convenient to state the denitions of solves and implements more generallyFor example

wemaywant to prove that one automaton solves a problem by showing that the automaton

solves another automaton which in turn solves another automaton and so on until

b e mo dules some nal automaton solves the original problem Therefore let M and M



This concept is sometimes called satisfying

either automata or schedule mo dules with the same external action signature Wesaythat

M solves M if fairbehs M fairbehs M and that M implements M if nbehs M

nbehs M

As wehave seen there are manyways to argue that an automaton A solves a problem

P Wenow turn our attention to two more general techniques

Pro of Techniques

Mo dular Decomp osition

One common technique for reasoning ab out the b ehavior of an automaton is modular de

compositioninwhichwe reason ab out the b ehavior of a comp osition by reasoning ab out the

b ehavior of the comp onent automata individually

It is often the case that an automaton b ehaves correctly only in the context of certain

restrictions on its input These restrictions may b e guaranteed in the context of the comp o

sition with other automata comprising the remainder of the system or may b e restrictions

dened by a problem statement describing conditions under which a solution is required to

b ehave correctly A useful notion for discussing such restrictions is that of a mo dule pre

serving a prop erty of b ehaviors as long as the environment do es not violate this prop erty

neither do es the mo dule

In practice this notion is of most interest when the prop erty is prexclosed and when

the prop erty do es not concern the mo dules internal actions A set of sequences P is said to

be prexclosed if P whenever b oth is a prex of and PAmoduleM either

an automaton or schedule mo dule is said to b e prexclosed provided that nbehs M is

prexclosed

Let M b e a prexclosed mo dule and let P b e a nonempty prexclosed set of sequences

of actions from a set satisfying int M WesaythatMpreserves P if j P

whenever j P out M and jM nbehs M

In general if a mo dule preserves a prop erty P then the mo dule is not the rst to violate

P as long as the environment only provides inputs such that the cumulative b ehavior

satises P the mo dule will only p erform outputs such that the cumulativebehavior satises

P This denition however deserves closer insp ection First notice that we consider only

sequences with the prop erty that jM nbehs M This implies that we consider only

sequences that contain no internal actions of M Second notice that we require sequences

to satisfy only j P rather than the stronger prop erty P Supp ose for example

that P is a prop erty of the actions at one of twointerfaces to the mo dule M In this case

it may b e that for no P and out M is it the case that jM nbehs M since

all nite b ehaviors of M containing outputs include activity at b oth interfaces to M By

considering satisfying only j Pwe consider all sequences determining nite b ehaviors

of M that at the interface concerning P do not violate the prop erty P

One can prove that a comp osition preserves a prop ertyby showing that each of the

comp onent automata preserves the prop erty

Prop osition Let fA g beastrongly compatible col lection of automata and let A

i iI

Q

A IfA preserves P for every i I then A preserves P

i i

iI

In fact we can prove a slightly stronger result An automaton is said to b e closed if it

has no input actions In other words it mo dels a closed system that do es not interact with

its environment

Prop osition Let A be a closed automaton Let P beasetofsequences over If A

preserves P then nbehs Aj P

In the sp ecial case that is the set of external actions of A the conclusion of this

prop osition reduces to the fact that nbehs A P The pro of of the prop osition dep ends

on the fact that do es not contain anyof As input actions and hence that if the prop erty

P is violated then it is not an input action of A committing the violation In fact this

prop osition follows as a corollary from the following slightly more general statement If A

preserves P and in A then nbehs Aj P

Combining Prop ositions and wehave the following technique for proving that an

automaton implements a problem

Corollary Let fA g beastrongly compatible col lection of automata with the property

i iI

Q

A isaclosed automaton Let P beaproblem with the external action signature that A

i

iI

of AIfA preserves nbehs P for al l i I then A implements P

i

That is if we can prove that each comp onent A preserves the external b ehavior re

i

quired by the problem P then wewillhave shown that the comp osition A preserves the

desired external b ehavior and since A has no input actions that could b e resp onsible for vi

olating the b ehavior required by P it follows that all nite b ehaviors of A are b ehaviors of P

Hierarchical Decomp osition

A second common technique for proving that an automaton solves a problem is hierarchical

decomposition in whichwe prove that the given automaton solves a second that the second

solves a third and so on until the nal automaton solves the given problem One wayof

proving that one automaton A solves another automaton B is to establish a relationship

between the states of A and B and use this relationship to argue that the fair b ehaviors of A

are fair b ehaviors of B In order to establish such a relationship in b etween two automata we

can use a simulation relation f Below for binary relation f we use the notation u f s

as an alternativeway of writing s u f

Denition Suppose A and B are inputoutput automata with the same external action

signature and suppose f is a binary relation from statesA to statesB Then f is a

forward simulation from A to B provided that both of the fol lowing are true

If s start A then f s start B

If s is a reachable state of A u f s is a reachable state of B and s s is a step

of A then there is an extended step u u such that u f s and jext B

jext A

An extended step of an automaton A is a triple of the form s s where s and s are

states of A is a nite sequence of actions of A and there is an execution fragmentof A

having s as its rst state s as its last state and as its schedule The following theorem

gives the key prop ertyofforward simulations

Theorem If there is a forward simulation from A to B then behs A behs B

Shared Memory Systems as IO Automata

Instantaneous Memory Access

As a case studywe consider a mutual exclusion system In this section we mo del the users as

IO automata Sp ecicallyeach user is an IOA with inputs crit and rem and outputs try

i i i

i

and exit The user automata have arbitrary internal actions states sets and start states

i

There is only one constraint that anysuch automaton preserves the cyclic b ehavior

ie if the system is not the rst to violate it the user do es not violate it The formal

denition of preserves is given in the notes ab ove

The following example shows a particular user IOA Note the language that is used to

describ e the automaton We rst describ e the IOAintuitively The example user cho oses

an arbitrary numb er from to and then makes exactly that many requests for the resource

in succession

In the preconditioneects language the user is describ ed as follows

states As b efore we can write it as consisting of comp onents Here wehave the following

region values in fR T C E g initially R

count a numb er from to or nil initially arbitrary

chosen a Bo olean initially false

start Given by the initializations

acts The external actions are describ ed ab ove Also there is an internal action choose

part wehave only one class that contains all the noninput actions

trans We describ e the allowed triples s s by organizing them according to action

choose

i

Precondition

schosen false

Eect

s chosen true

s count f g

try

i

Precondition

schosen true

sregion R

scount

Eect

s region T

s count scount

crit

i

Eect

s region C

exit

i

Precondition

sregion C

Eect

s region E

rem

i

Eect

s region R

Note that the automaton is inputenabled ie it can accommo date unexp ected inputs which

do not necessarily induce interesting b ehavior in this case though

Here we are describing the transitions of the form s s using explicit mentions of

the pre and p ost states s and s The eect is describ ed in terms of equations relating the

two We remark that this notation is very similar to that used by Lamp ort in his language

TLA Temp oral Logic of Actions An alternative notation that is also p opular suppresses

explicit mention of the states and describ es the eect using assignment statements This is

the syntax used in Ken Goldmans Sp ectrum system for instance

Fair Executions Consider the executions in which the system do es not violate the cyclic

b ehavior All these executions must involve rst doing choose this action is enabled nothing

else is enabled and the fairness requires that the execution cant stop if something is enabled

By the co de the next step must b e try Then the execution can stop if no crit ever o ccurs

i

i

But if crit o ccurs a later exit must o ccur Again at this p oint the execution can stop but

i i

if a later rem o ccurs then the automaton stops if it chose else the execution continues

i

That is have fair executions omitting states that are determined

choose try

choose try crit exit

choose try crit exit rem

choose try crit exit rem try

etc

There are also some fair executions in which the system violates the cyclic b ehavior eg

choose crit

But for instance the following is not fair

choose try crit exit rem

Note that for the automaton ab ove all the fair executions with correct cyclic b ehavior are

nite

To mo del the rest of the shared memory system weuseonebigIOA to mo del all the

pro cesses and the memory together The outputs are the crit and rem actions The internal

actions are any lo cal computation actions and the shared memory actions The states set

consists of the states of all the pro cesses plus the state of the shared memory The partition

is dened byhaving one class p er pro cess whichcontains all the noninput actions b elonging

to that pro cess The IOA fairness condition captures exactly our requirement on fairness of

pro cess execution in the shared memory system Of course there are extra constraints that

are not implied by the general mo del eg typ es of memory readwrite etc and lo cality

of pro cess action ie that it can only aect its own state only access one variable at a time

etc

Ob ject Mo del

In this section we dene formally an alternativeway to mo del shared memory systems the

object modelwhichwas discussed briey in Lecture We will use this mo del for the

next few lectures In the ob ject mo del each pro cess is a separate IOA usually with only

one fairness class and each shared memory ob ject is also an IOA Thus for example a

readwrite ob ject is an IOA with inputs of the form read where i is a pro cess and x an

ix

ob ject name and write v where i is a pro cess x is an ob ject and v is a value to b e

ix

written Now note that a read is supp osed to return a value Since this is an IOA this

will b e a separate output action readrespond v For uniformityalsogive the write a

ix

writerespond Thewriterespond action serves only as an ack saying that the write

ix ix

activation has terminated

This is essentially the same interface that wehave already describ ed for atomic readwrite

ob jects the only dierence is that nowwe are adding an identier for the sp ecic ob ject

into all the actions This is b ecause IOAs have a global naming scheme for actions

Of course we could also have other ob jects queue readmo difywrite registers and

snapshot ob jects as discussed earlier Each has its own characteristic interface with inputs

b eing invo cations of op erations and outputs b eing resp onses

The pro cesses are automata with inputs as b efore plus the resp onses to their data op er

ations eg readrespond v is an input to pro cess i The outputs of the pro cesses nowalso

ix

contain the invo cations of their data op erations eg read Usuallyeach pro cess would

ix

have only one fairness class that is we think of it as a sequential pro cess But wewould

like to b e able to generalize this when convenient

For the ob jects we usually dont want to consider just any automata with the given

interface wewant some extra conditions on the correctness of the resp onses For example

the atomic ob ject conditions dened earlier are an imp ortant example We recall these

conditions b elow see notes of Lecture for more complete sp ecication

Preserving wellformedness Recall that preserves has a formal denition for IOAs

Giving atomic resp onses based on serialization p oints in all executions Recall that

it is required that all complete op erations and an arbitrary subset of the incomplete

op erations have serialization p oints

In fair executions every invo cation has a resp onse Note that this nowmakes sense

for automata that have some organization other than the pro cessvariable architecture

wehave b een studying b ecause IOAs in general have a notion of fairness

After comp osing the pro cess automata and the ob ject automata we hide the communi

cation actions b etween them

Mo dularity Results

Now that wehave describ ed the two kinds of mo dels instantaneous access and atomic

ob jects moreorless precisely in terms of IOAs we can state our mo dularity results

Transforming from Instantaneous to Atomic Ob jects

In this section we discuss how to transform algorithms which are sp ecied for instanta

neous ob jects to work with atomic versions of the ob jects Let A b e an algorithm in the

instantaneous shared memory mo del satisfying the following conditions

Each pro cess can only access one shared variable at a time this is the usual restriction

For each user of A supp ose that there is a notion of user wel lformedness consisting

i

of a subset of alternating sequences of input and output actions We require that A

preserve user wellformedness

Supp ose that p s states are classied as input states and output states based on

i

which is supp osed to happ en next in a wellformed execution initially only input

actions are enabled We require that noninput steps of p are only enabled in output

i

states

We remark that this is the case for all the examples wehave considered For mutual

exclusion algorithms the output states are the T E states which o ccur b etween try and

crit actions and b etween exit and rem actions resp ectivelyFor snapshot implementations

the output states are just those while an invocation of update or snap is active For consensus

algorithms the output states are those b etween the init and decide

Assuming A satises the ab ove conditions we can now transform A into a new algorithm

T A designed for the atomic ob ject mo del T A includes an atomic ob ject for each shared

memory lo cation Each access to an ob ject is expanded to an invo cation and a resp onse

After the invo cation the pro cess waits until the resp onse o ccurs doing nothing else in the

meantime When the resp onse comes in the pro cess resumes as in A

Lemma

T A preserves user wel lformedness for each user i

If is any not necessarily fair user wel lformed execution of T A then thereisa

user wel lformedexecution of A such that beh beh

If is any fair user wel lformed execution of T A ie both the processes and the

atomic objects have fair executions then there is a fair user wel lformedexecution

of A such that beh beh

Pro of Sketch Here is a sketch of the argument We start with andmodifyittoget

as follows First by the denition wehave the serialization p oints within the op eration

intervals for all the invo cations of the atomic ob jects Now consider the execution obtained

bymoving the invo cation and resp onse actions to the serialization p oints and adjusting the

states accordingly In the case of incomplete op eration intervals there are two cases If there

is a serialization p ointfortheinterval then put the invo cation from together with a new ly

manufactured resp onse event at the serialization p oint the resp onse event should b e the

one whose existence is asserted by the atomic ob ject denition If there is no serialization

p oint then just removetheinvo cation

Wenow claim that wecanmove all the events wehavemoved without changing the order

of any events of a particular process p This follows from two facts First by construction

i

p do es no lo callycontrolled actions while waiting for a resp onse from an op eration Second

i

when p invokes an op eration it is in an output state and by assumption no inputs will

i

come in when p is in an output state Similarlywe can removetheinvo cations wehave

i

removed and add the resp onses wehave added without aecting the external b ehavior since

a pro cess with an incomplete invo cation do es not do anything afterwards anyway it do esnt

matter if it stops just b efore issuing the invo cation while waiting for the resp onse or just

after receiving the resp onse

In the resulting execution all the invo cations and resp onses o ccur in consecutive pairs

We can replace those pairs by instantaneous access steps thus obtaining an execution

of the instantaneous access system AThus and have the same b ehavior at the user

interface proving Part

For Part notice that if is fair then all emb edded executions of the atomic ob jects

always resp ond to all invo cations Coupling this with the fairness assumption for pro cesses

yields fairness for the simulated algorithm A ie that the simulated pro cesses continue

taking steps

So Lemma says that any algorithm for the instantaneous shared memory mo del can b e

used in a transformed version in the atomic ob ject shared memory mo del and no external

user of the algorithm could ever tell the dierence

In the sp ecial case where A itself is an atomic ob ject implementation in the instantaneous

shared memory mo del this lemma implies that T A is also an atomic ob ject implemen

tation To see this note that Prop erty preserving wellformedness follows immediately

from Part of Lemma Prop erty follows from Part of Lemma and the fact that A is

an atomic ob ject prop erty of the atomic ob ject denition for A and Prop erty follows

from Part of Lemma and the fact that A is an atomic ob ject Prop erty of the atomic

ob ject denition for A Therefore we can build atomic ob jects hierarchical ly

Comp osition in the Ob ject Mo del

Nowwewant to consider building ob jects that are not necessarily atomic hierarchicallyTo

do this wehave to generalize the notion of an ob ject sp ecication to include other kinds of

ob jects Later in the course we shall see some examples for such ob jects including safe and

regular ob jects and the sp ecial concurrent timestamp ob ject to b e considered next time

Ob ject Sp ecications

We dene the general notion of an object specication generalizing the sp ecication for

atomic ob jects Formally an ob ject sp ecication consists of the following

An external signature consisting of input actions called invocations and output ac

tions called responses There is also a xed binary relation that relates resp onses

syntactically to corresp onding invo cations

A set of sequences of external actions all sequences in the set are wellformed ie

they contain alternating invo cations and corresp onding resp onses on each of the lines

with invo cation rst The set is prexclosed

This is a very general notion Basically this is an arbitrary b ehavior sp ecication at the

ob ject b oundaryThebehavior sp ecication for atomic ob jects is a sp ecial case

An IOA with the appropriate interface is said to satisfyor implement this sp ecication

if the following conditions hold

It preserves wellformedness

All its wellformed b ehaviors are in the set

All its fair b ehaviors include resp onses to all invo cations

Comp ositionality of Ob ject Implementations

Supp ose that we start with a system A in the ob ject sp ecication mo del using a particular

set of ob ject sp ecications for its ob jects We need to b e a little more careful ab out what this

means The executions of this system are the executions of the pro cesses that get resp onses

allowed by the ob ject sp ecications The fair executions of the system are then dened to

b e those executions that are fair to all the pro cesses and in whichallinvo cations to the

ob ject sp ecications eventually return We also assume that each pro cess of A preserves

wellformedness for each ob ject sp ecication

Supp ose that A itself satises another higherlevel ob ject sp ecication Consider the

result of replacing all the ob ject sp ecications of A with ob ject IOAs that satisfy the cor

resp onding sp ecications in the ob ject sp ecication mo del We claim that the resulting

system still implements the external ob ject sp ecication

interface line process combined process

line

object

Figure On the left we see an ob ject comp osed of smaller ob jects All the shaded pro cesses

are asso ciated with the same external interface line On the rightwe see the combined pro cess for

fairness purp oses

Note that fairness of the comp osed implementation refers to fair execution of all the

pro cesses highlevel and lowlevel plus the ob ject sp ecication liveness condition for the

low level ob jects

Note that the resulting architecture is dierent from the mo del wehave b een using It

do esnt have one pro cess p er line but rather a distributed collection of pro cesses Wecan

convert this to a system in the usual architecture by comp osing some of these pro cesses to

give the pro cesses wewant as follows see gure Comp ose the distributed pro cesses

that op erate on b ehalf of any particular line i the highlevel pro cess plus for every line of

the intermediatelevel ob jects that it uses the corresp onding lowlevel pro cess or pro cesses

if one highlevel pro cess invokes several dierent op eration typ es on a single ob ject Call

this comp osition q Let the comp osed pro cesses q b e the pro cesses of the comp osed imple

i i

mentation Notice that they have several fairness classes eachthisiswhywedidntwant

to rule this out earlier

Remark We can get similar result comp osing a system that uses ob ject sp ecications with

ob ject implementations that use instantaneous memory getting a comp osed implementation

that uses instantaneous memory

JJ Distributed Algorithms Octob er

Lecturer Nancy Lynch

Lecture

Mo dularitycont

Waitfreedom

In this section we consider the waitfreedom prop erty We dene this prop erty only for

implementations in some sp ecial forms as follows

In the instantaneous shared memory mo del waitfreedom deals with an intermediate

notion b etween ordinary IOA fairness fairness to all pro cesses and no fairness namely

fairness to particular pro cesses For this setting the wait freedom condition is that if

an execution is fair to any particular pro cess i then anyinvocation of i eventually gets

a resp onse

In the ob ject sp ecication mo del where in place of the ob jects wehave ob ject sp ecica

tions the wait freedom condition is that if an execution is fair to any particular pro cess

i and al l invocations to memory objects by process i returnthenanyinvo cation of i

eventually gets a resp onse

Let us consider the comp ositionalityofwaitfree implementations To b e sp ecic supp ose

that we start with a system A in the ob ject sp ec mo del that by itself satises some ob ject

sp ec using a particular set of ob ject sp ecs for its ob jects Supp ose that each pro cess of A

preservewellformedness for each ob ject Assume also that A is waitfree Supp ose wethen

replace all the ob ject sp ecs used by A with waitfree ob jects that satisfy the corresp onding

sp ecs again these are required to b e expressed in the ob ject sp ec mo del We already know

how to get the comp osition and we also know that the result satises the external ob ject

sp ec Wenowwant to claim that the resulting system is also waitfree We argue that as

follows Let b e an execution of the comp osed system in which q do es not fail and in

i

which all invo cations by q on lowlevel ob jects return In the highlevel p do es not fail

i i

since by assumption q do esnt fail Also anyinvocation by p on the intermediatelevel

i i

ob ject must return by the waitfreedom of the implementation of the intermediate ob jects

Therefore bythewaitfreedom of the highlevel implementation anyinvo cation on line i

eventually returns

Remark We can get a similar result ab out comp osing a waitfree implementation that

uses ob ject sp ecs with waitfree ob ject implementations that use instantaneous memory

getting a waitfree comp osed implementation that uses instantaneous memory

Concurrent Timestamp Systems

Todaywe shall go through some constructions to showhowwe could build some p owerful

ob jects out of others We shall do this hierarchically Wehave already seen in Lecture

howwe can get a waitfree implementation of an atomic snapshot ob ject in terms of

instantaneous access singlewriter multireader readwrite ob jects In this section we show

howtogeta waitfree implementation of another typ e of ob ject called concurrent timestamp

object CTS in terms of instantaneous snapshot ob jects By the comp ositionality results

ab ove we can combine these and get a waitfree implementation of a concurrent timestamp

ob ject in terms of instantaneous access singlewriter multireader readwrite variables

We shall then see howwe can use a CTS ob ject as a building blo ck for other tasks

Sp ecicallywe give an improvement to Lamp orts bakery algorithm and an implementation

of a waitfree multiwriter multireader register CTS ob jects can also b e use for other tasks

eg resource allo cation systems

Denition of CTS Ob jects

A concurrent timestamp system ob ject has two op erations

label v always gets the resp onse OK

i

scan gets a resp onse that describ es a total ordering of the n pro cesses together with

i

an asso ciated value for each pro cess

The rough idea is that each newly arrived pro cess gets an asso ciated lab el of some kind

and we require that later lab els can b e detected as larger than earlier lab els The scan

op eration is supp osed to return a reasonable ordering of the pro cesses based on the order

of their most recent lab els ie the approximate order of their most recent arrivals And

for go o d measure it also returns the values asso ciated with those lab els

Note that this is not an atomic ob ject it has a more general kind of ob ject sp ec In

the literature see Gawlicks thesis and DolevShavit the precise requirements are given

by a set of axioms ab out the wellformed b ehaviors These lo ok a little to o complicated to

present in class so instead we dene the correct b ehaviors as the wellformed b ehaviors of a

fairly simple algorithm given in Figures and

This sp ec algorithm uses a dierent programming notation from what wehave previ

ously used eg for mutual exclusion The notation presented here describ es IO automata

more directly each action is sp ecied separately with an explicit description of the states

in which it can o ccur and the changes it causes to the pro cess state Note also the explicit

manipulation of the program counter pc This notation is very similar to that used in the

previous lecture to describ e a user automaton the main dierences are that here we use

assignment statements rather than equations and leave the pre and p oststates implicit

The b egin and end actions just movethe pc and transmit the results The real work

is done byembedded snap and update actions In the scan the pro cess do es a snap to get

an instantaneous snapshot of everyones latest lab el and value Then it returns the order

determined by the lab els with the asso ciated values In the label the pro cess rst do es

a snap to collect everyones lab els Then if the pro cess do es not already have the max it

cho oses some larger value this could b e the integer one greater than the max as in Lamp orts

bakery or it could b e some more general real value Finally it do es an instantaneous update

to write the new lab el and value in the shared memory

A straightforward observation is that the algorithm is waitfree The most interesting

prop erty of the algorithm however is that it can b e implemented using a bounded domain for

the lab els This prop erty will allow us to obtain b ounded versions for bakerystyle mutual

exclusion algorithm and a waitfree implementation of multiwriter multireader atomic

ob jects in terms of singlewriter multireader which uses three levels of implementation

snapshot in terms of singlewriter ob jects CTS ob ject in terms of snapshot and multiwriter

registers in terms of CTS ob ject

Bounded Lab el Implementation

We start with a b ounded lab el implementation First we recall what this means We are

using the program ab ove Figures as an ob ject sp ecication It describ es the

interface and gives a welldened set of wellformed b ehaviors whatever they are Toget

an implementation of this sp ec we need to design another algorithm that also preserves

wellformedness that has the required liveness prop erties and whose wellformed b ehaviors

are a subset of those of this algorithm

The algorithm we use is very close to the unb ounded algorithm ab ovegiven by the sp ec

The only dierence is in the lab el domain for the b ounded case we shall use a nite domain

n

The new domain has a binary relation dened for the purp ose of determining of size

order of lab els maximum etc only now it cant b e a total order b ecause elements have

Shared variables A snapshot ob ject whose value is a vector of pairs label value withone

comp onent for each lab eler pro cess iThelabels are p ositive real numb ers and the domain

of value is arbitrary

scan pro cess

i

Lo cal state

pc program counter

snaplabel snapvalue toholdvalues returned byasnap

order to hold an ordering of pro cesses to b e returned

Actions

beginscan

i

Eect

pc snap

snap

i

Precondition

pc snap

Eect

snaplabel snapvalue label value

pc deneresponse

deneresponse

i

Precondition

pc deneresponse

Eect

order the total order on indices where j app ears

b efore k i snaplabel j snaplabel k

j k

pc endscan

v endscan o

i

Precondition

pc endscan

o order

v snapvalue

Eect

pc beginscan

Figure Sp ecication algorithm for timestamp system part I

label pro cess

i

Lo cal state

pc

snaplabel snapvalue toholdvalues returned byasnap

new label to hold the newlychosen lab el

Actions

beginlabel

i

Eect

pc snap

snap

i

Precondition

pc snap

Eect

snaplabel snapvalue label value

if j snaplabel j j snaplabel ii then

new label snaplabel i

else new label max snaplabel r where r

is any p ositive real nondeterministic

pc update

update

i

Precondition

pc update

Eect

label i new label

pc end label

end label

i

Precondition

pc end label

Eect

pc beginscan

Figure Sp ecication algorithm for timestamp system part I I

to b e reused The algorithm will ensure however that the lab els that are in use at any

particular time are totally ordered by the given relation

The only changes in the co de are in the actions deneresponse of scan in particular

where the order of indices is determined from the scanned lab els and snap of label in

particular in determining if the pro cess executing the step has the maximum lab el and in

determining the next lab el to cho ose

Atomic Lab els and Scans

To motivate the value domain chosen rst supp ose that each label and scan op eration is

done atomically Consider two pro cesses We use the lab el domain consisting of f g

with relation given by the arrows see Figure

2

1 3

Figure element domain for two pro cesses a b indicates that a b

That is order relation isf g Supp ose that the two pro cesses p

and q initially b oth have label ties are broken by pro cess index If p do es a label

op eration it sees that q has the maximum assuming that pq so when p cho oses a new

lab el it gets the next lab el whichisNowifq do es a label it sees that the maximum

is and therefore cho oses the next lab el whichisnow The two if they take turns just

chase each other around the circle in a leapfrog fashion It is imp ortant to see that the eect

is just as if they were continuing to cho ose from an unb ounded space

What ab out three pro cesses We cant simply extend this to a ring of a larger size

consider the following scenario Supp ose one of the pro cesses say p retains lab el and do es

no new label op erations while the other two pro cesses say q and r leapfrog around the

ring This can yield quite dierentbehavior from the unb ounded algorithm when eventually

q and r bump into p In fact wedonteven have a denition of how to compare nonadjacent

lab els in this ring

Avalid solution for the pro cesses case is given in the recursive lab el structure depicted

in Figure In this domain wehave three main level comp onents where eachof

them is constructed from three level comp onents In eachlevel the ordering is given by the

arrows A lab el now consists of a string of two sublab els one for eachlevel The ordering 2

1 3

2

2 2

1 3 1 3

13

Figure element domain for three pro cesses Each element has a lab el that consists of two

strings of f g and the relation is taken to b e lexicographical

is determined lexicographicallyTo get a feeling why this construction works consider again

the threepro cess scenario describ ed ab ove If p keeps the initial lab el q will cho ose

next But then r will not wrap around and cho ose b ecause the comp onent of level

is already full ie it contains pro cesses So instead it go es to starting in the

next comp onent Thus the total order of pro cesses is p q r since the given relation relates

all three of the pairs of lab els involved here wehave and in the

relation In the next step q would cho ose Then r could cho ose and still maintain

the total order prop erty b ecause the comp onent is not considered full if the pro cesses it

contains include the one nowcho osing This can go on forever since now q and r alone

would just cycle within comp onent without violating the requirement

It is now quite easy to extend this scheme for the general case of n pro cesses We will have

n

a domain of size where each lab el is a length n string of f g that corresp onds

to a depth n nesting of cycles Wesay that a level k comp onent k n is

full if it contains at least n k pro cesses An invariant can b e proved saying that for any

cycle at anylevel at most two of the comp onents are o ccupied which suces to yield a

total order cf homework problem

The following rule is used to picka newvalue If the cho osing pro cess is the maximum

ie it dominates all the others in the given relation it keeps its old value If it is not the

maximum then it lo oks at the current maximum value It nds the rst level k k

n such that the level k comp onent of the maximum is full ie contains at least n k

pro cesses values excluding the pro cess currently cho osing The chosen value is the rst

lab el according to the relation in the next comp onentatlevel k

Note that there must b e some full level since fullness at level n just means that there

is a pro cess ie the maximum itself in the lowestlevel comp onent viz the single no de

containing the maximum value

It is p erhaps helpful to trace this pro cedure for n pro cesses with levels of nesting

see Figure

If the maxs level comp onent is full it means that it contains all other pro cesses so

we can cho ose the next level comp onent since we are guaranteed that no one has a value

outside the maxs level comp onent If the maxs level comp onent is not full then there

are at most pro cesses in that comp onent We then lo ok in the level comp onent of the

max If it is full it contains pro cesses values so that means there are none left for the

other two level comp onents and hence its safe to cho ose there Else there are at most

pro cess in the level comp onent We can rep eat the argument for this case to see that is OK

to cho ose the next element within the level comp onent ie the next level comp onent

Nonatomic Lab els

The recursive triangles construction ab ove do esnt quite give us what wewant The

problems arise when we consider concurrent execution of the lab el op erations Consider the

structure with pro cesses Supp ose that we get to the p ointwherep q and r have lab els

and resp ectively see Figure

Supp ose that b oth p and q initiate label op erations First b oth do their embedded snap

steps discovering that the max lab el in use is They b oth cho ose lab el Pro cess p

writes its new lab el but pro cess q delays writing Now pro cesses p and r can leapfrog

within the comp onent eventually reaching a p oint where the o ccupied lab els are and

So far so go o d but nowletq p erform its delayed write This will make all three lab els

in comp onent o ccupied which will break the total ordering

In order to handle such race conditions when pro cesses rst enter a comp onent the size

ring is mo died by adding a gateway see Figure The lab els for n pro cesses are

obtained by nesting recursively to depth n as b efore see Figure again wesaythat

a level k comp onent k n is full if it contains at least n k pro cesses We use

the same choice rule as b efore lo oking for the rst level at which the maxs comp onentis 2

1 3

2

2 2

1 3 1 3

13 2

2 2

1 3 1 3

2 2

2 2 2 2

1 3 1 3 1 3 1 3

13 13

1 3

Figure elements domain for four pro cesses

full Nowwe claim that this construction works in the concurrent case also

For intuition reconsider the previous example now with the gateway Supp ose that we

manage to get the pro cesses to the p oint where p and q haveand resp and r has

see Figure

Pro cesses p and q can cho ose concurrentlyNowletp write and delay q as b efore If

p and r now leapfrog around the comp onent they do so around the cycle But this means

that they never get back to the p osition and thus when q eventually writes there it will

b e ordered earlier than the leapfrogging pro cesses and the total order will b e preserved

Note that there are now pro cesses in the comp onent ie it is overfull But that isnt

to o harmful since there are only within the cycle If q nowpicks another lab el it will 2

r 1 3

2

2 2 q

p 1 3 1 3

13

Figure initial conguration for counterexample in the concurrent case

observe that comp onent is already full and will cho ose

For general nwe claim that although comp onents can b ecome overfull ie for a level

k n a comp onent could have more than n k pro cesses the cycle of level k never

contains more than n k pro cesses For general nwe can provetheinvariantthatinevery

cycle at every level at most two of the three sub comp onents are o ccupied which implies a

total ordering of the lab els We remark that the precise statements of the invariants are a

little trickier than this see b elow

Correctness Pro of

Gawlick Lynch and Shavit have carried out an invariant pro of for the CTS algorithm The

following invariants are used

Invariant Let L be any set of n labels obtained by choosing for each i exactly one of

label and new label Then L is total ly ordered by the label order relation

i i

Note that Invariantsays not only that the written lab els are totally ordered but also

that so are all the p ending ones and anycombination of the two kinds Given this wecan

dene lmax to b e the max label value and imax to b e the largest index of a pro cess having 4

3

1 2 5

Figure structure of one level of the concurrentcase domain

4 r

12 3 p q

5

Figure initial conguration for concurrent case

lab el lmax

Invariant If i imax then new label label

i i

Invariantsays that any p ending lab el from the known max is the same as its old lab el

it would never havehadany reason to increase its lab el if its the max

The following invariant describ es a prop ertythatany p ending lab el thats greater than

the current known max must satisfy Dene the nextlabel function to yield the very next

value at the indicated level

Invariant If lmax new label then new label nextlabel lmax k for some k k

i i

n

The previous three invariants and are sucient to complete the pro of The

others are just used to make the induction go through However in order to prove these

three invariants by induction we require three additional technical invariants Thus for all

levels k k n wehave the following three technical claims First wehave one

ab out old p ending lab els it constrains how dierent these p ending lab els can b e from the

corresp onding actual lab els

Invariant If new label lmax and label is in the same level k component as lmax then

i i

so is new label

i

The next invariant describ es some more constraints on the new label values based on the

corresp onding actual lab els This time the constraint refers to new label values in the cycle

of a comp onent containing the max lab el

Invariant If new label is in the cycle of the level k component containing lmax then label

i i

is in that level k component

And nallywehaveaninvariantsaying that it is sometimes p ossible to deduce that the

comp onent containing the max lab el contains lots of pro cesses

Invariant

If new label nextlabel lmax k then thereareatleast n k processes excluding i

i

whose label values are in the same k component as lmax

If lmax is not in level k component numbered then thereareatleast n k

processes whose label values are in the same k component as lmax

This invariants are proved by induction as usual We remark that the pro ofs are not

short they are fairly long detailed case analyses

The upshot of the invariants is that wehave the total ordering prop erty preserved but

this do esnt exactly show that wehave b ehavior inclusion

Recall that we need to show that all wellformed b ehaviors of the b ounded ob ject are also

wellformed b ehaviors of the unb ounded ob ject The rightway to show this is to give a formal

corresp ondence b etween the two ob jects Recall the case of the stoppingfailure synchronous

consensus algorithm where we showed a formal corresp ondence b etween an inecientbut

understandable algorithm and an optimized version We did this using invariants relating

the states of the two algorithms when they run using the same inputs and failure pattern

We are going to do essentially the same thing here Now however we dont haveany

convenient notion analogous to the inputs failurepattern pair to determine the course

of the execution We are in a setting in which there is much more nondeterminism b oth in

the order in whichvarious actions o ccur and in what state changes happ en when a particular

action happ ens consider the choice of the new lab el in the unb ounded ob ject But note

that now we dont havetoshowanequivalencesaying that everything that either algorithm

do es the other do es also Rather we only need a onedirectional relationshipshowing that

everything that the b ounded algorithm do es the unb ounded can also do this is exactly

inclusion of external b ehaviors

We shall use the notion of forward simulation relation dened in a Lecture applying it

to get a simulation from B the b ounded ob ject to U the unb ounded ob ject Incidentallyin

this case we dont need to use the wellformedness restriction the corresp ondence b etween

the ob jects also holds in the nonwellformed case although it is not so interesting in that

case We remark that if weonlywanted to consider inclusion in the restricted case sayfor

some other implementation that didnt work in the case of nonwellformed executions we

could just enlarge the system by explicit comp osition with a most general wellformed

preserving environment and prove inclusion for the resulting comp osed systems

It turns out that the forward simulation mapping f is quite simple Note the similarity

in style to the earlier pro of for synchronous consensus with invariants ab out the pair of

states We dene u f sprovided that the following hold for all i j wherei j

spc upc

i i

slabel slabel ulabel ulabel

i j i j

slabel snew label ulabel unew label

i j i j

snew label slabel unew label ulabel

i j i j

snew label snew label unew label unew label

i j i j

sorder uorder

i i

svalue uvalue

i i

ssnapvalue usnapvalue

i i

In other words everything is exactly the same except for the two kinds of lab el values

and in these cases the ordering relationship is the same Nowweneedtoshow that this

corresp ondence is preserved Formallywehavetoshow the conditions of the denition of

a simulation relation To help us we use Invariants and The interesting step is

the snap step emb edded in a label op eration since it is the one that has the p otential to

violate the nice ordering relationships bycho osing new labelSoconsiderasnap step within

i

a label op eration If i detects itself to b e the max based on the label values that it reads

in the snapshot it do esnt makeanychanges in either algorithm so the corresp ondence is

preserved So supp ose that it do esnt detect itself to b e the max which is same in b oth

So a particular lab el is chosen in B Wemust dene some corresp onding lab el in U which

should satisfy the same ordering conditions Weknow that this new lab el will b e strictly

greater than all the label sinB sowemight b e tempted to just cho ose one plus the max of

those for use in U But wehave to also preserve the relative order of this new label and all

the other new label s There are scenarios in which this new one might not b e greater than

all the others in fact it p ossibly could b e strictly less than some But Invariant implies

that we dont havetoworry ab out the new label s that are not greater than the maximum

label in B and Invariant characterizes the form of the new label s that are greater than

the maximum label in B Now it is easy to see how the new lab el nowbeingchosen ts in

relative to these and wecancho ose a new lab el in U in the same way

Application to Lamp orts Bakery Algorithm

We can use a CTS algorithm to provide a b oundedregister variant of the Lamp orts bakery

algorithm We describ e the algorithm in Figure b elow in a sort of pseudo co de roughly

similar to the original Lamp ort algorithm rather than preconditioneects notation

The invo cations of label and scan are written as single statements but they denote pairs

of invo cation and resp onse actions to the CTS ob ject eg beginlabel and end label

i

i

Let us go over the dierences from the original bakery algorithm First instead of

explicitly reading all the variables and taking the maximumwe just do a lab el op eration

to establish the order everything is hidden within the CTS ob ject Second the structure

of the checks is changed a little Instead of lo oping through the pro cesses one at a time

the mo died algorithm rst checks that no one is cho osing and then checks the priorities

Note that the algorithm is not sensitive to this dierence Third in the lo op where wecheck

the priorities instead of reading the numb ers the mo died algorithm at pro cess i do es scan

to obtain the order and then lo oks for i to b e ahead of everyone else Note that in this

variant weonlycheckfor i to b e ahead of those that are currently comp eting in the prior

algorithm the number was explicitly set back to when a pro cess left but nowwe no longer

have sucient explicit control to assign those numb ers So instead weleave the order as it

is but explicitly discount those that arent comp eting

To understand the ab ove algorithm consider it rst when the unb ounded CTS algorithm

is employed We claim that in this case similar arguments to those used in the correctness

pro of for the original bakery algorithm show the same prop erties ie mutual exclusion

deadlo ckfreedom lo ckoutfreedom and FIFO after a waitfree do orway now the do orway

is the p ortion of the co de from entry to the trying region until the pro cess sets its ag to

The unb ounded variant is more complicated has worse time complexity and so far do es

not seem to b e b etter than the original algorithm But now note that we can replace the

unb ounded CTS ob ject within this algorithm by the b ounded CTS ob ject Since every well

formed b ehavior of the b ounded algorithm is also a wellformed b ehavior of the unb ounded

algorithm we obtained an algorithm that uses b oundedsize registers and still works cor

rectly Recall that the b ounded algorithm is based on snapshot based on singlewriter

registers And the only values that need to b e written in the registers are the timestamps

chosen from the b ounded domain Note that this application did not use the value part of

the CTS only the ordering part is used

Shared variables

ag anarray indexed bynofintegers from fg initially all where agi

is written by p and read by all

i

CTS a concurrent timestamp ob ject

Co de for p

i

try

i

L

agi

label

i

agi

L

for j fng do

if agj then goto L

end if

end for

L

scan

i

for j fng do

if j i and agj then goto L

end if

end for

crit

i

Critical region

exit

i

agi

rem

i

Remainder region

Figure the bakery algorithm with concurrent timestamps

JJ Distributed Algorithms November

Lecturer Nancy Lynch

Lecture

MultiWriter Register Implementation Using CTS

In this section we show another simple and interesting application of CTS namely imple

menting multiwriter multireader registers from singlewriter multireader registers In this

application we use the value part of the CTS as well as the ordering part

The Algorithm

The co de is very simple

read pro cess scan o v and return v where j is the maximum index in o

i i j

write v pro cess label v

i i

The algorithm is straightforward a write simply inserts the new value as part of a label

action while read returns the latest value as determined by the order returned byanemb ed

ded scan In fact the algorithm is easier to understand by considering the high level co de and

the CTS level co de together but keeping the snapshot abstraction instantaneous In this

more elab orate description the lowlevel memory consists of a vectorofvalue timestamp

pairs The write action snaps all of memory instantaneouslycho oses a timestamp bigger

than the biggest timestamp already existing in the memory with the exception that if it

already is the biggest it keeps its value and in a separate step writes back the new value

with that new timestamp The read pro cess snaps all of memory instantaneously and returns

the value that is asso ciated with the biggest timestamp

Note that if the algorithm works correctly using the unb ounded CTS algorithm then

it will also work using the b ounded algorithm which can b e implemented using b ounded

space Thus it is sucient to prove that the algorithm works correctly using unb ounded

timestamps

It is easy to see that the algorithm has the nice prop erties of waitfreeness fairness

and wellformedness The interesting thing here is to show the existence of the needed

serialization p oints Wedoitby rst proving the following lemma

A Lemma for Showing Atomicity

Lemma Let be a wel lformedsequenceofinvocations and responses for a readwrite

register Let T be any subset of the incomplete operations in Let S be the union of the

complete operations and T Suppose that is a partial ordering of al l the operations in S

satisfying the fol lowing properties

Let i be a write operation in S and let j be any operation in S Then we have i j

or j i In other words total ly orders al l the write operations and orders al l the

read operations with respect to the write operations

The value returnedbyeach complete read operation in S is the value written by the last

preceding write operation according to or the initial value if none

If i is a complete operation and end precedes begin in then it cannot be the case

i

j

that j i

For any operation i thereare only nitely many operations j such that j i

Then is an atomic readwrite register behavior ie satises the atomicity condition

Pro of We construct serialization p oints for all the op erations in S as follows We insert

each serialization p oint immediately after the later of begin and all begin suchthat j i

i

i j

Note that Condition ensures that the p osition is well dened For consecutive s order

them in anyway that is consistent with Also for each incomplete read in T assign a

return value equal to the value of the last write op eration in S that is ordered b efore it by

or the initial value if none

To prove that this serialization satises the requirements wehavetoshow that

the serialization p oints are all within the resp ective op eration intervals and

each read i returns the value of the write whose serialization p oint is the last one b efore

or the initial value if none

i

For the rst claim consider the serialization p oint By construction it must b e after

i

begin Ifi is a complete op eration it is also b efore end if not then end precedes begin

i i

i j

for some j having j icontradicting Condition

For the second claim consider read op eration i By Condition for complete reads and

the denition for incomplete reads the value returned by read i is the value written by the

last write op eration in S that is ordered b efore op i by or the initial value if none

Condition says that orders all the writes with resp ect to all other writes and all reads

To show that read i returns the value of the write whose serialization p oint is the last one

before or the initial value if none it is sucienttoshow that the total order of the

i

serialization p oints is consistent with ie if i j then precedes This is true since

i j

by denition o ccurs after the last of begin and all begin for k i while o ccurs after

i j

i k

the last of begin and all begin where k j Thus if i j then for all k suchthatk i we

j k

have k j and hence o ccurs after

j i

Pro of of the Multiwriter Algorithm

Wenow continue verifying the implementation of multiwriter registers We dene an or

dering on all the completed op erations and T where T is the set of all the incomplete write

op erations that get far enough to p erform their emb edded update steps Namely order all

the write op erations in order of the timestamps they cho ose with ties broken by pro cess

index as usual Also if two op erations have the same timestamp and the same pro cess in

dex then we order them in order of their o ccurrence this can b e done since one must

totally precede the other Order each complete read op eration after the write whose value

it returns or b efore all the writes if it returns the initial value Note that this can b e

determined by examining the execution To complete the pro of weneedtocheck that this

ordering satises the four conditions It is immediate from the denitions that Conditions

and are satised Condition is left to the reader we remark only that it follows from

the fact that any write in T eventually p erforms its up date and so after that time other

new write op erations will see it and cho ose bigger timestamps

The interesting one is Condition Wemust show that if i is a complete op eration and

end precedes begin then it is not the case that j i

i

j

First note that for any particular pro cess the timestamps written in the vector lo cation

for that pro cess are monotone nondecreasing This is b ecause each pro cess sees the times

tamp written by the previous one and hence cho oses a larger timestamp or the same if its

the maximum We pro ceed by case analysis based on the typ es of the op erations involved

Case Two writes First supp ose that i and j are p erformed by dierent pro cesses

Then in the snap step j sees the up date p erformed by i or something with a timestamp at

least as big and hence j cho oses a larger timestamp than is and gets ordered after i by

Now supp ose that i and j are p erformed by the same pro cess Then since j sees the

up date p erformed by i j cho oses either the same or a larger timestamp then iIfitcho oses

a larger timestamp then as b efore j gets ordered after i by On the other hand if it

cho oses the same timestamp then the explicit tiebreaking convention says that the two

write s are ordered by in order of o ccurrence ie with j after i

Case Two reads If read i nishes b efore read j starts then by the monotonicityabove

j sees at least as large timestamps for all the pro cesses as i do es This means that read j

returns a value asso ciated with a timestamp index pair at least as great as that of i and

so j do es not get ordered b efore i

Case read iwritej Since i completes b efore j b egins the maximum timestamp index

pair that j sees is at least as great as the maximum one that i sees

First supp ose that j do es not see its own pro cess with the maximum pair In this case

j cho oses a new timestamp that is strictly larger than the maximum timestamp it sees so

strictly larger than the maximum timestamp that i sees Now i gets ordered right after the

write whose value it gets whichisby denition the maximum pair write that it sees Since

this pair has a smaller timestamp than that of j itmust b e that i is ordered b efore j

On the other hand supp ose that j sees its own pro cess with the maximum pair In this

case j cho oses the same timestamp that it sees The resulting pair is either greater than or

equal to the maximum pair that i sees If it is greater then as ab ove i is ordered b efore j

If it is equal then the explicit tiebreaking rule for the same pair implies that write j gets

ordered by after the write whose value i gets and hence after the read by i This again

implies that i gets ordered b efore j

Case write iread j Since i completes b efore j b egins j sees a timestamp for is

pro cess that is at least as great as that of op eration i and hence the value that j returns

comes from a write op eration that has at least as great a timestamp index pairasdoes i

So again i is ordered b efore j

This sort of case analysis is useful for other readwrite register algorithms Before wecon

clude the discussion on implementing multiwriter registers from singlewriter registers we

remark that this problem has a rich history of of complicatedincorrect algorithms including

pap ers byPeterson and Burns bySchaer and byLiTromp and Vitanyi

Algorithms in the ReadMo difyWrite Mo del

In this section we consider shared memory with a dierent memory access primitive called

readmodifywrite Our mo del is the instantaneous shared memory mo del only now the

variables are accessible by a more p owerful op eration Intuitively the idea is that a pro cess

should b e able in one instantaneous step to access a shared variable to use the variables

value and the pro cess state to determine the new value for the variable and the new pro cess

state We can formulate this in the invo cationresp onse style by using apply f to describ e

the information needed to up date the variable and the old value v comes back as a resp onse

cf Homework The resp onse value can b e used to an up date the state since input to

the pro cess can trigger an arbitrary state change

In its full generality this is a very p owerful typ e of shared memory since it allows

arbitrary computation based on the state and shared memory values It is not exactly clear

how this would b e implemented rememb er that the basic mo del implies fair turns to all

the pro cesses which means fair access to the shared memory So it seems that in this mo del

we are implicitly assuming some lowlevel arbitration mechanism to manage access to these

variables Still we might b e able to implement the fair access with high probability if

accesses are fast for example then interference is unlikely and by rep eated retrymayb e

with some backo mechanismwewould eventually gain access to the variable

In what follows we shall consider various problems in this mo del including consensus

mutual exclusion and other resource allo cation problems

Consensus

It is very easy to implementwaitfree consensus using a single readmo difywrite shared

variable as follows The variable starts out with the value undened All pro cesses access

the variable if a pro cess sees undened it changes the value to its own initial value and

decides on this value If a pro cess sees or it do es not change the value written there

but instead accepts the written value as the decision value It is easy to verify correctness

and waitfreedom for this scheme

Let us formulate the idea using the apply f style Each pro cess uses f or f dep ending

on its initial value where

b if x undened

f x

b

x otherwise

An input ie return value of undened causes decision to b e set to the initial value and

an input of b causes it to b e set to b

Another style for writing this algorithm is describing owofcontrol with explicit brackets

indicating the b eginning and ending of steps involving the shared memory and the lo cal

state Since arbitrarily complicated computations can b e done in one step there can b e

many steps of co de within one pair of brackets There shouldnt b e nonterminating lo ops

however since a step must alwaysendThecodeisgiven in this style in Figure and in

the preconditioneect style in Figure

In a result by Herlihyitisproven that it is imp ossible to obtain a waitfree implementa

tion of atomic readmo difywrite ob jects in the instantaneous shared memory mo del where

the memory is readwrite This can now b e seen easily as follows If we could solvewaitfree

consensus in this mo del then by using transitivity applied to the simple solution ab ovefor

the readmo difywrite mo del and the claimed implementation of readmo difywrite ob jects

init b

i

value b

lock

if x undened then

x b

decision b

else decision x

unlock

decide d

i

Figure Consensus in the lo ckunlo ck memory style

we could solvewaitfree consensus in the instantaneous readwrite mo del whichwehave al

ready shown to b e imp ossible Also a similar argument implies that we cant get a waitfree

implementation of atomic readmo difywrite ob jects in the atomic readwrite ob ject mo del

Mutual Exclusion

Consider the mutual exclusion problem with the more p owerful readmo difywrite memory

In some sense this seems almost paradoxical it sounds as if were assuming a solution to

avery similar problem fair exclusive access to a shared variable in order to get fair

exclusive access to the critical region This seems likely to make the problem trivial And

indeed it do es simplify things considerably but not completely In particular we shall see

some nontrivial lower b ounds

Deadlo ckFreedom

Recall that for readwrite registers we needed n variables to guarantee only the weak prop er

ties of mutual exclusion and deadlo ckfreedom To see how dierent readmo difywrite mo del

is from readwrite consider the trivial variable algorithm in Figure It is straight

forward to see that the algorithm guarantees b oth mutual exclusion and deadlo ckfreedom

State

pc with values finit access decide doneg

value

decision

Actions

init b

i

Eect value b

pc access

access

i

Precondition pc access

Eect if x undened then

x value

decision value

else decision x

pc decide

decide d

i

Precondition pc decide and decision d

Eect pc done

Figure Consensus algorithm in the preconditioneect language

Shared variables shared variable v values f g initially

Lo cal variables for pro cess i pc values in fR T TCEEg initially R

Co de for pro cess i

try

i

Eect pc T

test

i

Precondition pc T

Eect if v then

v

pc T

crit

i

Precondition pc T

Eect pc C

exit

i

Eect pc E

return

i

Precondition pc E

Eect v

pc E

rem

i

Precondition pc E

Eect pc R

Figure Mutual exclusion algorithm for the readmo difywrite mo del

Bounded Bypass

The simple algorithm of Figure do es not guarantee any fairness But we could even

get FIFO order based on the rst lo cally controlled step of a pro cess in the trying region

eg by maintaining a queue of pro cess indices in the shared variable An entering pro cess

adds itself to the end of the queue when a pro cess is at the b eginning of the queue it can

go critical and when it exits it deletes itself from the queue This is simple fast and only

uses one variable but it is spaceconsuming there are more than n dierent queues of at

most n indices and so the variable would need to b e able to assume more than ndierent

values ie n log n bits One might try to cut down the size of the shared variable The

interesting question is how many states we need in order to guarantee fairness Can wedo

it in constantnumber of values Linear

We can achieve fairness with n values log n bits bykeeping only twonumb ers in

the single shared variable the next ticket to b e given out and the numb er of the

ticket that has p ermission to enter the critical region This can b e viewed as a distributed

implementation of the queue from the previous algorithm Initially the value of the shared

variable is When a pro cess enters it takes a ticket ie copies and increments the

rst comp onent If a pro cess ticket is equal to the second comp onent it go es to critical

region When a pro cess exits it increments the second comp onent This algorithm is also

FIFO based on the rst noninput step in trying region Wecanallow the tickets to b e

incremented mo d nwithatotalof n values

The obvious question now is whether we can do b etter The following theorem gives

the answer for the case of b ounded bypass which is a stronger requirementthanlockout

freedom

Theorem Let A beannprocess mutual exclusion algorithm with dead lockfreedom and

boundedbypass using a single readmodifywrite shared variable Then the number of distinct

values the variable can take on is at least n

Pro of By contradiction we shall construct an execution in which some pro cess will b e

bypassed arbitrarily but nitely many times We start by dening a sequence of nite

executions where each execution is an extension of the previous one as follows The rst

execution is obtained by letting pro cess p run alone from the start state until it enters C

In the second execution after weletp enter the trying region and take one noninput

step Obviously p remains in the trying region And for i n isdenedby

i

starting after the end of then letting p enter the trying region and take one noninput

i i

step

Dene v to b e the value of the shared variable just after i nWe rst claim

i i

that v v for i j For assume the contrary ie that v v and without loss of

i j i j

generality assume ij Then the state after lo oks like the state after to p p p

i j i

since the value of the shared variable and all these pro cesses states are the same in b oth

Now there is a fair continuation of involving p p only that causes some pro cess to

i i

enter C innitely many times This follows from the deadlo ckfreedom assumption which

only applies in fair executions

The same continuation can b e applied after with the same eect But note that now

j

the continuation is not fair p p are not taking any steps although they are in T

i j

However running a suciently large but nite p ortion of this extension after is enough

j

to violate the b ounded bypass assumption p will get bypassed arbitrarily many times by

j

some pro cess

Is this lower b ound tight or can it b e raised sayton Another result in BFJLP

shows that it cant This is demonstrated bya counterexample algorithm that needs only

n c values for b oundedbypass where the b ound is around mutual exclusion with

deadlo ckfreedom

The main idea of the algorithm is as follows The pro cesses in the trying region are

divided into two sets called buer and main When pro cesses enter the trying region they

go into buer where they lose their relative order At some time when main is emptyall

pro cesses in buer go to main therebyemptying buer Then pro cesses are chosen one at

a time in an arbitrary order to go to the critical region

The implementation of this idea needs some communication mechanisms to tell pro cesses

when to change regions Well sketch the ideas here and leave the details to the reader

Assume for a start that wehavea supervisor process that is always enabled regardless of

user inputs and that can tell pro cesses when to change regions and go critical Clearly the

sup ervisor do es not t within the rules of our mo dels well see how to get rid of it later

With such a sup ervisor wehave the following strategy First let the variable have

comp onents one for a countn and one to hold any of a nite set of messages This is

cn values but we can optimize to n c by employing a priorityscheme to allow preemption

of the variable for dierent purp oses The sup ervisor keeps counts of numb er of pro cesses

it knows ab out in the buer and main regions When a pro cess enters it increments the

count in the shared variable to inform the sup ervisor that someone new has entered and

then waits in buer The sup ervisor whenever it sees a countinthevariable absorbs it

in its lo cal buercount and resets the variable counttoThus the sup ervisor can gure

out when to move pro cesses to main This is done sequentially by putting messages in the

message comp onent of the shared variable Wehave the following messages

ENTERMAIN the rst pro cess in buer that sees this message can pick it up and

b e therebytoldtogoto main

ACK pro cess resp onse

ENTERCRIT can b e picked up byanyone in main The pro cess that picks it up can

go to the critical region

ACK pro cess resp onse

BYE the pro cess in the critical region says its done and leaves

Lets see briey how can we reuse the variable We need the variable for two purp oses

counting and message delivery Note that message communication pro ceeds more or less

sequentially see Figure for example supervisor process enter−main

main−ack enter−main

main−ack

enter−crit

main−ack

Figure A typical communication fragment through the shared variable

Wehave an implicit control thread here If this thread is ever broken byoverwriting

a message with a count increase the rest of the system will eventually quiesce the sup ervisor

will eventually absorb all counts and count will b ecome At that p oint the overwritten

message could b e replaced count will b e default if message is there More sp ecically

the following o ccurs When a pro cess that enters the system sees a message in the variable

it picks it up and puts down a count of to announce its presence This pro cess holds the

message until count is again and then replaces the message it is holding in the shared

variable

Nowwe turn back to our mo del in which there is no sup ervisor The idea is to havea

virtual sup ervisor simulated by the pro cesses Eg at any time the pro cess that controls C

will b e the designated sup ervisor and it must pass resp onsibility on when it leaves C This

involves passing on its state information and we need to use the variable to communicate this

to o This can b e done by using the variable as a message channel where we again optimize

multiple uses of message channel as b efore Note that if ever theres no other pro cess to

pass it to it means that no one is waiting But then theres no interesting information in

the state anyway so there is no problem in abandoning resp onsibility when the pro cess

leaves and puts an indicator in the variable

Lo ckoutFreedom

The lower b ound of Theorem can b e b eefed up to get a similar lower b ound for lo ckout

freedom BFJLP This is harder b ecause lo ckout freedom unlike b ounded bypass is a

prop erty of p ossibly innite fair executions and not only nite executions The bad exe

cutions that are constructed must b e fair

We can get a lower b ound of n by a complicated construction with an extra assumption

that pro cesses dont remember anything when they return to their remainder regions Some

tries were made to raise the b ound to n since it seems as if an algorithm has to have

suciently many dierentvalues of the variable to record anynumber of entering pro cesses

But the searchforalower b ound was halted by another clever counterexample algorithm

giving lo ckoutfreedom using only nc values

The idea that algorithm is as follows Similarly to the n c value algorithm the algorithm

has incoming pro cesses increment a count whichnow only takes on values n and

then wraps around back to The count is absorb ed by a conceptual sup ervisor as b efore

If it wraps around to it seems like a group of n pro cesses is hidden But this is not quite

the case theres someone who knows ab out them called the executive the one that

did the transition from n to The executive can send SLEEP messages to an arbitrary

set of n pro cesses in buer to put them to sleep It do esnt matter which pro cesses

in buer go to sleep Then having removed n from the fray the executive reenters the

system Now the algorithm runs exactly as the b ounded bypass algorithm since it cant

overow but it only gets up to count n When the executive reaches C it can take care

of the sleep ers by sending them messages to awaken and telling the sup ervisor ab out them

Again wemust share the variable now with a more complicated priorityscheme

Note we cant have executives overlapping and confusing their sleep ers b ecause its

not p ossible for that manytoenter while the others are asleep

JJ Distributed Algorithms November

Lecturer Nancy Lynch

Lecture

ReadMo difyWrite Algorithms cont

Mutual Exclusion cont

Last time we describ ed readmo difywrite algorithms for mutual exclusion with b ounded

bypass and mutual exclusion with lo ckoutfreedom Weshowed that we can get algorithms

that use a linear number of values for either Todaywe conclude this topic with one more

lower b ound result

Though we will not give here the full nlower b ound for lo ckoutfreedom to o compli

cated we will show the kinds of ideas that are needed for sucha proofWe shall only sketch

p

n Sp ecicallyweprovethe aweaker lower b ound for lo ckoutfreedom of approximately

following theorem

k k

or moreprocesses satisfying mutual exclusion and Theorem Any system of n

lockoutfreedom requires at least k values of the shared variable

Pro of By induction on k The base case k is trivial Assume now that theorem

k k

holds for k we show that it holds for k Supp ose n and supp ose for

contradiction that numb er of shared values is no more than k From the inductivehyp othesis

it follows that the number of values is exactly k Wenow construct a bad execution to derive

a contradiction

Dene an execution by running p alone until it enters C Dene further an execution

by running p after until p reaches a state with shared variable value v suchthat

p can run on its own causing v to recur innitely many times Such a state must exist

since x can assume only nitely manyvalues Likewise get for i nby running

i

p after until it reaches a p oint where it could on its own cause the currentvalue of

i i

the shared variable x to recur innitely many times Let v be the value corresp onding to

i

Since there are only k values for xby the pigeonhole principle there must exist i and

i

j where n k ij n such that v v

i j

k k

pro cesses solving Now pro cesses p p comprise a system with at least

i

mutual exclusion with lo ckoutfreedom Thus by induction they must use all k values of

shared memory More sp ecically for every global state s that is ireachable from the state

q just after andevery value v of x there is a global state reachable from s in which

i i

the value of x is v If not then we could run the system to global state s then starting

from there wehave a reduced system with fewer than k values contradicting the induction

hyp othesis This implies that there exists a fair of execution of pro cesses p p that

i

starts from q suchthatallk values of shared memory recur innitely many times

i

Nowwe can construct the bad execution as follows First run p p through to

j j

state q which lo oks like q to p p Then run p p as ab ove nowfromq instead

j i i i j

of q in the execution in which each shared value recurs innitely often Now recall that in

i

q each pro cess p p is able to cause the value it sees to recur innitely many times

j i j

So intersp erse steps of the main run of p p with steps of p p as follows each

i i j

time the shared variable is set to some v i l j run p for enough steps at least one

l l

to let it return the value to v This yields an innite execution that is fair to al l pro cesses

l

and that lo cks out for example p

j

The key idea to rememb er in the ab ove pro of is the construction of bad fair executions

by splicing together fair executions Also note the apparent paradox there is a nontrivial

inherent cost to lo ckoutfree mutual exclusion even though we are assuming fair exclusive

access to a shared variable which seems very close

Other Resource Allo cation Problems

Mutual exclusion is only one kind of resource allo cation problem mo deling access to a single

shared resource In this section we generalize the problem There are twoways to describ e

the generalizations The rst is exclusion problems and the second is resource problems

We start with the new denitions

Denitions

In the rst variant we describ e exclusion problemsFormallywe describ e these problems

by a collection E of conicting sets of pro cesses that are not allowed to co exist in C E

could b e any collection of sets of pro cesses that is closed under containment ie if S E

and S S then S EFor example the mutual exclusion problem is characterized by

E fS jS j g

A common generalization is the k exclusion problem mo deling k shared resources where

at most k pro cesses may b e in the critical region at anygiven time In our language the

problem is dened by

E fS jS j k g

Clearlywe can b e more arbitrary eg say E is dened to contain f g f g f g f g

and other containing subsets But note that in this case do esnt exclude and do esnt

exclude

This leads us to the second variant of the denition in terms of resources In this case

wehave a set of explicit resources in the problem statement and for each p wegive a list

i

of requirements in the form of a monotone Bo olean formula eg R R R R

where the R s are resources Intuitively the meaning of the example is that p needs

i i

either R or R and either R or R in order to go critical More formally a set of resources

is acceptable for a pro cess to go critical i the setting all the corresp onding variables to true

yields a satisfying assignment for the formula

Lets consider another simple example to illustrate this style of formulation Supp ose p

wants resources R and R p wants R and R p wants R and R andp wants R and

R In this example the corresp onding formulae have only a single clause ie there are no

options here

Note that any resource problem gives rise to an exclusion problem where the combina

tions excluded by the formulas cannot b e simultaneously satised Eg for the example

ab ove we get an equivalentformulation by

E fS S f g f g f g f g f gg

It is also true that every exclusion condition can b e expressed as a resource problem and

thus these formulations are equivalent

For stating a general problem to b e solved in concurrent systems we again assume the

cyclic b ehavior of the requests as b efore The mutual exclusion condition is now replaced by

a general exclusion requirement The denitions of progress and lo ckoutfreedom remain the

same as b efore We actually want to require something more since there are some trivial

solutions that satisfy the requirements listed so far any solution to mutual exclusion is a

fortiori a solution to more general exclusion But theres something wrong with this we

wanttoinsistthat moreconcurrency is somehow p ossible We dont knowwhatisthebest

way to express this intuitive notion We give here a slightly stronger condition in terms of

the resource formulation as follows

Consider anyreachable state in which pro cess i is in T andevery pro cess that

has any of the samenamedresources in its resource requirementformula is in

R Then if all the conicting pro cesses stayinR and i continues to take steps

eventually i pro ceeds to C Note We allow for other pro cesses to halt

Dijkstra had originally dened a sp ecial case called the Dining Philosophers problem

describ ed b elow as a resource problem p0 F4 F0

p4 p1

F3 F1

p3 F2 p2

Figure The dining philosophers problem for n

In the traditional scenario see Figure wehaveve say philosophers at a table

usually thinking ie in R From time to time each philosopher wants to eat ie to enter

C which requires two forks representing the resources we assume that each philosopher

can only take the forks adjacent to him or her Thus the formula for p is F F

i i i

counting mo d We can represent this as an exclusion problem namely

E fS S fF F g for some ig

i i

We assume that there is a shared variable asso ciated with each fork and the access mo del

for the variables is readmo difywrite If the pro cesses are all identical and they dont use

their indices then wehave the same kind of symmetry problem wesaw in electing a leader

in a ring which cannot b e solved To b e concrete supp ose they all try to pick up their left

fork rst and then their right This satises the exclusion condition but it can deadlo ck if

all wake up at the same time and grab their left forks then no one can pro ceed this is worse

than a deadlo cking execution the system actually wedges Thus wemust use some kind

of symmetry breaking technique and here wemake the assumption that the pro cess indices

ie lo cations are known to the pro cesses

Dijkstras original solution requires simultaneous lo cking of the two adjacentforksa

centralized shared readmo difywrite variable and has no fairness Well see a b etter solution

b elow

The LeftRight Algorithm

This algorithm app eared in Burns pap er but it is generally considered as folklore This

algorithm ensures exclusion deadlo ckfreedom in the stronger version stated ab ove and

lo ckoutfreedom Also it has a go o d time b ound indep endent of the size of the ring

This is a simple style of algorithm based on running around and collecting the forks one

at a time Wehave to b e careful ab out the order in which this is done For instance if the

order is arbitrary then we risk deadlo ckasabove Even if there is no danger of deadlo ck we

might b e doing very p o orly in terms of time p erformance Eg if everyone gets their forks

in numerical order of forks then although they dont deadlo ck this might result in a long

waiting chainTo see this consider the following scenario In Figure ab ove supp ose

that p gets b oth forks then p gets its right fork then waits for its left p gets its right

fork then waits for its left p gets its right fork then waits for its left and p waits for its

right fork See Figure for nal conguration

p0 F4 F0

p4 p1

F3 F1

p3 F2 p2

Figure A conguration with long waiting chain Arrows indicate p ossession of forks

In the last state wehavea chain where p waits for p whichwaits for p whichwaits

for p whichwaits for p The pro cesses in the waiting chain can go into the critical region

only sequentially In general this means that time can b e at least linear in the size of the

ring

In contrast the Burns algorithm has a constant b ound indep endent of the size of the

ring The algorithm accomplishes this byavoiding long waiting chains We describ e the

algorithm here for n even The variant for o dd n is similar and is left as an exercise The

algorithm is based on the following simple rule even numb ered pro cesses pick up their left

forks rst and o dd numb ered pro cesses get their right forks rst Thus the asymmetry

here is the knowledge of the parityVariables corresp ond to the forks as b efore Now each

variable contains a queue of pro cesses waiting for the fork of length at most The rst

pro cess on the queue is assumed to own the fork The co de is given in Figure

Prop erties Mutual exclusion is straightforward Deadlo ckfreedom follows from lo ckout

freedom which in turn follows from a time b ound whichwe provenow Assume as usual

that the step time at most s and that the critical region time at most c Dene T nto

b e the worst time from entry to trying region until entry to critical region in an npro cess

ring Our goal is to b ound T n As an auxiliary quantitywedeneS n to b e the worst

time from when a pro cess has its rst fork until entry to critical region

try

i

Eect pc lookleft

lookleft

Precondition pc lookleft

Eect if i is not on x queue then add i to end

i

else if i is rst on x queue then pc lookright

i

lookright

Precondition pc lookright

Eect if i is not on x queue then add i to end

i

else if i is rst on x queue then pc beforeC

i

crit

i

Precondition pc beforeC

Eect pc C

exit

i

Eect return set fL Rg

pc return

returnleft

Precondition pc return

L returnset

Eect returnset returnset fLg

remove i from x queue

i

returnright

Precondition pc return

R returnset

Eect returnset returnset fRg

remove i from x queue

i

rem

i

Precondition pc return

returnset

Eect pc rem

Figure Leftright dining philosophers algorithm co de for pro cess p i even

i

Let us start by b ounding T n in terms of S n p tests its rst fork within time s of

i

when it enters T When p rst tries either it gets its rst fork immediatelyornotIfso

i

then S n time later it go es critical If not then the other pro cess has the rst fork At

worst that other pro cess just got the fork so it takes at most S nc s until it gets to

C nishes C and returns b oth forks Then since p recorded its index in the queue for its

i

rst fork within additional time s p will retest the rst fork and succeed in getting it and

i

then in additional S n time p reaches C This analysis says that

i

T n s max fS nSnc s s S ng s c S n

Wenow b ound S n p tests its second fork within time s of when it gets the rst fork

i

Again we need to consider two cases If it gets the second fork immediately then within an

additional time s p go es critical If not then the other pro cess has it and b ecause of the

i

arrangement since its p s second fork it must b e its neighb ors second fork also So within

i

time at most s c s the neighb or gets to C leaves and returns b oth forks Then within

additional time s p will retest the second fork and succeed in getting it Then additional

i

time s will take p to C This analysis says that

i

S n s maxfs s c s s sg s c

Putting Eq and together we conclude that

T n s c s cs c

Note that the b ound is indep endentofn This is a very nice prop erty for a distributed

algorithm to have It expresses a strong kind of lo calityornetwork transparency

Extension to More General Resource Problems

Due to the nice features of the LeftRight algorithm it is natural to ask whether can we

extend this alternating leftright idea to more general resource allo cation problems In this

section weshow an extension that is not completely general but rather applies only to

conjunctive problems where the resource requirement of a pro cess is exactly one particular

set This case do es not cover k exclusion for example whichisadisjunctive problem with

more than one alternativeallowed

Again for each resource wehave an asso ciated variable shared by all pro cesses that use

that resource The variable contains a FIFO queue to record whos waiting as b efore As

b efore the pro cesses will wait for the resources one at a time Toavoid deadlo ck we arrange

resources in some global total ordering and let each pro cess seek resources in order according

to this ordering smallest to largest It is easy to see that this prevents deadlo ck if pro cess

i waits for pro cess j then j holds a resource which is strictly larger than the one for which i

is waiting and hence the pro cess holding the largestnumb ered resource can always advance

The FIFO nature of the queues also prevents lo ckout

However the time p erformance of this strategy is not very go o d in general There is no

limit on the length of waiting chains except for the numb er of pro cesses and so the time

b ounds can b e linearly dep endentonthenumb er of no des in the network Wesaw suchan

example ab ove for rings So wemust rene the ordering scheme by giving a waytocho ose a

go o d total ordering ie one that do esnt p ermit long waiting chains We use the following

strategy for selecting a go o d total order Dene the resourceproblem graph where the no des

represent the resources and wehave an edge from one no de to another if there is a user that

needs b oth corresp onding resources See Figure for an example of Dining Philosophers with no des

F0 F1 p1

p0 p2

F5 F2

p5 p3

p4

F4 F3

Figure Resource graph for six dining philosophers

Now color the no des of the graph so that no two adjacent no des have the same color

We can try to minimize the numb er of colors nding minimal coloring is NPhard but a

greedy algorithm is guranteed to color the graph in no more than m colors

Now totally order the colors This only partially orders the resources but it do es totally

order the resources needed byany single pro cess See Figure for an example

Now each pro cess accesses the forks in increasing order according to the color ordering

Equivalentlytakeany top ological ordering of the given partial ordering and let all pro cesses

follow that total ordering in the earlier strategy Carrying on with our example this b oils

down to the alternating leftright strategy

It is easy to see that the length of waiting chains is b ounded by the numb er of colors

This is true since a pro cess can b e waiting for a resource of minimum color which another

pro cess has That pro cess could only b e waiting for a resource of larger color etc

The algorithm has the nice prop erty that the worstcase time is not directly dep endent F0 F1 p1

p0 p2

F5 F2

p5 p3

p4

F4 F3

Figure coloring

on the total numb er of pro cesses and resources Rather let s b e an upp er b ound on step

time and c an upp er b ound on critical section time as b efore Let k be the numb er of colors

and let m b e the maximal numb er of users for a single resource We showthattheworst

k k

case time b ound is O m c km s That is time dep ends only on lo cal parameters

We can make a reasonable case that in a realistic distributed network these parameters are

actually lo cal not dep endent on the size of the network Note that this a big b ound it

is exp onential in k so complexity is not prop ortional to the length of the waiting chain but

is actually more complicated There are some complex interleavings that get close to this

b ound

Toprove the upp er b ound we use the same strategy is as b efore Dene T for i k

ij

indicating colors and j m indicating p ositions on a queue to b e the worstcase time

from when a pro cess reaches p osition j on any queue of a resource with color iuntil it

reaches C Then set up inequalities as for the LeftRight algorithm The base case is when

the pro cess already has the last resource

T s

k

Also when a pro cess is rst on any queue within time s it will b e at worst at p osition m

on the next higher queue

T s T for all i k

i im

And lastly when a pro cess in in p osition j on queue i it only has to wait for its

predecessor on the queue to get the resource and give it up and then it gets the resource

T T c ks T for all i and for all j

ij ij i

The claimed b ound follows from solving for T after adding an extra s

m

Cutting Down the Time Bound This result shows that a very general class of resource

allo cation problems can have time indep endentofglobalnetwork parameters But it would

b e nice to cut down the time b ound from exp onential to linear in the numb er of colors There

are some preliminary results byAwerbuchSaks and ChoySingh But there is probably still

more work to b e done

Safe and Regular Registers

In this section we switch from the very p owerful readmo difywrite shared memory mo del to

some muchweaker variants We start with some denitions

Denitions

Wenow fo cus on readwrite ob jects registers again and consider generalizations of the

notion of an atomic readwrite register safe and regular registers These variants are weaker

than atomic registers but can still b e used to solveinteresting problems They t our general

ob ject denition in particular inputs are read and write v with outputs as b efore The

ix ix registers are mo deled as IO automata as in Figure

write(v)

write−respond

X read

read−respond(v)

Figure Concurrent ReadWrite Register X Subscripts mentioned in the text are

omitted from the diagram

As b efore the index i on the read and write op erations corresp onds to a particular line

on which a pro cess can access the register Also as b efore we assume that op erations on each

line are invoked sequentially ie no new op erations are invoked on a line until all previous

op erations invoked on that line have returned But otherwise op erations can overlap

Recall the denition of atomic registers An atomic register has three prop erties it

preserves wellformedness it satises the atomicity condition and fairness implies that op

erations complete Also dep ending on the architecture of the implementation we can have

awaitfree condition Nowwe generalize the second atomicity condition without changing

the other conditions This still ts the general ob ject sp ec mo del so the mo dularity results

hold this allows us to build ob jects hierarchicallyeven preserving the waitfree prop erty

For these weaker conditions we only consider singlewriter registers This means that

writes never overlap one another We require in all cases that any read that do esnt overlap

a write gets the value of the closest preceding write or the initial value if none This is

what would happ en in an atomic register The denitions dier from atomic registers in

allowing for more p ossibilities when a read do es overlap a write

The weakest p ossibilityisa safe registerinwhichareadthatoverlaps a write can return

an arbitrary value The other p ossibilityisa regular registerwhich falls somewhere in

between safe and atomic registers As ab ove a read op eration on a regular register returns

the correct value if it do es not overlap any write op erations However if a read overlaps one

or more writes it has to return the value of the register either b efore or after anyofthe

writes it overlaps

W0

W1

W2 R1 R2

W3

W4

Figure Read Overlapping Writes

For example consider the scenario in Figure The set of feasible writes for R

is fW W W W W g b ecause it is allowed to return a value written byany of these

write op erations Similarly the set of feasible writes for R is fW W W g The imp ortant

dierence b etween regular and atomic registers is that for regular registers consecutive reads

can return values in inverted order

Implementation Relationships for Registers

Supp ose we only have safe registers b ecause thats all our hardware provides and wewould

liketohave atomic registers We shall showthatinfactwe can start with safe singlereader

singlewriter binary registers and build all the way up to atomic multireader multiwriter

k ary registers for arbitrary k Lamp ort carries out a series of waitfree constructions starting

with the simplest registers and ending up with singlereader singlewriter k ary registers

Others have completed the remaining constructions Wehave already seen howtogetmulti

writer multireader registers from singlewriter multireader registers although there is ro om

for improved eciency Also Singh Anderson and Gouda havegiven an implementation

of singlewriter multireader registers from singlewriter singlereader registers

In this general development some of the algorithms are logically complex andor time

consuming so there is some question ab out their practicality

The general developmentwe consider here can b e formalized in the ob ject sp ec mo del

So we get transitivity as discussed earlier Recall the architecture

reader line RP

x1 writer 1 line WP1

x2 writer 2 line

WP2

Figure Example Implementing a writer reader register with two writer reader

registers

In Figure we see a particular implementation of a writer reader register in terms

of two writer reader registers There is a pro cess for each line of the register b eing

implemented ie for each of the highlevel writers and the highlevel reader These pro cesses

are connected to the lowlevel registers in suchaway that eachlineofalowlevel register is

connected to only one pro cess

In general we call the highlevel registers logical registers and the lowlevel registers

physical registers In all of the constructions we will consider a logical register is constructed

from one or more physical registers Each input line to the logical register is connected to a

pro cess These pro cesses in turn are connected to one or more of the physical registers using

internal lines Exactly one pro cess is connected to anyinternal line of a physical register

This p ermits the pro cesses to guarantee sequentiality on each line Pro cesses connected to

external write lines are called write pro cesses and pro cesses connected to external read lines

are called read pro cesses Note that nothing prevents a read pro cess from b eing connected

to an internal write line of a physical register and vice versa

We follow Lamp orts development Wehave safe vs regular vs atomic singlereader

vs multireader and binary vs k ary registers Wethus consider the twelve dierent kinds

of registers this classication gives rise to and see which register typ es can b e used to

implement which other typ es The implementation relationships in Figure should b e

obvious b ecause they indicate typ es of registers that are sp ecial cases of other typ es

Safe Regular Atomic

nreader nreader nreader

k ary k ary k ary

s s s

J J J

J J J

J J J

J J J

nreader reader nreader reader nreader reader

s s s s s s

binary k ary binary k ary binary k ary

I I J J J

J J J

J J J

J J J

s s s

reader reader reader

binary binary binary

Figure Obvious implementation relationships b etween register typ es An arrow from

typ e A to typ e B means that B can b e implemented using A

Register Constructions

Lamp ort presents ve constructions to show other implementation relationships In the

following constructions actions on external lines are always sp ecied in upp ercase whereas

actions on internal lines are sp ecied in lowercase For example WRITE and READ denote

external op erations whereas write and read denote internal op erations

N Reader Registers from SingleReader Registers

The following construction Construction in the pap er implements an nreader safe register

from n singlereader safe registers The same algorithm also implements an nreader regular

register from n singlereader regular registers The write pro cess is connected to the write

lines of all n internal registers as in Figure

WP W

r w

x

RP R

r w

x

RP R

r w

x

RP n R

n n

Figure N Reader Registers from n Reader Registers

Read pro cess i is connected to the read line of the ith physical register Thus the write

pro cess writes all the physical registers while each read pro cess only reads one physical

register The co de is as follows

WRITE v For all i in fnginvoke writev on x Wait for acks from all x and then

i i

do ACK The writes can b e done in any order

READ Invoke read on x Wait for return v and then do RETURN v

i i

Claim If x x are safe registers then so is the logical register

n

Pro of Within each WRITE exactly one write is p erformed on each particular register x

i

Therefore since WRITE op erations o ccur sequentially write op erations for each particular

x are also sequential Likewise read op erations for a particular x are also sequential

i i

Therefore eachphysical register has the required sequentiality of accesses This means that

the physical registers return values according to their second conditions b ehavior sp ecs

Nowifa READsayby RP do es not overlap any WRITE then its contained read do es

i

not overlap any write to x Therefore the correctness of x assures that the read op eration

i i

gets the value written by the last completed write to x This is the same value as written by

i

the last completed WRITE and since RP returns this value this READ returns the value

i

of the last completed WRITE

Claim If x x areregular registers then so is the logical register

n

Pro of We can reuse the preceding pro of to get the required sequentiality of accesses to

eachphysical register and to provethatany READ that do es not overlap a WRITE returns

the correct value Therefore we only need to showthatifaREAD pro cess R overlaps some

WRITE op erations then it returns the value written by a feasible WRITE Since the read r

for R falls somewhere inside the interval of R the set of feasible writes for r corresp onds to

a subset of the feasible WRITES for R

Therefore regularity of the physical register x implies that r gets one of the values

i

written by the set of feasible writes and hence R gets one of the values written by the set

of feasible WRITES

Consider two READ s done sequentially at dierent pro cesses bracketed within a single

WRITE They can obtain outoforder results if the rst sees the result of the WRITE

but the later one do es not since WRITE isnt done atomically This implies that the

construction do es not make the logical register atomic even if the x are atomic

i

With this construction Figure reduces to Figure

Waitfreedom The previous construction guarantees that all logical op erations terminate

in a b ounded numb er of steps of the given pro cess regardless of what the other pro cesses

do Therefore the implementation is waitfree In fact the prop erty satised is stronger

since it mentions a b ounded numb er of steps Sometimes in the literature waitfreedom is

formulated in terms of suchabound

K ary Safe Registers from Binary Safe Registers

j k

l

If k then we can implementa k ary safe register using l binary safe registers Con

struction in the pap er Wedothisby storing the ith bit of the value in binary register x

i

The logical register will allow the same numb er of readers as the physical registers do see

Figure

The co de for this construction is as follows

WRITE v For i in flg in any order write bit i of the value to register x

i

Safe Regular Atomic

nreader k ary k ary

k ary

s s s

Xy

X

J

X

X

X

X

J

X

X

X

X

J

X

X

X

J

X

nreader reader

X

X s s

binary k ary

J

J

J

J

s s s

binary binary reader

binary

Figure Collapsed Implementation Relationships

x x x

l

w

w

w

WP W

r

r

r

RP R

r

r

r

RP R

r

r

r

R RP

n n

Figure K ary Safe Registers from Binary Safe Registers

READ For i in flg in any order read bit i of value v from register x Then

i i

RETURNv

Note that unlike the Construction this construction works for safe registers onlyiea

k ary regular register cannot b e constructed from a binary regular register using this metho d

With this construction Figure reduces to Figure

Safe Regular Atomic

nreader k ary

k ary

s s s

Xy

Qk

X

J

X

X Q

X

X

J

Q

X

X

X

Q

X

J

X

X Q

X

J

X

nreader reader

Q

X

X s s

Q

binary k ary

J Q

Q

J

Q

J Q

Q

J

s s Q

reader binary

binary

Figure Collapsed Implementation Relationships

JJ Distributed Algorithms Novemb er

Lecturer Nancy Lynch

Lecture

Register Constructions cont

We continue investigating the implementation relationships among the dierent register mo d

els In the last lecture wesaw Constructions and and wehave consequently reduced

the relationship to the one describ ed in Figure

Safe Regular Atomic

nreader k ary

k ary

s s s

Xy

Qk

X

J

X

Q X

X

X

J

Q

X

X

X

Q

X

J

X

Q X

X

J

X

nreader reader

Q

X

s s X

Q

binary k ary

Q J

Q

J

Q

Q J

Q

J

Q s s

reader binary

binary

Figure Collapsed Implementation Relationships

Construction Binary Regular Register from Binary Safe

Register

A binary regular register can easily b e implemented using just one binary safe register see

Figure

Recall that a read from a binary safe register may return anyvalue from its domain the

read overlaps a write For binary registers the domain consists of only and Notice that

since a binary regular register is only required to return a feasible value it suces to ensure

w

x

W WP

r

r

RP RP

n

R R

n

Figure Binary Regular Register from Binary Safe Register

that all writes done on the lowlevel register actually change the value b ecause in this

case either value is feasible

The basic idea is that the write pro cess WPkeeps track of the contents of register x

This is easy b ecause WP is the only writer of x WP do es a lowlevel write only when it

is p erforming a WRITE that would actually change the value of the register If the WRITE

is just rewriting the old value then WP nishes the op eration rightaway without touching

x Therefore all low level write s toggle the value of the register Sp ecically the co de is

as follows

write v if v is the last recorded value then do WRITERESPONDelsedo write v and

up on receiving writerespond do WRITERESPOND

read do read up on receiving readrespondx do READRESPONDx

Now consider what happ ens when a READ is overlapp ed byaWRITE If the corresp ond

ing write is not p erformed ie the value is unchanged then register x will just return the

old value and this READ will b e correct If the corresp onding write is p erformed x may

return either or However b oth and are in the feasible value set of this READ b ecause

the overlapping WRITE is toggling the value of the register Therefore the READ will b e

correct

The implementation relations of Figure now reduce to Figure

Regular Atomic

k ary nreader

k ary

s s

Xy

X

J

X

X

X

X

J

X

X

X

X

J

X

X

X

J

X

nreader reader

X

s s X

binary k ary

J

J

J

J

s s

binary and reader

all safe binary

Figure Collapsed Implementation Relationships

Construction K ary Regular Register from Binary Reg

ular Register

We can implementa k ary regular register using k binary regular registers arranged as in

Figure Here we assume that the k values of the register are enco ded as k

We will use a unary enco ding of these values and the idea which has app eared in other

pap ers by Lamp ort and byPeterson of writing a sequence of bits in one order and reading

them in the opp osite order We remark that Lamp ort also gives an improved implementation

for binary representation thus reducing the space requirement logarithmically

If the initial value of the logical register is v then initially x is and the other physical

v

registers are all using unary representation The co de is as follows

WRITE v First write inx Then in order write inx x

v v

READ Do read x x x in order until x is found for some v Then RETURN

k v

v

Note that RP is guaranteed to nd a nonzero x b ecause whenever a physical register

i v

is zero ed out there is already a written in a higher index register Alternativelywe could

initialize the system so that x and then wewould have x always equal to We

k k

nowprove the correctness of Construction formally

Claim If a READ R sees in x then v must have been written by a WRITE that is

v

feasible for R

x x

x

k k

w

w

w

W WP

r

r

r

R RP

r

r

r

R RP

r

r

r

R RP

n n

Figure K ary Regular Registers from Binary Regular Registers

Pro of By contradiction Supp ose R sees x and neither an overlapping nor an

v

immediately preceding WRITE wrote v to the logical register Then v was written either

sometime in the past sayby WRITE W or v is the initial value We analyze b elowthe

former case the initial value case can b e treated similarly

So let W b e the last WRITE completely preceding R that wrote v Since v is not a

feasible value for R there must b e another WRITE W after W which completely precedes

RAny such W ie a WRITE that follows W and completely precedes R can write only

values strictly smaller than v b ecause if it wrote a value more than v itwould set x

v

before R could see x and it cannot write v bythechoice of W Note that if sucha W

v



in R writes value v completely precedes the read of x then the contained write to x

v v



such that v v and suchthatthe So consider the latest write to a register x

v

write follows W and completely precedes the read of x in R By the ab ove arguments

v

there is at least one such writeWenow pro ceed with case analysis



Since R reads the registers in order x x and it returns value vv R must see

k

is set to Therefore there exists But wehave just identied a place where x x

v v

some write to x which follows the write to x and either precedes or overlaps the

v v



read of v in R This can only happ en if there is an intervening write to some x such

v

  

vAsabove in this case wemust have v v If not then b efore doing the that v

write to x the enclosing WRITE would overwrite the in x and so R could not get

v v



it But then this is a contradiction to the choice of the write of v as the latest Note the

write to x completely precedes the read of x in R b ecause the write to x either

v v v

 

in R and v v precedes or overlaps the read of x

v

The implementation relationships now collapse as shown in Figure

nreader

k ary atomic

s

J

J

J

J

nreader reader

s s

binary atomic k ary atomic

J

J

J

J

s

reader

binary atomic

s

regular

and safe

Figure Collapsed Implementation Relationships

Construction Reader K ary Atomic Register from Reg

ular Register

Wenow showhow to construct a writer reader k ary atomic register from two writer

reader k ary regular registers arranged as in Figure The co de is given in Figure

The cases in the co de are designed to decide whether to return the new or the old value



read Intuitivelyif x num it means that there has b een much progress in the write or

it may b e nished so it is reasonable to return the new value At the same time it is useful

to rememb er that the new value has already b een returned in the newreturned variable

This is b ecause a later read of x that overlaps a write could get an earlier version of the

variable with num b ecause x is a regular register and not an atomic register Notice

that it couldnt get num The second case covers the situation in which the previous

read could have already returned the value of a write that overlaps b oth The conditions lo ok

a little ad hoc Unfortunatelywe cant give a correctness pro of in class For the correctness

pro of we refer the reader to Lamp ort This is not the sort of pro of that can b e done in

anyobvious way using invariants b ecause the underlying registers are not atomic Instead

w

x

W WP

r

r

w

y

R RP

Figure Reader Atomic Register from Regular Registers

Lamp ort develop ed an alternative theory of concurrency based on actions with duration

overlapping or totally preceding each other Another approach to prove the correctness is

to use the Lemma presented in Lecture

The implementation relations now collapse as in Figure This concludes Lamp orts

constructions

MultiReader Atomic Registers

Notice that the constructions thus far still leave a gap b etween the nreader writer atomic

registers and the reader writer atomic registers This gap has b een closed in a couple of

other pap ers SinghAG NewmanW We remark that these algorithms are somewhat

complicated It is also not clear if they are practical since their time complexity is high

SinghAndersonGouda have a reasonable construction related to Lamp orts Construction

Lamp orts Bakery Revisited

Wenow show that the original Lamp ort bakery algorithm presented in class still works if

the underlying registers are only safe The explanation is as follows

A read of the choosing variable while it is b eing written will b e guaranteed to get

either or since these are the only p ossible values But those corresp ond to either

the p oint b efore or after the write step

Ifanumb er is read bya cho oser while it is b eing written the cho oser can get an

arbitrary value When the cho oser takes the maximum of this arbitrary value with

Shared Variables regular registers

x stores tuples of old new num color where old new are from the value do

main of the atomic register and num f g Initially x v v red

y stores values from fred blue g Initially y blue

Co de for write v

newcolor y

oldvalue xnew record value of x lo cally no need to read it

for i f g do

x oldvalue vinewcolor

Co de for read

 

remember last two reads in x and x

 

x x



x x



y x color

case



x num

newreturned true



return x new

    

x num and x num x num and newreturned and x color x color



return x new

otherwise

newreturned false



return x old

end case

Figure Co de for constructing singleread k ary atomic register from regular registers

nreader

k ary atomic

s

nreader

s

binary atomic

J

J

J

J

s

regular and safe

reader atomic

Figure Collapsed Implementation Relationships

the others the result is that the cho oser is guaranteed to cho ose something bigger

than all the values it sees without overlap but there is no guarantee it will cho ose

something bigger than a concurrent writer However the concurrent writer is cho osing

concurrently and the algorithm didnt need any particular relation b etween values

chosen concurrently

In L supp ose someone reads a numb er while its b eing written If the test succeeds

the writer is back at the p ointofcho osing its value and everything is ne That is

it must b e that the reader at Lsaw the writer b efore it b egin cho osing its number

so the writer is guaranteed to cho ose a larger numb er The problem is if the writers

value comes backasvery small which can delay the reader However it cannot delay

the reader forever

Asynchronous Network Algorithms

Nowwe make another ma jor switchinthemodelwe consider Namelywe switch from asyn

chronous shared memory to asynchronous networksAsinsynchronous networks we assume

a graph or a digraph G with pro cesses at the no des and communication over directed

edges But now instead of synchronous rounds of communication wehaveasynchronyin

the communication as well as the pro cess steps

We mo del each pro cess as an IOA generally with some inputs and outputs connecting it

with the outside world This is the b oundary where the problems will usually b e stated In

addition pro cess i has outputs of the form send m where j is an outgoing neighb or and

ij

m is a message element of some message alphab et M and inputs of the form receive m

ji

for j an incoming neighb or Except for these interface restrictions the pro cess can b e an

arbitrary IOA We sometimes restrict the numb er of classes or the numb er of states etc

to obtain complexity results

To mo del communication wegivea behavior sp ecication for the allowable sequences of

send i j and receive i j actions for eachedgei j This can b e done in twoways bya

list of properties or axiomsorby giving an IOA whose fair b ehaviors are exactly the desired

sequences The advantage of giving an explicit IOA is that in this case the entire system

is describ ed simply as a comp osition and wehave information ab out the state of the entire

system to use in invariants and simulation pro ofs But sometimes its necessary to do some

unnatural programming to sp ecify the desired b ehavior as an IOA esp ecially when wewant

to describ e liveness constraints Let us now make a short digression to dene the imp ortant

notions of safety and liveness

Let S and L b e prop erties ie sets of sequences over a given action alphab et S is a

safety property if the following conditions hold

S is nonempty ie there is some sequence that satises S

S is prexclosed ie if a sequence satises S then all its prexes satisfy S and

S is limitclosed ie if every nite prex of an innite sequence satises S then the

innite sequence satises S

L is a liveness property if for every nite sequence there is some extension that satises

L Informally safety prop erties are appropriate for expressing the idea that nothing bad

ever happ ens if nothing bad has happ ened in a nite sequence then nothing bad has

happ ened in any prex either moreover if something bad happ ens it happ ens at a nite

p oint in time Liveness on the other hand is most often used to express the idea that

something go o d eventually happ ens

Wenow return to the asynchronous message passing mo del The most common commu

nication channel studied in the research literature is a reliable FIFO channelFormally this

is an IOA with the appropriate interface whose state is a queue of messages The send i j

action puts its argument at the end of the queue The receive m action is enabled if m is

ij

at the head of the queue and its eect is to remove the message from the queue Of course

it also delivers the message to the recipient pro cess which should handle it somehow But

that is a part of the pro cess job not of the channels The right division of resp onsibilities

is obtained bytheIOA comp osition denition The fairness partition here can put all the

lo cally controlled actions in a single class

To gain some exp erience with asynchronous network algorithms within this mo del we

go back to the b eginning of the course and reconsider some of the examples we previously

considered in synchronous networks

Leader Election

Recall the algorithms we considered earlier First the graph is a ring with UIDs built into

the initial states of the pro cessors as b efore In the action signature there are no inputs from

the outside world and the only output actions are leader one for each pro cess i

i

LeLannChang Algorithm

In this algorithm each pro cess sends its ID around the unidirectional ring When a no de

receives an incoming identier it compares that identier to its own if the incoming identier

is greater than its own it keeps passing the identier if it is less than its own then the

identier is discarded and if it is equal it outputs the leader action This idea still works

in an asynchronous system Figure gives the co de for the algorithm

Essentially the b ehavior of the algorithm is the same as in the synchronous but skewed

in time We mightprove it correct by relating it to the synchronous algorithm but the

relationship would not b e a forward simulation since things happ en here in dierent orders

We need a more complex corresp ondence Instead of pursuing that idea well discuss a more

common style of verifying such algorithms namely direct use of invariants and induction

Invariant pro ofs for asynchronous systems work ne The induction is now on individual

pro cesses steps whereas the induction in the synchronous mo del was on global system steps

Here as in the synchronous case wemust showtwo things

that no one except the maximum ever outputs leader and

that a leader output eventually o ccurs

Note that is a safety prop erty asserting the fact that no violation ever o ccurs and

is a liveness prop erty

Safety prop erties can b e proved using invariants Here we use a similar invarianttothe

one we stated for the synchronous case

Invariant If i i and j i i then own i send j

max max

Wewant to prove this invariantby induction again similar to the synchronous case

Recall a homework problem ab out this But now due to the intro duction of the message

channels with their own states we need to add another prop ertytomake the induction go

through

Invariant If i i and j i i then own i queue j j

max max

With these two statements together the induction works much as b efore More sp ecif

icallywe pro ceed by case analysis based on one pro cess at a time p erforming an action

The key step as b efore is where i gets pruned out by pro cess i Thisproves the safety

max

prop erty

Wenowturntotheliveness prop erty namely that a leader action eventually o ccurs

i

For this prop erty things are quite dierent from the synchronous case Recall that in the

synchronous case weusedavery strong invariant whichcharacterized exactly where the

maximum value had reached after k rounds Now there is no notion of round so the pro of

must b e dierent We cant give such a precise characterization of what happ ens in the

computation since there is nowsomuch more uncertaintySowe need to use a dierent

metho d based on making progress toward a goal Here weshow inductively on k for

k n that eventual l y i winds up in send We then apply this to n

max i k

max

and show that i eventually gets put into the i i channel and then that i

max max max max

message eventually is received at i and therefore eventually leader action gets done

max i

max

All these eventually arguments dep end on using the IOA fairness denition

For example supp ose that there is a state s app earing in some fair execution where

a message m app ears at the head of a send queue in sWewanttoshow that eventually

send mmust o ccur Supp ose not By examination of the allowable transitions we conclude

that m remains at the head of the send queue forever This means that the send class stays

enabled forever By the IOA fairness some send must eventually o ccur But m is the only

one at the head of the queue so the only one for which an action can b e enabled so send m

must eventually o ccur

Likewise if m app ears in the k th p osition in the send queue wecanprove that eventually

send m o ccurs This is proven by induction on k with the base case ab ove The inductive

step says that something in p osition k eventually gets to p osition k and this is based on

the head eventually getting removed

We remark that this typ e of argument can b e formalized within temp oral logic using

statements of the form P Q see Manna and Pnueli

State of pro cess i

own typ e UID initially is UID

send typ e queue of UIDs initially containing only is UID

status takes values from funknown chosen reported g initially unknown

Actions of pro cess i

send m

ii

Precondition m is rst on send

Eect Remove rst elementof send

receive m

ii

Eect case

m own addm to end of send

m own status chosen

otherwise do nothing

endcase

leader

i

Precondition status chosen

Eect status reported

Fairness partition classes For any pro cess all send actions are in one class and the

leader action in another singleton class

Figure LeLannChang algorithm for leader election in asynchronous rings

JJ Distributed Algorithms Novemb er

Lecturer Nancy Lynch

Lecture

Leader Election in Asynchronous Rings cont

The LeLannChang Algorithm cont

We conclude our discussion of this algorithm with time analysis In the synchronous case

we had an easy way to measure time bycounting the number of synchronous rounds for

the synchronous version of this algorithm we got ab out n In the asynchronous case we

havetousea real time measure similar to what we used for asynchronous shared memory

algorithms

At this p oint we pause and generalize this time measure to arbitrary IOAs This turns

out to b e straightforward thinking of the fairness classes as representing separate tasks

we can just asso ciate a p ositive real upp er b ound with each class This generalizes what

we did for shared memory in a natural way there we had upp er b ounds for each pro cess

and for each user Recall that we could mo del the user as a separate IO automaton In

sp ecializing version to the reliable network mo del we put an upp er b ound of l on eachclass

of each pro cess and an upp er b ound of d on time for the message queue to do a step that

is the time to deliver the rst message in the queue

Using these denitions we pro ceed with the analysis of the LeLannChang algorithm A

naive approach as in the CWI pap er gives an O n b ound This is obtained by using the

progress argument from the last lecture adding the time b ounds in along the way in the

worst case it takesnl nd time for the maximum UID to get from one no de to the next nl

time to send it and then nd to receive it Note that in this analysis we are allowing for the

maximum numb er of messages n to delay the message of interest in all the queues The

overall time complexityisthus O n l d

But it is p ossible to carry out a more rened analysis thus obtaining a b etter upp er

b ound The key idea is that although the queues can grow tosizeatworst n and incur nd

time to deliver everything this cannot happ en everywhere More sp ecically in order for

a long queue to form some messages must have b een traveling faster than the worstcase

upp er b ound Using this intuition we can carry out an analysis that yields an O nl d

upp er b ound on time Informally the idea is to asso ciate a progress measure with each state

giving an upp er b ound on the time from anypoint when the algorithm is in that state until

the leader is rep orted Formallywe prove the following claim

Claim Given a state in an execution of the algorithm consider the distancearound the

ring of a token ie message i from its current position to its home node iLet k be the

maximum of these distances Then within k l dl time a leader event occurs

Applying this claim to the initial state yields an upp er b ound for the algorithm of nl

dl as desired

Pro of By induction on k

Base case k ie the i token has already arrived at no de i in this case within

max max

l time units the leader will o ccur

Inductive step Supp ose the claim holds true for k and prove it holds for k Supp ose

that k is the maximum of the indicated distances Consider any token that is at distance

greater than k By denition of k no token is in distance strictly greater than k This

means that the token is either in the send queue at distance k orinthechannels queue

from distance k to k Wenow argue that this token is not blo cked byany other

token ie that it is the only token in the combination of this send buer or this queue

To see this notice that if another token were there also then that other token would b e at

distance strictly greater than k from its goal a contradiction to the assumption that k is the

maximum suchvalue So within time at most l d the token on whichwe fo cus gets to the

next no de ie to within distance k of its goal and then by the inductivehyp othesis

in additional time k l d l a leader event o ccurs This proves the inductive step

The Hirshb ergSinclai r Algorithm

Recall the Hirshb ergSinclair algorithm which used the idea of twodirectional exploration

as successively doubled distances It is straightforward to see that this algorithm still works

ne in the asynchronous setting with the same time and communication upp er b ounds

O n log n

The Peterson Algorithm

Hirshb erg and Sinclair in their original pap er conjectured that in order to get O n log n

worst case message p erformance a leader election algorithm would have to allow bidirectional

message passing Peterson however constructed an algorithm disproving this conjecture

More sp ecically the assumptions used bythePeterson algorithm are the following

The numb er of no des in the ring is unknown

Unidirectional channels

Asynchronous mo del

Unb ounded identier set

Anynodemay b e elected as the leader not necessarily the maximum

The Peterson algorithm not only has O n log nworst case p erformance but in fact the

constant factor is low it is easy to show anupperboundof n log nandPeterson used

a trickier optimized construction to get a worst case p erformance of n log n The

constant has b een broughteven further down by other researchers

The reason we didnt intro duce this algorithm earlier in the course is that synchrony

do esnt help in this case it seems like a naturally asynchronous algorithm

In Petersons algorithm pro cesses are designated as b eing either in an active mo de or

relay mo de all pro cesses are initially active We can consider the active pro cesses as the

ones doing the real work of the algorithm the pro cesses still participating in the leader

election pro cess The relay pro cesses in contrast just pass messages along

The Peterson algorithm is divided into asynchronously determined phases In each

phase the numb er of active pro cesses is divided at least in half so there are at most log n

phases

In the rst phase of the algorithm each pro cess sends its UID two steps clo ckwise Thus

everyone can compare its own UID to those of its twocounterclo ckwise neighb ors When it

receives the UIDs of its twocounterclo ckwise neighb ors each pro cess checks to see whether

it is in a conguration such that the immediate counterclo ckwise neighb or has the highest

UID of the three ie pro cess i checks if id id and id id If pro cess i nds

i i i i

that this condition is satised it remains active adopting as a temp orary UID the UID

of its immediate counterclo ckwise neighb or ie pro cess iAny pro cess for which the ab ove

condition do es not hold b ecomes a relay for the remainder of the execution The job of a

relay is only to forward messages to active pro cesses

Subsequent phases pro ceed in much the same way among active pro cessors only those

whose immediate active counterclo ckwise neighb or has the highest temp orary UID of the

three will remain active for the next phase A pro cess that remains active after a given phase

will adopt a new temp orary UID for the subsequent phase this new UID will b e that of its

immediate activecounterclo ckwise neighb or from the justcompleted phase The formal co de

for Petersons algorithm is given in Figures and

It is clear that in any given phase there will b e at least one pro cess that nds itself in a

conguration allowing it to remain active unless only one pro cess participates in the phase

State of pro cess i

mode taking values from factive relay g initially active

status taking values from funknown elected reported g initially unknown

id j where j f gtyp e UID or nil initially id is UID and id id

nil

send queue of UIDs initially containing is UID

receive queue of UIDs

Actions of pro cess i

getseconduid

i

Precondition mode active

receive

id nil

Eect id head receive

remove head of receive

add id to end of send

if id id then status elected

getthirduid

i

Precondition mode active

receive

id nil

id nil

Eect id head receive

remove head of receive

Figure Petersons algorithm for leader election part

Co de for pro cess i cont

advancephase

i

Precondition mode active

id nil

id max fid id g

Eect id id

id nil

id nil

add id to send

becomerelay

i

Precondition mode active

id nil

id max fid id g

Eect mode relay

relay

i

Precondition mode relay

receive

Eect move head of receive to tail of send

send message m

i

Precondition head send m

Eect delete head of send

message m receive

i

Eect add m to the tail of receive

Figure Petersons algorithm for leader election part

in which case the lone remaining pro cess is declared the winner Moreover at most half the

previously active pro cesses can survivea given phase since for every pro cess that remains

active there is an immediate counterclo ckwise activeneighbor that must go into its relay

state Thus as stated ab ove the numb er of active pro cesses is at least halved in each phase

until only one active pro cess remains

Communication and Time Analysis The total numberofphasesinPetersons algo

rithm is at most blog nc and during each phase each pro cess in the ring sends and receives

exactly two messages This applies to b oth active and relay pro cesses Thus there are

at most n b log nc messages sent in the entire algorithm Note that this a much b etter

constant factor than in the Hirshb ergSinclair algorithm

As for time p erformance one might rst estimate that the algorithm should take O n log n

time since there are log n phases and each phase could involvea chain of message deliver

ies passing through relays of total length O n As it turns out however the algorithm

terminates in O n time Wegive a brief sketch of the time analysis Again wemay ne

glect pileups which seem not to matter for the same reason as in the analysis of LeLann

This allows us to simplify the analysis by assuming that every no de gets a chance to send

a message in b etween every two receives For simplicitywe also assume that lo cal pro cess

ing time is negligible compared to the messagetransmission time dsowe just consider the

messagetransmission time Now our plan is to trace backwards the longest sequential chain

of messagesends that had to b e sent in order to pro duce a leader

Let us denote the eventual winner by P In the nal phase of the algorithm P had to

hear from two activecounterclo ckwise neighbors P and P In the worst case the chain

of messages sent from P toP is actually n in length and P P as depicted in Figure

Now consider the previous phase Wewishtocontinue pursuing the dep endency chain

backwards from P which might b e the same no de as P The key p oint is that for

anytwo consecutive phases it must b e the case that b etween anytwo active pro cesses in

the later phase there is at least one active pro cess in the previous phase Thus at the

nexttolast phase there must have b een an additional pro cess in the interval starting from

P counterclo ckwise to P and another additional pro cess in the interval starting from P

counterclo ckwise to P

Thus the chain of messages pursued backward from P in the nexttolast phase from

P and P can at worst only extend as far as P ie in the worst case P P as

depicted in Figure And there is an additional pro cess Qbetween P andP

At the phase preceding the nexttolast phase P waits to hear from P and P where

P isatworst equal to Q also there is an additional pro cess Qbetween Q and P see P0=P2

P1

Figure The last phase of the Peterson algorithm P is the winner

P0

P3 Q1

P1=P4

Figure The nexttolast phase of the Peterson algorithm

Figure At the phase b efore this P waits to hear from P andP etc This analysis

can go on but wenever get back around to P and so the total length of the dep endency

chain is at worst nWe conclude therefore that the time is at most around nd

Burns Lower Bound Result

All the algorithms in the previous section are designed to minimizethenumb er of messages

and the b est achieve O n log n communication Recall that in the synchronous case we had

a matching n log nlower b ound under the assumption that the algorithms are comparison

based That carries over to the asynchronous mo del since the synchronous mo del can b e

formulated as a sp ecial case of the asynchronous mo del Note that the algorithms to

which this lower b ound applies are not allowed to use the UIDs in arbitrary ways eg for

counting As wesaw if we lift this restriction then we can get O n message algorithms in P0

Q2

P3 P6=Q1

P5

P4

Figure The next preceding phase of the Peterson algorithm

the synchronous mo del

It turns out however that the asynchronous network mo del has enough extra uncertainty

so that the n log nlower b ound applies regardless of how the identiers are used The

pro of of this fact is completely dierent pro of from the synchronous case now the asynchrony

is used heavilyWe consider leader election algorithms with the following prop erties

The numb er of no des in the ring is unknown

Bidirectional channels

Asynchronous mo del

Unb ounded identier set

Anynodemay b e elected as leader

For this setting Burns proved the following result

Theorem Any leader election algorithm with the properties listedabove sends at least

n log n messages in the worst case where n is the number of processes in the ring

For simplicitywe assume that n is a p ower of The pro of can b e extended to arbitrary

n cf homework We mo del each pro cess as an IO automaton and stipulate that each

automaton is distinct in essence that each pro cess has a unique identier The automaton

can b e represented as in Figure Each pro cess has two output actions sendright and

sendleftandtwo input actions receiveright and receiveleft

Our job will ultimately b e to see how a collection of automata of this typ e b ehavewhen

arranged into a ring however in the course of this exploration wewould also liketosee

how the automata b ehave when arranged not in a ring but simply in a straight line as in

receiveleft sendright

sendleft receiveright

Figure A pro cess participating in a leader election algorithm

Figure A line of leaderelecting automata

Figure Formallywecansay that a line is a linear comp osition using IO automaton

comp osition of distinct automata chosen from the universal set of automata

The executions of such a line of automata can b e examined in isolation where the

two terminal automata receive no input messages in this case the line simply op erates on

its own Alternativelywe mightcho ose to examine the executions of the line when certain

input messages are provided to the two terminal automata

As an added bit of notation wewillsaythattwo lines of automata are compatible when

they contain no common automaton b etween them Wewillalsodenea join op eration on

two compatible lines which simply concatenates the lines this op eration interp oses a new

message queue to connect the rightmost receiveright message of the rst line with the

leftmost sendleft message of the second and likewise to connect the leftmost receiveleft

message of the second line with the rightmost sendright message of the rst and then uses

ordinary IOA comp osition Finallythe ring op eration on a single line interp oses new queues

to connect the rightmost sendright and leftmost receiveleft actions of the line and the

rightmost receiveright and leftmost sendleft actions The ring and join op erations are depicted graphically in Figure

LM

join(L,M)

L

ring(L)

Figure The join and ring op erations

We pro ceed with a pro of that n log n messages are required to elect a leader in a bidi

rectional asynchronous ring where n is unknown to the pro cesses and pro cess identiers are

unb ounded For a system S line or ring and an execution of S we dene the following

notions

COMM S is the numb er of messages sent in execution of system S

COMM S sup COMM S Here we consider the numb er of messages sent

during any execution of S For lines we only consider executions in which no messages

come in from the ends

A state q of a ring is quiescent if there is no execution fragment starting from q in

whichany new message is sent

A state q of a line is quiescent if there is no execution fragment starting from q in which

no messages arrive on the incoming links at the ends and in whichany new message

is sent

Wenow state and proveourkey lemma

Lemma For every i there is an innite set of disjoint lines L such that for al l L

i

i i

L j L j and COMM L i

i

Pro of By induction on i

Base case For i we need an innite set of dierent pro cesses such that eachcan

send at least message without rst receiving one Supp ose for contradiction that there

are pro cesses p and q such that neither can send a message without rst receiving one

Consider rings R R and R as shown in Figure

p

R1 R2 R3

p q q

Figure Basis for pro of of Lemma

In all three rings no messages are ever sent so each pro cess pro ceeds indep endently

Since R solves election p must elect itself and similarly for Randq Then R elects

two leaders a contradiction It follows that there is at most one pro cess that cant send a

message b efore receiving one If there is an innite numb er of pro cesses removing one leaves

an innite set of pro cesses that will send a message without rst receiving one Let L be

this set whichproves the basis

i

Inductive step Assume the lemma is true for i Let n LetL M N be any

lines from L Consider all p ossible combinations of two of these lines into a line of double

i

size LM LN ML NL MN and NM Since innitely many disjoint sets of three lines

can b e chosen from L the following claim implies the lemma

i

n

log n messages Claim Atleast one of the lines can be made to send at least

Pro of Assume that the claim is false By the inductivehyp othesis there exists a nite

n n

execution of L for which COMM L log and in which no messages arrive

L L

from the ends

We can assume without loss of generality that the nal conguration of is quiescent

L

n

since otherwise can b e extended to generate more messages until the number log n of

L

messages is reached We can assume the same condition for and by similar reasoning

M N

L M

can know ab out join

Figure join L M

Now consider anytwo of the lines say L and M Consider join L M Consider an execution

that starts by running on L and on M but delays messages over the b oundary

L M

n n

log messages Nowdeliver the delayed In this execution there are at least

messages over the b oundary By the assumption that the claim is false the entire line must

n

additional messages messages for otherwise the total quiesce without sending more than

n n n n n

exceeds log log n This means that at most pro cesses in join L M

know ab out the join and these are contiguous and adjacent to the b oundary as shown in

Figure These pro cesses extend at most halfwayinto either segment Let us call this

execution Similarly for etc

LM LN

M

R1 p1

L N

Figure ring join L M N case

In ring R of Figure consider an execution in which and o ccur rst

L M N

quiescing in three pieces separately Then quiesce around b oundaries as in and

LM LN

Since the processes that know about each join extend at most half way into either

NL

segment these messages wil l be noninterfering Similarly for R

EachofRandR elects a leader say p and p We can assume without loss of generality

that p is b etween the midp ointof L and the midp ointof M as in Figure We consider

cases based on the p osition of p in R Figure to get a contradiction

If p is b etween the midp ointof L and the midp ointof N as in Figure then let

R b e assembled by joining just M and N Consider an execution of Rinwhich the two N

p2 R2

L M

Figure ring join L N M

segments are rst run to quiescence using and and then quiescence o ccurs around

M N

the b oundaries as in and

MN NM

R3

M

p3 N

Figure ring join M N leader elected in the lower half

N

p2 R2 p3

L M

Figure ring join L N M

By the problem denition a leader must b e elected in R say p First supp ose it is

in lower half as in Figure Then it also o ccurs in R and gets elected there to o as in

Figure There are two leaders in this case which is a contradiction If it is in the upp er

half of R then we arrive at a similar contradiction in R

If p is b etween the midp ointof L and the midp ointof M thenwearrive at a similar

contradiction based on R again

R4

L N

Figure ring join L N

If p is b etween the midp ointof M and the midp ointof N thenwe arrive at a similar

contradiction based on R a ring containing LN as in Figure

The lemma follows from the claim

Lemma essentially proves Theorem let n be a power of Pick a line L of length

n

n with COMM L log n and paste it into a circle using the ring op erator Let the

pro cesses in L b ehave exactly as they would if they were not connected in the execution

that sends the large numb er of messages and delay all messages across the pasted b oundary

until all the large numberofmessageshave b een sent This gives us the desired b ound

Note the crucial part played by the asynchronyinthisargument

Problems in General Asynchronous Networks

So far wehave revisited several synchronous ring leader election algorithms from the b egin

ning of the course and have seen that they extend directly to the asynchronous setting Now

we will reexamine the algorithms from more general synchronous networks In the sequel we

shall assume that the underlying communication network is mo deled by a general undirected

graph

Network Leader Election

The synchronous algorithm wesaw earlier for leader election in a general graph do es not

extend directly to the asynchronous mo del In the synchronous algorithm every no de main

tains a record of the maximum UID it has seen so far initially its own Ateach round the

no de propagates this maximum on all its incident edges The algorithm terminates after a

numb er of rounds equal to the diameter if the pro cess has its own ID as its known maximum

then it announces itself as the leader

In the asynchronous setting we dont have rounds and we dont haveany information

ab out time that would allowustowait for enough time to hear from everyone and so the

ab ove idea do esnt extend We could still follow the basic strategy of propagating maximum

UIDs asynchronously whenever a pro cess gets a new maximum it propagates it sometime

later The problem is nowthatwe dont knowwhentostopToovercome this problem we

shall use other techniques

Breadthrst Search and Shortest Paths

Reconsider the shortest paths algorithms Recall that breadthrst search is a sp ecial case

of shortest paths when all edges haveweights In these algorithms we assume that we

are given an initiator no de i and wewant that eachnodeeventually outputs a p ointer to

a parent in a shortestpaths or breadthrst tree We can also output the distance on the

shortest path

The algorithms wehave presented earlier used the synchronyvery heavily Eg for

breadthrst search we marked no des in layers one layer for eachsynchronous round This

was very ecient O E total messages and time prop ortional to the diameter

We cannot run that algorithm directly in an asynchronous network since dierent branches

of the tree can grow more quickly than others More sp ecically this means that a no de can

get some information ab out one path rst then later nd out ab out a shorter path Thus

we need to b e able to adjust estimates downward when new information arrives Recall that

the synchronous shortestpaths algorithm wesaw earlier already did this kind of adjustment

We called it a relaxation step

This adjustment idea is used in a famous shortestpaths and BFS algorithm that imi

tates the BellmanFord algorithm from sequential complexity theory In fact this algorithm

was the routing algorithm used in the ARPANET b etween and Note that this

will not b e a terminating algorithm the no des will not know when they are done We give

the sp ecication of this algorithm in preconditioneects style in Figure

State variables

bestdist for i this is initially for others

parent initially undened

newsend for i initially neighbors i for others initially

says whom they should send a newlydiscovered distance to

Signature We assume that the weights are known a priori so no inputs are needed from

the outside world We need send and receive actions as usual the only message is a

distance value nonnegative real There are no outputs since not known when this

terminates

Actions

send m

ij

Precondition j newsend

m bestdist

Eect newsend newsend fj g

receive m

ji

Eect if m weight j i bestdist then

bestdist m weight j i

newsend neighbors i fj g

parent j

Figure BellmanFord algorithm for shortest paths

Analysis If the algorithm were executed in synchronous rounds then the time would b e

O n where n is the total number of nodes and the numb er of messages would b e O njE j

This do es not seem to o bad But in asynchronous execution the time b ound can b e much

worse in fact we show b elow that the time and message complexity are exponential in n

n

Claim The algorithm sends messages in the worst case

Pro of Consider the example illustrated in Figure

2^k 2^(k−1) 2 1

i000 i1 i2 00 ik i(k+1) i(k+2)

Figure Bad scenario for the BellmanFord Algorithm

k

The p ossible estimates that i can have for its bestdist are More

k

over we claim that it is p ossible for i to acquire all of these estimates in order from

k

the largest to the smallest To see how consider the following scenario Supp ose that the

k

messages on the upp er paths all propagate rst giving i the estimate Next the

k

k

alternative message from i arrives giving i the adjusted estimate of Next

k k

the alternative message from i to i arrives at i cause i to reduce its estimate by

k k k k

It then propagates this revision on b oth paths Again supp ose the message on the upp er

path arrives rst etc

k

Since i gets distinct estimates it is p ossible if i takes steps fast enoughfor

k k

i to actually put all these values in the outgoing channel to i This yields exp onentially

k k

n

many messages actually Moreover the time complexity is similarly bad b ecause

these messages all get delivered sequentially to i

k

Wenow consider upp er b ounds on the BellmanFord algorithm First we claim that

n

the numb er of messages is O n e at most This is b ecause each no de only sends

new messages out when it gets a new estimate Now each estimate is the total weightof

a simple path to that no de from i and the b ound follows from the fact that there are at

n

most n such paths This may b e a lo ose estimate Now consider any particular edge

It only gets messages when one of its endp oints gets a new estimate leading to a total of

n

n messages on that edge

JJ Distributed Algorithms Novemb er

Lecturer Nancy Lynch

Lecture

Asynchronous BroadcastConvergecast

In an earlier lecture we describ ed a synchronous discipline for broadcasting and converge

casting sometimes called broadcastwithecho using a breadthrst spanning tree In the

asynchronous setting the breadthrst spanning tree is not as usable since we dont know

when the construction is complete We can still do broadcastconvergecast reasonably e

ciently on an arbitrary not necessarily breadthrst spanning tree

First we consider how to construct an arbitrary spanning tree The distinguished source

no de i b egins by sending out report messages to all its neighb ors Any other no de up on

receipt of its rst report message marks itself as doneidenties the sender j as its parent

sends a parent j announcement to the outside world and sends out report messages on

i

its outgoing links The no de ignores additional report messages This idea works ne asyn

chronously no des know when they are done O E messages are used and it takes O diam

time until termination There are no pileups on links since only two messages ever get sent

on each edge one in each direction

Nowwe pro ceed to the task of broadcasting a message to all no des We can construct an

arbitrary spanning tree as ab ove and then i uses it to send messages over the tree links

To do this no des need to b e informed as to which other no des are their children and this

requires lo cal communication from children to parents up on discovery Note the interesting

timing anomaly here tree paths could takentimetotraverse the second time Toavoid

that we can send the message in the course of building the tree by piggybacking the

message on all report messages

We can also use an asynchronouslyconstructed spanning tree to fan in results of a

computation backto i as in the synchronous case eachnodewaits to hear from all its

children in the spanning tree and then forwards a message to its parent

This brings us to the task of broadcast and convergecast Now supp ose wewantto

broadcast a message and collect acknowledgments backat i thateveryone has received it

To solve this problem we do a combination of the asynchronous spanning tree construction

ab ove followed asynchronously by fanning in results from the leaves

In fact if the tree is b eing constructed solely for the purp ose of sending and getting

acks for a particular message it is p ossible to delete the tree information after fanning in

State

msg forthevalue b eing broadcast at i initially the message it is sending elsewhere

nil

send buer for outgoing messages at i initially contains bcast m to all neighb ors of

i elsewhere

parent p ointer for the parent edge initially nil

acked a set to keep trackofchildren that haveacked everywhere initially

done Bo olean ag initially false

Actions

send action as usual just sends anything in send buer

receive bcast m

ji

Eect if msg nil then

msg m

parent j

for all k neighbors fj g put message bcast min send

acked

else put message ack to j in send

receive ack

ji

Eect acked acked fj g

nish for i i

i

Precondition acked neighbors fparent g

done false

Eect put ack message to parent in send

done true

nish for i i

i

Precondition acked neighbors

done false

Eect output done message to outside world

done true

Figure Co de for the broadcastconvergecast task without deleting the tree

the results back to the parent However wehave to b e careful that the no de do esnt forget

everything since if it later gets a report message from another part of the tree whose

pro cessing is delayed it should know enough to ignore it In other words we need to keep

some kind of ag at eachnodesaying that its done The co de is given in Figure

Analysis Communication is simple it takes O E messages to set up the tree and after

the tree is set up its only O n messages to do the broadcastconvergecast work The time

analysis is trickier One might think that in O diam time the algorithm terminates since

thats howlongittakes to do the broadcast The convergecast however can take longer

Sp ecicallywehave the following timing anomaly a message can travel fast along a long

network path causing a no de to b ecome attached to the spanning tree on a long branch

even though there is a shorter path to it When the resp onse travels back along the long

path it maynow take time prop ortional to the length of the path whichisn for general

graphs To x this problem we shall need new ideas suchasa synchronizerwhichwe shall

discuss next lecture

Example application to leader election The broadcastconvergecast algorithm pro

vides an easy solution to the leader election problem Recall that the simple leader election

algorithm we mentioned ab ove didnt terminate Now we could do an alternative leader

election algorithm as follows Starting from every no de run broadcastconvergecast in par

allel while collecting the maximum ID in the convergecast on each tree The no de that nds

that the maximum is equal to its own UID gets elected

Minimum Spanning Tree

Problem Statement

Nowwe return to the problem of constructing a minimumweight spanning tree MST this

time in an asynchronous network Let us recall the problem denition We are given an

undirected graph G V E with weighted edges suchthateachvertex is asso ciated with

its own pro cess and pro cesses are able to communicate with each other via messages sent

over edges We wish to have the pro cesses vertices co op erate to construct a minimum

weight spanning tree for the graph G That is wewant to construct a tree covering the

vertices in Gwhosetotaledgeweight is less than or equal to that of every other spanning

tree for G

We assume that pro cesses have unique identiers and that each edge of the graph is

asso ciated with a unique weight known to the vertices on each side of that edge The

assumption of unique weights on edges is not a strong one given that pro cesses have unique

identiers if edges had nondistinct weights we could derive virtual weights for all edges

by app ending the identier numb ers of the end p oints onto the edge weights and use them

as tiebreakers b etween the original weights We will also assume that a pro cess do es not

know the overall top ology of the graph only the weights of its incident edges and that

it can learn nonlo cal information only by sending messages to other pro cesses over those

edges The output of the algorithm will b e a marked set of tree edges every pro cess will

mark those edges adjacent to it that are in the nal MST

There is one signicant piece of input to this algorithm namelythatanynumber of nodes

will b e awakened from the outside to b egin computing the spanning tree A pro cess can

be awakened either by the outside world asking that the pro cess b egin the spanning tree

computation or by another alreadyawakened pro cess during the course of the algorithm

The motivation for the MST problem comes mainly from the area of communications

The weights of edges might b e regarded as messagesending costs over the links b etween

pro cessess In this case if wewant to broadcast a message to every pro cess wewould use

the MST to get the message to every pro cess in the graph at minimum cost

Connections Between MST and Other Problems

The MST problem has strong connections to two other problems that of nding any undi

rected unro oted spanning tree at all for a graph and that of electing a leader in a graph

If we are given an undirected unro oted spanning tree it is pretty easy to elect a leader

this pro ceeds via a convergecast of messages from the leaves of the tree until the incoming

messages converge at some particular no de which can then b e designated as the leader It

is p ossible that the messages converge at some edge rather than some no de in which case

one of the endp oints of this edge say the one with the larger ID can b e chosen

Converselyifwe are given a leader it is easy to nd an arbitrary spanning tree as

discussed ab ove under broadcastconvergecast the leader just broadcasts messages along

each of its neighb oring edges and no des designate as their parent in the tree that no de

from which they rst receive an incoming message after which the no des then broadcast

their own messages along their remaining neighb oring edges So mo dulo the costs of these

basic algorithms the problems of leader election and nding an arbitrary spanning tree are

equivalent

A minimum spanning tree is of course a spanning tree The converse problem going from

an arbitrary spanning tree or a leader to a minimum spanning tree is much harder

One p ossible idea would b e to haveevery no de send information regarding its surrounding

edges to the leader which then computes the MST centrally and distributes the information

backtoevery other no de in the graph This centralized strategy requires a considerable

amount of lo cal computation and a fairly large amountofcommunication

The Synchronous Algorithm Review

Let us recall how the synchronous algorithm worked It was based on two basic prop erties

of MSTs

Prop erty Let G beanundirectedgraph with vertices V and weightededges E Let

V E i k be any spanning forest for G with k Fix any i i k Let e

i i

S

 

beanedge of lowest cost in in the set fe e E E and exactly one endpoint of e is

j

j

S

E and this treeisofas in V g Then thereisaspanning treeforG that includes feg

j i

j

S

low a cost as any spanning tree for G that includes E

j

j

Prop erty If al l edges of a connectedgraph have distinct weights then the MST is unique

These prop erties justify a basic strategy in which the MST is built up in fragments as

follows Atanypoint in the algorithm each of a collection of fragments may indep endently

and concurrently nd its own minimumweight outgoing edge MWOE knowing that all

such edges found must b e included in the unique MST Note that if the edge weights were

not distinct the fragments couldnt carry out this choice indep endently since it would b e

p ossible for them to form a cycle

This was the basis of the synchronous algorithm It worked in synchronous levels where

in each level al l fragments found their MWOEs and combined using these to form at most

half as many fragments

If we try to run the given synchronous algorithm in an asynchronous network wesee

some problems

Diculty In the synchronous case when a no de queries a neighb or to see if it is in the

same fragment it knows that the neighb or no de is uptodate at the same level Therefore

if the neighb or has a distinct fragment id then this fact implies that the neighbor is not in

the same fragment But in the asynchronous setting it could b e the case that the neighbor

has not yet heard that it is in the same fragment

Diculty The synchronous algorithm achieves a message cost of O n log n E based

on a balanced construction of the fragments Asynchronously there is danger of construct

ing the fragments in an unbalanced way leading to many more messages The number of

messages sentby a fragment to nd its MWOE will b e prop ortional to the number of nodes

in the fragment Under certain circumstances one might imagine the algorithm pro ceeding

byhaving one large fragment that picks up a single no de at a time each time requiring f

messages where f is the numb er of no des in the fragment see Figure In sucha

situation the algorithm would require n messages to b e sentoverall

Figure Howdoweavoid a big fragmentgrowing by one no de at a time

The GallagerHumbletSpi ra Algorithm Outline

These diculties lead us to rethink the algorithm to mo dify it for use in an asynchronous

network Luckily the basic ideas are the same The algorithm we shall see b elowisby

GallagerHumbletSpira They fo cus on keeping the numb er of messages as small as p ossible

and manage to achieve the same O n log nE message b ound as in the synchronous

algorithm

Before we continue let us make a few preliminary remarks First consider the question

whether this is the minimum b ound p ossible Note that at least for some graphs eg rings

the n log n term is necessary it comes from the lower b ound on the numb er of messages

for leader election that wesaw in Burns theorem What ab out the E term This seems

essential In a work byAwerbuchGoldreichPelegVainish they show that the number of

bits of communication must b e at least E log n If we assume that each message contains

only a constantnumb er of ids then this means that the numb er of messages is E If

the messages are allowed to b e of arbitrary size however then it can b e shown that O n

messages suce

Another thing wewould liketosay ab out the GallagerHumbletSpira algorithm is that

not only is it interesting but it is also extremely clever as presented in their pap er it is

ab out two pages of tight mo dular co de and there is a go o d reason for just ab out every line

in the algorithm In fact only one or twotiny optimizations have b een advanced over the

original algorithm The algorithm has b een proven correct via some rather dicult formal

pro ofs see WelchLL and it has b een referenced and elab orated up on quite often in

subsequent research It has b ecome a sort of test case for algorithm pro of techniques

As in the synchronous algorithm wesaw earlier the central idea of the GallagerHumblet

Spira algorithm is that no des form themselves into collections ie fragments of increas

ing size Initially all no des are considered to b e in singleton fragments Eachfragment

is itself connected by edges that form a MST for the no des in the fragment Within any

fragment no des co op erate in a distributed algorithm to nd the MWOE for the entire frag

ment that is the minimum weight edge that leads to a no de outside the fragment The

strategy for accomplishing this involves broadcasting over the edges of the fragment asking

each no de separately for its own MWOE leading outside the fragment Once all these edges

have b een found the minimal edge among them will b e selected as an edge to include in the

eventual MST

Once an MWOE for a fragment is found a message maybesent out over that edge to

the fragment on the other side The two fragments may then combine into a new larger

fragment The new fragment then nds its own MWOE and the entire pro cess is rep eated

until all the no des in the graph havecombined themselvesinto one giant fragment whose

edges are the nal MST

This is not the whole story of course there are still some problems to overcome First

how does a node know which of its edges lead outside its current fragment Anodein

fragment F can communicate over an outgoing edge but the no de at the other end needs

some way of telling whether it to o is in F We will therefore need some way of naming

fragments so that two no des can determine whether they are in the same fragment But the

issue is still more complicated it may b e for example that the other no de at the end of the

apparently outgoing edge is in F but hasnt learned this fact yet b ecause of communication

delays Thus some sort of overall synchronization pro cess is neededsome sort of strategy

that ensures that no des wont search for outgoing edges until all no des in the fragmenthave

b een informed of their current fragment

And there is also the second problem discussed ab ove of the unbalanced merging b e

havior causing excessive message cost This second problem should suggest a balancedtree

algorithm solution that is the diculty derives from the merging of data structures that

are very unequal in size The strategy that we will use therefore is to merge fragments of

roughly equal size Intuitivelyifwe can keep merging fragments of nearly equal size wecan

keep the numb er of total messages to O n log n

The trickwe will use to keep the fragments at similar sizes is to asso ciate a level number

with each fragment We will ensure that if level F l for a given fragment F then the

l

numb er of no des in F is greater than or equal to Initially all fragments are just singleton

no des at level When two fragments at level l are merged together the result is a new

fragment at level l This preserves the condition for level numb ers if two fragments of

l l

size at least are merged the result is a new fragment of size at least

So far this may lo ok similar to the way the level numb ers were used in the synchronous

algorithm but it will actually b e somewhat dierent Eg in the synchronous case we could

merge some arbitrary numb er of level l fragments to get a new level l fragment

The level numb ers as it turns out will not only b e useful in keeping things balanced

but they will also provide some identierlike information helping to tell no des whether they

are in the same fragment

In More Detail

There are twoways of combining fragments



Merging This combining rule is applied when wehavetwo fragments F and F with

the same level and the same minimumweight outgoing edge



level F level F l



MWOEF MWOEF

The result of a merge is a new fragmentata level of l

Absorbing There is another case to consider It might b e that some no des are forming

into huge fragments via merging but isolated no des or small fragments are lagging

b ehind at a low level In this case the small fragments may b e absorb ed into the

larger ones without determining the MWOE of the large fragment Sp ecically the

 

rule for absorbing is that if two fragments F and F satisfy level F level F and

 

the MWOEF leads to F then F can b e absorb ed into F by combining them along



MWOEF The larger fragment formed is still at the level of F In a sense wedont



want to think of this as a new fragment but rather as an augmented version of F

These two combining strategies are illustrated in a rough way by Figure It is



worth stressing the fact that levelF levelF do es not imply that fragment F is smaller



than F in fact it could b e larger Thus the illustration of F as a small fragmentin

Figure is meant only to suggest the typical case

Level numb ers also serve as mentioned ab ove as identifying information for fragments

For fragments of level or greater however the sp ecic fragmentidentier is the coreedge

of the fragment The core edge is the edge along which the merge op eration resulting in the

current fragmentlevel to ok place Since level numb ers for fragments are only incremented

by merge op erations any fragmentoflevel or greater must have had its level number

incremented by some previous merge along an edge The core edge also serves as the site

where the pro cessing for the fragment originates and where information from the no des of

the fragment is collected

To summarize the wayinwhichcoreedgesareidentied for fragments

For a merge op eration cor e is the common MWOE of the twocombining fragments

For an absorb op eration cor e is the core edge of the fragment with the larger level

numb er



F

F



F

F

Figure Two fragments combine by merging a fragment absorbs itself into another

Note that identifying a fragmentby its core edge dep ends on the assumption that all

edges have unique identiers Since we are assuming that the edges have unique weights

the weights could b e the identiers

Wenowwanttoshow that this strategyofhaving fragments merge together and absorb

themselves into larger fragments will in fact suce to combine all fragments into a MST for

the entire graph

Claim If we start from an initial situation in which each fragment consists of a single

node and we apply any possible sequence of merge and absorb steps then thereisalways

some applicable step to take until the result is a single fragment containing al l the nodes

Pro of Wewanttoshow that no matter what conguration we arrive at in the course of

the algorithm there is always some merge or absorb step that can b e taken

One way to see that this is true is to lo ok at all the current fragments at some stage

in the running algorithm Each of these fragments will identify its MWOE leading to some

other fragment If we view the fragments as vertices in a fragmentgraph and draw the

MWOE for eachfragment we get a directed graph with an equal number of vertices and

edges see Figure Since wehave k no des and k edges for some k such a directed

graph must have a cycle and b ecause the edges have distinct weights only cycles of size

ie cycles involving two fragments may exist Such a cycle represents two fragments

that share a single MWOE

Now it must b e the case that the two fragments in any cycle can b e combined If the

two fragments in the cycle have the same level numb er a merge op eration can take place

otherwise the fragment with the smaller level numb er can absorb itself into the fragment

Figure A collection of fragments and their minimumweight outgoing edges

with the larger one

Lets return to the question of how the MWOE is found for a given fragment The basic

strategy is as follows Each no de in the fragment is nds its own MWOE leading outside

the fragment then the information from each no de is collected at a selected pro cess which

takes the minimum of all the edges suggested by the individual no des

This seems straightforward but it reop ens the question of howanodeknows that a

given edge is outgoingthat is that the no de at the other end of the edge lies outside the

current fragment Supp ose wehavea node p that lo oks across an edge e to a no de q at

the other end How can p knowifq is in a dierent fragment or not

A fragment name or identier may b e thoughtofasapaircore level If q s fragment

name is the same as ps then p certainly knows that q is in the same fragment as itself

However if q s fragment name is dierentfromthatof p then it is still p ossible that q and

p are indeed in the same fragment but that q has not yet b een informed of that fact That

is to say q s information regarding its own current fragmentmay b e out of date

However there is an imp ortant fact to note if q s fragment name has a core unequal to

that of p and it has a level value at least as high as pthenq cant b e in the fragmentthat

p is in currently and never will b e This is so b ecause in the course of the algorithm a no de

will only b e in one fragmentatany particular level Thus wehave a general rule that q

can use in telling p whether b oth are in the same fragment if the value of core level for

q is the same as that of p then they are in the same fragment and if the value for cor e is

dierent for q and the value of level is at least as large as that of p then they are in dierent

fragments

The upshot of this is that MWOEp can b e determined only if level q level p If q

has a lower level than p it simply delays answering p until its own level is at least as great

as ps

q q



F

F

p q

 

Figure Fragment F absorbs itself into F while F is still searching for its own MWOE

However notice that wenowhave to reconsider the progress argument since this extra

precondition may cause progress to b e blo cked Since a fragment can b e delayed in nding

its MWOE since some individual no des within the fragment are b eing delayed we might

ask whether it is p ossible for the algorithm to reach a state in which neither a merge or

absorb op eration is p ossible To see that this is not the case we use essentially the same

argument as b efore but this time we only consider those MWOEs found by fragments

with the lowest level in the graph call this level l These succeed in all their individual

MWOEp calculations so succeed in determining their MWOE for the entire fragment

Then the argumentisasbeforeIfany fragmentatlevel l nds a MWOE to a higherlevel

fragment then an absorb op eration is p ossible and if every fragment at the level l has a

MWOE to some other fragment at level l thenwemust have a cycle b etween two fragments

at level l and a merge op eration is p ossible So again even with the new preconditions we

conclude that the algorithm must make progress until the complete MST is found

Getting back to the algorithm each fragment F will nd its overall MWOE by taking a

minimum of the MWOE for each no de in the fragment This will b e done by a broadcast

convergecast algorithm starting from the core emanating outward and then collecting all

information back at the core

This leads to yet another question what happ ens if a small fragment F gets absorb ed

 

into a larger one F while F is still in the course of lo oking for its own MWOE

There are two cases to consider consult Figure for the lab eling of no des Supp ose



rst that the minimum edge leading outside the fragment F has not yet b een determined



In this case we searchfora MWOE for the newlyaugmented fragment F in F as well there

is no reason it cannot b e there



has already b een found at the time that F On the other hand supp ose MWOEF



absorbs itself into F Inthatevent the MWOE for q cannot p ossibly b e e since the only

way that the MWOE for q could b e known is for that edge to lead to a fragment with a level

 

at least as great as F and we know that the level of F is smaller than that of F Moreover

the fact that the MWOE for q is not e implies that the MWOE for the entire fragment



F cannot p ossibly b e in F This is true b ecause e is the MWOE for fragment F and

thus there can b e no edges leading out of F with a smaller cost than the alreadydiscovered



MWOE for no de q Thus we conclude that if MWOEF is already known at the time the



absorb op eration takes place then fragment F neednt lo ok for its overall MWOE among



the newlyabsorb ed no des This is fortunate since if F did in fact have to lo ok for its

MWOE among the new no des it could b e to o late by the time the absorb op eration takes



place q mighthave already rep orted its own MWOE and fragment F might already b e

deciding on an overall MWOE without knowing ab out the newlyabsorb ed no des

A Summary of the Co de in the GHS Algorithm

Wehave describ ed intuitively the ma jor ideas of the GallagerHumbletSpira algorithm this

explanation ab ove should b e sucient to guide the reader through the co de presented in

their original pap er

Although the actual co de in the pap er is dense and complicated the p ossibilityofan

understandable highlevel description turns out to b e fairly typical for communications al

gorithms In fact the highlevel description that wehave seen can serve as a basis for a

correctness pro of for the algorithm using simulations

The following message typ es are employed in the actual co de

INITIATE messages are broadcast outward on the edges of a fragment to tell no des to

start nding their MWOE

REPORT messages are the messages that carry the MWOE information back in

These are the convergecast resp onse to the INITIATE broadcast messages

TEST messages are sentoutby no des when they search for their own MWOE

ACCEPT and REJECT messages are sent in resp onse to TEST messages from no des

they inform the testing no de whether the resp onding no de is in a dierentfragment

ACCEPT or is in the same fragment REJECT

CHANGEROOT is a message senttoward a fragments MWOE once that edge is

found The purp ose of this message is to change the ro ot of the merging or currently

b eingabsorb ed fragment to the appropriate new ro ot

CONNECT messages are sent across an edge when a fragment combines with another

In the case of a merge op eration CONNECT messages are sentbothways along the

edge b etween the merging fragments in the case of an absorb op eration a CONNECT

message is sentby the smaller fragment along its MWOE toward the larger frag

ment

In a bit more detail INITIATE messages emanate outward from the designated core

edge to all the no des of the fragment these INITIATE messages not only signal the no des

to lo ok for their own MWOEifthatedgehasnotyet b een found but they also carry

information ab out the fragment identity the core edge and level numb er of the fragment

As for the TESTACCEPTREJECT proto col theres a little b o okkeeping that no des have

to do Every no de in order to avoid sending out redundant messages testing and retesting

edges keeps a list of its incident edges in the order of weights The no des classify these

incident edges in one of three categories

Branch edges are those edges designated as part of the building spanning tree

Basic edges are those edges that the no de do esnt knowanything ab out yet they

mayyet end up in the spanning tree Initially of course all the no des edges are

classied as basic

Rejected edges are edges that cannot b e in the spanning tree ie they lead to another

no de within the same fragment

A fragmentnodesearching for its MWOE need only send messages along basic edges

The no de tries each basic edge in order lowest weight to highest The proto col that the

no de follows is to send a TEST message with the fragmentlevelnumb er and coreedge

represented by the unique weight of the core edge The recipient of the TEST message

then checks if its own identity is the same as the TESTer if so it sends back a REJECT

message If the recipients identity core edge is dierent and its level is greater than or

equal to that of the TESTer it sends backanACCEPT message Finally if the recipient

has a dierent identity from the TESTerbuthasalower level numb er it delays resp onding

until such time as it can send back a denite REJECT or ACCEPT

So each no de nds the MWOE if it exists All of this information is sentback to the

no des incident on the core edge via REPORT messages who determine the MWOE for the

entire fragment A CHANGEROOT message is then sentbacktowards the MWOE and the

endp oint no de sends a CONNECT message out over the MWOE

When two CONNECT messages cross this is the signal that a merge op eration is taking

place In this event a new INITIATE broadcast emanates from the new core edge and the

newlyformed fragment b egins once more to lo ok for its overall MWOE If an absorbing

CONNECT o ccurs from a lowerlevel to a higherlevel fragment then the no de in the high

level fragment knows whether it has found its own MWOE and thus whether to send back

an INITIATE message to b e broadcast in the lowerlevel fragment

Complexity Analysis

Message complexity The analysis is similar to the synchronous case giving the same b ound

of O n log n E More sp ecically in order to analyze the message complexity of the GHS

algorithm we app ortion the messages into two dierent sets resulting separately as we will

see in the O n log n term and the O E term

The O E term arises from the fact that eachedgeinthegraphmust b e tested at least

once in particular weknow that TEST messages and asso ciated REJECT messages can

o ccur at most once for each edge this follows from the observation that once a REJECT

message has b een sentover an edge that edge will never b e tested again

All other messages sent in the course of the algorithmthe TESTACCEPT pairs that go

with the acceptances of edges the INITIATEREPORT broadcastconvergecast messages

and the CHANGEROOTCONNECT messages that o ccur when fragments combinecan

b e considered as part of the overall pro cess of nding the MWOE for a given fragment In

p erforming this task for a fragment there will b e at most one of these messages asso ciated

with each no de each no de receives at most one INITIATE and one ACCEPT each sends

at most one successful TEST one REPORT and one of either CHANGEROOT or CON

NECT Thus the numb er of messages sent within a fragment in nding the MWOE is O m

where m is the numb er of no des in the fragment

The total numb er of messages sentintheMWOEnding pro cess therefore is prop or

tional to

X

number of nodes in F

all f ragments F

whichis

X X

number of nodes in F

all lev el number s L all f r ag ments F of lev el L

Now the total numb er of no des in the inner sum at eachlevel is at most n since each

no de app ears in at most one fragmentata given levelnumber L And since the biggest

p ossible value of L is log nthesumaboveisboundedby

log n

X

n O n log n

Thus the overall message complexity of the algorithm is O E n log n

Time Complexity It can b e shown by induction on l that the time for all the no des to

reach level at least l is at most O ln Hence the time complexity of the GHS algorithm is

O n log n

Proving Correctness for the GHS Algorithm

A go o d deal of interesting work remains to b e done in the eld of proving correctness for

communication algorithms like the GallagerHumbletSpira algorithm The level of complex

ity of this co de makes direct invariant assertion pro ofs quite dicult to do at least at the

detailed level of the co de Note that most of the discussion in this lecture has b een at the

higher level of graphs fragments levels MWOEs etc So it seems that a pro of oughtto

take advantage of this highlevel structure

One promising approachistoapplyinvariantassertion and other techniques to prove

correctness for a highlevel description of the algorithm and then prove indep endently that

the co de in fact correctly simulates the highlevel description see WelchLL

A pro of can b e formalized by describing the highlevel algorithm as an IO automaton

in which the state consists of fragments and actions include merge and absorb op erations

describing the lowlevel co de as another IO automaton and then pro ducing that there is a

simulation mapping b etween the two automata

JJ Distributed Algorithms Novemb er

Lecturer Nancy Lynch

Lecture

Minimum Spanning Tree cont

Simpler Synchronous Strategy

Note that in the asynchronous algorithm wehave complications we did not encounter in

the synchronous algorithm eg absorbing vs merging waiting for fragments to catch up

Another idea to cop e with these diculties is to try to simulate the synchronous algorithm

more closely somehowsynchronizing the levels

More sp ecicallywe can design an algorithm based on combining fragments where each

fragment has an asso ciated level numb er as follows Each individual no de starts as a single

no de fragment at level This time fragments of level l can only b e combined into fragments

of level l as in the synchronous algorithm Eachnodekeeps a locallevel variable which

is the latest level it knows of the fragment it b elongs to so initially every no des locallevel is

and when it nds out ab out its memb ership in a new fragmentoflevel l it raises its lo cal

level to l The basic idea is that a no de at level l tries not to participate in the algorithm

for nding its level l fragments MWOE until it knows that all the no des in the system have

reached level at least l Implementing this test requires global synchronization This is

barrier synchronization But in fact some kind of a weaker lo cal synchronization suces

Namely each no de only tests that all of its neighbors have locallevel at least l This requires

each no de to send a message on all of its incident edges every time its level is incremented

This algorithm has the same time p erformance as b efore ie O n log n The message

complexity gets worse howeveritisnow O E log n

Synchronizers

The idea discussed ab ove suggests a general strategy for running a synchronous algorithm in

an asynchronous network This will only apply to a faultfree asynchronous algorithm ie

no failed pro cesses or messages b eing lost duplicated etc since as we shall so on see the

faulttolerance capabilityisvery dierentinsynchronous and asynchronous systems We are

working from Awerbuch with some formalization and pro ofs done recently by Devara jan a

student

The strategy works for general synchronous algorithms without faults the idea is to

synchronize at each round rather than every so often as in the MST algorithm sketched

ab ove Doing this every so often dep ends on sp ecial prop erties of the algorithm eg that

it works correctly if it is allowed to exhibit arbitrary interleavingsofnodebehavior b etween

synchronization p oints We shall only describ e howtosynchronize at every round

Synchronous Mo del

Recall that in the synchronous mo del each no de had message generation and transition

functions It is convenient to reformulate eachnodeasanIOA with output actions synch

send and input actions synchreceive The no de must alternate these actions starting with

synchsendAt rounds at which the synchronous algorithm do esnt send anything we assume

that a synchsend action o ccurs Wecallsuch a no de automaton a client automaton The

client automaton may also have other external actions interacting with the outside world

The rest of the system can b e describ ed as a synchronous message system called synch

sys for short in the sequel whichateach round collects all the messages that are sent

at that round in synchsend actions and then delivers them to all the client automata in

synchreceive actions In particular this system synchronizes globally after all the synch

send events and b efore all the synchreceive events of each round

Note that this comp osition of IOAs is equivalent to the more tightly coupled synchronous

mo del describ ed earlier where the message system was not mo deled explicitly

We will describ e how to simulate the synchsys using an asynchronous network so that

the client automata running at the network no des cannot tell the dierence b etween running

in the simulation system and running in a real synchronous system

HighLevel Implementation Using a Synchronizer

The basic idea discussed byAwerbuch is to separate out the pro cessing of the actual messages

from the synchronization He splits up the implementation into several pieces a frontend

for each no de able to communicate with the frontends of neighb ors over sp ecial channels

and a pure synchronizer S The job of each front end is to pro cess the messages received

from the lo cal clientin synchsend events At a particular round the front end collects all

the messages that are to b e sent to neighb ors note that a no de might send to some neighbors

and not to some others It actually sends those messages to the neighb ors along the sp ecial

channels For eachsuch message it sends it waits to obtain an acknowledgment along the

sp ecial channel When it has received acks for all its messages it is safe using Awerbuchs

terminology p of Aw erbuch pap er For a no de to b e safe means that it knows that

all its neighb ors have received its messages Meanwhile while waiting for its own messages

to b e acknowledged it is collecting and acknowledging the messages of its neighb ors

When is it OK for the no de frontend to deliver all the messages it has collected from its

neighb ors to the client Only when it knows that it wont receiveany others It is therefore

sucient to determine that al l the nodes neighbors are safe for that round ie that those

neighb ors knowthat their messages for that round have all b een delivered

Thus the job of the synchronizer is to tell frontends when all their neighb ors are safe

To do this the synchronizer has OK input actions outputs from the frontends by which the

front ends tell the synchronizer that they are safe The synchronizer S sends a GO message

to a front end when it has received an OK message from each of its neighb ors

For now we are just writing an abstract sp ec for the S comp onent of course this will

have to b e implemented in the network We claim that this combination simulates the

synchronous system combination of clients and synchsys Notice we did not say that it

implements the system in the formal sense of b ehavior inclusion In fact it do esnt this

simulation is lo cal in the sense that the rounds can b e skewed at distances in the network

Rounds are only kept close for neighb ors But in this architecture where the client programs

do not haveany means of communicating directly with each other outside of the synchsys

this is enough to preserve the view of each client

Theorem If is any execution of the implementation clients front ends channels and



synchronizer then thereisanexecution of the specication clients and synchsys such



that for al l clients i jclient jclient

i i

That is as far as each client can tell its running in a synchronous system So the

b ehavior is the same the same correctness conditions hold at the outside b oundary of the

clients sub ject to reordering at dierent no des

Pro of Sketch We cannot do an implementation pro of as we did for example for the

CTSS algorithm since we are reordering the actions at dierent no des Note that certain

events dep end on other events to enable them eg an OK event dep ends on prior ack

input events at the same frontend and all events at one client automaton may dep end

on each other This is a conservative assumption We dene this depends on relation D

formally as follows For every schedule of the implementation system there is a partial

order relation D describing events o ccurring in that dep end on each other The key

prop erty of this relation D is that for anyschedule of the implementation system and



of that preserves the partial order of events given by D wehave any p ermutation



that is also a schedule of the implementation system This says that the D relations are

capturing enough ab out the dep endencies in the schedule to ensure that any reordering that

preserves these dep endencies is still a valid schedule of the implementation system

Now we start with anyschedule of the implementation and reorder the events to make

the successive rounds line up globally The goal is to make the reordered schedule lo ok as

much as p ossible likea schedule of the synchronous system We do this by explicitly putting

all the synchreceiveievents after all the synchsend ievents for the same iWenow claim

this new requirement is consistent with the dep endency requirements in D since those

never require the reverse order even when applied transitivelyWeget in this wayBy

the key claim ab ove is also a schedule of the implementation system

Were not done yet although is fairly synchronous it is still a schedule of the

implementation system wewanta schedule of the sp ecication system In the next step



This still we suppress the internal actions of the synchsys getting a reduced sequence

lo oks the same to the clients as and nowwe claim that it is a schedule of the sp ecication

system This claim can b e proved by induction

Note that the theorem started with an execution not a schedule To complete the pro of



we extract the schedule get as ab ove and then ll in the states to get an execution of

the sp ecication system lling in the client states as in

Note that if we care ab out order of events at dierentclients then this simulation strategy

do es not work

Synchronizer Implementations

Nowwe are left with the job of implementing the synchronizer part of this implementation

Its job is quite simple it gets OK inputs from eachnodeateach round and for each

no de it can output GO when it knows that all its neighb ors in the graph strictly sp eaking

including the no de itself have done input OK for that round Wewant to implement this

by a distributed algorithm one piece p er no de using a message system as usual There are

several ways to do this

Synchronizer The synchronizer works as follows When a no de gets an OK it sends

this information to all its neighb ors When a no de hears that all its neighb ors are OK and

it is OK it outputs GO The correctness of this algorithm is fairly obvious The complexity

is as follows For every round we generate O E messages since all no des send messages on

all edges at every round Also each round takes constant time measuring time from when

all the OKs at a round have happ ened until all the GOs for that round have happ ened

We conclude that this algorithm is fast but maybetoocommunicationinecientfor

some purp oses This brings us to the other extreme b elow

Synchronizer For this synchronizer we assume that there is a ro oted spanning tree in

G The rule is now as follows Convergecast all the OK info to the ro ot and then broadcast

p ermission to do the GO outputs The cost of this algorithm is O n messages for every

round but the time to complete a round is prop ortional to the height of the spanning tree

whichisnintheworst case

Hybrid Implementation

By combining synchronizer and it is p ossible to get a hybrid algorithm that dep ending

on the graph structure mightgivesimultaneously b etter time eciency than and b etter

message eciency than The basic idea is to divide the graph up into clusters of no des

each with its own spanning tree ie a spanning forest We assume that this spanning

forest is constructed using some prepro cessing The rule nowwillbetousesynchronizer

within each clustertree and use to synchronize b etween clusters Toseehow this works

we found it useful to describ e a highlevel decomp osition see Figure

front ends

Cluster− Cluster− Cluster− synch synch synch

Forest−synch

Figure Decomp osition of the synchronization task to clustersynch for intracluster

synchronization and forestsynch for intercluster synchronization

The b ehavior of clustersynch is to take OKs from all no des in one particular cluster

which is a connected subset of no des in the graph and output a single clusterOK to

forestsynch Then when a single clusterGO is input from forestsynch it pro duces a GO

for each no de in the cluster A p ossible way to implement this on the no des of the given

cluster of course is just to use the idea of synchronizer doing convergecast to handle the

OKs and broadcast to handle the GOs b oth on a spanning tree of the cluster

The other comp onent is essentially up to renaming a synchronizer for the cluster graph

 

G of G where the no des of G corresp ond to the clusters of G and there is an edge from



C to D in G if in G there is an edge from some no de in C to some no de in D Ignoring

for the moment the problem of howtoimplement this forestsynchronizer in a distributed

system based on G let us rst argue why this decomp osition is correct Weneedtoshow

that any GO that is output at anynodei implies that OKs for i and all its neighb ors in

G have o ccurred First consider no de i The GOi action means that clusterGO for is

cluster C has o ccurred which means that clusterOK for C has o ccurred which means that

OKi has o ccurred Now consider j neighbors i C Same argument holds clusterOK

G

for C means that OKj has o ccurred Finallylet j neighbors i C The GOi action

G

means as b efore that clusterGO for C has o ccurred which implies that clusterOK for D



has o ccurred where D is the cluster containing j The clusters are neighbors in G b ecause

the no des i and j are neighb ors in G This implies as b efore that OKj has o ccurred

Implementation of forestsynch Wecanuseanysynchronizer to synchronize b etween the

clusters Supp ose wewant to use synchronizer Note that we cant run directly b ecause it

is supp osed to run on no des that corresp ond to the clusters with edges directly connecting

these no des clusters we arent provided with such no des and channels in a distributed

system However it is not hard to simulate them We do this as follows Assume wehavea

leader no de in each cluster and let it run the no de proto col of for the entire cluster This

canbut do esnt have tob e the same no de as is used as the ro ot in the implementation

of for the cluster if the cluster is synchronized using The next problem to solveisthe

waytwo leaders in dierent clusters can communicate Wesimulate direct communication

using a path b etween them Note that there must exist such a path b ecause the leaders

that need to communicate are in adjacent clusters and each cluster is connected We need

some prepro cessing to determine these paths we ignore this issue for now

We need also to sp ecify for the nal implementation where the clusterOK action o ccurs

This is done as an output from some no de in the clustersynch proto col if synchronizer

is used it is an output of the ro ot of the spanning tree of the cluster It is also an input to

the leader no de in the forestsynch for that same cluster If these two are the same no de

then this action just b ecomes an internal action of the actual no de in the distributed system

If these no de are dierent then we need to have them communicate along some path also

determined by some prepro cessing this requires an extra piece of the implementation

The formal structure of this hybrid algorithm is quite nice each no de in the distributed

network is formally an IOAwhich is the comp osition of two other IOAs one for each of the

two proto cols intracluster and intercluster synchronizers We can consider two orthogo

nal decomp ositions of the entire implementation vertical ie to no des and channels or

horizontal ie bythetwo algorithms each distributed one piece p er no de

Analysis We shall consider the sp ecic implementation where the clustersynch is done

using and the forestsynch is done using andwe assume that the leader for within a

cluster is same no de that serves as the ro ot for

Consider one round The numb er of messages is O n for the work within the clusters

 

plus O E where E is the numb er of edges on all the paths needed for communication among



the leaders We remark that dep ending on the decomp osition E could b e signicantly

smaller than E The time complexity is prop ortional to the maximum heightofany cluster

spanning tree

Thus the optimal complexityofthehybrid algorithm b oils down to the combinatorial

problem of nding a graph decomp osition such that b oth the sum of the path lengths and the

maximum height are small Also we need to establish this decomp osition with a distributed

algorithm We will not address these problems here

Applications

We can use the synchronizers presented ab ovetosimulate anysynchronous algorithm in

the original mo del for synchronous systems presented earlier in the course on an asyn

chronous system Note that the synchronizer do esnt work for faulttolerant algorithms such

as Byzantine agreement b ecause it is not p ossible in this case to wait for all pro cesses

Ring leader election algorithms such as LeLannChang Hirshb ergSinclair and Peterson

can thus b e run in an asynchronous system But note that the message complexitygoesway

up if we do this Since they already worked in an asynchronous system without any such

extra synchronization this isnt interesting The following applications are more interesting

Network leader election Using the synchronizer the algorithm that propagates the max

imal ID seen so far can work as in the synchronous setting This means that it can count

rounds as b efore waiting only diam rounds b efore terminating It also means that the num

b er of messages do esnt have to b e excessive its not necessary for no des to send constantly

since eachnodenowknows exactly when it needs to send once to each neighb or at each

asynchronous round Using synchronizer the complexity of the resulting algorithm is

O diam time and O E diam messages which is b etter than the aforementioned naive

strategy of multiple spanning trees

Breadthrst search For this problem the synchronizer proves to b e a big win Recall

the horrendous p erformance of the BellmanFord algorithm in the asynchronous mo del and

how simple the algorithm was in the synchronous setting Nowwe can run the synchronous

algorithm using say synchronizer which leads to an O E diam message O diam time

algorithm Compare with the synchronous algorithm which needed only O E messages

Note O E diam might b e considered excessive communication We can give a direct

asynchronous algorithm based on building the BF tree in levels doing broadcastconvergecast

at each level adding in the new leaves for that level and terminating when no new no des are

discovered For this algorithm the time is then O diam whichisworse than the algorithm

based on the synchronizer But the message complexity is only O E n diam since each

edge is explored once and tree edges are traversed at most a constantnumb er of times for

each level It is also p ossible to obtain a timecommunication tradeo bycombining the two

strategies More sp ecicallywe can explore k levels in each phase using obtaining time



ndiam diam

and communication complexityof O E k for k diam complexityof O

k k

Weighted Singlesource Shortest Paths Using the synchronous algorithm with synchro

nizer yields an algorithm with O n time complexityand O En messages We remark

that there is an algorithm with fewer messages and more time cf Gab ow as develop ed by

Awerbuch in his notes

Maximal Independent Set We can apply the synchronizer to a randomized synchronous

algorithm like MIS to o We omit details

Lower Bound for Synchronizers

Awerbuchs synchronizer result suggests that a synchronous algorithm can b e converted into

a corresp onding asynchronous algorithm without to o great an increase in costs In particular

by using synchronizer it is p ossible not to increase the time cost at all The following

result by ArjomandiFischerLynch shows that this approach cannot b e adopted universally

In particular it establishes a lower b ound on time for an asynchronous algorithm to solve

a particular problem Since there is a very fast synchronous algorithm for the problem

this means that not every fast synchronous algorithm can b e converted to an asynchronous

algorithm with the same time complexity Note that the dierence is not caused by requiring

any faulttolerance so the result may seem almost contradictory to the synchronizer results

We start by dening the problem Let G V Ebeagraph Recall that diam G

denotes the maximum distance b etween twonodesinG The systems external interface

consists of ash output actions for all i V where ash is an output of the IO automaton

i i

at no de i As an illustration imagine that the ash is a signal that no de i has completed

i

some task Dene a session as a sequence of ashes in which at least one ash o ccurs for

i

every i We can now dene the k session problem we require simply that the algorithm

should p erform at least k disjoint sessions

The original motivation for this setting was that the no des were p erforming some kind

of co ordinated calculation on external data eg a Bo olean matrix transitive closure in the

PRAM style Eachnodei j k was resp onsible for writingainlocation i j in case

it ever saw s in b oth lo cations i k andk j Notice that for this problem it do esnt

matter whether the no des do excessivechecking and writing Correctness is guaranteed if

wehave at least log n sessions ie if there is enough interleaving

The reason the problem is stated this way is that it is a more general problem than the

simpler problem in which all pro cesses do exactly one step in each session so it strengthens

the imp ossibility result Note also that this is a weak problem statement it do esnt even

require the no des to knowwhenk sessions have completed

Before weturntoprove the lower b ound result let us make the problem statement

more precise As usual we mo del the pro cesses as IOAs connected by FIFOchannels For

simplicitywe let the no de automata partition consist of a singleclass intuitively the no des

are executing sequential programs

Our goal is to prove an inherent time complexity result Note that this is the rst lower

b ound on time that we are proving in this course We need to augment the mo del by

asso ciating times as usual with the events in a monotone nondecreasing way approaching

innity in an innite fair execution Let l b e a b ound for lo cal step time and d be the bound

for delivery of rst message in eachchannel This is a sp ecial case of the general notion of

time b ounds for IOAs We assume that d l An execution with times asso ciated with

events satisfying the given requirements is called a timed execution

Next we dene the time measure T A for algorithm A as follows For each execution

of the algorithm dene T to b e the supremum of the times at whicha ash event o ccurs

in There could b e innitely manysuchevents hence we use a supremum here Finally

we dene

T A supT

We can now state and prove our main result for this section

Theorem Suppose A is an algorithm that solves the k session problem on graph G Then

T A k diam G d

Before we turn to prove the theorem consider the k session problem in the synchronous case

We can get k sessions without the no des ever communicating each no de just do es a ash

at every round for k rounds The time would b e only kd for a time measure normalized so

that each round counts for d time This discrepancy proves that the inherentmultiplicative

overhead due to asynchrony for some problems is prop ortional to diam G

Pro of Sketch of Theorem

By contradiction Supp ose that there exists an algorithm A with T A k diam G d

Call a timed execution slow if all the actions take their maximum times ie if the deliveries

of the rst messages in the channels always takes time d and the step times always take l

Let be any slow fair timed execution of the system By the assumption that A is correct

contains k sessions By the contradiction assumption the time of the last ash in is

 

strictly less than k diam G dSowe can write where the time of the

 

last eventin is less than k diam G d and where there are no ash events in



Furthermore b ecause of the time b ound we can decomp ose where each

k

of the has the dierence b etween the times of its rst and last events strictly less than

r

diam G d



Wenow construct another fair execution of the algorithm This

k

will b e an ordinary untimed fair execution that is we do not assign times to this one The

execution is constructed by reordering the actions in each and removing the times to

r

 

obtain and by removing the times from to get Of course when we reorder events

r

we will have to make some adjustments to the intervening states We will show that the

mo died execution contains fewer than k sessions whichcontradicts the correctness of A

The reordering must preserve certain dep endencies in order to pro duce a valid execution of

AFormallywe shall prove the following claim

Claim Let i i i iand i i i j wheredisti j diam G

For al l r k there exists a reordering of ie the actions are

r r r r

reordered with the initial and nal states unchanged such that the fol lowing properties

hold

The reordering preserves the fol lowing dep endency partial order

a The order of al l actions of the same node

b The order of each send m event and its corresponding receive m event

ij ij

contains no event of i

r r

contains no event of i

r r

Let us rst showhow to complete the pro of of Theorem using Claim Since the



reordering preserves the dep endencies this means that is a fair execution of

k

A But we can show that this execution contains at most k sessions as follows No session

can b e entirely contained within since contains no eventof i Likewise no session can

be entirely contained within since this sequence contains no event of pro cess i

r r r

This implies that each session must contain events on b oth sides of some b oundary

r r

Since there are only k such b oundaries the theorem follows

It remains to prove Claim Todothiswe pro duce the required reorderings The

construction is done indep endently for each r Soxsomer r k We consider

two cases

Case contains no eventof i In this case wedene and is empty

r r r r r

Case contains an eventof i Let b e the rst eventof i in Nowwe claim

r r r r

that there is no step of i in that dep ends on according to the dep endency relation

r r

dened in Claim This is true since the time for a message to propagate from i to i in

r r

aslow execution is at least diam d whereas wehave constructed so that the time b etween

r

its rst and last events is strictly less than diam dThus we can reorder so that all the

r

steps of i precede wekeep the initial and nal states the same and mimic the lo cal state

r

changes as in This still yields allowable transitions of the algorithm by a dep endency

r

theorem similar to the one we used for the synchronizer construction but simpler Let

r

b e the part of the reordered sequence b efore and the part of the sequence starting with

r

It is straightforward to verify that has the prop erties required to prove Claim

r r r

Note that the reordered sequence is not a timed execution if wewere trying to preserve

the times somehow during the reordering wewould b e stretching and shrinking some time

intervals Wewould have to b e careful to observe the upp er b ound requirements We could

avoid these problems by making all the time intervals very small but we dont need to worry

ab out this at all since all we need to get the contradiction is an untimed fair execution

Theorem almost lo oks likea contradiction to the synchronizer result recall that the

synchronizer gives a simulation with constant time overhead This is not a contradiction

however b ecause the synchronizer simulation do esnt preserve the total external b ehavior

it only preserves the b ehavior at each no de while it may reorder the events at dierent

no des We remark that for most purp oses we might not care ab out global reordering but

sometimes when there is some outofband communication the order of events at dierent

no des might b e imp ortant

JJ Distributed Algorithms Novemb er

Lecturer Nancy Lynch

Lecture

Time Clo cks etc

The idea that it is OK to reorder events at dierent no des motivates Lamp orts imp ortant

notion of logical time dened in his famous pap er Time Clocks and the Ordering of

Events in a Distributed System

Logical Time

The mo del assumed is the one wehave b een using In this mo del there is no builtin notion

of realtime in particular there are no clo cks but it is p ossible to imp ose a logical notion of

time namely Lamp orts logical time Sp ecicallyevery event of the system send or receive

of messages external interface event internal event of no de gets assigned a logical time

an element of some xed totally ordered set Typically this set is either the nonnegative

integers or the nonnegative reals p erhaps with tiebreakers These logical times dont have

to haveany particular relationship to anynotionofrealtimeHowever they do need to

satisfy the following prop erties

No twoevents get assigned the same logical time

Events at each no de obtain logical times that are strictly increasing in the same order

as the events o ccur

For any message its send event gets a strictly smaller logical time than its receive

event

For anyevent there are only nitely many other events that get assigned logical times

that are smaller

The imp ortant result ab out such an assignmentisasfollows Supp ose that we are given

an execution of a network system of IO automata and a logical time assignment ltime



that satises the conditions ab ove Then there is another execution of the same system



which lo oks the same to each no de ie ji ji for all i and in which the times from

the assignment ltime o ccur in increasing order in real time In other words we can havea

reordered execution such that the logical time order of the original execution is the same as

the real time order in the reordered execution So we can say that programming in terms of

logical time lo oks to all the users as if they are programming in terms of real time

We can view this as another way to reduce the uncertaintyofanasynchronous network

system Roughly sp eaking we can write asynchronous programs where we assume that each

no de has access to real time Real time mo deling do esnt t the IOA mo del whichis

why this is rough It will b e done more carefully later in the course when we do timing

based mo dels In the basic mo del the no des dont have this access Instead we can have

them implement logical time somehow and let them use the logical time in place of the real

time The reordering result says that the resulting executions will lo ok the same to all no des

separately

One generalization is useful supp ose wewant to augment the mo del to allow broadcast

actions which serve to send the same message to all the other no des We can implement

this of course with a sequence of individual send actions There is nothing to o interesting

here thus far But the imp ortant thing here is that we can allow a broadcast to havea

single logical time and require that all the asso ciated receives have larger logical times The

reordering result still holds with this new action

Algorithms

Lamp orts implementation Lamp ort presents a simple algorithm for pro ducing such

logical times It involves each no de pro cess maintaining a lo cal variable clock that it uses as

a lo cal clo ck The lo cal clo ck gets incremented at eachevent input output or internal that

o ccurs at that no de The logical time of the event is dened to b e the value of the variable

clock immediately after the event paired with the pro cess index as a tiebreaker

This algorithm is easily seen to ensure Prop erties and ab ove In order to ensure

Prop erty the following rule is observed Whenever a node does a send or broadcast

event it rst increments its clock variable to get the logical time for the send event and

it attaches that clo ckvalue to the message when it sends it out When the receiver of the

message p erforms a receive event it makes sure that it increments its clock variable to b e

not only larger than its previous value but also strictly larger than the clo ckvalue in the

message As b efore it is this new clo ckvalue that gets assigned to the receive event To

ensure Prop erty it suces to allow each increase of a clock variable to increase the value

by some minimum amount eg

Welchs Implementation In Welchs algorithm as in Lamp orts the logical time at

each no de is maintained by a state variable named clock which can take on nonnegative

integer or real values But this time we assume that there is an input action tick tthat

is resp onsible for setting this variable The no de sets its clock variable to t when tick tis

received We can imagine this input as arriving from a separate clo ck IOA The no de

implements b oth this clo ck and the pro cess IOA

Now each nontick action gets its logical time equal to the value of clock when it is

p erformed with tie breakers rst by the pro cess index and second by execution order at

the same pro cess Note that the clock value do es not change during the execution of this

action Some additional careful handling is needed to ensure Prop erty This time wehave

an extra FIFO receivebuer at each no de which holds messages whose clo ckvalues are at

least the clo ckvalue at the lo cal no de Assuming that the lo cal clock value keeps increasing

without b ound eventually it will b e big enough so that it dominates the incoming messages

clo ckvalue The rule is to wait for the time where the rst message in the receivebuer has

an asso ciated clo ck less than the lo cal clock Then this message can b e pro cessed as usual

we dene the logical time of the receiveevent in terms of the lo cal clock variable when this

simulated receive o ccurs

The correctness of this algorithm is guaranteed by the following facts Prop ertyis

ensured by the the tiebreakers Prop erty relies on the monotonicity of the lo cal clo cks plus

the tiebreakers Prop erty is guaranteed by the buer discipline and Prop erty requires

a sp ecial prop erty of the clo ck automaton namely that it must growunb oundedly This

could b e achieved eg byhaving a p ositivelower b ound on the tick increment and having

the clo ck automaton execute fairly

Note that this algorithm leads to bad p erformance when the lo cal clo cks get out of synch

For this to work in practice we need some synchronization among the clo cks whichisnor

really p ossible in a purely asynchronous system We could use a synchronizer for the lo cal

clo cks but it is to o exp ensive to do this at every step Instead we could use a synchronizer

every so often sayevery steps Still in a pure asynchronous system the worstcase

b ehavior of this strategy will b e p o or This algorithm makes more sense in a setting with

some notion of realtime

Applications

It is not clear that the ab ove algorithms can always b e used to implementsome ltime

abstraction which then can b e used as a blackbox The problem is that the logical time



In other words we dont allow Zeno behaviors in which innite executions take a nite amount of time

abstraction do es not have a clear interface description Rather well consider uses of the

general idea of logical time Before wegointo more sophisticated applications we remark

that b oth algorithms can b e used to supp ort logical time for a system containing broadcast

actions as describ ed ab ove

Banking system

Supp ose wewant to count the total amount of money in a banking system in whichnonew

money is added to or removed from the system but it may b e sent around and received ie

the system satises conservation of money The system is mo deled as some collection of

IOAs with no external actions having the lo cal amount of money enco ded in the lo cal state

with send actions and receive actions that have money parameters The architecture is xed

in our mo del the only variation is in the decision of when to send moneyandhowmuch

Supp ose this system somehow manages to asso ciate a logical time with eacheventas

ab ove Imagine that eachnodenow consists of an augmented automaton that contains

the original automaton This augmented automaton is supp osed to know the logical time

asso ciated with eachevent of the basic automaton In this case in order to count the total

amount of money in the bank it suces to do the following

Fix some particular time tknown to all the no des

For each no de determine the amount of money in the state of the basic automaton

after pro cessing all events with logical time less than or equal to t and no later events

For eachchannel determine the amount of money in all the messages sent at a logical

time at most t but received at a logical time strictly greater than t

Adding up all these amounts gives a correct total This can b e argued as follows Consider

any execution of the basic system together with its logical time assignment By the



reordering result theres another execution of the same basic system as ab ove and same

logical time assignment that lo oks the same to all no des and all events o ccur in logical time



order What this strategy do es is cut execution immediately after time t and record

the money in all the no des and the money in all the channels This simple strategy is thus

giving an instantaneous snapshot of the system state whichgives the correct total amount

of money in the banking system

Let us consider how to do this with a distributed algorithm Each augmented automaton

i is resp onsible for overseeing the work of basic automaton i Augmented automaton i is

assumed able to lo ok inside basic automaton i to see the amount of money Note that this

is not ordinary IOA comp osition the interface abstractions are violated by the coupling It

is also assumed to know the logical time asso ciated with eachevent of the basic automaton

The augmented automaton i is resp onsible for determining the state of the basic automaton

at that no de plus the states of all the incoming channels

The augmented automaton i attaches the logical time of each send event to the message

b eing sent It records the basic automaton state as follows Automaton i records the state

after each step until it sees some logical time greater than t Then it backs up one

step and returns the previous state For recording the channel state we need to know the

messages that are sent at a logical time at the sender at most t and received at a logical

time at the receiver strictly greater than t So as so on as an event o ccurs with logical

time exceeding t that is when it records the no de state it starts recording messages

coming in on the channel It continues recording them as long as the attached timestamp

is no more than t Note to ensure termination we need to assume that each basic no de

continues sending messages every so often so that its neighb ors eventually get something

with timestamp strictly larger than t Alternatively the counting proto col itself could add

some extra messages to determine when all the senders messages with attached time at most

t have arrived

We remark that the augmented algorithm has the nice prop erty that it do esnt interfere

with the underlying op eration of the basic system

General Snapshot

This strategy ab ove can of course b e generalized b eyond banks to arbitrary asynchronous

systems Supp ose wewanttotakeanysuch system that is running and determine a global

state at some p ointintimeWe cant actually do this without stopping everything in the

system at one time or making duplicate copies of everything It is not practical in a real

distributed system eg one with thousands of no des to really stop everything it could

even b e imp ossible But for some applications it may b e sucient to get a state that just

lo oks like a correct global state as far as the no des can tell We shall see some such

applications later In this case we can use the strategy describ ed ab ove

Simulating a Single State Machine

Another use mentioned but not emphasized in Lamp orts pap er but very much emphasized

by him in the ensuing years is the use of logical time to allow a distributed system to

simulate a centralized state machine ie a single ob ject The notion of an ob ject that

works here is a very simple one essen tially it is an IO automaton with input actions only

we require that there b e only one initial value and that the new state b e determined by the

old state and the input action This may not seem very useful at rst glance but it can b e

used to solvesomeinteresting synchronization problems eg we will see how to use it to

solvemutual exclusion Its also a mo del for a database with up date op erations

Supp ose now that there are n users supplying inputs up dates at the n no de pro cesses of

the network Wewould like them all to apply their up dates to the same centralized ob ject

We can maintain one copyofthisobjectinonecentralized lo cation and send all the up dates

there But supp ose that wewould like all the no des to have access to ie to b e able to read

the latest available state of the ob ject Eg supp ose that reads are more frequent than

writes and wewant to minimize the amountofcommunication This suggests a replicated

implementation of the ob ject where eachnodekeeps a private copy we can broadcast all

the up dates to all the no des and let them all apply the up dates to their copies But they

need to up date their copies in the same way at least eventuallyIfwe just broadcast the

up dates nothing can stop dierent no des from applying them in dierent orders We require

a total order of all the up dates pro duced anywhere in the system known to all the no des

so that they can all apply them in the same order Moreover there should only b e nitely

many up dates ordered b efore any particular one so that all can eventually b e p erformed

Also a no de needs to know when its safe to apply an up date this means that no further

up dates that should have b een applied b efore ie b efore the one ab out to b e applied will

ever arrive Wewould like the prop erty of monotonicity of successive up dates submitted by

a particular no de in order to facilitate this prop erty

The answer to our problems is logical time When a user submits an up date the asso ci

ated pro cess broadcasts it and assigns it a timestamp which is the logical time assigned to

the broadcast event for the up date If a no de stops getting lo cal up dates as inputs it should

still continue to send some dummy messages out just to keep propagating information ab out

its logical time Each no de puts all the up dates it receives together with their timestamps

in a request queue not yet applying them to the ob ject Note this queue must include

all the up dates the no de receives in broadcasts by other no des plus those the no de sends in

its own broadcasts The no de applies up date u if the following conditions are satised

Up date u has the smallest timestamp of any request on the request queue

The no de has received a message request or dummy from every no de with send time

at least equal to the timestamp of u

Condition and the FIFOness of the channels guarantee that the no de will never receive

anything with smaller timestamp than this rst up date Furthermore anynodeeventually

succeeds in applying all up dates b ecause logical time keeps increasing at all no des Note

that we didnt strictly need Prop erty of logical time relating the times of message send

and receive here for correctness But something like that is imp ortant for p erformance

Example Banking distributed database Supp ose each no de is a bank branch and

up dates are things like dep osit withdraw addinterest withdrawfromcheckingifsucient

fundselsefromsavings etc Note that in some of these cases the order of the up dates

matters Also supp ose that all the no des wanttokeep a recent picture of all the bank

information for reading deciding on which later up dates should b e triggered at that branch

etc We can use logical time as ab ovetosimulate a centralized state machine representing

the bank information

Note that this algorithm is not very faulttolerant Also in general it do es not seem

very ecient But it can sometimes b e reasonable see the mutual exclusion algorithm

b elow

Simulating Shared Memory

In this section we consider another simplifying strategy for asynchronous networks We use

this strategy to simulate instantaneous shared memory algorithms

The basic idea is simple Lo cate each shared variable at some no de Each op eration

gets sent to the appropriate no de and the sending pro cess waits for a resp onse When the

op eration arrives at the simulated lo cation it gets p erformed p ossibly b eing queued up

b efore b eing p erformed and a resp onse is sentback When the resp onse is received by

the sending pro cess it completes its step If new inputs arrive from the outside world at a

pro cess during a waiting p erio d the pro cess only queues them These p ending inputs are

pro cessed only when the anticipated resp onse returns Wenow claim that this simulation is

correct ie that every external b ehavior is also an external b ehavior of the instantaneously

shared memory system b eing simulated also every fair b ehavior of no des and channels is

also a fair b ehavior of the instantaneous shared memory system also some kind of wait

freedom carries over To prove this claim we use the corresp onding result for atomic shared

memory simulating instantaneous shared memory and the observation that we are actually

simulating atomic ob jects using a centralized ob ject and the delays in the message system

We omit details

The problem of where to put the shared variables dep ends on the characteristics of the

application at hand Eg for a system based on singlewriter multireader shared variables

where writes are fairly frequent wewould exp ect to lo cate the variables at the writers no de



Earlier in the course we had a restriction for the atomic shared memory simulation of instantaneous

shared memory that no new inputs arriveatany no de while the no de pro cess is waiting for an invo cation

to return The reason for this was to avoid intro ducing any new interleavings It seems however that we

could have handled this by just forcing the pro cess to queue up the inputs as ab ove and so remove the need

for this restriction

Note that the ob jects have to b e simulated even while the no des dont have activere

quests

There are other ways of simulating shared memory algorithms

Caching For singlewriter multireader registers note that someone who wants to rep eat

edly test the value of a register has in the implementation describ ed ab ove to send rep eated

messages even though the variable is not changing It is p ossible to develop an alternative

strategy by which writers notify readers b efore they change the value of the variable so if

the readers havent heard ab out a mo dication they can just use their previous value This

is the basic idea of caching

Replicatedcopies Caching is one example of a general style of implementation based on

data replication Data can b e replicated for many reasons eg faulttolerance whichwe

will discuss later But often it is done only for availability Supp ose wehavea multiwriter

multireader shared register in which writes are very infrequent Then we can lo cate copies

of the register only at all the readers no des compare with the caching example A reader

to read the register can always lo ok at its lo cal copy A writer to write the register needs to

write all the copies Note that it needs to do this atomically since one at a time may cause

outoforder reads Therefore we need some extra proto col here to ensure that the copies

are written as if atomically This requires techniques from the area of database concurrency

control

Mutual Exclusion and Resource Allo cation

In this section we consider the problem of resource allo cation in a distributed messagepassing

network

Problem Denition

Wehave the same interface as b efore but now the inputs and outputs for user i o ccur at

a corresp onding no de i see Figure The pro cesses one for each i communicate via

messages over FIFOchannels

Consider the resource requirement formulation of the problem in terms of a monotone

Bo olean formula involving the needed resources Eachnode i has a static xed resource

requirement Assume that anytwo no des that haveany common resource in their two

resource requirements are neighb ors in the network graph so they can communicate to

negotiate priority for the resources We do not mo del the resources separately

In this setting wetypically drop the restriction that no des can p erform lo callycontrolled

steps only when they are b etween requests and resp onses This is b ecause the algorithms try crit exit rem

try crit exit rem

try crit exit

rem

Figure Interface sp ecication for resource allo cation on a message passing system

will typically need to resp ond to requests by their neighb ors for the resources

Th correctness conditions are now as follows As b efore we require a solution to preserve

wel lformedness cyclic b ehavior and also mutual exclusion or a more general exclusion

condition

We also require dead lockfreedom ie if there is any active request and no one in C

then some request eventually gets granted and if there is any active exit region then some

exit eventually o ccurs This time the hyp othesis is that all the no de and channel automata

exhibit fair b ehavior in the IOA sense ie all the no des keep taking steps and all the

channels continue to deliver all messages

The lockoutfreedom condition is dened as b efore under the hyp othesis that all the no de

and channel automata exhibit fair b ehavior

For the concurrency prop erty supp ose that a request is invoked at a no de iandall

neighb ors are in R and remain in R Supp ose that execution pro ceeds so that i and all its

neighb ors continue to take steps fairly and the connecting links continue to op erate fairly

Note that other pro cesses can remain in C forever Then the concurrency condition requires

that eventually i reach C

We remark that this is analogous to the weak condition we had for shared memory

systems only here we add fairness requirements on the neighb ors and the links in b etween

Mutual Exclusion Algorithms

Raynals b o ok is a go o d reference also for the shared memory mutual exclusive algorithms

Wehave many p ossible approaches now

Simulating Shared Memory

Wehave learned ab out several shared memory algorithms for mutual exclusion Wecan

simply simulate one of these in a distributed system Eg use the Peterson multiwriter

multireader shared register tournament algorithm or the variant of this that we didnt have

time to cover or some version of the bakery algorithm mayb e with unb ounded tickets

LeLann

LeLann prop osed the following simple solution The pro cesses are arranged in a logical ring

p p p p Atoken representing control of the resource is passed around the ring in

n

order When pro cess p receives the token it checks for an outstanding request for the

i

resource from user If there is no such request the token is passed to the next pro cess in

i

the ring If there is an outstanding request the resource is granted and the token is held

until the resource is returned and then passed to the next pro cess

The co de for pro cess p is given in Figure

i

Let us go over the prop erties of LeLanns algorithm

Mutual Exclusion is guaranteed in normal op eration b ecause there is only one token and

only its holder can haveany resource

Dead lockFreedom in fair executions when no one is in C the pro cess that holds the

token is either

in T and then it can go to C or

in E R and then it has to pass the token to the next pro cess

Similarly the deadlo ckfreedom for the exit region is satised

LockoutFreedom is straightforward

Lo cal variables

token fnone available in use used g initially available at p andnone else

where

region fR T C E g initially R

Actions

try

i

Eect region T

crit

i

Precondition region T

token avail

Eect region C

use token in

exit

i

Eect region E

rem

i

Precondition region E

Eect region R

token used

receive token

ii

Eect token available

send token

ii

W V

Precondition token used token available region T

Eect token none

Figure LeLann mutual exclusion algorithm for message passing systems

Performance First lets consider the numb er of messages It is not clear what to

measure since the messages arent naturally app ortioned to particular requests Eg we

can measure the worstcase numb er of messages sentinbetween a try and corresp onding

i

crit but it seems more reasonable to try to do some kind of amortized analysis Note that

i

each virtual link in the logical ring is really of unit length since the graph is complete

In the worst case n messages are sentbetween tr y and cr it For the amortized cost we

i i

consider the case of heavy load where there is always an active request at each no de in

this case there are only a constantnumb er of messages p er request If a request comes in

alone however we still get n messages in the worst case Also it is hard to get a go o d

b ound statement here since messages are senteven in the absence of any requests at all

For the time complexitywe assume worstcase b ounds Let c b e the time sp entinC d

b e the message delay for the rst message in anychannel and s b e the pro cess step time

The worstcase time is approximately c d O s n Note that unfortunately this time

b ound has a d n term regardless of the load and this mightbebig

Resiliency LeLann discusses various typ es of resiliency in his pap er

Process failure If a pro cess stops and announces its failure the rest of the pro cesses can

recongure the ring to bypass the failed pro cess This requires a distributed proto col

Note that a pro cess that do esnt announce its failure cant b e detected as failed since

in an asynchronous system there is no way for a pro cess to distinguish a failed pro cess

from one that is just going very slowly

Loss of token When a token loss is detected eg by timeout a new one can b e

generated by using a leaderelection proto col as studied earlier

Lamp orts algorithm

Another solution uses Lamp ort logical time In particular the state machine simulation

approach is applied The state machine here has as its state a queue of pro cess indices

which is initially empty The op erations are try and exit where try has the eect of

i

i i

adding i to the end of queue and exit has the eect of removing i from queue provided

i

it is at the front When user i initiates a try or exit event it is regarded as submitting

i

i

a corresp onding up date request Then as describ ed ab ove the state machine approach

broadcasts the up dates waits to b e able to p erform them ie a no de can p erform an up date

when that up date has the smallest timestamp of any up date in the request buer and the

no de knows it has all the smaller timestamp up dates When the no de is able to p erform the

up date it do es so The replicated ob ject state is used as follows When the queue at no de

i gets i at the front the pro cess at no de i is allowed to do a crit Whenthequeue at no de

i

i gets i removed the pro cess at no de i is allowed to do a rem

i

Let us now argue that this algorithm guarantees mutual exclusion Supp ose not say

p and p are simultaneously in C and supp ose without loss of generalitythat p has the

i j i

smaller timestamp Then when j was added to p s queue the state machine implementation

j

ensures that i must have already b een there ahead of j Buti has not b een removed since

it is still in C So j could not havedonea crit action a contradiction

i

It is a little hard to analyze the algorithm in this form since it combines three proto cols

the logical time generation proto col the replicated state machine proto col and the extra

rules for when to do crit and rem actions We can merge and optimize these proto cols

somewhat as follows Assume that the logical time generation is done according to Lamp orts

algorithm based on the given messages We combine the requestbuer and the simulated

state machines state Also the ack messages b elow are used as the dummy messages

discussed in the general approach

In this algorithm every pro cess p maintains a lo cal variable region as b efore and for each

i

other pro cess p a lo cal queue queue This queue contains all the messages ever received

j

j j

from p we remark that this can b e optimized There are three typ es of messages

j

trymsgi Broadcast by p to announce that it is trying

i

exitmsg i Broadcast by p to announce that it is exiting

i

ack i Sentby p to p acknowledging the receipt of a trymsgj message

i j

The co de is written so as to send all these messages at the indicated times We assume

that the sending or broadcast logical times are piggybacked on the messages and the queues

contain these logical times as well as the messages themselves Also the logical time of each

of the broadcast events for try or exit requests is used as the timestamp of the corresp onding

up date

This already tells what messages are sent and when Wewont have a separate application

step to apply the up date to the queue since wedonothave an explicit representation of the

queue All we need now is rules for p telling when to p erform crit and rem actions Below

i i i

we give these rules

Canbedoneanytime after exit o ccurs p R

i i

Must ensure that region T and that the following conditions hold p C

i

Mutual exclusion is preserved

There is no other request p ending with a smaller timestamp

Pro cess p can ensure that the ab ove conditions are met bychecking for each j i

i

Any trymsg in queue with timestamp less than the timestamp of the current

j

try msg of p has an exitmsg in queue with a larger timestamp than that of

i

j

j s trymsg

queue contains some message p ossibly an ack with timestamp larger than the

j

timestamp of the current trymsg of p

i

Lamp orts algorithm has the following prop erties

Mutual Exclusion This can b e proven bycontradiction as follows Assume that two

pro cesses p and p are in C at the same time and without loss of generality that the

i j

timestamp of p s request is smaller than the timestamp of p s request Then p had to

i j j

check its queue in order to enter C The second test and FIFO b ehavior of channels imply

i

that p had to see p s trymsgsoby the rst test it had also to see an exitmsg from p so

j i i

p must have already left C

i

Lockoutfreedom This prop erty results from servicing requests in timestamp order Since

eachevent in particular request event has a nite numb er of predecessors all requests will

eventually get serviced

Complexity Let us rst deal with the numb er of messages We shall use amortized

analysis here as follows Every request involves sending trymsg ack and exitmsg messages

between some pro cess and all the others thus n messages are sent p er request For

the time complexity note that when there is a single request in the system the time is

d O s where d is the communication delayand s is the lo cal pro cessing time We assume

that the broadcast is done as one atomic step if n messages are treated separately the

pro cessing costs are linear in n but these costs are still presumed to b e small compared

to the communication delay d Recall that in contrast the time complexity of Lelanns

algorithm had a dn term

JJ Distributed Algorithms December

Lecturer Nancy Lynch

Lecture

Mutual Exclusion cont

Last time wegave Lamp orts mutual exclusion algorithm based on his statemachine ap

proach Recall that wehave try and exit messages as input to the state machine and

ack messages to resp ond to try messages The following twovariants provide interesting

optimizations of this algorithm

Ricart Agrawala

Ricart and Agrawalas algorithm uses only n messages p er request It improves

Lamp orts original algorithm byacknowledging requests in a careful manner that eliminates

the need for exit messages This algorithm uses only twotyp es of messages try and OK

Pro cess p sends try i as in Lamp orts algorithm and can go critical after OK messages

i

have b een received from all the others So the interesting part is a rule for when to send

an OK message The idea is to use a priorityscheme Sp ecically in resp onse to a trya

pro cess do es the following

Replies with OK if it is not critical or trying

If it is critical it defers the reply until it exits and then immediately sends all the

deferred OKs

If it is trying it compares the timestamp of its own request to the timestamp of the

incoming try If its own is bigger its own is interpreted as lower priority and the pro cess

sends OK immediatelyelsehaving higher priority the pro cess defers acknowledging

the request until it nishes with its own critical region

In other words when there is some conict this strategy is to resolveitinfavor of the

earlier request as determined by the timestamps

Prop erties of the RicartAgrawala Algorithm

Mutual Exclusion Proved bycontradiction Assume that pro cesses p and p are b oth

i j

in C and without loss of generality that the timestamp of p s request is smaller than the

i

timestamp of p s request Then there must have b een try and OK messages sent from each

j

to the other prior to their entry to C at each no de the receipt of the try precedes the

sending of the matching OK But there are several p ossible orderings of the various events see Figure

Process i Process j

try−i try−j

OK−i

OK−j

Figure A p ossible scenario for the RicartAgrawala algorithm

Now the timestamp of i is less than that of j and the logical time of the receipt of j s

try by i is greater than the timestamp of j s request by the logical time prop erty So it

must b e that the receipt of p s try message at i o ccurs after p has broadcast its try Then

j i

at the time p receives p s try it is either trying or critical In either case p s rules sayit

i j i

has to defer the OK message thus p could not b e in C a contradiction

j

Dead lockfreedom is also proved by contradiction Assume some execution that reaches a

p oint after which no progress is achieved That is at that p oint all the pro cesses are either

in R or T none are in C and from that p oint on no pro cess changes regions By going

a little further out in the execution we can also reacha point after which no messages are

ever in transit Among all the pro cesses in T after that p oint let p b e the pro cess having

i

the request message with the lowest timestamp Then since p is blo cked forever it must b e

i

b ecause some other pro cess p has not returned an OK message to it Pro cess p could only

j j

have deferred the OK b ecause it was either

in C but since p eventually leaves C itmust eventually send the deferred OK

j

in T in this case p deferred the OK b ecause the timestamp of its request was smaller

j

than p s Since p s request has the smallest timestamp in T now p must have

i i j

completed thus after exiting C it had to send the deferred OK

In either case we get a contradiction

Fairness it can b e shown that the algorithm has lo ckoutfreedom

Complexity it is easy to see that there are n messages p er request The time

complexity is left as an exercise

Carvalho Roucairol

This algorithm improves on RicartAgrawala by giving a dierentinterpretation to the OK

message When some pro cess p sends an OK to some other pro cess p not only do es it

i j

approve p s current request but it also gives p p s p ermission to reenter C again and

j j i

again until p sends an OK to p in resp onse to a try from p

j i i

This algorithm p erforms well under light load When a single pro cess is requesting again

and again with no other pro cess interested it can go critical with no messages sent Under

heavy load however it basically reduces to Ricart and Agrawalas algorithm

This algorithm is closely related to ie is generalized by the ChandyMisra Dining

Philosophers algorithm

General Resource Allo cation

Wenow turn to consider more general resource allo cation problems in asynchronous net

works as dened earlier

BurnsLynch Algorithm

This algorithm solves conjunctive resource problems Recall the strategies studied in asyn

chronous shared memory setting involving seeking forks in order and waiting for resources

one at a time We can run the same strategies in asynchronous networks The b est waytodo

this is not via a general simulation of the shared memory mo del in the network mo del recall

that the algorithms involvebusywaiting on shared variables if simulated each busywaiting

step would involve sending messages Instead wehave each resource represented by its own

pro cess whichmust b e lo cated at some no de The user pro cess sends a message to the

resource pro cess for the rst resource it needs the resource pro cess puts the user index on

its queue the user pro cess then waits When the user index reaches the front of the queue

the resource pro cess sends a message back to that user pro cess whichthengoesontowait

for its next resource etc When a pro cess gets all its resources it tells its external user to

go critical when exit o ccurs the user pro cess sends messages to all the resource pro cesses

to remove its index from the queue

The analysis is similar to that done in the shared memory case and dep ends only on

lo cal parameters The general case has exp onential dep endence on the numb er of colors

There are some variations in the literature to improve the running time This is still a topic

of current research see eg StyerPeterson AwerbuchSaks ChoySingh

Drinking Philosophers

This is a dynamic variant of the general conjunctive static resource allo cation problem

In this setting wehave the same kind of graph as b efore where a pro cess is connected to

everyone with whom it might have a resource conict But now when a try request arrives

from the outside it contains an extra parameter a set B of resources indicating which

resources the pro cess actually needs this time This set must b e a subset of the static set for

that pro cess the twist is that now it neednt b e the whole set

As b efore wewant to get exclusion but this time based on the actual resources b eing

requested Wewould liketohave also deadlo ckfreedom and lo ckoutfreedom The concur

rency condition is now mo died as follows Supp ose that a request is invoked at no de i and

during the interval of this request there are no actual conicting requests Supp ose the

execution pro ceeds so that i and all its neighb ors in the underlying graph continue to take

steps fairly and the connecting links continue to op erate fairly note that some pro cesses

might remain in C Then eventually i reaches C

Chandy and Misra gave a solution to this whichwas based on a particular inecient

Dining Philosophers algorithm Welch and Lynch noticed that the ChandyMisra algorithm

made almostmo dular use of the underlying Dining Philosophers algorithm and so they re

did their result with the implicit mo dularity made explicit Then it was p ossible to substitute

a more ecient Dining Philosophers algorithm for the original ChandyMisra subroutine

The prop osed architecture is sketched in Figure The Dining Philosophers subroutine

executes using its own messages as usual In addition there are new messages for the

Drinking Philosophers algorithm

We assume for simplicity here that each resource is shared b etween only two pro cesses

The heart of the algorithm is the D automata First we describ e the algorithm informally

i

and then presenttheD automaton co de

i

When D enters its trying region needing a certain set of resources it sends requests

i

for those resources that it needs but lacks A recipient D of a request satises the request

j

unless D currently also wants the resource or is already using it In these two cases D

j j

defers the request so that it can satisfy it when it is nished using the resource

In order to prevent drinkers from deadlo cking a Dining Philosophers algorithm is used

as a subroutine The resources manipulated by the Dining Philosophers subroutine are User i User j

T C ER TC ER

Di send receive Dj Drinking Drinking Philosophers Philosophers algorithm receive send algorithm R T T C E C E R

Dining Philosophers algorithm

Figure Architecture for the Drinking Philosophers problem

priorities for the real resources there is one dining resource for each drinking resource

As so on as D is able to do so in its drinking trying region it enters its dining trying region

i

that is it tries to gain priority for its maximum set of resources If D ever enters its dining

i

critical region while still in its drinking trying region it sends demands for needed resources

that are still missing A recipient D of a demand must satisfy it even if D wants the

j j

resource unless D is actually using the resource In that case D defers the demand and

j j

satises it when D is through using the resource

j

Once D is in its dining critical region we can showthatiteventually receives all its

i

needed resources and never gives them up Then it mayenter its drinking critical region

Once D enters its drinking critical region it may relinquish its dining critical region since

i

the b enets of having the priorities are no longer needed Doing so allows some extra con

currency even if D stays in its drinking critical region forever other drinkers can continue

i

to make progress

A couple of p oints ab out the co de deserve explanation We can show that when a request

is received the resource is always at the recipient thus it is not necessary for the recipient

to check that it has the resource b efore satisfying or deferring the request On the other

hand it is p ossible for a demand for a missing resource to b e received so b efore satisfying

or deferring a demand the recipientmust check that it has the resource

Another p oint concerns the questions when the actions of the dining subroutine should b e

p erformed Note that some drinkers could b e lo cked out if D never relinquishes the dining

i

critical region The reason for this is that as long as D is in its critical region it has priority

i

for the resources Thus if we are not careful D could cycle through its drinking critical

i

region innitely many times while other drinkers wait Toavoid this situation wekeep

track using variable current of whether the dining critical region was entered on b ehalf

of the current drinking trying region ie whether the latest C o ccurred after the latest

i

T B If it was then D mayenter its drinking critical region assuming it has all the

i i

needed resources Otherwise D imust wait until the current dining critical region has

b een relinquished b efore continuing

The full co de is given in Figures and

Correctness pro of Exclusion is straightforward based on p ossession of the resources

since we use explicit tokens Concurrency is also straightforward Toprovelockoutfreedom

we need to rely on the DiningPhilosophers subroutine The pro of is a bit subtle First

we argue that the environment of the dining philosophers algorithm preserves diningwell

formedness this is proved by a straightforward explicit check of the co de Therefore every

execution of the system is diningwellformed since the dining subroutine also preserves

diningwellformedness Therefore the dining subroutine provides its guarantees dining

exclusion and lo ckoutfreedom for the dining subroutine in the context of the drinking

algorithm

Next we show the following lemma

Lemma In any fair drinkingwel l formedexecution if drinking users always return the

resources then dining users always return the resources

Pro of Sketch Supp ose that the dining resources are granted to D Then D sends

i i

demands Consider any recipient D of a demand If D has the resource and is not actually

j j

using it then it gives it to i b ecause by the exclusion prop erty of the dining subroutine D

j

cannot also b e in its dining critical region On the other hand if D is using the resource

j

then by the assumption that no drinker is stuck in its critical region D eventually nishes

j

and satises the demand Thus eventually D gets all the needed resources and enters its

i

drinking critical region after whichby the co de it returns the dining resources

With this we can prove the following prop erty

Lemma In any fair drinkingwel l formedexecution if drinking users always return the

resources then every drinking request is granted

Pro of Sketch Once a drinking request o ccurs sayatnodei then unless the grant o ccurs

immediately a subsequent dining request o ccurs at i Then by dining lo ckoutfreedom and

Lemma eventually the dining subroutine grants at i Again by Lemma eventually i

returns the dining resource But note that i only returns the dining resource if in the interim

it has done a drinking grant The formal pro of is byacontradiction as usual

State

drinkregion equals T if the most recent drinking action was T B for some B C if

i

C B etc Initially R

i

dineregion equals T if the most recent dining action was T C if C etc Initially R

i i

needequalsB where the most recent drinking action had parameter B Initially

empty

bottles set of tokens that have b een received and not yet enqueued to b e sent Initially

the token for a resource shared b etween D and D is in the bottles variable of exactly

i j

one of D and D with the choice b eing made arbitrarily

i j

deferred set of tokens in the set bottles for which a request or demand has b een received

since the token was received Initially empty

current Bo olean indicating whether current dining critical region is on b ehalf of current

drinking trying region Initially false

msgsj for all j i FIFO queue of messages for D enqueued but not yet sent Initially

j

empty

Actions for pro cess i

try B for all B B

i

i

Eect drinkregion T

need B

for all j i and b need B bottles

j

enqueue requestbin msgsj

send m j for all j i m frequestb tok enb demandb b B B g

i i j

Precondition m is at head of msgsj

Eect dequeue m from msgsj

try

i

Precondition dineregion R

drinkregion T

Eect dineregion T

Figure Co de for the mo dular drinking philosophers algorithmpart I

Actions for pro cess i cont

Receive r eq uestbj for all j i b B B

i i j

Eect if b need and drinkregion fT Cg then

deferred deferred fbg

else

enqueue tokenbin msgsj

bottles bottles fbg

crit

i

Eect dineregion C

if drinkregion T then

for all j i and b need B bottles enqueue demandb in msgsj

j

current true

Receive demandbj for all j i b B B

i i j

Eect if b bottles and b need ordrinkregion C then

enqueue tokenbin msgsj

bottles bottles fbg

deferred deferred fbg

Receive tok enbj for all j i b B B

i i j

Eect bottles bottles fbg

crit B for all B B

i i

Precondition drinkregion T

B need

need bottles

if dineregion C then current true

Eect drinkregion C

current false

exit

i

Precondition dineregion C

if drinkregion T then current false

Eect dineregion E

Figure Co de for the mo dular drinking philosophers algorithmpart I I

Actions for pro cess i cont

rem

i

Eect dineregion R

exit B for all B B

i i

Eect drinkregion E

for all j i and b deferred B

j

enqueue tokenbin msgsj

bottles bottles deferred

deferred

rem B for all B B

i i

Precondition drinkregion E

B need

Eect drinkregion R

Figure Co de for the mo dular drinking philosophers algorithmpart I I I

We remark that the complexity b ounds dep end closely on the costs of the underlying

dining algorithm

Stable Prop erty Detection

In this section we consider a new problem Supp ose wehave some system of IO automata as

usual and supp ose wewant to sup erimp ose on this system another algorithm that monitors

the given algorithm somehow For instance the monitoring algorithm can

detect when the given system has terminated execution

check for violation of some global invariants

detect some kind of deadlo ck where pro cesses are waiting for each other in a cycle to

do something

compute some global quantity eg the total amount of money

We shall see some p owerful techniques to solvesuch problems

Termination for Diusing Computations

First we consider the termination problem alone DijkstraScholten gave an ecient algo

rithm for the sp ecial case where the computation originates at one no de In this kind of

computation called diusing computation all no des are originally quiescent dened here to

mean that they are not enabled to do any lo cally controlled actions The algorithm starts

with an input from the outside world at one of the no des and we assume that no further

inputs o ccur According to the IOA denitions once a no de receives such a message it can

change state in sucha way that it now is enabled to do some lo callycontrolled actions eg

to send messages to other no des The no des can also do various outputs to the outside

world When other no des receive these messages they in turn might also b ecome enabled

to do some lo callycontrolled actions and so on

The termination detection problem is formulated as follows

If the entire system ever reaches a quiescent state ie all the pro cesses are

in quiescent states and there are no messages in transit then eventually the

originator no de outputs a sp ecial message done

Of course the original algorithm do es not havesuch an output The new system con

structed has to b e an augmentation of the given system where each no de automaton is

obtained from the old automaton in some allowed way

Belowwe state informally what is allowed to solve this problem The precise denition

should b e similar to the superposition denition describ ed in Chandy and Misras Unity

work

Wemay add new state comp onents but the pro jection of the start states of the aug

mented machine must b e exactly the start states of the original machine

The action set can b e augmented with new inputs outputs and internal actions The

old actions must remain with the same preconditions but they mayhave new eects

on the new comp onents they mightalsohave additional information piggybacked on

them eg send mnowdoessend m c The old actions remain in the same classes

The new input actions aect the new comp onents only

The new output and internal actions can b e preconditioned on the entire state old and

new comp onents but they may only aect the new comp onents They get group ed

into new fairness classes

The idea in the DijkstraScholten termination detection algorithm is to sup erimp ose a

kind of spanningtree construction on the existing work of the underlying algorithm Start

ing from the initial no de every message of the underlying algorithm gets a tree message

piggybacked up on it Just as in the asynchronous algorithm for constructing a spanning

tree each no de records the rst receipt of a tree message in its state as its parent Anysub

sequent receipts of tree messages are immediatelyacknowledged they are not discarded as

in the spanning tree proto col explicit acknowledgment is sentback Only the rst one gets

held and unacknowledged for now Also if the initial no de ever receives a tree message it

immediately acknowledges it So as the messages circulate around the network a spanning

tree of the no des involved in the proto col gets set up

Now we are going to let this tree converge back up on itself as in convergecast in order

to rep ort termination back to the initial no de Sp ecicallyeach node looks for when both

of the following hold at the same time a its underlying algorithm is in a quiescent state

and b all its outgoing tree messages have b een acknowledged When it nds this it closes

up the shop it sends an acknowledgment to its parent and forgets everything ab out

this proto col Note that this is dierent from what had to happ en for ordinary broadcast

convergecast There we had to allow the no des to rememb er that they had participated in

the algorithm marking themselves as done so they didnt participate another time If

we hadnt the no des would just continue broadcasting to each other rep eatedly This time

however we do allow the no des to participate anynumb er of times whether or not they do

so is controlled by where the messages of the underlying algorithm get sent

Example Supp ose the no des are and consider the following scenario see Figure

for illustration i No de wakes up sends a message to and sends a message to

ii No des and receive the messages and set their parentpointers to p ointtonode

Then they b oth wake up and send messages to each other but since each no de already has

a parent it will just send an ack Then b oth no des send messages to iii Supp ose s

message gets to no de rst then is set to b e the parent of and therefore acknowledges

immediately The four no des continue for a while sending messages to each other they

immediately acknowledge each message

Now supp ose the basic algorithm at quiesces Still do es not close up b ecause it has an

unacknowledged message to iv Supp ose now quiesces it then is ready to acknowledge

to and it forgets it ever participated in the algorithm actually it forgets everything

When receives this ack and since is also done it sends ack to and forgets everything

v Now supp ose sends out messages to and When they receive these messages they

wake up as at the b eginning continue carrying out the basic algorithm and reset their

parent p ointers this time to

This execution can continue in this fashion indenitely But if all the no des ever qui

esce and there are no messages of the basic algorithm in transit then the no des will all

succeed in converging back to the initial no de in this case When is quiesced and has (i) (ii) (iii) 2 2 2

0 3 0 3 0 3

1 1 1

(iv) (v) 2 2

0 3 0 3

1 1

Figure Example of an execution of the termination detection algorithm The arrows on

the edges indicate messages in transit and the arrows parallel to the edges indicate parent

p ointers

acknowledgments for all its outgoing messages it can announce termination

Supp ose the underlying algorithm is A In Figures and wegivecodeforthe

i

sup erimp osed version p

i

Wehave argued roughly ab ove that if the basic algorithm terminates this algorithm

eventually announces it We can also argue that this algorithm never do es announce termi

nation unless the underlying system has actually terminated We can prove this using the

following invariant cf homework

Invariant Al l nonid le nodes form a treerooted at the node with status leader with the

treeedges given by parent pointers

FrancezShavit have generalized this algorithm to the case where the computation can

initiate at several lo cations the idea there is to use several trees and wait for them all to

terminate

Complexity The numb er of messages is prop ortional to the numb er of messages of

underlying algorithm A nice prop erty of this algorithm is its lo cality if the diusing

computation only op erates for a short time in a small area of the network the termination

proto col only incurs prop ortionately small costs On the other hand if the algorithm works

for a very long time b efore terminating then the termination detection algorithm requires

considerable overhead

New state comp onents

status fid le leader nonleaderg initially id le

parent anodeornil initially nil

for eachneighbors j

send j a queue of messages initially empty

decit j an integer initially

Actions

any input action of no de i excluding a receive

New Eect

status leader

send m

ij

New Eect decit j decit j

New action receive ack

ji

New Eect decit j decit j

receive m where m a message of A

ji i

New Eect if status id le then

status nonleader

parent j

else add ack to send j

New action closeupshopi

Precondition status nonleader

j parent i

send k for all k

decit k forall k

state of A is quiescent

i

Eect add ack to send

j

status id le

parent nil

Figure Co de for DijkstraScholten termination detection algorithm part I

Actions cont

New action send ack

ij

Precondition ack rst on send j

Eect remove rst message from send j

New action done

Precondition status leader

send k for all k

decit k forall k

state of A is quiescent

i

Eect status id le

Figure Co de for DijkstraScholten termination detection algorithm part I I

Snapshots

Problem Statement

The snapshot problem is to obtain a consistent global state of a running distributed algo

rithm By global state we mean a state for each pro cess and eachFIFOchannel The

consistency roughly sp eaking requires that the output is a global state that could have

o ccurred at a xed momentintime

We make the denition of the problem more precise using the notion of logical time

discussed in the last lecture Recall the denition of logical time it is p ossible to assign a

unique time to eachevent in a way that is increasing at anygiven no de and suchthat sends

precede receives and only nitely manyevents precede any particular event Clearly a given

execution can get many logical time assignments A consistent global state is one that arises

from any particular logical time assignment and a particular real time tFor this assignment

and this t there is a welldened notion of what has happ ened up through and including

time tWewant to include exactly this information in the snapshot states of no des after

the lo cal events up through time t and the states of channels at time t ie those messages

sentby time t and not received by time t in the order of sending See Figure for an

example

We can stretch and shrink the execution to align corresp onding times at dierent no des

and the information in snapshot is just the states of no des and channels at the tcut

Because we mighthave to shrink and stretch the execution it is not necessarily true that

the state which is found actually o ccurs at any p oint in the execution But in a sense this

t

Figure t is a logical time cut

do esnt matter it could have o ccurred ie no pro cess can tell lo cally that it didnt o ccur

Actuallywewould like more than just any consistent global state wewould like one

that is fairly recent eg wewant to rule out the trivial solution that always returns the

initial global state For instance one mightwant a state that reects all the events that

have actually o ccurred in real time b efore the invo cation of the snapshot algorithm So we

supp ose that the snapshot is initiated by the arrival of a snap input from the outside world

at some nonzero numb er of the no des but only once at eachnodeWe do not imp ose the

restriction that the snapshot must originate at a single no de as in the diusing computation

problem Also the underlying computation is not necessarily diusing

Applications

A simple example is a banking system where the goal is to count all the money exactly

once More generallywehave the problem of stable property detection Formallywesay

that a prop erty of global states is stable if it p ersists ie if it holds in some state then it

holds in any subsequent state For instance termination the prop erty of all the no des b eing

in quiescent states and the channels b eing empty is a stable prop erty Also unbreakable

deadlo ck where a cycle of no des are in states where they are waiting for each other is a

stable prop erty

The Algorithm

The architecture is similar to the one used ab ove for the terminationdetection algorithm

the underlying A is augmented to give p with certain additions that do not aect the

i i

underlying automatons b ehavior

Wehave already sketched one such algorithm based on an explicit logical time assign

ment Wenow give another solution by Chandy and Lamp ort The algorithm is very much

like the snapshot based on logical time only it do es away with the explicit logical time

The idea is fairly simple Each no de is resp onsible for snapping its own state plus the

states of the incoming channels When a no de rst gets a snap input it takes a snapshot of

the state of the underlying algorithm Then it immediately puts a marker message in each

of its outgoing channels and starts to record incoming messages in order to snap eachofits

incoming channels This marker indicates the dividing line b etween messages that were sent

out b efore the lo cal snap and messages sent out after it For example if this is a banking

system and the messages are money the money b efore the marker was not included in the

lo cal snap but the money after the marker was

A no de that has already snapp ed just records these incoming messages until it encounters

the marker in which case it stops recording Thus the no de has recorded all the messages

sentby the sender b efore the sender did its lo cal snap Returning to the banking system

example in this case the recipient no de has counted all the money that was sent out b efore

the sender did its snap and thus not counted by the sender

There is one remaining situation to consider supp ose that a no de receives a marker

message b efore it has done its own lo cal snap b ecause it has not received a snap input

Then immediately up on receiving the rst marker the recipient snaps its lo cal state sends

out its markers and b egins recording the incoming channels The channel up on whichithas

just received the marker is recorded as emptyhowever

Mo deling It is not completely straightforward to mo del a snapshot system The subtlety

is in the atomicity of the snapping and marker sending steps Todothismuch atomically

we seem to need to mo dify the underlying algorithm in slightly more complicated ways than

for the DijkstraScholten algorithm p ossibly b yblocking some messages from b eing sent

by A while the markers are b eing put into the channels This really means interfering with

i

the underlying algorithm but roughly sp eaking wedonotinterfere bytoomuch

The co de is given in Figure

Correctness We need to argue that the algorithm terminates and that when it do es it

gives a consistent global state corresp onding to some xed logical time

Termination We need strong connectivitytoprove termination As so on as the snap

input o ccurs the recipient no de do es a lo cal snapshot and sends out markers Every time

anyone gets a marker it snaps its state if it hasnt already The markers thus eventually

propagate everywhere and everyone do es a lo cal snapshot Also eventually the no des will

nish collecting the messages on all channels when they receive markers

Consistency We omit details The idea is to assign a logical time to eachevent of the

underlying system by means of such an assignment for the complete system We do this in

New state comp onents

status fstart snapping reported g initially start

snapstate a state of A or nil initially nil

i

for eachneighbor j channelsnappedj a Bo olean initially false and

send j snapchannelj FIFO queues of messages initially empty

Actions

New action snap

Eect if status start then

snapstate gets the state of A

i

status snapping

for all j neighbors add marker to send j

Replace send m actions of A with internalsend actions which put m at the end of send j

ij i

New action send m

ij

Precondition m is rst on send j

Eect remove rst elementofsend j

receive m m a message of A

ji i

New Eect if status snapping and channel snapped j false then

add m to snapchannelj

New action receive marker

ji

Eect if status start then

snapstate gets the state of A

i

status snapping

for all j neighbors add marker to send j

channelsnappedj true

New action reportsnapshots C where s is a no de state and C is the states of incoming links

Precondition status snapping

for all j neighbors channelsnappedj true

s snapstate

for all j neighbors C j snapchannelj

Eect status reported

Figure The snapshot algorithm co de for pro cess i

suchaway that the same time t with ties broken by indices is assigned to eacheventat

which a no de snaps its lo cal state The fact that we can do this dep ends on the fact that

there is no message sent at one no de after the lo cal snap and arriving at another no de b efore

its lo cal snap If a message is sent after a lo cal snap it follows the marker on the channel

then when the message arrives the marker will have already arrived and so the recipient

will have already done its snapshot It is obvious that what is returned byeachnodeis

the state of the underlying algorithm up to time tWemust also check that the channel

recordings give exactly the messages in transit at logical time t ie the messages sent

after the senders snap and received b efore the receivers snap

Some simple examples We again use the bank setting Consider a twodollar twono de

banking system where initially eachnode i and j has one dollar The underlying algorithm

has some program it do esnt matter what it is that determines when it sends out some

money Consider the following execution see Figure

# 1 # 1 # 1

1 (1)1 0 (2)1 1 (3)0 1 (4) 0

ijijijij#

0

1 0

ij1

Figure Execution of a twono de bank system The nal snapshot is depicted b elow

the intermediate states

snap o ccurs i records the state of the bank puts a marker in the buer to send

i

to j and starts recording incoming messages

i puts its dollar in the buer to send to j b ehind the marker

j sends its dollar to i i receives it and records it in its lo cation for recording incoming

messages

j receives the marker from i records its lo cal bank state puts a marker in the

buer to send to i and snaps the state of the incoming channel as empty

i receives the marker snaps the state of the channel as the sequence consisting of one

message the dollar it received b efore the marker

Put together the global snapshot obtained is one dollar at i one dollar in channel from

j to i and the other no de and channel empty This lo oks reasonable in particular the

correct total is obtained But note that this global state never actually arose during the

computation However we can show a reachability relationship Supp ose that is the

actual execution of the underlying bank algorithm Then can b e partitioned into

where indicates the part b efore the snapshot starts and indicates the part after the

snapshot ends Then the snapp ed state is reachable from the state at the end of and the

state at the b eginning of is reachable from the snapp ed state So the state could have

happ ened as far as anyone b efore or after the snapshot could tell

In terms of Lamp orts partial order consider the actual execution of the underlying

system depicted in Figure a We can reorder it while preserving the partial order so that all the states recorded in the snapshot are aligned as in Figure b

Process i Process j Process i Process j logical time−cut time−cut

(a) (b)

Figure a is the actual execution of Figure and b is a p ossible orderings of

it where the logical time cut of a is a real time cut

Applications revisited

Stable property detection

A stable prop erty is one that if it is true of any state s of the underlying system is

also true in all states reachable from s Thusifitever b ecomes true it stays true The

following is a simple algorithm to detect the o ccurrence of a stable prop erty P First the

algorithm do es a global snapshot and then either it collects the information somewhere and

returns P result or else it do es a distributed algorithm on the static information recorded

from the snapshot to determine P result The correctness conditions of the snapshot the

reachabilityversion of those conditions that is imply the following If the output is true

then P is true in the real state of the underlying algorithm at the end of the snapshot and

if the output is false then P is false in the real state of the underlying algorithm at the

b eginning of the snapshot There is some uncertainty ab out the interim

Example Termination detection Assume a system of basic pro cesses with no external

inputs but the pro cesses dont necessarily b egin in quiescent states This will execute on its

own for a while and then might quiesce terminate More precisely quiescence refers to all

no des b eing quiescent and no messages in transit anywhere Wewant to b e able to detect

this situation since it means that the system will p erform no further action since there are

no inputs Quiescence of global states is a stable prop erty assuming no inputs So here

is a simple algorithm to detect it do a snapshot send the states somewhere and checkif

anything is enabled or in transit Actuallywe dont need to send the whole state to one

pro cess we can haveeveryone check termination lo cally and fan in ie convergecast the

results to some leader along a spanning tree We only need to convergecast a bit saying if

the computation is done in the subtree It maybeinteresting to contrast this algorithm

to the DijkstraScholten termination detection the snapshot algorithm involves the whole

network ie it is not lo cal at all But on the other hand if it only has to b e executed once

then the overhead is b ounded regardless of how long the underlying algorithm has gone on

for

Example Dead lock detection Now the basic algorithm involves pro cesses that are wait

ing for other pro cesses eg it can b e determined from the state of pro cess A that it is

i

waiting for pro cess A say to release a resource The exact formulation varies in dierent

j

situations eg a pro cess mightbewaiting for resources owned by the other pro cesses etc

In all formulations however deadlo ck essentially amounts to a waiting cycle If we supp ose

that every pro cess that is waiting is stopp ed and while it is stopp ed it will never release

resources to allowanyone else to stop waiting then deadlo ck is stable Thus we can detect

it by taking snapshots sending the information to one place then doing an ordinary cen

tralized cycledetection algorithm Alternatively after taking the snapshot we can p erform

some distributed algorithm to detect cycles on the static data resulting from the snapshot

JJ Distributed Algorithms December

Lecturer George Varghese

Lecture

Designing network proto cols is complicated by three issues parallelism asynchrony and

faulttolerance This course has already covered techniques for making proto cols robust

against Byzantine faults The Byzantine fault mo del is attractive b ecause it is general the

mo del allows a faulty pro cess to continue to b ehave arbitrarily In this lecture we will study

another general fault mo del the selfstabilization mo del whic h is attractive from b oth a

theoretical and practical viewp oint

We will compare the two mo dels a little later but for now lets just saythatself

stabilization allows an arbitrary numb er of faults that stop while Byzantine mo dels allowa

limited numb er of faults that continue Stabilizing solutions are also typically muchcheap er

and have found their wayinto real networks In this lecture we will describ e selfstabilization

using twonetwork mo dels an elegan t shared memory mo del and a more practical network

mo del We will also describ e three general techniques for making proto cols stabilizing local

checking and correction counter ushing and timer ushing

The Concept of SelfStabilization

Do or Closing and Domain Restriction

Todaywe will fo cus on the abilityofnetwork proto cols to stabilize to correct b ehavior

after arbitrary initial p erturbation This prop ertywas called selfstabilization by Dijkstra

Dijkstra The self emphasizes the ability of the system to stabilize by itself without

manual intervention

A story illustrates the basic idea Imagine that you live in a house in Alaska in the

middle of winter You establish the following proto col set of rules for p eople who enter and

leaveyour house Anyb o dy who leaves or enters the house must shut the do or after them If

the do or is initially shut and nob o dy makes a mistake then the do or will eventually return

to the closed p osition Supp ose however that the do or is initially op en or that someb o dy

forgets to shut the do or after they leave Then the do or will stayopenuntil someb o dy passes

through again This can b e a problem if heating bills are exp ensive and if several hours can

go by b efore another p erson go es through the do or It is often a go o d idea to make the do or

STEP 1: ENTERING STEP 2: MIDDLE OF THE DOOR STEP 3: OUT AT LAST!

Figure Exiting through a revolving do or

closing proto col selfstabilizing This can b e done by adding a spring or automatic do or

closer that constantly restores the do or to the closed p osition

In some cases it is p ossible to trivialize the problem by hardwiring relationships b etween

variables to avoid illegal states suc h a technique is actually used in a revolving do or A

p erson enters the revolving do or Figure gets into the middle of the do or and nally

leaves It is physically imp ossible to leave the do or op en and yet there is a waytoexit

through the do or

This technique whichwe will call domain restriction is often a simple but p owerful

metho d for removing illegal states in computer systems that contain a single shared memory

Consider two pro cesses A and B that have access to a common memory as shown in the rst

part of Figure Supp ose wewant to implementmutual exclusion by passing a token

between A and B

One way to implement this is to use two boolean variables token and token Topass

A B

the token Pro cess A sets token to false and sets token to true In a selfstabilizing

A B

setting however this is not a go o d implementation For instance the system will deadlo ck

if token tok en false in the initial state The problem is that wehave some extra and

A B

useless states The natural solution is to restrict the domain to a single bit called turn such

that turn when A has the token and turn when B has the token By using domain

restriction we ensure that any possible state is also a legal state

It is often feasible to use domain restriction to avoid illegal states within a single node

of a computer network Domain restriction can b e implemented in manyways The most



In this example we are really changing the domain However we prefer the term domain restriction token A token token token A B B turn

PROCESS PROCESS PROCESS PROCESS A B A B

1. WITH SHARED MEMORY 2. WITHOUT SHARED MEMORY

Figure Token passing among two pro cesses

natural wayisby restricting the numb er of bits allo cated to a set of variables so that every

p ossible value assigned to the bits corresp onds to a legal assignmentofvalues to eachofthe

variables in the set Another p ossibility is to mo dify the co de that reads variables so that

only values within the sp ecied domain are read

Unfortunately domain restriction cannot solve all problems Consider the same two pro

cesses A and B that wish to achievemutual exclusion This time however see Figure

Part A and B areattwo dierent no des of a computer network The only way they can

communicate is by sending token messages to each other Thus we cannot use a single turn

variable that can b e read by b oth pro cesses In fact A must have at least two states a state

in which A has the token and a state in which A do es not have the token B must also have

two such states Thus we need at least four combined states of whichtwo are illegal

Thus domain restriction at eachnode cannot prevent il legal combinations across nodes

We need other techniques to detect and correct illegal states of a network It should b e no

surprise that the title of Dijkstras pioneering pap er on selfstabilization Dijkstra was

SelfStabilization in spite of Distributed Control

Figure A typical mesh Network

SelfStabilizati on is attractive for Networks

We will explore selfstabilization prop erties for computer networks and network proto cols

A computer network consists of no des that are interconnected bycommunication channels

The network top ology see Figure is describ ed by a graph The vertices of the graph

represent the no des and the edges representthechannels No des communicate with their

neighbors by sending messages along channels Many real networks such as the ARPANET

DECNET and SNA can b e mo deled in this way

A network proto col consists of a program for each network no de Each program consists

of co de and inputs as well as lo cal state The global state of the network consists of the

lo cal state of eachnodeaswell as the messages on network links We dene a catastrophic

fault as a fault that arbitrarily corrupts the global network state but not the program co de

or the inputs from outside the network

Selfstabilization formalizes the following intuitive goal for networks despite a history

of catastrophic failures oncecatastrophic failures stop the system should stabilize to correct

behavior without manual intervention Thus selfstabilization is an abstraction of a strong

faulttolerance prop erty for networks It is an imp ortant prop ertyofrealnetworks b ecause

Catastrophic faults o ccur Most network proto cols are resilient to common failures

such as no des and link crashes but not to memory corruption But memory corruption

do es happ en from time to time It is also hard to prevent a malfunctioning device from

sending out an incorrect message

Manual intervention has a high cost In a large decentralized network restoring

the network manually after a failure requires considerable co ordination and exp ense

Thus even if catastrophic faults o ccur rarely sayonceayear there is considerable

incentive to makenetwork proto cols selfstabilizing A reasonable guideline is that the

network should stabilize preferably before the user notices and at least before the user

logs a servicecal l

These issues are illustrated by the crash of the original ARPANET proto col Rosen

Perlman The proto col was carefully designed never to enter a state that contained

three conicting up dates a band c Unfortunately a malfunctioning no de injected three

such up dates into the network and crashed After this the network cycled continuously

between the three up dates It to ok days of detectivework Rosen b efore the problem was

diagnosed With hindsight the problem could havebeenavoided by making the proto col

selfstabilizing

Selfstabilization is also attractive b ecause a selfstabilizing program do es not require

initialization The concept of an initial state makes p erfect sense for a single sequential

program However for a distributed program an initial state seems to b e an articial concept

Howwas the distributed program placed in such an initial state Did this require another

distributed program Selfstabilization avoids these questions by eliminating the need for

distributed initialization

Criticisms of SelfStabili zation

Despite the claims of the previous section there are p eculiar features of the selfstabilization

mo del that need justication

The mo del allows network state to b e corrupted but not program co de However

program co de can b e protected against arbitrary corruption of memory by redundancy

since co de is rarely mo died On the other hand the state of a program is constantly

b eing up dated and it is not clear how one can prevent illegal op erations on the memory

by using checksums It is even harder to prevent a malfunctioning no de from sending

out incorrect messages

The mo del only deals with catastrophic faults that stopByzantine mo dels deal with

continuous faults However in Byzantine mo dels only a fraction of no des are allowed

to exhibit arbitrary b ehavior In the selfstabilization mo del al l no des are p ermitted

to start with arbitrary initial states Thus the two mo dels are orthogonal and can

even b e combined

A selfstabilizing program P is is allowed to make initial mistakes However the

imp ortant stabilizing proto cols that we know of are used for routing scheduling and

resource allo cation tasks For such tasks initial errors only result in a temp orary loss

of service

Denitions of Stabilization

Execution Denitions

All the existing denitions of stabilization are in terms of the states and executions of a

system We will b egin with a denition of stabilization that corresp onds to the standard

denitions for example that of Katz and Perry KP Next we will describ e another

denition of stabilization in terms of external b ehaviors Webelieve that the denition of

b ehavior stabilization is appropriate for large systems that require mo dular pro ofs However

the denition of execution stabilization given b elow is essential in order to prove results ab out

b ehavior stabilization

Supp ose we dene the correctness of an automaton in terms of a set C of legal executions

For example for a token passing system we can dene the legal executions to b e those in

which there is exactly one token in every state and in whichevery pro cess p erio dically

receives a token

What do we mean when wesay that an automaton A stabilizes to the executions in set

C Intuitivelywe mean that eventually all executions of A b egin to lo ok like an execution

in set C For example supp ose C is the set of legal executions of a token passing system

Then in the initial state of A there may b e zero or more tokens However the denition

requires that eventually there is some sux of any execution of A in which there is exactly

one token in any state

Formally

Denition Let C be a set of executions We say that automaton A stabilizes to the

executions in C if for every execution of A there is some sux of execution that is in

C

We can extend this denition in the natural way to dene what it means for an automaton

A to stabilize to the executions of another automaton B

Denitions of Stabilization based on External Behavior

In the IO Automaton mo del the correctness of an automaton is sp ecied in terms of its

external b ehaviors Thus we sp ecify the correctness of a token passing system without any

reference to the state of the system We can do so by sp ecifying the ways in which token

delivery and token return actions to and from some external users of the token system can

be interleaved

Thus it natural to lo ok for a denition of stabilization in terms of external b ehaviors We

would also hop e that such a denition would allow us to mo dularly comp ose results ab out

the stabilization of parts of a system to yield stabilization results ab out the whole system

NowanIOA A is said to solve a problem P if the b ehaviors of A are contained in P For

stabilization however it is reasonable to weaken this denition and ask only that an IOA

eventual ly exhibit correct b ehavior Formally

Denition Let P beaproblem ie a set of behaviors AnIOA A stabilizes to the

behaviors in P if for every behavior of A there is a sux of that is in P

Similarlywe can sp ecify that A stabilizes to the b ehaviors of some other automaton B

in the same wayThebehavior stabilization denition used in this section and V is a

sp ecial case of a denition suggested by Nancy Lynch

Finallywehave a simple lemma that ties together the execution and b ehavior stabi

lization denitions It states that execution stabilization implies b ehavior stabilization In

fact the only metho d we knowto prove abehavior stabilization result is to rst provea

corresp onding execution stabilization result and then use this lemma Thus the b ehavior

and execution stabilization denitions complementeach other in this thesis the former is

typically used for specication and the latter is often used for proofs

Lemma If IOA A stabilizes to the executions of IOA B then IOA A stabilizes to the

behaviors of B

Discussion on the Stabilization Denitions

First notice that wehave dened what it means for an arbitrary IOA to stabilize to some

target set or automaton Typicallywewillbeinterested in proving stabilization prop erties

only for a sp ecial kind of automata unrestricted automata An unrestricted IOAUIOA is

one in which all states of the IOA are also start states SuchanIOA mo dels a system that

has b een placed in an arbitrary initial state by an arbitrary initial fault

A p ossible mo dication which is sometimes needed is to only require in Denitions

and that the sux of a b ehavior execution b e a sux of a b ehavior execution of

the target set One problem with this mo died denition is that weknow of no go o d pro of

technique to prove that the b ehaviors executions of an automaton are suxes of a sp ecied

set of b ehaviors executionsBy contrast it is much easier to provethatevery b ehavior of an

automaton has a sux that is in a sp ecied set Thus we prefer to use the simpler denitions

for what follows

Examples from Dijkstras Shared Memory Mo del

In Dijkstras Dijkstra mo del a network proto col is mo deled using a graph of nite state

machines In a single move a single no de is allowed to read the state of its neighb ors

compute and then p ossibly change its state In a real distributed system such atomic

communication is imp ossible Typically communication has to pro ceed through channels

While Dijkstras original mo del is not very realistic it is probably the simplest mo del of an

asynchronous distributed system This simple mo del provided an ideal vehicle for introducing

Dijkstra the concept of stabilization without undue complexityFor this section we will

use Dijkstras original mo del to introduce two imp ortant and general techniques for self

stabilization local checking and correction and counter ushing In the next section we will

intro duce a more realistic message passing mo del

Dijkstras Shared Memory Mo del and IOA

How can we map Dijkstras mo del into the IOA mo del Supp ose each no de in Dijkstras

mo del is a separate automaton Then in the InputOutput automata mo del it is not p ossible

to mo del the simultaneous reading of the state of neighb oring no des The solution we use

is to disp ense with mo dularity and mo del the entire network as a single automaton All

actions such as reading the state of neighb ors and computing are internal actions The

asynchrony in the system which Dijkstra mo deled using a demon is naturally a part of

the IOA mo del Also we will describ e the correctness of Dijkstras systems in terms of

executions of the automaton

Formally

A shared memory network automaton N for graph G E V is an automaton in which

The state of N is the crosspro duct of a set of no de states S N one for eachnode

u

u V For any state s of N weusesju to denote s pro jected onto S This is also

u

read as the state of no de u in global state s

All actions of N are internal actions and are partitioned into sets A N one for each

u

no de u V



Supp ose s s is a transition of N and b elongs to A N Consider any state s

u

 

of N suchthat s ju sju and s jv sjv for all neighb ors v of u Then there is some

  

transition s s of N such thats jv sjv for u and all us neighb ors in G

Supp ose s s is a transition of N and b elongs to A N Then sjv sjv for all

u

v u

Informally the third condition requires that the transitions of a no de u V only dep end

on the state of no de u and the states of of the neighb ors of u in G The fourth condition

requires that the eect of a transition assigned to no de u V canonlybetochange the

state of u Top (Process n-1) 0 up = false 0 up = false 0 up = false Token 1 up = true 1 up = true ...... 1 up = true

Bottom (Process 0)

Figure Dijktras proto col for token passing on a line

Dijkstras Second Example as Lo cal Checking and Correc

tion

In this section we will b egin by reconsidering the second example in Dijkstra The

derivation here is based on some unpublished work I did with Anish Arora and Mohamed

Gouda at the UniversityofTexas This proto col is essentially a token passing proto col on a

line of no des with pro cess indices ranging from to n Imagine that the line is drawn

vertically so that pro cess is at the b ottom of the line and hence is called b ottom and

Pro cess n is at the top of the line and called top This is shown in Figure The

down neighb or of Pro cess i is Pro cess i and the up neighb or is Pro cess i Pro cess

n and Pro cess are not connected

Dijkstra observed that it is imp ossible without randomization to solvemutual exclusion

in a stabilizing fashion if all pro cesses have identical co de To break symmetry he made the

co de for the top and b ottom pro cesses dierent from the co de for the others

Dijkstras second example is mo deled by the automaton D shown in Figure Each

pro cess i has a b o olean variable up andabitx Roughly up is a p ointer at no de i that

i

i i

p oints in the direction of the token and x is a bit that is used to implement token passing

i

Figure shows a state of this proto col when it is working correctly First there can b e

at most two consecutive no des whose up pointers dier in value and the token is at one of

these twonodesIfthetwo bits at the two no des are dierent as in the gure then the

token is at the upp er no de else the token is at the lower no de

The state of the system consists of a b o olean variable

up andabit x one for every pro cess in the line

i

i

We will assume that up true and up false by denition

n

In the initial state x for i n and up false for i n

i

i

Move Up action for the b ottom pro cess only to movetoken up

Precondition x x and up false





Eect x x

Down action for top pro cess only to movetoken down Move

n

Precondition x x

n n

Eects

x x

n n

Move Up i n action for other pro cesses to movetoken up

i

Precondition x x

i i

Eects

x x

i i

up true p ointupwards in direction token was passed

i

Move Down i n action for other pro cesses to movetoken down

i

Precondition x x and up true and up false

i i i

i

Eect up false p ointdownwards in direction token was passed

i

All actions are in a separate class

Figure Automaton D a version of Dijkstras second example with initial states The

proto col do es token passing on a line using no des with at most states

For the present assume that all pro cesses start with x Also initially assume

i

that up false for all pro cesses other than pro cess We will remove the need for such

i

initialization b elow We start by understanding the correct executions of this proto col when

it has b een correctly initialized

A pro cess i is said to have the token when any action at Pro cess i is enabled As usual the

system is correct when there is at most one token in the system Now it is easy to see that

Up is enabled Once no de makes a move then Move Up in the initial state only Move

is enabled followed by Move Up and so on as the token travels up the line Finally the

token reaches no de n and we reach a state s in which x x for i n and

i i

x x Also in state s up true for i n andup false Thus in state

n n

i n

s Move Down is enabled and the token b egins to movedown the line by executing

n

Down followed by Move Down andsoonuntil we reach the initial state Move

n n

again Then the cycle continues Thus in correct executions the token is passed up and

down the line

We describ e these go o d states of D that o ccur in correct executions in terms of lo cal

predicates In the shared memory mo del a lo cal predicate is any predicate that only refers

to the state variables of a pair of neighbors Thus in a go o d state of D two prop erties are

true for any Pro cess i other than

If up up then x x

i i

i i

If up true then up true

i i

First we prove that if these two lo cal predicates hold for all i n then there

is exactly one action enabled Intuitively since up false and up truewe can start

n

with pro cess n and go down the line until we nd a pair of no des i and i such

that up false and up true Consider the rst such pair Then the second predicate

i i

guarantees us that there is exactly one such pair The rst predicate then guarantees that

all no des ji have x x and all no des kihave x x Thus only one action is

j i k i

enabled If x x and i then only Move Down is enabled If x x and

i i i i i

i then only Move Up is enabled If x x and i n then only Move Up

i i i

is enabled If x x and i n then only Move Down is enabled

i i n

Similarly we can show that if exactly one action is enabled then all lo cal predicates hold

Thus if the system is in a bad state some pair of neighb ors will b e able to detect this fact

Hence we conclude that D islocal ly checkableHowever wewould liketogoeven further

and b e able to correct the system to a go o d state by adding extra correction actions to each

no de If we can add correction actions so that all link predicates will eventually b ecome true

then wesaythattheD is also local ly correctable Adding lo cal correction actions is tricky

b ecause the correction actions of no des mayinterfere and result in a form of thrashing

However a line is a sp ecial case of a tree with say as the ro ot eachnode i other than

can consider i tobeitsparent The tree top ology suggests a simple correction strategy

Child which is a separate class for For eachnodei we can add a new action Correct

i

each i Basically Correct Child checks whether the link predicate on the link b etween

i

i and its parent is true If not i changes its state such that the predicate b ecomes true

Child leaves the state of is parent unchanged Supp ose j is the Notice that Correct

i

Child will leave the lo cal predicate parentofi and k is the parentofj ThenCorrect

i

on link j k true if it was true in the previous state

Thus wehave an imp ortant stability prop erty correcting a link do es not aect the

correctness of links ab ove it in the tree Using this it is easy to see that eventually all links

will b e in a go o d state and so the system is in a go o d state We will ignore the details of

this pro of but the basic idea should b e clear Dijkstras actual co de do es not use an explicit

correction action However we feel that this way of understanding his example is clearer

and illustrates a general metho d The general metho d of lo cal checking and correction was

rst intro duced in APV

Dijkstras rst example as Counter Flushing

The counter ushing paradigm describ ed in this section is taken from V

Dijkstras rst example is mo deled by the automaton D shown in Figure As in

the previous example the no des once again numb ered from to n are arranged such

that no de has no de and no de as its neighbors and so on However in this case we

also assume that Pro cess and n are neighb ors In other words by making and n

adjacentwehave made the line into a ring For pro cess i let us call Pro cess i we assume

that all arithmetic on indices and counters is mo d n the counterclo ckwise neighbor of i and

i the clo ckwise neighbor of i

Each no de has a counter count in the range n that is incremented mo d n Once

i

again the easiest way to understand this proto col is to understand what happ ens when it

is prop erly initialized Thus assume that initially Pro cess has its counter set to while

all other pro cesses have their counter set to Pro cesses other than are only allowed

to move see Figure when their counter diers in value from that of their counter

clo ckwise neighb or in this case the pro cess is allowed to makea moveby setting its counter

to equal that of its counterclo ckwise neighbor Thus initially only Pro cess can makea

move after which Pro cess has its counter equal to next only Pro cess can move after

which Pro cess sets its counter equal to and so on until the value moves clo ckwise

The state of the system consists of an integer variable

count fng one for every pro cess in the ring

i

We assume that Pro cess and n are neighbors

In the initial state count for i n andcount

i 

Move action for Pro cess only

Precondition count count equal to counterclo ckwise neighb or

n

Eect count count modn incrementcounter

Move i n action for other pro cesses

i

Precondition count count not equal to counterclo ckwise neighb or

i i

Eects

count count set equal to counterclo ckwise neighb or

i i

All actions are in a separate class

Figure Automaton D a version of Dijkstras rst example with initial states The

proto col do es token passing on a ring using no des with n states

around the ring until all pro cesses have their counter equal to

Pro cess on the other hand cannot makea moveuntil Pro cess n has the same counter

value as Pro cess Thus until Pro cess sets its counter to Pro cess cannot makea move

However when this happ ens Pro cess increments its counter mo d n Then the cycle

rep eats as now the value b egins to move across the ring assuming nandsoonThus

after prop er initialization this system do es p erform a form of token passing on a ring each

no de is again considered to have the token when the system is in a state in which the no de

can takea move

It is easy to see that the system is in a go o d state i the following lo cal predicates are

true

For i n either count count or count count

i i i i

Either count count or count count

n n

The system is lo cally checkable but it do es not app ear to b e lo cally correctable However

it do es stabilize using a paradigm that we can call counter ushingEven if the counter values

are arbitrarily initialized in the range n the system will eventually b egin executing

as some sux of a prop erly initialized execution Wewillprove this informally using three

claims

In any execution Pro cess will eventually increment its counter Supp ose

not Then since Pro cess is the only pro cess that can pro duce new counter values

the numb er of distinct counter values cannot increase If there are two or more distinct

counter values then moves by Pro cesses other will reduce the numb er of distinct

counter values to after which Pro cess will increment its counter

In any execution Pro cess will eventually reach a fresh counter value

that is not equal to the counter values of any other pro cess To see this note

that that in the initial state there are at most n distinct counter values Thus there

is some counter value say m that is not present in the initial state Since pro cess

keeps incrementing its counter Pro cess will eventually reach m and in the interim

no other pro cess can set their counter value to m

Any state in which Process has a fresh counter value m is eventual ly fol lowedbya

state in which al l processes have counter value m It is easy to see that the value m

moves clo ckwise around the ring ushing any other counter values while Pro cess

remains at m This is why I call this paradigm counter ushing

The net eect is that any execution of D eventually reaches a go o d state in whichit

remains The general counter ushing paradigm can b e stated roughly as follows

A sender in our case Pro cess p erio dically sends a message to a set of resp onders

after all resp onders have received the message the sender sends the next message this

corresp onds to one cycle around the ring in our case

To make the proto col stabilizing the sender numb ers each message with a counter

The size of the counter must b e greater than the maximum numb er of distinct counter

values that can b e present in the network in any state The sender increments its

counter after the message has reached all resp onders

The proto col must ensure that the sender counter value will always increment and

so eventually reach a fresh value not presentinthenetwork A freshly numb ered

message m from the sender must guarantee that all old counter values will b e ushed

from the network b efore the sender sends the next message

In Dijkstras rst example the ushing prop erty is guaranteed b ecause the top ology

consists of a ring of unidirectional links Similar forms of counter ushing can b e used to

implement Data Links AfekB and token passing DolevIMunb ound b etween a pair

of no des Counter ushing is however not limited to rings or pair of no des Katz and Perry

KatzP extend the use of counter ushing to arbitrary networks in an ingenious way

The stabilizing endtoend proto col APV is obtained by rst applying lo cal correction

to the Slide proto col AGRslide and then applying a variant of counter ushing to the

Ma jority proto col of AGRslide

Message Passing Mo del

Having understo o d the techniques of counter ushing and lo cal checking and correction in a

shared memory mo del wemove to a more realistic message passing mo del The work in the

next three sections is based on V which formalizes the ideas intro duced in APV To

mo del a network proto col wewillmodelthenetwork top ology the links b etween no des and

the no des themselves Our mo del is essentially the standard asynchronous message passing

mo del except for two ma jor dierences

The ma jor dierence is that links are restricted to store at most one packet at a time

We assume that for every pair of neighb ors there is some a priori way of assigning one

of the two no des as the leader of the pair

We will argue that even with these dierences our mo del can b e implemented in real

networks

Mo deling the Top ology

We will call a directed graph V E symmetric if for every edge u v E there is an edge

v u E

Denition A top ology graph G V E l is a symmetric directedgraph V E together

with a leader function l such that for every edge u v E l u v l v u and either

l u v u or l u v v

We use E Gand V G to denote the set of edges and no des in G If it is clear what

graph we mean we sometimes simply say E and V As usual if u v E we will call v a

neighbor of u

Mo deling Links

Traditional mo dels of a data link have used what we call Unbounded Storage Data Links that

can store an unb ounded number of packets Now real physical links do have b ounds on the

Each p b elongs to the packet alphab et P dened ab ove

The state of the automaton consists of a single variable Q P nil

uv

Send p input action

uv

Eect

If Q nil then Q p

uv uv

Free output action

uv

Precondition Q nil

uv

Eect None

Receive p output action

uv

Precondition p Q nil

uv

Eect Q nil

uv

The Free and Receive actions are in separate classes

Figure Unit Storage Data Link automaton

numb er of stored packets However the unb ounded storage mo del is a useful abstraction in

a nonstabilizing context

Unfortunately this is no longer true in a stabilizing setting If the link can storean

unbounded number of packets it can have an unbounded number of bad packets in the

initial state It has b een shown DolevIMunb ound that almost any nontrivial task is

imp ossible in such a setting Thus in a stabilizing setting it is necessary to dene Data Links

that have b ounded storage

A network automaton for top ology graph G consists of a no de automaton for every vertex

in G and one channel automaton for every edge in GWe will restrict ourselves to a sp ecial

typ e of channel automaton a unit storage data link or UDL for short Intuitively a UDL

can only store at most one packet at any instant No de automata communicate by sending

packets to the UDLs that connect them In the next section we will argue that a UDL can

b e implemented over real physical channels

Wexapacket alphab et P We assume that P P P consists of two disjoint

data contr ol

packet alphab ets These corresp ond to what we call data packets and control packets The

sp ecication for a UDL will allow b oth data and control packets to b e sentona UDL

Denition We say that C is the UDL corresponding to orderedpair u v if C is the

uv uv

IOA dened in Figure

C is a UIOA since wehave not dened any start states for C The external interface

uv uv

to C includes an action to send a packet at no de u Send p an action to receivea

uv uv

packet at no de v Receive p and an action Free to tell the sender that the link is

uv uv

ready to accept a new packet The state of C is simply a single variable Q that stores

uv uv

a packet or has the default value of nil

Notice twopoints ab out the sp ecication of a UDL The rst is that if the UDL has a

packet stored then any new packet sent will b e dropp ed Second the Free action is enabled

continuously whenever the UDL do es not contain a packet

Mo deling Network No des and Networks

Next we sp ecify no de automata We do so using a set that contains a no de automaton for

every no de in the top ology graph For every edge incidenttoa node u a no de automaton

N must haveinterfaces to send and receivepackets on the channels corresp onding to that

u

edge However we will go further and require that no des ob ey a certain stylized convention

in order to receive feedback from and send packets on links

In the sp ecication for a UDL if a packet p is sent when the UDL already has a packet

stored then the new packet p is dropp ed We will prevent packets from b eing dropp ed by

requiring that the sending no de keep a corresp onding free variable for the link that records

whether or not the link is free to accept new packets The sender sets the free variable to

true whenever it receives a Free action from the link We require that the sender only

send packets on the link when the free variable is trueFinallywhenever the sender sends

a packet on the link the sender sets its free variable to false

We wish the interface to a UDL to stabilize to go o d b ehavior even when the sender

and link b egin in arbitrary states Supp ose the sender and the link b egin in arbitrary states

Then we can havetwo p ossible problems First if free true but the UDL contains a packet

then the rst packet sentby the sender can b e dropp ed However it is easy to see that all

subsequentpackets will b e accepted and delivered by the link This is b ecause after the rst

packet is sent the sender will never set free to true unless it receives a Free notication

from the link But a Free notication is delivered to the sender only when the link is empty

The second p ossible problem is deadlo ck Supp ose that initially free false but the channel

do es not contain a packet Toavoid deadlo ck the UDL sp ecication ensures that the Free

action is enabled continuously whenever the link do es not contain a packet

A formal description of no de automata can b e found in V We omit it here Finally

a network automaton is the comp osition of a set of no de and channel automata SEND (p) Physical Channel RECEIVE (p) queue

FREE Data Link Sender Data Link Receiver

Figure Implementing a UDL over a physical channel

Implementing the Mo del in Real Networks

In a real network implementation the physical channel connecting anytwo neighb oring no des

would typically not b e a UDL For example a telephone line connecting two no des can often

store more than one packet The physical channel may also not deliver a free signal Instead

an implementation can construct a Data Link proto col on top of the physical channel such

that the resulting Data Link proto col stabilizes to the b ehaviors of a UDL eg AfekB

S

Figure shows the structure of such a Data Link proto col over a physical link The

sender end of the Data Link proto col has a queue that can contain a single packet When

the queue is emptythe Free signal is enabled When a Sendp arrives and the queue is

empty p is placed on the queue if the queue is full p is dropp ed If there is a packet on the

queue the sender end constantly attempts to send the packet When the receiving end of

the Data Link receives a packet the receiver sends an ack to the sender When the sender

receives an ack for the packet currently in the queue the sender removes the packet from

the queue

If the physical channel is initially empty and the physical channel is FIFO ie do es

not p ermute the order of packets then a standard stop and wait or alternating bit proto col

BSW will implement a UDL However if the physical channel can initially store packets

then the alternating bit proto col is not stabilizing S There are two approaches to

creating a stabilizing stop and wait proto col Supp ose the physical channel can store at

most X packets in b oth directions Then AfekB suggest numb ering packets using a

counter that has at least X values Supp ose instead that no packet can remain on the

physical channel for more than a b ounded amount of time S exploits such a time b ound

to build a stabilizing Data Link proto col The main idea is to use either numb ered packets

counter ushing or timers timer ushing to ush the physical channel of stale packets

A stop and wait proto col is not very ecientover physical channels that havea high

transmission sp eed andor high latency It is easy to generalize a UDL to a Bounded Storage

Data Link or BDL that can store more than one packet

Lo cal Checking and Correction in our Message

Passing Mo del

In this section weintro duce formal denitions of lo cal checkability and lo cal correctability

The denitions for a message passing mo del will b e slightly more complex than corresp onding

ones for shared memory mo del b ecause of the presence of channels b etween no des

Link Subsystems and Lo cal Predicates

Consider a network automaton with graph G Roughly sp eaking a prop erty is said to b e



lo cal to a subgraph G of G if the truth of the prop erty can b e ascertained by examining



only the comp onents sp ecied by G We will concentrate on link subsystems that consist

of a pair of neighb oring no des u and v and the channels b etween them It is p ossible to

generalize our metho ds to arbitrary subsystems

In the following denitions wexa network automaton N Net G N

Denition We dene the u v link subsystem of N as the composition of N C C

u uv vu

and N

v

For anystate s of N sju denotes s pro jected on to no de N and sju v denotes s

u

pro jected onto C Thus when N is in state s the state of the u v subsystem is the

uv

tuple sju sju v sjv usjv

A predicate L of N is a subset of the states of N Let u v b e some edge in graph G of

N Alocal predicate L of N for edge u v is a subset of the states of the u v subsystem

uv

in N We use the word lo cal b ecause L is dened in terms of the u v subsystem

uv

The following denition provides a useful abbreviation It describ es what it means for a

lo cal prop ertytoholdinastates of the entire automaton

Denition We say that a state s of N satises alocal predicate L of N i

uv

sju sju v sjv usjv L

uv

We will makefrequent use of the concept of a closed predicate Intuitively a prop ertyis

closed if it remains true once it b ecomes true In terms of lo cal predicates

Denition Alocal predicate L of network automaton N is closed if for al l transitions

uv

s s of N if s satises L then so does s

uv

The following denitions provide two more useful abbreviations The rst gives a name to

a collection of lo cal predicates one for each edge in the graph The second the conjunction

of a collection of lo cal prop erties is the prop erty that is true when all lo cal prop erties hold

at the same time We will require that the conjunction of the lo cal prop erties is nontrivial

ie there is some global state that satises all the lo cal prop erties

Denition L is a link predicate set for N Net G N if for each u v G thereis

some L such that

uv

If a b c d L then d c b a L ie L and L are identical except for

uv vu uv vu

the way the states are written down

L fL u v Gg

uv

Thereisatleast one state s of N such that s satises L for al l L L

uv uv

Denition The conjunction of a link predicate set L is the predicate fs s satises L

uv

for al l L Lg We use Conj L to denote the conjunction of L

uv

Note that ConjL cannot b e the null set by the denition of a link predicate set

Lo cal Checkability

Supp ose we wish a network automaton N to satisfy some prop erty An example would b e

the prop erty all no des have the same color We can often sp ecify a prop ertyofN formally

using a predicate L of N Intuitively N can b e lo cally checked for L if we can ascertain

whether L holds bychecking all link subsystems of N The motivation for intro ducing this

notion is p erformance in a distributed system wecancheck all link subsystems in parallel in

constant time We formalize the intuitive notion of a lo cally checkable prop erty as follows

Denition A network automaton N is local ly checkable for predicate L using link pred

icate set L if

Lis a link predicate set for N and L ConjL

Each L L is closed

uv

The rst item in the denition requires that L holds if a collection of lo cal prop erties all

hold The second item is p erhaps more surprising It requires that each lo cal prop ertyalso

b e closed

We add this extra requirement b ecause in an asynchronous distributed system it app ears

to b e imp ossible to check whether an arbitrary lo cal predicate holds al l the time What

we can do is to sample the lo cal subsystem p erio dically to see whether the lo cal prop erty

holds Supp ose the network automaton consists of three no des u v and w and suchthat

v is the neighborofboth u and w Supp ose the prop erty L that we wish to check is the

conjunction of two lo cal predicates L and L Supp ose further that exactly one of the

uv vw

two predicates is always false and the predicate that is false is constantly changing Then

whenever wecheck the u v subsystem wemightnd L true Similarly whenever we

uv

check the v w subsystem we mightnd L true Then wemay never detect the fact

vw

that L do es not hold in this execution Weavoid this problem by requiring that L and

uv

L b e closed

vw

Lo cal Correctability

The motivation b ehind lo cal checking was to eciently ensure that some prop erty L holds

for network automaton N Wewould also liketo eciently correct N to make the prop erty

true Wehave already set up some plausible conditions for lo cal checking Can we nd some

plausible conditions under which N can b e local ly corrected

Tothisendwe dene a lo cal reset function f This is a function with three arguments

the rst argumentisanodesay u the second argumentisany state of no de automaton N

u

and the second argument is a neighbor v of u The function pro duces a state of the no de

automaton corresp onding to the rst argument Let s b e a state of N recall that sju is the

state of N Thenf u sju v is the state of N obtained by applying the lo cal reset function

u u

at u with resp ect to neighbor v We will abuse notation by omitting the rst argumentwhen

it is clear what the rst argument is Thus we prefer to write f sju v instead of the more

cumb ersome f u sju v

We will insist that f meet two requirements so that f can b e used for lo cal correction

Denition

Assume that the prop erty L holds if a lo cal prop erty L holds for every edge u v

uv

The rst requirement is that if anyu v subsystem do es not satisfy L then applying f

uv

to b oth u and v should result in making L hold More precisely let us assume that by

uv

some magic wehave the ability to simultaneously

Apply f to N with resp ect to v

u

Apply f to N with resp ect to u

v

Removeany packets stored in channels C and C

uv vu

Then the resulting state of the u v subsystem should satisfy L Ofcourseinareal

uv

distributed system such simultaneous actions are clearly imp ossible However we will achieve

essentially the same eect by applying a socalled reset proto col to the u v subsystem

We will describ e a stabilizing lo cal reset proto col for this purp ose in the next section

The rst requirement allows no des u and v to correct the u v subsystem if L do es not

uv

hold But other subsystems may b e correcting at the same time Since subsystems overlap

correction of one subsystem mayinvalidate the correctness of an overlapping subsystem For

example the u v andv w subsystems overlap at v If correcting the u v subsystem

causes the v w subsystem to b e incorrect then the correction pro cess can thrash To

prevent thrashing we add a second requirement In its simplest form wemight require that

correction of the u v subsystem leaves the v w subsystem correct if the v w subsystem

was correct in the rst place

However there is a more general denition of a reset function f that turns out to b e useful

Recall that wewanted to avoid thrashing that could b e caused if correcting a subsystem

causes an adjacent subsystem to b e incorrect Informally let us say that the u v subsystem

dep ends on the v w subsystem if correcting the v w subsystem can invalidate the u v

subsystem If this dep endency relation is cyclic then thrashing can o ccur On the other hand

if the dep endency relation is acyclic then the correction pro cess will eventually stabilize Such

an acyclic dep endency relation can b e formalized using a partial order on unordered pairs

of no des informally the u v subsystem dep ends on the v w subsystem if fv wg fu v g

Using this notion of a partial order we present the formal denition of a lo cal reset

function

Denition We say f is a local reset function for network automaton N net G N

with respect to link predicate set L fL g and partial order if for any state s of N and

uv

any edge u v of G

Correction f sju v nil nil fsjv u L

uv

Stability For any neighbor w of v

If sju sju v sjv usjv L and fv wg fu v g then

uv

sju sju v sjv ufsjv w L

uv

Notice that in the sp ecial case where all the link subsystems are indep endent no edge is

less than any other edge in the partial order

Using the denition of a reset function we can dene what it means to b e lo cally cor

rectable

Denition A network automaton N is lo cally correctable to L using link predicate set

Llocal reset function f and partial order if

N is local ly checkable for L using L

f isalocal reset function for N with respect to L and

Intuitivelyifwehave a reset function f with partial order we can exp ect the lo cal

correction to stabilize in time prop ortional to the maximum chain length in the partial order

Recall that a chain is a sequence a a a a Thus the following piece of notation

n

is useful

Denition For any partial order height is the length of the maximum length chain

in

Lo cal Correction Theorem

Theorem Statement

In the previous section we set up plausible conditions under which a network automaton

can b e lo cally corrected to achieve a prop erty LWe claimed that these conditions could b e

exploited to yield lo cal correction In this section we make these claims precise

Intuitively the result states that if N is lo cally correctable to L using lo cal reset function

f and partial order thenwe can transform N into a new augmented automaton N such

that N satises the following prop erty eventually every b ehavior of N will lo ok like a

b ehavior of N in which L holds We use the notation NjL to denote an automaton identical

to N except that the initial states of NjL are restricted to lie in set L

Theorem Lo cal Correction Consider any network automaton N NetG N that is

lo cally correctable to L using link predicate set Llocal reset function f andpartial order

Then there exists some N that is a UIOA and a network automaton such that N

stabilizes to the behaviors of N cjL

So far wehave ignored time complexityHowever time complexity is of crucial imp or

tance to a proto col designer b ecause nob o dy would b e interested in a proto col that to ok

years to stabilize Thus V uses a version of the timed automata mo del throughout and

uses this mo del to give a formal denition of stabilization time However intuitivelywe

can measure time complexity as follows We assume that it takestimeunittosendany

message on a link and to deliver a free action we assume that no de pro cessing takes time

With this assumption we can dene the stabilization time of a b ehavior to b e time it

takes b efore the b ehavior has a sux that b elongs to the target set We dene the overall

stabilization time as the worst case stabilization time across all b ehaviors

Using this intuitive denition of stabilization time we can state a more rened version of

stabilizes to the b ehaviors the Lo cal Correction Theorem The renementwe add is that N

of N cjL in time prop ortional to height the height of the partial order Node u Node v

TIME REQUEST

RESPONSE

Figure The structure of a single phase of the lo cal snapshotreset proto col

Overview of the Transformation Co de

To transform N into N we will add actions and states to N These actions will b e used

to send and receive snapshot packets that will b e used to do lo cal checking on each link

subsystem and reset packets that will b e used to do lo cal correction on each link subsystem

For every link u v the leader l u v initiates the checking and correction of the u v

subsystem The idea is of course that the leader of eachu v subsystem will p erio dically

do a local snapshot to checkiftheu v subsystem satises its predicates if the leader

detects a violation it tries to make the lo cal predicate true by doing a local reset of the u v

subsystem

The structure of our lo cal snapshot proto col is slightly dierent from the Chandy

Lamp ort snapshot proto col ChandyL studied earlier in the course It is p ossible to

show V that the ChandyLamp ort scheme cannot b e used without mo dications over

unit storage links Prove this it will help you understand the ChandyLamp ort scheme

b etter

Our lo cal snapshotreset proto col works roughly as follows Consider a u v subsystem

Assume that l u v u ie u is the leader on link u v A single snapshot or reset phase

has the structure shown in Fig

A single phase of either a snapshot or reset pro cedure consists of u sending a request

that is received by v followed by v sending a resp onse that is received by u During a phase

no de u sets a ag check v to indicate that it is checkingcorrecting the u v subsystem

u

While this ag is set no packets other than request packets can be sent on link C Since

uv

a phase completes in constanttimethisdoesnotdelay the data packets by more than a

constant factor

In what follows we will use the basic state at a no de u to mean the part of the state at

u corresp onding to automaton N To do a snapshot no de u sends a snapshot request to

u uvuv

REQ c REQ(c) TIME RESP Store c

REQ RESP(c) RESP c = c + 1

Incorrect Matching Matching using a Counter

Figure Using counter ushing to ensure that requestresp onse matching will work

correctly within a small numb er of phases

v A snapshot request is identied bya mode variable in the request packet that carries a

mode of snapshotIfv receives a request with a mode of snapshotNodev then records its

basic state say s and sends s in a resp onse to u

When u receives the resp onse it records its basic state say r No de u then records the

state of the u v subsystem as x r nil nil s If x L ie lo cal prop erty L do es

uv uv

not hold then u initiates a reset

To do a reset no de u sends a reset request to v A reset request is identied bya mode

variable in the request packet that carries a mode of reset Recall that f denotes the lo cal

reset function After v receives the request v changes its basic state to f v s u where s is

the previous value of v s basic state No de v then sends a resp onse to uWhenu receives

the resp onse u changes its basic state to f u rv where r is the previous value of us basic

state

Of course the lo cal snapshot and reset proto col must also b e stabilizing However the

proto col we just describ ed informally may fail if requests and resp onses are not prop erly

matched This can happ en for instance if there are spurious packets in the initial state

of N It is easy to see that b oth the lo cal snapshot and reset pro cedures will only work

correctly if the resp onse from v is sent following the receipt of the request at u The diagram

on the left of Figure shows a scenario in which requests are matched incorrectly to

old resp onses that were sent in previous phases Thus we need each phase to eventually

follow the structure shown in the right of Figure

To make the snapshot and reset proto cols stabilizing we use counter ushing Thus we

numb er all request and resp onse packets Thus each request and resp onse packet carries a

number count Also the leader u keeps a variable count v that u uses to numb er all requests

u

sentto v within a phase At the end of the phase u increments count v Similarlythe

u

resp onder v keeps a variable count u in which v stores the numb er of the last request it has

v

received from uNodev weeds out duplicates by only accepting requests whose number is

not equal count v

u

Clearly the count values can b e arbitrary in the initial state and the rst few phases may

not work correctlyHowever counter ushing guarantees that in constant time a resp onse

will b e prop erly matched to the correct request Because the links are unit storage there

can b e at most distinct counter values one in each link and one at the receiver stored

in the link subsystem Thus a space of numb ers is sucient to guarantee that eventually

more precisely within phases requests and resp onses are correctly matched

Besides prop erly matching requests and resp onses wemust also avoid deadlo ckwhen

the lo cal snapshotreset proto col b egins in an arbitrary state To do so when check v is

u

true ie u is in the middle of a phase u continuously sends requests Since v weeds out

duplicates this do es no harm and also prevents deadlo ck Similarly v continuously sends

resp onses to the last request v has received Once the resp onses b egin to b e prop erly matched

to requests this do es no harm b ecause u discards such duplicate resp onses

Intuitive Pro of of Lo cal Correction Theorem

A formal pro of of the Lo cal Correction theorem can b e found in V In this section we

provide the main intuition Wehave already describ ed how to transform a lo cally correctable

automaton N into an augmented automaton N Wehave to prove that in time prop ortional

to the height of the partial order every b ehavior of N is a b ehavior of N in which all lo cal

predicates hold

Wehave already describ ed the intuition b ehind the use of a counter to ensure prop er

requestresp onse matching Wenow describ e the intuition b ehind the lo cal snapshot and

reset pro cedures The intuition b ehind the snapshot is very similar to the intuition b ehind

the snapshot pro cedure studied earlier in the course

Intuition Behind Lo cal Snapshots

The diagram on the left of Figure shows why a snapshot works correctly if the resp onse



from v is sent following the receipt of the request at uLeta and b b e the state of no des u



b e the state of no des u and and v resp ectively just before the resp onse is sentLeta and b

v resp ectively just after the resp onse is delivered This is sketched in Figure

The snapshot proto col must guarantee that no de u do es not send anydatapackets to v

during a phase Also v cannot send another data packet to u from the time the resp onse is

sentuntil the resp onse is delivered This is b ecause the link from v to u is a UDL that will

not give a free indication until the resp onse is received Recall that nil denotes the absence

of anypacket on a link Thus the state of the u v subsystem just b efore the resp onse is



sentisa nil nil b Similarly the state of the u v subsystem just after the resp onse is



delivered is a nil nil b

We claim that it is possible to construct some other execution of the u v subsystem



which starts in state a nil nil b has an intermediate state equal to a nil nil bandhasa



nal state equal to a nil nil b This is b ecause we could have rst applied all the actions



that changed the state of no de u from a to a whichwould cause the u v subsystem to

reach the intermediate state Next we could apply all the actions that changed the state of



no de v from b to b which will cause the u v subsystem to reach the nal state Note that

this construction is only p ossible b ecause u and v do not send data packets to each other

between the time the resp onse is sentanduntil the time the resp onse is delivered

Thus the state a nil nil b recorded by the snapshot is a possible successor of the state of

u v subsystem when the resp onse is sent The recorded state is also a a possible predecessor

of the state of u v subsystem when the resp onse is delivered But L is a closed predicate

uv

it remains true once it is true Th us if L was true just b efore the resp onse was sent

uv

then the state recorded by the snapshot must also satisfy L Similarlyif L is false just

uv uv

after the resp onse is delivered then the state recorded by the snapshot does not satisfy L

uv

Thus the snapshot detection mechanism wil l not produce false alarms if the lo cal predicate

holds at the start of the phase Also the snapshot mechanism wil l detect a violation if the

the lo cal predicate do es not hold at the end of the phase

Intuition Behind Lo cal Resets

The diagram on the right of Figure shows why a lo cal reset works correctly if the

resp onse from v is sent following the receipt of the request at u Let b b e the state of no de v



just before the resp onse is sent Let a and b b e the state of no des u and v resp ectively just

before the resp onse is delivered This is sketched in Figure

The co de for an augmented automaton will ensure that just after the resp onse is sent

no de v will lo cally reset its state to f b u Similarly immediately after it receives the

resp onse no de u will lo cally reset its state to f a v Using similar arguments to the ones

used for a snapshot we can show that there is some execution of the u v subsystem



which b egins in the state f a v nil nil fb u and ends in the state f a v nil nilb

But the latter state is the state of the u v subsystem immediately after the resp onse uvuv

Snapshot Request Reset Request TIME

a’ b b f(b, u) a a b’ b’ f(a, v)

Correct Snapshot Phase Correct Reset Phase

Figure Lo cal Snapshots and Resets work correctly if requests and resp onses are prop

erly matched

is delivered But we know from the correction prop erty of a lo cal reset function that

f a v nil nilfb u satises L Since L is a closed predicate we conclude that L

uv uv uv

holds at the end of the reset phase

Intuition Behind Lo cal Correction Theorem

We can now see intuitively why the augmented automaton will ensure that all lo cal predicates

hold in time prop ortional to the height of the partial order Consider a u v subsystem

where fu v g fw xg for any pair of neighbors w x ie fu v g is a minimal element in the

partial order Then within phases of the u v subsystem the requestresp onse matching

will b egin to work correctly If the sixth phase of the u v subsystem is a snapshot phase

then either L will hold at the end of the phase or the snapshot will detect a violation But

uv

in the latter case the seventh phase will b e a reset phase which will cause L to hold at

uv

the end of the seventh phase

But once L remains true it remains true This is b ecause L is a closed predicate of

uv uv

the original automaton N and the only extra actions wehaveaddedto N that can aect L

uv

are actions to lo cally reset a no de using the reset function f Butby the stability prop erty

of a lo cal reset function any applications of f at u with resp ect to some neighb or other than

v cannot aect L Similarlyany applications of f at v with resp ect to some neighbor

uv

other than u cannot aect L Thus in constant time the lo cal predicates corresp onding

uv

to link subsystems that are minimal elements in the partial order will b ecome and remain

true

Now supp ose that the lo cal predicates for all subsystems with height i hold from some

state s onward By similar arguments we can show that in constant time after s the

i i

lo cal predicates for all subsystems with height i b ecome and remain true Once again

the argument dep ends crucially on the stability prop erty of a lo cal reset function The

intuition is that applications of the lo cal reset function to subsystems with height i do not

o ccur after state s But these are the only actions that can falsify the lo cal predicates for

i

subsystems with height i The net result is that all lo cal predicates b ecome and remain

true within time prop ortional to the height of the partial order

Implementing Lo cal Checking in Real Networks

Timer Flushing

In the previous section we made the snapshotreset proto col stabilizing bynumb ering snap

shot requests and resp onses We also relied on the fact that eachlinkwas a UDL In practice

however there is an even simpler way of making the snapshotreset proto col stabilizing This

can b e done using timers

Supp ose there is a known b ound on the length of time a packet can b e stored in a link

and a known b ound on the length of time b etween the delivery of a request packet and the

sending of a matching resp onse packet Then bycontrolling the interval b etween successive

snapshotreset phases it is easily p ossible to obtain a stabilizing snapshot proto col The

interval is chosen to b e large enough such that all packets from the previous phase will have

disapp eared at the end of the interval APVsigcomm We call the general metho d timer

ushing

The main idea in timer ushing is to b ound the lifetime of old state information in

the network This is done by using no de clo cks that run at approximately the same rate

and by enforcing a maximum packet lifetime over every link State information that is

not p erio dically refreshed is timed out by the no des Timer ushing has b een used in

the OSI Network Layer Proto col Perlman and the IEEE Spanning Tree proto col

Perlman Spinelli S uses timer ushing to build practical stabilizing Data Link and

virtual circuit proto cols

In most real networks each no de sends keepalive packets p erio dically on every link

in order to detect failures of adjacent links If no keepalivepacket arrives b efore a lo cal

timer expires the link is assumed to have failed Thus it is common practice to assume time

b ounds for the delivery and pro cessing of packets Note also that the snapshot and reset

packets used for lo cal checking can b e piggybacked on these keepalivepackets without

any appreciable loss in eciency

Summary

Selfstabilization was intro duced by Dijkstra in a seminal pap er Dijkstra Dijkstras

shared memory mo del is enchanting in its simplicityandmakes an ideal vehicle to describ e

nontrivial examples of selfstabilization We showed howtosimulate Dijkstras mo del as one

big IOA with a few restrictions We also used Dijkstras rst two examples to intro duce two

powerful metho ds for selfstabilization counter ushing and lo cal checking and correction

Next wemoved to a more realistic message passing mo del Weintro duced a reasonable

mo del for b ounded storage links Using this mo del we formally dened the notions of lo cal

checkability and correctability and describ ed the lo cal correction theorem This theorem

shows that any lo cally correctable proto col can b e stabilized in time prop ortional to the

height of a partial order that underlies the denition of lo cal correctability Our pro of of this

theorem was constructive weshowed how todoaugment the original proto col with snapshot

and reset actions We made the lo cal snapshot and reset proto cols stabilizing using counter

ushing While counter ushing relies on b ounding the number of packets that can b e stored

on a link we observed that a practical implementation would b e based on b ounding the time

that a message could stay on a link w e called this technique timer ushing

Together counter ushing timer ushing and lo cal checking with and without lo cal

correction can b e used to explain a numb er of selfstabilizing proto cols and can b e used to

design new proto cols While these techniques cannot b e used to solveevery problem there

are a large numb er of useful proto cols that can b e eciently stabilized using these notions

In the next lecture Boaz will provide an example of a reset proto col that somewhat

surprisingly is lo cally correctable using the trivial partial order In other places wehave used

lo cal checking to provide solutions for mutual exclusion the endtoend problem stabilizing

synchronizers and spanning tree proto cols More detailed references can b e b e found in

APV V and AVarghese

The messages used for lo cal checking can b e piggybacked on keepalive trac in real

networks without any appreciable loss of eciency As a side b enet lo cal checking also

provides a useful debugging to ol Any violations of lo cal checking can b e logged for further

examination In a trial implementation of our reset pro cedure on the Autonet Auto lo cal

checking discovered bugs in the proto col co de In the same vein lo cal checking can provide

a record of catastrophic transient faults that are otherwise hard to detect

JJ Distributed Algorithms December

Lecturer Boaz PattShamir

Lecture

SelfStabilization and Termination

Can a selfstabilizing proto col terminate This question has several p ossible answers First

consider the regular IO automata mo del and the b ehavior stabilization denition In this

setting if the problem sp ec allows for nite b ehaviors then the trivial automaton that do es

nothing is a selfstabilizing solution This is clearly unsatisfactoryWe shall use a dierent

formulation of the mo del to see that no interesting problem admits a terminating self

stabilizing proto col

Consider the inputoutput relation sp ecication of problems dened as follows In each

pro cess there are sp ecial input and output registers the problem is dened in terms of a

relation that sp ecies for each input assignment what are the p ossible output assignments

We also assume that some of the states are designated as terminating ie no action is

enabled in them This is just another formulation of a nonreactive distributed task We

saw quite a few such tasks eg MST shortestpaths leader election etc

In this setting it is easy to see that a selfstabilizing proto col can have no terminating

states if it satises the following nonlocality condition



There exist an input value v an output value v anodej andtwo input as

  

signments I and I suchthat v is assigned to j in b oth I and I v is a p ossible



output value for I and is not a p ossible output value for I

Intuitively the nonlo cality condition means that no de j cannot tell lo cally that having v as



input and v as output is correct or wrong We remark that all interesting distributed tasks

have this prop erty any other task amounts to a collection of unrelated pro cesses

The moral of this obvious fact is that selfstabilizing algorithms must constantly verify

whether their output or more generally their state is up dated with resp ect to the rest

of the system and therefore they cannot terminate



This can b e formalized in the IOAmodelby assuming that the input value is input to the system

innitely often and that the output action that carries the output value is continuously enabled

SelfStabilization and Finite State

Let us consider an issue whichmay seem unrelated to selfstabilization at rst glance namely

innitestate protocols In particular consider the idea of unb ounded registers Wewent

through a lot of diculty conceptually and paid a signicant price in terms of p erformance

to get b oundedregisters proto cols eg applying CTS to Lamp orts bakery

This approach is easy to criticize from the practical viewp oint to b e concrete supp ose

that the registers contain bits There is no system to day that can hit the b ound of

in any practical application However note that this argument dep ends crucially on the

assumption that the registers are initializedproperlyFor instance consider the case where

as a result of a transient fault the contents of the registers may b e corrupted Then no

matter how big we let the registers b e they may hit their b ound very quickly On the other

hand any feasible system can have only nitely many states and in many cases it is desirable

to b e able to withstand transient faults

In short unb ounded registers are usually a convenient abstraction given that the system

is initialized prop erly while selfstabilizing proto cols are useful only when they have nitely

many states This observation gives rise to two p ossible approaches The rst is to develop

only nitestate selfstabilizing proto cols tailored to the sp ecic applications The second is

to have some kind of a general transformation that enables us to design abstract unb ounded

state proto cols using realworld b oundedsize registers and whenever the physical limit is

hit to reset the system

The main to ol required for the second approachisaselfstabilizing reset protocol This

will b e the fo cus of our lecture to day

Selfstabilizing Reset Proto col

Problem statement

Informally the goal of the reset pro cedure is that whenever it is invoked to create a fresh

version of the proto col which should b e free of any p ossible corruption in the past For

mallywe require the following

We are given a user at eachnodewe sometimes identify the user with its no de The

user has input action Receive m e and output action Send m e where m is drawn from

some message alphab et and e is an incident link The meaning of these actions is the

natural one m is received from eand m is to b e sentover e resp ectively In addition the

 

We distinguish b etween the actual link alphab et and the user message alphab et We assume that



comprises of and the distinct Reset proto col messages User User User send(m,e) send(m,e) send(m,e) reset request reset request reset request reset signal reset signal reset signal receive(m,e) receive(m,e) receive(m,e)

Reset Reset Reset Protocol Protocol Protocol send send send receive receive receive

Asynchronous

Network

Figure Interfacespecication for reset service

user also has output action reset request and input action reset signal The problem is to

design a proto col reset service such that if one of the no des makes a reset request and

no no de makes innitely many requests then

In nite time all the no des in the connected comp onent of a requesting no de receivea

reset signal

No no de receives innitely many reset signals

Let e u v beany link in the nal top ology of the network Then the sequence

of Send m e input at u after the last reset signal at u is identical to the sequence of

Receive m output at v after the last reset signal at v

e

Informally and guarantee that every no de gets a last reset signal and stipu

lates that the last reset signal provides a consistent reference timep oint for the no des

In order that the consistency condition will b e satisable it is assumed that all

messages from the user to the network and viceversa are controlled by the reset proto col

see Figure

Unb oundedregisters Reset

The original problem requires us to run a new version of the users proto col Consider the

following simple solution

The reset proto col maintains a version numb er

All outgoing messages are stamp ed with the version numb er and the numb ers are

stripp ed b efore delivery of received messages

Whenever a request o ccurs version counter is incremented and signal is output im

mediately all messages will b e stamp ed with new value

Whenever a higher value is seen it is adopted and signal is output

Messages with lowvalues are ignored

This proto col works ne for the no des participating in the proto col Stabilization time

is O diam which is clearly optimal What makes this proto col uninteresting as discussed

ab ove is the need of unbounded version counter

Example Link Reset

Recall the lo cal correction theorem The basic mechanism there was applying some lo cal

correction action which reset the no des to some presp ecied state and reset the links to

empty This matches the ab ove sp ec for reset This link reset can b e implemented fairly

eciently the details are a bit messyhowever It requires the use identiers at no des

and a ushing mechanism based on either a time b ound on message delivery or a b ound on

the maximal capacity of the link

Our goal in the reset proto col is to provide a networkwide reset Wecoulddoit

using similar ushing mechanisms eg global timeout used in DECNET However global

b ounds are usually very bad as they need to accommo date simultaneous worst cases at all

the system Thus were interested in enhancing the selfstabilization prop erties of the links

as describ ed by the lo cal correction theorem to all the network without incurring the cost

of say a global timeout

Reset Proto col

Let us rst describ e how the reset proto col should haveworked if it was initialized prop erly

In the socalled Ready mo de the proto col relays using buers messages b etween the

network and the user When a reset request is made at some no de its mo de is changed to RESET REQUEST

A BORT

BORT A

ABORT

BORT A

RESET REQUEST BORT A

(a) (b)

RESET SIGNAL

CK A EADY R

ACK READY

CK

EADY A

CK EADY R R ACK A

R EADY

CK EADY A R RESET SIGNAL

(c) (d)

Figure An example of the run of the reset protocol In a two nodes get reset request

from the user In b the nodes in Ab ort mode arecolored black The heavy arrows represent

parent pointers and the dashedarrows represent messages in transit In c none of the

nodes is in Ready mode In d some of the nodes arein Ready mode and some reset

signals are output

Ab ort it broadcasts Abort messages to all its neighb ors sets the Ack Pend bits to true

for all incident links and waits until all the neighb oring no des send back Ack Ifanode

receives Abort message while in Ready mo de it marks the link from which the message

arrived as its parent broadcasts Abort and waits for Acks to b e received from all its

neighbors If Abort message is received by a no de not in a Ready mo de Ack is sent back

immediately When a no de receives Ack it sets the corresp onding Ack Pend bit to false

The action of a no de when it receives the last anticipated Ack dep ends on the value of its

parentIfparent nil its mo de is changed to Convergeand Ack is senton parent

If parent nil the mo de is changed to Ready the no de broadcasts Ready messages

and outputs a reset signal A no de that gets a Ready message on its parent link while in

Converge mo de changes its mo de to Readymakes its parent nil outputs a reset signal

and broadcasts Ready

While the mo de is not Ready messages input by the user are discarded and all

messages from a link e are queued on Buffere When Abort arrives on link e Buffere

is ushed While the mo de is Ready messages from the user to the network are put on

the output queue and messages in Buffer are forwarded to the user The size of Buffer

dep ends on the assumptions we make on the user

The actual co de is given in Figures A few remarks are in order

In the selfstabilization mo del we can deal with the assumption that the top ology isnt

xed in advance This is done by assuming that links maybeupordown and the

no des are constantly informed of the status of their incident links As usual we require

correctness only after the system has stabilized We mo del this by assuming that the

maximal degree d is known eg the numb er of p orts and that the state of the links

is maintained up dated in the Edges array

In the selfstabilization mo del we tend to have as little variables as p ossible less things

can get corrupted See for example the denition for mode uinFigure

The rst step in making the ab ove proto col selfstabilizing is to make it lo cally checkable

A clear problem with the existing proto col is that it will deadlo ck if in the initial state some

parent edges form a cycle We mend this awby maintaining distance variable at each

no de such that a no des distance is one greater than that of its parent Sp ecically

distance is initialized to up on reset request and its accumulated value is app ended to the

Abort messages However since all we care ab out is acyclicity there is no need to up date

distance when a link go es down

Next we list all the link predicates that are necessary to ensure correct op eration of

the Reset proto col It turns out that all the required link predicates are indep endently

stable and hence the stabilization time would b e one time unit each subsystem needs to b e

corrected at most once indep endently

Our last step is to design a lo cal correction action for links the f of the lo cal correction

theorem

The main diculty ab out designing a correcting strategy is making it lo cal ie to ensure

that when we correct a link we do not violate predicates of incident linksubsystems For the

case of dynamictop ology network the solution is simple emulate a link failure and recovery

Of course care must b e exerted when writing the co de for these events see co de for lo cal

reset in Figure

In the co de the detection and correction mechanism is written explicitly Figure

For every link wehave the ID of the no de at the other endp oint the no de with the higher

State of pro cess u

Ack Pend arrayof d Bo olean ags

parent p ointer ranging over d fnilg

distance ranges over N where N is a b ound on the numb er of pro cesses

Edges set of op erational incident links maintained by the links proto col

Queue arrayof d send buers

Buffer arrayof d receive buers

IDs arrayof d no de identiers

Timers arrayof d timers

Check arrayof d Bo olean ags

Modearrayof d Bo olean ags Snap

Shorthand for the co de of pro cess u

Pende true Ab ort if e such that Ack

mode u

Pende false Converge if parent nil and e Ack

Ready if parent nil and e Ack Pende false

Propagate e dist

if dist then parent e else parent nil

distance dist



for all edges e Edges do



Ack Pende true



enqueue Abortdistance on Queuee

Lo cal Reset e

Buffere

Queuee

  

Pende true or e parent and parent nil then D if e e e and Ack

if parent e then parent nil

if Ack Pende true then

Ack Pende false

if mode u Converge then enqueue Ack in Queueparent

D else if mode u Ready

Ack Pende false

parent nil

output reset signal

 

for all e Edges do enqueue Ready in Queuee

Figure Selfstabilizing reset protocol part I

Co de for pro cess u

Whenever reset request and mode u Ready

Q Propagate nil

Whenever Abort dist on link e

A if mode u Ready then

Propagate e dist

A if dist or mode Ready enqueue Ack in Queuee

Whenever Ack on link e and Ack Pende true

Pende false Ack

K if mode u Converge then

enqueue Ack in Queueparent

K else if mode u Ready then

output reset signal



for all e Edges do



enqueue Ready in Queuee

Whenever Ready on link e and parent e

R if mode u Converge then

parent nil

output reset signal



for all edges e Edges do



enqueue Ready in Queuee

Whenever Send m e

if mode u Ready then

enqueue m in Queuee

Whenever Queuee

send head Queuee over e

delete head Queuee

Figure Selfstabilizing reset protocol part II

Co de for pro cess u cont

Whenever m received from link e

if Checke true then

if Snap Modee snap then

record m as part of snapshot state on link e

enqueue m in Buffere

else enqueue m in Buffere

Whenever Buffere and mode uReady

output Receive head Bufferee

delete head Buffere

Whenever Abort on e

Buffere

Whenever Down on e

execute Lo cal Resete

Whenever Timerse expires for some link e

if My ID IDse then

Checke true

Modee snap then if Snap

record lo cal state

else execute Lo cal Resete

SnapSnap Modee My IDone send Start

Whenever Start Snapb id on link e

IDse id

ID then if id My

if b reset then

Resete execute Lo cal

send Resp onse Snap lo cal state

Snap S onlink e Whenever Resp onse

Checke false

if Snap Modee snap and anyinvariantdoesnotholdthen

Snap Modee reset

else execute Lo cal Reset e

Figure Selfstabilizing reset protocol part III

A Ack Pend e true i one of the following holds

u

Abortdistance Queue e

u

u

e mode v Ab ort and parent

v

Ack Queue e

v

B At most one of the following hold

Abort distance Queue e

u

u

e mode v Ab ort and parent

v

v

Ack Queue e

v

C parent e implies that mode u Converge i one of the following holds

u

Ack Queue e

u

Pend e false and mode v Ready Ack

v

Ready Queue e

v

D parent e implies that at most one of the following holds

u

Ack Queue e

u

Pend e false and mode v Ready Ack

v

Ready Queue e

v

R tail Queue e fReadyg implies

u

mode u Ready

P parent e implies

u

one of the following holds

distance distance

u v

Ready Queue e

v

E e Edges implies all the following

u

Pend e false Ack

u

parent e

u

Queue e

u

Buffer e

u

Q Queuee is a subsequence of the following

hAbort Acki

hAck Ready Aborti



hAck Ready i

Figure Reset protocol part IV link invariants for link e u v The link v u is

denoted e

ID is assumed to b e resp onsible for the correctness of the subsystem This is done as

follows Whenever the timer dedicated to that link expires the no de checks whether it is

in charge of that link if so it sends out a snapshot message that returns with the state

of the other no de Based on this state its own state and the messages received it can nd

out whether anyoftheinvariants Figure is violated If this is the case lo cal reset is

executed in a co ordinated way that resembles the way snapshots of the link are taken

The invariants of Figure express simple prop erties that turn out to b e sucientto

ensure correct op eration of the proto col Invariants A and B capture the desired b ehavior af

ab ort messages C and D deal with the converge messages analogously The ready message

is simpler since we dont exp ect anything from a no de in ready mo de and is describ ed by

invariant RInvariant P is crucial to ensure stabilization it guarantees that no cycles of the

parent p ointers can survivechecking and correction Invariant E describ es what is exp ected

from a nonexisting edge The goal of lo cal reset action brings the subsystem to that state

The last invariant Q is a technicality we need to consider the messages in the send queue

also

Analysis

Well sketch the outlines of the pro ofs just to show the ideas Details are a bit messy

We start with the statement of stabilization

Theorem Suppose that the links areboundeddelay links with delay time D and suppose

that invariants are veried by the manager at least every P time units Then any execution

of the Reset protocol regard less of the initial state in al l states from time O D P onwards

al l the invariants hold for al l the links

This theorem is proven by showing that the invariants are stable The argumentis

induction on the actions assuming that a given state is legal with resp ect to a given link

subsystem we show thatapplyingany action will not break anyinvariant in this subsystem

Thus the pro ofs are basically a straightforward somewhat tedious caseanalysis Using the

lo cal correction theorem we can deduce that the proto col is selfstabilizing

It is more interesting to analyze the executions of the algorithm We need a few deni

tions

Denition Let s a s beanexecution fragment An interval i j is a ready

k k k

interval at node u if

mode uReady in al l states s for al l i l j

l

If ik then mode u Ready in s andifj then mode u Ready in s

i j

Ab ort interval and converge interval are dened analogously

For these mo de intervals wehave the following lemma

Lemma Let s a s be an execution fragment and let u beanode Then at u

n n n

A ready interval may be fol lowedonlybyanabort interval

A converge interval may be fol lowed only by a ready interval

The pro of is by straightforward insp ection of the actions Next we dene the concept of

parent graph

Denition Let s be a state The parent graph Gs is denedbytheparent and the

Queue variables as fol lows The nodes of Gs arethenodes of the network and e u v

is an edge in Gs i parent e and Ready Queue e in s

v

u

From the distances invariant weknow that the parent graph is actually a forest in legal

states We also dene a sp ecial kind of ab ort intervals An ab ort interval i j atnode u is

a root interval if for some i k j parent nil in s

k

u

The next step of the pro of is to show that for anyabortinterval there exists a ro ot

interval that contains it This is done by induction on the depth of the no de in the parent

graph Using this fact we b ound the duration of ab ort intervals

Lemma Let s bealegal state let u beanode and suppose that no topological changes

occur after sIfmode uAb ort in s and depth uk in Gs then there exists a state

 

s in the fol lowing n k time units such that mode u Ab ort in s

Pend bit is continuously true for more This lemma is proven by showing that no Ack

than n k time units This implies the lemma since no Ack Pend bit is set to true while

the no de is in a nonready mo de

Similarlywe can proveby induction on depth u the following b ound on the duration of

the converge intervals

Lemma Let s bealegal state at time tlet u beanode and suppose that no topological

changes occur after time tIfmode uAb ort in s and depth u k in Gs then by

time t n depth u an action occurs in which mode u is changedtoReadyandReady

messages aresentby u to al l its neighbors

It is easy to derive the following conclusion

Theorem Suppose that the links including the link invariants stabilize in C time units

If the number of reset requests and topological changes is nite then in C n time units

after the last reset requesttopological change mode uReady at al l nodes u

Pro of Consider the ro ot intervals Note that by the co de each suchrootinterval can b e

asso ciated with an input request made at this no de or with a top ological change Since

each request can b e asso ciated with at most one ro ot interval and since each top ological

change can b e asso ciated with at most tworootintervals the number of root intervals is

nite Furthermore by Lemma n time units after the last reset request all ro ot intervals

terminate and Ready will b e propagated Since any nonro ot ab ort interval is contained

in some ro ot interval after n time al l ab ort intervals have terminated and hence n time

units after the last reset request the mo de of all the no des is Ready

The following theorem states the liveness prop erty of the Reset proto col

Theorem Suppose that the execution of the Reset protocol begins at time at arbitrary

state and suppose that the links including the link invariants stabilize in C time units If

the last reset requesttopological change occurs at time t C n then a RESET signal is

output at al l the nodes by time t n

Pro of Sketch The key prop ertywe need is that the time b etween successivereadyinter

vals is at most n time units

First notice that it suces to show that for anynodev there is a time in the interval

C t ninwhich mode v Ready

Supp ose now for contradiction that at time t C n o ccurs a reset request at some

no de w and that there exists a no de u whose mo de is Ready in times C t n Let v be

any no de in the network We show by induction on the distance i between u and v that

mode iReady in the time interval C n t n i This will complete the pro of since

the hyp othesis implies that mode w Ready at time t

The base case follows from the assumption that mode uReady in the time interval

 

C t n The inductive step follows from the fact that if for some no de v intimet Cn

  

wehave mode v Ready then there o ccurs an action in the time interval t n t in



which the mo de of v changed from Ready to Ab ortand Abort messages were senttoall

 

the neighb ors of v Hence for each neighbor v of v there is a state in the time interval

 

t n t n inwhich mode v Ready

Next weprove the consistency of the Reset proto col This prop erty follows from the

simple Buffer mechanism There is a buer dedicated to each link such that all messages

arriving from link e are stored in Buffere Buffere is ushed whenever a Abort message

arrives from e

Theorem Suppose no topological changes occurs after time C and that the last reset

signals occur at two adjacent nodes u and v after time C n Then the sequence Send m e

input by the user at u fol lowing the last reset signal at u is identical to the sequenceof

e output to the user at v after the last reset signal at v Receivem

Pro of Consider the last nonready intervals in u and v Supp ose that the last states in

which mode u Ready and mode v Ready are s and s resp ectively and let s and

j j i

u v u

b e the rst states in the last nonready intervals resp ectively ie for all i k j in s

u u i

v

s mode u Readyandins mode uReady Note that since the last reset signal

k i

u

is output after time C n it follows that the states s and s indeed exist Note

i i

u u

further that the change of mo de to Ab ort is accompanied by broadcasting Abort messages

by Propagate

We argue that m is delivered at v after s Consider Send m e input at u after s

i j

v u

supp ose not Then the mo de of v is changed to Ab ort after s and v broadcasts Abort to

j

u

u Hence there must b e a state s kjsuch that mode u Ready in s contradicting

k u k

is the last state in the last nonready interval Recalling that the the assumption that

j

u

manager buers messages until the mo de is Readywe conclude that the corresp onding

e is output to the user at v after s Receivem

j

v

Consider Send m e input at u b efore s Since the manager discards all messages

j

u

input while in nonready mo de we need only to consider Send m e input at u b efore s

i

u

Again we argue that Receivem e is not output at v after s This follows from the fact

j

v

that there must b e Abort message sentto u from v after m that implies that the mo de of

u is not Ready after s acontradiction

j

u

Comments

The reset proto col is a p owerful to ol but sometimes its an overkill it do esnt seem reason

able to reset the whole system whenever a minor but frequent fault o ccur eg a new no de

joins a link fails It seems that one of the b est applications of reset is when it is combined

with unb oundedregister proto cols the eect of the reset signal in this case can usually b e

dened easily eg set counters to

Finallywe note that the time complexity actually is b ounded by the length of the longest

simple path in the network which is a more rened measure than just the numb er of no des

If the network is a tree for instance the proto col works in diameter time which is clearly

optimal The space complexity is logarithmic in the b ound N As typical for the lo cal

correction metho d the communication bandwidth required for stabilization is prop ortional

to the space which is in our case logarithmic

Application Network Synchronization

Recall the network synchronization problem The basic prop erty there was that if a message

is sent in lo cal round i then no messages of rounds i orlesswillever b e received in that

no de We abstract the synchronizer as a service mo dule whose task is to provide the user

with pulse numb ers at all times If we asso ciate with eachnode v its roundpulse number

P v it can b e easily seen that a state is legal only if no two adjacent no des pulses dier by

more than ie for all no des v

u Nv jP u P v j

N v denotes the set of no des adjacentto v Call the states satisfying Eq legal

Of course we also wanttokeep progressing This can b e stated as asserting that for each

conguration there is some pulse number K such that all no des get all pulses K K

If the system initialized in a legal state we can use the following rule

P v min fP ug

uN v 

The idea is that whenever the pulse number is changed the no de sends out all the

messages of previous rounds whichhavent b een sentyet

The rule ab oveisstable ie if the conguration is legal then applying the rule arbitrarily

can yield only legal conguration Notice however that if the state is not legal then applying

the rule may cause pulse numb ers to drop This is a reason to b e worried since the regular

course of the algorithm requires pulse numb ers only to grow And indeed the rule is not selfstabilizing

3 3 3 3 3 2 4 2 4 2 4 2 4 2 4 1 3 1 2 1 2 1 2 1 2 (a) (b) (c) (d) (e) 3 1 3 1 3 3 3 3 3 3 4 2 4 2 4 2 4 4 4 4 3 3 3 3 5

3 3 3 5 2 4 2 4 4 4 4 4 1 2 3 2 3 2 3 2 (f) (g) (h) (i) 2 3 2 3 2 3 2 3 4 4 4 4 4 4 4 4

5 5 5 5

Figure An execution using rule The node that movedineach step in marked

One idea that can p op into mind to try to repair the ab oveawistonever let pulse

numb ers go down Formally the rule is the following



P v max P v min fP ug

uN v 

This rule can b e proven to b e selfstabilizing However it suers from a serious drawback

regarding its stabilization time Consider the conguration depicted b elow

1 1 1000000

Figure A pulse assignment for rule

A quick thought should suce to convince the reader that the stabilization time here is in

the order of time units which seems to b e unsatisfactory for such a small network

The next idea is to have a combination of rules if the neighb orho o d is legal then the

problem sp ecication requires the min rule is used But if the neighb orho o d is not legal

another rule can b e used The rst idea we consider is the following





min fP ug if u Nv jP v P uj

uN v 

P v



max fP ug otherwise

uN v 

It is fairly straightforward to show that if an atomic action consists of reading ones neigh

b ors and setting its own value in particular no neighbor changes its value in the meanwhile

then the max rule ab ove indeed converges to a legal conguration Unfortunately this mo del

traditionally called central demon mo del is not adequate for a truly distributed system which

is based on lo osely co ordinated asynchronous pro cesses And as one might susp ect the max

rule do es not work in a truly distributed system

So here is a solution





min fP ug if u Nv jP u P v j

uN v  



P v

max fP u g if u Nv P u P v

uN v 







P v otherwise

In words the rule is to apply minimum plus one when the neighb orho o d seems to b e

in a legal conguration and if the neighb orho o d seems to b e illegal to apply maximum

minus one

The intuition b ehind the mo dication is that if no des change their pulse numb ers to b e

the maximum of their neighb ors then race condition mightevolve where no des with high

pulses can run away from no des with low pulses If the correction action takes the pulse

numb er to b e one less than the maximum then the high no des are lo cked in the sense

that they cannot increment their pulse counters until all their neighborhood have reached

their pulse numb er This lo cking spreads automatically in all the infected area of the

network

We shall nowprove that this rule indeed stabilizes in time prop ortional to the diameter

of the network For this we shall need a few denitions

Denition Let v beanode in the graph The p otential of v is denotedby v and is

denedby

v maxfP u P v dist u v g

uV

Intuitively v is a measure of howmuchis v not synchronized or alternatively the

size of the largest skew in the synchronization of v corrected by the distance Pictorially

one can think that every no de u is a p oint on a plane where the xco ordinate represents the

distance of u from v and the y co ordinate represents the pulse numb ers see Figure

for an example In this representation v is the only no de on the y axis and v isthe

maximal vertical distance of any p oint ie no de ab ove the degree line going through

Pv

pulse numbers pulse numbers

Φ (c) potentiale line potentiale line a b 6 6 0 1 5 d 5 d Φ (b) 4 4 c φ(c) φ(b) 3 3 c 3 c

6 2 2 b 5 e 1 1 b d a a 0 0 123distance from c 123distance from b

(i) (ii) (iii)

Figure On the left is an example of a graph with pulse assignment i Geometrical

representations of this conguration are shown in ii and iii The plane corresponding to

node c is in the midd le ii and the plane corresponding to node b is on the right iii As

can bereadily seen c and bAlso candb see Denition

Let us start with some prop erties of that followimmediately from the denition

Lemma For al l nodes v V v

Lemma Aconguration of the system is legal if and only if for al l v V v

Wenowshow the key prop erty of the new rule namely that the p otential of the no des

never increases under this rule

Lemma Let P be any pulse assignment and suppose that node u changes its pulse

number by applying the rule Denote the new pulse number of u by P u and the potential

of the nodes in the new conguration by Then for al l nodes v V v v

Pro of The rst easy case to consider is the p otential of u itself Since P u Pu we

have

u max fP w P u dist w ug

w V

max fP w P u dist w ug

w V

u

Note for later reference that the inequality in is strict if u Now consider

v u The only value that was changed in the set

fP w P v dist w v j w V g

is P u P v dist u v There are two cases to examine If u changed its pulse by applying

the min plus one part of the rule then there must b e a no de w whichisaneighbor of u

and is closer to v ie dist u v dist w v Also since min plus one was applied

wehave P u P w Now

P u P v dist u v P w P v dist w v

P w P v dist w v

and hence the v do es not increase in this case The second case to consider is when u

has changed its value by applying the max minus one part of the rule The reasoning in

this case is similar let w b e a neighbor of u with P w P u Clearly dist w v

dist u v This implies that

P u P v dist u v P w P v dist w v

P w P v dist w v

and we are done

As noted ab ove the inequality in is strict if u In other words eachtime

a no de with p ositivepotential changes its pulse numb er its p otential decreases This fact

when combined with Lemmas and immediately implies eventual stabilization However

using this argument leads to a pro of that the stabilization time is b ounded by the total

p otential of the conguration which in turn dep ends on the initial pulse assignment We

need a stronger argument in order to prove a b ound on the stabilization time that dep ends

only on the top ologyToward this end we dene the notion of wavefront

Denition Let v beanynode in the graph with potential v The wavefront of v denoted

v is denedby

v min fdist u v j P u P v dist u v v g

uV

In the graphical representation the wavefront of a no de is simply the distance to the

closest no de of on the p otential line see Figure for an example Intuitively one

can think of v as the distance to the closest largest trouble of v The imp ortance of

the wavefront b ecomes apparent in Lemma b elow but let us rst state an immediate

prop ertyithas

Lemma Let v V Then v if and only if v

Lemma Let v beanynode with v and let v be the wavefront of v after one

time unit Then v v

Pro of Supp ose v f at some state Let u be any no de suchthatP u P v

dist u v v and dist u v f Consider a neighbor w of u which is closer to v

ie dist w v f itmay b e the case the w v From the denition of v it

follows that P w Pu Now consider the next time in which w applies Rule

If at that time v fwe are done Otherwise w must assign P w P u No

greater value can b e assigned or otherwise Lemma would b e violated At this time

P w P v dist w v v also and hence v f

Corollary Let v beanynode Then after v time units v

Wecannow prove the main theorem

Theorem Let G V E bea graph with diameter d and let P V N be a pulse

assignment Applying Rule above results in a legal conguration in d time units

Pro of By Lemma it suces to showthatafterd time units v for all v V

From Corollary ab ove we actually know that a slightly stronger fact holds for all no de

v V after v time units v The theorem follows from the facts that for all

v V v d and by the fact that v never increases by Lemma

Implementation with Bounded Pulse Numbers

The optimal rule from ab oveworks only when the pulse numb ers maygrowunb oundedly

However we can use the reset proto col to makeiswork with b oundedsize registers Supp ose

that the registers dedicated to the pulse numb ers may hold the values B for some b ound

B Then the proto col would work by pro ceeding according to the unb ounded rule Eq

and whenever a value should b e incremented ab ove B reset is invoked Notice that wemust

require that B is suciently large to enable propp er op eration of the proto col b etween

resets if B is less than the diameter of the network for instance then there is a p ossible

scenario in which the far no des never get to participate in the proto col

JJ Distributed Algorithms Decemb er

Lecturer Nancy Lynch

Lecture

This is the last lecture of the course we didnt havetimetocover a lot of material

weintended to Some of the main topics missing are faulttolerant network algorithms and

timingbased computing Well have several additional lectures in January to cover some of

this for those who are interested Todaywell do one ma jor result in faulttolerant network

algorithms namely the imp ossibility of faulttolerant consensus in asynchronous networks

Well also do two other results on ways to get around this limitation

FischerLynchPaterson Imp ossibility Result

Recall the consensus problem from the shared memory work The interface is dened by

the input actions init v and output actions decide v where i is a pro cess and v is a

i i

value Here we shall consider the same initdecide interface but the implementation is now

a distributed network algorithm The message delivery is FIFO and completely reliable In

fact we will assume that the no des have reliable broadcast actions so that when a no de

sends a message it is sent simultaneously to all the others and b ecause of the reliabilityof

message transmission it will eventually get there We consider solving this problem in the

face of pro cessor faults since we are giving an imp ossibility result we consider only a very

simple kind of faultybehavior at most one stopping failure Using suchaweak kind of

fault makes the imp ossibility result stronger

The imp ossibility of solving consensus in such a friendly environment is rather surprising

The story is that Fischer LynchandPaterson worked on this problem for some while

trying to get a p ositive result ie an algorithm Eventually they realized that there was a

very fundamental limitation getting in the way They discovered an inherent problem with

reaching agreement in an asynchronous system in the presence of any faulty pro cess

The Mo del As in any imp ossibility result it is very imp ortant to dene precisely the

assumptions made ab out the underlying mo del of computation Here we assume that the

network graph is complete and that the message system consists of fair FIFO queues Each

pro cess no de i is an IO automaton There are inputs and outputs init v and decide v

i i

resp ectively where i is a no de and v is a value There are also outputs broadcast v and

i

inputs receive v where i and j are no des and v is a value The eect of a broadcast v

ji i

action is to put atomically a v message in the channels leading to all the no des other than

i in the system With the broadcast action we actually dont need individual send actions

We assume without loss of generality that each pro cess has only one fairness class ie

the pro cesses are sequential and that the pro cesses are deterministic in the sense that

in each pro cess state there is at most one enabled lo cally controlled action and for each

state s and action there is at most one new state s suchthats s is a transition of the

proto col In fact we can assume furthermore without loss of generality that in each state

there is exactly one enabled lo cally controlled action since we can always add in dummy

steps

Correctness Requirement For the consensus problem we require that in any execution

such that exactly one init o ccurs for eachnode iwehave the following conditions satised

i

Agreement There are no two dierent decision values

Validity If all the initial values are the same value v then v is the only p ossible decision

value

In addition we need some termination conditions First we require that if there is

exactly one init for each i and if the entire system pro cesses and queues executes fairly

i

then all pro cesses should eventually decide This condition is easy to satisfy eg just

exchange values and take ma jority But we require more Dene fair executions to b e

those in which all but at most pro cess execute fairlyandallchannels execute fairly

We assume that a pro cess that fails still do es its input steps but it can stop p erforming

lo callycontrolled actions even if such actions remain enabled We require that in all fair

executions all pro cesses that dont stop eventually decide

It seems as though it oughttobepossibletosolve the problem for fair executions

also esp ecially in view of the p owerful broadcast primitiveyou should try to design an

algorithm Our main result in this section however is the following theorem

Theorem Thereisnoresilient consensus protocol

Henceforth we shall restrict our attention to the binary problem where the input values

are only and this is sucient to prove imp ossibility

Terminology Dene A to b e a resilient consensus protocol RCP provided that it

satises agreement validity and termination in all its fair executions Dene A to b e a

resilient consensus protocol RCP provided it is a resilient consensus proto col and if

in addition all nonstopp ed pro cesses eventually decide in fair executions

We call a nite execution valent if is the only decision value in all extensions of

Note the decision can o ccur anywhere in the extension including the initial segment

Similarlywe dene valent nite executions Wesaythat is univalent if it is either

valentorvalent and wesay that is bivalent if it is not univalent ie is bivalent

if there is one extension of in which someone decides and another extension in which

someone decides

These valency concepts capture the idea that a decision may b e determined although

the actual decision action might not yet have o ccurred As in the shared memory case

we will restrict attention in the pro of to inputrst executions ie executions in whichall

inputs arrive at the b eginning in round robin order of init events for pro cess n

Dene an initial execution to b e an inputrst execution of length exactly n that is an

initial execution just provides the initial values to all the pro cesses

We can now start proving the imp ossibility result We b egin with a statement ab out the

initial executions

Lemma In any RCP there is a bivalent initial execution

Pro of The idea is the same as in the shared memory setting Assume that the lemmais

false ie that all initial executions are univalent By validity the all input must lead to

decision of and all s to decision of By assumption anyvector of s and s leads

to a unique decision Hence there exist two input vectors with Hamming distance with

corresp onding initial executions such that one is valent and the other is valent let



and b e these two resp ective initial executions Let i b e the index of the unique pro cess



in whose initial value they dier

The idea is that the rest of the system cant tell which of the vectors was input if i never

do es anything More formally consider a fair execution that extends initial execution



in which i fails right at the b eginning ie it takes no lo callycontrolled steps In





some pro cess j eventually decides by the assumptions that is valent and that the



proto col is RCP

But then we claim that there is another execution that extends that also leads to





decide This is b ecause the only dierence b etween the states at the end of and

j  

is in the state of pro cess inow allow the same pro cesses to take the same steps in as in



the same messages to get delivered etc see Figure The only dierence in how



and will unfold is in the state changes caused by the receipt of the same message by i

 

in the two executions those states can b e dierent but since i never p erforms any lo cally

controlled action in either the rest of the system cannot tell the dierence α α 0 1

α’ α’ 0 1

0 0

Figure In b oth extensions for and the same pro cesses take the same actions and

 

therefore cannot b e valent



Hence pro cess j must decide in to o contradicting the valence of





Examples The ma jority proto col is a RCP it has no bivalent initial execution since

an initial execution can only lead to a decision of v in case the value v is the initial value of

a ma jority of the pro cesses

Another example of an RCP is a leaderbased proto col where everyone sends their

values to pro cess and decides on the rst value received and informs everyone else ab out

the decision This do es havebivalent initial executions anything in which there are two

pro cesses other than p having dierent initial values



Wenow pro ceed with the denition of a decider which is dened somewhat dierently

from the wayitwas dened for the shared memory setting

Denition A decider for an algorithm A consists of an inputrst execution of A and

aprocess index i such that the fol lowing conditions hold see Figure

is bivalent

There exists a valent extension of such that the sux after consists of steps



of process i only These can be input output or internal steps

There exists a valent extension of such that the sux after consists of steps



of process i only

Note that the exibility in obtaining two dierent executions here arises b ecause of the

p ossible dierent orders of interleaving b etween lo cally controlled steps and messagereceipt α

α α 0 1 i i only only

0−valent 1−valent

Figure schematic diagram of a decider i

steps on the various channels Also messages could arrive in dierent orders in twodierent

executions

Let us now state a lemma ab out the existence of a decider

Lemma Let A beanyRCP with a bivalent initial execution Then there exists a decider

for A

We shall prove this lemma later Let us rst complete the imp ossibility pro of given

Lemma

Pro of of Theorem Supp ose A is a RCPSinceA is also a RCP and since it

has a bivalent initial prex we can obtain by Lemma a decider i Wenow argue a

contradiction in a way that is similar to the argument for the initial case

Consider a fair extension of in which i takes no lo callycontrolled steps ie i



fails right after Eventually in some pro cess j must decide supp ose without loss of



generality that the decision value is Also we can supp ose without loss of generalitythat

there are no message deliveries to i in the sux of after if there are any then omitting



them still leads to the same decision

Nowwe claim that the there is another execution that extends and that also leads





to decide see Figure

j

This is b ecause the only dierences b etween the states at the end of and are in a



the state of pro cess ibthelast elements in the channels leaving pro cess i there can b e

some extras after and c the rst elements in the channels entering pro cess i can b e



dierent there can b e some messages missing after Then weallow the same pro cesses



to take the same steps in the sux after as in the same messages to get delivered etc

 

Nothing that i has done will interfere with this note that wenever deliver any of the extra α

α α α 0 1 2 i no i i steps only only

0−valent 1−valent process j decides 0

α’ 2 no i steps

process j decides 0

Figure Contradiction derived from the existence of a decider

messages As b efore the only dierence in what happ ens is in the state changes caused by

the receipt of the same message by i in the two executions those states can b e dierent but

since i never p erforms any lo cally controlled action in either the rest of the system cannot

tell the dierence

contradicting the valence of Hence pro cess j decides by the end of





It remains nowtoprove Lemma

Pro of of Lemma Supp ose there is no decider Then starting from a bivalent initial

execution we construct a fair execution no failures at all in which successive prexes are

all bivalent thus violating the termination requirement

Wework in a roundrobin fashion at each phase we either let one sp ecied pro cess take

a lo cally controlled step or else we deliver the rst message in some channel Visiting all

pro cesses and all channels in the round robin order yields a fair execution if a channel is

emptywecanbypass it in the order

Wehavenow reduced the pro of to doing the following We need to consider one step at

a time where each step starts with a particular bivalent inputrst execution and we need

to either

let a particular pro cess i take a turn or

deliver the oldest message in some channel

Wemust do this while keeping the result bivalent We pro ceed by considering the p ossible

cases of the step at hand

Case Deliver message m from j to i where m is the rst message in the channel from

i to j in the state after Consider the tree of all p ossible nite extensions of in which

m is delivered to i just at the end in the interim arbitrary other steps can o ccur If any

of these leaves is bivalent we are done with this stage So assume all are univalent As in

LouiAbuAmara assume without loss of generality that delivering the message immediately

after leads to a valent state Consider a nite extension of that contains a decision of

this is p ossible b ecause is bivalent see Figure

α

m delivered

0−valent

1

Figure the delivery of m leads to a valent state

Wenow use this to get a conguration involving one step such that if m is delivered

just b efore the step the result is a valent execution whereas if it is delivered just after

the step the result is a valent execution see Figure Weshowthismust happ en by

considering cases as follows

m delivered s 0−valent m delivered

1−valent

Figure the delivery of m b efore step s leads to a valent state and the delivery of m

after step s leads to a valent state

If the extension leading to the decision of do es not contain a delivery of mthen

delivering m right at the end leads to a valent state then since attaching it anywhere gives

univalence there must b e consecutive p ositions along the chain where one gives valence

and one gives valence see Figure

α α

m delivered m delivered 0−valent 0−valent m delivered 0−valent m delivered 1−valent m delivered 1−valent 1−valent m delivered 1

1−valent 1

Figure Left the delivery of m do es not app ear in the extension to a valent state

Right m is delivered in the extension of

On the other hand if this extension do es contain a delivery eventfor m then consider the

place where this delivery o ccurs Immediately after the deliverywemust haveunivalence

which for consistency with the later decision in this execution must b e valence Then we

use the same argumentbetween this p osition and the top

Now if the intervening step step s in Figure involves some pro cess other than i

then it is also p ossible to p erform the same action after the delivery of m and the resulting

global state is the same after either order of these twoevents see Figure

m delivered s 0−valent s m delivered

1−valent

Figure if s do es not involve i then p erforming s b efore or after m contradicts valency

But this is a contradiction since one execution is supp osed to b e valent and the other

valent So it must b e that the intervening step involves pro cess i

But then this yields a decider a contradiction to the assumption

Case Process i take a local lycontrol led step

Essentially the same argumentworks in this case weonlysketch it Consider the tree of

nite extensions in whichanything happ ens not including a lo callycontrolled step of i and

then i do es a lo callycontrolled step Here we use the assumption that lo callycontrolled

steps are always p ossible Note that the tree can include message deliveries to i but not

lo callycontrolled steps of i except at the leaves

This time we get a conguration involving one step suchthatifi takes a lo cally

controlled step just b efore this step the result is a valent execution whereas if i takes its

lo callycontrolled step just after the step the result is a valent execution Again weget

this by cases If the extension do esnt contain a lo callycontrolled step of iwe can attachit

at the end and pro ceed as in Case If it do es then cho ose the rst such in the extension

and pro ceed as in Case Then we complete the pro of is as in Case by showing that

either commutativity holds or a decider exists

BenOr Randomized Proto col for Consensus

Fischer Lynch and Paterson show that consensus is imp ossible in an asynchronous system

even for simple stopping faults However this is only for deterministic algorithms It turns

out that the problem can b e solved in an asynchronous environment using randomization

with probability of eventually terminating In fact an algorithm can even b e designed to

tolerate stronger typ es of pro cess faults Byzantine faults where pro cesses can moveto

arbitrary states and send arbitrary messages at any time

The algorithm uses a large numb er of pro cesses relativetothenumberoffaultsbeing

tolerated eg n f This can b e reduced but this version is easiest to see The

algorithm pro ceeds as follows Each pro cess starts with its initial value in variable x The

algorithm works in asynchronous phases where each phase consists of two rounds Each

pro cess sends messages of the form rst rvand second rvwherer is the phase number

and v is a value in the domain b eing considered The co de is given in Figure for binary

values We assume for simplicity that the no des continue p erforming this algorithm forever

even though they only decide once

Correctness

Validity If all the pro cesses start with the same initial value v thenitiseasytoseethat

all nonfaulty pro cesses decide on v in the rst phase In the rst round of phase all the

Initially x is is initial value

For each phase r do

Round broadcast rst rx

wait for n f messages of the form rst r 

if at least n f messages in this set havesamevalue v

then x  v

else x  nil

Round broadcast second rx

wait for n f messages of the form second r 

Let v b e the value o ccurring most often break ties arbitrarily

and let m b e the numb er of o ccurrences of v

if m  n f

then DECIDE v x  v

else if m  n f

then x  v

else x  random

Figure Randomized Consensus Algorithm for Asynchronous Network co de for pro cess

i

nonfaulty pro cesses broadcast rst v and receive rst v messages from at least n f

pro cesses Then in round of phase all nonfaulty pro cesses broadcast second v and

again receive at least n f second v messages

Agreement Weshow that the nonfaulty pro cesses cannot disagree We will also see that

all nonfaulty pro cesses terminate quickly once one pro cess decides Supp ose that pro cess i

decides v at phase r This can happ en only if i receives at least n f second rv messages

for a particular r and v which guarantees that each other nonfaulty pro cess j receives at

least n f second rv messages This is true since although there can b e some dierence

in which pro cesses messages are received by i and j pro cess j must receive messages from

at least n f of the n f senders of the second rv messages received by i Among

those senders at least n f must b e nonfaulty so they send the same message to j as to

i A similar counting argument shows that v must b e the value o ccurring most often in p s

j

set of messages since nf Therefore pro cess j cannot decide on a dierentvalue at

phase r Moreover j will set x v at phase r Since this holds for all nonfaulty pro cesses

j it follows as in the argumentforvalidity that all nonfaulty pro cesses that havenotyet

decided will decide v in phase r

Termination Now it remains to argue that the probability that someone eventually termi

nates is So consider anyphaser We will argue that regardless of what has happ ened at

n

prior rounds with probability at least which is quite small but nevertheless p ositive

all nonfaulty pro cesses will end up with the same value of x at the end of round r In this

case by the argument for agreement all nonfaulty pro cesses will decide by round r

For this phase r consider the rst time any nonfaulty pro cess say i reaches the p ointof

having received at least n f rst r messages Let M b e this set of messages and let v

b e the ma jorityvalue of the set dene v arbitrarily if there is no ma jorityvalue Wenow

claim that for any second rw message sentby a nonfaulty pro cess wemust have w v

or w nil To see that this is true supp ose w nil Now if second rwissentbya

nonfaulty pro cess j thenj receives at least n f rst rw messages Since by counting

at least n f rst rw messages app ear in M and since n f this is a ma jority

and so wehave w v

Next we claim that at any phase r the only value that can b e forced up on any

nonfaulty pro cess as its choice is v To see this note that for a pro cess to b e forced to cho ose

w the pro cess must receive at least n f second rw messages Since at least one of these

messages is from a nonfaulty pro cess it follows from the previous claim that w v

Now note that the value v is determined at the rst p oint where a nonfaulty pro cess has

received at least n f rst r messages Thus by the last claim the only forcible value

for round r is determined at that p oint But this p oint is b efore any nonfaulty pro cess tosses

a coin for round r and hence these coin tosses are independent of the choice of forcible value

n

for round r Therefore with probability at least all pro cesses tossing coins will cho ose

v agreeing with all those that do not toss coins This gives the claimed decision for round

r

The main interest in this result is that it shows a signicant dierence b etween the

randomized and nonrandomized mo dels in that the consensus problem can b e solved in the

randomized mo del The algorithm is not practical however b ecause its exp ected termination

time is very high We remark that cryptographic assumptions can b e used to improve the

probability but the algorithms and analysis are quite dicult See Feldman

ParliamentofPaxos

In this section well see Lamp orts recent consensus algorithm for deterministic asyn

chronous systems Its advantage is that in practiceitisvery tolerant to faults of the

following typ es no de stopping and recovery lost outoforder or duplicated messages link

failure and recoveryHowever its a little hard to state exactly what faulttolerance prop er

ties it has so instead well present the algorithm and then describ e its prop erties informally

We remark that the proto col always satises agreement and validity The only issue is

termination under what circumstances nonfaulty pro cesses are guaranteed to terminate

The Proto col All or some of the participating no des keep trying to conduct bal lots Each

ballot has an identier whichisatimestamp index pair where the timestamp neednt b e

a logical time but can just b e an increasing number at each no de The originator of each

ballot tries if it do esnt fail to asso ciate a value with the ballot and get the ballot

accepted by a ma jority of the no des If it succeeds in doing so the originator decides on the

value in the ballot and informs everyone else of the decision

The proto col pro ceeds in ve rounds as follows

The originator sends the identier t i to all pro cesses including itself If a no de

receives t i it commits itself to voting NO in all ballots with identier smaller

than t i on which it has not yet voted

Each pro cess sends back to the originator of t i the set of all pairs t i v indi

cating those ballots with identier smaller than t ionwhich the no de has previously

voted YES it also encloses the nal values asso ciated with those ballots If the

originator receives this information from a majorityitcho oses as the nal value of the

ballot the value v asso ciated with the largest t i t ithatarrives in any of these

messages If there is no suchvalue rep orted byanyone in the ma jority the originator

cho oses its own initial value as the value of the ballot

The originator sends out t iv where v is the chosen value from round

Each pro cess that hasnt already voted NO on ballot t ivotes YES and sends

its vote back to the originator If the originator receives YES votes from a ma jority

of the no des it decides on the value v

The originator sends out the decision value to all the no des Anynodethatreceives it

also decides

We rst claim that this proto col satises agreementandvalidityTermination will need

some more discussion

Validity Obvious since a no de can only decide on a value that is some no des initial value

Agreement Weprove agreementby contradiction supp ose there is disagreement Let b be

the smallest ballot on which some pro cess decides and say that the asso ciated value is v By

the proto col some ma joritysay M of the pro cesses vote YES on bWe rst claim that

v is the only value that gets asso ciated with any ballot strictly greater than bFor supp ose

not and let b b e the smallest ballot greater than b that has a dierentvalue say v v

In order to x the value of ballot b the originator has to hear from a ma joritysay M

of the pro cesses M must contain at least one pro cess say i that is also in M ie that

votes YES on bNow it cannot b e the case that i sends its information round for b

b efore it votes YES round on b for at the time it sends this information it commits

itself to voting NO on all smaller ballots Therefore i votes YES on b b efore sending

its information for b This in turn means that b v is included in the information i sends

for b So the originator of b sees a YES for ballot bBythechoice of b the originator

cannot see anyvalue other than v for any ballot b etween b and b Soitmust b e the case

that the originator cho oses v a contradiction

Termination Note that it is p ossible for two pro cesses to keep starting ballots and prevent

each other from nishing Termination can b e guaranteed if there exists a suciently long

time during which there is only one originator trying to start ballots in practice pro cesses

could drop out if they see someone with a smaller index originating ballots but this is

problematic in a purely asynchronous system and during which time a ma jority of the

pro cesses and the intervening links all remain op erative

The reason for the Paxos name is that the entire algorithm is couched in terms of a

ctitious Greek parliament in which legislators keep trying to pass bills Probably the

cutest part of the pap er is the names of the legislators

JJ Distributed Algorithms Handout Septemb er

Homework Assignment

Due Thursday September

Reading

Lamp ortLynch survey pap er

Exercises

Carry out careful inductive pro ofs of correctness for the LeLannChangRob erts LCR

algorithm outlined in class

For the LCR algorithm



a Give a UID assignment for whichn messages are sent

b Give a UID assignment for whichonlyO n messages are sent

c Showthattheaverage numb er of messages sentisO n log n where this average

is taken over all the p ossible orderings of the elements on the ring each assumed

to b e equally likely

Write co de for a state machine to express the Hirschb ergSinclair algorithm in the

same style as the co de given in class for the LCR algorithm

Show that the FredericksonLynchcounterexample algorithm do esnt necessarily have

the desired O n message complexity if pro cessors can wakeupatdierent times

Briey describ e a mo dication to the algorithm that would restore this b ound

Prove the b est lower bound you can for the numb er of rounds required in the worst

case to elect leader in a ring of size n Be sure to state your assumptions carefully

JJ Distributed Algorithms Handout Septemb er

Homework Assignment

Due Thursday September

Reading

The Byzantine generals problemby Lamp ort Shostak and Pease

Exercises

Prove the b est lower bound you can for the numb er of rounds required in the worst

case to elect leader in a ring of size n Be sure to state your assumptions carefully

Recall the pro of of the lower b ound on the numb er of messages for electing a reader in

k

a synchronous ring Prove that the bitreversal ring describ ed in class for n for



symmetric any p ositiveinteger k is



Consider a synchronous bidirectional ring of unknown size n in which pro cesses have

UIDs Give upp er and lower b ounds on the numb er of messages required for all the

pro cessors to compute n mo d output is via a sp ecial message

Write pseudo co de for the simple algorithm discussed in class for determining shortest

paths from a single source i inaweighted graph Giveaninvariant assertion pro of



of correctness

Design an algorithm for electing a leader in an arbitrary strongly connected directed

graph network assuming that the pro cesses have UIDs but that they have no knowl

edge of the size or shap e of the network Analyze its time and message complexity

Prove the following lemma If all the edges of a connected weighted graph have distinct

weights then the minimum spanning tree is unique

Consider the following variant of the generals problem Assume that the network is a

complete graph of n participants and that the system is deterministic ie non

randomized The validity requirement is the same as describ ed in class However

supp ose the agreement requirementisweakened to say that if any general decides

then there are at least two that decide That is wewant to rule out the case where

one general attacks alone

Is this problem solvable or unsolvable Prove

JJ Distributed Algorithms Handout Septemb er

Homework Assignment

Due Thursday Octob er

Exercises

Show that the generals problem with link failures is unsolvable in any nontrivial

connected graph

Consider the generals problem with link failures for two pro cesses in a setting where

each message has an indep endent probability p of getting delivered successfully As

sume that each pro cess can send only one message p er round For this setting design

an algorithm that guarantees termination has low probability of disagreementfor

any pair of inputs and has high probability L of attacking in case b oth inputs are

and there are no failures Determine and L in terms of p for your algorithm

In the pro ofs of b ounds for randomized generals algorithms we used two denitions of

information level level i k and mlevel i k Prove that for any A i and k these

A A

two are always within of each other

Be the Adversary

a Consider the algorithm given in class for consensus in the presence of pro cessor

stopping faults based on a lab eled tree Supp ose that instead of running for

f rounds the algorithm only runs for f rounds with the same decision rule

Describ e a particular execution in which the correctness requirements are violated

Hint a pro cess may stop in the middle of a round b efore completing sending

all its messages

b Now consider the algorithm given for consensus in the presence of Byzantine

faults Construct an execution to show that the algorithm gives a wrong result if

it is run with either i no des faults and rounds or ii no des faults

and rounds

Using the stopping fault consensus algorithm as a starting p oint mo dify the correctness

conditions algorithm and pro of as necessary to obtain a consensus algorithm that is

resilient to Byzantine faults assuming that authentication of messages is available

optional Consider the optimized version of the stopping fault consensus algorithm

where only the rst two messages containing distinct values are relayed In this algo

rithm it isnt really necessary to keep all the tree information around Can you reduce

the information kept in the state while still preserving the correctness of the algorithm

You should sketcha proofofwhyyour optimized algorithm works

JJ Distributed Algorithms Handout Octob er

Homework Assignment

Due Thursday Octob er

Reading

Lynch and Tuttle AnIntroduction to InputOutput Automata CWIQuarterly

Exercises

Consider the network graph G depicted in Figure

b d f

a

c e g

Figure the graph G is a no de connected communication graph

Show directly ie not by quoting the connectivity b ound stated in class but bya

pro of sp ecic to this graph that Byzantine agreement cannot b e solved in G when

pro cessors are faulty

In class a recursive argumentwas used to show imp ossibilityof f round consensus

in the presence of f stopping failures If the recursion is unwound the argument

essentially constructs a long chain of runs where eachtwo consecutive runs in the

chain lo ok the same to some pro cess How long is the chain of runs

Optional Can you shorten this chain using Byzantine faults rather than stopping

faults

Write co de for a correct phase commit proto col

Fill in more details in the inductive pro of of mutual exclusion for Dijkstras algorithm

Describ e an execution of Dijkstras algorithm in which a particular pro cess is lo cked

out forever

JJ Distributed Algorithms Handout Octob er

Homework Assignment

Due Thursday Octob er

Exercises

Consider the Burns mutual exclusion algorithm

a Exhibit a fair execution in which some pro cess is lo cked out

b Do a time analysis for deadlo ckfreedom That is assume that each step outside

the remainder region takes at most time unit and give upp er and lower b ounds

on the length of the time interval in which there is some pro cess in T and no

pro cess in C upp er b ound means analysis that applies to all schedules lower

b ound means a particular schedule of long duration

Why do es the Lamp ort bakery algorithm fail if the integers are replaced by the integers

mo d B for some very large value of B Describ e a sp ecic scenario

Design a mutual exclusion algorithm for a slightly dierent mo del in which there is

one extra pro cess a supervisor process that can always take steps The mo del should

use singlewritermultireader shared variables Analyze its complexity

Design an atomic readmo difywrite ob ject based on lowerlevel multiwriter multi

reader readwrite memory The ob ject should supp ort one op eration apply f where

f is a function In the serial sp ecication the eect of apply f on an ob ject with value

v is that the value of the ob ject b ecomes f v and the original value v is returned to

the user The executions can assumed to b e fair

Consider the following variantoftheunb oundedregisters atomic snapshot ob ject In

this variant the sequence numb er is also incrementedeachtimea snap is completed

rather than only in update Do es this variant preserve the correctness of the original

algorithm

JJ Distributed Algorithms Handout Octob er

Homework Assignment

Due Thursday Octob er

Exercises

Why do es the Lamp ort bakery algorithm fail if the integers are replaced by the integers

mo d B for some very large value of B Describ e a sp ecic scenario

Programmers at the Flaky Computer Corp oration designed the following proto col for

npro cess mutual exclusion with deadlo ckfreedom

Shared variables x y Initially y

Co de for p

i

Remainder Region

try

i

L

x i

if y then goto L

y

x i

if x i then goto L

crit

i

Critical Region

exit

i

y

rem

i

Do es this proto col satisfy the two claimed prop erties Sketchwhy it do es or show

that it do esnt If you quote an imp ossibility result to say that it do esnt satisfy the

prop erties then you must still go further and actually exhibit executions in which the

prop erties are violated

Reconsider the consensus problem using readwrite shared memory This time supp ose

that the typ es of faults b eing considered are more constrained than general stopping

failures in particular that the only kind of failure is stopping right at the b eginning

never p erforming any noninput steps Is the consensus problem solvable Givean

algorithm or an imp ossibility pro of

Let A be any IO automaton and supp ose that is a nite execution of A a Prove

that there is a fair execution of A ie an extension of b If is any nite or

innite sequence of input actions of A prove that there is a fair execution of A

such that the sequnce of input actions of beh isidentical to

JJ Distributed Algorithms Handout Octob er

Homework Assignment

Due ThursdayNovember

Exercises

Let A be any IO automaton Show that there is another IO automaton B with only a

single partition class that implements or solves A in the sense that fairbehs B

fairbehs A

Rewrite the Lamp ort bakery algorithm as an IO automaton in preconditioneects

notation While doing this generalize the algorithm slightly byintro ducing as much

nondeterminism in the order of execution of actions as you can The precondition

eects notation makes it easier to express nondeterministic order of actions than do es

the usual owofcontrol notation

Warning Wehavent entirely worked this one out Consider a sequential timestamp

object based on the sequential timestamp system discussed informally in class the

n

one with the values This is an IO automaton that has a big shared variable

label containing the latest lab els you can ignore the values for all pro cesses and that

has two indivisible actions involving the label vector An action that indivisibly

lo oks at the entire l abel vector cho oses a new lab el according to the rule based on

full comp onents and writes it in the label vector This is all done indivisibly An

label vector action that just snaps the entire

a Write this as an IOA

b Prove the following invariant For anycycleatanylevel at most two of three

comp onents are o ccupied

c Describ e a similar sequential timestamp system based on unb ounded real number

timestamps

d Use a p ossibilities mapping to showthatyour algorithm of a implements your

algorithm of c

Plenty of partial credit will b e given for go o d attempts

JJ Distributed Algorithms Handout Novemb er

Homework Assignment

Due ThursdayNovember

Exercises

Give example executions to showthatthemultiwritermultireader register implementa

tion given in class based on a concurrent timestamp ob ject is not correctly serialized

by serialization p oints placed as follows

a For a read at the p ointoftheemb edded snap op eration For a write at the

point of the emb edded update op eration

b For all op erations at the p ointoftheemb edded snap op eration

Note that the emb edded snap and update op erations are emb edded inside the atomic

snapshot out of which the concurrent timestamp ob ject is built

Consider the following simple ma jorityvoting algorithm for implementing a multi

writer multireader register in majority snapshot shared memory mo del In this mo del

the memory consists of a single vector ob ject one p osition p er writer pro cess with each

p osition containing a pair consisting of a value v and a p ositiveinteger timestamp t

Each read op eration instantaneously reads anymajority of the memory lo cations and

returns the value asso ciated with the largest timestamp Each write op eration instanta

neously reads any ma jority of the memory lo cations determines the largest timestamp

t and writes the new value to anymajority of the memory lo cations accompanied by

timestamp t

Use the lemmaproved in class Lecture to sketch a pro of that this is a correct

implementation of an atomic multiwriter multireader register

Show that every exclusion problem can b e expressed as a resource problem

Show that Lamp orts Construction when applied to atomic registers do es not nec

essarily give rise to an atomic register

JJ Distributed Algorithms Handout Novemb er

Homework Assignment

Due ThursdayNovember

Exercises

Give an explicit IO automaton with a singlewriter multireader user interface that

preserves user wellformedness alternation of requests and resp onses whose fair well

formed b ehaviors contain resp onses for all invo cations and whose wellformed external

b ehaviors are exactly those that are legal for a regular readwrite register Do this also

for a safe readwrite register

Extend the Burns lower b ound on the numb er of messages needed to elect a leader in

an asynchronous ring so that it applies to rings whose sizes are not p owers of two

Give an explicit algorithm that allows a distinguished leader no de i in an arbitrary



connected network graph G to calculate the exact total number of nodes in GSketch

a pro of that it works correctly

Consider a banking system in which eachnodeofanetwork keeps a number in

dicating an amount of moneyWe assume for simplicity that there are no external

dep osits or withdrawals but messages travel b etween no des at arbitrary times con

taining money that is b eing transferred from one no de to another The channels

preserveFIFO order Design a distributed network algorithm that allows eachnode

to decide on ie output its own balance so that the total of all the balances is the

correct amount of money in the system To rule out trivial cases we supp ose that the

initial state of the system is not necessarily quiescent ie there can b e messages in

transit initially

The algorithm should not halt or delay transfers unnecessarily Give a convincing

argument that your algorithm works

JJ Distributed Algorithms Handout Novemb er

Homework Assignment

Due Thursday December

Exercises

This question is op en ended and wehavent worked it out entirely Compare the

op eration of the GallagerHumbletSpira spanning tree algorithm to that of the syn

chronous version discussed earlier in the course Eg is there any relationship b etween

the fragments pro duced in the two cases Perhaps such a connection can b e used in

a formal pro of

Design outline an asynchronous algorithm for a network with a leader no de i that



allows i to discover and output the maximum distance d from itself to the furthest



no de in the network The algorithm should work in time O d What is the message

complexity

Consider a square grid graph G consisting of n n no des Consider a partition P

k



into k equalsized clusters obtained by dividing eachsideinto k equal intervals In

terms of n and k what are the time and message complexities of synchronizer based

on partition P

k

Obtain the b est upp er b ound you can for an asynchronous solution to the k session

problem

JJ Distributed Algorithms Handout Decemb er

Homework Assignment

Due Thursday December

Exercises

Develop a wayofsimulating singlewriter multireader shared memory algorithms in an

asynchronous network your metho d should b e based on caching where readers keep

lo cal copies of all the variables and just read their lo cal copies if p ossible The writer

however needs to carry out a more complicated proto col in order to write Describ e

appropriate proto cols for the writer and also for the reader if the lo cal copyisnot

available in order to simulate instantaneous readwrite shared memory Be sure to

guarantee that each pro cess invo cations eventually terminate

Outline whyyour algorithm works correctly

Optimize the mutual exclusion algorithm describ ed in class based on Lamp orts

state machine simulation strategy in order not to keep all the messages ever received

from all other pro cesses

Prove the following invariant for the DijkstraScholten termination detection algorithm

Al l nonid le nodes form a treerooted at the node with status leader where the tree

edges are given by parent pointers

In the shortest paths problem we are given a graph G V E with weights w e

for all e E and a distiguished no de i V The ob jective is to compute at eachnode



its weighted distance from i Consider the BellmanFord algorithm for synchronous



networks dened as follows Eachnode i maintains a nonnegative distance variable

i

The algorithm pro ceeds in rounds as follows Ateach round all the no des send to

their neighb ors the value of their distance variable The distinguished no de i has



distance xed and all other no des i assign at each round

distance min fdistance w j ig

i j

jiE

Assume that the state of a no de consists only of its distance variable Prove that this

algorithm is selfstabilizing

App endix

Notes for extra lectures

JJ Distributed Algorithms January

Lecturer Nancy Lynch

Lecture

A Implementing Reliable PointtoPoint Communi

cation in terms of Unreliable Channels

In our treatment of asynchronous networks wehavemostlydealtwith

the faultfree case

Exceptions selfstabilization single arbitrary failure

pro cessor stopping faults and Byzantine faults for consensus

Now emphasize failures but in a very sp ecial case a twono de

network

The problem to b e considered is reliable communication b etween users

at two no des

Messages are sentby some user at one no de and are supp osed to b e

delivered to a user at the other

Generally want this delivery to b e reliable FIFO exactly once each

The sender and receiver no des are IO automata

The twochannels inside are generally not our usual reliable channels

however

They mayhave some unreliability prop erties eg

loss of packets

duplication

reordering

However we dont normally allowworse faults suchasmanufacturing of

completely b ogus messages

We also dont allow total packet loss eg havealiveness

assumption that says something like

If innitely manypackets are sent then innitely manyof

them are delivered

Have to b e careful to state this right since we are considering

duplication of packets its not enough just to say that innitely

many sends imply innitely many receives wewant innitely many

of the dierentsendevents to have corresp onding receiveevents

We dont want innitely many receives of some early

message while innitely many new messages keep b eing inserted

Can usually formalize the channels as IO automata using IOA fairness

to express the liveness condition but it may b e more natural

just to sp ecify their allowable external b ehavior

in the form of a cause function from packet receipt events to

packet send events satisfying certain prop erties

Later we will want to generalize the conditions to utilize a

p ort structure and say that the channel exhibits FIFO b ehavior

and liveness with resp ect to each p ort separately Amounts to a

comp osition of channels one for each pair of matching p orts

A Stennings Proto col

The sender places incoming messages in a buer buer

s

It tags the messages in its buer with successively larger integers

no gaps starting from

The receiver assembles a buer buer also containing

r

messages with tags starting from

As the buer gets lled in the receiver can deliver the

initial messages in order

Alternative mo deling choice eliminate the senders buer in favor of a

handshake proto col with the user telling it when it can submit the next

message

Other than the basic liveness condition describ ed earlier this

proto col can tolerate a lot of unreliable b ehavior on the

part of the channel loss duplication and reordering

Note that only one direction of underlying channel is required

strictly sp eaking although in this case the sender never knows to

stop sending any particular message

If wewould liketoeventually stop sending each message after it has

b een delivered need to also add an acknowledgement proto col in the

receivertosender direction

The proto col can send lots of messages concurrently but just to b e

sp ecic and simple we describ e a sequential version of the

proto col in which the sender waits to know that the previous

message has b een delivered to the receiver b efore starting to send

the next version

Co de see Figures A and A

Correctness pro of

Can give a highlevel sp ec that is basically a queue MQwith

single state comp onent mqueue

A sendmessage action puts a message on the end and a receivemessage

removes it from the front

Same as our earlier channel sp ec

The liveness condition is then the same as we used b efore for

individual channels in our reliable network mo del eventual delivery

of the rst message easy sp ec as an IOA

The implementation consists of IOAs for the sender and receiver but

it is natural to use a more general mo del for the

underlying channels

However I dont want to just use axioms for external b ehavior

To do a mapping pro of

it is still convenienttohave a state machine even though the

fairness condition needed isnt so easily sp ecied using IOA

fairness

State machine has as its state a set send puts elementinto the set

can deliver a packet anynumb er of times if its in the set never

gets removed from the set

Interface

Input

msg mm M send

rt

receive pkt ii a nonnegativeinteger

Output

tr

send pkt m im M i a nonnegativeinteger

State

buer a nite queue of elements of M initially emptyand

integer a nonnegativeinteger initially

Steps

msg mm M send

Eect

add m to buer

tr

send pkt m im M i a nonnegativeinteger

Precondition

m is rst on buer

i integer

Eect

None

rt

receive pkt ii a nonnegativeinteger

Eect

if i integer then

remove rst element if any from buer

integer integer

Partition

tr

all send pkt actions are in one class

s

Figure A Stenning Sender A

Interface

Input

tr

receive pkt m in M i a nonnegativeinteger

Output

msg mm M receive

rt

send pkt ii a nonnegativeinteger

State

buer a nite queue of elements of M initially emptyand

integer a nonnegativeinteger initially

Steps

receive msg mm M

Precondition

m is rst on buer

Eect

remove rst element from buer

tr

pkt m im M i a nonnegativeinteger receive

Eect

if i integer then

add m to buer

integer integer

rt

send pkt ii a nonnegativeinteger

Precondition

i integer

Eect

None

Partition

rt

all send pkt actions

msg actions all receive

r

Figure A Stenning Receiver A

Liveness ie the innitesendinnitereceive condition

is sp ecied as an additional condition a liveness predicate

In order to prove that this works it is useful to havesomeinvariants

Lemma The fol lowing are true about every reachable state of Stennings

protocol

integer integer integer

r s r

All packets in queue have integer tag at most equal to

sr

integer

s

All packets in queue with integer tag equal to

sr

integer have their message components equal to the rst

s

message of buer

s

All packets in queue have integer tag at most equal to

rs

integer

r

If integer appears anywhere other than the senders

s

state then buer is nonempty

s

To show the safety prop ertyshow that every b ehavior of the

implementation is also a b ehavior of the sp ec

Do this by using a forward simulation actually a renement

Sp ecically

If s and u are states of the implementation and sp ecication

resp ectively then we dene u r sprovided that

uqueue is determined as follows

If sinteger sinteger thenuqueue

s r

is obtained by rst removing the rst elementof sbuer and

s

then concatenating sbuer and the reduced

r

buer

s

Invariant implies that sbuer is nonempty

s

Otherwise uqueue is just the concatenation of

sbuer and sbuer

r s

Nowweshow that r is a renement

The corresp ondence of initial states is easy b ecause all queues are empty

Inductive step

Given s s and u r s we consider the following cases

send msg m

Let u r s

Wemust showthatu u isastepof MQ

This is true b ecause the send msg event mo dies b oth queues in the same way

msg m receive

Let u r s Wemust showu u is a step of MQ Since

is enabled in s m is rst on sbuer so m

r

is also rst on uqueue so is also enabled in u

r

Both queues are mo died in the same way

pkt send

Then wemust showthatu r s

This is true b ecause the action do esnt change the virtual queue

tr

pkt m i receive

Weshow that u r s

The only case of interest is if the packet is accepted

This means i sinteger soby Lemma

r

in state s integer integer

s r

so r su is the concatenation of sbuer and

r

sbuer

s

Then in s the integers are equal and m is added to the end of

buer

r

So r s is determined by removing the rst elementof

s buer

s

and then concatenating with s buer

r

Then in order to see that r s isthesameas u it suces to note

that m is the same as the rst elementin sbuer

s

But Lemma implies that i sinteger

s

and that m is the same as the rst elementin sbuer as needed

s

rt

receive pkt i

Again weshowthatu r s

The only case of interest is if the packet is accepted

This means i sinteger which implies by Lemma

s

that sinteger sinteger

s r

This means that r s is the concatenation of

sbuer and sbuer with the rst elementof

r s

sbuer removed

s

Then s integer s integer so r s is just

s r

the concatenation of s buer and s buer

s r

But s buer is obtained from sbuer by removing

s s

the rst element so that r s u

Since r is a renement all b ehaviors of Stennings proto col are

also b ehaviors of the sp ecication automaton and so satisfy the

safety prop erty that the sequence of messages in receive msg events is

msg events consistent with the sequence of messages in send

In particular the fair b ehaviors of Stennings proto col satisfy this safety

prop erty

Liveness

The basic idea is that we rst show a corresp ondence b etween an

execution of the impl and of the sp ec MQ

We can get this by noticing that forward simulation actually yields a

stronger relationship than just b ehavior inclusion execution

corresp ondence lemma gives a corresp ondence b etween states and

external actions throughout the execution

Then supp ose that the liveness condition of the sp ecication is

violated

That is the abstract queue is nonempty from some p oint in the

execution but no receivemessage events ever o ccur

We consider what this means in the implementation execution

If the buer is ever nonemptytheneventually deliver

r

message contradiction so can assume that buer is always empty

r

Then buer must b e nonempty throughout the execution

s

sender keeps sending packets for the rst message on buer

s

gets delivered to receiver bychannel fairness then put

in buer whichisacontradiction

r

A Alternating Bit Proto col

Optimized version of Stennings proto col that do esnt use sequence

numb ers but only Bo olean values

These get asso ciated with successive messages in an alternating

fashion

This only works for channels with stronger reliability conditions

assume channels exhibit loss and duplication but no reordering

Liveness again says that innitely many sends lead to receives of

innitely many of the sentpackets

Sender keeps trying to send a message with a bit attached

When it gets an ack with the same bit it switches to the opp osite bit

to send the next message

Receiver is also always lo oking for a particular bit

Co de see Figures AA

Correctness Pro of

Mo del the sender and receiver as IOAs just as b efore

Again the channels are not quite IOAs b ecause of the fairness

conditions b eing dierent from usual IOA fairness

But it is again convenienttohave a state machine with a liveness

restriction

This time the channel state machine has as its state a queue

A send puts a copy of the packet at the end of the queue

and a receive can access any element of the queue when it

do es it throws away all the preceding elements but not the

element that is delivered

This allows for message loss and duplication

Again need a liveness condition to say that anypacket that is sent

innitely many times is also delivered innitely many times

A convenient correctness pro of can b e based on a forward simulation this

one is multivalued to a restricted version of Stennings proto col

whichwe call FIFOStenning

This restricted version is correct b ecause the less restricted version is

Interface

Input

msg mm M send

rt

receive pkt bb a Bo olean

Output

tr

send pkt m bm M b a Bo olean

The state consists of the following comp onents

buer a nite queue of elements of M initially emptyand

ag a Bo olean initially

The steps are

msg mm M send

Eect

add m to buer

tr

send pkt m bm M b aBoolean

Precondition

m is rst on buer

b ag

Eect

None

rt

receive pkt bb a Bo olean

Eect

if b ag then

remove rst element if any from buer

ag ag mod

Partition

tr

all send pkt actions are in one class

s

Figure A ABP Sender A

Interface

Input

tr

receive pkt m bn M b a Bo olean

Output

msg mm M receive

rt

send pkt bb a Bo olean

State

buer a nite queue of elements of M initially emptyand

ag a Bo olean initially

Steps

receive msg mm M

Precondition

m is rst on buer

Eect

remove rst element from buer

tr

pkt m bm M b a Bo olean receive

Eect

if b ag then

add m to buer

ag ag mod

rt

send pkt bb a Bo olean

Precondition

b ag

Eect

None

Partition

rt

all send pkt actions are in one class and

msg actions are in another class all receive

r

Figure A ABP Receiver A

The restricted version is based on the FIFOchannels describ ed ab ove

instead of arbitrary channels

Use an additional invariant ab out this restricted version of the

algorithm

Lemma The fol lowing is true about every reachable state of the FIFOStenning

protocol

Consider the sequenceconsisting of the indices in queue

rs

in order from rst to last on the queue

fol lowedbyinteger

r

fol lowed by the indices in queue

sr

fol lowedbyinteger In other words the sequence

s

starting at queue and tracing the indices al l the way

rs

back to the sender automaton

The indices in this sequencearenondecreasing furthermore the dierence

between the rst and last index in this sequence is at most

Pro of The pro of pro ceeds by induction on the numb er of steps in the nite

execution leading to the given reachable state

The base case is where there are no steps which means wehave to show

this to b e true in the initial state

In the initial state the channels are empty integer and

s

integer

r

Thus the sp ecied sequence is which has the required prop erties

For the inductive step supp ose that the condition is true in state s

and consider a step s s of the algorithm

We consider cases based on

msg or receive msg event is a send

Then none of the comp onents involved in the stated condition is changed

by the step so the condition is true after the step

tr

is a send pkt m ievent for some m

Then queue is the only one of the four relevant comp onents of

sr

the global state that can change

Wehave s queue equal to squeue with the addition

sr sr

of m i

But i sinteger by the preconditions of the action

s

Since the new i is placed consecutively in the concatenated sequence with

s integer i the prop erties are all preserved

s

rt

pkt ievent is a receive

If i sinteger then the only change is to remove some elements

s

in the concatenated sequence so all prop erties are preserved

On the other hand if i sinteger then the inductivehyp othesis

s

implies that the entire concatenated

sequence in state s must consist of is

The only changes are to remove some elements from the b eginning of the

sequence and add one i to the end since s integer i

s

by the eect of the action Thus the new sequence consists of

all is followed by one i so the prop erty is satised

rt

is a send pkt event

tr

Similar to the case for send pkt

tr

is a receive pkt m ievent

If i sinteger then the only change is to remove some

r

elements from the concatenated sequence so all prop erties are preserved

On the other hand if i sinteger

r

then the inductivehyp othesis implies that the entire concatenated

sequence in state s must consist of i s up to and including

sinteger followed entirely by is

r

The only changes are to remove some elements from the sequence and

change the value of integer from i

r

to iby the eect of the action this still has the required prop erties

Nowwe will showthattheAB P is correct

by demonstrating a mapping from AB P to FIFOStenning

This mapping will b e a multivalued p ossibilities mapping ie a

forward simulation

In particular it will ll in the integer values of tags only working

from bits

More preciselywesay that a state u of FIFOStenning is in f s

for state s of AB P provided that the following conditions hold

sbuer ubuer

s s

and sbuer ubuer

r r

sag uinteger mo d

s s

and sag uinteger mo d

r r

queue has the same number of packets in s and u

sr

th

Moreover for any j ifm iisthe j packet in

queue sr in u

th

then m i mo d is the j packet in queue in s

sr

Also queue has the same number of packets in s and u

rs

th

Moreover for any j if i is the j packet in

queue in u

rs

th

then i mo d is the j packet in queue in s

rs

Theorem f above is a forward simulation

Pro of By induction

For the base let s b e the start state of AB P and u the start state

of FIFOStenning

First all the buers are empty

Second sag uinteger mod

s s

and sag and uinteger which is as needed

r r

Third b oth channels are empty

This suces

Now show the inductive step

Supp ose s s is a step of AB P and u f s

We consider cases based on

msg m send

Cho ose u to b e the unique state suchthatu u isastepof

FIFOStenning

Wemust showthatu f s

The only condition that is aected by the step is the rst for the

buer comp onent

s

and ubuer However the action aects b oth sbuer

s s

in the same way so the corresp ondence holds

receive msg m

Since is enabled in s m is the rst value on sbuer

r

Since u f s m is also the rst value on ubuer

r

which implies that is enabled in u

Nowcho ose u to b e the unique state suchthatu u is a step of

FIFOStenning

All conditions are unaected except for the rst for buer and

r

buer is changed

r

in the same way in b oth algorithms so the corresp ondence holds

tr

send pkt m b

Since is enabled in s

b sag and m is the rst elementon sbuer

s s

Let i uinteger

s

Since u f s m is also the rst elementon ubuer

s

tr

pkt m i is enabled in u It follows that send

Nowcho ose u so that u u is a step of FIFOStenning

Wemust showthatu f s

sr

The only interesting condition is the third for C

By inductivehyp othesis and the fact that the two steps each insert

one packet it is easy to see that

sr

C has the same numb er of packets in s and u

Moreover the new packet gets added with tag i in state u and with

tag b in state s

since u f s wehave sag uinteger mo d

s s

ie b i mo d

which implies the result

tr

receive pkt m b

sr

Since is enabled in sm b app ears in C in s

Since u g s m i app ears in the corresp onding p osition in

sr

C in u

for some integer i with b i mo d

tr

pkt m i then is enabled in u Let receive

Let u b e the state such that u u is a step of

FIFOStenning and such that the same number of packets is removed

from the queue

Wemust showthatu f s

It is easy to see that the third condition is preserved

since eachof and

sr

removes the same number of packets from C

Supp ose rst that b sag

r

Then the eects of imply that the receiver state in s is identical

to that in s

r

Now since u f s sag uinteger mo d

r

since b i mo d this case must have

i uinteger

r

Then the eects of imply that the receiver state in u is

identical to that in u

It is immediate that the rst and second conditions hold

So now supp ose that b sag

r

The invariantabove for FIFOStenning implies that either i uinteger

r

or i uinteger

r

Since b i mo d and

since u f s sag uinteger mo d

r r

this case must have i uinteger

r

Then by the eect of the action u integer uinteger

r r

and s ag sag

r r

preserving the second condition

Also buer is mo died in b oth cases by adding the entry m

r

at the end therefore the rst condition is preserved

rt

pkt b send

tr

Similar to send pkt m b

rt

receive pkt b

rs

Since is enabled in s b is in C in s

rs

in u Since u f s i is the corresp onding p osition in C

for some integer i with b i mo d

rt

Let receive pkt i then is enabled in u

Let u b e the state such that u u is a step of

FIFOStenning and such that the same number of packets are removed

Wemust showthatu f s

It is easy to see that the third condition is preserved

since eachof and

rs

removes the same numb er of messages from C

Supp ose rst that b sag

s

Then the eects of imply that the sender state in s

is identical to that in s

Now since u f s sag uinteger mo d

s s

since b i mo d this case must have

i uinteger

s

Then the eects of imply that the sender state in u is

identical to that in u

It is immediate that the rst and second conditions hold for this situation

So now supp ose that b sag

s

The invariantabove for FIFOStenning implies that either i uinteger

s

or i uinteger

s

Since b i mo d and

since u f s sag uinteger mo d

s s

this case must have i uinteger

s

Then the eect of the action implies that

u integer uinteger

s s

and s ag sag preserving the second condition

s s

Also buer is mo died in the same way in b oth cases

s

so the rst condition is preserved

Remark

Consider the structure of the forward simulation f of this example

In going from FIFOStenning to AB P

integer tags are condensed to their loworder bits

The multiple values of the mapping f essentially replace this

information

In this example the corresp ondence b etween AB P and FIFOStenning

can b e describ ed

in terms of a mapping in the opp osite direction a singlevalued

pro jection from the state of FIFOStenning to that of AB P that

removes information

Then f maps a state s of AB P to

the set of states of FIFOStenning whose pro jections are equal to s

While this formulation suces to describ e manyinteresting examples

it do es not always work

Consider garbage collection examples

Machine assistance should b e useful in verifying and mayb e pro ducing

such pro ofs

Liveness

As b efore the forward simulation yields a close corresp ondence

between executions of the two systems

Given any live execution of AB P construct a corresp onding

execution of FIFOStenning needs formalizing

Show that is a live execution of FIFOStenning

Includes conditions of IOA fairness for the sender and receiver

If not live in this sense then some IOA class is enabled forever but no

action in that class o ccurs easily yields the same situation in

if the corresp ondence is stated strongly enough which

contradicts the fairness of the sender and receiver in

Also includes the channel liveness conditions innitely many

sends imply innitely many receives

Again this carries over from ABP

So far we see that we can tolerate all typ es of channel faults using

Stenning with unb ounded headers for messages

With b ounded headers as in ABP can tolerate loss and duplication

but not reordering

A BoundedHeader Proto cols Tolerating Reordering

This leads us to the question of whether we can use b ounded headers

and tolerate any reordering

Consider what go es wrong with the ABP for example if we attempt to

use it with channels that reorder packets

The recipient can get fo oled into accepting an old packet with the

same bit as the one exp ected

This could cause duplicate acceptance of the same message

Here we consider the question of whether there exists a proto col that

uses b ounded headers and tolerates reordering

WangZuckprove that there is no b oundedheader proto col that tolerates

duplication and reordering and consequently none that tolerates all

three typ es of faulty b ehavior

In contrast AAFFLMWZ pro duce a b oundedheader proto col that tolerates

loss and reordering but not duplication

That pap er also contains an imp ossibility result for ecient

b oundedheader proto cols tolerating loss and reordering

Formalize the notion of b oundedheader by assuming that the external

message alphab et is nite and requiring that the packet alphab et

also b e nite

WangZuck ReoDup Imp ossibility

Supp ose there is an implementation

Starting from an initial state run the system to send

systematically copies of as many dierentpackets as p ossible

yielding execution

That is there is no extension of in whichanypacket is sent

that isnt also sentin

Supp ose there are n sendmessage events in

Let b e a live extension of that contains exactly one

additional sendmessage event

By the correcteness condition all messages in get delivered

Nowwe construct a nite execution

to the receiver just up to the p oint that lo oks like

where the n st message delivery to the user o ccurs

but in which the additional sendmessage eventnever o ccurs

Namely delay the sender immediately after also dont allow

the new sendmessage event

But let the receiver do the same thing that it do es in

receive the same packets send the same packets deliver the same

messages to the user

It can receive the same packets even though the sender do esnt send

them b ecause the earlier packets in can b e duplicated

nothing new can b e sent that do esnt app ear already in

Then as in it delivers n messages to the user

even though only n sendmessage events have o ccurred

To complete the contradiction extend to a live execution

of the system without any new sendmessage events

This has more receivemessage events than sendmessage events a

contradiction

AAFFLMWZ BoundedHeader Proto col Tolerating Loss and Reordering

It turns out that it is p ossible to tolerate reordering

and also loss of messages with b ounded headers

This assumes that duplication do es not o ccur

It is not at all obvious how to do this however it was a conjecture

for a while that this was imp ossible that tolerating reordering

and loss would require unb ounded headers

This algorithm is NOT a practical algorithm just a counterexample

algorithm to show that an imp ossibility result cannot b e proved

The algorithm can most easily b e presented in two pieces

The underlying channels are at least guaranteed not to duplicate

anything

So rst use the nodup channels to implementchannels that do not

reorder messages but can lose or duplicate them

Then use the resulting FIFOchannels to implement reliable

communication

Note that pasting these two together requires each underlying channel

to b e used for two separate purp oses to simulate channels in two

dierent proto cols one yielding FIFOcommunication in either

direction

This means that the underlying channels need to b e fair to each

proto col separately need the p ort structure mentioned earlier

The second task can easily b e done using b ounded headers eg use

the ABP

The rst is much less obvious

The sender sends a value to the receiver only in resp onse to an

explicit query packet from the receiver

The value that it sends is always the most recent message that was

given to it saved in latest

To ensure that it answers each query exactly once the

sender keeps a variable unanswered which is incremented

whenever a new query is received and decremented whenever a

value is sent

The receiver continuously sends query s to the sender keeping

track in pending of the number of unanswered queries

The receiver counts in count m the numb er of copies of each

value m received since the last time it delivered a message

or since the b eginning of the run if no message has yet b een put on

that buer

At the b eginning and whenever a new message is delivered the receiver

sets old to pending

When count m old the receiver knows that m was the

value of latest at some time after the receiver

p erformed its last receiveevent

It can therefore safely output m

The niteness of the message alphab et the fact that the sender

will always resp ond to query messages and the liveness of the

channels together imply that the receiver will output innitely

manyvalues unless there is no send event at all

Co de

Sender

Variables

latest an elementof M or nil initially nil

unanswered a nonnegativeinteger initially

send m

Eect

latest m

rcv pkt query

Eect

unanswered unanswered

send pkt m

Precondition

unanswered

m latest nil

Eect

unanswered unanswered

Receiver

Variables

pending a nonnegativeinteger initially

old a nonnegativeinteger initially

for each m M count m a nonnegativeinteger

initially

receive m

Precondition

count m old

Eect

count n for all n M

old pending

send pkt query

Eect

pending pending

rcv pkt m

Eect

pending pending

count m count m

JJ Distributed Algorithms January

Lecturer Nancy Lynch

Lecture

B Reliable Communication Using Unreliable Chan

nels

B BoundedHeader Proto cols Tolerating Reordering

We b egan considering b oundedheader proto cols tolerating reordering

Wegave a simple imp ossibility result for reordering in combination

with duplication and then showed the AAFFLMWZ proto col that

tolerates reordering in combination with loss but no duplication

Todaywe start with an imp ossibilityresultsaying that its not

p ossible to tolerate reordering and loss with an ecient proto col

AAFFLMWZ Imp ossibility Result

sr rs

Again consider the case of channels C and C

that cannot duplicate messages but can lose and reorder them

Again if innitely manypackets are sent innitely many of these packets

are required to b e delivered

The imp ossibility pro of will still work if we generalize this condition to a

corresp onding p ortbased condition

As ab ove we can get a proto col for this setting

that proto col was very inecient in the sense that more and more

packets are required to send individual messages

Nowwe will show that this is inherent such a proto col cannot b e

ecientinthatit must send more and more messages

Again assume nite message alphab et and nite packet alphab et

To capture the ineciency need a denition to capture the number

of packets needed to send a message

First a nite execution is valid

if the number of sendmsg

actions is equal to the number of recmsg actions

This means that the datalink proto col succeeded in sending all the

messages m provided by the user ieall the messages m for which

an action sendmsg m o ccurred

The following denition expresses the notion that in order to successfully

deliver any message the datalink proto col only needs in the b est case

to send a bounded numb er of packets over the channels

Denition If is a valid execution an extension is a

k extension if

In the user actions are exactly the two actions

sendmsgm

and recmsgm for some given message m

This means that exactly one message has been sent successful ly

by the protocol

All packets receivedin are sent in ieno old

packets arereceived

sr

The number of recpkt actions in

is less than or equal to k

Aprotocol is k b ounded if thereisak extension of for

every message m and every valid execution

Theorem Thereisno k bounded datalink protocol

Assume there is such a proto col and lo ok for a contradiction

Supp ose that there is a multiset T of packets a nite execution

and a kextension such that

every packet in T is in transit from the s to r

after the execution ieT is a multiset of old packets

the multiset of packets received by the receiver in is a

submultiset of T

We could then derive a contradiction

Consider an alternative execution that b egins similarly with

but that contains no sendmsg event

All the packets in T cause the receiver to b ehave as in the

k extension and hence to generate an incorrect

recmsg action

In other words the receiver is confused by the presence of old

packets in the channel whichwere left in transit in the channel in

and are equivalent to those sentin

At the end of the alternative execution a message has b een received

without its b eing sent and the algorithm fails

Note that we are here assuming a universal channel one that

can actually exhibit all the b ehavior that is theoretically allowed

by the channel sp ec Such universal channels exist

So it remains to manufacture this bad situation

We need one further denition

Denition T T if

k

T T This inclusion is among multisets of packets

packet p st multp T multp T k

mult p T denotes the multiplicity of p within the multiset T

Lemma If is valid and T is a multiset of packets in transit after

then either

there exists a k extension such that the

multiset of packets

r

receivedbyA in is a submultiset of T or

there exists a valid execution such that

a al l packets receivedin are sent in and

b a new multiset T of packets in transit

after such that

T T

k

Supp ose that this lemma is true

We then show that there is some valid and a multiset

T of packets in transit after

such that case holds

As we already argued the existence of such and

T leads to the desired contradiction

For this we dene two sequences and T with

i i

equal to the empty sequence and T empty

 

If condition do es not hold for and T iefor i

 

we are in the situation of case

We then set and T T

 

In general assuming that case do es not hold for and

i

T we are then in case and derivea valid extension

i

of andamultiset T of packets in

i i i

transit after

i

T T

i i

k

But by denition of the

relation

k

the sequence T T T

  i

k k k k

can haveatmostk jP j terms

Its last term is the T we are lo oking for

jP j is the numb er of dierent p ossible packets which is nite

b ecause we assumed that the packet alphab et and the size of the

headers are b ounded

Pro of of Lemma

Pickany m and get a k extension for mby

the k b oundedness condition

If the multiset of packets received by A is included in T we

r

are in case a

Otherwise there is some packet p for whichthemultiplicity

of recpkt p actions in is bigger then

multp T

We then set T T fpg

Since the extension is k b ounded the number of

of these recpktpisatmostk so that

multp T k

Wenowwant to get a valid extension of

leaving the packets from T in transit

We know that there is a sendpkt p action in

since all packets received in were sentin

Consider then a prex of ending with the

rst such sendpkt p See Figure B

send(p) rec(p) αβ

αγ

Figure B An extension of with a send pk tp action

After all packets from T are still in transit

Wewant to extend to a valid execution without

delivering any packet from T

There are three cases

First supp ose that b oth the sendmessage and receivemessage events

from app ear in then itself suces

Second supp ose that only the sendmessage but not the

receivemessage event app ears in

To get validitywe need to deliver mwhich is the only

undelivered message without delivering a packet from T

Wecanachieve this b ecause of a basic prop ertyoflive executions in

this architecture for any nite execution there is a live

extension that contains no new sendmessage eventsandthatdoesnot

deliver any old packets

Because of the correctness condition this must eventually deliver

m

The needed sequence stops just after the receivemessage event

Third and nally supp ose that neither the sendmessage nor the

receivemessage event app ears in

Then add a new sendmessage event just after and

extend this just as in the previous case to get the needed valid extension

B Tolerating No de Crashes

The results ab ove settle prettymuch all the cases for reliable no des

Nowwe add consideration of faulty no des

This time we consider no de crashes that lose information

We givetwo imp ossibility results then show a practical algorithm

used in the internet

Imp ossibility Results

Consider the case where each no de has an input action crash

and a corresp onding output recover

When a crash o ccurs a sp ecial crash ag just gets set and no

lo cally controlled actions other than recover can b e enabled

At some later time a corresp onding recover action is supp osed

to o ccur when this happ ens the state of the no de go es backtoan

arbitrary initial state

Thus the crash causes all information to b e lost

In such a mo del it is not hard to see that we cant solve the

reliable communication problem even if the underlying channels are

completely reliable FIFO no loss or duplication

Pro of

Consider any live execution in whicha single

send mevent o ccurs but no crashes

Then correctness implies that there is a later receive m

Let b e the prex ending just b efore this receive m

Now consider followed by the events crash recover

r r

By basic prop erties of the mo del

this can b e extended to a live execution in whichno



further sendmessage events or crashes o ccur

Since must also satisfy the correctness conditions it must



b e that contains a receive mevent corresp onding



to the given send mevent sometime after the given crash

Now take the p ortion of after the crashrecover



and splice it at the end of the nite execution

receive mcrash recover

r r

Claim that the result is another live execution

But this execution has two receive mevents a contradiction to the

correctness conditions

Thus the key idea is that the system cannot tell whether a receiver crash

o ccurred b efore or after the delivery of a message

Note that a key fact that makes this result work is the separation

between the receive meventandany other eect suchas

sending an acknowledgement

We could argue that the problem statement is to o strong for the case

of crashes

Nowweweaken the problem statement quite a lot but still obtain an

imp ossibility result though its harder to prove

Wecontinue to forbid duplication

We allow reordering however

And most imp ortantlywenow allow loss of messages sent b efore the

last recover action

Messages sent after the last recover must b e delivered however

We describ e the underlying channels in as strong a way as p ossible

Thus we insist that they are FIFO and dont duplicate

All they can do is lose messages

The only constraint on losing messages is the liveness constraint

innitely many sends lead to innitely many receives

or a p ort version of this

Nowitmakes sense to talk ab out a sequence of packets b eing

in transit on a channel

By this we mean that the sequence is a subsequence of all the packets

in the channel sent after the sending of the last packet delivered

Notation

If is an execution x fs r g k jj

then dene

x

in x k to b e the sequence of packets received by A

during the rst k steps of

x

out x k to b e the sequence of packets sentby A

during the rst k steps of

x

state x k to b e the state of A after exactly k

steps of and

x

ext x k to b e the sequence of external actions of A

o ccurring during the rst k steps of

Denex to b e the opp osite no de to x

Lemma Let be any crashfree nite execution of length nLet x beeithernode and let

th x

k n Suppose that either k or else the k action in is an action of A Then

there is a nite execution possibly containing crashes that ends in a state in which

the state of x is state x k

the state of x is state x k and

the sequenceout x k is in transit from x

Pro of Pro ceed by induction on k

k is immediate use the initial conguration only

Inductive step

Supp ose true for all values smaller than k and provefork

x

Case The rst k steps in are all steps of A

Then the rst k steps of suce

Case The rst k steps in include at least one step of

x

A

th

Then let j b e the greatest integer k such that the k

x

step is a step of A

th

Note that in fact jk since wehave assumed the k step is a

x

step of A

Then in x k is a subsequence of

out x j and

state x k state x j

By inductivehyp othesis we get an execution that leads the



two no des to state x j and

state x j resp ectively and that has

out x j in transit fromx to x

This already hasx in the needed state

x

Now run A alone let it rst crash and recover going backto

its initial state in

Now let it run on its own using the messages in

in x k which are available in the incoming channel since

they are a subsequence of out x j

This brings x to the needed state and puts the needed messages in

the outgoing channel

Nowwe nish the pro of

Cho ose length n to b e a crashfree execution containing one

send m and its corresp onding receive mmust exist

Now use to construct an execution whose nal no de states are

those of

and that has a send as the last external action

How to do this

Let k b e the last step of o ccurring at the receiver must

exist one since there is a receive eventin

Then Lemma yields an execution ending in the no de states

after k steps and with out rk in transit

Note that state rk state rn

Now crash and recover the sender and then have it run again from

the b eginning using the inputs in in s n which are a

subsequence of the available sequence out rk

This lets it reach state s n

Also there is a send step but no other external step in

this sux

This is as needed

Nowwe get a contradiction

Let b e the execution obtained ab ove whose nal no de



states are those of

and that has a send as the last external action

Now lose all the messages in transit in b oth channels then extend

to a live execution containing no further send or crash



events

By the correctness this extension must eventually deliver the last

message

Now claim we can attach this sux at the end of andit

will again deliver the message

But alread had an equal numb er of sends and receives one of

each so this extra receive violates correctness

QED

This says that we cant even solvea weak version of the problem when

wehave to contend with crashes that lose all state information

B packet HandshakeInternet Proto col

But note that it is imp ortant in practice to have a message delivery

proto col that can tolerate no de crashes

Here we give one imp ortant example the packet handshake proto col of

Belsnes used in the Internet

This do es tolerates no de crashes

It uses a small amount of stable storage in the form of unique IDs

think of this as stable storage b ecause even after a crash the system

rememb ers not to reuse anyID

Underlying channels can reorder duplicate but only nitely many

times and lose

Yields no reordering no duplication ever and always delivers

messages sent after the last recovery

The vepacket handshake proto col of Belsnes is one of a batch

he designed getting successively more reliability out of successively

more messages Used in practice

For each message that the sender wishes to send

there is an initial exchange of packets b etween the

sender and receiver to establish a commonlyagreedup on message

identier

That is the sender sends a new unique identier called jd tothe

receiver which the receiver then pairs with another new unique

identier called id

It sends this pair back to the sender who knows that the new

id is recent b ecause it was paired with its current jd

The identier jd isnt needed any longer but id is

used for the actual message communication

The sender then asso ciates this identier id with the message

The receiver uses the asso ciated identier to decide whether or not

to accept a received message it will accept a

message provided the asso ciated identier id is equal to the

receivers current id

The receiver also sends an acknowledgement

Additional packets are required in order to tell the receiver when it

can throwaway a current identier

Variable Typ e Initially Description

mode acked acked Senders mo de

s

needid

send rec

buf M list empty Senders list of messages to send

s

jd JD or nil nil jd chosen for current message by

s

sender

jd used S JD set empty Includes all jd sever used by sender

s

id D or nil nil id received from receiver

s

current msg M or nil nil Message ab out to b e sent to receiver

s

current ack Bo olean false Ack from receiver

s

acked buf D list empty List of id s for which the sender will is

s

sue acked message

mode idle idle Receivers mo de

r

accept

rcvd ack

rec

buf M list empty Receivers list of messages to deliver

r

jd JD or nil nil jd received from sender

r

id D or nil nil id chosen for received jd by receiver

r

last D or nil nil Last id the receiver remembers

r

accepting

issued S D set empty Includes all id sever issued byreceiver

r

nack buf D list empty List of id sforwhich receiver will issue

r

negativeacks

Sender Actions

Input

msg m m M send

receive pkt p p P

rs

crash

s

Output

Ackb b a Bo olean

pkt p p P send

sr

recover

s

Internal

choose jd

grow jd used

s

Input

pkt p p P receive

sr

crash

r

Output

receive msg m m M

send pkt p p P

rs

recover

r

Internal

grow issued

r

msg m send

Eect

if mode rec then

s

add m to buf

s

choose jd

Precondition

mode acked

s

buf nonempty

s

jd jd used

s

Eect

mode needid

s

current msg head buf

s s

remove head buf

s

jd jd

s

add jd to jd used

s

receive send pkt needid jd pkt needid jd

sr sr

Precondition Eect

mode needid if mode idle then

s s

jd jd mode accept

r

s

Eect cho ose an id not in issued

r

none jd jd

r

id id

r

add id to issued

r

pkt accept jd id send pkt accept jd id receive

rs rs

Eect Precondition

if mode rec then mode accept

s r

if mode needid jd jd

s

r

and jd jd then id id

r

s

mode send Eect

s

id id none

s

else if id id then

s

add id to acked buf

s

send pkt sendmid pkt sendmid receive

sr sr

Precondition Eect

mode send if mode rec then

s r

current msg m if mode accept

r

s

id id and id id then

s r

Eect mode rcvd

r

none add m to buf

r

last id

r

else if id last then

r

add id to nack buf

r

msg m receive

Precondition

mode rcvd

r

m rst on buf

r

Eect

remove head buf

r

if buf is empty then

r

mode ack

r

pkt ack id b pkt ack id true receive send

rs rs

Eect Precondition

if mode rec then mode ack

s r

if mode send and last id

s r

id id then Eect

s

mode acked none

s

current ack b

s

pkt ack id false send

rs

jd nil

s

Precondition

id nil

s

mode rec

r

current msg nil

s

id rst on nack buf

r

if b true then

Eect

add id to acked buf

s

remove head nack buf

r

send receive pkt acked id nil pkt acked id nil

sr sr

Precondition Eect

mode rec if mode accept

s r

id rst on acked buf and id id or

r

s

Eect mode ack

r

remove head acked buf and id last then

r

s

mode idle

r

jd nil

r

id nil

r

last nil

r

Ackb

Precondition

mode acked

s

buf is empty

s

b current ack

s

Eect

none

crash crash

s r

Eect Eect

mode rec mode rec

s r

recover recover

s r

Precondition Precondition

mode rec mode rec

s r

Eect Eect

mode acked mode idle

s r

jd nil jd nil

s r

id nil id nil

s r

empty buf last nil

r

s

current msg nil empty buf

s r

current ack false empty nack buf

s

r

empty acked buf

s

grow jd used grow issued

s r

Precondition Precondition

none none

Eect Eect

add some JD stojd used add some D stoissued

r

s

Explain the algorithm from mytalk overview discuss whyitworks

informally

Note the two places where msgs get added to the ackedbuer once

normally once in rep onse to information ab out an old message

The trickiest part of this co de turns out to b e the liveness argument

howdoweknow that this algorithm do es continue to make progress

There are some places where only single resp onses are sent eg a

single nack but that could get lost

However in this case the no de at the other end is continuing to

resend so eventually another nack will b e triggered

JJ Distributed Algorithms January

Lecturer Nancy Lynch

Lecture

This is the nal lecture

Its on timingbased computing

Recall that we started this course out studying synchronous

distributed algorithms sp ent the rest of the time doing

asynchronous algorithms

There is an interesting mo del that is in b etween these two really

a class of mo dels

Call these partial ly synchronous

The idea is that the pro cesses can presume some information ab out

time though the information might not b e exact

For example we mighthave b ounds on pro cess step time or message

delivery time

Note that it do esnt change things much just to include upp er b ounds

b ecause the same set of executions results

In fact wehave b een asso ciating upp er b ounds with events throughout

the course when we analyzed time complexity

Rather we will consider b oth lower and upp er b ounds on the time for

events

Nowwe do get some extra p ower in the sense that we are now

restricting the p ossible interleavings

C MMT Mo del Denition

A natural mo del to use is the one of Merritt Mo dugno and Tuttle

Here Im describing the version thats used in Lynch Attiya pap er

This is the same as IO automata but noweach class do esnt just get

treated fairly

rather havea boundmap which asso ciates lower and upp er

b ounds with each class

The lower b ound is any nonnegative real numb er while the upp er b ound

is either a p ositive real or

The intuition is that the successive actions in each class arent

supp osed to o ccur any so oner after the class is enabled or after

the last action in that class than indicated bythelower b ound

Also no later than the upp er b ound

Executions are describ ed as sequences of alternating states and

actiontime pairs

Admissibility

Ive edited the following from the Lynch Attiya mo del pap er Section

C Timed Automata

In this subsection we augment the IO automaton mo del to allow discussion of timing

prop erties The treatment here is similar to the one describ ed in AttiyaLand is a sp ecial

case of the denitions prop osed in MerrittMT A boundmap for an IO automaton A is

a a mapping that asso ciates a closed subinterval of with eac h class in part A where



the lower b ound of eachinterval is not and the upp er b ound is nonzero Intuitively

the interval asso ciated with a class C by the b oundmap represents the range of p ossible

lengths of time b etween successive times when C gets a chance to p erform an action We

sometimes use the notation b C to denote the lower b ound assigned by b oundmap b to

class C and b C for the corresp onding upp er b ound A timed automaton is a pair A b

u

where A is an IO automaton and b is a b oundmap for A

We require notions of timed execution timed schedule and timed b ehavior for

timed automata corresp onding to executions schedules and b ehaviors for ordinary IO

automata These will all include time information We b egin by dening the basic typ e of

sequence that underlies the denition of a timed execution



In MerrittMT the mo del is dened in a more general manner to allow b oundmaps to yield op en or

semiop en intervals as well as closed intervals This restriction is not crucial in this pap er but allows us to

avoid considering extra cases in some of the technical arguments

Denition A timed sequence for an IO automaton A is a nite or innite sequence

of alternating states and actiontime pairs

s t s t

     

satisfying the fol lowing conditions

The states s s are in states A

 

The actions areinactsA

 

The times t t aresuccessively nondecreasing nonnegative real numbers

 

If the sequence is nite then it ends in a state s

i

If the sequence is innite then the times are unbounded

For a given timed sequence we use the convention that t For any nite timed



sequence wedene t to b e the time of the last eventin if contains any ac

end

tiontime pairs or if contains no such pairs Also we dene s to b e the last state

end

in We denote by ord the ordinary part of the sequence

s s

   

ie with time information removed

If i is a nonnegativeinteger and C part A wesay that i is an initial index for C in

if s enabled A C and either i or s disabled A C or C Thus an initial

i i i

index for class C is the index of an eventatwhich C b ecomes enabled it indicates a p oint

in from whichwe will b egin measuring upp er and lower time b ounds

Denition Suppose A b is a timed automaton Then a timedsequence is a timed

execution of A b provided that or d is an execution of A and satises the fol lowing

conditions for each class C part A and every initial index i for C in

If b C then there exists jiwith t t b C such that either C or

u j i u j

s disabled A C

j

Theredoes not exist jiwith t t b C and in C

j i j

The rst condition says that starting from an initial index for C within time b C

u

either some action in C o ccurs or there is a p ointatwhichnosuch action is enabled Note

that if b C no upp er b ound requirement is imp osed The second condition says

u

that again starting from an initial index for C no action in C can o ccur b efore time b C

has elapsed Note in particular that if a class C b ecomes disabled and then enabled once

again the lower b ound calculation gets restarted at the p oint where the class b ecomes

reenabled

The timedschedule of a timed execution of a timed automaton A b is the subsequence

consisting of the actiontime pairs and the timedbehavior is the subsequence consisting

of the actiontime pairs for which the action is external The timed schedules and timed

behaviors of A b are just those of the timed executions of A b

We mo del each timingdep endent concurrent system as a single timed automaton A b

where A is a comp osition of ordinary IO automata p ossibly with some output actions



hidden We also mo del problem sp ecications including timing prop erties in terms of

timed automata

We note that the denition we use for timed automata may not b e suciently general to

capture all interesting systems and timing requirements It do es capture manyhowever

The MMT pap er contains a generalization of this mo del in whichwehave b ounds that

can b e op en or closed Also upp er b ound of allowed Finally dont restrict to only

nitely many pro cesses Do results ab out comp osition similar to IO automata

C Simple Mutual Exclusion Example

Just as an illustrative example Ill describ e a very basic problem we

considered within this mo del to get simple upp er and lower b ound

results for a basic problem

Started with mutual exclusion

When we did this wedidnothave a clear idea of which parameters

etc would turn out to b e the most imp ortant so we considered

everything

C Basic Pro of Metho ds

Algorithms such as the ones in this pap er quickly lead to the need for

pro of metho ds for verifying correctness of timingbased algorithms



An equivalentwayoflookingateach system is as a comp osition of timed automata An appropriate

denition for a comp osition of timed automata is develop ed in MerrittMT together with theorems

showing the equivalence of the two viewp oints

In the untimed setting wemake heavy use of invariants and

simulations so it seems most reasonable to try to extend these

metho ds to the timed setting

However it is not instantly obvious how to do this

In an asynchronous system everything imp ortant ab out the systems

state ie everything that can aect the systems future

behavior is summarized in the ordinary states of the pro cesses and

channels ie in the values of the lo cal variables and the multiset of

messages in transit on each link

In a timingbased system that is no longer the case

Now if a message is in transit it is imp ortanttoknow whether it has

just b een placed into a channel or has b een there a long time and is

therefore ab out to b e delivered

Similar remarks hold for pro cess steps

Therefore it is convenient to augment the states with some extra

predictive timing information giving b ounds on how long it will b e

b efore certain events mayormust o ccur

C Consensus

Nowwe can also revisit the problem of distributed consensus

in the timed setting

This time for simplicitywe ignore the separate clo cks and just

supp ose wehave pro cesses with upp er and lower b ounds of c c

 

on step time

The questions is howmuch time it takes to reach consensus if there

are f faulty pro cesses

C More Mutual Exclusion

LynchShavit mutual exclusion with pro ofs

C Clo ckSynchronization

Work from Lamp ort MelliarSmith

Lundelius Lynch

Fischer Lynch Merritt

JJ Distributed Algorithms Handout Septemb er

Course Reading List

General References

N Lync h and I Saias Distributed Algorithms Fall Lecture Notes for

MITLCSRSS Massachusetts Institute of TechnologyFebruary

N Lync h and K Goldman Distributed Algorithms MITLCSRSS Massachusetts

Institute of Technology Lecture Notes for

M Ra ynal Algorithms for Mutual Exclusion MIT Press

M Ra ynal Networks and Distributed Computation MIT Press

J M H elary and M Raynal Synchronization and Control of Distributed Systems and

Programs John Wiley Sons

K M Chandy and J Misra Paral lel program design a foundation AddisonWesley

Butler Lampson William W eihl and Eric Brewer Principles of computer sys

tems Fall MITLCSRSS Massachusetts Institute of Technology

Lecture Notes and Handouts

D Intro duction

L Lamp ort and N Lync h Distributed computing Chapter of Handb o ok on The

oretical Computer Science Also app eared as Technical Memo MITLCSTM

Lab oratory for Computer Science Massachusetts Institute of TechnologyCambridge

MA February

D Mo dels and Pro of Metho ds

D Automata

Nancy Lync handFrits Vaandrager Forward and backward simulations for timing

based systems In Proceedings of REX Workshop RealTime Theory in Practice

volume of Lecture Notes in Computer Science Mo ok The Netherlands

SpringerVerlag Also MITLCSTM

D Invariants

S Owic ki and D Gries An axiomatic pro of technique for parallel programs Acta

Informatica

L Lamp ort Sp ecifying concurren t program mo dules ACM Transactions on Program

ming Languages and Systems April

Butler Lampson William W eihl and Eric Brewer Principles of computer sys

tems Fall MITLCSRSS Massachusetts Institute of Technology

Lecture notes and Handouts

D Mapping Metho ds

Nancy Lync handFrits Vaandrager Forward and backward simulations for timing

based systems In Proceedings of REX Workshop RealTime Theory in Practice

volume of Lecture Notes in Computer Science Mo ok The Netherlands

SpringerVerlag Also MITLCSTM

Martin Abadi and Leslie Lamp ort The existence of renement mapping Theoretical

Computer Science

Nils Klarlund and F red B Schneider Proving nondeterministically sp ecied safety

prop erties using progress measures May ToappearinInformation and Compu

tation Spring

D Shared Memory Mo dels

N Lync h and M Fischer On describing the b ehavior and implementation of dis

tributed systems Theoretical Computer Science

Kenneth Goldman Nancy Lync h and KathyYelick Mo delling shared state in a shared

action mo del April Manuscript Also earlier verison in Proceedings of the th

Annual IEEE Symposium on Logic in Computer Science pages Philadelphia

PA June

D IO Automata

Nancy A Lync h and Mark R Tuttle An intro duction to inputoutput automata

CWIQuarterly

Mark T uttle Hierarchical correctness pro ofs for distributed algorithms Masters

thesis Massachusetts Institute of Technology MIT Dept of Electrical Engineering

and Computer Science April

Nic k Reingold DaWei Wang and Lenore Zuck Games IO automata play September

Technical Rep ort YALEDCSTR

DaW ei Wang Lenore Zuck and Nick Reingold The p ower of IO automata

Manuscript May

D Temp oral Logic

Z Manna and APn ueli The Temporal Logic of Reactive and Concurrent Systems

SpringerVerlag

K M Chandy and J Misra Paral lel program design a foundation AddisonWesley

Leslie Lamp ort The temp oral logic of actions T echnical Rep ort Digital Systems

ResearchCenter December

D Timed Mo dels

F Mo dugno M Merritt and M T uttle Time constrained automata In CONCUR

Proceedings Workshop on Theories of Concurrency Unication and ExtensionAms

terdam August

Nancy Lync handFrits Vaandrager Forward and backward simulations for timing

based systems In Proceedings of REX Workshop RealTime Theory in Practice

volume of Lecture Notes in Computer Science Mo ok The Netherlands

SpringerVerlag Also MITLCSTM

Ra jeev Alur and Da vid L Dill A theory of timed automata To app ear in Theoretical

Computer Science

FW V aandrager and NA Lynch Action transducers and timed automata In Pro

ceedings of CONCUR rd International Conference on Concurrency Theory Lec

ture Notes in Computer Science Stony Bro ok NY August Springer Verlag To

app ear

N Lync h and H Attiya Using mappings to prove timing prop erties Distrib Comput

To app ear

Thomas A Henzinger Zohar Manna and Amir Pn ueli Temp oral pro of metho dologies

for realtime systems September Submitted for publication An abbreviated

version in HMP

T A Henzinger Z Manna and A Pn ueli Temp oral pro of metho dologies for real

time systems In Proc ACM Symp on Principles of Programming Languages pages

January

D Algebra

CAR Hoare Communicating Sequential Processes PrenticeHall

R Milner Communication and Concurrency PrenticeHall International Englewood

Clis

D Synchronous MessagePassing Systems

D Computing in a Ring

G LeLann Distributed systems to wards a formal approach In IFIP Congress pages

Toronto

E Chang and R Rob erts An impro ved algorithm for decentralized extremanding

in circular congurations of pro cesses Communications of the ACM May

D Hirsc hb erg and J Sinclair Decentralized extremanding in circular conguarations

of pro cesses Communications of the ACM November

GL P eterson An O n log n unidirectional distributed algorithm for the circular

extrema problem ACM Transactions on Programming Languages and Systems

Octob er

G F rederickson and N Lynch Electing a leader in a synchronous ring Journal of the

ACM January Also MITLCSTM July

H A ttiya M Snir and M Warmuth Computing in an anonymous ring Journal of

the ACM Octob er

D Network Proto cols

Baruc hAwerbuch Course notes

B Aw erbuch Reducing complexities of distributed maximum ow and breadthrst

search algorithms by means of network synchronization Networks

Mic hael Luby A Simple Parallel Algorithm for the Maximal Indep endent Set Problem

SIAM J Comput Novemb er

D Consensus

Basic Results

J Gra y Notes on data base op erating systems Technical Rep ort IBM Rep ort

RJ IBM February Also in Op erating Systems An Advanced

Course SpringerVerlag Lecture Notes in Computer Science

G V arghese and N Lynch A Tradeo Between Safety and Liveness for Randomized

Co ordinated Attack Proto cols th Symposium on Principles of Distributed Systems

Vancouver BC Canada August

L Lamp ort R Shostak and M P ease The Byzantine generals problem ACM

Transactions on Programming Languages and Systems July

M P ease R Shostak and L Lamp ort Reaching agreement in the presence of faults

Journal of the ACM April

Amotz BarNo y and Raymond Strong Shifting gears

Changing algorithms on the y to exp edite Byzantine agreement Information and

Computation

D Dolev and H Strong Authen ticated algorithms for Byzantine agreement SIAM J

Computing November

T Srik anth and S Toueg Simulating authenticated broadcasts to derive simple fault

tolerant algorithms Distributed Computing

R T urpin and B Coan Extending binary Byzantine agreementtomultivalued Byzan

tine agreement Information Processing Letters Also Technical

Rep ort MITLCSTR Lab oratory for Computer Science Massachusetts Institute

Technology Cambridge MA April Also revised in B Coan Achieving consensus

in faulttolerant distributedcomputing systems PhD thesis Department of Electrical

Engineering and Computer Science Massachusetts Institute of Technology

Numb er of Pro cesses

D Dolev The Byzan tine generals strikeagain Journal of Algorithms

M Fisc her N Lynch and M Merritt Easy imp ossibility pro ofs for distributed con

sensus problems Distributed Computing

L Lamp ort The w eak Byzantine generals problem Journal of the ACM

Time Bounds

M Fisc her and N Lynch A lower b ound for the time to assure interactive consistency

Information Processing Letters June

C Dw ork and Y Moses Knowledge and common knowledge in a Byzantine environ

ment Crash failures Information and Computation Octob er

Y Moses and M T uttle Programming simultaneous actions using common knowledge

Algorithmica

Communication

BA Coan Achieving Consensus in FaultTolerant Distributed Computer Systems

Protocols Lower Bounds and Simulations PhD thesis Massachusetts Institute of

Technology June

Y oram Moses and O Waarts Co ordinated traversal tround Byzantine agreement

in p olynomial time In Proceedings of th Symposium on Foundations of Computer

Science pages Octob er To app ear in the sp ecial issue of the Jour

nal of Algorithms dedicated to selected pap ers from the th IEEE Symp osium on

Foundations of Computer Science Extended version available as Waarts MSc thesis

Weizmann Institute August

Piotr Berman and Juan Gara y Cloture voting nresilient distributed consensus in

t rounds ToappearinMathematical Systems TheoryAn International Journal

on Mathematical Computing Theory sp ecial issue dedicated to distributed agreement

Preliminary version app eared in Proc th Symp on Foundations of Computer Sci

ence pp

Approximate Agreement

D Dolev N Lync h S Pinter E Stark and W Weihl Reaching approximate agree

ment in the presence of faults Journal of the ACM

Firing Squad

JE Burns and NA Lync h The Byzantine ring squad problem Advances in Com

puting Research

B Coan D Dolev C Dw ork and L Sto ckmeyer The distributed ring squad prob

lem In Proceedings of the th ACM Symposium on Theory of Computing pages

May

Commit

C Dw ork and D Skeen The inherentcostofnonblo cking commitment In Proceedings

of the nd Annual ACM Symposium on Principles of Distributed Computing pages

August

P Bernstein V Hadzilacos and N Go o dman Concurrency Control and Recovery in

Database Systems AddisonWesley

B A Coan and J L W elch Transaction commit in a realistic timing mo del Distributed

Computing

The Knowledge Approach

JY Halp ern and Y Moses Knowledge and common knowledge in a distributed

environment In Proceedings of rdACM Symposium on Principles of Distributed

Computing pages Revised as IBM Research Rep ort IBMRJ

C Dw ork and Y Moses Knowledge and common knowledge in a Byzantine environ

ment Crash failures Information and Computation Octob er

D Asynchronous Shared Memory Systems

D Mutual Exclusion

M Ra ynal Algorithms for Mutual Exclusion MIT Press

EW Dijkstra Solution of a problem in concurren t programming control Communi

cations of the ACM September

DE Kn uth Additional comments on a problem in concurrent programming control

Communications of the ACM

JG DeBruijn Additional commen ts on a problem in concurrent programming control

Communications of the ACM

M Eisen b erg and M McGuire Further comments on Dijkstras concurrentprogram

ming control Communications of the ACM

James E Burns and Nancy A Lync h Bounds on shared memory for mutual exclusion

To app ear in Info Comput

L Lamp ort A new solution of Dijkstras concurren t programming problem Commu

nications of the ACM

L Lamp ort The mutual exclusion problem Journal of the ACM

G P eterson and M Fischer Economical solutions for the critical section problem in a

distributed system In Proceedings of th ACM Symposium on Theory of Computing

pages May

J Burns M Fisc her PJackson N Lynch and G Peterson Data requirements for

implementation of npro cess mutual exclusion using a single shared variable Journal

of the ACM

M Merritt and G T aub enfeld Sp eeding Lamp orts fast mutual exclusion algorithm

ATT Technical Memorandum May Also submitted for publication

D Resource Allo cation

M Fisc her N Lynch J Burns and A Boro din Resource allo cation with immunity

to limited pro cess failure In Proceedings of th IEEE Symosium on Foundations of

Computer Science pages Octob er

M Fisc her N Lynch J Burns and A Boro din Distributed FIFO allo cation of

identical resources using small shared space ACM Trans Prog Lang and Syst

January

D Dolev E Gafni and N Sha vit Toward a nonatomic era lexclusion as a test

case In Proceedings of th ACM Symposium on Theory of Computing pages

May

EW Dijkstra Hierarc hical ordering of sequential pro cesses Acta Informatica pages

D Atomic Registers

GL P eterson Concurrent reading while writing ACM Transactions on Programming

Languages and Systems

L Lamp ort On in terpro cess communication Distributed Computing

Digital Systems Research TM

Am buj K Singh James H Anderson and Mohamed G Gouda The elusive atomic

register June Submitted for publication

Bard Blo om Constructing t wowriter registers IEEE Trans Computers

Decemb er

P Vitanyi and B Awerbuch Atomic shared register access by asynchronous hard

ware In Proceedings th Annual IEEE Symposium on Theory of Computing pages

Toronto Ontario Canada May Also MITLCSTM Lab oratory

for Computer Science Massachusetts Institute of Technology Cambridge MA

Corrigenda in Proceedings of th Annual IEEE Symposium on Theory of Computing

page

R Sc haer On the correctness of atomic multiwriter registers Bachelors Thesis

June Massachusetts Institute Technology Also Technical Memo MITLCSTM

Lab for Computer Science MIT June and Revised Technical Memo

MITLCSTM January

Soma Chaudh uri and Jennifer Welch Bounds on the costs of register implementations

Manuscript Also Technical Rep ort TR University of North Carolina at

Chap el Hill June

D Snapshots

Y Afek H A ttiya D Dolev E Gafni M Merritt and N Shavit Atomic snapshots

th

of shared memoryInProceedings of the Annual ACM Symposium on Principles

of Distributed Computing pages Queb ec Canada August Also Technical

Memo MITLCSTM Lab oratory for Computer Science Massachusetts Institute

of TechnologyCambridge MA MayTo app ear in Journal of the ACM

Cyn thia Dwork Maurice Herlihy Serge Plotkin and Orli Waarts Timelapse snap

shots In In Israel Symposium on Theory of Computing and Systems ISTCS

Also Stanford Technical Rep ort STANCS April and Digital Equipment

Corp oration Technical Rep ort CRL

D Timestamp Systems

Amos Israeli and Ming Li Bounded timestamps In Proceedings of the th IEEE

Symposium on Foundations of Computer Science pages Octob er

R Ga wlick N Lynch and N Shavit Concurrent timestamping made simple In

In Israel Symposium on Theory of Computing and Systems ISTCS pages

SpringerVerlag May

Cyn thia Dwork and Orli Waarts Simple and ecient b ounded concurrent timestamp

ing or b ounded concurrent timestamp systems are comprehensible In In Proceedings

of the th Annual ACM Symposium on Theory of Computing STOC pages

Victoria BC Canada May Preliminary version app ears in IBM Research

Rep ort RJ Octob er

D Consensus

M Loui and H AbuAmara Memory requiremen ts for agreement among unreliable

asynchronous pro cesses Advances in Computing Research

Maurice Herlih yWaitfree synchronization ACM Transactions on Programming Lan

guages and Systems January

H A ttiya N Lynch and N Shavit Are waitfree algorithms fast Submitted for

publication Earlier versions in MITLCSTM and Proceedings of the st Annual

IEEE Symposium on Foundations of Computer Science Octob er

Karl Abrahamson On achieving consensus using a shared memoryInProceedings

th

of the Annual ACM Symposium on Principles of Distributed Computing pages

Toronto Canada August

D Shared Memory for Multipro cessors

Leslie Lamp ort Ho wtomakea multipro cessor computer that correctly executes mul

tipro cess programs IEEE Transactions on Computers c September

Alan Ja ySmith Cache memories Computing Surveys September

Maurice P Herlihy and Jeannette M Wing Linearizability A correctness condition

for concurrent ob jects ACM Transactions on Programming Languages and Systems

July

Maurice Herlih yWaitfree synchronization ACM Transactions on Programming Lan

guages and Systems January

Y ehuda Afek Georey Brown and Michael Merritt Lazy caching TOPLAS Octob er

To app ear Also a preliminary version app eared in SPAA

Phillip B Gibb ons Mic hael Merritt and Kourosh Gharachorlo o Proving sequential

consistency of highp erformance shared memories ATT Bell Lab oratories

TM and TM May

Phillip B Gibb ons and Mic hael Merritt Sp ecifying nonblo cking shared memories

ATT Bell Lab oratories BL T M and BL TM May

Hagit A ttiya and RoyFriedman A correctness condition for highp erformance multi

pro cessors Unpublished Manuscript November

Ric hard J Lipton and Jonathan S Sandb erg PRAM A scalable shared memory

Department of Computer Science Princeton University CSTR September

Mustaque Ahamad James E Burns Phillip W Hutto and Gil Neiger Causal memory

College of Computing Georgia Institute of Technology GITCC September

Hagit A ttiya and Jennifer Welch Sequential consistency versus linearizability In

Proceedings of the rdACM Symp on Paral lel Algorithms and Architectures SPAA

Also Technion Israel Institute Technology Dept of CS TR

D Asynchronous MessagePassage Systems

D Computing in Static Networks

Rings

G LeLann Distributed systems to wards a formal approach In IFIP Congress pages

Toronto

E Chang and R Rob erts An impro ved algorithm for decentralized extremanding

in circular congurations of pro cesses Communications of the ACM May

D Hirsc hb erg and J Sinclair Decentralized extremanding in circular conguarations

of pro cesses Communications of the ACM November

GL P eterson An O n log n unidirectional distributed algorithm for the circular

extrema problem ACM Transactions on Programming Languages and Systems

Octob er

D Dolev M Kla we and M Ro deh An O n log n unidirectional distributed algorithm

for extrema nding in a circle Research Rep ort RJ IBM July J Algorithms

J Burns A formal mo del for message passing systems Technical Rep ort TR

Computer Science Dept Indiana UniversityMay

K Abrahamson A Adler L Higham and D Kirkpatric k Tightlower b ounds for

probabilistic solitude verication on anonymous rings Submitted for publication

Complete Graphs

E Korac h S Moran and S Zaks Tightlower and upp er b ounds for some distributed

algorithms for a complete network of pro cessors In In Proceedings of rdACM Sym

posium on Principles of Distributed Computing pages

Y Afek and E Gafni Time and message b ounds for election in synchronous and

asynchronous complete networks In Proceedings of th ACM Symposium on Principles

of Distributed Computing pages Minaki Ontario August

General Networks

D Angluin Lo cal and global prop erties in net works of pro cessors In Proceedings of

th ACM Symposium on Theory of Computing pages

R Gallager P Humblet and P Spira A distributed algorithm for minimumweight

spanning trees ACM Transactions on Programming Languages and Systems

January

P Humblet A distributed algorithm for minimum weight directed spanning trees

IEEE Transactions on Computers COM MITLIDSP

Baruc hAwerbuchandDavid Peleg Sparse partitions In Proceedings of the st IEEE

Symposium on Foundations of Computer Science pages St Louis Missouri

Octob er

D Network Synchronization

L Lamp ort Time clo c ks and the ordering of events in a distributed system Com

munications of the ACM July

J M H elary and M Raynal Synchronization and Control of Distributed Systems and

Programs John Wiley Sons

B Aw erbuch Complexity of network synchronization J ACM Octo

b er Also Technical Memo MITLCSTM Lab oratory for Computer Science

Massachusetts Institute TechnologyCambridge MA January

B Aw erbuch and M Sipser Dynamic networks are as fast as static networks In

Proceedings of the th IEEE Symposium on Foundations of Computer Science IEEE

Octob er

E Arjomandi M Fisc her and N Lynch Eciency of synchronous versus asyn

chronous distributed systems Journal of the ACM July

Leslie Lamp ort Using time instead of timeout for faulttoleran t distributed systems

toplas apr

Leslie Lamp ort The parttime parliamen t Technical Memo Digital Systems Re

searchCenter Septemb er

JL W elch Simulating synchronous pro cessors Information and Computation

August

D Simulatin g Asynchronous Shared Memory

Hagit A ttiya Amotz BarNoy and Danny Dolev Sharing memory robustly in message

passing systems In Proceedings of the th Annual ACM Symposium on Principles of

Distributed Computing pages Revised version to app ear in J ACM

D Resource Allo cation

M Ra ynal Algorithms for Mutual Exclusion MIT Press

K Chandy and J Misra The drinking philosophers problem ACM Transactions on

Programming Languages and Systems Octob er

N Lync h Upp er b ounds for static resource allo cation in a distributed system Journal

Of Computer And Systems Sciences Octob er

M Rabin and D Lehmann On the adv antages of free choice a symmetric and fully

distributed solution to the dining philosophers problem In Proceedings of th ACM

Symposium on Principles of Programming Languages pages

Baruc hAwerbuch and Michael Saks A dining philosophers algorithm with p olynomial

resp onse time In Proceedings of the st IEEE Symp on Foundations of Computer

Science pages St Louis Missouri Octob er

Jennifer W elch and Nancy Lynch A mo dular drinking philosophers algorithm July

Earlier version app eared as MITLCSTM

D Delecting Stable Prop erties

Termination Detection

E Dijkstra and C Sc holten Termination detection for diusing computations Infor

mation Processing Letters August

N F rancez and N Shavit A new approach to detection of lo cally indicative stabil

ityInProceedings of the th International Col loquium on Automata Languages and

Programming ICALP pages Rennes France July SpringerVerlag

Global Snapshots

K Chandy and L Lamp ort Distributed snapshots determining global states of

distributed systems ACM Transactions on Computer Systems February

M Fisc her N Grieth and N Lynch Global states of a distributed system IEEE

Transactions on Software Engineering SE May Also in Proceed

ings of IEEE Symposium on Reliability in Distributed Software and Database Systems

Pittsburgh PA July

D Deadlo ck Detection

D Menasce and R Mun tz Lo cking and deadlo ck detection in distributed databases

IEEE Transactions on Software Engineering SE May

V Gligor and S Shattuc k On deadlo ck detection in distributed systems IEEE

Transactions on Software Engineering SE Septemb er

R Ob ermarc k Distributed deadlo ck detection algorithm ACM Transactions on

Database Systems June

G Ho and C Ramamo orth y Proto cols for deadlo ck detection in distributed database

systems IEEE Transactions on Software Engineering SE November

K Chandy J Misra and L Haas Distributed deadlo ck detection ACM Transactions

on Programming Languages and Systems May

G Brac ha and S Toueg A distributed algorithm for generalized deadlo ck detection

Distributed Computing

D Mitc hell and M Merritt A distributed algorithm for deadlo ck detection and resolu

tion In Proceedings of rdACM Symposium on Principles of Distributed Computing

pages Vancouver BC Canada August

D Consensus

M Fisc her N Lynch and M Paterson Imp ossibility of distributed consensus with

one family faulty pro cess Journal of the ACM April

S Moran and Y W olfstahl Extended imp ossibility results for asynchronous complete

networks Information Processing Letters

M Bridgland and R W atro Faulttolerant decision making in totally asynchronous

distributed systems In Proceedings of th ACM Symposium on Principles of Dis

tributed Computing pages August Revision in progress

Ofer Biran Shlomo Moran and Shm uel Zaks A combinatorial characterization of the

distributed solvable tasks Journal of Algorithms Septemb er

Earlier version in Proceedings of th ACM Symposium on Principles of Distributed

Computing pages August

Gadi T aub enfeld Shmuel Katz and Shlomo Moran Imp ossibility results in the pres

ence of multiple faulty pro cesses In CE Veni Madhavan editor Proc of the th FCT

TCS Conferencevolume of Lecture Notes in Computer Science pages

Bangalore India Springer Verlag To app ear in Information and Computation

Gadi T aub enfeld On the nonexistence of resilient consensus proto cols Information

Processing Letters March

H A ttiya A BarNoyDDolevDPeleg and R Reischuk Renaming in an asyn

chronous environment J ACM July

D Dolev C Dw ork and L Sto ckmeyer On the minimal synchronism needed for

distributed consensus Journal of the ACM

T ushar D Chandra and Sam Toueg Unreliable failure detectors for asynchronous

systems In Proceedings th ACM Symposium on Principles of Distributed Computing

pages Montreal Canada August Also Cornell University Department

of Computer Science Ithaca New York TR

T ushar D Chandra Vassos Hadzilacos and Sam Toueg The weakest failure detector

for solving consensus In Proceedings th ACM Symposium on Principles of Dis

tributed Computing pages Vancouver Canada August Also Cornell

University Department of Computer Science Ithaca New York TR

D Datalink

A Aho J Ullman A Wyner and M Y annakakis Bounds on the size and transmis

sion rate of communication proto cols Computers and Mathematics with Applications

J Y Halp ern and L D Zuc k A little knowledge go es a long way Simple knowledge

based derivations and correctness pro ofs for a family of proto cols J ACM

July Earlier version exists in Proceedings of the th Annual ACM Symposium

on Principles of Distributed Computing August

N Lync h Y Mansour and A Fekete The data link layer Two imp ossibility results In

Proceedings of th ACM Symposium on Principles of Distributed Computation pages

Toronto Canada August Also Technical Memo MITLCSTM

May

Y Afek H A ttiya A Fekete M Fischer N Lynch Y Mansour D Wang

and L Zuck Reliable communication over unreliable channels Technical Memo

MITLCSTM Lab oratory for Computer Science Massachusetts Institute Tech

nologyCambridge MA Octob er

A F ekete N Lynch Y Mansour and J Spinelli The data link layer The imp os

sibility of implementation in face of crashes Technical Memo MITLCSTMb

Massachusetts Institute of Technology Lab oratory for Computer Science August

Submitted for publication

A F ekete and N Lynch The need for headers an imp ossibility result for commu

nication over unreliable channels In G Go os and J Hartmanis editors CONCUR

Theories of Concurrency Unication and Extensionvolume of Lecture Notes

in Computer Science pages SpringerVerlag August Also Technical

Memo MITLCSTM May

Y Afek H A ttiya A Fekete M Fischer N Lynch Y Mansour D Wang and

L Zuck Reliable communication over an unreliable channel Submitted for

publication

DaW ei Wang and Lenore Zuck Tight b ounds for the sequence transmission prob

th

lem In Proceedings of the Annual ACM Symposium on Principles of Distributed

Computing pages August

Ew an Temp ero and Richard Ladner Tight b ounds for weakly b ounded proto cols In

th

Proceedings of the Annual ACM Symposium on Principles of Distributed Comput

ing pages Queb ec Canada August

Ew an Temp ero and Richard Ladner Recoverable sequence transmission proto cols

June Manuscript

D Sp ecialPurp ose Network Building Blo cks

J M H elary and M Raynal Synchronization and Control of Distributed Systems and

Programs John Wiley Sons

B Aw erbuch O Goldreich D Peleg and R Vainish A tradeo b etween information

and communication in broadcast proto cols J ACM April

Y ehuda Afek Eli Gafni and Adi Rosen The slide mechanism with applications in

dynamic networks In Proceedings of the Eleventh Annual Symp on Principles of

Distributed Computing pages Vancouver British Columbia Canada August

D SelfStabilizati on

Edsger W Dijkstra Selfstabilizing systems in spite of distributed con trol Commu

nication of the ACM November

Edsger W Dijksra A Belated Pro of of SelfStabilization Distributed Computing

January

Geoery M Bro wn and Mohamed G Gouda and ChuanLin Wu Token Systems that

SelfStabilize IEEE Trans Computers June

James E Burns and Jan P achl Uniform SelfStabilizing Ring ACM Trans Prog

Lang and Syst April

Shm uel Katz and Kenneth J Perry Selfstabilizing extensions for messagepassing

systems March To app ear in Distrib Comput Also earlier version in Proceedings

th

of the Annual ACM Symposium on Principles of Distributed Computing August

Y ehuda Afek and Georey Brown Selfstabilization of the alternating bit proto col In

Proceedings of the th IEEE Symposium on Reliable Distributed Systems pages

Shlomi Dolev Amos Israeli and Shlomo Moran Resource b ounds for self stabilizing

message driven proto cols In Proceedings of the th Annual ACM Symposium on

Principles of Distributed Computing August

Baruc hAwerbuch and Boaz PattShamir and George Varghese SelfStabilization by

Lo cal Checking and Correction Proceedings of the nd Annual IEEE Symposium on

Foundations of Computer Science Octob er

V arghese G Dealing with Failure in Distributed Systems PhD thesis MIT Dept of

Electrical Engineering and Computer Science In progress

D Knowledge

K Chandy and J Misra How pro cesses learn Distributed Computing

D TimingBased Systems

D Resource Allo cation

H A ttiya and N Lynch Time b ounds for realtime pro cess control in the presence of

timing uncertainty Inf Comput To app ear

N Lync h and N Shavit Timingbased mutual exclusion In Proceedings of the th

IEEE RealTime Systems Symposium Pho enix Arizona December To app ear

Ra jeev Alur and Gadi T aub enfeld Results ab out fast mutual exclusion In Proceedings

of the th IEEE RealTime Systems Symposium Pho enix Arizona Decemb er

To app ear

D Shared Memory for Multipro cessors

M Ma vronicolas Timingbased distributed computation Algorithms and imp ossibil

ity results Harvard University Cambridge MA TR

D Consensus

C Dw ork N Lynch and L Sto ckmeyer Consensus in the presence of partial syn

chrony Journal of the ACM

Hagit A ttiya Cynthia Dwork Nancy Lynch and Larry Sto ckmeyer Bounds on the

time to reach agreement in the presence of timing uncertainty JACM To

app ear Also a preliminary version app eared in rd STOC

S P onzio The realtime cost of timing uncertainty Consensus and failure detection

Masters thesis MIT Dept of Electrical Engineering and Computer Science June

Also MITLCSTR

Stephen P onzio Consensus in the presence of timing uncertainty Omission and byzan

tine failures In th ACM Symposium on Principles of Distributed Computing pages

Montreal Canada August

D Communication

DaW ei Wang and Lenore D Zuck Realtime STP In Proceedings of Tenth ACM

Symp on Principles of Distributed ComputingAugust This is also Technical

Rep ort YALEDCSTR

S P onzio Bounds on the time to detect failures using b oundedcapacity message links

In Proceedings of the th Realtime Systems Symposium Pho enix Arizona December

To app ear

S P onzio The realtime cost of timing uncertainty Consensus and failure detection

Masters thesis MIT Dept of Electrical Engineering and Computer Science June

Also MITLCSTR

Amir Herzb erg and Sha y Kutten Fast isolation of arbitrary forwardingfaults Pro

ceedings of the th Annual ACM Symposium on Principles of Distributed Computing

August

D Clo ck Synchronization

Leslie Lamp ort and P M MelliarSmith Synchronizing clo cks in the presence of faults

J ACM January

J Lundelius and N Lync h A new faulttolerant algorithm for clo cksynchronization

Information and Computation

J Lundelius and N Lync h An upp er and lower b ound for clo cksynchronization

Information and Control AugustSeptember

Dann y Dolev Joseph Y Halp ern Barbara Simons and Ray Strong Dynamic fault

tolerant clo cksynchronization January IBM Research Rep ort RJ

Also earlier version app eared in Proceedings of the rdACM Symposium on Principles

of Distributed Computing August

D Dolev J Halp ern and R Strong On the p ossibilit y and imp ossibilityofachieving

clo ck synchronization In Proceedings of th Symposium on Theory of Computing

pages May Later version in Journal of Computer and Systems Sciences

April

S Mahaney and F Sc hneider Inexact agreement accuracy precision and grace

ful degradation In Proceedings of th ACM Symposium on Principles of Distributed

Computing pages August

M Fisc her N Lynch and M Merritt Easy imp ossibility pro ofs for distributed con

sensus problems Distributed Computing

D Applications of Synchronized Clo cks

Barbara Lisk ov Practical uses of synchronized clo cks in distributed systems In Pro

ceedings of the th Annual ACM Symposium on Principles of Distributed Computing

pages Montreal Canada August

Gil Neiger and Sam T oueg Simulating synchronized clo cks and common knowledge in

distributed systems July To app ear in J ACM Earlier version exists in Pro

ceedings th ACM Symposium on Principles of Distributed Computing August

Vancouver Canada