Today Time and 1. The need for time synchronization Logical 2. “Wall time” synchronization

3. Logical Time

COS 418: Distributed Systems Lecture 4

Kyle Jamieson

2

A distributed edit-compile workflow What makes time synchronization hard? 1. Quartz oscillator sensitive to temperature, age, vibration, radiation –Accuracy ca. one part per million (one second of clock drift over 12 days)

Physical time à 2. The internet is: • Asynchronous: arbitrary message delays • 2143 < 2144 è make doesn’t call compiler • Best-effort: messages don’t always arrive Lack of time synchronization result – a possible object file mismatch

3 4

1 Today Just use Coordinated Universal Time? 1. The need for time synchronization • UTC is broadcast from radio stations on land and satellite (e.g., the Global Positioning System)

2. “Wall clock time” synchronization – Computers with receivers can synchronize their clocks – Cristian’s algorithm, Berkeley algorithm, NTP with these timing signals

• Signals from land-based stations are accurate to about 3. Logical Time 0.1−10 milliseconds – Lamport clocks – Vector clocks • Signals from GPS are accurate to about one microsecond – Why can’t we put GPS receivers on all our computers?

5 6

Synchronization to a time server Cristian’s algorithm: Outline

• Suppose a server with an accurate clock (e.g., GPS- 1. Client sends a request packet, Client Server disciplined crystal oscillator) timestamped with its local clock T1 – Could simply issue an RPC to obtain the time: 2. Server timestamps its receipt of the request T with its local clock T1 Client Server 2 T2 3. Server sends a response packet with its local clock T and T 3 2 T3

Time ↓ 4. Client locally timestamps its T4 receipt of the server’s response T • But this doesn’t account for network latency 4 – Message delays will have outdated server’s answer How the client can use these timestamps to Time ↓ synchronize its local clock to the server’s local clock? 7 8

2 Cristian’s algorithm: Offset sample calculation Today Client Server 1. The need for time synchronization Goal: Client sets clock ß T3 + �resp 2. “Wall clock time” synchronization • Client samples round trip time � = T1 � + � = (T − T ) − (T − T ) – Cristian’s algorithm, Berkeley algorithm, NTP req resp 4 1 3 2 �req T2

• But client knows �, not �resp 3. Logical Time T3 �resp – Lamport clocks Assume: � ≈ � T4 req resp – Vector clocks

Client sets clock ß T3 + ½� Time ↓

9 10

Berkeley algorithm Berkeley algorithm • A single time server can fail, blocking timekeeping • Master machine: polls L other machines using Cristian’s algorithm à { �i } (i = 1…L) • The Berkeley algorithm is a distributed algorithm Master for timekeeping

– Assumes all machines have equally-accurate local clocks

– Obtains average from participating computers and synchronizes clocks to that average

11 12

3 Today The (NTP) 1. The need for time synchronization • Enables clients to be accurately synchronized to UTC despite message delays

2. “Wall clock time” synchronization • Provides reliable service – Cristian’s algorithm, Berkeley algorithm, NTP – Survives lengthy losses of connectivity – Communicates over redundant network paths 3. Logical Time • Provides an accurate service – Lamport clocks – Unlike the Berkeley algorithm, leverages – Vector clocks heterogeneous accuracy in clocks

13 14

NTP: System structure NTP operation: Server selection • Servers and time sources are arranged in layers (strata) • Messages between an NTP client and server are exchanged in pairs: request and response – Stratum 0: High-precision time sources themselves • Use Cristian’s algorithm • e.g., atomic clocks, shortwave radio time receivers • For ith message exchange with a particular server, calculate: – Stratum 1: NTP servers directly connected to Stratum 0 1. Clock offset �i from client to server 2. Round trip time between client and server – Stratum 2: NTP servers that synchronize with Stratum 1 �i • Stratum 2 servers are clients of Stratum 1 servers • Over last eight exchanges with server k, the client – Stratum 3: NTP servers that synchronize with Stratum 2 computes its dispersion �k = maxi �i − mini �i • Stratum 3 servers are clients of Stratum 2 servers – Client uses the server with minimum dispersion

• Users’ computers synchronize with Stratum 3 servers

15 16

4 NTP operation : Clock offset calculation NTP operation: How to change time • Client tracks minimum round trip time and associated • Can’t just change time: Don’t want time to run backwards offset over the last eight message exchanges (�0, �0) – Recall the make example

Clock filter algorithm – �0 is the best estimate of offset: client adjusts its clock by • Instead, change the update rate for the clock �0 to synchronize to server T2 Server T3 – Changes time in a more gradual fashion x – Prevents inconsistent local timestamps Offset � Each point �0 θ0 represents one sample

T1 Client T4

θ = 1 [(T − T ) + (T − T )] �0 2 2 1 3 4 δ = (T4 − T1 ) − (T3 − T2 ) Round trip time � o The most accurate offset θ0 is measured at the lowest delay δ0 (apex of the wedge scattergram). 17 18 o The correct time θ must lie within the wedge θ0 ±(δ −δ0)/2. o The δ0 is estimated as the minimum of the last eight delay measurements and (θ0 ,δ0) becomes the peer update. o Each peer update can be used only once and must be more recent than the previous update.Clock synchronization: Take-away points Today • Clocks on different systems will always behave differently 1. The need for time synchronization 22-Jul-07 – Disagreement between machines can result13 in undesirable behavior 2. “Wall clock time” synchronization • NTP, Berkeley clock synchronization – Cristian’s algorithm, Berkeley algorithm, NTP – Rely on timestamps to estimate network delays – 100s �s−ms accuracy 3. Logical Time – Clocks never exactly synchronized – Lamport clocks • Often inadequate for distributed systems – Vector clocks – Often need to reason about the order of events – Might need precision on the order of ns

19 20

5 Motivation: Multi-site database replication The consequences of concurrent updates • A New York-based bank wants to make its transaction • Replicate the database, keep one copy in sf, one in nyc ledger database resilient to whole-site failures – Client sends query to the nearest copy – Client sends update to both copies • Replicate the database, keep one copy in sf, one in nyc

“Deposit Inconsistent replicas! $100”Updates should have been performed in the same order at each copy $1,000 San $1,000 $1,010 Francisco New York $1,100 $1,110 $1,111 “Pay 1% interest”

21 22

Idea: Logical clocks Defining “happens-before” • Consider three processes: P1, P2, and P3

• Landmark 1978 paper by Leslie Lamport • Notation: Event a happens before event b (a à b) • Insight: only the events themselves matter

P1 P2 P3 Idea: Disregard the precise clock time Instead, capture just a “happens before” relationship between a pair of events Physical time ↓

23 24

6 Defining “happens-before” Defining “happens-before” 1. Can observe event order at a single process 1. If same process and a occurs before b, then a à b

P1 P2 P1 P2 P3 P3 a a

b b Physical time ↓ Physical time ↓

25 26

Defining “happens-before” Defining “happens-before” 1. If same process and a occurs before b, then a à b 1. If same process and a occurs before b, then a à b

2. Can observe ordering when processes communicate 2. If c is a message receipt of b, then b à c

P1 P2 P1 P2 P3 P3 a a

b b c c Physical time ↓ Physical time ↓

27 28

7 Defining “happens-before” Defining “happens-before” 1. If same process and a occurs before b, then a à b 1. If same process and a occurs before b, then a à b

2. If c is a message receipt of b, then b à c 2. If c is a message receipt of b, then b à c

3. Can observe ordering transitively 3. If a à b and b à c, then a à c

P1 P2 P1 P2 P3 P3 a a

b b c c Physical time ↓ Physical time ↓

29 30

Concurrent events Lamport clocks: Objective • Not all events are related by à • We seek a clock time C(a) for every event a

• a, d not related by à so concurrent, written as a || d Plan: Tag events with clock times; use clock times to make distributed system correct P1 P2 P3 a • Clock condition: If a à b, then C(a) < C(b) d b c Physical time ↓

31 32

8 The Lamport Clock algorithm The Lamport Clock algorithm

• Each process Pi maintains a local clock Ci 1. Before executing an event a, Ci ß Ci + 1: 1. Before executing an event, Ci ß Ci + 1 – Set event time C(a) ß Ci

P1 P2 P1 P2 C1=0 C2=0 P3 C1=1 C2=1 P3 C3=0 C(a) = 1 C3=1 a a

b b c c Physical time ↓ Physical time ↓

33 34

The Lamport Clock algorithm The Lamport Clock algorithm

1. Before executing an event b, Ci ß Ci + 1: 1. Before executing an event b, Ci ß Ci + 1

– Set event time C(b) ß Ci 2. Send the local clock in the message m

P1 P2 P1 P2 C1=2 C2=1 P3 C1=2 C2=1 P3 C(a) = 1 C3=1 C(a) = 1 C3=1 a a C(b) = 2 C(b) = 2 b b c c Physical time ↓ C(m) = 2 Physical time ↓

35 36

9 The Lamport Clock algorithm Ordering all events • Break ties by appending the process number to each event:

3. On process Pj receiving a message m: 1. Process Pi timestamps event e with Ci(e).i – Set Cj and receive event time C(c) ß1 + max{ Cj, C(m) } 2. C(a).i < C(b).j when: • C(a) < C(b), or C(a) = C(b) and i < j P1 P2 C1=2 C2=3 P3 C(a) = 1 C3=1 a • Now, for any two events a and b, C(a) < C(b) or C(b) < C(a) C(b) = 2 C(c) = 3 – This is called a total ordering of events b c C(m) = 2 Physical time ↓

37 38

Making concurrent updates consistent Totally-Ordered Multicast • Client sends update to one replica à Lamport timestamp C(x) P2 P1 • Key idea: Place events into a local queue • Recall multi-site database replication: – Sorted by increasing C(x) – San Francisco (P1) deposited $100: $ – New York (P2) paid 1% interest: % 1.1 1.2 1.2 P1’s local $ % P2’s local % queue: queue:

We reached an inconsistent state P1 P2

Could we design a system that uses Lamport Clock Goal: All sites apply the updates in total order to make multi-site updates consistent? (the same) Lamport clock order

39 40

10 Totally-Ordered Multicast (Almost correct) Totally-Ordered Multicast (Almost correct)

1. On receiving an event from client, broadcast to others • P1 queues $, P2 queues % 1.1 1.2 1.11.2 $ % %$ (including yourself) ✔ ✔ ✔ • P1 queues and ack’s % – P1 marks % fully ack’ed 2. On receiving an event from replica: 1.1 1.2 P2 P1 $ a) Add it to your local queue % b) Broadcast an acknowledgement message to every • P2 marks % fully ack’ed process (including yourself) ack % % P2 processes % 3. On receiving an acknowledgement: – Mark corresponding event acknowledged in your queue

4. Remove and process events everyone has ack’ed from head of queue (Ack’s to self not shown here) 41 42

Totally-Ordered Multicast (Correct version) Totally-Ordered Multicast (Correct version)

1. On receiving an event from client, broadcast to others 1.1 1.2 1.11.2 (including yourself) $ % %$ ✔ ✔

2. On receiving or processing an event: P2 P1 1.1 1.2 a) Add it to your local queue $ % b) Broadcast an acknowledgement message to every process (including yourself) only from head of queue

3. When you receive an acknowledgement: ack $ – Mark corresponding event acknowledged in your queue $ $ ack % 4. Remove and process events everyone has ack’ed from % head of queue % (Ack’s to self not shown here) 43 44

11 So, are we done? Take-away points: Lamport clocks • Does totally-ordered multicast solve the problem of • Can totally-order events in a distributed system: that’s useful! multi-site replication in general?

• Not by a long shot! • But: while by construction, a à b implies C(a) < C(b), – The converse is not necessarily true: 1. Our protocol assumed: • C(a) < C(b) does not imply a à b (possibly, a || b) – No node failures – No message loss Can’t use Lamport clock timestamps to infer – No message corruption causal relationships between events 2. All to all communication does not scale 3. Waits forever for message delays (performance?)

45 46

Today Vector clock (VC)

1. The need for time synchronization • Label each event e with a vector V(e) = [c1, c2 …, cn] – ci is a count of events in process i that causally precede e

2. “Wall clock time” synchronization • Initially, all vectors are [0, 0, …, 0] – Cristian’s algorithm, Berkeley algorithm, NTP • Two update rules: 3. Logical Time – Lamport clocks 1. For each local event on process i, increment local entry ci – Vector clocks 2. If process j receives message with vector [d1, d2, …, dn]: – Set each local entry ck = max{ck, dk} – Increment local entry cj

47 48

12 Vector clock: Example Vector clocks can establish causality • All counters start at [0, 0, 0] • Rule for comparing vector clocks: P1 P2 P3 –V(a) = V(b) when ak = bk for all k a [1,0,0] –V(a) < V(b) when ak ≤ bk for all k and V(a) ≠ V(b) e [0,0,1] • Applying local update rule [2,0,0] b [2,1,0] c • Concurrency: a || b if ai < bi and aj > bj, some i, j [2,2,0] • Applying message rule d – Local vector clock [2,2,2] • V(a) < V(z) when there is a chain of events f a [1,0,0] piggybacks on inter- linked by à between a and z [2,0,0] process messages b [2,1,0] c Physical time ↓ z [2,2,0] 49 50

VC application: Two events a, z Causally-ordered bulletin board system

• Distributed bulletin board application Lamport clocks: C(a) < C(z) – Each post à multicast of the post to all other users Conclusion: None

• Want: No user to see a reply before the corresponding Vector clocks: V(a) < V(z) original message post Conclusion: a à … à z • Deliver message only after all messages that causally precede it have been delivered Vector clock timestamps tell us – Otherwise, the user would see a reply to a message about causal event relationships they could not find

51 52

13 VC application: Causally-ordered bulletin board system

VC0 = (1,0,0) VC0 = (1,1,0) P0 Original m Wednesday Topic: post Primary-Backup Replication P1 1’s reply m* VC1 = (1,1,0) VC = (1,1,0) 2 Pre-reading: VMware paper (on class website) P2

VC2 = (0,0,0) VC2 = (1,0,0) Physical time à

• User 0 posts, user 1 replies to 0’s post; user 2 observes

53 54

Why global timing? Totally-Ordered Multicast (Attempt #1)

• Suppose there were an infinitely-precise and globally • P1 queues $, P2 queues % 1.1 1.2 1.11.2 consistent time standard $ % %$ ✔ ✔ ✔ • P1 queues and ack’s % • That would be very handy. For example: – P1 marks % fully ack’ed 1.1 1.2 P2 P1 $ 1. Who got last seat on airplane? % • P2 marks % fully ack’ed 2. Mobile cloud gaming: Which was first,A shoots B or vice-versa? – P2 processes % ack % % • P2 queues and ack’s $ ack $ – P2 processes $ $ $ • P1 marks $ fully ack’ed % – P1 processes $, then % 3. Does this file need to be recompiled? Note: ack’s to self not shown here 55 56

14 Totally-Ordered Multicast (Correct version) Time standards

• P1 queues $, P2 queues % 1.1 1.2 1.11.2 • Universal Time (UT1) $ % %$ • P1 queues % ✔ ✔ – In concept, based on astronomical observation of the sun at 0º longitude • P2 queues and ack’s $ P2 – Known as “Greenwich Mean Time” P1 1.1 1.2 • P2 marks $ fully ack’ed $ % – P2 processes $ • International Atomic Time (TAI) – Beginning of TAI is midnight on January 1, 1958 • P1 marks $ fully ack’ed – Each second is 9,192,631,770 cycles of radiation – P1 processes $ – P1 ack’s % emitted by a Cesium atom ack $ – Has diverged from UT1 due to slowing of earth’s rotation • P1 marks %fully ack’ed $ – P1 processes % $ ack % • Coordinated Universal Time (UTC) • P2 marks % fully ack’ed % – TAI + leap seconds, to be within 0.9 seconds of UT1 – P2 processes % % – Currently TAI − UTC = 36 (Ack’s to self not shown here) 57 58

VC application: Order processing • Suppose we are running a distributed order processing system

• Each process = a different user • Each event = an order

• A user has seen all orders with V(order) < the user’s current vector

59

15