Indian Institute of Science Bangalore, India भारतीय विज्ञान संथान बंगलौर, भारत

Spanner Known “unknown” is better than unknown “unknown” , Inc.

Winner of Jay Lepreau Best Paper Award at OSDI’ 12

12 Presented by : Swapnil Gandhi 19th N o v e m b e r 2018

Hollywood, CA USA 2012 Authors*

Sanjay Ghemawat

“When Jeff gave a seminar at Stanford, it was so crowded that Donald Knuth had to sit on the floor.”

20-Nov-18 2 *and there are 24 more… Applications at Google Scale

. Fast and Performant ‣ Low Latency ‣ High Throughput

. Global Scale ‣ Fault Tolerant ‣ Highly Available

. Seamless Scaling ‣ Automatic Rebalance of Compute ‣ Horizontal Scale-out

. Simple Back-Integration ‣ Integrates easily with existing products ‣ Multi-Language Support

20-Nov-18 3 Why throw in the Spanner ? (1/2) Why not MySQL ? . Relational Database + Supports ACID Transactions + Support Indices and Query Optimization + Can run Plan Bouquet  - Needs Master-Slave replication - Manual Re-sharding Used in : F1 Ad Database

20-Nov-18 4 Why throw in the Spanner ? (2/2) Why Not ? Why Not Megastore ?

. Key-Value Store . Semi-relational data model -- Eventually Consistent ++ Replicated synchronously - Lacks support for cross- + Supports ACID row transactions Transactions ++ High Write Throughput -- Poor Write Throughput Used in : Used in : , , , , Android Market, Personalized Search Calendar 20-Nov-18 5 Why Spanner ? . Globally Distributed Multi-version Database ‣ General purpose transactions (ACID) ‣ SQL-based query Language ‣ Schematized ‣ Semi-relational data model . Lock-free distributed read-only transactions . Applications control Replication and Placement . Supports External Consistency . Scales-out horizontally ‣ No need to shard data manually 20-Nov-18 6 What is External Consistency ?

. Serializability ensures that transactions are executed in a manner that’s indistinguishable from a system in which they are executed serially.

. External Consistency ensures that the serial order is consistent with the order in which transactions can be observed to commit with respect to Global clock.

20-Nov-18 7 Why is External Consistency desired ? . User sends an E-mail using Gmail web-app. . Then immediately views “sent ” to double check what he/she wrote. . Without external consistency the app’s request may go to a different replica which is behind on state changes. . This may lead to : ‣ Confusion ‣ Reduced User experience

20-Nov-18 8 Spanner Software Stack

20-Nov-18 9 Spanner Server Organization

Spanner Universe (Zone can be thought as unit of replication) 20-Nov-18 10 Spanserver stack

(key:string, timestamp:int64) -> string 20-Nov-18 11 Directory

A Directory is a Range Partitioned row space and can be thought as unit of data placement

20-Nov-18 12 Data Model

Prone to "garbage in, garbage out" 20-Nov-18 13 TrueTime Spanner knows what time it is !

20-Nov-18 14 Timestamp and Global Clock

. [Assumption] Global Clock is available . Strict Two-phase locks for write transactions . Assign timestamps while locks are held

Acquired locks Release locks

T D1 Pick s = now()

20-Nov-18 15 External Consistency

If Timestamp order is equivalent to Commit Order, we can claim

“Timestamp Order respects Global Wall Clock Time and thus transactions ordered by timestamp ensures External Consistency”

20-Nov-18 16 Timestamp Invariant

Transactions T1 and T2 overlap on time and data split T1

D1 T2 Acquired locks Release locks

Transactions T3 and T4 do not overlap on time and data split

D2 T3

T4 D3

20-Nov-18 17 Is Synchronizing Time at Global Scale possible ?

Distributed systems dogma :

. Synchronizing time within and between datacenters is extremely hard and uncertain.

. Efficient Serialization of requests is impossible at global scale

20-Nov-18 18 TrueTime API “Global wall-clock time” with bounded uncertainty Earliest Latest

time

Tabs

API : 2*ε Method Returns TT.now() TTInterval : [earliest, latest] TT.after(t) True if t has definitely passed TT.before(t) True if t has definitely not arrived

20-Nov-18 19 TrueTime Architecture

GPS GPS GPS GPS Clock Clock Clock Clock

Atomic Atomic Atomic Atomic Clock Clock Clock Clock

Datacenter Datacenter 1 Datacenter 2 … N

Sync every 30 secs Client

20-Nov-18 20 TrueTime Implementation

now = Reference now + Local-clock Offset ε = Reference ε + Worst-case local-clock drift

[Assumption] Worst-case local-clock drift =200 μs/sec Synchronization Time =50 μs ε +6ms

reference uncertainty time 0sec 30sec 60sec 90sec

20-Nov-18 21 Timestamp and TrueTime

Earliest Latest Acquired locks Release locks

T

s= Call TrueTime Wait until TT.now().latest API TT.now().earliest > s Commit wait

average ε average ε

20-Nov-18 22 Invariants

. Monotonicity Invariant ‣ Spanner assigns timestamps to Paxos writes in monotonically increasing order.

. Disjointness Invariant ‣ A leader must only assign timestamps within the interval of its lease.

. External Consistency Invariant commit start tabs(e1 ) < tabs(e2 ) => s1 < s2

20-Nov-18 23 Types of Reads and Writes

A read-only transaction must be pre-declared as not having any writes

20-Nov-18 24 Schema Recipient Sender Subject Message is_deleted is_spam Time ID ID Stamp Assad Trump Check my … 0 NO … twitter feed Putin Don’t you … 0 NO … worry child Gandhi Trump Climate … 0 YES … change is over- rated Modi Mittro … … 1 YES … Rocketman Trump We have … 0 NO … shinny rockets! Mueller Trump You’re fired! … 0 NO … 25 Consistent Reads

20-Nov-18 26 Stale Reads (1/2)

Above python code performs a stale read using a 15 sec bounded-staleness timestamp

20-Nov-18 27 Stale Reads (2/2)

Zone 1 Zone 2 Zone 3

Do I have Request up-to-date Slave data ? Leader Slave (Max 15 sec yes! old data) Response

Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2

Request Timestamp ≤ Tsafe

20-Nov-18 28 Strong Reads (1/3)

Above python code performs a strong read from Spanner Database

20-Nov-18 29 Strong Reads (2/3)

Zone 1 Zone 2 Zone 3

Request Is this the latest data ? Slave Leader Slave Response Yep!

Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2

20-Nov-18 30 Strong Reads (3/3)

Zone 1 Zone 2 Zone 3

Request Is this the latest data ? Slave Leader Slave Response Blocked Nope! Wait

Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2

20-Nov-18 31 Consistent Writes

20-Nov-18 32 Read-Write Transaction

Above python code performs a read-write transaction on spanner database

20-Nov-18 33 Transaction within Paxos Group

Paxos 1 Paxos 2 Paxos 3

41. .Buffer TXN Commit WritesQuery Wait

Leader1 Leader2 Leader3 2. Acq. 5. Apply 6. TXN3. Locks Writes SuccessQuery Acq. and Results Locks Release Locks Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2

Call TrueTime API

20-Nov-18 34 Transaction across Paxos Group

Paxos 1 Paxos 2 Paxos 3

4. Buffer Writes 1. TXN Commit Query 62..Prepar Write Wait 6. Write2. e Prepare ACK ACK Leader1 Leader2 Leader3 2. Acq. 5. Apply 6. TXN Apply Apply 3. Success Locks Writes Writes Writes Acq. Query Acq. and and Results Acq. and Release Locks Locks Release Locks Release Locks Locks Locks Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2

Call Call Call TrueTime TrueTime TrueTime API API API

20-Nov-18 35 Evaluation

20-Nov-18 36 Microbenchmarks

All operations performed were Standalone reads and Standalone writes

20-Nov-18 37 Two phase commit scalability

All write operations performed across 3 Zones

20-Nov-18 38 Effect of Leader failure on throughput

20-Nov-18 39 Network Induced Uncertainty

20-Nov-18 40 Key Takeaways . Spanner is the first globally distributed database to offer strong consistency guarantees ‣ Always use strong reads ‣ If latency makes strong reads infeasible, use reads with bounded staleness to improve performance

. TrueTime API is the secret sauce

. Stronger semantics are achievable Global scale != Weaker Semantic

. In-case of a local clock malfunction, strong consistency guarantees are in-validated.

. Write latency is lower bounded by ε.

20-Nov-18 41 Questions!

20-Nov-18 42 Backup Slides Just ignore these ! 

20-Nov-18 43 Safe Time

paxos TM Tsafe= min(T safe, T safe)

paxos T safe = Timestamp of highest applied Paxos write

∞ , when there are zero transaction is TM T safe = pending state. prepare min (s i,g) – 1, when there are transactions in pending state.

20-Nov-18 44 Backup Slide (1/2)

Start Consensus End Consensus Earliest Latest Acquired locks Release locks T

s= Call TrueTime Wait until TT.now().latest API TT.now().earliest > s Commit wait

average ε average ε

20-Nov-18 45 Backup Slide (2/2)

Start logging Done logging

Acquired locks Release locks T Committed C Notify participants of s Acquired locks Release locks

TP1

Acquired locks Release locks T P2 Prepared Send si si = TT.now().latest Commit wait done Compute overall s 20-Nov-18 46