Indian Institute of Science Bangalore, India भारतीय विज्ञान संथान बंगलौर, भारत
Spanner Known “unknown” is better than unknown “unknown” Google, Inc.
Winner of Jay Lepreau Best Paper Award at OSDI’ 12
12 Presented by : Swapnil Gandhi 19th N o v e m b e r 2018
Hollywood, CA USA 2012 Authors*
Sanjay Jeff Dean Ghemawat
“When Jeff gave a seminar at Stanford, it was so crowded that Donald Knuth had to sit on the floor.”
20-Nov-18 2 *and there are 24 more… Applications at Google Scale
. Fast and Performant ‣ Low Latency ‣ High Throughput
. Global Scale ‣ Fault Tolerant ‣ Highly Available
. Seamless Scaling ‣ Automatic Rebalance of Compute ‣ Horizontal Scale-out
. Simple Back-Integration ‣ Integrates easily with existing products ‣ Multi-Language Support
20-Nov-18 3 Why throw in the Spanner ? (1/2) Why not MySQL ? . Relational Database + Supports ACID Transactions + Support Indices and Query Optimization + Can run Plan Bouquet - Needs Master-Slave replication - Manual Re-sharding Used in : F1 Ad Database
20-Nov-18 4 Why throw in the Spanner ? (2/2) Why Not BigTable ? Why Not Megastore ?
. Key-Value Store . Semi-relational data model -- Eventually Consistent ++ Replicated synchronously - Lacks support for cross- + Supports ACID row transactions Transactions ++ High Write Throughput -- Poor Write Throughput Used in : Used in : Google Earth, Google Analytics, Gmail, Picasa, Android Market, Personalized Search Calendar 20-Nov-18 5 Why Spanner ? . Globally Distributed Multi-version Database ‣ General purpose transactions (ACID) ‣ SQL-based query Language ‣ Schematized Tables ‣ Semi-relational data model . Lock-free distributed read-only transactions . Applications control Replication and Placement . Supports External Consistency . Scales-out horizontally ‣ No need to shard data manually 20-Nov-18 6 What is External Consistency ?
. Serializability ensures that transactions are executed in a manner that’s indistinguishable from a system in which they are executed serially.
. External Consistency ensures that the serial order is consistent with the order in which transactions can be observed to commit with respect to Global clock.
20-Nov-18 7 Why is External Consistency desired ? . User sends an E-mail using Gmail web-app. . Then immediately views “sent messages” to double check what he/she wrote. . Without external consistency the app’s request may go to a different replica which is behind on state changes. . This may lead to : ‣ Confusion ‣ Reduced User experience
20-Nov-18 8 Spanner Software Stack
20-Nov-18 9 Spanner Server Organization
Spanner Universe (Zone can be thought as unit of replication) 20-Nov-18 10 Spanserver stack
(key:string, timestamp:int64) -> string 20-Nov-18 11 Directory
A Directory is a Range Partitioned row space and can be thought as unit of data placement
20-Nov-18 12 Data Model
Prone to "garbage in, garbage out" 20-Nov-18 13 TrueTime Spanner knows what time it is !
20-Nov-18 14 Timestamp and Global Clock
. [Assumption] Global Clock is available . Strict Two-phase locks for write transactions . Assign timestamps while locks are held
Acquired locks Release locks
T D1 Pick s = now()
20-Nov-18 15 External Consistency
If Timestamp order is equivalent to Commit Order, we can claim
“Timestamp Order respects Global Wall Clock Time and thus transactions ordered by timestamp ensures External Consistency”
20-Nov-18 16 Timestamp Invariant
Transactions T1 and T2 overlap on time and data split T1
D1 T2 Acquired locks Release locks
Transactions T3 and T4 do not overlap on time and data split
D2 T3
T4 D3
20-Nov-18 17 Is Synchronizing Time at Global Scale possible ?
Distributed systems dogma :
. Synchronizing time within and between datacenters is extremely hard and uncertain.
. Efficient Serialization of requests is impossible at global scale
20-Nov-18 18 TrueTime API “Global wall-clock time” with bounded uncertainty Earliest Latest
time
Tabs
API : 2*ε Method Returns TT.now() TTInterval : [earliest, latest] TT.after(t) True if t has definitely passed TT.before(t) True if t has definitely not arrived
20-Nov-18 19 TrueTime Architecture
GPS GPS GPS GPS Clock Clock Clock Clock
Atomic Atomic Atomic Atomic Clock Clock Clock Clock
Datacenter Datacenter 1 Datacenter 2 … N
Sync every 30 secs Client
20-Nov-18 20 TrueTime Implementation
now = Reference now + Local-clock Offset ε = Reference ε + Worst-case local-clock drift
[Assumption] Worst-case local-clock drift =200 μs/sec Synchronization Time =50 μs ε +6ms
reference uncertainty time 0sec 30sec 60sec 90sec
20-Nov-18 21 Timestamp and TrueTime
Earliest Latest Acquired locks Release locks
T
s= Call TrueTime Wait until TT.now().latest API TT.now().earliest > s Commit wait
average ε average ε
20-Nov-18 22 Invariants
. Monotonicity Invariant ‣ Spanner assigns timestamps to Paxos writes in monotonically increasing order.
. Disjointness Invariant ‣ A leader must only assign timestamps within the interval of its lease.
. External Consistency Invariant commit start tabs(e1 ) < tabs(e2 ) => s1 < s2
20-Nov-18 23 Types of Reads and Writes
A read-only transaction must be pre-declared as not having any writes
20-Nov-18 24 Schema Recipient Sender Subject Message is_deleted is_spam Time ID ID Stamp Assad Trump Check my … 0 NO … twitter feed Putin Don’t you … 0 NO … worry child Gandhi Trump Climate … 0 YES … change is over- rated Modi Mittro … … 1 YES … Rocketman Trump We have … 0 NO … shinny rockets! Mueller Trump You’re fired! … 0 NO … 25 Consistent Reads
20-Nov-18 26 Stale Reads (1/2)
Above python code performs a stale read using a 15 sec bounded-staleness timestamp
20-Nov-18 27 Stale Reads (2/2)
Zone 1 Zone 2 Zone 3
Do I have Request up-to-date Slave data ? Leader Slave (Max 15 sec yes! old data) Response
Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2
Request Timestamp ≤ Tsafe
20-Nov-18 28 Strong Reads (1/3)
Above python code performs a strong read from Spanner Database
20-Nov-18 29 Strong Reads (2/3)
Zone 1 Zone 2 Zone 3
Request Is this the latest data ? Slave Leader Slave Response Yep!
Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2
20-Nov-18 30 Strong Reads (3/3)
Zone 1 Zone 2 Zone 3
Request Is this the latest data ? Slave Leader Slave Response Blocked Nope! Wait
Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2
20-Nov-18 31 Consistent Writes
20-Nov-18 32 Read-Write Transaction
Above python code performs a read-write transaction on spanner database
20-Nov-18 33 Transaction within Paxos Group
Paxos 1 Paxos 2 Paxos 3
41. .Buffer TXN Commit WritesQuery Wait
Leader1 Leader2 Leader3 2. Acq. 5. Apply 6. TXN3. Locks Writes SuccessQuery Acq. and Results Locks Release Locks Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2
Call TrueTime API
20-Nov-18 34 Transaction across Paxos Group
Paxos 1 Paxos 2 Paxos 3
4. Buffer Writes 1. TXN Commit Query 62..Prepar Write Wait 6. Write2. e Prepare ACK ACK Leader1 Leader2 Leader3 2. Acq. 5. Apply 6. TXN Apply Apply 3. Success Locks Writes Writes Writes Acq. Query Acq. and and Results Acq. and Release Locks Locks Release Locks Release Locks Locks Locks Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2
Call Call Call TrueTime TrueTime TrueTime API API API
20-Nov-18 35 Evaluation
20-Nov-18 36 Microbenchmarks
All operations performed were Standalone reads and Standalone writes
20-Nov-18 37 Two phase commit scalability
All write operations performed across 3 Zones
20-Nov-18 38 Effect of Leader failure on throughput
20-Nov-18 39 Network Induced Uncertainty
20-Nov-18 40 Key Takeaways . Spanner is the first globally distributed database to offer strong consistency guarantees ‣ Always use strong reads ‣ If latency makes strong reads infeasible, use reads with bounded staleness to improve performance
. TrueTime API is the secret sauce
. Stronger semantics are achievable Global scale != Weaker Semantic
. In-case of a local clock malfunction, strong consistency guarantees are in-validated.
. Write latency is lower bounded by ε.
20-Nov-18 41 Questions!
20-Nov-18 42 Backup Slides Just ignore these !
20-Nov-18 43 Safe Time
paxos TM Tsafe= min(T safe, T safe)
paxos T safe = Timestamp of highest applied Paxos write
∞ , when there are zero transaction is TM T safe = pending state. prepare min (s i,g) – 1, when there are transactions in pending state.
20-Nov-18 44 Backup Slide (1/2)
Start Consensus End Consensus Earliest Latest Acquired locks Release locks T
s= Call TrueTime Wait until TT.now().latest API TT.now().earliest > s Commit wait
average ε average ε
20-Nov-18 45 Backup Slide (2/2)
Start logging Done logging
Acquired locks Release locks T Committed C Notify participants of s Acquired locks Release locks
TP1
Acquired locks Release locks T P2 Prepared Send si si = TT.now().latest Commit wait done Compute overall s 20-Nov-18 46