External Consistency

External Consistency

Indian Institute of Science Bangalore, India भारतीय विज्ञान संथान बंगलौर, भारत Spanner Known “unknown” is better than unknown “unknown” Google, Inc. Winner of Jay Lepreau Best Paper Award at OSDI’ 12 12 Presented by : Swapnil Gandhi 19th N o v e m b e r 2018 Hollywood, CA USA 2012 Authors* Sanjay Jeff Dean Ghemawat “When Jeff gave a seminar at Stanford, it was so crowded that Donald Knuth had to sit on the floor.” 20-Nov-18 2 *and there are 24 more… Applications at Google Scale . Fast and Performant ‣ Low Latency ‣ High Throughput . Global Scale ‣ Fault Tolerant ‣ Highly Available . Seamless Scaling ‣ Automatic Rebalance of Compute ‣ Horizontal Scale-out . Simple Back-Integration ‣ Integrates easily with existing products ‣ Multi-Language Support 20-Nov-18 3 Why throw in the Spanner ? (1/2) Why not MySQL ? . Relational Database + Supports ACID Transactions + Support Indices and Query Optimization + Can run Plan Bouquet - Needs Master-Slave replication - Manual Re-sharding Used in : F1 Ad Database 20-Nov-18 4 Why throw in the Spanner ? (2/2) Why Not BigTable ? Why Not Megastore ? . Key-Value Store . Semi-relational data model -- Eventually Consistent ++ Replicated synchronously - Lacks support for cross- + Supports ACID row transactions Transactions ++ High Write Throughput -- Poor Write Throughput Used in : Used in : Google Earth, Google Analytics, Gmail, Picasa, Android Market, Personalized Search Calendar 20-Nov-18 5 Why Spanner ? . Globally Distributed Multi-version Database ‣ General purpose transactions (ACID) ‣ SQL-based query Language ‣ Schematized Tables ‣ Semi-relational data model . Lock-free distributed read-only transactions . Applications control Replication and Placement . Supports External Consistency . Scales-out horizontally ‣ No need to shard data manually 20-Nov-18 6 What is External Consistency ? . Serializability ensures that transactions are executed in a manner that’s indistinguishable from a system in which they are executed serially. External Consistency ensures that the serial order is consistent with the order in which transactions can be observed to commit with respect to Global clock. 20-Nov-18 7 Why is External Consistency desired ? . User sends an E-mail using Gmail web-app. Then immediately views “sent messages” to double check what he/she wrote. Without external consistency the app’s request may go to a different replica which is behind on state changes. This may lead to : ‣ Confusion ‣ Reduced User experience 20-Nov-18 8 Spanner Software Stack 20-Nov-18 9 Spanner Server Organization Spanner Universe (Zone can be thought as unit of replication) 20-Nov-18 10 Spanserver stack (key:string, timestamp:int64) -> string 20-Nov-18 11 Directory A Directory is a Range Partitioned row space and can be thought as unit of data placement 20-Nov-18 12 Data Model Prone to "garbage in, garbage out" 20-Nov-18 13 TrueTime Spanner knows what time it is ! 20-Nov-18 14 Timestamp and Global Clock . [Assumption] Global Clock is available . Strict Two-phase locks for write transactions . Assign timestamps while locks are held Acquired locks Release locks T D1 Pick s = now() 20-Nov-18 15 External Consistency If Timestamp order is equivalent to Commit Order, we can claim “Timestamp Order respects Global Wall Clock Time and thus transactions ordered by timestamp ensures External Consistency” 20-Nov-18 16 Timestamp Invariant Transactions T1 and T2 overlap on time and data split T1 D1 T2 Acquired locks Release locks Transactions T3 and T4 do not overlap on time and data split D2 T3 T4 D3 20-Nov-18 17 Is Synchronizing Time at Global Scale possible ? Distributed systems dogma : . Synchronizing time within and between datacenters is extremely hard and uncertain. Efficient Serialization of requests is impossible at global scale 20-Nov-18 18 TrueTime API “Global wall-clock time” with bounded uncertainty Earliest Latest time Tabs API : 2*ε Method Returns TT.now() TTInterval : [earliest, latest] TT.after(t) True if t has definitely passed TT.before(t) True if t has definitely not arrived 20-Nov-18 19 TrueTime Architecture GPS GPS GPS GPS Clock Clock Clock Clock Atomic Atomic Atomic Atomic Clock Clock Clock Clock Datacenter Datacenter 1 Datacenter 2 … N Sync every 30 secs Client 20-Nov-18 20 TrueTime Implementation now = Reference now + Local-clock Offset ε = Reference ε + Worst-case local-clock drift [Assumption] Worst-case local-clock drift =200 μs/sec Synchronization Time =50 μs ε +6ms reference uncertainty time 0sec 30sec 60sec 90sec 20-Nov-18 21 Timestamp and TrueTime Earliest Latest Acquired locks Release locks T s= Call TrueTime Wait until TT.now().latest API TT.now().earliest > s Commit wait average ε average ε 20-Nov-18 22 Invariants . Monotonicity Invariant ‣ Spanner assigns timestamps to Paxos writes in monotonically increasing order. Disjointness Invariant ‣ A leader must only assign timestamps within the interval of its lease. External Consistency Invariant commit start tabs(e1 ) < tabs(e2 ) => s1 < s2 20-Nov-18 23 Types of Reads and Writes A read-only transaction must be pre-declared as not having any writes 20-Nov-18 24 Schema Recipient Sender Subject Message is_deleted is_spam Time ID ID Stamp Assad Trump Check my … 0 NO … twitter feed Putin Don’t you … 0 NO … worry child Gandhi Trump Climate … 0 YES … change is over- rated Modi Mittro … … 1 YES … Rocketman Trump We have … 0 NO … shinny rockets! Mueller Trump You’re fired! … 0 NO … 25 Consistent Reads 20-Nov-18 26 Stale Reads (1/2) Above python code performs a stale read using a 15 sec bounded-staleness timestamp 20-Nov-18 27 Stale Reads (2/2) Zone 1 Zone 2 Zone 3 Do I have Request up-to-date Slave data ? Leader Slave (Max 15 sec yes! old data) Response Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2 Request Timestamp ≤ Tsafe 20-Nov-18 28 Strong Reads (1/3) Above python code performs a strong read from Spanner Database 20-Nov-18 29 Strong Reads (2/3) Zone 1 Zone 2 Zone 3 Request Is this the latest data ? Slave Leader Slave Response Yep! Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2 20-Nov-18 30 Strong Reads (3/3) Zone 1 Zone 2 Zone 3 Request Is this the latest data ? Slave Leader Slave Response Blocked Nope! Wait Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2 20-Nov-18 31 Consistent Writes 20-Nov-18 32 Read-Write Transaction Above python code performs a read-write transaction on spanner database 20-Nov-18 33 Transaction within Paxos Group Paxos 1 Paxos 2 Paxos 3 41. .Buffer TXN Commit WritesQuery Wait Leader1 Leader2 Leader3 2. Acq. 5. Apply 6. TXN3. Locks Writes SuccessQuery Acq. and Results Locks Release Locks Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2 Call TrueTime API 20-Nov-18 34 Transaction across Paxos Group Paxos 1 Paxos 2 Paxos 3 4. Buffer Writes 1. TXN Commit Query 62..Prepar Write Wait 6. Write2. e Prepare ACK ACK Leader1 Leader2 Leader3 2. Acq. 5. Apply 6. TXN Apply Apply 3. Success Locks Writes Writes Writes Acq. Query Acq. and and Results Acq. and Release Locks Locks Release Locks Release Locks Locks Locks Split 1 Split 3 Split1 Split 3 Split 1 Split 3 Split 2 Split 2 Split 2 Call Call Call TrueTime TrueTime TrueTime API API API 20-Nov-18 35 Evaluation 20-Nov-18 36 Microbenchmarks All operations performed were Standalone reads and Standalone writes 20-Nov-18 37 Two phase commit scalability All write operations performed across 3 Zones 20-Nov-18 38 Effect of Leader failure on throughput 20-Nov-18 39 Network Induced Uncertainty 20-Nov-18 40 Key Takeaways . Spanner is the first globally distributed database to offer strong consistency guarantees ‣ Always use strong reads ‣ If latency makes strong reads infeasible, use reads with bounded staleness to improve performance . TrueTime API is the secret sauce . Stronger semantics are achievable Global scale != Weaker Semantic . In-case of a local clock malfunction, strong consistency guarantees are in-validated. Write latency is lower bounded by ε. 20-Nov-18 41 Questions! 20-Nov-18 42 Backup Slides Just ignore these ! 20-Nov-18 43 Safe Time paxos TM Tsafe= min(T safe, T safe) paxos T safe = Timestamp of highest applied Paxos write ∞ , when there are zero transaction is TM T safe = pending state. prepare min (s i,g) – 1, when there are transactions in pending state. 20-Nov-18 44 Backup Slide (1/2) Start Consensus End Consensus Earliest Latest Acquired locks Release locks T s= Call TrueTime Wait until TT.now().latest API TT.now().earliest > s Commit wait average ε average ε 20-Nov-18 45 Backup Slide (2/2) Start logging Done logging Acquired locks Release locks T Committed C Notify participants of s Acquired locks Release locks TP1 Acquired locks Release locks T P2 Prepared Send si si = TT.now().latest Commit wait done Compute overall s 20-Nov-18 46 .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    46 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us