Spanner: Google's Globally- Distributed Database
Total Page:16
File Type:pdf, Size:1020Kb
Today’s Reminders • Discuss Project Ideas with Phil & Kevin – Phil’s Office Hours: After class today – Sign up for a slot: 11-12:30 or 3-4:20 this Friday Spanner: Google’s Globally-Distributed Database Phil Gibbons 15-712 F15 Lecture 14 2 Spanner: Google’s Globally- Database vs. Key-value Store Distributed Database [OSDI’12 best paper] “We provide a database instead of a key-value store to make it easier for programmers to write their applications.” James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, “We consistently received complaints from users that Bigtable Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, can be difficult to use for some kinds of applications.” Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford (Google x 26) 3 4 Spanner [Slides from OSDI’12 talk] • Worked on Spanner for 4½ years at time of OSDI’12 • Scalable, multi-version, globally-distributed, Spanner: Google’s synchronously-replicated database – Hundreds of datacenters, millions of machines, trillions of rows Globally-Distributed Database • Transaction Properties – Transactions are externally-consistent (a.k.a. Linearizable) Wilson Hsieh – Read-only transactions are lock-free representing a host of authors – (Read-write transactions use 2-phase-locking) OSDI 2012 • Flexible replication configuration 5 What is Spanner? Example: Social Network • Distributed multiversion database Sao Paulo • x1000 Santiago General-purpose transactions (ACID) Buenos Aires • SQL query language • San Francisco Brazil Schematized tables Seattle • x1000 Arizona User posts Semi-relational data model Friend lists Moscow US London Paris x1000 Berlin Krakow • Running in production x1000 Berlin • Storage for Google’s ad data Madrid Lisbon Russia • Replaced a sharded MySQL database Spain OSDI 2012 7 OSDI 2012 8 Overview Read Transactions • Feature: Lock-free distributed read transactions • Generate a page of friends’ recent posts • Property: External consistency of distributed – Consistent view of friend list and their posts transactions – First system at global scale • Implementation: Integration of concurrency control, replication, and 2PC Why consistency matters – Correctness and performance 1. Remove untrustworthy person X as friend • Enabling technology: TrueTime 2. Post P: “My government is repressive…” – Interval-based global time OSDI 2012 9 OSDI 2012 10 Single Machine Multiple Machines Block writes Block writes Generate my page Friend1 post User posts Friend1 post Friend2 post Friend lists Friend2 post … Friend999 post User posts … Generate my page Friend lists Friend1000 post Friend999 post User posts Friend1000 post Friend lists OSDI 2012 11 OSDI 2012 12 Multiple Datacenters Version Management User posts Friend1 post x1000 • US Friend lists Transactions that write use strict 2PL – Each transaction T is assigned a timestamp s User posts – Data written by T is timestamped with s Friend2 post x1000 Friend lists Spain … Generate my page Time<8 8 15 User posts Friend999 post x1000 My friends [X] [] Friend lists Brazil My posts [P] X’s friends [me] [] User posts Friend1000 post x1000 Russia Friend lists OSDI 2012 13 OSDI 2012 14 Synchronizing Snapshots Timestamps, Global Clock Global wall-clock time • Strict two-phase locking for write transactions == • Assign timestamp while locks are held External Consistency: Commit order respects global wall-time order Acquired locks Release locks == T Timestamp order respects global wall-time order Pick s = now() given timestamp order == commit order OSDI 2012 15 OSDI 2012 16 Timestamp Invariants TrueTime • Timestamp order == commit order • “Global wall-clock time” with bounded T1 uncertainty T2 TT.now() • Timestamp order respects global wall-time order time earliest latest T3 2*ε T4 OSDI 2012 17 OSDI 2012 18 Timestamps and TrueTime Commit Wait and Replication Start consensus Achieve consensus Notify slaves Acquired locks Release locks T Acquired locks Release locks T Pick s = TT.now().latest s Wait until TT.now().earliest > s Pick s Commit wait done Commit wait average ε average ε OSDI 2012 19 OSDI 2012 20 Commit Wait and 2-Phase Commit Example Start logging Done logging Remove X from Risky post P Acquired locks Release locks my friend list TC T2 TC Committed s =6 s=8 s=15 Notify participants of s C Acquired locks Release locks Remove myself from X’s friend list TP1 TP Acquired locks Release locks sP=8 s=8 T P2 Prepared Time <8 8 15 Send s Compute s for each Commit wait done My friends [X] [] My posts [P] Compute overall s X’s friends [me] [] OSDI 2012 21 OSDI 2012 22 What Have We Covered? What Haven’t We Covered? • Lock-free read transactions across datacenters • How to read at the present time • External consistency • Atomic schema changes • Timestamp assignment – Mostly non-blocking • TrueTime – Commit in the future – Uncertainty in time can be waited out • Non-blocking reads in the past – At any sufficiently up-to-date replica OSDI 2012 23 OSDI 2012 24 TrueTime Architecture TrueTime implementation GPS GPS GPS timemaster timemaster timemaster now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift GPS Atomic-clock GPS timemaster timemaster timemaster ε +6ms 200 μs/sec Client Datacenter 1Datacenter 2 … Datacenter n reference uncertainty time Compute reference [earliest, latest] = now ± ε 0sec 30sec 60sec 90sec OSDI 2012 25 OSDI 2012 26 What If a Clock Goes Rogue? Network-Induced Uncertainty • Timestamp assignment would violate external 10 6 99.9 5 consistency 8 99 90 ) • Empirically unlikely based on 1 year of data s 4 m ( 6 n o – Bad CPUs 6 times more likely than bad clocks l 3 i s p 4 E 2 2 1 Mar 29 Mar 30 Mar 31 Apr 1 6AM 8AM 10AM 12PM Date Date (April 13) OSDI 2012 27 OSDI 2012 28 What’s in the Literature Future Work • External consistency/linearizability • Improving TrueTime • Distributed databases – Lower ε < 1 ms • Concurrency control • Building out database features • Replication – Finish implementing basic features – Efficiently support rich query patterns • Time (NTP, Marzullo) OSDI 2012 29 OSDI 2012 30 Semi-relational Data Model Conclusions • Rows must have names – Every table must have one or more primary-key columns • Reify clock uncertainty in time APIs – Known unknowns are better than unknown unknowns • Applications control data locality thru their choice of keys – Rethink algorithms to make use of uncertainty • Stronger semantics are achievable – Greater scale != weaker semantics [End of slides from OSDI’12 talk] OSDI 2012 31 32 Read-Only Transactions Constraints Availability • Must declare “read-only” upfront • Must have “scope” expression – Summarize keys that will be read by the entire transaction 33 34 F1 Advertising Backend Friday Discuss Projects with Phil & Kevin 1) the problem you want to solve 2) the approach you are going to take 3) how it pertains to the course Lock Conflicts Only one DC had SSDs 35 36.