Distributed SQL Databases Deconstructed
Understanding Amazon Aurora, Google Spanner & the Spanner Derivatives
Karthik Ranganathan Sid Choudhury April 18, 2019
© 2019 All rights reserved. 1 Introduction
Karthik Ranganathan Sid Choudhury
Co-Founder & CTO, YugaByte VP Product, YugaByte Nutanix ♦ Facebook ♦ Microsoft AppDynamics ♦ Salesforce ♦ Oracle IIT Madras, University of Texas-Austin IIT Kharagpur, University of Texas-Austin
@karthikr @SidChoudhury
© 2019 All rights reserved. 2 Types of Data Stores This Talk’s Focus
OLAP OLTP
Write once, Read many Mixed reads & writes Few concurrent sessions Many concurrent sessions Long running, ad-hoc queries Single-digit ms query latency Large table scans Point reads & short-range scans Petabyte-scale data storage Terabyte-scale data storage
© 2019 All rights reserved. 3 Examples Open Source This Talk’s Focus
OLAP OLTP NoSQL SQL NoSQL SQL
Amazon Aurora
Google Google Google BigQuery BigTable Spanner
Proprietary
© 2019 All rights reserved. 4 Devs SQL
1. Query Flexibility – Model data once, change queries as business changes – Balance modeling richness with performance needs
2. Rich Ecosystem – Data modeling & query examples – Developer IDEs & data visualization tools – Easy to reuse & build integrations
3. Universal Standard for Data Access – Learn once, use forever
© 2019 All rights reserved. 5 Devs SQL
1. Large Dataset? – No horizontal write scalability – Use manually sharded SQL or non-transactional NoSQL
2. Infrastructure Failures? – No native failover & repair – Use complex replication schemes
3. Multi-Region/Geo-Distributed App? – Multi-master deployment is the only option – Data loss w/ Last Writer Wins (LWW) conflict resolution
© 2019 All rights reserved. 6 Distributed SQL = Keep & Remove 1. SQL Features – ACID, JOINs, foreign keys, serializable isolation
2. Horizontal Write Scalability – Scale write throughput by adding/removing nodes
3. Fault Tolerance With High Availability – Native failover & repair
4. Globally Consistent Writes – Lower end user latency and tolerate region failures
5. Low Read Latency – Strongly consistent (aka correct) reads © 2019 All rights reserved. 7 Distributed SQL Architectures - Aurora vs Spanner
Shared Storage Shared Nothing
Amazon Aurora Google Spanner
“A highly available MySQL and PostgreSQL-compatible “The first horizontally scalable, strongly consistent, relational database service” relational database service”
Available on AWS since 2015 Available on Google Cloud since 2017
© 2019 All rights reserved. 8 #1 SQL Features
© 2019 All rights reserved. 9 Depth of SQL Support Amazon Aurora Google Spanner
✓ MySQL and PostgreSQL-compatible Subset of MySQL/PostgreSQL features
© 2019 All rights reserved. 10 Aurora vs Spanner
Feature Amazon Aurora Google Spanner
SQL Features ✓
Horizontal Write Scalability ✓
Fault Tolerance with HA ✓
Globally Consistent Writes ✓
Low Read Latency
© 2019 All rights reserved. 11 #2 Horizontal Write Scalability
© 2019 All rights reserved. 12 Amazon Aurora
Single Node SQL on Multi-Zone Distributed Storage
SQL APP
INSERT ROW
❌ Add Primary Instances for Write Scaling
✓ Add Read Replicas for Read Scaling
© 2019 All rights reserved. 13 Google Spanner
Multi-Node SQL on Multi-Region Distributed Storage
SQL APP
INSERT ROW1 INSERT ROW3
✓ Add Primary Instances for Write Scaling
✓ Add Read Replicas for Read Scaling
© 2019 All rights reserved. 14 Aurora vs Spanner
Feature Amazon Aurora Google Spanner
SQL Features ✓
Horizontal Write Scalability ❌ ✓
Fault Tolerance with HA
Globally Consistent Writes
Low Read Latency
© 2019 All rights reserved. 15 #3 Fault Tolerance with HA
© 2019 All rights reserved. 16 Amazon Aurora
Native Failover & Repair Through Primary Auto Election
SQL APP
INSERT ROW
✓ HA When Primary Instance Fails
✓ HA When Read Replica Fails
© 2019 All rights reserved. 17 Google Spanner
Native Failover & Repair Through Shard Leader Auto Election
SQL APP
INSERT ROW1 INSERT ROW3
✓ HA When Any Primary Node Fails
✓ HA When Read Replica Fails
© 2019 All rights reserved. 18 Aurora vs Spanner
Feature Amazon Aurora Google Spanner
SQL Features ✓
Horizontal Write Scalability ❌ ✓
Fault Tolerance with HA ✓ ✓
Globally Consistent Writes
Low Read Latency
© 2019 All rights reserved. 19 #4 Globally Consistent Writes
© 2019 All rights reserved. 20 Amazon Aurora
Multi-Master Last Writer Wins Conflict Resolution Leads to Inconsistencies
SQL APP SQL APP
SET BALANCE = BALANCE - 10 SET BALANCE = BALANCE - 100
Asynchronous Replication
Region 1 Region 2
© 2019 All rights reserved. 21 Google Spanner
Purpose-Built for Globally Consistent Writes
SQL APP SQL APP
SET BALANCE = SET BALANCE = BALANCE - 10 BALANCE - 100
© 2019 All rights reserved. 22 Aurora vs Spanner
Feature Amazon Aurora Google Spanner
SQL Features ✓
Horizontal Write Scalability ❌ ✓
Fault Tolerance with HA ✓ ✓
Globally Consistent Writes ❌ ✓
Low Read Latency
© 2019 All rights reserved. 23 #5 Low Read Latency
© 2019 All rights reserved. 24 Amazon Aurora
Strongly Consistent Reads Served By Primary Instance
SQL APP
READ ROW
© 2019 All rights reserved. 25 Google Spanner
Strongly Consistent Reads Served By Shard Leaders w/o Read Quorum
SQL APP
READ ROW1
© 2019 All rights reserved. 26 Aurora vs Spanner
Feature Amazon Aurora Google Spanner
SQL Features ✓
Horizontal Write Scalability ❌ ✓
Fault Tolerance with HA ✓ ✓
Globally Consistent Writes ❌ ✓
Low Read Latency ✓ ✓
© 2019 All rights reserved. 27 Battle of Architectures - Spanner Beats Aurora
No Performance & Availability Bottlenecks Scale to Large Clusters while Remaining Highly Available
Built for Geo-Distributed Apps Future Proofs Data Tier at Global Businesses
Complex to Engineer Needs Clock Skew Tracking Across Instances
© 2019 All rights reserved. 28 Analyzing Open Source Spanner Derivatives
© 2019 All rights reserved. 29 Spanner Brought to Life in Open Source
© 2019 All rights reserved. 30 Design Principles
• CP in CAP Theorem • High Performance • Consistent • All layers in C++ to ensure high perf • Partition Tolerant • Run on large memory machines • HA on failures • Optimized for SSDs (new leader elected in seconds)
• ACID Transactions • Run Anywhere • Single-row linearizability • No external dependencies • Multi-row ACID • No atomic clocks • Serializable • Bare metal, VM and Kubernetes • Snapshot
© 2019 All rights reserved. 31 Functional Architecture
YSQL PostgreSQL-Compatible Distributed SQL API
tablet 1’ DOCDB Spanner-Inspired Distributed Document Store tablet 1’
CLOUD NEUTRAL No Specialized Hardware Needed
© 2019 All rights reserved. 32 Distributed SQL = Keep & Remove
1. SQL Features
2. Replication Protocol
3. Clock Skew Tracking
4. Transactions Manager
© 2019 All rights reserved. 33 Spanner vs. its Open Source Derivatives
Feature Google Spanner YugaByte DB CockroachDB TiDB
Cost Expensive Free Free Free
SQL API Compatibility
Replication Protocol
Clock Skew Tracking
Transaction Manager
Tunable Read Latency
Official Jepsen Tests
© 2019 All rights reserved. 34 SQL API Compatibility
© 2019 All rights reserved. 35 PostgreSQL Transformed into Distributed SQL
© 2019 All rights reserved. 36 Depth of SQL Support
• Current • Future • Data Types • Relational Integrity (Foreign Keys) • Built-in Functions • Stored Procedures • Expressions • JSON Column Type • Triggers • Secondary Indexes • Foreign Data Wrappers • JOINs • And more ... • Transactions • Views
© 2019 All rights reserved. 37 Spanner vs. its Open Source Derivatives
Feature Google Spanner YugaByte DB CockroachDB TiDB
Cost Expensive Free Free Free
SQL API Compatibility Proprietary PostgreSQL PostgreSQL MySQL
Replication Protocol
Clock Skew Tracking
Transaction Manager
Tunable Read Latency
Official Jepsen Tests
© 2019 All rights reserved. 38 Replication Protocol
© 2019 All rights reserved. 39 Every Table is Automatically Sharded
… … … … … …
… … …
… … …
… … …
SHARDING = AUTOMATIC PARTITIONING OF TABLES tablet 1’
© 2019 All rights reserved. 40 Replication Done at Shard Level
Tablet #1
Tablet Peer 2 on Node Y
Tablet Peer 1 on Node X Tablet Peer 3 on Node Z
tablet 1’
© 2019 All rights reserved. 41 Replication uses a Consensus algorithm
Raft Leader
Uses Raft Algorithm
First elect Tablet Leader
tablet 1’
© 2019 All rights reserved. 42 Writes in Raft Consensus
Write Raft Leader
Writes processed by leader:
Send writes to all peers Wait for majority to ack
tablet 1’
© 2019 All rights reserved. 43 Reads in Raft Consensus
Read Raft Leader
Reads handled by leader
Uses Leader Leases for performance
tablet 1’
© 2019 All rights reserved. 44 Spanner vs. its Open Source Derivatives
Feature Google Spanner YugaByte DB CockroachDB TiDB
Cost Expensive Free Free Free
SQL API Compatibility Proprietary PostgreSQL PostgreSQL MySQL
Replication Protocol Paxos Raft Raft Raft
Clock Skew Tracking
Transaction Manager
Tunable Read Latency
Official Jepsen Tests
© 2019 All rights reserved. 45 Transactions and Clock Skew
Tracking
© 2019 All rights reserved. 46 Multi-Shard Transactions
BEGIN TXN UPDATE k1 UPDATE k2 COMMIT
k1 and k2 may belong to different shards
Belong to different Raft groups on completely different nodes
tablet 1’
© 2019 All rights reserved. 47 What do Distributed Transactions need?
BEGIN TXN UPDATE k1 UPDATE k2 COMMIT
Raft Leader Raft Leader
Updates should get written at the same physical time
But how will nodes agree on time? tablet 1’
© 2019 All rights reserved. 48 Use a Physical Clock
You would need an Atomic Clock or two lying around
Atomic Clocks are highly available, globally synchronized clocks with tight error bounds
Jeez! I’m fresh out of those.
Most of my physical clocks are never synchronized
tablet 1’
© 2019 All rights reserved. 49 Hybrid Logical Clock or HLC
Combine coarsely-synchronized physical clocks with Lamport Clocks to track causal relationships
(physical component, logical component) synchronized using NTP a monotonic counter
Nodes update HLC on each Raft exchange for things like heartbeats, leader election and data replication
tablet 1’
© 2019 All rights reserved. 50 Spanner vs. its Open Source Derivatives
Feature Google Spanner YugaByte DB CockroachDB TiDB
Cost Expensive Free Free Free
SQL API Compatibility Proprietary PostgreSQL PostgreSQL MySQL
Replication Protocol Paxos Raft Raft Raft
TrueTime Atomic Hybrid Logical Clock + Hybrid Logical Clock Single Timestamp Gen Clock Skew Tracking Clock Max Clock Skew + Max Clock Skew ⇒ No Tracking Needed
Special Node for Transaction Manager At Every Node At Every Node At Every Node Timestamp Generation
Tunable Read Latency
Official Jepsen Tests
© 2019 All rights reserved. 51 Miscellaneous
© 2019 All rights reserved. 52 Spanner vs. its Open Source Derivatives
Feature Google Spanner YugaByte DB CockroachDB TiDB
Cost Expensive Free Free Free
SQL API Compatibility Proprietary PostgreSQL PostgreSQL MySQL
Replication Protocol Paxos Raft Raft Raft
TrueTime Atomic Hybrid Logical Clock + Hybrid Logical Clock Single Timestamp Gen Clock Skew Tracking Clock Max Clock Skew + Max Clock Skew ⇒ No Tracking Needed
Special Node for Transaction Manager At Every Node At Every Node At Every Node Timestamp Generation
Tunable Read Latency ✓ ✓ ❌ ❌
Official Jepsen Tests Unknown ✓ ✓ ❌
© 2019 All rights reserved. 53 Read more at blog.yugabyte.com
Storage Layer blog.yugabyte.com/distributed-postgresql-on-a-google-spanner-architecture-storage-layer
Query Layer blog.yugabyte.com/distributed-postgresql-on-a-google-spanner-architecture-query-layer
© 2019 All rights reserved. 54 Questions?
Try it at docs.yugabyte.com/quick-start
Check us out on GitHub https://github.com/YugaByte/yugabyte-db
© 2019 All rights reserved. 55