Distributed SQL Deconstructed

Understanding Amazon Aurora, & the Spanner Derivatives

Karthik Ranganathan Sid Choudhury April 18, 2019

© 2019 All rights reserved. 1 Introduction

Karthik Ranganathan Sid Choudhury

Co-Founder & CTO, YugaByte VP Product, YugaByte Nutanix ♦ Facebook ♦ Microsoft AppDynamics ♦ Salesforce ♦ Oracle IIT Madras, University of Texas-Austin IIT Kharagpur, University of Texas-Austin

@karthikr @SidChoudhury

© 2019 All rights reserved. 2 Types of Data Stores This Talk’s Focus

OLAP OLTP

Write once, Read many Mixed reads & writes Few concurrent sessions Many concurrent sessions Long running, ad-hoc queries Single-digit ms query latency Large table scans Point reads & short-range scans Petabyte-scale data storage Terabyte-scale data storage

© 2019 All rights reserved. 3 Examples Open Source This Talk’s Focus

OLAP OLTP NoSQL SQL NoSQL SQL

Amazon Aurora

Google Google Google BigQuery Spanner

Proprietary

© 2019 All rights reserved. 4 Devs  SQL

1. Query Flexibility  – Model data once, change queries as business changes – Balance modeling richness with performance needs

2. Rich Ecosystem  – Data modeling & query examples – Developer IDEs & data visualization tools – Easy to reuse & build integrations

3. Universal Standard for Data Access – Learn once, use forever

© 2019 All rights reserved. 5 Devs  SQL

1. Large Dataset?  – No horizontal write scalability – Use manually sharded SQL or non-transactional NoSQL

2. Infrastructure Failures?  – No native failover & repair – Use complex replication schemes

3. Multi-Region/Geo-Distributed App?  – Multi-master deployment is the only option – Data loss w/ Last Writer Wins (LWW) conflict resolution

© 2019 All rights reserved. 6 Distributed SQL = Keep  & Remove  1. SQL Features – ACID, JOINs, foreign keys, serializable isolation

2. Horizontal Write Scalability – Scale write throughput by adding/removing nodes

3. Fault Tolerance With High Availability – Native failover & repair

4. Globally Consistent Writes – Lower end user latency and tolerate region failures

5. Low Read Latency – Strongly consistent (aka correct) reads © 2019 All rights reserved. 7 Distributed SQL Architectures - Aurora vs Spanner

Shared Storage Shared Nothing

Amazon Aurora Google Spanner

“A highly available MySQL and PostgreSQL-compatible “The first horizontally scalable, strongly consistent, relational service” relational database service”

Available on AWS since 2015 Available on Google Cloud since 2017

© 2019 All rights reserved. 8 #1 SQL Features

© 2019 All rights reserved. 9 Depth of SQL Support Amazon Aurora Google Spanner

✓ MySQL and PostgreSQL-compatible Subset of MySQL/PostgreSQL features

© 2019 All rights reserved. 10 Aurora vs Spanner

Feature Amazon Aurora Google Spanner

SQL Features ✓

Horizontal Write Scalability ✓

Fault Tolerance with HA ✓

Globally Consistent Writes ✓

Low Read Latency

© 2019 All rights reserved. 11 #2 Horizontal Write Scalability

© 2019 All rights reserved. 12 Amazon Aurora

Single Node SQL on Multi-Zone Distributed Storage

SQL APP

INSERT ROW

❌ Add Primary Instances for Write Scaling

✓ Add Read Replicas for Read Scaling

© 2019 All rights reserved. 13 Google Spanner

Multi-Node SQL on Multi-Region Distributed Storage

SQL APP

INSERT ROW1 INSERT ROW3

✓ Add Primary Instances for Write Scaling

✓ Add Read Replicas for Read Scaling

© 2019 All rights reserved. 14 Aurora vs Spanner

Feature Amazon Aurora Google Spanner

SQL Features ✓

Horizontal Write Scalability ❌ ✓

Fault Tolerance with HA

Globally Consistent Writes

Low Read Latency

© 2019 All rights reserved. 15 #3 Fault Tolerance with HA

© 2019 All rights reserved. 16 Amazon Aurora

Native Failover & Repair Through Primary Auto Election

SQL APP

INSERT ROW

✓ HA When Primary Instance Fails

✓ HA When Read Replica Fails

© 2019 All rights reserved. 17 Google Spanner

Native Failover & Repair Through Shard Leader Auto Election

SQL APP

INSERT ROW1 INSERT ROW3

✓ HA When Any Primary Node Fails

✓ HA When Read Replica Fails

© 2019 All rights reserved. 18 Aurora vs Spanner

Feature Amazon Aurora Google Spanner

SQL Features ✓

Horizontal Write Scalability ❌ ✓

Fault Tolerance with HA ✓ ✓

Globally Consistent Writes

Low Read Latency

© 2019 All rights reserved. 19 #4 Globally Consistent Writes

© 2019 All rights reserved. 20 Amazon Aurora

Multi-Master Last Writer Wins Conflict Resolution Leads to Inconsistencies

SQL APP SQL APP

SET BALANCE = BALANCE - 10 SET BALANCE = BALANCE - 100

Asynchronous Replication

Region 1 Region 2

© 2019 All rights reserved. 21 Google Spanner

Purpose-Built for Globally Consistent Writes

SQL APP SQL APP

SET BALANCE = SET BALANCE = BALANCE - 10 BALANCE - 100

© 2019 All rights reserved. 22 Aurora vs Spanner

Feature Amazon Aurora Google Spanner

SQL Features ✓

Horizontal Write Scalability ❌ ✓

Fault Tolerance with HA ✓ ✓

Globally Consistent Writes ❌ ✓

Low Read Latency

© 2019 All rights reserved. 23 #5 Low Read Latency

© 2019 All rights reserved. 24 Amazon Aurora

Strongly Consistent Reads Served By Primary Instance

SQL APP

READ ROW

© 2019 All rights reserved. 25 Google Spanner

Strongly Consistent Reads Served By Shard Leaders w/o Read Quorum

SQL APP

READ ROW1

© 2019 All rights reserved. 26 Aurora vs Spanner

Feature Amazon Aurora Google Spanner

SQL Features ✓

Horizontal Write Scalability ❌ ✓

Fault Tolerance with HA ✓ ✓

Globally Consistent Writes ❌ ✓

Low Read Latency ✓ ✓

© 2019 All rights reserved. 27 Battle of Architectures - Spanner Beats Aurora

No Performance & Availability Bottlenecks Scale to Large Clusters while Remaining Highly Available

Built for Geo-Distributed Apps Future Proofs Data Tier at Global Businesses

Complex to Engineer Needs Clock Skew Tracking Across Instances

© 2019 All rights reserved. 28 Analyzing Open Source Spanner Derivatives

© 2019 All rights reserved. 29 Spanner Brought to Life in Open Source

© 2019 All rights reserved. 30 Design Principles

• CP in CAP Theorem • High Performance • Consistent • All layers in C++ to ensure high perf • Partition Tolerant • Run on large memory machines • HA on failures • Optimized for SSDs (new leader elected in seconds)

• ACID Transactions • Run Anywhere • Single-row linearizability • No external dependencies • Multi-row ACID • No atomic clocks • Serializable • Bare metal, VM and Kubernetes • Snapshot

© 2019 All rights reserved. 31 Functional Architecture

YSQL PostgreSQL-Compatible Distributed SQL API

tablet 1’ DOCDB Spanner-Inspired Distributed Document Store tablet 1’

CLOUD NEUTRAL No Specialized Hardware Needed

© 2019 All rights reserved. 32 Distributed SQL = Keep  & Remove 

1. SQL Features

2. Replication Protocol

3. Clock Skew Tracking

4. Transactions Manager

© 2019 All rights reserved. 33 Spanner vs. its Open Source Derivatives

Feature Google Spanner YugaByte DB CockroachDB TiDB

Cost Expensive Free Free Free

SQL API Compatibility

Replication Protocol

Clock Skew Tracking

Transaction Manager

Tunable Read Latency

Official Jepsen Tests

© 2019 All rights reserved. 34 SQL API Compatibility

© 2019 All rights reserved. 35 PostgreSQL Transformed into Distributed SQL

© 2019 All rights reserved. 36 Depth of SQL Support

• Current • Future • Data Types • Relational Integrity (Foreign Keys) • Built-in Functions • Stored Procedures • Expressions • JSON Column Type • Triggers • Secondary Indexes • Foreign Data Wrappers • JOINs • And more ... • Transactions • Views

© 2019 All rights reserved. 37 Spanner vs. its Open Source Derivatives

Feature Google Spanner YugaByte DB CockroachDB TiDB

Cost Expensive Free Free Free

SQL API Compatibility Proprietary PostgreSQL PostgreSQL MySQL

Replication Protocol

Clock Skew Tracking

Transaction Manager

Tunable Read Latency

Official Jepsen Tests

© 2019 All rights reserved. 38 Replication Protocol

© 2019 All rights reserved. 39 Every Table is Automatically Sharded

… … … … … …

… … …

… … …

… … …

SHARDING = AUTOMATIC PARTITIONING OF TABLES tablet 1’

© 2019 All rights reserved. 40 Replication Done at Shard Level

Tablet #1

Tablet Peer 2 on Node Y

Tablet Peer 1 on Node X Tablet Peer 3 on Node Z

tablet 1’

© 2019 All rights reserved. 41 Replication uses a Consensus algorithm

Raft Leader

Uses Raft Algorithm

First elect Tablet Leader

tablet 1’

© 2019 All rights reserved. 42 Writes in Raft Consensus

Write Raft Leader

Writes processed by leader:

Send writes to all peers Wait for majority to ack

tablet 1’

© 2019 All rights reserved. 43 Reads in Raft Consensus

Read Raft Leader

Reads handled by leader

Uses Leader Leases for performance

tablet 1’

© 2019 All rights reserved. 44 Spanner vs. its Open Source Derivatives

Feature Google Spanner YugaByte DB CockroachDB TiDB

Cost Expensive Free Free Free

SQL API Compatibility Proprietary PostgreSQL PostgreSQL MySQL

Replication Protocol Paxos Raft Raft Raft

Clock Skew Tracking

Transaction Manager

Tunable Read Latency

Official Jepsen Tests

© 2019 All rights reserved. 45 Transactions and Clock Skew

Tracking

© 2019 All rights reserved. 46 Multi-Shard Transactions

BEGIN TXN UPDATE k1 UPDATE k2 COMMIT

k1 and k2 may belong to different shards

Belong to different Raft groups on completely different nodes

tablet 1’

© 2019 All rights reserved. 47 What do Distributed Transactions need?

BEGIN TXN UPDATE k1 UPDATE k2 COMMIT

Raft Leader Raft Leader

Updates should get written at the same physical time

But how will nodes agree on time? tablet 1’

© 2019 All rights reserved. 48 Use a Physical Clock

You would need an Atomic Clock or two lying around

Atomic Clocks are highly available, globally synchronized clocks with tight error bounds

Jeez! I’m fresh out of those.

Most of my physical clocks are never synchronized

tablet 1’

© 2019 All rights reserved. 49 Hybrid Logical Clock or HLC

Combine coarsely-synchronized physical clocks with Lamport Clocks to track causal relationships

(physical component, logical component) synchronized using NTP a monotonic counter

Nodes update HLC on each Raft exchange for things like heartbeats, leader election and data replication

tablet 1’

© 2019 All rights reserved. 50 Spanner vs. its Open Source Derivatives

Feature Google Spanner YugaByte DB CockroachDB TiDB

Cost Expensive Free Free Free

SQL API Compatibility Proprietary PostgreSQL PostgreSQL MySQL

Replication Protocol Paxos Raft Raft Raft

TrueTime Atomic Hybrid Logical Clock + Hybrid Logical Clock Single Timestamp Gen Clock Skew Tracking Clock Max Clock Skew + Max Clock Skew ⇒ No Tracking Needed

Special Node for Transaction Manager At Every Node At Every Node At Every Node Timestamp Generation

Tunable Read Latency

Official Jepsen Tests

© 2019 All rights reserved. 51 Miscellaneous

© 2019 All rights reserved. 52 Spanner vs. its Open Source Derivatives

Feature Google Spanner YugaByte DB CockroachDB TiDB

Cost Expensive Free Free Free

SQL API Compatibility Proprietary PostgreSQL PostgreSQL MySQL

Replication Protocol Paxos Raft Raft Raft

TrueTime Atomic Hybrid Logical Clock + Hybrid Logical Clock Single Timestamp Gen Clock Skew Tracking Clock Max Clock Skew + Max Clock Skew ⇒ No Tracking Needed

Special Node for Transaction Manager At Every Node At Every Node At Every Node Timestamp Generation

Tunable Read Latency ✓ ✓ ❌ ❌

Official Jepsen Tests Unknown ✓ ✓ ❌

© 2019 All rights reserved. 53 Read more at blog.yugabyte.com

Storage Layer blog.yugabyte.com/distributed-postgresql-on-a-google-spanner-architecture-storage-layer

Query Layer blog.yugabyte.com/distributed-postgresql-on-a-google-spanner-architecture-query-layer

© 2019 All rights reserved. 54 Questions?

Try it at docs.yugabyte.com/quick-start

Check us out on GitHub https://github.com/YugaByte/yugabyte-db

© 2019 All rights reserved. 55