Yugabytedb Extending Postgresql to Google Spanner Architecture

Extending PostgreSQL to Google Spanner Architecture YugabyteDB Karthik Ranganathan, Co-founder/CTO © 2019 All rights reserved. 1 Introduction to YugabyteDB © 2019 All rights reserved. 2 What is Distributed SQL? A Revolutionary Database Architecture SQL SQL & Massive Geo Ultra Resilience Transactions Scalability Distribution © 2019 All rights reserved. 3 = distributed SQL database + high performance (low Latency) + cloud native (run on Kubernetes, VMs, bare metal) + open source (Apache 2.0) © 2019 All rights reserved. 4 Fastest Growing Distributed SQL Project! Clusters Slack GitHub Stars ▲ 12x ▲ 12x ▲ 7x We stars! Give us one: github.com/YugaByte/yugabyte-db Growth in 1 Year Join our community: yugabyte.com/slack © 2019 All rights reserved. 5 Architecture © 2019 All rights reserved. 6 SQL Design Goals • PostgreSQL compatible • Re-uses PostgreSQL query layer • New changes do not break existing PostgreSQL functionality • Enable migrating to newer PostgreSQL versions • New features are implemented in a modular fashion • Integrate with new PostgreSQL features as they are available • E.g. Moved from PostgreSQL 10.4 → 11.2 in 2 weeks! • Cloud native architecture • Fully decentralized to enable scaling to 1000s of nodes • Tolerate rack/zone and datacenter/region failures automatically • Run natively in containers and Kubernetes • Zero-downtime rolling software upgrades and machine reconfig © 2019 All rights reserved. 7 Spanner design + Aurora-like Compatibility Feature Amazon Aurora Google Spanner Horizontal Write Scalability ❌ ✓ ✓ Fault Tolerance with HA ✓ ✓ ✓ Globally Consistent Writes ❌ ✓ ✓ Scalable Consistent Reads ❌ ✓ ✓ SQL Support ✓ ✓ ❌ © 2019 All rights reserved. 8 Layered Architecture YugaByte SQL (YSQL) PostgreSQL-Compatible Distributed SQL API DocDB Spanner-Inspired Distributed Document Store Cloud Neutral: No Specialized Hardware Needed © 2019 All rights reserved. 9 Single-Node PostgreSQL © 2019 All rights reserved. 10 Phase #1 - Extend to Distributed SQL © 2019 All rights reserved. 11 Phase #2: Perform More SQL Pushdowns © 2019 All rights reserved. 12 Phase #3: Enhance Optimizer • Table statistics based hints • Piggyback on current PostgreSQL optimizer that uses table statistics • Geographic location based hints • Based on “network” cost • Factors in network latency between nodes and tablet placement • Rewriting query plan for distributed SQL • Extend PostgreSQL “plan nodes” for distributed execution © 2019 All rights reserved. 13 We’ll focus only on phase #1 First look at storage layer (DocDB) Then look at the query layer (YSQL) © 2019 All rights reserved. 14 DocDB Architecture © 2019 All rights reserved. 15 DocDB : Sharding © 2019 All rights reserved. 16 Every Table is Automatically Sharded … … … … … … … … … … … … … … … SHARDING = AUTOMATIC PARTITIONING OF TABLES tablet 1’ © 2019 All rights reserved. 17 Supports multiple sharding strategies ● Hash Sharding ○ Ideal for massively scalable workloads ○ Distributes data evenly across all the nodes in the cluster ● Range Sharding ○ Efficiently query a range of rows by the primary key values ○ Example: look up all keys between a lower and an upper bound © 2019 All rights reserved. 18 Hash Sharding © 2019 All rights reserved. 19 Hash Sharding Examples © 2019 All rights reserved. 20 Range Sharding tablet 1’ © 2019 All rights reserved. 21 Range Sharding Examples © 2019 All rights reserved. 22 DocDB : Replication © 2019 All rights reserved. 23 DocDB uses Raft consensus to replicate data https://raft.github.io/ © 2019 All rights reserved. 24 Raft vs Paxos Easier to understand than Paxos Dynamic membership changes (eg: change machine types) Heidi Howard will talk about “Raft vs Paxos” © 2019 All rights reserved. 25 Dynamic Membership Changes Raft Leader On adding a node, it gets added to the Raft group tablet 1’ © 2019 All rights reserved. 26 Dynamic Membership Changes Raft Leader Replication factor of the tablet temporarily goes to 4 tablet 1’ © 2019 All rights reserved. 27 Dynamic Membership Changes Raft Leader Data gets replicated to the newly added node tablet 1’ © 2019 All rights reserved. 28 Dynamic Membership Changes Raft Leader 헫 An existing node is removed from the Raft group tablet 1’ © 2019 All rights reserved. 29 Dynamic Membership Changes Raft Leader Data is now stored on three different nodes, no downtime tablet 1’ © 2019 All rights reserved. 30 Many other performance enhancements needed! © 2019 All rights reserved. 31 Multi-region Raft reads are slow! Each read requires a round-trip to majority Round trip in multi-region = high latency © 2019 All rights reserved. 32 Leader Leases to the Rescue! © 2019 All rights reserved. 33 Monotonic Clocks From Jepsen Testing Report: © 2019 All rights reserved. 34 Group Commits © 2019 All rights reserved. 35 DocDB : Storage © 2019 All rights reserved. 36 Uses a heavily modified RocksDB Key to document store WAL log of RocksDB is not used (Raft log) MVCC performed at a higher layer © 2019 All rights reserved. 37 Logical Encoding of a Document © 2019 All rights reserved. 38 Encoding of DocDB data © 2019 All rights reserved. 39 Many Performance Enhancements © 2019 All rights reserved. 40 DocDB: Deployment and Failover © 2019 All rights reserved. 41 Multi-Zone Deployment ● Multi-region is similar ● 6 tablets in table ● Replication = 3 ● 1 replica per zone © 2019 All rights reserved. 42 Balancing across zones and regions Tablet leaders balanced across: ● Zones ● Nodes in a zone ● Per-table balancing © 2019 All rights reserved. 43 Tolerating Zone Outage ● New tablet leaders re-elected (~3 sec) ● No impact on tablet follower outage ● DB unavailable during re-election window ● Follower reads ok © 2019 All rights reserved. 44 Automatic rebalancing ● New leaders evenly rebalanced ● After 15 mins, data is re-replicated (if possible) ● On failed node recovery, automatically catch up © 2019 All rights reserved. 45 How long to failover in multi-zone setup? © 2019 All rights reserved. 46 DocDB: Distributed Transactions © 2019 All rights reserved. 47 Google Percolator © 2019 All rights reserved. 48 Percolator = Single Timestamp Oracle Not scalable Does not work for multi-region deployments © 2019 All rights reserved. 49 Google Spanner © 2019 All rights reserved. 50 Spanner = Distributed Time Synchronization Scalable Low-latency, multi-region deployments 2-phase commit © 2019 All rights reserved. 51 We picked Google Spanner like design for distributed transactions in YugabyteDB © 2019 All rights reserved. 52 YugabyteDB distributed transactions: Based on 2-phase commit Uses hybrid logical clocks © 2019 All rights reserved. 53 Fully Decentralized Architecture • No single point of failure or bottleneck • Any node can act as a Transaction Manager • Transaction status table distributed across multiple nodes • Tracks state of active transactions • Transactions have 3 states • Pending • Committed • Aborted • Reads served only for Committed Transactions • Clients never see inconsistent data © 2019 All rights reserved. 54 Isolation Levels • Serializable Isolation • Read-write conflicts get auto-detected • Both reads and writes in read-write txns need provisional records • Maps to SERIALIZABLE in PostgreSQL • Snapshot Isolation • Write-write conflicts get auto-detected • Only writes in read-write txns need provisional records • Maps to REPEATABLE READ, READ COMMITTED & READ UNCOMMITTED in PostgreSQL • Read-only Transactions • Lock free © 2019 All rights reserved. 55 Distributed Transactions - Write Path © 2019 All rights reserved. 56 Distributed Transactions - Write Path © 2019 All rights reserved. 57 Distributed Transactions - Write Path © 2019 All rights reserved. 58 Distributed Transactions - Write Path © 2019 All rights reserved. 59 Distributed Transactions - Write Path © 2019 All rights reserved. 60 Distributed Transactions - Write Path © 2019 All rights reserved. 61 DocDB: Software Defined Atomic Clocks © 2019 All rights reserved. 62 Distributed (aka Multi-Shard) Transactions BEGIN TXN UPDATE k1 UPDATE k2 COMMIT k1 and k2 may belong to different shards Belong to different Raft groups on completely different nodes tablet 1’ © 2019 All rights reserved. 63 What do Distributed Transactions need? BEGIN TXN UPDATE k1 UPDATE k2 COMMIT Raft Leader Raft Leader Updates should get written at the same time But how will nodes agree on time? tablet 1’ © 2019 All rights reserved. 64 Atomic Clocks You would need an Atomic Clock or two lying around Atomic Clocks: highly available, globally synchronized clocks, tight error bounds Jeez! I’m fresh out of those. Most of my physical clocks are never synchronized tablet 1’ © 2019 All rights reserved. 65 Use a Physical Clock You would need an Atomic Clock or two lying around Atomic Clocks are highly available, globally synchronized clocks with tight error bounds Jeez! I’m fresh out of those. Most of my physical clocks are never synchronized tablet 1’ © 2019 All rights reserved. 66 Hybrid Logical Clock (HLC) Combine coarsely-synchronized physical clocks with Lamport Clocks to track causal relationships (physical component, logical component) synchronized using NTP a monotonic counter Nodes update HLC on each Raft exchange for things like heartbeats, leader election and data replication tablet 1’ © 2019 All rights reserved. 67 YSQL Architecture © 2019 All rights reserved. 68 Existing PostgreSQL Architecture Reuse PostgreSQL SQL APP Postmaster (Authentication, authorization) Query Layer Rewriter Planner Executor Optimizer WAL Writer … BG Writer DISK © 2019 All rights reserved. 69 DocDB as Storage Engine SQL APP Postmaster (Authentication,

Yugabytedb Extending Postgresql to Google Spanner Architecture

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support