TiDB as an HTAP

主讲人:PingCap CEO 刘奇 About me

• CEO and co-founder of PingCAP • Open source hacker • Infrastructure engineer • Founder of TiDB/TiKV/Codis • Infrastructure software engineer • Wandoulabs/JD Why a new database? Brief History

RDBMS NoSQL NewSQL

● Standalone RDBMS 1970s 2010 2015 Present ● NoSQL ● Middleware & Proxy Redis Google Spanner MySQL HBase Google F1 ● NewSQL PostgreSQL Cassandra TiDB Oracle MongoDB DB2 ...... NewSQL Database

● Horizontal Scalability ● ACID Transaction ● High Availability ● SQL at Scale OLTP & OLAP

OLTP OLAP

ETL 8am 2pm 6pm 2am

Database

Where is my data? ERP Is the data out-of-date? Data Warehouse CRM Why two separate systems

● Huge data size ● Complex query logic ● Latency VS Throughput ● Point query VS Full range scan ● Transaction & lsolation level OLAP + OLTP = HTAP

Hybrid Transactional / Analytical Processing ● ACID Transcation ● Real-time analysis ● SQL HTAP How do we build the new database What is TiDB

● Scalability as the first class feature ● SQL is necessary ● Compatible with MySQL, in most cases ● OLTP + OLAP = HTAP (Hybrid Transactional/Analytical Processing) ● 24/7 availability, even in case of datacenter outages ● Open source, of course Architecture

Stateless SQL Layer

Metadata / Timestamp request TiDB ... TiDB ... TiDB

Placement Driver (PD)

Raft Raft Raft TiKV ... TiKV TiKV TiKV Control flow: Balance / Failover

Distributed Storage Layer TiKV - Overview

• Region: a set of continuous key-value pairs • Data is organized/stored/replicated by Regions • Highly layered

TiKV Key Space RPC (gRPC)

Node A Transaction 256MB

MVCC [ start_key, Raft end_key) RocksDB (-∞, +∞) Raft Raft Sorted Map

Node B Node C Raft TiKV - Multi-Raft

Multiple raft groups in the cluster, one group for each region.

Client

RPC RPC RPC RPC

Store 1 Store 2 Store 3 Store 4

Region 1 Region 1 Region 2 Region 1

Region 2 Region 3 Region 2 Region 5

Region 5 Raft Region 5 Region 4 Region 3 Group Region 4 Region 4 Region 3

TiKV node 1 TiKV node 2 TiKV node 3 TiKV node 4 TiKV - Horizontal Scale

Node B

Region 1^ Region 1 Region 2 Region 3 Region 1* Node D

Region 2 Region 2 Region 3 Region 3 Node C Node A Add Replica Three steps to move a leader replica ● Transfer Leader ● Add Replica Node E ● Remove Replica PD - Overview

• Meta data management • Load balance management

Route Info TiKV Client PD

Node/Region Management Info Command

TiKV Cluster TiKV TiKV TiKV … ... TiKV PD - TiKV Cluster Managment

PD Node 1 HeartBeat Region A Cluster Info

Region B Scheduling Command Movement Admin

Scheduling Config Node 2 Stratege

Region C TiDB - Overview

• The stateless SQL layer

Optimized SQL AST Logical Plan Logical Plan

Statistics Cost Model

Selected TiDB SQL Layer Physical Plan

TiKV TiKV TiKV TiDB - Distributed SQL

SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘shanghai’;

Physical Plan on TiDB Physical Plan on TiKV (index scan)

Partial Aggregate Final Aggregate COUNT(c1) SUM(COUNT(c1)) Row COUNT(c1) Filter c2 = “shanghai” DistSQL Scan Row

Read Row Data by RowID COUNT(c1) COUNT(c1) COUNT(c1) RowID TiKV Read Index TiKV idx1: (10, +∞) TiKV TiDB - Cost Based Optimizer

• Predicate Pushdown • Column Pruning • Eager Aggregate • Convert Subquery to Join • Statistics framework • CBO Framework – Index Selection – Join Operator Selection • Hash join • Index lookup join • Sort-merge join – Stream Operators VS Hash Operators OLTP + OLAP

TiDB TiKV Scheduler OLTP Jobs Query Job TiDB TiKV High Queue Priority TiDB TiKV Worker

. Worker OLAP . Query TiDB Jobs . . Worker Low TiDB Priority TiKV Worker Pool HTAP Ecosystem: Beyond TiDB and SQL Spark on TiDB

PD PD

TSO/Data location Data location

PD

PD Cluster

Meta data Spark Driver TiDB Job Application TiKV TiKV TiDB DistSQL API DistSQL API

Worker TiDB TiKV TiKV Syncer Worker TiDB TiKV TiKV Worker TiDB ...... Spark Cluster

TiDB Cluster TiKV Cluster (Storage) TiSpark Spark on TiDB

• Spark ecosystem • TiKV Connector is better than JDBC connector • Index support • Complex Calculation Pushdown • CBO – Pick up right Access Path – Join Reorder • Priority & Isolation Level Future plan

• Code Generation • MPP Engine • Mixed storage engine (Columnar / Row-based) • Heterogeneous computing (CPU/GPU/FPGA) THANKS