Peter Boncz SYSTEMS FOR CLOUD DATA ANALYTICS

www.cwi.nl/~boncz/badsCloud Data Systems Credits • David DeWitt & Willis Lang (Microsoft) – cloud DW material • Stratis Viglas (Google) – extreme computing course (University Edinburgh) • Marcin Zukowski (Snowflake) • Ippokratis Pandis ( Redshift/Spectrum) • Spark Team – Matei Zaharia, Xiangrui Meng (Stanford), – Ion Stoica, Xifan Pu (UC Berkeley) – Reynold Xin, Alex Behm (Databricks)

www.cwi.nl/~boncz/badsCloud Data Systems Is it safe to have enterprise data in the Cloud?

2005: No way! Are you crazy?

2012: Don’t think so... But wait, we store our email where?

2018: Of course!

www.cwi.nl/~boncz/badsCloud Data Systems Getting a in a cloud

Hi! I'm a Data Scientist! Hello! I am your account manager at X! I'm looking for a database for our cloud system Sure thing! Let's install our product, DBMS X for you! Awesome! It seems to work! Great. Let me send you that invoice! Just a sec… How much does the storage cost ? Hold on, let me check that

Wait, what? And the system is elastic, right? Mommy!!!

And I only pay for what I use, right? www.cwi.nl/~boncz/badsCloud Data Systems Traditional DB systems and the cloud • Designed for: –Small, fixed, optimized clusters of machines –Constrained amount of data and resources

• Can be delivered via the Cloud –Reduce the complexity of hardware setup, software installation –No elasticity –No cheap storage –Not designed for cloud's poor stability –Not easy to use –Not "always on" –... www.cwi.nl/~boncz/badsCloud Data Systems Data in the Cloud • Data traditional DW systems are built for –Assume predictable, slow-evolving internal data –Complex ETL (extract-transform-load) pipelines and physical tuning –Limited number of users and use-cases –OK to cost $100K per TB

• Data in the cloud –Dynamic, external sources: web, logs, mobile devices, sensor data… –ELT instead of ETL (data transformation inside the system) –Often in semi-structured form (JSON, XML, Avro) –Access required by many users, very different use cases –100TBs volume common www.cwi.nl/~boncz/badsCloud Data Systems 10,000 ft. view: Complexity vs Cost

Only 2 options 5 years ago! Complexity (deployment & Roll-your-own (RYO) operational) ▪ Buy & install a cluster of servers ▪ Buy, install, & configure software (, High RYO Asterdata, , …) ▪ High complexity ▪ Medium capex and opex

Medium Buy an appliance ▪ , Microsoft APS, Appliance ▪ High capex, low opex ▪ Low complexity Low ▪ Gold standard for performance Cost (capex + opex) Low Medium High

www.cwi.nl/~boncz/badsCloud Data Systems 10,000 ft. view: Complexity vs Cost

Complexity Use a SAAS DW in the cloud (deployment & ▪ AWS Redshift, MSFT SQL DW, Snowflake, BigQuery operational) ▪ Low complexity ▪ No capex, low opex High RYO Roll-your-own-Cloud (RYOC) RYOC ▪ Rent a cluster of cloud servers Medium ▪ Buy, install, & configure software (Spark, Hive, Vertica, Asterdata, Greenplum, …) Appliance ▪ Medium to high complexity ▪ Low capex Low CLOUD DW ▪ Medium opex Cost (capex + opex) Low Medium High

www.cwi.nl/~boncz/badsCloud Data Systems Scalability and the price of agility

Time to make an adjustment

Months Appliance

Weeks RYO RYOC

Minutes CLOUD DW Cost of making an adjustment Low Mediu High m

www.cwi.nl/~boncz/badsCloud Data Systems Why Cloud DW?

• No CapEx and low OpEx • Go from conception to insight in hours • Rock bottom storage prices (Azure, AWS S3, GFS) • Flexibility to scale up/down compute capacity • Simple upgrade process

www.cwi.nl/~boncz/badsCloud Data Systems Parallel Processing in Analytical DBs

▪ Alternative architectures –Shared-memory –Shared-disk/storage –Shared-nothing “The Case for Shared Nothing,” ▪ Partitioned tables Stonebraker, HPTS ‘85 ▪ Partitioned parallelism

www.cwi.nl/~boncz/badsCloud Data Systems Shared-Nothing

• Commodity servers connected via commodity networking • DB storage is ”strictly local” to each node Node 1 Node 2 Node K Co-located CPU CPU CPU … compute and MEM MEM MEM storage

Interconnection Network

• Design scales extremely well

www.cwi.nl/~boncz/badsCloud Data Systems Shared Disk/Storage

• Commodity servers connected to each other and storage using commodity networking

Local• DB disks is stored for caching on DB“remote storage” (e.g. a SAN, S3, Azure pages,Storage) temp files, …

Node 1 Node 2 Node K Network can limit scaling as CPU CPU CPU it must carry I/O traffic

MEM MEM … MEM

Storage Area Network

www.cwi.nl/~boncz/badsCloud Data Systems Table Partitioning

What? Shared Storage Distribute rows of each table across multiple storage devices Why? • spread I/O load • parallel query execution • data lifecycle management How? Hash, Round Robin, Range

www.cwi.nl/~boncz/badsCloud Data Systems Shared-Nothing Ex. Application

Select Name, Item from Orders O, Customers C Parser where O.CID = C.ID Join can be done “locally” Optimizer Catalogs Execution Coordinator

NODEJOIN No data movementNODEJOIN O.CID1 = C.ID O.CID2 = C.ID

CID OID Item CID OID Item Example of 602 10 Tivo Orders Table 933 20 Surface 633 21 TV 752 31 iPhone hash partitioned “partitioned parallelism” 633 21 DVD 602 10 Xbox on CID 602 11 iPod 19 51 TV ID Name AmtDue Customers ID Name AmtDue 602 Larry $13K 933 Mary $49K 752 Anne $75K Table 633 Bob $19K hash partitioned 322 Jeff $20K 19 George $83K on ID

www.cwi.nl/~boncz/badsCloud Data Systems Shared-Nothing Ex. Application

Select Name, Item from Orders O, Customers C Parser where O.CID = C.ID JoinJoin cannot can be be done done “locally” “locally” Optimizer Catalogs Execution Coordinator data movement needed NODEJOIN No data movementNODEJOIN Biggest table (orders)O.CID 1 = C.ID O.CID2 = C.ID needs to be shuffled: all-to- all communications CID OID Item CID OID Item Example of 602 10 Tivo Orders Table 933 20 Surface 633 21 TV 752 31 iPhone hash partitioned “partitioned parallelism” 633 21 DVD 602 10 Xbox on OID 602 11 iPod 19 51 TV ID Name AmtDue Customers ID Name AmtDue 602 Larry $13K 933 Mary $49K 752 Anne $75K Table 633 Bob $19K hash partitioned 322 Jeff $20K 19 George $83K on ID

www.cwi.nl/~boncz/badsCloud Data Systems Shared-Storage Ex. Application

Select Name, Item from Parser Orders O, Customers C Optimizer where O.CID = C.ID Execution Coordinator NODE NODE 1 2 LAN Both tables are remote

CID OID Item CID OID Item 602 10 Tivo Orders Table 933 20 Surface 633 21 TV 752 31 iPhone hash partitioned 633 21 DVD 602 10 Xbox on CID 602 11 iPod 19 51 TV ID Name AmtDue Customers ID Name AmtDue 602 Larry $13K 933 Mary $49K 752 Anne $75K Table 633 Bob $19K hash partitioned 322 Jeff $20K 19 George $83K on ID

www.cwi.nl/~boncz/badsCloud Data Systems For 30+ years

• Shared-nothing has been “gold standard” –Teradata, Netezza, DB2/PE, SQL Server PDW, ParAccel, Greenplum, … • Simplest design • Excellent scalability • Minimizes data movement –Especially for DBs with a star schema design • The “cloud” has changed the game –shared nothing:

2017 www.cwi.nl/~boncz/badsCloud Data Systems Outline

• Part 1: Intro • Part 2: in the Cloud – – Snowflake – Microsoft Azure Synapse Analytics – Google BigQuery – Databricks • Part 3: Cloud Research Challenges

www.cwi.nl/~boncz/badsCloud Data Systems Amazon (AWS) Redshift • Classic shared-nothing design w. locally attached storage –Engine is ParAccel database system (classic MPP, JIT C++) • Leverages AWS services –EC2 compute instances –S3 storage system –Virtual Private Cloud (VPC) • Leader in market adoption

www.cwi.nl/~boncz/badsCloud Data Systems A Redshift Instance Application Single Leader Node

LEADER Catalogs NODEOne or more compute One slice/core Memory,nodes storage, (EC2 instance) & data partitioned among slices

) NODE 1 Hash & round-robin

table partitioning AmtDue

SLICE 1 NODE 1 SLICE 2 SLICE 3 NODE 2 SLICE 4

ID`

, Name, , ID

ID Name Amt ID Name Amt ID Name Amt ID Name Amt HashPartitionon Customers( 22

www.cwi.nl/~boncz/badsCloud Data Systems Within a slice Two sort options: 1) Compound sort key 2) “Interleaved” sort key (multidimensional sorting)

NAM ID E AMT

Columns stored in 1MB blocks

Min and Max value of each block retained in a “zone” map

Rich collection of compression options (RLE, dictionary, gzip, …)

www.cwi.nl/~boncz/badsCloud Data Systems Unique Fault Tolerance Approach

Catalogs LEADER NODE

Each 1MB block gets replicated on a different compute node SLICE NODE 1 SLICE SLICE NODE 2 SLICE 1 2 3 4

ID Name Amt ID Name Amt ID Name Amt ID Name Amt ID Name Amt ID Name Amt ID Name Amt ID Name Amt

And also on S3

S3, in turn, triply replicates S3 ID Name Amt ID Name Amt ID Name Amt ID Name Amt each block

www.cwi.nl/~boncz/badsCloud Data Systems Handling Node Failures

Alternative #1: Node 2 processes load until Node 1 is restored

Alternative #2: New node instantiatedCatalogs LEADER NODE Assume Node 1 fails

Node 3 processes SLICE NODENODE 1 3 SLICE SLICE NODE 2 SLICE workload using data in S3 1 2 3 4

ID Name Amt ID Name Amt ID Name Amt ID Name Amt ID Name Amt ID Name Amt ID Name Amt ID Name Amt

Until local disks are restored S3 ID Name Amt ID Name Amt ID Name Amt ID Name Amt

www.cwi.nl/~boncz/badsCloud Data Systems Redshift Summary

• Highly successful cloud SAAS DW service • Classic shared-nothing design • Leverages S3 to handle node and disk failures • Key strength: performance through use of local storage • Key weaknesses: compute cannot be scaled independent of storage (and vice versa)

www.cwi.nl/~boncz/badsCloud Data Systems Redshift Spectrum

www.cwi.nl/~boncz/badsCloud Data Systems Redshift Spectrum

Spectrum

www.cwi.nl/~boncz/badsCloud Data Systems Outline

• Part 1: Intro • Part 2: Databases in the Cloud – Amazon Redshift – Snowflake – Microsoft Azure Synapse Analytics – Google BigQuery – Databricks • Part 3: Cloud Research Challenges

www.cwi.nl/~boncz/badsCloud Data Systems Snowflake Elastic DW

• Shared-storage design –Compute decoupled from storage –Highly elastic • Leverages AWS –Tables stored in S3 but dynamically cached on local storage Clusters of EC2 instances used to execute queries • Rich data model –Schema-less ingestion of JSON documents

www.cwi.nl/~boncz/badsCloud Data Systems Snowflake Architecture

AUTHENTICATION & ACCESS CONTROL

INFRASTRUCTURE QUERY TRANSACTION CLOUD SECURITY SERVICES MANAGER OPTIMIZER MANAGER

METADATA STORAGE These disks are strictly used as caches

VIRTUAL VIRTUAL VIRTUAL WAREHOUSE WAREHOUSE WAREHOUSE COMPUTE N1 N2 N3 N4 CLUSTERN1 N2 OF EC2 INSTANCESN1 N2 N3 N4 N5 N6 N7 N8 LAYER Database tables stored here

DATA CACHE

S3 DATA STORAGE

www.cwi.nl/~boncz/badsCloud Data Systems Table Storage Not able to support hash or RR partitioning as files are created • Rows of each table are storedstrictly in multiple as rows▪ areInside inserted ainto file, table rows stored S3 files: in columnar fashion

ID NAME FILE ”Standard” compression (gzip, HEADER AMTDUE

RLE, …) schemes available … ID

• Each file is ~10MB VALUES

Customer_File1 Customer_File2 Min & max value Customer_FileN of each column of each file of each table are kept in catalog. NAME Used for pruning at run time. VALUES

AMT_DUE VALUES

www.cwi.nl/~boncz/badsCloud Data Systems Virtual Warehouses

Dynamically created cluster of EC2 instances

VIRTUAL WAREHOUSE

COMPUTE N1 N2 N3 N4 LAYER CLUSTER OF EC2 INSTANCES

DATA CACHE Three sizing mechanisms: 1) Number of EC2 instances 2) ”Size” of each instance (# cores, I/O capacity) Local disks cache file headers & 3) Auto-scaling of one virtual warehouse table columns

www.cwi.nl/~boncz/badsCloud Data Systems Separate Compute & Storage.

Q1 Q2 • Queries against the same DB can VIRTUAL VIRTUAL WAREHOUSE WAREHOUSE be given the resources to meet N1 N2 N1 N2 N3 N4 N5 N6 N7 N8 their needs – truly unique idea • DBA can dynamically adjust S3 number & types of nodes Sales DB • This flexibility is simply not feasible with a shared-nothing approach such as RedShift.

www.cwi.nl/~boncz/badsCloud Data Systems Query Execution

• Each query runs on a single virtual warehouse • Standard parallel query algorithms • Modern SQL engine: –Columnar storage –Vectorized executor • Updates create new files! –Artifact of S3 files not being updatable. –But makes time travel possible

www.cwi.nl/~boncz/badsCloud Data Systems Built-in disaster recovery and high availability

Scale-out of all tiers: metadata, compute, storage Services

Cloud Services Resiliency across multiple availability

Metadata zones • Geographic separation Virtual • Separate power grids Warehouses • Built for synchronous replication

Database Fully online updates & patches (zero Storage downtime) Availability Availability Availability Zone Zone Zone Fully managed by Snowflake

© 2018 Snowflake Computing Inc. All Rights Reserved. 36

www.cwi.nl/~boncz/badsCloud Data Systems Comprehensive data protection

Protection against infrastructure failures All data transparently & synchronously replicated 3+ ways across multiple datacenters SELECT * FROM T0… Protection against corruption & user errors T T T 0 1 2 “Time travel” feature enables instant New data Modified data roll-back to any point in time during chosen retention window

Daily

Weekly S3 Long-term data protection

© 2018 Snowflake Computing Inc. All Rights Reserved. Zero-copy clones + optional export37 to S3 enable user-managed data copies

www.cwi.nl/~boncz/badsCloud Data Systems Secure by design

Authentication Access control Data encryption External validation

Multi-factor Role-based Tri-secret secure – Certified against authentication access control Customer Owned enterprise-class Encryption requirements Federated Granular privileges authentication on All data encrypted, PCI and HIPAA available all objects & actions always, everywhere available Identity propagation SSO command line Encryption keys FedRamp ready © 2018 Snowflake Computing Inc. All Rights Reserved. integration managed 38 automatically Automatic re-keying of data

www.cwi.nl/~boncz/badsCloud Data Systems Data sharing • Enabled by Snowflake’s unique cloud architecture Providers Consumers • Secure and integrated Snowflake’s • Get access to the data without any access control model need to move or transform it. • Only pay normal storage costs for • Query and combine shared data with shared data existing data or join together data • No limit to the number of consumer from multiple publishers accounts with which a dataset may be shared

Data Data Consumers Providers

39 Snowflake Summary

• Designed for the cloud from conception • Can directly query unstructured data (Json) w/o loading • Compute and storage independently scalable – AWS S3 for table storage – Virtual warehouses composed of clusters of AWS EC2 instances

– Queries can be given exactly the compute resources they need • No management knobs – No indices, no create/update stats, no distribution keys, …

www.cwi.nl/~boncz/badsCloud Data Systems Outline

• Part 1: Intro • Part 2: Databases in the Cloud – Amazon Redshift – Snowflake – Microsoft Azure Synapse Analytics – Google BigQuery – Databricks • Part 3: Cloud Research Challenges

www.cwi.nl/~boncz/badsCloud Data Systems Microsoft Azure SQL Data Warehouse

• Shared-Storage design • Based on SQL Server PDW appliance software –..based on SQL Server • Leverages Azure Storage Elastic Design – ..the (slower) equivalent of S3 • Query without loading (Polybase) –Spectrum ~= Polybase

www.cwi.nl/~boncz/badsCloud Data Systems Outline

• Part 1: Intro • Part 2: Databases in the Cloud – Amazon Redshift – Snowflake – Microsoft Azure Synapse Analytics – Google BigQuery – Databricks • Part 3: Cloud Research Challenges

www.cwi.nl/~boncz/badsCloud Data Systems Google BigQuery

• Separate storage and compute • Leverages Google’s internal storage & execution stacks –Collosus distributed file system –DremelX query executor –Jupiter networking stack –Borg resource allocator • No knobs, no indices, … • Serverless ➔ you do not start any machines, Google just runs your queries (it is much more like AWS Athena than AWS Redshift) www.cwi.nl/~boncz/badsCloud Data Systems BigQuery Tables

• Stored in Collosus FS –Partitioned by day (optionally) • Columnar storage (Capacitor) –RLE compression –Sampling used to pick sort order –Columns partitioned across multiple disks • Also “external” tables –JSON, CSV & Avro formats –Google Drive and Cloud Storage

www.cwi.nl/~boncz/badsCloud Data Systems Query Execution

SQL queries compiled into a tree of DremelX operators

MASTER Called “shards” Agg Agg Agg Agg Executed by a “slot” SHUFFLE MaxAll operatorsof 2000 slots/query are Join Join Join Join “purely in memory”

SHUFFLE Buffers rows in dedicated “memory” Filter Filter Filter Filter nodes Collosus DFS

www.cwi.nl/~boncz/badsCloud Data Systems CPU Resource Allocation • Compute resources not dedicated! –Shared among other internal and external customers –No apparent way to control computational resources used for a query • # of shards/slots assigned to an operator function of: –Estimated amount of data to be processed –Cluster capacity and current load

www.cwi.nl/~boncz/badsCloud Data Systems BigQuery Pricing • Storage: $0.02/GB/month (AWS is about $0.023/GB/month) • Query options 1) Pay-as-you-go: $5/TB “processed” - calculated after column is uncompressed (AWS is about $1.60/TB using M4.4Xlarge EC2 instance) 2) Flat rate: $40,000/month for 2,000 dedicated slots

www.cwi.nl/~boncz/badsCloud Data Systems Outline

• Part 1: Intro: Analytical Database Systems • Part 2: Databases in the Cloud – Amazon Redshift – Snowflake – Microsoft Azure Synapse Analytics – Google BigQuery – Databricks • Part 3: Cloud Research Challenges

www.cwi.nl/~boncz/badsCloud Data Systems Databricks Spark

• Spark-as-a-service in the cloud (“the best”) –All data stored in S3 • Clusters run in the user account –Control plane runs in Databricks account • User can dynamically power up and down clusters –Clusters can be grown and shrunk

www.cwi.nl/~boncz/badsCloud Data Systems DBIO Caching Layer • cloud instances have fast local disks – AWS: NVMe 3TB drives, 500MB/s per core (150MB/s S3) – Azure: even bigger difference (slower network) • DBIO caches Parquet pages – compressed or uncompressed – Spark scheduler schedules jobs with affinity (node that likely caches data becomes executor of queries on ot)

www.cwi.nl/~boncz/badsCloud Data Systems Databricks Big Data - AI positioning

www.cwi.nl/~boncz/badsCloud Data Systems Databricks Delta

www.cwi.nl/~boncz/badsCloud Data Systems Databricks Delta under the Hood

www.cwi.nl/~boncz/badsCloud Data Systems Databricks MLflow • System to make ML experiments reproducible

www.cwi.nl/~boncz/badsCloud Data Systems Pay For What You Use

• Redshift – More storage requires buying more compute • SQL DW – Charged separately for Azure storage and DWU usage • Snowflake – Charged separately for S3 storage and EC2 usage – Data resides in Snowflake account – works in AWS, Azure, and soon Google cloud • BigQuery – Charged separately for GFS storage and TBs “processed” • Databricks – Charged separately for S3 storage and EC2 usage (user account) – plus DBUs to Databricks (~EC2 usage) – works in AWS & Azure www.cwi.nl/~boncz/badsCloud Data Systems Elasticity • Redshift – Co-located storage and compute constrains elasticity • Azure Synapse Analytics – DB-level adjustment of DWU capacity • Snowflake – Query-level control through Virtual Warehouse mechanism • BigQuery (…AWS Athena is similar) – Google decides for you based on input table sizes • Databricks Spark – DB-level adjustment (cluster size) – dynamically changeable

www.cwi.nl/~boncz/badsCloud Data Systems Performance? “It’s complicated!”

SQL DW, Redshift Vs. Snowflake, & BigQuery Local disks provide better Shared-Nothing Shared-Storage bandwidth

www.cwi.nl/~boncz/badsCloud Data Systems But, literally in minutes, Databricks, BigQuery & Snowflake can become …

or or

www.cwi.nl/~boncz/badsCloud Data Systems Wrapup: Why Cloud Data Analytics?

• No capex • Go from conception to insight in hours • Low opex – pay for only what you use • Rock bottom storage prices • Flexibility to scale up/down compute capacity as needed. • Shared-nothing

2017

www.cwi.nl/~boncz/badsCloud Data Systems Summary

• No capex, low opex & low storage prices • Cloud data systems architectures –separate storage from compute ➔ elasticity • Discussed in some depth – Redshift (no full storage/compute separation) – SnowFlake (data sharing) – BigQuery (serverless!) • Different cloud pricing models – run clusters in provider’s account (Redshift, Snowflake) –Run in your own account (Databricks) –Pay per Query (BigQuery, Athena) www.cwi.nl/~boncz/badsCloud Data Systems Some Open Research Questions (1)

• New Hardware: storage – persistent RAM (3D Xpoint aka “Optane”) • will (like NVMe) land on the compute side in the cloud – Ephemeral Optane storage?? – HTTP as the interface to Optane??

Can / how should cloud storage APIs adapt to this?

www.cwi.nl/~boncz/badsCloud Data Systems