FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem.

Julien Le Dem @J_ [email protected] March 2019 Julien Le Dem @J_ Julien

Principal Data Engineer

• Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Agenda

❖ At the beginning there was Hadoop (2005)

❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards”

❖ The deconstructed database

❖ What next? At the beginning there was Hadoop Hadoop

Storage: A distributed file system

Execution: Map Reduce

Based on Google’s GFS and MapReduce papers Great at looking for a needle in a haystack Great at looking for a needle in a haystack …

… with snowplows Original Hadoop abstractions

Execution Storage Map/Reduce File System Read write locally locally Simple Just flat files M R ● Flexible/Composable Shuffle ● Any binary ● Logic and optimizations ● No schema tightly coupled ● No standard ● Ties execution with M R ● The job must know how to persistence split the data

M R “MapReduce: A major step backwards” (2008) Databases have been around for a long time

Relational model • First described in 1969 by Edgar F. Codd

• Originally SEQUEL developed in the early 70s at IBM SQL • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016

• Universal Data access language Global Standard • Widely understood Underlying principles of relational databases

Separation of logic and optimization Standard Separation of Schema and Application SQL is understood by many High level language focusing on logic (SQL) Indexing Optimizer

Integrity Evolution Transactions Views Integrity constraints Schemas Referential integrity Relational Database abstractions

Execution Storage SQL Tables SELECT a, AVG(b) FROM FOO GROUP BY a SQL Abstracted notion of data

● Decouples logic of query ● Defines a Schema from: ● Format/layout decoupled ○ Optimizations from queries ○ FOO Data representation ● Has statistics/indexing ○ Indexing ● Can evolve over time Query evaluation

Syntax Semantic Optimization Execution A well integrated system

Storage

SELECT f.c, AVG(b.d) FROM FOO f Table JOIN BAR b ON f.a = b.b Syntax Semantic Metadata Columnar Columnar Columnar GROUP BY f.c WHERE f.d = x (Schema, data data data stats, layout,…)

Push Push Push downs downs downs Select

Optimization Select Scan GROUP JOIN FILTER FOO BY

GROUP BY

Scan BAR JOIN Execution FILTER Scan BAR

Scan FOO

user So why? Why Hadoop? Why Snowplows? The relational model was constrained

Constraints are good We need the right Constraints

They allow optimizations Need the right abstractions

• Statistics Traditional SQL implementations: • Flat schema • Pick the best join algorithm • Inflexible schema evolution • Change the data layout • History rewrite required • Reusable optimization logic • No lower level abstractions • Not scalable Hadoop is flexible and scalable

• Nested data structures • Unstructured text with semantic annotations No Data shape constraint • Graphs • Non-uniform schemas

• Room to scale algorithms that are not part of the standard It’s just code • Machine learning • Your imagination is the limit

• You can improve it Open source • You can expand it • You can reinvent it You can actually implement SQL with this And they did… SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x

GROUP BY

Parser Select Execution

GROUP JOIN BY

JOIN

FILTER FILTER Scan BAR BAR Scan FOO

FOO (open-sourced 2009) 10 years later The deconstructed database

Author: gamene https://www.flickr.com/photos/gamene/4007091102 The deconstructed database The deconstructed database

Query Data Stream Batch Machine model Exchange Processing execution learning

Storage We can mix and match individual components Machine learning Specialized Components

Streams

Stream persistance

Storage

Execution SQL Iceberg

Resource management

*not exhaustive! We can mix and match individual components

Storage Query model Machine Learning Row oriented or columnar SQL Training models Immutable or mutable Functional Stream storage vs analytics optimized …

Data Exchange Streaming Execution Batch Execution Row oriented Optimized for High Throughput and Low Optimized for high throughput and Columnar latency processing historical analysis Emergence of standard components Emergence of standard components

Columnar Storage SQL parsing and Schema model as columnar as pipeline friendly schema representation at rest. optimization for the analytics world. as a versatile query optimizer framework

Columnar Exchange Table abstraction as the next generation in- Apache Iceberg has a great potential to memory representation and no-overhead provide Snapshot isolation and layout data exchange abstraction on top of distributed file systems. The deconstructed database’s optimizer: Calcite

Storage

Schema plugins

SELECT f.c, AVG(b.d) FROM FOO f Select JOIN BAR b ON f.a = b.b Syntax Semantic GROUP BY f.c WHERE f.d = x … Optimizer Scan GROUP rules JOIN FILTER FOO BY Execution engine

Select Scan BAR Optimization

GROUP BY

JOIN Execution

FILTER Scan BAR

Scan FOO Apache Calcite is used in:

Batch SQL Streaming SQL

• Apache Apex

• Apache SamzaSQL

• Apache Kilin • Apache StormSQL

… … The deconstructed database’s storage

Columnar Row oriented

Immutable

Optimized for analytics Optimized for serving

Mutable

To be performant a query engine requires deep integration with the storage layer. Query integration Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow). Storage: Push downs

PROJECTION PREDICATE AGGREGATION Read only what you need Filter Avoid materializing intermediary data

Read only the columns that are Evaluate filters during scan to: To reduce IO, aggregation can needed: also be implemented during the • Leverage storage properties scan to: • Columnar storage makes this (min/max stats, partitioning, efficient. sort, etc.) • minimize materialization of • Avoid decoding skipped data. intermediary data • Reduce IO. The deconstructed database interchange: Apache Arrow Projection and predicate push downs

Parquet files Arrow batches Shuffle

Partial Scanner Agg Agg Result

projection push down read only a and b

Partial Scanner Agg Agg

SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b

Partial Scanner Agg Agg Big Data infrastructure blueprint Big Data infrastructure blueprint Big Data infrastructure blueprint Legend: Persistence

Processing

Metadata

UI Data Infra Real time Batch publishing publishing Data API Stream persistance Stream processing

Real time Data API dashboards

Schema Datasets Data Storage registry metadata (S3, GCS, HDFS) (Parquet, Avro)

Interactive analysis Monitoring / Data API Observability Analyst

Batch processing

Periodic Scheduling/ dashboards Job deps

Eng The Future Still Improving

A better data Better interoperability Better data governance abstraction

• A better metadata • More efficient interoperability: • Global Data Lineage repository: Marquez Continued Apache Arrow adoption • Access control • A better table abstraction: • Protect privacy Apache Iceberg • Common Push down • Record User consent implementations (filter, projection, aggregation) Some predictions A common access layer

SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) Distributed access service

Centralizes:

• Schema evolution Table abstraction layer (Schema, push downs, access control, anonymization) • Access control/anonymization

• Efficient push downs File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …) • Efficient interoperability A multi tiered streaming batch storage system

Stream consumer API Batch-Stream storage Projection, Predicate, Aggregation push downs

integration Mainly reads from here

Convergence of Kafka and Kudu

2) in memory • Schema evolution Time based 1) In memory column 3) On Disk Column oriented Ingestion row oriented oriented store • API store Access control/anonymization store • Efficient push downs

• Efficient interoperability Mainly reads from here

Batch consumer API Projection, Predicate, Aggregation push downs THANK YOU! [email protected]

Julien Le Dem @J_ September 2018 We’re hiring!

Contact: [email protected]

We’re hiring!

As of January • WeWork has 425 locations, in 100 cities internationally WeWork in numbers • Over 400,000 members worldwide • 27 countries

• WeWork doubled in size last year and the year before • We expect to double in size this year as well We’re growing • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, …

• San Francisco • New York Technology jobs Locations • Tel Aviv • Shanghai

Contact: [email protected] Questions?

Julien Le Dem @J_ [email protected] March 2019 Rate today ’s session

Session page on conference website O’Reilly Events App