FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem.
Julien Le Dem @J_ [email protected] March 2019 Julien Le Dem @J_ Julien
Principal Data Engineer
• Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Agenda
❖ At the beginning there was Hadoop (2005)
❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards”
❖ The deconstructed database
❖ What next? At the beginning there was Hadoop Hadoop
Storage: A distributed file system
Execution: Map Reduce
Based on Google’s GFS and MapReduce papers Great at looking for a needle in a haystack Great at looking for a needle in a haystack …
… with snowplows Original Hadoop abstractions
Execution Storage Map/Reduce File System Read write locally locally Simple Just flat files M R ● Flexible/Composable Shuffle ● Any binary ● Logic and optimizations ● No schema tightly coupled ● No standard ● Ties execution with M R ● The job must know how to persistence split the data
M R “MapReduce: A major step backwards” (2008) Databases have been around for a long time
Relational model • First described in 1969 by Edgar F. Codd
• Originally SEQUEL developed in the early 70s at IBM SQL • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016
• Universal Data access language Global Standard • Widely understood Underlying principles of relational databases
Separation of logic and optimization Standard Separation of Schema and Application SQL is understood by many High level language focusing on logic (SQL) Indexing Optimizer
Integrity Evolution Transactions Views Integrity constraints Schemas Referential integrity Relational Database abstractions
Execution Storage SQL Tables SELECT a, AVG(b) FROM FOO GROUP BY a SQL Abstracted notion of data
● Decouples logic of query ● Defines a Schema from: ● Format/layout decoupled ○ Optimizations from queries ○ FOO Data representation ● Has statistics/indexing ○ Indexing ● Can evolve over time Query evaluation
Syntax Semantic Optimization Execution A well integrated system
Storage
SELECT f.c, AVG(b.d) FROM FOO f Table JOIN BAR b ON f.a = b.b Syntax Semantic Metadata Columnar Columnar Columnar GROUP BY f.c WHERE f.d = x (Schema, data data data stats, layout,…)
Push Push Push downs downs downs Select
Optimization Select Scan GROUP JOIN FILTER FOO BY
GROUP BY
Scan BAR JOIN Execution FILTER Scan BAR
Scan FOO
user So why? Why Hadoop? Why Snowplows? The relational model was constrained
Constraints are good We need the right Constraints
They allow optimizations Need the right abstractions
• Statistics Traditional SQL implementations: • Flat schema • Pick the best join algorithm • Inflexible schema evolution • Change the data layout • History rewrite required • Reusable optimization logic • No lower level abstractions • Not scalable Hadoop is flexible and scalable
• Nested data structures • Unstructured text with semantic annotations No Data shape constraint • Graphs • Non-uniform schemas
• Room to scale algorithms that are not part of the standard It’s just code • Machine learning • Your imagination is the limit
• You can improve it Open source • You can expand it • You can reinvent it You can actually implement SQL with this And they did… SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x
GROUP BY
Parser Select Execution
GROUP JOIN BY
JOIN
FILTER FILTER Scan BAR BAR Scan FOO
FOO (open-sourced 2009) 10 years later The deconstructed database
Author: gamene https://www.flickr.com/photos/gamene/4007091102 The deconstructed database The deconstructed database
Query Data Stream Batch Machine model Exchange Processing execution learning
Storage We can mix and match individual components Stream processing Machine learning Specialized Components
Streams
Stream persistance
Storage
Execution SQL Iceberg
Resource management
*not exhaustive! We can mix and match individual components
Storage Query model Machine Learning Row oriented or columnar SQL Training models Immutable or mutable Functional Stream storage vs analytics optimized …
Data Exchange Streaming Execution Batch Execution Row oriented Optimized for High Throughput and Low Optimized for high throughput and Columnar latency processing historical analysis Emergence of standard components Emergence of standard components
Columnar Storage SQL parsing and Schema model Apache Parquet as columnar Apache Avro as pipeline friendly schema representation at rest. optimization for the analytics world. Apache Calcite as a versatile query optimizer framework
Columnar Exchange Table abstraction Apache Arrow as the next generation in- Apache Iceberg has a great potential to memory representation and no-overhead provide Snapshot isolation and layout data exchange abstraction on top of distributed file systems. The deconstructed database’s optimizer: Calcite
Storage
Schema plugins
SELECT f.c, AVG(b.d) FROM FOO f Select JOIN BAR b ON f.a = b.b Syntax Semantic GROUP BY f.c WHERE f.d = x … Optimizer Scan GROUP rules JOIN FILTER FOO BY Execution engine
Select Scan BAR Optimization
GROUP BY
JOIN Execution
FILTER Scan BAR
Scan FOO Apache Calcite is used in:
Batch SQL Streaming SQL
• Apache Hive • Apache Apex
• Apache Phoenix • Apache SamzaSQL
• Apache Kilin • Apache StormSQL
… … The deconstructed database’s storage
Columnar Row oriented
Immutable
Optimized for analytics Optimized for serving
Mutable
To be performant a query engine requires deep integration with the storage layer. Query integration Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow). Storage: Push downs
PROJECTION PREDICATE AGGREGATION Read only what you need Filter Avoid materializing intermediary data
Read only the columns that are Evaluate filters during scan to: To reduce IO, aggregation can needed: also be implemented during the • Leverage storage properties scan to: • Columnar storage makes this (min/max stats, partitioning, efficient. sort, etc.) • minimize materialization of • Avoid decoding skipped data. intermediary data • Reduce IO. The deconstructed database interchange: Apache Arrow Projection and predicate push downs
Parquet files Arrow batches Shuffle
Partial Scanner Agg Agg Result
projection push down read only a and b
Partial Scanner Agg Agg
SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b
Partial Scanner Agg Agg Big Data infrastructure blueprint Big Data infrastructure blueprint Big Data infrastructure blueprint Legend: Persistence
Processing
Metadata
UI Data Infra Real time Batch publishing publishing Data API Stream persistance Stream processing
Real time Data API dashboards
Schema Datasets Data Storage registry metadata (S3, GCS, HDFS) (Parquet, Avro)
Interactive analysis Monitoring / Data API Observability Analyst
Batch processing
Periodic Scheduling/ dashboards Job deps
Eng The Future Still Improving
A better data Better interoperability Better data governance abstraction
• A better metadata • More efficient interoperability: • Global Data Lineage repository: Marquez Continued Apache Arrow adoption • Access control • A better table abstraction: • Protect privacy Apache Iceberg • Common Push down • Record User consent implementations (filter, projection, aggregation) Some predictions A common access layer
SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) Distributed access service
Centralizes:
• Schema evolution Table abstraction layer (Schema, push downs, access control, anonymization) • Access control/anonymization
• Efficient push downs File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …) • Efficient interoperability A multi tiered streaming batch storage system
Stream consumer API Batch-Stream storage Projection, Predicate, Aggregation push downs
integration Mainly reads from here
Convergence of Kafka and Kudu
2) in memory • Schema evolution Time based 1) In memory column 3) On Disk Column oriented Ingestion row oriented oriented store • API store Access control/anonymization store • Efficient push downs
• Efficient interoperability Mainly reads from here
Batch consumer API Projection, Predicate, Aggregation push downs THANK YOU! [email protected]
Julien Le Dem @J_ September 2018 We’re hiring!
Contact: [email protected]
We’re hiring!
As of January • WeWork has 425 locations, in 100 cities internationally WeWork in numbers • Over 400,000 members worldwide • 27 countries
• WeWork doubled in size last year and the year before • We expect to double in size this year as well We’re growing • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, …
• San Francisco • New York Technology jobs Locations • Tel Aviv • Shanghai
Contact: [email protected] Questions?
Julien Le Dem @J_ [email protected] March 2019 Rate today ’s session
Session page on conference website O’Reilly Events App