FROM FLAT FILES to DECONSTRUCTED DATABASE the Evolution and Future of the Big Data Ecosystem
Total Page:16
File Type:pdf, Size:1020Kb
FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ [email protected] March 2019 Julien Le Dem @J_ Julien Principal Data Engineer • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Agenda ❖ At the beginning there was Hadoop (2005) ❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards” ❖ The deconstructed database ❖ What next? At the beginning there was Hadoop Hadoop Storage: A distributed file system Execution: Map Reduce Based on Google’s GFS and MapReduce papers Great at looking for a needle in a haystack Great at looking for a needle in a haystack … … with snowplows Original Hadoop abstractions Execution Storage Map/Reduce File System Read write locally locally Simple Just flat files M R ● Flexible/Composable Shuffle ● Any binary ● Logic and optimizations ● No schema tightly coupled ● No standard ● Ties execution with M R ● The job must know how to persistence split the data M R “MapReduce: A major step backwards” (2008) Databases have been around for a long time Relational model • First described in 1969 by Edgar F. Codd • Originally SEQUEL developed in the early 70s at IBM SQL • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016 • Universal Data access language Global Standard • Widely understood Underlying principles of relational databases Separation of logic and optimization Standard Separation of Schema and Application SQL is understood by many High level language focusing on logic (SQL) Indexing Optimizer Integrity Evolution Transactions Views Integrity constraints Schemas Referential integrity Relational Database abstractions Execution Storage SQL Tables SELECT a, AVG(b) FROM FOO GROUP BY a SQL Abstracted notion of data ● Decouples logic of query ● Defines a Schema from: ● Format/layout decoupled ○ Optimizations from queries ○ FOO Data representation ● Has statistics/indexing ○ Indexing ● Can evolve over time Query evaluation Syntax Semantic Optimization Execution A well integrated system Storage SELECT f.c, AVG(b.d) FROM FOO f Table JOIN BAR b ON f.a = b.b Syntax Semantic Metadata Columnar Columnar Columnar GROUP BY f.c WHERE f.d = x (Schema, data data data stats, layout,…) Push Push Push downs downs downs Select Optimization Select Scan GROUP JOIN FILTER FOO BY GROUP BY Scan BAR JOIN Execution FILTER Scan BAR Scan FOO user So why? Why Hadoop? Why Snowplows? The relational model was constrained Constraints are good We need the right Constraints They allow optimizations Need the right abstractions • Statistics Traditional SQL implementations: • Flat schema • Pick the best join algorithm • Inflexible schema evolution • Change the data layout • History rewrite required • Reusable optimization logic • No lower level abstractions • Not scalable Hadoop is flexible and scalable • Nested data structures • Unstructured text with semantic annotations No Data shape constraint • Graphs • Non-uniform schemas • Room to scale algorithms that are not part of the standard It’s just code • Machine learning • Your imagination is the limit • You can improve it Open source • You can expand it • You can reinvent it You can actually implement SQL with this And they did… SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x GROUP BY Parser Select Execution GROUP JOIN BY JOIN FILTER FILTER Scan BAR BAR Scan FOO FOO (open-sourced 2009) 10 years later The deconstructed database Author: gamene https://www.flickr.com/photos/gamene/4007091102 The deconstructed database The deconstructed database Query Data Stream Batch Machine model Exchange Processing execution learning Storage We can mix and match individual components Stream processing Machine learning Specialized Components Streams Stream persistance Storage Execution SQL Iceberg Resource management *not exhaustive! We can mix and match individual components Storage Query model Machine Learning Row oriented or columnar SQL Training models Immutable or mutable Functional Stream storage vs analytics optimized … Data Exchange Streaming Execution Batch Execution Row oriented Optimized for High Throughput and Low Optimized for high throughput and Columnar latency processing historical analysis Emergence of standard components Emergence of standard components Columnar Storage SQL parsing and Schema model Apache Parquet as columnar Apache Avro as pipeline friendly schema representation at rest. optimization for the analytics world. Apache Calcite as a versatile query optimizer framework Columnar Exchange Table abstraction Apache Arrow as the next generation in- Apache Iceberg has a great potential to memory representation and no-overhead provide Snapshot isolation and layout data exchange abstraction on top of distributed file systems. The deconstructed database’s optimizer: Calcite Storage Schema plugins SELECT f.c, AVG(b.d) FROM FOO f Select JOIN BAR b ON f.a = b.b Syntax Semantic GROUP BY f.c WHERE f.d = x … Optimizer Scan GROUP rules JOIN FILTER FOO BY Execution engine Select Scan BAR Optimization GROUP BY JOIN Execution FILTER Scan BAR Scan FOO Apache Calcite is used in: Batch SQL Streaming SQL • Apache Hive • Apache Apex • Apache Drill • Apache Flink • Apache Phoenix • Apache SamzaSQL • Apache Kilin • Apache StormSQL … … The deconstructed database’s storage Columnar Row oriented Immutable Optimized for analytics Optimized for serving Mutable To be performant a query engine requires deep integration with the storage layer. Query integration Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow). Storage: Push downs PROJECTION PREDICATE AGGREGATION Read only what you need Filter Avoid materializing intermediary data Read only the columns that are Evaluate filters during scan to: To reduce IO, aggregation can needed: also be implemented during the • Leverage storage properties scan to: • Columnar storage makes this (min/max stats, partitioning, efficient. sort, etc.) • minimize materialization of • Avoid decoding skipped data. intermediary data • Reduce IO. The deconstructed database interchange: Apache Arrow Projection and predicate push downs Parquet files Arrow batches Shuffle Partial Scanner Agg Agg Result projection push down read only a and b Partial Scanner Agg Agg SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b Partial Scanner Agg Agg Big Data infrastructure blueprint Big Data infrastructure blueprint Big Data infrastructure blueprint Legend: Persistence Processing Metadata UI Data Infra Real time Batch publishing publishing Data API Stream persistance Stream processing Real time Data API dashboards Schema Datasets Data Storage registry metadata (S3, GCS, HDFS) (Parquet, Avro) Interactive analysis Monitoring / Data API Observability Analyst Batch processing Periodic Scheduling/ dashboards Job deps Eng The Future Still Improving A better data Better interoperability Better data governance abstraction • A better metadata • More efficient interoperability: • Global Data Lineage repository: Marquez Continued Apache Arrow adoption • Access control • A better table abstraction: • Protect privacy Apache Iceberg • Common Push down • Record User consent implementations (filter, projection, aggregation) Some predictions A common access layer SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) Distributed access service Centralizes: • Schema evolution Table abstraction layer (Schema, push downs, access control, anonymization) • Access control/anonymization • Efficient push downs File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …) • Efficient interoperability A multi tiered streaming batch storage system Stream consumer API Batch-Stream storage Projection, Predicate, Aggregation push downs integration Mainly reads from here Convergence of Kafka and Kudu 2) in memory • Schema evolution Time based 1) In memory column 3) On Disk Column oriented Ingestion row oriented oriented store • API store Access control/anonymization store • Efficient push downs • Efficient interoperability Mainly reads from here Batch consumer API Projection, Predicate, Aggregation push downs THANK YOU! [email protected] Julien Le Dem @J_ September 2018 We’re hiring! Contact: [email protected] We’re hiring! As of January • WeWork has 425 locations, in 100 cities internationally WeWork in numbers • Over 400,000 members worldwide • 27 countries • WeWork doubled in size last year and the year before • We expect to double in size this year as well We’re growing • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … • San Francisco • New York Technology jobs Locations • Tel Aviv • Shanghai Contact: [email protected] Questions? Julien Le Dem @J_ [email protected] March 2019 Rate today ’s session Session page on conference website O’Reilly Events App.