Autonomous ETL With Materialized Views
Abhishek Somani, Adesh Rao May 2018 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution
2 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution
3 Data structuring for SQL-on-Hadoop ● Partitioning
4 Data organization for SQL-on-Hadoop ● Columnar File Formats Parquet
ORC
5 Data organization for SQL-on-Hadoop ● Sorting ● Bucketing
6 Data organization for SQL-on-Hadoop
Speedup of Unsorted vs Sorted ORC data on TPCDS scale 1000
7 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution
8 Difficulties in Structuring Data
● Workload Aware identification ● Evolving query patterns of optimal data structure
● Data pipeline dependencies ● Flexibility of data structuring
● Large number of consumers ● Seamless restructuring
● Data Admin Involvement ● Continuous and automatic maintenance
● Downtime NO DOWNTIME!
9 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution
10 Basics: Materialized View
● A materialized view is a database object that contains the results of a query. ● It is a view for which the data has been materialized. ● Materialized Views can be consumed automatically by the query engine
Example: CREATE MATERIALIZED VIEW mv AS SELECT seller_id, seller_name, num_item*cost AS value FROM sales;
Effect: Query rewrite SELECT seller_id, num_item*cost AS value FROM sales; ~ SELECT seller_id, value FROM mv;
11 Materialized Views in Hive for Data Restructuring
Interesting properties of Materialized Views in Hive: ● A copy of the data(full, partial or transformed) ● Used automatically by the engine based on cost analysis ● Can be stored as ORC, Parquet etc ● Multiple materialized views can co-exist, optimally chosen
Plus: Storage is cheap
Idea: Create multiple materialized views of the full data with desired structures
12 Materialized Views for Data Restructuring
Example: Original Table T1: Query1: SELECT * from T1 where customer_id = 26988 and month ● Partitioned on Year, Month, Day = “January”; ● Stored as Text Rewritten: SELECT * from MV1 where customer_id = 26988 and Materialized View MV1: month = “January”;
● Partitioned on Year, Month, Day ● Sorted on Customer_Id Query2: SELECT * from T1 where seller_id = 121 and month = ● Stored as ORC “January”;
Materialized View MV2: Rewritten: SELECT * from MV2 where seller_id = 121 and month = ● Partitioned on Year, Month, Day “January”; ● Sorted on Seller_Id ● Stored as ORC
13 Materialized Views in SQL-on-Hadoop engines
● Basic implementation available in Apache Hive 2.3.0 ○ Uses Apache Calcite for query optimization and query rewrite ○ Multi file format support. Uses ORC (by default) for optimized columnar storage of materialized queries ● Not available in Presto ● Not available in Spark
14 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution
15 Challenges with Materialized Views
● Invalidation
○ Only a subset of use cases can work with stale data ● Rebuilds and Refreshes ○ Prohibitively expensive for full data copies ● Maintenance Isolation ○ Ongoing queries get affected
16 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution
17 FastCopy: A framework for Autonomous Materialized Views
● Materialized Views for Sorting, Partitioning and Bucketing for structuring data ● Synchronous invalidation on table updates ● Asynchronous automatic refreshes ● Maintenance isolation by refreshes in their own scheduler queues, or even their own cluster ● Recommendation Engine to suggest Materialized Views ● Cross engine support for using Materialized Views
18 Qubole FastCopy Infrastructure
19 Qubole FastCopy Infrastructure FastCopy Creation
20 Qubole FastCopy Infrastructure FastCopy Creation
21 Qubole FastCopy Infrastructure FastCopy Creation
22 Qubole FastCopy Infrastructure FastCopy Creation
23 Qubole FastCopy Infrastructure Incoming query for rewrite
24 Qubole FastCopy Infrastructure Query Rewrite
25 Qubole FastCopy Infrastructure Query Rewrite
26 Qubole FastCopy Infrastructure Query Rewrite
27 Qubole FastCopy Infrastructure Invalidation and Refresh
28 Qubole FastCopy Infrastructure Invalidation and Refresh
29 Qubole FastCopy Infrastructure Invalidation and Refresh
30 Qubole FastCopy Infrastructure Invalidation and Refresh
31 Qubole FastCopy Infrastructure Invalidation and Refresh
32 Qubole FastCopy Infrastructure Invalidation and Refresh
33 Qubole FastCopy Infrastructure Invalidation and Refresh
34 Fun Details ● Auto detect added, dropped or updated partitions using partition level tokens ● Multi Version Concurrency Control for FastCopy ● Minion clusters for workload isolation
35 Recommendations ● Top Tables
36 Recommendations ● Top Tables
37 Recommendations ● Column Usage as Filter predicates
38 Recommendations ● Column Usage as Filter predicates
39 Recommendations ● Column Usage as Filter predicates
40 Recommendations ● Top Tables
41 Recommendations ● Top Tables
42 Recommendations ● Column Usage as Filter predicates
43 Recommendations ● Column Usage as Filter predicates
44 Recommendations ● Column Usage as Filter predicates
45 Revise 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution
46 Status ● FastCopy is at an internal Alpha ● Will soon be released as a beta for customers in the next Quarter ● Contribute to Open Source
Thank You
47 Thank You
Abhishek Somani, Adesh Rao May 2018