Autonomous ETL With Materialized Views

Abhishek Somani, Adesh Rao May 2018 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

2 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

3 Data structuring for SQL-on-Hadoop ● Partitioning

4 Data organization for SQL-on-Hadoop ● Columnar File Formats Parquet

ORC

5 Data organization for SQL-on-Hadoop ● Sorting ● Bucketing

6 Data organization for SQL-on-Hadoop

Speedup of Unsorted vs Sorted ORC data on TPCDS scale 1000

7 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

8 Difficulties in Structuring Data

● Workload Aware identification ● Evolving query patterns of optimal data structure

● Data pipeline dependencies ● Flexibility of data structuring

● Large number of consumers ● Seamless restructuring

● Data Admin Involvement ● Continuous and automatic maintenance

● Downtime NO DOWNTIME!

9 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

10 Basics: Materialized View

● A materialized is a object that contains the results of a query. ● It is a view for which the data has been materialized. ● Materialized Views can be consumed automatically by the query engine

Example: CREATE MATERIALIZED VIEW mv AS SELECT seller_id, seller_name, num_item*cost AS value FROM sales;

Effect: Query rewrite SELECT seller_id, num_item*cost AS value FROM sales; ~ SELECT seller_id, value FROM mv;

11 Materialized Views in Hive for Data Restructuring

Interesting properties of Materialized Views in Hive: ● A copy of the data(full, partial or transformed) ● Used automatically by the engine based on cost analysis ● Can be stored as ORC, Parquet etc ● Multiple materialized views can co-exist, optimally chosen

Plus: Storage is cheap

Idea: Create multiple materialized views of the full data with desired structures

12 Materialized Views for Data Restructuring

Example: Original T1: Query1: SELECT * from T1 where customer_id = 26988 and month ● Partitioned on Year, Month, Day = “January”; ● Stored as Text Rewritten: SELECT * from MV1 where customer_id = 26988 and Materialized View MV1: month = “January”;

● Partitioned on Year, Month, Day ● Sorted on Customer_Id Query2: SELECT * from T1 where seller_id = 121 and month = ● Stored as ORC “January”;

Materialized View MV2: Rewritten: SELECT * from MV2 where seller_id = 121 and month = ● Partitioned on Year, Month, Day “January”; ● Sorted on Seller_Id ● Stored as ORC

13 Materialized Views in SQL-on-Hadoop engines

● Basic implementation available in 2.3.0 ○ Uses Apache Calcite for and query rewrite ○ Multi file format support. Uses ORC (by default) for optimized columnar storage of materialized queries ● Not available in Presto ● Not available in Spark

14 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

15 Challenges with Materialized Views

● Invalidation

○ Only a subset of use cases can work with stale data ● Rebuilds and Refreshes ○ Prohibitively expensive for full data copies ● Maintenance Isolation ○ Ongoing queries get affected

16 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

17 FastCopy: A framework for Autonomous Materialized Views

● Materialized Views for Sorting, Partitioning and Bucketing for structuring data ● Synchronous invalidation on table updates ● Asynchronous automatic refreshes ● Maintenance isolation by refreshes in their own scheduler queues, or even their own cluster ● Recommendation Engine to suggest Materialized Views ● Cross engine support for using Materialized Views

18 Qubole FastCopy Infrastructure

19 Qubole FastCopy Infrastructure FastCopy Creation

20 Qubole FastCopy Infrastructure FastCopy Creation

21 Qubole FastCopy Infrastructure FastCopy Creation

22 Qubole FastCopy Infrastructure FastCopy Creation

23 Qubole FastCopy Infrastructure Incoming query for rewrite

24 Qubole FastCopy Infrastructure Query Rewrite

25 Qubole FastCopy Infrastructure Query Rewrite

26 Qubole FastCopy Infrastructure Query Rewrite

27 Qubole FastCopy Infrastructure Invalidation and Refresh

28 Qubole FastCopy Infrastructure Invalidation and Refresh

29 Qubole FastCopy Infrastructure Invalidation and Refresh

30 Qubole FastCopy Infrastructure Invalidation and Refresh

31 Qubole FastCopy Infrastructure Invalidation and Refresh

32 Qubole FastCopy Infrastructure Invalidation and Refresh

33 Qubole FastCopy Infrastructure Invalidation and Refresh

34 Fun Details ● Auto detect added, dropped or updated partitions using level tokens ● Multi Version for FastCopy ● Minion clusters for workload isolation

35 Recommendations ● Top Tables

36 Recommendations ● Top Tables

37 Recommendations ● Usage as Filter predicates

38 Recommendations ● Column Usage as Filter predicates

39 Recommendations ● Column Usage as Filter predicates

40 Recommendations ● Top Tables

41 Recommendations ● Top Tables

42 Recommendations ● Column Usage as Filter predicates

43 Recommendations ● Column Usage as Filter predicates

44 Recommendations ● Column Usage as Filter predicates

45 Revise 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution

46 Status ● FastCopy is at an internal Alpha ● Will soon be released as a beta for customers in the next Quarter ● Contribute to Open Source

Thank You

47 Thank You

Abhishek Somani, Adesh Rao May 2018