Skipping-Oriented Partitioning for Columnar Layouts

Skipping-oriented Partitioning for Columnar Layouts Liwen Sun, Michael J. Franklin, Jiannan Wangy and Eugene Wuz University of California Berkeley, Simon Fraser Universityy, Columbia Universityz fliwen, [email protected], [email protected], [email protected] ABSTRACT year grade course year grade course grade year course t 2011 A AI t 2011 AI t1 2012 A DB 2 t1 A 2 As data volumes continue to grow, modern database systems in- 2011 B OS t 2011 A AI t3 t2 A t3 2011 OS creasingly rely on data skipping mechanisms to improve perfor- 2 t 2011 B OS 3 t1 2012 A DB t3 B t1 2012 DB mance by avoiding access to irrelevant data. Recent work [39] 2013 C DB t4 2013 C DB C t 2013 DB proposed a fine-grained partitioning scheme that was shown to im- t4 t4 4 prove the opportunities for data skipping in row-oriented systems. (a) original data (b) a SOP scheme (c) a GSOP scheme Modern analytics and big data systems increasingly adopt colum- Figure 1: Example of partitioning schemes. nar storage schemes, and in such systems, a row-based approach The opportunity for data skipping highly depends on how the misses important opportunities for further improving data skipping. tuples are organized into blocks. While traditional horizontal par- The flexibility of column-oriented organizations, however, comes titioning techniques, such as range partitioning, can be used for with the additional cost of tuple reconstruction. In this paper, we this purpose, in recent work [39], we proposed a skipping-oriented develop Generalized Skipping-Oriented Partitioning (GSOP), a novel partitioning (SOP) framework, which can significantly improve the hybrid data skipping framework that takes into account these row- effectiveness of data skipping over traditional techniques. The SOP based and column-based tradeoffs. In contrast to previous column- framework first analyzes the workload and extracts representative oriented physical design work, GSOP considers the tradeoffs be- filter predicates as features. Based on these features, SOP charac- tween horizontal data skipping and vertical partitioning jointly. Our terizes each tuple by a feature vector and partitions the data tuples experiments using two public benchmarks and a real-world work- by clustering the feature vectors. load show that GSOP can significantly reduce the amount of data While SOP has been shown to outperform previous techniques, scanned and improve end-to-end query response times over the its effectiveness depends on workload and data characteristics. Mod- state-of-the- art techniques. ern analytics applications can involve wide tables and complex workloads with diverse filter predicates and column-access patterns. For 1. INTRODUCTION this kind of workloads, SOP suffers from a high degree of fea- Data skipping has become an essential mechanism for improv- ture conflict. Consider the table in Figure 1(a). Suppose SOP ing query performance in modern analytics databases (e.g., [1,8,16, extracts two features from the workload: F1:grade=‘A’ and 26,40]) and the Hadoop ecosystem (e.g., [2,43]). In these systems, F2:year>2011^course=‘DB’. In this case, the best partition- data are organized into blocks, each of which typically contains ing scheme for feature F1 is t1t2jt3t4, since t1 and t2 satisfy F1 tens of thousands of tuples. At data loading time, these systems while t3 and t4 do not. For the same reason, the best partitioning compute statistics for each block (e.g., such as min and max val- scheme for feature F2 is t1t4jt2t3. Therefore, the conflict between ues of each column) and store the statistics as metadata. Incoming F1 and F2 lies in that their best partitioning schemes are different. queries can evaluate their filter predicates against such metadata Since SOP generates a single horizontal partitioning scheme that and decide which blocks can be safely skipped (i.e., do not need incorporates all features (e.g., Figure 1(b)), when there are many to be read or accessed). For example, suppose a query contains a highly conflicting features, it may be rendered ineffective. predicate age=20; if the metadata of a block says its min value on The key reason why SOP is sensitive to feature conflict is that it column age is 25, then this block can be skipped by this query. produces only monolithic horizontal partitioning schemes. That is, Reading less data not only saves I/O, but also reduces CPU work, SOP views every tuple as an atomic unit. While this perspective is such as decompression and deserialization. Therefore, data skip- natural for row-major data layouts, it becomes an unnecessary con- ping improves query performance even when data are memory- or straint for columnar layouts. Analytics systems increasingly adopt SSD-resident. columnar layouts [20] where each column can be stored separately. Inspired by recent work in column-oriented physical design, such as database cracking [23,38], we propose to remove the “atomic- tuple” constraint and allow different columns to have different horizontal partitioning schemes. By doing so, we can mitigate feature This work is licensed under the Creative Commons Attribution- conflict and boost the performance of data skipping. Consider the NonCommercial-NoDerivatives 4.0 International License. To view a copy example in Figure 1(c), where we partition column grade based of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For on F1 and independently partition the columns year and course any use beyond those covered by this license, obtain permission by emailing based on F . This hybrid partitioning scheme successfully resolves [email protected]. 2 Proceedings of the VLDB Endowment, Vol. 10, No. 4 the conflict between F1 and F2, as the relevant columns for each Copyright 2016 VLDB Endowment 2150-8097/16/12. can be partitioned differently. Unfortunately, this columnar ap- 421 proach incurs overhead for tuple-reconstruction [24], i.e., the pro- • We propose a GSOP framework for columnar layouts, which cess of assembling column values into tuples during query process- generalizes SOP by removing the atomic-tuple constraint. ing. Since column values are no longer aligned, to query such data, • we may need to maintain tuple ids and join the column values using We develop an objective function to quantify the skipping vs. these tuple ids. Thus, although combining horizontal and vertical reconstruction trade-off of GSOP. partitioning has large potential benefits, it is unclear how to bal- • We devise a skipping-aware column grouping algorithm and ance the benefits that a particular vertical partitioning scheme has propose techniques to select local features. on skipping against its tuple-reconstruction overheads. To this end, we propose a generalized SOP (GSOP) framework, • We prototype GSOP using Parquet and Spark and perform an with a goal of optimizing the overall query performance by auto- experimental evaluation using two public benchmarks and a matically balancing skipping effectiveness and tuple-reconstruction real-world workload. Our results show that GSOP can signifi- overhead. GSOP generalizes SOP by removing the atomic-tuple cantly outperform SOP in the presence of feature conflict. constraint and allowing both horizontal and vertical partitionings. For a given data and workload setting, GSOP aims to pick a hybrid 2. BACKGROUND partitioning scheme that maximizes the overall query performance. Given the goal of GSOP, we can think of several na¨ıve approaches. 2.1 SOP vs. Range Partitioning The first approach is to apply a state-of-the-art vertical partitioning How data is partitioned into blocks can significantly affect the technique (e.g., [10,14,27,46]) to divide the columns into groups chances of data skipping. A common approach is to perform a and to then use SOP to horizontally partition each column group. range partitioning on the frequently filtered columns. As pointed Such an approach, however, is oblivious to the potential impact of out in [39], however, range partitioning is not an ideal solution for vertical partitioning on skipping horizontal blocks. Another na¨ıve generating blocks at a fine-granularity. To use range partitioning approach is to first horizontally partition the data into blocks using for data skipping, we need a principled way of selecting partition- SOP and then vertically partition each block using existing tech- ing columns and capturing inter-column data correlation and filter niques. In this approach, although the columns are divided into correlation. SOP extracts features (i.e., representative filters which groups, they still have the same horizontal partitioning scheme, may span multiple columns) from a query log and constructs a fine- since this approach runs SOP (using all features) before applying grained partitioning map by solving a clustering problem. the vertical partitioning. Thus, this apporach does not really help SOP techniques can co-exist with traditional partitioning tech- mitigate feature conflict. Both naive approaches fail to incorporate niques such as range partitioning, as they operate at different gran- the interrelation between horizontal and vertical partitionings and ularities. In a data warehouse environment, for example, data are how they jointly affect data skipping. often horizontally partitioned by date ranges. Traditional horizontal skipping-aware column In the design of GSOP, we propose a partitioning facilitates common operations such as batch insertion grouping technique. We develop an objective function to quan- and deletion and enables partition pruning, but is relatively coarse- tify the trade-off of skipping effectiveness vs. tuple-reconstruction grained. While partition pruning helps queries skip some partitions overhead. One major technical challenge involved in using such an based on their date predicates, SOP further segments each horizon- objective function is to estimate the effectiveness of data skipping, tal partition (say, of 10 million tuples) into fine-grained blocks (say, i.e., how much data can be skipped by queries. As described in Sec- of 10 thousand tuples) and helps queries skip blocks inside each un- tion 4.3, directly assessing skipping effectiveness is prohibitively pruned partition.

Skipping-Oriented Partitioning for Columnar Layouts

Analysis of Big Data Storage Tools for Data Lakes Based on Apache Hadoop Platform

Developer Tool Guide

Unravel Data Systems Version 4.5

Talend Open Studio for Big Data Release Notes

Hive, Spark, Presto for Interactive Queries on Big Data

Impala: a Modern, Open-Source SQL Engine for Hadoop

Metadata in Pig Schema

Perform Data Engineering on Microsoft Azure Hdinsight (775)

Enabling Geospatial in Big Data Lakes and Databases with Locationtech Geomesa

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"

Release Notes Date Published: 2020-10-13 Date Modified

Eyeglass Search OSS Licenses and Packages V9