Approximate Partition Selection for Big-Data Workloads Using Summary Statistics

Approximate Partition Selection for Big-Data Workloads using Summary Statistics Kexin Rong†∗, Yao Luy, Peter Bailis,∗ Srikanth Kandula,y Philip Levis∗ Microsoft,y Stanford∗ ∗fkrong,pbailis,[email protected], yflu.yao,[email protected] partition 1 partition 2 partition N ABSTRACT Data X Y X Y X Y Offline ant 1 bee 10 cat 100 Many big-data clusters store data in large partitions that … … … … … … … support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sam- Stats Builder … pling is inefficient, often requiring reads of many partitions. Stats … In this work, we seek to answer queries quickly and ap- Precomputed proximately by reading a subset of the data partitions and Query-specific combining partial answers in a weighted manner without Stats … modifying the data layout. We illustrate how to efficiently perform this query processing using a set of pre-computed Query Weighted selection SELECT X, SUM(Y) - partition 1, weight=10 summary statistics, which inform the choice of partitions Partition Picker - partition 3, weight =2 and weights. We develop novel means of using the statistics Budget: 10 partitions - … Online to assess the similarity and importance of partitions. Our experiments on several datasets and data layouts demon- Figure 1: Our system PS3 makes novel use of strate that to achieve the same relative error compared to summary statistics to perform importance and uniform partition sampling, our techniques offer from 2.7× similarity-aware sampling of partitions. to 70× reduction in the number of partitions read, and the statistics stored per partition require fewer than 100KB. PVLDB Reference Format: of the partitions; a 10% uniform row sample would touch al- Kexin Rong, Yao Lu, Peter Bailis, Srikanth Kandula, Philip Levis. most all partitions. As a result, recent work from a produc- Approximate Partition Selection for Big-Data Workloads using tion AQP system shows that row-level sampling only offers Summary Statistics. PVLDB, 13(11): 2606-2619, 2020. significant speedups for complex queries where substantial DOI: https://doi.org/10.14778/3407790.3407848 query processing remains after the sampling [41]. In contrast to row-level sampling, the I/O cost of con- 1. INTRODUCTION structing a partition-level sample is proportional to the sam- 1 Approximate Query Processing (AQP) systems allow users pling fraction . In our example above, a 1% partition-level to trade off between accuracy and query execution speed. In sample would only read 1% of the data. We are especially applications such as data exploration and visualization, this interested in big data clusters, where data is stored in chunks trade-off is not only acceptable but often desirable. Sam- of tens to hundreds of megabytes, instead of disk blocks or pling is a common approximation technique, wherein the pages which are typically a few kilobytes [32, 52]. Partition- query is evaluated on a subset of the data, and much of the level sampling is already used in production due to its ap- literature focuses on row-level samples [12, 15, 22]. pealing performance: commercial databases create statistics When data is stored in media that does not support using partition samples [5, 28] and several Big Data stores random access (e.g., flat files in data lakes and columnar allow sampling partitions of tables [2, 6, 8]. stores [1, 52]), constructing a row-level sample can be as ex- However, a key challenge remains in how to construct pensive as scanning the entire dataset. For example, if data partition-level samples that can answer a given query ac- is split into partitions with 100 rows, a 1% uniform row sam- curately. Since all or none of the rows in a partition are ple would in expectation require fetching 64% (1 − 0:99100) included in the sample, the correlation between rows (e.g., due to layout) can lead to inaccurate answers. A uniformly random partition-level sample does not make a represen- tative sample of the dataset unless the rows are randomly This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy distributed among partitions [23], which happens rarely in of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For practice [19]. In addition, even a uniform random sample any use beyond those covered by this license, obtain permission by emailing of rows can miss rare groups in the answer or miss the [email protected]. Copyright is held by the owner/author(s). Publication rights rows that contribute substantially to SUM-like aggregates. licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 13, No. 11 1 ISSN 2150-8097. In the paper, we use \partition" to refer to the finest gran- DOI: https://doi.org/10.14778/3407790.3407848 ularity at which the storage layer maintains statistics. 2606 It is not known how to compute stratified [12] or measure- several importance groups, and allocate the sampling bud- biased [30] samples over partitions, which helps with queries get across groups such that the more important groups get with group-by's and complex aggregates. a greater proportion of the budget. The training overhead In this work, we introduce PS3 (Partition Selection with is a one time cost for each dataset and workload, and for Summary Statistics), a system that supports AQP via high-value datasets in clusters that are frequently queried, weighted partition selection (Figure 1). Our primary use this overhead is amortized over time. case is in large-scale production query processing systems In addition, we leverage the redundancy and skewness of such as Spark [14], F1 [51], SCOPE [20] where queries only the partitions for further optimization. For two partitions read and datasets are bulk appended. Our goal is to mini- that output similar answers to an input query, it suffices to mize the approximation error given a sampling budget, or only include one of them in the sample. While directly com- fraction of data that can be read. Motivated by observations paring the contents of the two partitions is expensive, we can from production clusters at Microsoft and in the literature use the query-specific summary statistics as a proxy for the that many datasets remain in the order that they were in- similarity between partitions. We also observe that datasets gested [42], PS3 does not require any specific layout or re- commonly exhibit significant skew in practice. For example, partitioning of data. Instead of storing precomputed sam- in a prototypical production service request log dataset at ples [13, 15, 21], which requires significant storage budgets Microsoft, the most popular application version out of the to offer good approximations for a wide range of queries [24, 167 distinct versions accounts for almost half of the dataset. 43], PS3 performs sampling exclusively during query opti- Inspired by prior works in AQP that recognize the impor- mization. Finally, similar to the query scope studied in prior tance of outliers [15, 22], we use summary statistics (e.g., work [13, 47, 53], PS3 supports single-table queries with the occurrences of heavy hitters in a partition) to identify SUM, COUNT(*), AVG aggregates, GROUP BY on columnsets a small number of partitions that are likely to contain rare with moderate distinctiveness, predicates that are conjunc- groups, and dedicate a portion of the sampling budget to tions, disjunctions or negations over single-column clauses. evaluate these partitions exactly. To select partitions that are most relevant to a query, PS3 In summary, this paper makes the following contributions: leverages the insight that partition-level summary statistics 1. We introduce PS3, a system that makes novel uses are relatively inexpensive to compute and store. The key of summary statistics to perform weighted partition question is which statistics to use. Systems such as Spark selection for many popular queries. Given the query SQL and ZoneMaps already maintain statistics such as max- semantics, summary statistics and a sampling budget, imum and minimum values of a column to assist in query op- the system intelligently combines a few sampling tech- timization [42]. Following similar design considerations, we niques to produce a set of partitions to sample and the look for statistics with small space requirements that can be weight of each partition. computed for each partition in one pass at ingest time. For functionality, we look for statistics that are discriminative 2. We propose a set of lightweight sketches for data parti- enough to support decisions such as whether the partition tions that are not only practical to implement, but can contributes disproportionally large values of the aggregates. also produce rich partition summary statistics. While We propose such a set of statistics for partition sampling the sketches are well known, this is the first time the { measures, heavy hitters, distinct values, and histograms statistics are used for weighted partition selection. { which include but expand on conventional catalog-level 3. We evaluate on a number of real-world datasets with statistics. The total storage overhead scales with the num- real and synthetic workloads. Our evaluation shows ber of partitions instead of with the dataset size. We only that each component of PS3 contributes meaningfully maintain single-column statistics to keep the overhead low. to the final accuracy and together, the system outper- The resulting storage overhead can be orders of magnitudes forms alternatives across datasets and layouts, deliver- smaller than approaches using auxiliary indices to reduce ing from 2.7× to 70× reduction in data read given the the cost of random access [30]. While the set of statistics is same error compared to uniform partition sampling. by no means complete, we show that each type of statistics contributes to the sampling performance, and, in aggregate, delivers effective AQP results.

Approximate Partition Selection for Big-Data Workloads Using Summary Statistics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support