Pushing Data-Induced Predicates Through Joins in Big-Data Clusters

Srikanth Kandula, Laurel Orr, Surajit Chaudhuri Microsoft

ABSTRACT Using data statistics, we convert predicates on a into data in- duced predicates (diPs) that apply on the joining tables. Doing so substantially speeds up multi- queries because the beneûts of predicate pushdown can now apply beyond just the tables that Figure Ï:Example illustrating creation and use of a data-induced have predicates. We use diPs to skip data exclusively during query predicate which only uses the join columns and is a necessary con- optimization; i.e., diPs lead to better plans and have no overhead dition to the true predicate, i.e., σ ⇒ dpartkey. during query execution. We study how to apply diPs for complex Ï query expressions and how the usefulness of diPs varies with the pruning [ÊÏ]), but predicates over join columns are rare, and these data statistics used to construct diPs and the data distributions.Our techniques do not apply when the predicates use columns that do results show that building diPs using zone-maps which are already not exist in the joining tables. maintained in today’s clusters leads to sizable data skipping gains. Some systems implement a form of sideways information passing Using a new (slightly larger) statistic, ªì of the queries in the TPC- over joins [}Ï, CÊ] during query execution. For example, they may H, TPC-DS and JoinOrder benchmarks can skip at least kkì of the build a bloom ûlter over the values of the join partkey query input. Consequently, the median query in a production big- in the rows that satisfy the predicate on the part table and use this data cluster ûnishes roughly }× faster. bloom ûlter to skip rows from the lineitem table. Unfortunately, this technique only applies during query execution, does not easily PVLDB Reference Format: extend to general joins and has high overheads, especially during Srikanth Kandula, Laurel Orr, Surajit Chaudhuri. Pushing Data-Induced Predicates _rough Joins in Big-Data Clusters. PVLDB, Ï}(k): xxxx-yyyy, parallel execution on large datasets because constructing the bloom }ªÏ•. ûlter becomes a scheduling barrier delaying the scan of lineitem DOI: https://doi.org/Ϫ.Ï99Ê/kkCÊ}ʕ.kkCÊ}•} until the bloom ûlter has been constructed. We seek a method that can convert predicates on a table to data 1. INTRODUCTION skipping opportunities on joining tables even if the predicate columns In this paper, we seek to extend the beneûts of predicate push- are absent in other tables. Moreover, we seek a method that applies down beyond just the tables that have predicates. Consider the fol- exclusively during generation in order to limit overheads lowing fragment of TPC-H query ¢Ï9 [}ª]. during query execution. Finally, we are interested in a method that is easy to maintain, applies to a broad class of queries and makes SELECT SUM(l extendedprice) FROM lineitem minimalist assumptions. JOIN part ON l partkey = p partkey Our target scenario is big-data systems, e.g., SCOPE[], Spark[k9, WHERE p brand=‘:1’ AND p container=‘:2’ Ê], Hive [ʪ], FÏ [9C] or Pig [C9] clusters that run SQL-like queries _e lineitem table is much larger than the part table, but because over large datasets; recent reports estimate over a million servers in the query predicate uses columns that are only available in part, such clusters [Ï]. predicate pushdown cannot speed up the scan of lineitem. How- Big-data systems already maintain data statistics such as the max- ever, it is easy to see that scanning the entire lineitem table will be imum and minimum value of each column at diòerent granularities wasteful if only a small number of those rows will join with the rows of the input; details are in Table Ï. In the rest of this paper, for sim- from part that satisfy the predicate on part. plicity, we will call this the zone-map statistic and we use the word If only the predicate was on the column used in the join condition, to denote the granularity at which statistics are maintained. partkey, then a variety of techniques become applicable (e.g., al- Using data statistics, we oòer a method that converts predicates gebraic equivalence [k], magic set rewriting [ª, 9}] or value-based on a table to data skipping opportunities on the joining tables at time. _e method, an example of which is shown This work is licensed under the Creative Commons Attribution- in Figure Ï, begins by using data statistics to eliminate partitions on NonCommercial-NoDerivatives 4.0 International License. To a copy tables that have predicates. _is step is already implemented in some of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For systems [9, ÏÊ, ,ÊÏ].Next, using the data statistics of the partitions any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights that satisfy the local predicates, we construct a new predicate which licensed to the VLDB Endowment. Ï Proceedings of the VLDB Endowment, Vol. 12, No. 3 Over all the queries in TPC-H [}C] and TPC-DS [}], there are ISSN 2150-8097. zero predicates on join columns perhaps because join columns tend DOI: https://doi.org/10.14778/3368289.3368292 to be opaque system-generated identiûers. Table Ï: Data statistics maintained by several systems.

Scheme Statistic Granularity ZoneMaps [Ï] max and min value per col- zone Spark [•, ÏÊ] umn file Exadata [C] max, min and present or per table region Vertica [•], ORC[}], null count per column stripe, - Parquet [ÏC] group Brighthouse [99] histograms, char maps per col data pack

Figure k:A workow which shows changes in red; using partition statistics, our query optimizer computes data-induced predicates and outputs plans that read less input.

range-set, that improves query performance over zone-maps at the Figure }: Illustrating the need to move diPs past other opera- cost of a small increase in space. We also discuss how to maintain tions (le). On a k-way join when all tables have predicates (middle), statistics when datasets evolve. In more detail, the rest of this paper the optimal schedule only requires three (parallel) steps (right). has these contributions. ● Using diPs for complex queries leads to novel challenges. captures all of the join column values contained in such partitions. – Consider TPC-H query¢Ï9 [}ª] in Figure }(le) which has a _is new data-induced predicate (diP) is a necessary condition of the nested sub-query on the lineitem table. Creating diPs for actual predicate (i.e., σ ⇒ d) because there may be false-positives; only the fragment considered in Figure Ï still reads the entire i.e., in the partitions that are included in the diP, not all of the rows lineitem table for the nested group-by. To alleviate this, we may satisfy σ. However, the diP can apply over the joining table be- use new QO transformation rules to move diPs; in this case, cause it only uses the join column; in the case of Figure Ï, the diP shown with dotted arrows, the diP is pulled above a join, moved constructed on the part table can be applied on the partition statis- sideways to a diòerent join input and then pushed below a group- tics of lineitem to eliminate partitions. All of these steps happen by thereby ensuring that a shared scan of lineitem will suõce. during query optimization; our QO eòectively replaces each table – When multiple joining tables have predicates, a second chal- with a partition subset of that table; the reduction in input size oen lenge arises. Consider the k-way join in Figure }(middle) where triggers other plan changes (e.g., using broadcast joins which elim- all tables have local predicates. _e ûgure shows four diPs: inate a partition-shuøe [9]) leading to more eõcient query plans. one per table and per join condition. If applying these diPs If the above method is implemented using zone-maps, which are eliminates partitions on some joining table, then the diPs that maintained by many systems already, the only overhead is an in- were previously constructed on that table are no longer up-to- crease in query optimization time which we show is small in §. date. Re-creating diPs whenever partition subsets change will For queries with joins, we show that data-induced predicates oòer increase data skipping but doing so na¨ıvely can construct ex- comparable query performance at much lower cost relative to mate- cessively many diPs which increases query optimization time. rializing denormalized join views [C] or using join indexes [, }k]. We present an optimal schedule for tree-like join graphs which _e fundamental reason is that these techniques use augmentary converges to ûxed point and hence achieves all possible data data-structures which are costly to maintain; yet, their beneûts are skipping while computing the fewest number of diPs. Star and limited to narrow classes of queries (e.g., queries that match views, snowake schemas result in tree-like join graphs. We discuss have speciûc join conditions, or very selective predicates) [k]. Data- how to derive diPs for general join graphs within a cost-based induced predicates, we will show, are useful more broadly. query optimizer in §k. We also note that the construction and use of data-induced pred- diPs icates is decoupled from how the datasets are laid out.Prior work ● We show how diòerent data statistics can be used to compute identiûes useful data layouts, for example, co-partition tables on in § and discuss why range-sets represent a good trade-oò be- their join columns [Ï, C}] or cluster rows that satisfy the same pred- tween space usage and potential for data skipping. icates [9k, 9Ê]; the former speeds up joins and the latter enhances ● We discuss two methods to cope with dataset updates in §.Ï. _e data skipping. In our big-data clusters, many unstructured datasets ûrst method taints partitions whose rows change and the second remain in the order that they were uploaded to the cluster. _e approximately updates partition statistics when rows change. We choice of data layout can have exogenous constraints (e.g., privacy) will show that both methods can be implemented eõciently and and is query dependent; that is, no one layout helps with all possible that diPs oòer large gains in spite of updates. queries. In §, we we will show that diPs oòer signiûcant additional ● Fundamentally, data-induced predicates are beneûcial only if the speedup when used with the data layouts proposed by prior works join column values in the partitions that satisfy a predicate con- and that diPs improve query performance in other layouts as well. tain only a small portion of all possible join column values. In §}.Ï, To the best of our knowledge, this paper is the ûrst to oòer data we discuss real-world use-cases that cause this property to hold skipping gains across joins for complex queries before query exe- and quantify their occurrence in production workloads. cution while using only small per-table statistics.Prior work either ● We report results from experiments on production clusters at Mi- only oòers gains during query execution [}Ï, ª, k, CÊ, 9},ÊÏ] or croso that have tens of thousands of servers. We also report re- uses more complex structures which have sizable maintenance over- sults on SQL server. See Figure k for a high-level architecture dia- heads [, }k, C, 9k, 9Ê]. To achieve the above, we oòer an eõcient gram. Our results in § will show that using small statistics and a method to compute diPs on complex query expressions (multiple small increase in query optimization time, diPs oòer sizable gains joins, join cycles, nested statements, other operations). _is method on three workloads (TPC-H [}C], TPC-DS [}], JOB[Ï}]) under works with a variety of data statistics. We also oòer a new statistic, a variety of conditions. 2. MOTIVATION Table }:Constructing diPs using partition statistics. We begin with an example that illustrates how data-induced pred- Column Partition ¢ icates (diPs) can enhance data skipping. Consider the query expres- Ï } k date sk [kªªª, ªªª] [Ϫªª, Cªªª] [9ªªª, Ï}ªªª] sion, σyear (date dim)&date sk R. Table }a shows the zone-maps per year [ϕ•, }ªªª] [ϕ•ª, }ªª}] [}ªª, }ªÏÊ] partition for the predicate and join columns. Recall that zone-maps (a) Zone maps [Ï], i.e., the maximum and minimum values, for two are the maximum and minimum value of a column in each partition, columns in three hypothetical partitions of the date dim table. and we use partition to denote the granularity at which statistics are Pred.(σ) Satisfying Data-induced ì total maintained which could be a ûle, a rowgroup etc.(see Table Ï). Ta- partitions Predicate range ble }b shows the diPs corresponding to diòerent predicates. _e year ≤ ϕ• {Ï, }} date sk ∈ [Ϫªª, Cªªª] ì predicate column year is only available on the date dim table, but year ∈ [}ªªk, }ªª] ∅ date sk ∈ [] ªì year > { } d ∈ [ ] the diPs are on the join column date sk and can be pushed onto }ªÏª k ate sk 9ªªª, Ï}ªªª ì diPs (b) Data-induced predicates on the join column date sk corresponding joining relations using column equivalence [k]. _e shown to predicates on the column year; built using stats from Table }a. here are DNFs over ranges; if the equijoin condition has multiple columns, the diPs will be a conjunction of DNFs, one DNF per col- umn. Further details on the class of predicates supported, extending to multiple joins and handling other operators, are in §k.}. Table }b also shows that the diPs contain a small portion of the range of the join column date sk (which is [Ϫªª, Ï}ªªª]); thus, they can oòer large data skipping gains on joining relations. diPs It is easy to see that can be constructed using any data statis- (a) Equiwidth histograms for the dataset in Table }a. tic that supports the following steps: (Ï) identify partitions that sat- Range-set Partition ¢ isfy query predicates, (}) merge the data statistic of the join columns (size }) Ï } k over the satisfying partitions, and (k) use the merged statistic to date sk {[kªªª, kªª], {[Ϫªª, }ªªª], {[9ªªª, Ϫªªª], [ªªª, ªªª]} [ªªª, Cªªª]} [ÏϪªª, Ï}ªªª]} extract a new predicate that can identify partitions satisfying the year {[ϕ•, ϕ•9], {[ϕ•ª, ϕ•k], {[}ªª, }ªÏ], predicate in joining relations.Many data statistics support these [ϕ•Ê, }ªªª]} [ϕ•Ê, }ªª}]} [}ªÏ, }ªÏÊ]} steps [kÏ], and diòerent stats can be used for diòerent steps. (b) Range-set of size }, i.e., two non-overlapping max and min values, To illustrate the trade-oòs in choice of data statistics, consider Fig- which contain all of the data values. ure a which shows equi-width histograms for the same columns and partitions as in Table }a. A histogram with b buckets uses b + } Figure : Other data statistics (histograms, range-sets) for the same doubles} compared to the two doubles used by zone maps (for the example as in Table }a; range-sets yield more succinct diPs. min. and max. value). Regardless of the number of buckets used, note that histograms will generate the same diPs as zone-maps. _is is because histograms do not remember gaps between buckets.Other histograms (e.g., equi-depth, v-optimal) behave similarly. More- over, the frequency information maintained by histograms is not useful here because diPs only reason about the existence of values. Guided by this intuition, consider a set of non-overlapping ranges {[li , ui ]} which contain all of the data values; such range-sets are Figure :For TPC-H query Ï9 in Figure } (le), the table shows the a simple extension of zone-maps which are, trivially, range-sets of partition reduction from using diPs. On the right, we show the plan size Ï. However, range-sets also record gaps that have no values. Fig- generated using magic-set transformations which push group-by ure b shows range-sets of size }.It is easy to see that range-sets give above the join. diPs complement magic-set transformations; we see rise to more succinct diPsk. We will show that using a small number here that magic-set tx cannot skip partitions of lineitem but be- of ranges leads to sizable improvements to query performance in §. cause group-by has been pushed above the join, moving diPs side- We discuss how to maintain range-sets and why range-sets perform ways once is enough unlike the case in Figure }(le). better than other statistics (e.g., bloom ûlters) in §. To assess the overall value of diPs, for TPC-H query ¢Ï9 [}ª] part contain rows that satisfy the predicate and the corresponding from Figure }(le),Figure  shows the I/O size reduction from using diP eliminates many partitions in lineitem. We will show results diPs. _ese results use a range-set of size  (i.e., Ê doubles per col- in § for many diòerent data layouts and data distributions. We dis- umn per partition). _e TPC-H dataset was generated with a scale cuss plan transformations needed to move the diP, as shown in Fig- factor of Ϫª, skewed with a zipf factor of } [}•], and tables were laid ure } (le), in §k.k.Overall, for the ϪªGB dataset, a ª.MB statistic  out in a typical manner . Each partition is ∼ ϪªMBs of data which reduces the initial I/O for this query by }ª×; the query can speed up is a typical quanta in distributed ûle systems [ª] and is the default by more or less depending on the work remaining aer initial I/O. in our clusters [ÊÊ]. Recall that the predicate columns are only avail- able in the part table. _e ûgure shows that only two partitions of 2.1 Use-cases where data-induced predicates can lead to large I/O savings } b to store the frequency per bucket and two for min and max. Given the examples thus far, it is perhaps easy to see that diPs k diP For year ≤ ϕ•, the using two ranges is date sk ∈ {[ÏK, }K], translate into large I/O savings when the following conditions hold. [kK, k.K], [K, CK]} which covers kªì fewer values than the diP date sk built using a zone-map ( ∈ [ÏK, CK]) in Table }b. C1 lineitem was clustered on l shipdate and each cluster sorted _e predicate on a table is satisûed by rows belonging to a on l orderkey; part was sorted on its key; this layout is known small subset of partitions of that table. to lead to good performance because it reduces re-partitioning for C2 _e join column values in partitions that satisfy the predicate joins and allows date predicates to skip partitions [k, }}, }9]. are a small subset of all possible join column values. 1 Table k: Notation used in this paper. 0.9 Symbol Meaning 0.8 Rows satisfying predicate pi Predicate on table i 0.7 Partitions satisfying predicate 0.6 JoinColValues in satisfying partitions pi j Equi-join condition between tables i and j 0.5 0.4 qi A vector whose x’th element is Ï if partition x of table i has to 0.3 be read and ª otherwise. 0.2 d → Data-induced predicate from table i to table j 0.1 i j 0 partition granularity at which the store maintains statistics (Table Ï) CDF (over predicates) 0 0.2 0.4 0.6 0.8 1 Fraction ′ Figure C: Quantifying how oen the conditions that lead to large I/O goal is to emit an equivalent expression E in which one or more of skipping gains from using diPs hold in practice by using queries and the table accesses are restricted to only read a subset of partitions. datasets from production clusters at Microso. _e algorithm applies to a wide class of queries (see §k.}) and can C3 In tables that receive diPs, the join column values are dis- work with many kinds of data statistics (see §). tributed such that diPs only pick a small subset of partitions. _e algorithm has three building blocks: use predicates on in- dividual tables to identify satisfying partitions, construct diPs for We identify use-cases where these conditions hold based on our pairs of joining tables and apply diPs to further restrict the subset experiences in production clusters at Microso []. of partitions that have to be read on each table. Using the notation ● Much of the data in production clusters is stored in the order in Table k, these steps can be written as: in which it was ingested into the cluster [k9, Ï]. A typical in- x Satisfy gestion process consists of many servers uploading data in large ∀ table i, partition x, qi ← (pi , x), (Ï) batches.Hence, a consecutive portion of a dataset is likely to con- ∀ tables i, j, di→ j ← DataPred(qi , pi j), (}) tain records for roughly similar periods of time, and entries from x x ∀ table j, partition x, q j ← q j ∏i≠ j Satisfy(di→ j , x).( k) a server are concentrated into just a few portions of the dataset. _us, queries for a certain time-period or for entries from a server We defer describing how to eõciently implement these equations will pick only a few portions of the dataset. _is helps with C1. to § because the details vary based on the statistic and focus here When such datasets are joined on time or server-id, this phe- on using these building blocks to enhance data skipping. nomenon also helps with C2 and C3. Note that the ûrst step (Equation Ï) executes once, but the lat- ● A common physical design methodology for performant parallel ter two steps may execute multiple times because whenever an in- plans is to hash partition a table by predicate columns and range coming diP changes the set of partitions that have to be read on a partition or order by the join columns [k, }}, }9, Ï] and vice- table (i.e., q changes in Equation k), then the diPs from that table versa.Performance improves since the shuøes to re-partition for (which are computed in Equation } based on q) will have to be re- joins decrease [Ï9, 9,ʕ] and predicates can skip data. Such data computed. _is eòect may cascade to other tables. layouts help with all three conditions C1–C3 and, in our experi- If a join graph, constructed with tables as nodes and edges be- ments, receive the largest I/O savings from diPs. tween tables that have a join condition, has n nodes and m edges, ● Join columns are keys which monotonically increase as new data then a na¨ıve method will construct }m diPs using Eq. }, one along is inserted and hence are related to time. For example, both the each edge in each direction, and will use these diPs in Eq. k to fur- title-id of movies and the name-id of actors in the IMDB dataset ther restrict the partition subsets of joining tables. _is step repeats [ÏÏ] roughly monotonically increase as each new title and new ac- until ûxpoint is reached (i.e., no more partitions can be eliminated). tor are added to the dataset. In such datasets, predicates on time Acyclic join graphs can repeat this step up to n − Ï times, i.e., con- as well as predicates that are implicitly related to time, such as co- struct up to }m(n − Ï) diPs, and join graphs with cycles can take stars, will select only a small range of join column values. _is even longer (see [ϕ] for an example). Abandoning this process be- helps with C1 and C2. fore the partition subsets converge can leave data skipping gains un- ● Practical datasets are skewed; oen times the skew is heavy-tailed tapped. On the other hand, generating too many diPs adds to query [k}]. In skewed datasets, predicates and diPs that skip over the optimization time. To address this challenge, we construct diPs in heavy hitters are highly selective; hence, skew can help C1–C3. a carefully chosen order so as to converge to the smallest partition subsets while building the minimum number of diPs (see §k.). Figure C illustrates how oen conditions C1 and C2 hold for dif- A second challenge arises when applying the above method, which ferent datasets, query predicates and join columns from produc- only accounts for select and join operations, to the general case where tion clusters at Microso. We used tens of datasets and extracted queries contain many other interceding operations such as group- predicates and join columns from thousands of queries. _e ûg- bys and nested statements. One option is to ignore other operations ure shows the cumulative distribution functions (CDFs) of the frac- and apply diPs only to sub-portions of the query that exclusively tion of rows satisfying each predicate (red squares), the fraction of consist of selections and joins. Doing so, again, leaves data skipping partitions containing these rows (green pluses) and the fraction of gains untapped; in some cases the unrealized gains can be substan- join column values contained in these partitions (orange triangles). tial as we saw for the query in Figure } (le) where ignoring the We see that about ªì of the predicates pick less than }ªì of par- nested statement (that is, restricting diPs to just the portion shown titions (C1); in about kªì of the predicates, the join column values with a shaded background in the ûgure) may lead to no gains since contained in the partitions satisfying the predicate are less than ªì the group-by can require reading the lineitem table fully. To ad- of all join column values (C2). dress this challenge, we move diPs around other relational operators using commutativity. We list the transformation rules used in §k.k 3. CONSTRUCTION AND USE OF DATA- which cover a broad class of operators. Using these transformations INDUCED PREDICATES extends the usefulness of diPs to complex query expressions. We describe our algorithm to enhance data skipping using data- 3.1 Deriving diPs within a cost-based QO induced predicates. Given a query E over some input tables, our Taken together, the previous paragraphs indicate two requirements Read the value of green pluses line at x = ª.} in Figure C. to quickly identify eõcient plans: (Ï) carefully schedule the order in Example: Figure 9 illustrates this process for TPC-DS query ¢k; the SQL query is in [}]. As shown in the top le of the ûgure, la- beled Ï , diPs that are triggered by the predicate on date dim are ûrst exchanged in maximal SJ portions: store sales & date dim, catalog sales & date dim and web sales & date dim. _e query joins these portions with another join expression aer a few set operations.Hence, in } , we build new diPs for the customer sk column and pull those up through the set operations (union and in- tersection translate to logical or and and over diPs) and push down to the customer (c) table. To do so, we use the transformation rules in §k.k. In k , if the incoming diP skips partitions on the customer table, another derivation ensues within the SJ expression on the le.C _e ûnal plan, shown in  , eòectively replaces each table with the partition subset that has to be read from that table. Figure 9: Illustrating the use of diPs in TPC-DS query ¢k. _e ta- 3.2 Supported Queries ble labels ss, cs, ws and d correspond to the tables store sales, catalog sales, web sales and date dim. Our prototype does not restrict the query class, i.e., queries can use any operator supported by the underlying platform. Here, we highlight aspects that impact the construction and use of diPs. which diPs are computed over a join graph and (}) use commutativ- Predicates: Our prototype triggers diPs for predicates which are ity to move diPs past other operators in complex queries. We sketch conjunctions, disjunctions or negations over the following clauses: diPs our method to derive within a cost-based QO here. ● c op v: here c is a numeric column, op denotes an operation that Let’s consider some alternative designs. (Ï) Could the user or a is either =, <, ≤, >, ≥, ≠ and v is a value. query rewriting soware that is separate from the QO insert opti- ● ci op c j: here ci , c j are numeric columns from the same relation diPs mal into the query?_is option is problematic because the and op is either =, <, ≤, >, ≥, ≠. user or the query rewriter will have to re-implement complex logic ● For string and categorical columns, equality check with a value. such as predicate simpliûcation and push-down that is already avail- able within the QO. Furthermore, moving diPs around other oper- Joins: Our prototype generates diPs for join conditions that are ators (see §k.k) requires plan transformation rules that are not im- column equality over one or more columns; although, extending to plemented in today’s QO; speciûcally rules that pull up diPs or move some other conditions (e.g., band joins []) is straightforward. We them sideways from one join input to another do not exist in typi- support inner, one-sided outer, semi, anti-semi and self joins. cal QOs. As we saw with the case of the example in Figure }(le), Projections: diPs commute trivially with any projection on columns diPs without such movement may not achieve any data skipping. (}) that are not used in the diP. On columns that are used in a diP, Could the change to QO be limited to adding some new plan trans- only single-column invertible projections commute with that diP formation rules? Doing so is appealing since the QO framework because only such projects can be inverted through zone-maps and remains unchanged. Unfortunately, as we saw in the case of Fig- other data statistics that we use to compute diPs (see [ϕ]). ure }(middle), diPs may have to be exchanged multiple times be- tween the same pair of tables, and to keep costs manageable, diPs Other operations: Operators that do not commute with diPs will have to be constructed in a careful order over the join graphs; in block the movement of diPs. As we discuss in §k.k next, diPs com- today’s cost-based optimizers, achieving such recursion and ûne- mute with a large class of operations. grained query-wide ordering is challenging [k]. _us, we use the 3.3 Commutativity of data-induced predicates hybrid design discussed next. We add derivation of diPs as a new phase in the QO aer plan with other operations simpliûcation rules have applied but before exploration, implemen- We list some query optimizer transformation rules that apply to tation, and costing rules, such as join ordering and choice of join data-induced predicates (diPs). _e correctness of these rules fol- implementations, are applied. _e input to this phase is a logical ex- lows from considering a diP as a ûlter on join columns. Note that pression where predicates have been simpliûed and pushed down. some of these transformation are not used in today’s query optimiz- _e output is an equivalent expression which replaces one or more ers. For example, pulling up diPs above a union and a join (rule ¢, tables with partition subsets of those tables. To speed up optimiza- ¢, below) naively results in redundant evaluation of predicates and tion, this phase creates maximal sub-portions of the query that only are hence not used today; however, as we saw in the case of Fig- contain selections and joins; we do this by pulling up group-bys, ure 9, such movements are necessary to skip partitions elsewhere in projections, predicates that use columns from multiple relations, etc. the query. We also note that diPs do not remain in the query plan; diPs are exchanged within these maximal select-join sub-portions the diPs directly on tables are replaced with a read of the partition of the query expression using the schedule in §k.. Next, using the subsets of that table, and other diPs are dropped. rules in §k.k, diPs are moved into the rest of the query. With this diPs method, derivation will be faster when the select-join sub-portions Ï. commute with any select. diP are large because, by decoupling the above steps, we avoid propa- }. A commutes with any projection that does not aòect the diP gating diPs which have not converged to other parts of the query. columns used in that . For projections that aòect columns diP Note that this phase executes exactly once for a query. _e increase used in a , commutativity holds if and only if the projec- in query optimization time is small, and by exploring alternative tions are invertible functions on one column. plans later, the QO can ûnd plans that beneût from the reduced in- CAer k , if the partition subset on the customer table becomes put sizes (e.g., choose a diòerent join order or use broadcast join further restricted, a new diP moves in opposite direction along the instead of hashjoin). path shown in } ; we do not discuss this issue for simplicity. Inputs: G, the join graph and ∀i, qi denoting partitions to be read in table i (notation is listed in Table k) Output: ∀ tables i, updated qi reecting the eòect of diPs Ï Func: DataPred (q, {c}) // Construct diP for columns {c} over partitions x having qx =Ï; see §. } Func: Satisfy (d, x) // = Ï if partition x satisûes predicate d; ª Figure Ê: Schedules of exchanging diPs for diòerent join graphs; otherwise. See §. numbers-in-circles denote the epoch; multiple diPs are exchanged k Func: Exchange(i, j) ∶ //send diP from table i to table j in parallel in each epoch. Details are in §k..  di→ j ← DataPred(qi , ColsOf(pi j , i)) x x  ∀ partition x ∈ table j, q j ← q j ∗ Satisfy(di→ j , x) k. diPs commute with a group-by if and only if the columns used C Func: TreeScheduler(T , {qi }): // a tree-like join graph diP in the are a subset of the group-by columns. 9 for h ← ª to height(T ) − Ï // bottom-up traversal do . diPs commute with set operations such as union, intersec- Ê foreach t ∈ T ∶ height(t) = h do tion, semi- and anti semi-joins, as shown below. • Exchange(t, Parent(t, T ))

● dÏ(RÏ)∩ d}(R}) ≡ (dÏ ∧ d})(RÏ ∩R}) ≡ (dÏ ∧ d})(RÏ)∩ Ϫ for h ← height(T ) to Ï // top-down traversal do (dÏ ∧ d})(R}) ÏÏ foreach t ∈ T ∶ height(t) = h do Exchange ● dÏ(RÏ) ∪ d}(R}) ≡ (dÏ ∨ d})(dÏ(RÏ) ∪ d}(R})) Ï} ∀ child c of t in T , (t, c) ● d(RÏ) − R} ≡ d(RÏ − R}) ≡ d(RÏ) − d(R}) Ïk Func: ExchangeExt (G, u, v)//send diPs from node u to v Ï foreach tÏ RelationsOf u , t} RelationsOf v do . diPs can move from one input of an equijoin to the other in- ∈ ( ) ∈ ( ) Ï if IsConnected(tÏ , t} , G) then put if the columns used in the diP match the columns used in ÏC d ← DataPred(qtÏ , ColsOf(ptÏ t} , tÏ)); the equi-join condition. For outer-joins, a diP can move only x x Ï9 ∀ partition x ∈ table t} , qt ← qt ∗ Satisfy(d, x) if from the le side of a le outer join (and vice versa). No } } ProcessNode diPs movement is possible with a full outer join. ÏÊ Func: (u) // exchange within node ϕ for i ← ª to κ // repeat up to κ times do }ª change ● dc (RÏ) &c=e R} ≡ dc (RÏ &c=e R}) ≡ dc (RÏ) &c=e de (R}); ← false; note here that c and e can be sets of multiple columns, then }Ï foreach tables tÏ , t} ∈ RelationsOf(u) do IsConnected c = e implies set equality. }} if (tÏ ≠ t}) ∧ (tÏ , t} , G) then }k d ← DataPred(qtÏ , ColsOf(ptÏ t} , tÏ)); } diP x C. As we saw in Figure 9 where a on the customer sk } foreach partition x ∈ t} ∶ qt} = Ï do x Satisfy column is being pushed down to the customer table, diPs on } qt} ← (d, x) change change x an inner join can push onto one of its input relations, gener- }C ← ∨ (qt} = ª) alizing the latter half of rule¢. _is requires the join input }9 change to contain all columns used in the diP, i.e., d(RÏ & R}) ≡ if ¬ then break// no new pruning; }Ê Func: TreeSchedulerExt , , qi // for cyclic join graphs. d(d(RÏ)&R}) iò all columns used by the diP d are available (V G { }) }• for h ª to height Ï // bottom-up traversal do in the relation RÏ. ← (V) − kª foreach u ∈ V ∶ height(u) = h do To see these rules in action, note that diPs move in Figure 9 } us- kÏ if IsNotSingleRelation(u) then ProcessNode(u); ing rule¢ twice to pull up past a union and an intersection, rule¢ to k} ExchangeExt(G, u, Parent(u, V)); move from one join input to another at the top of the expression and kk for h height to ª // top-down traversal do rule¢C twice to push to a join input. _e example in Figure } (le) ← (V) k foreach u ∈ V ∶ height(t) = h do uses rule¢ at the joins and rule¢k to push below the group-by. k if IsNotSingleRelation(u) then ProcessNode(u); ExchangeExt 3.4 Scheduling the deriving of predicates kC ∀ child v of u in V, (G, u, v); k9 Func: Scheduler(G, {qi }): Given a join graph where tables are nodes and edges correspond G kÊ if IsTree(G) then to join conditions, the goal here is to achieve the largest possible data k• return TreeScheduler (Treeify(G), {qi }); skipping (which improves query performance) while constructing ª else the fewest number of diPs (which reduces QO time). Ï V ← MaxWtSpanTree(CliqueGraph(Triangulate(G))) Consider the example join graphs in Figure Ê. _e simple case } return TreeSchedulerExt(Treeify(V), G, {qi }); diPs of two tables on the le only requires a single exchange of fol- Figure •: Pseudocode to compute a fast schedule. lowed by an update to the partition subsets q; proof is in [ϕ]. _e other two cases require more careful handling as we discuss next; the join graph in the middle is the popular star-join which leads to six diPs. A proof that this algorithm is optimal, i.e., can skip all skip- tree-like join graphs and on the right is a cyclic join graph. pable partitions in all tables while constructing the fewest number Our algorithm to hasten convergence is shown in Pseudocode •, of diPs, is in [ϕ]. Scheduler method at line¢k9. _e case of acyclic join graphs is an We convert a cyclic join graph into a tree over table subsets. _e important sub-case because it applies to queries with star or snowake conversion retains completeness; that is, all partitions that can be schema joins.Here, we construct a tree over the graph (Treeify skipped in the original join graph remain skippable by exchang- in line¢k• picks the root that has the smallest tree height and sets ing diPs only between adjacent table subsets on the tree. Further- parent–child pointers; details are in [ϕ]). _en, we pass diPs up the more, on the resulting tree, we apply the same simple schedule that tree (lines¢9–¢•) and aerwards pass diPs down the tree (lines¢Ïª– we used above for tree-like join graphs with a few changes. For ¢Ï}). To see why this converges, note that when line¢Ïª begins, the example, the join graph in Figure Ê(right) becomes the following partition subsets of the table at the root of the tree would have sta- tree: tÏ — {t} , tk , t} — {tk , t , t} — {t , t , tC} — t9. _e steps bilized; Figure Ê (middle) illustrates this case with t} as root and to achieve this conversion are shown in line¢ Ï and follow from shows that convergence requires at most two (parallel) epochs and the junction tree algorithm [Ïk, C•,Ê}]; a step-by-step derivation is in [ϕ]. On the resulting tree we mimic the strategy used for tree- like join graphs with two key diòerences. Speciûcally, at line¢}, Treeify picks a root with the lowest height as before. _en, diPs are exchanged from children to parents (lines¢}•–¢k}) and from par- ents to children (lines¢kk–¢kC). _e key diòerences between the TreeSchedulerExt and the TreeScheduler methods are: (Ï) as the ProcessNode method shows, diPs are exchanged until convergence or at most κ times between relations that are contained in a node Figure Ϫ: Illustrating the diòerence between range-sets, zone-maps and (}) we compute multiple diPs when exchanging information and equi-depth histogram; both histograms and range-sets have between nodes (see ExchangeExt) whereas the Exchange method three buckets. _e predicates shown in black dumbels below the constructs at most one diP. Figure Ê (right) illustrates the resulting axes will be false positives for all stats except the range-set. schedule; the root shown in blue is the node containing {tk , t , t}; ProcessNode epochs ¢}, ¢k and ¢ invoke on the triangle subsets of _e false positives noted above do not aòect query accuracy but tables which have the same color whereas epochs ¢Ï and ¢ exchange reduce the I/O savings. To reduce false positives, we consider other diP at most one on the edges shown. data statistics. Properties of Algorithm •: For tree-like join graphs, the method Equi-depth histograms [Ê] can avoid some of the false positives shown is optimal (proof in [ϕ]). For a tree-like join graph G with n when constructed with gaps between buckets. For e.g., a predicate tables, this method computes at most }(n − Ï) diPs (because a tree x = k may satisfy a partition’s zone-map because k lies between the has n − Ï edges) and requires }height(G) (parallel) epochs where min and max values for x but can be declared as not satisûed by that n tree height can vary from ⌈ } ⌉ to ⌈log n⌉. For cyclic join graphs partition’s histogram if the value k falls in a gap between buckets the method shown here is approximate; that is, it will not elimi- in the histogram. However, histograms are typically built without nate all partitions that can be skipped. We show by counter-example gaps between buckets [kª, Ê, }], are expensive to maintain [}], in [ϕ] that the optimal schedule for a cyclic join graph can require and the frequency information in histograms, while useful for other a very large number of diPs; the sub-optimality arises from limiting purposes, is a waste of space here because predicate satisfaction and how oen diPs are exchanged between relations within a node (in diP construction only check for the existence of values. ProcessNode the method). In §., we empirically demonstrate Bloom ûlters record set membership [k•]. However, we found them that our method for cyclic join graphs is a good trade-oò between to be less useful here because the partition sizes used in practical dis- diPs achieving large data skipping and computing many . tributed storage systems (e.g., ∼ ϪªMBs of data [ª, ÊÊ]) result in millions of distinct values per column in each partition, especially when join columns are keys. To record large sets, bloom ûlters re- 4. USING STATISTICS TO BUILD diPs quire large space or they will have a high false positive rate; e.g., a ÏKB bloom ûlter that records a million distinct values will have Data statistics play a key role in constructing data-induced pred- ••.C}ì false positives [k•] leading to almost no data skipping. icates; recall that the three equations Ï— k use statistics; the speciûc Alternatives such as the count-min [•] and AMS [kC] sketches statistic used determines both the eòectiveness and the cost of these behave similarly to a bloom ûlter for the purpose at hand. _eir operations. An ideal statistic is small, easy to maintain, supports space requirement is larger, and they are better at capturing the fre- evaluation of a rich class of query predicates and leads to succinct quency of values (in addition to set membership). However, as we diPs. In this section, we discuss the costs and beneûts of well-known noted in the case of histograms, frequency information is not help- data statistics including our new statistic, range-set, which our ex- ful to construct diPs. periments show to be particularly suitable for constructing diPs. Zone-maps [Ï] consist of the minimum and maximum value per Range-set: To reduce false-positives while keeping the stat size small, column per partition and are maintained by several systems today we propose storing a set of non-overlapping ranges over the column (see Table Ï). Each predicate clause listed in §k.} translates to a log- value, {[li , ui ]}. Note that a zone-map is a range-set of size Ï; us- ical operation over the zone-maps of the columns involved in the ing more ranges is hence a simple generalization. _e boundaries predicate. Conjunctions, disjunctions and negations translate to an of the ranges are chosen to reduce false positives by minimizing the intersection, an union or set diòerence respectively over the par- total width (i.e., ∑i ui − li ) while covering all of the column values. tition subsets that match each clause. For string-valued columns, To see why range-sets help, consider the range-set shown in green zone-maps are typically built over hash values of the strings and so dots in Figure Ϫ; compared to zone-maps, range-sets have fewer equality check is also a logical equality, but regular expressions are false positives because they record empty spaces or gaps. Equi-depth not supported. histograms, as the ûgure shows, will choose narrow buckets near Note that there can be many false positives because a zone map more frequent values and wider buckets elsewhere which can lead has no information about which values are present (except for the to more false positives. Constructing a range-set over r values takes 9 minimum and maximum values). O(r log r) time . Reecting on how zone-maps were used for the _e diP constructed using zone-maps, as we saw in the example three operations in Equations Ï– k, i.e., applying predicates, con- in Table }b, is a union of the zone-maps of the partitions satisfying structing diPs and applying diPs on joining tables, note that a simi- the predicate; hence, the diP is a disjunction over non-overlapping lar logic extends to the case of a range-set. SIMD-aware implemen- ranges. On the table that receives a diP, a partition will satisfy the tations can improve eõciency by operating on multiple ranges at diP only if there is an overlap between the diP and the zone-map of once. A range-set having n ranges uses }n doubles.Merging two that partition. Note that there can be false positives in this check as unsorted range-sets as well as checking for overlap between them well because no actual data row may have a value within the range that overlaps between the diP and the partition’s zone map. It is 9First sort the values, then sort the gaps between consecutive values straightforward to implement these checks eõciently, and our re- to ûnd a cutoò such that the number of gaps larger than cutoò is at sults will show that zone-maps oòer sizable I/O savings (Figures ÏÏ, Ïk). most the desired number of ranges; see [ϕ] for proof of optimality. Table : Greedily growing a range-set in the presence of updates. We also show that diPs are complementary to and sometimes bet- Beginning range-set: {[k, ], [Ϫ, }ª], [}k, }9]}, nr = k ter than using join indexes, materializing denormalized views or Update New range-set clustering rows in §.; these alternatives have much higher main- diPs r × Add C {[k, C], [Ϫ, }ª], [}k, }9]} tenance costs unlike which work in-situ using small per-table

de × statistics and a small increase to QO time. r × Add Ïk, Delete }ª,Change  to Ï no change

o × Ö Add } {[k, C], [Ϫ, }9], [}, }]} 5.1 Methodology uses O(n log n) time where n is the size of larger rangeset.Our re- sults will show that small numbers of ranges (e.g.,  or }ª) lead to Queries: We report results on TPC-H [}C], TPC-DS [}] and the substantial improvements over zone-maps (Figure ÏC). join order benchmark (JOB) [Cª]. We use all }} queries from TPC-H but because TPC-DS and JOB have many more queries we pick from 4.1 Coping with data updates them ª and k9 queries respectivelyÊ. We choose JOB for its cyclic When rows are added, deleted, or changed, if the data statistics are join queries. We choose TPC-DS because it has complex queries (e.g., not updated, partitions can be incorrectly skipped, i.e., false nega- several non foreign-key joins, UNIONs and nested SQL statements). tives may appear in equations Ï– k. We describe two methods to Query predicates are complex; e.g., qϕ from TPC-H has ÏC clauses avoid false negatives here. over Ê columns from multiple relations. While inner-joins domi- nate, the queries also have self-, semi- and outer joins. Tainting partitions: A statistic agnostic method to cope with data updates is to maintain a taint bit for each partition. A partition is Datasets: For TPC-H and TPC-DS we use ϪªGB and ÏTB datasets marked as tainted whenever any row in that partition changes. Ta- respectively. _e default datagen for TPC-H, unlike that of TPC-DS, bles with tainted partitions will not be used to originate diPs (be- creates uniformly distributed datasets which is not representative of cause that diP can be incorrect). However, all tables, even those practical datasets; therefore, we also use a modiûed datagen [}•] to with tainted partitions, can receive incoming diPs and use them to create datasets with diòerent amounts of skew (e.g., with zipf factors eliminate their un-tainted partitions. of Ï, Ï., }). For JOB, we use the IMDB dataset from May }ªÏk [Cª]. Taint bits can be maintained at transactional speeds and can be Layouts and partitioning: We experiment with many diòerent lay- extremely eòective in some cases, e.g., when updates are mostly to outs for each dataset. _e tuned layout speeds up queries by avoid- tables that do not generate data-reductive diPs. One such scenario is ing re-partitioning before joins and enhances data skipping•. diPs queries over updateable fact tables that join with many unchanging yield sizable gains on tuned layouts. To evaluate behavior more dimension tables; predicates on dimension tables can generate diPs broadly, we generate several other layouts where each table is or- that ow unimpeded by taint on to the fact tables. Going beyond dered on a randomly chosen column. For each data layout, we par- one taint bit per partition, maintaining taint bits at a ûner granular- tition the data as recommended by the storage system, i.e., roughly ity (e.g., per partition and per column) can improve performance ϪªMB of content in SCOPE clusters,[, ÊÊ] and roughly ÏM rows with a small increase in update cost. See results in §.k. per columnstore segment in SQL Server [Ê]. Approximately updating range-sets in response to updates: _e Systems: We have built prototypes on top of two production plat- key intuition of this method is to update the range-set in the follow- forms: SCOPE clusters which serve as the primary platform for batch ing approximate manner: ignore deletes and grow the range-set to analytics at Microso and comprise tens of thousands of servers [, cover the new values; that is, if the new value is already contained ÊÊ] and SQL Server }ªÏC. Both systems use cost-based query opti- in an existing range, there is nothing to do; otherwise, either grow mizers [k]. A SCOPE job is a collection of tasks orchestrated by a an existing range to contain the new value or merge two consecutive job manager; tasks read and write to a ûle system, and each task in- ranges and add the new value as a new range all by itself. Since these ternally executes a sub-graph of relational operators which pass data options increase the total width of the range-set, the process greedily through memory. _e servers are state-of-the-art Intel Xeons with chooses whichever option has the smallest increase in total width. ϕ}GB RAM, multiple disks and multiple ϪGbps network interface Table  shows examples of greedily growing a range-set.Our results cards.Our SQL server experiments ran on a similar server. Aer will show that such an update is fast (Table Ê), and the reduction in each query executes in SQL server, we ush various system buòer I/O savings— because the range-sets aer several such updates can pools to accurately measure the eòects of I/O savings. SCOPE clus- have more false positives than range-sets that are re-constructed for ters use a partitioned row store; for SQL server, we use both column- just the new column values— is small (Figure Ïa). stores and rowstores. SCOPE and SQL server implement several ad- vanced optimizations such as semijoins [}Ï], predicate pushdown to 5. EVALUATION eliminate partitions [9] and magic-set rewrites [ª]. Using our prototypes in Microso’s production big-data clusters Comparisons: In addition to the above production baselines, we and SQL server, we consider the following aspects: compare against several alternatives. By DenormView, we refer to a technique that avoids joins by denormalization, i.e., materializes a ● Do data-induced predicates oòer sizable gains for a wide variety join view over multiple tables. _e view is stored in column store of queries, data distributions, data layouts and statistic choices? format in SQL server. Since the view is a single relation, queries ● Understand the causes for gains and the value of our core contri- can skip partitions without worrying about joins. By JoinIndexes, butions. we refer to a technique that maintains clustered rowstore indexes ● Understand the gap from alternatives. on the join columns of each relation; for tables that join on more We will show that using diPs leads to sizable gains across queries than one column, we build an index on the most frequently used from TPC-H, TPC-DS and Join Order Benchmark, across diòerent data distributions and physical layouts and across statistics (§.}). Ê Ï . . . ª, •ª . . . •• from TPC-DS and ([Ï − •]SϪ)* from JOB _e costs to achieve these gains are small and range-sets oòer more •In short, dimension tables are sorted by key columns and fact tables gains in more cases than zone-maps (§.k). Both the careful ordering are clustered by a prevalent predicate column and sorted by columns of diPs and the commutativity rules to move diPs are helpful (§.). in the predominant join condition; details are in [ϕ] . 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 Latency (zipf 1.5) 0.4 0.4 Latency (zipf 2) 0.4 Latency (Zipf1) 0.2 Query Latency 0.2 Tot. Comp. Hrs (zipf 1.5) Total Compute Hours Tot. Comp. Hrs (zipf 2) 0.2 Latency (Zipf2) CDF over queries CDF over queries 0 0 CDF over queries Latency (Zip2 w ZoneMaps) 0 1 10 100 1 10 100 0.8 1 2 4 8 10 Performance Speedup Performance Speedup Performance Speedup (a) TPC-DS on SCOPE clusters (b) TPC-H (skew=zipf Ï. or }) on SCOPE (c) TPC-H (skew=zipf Ï or }) on SQL server Figure ÏÏ:Change in query performance from using data-induced predicates. _e ûgures show cumulative density functions (CDFs) of speedups for diòerent benchmarks, on diòerent platforms for the tuned data layout (see §.Ï). _e beneûts are wide-spread, i.e., almost all queries improve; in some cases, the improvements can be substantial. More discussion is in §.}. join column. By FineBlock, we refer to a single relation workload- Table : _e I˜ünuCnu from diPs for diòerent benchmarks and aware clustering scheme which enhances data skipping by colocat- diòerent layouts; each query and data layout (see §.Ï) contribute ing rows that match or do-not-match the same predicates [9Ê]. We a point to a CDF and the table shows values at various percentiles. apply FineBlock on the above denormalized view. Benchmark I˜ünuCnu at percentile We also compare with the following variants of our scheme: No ªth 9th •ªth •th Ϫªth Transforms does not use commutativity to move diPs; Naive Sched- TPC-H (zipf }) Ï.× C.× Ï9.9× k}.Ï× ÏCC.Ê× TPC-DS Ï. .Ï 9.} Ï}.ª }}. ule constructs as many diPs as our schedule but picks at random × × × × × JOB Ï.•× }.k× k.Ï× k.× ÏÏ.Ï× which diP to construct at each step. Preds uses the same statistics but only for predicate pushdown, i.e., it does not compute diPs. Statistics: Many systems already store zone-maps as noted in Ta- 1 1 ble Ï. We evaluate various statistics mentioned in §. Gap hist is 0.8 0.8 our own implementation of an optimal equi-depth histogram with gaps between buckets. Unless otherwise stated, we use }ª ranges for 0.6 0.6 range-sets and Ϫ buckets for gap hists. Also, unless otherwise stated the results use range-sets to construct diPs. 0.4 0.4 Uniform; tuned Metrics: We measure query performance (latency and resource Skew: zipf 1; tuned 0.2 Skew: zipf 1.5; tuned 0.2 use), statistic size, maintenance costs, and increase in query opti- Skew: zipf 2; tuned tuned layout Zipf 2; 7 diff. layouts 3 diff. layouts mization time. Since diPs reduce the input size that a query reads 0 0 from the store, we also report I˜ünuCnu which is the fraction of the CDF (over queries, data layouts) 1 10 100 1 10 InputCut on TPC-H InputCut on TPC-DS query’s input that is read aer data skipping; if data skipping elimi- Figure Ï}: How input skew and data layouts aòect the usefulness of nates half of a query’s input, I˜ünuCnu = }. When comparing two diPs; see §.}. techniques, we report the ratio of their metric values. We see that diPs produce larger speed-ups as skew increases mainly 5.2 Benefits from using diPs because predicates and diPs become more selective at larger skew. Figure ÏÏ shows the performance speedup from using diPs on dif- Figure ÏÏc shows sizable latency improvements when using zone- ferent workloads in SCOPE clusters and SQL server. Results are on maps. We have conûrmed that the query plans in the production the tuned layout which is popular because it avoids re-partitioning systems, SCOPE clusters and SQL server, reect the eòects of pred- for joins and enhances data skipping [}}, }9, Ï]. _e results are icate pushdown and bitmap ûlters for semijoins [9, }Ï, ª]; these CDFs over queries; we repeat each query at least ûve times. All ûgures show that diPs oòer sizable gains for a sizable fraction of of the results except one of the CDFs in Figure ÏÏc use range-sets. benchmark queries on top of such optimizations. Figure ÏÏa shows that the median TPC-DS query ûnishes almost }× Table  considers many diòerent layouts, and Figure Ï} also con- faster and uses × fewer total compute hours.Much larger speed- siders diòerent skew factors. _ese results show the I˜ünuCnu met- ups are seen on the tail. Total compute hours improves more than ric which is the reduction in initial I/O read by a query. Across data latency (higher speed-up in orange lines than in grey lines) because layouts, about ªì of the queries in each benchmark obtain an I˜- some of the changes to parallel plans that result from reductions ünuCnu of at least }×; that is, they can skip over half of the input. in initial I/O add to the length of the critical path which increases Figure Ï} shows that lower skew leads to a lower I˜ünuCnu, but diPs query latency while dramatically reducing total resource use; e.g., oòer gains even for a uniformly distributed dataset. _e tuned data replacing pair joins with broadcast joins eliminates shuøes but adds layout in both TPC-H and TPC-DS leads to larger values of I˜ünu- a merge of the smaller input before broadcast [9]. We see that al- Cnu relative to the other data layouts; that is, diPs skip more data most all queries improve. SCOPE clusters are shared by over hun- in the tuned layout. _is is because the tuned layouts help with all dreds of concurrent jobs, and so query latency is subject to perfor- three conditions C1−C3 listed in §}; predicates skip more partitions mance interference; the CDFs use the median value over at least ûve on each table because tuned layouts cluster by predicate columns trials, but some TPC-H queries in Figure ÏÏb still have a small re- and ordering by join column values helps diPs eliminate more par- gression in latency. Figures ÏÏb and ÏÏc show that TPC-H queries titions on the receiving tables. We also observe several instances receive similar latency speedup in SCOPE clusters and SQL server. where a query speeds up more in a diòerent layout than the tuned Unlike TPC-DS and real-world datasets which are skewed, the de- layout; typically, such queries use diòerent join or predicate columns fault datagen in TPC-H distributes data uniformly; these ûgures than those used by the tuned layout. show results with diòerent amounts of skew generated using [}•]. Figure Ïk breaks-down the gains for each query in TPC-H when 1000 diPs with range-set diPs with gap hist. diPs with zone map 100

10 InputCut 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 TPC-H Query Number Figure Ïk: _e ûgure shows the I˜ünuCnu values when using diòerent stats to construct diPs (for TPC-H with zipf } skew). Candlesticks show variation across seven diòerent data layouts including the tuned layout; the rectangle goes from }th to 9th percentile, the whiskers go from min to max and the thin black dash indicates the average value. Zone-maps do quite well and range-sets are a sizable improvement.

Table C: _e additional latency to derive diPs in seconds compared TPC-H TPC-DS to the baseline QO latency; see §.k. 1 1 ìile 0.9 0.9 Ϫth }th ªth 9th •ªth Latency(s) 0.8 0.8 0.7 Baseline QO ª.Ï ª.ÏÊ ª.Ï9C ª.ÏÊÊ ª.}ÏÊ 0.7 diPs 0.6 to add ª.ªk} ª.ªª ª.ªÊ ª.Ϫ9 ª.}ʪ 0.6 0.5 0.5 0.4 Table 9: Additional results for experiments in Figure ÏÏ, Table  0.4 0.3 No Tx 0.3 No Transforms 0.2 and Figure Ï}. _e table shows data from our SCOPE cluster. CDF (over queries) Naive Schedule Naive 0.2 0.1 TPC-H TPC-DS JOB Our Algo. Ours Input size ϪªGB ÏTB GB 0.1 0 1 10 100 ¢Tables, ¢Columns Ê, CÏ }, ÏC }Ï, ÏªÊ 1 10 100 InputCut ¢ Queries }} ª k9 Range-set size ∼ }MB ∼ kMB ∼ kªKB Figure Ï: How I˜ünuCnu varies when diòerent methods are used ¢ Partitions ∼ Ϫk ∼  ∗ Ϫ ∼ }ªª to derive diPs; we compare with No Transforms which does not use any transformation rules and Naive schedule which constructs the Table Ê: _e time to greedily update range-sets of various sizes (in same number of diPs in a naive manner.(Results are for TPC-H nanoseconds) measured on a desktop. skewed with zipf } and TPC-DS in the tuned layout.) Size }  Ê ÏC }ª k} C additional QO time to use diPs is rather small oen, but it can be Avg. Ê. ÏÏ.Ê }}.Ê }.Ï •.Ê C9.Ê Ï}Ï. Stdev. ª. ª. ª. ª.Ï }. k. k.• large on the tail. We verify that these outliers exchange diPs be- tween large tables which takes a long time because such diPs have many clauses and are evaluated on many partitions. We note that 1x108 1 4 ranges, total our derivation of diPs is a prototype, parts of which are in c¢ for 1x107 4 ranges, sort 20 ranges, total ease-of-debugging, and that evaluating diPs is embarrassingly par- 0.8 20 ranges, sort 1x106 allel (e.g., apply diP to stat of each partition); we have not yet im- 0.6 100000 plemented optimizations and believe that the extra QO time can be 0.4 10000 substantially smaller. Regarding storage overhead, Table 9 shows

Latency (µs) 1000 the size of range-sets which can be thought of as }ª “rows” per par- 0.2 100 tition (a partition is ϪªMB of data in SCOPE clusters and ÏM rows in a columnstore segment in SQL server [Ê]) and so the space over- 0 10 Greedy gap / Opt. Greedy 1 2 3 4 5 6 7 8 9 10 103 104 105 106 107 108 head for range-sets is ∼ ª.ªª}ì. Zone-maps use Ïª× less space be- Updated % of lineitem Number of rows cause they only record the max and min value per column, i.e., } (a)Greedy updates. (b) Construction latency “rows” per partition. Although TPC-DS and JOB have more tables Figure Ï: (Le) Eòectiveness of greedy-updates for range-sets; the and more columns, their ratio of stat size to input size is similar. ûgure shows the average and stdev across columns.(Right) Cost to construct range-sets measured on a desktop. Costs and gains when tainting partitions: Recall from § that a statistic-agnostic method to cope with data updates was to taint par- using diòerent statistics. Notice that zone-maps are oen as good as titions. We evaluate this approach by using the TPC-H data gen- the gap histograms to construct diPs; compare the third blue can- erator to generate Ϫª update sets each of which change ª.Ïì of dlestick in each cluster with the second green candlestick. Gap his- the orders and lineitem tables. Figure Ïk showed that diPs de- tograms are better in predicate satisfaction than zone-maps but do liver sizable gains for Ï out of the }} TPC-H queries; among these not lead to much better diPs. As the ûgure shows, range-sets (the queries, only six are unaòected by taints; speciûcally for {q}, qÏ, ûrst red candlestick in each cluster) oòer a marked improvement; qÏ, qÏC, qÏ9, qϕ}, diPs oòer large I/O savings in spite of updates. they oòer larger gains on more queries and in more layouts. _e other queries see reduced I/O savings because updates in TPC- H target the two largest tables, lineitem and orders; when both 5.3 Costs of using diPs these relations become tainted, diPs cannot ow between these re- diPs _e costs to obtain this speed-up include storing statistics, an in- lations, and so queries that require between these tables lose crease to the query optimization duration (to determine which par- I˜ünuCnu due to taints. As noted in §, taints are more suitable titions can be skipped), and maintaining statistics when data changes. when updates target smaller dimension tables. In big-data clusters, queries are read-only and datasets are bulk ap- Greedily maintaining range-sets: Recall from § that our second pended; so building and maintaining statistics is less impactful rel- proposal to cope with data updates is to greedily grow the range- ative to the storage space and QO overhead. Table C shows that the set statistic to cover the new values. Table Ê shows that range-sets 4 TPC-H TPC-DS Using diPs 1 1 Using join indexes

0.8 0.8 2

0.6 0.6 #1 #1 #2 #2 1 #4 0.4 #4 0.4 #8 Speedup Latency #6 #20 0.2 #20 0.2 #100 0.5 #100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0 0 TPC-H Query Number 1 10 100 1 10 100 CDF of InputCut (over queries) Figure Ï9:Comparing diPs with join indices on SQL server. Number of ranges in the range-set Figure ÏC: How I˜ünuCnu varies with the numbers of ranges used InputCut for diPs / InputCut for Preds 1 1 1 in the range-set statistic.(Results are for TPC-H skewed with zipf } and TPC-DS in the tuned layout; other cases behave similarly.) 0.8 0.8 0.8 can be updated in tens of nanoseconds using one core on a desktop; 0.6 0.6 0.6 thus, the anticipated slowdown to a transaction system is negligible. 0.4 0.4 0.4 Figure Ïa shows that the greedy update procedure leads to a reason- ably high quality range-set statistic; that is, the total gap value (i.e., 0.2 0.2 0.2 ∑i (ui − li ) for a range-set {[li , ui ]}) obtained aer many greedy 0 0 0 updates is close to the total gap value of an optimal range-set con- CDF (over queries, data layouts) 1 10 1 10 1 10 range-set hist max., min. structed over the updated dataset. _e ûgure shows that the greedy Figure ÏÊ: Comparing diPs versus using the same stats to only skip updates lead to a range-set with an average gap value ≥ ʪì of op- partitions on individual tables. timal when up to Ϫì of rows in the lineitem table are updated.

Range set construction time: Figure Ïb shows the latency to con- 10000 Denorm. latency InputCut w FineBlock+Preds struct range-sets. Computing larger range-sets (e.g., }ª ranges vs. 1000 FineBlock+Preds latency (proj.) ) has only a small impact on latency, and almost all of the latency is due to sorting the input once (the ‘total’ lines are indistinguish- 100 able from the ‘sort’ lines). _ese results use std::sort from Microso 10 Visual C++. Note that range-sets can be constructed during data in- 1 gestion in parallel per partition and column; construction can also piggy-back on the ûrst query to scan the input. 0.1 Speed-up= Baseline/ Method 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 TPC-H Query Number 5.4 Understanding why diPs help Figure ϕ: I˜ünuCnu on TPC-H queries for FineBlock; box plots Comparing diòerent methods to construct diPs: Figure Ï shows show results for diòerent predicates. that both the commutativity rules in §k.k and the algorithm in §k. believe that this is because: (Ï) the predicate selectivity in several are necessary to obtain large gains using diPs. _e na¨ıve schedule TPC-H queries is not small enough to beneût from an index seek has the same QO duration because it constructs the same number and so most plans use a clustered index scan, and (}) clustered index of diPs, but by not carefully choosing the order in which diPs are scans are slower than table scans. diPs are complementary because constructed, this schedule leaves gains on the table as shown in the they reduce I/O before query execution. ûgure. Not using commutativity rules leads to a faster QO time but, diPs as the ûgure shows, can lead to much smaller performance improve- vs. predicate pushdown: Figure ÏÊ shows the ratio of im- Preds ments because generating diPs only for maximal select-join por- provement over which can only skip partitions on individual tions of a query graph will not reduce I/O when queries have nested tables. When queries have no joins or the selective predicates are diPs statements and other complex operators. _e more complex queries only on large relations, do not oòer additional data skipping, diPs in TPC-DS suòer a greater falloò. but the ûgure shows that oòer a marked improvement for a large number of queries and layouts. How many ranges to use? Figure ÏC shows that a small number of diPs DenormView ranges achieve nearly the same amount of data skipping as much vs. : We use a (denormal- larger range-sets. Each step in diP creation, as noted in §, adds ized relation) which subsumes ÏC/}} queries in TPC-H; the remain- false positives, and there is a limit to gains based on the joint dis- ing queries require information that is absent in the view and can- tribution of join and predicate columns. We believe that achieving not be answered using this view (details are in [ϕ]). In colum- more I/O skipping beyond that obtained by using just a few ranges nar (rowstore) format, this view occupies }.9× (C.}×) more storage may require much larger statistics and/or more complex techniques. space than all of the tables combined. Queries over the view are un- hindered by joins because all predicates directly apply on a single 5.5 Comparing with alternatives relation; however, because the relation is larger in size, queries may or may not ûnish faster. We store this view in columnar format and Join Indexes: Figure Ï9 compares using diPs with the JoinIndexes compare against a baseline where all tables are in a columnar lay- scheme described in §.Ï. Results are on SQL server for TPC-H out. Figure ϕ (with + symbol) shows the speed-up in query latency skewed with zipf factor Ï and a scale factor of Ϫª. We built clus- when using this view in SQL server (results are for ϪªG dataset with tered rowstore indexes [C] on the key columns of the dimension zipf skew Ï). We see that all of the queries slow down (all + symbols tables, and on the fact tables, we built clustered indexes on their are below Ï). Materialized views speed-up queries, in general, only most frequently used join columns (i.e., l orderkey, o orderkey, when the views have selections or aggregations that reduce the size ps partkey). _e ûgure shows that using join indexes leads to of the view [k]. Unfortunately, this does not happen in the case of worse query latency than not using the indexes in ϕ/}} queries; we the view that subsumes all ÏC/}} TPC-H queries (see [ϕ]). diPs vs. clustering rows by predicates: A recent research pro- maps [Ï] on a fact table can be constructed to include predicate posal [9Ê] clusters rows in the above view to maximize data skipping. columns from dimension tables; doing so eòectively creates zone- Training over a set of predicates,[9Ê] learns a clustering scheme that maps on a larger denormalized view. Constructing and maintaining intends to skip data for unseen predicates. Figure ϕ shows with ⨉ these data structures has overhead, and as we saw in §, a particular symbols the average I˜ünuCnu obtained as a result of such cluster- view or join index does not subsume all queries.Hence, many dif- ing; the candlesticks around the ⨉ symbol show the min, }th per- ferent structures are needed to cover a large subset of queries which centile, 9th percentile and max I˜ünuCnu for diòerent query pred- further increases overhead. Queries with foreign-key — foreign- icates. We see that most queries receive no I˜ünuCnu (× marks are key joins (e.g., store sales and store returns in TPC-DS join at Ï) due primarily to two reasons: (Ï) the chosen clustering scheme in six diòerent ways) can require maintaining many diòerent struc- does not generalize across queries; that is, while some queries re- tures. diPs can be thought oò as a complementary approach that ceive gains the chosen clustering of rows does not help all queries, helps with or without auxiliary structures. and (}) the chosen clustering scheme does not generalize to unseen While data-induced predicates are similar to the implied integrity predicates as can be seen from the large span of the candlesticks. constraints used by [C], there are some key diòerences and addi- Figure ϕ also shows with circle symbols the average query latency tional contributions. (Ï) [C] only exchanges constraints between a when using this clustering. Only /}} queries improve (fastest query pair of relations; we oòer a method which exchanges diPs between is ∼ Ϫª× faster) and ÏÏ/}} queries regress (slowest is Ïª× slower). multiple relations, handles cyclic joins and supports queries having Hence, the practical value of such schemes is unclear. group-by’s, union’s and other operations.(}) [C] uses zone maps We also note the rather large overheads to create and maintain and two bucket histograms; we oòer a new statistic (range-set) that indexes and views [kk, 9ª] and to learn clusterings [9Ê]. Also, these performs better.(k) [C] shows no query performance improve- schemes require foreknowledge of queries and oòer gains only for ments; we show speed-ups in both a big-data cluster and a DBMS. queries that are similar [k, k, 9Ê]. In contrast, diPs only use small () [C] oòers no results in the presence of data updates; we de- and easily maintainable data statistics, require no apriori knowledge sign and evaluate two maintenance techniques that can be built into of queries and oòer sizable gains for ad-hoc and complex queries. transactional systems. While a query executes, sideways information passing (SIP) from 6. RELATED WORK one sub-expression to a joining sub-expression can prune the data- in-ight and speed up joins [kÊ, 9, Ck, CÊ, 9}, 9]. Several systems, To the best of our knowledge, this paper is the ûrst system to skip including SQL server, implement SIP and we saw in § that diPs data across joins for complex queries during query optimization. oòer additional speed-up. _is is because SIP only applies during _ese are fundamental diòerences: diPs rely only on simple per- query execution whereas diPs reduce the I/O to be read from the column statistics, are built on-the-y in the QO, can skip partitions store. SIP can reduce the cost of a join, but constructing the neces- of multiple joining relations, support diòerent join types and work sary info at runtime (e.g., a bloom ûlter over the join column values with complex queries; the resulting plans only read subsets of the from one input) adds runtime overhead, needs large structures to input relations and have no execution-time overhead. avoid false positives and introduces a barrier that prevents simul- Some research works discover data properties such as functional taneous parallel computation of the joining relations. Also, unlike dependencies and column correlations and use them to improve diPs, SIP cannot exchange information in both directions between query plans [Ϫ, }, C, Ê]. Inferring such data properties is a sizable joining relations nor does it create new predicates that can be pushed cost (e.g.,[C] uses student t-test between every pair of columns). It below group-bys, unions and other operations. is unclear if these properties can be maintained when data evolves. A large area of related work improves data skipping using work- More importantly, imprecise data properties are less useful for QO load aware adaptations to data partitioning or indexing [k, , , (e.g., a so functional dependency does not preserve set multiplic- C}, CC, 9Ï, 9, 9Ê, 9•,ÊC]; they co-locate data that is accessed to- ity and hence cannot guarantee correctness of certain plan trans- gether or build correlated indices. Some use denormalization to formations over group-bys and joins). A SQL server option [Ϫ] avoid joins [9Ê, ÊC]. In contrast, diPs require no changes to the data uses the fact that the l shipdate attribute of lineitem is between layout and no foreknowledge of queries. ª to •ª days larger than o orderdate from orders [}C] to con- vert predicates on l shipdate to predicates on o orderdate and vice versa.Others discover similar constraints more broadly [}, 7. CONCLUSION Ê]. In contrast, diPs exploit relationships that may only hold con- As dataset sizes grow, human-digestible insights increasingly use ditionally given a query and a data-layout. Speciûcally, even if the queries with selective predicates. In this paper, we present a new predicate columns and join columns are independent, diPs can of- technique that extends the gains from data skipping; the predicate fer gains if the subset of partitions that satisfy a predicate contain a on a table is converted into new data-induced predicates that can small subset of values of the join columns. As we saw in §}, such apply on joining tables. Data-induced predicates (diPs) are possi- situations arise when datasets are clustered on time or partitioned ble, at a fundamental level, because of implicit or explicit cluster- on join columns [Ï]. ing that already exists in datasets.Our method to construct diPs Prior work moves predicates around using column equivalence leverages data statistics and works with a variety of simple statistics, and magic-set style reasoning [ª, CÏ, Ck, 9},ÊÏ,Êk]. SCOPE clusters some of which are already maintained in today’s clusters. We ex- and SQL server implement such optimizations, and as we saw in §, tend the query optimizer to output plans that skip data before query diPs oòer gains over these baselines. Column equivalence does not execution begins (e.g., partition elimination). In contrast to prior help when predicate columns do not exist in joining relations.Magic work that oòers data skipping only in the presence of complex aux- set transformations help only }/}} queries in TPC-H queries and iliary structures, workload-aware adaptations and changes to query only when predicates are selective [9}]. By inferring new predicates execution, using diPs is radically simple.Our results in a large data- that are induced by data statistics, diPs have a wider appeal. parallel cluster and a DBMS show that large gains are possible across Auxiliary data structures such as views [}Ê], join indices [}k], join a wide variety of queries, data distributions and layouts. bitmap indexes [], succinct tries [Ê9], column sketches [], and partial histograms [Ê] can also help speed-up queries. Join zone 8. REFERENCES [Ï] }ªÏ9 big-data and analytics forecast. [kÊ]F . Bancilhon et al. Magic sets and other strange ways to https://bit.ly/2TtKyjB. implement logic programs. In SIGMOD, ϕÊ. [}]A pache orc spec. vÏ. https://bit.ly/2J5BIkh. [k•]B . Bloom. Space/time trade-oòs in hash coding with [k]A pache spark join guidelines and performance tuning. allowable errors. CACM, ϕ9ª. https://bit.ly/2Jd87We. [ª]D . Borthakur et al. Hdfs architecture guide. Hadoop Apache [] Band join. https://bit.ly/2kixJJn. Project, }ªªÊ. []B itmap join indexes in oracle. https://bit.ly/2TLBBTF. [Ï]D . Borthakur et al. Apache hadoop goes realtime at facebook. [C]C lustered and nonclustered indexes described. In SIGMOD, }ªÏÏ. https://bit.ly/2Drdb9o. [}] P. G. Brown and P. J. Haas. Bhunt:Automatic discovery of [9]C olumnstore index performance: Rowgroup elimination. fuzzy algebraic constraints in relational data. In VLDB, }ªªk. https://bit.ly/2VFpljV. [k] M. Brucato,A. Abouzied, and A.Meliou. A scalable execution [Ê]C olumnstore indexes described. engine for package queries. SIGMOD Rec., }ªÏ9. https://bit.ly/2F7LZuI. [] L. Cao and E. A. Rundensteiner. High performance stream [•] Data skipping index in spark. https://bit.ly/2qONacb. query processing with correlation-aware partitioning. VLDB, [Ϫ] Date correlation optimzation in server }ªª & }ªªÊ. }ªÏk. https://bit.ly/2VodSVN. [] R. Chaiken et al. SCOPE: Easy and Eõcient Parallel [ÏÏ] Imdb datasets. https://imdb.to/2S3BzSF. Processing of Massive Datasets. In VLDB, }ªªÊ. [Ï}] Join order benchmark. https://bit.ly/2tTRyIb. [C] R. Chirkova and J. Yang. Materialized views. Foundations and [Ïk] _e junction tree algorithm. https://bit.ly/2lPHNtA. Trends in , }ªÏ}. [Ï] guide: Using zone maps. [9] S. Chu, M. Balazinska, and D. Suciu. From theory to practice: https://bit.ly/2qMeO9E. Eõcient join query evaluation in a parallel database system. In SIGMOD, }ªÏ. [Ï] Oracle: Using zone maps. https://bit.ly/2vsUWKK. [Ê] G. Cormode, M. Garofalakis, P. J. Haas,C.Jermaine, et al. [ÏC] Parquet thri format. https://bit.ly/2vm6D5U. Synopses for massive data: Samples, histograms, wavelets, [Ï9] Presto: Repartitioned and replicated joins. sketches. Foundations and Trends in Databases, }ªÏÏ. https://bit.ly/2JauYll. [•] G. Cormode and S.Muthukrishnan. An improved data [ÏÊ] Processing petabytes of data in seconds with databricks delta. stream summary: the count-min sketch and its applications. J. https://bit.ly/2Pryf2E . Algorithms, }ªª. [ϕ] Pushing data-induced predicates through joins in bigdata [ª] M. Elhemali,C. A.Galindo-Legaria, T.Grabs, and M. M. clusters; extended version. https://bit.ly/2WhTwP1. Joshi. Execution strategies for sql subqueries. In SIGMOD, https://bit.ly/2kJRV72 [}ª] Query Ï9 in tpc-h, see page ¢9. . }ªª9. https://bit.ly/2NJzzgF [}Ï] Query execution bitmap ûlters. . [Ï] M. Y. Eltabakh et al. Cohadoop:Flexible data placement and [}}] Redshi:Choosing the best sort key. its exploitation in hadoop. In VLDB, }ªÏÏ. https://amzn.to/2AmYbXh . [}] P. B. Gibbons, Y.Matias, and V. Poosala. Fast incremental [}k] Teradata: Join index. https://bit.ly/2FbalDT. maintenance of approximate histograms. In VLDB, ϕ•9. [}] TPC-DS Benchmark. http://www.tpc.org/tpcds/. [k] G. Graefe. _e cascades framework for query optimization. [}] Tpc-ds query ¢k. https://bit.ly/2U0rIk6. IEEE Data Eng. Bull., ϕ•. [}C] TPC-H Benchmark. http://www.tpc.org/tpch. []B .Hentschel, M. S.Kester, and S.Idreos. Column sketches:A [}9] Vertica: Choosing sort order: Best practices. scan accelerator for rapid and robust predicate evaluation. In https://bit.ly/2yrvPtG. SIGMOD, }ªÏÊ. [}Ê] Views in sql server. https://bit.ly/2CnbmIo. [] S.Idreos, M. L. Kersten, and S.Manegold. Database cracking. [}•] Program for tpc-h data generation with skew. In CIDR, }ªª9. https://bit.ly/2wvdNVo, }ªÏC. [C] I. Ilyas et al. Cords:Automatic discovery of correlations and [kª]A . Aboulnaga and S. Chaudhuri. Self-tuning histograms: so functional dependencies. In SIGMOD, }ªª. Building histograms without looking at data. In SIGMOD, [9] Z. G. Ives and N. E. Taylor. Sideways information passing for ϕ••. push-style query processing. In ICDE, }ªªÊ. [kÏ] P. K. Agarwal et al. Mergeable summaries. TODS, }ªÏk. [Ê] H. Kimura et al. Correlation maps: a compressed access [k}] S. Agarwal et al. Blinkdb: Queries with bounded errors and method for exploiting so functional dependencies. In bounded response times on very large data. In EuroSys, }ªÏk. VLDB, }ªª•. [kk]D . Agrawal,A. El Abbadi,A. Singh, and T. Yurek. Eõcient [•]A .Lamb et al. _e vertica analytic database: C-store 9 years view maintenance at data warehouses. In SIGMOD, ϕ•9. later. VLDB, }ªÏ}. [k] S. Agrawal, S. Chaudhuri, and V. R.Narasayya. Automated [Cª] V.Leis et al. How good are query optimizers, really? In Selection of Materialized Views and Indexes in SQL VLDB, }ªÏ. Databases. VLDB, }ªªª. [CÏ]A . Y.Levy, I. S.Mumick, and Y. Sagiv. Query optimization by [k] S. Agrawal et al. Database tuning advisor for microso sql predicate move-around. In VLDB, ϕ•. server }ªª. VLDB, }ªª. [C}] Y.Lu,A. Shanbhag,A. Jindal, and S.Madden. AdaptDB: [kC] N. Alon, Y.Matias, and M. Szegedy. _e space complexity of Adaptive partitioning for distributed joins. In VLDB, }ªÏ9. approximating the frequency moments. In STOC, ϕ••. [Ck] I. S.Mumick and H. Pirahesh. Implementation of magic-sets [k9] M. Armbrust et al. Spark sql: Relational data processing in in a system. In SIGMOD, ϕ•. spark. In SIGMOD, }ªÏ. [C]A .Nanda.Oracle exadata: Smart scans meet storage indexes. http://bit.ly/2ha7C5u, }ªÏÏ. [C]A . Nica et al. Statisticum: Data Statistics Management in SAP [9Ê] L. Sun et al. Fine-grained partitioning for aggressive data HANA. In VLDB, }ªÏ9. skipping. In SIGMOD, }ªÏ. [CC] M. Olma et al. Slalom:Coasting through raw data via [9•] L. Sun, M. J. Franklin, J. Wang, and E. Wu. Skipping-oriented adaptive partitioning and indexing. In VLDB, }ªÏ9. partitioning for columnar layouts. In VLDB, }ªÏ9. [C9]C . Olston et al. Pig Latin:A Not-So-Foreign Language for [ʪ]A . _usoo et al. Hive- a warehousing solution over a Data Processing. In SIGMOD, }ªªÊ. map-reduce framework. In VLDB, }ªª•. [CÊ] J. M. Patel et al. Quickstep: A data platform based on the [ÊÏ] N. Tran et al. _e vertica query optimizer: _e case for scaling-up approach. In VLDB, }ªÏÊ. specialized query optimizers. In ICDE, }ªÏ. [C•] J. Pearl. Probabilistic Reasoning in Intelligent Systems: [Ê}] M. J. Wainwright and M. I. Jordan. Graphical models, Networks of Plausible Inference. Morgan Kaufmann, ϕÊÊ. exponential families, and variational inference. [9ª] K. A. Ross,D. Srivastava, and S. Sudarshan. Materialized view https://bit.ly/2yurPIS, }ªªÊ. maintenance and integrity constraint checking: Trading space [Êk]B . Walenz, S. Roy, and J. Yang. Optimizing iceberg queries for time. In SIGMOD, ϕ•C. with complex joins. In SIGMOD, }ªÏ9. [9Ï]F . M. Schuhknecht,A. Jindal, and J. Dittrich. _e uncracked [Ê] J. Yu and M. Sarwat. Two birds, one stone: a fast, yet pieces in database cracking. In VLDB, }ªÏk. lightweight, indexing scheme for modern database systems. [9}] P. Seshadri et al. Cost-based optimization for magic: Algebra In VLDB, }ªÏC. and implementation. In SIGMOD, ϕ•C. [Ê] M. Zaharia et al. Resilient distributed datasets: a fault-tolerant [9k]A . Shanbhag et al. A robust partitioning scheme for ad-hoc abstraction for in-memory cluster computing. In NSDI, }ªÏ}. query workloads. In SOCC, }ªÏ9. [ÊC]E . Zamanian,C. Binnig, and A. Salama. Locality-aware [9]A . Shanbhag,A. Jindal, Y.Lu, and S.Madden. Amoeba: a partitioning in parallel database systems. In SIGMOD, }ªÏ. shape changing storage system for big data. In VLDB, }ªÏC. [Ê9] H. Zhang et al. Surf: Practical range query ûltering with fast [9] L. Shrinivas et al. Materialization strategies in the vertica succinct tries. In SIGMOD, }ªÏÊ. analytic database: Lessons learned. In ICDE, }ªÏk. [ÊÊ] J. Zhou et al. SCOPE: Parallel databases meet MapReduce. In [9C] J. Shute et al. FÏ: A distributed sql database that scales. In VLDB, }ªÏ}. VLDB, }ªÏk. [ʕ] J. Zhou, P.-A.Larson, and R. Chaiken. Incorporating [99]D . S´le¸zak et al. Brighthouse: An analytic data warehouse for partitioning and parallel plans into the scope optimizer. In ad-hoc queries. In VLDB, }ªªÊ. ICDE, }ªÏª.