Identifying Robust Plans Through Plan Diagram Reduction
Total Page:16
File Type:pdf, Size:1020Kb
Identifying Robust Plans through Plan Diagram Reduction ∗ Harish D. Pooja N. Darera Jayant R. Haritsa Database Systems Lab, SERC/CSA Indian Institute of Science, Bangalore 560012, INDIA ABSTRACT tary and conceptually different approach, which we consider in this Estimates of predicate selectivities by database query optimizers paper, is to identify robust plans that are relatively less sensitive to often differ significantly from those actually encountered during such selectivity errors. In a nutshell, to “aim for resistance, rather query execution, leading to poor plan choices and inflated response than cure”, by identifying plans that provide comparatively good times. In this paper, we investigate mitigating this problem by performance over large regions of the selectivity space. Such plan replacing selectivity error-sensitive plan choices with alternative choices are especially important for industrial workloads where plans that provide robust performance. Our approach is based on global stability is as much a concern as local optimality [18]. the recent observation that even the complex and dense “plan di- Over the last decade, a variety of strategies have been proposed agrams” associated with industrial-strength optimizers can be ef- to identify robust plans, including the Least Expected Cost [6, 8], ficiently reduced to “anorexic” equivalents featuring only a few Robust Cardinality Estimation [2] and Rio [3, 4] approaches. These plans, without materially impacting query processing quality. techniques provide novel and elegant formulations (summarized in Extensive experimentation with a rich set of TPC-H and TPC- Section 6), but have to contend with the following issues: DS-based query templates in a variety of database environments Firstly, they are intrusive requiring, to varying degrees, modifi- indicate that plan diagram reduction typically retains plans that are cations to the optimizer engine. Secondly, they require specialized substantially resistant to selectivity errors on the base relations. information about the workload and/or the system which may not However, it can sometimes also be severely counter-productive, always be easy to obtain or model. Thirdly, their query capabili- with the replacements performing much worse. We address this ties may be limited compared to the original optimizer – e.g., only problem through a generalized mathematical characterization of SPJ queries with key-based joins were considered in [2, 3]. Fur- plan cost behavior over the parameter space, which lends itself to ther, [3] has been implemented and evaluated on a non-commercial efficient criteria of when it is safe to reduce. Our strategies are fully optimizer. Finally and most importantly, as explained in Section 6, non-invasive and have been implemented in the Picasso optimizer none of them provide, on an individual query basis, quantitative visualization tool. guarantees on the quality of their final plan choice relative to the original (unmodified) optimizer’s selection. That is, they “cater to the crowd, not individuals”. 1. INTRODUCTION The query execution plan choices made by database engines of- The SEER Algorithm. In this paper, we present SEER ten turn out to be poor in practice because the optimizer’s selec- (Selectivity-Estimate-Error-Resistance), a new strategy for identi- tivity estimates are significantly in error with respect to the actual fying robust plans that can be directly used on operational database values encountered during query execution. Such errors, which can environments. More concretely, it even be in orders of magnitude in real database environments [19], arise due to a variety of reasons [24], including outdated statistics, • Treats the optimizer as a black-box and therefore is inher- attribute-value independence assumptions and coarse summaries. ently (a) completely non-intrusive, and (b) capable of han- dling whatever SQL is supported by the system. Further, it Robust Plans. To address this problem, an obvious approach is to does not expect to have any additional information beyond improve the quality of the statistical meta-data, for which several that provided by the engine interface. techniques have been presented in the literature ranging from im- • Provides plan performance guarantees that are individually proved summary structures [1] to feedback-based adjustments [24] applicable to queries in the selectivity space. to on-the-fly reoptimization of queries [16, 19, 3]. A complemen- ∗ • Considers only the parametric optimal set of plans Contact Author: [email protected] (POSP) [13] as replacement candidates and therefore deliv- ers, for errors that lie within the replacement plan’s optimal- Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that copies ity region, robustness “for free”. In contrast, the previously Permission to copy without fee all or part of this material is granted provided are not made or distributed for profit or commercial advantage and proposed algorithms in the literature may evaluate plans that that the copies are not made or distributed for direct commercial advantage, that copies bear this notice and the full citation on the first page. are not optimal anywhere in the space. the VLDB copyright notice and the title of the publication and its date appear, Copyright for components of this work owned by others than VLDB and notice is given that copying is by permission of the Very Large Data Endowment must be honored. • Is validated on commercial optimizers on both the classical Base Endowment. To copy otherwise, or to republish, to post on servers TPC-H [26] and the recent TPC-DS [27] benchmarks. orAbstracting to redistribute with to credit lists, is requires permitted. a fee To and/or copy specialotherwise, permission to republish, from the publisher,to post on ACM. servers or to redistribute to lists requires prior specific VLDBpermission ‘08, August and/or 24-30, a fee. 2008, Request Auckland, permission New Zealand to republish from: We hasten to add that SEER, due to its non-intrusive design ob- CopyrightPublications 2008 Dept., VLDB Endowment,ACM, Inc. ACMFax 000-0-00000-000-0/00/00.+1 (212) 869-0481 or jective, only attempts to address selectivity errors that occur on the [email protected]. PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-305-1/08/08 1124 select o year, sum(case when nation = ’BRAZIL’ then volume else 0 end) / runtime turn out to be significantly different, say (50%,40%), sum(volume) executing with P70, whose cost increases steeply with selec- from (select YEAR(o orderdate) as o year, l extendedprice * (1 - l discount) as tivity, would be disastrous. In contrast, this error would have volume, n2.n name as nation had no impact with the reduced plan diagram of Figure 2(b), from part, supplier, lineitem, orders, customer, nation n1, nation n2, region since P1, the replacement plan choice at (14%,1%), remains where p partkey = l partkey and s suppkey = l suppkey and l orderkey the preferred plan for a large range of higher values, includ- = o orderkey and o custkey = c custkey and c nationkey = ing (50%,40%). Quantitatively, at the run-time location, plan n1.n nationkey and n1.n regionkey = r regionkey and s nationkey = n2.n nationkey and r name = ’AMERICA’ and p type = ’ECON- P1 has a cost of 135, while P70’s cost of 402 is about three OMY ANODIZED STEEL’ and times more expensive. s acctbal :varies and l extendedprice :varies ) as all nations group by o year It is easy to see, as in the above example, that the replacement order by o year plan will, by definition, be a robust choice for errors that lie within its optimality region, i.e. its “endo-optimal” region. This is the ad- Figure 1: Example Query Template: QT8 vantage, mentioned earlier, of considering replacements only from the POSP set of plans. The obvious question then is whether the sizes of these regions are typically large enough to materially im- prove the system performance. base relations, similar to [1]. However, since these base errors are A second, and even more important question, is: What if the er- often the source of poor plan choices due to the multiplier effect as rors are such that the run-time locations are “exo-optimal” w.r.t. they progress up the plan-tree [15], minimizing their impact could the replacement plan? For example, if the run-time location hap- be of significant value in practical environments. Further, since pens to be at (80%,90%), which is outside the optimality region of SEER is a purely compile-time approach, it can be used in conjunc- P1? In this situation, nothing can be said upfront – the replacement tion with run-time techniques such as adaptive query processing [9] could be much better, similar or much worse than the original plan. for addressing selectivity errors in the higher nodes of the plan tree. Therefore, ideally speaking, we would like to have a mechanism Anorexic Reduction of Plan Diagrams. SEER is based on the through which one could assess whether a replacement is globally anorexic reduction of plan diagrams, a notion that was recently safe over the entire parameter space. presented and analyzed in [11]. Specifically, a “plan diagram” [22] Contributions. In this paper, we address the above issues from is a color-coded pictorial enumeration of the plan choices of the both theoretical and empirical perspectives. We have conducted optimizer for a parametrized query template over the relational se- extensive experimentation on a leading commercial optimizer with lectivity space. That is, it visually captures the POSP geometry. a rich suite of multi-dimensional TPC-H and TPC-DS based query For example, consider QT8, the parametrized 2D query tem- templates operating on a variety of logical and physical database plate shown in Figure 1, based on Query 8 of TPC-H.