Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans* Navin Kabra David J
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans* Navin Kabra David J. Dewitt Computer Sciences Department Computer Sciences Department University of Wisconsin, Madison University of Wisconsin, Madison [email protected] dewitt @cs.wisc.edu Abstract to their database systems. Unfortunately optimizer tech- nology has not kept pace with these advances, and a num- For a number of reasons, even the best query optimizers can ber of the inadequacies of traditional query optimizers have very often produce sub-optimal query execution plans, lead- become obvious. Due to the inability of query optimizers ing to a significant degradation of performance. This is es- to accurately estimate the cost of executing complex query pecially true in databases used for complex decision support evaluation plans, they often produce sub-optimal plans. queries and/or object-relational databases. In this paper, we describe an algorithm that detects sub-optimality of a There are a number of reasons why estimating the cost of query execution plan during query execution and attempts query execution is difficult. Query optimizers use statistics to correct the problem. The basic idea is to collect statis- stored in the system catalogs to estimate sizes and cardinal- tics at key points during the execution of a complex query. ities of tables that participate in the query. This introduces These statistics are then used to optimize the execution of an error in the estimates either due to the approximations the query, either by improving the resource allocation for involved, or because statistics are not kept up-to-date. As that query, or by changing the execution plan for the re- the number of joins in the query increases, these errors mul- mainder of the query. To ensure that this does not signifi- tiply and grow exponentially [9]. Another source of errors cantly slow down the normal execution of a query, the Query is the lack of sufficient information about the run-time sys- Optimizer carefully chooses what statistics to collect, when tem at query optimization time. The amount of available to collect them, and the circumstances under which to re- resources (especially memory), the load on the system, and optimize the query. We describe an implementation of this the values of host language variables are things that differ algorithm in the Paradise Database System, and we report for every execution of the query, and, in some cases, change on performance studies, which indicate that this can result in the middle of query execution. in significant improvements in the performance of complex queries. The problem is further aggravated in the case of object- relational database systems that allow users to define data- types, methods, and operators. Collection and storage of 1 Introduction statistics (for example, histograms) for user-defined data- types (for example, spatial data-types like polygon, point) One of the key reasons for the success of relational database is an area that has not yet been addressed by the database technology is the use of declarative languages and query op- research community. There are some primitive methods that timization. The user can just specify what data needs to have been proposed to deal with the estimation of the cost be retrieved and the database takes over the task of finding of execution for user defined functions/methods written in the most efficient method of retrieving that data. It is the an external language (like C++) [23], but these are far from job of the query optimizer to evaluate alternative methods adequate. Similarly, selectivity estimation for predicates in- of executing a query, and selecting the cheapest alternative. volving user-defined methods/functions is another area that Notwithstanding the tremendous success of this approach, is poorly understood. All of this makes it really difficult query optimization still remains a problem for database sys- to properly estimate the cost of executing object-relational tems. Modern database systems are placing an increas- queries. Although recent advances in estimation techniques ingly heavy burden upon their query optimizers. Relational (for example, the histograms of [19] and [ll]) and the param- database systems are increasingly being used to execute eterized/dynamic query evaluation plans of [lo, 8, 71 address complex decision support queries. In addition, commercial some of the issues, many problems still remain to be solved. vendors are all scrambling to add object-relational features In this paper, we describe Dynamic Re-Optimization, an al- gorithm that can detect the sub-optimality of a query execu- *This research was supported by NASA under contracts NAGW- tion plan while executing the query in order to re-optimize 3895 and NAGW-4229. it and improve its performance. During query optimiza- tion, the plan produced by the query optimizer is annotated Permission to make digital or hard copies 01 all or part of this work for with the various estimates and statistics used by the op- personal or classroom we is granted without fee provided that copiesara not made or distributed for profit or commercial advan- timizer. Actual statistics are collected at query execution tage and that cop& bear this notice and the full citation on the first Page. time. These observed statistics are compared against the To copy otherwise, to republish. to post on servers or to estimated statistics and the difference is taken as an indica- redistribute to lists, requires prior specific permission and/or a fee. tor of whether the query-execution plan is sub-optimal. The SIGMOD ‘98 Seattle, WA, USA new statistics (much more accurate than the initial optimizer 8 1999 ACM 0.69791-995-5/96/006...55.00 estimates) can now be used to optimize the execution of the remainder of the query. 106 Collection of statistics at run-time can significantly slow down the execution of a query. Further, re-optimizating Select avg (Rell .selectattrl) Aggregate avg (Ffell.selactattR) 1 Group by Rell .groupattr part of the query and modifying the query execution plan at Reli .groupattr run-time also incurs overheads. This can actually cause the from Rell, Rel2. FM3 Indexed-Join performance of a query to deteriorate instead of improving. where Rell .selectattrl -C wluel / Rell .joinattr3 = ReEi.foinattrB To prevent such problems, we use hints from the optimizer and Rell.selectattr2 c :value2 \ and Ret1 Joinattr2 = RelP.joinattrB Hash-Join Rel3 to determine the most strategic places in the query where and Rell .joinattr3 = RelSjoinattr3 / Rell .joQttr2 = Rel2.joinattR group by Rell.groupattr statistics should be collected, and to determine the condi- ‘Rel2 tions under which to re-optimize a query. VekZectattr l -z :valuel / Rell selectattr2 -Z xalue2 Our approach is quite different from the competition model Rell proposed by Antoshenkov [2, 31, the dynamic query plans of [8] and [7], or the parametric query optimization algo- (a) W rithms proposed in [lo]. The differences between these algo- rithms and our approach are further described in Section 4 Figure 1: A query and its query execution plan when we discuss related work. used indiscriminately. To prevent this from happening, at The remainder of this paper is organized as follows. In Sec- query optimization time, the most effective points to col- tion 2 we describe the details of our algorithm. In Section 3, lect statistics are determined, and statistic collection opera- we describe an implementation of the algorithm in the Par- tors are inserted into the query execution plan at only those adise database system, and report the results of a perfor- points. mance study that validates our algorithm. In Section 4, we contrast our approach with previous work described in the In the remainder of this section, we describe each of the database literature. Section 5 presents our conclusions and above items in detail. We end the section with an overview directions of future research. of the whole dynamic re-optimization process, and how it all fits together. 2 Algorithm Overview The Dynamic Re-Optimization algorithm tries to detect sub- 2.1 Query Execution Plans optimality of a query execution plan while the query is being The job of a query optimizer in a database system is to take executed. If a query execution plan is believed to be sub- as input a query (which is declarative) and produce an execu- optimal, it dynamically changes the execution plan of the tion plan for that query. Figure l(a) shows an example SQL remainder of the query (the part that hasn’t been executed yet) leading to an improvement in performance. query. We will use this query as a running example through- out this section for illustrative purposes. Figure l(b) shows These are the salient features of the algorithm: a possible execution plan for this query that might be pro- duced by a query optimizer. An execution plan is essentially 1. (Annotated) Query Execution Plans: We assume a tree in which each node represents some database operator that a conventional query optimizer is used to produce a (like hash-join, indez-scan) being applied to its inputs. query execution plan for a given query. The only require- During the course of optimization, the query optimizer es- ment, on the plan generated by the query optimizer is that timates the sizes of various intermediate results that might the plan produced by the optimizer should include informa- be produced, and the cost/time taken by each operator. As tion about the optimizer’s estimates of the sizes of all the in- part of the Dynamic Re-Optimization algorithm, we modify termediate results in the query, and the execution cost/time the query optimizer so that these estimates are included in for each operator in the query. We refer to such a plan as the query evaluation plan that it produces, and are sent to an annotated query execution plan in the remainder of this the database execution engine.