Query Optimization Through the Looking Glass, and What We Found Running the Join Order Benchmark

VLDB Journal manuscript No. (will be inserted by the editor) Query Optimization Through the Looking Glass, and What We Found Running the Join Order Benchmark Viktor Leis · Bernhard Radke · Andrey Gubichev · Atanas Mirchev · Peter Boncz · Alfons Kemper · Thomas Neumann Received: date / Accepted: date Abstract Finding a good join order is crucial for query per- industrial systems that can deal accurately with join-crossing formance. In this paper, we introduce the Join Order Bench- correlated query predicates. We further show that while es- mark (JOB) that works on real-life data riddled with corre- timates are essential for finding a good join order, query lations and introduces 113 complex join queries. We exper- performance is unsatisfactory if the query engine relies too imentally revisit the main components in the classic query heavily on these estimates. Using another set of experiments optimizer architecture using a complex, real-world data set that measure the impact of the cost model, we find that it and realistic multi-join queries. For this purpose, we de- has much less influence on query performance than the car- scribe cardinality-estimate injection and extraction techniques dinality estimates. We investigate plan enumeration tech- that allow us to compare the cardinality estimators of mul- niques comparing exhaustive dynamic programming with tiple industrial SQL implementations on equal footing, and heuristic algorithms and find that exhaustive enumeration to characterize the value of having perfect cardinality es- improves performance despite the sub-optimal cardinality timates. Our investigation shows that all industrial-strength estimates. Finally, we extend our investigation from main- cardinality estimators routinely produce large errors: though memory only, to also include disk-based query processing. cardinality estimation using table samples solves the prob- Here, we find that though accurate cardinality estimation lem for single-table queries, there are still no techniques in should be the first priority, other aspects such as modeling random vs. sequential I/O are also important to predict query runtime. Viktor Leis Technische Universitat¨ Munchen,¨ Garching, Germany E-mail: [email protected] Bernhard Radke 1 Introduction Technische Universitat¨ Munchen,¨ Garching, Germany E-mail: [email protected] The problem of finding a good join order is one of the most Andrey Gubichev studied problems in the database field. Fig.1 illustrates the Technische Universitat¨ Munchen,¨ Garching, Germany classical, cost-based approach, which dates back to System E-mail: [email protected] R[45]. To obtain an efficient query plan, the query optimizer Atanas Mirchev enumerates some subset of the valid join orders, for example Technische Universitat¨ Munchen,¨ Garching, Germany using dynamic programming. Using cardinality estimates as E-mail: [email protected] its principal input, the cost model then chooses the cheapest Peter Boncz alternative from semantically equivalent plan alternatives. CWI, Amsterdam, The Netherlands E-mail: [email protected] Theoretically, as long as the cardinality estimations and Alfons Kemper the cost model are accurate, this architecture obtains the op- Technische Universitat¨ Munchen,¨ Garching, Germany timal query plan. In reality, cardinality estimates are usually E-mail: [email protected] computed based on simplifying assumptions like uniformity Thomas Neumann and independence. In real-world data sets, these assump- Technische Universitat¨ Munchen,¨ Garching, Germany tions are frequently wrong, which may lead to sub-optimal E-mail: [email protected] and sometimes disastrous plans. 2 Viktor Leis et al. HJ B model errors is dwarfed by cardinality estimation errors and cardinality cost BINL that even quite simple cost models seem to be sufficient. SELECT ... estimation model T v Section6 investigates different plan enumeration algorithms FROM R,S,T S and shows that—despite large cardinality misestimates and WHERE ... plan space R sub-optimal cost models—exhaustive join order enumera- enumeration tion improves performance and that using heuristics leaves performance on the table. To augment the understanding Fig. 1 Traditional query optimizer architecture. obtained from aggregated statistics, Section7 looks at two particular queries in our workload and analyzes their query In this paper we experimentally investigate the three main plans. While most experiments use the in-memory setting, components of the classical query optimization architecture Section8 repeats the important experiments with a cold cache in order to answer the following questions: and reading data from disk. Related work is discussed in Section9. – How good are cardinality estimators and when do bad We conclude this paper by repeating all important in- estimates lead to slow queries? sights in Section 10. Time-constrained readers may start with – How important is an accurate cost model for the overall Section 10, and selectively read the sections referenced there. query optimization process? – How large does the enumerated plan space need to be? To answer these questions, we use a novel methodology that allows us to isolate the influence of the individual optimizer 2 Background and Methodology components on query performance. Our experiments are con- ducted using a real-world data set and 113 multi-join queries Many query optimization papers ignore cardinality estima- that provide a challenging, diverse, and realistic workload. tion and only study search space exploration for join order- The main contributions of this paper are: ing with randomly generated, synthetic queries (e.g., [39, 14]). Other papers investigate only cardinality estimation in – We design a challenging workload named Join Order isolation either theoretically (e.g., [22]) or empirically (e.g., Benchmark (JOB), which is based on the IMDB data set. [53]). As important and interesting both approaches are for The benchmark is publicly available to facilitate further understanding query optimizers, they do not necessarily re- research. flect real-world user experience. – To the best of our knowledge, this work is the first end- The goal of this paper is to investigate the contribution of to-end study of the join ordering problem using a real- all relevant query optimizer components to end-to-end query world data set and realistic queries. performance in a realistic setting. We therefore perform our – By quantifying the contributions of cardinality estima- experiments using a workload based on a real-world data tion, the cost model, and the plan enumeration algorithm set and the widely-used PostgreSQL system. PostgreSQL is on query performance, we provide guidelines for the com- a relational database system with a fairly traditional archi- plete design of a query optimizer. We also show that tecture making it a good subject for our experiments. Fur- many disastrous plans can easily be avoided. thermore, its open source nature allows one to inspect and The rest of this paper is organized as follows: We first change its internals. In this section we introduce the Join Or- discuss important background and the Join Order Bench- der Benchmark, describe all relevant aspects of PostgreSQL, mark in Section2. Section3 shows that the cardinality es- and present our methodology. timators of the major relational database systems produce bad estimates for many realistic queries, in particular for multi-join queries. The conditions under which these bad es- 2.1 The IMDB Data Set timates cause slow performance are analyzed in Section4. We show that it very much depends on how much the query Many research papers on query processing and optimization engine relies on these estimates and on how complex the use standard benchmarks like TPC-H, TPC-DS, or the Star physical database design is, i.e., the number of indexes avail- Schema Benchmark (SSB) [4,43,41]. While these bench- able. Query engines that mainly rely on hash joins and full marks have proven their value for evaluating query engines, table scans, are quite robust even in the presence of large we argue that they are not good benchmarks for the cardi- cardinality estimation errors. The more indexes are avail- nality estimation component of query optimizers. The rea- able, the harder the problem becomes for the query opti- son is that in order to easily be able to scale the bench- mizer resulting in runtimes that are far away from the op- mark data, the data generators are using the very same sim- timal query plan. Section5 shows that with the currently- plifying assumptions (uniformity, independence, principle used cardinality estimation techniques, the influence of cost of inclusion) that query optimizers make. Real-world data Query Optimization Through the Looking Glass, and What We Found Running the Join Order Benchmark 3 Table 1 Cardinalities and aliases of the IMDB tables. join block4. Since we focus on join ordering, which arguably table alias cardinality is the most important query optimization problem, we de- aka name an 901;343 signed the queries to have between 3 and 16 joins, with an aka title at 361;472 average of 8 joins per query. Query 13d, which is shown in cast info ci 36;244;344 char name chn 3;140;339 Fig.3, is a typical example that computes the ratings and comp cast type cct 4 release dates for all movies produced by US companies. company name cn 234;997 company type ct 4 The join graph of query 13d is shown in Fig.4. The solid complete cast cc 135;086 edges in the graph represent key/foreign key edges (1 : n) info type it 113 with the arrow head pointing to the primary key side. Dotted ; keyword k 134 170 n m kind type kt 7 edges represent foreign key/foreign key joins ( : ), which link type lt 18 appear due to transitive join predicates. Our query set con- movie companies mc 2;609;129 sists of 33 query structures, each with 2-6 variants that differ movie info mi 14;835;720 in their selections only, resulting in a total of 113 queries – movie info idx mi idx 1;380;035 movie keyword mk 4;523;930 all depicted in detail in AppendixA.

Load more