Measuring the Complexity of Join in Query Optimization Kiyoshi Ono I, Guy M. Lehman IRM Almaden Research Center KS5/80 I, 650 Harry Road San ,Josc, CA 95120

A b&act Introduction Since relational management systems typically A query optimizer in a relational DRMS translates support only diadic join operators as primitive opera- non-procedural queries into a pr0cedura.l plan for ex- tions, a query optimizer must choose the “best” sc- ecution, typically hy generating many alternative plans, quence of two-way joins to achieve the N-way join of estimating the execution cost of each, and choosing tables requested by a query. The computational com- the plan having the lowest estimated cost. Increasing plexity of this optimization process is dominated by this set offeasilile plans that it evaluates improves the the number of such possible sequences that must bc chances - but dots not guarantee! - that it will find evaluated by the optimizer. This paper describes and a bcttct plan, while increasing the (compile-time) cost measures the performance of the Starburst join enu- for it to optimize the query. A major challenge in the merator, which can parameterically adjust for each design of a query optimizer is to ensure that the set of query the space of join sequences that arc evaluated feasible plans contains cflicient plans without making by the optimizer to allow or disallow (I) composite the :set too big to he gcncratcd practically. tables (i.e., tables that are themselves the result of a join) as the inner operand of a join and (2) joins between two tables having no join predicate linking them (i.e., Cartesian products). To limit the size of their optimizer’s search space, most earlier systems ex- One of Ihe major decisions an optitnizer must make cludcd both of these types of plans, which can exccutc is the order in which to join the tahlrs referenced in significantly faster for some queries. Dy experimentally the query. Since the join operation is implemented in varying the parameters of the Starburst join enumerator, most systems as a diadic (2-way) operator, the optimizer we have validated analytic formulas for the number of must generate plans that achieve N-way joins as a join sequcnccs under a variety of conditions, and have scqucncc of 2-way joins. When joining more than a proven their dependence upon the “shape” of the query. few tables, the numhcr of such possible sequences is Specifically, ‘linear” queries, in which tables arc con- the dominant factor in tlic number of altcmative plans: nectcd by binary predicates in a straight lint, can hc Ri! diffcrcnt scqucnccs are possible for joining N tables. I:vcn ~hcn dynamic programming is used, as Systcrn optimized in polynomial time. llence the dynamic R ISEl 791 and most current products do, thcorcticians programming techniques of System R and R* can still have, usrd the cxponcntinl worst case complexity to be used to optimize linear queries of as many as 100 argue that heuristic search methods should be used. tables in a reasonable amount of time! 1 Iov:cvl:r, these rncthods cannot guarantee o~firnality of their solution. :I? can dynamic programming. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distrihutetl fog I:or ttii!: rcasoti, tii:rtiy existing optimizers use licuristics direct commercial advantage. the VLDB copyright notice and \\ithiii c!):nnmic programming to limit the join sc- the title of the publication and its date appear. and notice is given qt~ericc:; cvalrtatctl. C)nc heuristic ctnploycd by System that copying is by permission of the Very Large Data Bahc I< ISI’. ,791 and IX* 11 ,OllK5l constructs only joins in Endowment. To copy otherwise. or to republish. requires a fee which :I single tahlc is join4 at each step with the and/or special permission from the Endowment. results o.,f previous joins. in a pipclincd fashion. This gL>rlrratlr:; plans st~h :IS (((Tl I;4 ‘1‘2) r4 '1'3) M '1‘4), :~ntl nvcbids sc~~llcd composite inntw (often referred to Proceedings of the 16th VLDB Conference as "!wshy trees” ICiRA87]), in which the inner table Brisbane, Australia 1990

I Current address: Tokyo Research 1.ahoralnry I IIM Japan, I.lrJ. 5.10. Snnhnchn Chiyoda-ku Tokyo 102, JAPAN

Introduction 314 (the second operand of the join) is the result of a join optirni;)cr treats the latter predicate in a plan such as that must be materialized in memory or - if it is too (‘1’1 M ‘1’2) I‘;d ‘1’3) as a restriction of the Cartesian big - on disk. The heuristic saves this materialization, product between (Tl r>a T2) and ‘I’3, rather than as but may exclude better plans for certain qucrics. As a join predicate. ‘t’hus, in conjunction with the heuristic an example, suppose a query with four large tables ‘1’1, that defers Cartesian products, any plan that might T2, 7’3, and T4 has two predicates Tl.Cl = ‘1’2.C2 join T2 and ‘1‘3 first is not considered, even if it dom- and ‘1’3.C3 = T4.C4 that are extremely selective (rc- inates all plans joining ‘I’1 and ‘I’2 first. strictive), and one T2.C6 = T4.C8 that is not. ‘I’hcn the plan ((Tl W T2) Cd (T3 fX ‘1’4)) with a com- Another limitation in handling predicates is that many posite inner (T3 W 1’4) is likely to be better than any systems do not derive impliedpredicates, which are not plan avoiding the composite inner, such as ((Tl 00 specified but arc implied by the predicates given by the ‘1’2) W T4) W T3, because the intermediate results user. For instance, two predicates Tl.Cl = T2.C2 of the first plan would all be significantly smaller than and T2.C2 = ‘1‘3.C3 imply a new predicate, Tl.Cl = any in the second plan. ‘1’3.(:3. Without this implied predicate, System R would avoid joining ‘1’1 with 1’3 until either had been Another major heuristic employed by both System R joined with ‘1’2. A few systems have relaxed these [%I,791 and INURES [WON761 always defers limitations somewhat. For example, commercial Cartesian products as late in the join sequence as pos- INGRI3S can generate the join between Tl and 1’3 by siblc, assuming that they result in large intermediate using an attribute equivahce c1u.r.r[ KOO80], although tables because there is no join predicate that restricts it does not apply the equivalence classes to more gen- the result (and hence every row in one table is joined eral forms of prcdicatcs - such as T2.C2 = ‘1’3.C3 + with every row in the other table). Again, this heuristic 1 - to derive an impticd predicate 1’1 .Cl = T3.C3 + may exclude the optimal plan for certain queries that 1, bccausc join predicates are still restricted to be equal- can benefit from Cartesian products. For instance, if ity predicates between two columns. the tables to be joined are small, and especially where they contain only one tuplc each, a Cartesian product is quite inexpensive. And its result may even have Solrrtion: An aciaptahie search space columns forming a composite key for another, much larger table to be accessed later, thus making the Although evaluating more plans may find a more ef- Cartesian product more advantageous. As an example, ficient plan for cxccuting some queries, it also increases consider a query with three tables T 1, T2, tind T3, and the cost of optimizing a given query. Ilence it is two prcdicatcs T1.C 1 = T2.C2 and T2.C3 = T3.C4. important to bc abtc to adjust the number of alternative The plan ((Tl W T3) Bl T2) is potentially the best plans considered by the optimizer for specific applica- plan if Tl and ‘1’3 are very small (or made so by tions and queries. For example, traditional batch- additional single-table predicates) and there is a multi- oricntccl applications can totcratc longer optimization column index for ‘1’2 on columns C2 and C3, even times than can intcractivc applications, for which the though it requires the Cartesian product of ‘1’1 and T3. optimization time is as important as the execution This will be illustrated concretely in Section 3.1. time. This adaptability (or customizability) of a query optimizer is even more critical for non-traditional ap- To make matters worse, many existing query optimizers plications, such as decision support systems and expert severely restrict the class of prcdicntcs that qualify as systems, and for qucrics gcncratcd automatically by a thcsc critical join predicates to be simple cquijoiris of user front end. ‘l‘hcsc applications tend to pose very the form “CohtmnI = Cdumn2”, excluding any prctli- complex queries rcfcrring to more tables than traditional cates that reference more than two tables or in\olvc applications /KR184, S\VA88]; without some of the arithmetic on the column values. Conscqucntly, what abol:c heuristics, it might bc infcasiblc to evaluate all may appear to the user as a perfectly good join prcdicntc the possible join scqucnccs2. is not treated as such by the system [1,01186], and a suboptimal join sequence results. For example, for a I-vcn within a single 1ypc of application, certain queries query with three tables 1’1, T2, and ‘1‘3, and two prcd- might tcncfit more from considering more alternatives icates Tl.Cl = ‘1’2.C2 and ‘1’2.C4 + 3 = ‘1‘3.C5, the than others. A query whose best plan is expected to

2 Whether the optimizer can accurately estimate the number of rows rczu!tin;! fr )m sucl~ ccrmplcu cluerics is an orthogonal issue, and will in any case not prevenj applications from posing these queries!

Introduction 315 run for 15 minutes is more likely to profit substantially join scqucnccs that the optimizer will consider for any from an additional minute of optimization than is a given query, to include or exclude consideration of query estimated to execute in under a minute! ‘l‘he composite inncrs and Cartesian products. Thus, the space of alternative join sequences must thcrcfore be level of optimization effort can be tailored individually adjustable for each query. to the query, either manually by the user or automat- ically by the system. In addition, the modular design of the Starburst join enumerator, described below, fa- Previous wovli cilitates the addition or repiaccmcnt of feasibility criteria for particular applications, exploits as join predicates Several authors have considered the benefits of increas- any implied prcdicatcs and predicates involving more ing or decreasing the space of possible join scquenccs than two tables or arit hmctic, and places no algorithmic through which the optimizer searches. In the original limit. on the number of tables in a query. And unlike (University) INGRES, intermediate results were always many other optimizers that arc limited to simple materialized so that the optimizer could assess the next SE1 .E<3’l’-I’I~O.lIJ~T1’-.IOIN queries, the Starburst op- join to perform based upon the size of the composite timizer correctly processes queries involving nested [WON76]. The commercial version of INGRIS allows subqucries and correlation predicates ]I 011841. plans to contain composite inners, but does not rcquirc them to do so [KOO80]. Rosenthal’s optimizer ‘l‘hese cnhanccmrnts allowed us to measure the impact [ROS82] also included plans with composite inncrs. of varying the optimizer’s search space, both upon the IIowever, so far as is known, none of these optimizers optimization complexity and upon the resulting exe- permitted adjusting the search space used by the opti- cution time. Our cxpcrimcntal results demonstrate mizer. ‘I‘he rule-based optimizer generator built by that the complexity of optimizing a query is largely Graefe for EXODIJS was easily adapted to consider depcndent~ upon the .rAtipe qf rhc qrtery gm@, where composite inners by changing just a rule or two the shape of a query indicates how tables arc connected ]GRA87]. IIowever, the optimizer so generated was with predicates, as well as the number of tables it then fixed and could not be adjusted for different que- rcfcrcnccs. Somewhat surprisingly, the number of fca- ries without generating a new optimizer. Graefe showed sible joins does not incrcasc exponentially with the that the number of plans considered by the I;,XODIJS number of tables for certain commonly posed queries. optimizer3 went up dramatically between 4 and 7 joins For example, for fincnr guerirs, in which tables arc per query, but then (surprisingly!) leveled off from 8 connected by binary prcdicatcs in a straight line, the to 14. Graefe offered no explanation or analysis for number of feasible joins is a polynomial of the number this apparent anomaly, nor any figures for queries ex- of the tables, rvcn when composite inncrs arc allowed. ceeding 14 joins. IJsing the EXODI IS cost equations, With the same feasibility criteria, the number incrcascs Graefe also showed that plans containing composite to an exponential for stnr qrwics, in which a table at inners had significantly improved (estimated) cost only the ccntcr is connrctcd by binary prcdicatcs to each of when the number of joins per query exceeded 10. the other surrounding tables, the same as for completely Since his composites were growing monotonically, tic connected query graphs. reasoned that “bushy trees” balanced the workload bct- ter as the number of joins increased, allowing mom joins on moderately-sized intermcdiatc results. Swami ‘l’hi: paper is organizrti as follows. I?rst, a brief over- [SWA89] measured the actual execution of a large num- vicu of the design of Starburst plan optimization is bcr of queries, allowing and disallowing both Cartesian given, including a discussion of how its join cnumcrator products and composite inners. IIis results showed can bc adapted and cxtrndcd. ‘lhc rcmaindcr of the that a small percentage of the plans in the increased paper prcscnts analysis and rxpcrimcntal results on the search space are optimal. nurnbcr of fcasiblc joins for queries containing up to 110 tables, as the scnrch space is varied using Starburst’s pammctcrizcd join enumerator. WC show how the Starburst’s Advances shalt affects the number of fcasiblc joins, and how this adaptability can help hnlancc the number of fcasiblc I’he Starburst join enumerator improves upon thcsc joins and the practicality of optimizing qucrics that earlier efforts by parameterizing the space of altcrnntivc must join that many tables.

3 Actually, the average size at the end of optimlzatinrl of the data Ctructurc, called hll!Sll, which contained plans

Introduction 316 Starburst’s Adaptable Join Enumeration Iqbr eac!l .s&l pair ?f quanii/ier .rels,the join enumerator invokes the plan generator to generate and evaluate alternative Ql’l’s for that join, passing to it the quan- Overview of Starburst’s Plan Optimization tificr sets, any join predicates linking those operands, and any limitations on the order of join (discussed This paper deals only with the plan optimization phase below). As with the plans for single tables, each job, ofquery processing in Starburst, an extensible relational QEI’ returned by the plan generator specifies execution database management system being prototyped at the details for one altcrnativc, such as an order and method IBM Almadcn Research Center [IIhA90). An overview of joining the two quantifier sets, the columns that can of Starburst query processing can be found in llIAA88]. be projected out, the plan chosen for producing each For a given query, plan optimization evaluates altcrna- of the two inputs to the join, and any intermediate tive query execution plans (QEPs), and outputs for operations (such as sorts) that must be inserted to execution the QEP with the least estimated execution make the chosen join method work with the chosen cost. ‘I’he input to plan optimization is a parsed query input plans [I ,OI !88]. that is stored in an internal database called the Query Graph Model (QGM). QGM is an internal represcn- The remainder of this paper will concentrate on the tation of the semantics of an SQI, query, including its top part of plan optimization, the join enumerator, various entities, such as tables, quantifiers, and predi- whose implementation is summarized next. cates, as well as the relationships among them. A quantifier corresponds to a join variable in SQL and a fupke variable in QtJI

Starburst’s Adaptable .loin Enumeration 317 quantifier. Now, at any point in the algorithm, suppose cludes imphd pr-cdicatcs, i.e. those predicates deriv- we have constructed the optimal plan for ail fcasihlc ahlc from predicates given in the query. Starburst quantifier sets containiilg up to k - I quantifiers (k 2 2) dcrivcs and exploits implied predicates in a slightly in the quantifier set table. We can obtain ail feasible more general approach than does INURES [K0080]; joins producing k quantifiers by considering every pair for details see IONO881. lnclading implied predicates of quantifier sets having i quantifiers and k - i quan- may product bcttcr execution plans by allowing tifiers (1 I i 5 floor (k / 2)). more feasible joins.

A detailed description of the filters of the default fca- l Composite inncrs: Enumeration of composite inncrs sibility criteria is given in [ONOSX]. ‘I’he mandatory can be controlled by a parameter filters are the same as for most systems: m~imlrmn~.~iz~_of~.rmn(ler_set to the join generator. This parameter specifies the maximum size of the Disjointncss: two quantifier sets to bc joined must smalicr quantifier set of each join: if it is set to I, be disjoint4. enumeration of composite inners is disabled”; by set- ting it to some intcgcr j (I

4 Note that this condition does not exclude joins of a table \v;lh itself, Ix!ca~~sc two accccsc~ lo lhr tnhlc are rrprcscnted as two different quantifiers. Recursive joins are specifically dctcclod and excluded from this criterion.

5 This conflict occurs when there are some qtlanlificrs upon which the inner dcp:rtdc and nh~ch ~hcn~cIvcs depend upon some quantifiers in the outer.

6 Actually, the join enumerator only enumerates pairs of qunnlificr Xl! , one of which always has a single quantifier if maximum size - of -. smaller-set is set to I. In conjunction with the dcprndor. relationship, Ihc plan generator then decides which quantifier set is the kner operand of the join in any particular allernatlvc plan.

Starburst’s Adaptable Join I’numcration 318 it reduces the time to enumerate feasible joins for such as the numhcr of quantifiers, the number of pred- certain queries, as shown in Section 3.4. icatcs, and the shape of the query, indicating how the quantifiers arc connected by the predicates, and 2) join l Replacing the join feasibility filters: Because each feasibility criteria, such as whether composite inners join feasibility criterion is an independent Boolean arc allowed or not. In the following, unless otherwise function, it is easy to change the function to another noted, the de$zuIt join fiasihility criterion is used, i.e., that decides whether it is advan- composite inncrs arc allowed and the existence of at tageous to join two quantifier sets. For instance, the least one join predicate is required for each join. filter requiring join predicates could be replaced by a new function that allows joins between any two sets whose estimated number of tuplcs arc sufficiently Verifying the Gain qf Larger Search Spaces small, even though there might be no join predicates between the sets. Although often postulated as “well known” in the lit- erature, examples from real applications that beneftt l Replacing the entire enumeration algorithm: The cn- from allowing Cartesian products or composite inners tire enumeration method can bc replaced as a whole arc rarely documcntcd. llcnce we first wanted to verify without affecting the rest of the plan optimizer. empirically that increasing the search space of the join This extension might be beneficial or necessary for cnumcrator to include Cartesian products and compos- handling very large queries with heuristic methods, ite inncrs produced significantly bcttcr plans for some such as Iterative Improvement [SWA88] or the qucrics. Greedy Algorithm of the OS/2 Extended I’dition Database Management System [ 1,011891. Cartesian products

Database designers often encode wide columns in a Experimental Results large table (e.g., encode “California” as “CA” or as an integer), and put the encodings for each column in a separate table. This was done in the following example IJsing the Starburst adaptable join enumerator de- database and query, containing a large (iO,OOO-row) scribed above to vary the search space, we measured table drawn from the actual online telephone directory the complexity of enumerating the feasible joins for of IBM employees in the San Jose area. The query several sample queries. Since we wanted to concentrate uses a JXP’I‘S table of only 51 rows to encode the on the primary factor in optimization cost, namely the department name to a department number, and a complexity of enumerating join scqucnccs, we tried to NOl>J3 table of only 24 rows to encode the node minimize the impact of what join and accessmethods name to a node idcntilier’: were currently implemented in our testbed by first just counting the number of feasible joins. The total opti- Database Schema: mization time is approximately the product of the SJOIR: LAST, FIRST, MIDDLE, PHONE, number of feasible joins and the average time to gcn- OEPT, OFFICE, NOOEIO, USER10 erate plans for a single feasible join. ‘I’he latter is the OEPTS: OEPTIO, OEPTLOC, OEPTNAME time for the rule-based plan generator to construct NODES: NOOEIO, NOOENAME plans for each feasible join, which is an orthogonal Index on SJOIR- NOOEIO, OEPT issue unrelated to the question at hand, depcndcnt ----.L upon how many alternative access and join methods Query: are available to choose from, is not significantly affcctcd ;ibiCT last, first, dept, c.nodeid by the characteristics of queries, and therefore is csscn- nodes a, depts b, sjdir c tially constant for all joins. We also measured the WHERE a.nodename = 'STL VM #14' extent to which the Graph Traversal ((IT) join gcncr- AND b.deptname = 'OBZ Optimiz.' ator reduced the time to enumerate joins. AND a.nodeid = c.nodeid AND b.deptid = c.dept; The number of joins evaluated for a query dcpcnds on two classes of factors: 1) characteristics of the query, ‘l‘hic (star-shaped) query was run on Starburst, allowing

7 Of course, these cardinalities do not reflect the actual numhcr of ~hcsc entities in the San .IOSC area!

Experimental Results 319 and disallowing Cartesian products, using 16 buffer pages that were flushed before each cxccution. Only one user was active. The best plan without Cartesian products did approximately 8 titnes the work of the best plan with Cartesian products, which formed the Cartesian product of DEI’TS and NODI3S, then ac- cessed SJDIR using the index:

l@xt Query Chavactefistics qf Figure: 2: Effect of Numhcr of Quantifiers on Number of Fcasihlc .loins, for linear and Star Queries Effect of the number of quantifiers:

For a given set of join feasibility criteria, the number of feasible joins is determined primarily by two factors: the number of tables to join and the shape of the query TIIIX)REM I (Complrxity of Linrar @cries with graph. Figure 1 shows the shapes of three rcprcsentativc Composite Inncrs): Ilsing dpnmic programming to op- queries with 13 quantifiers and 12 binary prcdicatcs. timize a linear qucty with N qunntiflcrs, nnd allowing In the figure, a dot represents a quantifier, and a rec- composite inncrs (bushy trees) , rkquircs eva~imting tangle represents a binary predicate. In the finenr qucrv, (iv” - w / 6 fcnrihle joins. all 13 quantifiers are connected consecutively with 12 predicates. In the star query, the quantifier at the PROOF: I;or a linear q~~ery with N quantifiers, each center is conncctcd to 12 surrounding quantiticrs. of the K steps in the dynamic programming algorithm (2 I K < N) inductively constructs joins of K consec- The following two theorems give the relationship be- utivc quantifiers from the best plan fragments contain- tween query shape and number of quantifiers for the ing I, 2, . . . . K - 1 quantifiers constructed by the previous extreme cases of linear and star-shaped queries, when iterations. ‘I‘hrrc arc (N - K + 1) such quantifier sets constructed using dynamic programming. Note that having K consccutivc quantifiers, and each of these can thcsc calculations compute the number of feasible joins bc c:mstructcd from (K - 1) different feasible joins, bc- (in which the outer and inner quantifier sets arc not cause thcrc arc exactly (K - I) places to break the K- distinguished), i.c., the number of times in Starburst quantifier subgraph into 2 smaller pieces to bc joined. that the plan generator is called; to obtain the numhcr ‘I’hcrcforc, the total number of fcasihlc joins is: of join sequences, multiply by 2. N ‘(K-- 1)(/V---K+ 1) =(,V”-~)/h L Linear K 2

(‘OROI.I,ARY 14 ((‘omplcxity of Linrar Qucrics with- 6 Branchee Star out Compositr Innrrs): 1Ising dynnmic progrnmming to optimize n M query with N qunnt@rs, and di.vaI- lowing cnmpositc inncr.5 (hush]) trrcs) , requires cmlu- % ~ “,‘:~~,,i’~,:’ # ntinr: (N - V2 .fin.~ihf~ bins.

PROOF: ‘I‘hc proof follows from the above, noticing that thcrc arc only 2 (instead of K-l) ways to break the K-quantifier subgraph into fragments of size 1 and Figure I: Examples of Query Shapes K-l for 3

F,xperimental Results 320 THEOREM 2 (Complexity of Star Qucrics): [ising a’ynamic programming to optimize a star auczv with N quantifiers requires evaluating (N - I)2 - ,fcasihle joins.

PROOF: For a star query with N quantifiers, a feasibic quantifier set with K quantifiers is obtained by choosing (K - 1) quantifiers from the (N - 1) quantifiers sur- rounding the center (or “huh”) quantifier, which must be included in any join (hence, allowing composite inners or not mak;.sfo difference in the complexity!). Flence, there are feasible quantifier sets, each of which can be con,c I-’ru zted from (K - 1) different joins. Therefore, the total number of feasible joins is:

&K- l)(;I ;) = (N- 1)2N-2 IGgure 3: FXcct nf Query Shape on Numher of Joins

her of quantifiers and predicates constant at 13 and 12, We are intcrcsted in star queries because, f%r a given rcspcctively, and just varied the shape of the query. number of quantijiers and predicales, star queries have Figure 3 shows how fast the number of feasible joins worst case complexity. One can see this intuitively by increased as the shape of the queries varied from linear moving any edge between any node j and the hub of to star, even though ail queries have 13 quantifiers and the star so that it connects to any non-hub node k I2 binary predicates. The abscissameasures the number instead, producing a star having one fewer edge incident of branches in the query graph, varying from 1 branch upon its hub, plus edge (k,j). Now there is only one (a iincar query) to 12 of them (a stdr query). way (via k) to join j to the rest of the aiterrcd graph, yielding fewer choices than when j was linked directly to the hub. &@xt qf Feasihlity Criteria

The analytic formulas of the above theorems were then For a given query, varying the join enumerator’s pa- verified by using Starburst to optimize queries that ramoters can drastically alter the number of joins con- varied each factor individually. Since the number of sidcrcd, as demonstrated by the following experiments. feasible joins for linear and star queries arc the minimum and maximum numbers, rcspectivcly, of feasible joins Effcrt of composite inncrs for linear queries: Figure 4 for a query with a given number of quantifiers, WC shows how the number of feasible joins for linear que- show in Figure 2 the number of joins, on a logarithmic rics increasesas the maximum size of the smaller quan- scale, as a function of the number of quantifiers, for tifier set. In the figure, the total number of quantifiers both linear and star queries. The lines denote the in a query is parametcrized by N, and the abscissa number of joins predicted by Theorems 1 and 2, and rcprcscnts Starburst’s composite inner parameter, the dots and squares indicate experimental confirmation r?lnwirntan_sizeof_smalr-.~~I, the maximum number of this analysis using Starburst’s adaptabic join enu- of quantifiers allowed in the smaller quantifier set. This merator. IJxactly as predicted, the number of joins for paramctcr may vary from one, whcrc no composite star queries increases exponentially, while the number inncrs arc allowctl, to the floor of N / 2, whcrc composite for linear queries increases poiynomiaiiy, as the number inncrs of any sizr arc allowed. Again, the iincs in the of quantifiers increases. From this, WCcan conciudc figure arc obtained by our combinatorial analysis, and that our join enumeration algorithm, as wcii as any the :;quarcs indicate cxpcrimcntai confirmation of this other enumeration algorithm based upon dynamic pro- analysi: by executing the query on Starburst. As pre- gramming, can remain practical for linear queries much dictcd, the obscrvcd number of joins without composite larger than currently allowed by most relational inncrs is sibmificantiy smalicr than that with composite DBMSs, but becomes impractical for large star qucrics. inners. Note that the measured number of feasible joins for 110 quantifiers Icithout composite inners was Effect of query shape: To quantify the relationship icss than the number of joins for SO quantifiers wilh between complexity and query shape, we hcid the num- compo:itc inncrs. ‘i’hcrcfore, WCcan optitnize a larger

Experimental Results 321 number, of fcasihlc joins for a star query with N + 1 N=llO L,,_,,_.. -..-..-. .-..-..-. quantifiers, less N(N + 1) / 2. This cab be explained ,,-..L” ------u=gO lE- ..fi -.------intuitively as follows: for a star query with N + 1 quan- :’ RY ,y, ., .~...... N = 70 tificrs, any surrounding quantifier is connected to any !/,.’ othrr quantifier through the quantifier at the center. 1,’ /.-.*’ -.-.* NZ60 2 ‘I’hcrefore, the N surrounding quantifiers become fully = ill : p' 7 f connected after quantifier sets containing two quanti- l a,/’ *‘----N N = 30 3 ficrs arc formed. The number of ways to make the II 6. 5 #I adds the minor term N(N + 1) / 2, which is Z1E ; This result further confirms our use of star worst case when join predicates are rcquircd. :

0 6 10 15 P363636404550 CPU Time! .fos Enurnevntion Yaxlmum Slzmof Smallmr Puonttflrr Sat ‘I‘he time to cnumcrate joins can be reduced substan- Figure 4: FAT&t af Compasite Inner on Feasible Joins, fur tially by reducing the number of potential joins coming Linear Query from the join gcncrntor. IGgurc 5 and Figure 6 show that exploiting the algorithm’s adaptability by changing the join gcncration part of the algorithm reduced the linear query in the same amount of time as that for time to cnumerntc feasible joins. In both cases, the smaller linear queries, if we disable composite inncrs number of potential joins coming from the join gen- for the larger query. This is precisely what System R crater was. reduced without affecting the set of feasible and R* did for aN queries, but, in Starburst, our pa- joins. In the first case, where linear queries were op- rameterized adaptability gives the user control over timized, the altcrnativc (

Effect of Cartesian products: When Cartesian products between two quantifier sets arc allowed, rcgardlcss of the existence of a join predicate bctwccn the sets, thr number of feasible joins for a query arc the samr, independent of the shape of the query. The numbrr of feasible joins for this case is interesting hccausc it gives the maximum number of feasible joins for the ,_...‘. -_-- ,._..’ -* - - query. For a query with N quantifiers, the numhcr is .,..... _--- 0 1 --p--f , , , , , I I 1 (3N- 2N+’ + 1) / 2 wit/z composite inncrs, and 10 l6!XiZ525 35 40 4666666665 10 N2N-’ - N(N + 1) / 2 without composite inncrs. It is Numbw of OuantIfl~m also interesting to note that the number of fcnsiblc joins for a query with N quantifiers al/owing (‘artcsian Figurr 5: Rcductinn of fkumeration ‘I’imc by the Graph products but not composite inncrs is the same as the ‘I’r:rversal Generation Method, for I,incar Queries

Experitncntal Results 322 Reduction of enumeration time for linear queries: Figure 5 shows that the Graph Traversal (GT) join generator reduced the time to enumerate the same set of feasible joins for linear queries by a factor of 2.7 on average, and by a factor of 3.5 (from 390.0 seconds to 111.7 seconds) for a query with 70 quantifiers, in particular. In addition, the GT join generator handled larger que- ries as efficiently as smaller queries, because its average time to enumerate each feasible join was constant at about 1.8 milliseconds for ail queries, whereas the av- erage time for the default method increased from 1.8 milliseconds (for N = 10) to 6.8 milliseconds (for N = 70). The reason for these improvements is that 1E-QI the GT join generator generatesfewer pairs of potential 1 3 4 !I D 7 I D IO 11 13 13 quantifier sets than the default method, so fewer must Numbu of Ouantlfhm be filtered. J:or instance, for the query with 70 quan- tifiers, the GT generator generates only 139,265 pairs, Fignrc 6: Reduction of Ennmeration Time (on a Logarithmic of which almost half (57,155) were feasible joins, com- Scale) hy Disabling <‘omposite inners for Star Queries pared with 2,542,470 pairs by the default method.

JJowever, the GT join generator worked poorly for this reduction factor increased as the number of quan- star queries because the feasibility criterion requiring a tifiers increased. join predicate - which the GT generator exploits - is vacuously satisfied in star queries for ail pairs containing two or more quantifiers. In fact, for star queries, the Conclusions number of potential joins coming from the GT join gcncrator actually increases, because every quantifier Enutneration of the join sequences for a query is the set is put into entries for every predicate in the modified dominant factor in both the optimization time for the quantifier set table. For instance, for a star query with query and the quality of resulting execution plans. To 13 quantifiers, the GT generator increased the number generate better execution plans, the join enumeration of potential joins from 3,566,678 to 14,717,640, increas- algorithm in this paper cniargcs the set of feasible ing the total enumeration time from 164.0 seconds to plans. The resulting plans can contain composite in- 680.0 seconds. For this reason, the GT join generator ncrs, and Cartesian products can occur at any place in is invoked only for linear queries in Starburst. This join scquenccs, not ncccssariiy at the end. Furthermore, illustrates how adaptability of our join enumeration by gcncraiizing join predicates to include non-equality algorithm for a particular query rcduccs the time to predicates, impiicd predicates, and predicates on more rnumcrate feasible joins. than two tabics, the Starburst optimizer can generate more cfficicnt plans that exploit these predicates as join Reduction of enumeration time for star queries: ‘I‘he predicates, rather than having to resort to Cartesian time to enumerate fcasibie joins for star queries was product,s. reduced by disabling the enumeration of composite inncrs, as shown in Jiigurc 6. Note that a logariliunic Although enlarging fhc set of fcasibic joins gcncrally scale is used for the ordinate to show the cxponcntini mnkcs the optimization tirnc larger, we can balance increase of the time as the number of quantifiers in a the number of fcasiblc plans with the optimization star query increases, and that ail times less than 0.1 time by varying the number of feasible joins using our seconds are shown to be 0.1 seconds in the ligurc pnrame~crizccifcasibic join rritcria. ‘J‘his kind of adapt- because the resolution of the CI’JJ clock was 0. I scc- ability is important so that some queries may be op- ends. Recall that disabling the cnumcration of rom- tiinircd cxtcnsivrly into an cxtrcmciy cfficicnt plan, posite inners for star queries does not rcducc the set and complex qucncs can bc optimized at ail. of feasible joins, since one of the operands must bc a single quantifier anyway. For a star query with 13 Our cxpcrimcntal rrsults on the number of feasibic quantifiers, the enumeration time was rcduccd by a joins show that ivc can find the optimal plan using factor of 6.8 (from 164.0 seconds to 24.0 seconds), and dynamic programming for a coinpicx query referring

Conclusions to as many as 100 tables if the shape of the query is som:: of the empirical results. ‘I’hc work reported in linear or almost linear. For linear qucrics, lhc number this paper was done while the primary author was on of feasible joins can also bc controlled hy a paramc- assignment from IlIM’s Tokyo Research I aboratory. terized feasible join criterion on composite inncrs that IIc wishes to cxprcss his thanks to the members of the controls the maximum size of smaller quantifirr sets Starhurst project for providing a stimulating and sup- for a join. It can be set to any intcgcr from one (no portivc working environment during that assignment composite inners) to the maximum, which is half of at the Ahnndcn Research Center. the total number of quantifiers (full composite inncrs). Although intuitively the linear shape seems to bc the most common shape of queries, it would be an inter- esting future study to examine shapes of queries in real Bibliography applications, classify the shapes, and obtain empirical formulas for typical shapes, because the shape of a lGRA871 6. Gracfc., RuleBased Query Optimiza- query largely determines the number of feasible joins tion in Extensible Database Systems, for the query and, thus, the practicality of optimizing Comptcr Sciences Tech. Report # 724 it. (P/r11 thesis) (IJniv. of Wisconsin, Mad- ison, WI, Nov. 1987). We also measured the join enumeration time, and found that adaptability in changing part of the enu- IrIAA88l I,.M. 1lass, W.1;. Cody, S. IGnkclstein, meration method allows us to reduce the enumeration J.C. Frcytag,

Acknowledgements lKR184I R. Krishnamurthy, II. Roral, and C. Zaniolo, Optimization of Nonrecursive The authors wish to thank Laura Ilaas, John Qurrics, I’rocs. of VZ,D/J (1984). McPherson, and Pat Selinger for reviewing an earlier draft of this paper, significantly improving its readabil- M.K. I ,ec, .J.C. Frcytag, and G.M. ity. The implementation of the join enumerator was I ,ohman, Implementing an Interprctcr for facilitated considerably by the well-designed QGM data Iunctional Rules in a Query Optimizer, structures develoPed by I-larnid Pirahesh, who gcncr- Procs. of 14111VIJIII (I ,ong Reach, August ously explained and adapted them when necessary for 1988) pp. 218-229. the join enumeration algorithm. We are grateful to Laura IIaas for her technical as well as managerial (I,OII84] G.M. I ohman, I). Daniels, L,.M. IIaas, guidance and support. IIanh Nguyen and George R. Kistler, and P.G. Selinger, Optimiza- Lapis provided invaluable programming support for tion of Nested Queries in a Distributed

Bibliography 324 Relational Database, Procs. of 10th VLDB (ON0881 K. Ono and G.M. Lehman, Extensible (August 1984) pp. 403-415. Enumeration of Feasible Joins for Rela- tional Query Optimization, IBM Research IL( 1H85] G.M. Lohrnan, C. Mohan, L.M. IJaas, Report R.16625 (San Jose, CA, Dec. 1989). B.G. Lindsay, P.G. Selinger, PP. Wilms, and D. Daniels, Query Processing in R+, [ROS82] A. Rosenthal and D. Reiner, An Archi- Query Processing in Database Systems tecture For Query Optimization, Procs. (Kim, Batory, & Reiner (eds.), 1985) pp. of ACM-SIGMOD (1982). 3 l-47. Springer-Verlag, I Ieidelberg. Also available as IBM Research Report [Sf:J,791 P.G. Selinger, M.M. Astrahan, D.D. RJ4272, San Jose, CA, April 1984. Chamberlin, R.A. Lorie, and T.G. Price, Access Path Selection in a Relational Da- tabase Management System, Procs. of [J-,01186] G.M. L&man, Do Semantically Equiv- ACM-sIr;MOD (1979). alent SQL Queries Perform Differently?, Procs. of IEEL Data Engineering (Pcbru- [SWASR] A. Swami and A. Gupta, Optimization of ary 1986) pp. 225-226. Jargc Join Queries, Procs. of ACM- SIGMOD (June 1988). [J/01188] G.M. Lehman, Grammar-like J;unctional Rules for Representing Query Optimiza- ISWA891 A. Swami, Optimization of Large Join tion Alternatives, Procs. of ACM- Queries: JXstributions of Query Plan SZGMOn (June 1988). costs, Tech. Report HPL-SAL-89-24 (JJcwlett-Packard Labs, Palo Alto, CA, [I ,01189] G.M. Lohman, Is Query Optimization a June 1989). ‘Solved’ Problem? (extended abstract), Workshop on Database Query Optimiza- (WON76] IT. Wong and K. Youssefi,, Decomposi- tion (CSB Tech. Report 89-005) (Oregon tion -- a Strategy for Query Processing, Graduate Center, Portland, OR, June ACM Transactions on Database Systems 1989). I,3 (September 1976) pp. 223-241.

Bibliography 325