2 Query optimization

! One logical plan → “best” physical plan Query Optimization ! Questions " How to enumerate possible plans " How to estimate costs " How to pick the “best” one CPS 196.3 ! Often the goal is not getting the optimum plan, but Introduction to Systems instead avoiding the horrible ones Any of these will do

1 second1 minute 1 hour

3 4 Plan enumeration in More relational algebra equivalences ! ! ! Apply relational algebra equivalences ! Convert σp-× to/from p: σp(R × S) = R p S ! Merge/split σ’s: σ (σ R) = σ R # Join reordering: × and ! are associative and p1 p2 p1 ∧ p2 ! Merge/split π’s: π (π R) = π R, where L1 L2 commutative (except ordering, but that is L1 L2 L1 ⊆ ! Push down/pull up σ: unimportant) ! ! σp ∧ pr ∧ ps (R S) = (σpr R) p (σps S), where ! ! ! " pr is a predicate involving only R columns ===… " ps is a predicate involving only S columns ! ! ! " p is a predicate involving both R and S columns T T T ! Push down π: πL (σp R) = πL (σp (πL L’ R)), where " L’ is the set of columns referenced by p that are not in L R S S R S R ! Many more (seemingly trivial) equivalences… " Can be systematically used to transform a plan to new ones

5 6 Relational query rewrite example Heuristics-based query optimization

πtitle ! Start with a logical plan σStudent.name=“Bart” ∧ Student.SID = Enroll.SID ∧ Enroll.CID = Course.CID × ! Push selections/projections down as much as possible × Course " Student Enroll πtitle Why? Reduce the size of intermediate results σEnroll.CID = Course.CID " Why not? May be expensive; maybe joins filter better σ ! Push down σ × Convert p-× to p Course ! Join smaller relations first, and avoid cross product πtitle σStudent.SID = Enroll.SID ! " Why? Reduce the size of intermediate results × Enroll.CID = Course.CID Enroll Course " Why not? Size depends on selectivity too σ ! Student.name = “Bart” Student.SID = Enroll.SID ! Convert the transformed logical plan to a physical Enroll Student plan (by choosing appropriate physical operators) σname = “Bart” Student

1 7 8 SQL query rewrite SQL query rewrite example

! More complicated—subqueries and views divide a ! SELECT name query into nested “blocks” FROM Student WHERE SID = ANY (SELECT SID FROM Enroll); " Processing each block separately forces particular join ! SELECT name methods and join order FROM Student, Enroll " Even if the plan is optimal for each block, it may not be WHERE Student.SID = Enroll.SID; optimal for the entire query " Wrong—consider two Bart’s, each taking two classes ! ! Unnest query: convert subqueries/views to joins SELECT name FROM (SELECT DISTINCT Student.SID, name # We can just deal with -project-join queries FROM Student, Enroll " Where the clean rules of relational algebra apply WHERE Student.SID = Enroll.SID); " Right—assuming Student.SID is a key

9 10 Dealing with correlated subqueries “Magic” decorrelation

! SELECT CID FROM Course ! SELECT CID FROM Course WHERE title LIKE ’CPS%’ WHERE title LIKE ’CPS%’ AND min_enroll > (SELECT COUNT(*) FROM Enroll AND min_enroll > (SELECT COUNT(*) FROM Enroll WHERE Enroll.CID = Course.CID); WHERE Enroll.CID = Course.CID); ! CREATE VIEW Supp_Course AS Process the outer query SELECT * FROM Course WHERE title LIKE ’CPS%’; without the subquery ! SELECT CID FROM Course, (SELECT CID, COUNT(*) AS cnt CREATE VIEW Magic AS Collect bindings FROM Enroll GROUP BY CID) t SELECT DISTINCT CID FROM Supp_Course; WHERE t.CID = Course.CID AND min_enroll > t.cnt CREATE VIEW DS AS Evaluate the subquery (SELECT Enroll.CID, COUNT(*) AS cnt with bindings AND title LIKE ’CPS%’; FROM Magic, Enroll WHERE Magic.CID = Enroll.CID " New subquery is inefficient (computes enrollment for all courses) GROUP BY Enroll.CID) UNION (SELECT Magic.CID, 0 AS cnt FROM Magic " Suppose a CPS class is empty? WHERE Magic.CID NOT IN (SELECT CID FROM Enroll); SELECT Supp_Course.CID FROM Supp_Course, DS Finally, refine WHERE Supp_Course.CID = DS.CID the outer query AND min_enroll > DS.cnt;

11 12 Heuristics- vs. cost-based optimization Cost estimation

! Heuristics-based optimization Physical plan example: PROJECT (title) MERGE-JOIN (CID) " Apply heuristics to rewrite plans into cheaper ones SORT (CID) SCAN (Course) ! Cost-based optimization Input to SORT(CID): MERGE-JOIN (SID) " Rewrite logical plan to combine “blocks” as much as SORT (SID) FILTER (name = “Bart”) possible SCAN (Enroll) " Optimize query block by block SCAN (Student) • Enumerate logical plans (already covered) ! We have: cost estimation for each operator • Estimate the cost of plans " Example: SORT(CID) takes 2 × B(input) • Pick a plan with acceptable cost •But what is B(input)? " Focus: select-project-join blocks ! We need: size of intermediate results

2 13 14 Selections with equality predicates Conjunctive predicates

! Q: σA = v R ! Q: σA = u and B = v R ! Suppose the following information is available ! Additional assumptions " Size of R: |R| " (A = u) and (B = v) are independent • Counterexample: major and advisor " Number of distinct A values in R: |πA R| ! Assumptions " No “over”-selection • Counterexample: A is the key " Values of A are uniformly distributed in R " Values of v in Q are uniformed distributed over all R.A ! |Q| ≈ |R| Ú (|πA R| · |πB R|) values " Reduce total size by all selectivity factors

! |Q| ≈ |R| Ú |πA R|

" Selectivity factor of (A = v) is 1 Ú |πA R|

15 16 Negated and disjunctive predicates Range predicates

! ! Q: σA ≠ v R Q: σA > v R

" |Q| ≈ |R| · (1 – 1 Ú |πA R|) ! Not enough information! • Selectivity factor of ¬ p is (1 – selectivity factor of p) " Just pick, say, |Q| ≈ |R| · 1 Ú 3

! Q: σA = u or B = v R ! With more information " Largest R.A value: high(R.A) " |Q| ≈ |R| · (1 Ú |πA R| + 1 Ú |πB R|)? • No! Tuples satisfying (A = u) and (B = v) are counted twice " Smallest R.A value: low(R.A) " |Q| ≈ |R| · (high(R.A) – v) Ú (high(R.A) – low(R.A)) " |Q| ≈ |R| · (1 – (1 – 1 Ú |πA R|) · (1 – 1 Ú |πB R|)) •Intuition: (A = u) or (B = v) is equivalent to " In practice: sometimes the second highest and lowest are ¬ ( ¬ (A = u) AND ¬ (B = v)) used instead • The highest and the lowest are often used by inexperienced database designer to represent invalid values!

17 18 Two-way equi-join Multiway equi-join

! Q: R(A, B) ! S(A, C) ! Q: R(A, B) ! S(B, C) ! T(C, D) ! Assumption: containment of value sets ! What is the number of distinct C values in the join " Every tuple in the “smaller” (one with fewer of R and S? distinct values for the join attribute) joins with some tuple in the other relation ! Assumption: preservation of value sets

" That is, if |πA R| · |πA S| then πA R ⊆ πA S " A non-join attribute does not lose values from its set of " Certainly not true in general possible values ! " But holds in the common case of joins " That is, if A is in R but not S, then πA (R S) = πA R

! |Q| ≈ |R| á |S| Ú max(|πA R|, |πA S|) " Certainly not true in general " Selectivity factor of R.A = S.A is " But holds in the common case of foreign key joins 1 Ú max(|πA R|, |πA S|)

3 19 20 Multiway equi-join (cont’d) Cost estimation: summary

! Q: R(A, B) ! S(B, C) ! T(C, D) ! Using similar ideas, we can estimate the size of projection, duplicate elimination, union, difference, aggregation (with ! Start with the product of relation sizes grouping) " |R| á |S| á |T| ! Lots of assumptions and very rough estimation ! Reduce the total size by the selectivity factor of each " Accurate estimate is not needed join predicate " Maybe okay if we overestimate or underestimate consistently " May lead to very nasty optimizer “hints” " R.B = S.B: 1 Ú max(|πB R|, |πB S|) " S.C = T.C: 1 Ú max(|π S|, |π T|) SELECT * FROM Student WHERE GPA > 3.9; C C SELECT * FROM Student WHERE GPA > 3.9 AND GPA > 3.9; " | | (| | | | | |) Ú Q ≈ R á S á T ! Not covered: better estimation using histograms (max(|πB R|, |πB S|) á max(|πC S|, |πC T|))

21 22 Search for the best plan Left-deep plans ! ! ! ! Huge search space !! R5 ! R ! “Bushy” plan example: ! 4 ! R3 R2 R1 R3 R2 R1 R4 R 5 ! Heuristic: consider only “left-deep” plans, in which only the ! Just considering different join orders, there are close to left child can be a join (n –1)! á 4 n –1bushy plans for R ! ! R 1 L n " Tend to be better than plans of other shapes, because many join " 30240 for n = 6 algorithms scan inner (right) relation multiple times—you will not ! And there are more if we consider: want it to be a complex subtree " Multiway joins ! ! ! How many left-deep plans are there for R1 L Rn? " Different join methods " Significantly fewer, but still lots— n! (720 for n = 6) " Placement of selection and projection operators

23 24 A greedy algorithm A dynamic programming approach

! S1, …, Sn ! Generate optimal plans bottom-up " Say selections have been pushed down; i.e., Si = σp Ri " Pass 1: Find the best single- plans (for each table) " Pass 2: Find the best two-table plans (for each pair of tables) by ! Start with the pair Si, Sj with the smallest estimated size for ! combining best single-table plans Si Sj " … ! Repeat until no relation is left: Pick from the remaining relations such that the join of " Pass k: Find the best k-table plans (for each combination of k Sk Sk tables) by combining two smaller best plans found in previous and the current result yields an intermediate result of the passes smallest size " … Pick most efficient join method Remaining ! Rationale: Any subplan of an optimal plan must also be Minimize expected size ! …, Sk, Sl, Sm, … relations optimal (otherwise, just replace the subplan to get a better to be joined overall plan) Current subplan S k # Well, not quite…

4 25 26 The need for “interesting order” Dealing with interesting orders

! Example: R(A, B) ! S(A, C) ! T(A, D) ! When picking the best plan ! Best plan for R ! S: hash join (beats sort-merge join) " Comparing their costs is not enough ! Best overall plan: sort-merge join R and S, and then sort- • Plans are not totally ordered by cost anymore merge join with T " Comparing interesting orders is also needed " Subplan of the optimal plan is not optimal! • Plans are now partially ordered ! Why? •Plan X is better than plan Y if " The result of the sort-merge join of R and S is sorted on A – Cost of X is lower than Y " This is an interesting order that can be exploited by later – Interesting orders produced by X subsume those produced by Y processing (e.g., join, duplicate elimination, GROUP BY, ORDER BY, ! Need to keep a set of optimal plans for joining every etc.)! combination of k tables " Typically one for each interesting order

27 Summary

! Relational algebra equivalence ! SQL rewrite tricks ! Heuristics-based optimization ! Cost-based optimization " Need statistics to estimate sizes of intermediate results " Greedy approach " Dynamic programming approach

5