Query Optimization Query Optimization Plan Enumeration in Relational

2 Query optimization ! One logical plan → “best” physical plan Query Optimization ! Questions " How to enumerate possible plans " How to estimate costs " How to pick the “best” one CPS 196.3 ! Often the goal is not getting the optimum plan, but Introduction to Database Systems instead avoiding the horrible ones Any of these will do 1 second1 minute 1 hour 3 4 Plan enumeration in relational algebra More relational algebra equivalences ! ! ! Apply relational algebra equivalences ! Convert σp-× to/from p: σp(R × S) = R p S ! Merge/split σ’s: σ (σ R) = σ R # Join reordering: × and ! are associative and p1 p2 p1 ∧ p2 ! Merge/split π’s: π (π R) = π R, where L1 L2 commutative (except column ordering, but that is L1 L2 L1 ⊆ ! Push down/pull up σ: unimportant) ! ! σp ∧ pr ∧ ps (R S) = (σpr R) p (σps S), where ! ! ! " pr is a predicate involving only R columns ===… " ps is a predicate involving only S columns ! ! ! " p is a predicate involving both R and S columns T T T ! Push down π: πL (σp R) = πL (σp (πL L’ R)), where " L’ is the set of columns referenced by p that are not in L R S S R S R ! Many more (seemingly trivial) equivalences… " Can be systematically used to transform a plan to new ones 5 6 Relational query rewrite example Heuristics-based query optimization πtitle ! Start with a logical plan σStudent.name=“Bart” ∧ Student.SID = Enroll.SID ∧ Enroll.CID = Course.CID × ! Push selections/projections down as much as possible × Course " Student Enroll πtitle Why? Reduce the size of intermediate results σEnroll.CID = Course.CID " Why not? May be expensive; maybe joins filter better σ ! Push down σ × Convert p-× to p Course ! Join smaller relations first, and avoid cross product πtitle σStudent.SID = Enroll.SID ! " Why? Reduce the size of intermediate results × Enroll.CID = Course.CID Enroll Course " Why not? Size depends on join selectivity too σ ! Student.name = “Bart” Student.SID = Enroll.SID ! Convert the transformed logical plan to a physical Enroll Student plan (by choosing appropriate physical operators) σname = “Bart” Student 1 7 8 SQL query rewrite SQL query rewrite example ! More complicated—subqueries and views divide a ! SELECT name query into nested “blocks” FROM Student WHERE SID = ANY (SELECT SID FROM Enroll); " Processing each block separately forces particular join ! SELECT name methods and join order FROM Student, Enroll " Even if the plan is optimal for each block, it may not be WHERE Student.SID = Enroll.SID; optimal for the entire query " Wrong—consider two Bart’s, each taking two classes ! ! Unnest query: convert subqueries/views to joins SELECT name FROM (SELECT DISTINCT Student.SID, name # We can just deal with select-project-join queries FROM Student, Enroll " Where the clean rules of relational algebra apply WHERE Student.SID = Enroll.SID); " Right—assuming Student.SID is a key 9 10 Dealing with correlated subqueries “Magic” decorrelation ! SELECT CID FROM Course ! SELECT CID FROM Course WHERE title LIKE ’CPS%’ WHERE title LIKE ’CPS%’ AND min_enroll > (SELECT COUNT(*) FROM Enroll AND min_enroll > (SELECT COUNT(*) FROM Enroll WHERE Enroll.CID = Course.CID); WHERE Enroll.CID = Course.CID); ! CREATE VIEW Supp_Course AS Process the outer query SELECT * FROM Course WHERE title LIKE ’CPS%’; without the subquery ! SELECT CID FROM Course, (SELECT CID, COUNT(*) AS cnt CREATE VIEW Magic AS Collect bindings FROM Enroll GROUP BY CID) t SELECT DISTINCT CID FROM Supp_Course; WHERE t.CID = Course.CID AND min_enroll > t.cnt CREATE VIEW DS AS Evaluate the subquery (SELECT Enroll.CID, COUNT(*) AS cnt with bindings AND title LIKE ’CPS%’; FROM Magic, Enroll WHERE Magic.CID = Enroll.CID " New subquery is inefficient (computes enrollment for all courses) GROUP BY Enroll.CID) UNION (SELECT Magic.CID, 0 AS cnt FROM Magic " Suppose a CPS class is empty? WHERE Magic.CID NOT IN (SELECT CID FROM Enroll); SELECT Supp_Course.CID FROM Supp_Course, DS Finally, refine WHERE Supp_Course.CID = DS.CID the outer query AND min_enroll > DS.cnt; 11 12 Heuristics- vs. cost-based optimization Cost estimation ! Heuristics-based optimization Physical plan example: PROJECT (title) MERGE-JOIN (CID) " Apply heuristics to rewrite plans into cheaper ones SORT (CID) SCAN (Course) ! Cost-based optimization Input to SORT(CID): MERGE-JOIN (SID) " Rewrite logical plan to combine “blocks” as much as SORT (SID) FILTER (name = “Bart”) possible SCAN (Enroll) " Optimize query block by block SCAN (Student) • Enumerate logical plans (already covered) ! We have: cost estimation for each operator • Estimate the cost of plans " Example: SORT(CID) takes 2 × B(input) • Pick a plan with acceptable cost •But what is B(input)? " Focus: select-project-join blocks ! We need: size of intermediate results 2 13 14 Selections with equality predicates Conjunctive predicates ! Q: σA = v R ! Q: σA = u and B = v R ! Suppose the following information is available ! Additional assumptions " Size of R: |R| " (A = u) and (B = v) are independent • Counterexample: major and advisor " Number of distinct A values in R: |πA R| ! Assumptions " No “over”-selection • Counterexample: A is the key " Values of A are uniformly distributed in R " Values of v in Q are uniformed distributed over all R.A ! |Q| ≈ |R| Ú (|πA R| · |πB R|) values " Reduce total size by all selectivity factors ! |Q| ≈ |R| Ú |πA R| " Selectivity factor of (A = v) is 1 Ú |πA R| 15 16 Negated and disjunctive predicates Range predicates ! ! Q: σA ≠ v R Q: σA > v R " |Q| ≈ |R| · (1 – 1 Ú |πA R|) ! Not enough information! • Selectivity factor of ¬ p is (1 – selectivity factor of p) " Just pick, say, |Q| ≈ |R| · 1 Ú 3 ! Q: σA = u or B = v R ! With more information " Largest R.A value: high(R.A) " |Q| ≈ |R| · (1 Ú |πA R| + 1 Ú |πB R|)? • No! Tuples satisfying (A = u) and (B = v) are counted twice " Smallest R.A value: low(R.A) " |Q| ≈ |R| · (high(R.A) – v) Ú (high(R.A) – low(R.A)) " |Q| ≈ |R| · (1 – (1 – 1 Ú |πA R|) · (1 – 1 Ú |πB R|)) •Intuition: (A = u) or (B = v) is equivalent to " In practice: sometimes the second highest and lowest are ¬ ( ¬ (A = u) AND ¬ (B = v)) used instead • The highest and the lowest are often used by inexperienced database designer to represent invalid values! 17 18 Two-way equi-join Multiway equi-join ! Q: R(A, B) ! S(A, C) ! Q: R(A, B) ! S(B, C) ! T(C, D) ! Assumption: containment of value sets ! What is the number of distinct C values in the join " Every tuple in the “smaller” relation (one with fewer of R and S? distinct values for the join attribute) joins with some tuple in the other relation ! Assumption: preservation of value sets " That is, if |πA R| · |πA S| then πA R ⊆ πA S " A non-join attribute does not lose values from its set of " Certainly not true in general possible values ! " But holds in the common case of foreign key joins " That is, if A is in R but not S, then πA (R S) = πA R ! |Q| ≈ |R| á |S| Ú max(|πA R|, |πA S|) " Certainly not true in general " Selectivity factor of R.A = S.A is " But holds in the common case of foreign key joins 1 Ú max(|πA R|, |πA S|) 3 19 20 Multiway equi-join (cont’d) Cost estimation: summary ! Q: R(A, B) ! S(B, C) ! T(C, D) ! Using similar ideas, we can estimate the size of projection, duplicate elimination, union, difference, aggregation (with ! Start with the product of relation sizes grouping) " |R| á |S| á |T| ! Lots of assumptions and very rough estimation ! Reduce the total size by the selectivity factor of each " Accurate estimate is not needed join predicate " Maybe okay if we overestimate or underestimate consistently " May lead to very nasty optimizer “hints” " R.B = S.B: 1 Ú max(|πB R|, |πB S|) " S.C = T.C: 1 Ú max(|π S|, |π T|) SELECT * FROM Student WHERE GPA > 3.9; C C SELECT * FROM Student WHERE GPA > 3.9 AND GPA > 3.9; " | | (| | | | | |) Ú Q ≈ R á S á T ! Not covered: better estimation using histograms (max(|πB R|, |πB S|) á max(|πC S|, |πC T|)) 21 22 Search for the best plan Left-deep plans ! ! ! ! Huge search space !! R5 ! R ! “Bushy” plan example: ! 4 ! R3 R2 R1 R3 R2 R1 R4 R 5 ! Heuristic: consider only “left-deep” plans, in which only the ! Just considering different join orders, there are close to left child can be a join (n –1)! á 4 n –1bushy plans for R ! ! R 1 L n " Tend to be better than plans of other shapes, because many join " 30240 for n = 6 algorithms scan inner (right) relation multiple times—you will not ! And there are more if we consider: want it to be a complex subtree " Multiway joins ! ! ! How many left-deep plans are there for R1 L Rn? " Different join methods " Significantly fewer, but still lots— n! (720 for n = 6) " Placement of selection and projection operators 23 24 A greedy algorithm A dynamic programming approach ! S1, …, Sn ! Generate optimal plans bottom-up " Say selections have been pushed down; i.e., Si = σp Ri " Pass 1: Find the best single-table plans (for each table) " Pass 2: Find the best two-table plans (for each pair of tables) by ! Start with the pair Si, Sj with the smallest estimated size for ! combining best single-table plans Si Sj " … ! Repeat until no relation is left: Pick from the remaining relations such that the join of " Pass k: Find the best k-table plans (for each combination of k Sk Sk tables) by combining two smaller best plans found in previous and the current result yields an intermediate result of the passes smallest size " … Pick most efficient join method Remaining ! Rationale: Any subplan of an optimal plan must also be Minimize expected size ! …, Sk, Sl, Sm, … relations optimal (otherwise, just replace the subplan to get a better to be joined overall plan) Current subplan

Query Optimization Query Optimization Plan Enumeration in Relational

Query Optimization Techniques - Tips for Writing Efficient and Faster SQL Queries

PL/SQL 1 Database Management System

An Overview of Query Optimization

Introduction to Query Processing and Optimization

Query Optimization for Dynamic Imputation

Scalable Query Optimization for Efficient Data Processing Using

Query Optimization

DB2 UDB for Iseries Database Performance and Query Optimization

Database Performance and Query Optimization 7.1

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

A Modern Optimizer for Real-Time Analytics in a Distributed Database

Query Optimization