Logical Plan 1 Example Query
Total Page:16
File Type:pdf, Size:1020Kb
Review: Query Evaluation Steps SQL query CSE 444: Database Internals Parse & Rewrite Query Logical Select Logical Plan Query plan optimization Lecture 10 Select Physical Plan Query Optimization (part 1) Physical plan Query Execution Magda Balazinska - CSE 444, Spring 2012 1 Disk 2 What We Already Know… Example Query: Logical Plan 1 Supplier(sno,sname,scity,sstate) π Part(pno,pname,psize,pcolor) sname Supply(sno,pno,price) For each SQL query…. σ sscity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2 SELECT S.sname FROM Supplier S, Supply U WHERE S.scity='Seattle' AND S.sstate='WA’ AND S.sno = U.sno sno = sno AND U.pno = 2 There exist many logical query plan… Supplier Supply Magda Balazinska - CSE 444, Spring 2012 3 Magda Balazinska - CSE 444, Spring 2012 4 Example Query: Logical Plan 2 What We Also Know • For each logical plan… π sname • There exist many physical plans sno = sno σ sscity=‘Seattle’ ∧sstate=‘WA’ σ pno=2 Supplier Supply Magda Balazinska - CSE 444, Spring 2012 5 Magda Balazinska - CSE 444, Spring 2012 6 1 Example Query: Physical Plan 1 Example Query: Physical Plan 2 (On the fly) π sname (On the fly) π sname (On the fly) (On the fly) σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2 σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2 (Nested loop) (Index nested loop) sno = sno sno = sno Supplier Supply Supplier Supply (File scan) (File scan) (File scan) (Index scan) Magda Balazinska - CSE 444, Spring 2012 7 Magda Balazinska - CSE 444, Spring 2012 8 Query Optimizer Overview Today • Input: A logical query plan Estimating the cost of a query plan • Output: A good physical query plan • Basic query optimization algorithm – Enumerate alternative plans (logical and physical) – Compute estimated cost of each plan • Compute number of I/Os • Optionally take into account other resources – Choose plan with lowest cost – This is called cost-based optimization Magda Balazinska - CSE 444, Spring 2012 9 Magda Balazinska - CSE 444, Spring 2012 10 Estimating Cost of a Query Plan Access Path • We already know how to • Access path: a way to retrieve tuples from a table – Compute the cost of different operations – A file scan • In terms of number IOs – An index plus a matching selection condition • Normally should also consider CPU and network • We still need to • Index matches selection condition if it can be used – Compute cost of retrieving tuples from disk to retrieve just tuples that satisfy the condition • We can use different access paths – Example: Supplier(sid,sname,scity,sstate) • Queries have more sophisticated predicates than equality – B+-tree index on (scity,sstate) – Compute cost of a complete plan • matches scity=‘Seattle’ • does not match sid=3, does not match sstate=‘WA’ Magda Balazinska - CSE 444, Spring 2012 11 Magda Balazinska - CSE 444, Spring 2012 12 2 Access Path Selection Access Path Selectivity • Supplier(sid,sname,scity,sstate) • Access path selectivity is the number of pages retrieved if we use this access path • Selection condition: sid > 300 ∧ scity=‘Seattle’ – Most selective retrieves fewest pages • As we saw earlier, for equality predicates • Indexes: B+-tree on sid and B+-tree on scity – Selection on equality: σa=v(R) – V(R, a) = # of distinct values of attribute a – 1/V(R,a) is thus the reduction factor • Which access path should we use? – Clustered index on a: cost B(R)/V(R,a) – Unclustered index on a: cost T(R)/V(R,a) • We should pick the most selective access path – (we are ignoring I/O cost of index pages for simplicity) Magda Balazinska - CSE 444, Spring 2012 13 Magda Balazinska - CSE 444, Spring 2012 14 Selectivity for Range Predicates Back to Our Example • Selection condition: sid > 300 ∧ scity=‘Seattle’ Selection on range: σa>v(R) – Index I1: B+-tree on sid clustered – Index I2: B+-tree on scity unclustered • How to compute the selectivity? • Let’s assume • Assume values are uniformly distributed – V(Supplier,scity) = 20 • Reduction factor X – Max(Supplier, sid) = 1000, Min(Supplier,sid)=1 • X = (Max(R,a) - v) / (Max(R,a) - Min(R,a)) – B(Supplier) = 100, T(Supplier) = 1000 • Clustered index on a: cost B(R)*X • Cost I1: B(R) * (Max-v)/(Max-Min) = 100*700/999 ≈ 70 • Unclustered index on a: cost T(R)*X • Cost I2: T(R) * 1/V(Supplier,scity) = 1000/20 = 50 Magda Balazinska - CSE 444, Spring 2012 15 Magda Balazinska - CSE 444, Spring 2012 16 Selectivity with Back to Estimating Multiple Conditions Cost of a Query Plan What if we have an index on multiple attributes? • We already know how to • Ex: selection σa=v1 ∧ b= v2(R) and index on <a,b> – Compute the cost of different operations (last week) – Compute cost of retrieving tuples from disk with How to compute the selectivity? different access paths • Assume attributes are independent • X = 1 / (V(R,a) * V(R,b)) • We still need to – Compute cost of a complete plan • Clustered index on <a,b>: cost B(R)*X • Unclustered index on <a,b>: cost T(R)*X Magda Balazinska - CSE 444, Spring 2012 17 Magda Balazinska - CSE 444, Spring 2012 18 3 Computing the Cost of a Plan Projection: Cardinality of Result • Estimate cardinality in a bottom-up fashion • Output cardinality same as input cardinality – Cardinality is the size of a relation (nb of tuples) – Same number of input tuples – Compute size of all intermediate relations in plan – But tuples are smaller! • Estimate cost by using the estimated cardinalities • We learned how to compute the cost last week • But how do we compute cardinalities? Magda Balazinska - CSE 444, Spring 2012 19 Magda Balazinska - CSE 444, Spring 2012 20 Selection: Cardinality of Result Join: Cardinality of Result • Multiply input cardinality by reduction factor • For joins R ⋈ S – Similar to estimating access path selectivity – In fact, we also say operator/predicate selectivity – Take product of cardinalities of relations R and S – Condition is a = c /* value selection on R */ – Apply reduction factors for each term in join condition • Selectivity = 1/V(R,a) – Condition is a < v /* range selection on R */ – Terms are of the form: column1 = column2 • Selectivity = (v - Min(R, a))/(Max(R,a) - Min(R,a)) – Reduction: 1/ ( MAX( V(R,column1), V(S,column2)) – Multiple conditions: assume independence • Assumes each value in smaller set has a matching value • Use product of the reduction factors for the terms in the larger set (more on next slide) • Condition is a=v1 ∧ b= v2 • Selectivity = 1 / (V(R,a) * V(R,b)) Magda Balazinska - CSE 444, Spring 2012 21 Magda Balazinska - CSE 444, Spring 2012 22 Assumptions Selectivity of R ⨝A=B S • Containment of values: if V(R,A) <= V(S,B), then Assume V(R,A) <= V(S,B) the set of A values of R is included in the set of • Each tuple t in R joins with T(S)/V(S,B) tuple(s) in S B values of S – Note: this indeed holds when A is a foreign key in R, • Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B) and B is a key in S In general: T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B)) • Preservation of values: for any other attribute C, V(R ⨝A=B S, C) = V(R, C) (or V(S, C)) Magda Balazinska - CSE 444, Spring 2012 23 Magda Balazinska - CSE 444, Spring 2012 24 4 T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11 T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10 Complete Example V(Supply,pno) = 2,500 Physical Query Plan 1 SELECT sname Supplier(sid, sname, scity, sstate) (On the fly) π sname Selection and project on-the-fly FROM Supplier x, Supply y Supply(sid, pno, quantity) -> No additional cost. WHERE x.sid = y.sid (On the fly) • Some statistics and y.pno = 2 – T(Supplier) = 1000 records and x.scity = ‘Seattle’ σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2 and x.sstate = ‘WA’ – T(Supply) = 10,000 records Total cost of plan is thus cost of join: – B(Supplier) = 100 pages = B(Supplier)+B(Supplier)*B(Supply)/(M-1) – B(Supply) = 100 pages (Nested loop) sno = sno = 100 + 100 * 100/10 – V(Supplier,scity) = 20, V(Suppliers,state) = 10 = 1,100 I/Os – V(Supply,pno) = 2,500 – Both relations are clustered Supplier Supply • M = 11 (File scan) (File scan) Magda Balazinska - CSE 444, Spring 2012 25 Magda Balazinska - CSE 444, Spring 2012 26 T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11 M = 11 T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10 V(Supply,pno) = 2,500 Plan 2 with Different Numbers Physical Query Plan 2 Total cost Total cost (On the fly) π sname (d) What if we had: π sname (d) = 100 + 100 * 1/20 * 1/10 (a) = 10000 + 50 (a) 10K pages of Supplier + 100 + 100 * 1/2500 (b) + 10000 + 4 (b) 10K pages of Supply + 2 (c) + 3*50 + 4 (c) (Sort-merge join) (c) (Sort-merge join) (c) sno = sno + 0 (d) sno = sno + 0 (d) (Scan Total cost ≈ 204 I/Os (Scan Total cost ≈ 20,208 I/Os write to T1) (Scan write to T1) (Scan (a) σ write to T2) (a) σ write to T2) scity=‘Seattle’ ∧sstate=‘WA’ (b) σ pno=2 scity=‘Seattle’ ∧sstate=‘WA’ (b) σ pno=2 Need to do a two- Supplier Supply Supplier Supply pass sort algorithm (File scan) (File scan) (File scan) (File scan) Magda Balazinska - CSE 444, Spring 2012 27 Magda Balazinska - CSE 444, Spring 2012 28 T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11 T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10 V(Supply,pno) = 2,500 Simplifications Physical Query Plan 3 (On the fly) d Total cost ( ) π sname • In the previous examples, we assumed that all = 1 (a) (On the fly) + 4 (b) index pages were in memory (c) σ scity=‘Seattle’ ∧sstate=‘WA’ + 0 (c) + 0 (d) • When this is not the case, we need to add the Total cost ≈ 5 I/Os (b) cost of fetching index pages from disk sno = sno (Index nested loop) (Use hash index) 4 tuples (a) σ pno=2 Supply Supplier (Hash index on pno ) (Hash index on sno) 29 Magda