CMU SCS CMU SCS Today’s Class

Carnegie Mellon Univ. • History & Background Dept. of Computer Science • Relational Algebra Equivalences 15-415/615 - DB Applications • Plan Cost Estimation • Plan Enumeration C. Faloutsos – A. Pavlo Lecture#15:

Faloutsos/Pavlo CMU SCS 15-415/615 2

CMU SCS CMU SCS Query Optimization 1970s –

• Remember that SQL is declarative. • Ted Codd saw the maintenance – User tells the DBMS what answer they want, overhead for IMS/Codasyl. not how to get the answer. • Proposed abstraction based • There can be a big difference in on relations: Codd performance based on plan is used: – Store database in simple data structures. – See last week: 5.7 days vs. 45 seconds – Access it through high-level language. – Physical storage left up to implementation.

Faloutsos/Pavlo CMU SCS 15-415/615 3 Faloutsos/Pavlo CMU SCS 15-415/615 4 CMU SCS CMU SCS IBM System R IBM System R

• Skunkworks project at IBM Research in • First implementation of a query optimizer. San Jose to implement Codd’s ideas. • People argued that the DBMS could never • Had to figure out all of the things that we choose a query plan better than what a are discussing in this course themselves. human could write. • IBM never commercialized System R. • A lot of the concepts from System R’s optimizer are still used today.

Faloutsos/Pavlo CMU SCS 15-415/615 5 Faloutsos/Pavlo CMU SCS 15-415/615 6

CMU SCS CMU SCS Today’s Class Relational Algebra Equivalences

• History & Background • A query can be expressed in different • Relational Algebra Equivalences ways. • Plan Cost Estimation • The optimizer considers variations and • Plan Enumeration choose the one with the lowest cost. • Nested Sub-queries • How do we know whether two queries are equivalent?

Faloutsos/Pavlo CMU SCS 15-415/615 7 Faloutsos/Pavlo CMU SCS 15-415/615 8 CMU SCS CMU SCS Relational Algebra Equivalences Predicate Pushdown

SELECT name, cid • Two relational algebra expressions are FROM student, enrolled equivalent if they generate the same set of WHERE student.sid = enrolled.sid tuples. AND enrolled.grade = ‘A’

π name, cid π name, cid

σ grade=‘A’ sid=sid

sid=sid σ grade=‘A’ ⨝ student ⨝ enrolled student enrolled Faloutsos/Pavlo CMU SCS 15-415/615 9 Faloutsos/Pavlo CMU SCS 15-415/615 10

CMU SCS CMU SCS Relational Algebra Equivalences Relational Algebra Equivalences

SELECT name, cid FROM student, enrolled • Selections: WHERE student.sid = – Perform them early enrolled.sid AND enrolled.grade = ‘A’ – Break a complex predicate, and push down σp1 p2 …pn(R) = σp1(σp2(σ…pn(R))…)

πname, cid(σgrade=‘A’(student enrolled)) ∧ ∧ • Simplify a complex predicate = ⋈ – (X=Y AND Y=3) → X=3 AND Y=3 πname, cid(student (σgrade=‘A’ (enrolled)))

Faloutsos/Pavlo CMU⋈ SCS 15-415/615 11 Faloutsos/Pavlo CMU SCS 15-415/615 12 CMU SCS CMU SCS Relational Algebra Equivalences Projection Pushdown

SELECT name, cid • Projections: FROM student, enrolled – Perform them early WHERE student.sid = enrolled.sid • Smaller tuples AND enrolled.grade = ‘A’ • Fewer tuples (if duplicates are eliminated) – Project out all attributes except the ones π name, cid π name, cid requested or required (e.g., joining attr.) σ grade=‘A’ sid=sid

sid=sid σ grade=‘A’ • This is not important for a column store… ⨝ student ⨝ enrolled student enrolled Faloutsos/Pavlo CMU SCS 15-415/615 13 Faloutsos/Pavlo CMU SCS 15-415/615 14

CMU SCS CMU SCS Projection Pushdown Relational Algebra Equivalences

SELECT name, cid FROM student, enrolled • Joins: WHERE student.sid = – Commutative, associative enrolled.sid AND enrolled.grade = ‘A’ R S = S R

π name, cid π name, cid (R S) T = R (S T) ⋈ ⋈ σ grade=‘A’ sid=sid • Q: How many⋈ different⋈ orderings⋈ ⋈ are there sid=sid π sid, cid π sid, name for an n-way join? ⨝ σ grade=‘A’

student ⨝ enrolled student enrolled Faloutsos/Pavlo CMU SCS 15-415/615 14 Faloutsos/Pavlo CMU SCS 15-415/615 15 CMU SCS CMU SCS Relational Algebra Equivalences Today’s Class

• Joins: How many different orderings are • History & Background there for an n-way join? • Relational Algebra Equivalences • A: Catalan number ~ 4n • Plan Cost Estimation – Exhaustive enumeration: too slow. • Plan Enumeration • We’ll see in a second how an optimizer limits the search space...

Faloutsos/Pavlo CMU SCS 15-415/615 16 Faloutsos/Pavlo CMU SCS 15-415/615 17

CMU SCS CMU SCS Cost Estimation Cost Estimation – Statistics

SR • How long will a query take? • For each relation R we keep: #1 #2 – CPU: Small cost; tough to estimate – NR → # tuples #3

– Disk: # of block transfers – SR → size of tuple in bytes – Memory: Amount of DRAM used – V(A,R) → # of distinct values – Network: # of messages of attribute ‘A’ • How many tuples will be read/written? … • What statistics do we need to keep? #NR

Faloutsos/Pavlo CMU SCS 15-415/615 18 Faloutsos/Pavlo CMU SCS 15-415/615 19 CMU SCS CMU SCS Derivable Statistics Derivable Statistics

SR • FR → max# records/block • SC(A,R) → Selection Cardinality

#1 avg# of records with A=given • BR → # blocks FR → NR / V(A,R) • SC(A,R) → selection cardinality #2 • Note that this assumes data uniformity avg# of records with A=given #3 – 10,000 students, 10 colleges – how many … students in SCS?

#BR

Faloutsos/Pavlo CMU SCS 15-415/615 20 Faloutsos/Pavlo CMU SCS 15-415/615 21

CMU SCS CMU SCS Additional Statistics Statistics

• For index i: • Where do we store them? HT – Fi → average fanout (~50-100) i • How often do we update them? – HTi → # levels of index i (~2-3) ~ log(#entries)/log(F ) i • Manual invocations: – LBi # → blocks at leaf level – Postgres/SQLite: ANALYZE – MySQL: ANALYZE TABLE

Faloutsos/Pavlo CMU SCS 15-415/615 22 Faloutsos/Pavlo CMU SCS 15-415/615 23 CMU SCS CMU SCS Selection Statistics Selections – Complex Predicates

• We saw simple predicates (name=“Kayne”) • Selectivity sel(P) of predicate P: • How about more complex predicates, like == fraction of tuples that qualify – salary > 10000 • Formula depends on type of predicate. – age=30 AND jobTitle=“Costermonger” – Equality • What is their selectivity? – Range – Negation – Conjunction – Disjunction

Faloutsos/Pavlo CMU SCS 15-415/615 24 Faloutsos/Pavlo CMU SCS 15-415/615 25

CMU SCS CMU SCS Selections – Complex Predicates Selections – Complex Predicates

• Selectivity sel(P) of predicate P: • Assume that V(rating, sailors) has 5

== fraction of tuples that qualify distinct values (0–4) and NR = 5 • Formula depends on type of predicate. • Equality Predicate: A=constant – Equality – sel(A=constant) = SC(P) / V(A,R) – Range – Example: sel(rating=‘2’) = – Negation – Conjunction – Disjunction

Faloutsos/Pavlo CMU SCS 15-415/615 25 26 CMU SCS CMU SCS Selections – Complex Predicates Selections – Complex Predicates

• Assume that V(rating, sailors) has 5 • Assume that V(rating, sailors) has 5

distinct values (0–4) and NR = 5 distinct values (0–4) and NR = 5 • Equality Predicate: A=constant • Equality Predicate: A=constant – sel(A=constant) = SC(P) / V(A,R) – sel(A=constant) = SC(P) / V(A,R) – Example: sel(rating=‘2’) = – Example: sel(rating=‘2’) =

count count V(rating,R)=5

0 1 2 3 4 0 1 2 3 4 26 26 rating rating

CMU SCS CMU SCS Selections – Complex Predicates Selections – Complex Predicates

• Assume that V(rating, sailors) has 5 • Assume that V(rating, sailors) has 5

distinct values (0–4) and NR = 5 distinct values (0–4) and NR = 5 • Equality Predicate: A=constant • Equality Predicate: A=constant – sel(A=constant) = SC(P) / V(A,R) – sel(A=constant) = SC(P) / V(A,R) – Example: sel(rating=‘2’) = – Example: sel(rating=‘2’) = 1/5

SC(rating=‘2’)=1 SC(rating=‘2’)=1

count V(rating,R)=5 count V(rating,R)=5

0 1 2 3 4 0 1 2 3 4 26 26 rating rating CMU SCS CMU SCS Selections – Complex Predicates Selections – Complex Predicates

• Range Query: • Range Query:

– sel(A>a) = (Amax – a) / (Amax – Amin) – sel(A>a) = (Amax – a) / (Amax – Amin) – Example: sel(rating >= ‘2’) – Example: sel(rating >= ‘2’) = (4 – 2) / (4 – 0) = 1/2

ratingmin = 0 ratingmax = 4

count count

0 1 2 3 4 0 1 2 3 4 27 27 rating rating

CMU SCS CMU SCS Selections – Complex Predicates Selections – Complex Predicates

• Negation Query • Negation Query – sel(not P) = 1 – sel(P) – sel(not P) = 1 – sel(P) – Example: sel(rating != ‘2’) – Example: sel(rating != ‘2’)

SC(rating=‘2’)=1

count count

0 1 2 3 4 0 1 2 3 4 28 28 rating rating CMU SCS CMU SCS Selections – Complex Predicates Selections – Complex Predicates

• Negation Query • Negation Query – sel(not P) = 1 – sel(P) – sel(not P) = 1 – sel(P) – Example: sel(rating != ‘2’) – Example: sel(rating != ‘2’) = 1 – (1/5) = 4/5 • Observation: selectivity ≈ probability

SC(rating!=‘2’)=2 SC(rating!=‘2’)=2 SC(rating!=‘2’)=2 SC(rating!=‘2’)=2

count count

0 1 2 3 4 0 1 2 3 4 28 28 rating rating

CMU SCS CMU SCS Selections – Complex Predicates Selections – Complex Predicates

• Conjunction: • Disjunction: – sel(rating = ‘2’ AND name LIKE ‘C%’) – sel(rating = ‘2’ OR name LIKE ‘C%’) – sel(P P ) – sel(P1 P2) = sel(P1) · sel(P2) 1 2 = sel(P ) + sel(P ) – sel(P P ) – INDEPENDENCE ASSUMPTION 1 2 1 2 ⋀ = sel⋁(P1) + sel(P2) – sel(P1) · sel(P2) – INDEPENDENCE ASSUMPTION,⋁ again

P1 P2

Faloutsos/Pavlo CMU SCS 15-415/615 29 Faloutsos/Pavlo CMU SCS 15-415/615 30 CMU SCS CMU SCS Selections – Complex Predicates Joins

• Disjunction, in general: • Q: Given a join of R and S, what is the

– sel(P1 OR P2 OR … Pn) = range of possible result sizes in #of tuples?

– 1 - (1- sel(P1) ) · (1 - sel(P2) ) · … (1 - sel(Pn))

P1 P2

Faloutsos/Pavlo CMU SCS 15-415/615 31 Faloutsos/Pavlo CMU SCS 15-415/615 32

CMU SCS CMU SCS Result Size Estimation for Joins Result Size Estimation for Joins

• General case: Rcols Scols = {A} where A • General case: Rcols Scols = {A} where A is not a key for either table. is not a key for either table. • Hint: for a given tuple⋂ of R, how many – Match each R-tuple⋂ with S-tuples: tuples of S will it match? estSize ≈ NR · NS / V(A,S) – Symmetrically, for S:

estSize ≈ NR · NS / V(A,R) • Overall:

– estSize ≈ NR · NS / max( {V(A,S), V(A,R)} )

Faloutsos/Pavlo CMU SCS 15-415/615 33 Faloutsos/Pavlo CMU SCS 15-415/615 34 CMU SCS CMU SCS Cost Estimations Cost Estimations

• Our formulas are nice but we assume that • Our formulas are nice but we assume that data values are uniformly distributed. data values are uniformly distributed. Uniform Approximation of D Non-Uniform Approximation of D 10 10

# of occurrences8 8

6 6

4 4

2 2

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5 Distinct values of attribute 35 Count=8 Count=4 Count=15 Count=3 Count=14 36

CMU SCS CMU SCS Cost Estimations Histograms with Quantiles

• Our formulas are nice but we assume that • A histogram type wherein the “spread” of data values are uniformly distributed. each bucket is same. Non-Uniform Approximation of D Equi-width Histogram ~ Quantiles 15 10

12 8

9 6

6 4

3 2

0 0 1-3 4-6 7-9 10-12 13-15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5 Bucket #1 Bucket #2 Bucket #3 Bucket #4 Count=8 Count=4 Count=15 Count=3 Count=14 36 Count=12 Count=12 Count=9 Count=12 37 CMU SCS CMU SCS Histograms with Quantiles Sampling

• A histogram type wherein the “spread” of • Modern DBMSs also employ sampling to each bucket is same. estimate predicate selectivities. Equi-width Histogram ~ Quantiles SELECT * FROM sailors 15 sid sname rating age WHERE rating > 100 12 1 Kayne 999 45.0 3 Obama 50 52.0 9 2 Tupac 32 26.0 6 Bieber 10 19.0 6 3 1 billion tuples 0 ⋮ 1-5 6-8 9-13 14-15

37 Faloutsos/Pavlo CMU SCS 15-415/615 38

CMU SCS CMU SCS Sampling Today’s Class

• Modern DBMSs also employ sampling to • History & Background estimate predicate selectivities. • Relational Algebra Equivalences SELECT * FROM sailors • Plan Cost Estimation sid sname rating age WHERE rating > 100 1 Kayne 999 45.0 3 Obama 50 52.0 • Plan Enumeration 2 Tupac 32 26.0 Kayne 999 6 Bieber 10 19.0 sel(rating>100) = Tupac 32 Bieber 10 = 1/3 1 billion tuples ⋮

Faloutsos/Pavlo CMU SCS 15-415/615 38 Faloutsos/Pavlo CMU SCS 15-415/615 39 CMU SCS CMU SCS Query Optimization Plan Generation

SELECT * SR • Bring query in internal form into FROM sailors WHERE rating = 2 “canonical form” (syntactic q-opt) #1 FR

• Generate alternative plans. • What are our plan options? #2 – Single relation. #3 – Multiple relations. – Nested sub-queries. …

• Estimate cost for each plan. #BR • Pick the best one.

Faloutsos/Pavlo CMU SCS 15-415/615 40 Faloutsos/Pavlo CMU SCS 15-415/615 41

CMU SCS CMU SCS Plan Generation Sequential Scan

SELECT * SR SELECT * SR FROM sailors FROM sailors WHERE rating = 2 WHERE rating = 2 #1 #1 FR FR

• Sequential Scan #2 • BR (worst case) #2

• Binary Search #3 • BR /2 (on average, if we search #3 – if sorted & consecutive for primary key) … … • Index Search #BR #BR – if an index exists

Faloutsos/Pavlo CMU SCS 15-415/615 42 Faloutsos/Pavlo CMU SCS 15-415/615 43 CMU SCS CMU SCS Binary Search Binary Search

SELECT * SR SELECT * SR FROM sailors FROM sailors We showed that estimating WHERE rating = 2 WHERE rating = 2 this is non-trivial. #1 #1 FR FR

• ~log(BR) + SC(A,R)/ FR #2 • ~log(BR) + SC(A,R)/ FR #2

• Extra blocks are ones that #3 • Extra blocks are ones that #3 contain qualifying tuples contain qualifying tuples … … #BR #BR

Faloutsos/Pavlo CMU SCS 15-415/615 44 Faloutsos/Pavlo CMU SCS 15-415/615 45

CMU SCS CMU SCS Index Search Index Search: Case #1

SELECT * SR SELECT * SR FROM sailors FROM sailors WHERE rating = 2 WHERE rating = 2 #1 #1 FR

• Index Search: #2 • Primary Key #2

– levels of index + #3 – cost: HTi + 1 #3 blocks w/ qual. tuples … … Case#1: Primary Key Case#2: Secondary key – clustering index #BR #BR Case#3: Secondary key – non-clust. index HTi

Faloutsos/Pavlo CMU SCS 15-415/615 46 Faloutsos/Pavlo CMU SCS 15-415/615 47 CMU SCS CMU SCS Index Search: Case #2 Index Search: Case #3

SELECT * SR SELECT * SR FROM sailors FROM sailors WHERE rating = 2 WHERE rating = 2 #1 #1 … • Secondary key with #2 • Secondary key with #2

clustering index: #3 non-clustering index: #3

– cost: HTi + SC(A,R)/FR – cost: HTi + SC(A,R) … …

#BR #BR HTi HTi

Faloutsos/Pavlo CMU SCS 15-415/615 48 Faloutsos/Pavlo CMU SCS 15-415/615 49

CMU SCS CMU SCS Query Optimization Queries over Multiple Relations

• Bring query in internal form into • As number of joins increases, number of “canonical form” (syntactic q-opt) alternative plans grows rapidly • Generate alternative plans. – We need to restrict search space. – Single relation. • Fundamental decision in System R: only – Multiple relations. left-deep join trees are considered. – Nested sub-queries. Newer DBMSs don’t make • Estimate cost for each plan. this assumption anymore • Pick the best one.

Faloutsos/Pavlo CMU SCS 15-415/615 50 Faloutsos/Pavlo CMU SCS 15-415/615 51 CMU SCS CMU SCS Queries over Multiple Relations Queries over Multiple Relations

• Fundamental decision in System R: only • Fundamental decision in System R: only left-deep join trees are considered. left-deep join trees are considered.

D D D D ⨝ ⨝ ⨝ ⨝ C C ⨝ C C ⨝ ⨝ ⨝ A B C D ⨝ ⨝ A B C D A B A B ⨝ ⨝ A B XA B ⨝X⨝ ⨝ ⨝ ⨝ ⨝

Faloutsos/Pavlo CMU SCS 15-415/615 52 Faloutsos/Pavlo CMU SCS 15-415/615 52

CMU SCS CMU SCS Queries over Multiple Relations Queries over Multiple Relations

• Fundamental decision in System R: only • Enumerate the orderings left-deep join trees are considered. – Example: Left-deep tree #1, Left-deep tree #2… – Allows for fully pipelined plans where • Enumerate the plans for each operator intermediate results not written to temp files. – Example: Hash, Sort-Merge, Nested Loop… – Not all left-deep trees are fully pipelined. • Enumerate the access paths for each table

– Example: Index #1, Index #2, Seq Scan…

Faloutsos/Pavlo CMU SCS 15-415/615 53 Faloutsos/Pavlo CMU SCS 15-415/615 54 CMU SCS CMU SCS Queries over Multiple Relations Dynamic Programming Example

$500 • Enumerate the orderings BOS $200 CDG $800 – Example: Left-deep tree #1, Left-deep tree #2… $850 $150 • Enumerate the plans for each operator PIT JKF $450 PVG $650 $950 – Example: Hash, Sort-Merge, Nested Loop… $50 FRA

• Enumerate the access paths for each table ATL $1050 – Example: Index #1, Index #2, Seq Scan… • Use dynamic programming to reduce the Compute the cheapest flight PIT -> PVG number of cost estimations.

Faloutsos/Pavlo CMU SCS 15-415/615 54 Faloutsos/Pavlo CMU SCS 15-415/615 55

CMU SCS CMU SCS Dynamic Programming Example Dynamic Programming Example

$500 $500 BOS BOS $200 $200 CDG $800 CDG $800 $850 $850 $150 $150 PIT JKF $450 PVG PIT JKF $450 PVG $650 $950 $650 $950 $50 FRA $50 FRA

ATL $1050 ATL $1050

Compute the cheapest flight PIT -> PVG Compute the cheapest flight PIT -> PVG Solution: Compute partial optimal, left-to-right Solution: Compute partial optimal, left-to-right

Faloutsos/Pavlo CMU SCS 15-415/615 55 Faloutsos/Pavlo CMU SCS 15-415/615 55 CMU SCS CMU SCS Dynamic Programming Example Dynamic Programming Example

$500 $200 $500 BOS BOS $200 $200 CDG $800 CDG $800 $850 $850 $150 $150 PIT JKF $450 PVG PIT JKF $450 PVG $150 $650 $950 $650 $950 $50 FRA $50 FRA $50 ATL $1050 ATL $1050

Compute the cheapest flight PIT -> PVG Compute the cheapest flight PIT -> PVG Solution: Compute partial optimal, left-to-right Solution: Compute partial optimal, left-to-right

Faloutsos/Pavlo CMU SCS 15-415/615 55 Faloutsos/Pavlo CMU SCS 15-415/615 55

CMU SCS CMU SCS Dynamic Programming Example Dynamic Programming Example

$200 $500 $200 $500 BOS BOS $700 $200 $200 CDG $800 CDG $800 $850 $850 $150 $150 PIT JKF $450 PVG PIT JKF $650 $450 PVG $150 $150 $650 $950 $650 $950 $50 FRA $50 FRA $50 $50 ATL $1050 ATL $1050

Compute the cheapest flight PIT -> PVG Compute the cheapest flight PIT -> PVG Solution: Compute partial optimal, left-to-right Solution: Compute partial optimal, left-to-right

Faloutsos/Pavlo CMU SCS 15-415/615 55 Faloutsos/Pavlo CMU SCS 15-415/615 55 CMU SCS CMU SCS Dynamic Programming Example Dynamic Programming Example

$200 $500 $200 $500 BOS BOS $700 $700 $200 $200 CDG $800 CDG $800 $1500 $850 $850 $150 $150 PIT JKF $650 $450 PVG PIT JKF $650 $450 PVG $150 $150 $650 $950 $650 $950 $50 FRA $50 FRA $50 $50 ATL $1050 ATL $1050

Compute the cheapest flight PIT -> PVG Compute the cheapest flight PIT -> PVG Solution: Compute partial optimal, left-to-right Solution: Compute partial optimal, left-to-right

Faloutsos/Pavlo CMU SCS 15-415/615 55 Faloutsos/Pavlo CMU SCS 15-415/615 55

CMU SCS CMU SCS Dynamic Programming Example Q-Opt + Dynamic Programming

$200 $500 BOS $700 • Example: R S T $200 CDG $800 $1500 $850 $150 ⨝ ⨝ PIT JKF $650 $450 PVG $150 R S $650 $950 $50 FRA T $50 R ⨝ ATL $1050 … S R S T T ⨝ ⨝ Compute the cheapest flight PIT -> PVG R Solution: Compute partial optimal, left-to-right S T

Faloutsos/Pavlo CMU SCS 15-415/615 55 Faloutsos/Pavlo ⨝CMU SCS 15-415/615 56 CMU SCS CMU SCS Q-Opt + Dynamic Programming Q-Opt + Dynamic Programming

• Example: R S T • Example: R S T Sort-Merge Hash-Join 150 (SM) ⨝ ⨝ 150 (SM) ⨝ ⨝ R S R S 300 (HJ) 300 (HJ) T T 2500 (NL) 2500 (NL) R ⨝ R ⨝ … … S R S T S Nested Loop R S T T T 3000 (NL) ⨝ ⨝ 3000 (NL) ⨝ ⨝ R 500 (HJ) R 500 (HJ) S T S T 500 (SM) 500 (SM)

Faloutsos/Pavlo ⨝CMU SCS 15-415/615 56 Faloutsos/Pavlo ⨝CMU SCS 15-415/615 56

CMU SCS CMU SCS Q-Opt + Dynamic Programming Q-Opt + Dynamic Programming

• How to plan a query where R is sorted on • R S T order by R.a R.a? ⨝ ⨝ • Consider the following query: 150 (SM) R S 300 (HJ) SELECT * T FROM R, S, T R 2500 (NL) ⨝ WHERE R.a = S.a AND S.b = T.b … R S T ORDER BY R.a S T ⨝ ⨝ R S T

Faloutsos/Pavlo CMU SCS 15-415/615 57 Faloutsos/Pavlo ⨝ CMU SCS 15-415/615 58 CMU SCS CMU SCS Q-Opt + Dynamic Programming Q-Opt + Dynamic Programming

• R S T order by R.a • R S T order by R.a

350 (SM) 150⨝ (SM) ⨝ 150⨝ (SM) ⨝ R S 300 (HJ) R S 300 (HJ) T 1000 (sort) T 1000 (sort) R 2500 (NL) ⨝ R 2500 (NL) ⨝

… R S T … R S T S R S T sorted R.a S R S T sorted R.a T ⨝ ⨝ T ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ R R S T S T

Faloutsos/Pavlo ⨝ CMU SCS 15-415/615 58 Faloutsos/Pavlo ⨝ CMU SCS 15-415/615 58

CMU SCS CMU SCS Candidate Plan Example Candidate Plan Example SELECT sname, bname, day SELECT sname, bname, day FROM sailors S, reserves R, boats B FROM sailors S, reserves R, boats B WHERE S.sid = R.sid AND R.bid = B.bid No real DBMSs WHERE does S.sid it this = way.R.sid AND R.bid = B.bid It’s actually more messy… • How to generate plans for search algorithm: • How to generate plans for search algorithm: 1. Enumerate relation orderings 1. Enumerate relation orderings 2. Enumerate join algorithm choices 2. Enumerate join algorithm choices 3. Enumerate access method choices 3. Enumerate access method choices

Faloutsos/Pavlo CMU SCS 15-415/615 59 Faloutsos/Pavlo CMU SCS 15-415/615 59 CMU SCS CMU SCS Candidate Plans Candidate Plans SELECT sname, bname, day SELECT sname, bname, day FROM sailors S, reserves R, boatsPrune B plans with FROM sailors S, reserves R, boatsPrune B plans with WHERE S.sid = R.sid AND R.bid = crossB.bid- products WHERE S.sid = R.sid AND R.bid = crossB.bid- products 1. Enumerate relation orderings: immediately! 1. Enumerate relation orderings: immediately!

B B × R B B × R ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ S R R S B S S R R S BX S ⨝ ⨝ ⨝ ⨝ S S × R S S × R ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ B R R B S B B R R B S XB Faloutsos/Pavlo⨝ CMU⨝ SCS 15-415/615 60 Faloutsos/Pavlo⨝ CMU⨝ SCS 15-415/615 60

CMU SCS CMU SCS Candidate Plans Candidate Plans SELECT sname, bname, day SELECT sname, bname, day FROM sailors S, reserves R, boats B FROM sailors S, reserves R, boats B WHERE S.sid = R.sid AND R.bid = B.bid WHERE S.sid = R.sid AND R.bid = B.bid 2. Enumerate join algorithm choices: 3. Enumerate access method choices: NLJ HJ HJ HJ B NLJ B NLJ B HJ B HJ SeqScanB ⨝ ⨝ ⨝ ⨝ ⨝ S R S R S R S R SeqScanS SeqScanR ⨝ ⨝ NLJ ⨝ HJ ⨝ ⨝ HJ Do this for the Do this for the other plans. HJ B HJ B other plans. HJ SeqScanB ⨝ ⨝ ⨝ S R S R SeqScanS IndexScanR (R.sid) Faloutsos/Pavlo CMU SCS 15⨝-415/615 ⨝ 61 Faloutsos/Pavlo CMU SCS 15⨝-415/615 62 CMU SCS CMU SCS Postgres Optimizer Postgres GEQO Best: 100 1st Generation • Examines all types of join trees NLJ – Left-deep, Right-deep, bushy NLJ B • Two optimizer implementations: R S⨝ 300 – Traditional Dynamic Programming Approach ⨝ NLJ – Genetic Query Optimizer (GEQO) HJ S • Postgres uses the traditional one when # of B R⨝ 200 tables in query is less than 12 and switches ⨝ HJ to GEQO when there are 12 or more. HJ B S R⨝ 100

Faloutsos/Pavlo CMU SCS 15-415/615 63 Faloutsos/Pavlo⨝ CMU SCS 15-415/615 64

CMU SCS CMU SCS Postgres GEQO Postgres GEQO Best: 100 Best: 80 1st Generation 1st Generation 2nd Generation NLJ NLJ HJ NLJ B NLJ B B HJ RX S⨝ 300 RX S⨝ 300 100⨝ S R ⨝ NLJ ⨝ NLJ ⨝NLJ HJ S HJ S HJ S B R⨝ 200 B R⨝ 200 R B⨝ 200 ⨝ HJ ⨝ HJ ⨝ HJ HJ B HJ B HJ S S R⨝ 100 S R⨝ 100 B R⨝ 80 Faloutsos/Pavlo⨝ CMU SCS 15-415/615 64 Faloutsos/Pavlo⨝ ⨝CMU SCS 15-415/615 64 CMU SCS CMU SCS Postgres GEQO Postgres GEQO Best: 80 Best: 80 1st Generation 2nd Generation 1st Generation 2nd Generation 3rd Generation NLJ HJ NLJ HJ HJ NLJ B B HJ NLJ B B HJ B HJ RX S⨝ 300 100⨝ S R RX S⨝ 300 100⨝ S R ⨝R S ⨝ NLJ ⨝NLJ ⨝ NLJ ⨝NLJ ⨝HJ … HJ S HJ S HJ S HJ S HJ B B R⨝ 200 RX B⨝ 200 B R⨝ 200 RX B⨝ 200 R S⨝ ⨝ HJ ⨝ HJ ⨝ HJ ⨝ HJ ⨝HJ HJ B HJ S HJ B HJ S S HJ S R⨝ 100 B R⨝ 80 S R⨝ 100 B R⨝ 80 ⨝R B Faloutsos/Pavlo⨝ ⨝CMU SCS 15-415/615 64 Faloutsos/Pavlo⨝ ⨝CMU SCS 15-415/615 ⨝ 64

CMU SCS CMU SCS Query Optimization Nested Sub-Queries

• Bring query in internal form into • The DBMS treats nested sub-queries in the “canonical form” (syntactic q-opt) where clause as functions that take • Generate alternative plans. parameters and return a single value or set – Single relation. of values. – Multiple relations. • Two Approaches: – Nested sub-queries. – Rewrite to de-correlate and/or flatten them • Estimate cost for each plan. – Decompose nested query and store result to temporary table • Pick the best one.

Faloutsos/Pavlo CMU SCS 15-415/615 65 Faloutsos/Pavlo CMU SCS 15-415/615 66 CMU SCS CMU SCS Nested Sub-Queries: Rewrite Nested Sub-Queries: Rewrite

SELECT name FROM sailors AS S SELECT name FROM sailors AS S WHERE EXISTS ( WHERE EXISTS ( SELECT * FROM reserves AS R SELECT * FROM reserves AS R WHERE S.sid = R.sid WHERE S.sid = R.sid AND R.day = “2016-10-24” AND R.day = “2016-10-24” ) )

SELECT name FROM sailors AS S, reserves AS R WHERE S.sid = R.sid AND R.day = “2016-10-23”

Faloutsos/Pavlo CMU SCS 15-415/615 67 Faloutsos/Pavlo CMU SCS 15-415/615 67

CMU SCS CMU SCS Nested Sub-Queries: Decompose Nested Sub-Queries: Decompose

SELECT S.sid, MIN(R.day) SELECT S.sid, MIN(R.day) FROM sailors S, reserves R, boats B FROM sailors S, reserves R, boats B WHERE S.sid = R.sid WHERE S.sid = R.sid AND R.bid = B.bid AND R.bid = B.bid AND B.color = ‘red’ AND B.color = ‘red’ AND S.rating = (SELECT MAX(S2.rating) AND S.rating = (SELECT MAX(S2.rating) FROM sailors S2) FROM sailors S2) GROUP BY S.sid GROUP BY S.sid HAVING COUNT(*) > 1 HAVING COUNT(*) > 1

For each sailor with the highest rating (over all For each sailor with the highest rating (over all sailors) and at least two reservations for red sailors) and at least two reservations for red boats, find the sailor id and the earliest date on boats, find the sailor id and the earliest date on which the sailor has a reservation for a red boat. which the sailor has a reservation for a red boat.

Faloutsos/Pavlo CMU SCS 15-415/615 68 Faloutsos/Pavlo CMU SCS 15-415/615 68 CMU SCS CMU SCS Decomposing Queries Decomposing Queries

• For harder queries, the optimizer breaks up queries into blocks and then concentrates on SELECT S.sid, MIN(R.day) one block at a time. FROM sailors S, reserves R, boats B WHERE S.sid = R.sid • Sub-queries are written to a temporary table AND R.bid = B.bid AND B.color = ‘red’ that are discarded after the query finishes. AND S.rating = (SELECT MAX(S2.rating) FROM sailors S2) GROUP BY S.sid HAVING COUNT(*) > 1

Nested Block

Faloutsos/Pavlo CMU SCS 15-415/615 69 Faloutsos/Pavlo CMU SCS 15-415/615 70

CMU SCS CMU SCS Decomposing Queries Decomposing Queries

SELECT MAX(rating) FROM sailors SELECT MAX(rating) FROM sailors

SELECT S.sid, MIN(R.day) SELECT S.sid, MIN(R.day) FROM sailors S, reserves R, boats B FROM sailors S, reserves R, boats B WHERE S.sid = R.sid WHERE S.sid = R.sid AND R.bid = B.bid AND R.bid = B.bid AND B.color = ‘red’ AND B.color = ‘red’ AND S.rating = (SELECT MAX(S2.rating) AND S.rating = (SELECT### MAX(S2.rating) FROM sailors S2) FROM sailors S2) GROUP BY S.sid GROUP BY S.sid HAVING COUNT(*) > 1 HAVING COUNT(*) > 1

Nested Block

Faloutsos/Pavlo CMU SCS 15-415/615 70 Faloutsos/Pavlo CMU SCS 15-415/615 70 CMU SCS CMU SCS Decomposing Queries What Optimizers are Still Bad At

SELECT MAX(rating) FROM sailors • Cardinality estimations are still hard.

SELECT S.sid, MIN(R.day) • Problem Areas: FROM sailors S, reserves R, boats B – Prepared Statements WHERE S.sid = R.sid AND R.bid = B.bid – Correlated Columns AND B.color = ‘red’ AND S.rating = (SELECT### MAX(S2.rating) – Plan Stability FROM sailors S2) GROUP BY S.sid HAVING COUNT(*) > 1 More Information: Guy Lohman, Is Query Optimization a “Solved” Problem? SIGMOD Blog (April 2014), http://wp.sigmod.org/?p=1075 Outer Block

Faloutsos/Pavlo CMU SCS 15-415/615 70 Faloutsos/Pavlo CMU SCS 15-415/615 71

CMU SCS CMU SCS Prepared Statements Prepared Statements

SELECT S.name, B.bid PREPARE myQuery AS FROM sailors AS S, SELECT S.name, B.bid reserves AS R, boats AS B FROM sailors AS S, WHERE S.sid = R.sid reserves AS R, boats AS B AND R.bid = B.bid WHERE S.sid = R.sid AND B.color = ‘red’ AND R.bid = B.bid AND S.rating > 1000 AND B.color = ‘red’ AND S.rating > 1000

Faloutsos/Pavlo CMU SCS 15-415/615 72 Faloutsos/Pavlo CMU SCS 15-415/615 72 CMU SCS CMU SCS Prepared Statements Prepared Statements

PREPARE myQuery AS PREPARE myQuery AS SELECT S.name, B.bid SELECT S.name, B.bid FROM sailors AS S, FROM sailors AS S, reserves AS R, boats AS B reserves AS R, boats AS B WHERE S.sid = R.sid WHERE S.sid = R.sid AND R.bid = B.bid AND R.bid = B.bid AND B.color = ‘red’ AND B.color = ‘red’ AND S.rating > 1000 AND S.rating > 1000

EXECUTE myQuery; EXECUTE myQuery;

Faloutsos/Pavlo CMU SCS 15-415/615 72 Faloutsos/Pavlo CMU SCS 15-415/615 72

CMU SCS CMU SCS Prepared Statements Prepared Statements

PREPARE myQuery (varchar, int) AS Prepared Statement Query Plan • What should be the SELECT S.name, B.bid FROM sailors AS S, π name, bid join order for reserves AS R, boats AS B R.sid=S.sid SAILORS, BOATS, WHERE S.sid = R.sid and RESERVES? AND R.bid = B.bid σ rating>$2 AND B.color = $1 ⨝ AND S.rating > $2 B.bid=R.sid SAILORS σ color=$1 EXECUTE myQuery (‘red’, 1000); ⨝ BOAT RESERVES

Faloutsos/Pavlo CMU SCS 15-415/615 72 Faloutsos/Pavlo CMU SCS 15-415/615 73 CMU SCS CMU SCS Prepared Statements Correlated Columns

• Solution #1 – Rerun optimizer each time • We showed how selectivities are modeled the query is invoked. as probabilities on whether a predicate on • Solution #2 – Generate multiple plans for any given row will be satisfied. different values of the parameters. • We then multiply these individual • Solution #3 – Choose the average value selectivities together. for a parameter and use that for all invocations.

Faloutsos/Pavlo CMU SCS 15-415/615 74 Faloutsos/Pavlo CMU SCS 15-415/615 75

CMU SCS CMU SCS Correlated Columns Column Group Statistics

• Consider a database of automobiles: • Tell the DBMS that it should keep track of – # of Makes = 10, # of Models = 100 statistics for groups of columns together • And the following query: rather than just treating them all as – make=“Honda” AND model=“Accord” independent variables. • With the independence and uniformity • Only supported in commercial systems. assumption, the selectivity is: – 1/10 * 1/100 = 0.001 • But since only Honda makes Accords the real selectivity is 1/100 = 0.01. 76 Faloutsos/Pavlo CMU SCS 15-415/615 77 CMU SCS CMU SCS Plan Stability Plan Stability

• We want to deploy a new version of a • Solution #1 – Allow tuning hints in plans. DBMS but need to make sure that there are • Solution #2 – Set the optimizer version no performance regressions. number and migrate queries one-by-one to • What if 99% of the query plans are faster the new optimizer. on the newer DBMS version, but 1% are • Solution #3 – Save query plan from old slower? version and provide it to the new DBMS.

Faloutsos/Pavlo CMU SCS 15-415/615 78 Faloutsos/Pavlo CMU SCS 15-415/615 79

CMU SCS CMU SCS Conclusions Next Class

• Ideas to remember: • How to refine database schemas to remove – Filter early as possible. redundancies and prevent loss data. – Selectivity estimations (uniformity, indep.; histograms; join selectivity) – Dynamic programming for join orderings – Rewrite nested queries • Query optimization is hard…

Faloutsos/Pavlo CMU SCS 15-415/615 80 Faloutsos/Pavlo CMU SCS 15-415/615 81