References: q [RG-3ed] Chapter 15 q [SKS-6ed] Chapter 13

Database Management Systems II, Huiping Cao 1 Query Optimization

q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries

Database Management Systems II, Huiping Cao 2 Individual Operators

q Queries are composed of a few basic operators: the implementation of these operators can be carefully tuned (and it is important to do this!). q Many alternative implementation techniques for each operator; no universally superior technique for most operators. q Must consider available alternatives for each operation in a query and choose the best one based on system statistics, etc. This is part of the broader task of optimizing a query composed of several operations.

Database Management Systems II, Huiping Cao 3 Introduction

q Alternative ways of evaluating a given query q Equivalent expressions q Different algorithms for each operation

SELECT name,title FROM instructor, teaches, course WHERE dept_name = “Music” AND instroctor.id = teaches.iid AND course.cid = teaches.cid;

Database Management Systems II, Huiping Cao 4 Introduction

q Alternative ways of evaluating a given query q Equivalent expressions q Different algorithms for each operation

Database Management Systems II, Huiping Cao 5 Introduction (Cont.)

q An evaluation plan defines exactly what algorithm is used for each operation, and how the execution of the operations is coordinated.

q Find out how to query execution plans on your favorite database

Database Management Systems II, Huiping Cao 6 PostgreSQL

EXPLAIN * from measurement_instance as mi, measurement_type as mt where mt.annot_id=mi.did and mt.mtypelabel=mi.mtypelabel;

QUERY PLAN ------Hash Join (cost=11.75..25.96 rows=1 width=1377) Hash Cond: ((mi.did = mt.annot_id) AND (mi.mtypelabel = mt.mtypelabel)) -> Seq Scan on measurement_instance mi (cost=0.00..12.40 rows=240 width=310) -> Hash (cost=10.70..10.70 rows=70 width=1067) -> Seq Scan on measurement_type mt (cost=0.00..10.70 rows=70 width=1067) (5 rows)

Database Management Systems II, Huiping Cao 7 Introduction (Cont.)

q Cost difference between evaluation plans for a query can be enormous q E.g. seconds vs. days in some cases q Steps in cost-based query optimization 1. Generate logically equivalent expressions using equivalence rules 2. Annotate resultant expressions to get alternative query plans 3. Choose the cheapest plan based on estimated cost

Database Management Systems II, Huiping Cao 8 Introduction (Cont.)

q Estimation of plan cost based on: q Statistical information about relations. Examples: ! number of tuples, number of distinct values for an attribute q Statistics estimation for intermediate results ! to compute cost of complex expressions q Cost formulae for algorithms, computed using statistics

Database Management Systems II, Huiping Cao 9 Query Optimization

q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries

Database Management Systems II, Huiping Cao 10 Evaluation of Expressions

q So far: we have seen algorithms for individual operations q Alternatives for evaluating an entire expression tree q Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. q Pipelining: pass on tuples to parent operations even as an operation is being executed

Database Management Systems II, Huiping Cao 11 Materialization

q Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations. q E.g., in figure below, compute and store

σ building ="Watson" (department ) then compute and store its with instructor, and finally compute the projection on name.

Πname

σ instructor building = “Watson”

department Database Management Systems II, Huiping Cao 12 Materialization (Cont.)

q Materialized evaluation is always applicable q Cost of writing results to disk and reading them back can be quite high q If ignoring cost of writing results to disk, ! Overall cost = Sum of costs of individual operations + cost of writing intermediate results to disk q Double buffering: use two output buffers for each operation, when one is full write it to disk while the other is getting filled q Allow overlap of disk writes with computation and reduces execution time

Database Management Systems II, Huiping Cao 13 Pipelining

q Pipelined evaluation: evaluate several operations simultaneously, passing the results of one operation on to the next. q E.g., in the previous expression tree, do not store the results of σ (department ) building ="Watson" q Instead, pass tuples directly to the join. Similarly, do not store the results of join, pass tuples directly to projection. q Much cheaper than materialization: no need to store a temporary to disk. q Pipelining may not always be possible – e.g., sort, hash-join. q Pipelines can be executed in two ways: demand driven and producer driven

Database Management Systems II, Huiping Cao 14 Pipelining (Cont.)

q In demand driven or lazy evaluation or pull model q System repeatedly requests next tuple from top level operation q Each operation requests next tuple from children operations as required, in order to output its next tuple q In between calls, operation has to maintain “state” so it knows what to return next q In producer-driven or eager pipelining or push model q Operators produce tuples eagerly and pass them up to their parents ! Buffer maintained between operators, child puts tuples in buffer, parent removes tuples from buffer

! If buffer is full, child waits till there is space in the buffer, and then generates more tuples q System schedules operations that have space in the output buffer and can process more input tuples

Database Management Systems II, Huiping Cao 15 Evaluation Algorithms for Pipelining

q Some algorithms are not able to output results even as they get input tuples q E.g. merge join, or hash join q Intermediate results written to disk and then read back q Blocking operations q Operations are pipelined

Database Management Systems II, Huiping Cao 16 q PostgreSQL q Explain command: show the execution plan of a statement q Ref: http://www.postgresql.org/docs/8.1/static/sql-explain.html

q MySQL q Explain command q Ref: http://dev.mysql.com/doc/refman/5.0/en/explain.html

Database Management Systems II, Huiping Cao 17 Query Optimization

q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries

Database Management Systems II, Huiping Cao 18 Query Blocks: Units of Optimization

SELECT S.sname q An SQL query is parsed into a collection of FROM Sailors S query blocks, and these are optimized one WHERE S.age IN block at a time. (SELECT MAX (S2.age) q A query block FROM Sailors S2 GROUP BY S2.rating) q No nesting q Exactly one SELECT and one FROM clause Outer block Nested block q At most one WHERE clause, GROUP BY clause, and HAVING clause

! WHERE clause in conjunctive normal form

Database Management Systems II, Huiping Cao 19 Query Blocks: Units of Optimization

SELECT S.sname q Nested blocks are usually treated as calls to FROM Sailors S a subroutine, made once per outer tuple. WHERE S.age IN (This is an over-simplification, but serves for (SELECT MAX (S2.age) now.) FROM Sailors S2 q For each block, the plans considered are: GROUP BY S2.rating) q All available access methods, for each relation in the FROM clause. Outer block Nested block q All left-deep join trees (i.e., all ways to join the relations one-at-a-time, with the inner relation in the FROM clause, considering all relation permutations and join methods.)

Database Management Systems II, Huiping Cao 20 Query Optimization

q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries

Database Management Systems II, Huiping Cao 21 Transformation of Relational Expressions

q Two expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instance q Note: order of tuples is irrelevant q we do not care if they generate different results on that violate integrity constraints q In SQL, inputs and outputs are multisets of tuples q Two expressions in the multiset version of the relational algebra are said to be equivalent if the two expressions generate the same multiset of tuples on every legal database instance. q An equivalence rule says that expressions of two forms are equivalent q Can replace expression of the first form by the second, or vice versa

Database Management Systems II, Huiping Cao 22 Equivalence Rules

1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. (E) ( (E)) σθ1 ∧θ2 = σθ1 σθ2 2. Selection operations are commutative. ( (E)) ( (E)) σθ1 σθ2 = σθ2 σθ1

3. Only the last in a sequence of projection operations is needed, the others can be omitted. ( ( ( (E)) )) (E) Π L1 Π L2 … Π Ln … = Π L1 4. Selections can be combined with Cartesian products and theta joins.

a. σθ(E1 X E2) = E1 θ E2 b. σθ1(E1 θ2 E2) = E1 θ1∧ θ2 E2

Database Management Systems II, Huiping Cao 23 Equivalence Rules (Cont.)

5. Theta-join operations (and natural joins) are commutative. E1 θ E2 = E2 θ E1 6. (a) Natural join operations are associative:

(E1 E2) E3 = E1 (E2 E3)

(b) Theta joins are associative in the following manner:

(E1 θ1 E2) θ2∧θ3 E3 = E1 θ1∧θ3 (E2 θ2 E3)

where θ2 involves attributes from only E2 and E3.

Database Management Systems II, Huiping Cao 24 Equivalence Rules (Cont.)

7. The selection operation distributes over the theta join operation under the following two conditions:

(a) When all the attributes in θ0 involve only the attributes of one of the expressions (E1) being joined.

σθ0(E1 θ E2) = (σθ0(E1)) θ E2

(b) When θ1 involves only the attributes of E1 and θ2 involves only the attributes of E2.

σθ1∧θ2 (E1 θ E2) = (σθ1(E1)) θ (σθ2 (E2))

Database Management Systems II, Huiping Cao 25 Pictorial Depiction of Equivalence Rules

Database Management Systems II, Huiping Cao 26 Equivalence Rules (Cont.)

8. The projection operation distributes over the theta join operation as follows:

(a) if θ involves only attributes from L1 ∪ L2:

∏L1∪L2 (E1 θ E2 ) = (∏L1 (E1)) θ (∏L2 (E2 ))

(b) Consider a join E1 θ E2.

q Let L1 and L2 be sets of attributes from E1 and E2, respectively.

q Let L3 be attributes of E1 that are involved in join condition θ, but are not in L1 ∪ L2, and

q Let L4 be attributes of E2 that are involved in join condition θ, but are not in L1 ∪ L2. ∏ = ∏ ∏ ∏ L1 ∪L2 (E1 θ E2 ) L1 ∪L2 (( L1 ∪L3 (E1)) θ ( L2 ∪L4 (E2 )))

Database Management Systems II, Huiping Cao 27 Equivalence Rules (Cont.)

9. The set operations union and intersection are commutative

E1 ∪ E2 = E2 ∪ E1 E1 ∩ E2 = E2 ∩ E1 ■ (set difference is not commutative). 10. Set union and intersection are associative.

(E1 ∪ E2) ∪ E3 = E1 ∪ (E2 ∪ E3)

(E1 ∩ E2) ∩ E3 = E1 ∩ (E2 ∩ E3) 11. The selection operation distributes over ∪, ∩ and –.

σθ (E1 – E2) = σθ (E1) – σθ(E2) and similarly for ∪ and ∩ in place of –

Also: σθ (E1 – E2) = σθ(E1) – E2 and similarly for ∩ in place of –, but not for ∪ 12. The projection operation distributes over union

ΠL(E1 ∪ E2) = (ΠL(E1)) ∪ (ΠL(E2))

Database Management Systems II, Huiping Cao 28 Transformation Example: Pushing Selections

q Query: Find the names of all instructors in the Music department, along with the titles of the courses that they teach

q Πname, title(σdept_name= “Music” (instructor (teaches Πcourse_id, title (course)))) q Transformation using rule 7a.

q Πname, title((σdept_name= “Music”(instructor)) (teaches Πcourse_id, title (course)))

q Performing the selection as early as possible reduces the size of the relation to be joined.

Database Management Systems II, Huiping Cao 29 Example with Multiple Transformations

q Query: Find the names of all instructors in the Music department who have taught a course in 2009, along with the titles of the courses that they taught

q Πname, title(σdept_name= “Music”∧year = 2009 (instructor (teaches Πcourse_id, title (course)))) q Transformation using join associatively (Rule 6a):

q Πname, title(σdept_name= “Music”∧year = 2009 ((instructor teaches) Πcourse_id, title (course))) q Second form provides an opportunity to apply the “perform selections early” rule, resulting in the subexpression

σdept_name = “Music” (instructor) σyear = 2009 (teaches)

Database Management Systems II, Huiping Cao 30 Multiple Transformations (Cont.)

Database Management Systems II, Huiping Cao 31 Transformation Example: Pushing Projections

q Consider: Πname, title(σdept_name= “Music” (instructor) teaches) Πcourse_id, title (course)))) q When we compute

(σdept_name = “Music” (instructor teaches)

we obtain a relation whose schema is: (ID, name, dept_name, salary, course_id, sec_id, semester, year) q Push projections using equivalence rules 8a and 8b; eliminate unneeded attributes from intermediate results to get: Πname, title(Πname, course_id ( σdept_name= “Music” (instructor) teaches)) Πcourse_id, title (course)))) q Performing the projection as early as possible reduces the size of the relation to be joined.

Database Management Systems II, Huiping Cao 32 Join Ordering Example

q For all relations r1, r2, and r3,

(r1 r2) r3 = r1 (r2 r3 ) (Join Associativity)

q If r2 r3 is quite large and r1 r2 is small, we choose

(r1 r2) r3 so that we compute and store a smaller temporary relation.

Database Management Systems II, Huiping Cao 33 Join Ordering Example (Cont.)

q Consider the expression

Πname, title(σdept_name= “Music” (instructor) teaches) Πcourse_id, title (course))))

q Could compute teaches Πcourse_id, title (course) first but the result of the first join is likely to be a large relation q Only a small fraction of the university’s instructors are likely to be from the Music department q It is better to compute

σdept_name= “Music” (instructor) teaches

Database Management Systems II, Huiping Cao 34 Query Optimization

q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries

Database Management Systems II, Huiping Cao 35 Cost Estimation

q For each plan considered, must estimate cost: q Must estimate cost of each operation in plan tree.

! Depends on input cardinalities.

! Cost of operations (sequential scan, index scan, joins, etc.)

q Must also estimate size of result for each operation in tree! ! Use information about the input relations.

! For selections and joins, assume independence of predicates.

Database Management Systems II, Huiping Cao 36 Estimating result size

q Reduction factor q The ratio of the (expected) result size to the input size considering only the selection represented by the term.

q How to calculate reduction factors: q = value ! 1/Nkeys(I) – I: index on column ! 1/10: random q Column1 = column2 ! 1/Max(Nkeys(I1), Nkeys(I2)): both columns have indexes ! 1/Nkeys(I): either column1 or column 2 has index I ! 1/10: no index

Database Management Systems II, Huiping Cao 37 Size estimation (cont.)

q How to calculate reduction factors: q Column>value ! (High(I)-value)/(High(I)-Low(I)): with index ! Less than half: no index q Column IN (list of values) ! RF (column=value) * number of items ! At most half

q Reduction factor for projection

Database Management Systems II, Huiping Cao 38 Improved Statistics: Histograms

q Column > value q rN for uniform distribution

q Histogram on age

q Age > 13 q Nonuniform: 9 q Uniform: (1/15)*45 = 3

8 9 4 4 3 2 3 3 3 2 1 2 1 1 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Database Management Systems II, Huiping Cao 39 Histograms

q Equi-width histograms q Divide the range into subranges of equal size(in terms of the values) q Equi-depth histograms q Divide the range into subranges such that the number of tuples in each subrange is equal q Age>13

9.0 5.0 5.0 5.0 2.67 2.5 1.33 1.0 2.25 1.75

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Bucket 1 Bucket 2 Bucket 3 Bucket 4 Bucket 5 Bucket 1 Bucket 2 Bucket 3 Bucket 4 Bucket 5 Count=8 Count=4 Count=15 Count=3 Count=15 Count=9 Count=10 Count=10 Count=7 Count=9

Database Management Systems II, Huiping Cao 40 Query Optimization

q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries

Database Management Systems II, Huiping Cao 41 Enumeration of Alternative Plans

q Given a query, an optimizer essentially q enumerates a certain set of plans, q chooses the plan with the least estimated cost.

q Algebraic equivalence q Cost estimation

q Subset of plans considered by a typical optimizer

Database Management Systems II, Huiping Cao 42 Enumeration of Alternative Plans

q There are two main cases: q Single-relation plans q Multiple-relation plans q Consider a query block: q Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. q Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size. Result cardinality = Max # tuples *product of all RF’s. q For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops: q Each available access path (file scan/index) is considered, and the one with the least estimated cost is chosen.

Database Management Systems II, Huiping Cao 43 Single Relation Queries -- Plans without Index

SELECT S.rating, COUNT(DISTINCT S.sname) as dsname FROM Sailors S WHERE S.rating>5 AND S.age = 20 GROUP BY S.rating HAVING dsname >2;

qPlans without index q Scan the relation and apply selections and projections q Writing out tuples after the selections and projections q Sorting these tuples to implement the GROUP BY clause

Database Management Systems II, Huiping Cao 44 Example

q File scan of Sailors: 500 q Writing out (S.rating, S.sname) is 500*ratio q Let selection RF of rating: 0.5 q Let selection RF of age: 0.1 q Let projection RF: 0.8 q Result: 500* 0.04=20 q Sorting intermediate relation q Let memory is enough to finish sorting in two passes (Relational optimizers often assume that a relation can be sorted in two passes to simplify the estimation of sorting costs.) q 3*20 = 60 q Total cost: 500+20+60 = 580

Database Management Systems II, Huiping Cao 45 Single Relation Queries --Plans Utilizing an Index

q Single-index access path q When several indexes match the selection conditions, choose the access path that the result will be fewest pages q Multiple-index access path q Intersect the sets of record ids q Sort according to page ids q Retrieve data q Sorted index access path q Group by attributes form a prefix of a tree index q Index-only access path q Only index scan; avoid retrieving data tuples q Steps: (1) Apply selections, (2) remove unwanted attributes, (3) sort for grouping, (4) compute aggregation q Works even if the index does not match the selections in the WHERE clause

Database Management Systems II, Huiping Cao 46 Example

SELECT S.rating, COUNT(DISTINCT S.sname) as dsname FROM Sailors S WHERE S.rating>5 AND S.age = 20 GROUP BY S.rating HAVING dsname >2;

Assumption q (1) B+-tree index on rating; q (2) Hash index on age; q (3) B+-tree index on

Database Management Systems II, Huiping Cao 47 Example

q Single-index access path q Use hash index on age ! Cost: retrieve the index entries + tuples ! Apply rating>5 condition q Project out fields mentioned in SELECT, GROUP BY, HAVING q Write out temporary results (only keep sname and rating) q Sort on the rating field for GROUP BY q Apply aggregation and HAVING

Database Management Systems II, Huiping Cao 48 Example

q Multiple-index access path q Retrieve rids of tuples satisfying rating>5 (B+-tree index) q Retrieve rids of tuples satisfying age=20 (Hash index) q Sort the rids according to page id q Retrieve the corresponding data tuples; retain just the rating and name fields q Write temporary results q Sort on the rating field for GROUP BY q Apply aggregation and HAVING

q The other two cases?

Database Management Systems II, Huiping Cao 49 Queries Over Multiple Relations

q Linear trees: at least one child of a join node is a base q Left-deep tree: the right child of each join node is a base table

q Fundamental decision in System R: only left-deep join trees are considered. q As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space.

q Left-deep trees allow us to generate all fully pipelined plans. ! Intermediate results not written to temporary files. ! Not all left-deep trees are fully pipelined (e.g., Sort-Merge join).

D D

C C

A B C D A B A B Database Management Systems II, Huiping Cao 50 Enumeration of Left-Deep Plans

q Left-deep plans differ only in (1) the order of relations, (2) the access method for each relation, and (3) the join method for each join. q Enumerated using N passes (if N relations joined): q Pass 1: Find best 1-relation plan for each relation. ! Selection terms only related to one relation ! Project out useless attributes ! Cheapest one

q Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) q Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans.) q For each subset of relations, retain only: q Cheapest plan overall, plus q Cheapest plan for each interesting order of the tuples.

Database Management Systems II, Huiping Cao 51 Enumeration of Left-Deep Plans

q Pass 2: Find best way to join the results of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) ! Consider each single relation retained after Pass 1 as the outer relation and every other relation as the inner relation ! A: outer relation; B: inner relation – Selections that involve only B è apply before join – Selections that define join – Selections that involve attributes in other relations è apply after the join ! The first two groups of selections è access path for B ! Project out useless attributes from B ! Pipelined? ! Best access method

Database Management Systems II, Huiping Cao 52 Enumeration of Plans (Cont.)

q ORDER BY, GROUP BY, aggregates etc. are handled as a final step, using either an “interestingly ordered” plan or an additional sorting operator. q In spite of pruning plan space, this approach is still exponential in the # of tables.

Database Management Systems II, Huiping Cao 53 Cost Estimation for Multi-relation Plans

q Consider a query block: q Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. q Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size. Result cardinality = Max # tuples *product of all RF’s.

q Multi-relation plans are built up by joining one new relation at a time. q Cost of join method, plus estimation of join cardinality gives us both cost estimate and result size estimate

SELECT attribute list FROM relation list WHERE term1 AND ... AND termk

Database Management Systems II, Huiping Cao 54 Sailors: Example Unclustered B+ tree on rating Unclustered Hash on sid Reserves: SELECT sname Unclustered B+ tree on bid FROM Sailors as S, Reserves as R Where S.sid = R.sid AND bid=100 AND rating>5;

q Pass1: q Sailors: B+ tree matches rating>5, sname and is probably cheapest. However, if this selection is expected to retrieve a lot of tuples, file scan may be cheaper.

! Still, B+ tree plan kept (because tuples are in sid=sid rating order). q Reserves: B+ tree on bid matches bid=100; cheapest. bid=100 rating > 5

Reserves Sailors

Database Management Systems II, Huiping Cao 55 Sailors: Example (cont.) Unclustered B+ tree on rating Unclustered Hash on sid Reserves: q Pass 2: Unclustered B+ tree on bid q Consider each plan retained from Pass 1 as the outer, and consider how to join it with the (only) other relation. q Reserves as outer: Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value. sname

q Sailors as outer: ! Two selection conditions: (1) bid=100; (2) sid=value sid=sid

bid=100 rating > 5

Reserves Sailors

Database Management Systems II, Huiping Cao 56 Query Optimization

q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries

Database Management Systems II, Huiping Cao 57 Nested Sub-queries

q Find the names of sailors with the highest rating

SELECT S.sname FROM Sailors S WHERE S.rating = (SELECT MAX (S2.rating) FROM Sailors S2) q Steps: q The nested subquery can be evaluated just once à single value q This value is incorporated into the top-level query

! E.g., S.rating = 8 q Sub-query returns a value;

Database Management Systems II, Huiping Cao 58 Nested Sub-queries

q Find the names of sailors who have reserved boat with number 103 SELECT S.sname FROM Sailors S WHERE S.sid = (SELECT R.sid FROM Reserves R WHERE R.bid = 103) q Steps: q The nested subquery can be evaluaed just once à relation q Join between S and this temporal relation. q Smart: temporary relation as outer relation, S.sid has index; index nested loop join. Generally, NO. q Sub-query returns a relation;

Database Management Systems II, Huiping Cao 59 Nested Sub-queries

q Correlated queries: Find the names of sailors who have reserved boar number 103 SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid = 103 AND S.sid = R.sid) q Steps: Evaluate the nested sub-query for each tuple of Sailors. q Problems q Nested sub-query is evaluated once per outer tuple; ! What if the same value appears in the correlation field more than one time? q Not set-oriented; precludes other join alternatives q Even index nested loop join has problem q Implicit ordering of these blocks means that some good strategies are not considered. (Sailor as outer/Reserves as outer)

Database Management Systems II, Huiping Cao 60 SELECT S.sname Nested Queries FROM Sailors S WHERE EXISTS (SELECT * q A nested query: equivalent query without FROM Reserves R nesting WHERE R.bid=103 AND R.sid=S.sid) q A correlated query: equivalent query without correlation

Nested block to optimize: q The unnested and “decorrelated” version of SELECT * the query is typically optimized better. FROM Reserves R q Many current optimizers cannot transform WHERE R.bid=103 one of the nested versions to nonnested AND S.sid= outer value versions.

Equivalent non-nested query: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103

Database Management Systems II, Huiping Cao 61 System R Optimizer

q The use of statistics to estimate the cost of query evaluation plan q Consider only plans with binary joins in which the inner relation is a base relation (i.e., not a temporary relation) q Reduce the number of alternative plans q Focus on optimization on the unnested SQL queries q Model of cost that accounted for I/O costs and CPU costs q Not perform duplicate elimination (except DISTINCT clause)

Database Management Systems II, Huiping Cao 62 Summary

q Query optimization is an important task in a relational DBMS. q Must understand optimization in order to understand the performance impact of a given (relations, indexes) on a workload (set of queries). q Two parts to optimizing a query: q Consider a set of alternative plans.

! Must prune search space; typically, left-deep plans only. q Must estimate cost of each plan that is considered.

! Must estimate size of results and cost for each plan node. ! Key issues: Statistics, indexes, operator implementations.

Database Management Systems II, Huiping Cao 63