Query Optimization

Query Optimization References: q [RG-3ed] Chapter 15 q [SKS-6ed] Chapter 13 Database Management Systems II, Huiping Cao 1 Query Optimization q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries Database Management Systems II, Huiping Cao 2 Individual Operators q Queries are composed of a few basic operators: the implementation of these operators can be carefully tuned (and it is important to do this!). q Many alternative implementation techniques for each operator; no universally superior technique for most operators. q Must consider available alternatives for each operation in a query and choose the best one based on system statistics, etc. This is part of the broader task of optimizing a query composed of several operations. Database Management Systems II, Huiping Cao 3 Introduction q Alternative ways of evaluating a given query q Equivalent expressions q Different algorithms for each operation SELECT name,title FROM instructor, teaches, course WHERE dept_name = “Music” AND instroctor.id = teaches.iid AND course.cid = teaches.cid; Database Management Systems II, Huiping Cao 4 Introduction q Alternative ways of evaluating a given query q Equivalent expressions q Different algorithms for each operation Database Management Systems II, Huiping Cao 5 Introduction (Cont.) q An evaluation plan defines exactly what algorithm is used for each operation, and how the execution of the operations is coordinated. q Find out how to view query execution plans on your favorite database Database Management Systems II, Huiping Cao 6 PostgreSQL EXPLAIN select * from measurement_instance as mi, measurement_type as mt where mt.annot_id=mi.did and mt.mtypelabel=mi.mtypelabel; QUERY PLAN ------------------------------------------------------------------------------------ Hash Join (cost=11.75..25.96 rows=1 width=1377) Hash Cond: ((mi.did = mt.annot_id) AND (mi.mtypelabel = mt.mtypelabel)) -> Seq Scan on measurement_instance mi (cost=0.00..12.40 rows=240 width=310) -> Hash (cost=10.70..10.70 rows=70 width=1067) -> Seq Scan on measurement_type mt (cost=0.00..10.70 rows=70 width=1067) (5 rows) Database Management Systems II, Huiping Cao 7 Introduction (Cont.) q Cost difference between evaluation plans for a query can be enormous q E.g. seconds vs. days in some cases q Steps in cost-based query optimization 1. Generate logically equivalent expressions using equivalence rules 2. Annotate resultant expressions to get alternative query plans 3. Choose the cheapest plan based on estimated cost Database Management Systems II, Huiping Cao 8 Introduction (Cont.) q Estimation of plan cost based on: q Statistical information about relations. Examples: ! number of tuples, number of distinct values for an attribute q Statistics estimation for intermediate results ! to compute cost of complex expressions q Cost formulae for algorithms, computed using statistics Database Management Systems II, Huiping Cao 9 Query Optimization q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries Database Management Systems II, Huiping Cao 10 Evaluation of Expressions q So far: we have seen algorithms for individual operations q Alternatives for evaluating an entire expression tree q Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. q Pipelining: pass on tuples to parent operations even as an operation is being executed Database Management Systems II, Huiping Cao 11 Materialization q Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations. q E.g., in figure below, compute and store σ building ="Watson" (department ) then compute and store its join with instructor, and finally compute the projection on name. Πname σ instructor building = “Watson” department Database Management Systems II, Huiping Cao 12 Materialization (Cont.) q Materialized evaluation is always applicable q Cost of writing results to disk and reading them back can be quite high q If ignoring cost of writing results to disk, ! Overall cost = Sum of costs of individual operations + cost of writing intermediate results to disk q Double buffering: use two output buffers for each operation, when one is full write it to disk while the other is getting filled q Allow overlap of disk writes with computation and reduces execution time Database Management Systems II, Huiping Cao 13 Pipelining q Pipelined evaluation: evaluate several operations simultaneously, passing the results of one operation on to the next. q E.g., in the previous expression tree, do not store the results of σ (department ) building ="Watson" q Instead, pass tuples directly to the join. Similarly, do not store the results of join, pass tuples directly to projection. q Much cheaper than materialization: no need to store a temporary relation to disk. q Pipelining may not always be possible – e.g., sort, hash-join. q Pipelines can be executed in two ways: demand driven and producer driven Database Management Systems II, Huiping Cao 14 Pipelining (Cont.) q In demand driven or lazy evaluation or pull model q System repeatedly requests next tuple from top level operation q Each operation requests next tuple from children operations as required, in order to output its next tuple q In between calls, operation has to maintain “state” so it knows what to return next q In producer-driven or eager pipelining or push model q Operators produce tuples eagerly and pass them up to their parents ! Buffer maintained between operators, child puts tuples in buffer, parent removes tuples from buffer ! If buffer is full, child waits till there is space in the buffer, and then generates more tuples q System schedules operations that have space in the output buffer and can process more input tuples Database Management Systems II, Huiping Cao 15 Evaluation Algorithms for Pipelining q Some algorithms are not able to output results even as they get input tuples q E.g. merge join, or hash join q Intermediate results written to disk and then read back q Blocking operations q Operations are pipelined Database Management Systems II, Huiping Cao 16 q PostgreSQL q Explain command: show the execution plan of a statement q Ref: http://www.postgresql.org/docs/8.1/static/sql-explain.html q MySQL q Explain command q Ref: http://dev.mysql.com/doc/refman/5.0/en/explain.html Database Management Systems II, Huiping Cao 17 Query Optimization q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries Database Management Systems II, Huiping Cao 18 Query Blocks: Units of Optimization SELECT S.sname q An SQL query is parsed into a collection of FROM Sailors S query blocks, and these are optimized one WHERE S.age IN block at a time. (SELECT MAX (S2.age) q A query block FROM Sailors S2 GROUP BY S2.rating) q No nesting q Exactly one SELECT and one FROM clause Outer block Nested block q At most one WHERE clause, GROUP BY clause, and HAVING clause ! WHERE clause in conjunctive normal form Database Management Systems II, Huiping Cao 19 Query Blocks: Units of Optimization SELECT S.sname q Nested blocks are usually treated as calls to FROM Sailors S a subroutine, made once per outer tuple. WHERE S.age IN (This is an over-simplification, but serves for (SELECT MAX (S2.age) now.) FROM Sailors S2 q For each block, the plans considered are: GROUP BY S2.rating) q All available access methods, for each relation in the FROM clause. Outer block Nested block q All left-deep join trees (i.e., all ways to join the relations one-at-a-time, with the inner relation in the FROM clause, considering all relation permutations and join methods.) Database Management Systems II, Huiping Cao 20 Query Optimization q Introduction q Evaluation of Expressions q Query Blocks q Transformation of Relational Expressions q Cost Estimation q Enumeration of Alternative Plans q Nested Subqueries Database Management Systems II, Huiping Cao 21 Transformation of Relational Expressions q Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instance q Note: order of tuples is irrelevant q we do not care if they generate different results on databases that violate integrity constraints q In SQL, inputs and outputs are multisets of tuples q Two expressions in the multiset version of the relational algebra are said to be equivalent if the two expressions generate the same multiset of tuples on every legal database instance. q An equivalence rule says that expressions of two forms are equivalent q Can replace expression of the first form by the second, or vice versa Database Management Systems II, Huiping Cao 22 Equivalence Rules 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. (E) ( (E)) σθ1 ∧θ2 = σθ1 σθ2 2. Selection operations are commutative. ( (E)) ( (E)) σθ1 σθ2 = σθ2 σθ1 3. Only the last in a sequence of projection operations is needed, the others can be omitted. ( ( ( (E)) )) (E) Π L1 Π L2 … Π Ln … = Π L1 4. Selections can be combined with Cartesian products and theta joins. a. σθ(E1 X E2) = E1 θ E2 b. σθ1(E1 θ2 E2) = E1 θ1∧ θ2 E2 Database Management Systems II, Huiping Cao 23 Equivalence Rules (Cont.) 5. Theta-join operations (and natural joins) are commutative. E1 θ E2 = E2 θ E1 6. (a) Natural join operations are associative: (E1 E2) E3 = E1 (E2 E3) (b) Theta joins are associative in the following manner: (E1 θ1 E2) θ2∧θ3 E3 = E1 θ1∧θ3 (E2 θ2 E3) where θ2 involves attributes from only E2 and E3.

Load more