1 Hash Table Example B+ Tree Index by Example Clustered
Student
ID fName lName Hash table example 10 Tom Hanks Introduction to Database Systems 20 Amy Hanks Index Student_ID on Student.ID Data File Student … CSE 414 10 Tom Hanks 10
20 20 Amy Hanks 50 50 … … 200 200 … 220 240 220 Lecture 26: More Indexes and 420 240 800
Operator Costs … … 420 … … 800
Index File Data file (preferably (on disk) in memory) CSE 414 - Spring 2018 1 Hash table indexesCSE are414 -goodSpring 2018for point queries 2
B+ Tree Index by Example Clustered vs Unclustered Recall binary trees from CSE 143!
d = 2 Find the key 40 80 Index File B+ Tree 40 <= 80 (preferably B+ Tree in memory) Index entries 20 60 100 120 140 Index entries (Index File) (Data file) 20 < 40 <= 60
10 15 18 20 30 40 50 60 65 80 85 90 Data Records Data Records CLUSTERED UNCLUSTERED
30 < 40 <= 40 Every table can have only one clustered and many unclustered indexes Data file Why? 10 15 18 20 30 40 50 60 65 80 85 90 (on disk) CSE 414 - Spring 2018 3 CSE 414 - Spring 2018 4 B+ indexes are good for range queries
Student(ID, fname, lname) SELECT * Student FROM Student x, Takes y Takes(studentID, courseID) WHERE x.ID=y.studentID AND y.courseID > 300 ID fName lName Example Which Indexes? 10 Tom Hanks 20 Amy Hanks … for y in Takes • How many indexes could we create? if courseID > 300 then Assume the database has indexes on these attributes: for x in Student • Takes_courseID = index on Takes.courseID if x.ID=y.studentID • Student_ID = index on Student.ID output * • Which indexes should we create? Index selection
Index join ⋈studentID=ID for y’ in Takes_courseID where y’.courseID > 300 y = fetch the Takes record pointed to by y’ for x’ in Student_ID where x’.ID = y.studentID σ x = fetch the Student record pointed to by x’ courseID>300 In general this is a very hard problem Index selection output *
Takes Student CSE 414 - Spring 2018 5 CSE 414 - Spring 2018 6
1 Student
ID fName lName Which Indexes? 10 Tom Hanks Index Selection: Which Search Key 20 Amy Hanks • The index selection problem … • Make some attribute K a search key if the – Given a table, and a “workload” (big Java WHERE clause contains: application with lots of SQL queries), decide which – An exact match on K indexes to create (and which ones NOT to create!) – A range predicate on K • Who does index selection: – The database administrator DBA
– Semi-automatically, using a database administration tool
CSE 414 - Spring 2018 7 CSE 414 - Spring 2018 8
The Index Selection Problem 1 The Index Selection Problem 1
V(M, N, P); V(M, N, P);
Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100 queries: SELECT * SELECT * SELECT * SELECT * FROM V FROM V FROM V FROM V WHERE N=? WHERE P=? WHERE N=? WHERE P=?
What indexes ?
CSE 414 - Spring 2018 9 CSE 414 - Spring 2018 10
The Index Selection Problem 1 The Index Selection Problem 2
V(M, N, P); V(M, N, P);
Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100 queries: 100000 queries: SELECT * SELECT * SELECT * SELECT * INSERT INTO V FROM V FROM V FROM V FROM V VALUES (?, ?, ?) WHERE N=? WHERE P=? WHERE N>? and N
A: V(N) and V(P) (hash tables or B-trees) What indexes ?
CSE 414 - Spring 2018 11 CSE 414 - Spring 2018 12
2 The Index Selection Problem 2 The Index Selection Problem 3
V(M, N, P); V(M, N, P);
Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100000 queries: 1000000 queries: 100000 queries: SELECT * SELECT * INSERT INTO V SELECT * SELECT * INSERT INTO V FROM V FROM V VALUES (?, ?, ?) FROM V FROM V VALUES (?, ?, ?) WHERE N>? and N?
A: definitely V(N) (must B-tree); unsure about V(P) What indexes ?
CSE 414 - Spring 2018 13 CSE 414 - Spring 2018 14
The Index Selection Problem 3 The Index Selection Problem 4
V(M, N, P); V(M, N, P);
Your workload is this Your workload is this 100000 queries: 1000000 queries: 100000 queries: 1000 queries: 100000 queries: SELECT * SELECT * INSERT INTO V SELECT * SELECT * FROM V FROM V VALUES (?, ?, ?) FROM V FROM V WHERE N=? WHERE N=? and P>? WHERE N>? and N? and P
A: V(N, P) How does this index differ from: What indexes ? 1. Two indexes V(N) and V(P)? CSE 414 -2.SpringAn 2018 index V(P, N)? 15 CSE 414 - Spring 2018 16
The Index Selection Problem 4 Two typical kinds of queries
V(M, N, P); • Point queries SELECT * • What data structure FROM Movie Your workload is this WHERE year = ? should be used for 1000 queries: 100000 queries: index? SELECT * SELECT * FROM V FROM V SELECT * • Range queries WHERE N>? and N? and P= ? AND should be used for A: V(N) secondary, V(P) primary index year <= ? index?
CSE 414 - Spring 2018 17 CSE 414 - Spring 2018 18
3 SELECT * Basic Index Selection Guidelines FROM R WHERE R.K>? and R.K
• Look at WHERE clause for possible search key
• Try to choose indexes that speed-up multiple queries
• Range queries benefit mostly from clustering 0 100 Percentage tuples retrieved CSE 414 - Spring 2018 19 CSE 414 - Spring 2018 20
SELECT * SELECT * FROM R FROM R WHERE R.K>? and R.K? and R.K
Cost Sequential scan Cost Sequential scan
Clustered index
0 100 0 100 Percentage tuples retrieved Percentage tuples retrieved CSE 414 - Spring 2018 21 CSE 414 - Spring 2018 22
SELECT * FROM R Choosing Index is Not Enough WHERE R.K>? and R.K
– How each operator is implemented Clustered index – The cost of each operator
– Let’s start with the basics 0 100 Percentage tuples retrieved CSE 414 - Spring 2018 23 CSE 414 - Spring 2018 24
4 Cost Parameters
• Cost = I/O + CPU + Network BW – We will focus on I/O in this class • Parameters (a.k.a. statistics): Cost of Reading – B(R) = # of blocks (i.e., pages) for relation R – T(R) = # of tuples in relation R Data From Disk – V(R, a) = # of distinct values of attribute a When a is a key, V(R,a) = T(R) When a is not a key, V(R,a) can be anything <= T(R)
• DBMS collects statistics about base tables must infer them for intermediate results
CSE 414 - Spring 2018 25 26
Selectivity Factors for Conditions Cost of Reading Data From Disk
• A = c /* σA=c(R) */ • Sequential scan for relation R costs B(R) – Selectivity = 1/V(R,A)