1 Hash Table Example B+ Tree Index by Example Clustered

Student ID fName lName Hash table example 10 Tom Hanks Introduction to Database Systems 20 Amy Hanks InDex Student_ID on Student.ID Data File Student … CSE 414 10 Tom Hanks 10 20 20 Amy Hanks 50 50 … … 200 200 … 220 240 220 Lecture 26: More Indexes and 420 240 800 Operator Costs … … 420 … … 800 Index File Data file (preferably (on disk) in memory) CSE 414 - Spring 2018 1 Hash table inDexesCSE are414 -gooDSpring 2018for point queries 2 B+ Tree Index by Example Clustered vs Unclustered Recall binary trees from CSE 143! d = 2 Find the key 40 80 Index File B+ Tree 40 <= 80 (preferably B+ Tree in memory) Index entries 20 60 100 120 140 Index entries (Index File) (Data file) 20 < 40 <= 60 10 15 18 20 30 40 50 60 65 80 85 90 Data Records Data Records CLUSTERED UNCLUSTERED 30 < 40 <= 40 Every table can have only one clustered and many unclustered indexes Data file Why? 10 15 18 20 30 40 50 60 65 80 85 90 (on disk) CSE 414 - Spring 2018 3 CSE 414 - Spring 2018 4 B+ indexes are good for range queries Student(ID, fname, lname) SELECT * Student FROM Student x, Takes y Takes(studentID, courseID) WHERE x.ID=y.studentID AND y.courseID > 300 ID fName lName Example Which Indexes? 10 Tom Hanks 20 Amy Hanks … for y in Takes • How many indexes could we create? if courseID > 300 then Assume the database has indexes on these attributes: for x in Student • Takes_courseID = index on Takes.courseID if x.ID=y.studentID • Student_ID = index on Student.ID output * • Which indexes should we create? Index selection Index join ⋈studentID=ID for y’ in Takes_courseID where y’.courseID > 300 y = fetch the Takes record pointed to by y’ for x’ in Student_ID where x’.ID = y.studentID σ x = fetch the Student record pointed to by x’ courseID>300 In general this is a very hard problem Index selection output * Takes Student CSE 414 - Spring 2018 5 CSE 414 - Spring 2018 6 1 Student ID fName lName Which Indexes? 10 Tom Hanks Index Selection: Which Search Key 20 Amy Hanks • The index selection problem … • Make some attribute K a search key if the – Given a table, and a “workload” (big Java WHERE clause contains: application with lots of SQL queries), decide which – An exact match on K indexes to create (and which ones NOT to create!) – A range predicate on K • Who does index selection: – The database administrator DBA – Semi-automatically, using a database administration tool CSE 414 - Spring 2018 7 CSE 414 - Spring 2018 8 The Index Selection Problem 1 The Index Selection Problem 1 V(M, N, P); V(M, N, P); Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100 queries: SELECT * SELECT * SELECT * SELECT * FROM V FROM V FROM V FROM V WHERE N=? WHERE P=? WHERE N=? WHERE P=? What indexes ? CSE 414 - Spring 2018 9 CSE 414 - Spring 2018 10 The Index Selection Problem 1 The Index Selection Problem 2 V(M, N, P); V(M, N, P); Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100 queries: 100000 queries: SELECT * SELECT * SELECT * SELECT * INSERT INTO V FROM V FROM V FROM V FROM V VALUES (?, ?, ?) WHERE N=? WHERE P=? WHERE N>? and N<? WHERE P=? A: V(N) and V(P) (hash tables or B-trees) What indexes ? CSE 414 - Spring 2018 11 CSE 414 - Spring 2018 12 2 The Index Selection Problem 2 The Index Selection Problem 3 V(M, N, P); V(M, N, P); Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100000 queries: 1000000 queries: 100000 queries: SELECT * SELECT * INSERT INTO V SELECT * SELECT * INSERT INTO V FROM V FROM V VALUES (?, ?, ?) FROM V FROM V VALUES (?, ?, ?) WHERE N>? and N<? WHERE P=? WHERE N=? WHERE N=? and P>? A: definitely V(N) (must B-tree); unsure about V(P) What indexes ? CSE 414 - Spring 2018 13 CSE 414 - Spring 2018 14 The Index Selection Problem 3 The Index Selection Problem 4 V(M, N, P); V(M, N, P); Your workload is this Your workload is this 100000 queries: 1000000 queries: 100000 queries: 1000 queries: 100000 queries: SELECT * SELECT * INSERT INTO V SELECT * SELECT * FROM V FROM V VALUES (?, ?, ?) FROM V FROM V WHERE N=? WHERE N=? and P>? WHERE N>? and N<? WHERE P>? and P<? A: V(N, P) How does this index differ from: What indexes ? 1. Two indexes V(N) and V(P)? CSE 414 -2.SpringAn 2018 index V(P, N)? 15 CSE 414 - Spring 2018 16 The Index Selection Problem 4 Two typical kinds of queries V(M, N, P); • Point queries SELECT * • What data structure FROM Movie Your workload is this WHERE year = ? should be used for 1000 queries: 100000 queries: index? SELECT * SELECT * FROM V FROM V SELECT * • Range queries WHERE N>? and N<? WHERE P>? and P<? FROM Movie • What data structure WHERE year >= ? AND should be used for A: V(N) secondary, V(P) primary index year <= ? index? CSE 414 - Spring 2018 17 CSE 414 - Spring 2018 18 3 SELECT * Basic Index Selection Guidelines FROM R WHERE R.K>? and R.K<? • Consider queries in workload in order of importance Cost • Consider relations accessed by query – No point indexing other relations • Look at WHERE clause for possible search key • Try to choose indexes that speed-up multiple queries • Range queries benefit mostly from clustering 0 100 Percentage tuples retrieved CSE 414 - Spring 2018 19 CSE 414 - Spring 2018 20 SELECT * SELECT * FROM R FROM R WHERE R.K>? and R.K<? WHERE R.K>? and R.K<? Cost Sequential scan Cost Sequential scan Clustered index 0 100 0 100 Percentage tuples retrieved Percentage tuples retrieved CSE 414 - Spring 2018 21 CSE 414 - Spring 2018 22 SELECT * FROM R Choosing Index is Not Enough WHERE R.K>? and R.K<? • To estimate the cost of a query plan, we still Cost Unclustered index Sequential scan need to consider other factors: – How each operator is implemented Clustered index – The cost of each operator – Let’s start with the basics 0 100 Percentage tuples retrieved CSE 414 - Spring 2018 23 CSE 414 - Spring 2018 24 4 Cost Parameters • Cost = I/O + CPU + Network BW – We will focus on I/O in this class • Parameters (a.k.a. statistics): Cost of Reading – B(R) = # of blocks (i.e., paGes) for relation R – T(R) = # of tuples in relation R Data From Disk – V(R, a) = # of distinct values of attribute a When a is a key, V(R,a) = T(R) When a is not a key, V(R,a) can be anythinG <= T(R) • DBMS collects statistics about base tables must infer them for intermediate results CSE 414 - Spring 2018 25 26 Selectivity Factors for Conditions Cost of Reading Data From Disk • A = c /* σA=c(R) */ • Sequential scan for relation R costs B(R) – Selectivity = 1/V(R,A) • A < c /* σA<c(R)*/ – Selectivity = (c - min(R, A))/(max(R,A) - min(R,A)) • c1 < A < c2 /* σc1<A<c2(R)*/ – Selectivity = (c2 – c1)/(max(R,A) - min(R,A)) CSE 414 - Spring 2018 27 CSE 414 - Spring 2018 28 Index Based Selection Index Based Selection B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection: CSE 414 - Spring 2018 29 CSE 414 - Spring 2018 30 5 Index Based Selection Index Based Selection B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: B(R) = 2,000 I/Os • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection: – If index is clustered: – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is unclustered: – If index is unclustered: CSE 414 - Spring 2018 31 CSE 414 - Spring 2018 32 Index Based Selection Index Based Selection B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: B(R) = 2,000 I/Os • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection: – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os – If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os Note: we ignore I/O cost for index pages Lesson: Don’t build unclustered indexes when V(R,a) is small ! CSE 414 - Spring 2018 33 CSE 414 - Spring 2018 34 Outline • Join operator algorithms – One-pass algorithms (Sec. 15.2 and 15.3) Cost of Executing Operators – Index-based algorithms (Sec 15.6) (Focus on Single Node Joins) • Note about readings: – In class, we discuss only algorithms for joins – Other operators are easier: read the book CSE 414 - Spring 2018 35 CSE 414 - Spring 2018 36 6 Join Algorithms Nested Loop Joins • Tuple-based nested loop R ⋈ S • Nested loop join • R is the outer relation, S is the inner relation for each tuple t in R do • Hash join 1 for each tuple t2 in S do if t1 and t2 join then output (t1,t2) • Sort-merge join What is the Cost? CSE 414 - Spring 2018 37 CSE 414 - Spring 2018 38 Nested Loop Joins Page-at-a-time Refinement • Tuple-based nested loop R ⋈ S • R is the outer relation, S is the inner relation for each page of tuples r in R do for each page of tuples s in S do for each tuple t1 in R do for all pairs of tuples t1 in r, t2 in s for each tuple t2 in S do if t1 and t2 join then output (t1,t2) if t1 and t2 join then output (t1,t2) What is the Cost? What is the Cost? • Cost: B(R) + T(R) B(S) • Cost: B(R) + B(R)B(S) • Multiple-pass since S is read many times CSE 414 - Spring 2018 39 CSE 414 - Spring 2018 40 Page-at-a-time Refinement Page-at-a-time Refinement 1 2 Input buffer for Patient 11 2 Input buffer for Patient Input buffer for Insurance Input buffer for Insurance Disk 2 4 Disk 4 3 Patient Insurance 2 2 Patient Insurance 1 2 2 4 6 6 Output buffer 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 3 4 4 3 1 3 9 6 2 8 9 6 2 8 8 5 8 9 41 8 5 8 9 42 7 Page-at-a-time Refinement 11 2 Input buffer for Patient Input buffer for Insurance Disk 2 8 Patient Insurance Keep going until read all of Insurance 2 2 1 2 Output buffer 2 4 6 6 Then repeat for next 3 4 4 3 1 3 page of Patient… until end of Patient 9 6 2 8 Cost: B(R) + B(R)B(S) 8 5 8 9 43 8.

1 Hash Table Example B+ Tree Index by Example Clustered

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support