1 Hash Table Example B+ Tree Index by Example Clustered

Student ID fName lName Hash table example 10 Tom Hanks Introduction to Database Systems 20 Amy Hanks InDex Student_ID on Student.ID Data File Student … CSE 414 10 Tom Hanks 10 20 20 Amy Hanks 50 50 … … 200 200 … 220 240 220 Lecture 26: More Indexes and 420 240 800 Operator Costs … … 420 … … 800 Index File Data file (preferably (on disk) in memory) CSE 414 - Spring 2018 1 Hash table inDexesCSE are414 -gooDSpring 2018for point queries 2 B+ Tree Index by Example Clustered vs Unclustered Recall binary trees from CSE 143! d = 2 Find the key 40 80 Index File B+ Tree 40 <= 80 (preferably B+ Tree in memory) Index entries 20 60 100 120 140 Index entries (Index File) (Data file) 20 < 40 <= 60 10 15 18 20 30 40 50 60 65 80 85 90 Data Records Data Records CLUSTERED UNCLUSTERED 30 < 40 <= 40 Every table can have only one clustered and many unclustered indexes Data file Why? 10 15 18 20 30 40 50 60 65 80 85 90 (on disk) CSE 414 - Spring 2018 3 CSE 414 - Spring 2018 4 B+ indexes are good for range queries Student(ID, fname, lname) SELECT * Student FROM Student x, Takes y Takes(studentID, courseID) WHERE x.ID=y.studentID AND y.courseID > 300 ID fName lName Example Which Indexes? 10 Tom Hanks 20 Amy Hanks … for y in Takes • How many indexes could we create? if courseID > 300 then Assume the database has indexes on these attributes: for x in Student • Takes_courseID = index on Takes.courseID if x.ID=y.studentID • Student_ID = index on Student.ID output * • Which indexes should we create? Index selection Index join ⋈studentID=ID for y’ in Takes_courseID where y’.courseID > 300 y = fetch the Takes record pointed to by y’ for x’ in Student_ID where x’.ID = y.studentID σ x = fetch the Student record pointed to by x’ courseID>300 In general this is a very hard problem Index selection output * Takes Student CSE 414 - Spring 2018 5 CSE 414 - Spring 2018 6 1 Student ID fName lName Which Indexes? 10 Tom Hanks Index Selection: Which Search Key 20 Amy Hanks • The index selection problem … • Make some attribute K a search key if the – Given a table, and a “workload” (big Java WHERE clause contains: application with lots of SQL queries), decide which – An exact match on K indexes to create (and which ones NOT to create!) – A range predicate on K • Who does index selection: – The database administrator DBA – Semi-automatically, using a database administration tool CSE 414 - Spring 2018 7 CSE 414 - Spring 2018 8 The Index Selection Problem 1 The Index Selection Problem 1 V(M, N, P); V(M, N, P); Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100 queries: SELECT * SELECT * SELECT * SELECT * FROM V FROM V FROM V FROM V WHERE N=? WHERE P=? WHERE N=? WHERE P=? What indexes ? CSE 414 - Spring 2018 9 CSE 414 - Spring 2018 10 The Index Selection Problem 1 The Index Selection Problem 2 V(M, N, P); V(M, N, P); Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100 queries: 100000 queries: SELECT * SELECT * SELECT * SELECT * INSERT INTO V FROM V FROM V FROM V FROM V VALUES (?, ?, ?) WHERE N=? WHERE P=? WHERE N>? and N<? WHERE P=? A: V(N) and V(P) (hash tables or B-trees) What indexes ? CSE 414 - Spring 2018 11 CSE 414 - Spring 2018 12 2 The Index Selection Problem 2 The Index Selection Problem 3 V(M, N, P); V(M, N, P); Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100000 queries: 1000000 queries: 100000 queries: SELECT * SELECT * INSERT INTO V SELECT * SELECT * INSERT INTO V FROM V FROM V VALUES (?, ?, ?) FROM V FROM V VALUES (?, ?, ?) WHERE N>? and N<? WHERE P=? WHERE N=? WHERE N=? and P>? A: definitely V(N) (must B-tree); unsure about V(P) What indexes ? CSE 414 - Spring 2018 13 CSE 414 - Spring 2018 14 The Index Selection Problem 3 The Index Selection Problem 4 V(M, N, P); V(M, N, P); Your workload is this Your workload is this 100000 queries: 1000000 queries: 100000 queries: 1000 queries: 100000 queries: SELECT * SELECT * INSERT INTO V SELECT * SELECT * FROM V FROM V VALUES (?, ?, ?) FROM V FROM V WHERE N=? WHERE N=? and P>? WHERE N>? and N<? WHERE P>? and P<? A: V(N, P) How does this index differ from: What indexes ? 1. Two indexes V(N) and V(P)? CSE 414 -2.SpringAn 2018 index V(P, N)? 15 CSE 414 - Spring 2018 16 The Index Selection Problem 4 Two typical kinds of queries V(M, N, P); • Point queries SELECT * • What data structure FROM Movie Your workload is this WHERE year = ? should be used for 1000 queries: 100000 queries: index? SELECT * SELECT * FROM V FROM V SELECT * • Range queries WHERE N>? and N<? WHERE P>? and P<? FROM Movie • What data structure WHERE year >= ? AND should be used for A: V(N) secondary, V(P) primary index year <= ? index? CSE 414 - Spring 2018 17 CSE 414 - Spring 2018 18 3 SELECT * Basic Index Selection Guidelines FROM R WHERE R.K>? and R.K<? • Consider queries in workload in order of importance Cost • Consider relations accessed by query – No point indexing other relations • Look at WHERE clause for possible search key • Try to choose indexes that speed-up multiple queries • Range queries benefit mostly from clustering 0 100 Percentage tuples retrieved CSE 414 - Spring 2018 19 CSE 414 - Spring 2018 20 SELECT * SELECT * FROM R FROM R WHERE R.K>? and R.K<? WHERE R.K>? and R.K<? Cost Sequential scan Cost Sequential scan Clustered index 0 100 0 100 Percentage tuples retrieved Percentage tuples retrieved CSE 414 - Spring 2018 21 CSE 414 - Spring 2018 22 SELECT * FROM R Choosing Index is Not Enough WHERE R.K>? and R.K<? • To estimate the cost of a query plan, we still Cost Unclustered index Sequential scan need to consider other factors: – How each operator is implemented Clustered index – The cost of each operator – Let’s start with the basics 0 100 Percentage tuples retrieved CSE 414 - Spring 2018 23 CSE 414 - Spring 2018 24 4 Cost Parameters • Cost = I/O + CPU + Network BW – We will focus on I/O in this class • Parameters (a.k.a. statistics): Cost of Reading – B(R) = # of blocks (i.e., paGes) for relation R – T(R) = # of tuples in relation R Data From Disk – V(R, a) = # of distinct values of attribute a When a is a key, V(R,a) = T(R) When a is not a key, V(R,a) can be anythinG <= T(R) • DBMS collects statistics about base tables must infer them for intermediate results CSE 414 - Spring 2018 25 26 Selectivity Factors for Conditions Cost of Reading Data From Disk • A = c /* σA=c(R) */ • Sequential scan for relation R costs B(R) – Selectivity = 1/V(R,A) • A < c /* σA<c(R)*/ – Selectivity = (c - min(R, A))/(max(R,A) - min(R,A)) • c1 < A < c2 /* σc1<A<c2(R)*/ – Selectivity = (c2 – c1)/(max(R,A) - min(R,A)) CSE 414 - Spring 2018 27 CSE 414 - Spring 2018 28 Index Based Selection Index Based Selection B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection: CSE 414 - Spring 2018 29 CSE 414 - Spring 2018 30 5 Index Based Selection Index Based Selection B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: B(R) = 2,000 I/Os • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection: – If index is clustered: – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is unclustered: – If index is unclustered: CSE 414 - Spring 2018 31 CSE 414 - Spring 2018 32 Index Based Selection Index Based Selection B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: B(R) = 2,000 I/Os • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection: – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os – If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os Note: we ignore I/O cost for index pages Lesson: Don’t build unclustered indexes when V(R,a) is small ! CSE 414 - Spring 2018 33 CSE 414 - Spring 2018 34 Outline • Join operator algorithms – One-pass algorithms (Sec. 15.2 and 15.3) Cost of Executing Operators – Index-based algorithms (Sec 15.6) (Focus on Single Node Joins) • Note about readings: – In class, we discuss only algorithms for joins – Other operators are easier: read the book CSE 414 - Spring 2018 35 CSE 414 - Spring 2018 36 6 Join Algorithms Nested Loop Joins • Tuple-based nested loop R ⋈ S • Nested loop join • R is the outer relation, S is the inner relation for each tuple t in R do • Hash join 1 for each tuple t2 in S do if t1 and t2 join then output (t1,t2) • Sort-merge join What is the Cost? CSE 414 - Spring 2018 37 CSE 414 - Spring 2018 38 Nested Loop Joins Page-at-a-time Refinement • Tuple-based nested loop R ⋈ S • R is the outer relation, S is the inner relation for each page of tuples r in R do for each page of tuples s in S do for each tuple t1 in R do for all pairs of tuples t1 in r, t2 in s for each tuple t2 in S do if t1 and t2 join then output (t1,t2) if t1 and t2 join then output (t1,t2) What is the Cost? What is the Cost? • Cost: B(R) + T(R) B(S) • Cost: B(R) + B(R)B(S) • Multiple-pass since S is read many times CSE 414 - Spring 2018 39 CSE 414 - Spring 2018 40 Page-at-a-time Refinement Page-at-a-time Refinement 1 2 Input buffer for Patient 11 2 Input buffer for Patient Input buffer for Insurance Input buffer for Insurance Disk 2 4 Disk 4 3 Patient Insurance 2 2 Patient Insurance 1 2 2 4 6 6 Output buffer 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 3 4 4 3 1 3 9 6 2 8 9 6 2 8 8 5 8 9 41 8 5 8 9 42 7 Page-at-a-time Refinement 11 2 Input buffer for Patient Input buffer for Insurance Disk 2 8 Patient Insurance Keep going until read all of Insurance 2 2 1 2 Output buffer 2 4 6 6 Then repeat for next 3 4 4 3 1 3 page of Patient… until end of Patient 9 6 2 8 Cost: B(R) + B(R)B(S) 8 5 8 9 43 8.

Load more