Quick viewing(Text Mode)

1 Hash Table Example B+ Tree Index by Example Clustered

1 Hash Table Example B+ Tree Index by Example Clustered

Student

ID fName lName Hash table example 10 Tom Hanks Introduction to Database Systems 20 Amy Hanks Index Student_ID on Student.ID File Student … CSE 414 10 Tom Hanks 10

20 20 Amy Hanks 50 50 … … 200 200 … 220 240 220 Lecture 26: More Indexes and 420 240 800

Operator Costs … … 420 … … 800

Index File Data file (preferably (on disk) in memory) CSE 414 - Spring 2018 1 Hash table indexesCSE are414 -goodSpring 2018for point queries 2

B+ Index by Example Clustered vs Unclustered Recall binary trees from CSE 143!

d = 2 Find the key 40 80 Index File B+ Tree 40 <= 80 (preferably B+ Tree in memory) Index entries 20 60 100 120 140 Index entries (Index File) (Data file) 20 < 40 <= 60

10 15 18 20 30 40 50 60 65 80 85 90 Data Records Data Records CLUSTERED UNCLUSTERED

30 < 40 <= 40 Every table can have only one clustered and many unclustered indexes Data file Why? 10 15 18 20 30 40 50 60 65 80 85 90 (on disk) CSE 414 - Spring 2018 3 CSE 414 - Spring 2018 4 B+ indexes are good for range queries

Student(ID, fname, lname) SELECT * Student FROM Student x, Takes y Takes(studentID, courseID) WHERE x.ID=y.studentID AND y.courseID > 300 ID fName lName Example Which Indexes? 10 Tom Hanks 20 Amy Hanks … for y in Takes • How many indexes could we create? if courseID > 300 then Assume the database has indexes on these attributes: for x in Student • Takes_courseID = index on Takes.courseID if x.ID=y.studentID • Student_ID = index on Student.ID output * • Which indexes should we create? Index selection

Index join ⋈studentID=ID for y’ in Takes_courseID where y’.courseID > 300 y = fetch the Takes record pointed to by y’ for x’ in Student_ID where x’.ID = y.studentID σ x = fetch the Student record pointed to by x’ courseID>300 In general this is a very hard problem Index selection output *

Takes Student CSE 414 - Spring 2018 5 CSE 414 - Spring 2018 6

1 Student

ID fName lName Which Indexes? 10 Tom Hanks Index Selection: Which Search Key 20 Amy Hanks • The index selection problem … • Make some attribute K a search key if the – Given a table, and a “workload” (big Java WHERE clause contains: application with lots of SQL queries), decide which – An exact match on K indexes to create (and which ones NOT to create!) – A range predicate on K • Who does index selection: – The database administrator DBA

– Semi-automatically, using a database administration tool

CSE 414 - Spring 2018 7 CSE 414 - Spring 2018 8

The Index Selection Problem 1 The Index Selection Problem 1

V(M, N, P); V(M, N, P);

Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100 queries: SELECT * SELECT * SELECT * SELECT * FROM V FROM V FROM V FROM V WHERE N=? WHERE P=? WHERE N=? WHERE P=?

What indexes ?

CSE 414 - Spring 2018 9 CSE 414 - Spring 2018 10

The Index Selection Problem 1 The Index Selection Problem 2

V(M, N, P); V(M, N, P);

Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100 queries: 100000 queries: SELECT * SELECT * SELECT * SELECT * INSERT INTO V FROM V FROM V FROM V FROM V VALUES (?, ?, ?) WHERE N=? WHERE P=? WHERE N>? and N

A: V(N) and V(P) (hash tables or B-trees) What indexes ?

CSE 414 - Spring 2018 11 CSE 414 - Spring 2018 12

2 The Index Selection Problem 2 The Index Selection Problem 3

V(M, N, P); V(M, N, P);

Your workload is this Your workload is this 100000 queries: 100 queries: 100000 queries: 100000 queries: 1000000 queries: 100000 queries: SELECT * SELECT * INSERT INTO V SELECT * SELECT * INSERT INTO V FROM V FROM V VALUES (?, ?, ?) FROM V FROM V VALUES (?, ?, ?) WHERE N>? and N?

A: definitely V(N) (must B-tree); unsure about V(P) What indexes ?

CSE 414 - Spring 2018 13 CSE 414 - Spring 2018 14

The Index Selection Problem 3 The Index Selection Problem 4

V(M, N, P); V(M, N, P);

Your workload is this Your workload is this 100000 queries: 1000000 queries: 100000 queries: 1000 queries: 100000 queries: SELECT * SELECT * INSERT INTO V SELECT * SELECT * FROM V FROM V VALUES (?, ?, ?) FROM V FROM V WHERE N=? WHERE N=? and P>? WHERE N>? and N? and P

A: V(N, P) How does this index differ from: What indexes ? 1. Two indexes V(N) and V(P)? CSE 414 -2.SpringAn 2018 index V(P, N)? 15 CSE 414 - Spring 2018 16

The Index Selection Problem 4 Two typical kinds of queries

V(M, N, P); • Point queries SELECT * • What data FROM Movie Your workload is this WHERE year = ? should be used for 1000 queries: 100000 queries: index? SELECT * SELECT * FROM V FROM V SELECT * • Range queries WHERE N>? and N? and P= ? AND should be used for A: V(N) secondary, V(P) primary index year <= ? index?

CSE 414 - Spring 2018 17 CSE 414 - Spring 2018 18

3 SELECT * Basic Index Selection Guidelines FROM R WHERE R.K>? and R.K

• Look at WHERE clause for possible search key

• Try to choose indexes that speed-up multiple queries

• Range queries benefit mostly from clustering 0 100 Percentage tuples retrieved CSE 414 - Spring 2018 19 CSE 414 - Spring 2018 20

SELECT * SELECT * FROM R FROM R WHERE R.K>? and R.K? and R.K

Cost Sequential scan Cost Sequential scan

Clustered index

0 100 0 100 Percentage tuples retrieved Percentage tuples retrieved CSE 414 - Spring 2018 21 CSE 414 - Spring 2018 22

SELECT * FROM R Choosing Index is Not Enough WHERE R.K>? and R.K

– How each operator is implemented Clustered index – The cost of each operator

– Let’s start with the basics 0 100 Percentage tuples retrieved CSE 414 - Spring 2018 23 CSE 414 - Spring 2018 24

4 Cost Parameters

• Cost = I/O + CPU + Network BW – We will focus on I/O in this class • Parameters (a.k.a. statistics): Cost of Reading – B(R) = # of blocks (i.e., pages) for relation R – T(R) = # of tuples in relation R Data From Disk – V(R, a) = # of distinct values of attribute a When a is a key, V(R,a) = T(R) When a is not a key, V(R,a) can be anything <= T(R)

• DBMS collects statistics about base tables must infer them for intermediate results

CSE 414 - Spring 2018 25 26

Selectivity Factors for Conditions Cost of Reading Data From Disk

• A = c /* σA=c(R) */ • Sequential scan for relation R costs B(R) – Selectivity = 1/V(R,A)

• A < c /* σA

• c1 < A < c2 /* σc1

CSE 414 - Spring 2018 27 CSE 414 - Spring 2018 28

Index Based Selection Index Based Selection

B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection:

CSE 414 - Spring 2018 29 CSE 414 - Spring 2018 30

5 Index Based Selection Index Based Selection

B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: B(R) = 2,000 I/Os • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection: – If index is clustered: – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is unclustered: – If index is unclustered:

CSE 414 - Spring 2018 31 CSE 414 - Spring 2018 32

Index Based Selection Index Based Selection

B(R) = 2000 B(R) = 2000 cost of cost of • Example: T(R) = 100,000 sa=v(R) = ? • Example: T(R) = 100,000 sa=v(R) = ? V(R, a) = 20 V(R, a) = 20 • Table scan: B(R) = 2,000 I/Os • Table scan: B(R) = 2,000 I/Os • Index based selection: • Index based selection: – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is clustered: B(R) * 1/V(R,a) = 100 I/Os – If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os – If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os

Note: we ignore I/O cost for index pages Lesson: Don’t build unclustered indexes when V(R,a) is small !

CSE 414 - Spring 2018 33 CSE 414 - Spring 2018 34

Outline

• Join operator algorithms – One-pass algorithms (Sec. 15.2 and 15.3) Cost of Executing Operators – Index-based algorithms (Sec 15.6) (Focus on Single Node Joins) • Note about readings: – In class, we discuss only algorithms for joins – Other operators are easier: read the book

CSE 414 - Spring 2018 35 CSE 414 - Spring 2018 36

6 Join Algorithms Nested Loop Joins • Tuple-based nested loop R ⋈ S • Nested loop join • R is the outer relation, S is the inner relation for each tuple t in R do • Hash join 1 for each tuple t2 in S do if t1 and t2 join then output (t1,t2) • Sort-merge join What is the Cost?

CSE 414 - Spring 2018 37 CSE 414 - Spring 2018 38

Nested Loop Joins Page-at-a-time Refinement • Tuple-based nested loop R ⋈ S • R is the outer relation, S is the inner relation for each page of tuples r in R do for each page of tuples s in S do for each tuple t1 in R do for all pairs of tuples t1 in r, t2 in s for each tuple t2 in S do if t1 and t2 join then output (t1,t2) if t1 and t2 join then output (t1,t2)

What is the Cost? What is the Cost? • Cost: B(R) + T(R) B(S) • Cost: B(R) + B(R)B(S) • Multiple-pass since S is read many times

CSE 414 - Spring 2018 39 CSE 414 - Spring 2018 40

Page-at-a-time Refinement Page-at-a-time Refinement

1 2 Input buffer for Patient 11 2 Input buffer for Patient

Input buffer for Insurance Input buffer for Insurance Disk 2 4 Disk 4 3

Patient Insurance 2 2 Patient Insurance 1 2 2 4 6 6 Output buffer 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 3 4 4 3 1 3 9 6 2 8 9 6 2 8 8 5 8 9 41 8 5 8 9 42

7 Page-at-a-time Refinement

11 2 Input buffer for Patient

Input buffer for Insurance Disk 2 8 Patient Insurance Keep going until read all of Insurance 2 2 1 2 Output buffer 2 4 6 6 Then repeat for next 3 4 4 3 1 3 page of Patient… until end of Patient 9 6 2 8 Cost: B(R) + B(R)B(S) 8 5 8 9 43

8