Indexing

Chapter 8, 10, 11

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Tree-Based Indexing

 The data entries are arranged in sorted order by search key value.  A hierarchical search (tree) is maintained that directs searches to the correct page of data entries.  Tree-structured indexing techniques support both range searches and equality searches.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2 Trees Tree A structure with a unique starting node (the root), in which each node is capable of having child nodes and a unique path exists from the root to every other node Root The top node of a tree structure; a node with no parent Leaf Node A tree node that has no children

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3

Trees

Level Distance of a node from the root

Height The maximum level

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Time Complexity of Tree Operations

 Time complexity of Searching: . O(height of the tree)  Time complexity of Inserting: . O(height of the tree)  Time complexity of Deleting: . O(height of the tree)

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5

Multi-way Tree

If we relax the restriction that each node can have only one key, we can reduce the height of the tree. A multi-way search tree is a tree in which the nodes hold between 1 to m-1 distinct keys

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6 Range Searches

 ``Find all students with gpa > 3.0’’ . If data is in sorted file, do binary search to find first such student, then scan to find others. . Cost of binary search can be quite high.  Simple idea: Create an `index’ file.

k1 k2 kN Index File

Page 1 Page 2 Page 3 Page N Data File

Can do binary search on (smaller) index file! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7

Property of Multi-way Tree

 The keys in each node are sorted.  A node with k values has k+1 sub-trees, where the sub-trees may be empty.  The i’th sub-tree of a node [v1, ..., vk], 0 < i < k, may hold only values v in the range

vi < v < vi+1 (v0 is assumed to equal -∞, and vk+1is assumed to equal +∞).  A m-way tree of height h has between h and mh - 1 keys. . The height of a complete m-ary tree with n nodes is

ceiling(logmn).

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9

Searching m-way Tree

 We make an m-way branching decision at each node according to the number of the node’s children.  Searching is performed in a recursive way.  Time complexity is O(h).

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10 Two Popular Trees

 ISAM (Indexed Sequential Access Method) . A static structure;  B+ tree: . A dynamic structure, adjusts gracefully under inserts and deletes.  Leaf pages of both of them contain data entries.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11

ISAM index entry

P K P K P K P 0 1 1 2 2 m m

 Index file may still be quite large. But we can apply the idea repeatedly!

Non-leaf Pages

Leaf Pages Overflow page Primary pages

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12 Comments on ISAM

 A multi-way tree  Each tree node is a disk page; Data Pages  Leaf pages contain data entries . . Alternative 1 : Leaf pages are created with data record with key value k; All leaf pages are allocated sequentially and sorted on the search Index Pages key value. . Alternative 2 or 3 : Data records are created and sorted in a separate file, and then storing in the leaf pages of ISAM index. Overflow pages . Then index pages allocated, then space for overflow pages. • Index entries: ; they `direct’ search for data entries, which are in leaf pages. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13

Data Comments on ISAM Pages

Index Pages  Search: Start at root; use key comparisons to

go to leaf. Cost  log F N ; . F = # entries/index pg, N = # leaf pgs Overflow pages  Insert: Find leaf data entry belongs to, and put it there.  Delete: Find and remove from leaf; if empty overflow page, de-allocate.

Static tree structure: inserts/deletes affect only leaf pages.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14 Example ISAM Tree

 Each node can hold 2 entries; no need for `next-leaf-page’ pointers. (Why? ->static)  Example: search a record with the key value Root 27. 40

20 33 51 63

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15

After Inserting 23*, 48*, 41*, 42* ...

Root Index 40 Pages

20 33 51 63

Primary Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Pages

Overflow 23* 48* 41* Pages

42*

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16 ... Then Deleting 42*, 51*, 97*

The number of primary leaf pages is fixed at file creation time – STATIC. Root

40

20 33 51 63

10* 15* 20* 27* 33* 37* 40* 46* 55* 63*

23* 48* 41*

Static tree structure: inserts/deletes affect only leaf pages. Note that 51* appears in index levels, but not in leaf! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17

Comments on ISAM

 Static design leads to the problem that long overflow chains could develop. . To alleviate this problem, the tree is initially created so that about 20% of each page is free.  Static design has the advantage that locking step is not needed since index-level pages are never modified.

Static tree structure: inserts/deletes affect only leaf pages.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18 B+ Trees A Dynamic Index Structure

 Most Widely Used Index  A balanced tree in which the internal nodes direct the search and the leaf nodes contain the data entries.  Tree structure grows and shrinks dynamically, leaf pages are not sequentially allocated.  Leaf pages are organized in doubly .

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19

B+ Tree Indexes

Non-leaf Pages

Leaf Pages (Sorted by search key) Double linked list  Leaf pages contain data entries, and are chained (prev & next)  Non-leaf pages have index entries; only used to direct searches: index entry

P K P K P 0 1 1 2 P 2 K m m

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 20 Fan-out of the Tree

 Fan-out of the tree: the average number of children for a non-leaf node.  If every non-leaf node has n children, a tree of height h has nh leaf pages.  A good approximation of the number of leaf pages, Fh ( F is the average # of children, which is at least 100).  Example: . A tree of height 4 contains 100 million leaf pages.

. A binary search will take log2100,000,000 > 25 I/Os.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 21

B+ Trees

 Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages)  Minimum 50% occupancy (except for root). Each node contains m entries, where d <= m <= 2d . Except the root has m entries, where 1 <= m <= 2d. The parameter d is called the order of the tree.  Supports equality and range-searches efficiently.

Index Entries (Direct search)

Data Entries ("Sequence set") Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 22 Example B+ Tree Root Note how data entries

17 in leaf level are sorted Entries <= 17 Entries > 17

5 13 27 30

2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

 Find 28*? 29*? All > 15* and < 30*  Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes. . And change sometimes bubbles up the tree

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 23

B+ Trees in Practice

 Typical order: d=100. Typical fill-factor: 67%. . average fanout = 133  Typical capacities: . Height 4: 1334 = 312,900,700 records . Height 3: 1333 = 2,352,637 records  Can often hold top levels in buffer pool: . Level 1 = 1 page = 8 Kbytes . Level 2 = 133 pages = 1 Mbyte . Level 3 = 17,689 pages = 133 MBytes

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 24 Inserting a Data Entry into a B+ Tree  Find correct leaf node L.  Put data entry onto L. . If L has enough space, done! . Else, must split L (into L and a new node L2) • Redistribute entries evenly, copy up middle key. • Insert index entry pointing to L2 into parent of L.  This can happen recursively . To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)  Splits “grow” tree; root split increases height. . Tree growth: gets wider or one level taller at top.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 25

Inserting 8* entry into B+ Tree

Root

13 17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Entry to be inserted in parent node. Split the full leaf 5 (Note that 5 iss copied up and continues to appear in the leaf.) node, copy up the middle key. 2* 3* 5* 7* 8*

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 26 Inserting 8* into Example B+ Tree To split index node, redistribute

entries evenly, but appears once in the index. Contrast

push up middle Entry to be inserted in parent node. key. (Contrast 17 (Note that 17 is pushed up and only with leaf splits.) this with a leaf split.)

5 13 24 30

 Observe how minimum occupancy is guaranteed in both leaf and index pg splits.  Note difference between copy-up and push-up; be sure you understand the reasons for this.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 27

Example B+ Tree After Inserting 8*

Root 17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

 Notice that root was split, leading to increase in height.  In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 28 Deleting a Data Entry from a B+ Tree

 Start at root, find leaf node L where entry belongs.  Remove the entry. . If L is at least half-full, done! . If L has only d-1 entries, • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). • If re-distribution fails, merge L and sibling.  If merge occurred, must delete entry (pointing to L or sibling) from parent of L.  Merge could propagate to root, decreasing height.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 29

Example Tree After (Inserting 8*, Then) Deleting 19*

Root 17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Root 17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 30 Example Tree After (Inserting 8*, Deleting 19*, then) Deleting 20* ...

Root 17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Root 17

5 13 27 30

2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

 Deleting 20* is done with re-distribution. Notice how middle key is copied up. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 31

... And Then Deleting 24*

 Must merge. 30  Observe `toss’ of index entry. 22* 27* 29* 33* 34* 38* 39*

Root 17

5 13 27 30

2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 32 ... And Then Deleting 24*

 Merge recursively.  Observe `pull down’ of index entry

Root 17

5 13 30

2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*

Root 5 13 17 30

2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 33

Example of Non-leaf Re-distribution

 Tree is shown below during deletion of 24*.  In contrast to previous example, can re-distribute entry from left child of root to right child.

Root

22

5 13 17 20 30

2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 34 After Re-distribution

 Intuitively, entries are re-distributed by `pushing through’ the splitting entry in the parent node.  It suffices to re-distribute index entry with key 20; we’ve re-distributed 17 as well for illustration. Root

17

5 13 20 22 30

2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 35

Bulk Loading of a B+ Tree

 If we have a large collection of records, and we want to create a B+ tree on some field, doing so by repeatedly inserting records is very slow.  Bulk Loading can be done much more efficiently.  Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page.

Root Sorted pages of data entries; not yet in B+ tree

3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 36 Bulk Loading (Contd.)

Root 10 20  Index entries for leaf Data entry pages pages always 6 12 23 35 entered into right- not yet in B+ tree most index page just above leaf level. 3* 4* 6* 9* 10*11* 12*13* 20*22* 23* 31* 35*36* 38*41* 44* When this fills up, it splits. (Split may go up right-most path Root 20 to the root.) 10 35 Data entry pages  Much faster than not yet in B+ tree

repeated inserts, 6 12 23 38 especially when one considers locking! 3* 6* 9* 10* 11* 12*13* 23* 31* 36* 38*41* 44* Database Management Systems 3ed, R.4* Ramakrishnan and J. Gehrke20*22* 35* 37

Summary of Bulk Loading

 Option 1: multiple inserts. . Slow. . Does not give sequential storage of leaves.  Option 2: Bulk Loading . Has advantages for concurrency control. . Fewer I/Os during build. . Leaves will be stored sequentially (and linked, of course). . Can control “fill factor” on pages.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 38 A Note on `Order’

 Order (d) concept replaced by physical space criterion in practice (`at least half-full’). . Index pages can typically hold many more entries than leaf pages. . Variable sized records and search keys mean differnt nodes will contain different numbers of entries. . Even with fixed length fields, multiple records with the same search key value (duplicates) can lead to variable-sized data entries (if we use Alternative (3)).

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 39

Summary

 Tree-structured indexes are ideal for range- searches, also good for equality searches.  ISAM is a static structure. . Only leaf pages modified; overflow pages needed. . Overflow chains can degrade performance unless size of data set and data distribution stay constant.  B+ tree is a dynamic structure.

. Inserts/deletes leave tree height-balanced; log F N cost. . High fanout (F) means depth rarely more than 3 or 4. . Almost always better than maintaining a sorted file.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 40 Summary (Contd.)

. Typically, 67% occupancy on average. . Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. . If data entries are data records, splits can change rids!  Key compression increases fanout, reduces height.  Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set.  Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 41

Comparing File Organizations

 A collection of employee records with composite search key . files (random order; insert at eof) . Sorted files, sorted on . Clustered B+ tree file, Alternative (1), search key . Heap file with unclustered B + tree index on search key . Heap file with unclustered hash index on search key

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 42 Operations to Compare

 Scan: Fetch all records from disk  Search with Equality search  Search with Range selection  Insert a record  Delete a record

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 43

Cost Model for Our Analysis

We ignore CPU costs, for simplicity: . B: The number of data pages . R: Number of records per page . D: (Average) time to read or write a disk page • 15 milliseconds . C: (Average) time to process a record (e.g., comparing a field value to a selection constant) • 100 nanoseconds . H: the time to apply the hash function to a record. • 100 nanoseconds . F: Fan-out of a index tree. • 100 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 44 Cost Model

 We expect the cost of I/O to dominate.  Disk speeds are not increasing at a similar pace as CPU speeds rises.  Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages; thus, even I/O cost is only approximated.  Average-case analysis; based on several simplistic assumptions.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 45

Assumptions in Our Analysis  Heap Files: . Equality selection on key; exactly one match.  Sorted Files: . Files compacted after deletions.  Indexes: . Alt (2), (3): data entry size = 10% size of record . Hash: No overflow buckets. • 80% page occupancy => File size = 1.25 data size . Tree: 67% occupancy (this is typical). • Implies file size = 1.5 data size

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 46 Assumptions (contd.)

 Scans: . Leaf levels of a tree-index are chained. . Index data-entries plus actual file scanned for unclustered indexes.  Range searches: . We use tree indexes to restrict the set of data records fetched, but ignore hash indexes.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 47

Heap Files

 Scan: B(D+RC)  Search with Equality Selection: 0.5B(D+RC) . Average case of linear search  Search with Range Selection: B(D+RC)  Insert: 2D+C . Insert at the end of file  Delete: Searching Cost +(D+C) . Search the page through rid 2D+C

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 48 Sorted Files

 Scan: B(D+RC)

 Search with Equality Selection: Dlog2B + Clog2R  Search with Range Selection: The cost of search + the cost of retrieving the set of records. . The cost of search includes fetching first page.  Insert: Searching cost + 2*0.5B(D+RC)  Delete: Searching cost + 2*0.5B(D+RC)

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 49

Clustered Files with B+ Tree

 Scan: 1.5B(D+RC)

 Search with Equality Selection: DlogF1.5B + Clog2R  Search with Range Selection: Similar to Search with many qualifying records

 Insert/Delete: DlogF1.5B + Clog2R+D

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 50 Heap File with Un-clustered Tree Index  The number of leaf pages is 0.1*1.5B = 0.15B.  The number of data entries on a page is 10 *0.67R = 6.7R.  Scan: 0.15B(D+6.7RC) I/Os + BR(D+C) or 4B (sorting) . Data Entries Cost : 0.15B(D+6.7RC) . Fetching records (one I/O per record): BR(D+C) or . Just sort records : 4B

 Search with Equality Selection: DlogF0.15B + Clog26.7R+D

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 51

Heap File with Un-clustered Tree Index

 Search with Range Selection: . Similar to Search with many qualifying records - one I/O per each qualifying record . If 10% of data records qualify, better sort the data file.  Insert: . Insert the record in heap file: 2D+C

. Insert the data entry in the index: DlogF0.15B + Clog26.7R+D

 Delete: DlogF0.15B + Clog26.7R+2D

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 52 Heap File with Un-clustered Hash Index

 The number of leaf pages is 1.25*0.1B = 0.125B  The number of data entries on a page is 10*0.8R=8R.  Scan: 0.125B(D+8RC) I/Os + BR(D+C). . Data entries: 0.125B(D+8RC) . Fetching each record: BR(D+C)  Search with Equality Selection: H+2D+4RC . Hashing cost: H . Fetching the data entry page: D . Search the data entry page: 0.5 *8RC= 4RC . Fetching the data record page: D

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 53

Heap File with Un-clustered Hash Index

 Search with Range Selection: B(D+RC) . Scan the whole heap file.  Insert: 2D+C plus H+2D+C . Insert the data record in heap file: 2D+C . Update the hash index : H+2D+C  Delete: H+2D+4RC plus 2D. . Hashing cost: H . Fetching the data entry page: D . Search the data entry page: 0.5 *8RC= 4RC . Fetching the data record page: D . Update both data entry page and data record page: 2D

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 54 Cost of Operations (a) Scan (b) Equality (c ) Range (d) Insert (e) Delete (1) Heap BD 0.5BD BD 2D Search +D (2) Sorted BD Dlog 2B D(log 2 B + Search Search # pgs with + BD +BD match recs) (3) 1.5BD Dlog F 1.5B D(log F 1.5B Search Search Clustered + # pgs w. + D +D match recs) (4) Unclust. BD(R+0.15) D(1 + D(log F 0.15B Search Search Tree index log F 0.15B) + # pgs w. + 2D + 2D match recs) (5) Unclust. BD(R+0.125) 2D BD Search Search Hash index + 2D + 2D Several assumptions underlie these (rough) estimates! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 55

Summary

 Heap file: Good storage efficiency, fast scan, slow search and deletion.  Sorted file: Good storage efficiency, slow insertion and deletion. Searching is faster than heap file.  Clustered file: as good as sorted file plus efficient insertion and deletion. Space overhead.  Un-clustered file: fast search, insertion, deletion. Slow scan and range searching.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 56 Impact of the Workload

 An index supports efficient retrieval of data entries that satisfy a given selection condition.  Hash-based indexing are optimized for equality selections and poor on range selection.  Tree-based indexing support both efficiently.  Both of them support inserts, deletes, and updates quite efficiently.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 57

Impact of the Workload

 B+ tree has two important advantages over sorted files . Handle inserts and deletes of data entries efficiently. . Finding the correct leaf page when searching for a record by search key value is much faster than binary search in a sorted file.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 58 Understanding the Workload

 For each query in the workload: . Which relations does it access? . Which attributes are retrieved? . Which attributes are involved in selection/join conditions? How selective are these conditions likely to be?  For each update in the workload: . Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? . The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 59

Choice of Indexes

 What indexes should we create? . Which relations should have indexes? What field(s) should be the search key? Should we build several indexes?  For each index, what kind of an index should it be? . Clustered? Hash/tree?

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 60 Choice of Indexes

 Before creating an index, must also consider the impact on updates in the workload! . Trade-off: Indexes can make queries go faster, updates slower. Require disk space, too. Using indexes on tables that are frequently updated can result in poor performance.  Indexes should be used on tables whose data does not change frequently but is used a lot in queries.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 61

Choice of Indexes

 One approach: Consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it.  Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 62 Index Selection Guidelines

 Attributes in WHERE clause are candidates for index keys. . Exact match condition suggests hash index. . Range query suggests tree index. . Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 63

Examples of Clustered Indexes

 B+ tree index on E.age can be used to get qualifying tuples.  Alternative way is a sorted file on E.age.  Considering: . How selective is the condition? . Is the index clustered? SELECT E.dno FROM Emp E WHERE E.age>40;

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 64 Examples of Clustered Indexes

 Consider the GROUP BY query. . If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly. . Clustered E.dno index may be better since sorting is expensive.

SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno;

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 65

Examples of Clustered Indexes

 If an index on a search key that does not include a candidate key, clustering is important.  Equality queries and duplicates: . Clustering on E.hobby helps!

SELECT E.dno FROM Emp E WHERE E.hobby=“Stamps”;

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 66 Index Selection Guidelines

 Composite search keys should be considered when a WHERE clause contains several conditions. . Hash index works for equality conditions on every field. . Tree index works for equality or range condition on a prefix of the composite search key. So order of attributes is important for range queries. . Such indexes can sometimes enable index-only strategies for important queries. • For index-only strategies, clustering is not important!

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 67

Indexes with Composite Search Keys

Examples of composite key indexes using lexicographic order.

Equality query: index: 11,80 11 age=20 and sal =75 12,10 12 12,20 name age sal 12 13,75 bob 12 10 13 cal 11 80 Data entries in index sorted by search key joe 12 20 to support range 10,12 sue 13 75 10 queries. 20,12 20 Lexicographic order, Data records or 75,13 sorted by name 75 Spatial order. 80,11 80 Data entries in index Data entries sorted by sorted by

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 68 Composite Search Keys

 To retrieve Emp records with age=30 AND sal=4000, an index on would be better than an index on age or an index on sal.  If condition is: 20 or is best.  If condition is: age=30 AND 3000 index much better than index!  Composite indexes are larger, updated more often.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 69

Index-Only Evaluation

 Plan A: sort Employees on E.dno to compute the count.  Plan B: if an index on E.dno is available, the query could be answered by scanning only the index.

SELECT E.dno, COUNT(*) FROM Emp E GROUP BY E.dno;

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 70 Index-Only Plans

 A composite B+ tree index on could answer the query with an index-only scan.

SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000

 A composite B+ tree index on could answer this query with an index-only scan SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 71

Create Index Basic Syntax

 Indexes can be defined in two ways . At the time of table creation . After table has been created.  Example schemas . Sailors (sid: integer, sname: string, rating: integer, age: real) . Boats (bid: integer, bname: string, color: string) . Reserves (sid: integer, bid: integer, day: dates)

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 72 Example

 For Sailor table, we expect lots of searches to the database on Sailor id, which is the primary key. CREATE TABLE Sailors (sid INTEGER NOT NULL AUTO_INCREMENT, sname CHAR(30) NOT NULL, rating INTEGER, age REAL, CONSTRAINT StudentsKey PRIMARY KEY (sid) USING BTREE , CHECK (rating >=1 AND rating<=10))

*The primary key of the table have already been indexed by MySql. Its name is PRIMARY KEY. This index is the clustered index.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 73

Example

 For Sailor table, we could also create an index on sname when we create the table.

CREATE TABLE Sailors (sid INTEGER NOT NULL AUTO_INCREMENT, sname CHAR(30) NOT NULL, rating INTEGER, age REAL, CONSTRAINT StudentsKey PRIMARY KEY (sid) USING BTREE , CHECK (rating >=1 AND rating<=10), INDEX sname_index (sname) USING HASH )

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 74 Example

 Later, we found people search a lot with sailors’ names and ages.

CREATE TABLE Sailors (sid INTEGER NOT NULL AUTO_INCREMENT, sname CHAR(30) NOT NULL, rating INTEGER, age REAL, CONSTRAINT StudentsKey PRIMARY KEY (sid), CHECK (rating >=1 AND rating<=10)); CREATE INDEX ‘sname_age_index’ ON Sailors(sname, age) USING BTREE ;

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 75

Index Status

 View the indexes defined on a particular table

SHOW INDEX FROM Sailors;

 Sometimes, to drop an index to improve performance.

DROP INDEX sname_index FROM Sailors;

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 76 Summary

 Many alternative file organizations exist, each appropriate in some situation.  If selection queries are frequent, sorting the file or building an index is important. . Hash-based indexes only good for equality search. . Sorted files and tree-based indexes best for range search; also good for equality search. (Files rarely kept sorted in practice; B+ tree index is better.)  Index is a collection of data entries plus a way to quickly find entries with given key values.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 77

Summary (Contd.)

 Data entries can be actual data records, pairs, or pairs. . Choice orthogonal to indexing technique used to locate data entries with a given key value.  Can have several indexes on a given file of data records, each with a different search key.  Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 78