Indexing Tree-Based Indexing
Total Page:16
File Type:pdf, Size:1020Kb
Indexing Chapter 8, 10, 11 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Tree-Based Indexing The data entries are arranged in sorted order by search key value. A hierarchical search data structure (tree) is maintained that directs searches to the correct page of data entries. Tree-structured indexing techniques support both range searches and equality searches. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2 Trees Tree A structure with a unique starting node (the root), in which each node is capable of having child nodes and a unique path exists from the root to every other node Root The top node of a tree structure; a node with no parent Leaf Node A tree node that has no children Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3 Trees Level Distance of a node from the root Height The maximum level Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Time Complexity of Tree Operations Time complexity of Searching: . O(height of the tree) Time complexity of Inserting: . O(height of the tree) Time complexity of Deleting: . O(height of the tree) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5 Multi-way Tree If we relax the restriction that each node can have only one key, we can reduce the height of the tree. A multi-way search tree is a tree in which the nodes hold between 1 to m-1 distinct keys Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6 Range Searches ``Find all students with gpa > 3.0’’ . If data is in sorted file, do binary search to find first such student, then scan to find others. Cost of binary search can be quite high. Simple idea: Create an `index’ file. k1 k2 kN Index File Page 1 Page 2 Page 3 Page N Data File Can do binary search on (smaller) index file! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7 Property of Multi-way Tree The keys in each node are sorted. A node with k values has k+1 sub-trees, where the sub-trees may be empty. The i’th sub-tree of a node [v1, ..., vk], 0 < i < k, may hold only values v in the range vi < v < vi+1 (v0 is assumed to equal -∞, and vk+1is assumed to equal +∞). A m-way tree of height h has between h and mh - 1 keys. The height of a complete m-ary tree with n nodes is ceiling(logmn). Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9 Searching m-way Tree We make an m-way branching decision at each node according to the number of the node’s children. Searching is performed in a recursive way. Time complexity is O(h). Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10 Two Popular Trees ISAM (Indexed Sequential Access Method) . A static structure; B+ tree: . A dynamic structure, adjusts gracefully under inserts and deletes. Leaf pages of both of them contain data entries. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11 ISAM index entry P K P K P K P 0 1 1 2 2 m m Index file may still be quite large. But we can apply the idea repeatedly! Non-leaf Pages Leaf Pages Overflow page Primary pages Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12 Comments on ISAM A multi-way tree Each tree node is a disk page; Data Pages Leaf pages contain data entries . Alternative 1 : Leaf pages are created with data record with key value k; All leaf pages are allocated sequentially and sorted on the search Index Pages key value. Alternative 2 or 3 : Data records are created and sorted in a separate file, and then storing <key, rid> in the leaf pages of ISAM index. Overflow pages . Then index pages allocated, then space for overflow pages. • Index entries: <search key value, page id>; they `direct’ search for data entries, which are in leaf pages. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13 Data Comments on ISAM Pages Index Pages Search: Start at root; use key comparisons to go to leaf. Cost log F N ; . F = # entries/index pg, N = # leaf pgs Overflow pages Insert: Find leaf data entry belongs to, and put it there. Delete: Find and remove from leaf; if empty overflow page, de-allocate. Static tree structure: inserts/deletes affect only leaf pages. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14 Example ISAM Tree Each node can hold 2 entries; no need for `next-leaf-page’ pointers. (Why? ->static) Example: search a record with the key value Root 27. 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15 After Inserting 23*, 48*, 41*, 42* ... Root Index 40 Pages 20 33 51 63 Primary Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Pages Overflow 23* 48* 41* Pages 42* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16 ... Then Deleting 42*, 51*, 97* The number of primary leaf pages is fixed at file creation time – STATIC. Root 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 55* 63* 23* 48* 41* Static tree structure: inserts/deletes affect only leaf pages. Note that 51* appears in index levels, but not in leaf! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17 Comments on ISAM Static design leads to the problem that long overflow chains could develop. To alleviate this problem, the tree is initially created so that about 20% of each page is free. Static design has the advantage that locking step is not needed since index-level pages are never modified. Static tree structure: inserts/deletes affect only leaf pages. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18 B+ Trees A Dynamic Index Structure Most Widely Used Index A balanced tree in which the internal nodes direct the search and the leaf nodes contain the data entries. Tree structure grows and shrinks dynamically, leaf pages are not sequentially allocated. Leaf pages are organized in doubly linked list. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19 B+ Tree Indexes Non-leaf Pages Leaf Pages (Sorted by search key) Double linked list Leaf pages contain data entries, and are chained (prev & next) Non-leaf pages have index entries; only used to direct searches: index entry P K P K P 0 1 1 2 P 2 K m m Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 20 Fan-out of the Tree Fan-out of the tree: the average number of children for a non-leaf node. If every non-leaf node has n children, a tree of height h has nh leaf pages. A good approximation of the number of leaf pages, Fh ( F is the average # of children, which is at least 100). Example: . A tree of height 4 contains 100 million leaf pages. A binary search will take log2100,000,000 > 25 I/Os. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 21 B+ Trees Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages) Minimum 50% occupancy (except for root). Each node contains m entries, where d <= m <= 2d . Except the root has m entries, where 1 <= m <= 2d. The parameter d is called the order of the tree. Supports equality and range-searches efficiently. Index Entries (Direct search) Data Entries ("Sequence set") Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 22 Example B+ Tree Root Note how data entries 17 in leaf level are sorted Entries <= 17 Entries > 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Find 28*? 29*? All > 15* and < 30* Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes. And change sometimes bubbles up the tree Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 23 B+ Trees in Practice Typical order: d=100. Typical fill-factor: 67%. average fanout = 133 Typical capacities: . Height 4: 1334 = 312,900,700 records . Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool: . Level 1 = 1 page = 8 Kbytes . Level 2 = 133 pages = 1 Mbyte . Level 3 = 17,689 pages = 133 MBytes Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 24 Inserting a Data Entry into a B+ Tree Find correct leaf node L. Put data entry onto L. If L has enough space, done! . Else, must split L (into L and a new node L2) • Redistribute entries evenly, copy up middle key. • Insert index entry pointing to L2 into parent of L. This can happen recursively . To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. Tree growth: gets wider or one level taller at top. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 25 Inserting 8* entry into B+ Tree Root 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Entry to be inserted in parent node. Split the full leaf 5 (Note that 5 iss copied up and continues to appear in the leaf.) node, copy up the middle key. 2* 3* 5* 7* 8* Database Management Systems 3ed, R.