1 Tree-Structured Indexes Range Searches ISAM Example ISAM Tree ISAM Is a STATIC Structure

Introduction Tree-Structured Indexes • Recall: 3 alternatives for data entries k*: CS 186, Fall 2002, Lecture 17 • Data record with key value k R & G Chapter 9 • <k, rid of data record with search key value k> • <k, list of rids of data records with search key k> • Choice is orthogonal to the indexing technique used to locate data entries k*. “If I had eight hours to chop down a • Tree-structured indexing techniques support tree, I'd spend six sharpening my ax.” both range searches and equality searches. • ISAM: static structure; B+ tree: dynamic, Abraham Lincoln adjusts gracefully under inserts and deletes. • ISAM = ???Indexed Sequential Access Method index entry Range Searches P K P K P K P 0 1 1 2 2 m m • ``Find all students with gpa > 3.0’’ ISAM – If data is in sorted file, do binary search to find first such student, then scan to find others. • Index file may still be quite large. But we can – Cost of binary search in a database can be quite apply the idea repeatedly! high. Q: Why??? • Simple idea: Create an ìndex’ file. Non-leaf Pages k1 k2 kN Index File Leaf Pages Page N Data File Page 1 Page 2 Page 3 Overflow page Primary pages Can do binary search on (smaller) index file! Leaf pages contain data entries. Example ISAM Tree Data Pages ISAM is a STATIC Structure • Index entries:<search key value, page id> • File creation: Leaf (data) pages allocated Index Pages they direct search for data entries in leaves. sequentially, sorted by search key; then • Example where each node can hold 2 entries; index pages allocated, then overflow pgs. Overflow pages • Search: Start at root; use key comparisons to go to leaf. Cost = log F N ; Root F = # entries/pg (i.e., fanout), N = # leaf pgs 40 – no need for `next-leaf-page’ pointers. (Why?) • Insert: Find leaf that data entry belongs to, 20 33 51 63 and put it there. Overflow page if necessary. • Delete: Find and remove from leaf; if empty page, de-allocate. 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Static tree structure: inserts/deletes affect only leaf pages. 1 Example: Insert 23*, 48*, 41*, 42* ... then Deleting 42*, 51*, 97* Root Root Index 40 Index 40 Pages Pages 20 33 51 63 20 33 51 63 Primary Primary Leaf Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Pages Pages Overflow 23* 48* 41* Overflow 23* 48* 41* Pages Pages 42* 42* Note that 51* appears in index levels, but not in leaf! B+ Tree: The Most Widely Used Index ISAM ---- Issues? • Insert/delete at log F N cost; keep tree height- balanced. (F = fanout, N = # leaf pages) • Pros • Minimum 50% occupancy (except for root). Each – ???? node contains d <= m <= 2d entries. “d” is called the order of the tree. • Supports equality and range-searches efficiently. • As in ISAM, all searches go from root to leaves, but • Cons structure is dynamic. – ???? Index Entries (Direct search) Data Entries ("Sequence set") Example B+ Tree B+ Trees in Practice • Search begins at root, and key comparisons • Typical order: 100. Typical fill-factor: 67%. direct it to a leaf (as in ISAM). – average fanout = 133 • Search for 5*, 15*, all data entries >= 24* ... • Typical capacities: 4 Root – Height 4: 133 = 312,900,700 entries – Height 3: 1333 = 2,352,637 entries 13 17 24 30 • Can often hold top levels in buffer pool: – Level 1 = 1 page = 8 Kbytes – Level 2 = 133 pages = 1 Mbyte 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* – Level 3 = 17,689 pages = 133 MBytes Based on the search for 15*, we know it is not in the tree! 2 Inserting a Data Entry into a B+ Tree Example B+ Tree - Inserting 8* • Find correct leaf L. • Put data entry onto L. Root Root –If L has enough space, done! 17 13 17 24 30 –Else, must split L (into L and a new node L2) • Redistribute entries evenly, copy up middle key. 5 13 24 30 • Insert index entry pointing to L2 into parent of L. 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* • This can happen recursively 2* 3* 5*7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* – To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Notice that root was split, leading to increase in height. • Splits “grow” tree; root split increases height. In this example, we can avoid split by re-distributing – Tree growth: gets wider or one level taller at top. entries; however, this is usually not done in practice. Example: Data vs. Index Page Split Deleting a Data Entry from a B+ Tree 3* 5* Data 2* 7* 8* Entry to be inserted in parent node. • Observe how Page 5 (Note that 5 iss copied up and • Start at root, find leaf L where entry belongs. minimum Split continues to appear in the leaf.) occupancy is • Remove the entry. guaranteed in 2* 3* 5* 7* 8* – If L is at least half-full, done! both leaf and … –If L has only d-1 entries, index pg splits. •Try to re-distribute, borrowing from sibling (adjacent • Note difference Index 5 13 17 24 30 between copy- Page node with same parent as L). Entry to be inserted in parent node. up and push-up; Split (Note that 17 is pushed up and only 17 appears once in the index. Contrast • If re-distribution fails, merge L and sibling. be sure you this with a leaf split.) understand the • If merge occurred, must delete entry (pointing to L reasons for this. or sibling) from parent of L. 5243013 • Merge could propagate to root, decreasing height. Example Tree (including 8*) Delete 19* and 20* ... ... And Then Deleting 24* • Must merge. 30 Root • Observe `toss’ of Root 17 17 index entry (on right), and `pull down’ of 22* 27* 29* 33* 34* 38* 39* 5 13 24 30 index entry (below). 5 13 27 30 2* 3* 5*7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 2* 3* 5*7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Root 5 13 17 30 • Deleting 19* is easy. • Deleting 20* is done with re-distribution. Notice how middle key is copied up. 2* 3* 5*7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39* 3 Example of Non-leaf Re-distribution After Re-distribution • Tree is shown below during deletion of 24*. (What • Intuitively, entries are re-distributed by `pushing could be a possible initial tree?) through’ the splitting entry in the parent node. • In contrast to previous example, can re-distribute • It suffices to re-distribute index entry with key 20; entry from left child of root to right child. we’ve re-distributed 17 as well for illustration. Root Root 17 22 5 13 20 22 30 5 13 17 20 30 2* 3* 5*7* 8* 14* 16* 17* 18* 20*21* 22* 27* 29* 33* 34* 38* 39* 2* 3* 5*7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39* Prefix Key Compression Bulk Loading of a B+ Tree • Important to increase fan-out. (Why?) • If we have a large collection of records, and we • Key values in index entries only `direct traffic’; want to create a B+ tree on some field, doing so can often compress them. by repeatedly inserting records is very slow. – Also leads to minimal leaf utilization --- why? – E.g., If we have adjacent index entries with search key values Dannon Yogurt, David Smith and • Bulk Loading can be done much more efficiently. Devarakonda Murthy, we can abbreviate David Smith • Initialization: Sort all data entries, insert pointer to Dav. (The other keys can be compressed too ...) to first (leaf) page in a new (root) page. • Is this correct? Not quite! What if there is a data entry Root Davey Jones? (Can only compress David Smith to Davi) Sorted pages of data entries; not yet in B+ tree • In general, while compressing, must leave each index entry greater than every key value (in any subtree) to its left. • Insert/delete must be suitably modified. 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* Bulk Loading (Contd.) Summary of Bulk Loading Root 10 20 • Index entries for leaf pages always entered Data entry pages 6 12 23 35 • Option 1: multiple inserts. into right-most index not yet in B+ tree page just above leaf –Slow. level. When this fills – Does not give sequential storage of leaves. up, it splits. (Split 3* 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35*36* 38*41* 44* may go up right-most • Option 2: Bulk Loading path to the root.) – Has advantages for concurrency control. Root • Much faster than 20 – Fewer I/Os during build. repeated inserts, especially when one 10 35 Data entry pages – Leaves will be stored sequentially (and linked, of considers locking! not yet in B+ tree course). 6 12 23 38 – Can control “fill factor” on pages. 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* 4 A Note on Òrder’ Summary • Order (d) concept replaced by physical space • Tree-structured indexes are ideal for range- criterion in practice (àt least half-full’).

1 Tree-Structured Indexes Range Searches ISAM Example ISAM Tree ISAM Is a STATIC Structure

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support