Chapter 4 Tree-Structured Indexing + ISAM and B -Trees Binary Search
Total Page:16
File Type:pdf, Size:1020Kb
Tree-Structured Indexing Torsten Grust Chapter 4 Tree-Structured Indexing + ISAM and B -trees Binary Search ISAM Architecture and Implementation of Database Systems Multi-Level ISAM Too Static? Summer 2014 Search Efficiency B+-trees Search Insert Redistribution Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees Torsten Grust Wilhelm-Schickard-Institut für Informatik Universität Tübingen 1 Here, let k denote the full record with key k: ∗ 8180* 8910* 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8953* 9016* 9532* 9200* scan Tree-Structured Ordered Files and Binary Search Indexing Torsten Grust How could we prepare for such queries and evaluate them efficiently? 1 SELECT* 2 FROM CUSTOMERS 3 WHERE ZIPCODE BETWEEN 8880 AND 8999 Binary Search ISAM We could Multi-Level ISAM Too Static? 1 sort the table on disk (in ZIPCODE-order) Search Efficiency + 2 To answer queries, use binary search to find the first B -trees Search qualifying tuple, then scan as long as ZIPCODE < 8999. Insert Redistribution Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees 2 Tree-Structured Ordered Files and Binary Search Indexing Torsten Grust How could we prepare for such queries and evaluate them efficiently? 1 SELECT* 2 FROM CUSTOMERS 3 WHERE ZIPCODE BETWEEN 8880 AND 8999 Binary Search ISAM We could Multi-Level ISAM Too Static? 1 sort the table on disk (in ZIPCODE-order) Search Efficiency + 2 To answer queries, use binary search to find the first B -trees Search qualifying tuple, then scan as long as ZIPCODE < 8999. Insert Redistribution Delete Duplicates Here, let k denote the full record with key k: Key Compression ∗ Bulk Loading Partitioned B+-trees 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 8953* 9016* 9200* 9532* scan 2 % We need to read about as many pages for this. The whole point of binary search is that we make far, unpredictable jumps. This largely defeats page prefetching. Tree-Structured Ordered Files and Binary Search Indexing Torsten Grust Binary Search 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 8953* 9016* 9200* 9532* ISAM page 0 page 1 page 2 page 3 page 4 page 5 page 6 page 7 page 8 page 9 page 10 page 11 page 12 Multi-Level ISAM scan Too Static? Search Efficiency " We get sequential access during the scan phase. B+-trees Search We need to read log (# tuples) tuples during the search phase. Insert 2 Redistribution Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees 3 Tree-Structured Ordered Files and Binary Search Indexing Torsten Grust Binary Search 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 8953* 9016* 9200* 9532* ISAM page 0 page 1 page 2 page 3 page 4 page 5 page 6 page 7 page 8 page 9 page 10 page 11 page 12 Multi-Level ISAM scan Too Static? Search Efficiency " We get sequential access during the scan phase. B+-trees Search We need to read log (# tuples) tuples during the search phase. Insert 2 Redistribution Delete Duplicates % as many pages Key Compression We need to read about for this. Bulk Loading Partitioned B+-trees The whole point of binary search is that we make far, unpredictable jumps. This largely defeats page prefetching. 3 Tree-Structured Tree-Structured Indexing Indexing Torsten Grust • This chapter discusses two index structures which especially Binary Search shine if we need to support range selections (and thus ISAM + Multi-Level ISAM sorted file scans): ISAM files and B -trees. Too Static? • Both indexes are based on the same simple idea which Search Efficiency B+-trees naturally leads to a tree-structured organization of the Search Insert indexes. (Hash indexes are covered in a subsequent chapter.) Redistribution + Delete • B -trees refine the idea underlying the rather static ISAM Duplicates scheme and add efficient support for insertions and Key Compression Bulk Loading deletions. Partitioned B+-trees 4 Tree-Structured Indexed Sequential Access Method (ISAM) Indexing Torsten Grust Remember: range selections on ordered files may use binary Binary Search ISAM search to locate the lower range limit as a starting point for a Multi-Level ISAM Too Static? sequential scan of the file (until the upper limit is reached). Search Efficiency B+-trees ISAM Search ... Insert Redistribution • . acts as a replacement for the binary search phase, and Delete Duplicates • touches considerably fewer pages than binary search. Key Compression Bulk Loading Partitioned B+-trees 5 Tree-Structured Indexed Sequential Access Method (ISAM) Indexing Torsten Grust To support range selections on field A: 1 In addition to the A-sorted data file, maintain an index file with entries (records) of the following form: Binary Search ISAM index entry separator pointer Multi-Level ISAM Too Static? Search Efficiency p0 k1 p1 k2 p2 kn pn B+-trees ··· Search • • • • Insert Redistribution 2 Delete ISAM leads to sparse index structures. In an index entry Duplicates Key Compression Bulk Loading k , pointer to p , + h i ii Partitioned B -trees key ki is the first (i.e., the minimal) A value on the data file page pointed to by pi (pi: page no). 6 Tree-Structured Indexed Sequential Access Method (ISAM) Indexing Torsten Grust index entry separator key p0 k1 p1 k2 p2 kn pn ··· • • • • Binary Search ISAM • In the index file, the ki serve as separators between the Multi-Level ISAM Too Static? contents of pages pi 1 and pi. Search Efficiency − B+-trees • It is guaranteed that ki 1 < ki for i = 2,..., n. − Search Insert • We obtain a one-level ISAM structure. Redistribution Delete Duplicates One-level ISAM structure of N + 1 pages Key Compression Bulk Loading Partitioned B+-trees index file k1 k2 kN p p p data file 0 1 p2 N 7 Tree-Structured Searching ISAM Indexing Torsten Grust SQL range selection on field A 1 SELECT* Binary Search 2 FROMR ISAM 3 WHEREA BETWEEN lower AND upper Multi-Level ISAM Too Static? Search Efficiency B+-trees To support range selections: Search Insert Redistribution 1 Conduct a binary search on the index file for a key of value Delete lower. Duplicates Key Compression 2 Start a sequential scan of the data file from the page Bulk Loading Partitioned B+-trees pointed to by the index entry (scan until field A exceeds upper). 8 Tree-Structured Indexed Sequential Access Method ISAM Indexing Torsten Grust • The size of the index file is likely to be much smaller than the data file size. Searching the index is far more efficient than searching the data file. Binary Search ISAM • For large data files, however, even the index file might be too Multi-Level ISAM large to allow for fast searches. Too Static? Search Efficiency B+-trees Search Insert Main idea behind ISAM indexing Redistribution Delete Recursively apply the index creation step: treat the topmost Duplicates Key Compression index level like the data file and add an additional index layer on Bulk Loading + top. Partitioned B -trees Repeat, until the the top-most index layer fits on a single page (the root page). 9 Tree-Structured Multi-Level ISAM Structure Indexing Torsten Grust This recursive index creation scheme leads to a tree-structured hierarchy of index levels: Multi-level ISAM structure Binary Search ISAM Multi-Level ISAM Too Static? Search Efficiency 8180 8910 B+-trees • • • Search Insert Redistribution Delete index pages Duplicates 4222 4528 6330 8050 8280 8570 8604 8808 9016 9532 Key Compression • • • • • • • • • • • • • Bulk Loading Partitioned B+-trees data 8953* 9532* 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 9016* 9200* pages 10 Tree-Structured Multi-Level ISAM Structure Indexing Torsten Grust Multi-level ISAM structure 8180 8910 • • • Binary Search index pages 4222 4528 6330 8050 8280 8570 8604 8808 9016 9532 ISAM • • • • • • • • • • • • • Multi-Level ISAM Too Static? Search Efficiency B+-trees data 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 8953* 9016* 9200* 9532* pages Search Insert Redistribution Delete Duplicates Key Compression Bulk Loading • Each ISAM tree node corresponds to one page (disk block). + Partitioned B -trees • To create the ISAM structure for a given data file, proceed bottom-up: 1 Sort the data file on the search key field. 2 Create the index leaf level. 3 If the top-most index level contains more than one page, repeat. 11 Tree-Structured Multi-Level ISAM Structure: Overflow Pages Indexing Torsten Grust • The upper index levels of the ISAM tree remain static: insertions and deletions in the data file do not affect the upper tree layers. • Insertion of record into data file: if space is left on the associated leaf page, insert record there. Binary Search ISAM • Otherwise create and maintain a chain of overflow pages Multi-Level ISAM Too Static? hanging off the full primary leaf page. Note: the records on Search Efficiency the overflow pages are not ordered in general. B+-trees Search Over time, search performance in ISAM can degrade. Insert ⇒ Redistribution Delete Duplicates Multi-level ISAM structure with overflow pages Key Compression Bulk Loading Partitioned B+-trees ··· ··· ··· ··· ··· ··· ··· overflow pages 12 Tree-Structured Multi-Level ISAM Structure: Example Indexing Torsten Grust Eeach page can hold two index entries plus one (the left-most) page pointer: Example (Initial