File Organizations and Indexing Administrivia Review – the Big

File Organizations and Indexing Administrivia Lecture 5 • Homework 1 Assignment now Available – Due in 2 parts: Sunday 9/14, Thursday 9/18 R&G Chapter 8 – Have to make changes to Postgres "If you don't find it in the index, look very – Don’t procrastinate carefully through the entire catalogue." -- Sears, Roebuck, and Co., Consumer's Guide, 1897 Review – The Big Picture Review – Relational Algebra • Basic operators for manipulating relations • Databases are cool – Selection ( s ) Selects a subset of rows – Theory: data modelling, relational algebra – Projection ( p ) Retains only wanted columns – Practical: storing data with disk, RAM – Cross-product ( ´ ) Combine two relations – Set-difference ( — ) Tuples in r1, but not in r2 • Today: – Union ( È ) Tuples in r1 and/or in r2 – Rename ( ? ) Change column name – Practical: File Organization, Indexing • Other Useful Operators • Next Time: – Intersection (Ç ) Tuples in both relations – Theory: Relational Calculus – Join ( ) Match common columns – Division ( / ) Tuples in A that match all in B Relation Algebra 2: Division Review: Examples of Division A/B sno pno pno • Division A / B, pno pno s1 p1 p2 p2 – columns in A must be a superset of columns in B p1 s1 p2 p4 p2 – let C be the columns in both A and B B1 both s1 p3 B2 p4 – let C be the columns that are only in A A-Only s1 p4 B3 – result has columns C A-Only s2 p1 sno – result has only rows where values in C A-Only have a s2 p2 s1 match with every value in B s3 p2 s2 sno – can be defined using other operators s4 p2 s3 s1 sno s4 p4 s4 s4 s1 A A/B1 A/B2 A/B3 1 sid bid day Review: Reserves 22 101 10/10/96 Find the names of Find the names of sailors who’ve reserved all boats Sailors who have 58 103 11/12/96 reserved all boats sid sname rating age • Uses division; schemas of the input relations to / must be carefully chosen: Sailors 22 dustin 7 45.0 31 lubber 8 55.5 r (Tempsids, (p Re serves) / (p Boats)) 58 rusty 10 35.0 sid,bid bid p (Tempsids>< Sailors) Boats bid bname color sname 101 Interlake Blue 102 Interlake Red v To find sailors who’ve reserved all ‘Interlake’ boats: 103 Clipper Green ..... / p (s Boats) 104 Marine Red bid bname=' Interlake' Review – Buffer Management and Files Today: File Organization • Storage of Data • How to keep pages of records on disk – Fields, either fixed or variable length... – Stored in Records... • but must support operations: – Stored in Pages... – scan all records – Stored in Files – search for a record id “RID” – search for record(s) with certain values • If data won’t fit in RAM, store on Disk – insert new records – Need Buffer Pool to hold pages in RAM – delete old records – Different strategies decide what to keep in pool Alternative File Organizations Indexes Many alternatives exist, tradeoffs for each: • Sometimes need to retrieve records by the values in one – Heap files: or more fields, e.g., • Suitable when typical access is file scan of all records. – Find all students in the “CS” department – Find all students with agpa > 3 • An index on a file is a: – Sorted Files: – Disk-based data structure • Best for retrieval in search key order – Speeds up selections on the search key fields for the index. • Also good for search based on search key – Any subset of the fields of a relation can be index searchkey – Search key is not the same as key • (e.g. doesn’t have to be unique ID). – Indexes: Organize records via trees or hashing. • An index • Like sorted files, speed up searches for search key fields – Contains a collection of data entries • Updates are much faster than in sorted files. – Supports efficient retrieval of all records with a given search key value k. 2 First Question to Ask About Indexes Index Classification • What selections does it support • What kinds of selections do they support? – Selections of form field <op> constant – Equality selections (op is =) • Representation of data entries in index – Range selections (op is one of <, >, <=, >=, BETWEEN) – what info is the index storing? 3 alternatives: – More exotic selections: • Data record with key value k • 2-dimensional ranges (“east of Berkeley and west of Truckee and North of Fresno and South of Eureka”) • <k, rid of data record with search key value k> – Or n- dimensional • <k, list of rids of data records with search key k> • 2-dimensional distances (“within 2 miles of Soda Hall”) – Or n- dimensional • Ranking queries (“10 restaurants closest to VLSB”) • Clustered vs. Unclustered Indexes • Regular expression matches, genome string matches, etc. • One common n-dimensional index: R-tree • Single Key vs. Composite Indexes – Supported in Oracle and Informix • Tree-based, hash-based, other – See http://gist.cs.berkeley.edu for research on this topic Alternatives for Data Entry k* in Index Alternatives for Data Entries (Contd.) • Three alternatives: · Actual data record (with key value k) • Alternative 1: · <k, rid of matching data record> Actual data record (with key value k) · <k, list of rids of matching data records> – Index structure is file organization for data records (like Heap files or sorted files). • Choice is orthogonal to the indexing technique. – At most one index on a table can use Alternative 1. – techniques: B+ trees, hash-tables, R trees, … – Saves pointer lookups – Typically, index contains auxiliary information that – Can be expensive to maintain with insertions and directs searches to the desired data entries deletions. • Can have multiple (different) indexes per file. – E.g. file sorted by age, with a hash index on salary and a B+tree index on name. Alternatives for Data Entries (Contd.) Index Classification Alternative 2 <k, rid of matching data record> • Clustered vs. unclustered: and Alternative 3 – If order of data records is the same as, or `close to’, <k, list of rids of matching data records> order of index data entries, then called clustered index. – Easier to maintain than Alt 1. • A file can be clustered on at most one search key. – Atmost one index can use Alternative 1; any others must use Alternatives 2 or 3. • Cost to retrieve data records with index varies – Alternative 3 more compact than Alternative 2, but greatly based on whether index clustered or not! leads to variable sized data entries even if search keys are of fixed length. • Alternative 1 implies clustered, but not vice-versa. – Even worse, for large rid lists the data entry might have to span multiple pages! 3 Clustered vs. Unclustered Index Unclustered vs. Clustered Indexes • Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. – To build clustered index, first sort the Heap file (with • What are the tradeoffs???? some free space on each block for future inserts). • Clustered Pros – Overflow blocks may be needed for inserts. (Thus, order of – Efficient for range searches data recs is `close to’, but not identical to, the sort order.) – May be able to do some types of compression – Possible locality benefits (related data?) Index entries UNCLUSTERED CLUSTERED direct search for data entries • Clustered Cons Data entries Data entries – Expensive to maintain (on the fly or sloppy with (Index File) reorganization) (Data file) Data Records Data Records B+ Tree Indexes Hash-Based Indexes • Good for equality selections. Non-leaf • Index is a collection of buckets. Bucket = primary Pages page plus zero or more overflow pages. • Hashing function h: – h(r) = bucket in which record r belongs. Leaf Pages – h looks at the search key fields of r. v Leaf pages containdata entries, and are chained (prev & next) v Non-leaf pages contain index entries and direct searches: • If Alternative (1) is used, the buckets contain the data records; otherwise, they contain <key, rid> index entry or <key, rid-list> pairs. P K P K P 0 1 1 2 P 2 K m m Example B+ Tree Comparing File Organizations Root 17 • Heap files (random order; insert at eof) Entries <= 17 Entries > 17 • Sorted files, sorted on <age, sal> • Clustered B+ tree file, Alternative (1), search 5 13 27 30 key <age, sal> • Heap file with unclustered B + tree index on 39* 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* search key <age, sal> • Find 28*? 29*? All > 15* and < 30* • Heap file with unclustered hash index on • Insert/delete: Find data entry in leaf, then search key <age, sal> change it. Need to adjust parent sometimes. – And change sometimes bubbles up the tree 4 Operations to Compare Cost Model for Analysis I/O cost 150,000 times more than hash function • Scan: Fetch all records from disk – We ignore CPU costs, for simplicity • Fetch all records in sorted order B: The number of data pages • Equality search R: Number of records per page • Range selection F: Fanout of B-tree • Insert a record • Delete a record Average-case analysis; based on several simplistic assumptions. * Good enough to show the overall trends! Heap File I/O Cost of Scan all Assumptions in Our Analysis Operations records • Heap Files: Get all in sort order – Equality selection on key; exactly one match. B: Number of data pages (packed) • Sorted Files: R: Number of records per page Equality Search – Files compacted after deletions. S: Time required for equality search • Indexes: Range – Alt (2), (3): data entry size = 10% size of record Search – Hash: No overflow buckets. Insert • 80% page occupancy => File size = 1.25 data size – Tree: 67% occupancy (this is typical). Delete • Implies file size = 1.5 data size Sorted File Clustered Tree I/O Cost of Scan all I/O Cost of Scan all Operations records Operations records Get all in Get all in sort order sort

Load more