File Organizations and Indexing Administrivia

Lecture 5 • Homework 1 Assignment now Available – Due in 2 parts: Sunday 9/14, Thursday 9/18 R&G Chapter 8 – Have to make changes to Postgres "If you don't find it in the index, look very – Don’t procrastinate carefully through the entire catalogue."

-- Sears, Roebuck, and Co., Consumer's Guide, 1897

Review – The Big Picture Review –

• Basic operators for manipulating relations • are cool – Selection ( s ) Selects a subset of rows – Theory: data modelling, relational algebra – Projection ( p ) Retains only wanted columns – Practical: storing data with disk, RAM – Cross-product ( ´ ) Combine two relations – Set-difference ( — ) Tuples in r1, but not in r2 • Today: – Union ( È ) Tuples in r1 and/or in r2 – Rename ( ? ) Change name – Practical: File Organization, Indexing

• Other Useful Operators • Next Time: – Intersection (Ç ) Tuples in both relations – Theory: Relational Calculus – Join ( ) Match common columns – Division ( / ) Tuples in A that match all in B

Relation Algebra 2: Division Review: Examples of Division A/B

sno pno pno • Division A / B, pno pno s1 p1 p2 p2 – columns in A must be a superset of columns in B p1 s1 p2 p4 p2 – let C be the columns in both A and B B1 both s1 p3 B2 p4 – let C be the columns that are only in A A-Only s1 p4 B3 – result has columns C A-Only s2 p1 sno

– result has only rows where values in C A-Only have a s2 p2 s1 match with every value in B s3 p2 s2 sno – can be defined using other operators s4 p2 s3 s1 sno s4 p4 s4 s4 s1 A A/B1 A/B2 A/B3

1 sid bid day Review: Reserves 22 101 10/10/96 Find the names of Find the names of sailors who’ve reserved all boats Sailors who have 58 103 11/12/96 reserved all boats sid sname rating age • Uses division; schemas of the input relations to / must be carefully chosen: Sailors 22 dustin 7 45.0 31 lubber 8 55.5 r (Tempsids, (p Re serves) / (p Boats)) 58 rusty 10 35.0 sid,bid bid p (Tempsids>< Sailors) Boats bid bname color sname 101 Interlake Blue 102 Interlake Red v To find sailors who’ve reserved all ‘Interlake’ boats: 103 Clipper Green ..... / p (s Boats) 104 Marine Red bid bname=' Interlake'

Review – Buffer Management and Files Today: File Organization

• Storage of Data • How to keep pages of records on disk – Fields, either fixed or variable length... – Stored in Records... • but must support operations: – Stored in Pages... – scan all records – Stored in Files – search for a record id “RID” – search for record(s) with certain values • If data won’t fit in RAM, store on Disk – insert new records – Need Buffer Pool to hold pages in RAM – delete old records – Different strategies decide what to keep in pool

Alternative File Organizations Indexes

Many alternatives exist, tradeoffs for each: • Sometimes need to retrieve records by the values in one – Heap files: or more fields, e.g., • Suitable when typical access is file scan of all records. – Find all students in the “CS” department – Find all students with agpa > 3 • An index on a file is a: – Sorted Files: – Disk-based data structure • Best for retrieval in search key order – Speeds up selections on the search key fields for the index. • Also good for search based on search key – Any subset of the fields of a can be index searchkey – Search key is not the same as key • (e.g. doesn’t have to be unique ID). – Indexes: Organize records via trees or hashing. • An index • Like sorted files, speed up searches for search key fields – Contains a collection of data entries • Updates are much faster than in sorted files. – Supports efficient retrieval of all records with a given search key value k.

2 First Question to Ask About Indexes Index Classification • What selections does it support • What kinds of selections do they support? – Selections of form field constant – Equality selections (op is =) • Representation of data entries in index – Range selections (op is one of <, >, <=, >=, BETWEEN) – what info is the index storing? 3 alternatives: – More exotic selections: • Data record with key value k • 2-dimensional ranges (“east of Berkeley and west of Truckee and North of Fresno and South of Eureka”) • – Or n- dimensional • • 2-dimensional distances (“within 2 miles of Soda Hall”) – Or n- dimensional • Ranking queries (“10 restaurants closest to VLSB”) • Clustered vs. Unclustered Indexes • Regular expression matches, genome string matches, etc. • One common n-dimensional index: R-tree • Single Key vs. Composite Indexes – Supported in Oracle and Informix • Tree-based, hash-based, other – See http://gist.cs.berkeley.edu for research on this topic

Alternatives for Data Entry k* in Index Alternatives for Data Entries (Contd.) • Three alternatives: · Actual data record (with key value k) • Alternative 1: · Actual data record (with key value k) · – Index structure is file organization for data records (like Heap files or sorted files). • Choice is orthogonal to the indexing technique. – At most one index on a can use Alternative 1. – techniques: B+ trees, hash-tables, R trees, … – Saves pointer lookups – Typically, index contains auxiliary information that – Can be expensive to maintain with insertions and directs searches to the desired data entries deletions.

• Can have multiple (different) indexes per file. – E.g. file sorted by age, with a hash index on salary and a B+tree index on name.

Alternatives for Data Entries (Contd.) Index Classification Alternative 2 • Clustered vs. unclustered: and Alternative 3 – If order of data records is the same as, or `close to’, order of index data entries, then called clustered index.

– Easier to maintain than Alt 1. • A file can be clustered on at most one search key. – Atmost one index can use Alternative 1; any others must use Alternatives 2 or 3. • Cost to retrieve data records with index varies – Alternative 3 more compact than Alternative 2, but greatly based on whether index clustered or not! leads to variable sized data entries even if search keys are of fixed length. • Alternative 1 implies clustered, but not vice-versa. – Even worse, for large rid lists the data entry might have to span multiple pages!

3 Clustered vs. Unclustered Index Unclustered vs. Clustered Indexes • Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. – To build clustered index, first sort the Heap file (with • What are the tradeoffs???? some free space on each block for future inserts). • Clustered Pros – Overflow blocks may be needed for inserts. (Thus, order of – Efficient for range searches data recs is `close to’, but not identical to, the sort order.) – May be able to do some types of compression – Possible locality benefits (related data?) Index entries UNCLUSTERED CLUSTERED direct search for dataentries • Clustered Cons

Data entries Data entries – Expensive to maintain (on the fly or sloppy with (Index File) reorganization) (Data file)

Data Records Data Records

B+ Tree Indexes Hash-Based Indexes

• Good for equality selections. Non-leaf • Index is a collection of buckets. Bucket = primary Pages page plus zero or more overflow pages. • Hashing function h: – h(r) = bucket in which record r belongs. Leaf Pages – h looks at the search key fields of r. v Leaf pages containdata entries, and are chained (prev & next) v Non-leaf pages contain index entries and direct searches: • If Alternative (1) is used, the buckets contain the data records; otherwise, they contain index entry or pairs. P K P K P 0 1 1 2 P 2 K m m

Example B+ Tree Comparing File Organizations Root 17 • Heap files (random order; insert at eof) Entries <= 17 Entries > 17 • Sorted files, sorted on • Clustered B+ tree file, Alternative (1), search 5 13 27 30 key • Heap file with unclustered B + tree index on 39* 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* search key • Find 28*? 29*? All > 15* and < 30* • Heap file with unclustered hash index on • Insert/delete: Find data entry in leaf, then search key change it. Need to adjust parent sometimes. – And change sometimes bubbles up the tree

4 Operations to Compare Cost Model for Analysis

I/O cost 150,000 times more than hash function • Scan: Fetch all records from disk – We ignore CPU costs, for simplicity • Fetch all records in sorted order B: The number of data pages • Equality search R: Number of records per page • Range selection F: Fanout of B-tree • Insert a record • Delete a record Average-case analysis; based on several simplistic assumptions.

* Good enough to show the overall trends!

Heap File

I/O Cost of Scan all Assumptions in Our Analysis Operations records

• Heap Files: Get all in sort order – Equality selection on key; exactly one match. B: Number of data pages (packed) • Sorted Files: R: Number of records per page Equality Search – Files compacted after deletions. S: Time required for equality search • Indexes: Range – Alt (2), (3): data entry size = 10% size of record Search

– Hash: No overflow buckets. Insert • 80% page occupancy => File size = 1.25 data size – Tree: 67% occupancy (this is typical). Delete • Implies file size = 1.5 data size

Sorted File Clustered Tree

I/O Cost of Scan all I/O Cost of Scan all Operations records Operations records Get all in Get all in sort order sort order B: Number of data pages (packed) B: Number of data pages (packed) R: Number of records per page Equality R: Number of records per page Equality S: Time required for equality search Search F: Fanout of B-Tree Search S: Time required for equality search Range Range Search Search

Insert Insert

Delete Delete

5 Unclustered Tree Hash Index

I/O Cost of Scan all I/O Cost of Scan all Operations records Operations records Get all in Get all in sort order sort order B: Number of data pages (packed) B: Number of data pages (packed) R: Number of records per page Equality R: Number of records per page Equality F: Fanout of B-Tree Search S: Time required for equality search Search S: Time required for equality search Range Range Search Search

Insert Insert

Delete Delete

B: The number of data pages R: Number of records per page I/O Cost of F: Fanout of B-Tree Index Selection Guidelines Operations S: Time required for equality search • Attributes in WHERE clause are candidates for index keys. Heap File Sorted File Clustered Tree Unclustered Tree Hash Index – Exact match condition suggests hash index. Scan all B B 1.5 B 1.5 B 1.25 B – Range query suggests tree index. records • Clustering is especially useful for range queries; can also help on Get all in 4B B 1.5 B 4B 4B equality queries if there are many duplicates. sort order Equality 0.5 B log2 B logF (1.5 B) logF (1.5 B) 2 Search • Multi-attribute search keys should be considered when a WHERE + 1 clause contains several conditions. Range B S + S + S + 1.25 B – Order of attributes is important for range queries. Search #matching #matching #matching – Such indexes sometimes enable index-only strategies pages pages records • For index-only strategies, clustering is not important! Insert 2 S + B S + 1 S + 2 4 • Choose indexes that benefit as many queries as possible. Delete 0.5B + 1 S + B 0.5B + 1 S + 2 S + 2 • Since only one index can be clustered per table, choose it based on important queries that would benefit the most from clustering.

Examples of Clustered Indexes Indexes with Composite Search Keys SELECT E.dno FROM Emp E • B+ tree index on E.age can be • Composite Search Keys: Search WHERE E.age>40 Examples of composite key used to get qualifying tuples. on a combination of fields. indexes using lexicographic order. – Equality query: Every field value is – How selective is the condition? SELECT E.dno, COUNT (*) equal to a constant value. E.g. wrt 11,80 11 FROM – Is the index clustered? Emp E index: 12,10 12 name age sal WHERE E.age>10 12,20 12 • Consider the GROUP BY query. • age=20 and sal =75 13,75 bob 12 10 13 GROUP BY E.dno – Range query: Some field value is – If many tuples have E.age > 10, cal 11 80 not a constant. E.g.: using E.age index and sorting the joe 12 20 • age =20; or age=20 and sal > 10 retrieved tuples may be costly. 10,12 sue 13 75 10 20,12 Data records 20 – Clustered E.dno index may be • Data entries in index sorted by 75,13 sorted by name 75 better! search key to support range 80,11 80 queries. • Equality queries and duplicates: SELECT E.dno FROM Emp E – Lexicographic order, or Data entries in index Data entries – Clustering on E.hobby helps! sorted by sorted by WHERE E.hobby=Stamps – Spatial order.

6 SELECT D.mgr Composite Search Keys Index-Only Plans FROM Dept D, Emp E WHERE D.dno=E.dno • To retrieve Emp records with age=30 AND sal=4000, SELECT D.mgr, E.eid • A number of queries an index on would be better than an index FROM Dept D, Emp E can be answered Tree index! on age or an index on sal. without retrieving WHERE D.dno=E.dno – Choice of index key orthogonal to clustering etc. any tuples from one SELECT E.dno, COUNT(*) • If condition is: 20 FROM Emp E – Clustered tree index on or is best. relations involved if GROUP BY E.dno a suitable index is • If condition is: age=30 AND 3000 – Clustered index much better than FROM Emp E Tree index! index! GROUP BY E.dno • Composite indexes are larger, updated more often. SELECT AVG(E.sal) or FROM Emp E WHERE E.age=25 AND Tree! E.sal BETWEEN 3000 AND 5000

Index-Only Plans (Contd.) Summary

• Index-only plans SELECT E.dno, COUNT (*) are possible if the FROM Emp E • Alternative file organizations, tradeoffs for each key is WHERE E.age=30 or we have a tree GROUP BY E.dno • If selection queries are frequent, sorting the file index with key or building an index is important. – Hash-based indexes only good for equality search. – Which is better? – Sorted files and tree-based indexes best for range – What if we SELECT E.dno, COUNT (*) FROM Emp E search; also good for equality search. (Files rarely consider the kept sorted in practice; B+ tree index is better.) second query? WHERE E.age>30 GROUP BY E.dno • Index is a collection of data entries plus a way to quickly find entries with given key values.

Summary (Contd.)

Summary (Contd.) • Understanding the nature of the workload for the application, and the performance goals, is essential • Data entries can be actual data records, pairs, or pairs. – What are the important queries and updates? What – Choice orthogonal to indexing technique used to attributes/relations are involved? locate data entries with a given key value. • Indexes must be chosen to speed up important queries (and perhaps some updates!). • Can have several indexes on a given file of data records, each with a different search key. – Index maintenance overhead on updates to key fields. – Choose indexes that can help many queries, if possible. – Build indexes to support index-only strategies. • Indexes can be – Clustering is an important decision; only one index on a – clustered, unclustered given relation can be clustered! – B-tree, hash table, etc. – Order of fields in composite index key can be important.

7