Introduction

Architecture of a DBMS Architecture and Implementation of Organizational Matters Database System

2013 - 2014

1 Introduction

Introduction Preliminaries and Organizational Matters Architecture of a DBMS

Organizational Architecture and Implementation of Database Systems Matters http://adrem.ua.ac.be/advanced-database-systems

2 Welcome all . . . Introduction

. . . to this course whose lectures are primarily about digging in the mud of database system internals.

Architecture of a • While others talk about SQL and graphical query interfaces, DBMS Organizational we will Matters

1 learn how DBMSs can access files on hard disks without paying too much for I/O traffic, 2 see how to organize data on disk and which kind of “maps” for huge amounts of data we can use to avoid to get lost, 3 assess what it means to sort/combine/filter data volumes that exceed main memory size by far, and 4 learn how user queries are represented and executed inside the database kernel.

3 Architecture of a DBMS / Course Outline Introduction

Web Forms Applications SQL Interface

SQL Commands Architecture of a DBMS Executor Parser Organizational Matters Operator Evaluator Optimizer

Transaction Files and Access Methods Manager Recovery

Buffer Manager this course Manager Lock Manager Disk Space Manager amakrishnan/Gehrke: “Database Management Systems”,McGraw-Hill 2003. R DBMS

data files, indices, . . . Database Figure inspired by

4 Reading Material Introduction

• Raghu Ramakrishnan and Johannes Gehrke.

Database Management Systems. McGraw-Hill. Architecture of a DBMS

• . . . in fact, any book about advanced database topics and Organizational internals will do — pick your favorite. Matters

Here and there, pointers ( ) to specific research papers will be given and you are welcome% to search for additional background reading. Use Google Scholar or similar search engines.

R Exam: Open book, example exams are on the website.

5 These Slides. . . Introduction

• . . . prepared/updated throughout the semester — watch out for bugs and please let me know. Thanks. • Posted to course web home on the day before the lecture.

Architecture of a Example DBMS Organizational Matters

 Open Issues/Questions Take notes.

Code Snippets, Algorithms

IBM DB2 Specifics If possible and insightful, discuss how IBM DB2 does things.

PostgreSQL Specifics

Ditto, but related to the glorious PostgreSQL system. 6 Before We Begin Introduction

Architecture of a DBMS

Organizational Questions? Matters

Comments?

Suggestions?

7 Storage Chapter Storage Disks, Buffer Manager, Files. . . Magnetic Disks Access Time Sequential vs. Random Architecture and Implementation of Database Systems Access I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Techniques Solid-State Disks Network-Based Storage

Managing Space Free Space Management

Buffer Manager Pinning and Unpinning Replacement Policies

Databases vs. Operating Systems

Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

1 Database Architecture Storage

Web Forms Applications SQL Interface

SQL Commands Magnetic Disks Access Time Executor Parser Sequential vs. Random Access Operator Evaluator Optimizer I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Techniques Transaction Files and Access Methods Solid-State Disks Manager Network-Based Storage Recovery Managing Space Buffer Manager Manager Free Space Management Lock Manager Buffer Manager Disk Space Manager Pinning and Unpinning Replacement Policies

Databases vs. DBMS Operating Systems

Files and Records data files, indices, . . . Database Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

2 The Memory Hierarchy Storage

capacity latency CPU (with registers) bytes < 1 ns Magnetic Disks Access Time caches kilo-/megabytes < 10 ns Sequential vs. Random Access

I/O Parallelism main memory gigabytes 70–100 ns RAID Levels 1, 0, and 5

Alternative Storage hard disks terabytes 3–10 ms Techniques Solid-State Disks tape library petabytes varies Network-Based Storage Managing Space Free Space Management

Buffer Manager • Fast—but expensive and small—memory close to CPU Pinning and Unpinning Replacement Policies • Larger, slower memory at the periphery Databases vs. Operating Systems • DBMSs try to hide latency by using the fast memory as a Files and Records cache. Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

3 Magnetic Disks Storage

rotation arm track

Magnetic Disks Access Time Sequential vs. Random Access

I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage block sector Techniques platter heads Solid-State Disks Network-Based Storage

Managing Space Free Space Management • A stepper motor positions an array of Buffer Manager disk heads on the requested track Pinning and Unpinning Replacement Policies • Platters (disks) steadily rotate Databases vs. Operating Systems • Disks are managed in blocks: the system Files and Records

http://www.metallurgy.utah.edu/ Heap Files reads/writes data one block at a time Free Space Management hoto: P Inside a Page Alternative Page Layouts

Recap

4 Access Time Storage

Data blocks can only be read and written if disk heads and platters are positioned accordingly. Magnetic Disks Access Time • This design has implications on the access time to Sequential vs. Random read/write a given block: Access I/O Parallelism RAID Levels 1, 0, and 5

Definition (Access Time) Alternative Storage Techniques 1 Move disk arms to desired track (seek time t ) Solid-State Disks s Network-Based Storage

2 Disk controller waits for desired block to rotate under Managing Space disk head (rotational delay tr) Free Space Management Buffer Manager 3 transfer time Read/write data ( ttr) Pinning and Unpinning Replacement Policies

Databases vs. access time: t = ts + tr + ttr Operating Systems ⇒ Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

5 Example: Seagate Cheetah 15K.7 Storage (600 GB, server-class drive)

• Seagate Cheetah 15K.7 performance characteristics: • 4 disks, 8 heads, avg. 512 kB/track, 600 GB capacity • rotational speed: 15 000 rpm (revolutions per minute) Magnetic Disks Access Time • average seek time: 3.4 ms Sequential vs. Random Access

• transfer rate 163 MB/s I/O Parallelism ≈ RAID Levels 1, 0, and 5

Alternative Storage  What is the access time to read an 8 KB data block? Techniques Solid-State Disks average seek time ts = 3.40 ms Network-Based Storage 1 1 Managing Space average rotational delay: 1 tr = 2.00 ms Free Space Management 2 · 15 000 min− transfer time for 8 KB: 8 kB t = 0.05 ms Buffer Manager 163 MB/s tr Pinning and Unpinning Replacement Policies

Databases vs. access time for an 8 kB data block t = 5.45 ms Operating Systems Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

6 Sequential vs. Random Access Storage

Example (Read 1 000 blocks of size 8 kB)

• random access: Magnetic Disks trnd = 1 000 5.45 ms = 5.45 s Access Time · Sequential vs. Random • sequential read of adjacent blocks: Access I/O Parallelism tseq = ts + tr + 1 000 ttr + 16 ts,track-to-track · · RAID Levels 1, 0, and 5 = 3.40 ms + 2.00 ms + 50 ms + 3.2 ms 58.6 ms Alternative Storage ≈ Techniques The Seagate Cheetah 15K.7 stores an average of 512 kB per Solid-State Disks track, with a 0.2 ms track-to-track seek time; our 8 kB blocks Network-Based Storage Managing Space are spread across 16 tracks. Free Space Management

Buffer Manager Pinning and Unpinning Sequential I/O is much faster than random I/O Replacement Policies ⇒ Databases vs. Avoid random I/O whenever possible Operating Systems ⇒ 58.6 ms Files and Records As soon as we need at least 5,450 ms = 1.07 % of a file, Heap Files ⇒  Free Space Management we better read the entire file sequentially Inside a Page Alternative Page Layouts

Recap

7 Performance Tricks Storage

• Disk manufacturers play a number of tricks to improve performance:

track skewing Magnetic Disks Access Time Align sector 0 of each track to avoid Sequential vs. Random rotational delay during longer Access I/O Parallelism sequential scans RAID Levels 1, 0, and 5

Alternative Storage Techniques request scheduling Solid-State Disks Network-Based Storage

If multiple requests have to be served, choose the one Managing Space that requires the smallest arm movement (SPTF: shortest Free Space Management Buffer Manager positioning time first, elevator algorithms) Pinning and Unpinning Replacement Policies

Databases vs. zoning Operating Systems Files and Records Outer tracks are longer than the inner ones. Therefore, Heap Files divide outer tracks into more sectors than inner tracks Free Space Management Inside a Page Alternative Page Layouts

Recap

8 Evolution of Hard Disk Technology Storage

Disk seek and rotational latencies have only marginally improved over the last years ( 10 % per year) ≈ Magnetic Disks But: Access Time Sequential vs. Random Access • Throughput (i.e., transfer rates) improve by 50 % per year ≈ I/O Parallelism • Hard disk capacity grows by 50 % every year RAID Levels 1, 0, and 5 ≈ Alternative Storage Techniques Solid-State Disks Therefore: Network-Based Storage Managing Space • Random access cost hurts even more as time progresses Free Space Management Buffer Manager Pinning and Unpinning Replacement Policies

Example (5 Years Ago: Seagate Barracuda 7200.7) Databases vs. Operating Systems

Read 1K blocks of 8 kB sequentially/randomly: 397 ms / 12 800 ms Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

9 Ways to Improve I/O Performance Storage

The latency penalty is hard to avoid

But: Magnetic Disks • Throughput can be increased rather easily by exploiting Access Time Sequential vs. Random parallelism Access • Idea: Use multiple disks and access them in parallel, try to I/O Parallelism hide latency RAID Levels 1, 0, and 5 Alternative Storage Techniques Solid-State Disks TPC-C: An industry benchmark for OLTP Network-Based Storage Managing Space A recent #1 system (IBM DB2 9.5 on AIX) uses Free Space Management Buffer Manager • 10,992 disk drives (73.4 GB each, 15,000 rpm) (!) Pinning and Unpinning plus 8 146.8 GB internal SCSI drives, Replacement Policies Databases vs. • connected with 68 4 Gbit fibre channel adapters, Operating Systems Files and Records • yielding 6 mio transactions per minute Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

10 Disk Mirroring Storage

• Replicate data onto multiple disks:

Magnetic Disks 1 2 3 4 5 Access Time 6 7 8 9 Sequential vs. Random · ·· Access 1 2 3 4 5 I/O Parallelism 6 7 8 9 RAID Levels 1, 0, and 5 ··· Alternative Storage 1 2 3 4 5 Techniques Solid-State Disks 6 7 8 9 Network-Based Storage ··· Managing Space Free Space Management

Buffer Manager Pinning and Unpinning • Achieves I/O parallelism only for reads Replacement Policies Databases vs. • Improved failure tolerance—can survive one disk failure Operating Systems Files and Records • This is also known as RAID 1 (mirroring without parity) Heap Files (RAID: Redundant Array of Inexpensive Disks) Free Space Management Inside a Page Alternative Page Layouts

Recap

11 Disk Striping Storage

• Distribute data over disks:

Magnetic Disks Access Time 1 2 3 4 5 Sequential vs. Random 6 7 8 9 Access ··· I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Techniques 1 4 7 2 5 8 3 6 9 Solid-State Disks · ·· ··· ··· Network-Based Storage

Managing Space Free Space Management

Buffer Manager Pinning and Unpinning • Full I/O parallelism for read and write operations Replacement Policies

Databases vs. • High failure risk (here: 3 times risk of single disk failure)! Operating Systems • Also known as RAID 0 (striping without parity) Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

12 Disk Striping with Parity Storage

• Distribute data and parity information over > 3 disks:

1 2 3 4 5 Magnetic Disks 6 7 8 9 Access Time ··· Sequential vs. Random Access

I/O Parallelism RAID Levels 1, 0, and 5 5 3 1 7 1 3 /6 7 2 /4 5 8 /2 4 6 /8 Alternative Storage · ·· ··· ··· Techniques Solid-State Disks Network-Based Storage

Managing Space Free Space Management

• High I/O parallelism Buffer Manager Pinning and Unpinning • Fault tolerance: any one disk may fail without data loss Replacement Policies (with dual parity/RAID 6: two disks may fail) Databases vs. Operating Systems

• Distribute parity (e.g., XOR) information over disks, separating Files and Records data and associated parity Heap Files Free Space Management • Also known as RAID 5 (striping with distributed parity) Inside a Page Alternative Page Layouts

Recap

13 Solid-State Disks Storage Solid state disks (SSDs) have emerged as an alternative to conventional hard disks

• SSDs provide very low-latency Magnetic Disks random read access (< 0.01 ms) Access Time Sequential vs. Random Random writes Access • , however, are ite

significantly slower than on wr I/O Parallelism

time RAID Levels 1, 0, and 5 ite ead traditional magnetic drives: r Alternative Storage wr ead

r Techniques 1 (Blocks of) Pages have to be Solid-State Disks flash mag. disk erased before they can be Network-Based Storage Managing Space updated Free Space Management 2 Once pages have been Buffer Manager Pinning and Unpinning erased, sequentially writing Replacement Policies

them is almost as fast as Databases vs. reading Operating Systems Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

14

Samsung 32 GB flash disk; 4096 bytes read/written randomly. Source: Koltsidas and Viglas. Flashing up the Storage Layer. VLDB 2008. SSDs: Page-Level Writes, Block-Level Deletes Storage • Typical page size: 128 kB • SSDs erase blocks of pages: block 64 pages (8 MB) ≈ Example (Perform block-level delete to accomodate new data pages) Magnetic Disks Access Time Sequential vs. Random Access

I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Techniques Solid-State Disks Network-Based Storage

Managing Space Free Space Management

Buffer Manager (The SSD Revolution) Pinning and Unpinning Replacement Policies

Databases vs. Operating Systems arstechnica.com Files and Records Heap Files Free Space Management Inside a Page llustration taken from

I Alternative Page Layouts

Recap

15 Example: Seagate Pulsar.2 Storage (800 GB, server-class solid state drive)

• Seagate Pulsar.2 performance characteristics: • NAND flash memory, 800 GB capacity Magnetic Disks • standard 2.500 enclosure, no moving/rotating parts Access Time • data read/written in pages of 128 kB size Sequential vs. Random Access

• transfer rate 370 MB/s I/O Parallelism ≈ RAID Levels 1, 0, and 5

Alternative Storage  What is the access time to read an 8 KB data block? Techniques Solid-State Disks no seek time ts = 0.00 ms Network-Based Storage Managing Space no rotational delay: tr = 0.00 ms Free Space Management transfer time for 8 KB: 128 kB t = 0.30 ms Buffer Manager 370 MB/s tr Pinning and Unpinning Replacement Policies

Databases vs. access time for an 8 kB data block t = 0.30 ms Operating Systems Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

16 Sequential vs. Random Access with SSDs Storage

Example (Read 1 000 blocks of size 8 kB)

• random access: Magnetic Disks Access Time trnd = 1 000 0.30 ms = 0.3 s Sequential vs. Random · Access • sequential read of adjacent pages: I/O Parallelism 1 000 8 kB RAID Levels 1, 0, and 5 t = · t 18.9 ms seq 128 kB tr Alternative Storage · ≈ Techniques   Solid-State Disks The Seagate Pulsar.2 (sequentially) reads data in 128 kB Network-Based Storage

chunks. Managing Space Free Space Management

Buffer Manager Sequential I/O still beats random I/O Pinning and Unpinning ⇒ Replacement Policies (but random I/O is more feasible again) Databases vs. • Adapting database technology to these characteristics is a Operating Systems Files and Records current research topic Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

17 Network-Based Storage Storage

Today the network is not a bottleneck any more:

Magnetic Disks Storage media/interface Transfer rate Access Time Sequential vs. Random Hard disk 100–200 MB/s Access Serial ATA 375 MB/s I/O Parallelism RAID Levels 1, 0, and 5

Ultra-640 SCSI 640 MB/s Alternative Storage 10-Gbit Ethernet 1,250 MB/s Techniques Solid-State Disks Infiniband QDR 12,000 MB/s Network-Based Storage Managing Space For comparison (RAM): Free Space Management Buffer Manager PC2–5300 DDR2–SDRAM 10.6 GB/s Pinning and Unpinning PC3–12800 DDR3–SDRAM 25.6 GB/s Replacement Policies Databases vs. Operating Systems

Files and Records Why not use the network for database storage? Heap Files ⇒ Free Space Management Inside a Page Alternative Page Layouts

Recap

18 Storage Area Network (SAN) Storage

• Block-based network access to storage:

• SAN emulate interface of block-structured disks Magnetic Disks (“read block #42 of disk 10”) Access Time Sequential vs. Random • This is unlike network file systems (e.g., NFS, CIFS) Access I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage • SAN storage devices typically abstract from RAID or physical Techniques disks and present logical drives to the DBMS Solid-State Disks Network-Based Storage • Hardware acceleration and simplified maintainability Managing Space Free Space Management

Buffer Manager Pinning and Unpinning • Typical setup: local area network with multiple participating Replacement Policies servers and storage resources Databases vs. Operating Systems • Failure tolerance and increased flexibility Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

19 Grid or Cloud Storage Storage

Internet-scale enterprises employ clusters with 1000s commodity PCs (e.g., Amazon, Google, Yahoo!): • system cost reliability and performance, ↔ Magnetic Disks • use massive replication for data storage Access Time Sequential vs. Random Spare CPU cycles and disk space are sold as a service: Access I/O Parallelism Amazon’s Elastic Computing Cloud (EC2) RAID Levels 1, 0, and 5 Alternative Storage Use Linux-based compute cluster by the hour ( 10 ¢/h). Techniques ∼ Solid-State Disks Amazon’s Simple Storage System (S3) Network-Based Storage “Infinite” store for objects between 1 B and 5 TB in size, Managing Space Free Space Management organized in a map data structure (key object) 7→ Buffer Manager • Latency: 100 ms to 1 s (not impacted by load) Pinning and Unpinning Replacement Policies

• pricing disk drives (but addl. cost for access) Databases vs. ≈ Operating Systems Building a database on S3? Files and Records ⇒ Heap Files ( Brantner et al., SIGMOD 2008) Free Space Management % Inside a Page Alternative Page Layouts

Recap

20 Managing Space Storage

Web Forms Applications SQL Interface

SQL Commands

Magnetic Disks Executor Parser Access Time Sequential vs. Random Access Operator Evaluator Optimizer I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Transaction Files and Access Methods Techniques Manager Solid-State Disks Recovery Network-Based Storage Buffer Manager Manager Managing Space Lock Manager Free Space Management Disk Space Manager Buffer Manager Pinning and Unpinning Replacement Policies

DBMS Databases vs. Operating Systems data files, indices, . . . Database Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

21 Managing Space Storage

Definition (Disk Space Manager)

• Abstracts from the gory details of the underlying storage (disk space manager talks to I/O controller and initiates I/O) Magnetic Disks Access Time • DBMS issues allocate/deallocate and read/write Sequential vs. Random Access

commands I/O Parallelism • Provides the concept of a page (typically 4–64 KB) as a unit RAID Levels 1, 0, and 5 Alternative Storage of storage to the remaining system components Techniques Solid-State Disks • Maintains a locality-preserving mapping Network-Based Storage Managing Space page number physical location , Free Space Management 7→ Buffer Manager Pinning and Unpinning where a physical location could be, e.g., Replacement Policies Databases vs. • an OS file name and an offset within that file, Operating Systems • head, sector, and track of a hard drive, or Files and Records Heap Files • tape number and offset for data stored in a tape library Free Space Management Inside a Page Alternative Page Layouts

Recap

22 Empty Pages Storage

The disk space manager also keeps track of used/free blocks (deallocation and subsequent allocation may create holes):

1 Maintain a linked list of free pages

• When a page is no longer needed, add it to the list Magnetic Disks Access Time Sequential vs. Random 2 Maintain a bitmap reserving one bit for each page Access I/O Parallelism • Toggle bit n when page n is (de-)allocated RAID Levels 1, 0, and 5

Alternative Storage Techniques Solid-State Disks  Allocation of contiguous pages Network-Based Storage Managing Space To exploit sequential access, it is useful to allocate contiguous Free Space Management sequences of pages. Buffer Manager Pinning and Unpinning Which of the techniques (1 or 2) would you choose to support Replacement Policies

this? Databases vs. Operating Systems

Files and Records This is a lot easier to do with a free page bitmap (option 2). Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

23 Buffer Manager Storage

Web Forms Applications SQL Interface

SQL Commands

Magnetic Disks Executor Parser Access Time Sequential vs. Random Access Operator Evaluator Optimizer I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Transaction Files and Access Methods Techniques Manager Solid-State Disks Recovery Network-Based Storage Buffer Manager Manager Managing Space Lock Manager Free Space Management Disk Space Manager Buffer Manager Pinning and Unpinning Replacement Policies

DBMS Databases vs. Operating Systems data files, indices, . . . Database Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

24 Buffer Manager Storage

page requests Definition (Buffer Manager) main

memory • Mediates between external Magnetic Disks storage and main memory, Access Time 6 1 4 Sequential vs. Random • Manages a designated main Access 3 I/O Parallelism memory area, the buffer RAID Levels 1, 0, and 5 7 pool, for this task Alternative Storage Techniques Solid-State Disks Network-Based Storage disk page free frame Disk pages are brought into Managing Space memory as needed and loaded Free Space Management Buffer Manager into memory frames Pinning and Unpinning 1 2 3 4 5 6 Replacement Policies 7 8 9 10 11 Databases vs. · ·· A replacement policy decides Operating Systems which page to evict when the Files and Records disk Heap Files buffer is full Free Space Management Inside a Page Alternative Page Layouts

Recap

25 Interface to the Buffer Manager Storage

Higher-level code requests (pins) pages from the buffer manager and releases (unpins) pages after use. pin (pageno) Magnetic Disks Request page number pageno from the buffer manager, load Access Time it into memory if necessary and mark page as clean ( dirty). Sequential vs. Random ¬ Access Returns a reference to the frame containing pageno. I/O Parallelism RAID Levels 1, 0, and 5

unpin (pageno, dirty) Alternative Storage Techniques Release page number pageno, making it a candidate for Solid-State Disks eviction. Must set dirty = true if page was modified. Network-Based Storage Managing Space Free Space Management  Buffer Manager Why do we need the dirty bit? Pinning and Unpinning Replacement Policies Only modified pages need to be written back to disk upon Databases vs. eviction. Operating Systems Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

26 Proper pin ()/unpin () Nesting Storage

• Any database transaction is required to properly “bracket” any page operation using pin () and unpin () calls: Magnetic Disks Access Time A read-only transaction Sequential vs. Random Access a pin (p); I/O Parallelism RAID Levels 1, 0, and 5 ←. . Alternative Storage . Techniques read data on page at memory address a; Solid-State Disks  Network-Based Storage . . Managing Space . Free Space Management unpin (p, false); Buffer Manager   Pinning and Unpinning Replacement Policies • Proper bracketing enables the systems to keep a count of Databases vs. active users (e.g., transactions) of a page Operating Systems Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

27 Implementation of pin () Storage

Function pin(pageno)

1 if buffer pool already contains pageno then Magnetic Disks Access Time 2 pinCount (pageno) pinCount (pageno) + 1 ; Sequential vs. Random ← Access 3 return address of frame holding pageno ; I/O Parallelism 4 else RAID Levels 1, 0, and 5 5 select a victim frame v using the replacement policy ; Alternative Storage Techniques 6 if dirty (page in v) then Solid-State Disks Network-Based Storage 7 write page in v to disk ; Managing Space Free Space Management 8 read page pageno from disk into frame v ; Buffer Manager 9 pinCount (pageno) 1 ; ← Pinning and Unpinning 10 dirty (pageno) false ; Replacement Policies ← Databases vs. 11 return address of frame v ; Operating Systems

Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

28 Implementation of unpin () Storage

Function unpin(pageno, dirty)

1 pinCount (pageno) pinCount (pageno) 1 ; ← − Magnetic Disks 2 dirty (pageno) dirty (pageno) dirty ; Access Time ← ∨ Sequential vs. Random Access

I/O Parallelism RAID Levels 1, 0, and 5  Why don’t we write pages back to disk during unpin ()? Alternative Storage Techniques Well, we could . . . Solid-State Disks Network-Based Storage

+ recovery from failure would be a lot simpler Managing Space – higher I/O cost (every page write implies a write to disk) Free Space Management Buffer Manager – bad response time for writing transactions Pinning and Unpinning This discussion is also known as force (or write-through) vs. Replacement Policies Databases vs. write-back. Actual database systems typically implement Operating Systems write-back. Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

29 Concurrent Writes? Storage

Conflicting changes to a block

Assume the following: Magnetic Disks Access Time Sequential vs. Random 1 The same page p is requested by more than one transaction Access

(i.e., pinCount (p) > 1), and I/O Parallelism RAID Levels 1, 0, and 5 2 . . . those transactions perform conflicting writes on p? Alternative Storage Techniques Solid-State Disks Conflicts of this kind are resolved by the system’s concurrency Network-Based Storage control Managing Space , a layer on top of the buffer manager Free Space Management

Buffer Manager Pinning and Unpinning Replacement Policies

The buffer manages may assume that everything is in order Databases vs. whenever it receives an unpin (p, true) call. Operating Systems Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

30 Replacement Policies Storage The effectiveness of the buffer manager’s caching functionality can depend on the replacement policy it uses, e.g., Least Recently Used (LRU) Evict the page whose latest unpin () is longest ago Magnetic Disks LRU-k Access Time Sequential vs. Random Like LRU, but considers the k latest unpin () calls, not just Access I/O Parallelism the latest RAID Levels 1, 0, and 5

Alternative Storage Most Recently Used (MRU) Techniques Evict the page that has been unpinned most recently Solid-State Disks Network-Based Storage Random Managing Space Free Space Management

Pick a victim randomly Buffer Manager Pinning and Unpinning Replacement Policies  Databases vs. Rationales behind each of these policies? Operating Systems

Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

31 Example Policy: Clock (“Second Chance”) Storage

• Simulate an LRU policy with less overhead (no LRU queue reorganization on every frame reference): Magnetic Disks Access Time Sequential vs. Random Clock (“Second Chance”) Access

I/O Parallelism 1 Number the N frames in the buffer pool 0,..., N 1; RAID Levels 1, 0, and 5 initialize current 0, maintain a bit array − Alternative Storage ← Techniques referenced[0,..., N 1] initialized to all 0 Solid-State Disks − Network-Based Storage 2 In pin(p), assign referenced[p] 1 Managing Space ← Free Space Management 3 To find the next victim, consider page current; Buffer Manager if referenced[current] = 0, current is the victim; Pinning and Unpinning otherwise, referenced[current] 0, Replacement Policies ← Databases vs. current current + 1 mod N, repeat 3 Operating Systems ← Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

32 Heuristic Policies Can Fail Storage

The mentioned policies, including LRU, are heuristics

only and may fail miserably in certain scenarios.  Magnetic Disks Access Time Sequential vs. Random Example (A Challenge for LRU) Access

I/O Parallelism A number of transactions want to scan the same sequence of RAID Levels 1, 0, and 5

pages (consider a repeated SELECT * FROM R). Alternative Storage Techniques Assume a buffer pool capacity of 10 pages. Solid-State Disks Network-Based Storage

1 Let the size of relation R be 10 or less pages. Managing Space How many I/O (actual disk page reads) do you expect? Free Space Management Buffer Manager 2 Now grow R by one page. Pinning and Unpinning How about the number of I/O operations in this case? Replacement Policies Databases vs. Operating Systems

Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

33 More Challenges for LRU Storage

1 Transaction T1 repeatedly accesses a fixed set of pages; transaction T2 performs a sequential scan of the database

pages. Magnetic Disks + Access Time 2 Assume a B -tree-indexed relation R. R occupies 10,000 data Sequential vs. Random + Access pages Ri, the B -tree occupies one root node and 100 index I/O Parallelism leaf nodes Ik. RAID Levels 1, 0, and 5 Transactions perform repeated random index key lookups on Alternative Storage + Techniques R page access pattern (ignores B -tree root node): Solid-State Disks ⇒ Network-Based Storage

Managing Space I1, R1, I2, R2, I3, R3,... Free Space Management

Buffer Manager Pinning and Unpinning Replacement Policies  How will LRU perform in this case? Databases vs. Operating Systems With LRU, 50 % of the buffered pages will be pages of R. However, Files and Records 1 the probability of re-accessing page Ri only is /10,000. Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

34 Buffer Management in Practice Storage

Prefetching Buffer managers try to anticipate page requests to overlap CPU and I/O operations: • Speculative prefetching: Assume sequential scan and Magnetic Disks Access Time automatically read ahead. Sequential vs. Random Access

• Prefetch lists: Some database algorithms can instruct I/O Parallelism the buffer manager with a list of pages to prefetch. RAID Levels 1, 0, and 5 Alternative Storage Techniques Solid-State Disks Page fixing/hating Network-Based Storage Higher-level code may request to fix a page if it may be Managing Space useful in the near future (e.g., nested-loop join). Free Space Management Buffer Manager Pinning and Unpinning Likewise, an operator that hates a page will not access it any Replacement Policies time soon (e.g., table pages in a sequential scan). Databases vs. Operating Systems

Files and Records Partitioned buffer pools Heap Files Free Space Management E.g., maintain separate pools for indexes and tables. Inside a Page Alternative Page Layouts

Recap

35 Databases vs. Operating Systems Storage

Wait! Didn’t we just re-invent the operating system? 

Magnetic Disks Yes, Access Time Sequential vs. Random • disk space management and buffer management very much Access look like file management and virtual memory in OSs. I/O Parallelism RAID Levels 1, 0, and 5 But, Alternative Storage Techniques • a DBMS may be much more aware of the access patterns of Solid-State Disks certain operators (prefetching, page fixing/hating), Network-Based Storage Managing Space • concurrency control often calls for a prescribed order of Free Space Management Buffer Manager write operations, Pinning and Unpinning • technical reasons may make OS tools unsuitable for a Replacement Policies Databases vs. database (e.g., file size limitation, platform independence). Operating Systems Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

36 Databases vs. Operating Systems Storage In fact, databases and operating systems sometimes interfere: • Operating system and buffer manager effectively buffer the same data twice. DBMS buffer Magnetic Disks • Things get really bad if parts of the Access Time Sequential vs. Random DBMS buffer get swapped out to disk Access by OS VM manager. OS I/O Parallelism buffer RAID Levels 1, 0, and 5 Alternative Storage Techniques • Therefore, database systems try to Solid-State Disks turn off certain OS features or Network-Based Storage disk Managing Space services. Free Space Management Buffer Manager Pinning and Unpinning Raw disk access instead of OS files. Replacement Policies ⇒ Databases vs. Operating Systems

Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

37 Files and Records Storage

Web Forms Applications SQL Interface

SQL Commands

Magnetic Disks Executor Parser Access Time Sequential vs. Random Access Operator Evaluator Optimizer I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Transaction Files and Access Methods Techniques Manager Solid-State Disks Recovery Network-Based Storage Buffer Manager Manager Managing Space Lock Manager Free Space Management Disk Space Manager Buffer Manager Pinning and Unpinning Replacement Policies

DBMS Databases vs. Operating Systems data files, indices, . . . Database Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

38 Database Files Storage • So far we have talked about pages. Their management is oblivious with respect to their actual content. • On the conceptual level, a DBMS primarily manages tables of tuples and indexes. Magnetic Disks • Such tables are implemented as files of records: Access Time Sequential vs. Random • A file consists of one or more pages, Access • each page contains one or more records, I/O Parallelism RAID Levels 1, 0, and 5

• each record corresponds to one tuple: Alternative Storage Techniques Solid-State Disks Network-Based Storage

Managing Space file 0 Free Space Management Buffer Manager Pinning and Unpinning free Replacement Policies Databases vs. page 0 page 1 page 2 page 3 Operating Systems

file 1 Files and Records Heap Files free Free Space Management Inside a Page Alternative Page Layouts page 4 page 5 page 6 page 7 Recap

39 Database Heap Files Storage The most important type of files in a database is the heap file. It stores records in no particular order (in line with, e.g., the SQL semantics):

Typical heap file interface Magnetic Disks Access Time • create/destroy heap file f named n: Sequential vs. Random Access

f = createFile(n), deleteFile(n) I/O Parallelism • insert record r and return its rid: RAID Levels 1, 0, and 5 Alternative Storage rid = insertRecord(f,r) Techniques Solid-State Disks • delete a record with a given rid: Network-Based Storage deleteRecord(f,rid) Managing Space Free Space Management

• get a record with a given rid: Buffer Manager r = getRecord(f,rid) Pinning and Unpinning Replacement Policies • initiate a sequential scan over the whole heap file: Databases vs. Operating Systems openScan(f) Files and Records Heap Files Free Space Management N.B. Record ids (rid) are used like record addresses (or pointers). Inside a Page The heap file structure maps a given rid to the page containing Alternative Page Layouts Recap

the record. 40 Heap Files Storage

(Doubly) Linked list of pages: Header page allocated when createFile(n) is Magnetic Disks called—initially both page lists are empty: Access Time Sequential vs. Random Access

data data data pages w/ I/O Parallelism RAID Levels 1, 0, and 5 page page · ·· page free space Alternative Storage header Techniques page Solid-State Disks Network-Based Storage data data data full pages Managing Space page page ··· page Free Space Management

Buffer Manager Pinning and Unpinning + easy to implement Replacement Policies Databases vs. – most pages will end up in free page list Operating Systems Files and Records – might have to search many pages to place a (large) record Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

41 Heap Files Storage Operation insertRecord(f,r) for linked list of pages

1 Try to find a page p in the free list with free space > r ; should this fail, ask the disk space manager to allocate| a| new

page p Magnetic Disks Access Time 2 Write record r to page p Sequential vs. Random Access 3 Since, generally, r p , p will belong to the list of pages | || | I/O Parallelism with free space RAID Levels 1, 0, and 5 Alternative Storage 4 A unique rid for r is generated and returned to the caller Techniques Solid-State Disks Network-Based Storage

Managing Space  Generating sensible record ids (rid) Free Space Management Buffer Manager Given that rids are used like record addresses: what would be a Pinning and Unpinning feasible rid generation method? Replacement Policies Databases vs. Generate a composite rid consisting of the address of page p and Operating Systems the placement (offset/slot) of r inside p: Files and Records Heap Files Free Space Management pageno p, slotno r Inside a Page h i Alternative Page Layouts Recap

42 Heap Files Storage

Directory of pages:

Magnetic Disks data Access Time page data Sequential vs. Random page Access I/O Parallelism data RAID Levels 1, 0, and 5 page Alternative Storage Techniques · ·· Solid-State Disks Network-Based Storage • Use as space map with information about free space on Managing Space Free Space Management

each page Buffer Manager • granularity as trade-off space accuracy Pinning and Unpinning ↔ Replacement Policies (may range from open/closed bit to exact information) Databases vs. Operating Systems

+ free space search more efficient Files and Records Heap Files – memory overhead to host the page directory Free Space Management Inside a Page Alternative Page Layouts

Recap

43 Free Space Management Storage

Which page to pick for the insertion of a new record? Append Only Magnetic Disks Always insert into last page. Otherwise, create a new page. Access Time Sequential vs. Random Best Fit Access Reduces fragmentation, but requires searching the entire I/O Parallelism RAID Levels 1, 0, and 5

free list/space map for each insert. Alternative Storage Techniques First Fit Solid-State Disks Search from beginning, take first page with sufficient space. Network-Based Storage Managing Space ( These pages quickly fill up, system may waste a lot of Free Space Management ⇒ search effort in these first pages later on.) Buffer Manager Pinning and Unpinning Next Fit Replacement Policies Databases vs. Maintain cursor and continue searching where search Operating Systems stopped last time. Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

44 Free Space Witnesses Storage

We can accelerate the search by remembering witnesses: Magnetic Disks Access Time Sequential vs. Random • Classify pages into buckets, e.g., “75 %–100 % full”, Access “50 %–75 % full”,“25 %–50 % full”,and “0 %–25 % full”. I/O Parallelism RAID Levels 1, 0, and 5

• For each bucket, remember some witness pages. Alternative Storage Techniques • Do a regular best/first/next fit search only if no witness is Solid-State Disks recorded for the specific bucket. Network-Based Storage Managing Space • Populate witness information, e.g., as a side effect when Free Space Management searching for a best/first/next fit page. Buffer Manager Pinning and Unpinning Replacement Policies

Databases vs. Operating Systems

Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

45 slot directory

1 0 1

Inside a Page — Fixed-Length Records Storage

Now turn to the internal page structure:

ID NAME SEX 4 7 1 1 J o h n M 1 7 2 3 M a r c M 6 3 8 1 B e t t y Magnetic Disks 4711 John M F Access Time Sequential vs. Random 1723 Marc M Access 6381 Betty F I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Techniques • Record identifier (rid): Solid-State Disks pageno, slotno Network-Based Storage h i Managing Space • Record position (within page): Free Space Management Buffer Manager slotno bytes per slot Pinning and Unpinning × no. of records Replacement Policies Record deletion? • in this page Databases vs. • record id should not Operating Systems Files and Records change Heap Files slot directory (bitmap) 3 Header Free Space Management ⇒ Inside a Page Alternative Page Layouts

Recap

46 Inside a Page — Fixed-Length Records Storage

Now turn to the internal page structure:

ID NAME SEX 4 7 1 1 J o h n M 1 7 2 3 M a r c M 6 3 8 1 B e t t y Magnetic Disks 4711 John M F Access Time Sequential vs. Random 1723 Marc M Access 6381 Betty F I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Techniques • Record identifier (rid): Solid-State Disks pageno, slotno Network-Based Storage h i Managing Space • Record position (within page): Free Space Management Buffer Manager slotno bytes per slot Pinning and Unpinning × slot no. of records Replacement Policies Record deletion? • directory in this page Databases vs. • record id should not Operating Systems Files and Records change Heap Files slot directory (bitmap) 1 0 1 3 Header Free Space Management ⇒ Inside a Page Alternative Page Layouts

Recap

46 • Slot directory points to start of each record. • Records can move on page. • E.g., if field size changes or page is compacted. • Create “forward address” if record won’t fit on page. •  Future updates? • Related issue: space-efficient representation of NULL values.

Inside a Page—Variable-Sized Fields Storage

• Variable-sized fields moved to end of each record. 4 7 1 1 M J o h n 1 7 2 3 M M • ⊥ • a r c 6 3 8 1 F B e t t y • Placeholder points to ⊥ • ⊥ Magnetic Disks location. Access Time •  Sequential vs. Random Why? Access

I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Techniques Solid-State Disks Network-Based Storage

Managing Space Free Space Management no. of records Buffer Manager in this page Pinning and Unpinning Replacement Policies

Databases vs. Operating Systems 3 Header Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

47 • Records can move on page. • E.g., if field size changes or page is compacted. • Create “forward address” if record won’t fit on page. •  Future updates? • Related issue: space-efficient representation of NULL values.

Inside a Page—Variable-Sized Fields Storage

• Variable-sized fields moved to end of each record. 4 7 1 1 M J o h n 1 7 2 3 M M • ⊥ • a r c 6 3 8 1 F B e t t y • Placeholder points to ⊥ • ⊥ Magnetic Disks location. Access Time •  Sequential vs. Random Why? Access • Slot directory points to start of I/O Parallelism each record. RAID Levels 1, 0, and 5 Alternative Storage Techniques Solid-State Disks Network-Based Storage

Managing Space Free Space Management slot no. of records Buffer Manager directory in this page Pinning and Unpinning Replacement Policies

Databases vs. Operating Systems 3 Header Files and Records • • • Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

47 • Create “forward address” if record won’t fit on page. •  Future updates? • Related issue: space-efficient representation of NULL values.

Inside a Page—Variable-Sized Fields Storage

• Variable-sized fields moved to end of each record. 4 7 1 1 M J o h n 1 7 2 3 M M • ⊥ a r c 6 3 8 1 F B e t t y 1 7 2 ⊥ • ⊥ • Placeholder points to 3 M T i m o t h y • ⊥ Magnetic Disks location. Access Time •  Sequential vs. Random Why? Access • Slot directory points to start of I/O Parallelism each record. RAID Levels 1, 0, and 5 Alternative Storage • Records can move on page. Techniques Solid-State Disks • E.g., if field size changes or Network-Based Storage Managing Space page is compacted. Free Space Management slot no. of records Buffer Manager directory in this page Pinning and Unpinning Replacement Policies

Databases vs. Operating Systems 3 Header Files and Records • • • Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

47 • Related issue: space-efficient representation of NULL values.

Inside a Page—Variable-Sized Fields Storage

• Variable-sized fields moved to end of each record. 4 7 1 1 M J o h n 1 7 2 3 M M • ⊥ a r c 6 3 8 1 F B e t t y • Placeholder points to ⊥ • ⊥ location. • Magnetic Disks forward Access Time •  Sequential vs. Random Why? Access • Slot directory points to start of I/O Parallelism each record. RAID Levels 1, 0, and 5 Alternative Storage • Records can move on page. Techniques Solid-State Disks • E.g., if field size changes or Network-Based Storage Managing Space page is compacted. Free Space Management • Create “forward address” if slot no. of records Buffer Manager directory in this page Pinning and Unpinning record won’t fit on page. Replacement Policies Databases vs. •  Future updates? Operating Systems 3 Header Files and Records • • • Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

47 Inside a Page—Variable-Sized Fields Storage

• Variable-sized fields moved to end of each record. 4 7 1 1 M J o h n 1 7 2 3 M M • ⊥ a r c 6 3 8 1 F B e t t y • Placeholder points to ⊥ • ⊥ Magnetic Disks location. Access Time •  Sequential vs. Random Why? Access • Slot directory points to start of I/O Parallelism each record. RAID Levels 1, 0, and 5 Alternative Storage • Records can move on page. Techniques Solid-State Disks • E.g., if field size changes or Network-Based Storage Managing Space page is compacted. Free Space Management • Create “forward address” if slot no. of records Buffer Manager directory in this page Pinning and Unpinning record won’t fit on page. Replacement Policies Databases vs. •  Future updates? Operating Systems • Related issue: space-efficient 3 Header Files and Records • • • Heap Files representation of NULL values. Free Space Management Inside a Page Alternative Page Layouts

Recap

47 IBM DB2 Data Pages Storage

Data page and layout details

• Support for 4 K, 8 K, 16 K, 32 K data pages in separate table spaces. Buffer manager pages match in size. Magnetic Disks Access Time Sequential vs. Random • 68 bytes of database manager overhead per page. On a 4 K Access page: maximum of 4,028 bytes of user data (maximum I/O Parallelism record size: 4,005 bytes). Records do not span pages. RAID Levels 1, 0, and 5 Alternative Storage • Maximum table size: 512 GB (with 32 K pages). Maximum Techniques Solid-State Disks number of columns: 1,012 (4 K page: 500), maximum Network-Based Storage number of rows/page: 255.  IBM DB2 RID format? Managing Space Free Space Management • Columns of type LONG VARCHAR, CLOB, etc. maintained Buffer Manager Pinning and Unpinning outside regular data pages (pages contain descriptors only). Replacement Policies Free space management: first-fit order. Free space map Databases vs. • Operating Systems

distributed on every 500th page in FSCR (free space control Files and Records records). Records updated in-place if possible, otherwises Heap Files Free Space Management uses forward records. Inside a Page Alternative Page Layouts

Recap

48 IBM DB2 Data Pages Storage

Taken directly from the DB2 V9.5 Information Center

Magnetic Disks Access Time Sequential vs. Random Access

I/O Parallelism RAID Levels 1, 0, and 5

Alternative Storage Techniques Solid-State Disks Network-Based Storage

Managing Space Free Space Management

Buffer Manager Pinning and Unpinning Replacement Policies

Databases vs. Operating Systems

Files and Records Heap Files Free Space Management Inside a Page http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/ Alternative Page Layouts

Recap

49 Alternative Page Layouts Storage

We have just populated data pages in a row-wise fashion:

a1 b1 c1 a4 b4 c4 a b c d c d a c d Magnetic Disks 1 1 1 1 1 1 2 4 4 Access Time b c d a2 b2 c2 d2 2 2 2 Sequential vs. Random d2 a3 b3 Access a3 b3 c3 d3 c3 d3 I/O Parallelism a4 b4 c4 d4 RAID Levels 1, 0, and 5 page 0 page 1 Alternative Storage Techniques Solid-State Disks We could as well do that column-wise: Network-Based Storage Managing Space Free Space Management

Buffer Manager a a a b b b Pinning and Unpinning 1 2 3 1 2 3 Replacement Policies a1 b1 c1 d1 a3 a4 b3 b4 Databases vs. a2 b2 c2 d2 Operating Systems a3 b3 c3 d3 Files and Records Heap Files a4 b4 c4 d4 Free Space Management page 0 page 1 Inside a Page Alternative Page Layouts

Recap

50 Alternative Page Layouts Storage

These two approaches are also known as NSM (n-ary storage model) and DSM (decomposition storage model).1 • Tuning knob for certain workload types (e.g., OLAP) Magnetic Disks • Suitable for narrow projections and in-memory database Access Time Sequential vs. Random systems Access ( Database Systems and Modern CPU Architecture) I/O Parallelism % RAID Levels 1, 0, and 5 • Different behavior with respect to compression. Alternative Storage Techniques Solid-State Disks Network-Based Storage

A hybrid approach is the PAX minipage 3 Managing Space (Partition Attributes Across) layout: Free Space Management minipage 2 Buffer Manager • Divide each page into minipages. Pinning and Unpinning Replacement Policies • Group attributes into them. minipage 1 Databases vs. Operating Systems

Files and Records Ailamaki et al. Weaving Relations for Cache minipage 0 % Heap Files Performance. VLDB 2001. Free Space Management page 0 Inside a Page Alternative Page Layouts

Recap 1 Recently, the terms row-store and column-store have become popular, too. 51 Recap Storage

Magnetic Disks

Random access orders of magnitude slower than Magnetic Disks sequential. Access Time Sequential vs. Random Access

Disk Space Manager I/O Parallelism Abstracts from hardware details and maps RAID Levels 1, 0, and 5 page number physical location. Alternative Storage 7→ Techniques Solid-State Disks Buffer Manager Network-Based Storage Page caching in main memory; pin ()/unpin () interface; Managing Space replacement policy crucial for effectiveness. Free Space Management Buffer Manager Pinning and Unpinning File Organization Replacement Policies

Stable record identifiers (rids); maintenance with Databases vs. fixed-sized records and variable-sized fields; NSM vs. DSM. Operating Systems Files and Records Heap Files Free Space Management Inside a Page Alternative Page Layouts

Recap

52 Indexing Chapter 3 Indexing Navigate and Search Large Data Volumes File Organization File Organization Competition Architecture and Implementation of Database Systems Cost Model Scan Equality Test Range Selection Insertion Deletion

Indexes Index Entries Clustered / Unclustered Dense / Sparse

1 File Organization and Indexes Indexing Torsten Grust

• A heap file provides just enough structure to maintain a collection of records (of a table). File Organization File Organization • The heap file supports sequential scans (openScan( )) over Competition · Cost Model the collection, e.g. Scan Equality Test Range Selection SQL query leading to a sequential scan Insertion Deletion 1 SELECT A,B Indexes 2 FROMR Index Entries Clustered / Unclustered Dense / Sparse But: No further operations receive specific support from the heap file.

2 File Organization and Indexes Indexing Torsten Grust • For queries of the following type, it would definitely be helpful if the SQL query processor could rely on a particular file organization of the records in the file for table R:

Queries calling for systematic file organization File Organization

1 SELECT A,B File Organization Competition 2 FROMR Cost Model 3 WHERE C > 42 Scan Equality Test Range Selection 1 SELECT A,B Insertion 2 FROMR Deletion 3 ORDERBYC ASC Indexes Index Entries Clustered / Unclustered Dense / Sparse  File organization for table R Which organization of records in the file for table R could speed up the evaluation of both queries above? Allocate records of table R in ascending order of C attribute values, place records on neighboring pages. (Only include columns A, B, C in records.)

3 A Competition of File Organizations Indexing Torsten Grust • This chapter presents a comparison of three file organizations:

1 Files of randomly ordered records (heap files)

2 Files sorted on some record field(s) File Organization

3 Files hashed on some record field(s) File Organization Competition • A file organization is tuned to make a certain query (class) Cost Model Scan efficient, but if we have to support more than one query Equality Test class Range Selection , we may be in trouble. Consider: Insertion Deletion

Query Q calling for organization sorted on column A Indexes Index Entries 1 SELECT A,B,C Clustered / Unclustered 2 FROMR Dense / Sparse 3 WHERE A>0 AND A < 100

• If the file for table R is sorted on C, this does not buy us anything for query Q. • If Q is an important query but is not supported by R’s file organization, we can build a support data structure, an index, to speed up (queries similar to) Q.

4 A Competition of File Organizations Indexing Torsten Grust • This competition assesses 3 file organizations in 5 disciplines:

1 Scan: read all records in a given file. 2 Search with equality test

Query calling for equality test support File Organization File Organization 1 SELECT* Competition 2 FROMR Cost Model Scan 3 WHERE C = 42 Equality Test Range Selection 3 Search with range selection (upper or lower bound Insertion Deletion

might be unspecified) Indexes Index Entries Query calling for range selection support Clustered / Unclustered Dense / Sparse 1 SELECT* 2 FROMR 3 WHERE A > 0 AND A < 100

4 Insert a given record in the file, respecting the file’s organization. 5 Delete a record (identified by its rid), maintain the file’s organization. 5 Simple Cost Model Indexing Torsten Grust • Performing these 5 database operations clearly involves block I/O, the major cost factor. • However, we have to additionally pay for CPU time used to

search inside a page, compare a record field to a File Organization

selection constant, etc. File Organization Competition • To analyze cost more accurately, we introduce the following Cost Model Scan parameters: Equality Test Range Selection Insertion Deletion

Simple cost model parameters Indexes Index Entries Clustered / Unclustered Parameter Description Dense / Sparse b # of pages in the file r # of records on a page D time needed to read/write a disk page C CPU time needed to process a record (e.g., compare a field value) H CPU time taken to apply a hash function to a record

6 Simple Cost Model Indexing Torsten Grust Simple cost model parameters

Parameter Description

b # of pages in the file File Organization r # of records on a page File Organization Competition D time needed to read/write a disk page Cost Model C CPU time needed to process a record (e.g., compare a Scan Equality Test field value) Range Selection H CPU time taken to apply a function to a record (e.g., a Insertion Deletion comparison or hash function) Indexes Index Entries Clustered / Unclustered Dense / Sparse Remarks: • D 15 ms ≈ • C H 0.1 µs ≈ ≈ • This is a coarse model to estimate the actual execution time (this does not model network access, cache effects, burst I/O, . . . ).

7 Aside: Hashing Indexing Torsten Grust A simple hash function A hashed file uses a hash function h to map a given record onto a specific page of the file.

Example: h uses the lower 3 bits of the first field (of type File Organization File Organization INTEGER) of the record to compute the corresponding page Competition Cost Model number: Scan Equality Test h ( 42, true, ’foo’ ) 2 (42 = 1010102) Range Selection h i → Insertion h ( 14, true, ’bar’ ) 6 (14 = 11102) Deletion h i → h ( 26, false, ’baz’ ) 2 (26 = 110102) Indexes h i → Index Entries Clustered / Unclustered Dense / Sparse • The hash function determines the page number only; record placement inside a page is not prescribed. • If a page p is filled to capacity, a chain of overflow pages is maintained to store additional records with h ( ... ) = p. h i • To avoid immediate overflowing when a new record is inserted, pages are typically filled to 80 % only when a heap file is initially (re)organized as a hashed file. 8 Scan Cost Indexing Torsten Grust

1 Heap file Scanning the records of a file involves reading all b pages as well as processing each of the r records on each page:

File Organization Scanheap = b (D + r C) File Organization · · Competition 2 Cost Model Sorted file Scan The sort order does not help much here. However, the scan Equality Test Range Selection retrieves the records in sorted order (which can be big plus Insertion later on): Deletion Indexes Scansort = b (D + r C) Index Entries · · Clustered / Unclustered 3 Hashed file Dense / Sparse Again, the hash function does not help. We simply scan from the beginning (skipping over the spare free space typically found in hashed files):

Scan = (100/80) b (D + r C) hash · · · =1.25

| {z } 9 Cost for Search With Equality Test (A = const) Indexing Torsten Grust

1 Heap file Consider (a) the equality test is on a primary key, (b) the equality test is not on a primary key:

1 File Organization (a) Searchheap = /2 b (D + r (C + H)) · · · File Organization (b) Searchheap = b (D + r (C + H)) Competition · · Cost Model Scan Equality Test 2 Sorted file (sorted on A) Range Selection Insertion We assume the equality test to be on the field determining Deletion the sort order. The sort order enables us to use binary Indexes Index Entries search: Clustered / Unclustered Dense / Sparse Search = log b (D + log r (C + H)) sort 2 · 2 · If more than one record qualifies, all other matches are stored right after the first hit.

(Nevertheless, no DBMS will implement binary search for value lookup.) 

10 Cost for Search With Equality Test (A = const) Indexing Torsten Grust

3 Hashed file (hashed on A)

Hashed files support equality searching best. The hash File Organization

function directly leads us to the page containing the hit File Organization Competition (overflow chains ignored here). Cost Model Scan Equality Test Consider (a) the equality test is on a primary key, (b) the Range Selection Insertion equality test is not on a primary key: Deletion

Indexes 1 (a) Searchhash = H + D + /2 r C Index Entries · · Clustered / Unclustered (b) Search = H + D + r C Dense / Sparse hash ·

No dependence on file size b here. (All qualifying records live on the same page or, if present, in its overflow chain.)

11 Cost for Range Selection (A >= lower AND A <= upper) Indexing Torsten Grust

1 Heap file Qualifying records can appear anywhere in the file: File Organization

File Organization Competition Rangeheap = b (D + r (C + 2 H)) · · · Cost Model Scan Equality Test Range Selection 2 Sorted file (sorted on A) Insertion Use equality search (with A = lower), then sequentially Deletion Indexes scan the file until a record with A > upper is encountered: Index Entries Clustered / Unclustered Dense / Sparse

Range = log b (D + log r (C + H)) + n/r D + n (C + H) sort 2 · 2 · · · (n + 1 overall hits in the range, n 0)   ≥

12 Cost for Range Selection (A >= lower AND A <= upper) Indexing Torsten Grust

File Organization

File Organization 3 Hashed file (hashed on A) Competition Cost Model Hashing offers no help here as hash functions are designed Scan Equality Test to scatter records all over the hashed file (e.g., for the h we Range Selection considered earlier: h ( 7,... ) = 7, h ( 8,... ) = 0): Insertion h i h i Deletion Indexes Rangehash = (100/80) b (D + r (C + H)) Index Entries · · · Clustered / Unclustered Dense / Sparse

13 Insertion Cost Indexing Torsten Grust

1 Heap file We can add the record to some arbitrary free page (finding that page is not accounted for here). File Organization This involves reading and writing that page: File Organization Competition Cost Model Insertheap = 2 D + C Scan · Equality Test Range Selection Insertion Deletion 2 Sorted file Indexes On average, the new record will belong in the middle of the Index Entries Clustered / Unclustered file. After insertion, we have to shift all subsequent records Dense / Sparse (in the second half of the file):

Insert = log b (D+log r (C+H))+1/2 b ( D + r C + D ) sort 2 · 2 · · · · read + shift + write | {z }

14 Insertion Cost Indexing Torsten Grust

File Organization

3 Hashed file File Organization Competition We pretend to search for the record, then read and write the Cost Model page determined by the hash function (here we assume the Scan Equality Test spare 20 % space on the page is sufficient to hold the new Range Selection Insertion record): Deletion

Indexes Index Entries Inserthash = H + D + C + D Clustered / Unclustered search Dense / Sparse | {z }

15 Deletion Cost (Record Specified by its rid) Indexing Torsten Grust 1 Heap file If we do not try to compact the file after we have found and removed the record (because the file uses free space management), the cost is:

File Organization

Deleteheap = D + C + D File Organization Competition search by rid Cost Model Scan |{z} Equality Test Range Selection 2 Sorted file Insertion Again, we access the record’s page and then (on average) Deletion Indexes shift the second half the file to compact the file: Index Entries Clustered / Unclustered Dense / Sparse Delete = D + 1/2 b (D + r C + D) sort · · ·

3 Hashed file Accessing the page using the rid is even faster than the hash function, so the hashed file behaves like the heap file:

Deletehash = D + C + D 16 The “Best” File Organization? Indexing Torsten Grust

File Organization

File Organization Competition • There is no single file organization that responds equally Cost Model Scan fast to all 5 operations. Equality Test Range Selection • This is a dilemma, because more advanced file organizations Insertion Deletion

can really make a difference in speed. Indexes Index Entries Clustered / Unclustered Dense / Sparse

17 Range selection performance Indexing Torsten Grust

• Performance of range selections for files of increasing size (D = 15 ms, C = 0.1 µs, r = 100, n = 10):

Range Selection Performance File Organization

File Organization Competition 10000 sorted file Cost Model heap/hashed file Scan Equality Test 1000 Range Selection Insertion 100 Deletion

Indexes

10 Index Entries Clustered / Unclustered time [s] Dense / Sparse 1

0.1

0.01 10 100 1000 10000 100000 b [pages]

18 Deletion Performance Comparison Indexing Torsten Grust

• Performance of deletions for files of increasing size (D = 15 ms, C = 0.1 µs, r = 100):

Deletion Performance File Organization

File Organization Competition 10000 sorted file Cost Model heap/hashed file Scan 1000 Equality Test Range Selection Insertion 100 Deletion

Indexes

10 Index Entries

time [s] Clustered / Unclustered Dense / Sparse 1

0.1

0.01 10 100 1000 10000 100000 b [pages]

19 Indexes Indexing Torsten Grust

File Organization

File Organization • There exist index structures which offer all the advantages Competition Cost Model 1 of a sorted file and support insertions/deletions efficiently : Scan + Equality Test B -trees. Range Selection + Insertion • Before we turn to B -trees in detail, the following sheds Deletion some light on indexes in general. Indexes Index Entries Clustered / Unclustered Dense / Sparse

1 At the cost of a modest space overhead. 20 Indexes Indexing Torsten Grust • If the basic organization of a file does not support a particular operation, we can additionally maintain an auxiliary structure, an index, which adds the needed support. • We will use indexes like guides. Each guide is specialized to accelerate searches on a specific attribute A (or a File Organization combination of attributes) of the records in its associated file: File Organization Competition Cost Model Index usage Scan Equality Test Range Selection 1 Query the index for the location of a record with A = k Insertion (k is the search key). Deletion Indexes 2 The index responds with an associated index entry k Index Entries ∗ Clustered / Unclustered (k contains sufficient information to access the actual record Dense / Sparse in∗ the file). 3 Read the actual record by using the guiding information in k ; the record will have an A-field with value k. ∗

k k ..., A = k,... −→ ∗ −→ h i index access file access |{z} |{z} 21 Index entries Indexing Torsten Grust • We can design the index entries, i.e., the k , in various ways: ∗ Index entry designs

Variant Index entry k ∗ File Organization A k, ..., A = k,... File Organization h i Competition B k, rid Cost Model Scan C k, [rid1, rid2,... ] Equality Test Range Selection Insertion Deletion

Remarks: Indexes Index Entries • With variant A , there is no need to store the data records in Clustered / Unclustered addition to the index—the index itself is a special file Dense / Sparse organization. • If we build multiple indexes for a file, at most one of these should use variant A to avoid redundant storage of records. • Variants B and C use rid(s) to point into the actual data file. • Variant C leads to less index entries if multiple records match a search key k, but index entries are of variable length. 22 Indexing Example Indexing Torsten Grust

Example (Indexing example, see following slide) File Organization File Organization Competition • The data file contains name, age, sal records of table Cost Model h i Scan A employees, the file itself (index entry variant ) is hashed on Equality Test field age (hash function h1). Range Selection Insertion • The index file contains sal, rid index entries (variant B ), Deletion pointing into the data file.h i Indexes Index Entries Clustered / Unclustered • This file organization + index efficiently supports equality Dense / Sparse searches on the age and sal keys.

23 Indexing Example Indexing Torsten Grust

Example (Indexing example)

1 2 3 2 1 File Organization

File Organization Smith, 44, 3000 Competition h(age)=0 3000 Jones, 40, 6003 h(sal)=0 Cost Model 3000 Tracy, 44, 5004 Scan 5004 Equality Test age sal Range Selection h(age)=1 5004 h1 Ashby, 25, 3000 h2 Insertion Basu, 33, 4003 Deletion 4003 Bristow, 29, 2007 h(sal)=3 Indexes 2007 Index Entries h(age)=2 6003 Clustered / Unclustered Cass, 50, 5004 Dense / Sparse 6003 Daniels, 22, 6003

data file index file hashed on age entries

24 Index Properties: Clustered vs. Unclustered Indexing Torsten Grust

• Suppose, we have to support range selections on records File Organization such that A >= AND A <= . File Organization lower upper Competition • If we maintain an index on the A-field, we can Cost Model Scan Equality Test 1 query the index once for a record with A = lower, and Range Selection then Insertion Deletion

2 sequentially scan the data file from there until we Indexes encounter a record with field A > upper. Index Entries Clustered / Unclustered • However: This switch from index to data file will only work Dense / Sparse provided that the data file itself is sorted on the field A.

25 Index Properties: Clustered vs. Unclustered Indexing Torsten Grust

Index over a data file with matching sort order

File Organization

File Organization Competition ... Cost Model index file Scan Equality Test ...... Range Selection Insertion Deletion ... Indexes data file Index Entries Clustered / Unclustered Dense / Sparse Remark: • In a B+-tree, for example, the index entries k stored in the leaves are sorted by key k. ∗

26 Index Properties: Clustered vs. Unclustered Indexing Torsten Grust

Definition (Clustered index)

If the data file associated with an index is sorted on the index File Organization search key, the index is said to be clustered. File Organization Competition Cost Model Scan Equality Test In general, the cost for a range selection grows tremendously if Range Selection Insertion the index on A is unclustered. In this case, proximity of Deletion

index entries does not imply proximity in the data file:  Indexes Index Entries • As before, we can query the index for a record with A = Clustered / Unclustered Dense / Sparse lower. To continue the scan, however, we have to revisit the index entries which point us to data pages scattered all over the data file.

27 Index Properties: Clustered vs. Unclustered Indexing Torsten Grust

Unclustered index

... index file File Organization File Organization ...... Competition Cost Model Scan Equality Test ... data file Range Selection Insertion Deletion

Indexes Remarks: Index Entries Clustered / Unclustered Dense / Sparse • If an index uses entries k of variant A , the index is obviously clustered by definition. ∗

Variant A in Oracle 8i CREATE TABLE ... (... PRIMARY KEY (... )) ORGANIZATION INDEX; • A data file can have at most one clustered index (but any number of unclustered indexes).

28 Index Properties: Clustered vs. Unclustered Indexing Torsten Grust

Clustered indexes Create a clustered index IXR on table R, index key is attribute A: File Organization

File Organization CREATE INDEX IXR ON R(A ASC) CLUSTER Competition Cost Model The DB2 V9.5 manual says: Scan Equality Test “[ CLUSTER ] specifies that the index is the clustering index of the Range Selection Insertion table. The cluster factor of a clustering index is maintained or Deletion improved dynamically as data is inserted into the associated Indexes insert new rows physically close to the Index Entries table, by attempting to Clustered / Unclustered rows for which the key values of this index are in the same Dense / Sparse range. Only one clustering index may exist for a table so CLUSTER may not be specified if it was used in the definition of any existing index on the table (SQLSTATE 55012). A clustering index may not be created on a table that is defined to use append mode (SQLSTATE 428D8).”

29 Index Properties: Clustered vs. Unclustered Indexing Torsten Grust

Cluster a table based on an existing index

Reorganize rows of table R so that their physical order matches the existing index IXR: File Organization

File Organization CLUSTER R USING IXR Competition Cost Model Scan Equality Test Range Selection • If IXR indexes attribute A of R, rows will be sorted in Insertion ascending A order. Deletion Indexes • The evaluation of range queries will touch less pages (which Index Entries Clustered / Unclustered additionally, will be physically adjacent). Dense / Sparse • Note: Generally, future insertions will compromise the perfect A order. • May issue CLUSTER R again to re-cluster. • In CREATE TABLE, use WITH (fillfactor = f), f 10 ... 100, to reserve page space for subsequent insertions.∈

30 Index Properties: Clustered vs. Unclustered Indexing Torsten Grust

Inspect clustered index details

1 db2 => reorgchk update statistics on table grust.accel 2

3 Doing RUNSTATS ... File Organization 4 File Organization 5 Table statistics: Competition . Cost Model . Scan 6 . Equality Test 7 Index statistics: Range Selection 8 Insertion Deletion 9 F4: CLUSTERRATIO or normalized CLUSTERFACTOR > 80 Indexes 10 Index Entries 11 SCHEMA.NAME INDCARD LEAF LVLS KEYS F4 REORG Clustered / Unclustered 12 ------Dense / Sparse 13 Table: GRUST.ACCEL 14 Index: GRUST.IKIND 235052 529 3 3 99 ----- 15 Index: GRUST.IPAR 235052 675 3 67988 99 ----- 16 Index: GRUST.IPOST 235052 980 3 235052 100 ----- 17 Index: GRUST.ITAG 235052 535 3 75 80 *---- 18 Index: SYSIBM.SQL080119120245000 19 235052 980 3 235052 100 ----- 20 ------

31 Index Properties: Dense vs. Sparse Indexing Torsten Grust

A clustered index comes with more advantages than the improved performance for range selections. We can additionally File Organization File Organization design the index to be space efficient: Competition Cost Model Scan Definition (Sparse index) Equality Test Range Selection To keep the size of the index small, maintain one index entry k Insertion per data file page (not one index entry per data record). Key k ∗is Deletion Indexes the smallest key on that page. Index Entries Clustered / Unclustered Dense / Sparse Indexes of this kind are called sparse. (Otherwise, indexes are referred to as dense.)

32 Index Properties: Dense vs. Sparse Indexing Torsten Grust

File Organization

To search a record with field A = k in a sparse A-index, File Organization Competition 1 Locate the largest index entry k0 such that k0 6 k, then Cost Model ∗ Scan 2 access the page pointed to by k0 , and Equality Test ∗ Range Selection 3 scan this page (and the following pages, if needed) to find Insertion Deletion records with ..., A = k,... . h i Indexes Since the data file is clustered (i.e., sorted) on field A, we are Index Entries Clustered / Unclustered guaranteed to find matching records in the proximity. Dense / Sparse

33 Index example (Dense vs. Sparse) Indexing Torsten Grust Example (Sparse index example)

• Again, the data file contains name, age, sal records. We maintain a clustered sparseh index on field namei and an

unclustered dense index on field age. Both use index entry File Organization variant B to point into the data file. File Organization Competition Cost Model Ashby, 25, 3000 Scan 22 Equality Test Basu, 33, 4003 Range Selection 25 Insertion Bristow, 30, 2007 Deletion 30 Ashby Indexes 33 Cass Cass, 50, 5004 Index Entries Clustered / Unclustered Smith Daniels, 22, 6003 Dense / Sparse 40 Jones, 40, 6003 44 44 Smith, 44, 3000 50 Tracy, 44, 5004

Sparse Index Dense Index on on Name Data File Age

34 Index Properties: Dense vs. Sparse Indexing Torsten Grust

Note: • Sparse indexes need 2–3 orders of magnitude less space than dense indexes (consider # records/page). File Organization File Organization • We cannot build a sparse index that is unclustered (i.e., there Competition Cost Model is at most one sparse index per file). Scan Equality Test Range Selection  SQL queries and index exploitation Insertion Deletion How do you propose to evaluate the following SQL queries? Indexes Index Entries Clustered / Unclustered 1 SELECT MAX(age) Dense / Sparse 2 FROM employees

1 SELECT MAX(name) 2 FROM employees

35 Tree-Structured Indexing Chapter 4 Tree-Structured Indexing + ISAM and B -trees Binary Search

ISAM Architecture and Implementation of Database Systems Multi-Level ISAM Too Static? Search Efficiency

B+-trees Search Insert Redistribution Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees

1 Tree-Structured Ordered Files and Binary Search Indexing

How could we prepare for such queries and evaluate them efficiently?

1 SELECT* 2 FROM CUSTOMERS 3 WHERE ZIPCODE BETWEEN 8880 AND 8999 Binary Search

ISAM We could Multi-Level ISAM Too Static? 1 sort the table on disk (in ZIPCODE-order) Search Efficiency + 2 To answer queries, use binary search to find the first B -trees Search qualifying tuple, then scan as long as ZIPCODE < 8999. Insert Redistribution Delete Duplicates Here, let k denote the full record with key k: Key Compression ∗ Bulk Loading Partitioned B+-trees 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 8953* 9016* 9200* 9532* scan

2 Tree-Structured Ordered Files and Binary Search Indexing

Binary Search 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 8953* 9016* 9200* 9532* ISAM page 0 page 1 page 2 page 3 page 4 page 5 page 6 page 7 page 8 page 9 page 10 page 11 page 12 Multi-Level ISAM scan Too Static? Search Efficiency " We get sequential access during the scan phase. B+-trees Search We need to read log (# tuples) tuples during the search phase. Insert 2 Redistribution Delete Duplicates % as many pages Key Compression We need to read about for this. Bulk Loading Partitioned B+-trees The whole point of binary search is that we make far, unpredictable jumps. This largely defeats page prefetching. 

3 Tree-Structured Tree-Structured Indexing Indexing

• This chapter discusses two index structures which especially Binary Search shine if we need to support range selections (and thus ISAM + Multi-Level ISAM sorted file scans): ISAM files and B -trees. Too Static? • Both indexes are based on the same simple idea which Search Efficiency B+-trees naturally leads to a tree-structured organization of the Search Insert indexes. (Hash indexes are covered in a subsequent chapter.) Redistribution + Delete • B -trees refine the idea underlying the rather static ISAM Duplicates scheme and add efficient support for insertions and Key Compression Bulk Loading deletions. Partitioned B+-trees

4 Tree-Structured Indexed Sequential Access Method (ISAM) Indexing

Remember: range selections on ordered files may use binary Binary Search ISAM search to locate the lower range limit as a starting point for a Multi-Level ISAM Too Static? sequential scan of the file (until the upper limit is reached). Search Efficiency

B+-trees ISAM Search ... Insert Redistribution • . . . acts as a replacement for the binary search phase, and Delete Duplicates • touches considerably fewer pages than binary search. Key Compression Bulk Loading Partitioned B+-trees

5 Tree-Structured Indexed Sequential Access Method (ISAM) Indexing

To support range selections on field A:

1 In addition to the A-sorted data file, maintain an index file with entries (records) of the following form: Binary Search ISAM index entry separator pointer Multi-Level ISAM Too Static? Search Efficiency

p0 k1 p1 k2 p2 kn pn B+-trees ··· Search • • • • Insert Redistribution 2 Delete ISAM leads to sparse index structures. In an index entry Duplicates Key Compression Bulk Loading k , pointer to p , + h i ii Partitioned B -trees

key ki is the first (i.e., the minimal) A value on the data file page pointed to by pi (pi: page no).

6 Tree-Structured Indexed Sequential Access Method (ISAM) Indexing

index entry separator key

p0 k1 p1 k2 p2 kn pn ··· • • • • Binary Search

ISAM • In the index file, the ki serve as separators between the Multi-Level ISAM Too Static? contents of pages pi 1 and pi. Search Efficiency − B+-trees • It is guaranteed that ki 1 < ki for i = 2,..., n. − Search Insert • We obtain a one-level ISAM structure. Redistribution Delete Duplicates One-level ISAM structure of N + 1 pages Key Compression Bulk Loading Partitioned B+-trees index file k1 k2 kN

p p p data file 0 1 p2 N

7 Tree-Structured Searching ISAM Indexing

SQL range selection on field A

1 SELECT* Binary Search

2 FROMR ISAM 3 WHEREA BETWEEN lower AND upper Multi-Level ISAM Too Static? Search Efficiency

B+-trees To support range selections: Search Insert Redistribution 1 Conduct a binary search on the index file for a key of value Delete lower. Duplicates Key Compression 2 Start a sequential scan of the data file from the page Bulk Loading Partitioned B+-trees pointed to by the index entry (scan until field A exceeds upper).

8 Tree-Structured Indexed Sequential Access Method ISAM Indexing

• The size of the index file is likely to be much smaller than the data file size. Searching the index is far more efficient than searching the data file. Binary Search ISAM • For large data files, however, even the index file might be too Multi-Level ISAM large to allow for fast searches. Too Static? Search Efficiency

B+-trees Search Insert Main idea behind ISAM indexing Redistribution Delete Recursively apply the index creation step: treat the topmost Duplicates Key Compression index level like the data file and add an additional index layer on Bulk Loading + top. Partitioned B -trees Repeat, until the the top-most index layer fits on a single page (the root page).

9 Tree-Structured Multi-Level ISAM Structure Indexing

This recursive index creation scheme leads to a tree-structured hierarchy of index levels:

Multi-level ISAM structure Binary Search

ISAM Multi-Level ISAM Too Static? Search Efficiency 8180 8910 B+-trees • • • Search

x pages Insert Redistribution Delete

inde Duplicates 4222 4528 6330 8050 8280 8570 8604 8808 9016 9532 Key Compression • • • • • • • • • • • • • Bulk Loading Partitioned B+-trees ta da 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 8953* 9016* 9200* 9532* pages

10 Tree-Structured Multi-Level ISAM Structure Indexing Multi-level ISAM structure 8180 8910 • • • x pages

Binary Search inde 4222 4528 6330 8050 8280 8570 8604 8808 9016 9532 ISAM • • • • • • • • • • • • • Multi-Level ISAM Too Static? Search Efficiency ta B+-trees da 4104* 4123* 4222* 4450* 4528* 5012* 6330* 6423* 8050* 8105* 8180* 8245* 8280* 8406* 8570* 8600* 8604* 8700* 8808* 8887* 8910* 8953* 9016* 9200* 9532* pages Search Insert Redistribution Delete Duplicates Key Compression Bulk Loading • Each ISAM tree node corresponds to one page (disk block). + Partitioned B -trees • To create the ISAM structure for a given data file, proceed bottom-up:

1 Sort the data file on the search key field. 2 Create the index leaf level. 3 If the top-most index level contains more than one page, repeat. 11 Tree-Structured Multi-Level ISAM Structure: Overflow Pages Indexing

• The upper index levels of the ISAM tree remain static: insertions and deletions in the data file do not affect the upper tree layers. • Insertion of record into data file: if space is left on the associated leaf page, insert record there. Binary Search ISAM • Otherwise create and maintain a chain of overflow pages Multi-Level ISAM Too Static? hanging off the full primary leaf page. Note: the records on Search Efficiency the overflow pages are not ordered in general. B+-trees Search Over time, search performance in ISAM can degrade. Insert ⇒ Redistribution Delete Duplicates Multi-level ISAM structure with overflow pages Key Compression Bulk Loading Partitioned B+-trees

· ·· · ·· · ·· · ·· · ·· · ·· · ·· overflow pages

12 Tree-Structured Multi-Level ISAM Structure: Example Indexing

Eeach page can hold two index entries plus one (the left-most) page pointer:

Example (Initial state of ISAM structure) Binary Search ISAM Multi-Level ISAM root page Too Static? 40 Search Efficiency B+-trees Search Insert Redistribution 20 33 51 63 Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

13 Tree-Structured Multi-Level ISAM Structure: Insertions Indexing

Example (After insertion of data records with keys 23, 48, 41, 42)

root page 40 Binary Search ISAM Multi-Level ISAM Too Static? 20 33 51 63 Search Efficiency B+-trees Search primary Insert leaf Redistribution Delete pages 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Duplicates Key Compression Bulk Loading Partitioned B+-trees overflow 23* 48* 41* pages

42*

14 Tree-Structured Multi-Level ISAM Structure: Deletions Indexing

Example (After deletion of data records with keys 42, 51, 97)

root page Binary Search

40 ISAM Multi-Level ISAM Too Static? Search Efficiency 20 33 51 63 B+-trees Search Insert primary Redistribution Delete leaf Duplicates 10* 15* 20* 27* 33* 37* 40* 46* 55* 63* pages Key Compression Bulk Loading Partitioned B+-trees

overflow 23* 48* 41* pages

15 Tree-Structured ISAM: Too Static? Indexing

• The non-leaf levels of the ISAM structure have not been touched at all by the data file updates.

• This may lead to index key entries which do not appear in Binary Search

the index leaf level (e.g., key value 51 on the previous slide). ISAM Multi-Level ISAM Too Static?  Orphaned index entries Search Efficiency B+-trees Search Does an index key entry like 51 above lead to problems during Insert index key searches? Redistribution Delete No, since the index keys maintain their separator property. Duplicates Key Compression Bulk Loading • To preseve the separator propery of the index key entries, Partitioned B+-trees maintenance of overflow chains is required. ISAM may lose balance after heavy updating. This ⇒ complicates life for the query optimizer.

16 Tree-Structured ISAM: Being Static is Not All Bad Indexing

• Leaving free space during index creation reduces the Binary Search insertion/overflow problem (typically 20 % free space). ≈ ISAM Multi-Level ISAM Too Static? • Since ISAM indexes are static, pages need not be locked Search Efficiency during concurrent index access. B+-trees Search Insert • Locking can be a serious bottleneck in dynamic tree Redistribution indexes (particularly near the index root node).  Delete Duplicates Key Compression Bulk Loading + ISAM may be the index of choice for relatively static data. Partitioned B -trees ⇒

17 Tree-Structured ISAM: Efficiency of Searches Indexing

• Regardless of these deficiencies, ISAM-based searching is the most efficient order-aware index structure discussed so far:

Binary Search

Definition (ISAM fanout) ISAM Multi-Level ISAM • Let N be the number of pages in the data file, and let F Too Static? Search Efficiency

denote the fanout of the ISAM tree, i.e., the maximum B+-trees number of children per index node Search Insert • The fanout in the previous example is F = 3, typical Redistribution realistic fanouts are F 1,000. Delete ≈ Duplicates • When index searching starts, the search space is of size Key Compression Bulk Loading N. With the help of the root page we are guided into an Partitioned B+-trees index subtree of size

N 1/F . ·

18 Tree-Structured ISAM: Efficiency of Searches Indexing

• As we search down the tree, the search space is repeatedly reduced by a factor of F:

N 1/F 1/F . · · ··· Binary Search

• Index searching ends after s steps when the search space has ISAM been reduced to size 1 (i.e., we have reached the index leaf Multi-Level ISAM Too Static? level and found the data page that contains the wanted Search Efficiency record): B+-trees Search Insert s ! Redistribution 1 F N ( / ) = 1 s = logF N . Delete · ⇔ Duplicates • Since F 2, this is significantly more efficient than access Key Compression  Bulk Loading via binary search (log2 N). Partitioned B+-trees

Example (Required I/O operations during ISAM search) Assume F = 1,000. An ISAM tree of height 3 can index a file of one billion (109) pages, i.e., 3 I/O operations are sufficient to locate the wanted data file page.

19 + Tree-Structured B -trees: A Dynamic Index Structure Indexing

The B+-tree index structure is derived from the ISAM idea, but is fully dynamic with respect to updates:

• Search performance is only dependent on the height of the Binary Search + ISAM B -tree (because of high fan-out F, the height rarely Multi-Level ISAM exceeds 3). Too Static? Search Efficiency + • No overflow chains develop, a B -tree remains balanced. B+-trees + Search • B -trees offer efficient insert/delete procedures, the Insert Redistribution underlying data file can grow/shrink dynamically. Delete + Duplicates • B -tree nodes (despite the root page) are guaranteed to Key Compression 2 Bulk Loading have a minimum occupancy of 50 % (typically /3). Partitioned B+-trees

Original publication (B-tree): R. Bayer and E.M. McCreight. %Organization and Maintenance of Large Ordered Indexes. Acta Informatica, vol. 1, no. 3, September 1972.

20 + Tree-Structured B -trees: Basics Indexing

B+-trees resemble ISAM indexes, where • leaf nodes are connected to form a doubly-linked list, the so-called sequence set,2

• leaves may contain actual data records or just references Binary Search to records on data pages (i.e., index entry variants B or C ). ISAM Multi-Level ISAM Here we assume the latter since this is the common case. Too Static? Search Efficiency

Remember: ISAM leaves were the data pages themselves, B+-trees Search instead. Insert Redistribution + Delete Sketch of B -tree structure (data pages not shown) Duplicates Key Compression Bulk Loading Partitioned B+-trees . ..

......

2 + This is not a strict B -tree requirement, although most systems implement it. 21 + Tree-Structured B -trees: Non-Leaf Nodes Indexing

B+-tree inner (non-leaf) node index entry separator pointer

p0 k1 p1 k2 p2 k p · ·· 2d 2d • • • • Binary Search

ISAM + Multi-Level ISAM B -tree non-leaf nodes use the same internal layout as inner Too Static? ISAM nodes: Search Efficiency B+-trees • The minimum and maximum number of entries n is Search + Insert bounded by the order d of the B -tree: Redistribution Delete Duplicates d 6 n 6 2 d (root node: 1 6 n 6 2 d) . Key Compression · · Bulk Loading Partitioned B+-trees • A node contains n + 1 pointers. Pointer pi (1 6 i 6 n 1) points to a subtree in which all key values k are such− that

ki 6 k < ki+1 .

(p0 points to a subtree with key values < k1, pn points to a subtree with key values > kn). 22 + Tree-Structured B -tree: Leaf Nodes Indexing

• B+-tree leaf nodes contain pointers to data records (not pages). A leaf node entry with key value k is denoted as k ∗ as before. Binary Search • Note that we can use all index entry variants A , B , C to ISAM Multi-Level ISAM implement the leaf entries: Too Static? + Search Efficiency • For variant A , the B -tree represents the index as well as B+-trees the data file itself. Leaf node entries thus look like Search Insert Redistribution Delete ki = ki, ... . ∗ h i Duplicates Key Compression + Bulk Loading • For variants B and C , the B -tree lives in a file distinct + Partitioned B -trees from the actual data file. Leaf node entries look like

k = k , rid . i∗ i

23 + Tree-Structured B -tree: Search Indexing

Below, we assume that key values are unique (we defer the treatment of duplicate key values).

B+-tree search

1 Function: search (k) Binary Search ISAM 2 return tree_search (k, root); • Function search(k) Multi-Level ISAM Too Static? returns a pointer to Search Efficiency

1 Function: tree_search (k, node) the leaf node page B+-trees Search 2 if node is a leaf then that contains Insert 3 return node; potential hits for Redistribution Delete 4 switch k do search key k. Duplicates Key Compression 5 case k < k1 index entry separator key Bulk Loading + 6 return tree_search (k, p0); Partitioned B -trees

7 case ki k < ki+1 p0 k1 p1 k2 p2 k p ≤ · ·· 2d 2d 8 return tree_search (k, pi); • • • • 9 case k2d k ≤ 10 return tree_search (k, p2d); node page layout

24 + Tree-Structured B -tree: Insert Indexing

• Remember that B+-trees remain balanced3 no matter which updates we perform. Insertions and deletions have to preserve this invariant. • The basic principle of B+-tree insertion is simple: Binary Search ISAM 1 To insert a record with key k, call search(k) to find the Multi-Level ISAM Too Static? page p to hold the new record. Search Efficiency

Let m denote the number of entries on p. B+-trees 2 If m < 2 d (i.e., there is capacity left on p), store k in Search · ∗ Insert page p. Otherwise ...? Redistribution Delete Duplicates Key Compression We must not start an overflow chain hanging off p: this Bulk Loading • + would violate the balancing property. Partitioned B -trees • We want the cost for search(k) to be dependent on tree height only, so placing k somewhere else (even near p) is no option either. ∗

3 + All paths from the B -tree root to any leaf are of equal length. 25 + Tree-Structured B -tree: Insert Indexing

• Sketch of the insertion procedure for entry k, q (key value k pointing to rid q): h i Binary Search

1 Find leaf page p where we would expect the entry for ISAM k. Multi-Level ISAM Too Static? 2 If p has enough space to hold the new entry (i.e., at Search Efficiency most 2d 1 entries in p), simply insert k, q into p. B+-trees − h i Search 3 Otherwise node p must be split into p and p0 and a new Insert separator Redistribution has to be inserted into the parent of p. Delete Duplicates Splitting happens recursively and may eventually lead to Key Compression Bulk Loading a split of the root node (increasing the tree height). Partitioned B+-trees

4 Distribute the entries of p and the new entry k, q onto h i pages p and p0.

26 + Tree-Structured B -tree: Insert Indexing

Example (B+-tree insertion procedure)

+ Binary Search 1 Insert record with key k = 8 into the following B -tree: ISAM root page Multi-Level ISAM 13 17 24 30 Too Static? Search Efficiency

B+-trees Search Insert Redistribution 39* 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees 2 The left-most leaf page p has to be split. Entries 2 , 3 ∗ ∗ remain on p, entries 5 , 7 , and 8 (new) go on new page p0. ∗ ∗ ∗

27 + Tree-Structured B -tree: Insert and Leaf Node Split Indexing

Example (B+-tree insertion procedure)

3 Pages p and p0 are shown below. new separator Key k0 = 5, the between pages p and p0, has Binary Search

to be inserted into the parent of p and p0 recursively: ISAM Multi-Level ISAM 13 17 24 30 Too Static? Search Efficiency

B+-trees Search Insert 5 Redistribution Delete Duplicates Key Compression Bulk Loading 2* 3* 5* 7* 8* Partitioned B+-trees

• Note that, after such a leaf node split, the new separator key k0 = 5 is copied up the tree: the entry 5 itself has to remain in its leaf page. ∗

28 + Tree-Structured B -tree: Insert and Non-Leaf Node Split Indexing

Example (B+-tree insertion procedure)

4 The insertion process is propagated upwards the tree: inserting key k0 = 5 into the parent leads to a non-leaf node split (the 2 d + 1 keys and 2 d + 2 pointers make for two Binary Search · · ISAM new non-leaf nodes and a middle key which we propagate Multi-Level ISAM further up for insertion): Too Static? Search Efficiency

17 B+-trees Search Insert Redistribution Delete 5 13 24 30 Duplicates Key Compression Bulk Loading Partitioned B+-trees

• Note that, for a non-leaf node split, we can simply push up the middle key (17). Contrast this with a leaf node split.

29 + Tree-Structured B -tree: Insert and Root Node Split Indexing

Example

5 Since the split node was the root node, we create a new root node which holds the pushed up middle key only: Binary Search ISAM root page Multi-Level ISAM 17 Too Static? Search Efficiency

B+-trees 5 13 24 30 Search Insert Redistribution Delete 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Duplicates Key Compression Bulk Loading Partitioned B+-trees • Splitting the old root and creating a new root node is the only situation in which the B+-tree height increases. The B+-tree thus remains balanced.

30 + Tree-Structured B -tree: Insert Indexing

 Further key insertions Binary Search How does the insertion of records with keys k = 23 and k = 40 ISAM + Multi-Level ISAM alter the B -tree? Too Static? Search Efficiency root page B+-trees 17 Search Insert Redistribution Delete 5 13 24 30 Duplicates Key Compression Bulk Loading Partitioned B+-trees 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

31 Enough space in node 3, simply insert. ⇒ Keep entries sorted within nodes. ⇒

+ Tree-Structured B -tree Insert: Further Examples Indexing

Example (B+-tree insertion procedure) 8500 Binary Search node 0 ISAM Multi-Level ISAM Too Static? Search Efficiency 5012 8280 8700 9016 + node 1 node 2 B -trees Search Insert Redistribution Delete 4123 4450 4528 5012 6423 8050 8105 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 Duplicates node 3 node 4 node 5 node 6 node 7 node 8 Key Compression Bulk Loading pointers to data pages Partitioned B+-trees · ·· ···

Insert new entry with key 4222.

32 + Tree-Structured B -tree Insert: Further Examples Indexing

Example (B+-tree insertion procedure) 8500 Binary Search node 0 ISAM Multi-Level ISAM Too Static? Search Efficiency 5012 8280 8700 9016 + node 1 node 2 B -trees Search Insert Redistribution Delete 4123 4222 4450 4528 5012 6423 8050 8105 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 Duplicates node 3 node 4 node 5 node 6 node 7 node 8 Key Compression Bulk Loading pointers to data pages Partitioned B+-trees · ·· ···

Insert new entry with key 4222. Enough space in node 3, simply insert. ⇒ Keep entries sorted within nodes. ⇒

32 + Tree-Structured B -tree Insert: Further Examples Indexing

Example (B+-tree insertion procedure) 8500 Binary Search node 0 ISAM Multi-Level ISAM Too Static? Search Efficiency 5012 8280 8700 9016 + node 1 node 2 B -trees Search Insert Redistribution Delete 5012 4123 4222 4450 4528 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 6423 8050 8105 Duplicates node 3 node 4 node 5 node 6 node 7 node 8 Key Compression Bulk Loading Partitioned B+-trees Insert key 6333. new separator

new entry 6423 Must split node 4. ⇒ New separator goes into node 1 ⇒ (including pointer to new page). 5012 6333 6423 8050 8105 node 4 new node 9

33 + Tree-Structured B -tree Insert: Further Examples Indexing

Example (B+-tree insertion procedure) 8500 Binary Search node 0 ISAM Multi-Level ISAM Too Static? Search Efficiency 5012 6423 8280 8700 9016 + node 1 node 2 B -trees Search Insert Redistribution Delete 4123 4222 4450 4528 5012 6333 6423 8050 8105 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 Duplicates node 3 node 4 node 9 node 5 node 6 node 7 node 8 Key Compression Bulk Loading Partitioned B+-trees Insert key 6333. new separator

new entry 6423 Must split node 4. ⇒ New separator goes into node 1 ⇒ (including pointer to new page). 5012 6333 6423 8050 8105 node 4 new node 9

33 Node 1 overflows split it ⇒ New separator goes⇒ into root ⇒ Unlike during leaf split, separator key does not remain in inner node.  Why?

+ Tree-Structured B -tree Insert: Further Examples Indexing

Example (B+-tree insertion procedure) 8500 node 0 Binary Search

ISAM Multi-Level ISAM

5012 6423 8105 8280 8700 9016 Too Static? node 1 node 2 Search Efficiency B+-trees Search Insert

4123 4222 4450 4528 5012 6333 6423 8050 8105 8180 8245 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 Redistribution node 3 node 4 node 9 node 10 node 5 node 6 node 7 node 8 Delete Duplicates Key Compression After 8180, 8245, insert key 4104. Bulk Loading Partitioned B+-trees Must split node 3. new separator ⇒ new entry 4222 4104 4123 4222 4450 4528 node 3 new node 11

34 + Tree-Structured B -tree Insert: Further Examples Indexing

Example (B+-tree insertion procedure) 8500

node 0 Binary Search

ISAM Multi-Level ISAM

5012 6423 8105 8280 8700 9016 Too Static? 4222 node 1 node 2 Search Efficiency

B+-trees Search 4104 4123 4222 4528 5012 6333 6423 8050 8105 8180 8245 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 4450 Insert node 3 node 11 node 4 node 9 node 10 node 5 node 6 node 7 node 8 Redistribution Delete Duplicates After 8180, 8245, insert key 4104. Key Compression new separator Bulk Loading Must split node 3. Partitioned B+-trees ⇒ Node 1 overflows split it from leaf split 6423 ⇒ New separator goes⇒ into root ⇒ Unlike during leaf split, separator 4222 5012 8105 8280 key does not remain in inner node. node 1 new node 12  Why?

34 + Tree-Structured B -tree Insert: Further Examples Indexing

Example (B+-tree insertion procedure) 6423 8500

node 0 Binary Search

ISAM Multi-Level ISAM

4222 5012 8105 8280 8700 9016 Too Static? node 1 node 12 node 2 Search Efficiency

B+-trees Search 4104 4123 4222 4528 5012 6333 6423 8050 8105 8180 8245 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 4450 Insert node 3 node 11 node 4 node 9 node 10 node 5 node 6 node 7 node 8 Redistribution Delete Duplicates After 8180, 8245, insert key 4104. Key Compression new separator Bulk Loading Must split node 3. Partitioned B+-trees ⇒ Node 1 overflows split it from leaf split 6423 ⇒ New separator goes⇒ into root ⇒ Unlike during leaf split, separator 4222 5012 8105 8280 key does not remain in inner node. node 1 new node 12  Why?

34 + Tree-Structured B -tree: Root Node Split Indexing

• Splitting starts at the leaf level and continues upward as long as index nodes are fully occupied. • Eventually, this can lead to a split of the root node:

• Split like any other inner node. Binary Search • Use the separator to create a new root. ISAM Multi-Level ISAM • The root node is the only node that may have an occupancy Too Static? of less than 50 %. Search Efficiency B+-trees • This is the only situation where the tree height increases. Search Insert Redistribution Delete Duplicates  How often do you expect a root split to happen? Key Compression + Bulk Loading E.g.,B -tree over 8 byte integers, 4 KB pages; Partitioned B+-trees h # records pointers encoded as 8 byte integers. • 128–256 index entries/page (fan-out F). 2 16,000 3 2,000,000 An index of height h indexes at least • 4 250,000,000 128h records, typically more.

35 + Tree-Structured B -tree: Insertion Algorithm Indexing

B+-tree insertion algorithm

1 Function: tree_insert (k, rid, node)

2 if node is a leaf then 3 return leaf_insert (k, rid, node); Binary Search ISAM 4 else Multi-Level ISAM 5 switch k do Too Static? 6 case k < k1 Search Efficiency + 7 sep, ptr tree_insert (k, rid, p0); B -trees h i ← see tree_search () Search 8 case k k < k Insert i ≤ i+1 9 sep, ptr tree_insert (k, rid, pi); Redistribution h i ← Delete 10 case k2d k Duplicates ≤ Key Compression 11 sep, ptr tree_insert (k, rid, p2d); h i ← Bulk Loading Partitioned B+-trees 12 if sep is null then 13 return null, null ; h i 14 else 15 return non_leaf_insert (sep, ptr, node);

36 1 Function: leaf_insert (k, rid, node)

2 if another entry fits into node then 3 insert k, rid into node ; h i 4 return null, null ; h i 5 else 6 allocate new leaf page p ; + + + + 7 let k1 , rid1 ,..., k2d+1, rid2d+1 := entries from node k, rid h i +h + i+ + ∪ {h i} 8 nleave entries k1 , rid1 ,..., kdo, ridd in node ; h + i+ h + i + 9 move entries k , rid ,..., k , rid to p ; h d+1 d+1i h 2d+1 2d+1i + 10 return k , p ; h d+1 i

1 Function: non_leaf_insert (k, ptr, node)

2 if another entry fits into node then 3 insert k, ptr into node ; h i 4 return null, null ; h i 5 else 6 allocate new leaf page p ; + + + + 7 let k1 , p1 ,..., k2d+1, p2d+1 := entries from node k, ptr h i h+ + i + + ∪ {h i} 8 nleave entries k1 , p1 ,..., kod , pd in node ; h + i+ h +i + 9 move entries kd+2, pd+2 ,..., k2d+1, p2d+1 to p ; + h i h i 10 set p0 p in p; ← d+1 + 11 return k , p ; h d+1 i + Tree-Structured B -tree: Insertion Algorithm Indexing

B+-tree insertion algorithm

1 Function: insert (k, rid)

2 key, ptr tree_insert (k, rid, root); h i ← Binary Search 3 if key is not null then ISAM 4 allocate new root page r; Multi-Level ISAM 5 populate r with Too Static? Search Efficiency 6 p0 root; ← B+-trees 7 k1 key; ← Search 8 p1 ptr; ← Insert Redistribution 9 root r ; ← Delete Duplicates Key Compression Bulk Loading • insert (k, rid) is called from outside. Partitioned B+-trees • Variable root contains a pointer to the B+-tree root page. • Note how leaf node entries point to rids, while inner nodes contain pointers to other B+-tree nodes.

38 + Tree-Structured B -tree Insert: Redistribution Indexing

• We can further improve the average occupancy of B+-tree using a technique called redistribution.

• Suppose we are trying to insert a record with key k = 6 into Binary Search + the B -tree below: ISAM Multi-Level ISAM Too Static? root page Search Efficiency + 13 17 24 30 B -trees Search Insert Redistribution Delete Duplicates Key Compression 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Bulk Loading Partitioned B+-trees

• The left-most leaf is full already, its right sibling still has capacity, however.

39 + Tree-Structured B -tree Insert: Redistribution Indexing

• In this situation, we can avoid growing the tree by redistributing entries between siblings (entry 7 moved into right sibling): ∗

Binary Search 7 17 24 30 ISAM Multi-Level ISAM Too Static? Search Efficiency

B+-trees 2* 3* 5* 6* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Search Insert Redistribution Delete Duplicates • We have to update the parent node (new separator ) Key Compression 7 Bulk Loading to reflect the redistribution.  Partitioned B+-trees • Inspecting one or both neighbor(s) of a B+-tree node involves additional I/O operations. Actual implementations often use redistribution on the leaf ⇒ level only (because the sequence set page chaining gives direct access to both sibling pages).

40 + Tree-Structured B -tree Insert: Redistribution Indexing Torsten Grust

 Redistribution makes a difference Insert a record with key k = 30 Binary Search

1 without redistribution, ISAM Multi-Level ISAM 2 using leaf level redistribution Too Static? into the B+-tree shown below. How does the tree change? Search Efficiency B+-trees Search Insert Redistribution root page Delete 13 17 24 30 Duplicates Key Compression Bulk Loading Partitioned B+-trees

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

41 + Tree-Structured B -tree: Delete Indexing Torsten Grust

• The principal idea to implement B+-tree deletion comes as Binary Search

no surprise: ISAM Multi-Level ISAM Too Static? 1 To delete a record with key k, use search(k) to locate Search Efficiency B+-trees the leaf page p containing the record. Search Let m denote the number of entries on p. Insert Redistribution Delete Duplicates 2 If m > d then p has sufficient occupancy: simply delete Key Compression k from p (if k is present on p at all). Bulk Loading Partitioned B+-trees Otherwise∗ ...?∗

42 + Tree-Structured B -tree: Delete Indexing Torsten Grust

Example (B+-tree deletion procedure)

1 Delete record with key k = 19 (i.e., entry 19 ) from the + ∗ following B -tree: Binary Search ISAM Multi-Level ISAM root page Too Static?

17 Search Efficiency B+-trees Search 5 13 24 30 Insert Redistribution Delete Duplicates 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Key Compression Bulk Loading Partitioned B+-trees

2 A call to search(19) leads us to leaf page p containing entries 19 , 20 , and 22 . We can safely remove 19 since m = 3 > 2∗ (no∗ page underflow∗ in p after removal).∗

43 + Tree-Structured B -tree: Delete and Leaf Redistribution Indexing Torsten Grust

Example (B+-tree deletion procedure)

3 Subsequent deletion of 20 , however, lets p underflow (p has minimal occupancy of ∗d = 2 already). Binary Search We now use redistribution and borrow entry 24 from the ∗ ISAM right sibling p0 of p (since p0 hosts 3 > 2 entries, Multi-Level ISAM Too Static? redistribution won’t let p0 underflow). Search Efficiency + The smallest key value on p0 (27) is the new separator of p B -trees Search and p0 in their common parent: Insert Redistribution Delete root page Duplicates Key Compression 17 Bulk Loading Partitioned B+-trees

5 13 27 30

2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

page p page p'

44 + Tree-Structured B -tree: Delete and Leaf Merging Indexing Torsten Grust

Example (B+-tree deletion procedure)

4 We continue and delete entry 24 from p. Redistribution is ∗ no option now (sibling p0 has minimial occupancy of d = 2). + Binary Search We now have mp + mp = 1 + 2 < 2 d however: B -tree 0 · ISAM deletion thus merges leaf nodes p and p0. Multi-Level ISAM Too Static? Search Efficiency Move entries 27 , 29 from p0 to p, then delete page p0: B+-trees ∗ ∗ Search Insert Redistribution 30 Delete Duplicates Key Compression Bulk Loading + 22* 27* 29* 33* 34* 38* 39* Partitioned B -trees

• NB: the separator 27 between p and p0 is no longer needed and thus discarded (recursively deleted) from the parent.

45 + Tree-Structured B -tree: Delete and Non-Leaf Node Merging Indexing Torsten Grust Example (B+-tree deletion procedure)

5 The parent of p experiences underflow. Redistribution is no option, so we merge with left non-leaf sibling. After merging we have Binary Search d + (d 1) keys and d + 1 + d pointers ISAM − Multi-Level ISAM left left right Too Static? right Search Efficiency + |{z} | {z } |{z} B -trees on the merged| page:{z } Search Insert Redistribution 17 Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees 5 13 17 30

The missing key value, namely the separator of the two nodes (17), is pulled down (and thus deleted) from the parent to form the complete merged node. 46 + Tree-Structured B -tree: Root Deletion Indexing Torsten Grust

Example (B+-tree deletion procedure)

6 Since we have now deleted the last remaining entry in the Binary Search root, we discard the root (and make the merged node the ISAM Multi-Level ISAM new root): Too Static? Search Efficiency root page B+-trees 5 13 17 30 Search Insert Redistribution Delete Duplicates Key Compression 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39* Bulk Loading Partitioned B+-trees • This is the only situation in which the B+-tree height decreases. The B+-tree thus remains balanced.

47 + Tree-Structured B -tree: Delete and Non-Leaf Node Redistribution Indexing Torsten Grust

Example (B+-tree deletion procedure)

7 We have now leaf node merging and redistribution as well as non-leaf node merging. The remaining case of Binary Search non-leaf node redistribution is straightforward: ISAM Multi-Level ISAM Too Static? • Suppose during deletion we encounter the following Search Efficiency + B -tree: B+-trees root page Search

22 Insert Redistribution Delete Duplicates 30 5 13 17 20 Key Compression Bulk Loading Partitioned B+-trees

2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*

• The non-leaf node with entry 30 underflowed. Its left sibling has two entries (17 and 20) to spare.

48 + Tree-Structured B -tree: Delete and Non-Leaf Node Redistribution Indexing Torsten Grust

Example (B+-tree deletion procedure) Binary Search 8 We redistribute entry 20 by “rotating it through” the parent. ISAM Multi-Level ISAM The former parent entry 22 is pushed down: Too Static? Search Efficiency

root page B+-trees

20 Search Insert Redistribution Delete 5 13 17 22 30 Duplicates Key Compression Bulk Loading 2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39* Partitioned B+-trees

49 Tree-Structured Merge and Redistribution Effort Indexing Torsten Grust

• Actual DBMS implementations often avoid the cost of merging and/or redistribution, but relax the minimum occupancy rule.

Binary Search + B -tree deletion ISAM Multi-Level ISAM Too Static? • System parameter MINPCTUSED (minimum percent used) Search Efficiency B+-trees controls when the kernel should try a leaf node merge Search (“online index reorg”). Insert Redistribution (This is particularly simple because of the sequence set pointers Delete Duplicates connecting adjacent leaves, see slide 40.) Key Compression Bulk Loading • Non-leaf nodes are never merged (a “full index reorg” is Partitioned B+-trees required to achieve this). • To improve concurrency, deleted index entries are merely marked as deleted and only removed later (IBM DB2 UDB type-2 indexes).

50 + Tree-Structured B -tree: Duplicates Indexing Torsten Grust • As discussed here, the B+-tree search, insert (and delete) procedures ignore the presence of duplicate key values. • Often this is a reasonable assumption:

• If the key field is a primary key for the data file (i.e., for Binary Search the associated relation), the search keys k are unique by ISAM Multi-Level ISAM definition. Too Static? Search Efficiency

B+-trees Treatment of duplicate keys Search Insert Since duplicate keys add to the B+-tree complexity, IBM DB2 Redistribution Delete forces uniqueness by forming a composite key of the form k, id Duplicates h i Key Compression where id is the unique tuple identity of the data record with Bulk Loading key k. Partitioned B+-trees Tuple identities 1 are system-maintained unique identitifers for each tuple in a table, and 2 are not dependent on tuple order and never rise again.

51 + Tree-Structured B -tree: Duplicates (IBM DB2 Tuple Identities) Indexing Torsten Grust

Expose IBM DB2 tuple identity

1 $ db2 Binary Search 2 (c) Copyright IBM Corporation 1993,2007 3 Command Line Processor for DB2 Client 9.5.0 ISAM Multi-Level ISAM 4 [...] Too Static? 5 db2 => CREATE TABLE FOO(ROWID INT GENERATED ALWAYSAS IDENTITY, Search Efficiency

6 text varchar(10)) B+-trees 7 db2 => INSERT INTO FOO VALUES(DEFAULT,’This is’), ... Search 8 db2 => SELECT* FROM FOO Insert Redistribution 9 Delete 10 ROWID TEXT Duplicates 11 ------Key Compression Bulk Loading 12 1 This is Partitioned B+-trees 13 2 nothing 14 3 but a 15 4 silly 16 5 example!

52 + Tree-Structured B -tree: Duplicates (IBM DB2 Tuple Identities) Indexing Torsten Grust

Expose IBM DB2 tuple identity (continued)

1 db2 => DELETE FROM FOO WHERE TEXT =’silly’ 2 db2 => SELECT* FROM FOO

3 Binary Search 4 ROWID TEXT ISAM 5 ------Multi-Level ISAM 6 1 This is Too Static? 7 2 nothing Search Efficiency + 8 3 but a B -trees Search 9 5 example! Insert 10 Redistribution 11 db2 => INSERT INTO FOO VALUES(DEFAULT,’I am new.’) Delete Duplicates 12 db2 => SELECT* FROM FOO Key Compression 13 Bulk Loading 14 ROWID TEXT Partitioned B+-trees 15 ------16 1 This is 17 2 nothing 18 3 but a 19 6 I am new. 20 5 example!

53 + Tree-Structured B -tree: Duplicates Indexing Torsten Grust

Other approaches alter the B+-tree implementation to add real awareness for duplicates:

1 Use variant C (see slide 3.22) to represent the index data Binary Search entries k : ISAM ∗ Multi-Level ISAM k = k, [rid1, rid2,... ] Too Static? ∗ Search Efficiency B+-trees Search • Each duplicate record with key field k makes the list of Insert Redistribution rids grow. Key k is not repeatedly stored (space savings). Delete • B+-tree search and maintenance routines largely Duplicates Key Compression unaffected. Index data entry size varies, however (this Bulk Loading Partitioned B+-trees affects the B+-tree order concept). • Implemented in IBM Informix Dynamic Server, for example.

54 + Tree-Structured B -tree: Duplicates Indexing Torsten Grust

2 Treat duplicate key values like any other value in insert and delete. This affects the search procedure.

Binary Search  Impact of duplicate insertion on search ISAM Multi-Level ISAM + Too Static? Given the following B -tree of order d = 2, perform insertions (do Search Efficiency not use redistribution): insert(2, ), insert(2, ), insert(2, ): B+-trees · · · Search Insert Redistribution 3 8 Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees

1* 2* 4* 5* 10* 11*

55 + Tree-Structured B -tree: Duplicates Indexing Torsten Grust  Impact of duplicate insertion on search The resulting B+-tree is shown here. Now apply insert(2, ), insert(2, ) to this B+-tree: · ·

2 3 8 Binary Search

ISAM Multi-Level ISAM Too Static? Search Efficiency 1* 2* 2* 2* 2* 4* 5* 10* 11* B+-trees Search Insert We get the tree depicted below: Redistribution Delete Duplicates 2 2 3 8 Key Compression Bulk Loading Partitioned B+-trees

1* 2* 2* 2* 2* 2* 2* 4* 5* 10* 11*

In search: in a non-leaf node, follow the left-most page ⇒ pointer pi such that ki 6 k 6 ki+1. In a leaf, also check the left sibling.

56 + Tree-Structured B -tree: Key Compression Indexing • Recall the search I/O effort s in an ISAM or B+-tree for a file of Torsten Grust N pages. The fan-out F has been the deciding factor:

s = logF N .

Binary Search Tree index search effort dependent on fan-out F ISAM Multi-Level ISAM Too Static? 10 Search Efficiency F = 10 F = 50 B+-trees F = 100 8 F = 250 Search F = 500 Insert F = 1000 Redistribution 6 Delete Duplicates s [I/Os] 4 Key Compression Bulk Loading + 2 Partitioned B -trees

0 10 100 1000 10000 100000 1e+06 1e+07 N [pages]

It clearly pays off to invest effort and try to maximize the ⇒ fan-out F of a given B+-tree implementation. 57 + Tree-Structured B -tree: Key Compression Indexing • Index entries in inner (i.e., non-leaf) B+-tree nodes are pairs k , pointer to p . h i ii • The representation of page pointers is prescribed by the DBMS’s pointer representation, and especially for key field types like CHAR( ) or VARCHAR( ), we will have Binary Search · · ISAM Multi-Level ISAM pointer ki . Too Static? | |  | | Search Efficiency

• To minimize the size of keys, observe that key values in B+-trees inner index nodes are used only to direct traffic to the Search Insert appropriate leaf page: Redistribution Delete Duplicates Excerpt of search(k) Key Compression Bulk Loading + 1 switch k do Partitioned B -trees 2 case k < k1 The actual key values are 3 . .. ⇒ not needed as long as we 4 case ki 6 k < ki+1 maintain their separator 5 . .. property. 6 case k2d 6 k 7 . ..

58 + Tree-Structured B -tree: Key Compression Indexing

Example (Searching a B+-tree node with VARCHAR( ) keys) · To guide the search across this B+-tree node Binary Search

ISAM Multi-Level ISAM Too Static? ironical irregular ... Search Efficiency

B+-trees Search Insert it is sufficient to store the prefixes iro and irr. Redistribution Delete Duplicates +  Key Compression We must preserve the B -tree semantics, though: Bulk Loading Partitioned B+-trees All index entries stored in the subtree left of iro have keys k < iro and index entries stored in the subtree right of iro have keys k > iro (and k < irr).

59 + Tree-Structured B -tree: Key Suffix Truncation Indexing

Example (Key suffix truncation, B+-tree with order d = 1) Before key suffix truncation: Goofy Binary Search ISAM Multi-Level ISAM Daisy Duck Mickey Mouse Mini Mouse Too Static? Search Efficiency

B+-trees Dagobert Duck Daisy Duck Goofy Mickey Mouse Mini Mouse Search Insert Redistribution Delete After key suffix truncation: Duplicates G Key Compression Bulk Loading Partitioned B+-trees Dai Mic Min

Dagobert Duck Daisy Duck Goofy Mickey Mouse Mini Mouse

60 + Tree-Structured B -tree: Key Suffix Truncation Indexing

 Key suffix truncation How would a B+-tree key compressor alter the key entries in the + inner node of this B -tree snippet? Binary Search

ISAM Multi-Level ISAM Too Static? Search Efficiency

B+-trees irish ironical irregular ... Search Insert Redistribution Delete irksome iron ironage irrational irreducible Duplicates Key Compression Bulk Loading Partitioned B+-trees

61 + Tree-Structured B -tree: Key Prefix Compression Indexing

• Observation: Keys within a B+-tree node often share a common prefix.

+ Example (Shared key prefixes in inner B -tree nodes) Binary Search

ISAM · ·· ··· Multi-Level ISAM Mic Min Mi c n Too Static? Search Efficiency + Goofy Mickey Mouse Mini Mouse Goofy Mickey Mouse Mini Mouse B -trees Search Insert Redistribution Key prefix compression: Delete Duplicates • Store common prefix only once (e.g., as “k0”) Key Compression Bulk Loading • Keys have become highly discriminative now. Partitioned B+-trees Violating the 50 % occupancy rule can help to improve the effectiveness of prefix compression.

Rudolf Bayer, Karl Unterauer: Prefix B-Trees. ACM TODS 2(1), March% 1977.

62 + Tree-Structured B -tree: Bulk Loading Indexing

• Consider the following database session (this might as well be commands executed on behalf of a database transaction): Table and index creation

1 db2 => CREATE TABLE FOO (ID INT, TEXT VARCHAR(10)); Binary Search 2 ISAM 3 [... insert 1,000,000 rows into table foo ...] Multi-Level ISAM Too Static? 4 Search Efficiency 5 db2 => CREATE INDEX FOO_IDX ON FOO (A ASC) B+-trees Search + • The last SQL command initiates 1,000,000 calls to the B -tree Insert Redistribution insert( ) procedure—a so-called index bulk load. Delete · + Duplicates The DBMS will traverse the growing B -tree index from its Key Compression ⇒ Bulk Loading root down to the leaf pages 1,000,000 times. Partitioned B+-trees

 This is bad ...... but at least it is not as bad as swapping the order of row insertion and index creation. Why?

63 + Tree-Structured B -tree: Bulk Loading Indexing

• Most DBMS installations ship with a bulk loading utility to reduce the cost of operations like the above.

Binary Search + B -tree bulk loading algorithm ISAM Multi-Level ISAM Too Static? 1 For each record with key k in the data file, create a Search Efficiency

sorted list of pages of index leaf entries k . B+-trees ∗ Search Insert Note: For index variants B or C , this does not imply to Redistribution Delete sort the data file itself. (For variant A , we effectively Duplicates Key Compression create a clustered index.) Bulk Loading Partitioned B+-trees

2 Allocate an empty index root page and let its p0 page pointer point to the first page of sorted k entries. ∗

64 + Tree-Structured B -tree: Bulk Loading Indexing

+ Example (State of bulk load after step 2 , order of B -tree d = 1)

root page

Binary Search

ISAM Multi-Level ISAM Too Static? 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* Search Efficiency B+-trees Search Insert + (Index leaf pages not yet in B -tree are framed .) Redistribution Delete Duplicates Key Compression Bulk Loading  Bulk loading continued Partitioned B+-trees Can you anticipate how the bulk loading process will proceed from this point?

65 + Tree-Structured B -tree: Bulk Loading Indexing

• We now use the fact that the k are sorted. Any insertion will thus hit the right-most index∗ node (just above the leaf level). • Use a specialized bulk_insert( ) procedure that avoids Binary Search + · ISAM B -tree root-to-leaf traversals altogether: Multi-Level ISAM Too Static? Search Efficiency + B -tree bulk loading algorithm (continued) B+-trees Search Insert 3 For each leaf level page p, insert the index entry Redistribution Delete Duplicates minimum key on p, pointer to p Key Compression h i Bulk Loading Partitioned B+-trees into the right-most index node just above the leaf level.

The right-most node is filled left-to-right. Splits occur only on the right-most path from the leaf level up to the root.

66 + Tree-Structured B -tree: Bulk Loading Indexing

Example (Bulk load continued)

root page 6 10

Binary Search

ISAM Multi-Level ISAM 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* Too Static? Search Efficiency

B+-trees Search root page Insert Redistribution 10 Delete Duplicates Key Compression Bulk Loading Partitioned B+-trees 6 12

3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

67 + Tree-Structured B -tree: Bulk Loading Indexing Example (Bulk load continued)

root page 10 20

Binary Search 6 12 23 35 ISAM Multi-Level ISAM Too Static? Search Efficiency 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* B+-trees Search Insert root page Redistribution 20 Delete Duplicates Key Compression Bulk Loading 10 35 Partitioned B+-trees

6 12 23 38

3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

68 Tree-Structured Composite Keys Indexing

B+-trees can (in theory4) be used to index everything with a defined total order, e.g.: • integers, strings, dates, . . . , and • concatenations thereof (based on lexicographical order). Binary Search ISAM Multi-Level ISAM Possible in most SQL DDL dialects: Too Static? Search Efficiency Example (Create an index using a composite (concatenated) key) B+-trees Search CREATE INDEX ON TABLE CUSTOMERS (LASTNAME, Insert Redistribution FIRSTNAME); Delete Duplicates Key Compression Bulk Loading A useful application are, e.g., partitioned B-trees: Partitioned B+-trees • Leading index attributes effectively partition the resulting B+-tree.

G. Graefe: Sorting And Indexing With Partitioned B-Trees. CIDR 2003. %

4 Some implementations won’t allow you to index, e.g., large character fields. 69 Tree-Structured Partitioned B-trees Indexing

Example (Index with composite key, low-selectivity key prefix) CREATE INDEX ON TABLE STUDENTS (SEMESTER, ZIPCODE);

Binary Search  What types of queries could this index support? ISAM + Multi-Level ISAM The resulting B -tree is going to look like this: Too Static? Search Efficiency

B+-trees Search Insert Redistribution Delete SEMESTER = 1 SEMESTER = 2 · ·· SEMESTER = n Duplicates Key Compression Bulk Loading It can efficiently answer queries with, e.g., Partitioned B+-trees • equality predicates on SEMESTER and ZIPCODE, • equality on SEMESTER and range predicate on ZIPCODE, or • a range predicate on SEMESTER only.

70 Multi-Dimensional Indexing

Torsten Grust Chapter 5 Multi-Dimensional Indexing

Indexing Points and Regions in k Dimensions B+-trees...... over composite keys Architecture and Implementation of Database Systems Point Quad Trees Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

1 Multi-Dimensional More Dimensions. . . Indexing Torsten Grust

One SQL query, many range predicates

1 SELECT* 2 FROM CUSTOMERS B+-trees. . . 3 WHERE ZIPCODE BETWEEN 8880 AND 8999 . . . over composite keys 4 AND REVENUE BETWEEN 3500 AND 6000 Point Quad Trees Point (Equality) Search Inserting Data This query involves a range predicate in two dimensions. Region Queries k-d Trees Balanced Construction

K-D-B-Trees Typical use cases for multi-dimensional data include K-D-B-Tree Operations Splitting a Point Page • geographical data (GIS: Geographical Information Systems), Splitting a Region Page R-Trees • multimedia retrieval (e.g., image or video search), Searching and Inserting • OLAP (Online Analytical Processing), Splitting R-Tree Nodes UB-Trees • queries against encoded data (e.g., XML) Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

2 Multi-Dimensional . . . More Challenges. . . Indexing Torsten Grust Queries and data can be points or regions

B+-trees...... over composite keys

point queryregion query Point Quad Trees region inclusion Point (Equality) Search region data Inserting Data or intersection Region Queries

k-d Trees point data Balanced Construction

K-D-B-Trees most interesting: k-nearest-neighbor search (k-NN) K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees Searching and Inserting Splitting R-Tree Nodes . . . and you can come up with many more meaningful types of UB-Trees queries over multi-dimensional data. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries Note: All equality searches can be reduced to one-dimensional queries. Spaces with High Dimensionality

Wrap-Up

3 Multi-Dimensional Points, Lines, and Regions Indexing Tarifzonen | Fare zones Torsten Grust

FEUERTHALEN

FLURLINGEN16 SCHLOSS LAUFEN A. RH.

DACHSEN

RHEINAU STAMMHEIM MARTHALEN OSSINGEN 62 OBERSTAMM- 15 61 HEIM OBERNEUNFORN WIL ZH RAFZ HÜNTWANGEN KLEINANDELFINGEN 14 ANDELFINGEN WASTERKINGEN HÜNTWANGEN- RÜDLINGEN WIL 24 60 FLAACH THALHEIM-ALTIKON KAISERSTUHL AG 13 BUCHBERG B+-trees. . . ALTIKON HENGGART ZWEIDLEN ELLIKON EGLISAU DINHARD AN DER THUR TEUFEN 63 . . . over composite keys GLATTFELDEN HETTLINGEN 18 SEUZACH RICKENBACH-ATTIKON NEFTENBACH GUNDETSWIL BEI NIEDERGLATT BÜLACH 23 WALLRÜTI Point Quad Trees BACHS WIESENDANGEN EMBRACH- PFUNGEN- OBERWINTERTHUR NEFTENBACH ELSAU HÖRI RORBAS ELGG Point (Equality) Search NIEDERWENINGENNIEDERWENINGENDORFSCHÖFFLISDORF- GRÜZE OBERWENINGEN 12 WÜLFLINGEN STEINMAUR WINTERTHUR RÄTERSCHEN SCHOTTIKON NIEDERGLATT OBEREMBRACH TÖSS SCHLEINIKON SEEN ELGG DIELSDORF Inserting Data * EIDBERG 64 REGENSBERG NIEDERHASLI BRÜTTEN 17 20 SCHLATT BEI ZÜRICH -KYBURG BOPPELSEN FLUGHAFEN 21 WINTERTHUR Region Queries KLOTEN BUCHS- KOLLBRUNN DÄLLIKON NÜRENSDORF KYBURG OTELFINGEN OBERGLATT RÄMISMÜHLE- RÜMLANG KEMPTTHAL ZELL HÜTTIKON TURBENTHAL OTELFINGEN 11 BALSBERG RIKON GOLFPARK GLATTBRUGG WEISSLINGEN k-d Trees REGENSDORF- BASSERSDORF OETWIL AN WATT AFFOLTERN OPFIKON DIETLIKON 70 SITZBERG DER LIMMAT 22 71 WALLISELLEN WILDBERG Balanced Construction SPREITENBACH SEEBACH WILA SHOPPING CENTER GEROLDSWIL EFFRETIKON UNTERENGSTRINGEN * ILLNAU DIETIKON 10 OERLIKON 84 SCHLIEREN ALTSTETTEN 21 VOLKETSWIL 35 STELZENACKER WIPKINGEN DÜBENDORF RUSSIKON SALAND STERNENBERG K-D-B-Trees GLANZENBERG HARDBRÜCKE STETTBACH FEHRALTORF BERGDIETIKON URDORF ZÜRICHHB 30 HITTNAU WIEDIKON SCHWERZENBACH ZH REPPISCHHOF SELNAU BAUMA K-D-B-Tree Operations STADELHOFEN NÄNIKON-GREIFENSEE UITIKON WEIHERMATT FÄLLANDEN PFÄFFIKON ZH GIESSHÜBEL BIRMENSDORF ZH ENGE 31 Splitting a Point Page REHALP STEG TIEFENBRUNNEN USTER 72 UETLIBERG ZOLLIKERBERG ZOLLIKON AATHAL BÄRETSWIL 54 WOLLISHOFEN 30 KEMPTEN FISCHENTHAL Splitting a Region Page ZUMIKON MAUR BONSTETTEN- LEIMBACH GOLDBACH WETTSWIL 40 73 KILCHBERG WETZIKON KÜSNACHT FORCH GOSSAU ZH RINGWIL GIBSWIL MÖNCHALTORF OBERLUNKHOFEN SOOD- ZH ERLENBACH ZH OBERLEIMBACH R-Trees WINKEL EGG 55 RÜSCHLIKON AM ZÜRICHSEE 32 HINWIL ADLISWIL 50 HERRLIBERG- 33 HEDINGEN ESSLINGEN 34 SIHLAU FELDMEILEN Searching and Inserting THALWIL GRÜNINGEN OBERDÜRNTEN OTTENBACH WILDPARK-HÖFLI BUBIKON 41 42 WALD FALTIGBERG AFFOLTERN LANGNAU-GATTIKON MEILEN OETWIL 43 OBERRIEDEN UETIKON TANN-DÜRNTEN Splitting R-Tree Nodes AM ALBIS OBERRIEDEN DORF AM SEE HORGEN HOMBRECHTIKON 33 OBFELDEN AEUGST SIHLWALD MÄNNEDORF RÜTI ZH AM ALBIS HORGEN OBERDORF SIHLBRUGG KEMPRATEN 56 AU ZH UB-Trees METTMENSTETTEN HAUSEN 51 AM ALBIS STÄFA UERIKON FELDBACH JONA 80 WAGEN MASCHWANDEN WÄDENSWIL Bit Interleaving / Z-Ordering KNONAU KAPPEL 52 RAPPERSWIL AM ALBIS HURDEN HIRZEL 83 BURGHALDEN RICHTERSWILBÄCH FREIENBACH 82 B+-Trees over Z-Codes PFÄFFIKON 53 GRÜNENFELD FREIENBACH SCHÖNENBERG ZH WOLLERAU Range Queries SAMSTAGERN SZ

Bahnstrecke mit Bahnhof Railway with Station SOB HÜTTEN Regionalbus Regional bus SCHINDELLEGI- 81 FEUSISBERG Schiff Boat Spaces with High 51 Zonennummer 51 Zone number Dimensionality Die Tarifzonen 10 (Stadt Zürich) Fare zones 10 (Zurich City) and 20 10 10 * und 20 (Stadt Winterthur) werden * (Winterthur City) each count as two für die Preisberechnung doppelt zones when calculating the fare. 20* 20* gezählt. Wrap-Up Verbundfahrausweise berechtigen You can travel as often as you like on während der Gültigkeitsdauer zu your ZVV ticket within the fare zones beliebigen Fahrten in den gelösten and period shown on the ticket. Zonen. © Zürcher Verkehrsverbund/PostAuto Region Zürich, 12.2007 4 Multi-Dimensional . . . More Solutions Indexing Torsten Grust

Recent proposals for multi-dimension index structures Quad Tree [Finkel 1974] K-D-B-Tree [Robinson 1981] R-tree [Guttman 1984] Grid File [Nievergelt 1984] + B+-trees. . . R -tree [Sellis 1987] LSD-tree [Henrich 1989] . . . over composite keys R*-tree [Geckmann 1990] hB-tree [Lomet 1990] Point Quad Trees Point (Equality) Search Vp-tree [Chiueh 1994] TV-tree [Lin 1994] Inserting Data UB-tree [Bayer 1996] hB-Π-tree [Evangelidis 1995] Region Queries k-d Trees SS-tree [White 1996] X-tree [Berchtold 1996] Balanced Construction M-tree [Ciaccia 1996] SR-tree [Katayama 1997] K-D-B-Trees K-D-B-Tree Operations Pyramid [Berchtold 1998] Hybrid-tree [Chakrabarti 1999] Splitting a Point Page DABS-tree [Böhm 1999] IQ-tree [Böhm 2000] Splitting a Region Page Slim-tree [Faloutsos 2000] Landmark file [Böhm 2000] R-Trees Searching and Inserting P-Sphere-tree [Goldstein 2000] A-tree [Sakurai 2000] Splitting R-Tree Nodes UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes None of these provides a “fits all” solution. Range Queries Spaces with High Dimensionality

Wrap-Up

5 + Multi-Dimensional Can We Just Use a B -tree? Indexing Torsten Grust • Can two B+-trees, one over ZIPCODE and over REVENUE, adequately support a two-dimensional range query?

Querying a rectangle

B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees • Can only scan along either index at once, and both of them Searching and Inserting produce many false hits. Splitting R-Tree Nodes UB-Trees Bit Interleaving / Z-Ordering • If all you have are these two indices, you can do index B+-Trees over Z-Codes intersection: perform both scans in separation to obtain the Range Queries Spaces with High rids of candidate tuples. Then compute the intersection Dimensionality between the two rid lists (DB2: IXAND). Wrap-Up 6 Multi-Dimensional IBM DB2: Index Intersection Indexing Torsten Grust

IBM DB2 uses index Plan selected by the intersection (operator IXAND) query optimizer B+-trees. . . 1 SELECT* . . . over composite keys 2 FROM ACCEL Point Quad Trees 3 WHERE PRE BETWEEN0 AND 10000 Point (Equality) Search Inserting Data 4 AND POST BETWEEN 10000 AND 20000 Region Queries

k-d Trees Relevant indexes defined over table Balanced Construction ACCEL: K-D-B-Trees K-D-B-Tree Operations 1 Splitting a Point Page IPOST over column POST Splitting a Region Page 2 SQL0801192024500 over R-Trees Searching and Inserting column PRE (primary key) Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

7 Multi-Dimensional Can Composite Keys Help? Indexing Torsten Grust Indexes with composite keys

REVENUE REVENUE

B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data Region Queries

k-d Trees ZIPCODE ZIPCODE Balanced Construction REVENUE, ZIPCODE index ZIPCODE, REVENUE index K-D-B-Trees h i h i K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees Searching and Inserting • Almost the same thing! Splitting R-Tree Nodes Indices over composite keys are not symmetric: The major UB-Trees + Bit Interleaving / Z-Ordering attribute dominates the organization of the B -tree. B+-Trees over Z-Codes Range Queries

• But: Since the minor attribute is also stored in the index Spaces with High (part of the keys k), we may discard non-qualifying tuples Dimensionality before fetching them from the data pages. Wrap-Up 8 Multi-Dimensional Multi-Dimensional Indices Indexing Torsten Grust

• B+-trees can answer one-dimensional queries only.1

• We would like to have a multi-dimensional index structure B+-trees. . . that . . . over composite keys Point Quad Trees • is symmetric in its dimensions, Point (Equality) Search • clusters data in a space-aware fashion, Inserting Data Region Queries

• is dynamic with respect to updates, and k-d Trees • provides good support for useful queries. Balanced Construction K-D-B-Trees K-D-B-Tree Operations • We will start with data structures that have been designed Splitting a Point Page for in-memory use, then tweak them into disk-aware Splitting a Region Page R-Trees database indices (e.g., organize data in a page-wise fashion). Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality 1 Toward the end of this chapter, we will see UB-trees, a nifty trick that uses Wrap-Up + B -trees to support some multi-dimensional queries. 9 Multi-Dimensional “Binary” Search Tree Indexing Torsten Grust In k dimensions, a “binary tree” becomes a 2k-ary tree.

Point quad tree (k = 2) • Each data point partitions the data k B+-trees. . . space into 2 disjoint . . . over composite keys regions. Point Quad Trees Point (Equality) Search • In each node, a region Inserting Data Region Queries

points to another node k-d Trees (representing a refined Balanced Construction partitioning) or to a K-D-B-Trees K-D-B-Tree Operations special null value. Splitting a Point Page Splitting a Region Page

• This data structure is a R-Trees point quad tree. Searching and Inserting Splitting R-Tree Nodes Finkel and Bentley. Quad Trees: A% Data Structure for Retrieval on UB-Trees Composite Keys. Acta Informatica, Bit Interleaving / Z-Ordering vol. 4, 1974. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

10 Multi-Dimensional Searching a Point Quad Tree Indexing Torsten Grust

Point quad tree (k = 2) Point quad tree search

1 Function: p_search (q, node)

2 if q matches data point in node then B+-trees. . . 3 return data point; . . . over composite keys

4 else Point Quad Trees ? 5 P partition containing q; Point (Equality) Search ← Inserting Data 6 if P points to null then Region Queries 7 return not found; k-d Trees 8 else Balanced Construction 9 node0 node pointed to by P; K-D-B-Trees ← K-D-B-Tree Operations 10 return p_search (q, node0); Splitting a Point Page ? Splitting a Region Page R-Trees 1 Function: pointsearch (q) Searching and Inserting Splitting R-Tree Nodes 2 return p_search (q, root) ; UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes ! Range Queries Spaces with High Dimensionality

Wrap-Up

11 Multi-Dimensional Inserting into a Point Quad Tree Indexing

Inserting a point qnew into a quad tree happens analogously to Torsten Grust an insertion into a binary tree:

1 Traverse the tree just like during a search for qnew until you encounter a partition P with a null pointer. 2 Create a new node n0 that spans the same area as P and is B+-trees. . . partitioned by qnew, with all partitions pointing to null. . . . over composite keys 3 Let P point to n0. Point Quad Trees Point (Equality) Search Inserting Data Note that this procedure does not keep the tree balanced. Region Queries

k-d Trees Example (Insertions into an empty point quad tree) Balanced Construction K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees → → Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High → → Dimensionality

Wrap-Up

12 Multi-Dimensional Range Queries Indexing Torsten Grust To evaluate a range query2, we may need to follow several children of a quad tree node node:

Point quad tree range search

B+-trees...... over composite keys

1 Function: r_search (Q, node) Point Quad Trees Point (Equality) Search 2 if data point in node is in Q then Inserting Data 3 append data point to result ; Region Queries k-d Trees 4 foreach partition P in node that intersects with Q do Balanced Construction 5 node0 node pointed to by P ; ← K-D-B-Trees 6 r_search (Q, node0) ; K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees 1 Function: regionsearch (Q) Searching and Inserting Splitting R-Tree Nodes 2 return r_search (Q, root) ; UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality 2 We consider rectangular regions only; other shapes may be answered by Wrap-Up querying for the bounding rectangle and post-processing the output. 13 Multi-Dimensional Point Quad Trees—Discussion Indexing Torsten Grust

Point quad trees " are symmetric with respect to all dimensions and " can answer point queries and region queries. B+-trees...... over composite keys

Point Quad Trees But Point (Equality) Search Inserting Data % the shape of a quad tree depends on the insertion order of Region Queries its content, in the worst case degenerates into a linked list, k-d Trees % Balanced Construction null pointers are space inefficient (particularly for large k). K-D-B-Trees K-D-B-Tree Operations In addition, Splitting a Point Page Splitting a Region Page - they can only store point data. R-Trees Searching and Inserting Splitting R-Tree Nodes Also remember that quad trees were designed for main memory UB-Trees Bit Interleaving / Z-Ordering operation. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

14 Multi-Dimensional k-d Trees Indexing Torsten Grust

Sample k-d tree (k = 2) • Index k-dimensional data, but keep the tree binary. • For each tree level l use B+-trees. . . a different discriminator . . . over composite keys Point Quad Trees dimension dl along Point (Equality) Search Inserting Data which to partition. Region Queries • Typically: round k-d Trees robin Balanced Construction K-D-B-Trees • This is a k-d tree. K-D-B-Tree Operations Splitting a Point Page Bentley. Multidimensional Splitting a Region Page Binary% Search Trees Used for R-Trees Associative Searching. Comm. ACM, Searching and Inserting vol. 18, no. 9, Sept. 1975. Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

15 Multi-Dimensional k-d Trees Indexing Torsten Grust k-d trees inherit the positive properties of the point quad trees, but improve on space efficiency. For a given point set, we can also construct a balanced k-d tree:3

k-d tree construction (Bentley’s OPTIMIZE algorithm) B+-trees...... over composite keys

Point Quad Trees 1 Function: kdtree (pointset, level) Point (Equality) Search 2 if pointset is empty then Inserting Data Region Queries 3 return null ; k-d Trees Balanced Construction 4 else K-D-B-Trees 5 p median from pointset (along dlevel); ← K-D-B-Tree Operations 6 Splitting a Point Page pointsleft v pointset where vdlevel < pdlevel ; ← { ∈ } Splitting a Region Page 7 pointsright v pointset where vdlevel pdlevel p ; ← { ∈ ≥ }\{ } R-Trees 8 n new k-d tree node, with data point p ; Searching and Inserting ← Splitting R-Tree Nodes 9 n.left kdtree (pointsleft, level + 1) ; ← UB-Trees 10 n.right kdtree (pointsright, level + 1) ; ← Bit Interleaving / Z-Ordering 11 return n ; B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 3 vi: coordinate i of point v. 16 Multi-Dimensional Balanced k-d Tree Construction Indexing Torsten Grust

Example (Step-by-step balanced k-d tree construction)

→ → → B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data f Region Queries a e k-d Trees c → → → → h Balanced Construction b d g K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page d Resulting tree shape: R-Trees Searching and Inserting e c Splitting R-Tree Nodes

UB-Trees g a h f Bit Interleaving / Z-Ordering B+-Trees over Z-Codes b Range Queries Spaces with High Dimensionality

Wrap-Up

17 Multi-Dimensional K-D-B-Trees Indexing Torsten Grust

k-d trees improve on some of the deficiencies of point quad trees: " We can balance a k-d tree by re-building it. (For a limited number of points and in-memory processing, this may B+-trees...... over composite keys be sufficient.) Point Quad Trees " Point (Equality) Search We’re no longer wasting big amounts of space. Inserting Data Region Queries

k-d Trees It’s time to bring k-d trees to the disk. The K-D-B-Tree Balanced Construction • uses pages as an organizational unit K-D-B-Trees K-D-B-Tree Operations (e.g., each node in the K-D-B-tree fills a page) and Splitting a Point Page Splitting a Region Page -d tree-like layout • uses a k to organize each page. R-Trees Searching and Inserting Splitting R-Tree Nodes John T. Robinsion. The K-D-B-Tree: A Search Structure for Large % UB-Trees Multidimensional Dynamic Indexes. SIGMOD 1981. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

18 Multi-Dimensional K-D-B-Tree Idea Indexing Torsten Grust Region pages: page 0 ◦ ··· ◦ • contain entries ··· ◦ ◦ ··· region, pageID h i no null pointers ◦ ◦ • B+-trees. . . • form a balanced . . . over composite keys

tree Point Quad Trees page 3 Point (Equality) Search page 1 • all regions Inserting Data ◦ Region Queries ◦ ◦ disjoint and ··· ◦ k-d Trees ··· ◦ page 2 ··· rectangular Balanced Construction ◦ ◦ ◦ ··· K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page . . Splitting a Region Page . . Point pages: . R-Trees Searching and Inserting page 4 • contain entries Splitting R-Tree Nodes page 5 point, rid UB-Trees h + i Bit Interleaving / Z-Ordering • B -tree leaf B+-Trees over Z-Codes nodes Range Queries ; Spaces with High Dimensionality

Wrap-Up

19 Multi-Dimensional K-D-B-Tree Operations Indexing Torsten Grust

• Searching a K-D-B-Tree works straightforwardly:

• Within each page determine all regions Ri that contain B+-trees. . . the query point q (intersect with the query region Q). . . . over composite keys

• For each of the Ri, consult the page it points to and Point Quad Trees recurse. Point (Equality) Search Inserting Data • On point pages, fetch and return the corresponding Region Queries record for each matching data point pi. k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations • When inserting data, we keep the K-D-B-Tree balanced, Splitting a Point Page much like we did in the B+-tree: Splitting a Region Page R-Trees • Simply insert a region, pageID ( point, rid ) entry into a Searching and Inserting region page (pointh page) if there’si h sufficienti space. Splitting R-Tree Nodes UB-Trees • Otherwise, split the page. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

20 Multi-Dimensional K-D-B-Tree: Splitting a Point Page Indexing Torsten Grust

Splitting a point page p

B+-trees. . . 1 Choose a dimension i and an i-coordinate xi along which to . . . over composite keys

split, such that the split will result in two pages that are not Point Quad Trees overfull. Point (Equality) Search Inserting Data 2 Move data points p with pi < xi and pi > xi to new pages Region Queries p and p (respectively). k-d Trees left right Balanced Construction 3 Replace region, p in the parent of p with left region, pleft K-D-B-Trees h i h i K-D-B-Tree Operations right region, pright . Splitting a Point Page h i Splitting a Region Page

R-Trees Searching and Inserting Splitting R-Tree Nodes Step 3 may cause an overflow in p’s parent and, hence, lead to a UB-Trees split of a region page. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

21 Multi-Dimensional K-D-B-Tree: Splitting a Region Page Indexing Torsten Grust

• Splitting a point page and moving its data points to the resulting pages is straightforward. • In case of a region page split, by contrast, some regions may both B+-trees. . . intersect with sides of the split. . . . over composite keys

Point Quad Trees Consider, e.g., page 0 on slide 19: Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees split K-D-B-Tree Operations Splitting a Point Page region crossing the split Splitting a Region Page R-Trees Searching and Inserting Splitting R-Tree Nodes • Such regions need to be split, too. UB-Trees Bit Interleaving / Z-Ordering • This can cause a recursive splitting downward (!) the tree. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

22 Multi-Dimensional Example: Page 0 Split in K-D-B-Tree of slide 19 Indexing Torsten Grust Split of region page 0

new root

page 6 B+-trees...... over composite keys

Point Quad Trees ◦ ◦ Point (Equality) Search page 0 Inserting Data Region Queries ◦ ◦ k-d Trees Balanced Construction

page 3 K-D-B-Trees page 7 K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

page 1 page 2 R-Trees Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes • Root page 0 pages 0 and 6 ( creation of new root). Range Queries ⇒ Spaces with High • Region page 1 pages 1 and 7; (point pages not shown). Dimensionality ⇒ Wrap-Up 23 Multi-Dimensional K-D-B-Trees: Discussion Indexing Torsten Grust

K-D-B-Trees " are symmetric with respect to all dimensions,

" cluster data in a space-aware and page-oriented fashion, B+-trees. . . " are dynamic with respect to updates, and . . . over composite keys Point Quad Trees " can answer point queries and region queries. Point (Equality) Search Inserting Data Region Queries

However, k-d Trees - we still don’t have support for region data and Balanced Construction K-D-B-Trees - K-D-B-Trees (like k-d trees) will not handle deletes K-D-B-Tree Operations Splitting a Point Page dynamically. Splitting a Region Page R-Trees This is because we always partitioned the data space such that Searching and Inserting Splitting R-Tree Nodes • every region is rectangular and UB-Trees Bit Interleaving / Z-Ordering • regions never intersect. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

24 Multi-Dimensional R-Trees Indexing Torsten Grust

R-trees do not have the disjointness requirement: • R-tree inner or leaf nodes contain region, pageID or h i B+-trees. . . region, rid entries (respectively). . . . over composite keys h i region is the minimum bounding rectangle that spans all Point Quad Trees Point (Equality) Search data items reachable by the respective pointer. Inserting Data • Every node contains between d and 2d entries ( B+-tree).4 Region Queries k-d Trees • Insertion and deletion algorithms keep an R-tree; balanced Balanced Construction K-D-B-Trees at all times. K-D-B-Tree Operations Splitting a Point Page R-trees allow the storage of point and region data. Splitting a Region Page R-Trees Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching and Inserting % Splitting R-Tree Nodes Searching. SIGMOD 1984. UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 4 except the root node 25 Multi-Dimensional R-Tree: Example Indexing Torsten Grust

• order: d = 2 • region data ◦ ◦ B+-trees. . . ◦ page 0 . . . over composite keys

Point Quad Trees Point (Equality) Search

inner nodes Inserting Data Region Queries ◦ ◦ k-d Trees ◦ ··· ◦ ◦ page 3 Balanced Construction ··· ◦ ··· ◦ page 1 ◦ K-D-B-Trees page 2 ··· K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees Searching and Inserting Splitting R-Tree Nodes page 8 UB-Trees page 9 Bit Interleaving / Z-Ordering

leaf nodes B+-Trees over Z-Codes page 6 Range Queries page 7 Spaces with High Dimensionality

Wrap-Up

26 Multi-Dimensional R-Tree: Searching and Inserting Indexing Torsten Grust

While searching an R-tree, we may have to descend into more than one child node for point and region queries ( range search in point quad trees, slide 13). % B+-trees...... over composite keys Inserting into an R-tree (cf. B+-tree insertion) Point Quad Trees Point (Equality) Search Inserting Data 1 Choose a leaf node n to insert the new entry. Region Queries k-d Trees • Try to minimize the necessary region enlargement(s). Balanced Construction K-D-B-Trees 2 full split If n is , it (resulting in n and n0) and distribute old K-D-B-Tree Operations and new entries evenly across n and n0. Splitting a Point Page Splitting a Region Page

• Splits may propagate bottom-up and eventually reach R-Trees + the root ( B -tree). Searching and Inserting % Splitting R-Tree Nodes 3 After the insertion, some regions in the ancestor nodes of n UB-Trees Bit Interleaving / Z-Ordering may need to be adjusted to cover the new entry. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

27 Multi-Dimensional Splitting an R-Tree Node Indexing Torsten Grust To split an R-tree node, we generally have more than one alternative:

B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations “bad” split “good” split Splitting a Point Page Splitting a Region Page

R-Trees Heuristic: Minimize the totally covered area. But: Searching and Inserting • Exhaustive search for the best split is infeasible. Splitting R-Tree Nodes UB-Trees • Guttman proposes two ways to approximate the search. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes • Follow-up papers (e.g., the R*-tree) aim at improving the Range Queries Spaces with High quality of node splits. Dimensionality

Wrap-Up

28 Multi-Dimensional R-Tree: Deletes Indexing Torsten Grust All R-tree invariants (see 25) are maintained during deletions.

1 If an R-tree node n underflows (i.e., less than d entries are left after a deletion), the whole node is deleted.

2 Then, all entries that existed in n are re-inserted into the B+-trees. . . R-tree. . . . over composite keys Point Quad Trees Note that Step 1 may lead to a recursive deletion of n’s parent. Point (Equality) Search Inserting Data • Deletion, therefore, is a rather expensive task in an R-tree. Region Queries k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations Spatial indexing in mainstream database implementations Splitting a Point Page Splitting a Region Page • Indexing in commercial database systems is typically based R-Trees Searching and Inserting on R-trees. Splitting R-Tree Nodes UB-Trees • Yet, only few systems implement them out of the box (e.g., Bit Interleaving / Z-Ordering PostgreSQL). Most require the licensing/use of specific B+-Trees over Z-Codes Range Queries

extensions. Spaces with High Dimensionality

Wrap-Up

29 Multi-Dimensional Bit Interleaving Indexing Torsten Grust • We saw earlier that a B+-tree over concatenated fields a, b does not help our case, because of the asymmetry betweenh i the role of a and b in the index key. • What happens if we interleave the bits of a and b (hence, + make the B -tree “more symmetric”)? B+-trees...... over composite keys

a = 42 b = 17 Point Quad Trees 00101010 00010001 Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction 0010101000010001 K-D-B-Trees K-D-B-Tree Operations a, b (concatenation) Splitting a Point Page h i Splitting a Region Page a = 42 b = 17 R-Trees Searching and Inserting 00101010 00010001 Splitting R-Tree Nodes UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries 0000100110001001 Spaces with High Dimensionality a and b interleaved Wrap-Up 30 Multi-Dimensional Z-Ordering Indexing Torsten Grust

a a

B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees b b K-D-B-Tree Operations a, b (concatenated) a and b interleaved Splitting a Point Page h i Splitting a Region Page R-Trees • Both approaches linearize all coordinates in the value space Searching and Inserting according to some order. see also slide 8 Splitting R-Tree Nodes % UB-Trees • Bit interleaving leads to what is called the Z-order. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes • The Z-order (largely) preserves spatial clustering. Range Queries Spaces with High Dimensionality

Wrap-Up

31 + Multi-Dimensional UB-Tree: B -trees Over Z-Codes Indexing Torsten Grust • Use a B+-tree to index Z-codes of multi-dimensional data. • Each leaf in the B+-tree describes an interval in the Z-space. • Each interval in the Z-space describes a region in the multi-dimensional data space: B+-trees...... over composite keys

Point Quad Trees + UB-Tree: B -tree over Z-codes Point (Equality) Search Inserting Data Region Queries

k-d Trees x8 Balanced Construction

x10 K-D-B-Trees x7 x9 K-D-B-Tree Operations x6 Splitting a Point Page Splitting a Region Page

x2 x4 0 3 4 20 21 35 36 47 48 63 R-Trees x1 x3 x5 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes • To retrieve all data points in a query region Q, try to touch Range Queries only those leave pages that intersect with Q. Spaces with High Dimensionality

Wrap-Up

32 Multi-Dimensional UB-Tree: Range Queries Indexing Torsten Grust After each page processed, perform an index re-scan ( ) to fetch the next leaf

page that intersects with Q. http://mistral.in.tum.de/

B+-trees...... over composite keys

Point Quad Trees UB-tree range search (Function ub_range(Q)) Point (Equality) Search Inserting Data 1 cur z(Q ) ; Region Queries ← bottom,left 2 while true do k-d Trees Balanced Construction // search B+-tree page containing cur ( slide 0.0) Example by Volker Markl and Rudolf Bayer, taken from % K-D-B-Trees 3 page search (cur); K-D-B-Tree Operations ← Splitting a Point Page 4 foreach data point p on page do Splitting a Region Page 5 if p is in Q then R-Trees 6 append p to result ; Searching and Inserting Splitting R-Tree Nodes

7 if region in page reaches beyond Qtop,right then UB-Trees Bit Interleaving / Z-Ordering 8 break ; B+-Trees over Z-Codes Range Queries // compute next Z-address using Q and data on current Spaces with High page Dimensionality 9 cur get_next_z_address (Q, page); Wrap-Up ← 33 Multi-Dimensional UB-Trees: Discussion Indexing Torsten Grust

• Routine get_next_z_address() (return next Z-address B+-trees. . . lying in the query region) is non-trivial and depends on the . . . over composite keys shape of the Z-region. Point Quad Trees Point (Equality) Search • UB-trees are fully dynamic, a property inherited from the Inserting Data underlying B+-trees. Region Queries k-d Trees • The use of other space-filling curves to linearize the data Balanced Construction K-D-B-Trees space is discussed in the literature, too. E.g., Hilbert curves. K-D-B-Tree Operations Splitting a Point Page I UB-trees have been commercialized in the Transbase® Splitting a Region Page R-Trees database system. Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up

34 Multi-Dimensional Spaces with High Dimensionality Indexing Torsten Grust

For large k, all techniques that have been discussed become ineffective: • For example, for k = 100, we get 2100 1030 partitions per ≈ node in a point quad tree. Even with billions of data points, B+-trees. . . almost all of these are empty. . . . over composite keys Point Quad Trees • Consider a really big search region, cube-sized covering Point (Equality) Search Inserting Data 95 % of the range along each dimension: Region Queries data space k-d Trees Balanced Construction

query region K-D-B-Trees K-D-B-Tree Operations For k = 100, the probability of a point Splitting a Point Page Splitting a Region Page

being in this region is still only R-Trees 100 0.95 0.59 %. Searching and Inserting ≈ Splitting R-Tree Nodes 95 % UB-Trees 100 % Bit Interleaving / Z-Ordering B+-Trees over Z-Codes • We experience the curse of dimensionality here. Range Queries Spaces with High Dimensionality

Wrap-Up

35 Multi-Dimensional Page Selectivty for k-NN Search Indexing Torsten Grust N=50’000, image database, k=10

140 Scan R*-Tree 120 X-Tree B+-trees. . . VA-File . . . over composite keys 100 Point Quad Trees Point (Equality) Search Inserting Data 80 Region Queries k-d Trees 60 Balanced Construction K-D-B-Trees K-D-B-Tree Operations 40 Splitting a Point Page Splitting a Region Page

R-Trees % Vector/Leaf blocks visited 20 Searching and Inserting Splitting R-Tree Nodes

0 UB-Trees 0 5 10 15 20 25 30 35 40 45 Bit Interleaving / Z-Ordering Number of dimensions in vectors B+-Trees over Z-Codes Range Queries

Data: Stephen Bloch. What’s Wrong with High-Dimensionality Search. VLDB 2008. Spaces with High Dimensionality

Wrap-Up

36 Multi-Dimensional Wrap-Up Indexing Torsten Grust Point Quad Tree k-dimensional analogy to binary trees; main memory only. k-d Tree, K-D-B-Tree k-d tree: partition space one dimension at a time B+-trees. . . (round-robin); K-D-B-Tree: B+-tree-like organization with . . . over composite keys Point Quad Trees pages as nodes, nodes use a k-d-like structure internally. Point (Equality) Search Inserting Data R-Tree Region Queries regions within a node may overlap; fully dynamic; for point k-d Trees Balanced Construction

and region data. K-D-B-Trees K-D-B-Tree Operations UB-Tree Splitting a Point Page use space-filling curve (Z-order) to linearize k-dimensional Splitting a Region Page + R-Trees data, then use B -tree. Searching and Inserting Splitting R-Tree Nodes

Curse Of Dimensionality UB-Trees most indexing structures become ineffective for large k; fall Bit Interleaving / Z-Ordering B+-Trees over Z-Codes back to sequential scanning and Range Queries Spaces with High approximation/compression. Dimensionality

Wrap-Up

37 Hash-Based Indexing

Torsten Grust Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static Hashing Architecture and Implementation of Database Systems Hash Functions Extendible Hashing Search Insertion Procedures

Linear Hashing Insertion (Split, Rehashing) Running Example Procedures

1 Hash-Based Indexing Hash-Based Indexing Torsten Grust

• We now turn to a different family of index structures: hash indexes. • Hash indexes are “unbeatable” when it comes to support for Hash-Based Indexing equality selections: Static Hashing Hash Functions

Extendible Hashing Search Equality selection Insertion Procedures

1 SELECT* Linear Hashing 2 FROMR Insertion (Split, Rehashing) 3 WHERE A= k Running Example Procedures

• Further, other query operations internally generate a flood of equality tests (e.g., nested-loop join). (Non-)presence of hash index support can make a real difference in such scenarios.

2 Hashing vs. B+-trees Hash-Based Indexing Torsten Grust

• Hash indexes provide no support for range queries, however (hash indexes are also know as scatter storage). + • In a B -tree-world, to locate a record with key k means to Hash-Based Indexing compare k with other keys k0 organized in a (tree-shaped) Static Hashing search data structure. Hash Functions Extendible Hashing • Hash indexes use the bits of k itself (independent of all Search Insertion other stored records) to find the location of the associated Procedures record. Linear Hashing Insertion (Split, Rehashing) • We will now briefly look into static hashing to illustrate the Running Example Procedures basics. • Static hashing does not handle updates well (much like ISAM). • Later, we introduce extendible hashing and linear hashing which refine the hashing principle and adapt well to record insertions and deletions.

3 Static Hashing Hash-Based Indexing Torsten Grust

• To build a static hash index on attribute A:

Build static hash index on column A Hash-Based Indexing Static Hashing 1 Allocate a fixed area of N (successive) disk pages, the Hash Functions Extendible Hashing so-called primary buckets. Search Insertion 2 In each bucket, install a pointer to a chain of overflow pages Procedures (initially set the pointer to null). Linear Hashing Insertion (Split, Rehashing) 3 Define a hash function h with range [0,..., N 1]. The Running Example domain of h is the type of A, e.g.. − Procedures

h : INTEGER [0,..., N 1] → − if A is of SQL type INTEGER.

4 Static Hashing Hash-Based Indexing Torsten Grust Static hash table

hash table 0 ... Hash-Based Indexing 1 bucket Static Hashing 2 Hash Functions k ... h Extendible Hashing Search Insertion Procedures

Linear Hashing Insertion (Split, Rehashing) N-1 Running Example ... Procedures

primary buckets overflow pages

• A primary bucket and its associated chain of overflow pages is referred to as a bucket ( above). • Each bucket contains index entries k (implemented using ∗ any of the variants A , B , C , see slide 2.22.

5 Static Hashing Hash-Based Indexing Torsten Grust

• To perform hsearch(k) (or hinsert(k)/hdelete(k)) for a record with key A = k:

Hash-Based Indexing Static hashing scheme Static Hashing Hash Functions

Extendible Hashing 1 Apply hash function h to the key value, i.e., compute h(k). Search Insertion 2 Access the primary bucket page with number h(k). Procedures 3 Search (insert/delete) subject record on this page or, if Linear Hashing Insertion (Split, Rehashing) required, access the overflow chain of bucket h(k). Running Example Procedures

• If the hashing scheme works well and overflow chain access is avoidable, • hsearch(k) requires a single I/O operation, • hinsert(k)/hdelete(k) require two I/O operations.

6 Static Hashing: Collisions and Overflow Chains Hash-Based Indexing Torsten Grust

• At least for static hashing, overflow chain management is important. • Generally, we do not want hash function h to avoid collisions, i.e., Hash-Based Indexing Static Hashing Hash Functions

h(k) = h(k0) even if k = k0 Extendible Hashing 6 Search Insertion (otherwise we would need as many primary bucket pages as Procedures

different key values in the data file). Linear Hashing Insertion (Split, Rehashing) • At the same time, we want h to scatter the key attribute Running Example domain evenly across [0,..., N 1] to avoid the Procedures development of long overflow chains− for few buckets. This makes the hash tables’ I/O behavior non-uniform and unpredictable for a query optimizer.

• Such “good” hash functions are hard to discover, unfortunately.

7 The Birthday Paradox (Need for Overflow Chain Management) Hash-Based Indexing Torsten Grust

Example (The birthday paradox) Consider the people in a group as the domain and use their birthday as hash function h (h : Person [0,..., 364]). → If the group has 23 or more members, chances are > 50 % Hash-Based Indexing Static Hashing that two people share the same birthday (collision). Hash Functions

Extendible Hashing Check: Compute the probability that n people all have different Search Insertion birthdays: Procedures

1 Function: different_birthday (n) Linear Hashing Insertion (Split, Rehashing) Running Example 2 if n = 1 then Procedures 3 return 1;

4 else 365 (n 1) 5 return different_birthday(n 1) − − ; − × 365 probability that n 1 persons have different birthdays− probability that nth person has birthday different from | {z } first n 1 persons | − {z }

8 Hash Functions Hash-Based Indexing Torsten Grust • It is impossible to generate truly random hash values from the non-random key values found in actual table. Can we define hash functions that scatter even better than a random function? Hash function Hash-Based Indexing Static Hashing Hash Functions 1 By division. Simply define Extendible Hashing Search h(k) = k mod N . Insertion Procedures

Linear Hashing This guarantees the range of h(k) to be [0,..., N 1]. Insertion (Split, Rehashing) − Running Example Procedures Note: Choosing N = 2d for some d effectively considers the least d bits of k only. Prime numbers work best for N. 2 By multiplication. Extract the fractional part of Z k (for a specific Z1) and multiply by arbitrary hash table size· N:

h(k) = N (Z k Z k ) · · − b · c   1The (inverse) golden ratio Z = 2/(√5+1) 0.6180339887 is a good choice. ≈ See D.E.Knuth, “Sorting and Searching.” 9 Static Hashing and Dynamic Files Hash-Based Indexing Torsten Grust

• For a static hashing scheme: • If the underlying data file grows, the development of overflow chains spoils the otherwise predictable Hash-Based Indexing behavior hash I/O behavior (1–2 I/O operations). Static Hashing Hash Functions

Extendible Hashing • If the underlying data file shrinks, a significant fraction Search Insertion of primary hash buckets may be (almost) empty—a Procedures waste of page space. Linear Hashing Insertion (Split, Rehashing) • As in the ISAM case, however, static hashing has advantages Running Example when it comes to concurrent access. Procedures • We may perodicially rehash the data file to restore the ideal situation (20 % free space, no overflow chains).

Expensive and the index cannot be used while rehashing is⇒ in progress.

10 Extendible Hashing Hash-Based Indexing Torsten Grust • Extendible Hashing can adapt to growing (or shrinking) data files. • To keep track of the actual primary buckets that are part of

the current hash table, we hash via an in-memory bucket Hash-Based Indexing

directory: Static Hashing Hash Functions

2 Extendible Hashing Example (Extendible hash table setup; ignore the 2 fields for now ) Search Insertion 2 Procedures 4* 12* 32* 16* bucket A Linear Hashing Insertion (Split, Rehashing) 2 2 Running Example Procedures 00 1* 5* 21* bucket B

h 01 10 2 11 10* bucket C

directory 2 15* 7* 19* bucket D

hash table

2 Note: This figure depicts the entries as h(k) , not k . 11 ∗ ∗ Extendible Hashing: Search Hash-Based Indexing Torsten Grust

Search for a record with key k Hash-Based Indexing 1 Apply h, i.e., compute h(k). Static Hashing Hash Functions

2 Consider the last 2 bits of h(k) and follow the Extendible Hashing corresponding directory pointer to find the bucket. Search Insertion Procedures

Linear Hashing Insertion (Split, Rehashing) Running Example Example (Search for a record) Procedures

To find a record with key k such that h(k) = 5 = 1012, follow the second directory pointer (1012 112 = 012) to bucket B, then use entry 5 to access the wanted record.∧ ∗

12 Extendible Hashing: Global and Local Depth Hash-Based Indexing Torsten Grust

Global and local depth annotations Hash-Based Indexing Static Hashing Hash Functions Gloabl depth ( n at hash directory): • Extendible Hashing Use the last n bits of h(k) to lookup a bucket pointer in the Search n Insertion directory (the directory size is 2 ). Procedures

Linear Hashing Insertion (Split, Rehashing) • Local depth ( d at individual buckets): Running Example The hash values h(k) of all entries in this bucket agree on their Procedures last d bits.

13 Extendible Hashing: Insert Hash-Based Indexing Torsten Grust Insert record with key k

1 Apply h, i.e., compute h(k). 2 Use the last n bits of h(k) to lookup the bucket pointer in

the directory. Hash-Based Indexing 3 If the primary bucket still has capacity, store h(k) in it. Static Hashing ∗ Hash Functions (Otherwise . . . ?) Extendible Hashing Search Insertion Procedures Example (Insert record with h(k) = 13 = 11012) Linear Hashing 2 Insertion (Split, Rehashing) 4* 12* 32* 16* bucket A Running Example Procedures

2 2 00 1* 5* 21* 13* bucket B

h 01 10 2 11 10* bucket C

directory 2 15* 7* 19* bucket D

hash table 14 Extendible Hashing: Insert, Bucket Split Hash-Based Indexing Torsten Grust

Example (Insert record with h(k) = 20 = 101002)

Insertion of a record with h(k) = 20 = 101002 leads to overflow in primary bucket A. Initiate a bucket split for A. Hash-Based Indexing

1 Split bucket A (creating a new bucket A2) and use bit Static Hashing position d + 1 to redistribute the entries: Hash Functions Extendible Hashing Search Insertion 4 = 1002 Procedures

12 = 11002 Linear Hashing 32 = 1000002 Insertion (Split, Rehashing) Running Example 16 = 100002 Procedures 20 = 101002 0 1

Bucket A 32 16 4 12 20 Bucket A2

Note: We now need 3 bits to discriminate between the old bucket A and the new split bucket A2.

15 Extendible Hashing: Insert, Directory Doubling Hash-Based Indexing Torsten Grust

Example (Insert record with h(k) = 20 = 101002)

2 In the present case, we need to double the directory by simply copying its original pages (we now use 2 + 1 = 3 bits to lookup a bucket pointer). Hash-Based Indexing Static Hashing 3 Let bucket pointer for 1002 point to A2 (the directory pointer Hash Functions for 0002 still points to bucket A): Extendible Hashing 3 Search Insertion 32* 16* bucket A Procedures

Linear Hashing 3 2 Insertion (Split, Rehashing) Running Example 000 1* 5* 21* 13* bucket B Procedures 001 010 2 h 011 10* bucket C 100 101 2 110 15* 7* 19* bucket D 111 directory 3 4* 12* 20* bucket A2

16 Extendible Hashing: Insert Hash-Based Indexing Torsten Grust

If we split a bucket with local depth d < n (global depth), directory doubling is not necessary:  Hash-Based Indexing

Static Hashing Example (Insert record with h(k) = 9 = 10012) Hash Functions

Extendible Hashing • Insert record with key k such that h(k) = 9 = 10012. Search Insertion Procedures • The associated bucket B is split, creating a new bucket B2. Linear Hashing Insertion (Split, Rehashing) Entries are redistributed. New local depth of B and B2 is 3 Running Example and thus does not exceed the global depth of 3 . Procedures

Modifying the directory’s bucket pointer for 1012 is sufficient ⇒ (see following slide).

17 Extendible Hashing: Insert Hash-Based Indexing Torsten Grust

Example (After insertion of record with h(k) = 9 = 10012)

3

32* 16* bucket A Hash-Based Indexing

Static Hashing 3 3 Hash Functions

000 1* 9* bucket B Extendible Hashing 001 Search Insertion 2 010 Procedures

h 011 10* bucket C Linear Hashing 100 Insertion (Split, Rehashing) 101 2 Running Example Procedures 110 15* 7* 19* bucket D 111 directory 3 4* 12* 20* bucket A2

3 5* 21* 13* bucket B2

18 Extendible Hashing: Search Procedure Hash-Based Indexing Torsten Grust

• The following hsearch( ) and hinsert( ) procedures Hash-Based Indexing · · operate over an in-memory array representation of the Static Hashing Hash Functions bucket directory bucket[0,..., 2 n 1]. − Extendible Hashing Search Insertion Extendible Hashing: Search Procedures

Linear Hashing 1 Function: hsearch(k) Insertion (Split, Rehashing) Running Example 2 n n ; /* global depth */ Procedures ← n 3 b h(k)&(2 1) ; /* mask all but the low n bits */ ← − 4 return bucket[b] ;

19 Extendible Hashing: Insert Procedure Hash-Based Indexing Torsten Grust

Extendible Hashing: Insertion

1 Function: hinsert(k ) ∗ Hash-Based Indexing 2 n n ; /* global depth */ Static Hashing ← Hash Functions 3 b hsearch(k) ; ← Extendible Hashing 4 if b has capacity then Search Insertion 5 Place k in bucket b ; ∗ Procedures 6 return; Linear Hashing Insertion (Split, Rehashing) /* overflow in bucket b, need to split */ Running Example Procedures 7 d d ; /* local depth of hash bucket b */ ← b 8 Create a new empty bucket b2 ; /* redistribute entries of b including k */ ∗ . 9 .

20 Extendible Hashing: Insert Procedure (continued) Hash-Based Indexing Torsten Grust

Extendible Hashing: Insertion (cont’d)

. 1 . /* redistribute entries of b including k */ ∗ Hash-Based Indexing 2 foreach k in bucket b do 0 Static Hashing ∗ d 3 if h(k0)& 2 = 0 then Hash Functions 6 4 Move k0 to bucket b2 ; Extendible Hashing ∗ Search Insertion /* new local depths for buckets b and b2 */ Procedures Linear Hashing 5 d d + 1 ; b ← Insertion (Split, Rehashing) Running Example 6 d b2 d + 1 ; ← Procedures 7 if n < d + 1 then /* we need to double the directory */ n n n+1 8 Allocate 2 new directory entries bucket[2 ,..., 2 1] ; n n n+1 − 9 Copy bucket[0,..., 2 1] into bucket[2 ,..., 2 1] ; − − 10 n n + 1 ; /* ←update the bucket directory to point to b2 */ n n 11 bucket[(h(k)&(2 1)) 2 ] addr(b2) − | ←

21 Extendible Hashing: Overflow Chains? / Delete Hash-Based Indexing Torsten Grust

 Overflow chains? Extendible hashing uses overflow chains hanging off a bucket only as a resort. Under which circumstances will extendible hashing create an overflow chain? Hash-Based Indexing Static Hashing Hash Functions

If considering d + 1 bits does not lead to satisfying record Extendible Hashing redistribution in procedure hinsert(k) (skewed data, hash Search Insertion collisions). Procedures

Linear Hashing Insertion (Split, Rehashing) • Deleting an entry k from a bucket may leave its bucket Running Example Procedures completely (or almost)∗ empty. • Extendible hashing then tries to merge the empty bucket and its associated partner bucket.

 Extendible hashing: deletion When is local depth decreased? When is global depth decreased? (Try to work out the details on your own.)

22 Linear Hashing Hash-Based Indexing Torsten Grust

• Linear hashing can, just like extendible hashing, adapt its underlying data structure to record insertions and deletions:

• Linear hashing does not need a hash directory in Hash-Based Indexing addition to the actual hash table buckets. Static Hashing • Linear hashing can define flexible criteria that Hash Functions Extendible Hashing determine when a bucket is to be split, Search • Linear hashing, however, may perform bad if the key Insertion Procedures

distribution in the data file is skewed. Linear Hashing Insertion (Split, Rehashing) Running Example • We will now investigate linear hashing in detail and come Procedures back to the points above as we go along.

• The core idea behind linear hashing is to use an ordered family of hash functions, h0, h1, h2,... (traditionally the subscript is called the hash function’s level).

23 Linear Hashing: Hash Function Family Hash-Based Indexing Torsten Grust

• We design the family so that the range of hlevel+1 is twice as large as the range of hlevel (for level = 0, 1, 2,... ).

Example (hlevel with range [0,..., N 1]) − Hash-Based Indexing Static Hashing Hash Functions

Extendible Hashing 0 0 Search Insertion hlevel     Procedures   N 1   Linear Hashing hlevel+1 hlevel+1  −   Insertion (Split, Rehashing)  N      Running Example      Procedures   2 N 1 h   2 N 1  · − level+2  2 · N −       ·        4 N 1  · −   24 Linear Hashing: Hash Function Family Hash-Based Indexing Torsten Grust

Hash-Based Indexing • Given an initial hash function h and an initial hash table size Static Hashing N, one approach to define such a family of hash functions h0, Hash Functions h1, h2, . . . would be: Extendible Hashing Search Insertion Procedures Hash function family Linear Hashing Insertion (Split, Rehashing) Running Example Procedures h (k) = h(k) mod (2level N)(level = 0, 1, 2,... ) level ·

25 Linear Hashing: Basic Scheme Hash-Based Indexing Torsten Grust Basic linear hashing scheme

1 Initialize: level 0, next 0. ← ←

Hash-Based Indexing 2 The current hash function in use for searches Static Hashing (insertions/deletions) is hlevel, active hash table buckets are Hash Functions level those in hlevel’s range: [0,..., 2 N 1]. Extendible Hashing · − Search Insertion Procedures 3 Whenever we realize that the current hash table overflows, Linear Hashing e.g., Insertion (Split, Rehashing) Running Example • insertions filled a primary bucket beyond c % capacity, Procedures • or the overflow chain of a bucket grew longer than p pages, • or insert your criterion here h i we split the bucket at hash table position next (in general, this is not the bucket which triggered the split!) 

26 Linear Hashing: Bucket Split Hash-Based Indexing Torsten Grust

Linear hashing: bucket split

1 Allocate a new bucket, append it to the hash table (its position will be 2level N + next). · Hash-Based Indexing 2 Redistribute the entries in bucket next by rehashing them Static Hashing via hlevel+1 (some entries will remain in bucket next, some go Hash Functions to bucket 2level N + next). For next = 0: Extendible Hashing · Search Insertion Procedures 0 o next Linear Hashing Insertion (Split, Rehashing) hlevel+1 Running Example . Procedures .

2level N 1 · − 2level N + next · 3 Increment next by 1. All buckets with positions < next have been rehashed. ⇒

27 Linear Hashing: Rehashing Hash-Based Indexing Torsten Grust Searches need to take current next position into account < next : we hit an already split bucket, rehash h (k) level next : we hit a yet unsplit bucket, bucket found  >

Example (Current state of linear hashing scheme) Hash-Based Indexing Static Hashing 0 buckets already Hash Functions split (hlevel+1) Extendible Hashing    next bucket to be split Search   o  Insertion     range of   Procedures     h  Linear Hashing  level   unsplit buckets (h )    level Insertion (Split, Rehashing)     Running Example    Procedures    range of   level    2 N 1  hlevel+1   · −  images of already       split buckets (h )  level+1                  28  hash buckets  Linear Hashing: Split Rounds Hash-Based Indexing Torsten Grust

 When next is incremented beyond hash table size. . . ? A bucket split increments next by 1 to mark the next bucket to be split. How would you propose to handle the situation when next is incremented beyond the last current hash table position, i.e. Hash-Based Indexing Static Hashing Hash Functions level next > 2 N 1? Extendible Hashing · − Search Insertion Procedures

Answer: Linear Hashing level Insertion (Split, Rehashing) • If next > 2 N 1, all buckets in the current hash table Running Example · − Procedures are hashed via function hlevel+1. Proceed in a round-robin fashion: ⇒ If next > 2level N 1, then · − 1 increment level by 1, 2 next 0 (start splitting from hash table top again). ← • In general, an overflowing bucket is not split immediately, but—due to round-robin splitting—no later than in the following round.

29 Linear Hashing: Running Example Hash-Based Indexing Torsten Grust Linear hash table setup: • Bucket capacity of 4 entries, initial hash table size N = 4. • Split criterion: allocation of a page in an overflow chain.

Hash-Based Indexing

Example (Linear hash table, hlevel(k) shown) Static Hashing ∗ Hash Functions

level = 0 Extendible Hashing Search Insertion h 1 h0 Procedures

Linear Hashing 000 00 32* 44* 36* next Insertion (Split, Rehashing) Running Example Procedures 001 01 9* 25* 5*

010 10 14* 18* 10* 30*

011 11 31* 35* 7* 11*

hash buckets overflow pages

30 Linear Hashing: Running Example Hash-Based Indexing Torsten Grust

Example (Insert record with key k such that h0(k) = 43 = 1010112)

level = 0 Hash-Based Indexing

Static Hashing h 1 h0 Hash Functions

Extendible Hashing 000 00 32* Search Insertion Procedures 001 01 9* 25* 5* next Linear Hashing Insertion (Split, Rehashing) 010 10 14* 18* 10* 30* Running Example Procedures

011 11 31* 35* 7* 11* 43*

100 44* 36*

hash buckets overflow pages

31 Linear Hashing: Running Example Hash-Based Indexing Torsten Grust

Example (Insert record with key k such that h0(k) = 37 = 1001012)

level = 0 Hash-Based Indexing

Static Hashing h 1 h0 Hash Functions

Extendible Hashing 000 00 32* Search Insertion Procedures 001 01 9* 25* 5* 37* next Linear Hashing Insertion (Split, Rehashing) 010 10 14* 18* 10* 30* Running Example Procedures

011 11 31* 35* 7* 11* 43*

100 44* 36*

hash buckets overflow pages

32 Linear Hashing: Running Example Hash-Based Indexing Torsten Grust

Example (Insert record with key k such that h0(k) = 29 = 111012)

level = 0 Hash-Based Indexing

h 1 h0 Static Hashing Hash Functions

000 00 32* Extendible Hashing Search Insertion 001 01 9* 25* Procedures Linear Hashing Insertion (Split, Rehashing) 010 10 14* 18* 10* 30* next Running Example Procedures 011 11 31* 35* 7* 11* 43*

100 44* 36*

101 5* 37* 29*

33 Linear Hashing: Running Example Hash-Based Indexing Torsten Grust

Example (Insert three records with key k such that h0(k) = 22 = 101102 / 66 = 10000102 / 34 = 1000102)

level = 0 Hash-Based Indexing h h 1 0 Static Hashing Hash Functions

000 00 32* Extendible Hashing Search Insertion 001 01 9* 25* Procedures

Linear Hashing 010 10 66* 18* 10* 34* Insertion (Split, Rehashing) Running Example Procedures 011 11 31* 35* 7* 11* 43* next

100 44* 36*

101 5* 37* 29*

110 14* 30* 22*

34 Linear Hashing: Running Example Hash-Based Indexing Torsten Grust Example (Insert record with key k such that h0(k) = 50 = 1100102)

level = 1

h 1

Hash-Based Indexing

000 32* next Static Hashing Hash Functions

001 9* 25* Extendible Hashing Search Insertion 010 66* 18* 10* 34* 50* Procedures

Linear Hashing 011 43* 35* 11* Insertion (Split, Rehashing) Running Example Procedures 100 44* 36*

101 5* 37* 29*

110 14* 30* 22*

111 31* 7*

Rehashing a bucket requires rehashing its overflow chain, too.  35 Linear Hashing: Search Procedure Hash-Based Indexing Torsten Grust

• Procedures operate over hash table bucket (page) address array bucket[0,..., 2level N 1]. · − • Variables level, next are hash-table globals, N is constant. Hash-Based Indexing Static Hashing Linear hashing: search Hash Functions Extendible Hashing Search 1 Function: hsearch(k) Insertion Procedures 2 b hlevel(k) ; ← Linear Hashing 3 if b < next then Insertion (Split, Rehashing) Running Example /* b has already been split, record for key k */ Procedures /* may be in bucket b or bucket 2level N + b */ /* rehash· */ ⇒ 4 b h (k) ; ← level+1 /* return address of bucket at position b */ 5 return bucket[b] ;

36 Linear Hashing: Insert Procedure Hash-Based Indexing Torsten Grust

Linear hashing: insert

1 Function: hinsert(k ) ∗ 2 b hlevel(k) ; ← Hash-Based Indexing 3 if b < next then Static Hashing /* rehash */ Hash Functions Extendible Hashing 4 b hlevel+1(k) ; ← Search Insertion 5 Place h(k ) in bucket[b] ; ∗ Procedures 6 if overflow(bucket[b]) then Linear Hashing Insertion (Split, Rehashing) 7 Allocate new page b0 ; Running Example /* Grow hash table by one page */ Procedures level 8 bucket[2 N + next] addr(b0) ; · ← . 9 .

• Predicate overflow( ) is a tunable parameter: whenever overflow(·bucket[b]) returns true, trigger a split.

37 Linear Hashing: Insert Procedure (continued) Hash-Based Indexing Torsten Grust

Linear hashing: insert (cont’d)

. 1 . 2 if overflow( ) then Hash-Based Indexing . ··· Static Hashing . Hash Functions 3 . Extendible Hashing 4 foreach entry k0 in bucket[next] do ∗ Search /* redistribute */ Insertion Procedures 5 Place k0 in bucket[hlevel+1(k0)] ; ∗ Linear Hashing Insertion (Split, Rehashing) 6 next next + 1 ; ← Running Example /* did we split every bucket in the hash? */ Procedures level 7 if next > 2 N 1 then /* hash table· − size doubled, split from top */ 8 level level + 1 ; ← 9 next 0 ; ←

10 return;

38 Linear Hashing: Delete Procedure (Sketch) Hash-Based Indexing Torsten Grust • Linear hashing deletion essentially behaves as the “inverse” of hinsert( ): · Linear hashing: delete (sketch)

1 Function: hdelete(k) Hash-Based Indexing . Static Hashing 2 . Hash Functions /* does record deletion leave last bucket empty? */ Extendible Hashing Search level 3 if empty(bucket[2 N + next]) then Insertion · level Procedures 4 Remove page pointed to by bucket[2 N + next] from hash · Linear Hashing table ; Insertion (Split, Rehashing) Running Example 5 next next 1 ; ← − Procedures 6 if next < 0 then /* round-robin scheme for deletion */ 7 level level 1 ; ← level − 8 next 2 N 1 ; ← · − . 9 .

• Possible: replace empty( ) by suitable underflow( ) predicate. · · 39 External Sorting

Torsten Grust Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting Two-Way Merge Sort Architecture and Implementation of Database Systems External Merge Sort Comparisons Replacement Sort B+-trees for Sorting

1 Query Processing External Sorting Torsten Grust

Challenges lurking behind a SQL query aggregation

SELECT C.CUST_ID, C.NAME, SUM (O.TOTAL) AS REVENUE Query Processing FROM CUSTOMERS AS C, ORDERS AS O Sorting selection Two-Way Merge Sort WHERE C.ZIPCODE BETWEEN 8000 AND 8999 External Merge Sort grouping Comparisons AND C.CUST_ID = O.CUST_ID Replacement Sort join + GROUP BY C.CUST_ID B -trees for Sorting ORDER BY C.CUST_ID, C.NAME sorting

A DBMS query processor needs to perform a number of tasks • with limited memory resources, • over large amounts of data, • yet as fast as possible.

2 Query Processing External Sorting Torsten Grust

Web Forms Applications SQL Interface

SQL Commands

Query Processing

Executor Parser Sorting Two-Way Merge Sort Operator Evaluator Optimizer External Merge Sort Comparisons Replacement Sort B+-trees for Sorting Transaction Files and Access Methods Manager Recovery Buffer Manager Manager Lock Manager Disk Space Manager

DBMS

data files, indices, . . . Database

3 Query Plans and Operators External Sorting Torsten Grust

Query plans and operators

Query Processing

• DBMS does not execute a query as a large monolithic block Sorting but rather provides a number of specialized routines, the Two-Way Merge Sort External Merge Sort query operators. Comparisons Replacement Sort • Operators are “plugged together” to form a network of B+-trees for Sorting operators, a plan, that is capable of evaluating a given query. • Each operator is carefully implemented to perform a specific task well (i.e., time- and space-efficient).

• Now: Zoom in on the details of the implementation of one of the most basic and important operators: sort.

4 Query Processing: Sorting External Sorting Torsten Grust

• Sorting stands out as a useful operation, explicit or implicit:

Explicit sorting via the SQL ORDER BY clause

1 SELECT A,B,C Query Processing 2 FROM R Sorting 3 ORDER BY A Two-Way Merge Sort External Merge Sort Comparisons Implicit sorting, e.g., for duplicate elimination Replacement Sort B+-trees for Sorting 1 SELECT DISTINCT A,B,C 2 FROM R

Implicit sorting, e.g., to prepare equi-join

1 SELECT R.A,S.Y 2 FROM R,S 3 WHERE R.B = S.X

• Further: Grouping via GROUP BY,B+-tree bulk loading, sorted rid scans after access to unclustered indexes, . . .

5 Sorting External Sorting Torsten Grust

Sorting

• A file is sorted with respect to sort key k and ordering ✓, if Query Processing for any two records r1,2 with r1 preceding r2 in the file, we Sorting Two-Way Merge Sort have that their correspoding keys are in ✓-order: External Merge Sort Comparisons Replacement Sort r ✓ r r .k ✓ r .k . B+-trees for Sorting 1 2 , 1 2 • A key may be a single attribute as well as an ordered list of attributes. In the latter case, order is defined lexciographically. Consider: k =(A, B), ✓ = <:

r1 < r2 r1.A

6 Sorting External Sorting Torsten Grust

• As it is a principal goal not to restrict the file sizes a DBMS can handle, we face a fundamental problem: How can we sort a file of records whose size exceeds the available main memory space (let alone the Query Processing Sorting avaiable buffer manager space) by far? Two-Way Merge Sort External Merge Sort Comparisons Replacement Sort • Approach the task in a two-phase fashion: B+-trees for Sorting 1 Sorting a file of arbitrary size is possible even if three pages of buffer space is all that is available. 2 Refine this algorithm to make effective use of larger and thus more realistic buffer sizes.

• As we go along, consider a number of further optimizations in order to reduce the overall number of required page I/O operations.

7 Two-Way Merge Sort External Sorting Torsten Grust

We start with two-way merge sort, which can sort files of arbitrary size with only three pages of buffer space.

Two-way merge sort Query Processing Two-way merge sort sorts a file with N = 2k pages in multiple Sorting Two-Way Merge Sort passes, each of them producing a certain number of sorted External Merge Sort Comparisons sub-files called runs. Replacement Sort B+-trees for Sorting • Pass 0 sorts each of the 2k input pages individually and in main memory, resulting in 2k sorted runs. • Subsequent passes merge pairs of runs into larger runs. k n Pass n produces 2 runs. • Pass k leaves only one run left, the sorted overall result.

During each pass, we consult every page in the file. Hence, k N page reads and k N page writes are required to sort the file. · ·

8 Basic Two-Way Merge Sort Idea External Sorting Torsten Grust Pass 0 (Input: N = 2k unsorted pages; Output: 2k sorted runs) 1. Read N pages, one page at a time 2. Sort records, page-wise, in main memory. 3. Write sorted pages to disk (each page results in a run). This pass requires one page of buffer space. Query Processing Sorting Two-Way Merge Sort k k 1 External Merge Sort Pass 1 (Input: N = 2 sorted runs; Output: 2 sorted runs) Comparisons Replacement Sort 1. Open two runs r1 and r2 from Pass 0 for reading. B+-trees for Sorting 2. Merge records from r1 and r2, reading input page-by-page. 3. Write new two-page run to disk (page-by-page). This pass requires three pages of buffer space. . .

k n+1 k n Pass n (Input: 2 sorted runs; Output: 2 sorted runs)

1. Open two runs r1 and r2 from Pass n 1 for reading. 2. Merge records from r1 and r2, reading input page-by-page. 3. Write new 2n-page run to disk (page-by-page). This pass requires three pages of buffer space. . . 9 Two-way Merge Sort: Example External Sorting Torsten Grust Example (7-page file, two records per page, keys k shown, ✓ = <)

input file 65 43 47 89 52 13 8 Pass 0 1-page runs 56 34 47 89 25 13 8 Query Processing Sorting Pass 1 Two-Way Merge Sort External Merge Sort 34 47 12 8 Comparisons 2-page runs Replacement Sort 56 89 35 B+-trees for Sorting

Pass 2 34 12 45 35 4-page runs 67 8 89

Pass 3 7-page run 12 33 44 55 67 88 9

10 Two-Way Merge Sort: Algorithm External Sorting Torsten Grust

Two-way merge sort, N = 2k

1 Function: two_way_merge_sort (file, N) /* Pass 0: create N sorted single-page runs (in-memory sort) */ Query Processing Sorting 2 foreach page p in file do Two-Way Merge Sort External Merge Sort 3 read p into memory, sort it, write it out into a new run; Comparisons Replacement Sort /* next k passes merge pairs of runs, until only one B+-trees for Sorting run is left */ 4 for n in 1 ...k do k n 5 for r in 0 ...2 1 do 6 merge runs 2 r and 2 r + 1 from previous pass into a new run, reading· the input· runs one page at a time; 7 delete input runs 2 r and 2 r + 1; · ·

8 result last output run;

Each merge requires three buffer frames (two to read the two input files and one to construct output pages).

11 Two-Way Merge Sort: I/O Behavior External Sorting Torsten Grust

• To sort a file of N pages, in each pass we read N pages, sort/merge, and write N pages out again:

2 N I/O operations per pass Query Processing · Sorting • Number of passes: Two-Way Merge Sort External Merge Sort Comparisons 1 + log N Replacement Sort d 2 e B+-trees for Sorting Pass 0 Passes 1,...,k

• Total number of I/O operations:|{z} | {z }

2 N (1 + log N ) · · d 2 e

How many I/Os does it take to sort an 8 GB file? Assume a page size of 8 KB (with 1000 records each).

12 External Merge Sort External Sorting Torsten Grust

• So far we have “voluntarily” used only three pages of buffer space. Query Processing Sorting How could we make effective use of a significantly Two-Way Merge Sort larger buffer page pool (of, say, B frames)? External Merge Sort Comparisons Replacement Sort B+-trees for Sorting • Basically, there are two knobs we can turn and tune: 1 Reduce the number of initial runs by using the full buffer space during the in-memory sort.

2 Reduce the number of passes by merging more than two runs at a time.

13 Reducing the Number of Initial Runs External Sorting Torsten Grust With B frames available in the buffer pool, we can read B pages at a time during Pass 0 and sort them in memory ( slide 9): %

N B Pass 0 (Input: N unsorted pages; Output: ::::::::::::::/ sorted runs) d e Query Processing 1. Read N pages, B:::::::::::::::pages at a time 2. Sort records in main memory. Sorting Two-Way Merge Sort N B External Merge Sort 3. Write sorted pages to disk (resulting in ::::::::/ runs). d e Comparisons This pass uses :::::::B pages of buffer space. Replacement Sort B+-trees for Sorting The number of initial runs determines the number of passes we need to make ( slide 12): % Total number of I/O operations: ) 2 N (1 + log N/B ) . · · d 2 d ee

How many I/Os does it take to sort an 8 GB file now? Again, assume 8 KB pages. Available buffer space is B = 1,000.

14 Reducing the Number of Passes External Sorting Torsten Grust With B frames available in the buffer pool, we can merge B 1 pages at a time (leaving one frame as a write buffer).

N/B N/B Pass n (Input: d ne 1 sorted runs; Output: d en sorted runs) (B 1) (B 1) Query Processing 1. Open B 1 runs r1 ...rB 1 from Pass n 1 for reading. ::::::::::::::::: Sorting Two-Way Merge Sort 2. Merge records from r:::::::1 ...rB 1, reading page-by-page. n External Merge Sort 3. Write new B::::::::::::::(B 1) -page run to disk (page-by-page). Comparisons · Replacement Sort This pass requires B pages of buffer space. B+-trees for Sorting

With B pages of buffer space, we can do a (B 1)-way merge. Total number of I/O operations: ) N 2 N 1 + logB 1 /B . · · d e ⌃ ⌥

How many I/Os does it take to sort an 8 GB file now? Again, assume 8 KB pages. Available buffer space is B = 1,000.

15 Reducing the Number of Passes External Sorting Torsten Grust

(B 1)-way merge using a buffer of B pages Query Processing Sorting Two-Way Merge Sort External Merge Sort input1 Comparisons Replacement Sort B+-trees for Sorting input2 output ......

inputB-1 disk disk B main memory buffers

16 External Sorting: I/O Behavior External Sorting Torsten Grust • The I/O savings in comparison to two-way merge sort (B = 3) can be substantial:

# of passes for buffers of size B = 3, 5,...,257

Query Processing B = 3 (two-way) Sorting 30 Two-Way Merge Sort B = 5 External Merge Sort B = 9 Comparisons 25 B = 17 Replacement Sort B = 129 B+-trees for Sorting B = 257 20

15 # of Passes 10

5

0 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 N [pages]

17 External Sorting: I/O Behavior External Sorting Torsten Grust

• Sorting N pages with B buffer frames requires

N Query Processing 2 N 1 + logB 1 /B · · d e Sorting Two-Way Merge Sort I/O operations. ⌃ ⌥ External Merge Sort Comparisons Replacement Sort B+-trees for Sorting

What is the access pattern of these I/Os?

• In Pass 0, we read chunks of size B sequentially. • Everything else is random access due to the B 1way merge. (Which of the B 1 runs will contribute the next record. . . ?)

18 Blocked I/O External Sorting Torsten Grust We could improve the I/O pattern by reading blocks of, say, b pages at once during the merge phases. • Allocate b pages for each input (instead of just one). • Reduces per-page I/O cost by a factor of b. ⇡ Query Processing • The price we pay is a decreased fan-in (resulting in an Sorting increased number of passes and more I/O operations). Two-Way Merge Sort External Merge Sort Comparisons • In practice, main memory sizes are typically large enough to Replacement Sort sort files with just one merge pass, even with blocked I/O. B+-trees for Sorting

How long does it take to sort 8 GB (counting only I/O cost)? Assume 1,000 buffer pages of 8 KB each, 8.5 ms average seek time. • Without blocked I/O: 4 106 disk seeks (10.5 h) + transfer of 6 106 disk pages (13.6⇡ · min) ⇡ · • With blocked I/O (b = 32 page blocks): 6 32, 800 disk seeks (28.1 min) + transfer of 8 106 disk⇡ pages· (18.1 min) ⇡ ·

19 External Merge Sort: CPU Load and Comparisons External Sorting Torsten Grust

• External merge sort reduces the I/O load, but is considerably CPU intensive.

• Consider the (B 1)-way merge during passes 1, 2,...: Query Processing To pick the next record to be moved to the output buffer, we Sorting Two-Way Merge Sort need to perform B 2 comparisons. External Merge Sort Comparisons Replacement Sort Example (Comparisons for B 1 = 4,✓ = <) B+-trees for Sorting

087 503 504 ... 503 504 ... 503 504 ... 170 908 994 ... 170 908 994 ... 170 908 994 ... 087 087 154 8154 426 653 ... 8154 426 653 ... 8426 653 ... <>612 613 700 ... <>612 613 700 ... <>612 613 700 ... :> :> :> 503 504 ... 503 504 ... 908 994 ... 908 994 ... 087 154 170 087 154 170 426 ... 8426 653 ... 8653 ... <>612 613 700 ... <>612 613 700 ... :> :>

20 Selection Trees External Sorting Torsten Grust

Choosing the next record from B 1 (or B/b 1) input runs can be quite CPU intensive (B 2 comparisons). • Use a selection tree to reduce this cost. Query Processing • E.g., “tree of losers” ( D. Knuth, TAoCP, vol. 3): % Sorting Two-Way Merge Sort External Merge Sort Example (Selection tree, read bottom-up) Comparisons 23 Replacement Sort B+-trees for Sorting 95 79 142 91 605 132 278 985 670 850 873 190 412 390 901

985 23 91 670 605 850 873 79 190 132 95 412 142 390 278 901 ......

• This cuts the number of comparisons down to log (B 1). 2

21 Further Reducing the Number of Initial Runs External Sorting Torsten Grust

• Replacement sort can help to further cut down the number of initial runs N/B : try to produce initial runs with more than B pagesd. e Query Processing Replacement sort Sorting Two-Way Merge Sort External Merge Sort • Assume a buffer of B pages. Two pages are dedicated input Comparisons Replacement Sort and output buffers. The remaining B 2 pages are called B+-trees for Sorting the current set:

2 8 12 10 3 4 16 5 ...... input ... output buffer buffer current set

22 Replacement Sort External Sorting Torsten Grust

Replacement sort

1 Open an empty run file for writing.

2 Load next page of file to be sorted into input buffer. Query Processing

If input file is exhausted, go to 4 . Sorting Two-Way Merge Sort 3 While there is space in the current set, move a record from External Merge Sort Comparisons input buffer to current set (if the input buffer is empty, Replacement Sort reload it at 2 ). B+-trees for Sorting 4 In current set, pick record r with smallest key value k such that k > kout where kout is the maximum key value in output buffer.1 Move r to output buffer. If output buffer is full, append it to current run.

5 If all k in current set are < kout, append output buffer to current run, close current run. Open new empty run file for writing. 6 If input file is exhausted, stop. Otherwise go to 3 .

1 If output buffer is empty, define kout = . 23 1 Replacement Sort External Sorting Torsten Grust

Example (Record with key k = 8 will be the next to be moved into the output buffer; current kout = 5) Query Processing

Sorting 3 Two-Way Merge Sort 2 External Merge Sort 8 Comparisons Replacement Sort 12 10 3 4 B+-trees for Sorting 4 16 5 ...... input ... output buffer buffer current set

• The record with key k = 2 remains in the current set and will be written to the subsequent run.

24 Replacement Sort External Sorting Torsten Grust

Tracing replacement sort Assume B = 6, i.e., a current set size of 4. The input file contains records with INTEGER key values: Query Processing

Sorting 503 087 512 061 908 170 897 275 426 154 509 612 . Two-Way Merge Sort External Merge Sort Comparisons Write a trace of replacement sort by filling out the table below, Replacement Sort mark the end of the current run by EOR (the current set has B+-trees for Sorting h i already been populated at step 3 ):

current set output 503 087 512 061

25 Replacement Sort External Sorting Torsten Grust

• Step 4 of replacement sort will benefit from techniques like selection tree, esp. if B 2 is large. Query Processing Sorting • The replacement sort trace suggests that the length of the Two-Way Merge Sort External Merge Sort initial runs indeed increases. In the example: first run Comparisons Replacement Sort length 7 twice the size of the current set. + ⇡ B -trees for Sorting

Length of initial runs? Implement replacement sort to empricially determine initial run length or check the proper analysis ( D. Knuth, TAoCP, vol. 3, p. 254). %

26 External Sort: Remarks External Sorting Torsten Grust

• External sorting follows a divide and conquer principle. • This results in a number of indepdendent (sub-)tasks. • Execute tasks in parallel in a distributed DBMS or Query Processing Sorting exploit multi-core parallelism on modern CPUs. Two-Way Merge Sort External Merge Sort Comparisons Replacement Sort • To keep the CPU busy while the input buffer is reloaded (or B+-trees for Sorting the output buffer appended to the current run), use double buffering:

Create shadow buffers for the input and output buffers. Let the CPU switch to the “double” input buffer as soon as the input buffer is empty and asynchronously initiate an I/O operation to reload the original input buffer. Treat the output buffer similarly.

27 (Not) Using B+-trees for Sorting External Sorting Torsten Grust • If a B+-tree matches a sorting task (i.e.,B+-tree organized over key k with ordering ✓), we may be better off to access the index and abandon external sorting. + 1 If the B -tree is clustered, then • the data file itself is already ✓-sorted, Query Processing Sorting simply sequentially read the sequence set (or the Two-Way Merge Sort ) External Merge Sort pages of the data file). Comparisons + Replacement Sort 2 If the B -tree is unclustered, then B+-trees for Sorting • in the worst case, we have to initiate one I/O operation per record (not per page)! do not consider the index. )

+ B tree

index file

index entries (k*)

data data file records

28 (Not) Using B+-tree for Sorting External Sorting Torsten Grust

• Let p denote the number of records per page (typically, p = 10,...,1000. Expected of I/O operations to sort via an unclustered B+-tree will thus be p N: · Query Processing

Expected sort I/O operations (assume B = 257) Sorting Two-Way Merge Sort 1e+09 External Merge Sort Comparisons 1e+08 Replacement Sort B+-trees for Sorting 1e+07 1e+06 100000 10000

I/O operations 1000 100 B+ tree clustered External Merge Sort 10 B+ tree unclustered, p = 10 B+ tree unclustered, p = 100 1 100 1000 10000 100000 1e+06 1e+07 N [pages]

29 Evaluation of Relational Operators

Torsten Grust Chapter 8 Evaluation of Relational Operators

Implementing the Relational Algebra Relational Query Engines Architecture and Implementation of Database Systems Operator Selection Selection (σ) Selectivity Conjunctive Predicates Disjunctive Predicates

Projection (π)

Join ( ) Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining Volcano Iterator Model

1 Evaluation of Relational Query Engines Relational Operators

Torsten Grust • In many ways, a DBMS’s query engine compares to virtual machines (e.g., the Java VM):

Relational Query Engine Virtual Machine (VM)

Operators of the relational al- Primitive VM instructions Relational Query gebra Engines Operator Selection Operates over streams of Acts on object representa- Selection (σ) rows tions Selectivity Conjunctive Predicates Operator network (tree/DAG) Sequential program (with Disjunctive Predicates

branches, loops) Projection (π)

Several equivalent variants Compact instruction set Join ( ) of an operator Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining  Equivalent operator variants Volcano Iterator Model Instead of a single operator, a typical DBMS query engine features equivalent1 variants 0, 00,.... What would equivalent mean1 in1 the context of the relational model? 2 Evaluation of Operator Variants Relational Operators

Torsten Grust • Specific operator variants may be tailored to exploit physical properties of its input or the current system state:

1 The presence or absence of indexes on the input

file(s), Relational Query Engines 2 the sortedness of the input file(s), Operator Selection

3 the size of the input file(s), Selection (σ) 4 available space in the buffer pool Selectivity the , Conjunctive Predicates 5 the buffer replacement policy, Disjunctive Predicates 6 ... Projection (π) Join ( ) Nested1 Loops Join Physical operators Block Nested Loops Join Index Nested Loops Join The variants ( 0, 00) are thus referred to physical operators. Sort-Merge Join Hash Join

They implement1 1 the logical operators of the relational algebra. Operator Pipelining Volcano Iterator Model • The query optimizer is in charge to perform optimal (or, reasonable) operator selection (much like the instruction selection phase in a programming language compiler).

3 Evaluation of Operator Selection Relational Operators

Torsten Grust Initial, logical operator network (“plan”)

R & 8 /π Relational Query ' Engines S 1 7 /sort / Operator Selection 7+ 1 Selection (σ) Selectivity T /σ 1 Conjunctive Predicates Disjunctive Predicates

Projection (π) Physical plan with (un)sortedness annotations (u/s) Join ( ) Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join R s Sort-Merge Join u Hash Join u ' / u 8 0 π0 Operator Pipelining u s ' Volcano Iterator Model S 1u 7 0 /sort0 / u 7+ 0 1 s T s /σ00 1

4 Evaluation of Plan Rewriting Relational Operators

Torsten Grust Physical plan with (un)sortedness annotations (u/s)

R s ( u u 6 0 π/ 0 u ) u s S 1u 5 0 /sort0 / Relational Query , u Engines 5 0 1 Operator Selection T /σ s s 00 1 Selection (σ) Selectivity Conjunctive Predicates Disjunctive Predicates • Rewrite the plan to exploit that the 00 variant of operator Projection (π) ⊕ ⊕ can benefit from/preserve sortedness of its input(s): Join ( ) Nested1 Loops Join Block Nested Loops Join Rewritten physical plan (preserve equivalence!) Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining s Volcano Iterator Model R s , / s 4 00 π00 s s ) S u sort/ 0 1 2 00 / * s 4 00 1 / s T s σ00 1

5 Evaluation of Selection (σ)—No Index, Unsorted Data Relational Operators Torsten Grust

• Selection (σp) reads an input file Rin of records and writes those records satisfying predicate p into the output file: Relational Query Engines Selection Operator Selection Selection (σ) Selectivity 1 Function: σ(p, R , R ) in out Conjunctive Predicates Disjunctive Predicates 2 out createFile(Rout) ; ← Projection (π) 3 in openScan(Rin) ; ← Join ( ) 4 while (r nextRecord(in)) = EOF do Nested1 Loops Join ← 6 h i Block Nested Loops Join 5 if p(r) then Index Nested Loops Join 6 appendRecord(out, r) ; Sort-Merge Join Hash Join

Operator Pipelining 7 closeFile(out) ; Volcano Iterator Model

6 Evaluation of Selection (σ)—No Index, Unsorted Data Relational Operators Torsten Grust

Remarks:

Relational Query Engines • Reading the special “record” EOF from a file via Operator Selection h i nextRecord() indicates that all its record have been Selection (σ) Selectivity retrieved (scanned) already. Conjunctive Predicates Disjunctive Predicates

Projection (π)

• This simple procedure does not require rin to come with Join ( ) any special physical properties (the procedure is Nested1 Loops Join Block Nested Loops Join exclusively defined in terms of heap files). Index Nested Loops Join Sort-Merge Join Hash Join • In particular, predicate p may be arbitrary. Operator Pipelining Volcano Iterator Model

7 Evaluation of Selection (σ)—No Index, Unsorted Data Relational Operators Torsten Grust

• We can summarize the characteristics of this implementation of the selection operator as follows:

Selection (σ)—no index, unsorted data Relational Query Engines Operator Selection

σp (R) Selection (σ) 1 Selectivity input access file scan (openScan) of R Conjunctive Predicates prerequisites none (p arbitrary, R may be a heap file) Disjunctive Predicates Projection (π) I/O cost NR + sel(p) NR · Join ( ) input cost output cost Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join |{z} | {z } Sort-Merge Join Hash Join

• NR denotes the number of pages in file R, R denotes Operator Pipelining the number of records | | Volcano Iterator Model R (if pR records fit on one page, we have NR = | |/pR )  

1 Also known as access path in the literature and text books. 8 Evaluation of Aside: Selectivity Relational Operators

Torsten Grust • sel(p), the selectivity of predicate p, is the fraction of records satisfying predicate p:

σp (R) 0 6 sel(p) = | | 6 1 R Relational Query | | Engines Operator Selection

Selection (σ) Selectivity  Selectivity examples Conjunctive Predicates Disjunctive Predicates

What can you say about the following selectivities? Projection (π) 1 sel(true) Join ( ) Nested1 Loops Join 2 Block Nested Loops Join sel(false) Index Nested Loops Join Sort-Merge Join 3 sel(A = 0) Hash Join

Operator Pipelining Volcano Iterator Model Estimated selectivities IBM DB2 reports (estimated) selecitvities in the operators details of, e.g., its IXSCAN operator.

9 Evaluation of Selection (σ)—Matching Predicates with an Index Relational Operators • A selection on input file R can be sped up considerably if an Torsten Grust index has been defined and that index matches predicate p. • The matching process depends on p itself as well as on the index type. If there is no immediate match but p is compound, a sub-expression of p may still find a partial Relational Query match. Residual predicate evaluation work may then remain. Engines Operator Selection

Selection (σ)  When does a predicate match a sort key? Selectivity Conjunctive Predicates Assume R is tree-indexed on attribute A in ascending order. Disjunctive Predicates Projection (π)

Which of the selections below can benefit from the index on R? Join ( ) Nested1 Loops Join 1 σA=42 (R) Block Nested Loops Join Index Nested Loops Join 2 σA<42 (R) Sort-Merge Join Hash Join 3 σ (R) A>42 AND A<100 Operator Pipelining Volcano Iterator Model 4 σA>42 OR A>100 (R)

5 σA>42 AND A<32 (R)

6 σA>42 AND B=10 (R)

7 σA>42 OR B=10 (R)

10 + Evaluation of Selection (σ)—B -tree Index Relational Operators • A B+-tree index on R whose key matches the selection Torsten Grust predicate p is clearly the superior method to evaluate σp (R):

• Descend the B+-tree to retrieve the first index entry to

satisfy p. If the index is clustered, access that record on Relational Query its page in R and continue to scan inside R. Engines Operator Selection

Selection (σ) • If the index is unclustered and sel(p) indicates a large Selectivity Conjunctive Predicates number of qualifying records, it pays off to Disjunctive Predicates 1 read the matching index entries k = k, rid in the Projection (π) ∗ h i sequence set, Join ( ) Nested1 Loops Join 2 sort those entries on their rid field, Block Nested Loops Join Index Nested Loops Join 3 and then access the pages of R in sorted rid order. Sort-Merge Join Note that lack of clustering is a minor issue if sel(p) is Hash Join Operator Pipelining close to 0. Volcano Iterator Model

Accessing unclustered B+-trees IBM DB2 uses physical operator quadruple IXSCAN/SORT/RIDSCN/FETCH to implement the above strategy. 11 + Evaluation of Selection (σ)—B -tree Index Relational Operators Torsten Grust The IXSCAN/SORT/RIDSCN/FETCH quadruple

Relational Query Engines Operator Selection

Selection (σ) Selectivity Conjunctive Predicates Disjunctive Predicates

Projection (π)

Join ( ) Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining Volcano Iterator Model

• Note: Selectivity of predicate estimated as 57 % (table accel has 235,501 rows). 12 + Evaluation of Selection (σ)—B -tree Index Relational Operators Torsten Grust

Selection (σ)—clustered B+-tree index Relational Query Engines Operator Selection σp (R) Selection (σ) + Selectivity input access access of B -tree on R, then sequence set scan Conjunctive Predicates + Disjunctive Predicates prerequisites clustered B -tree on R with key k, p matches Projection (π)

key k Join ( ) Nested1 Loops Join I/O cost 3 + sel(p) NR + sel(p) NR Block Nested Loops Join ≈ · · Index Nested Loops Join + B -tree acc. sorted scan output cost Sort-Merge Join Hash Join |{z} | {z } | {z } Operator Pipelining Volcano Iterator Model

13 Evaluation of Selection (σ)—Hash Index, Equality Predicate Relational Operators Torsten Grust

• A selection predicate p matches an hash index only if p contains a term of the form A = c (c constant, assuming the hash index has been built over column A).

Relational Query • We are directly led to the bucket of qualifying records and Engines 2 pay I/O cost only for the access of this bucket . Note that Operator Selection sel(p) is likely to be close to 0 for equality predicates. Selection (σ) Selectivity Conjunctive Predicates Selection (σ)—hash index, equality predicate Disjunctive Predicates Projection (π)

σp (R) Join ( ) Nested1 Loops Join Block Nested Loops Join input access hash table on R Index Nested Loops Join prerequisites rin hashed on key A, p has term A = c Sort-Merge Join Hash Join

I/O cost 1 + sel(p) NR Operator Pipelining · Volcano Iterator Model hash access output cost |{z} | {z }

2Remember that this may include access cost for the pages of an overflow chain hanging off the primary bucket page. 14 Evaluation of Selection (σ)—Conjunctive Predicates Relational Operators Torsten Grust • Indeed, selection operations with simple predicates like σA θ c (R) are a special case only. • We somehow need to deal with complex predicates, built from simple comparisons and the Boolean connectives Relational Query AND and OR. Engines Operator Selection

• Matching a selection predicate with an index can be Selection (σ) extended to cover the case where predicate p has a Selectivity Conjunctive Predicates conjunctive form: Disjunctive Predicates Projection (π)

A1 θ1 c1 AND A2 θ2 c2 AND AND An θn cn . Join ( ) ··· Nested1 Loops Join conjunct Block Nested Loops Join Index Nested Loops Join Sort-Merge Join | {z } Hash Join

• Here, each conjunct is a simple comparison Operator Pipelining (θ =, <, >, <=, >= ). Volcano Iterator Model i ∈ { }

• An index with a multi-attribute key may match the entire complex predicate.

15 Evaluation of Selection (σ)—Conjunctive Predicates Relational Operators Torsten Grust

 Matching a multi-attribute hash index Consider a hash index for the multi-attribute key k = (A, B, C), i.e.,

all three attributes are input to the hash function. Relational Query Engines Operator Selection

Which conjunctive predicates p would match this type of index? Selection (σ) Selectivity Conjunctive Predicates Conjunctive predicate match rule for hash indexes Disjunctive Predicates Projection (π) A conjunctive predicate p matches a (multi-attribute) hash Join ( ) Nested1 Loops Join index with key k = (A1, A2,..., An), if p covers the key, i.e., Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join p A1 = c1 AND A2 = c2 AND AND An = cn AND φ . Operator Pipelining ≡ ··· Volcano Iterator Model The residual conjunct φ is not supported by the index itself and has to be evaluated after index retrieval.

16 Evaluation of Selection (σ)—Conjunctive Predicates Relational Operators Torsten Grust

Matching a multi-attribute B+-tree index Consider a B+-tree index for the multi-attribute key k = (A, B, C), i.e., the B+-tree nodes are searched/inserted in lexicographic order Relational Query w.r.t. these three attributes: Engines Operator Selection k1 < k2 A1 < A2 Selection (σ) ≡ ∨ Selectivity (A1 = A2 B1 < B2) Conjunctive Predicates ∧ ∨ Disjunctive Predicates (A1 = A2 B1 = B2 C1 < C2) ∧ ∧ Projection (π) + Join ( ) Excerpt of an inner B -tree node (separator): Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join (50,20,30) Hash Join Operator Pipelining Volcano Iterator Model

Which conjunctive predicates p would match this type of index?

17 Evaluation of Selection (σ)—Conjunctive Predicates Relational Operators Torsten Grust

Conjunctive predicate match rule for B+-tree indexes + A conjunctive predicate p matches a (multi-attribute) B -tree Relational Query Engines index with key k = (A1, A2,..., An), if p is a prefix of the key, i.e., Operator Selection

Selection (σ) p A1 θ1 c1 AND φ Selectivity ≡ Conjunctive Predicates p A θ c AND A θ c AND φ Disjunctive Predicates ≡ 1 1 1 2 2 2 . Projection (π) . Join ( ) p A1 θ1 c1 AND A2 θ2 c2 AND AND An θn cn AND φ Nested1 Loops Join ≡ ··· Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join • Note: Whenever a multi-attribute hash index matches a Operator Pipelining predicate, so does a B+-tree over the same key. Volcano Iterator Model

18 Evaluation of Selection (σ)—Conjunctive Predicates Relational Operators Torsten Grust • If the system finds that a conjunctive predicate does not match a single index, its (smaller) conjuncts may nevertheless match distinct indexes.

Example (Partial predicate match) Relational Query Engines The conjunctive predicate in σp AND q(R) does not match an index, Operator Selection but both conjuncts p, q do. Selection (σ) Selectivity Conjunctive Predicates A typical optimizer might thus decide to transform the original Disjunctive Predicates query Projection (π) Join ( ) R / σp AND q / Nested1 Loops Join Block Nested Loops Join into Index Nested Loops Join Sort-Merge Join 8σp0 Hash Join & Operator Pipelining R rid / Volcano Iterator Model ∩8 σ& q0 Here, rid denotes a set intersection operator defined by rid equality∩ (IBM DB2: IXAND).

19 Evaluation of Selection (σ)—Conjunctive Predicates Relational Operators Torsten Grust

Relational Query  Selectivity of conjunctive predicates Engines Operator Selection

What can you say about the selectivity of the conjunctive Selection (σ) predicate p AND q? Selectivity Conjunctive Predicates Disjunctive Predicates sel(p AND q) = Projection (π) Join ( ) Nested1 Loops Join Block Nested Loops Join Now assume p AGE <= 16 and q SALARY > 5000. Index Nested Loops Join ≡ ≡ Sort-Merge Join Reconsider your proposal above. Hash Join

Operator Pipelining Volcano Iterator Model

20 Evaluation of Selection (σ)—Disjunctive Predicates Relational Operators Torsten Grust • Choosing a reasonable execution plan for disjunctive selection of the general form

A θ c OR A θ c OR OR A θ c 1 1 1 2 2 2 ··· n n n Relational Query Engines is much harder: Operator Selection

Selection (σ) Selectivity • We are forced to fall back to a naive file scan based Conjunctive Predicates evaluation as soon only a single term does not match Disjunctive Predicates Projection (π) an index. Join ( ) • If all terms are matched by indexes, we can exploit a Nested1 Loops Join rid Block Nested Loops Join rid-based set union to improve the plan: Index Nested Loops Join ∪ Sort-Merge Join Hash Join σA0 θ c : 1 1 1 Operator Pipelining Volcano Iterator Model . & R . rid / 8 ∪ σ$ A0 n θn cn

21 Evaluation of Selection (σ)—Disjunctive Predicates Relational Operators Torsten Grust

Selective disjunctive Non-selective disjunctive predicate predicate

Relational Query Engines Operator Selection

Selection (σ) Selectivity Conjunctive Predicates Disjunctive Predicates

Projection (π)

Join ( ) Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining Note: Presence of indexes Volcano Iterator Model ignored. Note: Multi-input RIDSCN operator.

22 Evaluation of Selection (σ)—Disjunctive Predicates Relational Operators Torsten Grust

Relational Query Engines  Selectivity of disjunctive predicates Operator Selection Selection (σ) Selectivity What can you say about the selectivity of the disjunctive Conjunctive Predicates predicate p OR q? Disjunctive Predicates Projection (π)

Join ( ) sel(p OR q) = Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining Volcano Iterator Model

23 Evaluation of Projection (π) Relational Operators Torsten Grust • Projection (π`) modifies each record in its input file and cuts off any field not listed in the attribute list `:

Relational projection

Relational Query A B C A B Engines Operator Selection 1 ’foo’ 3 1 ’foo’   A B Selection (σ) 1 ’bar’ 2 1 ’bar’ Selectivity πA,B = = 1 ’foo’ Conjunctive Predicates  1 ’foo’ 2  1 ’foo’ Disjunctive Predicates   1 2 1 ’bar’  1 ’bar’ 0  1 ’bar’ Projection (π)    1 ’foo’ 0  1 ’foo’ Join ( )   Nested1 Loops Join   Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining • In general, the size of the resulting file will only be a fraction Volcano Iterator Model of the original input file: 1 any unwanted fields (here: C) have been thrown away, and 2 optionally duplicates removed (SQL: DISTINCT). 

24 Evaluation of Projection (π)—Duplicate Elimination, Sorting Relational Operators Torsten Grust • Sorting is one obvious preparatory step to facilitate duplicate elimination: records with all fields equal will end up adjacent to each other.

Relational Query Engines Implementing DISTINCT • One benefit of sort-based projection is that operator Operator Selection Selection (σ) π` will write a sorted output Selectivity Conjunctive Predicates file, i.e., Disjunctive Predicates

Projection (π) ? s R / πsort / Join ( ) ` Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join  Sort ordering? Operator Pipelining What would be the correct Volcano Iterator Model ordering θ to apply in the case of duplicate elimination?

25 Evaluation of Projection (π)—Duplicate Elimination, Hashing Relational Operators Torsten Grust • If the DBMS has a fairly large number of buffer pages (B, say) to spare for the π` (R) operation, a hash-based projection may be an efficient alternative to sorting:

Relational Query Hash-based projection π`: partitioning phase Engines Operator Selection

Selection (σ) 1 Allocate all B buffer pages. One page will be the input Selectivity buffer, the remaining B 1 pages will be used as hash Conjunctive Predicates Disjunctive Predicates buckets − . Projection (π)

Join ( ) Nested1 Loops Join 2 Read the file R page-by-page, for each record r: cut off fields Block Nested Loops Join not listed in `. Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining 3 For each such record, apply hash function Volcano Iterator Model h (r) = h(r) mod (B 1)—which depends on all remaining 1 − fields of r—and store r in hash bucket h1(r). (Write the bucket to disk if full.)

26 Evaluation of Projection (π)—Hashing Relational Operators Torsten Grust

Hash-based projection π`: partitioning phase

input file partitions

1 1 Relational Query Engines Operator Selection 2 2 hash Selection (σ) function Selectivity ... h ...... Conjunctive Predicates Disjunctive Predicates B-1 B-1 Projection (π) Join ( ) Nested1 Loops Join disk B main memory buffers disk Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join • After partitioning, duplicate elimination becomes an Operator Pipelining Volcano Iterator Model intra-partition problem only: two identical records have been mapped to the same partition:

h (r) = h (r0) r = r0 . 1 1 ⇐

27 Evaluation of Projection (π)—Hashing Relational Operators Torsten Grust

Hash-based projection π`: duplicate elimination phase

1 For each partition, read each partition page-by-page (possibly in parallel).

! Relational Query 2 To each record, apply hash function h2 = h1 to all record Engines fields. 6 Operator Selection Selection (σ) Selectivity 3 Only if two records collide w.r.t. h2, check if r = r0. If so, Conjunctive Predicates discard r0. Disjunctive Predicates Projection (π) 4 After the entire partition has been read in, append all hash Join ( ) buckets to the result file (which will be free of duplicates). Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

 Huge partitions? Operator Pipelining Volcano Iterator Model Note: Works efficiently only if duplicate elimination phase can be performed in the buffer (main memory).

What to do if partition size exceeds buffer size?

28 Evaluation of The Join Operator ( ) Relational Operators 1 Torsten Grust The join operator p is actually a short-hand for a combination of cross product and selection σ . ×1 p Join vs. Cartesian product

Relational Query Engines Operator Selection

σp Selection (σ) p Selectivity ⇔ Conjunctive Predicates 1 × Disjunctive Predicates R S Projection (π) R S Join ( ) Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join One way to implement p is to follow this equivalence: Sort-Merge Join Hash Join

1 Enumerate and concatenate1 all records in the cross product Operator Pipelining of r1 and r2. Volcano Iterator Model 2 Then pick those that satisfy p. More advanced algorithms try to avoid the obvious inefficiency in Step 1 (the size of the intermediate result is R S ). | | · | |

29 Evaluation of Nested Loops Join Relational Operators The nested loops join is the straightforward implementation of Torsten Grust the σ– combination: × Nested loops join

1 Function: nljoin (R, S, p) Relational Query /* outer relation R */ Engines 2 foreach record r R do Operator Selection /* inner relation∈ S */ Selection (σ) Selectivity 3 foreach record s S do Conjunctive Predicates /* r, s denotes∈ record concatenation */ Disjunctive Predicates h i Projection (π) 4 if r, s satisfies p then h i Join ( ) 5 append r, s to result Nested1 Loops Join h i Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join Let NR and NS the number of pages in R and S; let pR and pS be Operator Pipelining the number of records per page in R and S. Volcano Iterator Model The total number of disk reads then is N + p N N . R R · R · S # tuples in R 30 | {z } Evaluation of Nested Loops Join: I/O Behavior Relational Operators

Torsten Grust

The good news about nljoin () is that it needs only three pages of buffer space (two to read R and S, one to write the result). Relational Query Engines The bad news is its enormous I/O cost: Operator Selection • Assuming p = p = 100, N = 1000, N = 500, we need to Selection (σ) R S R S Selectivity 7 read 1000 + 5 10 disk pages. Conjunctive Predicates · Disjunctive Predicates • With an access time of 10 ms for each page, this join would Projection (π) take 140 hours! Join ( ) Nested1 Loops Join • Switching the role of R and S to make S (the smaller one) the Block Nested Loops Join Index Nested Loops Join outer relation does not bring any significant advantage. Sort-Merge Join Hash Join

Operator Pipelining Volcano Iterator Model Note that reading data page-by-page (even tuple-by-tuple) means that every I/O suffers the disk latency penalty, even though we process both relations in sequential order.

31 Evaluation of Block Nested Loops Join Relational Operators

Torsten Grust

• Again we can save random access cost by reading R and S in blocks of, say, bR and bS pages.

Block nested loops join Relational Query Engines Operator Selection 1 Function: block_nljoin (R, S, p) Selection (σ) Selectivity 2 foreach b -sized block in R do R Conjunctive Predicates 3 foreach bS-sized block in S do Disjunctive Predicates /* performed in the buffer */ Projection (π) Join ( ) 4 find matches in current R- and S-blocks and append them Nested1 Loops Join to the result ; Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining N • R is still read once, but now with only R/bR disk seeks. Volcano Iterator Model d e N • S is scanned only R/bR times now, and we need to perform N N d e R/bR S/bS disk seeks to do this. d e · d e

32 Evaluation of Choosing bR and bS Relational Operators Torsten Grust Example: Buffer pool with B = 100 frames, NR = 1000, NS = 500:

Example (Choosing br and bS)

block size used for reading S (bS) 90 80 70 60 50 40 30 20 10 Relational Query 7000 7000 Engines Operator Selection 6000 6000 Selection (σ) Selectivity Conjunctive Predicates 5000 5000 Disjunctive Predicates

Projection (π) 4000 4000 Join ( ) Nested1 Loops Join 3000 3000 Block Nested Loops Join Index Nested Loops Join disk seeks Sort-Merge Join 2000 2000 Hash Join

Operator Pipelining 1000 1000 Volcano Iterator Model 0 0 10 20 30 40 50 60 70 80 90 block size used for reading R (bR)

33 Evaluation of In-Memory Join Performance Relational Operators

Torsten Grust

• Line 4 in block_nljoin (R, S, p) implies an in-memory join between the R- and S-blocks currently in memory. • Building a hash table over the R-block can speed up this join Relational Query considerably. Engines Operator Selection

Selection (σ) Block nested loops join: build hash table from outer row block Selectivity Conjunctive Predicates 1 Function: block_nljoin’ (R, S, p) Disjunctive Predicates Projection (π) 2 foreach bR-sized block in R do Join ( ) 3 build an in-memory hash table H for the current R-block ; Nested1 Loops Join Block Nested Loops Join 4 foreach b -sized block in S do S Index Nested Loops Join 5 foreach record s in current S-block do Sort-Merge Join Hash Join 6 probe H and append matching r, s tuples to result ; h i Operator Pipelining Volcano Iterator Model

• Note that this optimization only helps equi-joins.

34 Evaluation of Index Nested Loops Join Relational Operators

Torsten Grust The index nested loops join takes advantage of an index on the inner relation (swap outer inner if necessary): ↔ Index nested loops join

Relational Query 1 Function: index_nljoin (R, S, p) Engines Operator Selection 2 foreach record r R do ∈ Selection (σ) 3 scan S-index using (key value in) r and concatenate r with all Selectivity Conjunctive Predicates matching tuples s ; Disjunctive Predicates 4 append r, s to result ; Projection (π) h i Join ( ) Nested1 Loops Join Block Nested Loops Join • The index must match the join condition p. Index Nested Loops Join Sort-Merge Join • Hash indices, e.g., only support equality predicates. Hash Join • Remember the discussion about composite keys in Operator Pipelining Volcano Iterator Model B+-trees. • Such predicates are also called sargable (sarg: search argument Selinger et al., SIGMOD 1979) %

35 Evaluation of Index Nested Loop Join: I/O Behavior Relational Operators

Torsten Grust

For each record in R, we use the index to find matching S-tuples. While searching for matching S-tuples, we incur the following I/O

costs for each tuple in R: Relational Query Engines Operator Selection

Selection ( ) 1 Access the index to find its first matching entry: Nidx I/Os. σ Selectivity 2 Scan the index to retrieve all n matching rids. The I/O cost Conjunctive Predicates Disjunctive Predicates

for this is typically negligible (locality in the index). Projection (π) 3 Fetch the n matching S-tuples from their data pages. Join ( ) Nested1 Loops Join • For an unclustered index, this requires n I/Os. Block Nested Loops Join n Index Nested Loops Join • For a clustered index, this only requires /pS I/Os. Sort-Merge Join d e Hash Join

Operator Pipelining Note that (due to 2 and 3 ), the cost of an index nested loops join Volcano Iterator Model becomes dependent on the size of the join result.

36 Evaluation of Index Access Cost Relational Operators

Torsten Grust If the index is a B+-tree index: • A single index access requires the inspection of h pages.3 • If we repeatedly probe the index, however, most of these are cached by the buffer manager. Relational Query Engines Operator Selection • The effective value for Nidx is around 1–3 I/Os. Selection (σ) Selectivity Conjunctive Predicates If the index is a hash index: Disjunctive Predicates • Caching will not help here (no locality in accesses to hash Projection (π) Join ( ) table). Nested1 Loops Join Block Nested Loops Join • A typical value for Nidx is 1.2 I/Os (> 1 due to overflow Index Nested Loops Join pages). Sort-Merge Join Hash Join

Operator Pipelining Overall, the use of an index (over, e.g., a block nested loops join) Volcano Iterator Model pays off if the join is selective (picks out only few tuples from a big table).

3 + h:B -tree height 37 Evaluation of Sort-Merge Join Relational Operators Join computation becomes particularly simple if both inputs are Torsten Grust sorted with respect to the join attribute(s). • The merge join essentially merges both input tables, much like we did for sorting. Both tables are read once, in parallel. In contrast to sorting, however, we need to be careful Relational Query • Engines whenever a tuple has multiple matches in the other relation: Operator Selection Selection (σ) Multiple matches per tuple (disrupts sequential access) Selectivity Conjunctive Predicates Disjunctive Predicates A B Projection (π) C D "foo" 1 Join ( ) 1 false Nested1 Loops Join "foo" 2 Block Nested Loops Join 2 true Index Nested Loops Join "bar" 2 B=C Sort-Merge Join 2 false Hash Join "baz" 2 1 3 true Operator Pipelining "baf" 4 Volcano Iterator Model

• Merge join is typically used for equi-joins only.

38 Evaluation of Merge Join: Algorithm Relational Operators

Torsten Grust

Merge join algorithm

1 Function: merge_join (R, S, α = β)// α,β: join col.s in R,S

2 r position of first tuple in R ; // r, s, s0: cursors over R, S, S ← 3 s position of first tuple in S ; Relational Query ← Engines 4 while r = EOF and s = EOF do // EOF : end of file marker Operator Selection 6 h i 6 h i h i 5 while r.α < s.β do Selection (σ) Selectivity 6 advance r ; Conjunctive Predicates Disjunctive Predicates 7 while r.α > s.β do Projection (π) 8 advance s ; Join ( ) 9 s0 s ; // Remember current position in S Nested1 Loops Join ← Block Nested Loops Join 10 while r.α = s0.β do // All R-tuples with same α value Index Nested Loops Join 11 s s0 ; // Rewind s to s0 Sort-Merge Join ← Hash Join 12 while r.α = s.β do // All S-tuples with same β value Operator Pipelining 13 append r, s to result ; Volcano Iterator Model h i 14 advance s ;

15 advance r ;

39 Evaluation of Merge Join: I/O Behavior Relational Operators

Torsten Grust • If both inputs are already sorted and there are no exceptionally long sequences of identical key values, the I/O cost of a merge join is NR + NS (which is optimal).

• By using blocked I/O, these I/O operations can be done Relational Query almost entirely as sequential reads. Engines Operator Selection • Sometimes, it pays off to explicitly sort a (unsorted) relation Selection (σ) Selectivity first, then apply merge join. This is particularly the case if a Conjunctive Predicates sorted output is beneficial later in the execution plan. Disjunctive Predicates Projection (π) The final sort pass can also be combined with merge join, • Join ( ) avoiding one round-trip to disk and back. Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining  What is the worst-case behavior of merge join? Volcano Iterator Model If both join attributes are constants and carry the same value (i.e., the result is the Cartesian product), merge join degenerates into a nested loops join.

40 Evaluation of Merge Join: IBM DB2 Plan Relational Operators

Torsten Grust Merge join (left input: sort, right input: sorted index scan)

Relational Query Engines Operator Selection

Selection (σ) Selectivity Conjunctive Predicates Disjunctive Predicates

Projection (π)

Join ( ) Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining Volcano Iterator Model

• Note: The FILTER(11) implements the join predicate of the MSJOIN(3).

41 Evaluation of Hash Join Relational Operators

Torsten Grust • Sorting effectively brought related tuples into spatial proximity, which we exploited in the merge join algorithm. • We can achieve a similar effect with hashing, too.

• Partition R and S into partitions R1,..., Rn and S1,..., Sn Relational Query using the same hash function (applied to the join attributes). Engines Operator Selection

Selection (σ) Hash partitioning for both inputs Selectivity Conjunctive Predicates Disjunctive Predicates Partition 1 (R1 and S1) Projection (π)

Join ( ) Relation R h Partition 2 (R2 and S2) Nested1 Loops Join Block Nested Loops Join Index Nested Loops Join Partition 3 (R3 and S3) Sort-Merge Join Hash Join Relation S h . Operator Pipelining . Volcano Iterator Model

Partition n (Rn and Sn)

• Observe that Ri Sj = ∅ for all i = j. 6

1 42 Evaluation of Hash Join Relational Operators

Torsten Grust

• By partitioning the data, we reduced the problem of joining to smaller sub-relations Ri and Si. • Matching tuples are guaranteed to end up together in the same partition (again: works for equality predicates only). Relational Query Engines • We only need to compute Ri Si (for all i). Operator Selection Selection (σ) • By choosing n properly (i.e., the1 hash function h), partitions Selectivity Conjunctive Predicates become small enough to implement the Ri Si as Disjunctive Predicates in-memory joins. 1 Projection (π) • The in-memory join is typically accelerated using a hash Join ( ) Nested1 Loops Join table, too. We already did this for the block nested loops join Block Nested Loops Join Index Nested Loops Join ( slide 34). Sort-Merge Join % Hash Join

Operator Pipelining  Intra-partition join via hash table Volcano Iterator Model

Use a different hash function h0 = h for the intra-partition join. Why? 6

43 Evaluation of Hash Join Algorithm Relational Operators

Torsten Grust

Hash join

1 Function: hash_join (R, S, α = β) /* Partitioning phase */ Relational Query 2 foreach record r R do Engines ∈ Operator Selection 3 append r to partition R h(r.α) Selection (σ) Selectivity 4 foreach record s S do ∈ Conjunctive Predicates 5 append s to partition Sh(s.β) Disjunctive Predicates Projection (π)

/* Intra-partition join phase */ Join ( ) 6 foreach partition i 1,..., n do Nested1 Loops Join ∈ Block Nested Loops Join 7 build hash table H for Ri, using hash function h0; Index Nested Loops Join Sort-Merge Join 8 foreach block b Si do ∈ Hash Join 9 foreach record s b do Operator Pipelining ∈ Volcano Iterator Model 10 probe H via h0(s.β) and append matching tuples to result ;

44 Evaluation of Hash Join—Buffer Requirements Relational Operators

Torsten Grust • We assumed that we can create the necessary n partitions in one pass (note that we want N < (B 1)). Ri − • This works out if R consists of at most (B 1)2 pages. ≈ − Relational Query Engines Operator Selection  Why (B 1)2? Why ? Selection (σ) − ≈ Selectivity Conjunctive Predicates Disjunctive Predicates • We can write out at most B 1 partitions in one pass; each − Projection (π) of them should be at most B 1 pages in size. Join ( ) − Nested1 Loops Join • Hashing does not guarantee an even distribution. Since the Block Nested Loops Join actual size of each partition varies, R must actually be smaller Index Nested Loops Join Sort-Merge Join 2 than (B 1) . Hash Join − Operator Pipelining Volcano Iterator Model

• Larger input tables require multiple passes for partitioning (recursive partitioning).

45 Evaluation of Orchestrating Operator Evaluation Relational Operators

Torsten Grust So far we have assumed that all database operators consume and produce files (i.e., on-disk items):

File-based operator input and output

Relational Query Engines ··· Operator Selection π σ π Selection (σ) ··· Selectivity 1 Conjunctive Predicates ··· Disjunctive Predicates file1 file2 file3 filen Projection (π)

Join ( ) Nested1 Loops Join Block Nested Loops Join • Obviously, using secondary storage as the communication Index Nested Loops Join Sort-Merge Join channel causes a lot of disk I/O. Hash Join Operator Pipelining • In addition, we suffer from long response times: Volcano Iterator Model • An operator cannot start computing its result before all its input files are fully generated (“materialized”). • Effectively, all operators are executed in sequence.

46 Evaluation of Unix: Temporary Result Files Relational Operators

Torsten Grust

• Architecting the query processor in this fashion bears much

resemblance with using the Unix shell like this: Relational Query Engines Operator Selection File-based Unix command sequencing Selection (σ) Selectivity Conjunctive Predicates 1 # report "large" XML files below current working dir Disjunctive Predicates 2 $ find . -size +500k > tmp1 Projection (π) 3 $ xargs file < tmp1 > tmp2 Join ( ) 4 $ grep -i XML < tmp2 > tmp3 Nested1 Loops Join 5 $ cut -d: -f1 < tmp3 Block Nested Loops Join 6 output generated here Index Nested Loops Join h i 7 Sort-Merge Join Hash Join 8 # remove temporary files Operator Pipelining 9 $ rm -f tmp[0-9]* Volcano Iterator Model

47 Evaluation of Pipelined Evaluation Relational Operators

Torsten Grust

• Alternatively, each operator could pass its result directly on to the next operator (without persisting it to disk first). Do not wait until entire file is created, but propagate output Relational Query ⇒ Engines immediately. Operator Selection Selection (σ) Start computing results as early as possible, i.e., as soon as Selectivity ⇒ Conjunctive Predicates enough input data is available to start producing output. Disjunctive Predicates • This idea is referred to as pipelining. Projection (π) Join ( ) • The granularity in which data is passed may influence Nested1 Loops Join performance: Block Nested Loops Join Index Nested Loops Join • Smaller chunks reduce the response time of the system. Sort-Merge Join Hash Join

• Larger chunks may improve the effectiveness of Operator Pipelining (instruction) caches. Volcano Iterator Model • Actual systems typically operate tuple at a time.

48 Evaluation of Unix: Pipelines of Processes Relational Operators

Torsten Grust Unix provides a similar mechanism to communicate between processes (“operators”):

Pipeline-based Unix command sequencing

1 $ find . -size +500k | xargs file | grep -i XML | cut -d: -f1 Relational Query 2 output generated here Engines h i Operator Selection Execution of this pipe is driven by the rightmost Selection (σ) Selectivity operator—all operators act in parallel:  Conjunctive Predicates Disjunctive Predicates • To produce a line of output, cut only needs to see the next Projection (π) line of its input: grep is requested to produce this input. Join ( ) Nested1 Loops Join • To produce a line of output, grep needs to request as many Block Nested Loops Join Index Nested Loops Join input lines from the xargs process until it receives a line Sort-Merge Join containing the string "XML". Hash Join Operator Pipelining . Volcano Iterator Model . • Each line produced by the find process is passed through the pipe until it reaches the cut process and eventually is echoed to the terminal.

49 Evaluation of The Volcano Iterator Model Relational Operators

Torsten Grust • The calling interface used in database execution runtimes is very similar to the one used in Unix process pipelines. • In databases, this interface is referred to as open-next-close interface or Volcano iterator model. Relational Query • Each operator implements the functions Engines Operator Selection open () Initialize the operator’s internal state. Selection (σ) Selectivity next () Produce and return the next result tuple or Conjunctive Predicates Disjunctive Predicates EOF . h i Projection (π) close () Clean up all allocated resources (typically Join ( ) Nested1 Loops Join after all tuples have been processed). Block Nested Loops Join Index Nested Loops Join • All state is kept inside each operator instance: Sort-Merge Join Hash Join

• Operators are required to produce a tuple via next (), Operator Pipelining then pause, and later resume on a subsequent next () Volcano Iterator Model call.

Goetz Graefe. Volcano—An Extensibel and Parallel Query Evaluation % System. Trans. Knowl. Data Eng. vol. 6, no. 1, February 1994.

50 Evaluation of Volcano-Style Pipelined Evaluation Relational Operators

Torsten Grust Example (Pipelined query plan)

R1 / scan 4* p π/ l /σq / R2 / scan 1 Relational Query Engines Operator Selection

Selection (σ) • Given a query plan like the one shown above, query Selectivity Conjunctive Predicates evaluation is driven by the query processor like this (just like Disjunctive Predicates

in the Unix shell): Projection (π) 1 The whole plan is initially reset by calling open () on Join ( ) Nested1 Loops Join the root operator, i.e., σq.open (). Block Nested Loops Join Index Nested Loops Join 2 The open () call is forwarded through the plan by the Sort-Merge Join operators themselves (see σ.open () on slide 53). Hash Join 3 Control returns to the query processor. Operator Pipelining Volcano Iterator Model 4 The root is requested to produce its next result record, i.e., the call σq.next () is made. 5 Operators forward the next () request as needed. As soon as the next result record is produced, control returns to the query processor again. 51 Evaluation of Volcano-Style Pipelined Evaluation Relational Operators

Torsten Grust

• In a nutshell, the query processor uses the following routine to evaluate a query plan:

Relational Query Engines Query plan evaluation driver Operator Selection Selection (σ) 1 Function: eval (q) Selectivity Conjunctive Predicates 2 q.open (); Disjunctive Predicates 3 r q.next (); Projection (π) ← Join ( ) 4 while r = EOF do 6 h i Nested1 Loops Join /* deliver record r (print, ship to DB client) */ Block Nested Loops Join Index Nested Loops Join 5 emit (r); Sort-Merge Join Hash Join 6 r q.next (); ← Operator Pipelining /* resource deallocation now */ Volcano Iterator Model 7 q.close ();

52 Evaluation of Volcano-Style Selection (σp) Relational Operators Torsten Grust • Input operator (sub-plan root) R, predicate p:

Volcano-style interface for σp(R)

Relational Query Engines 1 Function: open () Operator Selection

2 R.open () ; Selection (σ) Selectivity Conjunctive Predicates 1 Function: close () Disjunctive Predicates Projection (π) 2 R.close () ; Join ( ) Nested1 Loops Join Block Nested Loops Join 1 Function: next () Index Nested Loops Join Sort-Merge Join 2 while ((r R.next ()) = EOF ) do ← 6 h i Hash Join 3 if p(r) then Operator Pipelining Volcano Iterator Model 4 return r ;

5 return EOF ; h i

53 Evaluation of Volcano-Style Nested Loops Join ( p) Relational Operators 1 Torsten Grust  A Volcano-style implementation of nested loops join R p S? 1 1 Function: open () 1 Function: close () 2 R.open () ; 2 R.close () ; 3 S.open () ; Relational Query 3 S.close () ; Engines 4 r R.next () ; Operator Selection ← Selection (σ) Selectivity 1 Function: next () Conjunctive Predicates Disjunctive Predicates 2 while (r = EOF ) do 6 h i Projection (π) 3 while ((s S.next ()) = EOF ) do ← 6 h i Join ( ) 4 if p(r, s) then Nested1 Loops Join Block Nested Loops Join /* emit concatenated result */ Index Nested Loops Join 5 return r, s ; Sort-Merge Join h i Hash Join Operator Pipelining /* reset inner join input */ Volcano Iterator Model 6 S.close () ; 7 S.open () ; 8 r R.next () ; ← 9 return EOF ; h i 54 Evaluation of Pipelining (Real DBMS Product) Relational Operators Volcano-style pipelined selection operator (C code) Torsten Grust

1 /* eFLTR-- apply filter predicate pred to stream 2 Filter the in-bound stream, only stream elements that fulfill 3 e->pred contribute to the result. No index support.*/ 4

5 eRC eOp_FLTR(eOp *ip) { Relational Query 6 eObj_FLTR *e = (eObj_FLTR *)eObj(ip); Engines Operator Selection 7 8 while (eIntp(e->in) != eEOS) { Selection (σ) Selectivity 9 eIntp(e->pred); Conjunctive Predicates 10 if (eT_as_bool(eVal(e->pred))) { Disjunctive Predicates 11 eVal(ip) = eVal(e->in); Projection (π) 12 return eOK; Join ( ) 13 } Nested1 Loops Join 14 } Block Nested Loops Join 15 return eEOS; Index Nested Loops Join Sort-Merge Join 16 } Hash Join 17 Operator Pipelining 18 eRC eOp_FLTR_RST(eOp *ip) { Volcano Iterator Model 19 eObj_FLTR *e = (eObj_FLTR *)eObj(ip); 20 21 eReset(e->in); 22 eReset(e->pred); 23 return eOK; 24 } 55 Evaluation of Blocking Operators Relational Operators

Torsten Grust • Pipelining reduces memory requirements and response time since each chunk of input is propagated to the output immediately. Some operators cannot be implemented in such a way. • Relational Query Engines Operator Selection  Selection (σ) Which operators do not permit pipelined evaluation? Selectivity Conjunctive Predicates • (external) sorting (this is also true for Unix sort) Disjunctive Predicates Projection (π) • hash join Join ( ) Nested1 Loops Join • grouping and duplicate elimination over unsorted input Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join

Operator Pipelining • Such operators are said to be blocking. Volcano Iterator Model • Blocking operators consume their entire input before they can produce any output. • The data is typically buffered (“materialized”) on disk.

56 Cardinality Estimation

Torsten Grust Chapter 9 Cardinality Estimation How Many Rows Does a Query Yield? Cardinality Estimation Database Profiles Architecture and Implementation of Database Systems Assumptions Estimating Operator Cardinality Selection σ Projection π Set Operations , , ∪ \ × Join 1 Histograms Equi-Width Equi-Depth

Statistical Views

1 Cardinality Estimation Cardinality Estimation Torsten Grust

Web Forms Applications SQL Interface

SQL Commands

Cardinality Estimation

Executor Parser Database Profiles Assumptions

Operator Evaluator Optimizer Estimating Operator Cardinality Selection σ Files and Access Methods Projection π Transaction Set Operations , , ∪ \ × Manager Join Recovery 1 Buffer Manager Histograms Manager Equi-Width Lock Manager Equi-Depth Disk Space Manager Statistical Views

DBMS

data files, indices, . . . Database

2 Cardinality Estimation Cardinality Estimation Torsten Grust

• A relational query optimizer performs a phase of cost-based plan search to identify the—presumably—“cheapest” Cardinality Estimation Database Profiles alternative among a a set of equivalent execution plans Assumptions

( Chapter on Query Optimization). Estimating Operator % Cardinality • Since page I/O cost dominates, the estimated cardinality of Selection σ Projection π a (sub-)query result is crucial input to this search. Set Operations , , ∪ \ × Join 1 • Cardinality typically measured in pages or rows. Histograms Equi-Width • Cardinality estimates are also valuable when it comes to Equi-Depth buffer “right-sizing” before query evaluation starts (e.g., Statistical Views allocate B buffer pages and determine blocking factor b for external sort).

3 Estimating Query Result Cardinality Cardinality Estimation Torsten Grust There are two principal approaches to query cardinality estimation:

1 Database Profile.

Maintain statistical information about numbers and sizes of Cardinality Estimation

tuples, distribution of attribute values for base relations, as Database Profiles part of the database catalog (meta information) during Assumptions Estimating Operator database updates. Cardinality Selection σ • Calculate these parameters for intermediate query Projection π Set Operations , , ∪ \ × results based upon a (simple) statistical model during Join 1 query optimization. Histograms Equi-Width Equi-Depth

• Typically, the statistical model is based upon the Statistical Views uniformity and independence assumptions. • Both are typically not valid, but they allow for simple calculations limited accuracy.  • In order to improve⇒ accuracy, the system can record histograms to more closely model the actual value distributions in relations.

4 Estimating Query Result Cardinality Cardinality Estimation Torsten Grust

2 Sampling Techniques. Cardinality Estimation Gather the necessary characteristics of a query plan (base Database Profiles relations and intermediate results) at query execution time: Assumptions Estimating Operator Cardinality • Run query on a small sample of the input. Selection σ • Extrapolate to the full input size. Projection π Set Operations , , ∪ \ × Join 1 • It is crucial to find the right balance between sample Histograms Equi-Width size and the resulting accuracy. Equi-Depth Statistical Views

These slides focus on 1 Database Profiles.

5 Database Profiles Cardinality Estimation Torsten Grust

Keep profile information in the database catalog. Update whenever SQL DML commands are issued (database updates): Cardinality Estimation

Typical database profile for relation R Database Profiles Assumptions

Estimating Operator Cardinality R number of records in relation R Selection σ | | Projection π Set Operations , , NR number of disk pages allocated for these records ∪ \ × Join s(R) average record size (width) 1 Histograms V(A, R) number of distinct values of attribute A Equi-Width MCV(A, R) most common values of attribute A Equi-Depth MCF(A, R) frequency of most common values of attribute A Statistical Views . . possibly many more

6 Database Profiles: IBM DB2 Cardinality Estimation Torsten Grust

Excerpt of IBM DB2 catalog information for a TPC-H database

1 db2 => SELECT TABNAME, CARD, NPAGES

2 db2 (cont.) => FROM SYSCAT.TABLES Cardinality Estimation 3 db2 (cont.) => WHERE TABSCHEMA =’TPCH’; Database Profiles Assumptions 4 TABNAME CARD NPAGES Estimating Operator 5 ------Cardinality 6 ORDERS 1500000 44331 Selection σ Projection π 7 CUSTOMER 150000 6747 Set Operations , , ∪ \ × 8 NATION 25 2 Join 1 9 REGION 5 1 Histograms 10 PART 200000 7578 Equi-Width 11 SUPPLIER 10000 406 Equi-Depth 12 PARTSUPP 800000 31679 Statistical Views 13 LINEITEM 6001215 207888

14 8 record(s) selected.

• Note: Column CARD R , column NPAGES N . ≡ | | ≡ R

7 Database Profile: Assumptions Cardinality Estimation Torsten Grust In order to obtain tractable cardinality estimation formulae, assume one of the following:

Uniformity & independence (simple, yet rarely realistic)

All values of an attribute uniformly appear with the same Cardinality Estimation probability (even distribution). Values of different attributes are Database Profiles Assumptions

independent of each other. Estimating Operator Cardinality Selection σ Worst case (unrealistic) Projection π Set Operations , , ∪ \ × Join No knowledge about relation contents at all. In case of a 1 Histograms selection σp, assume all records will satisfy predicate p. Equi-Width Equi-Depth

(May only be used to compute upper bounds of expected cardinality.) Statistical Views

Perfect knowledge (unrealistic) Details about the exact distribution of values are known. Requires huge catalog or prior knowledge of incoming queries. (May only be used to compute lower bounds of expected cardinality.)

8 Cardinality Estimation for σ (Equality Predicate) Cardinality Estimation Torsten Grust

• Database systems typically operate under the uniformity assumption. We will come across this assumption multiple times below. Cardinality Estimation Database Profiles Assumptions

Estimating Operator Query: Q σA=c(R) Cardinality ≡ Selection σ Projection π Set Operations , , ∪ \ × Selectivity sel(A = c) MCF(A, R)[c] if c MCV(A, R) Join ∈ 1 Histograms Equi-Width Selectivity sel(A = c) 1/V(A, R) Uniformity Equi-Depth Statistical Views Cardinality Q sel(A = c) R | | · | | Record size s(Q) s(R)

9 Selectivity Estimation for σ (Other Predicates) Cardinality Estimation Torsten Grust • Equality between attributes (Q σA=B(R)): Approximate selectivity by ≡

sel(A = B) = 1/ max(V(A, R), V(B, R)) .

Cardinality Estimation

(Assumes that each value of the attribute with fewer Database Profiles distinct values has a corresponding Assumptions  Estimating Operator match in the other attribute.) Independence Cardinality Selection σ • Range selections (Q = σA>c(R)): Projection π Set Operations , , In the database profile, maintain the minimum and ∪ \ × Join 1 maximum value of attribute A in relation R, Low(A, R) and Histograms High(A, R). Equi-Width Equi-Depth Approximate selectivity by Uniformity Statistical Views

sel(A > c) =

High(A, R) c − , Low(A, R) c High(A, R) High(A, R) Low(A, R) 6 6  − 0, otherwise

10  Cardinality Estimation for π Cardinality Estimation Torsten Grust

• For Q π (R), estimating the number of result rows is ≡ L difficult (L = A1, A2,..., An : list of projection attributes): h i Cardinality Estimation Database Profiles Q πL(R) (duplicate elimination) Assumptions ≡ Estimating Operator V(A, R), if L = A Cardinality h i Selection σ R , if keys of R L Projection π  Set Operations , , Cardinality Q | | ∈ ∪ \ × | |  R , no dup. elim. Join  1 | | Histograms min R , V(A , R) , otherwise Equi-Width Ai L i | | ∈ Equi-Depth   Statistical Views  Q Independence 

Record size s(Q) A L s(Ai) i∈ P

11 Cardinality Estimation for , , Cardinality Estimation ∪ \ × Torsten Grust Q R S ≡ ∪

Q 6 R + S s(|Q)| = s|(R| ) =| s|(S) schemas of R,S identical Cardinality Estimation

Database Profiles Assumptions

Q R S Estimating Operator ≡ \ Cardinality Selection σ Projection π max(0, R S ) 6 Q 6 R Set Operations , , | | − | | | | | | ∪ \ × s(Q) = s(R) = s(S) Join 1 Histograms Equi-Width Equi-Depth

Q R S Statistical Views ≡ × Q = R S s(|Q)| = s|(R|) · | +|s(S) V(A, R), for A R V(A, Q) = ∈ V(A, S), for A S ( ∈

12 Cardinality Estimation for Cardinality Estimation Torsten Grust 1 • Cardinality estimation for the general join case is challenging. • A special, yet very common case: foreign-key relationship between input relations R and S: Cardinality Estimation

Database Profiles Establish a foreign key relationship (SQL) Assumptions

1 CREATE TABLER(A INTEGER NOT NULL, Estimating Operator Cardinality 2 ... Selection σ 3 PRIMARY KEY (A)); Projection π Set Operations , , 4 CREATE TABLE S (..., ∪ \ × Join 5 A INTEGER NOT NULL, 1 Histograms 6 ... Equi-Width 7 FOREIGN KEY (A) REFERENCES R); Equi-Depth

Statistical Views

Q R R.A=S.A S ≡ The foreign1 key constraint guarantees π (S) π (R). Thus: A ⊆ A Q = S . | | | |

13 Cardinality Estimation for Cardinality Estimation Torsten Grust 1

Cardinality Estimation Q R R.A=S.B S ≡ Database Profiles 1 Assumptions R S Estimating Operator | | · | | , πB(S) πA(R) Cardinality V(A, R) ⊆ Selection σ Q =  Projection π Set Operations , , | | ∪ \ ×  R S Join | | · | |, πA(R) πB(S) 1 V(B, S) ⊆ Histograms  Equi-Width s(Q) = s(R) + s(S) Equi-Depth Statistical Views

14 Histograms Cardinality Estimation Torsten Grust

• In realistic database instances, values are not uniformly distributed in an attribute’s active domain (actual values found in a column). Cardinality Estimation To keep track of this non-uniformity for an attribute A, • Database Profiles maintain a histogram to approximate the actual Assumptions distribution: Estimating Operator Cardinality Selection σ 1 Divide the active domain of A into adjacent intervals by Projection π Set Operations , , selecting boundary values bi. ∪ \ × Join 1 2 Collect statistical parameters for each interval between Histograms boundaries, e.g., Equi-Width Equi-Depth • # of rows r with bi 1 < r.A 6 bi, or Statistical Views − • # of distinct A values in interval (bi 1, bi]. − • The histogram intervals are also referred to as buckets.

( Y. Ioannidis: The History of Histograms (Abridged), Proc. VLDB 2003) %

15 Histograms in IBM DB2 Cardinality Estimation Torsten Grust Histogram maintained for a column in a TPC-H database 1 SELECT SEQNO, COLVALUE, VALCOUNT • Catalog table 2 FROM SYSCAT.COLDIST SYSCAT.COLDIST also 3 WHERE TABNAME =’LINEITEM’ Cardinality Estimation 4 AND COLNAME =’L_EXTENDEDPRICE’ contains information Database Profiles 5 AND TYPE =’Q’; like Assumptions Estimating Operator 6 SEQNO COLVALUE VALCOUNT • the n most Cardinality 7 ------frequent values Selection σ 8 1 +0000000000996.01 3001 Projection π (and their Set Operations , , 9 2 +0000000004513.26 315064 ∪ \ × Join 10 3 +0000000007367.60 633128 frequency), 1 Histograms 11 4 +0000000011861.82 948192 • the number of Equi-Width 12 5 +0000000015921.28 1263256 distinct values in Equi-Depth 13 6 +0000000019922.76 1578320 Statistical Views 14 7 +0000000024103.20 1896384 each bucket. 15 8 +0000000027733.58 2211448 • Histograms may even 16 9 +0000000031961.80 2526512 17 10 +0000000035584.72 2841576 be manipulated 18 11 +0000000039772.92 3159640 manually to tweak 19 12 +0000000043395.75 3474704 optimizer decisions. 20 13 +0000000047013.98 3789768 . 21 . 16 Histograms Cardinality Estimation Torsten Grust

• Two types of histograms are widely used:

1 Equi-Width Histograms. All buckets have the same width, i.e., boundary Cardinality Estimation Database Profiles bi = bi 1 + w, for some fixed w. Assumptions − Estimating Operator Cardinality 2 Equi-Depth Histograms. Selection σ same number of rows Projection π All buckets contain the (i.e., their Set Operations , , ∪ \ × width is varying). Join 1 Histograms • Equi-depth histograms ( 2 ) are able to adapt to data skew Equi-Width (high uniformity). Equi-Depth Statistical Views

• The number of buckets is the tuning knob that defines the tradeoff between estimation quality (histogram resolution) and histogram size: catalog space is limited.

17 Equi-Width Histograms Cardinality Estimation Torsten Grust

Example (Actual value distribution) Cardinality Estimation Column A of SQL type INTEGER (domain ..., -2, -1, 0, 1, 2,... ). { } Database Profiles Actual non-uniform distribution in relation R: Assumptions Estimating Operator Cardinality 9 Selection σ 8 8 Projection π 7 Set Operations , , ∪ \ × 6 5 Join 4 1 Histograms 3 3 3 Equi-Width 2 2 2 Equi-Depth 1 1 0 Statistical Views 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

18 Equi-Width Histograms Cardinality Estimation Torsten Grust • Divide active domain of attribute A into B buckets of equal width. The bucket width w will be High(A, R) Low(A, R) + 1 w = − B

Cardinality Estimation

Example (Equi-width histogram (B = 4)) Database Profiles Assumptions

27 Estimating Operator Cardinality Selection σ Projection π Set Operations , , 19 ∪ \ × Join 1 Histograms 13 Equi-Width 9 Equi-Depth 8 8 7 Statistical Views 5 6 5 4 3 3 3 2 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

• Maintain sum of value frequencies in each bucket (in

addition to bucket boundaries bi). 19 Equi-Width Histograms: Equality Selections Cardinality Estimation Torsten Grust Example (Q σA=5(R)) ≡

27

19 Cardinality Estimation

13 Database Profiles 9 8 8 Assumptions 7 Estimating Operator 5 6 5 4 Cardinality 3 3 3 Selection σ 2 2 2 1 1 Projection π 0 Set Operations , , 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ∪ \ × Join 1 Histograms Equi-Width • Value 5 is in bucket [5, 8] (with 19 tuples) Equi-Depth • Assume uniform distribution within the bucket: Statistical Views

Q = 19/w = 19/4 5 . | | ≈

Actual: Q = 1 | | What would be the cardinality under the uniformity assumption (no histogram)? 20 Equi-Width Histograms: Range Selections Cardinality Estimation Torsten Grust Example (Q σA>7 AND A 16(R)) ≡ 6

27

19 Cardinality Estimation

13 Database Profiles 9 8 8 Assumptions 7 Estimating Operator 5 6 5 4 Cardinality 3 3 3 Selection σ 2 2 2 1 1 Projection π 0 Set Operations , , 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ∪ \ × Join 1 Histograms Equi-Width • Query interval (7, 16] covers buckets [9, 12] and [13, 16]. Equi-Depth Query interval touches [5, 8]. Statistical Views

Q = 27 + 13 + 19/4 45 . | | ≈

Actual: Q = 48 | | What would be the cardinality under the uniformity assumption

(no histogram)? 21 Equi-Width Histogram: Construction Cardinality Estimation Torsten Grust

• To construct an equi-width histogram for relation R,

attribute A: Cardinality Estimation

Database Profiles 1 Compute boundaries bi from High(A, R) and Low(A, R). Assumptions

2 Scan R once sequentially. Estimating Operator 3 While scanning, maintain B running tuple frequency Cardinality Selection σ counters, one for each bucket. Projection π Set Operations , , ∪ \ × • If scanning R in step 2 is prohibitive, scan small sample Join 1 Histograms Rsample R, then scale frequency counters by R / Rsample . | | Equi-Width ⊂ | | Equi-Depth • To maintain the histogram under insertions (deletions): Statistical Views

1 Simply increment (decrement) frequency counter in affected bucket.

22 Equi-Depth Histograms Cardinality Estimation Torsten Grust • Divide active domain of attribute A into B buckets of roughly the same number of tuples in each bucket, depth d of each bucket will be R d = | | . Cardinality Estimation B Database Profiles Assumptions

Estimating Operator Example (Equi-depth histogram (B = 4, d = 16)) Cardinality Selection σ Projection π 16 16 16 16 Set Operations , , ∪ \ × Join 1 Histograms Equi-Width 9 Equi-Depth 8 8 7 Statistical Views 6 5 4 3 3 3 2 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

• Maintain depth (and bucket boundaries bi). 23 Equi-Depth Histograms Cardinality Estimation Torsten Grust

Example (Equi-depth histogram (B = 4, d = 16))

16 16 16 16

Cardinality Estimation

Database Profiles 9 Assumptions 8 8 Estimating Operator 7 Cardinality 6 5 Selection σ 4 Projection π 3 3 3 Set Operations , , 2 2 2 ∪ \ × 1 1 Join 1 0 Histograms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Equi-Width Equi-Depth

Statistical Views Intuition: • High value frequencies are more important than low value frequencies. • Resolution of histogram adapts to skewed value distributions.

24 Equi-Width vs. Equi-Depth Histograms Cardinality Estimation Torsten Grust Example (Histogram on customer age attribute (B = 8, R = 5,600)) | |

1300

1100 1000 Cardinality Estimation 800 Database Profiles 600 Assumptions 500 Estimating Operator Cardinality 200 100 Selection σ Projection π Set Operations , , 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 ∪ \ × Join 1 Histograms 700 700 700 700 700 700 700 700 Equi-Width Equi-Depth

Statistical Views

0-20 21-29 30-36 37-41 44-48 49-52 55-59 60-80

• Equi-depth histogram “invests” bytes in the densely populated customer age region between 30 and 59.

25 Equi-Depth Histograms: Equality Selections Cardinality Estimation Torsten Grust

Example (Q σA=5(R)) ≡

16 16 16 16

Cardinality Estimation

Database Profiles 9 Assumptions 8 8 7 Estimating Operator Cardinality 6 5 4 Selection σ 3 3 3 Projection π Set Operations , , 2 2 2 ∪ \ × 1 1 Join 1 0 Histograms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Equi-Width Equi-Depth

Statistical Views • Value 5 is in first bucket [1, 7] (with d = 16 tuples) • Assume uniform distribution within the bucket:

Q = d/7 = 16/7 2 . | | ≈ (Actual: Q = 1) | |

26 Equi-Depth Histograms: Range Selections Cardinality Estimation Torsten Grust

Example (Q σA>5 AND A 16(R)) ≡ 6

16 16 16 16

Cardinality Estimation

Database Profiles 9 Assumptions 8 8 Estimating Operator 7 Cardinality 6 5 Selection σ 4 Projection π 3 3 3 Set Operations , , 2 2 2 ∪ \ × Join 1 1 1 0 Histograms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Equi-Width Equi-Depth

Statistical Views • Query interval (5, 16] covers buckets [8, 9], [10, 11] and [12, 16] (all with d = 16 tuples). Query interval touches [1, 7].

Q = 16 + 16 + 16 + 2/7 16 53 . | | · ≈ (Actual: Q = 59) | |

27 Equi-Depth Histograms: Construction Cardinality Estimation Torsten Grust

• To construct an equi-depth histogram for relation R, attribute A:

1 Compute depth d = R /B. Cardinality Estimation | | 2 Sort R by sort criterion A. Database Profiles Assumptions 3 b0 = Low(A, R), then determine the bi by dividing the Estimating Operator sorted R into chunks of size d. Cardinality Selection σ Projection π Set Operations , , Example (B = 4, R = 64) ∪ \ × | | Join 1 64 Histograms 1 d = /4 = 16. Equi-Width Equi-Depth 2 Sorted R.A: Statistical Views 1,2,2,3,3,5,6,6,6,6,6,6,7,7,7,7,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,10,10,. . . h i 3 Boundaries of d-sized chunks in sorted R: 1,2,2,3,3,5,6,6,6,6,6,6,7,7,7,7,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,10,10,. . . h i b1=7 b2=9 | {z } | {z }

28 A Cardinality (Mis-)Estimation Scenario Cardinality Estimation Torsten Grust

• Because exact cardinalities and estimated selectivity information is provided for base tables only, the DBMS relies on projected cardinalities for derived tables. Cardinality Estimation

• In the case of foreign key joins, IBM DB2 promotes selectivity Database Profiles factors for one join input to the join result, for example. Assumptions Estimating Operator Cardinality Selection σ Example (Selectivity promotion; K is key of S, πA(R) πK(S)) Projection π Set Operations , , ⊆ ∪ \ × Join 1 Histograms R R.A=S.K (σB=10(S)) Equi-Width 1 Equi-Depth If sel(B = 10) = x, then assume that the join will yield x R rows. Statistical Views · | |

• Whenever the value distribution of A in R does not match the distribution of B in S, the cardinality estimate may be severly off.

29 A Cardinality (Mis-)Estimation Scenario Cardinality Estimation Torsten Grust

Example (Excerpt of a data warehouse) Dimension table STORE: STOREKEY STORE_NUMBER CITY STATE DISTRICT 63 rows ··· ··· ··· ··· ···  Cardinality Estimation Database Profiles Dimension table PROMOTION: Assumptions PROMOKEY PROMOTYPE PROMODESC PROMOVALUE 35 rows Estimating Operator Cardinality ··· ··· ··· ···  Selection σ Projection π Fact table DAILY_SALES: Set Operations , , ∪ \ × STOREKEY CUSTKEY PROMOKEY SALES_PRICE Join 754 069 426 rows 1 Histograms ··· ··· ··· ···  Equi-Width Equi-Depth

Let the tables be arranged in a star schema: Statistical Views • The fact table references the dimension tables, • the dimension tables are small/stable, the fact table is large/continously update on each sale. Histograms are maintained for the dimension tables. ⇒

30 A Cardinality (Mis-)Estimation Scenario Cardinality Estimation Torsten Grust

Query against the data warehouse Find the number of those sales in store ’01’ (18 of the overall 63 locations) that were the result of the sales promotion of type Cardinality Estimation ’XMAS’ (“star join”): Database Profiles Assumptions

1 SELECT COUNT(*) Estimating Operator Cardinality 2 FROM STORE d1, PROMOTION d2, DAILY_SALES f Selection σ 3 WHERE d1.STOREKEY = f.STOREKEY Projection π 4 Set Operations , , AND d2.PROMOKEY = f.PROMOKEY ∪ \ × 5 AND d1.STORE_NUMBER =’01’ Join 1 6 AND d2.PROMOTYPE =’XMAS’ Histograms Equi-Width Equi-Depth The query yields 12,889,514 rows. The histograms lead to the Statistical Views following selectivity estimates:

sel(STORE_NUMBER = ’01’) = 18/63 (28.57 %) sel(PROMOTYPE = ’XMAS’) = 1/35 (2.86 %)

31 A Cardinality (Mis-)Estimation Scenario Cardinality Estimation Torsten Grust Estimated cardinalities and selected plan

1 SELECT COUNT(*) 2 FROM STORE d1, PROMOTION d2, DAILY_SALES f 3 WHERE d1.STOREKEY = f.STOREKEY

4 AND d2.PROMOKEY = f.PROMOKEY Cardinality Estimation 5 AND d1.STORE_NUMBER =’01’ Database Profiles 6 AND d2.PROMOTYPE =’XMAS’ Assumptions

Estimating Operator Cardinality Plan fragment (top numbers indicates estimated cardinality): Selection σ Projection π Set Operations , , . ∪ \ × . Join . 1 6.15567e+06 Histograms IXAND Equi-Width ( 8) Equi-Depth /------+------\ 2.15448e+07 2.15448e+08 Statistical Views NLJOIN NLJOIN ( 9) ( 13) /------+------\ /------+------\ 1 2.15448e+07 18 1.19694e+07 FETCH IXSCAN FETCH IXSCAN ( 10) ( 12) ( 14) ( 16) /---+---\ | /---+---\ | 35 35 7.54069e+08 18 63 7.54069e+08 IXSCAN TABLE: DB2DBA INDEX: DB2DBA IXSCAN TABLE: DB2DBA INDEX: DB2DBA ( 11) PROMOTION PROMO_FK_IDX ( 15) STORE STORE_FK_IDX | | 35 63 INDEX: DB2DBA INDEX: DB2DBA PROMOTION_PK_IDX STOREX1 32 IBM DB2: Statistical Views Cardinality Estimation Torsten Grust • To provide database profile information (estimate cardinalities, value distributions, . . . ) for derived tables: 1 Define a view that precomputes the derived table (or possibly a small sample of it, IBM DB2: 10 %), 2 use the view result to gather and keep statistics, then Cardinality Estimation delete the result. Database Profiles Assumptions

Estimating Operator Statistical views Cardinality Selection σ 1 CREATE VIEW sv_store_dailysalesAS Projection π Set Operations , , 2 (SELECT s.* ∪ \ × Join 3 FROM STORE s, DAILY_SALES ds 1 Histograms 4 WHERE s.STOREKEY = ds.STOREKEY); Equi-Width 5 Equi-Depth 6 CREATE VIEW sv_promotion_dailysalesAS Statistical Views 7 (SELECT p.* 8 FROM PROMOTION p, DAILY_SALES ds 9 WHERE p.PROMOKEY = ds.PROMOKEY); 10 11 ALTER VIEW sv_store_dailysales ENABLE QUERY OPTIMIZATION; 12 ALTER VIEW sv_promotion_dailysales ENABLE QUERY OPTIMIZATION; 13 14 RUNSTATSON TABLE sv_store_dailysales WITH DISTRIBUTION; 15 RUNSTATSON TABLE sv_promotion_dailysales WITH DISTRIBUTION; 33 Cardinality Estimation with Statistical Views Cardinality Estimation Torsten Grust

Estimated cardinalities and selected plan after reoptimization

1.04627e+07 IXAND ( 8) Cardinality Estimation /------+------\ 6.99152e+07 1.12845e+08 Database Profiles NLJOIN NLJOIN ( 9) ( 13) Assumptions /------+------\ /------+------\ Estimating Operator 18 3.88418e+06 1 1.12845e+08 Cardinality FETCH IXSCAN FETCH IXSCAN ( 10) ( 12) ( 14) ( 16) Selection σ /---+---\ | /---+---\ | Projection π 18 63 7.54069e+08 35 35 7.54069e+08 Set Operations , , IXSCAN TABLE:DB2DBA INDEX: DB2DBA IXSCAN TABLE: DB2DBA INDEX: DB2DBA DB2DBA ∪ \ × Join ( 11) STORE STORE_FK_IDX ( 15) PROMOTION PROMO_FK_IDX 1 | | Histograms 63 35 INDEX: DB2DBA INDEX: DB2DBA Equi-Width STOREX1 PROMOTION_PK_IDX Equi-Depth Statistical Views Note new estimated selectivities after join: • Selectivity of PROMOTYPE = ’XMAS’ now only 14.96 % (was: 2.86 %) • Selectivity of STORE_NUMBER = ’01’ now 9.27 % (was: 28.57 %)

34 Query Optimization

Torsten Grust Chapter 10 Query Optimization Exploring the Search Space of Alternative Query Plans Query Optimization Search Space Illustration Dynamic Programming Architecture and Implementation of Database Systems Example: Four-Way Join Algorithm Discussion Left/Right-Deep vs. Bushy Greedy join enumeration

1 Finding the “Best” Query Plan Query Optimization Torsten Grust

Throttle or break?

π

Query Optimization NL SELECT ? Search Space Illustration ··· 1 Dynamic Programming FROM hash T Example: Four-Way Join ··· Algorithm WHERE 1 Discussion ··· R σ Left/Right-Deep vs. Bushy Greedy join enumeration S

• We already saw that there may be more than one way to answer a given query. • Which one of the join operators should we pick? With which parameters (block size, buffer allocation, . . . )? • The task of finding the best execution plan is, in fact, the “holy grail” of any database implementation.

2 Plan Generation Process Query Optimization Torsten Grust

• Parser: syntactical/semantical analysis Query compilation

• Rewriting: heuristic optimizations Query Optimization SQL independent of the current database Search Space Illustration Dynamic Programming state (table sizes, availability of indexes, Example: Four-Way Join etc.). For example: Algorithm Parser Discussion Left/Right-Deep vs. Bushy • Apply predicates early Greedy join enumeration • Avoid unnecessary duplicate Rewriting elimination • Optimizer: optimizations that rely on a Optimizer cost model and information about the current database state Plan • The resulting plan is then evaluated by the system’s execution engine.

3 Impact on Performance Query Optimization Torsten Grust

Finding the right plan can dramatically impact performance.

Sample query over TPC-H tables

1 SELECT L.L_PARTKEY, L.L_QUANTITY, L.L_EXTENDEDPRICE Query Optimization 2 FROM LINEITEM L, ORDERS O, CUSTOMER C Search Space Illustration 3 WHERE L.L_ORDERKEY = O.O_ORDERKEY Dynamic Programming 4 AND O.O_CUSTKEY = C.C_CUSTKEY Example: Four-Way Join Algorithm 5 AND C.C_NAME =’IBM Corp.’ Discussion Left/Right-Deep vs. Bushy Greedy join enumeration 57 57

14 6 mio 6 mio 1 1 1 L σ 1 1.5 mio 1 6 mio 1.5 mio 150,000 σ O 1 150,000 L O C C

• In terms of execution times, these differences can easily mean “seconds versus days.”

4 The SQL Parser Query Optimization Torsten Grust • Besides some analyses regarding the syntactical and semantical correctness of the input query, the parser creates an internal representation of the input query. This representation still resembles the original query: • Query Optimization • Each SELECT-FROM-WHERE clause is translated into a Search Space Illustration Dynamic Programming query block. Example: Four-Way Join Algorithm Discussion Deriving a query block from a SQL SFW block Left/Right-Deep vs. Bushy Greedy join enumeration query block πproj-list

SELECT proj-list σhaving-list FROM R , R , ... , R 1 2 n grpby WHERE predicate-list groupby-list → GROUP BY groupby-list σpredicate-list HAVING having-list × R R R 1 2 ··· n

• Each Ri can be a base relation or another query block.

5 Finding the “Best” Execution Plan Query Optimization Torsten Grust

SQL rewrite engine The parser output is fed into a Parser which, again, yields a tree of query blocks. Rewriting Query Optimization It is then the optimizer’s task to come up with Search Space Illustration Dynamic Programming the optimal execution plan for the given query. Optimizer Example: Four-Way Join Algorithm Essentially, the optimizer Discussion Plan Left/Right-Deep vs. Bushy 1 enumerates all possible execution plans, Greedy join enumeration (if this yields too many plans, at least enumerate the “promising” plan candidates) 2 determines the quality (cost) of each plan, then 3 chooses the best one as the final execution plan.

Before we can do so, we need to answer the question • What is a “good” execution plan at all?

6 Cost Metrics Query Optimization Torsten Grust

Database systems judge the quality of an execution plan based on a number of cost factors, e.g., Query Optimization • the number of disk I/Os required to evaluate the plan, Search Space Illustration Dynamic Programming • the plan’s CPU cost, Example: Four-Way Join Algorithm • the overall response time observable by the database client Discussion Left/Right-Deep vs. Bushy as well as the total execution time. Greedy join enumeration A cost-based optimizer tries to anticipate these costs and find the cheapest plan before actually running it. • All of the above factors depend on one critical piece of information: the size of (intermediate) query results. • Database systems, therefore, spend considerable effort into accurate result size estimates.

7 Result Size Estimation Query Optimization Torsten Grust

Consider a query block corresponding to a simple SFW query Q.

SFW query block Query Optimization πproj-list Search Space Illustration Dynamic Programming Example: Four-Way Join σpredicate-list Algorithm Discussion Left/Right-Deep vs. Bushy × Greedy join enumeration R R R 1 2 ··· n

We can estimate the result size of Q based on • the size of the input tables, R ,..., R , and | 1| | n| • the selectivity sel(p) of the predicate predicate-list:

Q R R R sel(predicate-list) . | | ≈ | 1| · | 2| · · · | n| ·

8 Join Optimization Query Optimization Torsten Grust • We’ve now translated the query into a graph of query blocks. • Query blocks essentially are a multi-way Cartesian product with a number of selection predicates on top.

• We can estimate the cost of a given execution plan. Query Optimization • Use result size estimates in combination with the cost Search Space Illustration Dynamic Programming for individual join algorithms discussed in previous Example: Four-Way Join Algorithm chapters. Discussion Left/Right-Deep vs. Bushy We are now ready to enumerate all possible execution plans, i.e., Greedy join enumeration all possible 2-way join combinations for each query block. Ways of building a 3-way join from two 2-way joins

1 T 1 T 1 S 1 R 1 S 1 R R 1 S S 1 R R 1 T S 1 T T 1 R T 1 S

R 1 S 1 R 1 S 1 T 1 T 1 S 1 T R 1 T T 1 S T 1 S R 1 S S 1 R

9 How Many Such Combinations Are There? Query Optimization Torsten Grust

• A join over n + 1 relations R1,..., Rn+1 requires n binary joins. • Its root-level operator joins sub-plans of k and n k 1 − − Query Optimization join operators (0 6 k 6 n 1): Search Space Illustration − Dynamic Programming Example: Four-Way Join Algorithm Discussion 1 Left/Right-Deep vs. Bushy k joins n k 1 joins Greedy join enumeration − − R1,..., Rk+1 Rk+2,..., Rn+1

• Let Ci be the number of possibilities to construct a binary tree of i inner nodes (join operators):

n 1 − Cn = Ck Cn k 1 . · − − Xk=0

10 Catalan Numbers Query Optimization Torsten Grust

This recurrence relation is satisfied by Catalan numbers:

n 1 − (2n)! Query Optimization Cn = Ck Cn k 1 = , Search Space Illustration · − − (n + 1)!n! Dynamic Programming k=0 Example: Four-Way Join X Algorithm Discussion describing the number of ordered binary trees with n + 1 leaves. Left/Right-Deep vs. Bushy Greedy join enumeration For each of these trees, we can permute the input relations (why?) R1,..., Rn+1, leading to:

Number of possible join trees for an (n + 1)-way relational join

(2n)! (2n)! (n + 1)! = (n + 1)!n! · n!

11 Search Space Query Optimization Torsten Grust

The resulting search space is enormous:

Possible bushy join trees joining n relations

Query Optimization number of relations n Cn 1 join trees Search Space Illustration − Dynamic Programming 2 1 2 Example: Four-Way Join Algorithm 3 2 12 Discussion Left/Right-Deep vs. Bushy 4 5 120 Greedy join enumeration 5 14 1,680 6 42 30,240 7 132 665,280 8 429 17,297,280 10 4,862 17,643,225,600

• And we haven’t yet even considered the use of k different (n 1) join algorithms (yielding another factor of k − )!

12 Dynamic Programming Query Optimization Torsten Grust

The traditional approach to master this search space is the use of dynamic programming. Query Optimization Idea: Search Space Illustration Dynamic Programming • Find the cheapest plan for an n-way join in n passes. Example: Four-Way Join Algorithm • In each pass k, find the best plans for all k-relation Discussion Left/Right-Deep vs. Bushy sub-queries. Greedy join enumeration • Construct the plans in pass k from best i-relation and (k i)-relation sub-plans found in earlier passes (1 i < k). − 6 Assumption: • To find the optimal global plan, it is sufficient to only consider the optimal plans of its sub-queries (“Principle of optimality”).

13 Dynamic Programming Query Optimization Torsten Grust Example (Four-way join of tables R1,...,4)

Pass 1 (best 1-relation plans) Find the best access path to each of the Ri individually

(considers index scans, full table scans). Query Optimization Search Space Illustration Pass 2 (best 2-relation plans) Dynamic Programming Example: Four-Way Join For each pair of tables Ri and Rj, determine the best order to Algorithm join Ri and Rj (use Ri Rj or Rj Ri ?): Discussion Left/Right-Deep vs. Bushy 1 1 Greedy join enumeration optPlan( R , R ) best of R R and R R . { i j} ← i j j i 1 1 12 plans to consider. → Pass 3 (best 3-relation plans) For each triple of tables Ri, Rj, and Rk, determine the best three-table join plan, using sub-plans obtained so far:

optPlan( Ri, Rj, Rk ) best of Ri optPlan( Rj, Rk ), optPlan{( R , R }) ←R , R optPlan( R , R{ ),....} { j k} i j 1 { i k} 1 1 24 plans to consider. → 14 Dynamic Programming Query Optimization Torsten Grust

Example (Four-way join of tables R1,...,4 (cont’d))

Pass 4 (best 4-relation plan) For each set of four tables Ri, Rj, Rk, and Rl, determine the Query Optimization Search Space Illustration best four-table join plan, using sub-plans obtained so far: Dynamic Programming Example: Four-Way Join optPlan( R , R , R , R ) best of R optPlan( R , R , R ), Algorithm { i j k l} ← i { j k l} Discussion optPlan( Rj, Rk, Rl ) Ri, Rj optPlan( Ri, Rk, Rl ),..., Left/Right-Deep vs. Bushy 1 Greedy join enumeration optPlan({R , R ) } optPlan( R , R ),....{ } { i j} 1 {1k l} 1 14 plans to consider. →

• Overall, we looked at only 50 (sub-)plans (instead of the possible 120 four-way join plans; slide 12). % • All decisions required the evaluation of simple sub-plans only (no need to re-evaluate optPlan( ) for already known relation combinations use lookup table).· ⇒

15 Sharing Under the Optimality Principle Query Optimization Torsten Grust

Sharing optimal sub-plans Join Ordering Dynamic Programming Search Space with Sharing under Optimality Principle

R , R , R , R { 1 2 3 4} Query Optimization Search Space Illustration Dynamic Programming Example: Four-Way Join Algorithm R , R , R R , R , R { 1 2 4} { 1 3 4} Discussion Left/Right-Deep vs. Bushy R1, R2, R3 R2, R3, R4 { } { } Greedy join enumeration

R1, R4 {R , R } R , R { 1 3} { 2 4} R1, R2 R3, R4 {R , R } { } { 2 3}

R1 R2 R3 R4 151/575 Drawing by Guido Moerkotte, U Mannheim

16 Dynamic Programming Algorithm Query Optimization Torsten Grust Find optimal n-way bushy join tree via dynamic programming

1 Function: find_join_tree_dp (q(R1,..., Rn))

2 for i = 1 to n do

3 optPlan( Ri ) access_plans (Ri) ; Query Optimization { } ← 4 prune_plans (optPlan( Ri )) ; Search Space Illustration { } Dynamic Programming Example: Four-Way Join 5 for i = 2 to n do Algorithm 6 foreach S R1,..., Rn such that S = i do Discussion ⊆ { } | | Left/Right-Deep vs. Bushy 7 optPlan(S) ∅ ; ← Greedy join enumeration 8 foreach O S with O = ∅ do ⊂ 6 9 optPlan(S) optPlan(S) ← ∪

10 possible_joins ;  optPlan(O) 1optPlan(S O)  \   11 prune_plans (optPlan(S)) ;

12 return optPlan( R1,..., Rn ) ; { } • possible_joins [R S] enumerates the possible joins between R and S (nested1 loops join, merge join, etc.). • prune_plans (set) discards all but the best plan from set. 17 Dynamic Programming—Discussion Query Optimization Torsten Grust • Enumerate all non-empty true subsets of S (using C): 1 O = S &-S; 2 do { 3 /* perform operation on O */ 4 O = S &(O - S); 5 } while (O != S); Query Optimization Search Space Illustration • find_join_tree_dp () draws its advantage from filtering Dynamic Programming Example: Four-Way Join plan candidates early in the process. Algorithm Discussion • In our example on slide 14, pruning in Pass 2 reduced Left/Right-Deep vs. Bushy the search space by a factor of 2, and another factor of 6 Greedy join enumeration in Pass 3. • Some heuristics can be used to prune even more plans: • Try to avoid Cartesian products. • Produce left-deep plans only (see next slides). • Such heuristics can be used as a handle to balance plan quality and optimizer runtime.

Control optimizer investment

1 SET CURRENT QUERY OPTIMIZATION = n

18 Left/Right-Deep vs. Bushy Join Trees Query Optimization Torsten Grust

The algorithm on slide 17 explores all possible shapes a join tree could take: Join tree shapes Query Optimization Search Space Illustration Dynamic Programming Example: Four-Way Join 1 1 Algorithm 1 ··· ··· Discussion 1 1 Left/Right-Deep vs. Bushy ··· 1 1 ··· Greedy join enumeration 1 ··· ··· ··· ··· 1 ··· ··· bushy ··· ··· left-deep (everything else) right-deep

Actual systems often prefer left-deep join trees.1 • The inner (rhs) relation always is a base relation. • Allows the use of index nested loops join. • Easier to implement in a pipelined fashion.

1 The seminal System R prototype, e.g., considered only left-deep plans. 19 Join Order Makes a Difference Query Optimization Torsten Grust • XPath location step evaluation over relationally encoded XML data.2 • n-way self-join with a range predicate.

Query Optimization Search Space Illustration Dynamic Programming 250 250 Example: Four-Way Join Algorithm Discussion 200 200 Left/Right-Deep vs. Bushy Greedy join enumeration 150 150

100 100 exec. time [s] 50 50

0 0 1 2 3 4 5 6 7 8 9 10 11 12 35 MB XML IBM DB2 9 SQL path length ·

2 Grust et al. Accelerating XPath Evaluation in Any RDBMS. TODS 2004. % http://www.pathfinder-xquery.org/ 20 Join Order Makes a Difference Query Optimization Torsten Grust

Contrast the execution plans for a path of 8 and 9 XPath location steps:

Join plans Query Optimization Search Space Illustration Dynamic Programming Example: Four-Way Join Algorithm Discussion Left/Right-Deep vs. Bushy Greedy join enumeration

left-deep join tree bushy join tree DB2’s optimizer essentially gave up in the face of 9+ joins. ⇒

21 Joining Many Relations Query Optimization Torsten Grust

Dynamic programming still has exponential resource requirements: ( K. Ono, G.M. Lohman, Measuring the Complexity of Join Enumeration Query Optimization % Search Space Illustration in Query Optimization, VLDB 1990) Dynamic Programming Example: Four-Way Join n • time complexity: (3 ) Algorithm O Discussion • space complexity: (2n) Left/Right-Deep vs. Bushy O Greedy join enumeration This may still be too expensive • for joins involving many relations ( 10–20 and more), ∼ • for simple queries over well-indexed data (where the right plan choice should be easy to make).

The greedy join enumeration algorithm jumps into this gap.

22 Greedy Join Enumeration Query Optimization Torsten Grust

Greedy join enumeration for n-way join

1 Function: find_join_tree_greedy (q(R1,..., Rn))

2 worklist ∅ ; ← 3 for i = 1 to n do Query Optimization Search Space Illustration 4 worklist worklist best_access_plan (Ri) ; ← ∪ Dynamic Programming Example: Four-Way Join 5 for i = n downto 2 do Algorithm // worklist = P1,..., Pi Discussion { } Left/Right-Deep vs. Bushy 6 find Pj, Pk worklist and ... such that cost(Pj ... Pk) is minimal ; ∈ Greedy join enumeration 7 worklist worklist Pj, 1Pk (Pj ... Pk) ; 1 ← \{ } ∪ { } // worklist = P1 1 { } 8 return single plan left in worklist ;

• In each iteration, choose the cheapest join that can be made over the remaining sub-plans at that time (this is the “greedy” part). • Observe that find_join_tree_greedy () operates similar to finding the optimum binary tree for Huffman coding.

23 Join Enumeration—Discussion Query Optimization Torsten Grust

Greedy join enumeration: • The greedy algorithm has (n3) time complexity: O Query Optimization • The loop has (n) iterations. Search Space Illustration O Dynamic Programming • Each iteration looks at all remaining pairs of plans in Example: Four-Way Join worklist. An (n2) task. Algorithm O Discussion Left/Right-Deep vs. Bushy Greedy join enumeration Other join enumeration techniques: • Randomized algorithms: randomly rewrite the join tree one rewrite at a time; use hill-climbing or simulated annealing strategy to find optimal plan. • Genetic algorithms: explore plan space by combining plans (“creating offspring”) and altering some plans randomly (“mutations”).

24 Physical Plan Properties Query Optimization Torsten Grust

Consider the simple equi-join query

Join query over TPC-H tables Query Optimization Search Space Illustration 1 SELECT O.O_ORDERKEY Dynamic Programming 2 FROM ORDERS O, LINEITEM L Example: Four-Way Join Algorithm 3 WHERE O.O_ORDERKEY = L.L_ORDERKEY Discussion Left/Right-Deep vs. Bushy where table ORDERS is indexed with a clustered index OK_IDX on Greedy join enumeration column O_ORDERKEY.

Possible table access plans (1-relation plans) are:

ORDERS • full table scan: estimated I/Os: NORDERS • index scan: estimated I/Os: NOK_IDX + NORDERS.

LINEITEM • full table scan: estimated I/Os: NLINEITEM.

25 Physical Plan Properties Query Optimization Torsten Grust

• Since the full table scan is the cheapest access method for both tables, our join algorithms will select them as the best 3 Query Optimization 1-relation plans in Pass 1. Search Space Illustration Dynamic Programming To join the two scan outputs, we now have the choices Example: Four-Way Join Algorithm • nested loops join, Discussion Left/Right-Deep vs. Bushy • hash join, or Greedy join enumeration • sort both inputs, then use merge join. Hash join or sort-merge join are probably the preferable candidates, incurring a cost of 2 (N + N ). ≈ · ORDERS LINEITEM Overall cost: ⇒ N + N + 2 (N + N ). ORDERS LINEITEM · ORDERS LINEITEM

3Dynamic programming and the greedy algorithm happen to do the same in this example. 26 Physical Plan Properties—A Better Plan Query Optimization Torsten Grust

It is easy to see, however, that there is a better way to evaluate the query:

Query Optimization Search Space Illustration 1 Use an index scan to access ORDERS. This guarantees that Dynamic Programming Example: Four-Way Join the scan output is already in O_ORDERKEY order. Algorithm Discussion 2 Then only sort LINEITEM and Left/Right-Deep vs. Bushy Greedy join enumeration 3 join using merge join.

Overall cost: (N + N ) + 2 N . ⇒ OK_IDX ORDERS · LINEITEM 1 2 / 3 | {z } | {z } Although more expensive as a standalone table access plan, the use of the index (order enforcement) pays off later on in the overall plan.

27 Physical Plan Properties: Interesting Orders Query Optimization Torsten Grust

• The advantage of the index-based access to ORDERS is that it provides beneficial physical properties. Query Optimization • Optimizers, therefore, keep track of such properties by Search Space Illustration Dynamic Programming annotating candidate plans. Example: Four-Way Join Algorithm Discussion • System R introduced the concept of interesting orders, Left/Right-Deep vs. Bushy Greedy join enumeration determined by • ORDER BY or GROUP BY clauses in the input query, or • join attributes of subsequent joins ( merge join). ; In prune_plans (), retain ⇒ • the cheapest “unordered” plan and • the cheapest plan for each interesting order.

28 Query Rewriting Query Optimization Torsten Grust

• Join optimization essentially takes a set of relations and a set of join predicates to find the best join order. Query Optimization Search Space Illustration Dynamic Programming • By rewriting query graphs beforehand, we can improve the Example: Four-Way Join Algorithm effectiveness of this procedure. Discussion Left/Right-Deep vs. Bushy Greedy join enumeration • The query rewriter applies heuristic rules, without looking into the actual database state (no information about cardinalities, indexes, etc.). In particular, the optimizer • relocates predicates (predicate pushdown), • rewrites predicates, and • unnests queries.

29 Predicate Simplification Query Optimization Torsten Grust

Rewrite Example (Query against TPC-H table)

1 SELECT* Query Optimization 2 FROM LINEITEM L Search Space Illustration 3 WHERE L.L_TAX * 100 < 5 Dynamic Programming Example: Four-Way Join Algorithm into Discussion Left/Right-Deep vs. Bushy Example (Query after predicate simplification) Greedy join enumeration

1 SELECT* 2 FROM LINEITEM L 3 WHERE L.L_TAX < 0.05

 In which sense is the rewritten predicate simpler? Why would a RDBMS query optimizer rewrite the selection predicate as shown above?

30 Introducing Additional Join Predicates Query Optimization Torsten Grust

Implicit join predicates as in

Implicit join predicate through transitivity

1 SELECT* Query Optimization 2 FROM A, B, C Search Space Illustration 3 WHERE A.a = B.b AND B.b = C.c Dynamic Programming Example: Four-Way Join Algorithm can be turned into explicit ones: Discussion Left/Right-Deep vs. Bushy Explicit join predicate Greedy join enumeration

1 SELECT* 2 FROM A, B, C 3 WHERE A.a = B.b AND B.b = C.c 4 AND A.a::::=::::C.c This makes the following join tree feasible:

(A C) B . 1 1 (Note: (A C) would have been a Cartesian product before.) 1

31 Nested Queries and Correlation Query Optimization Torsten Grust

SQL provides a number of ways to write nested queries. • Uncorrelated sub-query:

No free variables in subquery Query Optimization Search Space Illustration 1 SELECT* Dynamic Programming 2 FROM ORDERS O Example: Four-Way Join Algorithm 3 WHERE O_CUSTKEYIN(SELECT C_CUSTKEY Discussion 4 FROM CUSTOMER Left/Right-Deep vs. Bushy 5 WHERE C_NAME =’IBM Corp.’) Greedy join enumeration

• Correlated sub-query:

Row variable O occurs free in subquery

1 SELECT* 2 FROM ORDERS O 3 WHERE O.O_CUSTKEYIN 4 (SELECT C.C_CUSTKEY 5 FROM CUSTOMER C 6 WHERE C.C_ACCTBAL < O.O_TOTALPRICE)

32 Query Unnesting Query Optimization Torsten Grust • Taking query nesting literally might be expensive. • An uncorrelated query, e.g., need not be re-evaluated for every tuple in the outer query. • Oftentimes, sub-queries are only used as a syntactical way to Query Optimization express a join (or a semi-join). Search Space Illustration Dynamic Programming • The query rewriter tries to detect such situations and make Example: Four-Way Join the join explicit. Algorithm Discussion • This way, the sub-query can become part of the regular join Left/Right-Deep vs. Bushy Greedy join enumeration order optimization.

 Turning correlation into joins Reformulate the correlated query of slide 32 (use SQL syntax or relational algebra) to remove the correlation (and introduce a join).

Won Kim. On Optimizing an SQL-like Nested Query. ACM TODS, vol. 7, % no. 3, September 1982. 33 Summary Query Optimization Torsten Grust

Query Parser

Translates input query into (SFW-like) query blocks. Query Optimization Search Space Illustration Rewriter Dynamic Programming Example: Four-Way Join Logical (database state-independent) optimizations; Algorithm Discussion predicate simplification; query unnesting. Left/Right-Deep vs. Bushy Greedy join enumeration (Join) Optimization Find “best” query execution plan based on a cost model (considering I/O cost, CPU cost, . . . ); data statistics (histograms); dynamic programming, greedy join enumeration; physical plan properties (interesting orders).

Database optimizers still are true pieces of art. . .

34 “Picasso” Plan Diagrams Query Optimization Torsten Grust

Generated by “Picasso”: SQL join query with filters of parameterizable selectivities (0 ... 100) against both join inputs

Query Optimization Search Space Illustration Dynamic Programming Example: Four-Way Join Algorithm Discussion Left/Right-Deep vs. Bushy Greedy join enumeration

(a) Plan Diagram (a) Complex(b) Cost Plan Diagram Diagram (b) Reduced Plan Diagram

Figure 1: Smooth Plan and Cost Diagram (QueryFigure 7) 2: Complex Plan and Reduced Plan Diagram (Query 8, OptA) Naveen Reddy and Jayant Haritsa. Analyzing Plan Diagrams of % TheDatabase Picasso Tool Query Optimizers. VLDB 2005made. they here are are “doing perforce too good speculative a job”, in not nature merited and by should the Validity of PQO: A rich body of literature exists on para- thereforecoarseness be treated of the as underlying such. Our cost intention space. is primarily Moreover, to metric query optimization (PQO) [1, 2, 7, 8, 3, 4, 10, As part of our ongoing project on developing value- alertif database it were possiblesystem designers to simplify and the developers optimizer to tothe pro- phe- 11, 12]. The goal here is to aprioriidentify the optimal addition software for query optimization [24], we have cre- nomenaduce that only we reduced have encountered plan diagrams, during it is the plausible course of that our set of plans for the entire relational selectivity space ated a tool, called Picasso, that given a query and a rela- study,the with considerable the hope thatprocessing they may overheads prove useful typically in building asso- at compile time, and subsequently to use at run time tional engine, automatically generates the associated plan the nextciated generation with query of optimization optimizers. could be significantly the actual selectivity parameter settings to identify the and cost diagrams. In this paper, we report on the fea- lowered. best plan – the expectation35 is that this would be much tures of the plan/cost diagrams output by Picasso on a suite Features of Plan and Cost Diagrams faster than optimizing the query from scratch. Much of three popular commercial query optimizers for queries Complex Patterns: The plan diagrams exhibit a variety of this work is based on a set of assumptions, that we based on the TPC-H benchmark. [Due to legal restrictions, Analyzingof intricate the TPC-H-based tessellated patterns, query plan including and costspeckles diagrams, do not find to hold true, even approximately, in the these optimizers are anonymouslyidentified as OptA, OptB providesstripes a, varietyblinds, ofmosaics interestingand bands insights,, among including others. the For fol- plan diagrams produced by the commercial optimiz- and OptC, in the sequel.] lowing:example, witness the rapidly alternating choices be- ers. Our evaluation shows that a few queries in the bench- tween plans P12 (dark green) and P16 (light gray) mark do produce “well-behaved” or “smooth” plan dia- Fine-grainedin the bottom Choices: left quadrantModern of query Figure optimizers 2(a). Further, often For example, one of the assumptions is that a plan is grams, like that shown in Figure 1(a). A substantial remain- themake boundaries extremely offine-grained the plan optimalityplan choices, regions exhibiting can be optimal within the entire region enclosed by its plan der, however, result in extremely complex and intricate plan highlya marked irregular skew in – the a case space in coverage point is of plan the P8 individual (dark boundaries. But, in Figure 2(a), this is violated by the diagrams that appear similar to cubist paintings3, providing pink)plans. in theFor top example, right quadrant 80 percent of Figure of the space2(a). These is usu- small (brown) rectangle of plan P14, close to coordi- rich material for investigation. A particularly compelling complexally covered patterns by less appear than to 20 indicate percent ofthe the presence plans, with of nates (60,30), in the (light-pink) optimality region of example is shown in Figure 2(a) for Query 8 of the bench- stronglymany of non-linear the smaller andplans discretized occupying cost models, less than againone plan P3, and there are several other such instances. 4 perhapspercent anof over-kill the selectivity in light space. of Figure Using 2(b). the well-known mark with optimizer OptA , where no less than 68 plans On the positive side, however, we show that some Gini index [22], which ranges over [0,1], to quantify cover the space in a highly convoluted manner! Further, of the important PQO assumptions do hold approxi- Non-Monotonicthe skew, we Cost find Behavior: that all theWe optimizers, have foundacross quite the a even this cardinality is a conservative estimate since it was mately for reduced plan diagrams. fewboard instances, exhibit where, a marked although skewthe in excess base relation of 0.5 for selec- most obtained with a query grid of 100 x 100 – a finer grid size tivities and the result cardinalities are monotonically of 300 x 300 resulted in the plan cardinality going up to 80 queries, on occasion going even higher than 0.8. increasing, the cost diagram does not show a corre- 1.1 Organization plans! spondingFurther, monotonic and more behavior. importantly,5 Sometimes, we show the that non- the Before we go on, we hasten to clarify that our goal in monotonicsmall-sized behavior plans may arises often due be to supplanted a change inby plan, larger The above effects are described in more detail in the re- this paper is to provide a broad overview of the intrigu- perhapssiblings understandablewithout substantively given the affecting restricted the searchquality. mainder of this paper, which is organized as follows: In ing behavior of modern optimizers, but not to make judge- spaceFor example, evaluated the by plan the optimizer. diagram of But, Figure more 2(a) surpris- which Section 2, we present the Picasso tool and the testbed en- ments on specific optimizers, nor to draw conclusionsabout ingly,has 68 we plans have can also be encountered “reduced” to situations that shown where in Fig- a vironment. Then, in Section 3, the skew in the plan space the relative qualities of their execution plans. Further, not planure shows 2(b) featuring such behavior as few even as seveninternalplans,to its without optimal- in- distribution, as well as techniques for reducing the plan set being privy to optimizer internals, some of the remarks itycreasing region. the estimated cost of any individual query cardinalities, are discussed. The relationship to PQO is ex- point by more than 10 percent. plored in Section 4. Interesting plan diagram motifs are 3Hence, the name of our tool – Pablo Picasso is considered to be a 5Our query setup is such that in addition to the result cardinality mono- founder of the cubist painting genre [23]. tonicallyOverall, increasing this as we leads travel us outwards to the along hypothesis the selectivity that axes, current the presented in Section 5. An overviewof related work is pro- 4Operating at its highest optimization level. result tuplesoptimizers are also supersets may perhapsof the previous be over-sophisticated results. in that vided in Section 6. Finally, in Section 7, we summarize the common change in all plans across the switch-point is that the hash-join between relations PART and PARTSUPP is replaced by a sort-merge-join. But, in the same picture, there are switch-points occur- ring at 26% and 50% in the PARTSUPP selectivity range, that result in a counter-intuitive non-monotonic cost behav- ior, as shown in the corresponding cost diagram of Fig- ure 11(b). Here, we see that although the result cardi- nalities are monotonically increasing, the estimated costs for producing these results show a marked non-monotonic step-downQuery Optimization behavior in the middle section. Specifically, “Picasso” Plan Diagrams at the 26% switch-point, an additional `sort' operator (on psTorstensupplycost Grust ) is added, which substantially de- creases the overall cost – for example, in moving from plan P2 to P3 at 50% PART selectivity, the estimated cost de- creases by a factor of 50! Conversely, in moving from P3 Generated by “Picasso”: each distinct color represent a distinct plan to P1 at the 50% switch-point, the cost of the optimal plan jumps up by a factor of 70 at 50% PART selectivity. Figure 6: Duplicates and Islands (Query 5, OptC) Figure 7: Plan Switch-Point (Query 9, OptA) Figure 9: Footprint Pattern (Query 7, OptA) considered by the DBMS Step-function upward jumps in the cost with increas- Databases # Duplicates # Islands ing input cardinalities are known to occur – for example, Original Reduced Original Reduced when one of the relations in a join ceases to fit completely OptA 136 14 38 3 withinQuery the Optimization available memory – however, what is surprising OptB 80 15 1 0 in theSearch above Space is Illustration the step-function cost decrease at the 26% OptC 55 7 8 3 switch-point.Dynamic Programming We conjecture that such disruptive cost be- haviorExample: may Four-Way arise either Join due to the presence of rules in the optimizer,Algorithm or due to parameterized changes in the search Table 3: Duplicates and Islands spaceDiscussion evaluated by the optimizer. The above example showed non-monotonic behavior PQO, which, as mentioned in the previous section, appears Left/Right-Deep vs. Bushy arising out of a plan-switch. However, more surprisingly, ill-suited to directly capture the complexities of modern op- Greedy join enumeration we have also encountered situations where a plan shows timizers, may turn out to be a more viable propositionin the non-monotonic behavior internal to its optimality region. space of reduced plan diagrams. A specific example is shown in Figure 12 obtained for Q21 with OptA. Here, the plans P1, P3, P4 and P6, show a re- 5.2 Plan Switch Points duction in their estimated costs with increasing input and In several plan diagrams, we find lines parallel to the axes result cardinalities. An investigation of these plans showed that run through the entire selectivity space, with a plan that all of them feature a nested-loops join, whose esti- shift occurring for all plans bordering the line, when we mated cost decreases with increasing cardinalities of its in- Figure 8: Venetian Blinds Pattern (Query 9, OptB) move across the line. We will hereafter refer to such lines Figure 10: Speckle Pattern (Query 17, OptA) put relations – this may perhaps indicate an inconsistency as “plan switch-points”. in the associated cost model. Further, such instances of join across the NATION, SUPPLIER and LINEITEM rela- 5.4 Speckle Pattern In the plan diagram of Figure 7, obtained with Q9 on non-monotonic behavior were observed with all three opti- tions. Both variations have almost equal estimated cost, OptA, an example switch-point appears at approximately Operating Picasso with Q17 on OptA (at its highest opti- mizers. andDownload their cost-models Picasso are perhaps at discretized in a step- 30% selectivity of the SUPPLIER relation. Here, we found mization level) results in Figure 10. We see here that the function manner, resulting in the observed blinds.

a common change in all plans across the switch-point – the entire plan diagram is divided into just two plans, P1 and 6 Related Work

» º» hash-join sequence PARTSUPP º SUPPLIER PART is al- P2, occupying nearly equal areas, but that plan P1 (bright

http://dsl.serc.iisc.ernet.in/projects/PICASSO/index.html. To the best of our knowledge, there has been no prior work

» º» tered to PARTSUPP º PART SUPPLIER, suggesting an in- green) also appears as speckles sprinkled in P2’s (red) area. 5.3 Footprint Pattern on the analysis of plan diagrams with regard to real-world tersection of the cost function of the two sequences at this The only difference between the two plans is that an ad- industrial-strength query optimizers. However, similar is- switch-point. ditional SORT operation is present in P2 on the PART rela- A curious pattern, similar to footprints on the beach, shows sues have been studied in the parametricquery optimization For the same Q9 query, an even more interesting switch- tion. However, the cost of this sort is very low, and there- up in Figure 9, obtained with Q7 on the OptA optimizer, (PQO) literature in the context of simplified self-crafted op- point example is obtained with OptB, shown in Figure 8. fore we find intermixing of plans due to the close and per- where we see plan P7 exhibiting a thin (cadet-blue) bro- timizers. Specifically,36 in [1, 11, 12], an optimizer modeled Here we observe, between 10% and 35% on the SUPPLIER haps discretized cost models. ken curved pattern in the middle of plan P2’s (orange) re- along the lines of the original System R optimizer [16] is axis, six plans simultaneously changing with rapid alterna- gion. The reason for this behavior is that both plans are of used, with the search space restricted to left-deep join trees, tions to produce a “Venetian blinds” effect. Specifically, 5.5 Non-Monotonic Cost Behavior roughly equal cost, with the difference being that in plan and the workload comprised of pure SPJ queries with “star” the optimizer changes from P6 to P1, P16 to P4, P25 to P2, the SUPPLIER relation participates in a sort-merge- The example switch-points shown earlier, were all cost- or “linear” join-graphs. The metrics considered include P23, P7 to P18, P8 to P9, and P42 to P47, from one vertical join at the top of the plan tree, whereas in P7, the hash-join based switch-points, where plans were switched to de- the cardinality and spatial distribution of the set of optimal strip to the next. operator is used instead at the same location. This is con- rive lower execution costs. Yet another example of such plans – while [1] considered only single-relation selectivi- The reason for this behavior is that the optimizer alter- firmed in the corresponding reduced plan diagram where a switch-point is seen in Figure 11(a), obtained with query ties, [11, 12] evaluated two-dimensional relational selectiv- nates between a left-deep hash join and a right-deep hash the footprints disappear. Q2 on OptA, at 97% selectivity of the PART relation. Here, ity spaces, similar to those considered in this paper. Their Transaction Management

Floris Geerts

Transaction Management

Concurrent and Consistent Data Access ACID Properties

Anomalies

The Scheduler

Serializability

Query Scheduling

Locking

Two-Phase Locking

SQL 92 Isolation Levels

Concurrency in B-trees

Optimistic Concurrency Protocol

Multiversion Concurrency Protocol

11.1 Transaction The “Hello World” of Transaction Management Management Floris Geerts

• My bank issued me a debit card to access my account. • Every once in a while, I’d use it at an ATM to draw some money from my account, causing the ATM to perform a ACID Properties transaction in the bank’s database. Anomalies The Scheduler

Serializability

Query Scheduling

Example (ATM transaction) Locking

Two-Phase Locking 1 bal read_bal (acct_no) ; ← SQL 92 Isolation 2 bal bal 100 ; Levels ← − ¤ 3 write_bal (acct_no, bal) ; Concurrency in B-trees Optimistic Concurrency Protocol

Multiversion Concurrency Protocol • My account is properly updated to reflect the new balance.

11.2 Transaction Concurrent Access Management Floris Geerts The problem is: My wife has a card for the very same account, too. We might end up using our cards at different ATMs at the ⇒ same time, i.e., concurrently.

ACID Properties Example (Concurrent ATM transactions) Anomalies

The Scheduler

me my wife DB state Serializability bal read (acct) ; 1200 Query Scheduling Locking ← bal read (acct) ; 1200 Two-Phase Locking bal bal 100 ; ← 1200 ← − SQL 92 Isolation bal bal 200 ; 1200 Levels write (acct, bal) ; ← − 1100 Concurrency in B-trees Optimistic write (acct, bal) ; 1000 Concurrency Protocol

Multiversion Concurrency Protocol

• The first update was lost during this execution. Lucky me!

11.3 Transaction If the Plug is Pulled . . . Management Floris Geerts • This time, I want to transfer money over to another account.

Example (Money transfer transaction) // Subtract money from source (checking) account ACID Properties 1 chk_bal read_bal (chk_acct_no) ; ← Anomalies 2 chk_bal chk_bal 500 ¤ ; ← − The Scheduler 3 write_bal (chk_acct_no, chk_bal) ; Serializability

Query Scheduling

// Credit money to the target (savings) account Locking 4 sav_bal read_bal (sav_acct_no) ; ← Two-Phase Locking 5 sav_bal sav_bal + 500 ¤ ; SQL 92 Isolation ← Levels 6 Concurrency in B-trees 7 write_bal (sav_acct_no, sav_bal) ; Optimistic Concurrency Protocol Multiversion • Before the transaction gets to step 7, its execution is Concurrency Protocol interrupted/cancelled (power outage, disk failure, software bug, . . . ). My money is lost . /

11.4 Transaction ACID Properties Management Floris Geerts

To prevent these (and many other) effects from happening, a DBMS guarantees the following transaction properties:

ACID Properties

A Atomicity Either all or none of the updates in a database Anomalies

transaction are applied. The Scheduler C Consistency Every transaction brings the database from one Serializability consistent state to another. (While the transaction Query Scheduling executes, the database state may be temporarily Locking Two-Phase Locking

inconsistent.) SQL 92 Isolation I Isolation A transaction must not see any effect from other Levels Concurrency in B-trees transactions that run in parallel. Optimistic D Durability The effects of a successful transaction maintain Concurrency Protocol Multiversion persistent and may not be undone for system Concurrency Protocol reasons.

11.5 Transaction Concurrency Control Management Floris Geerts

Web Forms Applications SQL Interface

SQL Commands

Executor Parser ACID Properties Anomalies

Operator Evaluator Optimizer The Scheduler

Serializability

Query Scheduling Files and Access Methods Transaction Locking Manager Recovery Two-Phase Locking Buffer Manager Manager SQL 92 Isolation Lock Manager Levels Disk Space Manager Concurrency in B-trees Optimistic DBMS Concurrency Protocol Multiversion Concurrency Protocol data files, indices, . . . Database

11.6 Transaction Anomalies: Lost Update Management Floris Geerts

ACID Properties • We already saw an example of the lost update anomaly on Anomalies slide 3: The Scheduler Serializability

Query Scheduling

The effects of one transaction are lost due to an Locking

uncontrolled overwrite performed by the second Two-Phase Locking

transaction. SQL 92 Isolation Levels

Concurrency in B-trees

Optimistic Concurrency Protocol

Multiversion Concurrency Protocol

11.7 Transaction Anomalies: Inconsistent Read Management Floris Geerts

Reconsider the money transfer example (slide 4), expressed in SQL syntax:

Example ACID Properties

Transaction 1 Transaction 2 Anomalies

1 UPDATE Accounts The Scheduler

2 SET balance = balance - 500 Serializability 3 WHERE customer = 1904 Query Scheduling 4 AND account_type = ’C’; Locking 1 SELECT SUM(balance) Two-Phase Locking 2 FROM Accounts SQL 92 Isolation 3 WHERE customer = 1904; Levels

5 UPDATE Accounts Concurrency in B-trees

6 SET balance = balance + 500 Optimistic 7 WHERE customer = 1904 Concurrency Protocol 8 AND account_type = ’S’; Multiversion Concurrency Protocol

• Transaction 2 sees a temporary, inconsistent database state.

11.8 Transaction Anomalies: Dirty Read Management Floris Geerts At a different day, my wife and me again end up in front of an ATM at roughly the same time. This time, my transaction is cancelled (aborted): Example ACID Properties me my wife DB state Anomalies The Scheduler bal read (acct) ; 1200 Serializability ← bal bal 100 ; 1200 Query Scheduling write← (acct−, bal) ; 1100 Locking bal read (acct) ; 1100 Two-Phase Locking ← SQL 92 Isolation bal bal 200 ; 1100 Levels ← − abort ; 1200 Concurrency in B-trees Optimistic write (acct, bal) ; 900 Concurrency Protocol

Multiversion Concurrency Protocol • My wife’s transaction has already read the modified account balance before my transaction was rolled back (i.e., its effects are undone).

11.9 Transaction Concurrent Execution Management Floris Geerts • The scheduler decides the execution order of concurrent database accesses.

The transaction scheduler ACID Properties

Anomalies Client 1 Client 2 Client 3 The Scheduler 3 2 3 Serializability 2 1 2 Query Scheduling 1 1 Locking

Scheduler Two-Phase Locking

SQL 92 Isolation 1 Levels

2 Concurrency in B-trees 1 Optimistic Concurrency Protocol 1 Multiversion Concurrency Protocol

Access and Storage Layer

11.10 Transaction Database Objects and Accesses Management Floris Geerts

• We now assume a slightly simplified model of database access:

1 A database consists of a number of named objects. In a ACID Properties

given database state, each object has a value. Anomalies 2 Transactions access an object o using the two The Scheduler operations read o and write o. Serializability Query Scheduling • In a relational DBMS we have that Locking Two-Phase Locking object attribute . SQL 92 Isolation ≡ Levels Concurrency in B-trees

This defines the granularity of our discussion. Other Optimistic possible granularities: Concurrency Protocol Multiversion Concurrency Protocol object row , object table . ≡ ≡

11.11 Transaction Transactions Management Floris Geerts Database transaction A database transaction T is a (strictly ordered) sequence of steps. Each step is a pair of an access operation applied to an object. ACID Properties • Transaction T = s ,..., s h 1 ni Anomalies • Step si = (ai, ei) The Scheduler Serializability • Access operation ai r(ead), w(rite) ∈ { } Query Scheduling The length of a transaction T is its number of steps T = n. Locking | | Two-Phase Locking

SQL 92 Isolation We could write the money transfer transaction as Levels Concurrency in B-trees 3 Optimistic T = (read, Checking), (write, Checking), 2 Concurrency Protocol h (read, Saving), (write, Saving) 1 Multiversion i Concurrency Protocol or, more concisely, T = r(C), w(C), r(S), w(S) . h i

11.12 Transaction Schedules Management Floris Geerts Schedule

A schedule S for a given set of transactions T = T1,..., Tn is an arbitrary sequence of execution steps { } 2 S(k) = (Tj, ai, ei) k = 1 ... m , 1 ACID Properties 1 Anomalies such that The Scheduler Serializability 1 S contains all steps of all transactions and nothing else and Query Scheduling 2 the order among steps in each transaction Tj is preserved: Locking Two-Phase Locking

(ap, ep) < (aq, eq) in Tj (Tj, ap, ep) < (Tj, aq, eq) in S SQL 92 Isolation ⇒ Levels (read “<” as: occurs before). Concurrency in B-trees Optimistic Concurrency Protocol

Multiversion We sometimes write Concurrency Protocol S = r (B), r (B), w (B), w (B) h 1 2 1 2 i to abbreviate S(1) = (T1, read, B) S(3) = (T1, write, B) S(2) = (T2, read, B) S(4) = (T2, write, B) 11.13 Transaction Serial Execution Management Floris Geerts Serial execution One particular schedule is serial execution.

• A schedule S is serial iff, for each contained transaction Tj, all

its steps are adjacent (no interleaving of transactions and ACID Properties

thus no concurrency). Anomalies Briefly: The Scheduler Serializability

Query Scheduling S = Tπ1, Tπ2,..., Tπn (for some permutation π of 1,..., n) Locking

Two-Phase Locking Consider again the ATM example from slide 3. 2 2 SQL 92 Isolation Levels S = r (B), r (B), w (B), w (B) 1 • 1 2 1 2 Concurrency in B-trees h i 1 • This is a schedule, but it is not serial. Optimistic Concurrency Protocol

Multiversion If my wife had gone to the bank one hour later (initiating Concurrency Protocol transaction T2), the schedule probably would have been serial. 2 1 • S = r1(B), w1(B), r2(B), w2(B) h i 2 1 11.14 Transaction Correctness of Serial Execution Management Floris Geerts

• Anomalies such as the “lost update” problem on slide 3 can only occur in multi-user mode. ACID Properties • If all transactions were fully executed one after another (no Anomalies The Scheduler concurrency), no anomalies would occur. Serializability Any serial execution is correct. Query Scheduling ⇒ Locking • Disallowing concurrent access, however, is not practical. Two-Phase Locking SQL 92 Isolation Levels

Correctness criterion Concurrency in B-trees Allow concurrent executions if their overall effect is the same as Optimistic Concurrency Protocol

an (arbitrary) serial execution. Multiversion Concurrency Protocol

11.15 Transaction Conflicts Management Floris Geerts

What does it mean for a schedule S to be equivalent to another schedule S0?

• Sometimes, we may be able to reorder steps in a schedule. ACID Properties • We must not change the order among steps of any Anomalies transaction T ( slide 13). The Scheduler j % • Rearranging operations must not lead to a different Serializability result. Query Scheduling Locking

• Two operations (Ti, a, e) and (Tj, a0, e0) are said to be in Two-Phase Locking conflict (Ti, a, e) = (Tj, a0, e0) if their order of execution SQL 92 Isolation matters. Levels Concurrency in B-trees

• When reordering a schedule, we must not change the Optimistic relative order of such operations. Concurrency Protocol Multiversion • Any schedule S0 that can be obtained this way from S is said Concurrency Protocol to be conflict equivalent to S.

11.16 Transaction Conflicts Management Floris Geerts Based on our read/write model, we can come up with a more machine-friendly definition of a conflict.

Conflicting operations

Two operations (Ti, a, e) and (Tj, a0, e0) are in conflict (=) in S if ACID Properties 1 they belong to two different transactions (Ti = Tj), Anomalies 6 The Scheduler 2 they access the same database object, i.e., e = e0, and Serializability

3 at least one of them is a write operation. Query Scheduling

Locking • This inspires the following conflict matrix: Two-Phase Locking SQL 92 Isolation read write Levels read Concurrency in B-trees × Optimistic write Concurrency Protocol × × Multiversion • Conflict relation : Concurrency Protocol ≺S (Ti, a, e) S (Tj, a0, e0) ≺:= (Ti, a, e) = (Tj, a0, e0) (Ti, a, e) occurs before (Tj, a0, e0) in S ∧ 11.17 Transaction Conflict Serializability Management Floris Geerts

ACID Properties

Conflict serializability Anomalies A schedule S is conflict serializable iff it is conflict equivalent to The Scheduler Serializability some serial schedule S . 0 Query Scheduling • The execution of a conflict-serializable S schedule is Locking correct. Two-Phase Locking SQL 92 Isolation • Note: S does not have to be a serial schedule.  Levels Concurrency in B-trees

Optimistic Concurrency Protocol

Multiversion Concurrency Protocol

11.18 Transaction Serializability: Example Management Floris Geerts Example (Three schedules Si for two transactions T1,2, with S2 serial)

Schedule S1 Schedule S2 Schedule S3 T1 T2 T1 T2 T1 T2 read A read A read A ACID Properties

write A write A write A Anomalies

read A read B read A The Scheduler

write A write B write A Serializability read B read A read B Query Scheduling write B write A write B read B read B read B Locking write B write B write B Two-Phase Locking SQL 92 Isolation Levels • Conflict relations: Concurrency in B-trees Optimistic (T1, r, A) (T2, w, A), (T1, r, B) (T2, w, B), ≺S1 ≺S1 Concurrency Protocol (T1, w, A) S (T2, r, A), (T1, w, B) S (T2, r, B), ≺ 1 ≺ 1 Multiversion (T1, w, A) (T2, w, A), (T1, w, B) (T2, w, B)  Concurrency Protocol ≺S1 ≺S1  Note:  ( S2 = S1 )  S1serializable ≺ ≺  ⇒  (T1, r, A) (T2, w, A), (T2, r, B) (T1, w, B), ≺S3 ≺S3 (T1, w, A) (T2, r, A), (T2, w, B) (T1, r, B), ≺S3 ≺S3  (T1, w, A) S (T2, w, A), (T2, w, B) S (T1, w, B)  ≺ 3 ≺ 3   11.19  Transaction The Conflict Graph Management Floris Geerts

• The serializability idea comes with an effective test for the correctness of a schedule S based on its conflict graph G(S) (also: serialization graph): ACID Properties Anomalies

• The nodes of G(S) are all transactions Ti in S. The Scheduler • There is an edge T T iff S contains operations Serializability i → j (Ti, a, e) and (Tj, a0, e0) such that (Ti, a, e) S (Tj, a0, e0) Query Scheduling ≺ Locking (read: in a conflict equivalent serial schedule, Ti must Two-Phase Locking occur before Tj). SQL 92 Isolation Levels

Concurrency in B-trees

• S is conflict serializable iff G(S) is acyclic. Optimistic Concurrency Protocol

An equivalent serial schedule for S may be immediately Multiversion obtained by sorting G(S) topologically. Concurrency Protocol

11.20 Transaction Serialization Graph Management Floris Geerts Example (ATM transactions ( slide 3)) % • S = r1(A), r2(A), w1(A), w2(A) T h i 2 • Conflict relation: (T , r, A) (T , w, A) 1 S 2 T1 ACID Properties (T , r, A) ≺ (T , w, A) 2 ≺S 1 Anomalies (T , w, A) (T , w, A) not serializable The Scheduler 1 ≺S 2 ⇒ Serializability

Query Scheduling Example (Two money transfers ( slide 4)) Locking % Two-Phase Locking SQL 92 Isolation • S = r (C), w (C), r (C), w (C), r (S), w (S), r (S), w (S) Levels h 1 1 2 2 1 1 2 2 i • Conflict relation: Concurrency in B-trees T2 Optimistic (T , r, C) (T , w, C) Concurrency Protocol 1 ≺S 2 (T1, w, C) S (T2, r, C) Multiversion T Concurrency Protocol (T , w, C) ≺ (T , w, C) 1 1 ≺S 2 . serializable . ⇒

11.21 Transaction Query Scheduling Management Floris Geerts

Can we build a scheduler that always emits a serializable schedule?

Idea: ACID Properties Client 1 Client 2 Client 3 Anomalies • Require each transaction to The Scheduler

obtain a lock before it 3 2 3 Serializability 2 1 2 accesses a data object o: 1 1 Query Scheduling Locking

Locking and unlocking of o Scheduler Two-Phase Locking 2 SQL 92 Isolation 1 lock o ; Levels 1 2 . . . access o ...; Concurrency in B-trees 1 3 unlock o ; Optimistic Concurrency Protocol

Access and Storage Layer Multiversion • This prevents concurrent Concurrency Protocol access to o.

11.22 Transaction Locking Management Floris Geerts

• If a lock cannot be granted (e.g., because another transaction ACID Properties T0 already holds a conflicting lock) the requesting transaction T gets blocked. Anomalies The Scheduler • The scheduler suspends execution of the blocked Serializability transaction T. Query Scheduling Locking • Once T0 releases its lock, it may be granted to T, whose execution is then resumed. Two-Phase Locking SQL 92 Isolation Since other transactions can continue execution while T is Levels ⇒ blocked, locks can be used to control the relative order of Concurrency in B-trees Optimistic operations. Concurrency Protocol Multiversion Concurrency Protocol

11.23 Transaction Locking and Scheduling Management Example (Locking and scheduling) Floris Geerts

• Consider two • Two valid schedules (respecting lock and transactions T1,2: unlock calls) are:

ACID Properties Schedule S1 Schedule S1 Anomalies T1 T2 T1 T2 T1 T2 lock A lock A lock A lock A The Scheduler write A write A write A write A Serializability lock B lock B lock B lock B Query Scheduling unlock A write B unlock A write B Locking

write B write A lock A write A Two-Phase Locking

unlock B unlock A write A unlock A SQL 92 Isolation write B write B lock A Levels unlock B unlock B write A Concurrency in B-trees lock B write B Optimistic write B unlock B Concurrency Protocol Multiversion write A lock B Concurrency Protocol unlock A write B write B unlock B unlock B unlock A

• Note: Both schedules S1,2 are serializable. Are we done yet?

11.24 Transaction Locking and Serializability Management Floris Geerts Example (Proper locking does not guarantee serializability yet) Even if we adhere to a properly nested lock/unlock discipline, the scheduler might still yield non-serializiable schedules:

ACID Properties Schedule S1 T1 T2 Anomalies lock A The Scheduler lock C  What is the conflict graph of this Serializability write A write C schedule? Query Scheduling unlock A Locking lock A Two-Phase Locking

write A SQL 92 Isolation lock B Levels unlock A Concurrency in B-trees write B Optimistic unlock B Concurrency Protocol

unlock C Multiversion lock B Concurrency Protocol write B unlock B lock C write C unlock C

11.25 Transaction ATM Transaction with Locking Management Floris Geerts

Example (Two concurrent ATM transactions with locking) Transaction 1 Transaction 2 DB state lock (acct) ; 1200 read (acct) ; ACID Properties Anomalies unlock (acct) ; lock (acct) ; The Scheduler read (acct) ; Serializability Query Scheduling unlock (acct) ; lock (acct) ; Locking Two-Phase Locking write (acct) ; 1100 SQL 92 Isolation unlock (acct) ; Levels lock (acct) ; Concurrency in B-trees write (acct) ; 1000 Optimistic Concurrency Protocol

unlock (acct) ; Multiversion Concurrency Protocol Again: on its own, proper locking does not guarantee ⇒ serializability yet.

11.26 Transaction Two-Phase Locking (2PL) Management Floris Geerts The two-phase locking protocol poses an additional restriction on how transactions have to be written: Definition (Two-Phase Locking)

• Once a transaction has released any lock (i.e., performed the ACID Properties first unlock), it must not acquire any new lock: Anomalies The Scheduler # of locks held Serializability Query Scheduling

Locking

Two-Phase Locking time SQL 92 Isolation Levels lock phase release phase Concurrency in B-trees Optimistic Concurrency Protocol lock point Multiversion Concurrency Protocol

• Two-phase locking is the concurrency control protocol used in database systems today.

11.27 Transaction Again: ATM Transaction Management Floris Geerts

Example (Two concurrent ATM transactions with locking, 2PL) ¬ Transaction 1 Transaction 2 DB state

lock (acct) ; 1200 ACID Properties

read (acct) ; Anomalies unlock (acct) ; The Scheduler lock (acct) ; Serializability read (acct) ; Query Scheduling unlock (acct) ; Locking lock (acct) ; Two-Phase Locking write (acct) ; 1100 SQL 92 Isolation Levels unlock (acct) ; lock (acct) ; Concurrency in B-trees Optimistic write (acct) ; 1000 Concurrency Protocol unlock (acct) ; Multiversion Concurrency Protocol These locks violate the 2PL principle.

11.28

Transaction A 2PL-Compliant ATM Transaction Management Floris Geerts

• To comply with the two-phase locking protocol, the ATM transaction must not acquire any new locks after a first lock ACID Properties has been released: Anomalies The Scheduler A 2PL-compliant ATM withdrawal transaction Serializability Query Scheduling lock phase Locking 1 lock (acct) ; Two-Phase Locking 2 bal read_bal (acct) ; SQL 92 Isolation ← Levels 3 bal bal 100 ; ← − ¤ Concurrency in B-trees 4 write_bal (acct, bal) ; Optimistic Concurrency Protocol 5 unlock (acct) ; unlock phase Multiversion Concurrency Protocol

11.29 Transaction Resulting Schedule Management Floris Geerts

Example Transaction 1 Transaction 2 DB state lock (acct) ; 1200 ACID Properties

read (acct) ; Anomalies lock (acct) ; The Scheduler Serializability write (acct) ; Transaction 1100 Query Scheduling

unlock (acct) ; blocked Locking lock (acct) ; Two-Phase Locking SQL 92 Isolation read (acct) ; Levels write (acct) ; 900 Concurrency in B-trees Optimistic unlock (acct) ; Concurrency Protocol Multiversion Concurrency Protocol

• Theorem: The use of 2PL-compliant locking always leads to a correct and serializable schedule.

11.30 Transaction Lock Modes Management Floris Geerts • We saw earlier that two read operations do not conflict with each other. • Systems typically use different types of locks (lock modes) to allow read operations to run concurrently. ACID Properties read locks shared locks • or : mode S Anomalies

• write locks or exclusive locks: mode X The Scheduler

Serializability

Query Scheduling Locks are only in conflict if at least one of them is an X lock: • Locking Shared vs. exclusive lock compatibility Two-Phase Locking SQL 92 Isolation shared (S) exclusive (X) Levels shared (S) Concurrency in B-trees × Optimistic exclusive (X) Concurrency Protocol × × Multiversion Concurrency Protocol

• It is a safe operation in two-phase locking to (try to) convert a shared lock into an exclusive lock during the lock phase (lock upgrade) improved concurrency. ⇒ 11.31 Transaction Deadlocks Management Floris Geerts

• Like many lock-based protocols, two-phase locking has the risk of deadlock situations:

Example (Proper schedule with locking) ACID Properties Transaction 1 Transaction 2 Anomalies The Scheduler lock (A) ; Serializability . . lock (B) ; Query Scheduling Locking . do something . Two-Phase Locking . SQL 92 Isolation . do something Levels Concurrency in B-trees . lock (B) ; . Optimistic Concurrency Protocol [ wait for T2 to release lock ] lock (A) ; Multiversion [ wait for T1 to release lock ] Concurrency Protocol

• Both transactions would wait for each other indefinitely.

11.32 Transaction Deadlock Handling Management Floris Geerts • Deadlock detection:

1 The system maintains a waits-for graph, where an edge T T indicates that T is blocked by a lock held by T . 1 → 2 1 2 2 Periodically, the system tests for cycles in the graph. ACID Properties 3 If a cycle is detected, the deadlock is resolved by Anomalies aborting one or more transactions. The Scheduler 4 Selecting the victim is a challenge: Serializability • Aborting young transactions may lead to Query Scheduling Locking starvation: the same transaction may be cancelled Two-Phase Locking

again and again. SQL 92 Isolation • Aborting an old transaction may cause a lot of Levels computational investment to be thrown away (but Concurrency in B-trees Optimistic the undo costs may be high). Concurrency Protocol

Multiversion • Deadlock prevention: Concurrency Protocol Define an ordering on all database objects. If A B, then order the lockoperations in all transactions in the same way (lock(A) before lock(B)).

11.33 Transaction Deadlock Handling Management Floris Geerts

Other common technique: • Deadlock detection via timeout: Let a transaction T block on a lock request only until a timeout occurs (counter expires). On expiration, assume that ACID Properties Anomalies a deadlock has occurred and abort T. The Scheduler

Serializability

Query Scheduling Timeout-based deadlock detection Locking db2 => GET DATABASE CONFIGURATION; Two-Phase Locking . SQL 92 Isolation . Levels Interval for checking deadlock (ms) (DLCHKTIME) = 10000 Concurrency in B-trees Lock timeout (sec) (LOCKTIMEOUT) = 30 Optimistic Concurrency Protocol . . Multiversion Concurrency Protocol • Also: lock-less optimistic concurrency control ( slide 65). %

11.34 Transaction Variants of Two-Phase Locking Management • The two-phase locking discipline does not prescribe exactly Floris Geerts when locks have to be acquired and released. • Two possible variants: Preclaiming and strict 2PL locks locks ACID Properties held held Anomalies The Scheduler time time Serializability Query Scheduling

“lock phase” release phase lock phase “release phase” Locking

Two-Phase Locking Preclaiming 2PL Strict 2PL SQL 92 Isolation Levels

Concurrency in B-trees

Optimistic  What could motivate either variant? Concurrency Protocol Multiversion Concurrency Protocol 1 Preclaiming 2PL:

2 Strict 2PL:

11.35 Transaction Cascading Rollbacks Management Floris Geerts Consider three transactions:

Transations T1,2,3, T1 fails later on

w(x) abort ; T1 ACID Properties r(x) Anomalies T2 The Scheduler

r(x) Serializability T3 Query Scheduling

Locking time t1 t2 Two-Phase Locking

SQL 92 Isolation Levels

Concurrency in B-trees

Optimistic • When transaction T1 aborts, transactions T2 and T3 have Concurrency Protocol already read data written by T ( dirty read, slide 9) Multiversion 1 % Concurrency Protocol • T2 and T3 need to be rolled back, too (cascading roll back).

• T2 and T3 cannot commit until the fate of T1 is known. Strict 2PL can avoid cascading roll backs altogether. (How?) ⇒ 11.36 Transaction Implementing a Lock Manager Management Floris Geerts

We’d like the Lock Manager to do three tasks very efficiently:

1 Check which locks are currently held for a given resource ACID Properties (invoked on lock(A) to decide whether the lock request on Anomalies A can be granted). The Scheduler Serializability

2 When unlock(A) is called, blocked transactions that Query Scheduling

requested locks on A have to be identified and now granted Locking the lock. Two-Phase Locking SQL 92 Isolation 3 When a transaction terminates, all held locks must be Levels

released. Concurrency in B-trees

Optimistic Concurrency Protocol

What is a good data structure to accommodate these needs? Multiversion Concurrency Protocol

11.37 Transaction Lock Manager: Bookkeeping Management Floris Geerts

Bookkeeping in the lock manager

Transaction ID Transaction hash table, Update Flag Control Block indexed by resource ID (TCB) TX Status ACID Properties # of Locks Anomalies . . . Resource Control Block (RCB) . . . The Scheduler Resource ID ... LCB Chain Serializability

Hash Chain Query Scheduling . . First In Queue Locking . . . Two-Phase Locking . SQL 92 Isolation Levels

Transaction ID Transaction ID ... Concurrency in B-trees

Resource ID Resource ID Optimistic Lock Mode Lock Mode Concurrency Protocol Lock Status Lock Status Multiversion Concurrency Protocol Next in Queue Next in Queue LCB Chain LCB Chain Lock Control Blocks (LCBs)

11.38 Transaction Implementing Lock Manager Tasks Management Floris Geerts

1 The locks held for a given resource can be found using a hash lookup: • Linked list of Lock Control Blocks via ‘First In ACID Properties Queue’/‘Next in Queue’ Anomalies

• The list contains all lock requests, granted or not. The Scheduler

• The transaction(s) at the head of the list are the ones Serializability

that currently hold a lock on the resource. Query Scheduling

Locking

Two-Phase Locking 2 When a lock is released (i.e., its LCB removed from the list), SQL 92 Isolation the next transaction(s) in the list are considered for granting Levels the lock. Concurrency in B-trees Optimistic Concurrency Protocol

3 Multiversion All locks held by a single transaction can be identified via Concurrency Protocol the linked list ‘LCB Chain’ (and easily released upon transaction termination).

11.39 Transaction Granularity of Locking Management Floris Geerts The granularity of locking is a trade-off:

database level low concurrency low overhead ACID Properties

Anomalies 1 tablespace level The Scheduler

Serializability table level Query Scheduling Locking

Two-Phase Locking page level SQL 92 Isolation Levels

Concurrency in B-trees row-level high concurrency high overhead Optimistic Concurrency Protocol

Multiversion Concurrency Protocol Idea: multi-granularity locking. ⇒

1An DB2 tablespace represents a collection of tables that share a physical storage location. 11.40 Transaction Multi-Granularity Locking Management Floris Geerts • Decide the granularity of locks held for each transaction (depending on the characteristics of the transaction): • For example, aquire a row lock for

Row-selecting query (C_CUSTKEY is key) ACID Properties Anomalies 1 SELECT* Q1 The Scheduler 2 FROM CUSTOMERS Serializability 3 WHERE C_CUSTKEY = 42 Query Scheduling and a table lock for Locking Two-Phase Locking Table scan query SQL 92 Isolation Levels 1 SELECT* Q2 Concurrency in B-trees 2 FROM CUSTOMERS Optimistic Concurrency Protocol

Multiversion Concurrency Protocol • How do such transactions know about each others’ locks?

• Note that locking is performance-critical. Q2 does not want to do an extensive search for row-level conflicts.

11.41 Transaction Intention Locks Management Floris Geerts

Databases use an additional type of locks: intention locks. • Lock mode intention share: IS ACID Properties Lock mode intention exclusive: IX • Anomalies Extended conflict matrix The Scheduler Serializability

S X IS IX Query Scheduling S Locking × × X Two-Phase Locking × × × × SQL 92 Isolation IS Levels IX × × × Concurrency in B-trees Optimistic Concurrency Protocol

• A lock I on a coarser level of granularity means that there Multiversion is some lock on a lower level. Concurrency Protocol

11.42 Transaction Intention Locks Management Floris Geerts Multi-granularity locking protocol

1 Before a granule g can be locked in S, X mode, the transaction has to obtain an I lock on∈all { coarser} granularities that contain g. ACID Properties

2 If all intention locks could be granted, the transaction can Anomalies lock granule g in the announced mode. The Scheduler Serializability Query Scheduling Locking

Example (Multi-granularity locking) Two-Phase Locking

SQL 92 Isolation Query Q ( slide 41) would, e.g., Levels 1 % • obtain an IS lock on table CUSTOMERS Concurrency in B-trees Optimistic (also on the containing tablespace and database) and Concurrency Protocol

Multiversion • obtain an S lock on the row(s) with C_CUSTKEY = 42. Concurrency Protocol

Query Q2 would place an • S lock on table CUSTOMERS (and an IS lock on tablespace and database).

11.43 Transaction Detecting Conflicts Management Floris Geerts • Now suppose an updating query comes in: Update request

1 UPDATE CUSTOMERS Q3 2 SET NAME = ’Seven Teen’ 3 WHERE C_CUSTKEY = 17 ACID Properties

Anomalies

The Scheduler

Serializability • Q3 will want to place Query Scheduling

• an IX lock on table CUSTOMER (and all coarser granules) Locking

and Two-Phase Locking • an X lock on the row holding customer 17. SQL 92 Isolation Levels

Concurrency in B-trees

As such it is Optimistic Concurrency Protocol compatible • with Q1 Multiversion (there is no conflict between IX and IS on the table Concurrency Protocol level), • but incompatible with Q2 (the table-level S lock held by Q2 is in conflict with Q3’s IX lock request). 11.44 Transaction Consistency Guarantees and SQL 92 Management Floris Geerts

Sometimes, some degree of inconsistency may be acceptable for specific applications:

• Occasional “off-by-one errors”, e.g., will not considerably ACID Properties affect the outcome of an aggregate over a huge table. Anomalies Accept inconsistent read anomaly and be awarded with The Scheduler ⇒ improved concurrency. Serializability Query Scheduling • To this end, SQL 92 specifies different isolation levels. Locking Two-Phase Locking

SQL 92 Isolation SQL 92 isolation level control Levels

Concurrency in B-trees 1 SET ISOLATION SERIALIZABLE DIRTY READ READ UNCOMMITED ; | | | · · · Optimistic Concurrency Protocol

Multiversion • Obviously, relaxed consistency guarantees should lead to Concurrency Protocol increased throughput.

11.45 Transaction SQL 92 Isolation Levels Management Floris Geerts

READ UNCOMMITTED (also: DIRTY READ or BROWSE) (0 locks) Only write locks are acquired (according to strict 2PL). Any row read may be concurrently changed by other transactions. ACID Properties Anomalies

READ COMMITTED (also: CURSOR STABILITY) (1 lock) The Scheduler

Read locks are only held for as long as the application’s Serializability cursor sits on a particular, current row. Other rows may be Query Scheduling changed by other transactions. Write locks acquired Locking according to strict 2PL. Two-Phase Locking SQL 92 Isolation READ STABILITY (n locks) Levels Locks those rows actually read by the transaction. The Concurrency in B-trees Optimistic transaction may read phantom rows if it executes an Concurrency Protocol aggregation query twice. Multiversion Concurrency Protocol SERIALIZABLE Strict 2PL locking. No phantom rows seen by transaction.

11.46 Transaction Phantom Row Problem Management Floris Geerts

Example Transaction 1 Transaction 2 Effect

scan relation R ; T1 locks all rows ACID Properties insert new row into R ; T2 locks new row Anomalies commit ; T2’s lock released scan relation R ; reads new row, too! The Scheduler Serializability

Query Scheduling • Although both transactions properly followed the 2PL Locking protocol, T1 observed an effect caused by T2. Isolation Two-Phase Locking SQL 92 Isolation violated! Levels • Cause of the problem: T1 can only lock existing rows. Concurrency in B-trees Optimistic • Possible solution: “Declarative locking” specifies rules that Concurrency Protocol specify which rows are to be locked: Multiversion Concurrency Protocol • Key range locking, typically in B-trees • Predicate locking (if predicate evaluates to true for row, lock it)

11.47 Transaction Isolation Levels and Performance: IBM DB2 Management Floris Geerts SUM aggregation query vs. concurrent debit/credit transactions

ACID Properties

Anomalies

The Scheduler

Serializability

Query Scheduling

Locking

Two-Phase Locking

Impact on transaction throughput SQL 92 Isolation Levels

Concurrency in B-trees

Optimistic Concurrency Protocol

Multiversion Concurrency Protocol Shasha nad Philippe Bonnet. Database Tuning. Morgan Kaufmann, 2003. Dennis

11.48 Transaction Isolation Levels and Performance: Oracle Management Floris Geerts SUM aggregation query vs. concurrent debit/credit transactions

ACID Properties

Anomalies

The Scheduler

Serializability

Query Scheduling

Locking

Two-Phase Locking

SQL 92 Isolation Impact on transaction throughput Levels

Concurrency in B-trees

Optimistic Concurrency Protocol

Multiversion Concurrency Protocol Shasha nad Philippe Bonnet. Database Tuning. Morgan Kaufmann, 2003. Dennis

11.49 Transaction Isolation Levels and Performance: SQL Server Management Floris Geerts SUM aggregation query vs. concurrent debit/credit transactions

ACID Properties

Anomalies

The Scheduler

Serializability

Query Scheduling

Locking

Two-Phase Locking

SQL 92 Isolation Impact on transaction throughput Levels

Concurrency in B-trees

Optimistic Concurrency Protocol

Multiversion Concurrency Protocol Shasha nad Philippe Bonnet. Database Tuning. Morgan Kaufmann, 2003. Dennis

11.50 Transaction Resulting Consistency Guarantees Management Floris Geerts

Isolation levels vs. consistency guarantees isolation level dirty read non-repeat. rd phantom rd ACID Properties

READ UNCOMMITTED possible possible possible Anomalies

READ COMMITTED not possible possible possible The Scheduler

READ STABILITY not possible not possible possible Serializability

SERIALIZABLE not possible not possible not possible Query Scheduling

Locking

Two-Phase Locking

SQL 92 Isolation • Some DBMS implementations support more, less, or different Levels

levels of isolation. Concurrency in B-trees

Optimistic • Note: Few applications really need serializability. Selecting a Concurrency Protocol

weaker yet acceptable isolation level is integral part of Multiversion database tuning. Concurrency Protocol

11.51 Transaction Concurrency in B-tree Indices Management Floris Geerts

+ Consider an insert transaction Tw into a B -tree that resulted in a

leaf node split (see following slide): ACID Properties

Anomalies 1 Assume node 4 has just been split, but the new separator has not yet been inserted into node 1. The Scheduler Serializability 2 Now a concurrent read transaction Tr tries to find 8050. Query Scheduling Locking 3 The (old) node 1 guides Tr to node 4. Two-Phase Locking 4 Node 4 no longer contains entry 8050, Tr believes there is no SQL 92 Isolation data item with zip code 8050 . Levels Concurrency in B-trees / Optimistic This calls for concurrency control in B-trees. Concurrency Protocol ⇒ Multiversion Concurrency Protocol

11.52 + Transaction Concurrent Access to B -tree While Insertion is in Progress Management Floris Geerts

Example (A search for key 8050 may be misguided) 8500 ACID Properties node 0 Anomalies

The Scheduler

Serializability 5012 8280 8700 9016

node 1 node 2 Query Scheduling

Locking

Two-Phase Locking 5012 4123 4222 4450 4528 6423 8050 8105 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 SQL 92 Isolation node 3 node 4 node 5 node 6 node 7 node 8 Levels Concurrency in B-trees Insert key 6333. Optimistic new separator Concurrency Protocol

• Must split node 4. new entry 6423 Multiversion Concurrency Protocol • New separator goes into node 1 (only after lookup for key 8050 has 5012 6333 6423 8050 8105 been performed). node 4 new node 9

11.53 + Transaction Concurrent Access to B -tree While Insertion is in Progress Management Floris Geerts

Example (A search for key 8050 may be misguided) 8500 ACID Properties node 0 Anomalies

The Scheduler

Serializability 5012 6423 8280 8700 9016

node 1 node 2 Query Scheduling

Locking

Two-Phase Locking

4123 4222 4450 4528 5012 6333 6423 8050 8105 8280 8404 8500 8570 8604 8700 8808 8887 9016 9200 SQL 92 Isolation node 3 node 4 node 9 node 5 node 6 node 7 node 8 Levels Concurrency in B-trees Insert key 6333. Optimistic new separator Concurrency Protocol

• Must split node 4. new entry 6423 Multiversion Concurrency Protocol • New separator goes into node 1 (only after lookup for key 8050 has 5012 6333 6423 8050 8105 been performed). node 4 new node 9

11.53 Transaction Locking and B-tree Indices Management Floris Geerts

Remember how we performed operations on B+-trees: • To search a B+-tree, we descended the tree top-down. Based on the content of a node n, we decided in which son ACID Properties n0 to continue the search. Anomalies + • To update a B -tree, we The Scheduler • first did a search, Serializability • then inserted new data into the right leaf. Query Scheduling • Depending on the fill levels of nodes, we had to split Locking tree nodes and propagate splits bottom-up. Two-Phase Locking SQL 92 Isolation Levels

Concurrency in B-trees

According to the two-phase locking protocol, we would have to Optimistic • obtain S/X locks when we walk down the tree2 and Concurrency Protocol Multiversion • keep all locks until we’re finished. Concurrency Protocol

2Note that lock conversion is not a good idea. It would increase the likeliness of deadlocks (read locks acquired top-down, write locks bottom-up). 11.54 Transaction Locking and B-tree Indices Management Floris Geerts

ACID Properties • This strategy would seriously reduce concurrency: Anomalies The Scheduler All transactions will have to lock the tree root, which • Serializability

becomes a locking bottleneck. Query Scheduling

• Root node locks, effectively, serialize all (write) Locking

transactions. Two-Phase Locking Two-phase locking is not practical for B-trees. SQL 92 Isolation ⇒ Levels Concurrency in B-trees

Optimistic Concurrency Protocol

Multiversion Concurrency Protocol

11.55 Transaction Lock Coupling Management Floris Geerts

Let us consider the write-only case first (all locks conflict).

The write-only tree locking (WTL) protocol is sufficient to

guarantee serializability: ACID Properties 1 For all tree nodes n other than the root, a transaction may Anomalies only acquire a lock on n if it already holds a lock on n’s The Scheduler parent. Serializability Query Scheduling

2 Once a node n has been unlocked, the same n may not be Locking locked again by the same transaction. Two-Phase Locking SQL 92 Isolation Levels Effectively, Concurrency in B-trees Optimistic • all transactions have to follow a top-down access pattern, Concurrency Protocol no transaction can “bypass” any other transaction along the Multiversion • Concurrency Protocol same path. Conflicting transactions are thus serializable. • The WTL protocol is deadlock free.

11.56 Transaction Split Safety Management Floris Geerts

• We still have to keep as many write locks as nodes might be affected by node splits. ACID Properties

• It is easy to check for a node n whether an update might Anomalies

affect n’s ancestors: The Scheduler • If n contains less than 2d entries, no split will Serializability propagate above n. Query Scheduling • If n satisfies this condition, it is said to be (split) safe. Locking Two-Phase Locking • We can use this definition to release write locks early: SQL 92 Isolation Levels

• If, while searching top-down for an insert location, we Concurrency in B-trees

encounter a safe node n, we can release locks on all of Optimistic n’s ancestors. Concurrency Protocol Multiversion Effectively, locks near the root are held for a shorter time. Concurrency Protocol ⇒

11.57 Transaction Lock Coupling Protocol (Variant 1) Management Floris Geerts Readers

1 place S lock on root ; 2 current root ; ← 3 while current is not a leaf node do ACID Properties 4 place S lock on appropriate son of current ; Anomalies 5 release S lock on current ; /* lock coupling */ The Scheduler 6 current son of current ; ← Serializability Query Scheduling

Locking

Writers Two-Phase Locking

SQL 92 Isolation 1 place X lock on root ; Levels

2 current root ; Concurrency in B-trees ← 3 while current is not a leaf node do Optimistic Concurrency Protocol 4 place X lock on appropriate son of current ; Multiversion 5 current son of current ; Concurrency Protocol ← 6 if current is safe then 7 release all locks held on ancestors of current ;

11.58 Transaction Increasing Concurrency for Common Scenarios Management Floris Geerts

• Even with lock coupling there is a considerable amount of locks held on inner tree nodes (reducing concurrency). • Chances that inner nodes are actually affected by updates, ACID Properties however, are very small. Anomalies The Scheduler

• Back-of-the-envelope calculation: Serializability

d = 50 every 50th insert causes a split (2 % chance). Query Scheduling ⇒ Locking

Two-Phase Locking An insert transaction could thus optimistically assume that ⇒ SQL 92 Isolation no leaf split is going to happen: Levels Concurrency in B-trees On inner nodes, only read locks acquired during tree • Optimistic navigation (plus a write lock on the affected leaf). Concurrency Protocol Multiversion • If assumption is wrong, nothing is broken yet: Concurrency Protocol re-traverse the tree and obtain write locks.

11.59 Transaction Lock Coupling Protocol (Variant 2) Management Modified protocol for writers:3 Floris Geerts

Writers

1 place S lock on root ; ACID Properties 2 current root ; ← Anomalies 3 while current is not a leaf node do The Scheduler 4 son appropriate son of current ; ← Serializability 5 if son is a leaf then Query Scheduling 6 place X lock on son ; Locking

7 else Two-Phase Locking SQL 92 Isolation 8 place S lock on son ; Levels

Concurrency in B-trees 9 release lock on current ; Optimistic 10 current son ; ← Concurrency Protocol Multiversion /* current is a leaf now */ Concurrency Protocol 11 if current is unsafe then 12 release all locks and repeat with protocol Variant 1 ;

3 Reader protocol remains unchanged. 11.60 + Transaction B -trees Without Read Locks Management Floris Geerts • Lehman and Yao (TODS vol. 6(4), 1981) proposed a protocol that does not need any read locks on B-tree nodes. • Requirement: a next pointer, pointing to the right sibling.

ACID Properties + B -tree with right-sibling pointers Anomalies The Scheduler

Serializability

Query Scheduling ... Locking ...... Two-Phase Locking SQL 92 Isolation Levels

Concurrency in B-trees + • Original B -tree: linked list along the leaf level only Optimistic Concurrency Protocol

(sequence set). Multiversion • Sibling pointers provide a second path to find each node. Concurrency Protocol • This way, read operations that have been mis-guided by concurrent splits can still find the node that they need.

11.61 Transaction Insertion With Inner Node Split Management Floris Geerts Insertion (of key 4222) hits full inner node B

1 lock & read page B ;

6423 8500 A 2 create new page D and lock it ; ACID Properties 3 populate page D ; Anomalies 4 set next pointer D C ; → The Scheduler 5 write D ;

4222 5012 B 8700 9016 C Serializability 6 set next pointer B D ; → Query Scheduling 7 adjust content of B ; Locking 8 write B ; Two-Phase Locking

9 SQL 92 Isolation

lock & read A ; 8105 8280 D Levels 10 adjust content of A ; Concurrency in B-trees

11 write A ; Optimistic Concurrency Protocol

Multiversion Concurrency Protocol • All index entries remained reachable at all times.

11.62 Transaction B-tree Access By Readers Management Floris Geerts • The next pointers give readers the chance to “find” entries even in the middle of a split, when some entries (e.g., keys 8105, 8280) have already moved to the new page.

• During tree_search (k), reading transactions move from n ACID Properties

to the next sibling if Anomalies 1 No appropriate entry can be found in n The Scheduler (high key4 in n is less than k tree structure may be Serializability temporarily inconsistent) and⇒ Query Scheduling Locking 2 next is not a null pointer. Two-Phase Locking • A search for key 8200 will exceed the high key value of SQL 92 Isolation node B in line 8 (which will be less than 8105) traverse Levels Concurrency in B-trees sibling link from node B to D. ⇒ Optimistic • Write locks do not prevent readers from accessing a page! Concurrency Protocol Multiversion (This is because readers do not acquire read locks.) Concurrency Protocol

+ I PostgreSQL, e.g., uses this protocol for B -tree access.

4 Inner node v’s high key indicates the maximum value in the subtree below v. 11.63 Transaction Locks and Latches Management Floris Geerts

 Which assumption is necessary for lock-free read access? Modifications to a single page need to be done atomically. The tree might be inconsistent, e.g., while a writer is just executing line 7 on slide 62. ACID Properties Anomalies

The Scheduler • We do not want to use locks to guarantee atomicity (which Serializability would, after all, defeat the whole approach). Query Scheduling • Instead, systems typically provide special light-weight locks, Locking Two-Phase Locking so-called latches, to allow for short-term atomic SQL 92 Isolation operations. Levels Concurrency in B-trees • Latches incur only little overhead—they are sometimes Optimistic even implemented as spinlocks (busy wait loops). Concurrency Protocol Multiversion • However, they are not under control of the lock manager. Concurrency Protocol Hence, they are not covered by deadlock detection or, depending on the system, automatic unlocking in case of a transaction rollback.

11.64 Transaction Optimistic Concurrency Control Management Floris Geerts

• Uo to here, the approach to concurrency control has been pessimistic:

• Assume that transactions will conflict and thus protect ACID Properties

database objects by locks and lock protocols. Anomalies

The Scheduler

Serializability The converse is a optimistic concurrency control approach: • Query Scheduling • Hope for the best and let transactions freely proceed Locking with their read/write operations. Two-Phase Locking • Only just before updates are to be committed to the SQL 92 Isolation Levels

database, perform a check to see whether conflicts Concurrency in B-trees

indeed did not happen. Optimistic Concurrency Protocol

• Rationale: Non-serializable conflicts are not that frequent. Multiversion Save the locking overhead in the majority of cases and Concurrency Protocol only invest effort if really required.

11.65 Transaction Optimistic Concurrency Control Management Floris Geerts Under optimistic concurrency control, transactions proceed in three phases:

Optimistic concurrency control

ACID Properties 1 Read Phase. Execute transaction, but do not write data back Anomalies to disk immediately. Instead, collect updates in the The Scheduler transaction’s private workspace. Serializability Query Scheduling 2 Validation Phase. When the transaction wants to commit, Locking

test whether its execution was correct (only acceptable Two-Phase Locking

conflicts happened). If it is not, abort the transaction. SQL 92 Isolation Levels 3 Write Phase. Transfer data from private workspace into Concurrency in B-trees

database. Optimistic Concurrency Protocol

Multiversion • Note: Phases 2 and 3 need to be performed in a Concurrency Protocol non-interruptible critical section (thus also called the val-write phase).

11.66 Transaction Validating Transactions Management Floris Geerts Validation is typically implemented by looking at transaction Tj’s

• Read Set RS(Tj) (attributes read by Tj) and

• Write Set WS(Tj) (attributes written by Tj).

ACID Properties

Anomalies Backward-oriented optimistic concurrency control (BOCC) The Scheduler

Serializability Compare Tj against all committed transactions Ti. Check succeeds if Query Scheduling Locking

Two-Phase Locking Ti committed before Tj started or RS(Tj) WS(Ti) = ∅ . ∩ SQL 92 Isolation Levels

Concurrency in B-trees Forward-oriented optimistic concurrency control (FOCC) Optimistic Concurrency Protocol running Multiversion Compare Tj against all transactions Ti. Concurrency Protocol Check succeeds if

WS(Tj) RS(Ti) = ∅ . ∩

11.67 Transaction Optimistic Concurrency Control in IBM DB2 Management Floris Geerts • DB2 V9.5 provides SQL-level constructs that enable database applications to implement optimistic concurrency control: • RID(r): return row identifier for row r, • ROW CHANGE TOKEN FOR r: unique number reflecting the time row r has last been updated. ACID Properties Anomalies

The Scheduler Optimistic concurrency control Serializability

1 db2 => SELECT* FROM EMPLOYEES Query Scheduling 2 Locking 3 ID NAME DEPT SALARY 4 ------Two-Phase Locking 5 1 Alex DE 300 SQL 92 Isolation 6 2 Bert DE 100 Levels 7 3 Cora DE 200 8 4 Drew US 200 Concurrency in B-trees 9 5 Erik US 400 Optimistic 10 Concurrency Protocol 11 db2 => SELECT E.NAME, E.SALARY, Multiversion 12 RID(E)AS RID, ROW CHANGE TOKEN FOR E AS TOKEN Concurrency Protocol 13 FROM EMPLOYEES E 14 WHERE E.NAME = ’Erik’ 15 16 NAME SALARY RID TOKEN 17 ------18 Erik 400 16777224 74904229642240

11.68 Transaction Optimistic Concurrency Control in IBM DB2 Management Floris Geerts Optimistic concurrency control

• The ’Erik’ row belongs to our read set. Save its RID and TOKEN values to perform validation later.

• (. . . Time passes . . . ) Now try to save our changes to the row. ACID Properties Perform BOCC-style validation: Anomalies The Scheduler

Serializability 1 db2 => UPDATE EMPLOYEES E 2 SET E.SALARY = 450 Query Scheduling 3 WHERE RID(E) = 16777224 -- identify row Locking 4 AND ROW CHANGE TOKEN FOR E = 74904229642240 -- row changed? Two-Phase Locking 5 6 SQL0100W No row was found for FETCH, UPDATE or DELETE; or the SQL 92 Isolation 7 result of a query is an empty table. SQLSTATE=02000 Levels 8 Concurrency in B-trees 9 db2 => SELECT E.ID, E.NAME Optimistic 10 RID(E) AS RID, ROW CHANGE TOKEN FOR E AS TOKEN Concurrency Protocol 11 FROM EMPLOYEES E 12 Multiversion Concurrency Protocol 13 ID NAME RID TOKEN 14 ------15 1 Alex 16777220 74904229642240 16 2 Bert 16777221 74904229642240 17 3 Cora 16777222 74904229642240 18 4 Drew 16777223 74904229642240 19 5 Erik 16777224 141378732653941710 11.69 Transaction Multiversion Concurrency Control Management Floris Geerts Consider the schedule t

r1(x), w1(x), r2(x), w2(y), r1(y), w1(z) .

ACID Properties

Anomalies  Is this schedule serializable? The Scheduler

No! We have w1(x) < r2(x) and w2(y) < r1(y). Serializability Query Scheduling

Locking

Two-Phase Locking

SQL 92 Isolation • Now suppose when T1 wants to read y, we do still have the Levels

“old” value of y, valid at time t (prior to the conflicting w2(y) Concurrency in B-trees operation), around. Optimistic Concurrency Protocol

• We could then create a schedule equivalent to Multiversion Concurrency Protocol

r1(x), w1(x), r2(x), r1(y), w2(y), r1(y),w1(z) , :::

which is serializable.

11.70 Transaction Multiversion Concurrency Control Management Floris Geerts

• With old object versions still around, read transactions need

no longer be blocked. ACID Properties • They might see outdated, but consistent versions of the Anomalies data. The Scheduler Serializability

Query Scheduling • Problem: Versioning requires space and management Locking overhead ( garbage collection). Two-Phase Locking SQL 92 Isolation ; Levels • Some database systems implement such an approach, Concurrency in B-trees termed snapshot isolation: Optimistic Concurrency Protocol I Oracle, SQL Server, PostgreSQL Multiversion Concurrency Protocol

11.71