Why Sort?

™ A classic problem in computer science! ™ Data requested in sorted order External Sorting ƒ e.g., find students in increasing gpa order ™ Sorting is first step in bulk loading B+ tree index. Chapter 13 ™ Sorting useful for eliminating duplicate copies in a collection of records (Why?) ™ Sort-merge join involves sorting. ™ Problem: sort 1Gb of data with 1Mb of RAM. ƒ why not virtual memory?

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

2-Way Sort: Requires 3 Buffers Two-Way External

3,4 6,2 9,4 8,7 5,6 3,1 2 Input file ™ Each pass we read + write PASS 0 ™ Pass 1: Read a page, sort it, write it. each page in file. 3,42,6 4,9 7,8 5,6 1,3 2 1-page runs PASS 1 ƒ only one buffer page is used 4,7 1,3 ™ N pages in the file => the 2,3 2-page runs number of passes 4,6 8,9 5,6 2 ™ Pass 2, 3, …, etc.: PASS 2 =+log N 1 2,3 ƒ three buffer pages used. 2 4,4 1,2 4-page runs ™ So toal cost is: 6,7 3,5 8,9 6 PASS 3 INPUT 1 21NN(log2 + ) OUTPUT 1,2 ™ Idea: Divide and conquer: 2,3 INPUT 2 sort subfiles and merge 3,4 8-page runs 4,5 Main memory buffers Disk Disk 6,6 7,8 9 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4

General External Merge Sort Cost of External Merge Sort

* More than 3 buffer pages. How can we utilize them? ™ To sort a file with N pages using B buffer pages: ™ Number of passes: 1 + logB−1 NB / ƒ Pass 0: use B buffer pages. Produce  NB / sorted runs of B ™ Cost = 2N * (# of passes) pages each. ™ E.g., with 5 buffer pages, to sort 108 page file: ƒ Pass 2, …, etc.: merge B-1 runs. ƒ Pass 0:  108 / 5 = 22 sorted runs of 5 pages each

INPUT 1 (last run is only 3 pages) ƒ Pass 1:  22 / 4 = 6 sorted runs of 20 pages each INPUT 2 (last run is only 8 pages) . . . OUTPUT ...... ƒ Pass 2: 2 sorted runs, 80 pages and 28 pages INPUT B-1 ƒ Pass 3: Sorted file of 108 pages Disk Disk B Main memory buffers

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6 Number of Passes of External Sort Internal Sort Algorithm

™ is a fast way to sort in memory. N B=3 B=5 B=9 B=17 B=129 B=257 ™ An alternative is “tournament sort” (a.k.a. 100 7 4 3 2 1 1 “”) 1,000 10 5 4 3 2 2 ƒ Top: Read in B blocks 10,000 13 7 5 4 2 2 ƒ Output: move smallest record to output buffer 100,000 17 9 6 5 3 3 ƒ Read in a new record r 1,000,000 20 10 7 5 3 3 ƒ insert r into “heap” 10,000,000 23 12 8 6 4 3 ƒ if r not smallest, then GOTO Output 100,000,000 26 14 9 7 4 4 ƒ else remove r from “heap” 1,000,000,000 30 15 10 8 5 4 ƒ output “heap” in order; GOTO Top

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

More on Heapsort I/O for External Merge Sort

™ Fact: average length of a run in heapsort is 2B ™ … longer runs often means fewer passes! ƒ The “snowplow” analogy ™ Actually, do I/O a page at a time ™ Worst-Case: ™ In fact, read a block of pages sequentially! ƒ What is min length of a run? ™ Suggests we should make each buffer ƒ How does this arise? (input/output) be a block of pages. ™ Best-Case: B ƒ But this will reduce fan-out during merge passes! ƒ What is max length of a run? ƒ In practice, most files still sorted in 2-3 passes. ƒ How does this arise? ™ Quicksort is faster, but ...

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

Number of Passes of Optimized Sort Double Buffering

™ To reduce wait time for I/O request to N B=1,000 B=5,000 B=10,000 complete, can prefetch into `shadow block’. 100 1 1 1 ƒ Potentially, more passes; in practice, most files still 1,000 1 1 1 sorted in 2-3 passes. 10,000 2 2 1 100,000 3 2 2 INPUT 1 INPUT 1' 1,000,000 3 2 2 INPUT 2 OUTPUT 10,000,000 4 3 3 INPUT 2' OUTPUT'

100,000,000 5 3 3 b block size 1,000,000,000 5 4 3 Disk INPUT k Disk INPUT k'

* Block size = 32, initial pass produces runs of size 2B. B main memory buffers, k-way merge Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12 Sorting Records! Using B+ Trees for Sorting

™ ™ Sorting has become a blood sport! Scenario: Table to be sorted has B+ tree index on sorting column(s). ƒ Parallel sorting is the name of the game ... ™ Idea: Can retrieve records in order by traversing ™ Datamation: Sort 1M records of size 100 bytes leaf pages. ƒ Typical DBMS: 15 minutes ™ ƒ World record: 3.5 seconds Is this a good idea? • 12-CPU SGI machine, 96 disks, 2GB of RAM ™ Cases to consider: ™ New benchmarks proposed: ƒ B+ tree is clustered Good idea! ƒ Minute Sort: How many can you sort in 1 minute? ƒ B+ tree is not clustered Could be a very bad idea! ƒ Dollar Sort: How many can you sort for $1.00?

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14

Clustered B+ Tree Used for Sorting Unclustered B+ Tree Used for Sorting

™ Alternative (2) for data entries; each data ™ Cost: root to the left- Index entry contains rid of a data record. In general, most leaf, then retrieve (Directs search) all leaf pages one I/O per data record! (Alternative 1) Data Entries ™ If Alternative 2 is used? ("Sequence set") Index (Directs search) Additional cost of retrieving data records: each page fetched just Data Entries ("Sequence set") once. Data Records

* Always better than external sorting! Data Records Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16

External Sorting vs. Unclustered Index Summary

N Sorting p=1 p=10 p=100 ™ External sorting is important; DBMS may dedicate part of buffer pool for sorting! 100 200 100 1,000 10,000 ™ External merge sort minimizes disk I/O cost: 1,000 2,000 1,000 10,000 100,000 ƒ Pass 0: Produces sorted runs of size B (# buffer pages). 10,000 40,000 10,000 100,000 1,000,000 Later passes: merge runs. 100,000 600,000 100,000 1,000,000 10,000,000 ƒ # of runs merged at a time depends on B, and block size. 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 ƒ Larger block size means less I/O cost per page. 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000 ƒ Larger block size means smaller # runs merged. ƒ In practice, # of runs rarely more than 2 or 3. * p: # of records per page * B=1,000 and block size=32 for sorting * p=100 is the more realistic value. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18 Summary, cont.

™ Choice of internal sort algorithm may matter: ƒ Quicksort: Quick! ƒ Heap/tournament sort: slower (2x), longer runs ™ The best sorts are wildly fast: ƒ Despite 40+ years of research, we’re still improving! ™ Clustered B+ tree is good for sorting; unclustered tree is usually very bad.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19