External Sorting Why Sort? 2-Way Sort: Requires 3 Buffers Two-Way

External Sorting Why Sort? 2-Way Sort: Requires 3 Buffers Two-Way

Why Sort? A classic problem in computer science! Data requested in sorted order External Sorting e.g., find students in increasing gpa order Sorting is first step in bulk loading B+ tree index. Chapter 13 Sorting useful for eliminating duplicate copies in a collection of records (Why?) Sort-merge join algorithm involves sorting. Problem: sort 1Gb of data with 1Mb of RAM. why not virtual memory? Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2 2-Way Sort: Requires 3 Buffers Two-Way External Merge Sort 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file Each pass we read + write PASS 0 Pass 1: Read a page, sort it, write it. each page in file. 3,42,6 4,9 7,8 5,6 1,3 2 1-page runs PASS 1 only one buffer page is used 4,7 1,3 N pages in the file => the 2,3 2-page runs number of passes 4,6 8,9 5,6 2 Pass 2, 3, …, etc.: PASS 2 =+log N 1 2,3 three buffer pages used. 2 4,4 1,2 4-page runs So toal cost is: 6,7 3,5 8,9 6 PASS 3 INPUT 1 21NN(log2 + ) OUTPUT 1,2 Idea: Divide and conquer: 2,3 INPUT 2 sort subfiles and merge 3,4 8-page runs 4,5 Main memory buffers Disk Disk 6,6 7,8 9 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 General External Merge Sort Cost of External Merge Sort * More than 3 buffer pages. How can we utilize them? To sort a file with N pages using B buffer pages: Number of passes: 1 + logB−1 NB / Pass 0: use B buffer pages. Produce NB / sorted runs of B Cost = 2N * (# of passes) pages each. E.g., with 5 buffer pages, to sort 108 page file: Pass 2, …, etc.: merge B-1 runs. Pass 0: 108 / 5 = 22 sorted runs of 5 pages each INPUT 1 (last run is only 3 pages) Pass 1: 22 / 4 = 6 sorted runs of 20 pages each INPUT 2 (last run is only 8 pages) . OUTPUT . Pass 2: 2 sorted runs, 80 pages and 28 pages INPUT B-1 Pass 3: Sorted file of 108 pages Disk Disk B Main memory buffers Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6 Number of Passes of External Sort Internal Sort Algorithm Quicksort is a fast way to sort in memory. N B=3 B=5 B=9 B=17 B=129 B=257 An alternative is “tournament sort” (a.k.a. 100 7 4 3 2 1 1 “heapsort”) 1,000 10 5 4 3 2 2 Top: Read in B blocks 10,000 13 7 5 4 2 2 Output: move smallest record to output buffer 100,000 17 9 6 5 3 3 Read in a new record r 1,000,000 20 10 7 5 3 3 insert r into “heap” 10,000,000 23 12 8 6 4 3 if r not smallest, then GOTO Output 100,000,000 26 14 9 7 4 4 else remove r from “heap” 1,000,000,000 30 15 10 8 5 4 output “heap” in order; GOTO Top Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8 More on Heapsort I/O for External Merge Sort Fact: average length of a run in heapsort is 2B … longer runs often means fewer passes! The “snowplow” analogy Actually, do I/O a page at a time Worst-Case: In fact, read a block of pages sequentially! What is min length of a run? Suggests we should make each buffer How does this arise? (input/output) be a block of pages. Best-Case: B But this will reduce fan-out during merge passes! What is max length of a run? In practice, most files still sorted in 2-3 passes. How does this arise? Quicksort is faster, but ... Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10 Number of Passes of Optimized Sort Double Buffering To reduce wait time for I/O request to N B=1,000 B=5,000 B=10,000 complete, can prefetch into `shadow block’. 100 1 1 1 Potentially, more passes; in practice, most files still 1,000 1 1 1 sorted in 2-3 passes. 10,000 2 2 1 100,000 3 2 2 INPUT 1 INPUT 1' 1,000,000 3 2 2 INPUT 2 OUTPUT 10,000,000 4 3 3 INPUT 2' OUTPUT' 100,000,000 5 3 3 b block size 1,000,000,000 5 4 3 Disk INPUT k Disk INPUT k' * Block size = 32, initial pass produces runs of size 2B. B main memory buffers, k-way merge Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12 Sorting Records! Using B+ Trees for Sorting Sorting has become a blood sport! Scenario: Table to be sorted has B+ tree index on sorting column(s). Parallel sorting is the name of the game ... Idea: Can retrieve records in order by traversing Datamation: Sort 1M records of size 100 bytes leaf pages. Typical DBMS: 15 minutes World record: 3.5 seconds Is this a good idea? • 12-CPU SGI machine, 96 disks, 2GB of RAM Cases to consider: New benchmarks proposed: B+ tree is clustered Good idea! Minute Sort: How many can you sort in 1 minute? B+ tree is not clustered Could be a very bad idea! Dollar Sort: How many can you sort for $1.00? Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14 Clustered B+ Tree Used for Sorting Unclustered B+ Tree Used for Sorting Alternative (2) for data entries; each data Cost: root to the left- Index entry contains rid of a data record. In general, most leaf, then retrieve (Directs search) all leaf pages one I/O per data record! (Alternative 1) Data Entries If Alternative 2 is used? ("Sequence set") Index (Directs search) Additional cost of retrieving data records: each page fetched just Data Entries ("Sequence set") once. Data Records * Always better than external sorting! Data Records Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16 External Sorting vs. Unclustered Index Summary N Sorting p=1 p=10 p=100 External sorting is important; DBMS may dedicate part of buffer pool for sorting! 100 200 100 1,000 10,000 External merge sort minimizes disk I/O cost: 1,000 2,000 1,000 10,000 100,000 Pass 0: Produces sorted runs of size B (# buffer pages). 10,000 40,000 10,000 100,000 1,000,000 Later passes: merge runs. 100,000 600,000 100,000 1,000,000 10,000,000 # of runs merged at a time depends on B, and block size. 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 Larger block size means less I/O cost per page. 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000 Larger block size means smaller # runs merged. In practice, # of runs rarely more than 2 or 3. * p: # of records per page * B=1,000 and block size=32 for sorting * p=100 is the more realistic value. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18 Summary, cont. Choice of internal sort algorithm may matter: Quicksort: Quick! Heap/tournament sort: slower (2x), longer runs The best sorts are wildly fast: Despite 40+ years of research, we’re still improving! Clustered B+ tree is good for sorting; unclustered tree is usually very bad. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    4 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us