External Sorting and Query Evaluation (R&G Ch

External Sorting and Query Evaluation (R&G Ch

Faloutsos 15-415 CMU SCS CMU SCS Why Sort? Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture 11: external sorting and query evaluation (R&G ch. 13 and 14) Faloutsos 15-415 1 Faloutsos 15-415 2 CMU SCS CMU SCS Why Sort? Outline • select ... order by – e.g., find students in increasing gpa order • two-way merge sort • bulk loading B+ tree index. • external merge sort • duplicate elimination (select distinct) • fine-tunings • select ... group by • B+ trees for sorting • Sort-merge join algorithm involves sorting. Faloutsos 15-415 3 Faloutsos 15-415 4 CMU SCS CMU SCS 2-Way Sort: Requires 3 Buffers Two-Way External Merge Sort • Pass 0: Read a page, sort it, write it. 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file • Each pass we read + PASS 0 – only one buffer page is used 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs write each page in file. PASS 1 4,7 1,3 • Pass 1, 2, 3, …, etc.: requires 3 buffer pages 2,3 2-page runs 4,6 8,9 5,6 2 – merge pairs of runs into runs twice as long PASS 2 2,3 – three buffer pages used. 4,4 1,2 4-page runs 6,7 3,5 8,9 6 PASS 3 INPUT 1 1,2 OUTPUT 2,3 INPUT 2 3,4 8-page runs 4,5 6,6 Main memory buffers Disk Disk 7,8 Faloutsos 15-415 5 Faloutsos 15-415 9 6 1 Faloutsos 15-415 CMU SCS CMU SCS Two-Way External Merge Sort Two-Way External Merge Sort 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file • Each pass we read + PASS 0 • Each pass we read + PASS 0 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs write each page in file. PASS 1 write each page in file. PASS 1 4,7 1,3 4,7 1,3 2,3 2-page runs 2,3 2-page runs 4,6 8,9 5,6 2 4,6 8,9 5,6 2 PASS 2 PASS 2 2,3 2,3 4,4 1,2 4-page runs 4,4 1,2 4-page runs 6,7 3,5 6,7 3,5 8,9 6 8,9 6 PASS 3 PASS 3 1,2 1,2 2,3 2,3 3,4 8-page runs 3,4 8-page runs 4,5 4,5 6,6 6,6 7,8 7,8 Faloutsos 15-415 9 7 Faloutsos 15-415 9 8 CMU SCS CMU SCS Two-Way External Merge Sort Two-Way External Merge Sort 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file • Each pass we read + PASS 0 • Each pass we read + PASS 0 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs write each page in file. PASS 1 write each page in file. PASS 1 4,7 1,3 4,7 1,3 2,3 2-page runs 2,3 2-page runs 4,6 8,9 5,6 2 • N pages in the file => 4,6 8,9 5,6 2 PASS 2 PASS 2 log2 N 1 2,3 2,3 4,4 1,2 4-page runs 4,4 1,2 4-page runs 6,7 3,5 • So total cost is: 6,7 3,5 8,9 6 8,9 6 PASS 3 2 N log2 N 1 PASS 3 1,2 1,2 2,3 • Idea: Divide and 2,3 3,4 8-page runs 3,4 8-page runs 4,5 conquer: sort subfiles 4,5 6,6 6,6 7,8 and merge 7,8 Faloutsos 15-415 9 9 Faloutsos 15-415 9 10 CMU SCS CMU SCS Outline External merge sort • two-way merge sort B > 3 buffers • external merge sort • Q1: how to sort? • fine-tunings • Q2: cost? • B+ trees for sorting Faloutsos 15-415 11 Faloutsos 15-415 12 2 Faloutsos 15-415 CMU SCS CMU SCS General External Merge Sort General External Merge Sort B>3 buffer pages. How to sort a file with N pages? – Pass 0: use B buffer pages. Produce N / B sorted runs of B pages each. – Pass 1, 2, …, etc.: merge B-1 runs. INPUT 1 INPUT 2 . OUTPUT . INPUT B-1 Disk Disk Disk B Main memory buffers Disk B Main memory buffers Faloutsos 15-415 13 Faloutsos 15-415 14 CMU SCS CMU SCS Sorting Sorting – create sorted runs of size B (how many?) – create sorted runs of size B – merge them (how?) – merge first B-1 runs into a sorted run of (B-1) *B, ... B B ….. ... ... ... ... Faloutsos 15-415 15 Faloutsos 15-415 16 CMU SCS CMU SCS Cost of External Merge Sort Sorting – How many steps we need to do? • Number of passes: 1 log B1 N / B „i‟, where B*(B-1)^i > N • Cost = 2N * (# of passes) – How many reads/writes per step? N+N B ….. ... ... Faloutsos 15-415 17 Faloutsos 15-415 18 3 Faloutsos 15-415 CMU SCS CMU SCS Number of Passes of External Cost of External Merge Sort Sort ( I/O cost is 2N times number of passes) • E.g., with 5 buffer pages, to sort 108 page N B=3 B=5 B=9 B=17 B=129 B=257 file: 100 7 4 3 2 1 1 – Pass 0: 108 / 5 = 22 sorted runs of 5 pages 1,000 10 5 4 3 2 2 each (last run is only 3 pages) 10,000 13 7 5 4 2 2 – Pass 1: 22 / 4 = 6 sorted runs of 20 pages 100,000 17 9 6 5 3 3 each (last run is only 8 pages) 1,000,000 20 10 7 5 3 3 – Pass 2: 2 sorted runs, 80 pages and 28 pages 10,000,000 23 12 8 6 4 3 – Pass 3: Sorted file of 108 pages 100,000,000 26 14 9 7 4 4 ┌ ┐ 1,000,000,000 30 15 10 8 5 4 Formula check: log 22 = 3 … + 1 4 passes √ 4 Faloutsos 15-415 19 Faloutsos 15-415 20 CMU SCS CMU SCS Outline Outline • two-way merge sort • two-way merge sort • external merge sort • external merge sort • fine-tunings • fine-tunings • B+ trees for sorting – which internal sort for Phase 0? – blocked I/O • B+ trees for sorting Faloutsos 15-415 21 Faloutsos 15-415 22 CMU SCS CMU SCS Internal Sort Algorithm Internal Sort Algorithm • Quicksort is a fast way to sort in memory. • Quicksort is a fast way to sort in memory. • But: we get B buffers, and produce 1 run of length • But: we get B buffers, and produce 1 run of length B. B. • Can we produce longer runs than that? • Can we produce longer runs than that? Heapsort: B=3 B=3 • Pick smallest • Output • Read from next buffer Faloutsos 15-415 23 Faloutsos 15-415 24 4 Faloutsos 15-415 CMU SCS CMU SCS details Internal Sort Algorithm Reminder: Heapsort • Quicksort is a fast way to sort in memory. • But: we get B buffers, and produce 1 run of length B. pick smallest, write to output buffer: • Can we produce longer runs than that? 10 • Alternative: “tournament sort” (a.k.a. “heapsort”, 14 11 “replacement selection”) • Produces runs of length ~ 2*B 15 17 18 16 • Clever, but not implemented, for subtle reasons: tricky memory management on variable length records Faloutsos 15-415 25 Faloutsos 15-415 26 CMU SCS CMU SCS details details Heapsort: Heapsort: 10 pick smallest, write to output buffer: get next key; put at top and ‘sink’ it ... 22 14 11 14 11 15 17 18 16 15 17 18 16 Faloutsos 15-415 27 Faloutsos 15-415 28 CMU SCS CMU SCS details details Heapsort: Heapsort: get next key; put at top and ‘sink’ it get next key; put at top and ‘sink’ it 11 11 14 22 14 16 15 17 18 16 15 17 18 22 Faloutsos 15-415 29 Faloutsos 15-415 30 5 Faloutsos 15-415 CMU SCS CMU SCS details Heapsort: Outline • two-way merge sort When done, pick top (= smallest) • external merge sort 11 and output it, if ‘legal’ (ie., >=10 in our example • fine-tunings 14 16 – which internal sort for Phase 0? This way, we can keep on reading – blocked I/O 15 17 18 22 new key values (beyond the B ones of quicksort) • B+ trees for sorting Faloutsos 15-415 31 Faloutsos 15-415 32 CMU SCS CMU SCS Blocked I/O & double-buffering Blocked I/O & double-buffering • So far, we assumed random disk access • So far, we assumed random disk access • Cost changes, if we consider that runs are • Cost changes, if we consider that runs are written (and read) sequentially written (and read) sequentially • What could we do to exploit it? • What could we do to exploit it? • A1: Blocked I/O (exchange a few r.d.a for several sequential ones) • A2: double-buffering Faloutsos 15-415 33 Faloutsos 15-415 34 CMU SCS CMU SCS Double Buffering Outline • To reduce wait time for I/O request to complete, can prefetch into `shadow block‟.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    16 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us