Lecture 11: External Sorting Lecture 11 Announcements

Lecture 11 Lecture 11: External Sorting Lecture 11 Announcements 1. Midterm Review: This Friday! 2. Project Part #2 is out. Implement CLOCK! 3. Midterm Material: Everything up to Buffer management. 1. Today’s lecture is no fair game. 2. Do not forget to go over the activities as well! 4. Our usual Wednesday update…. 2 Lecture 11 Lecture 11: External Sorting Lecture 11 What you will learn about in this section 1. External Merge (of sorted files) 2. External Merge - Sort 4 Lecture 11 1. External Merge 5 Lecture 11 > Section 1 > External Merge Challenge: Merging Big Files with Small Memory How do we efficiently merge two sorted files when both are much larger than our main memory buffer? Lecture 11 > Section 1 > External Merge External Merge Algorithm • Input: 2 sorted lists of length M and N • Output: 1 sorted list of length M + N • Required: At least 3 Buffer Pages • IOs: 2(M+N) Lecture 11 > Section 1 > External Merge Key (Simple) Idea To find an element that is no larger than all elements in two lists, one only needs to compare minimum elements from each list. If: �" ≤ �$ ≤ ⋯ ≤ �& �" ≤ �$ ≤ ⋯ ≤ �( Then: ��(�", �") ≤ �/ ��(�", �") ≤ �0 for i=1….N and j=1….M Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 1,5 7,11 20,31 Two sorted files F2 2,22 23,24 25,30 Output: One merged sorted file Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 7,11 20,31 Two sorted 1,5 2,22 files F2 23,24 25,30 Output: One merged sorted file Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 7,11 20,31 Two sorted 5 22 1,2 files F2 23,24 25,30 Output: One merged sorted file Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 7,11 20,31 Two sorted 5 22 files F2 23,24 25,30 Output: 1,2 One merged sorted file Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 7,11 20,31 Two sorted 22 5 files F2 23,24 25,30 Output: 1,2 One merged sorted file This is all the algorithm “sees”… Which file to load a page from next? Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 7,11 20,31 Two sorted 22 5 files F2 23,24 25,30 Output: 1,2 One merged sorted file We know that F2 only contains values ≥ 22… so we should load from F1! Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 20,31 Two sorted 7,11 22 5 files F2 23,24 25,30 Output: 1,2 One merged sorted file Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 20,31 Two sorted 11 22 5,7 files F2 23,24 25,30 Output: 1,2 One merged sorted file Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 20,31 Two sorted 11 22 files F2 23,24 25,30 Output: 1,2 5,7 One merged sorted file Disk Lecture 11 > Section 1 > External Merge External Merge Algorithm Main Memory Buffer Input: F1 20,31 Two sorted 22 11 files F2 23,24 25,30 Output: 1,2 5,7 One merged sorted file And so on… Disk Lecture 11 > Section 1 > External Merge We can merge 2 lists of arbitrary length with only 3 buffer pages. If lists of size M and N, then Cost: 2(M+N) IOs Each page is read once, written once With B+1 buffer pages, can merge B lists. How? Lecture 11 > Section 2 2. External Merge Sort 20 Lecture 11 > Section 2 What you will learn about in this section 1. External merge sort (2-way sort) 2. External merge sort on larger files 3. OptimiZations for sorting 21 Lecture 11 > Section 2 > External Merge Sort External Merge Algorithm • Suppose we want to merge two sorted files both much larger than main memory (i.e. the buffer) • We can use the external merge algorithm to merge files of arbitrary length in 2*(N+M) IO operations with only 3 buffer pages! Our first example of an “IO aware” algorithm / cost model Lecture 11 > Section 2 > External Merge Sort Why are Sort Algorithms Important? • Data requested from DB in sorted order is extremely common • e.g., find students in increasing GPA order • Why not just use quicksort in main memory?? • What about if we need to sort 1TB of data with 1GB of RAM… A classic problem in computer science! Lecture 11 > Section 2 > External Merge Sort More reasons to sort… • Sorting useful for eliminating duplicate copies in a collection of records (Why?) • Sorting is first step in bulk loading B+ tree Coming up… index. Coming up… • Sort-merge join algorithm inVolVes sorting Lecture 11 > Section 2 > External Merge Sort Do people care? http://sortbenchmark.org Sort benchmark bears his name Lecture 11 > Section 2 > External Merge Sort External Merge Sort Lecture 11 > Section 2 > External Merge Sort So how do we sort big files? 1. Split into chunks small enough to sort in memory (“runs”) 2. Merge pairs (or groups) of runs using the external merge algorithm 3. Keep merging the resulting runs (each time = a “pass”) until left with one sorted file! Lecture 11 > Section 2 > External Merge Sort External Merge Sort Algorithm (2-way sort) Example: Disk Main Memory • 3 Buffer pages • 6-page file Buffer 44,10 33,12 55,31 F Orange file 18,22 27,24 3,1 = unsorted 1. Split into chunks small enough to sort in memory Lecture 11 > Section 2 > External Merge Sort External Merge Sort Algorithm (2-way sort) Example: Disk Main Memory • 3 Buffer pages • 6-page file Buffer F1 44,10 33,12 55,31 Orange file F2 18,22 27,24 3,1 = unsorted 1. Split into chunks small enough to sort in memory Lecture 11 > Section 2 > External Merge Sort External Merge Sort Algorithm (2-way sort) Example: Disk Main Memory • 3 Buffer pages • 6-page file Buffer F1 44,10 33,12 55,31 Orange file F2 18,22 27,24 3,1 = unsorted 1. Split into chunks small enough to sort in memory Lecture 11 > Section 2 > External Merge Sort External Merge Sort Algorithm (2-way sort) Example: Disk Main Memory • 3 Buffer pages • 6-page file Buffer F1 10,12 31,33 44,55 Orange file F2 18,22 27,24 3,1 = unsorted 1. Split into chunks small enough to sort in memory Lecture 11 > Section 2 > External Merge Sort External Merge Sort Algorithm (2-way sort) Example: Disk Main Memory • 3 Buffer pages • 6-page file Buffer F1 10,12 31,33 44,55 Each sorted 1,3 18,22 24,27 file is a F2 18,22 27,24 3,1 called a run And similarly for F2 1. Split into chunks small enough to sort in memory Lecture 11 > Section 2 > External Merge Sort External Merge Sort Algorithm (2-way sort) Example: Disk Main Memory • 3 Buffer pages • 6-page file Buffer F1 10,12 31,33 44,55 F2 1,3 18,22 24,27 2. Now just run the external merge algorithm & we’re done! Lecture 11 > Section 2 > External Merge Sort Calculating IO Cost For 3 buffer pages, 6 page file: 1. Split into two 3-page files and sort in memory 1. = 1 R + 1 W for each file = 2*(3 + 3) = 12 IO operations 2. Merge each pair of sorted chunks using the external merge algorithm 1. = 2*(3 + 3) = 12 IO operations 3. Total cost = 24 IO Lecture 11 > Section 2 > External Merge Sort: Larger files Running External Merge Sort on Larger Files Assume we still Disk only have 3 buffer pages (Buffer not 10,12 31,33 44,55 pictured) 45,38 18,43 24,27 10,12 31,33 47,55 41,3 18,22 23,20 42,46 31,33 39,55 1,3 18,23 24,27 10,12 48,33 44,40 16,31 18,22 24,27 Lecture 11 > Section 2 > External Merge Sort: Larger files Running External Merge Sort on Larger Files Assume we still Disk only have 3 buffer 1. Split into files small enough to pages (Buffer not 10,12 31,33 44,55 sort in buffer… pictured) 45,38 18,43 24,27 10,12 31,33 47,55 41,3 18,22 23,20 42,46 31,33 39,55 1,3 18,23 24,27 10,12 48,33 44,40 16,31 18,22 24,27 Lecture 11 > Section 2 > External Merge Sort: Larger files Running External Merge Sort on Larger Files Assume we still Disk only have 3 buffer 1. Split into files small enough to pages (Buffer not 10,12 31,33 44,55 sort in buffer… and sort pictured) 18,24 27,38 43,45 10,12 31,33 47,55 3,18 20,22 23,41 31,33 39,42 46,55 1,3 18,23 24,27 10,12 33,40 44,48 Call each of these 16,18 22,24 27,31 sorted files a run Lecture 11 > Section 2 > External Merge Sort: Larger files Running External Merge Sort on Larger Files Assume we still Disk Disk only have 3 buffer pages (Buffer not 10,12 31,33 44,55 10,12 18,24 27,31 pictured) 18,24 27,38 43,45 33,38 43,44 45,55 10,12 31,33 47,55 3,10 12,18 20,22 2.

Lecture 11: External Sorting Lecture 11 Announcements

The Influence of Caches on the Performance of Sorting

17. Chapter 11

External Sorting Why Sort? Sorting a File in RAM 2-Way Sort of N Pages

Efficient External Sorting on Flash Memory Embedded Devices

External Sorting

Algorithms and Data Structures for External Memory Algorithms and Data Structures 2:4 for External Memory Jeffrey Scott Vitter

14.1 Sorting

External Sorting and Query Evaluation (R&G Ch

8 File Processing and External Sorting

External Sort in Data Structure with Example

Leyenda: an Adaptive, Hybrid Sorting Algorithm for Large Scale Data with Limited Memory

Cache-Oblivious Sorting (1999; Frigo, Leiserson, Prokop, Ramachandran)