Overview

• Sorting Midterm Review • Hashing • Selections • Joins Spring 2003

Two-Way External General External Merge Sort 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file • Each pass we read + write PASS 0 * each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs More than 3 buffer pages. How can we utilize them? • N pages in the file => the PASS 1 • To sort a file with N pages using B buffer pages: 4,7 1,3 number of passes 2,3 2-page runs 4,6 8,9 5,6 2 – Pass 0: use B buffer pages. Produce È N / B ˘ sorted runs PASS 2 = log2 N +1 of B pages each. È ˘ 2,3 4,4 1,2 4-page runs – Pass 1, 2, …, etc.: merge B-1 runs. • So total cost is: 6,7 3,5 6 8,9 INPUT 1 2N(Èlog2 N ˘+1) PASS 3 1,2 INPUT 2 OUTPUT 2,3 • Idea: Divide and conquer: ...... 3,4 8-page runs sort subfiles and merge 4,5 INPUT B-1 6,6 Disk 7,8 Disk B Main memory buffers 9

Cost of External Merge Sort Sorting warnings • Be able to run the general external merge sort! • Number of passes: 1 log N / B – Careful use of buffers in pass 0 vs. pass i, i>0. + È B-1 È ˘˘ • Cost = 2N * (# of passes) – Draw pictures of runs like the “tree” in the slides • E.g., with 5 buffer pages, to sort 108 page file: for 2-way external merge sort (will look slightly – Pass 0: È 108 / 5 ˘ = 22 sorted runs of 5 pages different!) each (last run is only 3 pages) • Be able to compute # of passes correctly for • Now, do four-way (B-1) merges file of N blocks, B buffers! – Pass 1: È 2 2 / 4 ˘ = 6 sorted runs of 20 pages – Watch the number of buffers available in pass 0 each (last run is only 8 pages) – tournament sort () vs. – Pass 2: 2 sorted runs, 80 pages and 28 pages – Be able to count I/Os carefully! – Pass 3: Sorted file of 108 pages

1 More tips Query Processing Overview

• How to sort any file using 3 memory Pages • The query optimizer translates SQL to a special internal “language” • How to sort in as few passes given some amount of – Query Plans memory • The query executor is an interpreter for query plans • Think of query plans as “box-and-arrow” • I have a file of N blocks and B buffers dataflow diagrams – How big can N be to sort in 2 phases ? – Each box implements a relational operator – Edges represent a flow of tuples (columns as specified) name, gpa B-1 >= N/B – For single-table queries, these diagrams are straight-line graphs Distinct So, N <= B^2 .. approx of course Optimizer name, gpa SELECT DISTINCT name, gpa Sort FROM Students name, gpa HeapScan

Sort GROUP BY: Naïve Solution Aggregate • The Sort iterator (could be external sorting, as An Alternative to Sorting: Hashing! explained last week) naturally permutes its input so that all tuples are output in sequence Sort • Idea: • The Aggregate iterator keeps running info (“transition values”) on agg functions in the SELECT list, per group – Many of the things we use sort for don’t exploit the order of the sorted data – E.g., for COUNT, it keeps count-so-far – E.g.: forming groups in GROUP BY – For SUM, it keeps sum-so-far – E.g.: removing duplicates in DISTINCT – For AVERAGE it keeps sum-so-far and count-so-far • Often good enough to match all tuples with equal field- • As soon as the Aggregate iterator sees a tuple from a values new group: • Hashing does this! 1. It produces an output for the old group based on the agg function – And may be cheaper than sorting! (Hmmm…!) E.g. for AVERAGE it returns (sum-so-far/count-so-far) – But how to do it for data sets bigger than memory?? 2. It resets its running info. 3. It updates the running info with the new tuple’s info

Original Two Phases Relation OUTPUT Partitions 1 General Idea 1 INPUT 2 hash 2 function • Two phases: • Partition: . . . hp B-1 B-1 – Partition: use a hash function hp to split tuples into partitions on disk. Disk B main memory buffers Disk • We know that all matches live in the same partition. • Partitions are “spilled” to disk via output buffers Partitions Result – ReHash: for each partition on disk, read it into Hash table for partition hash Ri (k <= B pages) memory and build a main-memory hash table fn • Rehash: hr based on a hash function hr • Then go through each bucket of this hash table to bring together matching tuples

Disk B main memory buffers

2 Hash GROUP BY: Naïve Solution Aggregate (similar to the Sort GROUPBY) We Can Do Better! HashAgg Hash • The Hash iterator permutes its input so that all tuples are output in sequence • Combine the summarization into the hashing process • The Aggregate iterator keeps running info (“transition – During the ReHash phase, don’t store tuples, store pairs of values”) on agg functions in the SELECT list, per group the form – E.g., for COUNT, it keeps count-so-far – When we want to insert a new tuple into the hash table – For SUM, it keeps sum-so-far • If we find a matching GroupVals, just update the TransVals appropriately – For AVERAGE it keeps sum-so-far and count-so-far • Else insert a new pair • When the Aggregate iterator sees a tuple from a new • What’s the benefit? group: – Q: How many pairs will we have to handle? 1. It produces an output for the old group based on the agg function – A: Number of distinct values of GroupVals columns E.g. for AVERAGE it returns (sum-so-far/count-so-far) • Not the number of tuples!! 2. It resets its running info. – Also probably “narrower” than the tuples 3. It updates the running info with the new tuple’s info • Can we play the same trick during sorting?

Hashing for Grouped Aggregation Analysis

• How big can a partition be ? • How big of a table can we process? – As big as can fit into the hashtable during rehash – B-1 “spill partitions” in Phase 1 – For grouped aggs, we have one entry per group ! – Each limited by the number of unique tuples per – So, the key is : the number of unique groups ! partition and that can be accommodated in the hash table (U ) – A partition’s size is only limited by the H number of unique groups in the partition • Have a bigger table? Recursive partitioning! • Similar analysis holds for duplicate elimination – In the ReHash phase, if a partition b has more – Note: Can think of dup-elem as a grouped agg unique tuples than UH, then recurse: • pretend that b is a table we need to hash, run the – All tuples that contribute to the agg are identical Partitioning phase on b, and then the ReHash phase on – So any tuple of a “group” is a “representative” each of its (sub)partitions

Even Better: Hybrid Hashing Analysis: Hybrid Hashing, GroupAgg • What if the set of pairs fits in memory • H buffers in all: – It would be a waste to spill it to disk and read it all back! – In Phase 1: P “spill partitions”, H-P buffers for – Recall this could be true even if there are tons of tuples! hash table • Idea: keep a smaller 1st partition in memory during – Subsequent phases: H-1 buffers for hash table phase 1! • How big of a table can we process ? – Output its stuff Original k-buffer hashtable at the end of Relation OUTPUT Partitions – Each of the P partitions is limited by the number Phase 1. 2 of unique tuples per partition and that can be 1 2 accommodated in the hash table (U ) – Q: how do we 3 H choose the hr 3 – Note that that UH depends on the phase ! number k? INPUT . . . • In Phase 1 UH is based on H-P buffers hh B-k B-k • In subsequent phases UH is based on H-1 buffers

Disk B main memory buffers Disk

3 Using an Index for Selections Simple Selections (cont) • Cost depends on #qualifying tuples, and clustering.

• With no index, unsorted: – Cost: – Must essentially scan the whole relation • finding qualifying data entries (typically small) – cost is M (#pages in R). For “reserves” = 1000 I/Os. • plus cost of retrieving records (could be large w/o • With no index, sorted: clustering). – cost of binary search + number of pages containing results. – In example “reserves” relation, if 10% of tuples qualify – For reserves = 10 I/Os + Èselectivity*#pages˘ (100 pages, 10000 tuples). • With an index on selection attribute: • With a clustered index, cost is little more than 100 I/Os; – Use index to find qualifying data entries, – then retrieve corresponding data records. • If unclustered, could be up to 10000 I/Os! – Cost? – Unless you get fancy…

Projection (DupElim) Simple Nested Loops Join SELECT DISTINCT R.sid, R.bid • Issue is removing duplicates. foreach tuple r in R do FROM Reserves R • Basic approach is to use sorting foreach tuple s in S do – 1. Scan R, extract only the needed attrs (why do this 1st?) if ri == sj then add to result – 2. Sort the resulting set • For each tuple in the outer relation R, we scan the – 3. Remove adjacent duplicates entire inner relation S. – Cost: Reserves with size ratio 0.25 = 250 pages. With 20 buffer pages • How much does this Cost? can sort in 2 passes, so 1000 +250 + 2 * 2 * 250 + 250 = 2500 I/Os • (pR * M) * N + M = 100*1000*500 + 1000 I/Os. • Can improve by modifying external sort (see chapter – At 10ms/IO, Total: ??? 12): • What if smaller relation (S) was outer? – Modify Pass 0 of external sort to eliminate unwanted fields. – Modify merging passes to eliminate duplicates. – Cost: for above case: read 1000 pages, write out 250 in runs of 40 • What assumptions are being made here? pages, merge runs = 1000 + 250 +250 = 1500. Q: What is cost if one relation can fit entirely in memory?

Page-Oriented Nested Loops Join

foreach page bR in R do Question from midterm fall 1998 foreach page bS in S do foreach tuple r in bR do • Sorting: Trying to sort a file of 250,000 blocks with foreach tuple s in bSdo if ri == sj then add to result only 250 buffers available. – How many initial runs will be generated with • For each page of R, get each page of S, and write out quicksort ? N/B = 250,000/250 = 1000 matching pairs of tuples , where r is in R-page – How many total I/O will the sort perform, and S is in S-page. including the cost of writing out the output ? 2N(logB-1[N/B] + 1) • What is the cost of this approach? – How many runs (on average) with heapsort ? Avg size = 2(B-2) = 2(248) = 496 • M*N + M= 1000*500 + 1000 Num runs = N/2(B-2) = 250 = 504 – If smaller relation (S) is outer, cost = 500*1000 + 500

4 Question from midterm fall 1998

• Sorting: Trying to sort a file of 250,000 blocks with only 250 buffers available. – How many initial runs will be generated with quicksort ? – How many total I/O will the sort perform, including the cost of writing out the output ?

– How many runs (on average) with heapsort ?

5