Overview
• Sorting Midterm Review • Hashing • Selections • Joins Spring 2003
Two-Way External Merge Sort General External Merge Sort 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file • Each pass we read + write PASS 0 * each page in file. 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs More than 3 buffer pages. How can we utilize them? • N pages in the file => the PASS 1 • To sort a file with N pages using B buffer pages: 4,7 1,3 number of passes 2,3 2-page runs 4,6 8,9 5,6 2 – Pass 0: use B buffer pages. Produce È N / B ˘ sorted runs PASS 2 = log2 N +1 of B pages each. È ˘ 2,3 4,4 1,2 4-page runs – Pass 1, 2, …, etc.: merge B-1 runs. • So total cost is: 6,7 3,5 6 8,9 INPUT 1 2N(Èlog2 N ˘+1) PASS 3 1,2 INPUT 2 OUTPUT 2,3 • Idea: Divide and conquer: ...... 3,4 8-page runs sort subfiles and merge 4,5 INPUT B-1 6,6 Disk 7,8 Disk B Main memory buffers 9
Cost of External Merge Sort Sorting warnings • Be able to run the general external merge sort! • Number of passes: 1 log N / B – Careful use of buffers in pass 0 vs. pass i, i>0. + È B-1 È ˘˘ • Cost = 2N * (# of passes) – Draw pictures of runs like the “tree” in the slides • E.g., with 5 buffer pages, to sort 108 page file: for 2-way external merge sort (will look slightly – Pass 0: È 108 / 5 ˘ = 22 sorted runs of 5 pages different!) each (last run is only 3 pages) • Be able to compute # of passes correctly for • Now, do four-way (B-1) merges file of N blocks, B buffers! – Pass 1: È 2 2 / 4 ˘ = 6 sorted runs of 20 pages – Watch the number of buffers available in pass 0 each (last run is only 8 pages) – tournament sort (heapsort) vs. quicksort – Pass 2: 2 sorted runs, 80 pages and 28 pages – Be able to count I/Os carefully! – Pass 3: Sorted file of 108 pages
1 More tips Query Processing Overview
• How to sort any file using 3 memory Pages • The query optimizer translates SQL to a special internal “language” • How to sort in as few passes given some amount of – Query Plans memory • The query executor is an interpreter for query plans • Think of query plans as “box-and-arrow” • I have a file of N blocks and B buffers dataflow diagrams – How big can N be to sort in 2 phases ? – Each box implements a relational operator – Edges represent a flow of tuples (columns as specified) name, gpa B-1 >= N/B – For single-table queries, these diagrams are straight-line graphs Distinct So, N <= B^2 .. approx of course Optimizer name, gpa SELECT DISTINCT name, gpa Sort FROM Students name, gpa HeapScan
Sort GROUP BY: Naïve Solution Aggregate • The Sort iterator (could be external sorting, as An Alternative to Sorting: Hashing! explained last week) naturally permutes its input so that all tuples are output in sequence Sort • Idea: • The Aggregate iterator keeps running info (“transition values”) on agg functions in the SELECT list, per group – Many of the things we use sort for don’t exploit the order of the sorted data – E.g., for COUNT, it keeps count-so-far – E.g.: forming groups in GROUP BY – For SUM, it keeps sum-so-far – E.g.: removing duplicates in DISTINCT – For AVERAGE it keeps sum-so-far and count-so-far • Often good enough to match all tuples with equal field- • As soon as the Aggregate iterator sees a tuple from a values new group: • Hashing does this! 1. It produces an output for the old group based on the agg function – And may be cheaper than sorting! (Hmmm…!) E.g. for AVERAGE it returns (sum-so-far/count-so-far) – But how to do it for data sets bigger than memory?? 2. It resets its running info. 3. It updates the running info with the new tuple’s info
Original Two Phases Relation OUTPUT Partitions 1 General Idea 1 INPUT 2 hash 2 function • Two phases: • Partition: . . . hp B-1 B-1 – Partition: use a hash function hp to split tuples into partitions on disk. Disk B main memory buffers Disk • We know that all matches live in the same partition. • Partitions are “spilled” to disk via output buffers Partitions Result – ReHash: for each partition on disk, read it into Hash table for partition hash Ri (k <= B pages) memory and build a main-memory hash table fn • Rehash: hr based on a hash function hr • Then go through each bucket of this hash table to bring together matching tuples
Disk B main memory buffers
2 Hash GROUP BY: Naïve Solution Aggregate (similar to the Sort GROUPBY) We Can Do Better! HashAgg Hash • The Hash iterator permutes its input so that all tuples are output in sequence • Combine the summarization into the hashing process • The Aggregate iterator keeps running info (“transition – During the ReHash phase, don’t store tuples, store pairs of values”) on agg functions in the SELECT list, per group the form
Hashing for Grouped Aggregation Analysis
• How big can a partition be ? • How big of a table can we process? – As big as can fit into the hashtable during rehash – B-1 “spill partitions” in Phase 1 – For grouped aggs, we have one entry per group ! – Each limited by the number of unique tuples per – So, the key is : the number of unique groups ! partition and that can be accommodated in the hash table (U ) – A partition’s size is only limited by the H number of unique groups in the partition • Have a bigger table? Recursive partitioning! • Similar analysis holds for duplicate elimination – In the ReHash phase, if a partition b has more – Note: Can think of dup-elem as a grouped agg unique tuples than UH, then recurse: • pretend that b is a table we need to hash, run the – All tuples that contribute to the agg are identical Partitioning phase on b, and then the ReHash phase on – So any tuple of a “group” is a “representative” each of its (sub)partitions
Even Better: Hybrid Hashing Analysis: Hybrid Hashing, GroupAgg • What if the set of
Disk B main memory buffers Disk
3 Using an Index for Selections Simple Selections (cont) • Cost depends on #qualifying tuples, and clustering.
• With no index, unsorted: – Cost: – Must essentially scan the whole relation • finding qualifying data entries (typically small) – cost is M (#pages in R). For “reserves” = 1000 I/Os. • plus cost of retrieving records (could be large w/o • With no index, sorted: clustering). – cost of binary search + number of pages containing results. – In example “reserves” relation, if 10% of tuples qualify – For reserves = 10 I/Os + Èselectivity*#pages˘ (100 pages, 10000 tuples). • With an index on selection attribute: • With a clustered index, cost is little more than 100 I/Os; – Use index to find qualifying data entries, – then retrieve corresponding data records. • If unclustered, could be up to 10000 I/Os! – Cost? – Unless you get fancy…
Projection (DupElim) Simple Nested Loops Join SELECT DISTINCT R.sid, R.bid • Issue is removing duplicates. foreach tuple r in R do FROM Reserves R • Basic approach is to use sorting foreach tuple s in S do – 1. Scan R, extract only the needed attrs (why do this 1st?) if ri == sj then add
Page-Oriented Nested Loops Join
foreach page bR in R do Question from midterm fall 1998 foreach page bS in S do foreach tuple r in bR do • Sorting: Trying to sort a file of 250,000 blocks with foreach tuple s in bSdo if ri == sj then add
4 Question from midterm fall 1998
• Sorting: Trying to sort a file of 250,000 blocks with only 250 buffers available. – How many initial runs will be generated with quicksort ? – How many total I/O will the sort perform, including the cost of writing out the output ?
– How many runs (on average) with heapsort ?
5