Chapter 7. Sorting
Total Page:16
File Type:pdf, Size:1020Kb
CSci 335 Software Design and Analysis 3 Prof. Stewart Weiss Chapter 7 Sorting Sorting 1 Introduction Insertion sort is the sorting algorithm that splits an array into a sorted and an unsorted region, and repeatedly picks the lowest index element of the unsorted region and inserts it into the proper position in the sorted region, as shown in Figure 1. x sorted region unsorted region x sorted region unsorted region Figure 1: Insertion sort modeled as an array with a sorted and unsorted region. Each iteration moves the lowest-index value in the unsorted region into its position in the sorted region, which is initially of size 1. The process starts at the second position and stops when the rightmost element has been inserted, thereby forcing the size of the unsorted region to zero. Insertion sort belongs to a class of sorting algorithms that sort by comparing keys to adjacent keys and swapping the items until they end up in the correct position. Another sort like this is bubble sort. Both of these sorts move items very slowly through the array, forcing them to move one position at a time until they reach their nal destination. It stands to reason that the number of data moves would be excessive for the work accomplished. After all, there must be smarter ways to put elements into the correct position. The insertion sort algorithm is below. for ( int i = 1; i < a.size(); i++) { tmp = a[i]; for ( j = i; j >= 1 && tmp < a[j-1]; j = j-1) a[j] = a[j-1]; a[j] = tmp; } You can verify that it does what I described. The step of assigning to tmp and then copying the value into its nal position is a form of ecient swap. An ordinary swap takes three data moves; this reduces the swap to just one per item compared, plus the moves at the beginning and end of the loop. The worst case number of comparisons and data moves is O(N 2) for an array of N elements. The best case is Ω(N) data moves and Ω(N) comparisons. We can prove that the average number of comparisons and data moves is Ω(N 2). This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 1 CSci 335 Software Design and Analysis 3 Prof. Stewart Weiss Chapter 7 Sorting 2 A Lower Bound for Simple Sorting Algorithms An inversion in an array is an ordered pair (i; j) such that i < j but a[i] > a[j]. An exchange of adjacent elements removes one inversion from an array. Insertion sort needs to remove all inversions by swapping, so if there are m inversions, m swaps will be necessary. Theorem 1. The average number of inversions in an array of n distinct elements is n(n-1)/4. Proof. Let L be any list and LR be its reverse. Pick any pair of elements (i; j) 2 L with i < j. There are n(n − 1)=2 ways to pick such pairs. This pair is an inversion in exactly one of L or LR. This implies that L and LR combined have exactly n(n − 1)=2 inversions. The set of all n! permutations of n distinct elements can be divided into two disjoint sets containing lists and their reverses. In other words, half of all of these lists are the reverses of the others. This implies that the average number of inversions per list over the entire set of permutations is n(n − 1)=4. Theorem 2. Any algorithm that sorts by exchanging adjacent elements requires Ω(n2) time on average. Proof. The average number of inversions is initially n(n − 1)=4. Each swap reduces the number of inversions by 1, and an array is sorted if and only if it has 0 inversions, so n(n − 1)=4 swaps are required. 3 Shell Sort Shell sort was invented by Donald Shell. It is like a sequence of insertion sorts carried out over varying distances in the array and has the advantage that in the early passes, it moves data items close to where they belong by swapping distant elements with each other. Consider the original insertion sort modied so that the gap between adjacent elements can be a number besides 1: Listing 1: A Generalized Insertion Sort Algorithm int gap = 1; for ( int i = gap; i < a.size(); i++) { tmp = a[i]; for ( j = i; j >= gap && tmp < a[j-gap]; j = j-gap) a[j] = a[j-gap]; a[j] = tmp ; } Now suppose we let the variable gap start with a large value and get smaller with successive passes. Each pass is a modied form of insertion sort on each of a set of interleaved sequences in the array. When the gap is h, it is h insertion sorts on each of the h sequences 0; h; 2h; 3h; : : : ; k0h; 1; 1 + h; 1 + 2h; 1 + 3h; : : : ; 1 + k1h 2; 2 + h; 2 + 2h; 2 + 3h; : : : ; 2 + k2h ::: h − 1; h − 1 + h; h − 1 + 2h; : : : ; h − 1 + kh−1h This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 2 CSci 335 Software Design and Analysis 3 Prof. Stewart Weiss Chapter 7 Sorting where ki is the largest number such that i + kih < n. For example, if the array size is 15, and the gap is h = 5, then each of the following subsequences of indices in the array, which we will call slices, will be insertion-sorted independently: slice 0: 0 5 10 slice 1: 1 6 11 slice 2: 2 7 12 slice 3: 3 8 13 slice 4: 4 9 14 For each xed value of the gap, h, the sort starts at array element a[h] and inserts it in the lower sorted region of the slice, then it picks a[h + 1] and inserts it in the sorted region of its slice, and then a[h + 2] is inserted, and so on, until a[n − 1] is inserted into its slice's sorted region. For each element a[i], the sorted region of the slice is the set of array elements at indices i; i − h; i − 2h; : : : and so on. When the gap is h, we say that the array has been h-sorted. It is obviously not sorted, because the dierent slices can have dierent values, but each slice is sorted. Of course when the gap h = 1 the array is fully sorted. The following table shows an array of 15 elements before and after a 5-sort of the array. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Original Sequence: 81 94 11 96 12 35 17 95 28 58 41 75 15 65 7 After 5-sort: 35 17 11 28 7 41 75 15 65 12 81 94 95 96 58 The intuition is that the large initial gap sorts move items much closer to where they have to be, and then the successively smaller gaps move them closer to their nal positions. In Shell's original algorithm the sequence of gaps was n=2, n=4, n=8,:::; 1. This proved to be a poor choice of gaps because the worst case running time was no better than ordinary insertion sort, as is proved below. Listing 2: Shell's Original Algorithm for ( int gap = a.size()/2; gap > 0; gap /=2) for ( int i = gap; i < a.size(); i++) { tmp = a[i]; for ( j = i; j >= gap && tmp < a[j-gap]; j = j-gap) a[j] = a[j-gap]; a[j] = tmp ; } 3.1 Analysis of Shell Sort Using Shell's Original Increments 2 Lemma 3. The running time of Shell Sort when the increment is hk is O(n =hk). Proof. When the increment is hk, there are hk insertion sorts of n=hk keys. An insertion sort of 2 m elements requires in the worst case O(m ) steps. Therefore, when the increment is hk the total number of steps is n2 n2 hk · O 2 = O hk hk This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 3 CSci 335 Software Design and Analysis 3 Prof. Stewart Weiss Chapter 7 Sorting Theorem 4. The worst case for Shell Sort, using Shell's original increment sequence, is Θ(n2). Proof. The proof has two parts, one that the running time has a lower bound that is asymptotic to n2, and one that it has an upper bound of n2. Proof of Lower Bound. Consider any array of size n = 2m, where m may be any positive integer. There are innitely many of these, so this will establish asymptotic behavior. Let the n=2 largest numbers be in the odd-numbered positions of the array and let the n=2 smallest numbers be in the even numbered positions. When n is a power of 2, halving the gap, which is initially n, in each pass except the last, results in an even number. Therefore, in all passes of the algorithm until the very last pass, the gaps are even numbers, which implies that all of the smallest numbers must remain in even-numbered positions and all of the largest numbers will remain in the odd-numbered positions. When it is time to do the last pass, all of the n=2 smallest numbers are sorted in the indices 0, 2, 4, 6, 8, ..., and the n=2 largest numbers are sorted and in indices 1, 3, 5, 7, and so on.