Sorting: from Theory to Practice Sorting out Sorts (See Libsort.Cpp)
Total Page:16
File Type:pdf, Size:1020Kb
Sorting: From Theory to Practice Sorting out sorts (see libsort.cpp) z Why do we study sorting? z Simple, O(n2) sorts --- for sorting n elements 2 ¾ Because we have to ¾ Selection sort --- n comparisons, n swaps, easy to code 2 2 ¾ Because sorting is beautiful ¾ Insertion sort --- n comparisons, n moves, stable, fast ¾ Bubble sort --- n2 everything, slow, slower, and ugly ¾ Example of algorithm analysis in a simple, useful setting z Divide and conquer faster sorts: O(n log n) for n elements z There are n sorting algorithms, how many should we study? ¾ Quick sort: fast in practice, O(n2) worst case ¾ O(n), O(log n), … ¾ Merge sort: good worst case, great for linked lists, uses ¾ Why do we study more than one algorithm? extra storage for vectors/arrays • Some are good, some are bad, some are very, very sad z Other sorts: • Paradigms of trade-offs and algorithmic design ¾ Heap sort, basically priority queue sorting ¾ Which sorting algorithm is best? ¾ Radix sort: doesn’t compare keys, uses digits/characters ¾ Which sort should you call from code you write? ¾ Shell sort: quasi-insertion, fast in practice, non-recursive CPS 100 14.1 CPS 100 14.2 Selection sort: summary Insertion Sort: summary z Simple to code n2 sort: n2 comparisons, n swaps z Stable sort, O(n2), good on nearly sorted vectors ¾ Stable sorts maintain order of equal keys void selectSort(tvector<string>& a) ¾ Good for sorting on two criteria: name, then age { for(int k=0; k < a.size(); k++){ void insertSort(tvector<string>& a) int minIndex = findMin(a,k,a.size()); { int k, loc; string elt; for(k=1; k < a.size(); k++) { swap(a[k],a[minIndex]); elt = a[k]; } loc = k; } // shift until spot for elt is found while (0 < loc && elt < a[loc-1] n { a[loc] = a[loc-1]; // shift right z # comparisons: Σ k = 1 + 2 + … + n = n(n+1)/2 = O(n2) loc=loc-1; k=1 } ¾ Swaps? a[loc] = elt; Sorted, won’t move ¾ Invariant: ????? } final position } Sorted relative to ????? each other CPS 100 14.3 CPS 100 14.4 Bubble sort: summary of a dog Summary of simple sorts z For completeness you should know about this sort z Selection sort has n swaps, good for “heavy” data ¾ Few, if any, redeeming features. Really slow, really, really ¾ moving objects with lots of state, e.g., … ¾ Can code to recognize already sorted vector (see insertion) • A string isn’t heavy, why? (pointer and pointee) • Not worth it for bubble sort, much slower than insertion • What happens in Java? • Wrap heavy items in “smart pointer proxy” void bubbleSort(tvector<string>& a) { z Insertion sort is good on nearly sorted data, it’s stable, it’s fast for(int j=a.size()-1; j >= 0; j--) { ¾ Also foundation for Shell sort, very fast non-recursive for(int k=0; k < j; k++) { if (a[k] > a[k+1]) ¾ More complicated to code, but relatively simple, and fast swap(a[k],a[k+1]); } z Bubble sort is a travesty? But it's fast to code if you know it! Sorted, in final } ????? ¾ position Can be parallelized, but on one machine don’t go near it } (see quotes at end of slides) z “bubble” elements down the vector/array CPS 100 14.5 CPS 100 14.6 Quicksort: fast in practice Partition code for quicksort z Easy to develop partition z Invented in 1962 by C.A.R. Hoare, didn’t understand recursion what we want ¾ Worst case is O(n2), but avoidable in nearly all cases <= pivot > pivot int partition(tvector<string>& a, int left, int right) ¾ In 1997 Introsort published (Musser, introspective sort) left right { • Like quicksort in practice, but recognizes when it will be bad pIndex string pivot = a[left]; int k, pIndex = left; and changes to heapsort for(k=left+1, k <= right; k++) { what we have if (a[k] <= pivot){ void quick(tvector<string>& a, int left, int right) pIndex++; { ?????????????? swap(a[k],a[pIndex]); right } if (left < right) { left } int pivot = partition(a,left,right); swap(a[left], a[pIndex]); quick(a,left,pivot-1); } invariant quick(a,pivot+1, right); z loop invariant: } <= > ??? ¾ statement true each time loop } left right test is evaluated, used to verify z Recurrence? <= X X > X correctness of loop pIndex k z Can swap into a[left] before loop pivot index ¾ Nearly sorted data still ok CPS 100 14.7 CPS 100 14.8 Analysis of Quicksort Tail recursion elimination z Average case and worst case analysis z If the last statement is a recursive call, recursion can be replaced ¾ Recurrence for worst case: T(n) = T(n-1) + T(1) + O(n) with iteration ¾ What about average? T(n) = 2T(n/2) + O(n) ¾ Call cannot be part of an expression ¾ Some compilers do this automatically z Reason informally: void foo(int n) void foo2(int n) ¾ Two calls vector size n/2 { { ¾ Four calls vector size n/4 if (0 < n) { while (0 < n) { cout << n << endl; cout << n << endl; ¾ … How many calls? Work done on each call? foo(n-1); n = n-1; } } z Partition: typically find middle of left, middle, right, swap, go } } ¾ Avoid bad performance on nearly sorted data z What if cout << and recursive call switched? z In practice: remove some (all?) recursion, avoid lots of “clones” z What about recursive factorial? return n*factorial(n-1); CPS 100 14.9 CPS 100 14.10 Merge sort: worst case O(n log n) Merge sort: lists or vectors z Divide and conquer --- recursive sort z Mergesort for vectors ¾ Divide list/vector into two halves void mergesort(tvector<string>& a, int left, int right) • Sort each half { • Merge sorted halves together if (left < right) { int mid = (right+left)/2; ¾ What is complexity of merging two sorted lists? mergesort(a, left, mid); ¾ What is recurrence relation for merge sort as described? mergesort(a, mid+1, right); merge(a,left,mid,right); T(n) = T(n) = 2T(n/2) + O(n) } } z What is advantage of vector over linked-list for merge sort? z What’s different when linked lists used? ¾ What about merging, advantage of linked list? ¾ Do differences affect complexity? Why? ¾ Vector requires auxiliary storage (or very fancy coding) z How does merge work? CPS 100 14.11 CPS 100 14.12 Mergesort continued Summary of O(n log n) sorts z Vector code for merge isn’t pretty, but it’s not hard z Quicksort is relatively straight-forward to code, very fast ¾ Mergesort itself is elegant ¾ Worst case is very unlikely, but possible, therefore … ¾ But, if lots of elements are equal, performance will be bad void merge(tvector<string>& a, • One million integers from range 0 to 10,000 int left, int middle, int right) • How can we change partition to handle this? // pre: left <= middle <= right, // a[left] <= … <= a[middle], // a[middle+1] <= … <= a[right] z Merge sort is stable, it’s fast, good for linked lists, harder to code? // post: a[left] <= … <= a[right] ¾ Worst case performance is O(n log n), compare quicksort ¾ Extra storage for array/vector z Why is this prototype potentially simpler for linked lists? ¾ What will prototype be? What is complexity? z Heapsort, more complex to code, good worst case, not stable ¾ Basically heap-based priority queue in a vector CPS 100 14.13 CPS 100 14.14 Sorting in practice, see libsort.cpp Standard sorts: know your library z Rarely will you need to roll your own sort, but when you do … z Know how to use the STL sorts even if you don't use STL ¾ What are key issues? ¾ The sort function takes iterators as parameters ¾ vectors, strings and other containers: "give me iterators" z If you use a library sort, you need to understand the interface • What about linked-list iterators? Why aren't these "sortable"? ¾ In C++ we have STL and sortall.cpp in Tapestry • STL has sort, and stable_sort string s = "…."; • Tapestry has lots of sorts, Mergesort is fast in practice, stable, safe sort(s.begin(), s.end()); ¾ In C the generic sort is complex to use because arrays are ugly tvector<string> vs; // fill vs with values • See libsort.cpp sort(vs.begin(), vs.end()); ¾ In Java guarantees and worst-case are important • Why won’t quicksort be used? z Beware C qsort, vary widely and wildly on different platforms ¾ Last year it was slow on Solaris, this year fast. Why? z Function objects permit sorting criteria to change simply CPS 100 14.15 CPS 100 14.16 In practice: templated sort functions Function object concept in Tapestry z Function templates permit us to write once, use several times z To encapsulate comparison (like operator <) in an object for several different types of vector ¾ Need convention for parameter : name and behavior ¾ Template function “stamps out” real function ¾ Enforceable by templates or by inheritance (or both) ¾ Maintenance is saved, code still large (why?) z Name convention: know what name of function/method is ¾ Two parameters, the (vector) elements being compared z What properties must hold for vector elements? ¾ In Tapestry name is compare, in STL is operator() ¾ Comparable using < operator z compare returns an int, operator() returns a bool ¾ Elements can be assigned to each other ¾ For operator(),like <, but works like function ¾ For compare: z Template functions capture property requirements in code • zero if elements equal ¾ Part of generic programming • +1 (positive) if first > second ¾ Newest Java (1.5 beta) has generics, older Java did not • -1 (negative) if first < second CPS 100 14.17 CPS 100 14.18 Function object example: Tapestry Function object example: STL class StrLenComp // : public Comparer<string> struct stllencomp { { public: // for use with standard C++ sorting functions int compare(const string& a, const string& b) const bool operator() (const string& a, const string& b) // post: return -1/+1/0 as a.length() < b.length() { { if (a.length() < b.length()) return -1; return a.length() < b.length(); if (a.length() > b.length()) return