Introspective Sorting and Selection Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Introsp ective Sorting and Selection Algorithms David R Musser Computer Science Department Rensselaer Polytechnic Institute Troy NY mussercsrpiedu Abstract Quicksort is the preferred inplace sorting algorithm in manycontexts since its average computing time on uniformly distributed inputs is N log N and it is in fact faster than most other sorting algorithms on most inputs Its drawback is that its worstcase time bound is N Previous attempts to protect against the worst case by improving the way quicksort cho oses pivot elements for partitioning have increased the average computing time to o muchone might as well use heapsort which has aN log N worstcase time b ound but is on the average to times slower than quicksort A similar dilemma exists with selection algorithms for nding the ith largest element based on partitioning This pap er describ es a simple solution to this dilemma limit the depth of partitioning and for subproblems that exceed the limit switch to another algorithm with a b etter worstcase bound Using heapsort as the stopp er yields a sorting algorithm that is just as fast as quicksort in the average case but also has an N log N worst case time bound For selection a hybrid of Hoares find algorithm which is linear on average but quadratic in the worst case and the BlumFloydPrattRivestTarjan algorithm is as fast as Hoares algorithm in practice yet has a linear worstcase time b ound Also discussed are issues of implementing the new algorithms as generic algorithms and accurately measuring their p erformance in the framework of the C Standard Template Library key words Quicksort Heapsort Sorting algorithms Introsp ective algorithms Hybrid algorithms Generic algorithms STL Intro duction Among sorting algorithms with O N log N average computing time medianof quicksort is considered to b e a go o d choice in most contexts It sorts in place except for log N stack space and is usually faster than other inplace algorithms such as heapsort mainly b ecause it do es substantially fewer data assignments and other op erations These characteristics make medianof quicksort a go o d candidate for a standard library sorting routine and it is in fact used as such in the C Standard Template Library STL as well as in older C librariesit is the algorithm most commonly used for qsort for example However its worst case time is N and although the worstcase behavior app ears to be highly unlikely the very existence of input sequences that can cause such bad p erformance may be a concern in some situations The b est alternativeinsuch situations has b een to use heapsort which has a N log N worstcase as well as averagecase time bound but doing so results in computing times to times longer than quicksorts time on most inputs Thus in order to protect This work was partially supp orted by a grant from IBM Corp oration completely against deterioration to quadratic time it would seem necessary to pay a substantial time p enalty for the great ma jority of input sequences A similar dilemma app ears with selection algorithms for nding the ith smallest elementof a sequence Hoares algorithm based on partitioning has a linear b ound in the average case but is quadratic in the worst case The BlumFloydPrattRivestTarjan lineartime worst case algorithm is much slower on the average than Hoares algorithm In this pap er we concentrate on the sorting problem and return to the selection problem only briey in a later section The next section presents a class of arbitrarily long input sequences that do in fact cause medianof quicksort to take quadratic time call these sequences medianof killers It is p ointed out that such input sequences might not be as rare in practice as a probabilistic argument based on uniform distribution of p ermutations would suggest The third section then describ es introsort for introspective sort a new hybrid sorting algorithm that b ehaves almost exactly like medianof quicksort for most inputs and is just as fast but which is capable of detecting when partitioning is tending toward quadratic b ehavior By switching to heapsort in those situations introsort achieves the same O N log N time b ound as heapsort but is almost always faster than just using heapsort in the rst place On a medianof killer sequence of length for example introsort switches to heapsort after partitions and has a total running time less than th of that of medianof quicksort In this case it would be somewhat faster to use heapsort in the rst place but for almost all randomlychosen integer sequences introsort is faster than heapsort by a factor of between and Ideally quicksort partitions sequences of size N into sequences of size approximately N those sequences in sequences of size N and so on implicitly pro ducing a tree of subproblems whose depth is approximately log N The quadratictime problem arises on sequences that cause many unequal partitions resulting in the subproblem tree depth growing linearly rather than logarithmically Introsort avoids the problem by putting a logarithmic b ound on the depth and switching to heapsort whenever the b ound is reached Another way of obtaining an O N log N worstcase time bound with an algorithm based on quicksort is to install a lineartime mediannding algorithm for cho osing partition pivots However the resulting algorithm is useless in practice since the large overhead of the mediannding algorithm makes the overall sorting algorithm much slower than heapsort This algorithm could however be used as the stopp er in an introsp ective algorithm in place of heapsort There might be some advantage in this algorithm in terms of commonality of co de with the selection algorithms describ ed in a later section Instead of nding the median exactly other less exp ensive alternatives have b een suggested such as adaptively selecting from larger arrays a larger sample of elements from which to estimate the median A new version of the C library function qsort based on this technique and other improvements outp erforms medianof versions in most cases but still takes N time in the worstcase Yet another p ossibility is to use a randomized version of quicksort in whichpivot elements are chosen at random The worstcase time b ound is still N but the probabilityof there seconds on a MHz microSPARC II versus seconds for the generic sort algorithm from the HewlettPackard HP implementation of the C Standard Template library with b oth algorithms compiled with the Ap ogee C compiler version Performance comparisons that are more architecture and compiler indep endent are given in a later section For a purely recursiveversion of quicksort the subproblem depth would b e the same as the recursion depth but the most ecientversions recurse directly or using a stack on only one branch of the subproblem tree and iterate down the other b eing suciently many bad partitions to achieve this b ound is extremely low However cho osing a single element of the sequence at random do es not pro duce go o d enough partitions on the average to comp ete with medianof choices and cho osing three elements at random requires to o much time for random numb er generation Introsort b eats these randomized variations in practice Medianof Killer Sequences As is wellknown the simplest metho ds of cho osing pivot elements for partitioning such as always cho osing the rst element cause quicksort to deteriorate to quadratic time on already or nearly sorted sequences Since there are many situations in which one applies a sorting algorithm to an already sorted or almost sorted sequence a more robust wayofcho osing pivots is needed While the ideal choice would be the median value the amount of computation required is to o large Instead one of the most frequently used metho ds is to cho ose the rst middle and last elements and then cho ose the median of these three values This metho d pro duces go o d partitions in most cases including the sorted or almost sorted cases that cause the simpler pivoting metho ds to blow up Nevertheless there are sequences that can cause medianof quicksort to make many bad partitions and take quadratic time For example let k be any even p ositiveinteger and consider the following p ermutation of k k k k k k k k k k k k k k k k Call this p ermutation K for reasons to be seen it will also b e called a medianof killer k permutation or sequence Assume the three elements chosen by a medianof quicksort algorithm on an array ak are aak and ak Thus for K the three values k chosen are and k and the median value chosen as the pivot for partitioning is In the partition the only elements exchanged are k and resulting in k k k k k k k k k k k k k k k where the partition point is after a Now the array ak now holds a sequence with the structure of K as can be seen by subtracting from every element As this step can k be repeated ktimeswehave the following Theorem For any even integer N divisible by the permutation K causes medianof N partitioning to produce a sequence of N partitions into subsequences of length and N and N and N Corollary On any sequencepermuted from an ordered sequencebyK where N is divisible N by medianof quicksort takes N time Pro of By the theorem the subproblem tree pro duced by partitioning K has at least N N levels Since the total time for all partitions at each level is N the total time for all levels is N One might argue that the K sequences are not likely to occur in real