View metadata,citationandsimilarpapersatcore.ac.uk arXiv:1810.12047v1 [cs.DS] 29 Oct 2018 es EA21) h ipicto sachieved is simplification The 2016). (ESA Weiss os hoeia eair h ipe ainsper- variants simpler the behavior, theoretical worse approaches Lomuto-based that shows analysis The paper the types, input different many on well works Block- the of variants simple presents paper This ainso taesilrlvn oa.Ti espe- This today. relevant still are it of variants al- proposed the of properties theoretical the of ysis rmiglnugs uhas pro- such modern languages, of libraries gramming standard im- the algorithms in sorting plemented at looking when shows cially [11], in Hoare by described was algorithm work to lose data routines to more and seem core day. more every not important on get does we most and as the relevance algorithms, of many one in problems. is fundamental It Science’s is way Computer efficient of an in one elements of sequence a Sorting Introduction 1 BlockQuick- of version original the sort. as well as their form despite that, shows study exten- experimental An sive approach. two-pivot the pivot suit about that reason choices to useful the particularly Moreover, is analysis Hoare- BlockQuicksort. the of than approach cost based sorting average higher a incur competitors. their to them compares and gorithms anal- an provides paper The robust. makes algorithm approach twist the quicksort simple two-pivot generic surprisingly the A Lomuto’s to of scheme. variant two-pivot partitioning novel a that introduces algorithm sorting robust a achieve the To partition to input. of technique instead pointer scheme crossing Hoare’s partitioning Lomuto’s using by and Edelkamp by described algorithm Quicksort Abstract ipeadFs lcQikotuigLmt’ Partitionin Lomuto’s using BlockQuicksort Fast and Simple hl thsbe oeta 5yassnethe since years 55 than more been has it While atnAmulrNkljHass Nikolaj Aum¨uller Martin C++ hc ssa uses which , coe 0 2018 30, October os-aeaaeqikotvratcle called variant quicksort aware worst-case hc eeue nanme fimplementations of number a in used were which otagrtm[3,adMicrosoft’s and [23], algorithm sort Oracle’s [15], rdtoa eino-he ucsr approach. quicksort -of-three traditional a t stemi rwako hi prah stat- approach, their of complex- drawback code main describe the [4] as al. algorithm. ity et the Axtmann to fact, introduced In been has complexity introduction. detailed the of More the part to work deferred 4]. related is 7, methods these 8, about information [17, variants quicksort of so- others, using among branches move-operations, avoid conditional called can circumstances and processors certain will under [10], we modern 2, CPU As Section i7 [12]. in Intel architectures describe other modern on a large larger on as even cycles be 15 can which as that incur, penalty mispredictions performance these the reduce to tools needed other comparison- are means of This regime algorithms. the the sorting in efficient of necessity elements a two are of input, comparison the of outcome the mispredicting i.e., branches, conditional processor dicting today’s in one bottleneck’s is architectures. largest which the unsolved, [18]. of mispredictions elements branch problem of the be of leaves sample behavior cannot a cache better from as that However, such pivot effect means a traditional an choosing more small other — a by the pivots achieved with that of improves showed in the 21] number behavior pivot 3, at access one [13, looked memory Here, have than 21] more algorithm. 3, the introducing 13, of line 23, benefits long [16, a papers of implementations, quicksort-based of ntebac-rergm f[7 ,4,altof lot a 4], 7, [17, of regime branch-free the In mispre- [6], in Moruz and Brodal by shown As performance the improving of pursuit the In Java hc ssatopvtquick- two-pivot a uses which , provided byTheITUniversityofCopenhagen'sRepository C# Scheme g hc uses which , brought toyouby CORE ing “formal verification of the correctness of the tice, it is usually chosen as the median in a sample implementation might help to increase trust in the of three elements, while theory suggests to choose remaining cases.” [4, p. 12]. The BlockQuicksort it as the median of Θ(√n) elements to minimize approach of Edelkamp and Weiss [7] has a very el- comparisons [14]. evated code complexity as well, using about 300 Super-Scalar Samplesort Sanders and lines of code for the partitioning procedure. Winkel [17] provided an engineered implemen- The main contribution of this paper are simple, tation of samplesort, a variant of quicksort in fast, and robust variants of branch-free quicksort- which many pivots are used to distribute the based algorithms. Thus, introducing more com- input to buckets. They used conditional moves to plexity is not necessarily required to improve per- traverse the heap-like tree used to find the bucket formance. In more detail, the paper presents sim- of an element. The algorithm uses two phases, pler variants of the BlockQuicksort algorithm de- classification and rearranging elements, and needs scribed in [7]. The difference to BlockQuicksort extra space in the size of the input. lies in utilizing a partitioning scheme that Bent- Tuned Quicksort Elmarsy et al. [8] described ley in [5] attributed to Nico Lomuto, rather than an engineered quicksort variant that uses a branch- using the traditional “crossing pointer” partition- free variant of Lomuto’s partitioning scheme using ing scheme by Hoare as in [7]. The description of conditional moves. the algorithms is provided in Section 3. In there, BlockQuicksort BlockQuicksort as intro- we describe a one-pivot and a two-pivot approach. duced by Edelkamp and Weiss in [7] generalizes Introducing an additional pivot allows us to make Hoare’s partitioning scheme using blocks to handle the algorithm robust and to avoid some of the com- misplaced elements. Filling these two blocks, one mon performance issues of the Lomuto partitioning block for each of the two pointers used in Hoare’s scheme, such as the presence of many equal ele- partitioning scheme, with references to misplaced ments [7]. elements is done branch-free using conditional Theoretical properties of the proposed algo- moves. When these blocks are near-full, elements rithms are discussed in Section 4. The result of are moved into a correct position with regard to this analysis allows us, among others, to reason the pivot choice. When the two pointers are close about good pivot choices for the two-pivot vari- at the end of the partitioning phase, the code has ant. Comparing the Lomuto-based algorithms to to handle many edges cases to produce a correct the Hoare-based approach used in standard Block- partition. To this end, BlockQuicksort uses a Quicksort, we notice a slightly worse comparison “clean-up phase” when partitioning is done. count and a slightly worse memory behavior with Multi-Pivot Quicksort [9, 16, 23, 13, 1, 2, 3, regard to the “scanned elements/memory accesses” 21] studied benefits of using a small number of piv- cost measure discussed in [13, 3, 21]. This makes ots in quicksort. This line of research showed that the results of the experiments in Section 5 quite this approach improves the memory access pattern surprising, which show that the proposed variants drastically [3, 21], an improvement that cannot be can compete with BlockQuicksort. In the same sec- achieved by other means such as choosing a good tion, we speculate about the reasons for this based pivot by sampling input elements. In short, multi- on actual measurements from the CPU. pivot quicksort allows to make fewer accesses to array cells to sort the input, an effect that trans- 1.1 Related Work lates into better running times. The most popular Tuning Quicksort Algorithms Quicksort multi-pivot algorithm is the two-pivot YBB algo- implementations (see Section 2 for a description rithm [23] used in Oracle’s Java since version 7. of the quicksort algorithm) usually use insertion- IPS4o [4] provides an in-place samplesort vari- sort to sort subproblems that are below a certain ant of [17] by combining it with the idea of using size threshold [18]. Moreover, the pivot is usually blocks from BlockQuicksort. Instead of two blocks, chosen from a small sample of elements. In prac- they use a large number of blocks, one for each Algorithm 1 One-Pivot-Quicksort Algorithm 2 Lomuto Partitioning Scheme procedure OnePivotQuicksort(A[1..n]) procedure LomutoPartition(A[1..n]) 1: if n > 1 then 1: p A[n] ← 2: choosePivot(A[1..n]) ⊲ Pivot resides in A[n] 2: i 1 ← 3: i partition(A[1..n]) 3: for j 1; j < n; j j + 1 do ← ← ← 4: OnePivotQuicksort(A[1..i 1]) 4: if A[j] < p then − 5: OnePivotQuicksort(A[i + 1..n]) 5: Swap A[i] and A[j] 6: i i + 1 ← 7: Swap A[i] and A[n] 8: return i bucket. These blocks are filled in a branch-free way using the classification strategy from [17]. All

1 then Require: n > 1, Pivot in A[n] 2: choosePivot(A[1..n]) ⊲ Pivots A[1] A[n] 1: p A[n]; ≤ ← 3: (i, j) partition(A[1..n]) 2: integer block[0,..., B 1], i, j 1, num 0 ← − ← ← 4: TwoPivotQuicksort(A[1..i 1]) 3: while j < n do − 5: TwoPivotQuicksort(A[i+1..j 1]) ⊲ Sec. 3.3 4: t min(B, n j); − ← − 6: TwoPivotQuicksort(A[j+1..n]) 5: for c 0; c < t; c c + 1 do ← ← 6: block[num] c; ← 7: num num + (p > A[j + c]); ← Otherwise, choose two elements p,q with p q 8: for c 0; c < num; c c + 1 do ≤ ← ← from A as pivots. Next, partition the input such 9: Swap A[i] and A[j + block[c]] that p resides in position i and q resides in posi- 10: i i + 1 tion j, A[1..i 1] contains elements smaller than p, ← − 11: num 0; A[i + 1..j 1] contains elements x with p x q ← − ≤ ≤ 12: j j + t; and A[j + 1..n] contains elements larger than q. ← 13: Swap A[i] and A[n]; Then, sort A[1..i 1], A[i+1..j 1], and A[j +1..n] − − 14: return i; recursively. Many different partitioning methods for two- B pivot quicksort exist, see, for example [23, 2].

q ? ? q procedure BlockLomuto2(A[1..n]) ≤···≤ · · · Require: n> 1, Pivots in A[1] A[n] ↑ ↑ ↑ ↑ ≤ i j k k + c 1 2 3 1: p A[1]; q A[n]; ← ← block 2: integer block[0,..., B 1] num↑≤q − num≤q 3: i, j, k 2, numq ? ? q 5: t min(B, n k); ≤···≤ · · · ← − ↑ ↑ ↑ 6: for c 0; c < t; c c + 1 do i j k 0 1 ← ← j + c 7: block[num≤q] c; ← block ↑ 8: num≤q num≤q + (q A[k + c]); num

A[j + c]); ← 15: for c 0; c < numq ≤···≤ Algorithm 4, the random variable PMAn counts the 1/6 1/3 1/2 number of times we reach Line 7 (reading A[j + c]) ′ Figure 4: Top: Sampling step with t = (0, 1, 2) and Line 9 (reading A[i]). For Algorithm 5, PMAn and two pivots. Bottom: Assuming a random per- counts the sum of the number of times Line 7, mutation, expected sizes of different groups under Line 12, and Line 14 are reached, i.e., individual accesses of i, j, and k in the array. We let by MAn the pivot choice. The 5 samples split the input ′ into 6 different parts of equal size (in expectation). and MAn denote the cost over the whole recursion, Thus, a fraction of 1/6 of the remaining elements respectively. are smaller than p, 1/3 are in between p and q, and From Partitioning Cost to Sorting Cost 1/2 are larger than q. Computing the average sorting cost from a given average partitioning cost is straight-forward if the latter is bounded by a n + O(1), see [2]. 4.1 Setup of the Analysis The input is a · For one-pivot quicksort with pivot p, we let random permutation of the set 1,...,n which { } by a := p 1, a := n p denote the number resides in an array A[1..n]. Fix an integer k 1, 2 0 − 1 − ∈ { } of elements smaller/larger than the pivot. For which denotes the number of pivots. Fix a vector k+1 two-pivot quicksort with pivots p and q, we set t = (t0,...,tk) N . Let κ := κ(t) = k + ∈ a0 := p 1, a1 := q p, a2 := n q to denote ti be the number of samples. We assume − − − P0≤i≤k group sizes in the partition. For a given sequence that κ is a constant independent of n. The general k+1 t = (t ,...,tk) N we define H(t) by outline of a k-pivot quicksort algorithm is then as 0 ∈ follows: If n κ, sort A directly. Otherwise, sort k ≤ ti + 1 the first κ elements and then set pi = A[i+ tj], (4.1) H(t)= (Hκ+1 Hti+1) , Pj

4.2 Cost Measures We measure cost in two (4.2) ways: we count the number of comparisons with E(Cn) = E(Pn)+ the pivot(s) and we count the number of array cells (E(Ca0 )+ + E(Cak )) Pr( a0, . . . , ak ), accessed by the respective algorithm when sorting X · · · · h i a0+···+ak=n−k an input containing n elements. where a , . . . , ak is the event that the group sizes With regard to comparisons, we define the h 0 i are exactly a , . . . , a . The probability of this event following random variables: Let PCMPn denote 0 k a0 ··· ak (t0 ) (tk ) the number of comparisons with elements against for a given vector t is n . Now, a result pivot p in Algorithm 4, i.e., the number of times (κ) by Hennequin [9, Proposition III.9] says that for Line 7 is reached. Let CMP be the number of n fixed k and t and average partitioning cost E(P )= comparisons over the whole recursion. Let PCMP′ n n a n + O(1) recurrence (4.2) has the solution denote the number of comparisons with elements · a against the pivots p and q in Algorithm 5 (Line 7 (4.3) E(C )= n ln n + O(n). ′ n H(t) and Line 12) and define CMPn accordingly. The sampling technique used here does not Algo AS Cost Best (cost, t) preserve randomness in subproblems, since a few H1 cmp 2.00n ln n, (0, 0) elements have already been sorted during the pivot H1 0ma 2.00n ln n, (0, 0) H1 cmp+ma 4.00n ln n, (0, 0) sampling step. For the analysis, we ignore that the unused samples have been seen and get only an H1 cmp 1.71n ln n, (1, 1) estimate on the sorting cost. See [16] for a detailed H1 2ma 1.71n ln n, (1, 1) H1 cmp+ma 3.43n ln n, (1, 1) analysis of this situation. H1 cmp 1.62n ln n, (2, 2) 4.3 Analysis H1 4ma 1.62n ln n, (2, 2) BlockLomuto1 (Algorithm 4) Every ele- H1 cmp+ma 3.24n ln n, (2, 2) ment in the input that has not been sampled is H1 cmp 1.53n ln n, (5, 5) compared with the pivot exactly once (Line 7), so H1 10ma 1.53n ln n, (5, 5) H1 cmp+ma 3.06n ln n, (5, 5) we get PCMPn = n + O(1). Conditioned on the pivot being p, there are exactly p 1 elements that L1 cmp 2.00n ln n, (0, 0) − are smaller than p in the input. Each of these el- L1 0ma 3.00n ln n, (0, 0) ements is accessed exactly once in Line 9, thus we L1 cmp+ma 5.00n ln n, (0, 0) get PMAn = n+p+O(1). If the pivot is chosen ac- L1 cmp 1.71n ln n, (1, 1) cording to a vector t = (t0,t1), we expect that there L1 2ma 2.57n ln n, (1, 1) are t0+1 n+O(1) many elements smaller than p. L1 cmp+ma 4.29n ln n, (1, 1) t0+t1+2 · t0+1 L1 cmp 1.62n ln n, (2, 2) Thus, we get E(PMAn)= 1+ n + O(1)  t0+t1+2  · L1 4ma 2.38n ln n, (1, 3) memory accesses on average. L1 cmp+ma 4.05n ln n, (2, 2) BlockLomuto2 (Algorithm 5) Each unsam- pled element from the input is compared exactly L1 cmp 1.53n ln n, (5, 5) L1 10ma 2.22n ln n, (4, 6) once to q (Line 7). Each element that is smaller L1 cmp+ma 3.78n ln n, (4, 6) than q is then compared to p (Line 12). If the pivots 3 L2 cmp 2.00n ln n, (0, 0, 0) p and q are chosen according to t = (t0,t1,t2) N , t0+t1+2 ∈ L2 0ma 2.40n ln n, (0, 0, 0) we expect that a fraction of t0+t1+t2+3 elements ′ L2 cmp+ma 4.40n ln n, (0, 0, 0) are smaller than q. Thus, we obtain E(PCMPn)= L2 cmp 1.73n ln n, (0, 1, 2) 1+ t0+t1+2 n + O(1). Regarding memory  t0+t1+t2+3  · L2 3ma 1.92n ln n, (0, 1, 2) accesses, fix the pivots p,q. Every array cell is L2 cmp+ma 3.65n ln n, (0, 1, 2) accessed by Line 7 of the algorithm. All array L2 cmp 1.62n ln n, (1, 1, 3) cells A[j] with j

2http://icl.utk.edu/papi/ s n

BlockLomuto1 ln

6 BlockLomuto2 /n

ns 2.6 4 2.4 Running time in 20 23 26 29 212 215 Block size B 2.2 221 222 223 224 225 226 Running time in Figure 5: Influence of block size B to the running Items time for n = 226, Xeon on Permutation. 2 (direct) 1 (direct) 2 (2,4 of 5) 2 (1,2 of 3) 1 (2 of 3) 2 (1,3 of 5) 2 (1,3 of 5*) 1 (3 of 5*) BlockLomuto2 and the implementation of IPS4o [4] retrieved from https://github.com/SaschaWitt/ips4o. Figure 6: Running times on Permutation on Xeon Furthermore, we used std::sort from gcc as the for different pivot choices. Legend entries of the baseline implementation to compare all algorithms form p (R of S) are read as follows: For the p- to. The implementations of BlockLomuto1 and pivot variant, we sort S elements and choose the BlockLomuto2 save lines of code by a factor of elements R in the sorted sample, e.g., (2, 4 of 5) 5 and 8, resp., compared to BlockQS under a means that the second and the fourth element in a similarly dense coding style. sorted sample of five elements are chosen as pivots; a * refers to the median of sampling. 5.1 Block Size Figure 5 shows how the block size influences the sorting time for the one- and two-pivot variant for 2 B 214. Both variants The plot shows that the pivot selection strat- ≤ ≤ benefit from large block sizes, decreasing the run- egy has big influence on the running time. In par- ning time to sort the input of n = 226 items from ticular, choosing the pivot directly from the input around 7 seconds to around 3 seconds. The mini- without sampling performs much worse than even mum is attained for both variants for a block that simple pivot selection strategies. For example, for can hold up to 1024 items, but there is no big dif- BlockLomuto1, there is a difference of a factor 1.14 ference for block sizes around this value. Conse- in running time when choosing the pivot directly quently, we set the block size to B = 1024 for both from the input compared to choosing it as the me- implementations. dian of 5 medians. BlockLomuto2 is a factor of 1.16 times slower without sampling than choosing 5.2 Pivot Strategies We implemented different the first two elements in a sample of three elements, pivot selection strategies and compared them to i.e., looking at one additional element. It is by a each other. Figure 6 shows the result of this factor of 1.22 slower when pivots are chosen among experiment. For BlockLomuto1, we compare direct the medians of five elements each. In particular, choice, median of 3, and median of medians of 5, choosing skewed pivots as predicted in Section 4 which groups 25 elements in 5 groups, chooses the provides an improvement for the two-pivot variant. median in each, and chooses the median of these Choosing the second and fourth element is a fac- medians as pivot, see [7]. For BlockLomuto2, we tor of 1.07 slower than choosing the first and third compare direct choice, first two of three elements, element. first and third of 5, second and fourth of 5, and For the final implementation, we used a the adaption of the median of medians strategy slightly more elaborate pivot selection strategy that described above, in which we take the first and switches between pivot strategies based on the in- third element in the sample of 5 medians. put size. The best thresholds to switch strate- gies have been obtained experimentally as well. of these input types, stdsort performs very well. While improvements in running time are consis- Furthermore, we note that average running times tent, they do not exceed a factor of 1.01 compared predict the difference between implementations to the fastest variant that does not switch strate- very well: Even for random structured inputs gies, as examplified through the difference between (RandomDup), significant differences that occurred BlockLomuto2 and 2 (1, 3 of 5∗) in Figure 6. in at least 95% of the trials made differences only about a factor of 0.02 smaller than average 5.3 Further Tuning Apart from choosing the running time differences. block size and pivot selection strategy based on Despite their simplicity, the BlockLomuto vari- the experiments described above, the implementa- ants show very good performance. In particu- tions used in the experiments are identical to Algo- lar, the two-pivot variant is robust on different rithm 4 and Algorithm 5. In contrast to the obser- input types, achieving robustness with the small vations made in [7, Section 3.2], unrolling the main twist introduced in Section 3. On the other hand, loop did not increase performance. We suspect that BlockQS achieves robustness with elaborate addi- this is due to the simplicity of the code that allows tional checks of the input after the main partition- the compiler to unroll the code. Furthermore, we ing step finishes. do not perform the cyclic rotations described in [7] since they actually increased running time. We sus- 5.5 Practical Observations Comparing our pect that this is due to the CPU pipelining the load- running time results to the considerations in Sec- /store instruction issued in Line 9 of Algorithm 4, tion 4, it is quite surprising that BlockLomuto1 and whereas the cyclic shifts described in [7] introduce BlockLomuto2 can compete in performance with read/write dependencies. BlockQS. In Table 2 we present selected measure- ments we got from running experiments. Exper- 5.4 Running Times We discuss the running iments are run on Permutation to report on the times observed on Xeon plotted in Figure 7. In average-case behavior of implementations. the following, running time differences are always From Section 4, we expect that IPS4o and stated for the data points associated with 227 items. BlockQS use fewer instructions than the Lomuto- On Permutation, IPS4o has the fastest running variants discussed in this paper. Furthermore, they time. On average, BlockLomuto1 and BlockLo- are expected to have a better cache behavior. muto2 are a factor of 1.15 times slower, BlockQS is From Table 2 we see that with regard to L1 1.20 times slower; stdsort is a factor of 2.22 slower cache misses, BlockQS incurs the fewest misses, fol- than IPS4o. With regard to significant running lowed by stdsort, BlockLomuto2, IPS4o, and Block- time differences, these factors decrease in absolute Lomuto1. So, IPS4o uses too many blocks for all of value by about 0.02. them to reside in L1 cache. Looking at L2 cache For inputs containing many duplicates misses, IPS4o incurs fewer misses than all other al- (SawTooth, RandomDup, EightDup, Equal), gorithms, as predicted. BlockLomuto2 has a bet- BlockLomuto1 cannot compete with the other ter cache behavior than BlockLomuto1, but behaves algorithms; a drawback of Lomuto’s partitioning worse than BlockQS, which again follows nicely the scheme that had already been identified in [7]. memory accesses computed in Section 4. Thus, we omit it in the plots. In the respective BlockLomuto1 and BlockLomuto2 branch condi- order of these input types, it was 221, 280, 18571, tionally in about the same dimension as IPS4o and 21010 times slower than IPS4o. On the other hand, BlockQS, and there is a big difference to the num- BlockLomuto2 is robust on all of these input types, ber of branches in stdsort. These branches are being fastest on RandomDup and Equal, and being easy to predict, as shown by the low misprediction close in performance to BlockQS on the other two. rates. Here, LomutoBlock variants make fewer mis- Only on Sorted and Reversed it is around a factor predictions than BlockQS, and there is a big gap to 2 to 3 slower than the fastest variant. For both the misprediction rate of IPS4o; stdsort mispredicts (a) Permutation (b) RandomDup

3 4 2

2 1

0 0 221 222 223 224 225 226 227 221 222 223 224 225 226 227 (c) SawTooth (d) EightDup

1 2 n

ln 0.5

n 1

0 0 scaled by 221 222 223 224 225 226 227 221 222 223 224 225 226 227

ns (e) Sorted (f) Reversed 3

2 2 Running time in 1 1

0 0 221 222 223 224 225 226 227 221 222 223 224 225 226 227 (g) Equal

1.5 IPS4o BlockQS 1 BlockLomuto2 stdsort 0.5 BlockLomuto1

0 221 222 223 224 225 226 227

Figure 7: Running time plots on Xeon. The x-axis represents the number of items, the y-axis shows the running time in nanoseconds scaled by n ln n. Algorithm L1/L2 CM CB (% MP) INS WOF (%, cycles) IPS4o .147 / .085 1.080 ( 6.1%) 12.870 2.921 (41.3%, 6.834) BlockLomuto1 .169 / .143 0.926 (11.7%) 16.456 3.001 (38.7%, 7.758) BlockLomuto2 .144 / .121 0.884 (10.5%) 17.467 2.579 (32.7%, 7.878) BlockQS .125 / .102 0.877 (14.7%) 17.566 3.240 (38.7%, 8.354) stdsort .137 / .115 2.128 (29.0%) 11.769 11.509 (74.0%, 15.550)

Table 2: CPU counter measurements for n = 227 items on 600 trials on Permutation. We use the abbreviations “CM” for “Cache Misses”, “INS” for “Instructions”, “CB” for “conditional branch instructions”, “MP” for “mispredicted”, and “WOF” for “cycles without instruction finished”. All values are normalized by n ln n. more branches, since comparisons to the pivot have mizations. For example, it would be nice to inspect a prediction rate of around 50%. how much can be gained from implementing differ- With regard to the instruction count, we see ent block operations such as filling the block or re- that IPS4o makes by far the fewest instructions arranging elements through vectorized SIMD oper- among “branch-free” variants. Somehow surpris- ations. In another line of research, having a simple ingly, the Lomuto-variants incur fewer instructions implementation could allow to formally verify the than BlockQS. We suspect that this is because of correctness of the implementation. Additionally, a their simpler structure that simplifies bookkeeping. theoretical analysis of handling equal elements as Looking at cycles in which no instruction was fin- described in Section 3.3 seems interesting [22]. ished (because of, e.g., memory stalls or branch mis- We remark that a 3-pivot variant of Lomuto’s predictions), Lomuto-based variants have slightly scheme did not give any improvements with regard fewer of them compared to BlockQS; around 33– to observed running times; the same is true for 40% of the total amount of cycles are of this type a two-pivot variant of Hoare’s scheme used in for branch-free variants, while they make up 74% BlockQuicksort. In particular, handling all the of the cycles for stdsort. edges cases correctly made this algorithm very In conclusion, we see that the theoretical dif- complicated. ferences between IPS4o/BlockQS and the Lomuto- Acknowledgments We thank Armin based variants translate into practice, but their eas- Weiss for a useful hint in the two-pivot ver- ier structure make them competitive to BlockQS. sion. We also thank Timo Bingmann who No variant can compete with IPS4o, both in theory provided some source code used in the testing and in practice. framework. Furthermore, we thank the anonymous 6 Conclusion reviewers for their suggestions that helped us in improving the presentation of this paper. This paper introduced simple variants of the Block- Quicksort algorithm by [7] using block-based ver- sions of Lomuto’s partitioning scheme. The imple- References mentation was shown to be competitive in running time to the implementation of [7]. A novel twist [1] Aum¨uller, M.: On the Analysis of Two Fundamen- to the general two-pivot quicksort approach made tal Randomized Algorithms - Multi-Pivot Quick- the proposed two-pivot variant particularly robust sort and Efficient Hash Functions. Ph.D. thesis, with regard to different input distributions. The Technische Universit¨at Ilmenau, Germany (2015), paper presented theoretical properties of the algo- http://www.db-thueringen.de/servlets/DocumentServlet?id=26263 rithms and verified them through experiments. [2] Aum¨uller, M., Dietzfelbinger, M.: Optimal Due to their simple structure, the proposed al- partitioning for dual-pivot quicksort. ACM gorithms are particularly suited to test further opti- Trans. Algorithms 12(2), 18:1–18:36 (2016), http://doi.acm.org/10.1145/2743020 [3] Aum¨uller, M., Dietzfelbinger, M., Klaue, P.: [16] Nebel, M.E., Wild, S., Mart´ınez, C.: Analysis of How good is multi-pivot quicksort? ACM pivot sampling in dual-pivot quicksort: A holistic Trans. Algorithms 13(1), 8:1–8:47 (2016), analysis of yaroslavskiy’s partitioning scheme. Al- http://doi.acm.org/10.1145/2963102 gorithmica pp. 1–52 (2015) [4] Axtmann, M., Witt, S., Ferizovic, D., Sanders, [17] Sanders, P., Winkel, S.: Super scalar sample P.: In-place parallel super scalar samplesort sort. In: Proc. of the 12th Annual European (ipsssso). In: 25th Annual European Sympo- Symposium on Algorithms (ESA’04). pp. 784–796. sium on Algorithms, ESA 2017, September 4- Springer (2004) 6, 2017, Vienna, Austria. pp. 9:1–9:14 (2017), [18] Sedgewick, R.: Implementing quicksort programs. https://doi.org/10.4230/LIPIcs.ESA.2017.9 Commun. ACM 21(10), 847–857 (1978) [5] Bentley, J.L.: Programming pearls. Addison- [19] Wild, S.: https://cs.stackexchange.com/questions/11458/quicksort-partitioning-hoare-vs-lomuto, Wesley (1986) accessed on August 8, 2018. [6] Brodal, G.S., Moruz, G.: Tradeoffs between [20] Wild, S.: Dual-pivot quicksort and beyond: Analy- branch mispredictions and comparisons for sort- sis of multiway partitioning and its practical poten- ing algorithms. In: Proc. of the 9th Interna- tial. Ph.D. thesis, Technische Universit¨at Kaiser- tional Workshop on Algorithms and Data Struc- slautern (2016) tures (WADS’05). pp. 385–395. Springer (2005), [21] Wild, S.: Dual-pivot and beyond: The poten- http://dx.doi.org/10.1007/11534273_34 tial of multiway partitioning in quicksort. it - [7] Edelkamp, S., Weiß, A.: Blockquicksort: Avoid- Information Technology 60(3), 173–177 (2018), ing branch mispredictions in quicksort. In: https://doi.org/10.1515/itit-2018-0012 24th Annual European Symposium on Al- [22] Wild, S.: Quicksort is optimal for many gorithms, ESA 2016, August 22-24, 2016, equal keys. In: Proceedings of the Fifteenth Aarhus, Denmark. pp. 38:1–38:16 (2016), Workshop on Analytic Algorithmics and Com- https://doi.org/10.4230/LIPIcs.ESA.2016.38 binatorics, ANALCO 2018, New Orleans, LA, [8] Elmasry, A., Katajainen, J., Stenmark, M.: USA, January 8-9, 2018. pp. 8–22 (2018), Branch mispredictions don’t affect mergesort. https://doi.org/10.1137/1.9781611975062.2 In: Experimental Algorithms - 11th Interna- [23] Wild, S., Nebel, M.E.: Average case analysis of tional Symposium, SEA 2012, Bordeaux, France, java 7’s dual pivot quicksort. In: Proc. of the 20th June 7-9, 2012. Proceedings. pp. 160–171 (2012), European Symposium on Algorithms (ESA’12). http://dx.doi.org/10.1007/978-3-642-30850-5_15 pp. 825–836 (2012) [9] Hennequin, P.: Analyse en moyenne d’algorithmes: tri rapide et arbres de recherche. Ph.D. thesis, A Running Time Plots for i7 Ecole Politechnique, Palaiseau (1991) The machine i7 was equipped with a 4-core Intel [10] Hennessy, J.L., Patterson, D.A.: Computer Archi- tecture - A Quantitative Approach, 5th Edition. Core i7-4790 clocked at 3.6 GHz, with 8 MB L3 Morgan Kaufmann (2012) Cache and 32GB RAM. i7 was running CentOS [11] Hoare, C.A.R.: Quicksort. Comput. J. 5(1), 10–15 7 with Linux kernel 3.10 and code was compiled (1962) using gcc version 5.1.1. Figure 8 presents an [12] Kaligosi, K., Sanders, P.: How branch mispredic- overview over the running time measurements that tions affect quicksort. In: Proc. of the 14th Annual we obtained. While i7 turns out to be faster than European Symposium on Algorithms (ESA’06). Xeon, differences in running time are comparable pp. 780–791. Springer (2006) [13] Kushagra, S., L´opez-Ortiz, A., Qiao, A., Munro, to Section 5. J.I.: Multi-pivot quicksort: Theory and ex- periments. In: Proc. of the 16th Meeting on Algorithms Engineering and Experiments (ALENEX’14). pp. 47–60. SIAM (2014) [14] Mart´ınez, C., Roura, S.: Optimal sampling strate- gies in quicksort and . SIAM J. Comput. 31(3), 683–705 (2001) [15] Musser, D.R.: Introspective sorting and selection algorithms. Softw., Pract. Exper. 27(8), 983–993 (1997) Permutation RandomDup

3 4

2 2 1

0 0 221 222 223 224 225 226 227 221 222 223 224 225 226 227

SawTooth EightDup 1.5

n 2 ln 1 n

1 0.5 scaled by

ns 0 0 221 222 223 224 225 226 227 221 222 223 224 225 226 227

Sorted Reversed 3

Running time in 2 2

1 1

0 0 221 222 223 224 225 226 227 221 222 223 224 225 226 227

Equal

2 IPS4o 1.5 BlockQS 1 BlockLomuto2 stdsort 0.5 BlockLomuto1

0 221 222 223 224 225 226 227

Figure 8: Running time plots on i7. The x-axis represents the number of items, the y-axis shows the running time in nanoseconds scaled by n ln n.