Arxiv:1810.12047V1 [Cs.DS] 29 Oct 2018 Es EA21) H Ipicto Sachieved Is Simplification the 2016)
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by The IT University of Copenhagen's Repository Simple and Fast BlockQuicksort using Lomuto’s Partitioning Scheme Martin Aum¨uller Nikolaj Hass October 30, 2018 Abstract worst-case aware quicksort variant called introsort This paper presents simple variants of the Block- [15], Oracle’s Java, which uses a two-pivot quick- Quicksort algorithm described by Edelkamp and sort algorithm [23], and Microsoft’s C#, which uses Weiss (ESA 2016). The simplification is achieved a traditional median-of-three quicksort approach. by using Lomuto’s partitioning scheme instead of In the pursuit of improving the performance Hoare’s crossing pointer technique to partition the of quicksort-based implementations, a long line input. To achieve a robust sorting algorithm that of papers [16, 23, 13, 3, 21] have looked at the works well on many different input types, the paper benefits of introducing more than one pivot in introduces a novel two-pivot variant of Lomuto’s the algorithm. Here, [13, 3, 21] showed that the partitioning scheme. A surprisingly simple twist memory access behavior improves with a small to the generic two-pivot quicksort approach makes number of pivots — an effect that cannot be the algorithm robust. The paper provides an anal- achieved by other more traditional means such as ysis of the theoretical properties of the proposed al- choosing a pivot from a sample of elements [18]. gorithms and compares them to their competitors. However, better cache behavior leaves the problem The analysis shows that Lomuto-based approaches of branch mispredictions unsolved, which is one incur a higher average sorting cost than the Hoare- of the largest bottleneck’s in today’s processor based approach of BlockQuicksort. Moreover, the architectures. analysis is particularly useful to reason about pivot As shown by Brodal and Moruz in [6], mispre- choices that suit the two-pivot approach. An exten- dicting conditional branches, i.e., mispredicting the sive experimental study shows that, despite their outcome of the comparison of two elements of the worse theoretical behavior, the simpler variants per- input, are a necessity in the regime of comparison- form as well as the original version of BlockQuick- efficient sorting algorithms. This means other tools sort. are needed to reduce the performance penalty that these mispredictions incur, which can be as large arXiv:1810.12047v1 [cs.DS] 29 Oct 2018 1 Introduction as 15 cycles on a modern Intel i7 CPU [10], and Sorting a sequence of elements in an efficient way is even larger on other architectures [12]. As we will one of Computer Science’s fundamental problems. describe in Section 2, under certain circumstances It is one of the most important core routines modern processors can avoid branches using so- in many algorithms, and does not seem to lose called conditional move-operations, among others, relevance as we get more and more data to work which were used in a number of implementations on every day. of quicksort variants [17, 8, 7, 4]. More detailed While it has been more than 55 years since the information about these methods is deferred to the quicksort algorithm was described by Hoare in [11], related work part of the introduction. variants of it are still relevant today. This espe- In the branch-free regime of [17, 7, 4], a lot of cially shows when looking at sorting algorithms im- complexity has been introduced to the algorithm. plemented in the standard libraries of modern pro- In fact, Axtmann et al. [4] describe code complex- gramming languages, such as C++, which uses a ity as the main drawback of their approach, stat- ing “formal verification of the correctness of the tice, it is usually chosen as the median in a sample implementation might help to increase trust in the of three elements, while theory suggests to choose remaining cases.” [4, p. 12]. The BlockQuicksort it as the median of Θ(√n) elements to minimize approach of Edelkamp and Weiss [7] has a very el- comparisons [14]. evated code complexity as well, using about 300 Super-Scalar Samplesort Sanders and lines of code for the partitioning procedure. Winkel [17] provided an engineered implemen- The main contribution of this paper are simple, tation of samplesort, a variant of quicksort in fast, and robust variants of branch-free quicksort- which many pivots are used to distribute the based algorithms. Thus, introducing more com- input to buckets. They used conditional moves to plexity is not necessarily required to improve per- traverse the heap-like tree used to find the bucket formance. In more detail, the paper presents sim- of an element. The algorithm uses two phases, pler variants of the BlockQuicksort algorithm de- classification and rearranging elements, and needs scribed in [7]. The difference to BlockQuicksort extra space in the size of the input. lies in utilizing a partitioning scheme that Bent- Tuned Quicksort Elmarsy et al. [8] described ley in [5] attributed to Nico Lomuto, rather than an engineered quicksort variant that uses a branch- using the traditional “crossing pointer” partition- free variant of Lomuto’s partitioning scheme using ing scheme by Hoare as in [7]. The description of conditional moves. the algorithms is provided in Section 3. In there, BlockQuicksort BlockQuicksort as intro- we describe a one-pivot and a two-pivot approach. duced by Edelkamp and Weiss in [7] generalizes Introducing an additional pivot allows us to make Hoare’s partitioning scheme using blocks to handle the algorithm robust and to avoid some of the com- misplaced elements. Filling these two blocks, one mon performance issues of the Lomuto partitioning block for each of the two pointers used in Hoare’s scheme, such as the presence of many equal ele- partitioning scheme, with references to misplaced ments [7]. elements is done branch-free using conditional Theoretical properties of the proposed algo- moves. When these blocks are near-full, elements rithms are discussed in Section 4. The result of are moved into a correct position with regard to this analysis allows us, among others, to reason the pivot choice. When the two pointers are close about good pivot choices for the two-pivot vari- at the end of the partitioning phase, the code has ant. Comparing the Lomuto-based algorithms to to handle many edges cases to produce a correct the Hoare-based approach used in standard Block- partition. To this end, BlockQuicksort uses a Quicksort, we notice a slightly worse comparison “clean-up phase” when partitioning is done. count and a slightly worse memory behavior with Multi-Pivot Quicksort [9, 16, 23, 13, 1, 2, 3, regard to the “scanned elements/memory accesses” 21] studied benefits of using a small number of piv- cost measure discussed in [13, 3, 21]. This makes ots in quicksort. This line of research showed that the results of the experiments in Section 5 quite this approach improves the memory access pattern surprising, which show that the proposed variants drastically [3, 21], an improvement that cannot be can compete with BlockQuicksort. In the same sec- achieved by other means such as choosing a good tion, we speculate about the reasons for this based pivot by sampling input elements. In short, multi- on actual measurements from the CPU. pivot quicksort allows to make fewer accesses to array cells to sort the input, an effect that trans- 1.1 Related Work lates into better running times. The most popular Tuning Quicksort Algorithms Quicksort multi-pivot algorithm is the two-pivot YBB algo- implementations (see Section 2 for a description rithm [23] used in Oracle’s Java since version 7. of the quicksort algorithm) usually use insertion- IPS4o [4] provides an in-place samplesort vari- sort to sort subproblems that are below a certain ant of [17] by combining it with the idea of using size threshold [18]. Moreover, the pivot is usually blocks from BlockQuicksort. Instead of two blocks, chosen from a small sample of elements. In prac- they use a large number of blocks, one for each Algorithm 1 One-Pivot-Quicksort Algorithm 2 Lomuto Partitioning Scheme procedure OnePivotQuicksort(A[1..n]) procedure LomutoPartition(A[1..n]) 1: if n > 1 then 1: p A[n] ← 2: choosePivot(A[1..n]) ⊲ Pivot resides in A[n] 2: i 1 ← 3: i partition(A[1..n]) 3: for j 1; j < n; j j + 1 do ← ← ← 4: OnePivotQuicksort(A[1..i 1]) 4: if A[j] < p then − 5: OnePivotQuicksort(A[i + 1..n]) 5: Swap A[i] and A[j] 6: i i + 1 ← 7: Swap A[i] and A[n] 8: return i bucket. These blocks are filled in a branch-free way using the classification strategy from [17]. All <p p ? ? p elements are moved in blocks and a final rearrange- ≥ · · · ment step is needed to swap blocks to correct po- i↑ j↑ sitions. Again, a lot of care is needed in the clean- Figure 1: Lomuto invariant: A[1..i 1] consists 4 − up phase. IPS o lends itself to parallelization, as of elements smaller than p, A[i..j 1] consists of − demonstrated by the experiments in [4]. elements at least as large as p; A[j..n 1] has not − been looked at, which is depicted by filling this part 1.2 Our Contributions This paper contains of the array with both colors. the following contributions: We present variants of BlockQuicksort using • Lomuto’s partitioning scheme (Algorithm 4, Lomuto’s Partitioning Scheme Tradition- Algorithm 5) and evaluate them experimen- ally, one uses the crossing pointer technique de- tally. Comparing the implementations, the scribed by Hoare [11] to partition the input.