Robust Scalable Sorting

Robust Scalable Sorting Zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Michael Axtmann Tag der mündlichen Prüfung: 17. Mai 2021 1. Referent: Prof. Dr. Peter Sanders Karlsruher Institut für Technologie Deutschland 2. Referent: Prof. Guy E. Blelloch Carnegie Mellon University Vereinigte Staaten von Amerika To my parents, my sister, and my wife. Abstract Sorting is one of the most important basic algorithmic problems. Thus, it is not a surprise that sorting algorithms are needed in a very large number of applications. These applications are executed on a wide range of different machines—from smartphones with energy-efficient multi-core processors to supercomputers with thousands of machines interconnected by a high-performance network. Since single-core performance has stagnated, parallel applications have become an indispensable part of our everyday lives. Efficient and scalable algorithms are key to take advantage of this immense availability of (parallel) computing power. In this thesis, we study sequential and parallel sorting algorithms aiming at robust performance for a diverse set of input sizes, input distributions, data types, and machines. In the first part of this thesis, we study sequential sorting as well as parallel sorting on shared-memory machines. We propose In-place Parallel Super Scalar Samplesort (IPS4o), a new comparison-based algorithm that is provably in-place, i.e., the amount of additional memory does not depend on the size of the input. An essential result is that the in-place property improves the performance of IPS4o compared to similar non-in-place algorithms. Until now, the in-place feature has usually been associated with a loss of speed for most algorithms. IPS4o is also provably cache-efficient and performs O(푛~푡 log 푛) work per thread when executed with 푡 threads. Additionally, IPS4o incorporates a branchless decision tree to minimize the number of branch mispredictions, takes advantage of memory locality, and handles many elements with equal keys—so-called duplicated keys—by separating them into “equality buckets”. For the special case of sorting integer keys, we use the algorithmic framework of IPS4o to implement In-place Parallel Super Scalar Radix Sort (IPS2Ra). We validate the performance of our algorithms in an extensive experimental study involving 21 state-of-the-art sorting algorithms, six data types, ten input distributions, four machines, four memory allocation strategies, and input sizes varying over seven orders of magnitude. On the one hand, the study shows that our algorithms have consistently good performance. On the other hand, it reveals that many competitors have large performance issues: With IPS4o, we obtain a robust comparison-based sorting algorithm that outperforms other parallel in-place comparison-based sorting algorithms by almost a factor of three. In the large majority of the cases, IPS4o is the fastest comparison-based algorithm. At this point, it is irrelevant whether we compare IPS4o to sequential or parallel, in-place or non-in-place algorithms. IPS4o even outperforms competing implementations of integer sorting algorithms in many cases. The remaining cases mainly include uniformly distributed inputs and inputs with keys containing only few bits. These inputs are usually “easy” for integer sorting algorithms. Our integer sorter IPS2Ra outperforms other integer sorting algorithms for these inputs in the large majority of v Abstract the cases. Exceptions are some very small inputs for which most of the algorithms are very inefficient. However, algorithms dedicated for these input sizes are usually much slower for the remaining range of input sizes. In the second part of this thesis, we study sorting algorithms for distributed systems and their robust scalability with regard to the number of processors, the input size, duplicated keys, and the distribution of input keys to processors. Our main contributions are four robust scalable sorting algorithms that allow us to cover the entire range of input sizes. Three of the four are new fast algorithms that implement low overhead mechanisms to make them scale robustly regardless of “difficult” inputs, e.g., inputs where the location of the input elements is correlated to the key values—so-called skewed element distributions—or inputs with duplicated keys. The fourth algorithm is simple and may be considered as folklore. Previous algorithms for inputs of medium and large size have an unacceptably large communication volume or an unacceptably large number of message exchanges. For these input sizes, we describe a robust multi-level generalization of samplesort that represents a feasible compromise between a moderate communication volume and a moderate number of message exchanges. We overcome these previously incompatible goals with scalable approximate splitter selection and a new data routing algorithm. As an alternative, we present a generalization of mergesort with the advantage of perfect load balance. For small inputs, we design a variant of quicksort that overcomes the problem of skewed element distributions and duplicated keys with fast high-quality pivot selection, element randomization, and low overhead duplicate handling. Previous practical approaches with polylogarithmic latency either have at least a logarithmic factor more communication or only consider uniform input. For very small inputs, we propose a practical and fast, yet work-inefficient algorithm with logarithmic latency. For these inputs, previous efficient approaches are theoretical algorithms mostly with prohibitively large constant factors. For the smallest inputs, we recommend an algorithm that sorts the data while the input is routed to a single processor. An important contribution of this thesis to the practical side of algorithm engineering is a communication library that we call RangeBasedComm (RBC). RBC allows an efficient implementation of recursive algorithms with sublinear running time by providing scalable and efficient communication primitives on processor subsets. The RBC library significantly speeds up the algorithms, e.g., one competitor even by more than two orders of magnitude. We present an extensive experimental study involving two supercomputers with up to 262 144 cores, 11 algorithms, 10 input distributions, and input sizes varying over nine orders of magnitude. For all but the largest input sizes, we are the only ones performing benchmarks on these large machine instances. The study also shows that our algorithms have a robust performance and outperform competing implementations significantly. Whereas our algorithms provide consistent performance on all inputs, our competitors’ performance breaks down on “difficult” inputs or they literally break. vi Deutsche Zusammenfassung Sortieren ist eines der wichtigsten algorithmischen Grundlagenprobleme. Es ist daher nicht verwunderlich, dass Sortieralgorithmen in einer Vielzahl von Anwendungen benötigt werden. Diese Anwendungen werden auf den unterschiedlichsten Geräten ausgeführt – angefangen bei Smartphones mit leistungseffizienten Multi-Core-Prozessoren bis hin zu Supercomputern mit Tausenden von Maschinen, die über ein Hochleistungsnetzwerk miteinander verbunden sind. Spätestens seitdem die Single-Core-Leistung nicht mehr signifikant steigt, sind parallele Anwendungen in unserem Alltag nicht mehr wegzudenken. Daher sind effiziente und skalier- bare Algorithmen essentiell, um diese immense Verfügbarkeit von (paralleler) Rechenleistung auszunutzen. Diese Arbeit befasst sich damit, wie sequentielle und parallele Sortieralgorithmen auf möglichst robuste Art maximale Leistung erzielen können. Dabei betrachten wir einen großen Parameterbereich von Eingabegrößen, Eingabeverteilungen, Maschinen sowie Datenty- pen. Im ersten Teil dieser Arbeit untersuchen wir sowohl sequentielles Sortieren als auch paralle- les Sortieren auf Shared-Memory-Maschinen. Wir präsentieren In-place Parallel Super Scalar Samplesort (IPS4o), einen neuen vergleichsbasierten Algorithmus, der mit beschränkt viel Zusatzspeicher auskommt (die sogenannte „in-place“ Eigenschaft). Eine wesentliche Erkennt- nis ist, dass unsere in-place-Technik die Sortiergeschwindigkeit von IPS4o im Vergleich zu ähnlichen Algorithmen ohne in-place-Eigenschaft verbessert. Bisher wurde die Eigenschaft, mit beschränkt viel Zusatzspeicher auszukommen, eher mit Leistungseinbußen verbunden. IPS4o ist außerdem cache-effizient und führt O(푛~푡 log 푛) Arbeitsschritte pro Thread aus, um ein Array der Größe 푛 mit 푡 Threads zu sortieren. Zusätzlich berücksichtigt IPS4o Speicherlokalität, nutzt einen Entscheidungsbaum ohne Sprungvorhersagen und verwendet spezielle Partitionen für Elemente mit gleichem Schlüssel. Für den Spezialfall, dass ausschließlich ganzzahlige Schlüs- sel sortiert werden sollen, haben wir das algorithmische Konzept von IPS4o wiederverwendet, um In-place Parallel Super Scalar Radix Sort (IPS2Ra) zu implementieren. Wir bestätigen die Performance unserer Algorithmen in einer umfangreichen experimentel- len Studie mit 21 State-of-the-Art-Sortieralgorithmen, sechs Datentypen, zehn Eingabevertei- lungen, vier Maschinen, vier Speicherzuordnungsstrategien und Eingabegrößen, die über sieben Größenordnungen variieren. Einerseits zeigt die Studie die robuste Leistungsfähigkeit unserer Algorithmen. Andererseits deckt sie auf, dass viele konkurrierende Algorithmen Performance- Probleme haben: Mit IPS4o

Load more