Felix Putze and Peter Sanders
design experiment
Algorithmics analyze
implement
Course Notes Algorithm Engineering TU Karlsruhe October 19, 2009 Preface
These course notes cover a lecture on algorithm engineering for the basic toolbox that Peter Sanders is reading at Universitat¨ Karlsruhe since 2004. The text is compiled from slides, scientific papers and other manuscripts. Most of this material is in English so that this language was adopted as the main language. However, some parts are in German. The primal sources of our material are given at the beginning of each chapter. Please refer to the original publications for further references. This document is still work in progress. Please report bugs of any type (content, language, layout, . . . ) to [email protected]. Thank you!
1 Contents
1 Was ist Algorithm Engineering? 6 1.1 Einfuhrung¨ ...... 6 1.2 Stand der Forschung und offene Probleme ...... 8
2 Data Structures 16 2.1 Arrays & Lists ...... 16 2.2 External Lists ...... 18 2.3 Stacks, Queues & Variants ...... 20
3 Sorting 24 3.1 Quicksort Basics ...... 24 3.2 Refined Quicksort ...... 25 3.3 Lessons from experiments ...... 27 3.4 Super Scalar Sample Sort ...... 30 3.5 Multiway Merge Sort ...... 34 3.6 Sorting with parallel disks ...... 35 3.7 Internal work of Multiway Mergesort ...... 38 3.8 Experiments ...... 41
4 Priority Queues 44 4.1 Introduction ...... 44 4.2 Binary Heaps ...... 45 4.3 External Priority Queues ...... 47 4.4 Adressable Priority Queues ...... 54
5 External Memory Algorithms 57 5.1 Introduction ...... 57 5.2 The external memory model and things we already saw ...... 59 5.3 The Stxxl library ...... 60 5.4 Time-Forward Processing ...... 62 5.5 Cache-oblivious Algorithms ...... 64
2 5.5.1 Matrix Transposition ...... 67 5.5.2 Searching Using Van Emde Boas Layout ...... 69 5.5.3 Funnel sorting ...... 71 5.5.4 Is the Model an Oversimplification? ...... 74 5.6 External BFS ...... 76 5.6.1 Introduction ...... 76 5.6.2 Algorithm of Munagala and Ranade ...... 77 5.6.3 An Improved BFS Algorithm with sublinear I/O ...... 78 5.6.4 Improvements in the previous implementat- ions of MR BFS and MM BFS R...... 81 5.6.5 A Heuristic for maintaining the pool ...... 82 5.7 Maximal Independent Set ...... 84 5.8 Euler Tours ...... 85 5.9 List Ranking ...... 86
6 van Emde Boas Trees 90 6.1 From theory to practice ...... 90 6.2 Implementation ...... 91 6.3 Experiments ...... 95
7 Shortest Path Search 98 7.1 Introduction ...... 98 7.2 “Classical” and other Results ...... 99 7.3 Highway Hierarchy ...... 102 7.3.1 Introduction ...... 102 7.3.2 Hierarchies and Contraction ...... 103 7.3.3 Query ...... 107 7.3.4 Experiments ...... 111 7.4 Transit Node Routing ...... 116 7.4.1 Computing Transit Nodes ...... 117 7.4.2 Experiments ...... 119 7.4.3 Complete Description of the Shortest Path ...... 122 7.5 Dynamic Shortest Path Computation ...... 123 7.5.1 Covering Nodes ...... 124 7.5.2 Static Highway-Node Routing ...... 126 7.5.3 Construction ...... 127 7.5.4 Query ...... 127 7.5.5 Analogies To and Differences From Related Techniques . . . . . 128 7.5.6 Dynamic Multi-Level Highway Node Routing ...... 129 7.5.7 Experiments ...... 131
3 8 Minimum Spanning Trees 137 8.1 Definition & Basic Remarks ...... 137 8.1.1 Two important properties ...... 137 8.2 Classic Algorithms ...... 138 8.2.1 Excursus: The Union-Find Data Structure ...... 141 8.3 QuickKruskal ...... 143 8.4 The I-Max-Filter algorithm ...... 144 8.5 External MST ...... 149 8.5.1 Semiexternal Algorithm ...... 150 8.5.2 External Sweeping Algorithm ...... 151 8.5.3 Implementation & Experiments ...... 153 8.6 Connected Components ...... 156
9 String Sorting 158 9.1 Introduction ...... 158 9.2 Multikey Quicksort ...... 159 9.3 Radix Sort ...... 160
10 Suffix Array Construction 162 10.1 Introduction ...... 162 10.2 The DC3 Algorithm ...... 162 10.3 External Suffix Array Construction ...... 165
11 Presenting Data from Experiments 170 11.1 Introduction ...... 170 11.2 The Process ...... 171 11.3 Tables ...... 172 11.4 Two-dimensional Figures ...... 172 11.5 Grids and Ticks ...... 181 11.6 Three-dimensional Figures ...... 182 11.7 The Caption ...... 183 11.8 A Check List ...... 184
12 Appendix 186 12.1 Used machine models ...... 186 12.2 Amortized Analysis for Unbounded Arrays ...... 187 12.3 Analysis of Randomized Quicksort ...... 188 12.4 Insertion Sort ...... 189 12.5 Lemma on Interval Maxima ...... 190 12.6 Random Permutations without additional I/Os ...... 191 12.7 Proof of Discarding Theorem for Suffix Array Construction ...... 192
4 12.8 Pseudocode for the Discarding Algorithm ...... 192
5 Chapter 1
Was ist Algorithm Engineering?
1.1 Einfuhrung¨
Algorithmen (einschließlich Datenstrukturen) sind das Herz jeder Computeranwendung und damit von entscheidender Bedeutung fur¨ große Bereiche von Technik, Wirtschaft, Wissenschaft und taglichem¨ Leben. Die Algorithmik befasst sich mit der systematischen Entwicklung effizienter Algorithmen und hat damit entscheidenden Anteil an der effek- tiven Entwicklung verlaßlicher¨ und ressourcenschonender Technik. Wir nennen hier nur einige besonders spektakulare¨ Beispiele: Das schnelle Durchsuchen der gewaltigen Datenmengen des Internet (z.B. mit Google) hat die Art verandert,¨ wie wir mit Wissen und Information umgehen. Moglich¨ wurde dies durch Algorithmen zur Volltextsuche, die in Sekundenbruchteilen alle Tr- effer aus Terabytes von Daten herausfischen konnen¨ und durch Ranking-Algorithmen, die Graphen mit Milliarden von Knoten verarbeiten, um aus der Flut von Treffern relevante Antworten zu filtern. Weniger sichtbar aber ahnlich¨ wichtig sind Algorith- men fur¨ die effiziente Verteilung von sehr haufig¨ zugegriffenen Daten unter massiven Lastschwankungen oder gar Uberlastungsangriffen¨ (distributed denial of service attacks). Der Marktfuhrer¨ auf diesem Gebiet, Akamai, wurde von Algorithmikern gegrundet.¨ Eines der wichtigsten wissenschaftlichen Ereignisse der letzten Jahre war die Veroffentlichung¨ des menschlichen Genoms. Mitentscheidend fur¨ die fruhe¨ Veroffentlichung¨ war die von der Firma Celera verwendete und durch algorithmische Uberlegungen¨ begrundete¨ Aus- gestaltung des Sequenzierprozesses (whole genome shotgun sequencing). Die Algorith- mik hat sich hier nicht auf die Verarbeitung der von Naturwissenschaftlern produzierten Daten beschrankt,¨ sondern gestaltenden Einfluss auf den gesamten Prozess ausgeubt.¨ Die Liste der Bereiche, in denen ausgefeilte Algorithmen eine Schlusselrolle¨ spielen, ließe sich fast beliebig fortsetzen: Computergrafik, Bildverarbeitung, geografische Infor- mationssysteme, Kryptografie, Produktions-, Logistik- und Verkehrsplanung . . . Wie funktioniert nun der Transfer algorithmischer Innovation in Anwendungsbere-
6 abstrakte Algorithm realistische Modelle Modelle 1 reale Engineering 7 8
Eingaben Anwendungen Entwurf Entwurf 2 falsifizierbare Analyse Analyse3 Hypothesen 5 Experimente Algorithmentheorie Induktion Deduktion Leistungsgarantien 4 Implementierung Leistungs− garantien Implementierung Algorithmen− 6 bibliotheken Anwendungen
Figure 1.1: Zwei Sichtweisen der Algorithmik: Links: traditionell. Rechts: AE = Algo- rithmik als von falsifizierbaren Hypothesen getriebener Kreislauf aus Entwurf, Analyse, Implementierung, und experimenteller Bewertung von Algorithmen. iche? Traditionell hat sich die Algorithmik der Methodik der Algorithmentheorie bedi- ent, die aus der Mathematik stammt: Algorithmen werden fur¨ einfache und abstrakte Problem- und Maschinenmodelle entworfen. Hauptergebnis sind dann beweisbare Leis- tungsgarantien fur¨ alle moglichen¨ Eingaben. Dieser Ansatz fuhrt¨ in vielen Fallen¨ zu ele- ganten, zeitlosen Losungen,¨ die sich an viele Anwendungen anpassen lassen. Die harten Leistungsgarantien ergeben zuverlassig¨ hohe Effizienz auch fur¨ zur Implementierungszeit unbekannte Typen von Eingaben. Aufgreifen und Implementieren eines Algorithmus ist aus Sicht der Algorithmentheorie Teil der Anwendungsentwicklung. Nach allgemeiner Beobachtung ist diese Art des Ergebnistransfers aber ein sehr langsamer Prozess. Bei wachsenden Anforderungen an innovative Algorithmen ergeben sich daraus wachsende Lucken¨ zwischen Theorie und Praxis: Reale Hardware entwickelt sich durch Parallelis- mus, Pipelining, Speicherhierarchien u.s.w. immer weiter weg von einfachen Maschi- nenmodellen. Anwendungen werden immer komplexer. Gleichzeitig entwickelt die Al- gorithmentheorie immer ausgeklugeltere¨ Algorithmen, die zwar wichtige Ideen enthalten aber manchmal kaum implementierbar sind. Außerdem haben reale Eingaben oft wenig mit den worst-case Szenarien der theoretischen Analyse zu tun. Im Extremfall werden viel versprechende algorithmische Ansatze¨ vernachlassigt,¨ weil eine vollstandige¨ Anal- yse mathematisch zu schwierig ware.¨ Seit Beginn der 1990er Jahre wird deshalb eine breitere Sichtweise der Algorithmik immer wichtiger, die als algorithm engineering (AE) bezeichnet wird und bei der En- twurf, Analyse, Implementierung und experimentelle Bewertung von Algorithmen gleich- berechtigt nebeneinander stehen. Der gegenuber¨ der Algorithmentheorie großere¨ Meth-
7 odenapparat, die Einbeziehung realer Software und der engere Bezug zu Anwendun- gen verspricht realistischere Algorithmen, die Uberbr¨ uckung¨ entstandener Lucken¨ zwis- chen Theorie und Praxis, und einen schnelleren Transfer von algorithmischem Know- how in Anwendungen. Abbildung 1.1 zeigt die Sichtweise der Algorithmik als AE und eine Aufteilung in acht eng interagierende Aktivitaten.¨ Ziele und Arbeitsprogramm des Schwerpunktprogramms ergeben sich daraus in naturlicher¨ Weise: Einsatz der vollen Schlagkraft der AE Methodologie mit dem Ziel, Lucken¨ zwischen Theorie und Praxis zu uberbr¨ ucken.¨
1. Studium realistischer Modelle fur¨ Maschinen und algorithmische Probleme. 2. Entwurf von einfachen und auch in der Realitat¨ effizienten Algorithmen. 3. Analyse praktikabler Algorithmen zwecks Etablierung von Leistungsgarantien, die Theorie und Praxis einander naher¨ bringen. 4. Sorgfaltige¨ Implementierungen, die die Lucken¨ zwischen dem besten theoretischen Algorithmus und dem besten implementierten Algorithmus verkleinern. 5. Systematische, reproduzierbare Experimente, die der Widerlegung oder Starkung¨ aussagekraftiger,¨ falsifizierbarer Hypothesen dienen, die sich aus Entwurf, Anal- yse oder fruheren¨ Experimenten ergeben. Oft wird es z.B. um den Vergleich von Algorithmen gehen, deren theoretische Analyse zu viele Fragen offen lasst.¨ 6. Entwicklung und Ausbau von Algorithmenbibliotheken, die Anwendungsentwick- lungen beschleunigen und algorithmisches Know-how breit verfugbar¨ machen. 7. Sammeln von großen und realistischen Probleminstanzen sowie Entwicklung von Benchmarks. 8. Einsatz von algorithmischem Know-how in konkreten Anwendungen.
1.2 Stand der Forschung und offene Probleme
Im Folgenden beschreiben wir die Methodik des AE anhand von Beispielen.
Fallbeispiel: Routenplanung in Straßennetzen Jeder kennt diese zunehmend wichtige Anwendung: Man gibt Start- und Zielpunkt in ein Navigationssystem ein und wartet auf die Ausgabe der schnellsten Route. Hier hat das AE in letzter Zeit Losungen¨ entwickelt, die in Sekundenbruchteilen optimale Routen berechnen wo kommerzielle Losungen¨ trotz erheblich langerer¨ Rechenzeiten bisher keine Qualitatsgarantien¨ geben konnen¨ und gelegentlich deutlich daneben liegen. Auf den er- sten Blick ist das Anwendungsmodell ein klassisches und wohl studiertes Problem aus
8 der Graphentheorie: kurzeste¨ Wege in Graphen. Die altbekannte Lehrbuchlosung¨ — Di- jkstra’s Algorithmus — hatte¨ allerdings auf einem Hochleistungs-Server Antwortzeiten im Minutenbereich und ware¨ auf leistungsschwacherer¨ mobiler Hardware mit begren- ztem Hauptspeicher hoffnungslos langsam. Kommerzielle Routenplaner greifen deshalb zu Heuristiken, die annehmbare Antwortzeiten haben, aber nicht immer die beste Route finden. Auf den zweiten Blick bietet sich ein verfeinertes Problemmodell an, das die Vor- berechnung von Informationen zulasst,¨ die dann fur¨ viele Anfragen verwendet werden. Die Theorie winkt ab und beweist, dass nur eine unpraktikabel große vorberechnete Datenstruktur die Berechnung von schnellsten Routen in beliebigen Graphen beschleu- nigt. Reale Straßengraphen haben jedoch Eigenschaften, welche die Vorberechnungsidee praktikabel machen. Die Wirksamkeit dieser Ansatze¨ hangt¨ von Hypothesen uber¨ die Eigenschaft der Straßengraphen ab, wie ”‘weit weg von Start und Ziel kann man die Suche auf uberregionale¨ Straßen beschranken”’¨ oder ”‘Straßen, die vom Ziel wegfuhren,¨ darf man ignorieren”’. Solche intuitiven Formulierungen gilt es dann so zu formalisieren, dass sich daraus Algorithmen mit Leistungsgarantien entwickeln lassen. Letztlich lassen diese Hypothesen sich aber nur durch Experimente mit Implementierungen uberpr¨ ufen,¨ die realistische Straßengraphen verwenden. Letzteres ist in der Praxis schwierig, da viele Firmen nur ungern Daten an Forscher herausgeben. Besonders wertvoll ist deshalb ein frei verfugbarer¨ Graph der USA, der aus Daten im Web konstruiert wurde und jetzt fur¨ eine DIMACS Implementation Challenge zur Routenplanung Verwendung finden soll. Die Experimente decken Schwachstellen auf, die wiederum zum Entwurf verbesserter Algorithmen fuhren.¨ Zum Beispiel stellte sich heraus, dass schon wenige Langstreck- enfahrverbindungen¨ den Vorberechnungsaufwand der ersten Version des Algorithmus enorm in die Hohe¨ treiben. Trotz der Erfolge gibt es viele offene Fragen. Kann man die Heuristiken auch theo- retisch analysieren um zu allgemeineren Leistungsgarantien zu kommen? Wie vertragt¨ sich die Idee der Vorberechnung mit Anwendungsanforderungen wie Anderung¨ des Straßennetzes, Baustellen, Staus, oder verschiedenen Zielfunktionen der Benutzer? Wie lassen sich die komplexen Speicherhierarchien von Mobilgeraten¨ berucksichtigen?¨
Modelle Ein wichtiger Aspekt des AE sind Maschinenmodelle. Sie betreffen im Prinzip alle Anwendungen und sind die Schnittstelle zwischen der Algorithmik und der rasanten technologische Entwicklung mit immer komplexerer Hardware. Wegen seiner großen Einfachheit ist das streng sequentielle, mit uniformem Speicher ausgestattete von Neu- mann Maschinenmodell immer noch Grundlage der meisten algorithmischen Arbeiten. Dies ist vor allem bei der Verarbeitung großer Datenmengen ein Problem, da die Zu- griffszeit auf den Speicher sich um viele Großenordnungen¨ andert,¨ je nachdem, ob
9 auf den schnellsten Cache eines Prozessors, auf den Hauptspeicher oder auf die Fest- platte zugegriffen wird. Speicherhierarchien werden in der Algorithmik bisher meist auf zwei Schichten beschrankt¨ (I/O Modell). Dieses Modell ist sehr erfolgreich und eine Vielzahl von Ergebnissen dazu ist bekannt. Allerdings klaffen oft große Lucken¨ zwischen den besten bekannten Algorithmen und den implementierten Verfahren. Bib- liotheken fur¨ Sekundarspeicheralgorithmen¨ wie STXXL versprechen diese Situation zu verbessern. In letzter Zeit gibt es aber verstarktes¨ Interesse an weiteren immer noch einfachen Modellen zur Verarbeitung großer Datenmengen, z.B. einfache Modelle fur¨ mehrschichtige Speicherhierarchien, Datenstrommodelle, bei denen die Daten uber¨ ein Netzwerk hereinkommen, oder Sublinearzeitalgorithmen, bei denen gar nicht alle Daten beruhrt¨ werden mussen.¨ Nur punktuelle Ergebnisse gibt es bisher zu anderen komplexen Eigenschaften moderner Prozessoren, wie den Ersetzungsmechanismen von Hardwarecaches oder Sprungvorhersage. Wir erwarten, dass die Erforschung paralleler Algorithmen in nachster¨ Zeit eine Re- naissance erfahren wird, denn durch die Verbreitung von Multithreading, Multi-Core- CPUs und Clustern halt¨ die Parallelverarbeitung nun Einzug in den Mainstream der Datenverarbeitung. Die traditionellen ”‘flachen”’ Modelle fur¨ Parallelverarbeitung sind hier allerdings nur von begrenztem Nutzen, da es parallel zur Speicherhierarchie eine Hierarchie mehr oder weniger eng gekoppelter Verarbeitungseinheiten gibt.
Entwurf Eine entscheidende Komponente des AE ist die Entwicklung implementierbarer Algo- rithmen, die effiziente Ausfuhrung¨ in realistischen Situationen erwarten lassen. Le- ichte Implementierbarkeit bedeutet vor allem Einfachheit aber auch Moglichkeiten¨ zur Codewiederverwendung. Effiziente Ausfuhrung¨ bedeutet in der Algorithmentheorie gute asymptotische Ausfuhrungszeit¨ und damit gute Skalierungseigenschaften fur¨ sehr große Eingaben. Im AE sind aber auch konstante Faktoren und die Ausnutzung einfacher Prob- leminstanzen wichtig. Ein Beispiel hierzu: Der Sekundarspeicheralgorithmus¨ zur Berechnung minimaler Spannbaume¨ war der erste Algorithmus, der ein nichttriviales Graphenproblem mit Milliarden von Knoten auf m einem PC lost.¨ Theoretisch ist er suboptimal, weil er einen Faktor O(log M ) mehr Platten- zugriffe benotigt¨ als der theoretisch beste Algorithmus (dabei ist m die Anzahl der Kanten des Eingabegraphen und M der Hauptspeicher der Maschine). Auf sinnvoll konfigurierten Maschinen benotigt¨ er aber jetzt und in absehbarer Zukunft hochstens¨ ein Drittel der Plat- tenzugriffe der asymptotisch besten bekannten Algorithmen. Hat man eine Prioritatsliste¨ fur¨ Sekundarspeicher¨ zur Verfugung¨ wie in STXXL, ist der Pseudocode des Algorithmus ein Zwolfzeiler¨ und die Analyse der erwarteten Ausfuhrungszeit¨ ein Siebenzeiler.
10 Analyse Selbst einfache, in der Praxis bewahrte¨ Algorithmen sind oft schwer zu analysieren und dies ist ein Hauptgrund fur¨ Lucken¨ zur Algorithmentheorie. Die Analyse solcher Algo- rithmen ist damit ein wichtiger Aspekt des AE. Zum Beispiel sind randomisierte (zufalls- gesteuerte) Algorithmen oft viel einfacher und schneller als die besten bekannten deter- ministischen Algorithmen. Allerdings sind selbst einfache randomisierte Algorithmen oft schwer zu analysieren. Viele komplexe Optimierungsprobleme werden mittels Metaheuristiken wie (ran- domisierter) lokaler Suche oder genetischer Programmierung gelost.¨ So entworfene Al- gorithmen sind einfach und flexibel an das jeweils vorliegende Problem anpassbar. Nur ganz wenige dieser Algorithmen sind aber bisher analysiert worden obwohl Leistungs- garantien von großem theoretischen und praktischen Interesse waren.¨ Ein beruhmtes¨ Beispiel fur¨ lokale Suche ist der Simplexalgorithmus zur linearen Op- timierung — vielleicht der praktisch wichtigste Algorithmus in der mathematischen Op- timierung. Einfache Varianten des Simplexalgorithmus benotigen¨ fur¨ spezielle, konstru- ierte Eingaben exponentielle Zeit. Es wird aber vermutet, dass es Varianten gibt, die in polynomieller Zeit laufen. In der Praxis jedenfalls genugt¨ eine lineare Zahl Iterationen. Bisher kennt man aber lediglich subexponentielle erwartete Laufzeitschranken fur¨ inprak- tikable Varianten. Spielmann und Teng konnten jedoch zeigen, dass selbst kleine zufallige¨ Veranderungen¨ der Koeffizienten eines beliebigen linearen Programms genugen,¨ um die erwartete Laufzeit des Simplexalgorithmus polynomiell zu machen. Dieses Konzept der geglatteten¨ Analyse (smoothed analysis) ist eine Verallgemeinerung der average case analysis und auch jenseits des Simplexalgorithmus ein interessantes Werkzeug des AE. Zum Beispiel konnten Beier und Vocking¨ fur¨ eine wichtige Familie NP-harter Probleme zeigen, dass ihre geglattete¨ Komplexitat¨ polynomiell ist. Dieses Ergebnis erklart¨ u.a., warum das NP-harte Rucksackproblem sich in der Praxis effizient losen¨ lasst¨ und hat auch zur Verbesserung der besten Codes fur¨ Rucksackprobleme gefuhrt.¨ Es gibt auch enge Beziehungen zwischen geglatteter¨ Komplexitat,¨ Naherungsalgorithmen¨ und sogenannten pseudopolynomiellen Algorithmen, die ebenfalls ein interessanter Ansatz zur praktischen Losungen¨ NP-harter Probleme sind.
Implementierung Die Implementierung ist nur scheinbar der am klarsten vorgezeichnete und langweiligste Schritt im Kreislauf des AE. Ein Grund dafur¨ sind die großen semantischen Lucken¨ zwis- chen abstrakt formulierten Algorithmen, imperativen Programmiersprachen und realer Hardware. Ein extremes Beispiel fur¨ die semantische Lucke¨ sind viele geometrische Algorith- men, die unter der Annahme exakter Arithmetik mit reellen Zahlen und ohne explizite
11 Berucksichtigung¨ degenerierter Falle¨ entworfen sind. Die Robustheit geometrischer Al- gorithmen kann deshalb als eigener Zweig des AE betrachtet werden. Selbst Implementierungen relativ einfacher grundlegender Algorithmen konnen¨ sehr anspruchsvoll sein. Dort gilt es namlich¨ oft mehrere Kandidaten auf Grund kleiner kon- stanter Faktoren in ihrer Ausfuhrungszeit¨ zu vergleichen. Der einzige verlassliche¨ Weg besteht dann darin, alle Kontrahenten voll auszureizen, denn schon kleine Implemen- tierungsdetails konnen¨ sich zu einem Faktor zwei in der Ausfuhrungszeit¨ auswachsen. Selbst ein Vergleich des erzeugten Maschinencodes kann angezeigt sein, um Zweifelsfalle¨ aufzuklaren.¨ Oft liefern erst Implementierungen von Algorithmen einen letzten Beleg fur¨ deren Ko- rrektheit bzw. die Qualitat¨ der Ergebnisse. In der Geometrie und bei Graphenproblemen wird naturlicherweise¨ meist eine graphische Ausgabe der Ergebnisse erzeugt, wodurch Nachteile des Algorithmus oder sogar Fehler sofort sichtbar werden. Zum Beispiel wurde zur Einbettung eines planaren Graphen 20 Jahre lang auf eine Arbeit von Hopcroft und Tarjan1 verwiesen. Dort findet sich aber nur eine vage Beschrei- bung wie sich ein Planaritatstestalgorithmus¨ erweitern lasst.¨ Einige Versuche einer detaillierteren Beschreibung waren fehlerhaft. Dies wurde erst bemerkt, als die er- sten korrekten Implementierungen erstellt wurden. Lange Zeit gelang es niemandem, einen beruhmten¨ Algorithmus2 zur Berechnung von 3-Zusammenhangskomponenten (ein wichtiges Werkzeug beim Graphenzeichnen und in der Signalverarbeitung) zu implemen- tieren. Erst wahrend¨ einer Implementierung im Jahr 2000 wurden die Fehler im Algorith- mus identifiziert und korrigiert. Es gibt sehr viele interessante Algorithmen fur¨ wichtige Probleme, die noch nie im- plementiert wurden. Zum Beispiel, die asymptotisch besten Algorithmen fur¨ viele Fluss- und Matchingprobleme, die meisten Algorithmen fur¨ mehrschichtige Speicherhierarchien (cache oblivious Algorithmen) oder geometrische Algorithmen, die Cuttings oder -Netze benutzen.
Experimente Aussagekraftige¨ Experimente sind der Schlussel¨ zum Schließen des Kreises im AE- Prozess. Zum Beispiel brachten Experimente3 zur Kreuzungsminimierung beim Graphenzeichnen eine neue Qualitat¨ in diesen Bereich. Alle vorhergehenden Studien arbeiteten mit relativ dichten Graphen und wiesen jeweils nach, dass die Kreuzungszahl recht nahe an die jeweiligen oberen theoretischen Schranken herankam. In den ange-
1J. Hopcroft and R. E. Tarjan: Efficient planarity testing. J. of the ACM, 21(4):549–568, 1974. 2R. E. Tarjan and J. E. Hopcroft: Dividing a graph into triconnected components. SIAM J. Comput., 2(3):135–158, 1973. 3M. Junger¨ and P. Mutzel: 2-layer straightline crossing minimization: Performance of exact and heuris- tic algorithms. Journal of Graph Algorithms and Applications (JGAA), 1(1):1–25, 1997.
12 sprochenen Experimenten wird dagegen auch mit optimalen Algorithmen und den in der Praxis wichtigen dunnen¨ Graphen gearbeitet. Es stellte sich heraus, dass die Ergebnisse mancher Heuristiken um ein Vielfaches uber¨ der optimalen Kreuzungszahl liegen. Dieses Papier gehort¨ inzwischen zu den am meisten zitierten Arbeiten im Bereich des Graphen- zeichnens. Experimente konnen¨ auch entscheidenden Einfluss auf die Algorithmenanalyse haben: Die Rekonstruktion einer Kurve aus einer Menge von Messpunkten ist die grundlegendste Variante einer wichtigen Familie von Bildverarbeitungsproblemen. In einer Arbeit von Althaus und Mehlhorn4 wird ein scheinbar recht aufwendiges Ver- fahren untersucht, das auf dem Handlungsreisendenproblem beruht. Bei Experimenten stellte sich heraus, dass ”‘vernunftige”’¨ Eingaben zu leicht losbaren¨ Instanzen des Hand- lungsreisendenproblems fuhren.¨ Diese Beobachtung wurde dann formalisiert und be- wiesen. Gegenuber¨ den Naturwissenschaften ist das AE in der privilegierten Situation, eine Vielzahl von Experimenten schnell und vergleichsweise kostengunstig¨ durchfuhren¨ zu konnen.¨ Die Ruckseite¨ dieser Medaille ist aber eine hochgradig nichttriviale Planung, Auswertung, Archivierung, Aufbereitung und Interpretation dieser Ergebnisse. Aus- gangspunkt sollten dabei falsifizierbare Hypothesen uber¨ das Verhalten der untersuchten Algorithmen sein, die aus Entwurf, Analyse, Implementierung oder fruheren¨ Experi- menten stammen. Ergebnis ist eine Widerlegung, Bestatigung¨ oder Verfeinerung dieser Hypothesen. Diese fuhren¨ als Erganzung¨ beweisbarer Leistungsgarantien nicht nur zu besserem Verstandnis¨ der Algorithmen, sondern liefern auch Ideen fur¨ bessere Algorith- men, genauere Analyse oder effizientere Implementierung. Erfolgreiches Experimentieren hat viel mit Software Engineering zu tun. Ein mod- ularer Aufbau der Implementierungen ermoglicht¨ flexible Experimente. Geschickter Einsatz von Werkzeugen erleichtert die Auswertung. Sorgfaltige¨ Dokumentation und Versionsverwaltung erleichtert Reproduzierbarkeit — eine zentrale Anforderung wis- senschaftlicher Experimente, die bei den schnellen Modellwechseln von Soft- und Hard- ware eine große Herausforderung darstellt.
Probleminstanzen Sammlungen von realistischen Probleminstanzen zwecks Benchmarking haben sich als entscheidend fur¨ die Weiterentwicklung von Algorithmen erwiesen. Zum Beispiel gibt es interessante Sammlungen fur¨ einige NP-harte Probleme wie das Handlungsreisenden- problem, das Steinerbaumproblem, Satisfiability, Set Covering oder Graphpartition- ierung. Besonders bei den beiden ersten Problemen hat das zu erstaunlichen Durchbruchen¨ gefuhrt.¨ Mit Hilfe tiefer mathematischer Einsichten in die Struktur der
4E. Althaus and K. Mehlhorn: Traveling salesman-based curve reconstruction in polynomial time. SIAM Journal on Computing, 31(1):27–66, 2002.
13 Probleme kann man selbst große, realistische Instanzen des Handlungsreisendenproblems und des Steinerbaumproblems exakt losen.¨ Merkwurdigerweise¨ sind realistische Probleminstanzen fur¨ polynomiell losbare¨ Prob- leme viel schwerer zu bekommen. Zum Beispiel gibt es dutzende praktischer Anwendun- gen der Berechnung maximaler Flusse¨ aber die Algorithmenentwicklung muss bislang mit synthetischen Instanzen vorlieb nehmen.
Anwendungen Die Algorithmik spielt eine Schlusselrolle¨ bei der Entwicklung innovativer IT- Anwendungen und entsprechend sind anwendungsorientierte AE-Projekte aller Art eine sehr wichtiger Teil des Schwerpunktprogramms. Hier nennen wir nur einige grand chal- lenge Anwendungen, bei denen Algorithmik eine wichtige Rolle spielen konnte¨ und die ein besonderes Potential haben, einen wichtigen Einfluss auf Wissenschaft, Technik, Wirtschaft oder tagliches¨ Leben zu haben.5
Bioinformatik Neben dem bereits genannten Problem der Genomsequenzierung halt¨ die Mikrobiologie viele weitere algorithmische Herausforderungen bereit: die Berechnung der Tertiarstrukturen¨ von Proteinen; Algorithmen zur Berechnung von Stammbaumen¨ von Arten; data mining in den Daten zur Genaktivierung, die in großem Umfang mit DNA chips gewonnen werden. . . Diese Probleme konnen¨ nur in enger Ko- operation mit Molekularbiologen oder Chemikern gelost¨ werden.
Information Retrieval Die zu Beginn erwahnten¨ Indizierungs- und Rankingalgorith- men von Internetsuchmaschinen sind zwar sehr erfolgreich, lassen aber noch viel zu wunschen¨ ubrig.¨ Viele Heuristiken sind kaum publiziert, geschweige denn mit Leis- tungsgarantien ausgestattet. Nur in kleineren Systemen wird bisher ernsthaft versucht, Ahnlichkeitssuche¨ zu unterstutzen¨ und es zeichnet sich ein Rustungswettlauf¨ zwischen Rankingalgorithmen und Spammern ab, die diese zu tauschen¨ versuchen.
Verkehrsplanung Der Einsatz von Algorithmen in der Verkehrsplanung hat gerade erst begonnen. Neben einzelnen Anwendungen im Flugverkehr, wo Probleminstanzen relativ klein und das Einsparpotenzial groß ist, beschranken¨ sich diese Anwendungen auf verhaltnism¨ aßig¨ einfache, isolierte Bereiche: Datenerfassung (Netze, Straßenkate- gorien, Fahrzeiten), Monitoring und partielle Lenkung (Toll Collect, Bahnleitstande),¨
5Dessen ungeachtet wird naturlich¨ nicht erwartet, dass ein einzelnes Teilprojekt den Durchbruch bei einer grand challenge bringt, und viele Teilprojekte werden sich mit weniger spektakularen,¨ aber ebenso interessanten Anwendungen beschaftigen.¨
14 Prognose (Simulation, Vorhersagemodelle) und einfache Nutzerunterstutzung¨ (Routen- planung, Fahrplanabfrage). Das AE kann wesentlich zur Weiterentwicklung und Integra- tion dieser verschiedenen Aspekte hin zu leistungsfahigen¨ Algorithmen fur¨ eine bessere Planung und Lenkung unserer Verkehrssysteme (Lenkung durch Maut, Fahrplanopti- mierung, Linienplanung, Fahrzeug- und Personaleinsatzplanung) beitragen. Besondere Herausforderungen sind dabei die sehr komplexen Anwendungsmodelle und die daraus entstehenden riesigen Problemgroßen.¨
Geografische Informationssysteme Moderne Erdbeobachtungssatelliten und andere Datenquellen erzeugen inzwischen taglich¨ viele Terabyte an Informationen, die wichtige Anwendungen in Landwirtschaft, Umweltschutz, Katastrophenschutz, Tourismus etc. versprechen. Die effektive Verarbeitung solch gewaltiger Datenmengen ist aber eine echte Herausforderung bei der Know-how aus geometrischen Algorithmen, Parallelverar- beitung und Speicherhierarchien sowie AE mit realen Eingabedaten eine wichtige Rolle spielen wird.
Kommunikationsnetzwerke Im selben Maße wie Netzwerke immer vielseitiger und großer¨ werden, wachst¨ der Bedarf an effizienten Verfahren zu ihrer Organisation. Beson- deres Interesse gilt hier mobilen, ad-hoc und Sensornetzen, sowie Peer-to-peer Netzen und der Koordination konkurrierender Agenten mit spieltheoretischen Techniken. All diesen neuartigen Anwendungen ist gemeinsam, dass sie ohne eine zentrale Planung und Organisation auskommen mussen.¨ Viele der hier untersuchten Fragestellungen kann man als noch-nicht-Anwendungen bezeichnen. Aus der Perspektive des AE ist daran besonders interessant, dass auch prak- tische Arbeiten hier keine verlasslichen¨ Daten uber¨ Große¨ und Eigenarten der spateren¨ Anwendungssituation haben. Einerseits ergibt sich daraus ein noch großerer¨ Bedarf an beweisbaren Leistungsgarantien. Andererseits sind die Modelle vieler theoretischer Ar- beiten auf diesem Gebiet noch weiter von der Realitat¨ entfernt als sonst.
Planungsprobleme Zeitliche Ablaufe¨ in Produktion und Logistik werden stets enger, und der Bedarf an algorithmischer Unterstutzung¨ und Optimierung wachst.¨ Erste Ansatze¨ hierzu aus der Algorithmik werden durch Onlinealgorithmen (dial a ride, Scheduling) und flows over time (Routing mit Zeitfenstern, dynamische Flusse)¨ gegeben. Die Entwicklung steht jedoch erst in den Anfangen.¨ Fur¨ aussagekraftige¨ Qualitatsaussagen¨ zu Onlinealgo- rithmen muss insbesondere die kompetitive Analyse uberdacht¨ werden, die sich zu sehr am groben worst-case Verhalten orientiert. Flows over Time verlangen nach besseren Techniken, um algorithmisch moglichst¨ effizient mit der Dimension ”‘Zeit”’ umzugehen.
15 Chapter 2
Data Structures
Most material in this chapter was taken from a yet unpublished book manuscript by Pe- ter Sanders and Ulrich Mehlhorn. Some parts on external data structures were presented in [7]. Notice that during the lecture, the latter topics were covered in the talk on ex- ternal algorithms, not in the introduction on data structures. If you are unfamiliar with external memory models, please read the introduction in 5.2 or the short overview in the appendix 12.1.
2.1 Arrays & Lists
For starters, we will study how algorithm engineering can be applied to the (apparently?) easy field of sequence data structures. Bounded Arrays: Usually the most basic, built-in sequence data structure in pro- gramming languages. They have constant running time for [·]-, popBack- and pushBack- operations which remove or add an element behind the currently last entry. Their major drawback is that their size has to be known in advance to reserve enough memory. Unbounded Arrays: To bypass this often unconvenient restriction, unbounded arrays are introduced (std::vector from the C++ STL is an example). They are imple- mented on a bounded array. If this array runs out of space for new elements, a new array of double size is allocated and the old content copied. If the filling degree is reduced to a quarter by pop-operations, the array is replaced with new one, using only the half space. We can show amortized costs of O(1) for pushBack and popBack implemented that way. A proof is given in the appendix in 12.2. Note that is not possible to already shrink the array when it is half full since repeated insertion and deletion at that point would lead to costs of O(n) for a single operation. Double Linked Lists1: Figure 2.1 shows the basic building block of a linked list. A
1Sometimes singly linked lists (maintaining only a successor pointer) are sufficient and more space
16 Class Item of Element e : Element next : Handle prev : Handle invariant next→prev=prev→next=this
Figure 2.1: Prototype of a segment in doubly linked list
Figure 2.2: Structure of a double linked list list item (a link of a chain) stores one element and pointers to successor and predecessor. This sounds simple enough, but pointers are so powerful that we can make a big mess if we are not careful. What makes a consistent list data structure? We make a simple and innocent looking decision and the basic design of our list data structure will follow from that: The successor of the predecessor of an item must be the original item, and the same holds for the predecessor of a successor. If all items fulfill this invariant, they will form a collection of cyclic chains. This may look strange, since we want to represent sequences rather than loops. Sequences have a start and an end, wheras loops have neither. Most implementations of linked lists therefore go a different way, and treat the first and last item of a list differently. Unfortunately, this makes the implementation of lists more complicated, more errorprone and somewhat slower. Therefore, we stick to the simple cyclic internal representation. For conciseness, we implement all basic list operations in terms of the single operation splice depicted in Figure 2.3. splice cuts out a sublist from one list and inserts it after some target item. The target can be either in the same list or in a different list but it must not be inside the sublist. splice can easily be specialized to common methods like insert, delete,... Since splice never changes the number of items in the system, we assume that there is one special list freeList that keeps a supply of unused elements. When inserting new elements into a list, we take the necessary items from freeList and when deleting elements efficient. As they have non-intuitive semantics on some operations and are less versatile, we focus on doubly linked lists.
17 we return the corresponding items to freeList. The function checkFreeList allocates mem- ory for new items when necessary. A freeList is not only useful for the splice operation but it also simplifies our memory management which can otherwise easily take 90% of the work since a malloc would be necessary for every element inserted2. It remains to decide how to simulate the start and end of a list. The class List in Figure 2.2 introduces a dummy item h that does not store any element but seperates the first element from the last element in the cycle formed by the list. By definition of Item, h points to the first “proper” item as a successor and to the last item as a predecessor. In addition, a handle head pointing to h can be used to encode a position before the first element or after the last element. Note that there are n+1 possible positions for inserting an element into an list with n elements so that an additional item is hard to circumvent if we want to code handles as pointers to items. With these conventions in place, a large number of useful operations can be implemented as one line functions that all run in constant time. Thanks to the power of splice, we can even manipulate arbitrarily long sublists in constant time. The dummy header can also be useful for other operations. For example consider the fol- lowing code for finding the next occurrence of x starting at item from. If x is not present, head should be returned. We use the header as a sentinel. A sentinel is a dummy element in a data structure that makes sure that some loop will terminate. By storing the key we are looking for in the header, we make sure that the search terminates even if x is origi- nally not present in the list. This trick saves us an additional test in each iteration whether the end of the list is reached. A drawback of dummy headers is that it requires additional space. This seems negligible for most applications but may be costly for many, nearly empty lists. This is a typical scenario for hash tables using chaining on collisions.
2.2 External Lists
The direct implementation of a linked list in an external memory model will have costs of 1 I/O when following a link, which leads to Θ(N) I/Os for traversing N elements. This is caused by the high degree of freedom in the allocation of list elements within memory3. A first idea to improve this is to introduce locality by requiring to store B consecutive elements together. Traversal is now only N/B = O(scan(N)) I/Os, but an insertion or deletion can cost Θ(N/B) I/Os for moving all following elements. We relax the invariant 2 to ≥ 3 B elements in every pair of consecutive blocks. Traversal is still available for ≤ 3N/B = O(scan(N)) I/Os. For inserting in block i, we have to distinguish to cases: If block i has space we pay 1 I/O and are done. If it is full but a neighbor has space, we push
2Another countermeasure to allocation overhead is scheduling many insertions at the same time, result- ing in only one malloc and possibly less cache faults as many items reside in the same memory block 3A faster traversal is possible if we use list ranking (see 5.9) as preprocessing, which can be done in O(sort(N)). Sorting with respect to each element’s rank (distance from last node) will then give a scannable representation of the list
18 //Remove ha, . . . , bi from its current list and insert it after t //. . . , a0, a, . . . , b, b0, . . . , t, t0,...) 7→ (. . . , a0, b0, . . . , t, a, . . . , b, t0,...) Procedure splice(a,b,t : Handle) assert b is not before a ∧ t 6∈ ha, . . . , bi a0 := a→prev b0 := b→next a0 →next := b0 b0 →prev := a0
b→next := t0 a→prev := t t→next := a t0 →prev := b
Figure 2.3: The splice method
Figure 2.4: The direct implementation of linked lists is not suited for external memory. an element to it for O(1) I/Os. If both neighbors are full, we split block i into 2 blocks of ≈ B/2 elements, for (amortized) costs of O(1) I/Os (≥ B/6 deletions needed to violate the invariant). For deletion from block i: if blocks i and i + 1 or blocks i and i − 1 have ≤ 2B/3 elements ⇒ merge the two blocks, again for (amortized) O(1) I/Os.
Figure 2.5: First approach: block B consecutive list elements together
19 2 Figure 2.6: Second approach: block ≥ 3 B consecutive list elements together S-List B-Array U-Array dynamic + − + space wasting pointer too large? too large? set free? time wasting cache miss + resizing worst case time (+) + − Table 2.1: Pros and cons for implementation variants of a stack
2.3 Stacks, Queues & Variants
We now want to use these general sequence types to implement another important data structure: A stack with operations push (insert at the end of the sequence) and pop (return and remove the last element) which we both want to implement with constant costs. Let us examine the alternatives: A bounded array is only feasible if you can give a tight limit for the number of inserted elements; otherwise, you have to allocate much memory in advance to avoid running out of space. A linked list comes with nontrivial memory management and a lot of cache faults (when every successor is in a different memory block). An unbounded array has no constant cost guaranty for a single operation and can consume up to twice the actually required space. So none of the basic data structures comes without major drawbacks. For an optimal solution, we need to take a hybrid approach: A hybrid stack is a linked list containing bounded arrays of size B. When the current array is full, another one is allocated and linked. We now have a dynamic data structure
... B Figure 2.7: A hybrid stack
20 Directory...
... B Elemente Figure 2.8: A variant of the hybrid stack with (small) constant worst case access time4 at the back pointer. We give up a maximum of n/B √+ B wasted space (for pointers and one empty block). This is minimized for B = Θ( n). A variant of this stack works as follows: Instead of having each block maintain a pointer to its successor, we use a directory (implemented as an unbounded array) contain- ing these. Together with two additional references to the current dictionary entry and the current position in the last block, we gain the functionality of a stack. Additionally, it is now easy to implement [·] in constant time using integer division and modulo arithmetic. The drawback of this approach is non-constant worst case insertion time (although we still have constant amortized costs). There are further specialized data structures that can be useful for certain algorithms: a FIFO queue allows insertion at one end and extraction at the other. FIFO queues are easy to implement with singly linked lists with a pointer to the last element. For bounded queues, we can also use cyclic arrays where entry zero is the successor of the last entry. Now it suffices to maintain two indices h and t delimiting the range of valid queue entries. These indices travel around the cycle as elements are queued and dequeued. The cyclic semantics of the indices can be implemented using arithmetics modulo the array size.5 Our implementation always leaves one entry of the array empty because otherwise it would be difficult to distinguish a full queue from an empty queue. Bounded queues can be made unbounded using similar techniques as for unbounded arrays. Finally, deques – allowing read/write access at both ends – cannot be implemented efficiently using singly linked lists. But the array based FIFO from Figure 2.3 is easy to generalize. Circular array can also support access using [·] (interpreting [i] as [i + h mod n]. With techniques from both the hybrid stack variant and the cyclic FIFO queue,√ we can derive a data structure with constant costs for random accesses and costs O( n) for
4although inserting at the end of the current array is still costlier 5On some machines one might get significant speedups by choosing the array size as a power of two and replacing mod by bit operations.
21 n 0 t
h b
Figure 2.9: A variant of the hybrid stack insertion/deletion on arbitrary positions: Instead of bounded arrays, we have our directory point to cyclic arrays. Random access works as above. For insertion at a random location, shift the elements in the corresponding cyclic array that follow the new element’s position. If the array was full, we have no room for the last element so it is propagated to the next cyclic array. Here, it replaces the last element (which can travel further) and the indices are rotated by one, giving the new element index 0. In the worst case, we have B elements to move in the first array and√ constant time operations for the other n/B subarrays. This is again minized for B = Θ( n). Another specialized variant we can develop is an I/O efficient stack6: We use 2 buffers of size B in main memory and a pointer to the end of the stack. When both buffers are full, we write the one containing the older elements to disk and use the freed room for new insertions. When both buffers run empty, we refill one with a block from disk. This leads to amortized I/O costs of O(1/B). Mind that only one buffer is not sufficient: A sequence of B insertions followed by alternating insertions and deletions will incur 1 I/O per operation. [image] ⇐= Table 2.2 summarizes some of the results found in this chapter by comparing running times for common operations of the presented data structures. Predictably, arrays are bet- ter at indexed access whereas linked lists have their strenghts at sequence manipulation at arbitrary positions. However, both basic approaches can implement the special opera- tions needed for stacks and queues roughly equally well. Where both approaches work, arrays are more cache efficient whereas linked lists provide worst case performance guar- antees. This is particularly true for all kinds of operations that scan through the sequence; findNext is only one example.
612.1 gives an introduction on our external memory model
22 List UArray hybr. Stack hybr. Array cycl. Array ∗ Operation √ explanation of ‘ ’ [·] n 1 n 1 1 | · | 1∗ 1 1 1 1 not with inter-list splice first 1 1 1 1 1 last 1 1 1√ 1 1 insert 1 n n √n n remove 1 n n n n ∗ ∗ pushBack 1 1 1√ 1 1 amortized pushFront 1 n n n 1∗ amortized ∗ ∗ popBack 1 1 1√ 1 1 amortized popFront 1 n n n 1∗ amortized concat 1 n n n n splice 1 n n n n findNext,. . . n n∗ n∗ n∗ n∗ cache efficient Table 2.2: Running times of operations on sequences with n elements. Entries have an implicit O(·) around them.
23 Chapter 3
Sorting
The findings on how branch mispredictions affect quicksort are taken from [1]. Super Scalar Sample Sort is described in [2], Multiway Merge Sort is covered in [3], the analysis of duality between prefetching and buffered writing is from [4].
3.1 Quicksort Basics
Sorting is one of the most important algorithmic problems both practically and theoret- ically. Quicksort is perhaps the most frequently used sorting algorithm since it is very fast in practice, needs almost no additional memory, and makes no assumptions on the distribution of the input.
Function quickSort(s : Sequence of Element): Sequence of Element if |s| ≤ 1 then return s // base case pick p ∈ s uniformly at random // pivot key a := he ∈ s : e < pi // (A) b := he ∈ s : e = pi // (B) c := he ∈ s : e > pi // (C) return concatenation of quickSort(a), b, and quickSort(c)
Figure 3.1: Quicksort (high-level implementation)
Analysis shows that Quicksort picking pivots randomly will perform an expected number of ≈ 1.4n log(n) comparisons1. A proof for this bound is given in the appendix
1With other other strategies for selecting a pivot, better constant factors can be achieved: e.g. ”median of three” reduces the expected number of comparisons to ≈ 1.2n log(n)
24 quickSort qsort i-> partition <-j 3681072459 3681072459 3681072459 1 0 2|3|6 8 7 4 5 9 2 0 1|8 6 7 3 4 5 9 2 6 8 1 0 7 3 4 5 9 0|1|2 4 5|6|8 7 9 1 0|2|5 6 7 3 4|8 9 2 0 8 1 6 7 3 4 5 9 4|5 7|8|9 0 1| |4 3|7 6 5|8 9 2 0 1|8 6 7 3 4 5 9 0 1 2 3 4 5 6 7 8 9 | |3 4|5 6|7| j i | | |5 6| | 0 1 2 3 4 5 6 7 8 9
Figure 3.2: Execution of both high-level and refined version of quickSort. (Figure 3.1 and Figure 3.3) on h2, 7, 1, 8, 2, 8, 1i using the first character of a subsequence as the pivot. The right block shows the first execution of the repeat loop for partitioning the input in qSort. in 12.3. The worst case occurs if all elements are different and we are always so unlucky to pick the largest or smallest element as a pivot and results in Θ(n2) comparisons. As the number of executed instructions and cache faults is proportional to the number of comparisons, this is (at least in theory) a good measure for the total runtime of Quicksort.
3.2 Refined Quicksort
Figure 3.3 gives pseudocode for an array based quicksort that works in-place and uses several implementation tricks that make it faster and very space efficient. To make a recursive algorithm compatible to the requirement of in-place sorting of an array, quicksort is called with a reference to the array and the range of array indices to be sorted. Very small subproblems with size up to n0 are sorted faster using a simple 2 algorithm like insertion sort . The best choice for the constant n0 depends on many details of the machine and the compiler. Usually one should expect values around 10–40. An efficient implementation of Insertion Sort is given in the appendix in 12.4. The pivot element is chosen by a function pickPivotPos that we have not specified here. The idea is to find a pivot that splits the input more accurately than just choosing a random element. A method frequently used in practice chooses the median (‘middle’) of three elements. An even better method would choose the exact median of a random sample of elements. The repeat-until loop partitions the subarray into two smaller subarrays. Elements
2Some books propose to leave small pieces unsorted and clean up at the end using a single insertion sort that will be fast as the sequence is already almost sorted. Although this nice trick reduces the number of instructions executed by the processor, our solution is faster on modern machines because the subarray to be sorted will already be in cache.
25 //Sort the subarray a[`..r] Procedure qSort(a : Array of Element; `, r : N) while r − ` ≥ n0 do // Use divide-and-conquer j := pickPivotPos(a, l, r) swap(a[`], a[j]) // Helps to establish the invariant p := a[`] i := `; j := r repeat // a: ` i→ ←j r invariant 1: ∀i0 ∈ `..i − 1: a[i0] ≤ p // a: ∀ ≤ p invariant 2: ∀j0 ∈ j + 1..r: a[j0] ≥ p // a: ∀ ≥ p invariant 3: ∃i0 ∈ i..r : a[i0] ≥ p // a: ∃ ≥ p invariant 4: ∃j0 ∈ `..j : a[j0] ≤ p // a: ∃ ≤ p while a[i] < p do i++ // Scan over elements (A) while a[j] > p do j−− // on the correct side (B) if i ≤ j then swap(a[i], a[j]); i++ ; j−− until i > j // Done partitioning l+r if i < 2 then qSort(a,`,j); ` := j else qSort(a,i,r) ; r := i insertionSort(a[l..r]) // faster for small r − l
Figure 3.3: Refined quicksort
26 equal to the pivot can end up on either side or between the two subarrays. Since quicksort spends most of its time in this partitioning loop, its implementation details are important. Index variable i scans the input from left to right and j scans from right to left. The key invariant is that elements left of i are no larger than the pivot whereas elements right of j are no smaller than the pivot. Loops (A) and (B) scan over elements that already satisfy this invariant. When a[i] ≥ p and a[j] ≤ p, scanning can be continued after swapping these two elements. Once indices i and j meet, the partitioning is completed. Now, a[`..j] represents the left partition and a[i..r] represents the right partition. This sounds simple enough but for a correct and fast implementation, some subtleties come into play. To ensure termination, we verify that no single piece represents all of a[`..r] even if p is the smallest or largest array element. So, suppose p is the smallest element. Then loop A first stops at i = `; loop B stops at the last occurrence of p. Then a[i] and a[j] are swapped (even if i = j) and i is incremented. Since i is never decremented, the right partition a[i..r] will not represent the entire subarray a[`..r]. The case that p is the largest element can be handled using a symmetric argument. The scanning loops A and B are very fast because they make only a single test. On the first glance, that looks dangerous. For example, index i could run beyond the right boundary r if all elements in a[i..r] were smaller than the pivot. But this cannot hap- pen. Initially, the pivot is in a[i..r] and serves as a sentinel that can stop Scanning Loop A. Later, the elements swapped to the right are large enough to play the role of a sen- tinel. Invariant 3 expresses this requirement that ensures termination of Scanning Loop A. Symmetric arguments apply for Invariant 4 and Scanning Loop B. Our array quicksort handles recursion in a seemingly strange way. It is something like “semi-recursive”. The smaller partition is sorted recursively, while the larger partition is sorted iteratively by adjusting ` and r. This measure ensures that recursion can never go deeper than dlog n e levels. Hence, the space needed for the recursion stack is O(log n). n0 Note that a completely recursive algorithm could reach a recursion depth of n − 1 so the space needed for the recursion stack could be considerably larger than for the input array itself.
3.3 Lessons from experiments
We now run Quicksort on real machines to check if it behaves differently than our analysis on the RAM model predicted. We will see that modern hardware architecture can have influence on the runtime and try to find algorithmic solutions to these problems. In the analysis, we saw that the number of comparisons determines the runtime of Quicksort. On a real machine a comparison and the corresponding if-clause are mapped to a branch instruction. In modern processors with long execution pipelines and superscalar execution, dozens of subsequent instructions are executed in parallel to achieve a high
27 in sort Seconds / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11 6.8 random pivot median of 3 exact median 6.6 skewed pivot n/10 skewed pivot n/11
6.4
6.2
6
5.8
5.6 nanosecs constant
5.4
5.2
5
4.8 10 12 14 16 18 20 22 24 26 n
Figure 3.4: Runtime for Quicksort using different strategies for pivot selection peak throughput. To keep the pipeline filled, the outcome of each branch is predicted by the hardware (based on several possible heuristics). When a branch is mispredicted, much of the work already done on the instructions following the predicted branch direction turns out to be wasted. Therefore, ingenious and very successful schemes have been devised to accurately predict the direction a branch takes . Unfortunately, we are facing a dilemma here. Information theory tells us that the optimal number of n log n element comparisons for sorting can only be achieved if each element comparison yields one bit of information, i.e., there is a 50% chance for the branch to take either direction. In this situation, even the most clever branch prediction algorithm is helpless. A painfully large number of branch mispredictions seems to be unavoidable. Figure 3.4 compares the runtime of Quicksort implementations using different strate- gies of selecting a pivot. Together with standard techniques (random, median of three, ...) α-skewed pivots are used, i.e., pivots which have a rank of αn. Theory suggests 1 large constant factors in execution time for these strategies with α 2 compared to a perfect median. In practice, Figure 3.4 shows that these implementations are actually faster than those that use an (approximated) median as pivot. An explanation for this can be found in Figure 3.5: A pivot with rank close to n/2 produces many more branch mispredictions than a pivot that separates the sequence in two parts of very different sizes. The costs to flush the entire instruction pipeline outweigh
28 in sort Branch misses / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11 0.44 random pivot median of 3 exact median 0.42 skewed pivot n/10 skewed pivot n/11
0.4
0.38
0.36
0.34
0.32 branch misses constant
0.3
0.28
0.26
0.24 10 12 14 16 18 20 22 24 26 n
Figure 3.5: Number of branch mispredictions for Quicksort using different strategies for pivot selection
29 the fewer partition steps of these variants.
3.4 Super Scalar Sample Sort
We now study a sorting algorithm which is aware of hardware phenomenons like branch mispredictions or superscalar execution. This algorithm is called Super Scalar Sample Sort (SSSS) which is an engineered version of Sample Sort which in turn is a generaliza- tion of Quicksort.
Function sampleSort(e = he1, . . . , eni : Sequence of Element, k : Z): Sequence of Element if n/k is “small” then return smallSort(e) // base case, e.g. quicksort let hS1,...,Sak−1i denote a random sample of e sort S // or at least locate the elements whose rank is a multiple of a hs0, s1, s2, . . . , sk−1, ski:= h−∞,Sa,S2a,...,S(k−1)a, ∞i // determine splitters for i := 1 to n do find j ∈ {1, . . . , k} such that sj−1 < ei ≤ sj place ei in bucket bj return concatenate(sampleSort(b1),..., sampleSort(bk))
Figure 3.6: Standard Sample Sort
Our starting point is ordinary sample sort. Fig. 3.6 gives high level pseudocode. Small inputs are sorted using some other algorithm like quicksort. For larger inputs, we first take a sample of s = ak randomly chosen elements. The oversampling factor a allows a flexible tradeoff between the overhead for handling the sample and the accuracy of splitting. Our splitters are those elements whose rank in the sample is a multiple of a. Now each input element is located in the splitters and placed into the corresponding bucket. The buckets are sorted recursively and their concatenation is the sorted output. A first advantage of Sample Sort over Quicksort is the number of logk n recursion levels which is by a factor log2 k smaller than the recursion depth of Quicksort log2 n. Every element is moved once during each level, resulting in less cache faults for Sample Sort. However, this alone does not resolve the central issue of branch mispredictions and only comes to bear for very large inputs. SSSS is an implementation strategy for the basic sample sort algorithm. All sequences are represented as arrays. More precisely, we need two arrays of size n. One for the original input and one for temporary storage. The flow of data between these two arrays alternates in different levels of recursion. If the number of recursion levels is odd, a final copy operation makes sure that the output is in the same place as the input. Using an array
30 i refer
a
o refer move B refer
a'
Figure 3.7: Two-pass element distribution in Super Scalar Sample Sort of size n to accommodate all buckets means that we need to know exactly how big each bucket is. In radix sort implementations this is done by locating each element twice. But this would be prohibitive in a comparison based algorithm. Therefore we use an additional auxiliary array, o, of n oracles – o(i) stores the bucket index for element ei. A first pass computes the oracles and the bucket sizes. A second pass reads the elements again and places element ei into bucket bo(i). This two pass approach incurs costs in space and time. However these costs are rather small since bytes suffice for the oracles and the additional memory accesses are sequential and thus can almost completely be hidden via software or hardware prefetching3. In exchange we get simplified memory management, no need to test for bucket overflows. Perhaps more importantly, decoupling the expensive tasks of finding buckets and distributing elements to buckets facilitates software pipelining by the compiler and prevents cache interferences of the two parts. This optimization is also known as loop distribution. Theoretically the most expensive and algorithmically the most interesting part is how to locate elements with respect to the splitters. Fig. 3.8 gives pseudocode and a picture for this part. Assume k is a power of two. The splitters are placed into an array t such that they form a complete binary search tree with root t1 = sk/2. The left successor of tj is stored at t2j and the right successor is stored at t2j+1. This is the arrangement well known from binary heaps but used for representing a search tree here. To locate an element ai, it suffices to travel down this tree, multiplying the index j by two in each level and adding one if the element is larger than the current splitter. This increment is the only instruction that depends on the outcome of the comparison. Some architectures
3This is true as long as we can accommodate one buffer per bucket in the cache, limiting the parameter k. Other limiting factors are the size of the TLB (translation lookaside buffer, storing mappings of virtual to physical memory addresses) and k ≤ 256 if we want to store the bucket indices in one byte
31 s k/2 t:= hsk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8,...i < > for i := 1 to n do // locate each element s s j:= 1 // current tree node := root k/4 3k/4 < > < > repeat log k times // will be unrolled s s s s j:= 2j + (ai > tj) // left or right? k/8 3k/8 5k/8 7k/8 j:= j − k + 1 // bucket index < >< >< >< > |b |++ // count bucket size b b b b b b b b j 1 2 3 4 5 6 7 8 oi:= j // remember oracle
Figure 3.8: Finding buckets using implicit search trees. The picture is for k = 8. We adopt the convention from C that “x > y” is one if x > y holds, and zero else. cmp.gt p7=r1,r2 cmp.gt p6=r1,r2 (p7) br.cond .label (p6) add r3=4,r3 add r3=4,r3 .label:
Table 3.1: Translation of if(r1 > r2) r3 := r3 + 4 with branches (left) and predicated instructions (right) like IA-64 have predicated arithmetic instructions that are only executed if the previously computed condition code in the instruction’s predicate register is set. Others at least have a conditional move so that we can compute j:= 2j and then, speculatively, j0:= j+1. Then we conditionally move j0 to j. The difference between such predicated instructions and ordinary branches is that they do not affect the instruction flow and hence cannot suffer from branch mispredictions. Experiments (conducted on an Intel Itanium processor with Intel’s compiler to have support for predicated instructions and software pipelining) show that our implementa- tion of SSSS outperforms two well known library implementations for sorting. In the Experiment 32 bit random integers in the range [0, 109] were sorted4. For this first version of SSSS, several improvements are possible. For example, the current implementation suffers from many identical keys. This could be fixed without much overhead: If si−1 < si = si+1 = ··· = sj (identical splitters are an indicator for many identical keys), j > i, change si to si − 1. Do not recurse on buckets bi+1,. . . ,bj – they all contain identical keys. Now SSSS can even profit from an input like this.¡ Another disadvantage compared to quicksort is that SSSS is not inplace. One could make it almost inplace however. This is most easy to explain for the case that both input
4note that the algorithm’s runtime is not influenced by the distribution of elements, so a random distri- bution of elements is no unfair advantage for SSSS
32 18
16
14
12 Intel STL 10 GCC STL sss-sort 8
time / n log [ns] 6
4
2
0 4096 16384 65536 218 220 222 224 n
Figure 3.9: Runtime for sorting using SSSS and other algorithms
7 Total FB+DIST+BA 6 FB+DIST FB 5
4
3
time / n log [ns] 2
1
0 4096 16384 65536 218 220 222 224 n Figure 3.10: Breakdown of the execution time of SSSS (divided by n log n) into phases. “FB” denotes the finding of buckets for the elements, “DIST” the distribution of the elements to the buckets, “BA” the base sorting routines. The remaining time is spent in finding the splitters etc.
33 M f i=0 i=M i=2M run sort internal
t i=0 i=M i=2M
Figure 3.11: Run formation f make things as simple as possible but no simpler f __aeghikmnst__aaeilmpsss__bbeilopssu__eilmnoprst runBuffer st ss ps st
B next internal ss Mk out t ______aaabbeeeeghiiiiklllmmmnnooppprss
Figure 3.12: Example of 4-way merging with M = 12, B = 2 and output are a sequence of blocks (compare chapter 2). Sampling takes sublinear space and time. Distribution needs at most 2k additional blocks and can otherwise recycle freed blocks of the input sequence. Although software pipelining may be more difficult for this distribution loop, the block representation facilitates a single pass implementation with- out the time and space overhead for oracles so that good performance may be possible. Since it is possible to convert inplace between block list representation and an array rep- resentation in linear time, one could actually attempt an almost inplace implementation of SSSS.
3.5 Multiway Merge Sort
We will now study another algorithm based on the concept of Merge Sort which is espe- cially well suited for external sorting. For external algorithms, an efficient sorting sub- routine is even more important than for main memory algorithms because one often tries to avoid random disk accesses by ordering the data, allowing a sequential scan. Multiway Merge Sort first splits the data into dn/Me runs which fit into main memory where they are sorted. We merge these runs until only one is left. Instead of ordinary 2- way-merging, we merge k := M/B runs in a single pass resulting in a smaller number of merge phases. We only have to keep one block (containing the currently smallest elements) per run in main memory. We maintain a priority queue containing the smallest elements of each run in the current merging step to efficiently keep track of the overall smallest element.
34 emulated disk logical block
1 2 D
physical blocks
Figure 3.13: Striping: one logical block consists of D physical blocks.
Every element is read/written twice for forming the runs (in blocks of size B) and twice for every merging phase. Access granularity is blocks. This leads to the following (asymptotically optimal) total number of I/Os: 2n 2n l n m (1 + dlog #runse) = 1 + log := sort(n) (3.1) B k B M/B M Let us consider the following realistic parameters: B = 2MB, M = 1GB. For inputs up to a size of n = 512GB, we get only one merging phase! In general, this is the case if we can store dn/Me buffers (one for each run) of size B in internal memory (i.e., n ≤ M 2/B). Therefore, only one additional level can increase the I/O volume by 50%.
3.6 Sorting with parallel disks
We now consider a system with D disks. There are different ways to model this situation (see Figure 3.15) but all have in common that in one I/O step we can fetch up to D blocks so we can hope to reduce the number of I/Os by this factor: 2n l n m 1 + log (3.2) BD M/B M An obvious idea to handle multiple disks is the concept of striping: An emulated disk contains logical blocks of size DB consisting of one physical block per disk. The algo- rithms for run formation and writing the output can be used unchanged on this emulated disk. For the merging step however, we have to be careful: With larger (logical) blocks the number of I/Os becomes: 2n l n m 1 + log (3.3) BD M/BD M The algorithm will move more data in one I/O step (compared to the setup with one disk) but requires a possibly deeper recursion level. In practice, this can make the differ- ence between one or two merge phases. We therefore have to work on the level of physical
35 ...... prediction sequence ...... controls
... internal buffers
... prefetch buffers ......
Figure 3.14: The smallest element of each block triggers fetch. blocks to achieve optimal constant factors. This comes with the necessity to distribute the runs in an intelligent way among the disks and to find a schedule for fetching blocks into the merger. For starters, it is necessary to find out which block on which disk will be required next when one of the merging buffers runs out of elements. This can be computed offline when all runs are formed: A block is required the moment its smallest element is required. We can therefore sort the set of all smallest elements to get a prediction sequence. To be able to refill the merge buffers in time we maintain prefetch buffers which we fill (if necessary) while the merging of the current elements takes place. This allows parallel access of next due blocks and helps for an efficiency near 1 (i.e. fetching D blocks in one I/O step). How many prefetch buffers should we use? We first approach this question by using a simplified model ((a) in figure 3.15) where we have D read-/write-heads on one large disk. Here, D prefetch buffers suffice: In one I/O-step we can refill all buffers, transferring D blocks of size B which leads to a total (optimal) number of I/Os as in equation 3.2. If we replace the multihead model with D independent disks (each with its own right/write-head) we get a more realistic model. But now D prefetch buffers seem too few as it is possible that all next k blocks reside on the same disk which would need that many I/O steps for filling the buffers while the other disks lie idle, leading to a non-optimal efficiency. A first solution is to increase the number of prefetch buffers to kD. But that would leave us with less space for merge buffers, write buffers and other data that we have to
36 M M
1 2B D B 1 2 ... D Multihead Model independent disks [Aggarwal Vitter 88] [Vitter Shriver 94] (a) (b)
Figure 3.15: Different models for systems with several disks
prefetch buffers prediction sequence ...... internal buffers ......
Figure 3.16: Distribution of runs using randomized cycling. keep in main memory. Instead, we use the randomized cycling pattern while forming runs: For every run j, we map block i to πj(i mod D) for a random permutation πj. This makes the event of getting a “difficult“ distribution highly unlikely. With a naive prefetching strategy and random cycling, we can achieve a good perfor- mance with only O(D log D) buffers. Is it possible to even reduce this to O(D)? The prefetching strategy leaves more room for optimization. The naive approach fetches in one I/O-step the next blocks from the prediction sequence until all free buffers are filled or a disk would be accessed twice. The problem is now to find an optimal offline prefetching schedule (offline, because the prediction sequence yields the order in which the blocks on each disk are needed). For the solution, we make a digression to online buffered writing and use the principle of duality to transform our result here into a schedule for offline prefetching. In online buffered writing, we have a sequence Σ of blocks to be written to one of D
37 Sequence of blocks Σ write whenever otherwise, output one of W one block from buffers is free each nonempty queue
randomized ... mapping
... queues W/D
1 2 3 ... D
Figure 3.17: The online buffered writing problem and its optimal solution. disks. We also have W buffers, W/D for each disk. It can be shown that randomized, equally distributed writing to one of the free buffers and outputting one block of each queue if no capacity is left is an optimal strategy and achieves an expected efficiency of 1 − o(D/W ). We can now reverse this process to obtain an optimal offline prefetching algorithm called lazy prefetching: Given the prediction sequence Σ, we calculate the optimal online writing sequence T for ΣR and use T R as prefetching schedule. Note that we do not use the distribution between the disks the writing algorithm produces and that the random distribution during the writing process corresponds to random cycling. Figure 3.19 gives an example in which our optimal strategy yields a better result than a naive prefetching approach: The upper half shows the result of the example schedule from 3.18 created by inverting a writing schedule. The bottom half shows the result of naive prefetching, always fetching the next block from every disk in one step (as long as there are free buffers).
3.7 Internal work of Multiway Mergesort
Until now we have only regarded the number of I/Os. In fact, when running with several disks our sorting algorithm can very well be compute bound, i.e. prefetching D new blocks requires less time than merging them. We use the technique of overlapping to minimize wait time for whichever task is bounding our algorithm in a given environment. Take the following example on run formation (i denotes a run):
Thread A: Loop { wait-read i; sort i; post-write i}; Thread B: Loop { wait-write i; post-read i+2};
38 input step 8 7 6 5 4 3 2 1 input step 8 7 6 5 4 3 2 1 r q p o l i f r r Σ Σ order of reading Puffer order of reading Puffer q p r q p o n m l k j i h g f e d c b a n h g e r q p o n m l k j i h g f e d c b a n o R R n Σ order of writing Σ order of writing m m k j d c b a m
output step1 2 3 4 5 6 7 8 output step1 2 3 4 5 6 7 8 (a) (b) input step 8 7 6 5 4 3 2 1 input step 8 7 6 5 4 3 2 1 r q r q p j j Σ Σ order of reading Puffer q order of reading Puffer h p p r q p o n m l k j i h g f e d c b a n r q p o n m l k j i h g f e d c b a n h o o R l R l Σ order of writing k Σ order of writing i m k m k j
output step1 2 3 4 5 6 7 8 output step1 2 3 4 5 6 7 8 (c) (d) input step 8 7 6 5 4 3 2 1 input step 8 7 6 5 4 3 2 1 r q p o r q p o l g d Σ Σ order of reading Puffer f order of reading Puffer f e e r q p o n m l k j i h g f e d c b a n h g r q p o n m l k j i h g f e d c b a n h g e o c R l R l Σ order of writing i Σ order of writing i m k j m k j d
output step1 2 3 4 5 6 7 8 output step1 2 3 4 5 6 7 8 (e) (f) input step 8 7 6 5 4 3 2 1 input step 8 7 6 5 4 3 2 1 r q p o l i f r q p o l i b b Σ Σ Puffer order of reading Puffer f order of reading f a a r q p o n m l k j i h g f e d c b a n h g e r q p o n m l k j i h g f e d c b a n h g e c R ΣR Σ order of writing i order of writing m k j d c m k j d c b
output step1 2 3 4 5 6 7 8 output step1 2 3 4 5 6 7 8 (g) (h) input step 8 7 6 5 4 3 2 1 r q p o l i f Σ order of reading Puffer a r q p o n m l k j i h g f e d c b a n h g e
ΣR order of writing m k j d c b a
output step1 2 3 4 5 6 7 8 (i)
Figure 3.18: Example: Optimal randomized online writing
39 Σ a bc d e f g h i j k lm n o p q r Σ R
Eingabeschritt 1 2 3 4 5 6 7 8 f i l o p q r
optimal e g h n
a b c d j k m
7 6 5 4 3 2 1 Ausgabeschritt
Eingabeschritt 1 2 34 5 6 7 8 9 f i l o p q r [Barve−Grove−Vitter 97] e g h n
a b c d j k m
Figure 3.19: Example: resulting offline reading schedule
During initialization, runs 1 and 2 are read, i is set to 1. Thread A sorts runs in memory and writes them to disk. Thread B will wait until run i is finished (and thread A works on i + 1) and reads the next run i + 2 into the freed space. The thread doing the more intense work will never wait for the other one. A similar result can be achieved during the merging step but this is considerably more complicated and beyond the scope of this course. As internal work influences running time, we need a fast solution for the most compute intense step during merging: A Tournament Tree (or Loser Tree) is a specialized data structure for finding the smallest element of all runs. For k = 2K , it is a complete binary tree with K levels, where each leaf contains the currently smallest element of one run. Each internal node contains the ’loser’ (i.e., the greater) of the ’competition’ between its two child nodes. Above the root node, we store the global winner along with a pointer to the corresponding run. After writing this element to the output buffer, we simply have to move the next element of its run up until there is a new global winner. Compared to general purpose data structures like binary heaps, we get exactly log k comparisons (no hidden constant factors). Similar to the implicit search trees we used for Sample Sort, Tournament Trees can be implemented as arrays where finding the parent node simply maps to an index shift to the right. The inner loop for moving from leaf to root can be unrolled and contains predictable load instructions and index computations allowing
40 1 deleteMin+ 2 insertNext 2 3 4 4 4 4 6 7 9 7 6 7 9 7 4 6 2 7 9 1 4 7 4 6 2 7 93 4 7 3
Figure 3.20: A tournament tree for (int i=(winnerIndex+kReg)>>1; i>0; i>>=1){ currentPos = entry + i; currentKey = currentPos->key; if (currentKey < winnerKey) { currentIndex = currentPos->index; currentPos->key = winnerKey; currentPos->index = winnerIndex; winnerKey = currentKey; winnerIndex = currentIndex;}}
Figure 3.21: Inner loop of Tournament Tree computation exploitation of instruction parallelism.
3.8 Experiments
Experiments on Multiway Merge Sort were performed in 2001 on a 2 × 2GHz Xeon × 2 threads machine (Intel IV with Netburst) with several 66 MHz PCI-buses, 4 fast IDE controllers (Promise Ultra100 TX2) and 8 fast IDE disks (IBM IC35L080AVVA07). This inexpensive (mid 2002) setup gave a high I/O-bandwidth of 360 MB/s. The keys consisted of 16 GByte random 32 bit integers, run size was 256 MByte, block size B was 2MB (if not otherwise mentioned). Figure 3.22 shows the running time for different element sizes (for a constant total data volume of 16 GByte). The smaller the elements, the costlier becomes internal work, espe- cially during run formation (there are more elements to sort). With a high I/O throughput and intelligent prefetching algorithms, I/O wait time never makes up for more than half of the total running time. This proves the point that overlapping and tuning of internal work are important.
41 400 run formation 350 merging I/O wait in merge phase I/O wait in run formation phase 300
250
200 time [s] 150
100
50
0 16 32 64 128 256 512 1024 element size [byte]
Figure 3.22: Multiway Merge Sort with different element sizes
26 128 GBytes 1x merge 128 GBytes 2x merge 24 16 GBytes
22
20
18 sort time [ns/byte] 16
14
12 128 256 512 1024 2048 4096 8192 block size [KByte]
Figure 3.23: Performance using different block sizes
42 What is a good block size B? An intuitive approach would link B to the size of a physical disk block. However, figure 3.23 shows that B is no technology constant but a tuning parameter: A larger B is better (as it reduces the amortized costs of O(1/B) I/Os per element), as long as the resulting smaller k still allows for a single merge phase (see curve for 128GB).
43 Chapter 4
Priority Queues
The material on external priority queues was first published in [5].
4.1 Introduction
Priority queues are an important data structure for many applications, including: short- est path search (Dijkstra’s Algorithm), sorting, construction of mimimum spanning trees, branch and bound search, discrete event simulaton and many more. While the first ex- amples are widely known and also covered in other chapters, we give a short explanation of the latter two applications: The best first branch-and-bound approach to optimization elements are partial solutions of an optimization problem and the keys are optimistic es- timates of the obtainable solution quality. The algorithm repeatedly removes the best looking partial solution, refines it, and inserts zero or more new partial solutions. In a discrete event simulation one has to maintain a set of pending events. Each event happens at some scheduled point in time and creates zero or more new events scheduled to happen at some time in the future. Pending events are kept in a priority queue. The main loop of the simulation deletes the next event from the queue, executes it, and inserts newly generated events into the priority queue. Our (non-addressable) priority queue M needs to support the following operations:
Procedure build({e1, . . . , en}) M:= {e1, . . . , en} Procedure insert(e) M:= M ∪ {e} Function deleteMin e:= min M; M:= M \{e}; return e There are different approaches to implementing priority queues but most of them re- sort to an implicit or explicit tree representation which is heap-ordered1: If w is a succes- sor of v, the key stored in w is not greater than the key stored in v. This way, the overall smallest key is stored in the root. 1In 4.4 we will see implementations using a whole forest of heap-ordered trees
44 4.2 Binary Heaps
Priority queues are often implemented as binary heaps, stored in an array h where the successors for an element at position i are stored at positions 2i and 2i + 1. This is an implicit representation of a near-perfect binary tree which only might lack some leafs in the bottom level. We require that this array is heap-ordered, i.e.,
if 2 ≤ j ≤ n then h[bj/2c] ≤ h[j ].
Binary Heaps with arrays are bounded in space, but they can be made unbounded in the same way as bounded arrays are made unbounded. Asuming non-hierarchical mem- ory, we can implement all desired operations in an efficient manner: An insert puts a new element e tentatively at the end of the heap h, i.e., e is put at a leaf of the tree represented by h.[reformulated:more rendundancy, less ambiguity] ⇐= Then e is moved to an appropriate position on the path from the leaf h[n] to the root.
Procedure insert(e : Element) assert n < w n++ ; h[n]:= e siftUp(n) where siftUp(s) moves the contents of node s towards the root until the heap prop- erty[was heap condition] holds. ⇐=
Procedure siftUp(i : N) assert the heap property holds except maybe for j = i if i = 1 ∨ h[bi/2c] ≤ h[i] then return assert the heap property holds except for j = i swap(h[i], h[bi/2c]) assert the heap property holds except maybe for j = bi/2c siftUp(bi/2c)
Since siftUp will potentially move the element up to the root and perform a com- parison on every level, insert takes O(log n) time. On avergage, a constant number of comparisons will suffice. deleteMin in its basic form replaces the root with the leftmost leaf which is then sifted down (analogously to siftUp), resulting in 2 log n key comparisons (on every level, we have to find the minimum of three elements). The bottom-up heuristic suggests an improvement for that operation: The hole left by the removed minimum is “sifted down“ to a leaf (requiring only one comparison per level between the two successors of the hole), is only now replaced by the rightmost leaf which is then sifted up again (costing constant time on average, like insertion).
45 1 2 2 2 6 2 4 4 7 4 7 9 8 4 7 9 8 6 7 9 8 6 9 8 6 delete Min sift down hole 2 O(1) log(n) compare swap move 7 4 6 9 8 sift up 2 O(1) 4 6 average 7 9 8
Figure 4.1: The bottom-up heuristic int i=1, m=2, t = a[1]; m += (m != n && a[m] > a[m + 1]); if (t > a[m]) { do { a[i] = a[m]; i = m; m = 2*i; if (m > n) break; m += (m != n && a[m] > a[m + 1]); } while (t > a[m]); a[i] = t;}
Figure 4.2: An efficient version of standard deleteMin
This approach should be a factor two faster then the naive implementation. However, if the latter is programmed properly (see figure 4.2), there are no measureable differences in runtime: The given implementation has log n comparisons more than bottom-up but these are stop criterions for the loop and thus easy to handle for branch prediction. Notice how the increment of m avoids branches within the loop. [siftDown mit logn + loglogn Vergleichen] ⇐= For the initial construction of a heap there are also two competing approaches: buildHeapBackwards moves from the leaves to the root, ensuring the heap property on every level. buildHeapRecursive first fixes this properties recursively on the two subtrees of the root and then sifts the remaining node down. Here, we have the reverse situation compared to deleteMin: Both algorithms asymptotically cost O(n) time but on a real machine, the recursive variant is faster by a factor two: It is more cache efficient. Note, that a subtree with B leaves and therefore log B levels can be stored in B log B blocks of
46 Procedure buildHeapBackwards for i := bn/2c downto 1 do siftDown(i) Procedure buildHeapRecursive(i : N) if 4i ≤ n then buildHeapRecursive(2i) buildHeapRecursive(2i + 1) siftDown(i)
Figure 4.3: Two implementations for buildHeap
sorted 1 2 ... k m sequences
k−merge tournament tree data structure
buffer m m insertion buffer
Figure 4.4: A simple external PQ for n < km size B. If these blocks fit into the cache, we only require O(n/B) I/O operations.
4.3 External Priority Queues
We now study a variant of external priority queues2 which are called sequence heaps. Merging k sorted sequences into one sorted sequence (k-way merging) is an I/O effi- cient subroutine used for sorting – we saw this in chapter 3.5. The basic idea of sequence heaps is to adapt k-way merging to the related but more dynamical problem of priority queues. Let us start with the simple case, that at most km insertions take place where m is the size of a buffer that fits into fast memory. Then the data structure could consist of k sorted sequences of length up to m. We can use k-way merging for deleting a batch of the m smallest elements from k sorted sequences. The next m deletions can then be served from a buffer in constant time. A separate binary heap with capacity m allows an arbitrary mix of insertions and
2if “I/O“ is replaced by “cache fault“, we can use this approach also one level higher in the memory hierarchy
47 2 1 2 ... km 1 2 ... kmk 1 2 ... k mk
k-merge k-merge k-merge T T1 T2 3
group group group buffer 1 m buffer 2 m buffer 3 m
R-merge
deletion buffer m' m insert heap
Figure 4.5: Overview of the complete data structure for R = 3 merge groups deletions by holding the recently inserted elements. Deletions have to check whether the smallest element has to come from this insertion buffer. When this buffer is full, it is sorted, and the resulting sequence becomes one of the sequences for the k-way merge. How can we generalize this approach to handle more than km elements? We cannot increase m beyond M, since the insertion heap would not fit into fast memory. We cannot arbitrarily increase k, since eventually k-way merging would start to incur cache faults. Sequence heaps make room by merging all the k sequences producing a larger sequence of size up to km. Now the question arises how to handle the larger sequences. Sequence heaps employ i−1 R merge groups G1,...,GR where Gi holds up to k sequences of size up to mk . When group Gi overflows, all its sequences are merged, and the resulting sequence is put into group Gi+1. Each group is equipped with a group buffer of size m to allow batched deletion from the sequences. The smallest elements of these buffers are deleted in batches of size m0 m. They are stored in the deletion buffer. Fig. 4.5 summarizes the data structure. We now have enough information to understand how deletion works:
DeleteMin: The smallest elements of the deletion buffer and insertion buffer are com- pared, and the smaller one is deleted and returned. If this empties the deletion buffer, it is refilled from the group buffers using an R-way merge. Before the refill, group buffers
48 with less than m0 elements are refilled from the sequences in their group (if the group is nonempty). DeleteMin works correctly provided the data structure fulfills the heap property, i.e., elements in the group buffers are not smaller than elements in the deletion buffer, and in turn, elements in a sorted sequence are not smaller than the elements in the respective group buffer. Maintaining this invariant is the main difficulty for implementing insertion.
Insert: New elements are inserted into the insert heap. When its size reaches m, its elements are sorted (e.g. using merge sort or heap sort). The result is then merged with the concatenation of the deletion buffer and the group buffer 1. The smallest resulting elements replace the deletion buffer and group buffer 1. The remaining elements form a new sequence of length at most m. The new sequence is finally inserted into a free slot of group G1. If there is no free slot initially, G1 is emptied by merging all its sequences into a single sequence of size at most km, which is then put into G2. The same strategy is used recursively to free higher level groups when necessary. When group GR overflows, R is incremented and a new group is created. When a sequence is moved from one group to the other, the heap property may be violated. Therefore, when G1 through Gi have been emptied, the group buffers 1 through i + 1 are merged, and put into G1.
For cached memory, where the speed of internal computation matters, it is also crucial how to implement the operation of k-way merging. How is can be done in an efficient way is described in the chapter about Sorting (3.7).
Analysis We will now give a sketch for the I/O analysis of our priority queues. Let i denote the number of insertions and an upper bound to the number of deleteMin operations. i First note that Group Gi can overflow at most every m(k − 1) insertions: The only complication is the slot in group G1 used for invalid group buffers. Nevertheless, when groups G1 through Gi contain k sequences each, this can only happen if
R X m(k − 1)kj−1 = m(ki − 1) j=1
I insertions have taken place. Therefore, R = logk m groups suffice. Now consider the I/Os performed for an element moving on the following canoni- cal data path: It is first inserted into the insert buffer and then written to a sequence in group G1 in a batched manner, i.e., 1/B I/Os are charged to the insertion of this element. Then it is involved in emptying groups until it arrives in group GR. For each emptying operation, the element is involved into one batched read and one batched write, i.e., it is
49 insert(3 ) v u w v u w s n t s n t r m p r m p q g o q c o k j f l c h x k j f l h i e g i
e b d b d b
b a x a 3 (a) Inserting element 3 leads to overflow of insert heap: it is merged with deletion buffer and group buffer 1 and then inserted into group 1 v u w v u w s n t s n t r m p k r m p q c o j q c o x k j f l h x g l h e g i e f i
d b d b b b
a 3 a 3 (b) Overflow in group 1: all old elements are merged and inserted in next group v u w n o p s n t m q k r m p k l r j q c o j i s w x g l h x g h t u v e f i e f c
d b d b b b
a 3 a 3 (c) Overflow in group 2: all old elements are merged and inserted in next group n o p n o p m q m q k l r k l r j i s w d j i s w x g h t u v x b g h t u v e f c e b f c
d b b 50 a 3 a 3 (d) Group buffers are invalid now: merge and inserted them to group 1
Figure 4.6: Example of an insertion on the sequence heap n o p n o p m q m q k l r k l r d j i s w d j i s w x b g h t u v x b g h t u v e b f c e b f c
a 3 (a) Deletion of two elements empties insert heap and deletion buffer n o p n o p m q m q k l r k l r d j i s w d j i s w x b g h t u v x t u v e b f c e
b g h f c
(b) Every Group fills its buffer via k-way-merging, the deletion buffer is filled from group buffers via M-way-merging
Figure 4.7: Example of a deletion on the sequence
51 charged with 2(R − 1)/B I/Os for tree emptying operations. Eventually, the element is read into group buffer R yielding a charge of 1/B I/Os for. All in all, we get a charge of 2R/B I/Os for each insertion. What remains to be shown (and is ommited here) is that the remaining I/Os only contribute lower order terms or replace I/Os done on the canonical path. For example, we save I/Os when an element is extracted before it reaches the last group. We use the costs charged for this to pay for swapping the group buffers in and out. Eventually, we have O(sort(I)) I/Os. In a similar fashion, we can show that I operations inflict I log I key comparisons on average. As for sorting, this is a good measure for the internal work, since in efficient implementations of priority queues for the comparison model, this number is close to the number of unpredictable branch instructions (whereas loop control branches are usually well predictable by the hardware or the compiler), and the number of key comparisons is also proportional to the number of memory accesses. These two types of operations often have the largest impact on the execution time, since they are the most severe limit to instruction parallelism in a super-scalar processor.
Experiments We now present the results of some experiments conducted to compare our sequence heap with other priority queue implementations. Random 32 bit integers were used as keys for another 32 bits of associated information. The operation sequence used was (Insert − deleteMin − Insert)N (deleteMin − Insert − deleteMin)N . The choice of this sequence is nontrivial as it can have measurable influence (factor two and more) on the performance. Figure 4.9 show this: Here we have the sequence (Insert (deleteMin Insert)s)N (deleteMin (Insert deleteMin)s)N for several val- ues of s. For larger s, the performance gets better when N is large enough. This can be explained with a “locality effect“: New elements tend to be smaller than most old elements (the smallest of the old elements have long been removed before). Therefore, many elements never make it into group G1 let alone the groups for larger sequences. Since most work is performed while emptying groups, this work is saved. So that these instances should come close to the worst case. To make clear that sequence heaps are nevertheless still much better than binary or 4-ary heaps, Figure 4.9 additionally contains their timing for s = 0. The parameters chosen for the experiments where m0 = 32, m = 256 and k = 128 on all machines tried. While there were better settings for individual machines, these global values gave near optimal performance in all cases.
52 bottom up binary heap 200 bottom up aligned 4-ary heap sequence heap
150
100
50 (T(deleteMin) + T(insert))/log N [ns]
0 1024 4096 16384 65536 218 220 222 223 N
Figure 4.8: Runtime comparison for several PQ implementations (on a 180MHz MIPS R10000)
s=0, binary heap s=0, 4-ary heap s=0 s=1 128 s=4 s=16
64 (T(deleteMin) + T(insert))/log N [ns] 32 256 1024 4096 16384 65536 218 220 222 223 N
Figure 4.9: Runtime comparison for different operation sequences
53 a b a
b
Figure 4.10: Link: Merge to trees preserving the heap property
Figure 4.11: Cut: remove subtree and add it to the forest
4.4 Adressable Priority Queues
For adressable Priority Queues, we want to add the following functionality to the interface of our basic data structure:
Function remove(h : Handle) e:= h; M:= M \{e}; return e Procedure decreaseKey(h : Handle, k : Key) assert key(h) ≥ k; key(h):= k Procedure merge(M 0) M:= M ∪ M 0
This extended interface is required to efficiently implement Dijkstra’s Algorithm for shortest paths or the Jarnik-Prim Algorithm for calculating Minimum Spanning Trees (both make use of the decreaseKey operation). It is not possible to extend our hitherto approach to become adressable as keys are constantly swapped in our array for deleteMin and other operations. For this domain, we implement priority queues as a set of heap-ordered trees and a pointer for finding the tree containing the globally minimal element. The elementary form of these priority queues is called Pairing Heap. With just two basic operations, we can implement adressable priority queues: Now we can already give a high-level implementation of all necessary operations:
Procedure insertItem(h : Handle) newTree(h) Procedure newTree(h : Handle) forest:= forest ∪ {h} if e < min then minPtr:= h Procedure decreaseKey(h : Handle, k : Key) key(h):= k
54 if h is not a root then cut(h) Function deleteMin : Handle m:= minPtr forest:= forest \{m} foreach child h of m do newTree(h) Perform a pairwise link of the tree roots in forest return m Procedure merge(o : AdressablePQ) if minPtr > o.minPtr then minPtr:= o.minPtr forest:= forest ∪ o.forest o.forest:= ∅
An insert adds a new single node tree to the forest. So a sequence of n inserts into an initially empty heap will simply create n single node trees. The cost of an insert is clearly O(1). A deleteMin operation removes the node indicated by minPtr. This turns all children of the removed node into roots. We then scan the set of roots (old and new) to find the new minimum. To find the new minimum we need to inspect all roots (old and new), a potentially very costly process. We make the process even more expensive (by a constant factor) by doing some useful work on the side, namely combining some trees into larger trees. Pairing heaps do this by just doing one step of pairwise linking of arbitrary trees. There are variants doing more complicated operations to prove better theoretical bounds. We turn to the decreaseKey operation next. It is given a handle h and a new key k and decreases the key value of h to k. In order to maintain the heap property, we cut the subtree rooted at h and turn h into a root. Cutting out subtrees causes the more subtle problem that it may leave trees that have an awkward shape. While Pairing heaps do nothing to prevent thiss, some variants of addressable priority queues perform additional operations to keep the trees in shape. The remaining operations are easy. We can remove an item from the queue by first decreasing its key so that it becomes the minimum item in the queue and then perform a deleteMin. To merge a queue o into another queue we compute the union of roots and o.roots. To update minPtr it suffices to compare the minima of the merged queues. If the root sets are represented by linked lists, and no additional balancing is done, a merge needs only constant time. Pairing heaps are the simplest form of forest-based adressable priority queues. A more elaborated and (in theory, at least) faster variant are Fibonacci Heaps. They maintain a rank (initially zero, denoting the number of its children) for every element, which is increased for root nodes when another tree is linked to them and a mark flag that is set when the node lost a child due to a decreaseKey. Root nodes of the same rank are
55 parent
data rank mark left sibling Fibonacci Heap right sibling one child
left sibling or parent data Pairing Heap right sibling
one child
Figure 4.12: Structure of one item in a Pairing Heap or a Fibonacci Heap. linked after a deleteMin to limit the number of trees. If a cut is executed on a node with an already marked parent, the parent is cut as well. These rules lead to an amortized complexity of O(log n) for deleteMin an O(1) for all other operations. However, both the constant factors and the worst case performance for a single operation are high, making Fibonacci Heaps a mainly theoretical tool. In addition, more metainformation per node increases the memory overhead of Fibonacci Heaps.
56 Chapter 5
External Memory Algorithms
The introduction of this chapter is based on [6]. The sections on time-forward process- ing, graph algorithms and cache oblivious algorithms use material from the book chap- ters [10], [8] and [9]. The cache oblivious model was first presented in [11]. The section on Funnelsort is based on [19]. The external BFS section is from [12] for the presenta- tion of the algorithm and from [13] for tuning and experiments. Additional material in multiple sections is from [7].
5.1 Introduction
Massive data sets arise naturally in many domains. Spatial data bases of geographic information systems like GoogleEarth and NASA’s World Wind store terabytes of geographically-referenced information that includes the whole Earth. In computer graph- ics one has to visualize huge scenes using only a conventional workstation with limited memory. Billing systems of telecommunication companies evaluate terabytes of phone call log files. One is interested in analyzing huge network instances like a web graph or a phone call graph. Search engines like Google and Yahoo provide fast text search in their data bases indexing billions of web pages. A precise simulation of the Earth’s climate needs to manipulate with petabytes of data. These examples are only a sample of numerous applications which have to process huge amount of data. For economical reasons, it is not feasible to build all of the computer’s memory of the fastest type or to extend the fast memory to dimensions that could hold all relevant data. Instead, modern computer architectures contain a memory hierarchy of increasing size, decreasing speed and costs from top to bottom: On top, we have the registers integrated in the CPU, a number of caches, main memory and finally disks, which are often referenced as external memory as opposed to internal memory. The internal memories of computers can keep only a small fraction of these large data sets. During the processing the applications need to access the external memory (e. g.
57 Figure 5.1: schematic construction of a hard disk hard disks) very frequently. One such access can be about 106 times slower than a main memory access. Therefore, the disk accesses (I/Os) become the main bottleneck. The reason for this high latency is the mechanical nature of the disk access. Figure 5.1 shows the schematic construction of a hard disk. The time needed for finding the data position on the disk is called seek time or (seek) latency and averages to about 3–10 ms for modern disks. The seek time depends on the surface data density and the rotational speed and can hardly be reduced because of the mechanical nature of hard disk technology, which still remains the best way to store massive amounts of data. Note that after finding the required position on the surface, the data can be transferred at a higher speed which is limited only by the surface data density and the bandwidth of the interface connecting CPU and hard disk. This speed is called sustained throughput and achieves up to 80 MByte/s nowadays. In order to amortize the high seek latency one reads or writes the data in chunks (blocks). The block size is balanced when the seek latency is a fraction of the sustained transfer time for the block. Good results show blocks containing a full track. For older low density disks of the early 90’s the track capacities were about 16-64 KB. Nowadays, disk tracks have a capacity of several megabytes. Operating systems implement the so called virtual memory mechanism that provides an additional working space for an application, mapping an external memory file (page file) to virtual main memory addresses. This idea supports the Random Access Machine model in which a program has an infinitely large main memory with uniform random access cost. Since the memory view is unified in operating systems supporting virtual memory, the application does not know where its working space and program code are located: in the main memory or (partially) swapped out to the page file. For many appli- cations and algorithms with non-linear access pattern, these remedies are not useful and
58 even counterproductive: the swap file is accessed very frequently; the data code can be swapped out in favor of data blocks; the swap file is highly fragmented and thus many random input/output operations (I/Os) are needed even for scanning.
5.2 The external memory model and things we already saw
If we bypass the virtual memory mechanism, we cannot apply the RAM model for analy- sis anymore since we now have to explicitly handle different levels of memory hierarchy, while the RAM model uses one large, uniform memory. Several simple models have been introduced for designing I/O-efficient algorithms and data structures (also called external memory algorithms and data structures). The most popular and realistic model is the Parallel Disk Model (PDM) of Vitter and Shriver. In this model, I/Os are handled explicitly by the application. An I/O operation transfers a block of B consecutive bytes from/to a disk to amortize the latency. The application tries to transfer D blocks between the main memory of size M bytes and D independent disks in one I/O step to improve bandwidth. The input size is N bytes which is (much) larger than M. The main complexity metrics of an I/O-efficient algorithm in this model are:
• I/O complexity: the number of I/O steps should be minimized (the main metric),
• CPU work complexity: the number of operations executed by the CPU should be minimized as well.
The PDM model has become the standard theoretical model for designing and analyzing I/O-efficient algorithms. There are some “golden rules” that can guide the process of designing I/O efficient algorithms: Unstructured memory access is often very expensive as it comes with 1 I/O per operation whereas we want 1/B I/Os for an efficient algorithm. Instead, we want to scan the external memory, always loading the next due block of size B in one step and N processing it internally. An optimal scan will only cost scan(N) := Θ( D·B ) I/Os. If the data is not stored in a way that allows linear scanning, we can often use sorting to reorder and than scan it. As we saw in chapter 3, external sorting can be implemented N N with sort(N) := Θ( D·B · logM/B B ) I/Os. A simple example of this technique is the following task: We want to reorder the elements in an array A into an array B using a given “rank” stored in array C. This should be done in an I/O efficient way.
int[1..N] A,B,C; for i=1 to N do A[i]:=B[C[i]];
59 CPU
c
Memory M
B B B
Disk 1Disk i Disk D
Figure 5.2: Vitter’s I/O model with several independent disks
The literal implementation would have worst case costs of Ω(N) I/Os. For N = 106, this would take ≈ T = 10000 seconds ≈ 3 hours. Using the sort-and-scan technique, we can lower this to sort(N) and the algorithm would finish in less than a second:
SCAN C: (C[1]=17,1), (C[2]=5,2), ... SORT(1st): (C[73]=1,73), (C[12]=2,12), ... par SCAN : (B[1],73), (B[2],12), ... SORT(2nd): (B[C[1]],1), (B[C[2]],2), ...
We already saw some I/O efficient algorithms using this model in previous chapters: Chapter 2 presented an external stack, a large section of chapter 3 dealt with external sort- ing and in chapter 4 we saw external priority queues. Chapter 8 will present an external approach to minimal spanning trees. In this chapter, we will see some more algorithms, study how these algorithms and data structures can be implemented in a convenient way using an algorithm library and learn about other models of external computation.
5.3 The Stxxl library
The Stxxl library is an algorithm library aimed to speed up the process of implement- ing I/O-efficient algorithms, abstracting away the details of how I/O is performed. Many high-performance features are supported: disk parallelism, explicit overlapping of I/O and
60 Applications STL−user layer Streaming layer vector, stack, set Containers: priority_queue, map Pipelined sorting, Algorithms: sort, for_each, merge zero−I/O scanning Block management (BM) layer typed block, block manager, buffered streams, block prefetcher, buffered block writer TXXL S Asynchronous I/O primitives (AIO) layer files, I/O requests, disk queues, completion handlers