Engineering Cache-Oblivious Sorting Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Engineering Cache-Oblivious Sorting Algorithms Master’s Thesis by Kristoffer Vinther June 2003 Abstract Modern computers are far more sophisticated than simple sequential programs can lead one to believe; instructions are not executed sequentially and in constant time. In particular, the memory of a modern computer is structured in a hierarchy of increasingly slower, cheaper, and larger storage. Accessing words in the lower, faster levels of this hierarchy can be done virtually immediately, but accessing the upper levels may cause delays of millions of processor cycles. Consequently, recent developments in algorithm design have had a focus on developing algorithms that sought to minimize accesses to the higher levels of the hierarchy. Much experimental work has been done showing that using these algorithms can lead to higher performing algorithms. However, these algorithms are designed and implemented with a very specific level in mind, making it infeasible to adapt them to multiple levels or use them efficiently on different architectures. To alleviate this, the notion of cache-oblivious algorithms was developed. The goal of a cache-oblivious algorithm is to be optimal in the use of the memory hierarchy, but without using specific knowledge of its structure. This automatically makes the algorithm efficient on all levels of the hierarchy and on all implementations of such hierarchies. The experimental work done with these types of algorithms remain sparse, however. In this thesis, we present a thorough theoretical and experimental analysis of known optimal cache-oblivious sorting algorithms. We develop our own simpler variants and present the first optimal sub-linear working space cache-oblivious sorting algorithm. These algorithms are implemented and highly optimized to yield high performing sorting algorithms. We then do a thorough performance investigation, comparing our algorithms with popular alternatives. This thesis is among the first to provide evidence that using cache-oblivious algorithm designs can indeed yield superior performance. Indeed, our algorithms are able to outperform popular sorting algorithms using cache-oblivious sorting algorithms. We conclude that cache-oblivious techniques can be applied to yield significant performance gains. iii Contents Abstract iii Chapter 1 Introduction 1 1.1 Algorithm Analysis..........................................................................1 1.1.1 The Random-Access Machine Model.....................................2 1.1.2 Storage Issues..........................................................................4 1.2 Sorting in the Memory Hierarchy ....................................................4 1.3 Previous Work..................................................................................5 1.4 This Thesis.......................................................................................6 1.4.1 Main Contributions.................................................................6 1.4.2 Structure..................................................................................7 Chapter 2 Modern Processor and Memory Technology 9 2.1 Historical Background......................................................................9 2.2 Modern Processors.........................................................................10 2.2.1 Instruction Pipelining............................................................10 2.2.2 Super-Scalar Out-of-Order Processors .................................14 2.2.3 State of the Art......................................................................15 2.3 Modern Memory Subsystems.........................................................16 2.3.1 Low-level Caches..................................................................17 2.3.2 Virtual Memory ....................................................................20 2.3.3 Cache Similarities .................................................................24 2.4 Summary ........................................................................................25 Chapter 3 Theory of IO Efficiency 27 3.1 Models............................................................................................27 3.1.1 The RAM Revisited ..............................................................27 3.1.2 The External Memory Model................................................28 3.1.3 The Cache-Oblivious Model.................................................32 3.2 External Memory Sorting...............................................................36 3.2.1 Lower Bound ........................................................................36 3.2.2 Multiway Merge Sort............................................................38 3.2.3 Distribution Sort....................................................................40 3.2.4 Cache-Oblivious Sorting.......................................................40 Chapter 4 Cache-Oblivious Sorting Algorithms 43 4.1 Funnelsort.......................................................................................43 4.1.1 Merging with Funnels...........................................................43 4.1.2 Funnel Analysis ....................................................................49 4.1.1 The Sorting Algorithm..........................................................54 4.2 LOWSCOSA..................................................................................56 4.2.1 Refilling ................................................................................57 v 4.2.2 Sorting .................................................................................. 57 Chapter 5 Engineering the Algorithms 61 5.1 Measuring Performance................................................................. 61 5.1.1 Programming Language....................................................... 62 5.1.2 Benchmark Platforms........................................................... 62 5.1.3 Data Types............................................................................ 63 5.1.4 Performance Metrics ............................................................ 64 5.1.5 Validity................................................................................. 65 5.1.6 Presenting the Results .......................................................... 66 5.1.7 Engineering Effort Evaluation.............................................. 66 5.2 Implementation Structure.............................................................. 67 5.2.1 Iterators................................................................................. 67 5.2.2 Streams ................................................................................. 68 5.2.3 Mergers................................................................................. 68 5.3 Funnel ............................................................................................ 71 5.3.1 Merge Tree........................................................................... 71 5.3.2 Layout................................................................................... 72 5.3.3 Navigation ............................................................................ 76 5.3.4 Basic Mergers....................................................................... 88 5.4 Funnelsort ...................................................................................... 99 5.4.1 Workspace Recycling........................................................... 99 5.4.2 Merger Caching.................................................................. 100 5.4.3 Base Sorting Algorithms .................................................... 103 5.4.4 Buffer Sizes........................................................................ 106 5.5 LOWSCOSA ............................................................................... 108 5.5.1 Partitioning ......................................................................... 108 5.5.2 Strategy for Handling Uneven Partitions ........................... 109 5.5.3 Performance Expectation. .................................................. 111 Chapter 6 Experimental Results 113 6.1 Methodology................................................................................ 113 6.1.1 The Sorting Problem .......................................................... 113 6.1.2 Competitors ........................................................................ 114 6.1.3 Benchmark Procedure ........................................................ 116 6.2 Straight Sorting............................................................................ 117 6.2.1 Key/Pointer pairs................................................................ 118 6.2.2 Integer Keys ....................................................................... 124 6.2.3 Records............................................................................... 125 6.3 Special Cases............................................................................... 127 6.3.1 Almost Sorted..................................................................... 127 6.3.2 Few Distinct Elements........................................................ 128 Chapter 7 Conclusion 131 7.1 Further Improvements of the Implementation............................. 132 7.2 Further Work................................................................................ 132 Appendix A Electronic Resources 133 A.1 The Thesis.................................................................................... 133 A.2 Source Code................................................................................