Chapter 3 Cache-Oblivious Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 3 Cache-Oblivious Algorithms Portable High-Performance Programs by Matteo Frigo Laurea, Universita di Padova (1992) Dottorato di Ricerca, Universita di Padova (1996) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY J ne 1999 -. @ Matteo Frigo, MCMXCIX. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part, and to grant others the right to do so. Author........ ......... ...................... Department of Electrical Engineering and Computer Science June 22, 1999 Certified by ............ ................... -V Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by .................. & .. ..-.-.-. -..... ......-- Arthur C. Smith . Denartmental Commnsie on Graduate Students Copyright @ 1999 Matteo Frigo. Permission is granted to make and distribute verbatim copies of this thesis provided the copy- right notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this thesis under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this thesis into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Free Software Foundation. 2 Portable High-Performance Programs by Matteo Frigo Submitted to the Department of Electrical Engineering and Computer Science on June 22, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This dissertation discusses how to write computer programs that attain both high performance and portability, despite the fact that current computer systems have different degrees of parallelism, deep memory hierarchies, and diverse processor architectures. To cope with parallelism portably in high-performance programs, we present the Cilk multi- threaded programming system. In the Cilk-5 system, parallel programs scale up to run efficiently on multiple processors, but unlike existing parallel-programming environments, such as MPI and HPF, Cilk programs "scale down" to run on one processor as efficiently as a comparable C pro- gram. The typical cost of spawning a parallel thread in Cilk-5 is only between 2 and 6 times the cost of a C function call. This efficient implementation was guided by the work-first principle, which dictates that scheduling overheads should be borne by the critical path of the computation and not by the work. We show how the work-first principle inspired Cilk's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler. To cope portably with the memory hierarchy, we present asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z = Q(L 2) the number of cache misses for an m x n matrix transpose is 0(1 + mn/L). The number of cache misses for either an n-point FFT or the sorting of n numbers is E(1 + (n/L)(1 + logz n)). We also give a e(mnp)-work algorithm to multiply an m x n matrix by an n x p matrix that incurs E(1 + (mn + np + mp)/L + mnp/LvZ) cache faults. To attain portability in the face of both parallelism and the memory hierarchy at the same time, we examine the location consistency memory model and the BACKER coherence algorithm for maintaining it. We prove good asymptotic bounds on the execution time of Cilk programs that use location-consistent shared memory. To cope with the diversity of processor architectures, we develop the FFTW self-optimizing program, a portable C library that computes Fourier transforms. FFTW is unique in that it can au- tomatically tune itself to the underlying hardware in order to achieve high performance. Through extensive benchmarking, FFTW has been shown to be typically faster than all other publicly avail- able FFT software, including codes such as Sun's Performance Library and IBM's ESSL that are tuned to a specific machine. Most of the performance-critical code of FFTW was generated auto- matically by a special-purpose compiler written in Objective Caml, which uses symbolic evaluation and other compiler techniques to produce "codelets"-optimized sequences of C code that can be assembled into "plans" to compute a Fourier transform. At runtime, FFTW measures the execution 3 time of many plans and uses dynamic programming to select the fastest. Finally, the plan drives a special interpreter that computes the actual transforms. Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering 4 Contents 1 Portable high performance 9 1.1 The scope of this dissertation .. ... .... .... 9 1.1.1 Coping with parallelism ... .... .... 9 1.1.2 Coping with the memory hierarchy ... ... 11 1.1.3 Coping with parallelism and memory hierarchy together 13 1.1.4 Coping with the processor architecture .. ... 14 1.2 The methods of this dissertation 16 1.3 Contributions .. ........ ....... 17 2 Cilk 19 2.1 History of Cilk ... .. .. .. 21 2.2 The Cilk language ... .. .. 22 2.3 The work-first principle .. ... 26 2.4 Example Cilk algorithms .. .. 28 2.5 Cilk's compilation strategy . 31 2.6 Implementation of work-stealing 36 2.7 Benchmarks .. .. ... .. .. 41 2.8 Related work . .. ... .. .. 43 2.9 Conclusion ... .. ... .. 44 3 Cache-oblivious algorithms 46 3.1 Matrix multiplication . .. 50 3.2 Matrix transposition and FFT . 52 3.3 Funnelsort . .. ... .. .. 56 3.4 Distribution sort ... ... 61 3.5 Other cache models .. .. .. 65 3.5.1 Two-level models . 65 3.5.2 Multilevel ideal caches 66 3.5.3 The SUMH model . 67 5 3.6 Related work . ........ ... 6 8 3.7 Conclusion ............ 6 9 4 Portable parallel memory 71 4.1 Performance model and summary of results ........ 74 4.2 Location consistency and the BACKER coherence algorithm . 78 4.3 Analysis of execution time ................. 79 4.4 Analysis of space utilization ................. 86 4.5 Related work .... .... ................. 92 4.6 Conclusion ......... ................. 93 5 A theory of memory models 94 5.1 Computation-centric memory models 96 5.2 Constructibility ........... .. 99 5.3 Models based on topological sorts . .. 102 5.4 Dag-consistent memory models . ... 104 5.5 Dag consistency and location consistency 108 5.6 Discussion ................ 109 6 FFTW 111 6.1 Background ............... 114 6.2 Performance results ........... 116 6.3 FFTW's runtime structure ...... .. 127 6.4 The FFTW codelet generator ...... 133 6.5 Creation of the expression dag . .... 137 6.6 The simplifier .............. 140 6.6.1 What the simplifier does .. .. 140 6.6.2 Implementation of the simplifier 143 6.7 The scheduler .............. 145 6.8 Real and multidimensional transforms . 148 6.9 Pragmatic aspects of FFTW ... .... 150 6.10 Related work .............. 152 6.11 Conclusion ............... 153 7 Conclusion 155 7.1 Future work . .. .... .... ... .... ...... ....... ...... 155 7.2 Summary ... .... ..... ... .. ....... ...... ....... .. 158 6 Acknowledgements This brief chapter is the most important of all. Computer programs will be outdated, and theorems will be shown to be imprecise, incorrect, or just irrelevant, but the love and dedition of all people who knowingly or unknowingly have contributed to this work is a lasting proof that life is supposed to be beautiful and indeed it is. Thanks to Charles Leiserson, my most recent advisor, for being a great teacher. He is always around when you need him, and he always gets out of the way when you don't. (Almost always, that is. I wish he had not been around that day in Singapore when he convinced me to eat curried fish heads.) I remember the first day I met Gianfranco Bilardi, my first advisor. He was having trouble with a computer, and he did not seem to understand how computers work. Later I learned that real computers are the only thing Gianfranco has trouble with. In any other branch of human knowledge he is perfectly comfortable. Thanks to Arvind and Martin Rinard for serving on my thesis committee. Arvind and his student Jan-Willem Maessen acquainted me with functional programming, and they had a strong influence on my coding style and philosophy. Thanks to Toni Mian for first introducing me to Fourier trans- forms. Thanks to Alan Edelman for teaching me numerical analysis and algorithms. Thanks to Guy Steele and Gerry Sussman for writing the papers from which I learned what computer science is all about. It was a pleasure to develop Cilk together with Keith Randall, one of the most talented hackers I have ever met. Thanks to Steven Johnson for sharing the burden of developing FFTW, and for many joyful moments. Volker Strumpen influenced many of my current thoughts about computer science as well as much of my personality. From him I learned a lot about computer systems. Members of the Cilk group were a constant source of inspiration, hacks, and fun. Over the years, I was honored to work with Bobby Blumofe, Guang-len Cheng, Don Dailey, Mingdong Feng, Chris Joerg, Bradley Kuszmaul, Phil Lisiecki, Alberto Medina, Rob Miller, Aske Plaat, Harald Prokop, Sridhar Ramachandran, Bin Song, Andrew Stark, and Yuli Zhou. Thanks to my officemates, Derek Chiou and James Hoe, for many hours of helpful and enjoyable conversations. Thanks to Tom Toffoli for hosting me in his house when I first arrived to Boston.
Recommended publications
  • Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E
    Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139 ¢¡¤£¦¥¨§¢© ¡ § ¨¨ ¥¨¡ ! "#¨§#£$§ %¥'&( # &*)+%£,&-§" Main Abstract This paper presents asymptotically optimal algo- organized by Memory rithms for rectangular matrix transpose, FFT, and sorting on optimal replacement computers with multiple levels of caching. Unlike previous strategy optimal algorithms, these algorithms are cache oblivious: no Cache variables dependent on hardware parameters, such as cache CPU size and cache-line length, need to be tuned to achieve opti- mality. Nevertheless, these algorithms use an optimal amount W of work and move data optimally among multiple levels of work Q cache. For a cache with size Z and cache-line length L where Z 3 L Cache lines 2 0 cache / 1 . Ω Z L the number of cache misses for an m n ma- misses Θ 0 Lines 2 3 trix transpose is / 1 mn L . The number of cache misses of length L for either an n-point FFT or the sorting of n numbers is 050 0 Θ 0 Θ 24/ 3 / 2 / / 1 n L 1 logZ n . We also give an mnp -work al- 1 gorithm to multiply an m 1 n matrix by an n p matrix that Figure 1: The ideal-cache model 0 0 26/ 2 2 3 2 3 7 incurs Θ / 1 mn np mp L mnp L Z cache faults. We introduce an “ideal-cache” model to analyze our algo- shall assume that word size is constant; the particular rithms. We prove that an optimal cache-oblivious algorithm constant does not affect our asymptotic analyses.
    [Show full text]
  • Engineering Cache-Oblivious Sorting Algorithms
    Engineering Cache-Oblivious Sorting Algorithms Master’s Thesis by Kristoffer Vinther June 2003 Abstract Modern computers are far more sophisticated than simple sequential programs can lead one to believe; instructions are not executed sequentially and in constant time. In particular, the memory of a modern computer is structured in a hierarchy of increasingly slower, cheaper, and larger storage. Accessing words in the lower, faster levels of this hierarchy can be done virtually immediately, but accessing the upper levels may cause delays of millions of processor cycles. Consequently, recent developments in algorithm design have had a focus on developing algorithms that sought to minimize accesses to the higher levels of the hierarchy. Much experimental work has been done showing that using these algorithms can lead to higher performing algorithms. However, these algorithms are designed and implemented with a very specific level in mind, making it infeasible to adapt them to multiple levels or use them efficiently on different architectures. To alleviate this, the notion of cache-oblivious algorithms was developed. The goal of a cache-oblivious algorithm is to be optimal in the use of the memory hierarchy, but without using specific knowledge of its structure. This automatically makes the algorithm efficient on all levels of the hierarchy and on all implementations of such hierarchies. The experimental work done with these types of algorithms remain sparse, however. In this thesis, we present a thorough theoretical and experimental analysis of known optimal cache-oblivious sorting algorithms. We develop our own simpler variants and present the first optimal sub-linear working space cache-oblivious sorting algorithm.
    [Show full text]
  • Cache-Oblivious Algorithms by Harald Prokop
    Cache-Oblivious Algorithms by Harald Prokop Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY. June 1999 c Massachusetts Institute of Technology 1999. All rights reserved. Author Department of Electrical Engineering and Computer Science May 21, 1999 Certified by Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by Arthur C. Smith Chairman, Departmental Committee on Graduate Students 2 Cache-Oblivious Algorithms by Harald Prokop Submitted to the Department of Electrical Engineering and Computer Science on May 21, 1999 in partial fulfillment of the requirements for the degree of Master of Science. Abstract This thesis presents “cache-oblivious” algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cache-line length need to be tuned to minimize the number of cache misses. We show that the ordinary algorithms for matrix transposition, matrix multi- plication, sorting, and Jacobi-style multipass filtering are not cache optimal. We present algorithms for rectangular matrix transposition, FFT, sorting, and multi- pass filters, which are asymptotically optimal on computers with multiple levels 2 ( ) of caches. For a cache with size Z and cache-line length L, where Z = Ω L , ( + = ) the number of cache misses for an m n matrix transpose is Θ 1 mn L . The number of cache misses for either an n-point FFT or the sorting of n numbers is + ( = )( + )) Θ ( 1 n L 1 logZn .
    [Show full text]
  • Design Implement Analyze Experiment
    Felix Putze and Peter Sanders design experiment Algorithmics analyze implement Course Notes Algorithm Engineering TU Karlsruhe October 19, 2009 Preface These course notes cover a lecture on algorithm engineering for the basic toolbox that Peter Sanders is reading at Universitat¨ Karlsruhe since 2004. The text is compiled from slides, scientific papers and other manuscripts. Most of this material is in English so that this language was adopted as the main language. However, some parts are in German. The primal sources of our material are given at the beginning of each chapter. Please refer to the original publications for further references. This document is still work in progress. Please report bugs of any type (content, language, layout, . ) to [email protected]. Thank you! 1 Contents 1 Was ist Algorithm Engineering? 6 1.1 Einfuhrung¨ . .6 1.2 Stand der Forschung und offene Probleme . .8 2 Data Structures 16 2.1 Arrays & Lists . 16 2.2 External Lists . 18 2.3 Stacks, Queues & Variants . 20 3 Sorting 24 3.1 Quicksort Basics . 24 3.2 Refined Quicksort . 25 3.3 Lessons from experiments . 27 3.4 Super Scalar Sample Sort . 30 3.5 Multiway Merge Sort . 34 3.6 Sorting with parallel disks . 35 3.7 Internal work of Multiway Mergesort . 38 3.8 Experiments . 41 4 Priority Queues 44 4.1 Introduction . 44 4.2 Binary Heaps . 45 4.3 External Priority Queues . 47 4.4 Adressable Priority Queues . 54 5 External Memory Algorithms 57 5.1 Introduction . 57 5.2 The external memory model and things we already saw .
    [Show full text]
  • Cache-Oblivious and Data-Oblivious Sorting and Applications
    Cache-Oblivious and Data-Oblivious Sorting and Applications T-H. Hubert Chan∗ Yue Guoy Wei-Kai Liny Elaine Shiy Abstract Although external-memory sorting has been a classical algorithms abstraction and has been heavily studied in the literature, perhaps somewhat surprisingly, when data-obliviousness is a requirement, even very rudimentary questions remain open. Prior to our work, it is not even known how to construct a comparison-based, external-memory oblivious sorting algorithm that is optimal in IO-cost. We make a significant step forward in our understanding of external-memory, oblivious sorting algorithms. Not only do we construct a comparison-based, external-memory oblivious sorting algorithm that is optimal in IO-cost, our algorithm is also cache-agnostic in that the algorithm need not know the storage hierarchy's internal parameters such as the cache and cache-line sizes. Our result immediately implies a cache-agnostic ORAM construction whose asymptotical IO-cost matches the best known cache-aware scheme. Last but not the least, we propose and adopt a new and stronger security notion for external- memory, oblivious algorithms and argue that this new notion is desirable for resisting possible cache-timing attacks. Thus our work also lays a foundation for the study of oblivious algorithms in the cache-agnostic model. ∗Department of Computer Science, The University of Hong Kong. [email protected] yComputer Science Department, Cornell University. [email protected], [email protected], [email protected] 1 Introduction In data-oblivious algorithms, a client (also known as a CPU) performs some computation over sensitive data, such that a possibly computationally bounded adversary, who can observe data access patterns, gains no information about the input or the data.
    [Show full text]