Chapter 3 Cache-Oblivious Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Portable High-Performance Programs by Matteo Frigo Laurea, Universita di Padova (1992) Dottorato di Ricerca, Universita di Padova (1996) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY J ne 1999 -. @ Matteo Frigo, MCMXCIX. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part, and to grant others the right to do so. Author........ ......... ...................... Department of Electrical Engineering and Computer Science June 22, 1999 Certified by ............ ................... -V Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by .................. & .. ..-.-.-. -..... ......-- Arthur C. Smith . Denartmental Commnsie on Graduate Students Copyright @ 1999 Matteo Frigo. Permission is granted to make and distribute verbatim copies of this thesis provided the copy- right notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this thesis under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this thesis into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Free Software Foundation. 2 Portable High-Performance Programs by Matteo Frigo Submitted to the Department of Electrical Engineering and Computer Science on June 22, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This dissertation discusses how to write computer programs that attain both high performance and portability, despite the fact that current computer systems have different degrees of parallelism, deep memory hierarchies, and diverse processor architectures. To cope with parallelism portably in high-performance programs, we present the Cilk multi- threaded programming system. In the Cilk-5 system, parallel programs scale up to run efficiently on multiple processors, but unlike existing parallel-programming environments, such as MPI and HPF, Cilk programs "scale down" to run on one processor as efficiently as a comparable C pro- gram. The typical cost of spawning a parallel thread in Cilk-5 is only between 2 and 6 times the cost of a C function call. This efficient implementation was guided by the work-first principle, which dictates that scheduling overheads should be borne by the critical path of the computation and not by the work. We show how the work-first principle inspired Cilk's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler. To cope portably with the memory hierarchy, we present asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z = Q(L 2) the number of cache misses for an m x n matrix transpose is 0(1 + mn/L). The number of cache misses for either an n-point FFT or the sorting of n numbers is E(1 + (n/L)(1 + logz n)). We also give a e(mnp)-work algorithm to multiply an m x n matrix by an n x p matrix that incurs E(1 + (mn + np + mp)/L + mnp/LvZ) cache faults. To attain portability in the face of both parallelism and the memory hierarchy at the same time, we examine the location consistency memory model and the BACKER coherence algorithm for maintaining it. We prove good asymptotic bounds on the execution time of Cilk programs that use location-consistent shared memory. To cope with the diversity of processor architectures, we develop the FFTW self-optimizing program, a portable C library that computes Fourier transforms. FFTW is unique in that it can au- tomatically tune itself to the underlying hardware in order to achieve high performance. Through extensive benchmarking, FFTW has been shown to be typically faster than all other publicly avail- able FFT software, including codes such as Sun's Performance Library and IBM's ESSL that are tuned to a specific machine. Most of the performance-critical code of FFTW was generated auto- matically by a special-purpose compiler written in Objective Caml, which uses symbolic evaluation and other compiler techniques to produce "codelets"-optimized sequences of C code that can be assembled into "plans" to compute a Fourier transform. At runtime, FFTW measures the execution 3 time of many plans and uses dynamic programming to select the fastest. Finally, the plan drives a special interpreter that computes the actual transforms. Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering 4 Contents 1 Portable high performance 9 1.1 The scope of this dissertation .. ... .... .... 9 1.1.1 Coping with parallelism ... .... .... 9 1.1.2 Coping with the memory hierarchy ... ... 11 1.1.3 Coping with parallelism and memory hierarchy together 13 1.1.4 Coping with the processor architecture .. ... 14 1.2 The methods of this dissertation 16 1.3 Contributions .. ........ ....... 17 2 Cilk 19 2.1 History of Cilk ... .. .. .. 21 2.2 The Cilk language ... .. .. 22 2.3 The work-first principle .. ... 26 2.4 Example Cilk algorithms .. .. 28 2.5 Cilk's compilation strategy . 31 2.6 Implementation of work-stealing 36 2.7 Benchmarks .. .. ... .. .. 41 2.8 Related work . .. ... .. .. 43 2.9 Conclusion ... .. ... .. 44 3 Cache-oblivious algorithms 46 3.1 Matrix multiplication . .. 50 3.2 Matrix transposition and FFT . 52 3.3 Funnelsort . .. ... .. .. 56 3.4 Distribution sort ... ... 61 3.5 Other cache models .. .. .. 65 3.5.1 Two-level models . 65 3.5.2 Multilevel ideal caches 66 3.5.3 The SUMH model . 67 5 3.6 Related work . ........ ... 6 8 3.7 Conclusion ............ 6 9 4 Portable parallel memory 71 4.1 Performance model and summary of results ........ 74 4.2 Location consistency and the BACKER coherence algorithm . 78 4.3 Analysis of execution time ................. 79 4.4 Analysis of space utilization ................. 86 4.5 Related work .... .... ................. 92 4.6 Conclusion ......... ................. 93 5 A theory of memory models 94 5.1 Computation-centric memory models 96 5.2 Constructibility ........... .. 99 5.3 Models based on topological sorts . .. 102 5.4 Dag-consistent memory models . ... 104 5.5 Dag consistency and location consistency 108 5.6 Discussion ................ 109 6 FFTW 111 6.1 Background ............... 114 6.2 Performance results ........... 116 6.3 FFTW's runtime structure ...... .. 127 6.4 The FFTW codelet generator ...... 133 6.5 Creation of the expression dag . .... 137 6.6 The simplifier .............. 140 6.6.1 What the simplifier does .. .. 140 6.6.2 Implementation of the simplifier 143 6.7 The scheduler .............. 145 6.8 Real and multidimensional transforms . 148 6.9 Pragmatic aspects of FFTW ... .... 150 6.10 Related work .............. 152 6.11 Conclusion ............... 153 7 Conclusion 155 7.1 Future work . .. .... .... ... .... ...... ....... ...... 155 7.2 Summary ... .... ..... ... .. ....... ...... ....... .. 158 6 Acknowledgements This brief chapter is the most important of all. Computer programs will be outdated, and theorems will be shown to be imprecise, incorrect, or just irrelevant, but the love and dedition of all people who knowingly or unknowingly have contributed to this work is a lasting proof that life is supposed to be beautiful and indeed it is. Thanks to Charles Leiserson, my most recent advisor, for being a great teacher. He is always around when you need him, and he always gets out of the way when you don't. (Almost always, that is. I wish he had not been around that day in Singapore when he convinced me to eat curried fish heads.) I remember the first day I met Gianfranco Bilardi, my first advisor. He was having trouble with a computer, and he did not seem to understand how computers work. Later I learned that real computers are the only thing Gianfranco has trouble with. In any other branch of human knowledge he is perfectly comfortable. Thanks to Arvind and Martin Rinard for serving on my thesis committee. Arvind and his student Jan-Willem Maessen acquainted me with functional programming, and they had a strong influence on my coding style and philosophy. Thanks to Toni Mian for first introducing me to Fourier trans- forms. Thanks to Alan Edelman for teaching me numerical analysis and algorithms. Thanks to Guy Steele and Gerry Sussman for writing the papers from which I learned what computer science is all about. It was a pleasure to develop Cilk together with Keith Randall, one of the most talented hackers I have ever met. Thanks to Steven Johnson for sharing the burden of developing FFTW, and for many joyful moments. Volker Strumpen influenced many of my current thoughts about computer science as well as much of my personality. From him I learned a lot about computer systems. Members of the Cilk group were a constant source of inspiration, hacks, and fun. Over the years, I was honored to work with Bobby Blumofe, Guang-len Cheng, Don Dailey, Mingdong Feng, Chris Joerg, Bradley Kuszmaul, Phil Lisiecki, Alberto Medina, Rob Miller, Aske Plaat, Harald Prokop, Sridhar Ramachandran, Bin Song, Andrew Stark, and Yuli Zhou. Thanks to my officemates, Derek Chiou and James Hoe, for many hours of helpful and enjoyable conversations. Thanks to Tom Toffoli for hosting me in his house when I first arrived to Boston.