Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E
Total Page:16
File Type:pdf, Size:1020Kb
Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139 ¢¡¤£¦¥¨§¢© ¡ § ¨¨ ¥¨¡ ! "#¨§#£$§ %¥'&( # &*)+%£,&-§" Main Abstract This paper presents asymptotically optimal algo- organized by Memory rithms for rectangular matrix transpose, FFT, and sorting on optimal replacement computers with multiple levels of caching. Unlike previous strategy optimal algorithms, these algorithms are cache oblivious: no Cache variables dependent on hardware parameters, such as cache CPU size and cache-line length, need to be tuned to achieve opti- mality. Nevertheless, these algorithms use an optimal amount W of work and move data optimally among multiple levels of work Q cache. For a cache with size Z and cache-line length L where Z 3 L Cache lines 2 0 cache / 1 . Ω Z L the number of cache misses for an m n ma- misses Θ 0 Lines 2 3 trix transpose is / 1 mn L . The number of cache misses of length L for either an n-point FFT or the sorting of n numbers is 050 0 Θ 0 Θ 24/ 3 / 2 / / 1 n L 1 logZ n . We also give an mnp -work al- 1 gorithm to multiply an m 1 n matrix by an n p matrix that Figure 1: The ideal-cache model 0 0 26/ 2 2 3 2 3 7 incurs Θ / 1 mn np mp L mnp L Z cache faults. We introduce an “ideal-cache” model to analyze our algo- shall assume that word size is constant; the particular rithms. We prove that an optimal cache-oblivious algorithm constant does not affect our asymptotic analyses. The designed for two levels of memory is also optimal for multi- cache is partitioned into cache lines, each consisting of ple levels and that the assumption of optimal replacement in L consecutive words which are always moved together the ideal-cache model can be simulated efficiently by LRU re- placement. We also provide preliminary empirical results on between cache and main memory. Cache designers typ- the effectiveness of cache-oblivious algorithms in practice. ically use L ; 1, banking on spatial locality to amortize the overhead of moving the cache line. We shall gener- 1. Introduction ally assume in this paper that the cache is tall: Ω 2 8 :9 Z < L (1) Resource-oblivious algorithms that nevertheless use re- sources efficiently offer advantages of simplicity and which is usually true in practice. portability over resource-aware algorithms whose re- The processor can only reference words that reside source usage must be programmed explicitly. In this in the cache. If the referenced word belongs to a line paper, we study cache resources, specifically, the hier- already in cache, a cache hit occurs, and the word is archy of memories in modern computers. We exhibit delivered to the processor. Otherwise, a cache miss oc- several “cache-oblivious” algorithms that use cache as curs, and the line is fetched into the cache. The ideal effectively as “cache-aware” algorithms. cache is fully associative [20, Ch. 5]: cache lines can be Before discussing the notion of cache obliviousness, stored anywhere in the cache. If the cache is full, a cache 9 : we first introduce the 8 Z L ideal-cache model to study line must be evicted. The ideal cache uses the optimal the cache complexity of algorithms. This model, which off-line strategy of replacing the cache line whose next is illustrated in Figure 1, consists of a computer with a access is furthest in the future [7], and thus it exploits two-level memory hierarchy consisting of an ideal (data) temporal locality perfectly. cache of Z words and an arbitrarily large main mem- Unlike various other hierarchical-memory models ory. Because the actual size of words in a computer is [1, 2, 5, 8] in which algorithms are analyzed in terms of typically a small, fixed size (4 bytes, 8 bytes, etc.), we a single measure, the ideal-cache model uses two mea- sures. An algorithm with an input of size n is measured This research was supported in part by the Defense Advanced : by its work complexity W 8 n —its conventional running Research Projects Agency (DARPA) under Grant F30602-97-1-0270. Matteo Frigo was supported in part by a Digital Equipment Corpora- time in a RAM model [4]—and its cache complexity 9 : tion fellowship. Q 8 n;Z L —the number of cache misses it incurs as a function of the size Z and line length L of the ideal cache. cache-oblivious algorithm that requires no tuning pa- When Z and L are clear from context, we denote the rameters such as the s in BLOCK-MULT. We present : cache complexity simply as Q 8 n to ease notation. such an algorithm, which works on general rectangular We define an algorithm to be cache aware if it con- matrices, in Section 2. The problems of computing a tains parameters (set at either compile-time or runtime) matrix transpose and of performing an FFT also suc- that can be tuned to optimize the cache complexity for cumb to remarkably simple algorithms, which are de- the particular cache size and line length. Otherwise, the scribed in Section 3. Cache-oblivious sorting poses a algorithm is cache oblivious. Historically, good perfor- more formidable challenge. In Sections 4 and 5, we mance has been obtained using cache-aware algorithms, present two sorting algorithms, one based on mergesort but we shall exhibit several optimal1 cache-oblivious al- and the other on distribution sort, both of which are op- gorithms. timal in both work and cache misses. To illustrate the notion of cache awareness, consider The ideal-cache model makes the perhaps- the problem of multiplying two n n matrices A and questionable assumptions that there are only two B to produce their n n product C. We assume that levels in the memory hierarchy, that memory is man- the three matrices are stored in row-major order, as aged automatically by an optimal cache-replacement shown in Figure 2(a). We further assume that n is strategy, and that the cache is fully associative. We “big,” i.e., n ; L, in order to simplify the analysis. The address these assumptions in Section 6, showing that conventional way to multiply matrices on a computer to a certain extent, these assumptions entail no loss with caches is to use a blocked algorithm [19, p. 45]. of generality. Section 7 discusses related work, and The idea is to view each matrix M as consisting of Section 8 offers some concluding remarks, including ¡ ¡ : 8 : 8 n s n s submatrices Mi j (the blocks), each of some preliminary empirical results. which has size s s, where s is a tuning parame- ter. The following algorithm implements this strategy: 2. Matrix multiplication 0 ¢ ¢ ¢ BLOCK-MULT / A B C n This section describes and analyzes a cache-oblivious al- 3 1 for i £ 1 to n s gorithm for multiplying an m n matrix by an n p ma- 3 2 do for j £ 1 to n s : trix cache-obliviously using Θ 8 mnp work and incurring ¡ ¡ 3 3 do for k £ 1 to n s ¥ ¥ ¥ 8 ¥ ¥ : ¥ ¦ : Θ 8 0 m n p mn np mp L mnp L Z cache ¢ ¢ ¢ 4 do ORD-MULT / A B C s ik k j i j misses. These results require the tall-cache assumption 9 9 9 : The ORD-MULT 8 A B C s subroutine computes (1) for matrices stored in row-major layout format, but 3 ¥ 8 : C ¤ C AB on s s matrices using the ordinary O s the assumption can be relaxed for certain other layouts. algorithm. (This algorithm assumes for simplicity that We also show that Strassen’s algorithm [31] for multi- Θ lg7 2 : s evenly divides n, but in practice s and n need have no plying n n matrices, which uses 8 n work , incurs ¡ special relationship, yielding more complicated code in Θ 2 ¡ lg7 ¥ ¥ ¦ : 8 1 n L n L Z cache misses. the same spirit.) In [9] with others, two of the present authors analyzed Depending on the cache size of the machine on which an optimal divide-and-conquer algorithm for n n ma- BLOCK-MULT is run, the parameter s can be tuned to trix multiplication that contained no tuning parameters, make the algorithm run fast, and thus BLOCK-MULT is but we did not study cache-obliviousness per se. That a cache-aware algorithm. To minimize the cache com- algorithm can be extended to multiply rectangular matri- plexity, we choose s to be the largest value such that ces. To multiply a m n matrix A and a n p matrix B, the three s s submatrices simultaneously fit in cache. ¡ the REC-MULT algorithm halves the largest of the three 2 ¥ : An s s submatrix is stored on Θ 8 s s L cache lines. dimensions and recurs according to one of the following From the tall-cache assumption (1), we can see that three cases: Θ 8§¦ : s < Z . Thus, each of the calls to ORD-MULT runs ¡ ¡ Θ 2 A1 A1B 8 : < 9 with at most Z L < s L cache misses needed to B (2) A A B bring the three matrices into the cache. Consequently, 2 2 Θ ¥ the cache complexity of the entire algorithm is 8 1 B1 ¡ ¡ ¡ ¡ ¡ < ¥ 9 2 3 2 3 A1 A2 A1B1 A2B2 (3) Θ 8 ¦ : 8 : :< 8 ¥ ¥ ¦ : n L ¥ n Z Z L 1 n L n L Z , since B2 the algorithm has to read n2 elements, which reside on ¨ < ¡ A B B AB AB (4) 2 1 2 1 2 n L © cache lines.