Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E

Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139 ¢¡¤£¦¥¨§¢© ¡ § ¨¨ ¥¨¡ ! "#¨§#£$§ %¥'&( # &*)+%£,&-§" Main Abstract This paper presents asymptotically optimal algo- organized by Memory rithms for rectangular matrix transpose, FFT, and sorting on optimal replacement computers with multiple levels of caching. Unlike previous strategy optimal algorithms, these algorithms are cache oblivious: no Cache variables dependent on hardware parameters, such as cache CPU size and cache-line length, need to be tuned to achieve opti- mality. Nevertheless, these algorithms use an optimal amount W of work and move data optimally among multiple levels of work Q cache. For a cache with size Z and cache-line length L where Z 3 L Cache lines 2 0 cache / 1 . Ω Z L the number of cache misses for an m n ma- misses Θ 0 Lines 2 3 trix transpose is / 1 mn L . The number of cache misses of length L for either an n-point FFT or the sorting of n numbers is 050 0 Θ 0 Θ 24/ 3 / 2 / / 1 n L 1 logZ n . We also give an mnp -work al- 1 gorithm to multiply an m 1 n matrix by an n p matrix that Figure 1: The ideal-cache model 0 0 26/ 2 2 3 2 3 7 incurs Θ / 1 mn np mp L mnp L Z cache faults. We introduce an “ideal-cache” model to analyze our algo- shall assume that word size is constant; the particular rithms. We prove that an optimal cache-oblivious algorithm constant does not affect our asymptotic analyses. The designed for two levels of memory is also optimal for multi- cache is partitioned into cache lines, each consisting of ple levels and that the assumption of optimal replacement in L consecutive words which are always moved together the ideal-cache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on between cache and main memory. Cache designers typ- the effectiveness of cache-oblivious algorithms in practice. ically use L ; 1, banking on spatial locality to amortize the overhead of moving the cache line. We shall gener- 1. Introduction ally assume in this paper that the cache is tall: Ω 2 8 :9 Z < L (1) Resource-oblivious algorithms that nevertheless use resources efficiently offer advantages of simplicity and which is usually true in practice. portability over resource-aware algorithms whose re- The processor can only reference words that reside source usage must be programmed explicitly. In this in the cache. If the referenced word belongs to a line paper, we study cache resources, specifically, the hier- already in cache, a cache hit occurs, and the word is archy of memories in modern computers. We exhibit delivered to the processor. Otherwise, a cache miss oc- several “cache-oblivious” algorithms that use cache as curs, and the line is fetched into the cache. The ideal effectively as “cache-aware” algorithms. cache is fully associative [20, Ch. 5]: cache lines can be Before discussing the notion of cache obliviousness, stored anywhere in the cache. If the cache is full, a cache 9 : we first introduce the 8 Z L ideal-cache model to study line must be evicted. The ideal cache uses the optimal the cache complexity of algorithms. This model, which off-line strategy of replacing the cache line whose next is illustrated in Figure 1, consists of a computer with a access is furthest in the future [7], and thus it exploits two-level memory hierarchy consisting of an ideal (data) temporal locality perfectly. cache of Z words and an arbitrarily large main mem- Unlike various other hierarchical-memory models ory. Because the actual size of words in a computer is [1, 2, 5, 8] in which algorithms are analyzed in terms of typically a small, fixed size (4 bytes, 8 bytes, etc.), we a single measure, the ideal-cache model uses two mea- sures. An algorithm with an input of size n is measured This research was supported in part by the Defense Advanced : by its work complexity W 8 n —its conventional running Research Projects Agency (DARPA) under Grant F30602-97-1-0270. Matteo Frigo was supported in part by a Digital Equipment Corpora- time in a RAM model [4]—and its cache complexity 9 : tion fellowship. Q 8 n;Z L —the number of cache misses it incurs as a function of the size Z and line length L of the ideal cache. cache-oblivious algorithm that requires no tuning pa- When Z and L are clear from context, we denote the rameters such as the s in BLOCK-MULT. We present : cache complexity simply as Q 8 n to ease notation. such an algorithm, which works on general rectangular We define an algorithm to be cache aware if it con- matrices, in Section 2. The problems of computing a tains parameters (set at either compile-time or runtime) matrix transpose and of performing an FFT also suc- that can be tuned to optimize the cache complexity for cumb to remarkably simple algorithms, which are de- the particular cache size and line length. Otherwise, the scribed in Section 3. Cache-oblivious sorting poses a algorithm is cache oblivious. Historically, good perfor- more formidable challenge. In Sections 4 and 5, we mance has been obtained using cache-aware algorithms, present two sorting algorithms, one based on mergesort but we shall exhibit several optimal1 cache-oblivious al- and the other on distribution sort, both of which are op- gorithms. timal in both work and cache misses. To illustrate the notion of cache awareness, consider The ideal-cache model makes the perhaps- the problem of multiplying two n n matrices A and questionable assumptions that there are only two B to produce their n n product C. We assume that levels in the memory hierarchy, that memory is man- the three matrices are stored in row-major order, as aged automatically by an optimal cache-replacement shown in Figure 2(a). We further assume that n is strategy, and that the cache is fully associative. We “big,” i.e., n ; L, in order to simplify the analysis. The address these assumptions in Section 6, showing that conventional way to multiply matrices on a computer to a certain extent, these assumptions entail no loss with caches is to use a blocked algorithm [19, p. 45]. of generality. Section 7 discusses related work, and The idea is to view each matrix M as consisting of Section 8 offers some concluding remarks, including ¡ ¡ : 8 : 8 n s n s submatrices Mi j (the blocks), each of some preliminary empirical results. which has size s s, where s is a tuning parameter. The following algorithm implements this strategy: 2. Matrix multiplication 0 ¢ ¢ ¢ BLOCK-MULT / A B C n This section describes and analyzes a cache-oblivious al- 3 1 for i £ 1 to n s gorithm for multiplying an m n matrix by an n p ma- 3 2 do for j £ 1 to n s : trix cache-obliviously using Θ 8 mnp work and incurring ¡ ¡ 3 3 do for k £ 1 to n s ¥ ¥ ¥ 8 ¥ ¥ : ¥ ¦ : Θ 8 0 m n p mn np mp L mnp L Z cache ¢ ¢ ¢ 4 do ORD-MULT / A B C s ik k j i j misses. These results require the tall-cache assumption 9 9 9 : The ORD-MULT 8 A B C s subroutine computes (1) for matrices stored in row-major layout format, but 3 ¥ 8 : C ¤ C AB on s s matrices using the ordinary O s the assumption can be relaxed for certain other layouts. algorithm. (This algorithm assumes for simplicity that We also show that Strassen’s algorithm [31] for multi- Θ lg7 2 : s evenly divides n, but in practice s and n need have no plying n n matrices, which uses 8 n work , incurs ¡ special relationship, yielding more complicated code in Θ 2 ¡ lg7 ¥ ¥ ¦ : 8 1 n L n L Z cache misses. the same spirit.) In [9] with others, two of the present authors analyzed Depending on the cache size of the machine on which an optimal divide-and-conquer algorithm for n n ma- BLOCK-MULT is run, the parameter s can be tuned to trix multiplication that contained no tuning parameters, make the algorithm run fast, and thus BLOCK-MULT is but we did not study cache-obliviousness per se. That a cache-aware algorithm. To minimize the cache com- algorithm can be extended to multiply rectangular matri- plexity, we choose s to be the largest value such that ces. To multiply a m n matrix A and a n p matrix B, the three s s submatrices simultaneously fit in cache. ¡ the REC-MULT algorithm halves the largest of the three 2 ¥ : An s s submatrix is stored on Θ 8 s s L cache lines. dimensions and recurs according to one of the following From the tall-cache assumption (1), we can see that three cases: Θ 8§¦ : s < Z . Thus, each of the calls to ORD-MULT runs ¡ ¡ Θ 2 A1 A1B 8 : < 9 with at most Z L < s L cache misses needed to B (2) A A B bring the three matrices into the cache. Consequently, 2 2 Θ ¥ the cache complexity of the entire algorithm is 8 1 B1 ¡ ¡ ¡ ¡ ¡ < ¥ 9 2 3 2 3 A1 A2 A1B1 A2B2 (3) Θ 8 ¦ : 8 : :< 8 ¥ ¥ ¦ : n L ¥ n Z Z L 1 n L n L Z , since B2 the algorithm has to read n2 elements, which reside on ¨ < ¡ A B B AB AB (4) 2 1 2 1 2 n L © cache lines.

Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support