Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran

MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139 ¢¡¤£¦¥¨§¢© ¡  § ¨¨  ¥¨¡ ! "#¨§#£$§ %¥'&( # &*)+%£,&-§ "

Main Abstract This paper presents asymptotically optimal algo- organized by Memory rithms for rectangular matrix transpose, FFT, and sorting on optimal replacement computers with multiple levels of caching. Unlike previous strategy optimal algorithms, these algorithms are cache oblivious: no Cache variables dependent on hardware parameters, such as cache CPU size and cache-line length, need to be tuned to achieve opti- mality. Nevertheless, these algorithms use an optimal amount W of work and move data optimally among multiple levels of work Q cache. For a cache with size Z and cache-line length L where Z 3 L Cache lines

2 0 cache

/ 1 . Ω Z L the number of cache misses for an m n ma- misses

Θ 0 Lines 2 3 trix transpose is / 1 mn L . The number of cache misses of length L

for either an n-point FFT or the sorting of n numbers is

050 0

Θ 0 Θ

24/ 3 / 2 /

/ 1 n L 1 logZ n . We also give an mnp -work al- 1

gorithm to multiply an m 1 n matrix by an n p matrix that Figure 1: The ideal-cache model

0 0

26/ 2 2 3 2 3 7 incurs Θ / 1 mn np mp L mnp L Z cache faults. We introduce an “ideal-cache” model to analyze our algo- shall assume that word size is constant; the particular rithms. We prove that an optimal cache-oblivious algorithm constant does not affect our asymptotic analyses. The designed for two levels of memory is also optimal for multi- cache is partitioned into cache lines, each consisting of ple levels and that the assumption of optimal replacement in L consecutive words which are always moved together the ideal-cache model can be simulated efficiently by LRU re- placement. We also provide preliminary empirical results on between cache and main memory. Cache designers typ- the effectiveness of cache-oblivious algorithms in practice. ically use L ; 1, banking on spatial locality to amortize the overhead of moving the cache line. We shall gener- 1. Introduction ally assume in this paper that the cache is tall:

Ω 2 8 :9 Z < L (1) Resource-oblivious algorithms that nevertheless use re- sources efficiently offer advantages of simplicity and which is usually true in practice. portability over resource-aware algorithms whose re- The processor can only reference words that reside source usage must be programmed explicitly. In this in the cache. If the referenced word belongs to a line paper, we study cache resources, specifically, the hier- already in cache, a cache hit occurs, and the word is archy of memories in modern computers. We exhibit delivered to the processor. Otherwise, a cache miss oc- several “cache-oblivious” algorithms that use cache as curs, and the line is fetched into the cache. The ideal effectively as “cache-aware” algorithms. cache is fully associative [20, Ch. 5]: cache lines can be

Before discussing the notion of cache obliviousness, stored anywhere in the cache. If the cache is full, a cache

9 : we first introduce the 8 Z L ideal-cache model to study line must be evicted. The ideal cache uses the optimal the cache complexity of algorithms. This model, which off-line strategy of replacing the cache line whose next is illustrated in Figure 1, consists of a computer with a access is furthest in the future [7], and thus it exploits two-level memory hierarchy consisting of an ideal (data) temporal locality perfectly. cache of Z words and an arbitrarily large main mem- Unlike various other hierarchical-memory models ory. Because the actual size of words in a computer is [1, 2, 5, 8] in which algorithms are analyzed in terms of typically a small, fixed size (4 bytes, 8 bytes, etc.), we a single measure, the ideal-cache model uses two mea- sures. An algorithm with an input of size n is measured

This research was supported in part by the Defense Advanced : by its work complexity W 8 n —its conventional running Research Projects Agency (DARPA) under Grant F30602-97-1-0270.

Matteo Frigo was supported in part by a Digital Equipment Corpora- time in a RAM model [4]—and its cache complexity

9 : tion fellowship. Q 8 n;Z L —the number of cache misses it incurs as a function of the size Z and line length L of the ideal cache. cache-oblivious algorithm that requires no tuning pa-

When Z and L are clear from context, we denote the rameters such as the s in BLOCK-MULT. We present : cache complexity simply as Q 8 n to ease notation. such an algorithm, which works on general rectangular We define an algorithm to be cache aware if it con- matrices, in Section 2. The problems of computing a tains parameters (set at either compile-time or runtime) matrix transpose and of performing an FFT also suc- that can be tuned to optimize the cache complexity for cumb to remarkably simple algorithms, which are de- the particular cache size and line length. Otherwise, the scribed in Section 3. Cache-oblivious sorting poses a algorithm is cache oblivious. Historically, good perfor- more formidable challenge. In Sections 4 and 5, we mance has been obtained using cache-aware algorithms, present two sorting algorithms, one based on mergesort but we shall exhibit several optimal1 cache-oblivious al- and the other on distribution sort, both of which are op- gorithms. timal in both work and cache misses. To illustrate the notion of cache awareness, consider The ideal-cache model makes the perhaps- the problem of multiplying two n n matrices A and questionable assumptions that there are only two B to produce their n n product C. We assume that levels in the memory hierarchy, that memory is man- the three matrices are stored in row-major order, as aged automatically by an optimal cache-replacement shown in Figure 2(a). We further assume that n is strategy, and that the cache is fully associative. We

“big,” i.e., n ; L, in order to simplify the analysis. The address these assumptions in Section 6, showing that conventional way to multiply matrices on a computer to a certain extent, these assumptions entail no loss with caches is to use a blocked algorithm [19, p. 45]. of generality. Section 7 discusses related work, and

The idea is to view each matrix M as consisting of Section 8 offers some concluding remarks, including

¡ ¡

: 8 : 8 n s n s submatrices Mi j (the blocks), each of some preliminary empirical results. which has size s s, where s is a tuning parame-

ter. The following algorithm implements this strategy: 2. Matrix multiplication

0

¢ ¢ ¢

BLOCK-MULT / A B C n This section describes and analyzes a cache-oblivious al-

3

1 for i £ 1 to n s gorithm for multiplying an m n matrix by an n p ma- 3

2 do for j £ 1 to n s :

trix cache-obliviously using Θ 8 mnp work and incurring

¡ ¡ 3

3 do for k £ 1 to n s

¥ ¥ ¥ 8 ¥ ¥ : ¥ ¦ : Θ 8

0 m n p mn np mp L mnp L Z cache

¢ ¢ ¢ 4 do ORD-MULT / A B C s

ik k j i j misses. These results require the tall-cache assumption

9 9 9 : The ORD-MULT 8 A B C s subroutine computes (1) for matrices stored in row-major layout format, but

3

¥ 8 : C ¤ C AB on s s matrices using the ordinary O s the assumption can be relaxed for certain other layouts. algorithm. (This algorithm assumes for simplicity that We also show that Strassen’s algorithm [31] for multi-

Θ lg7 2 :

s evenly divides n, but in practice s and n need have no plying n n matrices, which uses 8 n work , incurs ¡

special relationship, yielding more complicated code in Θ 2 ¡ lg7 ¥ ¥ ¦ : 8 1 n L n L Z cache misses. the same spirit.) In [9] with others, two of the present authors analyzed Depending on the cache size of the machine on which an optimal divide-and-conquer algorithm for n n ma- BLOCK-MULT is run, the parameter s can be tuned to trix multiplication that contained no tuning parameters, make the algorithm run fast, and thus BLOCK-MULT is but we did not study cache-obliviousness per se. That

a cache-aware algorithm. To minimize the cache com- algorithm can be extended to multiply rectangular matri-

plexity, we choose s to be the largest value such that ces. To multiply a m n matrix A and a n p matrix B, the three s s submatrices simultaneously fit in cache.

¡ the REC-MULT algorithm halves the largest of the three

2

¥ : An s s submatrix is stored on Θ 8 s s L cache lines. dimensions and recurs according to one of the following

From the tall-cache assumption (1), we can see that

three cases:

Θ 8§¦ :

s < Z . Thus, each of the calls to ORD-MULT runs ¡

¡ Θ 2 A1 A1B 8 : < 9

with at most Z L < s L cache misses needed to B (2)

A A B bring the three matrices into the cache. Consequently, 2 2

Θ ¥

the cache complexity of the entire algorithm is 8 1 B1

¡ ¡ ¡ ¡ ¡ < ¥ 9 2 3 2 3 A1 A2 A1B1 A2B2 (3)

Θ

8 ¦ : 8 : :< 8 ¥ ¥ ¦ :

n L ¥ n Z Z L 1 n L n L Z , since B2

the algorithm has to read n2 elements, which reside on

¨

<

¡  A B B AB AB (4) 2 1 2 1 2

n L © cache lines.

 9  The same bound can be achieved using a simple In case (2), we have m  max n p . Matrix A is split horizontally, and both halves are multiplied by matrix B.

1

 9  For simplicity in this paper, we use the term “optimal” as a syn- In case (3), we have n  max m p . Both matrices are onym for “asymptotically optimal,” since all our analyses are asymp- 2 totic. We use the notation lg to denote log2.

split, and the two halves are multiplied. In case (4), we (a) 0 1 2 3 4 5 6 7 (b) 0 8 16 24 32 40 48 56

 9  have p  max m n . Matrix B is split vertically, and 8 9 10 11 12 13 14 15 1 9 17 25 33 41 49 57 each half is multiplied by A. For square matrices, these 16 17 18 19 20 21 22 23 2 10 18 26 34 42 50 58 three cases together are equivalent to the recursive mul- 24 25 26 27 28 29 30 31 3 11 19 27 35 43 51 59

tiplication algorithm described in [9]. The base case oc- 32 33 34 35 36 37 38 39 4 12 20 28 36 44 52 60

< < curs when m < n p 1, in which case the two ele- 40 41 42 43 44 45 46 47 5 13 21 29 37 45 53 61 ments are multiplied and added into the result matrix. 48 49 50 51 52 53 54 55 6 14 22 30 38 46 54 62 Although this straightforward divide-and-conquer al- 56 57 58 59 60 61 62 63 7 15 23 31 39 47 55 63 gorithm contains no tuning parameters, it uses cache op- timally. To analyze the REC-MULT algorithm, we as- (c) 0 1 2 3 16 17 18 19 (d) 0 1 4 5 16 17 20 21 sume that the three matrices are stored in row-major or- 4 5 6 7 20 21 22 23 2 3 6 7 18 19 22 23 der, as shown in Figure 2(a). Intuitively, REC-MULT 8 9 10 11 24 25 26 27 8 9 12 13 24 25 28 29 uses the cache effectively, because once a subproblem 12 13 14 15 28 29 30 31 10 11 14 15 26 27 30 31 fits into the cache, its smaller subproblems can be solved 32 33 34 35 48 49 50 51 32 33 36 37 48 49 52 53 in cache with no further cache misses. 36 37 38 39 52 53 54 55 34 35 38 39 50 51 54 55

Θ 40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61 : Theorem 1 The REC-MULT algorithm uses 8 mnp

Θ ¡ 44 45 46 47 60 61 62 63 42 43 46 47 58 59 62 63 ¥ ¥ ¥ 8 ¥ ¥ : ¥

work and incurs 8 m n p mn np mp L

¡

:

mnp L ¦ Z cache misses when multiplying an m n ma- Figure 2: Layout of a 16 16 matrix in (a) row ma- trix by an n p matrix. jor, (b) column major, (c) 4 4-blocked, and (d) bit- Proof. It can be shown by induction that the work of interleaved layouts.

Θ : REC-MULT is 8 mnp . To analyze the cache misses, let α in the recursion, both are small enough that the whole

; 0 be the largest constant sufficiently small that three

submatrices of sizes m n , n p , and m p , where problem fits into cache. The number of cache misses

¡ α can be described by the recurrence

9 9  ¦

max  m n p Z, all fit completely in the cache.

0¢

¢ ¢

We distinguish four cases depending on the initial size Q / m n p (6) ¤

of the matrices. 0

7

Θ α 7 α

/ 2 2 3 2 ¢ ¨© 3 ¢ ¢

¦ 1 n np L m if n p Z 2 Z

§

0 0

9 9 ; α ¦

¢ 3 ¢ 2 / ¢

Case I: m n p Z. This case is the most intuitive. 2Q / m n 2 p O 1 otherwise if n p

0 0

¢ ¢ 3 2 /

The matrices do not fit in cache, since all dimensions are 2Q / m n p 2 O 1 otherwise ; ¡

“big enough.” The cache complexity can be described ¡

9 9 : < Θ 8 ¥ ¦ :

by the recurrence whose solution is Q 8 m n p np L mnp L Z .

¡ ¡ £¢

0 α α α

9 ¦ ; ¦ 9 ¦

¢ ¢

Q / m n p (5) Case III: (n p Z and m Z) or (m p Z

¡

¤¥

¥ α α α ; ¦ 9 ¦ ; ¦ 0

0 and n Z) or (m n Z and p Z). In each

7

Θ α 7 α

/ 2 2 3 ¢ ¢ ¨ © 3 ¢ ¢

/ mn np mp L if m n p Z 2 Z

¦ 0

0 of these cases, one of the matrices fits into cache, and

2 3 ¢ ¢ / ¢

2Q / m 2 n p O 1 ow. if m n and m p

¥

¥§ 0

0 the others do not. Here, we shall present the case where

2 / ¢ 3 ¢ / ¢

2Q m n 2 p O 1 ow. if n m and n p ¡ 0

0 α α

¦ ; ¦

n 9 p Z and m Z. The other cases can be

¢ ¢ 3 2 /  2Q / m n p 2 O 1 otherwise proven similarly. The REC-MULT algorithm always di-

The base case arises as soon as all three submatrices fit vides m by 2 according to case (2). At some point in the

¡ ¡ in cache. The total number of lines used by the three ¡

¡ α α ¦

recursion, m falls into the range ¦ Z 2 m Z,

8 ¥ ¥ : : Θ 8 submatrices is mn np mp L . The only cache and the whole problem fits in cache. The number cache misses that occur during the remainder of the recursion

¡ misses can be described by the recurrence

¥ ¥ : :

are the Θ 8*8 mn np mp L cache misses required to

0£¢ ¢

bring the matrices into cache. In the recursive cases, Q / m n (7)



0

7

Θ α 7 α 2 ¨© 3 ¢ ¢

when the matrices do not fit in cache, we pay for the / 1 m if m Z 2 Z

0 0

3 ¢ ¢ 2 /

cache misses of the recursive calls, which depend on the 2Q / m 2 n p O 1 otherwise ;

¡ :

dimensions of the matrices, plus O 8 1 cache misses for

9 9 : < Θ 8 ¥ ¦ : whose solution is Q 8 m n p m mnp L Z .

the overhead of manipulating submatrices. The solution

¡

¡

9 9 α α

Θ ¦ 9 9 : < 8 ¦ :

to this recurrence is Q 8 m n p mnp L Z . Case IV: m n p Z. From the choice of , all ¡

¡ three matrices fit into cache. The matrices are stored

α ¦ 9 ; α ¦ α ¦

¡ ¡

Case II: (m Z and n p Z) or (n Z and ¡

¡

¥ ¥ ¥ :

on Θ 8 1 mn L np L mp L cache lines. Therefore,

¡

9 ; α α 9 ; α

¦ ¦

m p ¦ Z) or (p Z and m n Z). Here, we

¡

9 9 : < 8 ¥ 8 ¥ ¥ : :

we have Q 8 m n p Θ 1 mn np mp L .

9 ; ¦ shall present the case where m α ¦ Z and n p α Z. The proofs for the other cases are only small variations We require the tall-cache assumption (1) in these of this proof. The REC-MULT algorithm always divides analyses, because the matrices are stored in row-major n or p by 2 according to cases (3) and (4). At some point order. Tall caches are also needed if matrices are stored in column-major order (Figure 2(b)), but the assumption stored in a row-major layout. The straightforward algo-

Ω 2

8 :

that Z < L can be relaxed for certain other matrix rithm for transposition that employs doubly nested loops

:

layouts. The s s-blocked layout (Figure 2(c)), for some incurs Θ 8 mn cache misses on one of the matrices when

¡ ¡ ¢ tuning parameter s, can be used to achieve the same m ¢ Z L and n Z L, which is suboptimal. bounds with the weaker assumption that the cache holds Optimal work and cache complexities can be ob- at least some sufficiently large constant number of lines. tained with a divide-and-conquer strategy, however. If

The cache-oblivious bit-interleaved layout (Figure 2(d)) n  m, the REC-TRANSPOSE algorithm partitions has the same advantage as the blocked layout, but no

B1

8 : 9 <

tuning parameter need be set, since submatrices of size A < A1 A2 B

B2

¦ : 8 ¦ : 8 :

O 8 L O L are cache-obliviously stored on O 1

9 :

cache lines. The advantages of bit-interleaved and re- and recursively executes REC-TRANSPOSE 8 A1 B1 and

9 : lated layouts have been studied in [11, 12, 16]. One of REC-TRANSPOSE 8 A2 B2 . Otherwise, it divides matrix the practical disadvantages of bit-interleaved layouts is A horizontally and matrix B vertically and likewise per- that index calculations on conventional microprocessors forms two transpositions recursively. The next two lem- can be costly, a deficiency we hope that processor archi- mas provide upper and lower bounds on the performance

tects will remedy. of this algorithm.

: <

For square matrices, the cache complexity Q 8 n ¡

2 ¡ 3

¥ ¥ : Θ 8

n n L n L ¦ Z of the REC-MULT algorithm is Lemma 2 The REC-TRANSPOSE algorithm involves

¡

: 8 ¥ : the same as the cache complexity of the cache-aware O 8 mn work and incurs O 1 mn L cache misses for BLOCK-MULT algorithm and also matches the lower an m n matrix.

bound by Hong and Kung [21]. This lower bound :

Θ 3 Proof. That the algorithm does O 8 mn work is straight- :

holds for all algorithms that execute the 8 n opera-

9 :

tions given by the definition of matrix multiplication forward. For the cache analysis, let Q 8 m n be the cache

n complexity of transposing an m n matrix. We as-

< sume that the matrices are stored in row-major order, the ci j ∑ aikbk j 

k 1 column-major layout having a similar analysis.

No tight lower bounds for the general problem of matrix Let α be a constant sufficiently small such that two

¡

9  multiplication are known. submatrices of size m n and n m, where max  m n By using an asymptotically faster algorithm, such as αL, fit completely in the cache even if each row is stored Strassen’s algorithm [31] or one of its variants [37], both in a different cache line. We distinguish the three cases.

the work and cache complexity can be reduced. When ¡ α  9 

Case I: max m n L. Both the matrices fit in

multiplying n n matrices, Strassen’s algorithm, which ¡ α

: ¥ O 8 1 2mn L lines. From the choice of , the number

is cache oblivious, requires only 7 recursive multiplica- ¡

¡ ¡

8 9 :¨<

of lines required is at most Z L. Therefore Q m n

tions of n 2 n 2 matrices and a constant number of ¡

¥ : O 8 1 mn L .

matrix additions, yielding the recurrence

¡ ¡

¡ ¡

¡ α α £

2 2 Case II: m L £ n or n L m. Suppose first that

¡

¥ ¥ : 9 ¡

Θ 8 1 n n L if n αZ

¡ ¡ : 8 α

Q n 2 (8) m L £ n. The REC-TRANSPOSE algorithm divides

: ¥ 8 : 7Q 8 n 2 O n L otherwise ; the greater dimension n by 2 and performs divide and

where α is a sufficiently small constant. The solution to ¡

¡ conquer. At some point in the recursion, n falls into the

¡ ¡

2 lg7 ¡

¥ ¥ : Θ 8 this recurrence is n n L n L ¦ Z . range αL 2 n αL, and the whole problem fits in cache. Because the layout is row-major, at this point the

3. Matrix transposition and FFT input array has n rows and m columns, and it is laid out

¡

¥ : This section describes a recursive cache-oblivious al- in contiguous locations, requiring at most O 8 1 nm L

gorithm for transposing an m n matrix which uses cache misses to be read. The output array consists of nm

¡

: 8 ¥ : O 8 mn work and incurs O 1 mn L cache misses, elements in m rows, where in the worst case every row

which is optimal. Using matrix transposition as a sub- lies on a different cache line. Consequently, we incur at

¡

¥ : routine, we convert a variant [36] of the “six-step” fast most O 8 m nm L for writing the output array. Since α ¡

Fourier transform (FFT) algorithm [6] into an optimal n  L 2, the total cache complexity for this base case

¥ :

cache-oblivious algorithm. This FFT algorithm uses is O 8 1 m . These observations yield the recurrence

¡

¡

¡

: 8 ¥ 8 :8 ¥ : : O 8 nlgn work and incurs O 1 n L 1 log n

Z ¡

¥ : ¤¦¥ 9 §¦9

O 8 1 m if n αL 2 αL

¡

9 :

cache misses. Q 8 m n

9 : ¥ 8 : 2Q 8 m n 2 O 1 otherwise ;

The problem of matrix transposition is defined as fol-

¡

9 : < 8 ¥ :

lows. Given an m n matrix stored in a row-major lay- whose solution is Q 8 m n O 1 mn L . ¡ T out, compute and store A into an n m matrix B also The case n αL £ m is analogous.

 § 

α lgn § 2 lgn 2

;  Case III: m 9 n L. As in Case II, at some point in the We choose n1 to be 2 and n2 to be 2 . The

α ¡ α

9 § recursion both n and m fall into the range ¥ L 2 L . recursive step then operates as follows:

The whole problem fits into cache and can be solved

¡ 1. Pretend that input is a row-major n1 n2 matrix A.

¥ ¥ : with at most O 8 m n mn L cache misses. The cache Transpose A in place, i.e., use the cache-oblivious complexity thus satisfies the recurrence

REC-TRANSPOSE algorithm to transpose A onto an

¡

9 :

Q 8 m n auxiliary array B, and copy B back onto A. Notice

¡ ¡

α α that if n1 < 2n2, we can consider the matrix to be

8 ¥ ¥ : 9 ¤ ¥ 9 §9 ¢ ¡ O m n mn L if m n L 2 L

¡ made up of records containing two elements.

9 : ¥ 8 :  9

2Q 8 m 2 n O 1 if m n

¡

9 : ¥ 8 : 2Q 8 m n 2 O 1 otherwise; 2. At this stage, the inner sum corresponds to a DFT

¡ of the n2 rows of the transposed matrix. Compute

9 : < 8 ¥ : whose solution is Q 8 m n O 1 mn L . these n2 DFT’s of size n1 recursively. Observe that, because of the previous transposition, we are trans- Theorem 3 The REC-TRANSPOSE algorithm has opti- forming a contiguous array of elements. mal cache complexity. 3. Multiply A by the twiddle factors, which can be

Proof. For an m n matrix, the algorithm must write to computed on the fly with no extra cache misses.

¡

¤ <

mn distinct elements, which occupy at least £ mn L 4. Transpose A in place, so that the inputs to the next

¡

¥ : Ω 8 1 mn L cache lines. stage are arranged in contiguous locations. As an example of an application of this cache- 5. Compute n1 DFT’s of the rows of the matrix recur- oblivious transposition algorithm, in the rest of this sec- sively. tion we describe and analyze a cache-oblivious algo- 6. Transpose A in place so as to produce the correct rithm for computing the discrete Fourier transform of a output order. complex array of n elements, where n is an exact power It can be proven by induction that the work com-

of 2. The basic algorithm is the well-known “six-step” : plexity of this FFT algorithm is O 8 nlgn . We now an- variant [6, 36] of the Cooley-Tukey FFT algorithm [13]. alyze its cache complexity. The algorithm always op- Using the cache-oblivious transposition algorithm, how- erates on contiguous data, by construction. Thus, by ever, the FFT becomes cache-oblivious, and its perfor- the tall-cache assumption (1), the transposition oper- mance matches the lower bound by Hong and Kung [21]. ations and the twiddle-factor multiplication require at

Recall that the discrete Fourier transform (DFT) of ¡

¥ : most O 8 1 n L cache misses. Thus, the cache com-

an array X of n complex numbers is the array Y given by plexity satisfies the recurrence

¡ ¥

n 1 ¡

¢ α ¥ 8 ¥ : 9 9

ω i j ¡ O 1 n L if n Z

¥ §< ¥ § 9

Y i ∑ X j n (9) ¡

8 : ¥ 8 :

8 : Q n n1Q n2 n2Q n1 otherwise ; (11)

j 0 ¡

8 ¥ :

¥ O 1 n L

¥ § 2π ¦ 1 n

where ωn < e is a primitive nth root of unity, ¡ where α ; 0 is a constant sufficiently small that a sub-

and 0 i £ n. Many algorithms evaluate Equation (9) in

problem of size αZ fits in cache. This recurrence has : O 8 nlgn time for all integers n [15]. In this paper, how-

ever, we assume that n is an exact power of 2, and we solution

¡

:#< 8 ¥ 8 :%8 ¥ :*: 9 compute Equation (9) according to the Cooley-Tukey al- Q 8 n O 1 n L 1 logZ n

gorithm, which works recursively as follows. In the base which is optimal for a Cooley-Tukey algorithm, match-

8 : case where n < O 1 , we compute Equation (9) directly. ing the lower bound by Hong and Kung [21] when n is

Otherwise, for any factorization n < n1n2 of n, we have an exact power of 2. As with matrix multiplication, no

¥ § <

Y ¥ i1 i2n1 (10) tight lower bounds for cache complexity are known for

¨ © ¥

n2 ¥ 1 n1 1 the general DFT problem.

¥ ¥

ω ¥ i1 j1 ω i1 j2 ω i2 j2

¥ ¥ §

∑ ∑ X j n j  1 2 2

n1 n n2

j2 0 j1 0 4. Funnelsort Observe that both the inner and outer summations in Cache-oblivious algorithms, like the familiar two-way Equation (10) are DFT’s. Operationally, the computa- , are not optimal with respect to cache misses. tion specified by Equation (10) can be performed by The Z-way mergesort suggested by Aggarwal and Vit- computing n2 transforms of size n1 (the inner sum), mul- ter [3] has optimal cache complexity, but although it ap-

¥ i1 j2 tiplying the result by the factors ωn (called the twid- parently works well in practice [23], it is cache aware. dle factors [15]), and finally computing n1 transforms of This section describes a cache-oblivious sorting algo- size n2 (the outer sum). rithm called “funnelsort.” This algorithm has optimal of ¦ k buffers. Each buffer is a FIFO queue that can

hold 2k3 § 2 elements. Finally, the outputs of the buffers ¦ are connected to the ¦ k inputs of the k-merger R in

the right part of the figure. The output of this final ¦ k- L1 R merger becomes the output of the whole k-merger. The

intermediate buffers are overdimensioned, since each § 3 § 2 3 2 buffers can hold 2k elements, which is twice the number k

of elements output by a ¦ k-merger. This additional buffer space is necessary for the correct behavior of the

algorithm, as will be explained below. The base case of < L the recursion is a k-merger with k 2, which produces k 3

k < 8 elements whenever invoked. k-merger A k-merger operates recursively in the following way. In order to output k3 elements, the k-merger invokes

R k3 § 2 times. Before each invocation, however, the k- merger fills all buffers that are less than half full, i.e.,

Figure 3: Illustration of a k-merger. A k-merger is built all buffers that contain less than k3 § 2 elements. In order

¦ ¦

  recursively out of k “left” k-mergers L , L ,  , L , 1 2 ¦ k to fill buffer i, the algorithm invokes the corresponding

3 § 2 a series of buffers, and one “right” ¦ k-merger R. left merger Li once. Since Li outputs k elements, the

3 § 2

¡ buffer contains at least k elements after Li finishes.

: 8 ¥ 8 : 8 ¥

O 8 nlgn work complexity, and optimal O 1 n L 1 It can be proven by induction that the work com- :

log n : cache complexity. : Z plexity of funnelsort is O 8 nlgn . We will now analyze

Funnelsort is similar to mergesort. In order to sort the cache complexity. The goal of the analysis is to : a (contiguous) array of n elements, funnelsort performs show that funnelsort on n elements requires at most Q 8 n

the following two steps: cache misses, where ¡

1 § 3

:#< 8 ¥ 8 : 8 ¥ :*: 1. Split the input into n contiguous arrays of size 8 Q n O 1 n L 1 logZ n  n2 § 3, and sort these arrays recursively.

In order to prove this result, we need three auxiliary lem- § 1 § 3 1 3 2. Merge the n sorted sequences using a n - mas. The first lemma bounds the space required by a merger, which is described below. k-merger.

Funnelsort differs from mergesort in the way the 2 : Lemma 4 A k-merger can be laid out in O 8 k contigu- merge operation works. Merging is performed by a de- ous memory locations. vice called a k-merger, which inputs k sorted sequences

2 : and merges them. A k-merger operates by recursively Proof. A k-merger requires O 8 k memory locations

merging sorted sequences which become progressively for the buffers, plus the space required by the ¦ k- : longer as the algorithm proceeds. Unlike mergesort, mergers. The space S 8 k thus satisfies the recurrence

however, a k-merger suspends work on a merging sub- ¡ 2

¦ ¦

: 8 ¥ : 8 : ¥ 8 :9 S 8 k k 1 S k O k problem when the merged output sequence becomes

2

:< 8 :

“long enough” and resumes work on another merging whose solution is S 8 k O k . : subproblem. In order to achieve the bound on Q 8 n , the buffers This complicated flow of control makes a k-merger in a k-merger must be maintained as circular queues of a bit tricky to describe. Figure 3 shows a representa- size k. This requirement guarantees that we can man- tion of a k-merger, which has k sorted sequences as in- age the queue cache-efficiently, in the sense stated by puts. Throughout its execution, the k-merger maintains the next lemma. the following invariant.

Lemma 5 Performing r insert and remove operations

¡

¥ : Invariant Each invocation of a k-merger outputs the on a circular queue causes in O 8 1 r L cache misses next k3 elements of the sorted sequence obtained by as long as two cache lines are available for the buffer. merging the k input sequences. Proof. Associate the two cache lines with the head and tail of the circular queue. If a new cache line is read A k-merger is built recursively out of ¦ k-mergers in

during a insert (delete) operation, the next L ¡ 1 insert the following way. The k inputs are partitioned into ¦ k

(delete) operations do not cause a cache miss. ¦

sets of ¦ k elements, which form the input to the k

¦

9 9 9

  k-mergers L L  L in the left part of the figure. The next lemma bounds the cache complexity of a 1 2 ¦ k The outputs of these mergers are connected to the inputs k-merger.

Ω 2

8 : Lemma 6 If Z < L , then a k-merger operates with Before invoking R, the algorithm must check every

at most buffer to see whether it is empty. One such check re-

¡ ¡ ¦

3 3 quires at most ¦ k cache misses, since there are k

8 : < 8 ¥ ¥ ¥ :

Q k O 1 k k L k log k L § M Z buffers. This check is repeated exactly k3 2 times, lead- cache misses. ing to at most k2 cache misses for all checks. These considerations lead to the recurrence

α £ ;

Proof. There are two cases: either k ¦ Z or k ¡

3 § 2 2

¦ ¦

: ¥ ¡ 8 : ¥

α α 8 

¦ Z, where is a sufficiently small constant. QM k 2k 2 k QM k k α ¦ Case I: k £ Z. By Lemma 4, the data structure Application of the inductive hypothesis and the choice

2 ¡

8 :¨<

:¨< 8 ¥ :

associated with the k-merger requires at most O k A 8 k k 1 2clogZ k L yields Inequality (12) as fol- :

O 8 Z contiguous memory locations, and therefore it fits lows:

¡

into cache. The k-merger has k input queues from 3 § 2 2

¦ ¦

: ¥ ¡ 8 : ¥

3 QM 8 k 2k 2 k QM k k :

which it loads O 8 k elements. Let ri be the number

¨

of elements extracted from the ith input queue. Since 3 § 2

¡ ck log k

3 § 2 Z 2

¦ ¦

¡

¡ 8 : ¥

α ¥ ¦ £ 2 k k A k k k Z and the tall-cache assumption (1) implies that

¡ Ω 2L

8§¦ : < 8 :

L < O Z , there are at least Z L k cache lines

¡ ¡

¡ 3 2

8 ¥ : available for the input buffers. Lemma 5 applies, whence ¥ ck logZ k L k 1 clogZ k L the total number of cache misses for accessing the input

3 § 2

¦ ¦

¡

¡ 8 :

queues is 2k ¥ 2 k A k ¡

k ¡ 3

¡ 8 :

¡ ¡

3 ck logZ k L A k 

8 ¥ : < 8 ¥ :

∑ O 1 ri L O k k L 

8 ¥

Theorem 7 To sort n elements, funnelsort incurs O 1

i 1 ¡

:-8 ¥ : : Similarly, Lemma 4 implies that the cache complexity 8 n L 1 logZ n cache misses.

3 ¡

£ α α

¥ : of writing the output queue is O 8 1 k L . Finally, the Proof. If n Z for a small enough constant , then

2 ¡

¥ : algorithm incurs O 8 1 k L cache misses for touching the algorithm fits into cache. To see why, observe that

its internal data structures. The total cache complexity is only one k-merger is active at any time. The biggest ¡

3 1 § 3

: < 8 ¥ ¥ : therefore QM 8 k O 1 k k L . k-merger is the top-level n -merger, which requires

2 § 3

: £ 8 : O 8 n O n space. The algorithm thus can operate

α ¡ ¦

Case I: k ; Z. We prove by induction on k that

¥ : in O 8 1 n L cache misses.

α ; whenever k ¦ Z, we have α

If N ; Z, we have the recurrence ¡

¡ 3

§ § §

: ¡ 8 :¨9

Q 8 k ck log k L A k (12) 1 3 2 3 1 3

: < 8 : ¥ 8 : M 8

Z Q n n Q n QM n  ¡

3 ¡

§ §

:¢< 8 ¥ :¤< 8 :

where A 8 k k 1 2clog k L o k . This particular 1 3 1 3

:< 8 ¥ ¥ ¥

Z By Lemma 6, we have QM 8 n O 1 n n L

¡ : value of A 8 k will be justified at the end of the analysis. nlog n L : .

Z ¡

The base case of the induction consists of values of By the tall-cache assumption (1), we have n L <

1 § 4

§ §

£ ¦

k such that αZ £ k α Z. (It is not sufficient only 1 3 1 3

: < 8 : <

Ω 8 n . Moreover, we also have n Ω 1 and lgn

¡

< Θ 8§¦ :

to consider k Z , since k can become as small as 1 § 3

Ω 8 : 8 : < 8 : lgZ . Consequently, QM n O nlogZ n L

1 § 4 : Θ 8 Z in the recursive calls.) The analysis of the first

¡ holds, and the recurrence simplifies to

3

8 : < 8 ¥ ¥ :

case applies, yielding QM k O 1 k k L . Be- ¡ §

1 § 3 2 3

8 :#< 8 : ¥ 8 :

2 α Ω Ω Q n n Q n O nlogZ n L  ¦ < 8 : < 8 : cause k ; Z L and k 1 , the last term

3 ¡

:+< 8 : dominates, which implies QM 8 k O k L . Conse- The result follows by induction on n. quently, a big enough value of c can be found that satis- This upper bound matches the lower bound stated fies Inequality (12). by the next theorem, proving that funnelsort is cache-

α ¦ For the inductive case, suppose that k ; Z. The optimal.

k-merger invokes the ¦ k-mergers recursively. Since Theorem 8 The cache complexity of any sorting algo- §

1 4 ¡

α ¦ £

Z £ k k, the inductive hypothesis can be used to

:#< 8 ¥ 8 :-8 ¥ :*: 8 Ω

rithm is Q n 1 n L 1 logZ n .

¦ : bound the number Q 8 k of cache misses incurred by

M Proof. Aggarwal and Vitter [3] show that there is an ¡

the submergers. The “right” merger R is invoked exactly ¡

8 : 8 :*: Ω 8

§ n L log n Z bound on the number of cache k3 2 times. The total number l of invocations of “left” Z § L

3 § 2 misses made by any on their “out-of-

¦ ¥ mergers is bounded by l £ k 2 k. To see why, con- sider that every invocation of a left merger puts k3 § 2 el- core” memory model, a bound that extends to the ideal- 3 cache model. The theorem can be proved by apply-

ements into some buffer. Since k elements are output 2

8 : < Ω

2 3 § 2 ing the tall-cache assumption Z L and the trivial

¡

¦ ¥

and the buffer space is 2k , the bound l £ k 2 k

: < 8 : 8 :#< 8 : 8 Ω Ω follows. lower bounds of Q n 1 and Q n n L .

5. Distribution sort The distribution step is accomplished by the recur-

9 9 : sive procedure DISTRIBUTE 8 i j m which distributes

In this section, we describe another cache-oblivious op-

¥ ¡ : elements from the ith through 8 i m 1 th subarrays timal sorting algorithm based on distribution sort. Like into buckets starting from B . Given the precondition

the funnelsort algorithm from Section 4, the distribution- j

9 ¥ 9 9 ¥ ¡ 

 

that each subarray i i 1  i m 1 has its bnum j, :

sorting algorithm uses O 8 nlgn work to sort n elements,

¡

9 9 :

the execution of DISTRIBUTE 8 i j m enforces the post-

¥ 8 :%8 ¥ : :

and it incurs O 8 1 n L 1 logZ n cache misses.

9 ¥ 9 9 ¥ ¡

  condition that subarrays i i 1  i m 1 have their

Unlike previous cache-efficient distribution-sorting al- ¥ bnum  j m. Step 2 of the distribution sort invokes

gorithms [1, 3, 25, 34, 36], which use sampling or other

8 9 9 :

DISTRIBUTE 1 1 ¦ n . The following is a recursive im- techniques to find the partitioning elements before the plementation of DISTRIBUTE:

distribution step, our algorithm uses a “bucket splitting”

9 9 : technique to select pivots incrementally during the dis- DISTRIBUTE 8 i j m

tribution step. 1 if m < 1

9 :

Given an array A (stored in contiguous locations) of 2 then COPYELEMS 8 i j

¡

9 9 :

length n, the cache-oblivious distribution sort operates 3 else DISTRIBUTE 8 i j m 2 ¡

as follows: ¡

¥ 9 9 :

4 DISTRIBUTE 8 i m 2 j m 2

¡ ¡

9 ¥ 9 : 5 DISTRIBUTE 8 i j m 2 m 2

1. Partition A into ¦ n contiguous subarrays of size

¡ ¡ ¡

¥ 9 ¥ 9 : ISTRIBUTE 8

¦ n. Recursively sort each subarray. 6 D i m 2 j m 2 m 2

9 :

2. Distribute the sorted subarrays into q buckets In the base case, the procedure COPYELEMS 8 i j

9 9 9 9

    

B1  Bq of size n1 nq, respectively, such that copies all elements from subarray i that belong to

¡

bucket j. If bucket j has more than 2 ¦ n elements af-

 ¤   ¤  < 1. max x x Bi min x x Bi ¡ 1 for i

ter the insertion, it can be split into two buckets of size

9 9 9 ¡

 

1 2  q 1. ¡

at least ¦ n. For the splitting operation, we use the deter-

< 9 9 9

   2. n 2 ¦ n for i 1 2 q. i ministic median-finding algorithm [14, p. 189] followed (See below for details.) by a partition.

3. Recursively sort each bucket. Lemma 9 The median of n elements can be found

: 8 ¥

4. Copy the sorted buckets to array A. cache-obliviously using O 8 n work and incurring O 1 ¡

n L : cache misses. A stack-based memory allocator is used to exploit spatial locality. Proof. See [14, p. 189] for the linear-time median find-

The goal of Step 2 is to distribute the sorted subarrays ing algorithm and the work analysis. The cache com-

9 9

9 plexity is given by the same recurrence as the work com-

  of A into q buckets B1 B2  Bq. The algorithm main-

tains two invariants. First, at any time each bucket holds plexity with a different base case.

¡ ¡

at most 2 ¦ n elements, and any element in bucket Bi is

¢ 8 ¥ : 9

¡ O 1 m L if m αZ

¡ ¡

smaller than any element in bucket Bi ¡ 1. Second, every

8 £ ¤: ¥ 8 ¥ : :#< Q 8 m Q m 5 Q 7m 10 6 otherwise ;

bucket has an associated pivot. Initially, only one empty ¡

8 ¥ : ¥ O 1 m L bucket exists with pivot ∞. α The idea is to copy all elements from the subarrays where is a sufficiently small constant. The result fol-

into the buckets while maintaining the invariants. We lows. ¥ keep state information for each subarray and bucket. The In our case, we have buckets of size 2 ¦ n 1. In ad- state of a subarray consists of the index next of the next dition, when a bucket splits, all subarrays whose bnum element to be read from the subarray and the bucket is greater than the bnum of the split bucket must have number bnum where this element should be copied. By their bnum’s incremented. The analysis of DISTRIBUTE ∞ convention, bnum < if all elements in a subarray have is given by the following lemma.

been copied. The state of a bucket consists of the pivot : Lemma 10 The distribution step involves O 8 n work,

and the number of elements currently in the bucket. ¡

¥ : 8 : incurs O 8 1 n L cache misses, and uses O n stack We would like to copy the element at position next of space to distribute n elements. a subarray to bucket bnum. If this element is greater than the pivot of bucket bnum, we would increment bnum un- Proof. In order to simplify the analysis of the work

til we find a bucket for which the element is smaller than used by DISTRIBUTE, assume that COPYELEMS uses : the pivot. Unfortunately, this basic strategy has poor O 8 1 work for procedural overhead. We will account for caching behavior, which calls for a more complicated the work due to copying elements and splitting of buck- procedure. ets separately. The work of DISTRIBUTE is described by :

the recurrence Theorem 11 Distribution sort uses O 8 nlgn work and

¡

¡

¥ 8 :8 ¥ : :

incurs O 8 1 n L 1 log n cache misses to sort n

: < 8 : ¥ 8 : 8 Z T c 4T c 2 O 1 

2 elements.

8 :¢< 8 : <

It follows that T c O c , where c ¦ n initially. The

Proof. The work done by the algorithm is given by : work due to copying elements is also O 8 n . q

The total number of bucket splits is at most ¦ n. To

8 :#< 8 : ¥ 8 : ¥ 8 :9 ¦ W n ¦ nW n ∑W ni O n

see why, observe that there are at most ¦ n buckets at the

end of the distribution step, since each bucket contains at i 1 ¡

∑ <

where each n 2 ¦ n and n n. The solution to this :

8 i i ¦

least ¦ n elements. Each split operation involves O n

:< 8 :

recurrence is W 8 n O nlgn . : work and so the net contribution to the work is O 8 n .

The space complexity of the algorithm is given by

: <

Thus, the total work used by DISTRIBUTE is W 8 n

¡

8 8 :*: ¥ 8 : ¥ 8 :¤< 8 :

8 : 8 : ¥ 8 :9 ¦ O T n O n O n O n . S n S 2 ¦ n O n

For the cache analysis, we distinguish two cases. Let : where the O 8 n term comes from Step 2. The solution to

α be a sufficiently small constant such that the stack

:< 8 : this recurrence is S 8 n O n . space used fits into cache. The cache complexity of distribution sort is described Case I, n ¡ αZ: The input and the auxiliary space of

¡ by the recurrence

¡

: 8 ¥ : ¡

size O 8 n fit into cache using O 1 n L cache lines.

¡

¢ 8 ¥ : α 9

¡ O 1 n L if n Z

8 ¥ :

Consequently, the cache complexity is O 1 n L . ¡ q

8 : 8 : ¥ 8 :

∑ ¦ Q n ¦ nQ n Q ni otherwise ;

i 1 ¡

8 9 :

Case II, n ; αZ: Let R c m denote the cache misses

8 ¥ :

¥ O 1 n L

9 9 : incurred by an invocation of DISTRIBUTE 8 a b c that

copies m elements from subarrays to buckets. We first where α is a sufficiently small constant such that the ¡

2 ¡ stack space used by a sorting problem of size αZ, in-

9 : < 8 ¥ ¥ : prove that R 8 c m O L c L m L , ignoring the

cluding the input array, fits completely in cache. The cost splitting of buckets, which we shall account for sep- ¡

base case n αZ arises when both the input array A

9 :

arately. We argue that R 8 c m satisfies the recurrence

¡

8 : < 8 : ¡

and the contiguous stack space of size S n O n fit

¡

¥ : 9

O 8 L m L if c αL

¢

¡

¥ :

in O 8 1 n L cache lines of the cache. In this case, ¡

4 ¡

¡

8 9 :

¥ :

R c m (13) the algorithm incurs O 8 1 n L cache misses to touch

9 : ∑ R 8 c 2 mi otherwise ;

all involved memory locations once. In the case where i 1 α n ; Z, the recursive calls in Steps 1 and 3 cause

4 ¡

8 9 :< 8 ¥

where ∑ mi < m, whose solution is R c m O L q

¡ ¡ 8 : ¥ 8 : 8 ¥ : i 1 ∑

2 Q ¦ n i 1 Q ni cache misses and O 1 n L is

: ; c L ¥ m L . The recursive case c αL follows im-

¡ the cache complexity of Steps 2 and 4, as shown by mediately from the algorithm. The base case c Lemma 10. The theorem follows by solving the recur- α

L can be justified as follows. An invocation of rence.

9 9 :

DISTRIBUTE 8 a b c operates with c subarrays and c

: Ω 8 buckets. Since there are L cache lines, the cache can 6. Theoretical justifications for the ideal- hold all the auxiliary storage involved and the currently

accessed element in each subarray and bucket. In this cache model

¡

¥ :

case there are O 8 L m L cache misses. The initial ac- How reasonable is the ideal-cache model for algorithm

: < 8 : cess to each subarray and bucket causes O 8 c O L design? The model incorporates four major assumptions

cache misses. Copying the m elements to and from con- that deserve scrutiny:

¡

¥ : tiguous locations causes O 8 1 m L cache misses. optimal replacement,

We still need to account for the cache misses caused exactly two levels of memory,

¥ by the splitting of buckets. Each split causes O 8 1

¡ automatic replacement,

: ¦ n L cache misses due to median finding (Lemma 9) full associativity.

and partitioning of ¦ n contiguous elements. An addi-

¡ Designing algorithms in the ideal-cache model is easier

8 ¥ : tional O 1 ¦ n L misses are incurred by restoring the than in models lacking these properties, but are these

cache. As proven in the work analysis, there are at most assumptions too strong? In this section we show that

8 9 : ¦ ¦ n split operations. By adding R n n to the split cache-oblivious algorithms designed in the ideal-cache

complexity, we conclude that the total cache complexity ¡

¡ model can be efficiently simulated by weaker models.

8 ¥ ¥ 8 ¥ : :¤< ¦ of the distribution step is O L n L ¦ n 1 n L

¡ The first assumption that we shall eliminate is that : O 8 n L . of optimal replacement. Our strategy for the simula- The analysis of distribution sort is given in the next tion is to use an LRU (least-recently used) replacement theorem. The work and cache complexity match lower strategy [20, p. 378] in place of the optimal and om- bounds specified in Theorem 8. niscient replacement strategy. We start by proving a lemma that bounds the effectiveness of the LRU simu- never replaced, the multilevel cache maintains the inclu- lation. We then show that algorithms whose complex- sion property. The next lemma, whose proof is omitted, ity bounds satisfy a simple regularity condition (includ- asserts that even though a cache in a multilevel model ing all algorithms heretofore presented) can be ported to does not see accesses that hit at lower levels, it neverthe- caches incorporating an LRU replacement policy. less behaves like the first-level cache of a simple two- level model, which sees all the memory accesses.

Lemma 12 Consider an algorithm that causes

9 : 8

9 :

Q n;Z L cache misses on a problem of size n using Lemma 14 A 8 Zi Li -cache at a given level i of a mul-

9 :

a 8 Z L ideal cache. Then, the same algorithm incurs tilevel LRU model always contains the same cache lines

¡

¡

8 9 : 9 : 8 9 : 8

9 : Q n;Z L 2Q n;Z 2 L cache misses on a Z L as a simple 8 Zi Li -cache managed by LRU that serves cache that uses LRU replacement. the same sequence of memory accesses. Proof. Sleator and Tarjan [30] have shown that the

Lemma 15 An optimal cache-oblivious algorithm

9 :

cache misses on a 8 Z L cache using LRU replacement

¡ ¡ ¡

whose cache complexity satisifies the regularity condi-

: 8 8 ¡ : ¥ : are 8 Z L Z Z L 1 -competitive with optimal

tion (14) incurs an optimal number of cache misses on

9 : replacement on a 8 Z L ideal cache if both caches start

each level3 of a multilevel cache with LRU replacement.

9 : empty. It follows that the number of misses on a 8 Z L LRU-cache is at most twice the number of misses on a

¡ Proof. Let cache i in the multilevel LRU model be a

9 :

8 Z 2 L ideal-cache.

9 :

8 Zi Li cache. Lemma 14 says that the cache holds ex-

9 : actly the same elements as a 8 Zi Li cache in a two-level

Corollary 13 For any algorithm whose cache- LRU model. From Corollary 13, the cache complex-

9 :

complexity bound Q 8 n;Z L in the ideal-cache model

9 : ity of a cache-oblivious algorithm working on a 8 Zi Li

satisfies the regularity condition LRU cache lower-bounds that of any cache-aware algo-

8 9 : 8 9 :

9 :< 8 8 9 : : 9 Q 8 n;Z L O Q n;2Z L (14) rithm for a Zi Li ideal cache. A Zi Li level in a mul- tilevel cache incurs at least as many cache misses as a

the number of cache misses with LRU replacement is

9 :

Θ 8 Zi Li ideal cache when the same algorithm is executed. 8 9 : : 8 Q n;Z L . Proof. Follows directly from (14) and Lemma 12. Finally, we remove the two assumptions of automatic The second assumption we shall eliminate is the as- replacement and full associativity. Specifically, we shall sumption of only two levels of memory. Although mod- show that a fully associative LRU cache can be main- els incorporating multiple levels of caches may be nec- tained in ordinary memory with no asymptotic loss in essary to analyze some algorithms, for cache-oblivious expected performance.

algorithms, analysis in the two-level ideal-cache model

9 : Lemma 16 A 8 Z L LRU-cache can be maintained us-

suffices. Specifically, optimal cache-oblivious algo- : ing O 8 Z memory locations such that every access to a

rithms also perform optimally in computers with mul- : cache line in memory takes O 8 1 expected time. tiple levels of LRU caches. We assume that the caches satisfy the inclusion property [20, p. 723], which says Proof. Given the address of the memory location to that the values stored in cache i are also stored in cache be accessed, we use a 2-universal hash function [24,

i ¥ 1 (where cache 1 is the cache closest to the proces- p. 216] to maintain a hash table of cache lines present ¡ sor). We also assume that if two elements belong to in the memory. The Z L entries in the hash table

the same cache line at level i, then they belong to the point to linked lists in a heap of memory that contains ¡

same line at level i ¥ 1. Moreover, we assume that cache Z L records corresponding to the cache lines. The 2-

i ¥ 1 has strictly more cache lines than cache i. These as- universal hash function guarantees that the expected size

8 : sumptions ensure that cache i ¥ 1 includes the contents of a chain is O 1 . All records in the heap are organized

of cache i plus at least one more cache line. as a doubly linked list in the LRU order. Thus, the LRU :

The multilevel LRU cache operates as follows. A hit policy can be implemented in O 8 1 expected time using

¡

: 8 : on an element in cache i is served by cache i and is not O 8 Z L records of O L words each. seen by higher-level caches. We consider a line in cache 3 i ¥ 1 to be marked if any element stored on the line be- Alpern, Carter and Feig [5] show that optimality on each level of longs to cache i. When cache i misses on an access, it memory in the UMH model does not necessarily imply global optimal- ity. The UMH model incorporates a single cost measure that combines recursively fetches the needed line from cache i ¥ 1, re- the costs of work and cache faults at each of the levels of memory. By placing the least-recently accessed unmarked cache line. analyzing the levels independently, our multilevel ideal-cache model

The replaced cache line is then brought to the front of remains agnostic about the various schemes by which work and cache

¥ : cache 8 i 1 ’s LRU list. Because marked cache lines are faults might be combined. 0.25 Theorem 17 An optimal cache-oblivious algorithm iterative whose cache-complexity bound satisfies the regularity 0.2 recursive condition (14) can be implemented optimally in expec- 0.15 tation in multilevel models with explicit memory man- 0.1 agement. 0.05

Time (microseconds) 0 Proof. Combine Lemma 15 and Lemma 16. 0 200 400 600 800 1000 1200 N Corollary 18 The recursive cache-oblivious algorithms Figure 4: Average time to transpose an N N matrix, for matrix multiplication, matrix transpose, FFT, and divided by N2. sorting are optimal in multilevel models with explicit memory management. exceptions. Hong and Kung [21] use the red-blue peb- ble game to prove lower bounds on the I/O-complexity Proof. Their complexity bounds satisfy the regularity of matrix multiplication, FFT, and other problems. The condition (14). red-blue pebble game models temporal locality using It can also be shown [26] that cache-oblivous algo- two levels of memory. The model was extended by rithms satisfying (14) are also optimal (in expectation) Savage [27] for deeper memory hierarchies. Aggarwal in the previously studied SUMH [5, 34] and HMM [1] and Vitter [3] introduced spatial locality and investigated models. Thus, all the algorithmic results in this paper a two-level memory in which a block of P contiguous apply to these models, matching the best bounds previ- items can be transferred in one step. They obtained tight ously achieved. bounds for matrix multiplication, FFT, sorting, and other Other simulation results can be shown. For example, problems. The hierarchical memory model (HMM) by by using the copying technique of [22], cache-oblivious Aggarwal et al. [1] treats memory as a linear array,

algorithms for matrix multiplication and other problems where the cost of an access to element at location x is : can be designed that are provably optimal on direct- given by a cost function f 8 x . The BT model [2] extends mapped caches. HMM to support block transfers. The UMH model by Alpern et al. [5] is a multilevel model that allows I/O at 7. Related work different levels to proceed in parallel. Vitter and Shriver introduce parallelism, and they give algorithms for ma- In this section, we discuss the origin of the notion of trix multiplication, FFT, sorting, and other problems in cache-obliviousness. We also give an overview of other both a two-level model [35] and several parallel hierar- hierarchical memory models. chical memory models [36]. Vitter [33] provides a com- Our research group at MIT noticed as far back as prehensive survey of external-memory algorithms. 1994 that divide-and-conquer matrix multiplication was a cache-optimal algorithm that required no tuning, but 8. Conclusion we did not adopt the term “cache-oblivious” until 1997. This matrix-multiplication algorithm, as well as a cache- The theoretical work presented in this paper opens two oblivious algorithm for LU-decomposition without piv- important avenues for future research. The first is to oting, eventually appeared in [9]. Shortly after leaving determine the range of practicality of cache-oblivious our research group, Toledo [32] independently proposed algorithms, or indeed, of any algorithms developed in a cache-oblivious algorithm for LU-decomposition with the ideal-cache model. The second is to resolve, from a

pivoting. For n n matrices, Toledo’s algorithm uses complexity-theoretic point of view, the relative strengths ¡

3 2 ¡ 3

: Θ 8 ¥ ¥ ¦ : Θ 8 n work and incurs 1 n L n L Z cache of cache-oblivious and cache-aware algorithms. To con- misses. More recently, our group has produced an FFT clude, we discuss each of these avenues in turn. library called FFTW [18], which in its most recent incar- Figure 4 compares per-element time to transpose a nation [17], employs a register-allocation and schedul- matrix using the naive iterative algorithm employing a ing algorithm inspired by our cache-oblivious FFT al- doubly nested loop with the recursive cache-oblivious gorithm. The general idea that divide-and-conquer en- REC-TRANSPOSE algorithm from Section 3. The two hances memory locality has been known for a long algorithms were evaluated on a 450 megahertz AMD time [29]. K6III processor with a 32-kilobyte 2-way set-associative Previous theoretical work on understanding hierar- L1 cache, a 64-kilobyte 4-way set-associative L2 cache, chical memories and the I/O-complexity of algorithms and a 1-megabyte L3 cache of unknown associativ- has been studied in cache-aware models lacking an auto- ity, all with 32-byte cache lines. The code for REC- matic replacement strategy, although [10, 28] are recent TRANSPOSE was the same as presented in Section 3, ex- 0.12 uses a recursive strategy to exploit caches in Fourier 0.1 iterative recursive transform calculations. FFTW’s code generator pro- 0.08 duces straight-line “codelets,” which are coarsened base 0.06 cases for the FFT algorithm. Because these codelets are 0.04 cache oblivious, a C compiler can perform its register 0.02 allocation efficiently, and yet the codelets can be gen- Time (microseconds) 0 0 100 200 300 400 500 600 erated without knowing the number of registers on the N target architecture. Figure 5: Average time taken to multiply two N N To close, we mention two theoretical avenues matrices, divided by N3. that should be explored to determine the complexity- theoretic relationship between cache-oblivious algo- cept that the divide-and-conquer structure was modified rithms and cache-aware algorithms. to produce exact powers of 2 as submatrix sizes wher- ever possible. In addition, the base cases were “coars- Separation: Is there a gap in asymptotic complexity ened” by inlining the recursion near the leaves to in- between cache-aware and cache-oblivious algorithms? crease their size and overcome the overhead of proce- It appears that cache-aware algorithms should be able to dure calls. (A good research problem is to determine use caches better than cache-oblivious algorithms, since an effective compiler strategy for coarsening base cases they have more knowledge about the system on which automatically.) they are running. Do there exist problems for which this advantage is asymptotically significant, for example an

Although these results must be considered prelimi- Ω : nary, Figure 4 strongly indicates that the recursive al- 8 lgZ advantage? Bilardi and Peserico [8] have re- gorithm outperforms the iterative algorithm throughout cently taken some steps in proving a separation. the range of matrix sizes. Moreover, the iterative al- Simulation: Is there a limit as to how much better a gorithm behaves erratically, apparently due to so-called cache-aware algorithm can be than a cache-oblivious “conflict” misses [20, p. 390], where limited cache asso- algorithm for the same problem? That is, given a class ciativity interacts with the regular addressing of the ma- of optimal cache-aware algorithms to solve a single trix to cause systematic interference. Blocking the itera- problem, can we construct a good cache-oblivious al-

tive algorithm should help with conflict misses [22], but gorithm that solves the same problem with only, for : it would make the algorithm cache aware. For large ma- example, O 8 lgZ loss of efficiency? Perhaps simula- trices, the recursive algorithm executes in less than 70% tion techniques can be used to convert a class of effi- of the time used by the iterative algorithm, even though cient cache-aware algorithms into a comparably efficient the transpose problem exhibits no temporal locality. cache-oblivious algorithm. Figure 5 makes a similar comparison between the naive iterative matrix-multiplication algorithm, which

3 Acknowledgments : uses three nested loops, with the O 8 n -work recur- sive REC-MULT algorithm described in Section 2. This Thanks to Bobby Blumofe, now of the University of problem exhibits a high degree of temporal locality, Texas at Austin, who sparked early discussions at MIT which REC-MULT exploits effectively. As the figure about what we now call cache obliviousness. Thanks to shows, the average time used per integer multiplication Gianfranco Bilardi of University of Padova, Sid Chat- in the recursive algorithm is almost constant, which for terjee of University of North Carolina, Chris Joerg of large matrices, is less than 50% of the time used by the Compaq CRL, Martin Rinard of MIT, Bin Song of MIT, iterative variant. A similar study for Jacobi multipass Sivan Toledo of Tel Aviv University, and David Wise of filters can be found in [26]. Indiana University for helpful discussions. Thanks also Several researchers [12, 16] have also observed that to our anonymous reviewers. recursive algorithms exhibit performance advantages over iterative algorithms for computers with caches. A References comprehensive empirical study has yet to be done, how- [1] A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir. ever. Do cache-oblivious algorithms perform nearly as A model for hierarchical memory. In Proceedings of the well as cache-aware algorithms in practice, where con- 19th Annual ACM Symposium on Theory of Computing stant factors matter? Does the ideal-cache model cap- (STOC), pages 305–314, May 1987. [2] A. Aggarwal, A. K. Chandra, and M. Snir. Hierarchi- ture the substantial caching concerns for an algorithms cal memory with block transfer. In 28th Annual Sym- designer? posium on Foundations of Computer Science (FOCS), An anecdotal affirmative answer to these questions is pages 204–216, Los Angeles, California, 12–14 Oct. exhibited by the popular FFTW library [17, 18], which 1987. IEEE. [3] A. Aggarwal and J. S. Vitter. The input/output complex- [22] M. S. Lam, E. Rothberg, and M. E. Wolf. The cache per- ity of sorting and related problems. Communications of formance and optimizations of blocked algortihms. In the ACM, 31(9):1116–1127, Sept. 1988. Fourth International Conference on Architectural Sup- [4] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design port for Programming Languages and Operating Systems and Analysis of Computer Algorithms. Addison-Wesley (ASPLOS), pages 63–74, Santa Clara, CA, Apr. 1991. Publishing Company, 1974. ACM SIGPLAN Notices 26:4. [5] B. Alpern, L. Carter, and E. Feig. Uniform memory hier- [23] A. LaMarca and R. E. Ladner. The influence of caches archies. In Proceedings of the 31st Annual IEEE Sym- on the performance of sorting. Proceedings of the Eighth posium on Foundations of Computer Science (FOCS), Annual ACM-SIAM Symposium on Discrete Algorithms pages 600–608, Oct. 1990. (SODA), pages 370–377, 1997. [6] D. H. Bailey. FFTs in external or hierarchical memory. [24] R. Motwani and P. Raghavan. Randomized Algorithms. Journal of Supercomputing, 4(1):23–35, May 1990. Cambridge University Press, 1995. [7] L. A. Belady. A study of replacement algorithms for vir- [25] M. H. Nodine and J. S. Vitter. Deterministic distribution tual storage computers. IBM Systems Journal, 5(2):78– sort in shared and distributed memory multiprocessors. 101, 1966. In Proceedings of the Fifth ACM Symposium on Paral- [8] G. Bilardi and E. Peserico. Efficient portability across lel Algorithms and Architectures (SPAA), pages 120–129, memory hierarchies. Unpublished manuscript, 1999. Velen, Germany, 1993. [9] R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, [26] H. Prokop. Cache-oblivious algorithms. Master’s thesis, and K. H. Randall. An analysis of dag-consistent dis- Massachusetts Institute of Technology, June 1999. tributed shared-memory algorithms. In Proceedings of [27] J. E. Savage. Extending the Hong-Kung model to mem- the Eighth Annual ACM Symposium on Parallel Algo- ory hierarchies. In D.-Z. Du and M. Li, editors, Com- rithms and Architectures (SPAA), pages 297–308, Padua, puting and Combinatorics, volume 959 of Lecture Notes Italy, June 1996. in Computer Science, pages 270–281. Springer Verlag, [10] L. Carter and K. S. Gatlin. Towards an optimal bit- 1995. reversal permutation program. In Proceedings of the 39th [28] S. Sen and S. Chatterjee. Towards a theory of cache- Annual Symposium on Foundations of Computer Science, efficient algorithms. Unpublished manuscript, 1999. pages 544–555. IEEE Computer Society Press, 1998. [29] R. C. Singleton. An algorithm for computing the mixed [11] S. Chatterjee, V. V. Jain, A. R. Lebeck, and S. Mundhra. radix fast Fourier transform. IEEE Transactions on Audio Nonlinear array layouts for hierarchical memory sys- and Electroacoustics, AU-17(2):93–103, June 1969. tems. In Proceedings of the ACM International Confer- [30] D. D. Sleator and R. E. Tarjan. Amortized efficiency ence on Supercomputing, Rhodes, Greece, June 1999. of list update and paging rules. Communications of the [12] S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thot- ACM, 28(2):202–208, Feb. 1985. tethodi. Recursive array layouts and fast parallel matrix [31] V. Strassen. Gaussian elimination is not optimal. Nu- multiplication. In Proceedings of the Eleventh ACM Sym- merische Mathematik, 13:354–356, 1969. posium on Parallel Algorithms and Architectures (SPAA), [32] S. Toledo. Locality of reference in LU decomposition June 1999. with partial pivoting. SIAM Journal on Matrix Analysis [13] J. W. Cooley and J. W. Tukey. An algorithm for the ma- and Applications, 18(4):1065–1081, Oct. 1997. chine computation of the complex Fourier series. Mathe- matics of Computation, 19:297–301, Apr. 1965. [33] J. S. Vitter. External memory algorithms and data struc- tures. In J. Abello and J. S. Vitter, editors, External [14] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. In- Memory Algorithms and Visualization, DIMACS Series troduction to Algorithms. MIT Press and McGraw Hill, in Discrete Mathematics and Theoretical Computer Sci- 1990. ence. American Mathematical Society Press, Providence, [15] P. Duhamel and M. Vetterli. Fast Fourier transforms: a RI, 1999. tutorial review and a state of the art. Signal Processing, [34] J. S. Vitter and M. H. Nodine. Large-scale sorting in 19:259–299, Apr. 1990. uniform memory hierarchies. Journal of Parallel and [16] J. D. Frens and D. S. Wise. Auto-blocking matrix- Distributed Computing, 17(1–2):107–114, January and multiplication or tracking BLAS3 performance from February 1993. source code. In Proceedings of the Sixth ACM SIG- PLAN Symposium on Principles and Practice of Parallel [35] J. S. Vitter and E. A. M. Shriver. Algorithms for par- Programming (PPoPP), pages 206–216, Las Vegas, NV, allel memory I: Two-level memories. Algorithmica, June 1997. 12(2/3):110–147, August and September 1994. [17] M. Frigo. A fast Fourier transform compiler. In Proceed- [36] J. S. Vitter and E. A. M. Shriver. Algorithms for parallel ings of the ACM SIGPLAN’99 Conference on Program- memory II: Hierarchical multilevel memories. Algorith- ming Language Design and Implementation (PLDI), At- mica, 12(2/3):148–169, August and September 1994. lanta, Georgia, May 1999. [37] S. Winograd. On the algebraic complexity of functions. [18] M. Frigo and S. G. Johnson. FFTW: An adaptive soft- Actes du Congres` International des Mathematiciens´ , ware architecture for the FFT. In Proceedings of the In- 3:283–288, 1970. ternational Conference on Acoustics, Speech, and Signal Processing, Seattle, Washington, May 1998. [19] G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, 1989. [20] J. L. Hennessy and D. A. Patterson. Computer Architec- ture: A Quantitative Approach. Morgan Kaufmann, 2nd edition, 1996. [21] J.-W. Hong and H. T. Kung. I/O complexity: the red-blue pebbling game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC), pages 326– 333, Milwaukee, 1981.