The Influence of Caches on the Performance of Sorting
Total Page:16
File Type:pdf, Size:1020Kb
The Influence of Caches on the Performance of Sorting Anthony LaMarca* & Richard E. Ladner Department of Computer Science and Engineering University of Washington Seattle, WA 98195 lamarcaQparc.xerox.com [email protected] Abstract quicksort [12], and radix sort*. Heapsort, mergesort, We investigate the effect that caches have on the per- and quicksort are all comparison based sorting algo- formance of sorting algorithms both experimentally and rithms while radix sort is not. analytically. To address the performance problems that For each of the four sorting algorithms we choose an high cache miss penalties introduce we restructure heap- implementation variant with potential for good overall sort, mergesort and quicksort in order to improve their performance and then heavily optimize this variant us- cache locality. For all three algorithms the improvement ing traditional techniques to minimize the number of in cache performance leads to a reduction in total ex- instructions executed. These heavily optimized algo- ecution time. We also investigate the performance of rithms form the baseline for comparison. For each of radix sort. Despite the extremely low instruction count the comparison sort baseline algorithms we develop and incurred by this linear sorting algorithm, its relatively apply memory optimizations in order to improve cache poor cache performance results in worse overall perfor- performance and, hopefully, overall performance. For mance than the efficient comparison based sorting algo- radix sort we optimize cache performance by varying rithms. the radix. In the process we develop some simple an- alytic techniques which enable us to predict the mem- 1 Introduction. ory performance of these algorithms in terms of cache misses. Since the introduction of caches, main memory has con- For comparison purposes we focus on sorting an ar- tinued to grow slower relative to processor cycle times. ray (4,000 to 4,096,OOO keys) of 64 bit integers chosen The time to service a cache miss to memory has grown uniformly at random. Our study uses trace-driven sim- from 6 cycles for the Vax 11/780 to 120 for the Al- ulations and actual executions to measure the impact phaServer 8400 13, 73. Cache miss penalties have grown that our memory optimizations have on performance. to the point where good overall performance cannot be We concentrate on three performance measures: in- achieved without good cache performance. As a conse- struction count, cache misses, and overall performance quence of this change in computer architectures, algo- (time) on machines with modern memory systems. Our rithms which have been designed to minimize instruc- results can be summarized as follows: tion count may not achieve the performance of algo- 1. For the three comparison based sorting algorithms, rithms which take into account both instruction count memory optimizations improve both cache and and cache performance. overall performance. The improvements in overall One of the most common tasks computers perform performance for heapsort and mergesort are signif- is sorting a set of unordered keys. Sorting is a funda- icant, while the improvement for quicksort is mod- mental task and hundreds of sorting algorithms have est. Interestingly, memory optimizations to heap- been developed. In this paper we explore the potential sort also reduce its instruction count. For radix sort performance gains that cache-conscious design offers in the radix that minimizes cache misses also mini- understanding and improving the performance of four mizes instruction count. popular sorting algorithms: heapsort [25], mergesortl, -arca was supported by an AT&T fellowship. He is the 1930s. currently at Xerox PARC, 3333 Coyote Hill Road, Palo Alto CA 2Knuth [IS] traces the radix sorting method to the Hollerith 94304 sorting machine that was first used to assist the 1890 United lKnuth [15] traces mergesort back to card sorting machines of States census. 370 371 2. For large arrays, radix sort has the lowest instruc- locations. Direct-mapped caches have an associa- tion count, but because of its relatively poor cache tivity of one, and can load a particular block only performance, its overall performance is worse than in a single location. Fully associative caches are at the memory optimized versions of mergesort and the other extreme and can load blocks anywhere in quicksort. the cache. 3. Although our study was done on one machine, l Replacement policy, which indicates the policy of we demonstrate the robustness of the results by which block to remove from the cache when a new showing that comparable speedups due to improved block is loaded. For the direct-mapped cache the cache performance can be achieved on several other replacement policy is simply to remove the block machines. currently residing in the cache. 4. There are effective analytic approaches to predict- In most modern machines, more than one cache is ing the number of cache misses these sorting algo- placed between the processor and main memory. These rithms incur. hierarchies of caches are configured with the smallest, The main general lesson to be learned from this fastest cache next to the processor and the largest, study is that because cache miss penalties are large, slowest cache next to main memory. The largest miss and growing larger with each new generation of pro- penalty is typically incurred with the cache closest to cessor, selecting the fastest algorithm to solve a prob- main memory and this cache is usually direct-mapped. lem entails understanding cache performance. Improv- Consequently, our design and analysis techniques will ing an algorithm’s overall performance may require in- focus on improving the performance of direct-mapped creasing the number of instructions executed while, at caches. We will assume that the cache parameters, block the same time, reducing the number of cache misses. size and capacity, are known to the programmer. Consequently, cache;conscious design of algorithms is High cache hit ratios depend on a program’s stream required to achieve the best, performance. of memory references exhibiting locality. A program exhibits temporal locality if there is a good chance that 2 Caches. an accessed data item will be accessed again in the near future. A program exhibits spatial locality if there is In order to speed up memory accesses, small high speed good chance that subsequently accessed data items are memories called c&es are placed between the proces- located near each other in memory. Most programs sor and the main memory. Accessing the cache is typ- tend to exhibit both kinds of locality and typical hit ically much faster than accessing main memory. Un- ratios are greater than 90% [18]. Our design techniques fortunately, since caches are smaller than main memory will attempt to improve both the temporal and spatial they can hold only a subset of its contents. Memory locality of the sorting algorithms. accesses first consult the cache to see if it contains the desired data. If the data is found in the cache, the main 3 Design and Evaluatiqn Methodology. memory need not be consulted and the access is consid- ered to be a cache hit. If the data is not in the cache it Cache locality is a good thing. When spatial and tempo- is considered a miss, and the data must be loaded from ral locality can be improved at no cost it should always main memory. On a miss, the block containing the ac- be done. In this paper, however, we develop techniques cessed data is loaded into the cache in the hope that for improving locality even when it results in an increase it will be used again in the future. The hit ratio is a in the total number of executed instructions. This rep- measure of cache performance and is the total number resents a significant departure from traditional design of hits divided by the total number of accesses. and optimization methodology. We take this approach The major design parameters of caches are: in order to show how large an impact cache performance can have on overall performance. Interestingly, many of l Capacity, which is the total number of bytes that the cache can hold. the design techniques are not particularly new. Some have already been used in optimizing compilers, in al- l Block site, which is the number of bytes that are gorithms which use external storage devices, and in par- loaded from and written to memory at a time. allel algorithms. Similar techniques have also been used successfully in the development of the cache-efficient Al- l Associativity, which indicates the number of differ- phasort algorithm [19]. ent locations in the cache where a particular block As mentioned earlier we focus on three measures can be loaded. In an N-way set-associative cache, a of performance: instruction count, cache misses, and particular block can be loaded in N different cache overall performance in terms of execution time. All 372 of the dynamic instruction counts and cache simula- algorithm [25] for building a binary heap incurs fewer tion results were measured using Atom [23]. Atom is cache misses than Floyd’s method. In addition we have a toolkit developed by DEC for instrumenting program shown [17, 161 that two other optimizations reduce the executables on Alpha workstations. Dynamic instruc- number of cache misses incurred by the remove-min op- tion counts are obtained by inserting an increment to eration. The first optimization is to replace the tradi- an instruction counter after each instruction executed tional binary heap with a d-heap [14] where each non- by the algorithm. Cache performance is determined by leaf node has d children instead of two. The fanout d inserting calls after every load and store to maintain is chosen so that exactly d keys are the size of a cache the state of a simulated cache and to keep track of hit block.