Cache Performance of Fast-Allocating Programs

Cache Performance of Fast-Allocating Programs Marcelo J. R. Gonçalves and Andrew W. Appel Princeton University Department of Computer Science {mjrg,appel}@cs.princeton.edu Abstract achieved by allocating sequentially from a large contiguous memory area, the allocation space. We only need to keep two pointers, We study the cache performance of a set of ML programs, com- one to the next free address, the allocation pointer, and one to the piled by the Standard ML of New Jersey compiler. We find that last usable address, the limit pointer. Before the allocation of a more than half of the reads are for objects that have just been allo- new object, if the allocation pointer plus the size of the object is cated. We also consider the effects of varying software (garbage greater than the limit pointer, a garbage collection must be per- collection frequency) and hardware (cache) parameters. Confirm- formed to reclaim the space of objects that are no longer needed. ing results of related experiments, we found that ML programs Otherwise, the object is allocated and the allocation pointer in- can have good cache performance when there is no penalty for cremented by the size of the object. Efficient deallocation can be allocation. Even on caches that have an allocation penalty, we achieved with the use of a generational garbage collector. Early found that ML programs can have lower miss ratios than the C versions of SML/NJ used a simple two-generational garbage col- and Fortran SPEC92 benchmarks. lector. Future releases will use a multi-generational collector, Topics: 4 benchmarks, performance analysis; 21 hardware which we have measured in this work. design, measurements; 17 garbage collection, storage allocation; It might seem that allocating closures sequentially on the heap 46 runtime systems. would be terrible for cache performance. After all, stacks provide a simple and effective way of recycling memory, and have very 1 Introduction good spatial locality. With sequential heap allocation, on the other hand, the first allocation in each cache block can be a cache miss With the gap between CPU and memory speed widening, good (allocation miss), which could result in bad cache performance. cache performance is increasingly important for programs to take We have comprehensively studied the cache performance of a full advantage of the speed of current microprocessors. Most re- set of ML programs. We consider the effects of varying both soft- cent microprocessors come with a small on-chip cache, and many ware and hardware (cache) parameters. Our main contributions machines add a large second level off-chip cache to that. It is are the following: therefore important to understand the memory behavior of pro- • We examine in detail the memory reference patterns of a set grams, so that computer architects can design cache organizations of ML programs. We find that more than half of the read that can exploit the locality of reference of these programs, or references are to objects (typically closures) that have just programmers can modify the programs to take advantage of cache been allocated. This means that heap allocated closures can properties, or both. This work studies the cache performance of have good locality; on suitable cache architectures, stack programs with intensive heap allocation and generational garbage allocation should not be preferred over heap allocation just collection, such as ML programs compiled by the Standard ML for reasons of locality. Diwan et al. [14] suggest that ML of New Jersey (SML/NJ) compiler [3]. programs tend to read objects soon after they have been Dynamic heap allocation is used in the implementation of written, based on indirect evidence from cache behavior on many programming languages, but the SML/NJ implementation different architectures, and we have confirmed this by direct is characterized by a much higher allocation rate than most other experiment. implementations: a typical program allocates one word for every six machine instructions. SML/NJ allocates so frequently because • Varying minor garbage collection frequency (i.e., changing it allocates function activation records (closures) on the heap, in- the size of allocation space) may have an impact on cache stead of using a stack as do most language implementations [2]. In performance. In the case of caches without an allocation order to make heap allocation efficient it is essential that both allo- miss penalty, the impact is very small and is largely off- cation and deallocation be very fast [1]. Efficient allocation can be set by changes in garbage collection overhead. On caches Permission to make digital or hard copies of part or all of this work for personal with a penalty for allocation misses, however, making the or classroom use is granted without fee provided that copies are not made or dis- allocation space fit in the cache can result in a significant tributed for profit or commercial advantage and that new copies bear this notice and improvement in performance. We show quantitatively for the full citation on the first page. Copyrights for components of this work owned by what cache sizes and miss penalties fitting the allocation others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior space in the cache improves performance. specific permission and/or a fee. Request permissions from Publications Dept, ACM • Inc., fax +1(212)869-0481, or <[email protected]>. We report a comprehensive set of miss ratios, for varying To appear in SIGPLAN/SIGARCH/WG2.8 Conf. on Functional Programming cache sizes, block sizes, associativity, and write miss poli- Languages and Computer Architecture (FPCA ’95). cies. We find that ML programs can have a very good c 1995 Association for Computing Machinery. cache performance on some architectures, comparable to or even better than that of the C and Fortran SPEC92 bench- older (i.e., have been allocated at an earlier time) than the objects marks. We also found that allocation misses depend mostly in the previous generations. In addition, within each generation, on block size, and if blocks are not too small (at least 32 objects are divided into two age groups. bytes), even on architectures that do not eliminate alloca- Most objects (closures and user data structures) are allocated in tion misses, ML programs can have much lower miss ratios the allocation space. The only exception is for big objects (more than the C and Fortran programs. than 512 words), which are allocated in the first generation. Each generation (but not the allocation space) is divided into five are- • On the DEC3000 Alpha workstation, we find that fitting the nas, one for each class of objects: records, pairs, arrays, strings, allocation space in the secondary cache can result in signif- and code. All objects are preceded by a one-word descriptor (tag), icant performance improvement, even though the machine except pairs copied into the first or older generations. Removing has a deep write buffer that should eliminate penalties for the descriptor from pairs not only saves space, but might also im- allocation misses. The ineffectiveness of the write buffer is prove cache performance, since more pairs can fit into a single related to the fact that so many reads are to recently allo- block, and pairs can be aligned to block boundaries. cated objects: some of those reads cause the write buffer to There are two types of garbage collections, minor and ma- be flushed prematurely. jor. Minor collections are very frequent but brief, and copy live • The frequency of major collections and the number of active objects from the allocation space to the first generations. Major generations can have a significant impact on cache perfor- collections are longer, but much less frequent, and collect from mance. We found that a large number of active generations the older generations. A kth-generation major collection promotes can increase the number of conflict misses. the older objects of each generation i into generation i + 1, for 1 ≤ i ≤ k; a fixed number of kth-generation collections occur for The problem of the cache performance of programs with se- each (k +1)th-generation collection. quential allocation has been addressed previously. Peng and Sohi The early prototype we measured lacks certain features of the [29] studied the cache performance of a set of Lisp programs. collector Reppy describes [33], particularly the use of a mark- They show that conventional cache memories were inadequate and-sweep collector for machine code and other big objects. This for these programs and proposed the use of an allocate instruc- change is likely to decrease major garbage collection overhead tion to eliminate allocation misses. They also show that the LRU (from the amount we measured) on some of the benchmark pro- replacement policy commonly used in associative caches is bad grams, particularly when a small allocation space is used. for this class of programs and proposed an alternative replacement policy. Finally they proposed the use of a control bit to 3 Methodology avoid writing back cache blocks filled with garbage. Wilson et al. [38] and Zorn [40, 39] proposed fitting the allocation space in We used a benchmark suite of ten ML programs, which were the cache as an alternative, software-based method for eliminating compiled and run in SML/NJ, version 0.93, with the new runtime allocation misses on large caches. Wilson et al. also show that system and generational collector. We compiled each program modestly set-associative caches can achieve a significant perfor- with the default parameters of the compiler, and saved a heap mance improvement over direct mapped caches. In essence, the image of the system.

Load more