Cache Memories

ALAN JAY SMITH Unwersity of California, Berkeley, Californm 94720

Cache memories are used in modern, medium and high-speed CPUs to hold temporarily those portions of the contents of main memory which are {believed to be) currently in use. Since instructions and in cache memories can usually be referenced in 10 to 25 percent of the time required to access main memory, cache memories permit the executmn rate of the machine to be substantially increased. In order to function effectively, cache memories must be carefully designed and implemented. In this paper, we explain the various aspects of cache memorms and discuss in some detail the design features and trade-offs. A large number of original, trace-driven simulation results are presented. Consideration is given to practical implementatmn questions as well as to more abstract design issues. Specific aspects of cache memories that are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input/output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles--cache memorws; B.3.3 [Memory Structures]: Performance Analysis and Design Aids; C.O. [ Systems Organization]: General; C.4 [Computer Systems Organiza- tion]: Performance of Systems General Terms: Design, Experimentation, Measurement, Performance Additional Key Words and Phrases' Buffer memory, paging, prefetching, TLB, store- through, Amdah1470, IBM 3033, BIAS

INTRODUCTION instructions and operands to be fetched and/or stored. For exam,~le, in typical large, Definition and Rationale high-speed (e.g., Amdahl 470V/ Cache memories are small, high-speed 7, IBM 3033), main memory can be ac- buffer memories used in modern computer cessed in 300 to 600 nanoseconds; informa- systems to hold temporarily those portions tion can be obtained from a cache, on the of the contents of main memory which are other hand, in 50 to 100 nanoseconds. Since (believed to be) currently in use. Informa- the performance of such machines is al- tion located in cache memory may be ac- ready limited in instruction execution rate cessed in much less time than that located by cache memory access time, the absence in main memory (for reasons discussed of any cache memory at all would produce throughout this paper}. Thus, a central a very substantial decrease in execution processing unit (CPU) with a cache mem- speed. ory needs to spend far less time waiting for Virtually all modern large computer sys-

Permission to copy without fee all or part of this matenal is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying Is by permission of the Association for Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1982 ACM 0010-4892/82/0900-0473 $00.75 Computing Surveys, Vol. 14, No. 3, September 1982 474 * A. J. Smith CONTENTS two to four years, circuit speed and density will progress sufficiently to permit cache memories in one chip microcomputers. (On-chip addressable memory is planned INTRODUCTION for the Texas Instruments 99000 [LAFF81, Definltlon and Rationale ELEC81].) Even microcomputers could Overwew of Cache Deslgn benefit substantially from an on-chip cache, Cache Aspects since on-chip access times are much smaller 1. DATA AND MEASUREMENTS 1 1 Ratmnale than off-chip access times. Thus, the ma- 12 Trace-Driven Snnulatlon terial presented in this paper should be 1 3 Slmulatlon Evaluatmn relevant to almost the full range of com- 14 The Traces puter architecture implementations. 15 Swnulatmn Methods 2 ASPECTS OF CACHE DESIGN AND OPERA- The success of cache memories has been TION explained by reference to the "property of 2.1 Cache Fetch Algorithm locality" [DENN72]. The property of local- 2.2 Placement Algorithm ity has two aspects, temporal and spatial. 2.3 Line Size Over short periods of time, a program dis- 2.4 Replacement Algorithm 2.5 -Through versus Copy-Back tributes its memory references nonuni- 2.6 Effect of Multlprogramming Cold-Start and formly over its , and which Warm-Start portions of the address space are favored 2.7 Multmache Conslstency remain largely the same for long periods of 2.8 Data/Instruction Cache time. This first property, called temporal 2 9 Virtual Address Cache 2.10 User/Super~sor Cache locality, or locality by time, means that the 2.11 Input/Output through the Cache information which will be in use in the near 2 12 Cache Size future is likely to be in use already. This 2 13 Cache Bandwldth, Data Width, and Ac- type of behavior can be expected from pro- cess Resolutmn 2 14 Multilevel Cache gram loops in which both data and instruc- 2 15 Plpehmng tions are reused. The second property, lo- 2 16 Translatmn Lookaslde Buffer cality by space, means that portions of the 2 17 Translator address space which are in use generally 2 18 Memory-Based Cache 2 19 Specmlized Caches and Cache Components consist of a fairly small number of individ- 3 DIRECTIONS FOR RESEARCH AND DEVEL- ually contiguous segments of that address OPMENT space. Locality by space, then, means that 3 1 On-Clnp Cache and Other Technology Ad- the loci of reference of the program in the vances near future are likely to be near the current 3.2 Multmache Consistency 3.3 Implementatmn Evaluatmn loci of reference. This type of behavior can 3.4 Hit Ratio versus S~e be expected from common knowledge of 3.5 TLB Design programs: related data items (variables, ar- 3.6 Cache Parameters versus Architecture and rays) are usually stored together, and in- Workload APPENDIX EXPLANATION OF TRACE NAMES structions are mostly executed sequentially. ACKNOWLEDGMENTS Since the cache memory buffers segments REFERENCES of information that have been recently used, the property of locality implies that A v needed information is also likely to be found in the cache. terns have cache memories; for example, Optimizing the design of a cache memory the Amdahl 470, the IBM 3081 [IBM82, generally has four aspects: REIL82, GUST82], 3033, 370/168, 360/195, the Univac 1100/80, and the Honeywell 66/ (1) Maximizing the probability of finding a 80. Also, many medium and small size ma- memory reference's target in the cache chines have cache memories; for example, (the hit ratio), the DEC VAX 11/780, 11/750 [ARMS81], (2) minimizing the time to access informa- and PDP-11/70 [SIRE76, SNOW78], and tion that is indeed in the cache {access the Apollo, which uses a Motorolla 68000 time), . We believe that within (3) minimizing the delay due to a miss, and

Computing Surveys, Vol 14, No. 3, September 1982 Cache Memories * 475

(4) minimizing the overheads of updating I Moln Memory I main memory, maintaining multicache consistency, etc. I ..,,,"~oc he S-Unlt Tronslotor (All of these have to be accomplished ' ASrT under suitable cost constraints, of course.) {I T-Unit I E-Unlt "~rite Through Buffers~ There is also a trade-off between hit ratio and access time. This trade-offhas not been sufficiently stressed in the literature and it Figure 1. A typical CPU design and the S-unit. is one of our major concerns in this paper. In this paper, each aspect of cache memo- There is usually a translator, which trans- ries is discussed at length and, where avail- lates virtual to real memory addresses, and able, measurement results are presented. In a TLB (translation lookaside buffer) which order for these detailed discussions to be buffers (caches) recently generated (virtual meaningful, a familiarity with many of the address, real address) pairs. Depending on aspects of cache design is required. In the machine design, there can be an ASIT (ad- remainder of this section, we explain the dress space identifier table), a BIAS (buffer operation of a typical cache memory, and invalidation address stack), and some write- then we briefly discuss several aspects of through buffers. Each of these is discussed cache memory design. These discussions in later sections of this paper. are expanded upon in Section 2. At the end Figure 2 is a diagram of portions of a of this paper, there is an extensive bibliog- typical S-unit, showing only the more im- raphy in which we have attempted to cite portant parts and data paths, in particular all relevant literature. Not all of the items the cache and the TLB. This design is in the bibliography are referenced in the typical of that used by IBM (in the 370/168 paper, although we have referred to items and 3033) and by Amdahl (in the 470 series). there as appropriate. The reader may wish Figure 3 is a flowchart that corresponds in particular to refer to BADE79, BARS72, to the operation of the design in Figure GIBS67, and KAPL73 for other surveys of 2. A discussion of this flowchart follows. some aspects of cache design. CLAR81 is The operation of the cache commences particularly interesting as it discusses the with the arrival of a virtual address, gener- design details of a real cache. (See also ally from the CPU, and the appropriate LAMPS0.) control signal.The virtual address is passed to both the TLB and the cache storage. Overview of Cache Design The TLB is a small associative memory which maps virtual to real addresses. It is Many CPUs can be partitioned, concep- often organized as shown, as a number of tually and sometimes physically, into three groups (sets) of elements, each consisting parts: the I-unit, the E-unit, and the S-unit. of a virtual address and a real address. The The I-unit (instruction) is responsible for TLB accepts the virtual number, ran- instruction fetch and decode. It may have domizes it, and uses that hashed number to some local buffers for lookahead prefetch- select a set of elements. That set of ele- ing of instructions. The E-unit (execution) ments is then searched associatively for a does most of what is commonly referred to match to the virtual address. If a match is as executing an instruction, and it contains found, the corresponding real address is the logic for arithmetic and logical opera- passed along to the comparator to deter- tions. The S-unit (storage) provides the mine whether the target line is in the cache. memory interface between the I-unit and Finally, the replacement status of each en- E-unit. (IBM calls the S-unit the PSCF, or try in the TLB set is updated. storage control function.) If the TLB does not contain the (virtual The S-unit is the part of the CPU of address, real address) pair needed for the primary interest in this paper. It contains translation, then the translator (not shown several parts or functions, some of which in Figure 2) is invoked. It uses the high- are shown in Figure 1. The major compo- order bits of the virtual address as an entry nent of the S-unit is the cache memory. into the segment and page tables for the

Computing Surveys, Vol. 14, No. 3, September 1982 476 A. J. Smith CPU Virtual Address From translator / t L,°e : To translator Virtuo! Real Number Number Line I Address Address /\ .... 9- I Address ~l~oto Line I C~._.~mory

essl Data I i i IJ I Cache L Tronslehon I Address i ~ Lookes~de • ] : Buffer I 1 Data I I Arrays I I f l I I !

# t # # .]Compore Virtual ~l Compare Addresses & Select Data I "7 Addresses AddressReal I i ¥L Data Select & Al.,on I S =select To Morn Memory Data Out

Figure 2. A typical cache and TLB design.

CACHE OPERATION FLOW CHART l I Rece,veVirtual Address I I I t I HCshPage Numbe~ (, Select Set [Search TLB I t (~)~eod Out Address Tags I yes ~ =t Compare Addresses I

J Sendto v,rtool Translator Address J I UpdateStatusReplacement in TLB } 1 Jl yes ~

I Use Poo~ & Segment Tables IUpdate ;te~: ement I sendRealto Ma,n AddressIMemory 1"o Translate Address I ' 1 I ReceweLine from I I Put ,. TLB ] Morn Memory

I ,.,~, fromco,,,o, Line 1

Figure 3. Cache operahon flow chart.

Computing Surveys, Vol. 14, No. 3, September 1982 Cache Memories * 477 and then returns the address pair to the TLB (which retains it for possible I e°A00r"s '100'ol Vo,0 1 future use), thus replacing an existing TLB Cache Entry entry. The virtual address is also passed along [Eo,ry, I E°,..21 ..... i E°,ry I.e0,oce.°, s,oio. initially to a mechanism which uses the middle part of the virtual address (the line Cache Set number) as an index to select a set of entries Figure 4. Structure of cache entry and cache set. in the cache. Each entry consists primarily of a real address tag and a line of data (see Figure 4). The line is the quantum of stor- bilities exist: information can be fetched on age in the cache. The tags of the elements demand (when it is needed) or prefetched of all the selected set are read into a com- (before it is needed). Prefetch algorithms parator and compared with the real address attempt to guess what information will soon from the TLB. (Sometimes the cache stor- be needed and obtain it in advance. It is age stores the data and address tags to- also possible for the cache fetch algorithm gether, as shown in Figures 2 and 4. Other to omit fetching some information (selec- times, the address tags and data are stored tive fetch) and designate some information, separately in the "address array" and "data such as shared writeable code (sema- array," respectively.) If a match is found, phores), as unfetchable. Further, there may the line (or a part of it) containing the be no fetch-on-write in systems which use target locations is read into a shift register write-through (see below). and the replacement status of the entries in Cache Placement Algorithm. Informa- the cache set are updated. The shift register tion is generally retrieved from the cache is then shifted to select the target bytes, associatively, and because large associative which are in turn transmitted to the source memories are usually very expensive and of the original data request. somewhat slow, the cache is generally or- If a miss occurs (i.e., addresss tags in the ganized as a group of smaller associative cache do not match), then the real address memories. Thus, only one of the associative of the desired line is transmitted to the memories has to be searched to determine main memory. The replacement status in- whether the desired information is located formation is used to determine which line in the cache. Each such (small) associative to remove from the cache to make room for memory is called a set and the number of the target line. If the line to be removed elements over which the associative search from the cache has been modified, and main is conducted is called the set size. The memory has not yet been updated with the placement algorithm is used to determine modification, then the line is copied back to in which set a piece {line) of information main memory; otherwise, it is simply de- will be placed. Later in this paper we con- leted from the cache. After some number of sider the problem of selecting the number machine cycles, the target line arrives from of sets, the set size, and the placement main memory and is loaded into the cache algorithm in such a set-associative memory. storage. The line is also passed to the shift Line Size. The fixed-size unit of infor- register for the target bytes to be selected. mation transfer between the cache and main memory is called the line. The line Cache Aspects corresponds conceptually to the page, The cache description given above is both which is the unit of transfer between the simplified and specific; it does not show main memory and secondary storage. Se- design alternatives. Below, we point out lecting the line size is an important part of some of the design alternatives for the the memory system design. (A line is also cache memory. sometimes referred to as a .) Replacement Algorithm. When infor- Cache Fetch Algorithm. The cache fetch mation is requested by the CPU from main algorithm is used to decide when to bring memory and the cache is full, some infor- information into the cache. Several possi- mation in the cache must be selected for

Computing Surveys, Vol. 14, No. 3, September 1982 478 * A. J. Smith replacement. Various replacement algo- the user programs to use the other. Poten- rithms are possible, such as FIFO (firstin, tially, this could result in both the super- first out), LRU (least recently used), and visor and the user programs more fre- random. Later, we consider the first two of quently finding upon initiation what they these. need in the cache. Main Memory Update Algorithm. When Multicache Consistency. A multiproces- the CPU performs a write (store) to mem- sor system with multiple caches faces the ory, that operation can actually be reflected problem of making sure that all copies of a in the cache and main memories in a num- given piece of information (which poten- ber of ways. For example, the cache mem- tially could exist in every cache, as well as ory can receive the write and the main in the main memory) are the same. A mod- memory can be updated when that line is ification of any one of these copies should replaced in the cache. This strategy is somehow be reflected in all others. A num- known as copy-back. Copy-back may also ber of solutions to this problem are possible. require that the line be fetched if it is absent The three most popular solutions are essen- from the cache (i.e., fetch-on-write). An- tially: (1) to transmit all stores to all caches other strategy, known as write-through, im- and memories, so that all copies are up- mediately updates main memory when a dated; (2) to transmit the addresses of all write occurs. Write-through may specify stores to all other caches, and purge the that if the information is in the cache, the corresponding lines from all other caches; cache be either updated or purged from or (3) to permit data that are writeable main memory. If the information is not in (page or line flagged to permit modifica- the cache, it may or may not be fetched. tion) to be in only one cache at a time. A The choice between copy-back and write- centralized or distributed may be through strategies is also influenced by the used to control making and updating of need to maintain consistency among the copies. cache memories in a tightly coupled multi- Input/Output. Input/output (from and processor system. This requirement is dis- to I/O devices) is an additional source of cussed later. references to information in memory. It is Cold-Start versus Warm-Start Miss Ra- important that an output request stream tios and Multiprogramming. Most com- reference the most current values for the puter systems with cache memories are information transferred. Similarly, it is also multiprogrammed; many processes run on important that input data be immediately the CPU, though only one can run at a reflected in any and all copies of those lines time, and they alternate every few millisec- in memory. Several solutions to this prob- onds. This means that a significant fraction lem are possible. One is to direct the I/O of the cache miss ratio is due to loading stream through the cache itself (in a single data and instructions for a new process, processor system); another is to use a write- rather than to a single process which has through policy and broadcast all writes so been running for some time. Miss ratios as to update or invalidate the target line that are measured when starting with an wherever found. In the latter case, the empty cache are called cold-start miss ra- channel accesses main memory rather than tios, and those that are measured from the the cache. time the cache becomes full are called Data/Instruction Cache. Another cache warm-start miss ratios. Our simulation design strategy is to split the cache into two studies consider this multiprogramming en- parts: one for data and one for instructions. vironment. This has the advantages that the band- User/Supervisor Cache. The frequent width of the cache is increased and the switching between user and supervisor access time (for reasons discussed later) can in most systems results in high miss be decreased. Several problems occur: the ratios because the cache is often reloaded overall miss ratio may increase, the two (i.e.,cold-start). One way to address this is caches must be kept consistent, and self- to incorporate two cache memories, and modifying code and execute instructions allow the supervisor to use one cache and must be accommodated.

ComputingSurveys, Vol. 14, No 3, September 1982 Cache Memories • 479 Virtual versus Real Addressing. In com- simulation. In this section, we explain the puter systems with , the importance of this approach, and then dis- cache may potentially be accessed either cuss the presentation of our results. with a real address (real address cache) or One difficulty in providing definitive a virtual address (virtual address cache). If statements about aspects of cache opera- real addresses are to be used, the virtual tion is that the effectiveness of a cache addresses generated by the processor must memory depends on the workload of the first be translated as in the example above computer system; further, to our knowl- (Figure 2); this is generally done by a TLB. edge, there has never been any (public) The TLB is itself a cache memory which effort to characterize that workload with stores recently used address translation in- respect to its effect on the cache memory. formation, so that translation can occur Along the same lines, there is no generally quickly. Direct virtual address access is accepted model for program behavior, and faster (since no translation is needed), but still less is there one for its effect on the causes some problems. In a virtual address uppermost level of the . cache, inverse mapping (real to virtual ad- (But see AROR72 for some measurements, dress) is sometimes needed; this can be and LEHM78 and L~.HMS0,in which a model done by an RTB (reverse translation buffer). is used.) Cache Size. It is obvious that the larger For these reasons, we believe that it is the cache, the higher the probability of possible for many aspects of cache design finding the needed information in it. Cache to make statements about relative perform- sizes cannot be expanded without limit, ance only when those statements are based however, for several reasons: cost (the most on trace-driven simulation or direct mea- important reason in many machines, espe- surement. We have therefore tried through- cially small ones), physical size (the cache out, when examining certain aspects of must fit on the boards and in the cabinets), cache memories, to present a large number and access time. (The larger the cache, the of simulation results and, if possible, to slower it may become. Reasons for this are generalize from those measurements. We discussed in Section 2.12.). Later, we ad- have also made an effort to locate and dress the question of how large is large reference other measurement and trace- enough. driven simulation results reported in the Multilevel Cache. As the cache grows in literature. The reader may wish, for exam- size, there comes a point where it may be ple, to read WIND73, in which that author usefully split into two levels: a small, high- discusses the set of data used for his simu- level cache, which is faster, smaller, and lations. more expensive per byte, and a larger, sec- ond-level cache. This two-level cache struc- 1.2 Trace-Driven Simulation ture solves some of the problems that afflict Trace-driven simulation is an effective caches when they become too large. method for evaluating the behavior of a Cache Bandwidth. The cache bandwidth memory hierarchy. A trace is usually gath- is the rate at which data can be read from ered by interpretively executing a program and written to the cache. The bandwidth and recording every main memory location must be sufficient to support the proposed referenced by the program during its exe- rate of instruction execution and I/O. cution. (Each address may be tagged in any Bandwidth can be improved by increasing way desired, e.g., instruction fetch, data the width of the data path, interleaving the fetch, data store.) One or more such traces cache and decreasing access time. are then used to drive a simulation model of a cache (or main) memory. By varying 1. DATA AND MEASUREMENTS parameters of the simulation model, it is possible to simulate directly any cache size, 1.1 Rationale placement, fetch or replacement algorithm, As noted earlier, our in-depth studies of line size, and so forth. Programming tech- some aspects of cache design and optimi- niques allow a range of values for many of zation are based on extensive trace-driven these parameters to be measured simulta-

Computing Surveys, Vol. 14, No. 3, September 1982 480 • A. J. Smith neously, during the same simulation run tion not only of how the cache design affects [GEcS74, MATT70, SLUT72]. Trace-driven the number of misses, but also of how the simulation has been a mainstay of memory machine design affects the number of evaluation for the last 12 to 15 memory references. (A memory reference years; see BELA66 for an early example of represents a cache access. A given instruc- this technique, or see POHM73. We assume tion requires a varying number of memory only a single cache in the system, the one references, depending on the specific imple- that we simulate. Note that our model does mentation of the machine.) For example, a not include the additional buffers com- different number of memory references monly found in the instruction decode and would be required if one word at a time ALU portions of many CPUs. were obtained from the cache than if two In many cases, trace-driven simulation is words were obtained at once. Almost all of preferred to actual measurement. Actual our trace-driven studies assume a cache measurements require access to a computer with a one-word data path (370 words = 4 and hardware measurement tools. Thus, if bytes, PDP-11 word ffi 2 bytes). The WA- the results of the experiments are to be TEX, WATFIV, FFT, and APL traces as- even approximately repeatable, standalone sume a two-word (eight-byte) data path. time is required. Also, if one is measuring We measure the miss ratio and use it as the an actual machine, one is unable to vary major figure of merit for most of our stud- most (if any) hardware parameters. Trace- ies. We display many of these results as driven simulation has none of these diffi- x/y plots of miss ratios versus cache size in culties; parameters can be varied at will and order to show the dependence of various experiments can be repeated and repro- cache design parameters on the cache size. duced precisely. The principal advantage of measurement over simulation is that it re- 1.4 The Traces quires 1 to 0.1 percent as much running We have obtained 19 program address time and is thus very valuable in establish- traces, 3 of them for the PDP-11 and the ing a genuine, workload-based, actual level other 16 for the IBM 360/370 series of of performance (for validation). Actual computers. Each trace is for a program workloads also include supervisor code, in- developed for normal production use. terrupts, context , and other as- (These traces are listed in the Appendix, pects of workload behavior which are hard with a brief description of each.) They have to imitate with traces. The results in been used in groups to simulate multipro- this paper are mostly of the trace-driven gramming; five such groups were formed. variety. Two represent a scientific workload (WFV, APL, WTX, FFT, and FGO1, FGO2, FGO3, 1.3 Simulation Evaluation FGO4), one a business (commercial) work- There are two aspects to the performance load (CGO1, CGO2, CGO3, PGO2), one a of a cache memory. The first is access time: miscellaneous workload, including compi- How long does it take to get information lations and a utility program (PGO1, from or put information into the cache? It CCOMP, FCOMP, IEBDG), and one a is very difficult to make exact statements PDP-11 workload (ROFFAS, EDC, about the effect of design changes on access TRACE). The miss ratio as a function of time without specifying a circuit technology cache size is shown in Figure 5 for most of and a circuit diagram. One can, though, the traces; see SMIT79 for the miss ratios of indicate trends, and we do that throughout the remaining traces. The miss ratios for this paper. each of the traces in Figure 5 are cold-start The second aspect of cache performance values based on simulations of 250,000 is the miss ratio: What fraction of all mem- memory references for the IBM traces, and ory references attempt to access something 333,333 for the PDP-11 traces. which is not resident in the cache memory? 1.5 Simulation Methods Every such miss requires that the CPU wait until the desired information can be Almost all of the simulations that were run reached. Note that the miss ratio is a func- used 3 or 4 traces and simulated multipro-

Computing Surveys, Vol 14, No. 3, September 1982 Cache Memories • 481 TRACE MISS RATIOS 20000 40000 60000 10000 20000 30000 ~o; , , , , , , , , , , , i , ,i, , , , , , , , , , , , , , 0.i00 =-~. ' '--- CG01 --~A I I FG~I -~ 0.100 : i:', --- cgo2 \, ...... Fo02 ! 0.050 - .... cG 3 \',. .... FG03:0.050

0.020 - .... "i-"'..," ...\\", ~ ~...... " 0.020 0.010 -=-! "'" ~=="--W ~"., -~,. ~ 0.010 •.= 0.005 0.005 I--- <: ! ,',, I .... I .... I,, .... I,;', ,'t':-',,, I, : o.oo2 rY :',''1 .... I .... I'''-',~ .... I ' ' ' ' I ' ' 0.100 03 0.050 - :' "- ROFFAS ~"\". PG02 03 I---I "EDC "'~'~!\~... " .... CCOHP! 0.050 :E 0.020 't. .... TRACEEOC "- \ "'. " .. -" "" FCOHP - 0.010 0.005 0.010 Z ~ %...... 0.005 0.002 0.001 I,,,, I,,,-,-,~,-~,,,, I,,,, I,, 0.002

2000 4000 6000 20000 40000 MEMORY CAPACITY

Figure 5. Individual trace miss ratios.

gramming by switching the trace in use The standard number of sets in the sim- every Q time-units (where Q was usually ulations was 64. The line size was generally 10,000, a cache memory reference takes 1 32 bytes for the IBM traces and 16 bytes time-unit, and a miss requires 10). Multi- for the PDP-11 traces. programmed simulations are used for two reasons: they are considered to be more 2. ASPECTS OF CACHE DESIGN AND representative of usual computer system OPERATION operation than uniprogrammed ones, and 2.1 Cache Fetch Algorithm they also allow many more traces to be included without increasing the number of 2.1.1 Introduction simulation runs. An acceptable alternative, As we noted earlier, one of the two aims of though, would have been to use unipro- cache design is to minimize the miss ratio. gramming and purge the cache every Q Part of the approach to this goal is to select memory references. A still better idea a cache fetch algorithm that is very likely would have been to interleave user and to fetch the right information, if possible, supervisor code, but no supervisor traces before it is needed. The standard cache were available. fetch algorithm is demand fetching, by All of the multiprogrammed simulations which a line is fetched when and if it is (i.e., Figures 6, 9-33) were run for one mil- needed. Demand fetches cannot be avoided lion memory references; thus approxi- entirely, but they can be reduced if we can mately 250,000 memory references were sucessfully predict which lines will be used from each of the IBM 370 traces, and needed and fetch them in advance. A cache 333,333 from the PDP-11 traces. fetch algorithm which gets information be-

Computing Surveys. Vol. 14, No. 3, September 1982 482 • A. J. Smith fore it is needed is called a prefetch algo- reference. Let D be the penalty for a de- rithm. mand miss (a miss that occurs because the Prefetch algorithms have been studied in target is needed immediately) which arises detail in Smv78b. Below, we summarize from machine idle time while the fetch those results and give one important exten- completes. The prefetch cost, P, results sion. We also refer the reader to several from the cache cycles used (and thus other- other works [AIcH76, BENN76, BEaG78, wise unavailable) to bring in a prefetched ENGE73, PERK80, and RAU76] for addi- line, used to move out (if necessary) a line tional discussions of some of these issues. replaced by a prefetch, and spent in delays We mention the importance of a tech- while main memory modules are busy doing nique known as fetch bypass or load- a prefetch move-in and move-out. The ac- through. When a miss occurs, it can be cess cost,A, is the penalty due to additional rectified in two ways: either the line desired cache prefetch lookup accesses which inter- can be read into the cache, and the fetch fere with the executing program's use of the then reinitiated (this was done in the orig- cache. A prefetch algorithm is effectiveonly inal Amdahl 470V/6 [SMIT78b]), or, better, if the following equation holds: the desired bytes can be passed directly from the main memory to the instruction D * miss ratio (demand) unit, bypassing the cache. In this latter > [D * miss ratio (prefetch) strategy, the cache is loaded, either simul- + P * prefetch ratio taneously with the fetch bypass or after the + A * (access ratio - 1)] (1) bypass occurs. This method is used in the 470V/7,470V/8, and the IBM 3033. (A wra- We should note also that the miss ratio paround load is usually used [KROFS0] in when using prefetching may not be lower which the transfer begins with the bytes than the miss ratio for demand fetching. accessed and wraps around to the rest of The problem here is cache memory pollu- the line.) tion; prefetched lines may pollute memory by expelling other lines which are more 2.1.2 Prefetching likely to be referenced. This issue is dis- cussed extensively and with some attempt A prefetch algorithm must be carefully de- at analysis in SMIT78c; in SMIT78b a num- signed if the machine performance is to be ber of experimental results are shown. We improved rather than degraded. In order to found earlier [SMIT78b] that the major fac- show this more clearly, we must first define tor in determining whether prefetching is our terms. Let the prefetch ratio be the useful was the line size. Lines of 256 or ratio of the number of lines transferred due fewer bytes (such as are commonly used in to prefetches to the total number of pro- caches) generally resulted in useful pre- gram memory references. And let transfer fetching; larger lines (or pages) made pre- ratio be the sum of the prefetch and miss fetching ineffective.The reason for this is ratios. There are two types of references to that a prefetch to a large line brings in a the cache: actual and prefetch lookup. Ac- great deal of information, much or all of tual references are those generated by a which may not be needed, and removes an source external to the cache, such as the equally large amount of information, some rest of the CPU (I-unit, E-unit) or the chan- of which may still be in use. nell.. A prefetch lookup occurs when the A prefetch algorithm has three major cache interrogates itself to see if a given concerns: (1) when to initiate a prefetch, line is resident or if it must be prefetched. (2) which line(s) to prefetch, and (3) what The ratio of the total accesses to the cache replacement status to give the prefetched (actual plus prefetch lookup) to the number block. We believe that in cache memories, of actual references is called the access because of the need for fast hardware im- ratio. plementation, the only possible line to pre- There are costs associated with each of fetch is the immediately sequential one; the above ratios. We can define these costs this type of prefetching is also known as in terms of lost machine cycles per memory one block lookahead (OBL). That is, if line

Computing Surveys, Vol. 14, No. 3, September 1982 Cache Memories * 483 i is referenced, only line i + 1 is considered large amounts of information would run the for prefetching. Other possibilities, which substantial risk of polluting the cache with sometimes may result in a lower miss ratio, information that either would not be used are not feasible for hardware implementa- for some time, or would not be used at all. tion in a cache at cache speeds. Therefore, If only a small amount of information were we consider only OBL. prefetched, the overhead of the prefetch If some lines in the cache have been might well exceed the value of the savings. referenced and others are resident only be- However, some sophisticated versions of cause they were prefetched, then the two this idea might work. One such would be to types of lines may be treated differently make a record of the contents of the cache with respect to replacement. Further, a pre- whenever the execution of a process was fetch lookup may or may not alter the stopped, and after the process had been replacement status of the line examined. In restarted, to restore the cache, or better, this paper we have made no distinction only its most recently used half. This idea between the effect of a reference or a pre- is known as restoration and fetch lookup on the replacement status of has been studied to some extent for paged a fine. That is, a line is moved to the top of main memories. The complexity of imple- the LRU stack for its set if it is referenced, menting it for cache makes it unlikely to be prefetched, or is the target of a prefetch worthwhile, although further study is called lookup; LRU is used for replacement for all for. prefetch experiments in this paper. (See Another possibility would be to recognize Section 2.2.2} The replacement status of when a base register is loaded by the proc- these three cases was varied in SMIT78C, ess and then to cause some number of lines and in that paper it was found that such (one, two, or three) following the loaded distinctions in replacement status had little address to be prefetched [PoME80b, effect on the miss ratio. HoEv81a, HoEv81b]. Implementing this is There are several possibilities for when easy, but architectural and software to initiate a prefetch. For example, a pre- changes are required to ensure that the fetch can occur on instruction fetches, data base registers are known or recognized, and reads and/or data writes, when a miss oc- modifications to them initiate prefetches. curs, always, when the last nth of a line is No evaluation of this idea is available, but accessed, when a sequential access pattern a decreased miss ratio appears likely to has already been observed, and so on. Pre- result from its implementation. The effect fetching when a sequential access pattern could be very minor, though, and needs to has been observed or when the last nth be evaluated experimentally before any segment (n = ½, ¼, etc.} of a line has been modification of current software or hard- used is likely to be ineffective for reasons of ware is justified. timing: the prefetch will not be complete We consider three types of prefetching in when the line is needed. In SMIT78b we this paper: (1) always prefetch, (2) prefetch showed that limiting prefetches only to in- on misses, and (3) tagged prefetch. Always struction accesses or only to data accesses prefetch means that on every memory ref- is less effective than making all memory erence, access to line i (for all i) implies, a accesses eligible to start prefetches. See prefetch access for line i + 1. Thus the also BENN82. access ratio in this case is always 2.0. Pre- It is possible to create prefetch algo- fetch on misses implies that a reference to rithms or mechanisms which employ infor- a block i causes a prefetch to block i + 1 if mation not available within the cache mem- and only if the reference to block i itself ory. For example, a special instruction was a miss. Here, the access ratio is 1 + could be invented to initiate prefetches. No miss ratio. Taggedprefetch is a little more machine, to our knowledge, has such an complicated, and was first proposed by instruction, nor have any evaluations been GIND77. We associate with each line a sin- performed of this idea, and we are inclined gle bit called the tag, which is set to one to doubt its utility in most cases. A prefetch whenever the line is accessed by a program. instruction that specified the transfer of It is initially zero and is reset to zero when

Computing Surveys, Vol. 14, No. 3, September 1982 484 • A. J. Smith

PREFETCH EXPERIMENTS 0 20000 40000 60000 0 20000 40000 60000 0.100 J''''i .... / .... /'' I~! ' ' ' [ ' ' ' ' [ ' ' ' ' I ' o PREFETCH ON MISSES--.-" 0.i00 0.050 " ,~~AGGED PREFETCH ! 0.050

- \, <>~ ,, _ O.OlO -- "x _ _ ~ - o.olo E~ 0.005 I--I : " .. 0.005 I-- : "% < ,% n," 0.001 09 0.001 09 0.I00 0.0500 PGO-CCOMP-FCOMP-IEBOG'~.: " ~\ ROFFAS-EOC-TRACE 0.050 YTE LINES :" ~.~ 16 BYTE LINES 0.0100 0.0050 O.OlO -x---×.

0.005 : . ! 0.0010 0.0005 "1,,,, I,,,, I,, I." , I ,,,I,.,;I-,~, 0 20000 40000 600000 5000 10000 15000 MEMORY CAPACITY Figure 6. Comparisonof miss ratios for two prefetch strategies and no prefetch. the line is removed from the cache. Any creased by a much smaller amount, typi- line brought to the cache by a prefetch cally 10 to 20 percent. operation retains its tag of zero. When a tag The experiments in SMIT78b, while very changes from 0 to 1 (i.e., when the line is thorough, used only one set of traces and referenced for the first time after prefetch- also did not test the tagged prefetch algo- ing or is demand-fetched), a prefetch is rithm. To remedy this, we ran additional initiated for the sequential line. The experiments; the results are presented in idea is very similar to prefetching on misses Figure 6. (In this figure, 32-byte lines are only, except that a miss which did not occur used in all cases except for 16-byte lines for because the line was prefetched (i.e., had the PDP-11 traces, the task interval there not been a prefetch, there would have Q is 10K, and there are 64 sets in all cases.) been a miss to this line) also initiates a It can be seen that always prefetching cut prefetch. the (demand) miss ratio by 50 to 90 percent Two of these prefetch algorithms were for most cache sizes and tagged prefetch tested in SMIT78b: always prefetch and pre- was almost equally effective. Prefetching fetch on misses. It was found that always only on misses was less than half as good as prefetching reduced the miss ratio by as always prefetching or tagged prefetch in much as 75 to 80 percent for large cache reducing the miss ratio. These results are memory sizes, while increasing the transfer seen to be consistent across all five sets of ratio by 20 to 80 percent. Prefetching only traces used. on misses was much less effective; it pro- These experiments are confirmed by the duced only one half, or less, of the decrease results in Table 1. There we have tabulated in miss ratio produced by always prefetch- the miss, transfer, and access ratios for the ing. The transfer ratio, of course, also in- three prefetch algorithms considered, as

Computmg Surveys, Vol 14, No. 3, September 1982 o~ooo~oo"

M~MM~M~M

ooodooo •

% e

t- O a. I~oooSoo

I'-

e- 0

8

I ~ 00000~1~. ~~t'~ ' ~< 000©001~ ,~0~ 00000~1 ~ :~ ~~1 ~ 486 • A. J. Smith well as the demand miss ratio for each of prefetch transfer can be buffered for per- the sets of traces used and for a variety of formance during otherwise idle cache cy- memory sizes. We observe from this table cles. The main memory busy time engen- that always prefetch and tagged prefetch dered by a prefetch transfer seems unavoid- are both very successful in reducing the able, but is not a serious problem. Also miss ratio. Tagged prefetch has the signifi- unavoidable is the fact that a prefetch may cant additional benefit of requiring only a not be complete by the time the prefetched small increase in the access ratio over de- line is actually needed. This effect was ex- mand fetching. The transfer ratio is com- amined in SMIT78b and was found to be parable for tagged prefetch and always pre- minor although noticeable. Further com- fetch. ments and details of a suggested implemen- It is important to note that taking advan- tation are found in SMIT78b. tage of the decrease in miss ratio obtained We note briefly that it is possible to by these prefetch algorithms depends consider the successful use of prefetching very strongly on the effectiveness of the as an indication that the line size is too implementation. For example, the Amdahl small; prefetch functions much as a larger 470V/6-1 has a fairly sophisticated prefetch line size would. A comparison of the results algorithm built in, but because of the design in Figure 6 and Table I with those in Fig- architecture of the machine, the benefit of ures 15-21 shows that prefetching on misses prefetch cannot be realized. Although the for 32-byte lines gives slightly better results prefetching cuts the miss ratio in this ar- than doubling the line size to 64 bytes. chitecture, it uses too many cache cycles Always prefetching and tagged prefetch are and interferes with normal program ac- both significantly better than the larger line cesses to the cache. For that reason, pre- size without prefetching. Therefore, it fetching is not used in the 470V/6-1 and is would appear that prefetching has benefits not available to customers. The more re- in addition to those that it provides by cent 470V/8, though, does contain a pre- simulating a larger line size. fetch algorithm which is useful and im- proves machine performance. The V/8 2.2 Placement Algorithm cache prefetch algorithm prefetches (only) on misses, and was selected on the basis The cache itself is not a user-addressable that it causes very little interference with memory, but serves only as a buffer for normal machine operation. (Prefetch is im- main memory. Thus in order to locate an plemented in the Dorado [CLAR81], but its element in the cache, it is necessary to have success is not described.) some function which maps the main mem- The prefetch implementation must at all ory address into a cache location, or to times minimize its interference with regular search the cache associatively, or to per- machine functioning. For example, prefetch form some combination of these two. The lookups should not block normal program placement algorithm determines the map- memory accesses. This can be accom- ping function from main plished in three ways: (1) by instituting a to cache location. second, parallel port to the cache, (2) by The most commonly used form of place- deferring prefetches until spare cache cy- ment algorithm is called set-associative cles are available, or (3) by not repeating mapping. It involves organizing the cache recent prefetches. (Repeat prefetches can into S sets of E elements per set (see Figure be eliminated by remembering the ad- 7). Given a memory address r(i), a function dresses of the last n prefetches in a small f will map r(i) into a set s(i), so that auxiliary cache. A potential prefetch could f(r(i)) ffi s(i). The reason for this type of be tested against this buffer and not issued organization may be oberved by letting if found. This should cut the number of either S or E become one. If S becomes prefetch lookups by 80 to 90 percent for one, then the cache becomes a fully asso- small n). ciative memory. The problem is that the The move-in (transfer of a line from main large number of lines in a cache would make to cache memory) and move-out (transfer a fully associative memory both slow and from cache to main memory) required by a very expensive. (Our comments here apply Computing Surveys, Vol, 14, No. 3, September 1982 Cache Memories • 487

o,o

E ooo elements ooo per set : : : : : :

.e°

S sets

Figure 7. The cache is orgamzed as S sets of E elements each. to non-VLSI implementations. VLSI MOS Memory Address facilitates broad associative searches.) Con- b I versely, if E becomes one, in an organiza- I I...I !J.-'l I), I'"1 I! tion known as direct mapping [CONT69], Set number Byte within hne there is only one element per set. (A more general classification has been proposed by Figure 8. The set is selected by the low-order bits of HARD75.) Since the mapping function f is the line number. many to one, the potential for conflict in this latter case is quite high: two or more currently active lines may map into the bandwidth. That topic is currently un- same set. It is clear that, on the average, der study.) the conflict and miss ratio decline with There are two aspects to selecting a increasing E, (as S * E remains constant), placement algorithm for the cache. First, while the cost and access time increase. An the number of sets S must be chosen while effective compromise is to select E in the S * E ffi M remains constant, where M is range of 2 to 16. Some typical values de- the number of lines in the cache. Second, pending on certain cost and performance the mapping function f, which translates a tradeoffs, are: 2 (Amdahl 470V/6-1, VAX main memory address into a cache set, 11/780, IBM 370/158-1}, 4 (IBM 370/168-1, must be specified. The second question is Amdahl 470V/8, Honeywell 66/80, IBM most fully explored by SMIT78a; we sum- 370/158-3), 8 (IBM 370/168-3, Amdahl marize those results and present some new 470V/7), 16 (IBM 3033). experimental measurements below. A num- Another placement algorithm utilizes a ber of other papers consider one or both of sector buffer [CONT68], as in the IBM 360/ these questions to some extent, and we refer 85. In this machine, the cache is divided to reader to those [CAMP76, CONT68, into 16 sectors of 1024 bytes each. When a CONT69, FUKU77, KAPL73, LIPT68, word is accessed for which the correspond- MATT71, STRE76, THAK78] for additional ing sector has not been allocated a place in information. the cache (the sector search being fully associative), a sector is made available (the 2.2.1 Set Selection Algorithm LRU sector--see Section 2.4), and a 64- byte block containing the information ref- Several possible algorithms are used or erenced is transferred. When a word is ref- have been proposed for mapping an address erenced whose sector is in the cache but into a set number. The simplest and most whose block is not, the block is simply popular is known as bit selection, and is fetched. The hit ratio for this algorithm is shown in Figure 8. The number of sets S is now generally known to be lower than that chosen to be a power of 2 (e.g., S = 2k). of the set-associative organization (Private If there are 2J bytes per line, the j bits Communication: F. Bookett) and hence we 1 ... j select the byte within the line, and do not consider it further. (This type of bitsj + 1 ... j + k select the set. Performing design may prove appropriate for on-chip the mapping is thus very simple, since all microprocessor caches, since the limiting that is required is the.decoding of a binary factor in many microprocessor systems is quantity. Bit-selection is used in all com- Computing Surveys, Vol. 14, No. 3, September 1982 488 • A. J. Smith RANDOM VS. SET ASSOC. MAPPING 0 20000 40000 60000 0 20000 40000 60000 0.100

0.050 32 BYTES/LINE, 64 SETS PGO-CCOMP-FCOMP-IEBDG 0.I00 ---SET ASSOC., 0-250K RANDOM, Q-250K 0,050 0.020 FGOI...4 %. 0.020 E~O.OIO '.k °°~ :.\ oN ~<0.005 %% "~ °.,,w. or" 0.005 CO CO 0.100 CG01...3, PG02 ROFFAS-EDC-TRACE 0.050 5-- 0.10 --SET ^SSBC., Q-IOR 16 BYTES/LINE .... RANDOM, Q-IOK 0.020 0.05 0.010 0.005 \ (.. 0.02 i ~" 4.. 0.002 0.01 0.001 0 20000 40000 60000 0 5000 10000 15000 MEMORY CAPACITY Figure 9. Comparison of miss ratios when using random or bit-selection set-assocmtive mapping. puters to our knowledge, including partic- PDP-11 traces, for which 16 bytes are used.) ularly all Amdahl and IBM computers. It can be observed that random mapping Some people have suggested that be- seems to have a small advantage in most cause bit selection is not random, it might cases, but that the advantage is not signifi- result in more conflict than a random al- cant. Random mapping would probably be gorithm, which employs a pseudorandom preferable to bit-selection mapping if it calculation (hashing) to map the line into could be done equally quickly and inexpen- a set. It is difficult to generate random sively, but several extra levels of logic apo numbers quickly in hardware, and the usual pear to be necessary. Therefore, bit selec- suggestion is some sort of folding of the tion seems to be the most desirable algo- address followed by exclusive oring of the rithm. bits. That is, if bitsj + 1 ... b are available for determining the line location, then these 2.2.2 Set Size and the Number of Sets b - j bits are grouped into k groups, and There are a number of considerations in within each group an Exclusive Or is per- selecting values for the number of sets (S) formed. The resulting k bits then designate and the set size (E). (We note that S and E a set. (This algorithm is used in TLBs--see are inversely related in the equation S * E Section 2.16.) In our simulations discussed = M, where M is the number of lines in the later, we have used a randomizing function cache (M ffi 2m).) These considerations have of the form s(i) = a * r (i) mod 2k. to do with lookup time, expense, miss ratio, Simulations were performed to compare and addressing. We discuss each below. random and set-associative mapping. The The first consideration is that most cache results are shown in Figure 9. (32 bytes is memories (e.g., Amdahl, IBM) are ad- the line size used in all cases except the dressed using the real address of the data,

Computing Surveys, Vol 14, No. 3, September 1982 Cache Memories • 489 although the CPU produces a virtual ad- that model here, and then we present some dress. The most common mechanism for new experimental results. avoiding the time to translate the virtual A commonly used model for program address to a real one is to overlap the cache behavior is what is known as the LRU lookup and the translation operation (Fig- Stack Model. (See COFF73 for a more thor- ures 1 and 2). We observe the following: the ough explanation and some basic results.) only address bits that get translated in a In this model, the pages or lines of the virtual memory system are the ones that program's address space are arranged in a specify the page address; the bits that spec- list, with the most recently referenced line ify the byte within the page are invariant at the top of the list, and with the lines with respect to virtual memory translation. arranged in decreasing recency of reference Let there be 2j bytes per line and 2 k sets in from top to bottom. Thus, referencing a the cache, as before. Also, let there be 2 p line moves it to the top of the list and bytes per page. Then (assuming bit-selec- moves all of the lines between the one tion mapping), p - j bits are immediately referenced and the top line down one posi- available to choose the set. If (p -j) _> k, tion. A reference to a line which is the ith then the set can be selected immediately, line in the list (stack) is referred to as a before translation; if (p -j) < k, then the stack distance of i. This model assumes search for the cache line can only be nar- that the stack distances are independently rowed down to a small number 2 (k-p +J) of and identically drawn from a distribution sets. It is quite advantageous if the set can (q(j)), j ffi 1,..., n. This model has been be selected before translation (since the shown not to hold in a formal statistical associative search can be started immedi- sense [LEwI71, LEw173], but the author ately upon completion of translation); thus and others have used this model with good there is a good reason to attempt to keep success in many modeling efforts. the number of sets less than or equal to Each set in the cache constitutes a sepa- 2 (p-J). (We note, though, that there is an rate associative memory. If each set is man- alternative. The Amdahl 470V/6 has 256 aged with LRU replacement, it is possible sets, but only 6 bits immediately available to determine the probability of referencing for set selection. The machine reads out the kth most recently used item in a given both elements of each of the four sets which set as a function of the overall LRU stack could possibly be selected, then after the distance probability distribution {q(j)}. .translation is complete, selects one of those Letp(i, S) be the probability of referencing sets before making the associative search. the ith most recently referenced line in a See also LEE80.) set, given S sets. Then, we show thatp(i, S) Set size is just a different term for the may be calculated from the {q(i)} with the scope of the associative search. The smaller following formula: the degree of associative search, the faster and less expensive the search (except, as p(i, S)flY, ql (l/S) i-I (S - l/S) J-+ - noted above, for MOS VLSI). This is be- JEt cause there are fewer comparators and sig- Note that p(i, 1) ffi q(i). In SMIT78a this nal lines required and because the replace- model was shown to give accurate predic- ment algorithm can be simpler and faster tions of the effect of set size. (see Section 2.4). Our second consideration, Experimental results are provided in Fig- expense, suggests that therefore the smaller ures 10-14. In each of these cases, the num- the set size, the better. We repeat, though, ber of sets has been varied. The rather that the set size and number of sets are curious shape of the curves (and the simi- inversely related. If the number of sets is larities between different plots) has to do less than or equal to 2 (p -J),then the set size with the task-switch interval and the fact is greater than or equal to 2 (m -P +J) lines. that round-robin was used. The third consideration in selecting set Thus, when a program regained control of size is the effect of the set size on the miss the processor, it might or might not, de- ratio. In [SMIT78a] we developed a model pending on the memory capacity, find any for this effect. We summarize the results of of its working set still in the cache.

Computing Surveys,Vol. 14, No. 3, September 1982 490 • A. J. Smith VARY NUMBER BF SETS

WFV-APL-WTX-FFT 0.I00 Q-10000, 32 BYTE LINES 32 SETS ...... 64 SETS ix .... 128 SETS 0.050 ..... 258 SETS E~ }-I---I % rY co 512 SETS o3 I--.4

0.010

0.005

0 20000 40000 60000 MEMORY CAPACITY

Figure 10 Miss ratios as a function of the number of sets and memory capacity.

VARY NUMBER OF SETS

~ FG01...4 0.100 -~ 0-I0000, 32 BYTE LINES - ~\ --- 8 SETS X'~\ .... 10 SETS 0.050 ~\ \. - .... 32 SETS ~\ \'% ..... 64 SETS SEZ h-- < rr" O3 CO 0.010 _ ~\.. 256 SETS _ ",,o ~.

0.005

0 20000 40000 60000 P1EP10£Y CAPACITY

Figure 11 Miss ratios as a function of the number of sets and memory capacity

Computing Surveys, VoL 14, No 3, September 1982 Ca~e Memorws 491 VARY NUMBER OF SETS 0.20

C601...3, PG02 Q-IO000, 32 BYTE LINES --18 SETS ...... 32 SETS 0.08 .... 54 SETS 0.07 I-,-I O.OO t- 0.05 SETS ,w. oc 0.04 "~"o CO co 0.03 256 SETS SETS 0.02

0 20000 40000 80000 MEMORY CAPACITY

Figure 12. Miss ratios as a function of the number of sets and memory capacity.

VARY NUMBER OF SETS

it, , , , i , , , , i , , , , I , li PGO-CCOMP-FCOMP-IEBD6

ILl\'> 0-i0000, 32 BYTE LINES 0.10 -t{.',,,, -- 16 SETS __\('~ ...... 32 SETS ~,, '~ .... 64 SETS Q 0.05 ~,".N~,~/128 SETS F- < rr" CO CO 256 SETS EI

0.01

0 20000 40000 60000 HEHORY CAPACITY

Figure 13. Miss ratios as a function of the number of sets and memory capacity.

Computing Surveys, VoI, 14, No, 3, September 1982 492 • A. J. Smith VARY NUMBER OF SETS

0.100 __ ~. ROFF^S-EDC-TRACE (PDP-II) _ ~ Q-IO000, 16 BYTE LINES if' ...... 8 SETS 0.050 16 SETS ~i - --- 32 SETS ~'~ "- 64 SETS E~ '-. \ I---I :!i~-.~ \\\ 128 SETS rY 0.010 O9 "'.. \ "~ CO ~. \ ~-~ 256 SETS 0.005 ~'...~ \f -.... "~ ... \ .. "~. "".. " -. ""'-. ~ 512 SETS

0.001

0 10000 20000 30000 MEMORY CAPACITY

Figure 14. Miss ratios as a function of the number of sets and memory capacity.

Based on Figures 10-14, Figure 33, and 128), IBM 3033 (16, 64), DEC PDP-11/70 the information given by SMIT78a, we be- (2, 256), DEC VAX 11/780 (2, 512), Itel lieve that the minimum number of elements AS/6 (4, 128) [Ross78], Honeywell 66/60 per set in order to get an acceptable miss and 66/80 (4, 128) [DIET74]. ratio is 4 to 8. Beyond 8, the miss ratio is likely to decrease very little if at all. This 2.3 Line Size has also been noted for TLB designs [SATY81]. The issue of maximum feasible One of the most visible parameters to be set size also suggests that a set size of more chosen for a cache memory is the line size. than 4 or 8 will be inconvenient and expen- Just as with paged main memory, there are sive. The only machine known to the author a number of trade-offs and no single crite- with a set size larger than 8 is the IBM 3033 rion dominates. Below we discuss the ad- processor. The reason for such a large set vantages of both small and large line sizes. size is that the 3033 has a 64-kbyte cache Additional information relating to this and 64-byte lines. The page size is 4096 problem may be found in other papers bytes, which leaves only 6 bits for selecting [ALSA78, ANAC67, GIBS67, KAPL73, the set, if the translation is to be over- MATT71, MEAD70, and STRE76]. lapped. This implies that 16 lines are Small line sizes have a number of advan- searched, which is quite a large number. tages. The transmission time for moving a The 3033 is also a very performance-ori- small line from main memory to cache is ented machine, and the extra expense is obviously shorter than that for a long line, apparently not a factor. and if the machine has to wait for the full Values for the set size (number of ele- transmission time, short lines are better. (A ments per set) and number of sets for a high-performance machine will use fetch number of machines are as follows: Amdahl bypass; see Section 2.1.1.) The small line is 470V/6 (2, 256), Amdahl 470V/7 (8, 128), less likely to contain unneeded information; Amdahl 470V/8 (4, 512), IBM 370/168-3 (8, only a few extra bytes are brought in along

ComputingSurveys, Vo]. 14, No. 3, September 1982 Cache Memories 493 VARY LINE SIZE

I\' ' ' ' I ' ' ' ' I ' ' ' ' I '

.~. CG01...3. PB02 0.I00 - \\ Q-IO000. 84 SETS --~

0.050 H h- "~',~.. ~ 16 BYTES - n- ...... = CO tO / "~ - -X~x _ , "'"" ...... f32 BYTES - 64 BYTES '.. \ -~'--...... >-.-\. .... ---... •...... _., 0.010

0.005

I , , - 0 20000 40000 60000 HEHORY CAPACITY

Figure 1 5. Miss ratios as a function of the line size and memory capacity.

with the actually requested information. both long and short lines become disadvan- The data width of main memory should tages for the other. usually be at least as wide as the line size, Another important criterion for selecting since it is desirable to transmit an entire a line size is the effect of the line size on line in one main memory cycle time. Main the miss ratio. The miss ratio, however, memory width can be expensive, and short only tells part of the story. It is inevitable lines minimize this problem. that longer lines make processing a miss Large line sizes, too, have a number of somewhat slower (no matter how efficient advantages. If more information in a line is the overlapping and buffering), so that actually being used, fetching it all at one translating a miss ratio into a measure of time (as with a long line) is more efficient. machine speed is tricky and depends on the The number of lines in the cache is smaller, details of the implementation. The reader so there are fewer logic gates and fewer should bear this in mind when examining storage bits (e.g., LRU bits} required to our experimental results. keep and manage address tags and replace- Figures 15-21 show the miss ratio as a ment status. A larger line size permits fewer function of line size and cache size for five elements/set in the cache (see Section 2.2), different sets of traces. Observe that we which minimizes the associative search have also varied the multiprogramming logic. Long lines also minimize the fre- quantum time Q. We do so because the quency of "line crossers," which are re- miss ratio is affected by the task-switch quests that span the contents of two lines. interval nonuniformly with line size. This Thus in most machines, this means that nonuniformity occurs because long line two separate fetches are required within sizes load up the cache more quickly. Con- the cache (this is invisible to the rest of the sider two cases. First, assume that most machine.) cache misses result from task switching. In Note that the advantages cited above for this case, long lines load up the cache more

Computing Surveys, Vol. 14, No. 3, September 1982 494 • A. J, Smith VARY LINE SIZE

t I | I I I I I I I..~' ' I ' ' I ' ' ' I PGO-CCOMP-FCOMP-IEBBB 0.I00 __ G-tO000, 64 SETS

~... - ~" ~ °°o°,, .~ . 0.050 \., ""

n-~ '"~X.- - - - - "'L"~'..'L~ 8 BYTES -

R • ~oo

/\-:... . ~-~ 64 BYTES \ ':.7 ...... ",'".. ".. "'-. \,, ...... fl6 BYTES

0.010 128 BYTES/~ "" "" " " " ...... BYTE~ "'" 0.005 I , ~ 0 20000 40000 60000 MEMORY CAPACITY

Figure 16. Miss ratios as a function of the line size and memory capacity.

VARY LINE SIZE

"-' ' ' ' I ' ' ' ' I ' ' ' ' I ' I~... FGBI...4 __ 0,100 - \x~_ ..... Q-IO000, 64 SETS

0.050 E~ h- ~z " "" . /8 BYTES • ,~. ",o, FF ,.~ ,, "o,,,,o p/

co 0.010 • ,.,,., '.,,,*..,,o,,,,° o3 ~~ ...... ~i6 BYTES 64 BYTES "'m~..""- ~, ~- 0.005 "" "~I3-Y T E

256 BYTES 128 BYTES 0.001-I,,,, I,,,, I,,,, I,- 0 20000 40000 60000 FIEHORY CAPACITY Figure 17. Mms ratios as a function of the line size and memory capacity.

Computing Surveys, Vol. 14, No 3, September 1982 Cache Memories • 495

VARY LINE SIZE

I I I I I I I I I I I I ! I ] I \ I I I 0.100 \\ WFV-APL-WTX-FFT _ \\ Qf250K, 64 SETS 0.050

\, X~. \ v."x'&-~ y b- \'~'"',~ -Z- " f8 B TEE \':.\ \ c~ ~',x \'~%.~ /16 BYTES CO 0.010 I CO ",'""X£...... ~

0.005 ...... /'~',." " ~_ ~ _ f64 BYTES 128 BYTES " ""...~ ---- ......

0.001 I , , , , I , , , , I , , , , , ~ 0 20000 40000 60000 MEMBRY CAPACITY Figure 18. Miss ratios as a function of the line size and memory capacity.

VARY LINE SIZE

I~' ' ' ' I ' ' ' ' I ' ' ' ' I ' ik WFV-APL-WTX-FFT 0.100 - "~\ o-loooo. 64 SETS - ~-.. ~ ,8 BYTES 0.050 b-- ,. - ..... co co "'...'~x.~ ..... fl6BYTES H 0.010 64 BYTES ""~.--~.,,'-,, " ...... - 128 BYTESiii'"T •, o "" " "., ,~'"'" ~, "-4 32" J"'°"'o,.o,loo,,,oBYTES 0.005

I , , , , I, , ,, I , , , , I, ;I 0 20000 40000 60000 MEMORY CAPACITY Figure 19. Miss ratios as a function of the line size and memory capacity.

Computing Surveys, VoL 14, No.3, September 1982 496 • A. J. SmitI~

VARY LINE SIZE

'''I''''I .... I .... I''' 10-1 ROFFAS-EDC-TRACE (PDP-11)- ~ Q-333334, 64 SETS 'LI',,\ 10 -2 I Q~ 4 BYTES 03 ~, . "'.. ~ : 03 I--4 ~',\. \ "'". / 8 BYTES v,:\ *\...... \ ...... :: .....", "°°,l**,**,J, _ "-,.. \ ".. /16 BYTES 10-3 _ .,'.:.~iT_..:. 32BhTES._._~%:._._.j..... y -- "".. f64 BYTES : 128 BYTES J ...... " ,,,, I,,,, I,,,, I,,,, I,,, 0 5000 10000 15000 20000 MEMORY CAPACITY Figure 20. Miss ratios as a function of the line size and memory capacity.

VARY LINE SIZE

II' ' ' I '' ' ' I '' ' ' I' ' ' 10-1 ROFFAS-EDC-TRACE (PDP-11V! L gIlO000, 64 SETS !

I--4 ~-

Figure 21. Miss ratios as a functmn of the line size and memory capacity.

Computing Surveys,Vol 14, No. 3, September 1982 Cache Memories • 497 quickly than small ones. Conversely, as- Table 2. Line Size (in bytes) Giving Minimum Miss sume that most misses occur in the steady Rat=o, for Given Memory Capacity and Traces state; that is, that the cache is entirely full Minimum miss most of the time with the current process Memory size ratio line size and most of the misses occur in this state. Traces Quantum (kbytes) (bytes) In this latter case, small lines cause less CGO1 10,000 4 32 memory pollution and possibly a lower miss CGO2 8 64 CGO3 16 128 ratio. Some such effect is evident when PGO2 32 256 comparing Figures 18 and 19 with Figures 64 256 20 and 21, but explanation is required. A PGO 10,000 4 32 quantum of 10,000 results not in a program CCOMP 8 64 FCOMP 16 128 finding an empty cache, but in it finding IEBDG 32 256 some residue of its previous period of activ- 64 256 ity {since the degree of multiprogramming FGO1 10,000 4 32 is only 3 or 4); thus small lines are relatively FG02 8 64 more advantageous in this case than one FG03 16 128 FG04 32 256 would expect. 64 128 The most interesting number to be WFV 10,000 4 32 gleaned from Figures 15-21 is the line size APL 8 64 which causes the minimum miss ratio for WTX 16 128 each memory size. This information has FFT 32 256 64 128 been collected in Table 2. The consistency 250,000 4 32 displayed there for the 360/370 traces is 8 64 surprising; we observe that one can divide 16 64 the cache size by 128 or 256 to get the 32 128 64 256 minimum miss ratio line size. This rule does ROFFAS 10,000 2 16 not apply to the PDP-11 traces. Programs EDC 4 32 written for the PDP-11 not only use a dif- TRACE 8 16 ferent instruction set, but they have been 16 32 written to run in a small (64K) address 333,333 2 8 4 16 space. Without more data, generalizations 8 32 from Figures 20 and 21 cannot be made. 16 64 In comparing the minimum miss ratio line sizes suggested by Table 2 and the offerings of the various manufacturers, one notes a discrepancy. For example, the IBM three traces and various working set win- 168-1 (32-byte line, 16K buffer) and the dow sizes. He then found that for those 3033 (64-byte line, 64K buffer) both have machines, the optimum block size lies in surprisingly small line sizes. The reason for the range 64 to 256 bytes. this is almost certainly that the transmis- It is worth considering the relationship sion time for longer lines would induce a between prefetching and line size. Prefetch- performance penalty, and the main mem- ing can function much as a larger line size ory data path width required would be too would. In terms of miss ratio, it is usually large and therefore too expensive. even better; although a prefetched line that Kumar [KuMA79] also finds that the line is not being used can be swapped out, a half sizes in the IBM 3033 and Amdahl 470 are of a line that is not being used cannot be too small. He creates a model for the work- removed independently. Comparisons be- ing set size w of a program, of the form w(k) tween the results in Section 2.1 and this = c/k a, where k is the block size, and c and section show that the performance im- a are constants. By making some conven- provement from prefetching is significantly ient assumptions, Kumar derives from this larger than that obtained by doubling the an expression for the miss ratio as a func- line size. tion of the block size. Both expressions are Line sizes in use include: 128 bytes (IBM verified for three traces, and a is measured 3081 [IBM82]), 64 bytes (IBM 3033), 32 to be in the range of 0.45 to 0.85 over the bytes (Amdahl 470s, Itel AS/6, IBM 370/ Computmg Surveys, Vol. 14, No. 3, September 1982 498 • A. J. Smith

168), 16 bytes (Honeywell 66/60 and this reason, we believe that variable-space 66/80), 8 bytes (DEC VAX 11/780), 4 bytes algorithms are not suitable for a cache (PDP-11/70). memory. To our knowledge, no variable- space algorithm has ever been used in a 2.4 Replacement Algorithm cache memory. It should also be clear that in a set-asso- 2.4.1 Classification ciative memory, replacement must take In the steady state, the cache is full, and a place in the same set as the fetch. A line is cache miss implies not only a fetch but also being added to a given set because of the a replacement; a line must be removed from fetch, and thus a line must be removed. the cache. The problem of replacement has Since a line maps uniquely into a set, the been studied extensively for paged main replaced line in that set must be entirely memories (see SMIT78d for a bibliography), removed from the cache. but the constraints on a replacement algo- The set of acceptable replacement algo- rithm are much more stringent for a cache rithms is thus limited to fixed-space algo- memory. Principally, the cache replace- rithms executed within each set. The basic ment algorithm must be implemented en- candidates are LRU, FIFO, and Rand. It is tirely in hardware and must execute very our experience (based on prior experiments quickly, so as to have no negative effect on and on material in the literature) that non- processor speed. The set of feasible solu- usage-based algorithms all yield compara- tions is still large, but many of them can be ble performance. We have chosen FIFO rejected on inspection. within set as our example of a non-usage- The usual classification of replacement based algorithm. algorithms groups them into usage-based versus non-usage-based, and fixed-space 2.4.2 Comparisons versus variable-space. Usage-based algo- Comparisons between FIFO and LRU ap- rithms take the record of use of the line (or pear in Table 3, where we show results page) into account when doing replace- based on each set of traces for varying ment; examples of this type of algorithm memory sizes, quantum sizes, and set num- are LRU (least recently used) [COFF73] and bers. We found (averaging over all of the Working Set [DEI~N68]. Conversely, non- numbers there) that FIFO yields a miss usage-based algorithms make the replace- ratio approximately 12 percent higher than ment decision on some basis other than and LRU, although the ratios of FIFO to LRU not related to usage; FIFO and Rand {ran- miss ratio range from 0.96 to 1.38. This 12 dom or pseudorandom) are in this class. percent difference is significant in terms of (FIFO could arguably be considered usage- performance, and LRU is clearly a better based, but since reuse of a line does not choice if the cost of implementing LRU is improve the replacement status of that line, small and the implementation does not slow we do not consider it as being such.) Fixed- down the machine. We note that in mini- space algorithms assume that the amount computers (e.g., PDP-11) cost is by far the of memory to be allocated is fixed; replace- major criterion; consequently, in such sys- ment simply consists of selecting a specific tems, this performance difference may not line. If the algorithm varies the amount of be worthwhile. The interested reader will space allocated to a specific process, it is find additional performance data and dis- known as a variable-space algorithm, in cussion in other papers [CHIA75, FURS78, which case, a fetch does not imply a re- GIBS67, LEE69, and SIRE76]. placement, and a swap-out can take place without a corresponding fetch. Working Set 2.4.3 Implementation and Frequency [CHu76] are variable-space algorithms. It is important to be able to implement The cache memory is fixed in size, and it LRU cheaply, and so that it executes is usually too small to hold the working set quickly; the standard implementation in of more than one process (although the software using linked lists is unlikely to be 470V/8 and 3033 may be exceptions). For either cheap or fast. For a set size of two,

Computing Surveys, Vol. 14, No. 3, September 1982 Cache Memories • 499

Table 3. • Miss Ratio Comparison for FIFO and LRU (within set) Replacement Memory size Quantum Number of Miss ratio Miss ratio Ratio FIFO/ (kbytes) SLZe sets LRU FIFO LRU Traces 16 10K 64 0.02162 0.02254 1.04 WFV 32 10K 64 0 00910 0.01143 1.26 APL 16 250K 64 0 00868 0.01036 1.19 WTX 32 250K 64 0.00523 0.00548 1.05 FFT 16 10K 256 0.02171 0.02235 1.03 32 10K 256 0.01057 0.01186 1.12 4 10K 64 0 00845 0.00947 1.12 ROFFAS 8 10K 64 0 00255 0.00344 1.35 EDC 4 333K 64 0.00173 0.00214 1.24 TRACE 8 333K 64 0.00120 0.00120 1.00 4 10K 256 0.0218 0.02175 0.998 8 10K 256 0.00477 0 00514 1.08 4 333K 256 0.01624 0.01624 1.00 8 333K 256 0.00155 0.00159 1.03 64 10K 64 0.01335 0.01839 1.38 COO1 128 10K 64 0.01147 0.01103 0.96 COO2 64 10K 256 0.01461 0.01894 1.30 CGO3 128 10K 256 0.01088 0.01171 1.08 PGO2 16 10K 64 0 01702 0.01872 1.10 FOOl 32 10K 64 0.00628 0.00867 1.38 FGO2 16 10K 256 0.01888 0.01934 1.02 FGO3 32 10K 256 0.00857 0.00946 1.10 FGO4 16 10K 64 0.03428 0.03496 1.02 POOl 32 10K 64 0.02356 0.02543 1.08 CCOMP 16 10K 256 0 03540 0.03644 1.03 FCOMP 32 10K 256 0.02394 0.02534 1.06 IEBDG Average 1.116

only a hot/cold (toggle) bit is required. approximations to LRU in the 370/168 and More generally, replacement in a set of E the 3033. In the 370/168-3 [IBM75], the set elements can be effectively implemented size is 8, with the 8 lines grouped in 4 pairs. with E(E - 1)/2 bits of status. (We note The LRU pair of lines is selected, and then that [log 2E!] bits of status are the theo- the LRU block of the pair is the one used retical minimum.) One creates an upper- for replacement. This algorithm requires left triangular matrix (without the diagonal, only 10 bits, rather than the 28 needed by that is, i + j < E) which we will call R and the full LRU. A set size of 16 is found in refer to as R(i,j). When line i is referenced, the 3033 [IBM78]. The 16 lines that make row i of R(i, j) is set to 1, and column i of up a set are grouped into four groups of two R(], i) is set to 0. The LRU line is the one pairs of two lines. The line to be replaced is for which the row is entirely equal to 0 (for selected as follows: (1) find the LRU group those bits in the row; the row may be of four lines (requiring 6 bits of status), (2) empty) and for which the column is entirely find the LRU pair of the two pairs (1 bit 1 (for all the bits in the column; the column per group, thus 4 more bits), and (3) find may be empty). This algorithm can be eas- the LRU line of that pair (1 bit per pair, ily implemented in hardware, and executes thus 8 more bits). In all, 18 bits are used for rapidly. See MARU75 for an extension and this modified LRU algorithm, as opposed MARU76 for an alternative. to the 120 bits required for a full LRU. No The above algorithm requires a number experiments have been published compar- of LRU status bits that increases with the ing these modified LRU algorithms with square of the set size. This number is ac- genuine LRU, but we would expect to find ceptable for a set size of 4 (470V/8, Itel AS/ no measurable difference. 6), marginal for a set size of eight (470V/7), Implementing either FIFO or Rand is and unacceptable for a set size of 16. For much easier than implementing LRU. that reason, IBM has chosen to implement FIFO is implemented by keeping a modulo

Computing Surveys, VoL 14, No. 3, September 1982 500 * A. J. Smith

Table 4. Percentage of Memory References That Are Reads, Writes, and Instruction Fetches for Each Trace~ Partial trace Full trace

Trace Data read Data write IFETCH Data read Data write IFETCH WATFIV 23.3 16.54 60.12 -- 15.89 -- APL 21.5 8.2 70.3 -- 8.90 -- WATEX 24.5 9.07 66.4 -- 7.84 -- FFT1 23.4 7.7 68.9 -- 7.59 -- ROFFAS 37.5 4.96 57.6 38.3 5.4 56.3 EDC 30 4 10.3 59.2 29.8 11.0 59.2 TRACE 47.9 10.2 41.9 48.6 10.0 41.3 CGO1 41.5 34.2 24.3 42.07 34.19 23.74 CG02 41.1 32.4 26.5 36.92 15.42 47.66 CGO3 37.7 22.5 39.8 37.86 22.55 39.59 PG02 31.6 15.4 53.1 30.36 12.77 56.87 FGO1 29.9 17.6 52.6 30.57 11.26 58.17 FG02 30 6 5.72 63 7 32.54 10.16 57.30 FG03 30.0 12.8 57.2 30.60 13.25 56.15 FG04 28.5 17.2 54.3 28 38 17.29 54.33 PGO1 29.7 19.8 50.5 28.68 16.93 54.39 CCOMP1 30.8 9.91 59.3 33.42 17.10 49.47 FCOMP1 29.5 20.7 50.0 30.80 15.51 53.68 IEBDG 39.3 28.1 32.7 39.3 28.2 32.5 Average 32.0 15.96 52 02 34.55 14.80 49.38 Stand. Dev. 7.1 8.7 13.4 5.80 7.21 10.56 a Partial trace results are for first 250,000 memory references for IBM 370 traces, and 333,333 memory references for PDP-11 traces. Full trace results refer to entire length of memory address trace (one to ten million memory percent references).

E (E elements/set) for each set; it [AGRA77a, BELL74, POHM75, and R1s77]. A is incremented with each replacement and detailed analysis of some aspects of this points to the next line for replacement. problem is provided in SMIT79 and YEN81. Rand is simpler still. One possibility is to To provide an empirical basis for our use a single modulo E counter, incremented discussion in this section, we refer the in a variety of ways: by each clock cycle, reader to Table 4. There we show the per- each memory reference, or each replace- centage of memory references that resulted ment anywhere in the cache. Whenever a from data reads, data writes, and instruc- replacement is to occur, the value of the tion fetches for each of the traces used in counter is used to indicate the replaceable this paper. The leftmost three columns line within the set. show the results for those portions of the traces used throughout this paper; that is, 2.5 Write-Through versus Copy-Back the 370 traces were run for the first 250,000 memory references and the PDP-11 traces When the CPU executes instructions that for 333,333 memory references. When avail- modify the contents of the current address able, the results for the entire trace (1 to 10 space, those changes must eventually be million memory references) are shown in reflected in main memory; the cache is only the rightmost columns. The overall average a temporary buffer. There are two general shows 16 percent of the references were approaches to updating main memory: writes, but the variation is wide (5 to 34 stores can be immediately transmitted to percent) and the values observed are clearly main memory (called write-through or very dependent on the source language and store-through), or stores can initially only on the machine architecture. In SMIT79 we modify the cache, and can later be reflected observed that the fraction of lines from the in main memory (copy-back). There are cache that had to be written back to main issues of performance, reliability, and com- memory (in a copy-back cache) ranged plexity in making this choice; these issues from 17 to 56 percent. are discussed in this section. Further infor- Several issues bear on the trade-off be- mation can be found in the literature tween write-through and copy-back. Computing Surveys, Vol. 14, No. 3, September 1982 Cache Memories • 501 1. Main Memory Traffic. Copy-back al- ment status (e.g., LRU) may or may not be most always results in less main memory updated. This is considered in item 6 below. traffic since write-through requires a main 5. Buffering. Buffering is required for memory access on every store, whereas both copy-back and write-through. In copy- copy-back only requires a store to main back, a buffer is required so that the line to memory if the swapped out line {when a be copied back can be held temporarily in miss occurs) has been modified. Copy-back order to avoid interfering with the fetch. generally, though, results in the entire line One optimization worth noting for copy- being written back, rather than just one or back is to use spare cache/main memory two words, as would occur for each write cycles to do the copy-back of "dirty" {mod- memory reference (unless "dirty bits" are ified) fines [BALZ81b]. For write-through, it associated with partial lines; a dirty bit, is important to buffer several stores so that when set, indicates the line has been mod- the CPU does not have to wait for them to ified while in the cache). For example, as- be completed. Each buffer consists of a data sume a miss ratio of 3 percent, a line size of part (the data to be stored) and the address 32 bytes, a memory module width of 8 part (the target address). In SMIT79 it was bytes, a 16 percent store frequency, and 30 shown that a buffer with capacity of four percent of all cache lines requiring a copy- provided most of the performance improve- back operation. Then the ratio of main ment possible in a write-through system. memory store cycles to total memory ref- This is the number used in the IBM 3033. erences is 0.16 for write-through and 0.036 We note that a great deal of extra logic may for copy-back. be required if buffering is used. There is not 2. Cache Consistency. If store-through is only the logic required to implement the used, main memory always contains an up- buffers, but also there must be logic to test to-date copy of all information in the sys- all memory access addresses and match tem. When there are multiple processors in them against the addresses in the address the system (including independent chan- part of the buffers. That is, there may be nels), main memory can serve as a common accesses to the material contained in the and consistent storage place, provided that store buffers before the data in those additional mechanisms are used. Other- buffers has been transferred to main mem- wise, either the cache must be shared or a ory. Checks must be made to avoid possible complicated directory system must l~e em- inconsistencies. ployed to maintain consistency. This sub- 6. Reliability. If store-through is used, ject is discussed further in Section 2.7, but main memory always has a valid copy of we note here that store-through simplifies the total memory state at any given time. the memory consistency problem. Thus, if a processor fails (along with its 3. Complicated Logic. Copy-back may cache), a store-through system can often be complicate the cache logic. A dirty bit is restored more easily. Also, if the only valid required to determine when to copy a line copy of a line is in the cache, an error- back. In addition, arrangements have to be correcting code is needed there. If a cache made to perform the copy-back before the error can be corrected from main memory, fetch (on a miss) can be completed. then a parity check is sufficient in the 4. Fetch-on-write. Using either copy- cache. back or write-through still leaves undecided the question of whether to fetch-on-write Some experiments were run to look at or not, if the information referenced is not the miss ratio for store-through and copy- in the cache. With copy-back, one will usu- back. A typical example is shown in Figure ally fetch-on-write, and with write-through, 22; the other sets of traces yield very similar usually not. There are additional related results. (In the case of write-through, we possibilities and problems. For example, have counted each write as a miss.) It is when using write-through, one could not clear that write-through always produces a only not fetch-on-write but one could much higher miss ratio. The terms reorder choose actually to purge the modified line and no reorder specify how the replace- from the cache should it be found there. If ment status of the lines were updated. the line is found in the cache, its replace- Reorder means that a modified line is Computing Surveys, Vol. 14, No. 3, September 1982 502 A. J. Smith WRITE POLICY EXPERIMENTS

i , ROUGH, NO REORDER

WRITE THROUGH, REORDER 0.10

l-- rr 0.05 Co }OPY BACK Co

PGO-CCOMP-FCOMP-IEBDG 0.01 64 SETS, 32 BYTE LINES, Q-IOK

0 20000 40000 60000 MEMORY CAPACITY Figure 22. Miss raho for copy-backand write-through.All writes counted as misses for write- through. No reorder nnplies replacement status not modified on write, reorder nnphes replacement status is updated on write. moved to the top of the LRU stack within indicate which bytes have been modified in its set in the cache. No reorder implies that the double-word store. There are also com- the replacement status of the line is not parators to match each store address modified on write. From Figure 22, it can against subsequent accesses to main mem- be seen that there is no significant differ- ory, so that references to the modified data ence in the two policies. For this reason, get the updated values. Copy-back could the IBM 3033, a write-through machine, probably have been implemented much does not update the LRU status of lines more cheaply. when a write occurs. The Amdahl Computers all use copy- With respect to performance, there is no back, as does the IBM 3081. IBM uses clear choice to be made between write- store-through in the 370/168 [IBM75] and through and copy-back. This is because a the 3033 [IBM78], as does DEC in the good implementation of write-through sel- PDP-11/70 [SIRE76] and VAX 11/780 dom has to wait for a write to complete. A [DEC78], Honeywell in the 66/60 and 66/ good implementation of write-through re- 80, and Itel in the AS/6. quires, however, both sufficient buffering of writes and sufficient interleaving of main 2.6 Effect of Multiprogramming: Cold-Start memory so that the probability of the CPU and Warm-Start becoming blocked while waiting on a store is small. This appears to be true in the IBM A significant factor in the cache miss ratio 3033, but at the expense of a great deal of is the frequency of task switching, or in- buffering and other logic. For example, in versely, the value of the mean intertask the 3033 each buffered store requires a dou- switch time, Q. The problem with task ble-word datum buffer, a single buffer for switching is that every time the active task the store address, and a 1-byte buffer to is changed, a new process may have to be Computing Surveys, Vol 14, No. 3, September 1982 Cache Memories • 503 loaded from scratch into the cache. This between all of the active processes, and that issue was discussed by EAST75, where the when a process is restarted it finds a sig- terms warm-start and cold-start were nificant fraction of its previous information coined to refer to the miss ratio starting still in the cache. A very large Q (e.g., with a full memory and the miss ratio start- 100,000, 250,000) implies that when the pro- ing with an empty memory, respectively. gram is restarted it finds an empty cache Other papers which discuss the problem (with respect to its own working set), but include EAST78, KOBA80, MACD79, that the new task runs long enough first to PEUT77, POHM75, SCHR71, and SIRE76. fill the cache and then to take advantage of Typically, a program executes for a pe- the full cache. Intermediate values for Q riod of time before an interruption (I/O, result in the situation where a process runs clock, etc.) of some type invokes the super- for a while but does not fill the cache; visor. The supervisor eventually relin- however, when it is restarted, none of its guishes control of the processor to some information is still cache resident (since the user process, perhaps the same one as was multiprogramming degree is four). These running most recently. If it is not the same three regions of operation are evident in user process, the new process probably does Figures 23 and 24 as a function of Q and of not find any lines of its address space in the the cache size. (In SATY81, Q is estimated cache, and starts immediately with a num- to be about 25,000.) ber of misses. If the most recently executed There appear to be several possible so- process is restarted, and if the supervisor lutions to the problem of high cache miss interruption has not been too long, some ratios due to task switching. (1) It may be useful information may still remain. In possible to lengthen the task-switch inter- PEUT77, some figures are given about the val. (2) The cache can be made so large length of certain IBM that several programs can maintain infor- supervisor interruptions and what fraction mation in it simultaneously. (3) The sched- of the cache is purged. (One may also view uling algorithm may be modified in order the user as interrupting the supervisor and to give preference to a task likely to have increasing the supervisor miss ratio.) information resident in the cache. (4) If the The effect of the task-switch interval on working set of a process can be identified the miss ratio cannot be easily estimated. (e.g., from the previous period of execu- In particular, the effect depends on the tion), it might be reloaded as a whole; this workload and on the cache size. We also is called working set restoration. (5) Mul- observe that the proportion of cache misses tiple caches may be created; for example, a due to task switching increases with in- separate cache could be established for the creasing cache size, even though the abso- supervisor to use, so that, when invoked, it lute miss ratio declines. This is because a would not displace user data from the small cache has a large inherent miss ratio cache. This idea is considered in Section (since it does not hold the program's work- 2.10, and some of the problems of the ap- ing set) and this miss ratio is only slightly proach are indicated. augmented by task-switch-induced misses. A related concept is the idea of bypassing Conversely, the inherent low miss ratio of the cache for operations unlikely to result a large cache is greatly increased, in relative in the reuse of data. For example, long terms, by task switching. We are not aware vector operations such as a very long move of any current machine for which a break- character (e.g., IBM MVCL) could bypass down of the miss ratio into these two com- the cache entirely [LOSQ82] and thereby ponents has been done. avoid displacing other data more likely to Some experimental results bearing on be reused. this problem appear in Figures 23 and 24. In each, the miss ratio is shown as a func- 2.7 Multicache Consistency tion of the memory size and task-switch interval Q (Q is the number of memory Large modern computer systems often have references). The figures presented can be several independent processors, consisting understood as follows. A very small Q (e.g., sometimes of several CPUs and sometimes 100, 1,000) implies that the cache is shared of just a single CPU and several channels. Computing Surveys, Vol. 14, No. 3, September 1982 504 • A. J. Smith

VARY MP QUANTUM SIZE

i , ' ' I ' ' ' ' I CG01...3, PG02 64 SETS, 32 BYTE LINES

0.080 Q-3000 0.070 0.060 H t- 0.050 I,k \ °'o% or" 0.040

03 0.030 \-,, "" "-N;.. 03 H \ "% ~'~ ~ Q'30000 0.020

@-IOOOOO Q-2500oo y - ~ - - -

0 20000 40000 60000 MEMORY CAPACITY Figure 23 Miss ratio versus memory capacity for range of multlprogramming intervals Q.

VARY MP QUANTUM SIZE

' I ' I WFV-APL-WTX-FFT 64 SETS, 32 BYTE LINES 0.100

!-3000 0.050 I--4 O~ . o-3oooo co 03 \% ",~ 0.010 •N"° ",~, \%.°°

0.005 Q'lOOOOO~Q.250000d~r"

0 20000 40000 60000 HENORY CAPACITY Figure 24. Miss ratio versus memory capacity for variety of multlprogrammmg intervals Q.

Computing Surveys, Vol. 14, No. 3, September 1982 Cache Memories • 505 Each processor may have zero, one, or sev- or channel is broadcast to all caches sharing eral caches. Unfortunately, in such a mul- the same main memory. This broadcast tiple processor system, a given piece of in- store is placed in the buffer invalidation formation may exist in several places at a address stack (BIAS) which is a list of given time, and it is important that all addresses to be invalidated in the cache. processors have access (as necessary) to the The buffer invalidation address stack has a same, unique (at a given time) value. Sev- high priority for cache cycles, and if the eral solutions exist and/or have been pro- target line is found in that cache, it is in- posed for this problem. In this section, we validated. discuss many of these solutions; the in- The major difficulty with broadcasting terested reader should refer to BEAN79, store addresses is that every cache memory CENS78, DRIM81a, DUBO82, JONE76, in the system is forced to surrender a cycle JONE77a, MAZA77, McWI77, NGAI81, and for invalidation lookup to any processor TANG76 for additional explanation. which performs a write. The memory inter- As a basis for discussion, consider a single ference that occurs is generally acceptable CPU with a cache and with a main memory for two processors (e.g., IBM's current MP behind the cache. The CPU reads an item systems), but significant performance deg- into the cache and then modifies it. A sec- radation is likely with more than two proc- ond CPU (of similar design and using the essors. A clever way to minimize this prob- same main memory) also reads the item lem appears in a recent patent [BEA~79]. and modifies it. Even if the CPUs were In that patent, a BIAS Filter Memory using store-through, the modification per- (BFM) is proposed. A BFM is associated formed by the second CPU would not be with each cache in a tightly coupled MP reflected in the cache of the first CPU system. This filter memory works by filter- unless special steps were taken. There are ing out repeated requests to invalidate the several possible special steps. same block in a cache. 3. Software Control. If a system is being 1. Shared Cache. All processors in the written from scratch and the architecture system can use the same cache. In general, can be designed to support it, then software this solution is infeasible because the band- control can be used to guarantee consist- width of a single cache usually is not suffi- ency. Specifically, certain information can cient to support more than one CPU, and be designated noncacheable, and can be because additional access time delays may accessed only from main memory. Such be incurred because the cache may not be items are usually semaphores and perhaps physically close enough to both (or all) data structures such as the job queue. For processors. This solution is employed suc- efficiency, some shared writeable data has cessfully in the Amdahl 470 computers, to be cached. The CPU must therefore be where the CPU and the channels all use equipped with commands that permit it to the same cache; the 470 series does not, purge any such information from its own however, permit tightly coupled CPUs. The cache as necessary. Access to shared write- UNIVAC 1100/80 [BORG79] permits two able cacheable data is possible only within CPUs to share one cache. critical sections, protected by noncacheable 2. Broadcast Writes. Every time a CPU semaphores. Within the critical regions, the performs a write to the cache, it also sends code is responsible for restoring all modified that write to all other caches in the system. items to main memory before releasing the If the target of the store is found in some lock. Just such a scheme is intended for the other cache, it may be either updated or S-1 multiprocessor system under construc- invalidated. Invalidation may be less likely tion at the Lawrence Livermore Laboratory to create inconsistencies, since updates can [HAIL79, McWI77]. The Honeywell Series possibly "cross," such that CPU A updates 66 machines use a similar mechanism. In its own cache and then B's cache. CPU B some cases, the simpler alternative of mak- simultaneously updates its own cache and ing shared writeable information noncache- then A's. Updates also require more data able may be acceptable. transfer. The IBM 370/168 and 3033 proc- 4. Directory Methods. It is possible to essors use invalidation. A store by a CPU keep a centralized and/or distributed direc- ComputingSurveys, VoL 14, No. 3, September1982 506 • A. J. Smith tory of all main memory lines, and use it to memory directory to perform the write. ensure that no lines are write-shared. One The main memory directory invalidates such scheme is as follows, though several any other copies of the line in the system, variants are possible. The main memory marks its own directory suitably, and then maintains k + 1 bits for each line in main gives permission to modify the data. memory, when there are k caches in the The performance implications of this system. Bit i, i -- 1,..., k is set to 1 if the method are as follows. The cost of a miss corresponding cache contains the line. The may increase significantly due to the need (k + 1)th bit is I if the line is being or has to query the main memory directory and been modified in a cache, and otherwise is possibly retrieve data from other caches. 0. If the (k + 1)th bit is on, then exactly The use of shared writeable information one of the other bits is on. Each CPU has becomes expensive due to the high miss associated with each line in its cache a ratios that are likely to be associated with single bit (called theprivate bit). If that bit such information. In CENS78, there is some is on, that CPU has the only valid copy of attempt at a quantitative analysis of these that line. If the bit is off, other caches and performance problems. main memory may also contain current Another problem is that I/O overruns copies. Exactly one private bit is set if and may occur. Specifically, an I/O data stream only if the main memory directory bit k + may be delayed while directory operations I is set. take place. In the meantime, some I/O data A CPU can do several things which pro- are lost. Care must be taken to avoid this voke activity in this directory system. If a problem. Either substantial I/O buffering CPU attempts to read a line which is not in or write-through is clearly needed. its cache, the main memory directory is Other variants of method 4 are possible. queried. There are two possibilities: either (1) The purpose of the central directory is but k + 1 is off, in which case the line is to minimize the queries to the caches of transferred to the requesting cache and the other CPUs. The central directory is not corresponding bit set to indicate this; or, bit logically necessary; sufficient information k + 1 is on, in which case the main memory exists in the individual caches. It is also directory must recall the line from the possible to transmit information from cache cache which contains the modified copy, to cache, without going through main mem- update main memory, invalidate the copy ory. (2) If the number of main memory in the cache that modified it, send the line directory bits is felt to be too high, locking to the requesting CPU/cache and finally could be on a page instead of on a line basis. update itself to reflect these changes. (Bit (3) Store-through may be used instead of k + 1 is then set to zero, since the request copy-back; thus main memory always has was a read.) a valid copy and data do not have to be An attempt to perform a write causes one fetched from the other caches, but can sim- of three possible actions. If the line is al- ply be invalidated in these other caches. ready in cache and has already been modi- The IBM 3081D, which contains two fied, the private bit is on and the write CPUs, essentially uses the directory takes place immediately. If the line is not scheme described. The higher performance in cache, then the main memory directory 3081K functions similarly, but passes the must be queried. If the line is in any other necessary information from cache to cache cache, it must be invalidated (in all other rather than going through main memory. caches), and main memory must be up- Another version of the directory method dated if necessary. The main memory di- is called the broadcast search [DRIM81b]. rectory is then set to reflect the fact that In this case, a miss is sent not only to the the new cache contains the modified copy main memory but to all caches. Whichever of the data, the line is transmitted to the memories (cache or main) contain the de- requesting cache, and the private bit is set. sired information send it to the requesting The third possibility is that the cache al- processor. ready contains the line but that it does not Liu [LItT82] proposes a multicache have its private bit set. In this case, per- scheme to minimize the overhead of direc- mission must be requested from the main tory operations. He suggests that all CPUs

Computing Surveys, Vol. 14, No. 3, September 1982 Cache Memories • 507 have two caches, only one of which can logic which will access it. A split cache, on contain shared data. The overhead of direc- the other hand, can have each of its halves tory access and maintenance would thus placed in the physical location which is only be incurred when the shared data most useful, thereby saving from a fraction cache is accessed. of a nanosecond to several nanoseconds. There are, of course, some problems in- There are two practical methods among troduced by the split cache organization. the above alternatives: method 4 and the First, there is the problem of consistency. BIAS Filter Memory version of method 2. Two copies now exist of information which Method 4 is quite general, but is potentially formerly existed only in one place. Specifi- very complex. It may also have perform- cally, instructions can be modified, and this ance problems. No detailed comparison ex- modification must be reflected before the ists, and other and better designs may yet instructions are executed. Further, it is pos- remain to be discovered. For new machines, sible that even if the programs are not self- it is not known whether software control is modifying, both data and instructions may better than hardware control; clearly, for coexist in the same line, either for reasons existing architectures and software, hard- of storage efficiency or because of immedi- ware control is required. ate operands. The solutions for this prob- lem are the same as those discussed in the 2.8 Data/Instruction Cache section on multicache consistency (Section Two aspects of the cache having to do with 2.7), and they work here as well. It is im- its performance are cache bandwidth and perative that they be implemented in such access time. Both of these can be improved a way so as to not impair the access time by splitting the cache into two parts, one advantage given by this organization. for data and one for instructions. This dou- Another problem of this cache organiza- bles the bandwidth since the cache can now tion is that it results in inefficient use of the service two requests in the time it formerly cache memory. The size of the working set required for one. In addition, the two re- of a program varies constantly, and in par- quests served are generally complementary. ticular, the fraction devoted to data and to Fast computers are pipelined, which means instructions also varies. (A dynamically that several instructions are simultaneously split design is suggested by FAVR78.) If the in the process of being decoded and exe- instructions and data are not stored to- cuted. Typically, there are several stages in gether, they must each exist within their a pipeline, including instruction fetch, in- own memory, and be unable to share a struction decode, operand address genera- larger amount of that resource. The extent tion, operand fetch, execution, and trans- of this problem has been studied both ex- mission of the results to their destination perimentally and analytically. In SHED76, (e.g., to a register). Therefore, while one a set of formulas are provided which can be instruction is being fetched (from the in- used to estimate the performance of the struction cache), another can be having its unified cache from the performance of the operands fetched from the operand cache. individual ones. The experimental results In addition, the logic that arbitrates priority were not found to agree with the mathe- between instruction and data accesses to matical ones, although the reason was not the cache can be simplified or eliminated. investigated. We believe that the nonsta- A split instruction/data cache also pro- tionarity of the workload was the major vides access time advantages. The CPU of problem. a high-speed computer typically contains We compared the miss ratio of the split (exclusive of the S-unit) more than 100,000 cache to that of the unified cache for each logic gates and is physically large. Further, of the sets of traces; some of the results the logic having to do with instruction fetch appear in Figures 25-28. (See BP.LL74 and and decode has little to do with operand THAK78 for additional results.) We note fetch and store except for execute in- that there are several possible ways to split structions and possibly for the targets of and manage the cache and the various al- branches. With a single cache system, it is ternatives have been explored. One could not always possible simultaneously to place split the cache in two equal parts (labeled the cache immediately adjacent to all of the "SPLIT EQU.AL"), or the observed miss Computing Surveys, Vol. 14, No. 3, September 1982 508 • A. J. Smith I/D CACHES VS. UNIFIED CACHE

I .... I ' ' ' ' I ' ' ' ' I ' ~,, C601,,,3, P602 Q-IO000, 64 SETS, 32 BYTES/LINE 0.100 - ~

~':~. ~UNIFIED -- ! 0,050 E~ ~,~:., SPLIT OPT ..... UAL :,

cr 03 -: ::-_- -.-:: ...... : 03 ~-DA]A ~- 0,010

0,005 - INSTRUCT~ --

I , , , , I , , , , I , , , , I , 0 20000 40000 60000 MEMORY CAPACITY Figure 25. Miss ratio versus memory capacity for unified cache, cache split equally between instruction and data halves, cache split according to static optimum partition between instruc- tion and data halves, and miss ratios individually for instruction and data halves.

I/D CACHES VS. UNIFIED CACHE

I , , , , I , , , , I , : , , 1 I CG01...3, P602 O-lO000, 64 SETS, 32 BYTES/LINE 0.100 - I\ ~- i NO DUPLICATES - :%~f---SPLIT EQUAL...... 0.050

EE ~~. SPLITOPT ......

\ ...... ~CO \ ~o^~^ ~ ......

O.OlO _ ~ U":F:E°) --

o oo~ (1.s:~ucT:o.s I , , , , I , , , , I , , , , I , 0 20000 40000 60000 MEMORY CAPACITY Figure 26. Miss ratio versus memory capacity for instruction/data cache and for umfied cache.

Computing Surveys, Vo]. 14, No. 3, September 1982 Cache Memories • 509 I/D CACHES VS. UNIFIED CACHE

I I I ' F00!...4 O.IO0 Q-IOOOO, 64 SETS, 32 BYTES/LINE

0.050 ...... E) i\ ~SPLIT OPT ~,~ SPLITEQUAL .... < rr" CO CO I'--'1 O,OlO "~k "~'.,,.~ DATA 0.005 INSTRUCTIO

0 20000 40000 60000 MEMBRY CAPACITY Figure 27. Miss ratio versus memory capacity for instructmn/data cache and for unified cache.

I/D CACHES VS. UNIFIED CACHE

FG01...4 0,100 - Q-IOOOO, 64 SETS, 32 BYTES/LINE_ NO DUPLICATES

0.050 SPLIT OPT ~---< ~ ~-~SPLIT EQUAL """.. -- _ DATA

0.010 - ~ ~"'~ ' """ ' ~"" --

0.005

I , , , , ] , , , , I , , , , I , 0 20000 40000 60000 MEMORY CAPACITY Figure 28. Miss ratio versus memory capacity for instruction/data cache and for unified cache

Computing Surveys, VoL 14, No. 3, September 1982 510 • A. J. Smith ratios could be used (from this particular problems discussed here. The 801 minicom- run) to determine the optimal static split puter, built at IBM Research (Yorktown ("SPLIT OPT"). Also, when a line is found Heights) [ELEC76, RADI82] also has a to be in the side of the cache other than the split cache. The Hitachi H200 and Itel one currently accessed, it could either be AS/6 [Ross79] both have a split data/in- duplicated in the remaining side of the struction cache. No measurements have cache or it could be moved; this latter case been publicly reported for any of these is labeled "NO DUPLICATES." (No spe- machines. cial label is shown if duplicates are permit- ted.) The results bearing on these distinc- 2.9 Virtual Address Cache tions are given in Figures 25-28. We observe in Figures 25 and 26 that the Most cache memories address the cache unified cache, the equally split cache, and using the real address (see Figure 2). As the the cache split unequally (split optimally) reader recalls, we discussed (Introduction, all perform about equally well with respect Section 2.3) the fact that the virtual address to miss ratio. Note that the miss ratio for was translated by the TLB to the real ad- the instruction and data halves of the cache dress, and that the line lookup and readout individually are also shown. Further, com- could not be completed until the real ad- paring Figures 25 and 26, it seems that dress was available. This suggests that the barring duplicate lines has only a small cache access time could be significantly re- negative effect on the miss ratio. duced if the translation step could be elim- In sharp contrast to the measurements of inated. The way to do this is to address the Figures 25 and 26 are those of Figures 27 cache directly with the virtual address. We and 28. In Figure 27, although the equally call a cache organized this way a virtual and optimally split cache are comparable, address cache. The MU-5 [IBBE77] uses the unified cache is significantly better. The this organization for its name store. The unified cache is better by an order of mag- S-l, the IBM 801, and the ICL 2900 series nitude when duplicates are not permitted machines also use this idea. It is discussed (Figure 28), because the miss ratio is in BEDE79. See also OLBE79. sharply increased by the constant move- There are some additional considerations ment of lines between the two halves. It in building a virtual address cache, and appears that lines sharing instruction and there is one serious problem. First, all ad- data are very common in programs com- dresses must be tagged with the identifier piled with IBM's G compiler of the address space with which they are and are not common in programs compiled associated, or else the cache must be purged using the IBM COBOL or PL/I compiler. every time task switching occurs. Tagging (Results similar to the FORTRAN case is not a problem, but the address tag in the have been found for the two other sets of cache must be extended to include the ad- IBM traces, all of which include FOR- dress space ID. Second, the translation TRAN code but are not shown.) mechanism must still exist and must still be Based on these experimental results, we efficient, since virtual addresses must be can say that the miss ratio may increase translated to real addresses whenever main significantly if the caches are split, but that memory is accessed, specifically for misses the effect depends greatly on the workload. and for writes in a write-through cache. Presumably, compilers can be designed to Thus the TLB cannot be eliminated. minimize this effect by ensuring that data The most serious problem is that of and instructions are in separate lines, and "synonyms," two or more virtual addresses perhaps even in separate pages. that map to the same real address. Syn- Despite the possible miss ratio penalty of onyms occur whenever two address spaces splitting the cache, there are at least two share code or data. (Since the lines have experimental machines and two commer- address space tags, the virtual addresses cial ones which do so. The S-1 [HAIL79, are different even if the line occurs in the McWI77] at Lawrence Livermore Labora- same place in both address spaces.) Also, tory is being built with just such a cache; it the supervisor may exist in the address relies on (new) software to minimize the space of each user, and it is important that Computing Surveys, Vol. 14, No. 3, September 1982 Cache Memories • 511 only one copy of supervisor tables be kept. 2.10 User/Supervisor Cache The only way to detect synonyms is to take It was suggested earlier that a significant the virtual address, map it into a real ad- fraction of the miss ratio is due to task- dress, and then see if any other virtual switching. A possible solution to this prob- addresses in the cache map into the same lem is to use a cache which has been split real address. For this to be feasible, the into two parts, one of which is used only by inverse mapping must be available for every the supervisor and the other of which is line in the cache; this inverse mapping is used primarily by the user state programs. accessed on real address and indicates all If the scheduler were programmed to re- virtual addresses associated with that real start, when possible, the same user program address. Since this inverse mapping is the that was running before an , then opposite of the TLB, we choose to call the the user state miss ratio would drop appre- inverse mapping buffer (if a separate one is ciably. Further, if the same recur used) the RTB or reverse translation frequently, the supervisor state miss ratio buffer. When a miss occurs, the virtual ad- may also drop. In particular, neither the dress is translated into the real address by user nor the supervisor would purge the the TLB. The access to main memory for cache of the other's lines. (See PEUT77 for the miss is overlapped with a similar search some data relevant to this problem.) The of the RTB to see if that line is already in supervisor cache may have a high miss ratio the cache under a different name (virtual in any case due to its large working set. address). If it is, it must be renamed and (See MILA75 for an example.) moved to its new location, since multiple Despite the appealing rationale of the copies of the same line in the cache are above comments, there are a number of clearly undesirable for reasons of consist- problems with the user/supervisor cache. ency. First, it is actually unlikely to cut down the The severity of the synonym problem can miss ratio. Most misses occur in supervisor be decreased if shared information can be state [MILA75] and a supervisor cache half forced to have the same location in all as large as the unified cache is likely to be address spaces. Such information can be worse since the maximum cache capacity is given a unique address space identifier, and no longer available to the supervisor. Fur- the lookup algorithm always considers such ther, it is not clear what fraction of the time a tag to match the current address space the scheduling algorithm can restart the ID. A scheme like this is feasible only for same program. Second, the information the supervisor since other shared code used by the user and the supervisor are not could not conceivably be so allocated. entirely distinct, and cross-access must be Shared supervisor code does have a unique permitted. This overlap introduces the location in all address spaces in IBM's MVS problem of consistency. operating system. We are aware of only one evaluation of The RTB may or may not be a simple the split user/supervisor cache [RossS0]. structure, depending on the structure of the In that case, an experiment was run on an rest of the cache. In one case it is fairly Hitatchi M180. The results seemed to show simple: if the bits used to select the set of that the split cache performed about as well the cache are the same for the real and as a unified one, but poor experimental virtual address (i.e., if none of the bits used design makes the results questionable. We to select the set undergo translation}, the do not expect that a split cache will prove RTB can be implemented by associating to be useful. with each cache line two address tags [BEDE79]. If a match is not found on the 2.11 Input/Output through virtual address, then a search is made in the Cache that set on the real address. If that search finds the line, then the virtual address tag In Section 2.7, the problem of mu]ticache is changed to the current virtual address. A consistency was discussed. We noted that more complex design would involve a sep- if all accesses to main memory use the same arate mapping buffer for the reverse trans- cache, then there would be no consistency lation. problem. Precisely this approach has been Computing Surveys, Vol. 14, No. 3, September 1982 512 • A. J. Smith used in one manufacturer's computers data has been rearranged to show more (Amdahl Corporation). directly the effect on the miss ratio in Fig- ures 31 and 32. The results displayed here 2.1 1. I Overruns show no clear mathematical pattern, and we were unable to derive a useful and ver- While putting all input/output through the ifiable formula to predict the effect on the cache solves the consistency problem, it miss ratio by an I/O stream. introduces other difficulties. First, there is Examination of the results presented in the overrun problem. An overrun occurs Figures 29-32 suggests that for reasonable when for some reason the I/O stream can- I/O rates (less than 0.05; see PowE77 for not be properly transmitted between the some I/O rate data) the miss ratio is not memory (cache or main) and the I/O de- affected to any large extent. This observa- vice. Transmitting the I/O through the tion is consistent with the known perform- cache can cause an overrun when the line ance of the Amdahl computers, which are accessed by the I/O stream is not in the not seriously degraded by high I/O rates. cache (and is thus a miss) and for some reason cannot be obtained quickly enough. 2.12 Cache Size Most I/O devices involve physical move- ment, and when the buffering capacity Two very important questions when select- embedded in the I/O path is exhausted, the ing a cache design are how large should the transfer must be aborted and then re- cache be and what kind of performance can started. Overruns can be provoked when: we expect. The cache size is usually dictated (1) the cache is already processing one or by a number of criteria having to do with more misses and cannot process the current the cost and performance of the machine. {additional) one quickly enough; (2) more The cache should not be so large that it than one I/O transfer is in progress, and represents an expense out of proportion to more active (in use) lines map into one set the added performance, nor should it oc- than the set size can accommodate; or (3) cupy an unreasonable fraction of the phys- the cache bandwidth is not adequate to ical space within the processor. A very large handle the current burst of simultaneous cache may also require more access cir- I/O from several devices. Overruns can be cuitry, which may increase access time. minimized if the set size of the cache is Aside from the warnings given in the largeenough, the bandwidth is high paragraph above, one can generally assume enough, and the ability to process misses is that the larger the cache, the higher the hit sufficient. Sufficient buffering should also ratio, and therefore the better the perform- be provided in the I/O paths to the devices. ance. The issue is then one of the relation between cache size and hit ratio. This is a very difficult problem, since the cache hit 2.1 1.2 Miss Ratio ratio varies with the workload and the ma- Directing the input/output data streams chine architecture. A cache that might yield through the cache also has an effect on the a 99.8 percent hit ratio on a PDP-11 pro- miss ratio. This I/O data occupies some gram could result in a 90 percent or lower fraction of the space in the cache, and this hit ratio for IBM (MVS) supervisor state increases the miss ratio for the other users code. This problem cannot be usefully stud- of the cache. Some experiments along these ied using trace-driven simulation because lines were run by the author and results are the miss ratio varies tremendously from shown in Figures 29-32. IORATE is the program to program and only a small num- ratio of the rate of I/O accesses to the cache ber of traces can possibly be analyzed. Typ- to the rate of CPU accesses. (I/O activity ical trace-driven simulation results appear is simulated by a purely sequential syn- throughout this paper, however, and the thetic address stream referencing a distinct reader may wish to scan that data for in- address space from the other programs.) sight. There is also a variety of data avail- The miss ratio as a function of memory size able in the literature and the reader may and I/O transfer rate is shown in Figures wish to inspect the results presented in 29 and 30 for two of the sets of traces. The ALSA78, BELL74, BERG76, GIBS67, LEE69,

Computing Surveys, Vo|. 14, No. 3, September 1982 Cache Memories 513

VARY I/0 RATE 0.20

CG01...3, P602 i Q-IO000, 64 SEa'S 32 BYTE LINES 0.I0 0.09 0.08 I--4E;~ 0.07 ~- 0.06 rF 0.05 O3 0.04 O3 10RATE-.IO ~- 0.03 ~ " "- ,, ,//-10RATE-.20 0.02 10RATE, IOF IOF IORAT i-.05

0 20000 40000 60000 MEMORY CAPACITY Figure 29. Miss ratio versus memory capacity while I/O occurs through cache at specified rate IORATE refers to fraction of all memory references due to I/O

VARY I/0 RATE

I .... I ' ' ' ' I ' ' ' ' I ' 0.I00 -- ~..... WFV-APL-WTX-FFT -- Q-iO000, 64 SETS 32 BYTE LINES 0.050 E~ I,--I ]E=.20 " -

rY GO ~ CO I--I IORATE,O.?J ~:~:..~"..~... IORATE-.iO 2:: 0.010 -- 10RATE"OI/ ~~

10RATE=.025 ...... T. 2.L-Z-" ~ L .... 0.005 10RATE-.05 ...... , , , , I , , , , I , , , , I , 20000 40000 60000 MEMORY CAPACITY Figure 30. Miss ratio versus memory capacity while I/O occurs through cache.

Computing Surveys, Vol. 14, No. 3, September 1982 514 • A. J. Smith MISS RATIO VS. 110 RATE

WIV-APL-WTX-FFT CG01...3~'~ 0.040 0.04

_~ ~-8K - 0.035

E~ 0.03 16K~ ...... - //.16K °°.o...... - I-- l I ~ ...... 1""- 0.030 .<:: ,°°" / Q::: .>... 24K---.~,. '","f"*"°"°" 24K /" I" / "/ t 1 03 f s. • 03 0.02 - ( ~ .,.-" -_I 0.025 I-.-t / •" " 32K'-'~., "" / j. >/ f32K ...,." /" d" -" [ .'" -- 0.020 / .--':'. /64K 48K 0.01 - ..-" f48K /I--~- ~'"~ 0.015

0 0.05 0.I 0.15 0.2 0 0.05 0.1 0.15 0.2 I/0 RATE Figure 31. Miss ratio versus I/0 rate for variety of memory capacitms.

INCREASE IN MISS RATIO VS. I/0 RATE II~ Ill ~&&''l''''l' WFV-APL- /" /// 1...3, 2 2.0 I""l''"l'''l'"'4~ /- 1.5 WTX-FFT //24K//"C nl /,/ 1,8 32K // /~'"'"" //° 1,4 / • / / E \,";'X."• i "w " / /" 24K / / _j 1,6 # 1.3 / 32K / / / / ,8K 1.4 "7 1,2 / ,," 16K [ /- =o / ," / ,,,.~.....- ,, .," !..>~...- .. / rl /'.,°"/ ''''" "'' 1,2 / V''" l -- 1,1

/,,./.."~.~""~ ,,, •...... " 64K 1.0 ~':1.,,, I,,,, I,,i, I 1,0 rr 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 I/0 RATE Figure 32. Miss ratio versus I/O rate for variety of memory capacities.

Computing Surveys, Vol. 14, No 3, September 1982 Cache Memories * 515 MISS RATIC) VS. CACHE SIEE

! I | I I I 0.10 K-'4353 o.og - -2- - z '2g 6 0.08 DIGIT EQUALS SET SIZE 0.07 0.06 0.05 E~ k- 0.04 .< rY SUPERVISOR STATE`4 " 0.03 03 CO " 2- - _ /.0843 C "4632 5- 0.02 - ~- - - ~- - -~-~ ~_~ - 4-- PROBLEM STATE

osoo - ..... 4-

0.01 I I I I I I 20 30 40 50 60 70 CACHE SIZE () Figure 33. - Miss ratio versus cache size, classified by set size and user/supervisor state. Data gathered by hardware monitor from machine running standard .

MEAD70, STRE76, THAK78, and YUVA75. Saltzer's earlier findings. [CHow75 and Some direct measurements of cache miss CHOW76] suggest that the miss ratio curve ratios appear in CLAR81 and CLAR82. In the was of the form m ffi a(cb), where a and b former, the Dorado was found to have hit are constants, m is the miss ratio, and c is ratios above 99 percent. In the latter, the the memory capacity. They show no exper- VAX 11/780 was found to have hit ratios imental results to substantiate this model, around 90 percent. and it seems to have been chosen for math- Another problem with trace-driven sim- ematical convenience. ulation is that in general user state pro- Actual cache miss ratios, from real ma- grams are the only ones for which many chines running "typical" workloads, are the traces exist. In IBM MVS systems, the most useful source of good measurement supervisor typically uses 25 to 60 percent of data. In Figure 33 we show a set of such the CPU time, and provides by far the measurements taken from Amdahl 470 largest component of the miss ratio computers running a standard Amdahl in- [MILA75]. User programs generally have ternal benchmark. This data is reproduced very low miss ratios, in our experience, and from HARD80a. Each digit represents a many of those misses come from task measurement point, and shows either the switching. supervisor or problem state miss ratio for a Two models have been proposed in the specific cache size and set size; the value of literature for memory hierarchy miss ratios. the digit at each point is the set size. Ex- Saltzer [SALT74] suggested, based on his amination of the form of the measurements data, that the mean time between faults from Figure 33 suggest that the miss ratio was linearly related to the capacity of the can be approximated over the range shown memory considered. But later results, taken by an equation of the form m = a(k b) on the same system [GREE74] contradict (consistent with CHOW75 and CHow76),

Computing Surveys, VoL 14, No. 3, September 1982 516 • A. J. Smith where m is the miss ratio, a and b are requests can result in permanently wasted constants (b < 0), and k is the cache capac- machine cycles. In the Amdah1470V/8 and ity in kilobytes. The values of a and b are the IBM 3033, the cache bandwidth ap- shown for four cases in Figure 33; supervi- pears to exceed the average data rate by a sor and user state for a set size of two, and factor of two to three, which is probably supervisor and user state for all other set the minimum sufficient margin. We note sizes. We make no claims for the validity of that in the 470V/6 {when prefetch is used this function for other workloads and/or experimentally) prefetches are executed other architectures, nor for cache sizes be- only during otherwise idle cycles, and it has yond the range shown. From Figure 33 it is been observed that not all of the prefetches evident, though, that the supervisor con- actually are performed. (Newly arrived pre- tributes the largest fraction of the miss fetch requests take the place of previously ratio, and the supervisor state measure- queued but never performed requests.) ments are quite consistent. Within the For some instructions, the cache band- range indicated, therefore, these figures can width can be extremely important. This is probably be used for a first approximation particularly the case for data movement at estimating the performance of a cache instructions such as: (1) instructions which design. load or unload all of the registers (e.g., IBM Typical cache sizes in use include 128 370 instructions STM, LM); (2) instructions kbytes (NEC ACOS 9000), 64 kbytes (Am- which move long character strings (MVC, dahl 470V/8, IBM 3033, IBM 3081K per MVCL); and (3) instructions which operate CPU), 32 kbytes (IBM 370/168-3, IBM on long character strings (e.g., CLC, OC, 3081D per CPU, Amdahl 470V/7, Magnu- NC, XC). In these cases, especially the first son M80/43), 16 kbytes (Amdahl 470V/6, two, there is little if any processing to be Itel AS/6, IBM 4341, Magnuson M80/42, done; the question is simply one of physical M80/44, DEC VAX 11/780), 8 kbytes (Hon- data movement, and it is important that eywell 66/60 and 66/80, Xerox Dorado the cache data path be as wide as possible-- [CLAR81]), 4 kbytes (VAX 11/750, IBM in large machines, 8 bytes (3033, 3081, Itel 4331), 1 kbyte (PDP-11/70). AS/6) instead of 4 (470V/6, V/7, V/8); in small machines, 4 bytes (VAX 11/780) in- 2.13. Cache Bandwidth, Data Path Width, stead of 2 (PDP-11/70). and Access Resolution It is important to note that cache data 2.13.1 Bandwidth path width is expensive. Doubling the path For adequate performance, the cache band- width means doubling the number of lines width must be sufficient. Bandwidth refers into and out of the cache (i.e., the bus to the aggregate data transfer rate, and is widths) and all of the associated circuitry. equal to data path width divided by access This frequently implies some small increase time. The bandwidth is important as well in access time, due to larger physical pack- as the access time, because (1) there may aging and/or additional levels of gate delay. be several sources of requests for cache Therefore, both the cost and performance access (instruction fetch, operand fetch and aspects of cache bandwidth must be consid- store, channel activity, etc.), and (2) some ered during the design process. requests may be for a large number of bytes. Another approach to increasing cache If there are other sources of requests for bandwidth is to interleave the cache cache cycles, such as prefetch lookups and [DRIS80, YAMO80]. If the cache is required transfers, it must be possible to accommo- to serve a large number of small requests date these as well. very quickly, it may be efficient to replicate In determining what constitutes an ade- the cache (e.g., two or four times) and ac- quate data transfer rate, it is not sufficient cess each separately, depending on the low- that the cache bandwidth exceed the aver- order bits of the desired locations. This age demand placed on it by a small amount. approach is very expensive, and to our It is important as well to avoid contention knowledge, has not been used on any exist- since if the cache is busy for a cycle and one ing machine. (See POHM75 for some addi- or more requests are blocked, these blocked tional comments.)

Computing Surveys, Vol 14, No. 3, September 1982 Cache Memories • 517

2.13.2 Priority Arbitration (some cache accesses are blocked and have to be restarted), and (7) normal instruction An issue related to cache bandwidth is what or operand access. to do when the cache has several requests competing for cache cycles and only one 2.14 Multilevel Cache can be served at a given time. There are two criteria for making the choice: (1) give The largest existing caches (to our knowl- priority to any request that is "deadline edge} can be found in the NEC ACOS 9000 scheduled" (e.g., an I/O request that would (128 kbytes), and the Amdahl 470V/8 and otherwise abort); and (2) give priority (after IBM 3033 processors (64 kbytes). Such (1)) to requests in order to enhance ma- large caches pose two problems: (1) their chine performance. The second criterion physical size and logical complexity in- may be sacrificed for implementation con- crease the access time, and (2) they are venience, since the optimal scheduling of very expensive. The cost of the chips in the cache accesses may introduce unreasonable cache can be a significant fraction (5-20 complexity into the cache design. Typically, percent) of the parts cost of the CPU. The fixed priorities are assigned to competing reason for the large cache, though, is to cache requests, but dynamic scheduling, decrease the miss ratio. A possible solution though complex, is possible [BLOC80]. to this problem is to build a two-level cache, An an illustration of cache priority reso- in which the smaller, faster level is on the lution, we consider two large, high-speed order of 4 kbytes and the larger, slower computers: the Amdahl 470V/7 and the level is on the order of 64-512 kbytes. In IBM 3033. In the Amdahl machine, there this way, misses from the small cache could are five "ports" or address registers, which be satisfied, not in the six to twelve machine hold the addresses for cache requests. Thus, cycles commonly required, but in two to there can be up to five requests queued for four cycles. Although the miss ratio from access. These ports are the operand port, the small cache would be fairly high, the the instruction port, the channel port, the improved cycle time and decreased miss translate port, and the prefetch port. The penalty would yield an overall improve- first three are used respectively for operand ment in performance. Suggestions to this store and fetch, instruction fetch, and chan- effect may be found in BENN82, OHNO77, nel I/O (since channels use the cache also). and SPAR78. It has also been suggested for The translate port is used in conjunction the TLB [NGAI82]. with the TLB and translator to perform As might be expected, the two-level or virtual to real address translation. The pre- multilevel cache is not necessarily desira- fetch port is for a number of special func- ble. We suggested above that misses from tions, such as setting the storage key or the fast cache to the slow cache could be purging the TLB, and for prefetch opera- serviced quickly, but detailed engineering tions. There are sixteen priorities for the studies are required to determine if this is 470V/6 cache; we list the important ones possible. The five-to-one or ten-to-one ratio here in decreasing order of access of main memory to cache memory access priority: (1) move line in from main stor- times is not wide enough to allow another age, (2) operand store, (3) channel store, level to be easily placed between them. (4) fetch second half of double word re- Expense is another consideration. A two- quest, (5) move line out from cache to main level cache implies another level of access memory, (6) translate, (7) channel fetch, (8) circuitry, with all of the attendant compli- operand fetch, (9) instruction fetch, (10) cations. Also, the large amount of storage prefetch. in the second level, while cheaper per bit The IBM 3033 has a similar list of cache than the low-level cache, is not inexpensive access priorities [IBM78]: (1) main memory on the whole. fetch transfer, (2) invalidate line in cache The two-level or multilevel cache repre- modified by channel or other CPU, (3) sents a possible approach to the problem of search for line modified by channel or other an overlarge single-level cache, but further CPU, (4) buffer reset, (5) translate, (6) redo study is needed.

ComputmgSurveys, Vol. 14, No. 3, September 1982 518 • A. J. Smith

2.15 Pipelining the pipeline can proceed. There is an elab- Referencing a cache memory is a multistep orate mechanism built in which prevents process. There is the need to obtain priority this out-of-order operation from producing for a cache cycle. Then the TLB and the incorrect results. desired set are accessed in parallel. After 2.16 Translation Lookaside Buffer this, the real address is used to select the correct line from the set, and finally, after The translation lookaside buffer {also the information is read out, the replace- called the translation buffer [DEC78], the ment bits are updated. In large, high-speed associative memory [SCHR71], and the di- machines, it is common to pipeline the rectory lookaside table [IBM78]), is a small, cache, as well as the rest of the CPU, so high-speed buffer which maintains the that more than one cache access can be in mapping between recently used virtual and progress at the same time. This pipelining real memory addresses (see Figure 2). The is of various degrees of sophistication, and TLB performs an essential function since we illustrate it by discussing two machines: otherwise an address translation would re- the Amdahl 470V/7 and the IBM 3033. quire two additional memory references: In the 470V/7, a complete read requires one each to the segment and page tables. four cycles, known as the P, B1, B2, and R In most machines, the cache is accessed cycles [SMIT78b]. The P {priority) cycle is using real addresses, and so the design and used to determine which of several possible implementation of the TLB is intimately competing sources of requests to the cache related to the cache memory. Additional will be permitted to use the next cycle. The information relevant to TLB design and B1 and B2 {buffer 1, buffer 2) cycles are operation may be found in JONE77b, used actually to access the cache and the LUDL77, RAMA81, SAT¥81, SCHR71, and TLB, to select the appropriate line from WILK71. Discussions of the use of TLBs the cache, to check that the contents of the (TLB chips or units) line are valid, and to shift to get the desired in microcomputers can be found in JOHN81, byte location out of the two-word (8-byte) STEVS1, and ZOLN81. segment fetched. The data are available at The TLB itself is typically designed to the end of the B2 cycle. The R cycle is used look like a small set-associative memory. for "cleanup" and the replacement status is For example, the 3033 TLB {called the updated at that time. It is possible to have DLAT or directory lookaside table) is set- up to four fetches active at any one time in associative, with 64 sets of two elements the cache, one in each of the four cycles each. Similarly, the Amdahl 470V/6 uses mentioned above. The time required by a 128 sets of two elements each and the 470V/ store is longer since it is essentially a read 7 and V/8 have 256 sets of 2 elements each. followed by a modify and write-back; it The IBM 3081 TLB has 128 entries. takes six cycles all together, and one store The TLB differs in some ways from the requires two successive cycles in the cache cache in its design. First, for most processes, pipeline. address spaces start at zero and extend The pipeline in the 3033 cache is similar upward as far as necessary. Since the TLB [IBM78]. The cache in the 3033 can service translates page addresses from virtual to one fetch or store in each machine cycle, real, only the high-order {page number) where the turnaround time from initial re- address bits can be used to access the TLB. quest for priority until the data is available If the same method was used as that used is about 2½ cycles (½-cycle transmission for accessing the cache (bit selection using time to S-unit, 1½ cycles in S-unit, ½ cycle lower order bits), the low-order TLB entries to return data to ). An im- would be used disproportionately and portant feature of the 3033 is that the cache therefore the TLB would be used ineffi- accesses do not have to be performed in the ciently. For this reason, both the 3033 and order that they are issued. In particular, if the 470 hash the address before accessing an access causes a miss, it can be held up the TLB (see Figure 2). Consider the 24-bit while the miss is serviced, and at the same address used in the System/370, with the time other requests which are behind it in bits numbered from 1 to 24 (high order to

Computmg Surveys, Vol 14, No. 3, September 1982 Cache Memories , 519 low order). Then the bits 13 to 24 address Address V~rtual Page Valid Protechon Real Page[ the byte within the page (4096-byte page) Space TD Address B~t B)te Address and the remaining bits (1 to 12) can be used to access the TLB. The 3033 contains a 6- Translation Lookaslde Buffer (TLB) Entry bit index into the TLB computed as follows. I Replacement Let @ be the Exclusive OR operator; a 6- Entry I Entry2 (usage) S=ts bit quantity is computed [7, (8 @ 2), (9 @ 3), (10 @ 4), (11 @ 5), (12 @ 6)], where TLB Set each number refers to the input bit it des- Figure 34. Structure of translation lookaslde buffer ignates. (TLB) entry and TLB set. The Amdahl 470V/7 and 470V/8 use a different hashing algorithm, one which pro- field. There are also bits that indicate vides much more thorough randomization whether a given entry in the TLB is valid at the cost of significantly greater complex- and the appropriate bits to permit LRU- ity. To explain the hashing algorithm, we like replacement. Sometimes, the modify first explain some other items. The 470 and reference bits for a page are kept in the associates with each address space an 8-bit TLB. If so, then when the entry is removed tag field called the address space identifier from the TLB, the values of those bits must (ASID) (see Section 2.19.1). We refer to the be stored. bits that make up this tag field as $1, It may be necessary to change one or .... $8. These bits are used in the hashing more entries in the TLB whenever the vir- algorithm as shown below. Also, the 470V/ tual to real address correspondence changes 7 uses a different algorithm to hash into for any page in the address space of any each of the two elements of a set; the TLB active process. This can be accomplished in is more like a pair of direct mapping buffers two ways: (1) if a single- entry is than a set-associative buffer. The first half changed (in the 370), the IPTE (insert page is addressed using 8 bits calculated as fol- table entry) instruction causes the TLB to lows: [(6 @ 1 @ $8), 7, (8 @ 3 @ $6), 9, (10 be searched, and the now invalid entry @ $4), 11, (12 @ $2), 5]; and the second purged; (2) if the assignment of address half is addressed as [6, (7 @ 2 @ $7), 8, (9 space IDs is changed, then the entire TLB @ 4 @ $5), I0, (11 @ $3), 12, (5 @ SI)]. is purged. In the 3033, purging the TLB is There are no published studies that indi- slow (16 machine cycles) since each entry cate whether the algorithm used in the 3033 is actually invalidated. The 470V/7 does is sufficient or whether the extra complex- this in a rather clever way. There are two ity of the 470V/7 algorithm is warranted. sets of bits used to denote valid and invalid There are a number of fields in a TLB entries, and a flag indicating which set is to entry (see Figure 34). The virtual address be used at any given time. The set not in presented for translation is matched against use is supposed to be set to zero (invalid). the virtual address tag field (ASID plus The purge TLB command has the effect of virtual page address) in the TLB to ensure flipping this flag, so that the set of bits that the right entry has been found. The indicating that all entries are invalid are virtual address tag field must include now in use. The set of bits no longer in use the address space identifier (8 bits in the is reset to zero in the background during 470V/7, 5 bits in the 3033) so that entries idle cycles. See Cosc81 for a similar idea. for more than one process can be in the The cache on the DEC VAX 11/780 TLB at one time. A protection field (in 370- [DEC78] is similar to but simpler than that type machines) is also included in the TLB in the IBM and Amdahl machines. A set- and is checked to make sure that the access associative TLB (called the translation is permissible. (Since keys are associated buffer) is used, with 64 sets of 2 entries on a page basis in the 370, this is much each. (The VAX 11/750 has 256 sets of 2 more efficient than placing the key with entries each.) The set is selected by using each line in the cache.) The real address the high-order address bit and the five low- corresponding to the virtual address is the order bits of the page address, so the ad- primary output of the TLB and occupies a dress need not be hashed at all. Since the

Computing Surveys, Vol. 14, No. 3, September 1982 520 • A. J. Smith higher order address bit separates the user CPU I CPU z .. CPU n from the supervisor address space, this sus I / I means that the user and supervisor TLB entries never map into the same locations. This is convenient because the user part of the TLB is purged on a task switch. (There Figure 35. Diagram of computer system m whmh are no address space IDs.) The TLB is also caches are associated with memorms rather than with used to hold the dirty bit, which indicates processors. if the page has been modified, and the protection key. Published figures for TLB performance are not generally available. The observed separate translations are required; these miss ratio for the Amdahl 470V/6 TLB is may occur in the TLB and/or translator as about 0.3 to 0.4 percent (Private Commu- the occasion demands. nication: W. J. Harding). Simulations of the VAX 11/780 TLB [SATY81] show miss ra- 2.18 Memory-Based Cache tios of 0.1 to 2 percent for TLB sizes of 64 to 256 entries. It was stated at the beginning of this paper that caches are generally associated with the processor and not with the main mem- 2.17 Translator ory. A different design would be to place When a virtual address must be translated the cache in the main memory itself. One into a real address and the translation does way to do this is with a shared bus interfac- not already exist in the TLB, the translator ing between one or more CPUs and several must be used. The translator obtains the main memory modules, each with its own base of the segment table from the appro- cache (see Figure 35). priate place (e.g., 1 in 370 There are two reasons for this approach. machines), adds the segment number from First, the access time at the memory mod- the virtual address to obtain the page table ule is decreased from the typical 200-500 address, then adds the page number (from nanoseconds (given high-density MOS the virtual address) to the page table ad- RAM) to the 50-100 nanoseconds possible dress to get the real page address. This real for a high-speed cache. Second, there is no address is passed along to the cache so that consistency problem even though there are the access can be made, and simultane- several CPUs. All accesses to data in mem- ously, the virtual address/real address pair ory module i go through cache i and thus is entered in the TLB. The translator is there is only one copy of a given piece of basically an which knows what to data. add. Unfortunately, the advantages men- It is important to note that the translator tioned are not nearly sufficient to compen- requires access to the segment and page sate for the shortcomings of this design. table entries, and these entries may be First, the design is too slow; with the cache either in the cache or in main memory. on the far side of the memory bus, access Provision must be made for the translator time is not cut sufficiently. Second, it is too accesses to proceed unimpeded, independ- expensive; there is one cache per memory ent of whether target addresses are cache module. Third, if there are multiple CPUs, or main memory resident. there will be memory bus contention. This We also observe another problem related slows down the system and causes memory to translation: "page crossers." The target access time to be highly variable. of a fetch or store may cross from one page Overall, the scheme of associating the to another, in a similar way as for "line cache with the memory modules is very crossers." The problem here is considerably poor unless both the main memory and the more complicated than that of line crossers processors are relatively slow. In that case, since, although the virtual addresses are a large number of processors could be contiguous, the real addresses may not be. served by a small number of memory mod- Therefore, when a page crosser occurs, two ules with built-in caches, over a fast bus.

Computmg Surveys, Vol 14, No 3, September 1982 Cache Memories • 521

2.19 Specialized Caches and Cache 2.19 2 Buffers Components In some machines, especially the IBM 360/ This paper has been almost entirely con- 91 [ANDE67b, IBM71, TOMA67], a number cerned with the general-purpose cache of buffers are placed internally in the exe- memory found in most large, high-speed cution unit to buffer the inputs and outputs computers. There are other caches and of partially completed instructions. We re- buffers that can be used in such machines fer the reader to the references just cited and we briefly discuss them in this section. for a complete discussion of this.

2. 19. 1 Address Space Identifier Table 2.19.3 Instruction Lookahead Buffers In many computers, the operating system identifier for an address space is quite long; In several machines, especially those with- in the IBM-compatible machines discussed out general-purpose caches, a buffer may {370/168, 3033, 470V), the identifier is the be dedicated to lookahead buffering of in- contents of control register 1. Therefore, structions. Just such a scheme is used on these machines associate a much shorter the Cray I [CRAY76], the CDC 6600 tag with each address space for use in the [CDC74], the CDC 7600, and the IBM TLB and/or the cache. This tag is assigned 360/91 [ANDE67a, BOLA67, IBM71]. These on a temporary basis by the hardware, and machines all have substantial buffers, and the correspondence between the address loops can be executed entirely within these space and the tag is held in a hardware buffers. Machines with general-purpose table which we name the Address Space caches usually do not have much instruc- Identifier Table (ASIT). It is also called tion lookahead buffering, although a few the Segment Base Register Table in the extra bytes are frequently fetched. See also 470V/7, the Segment Table Origin Address BLAZ80 and KONE80. Stack in the 3033 [IBM78] and 370/168 [IBM75], and the Segment Base Register 2.19 4 Branch Target Buffer Stack in the 470V/6. The 3033 ASIT has 32 entries, which are One major impediment to high perform- assigned starting at 1. When the table be- ance in pipelined computer systems is the comes full, all entries are purged and IDs existence of branches in the code. When a are reassigned dynamically as address branch occurs, portions of the pipeline must spaces are activated. (The TLB is also be flushed and the correct instruction purged.) When a task switch occurs, the stream fetched. To minimize the effect of ASIT in the 3033 is searched starting at 1; these disruptions, it is possible to imple- when a match is found with control register ment a branch target buffer (BTB) which 1, the index of that location becomes the buffers the addresses of previous branches address space identifier. and their target addresses. The instruction The 470V/6 has a somewhat more com- fetch address is matched against the con- plex ASIT. The segment table origin ad- tents of the branch target buffer and if a dress is hashed to provide an entry into the match occurs, the next instruction fetch ASIT. The tag associated with that address takes place from the {previous} target of is then read out. If the address space does the branch. The BTB can correctly predict not have a tag, a previously unused tag is the correct branch behavior more than 90 assigned and placed in the ASIT. Whenever percent of the time [LEE82]. Something like a new tag is assigned, a previously used tag a branch target buffer is used in the MU-5 is made available by deleting its entry in [IBBE72, MORR79], and the S-1 [McWI77]. the ASIT and (in the background) purging all relevant entries in the TLB. (A complete 2 19.5 Cache TLB purge is not required.) Thirty-two valid tags are available, but the ASIT has Many modern computer systems are micro- the capability of holding up to 128 entries; coded and in some cases the amount of thus, all 32 valid tags can usually be used, microcode is quite large. If the microcode with little fear of hashing conflicts. is not stored in sufficiently fast storage, it

Computing Surveys, Vol. 14, No 3, September 1982 522 • A. J. Smith is possible to build a special cache to buffer but is dedicated to holding registers; it the microcode. therefore should be much faster than a larger, general-purpose cache. 2.19.6 Buffer Invahdation Address Stack (BIAS) 3. DIRECTIONS FOR RESEARCH AND The IBM 370/168 and 3033 both use a DEVELOPMENT store-through mechanism in which any Cache memories are moderately well un- store to main memory causes the line af- derstood, but there are problems which in- fected to be invalidated in the caches of all dicate directions both for research and de- processors other than the one which per- velopment. First, we note that technology formed the store. Addresses of lines which is changing; storage is becoming cheaper are to be invalidated are kept in the buffer and faster, as is processor logic. Cost/per- invalidation address stack (BIAS) in each formance trade-offs and compromises will processor, which is a small hardware imple- change with technology and the appropri- mented queue inside the S-unit. The DEC ate solutions to the problems discussed will VAX 11/780 functions in much the same shift. In addition to this general comment, way, although without a BIAS to queue we see some more specific issues. requests. That is, invalidation requests in the VAX have high priority, and only one may be outstanding at a time. 3.1 On-Chip Cache and Other Technology Advances 2 19.7 Input~Output Buffers The number of gates that can be placed on a microcomputer chip is growing quickly, As noted earlier, input/output streams and within a few years, it will be feasible to must be aborted if the processor is not build a general-purpose cache memory on ready to accept or provide data when they the same chip as the processor. We expect are needed. For this reason, most machines that such an on-chip cache will occur. have a few words of buffering in the I/O There is research to be done in designing channels or l,/O channel controller(s). This this within the constraints of the VLSI state is the case in the 370/168 [IBM75] and the of the art. (See LIND81 and POHM82.) 3033 [IBM78]. Cache design is also affected by the im- plementation technology. MOS VLSI, for 2.19.8 Write-Through Buffers example, permits wide associative searches In a write-through machine, it is important to be implemented easily. This implies that to buffer the writes so that the CPU does parameters such as set size may change not become blocked waiting for previous with changing technology. This related as- writes to complete. In the IBM 3033 pect of technological change also needs to [IBM78], four such buffers, each holding be studied. a double word, axe provided. The VAX 11/780 [DEC78], on the other hand, buffers 3.2 Multicache Consistency one write. {Four buffers were recommended in SMIT79.) The problem of multicache consistency was discussed in Section 2.7 and a number of solutions were indicated. Additional com- 2.19.9 Register Cache mercial implementations are needed, espe- It has been suggested that registers be au- cially of systems with four or more CPUs, tomaticaUy stacked, with the top stack before the cost/performance trade-offs can frames maintained in a cache [DITz82]. be evaluated. While this is much better (faster) than im- plementing registers as part of memory, as 3.3 Implementation Evaluation with the Texas Instruments 9900 micropro- cessor, it is unlikely to be as fast as regular, A number of new or different cache designs hardwired registers. The specific cache de- were discussed earlier, such as the split scribed, however, is not general purpose, instruction/data cache, the supervisor/user

Computing Surveys, Vol. 14, No 3, September 1982 Cache Memories • 523 cache, the multilevel cache, and the virtual precision numerical analysis, 840 lines, address cache. One or more implementa- FortG compiler. tions of such designs are required before 7. FGO4 FORTRAN Go step, FFT of their desirability can be fully evaluated. hole in rotating body. Double-precision FortG. 3.4 Hit Ratio versus Size 8. CGO1 COBOL Go step, fixed-assets There is no generally accepted model for program doing tax transaction selec- the hit ratio of a cache as a function of its tion. size. Such a model is needed, and it will 9. CGO2 COBOL Go step, "fixed assets: probably have to be made specific to each year end tax select." machine architecture and workload type 10. CGO3 COBOL Go step, projects de- (e.g., 370 commercial, 370 scientific, and preciation of fixed assets. PDP-11). 11. PGO2 PL/I Go step, does CCW anal- ysis. 3.5 TLB Design 12. IEBDG IBM utility that generates A number of different TLB designs exist, test data that can be used in program debugging. It will create multiple data but there are almost no published evalua- sets of whatever form and contents are tions (but see SATY81). It would be useful desired. to know what level of performance can be expected from the various designs and, in 13. PGO1 PLI Go step, SMF billing pro- particular, to know whether the complexity gram. of the Amdahl TLB is warranted. 14. FCOMP FORTRAN compile of pro- gram that solves Reynolds partial dif- 3.6 Cache Parameters versus Architecture ferential equation (2330 lines). and Workload 15. CCOMP COBOL compile. 240 lines, accounting report. Most of the studies in this paper have been based on IBM System 370 user program 16. WATEX Execution of a FORTRAN address traces. On the basis of that data, program compiled using the WATFIV we have been able to suggest desirable pa- compiler. The program is a combina- rameter values for various aspects of the torial search routine. cache. Similar studies need to be performed 17. WATFIV FORTRAN compilation for other machines and workloads. using the WATFIV compiler. (Com- piles the program whose trace is the WATEX trace.) APPENDIX. EXPLANATION OF TRACE 18. APL Execution of APL program NAMES which does plots at a terminal. 19. FFT Execution of an FFT program 1. EDC PDP-11 trace of text editor, written in ALGOL, compiled using AL- written in C, compiled with C compiler GOLW compiler at Stanford. on PDP-11. 2. ROFFAS PDP-11 trace of text output and formatting program. (called ROFF ACKNOWLEDGMENTS or runoff). This research has been partially supported by the 3. TRACE PDP-11 trace of program Nahonal Science Foundation under grants MCS77- tracer itself tracing EDC. (Tracer is 28429 and MCS-8202591, and by the Department of written in .) Energy under contract EY-76-C-03-0515 (with the 4. FGO1 FORTRAN Go step, factor Stanford Linear Accelerator Center). analysis (1249 lines, single precision). I would like to thank a number of people who were of help to me in the preparation of this paper. Matt 5. FGO2 FORTRAN Go step, double- Diethelm, George Rossman, Alan Knowles, and Dileep precision analysis of satellite informa- Bhandarkar helped me to obtain materials describing tion, 2057 lines, FortG compiler. various machines. George Rossman, Bill Harding, Jim 6. FGO3 FORTRAN Go step, double- Goodman, Domenico Ferrari, Mark Hill) and Bob

Computing Surveys, Vol. 14, No. 3, September 1982 524 • A. J. Smith

Doran read and commented upon a draft of this paper BARS72 BARSAMIAN, H., AND DECEGAMA, Mac MacDougall provided a number of other useful A. "System deslgn considerations of comments. John Lee generated a number of the traces cache memories," in Proc IEEE Com- used for the experiments In this paper from trace data puter Society Conference (1972), IEEE, available at Amdahl Corporation. Four of the traces New York, pp 107-110 were generated by Len Shustek. The responsibility for BEAN79 BEAN, B M., LANGSTON, K, PART- RIDGE, R., SY, K.-B Bias filtermemory the contents of this paper, of course, remains with the for filteringout unnecessary interroga- author tions of cache directoriesin a multlpro- cessor system United States Patent 4,- REFERENCES 142,234, Feb. 17, 1979. BEDE79 BEDERMAN, S. Cache management AGRA77a AGRAWAL, O. P., AND POHM, A. system using virtual and real tags m the V. "Cache memory systems for multi- cache directory. IBM Tech. D~sclosure processor architecture," in Proc. AFIPS Bull. 21, 11 (Aprd 1979), 4541. National Computer Conference (Dallas, BELA66 BELADY, L A A study of replacement Tex. June 13-16, 1977), vol. 46, AFIPS algorithms for a virtual storage com- Press, Arlington, Va, pp. 955-964. puter. IBM Syst. J. 5, 2 (1966), 78-101. AICH76 AICHELMANN, F.J. Memory prefetch. BELL74 BELL, J, CASASENT, D., AND BELL, C. IBM Tech D~sclosure Bull. 18, 11 (April G. An investigation of alternative 1976), 3707-3708. cache organizations. IEEE Trans Corn- ALSA78 AL-SAYED, H.S. "Cache memory ap- put. TC-23, 4 (April 1974), 346-351. plication to lmcrocomputers," Tech. BENN76 BENNETT, B. T., AND FRANACZEK, P Rep. 78-6, Dep. of Computer Science, A. Cache memory with prefetching of Iowa State Unlv, Ames, Iowa, 1978 data by priority IBM Tech. D~sclosure ANAC67 ANACKER, W., AND WANG, C. Bull. 18, 12 (May 1976), 4231-4232. P. Performance evaluation of comput- BENN82 BENNETT, B. T, POMERENE, J. H, Pu- ing systems with memory hmrarchles. ZAK, T. R., AND RECHTSCHAFFEN, R. IEEE Trans Comput. TC-16, 6 (Dec. N. Prefetchmg in a multilevel memory 1967), 764-773. hierarchy. IBM Tech. Dzsclosure Bull. ANDE67a ANDERSON, D. W., SPARACIO,F. J., AND 25, 1 (June 1982), 88 TOMASULO, R. M. The IBM System/ BERG76 BERG, H. S., AND SUMMERFIELD, A. 360 Model 91 Machine philosophy and R. CPU busy not all productive utill- instruction handling. IBM J. Res. Dev. zatlon. Share Computer Measurement 11, 1 (Jan. 1967), 8-24 and Evaluatmn Newsletter, no 39 ANDE67b ANDERSON, S. F, EARLE, J. G, GOLD- (Sept. 1976), 95-97. SCHMIDT, R. E., AND POWERS, D. BERG78 BERGEY, A. L, JR. Increased computer M. The IBM System/360 Model 91 throughput by conditioned memory data Floating point execution unit. IBM J prefetching. IBM Tech. D~sclosure Bull Res Dev 11, 1 (Jan. 1967), 34-53 20, 10 (March 1978), 4103. ARMS81 ARMSTRONG, R. A Applying CAD to BLAZ80 BLAZEJEWSKI, T. J., DOBRZYNSKI, S. M, gate arrays speeds 32 bit minicomputer AND WATSON, W.D. Instruction buffer design. Electronws (Jan. 31, 1981), 167- with simultaneous storage and fetch op- 173. erations. IBM Tech. D~sclosure Bull. 23, AROR72 ARORA, S. R., AND WU, F. L. "Sta- 2 (July 1980), 670-672. tistical quantification of instruction and BLOU80 BLOUNT, F T., BULLIONS, R. J., MAR- operand traces," in Stat~stwal Computer TIN, D B., McGILVRAY, B L., AND RO- Performance Evaluatmn, Freiberger BINSON, J.R. Deferred cache storing (ed.), pp. 227-239, Acaden~c Press, New method. IBM Teeh. D~sclosure Bull 23, York, N Y., 1972 1 (June 1980), 262-263 BADE79 BADEL, M., AND LEROUDIER, J. BOLA67 BOLAND, L J., GRANITO, G. D., MAR- "Performance evaluation of a cache COTVE, A V., MESSINA, B. U, AND memory for a minicomputer," in Proc. SMITH, J W. The IBM System/360 4th Int. Syrup on Modelhng and Per- Model 91 Storage system. IBM J Res formance Evaluatmn of Computer Sys- Dev. 11, 1 (Jan 1967), 54-68. tems (Vienna, Austria, Feb. 1979). BORG79 BORGERSON, B. R., GODFREY, M. D, BALZ81a BALZEN, D., GETZLAFF, K. J., HADJU, J., HAGERTY, P. E, RYKKEN, T R. "The AND KNAUFT, G. Accelerating store in architecture of the Sperry Univac ll00 cache operatmns. IBM Tech. Disclosure series systems," m Proc. 6th Annual Bull. 23, 12 (May 1981), 5428-5429. Syrup Computer Archttecture (April 23- BALZ81b BALZEN, D., HADJU, J., AND KNAUFT, 25, 1979}, ACM, New York, N.Y., pp. G. Preventive cast out operations in 137-146. cache hierarchies. IBM Tech D~sclosure CAMP76 CAMPBELL, J. E., STROHM, W. G., AND Bull. 23, 12 (May 1981), 5426-5427 TEMPLE, J. L Most recent class used

Computing Surveys, Vol 14, No. 3, September 1982 Cache Memories * 525

search algorithm for a memory cache. memory," Tech. Info. Notepad 1-114, IBM Tech. D~sclosure Bull. 18, 10 Honeywell, Phoenix, Ariz., April, 1974. {March 1976), 3307-3308. DITZ82 DITZEL, D.R. " for CDC74 CONTROL DATA CORP Control Data free: The C machine stack cache," in 6000 Series Computer Systems Refer- Proc Syrup on Architectural Support ence Manual Arden Hills, Minn., 1974. for Programming Languages and Op- CENS78 CENSIER, L., AND FEAUTRIER, P. A erating Systems (Palo Alto, Calif., new solution to coherence problems in March 1-3, 1982), ACM, New York, multicache systems. IEEE Trans. Corn- N.Y., 1982. put. TC-27, 12 (Dec. 1978), 1112-1118. DRIM81a DRIMAK, E. G., DUTTON, P. F., HICKS, CHIA75 CHIA, D K. Opttmum implementation G L., AND SITLER, W. R Multi- of LRU hardware for a 4-way set-asso- processor locking with a bypass for chan- ciative memory. IBM Tech. D~sclosure nel references. IBM Tech. D~sclosure Bull 17, 11 (April 1975), 3161-3163. Bull. 23, 12 (May 1981), 5329-5331. CHOW75 CHOW, C.K. Determining the optimum DRIM81b DRIMAK, E. G., DUTTON, P. F., AND capacity of a cache memory. IBM Tech SITLER, W. R. Attached processor si- D~sclosure Bull. 17, 10 (March 1975), multaneous data searching and transfer 3163-3166. via main storage controls and intercache CHOW76 CHOW, C.K. Determination of cache's transfer controls. IBM Tech. Disclosure capacity and its matching storage hier- Bull. 24, 1A (June 1981), 26-27. archy. IEEE Trans. Comput. TC-25, 2 DRIS80 DRISCOLL, G C., MATICK, R. E, PUZAK, (Feb. 1976), 157-164. T. R., AND SHEDLETSKY, J. J. Spht CHU76 CHU, W. W., AND OPDERBECK, cache with variable Interleave boundary. H Program behavior and the page IBM Tech D~sclosure Bull. 22, 11 (April fault frequency replacement algorithm. 1980), 5183-5186. Computer 9, 11 (Nov. 1976), 29-38 DUBO82 DuBoIs, M., AND BRIGGS, F. A. CLAR81 CLARK, D W, LAMPSON, B. W, PIER, "Effects of cache concurrency in multi- K.A. The memory system of a high processors," in Proc. 9th Annual Syrup. performance personal computer. IEEE (Austin, Texas, Trans Comput. TC-30, 10 (Oct. 1981), April, 1982), ACM, New York, N.Y, 715-733 1982, pp. 292-308. CLAR82 CLARK, D. W Cache performance In EAST75 EASTON, M. C., AND FAGIN, R "Cold- the VAX-11/780. To appear in ACM start vs. warm-start miss ratios and mul- Trans Comp. Syst. 1, 1 (Feb. 1983). tlprogramming performance," IBM Res. COFF73 COFFMAN, E. G., AND DENNING, P Rep RC 5715, Nov., 1975. J. Operating Systems Theory. Pren- EAST78 EASTON, M C. Computation of cold rice-Hall, Englewood Chffs, N.J, 1973. start miss ratios. IEEE Trans. Comput CONT68 CONTI, C. J., GIBS}3N, D. H., AND PIT- TC-27, 5 (May 1978), 404-408. KOWSKY, S. H. Structural aspects of ELEC76 ELECTRONICS MAGAZINE Altering the system/360 Model 85 IBM Syst. J. computer architecture is way to raise 7, 1 (1968), 2-21 throughput, suggests IBM researchers. CONT69 CONTI, C.J. Concepts for buffer stor- Dec. 23, 1976, 30-31. age IEEE Computer Group News 2, 8 ELEC81 ELECTRONICS New TI 16-bit machine (March 1969), 9-13 has on-chip memory. Nov 3, 1981, 57. Cosc81 COSCARELLA, A S., AND SELLERS, F ENGE73 ENGER, T.A. Paged pre- F. System for purging TLB. IBM Tech. fetch mechantsm. IBM Tech D~sclosure D~sclosure Bull 24, 2 (July 1981), 910- Bull. 16, 7 (Dec. 1973), 2140-2141. 911. FAVR78 FAVRE, P., AND KUHNE, R. Fast mem- CRAY76 CRAY RESEARCH, INC. Cray-1 Com- ory organization. IBM Tech. Dtsclosure puter System Reference Manual. Bloom- Bull. 21, 2 (July 1978), 649-650. mgton, Minn., 1976. Fuzu77 FUKUNAGA, K., AND KASAI, T. "The DEC78 DIGITAL EQUIPMENT CORP. "TB/ efficient use of buffer storage," in Proc. Cache/SBI Control Technical Descmp- ACM 1977 Annual Conference (Seattle, tion--Vax-ll/780 Implementation," Wa., Oct. 16-19, 1977), ACM, New York, Document No. EK-MM780-TD-001, N.Y., pp. 399-403. First Edition (April 1978), Digital Eqmp- FERN78 FURNEY, R.W. Selection of least re- ment Corp., Maynard, Mass., 1978. cently used slot with bad entry and DENN68 DENNING, P.J. The working set model locked slots involved. IBM Tech Dtsclo- for program behavior. Commun. ACM sure Bull. 21, 6 (Nov. 1978), 290. 11, 5 (May 1968), 323-333. GECS74 GECSEI, J. Determining hit ratios for DENN72 DENNING, P.J. "On modehng program multilevel hlerarchms. IBM J. Res. Dev. behawor," in Proc Sprung Joint Com- 18, 4 (July 1974), 316-327. puter Conference, vol. 40, AFIPS Press, GIBS67 GIBSON, D H. "Consideration in block- Arlington, Va., 1972, pp. 937-944. oriented systems design," in Proc DIET74 DIETHELM, M. A. "Level 66 cache Spring Jt Computer Conf, vol 30,

Computing Surveys, Vol. 14, No. 3, September 1982 526 • A. J. Smith

Thompson Books, Washington, D.C., other CPUs. IBM Tech. D~sclosure Bull. 1967, pp. 75-80 19, 2 (July 1976), 594-596. GIND77 GINDELE, J.D. Buffer block prefetch- JONE77a JONES, J. D, AND JUNOD, D M. Cache ing method. IBM Tech Disclosure Bull. address directory invalidation scheme 20, 2 (July 1977), 696-697 for system. IBM Tech. GREE74 GREENBERG, B. S. "An experimental Dtsclosure Bull. 20, 1 (June 1977), 295- analysis of program reference patterns in 296. the multlcs virtual memory," Project JONE77b JONES, J D., AND JUNOD, D. M. MAC Tech Rep. MAC-TR-127, 1974. Pretest lookaslde buffer.IBM Tech D~s- GUST82 GUSTAFSON, R. N., AND SPARACIO, F. closure Bull. 20, 1 (June 1977), 297-298. J. IBM 3081 processor unit: Design KAPL73 KAPLAN, K. R, AND WINDER, R O. considerations and design process. IBM Cache-based computer systems. IEEE d Res. Dev. 26, 1 (Jan. 1982), 12-21. Computer 6, 3 (March 1973), 30-36 HAIL79 HAILPERN, B., AND HITSON, B. "S-1 KOBA80 KOBAYASHI,M. "An algorithm to mea- architecture manual," Tech. Rep No sure the buffer growth function," Tech. 161, Computer Systems Laboratory, Rep. PN 820413-700A (Aug. 8, 1980), Stanford Univ., Stanford, Calif., Jan, Amdahl Corp 1979. KONE80 KONEN, D. H., MARTIN, D. B., Mc- HARD75 HARDING, W.J. "Hardware Controlled GILVRAY, B. L., AND TOMASULO, R. Memory Hmrarchms and Their Perform- M. Demand driven instructionfetching ance." Ph.D. dmsertatlon, Arizona State inhibit mechanism. IBM Tech D~sclo- Univ., Dec., 1975. sure Bull. 23, 2 (July 1980), 716-717 HARD80 HARDING, W. J, MACDOUGALL, M.H., KROF81 KROFT, D. "Lockup-free instruction RAYMOND, W. J. "Empmcal estnna- fetch/prefetch cache organization," in tion of cache miss ratios as a function of Proc. 8th Annual Syrup. Computer Ar- cache sLze,"Tech. Rep. PN 820-420-700A chitecture (Mlnneapohs, Minn., May 12- (Sept. 26, 1980), Amdahl Corp. 14, 1981), ACM, New York, N Y., pp. 81- HOEV81a HOEVEL, L W., AND VOLDMAN, J. 87. Mechanism for cache replacement and KUMA79 KUMAR, B. "A model of spatial locality prefetchlng driven by heuristic estima- and its apphcation to cache design," tion of operating system behavior. IBM Tech Rep. (unpubl.), Computer Sys- Tech Dtsclosure Bull. 23, 8 (Jan. 1981), tems Laboratory, Stanford Umv., Stan- 3923. ford, Calif., 1979. HOEV81b HOEVEL, L W., AND VOLDMAN, J LAFF81 LAFFITTE, D. S., AND GUTTAG, K M. Cache hne reclamation and cast out Fast on-chip memory extends 16 blt ram- avoidance under operating system con- fly's reach. Electronws, Feb. 24, 1981, trol. IBM Tech. Dzsclosure Bull. 23, 8 157-161. (Jan. 1981), 3912. LAMP80 LAMPSON, B W, AND PIER, K.A. "A IBBE72 INSET, R. The MU5 instruction pipe- processor for a high-performance per- line. Comput. J 15, 1 (Jan. 1972), 42-50. sonal computer," in Proc. 7th Annual IBBE77 IBBET, R. N., AND HUSBAND, M. Symp. Computer Architecture (May 6-8, A. The MU5 name store Comput. J. 1980), ACM, New York, N Y, pp 146- 20, 3 (Aug. 1977), 227-231. 160. IBM71 IBM "IBM system/360 and System/ LEE69 LEE, F.F. Study of "look-aside" mem- 370 Model 195 Functional characteris- ory. IEEE Trans. Comput. TC-18, 11 tics," Form GA22-6943-2 (Nov. 1971), (Nov 1969), 1062-1064 IBM, Armonk, N.Y. LEE80 LEE, J M., AND WEINBERGER, A. A IBM75 IBM "IBM System/370 Model 168 solution to the synonym problem. IBM Theory of Operation/Diagrams Man- Tech. Dtsclosure Bull. 22, 8A (Jan ual-Processor Storage Control Func- 1980), 3331-3333. tion (PSCF)," vol. 4, IBM, Poughkeep- LEE82 LEE, J. K. F., AND SMITH, i. J. sin, N.Y., 1975. "Analysis of branch prediction strategies IBM78 IBM "3033 Processor Complex, The- and branch target buffer design," Tech. ory of Operation/Diagrams Manual-- Rep., Umv. of Calif., Berkeley, Calif, Processor Storage Control Function 1982. (PSCF)," vol. 4, IBM, Poughkeepsie, LEHM78 LEHMAN, A., AND SCHMID, D. "The N.Y., 1978. performance of small cache memories in IBM82 IBM "IBM 3081 Functional character- minicomputer systems with several proc- istms," Form GA22-7076, IBM, Pough- essors," in Dtgttal Memory and Storage. keepsie, N.Y., 1982. Springer-Verlag, New York, 1978, pp. JOHN81 JOHNSON, R.C. Microsystems exploit 391-407. mainframe methods. Electronws, Aug. LEHM80 LEHMANN, A. Performance evaluation 11, 1981, 119-127 and prediction of storage hierarchies. JONE76 JONES, J. D., JUNOD, D. M, PARTRIDGE, Source unknown, 1980, pp. 43-54. R. L., AND SHAWLEY, B. L Updating LEWI71 LEwis, P. A. W., AND YUE, P. C. cache data arrays with data stored by "Statistical analysis of program refer-

Computing Surveys, Vol 14, No 3, September 1982 Cache Memories • 527

ence patterns m a paging environment," MU5 Computer System. Springer-Ver- in Proc IEEE Computer Socwty Con- lag, New York, 1979. ference, IEEE, New York, N Y., 1971. NGAIS1 NGAI, C. H., AND WASSEL, E R. LEWI73 LEWIS, P. A. W., AND SHEDLER, G. Shadow directory for attached processor S. Empirically derived micro models system. IBM Tech. Dtsclosure Bull. 23, for sequences of page exceptions. IBM J. 8 (Jan. 1981), 3667-3668. Res. Dev. 17, 2 (March 1973), 86-100. NGAI82 NGAI, C. H., AND WASSEL, E.R. Two- LINDS1 LINDSAY, D C. Cache memories for level DLAT hierarchy. IBM Tech. D~s- . Computer Architecture closure Bull. 24, 9 (Feb. 1982), 4714- News 9, 5 (Aug. 1981), 6-13. 4715. LIPT68 LIPTAY, J. S Structural aspects of the OHNO77 OHNO, N., AND HAKOZAKI, K. Pseudo System/360 Model 85, II the cache. IBM random access memory system with Syst J. 7, 1 (1968), 15-21 CCD-SR and MOS RAM on a chip. 1977 LIU82 LIu, L. Cache-splitting with informa- OLBE79 OLBERT, A.G. Fast DLAT load for V tion of XI-sensltlvity m tightly coupled = R translations. IBM Tech. Disclosure multlprocessing systems IBM Tech. Bull 22, 4 (Sept. 1979), 1434 Dtsclosure Bull. 25, 1 (June 1982), 54- PERKS0 PERKINS, D.R. "The Design and Man- 55 agement of Predictive Caches." Ph D. LOSQ82 LOSQ, J. J., PARKS, L S., SACRAE, H. E, dissertation, Univ of Calif., San Diego, AND YAMOUR, J. Conditmnal cache Calif, 1980. miss facility for handling short/long PEUT77 PEUTO, B. L., AND SHUSTEK, L.J. "An cache requests. IBM Tech. Disclosure instruction timing model of CPU per- Bull. 25, 1 (June 1982), 110-111 formance," in Proe. 4th Annual Syrup. LUDL77 LUDLOW, D M., AND MOORE, B. B. Computer Architecture (March 1977), Channel DAT with pin bits. IBM Tech. ACM, New York, N.Y., pp. 165-178. D~sclosure Bull 20, 2 (July 1977), 683. POHM73 POHM, A. V., AGRAWAL, O P., CHENG, MAcD79 MACDOUGALL, M. H. "The stack C.-W., AND SHIMP, A. C. An efficient growth function model," Teeh. Rep. flexible buffered memory system. IEEE 820228-700A (April 1979), Amdahl Corp. Trans. Magnetics MAG-9, 3 (Sept. MARU75 MARUYAMA, K mLRU page replace- 1973), 173-179. ment algomthm in terms of the reference POHM75 POHM, A. V., AGRAWAL, O. P., AND MON- matrix. IBM Tech D~sclosure Bull 17, ROE, R.N. "The cost and performance 10 (March 1975), 3101-3103. tradeoffs of buffered memories," in Proc MARU76 MARUYAMA, K. Implementatmn of the IEEE 63, 8 (Aug. 1975), pp. 1129-1135. stack operation circuit for the LRU al- POHM82 POHM, A. V., AND AGRAWAL, O.P. "A gonthm. IBM Tech D~sclosure Bull. 19, cache technique for bus oriented multi- 1 (June 1976), 321-325 processor systems," in Proc. Compcon82 MATT71 MATTSON, R. L. Evaluation of multi- (San Francisco, Calif., Feb. 1982), IEEE, level memories. IEEE Trans Magnetics New York, pp. 62-66. MAG-7, 4 (Dec. 1971), 814-819. POMES0a POMERENE, J. H., AND RECHTSCHAFFEN, MATT70 MATTSON, R. L., GECSEI, J., SLUTZ, D. R Reducing cache misses in a branch R., AND TRAIGER, I. L Evaluation history table machine. IBM Tech Dts- techmques for storage hierarchms. IBM closure Bull. 23, 2 (July 1980), 853. Syst J. 9, 2 (1970), 78-117. PoMES0b POMERENE, J. H., AND RECHTSCHAFFEN, MAZA77 MAZARE, G "A few examples of how R N. Base/&splacement lookahead to use a symmetrical multi-micro-proc- buffer. IBM Tech. Disclosure Bull. 22, essor," in Proc 4th Annual Syrup. Com- 11 (April 1980), 5182. puter Archztecture (March 1977), ACM, POWE77 POWELL, M.L. "The DEMOS File sys- New York, N.Y., pp. 57-62. tem," in Proc 6th Syrup. on Operating McWI77 MCWILLIAMS, T., WIDDOES, L C., AND Systems Principles (West LaFayette, WOOD, L "Advanced digital processor Ind., Nov. 16-18, 1977), ACM, New York, technology base development for navy N.Y., pp. 33-42. applicatmns: The S-1 Processor," Tech RADI82 RADIN, G. M. "The 801 minicompu- rep. UCIO-17705, Lawrence Livermore ter," in Proc Syrup. on Architectural Laboratory, Sept, 1977. Support for Programming Languages MEAD70 MEADE, R.M. "On memory system de- and Operatmg Systems (Palo Alto, sign," in Proc. Fall Joint Computer Con- Calif., March 1-3, 1982), ACM, New ference, vol. 37, AFIPS Press, Arhngton, York, N.Y., pp 39-47. Va., 1970, pp. 33-43 RAMA81 RAMAMOHANARAO, g., AND SACKS- MILA75 MILANDRE, G., AND MIKKOR, R. "VS2- DAVIS, R. Hardware address transla- R2 experience at the University of To- tion for machines with a large virtual ronto Computer Centre," in Share 44 memory. Inf. Process. Lett. 13, 1 (Oct. Proc. (Los Angeles, Cahf, March, 1975), 1981), 23-29. pp. 1887-1895. RAU76 RAU, B.R. "Sequential prefetch strat- MORR79 MORRIS, D., AND IBBETT, R. N. The egies for instructions and data," Digital

Computing Surveys, VoL 14, No. 3, September 1982

...... 4 ~ ~:-~...... 528 A. J. Smith

systems laboratory tech. rep. 131 (1976), (San Francisco, Calff, Feb. 1981), pp Stanford Univ., Stanford, Calif 355-357 REIL82 REILLY, J., SUTTON, A., NASSER, R, AND STRE76 STRECKER, W.D. "Cache memories for GRISCOM, R. Processor controller for PDP-11 family computers," in Proc. 3rd the IBM 3081. IBM J. Res Dev. 26, 1 Annual Symp. Computer Architecture (Jan. 1982), 22-29. (Jan. 19-21, 1976), ACM. New York, Ris77 RIS, F. N., AND WARREN, H S, JR. N.Y, pp. 155-158. Read-constant control line to cache. TANC76 TANG, C K. "Cache system design in IBM Tech Disclosure Bull. 20, 6 (Nov. the tightly coupled multiprocessor sys- 1977), 2509-2510 tem," in Proc. AFIPS Natmnal Com- Ross79 ROSSMAN, G. Private communication. puter Conference (New York City, New Palyn Associates, San Jose, Calif., 1979. York, June 7-10, 1976), vol. 45, AFIPS SALT74 SALTZER, J.H. "A simple linear model Press, Arlington, Va, pp. 749-753. of performance, Corn- THAK78 THAKKAR, S. S. "Investigation of mun. ACM 17, 4 (April 1974), 181-186. Buffer Store Orgamzation." Master's of SATY81 SATYANARAYANAN, M., AND BHANDAR- science thesis, Victoria University of KAR, D. Design trade-offs m VAX-11 Manchester, England, October, 1978. translation buffer organization. IEEE TOMA67 TOMASULO, R. M An efficmnt algo- Computer (Dec. 1981), 103-111. rithm for exploiting multiple arithmetic SCHR71 SCHROEDER, M. D. "Performance of units. IBMJ Res. Dev 11, 1 (Jan. 1967), the GE-645 associative memory while 25-33. multics is m operation," m Proc. 1971 WILK71 WILKES, M. V. Slave memories and Conference on Computer Performance segmentation IEEE Trans Comput Evaluation (Harvard Univ, Cambridge, {June 1971), 674-675. Mass.), pp. 227-245. WIND73 WINDER, R. O. A data base for com- SHED76 SHEDLER, G. S., AND SLUTZ, D. puter performance evaluation. IEEE R. Derivation of miss ratios for merged Computer 6, 3 (March 1973), 25-29. access streams. IBM J Res. Dev. 20, 5 YAMO80 YAMOUR, J. Odd/even mterleave cache (Sept. 1976), 505-517. with optnnal hardware array cost, cycle SLUT72 SLUTZ, D. R, AND TRAIGER, I. tLme and variable data part width. IBM L. "Evaluation techniques for cache Tech. Disclosure Bull 23, 7B (Dec. memory hierarchies," IBM Res. Rep. RJ 1980), 3461-3463. 1045, May, 1972. YEN81 YEN, W. C., AND FU, K.S. "Analysis of SMIT78a SMITH, A. J A comparative study of multlprocessor cache organizations with set assoclatwe memory mapping algo* alternative mare memory update poli- rithms and their use for cache and main cies," In Proc 8th Annual Syrup. Com- memory. IEEE Trans. Softw. Eng SE- puter Architecture (Minneapolis, Minn., 4, 2 (March 1978), 121-130 May 12-14, 1981), ACM, New York, SMIT78b SMITH, A.J. Sequential program pre- N Y., pp. 89-105. fetching in memory hierarchies. IEEE YUVA75 YUVAL, A. "165/HSB analysis," Share Computer 11, 12 (Dec. 1978), 7-21 Inc., Computer Measurement and Eval- SMIT78C SMITH, A J Sequentiahty and pre- uation, Selected Papers from the Share fetching m systems. ACM Project, vol III, pp. 595-606, 1975. Trans. Database Syst. 3, 3 (Sept 1978), ZOLN81 ZOLNOWSKY, J. "Philosophy of the 223-247. MC68451 ," in Proc. IEEE Compcon (San Francisco, SMIT78d SMITH, A J. Bibliography on paging Callf, Feb. 1981), IEEE, New York, pp. and related topics. Operating Systems 358-361. Review 12, 4 (Oct. 1978), 39-56. SMIT79 SMITH, A J. Characterizing the storage process and its effect on the update of mare memory by write-through J. ACM BIBLIOGRAPHY 26, 1 (Jan. 1979), 6-27. ACKL75 ACKLAND, B. D., AND PUCKNELL, D. SNOW78 SNOW, E. A., AND SIEWIOREK, D. A. Studies of cache store behavior in a P. "Impact of implementation design real time minicomputer environment. tradeoffs on performance The PDP-11, Electronics Letters 11, 24 (Nov. 1975), A case study," Dep. of Computer Science 588-590. Report (Feb. 1978), Carnegm-Mellon ACKL79 ACKLAND, B.D. "A bit-slice cache con- University, Pittsburgh, Pa troller," in Proc. 6th Annual Symp. Com- SPAR78 SPARACIO, F J. Data processing sys- puter Architecture (April 23-25, 1979), tem with second level cache. 'IBM Tech ACM, New York, N.Y., pp 75-82. Disclosure Bull. 21, 6 (Nov. 1978), 2468- AGRA77b AGRAWAL, O. P., ZINGG, R. J., POHM, A. 2469. V. "Applicability of 'cache' memories STEV81 STEVENSON, D. "Virtual memory on to dechcated multlprocessor systems," in the Z8003," m Proc IEEE Compcon Proc. IEEE Computer Society Confer-

Computing Surveys, Vol. 14, No 3, September 1982 Cache Memories • 529

ence (San Francisco, Calif., Spring 1977), cycle for a shared array. IBM IEEE, New York, pp 74-76 Tech. Disclosure Bull. 21, 6 (Nov. 1978), AMDA76 AMDAHL CORP. 470V/6 Machine Ref- 2293-2294. erence Manual. 1976. HOFF81 HOFFMAN, R. L., MITCHELL, G. R., AND BELL71 BELL, C G., AND CASASENT, D. Im- SOLTIS, F.G. Reference and change bit plementation of a buffer memory in min- recor&ng IBM Tech. Disclosure Bull. icomputers. Comput Des 10, (Nov. 23, 12 (May 1981), 5516-5519. 1971), 83-89. HRUS81 HRUSTICH, J., AND SITLER, W. BLOG61 BLOOM, L, COHEN, M., AND PORTER, R. Cache reconfiguration. IBM Tech. S. "Considerations in the design of a Disclosure Bull. 23, 9 (Feb. 1981), 4117- computer with ingh logic-to-memory 4118 speed ratio," in Proc Gigacycle Corn- IBM70 IBM "IBM field engineering theory of putmg Systems (Jan. 1962), AIEE Spe- operation, System/360, Model 195 stor- cial Publication S-136, pp. 53-63. age buffer storage," first BORG81 BORGWARDT, P A. "Cache structures edition (Aug 1970), IBM, Poughkeepsie, based on the execution stack for high N.Y. level languages," Tech Rep. 81-08-04, hzu73 IIZUKA, H., AND TERU, T. Cache mem- Dep. of Computer Scmnce, Umv. of ory simulation by a new method of ad- Washington, Seattle, Wa., 1981 dress pattern generation. J. IPSJ 14, 9 BRIG81 BRIGGS, F. A., AND DUBOIS, M. (1973), 669-676. "Performance of cache-based multipro- KNEP79 KNEPPER, R. W. Cache bit selection cessors," in Proc ACM/SIGMETRICS circuit. IBM Tech. D~sclosure Bull. 22, Conf. on Measurement and Modehng of 1 (June 1979), 142-143. Computer Systems (Las Vegas, Nev., KNOK69 KNOKE, e. "An analysis of buffered Sept. 14-16, 1981), ACM, New York, memories," in Proc. 2nd Hawaii Int N.Y., 181-190. Conf. on System Sciences (Jan. 1969), CANN81 CANNON, J. W., GRIMES, D. W., AND pp. 397-400. HERMANN, S. D Storage protect op- KOTO76 KOTOK, A. "Lecture notes for CS252," erations. IBM Tech D~sclosure Bull. 24, course notes (Spring 1976), Univ. of 2 (July 1981), 1184-1186 Calif, Berkeley, Calif., 1976. CAPOSI CAPOWSKI, R. S., DEVEER, J A., HEL- LANG75 LANGE, R. E, DIETHE~M, M. A., ISH- LER, A. R., AND MESCHI, J. MAEL, P C. Cache memory store in a W. Dynamic address translator for I/O processor of a data processing system channels. IBM Tech. Disclosure Bull. United States Patent 3,896,419, July, 23, 12 (May 1981), 5503-5508. 1975. CASP79 CASPERS, P. G., FAIX, M., GOETZE, V., LARNS0 LARNER, R. A., LASSETTRE, E R., AND ULLAND, H. Cache resident proc- MOORE, B B., AND STRICKLAND, J. essor registers IBM Tech D~sclosure P. Channel DAT and page pinning for Bull 22, 6 (Nov. 1979), 2317-2318. block unit transfers IBM Tech Disclo- CORD81 CORDERO, H, AND CHAMBERS, J sure Bull. 23, 2 (July 1980), 704-705. B. Second group of IBM 4341 machines LEE80 LEE, P. A., GHANI, N., AND HERON, outdoes the first. Electromcs (April 7, K. A recovery cache for the PDP-11. 1981), 149-152. IEEE Trans. Comput TC-29, 6 (June ELLE79 ELLER, J. E., nI "Cache design and the 1980), 546-549. X-tree high speed memory buffers." LORI80 LORIN, H., AND GOLDSTEIN, B. "An in- Master's of science project report, Com- version of the memory hierarchy," IBM puter Science Division, EECS Depart- Res. Rep. RC 8171, March, 1980. ment, Umv of Calif., Berkeley, Calif., MADD81 Sept, 1979. MADDOCK, R. F., MARKS, B. L., MIN- FARI80 FARIS, S M, HENKELS, W H, VALSA- SHULL, J F., AND PINNELL, M. MAKIS, E. A., AND ZAPPE, H.H. Basic C. Hardware address relocation for design of a Josephson technology cache variable length segments. IBM Tech. memory. IBM J Res. Dev 24, 2 (March Disclosure Bull 23, 11 (April 1981), 1980), 143-154. 5186-5187 FARM81 FARMER, D. Comparing the 4341 and MATH8] MATHIS, J. R, MAYFIELD, M. J., AND M80/42. Computerworld, Feb. 9, 1981. ROWLAND, R. E Reference associative FERE79 FERETICH, R. A., AND SACHAR, H. cache mapping. IBM Tech. Disclosure E. Interleaved multiple speed memory Bull 23, 9 (Feb. 1981), 3969-3971 controls with high speed buffer IBM MEAn71 MEADE, R. M Design approaches for Tech. Disc. Bull. 22, 5 (Octo, 1979), cache memory control. Comput Des. 10, 1999-2000. 1 (Jan 1971), 87-93 GARC78 GARCIA, L. C. Instruction buffer de- MERR74 MERRIL, B 370/168 cache memory sign. IBM Tech. D~sclosure Bull. 20, 11b performance. Share Computer Measure- (April 1978), 4832-4833. ment and Evaluation Newsletter, no. 26 HEST78 HESTER, R. L., AND MOYER, J.T. Spht (July 1974), 98-101.

Computing Surveys, Vol. 14, No. 3, September 1982 530 * A. J. Smith

MOOR80 MOORE,B B, RODELL, J. T., SUTTON, lation in systems using virtual addresses A. J., AND VOWELL, J.D. Vary storage and a cache memory. IBM Tech Disclo- physical on/off lme m a non-store- sure Bull. 21, 2 (Jan. 1978), 663-664. through cache system. IBM Tech. D~s- THRE82 THREEWlTT, B A VLSI approach to closure Bull. 23, 7B (Dec. 1980), 3329 cache memory. Comput. Des. (Jan. NAKA74 NAKAMURA,T., HAGIWARA, H., KITA- 1982), 169-173. GAWE, H., AND KANAZAWE, M. Sin]- TRED77 TREDENNICK, H. L, AND WELCH, T. ulation of a computer system with buffer A. "High-speed buffering for variable memory. J IPSJ 15, 1 (1974), 26-33. length aperands," in Proc. 4th Annual NGAI80 NGAI, C. H., AND SITLER, W.R. Two- Symp. Computer Architecture (March 1977), ACM, New York, N.Y., pp 205- bit DLAT LRU algorithm. IBM Tech Disclosure Bull 22, 10 (March 1980), 210. 4488-4490. VOLD81a VOLDMAN, J., AND HOEVEL, L. W. "The fourier cache connection," in RAO75 RAO, G. S. "Performance analyms of Proc. IEEE Compcon (San Francisco, cache memorms," Digital Systems Lab- Calif., Feb. 1981), IEEE, New York, pp. oratory Tech. Rep. 110 (Aug 1975), 344-354 Stanford Univ., Stanford, Calif VOLD81b VOLDMAN,J, AND HOEVEL, L.W. The RECH80 RECHTSCHAFFEN, R. N. Using a software-cache connection. IBM J Res. branch history table to prefetch cache Dev. 25, 6 (Nov. 1981), 877-893. lines. IBM Tech Disclosure Bull 22, 12 WILK65 WILKES, M.V. Slave memories and dy- (May 1980), 5539. namic storage allocation. IEEE Trans. SCHU78 SCHUENEMANN,C. Fast address trans- Comput.j TC-14, 2 {April 1965), 270-271,

Received December 1979, final revision accepted January 1982

Computing Surveys, Vol 14, No. 3, September 1982