A Cache Line Fill Circuit for a Micropipelined, Asynchronous Microprocessor

A Cache Line Fill Circuit for a Micropipelined, Asynchronous Microprocessor R. Mehra, J.D. Garside [email protected], [email protected] Department of Computer Science, The University, Oxford Road, Manchester, M13 9PL, U.K. Abstract To reduce the total execution time (ttotal) the hit ratio (hr) can be increased or the miss penalty (tmiss) In microprocessor architectures featuring on-chip reduced. Increasing the size of the cache or changing cache the majority of memory read operations are satisfied its structure (e.g. making it more associative) can without external access. There is, however, a significant improve the hit rate. Having a faster external memory penalty associated with cache misses which require off- (e.g. adding further levels of cache) speeds up the line chip accesses when the processor is stalled for some or all fetch; alternatively a wider external bus means that of the cache line refill time. fewer transactions are required to fetch a complete This paper investigates the magnitude of the penalties line. associated with different cache line fetch schemes and Such direct methods for reducing the miss penalty demonstrates the desirability of an independent, parallel are not always available or desirable for reasons such line fetch mechanism. Such a mechanism for an asynchro- as chip-area, pin-count or power consumption. A com- nous microprocessor is then described. This resolves some plementary approach is to hide the time taken for line potentially complex interactions deterministically and fetch operations. automatically provides a non-blocking mechanism similar to those in the most sophisticated synchronous systems. This paper examines the magnitudes of the various line fetch penalties and the methods available for 1: Introduction reducing them. It goes on to present a micropipeline [13] circuit which not only manages the line fetch Although any properly designed cache will inter- problem correctly but provides several advanced fea- cept a large proportion (greater than ninety percent tures at very low hardware cost. [12]) of memory read requests and satisfy them at high speed, it is inevitable that there will be occasional 2: Line Fetch Strategies cache misses. Such misses usually instigate a cache line fetch where a number of words are read from the Broadly speaking there are two approaches to main memory and copied into the cache. This require- reducing the performance penalty due to the line fetch ment for external memory cycles can impose a signifi- operation. The first method is to ensure that the line cant penalty on overall performance as a complete line fetch starts prematurely and completes before any fetch can take an order of magnitude longer than an words from the line are needed (prefetching). Failing internal memory cycle. this the penalty imposed in waiting for a line fetch may When evaluating the performance of a CPU and be minimised by reducing any stall times. cache sub-system it is usual to consider the overall 2.1: Reducing the miss penalty execution time of a chosen application program. Assuming that the cache and CPU cycle at the same When a cache miss occurs, the simplest procedure is speed (thit) when the cache hits and the CPU is only to stall the processor and fetch the entire missing line stalled when a cache miss occurs, the execution time starting at the lowest address. When the line fetch is for a program with Nmemreqs memory requests can then complete the requested word is sent to the CPU and be expressed by equation 1.1 processing continues. This is easy to implement and is commonly used in mid-range microprocessors (e.g. ⋅ ( ⋅ ( ) ⋅ ) ttotal = Nmemreqs hr thit + 1 – hr tmiss EQ 1.1 MIPS-X [5], Intel 486 DX2 [14]). It is clearly sub- TIME The order of CPU address requests: ... Y1 Y2 X2 X3 A B C X0 X1 ... Stall On Miss: Y1 Y2 miss X X0* X1* X2* X3* X3 Cache Y1 Y2 X2 X3 CPU Executes X1 CPU Executes a line fetch X1 Early Restart with Y1 Y2 miss X X2* X3* X0* X1* A Cache Cache Hit Streaming: Y1 Y2 X2 X3 A CPU Executes X1* Non-Blocking: Fetch from Y1 Y2 A B C Main Cache (hits) Main Memory miss X X2* X3* X0* X1* Line Fetch Y1 Y2 X2 X3 A B C X0 X1 CPU Executes tlat twd thit Figure 1: CPU and Cache Activity for Different Line Fetch Strategies optimal as it requires the time for a number of memory 2. A read operation which may be satisfied from cycles (typically four or eight [9]) to be added to the the pre-existing cache. processing time for each cache miss. In more detail it 3. A read operation which is a cache miss and re- means that tmiss is: quires another line fetch. 4. A read operation on a value which is in the last ⋅ tmiss = tlat + nw twd EQ 2.1 line to be requested. Once it has been determined that a write operation Where tlat is the first word memory latency, nw is will complete successfully (i.e. no exception occurs) it the number of memory transfers need to fill a line (i.e. can be treated as fire and forget. The data can be writ- the number of words in a line) and twd is the time taken ten into the cache (or write-buffer [4]) thus hiding any to transfer one word. further memory access penalty. For this reason write There are many ways of alleviating this penalty. An operations will be neglected for the remainder of this obvious improvement is to allow processing to con- discussion. tinue as soon as the required word is available (early- It is clear that case (2) has no explicit dependency restart [11]). This introduces some parallelism in that on the previously fetched cache line and such opera- the CPU may continue whilst the line fetch completes. tions may – in principle – proceed independently of the A further optimization is to fetch the required word line fetch process, although cache access conflicts first – wrapping the line fetch addresses appropriately between these processes must be resolved. Simpler – (desired word first/wrapping-fetch [4]) thus reducing mechanisms resolve any conflicts by stalling the proc- tmiss to a minimum (tlat + twd). Both the early-restart essor until the line fetch is complete; more sophisti- and desired word first optimisations are employed in cated systems employ hit under miss [6] which allows high performance architectures (e.g. RS/6000–560 the line fetch to proceed in the background whilst the [15]). processor continues in parallel. When employing early-restart a problem can occur It is apparent that case (3) requires its own cache when the processor attempts its subsequent memory miss processing and line fetch. If there is contention access, in that the cache may still be fetching the for resources it must either abandon the current line remainder of the previous missed line. The next cycle fetch or wait until it has completed before proceeding. may be any of the following: In situation (4) various options are available. One 1. A write operation. method, known as streaming, is to allow the read request to be synchronised with the incoming line fetch data. When the required value is available (the request Main Array and LFL Hit Rates vs. Line Length (for 8Kbyte 4-Way Set Associative Cache) might have to wait) it is forwarded to the processor. 100.0 Another, simpler, option is to stall the request, allow the line fetch to complete and then send the required data (which is now known to be present) to the processor. Options such as early-restart, streaming, hit under In Last LF Line 90.0 miss etc. can be combined to produce many different line fetch strategies. Three possible strategies are Hit Rate (%) depicted graphically in Figure 1 and summarised below; In Main Array • Stall On Miss. A miss stalls the processor for 80.0 the whole duration of the line fetch. 2 4 8 16 32 Line Length (words) • Early Restart with Streaming. Desired word is Figure 2: Hit Rate Breakdown for MPEG Decoder fetched first and the CPU restarted early. Fetched data is streamed to the CPU as long as importance. In an 8Kbyte, 4-way set associative cache it requests it. If the CPU requests anything with a 2 word line, for example, one cache line repre- else it is stalled until the fetch completes. sents only 0.39% of cache storage; this is much smaller than the observed contribution of 4% from the last line • Non-Blocking. The previous strategy with the fetched. addition of hit under miss support. Requested data is forwarded to the processor as soon as The “last line fetched” is an ephemeral location, possible in all cases thus giving the maximum rapidly replaced by the next line fetch. Further experi- degree of parallelism. ments where conducted to examine the degree of tem- poral locality exhibited during the line fetch. For each 3: Simulation Results line fetch operation it was determined how many words contained within the line were used before a Simulation was performed to determine the benefits request for other data was issued. Since line fetching is of the various strategies. An ARM [1] instruction level performed on a demand basis, all line fetches result in simulator [2], was augmented with code modelling dif- at least one word being used but they may use more ferent cache architectures. A selection of programs was run (single tasked) and statistics were recorded. before moving off the line. The results of cache read requests were first classi- Distribution of Unbroken Reads Following a LF (for 8Kbyte, 4-Way Set Associative, 8 Word Line Cache [MPEG]) fied into three categories; misses, hits in the main array 60.0 and hits in the last cache line to be fetched.

A Cache Line Fill Circuit for a Micropipelined, Asynchronous Microprocessor

Memory Hierarchy Memory Hierarchy

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Caches & Memory

Stealing the Shared Cache for Fun and Profit

IBM Power Systems Performance Report Apr 13, 2021

Cache & Memory System

Cache-Fair Thread Scheduling for Multicore Processors

Quickspecs HP Integrity Rx7640 Server Overview

Exploiting Cache Side Channels on CPU-FPGA Cloud Platforms

The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures

Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor