Memory Hierarchy Design Overview Locality of Reference Cache
Total Page:16
File Type:pdf, Size:1020Kb
Cache Systems CPU Main CPU Main Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy Design Bus 66MHz Bus 66MHz Chapter 5 and Appendix C Data object Block transfer transfer Main CPU Cache Memory 1 4 Overview Example: Two-level Hierarchy Access Time • Problem T1+T2 –CPU vs Memory performance imbalance • Solution – Driven by temporal and spatial locality – Memory hierarchies • Fast L1, L2, L3 caches • Larger but slower memories • Even larger but even T slower secondary storage 1 • Keep most of the action in the higher levels 0 Hit ratio 1 2 5 Locality of Reference Basic Cache Read Operation • Temporal and Spatial • Sequential access to memory • CPU requests contents of memory location • Unit-stride loop (cache lines = 256 bits) • Check cache for this data for (i = 1; i < 100000; i++) • If present, get from cache (fast) sum = sum + a[i]; • If not present, rea d requ ire d bloc k from main memory to cache • Non-unit stride loop (cache lines = 256 bits) • Then deliver from cache to CPU for (i = 0; i <= 100000; i = i+8) • Cache includes tags to identify which block sum = sum + a[i]; of main memory is in each cache slot 3 6 1 Elements of Cache Design Number of Caches • Cache size • Increased logic density => on-chip cache • Line (block) size – Internal cache: level 1 (L1) • Number of caches – External cache: level 2 (L2) • Unified cache • Mapping function – Balances the load between instruction and data fetches – Block placement – Only one cache needs to be designed / implemented – Block identification • Split caches (data and instruction) • Replacement Algorithm – Pipelined, parallel architectures • Write Policy 7 10 Cache Size Mapping Function • Cache size << main memory size • Cache lines << main memory blocks • Small enough • Direct mapping – Minimize cost – Maps each block into only one possible line – Speed up access (less gates to address the cache) – (block address) MOD (number of lines) – Keep cache on chip • Fully associative • Large enough – Block can be placed anywhere in the cache – Minimize average access time • Set associative • Optimum size depends on the workload – Block can be placed in a restricted set of lines • Practical size? – (block address) MOD (number of sets in cache) 8 11 Line Size Cache Addressing • Optimum size depends on workload • Small blocks do not use locality of reference Block address Block offset principle Tag Index • Larger blocks reduce the number of blocks – Replacement overhead Block offset – selects data object from the block Tag Cache Main Memory • Practical sizes? Index – selects the block set Tag – used to detect a hit 9 12 2 Direct Mapping Replacement Algorithm • Simple for direct-mapped: no choice • Random – Simple to build in hardware •LRU Associativity Two-way Four-way Eight-way Size LRU Random LRU Random LRU Random 16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 13 16 Associative Mapping Write Policy • Write is more complex than read – Write and tag comparison can not proceed simultaneously – Only a portion of the line has to be updated • Write policies – Write through – write to the cache and memory – Write back – write only to the cache (dirty bit) • Write miss: – Write allocate – load block on a write miss – No-write allocate – update directly in memory 14 17 K-Way Set Associative Mapping Alpha AXP 21064 Cache CPU 21 8 5 Address Tag Index offset Data data In out Valid Tag Data (256) Write buffer =? Lower level memory 15 18 3 Write Merging Cache Performance Improvements Write address V V V V • Average memory-access time 100 1 000 = Hit time + Miss rate x Miss penalty 104 1 0 00 • Cache optimizations 108 1 0 0 0 – Reducing the miss rate 112 1 0 0 0 – Reducing the miss penalty Write address V V V V – Reducing the hit time 100 11 11 00 00 00 00 00 00 19 22 DECstation 5000 Miss Rates Example 30 Which has the lower average memory access time: 25 A 16-KB instruction cache with a 16-KB data cache or 20 A 32-KB unified cache Instr. Cache % 15 Data Cache Hit time = 1 cycle Unified 10 Miss ppyenalty = 50 cy cles Load/store hit = 2 cycles on a unified cache 5 Given: 75% of memory accesses are instruction references. 0 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Overall miss rate for split caches = 0.75*0.64% + 0.25*6.47% = 2.10% Cache size Miss rate for unified cache = 1.99% Average memory access times: Direct-mapped cache with 32-byte blocks Split = 0.75 * (1 + 0.0064 * 50) + 0.25 * (1 + 0.0647 * 50) = 2.05 Percentage of instruction references is 75% 20 Unified = 0.75 * (1 + 0.0199 * 50) + 0.25 * (2 + 0.0199 * 50) = 2.24 23 Cache Performance Measures Cache Performance Equations • Hit rate: fraction found in that level CPUtime = (CPU execution cycles + Mem stall cycles) * Cycle time – So high that usually talk about Miss rate Mem stall cycles = Mem accesses * Miss rate * Miss penalty – Miss rate fallacy: as MIPS to CPU performance, CPUtime = IC * (CPIexecution + Mem accesses per instr * Miss rate * • Average memory-access time Miss penalty) * Cycle time = Hit time + Miss rate x Miss penalty (ns) Misses per instr = Mem accesses per instr * Miss rate • Miss penalty: time to replace a block from lower CPUtime = IC * (CPIexecution + Misses per instr * Miss penalty) * level, including time to replace in CPU Cycle time – access time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(bandwidth) 21 24 4 Reducing Miss Penalty Critical Word First and Early Restart • Multi-level Caches • Critical Word First: Request the missed • Critical Word First and Early Restart word first from memory • Priority to Read Misses over Writes • Early Restart: Fetch in normal order, but as •Merggging Write Buffers soon as the requested word arrives, send it to CPU • Victim Caches 25 28 Multi-Level Caches Giving Priority to Read Misses over Writes • Avg mem access time = Hit time(L1) + Miss Rate SW R3, 512(R0) (L1) X Miss Penalty(L1) • Miss Penalty (L1) = Hit Time (L2) + Miss Rate LW R1, 1024 (R0) (L2) X Miss Penalty (L2) LW R2, 512 (R0) • Avg mem access time = Hit Time (L1) + Miss Rate (()(L1) X (Hit Time ( ()L2) + Miss Rate ( ()L2) X • Direct-mapped, write-through cache Miss Penalty (L2) mapping 512 and 1024 to the same block • Local Miss Rate: number of misses in a cache divided by the total number of accesses to the and a four word write buffer cache • Will R2=R3? • Global Miss Rate: number of misses in a cache divided by the total number of memory accesses • Priority for Read Miss? generated by the cache 26 29 Performance of Multi-Level Caches Victim Caches 27 30 5 Reducing Miss Rates: 1. Larger Block Size Types of Cache Misses • Compulsory • Effects of larger block sizes – First reference or cold start misses – Reduction of compulsory misses • Capacity • Spatial locality – Working set is too big for the cache – Increase of miss penalty (transfer time) – Fully associative caches – Re duct ion o f num ber o f bloc ks • Conflict (collision) • Potential increase of conflict misses – Many blocks map to the same block frame (line) • Latency and bandwidth of lower-level memory – Affects – High latency and bandwidth => large block size • Set associative caches • Direct mapped caches • Small increase in miss penalty 31 34 Miss Rates: Absolute and Distribution Example 32 35 Reducing the Miss Rates 2. Larger Caches 1. Larger block size • More blocks 2. Larger Caches • Higher probability of getting the data 3. Higher associativity • Longer hit time and higher cost 4. Pseudo-associative caches • Primarily used in 2nd level caches 5. Compiler optimizations 33 36 6 3. Higher Associativity 5. Compiler Optimizations • Eight-way set associative is good enough • Avoid hardware changes • 2:1 Cache Rule: • Instructions – Miss Rate of direct mapped cache size N = – Profiling to look at conflicts between groups of Miss Rate 2-way cache size N/2 instructions • Higher Associativity can increase • Data – Clock cycle time – Loop Interchange: change nesting of loops to access data in order stored in memory – Hit time for 2-way vs. 1-way external cache +10%, – Blocking: Improve temporal locality by accessing internal + 2% “blocks” of data repeatedly vs. going down whole columns or rows 37 40 4. Pseudo-Associative Caches Loop Interchange /* Before */ • Fast hit time of direct mapped and lower conflict for (j = 0; j < 100; j = j+1) misses of 2-way set-associative cache? for (i = 0; i < 5000; i = i+1) • Divide cache: on a miss, check other half of cache x[i][j] = 2 * x[i][j]; to see if there, if so have a pseudo-hit (slow hit) /* After */ Hit time for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) Pseudo hit time Miss penalty x[i][j] = 2 * x[i][j]; • Drawback: •Sequential accesses instead of striding through memory every 100 words; improved spatial locality – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for caches not tied directly to processor (L2) •Same number of executed instructions – Used in MIPS R1000 L2 cache, similar in UltraSPARC 38 41 Pseudo Associative Cache Blocking (1/2) /* Before */ for (i = 0; i < N; i = i+1) CPU for (j = 0; j < N; j = j+1){ Address r = 0; Data Data for (k = 0; k < N; k = k+1) in out r = r + y[i][k]*z[k][j]; 1 1 Data x[i][j] = r; Tag }; =? •Two Inner Loops: –Read all NxN elements of z[] 3 2 2 –Read N elements of 1 row of y[] repeatedly –Write N elements of 1 row of x[] =? Write buffer •Capacity Misses a function of N & Cache Size: –3 NxNx4 => no capacity misses Lower level memory –Idea: compute on BxB submatrix that fits 39 42 7 Blocking (2/2) 2.