Memory Hierarchy Design Overview Locality of Reference Cache

Memory Hierarchy Design Overview Locality of Reference Cache

Cache Systems CPU Main CPU Main Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy Design Bus 66MHz Bus 66MHz Chapter 5 and Appendix C Data object Block transfer transfer Main CPU Cache Memory 1 4 Overview Example: Two-level Hierarchy Access Time • Problem T1+T2 –CPU vs Memory performance imbalance • Solution – Driven by temporal and spatial locality – Memory hierarchies • Fast L1, L2, L3 caches • Larger but slower memories • Even larger but even T slower secondary storage 1 • Keep most of the action in the higher levels 0 Hit ratio 1 2 5 Locality of Reference Basic Cache Read Operation • Temporal and Spatial • Sequential access to memory • CPU requests contents of memory location • Unit-stride loop (cache lines = 256 bits) • Check cache for this data for (i = 1; i < 100000; i++) • If present, get from cache (fast) sum = sum + a[i]; • If not present, rea d requ ire d bloc k from main memory to cache • Non-unit stride loop (cache lines = 256 bits) • Then deliver from cache to CPU for (i = 0; i <= 100000; i = i+8) • Cache includes tags to identify which block sum = sum + a[i]; of main memory is in each cache slot 3 6 1 Elements of Cache Design Number of Caches • Cache size • Increased logic density => on-chip cache • Line (block) size – Internal cache: level 1 (L1) • Number of caches – External cache: level 2 (L2) • Unified cache • Mapping function – Balances the load between instruction and data fetches – Block placement – Only one cache needs to be designed / implemented – Block identification • Split caches (data and instruction) • Replacement Algorithm – Pipelined, parallel architectures • Write Policy 7 10 Cache Size Mapping Function • Cache size << main memory size • Cache lines << main memory blocks • Small enough • Direct mapping – Minimize cost – Maps each block into only one possible line – Speed up access (less gates to address the cache) – (block address) MOD (number of lines) – Keep cache on chip • Fully associative • Large enough – Block can be placed anywhere in the cache – Minimize average access time • Set associative • Optimum size depends on the workload – Block can be placed in a restricted set of lines • Practical size? – (block address) MOD (number of sets in cache) 8 11 Line Size Cache Addressing • Optimum size depends on workload • Small blocks do not use locality of reference Block address Block offset principle Tag Index • Larger blocks reduce the number of blocks – Replacement overhead Block offset – selects data object from the block Tag Cache Main Memory • Practical sizes? Index – selects the block set Tag – used to detect a hit 9 12 2 Direct Mapping Replacement Algorithm • Simple for direct-mapped: no choice • Random – Simple to build in hardware •LRU Associativity Two-way Four-way Eight-way Size LRU Random LRU Random LRU Random 16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 13 16 Associative Mapping Write Policy • Write is more complex than read – Write and tag comparison can not proceed simultaneously – Only a portion of the line has to be updated • Write policies – Write through – write to the cache and memory – Write back – write only to the cache (dirty bit) • Write miss: – Write allocate – load block on a write miss – No-write allocate – update directly in memory 14 17 K-Way Set Associative Mapping Alpha AXP 21064 Cache CPU 21 8 5 Address Tag Index offset Data data In out Valid Tag Data (256) Write buffer =? Lower level memory 15 18 3 Write Merging Cache Performance Improvements Write address V V V V • Average memory-access time 100 1 000 = Hit time + Miss rate x Miss penalty 104 1 0 00 • Cache optimizations 108 1 0 0 0 – Reducing the miss rate 112 1 0 0 0 – Reducing the miss penalty Write address V V V V – Reducing the hit time 100 11 11 00 00 00 00 00 00 19 22 DECstation 5000 Miss Rates Example 30 Which has the lower average memory access time: 25 A 16-KB instruction cache with a 16-KB data cache or 20 A 32-KB unified cache Instr. Cache % 15 Data Cache Hit time = 1 cycle Unified 10 Miss ppyenalty = 50 cy cles Load/store hit = 2 cycles on a unified cache 5 Given: 75% of memory accesses are instruction references. 0 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Overall miss rate for split caches = 0.75*0.64% + 0.25*6.47% = 2.10% Cache size Miss rate for unified cache = 1.99% Average memory access times: Direct-mapped cache with 32-byte blocks Split = 0.75 * (1 + 0.0064 * 50) + 0.25 * (1 + 0.0647 * 50) = 2.05 Percentage of instruction references is 75% 20 Unified = 0.75 * (1 + 0.0199 * 50) + 0.25 * (2 + 0.0199 * 50) = 2.24 23 Cache Performance Measures Cache Performance Equations • Hit rate: fraction found in that level CPUtime = (CPU execution cycles + Mem stall cycles) * Cycle time – So high that usually talk about Miss rate Mem stall cycles = Mem accesses * Miss rate * Miss penalty – Miss rate fallacy: as MIPS to CPU performance, CPUtime = IC * (CPIexecution + Mem accesses per instr * Miss rate * • Average memory-access time Miss penalty) * Cycle time = Hit time + Miss rate x Miss penalty (ns) Misses per instr = Mem accesses per instr * Miss rate • Miss penalty: time to replace a block from lower CPUtime = IC * (CPIexecution + Misses per instr * Miss penalty) * level, including time to replace in CPU Cycle time – access time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(bandwidth) 21 24 4 Reducing Miss Penalty Critical Word First and Early Restart • Multi-level Caches • Critical Word First: Request the missed • Critical Word First and Early Restart word first from memory • Priority to Read Misses over Writes • Early Restart: Fetch in normal order, but as •Merggging Write Buffers soon as the requested word arrives, send it to CPU • Victim Caches 25 28 Multi-Level Caches Giving Priority to Read Misses over Writes • Avg mem access time = Hit time(L1) + Miss Rate SW R3, 512(R0) (L1) X Miss Penalty(L1) • Miss Penalty (L1) = Hit Time (L2) + Miss Rate LW R1, 1024 (R0) (L2) X Miss Penalty (L2) LW R2, 512 (R0) • Avg mem access time = Hit Time (L1) + Miss Rate (()(L1) X (Hit Time ( ()L2) + Miss Rate ( ()L2) X • Direct-mapped, write-through cache Miss Penalty (L2) mapping 512 and 1024 to the same block • Local Miss Rate: number of misses in a cache divided by the total number of accesses to the and a four word write buffer cache • Will R2=R3? • Global Miss Rate: number of misses in a cache divided by the total number of memory accesses • Priority for Read Miss? generated by the cache 26 29 Performance of Multi-Level Caches Victim Caches 27 30 5 Reducing Miss Rates: 1. Larger Block Size Types of Cache Misses • Compulsory • Effects of larger block sizes – First reference or cold start misses – Reduction of compulsory misses • Capacity • Spatial locality – Working set is too big for the cache – Increase of miss penalty (transfer time) – Fully associative caches – Re duct ion o f num ber o f bloc ks • Conflict (collision) • Potential increase of conflict misses – Many blocks map to the same block frame (line) • Latency and bandwidth of lower-level memory – Affects – High latency and bandwidth => large block size • Set associative caches • Direct mapped caches • Small increase in miss penalty 31 34 Miss Rates: Absolute and Distribution Example 32 35 Reducing the Miss Rates 2. Larger Caches 1. Larger block size • More blocks 2. Larger Caches • Higher probability of getting the data 3. Higher associativity • Longer hit time and higher cost 4. Pseudo-associative caches • Primarily used in 2nd level caches 5. Compiler optimizations 33 36 6 3. Higher Associativity 5. Compiler Optimizations • Eight-way set associative is good enough • Avoid hardware changes • 2:1 Cache Rule: • Instructions – Miss Rate of direct mapped cache size N = – Profiling to look at conflicts between groups of Miss Rate 2-way cache size N/2 instructions • Higher Associativity can increase • Data – Clock cycle time – Loop Interchange: change nesting of loops to access data in order stored in memory – Hit time for 2-way vs. 1-way external cache +10%, – Blocking: Improve temporal locality by accessing internal + 2% “blocks” of data repeatedly vs. going down whole columns or rows 37 40 4. Pseudo-Associative Caches Loop Interchange /* Before */ • Fast hit time of direct mapped and lower conflict for (j = 0; j < 100; j = j+1) misses of 2-way set-associative cache? for (i = 0; i < 5000; i = i+1) • Divide cache: on a miss, check other half of cache x[i][j] = 2 * x[i][j]; to see if there, if so have a pseudo-hit (slow hit) /* After */ Hit time for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) Pseudo hit time Miss penalty x[i][j] = 2 * x[i][j]; • Drawback: •Sequential accesses instead of striding through memory every 100 words; improved spatial locality – CPU pipeline design is hard if hit takes 1 or 2 cycles – Better for caches not tied directly to processor (L2) •Same number of executed instructions – Used in MIPS R1000 L2 cache, similar in UltraSPARC 38 41 Pseudo Associative Cache Blocking (1/2) /* Before */ for (i = 0; i < N; i = i+1) CPU for (j = 0; j < N; j = j+1){ Address r = 0; Data Data for (k = 0; k < N; k = k+1) in out r = r + y[i][k]*z[k][j]; 1 1 Data x[i][j] = r; Tag }; =? •Two Inner Loops: –Read all NxN elements of z[] 3 2 2 –Read N elements of 1 row of y[] repeatedly –Write N elements of 1 row of x[] =? Write buffer •Capacity Misses a function of N & Cache Size: –3 NxNx4 => no capacity misses Lower level memory –Idea: compute on BxB submatrix that fits 39 42 7 Blocking (2/2) 2.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us