Average Memory Access Time: Reducing Misses

Review: Cache performance 332 Miss-oriented Approach to Memory Access: Advanced Computer Architecture ⎛ MemAccess ⎞ CPUtime = IC × ⎜ CPI + × MissRate × MissPenalt y ⎟ × CycleTime Chapter 2 ⎝ Execution Inst ⎠ ⎛ MemMisses ⎞ CPUtime = IC × ⎜ CPI + × MissPenalt y ⎟ × CycleTime Caches and Memory Systems ⎝ Execution Inst ⎠ CPIExecution includes ALU and Memory instructions January 2007 Separating out Memory component entirely Paul H J Kelly AMAT = Average Memory Access Time CPIALUOps does not include memory instructions ⎛ AluOps MemAccess ⎞ These lecture notes are partly based on the course text, Hennessy CPUtime = IC × ⎜ × CPI + × AMAT ⎟ × CycleTime and Patterson’s Computer Architecture, a quantitative approach (3rd ⎝ Inst AluOps Inst ⎠ and 4th eds), and on the lecture slides of David Patterson and John AMAT = HitTime + MissRate × MissPenalt y Kubiatowicz’s Berkeley course = ()HitTime Inst + MissRate Inst × MissPenalt y Inst + ()HitTime Data + MissRate Data × MissPenalt y Data Advanced Computer Architecture Chapter 2.1 Advanced Computer Architecture Chapter 2.2 Average memory access time: Reducing Misses Classifying Misses: 3 Cs AMAT = HitTime + MissRate × MissPenalt y Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. There are three ways to improve cache (Misses in even an Infinite Cache) Capacity—If the cache cannot contain all the blocks needed during performance: execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) 1. Reduce the miss rate, Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will 2. Reduce the miss penalty, or occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. 3. Reduce the time to hit in the cache. (Misses in N-way Associative, Size X Cache) More recent, 4th “C”: Coherence - Misses caused by cache coherence. Advanced Computer Architecture Chapter 2.3 Advanced Computer Architecture Chapter 2.4 Page 1 3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule (of thumb!) miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 0.14 0.14 1-way 1-way 0.12 Conflict 0.12 Conflict 2-way 2-way 0.1 0.1 4-way 4-way 0.08 0.08 8-way 8-way 0.06 0.06 Capacity Capacity 0.04 0.04 0.02 0.02 0 0 1 2 4 8 1 2 4 8 16 32 64 16 32 64 128 128 Compulsory vanishingly Compulsory Compulsory small Cache Size (KB) Cache Size (KB) Advanced Computer Architecture Chapter 2.5 Advanced Computer Architecture Chapter 2.6 3Cs Relative Miss Rate How We Can Reduce Misses? 100% 1-way 3 Cs: Compulsory, Capacity, Conflict 80% Conflict 2-way In all cases, assume total cache size not changed: 4-way 60% 8-way What happens if: 1) Change Block Size: 40% Which of 3Cs is obviously affected? Capacity 20% 2) Change Associativity: Which of 3Cs is obviously affected? 0% 1 2 4 8 3) Change Compiler: 16 32 64 Which of 3Cs is obviously affected? 128 Compulsory Cache Size (KB) Flaws: for fixed block size Good: insight => invention Advanced Computer Architecture Chapter 2.7 Advanced Computer Architecture Chapter 2.8 Page 2 1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 2:1 Cache Rule of thumb: 25% The Miss Rate of a direct-mapped cache of size N Is the same as the Miss Rate of a 2-way set-associative 20% 1K cache size of size N/2 4K on average, over a large suite of benchmarks 15% Miss 16K Beware: Execution time is the only thing that Rate 10% 64K matters in the end Will Clock Cycle time increase? 5% 256K Hill [1988] suggested hit time for 2-way vs. 1-way 0% external cache +10%, internal + 2% 16 32 64 128 256 Block Size (bytes) Advanced Computer Architecture Chapter 2.9 Advanced Computer Architecture Chapter 2.10 Example: Avg. Memory Access Time vs. Miss Rate 3. Reducing Misses via a “Victim Cache” Example: assume clock cycle time (CCT) = 1.10 for 2- way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct How to combine fast hit time of mapped direct mapped TAGS DATA Cache Size Associativity yet still avoid conflict misses? (KB) 1-way 2-way 4-way 8-way Add buffer to place data 1 2.33 2.15 2.07 2.01 discarded from cache 2 1.98 1.86 1.76 1.68 Jouppi [1990]: 4-entry victim 4 1.72 1.67 1.61 1.53 cache removed 20% to 95% of 81.461.48 1.47 1.43 conflicts for a 4 KB direct Tag and Comparator One Cache line of Data 16 1.29 1.32 1.32 1.32 mapped data cache Tag and Comparator One Cache line of Data 32 1.20 1.24 1.25 1.27 Used in AMD Opteron, Alpha, HP Tag and Comparator One Cache line of Data 64 1.14 1.20 1.21 1.23 machines 128 1.10 1.17 1.18 1.20 Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy (Red means A.M.A.T. not improved by more associativity) Advanced Computer Architecture Chapter 2.11 Advanced Computer Architecture Chapter 2.12 Page 3 4. Reducing Misses via 5. Reducing Misses by Hardware “Pseudo-Associativity” Prefetching of Instructions & Data How to combine fast hit time of Direct Mapped and have E.g., Instruction Prefetching the lower conflict misses of 2-way SA cache? Alpha 21064 fetches 2 blocks on a miss Divide cache: on a miss, check other half of cache to see if Extra block placed in “stream buffer” there, if so have a pseudo-hit (slow hit) On miss check stream buffer Hit Time Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB Pseudo Hit Time Miss Penalty cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from Time 2 64KB, 4-way set associative caches Prefetching relies on having extra memory Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles bandwidth that can be used without penalty Better for caches not tied directly to processor (L2) Used in MIPS R10000 L2 cache, similar in UltraSPARC Advanced Computer Architecture Chapter 2.13 Advanced Computer Architecture Chapter 2.14 6. Reducing Misses by 7. Reducing Misses by Compiler Software Prefetching Data Optimizations Data Prefetch McFarling [1989]* reduced instruction cache misses by 75% Load data into register (HP PA-RISC loads) on 8KB direct mapped cache, 4 byte blocks in software Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Instructions Special prefetching instructions cannot cause faults; a form of speculative By choosing instruction memory layout based on callgraph, branch structure execution and profile data Prefetching comes in two flavors: Reorder procedures in memory so as to reduce conflict misses Binding prefetch: Requests load directly into register. Similar (but different) ideas work for data Must be correct address and register! Non-Binding prefetch: Load into cache. Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Can be incorrect. Frees HW/SW to guess! Loop Interchange: change nesting of loops to access data in order stored in Issuing Prefetch Instructions takes time memory Is cost of prefetch issues < savings in reduced misses? Loop Fusion: Combine 2 independent loops that have same looping and some Higher superscalar reduces difficulty of issue bandwidth variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows * “Program optimization for instruction caches”, ASPLOS89, http://doi.acm.org/10.1145/70082.68200 Advanced Computer Architecture Chapter 2.15 Advanced Computer Architecture Chapter 2.16 Page 4 Merging Arrays Example Loop Interchange Example /* Before: 2 sequential arrays */ /* Before */ int val[SIZE]; for (k = 0; k < 100; k = k+1) int key[SIZE]; for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) /* After: 1 array of stuctures */ x[i][j] = 2 * x[i][j]; struct merge { /* After */ int val; for (k = 0; k < 100; k = k+1) int key; for (i = 0; i < 5000; i = i+1) }; for (j = 0; j < 100; j = j+1) struct merge merged_array[SIZE]; x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through Reducing conflicts between val & key; memory every 100 words; improved spatial improve spatial locality locality Advanced Computer Architecture Chapter 2.17 Advanced Computer Architecture Chapter 2.18 Loop Fusion Example /* Before */ Blocking Example for (i = 0; i < N; i = i+1) /* Before */ for (j = 0; j < N; j = j+1) for (i = 0; i < N; i = i+1) a[i][j] = 1/b[i][j] * c[i][j]; for (j = 0; j < N; j = j+1) for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; d[i][j] = a[i][j] + c[i][j]; for (k = 0; k < N; k = k+1){ /* After fusion */ /* After array contraction */ r = r + y[i][k]*z[k][j];}; for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1) x[i][j] = r; for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1) }; { a[i][j] = 1/b[i][j] * c[i][j]; { c = c[i][j]; Two Inner Loops: d[i][j] = a[i][j] + c[i][j];} a = 1/b[i][j] * c; Read all NxN elements of z[] d[i][j] = a + c;} Read N elements of 1 row of y[] repeatedly 2 misses per access to a & c vs.

Average Memory Access Time: Reducing Misses

Mips 16 Bit Instruction Set

A Programming Model and Processor Architecture for Heterogeneous Multicore Computers

Production Cluster Visualization: Experiences and Challenges

Introduction to Playstation®2 Architecture

Sony's Emotionally Charged Chip

Ilore: Discovering a Lineage of Microprocessors

Grids to Petaflops

SCPH-39000 Series Service Manual

Evaluation of the Playstation 2 As a Cluster Computing Node

Playstation 2

Appendix M Historical Perspectives and References

SYSC5603 Paper Analysis: Multiprocessors for DSP