Memory Hierarchy Design
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 2 CPU vs. Memory: Performance vs Latency Memory Hierarchy Design 4 Introduction Memory Hierarchy Design Considerations ● Goal: unlimited amount of memory with low latency ● Fast memory technology is more expensive per bit than ● Memory hierarchy design becomes more crucial with recent slower memory multi-core processors: – Use principle of locality (spatial and temporal) – Aggregate peak bandwidth grows with # cores: ● Solution: organize memory system into a hierarchy ● Intel Core i7 can generate two references per core per clock ● Four cores and 3.2 GHz clock – Entire addressable memory space available in largest, slowest memory – 25.6 billion 64-bit data references/second + – 12.8 billion 128-bit instruction references – Incrementally smaller and faster memories, each containing a – = 409.6 GB/s! subset of the memory below it, proceed in steps up toward the ● DRAM bandwidth is only 6% of this (25 GB/s) processor ● Requires: ● Temporal and spatial locality insures that nearly all – Multi-port, pipelined caches references can be found in smaller memories – Two levels of cache per core – Shared third-level cache on chip – Gives the illusion of a large, fast memory being presented to the processor 2 5 Memory Hierarchies Performance and Power for Caches CPU CPU Registers 1000 B 300 ps Registers 500 B 500 ps L1 L1 64 kB 1 ns 64 kB 2 ns On-die L2 256 kB 3-10 ns L2 256 kB 10-20 ns ● High-end microprocessors have >10 MB on-chip cache 2-4 MB 10-20 ns – Consumes large amount of area and power budget L3 In package – Both static (idle) and dynamic power is an issue ● Personal/mobile devices have Memory 4-16 GB 50-100 ns Memory 256-512 MB 50-100 ns – 20-50x smaller power budget – 25%-50% is consumed by memory Disk 4-16 TB 5-10 ms Flash 4-6 GB 25-50 µs Off-chip Storage Storage Server Personal Mobile 3 6 Handling of Cache Misses Calculating Miss Rate Misses Miss rate×Memory accesses Miss rate×Memory accesses = = Instruction Instruction count Instruction ● When a word is not found in the cache, a miss occurs: Average memory access time = Hit time+Miss rate×Miss penalty – Fetch word from lower level in hierarchy, requiring a higher latency reference ● Note that speculative and multithreaded processors may – Lower level may be another cache or the main memory execute other instructions during a miss – Also fetch the other words contained within the block – This reduces performance impact of misses ● Takes advantage of spatial locality – But complicates analysis and introduces runtime dependence – Place block into cache in any location within its set, determined by address ● block address MODULO number of sets 7 10 Cache Associativity and Writing Strategies Cache and TLB Mapping Illustrated tag set cache location block 64 7 6 … 1 0 ● n sets → n-way set associative – Direct-mapped cache → one block per set – Fully associative → one set sets tag ● Writing to cache: two strategies cache – Write-through Physical address space ● Immediately update lower levels of hierarchy – Write-back ● Only update lower levels of hierarchy when an updated block is TLB OS kernel replaced – Both strategies use write buffer to make writes asynchronous Hardware regs. heap stack 8 Virtual address space 11 Miss Rate and Types of Cache Misses Cache Design Considerations ● ● Miss rate is... ● Larger block size More levels of cache – Fraction of cache access that result in a miss – Reduced compulsory misses – Reduced miss penalty Access time = Hit time +Miss rate – Slightly reduced static power L1 L1 (smaller tag) ×(Hit timeL2+Miss rateL2×Miss penaltyL2) ● Reasons for cache misses and their names – Sometimes increased capacity – Compulsory and conflict misses ● Prioritize read misses over ● Cache block is referenced for the first time ● Bigger cache write misses – Solution: hardware and software prefetchers – Reduced miss rate – Reduced miss penalty – Capacity – Increased hit time – Need for write buffer and ● Cache block was discarded and later retrieved – Increased static & dynamic hazard resolution – Solution: build bigger cache or reorganize the code power ● – Conflict Avoid address translation ● Higher associativity during indexing of cache ● Program makes repeated references to multiple addresses from – Reduced conflict miss rate different blocks that map to the same location in the cache – Reduced hit time – Increased hit time – Solution: add padding or stride to code; change size of data – Limited size and structure – Coherency – Increased power of cache 9 12 Categories of Metrics for Cache Optimization Optimization 3: Pipelined Cache Access ● Reduce hit time ● Improves bandwidth – Small and simple first-level ● History caches ● Reduce miss rate – Pentium (1993) 1 cycle – “way-prediction” – Compiler optimization – Pentium Pro (1995) 2 cycles – Side effect: reduction in power consumption – Side effect: reduced power – Pentium III (1999) 2 cycles ● Reduce miss penalty and/or – ● Increase cache bandwidth Pentium 4 (2000) 4 cycles rate via parallelism – Intel Core i7 (2010) 4 cycles – Pipelined caches – Hardware prefetching ● – Multibanked caches Interaction with branch prediction – Software and compiler – Nonblocking caches prefetching – Increased penalty for mispredicted branches ● ● Reduce miss penalty – Side effect: increased Load instructions are longer power due to unused data – “Critical word first” – Waiting for cache pipeline to finish – Merging write buffers ● Pipeline cache cycles make high degrees of associativity easier to implement 13 16 Optimization 1: Small/Simple 1st-level Caches Optimization 4: Nonblocking Caches ● If one instruction stalls on a cache miss, should the following st ● 1 level cache should match the clock cycle of CPU instruction stall if its data is in cache? ● Cache addressing is a three step process – NO – Address tag memory with index portion of address ● But you have design a nonblocking cache (lockup-free cache) – Compare the found tag with address tag – Call it “hit under miss” – Choose cache set ● Why stop at two instructions? ● High associativity helps with... – Make it “hit under multiple miss” – Address aliasing ● What about two outstanding misses? – Dealing with TLB and multiprocessing conflicts – “miss under miss” ● Low associativity... – Next level cache has to be able to handle multiple misses – Is faster ● Rarely the case in contemporary CPUs ● ● Overlap tag check with data transmission How long before our caches become ● 10% or more for each doubling of set count – Out-of-order, superscalar, … – Consumes less power – Moving the CPU innovation into the memory hierarchy? 14 17 Optimization 2: “way prediction” Optimization 5: Multibanked Caches ● Idea: predict “the way” ● Main memory has been organized in banks for increased – which block within set will be accessed next bandwidth ● Index multiplexor (mux) starts working early ● Caches can do this too ● Implemented as extra bits kept in cache for each block ● Each cache block is evenly spread across banks ● Prediction accuracy (simulated) – Sequential interleaving – more effective of instruction caches ● Bank 0: blk[0] Bank 1: blk[1] Bank 2: blk[2] Bank 3: blk[3] – > 90% for two-way associative ● Bank 0: blk[4] Bank 1: blk[5] Bank 2: blk[6] Bank 3: blk[7] – > 80% for four-way associative ● Modern use ● On misprediction – ARM Cortex-A8 ● 1-4 banks in L2 cache – Try the other block – Intel Core i7 – Change the prediction bits ● 4 banks in L1 (2 memory accesses/cycle) – Incur penalty (commonly 1 cycle for slow CPUs) ● 8 banks in L2 st ● Examples: 1 use MIPS R10000 in 1990s, ARM Cortex-A8 ● Reduced power usage 15 18 Optimization 6: Critical Word 1st ,Early Restart Optimization 8: Compiler Optimizations ● Forget cache blocks (lines) deal with words – Start executing instruction when its word arrives, not the entire block where the word resides ● No hardware changes required ● Critical word first ● Two main techniques – What if the instruction needs the last word in the cache block? – ● Go ahead and request the last word from memory before Loop interchange requesting others ● Requires 2 or more loop nests – Out-of-order loading of cache block words ● The order of loops is changed to walk through memory in a ● As soon as the word arrives pass it on to the CPU more cache-friendly manner ● Continue fetching the remaining words – Loop blocking ● Early restart ● Additional loop nests are introduced to deal with small portion of an array (called a block but also a tile) – Don't change the order of words, but supply the missing word as soon as it arrives ● it won't help if the last word in block is needed ● Useful for large cache blocks, depends on data-stream 19 22 Optimization 7: Merging Write Buffer (Intro) Optimization 8: Loop Interchange /* Before: memory stride = 100 */ for (j = 0; j < 100; ++j) ● Write buffer basics for (i = 0; i < 5000; ++i) – Write buffer sits between cache and memory x[i][j] = 2 * x[i][j]; – Write buffer stores both: data and its address – Write buffer allows for the store instruction to finish immediately ● Unless the write buffer is full /* After: memory stride = 1 – Especially useful for write-through caches * Uses all words in a single cache block */ ● Write-back caches will benefit for when block is replaced for (i = 0; i < 5000; ++i) for (j = 0; j < 100; ++j) x[i][j] = 2 * x[i][j]; 20 23 Optimization 7: Merging Write Buffer Optimization 8: Blocking /* Before: memory stride: A → 1, B → N */ ● Merging write buffer: a buffer that merges write requests for (i = 0; i < N; ++i) ● When storing to a block that is already pending in the write for (j = 0; j