Memory Hierarchy Design

Chapter 2 CPU vs. Memory: Performance vs Latency Memory Hierarchy Design 4 Introduction Memory Hierarchy Design Considerations ● Goal: unlimited amount of memory with low latency ● Fast memory technology is more expensive per bit than ● Memory hierarchy design becomes more crucial with recent slower memory multi-core processors: – Use principle of locality (spatial and temporal) – Aggregate peak bandwidth grows with # cores: ● Solution: organize memory system into a hierarchy ● Intel Core i7 can generate two references per core per clock ● Four cores and 3.2 GHz clock – Entire addressable memory space available in largest, slowest memory – 25.6 billion 64-bit data references/second + – 12.8 billion 128-bit instruction references – Incrementally smaller and faster memories, each containing a – = 409.6 GB/s! subset of the memory below it, proceed in steps up toward the ● DRAM bandwidth is only 6% of this (25 GB/s) processor ● Requires: ● Temporal and spatial locality insures that nearly all – Multi-port, pipelined caches references can be found in smaller memories – Two levels of cache per core – Shared third-level cache on chip – Gives the illusion of a large, fast memory being presented to the processor 2 5 Memory Hierarchies Performance and Power for Caches CPU CPU Registers 1000 B 300 ps Registers 500 B 500 ps L1 L1 64 kB 1 ns 64 kB 2 ns On-die L2 256 kB 3-10 ns L2 256 kB 10-20 ns ● High-end microprocessors have >10 MB on-chip cache 2-4 MB 10-20 ns – Consumes large amount of area and power budget L3 In package – Both static (idle) and dynamic power is an issue ● Personal/mobile devices have Memory 4-16 GB 50-100 ns Memory 256-512 MB 50-100 ns – 20-50x smaller power budget – 25%-50% is consumed by memory Disk 4-16 TB 5-10 ms Flash 4-6 GB 25-50 µs Off-chip Storage Storage Server Personal Mobile 3 6 Handling of Cache Misses Calculating Miss Rate Misses Miss rate×Memory accesses Miss rate×Memory accesses = = Instruction Instruction count Instruction ● When a word is not found in the cache, a miss occurs: Average memory access time = Hit time+Miss rate×Miss penalty – Fetch word from lower level in hierarchy, requiring a higher latency reference ● Note that speculative and multithreaded processors may – Lower level may be another cache or the main memory execute other instructions during a miss – Also fetch the other words contained within the block – This reduces performance impact of misses ● Takes advantage of spatial locality – But complicates analysis and introduces runtime dependence – Place block into cache in any location within its set, determined by address ● block address MODULO number of sets 7 10 Cache Associativity and Writing Strategies Cache and TLB Mapping Illustrated tag set cache location block 64 7 6 … 1 0 ● n sets → n-way set associative – Direct-mapped cache → one block per set – Fully associative → one set sets tag ● Writing to cache: two strategies cache – Write-through Physical address space ● Immediately update lower levels of hierarchy – Write-back ● Only update lower levels of hierarchy when an updated block is TLB OS kernel replaced – Both strategies use write buffer to make writes asynchronous Hardware regs. heap stack 8 Virtual address space 11 Miss Rate and Types of Cache Misses Cache Design Considerations ● ● Miss rate is... ● Larger block size More levels of cache – Fraction of cache access that result in a miss – Reduced compulsory misses – Reduced miss penalty Access time = Hit time +Miss rate – Slightly reduced static power L1 L1 (smaller tag) ×(Hit timeL2+Miss rateL2×Miss penaltyL2) ● Reasons for cache misses and their names – Sometimes increased capacity – Compulsory and conflict misses ● Prioritize read misses over ● Cache block is referenced for the first time ● Bigger cache write misses – Solution: hardware and software prefetchers – Reduced miss rate – Reduced miss penalty – Capacity – Increased hit time – Need for write buffer and ● Cache block was discarded and later retrieved – Increased static & dynamic hazard resolution – Solution: build bigger cache or reorganize the code power ● – Conflict Avoid address translation ● Higher associativity during indexing of cache ● Program makes repeated references to multiple addresses from – Reduced conflict miss rate different blocks that map to the same location in the cache – Reduced hit time – Increased hit time – Solution: add padding or stride to code; change size of data – Limited size and structure – Coherency – Increased power of cache 9 12 Categories of Metrics for Cache Optimization Optimization 3: Pipelined Cache Access ● Reduce hit time ● Improves bandwidth – Small and simple first-level ● History caches ● Reduce miss rate – Pentium (1993) 1 cycle – “way-prediction” – Compiler optimization – Pentium Pro (1995) 2 cycles – Side effect: reduction in power consumption – Side effect: reduced power – Pentium III (1999) 2 cycles ● Reduce miss penalty and/or – ● Increase cache bandwidth Pentium 4 (2000) 4 cycles rate via parallelism – Intel Core i7 (2010) 4 cycles – Pipelined caches – Hardware prefetching ● – Multibanked caches Interaction with branch prediction – Software and compiler – Nonblocking caches prefetching – Increased penalty for mispredicted branches ● ● Reduce miss penalty – Side effect: increased Load instructions are longer power due to unused data – “Critical word first” – Waiting for cache pipeline to finish – Merging write buffers ● Pipeline cache cycles make high degrees of associativity easier to implement 13 16 Optimization 1: Small/Simple 1st-level Caches Optimization 4: Nonblocking Caches ● If one instruction stalls on a cache miss, should the following st ● 1 level cache should match the clock cycle of CPU instruction stall if its data is in cache? ● Cache addressing is a three step process – NO – Address tag memory with index portion of address ● But you have design a nonblocking cache (lockup-free cache) – Compare the found tag with address tag – Call it “hit under miss” – Choose cache set ● Why stop at two instructions? ● High associativity helps with... – Make it “hit under multiple miss” – Address aliasing ● What about two outstanding misses? – Dealing with TLB and multiprocessing conflicts – “miss under miss” ● Low associativity... – Next level cache has to be able to handle multiple misses – Is faster ● Rarely the case in contemporary CPUs ● ● Overlap tag check with data transmission How long before our caches become ● 10% or more for each doubling of set count – Out-of-order, superscalar, … – Consumes less power – Moving the CPU innovation into the memory hierarchy? 14 17 Optimization 2: “way prediction” Optimization 5: Multibanked Caches ● Idea: predict “the way” ● Main memory has been organized in banks for increased – which block within set will be accessed next bandwidth ● Index multiplexor (mux) starts working early ● Caches can do this too ● Implemented as extra bits kept in cache for each block ● Each cache block is evenly spread across banks ● Prediction accuracy (simulated) – Sequential interleaving – more effective of instruction caches ● Bank 0: blk[0] Bank 1: blk[1] Bank 2: blk[2] Bank 3: blk[3] – > 90% for two-way associative ● Bank 0: blk[4] Bank 1: blk[5] Bank 2: blk[6] Bank 3: blk[7] – > 80% for four-way associative ● Modern use ● On misprediction – ARM Cortex-A8 ● 1-4 banks in L2 cache – Try the other block – Intel Core i7 – Change the prediction bits ● 4 banks in L1 (2 memory accesses/cycle) – Incur penalty (commonly 1 cycle for slow CPUs) ● 8 banks in L2 st ● Examples: 1 use MIPS R10000 in 1990s, ARM Cortex-A8 ● Reduced power usage 15 18 Optimization 6: Critical Word 1st ,Early Restart Optimization 8: Compiler Optimizations ● Forget cache blocks (lines) deal with words – Start executing instruction when its word arrives, not the entire block where the word resides ● No hardware changes required ● Critical word first ● Two main techniques – What if the instruction needs the last word in the cache block? – ● Go ahead and request the last word from memory before Loop interchange requesting others ● Requires 2 or more loop nests – Out-of-order loading of cache block words ● The order of loops is changed to walk through memory in a ● As soon as the word arrives pass it on to the CPU more cache-friendly manner ● Continue fetching the remaining words – Loop blocking ● Early restart ● Additional loop nests are introduced to deal with small portion of an array (called a block but also a tile) – Don't change the order of words, but supply the missing word as soon as it arrives ● it won't help if the last word in block is needed ● Useful for large cache blocks, depends on data-stream 19 22 Optimization 7: Merging Write Buffer (Intro) Optimization 8: Loop Interchange /* Before: memory stride = 100 */ for (j = 0; j < 100; ++j) ● Write buffer basics for (i = 0; i < 5000; ++i) – Write buffer sits between cache and memory x[i][j] = 2 * x[i][j]; – Write buffer stores both: data and its address – Write buffer allows for the store instruction to finish immediately ● Unless the write buffer is full /* After: memory stride = 1 – Especially useful for write-through caches * Uses all words in a single cache block */ ● Write-back caches will benefit for when block is replaced for (i = 0; i < 5000; ++i) for (j = 0; j < 100; ++j) x[i][j] = 2 * x[i][j]; 20 23 Optimization 7: Merging Write Buffer Optimization 8: Blocking /* Before: memory stride: A → 1, B → N */ ● Merging write buffer: a buffer that merges write requests for (i = 0; i < N; ++i) ● When storing to a block that is already pending in the write for (j = 0; j

Memory Hierarchy Design

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support