CPU Clock Rate DRAM Access Latency Growing

CPU clock rate . Apple II – MOS 6502 (1975) 1~2MHz . Original IBM PC (1981) – Intel 8080 4.77MHz CS/COE1541: Introduction to . Intel 486 50MHz Computer Architecture . DEC Alpha Memory hierarchy • EV4 (1991) 100~200MHz • EV5 (1995) 266~500MHz • EV6 (1998) 450~600MHz Sangyeun Cho • EV67 (1999) 667~850MHz Computer Science Department University of Pittsburgh . Today’s processors – 2~5GHz CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 2 DRAM access latency Growing gap . How long would it take to fetch a memory block from main memory? • Time to determine that data is not on-chip (cache miss resolution time) • Time to talk to memory controller (possibly on a separate chip) • Possible queuing delay at memory controller • Time for opening a DRAM row • Time for selecting a portion in the selected row (CAS latency) • Time to transfer data from DRAM to CPU Possibly through a separate chip • Time to fill caches as needed . The total latency can be anywhere from 100~400 CPU cycles CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 3 4 Idea of memory hierarchy Memory hierarchy goals . To create an illusion of “fast and large” main memory Smaller CPU Faster Regs • As fast as a small SRAM-based memory More expensive per byte • As large (and cheap) as DRAM-based memory L1 cache . To provide CPU with necessary data and instructions as SRAM quickly as possible • Frequently used data must be placed near CPU L2 cache • “Cache hit” when CPU finds its data in cache SRAM • Cache hit rate = # of cache hits/# cache accesses • Avg. mem. access lat. (AMAL) = cache hit time + (1 – cache hit × Main memory Larger rate) miss penalty Slower DRAM • To decrease AMAL, decrease ____, increase ____, and reduce ____. Cheaper per byte . Cache helps reduce bandwidth pressure on memory bus Hard disk • Cache becomes a “filter” Magnetics CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 5 6 Cache Cache operations . (From Webster online): 1. a: a hiding place especially for concealing and preserving provisions or implements b: a secure place of storage … 3. a computer memory with No very short access time used for storage of frequently or &A[0] Main CPU Cache Get A[0] memory recently used instructions or data–called also cache memory Data Data Do I have A[0]? . Notions of cache memory were used in as early as 1967 in IBM mainframe machines for (i=0; i<N; i++) { C[i] = A[i] + B[i]; SUM = SUM + C[i]; } CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 7 8 Cache operations Cache organization . CPU-cache operations . Caches use “blocks” or “lines” (block > byte) as their granule of management • CPU: Read request . Memory > cache: we can only keep a subset of memory blocks Cache: hit, provide data . Cache is in essence a fixed-width hash table; the memory blocks kept Cache: miss, talk to main memory; hold CPU until data arrives; fill cache (if in a cache are thus associated with their addresses (or “tagged”) needed, kick out old data); provide data • CPU: Write request addr addr addr table data table Memory Cache: hit, update data N bits N bits Cache: miss, put data in a buffer (pretend data is written); let CPU proceed . Cache-main memory operations N • Cache: Read request 2 words supporting circuitry Memory: provide data • Cache: Write request Memory: update memory Hit/miss Specified data word Specified data word CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 9 10 L1 cache vs. L2 cache Key questions . Their basic parameters are similar . Where to place a block? • Associativity, block size, and cache size (the capacity of data array) . How to find a block? . Address used to index • L1: typically virtual address (to quickly index first) . Which block to replace for a new block? Using a virtual address causes some complexities • L2: typically physical address Physical address is available by then . How to handle a write? . System visibility • Writes make a cache design much more complex! • L1: not visible • L2: page coloring can affect hit rate . Hardware organization (esp. in multicores) • L1: private • L2: often shared among cores CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 11 12 Where to place a block? Direct-mapped cache . Block placement is a matter of mapping . 2B byte block M . If you have a simple rule to place data, you can find them later . 2 entries using the same rule . N-bit address (N-B-M) bits 2B bytes (N-B-M) bits M bits B bits Address Tag Data =? Hit/miss Data CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 13 14 2-way set-associative cache Fully associative cache . 2B byte block . 2B byte block . 2M entries . 2M entries . N-bit address . N-bit address 2M entries . CAM (Content addressable memory) N bits • Input: content (N-B) bits 2B bytes • Output: index Address DM Cache DM Cache • Used for the tag memory 2(M-1) entries 2(M-1) entries (N-B) bits B bits Tag M-bit index Data Address Hit/miss Data Hit/miss Data CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 15 16 Why caches work (or do not work) Which block to replace? . Principle of locality . Which block to replace, to make room for a new block on a miss? • Temporal locality . Goal: minimize the # of total misses If the location A is accessed now, it’ll be accessed again soon . Trivial in a direct-mapped cache • Spatial locality If the location A is accessed now, the location nearby (e.g., A+1) will be . N choices in N-way associative cache accessed soon . What is the optimal policy? . Can you explain how locality is manifested in your program (at the • MRU (most remotely used) is considered optimal source code level)? • This is an oracle scheme – we do not know the future • Data • Instructions . Replacement approaches • LRU (least recently used) – look at the past to predict the future . Can you write two programs doing the same thing, each having • FIFO (first in first out) – honor the new ones • A high degree of locality • Random – don’t remember anything • Very low locality • Cost-based – what is the cost (e.g., latency) of bringing this block again? CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 17 18 How to handle a write? Write-through vs. write-back . Design considerations . L1 cache: advantages of write-through + no-write-allocate • Performance • Simple control • No stalls for evicting dirty data on L1 miss with L2 hit • Design complexity • Avoids L1 cache pollution with results that are not read for a while . Allocation policy (on a miss) • Avoids problems with coherence (L2 is consistent with L1) • Write-allocate • Allows efficient transient error handling: parity protection in L1 and ECC in L2 • What about high traffic between L1 and L2, esp. in a multicore processor? • No-write-allocate • Write-validate . L2 cache: advantages of write-back + write-allocate . Update policy • Typically reduces overall bus traffic by filtering all L1 write-through traffic • Write-through • Better able to capture temporal locality of infrequently written memory locations • Provides a safety net for programs where write-allocate helps a lot • Write-back Garbage-collected heaps Write-followed-by-read situations Linking loaders (if unified cache, need not be flushed before execution) . Typical combinations • Write-back with write-allocate . Some ISA/caches support explicitly installing cache blocks with empty • Write-through with no-write-allocate contents or common values (e.g., zero) CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 19 20 Write buffer Write buffer . Waiting for a write to main memory is inefficient for a write- . A write buffer holds data while it is waiting to be written to through scheme (slow) memory; frees processor to continue executing • Every write to cache causes a write to main memory and the processor instructions stalls for that write • Processor thinks that write was done (quickly) . Example: base CPI = 1.2, 13% of all instructions are a store, . When write buffer is empty 10 cycles to write to memory • Send written data to buffer and eventually to main memory × • CPI = 1.2 + 0.13 10 = 2.5 . When write buffer is full, • More than doubled the CPI by waiting… • Wait for write buffer to become empty; then send data to write buffer CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 21 22 Write buffer Revisiting write-back cache . Is one write buffer enough? Tag array Data array . Probably not • Writes may occur faster than they can be handled by the memory system • Write tend to occur in bursts; memory may be fast but can’t handle several writes at once D V Address tag . Address tag is the “id” Typical buffer size is 4~8, possibly more of the associated data “Dirty” bit is set “Valid” bit is set when data is modified when data is brought in CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 23 24 Alpha 21264 example More examples 64B block .

CPU Clock Rate DRAM Access Latency Growing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support