CPU Clock Rate DRAM Access Latency Growing
Total Page:16
File Type:pdf, Size:1020Kb
CPU clock rate . Apple II – MOS 6502 (1975) 1~2MHz . Original IBM PC (1981) – Intel 8080 4.77MHz CS/COE1541: Introduction to . Intel 486 50MHz Computer Architecture . DEC Alpha Memory hierarchy • EV4 (1991) 100~200MHz • EV5 (1995) 266~500MHz • EV6 (1998) 450~600MHz Sangyeun Cho • EV67 (1999) 667~850MHz Computer Science Department University of Pittsburgh . Today’s processors – 2~5GHz CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 2 DRAM access latency Growing gap . How long would it take to fetch a memory block from main memory? • Time to determine that data is not on-chip (cache miss resolution time) • Time to talk to memory controller (possibly on a separate chip) • Possible queuing delay at memory controller • Time for opening a DRAM row • Time for selecting a portion in the selected row (CAS latency) • Time to transfer data from DRAM to CPU Possibly through a separate chip • Time to fill caches as needed . The total latency can be anywhere from 100~400 CPU cycles CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 3 4 Idea of memory hierarchy Memory hierarchy goals . To create an illusion of “fast and large” main memory Smaller CPU Faster Regs • As fast as a small SRAM-based memory More expensive per byte • As large (and cheap) as DRAM-based memory L1 cache . To provide CPU with necessary data and instructions as SRAM quickly as possible • Frequently used data must be placed near CPU L2 cache • “Cache hit” when CPU finds its data in cache SRAM • Cache hit rate = # of cache hits/# cache accesses • Avg. mem. access lat. (AMAL) = cache hit time + (1 – cache hit × Main memory Larger rate) miss penalty Slower DRAM • To decrease AMAL, decrease ____, increase ____, and reduce ____. Cheaper per byte . Cache helps reduce bandwidth pressure on memory bus Hard disk • Cache becomes a “filter” Magnetics CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 5 6 Cache Cache operations . (From Webster online): 1. a: a hiding place especially for concealing and preserving provisions or implements b: a secure place of storage … 3. a computer memory with No very short access time used for storage of frequently or &A[0] Main CPU Cache Get A[0] memory recently used instructions or data–called also cache memory Data Data Do I have A[0]? . Notions of cache memory were used in as early as 1967 in IBM mainframe machines for (i=0; i<N; i++) { C[i] = A[i] + B[i]; SUM = SUM + C[i]; } CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 7 8 Cache operations Cache organization . CPU-cache operations . Caches use “blocks” or “lines” (block > byte) as their granule of management • CPU: Read request . Memory > cache: we can only keep a subset of memory blocks Cache: hit, provide data . Cache is in essence a fixed-width hash table; the memory blocks kept Cache: miss, talk to main memory; hold CPU until data arrives; fill cache (if in a cache are thus associated with their addresses (or “tagged”) needed, kick out old data); provide data • CPU: Write request addr addr addr table data table Memory Cache: hit, update data N bits N bits Cache: miss, put data in a buffer (pretend data is written); let CPU proceed . Cache-main memory operations N • Cache: Read request 2 words supporting circuitry Memory: provide data • Cache: Write request Memory: update memory Hit/miss Specified data word Specified data word CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 9 10 L1 cache vs. L2 cache Key questions . Their basic parameters are similar . Where to place a block? • Associativity, block size, and cache size (the capacity of data array) . How to find a block? . Address used to index • L1: typically virtual address (to quickly index first) . Which block to replace for a new block? Using a virtual address causes some complexities • L2: typically physical address Physical address is available by then . How to handle a write? . System visibility • Writes make a cache design much more complex! • L1: not visible • L2: page coloring can affect hit rate . Hardware organization (esp. in multicores) • L1: private • L2: often shared among cores CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 11 12 Where to place a block? Direct-mapped cache . Block placement is a matter of mapping . 2B byte block M . If you have a simple rule to place data, you can find them later . 2 entries using the same rule . N-bit address (N-B-M) bits 2B bytes (N-B-M) bits M bits B bits Address Tag Data =? Hit/miss Data CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 13 14 2-way set-associative cache Fully associative cache . 2B byte block . 2B byte block . 2M entries . 2M entries . N-bit address . N-bit address 2M entries . CAM (Content addressable memory) N bits • Input: content (N-B) bits 2B bytes • Output: index Address DM Cache DM Cache • Used for the tag memory 2(M-1) entries 2(M-1) entries (N-B) bits B bits Tag M-bit index Data Address Hit/miss Data Hit/miss Data CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 15 16 Why caches work (or do not work) Which block to replace? . Principle of locality . Which block to replace, to make room for a new block on a miss? • Temporal locality . Goal: minimize the # of total misses If the location A is accessed now, it’ll be accessed again soon . Trivial in a direct-mapped cache • Spatial locality If the location A is accessed now, the location nearby (e.g., A+1) will be . N choices in N-way associative cache accessed soon . What is the optimal policy? . Can you explain how locality is manifested in your program (at the • MRU (most remotely used) is considered optimal source code level)? • This is an oracle scheme – we do not know the future • Data • Instructions . Replacement approaches • LRU (least recently used) – look at the past to predict the future . Can you write two programs doing the same thing, each having • FIFO (first in first out) – honor the new ones • A high degree of locality • Random – don’t remember anything • Very low locality • Cost-based – what is the cost (e.g., latency) of bringing this block again? CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 17 18 How to handle a write? Write-through vs. write-back . Design considerations . L1 cache: advantages of write-through + no-write-allocate • Performance • Simple control • No stalls for evicting dirty data on L1 miss with L2 hit • Design complexity • Avoids L1 cache pollution with results that are not read for a while . Allocation policy (on a miss) • Avoids problems with coherence (L2 is consistent with L1) • Write-allocate • Allows efficient transient error handling: parity protection in L1 and ECC in L2 • What about high traffic between L1 and L2, esp. in a multicore processor? • No-write-allocate • Write-validate . L2 cache: advantages of write-back + write-allocate . Update policy • Typically reduces overall bus traffic by filtering all L1 write-through traffic • Write-through • Better able to capture temporal locality of infrequently written memory locations • Provides a safety net for programs where write-allocate helps a lot • Write-back Garbage-collected heaps Write-followed-by-read situations Linking loaders (if unified cache, need not be flushed before execution) . Typical combinations • Write-back with write-allocate . Some ISA/caches support explicitly installing cache blocks with empty • Write-through with no-write-allocate contents or common values (e.g., zero) CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 19 20 Write buffer Write buffer . Waiting for a write to main memory is inefficient for a write- . A write buffer holds data while it is waiting to be written to through scheme (slow) memory; frees processor to continue executing • Every write to cache causes a write to main memory and the processor instructions stalls for that write • Processor thinks that write was done (quickly) . Example: base CPI = 1.2, 13% of all instructions are a store, . When write buffer is empty 10 cycles to write to memory • Send written data to buffer and eventually to main memory × • CPI = 1.2 + 0.13 10 = 2.5 . When write buffer is full, • More than doubled the CPI by waiting… • Wait for write buffer to become empty; then send data to write buffer CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 21 22 Write buffer Revisiting write-back cache . Is one write buffer enough? Tag array Data array . Probably not • Writes may occur faster than they can be handled by the memory system • Write tend to occur in bursts; memory may be fast but can’t handle several writes at once D V Address tag . Address tag is the “id” Typical buffer size is 4~8, possibly more of the associated data “Dirty” bit is set “Valid” bit is set when data is modified when data is brought in CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 23 24 Alpha 21264 example More examples 64B block .