CPU clock rate

. Apple II – MOS 6502 (1975) 1~2MHz . Original IBM PC (1981) – Intel 8080 4.77MHz CS/COE1541: Introduction to . Intel 486 50MHz Computer Architecture . DEC Alpha Memory hierarchy • EV4 (1991) 100~200MHz • EV5 (1995) 266~500MHz • EV6 (1998) 450~600MHz Sangyeun Cho • EV67 (1999) 667~850MHz

Computer Science Department University of Pittsburgh . Today’s processors – 2~5GHz

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 2

DRAM access latency Growing gap

. How long would it take to fetch a memory block from main memory? • Time to determine that data is not on-chip (cache miss resolution time) • Time to talk to memory controller (possibly on a separate chip) • Possible queuing delay at memory controller • Time for opening a DRAM row • Time for selecting a portion in the selected row (CAS latency) • Time to transfer data from DRAM to CPU Possibly through a separate chip • Time to fill caches as needed

. The total latency can be anywhere from 100~400 CPU cycles

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 3 4 Idea of memory hierarchy Memory hierarchy goals

. To create an illusion of “fast and large” main memory Smaller CPU Faster Regs • As fast as a small SRAM-based memory More expensive per byte • As large (and cheap) as DRAM-based memory

L1 cache . To provide CPU with necessary data and instructions as SRAM quickly as possible • Frequently used data must be placed near CPU L2 cache • “Cache hit” when CPU finds its data in cache SRAM • Cache hit rate = # of cache hits/# cache accesses • Avg. mem. access lat. (AMAL) = cache hit time + (1 – cache hit × Main memory Larger rate) miss penalty Slower DRAM • To decrease AMAL, decrease ____, increase ____, and reduce ____. Cheaper per byte . Cache helps reduce bandwidth pressure on memory bus Hard disk • Cache becomes a “filter” Magnetics

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 5 6

Cache Cache operations

. (From Webster online): 1. a: a hiding place especially for concealing and preserving provisions or implements b: a secure place of storage … 3. a computer memory with No very short access time used for storage of frequently or &A[0] Main CPU Cache Get A[0] memory recently used instructions or data–called also cache memory Data Data

Do I have A[0]? . Notions of cache memory were used in as early as 1967 in IBM mainframe machines for (i=0; i

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 7 8 Cache operations Cache organization

. CPU-cache operations . Caches use “blocks” or “lines” (block > byte) as their granule of management • CPU: Read request . Memory > cache: we can only keep a subset of memory blocks Cache: hit, provide data . Cache is in essence a fixed-width hash table; the memory blocks kept Cache: miss, talk to main memory; hold CPU until data arrives; fill cache (if in a cache are thus associated with their addresses (or “tagged”) needed, kick out old data); provide data

• CPU: Write request addr addr addr table data table Memory Cache: hit, update data N bits N bits Cache: miss, put data in a buffer (pretend data is written); let CPU proceed . Cache-main memory operations N • Cache: Read request 2 words supporting circuitry Memory: provide data • Cache: Write request Memory: update memory Hit/miss

Specified data word Specified data word

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 9 10

L1 cache vs. L2 cache Key questions

. Their basic parameters are similar . Where to place a block? • Associativity, block size, and cache size (the capacity of data array) . How to find a block? . Address used to index • L1: typically virtual address (to quickly index first) . Which block to replace for a new block? Using a virtual address causes some complexities • L2: typically physical address Physical address is available by then . How to handle a write? . System visibility • Writes make a cache design much more complex! • L1: not visible • L2: page coloring can affect hit rate . Hardware organization (esp. in multicores) • L1: private • L2: often shared among cores

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 11 12 Where to place a block? Direct-mapped cache

. Block placement is a matter of mapping . 2B byte block M . If you have a simple rule to place data, you can find them later . 2 entries using the same rule . N-bit address (N-B-M) bits 2B bytes

(N-B-M) bits M bits B bits

Address Tag Data

=?

Hit/miss Data

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 13 14

2-way set-associative cache Fully associative cache

. 2B byte block . 2B byte block . 2M entries . 2M entries . N-bit address . N-bit address 2M entries . CAM (Content addressable memory) N bits • Input: content (N-B) bits 2B bytes • Output: index Address DM Cache DM Cache • Used for the tag memory 2(M-1) entries 2(M-1) entries

(N-B) bits B bits Tag M-bit index Data

Address

Hit/miss Data

Hit/miss Data

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 15 16 Why caches work (or do not work) Which block to replace?

. Principle of locality . Which block to replace, to make room for a new block on a miss? • Temporal locality . Goal: minimize the # of total misses If the location A is accessed now, it’ll be accessed again soon . Trivial in a direct-mapped cache • Spatial locality If the location A is accessed now, the location nearby (e.g., A+1) will be . N choices in N-way associative cache accessed soon . What is the optimal policy? . Can you explain how locality is manifested in your program (at the • MRU (most remotely used) is considered optimal source code level)? • This is an oracle scheme – we do not know the future • Data • Instructions . Replacement approaches • LRU (least recently used) – look at the past to predict the future . Can you write two programs doing the same thing, each having • FIFO (first in first out) – honor the new ones • A high degree of locality • Random – don’t remember anything • Very low locality • Cost-based – what is the cost (e.g., latency) of bringing this block again?

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 17 18

How to handle a write? Write-through vs. write-back

. Design considerations . L1 cache: advantages of write-through + no-write-allocate • Performance • Simple control • No stalls for evicting dirty data on L1 miss with L2 hit • Design complexity • Avoids L1 cache pollution with results that are not read for a while . Allocation policy (on a miss) • Avoids problems with coherence (L2 is consistent with L1) • Write-allocate • Allows efficient transient error handling: parity protection in L1 and ECC in L2 • What about high traffic between L1 and L2, esp. in a multicore processor? • No-write-allocate • Write-validate . L2 cache: advantages of write-back + write-allocate . Update policy • Typically reduces overall bus traffic by filtering all L1 write-through traffic • Write-through • Better able to capture temporal locality of infrequently written memory locations • Provides a safety net for programs where write-allocate helps a lot • Write-back Garbage-collected heaps Write-followed-by-read situations Linking loaders (if unified cache, need not be flushed before execution) . Typical combinations • Write-back with write-allocate . Some ISA/caches support explicitly installing cache blocks with empty • Write-through with no-write-allocate contents or common values (e.g., zero)

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 19 20 Write buffer Write buffer

. Waiting for a write to main memory is inefficient for a write- . A write buffer holds data while it is waiting to be written to through scheme (slow) memory; frees processor to continue executing • Every write to cache causes a write to main memory and the processor instructions stalls for that write • Processor thinks that write was done (quickly)

. Example: base CPI = 1.2, 13% of all instructions are a store, . When write buffer is empty 10 cycles to write to memory • Send written data to buffer and eventually to main memory × • CPI = 1.2 + 0.13 10 = 2.5 . When write buffer is full, • More than doubled the CPI by waiting… • Wait for write buffer to become empty; then send data to write buffer

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 21 22

Write buffer Revisiting write-back cache

. Is one write buffer enough?

Tag array Data array . Probably not • Writes may occur faster than they can be handled by the memory system • Write tend to occur in bursts; memory may be fast but can’t handle several writes at once D V Address tag

. Address tag is the “id” Typical buffer size is 4~8, possibly more of the associated data “Dirty” bit is set “Valid” bit is set when data is modified when data is brought in

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 23 24 Alpha 21264 example More examples

64B block . IBM Power5 • L1I: 64kB 2-way 128B block LRU • L1D: 32kB 4-way 128B block write-through LRU • L2: 1.875MB (3 banks) 10-way 128B block pseudo LRU

. Intel Core Duo 64 × 1024 = 64kB 2 ways • L1I: 32kB 8-way 64B block LRU • L1D: 32kB 8-way 64B block LRU write-through • L2: 2MB 8-way 64B line LRU write-back

. Sun Niagara • L1I: 16kB 4-way 32B block random • L1D: 8kB 4-way 16B block random write-through write no-allocate • L2: 3MB 4 banks 64B block 12-way write-back

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 25 26

Impact of caches on performance 1. Reducing miss penalty

. Average memory access latency (AMAL) = cache hit time + (1 – . Multi-level caches cache hit rate) × miss penalty • miss penaltyL1 = hit timeL2 + miss rateL2 × miss penaltyL2 . Critical word first and early restart . Example 1 • When L2-L1 bus width is smaller than L1 cache block • Hit time = 1 cycle • Miss penalty = 100 cycles . Giving priority to read misses • Miss rate = 2% • Esp. in dynamically scheduled processors • Average memory access latency? . Merging write buffer . Example 2 • 1GHz processor . Victim caches • Two configurations: 16kB direct-mapped, 16kB 2-way . Non-blocking caches • Two miss rates: 3%, 2% • Esp. in dynamically scheduled processors • Hit time = 1 cycle, but clock cycle time is stretched by 1.1 in 2-way • Miss penalty = 100ns (how many cycles?) • Average memory access latency?

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 27 28 Victim cache Categorizing misses

w/o victim cache w/ victim cache . Compulsory • “I’ve not met this block before…”

Requests and . Capacity replacements Requests • “Working set is larger than my cache…” . Conflict L1 cache L2 cache L1 cache L2 cache

2:1 • “Cache has space but blocks in use map to busy sets…” Data

Requests and replacements Victim Replacements . How can you measure contributions from different miss cache categories?

. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” ISCA 1990.

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 29 30

2. Reducing miss rate 3. Reducing hit time

. Larger block size . Small & simple cache (e.g., direct-mapped cache) • Reduces compulsory misses • Test different cache configurations using cacti . Larger cache size (http://quid.hpl.hp.com:9081/cacti/) • Tackles capacity misses . Avoid address translation during cache indexing . Higher associativity . Pipeline access • Attacks conflict misses . Prefetching • Relevancy issues (due to pollution) . Pseudo-associative caches

. Compiler optimizations • Loop interchange • Blocking

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 31 32 Prefetching Evaluating a prefetching scheme

. Memory hierarchy generally works well . Coverage • L1 cache: 1- ~ 3-cycle latency • By applying a prefetching scheme, how many misses are covered? • L2 cache: 8- ~ 13-cycle latency . Timeliness • Main memory: 100- ~ 300-cycle latency • Are prefetched data arriving early enough so that the (otherwise) miss • Cache hit rates are critical to high performance latency is effectively hidden? . Relevance and usefulness . Prefetching: if we know what data we’ll need from level-N cache a priori, get data from level-(N+1) and place it in • Are prefetched blocks actually used? level-N cache before the data is accessed by the processor • Do they replace other useful data?

. What are design goals and issues? . How would program execution time change? • E.g., shall we load prefetched block in L1 or not? • Especially, dynamically scheduled processors have intrinsic ability to • E.g., instruction prefetch vs. data prefetch tolerate long latencies to a degree

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 33 34

Hardware schemes Stride detection

. Prefetch data under hardware control . A popular method using a reference prediction table • Without software modification • Load instruction PC

• Without increasing instruction count • Last address Ai-1

• Last stride S = Ai-1 –Ai-2 • Other flags, e.g., confidence, time stamp, … . Hardware prefetch engines • Next address for this PC A = A + S • Programmable engines i i-1 Program this engine with data access pattern information Who programs this engine, then? . In practice, simpler stream buffer like methods are often • Automatic used Hardware detects access behavior • IBM Power4/5 use 8 stream buffers between L1 & L2, L2 & L3 (or main memory)

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 35 36 Power4 example Software schemes

. 8 stream buffers: ascending/descending . Use prefetch instruction – needs ISA support • Requires at least 4 sequential misses to install a stream • Check if the desired block is in cache already . Supports L2 to L1, L3 to L2, memory to L3 If not, bring the cache block into the cache If yes, do nothing • In any case, prefetch request is a performance hint and not a correctness requirement Not acting on it must not affect program output

. Compiler or programmer then inserts prefetch instructions in . Based on physical address programs • When page boundary is met, stop there . Software interface • To explicitly (and quickly) install a stream . Hardware prefetch vs. software prefetch

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 37 38

Software prefetch example Virtual memory

Process 1 Process 2 Process 3

E prefetch(&b[0]); H prefetch(&c[0]); F C for (int i=0; i<100; ++i) E I for (i=0; i<100; i++) { C D D { A a[i] = b[i] + c[i]; B prefetch (&b[i+4]); } Physical memory prefetch (&c[i+4]); a[i] = b[i] + c[i]; B B I A A H } F G G

Hard disk CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 39 40 Virtual memory Cache vs. virtual memory

. Motivation . Block size • Large, private memory space view What is the typical physical memory size today? • 16~128B 4k~64kB • Virtual view to programmers . Hit time • Virtual view to each process • 1~3 cycles (L1) 50~150 cycles (main memory)

. Management goals . Miss penalty • Efficient sharing of physical memory • 8~150 cycles (L2/memory) 1M~10M cycles (hard drive) • Protection . Miss rate • 0.1~10% 0.00001~0.001% . Paging vs. segmentation . Address mapping . Architectural issues • 25~45 bits to 14~20 32~64 bits to 25~45 • Fast translation of virtual addresses . Replacement • Support for protection • Impact on cache design • Hardware-based OS

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 41 42

Paging vs. segmentation Paging basics

. Words per address . Block placement • One Two (segment & offset) • Anywhere in memory (fully associative) . Programmer visible? . Block identification • No Maybe • Page table keeps mapping information . Replacing a block . Block replacement • Trivial (same block size) Can be difficult • LRU by OS . Memory use inefficiency . Write policy • Internal fragmentation External fragmentation • Write-back . Efficient disk traffic • Yes Not always

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 43 44 Page table Virtually indexed L1 caches

. Implemented in OS kernel memory space . Address from a load/store instruction is virtual . Per process . VA-PA translation is done via TLB . Table size? . If L1 cache access is done after TLB access, L1 latency . Architectural support for fast address translation • Translation Look-aside Buffer (TLB) becomes longer • Small cache to keep frequently used page table entries  Virtually indexed but physically tagged cache • Allows simultaneous access to TLB and cache • Overlaps cache access time and TLB access time

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 45 46

Virtually indexed L1 caches Issues

. Virtually indexed L1 caches • Validity of data across context switches (there are many virtual address spaces!) Flush at context switches () Use physical tags Use address space identifiers • Handling of aliases (two VAs for one PA) Hardware scheme: limit cache size Software scheme: page coloring . I/O and cache coherence • I/O can result in inconsistency among data in main memory and caches Parallel access I/O is between main memory and outside world • Solutions Physical address governs Flush cache before & after I/O operations in this region Address bus snooping for cache line invalidation Allocate I/O buffers in a non-cacheable memory region

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 47 48 Protection Alpha 21264 example

3FF FFFF FFFF Reserved . Between user and kernel processes . 8kB page size 3FF 8000 0000 (shared libraries) • User mode vs. privileged mode . 8B page table entry: 4B for Not yet allocated . Between users physical page number (PPN) and 4B for other information Dynamic Data

. Address boundary & access type checking 001 4000 0000 Static Data . 64-bit address is not fully used • Base ≤ address ≤ upper bound Not used • seg1: bits 63~47: 11…11 • Write allowed? Information maintained by OS, not accessible by users Text (Code) • Protections are enforced when accessing page table (TLB) 001 2000 0000 • kseg: bits 63~47: 11…10 Stack Kernel-accessible physical addresses $sp No address translation performed • seg0: bits 63~47: 00…00 Not yet allocated User space 000 0001 0000 Reserved

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 49 50

Alpha 21264 example DRAM technology

. DRAM = dynamic RAM • Memory cell is based on capacitor If capacitor has enough charge, it’s ‘1’ If not, it means ‘0’ • Memory content decays as capacitors lose their charges over time Periodic “refresh” actions are required • Reads are also destructive – the data read must be written back before reading another row

. DRAM designs are driven by standards (www.jedec.org) • SDRAM (synchronous DRAM) is most widely used • RDRAM (Rambus DRAM) is a high-performance proprietary design Playstation series has been the major consumer

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 51 52 DRAM internal DRAM internal

1. A Row in a bank is selected and read using Command interface {BA1,BA0,A13~A0}

Memory banks 2. Data is moved to I/O latch Row address path Row address

Data I/O interface 3. Part of data is selected using column address

Column address path Column address 4. Selected data is moved to chip I/O

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 53 54

DRAM-related concepts DRAM modules

. Row address & column address . SIMM (single in-line memory module) • Transfer address (e.g., from CPU to DRAM) in two steps • 32-bit data bus Row address first and column address later (why?) • Primarily in 386, 486, 586, and computers from Apple Halves # of address bits . DIMM (dual in-line …) . Fast page mode • 64-bit data bus • Once you’ve read a row, it’s fast to read parts of it (rather than data . RIMM (Rambus in-line …) from other rows) • Heat spreader • Cycle time determines bandwidth • 1.6 GBps . Synchronous DRAM has a clock . DDR () – use both edges of a clock when transferring data . FB-DIMM (fully buffered …) • New standard

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 55 56 FB-DIMM FB-DIMM

. Scalability and throughput • 192 GB, 6 channels, 8 /channel • 6.7 GB/s per channel . Improved board layouts • 69 pins per channel (c.f., 240 pins per channel today) . Reliability considerations • ECC in commands, address, and data • Automatic retry on errors . Headroom • More open-ended serial differential . Serial, packet-based interface (new design trend) signaling scheme . Industry’s acceptance is still low (10% or lower)

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 57 58

Memory bank interleaving Page coloring

. Organization to improve . OS virtual to physical address mapping affects L2 cache performance

processor P-Page 0

V-Page A Page mapping P-Page 1

P-Page 2 V-Page B P-Page 3

Bin 0 P-Page 4

Bin 1 Main memory P-Page 5 L2 cache Bin 2 Page mapping . With proper # of banks, successive memory access latencies P-Page 6 Bin 3 are hidden P-Page 7

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 59 60 Page coloring

. This is about deciding which physical page to give to a virtual page on a page allocation event . Goal: minimize the # of conflict misses

. Approaches • Basic method: match a few lower bits in PPN with those of VPN “Let neighboring pages (in virtual address space) occupy different cache bins” • Bin hopping: a consecutive page allocation gets the next bin “Let pages accessed in the same time frame occupy different cache bins” • Best bin: track the page count for each bin; choose the bin with the minimum page count “Use the OS knowledge about the page allocations” . Example • Solaris supports several different page coloring strategies and lets the user choose one

CS/CoE1541: Intro. to Computer Architecture University of Pittsburgh 61