What Is Memory Hierarchy

A typical memory hierarchy today: Lecture 13: Cache and Virtual Memroy Review Proc/Regs L1-Cache BiggerL2-Cache Faster Cache optimization approaches, L3-Cache (optional)

cache miss classification, Memory

Disk, Tape, etc.

Here we focus on L1/L2/L3 caches and main memory

1 2 Adapted from UCB CS252 S01

Why Memory Hierarchy? Generations of Time of a full cache miss in instructions executed: µProc 1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 1000 CPU 60%/yr. “Moore’s Law” 136 2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 100 Processor-Memory Performance Gap: 320 (grows 50% / year) 3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 10 DRAM 648 Performance DRAM 7%/yr. 1/2X latency x 3X clock rate x 3X Instr/clock ⇒ 1 4.5X 1987 1983 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1980 1981 1982 1984 1985 1986

1980: no cache in µproc; 1995 2-level cache on chip (1989 first µproc with a cache on chip) 3 4

Area Costs of Caches What Is Cache, Exactly? Processor % Area %Transistors Small, fast storage used to improve average access time to slow memory; usually made by SRAM (­cost) (­power) Exploits locality: spatial and temporal Intel 80386 0% 0% In computer architecture, almost everything is a cache! 37% 77% „ is the fastest place to cache variables „ First-level cache a cache on second-level cache StrongArm SA110 61% 94% „ Second-level cache a cache on memory „ Memory a cache on disk (virtual memory) Pro 64% 88% „ TLB a cache on table „ 2 dies per package: Proc/I$/D$ + L2$ „ Branch-prediction a cache on prediction information? „ Branch-target buffer can be implemented as cache 92% Beyond architecture: file cache, browser cache, proxy Caches store redundant data cache only to close performance gap Here we focus on L1 and L2 caches (L3 optional) as buffers to main memory 5 6

1 Example: 1 KB Direct Mapped Cache For Questions About Cache Design Assume a cache of 2N bytes, 2K blocks, block size of 2M bytes; N = M+K (#block times block size) Block placement: Where can a block be placed? „ (32 - N)-bit cache tag, K-bit cache index, and M-bit cache The cache stores tag, data, and valid bit for each block „ Cache index is used to select a block in SRAM (Recall BHT, Block identification: How to find a block in the BTB) cache? „ Block tag is compared with the input tag „ A word in the data block may be selected as the output Block address 31 9 4 0 Block replacement: If a new block is to be Tag Example: 0x50 Index Block offset Ex: 0x01 Ex: 0x00 fetched, which of existing blocks to Stored as part of the cache “state” replace? (if there are multiple choices Valid Bit Cache Tag Cache Data Byte 31 : Byte 1 Byte 0 0 0x50 Byte 63 : Byte 33 Byte 32 1 2 3 Write policy: What happens on a write? : : :

Byte 1023 : Byte 992 31 7 8

Where Can A Block Be Placed Set Associative Cache Example: Two-way set associative cache What is a block: divide memory space into „ Cache index selects a set of two blocks blocks as cache is divided „ The two tags in the set are compared to the input in „ A memory block is the basic unit to be cached parallel Direct mapped cache: there is only one place „ Data is selected based on the tag comparison in the cache to buffer a given memory block Set associative or direct mapped? Discuss later Cache Index N-way set associative cache: N places for a Valid Cache Tag Cache Data Cache Data Cache Tag Valid given memory block Cache Block 0 Cache Block 0 „ Like N direct mapped caches operating in parallel ::: : :: „ Reducing miss rates with increased complexity, cache access time, and power consumption Adr Tag Fully associative cache: a memory block can Compare Sel11 Mux 0 Sel0 Compare

be put anywhere in the cache OR Cache Block 9 Hit 10

How to Find a Cached Block Which Block to Replace? Direct mapped cache: the stored tag for the Direct mapped cache: Not an issue cache block matches the input tag For set associative or fully associative* cache: Fully associative cache: any of the stored N tags matches the input tag „ Random: Select candidate blocks randomly from the cache set Set associative cache: any of the stored K „ LRU (Least Recently Used): Replace the block tags for the cache set matches the input that has been unused for the longest time tag „ FIFO (First In, First Out): Replace the oldest block Cache hit time is decided by both tag Usually LRU performs the best, but hard comparison and data access – Can be (and expensive) to implement determined by Cacti Model *Think fully associative cache as a set associative one with a 11 single set 12

2 What Happens on Writes Where to write the data if the block is found in cache? Real Example: Caches Write through: new data is written to both the cache 64KB 2-way block and the lower-level memory associative „ Help to maintain cache consistency instruction cache Write back: new data is written only to the cache block „ Lower-level memory is updated when the block is 64KB 2-way replaced associative data „ A dirty bit is used to indicate the necessity cache „ Help to reduce memory traffic What happens if the block is not found in cache? Write allocate: Fetch the block into cache, then write the data (usually combined with write back) I-cache D-cache No-write allocate: Do not fetch the block into cache (usually combined with write through)

13 14

Alpha 21264 Data Cache Cache performance D-cache: 64K 2-way Calculate average memory access time (AMAT) associative AMAT = hit time + Miss rate× Miss penalty Use 48-bit virtual address to index cache, „ Example: hit time = 1 cycle, miss time = 100 cycle, use tag from physical miss rate = 4%, than AMAT = 1+100*4% = 5 address 48-bit Virtual=>44-bit Calculate cache impact on processor address 512 block (9-bit blk performance index) CPU time = (CPU execution cycles + Memory stall cycles)×Cycle time Cache block size 64 bytes (6-bit offset)t  Memory Stall Cycles CPU time = IC×CPIexecution + ×CycleTime Tag has 44-(9+6)=29  Instruction  bits Writeback and write „ Note cycles spent on cache hit is usually counted allocated into execution cycles (We will study virtual- physical address translation)

15 16

Disadvantage of Set Associative Cache Virtual Memory Compare n-way set associative with direct mapped cache: Virtual memory (VM) allows programs to have the „ Has n comparators vs. 1 comparator illusion of a very large memory that is not limited by „ Has Extra MUX delay for the data physical memory size „ Data comes after hit/miss decision and set selection „ Make main memory (DRAM) acts like a cache for secondary In a direct mapped cache, cache block is available before storage (magnetic disk) hit/miss decision „ Otherwise, application programmers have to move data in/out main memory „ Use the data assuming the access is a hit, recover if „ That’s how virtual memory was first proposed found otherwise Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Virtual memory also provides the following functions Cache Block 0 Cache Block 0 „ Allowing multiple processes share the physical memory in multiprogramming environment ::: : :: „ Providing protection for processes (compare : without VM applications can overwrite OS kernel) „ Facilitating program relocation in physical memory space Adr Tag Compare Sel11 Mux 0 Sel0 Compare

OR Cache Block 17 Hit

3 VM Example Virtual Memory and Cache VM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory and secondary storage.

Cache terms vs. VM terms „ Cache block => page „ Cache Miss => page fault

Tasks of hardware and OS „ TLB does fast address translations „ OS handles less frequently events: Š page fault Š TLB miss (when software approach is used)

19 20

4 Qs for Virtual Memory Virtual Memory and Cache Q1: Where can a block be placed in the upper Parameter L1 Cache Main Memory level? „ Miss penalty for virtual memory is very high => Full Block (page) size 16-128 bytes 4KB – 64KB associativity is desirable (so allow blocks to be placed anywhere in the memory) Hit time 1-3 cycles 50-150 cycles „ Have software determine the location while accessing disk (10M cycles enough to do sophisticated Miss Penalty 8-300 cycles 1M to 10M cycles replacement) Miss rate 0.1-10% 0.00001-0.001% Q2: How is a block found if it is in the upper level? Address mapping 25-45 bits => 13-21 32-64 bits => 25-45 „ Address divided into page number and page offset bits bits „ Page table and translation buffer used for address translation „ Q: why fully associativity does not affect hit time?

4 Qs for Virtual Memory Virtual-Physical Translation Q3: Which block should be replaced on a A virtual address consists of a virtual page miss? number and a page offset. „ Want to reduce miss rate & can handle in software The virtual page number gets translated to a „ Least Recently Used typically used physical page number. „ A typical approximation of LRU The page offset is not changed Š Hardware set reference bits Š OS record reference bits and clear them periodically 36 bits 12 bits Š OS selects a page among least-recently referenced for replacement Virtual Page Number Page offset Virtual Address

Q4: What happens on a write? Translation „ Writing to disk is very expensive „ Use a write-back strategy Physical Page Number Page offset Physical Address

33 bits 12 bits 23 24

4 TLB: Improving Page Table Access Address Translation Via Page Table Cannot afford accessing page table for every access include cache hits (then cache itself makes no sense) Again, use cache to speed up accesses to page table! (cache for cache?) TLB is translation lookaside buffer storing frequently accessed page table entry A TLB entry is like a cache entry „ Tag holds portions of virtual address „ Data portion holds physical page number, protection field, valid bit, use bit, and dirty bit (like in page table entry) „ Usually fully associative or highly set associative Assume the access hits in main memory „ Usually 64 or 128 entries Access page table only for TLB misses 25 26

TLB Characteristics The following are characteristics of TLBs „ TLB size : 32 to 4,096 entries „ Block size : 1 or 2 page table entries (4 or 8 bytes each) „ Hit time: 0.5 to 1 clock cycle „ Miss penalty: 10 to 30 clock cycles (go to page table) „ Miss rate: 0.01% to 0.1% „ Associative : Fully associative or set associative „ Write policy : Write back (replace infrequently)

27

5