Why Care About the Memory Hierarchy?

Why Care About the Memory Memory Hierarchy Hierarchy?

■ Memory Hierarchy Processor-DRAM Memory Gap (latency) – Reasons µProc – Virtual Memory 1000 CPU 60%/yr. ■ Cache Memory “Moore’s Law” (2X/1.5yr) ■ Translation Lookaside Buffer 100 Processor-Memory – Address translation Performance Gap: (grows 50% / year) – Demand paging 10 DRAM

Performance DRAM 9%/yr. 1 (2X/10 yrs) 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1980 1981 1983 1984 1985 1986 1982 Time

Datorteknik MemoryAcceleration bild 1 Datorteknik MemoryAcceleration bild 2

DRAMs over Time Recap:

DRAM Generation ■ 1st Gen. Sample ‘84 ‘87 ‘90 ‘93 ‘96 ‘99 Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be Memory Size 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb referenced again soon. – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. Die Size (mm2) 55 85 130 200 300 450 ■ By taking advantage of the principle of locality: Memory Area 30 47 72 110 165 250 – Present the user with as much memory as is available in the cheapest (mm2) technology. – Provide access at the speed offered by the fastest technology. Memory Cell Area (µm2) 28.84 11.1 4.26 1.64 0.61 0.23 ■ DRAM is slow but cheap and dense: – Good choice for presenting the user with a BIG memory system ■ SRAM is fast but expensive and not very dense: – Good choice for providing the user FAST access time. (from Kazuhiro Sakashita, Mitsubishi)

Memory Hierarchy of a Modern Computer Levels of the Memory Hierarchy ■ By taking advantage of the principle of locality: – Present the user with as much memory as is available in the cheapest Staging technology. Xfer Unit faster – Provide access at the speed offered by the fastest technology. Reg:s Instr. Operands prog 1-8 bytes Processor Cache cache cntl Control Tertiary Blocks 8-128 bytes Secondary Storage Storage Memory On-Chip

Registers Second Main (Disk Cache Level Memory (Disk) OS Datapath /Tape) Pages Cache (DRAM) 512-4K bytes (SRAM) Disk user/operator Files Mbytes Speed : 1nsXns 10ns 100ns 10 ms 10 sec Larger Tape Size (bytes): 10064K K..M M G T

Datorteknik MemoryAcceleration bild 5 Datorteknik MemoryAcceleration bild 6 The Art of Memory System Design Virtual Memory System Design

Workload or size of information blocks that are transferred from secondary to main storage (M) Benchmark programs Optimize the memory system organization block of information brought into M, and M is full, then some region of M must be to minimize the average memory access time released to make room for the new block --> replacement policy Processor for typical workloads which region of M is to hold the new block --> placement policy

missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy reference stream , ,,, . . . mem disk cache op: i-fetch, read, write reg pages Memory frame $ Paging Organization virtual and physical address space partitioned into blocks of equal size MEM pages page frames

Address Map Paging Organization V.A. unit of P.A. V = {0, 1, . . . , n - 1} virtual address space mapping 1K page 0 0 1K frame 0 0 M = {0, 1, . . . , m - 1} physical address space n > m Addr 1K 1 1024 1K 1 1024 Trans MAP: V --> M U {0} address mapping function also unit of MAP transfer from 1K 7 7168 MAP(a) = a' if data at virtual address a is present in physical virtual to physical Physical address a' and a' in M Memory memory 1K 31 31744 = 0 if data at virtual address a is not present in M Virtual Memory Address Mapping actually, 10 a missing item fault concatenation Name Space V VA page no. disp is more likely fault Processor handler Page Table Page Table 0 Addr Trans Main Secondary Base Reg Access V Rights PA + a Mechanism Memory Memory index a' into page table located physical physical address OS performs table in physical memory this transfer memory address

Address Mapping Translation Lookaside Buffer (TLB)

CP0 CP0 On TLB hit, the 32-bit virtual address MIPS PIPELINE User Memory MIPS PIPELINE is translated into a User Memory 24-bit physical address by hardware Instr Data 32 32 32 32 We never call the Kernel! 24-bit 32-bit Virtual Address 24 Physical Address D R Physical Addr [23:10] Virtual Address User process 2 running Kernel Memory Kernel Memory Here we need Page Page page table 2 for TablePage TablePage address mapping Table1Page Table1Page Table2 Table2 n n

Datorteknik MemoryAcceleration bild 11 Datorteknik MemoryAcceleration bild 12 So Far, NO GOOD Let’s put in a Cache

CP0 60 ns, RAM CP0 60 ns, RAM IM DE EX DM STALL IM DE EX DM

32 32 32 32 Critical path 20 ns Critical path 20 ns 24-bit TLB TLB Cache Physical Address

MIPS pipe is clocked at 50 MHz MIPS pipe is clocked at 50 MHz Kernel Memory Kernel Memory 5ns Page 5ns 15ns Page But RAM needs TablePage A cache Hit never STALLS the pipe TablePage 3 cycles to read/write Table1Page Table1Page STALLS the pipe Table2 Table2 n n

Fully Associative Cache Fully Associative Cache 23 2 10 24-bit PA ■ Very good hit ratio (nr hits/nr accesses) Check all Cache lines But! Cache Hit if PA[23:2]=TAG 16 ■ Too expensive checking all 2 Cache lines concurrently Tag PA[23:2] Data Word PA[1:0] – A comparator for each line! A lot of hardware

16 all 2 lines

16 2 * 4=256kb

Direct Mapped Cache Direct Mapped Cache 23 18 17 2 10 24-bit PA ■ Not so good hit ratio – Each line can hold only certain addresses, less freedom Selects ONE cache line Cache Hit if PA[23:18]=TAG But! ■ Much cheaper to implement, only one line Tag PA[23:18] Data Word PA[1:0] checked – Only one comparator

1 line

16 2 * 4=256kb

Datorteknik MemoryAcceleration bild 17 Datorteknik MemoryAcceleration bild 18 Set Associative Cache Set Associative Cache 23 18-z 17-z 2 10 24-bit PA ■ Quite good hit ratio

z – The number (set) of different addresses for each line is Selects ONE set of lines, size 2 greater than that of a directly mapped cache Cache Hit if PA[23:18-z]=TAG in the set ■ The larger Z the better hit ratio, but more expensive Tag PA[23:18-z] Data Word PA[1:0] – 2z comparators – Cost-performance tradeoff z 2 lines

16 2 * 4=256kb

Cache Miss A Summary on Sources of Cache Misses

■ Compulsory (cold start or process migration, first reference): ■ A Cache Miss should be handled by the first access to a block hardware – “Cold” fact of life: not a whole lot you can do about it – Note: If you are going to run “billions” of instruction, Compulsory Misses – If handled by the OS it would be very slow (>>60 ns) are insignificant ■ On a Cache Miss ■ Conflict (collision): – Stall the pipe – Multiple memory locations mapped to the same cache location – Read in new data to cache – Solution 1: increase cache size – Release the pipe, now we get a Cache Hit – Solution 2: increase associativity ■ Capacity: – Cache cannot contain all blocks access by the program – Solution: increase cache size ■ Invalidation: other process (e.g., I/O) updates memory

Example: 1 KB Direct Mapped Block Size Tradeoff Cache with 32 B Blocks ■ In general, larger block size take advantage of spatial locality BUT: ■ N For a 2 byte cache: – Larger block size means larger miss penalty: – The uppermost (32 - N) bits are always the Cache Tag Takes longer time to fill up the block M – The lowest M bits are the Byte Select (Block Size = 2 ) – If block size is too big relative to cache size, miss rate will go up: Too few cache blocks 31 9 4 0 ■ Cache Tag Example: 0x50 Cache Index Byte Select In gerneral, Average Access Time: Ex: 0x01 Ex: 0x00 TimeAv= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Stored as part of the cache “state” Average Valid Bit Cache Tag Cache Data Miss Miss Access Byte 31 : Byte 1 Byte 0 0 Penalty Rate Time 0x50 Byte 63 : Byte 33 Byte 32 1 Exploits Spatial Locality 2 Increased Miss Penalty Fewer blocks: & Miss Rate 3 compromises : : : temporal locality : Byte 1023 Byte 992 31 Block Size Block Size Block Size

Datorteknik MemoryAcceleration bild 23 Datorteknik MemoryAcceleration bild 24 Extreme Example: single big line Hierarchy

Valid Bit Cache Tag Cache Data Small, fast and expensive VS Slow big and Byte 3 Byte 2 Byte 1 Byte 0 0 inexpensive Cache Contains copies ■ Cache Size = 4 bytes Block Size = 4 bytes What if copies are changed? – Only ONE entry in the cache HD 2 Gb INCONSISTENCY! ■ If an item is accessed, likely that it will be accessed again soon – But it is unlikely that it will be accessed again immediately!!! RAM 16 Mb – The next access will likely to be a miss again Cache 256kb ■ Continually loading data into the cache but discard (force out) them before they are used again ■ Worst nightmare of a cache designer: Ping Pong Effect I ■ Conflict Misses are misses caused by: – Different memory locations mapped to the same cache index D ■ Solution 1: make the cache size bigger ■ Solution 2: Multiple entries for the same Cache Index

Cache Miss, Write Through/Back Write Buffer for Write Through

To avoid INCONSISTENCY we can Cache Processor DRAM ■ Write Through – Always write data to RAM Write Buffer – Not so good performance (write 60ns) ■ A Write Buffer is needed between the Cache and Memory – Therefore, WT always combined with write buffers so that don’t wait for lower level memory – Processor: writes data into the cache and the write buffer – Memory controller: write contents of the buffer to memory ■ Write Back ■ Write buffer is just a FIFO: – Write data to memory only when cache line is replaced – Typical number of entries: 4 ■ We need a Dirty bit (D) for each cache line – Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle ■ D-bit set by hardware on write operation ■ Memory system designer’s nightmare: – Much better performance, but more complex hardware – Store frequency (w.r.t. time) -> 1 / DRAM write cycle – Write buffer saturation

Write Buffer Saturation Replacement Strategy in Hardware

■ Store frequency (w.r.t. time) -> 1 / DRAM write cycle ■ A Direct mapped cache selects ONE cache line – If this condition exist for a long period of time (CPU cycle time too quick – No replacement strategy and/or too many store instructions in a row): ■ Store buffer will overflow no matter how big you make it ■ Set/Fully Associative Cache selects a set of lines. ■ The CPU Cycle Time <= DRAM Write Cycle Time Strategy to select one Cache line ■ Solution for write buffer saturation: – Random, Round Robin – Use a write back cache ■ Not so good, spoils the idea with Associative Cache – Install a second level (L2) cache: – Least Recently Used, (move to top strategy) ■ Good, but complex and costly for large Z Cache – We could use an approximation (heuristic) Processor L2 DRAM Cache ■ Not Recently Used, (replace if not used for a certain time)

Write Buffer

Datorteknik MemoryAcceleration bild 29 Datorteknik MemoryAcceleration bild 30 Sequential RAM Access System Startup, RESET

■ Accessing sequential words from RAM is ■ Random Cache Contents faster than accessing RAM randomly – We might read incorrect values from the Cache – Only lower address bits will change ■ We need to know if the contents is Valid, a V- ■ How could we exploit this? bit for each cache line – Let each Cache Line hold an Array of Data words – Let the hardware clear all V-bits on RESET – Give the Base address and array size – Set the V-bit and clear the D-bit for the line copied from ■ “Burst Read” the array from RAM to Cache RAM to Cache

■ “Burst Write” the array from Cache to RAM

Final Cache Model Translation Lookaside Buffer (TLB) 23 18-z 17-z 2+j 1+j 0 CP0 On TLB hit, the 32-bit virtual address 24-bit PA MIPS PIPELINE is translated into a User Memory 24-bit physical address by hardware z Selects ONE set of lines, size 2 32 32 We never call the Kernel! Cache Hit if (PA[23:18-z]=TAG) and V in set Set D bit if Write 24 D R Physical Addr [23:10] V D Tag PA[23:18-z] Data Word PA[1+j:0] Virtual Address Kernel Memory Page z 2 lines TablePage ... Table1Page Table2 n

TLBs Virtual Address and a Cache CPU

It takes an extra memory access to translate VA to PA VA A way to speed up translation is to use a special cache of recently

This makes cache access very expensive, and this is the Trans-

used page table entries -- this has many names, but the most lation frequently used is Translation Lookaside Buffer or TLB "innermost loop" that you want to go as fast as possible data ASIDE: Why access cache with PA at all? VA caches

have a problem! synonym / alias problem: two different PA Virtual Address Physical Address Dirty Ref Valid Access hit virtual addresses map to same physical address => two different cache entries holding data for the same physical

address! Cache for update: must update all cache entries with same physical address or memory becomes inconsistent TLB access time comparable to cache access time miss (much less than main memory access time) determining this requires significant hardware, essentially an associative lookup on the physical

address tags to see if you have multiple hits; or Memory Main software enforced alias boundary: same lsb of VA &PA > cache size

Datorteknik MemoryAcceleration bild 35 Datorteknik MemoryAcceleration bild 36 Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as fully associative, Reducing Translation Time set associative, or direct mapped

TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. Machines with TLBs go one step further to reduce cycles/cache access hit VA PA miss TLB Main They overlap the cache access with the TLB access CPU Cache Lookup Memory Works because high order bits of the VA are used to miss hit Translation look in the TLB while low order bits are used as index with a TLB Trans- into cache lation data 1/2 t t 20 t

Overlapped Cache & TLB Access Problems With Overlapped TLB Access Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation

index This usually limits things to small caches, large page sizes, or 32 TLB assoc Cache 1 K lookup high n-way set associative caches if you want a large cache

4 bytes Example: suppose everything the same except that the cache is 10 2 increased to 8 K bytes instead of 4 K: 00 11 2 cache This bit is changed index 00 by VA translation, but PA Hit/ PA Data Hit/ Miss 20 12 is needed for cache page # disp Miss 20 12 lookup virt page # disp =

IF cache hit AND (cache tag = PA) then deliver data to CPU Solutions: go to 8K byte page sizes; ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN 1K 2 way set access memory with the PA from the TLB go to 2 way set 10 assoc cache ELSE do standard VA translation associative cache; or SW guarantee VA[13]=PA[13] 44

Startup a User process Demand Paging ■ Allocate Stack pages, Make a Page Table; ■ IM Stage: We get a TLB Miss and Page Fault (page 0 – Set Instruction (I), Global Data (D) and Stack pages (S) not resident) – Clear Resident (R) and Dirty (D) bits – Page Table (Kernel memory) holds HD address for page 0 (P0) Kernel Memory ■ Clear V-bits in TLB – Read page to RAM page X, Update PA[23:10] in Page Table – Update TLB, set V, clear D, Page #, PA[23:10] XX…X00..0 RAM V D R Page Table Page ■ Restart failing instruction: TLB hit! Page 0 I 0 0 0 I Table Place on Hard Disk TLB 0 0 0 I V 22-bit Page # D Physical Addr PA[23:10] ... 0 0 I 1000…...….0 XX………………..X I 0 0 0 D 0 P0 0 0 S TLB ... 0

Datorteknik MemoryAcceleration bild 41 Datorteknik MemoryAcceleration bild 42 Demand Paging Spatial and Temporal Locality ■ DM Stage: We get a TLB Miss and Page Fault (page 3 not resident) Spatial Locality – Page Table (Kernel memory) holds HD address for page 3 (P3) ■ – Read page to RAM page Y, Update PA[23:10] in Page Table Now TLB holds page translation; 1024 bytes, 256 – Update TLB, set V, clear D, Page #, PA[23:10] RAM instructions – The next instruction (PC+4) will cause a TLB Hit ■ Restart failing instruction: TLB hit! Page 0 I – Access a data array, e.g., 0($t0),4($t0) etc TLB YY…Y00..0 Temporal Locality V 22-bit Page # D Physical Addr PA[23:10] Page 3 D ■ TLB holds translation 1000…...….0 XX………………..X I – Branch within the same page, access the same instruction address 1 00…...…11 0 YY………………..Y D P0 P3 – Access the array again e.g., 0($t0),4($t0) etc ... P1 P2 0 THIS IS THE ONLY REASON A SMALL TLB WORKS

Replacement Strategy Hierarchy Small, fast and expensive VS Slow big and inexpensive If TLB is full the OS selects the TLB line to replace ■ Any line will do, they are the same and concurrently TLB 64 Lines TLB/RAM Contains copies RAM 256Mb checked What if copies are changed?

Strategy to select one INCONSISTENCY! ■ Random >> 64 – Not so good Kernel Memory HD 32 Gb ■ Round Robin Page – Not so good, about the same as random Table ■ Least Recently Used, (move to top strategy) – Much better, (the best we can do without knowing or predicting page access). Based on temporal locality

Inconsistency Current Working Set

Replace a TLB entry, caused by TLB Miss If RAM is full the OS selects a page to replace, Page Fault ■ If old TLB entry dirty (D-bit) we update Page Table (Kernel memory) ■ OBS! The RAM is shared by many User processes ■ Least Recently Used, (move to top strategy) – Much better, Replace a page in RAM (swapping) caused by Page Fault (the best we can do without knowing or predicting page access) ■ If old Page is in TLB ■ Swapping is VERY expensive (, maybe > 100 ms) – Check old page TLB D-bit, if Dirty write page to HD – Why not try harder to keep the pages needed (the working set) in RAM – Clear TLB V-bit and Page Table R-bit (now not resident) using Advanced memory paging algorithms ■ If old Page is in not in TLB Current working set of process P – Check old page Page Table D-bit, if Dirty write page to HD {p0,p3,...} set of pages used under t – Clear Page Table R-bit (page not resident any more) t t now

Datorteknik MemoryAcceleration bild 47 Datorteknik MemoryAcceleration bild 48 Trashing Summary: Cache, TLB, Virtual Memory Probability of Page Fault ■ Caches, TLBs, Virtual Memory all understood by examining how 1 Trashing they deal with 4 questions: No useful work done! – Where can block be placed? – How is block found? This we want to avoid – What block is repalced on miss? – How are writes handled?

■ Page tables map virtual address to physical address

■ TLBs are important for fast translation Fragment of 0 working set ■ TLB misses are significant in processor performance: 0 1 not resident – (most systems can’t access all of 2nd level cache without TLB misses!)

Summary: Memory Hierachy

■ Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs? – 1000X DRAM growth removed the controversy

■ Today VM allows many processes to share single memory without having to swap all processes to disk;

■ VM protection is more important than memory space increase

■ Today CPU time is a function of (ops, cache misses) vs. just of(ops): – What does this mean to Compilers, Data structures, Algorithms?

Datorteknik MemoryAcceleration bild 51