Outline °Review Pipelining °Review Cache/VM/TLB Review slides CS61C Review of Cache/VM/TLB °Administrivia, “What’s this Stuff Good for?” ° Lecture 26 4 Questions on Memory Hierarchy °Detailed Example April 30, 1999 °3Cs of Caches (if time permits) Dave Patterson °Cache Impact on Algorithms (if time permits) (http.cs.berkeley.edu/~patterson) °Conclusion www-inst.eecs.berkeley.edu/~cs61c/schedule.html

cs 61C L26 cachereview.1 Patterson Spring 99 ©UCB cs 61C L26 cachereview.2 Patterson Spring 99 ©UCB

Review 1/3: Pipelining Introduction Review 2/3: Pipelining Introduction °Pipelining is a fundamental concept °What makes it hard? ¥ Multiple steps using distinct resources °Structural hazards: suppose we had only ¥ Exploiting parallelism in instructions one cache? °What makes it easy? (MIPS vs. 80x86) ⇒ Need more HW resources ¥ All instructions are the same length ⇒ simple instruction fetch °Control hazards: need to worry about ¥ Just a few instruction formats branch instructions? ⇒ read registers before decode instruction ⇒ Branch prediction, delayed branch ¥ Memory operands only in loads and stores ⇒ fewer pipeline stages °Data hazards: an instruction depends on ¥ Data aligned ⇒ 1 memory access / load, store a previous instruction? ⇒ need forwarding, compiler scheduling cs 61C L26 cachereview.3 Patterson Spring 99 ©UCB cs 61C L26 cachereview.4 Patterson Spring 99 ©UCB

Review 3/3: Advanced Concepts Memory Hierarchy Pyramid °Superscalar Issue, Execution, Retire: Central Processor Unit (CPU) ¥ Start several instructions each clock cycle “Upper” Increasing (1999: 3-4 instructions) Distance Level 1 from CPU, ¥ Execute on multiple units in parallel Levels in Decreasing memory Level 2 ¥ Retire in parallel; HW guarantees appearance hierarchy cost / MB of simple single instruction execution Level 3 °Out-of-order Execution: “Lower” . . . ¥ Instructions issue in-order, but execute out-of- Level n order when hazards occur (load-use, cache miss, multiplier busy, ...) Size of memory at each level Principle of Locality (in time, in space) + ¥ Instructions retire in-order; HW guarantees Hierarchy of Memories of different speed, appearance of simple in-order execution cost; exploit to improve cost-performance

cs 61C L26 cachereview.5 Patterson Spring 99 ©UCB cs 61C L26 cachereview.6 Patterson Spring 99 ©UCB

Page 1, 4/29/99 10:57 PM Why Caches? Why ? (1/2) µProc ° Protection 1000 CPU 60%/yr. ¥ regions of the address space can be read “Moore’s Law” only, execute only, . . . 100 Processor-Memory ° Flexibility Performance Gap: ¥ portions of a program can be placed (grows 50% / year) 10 anywhere, without relocation DRAM ° Expandability Performance DRAM 7%/yr. 1 ¥ can leave room in virtual address space for objects to grow 1980 1982 1983 1985 1986 1987 1989 1992 1993 1995 1996 1998 1999 1981 1984 1988 1990 1991 1994 1997 2000 ° Storage management 1989 first Intel CPU with cache on chip; ¥ allocation/deallocation of variable sized Today 37% area of , blocks is costly and leads to (external) 61% StrongArm SA110, 64% Pentium Pro fragmentation

cs 61C L26 cachereview.7 Patterson Spring 99 ©UCB cs 61C L26 cachereview.8 Patterson Spring 99 ©UCB

Why virtual memory? (2/2) Why Translation Lookaside Buffer (TLB)? ° Generality ° ¥ ability to run programs larger than size of Paging is most popular physical memory implementation of virtual memory ° Storage efficiency (vs. base/bounds) ¥ retain only most important portions of the °Every paged virtual memory access program in memory must be checked against ° Concurrent I/O Entry of Table in memory ¥ execute other processes while ° loading/dumping page Cache of Entries makes address translation possible without memory access in common case

cs 61C L26 cachereview.9 Patterson Spring 99 ©UCB cs 61C L26 cachereview.10 Patterson Spring 99 ©UCB

Paging/Virtual Memory Review Three Advantages of Virtual Memory User A: User B: 1) Translation: Virtual Memory TLB Virtual Memory ¥ Program can be given consistent view of ∞ Physical ∞ memory, even though physical memory is Stack Memory Stack scrambled 64 MB ¥ Makes multiple processes reasonable ¥ Only the most important part of program (“Working Set”) must be in physical memory. ¥ Contiguous structures (like stacks) use only Heap Heap as much physical memory as necessary yet still grow later. Static Static A B Code Page 0 Page Code 0 Table Table 0 cs 61C L26 cachereview.11 Patterson Spring 99 ©UCB cs 61C L26 cachereview.12 Patterson Spring 99 ©UCB

Page 2, 4/29/99 10:57 PM Three Advantages of Virtual Memory Virtual Memory Summary ° 2) Protection: Virtual Memory allows protected sharing of memory between processes with less ¥ Different processes protected from each other. swapping to disk, less fragmentation than ¥ Different pages can be given special behavior always swap or base/bound - (Read Only, Invisible to user programs, etc). ¥ Kernel data protected from User programs °3 Problems: ¥ Very important for protection from malicious programs ⇒ Far more “viruses” under 1) Not enough memory: Spatial Locality Microsoft Windows means small Working Set of pages OK 3) Sharing: 2) TLB to reduce performance cost of VM ¥ Can map same physical page to multiple users (“Shared memory”) 3) Need more compact representation to reduce memory size cost of simple 1-level page table, especially for 64-bit address (See CS 162)

cs 61C L26 cachereview.13 Patterson Spring 99 ©UCB cs 61C L26 cachereview.14 Patterson Spring 99 ©UCB

Administrivia “What’s This Stuff (Potentially) Good For?” °11th homework (last): Due Today Private Spy in Space to Rival Military's Ikonos 1 spacecraft is °Next Readings: A.7 just 0.8 tons, 15 feet long with its solar panels extended, it runs on just M 5/3 Deadline to correct your grade record (up to 11th lab, 10th homework, 5th project) 1,200 watts of power -- a bit more than a toaster. It sees objects as W 5/5 Review: Interrupts / Polling; A.7 small as one meter, and covers F 5/7 61C Summary / Your Cal heritage / Earth every 3 days. The company HKN Course Evaluation plans to sell the imagery for $25 to (Due: Final 61C Survey in lab) $300 per square mile via Sun 5/9 Final Review starting 2PM (1 Pimintel) www.spaceimaging.com. "We believe it will fundamentally W5/12 Final (5-8PM 1 Pimintel) Allow civilians to find change the approach to many forms mines, mass graves, of information that we use in ¥ Need Early Final? Contact mds@cory plan disaster relief? business and our private lives." N.Y. Times, 4/27/99 cs 61C L26 cachereview.15 Patterson Spring 99 ©UCB cs 61C L26 cachereview.16 Patterson Spring 99 ©UCB

4 Questions for Memory Hierarchy Q1: Where block placed in upper level? °Block 12 placed in 8 block cache: °Q1: Where can a block be placed in the ¥ Fully associative, direct mapped, 2-way set upper level? (Block placement) associative ¥ S.A. Mapping = Block Number Mod Number Sets Block ° Block 0 1 2 3 4 5 6 7 Block 0 1 2 3 4 5 6 7 Q2: How is a block found if it is in the 0 1 2 3 4 5 6 7 no. upper level? no. no. (Block identification)

Set Set Set Set ° Fully associativ e: Q3: Which block should be replaced on Direct mapped: 0 1 2 3 block 12 can go block 12 can go Set associative: a miss? anywhere only into block 4 block 12 can go (Block replacement) (12 m od 8) anywhere in set 0 Block-fram e address (12 m od 4) °Q4: What happens on a write? (Write strategy)

Block 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 no. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 cs 61C L26 cachereview.17 Patterson Spring 99 ©UCB cs 61C L26 cachereview.18 Patterson Spring 99 ©UCB

Page 3, 4/29/99 10:57 PM Q2: How is a block found in upper level? Q3: Which block replaced on a miss? °Easy for Direct Mapped

Block Address Block °Set Associative or Fully Associative: Tag Index offset ¥ Random Set Select ¥ LRU (Least Recently Used) Data Select Miss Rates Associativity:2-way 4-way 8-way °Direct indexing (using index and block Size LRU Ran LRU Ran LRU Ran offset), tag compares, or combination 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% °Increasing associativity shrinks index, 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% expands tag 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% cs 61C L26 cachereview.19 Patterson Spring 99 ©UCB cs 61C L26 cachereview.20 Patterson Spring 99 ©UCB

Q4: What happens on a write? Comparing the 2 levels of hierarchy °Write through—The information is written °Cache Version Virtual Memory vers. to both the block in the cache and to the block in the lower-level memory. °Block or Line Page °Write back—The information is written °Miss Page Fault only to the block in the cache. The modified cache block is written to main °Block Size: 32-64B Page Size: 4K-8KB memory only when it is replaced. °Placement: Fully Associative ¥ is block clean or dirty? Direct Mapped, °Pros and Cons of each? N-way Set Associative ¥ WT: read misses cannot result in writes °Replacement: Least Recently Used LRU or Random (LRU) ¥ WB: no writes of repeated writes ° °WT always combined with write buffers Write Thru or Back Write Back cs 61C L26 cachereview.21 Patterson Spring 99 ©UCB cs 61C L26 cachereview.22 Patterson Spring 99 ©UCB

Alpha 21064 Picking Optimal Page Size Page-frame Page CPU Data Page-frame Page address <30> offset<13> address <30> offset<13> PC ° Instruction <64>Data Out <64> Data In <64> Separate Instr & 1 2 5 18 19 23 ° Minimize wasted storage <1><2><2> <30> <21> <1><2><2> <30> <21> V R W Tag Physical address V R W Tag Physical address I D Data TLB T <64> T <64> L 1 <64> L 18 — small page minimizes internal fragmentation B B ° Separate Instr & … … … … … … 21 — small page increase size of page table 3 Data Caches (high-order 21 bits of 12:1 MUX (high-order 21 bits of 32:1 MUX physical address) <21> physical address) <21> ° ° 2 5 25 19 23 Minimize transfer time TLBs fully <8> <5> Instr <8> <5> Data Index Block Index Block Delayed Write Buffer I offset D offset C C 26 associative 12 A (256 ValidTag Data A (256 ValidTag Data — large pages (multiple disk sectors) amortize C <1> <21> <64> 9 C <1> <21> <64> blocks) 12 blocks) H H access cost ° E 2 E 19 TLB updates in SW 23 5 — sometimes transfer unnecessary info ° Caches 8KB direct 4 22 =? =?

— sometimes prefetch useful data mapped, write thru 24 13 <29> Instruction Prefetch Stream Buffer <29> 27 Tag <29> Data <256> =? Tag <29> Data <256> Write ° 7 =? — sometimes discards useless data early 2 MB L2 cache, 9 8 Buffer direct mapped, WB Write Buffer ° 6 28 General trend toward larger pages because Stream 12 (off-chip) 32B block 4:1 MUX Buffer Alpha AXP 21064 V D Tag Data 28 — big cheap RAM <10> <19> ° <1><1> <10> <256> 16 256-bit to memory Tag Index L2 C 10 Main A 14 Memory — increasing mem / disk performance gap C (65,536 ° H blocks) 4 entry write buffer E 20 17 15 — larger address spaces between D$ & L2$ 11 12 Magnetic =? Victim Buffer Disk cs 61C L26 cachereview.23 Patterson Spring 99 ©UCB cs 61C L26 cachereview.24 Victim Buffer 17 Patterson Spring 99 ©UCB

Page 4, 4/29/99 10:57 PM Starting an Alpha 21064: 1/6 Starting an Alpha 21064: 2/6 °Starts in kernel mode where memory is °Since 256 32-byte blocks in instruction not translated cache, 8-bit index selects the I cache block °Since software managed TLB, 1st load °Translated address is compared to cache TLB with 12 valid mappings for process in block tag (assuming tag entry is valid) Instruction TLB °If hit, proper 4 bytes of block sent to CPU °Sets the PC to appropriate address for user program and goes into user mode °If miss, address sent to L2 cache; its 2MB cache has 64K 32-byte blocks, so 16-bit °Page frame portion of 1st instruction index selects the L2 cache block; access is sent to Instruction TLB: checks address tag compared (assuming its valid) for valid PTE and access rights match (otherwise an exception for page fault or °If L2 cache hit, 32-bytes in 10 clock cycles access violation)

cs 61C L26 cachereview.25 Patterson Spring 99 ©UCB cs 61C L26 cachereview.26 Patterson Spring 99 ©UCB

Starting an Alpha 21064: 3/6 Starting an Alpha 21064: 4/6 °Suppose 1st instruction is a load; °If L2 cache miss, 32-bytes in 36 clocks 1st sends page frame to data TLB °Since L2 cache is write back, any miss can °If TLB miss, software loads with PTE result in a write to main memory; old block placed into buffer and new block fetched °If data page fault, OS will switch to another first; after new block loaded, and old block process (since disk access = time to is written to memory if old block is dirty execute millions of instructions) °If TLB hit and access check OK, send translated page frame address to Data Cache °8-bit portion of address selects data cache block; tags compared to see if a hit (if valid) ° cs 61C L26 cachereview.27 Patterson Spring 99 ©UCB Ifcs 61Cmiss, L26 cachereview.28 send to L2 cache as before Patterson Spring 99 ©UCB

Starting an Alpha 21064: 5/6 Starting an Alpha 21064: 5/6 °Suppose instead 1st instruction is a store °Write buffer checks to see if write is already to address within entry, and if so updates the °TLB still checked (no protection violations block and valid PTE), send translated address to data cache °If 4-entry write buffer is full, stall until entry is written to L2 cache block °Access to data cache as before for hit, except write new data into block °All writes eventually passed to L2 cache °Since Data Cache is write through, also write °If miss, put old block in buffer, and since L2 data to Write Buffer allocates on write miss, load missing data from cache;Write to portion of block,and °If Data Cache access on store is a miss, also mark new L2 cache block as dirty; write to write buffer since policy is no allocate on write miss °If old block in buffer is dirty, then write it to memory cs 61C L26 cachereview.29 Patterson Spring 99 ©UCB cs 61C L26 cachereview.30 Patterson Spring 99 ©UCB

Page 5, 4/29/99 10:57 PM Classifying Misses: 3 Cs 3Cs Absolute Miss Rate (SPEC92) ¥ Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. (Misses in even an Infinite Cache) 0.14 1-way ¥ Capacity—If the cache cannot contain all the 0.12 2-way Conflict blocks needed during execution of a 0.1 program, capacity misses will occur due to 4-way 0.08 blocks being discarded and later retrieved. 8-way 0.06 (Misses in Fully Associative Size X Cache) Capacity ¥ Conflict—If block-placement strategy is set 0.04 associative or direct mapped, conflict Miss Rate per Type 0.02 misses (in addition to compulsory & capacity 0

Compulsory 1 2 4 8

misses) will occur because a block can be 16 32 64 discarded and later retrieved if too many vanishingly 128 Cache Size (KB) Compulsory blocks map to its set. (Misses in N-way small cs 61C L26 cachereview.31 Associative, Size X Cache) Patterson Spring 99 ©UCB cs 61C L26 cachereview.32 Patterson Spring 99 ©UCB

3Cs Relative Miss Rate Impact of What Learned About Caches?

3Cs Flaws: for fixed block size ° 1960-1985: Speed 3Cs Pros: insight ⇒ invention = Ä(no. operations) 1000 100% ° 1990s CPU 1-way 80% Conflict ¥ Pipelined 2-way 4-way Execution &100 60% 8-way Fast Clock Rate

¥ Out-of-Order10 40% Capacity execution

DRAM 20% ¥ Superscalar 1 Miss Rate per Type °

1999: Speed = 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 0% Ä(non-cached memory accesses) 1 2 4 8

16 32 64 ° 128 Superscalar, Out-of-Order machines hide L1 data cache Compulsory Cache Size (KB) miss (5 clocks) but not L2 cache miss (50 clocks)? cs 61C L26 cachereview.33 Patterson Spring 99 ©UCB cs 61C L26 cachereview.34 Patterson Spring 99 ©UCB

Quicksort vs. Radix as vary number keys: Quicksort vs. Radix as vary number keys: Instructions Instructions and Time

Radix sort Radix sort Quick (Instr/key) 800 800 Quick (Instr/key) Radix (Instr/key) Radix (Instr/key) 700 700 Quick (Clocks/key) Radix (clocks/key) 600 600 Time 500 500

400 400

300 300 Quick 200 Quick 200 sort sort Instructions/key Instructions 100 100

0 0 1000 10000 100000 1000000 1E+07 1000 10000 100000 1000000 1E+07 Set size in keys Set size in keys cs 61C L26 cachereview.35 Patterson Spring 99 ©UCB cs 61C L26 cachereview.36 Patterson Spring 99 ©UCB

Page 6, 4/29/99 10:57 PM Quicksort vs. Radix as vary number keys: Cache/VM/TLB Summary: #1/3 Cache misses 5 °The Principle of Locality: Radix sort Quick(miss/key) Radix(miss/key) ¥ Program access a relatively small portion 4 of the address space at any instant of time.

3 - Temporal Locality: Locality in Time Cache misses - Spatial Locality: Locality in Space 2 °3 Major Categories of Cache Misses: 1 Quick ¥ Compulsory Misses: sad facts of life. sort Example: cold start misses. 0 1000 10000 100000 1000000 1000000 ¥ Capacity Misses: increase cache size 0 Set size in keys ¥ Conflict Misses: increase cache size What is proper approach to fast algorithms? and/or associativity. cs 61C L26 cachereview.37 Patterson Spring 99 ©UCB cs 61C L26 cachereview.38 Patterson Spring 99 ©UCB

Cache/VM/TLB Summary: #2/3 Cache/VM/TLB Summary: #3/3 ° ° Virtual memory was controversial at the Caches, TLBs, Virtual Memory all time: can SW automatically manage 64KB understood by examining how they deal across many programs? with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What ¥ 1000X DRAM growth removed controversy block is replaced on miss? 4) How are writes handled? °Today VM allows many processes to share single memory without having to °Page tables map virtual address to swap all processes to disk; VM physical address protection is more important than memory hierarchy °TLBs are important for fast translation °Today CPU time is a function of °TLB misses are significant in processor (ops, cache misses) vs. just f(ops): performance What does this mean to Compilers, Data structures, Algorithms? cs 61C L26 cachereview.39 Patterson Spring 99 ©UCB cs 61C L26 cachereview.40 Patterson Spring 99 ©UCB

Page 7, 4/29/99 10:57 PM