Page 1, 4/29/99 10:57 PM Why Caches? Why Virtual Memory? (1/2) Μproc ° Protection 1000 CPU 60%/Yr

Outline °Review Pipelining °Review Cache/VM/TLB Review slides CS61C Review of Cache/VM/TLB °Administrivia, “What’s this Stuff Good for?” ° Lecture 26 4 Questions on Memory Hierarchy °Detailed Example April 30, 1999 °3Cs of Caches (if time permits) Dave Patterson °Cache Impact on Algorithms (if time permits) (http.cs.berkeley.edu/~patterson) °Conclusion www-inst.eecs.berkeley.edu/~cs61c/schedule.html cs 61C L26 cachereview.1 Patterson Spring 99 ©UCB cs 61C L26 cachereview.2 Patterson Spring 99 ©UCB Review 1/3: Pipelining Introduction Review 2/3: Pipelining Introduction °Pipelining is a fundamental concept °What makes it hard? • Multiple steps using distinct resources °Structural hazards: suppose we had only • Exploiting parallelism in instructions one cache? °What makes it easy? (MIPS vs. 80x86) ⇒ Need more HW resources • All instructions are the same length ⇒ simple instruction fetch °Control hazards: need to worry about • Just a few instruction formats branch instructions? ⇒ read registers before decode instruction ⇒ Branch prediction, delayed branch • Memory operands only in loads and stores ⇒ fewer pipeline stages °Data hazards: an instruction depends on • Data aligned ⇒ 1 memory access / load, store a previous instruction? ⇒ need forwarding, compiler scheduling cs 61C L26 cachereview.3 Patterson Spring 99 ©UCB cs 61C L26 cachereview.4 Patterson Spring 99 ©UCB Review 3/3: Advanced Concepts Memory Hierarchy Pyramid °Superscalar Issue, Execution, Retire: Central Processor Unit (CPU) • Start several instructions each clock cycle “Upper” Increasing (1999: 3-4 instructions) Distance Level 1 from CPU, • Execute on multiple units in parallel Levels in Decreasing memory Level 2 • Retire in parallel; HW guarantees appearance hierarchy cost / MB of simple single instruction execution Level 3 °Out-of-order Execution: “Lower” . • Instructions issue in-order, but execute out-of- Level n order when hazards occur (load-use, cache miss, multiplier busy, ...) Size of memory at each level Principle of Locality (in time, in space) + • Instructions retire in-order; HW guarantees Hierarchy of Memories of different speed, appearance of simple in-order execution cost; exploit to improve cost-performance cs 61C L26 cachereview.5 Patterson Spring 99 ©UCB cs 61C L26 cachereview.6 Patterson Spring 99 ©UCB Page 1, 4/29/99 10:57 PM Why Caches? Why virtual memory? (1/2) µProc ° Protection 1000 CPU 60%/yr. • regions of the address space can be read “Moore’s Law” only, execute only, . 100 Processor-Memory ° Flexibility Performance Gap: • portions of a program can be placed (grows 50% / year) 10 anywhere, without relocation DRAM ° Expandability Performance DRAM 7%/yr. 1 • can leave room in virtual address space for objects to grow 1980 1982 1983 1985 1986 1987 1989 1992 1993 1995 1996 1998 1999 1981 1984 1988 1990 1991 1994 1997 2000 ° Storage management 1989 first Intel CPU with cache on chip; • allocation/deallocation of variable sized Today 37% area of Alpha 21164, blocks is costly and leads to (external) 61% StrongArm SA110, 64% Pentium Pro fragmentation cs 61C L26 cachereview.7 Patterson Spring 99 ©UCB cs 61C L26 cachereview.8 Patterson Spring 99 ©UCB Why virtual memory? (2/2) Why Translation Lookaside Buffer (TLB)? ° Generality ° • ability to run programs larger than size of Paging is most popular physical memory implementation of virtual memory ° Storage efficiency (vs. base/bounds) • retain only most important portions of the °Every paged virtual memory access program in memory must be checked against ° Concurrent I/O Entry of Page Table in memory • execute other processes while ° loading/dumping page Cache of Page Table Entries makes address translation possible without memory access in common case cs 61C L26 cachereview.9 Patterson Spring 99 ©UCB cs 61C L26 cachereview.10 Patterson Spring 99 ©UCB Paging/Virtual Memory Review Three Advantages of Virtual Memory User A: User B: 1) Translation: Virtual Memory TLB Virtual Memory • Program can be given consistent view of ∞ Physical ∞ memory, even though physical memory is Stack Memory Stack scrambled 64 MB • Makes multiple processes reasonable • Only the most important part of program (“Working Set”) must be in physical memory. • Contiguous structures (like stacks) use only Heap Heap as much physical memory as necessary yet still grow later. Static Static A B Code Page 0 Page Code 0 Table Table 0 cs 61C L26 cachereview.11 Patterson Spring 99 ©UCB cs 61C L26 cachereview.12 Patterson Spring 99 ©UCB Page 2, 4/29/99 10:57 PM Three Advantages of Virtual Memory Virtual Memory Summary ° 2) Protection: Virtual Memory allows protected sharing of memory between processes with less • Different processes protected from each other. swapping to disk, less fragmentation than • Different pages can be given special behavior always swap or base/bound - (Read Only, Invisible to user programs, etc). • Kernel data protected from User programs °3 Problems: • Very important for protection from malicious programs ⇒ Far more “viruses” under 1) Not enough memory: Spatial Locality Microsoft Windows means small Working Set of pages OK 3) Sharing: 2) TLB to reduce performance cost of VM • Can map same physical page to multiple users (“Shared memory”) 3) Need more compact representation to reduce memory size cost of simple 1-level page table, especially for 64-bit address (See CS 162) cs 61C L26 cachereview.13 Patterson Spring 99 ©UCB cs 61C L26 cachereview.14 Patterson Spring 99 ©UCB Administrivia “What’s This Stuff (Potentially) Good For?” °11th homework (last): Due Today Private Spy in Space to Rival Military's Ikonos 1 spacecraft is °Next Readings: A.7 just 0.8 tons, 15 feet long with its solar panels extended, it runs on just M 5/3 Deadline to correct your grade record (up to 11th lab, 10th homework, 5th project) 1,200 watts of power -- a bit more than a toaster. It sees objects as W 5/5 Review: Interrupts / Polling; A.7 small as one meter, and covers F 5/7 61C Summary / Your Cal heritage / Earth every 3 days. The company HKN Course Evaluation plans to sell the imagery for $25 to (Due: Final 61C Survey in lab) $300 per square mile via Sun 5/9 Final Review starting 2PM (1 Pimintel) www.spaceimaging.com. "We believe it will fundamentally W5/12 Final (5-8PM 1 Pimintel) Allow civilians to find change the approach to many forms mines, mass graves, of information that we use in • Need Early Final? Contact mds@cory plan disaster relief? business and our private lives." N.Y. Times, 4/27/99 cs 61C L26 cachereview.15 Patterson Spring 99 ©UCB cs 61C L26 cachereview.16 Patterson Spring 99 ©UCB 4 Questions for Memory Hierarchy Q1: Where block placed in upper level? °Block 12 placed in 8 block cache: °Q1: Where can a block be placed in the • Fully associative, direct mapped, 2-way set upper level? (Block placement) associative • S.A. Mapping = Block Number Mod Number Sets Block ° Block 0 1 2 3 4 5 6 7 Block 0 1 2 3 4 5 6 7 Q2: How is a block found if it is in the 0 1 2 3 4 5 6 7 no. upper level? no. no. (Block identification) Set Set Set Set ° Fully associativ e: Q3: Which block should be replaced on Direct mapped: 0 1 2 3 block 12 can go block 12 can go Set associative: a miss? anywhere only into block 4 block 12 can go (Block replacement) (12 m od 8) anywhere in set 0 Block-fram e address (12 m od 4) °Q4: What happens on a write? (Write strategy) Block 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 no. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 cs 61C L26 cachereview.17 Patterson Spring 99 ©UCB cs 61C L26 cachereview.18 Patterson Spring 99 ©UCB Page 3, 4/29/99 10:57 PM Q2: How is a block found in upper level? Q3: Which block replaced on a miss? °Easy for Direct Mapped Block Address Block °Set Associative or Fully Associative: Tag Index offset • Random Set Select • LRU (Least Recently Used) Data Select Miss Rates Associativity:2-way 4-way 8-way °Direct indexing (using index and block Size LRU Ran LRU Ran LRU Ran offset), tag compares, or combination 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% °Increasing associativity shrinks index, 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% expands tag 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% cs 61C L26 cachereview.19 Patterson Spring 99 ©UCB cs 61C L26 cachereview.20 Patterson Spring 99 ©UCB Q4: What happens on a write? Comparing the 2 levels of hierarchy °Write through—The information is written °Cache Version Virtual Memory vers. to both the block in the cache and to the block in the lower-level memory. °Block or Line Page °Write back—The information is written °Miss Page Fault only to the block in the cache. The modified cache block is written to main °Block Size: 32-64B Page Size: 4K-8KB memory only when it is replaced. °Placement: Fully Associative • is block clean or dirty? Direct Mapped, °Pros and Cons of each? N-way Set Associative • WT: read misses cannot result in writes °Replacement: Least Recently Used LRU or Random (LRU) • WB: no writes of repeated writes ° °WT always combined with write buffers Write Thru or Back Write Back cs 61C L26 cachereview.21 Patterson Spring 99 ©UCB cs 61C L26 cachereview.22 Patterson Spring 99 ©UCB Alpha 21064 Picking Optimal Page Size Page-frame Page CPU Data Page-frame Page address <30> offset<13> address <30> offset<13> PC ° Instruction <64>Data Out <64> Data In <64> Separate Instr & 1 2 5 18 19 23 ° Minimize wasted storage <1><2><2> <30> <21> <1><2><2> <30> <21> V R W Tag Physical address V R W Tag Physical address I D Data TLB T <64> T <64> L 1 <64> L 18 — small page minimizes internal fragmentation B B ° Separate Instr & … … … … … … 21 — small

Page 1, 4/29/99 10:57 PM Why Caches? Why Virtual Memory? (1/2) Μproc ° Protection 1000 CPU 60%/Yr

Digital Semiconductor Alpha 21064 and Alpha 21064A Microprocessors Hardware Reference Manual

Alpha 21064A Microprocessors Data Sheet

Computer Architectures an Overview

Virtual Memory CS740 October 13, 1998

Address Translation & Caches Outline

Database Integration

Thesis May Never Have Been Completed

Virtual Memory

Iilihfflf WWETWIU HI Ull ISTITUTO NAZIONALE DI FISICA NUCLEARE - ISTITUTO NAZIONALE DI FISICA NUCLEARE - \ST, ^3 Laboratori Nazionali Di Frascati

Advancements in Microprocessor Architecture for Ubiquitous AI—An Overview on History, Evolution, and Upcoming Challenges in AI Implementation

A 160Mhz 32B 0.5W CMOS ARM Processor, Sribalan

Phd Musoll.Pdf