11/20/12

Review • Implemenng precise interrupts in in-order CS 61C: pipelines: – Save excepons in pipeline unl commit point Great Ideas in Architecture – Check for traps and interrupts before commit – No architectural state overwrien before commit • Support mulprogramming with translaon and protecon Instructors: – Base and bound, simple scheme, suffers from memory Krste Asanovic, Randy H. Katz fragmentaon – Paged systems remove external fragmentaon but hp://inst.eecs.Berkeley.edu/~cs61c/fa12 add indirecon through page table

Fall 2012 -- Lecture #36 1 Fall 2012 -- Lecture #36 2

You Are Here! Private Address Space per User Soware Hardware OS • Parallel Requests User 1 VA1 Warehouse Smart pages Assigned to computer Scale Phone Page Table e.g., Search “Katz” Computer • Harness Parallel Threads Parallelism & Today’s User 2 Assigned to core Achieve High Lecture VA1 e.g., Lookup, Ads Performance Computer • Parallel Instrucons Core … Core Page Table >1 instrucon @ one me Memory () e.g., 5 pipelined instrucons User 3 VA1 Input/Output Core

• Parallel Memory Physical Funconal >1 data item @ one me Instrucon Unit(s) Unit(s) Page Table e.g., Add of 4 pairs of words free A0+B0 A1+B1 A2+B2 A3+B3 • Hardware descripons Main Memory • Each user has a page table All gates @ one me • Page table contains an entry for each user page • Programming Languages Logic Gates

Fall 2012 -- Lecture #36 3 Fall 2012 -- Lecture #36 4

A Problem in the Early Sixes Manual Overlays • Assume an instrucon can address all • There were many applicaons whose data the storage on the drum could not fit in the main memory, e.g., 40k bits payroll • Method 1: programmer keeps track of main – Paged memory system reduced fragmentaon addresses in the main memory and but sll required the whole program to be iniates an I/O transfer when required resident in the main memory – Difficult, error-prone! 640k bits • Method 2: automa iniaon of I/O drum transfers by soware address Central Store translaon Mercury – Brooker’s interpreve coding, 1960 1956 – Inefficient!

Not just an ancient black art, e.g., IBM Cell used in Playstaon-3 has explicitly managed local store!

Fall 2012 -- Lecture #36 5 Fall 2012 -- Lecture #36 6

1 11/20/12

Demand Paging in (1962) Hardware Organizaon of Atlas

Effective 16 ROM pages system code “A page from secondary Initial 0.4 ~1 µsec (not swapped) Address storage is brought into the Address 2 subsidiary pagessystem data primary storage whenever Decode (not swapped) PARs 1.4 µsec it is (implicitly) demanded 48-bit words 0 by the .” 512-word pages Main Drum (4) Primary 8 Tape 32 pages 192 pages 32 Pages decks 1 Page Address 1.4 µsec 512 words/page 31 88 sec/ Register (PAR) word Primary memory as a cache per page frame for secondary memory Secondary Central (Drum) Compare the effective page address against all 32 PARs User sees 32 x 6 x 512 words Memory 32x6 pages match ⇒ normal access of storage no match ⇒ page fault save the state of the partially executed instruction

Fall 2012 -- Lecture #36 7 Fall 2012 -- Lecture #36 8

Atlas Demand Paging Scheme Recap: Typical • Take advantage of the principle of locality to present the user • On a page fault: with as much memory as is available in the cheapest – Input transfer into a free page is iniated technology at the speed offered by the fastest technology – The Page Address Register (PAR) is updated On-Chip Components – If no free page is le, a page is selected to be replaced Control (based on usage) Cache Instr – The replaced page is wrien on the drum Second Secondary Level Main Memory RegFile Memory (Disk) Cache • to minimize drum latency effect, the first empty Data Cache (SRAM) (DRAM)

page on the drum was selected – The page table is updated to point to the new locaon of the page on the drum Speed (cycles): ½’s 1’s 10’s 100’s 10,000’s Size (): 100’s 10K’s M’s G’s T’s Cost: highest lowest

Fall 2012 -- Lecture #36 9 11/20/12 Fall 2012 -- Lecture #31 10

Modern Virtual Memory Systems Illusion of a large, private, uniform store Administrivia Protection & Privacy OS several users, each with their private • Regrade request deadline Monday Nov 26 address space and one or more shared address spaces useri – For everything up to Project 4 page table ≡ name space

Swapping Demand Paging Store Provides the ability to run programs Primary larger than the primary memory Memory

Hides differences in machine configurations

The price is address translation on each memory reference VA mapping PA TLB Fall 2012 -- Lecture #36 11 Fall 2012 -- Lecture #36 12

2 11/20/12

CS61C in the News Hierarchical Page Table “World's oldest digital Virtual Address computer successfully 31 22 21 12 11 0 p1 p2 offset reboots” 10-bit 10-bit L1 index L2 index offset Iain Thomson Root of the Current Page Table p2 The Register, 11/20/2012 “Aer three years of restoraon by the Naonal Museum of Compung (TNMOC) and staff at p1 Bletchley Park, the world's oldest funconing digital computer has been successfully rebooted at a ceremony aended by two of its original developers. The 2.5 ton Harwell , later renamed (Processor Level 1 the Wolverhampton Instrument for Teaching Computaon from Harwell (WITCH), was first Register) Page Table constructed in 1949 and from 1951 ran at the USK's Harwell Atomic Energy Research Establishment, Memory Physical where it was used to mathemacal calculaons for Britain's nuclear program. Level 2 The system uses 828 flashing Dekatron valves, each capable of holding a single digit, for volale Page Tables memory, plus 480 GPO 3000 type relays to shi calculaons and six paper tape readers. It was very page in primary memory slow, taking a couple of seconds for each addion or subtracon, five seconds for mulplicaon and page in secondary memory up to 15 for division.” PTE of a nonexistent page Data Pages

Fall 2012 -- Lecture #36 13 Fall 2012 -- Lecture #36 14

Two-Level Page Tables in Physical Address Translaon & Protecon Memory Physical Virtual Memory Virtual Address Virtual Page No. (VPN) offset Address Spaces Level 1 PT Kernel/User Mode User 1 Read/Write VA1 Protection Address Level 1 PT Check Translation User 1 User 2 Exception? Physical Address Physical Page No. (PPN) offset

User2/VA1 VA1 • Every instruction and data access needs address User1/VA1 translation and protection checks

User 2 A good VM design needs to be fast (~ one cycle) and Level 2 PT space efficient User 2 Fall 2012 -- Lecture #36 15 Fall 2012 -- Lecture #36 16

Translaon Lookaside Buffers (TLB) TLB Designs (really an Address Translaon Cache!) Address translation is very expensive! • Typically 32-128 entries, usually fully associave – Each entry maps a large page, hence less spaal locality across In a two-level page table, each reference pages è more likely that two entries conflict becomes several memory accesses – Somemes larger TLBs (256-512 entries) are 4-8 way set- associave Solution: Cache translations in TLB – Larger systems somemes have mul-level (L1 and L2) TLBs TLB hit Single-Cycle Translation ⇒ • Random or FIFO replacement policy TLB miss ⇒ Page-Table Walk to refill • No process informaon in TLB? virtual address VPN offset • TLB Reach: Size of largest virtual address space that can be simultaneously mapped by TLB V R W D tag PPN (VPN = virtual page number) Example: 64 TLB entries, 4KB pages, one page per entry (PPN = physical page number) TLB Reach = ______? hit? physical address PPN offset

Fall 2012 -- Lecture #36 17 Fall 2012 -- Lecture #36 18

3 11/20/12

Handling a TLB Miss Flashcard Quiz:

Software (MIPS, Alpha) Which statement is false? TLB miss causes an exception and the operating system walks the page tables and reloads TLB. A privileged “untranslated” used for walk

Hardware (SPARC v8, , PowerPC, RISC-V) A (MMU) walks the page tables and reloads the TLB

If a missing (data or PT) page is encountered during the TLB reloading, MMU gives up and signals a Page-Fault exception for the original instruction

Fall 2012 -- Lecture #36 20 21

Hierarchical Page Table Walk: Page-Based Virtual-Memory Machine (Hardware Page-Table Walk) SPARC v8 Page Fault? Page Fault? Virtual Address Index 1 Index 2 Index 3 Offset Protecon violaon? Protecon violaon? 31 23 17 11 0 Virtual Virtual Context Context Table Address Physical Address Physical Table Address Address Register L1 Table Inst. Inst. Data Data PC D Decode E + M W root ptr TLB Cache TLB Cache Context Register L2 Table Miss? Miss? PTP L3 Table Page-Table Base Register Hardware Page PTP Table Walker

Physical PTE Physical Address Address 31 11 0 Physical Address Physical Address PPN Offset Main Memory (DRAM) MMU does this table walk in hardware on a TLB miss • Assumes page tables held in untranslated physical memory Fall 2012 -- Lecture #36 22 Fall 2012 -- Lecture #36 23

Address Translaon: pung it all together Handling VM-related traps Virtual Address Inst Inst. Data Data hardware Decode PC TLB Cache D E + M TLB Cache W TLB hardware or software software Lookup TLB miss? Page Fault? TLB miss? Page Fault? miss hit Protection violation? Protection violation? Page Table Protection • Handling a TLB miss needs a hardware or soware Walk Check mechanism to refill TLB • Handling a page fault (e.g., page is on disk) needs a the page is ∉ memory ∈ memory denied permitted restartable trap so soware handler can resume aer retrieving page Page Fault Update TLB Protection Physical – Precise excepons are easy to restart (OS loads page) Fault Address – Can be imprecise but restartable, but this complicates OS (to cache) soware Where? SEGFAULT • Handling protecon violaon may abort process Fall 2012 -- Lecture #36 24 25

4 11/20/12

Concurrent Access to TLB & Cache Address Translaon in CPU Pipeline (Virtual Index/Physical Tag) Virtual Inst Inst. Data Data Decode VA Index PC TLB Cache D E + M TLB Cache W VPN L b

Direct-map Cache TLB k L TLB miss? Page Fault? TLB miss? Page Fault? 2 blocks Protection violation? Protection violation? 2b- PA PPN Page Offset • Need to cope with addional latency of TLB: – slow down the clock? Tag = Physical Tag Data – pipeline the TLB and cache access? hit? – virtual address caches (see CS152) Index L is available without consulting the TLB – parallel TLB/cache access ⇒ cache and TLB accesses can begin simultaneously! Tag comparison is made after both accesses are completed Cases: L + b = k, L + b < k, L + b > k

26 27

Virtual-Index Physical-Tag Caches: VM features track historical uses: • Bare machine, only physical addresses Associave Organizaon – One program owned enre machine • Batch-style mulprogramming a Virtual VA VPN a L = k-b b 2 – Several programs sharing CPU while waing for I/O Index – Base & bound: translaon and protecon between programs (not virtual memory) – Problem with external fragmentaon (holes in memory), needed Direct-map Direct-map occasional memory defragmentaon as new jobs arrived TLB k 2L blocks 2L blocks • Time sharing – Phy. More interacve programs, waing for user. Also, more jobs/second. PA – Movated move to fixed-size page translaon and protecon, no external PPN Page Offset Tag fragmentaon (but now internal fragmentaon, wasted bytes in page) – Movated adopon of virtual memory to allow more jobs to share limited = = physical memory resources while holding working set in memory Tag hit? 2a • Virtual Machine Monitors – Run mulple operang systems on one machine – Idea from 1970s IBM mainframes, now common on laptops Data • e.g., run Windows on top of Mac OS X – Hardware support for two levels of translaon/protecon After the PPN is known, 2a physical tags are compared • Guest OS virtual -> Guest OS physical -> Host machine physical – Also basis of Cloud Compung How does this scheme scale to larger caches? • Virtual machine instances for Project 1

28 Fall 2012 -- Lecture #36 29

Acknowledgements

• These slides contain material developed and copyright by: – Arvind (MIT) – Krste Asanovic (MIT/UCB) – Joel Emer (Intel/MIT) – James Hoe (CMU) – John Kubiatowicz (UCB) – David Paerson (UCB)

• MIT material derived from course 6.823 • UCB material derived from course CS252

Fall 2012 -- Lecture #36 30

5