11/20/12
Review • Implemen ng precise interrupts in in-order CS 61C: pipelines: – Save excep ons in pipeline un l commit point Great Ideas in Computer Architecture – Check for traps and interrupts before commit Virtual Memory – No architectural state overwri en before commit • Support mul programming with transla on and protec on Instructors: – Base and bound, simple scheme, suffers from memory Krste Asanovic, Randy H. Katz fragmenta on – Paged systems remove external fragmenta on but h p://inst.eecs.Berkeley.edu/~cs61c/fa12 add indirec on through page table
Fall 2012 -- Lecture #36 1 Fall 2012 -- Lecture #36 2
You Are Here! Private Address Space per User So ware Hardware OS • Parallel Requests User 1 VA1 Warehouse Smart pages Assigned to computer Scale Phone Page Table e.g., Search “Katz” Computer • Harness Parallel Threads Parallelism & Today’s User 2 Assigned to core Achieve High Lecture VA1 e.g., Lookup, Ads Performance Computer • Parallel Instruc ons Core … Core Page Table >1 instruc on @ one me Memory (Cache) e.g., 5 pipelined instruc ons User 3 VA1 Input/Output Core
• Parallel Data Memory Physical Func onal >1 data item @ one me Instruc on Unit(s) Unit(s) Page Table e.g., Add of 4 pairs of words free A0+B0 A1+B1 A2+B2 A3+B3 • Hardware descrip ons Main Memory • Each user has a page table All gates @ one me • Page table contains an entry for each user page • Programming Languages Logic Gates
Fall 2012 -- Lecture #36 3 Fall 2012 -- Lecture #36 4
A Problem in the Early Six es Manual Overlays • Assume an instruc on can address all • There were many applica ons whose data the storage on the drum could not fit in the main memory, e.g., 40k bits payroll • Method 1: programmer keeps track of main – Paged memory system reduced fragmenta on addresses in the main memory and but s ll required the whole program to be ini ates an I/O transfer when required resident in the main memory – Difficult, error-prone! 640k bits • Method 2: automa c ini a on of I/O drum transfers by so ware address Central Store transla on Ferranti Mercury – Brooker’s interpre ve coding, 1960 1956 – Inefficient!
Not just an ancient black art, e.g., IBM Cell microprocessor used in Playsta on-3 has explicitly managed local store!
Fall 2012 -- Lecture #36 5 Fall 2012 -- Lecture #36 6
1 11/20/12
Demand Paging in Atlas (1962) Hardware Organiza on of Atlas
Effective 16 ROM pages system code “A page from secondary Initial 0.4 ~1 µsec (not swapped) Address storage is brought into the Address 2 subsidiary pagessystem data primary storage whenever Decode (not swapped) PARs 1.4 µsec it is (implicitly) demanded 48-bit words 0 by the processor.” 512-word pages Main Drum (4) Primary 8 Tape Tom Kilburn 32 pages 192 pages 32 Pages decks 1 Page Address 1.4 µsec 512 words/page 31 88 sec/ Register (PAR) word
Fall 2012 -- Lecture #36 7 Fall 2012 -- Lecture #36 8
Atlas Demand Paging Scheme Recap: Typical Memory Hierarchy • Take advantage of the principle of locality to present the user • On a page fault: with as much memory as is available in the cheapest – Input transfer into a free page is ini ated technology at the speed offered by the fastest technology – The Page Address Register (PAR) is updated On-Chip Components – If no free page is le , a page is selected to be replaced Control (based on usage) Cache Instr – The replaced page is wri en on the drum Second Secondary Level Main Memory RegFile Memory (Disk) Datapath Cache • to minimize drum latency effect, the first empty Data Cache (SRAM) (DRAM)
page on the drum was selected – The page table is updated to point to the new loca on of the page on the drum Speed (cycles): ½’s 1’s 10’s 100’s 10,000’s Size (bytes): 100’s 10K’s M’s G’s T’s Cost: highest lowest
Fall 2012 -- Lecture #36 9 11/20/12 Fall 2012 -- Lecture #31 10
Modern Virtual Memory Systems Illusion of a large, private, uniform store Administrivia Protection & Privacy OS several users, each with their private • Regrade request deadline Monday Nov 26 address space and one or more shared address spaces useri – For everything up to Project 4 page table ≡ name space
Swapping Demand Paging Store Provides the ability to run programs Primary larger than the primary memory Memory
Hides differences in machine configurations
The price is address translation on each memory reference VA mapping PA TLB Fall 2012 -- Lecture #36 11 Fall 2012 -- Lecture #36 12
2 11/20/12
CS61C in the News Hierarchical Page Table “World's oldest digital Virtual Address computer successfully 31 22 21 12 11 0 p1 p2 offset reboots” 10-bit 10-bit L1 index L2 index offset Iain Thomson Root of the Current Page Table p2 The Register, 11/20/2012 “A er three years of restora on by the Na onal Museum of Compu ng (TNMOC) and staff at p1 Bletchley Park, the world's oldest func oning digital computer has been successfully rebooted at a ceremony a ended by two of its original developers. The 2.5 ton Harwell Dekatron, later renamed (Processor Level 1 the Wolverhampton Instrument for Teaching Computa on from Harwell (WITCH), was first Register) Page Table constructed in 1949 and from 1951 ran at the USK's Harwell Atomic Energy Research Establishment, Memory Physical where it was used to process mathema cal calcula ons for Britain's nuclear program. Level 2 The system uses 828 flashing Dekatron valves, each capable of holding a single digit, for vola le Page Tables memory, plus 480 GPO 3000 type relays to shi calcula ons and six paper tape readers. It was very page in primary memory slow, taking a couple of seconds for each addi on or subtrac on, five seconds for mul plica on and page in secondary memory up to 15 for division.” PTE of a nonexistent page Data Pages
Fall 2012 -- Lecture #36 13 Fall 2012 -- Lecture #36 14
Two-Level Page Tables in Physical Address Transla on & Protec on Memory Physical Virtual Memory Virtual Address Virtual Page No. (VPN) offset Address Spaces Level 1 PT Kernel/User Mode User 1 Read/Write VA1 Protection Address Level 1 PT Check Translation User 1 User 2 Exception? Physical Address Physical Page No. (PPN) offset
User2/VA1 VA1 • Every instruction and data access needs address User1/VA1 translation and protection checks
User 2 A good VM design needs to be fast (~ one cycle) and Level 2 PT space efficient User 2 Fall 2012 -- Lecture #36 15 Fall 2012 -- Lecture #36 16
Transla on Lookaside Buffers (TLB) TLB Designs (really an Address Transla on Cache!) Address translation is very expensive! • Typically 32-128 entries, usually fully associa ve – Each entry maps a large page, hence less spa al locality across In a two-level page table, each reference pages è more likely that two entries conflict becomes several memory accesses – Some mes larger TLBs (256-512 entries) are 4-8 way set- associa ve Solution: Cache translations in TLB – Larger systems some mes have mul -level (L1 and L2) TLBs TLB hit Single-Cycle Translation ⇒ • Random or FIFO replacement policy TLB miss ⇒ Page-Table Walk to refill • No process informa on in TLB? virtual address VPN offset • TLB Reach: Size of largest virtual address space that can be simultaneously mapped by TLB V R W D tag PPN (VPN = virtual page number) Example: 64 TLB entries, 4KB pages, one page per entry (PPN = physical page number) TLB Reach = ______? hit? physical address PPN offset
Fall 2012 -- Lecture #36 17 Fall 2012 -- Lecture #36 18
3 11/20/12
Handling a TLB Miss Flashcard Quiz:
Software (MIPS, Alpha) Which statement is false? TLB miss causes an exception and the operating system walks the page tables and reloads TLB. A privileged “untranslated” addressing mode used for walk
Hardware (SPARC v8, x86, PowerPC, RISC-V) A memory management unit (MMU) walks the page tables and reloads the TLB
If a missing (data or PT) page is encountered during the TLB reloading, MMU gives up and signals a Page-Fault exception for the original instruction
Fall 2012 -- Lecture #36 20 21
Hierarchical Page Table Walk: Page-Based Virtual-Memory Machine (Hardware Page-Table Walk) SPARC v8 Page Fault? Page Fault? Virtual Address Index 1 Index 2 Index 3 Offset Protec on viola on? Protec on viola on? 31 23 17 11 0 Virtual Virtual Context Context Table Address Physical Address Physical Table Address Address Register L1 Table Inst. Inst. Data Data PC D Decode E + M W root ptr TLB Cache TLB Cache Context Register L2 Table Miss? Miss? PTP L3 Table Page-Table Base Register Hardware Page PTP Table Walker
Physical PTE Physical Memory Controller Address Address 31 11 0 Physical Address Physical Address PPN Offset Main Memory (DRAM) MMU does this table walk in hardware on a TLB miss • Assumes page tables held in untranslated physical memory Fall 2012 -- Lecture #36 22 Fall 2012 -- Lecture #36 23
Address Transla on: pu ng it all together Handling VM-related traps Virtual Address Inst Inst. Data Data hardware Decode PC TLB Cache D E + M TLB Cache W TLB hardware or software software Lookup TLB miss? Page Fault? TLB miss? Page Fault? miss hit Protection violation? Protection violation? Page Table Protection • Handling a TLB miss needs a hardware or so ware Walk Check mechanism to refill TLB • Handling a page fault (e.g., page is on disk) needs a the page is ∉ memory ∈ memory denied permitted restartable trap so so ware handler can resume a er retrieving page Page Fault Update TLB Protection Physical – Precise excep ons are easy to restart (OS loads page) Fault Address – Can be imprecise but restartable, but this complicates OS (to cache) so ware Where? SEGFAULT • Handling protec on viola on may abort process Fall 2012 -- Lecture #36 24 25
4 11/20/12
Concurrent Access to TLB & Cache Address Transla on in CPU Pipeline (Virtual Index/Physical Tag) Virtual Inst Inst. Data Data Decode VA Index PC TLB Cache D E + M TLB Cache W VPN L b
Direct-map Cache TLB k L TLB miss? Page Fault? TLB miss? Page Fault? 2 blocks Protection violation? Protection violation? 2b-byte block PA PPN Page Offset • Need to cope with addi onal latency of TLB: – slow down the clock? Tag = Physical Tag Data – pipeline the TLB and cache access? hit? – virtual address caches (see CS152) Index L is available without consulting the TLB – parallel TLB/cache access ⇒ cache and TLB accesses can begin simultaneously! Tag comparison is made after both accesses are completed Cases: L + b = k, L + b < k, L + b > k
26 27
Virtual-Index Physical-Tag Caches: VM features track historical uses: • Bare machine, only physical addresses Associa ve Organiza on – One program owned en re machine • Batch-style mul programming a Virtual VA VPN a L = k-b b 2 – Several programs sharing CPU while wai ng for I/O Index – Base & bound: transla on and protec on between programs (not virtual memory) – Problem with external fragmenta on (holes in memory), needed Direct-map Direct-map occasional memory defragmenta on as new jobs arrived TLB k 2L blocks 2L blocks • Time sharing – Phy. More interac ve programs, wai ng for user. Also, more jobs/second. PA – Mo vated move to fixed-size page transla on and protec on, no external PPN Page Offset Tag fragmenta on (but now internal fragmenta on, wasted bytes in page) – Mo vated adop on of virtual memory to allow more jobs to share limited = = physical memory resources while holding working set in memory Tag hit? 2a • Virtual Machine Monitors – Run mul ple opera ng systems on one machine – Idea from 1970s IBM mainframes, now common on laptops Data • e.g., run Windows on top of Mac OS X – Hardware support for two levels of transla on/protec on After the PPN is known, 2a physical tags are compared • Guest OS virtual -> Guest OS physical -> Host machine physical – Also basis of Cloud Compu ng How does this scheme scale to larger caches? • Virtual machine instances for Project 1
28 Fall 2012 -- Lecture #36 29
Acknowledgements
• These slides contain material developed and copyright by: – Arvind (MIT) – Krste Asanovic (MIT/UCB) – Joel Emer (Intel/MIT) – James Hoe (CMU) – John Kubiatowicz (UCB) – David Pa erson (UCB)
• MIT material derived from course 6.823 • UCB material derived from course CS252
Fall 2012 -- Lecture #36 30
5