Announcements

• Reminders – Lab3 and PA2 (part a) are due on today EE108B Lecture 14

Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b

C. Kozyrakis EE108b Lecture 14 1 C. Kozyrakis EE108b Lecture 14 2

Review: Hardware Support for Operating System MIPS Interrupts

• Operating system • What does the CPU do on an exception? – Manages hardware resources (CPU, memory, I/O devices) – EPC register to point to the restart location • On behalf of user applications – Change CPU mode to kernel and disable interrupts – Provides protection and isolation – Set Cause register to reason for exception also set BadVaddr – Virtualization register if address exception • Processes and virtual memory – Jump to interrupt vector • Interrupt Vectors • What hardware support is required for the OS? – 0x8000 0000 : TLB refill – Kernel and user mode – 0x8000 0180 : All other exceptions • Kernel mode: access to all state including privileged state • Privileged state • User mode: access to user state ONLY – EPC – Cause – Exceptions/interrupts – BadVaddr – Virtual memory support

C. Kozyrakis EE108b Lecture 14 3 C. Kozyrakis EE108b Lecture 14 4 Interrupt vectors A Really Simple Exception Handler Accelerating handler dispatch

Exception xcpt_cnt: numbers • 1. Each type of interrupt has la $k0, xcptcount # get address of counter a unique exception number k lw $k1, 0($k0) # load counter codecode for for addu $k1, 1 # increment counter exceptionexception handler handler 0 0 • 2. Jump table (interrupt vector) sw $k1, 0($k0) # store counter interrupt entry k points to a function code for eret # restore status reg: enable vector code for (exception handler). exceptionexception handler handler 1 1 # interrupts and user mode 0 # return to program (EPC ) 1 codecode for for • 3. Handler k is called each time 2 ... exceptionexception handler handler 2 2 exception k occurs • Can’t survive nested exceptions n-1 ... – OK since does not re-enable interrupts • Does not use any user registers codecode for for But the CPU is much faster – No need to save them exceptionexception handler handler n-1 n-1 than I/O devices and most OSs use a common handler anyway Make the common case fast!

C. Kozyrakis EE108b Lecture 14 5 C. Kozyrakis EE108b Lecture 14 6

Exception Example #1 Exception Example #2

• Memory Reference • Memory Reference – User writes to memory location – User writes to memory location int a[1000]; – That portion (page) of user’s memory is currently – Address is not valid main () not in memory int a[1000]; – Page handler detects invalid address main () { – Page handler must load page into physical { – Sends SIGSEG signal to user process a[5000] = 13; memory a[500] = 13; – User process exits with “segmentation } – Returns to faulting instruction } fault” – Successful on second try

User Process OS User Process OS

event page fault lw Create page and load page fault event lw return into memory Detect invalid address Signal process C. Kozyrakis EE108b Lecture 14 7 C. Kozyrakis EE108b Lecture 14 8 Review: OS ↔ User Code Switches Process

• System call instructions (syscall) 0xffffffff kernel memory memory – Voluntary transfer to OS code to open a file, write data, change (code, data, heap, stack) invisible to working directory, … 0x80000000 user code user stack – Jumps to predefined location(s) in the operating system (created at runtime) – Automatically switches from user mode to kernel mode • Each process has its own $sp (stack pointer) private address space PC + 4

Branch/jump PC PC • Exceptions & interrupts Excep Vec MUX – Illegal opcode, device by zero, …, timer expiration, I/O, … EPC brk run-time heap – Exceptional instruction becomes a jump to OS code (managed by malloc) • Precise exceptions read/write segment – External interrupts attached to some instruction in the pipeline 0x10000000 (.data, .bss) loaded from the – New state: EPC, Cause, BadVaddr registers read-only segment executable file (.text) 0x00400000 unused 0 C. Kozyrakis EE108b Lecture 14 9 C. Kozyrakis EE108b Lecture 14 10

Motivations for Virtual Memory Motivation #1: DRAM a “” for Disk

• #1 Use Physical DRAM as a Cache for the Disk • Full address space is quite large: – Address space of a process can exceed physical memory size – 32- addresses: ~4,000,000,000 (4 billion) – Sum of address spaces of multiple processes can exceed physical memory – 64-bit addresses: ~16,000,000,000,000,000,000 (16 quintillion) bytes • #2 Simplify Memory Management • Disk storage is ~100X cheaper than DRAM storage – Multiple processes resident in main memory. • To access large amounts of data in a cost-effective manner, the • Each process with its own address space bulk of the data must be stored on disk – Only “active” code and data is actually in memory • Allocate more memory to process as needed. • #3 Provide Protection 100 GB: ~$125 – One process can’t interfere with another. 1 GB: ~$128 • because they operate in different address spaces. 4 MB: ~$120 – User process cannot access privileged information • different sections of address spaces have different permissions. SRAM DRAM Disk

C. Kozyrakis EE108b Lecture 14 11 C. Kozyrakis EE108b Lecture 14 12 Levels in Memory Hierarchy DRAM vs. SRAM as a “Cache”

cache virtual memory • DRAM vs. disk is more extreme than SRAM vs. DRAM – Access latencies: • DRAM ~10X slower than SRAM C 8 Ba 32 B 8 KB • Disk ~ 100,000X slower than DRAM CPUCPU MemoryMemory disk c disk – Importance of exploiting spatial locality: regsregs h e • First is ~ 100,000X slower than successive bytes on disk – vs. ~4X improvement for page-mode vs. regular accesses to DRAM Register Cache Memory Disk Memory – Bottom line: size: 128 B <4MB < 16 GB > 100 GB • Design decisions made for DRAM caches driven by enormous Speed(cycles): 0.5-1 1-20 80-100 5-10 M cost of misses $/Mbyte: $30/MB $0.128/MB $0.001/MB line size: 8 B 32 B 8 KB SRAM DRAM larger, slower, cheaper Disk

C. Kozyrakis EE108b Lecture 14 13 C. Kozyrakis EE108b Lecture 14 14

Impact of These Properties on Design Locating an Object in a “Cache”

• If DRAM was to be organized similar to an SRAM cache, how would • SRAM Cache we set the following design parameters? – Tag stored with cache line – Line size? – Maps from cache block to memory blocks • Large, since disk better at transferring large blocks and minimizes miss rate • From cached to uncached form – Associativity? – No tag for block not in cache • High, to mimimize miss rate – Write through or write back? – Hardware retrieves information • Write back, since can’t afford to perform small writes to disk • can quickly match against multiple tags “Cache” Tag Data Object Name 0: D 243 X = X? 1: X 17 • • • • • • N-1: J 105

C. Kozyrakis EE108b Lecture 14 15 C. Kozyrakis EE108b Lecture 14 16 Locating an Object in a “Cache” (cont.) A System with Physical Memory Only

• DRAM Cache • Examples: – Each allocate page of virtual memory has entry in – most Cray machines, early PCs, nearly all embedded systems, etc. – Mapping from virtual pages to physical pages Memory • From uncached form to cached form 0: Physical 1: – Page table entry even if page not in memory Addresses • Specifies disk address – OS retrieves information CPU Page Table “Cache” Location Data Object Name D: 0 0: 243 X J: On Disk 1: 17 • • N-1: • • • • X: 1 N-1: 105 Addresses generated by the CPU point directly to bytes in physical memory C. Kozyrakis EE108b Lecture 14 17 C. Kozyrakis EE108b Lecture 14 18

A System with Virtual Memory Page Faults (Similar to “Cache Misses”)

• Examples: Memory • What if an object is on disk rather than in memory? – workstations, servers, modern PCs, etc. 0: – Page table entry indicates virtual address not in memory Page Table 1: – OS exception handler invoked to move data from disk into memory Virtual Physical Addresses 0: Addresses • current process suspends, others can resume 1: • OS has full control over placement, etc. CPU Before fault After fault Memory Memory Page Table Page Table P-1: Virtual Physical N-1: Addresses Addresses Virtual Physical Addresses Addresses CPU CPU Disk Address Translation: Hardware converts virtual addresses to physical addresses via an OS-managed lookup table ( page table ) Disk Disk C. Kozyrakis EE108b Lecture 14 19 C. Kozyrakis EE108b Lecture 14 20 Servicing a Page Fault Motivation #2: Memory Management

(1) Initiate Block Read • Signals ProcessorProcessor Controller • Multiple processes can reside in physical memory. Reg (3) Read – Read block of length P • How do we resolve address conflicts? Done starting at disk address – what if two processes access something at the same address? X and store starting at CacheCache Y memory invisible to kernel virtual memory • Read Occurs user code $sp stack – Memory-I/O (DMA) Memory-I/O bus – Under control of I/O (2) DMA Transfer controller I/OI/O the “brk” ptr controller runtime heap (via malloc) • I / O Controller Signals MemoryMemory controller Completion uninitialized data (.bss) – Interrupt processor initialized data (.data) – OS resumes diskDisk Diskdisk program text (.text) forbidden suspended process 0

C. Kozyrakis EE108b Lecture 14 21 C. Kozyrakis EE108b Lecture 14 22

Solution: Separate Virtual Addr. Spaces Motivation #3: Protection

• Page table entry contains access rights information – Virtual and physical address spaces divided into equal-sized blocks – hardware enforces this protection (trap into OS if violation occurs) • blocks are called “pages” (both virtual and physical) Page Tables – Each process has its own Memory • operating system controls how virtual pages as assigned to physical Read? Write? Physical Addr 0: memory VP 0: Yes No PP 9 1: 0 Process i: VP 1: Yes Yes PP 4 Virtual 0 Address Translation Physical Address VP 1 PP 2 Address VP 2: No No XXXXXXX VP 2 Space • • • Space for ... • • • Process 1: (DRAM) • • • N-1

PP 7 Read? Write? Physical Addr (e.g., read/only VP 0: Yes Yes PP 6 Virtual 0 library code) Address VP 1 VP 2 PP 10 Process j: VP 1: Yes No PP 9 N-1: Space for ... Process 2: VP 2: No No XXXXXXX M-1 • • • N-1 • • • C. Kozyrakis EE108b Lecture 14 23 C. Kozyrakis • • EE108b Lecture• 14 24 Paging Address Translation Page Table/Translation

22 10 = 32 1 KB Pages Virtual address Page number Offset 1K 1K

Page Page Table Page Table Table Base Register 2048 Page 2 1K Lookup 2048 Frame 2 1K Page table Access V Frame 1024 Page 1 1K 1024 Frame 1 1K Valid bit to index Rights Frame 0 1K indicate if 0 Page 0 1K 0 Frame # Offset virtual page Virtual Memory Physical Memory 30 = is currently Access rights to 20 10 Virtual pages can map Physical address to any physical frame mapped define legal (fully associative) accesses (Read, Table located in Write, Execute) physical memory To memory

C. Kozyrakis EE108b Lecture 14 25 C. Kozyrakis EE108b Lecture 14 26

Process Address Space Translation Process

• Each process has its own address space • Valid page – Each process must therefore have its own translation table – Check access rights (R, W, X) against access type – Translation tables are changed only by the OS • Generate physical address if allowed – A process cannot gain access to another process’s address space • Generate a protection fault if illegal access • Context switches • Invalid page – Steps – Page is not currently mapped and a page fault is generated 1. Save previous process’ state (registers, PC) in the Process Control Block • Faults are handled by the operating system (PCB) – Protection fault is often a program error and the program should be 2. Change page table pointer to new process’s table terminated 3. Flush the TLB (discussed later) – Page fault requires that a new frame be allocated, the page entry is 4. Initialize PC and registers from new process’ PCB marked valid, and the instruction restarted – Context switches occur on a timeslice (when the scheduler determines the next process should get the CPU) or when blocking • Frame table has mapping from physical address to virtual address for I/O and tracks used frames

C. Kozyrakis EE108b Lecture 14 27 C. Kozyrakis EE108b Lecture 14 28 Unaligned Memory Access Page Table Problems

• Memory access might be aligned or unaligned • “Internal” Fragmentation resulting from fixed size pages since not all of page will be filled with data 0 4 8 12 16 – Problem gets worse as page size is increased • Each data access now takes two memory accesses 0 4 8 12 16 – First memory reference uses the page table base register to lookup the frame – Second memory access to actually fetch the value • What happens if unaligned address access straddles a page • Page tables are large boundary? – Consider the case where the machine offers a 32 bit address space – What if one page is present and the other is not? and uses 4 KB pages – Or, what if neither is present? • Page table contains 2 32 / 2 10 = 2 22 entries • MIPS architecture disallows unaligned memory access • Assume each entry contains ~32 • Interesting legacy problem on 80 which does support unaligned • Page table requires 4 MB of RAM per process! access

C. Kozyrakis EE108b Lecture 14 29 C. Kozyrakis EE108b Lecture 14 30

Improving Access Time TLB Entries

• But we know how to fix this problem, let’s cache the page table and • Just holds a cached Page Table Entry (PTE) leverage the principles of locality • Additional bits for LRU, etc. may be added • Rather than use the normal caches, we will use a special cache Virtual Page Physical Frame Dirty LRU Valid Access Rights known as the Translation Lookaside Buffer (TLB) – All accesses are looked up in the TLB • Optimization is to access TLB and cache in parallel • A hit provides the physical frame number hit Virtual Addr Phy Addr • On a miss, look in the page table for the frame miss TLB Main CPU Cache – TLBs are usually smaller than the cache Lookup Memory • Higher associativity is common hit data • Similar speed to a cache access miss map • Treat like a normal cache and use write back vs. write through entry Access page tables for translation (from memory!) data

C. Kozyrakis EE108b Lecture 14 31 C. Kozyrakis EE108b Lecture 14 32 Address Translation with a TLB TLB Miss Handler n–1p p–1 0 virtual page number page offset virtual address • A TLB miss can be handled by software or hardware – Miss rate : 0.01% – 1% validtag physical page number TLB • In software, a special OS handler looks up the value in the page . . . table – Ten instructions on R3000/R4000 = • In hardware, microcode or a dedicated finite state machine (FSM) TLB hit looks up the value physical address – Page tables are stored in regular physical memory tag byte offset index – It is therefore likely that the data is in the L2 cache valid tag data – Advantages Cache • No need to take an exception • Better performance but may be dominated by cache miss

= cache hit data C. Kozyrakis EE108b Lecture 14 33 C. Kozyrakis EE108b Lecture 14 34

TLB Case Study: TLB Caveats MIPS R2000/R3000

• What happens on a context switch? – If the TLB entries have a Process ID (PID) associated with them, then • Consider the MIPS R2000/R3000 processors nothing needs to be done – Addresses are 32 bits with 4 KB pages (12 bit offset) – Otherwise, the OS must flush the entries in the TLB – TLB has 64 entries, fully associative • Limited Reach – Each entry is 64 bits wide: – 64 entry TLB with 8KB pages maps 0.5 MB – Smaller than many L2 caches 20 6 6 20 1 1 1 1 8 – TLB miss rate > L2 miss rate! Virtual Page PID 0 Physical Page N D V G 0 – Motivates larger page size PID Process ID N Do not cache memory address D Dirty bit V Valid bit G Global (valid regardless of PID)

C. Kozyrakis EE108b Lecture 14 35 C. Kozyrakis EE108b Lecture 14 36 Page Size Multiple Page Sizes

• Larger Pages • Some machines support multiple page sizes to balance competing – Advantages goals • Smaller page tables • Fewer page faults and more efficient transfer with larger applications – Page size dependent upon application • Improved TLB coverage • OS kernel uses large pages – Disadvantages • User applications use smaller pages • Higher internal fragmentation • SPARC: 8KB, 64KB, 1 MB, 4MB • Smaller Pages • MIPS R4000: 4KB – 16 MB – Advantages – Results • Improved time to start up small processes with fewer pages • Internal fragmentation is low (important for small programs) • Complicates TLB design since must account for page size – Disadvantages • High overhead in large page tables • General trend toward larger pages – 1978: 512 B – 1984: 4 KB – 1990: 16 KB – 2000: 64 KB

C. Kozyrakis EE108b Lecture 14 37 C. Kozyrakis EE108b Lecture 14 38

Page Sharing Page Sharing (cont)

• Another benefit of paging is that we can easily share pages by mapping multiple pages to the same physical frame • Useful in many different circumstances Virtual Address – Useful for applications that need to share data – Read only data for applications, OS, etc. – Example: Unix fork() system call creates second process with a 1K Physical “copy” of the current address space 0 Address • Often implemented using copy-on-write Virtual Memory 1K – Second process has read-only access to the page Process 1 – If second process wants to write, then a page fault occurs and the page is copied 0 Physical 1K Memory

0 Virtual Memory Process 2 C. Kozyrakis EE108b Lecture 14 39 C. Kozyrakis EE108b Lecture 14 40 Copy-on-Write Page Table Size

• Page table size is proportional to size of address space – Even when applications use few pages, page table is still very large Virtual • Example: Intel 80 x86 Page Tables Address – Virtual addresses are 32 bits – Page size is 4 KB 32 12 1K Physical – Total number of pages: 2 / 2 = 1 Million 0 Address – Page Table Entry (PTE) are 32 bits wide Virtual Memory • 20 bit Frame address Process 1 1K • Dirty bit, accessed bit, valid (present) bit 1K • 9 flag and unused bits 0 – Total page table size is therefore 2 20 × 4 bytes = 4 MB Physical – But, only a small fraction of those pages are actually used! 1K Memory • The large page table size is made even worse by the fact that the page table must be in memory 0 – Otherwise, what would happen if the page table were suddenly Virtual Memory swapped out? Process 2 C. Kozyrakis EE108b Lecture 14 41 C. Kozyrakis EE108b Lecture 14 42

Paging the Page Tables Two-level Paging

• One solution is to use multi-level page tables PFE PTE Offset – If each PTE is 4 bytes, then each 4 KB frame can hold 1024 PTEs PTR – Use an additional “master block” 4 KB frame to index frames containing PTEs + • Master frame has 1024 PFEs (Page Frame Entries) • Each PFE points to a frame containing 1 KB PTEs + • Each PTE points to a page – Now, only the “master block” must be fixed in memory +

Master Block •Problems Page Table Desired Page –Multiple page faults •Accessing a PTE page table can cause a page fault •Accessing the actual page can cause a second page fault –TLB plays an even more important role

C. Kozyrakis EE108b Lecture 14 43 C. Kozyrakis EE108b Lecture 14 44 Virtual Memory Overview Virtual Memory Approach

• We can extend our caching idea one step further and use • Presents the illusion of a large address space to every application secondary storage as one of the elements in our memory hierarchy – Each process or program believes it has its own complete copy of physical memory (its own copy of the address space) • Virtual memory is the process of using secondary storage to to create an illusion of an even larger memory space without incurring • Strategy the cost – Pages not currently in use can be written back to secondary storage – When page ownership changes, save the current contents to disk so • Removes programming burdens of managing limited physical they can be restored later memory • Costs – Address translation is still required to convert virtual addresses into physical addresses – Writing to the disk is a slow operation – TLB misses can limit performance

C. Kozyrakis EE108b Lecture 14 45 C. Kozyrakis EE108b Lecture 14 46

Virtual Addresses Page Misses

• Each virtual address can be mapped into one of three categories • If a page miss corresponds to a valid address that is no longer – A physical memory address (by way of TLB or PT) available, then it must reloaded – A location on disk (we suffer a page miss) – Locate the contents on disk that were swapped out – Nothing – corresponds to an invalid memory address – Find an available physical frame (or evict another frame) – Write the contents into the frame – Update the page table mappings (for both this process and the one that was evicted if necessary) • Pages are fetched only on demand , when accessed

Memory Disk Reg Cache

frame page

C. Kozyrakis EE108b Lecture 14 47 C. Kozyrakis EE108b Lecture 14 48 Page Fault

• Programs use virtual address which may include pages that currently reside in secondary storage • If an entry is not found then the address is not currently located in physical memory, and a page fault is said to have occurred – A page fault is handled by the operating system – The item must be first fetched from disk storage and loaded into a physical frame – A page table entry must be created fault – The access is repeated handler missing (in OS) item fault Virtual Address Main address Translation valid Memory Disk physical address OS performs this transfer

C. Kozyrakis EE108b Lecture 14 49