9.1

CS356 Unit 9

Virtual Memory & Address Translation 9.2 Indirection

• Indirection means using one entity to refer to another • Examples: – A variable name vs. it's physical – Phone number vs. cell tower location/phone ID – Titles like "CEO" or "head coach" are virtual titles that can be applied to different people at different times • The benefits are we can change one without changing the other – We can change the underlying implementation without changing the higher level task. For example, a job description would read "The CEO shall perform this duty or that." and it need not be changed if the company replaces John Doe with Jane Doe. • "All problems in computer science can be solved by another level of indirection" – attributed to David Wheeler 9.3 & Address Translation

• We are going to indirect the addresses used by computer programs • Primary Idea = Compile the program with fictitious (virtual) addresses and have a translator convert these to physical addresses as the program runs (this is Address Translation) – Efficiently share the physical memory between several running programs/processes and provide protection from accessing each others' information • Secondary Idea = Use main memory (MM) as a "" for multiple programs' data as they run, using secondary storage (hard-drive) as the home location (this is Virtual Memory) – Remove the need of the programmer to know how much memory is physically present and/or give the illusion of more or less physical memory than is present • These ideas are often used interchangeably 9.4 Benefits of Address Translation

• What is enabled through virtual memory and address translation? – Illusion of more or less memory than physically present – Isolation – *Controlled sharing of code or data – *Efficient I/O (memory-mapped files) – *Dynamic allocation (Heap / Stack growth) – *Process migration • *Will be discussed in a subsequent unit or Operating Systems class 9.5 Memory Hierarchy & Caching

• Lower levels act as a cache for upper levels

Registers

L1 Cache ~ 1ns L1/L2 is a “cache” for L2 Cache main memory ~ 10ns Main Memory ~ 100 ns Virtual memory provides each Disk / Secondary Storage process its own ~1-10 ms address space in secondary storage and uses main

This Photo by Unknown Author is licensed under CC BY-SA memory as a cache 9.6 Secondary Storage: Magnetic Disks

• Magnetic hard drive consists of – Double sided surfaces/platters (with R/W head) – Each platter is divided into concentric tracks of small sectors that each store several thousand • Performance is slow primarily due to moving mechanical parts

Read/Write Head 0 • Seek Time: Time needed to 3-12 ms Read/Write Head 1 position the read-head above … Surfaces the proper track • Rotational delay: Time needed 5-6 ms Read/Write Head 7 to bring the right sector under Sector 0 the read-head • Depends on rotation Track 0 Sector 1 speed (e.g. 5400 RPM) Track 1 • Transfer Time: 0.1 ms • Disk Controller Overhead: + 2.0 ms ~20 ms Sector 2 9.7 Secondary Storage: Flash

• Flash (solid-state) drives store bits using special transistors that retain their values even when power is turned off • Performance is higher than magnetic disks but still slower comparted to main memory – Better sequential read throughput • HD (Magnetic): 122-204 MB/s • SSD: 210-270 MB/s – MUCH better random read nd • Max latency for single read/write: 75us OS:PP 2 Ed. Fig. 12.6 Intel 710 SSD specs. • When many requests present we can overlap and achieve latency of around 26us (1/38500) • Flash drives "wear-out" after some number of writes/erasures 9.8 Address Spaces

spaces corresponds to the actual system address range Program/Process (based on the width of the address 1,2,3,…

bus) of the processor and how much 0xffff ffff - 0xffff ffff Not main memory is physically present Mapped used I/O • Each process/program runs in its own 0xbfffffff - 0xc0000000 private "virtual" address space I/O Stack 0x80000000 – Virtual address space can be larger -

(or smaller) than physical memory Not Heap used – Virtual address spaces are protected -

from each other 0x3fffffff Data 0x10000000 - Mem. Code 0x00000000 0x00000000

32- Physical 32-bit Fictitious Virtual Address Space w/ Address Spaces only 1 GB of Mem ( > 1GB Mem) 9.9 Processes

Program/Process • Process 1,2,3,… – (def 1.) Address Space + Threads - 0xffff ffff • (Virtual) Address Space = Protected view of memory Mapped • 1 or more threads I/O – (def 2.) : Running instance of a program that has - 0xc0000000 limited rights Stack • Memory is protected: Address translation (VM) ensures no - access to any other processes' memory Heap • I/O is protected: Processes execute in user-mode (not kernel mode) which generally means direct I/O access is - disallowed instead requiring system calls into the kernel Data 0x10000000 • OS Kernel is not considered a "process" - = Thread – Has access to all resources and much of its code is Code 0x00000000 invoked under the execution of a user process thread

Address Spaces 9.10 Virtual Address Spaces (VAS)

• Virtual address spaces 0 - Code Program/Process 1 - Code 0 - Code 1,2,3,… (VASs) are broken into 2 - Data 1 - Code 3 - Stack 0xffff ffff 2 - Data - blocks called "pages" 4 - Heap 3 - Stack Mapped … I/O • Depending on the 4 - Heap 0 - program, much of the 0 0xc0000000 1 1 Stack 2 virtual address space will 2 - 3 3 be unused 0 Heap Unalloc 1 - • Pages can be allocated … 2 Data 0x10000000 - "on demand" (i.e. when 0 1 the stack grows, etc.) Code 2 0x00000000 • All allocated pages can be Unalloc Unalloc stored in secondary … storage (hard drive) Secondary Used/Unused Blocks in Storage Virtual Address Space 9.11 Physical Address Space (PAS)

• Physical memory is broken into 0 - Code 1 - Code page-size blocks called "frames" 0 - Code 2 - Data 0xffffffff 1 - Code • Multiple programs can be running I/O 3 - Stack 2 - Data and 4 - Heap and their pages can share the 3 - Stack un- … physical memory used 4 - Heap area 0 0 • Physical memory acts as a cache 0x3fffffff 1 frame 1 2 for pages with secondary storage 0-Code 2 3 acting as the backing store (next Pg. 0 3 0 Pg. 2 Unalloc lower level in the hierarchy) 1 2-Data … 2 • A page can be: Pg. 0 – Unallocated (not needed 0 yet…stack/heap) frame 0x00000000 1 2 – Allocated and residing in secondary Unalloc storage (Uncached) 1GB Physical Memory and Unalloc – Secondary … Allocated and residing in main 32-bit Address memory (Cached) Storage Space Fictitious Virtual Address Spaces 9.12 Paging

Phys. Addr. • Virtual address space is divided into equal Space 0xffffffff size "pages" (often around 4KB) I/O and • Physical memory is broken into page un- used frames (which can hold any page of virtual area 0x3fffffff memory and then be swapped for another frame page) 0-Code Pg. 0 • Virtual address spaces can be contiguous Pg. 2 while physical layout is not 2-Data Pg. 0 Physical Frame of frame 0x00000000 memory can hold data from any virtual page. Pg. 0 Pg. 0 Since all pages are the same size any page can Pg. 1 Pg. 1 go in any frame (and be Pg. 2 Pg. 2 swapped at our desire). Pg. 3 unused unused unused … … Proc. 1 VAS Proc. 2 VAS 9.13 Virtual vs. Physical Addresses VA: 0x040000 0 - Code 1 - Code VA: 0x100080 2 - Data • Key: Programs are written using virtual 0xffffffff I/O 0 - Code 3 - Stack addresses and 1 - Code 4 - Heap un- 2 - Data … • HW & the OS will translate the virtual used addresses used by the program to the area 3 - Stack 4 - Heap 0 physical address where that page resides PA:0x3fffffff frame 0 1 0-Code • If an attempt is made to access a page PA: 0x21b000 1 2 Pg. 0 that is not in physical memory, HW 2 3 Pg. 2 3 Unalloc generates a "page fault exception" and 2-Data PA: 0x11f000 0 … the OS is invoked to bring in the page to Pg. 0 1 physical memory (possibly evicting 2 0 another page) frame PA: 0x0 1 • Notice: Virtual addresses are not unique 2 Physical Unalloc – Each program/process has VA: 0x00000000 Memory and Unalloc Secondary Address Space … Storage Virtual Physical Proc. Addr Translation Unit / Addr Fictitious Virtual MMU Memory Core (Mem. Mgmt. Unit) Address Spaces

Data 9.14 Summary VA: 0x040000 0 - Code • Program takes an abstract (virtual) view of 1 - Code memory and uses virtual addresses and VA: 0x100080 2 - Data 0xffffffff necessary data is broken into large chunks I/O 0 - Code 3 - Stack and 1 - Code 4 - Heap called pages un- 2 - Data … • used HW and OS work together to bring pages into area 3 - Stack 4 - Heap 0 main memory acting as a cache and allowing PA:0x3fffffff frame sharing 0 1 0-Code PA: 0x21b000 1 2 • Pg. 0 HW and OS work together to perform 2 3 Pg. 2 translation between: 3 Unalloc 2-Data – Virtual address: Address used by the process PA: 0x11f000 0 … Pg. 0 (programmer) 1 0 – Physical address: Physical memory location of 2 frame PA: 0x0 1 the desired data 2 • Translation allows protection against other Physical Unalloc Memory and Unalloc programs Secondary Address Space … Storage Virtual Physical Proc. Addr Translation Unit / Addr Fictitious Virtual MMU Memory Core (Mem. Mgmt. Unit) Address Spaces

Data 9.15 VM Design Implications

• SLOW secondary storage access on page faults (100us - 10ms) – Implies page size should be fairly large (i.e. once we’ve taken the time to find data on disk, make it worthwhile by accessing a reasonably large amount of data) – Implies the placement of pages in main memory should be fully associative to reduce conflicts and maximize page hit rates – Implies a "page fault" is going to take so much time to even access the data that we can handle them in software (via an exception) rather than using HW like typical cache misses – Implies we should use a write-back policy for pages (since write- through would be too expensive) 9.16

Page Tables ADDRESS TRANSLATION 9.17 Page Size and Address Translation

• Since pages are usually retrieved from disk, we size them to be fairly large (several KB) to amortize the large access time • Virtual page number to physical page frame translation performed by HW unit = MMU (Mem. Management Unit) • is an in-memory that the HW MMU will use to look up translations from VPN to PPFN

31 12 11 0 Virtual Address Virtual Page Number Offset within page 0 0 0 0 2 0 0 0 20 12

Translation Copied Lookup VPN 0x00040 Process to it lives in PPFN: 0x0021b (MMU + Page Table)

31 30 29 18 12 11 0 Phys. Page Frame Physical Address 00 Offset within page Number 0 0 2 1 b 0 0 0 9.18 Address Translation Issues

• We want to take advantage of all the physical memory so page placement should be fully associative – For 1GB of physical memory, a 4KB page can be anywhere in the 256K = 218 page frames • We could potentially track the contents of physical memory using similar techniques to cache – TAG = VPN that is currently stored in the frame – This would be too many tags to check • Instead, most systems implement full associativity using a look-up table = PAGE TABLE Physical Memory

Processor Frame Virtual Address 218-1 VPN offset … Page Frame # Frame 2 = V Tag (VPN) M 218-1 = … Frame 1 = V Tag (VPN) M 2 = V Tag (VPN) M 1 = V Tag (VPN) M 0 Frame 0 9.19 Analogy for Page Tables

• Suppose we want to build a caller-ID mechanism for your contacts on your cell phone – Let us assume 1000 contacts represented by a 3-digit integer (0-999) in the cell phone (this ID can be used to look up their names) – We want to use a simple array (or Look-Up Table (LUT)) to translate phone numbers to contact ID’s, how shall we organize/index our LUT

1 LUT indexed w/ 2 Sorted LUT indexed 3 LUT indexed w/ all contact ID w/ used phone #’s possible phone #’s 000 213-745-9823 213-730-2198 436 000-000-0000 null 001 626-454-9985 213-745-9823 000 .. 002 818-329-1980 323-823-7104 999 213-745-9823 000 … … … … 999 323-823-7104 818-329-1980 002 999-999-9999 null O(n) - Doesn’t Work O(log n) - Could Work O(1) - Could Work We are given phone # and Since its in sorted order we Easy to index & find but need to translate to ID could use a binary search LARGE

(1000 accesses) (log21000 accesses) (1 access) 9.20 Page Tables • VA is broken into: – VPN (upper bits) + Page offset: Based on page size (i.e. 0 to 4K-1 for 4KB page) • MMU uses VPN & PTBR to access the page table in memory and lookup physical frame (i.e. like an array access where VPN is the index: PTBR[VPN] ) – Each entry is referred to as a Page Table Entry (PTE) and holds the physical frame number & bookkeeping info • Physical frame is combined with offset to form physical address • For 20-bit VPN, how big is the page table? (See below)

Page Table in Main Memory Processor 0xc0008000 Other PTBR = Page Table Base Reg. 0xc0008000 Page Frame Number flags PTBR[0] PTE PTBR[1] PTE 31 12 11 0 0x0021b 0x00002 0x2d8 PTBR[2] Virtual Page Number Offset w/in page VA 20 … …

Page Table Size 18 = 220 entries * 18 bits 31 12 11 0 0x0021b 0x2d8 = approx. 220*4bytes = 4MB 00 Phys. Frame # Offset w/in page PA 9.21 Page Table Example

• Suppose a system with 8-bit VAs, 10-bit PAs, and 32- pages.

7 4 0 PFN Phys Mem VPN P1-VAS VPN Offset 0 VP 3 0x000 0 0x00 0x01F 0x1F 0x020 0x20 Page 1 1 Table V Entry 0x03F 0x3F 0 0 0x1a 2 VP 1 0x040 2 0x40 9 4 0 1 1 0x02 0x05F 0x5F PFN Offset 2 1 0x18 3 3 VA 3 1 0x00 4 4 0 0x10 5 1 0x1F 5 6 0 0x15 PT for P1 6 7 0 0x0A (OS Owns) PA 0x3E0 0xE0 Page Table 31 VP 5 7 0x3FF 0xFF 9.22 Page Table Exercise

V Entry • Suppose a system with 8-bit VAs, 10-bit PAs, and 32-byte pages. 0 0 0x0E • Fill in the table below showing the 1 1 0x1E corresponding physical or virtual address 2 1 0x16 based on the other. If no translation can be made, indicate "INVALID" 3 1 0x06 4 0 0x0B 7 4 0 VPN Offset VA 5 1 0x1F VA PA 6 0 0x15 Page 0x2D = 0010 1101 0x3CD Table 0x7A = 0111 1010 0x0DA 7 0 0x0A

9 4 0 0xEF = 1110 1111 INVALID Page Table PFN Offset PA 0xA8 = 1010 1000 0x3E8 9.23 Paging

• Each process has its own virtual PT2 address space and thus needs its own VPN Phys. Frame # R/W page table 0 0x3d000 R/W 1 0xa1000 R/W GDT unused • On context switch to new process, 2 0xb4000 R reload the PTBR using info in the GDT Process 2 Page Table 0xd0000 PT1 – GDT = Global Descriptor Table (Intel Phys. Frame # R/W

VPN OS("Kernel") Memory prescribed structure to hold info about 0x7e000 R/W 0 0xc4000 each program) 0x6e000 R/W 1 Code 2.1 – CR3 = Control Register 3 0x08000 R 2 (x86 register to hold base Process 1 Page Table Stack 2.1 address of page table) PPFN: 0x6e000 Data 1.1 rax PPFN: 0x08000 PTBR/CR3 0xc4000 PA: 0x6e040 rsp Process 1 offs: 0x040 Stack 1.1 0x6e000 0x001 0x040 + rbx VA: 0x001040

offs: 0xeac Addr Code 1.2 rip VA: 0x002eac 0x002 0xeac

Virtual Addrs VPN offset Data 2.1

Physical Physical Physical for Memory Physical Code1.1 0x08000 Translation Unit / MMU PA: 0x08eac Physical Addr 9.24 Page Table Entries (PTEs)

• Usually fits within a 32-bit (4-byte) or 64-bit (8-byte) value: – Valid bit (1 = desired page in memory / 0 = page not present / page fault) – Modified/Dirty – Referenced = To implement pseudo-LRU replacement – Protection: Read/Write/eXecute

• For 32-bit VA, 1 GB phys. memory, and Valid / Present 4KB pages how many bits do we need for Modified / Dirty the frame number? Referenced – 1GB = 30 phys. addr. bits; 4KB => 12 offset bits Cacheable – Thus we need 30-12 = 18 bits for the frame number Protection

Page Frame Number 9.25 Multi-level Page Table Concept

• Much of the VAS is often unused (gray areas in the image on the right) which implies many of the page table entries would be unused • Can we reduce the page table size and still do a lookup in constant time? – Do you have friends from every area code? – Likely contacts are clustered in only a few area codes. • Use a 2-level organization – 1st level LUT is indexed on area code and contains pointers to 2nd level tables – 2nd level LUT’s indexed on local phone numbers and contains contact ID entries • The first level is often called the page directory and while the 2nd level is the called page tables – PDE's (Page Directory Entries) contain pointers to 2nd level Page Tables

nd st 2 Level Index = Local Phone # LUT indexed w/ all 1 Level Index = Area Code (Page Directory) possible phone #’s Area Code 000-0000 If only 2 area 000 null 213 codes used 000-000-0000 null … Table then only 1000 999-9999 .. 213 + 2(107) entries 213-745-9823 000 … rather than 1010 … 323 000-0000 323 entries 999-999-9999 null … Table 999-9999 9.26 Analogy for Page Tables

• Could extend to 3 levels if desired – 1st Level: Indices are area codes and values are pointers to 2nd level tables – 2nd Level: Indices are first 3-digits of local phone and values are pointers to 3rd level tables – 3rd Level: Indices are last 4-digits of local phone and values are contact ID’s (i.e. Translations)

1st Level Index = 2nd Level Index = 3rd Level Index = Area Code Local Phone # Local Phone #

000 null 0000 null

Area Code 740 9823 003 000 null … 999 null 9999 null 213 213-740 213 Table Table …

323 000 null 0000 null … 821 7104 248 999 null 9999 null 323 323-821 Table Table 9.27 Multi-level Page Tables • Think of a multi-level page table as a tree – Internal nodes contain pointers to other page tables – Leaves hold actual translations • Unused entries in one level mean no table at VPN Processor the next (saving space) Idx1 Idx2 Idx3 offset 6 bits 7 bits 7 bits 12 bits PD start addr 0x3f 0x40 0x35 0x040 0xd0000 PDBR/CR3 Page Directory

Virtual [0x3f] PT2[] = start addr Level 1 Addr

[0x40] PT3[] = start addr Level 2

[0x35] Level 3 Translations live … … in this level

Phys. Frame Addr 9.28 SPARC Processor VM Implementation

Virtual Address: 8 6 6 11 0 Process ID Index 1 Index 2 Index 3 Offset w/in page

First Context Table Level Second 0 Level Third Level 4K Page

PPFN 28 * 4 26 * 4 4095 bytes 26 * 4 bytes Desired word MMU hold 4096 entry table (one entry per context/process) [Essentially, PTBR for each process]

How many accesses to memory does it take to get the desired word that corresponds to the given virtual address? Would that change for a 1- or 2- level table? 9.29 Analogy for Page Tables

• If we add a friend from area code 408 we would have to add a second and third level table for just this entry. • If we had 1 friend from every area code and every 3-digit local prefix, would this scheme save us any storage? No!

1st Level Index = 2nd Level Index = 3rd Level Index = Area Code Local Phone # Local Phone #

000 null 0000 null

Area Code 740 9823 003 000 null … 999 null 9999 null 213 213-740 213 Table Table …

323 000 null 0000 null … 821 7104 248 999 null 9999 null 323 323-821 Table Table 9.30 Page Faults

PPFN’s I/O 0 and 1 Uncached Page Pointer to start of 2 un- 2nd Level Table used 31 22 21 12 11 0 area Level Level Offset w/in page Index 1 Index 2 frame 10 10 1023

0 0 1 1 2 2 Uncached Page

1023 1023

frame 0x0

0 1 2 Uncached Page Uncached Page When HW encounters a PTE whose page is not in physical Uncached Page memory, it will generate a page fault exception and the OS will 1023 take over and retrieve the page before resuming the program. Uncached Page 9.31 Page Fault Steps

• What happens when you reference a page that is not present? • HW will… – Record the offending address and generate a page fault exception • SW (the OS) will… – Pick an empty frame or select a page to evict – Writeback the evicted page if it has been modified • May block process while waiting and yield processor – Bring in the desired page and update the page table • May block process while waiting and yield processor – Restart the offending instruction • Key Idea: Handler can bring in the page or do anything appropriate to handle the page fault – Allocate a new page, zero it out, retrieve from secondary storage, etc. 9.32 Page Replacement Policies

• Possible algorithms: LRU, FIFO, Random • Since page misses are so costly (slow) we can afford to spend sometime keeping statistics to implement pseudo-LRU • HW will implement simple mechanism that allows SW to implement a pseudo-LRU algorithm – HW will the “Referenced” bit when a page is used – At certain intervals, SW will use these reference bits to keep statistics on which pages have been used in that interval and then clear the reference bits – On replacement, these statistics can be used to find the pseudo-LRU page • Other simpler replacement algorithms (e.g. variants of the clock algorithm) might also be used 9.33 Cache & VM Comparison

Cache Virtual Memory

Block Size 16-64B 4 KB – 64 MB

Mapping Schemes Direct or Set Associative Fully Associative

Miss handling and HW SW replacement

Replacement Policy Full LRU if low associativity Pseudo-LRU can be / Random is also used implemented 9.34 Inverted Page Tables • Page tables may seem expensive in terms of memory overhead – Though they really aren't that big • One option to consider is an "inverted" page table – One entry per physical frame – Hash the virtual address and whatever results is where that page must reside • What about collisions? – Becomes hard to maintain in hardware, but can be used by secondary software structures LUT indexed w/ contact ID 000 213-745-9823 626-454-9985 001 626-454-9985 Hash 002 818-329-1980 func. … 999 323-823-7104 9.35

Achieving faster translations… TLB (TRANSLATION LOOKASIDE BUFFERS) 9.36 Page Table Performance

• How many accesses to memory does it take to get the desired word that corresponds to the given virtual address? • So for each needed memory access, we need 2 additional? – That sounds BAD! • Would that change for a 1- or 3- level table? • M-level page table may require M memory accesses to find the translation…EXPENSIVE!!

10 6 11 0 PDR Index 1 Index 2 Offset w/in page Virtual Address

First Level (Page Directory) Second Level 4K Page

PPFN 210 * 4 bytes 210 * 4 bytes Desired word 9.37 Translation Lookaside Buffer (TLB)

• Solution: Let’s create a cache for translations = Translation Lookaside Buffer (TLB) • Needs to be small (64-128 entries) so it can be fast, with high degree of associativity (at least 4-way and many times fully associative) to avoid conflicts – On hit, the PPFN is produced and concatenated with the offset – On miss, a page table walk is needed Memory (Page Table)

Processor

Miss PPFN VPN Memory TLB

Hit 10 ns Miss

VA 10 ns PA I or D data CPU Page Offset Cache Translation Unit / MMU Hit 9.38 Translation Lookaside Buffer (TLB)

• T(Translation): T(TLB lookup) + (1-P(TLB hit)) * T(PT Walk) • What is P(TLB hit)? – Suppose 4KB page size and that we are walking an array of integers in sequential order – What fraction of accesses will be misses in the TLB? – 4KB / 4 bytes per int = 1K integer addresses per page = 1 miss every 1K accesses • Below is a fully associative TLB diagram 31 12 11 0 Virtual Page Number Offset w/in page Virtual Address 20 7 f f e 1 6 d 8 Tag = VPN Page Frame # V D Fully Associative TLB = (Entry can be anywhere = 0x7ffe1 0x308ac and thus we must check = 12 all locations in TLB for a = hit) TLB 31 12 11 0 Phys. Frame # Offset w/in page Physical Address 3 0 8 a 6 d 8 9.39 A 4-Way Set Associative TLB

• 64 entry 4-way SA TLB (set field indexes each “way”) – On hit, page frame # supplied quickly w/o page table access

31 12 11 0 VPN Offset w/in page Virtual Address 7 f f e 1 6 d 8 Tag Set 7 f f e 1 4 Way 0 Way 1 Way 2 Way 3 Tag PF# Tag PF# Tag PF# Tag PF#

7ffe 308ac

16

= = = = 3 0 8 a c 6 d 8 31 12 11 0 Phys. Frame # Offset w/in page Physical Address 9.40 TLB + Data Cache

31 12 11 0 Virtual Page Number Offset w/in page Virtual Address 20

Tag = VPN Page Frame # V M TLB = = Fully = 12 Direct = Set-Assoc. TLB 31 12 11 0 Phys. Frame # Offset w/in page Physical Address MMU Byte Phys. Tag Index Offset Cache 16 14 Data Cache V Tag Data Data Data Data Fully Direct Set-Assoc. Hit =

If data cache is tagged with physical addresses, then we must Desired Data translate the VA before we can access the data cache. 9.41 Differences of TLB & Data Cache

• Data cache – 1 tag (to identify the block) corresponds to MANY bytes (16-64 bytes) • TLB – 1 tag (VPN) corresponds to 1 translation (PTE/PPFN) which is good for 4KB • Main Point: TLBs are smaller than normal data caches and faster to access

Instruc./Data Cache TLB (Offset needed since one tag corresponds to many values) (1 tag = 1 translation. No Offset needed) 9.42 TLB Exercise

• This TLB is 2-way set Index V Tag PPFN associative, with 4 sets 0 0 0x13 0x30 • Page sizes are 256 bytes and 1 0x34 0x58 16-bit VAs and PAs 1 0 0x1F 0x80 • What is the physical address of virtual address 0x7E85? 1 0x2A 0x30 – A: 0x9585 2 1 0x1F 0x95

1 0x20 0xAA • What is the virtual address of physical address 0x3020? 3 1 0x3F 0x20 – A: 0xA920 0 0x3E 0xFF

TLB 9.43 Page Fault Steps

• On page fault, handler will access disk to evict old page (if dirty) and to bring in desired page 3 3. Evict (writeback) page if no – Likely context switch on each access frame free (update PT & TLB) 4 4. then bring in needed page since disk is slow and update PT Disk Driver OS Exception • Make sure PT & TLB are updated (Interrupt) (Page Fault) appropriately Handler 4 Page Table 2 Invalid / 3 4 Not Present 3

1 Restart faulting 5 Miss Miss PPFN instruction VPN TLB 10 ns 4

3 6 Miss Memory VAVA 6 TLB Miss / PT PA CPU walk / Update TLB Cache Page Offset Translation Unit / MMU Hit Processor Chip data 9.44 Page Eviction Bookkeeping

• When we want to remove a page from memory – Data/instruction cache • Writeback any modified blocks belonging to that page • Invalidate (set V=0) all blocks belonging to that page – TLB (check if a translation for that page is even in the TLB), if so… • If Modified/Dirty bit is set for that translation, set modified bit in the page table • Invalidate (V=0) the translation – Writeback page to disk if modified/dirty bit in Page Table entry is set – Update Page Table Entry to indicate the page is not present in memory anymore – Simple way to remember this… • Children (cache & TLB entries related to a page) must leave when the parent (the actual page) leaves • Bring in new page and update page directory/page table 9.45 Cache, VM, and Main Memory

TLB VM Cache Possible Y/N & Description

Hit Hit Hit Possible: best-case scenario Hit Hit Miss Possible: TLB hits (hit in VM is implied), then cache miss

Miss Hit Hit TLB misses, then hits in page table, then cache hit

Miss Hit Miss TLB misses, then hits in page table, then cache miss

Miss Miss Miss TLB misses, then page fault, then miss in cache

Hit Miss Miss Impossible: cannot hit in TLB if page not present

Hit Miss Hit Impossible: cannot hit in TLB if page not present

Miss Miss Hit Impossible: data cannot be in cache if page not present

Taken from H & P, “Computer Organization” 3rd, Ed. 9.46 x86 HW Cache/VM Support

• Cache and TLB Configuration

Processor Package Core x4 Main Memory L1 D$ Registers (32KB, 8-way) L2 Unified $ Shared L3 Cache (256KB, 8-way) Instr. Fetch L1 I$ (8MB, 16-way) (32KB, 8-way)

L1 dTLB (128 entry, 4-way) L2 Unified TLB DDR3 Memory controller I/O Bridge MMU (512 entry, 4-way) CR3/ L1 iTLB PDBR (64 entry, 4-way) QuickPath Interconnect

Intel CoreTM i7 Memory System 9.47 CoreTM i7 Page Table & Entries Format

• Specs: 48-bit VA, 52-bit PA, 4KB pages, 4-level Page Table

Level 1 (9-bits) Level 2 (9-bits) Level 3 (9-bits) Level 4 (9-bits) Page Offset (12-bits)

L1 PT L2 PT L3 PT L4 PT (Page Global (Page Upper (Page Middle (Page Table) CR3 Directory) Directory) Directory) L3 PTE L1 PTE L2 PTE L4 PTE x86 Processor 512 GB 1 GB 2 MB 4 KB Core range per PTE range per PTE range per PTE range per PTE

Physical Page Number (40-bits) Page Offset (12-bits) PTE Format L4 PTE 63 62-52 51 12 11 9 8 7 6 5 4 3 2 1 0 R/ unused Physical Page number unused G unused D A CD WT U/S P=1 XD W XD (Execute Disable), G (Global Page), D (Dirty Bit), A (Referenced Bit), CD (Caching Disabled), WT (Write-Thru/WriteBack), U/S (User or Supervisor (Kernel) Mode Permission, R/W (Read-only or Read-write), P (Present/Valid). If P=0, all 63 other bits may be used as the OS desires to store information about the page (i.e. disk location, etc.) 9.48 Multiple Processes • On a process context switch can TLB keep its entries? – Can TLB share mappings from multiple processes? Not without special care!! • Recall each process has its own virtual address space, page table, and translations – Virtual addresses are not unique between processes • How does TLB handle context switch – Can choose to only hold translations for current process and thus invalidate all entries on context switch – Can hold translations for multiple processes concurrently by concatenating a process or address space ID (PID or ASID) to the VPN tag 31 12 11 0 ASID VPN Offset VA

Unique ID for ASID Tag Page Frame # V M each process = = = = 9.49 Shared Memory

• In current system, all memory is P1 private to each process 0 VA: 0x040000 1 • To share memory between two 2 … 0 processes, the OS can allocate 1 an entry in each process’ page 2 3 table to point to the same … P2 physical page Shared phys. page 0 • Can use different protection 1 VA: 0x2c8000 2 Physical bits for each page table entry … Memory (e.g. P1 can be R/W while P2

can be read only) Virtual Address Spaces 9.50

Overlapping TLB access with Data/Instruction Cache access IF TIME PERMITS 9.51 Cache Addressing with VM

addr, data addr, data • Review of cache 0 – Store copies of data indexed based on the address 1 they came from in MM 2 – Simplified view: 2 steps to determine hit 3 • Index: Hash portion of address to find "set" to look in 4 • Tag match: Compare remaining address to all … entries in set to determine hit – Sequential connection between indexing these two steps (index + tag match) Tag Index/Hash Offset • Rather than waiting for address translation and Address then performing this two step hit process, can we overlap the translation and portions of the hit sequence? – Yes if we choose page size, block size, and set/direct mapping carefully 9.52 Virtual vs. Physical Addressed Cache

• Physically indexed, physically tagged (PIPT) 31 12 11 0 VPN Offset VA – Wait for full address translation – Then use physical address for both indexing and 31 12 11 0 tag comparison PFN Offset PA • Virtually indexed, physically tagged (VIPT) PIPT Tag Set/Blk

– Use portion of the virtual address for indexing 31 12 11 0 then wait for address translation and use physical VPN Offset VA address for tag comparisons Set/Blk – Easiest when index portion of virtual address 31 12 11 0

w/in offset (page size) address bits, otherwise PFN Offset PA VIPT aliasing may occur Tag • Virtually indexed, virtually tagged (VIVT) – Use virtual address for both indexing and 31 12 11 0 tagging…No TLB access unless cache miss VPN Offset VA – Requires invalidation of cache lines on context Tag Set/Blk switch or use of process ID as part of tags 31 12 11 0

VIVT PFN Offset PA 9.53 Virtual vs. Physical Addressed Cache

• Another view:

Virtually addressed Cache Physically addressed Cache

In a modern system the L1 caches may be virtually addressed while L2 may be physically addressed.