<<

Overview

• Page tables Address Translation with – What are they? (review) – What does a entry (PTE) contain? Paging – How are page tables organized? Case studies for , SPARC, • Making page table access fast and PowerPC – Caching entries – Translation lookaside buffer (TLB) – TLB management

Generic Page Table Generic PTE

• Memory divided into pages • PTE maps virtual page to physical page • Page table is a collection of PTEs that maps a virtual page number to a PTE • Includes some page properties • Organization and content vary with architecture – Valid?, writable?, dirty?, cacheable? • If no virtual to physical mapping => Virtual Page # Physical Page # Property bits

Some acronyms used in this lecture: Virtual page # ? • PTE = page table entry PTE (or page fault) • PDE = page directory entry • VA = virtual address … • PA = physical address • VPN = virtual page number Page Table • {R,P}PN = {real, physical} page number

1 Real Page Tables X86-32 Address Translation

• Design requirements • Page tables organized as a two-level tree – Minimize memory use (PT are pure overhead) • Efficient because address space is sparse • Each level of the tree indexed using a piece of – Fast (logically accessed on every memory ref) the virtual page number for fast lookups • Requirements lead to • One set of page tables per – Compact data structures • Current set of page tables pointed to by CR3 – O(1) access (e.g. indexed lookup, hashtable) • CPU walks the page tables to find translations • Examples: X86 and PowerPC • Accessed and dirty bits updated by CPU • 4K or 4M (sometimes 2M) pages

X86-32 PDE and PTE Details X86-32 Page Table Lookup

32-bit virtual address Page Directory Entry (PDE) • Top 10 address bits index 20 bit page number of a PTE 12 bit properties page directory and return 10-bit page dir index 10-bit page tbl index 12-bit offset of in page a page directory entry that points to a page table 0 1 • Middle 10 bits index the 2 . page table that points to a Global

Present . Available Available Available physical memory page Reserved Accessed . Read/Write 0

Write-through 1024 • Bottom 12 bits are an

4K or 4M Page 1 Disabled User/Supervisor 2 offset to a single byte in . Page Table Entry (PDE) . 0 the physical page . 1 2 1024 . 20 bit page number of a physical memory page 12 bit properties . . • Checks made at each

1024 step to ensure desired page is available in 0 Page Directory memory and that the PAT Dirty 1

Global 2 process making the Present Available Available Available . Accessed

Read/Write . request has sufficient Write-through . rights to access the Disabled User/Supervisor Where is the virtual page number? 1024 If a page is not present, all but bit 0 are available for OS Page Tables Physical Memory IA-32 Architecture Software Developer’s Manual, Volume 3, pg. 3-24

2 X86-32 and PAE What about 64-bit X86?

• Intel added support for up to 64GB of physical memory in the - called Physical Address Extensions • X86-64 (AMD64 or EM64T) supports a 64- (PAE) bit virtual address (only 48 bits effective) • Introduced a new CPU mode and another layer in the page tables • Three modes • In PAE mode, 32-bit VAs map to 36-bit PAs – Legacy 32-bit (32-bit VA, 32-bit PA) • Single-process address space is still 32 bits • 4-entry page-directory-pointer-table (PDPT) points to a – Legacy PAE (32-bit VA, up to 52-bit PA) page directory and then translation proceeds as normal – Long PAE mode (64-bit VA, 52-bit PA) • Page directory and page table entries expanded to 64 bits to hold 36 bit physical addresses • requires four levels of page • Only 512 entries per 4K page tables to map 48-bit VA to 52-bit PA • 4K or 2M page sizes

AMD64 Architecture Programmer’s Manual Volume 2: System Programming, Ch. 5

PowerPC Address Translation PowerPC Segmentation 64-bit “effective” address generated by a program 36-bit ESID 28 address bits • SLB is an “associative memory” • 80-bit virtual address obtained via PowerPC • Top 36 bits of a program- segmentation mechanism generated “effective address Associative Lookup used as a tag called the effective • 62-bit physical (“real”) address segment id (ESID) • Search for tag value in SLB • PTEs organized in a hash table (HTAB) • If a match exists, property bits validated for access

• Each HTAB entry is a page table entry group Buffer (SLB) • A failed match causes segment fault (PTEG) • Associated 52-bit virtual segment id (VSID) is concatenated with • Each PTEG has (8) 16-byte PTEs Lookaside the remaining address bits to • Hash function on VPN gives the index of two form an 80-bit virtual address

Property bits (U/S, X, ) • Segmentation used to separate Segment ESID 52-bit VSID PTEGs (Primary and secondary PTEGs) processes within the large virtual address space • Resulting 16 PTEs searched for a VPN match Matching entry 52-bit VSID 28 address bits • No match => page fault 80-bit “virtual” address used for page table lookup

3 PowerPC Page Table Lookup PowerPC PTE Details

• Variable size hash table 80-bit virtual address 0 60 62 63 • register points 56 Hash function to hash table base and gives table’s size Abbreviated Virtual Page Number SW / H V Primary hash index • Architecture-defined A Secondary hash index hash function on virtual / / Real Page Number / / R C WIMG N PP address returns two possible hash table 0 2 51 54 55 56 57 60 61 62 63 entries • Each of the 16 possible Primary PTEG PTEs is checked for a Key SW=Available for OS use • 16-byte PTE

Fault VA match No Match? H=Hash function ID Secondary PTEG • If no match then page fault V=Valid bit • Both VPN and RPN

Page AC=Address compare bit • Possibility that a R=Referenced bit translation exists but that C=Changed bit • Why only 57 bit VPN? Match? it can’t fit in the hash Hash Table (HTAB) table – OS must handle WIMG=Storage control bits N=No execute bit PP=Page protection bits

16-byte PTE

PowerPC Operating Environment Architecture, Book III, Version 2.01, Sections 4.3-4.5

Making Translation Fast Generic TLB

• Page table logically accessed on every • Cache of recently used PTEs instruction • Small – usually about 64 entries • Paging has turned each memory reference • Huge impact on performance • Various organizations, search strategies, and into at least three memory references levels of OS involvement possible • Page table access has temporal locality • Consider X86 and SPARC • Use a cache to speed up access • Translation Lookaside Buffer (TLB) Virtual Address TLB Physical Address or TLB Miss or Access fault

4 TLB Organization Associativity Trade-offs TLB Entry Tag (virtual page number) Value (page table entry) Various ways to organize a 16-entry TLB • Higher associativity A A B A B C – Better utilization, fewer collisions 0 0 0 1 1 1 – Slower 2 2 2 3 3 3 Index – More hardware 4 4 Four-way set associative 5 5 • Lower associativity 6 6 Set 7 7 – Fast 8 Two-way set associative 9 – Simple, less hardware 10 A B C D E L M N O P 11 – Greater chance of collisions 12 13 Fully associative • How does page size affect TLB performance? 14 Lookup 15 •Calculate index (index = tag % num_sets) Direct mapped • Search for tag within the resulting set • Why not use upper bits of tag value for index?

X86 TLB Example: Pentium-M TLBs

• TLB management shared by processor and OS • Four different TLBs • CPU fills TLB on demand from page table (the OS is – Instruction TLB for 4K pages unaware of TLB misses) • 128 entries, 4-way set associative • CPU evicts entries when a new entry must be added and – Instruction TLB for large pages no free slots exist • 2 entries, fully associative • ensures TLB/page table consistency – Data TLB for 4K pages by flushing entries as needed when the page tables are • 128 entries, 4-way set associative updated or switched (e.g. during a context ) – Data TLB for large pages • 8 entries, 4-way set associative • TLB entries can be removed by the OS one at a time using the INVLPG instruction or the entire TLB can be • All TLBs use LRU replacement policy flushed at once by writing a new entry into CR3 • Why different TLBs for instruction, data, and page sizes?

5 SPARC TLB Minimizing Flushes

• SPARC is RISC (simpler is better) CPU • On SPARC, TLB misses trap to OS (SLOW) • We want to avoid TLB misses • Example of a “software-managed” TLB • Retain TLB contents across context switch • TLB miss causes a fault, handled by OS • SPARC TLB entries enhanced with a context id • Context id allows entries with the same VPN to coexist in • OS explicitly adds entries to TLB the TLB (e.g. entries from different process address spaces) • OS is free to organize its page tables in • When a process is switched back onto a processor, any way it wants because the CPU does chances are that some of its TLB state has been not use them retained from the last time it ran • Some TLB entries shared (OS kernel memory) • E.g. uses a tree like X86, Solaris – Mark as global uses a hash table – Context id ignored during matching

Example:UltraSPARC III TLBs Speeding Up TLB Miss Handling

• Five different TLBs • In some cases a huge amount of time can be spent handling TLB misses (2-50% in one study of SuperSPARC and SunOS) • Instruction TLBs • Many architectures that use software managed TLBs have hardware – 16 entries, fully associative (supports all page sizes) assisted TLB miss handling – 128 entries, 2-way set associative (8K pages only) • SPARC uses a large, virtually-indexed, direct-mapped, physically contiguous table of recently used TLB entries called the Translation • Data TLBs Storage Buffer (TSB) – 16 entries, fully associative (supports all page sizes) • The location of the TSB is loaded into the processor on context – 2 x 512 entries, 2-way set associative (each supports one page switch (implies one TSB per process) size per process) • On TLB miss, hardware calculates the offset of the matching entry into the TSB and supplies it to the software TLB miss handler • Valid page sizes – 8K (default), 64K, 512K, and 4M • In most cases, the software TLB miss handler only needs to make a • 13-bit context id – 8192 different concurrent address tag comparison to the TSB entry, load it into the TLB, and return spaces • If an access misses in the TSB then a slow software search of page tables is required • What happens if you have > 8192 processes?

6