Address Translation with Paging

Overview • Page tables Address Translation with – What are they? (review) – What does a page table entry (PTE) contain? Paging – How are page tables organized? Case studies for X86, SPARC, • Making page table access fast and PowerPC – Caching entries – Translation lookaside buffer (TLB) – TLB management Generic Page Table Generic PTE • Memory divided into pages • PTE maps virtual page to physical page • Page table is a collection of PTEs that maps a virtual page number to a PTE • Includes some page properties • Organization and content vary with architecture – Valid?, writable?, dirty?, cacheable? • If no virtual to physical mapping => page fault Virtual Page # Physical Page # Property bits Some acronyms used in this lecture: Virtual page # ? • PTE = page table entry PTE (or page fault) • PDE = page directory entry • VA = virtual address … • PA = physical address • VPN = virtual page number Page Table • {R,P}PN = {real, physical} page number 1 Real Page Tables X86-32 Address Translation • Design requirements • Page tables organized as a two-level tree – Minimize memory use (PT are pure overhead) • Efficient because address space is sparse • Each level of the tree indexed using a piece of – Fast (logically accessed on every memory ref) the virtual page number for fast lookups • Requirements lead to • One set of page tables per process – Compact data structures • Current set of page tables pointed to by CR3 – O(1) access (e.g. indexed lookup, hashtable) • CPU walks the page tables to find translations • Examples: X86 and PowerPC • Accessed and dirty bits updated by CPU • 4K or 4M (sometimes 2M) pages X86-32 PDE and PTE Details X86-32 Page Table Lookup 32-bit virtual address Page Directory Entry (PDE) • Top 10 address bits index 20 bit page number of a PTE 12 bit properties page directory and return 10-bit page dir index 10-bit page tbl index 12-bit offset of byte in page a page directory entry that points to a page table 0 1 • Middle 10 bits index the 2 . page table that points to a Global Present . Available Available Available physical memory page Reserved Accessed . Read/Write 0 Write-through 1024 • Bottom 12 bits are an 4K or 4M Page 1 Cache Disabled User/Supervisor 2 offset to a single byte in . Page Table Entry (PDE) . 0 the physical page . 1 2 1024 . 20 bit page number of a physical memory page 12 bit properties . • Checks made at each 1024 step to ensure desired page is available in 0 Page Directory memory and that the PAT Dirty 1 Global 2 process making the Present Available Available Available . Accessed Read/Write . request has sufficient Write-through . rights to access the page Cache Disabled User/Supervisor Where is the virtual page number? 1024 If a page is not present, all but bit 0 are available for OS Page Tables Physical Memory IA-32 Intel Architecture Software Developer’s Manual, Volume 3, pg. 3-24 2 X86-32 and PAE What about 64-bit X86? • Intel added support for up to 64GB of physical memory in the Pentium Pro - called Physical Address Extensions • X86-64 (AMD64 or EM64T) supports a 64- (PAE) bit virtual address (only 48 bits effective) • Introduced a new CPU mode and another layer in the page tables • Three modes • In PAE mode, 32-bit VAs map to 36-bit PAs – Legacy 32-bit (32-bit VA, 32-bit PA) • Single-process address space is still 32 bits • 4-entry page-directory-pointer-table (PDPT) points to a – Legacy PAE (32-bit VA, up to 52-bit PA) page directory and then translation proceeds as normal – Long PAE mode (64-bit VA, 52-bit PA) • Page directory and page table entries expanded to 64 bits to hold 36 bit physical addresses • Long mode requires four levels of page • Only 512 entries per 4K page tables to map 48-bit VA to 52-bit PA • 4K or 2M page sizes AMD64 Architecture Programmer’s Manual Volume 2: System Programming, Ch. 5 PowerPC Address Translation PowerPC Segmentation 64-bit “effective” address generated by a program 36-bit ESID 28 address bits • SLB is an “associative memory” • 80-bit virtual address obtained via PowerPC • Top 36 bits of a program- segmentation mechanism generated “effective address Associative Lookup used as a tag called the effective • 62-bit physical (“real”) address segment id (ESID) • Search for tag value in SLB • PTEs organized in a hash table (HTAB) • If a match exists, property bits validated for access • Each HTAB entry is a page table entry group Buffer (SLB) • A failed match causes segment fault (PTEG) • Associated 52-bit virtual segment id (VSID) is concatenated with • Each PTEG has (8) 16-byte PTEs Lookaside the remaining address bits to • Hash function on VPN gives the index of two form an 80-bit virtual address Property bits (U/S, X, V) • Segmentation used to separate Segment ESID 52-bit VSID PTEGs (Primary and secondary PTEGs) processes within the large virtual address space • Resulting 16 PTEs searched for a VPN match Matching entry 52-bit VSID 28 address bits • No match => page fault 80-bit “virtual” address used for page table lookup 3 PowerPC Page Table Lookup PowerPC PTE Details • Variable size hash table 80-bit virtual address 0 60 62 63 • Processor register points 56 Hash function to hash table base and gives table’s size Abbreviated Virtual Page Number SW / H V Primary hash index • Architecture-defined A Secondary hash index hash function on virtual / / Real Page Number / / C R C WIMG N PP address returns two possible hash table 0 2 51 54 55 56 57 60 61 62 63 entries • Each of the 16 possible Primary PTEG PTEs is checked for a Key SW=Available for OS use • 16-byte PTE Fault VA match No Match? H=Hash function ID Secondary PTEG • If no match then page fault V=Valid bit • Both VPN and RPN Page AC=Address compare bit • Possibility that a R=Referenced bit translation exists but that C=Changed bit • Why only 57 bit VPN? Match? it can’t fit in the hash Hash Table (HTAB) table – OS must handle WIMG=Storage control bits N=No execute bit PP=Page protection bits 16-byte PTE PowerPC Operating Environment Architecture, Book III, Version 2.01, Sections 4.3-4.5 Making Translation Fast Generic TLB • Page table logically accessed on every • Cache of recently used PTEs instruction • Small – usually about 64 entries • Paging has turned each memory reference • Huge impact on performance • Various organizations, search strategies, and into at least three memory references levels of OS involvement possible • Page table access has temporal locality • Consider X86 and SPARC • Use a cache to speed up access • Translation Lookaside Buffer (TLB) Virtual Address TLB Physical Address or TLB Miss or Access fault 4 TLB Organization Associativity Trade-offs TLB Entry Tag (virtual page number) Value (page table entry) Various ways to organize a 16-entry TLB • Higher associativity A A B A B C D – Better utilization, fewer collisions 0 0 0 1 1 1 – Slower 2 2 2 3 3 3 Index – More hardware 4 4 Four-way set associative 5 5 • Lower associativity 6 6 Set 7 7 – Fast 8 Two-way set associative 9 – Simple, less hardware 10 A B C D E L M N O P 11 – Greater chance of collisions 12 13 Fully associative • How does page size affect TLB performance? 14 Lookup 15 •Calculate index (index = tag % num_sets) Direct mapped • Search for tag within the resulting set • Why not use upper bits of tag value for index? X86 TLB Example: Pentium-M TLBs • TLB management shared by processor and OS • Four different TLBs • CPU fills TLB on demand from page table (the OS is – Instruction TLB for 4K pages unaware of TLB misses) • 128 entries, 4-way set associative • CPU evicts entries when a new entry must be added and – Instruction TLB for large pages no free slots exist • 2 entries, fully associative • Operating system ensures TLB/page table consistency – Data TLB for 4K pages by flushing entries as needed when the page tables are • 128 entries, 4-way set associative updated or switched (e.g. during a context switch) – Data TLB for large pages • 8 entries, 4-way set associative • TLB entries can be removed by the OS one at a time using the INVLPG instruction or the entire TLB can be • All TLBs use LRU replacement policy flushed at once by writing a new entry into CR3 • Why different TLBs for instruction, data, and page sizes? 5 SPARC TLB Minimizing Flushes • SPARC is RISC (simpler is better) CPU • On SPARC, TLB misses trap to OS (SLOW) • We want to avoid TLB misses • Example of a “software-managed” TLB • Retain TLB contents across context switch • TLB miss causes a fault, handled by OS • SPARC TLB entries enhanced with a context id • Context id allows entries with the same VPN to coexist in • OS explicitly adds entries to TLB the TLB (e.g. entries from different process address spaces) • OS is free to organize its page tables in • When a process is switched back onto a processor, any way it wants because the CPU does chances are that some of its TLB state has been not use them retained from the last time it ran • Some TLB entries shared (OS kernel memory) • E.g. Linux uses a tree like X86, Solaris – Mark as global uses a hash table – Context id ignored during matching Example:UltraSPARC III TLBs Speeding Up TLB Miss Handling • Five different TLBs • In some cases a huge amount of time can be spent handling TLB misses (2-50% in one study of SuperSPARC and SunOS) • Instruction TLBs • Many architectures that use software managed TLBs have hardware – 16 entries, fully associative (supports all page sizes) assisted TLB miss handling – 128 entries, 2-way set associative (8K pages only) • SPARC uses a large, virtually-indexed, direct-mapped, physically contiguous table of recently used TLB entries called the Translation • Data

Load more