Virtual Memory in Contemporary Microprocessors

VIRTUAL MEMORY IN CONTEMPORARY MICROPROCESSORS THIS SURVEY OF SIX COMMERCIAL MEMORY-MANAGEMENT DESIGNS DESCRIBES HOW EACH PROCESSOR ARCHITECTURE SUPPORTS THE COMMON FEATURES OF VIRTUAL MEMORY: ADDRESS SPACE PROTECTION, SHARED MEMORY, AND LARGE ADDRESS SPACES. Virtual memory is a technique for from the most popular commercial microar- managing the resource of physical memory. It chitectures: the MIPS R10000, Alpha 21164, gives an application the illusion of a very large PowerPC 604, PA-8000, UltraSPARC-I, and amount of memory, typically much larger than Pentium II. Table 1 points out a few of their what is actually available. It protects the code similarities by comparing their support for and data of user-level applications from the some core virtual-memory functions. actions of other programs but also allows programs to share portions of their address spaces Memory management if desired. It supports the execution of process- The classic MMU, as in the DEC VAX, GE es partially resident in memory. Only the most 645, and Intel Pentium architectures,2-4 recently used portions of a process’s address includes a translation look-aside buffer that Bruce Jacob space actually occupy physical memory—the translates addresses and a finite-state machine rest of the address space is stored on disk until that walks the page table. The TLB is an on- University of Maryland needed. For a primer on virtual memory, see chip memory structure that caches only page our companion article in Computer magazine.1 table entries (PTEs). If the necessary transla- Most contemporary general-purpose proces- tion information is in the TLB, the system can Trevor Mudge sors support virtual memory through a hard- translate a virtual address to a physical address ware memory management unit (MMU) that without accessing the page table. If the trans- University of Michigan translates virtual addresses to physical address- lation information is not found in the TLB es. Unfortunately, the various microarchitec- (called a TLB miss), one must search the page tures define the virtual-memory interface table for the mapping and insert it into the TLB differently, and, as explained in the next sec- before processing can continue. In early designs tion, this is becoming a significant problem. a hardware state machine performed this activ- Here, we consider the memory management ity; on a TLB miss, the state machine walked designs of a sampling of six recent processors, the page table, loaded the mapping, refilled the focusing primarily on their architectural dif- TLB, and restarted the computation. ferences, and hint at optimizations that some- TLBs usually have on the order of 100 one designing or porting system software entries, are often fully associative, and are typ- might want to consider. We selected examples ically accessed every clock cycle. They trans- 60 0272-1732/98/$10.00 1998 IEEE Table 1. Comparison of architectural support for virtual memory in six commercial MMUs. Item MIPS Alpha PowerPC PA-RISC UltraSPARC IA-32 Address space ASIDs ASIDs Segmentation Multiple ASIDs ASIDs Segmentation protection Shared memory GLOBAL bit GLOBAL bit Segmentation Multiple ASIDs; Indirect Segmentation in TLB entry in TLB entry segmentation specification of ASIDs Large address 64-bit 64-bit 52-/80-bit 96-bit 64-bit None spaces addressing addressing segmented segmented addressing addressing addressing Fine-grained In TLB entry In TLB entry In TLB entry In TLB entry In TLB entry In TLB entry; protection per segment Page table Software- Software- Hardware- Software- Software- Hardware- support managed managed managed managed managed managed TLB TLB TLB; inverted TLB TLB TLB/hierarchical page table page table Superpages Variable page Groupings of Block address Variable page Variable page Segmentation/ size set in TLB 8, 64, 512 translation: size set in TLB size set in variable page entry: 4 Kbyte pages (set 128 Kbytes entry:4 Kbytes TLB entry: size set in TLB to 16 Mbyte, by 4 in TLB entry) to 256 Mbytes, to 64 Mbytes, 8, 64, 512 entry: 4 Kbytes by 2 by 4 Kbytes, and or 4 Mbytes 4 Mbytes late both instruction and data stream address- primary disadvantage of the state machine is es. They can constrain the chip’s clock cycle that the page table organization is effectively as they tend to be fairly slow, and they are also etched in stone; the operating system (OS) has power-hungry—both are a consequence of little flexibility in tailoring a design. the TLB’s high degree of associativity. Today’s In response, recent memory management systems require both high clock speeds and designs have used a software-managed TLB, in low power; in response, two-way and four- which the OS handles TLB misses. MIPS was way set-associative TLB designs are popular, one of the earliest commercial architectures to as lower degrees of associativity have far less offer a software-managed TLB,5 though the impact on clock speed and power consump- Astronautics Corporation of America holds a tion than fully associative designs. To provide patent for a software-managed design.6 In a increased translation bandwidth, designers software-managed TLB miss, the hardware often use split TLB designs. interrupts the OS and vectors to a software rou- The state machine is an efficient design as tine that walks the page table. The OS thus it disturbs the processor pipeline only slightly. defines the page table organization, since hard- During a TLB miss, the instruction pipeline ware never directly manages the table. effectively freezes: in contrast to taking an The flexibility of the software-managed exception, the pipeline is not disturbed, and mechanism comes at a performance cost. The the reorder buffer need not be flushed. The TLB miss handler that walks the page table is instruction cache is not used, and the data an OS primitive usually 10 to 100 instruc- cache is used only if the page table is located in tions long. If the handler code is not in the cacheable space. At the worst, the execution of instruction cache at the time of the TLB miss, the state machine will replace a few lines in the the time to handle the miss can be much data cache. Many designs do not even freeze longer than in the hardware-walked scheme. the pipeline; for instance, the Intel Pentium In addition, the use of the interrupt mecha- Pro allows instructions that are independent nism adds a number of cycles to the cost by of the faulting instruction to continue pro- flushing the pipeline—possibly flushing a cessing while the TLB miss is serviced. The large number of instructions from the reorder JULY–AUGUST 1998 61 VIRTUAL MEMORY 64-bit virtual address concentrate on those mecha- Region (Sign-extended) 32-bit virtual page number 12-bit page offset nisms that are unique to each architecture. 8-bit ASID Cache index MIPS MIPS (Figure 1) defines Eight-entry one of the simplest memory instruction TLB, 32-Kbyte, 64-paired-entry, management architectures two-way set-associative among recent microproces- fully associative instruction or data cache main TLB sors. The OS handles TLB misses entirely in software: Tag the software fills in the TLB, 28-bit page frame number compare and the OS defines the TLB (cache hit/miss) Tag: page frame number replacement policy. Address space Cache data The R2000/R3000 virtual address is 32 bits wide; the Figure 1. MIPS R10000 address translation mechanism. The split instruction and data R10000 virtual address is 64 caches are virtually indexed, both requiring an index larger than the page offset. The TLB bits wide, though not all 64 lookup proceeds in parallel with the cache lookup. The earliest MIPS designs had physically bits are translated in the indexed caches, and if the cache was larger than the page size, the cache was accessed in R10000. The top “region” series with the TLB. ASID: address space identifier. bits divide the virtual space into areas of different behavior. The top two bits distin- buffer. This can add hundreds of cycles to the guish between user, supervisor, and kernel overhead of walking the page table. Nonethe- spaces (the R10000 offers three levels of exe- less, the flexibility afforded by the software- cution/access privileges). Further bits divide managed scheme can outweigh the potentially the kernel and supervisor regions into areas of higher per-miss cost of the design.7 different memory behavior (that is, cached/ Given the few details presented so far, one uncached, mapped/unmapped). can easily see that the use of different virtual- In the R2000/R3000, the top bit divides memory interface definitions in every the 4-Gbyte address space into user and ker- microarchitecture is becoming a significant nel regions, and the next two bits further problem. More often than not, the OS run- divide the kernel’s space into cached/uncached ning on a microprocessor was not initially and mapped/unmapped regions. In both designed for that processor: OSs often outlast architectures, virtual addresses are extended the hardware on which they were designed with an address space identifier (ASID) to dis- and built, and the more popular OSs are port- tinguish between contexts. The 6-bit-wide ed to many different architectures. Hardware ASID on the R2000/R3000 uniquely identi- abstraction layers (for example, see Rashid et fies 64 processes; the 8-bit-wide ASID on the al.8 and Custer9) hide hardware particulars R10000 uniquely identifies 256. from most of the OS, and they can prevent Since it is simpler, we describe the 32-bit system designers from fully optimizing their address space of the R2000/R3000. User space, software. These types of mismatches between called kuseg, occupies the bottom 2 Gbytes of OSs and microarchitectures cause significant the address space.

Virtual Memory in Contemporary Microprocessors

Chapter 8 Instruction Set

Virtual Memory

Mipspro C++ Programmer's Guide

Book E: Enhanced Powerpc™ Architecture

The Case for Compressed Caching in Virtual Memory Systems

Memory Protection at Option

Using the GNU Compiler Collection (GCC)

Microprocessors History of Computing Nouf Assaid

Computer Organization and Architecture Designing for Performance Ninth Edition

MIPS IV Instruction Set

A Minimal Powerpc™ Boot Sequence for Executing Compiled C Programs

Intermediate X86 Part 2