<<

IN CONTEMPORARY

THIS SURVEY OF SIX COMMERCIAL MEMORY-MANAGEMENT DESIGNS

DESCRIBES HOW EACH ARCHITECTURE SUPPORTS THE COMMON

FEATURES OF VIRTUAL MEMORY: PROTECTION, SHARED

MEMORY, AND LARGE ADDRESS SPACES.

Virtual memory is a technique for from the most popular commercial microar- managing the resource of physical memory. It chitectures: the MIPS R10000, , gives an application the illusion of a very large PowerPC 604, PA-8000, UltraSPARC-I, and amount of memory, typically much larger than Pentium II. Table 1 points out a few of their what is actually available. It protects the code similarities by comparing their support for and of user-level applications from the some core virtual-memory functions. actions of other programs but also allows pro- grams to share portions of their address spaces if desired. It supports the execution of - The classic MMU, as in the DEC VAX, GE es partially resident in memory. Only the most 645, and Pentium architectures,2-4 recently used portions of a process’s address includes a translation look-aside buffer that Bruce Jacob space actually occupy physical memory—the translates addresses and a finite-state machine rest of the address space is stored on disk until that walks the table. The TLB is an on- University of Maryland needed. For a primer on virtual memory, see chip memory structure that caches only page our companion article in magazine.1 table entries (PTEs). If the necessary transla- Most contemporary general-purpose proces- tion information is in the TLB, the system can Trevor Mudge sors support virtual memory through a hard- translate a virtual address to a ware (MMU) that without accessing the . If the trans- translates virtual addresses to physical address- lation information is not found in the TLB es. Unfortunately, the various microarchitec- (called a TLB miss), one must search the page tures define the virtual-memory interface table for the mapping and insert it into the TLB differently, and, as explained in the sec- before processing can continue. In early designs tion, this is becoming a significant problem. a hardware state machine performed this activ- Here, we consider the memory management ity; on a TLB miss, the state machine walked designs of a sampling of six recent processors, the page table, loaded the mapping, refilled the focusing primarily on their architectural dif- TLB, and restarted the computation. ferences, and hint at optimizations that some- TLBs usually have on the order of 100 one designing or porting system software entries, are often fully associative, and are typ- might want to consider. We selected examples ically accessed every clock cycle. They trans-

60 0272-1732/98/$10.00  1998 IEEE Table 1. Comparison of architectural support for virtual memory in six commercial MMUs.

Item MIPS Alpha PowerPC PA-RISC UltraSPARC IA-32 Address space ASIDs ASIDs Segmentation Multiple ASIDs ASIDs Segmentation protection GLOBAL bit GLOBAL bit Segmentation Multiple ASIDs; Indirect Segmentation in TLB entry in TLB entry segmentation specification of ASIDs Large address 64-bit 64-bit 52-/80-bit 96-bit 64-bit None spaces addressing addressing segmented segmented addressing addressing addressing Fine-grained In TLB entry In TLB entry In TLB entry In TLB entry In TLB entry In TLB entry; protection per segment Page table Software- Software- Hardware- Software- Software- Hardware- support managed managed managed managed managed managed TLB TLB TLB; inverted TLB TLB TLB/hierarchical page table page table Superpages Variable page Groupings of Block address Variable page Variable page Segmentation/ size set in TLB 8, 64, 512 translation: size set in TLB size set in variable page entry: 4 Kbyte pages (set 128 Kbytes entry:4 Kbytes TLB entry: size set in TLB to 16 Mbyte, by 4 in TLB entry) to 256 Mbytes, to 64 Mbytes, 8, 64, 512 entry: 4 Kbytes by 2 by 4 Kbytes, and or 4 Mbytes 4 Mbytes late both instruction and data stream address- primary disadvantage of the state machine is es. They can constrain the chip’s clock cycle that the page table organization is effectively as they tend to be fairly slow, and they are also etched in stone; the (OS) has power-hungry—both are a consequence of little flexibility in tailoring a design. the TLB’s high degree of associativity. Today’s In response, recent memory management systems require both high clock speeds and designs have used a software-managed TLB, in low power; in response, two-way and four- which the OS handles TLB misses. MIPS was way set-associative TLB designs are popular, one of the earliest commercial architectures to as lower degrees of associativity have far less offer a software-managed TLB,5 though the impact on clock speed and power consump- Astronautics Corporation of America holds a tion than fully associative designs. To provide patent for a software-managed design.6 In a increased translation bandwidth, designers software-managed TLB miss, the hardware often use split TLB designs. the OS and vectors to a software rou- The state machine is an efficient design as tine that walks the page table. The OS thus it disturbs the processor pipeline only slightly. defines the page table organization, since hard- During a TLB miss, the instruction pipeline ware never directly manages the table. effectively freezes: in contrast to taking an The flexibility of the software-managed exception, the pipeline is not disturbed, and mechanism comes at a performance cost. The the reorder buffer need not be flushed. The TLB miss handler that walks the page table is instruction is not used, and the data an OS primitive usually 10 to 100 instruc- cache is used only if the page table is located in tions long. If the handler code is not in the cacheable space. At the worst, the execution of instruction cache at the time of the TLB miss, the state machine will replace a few lines in the the time to the miss can be much data cache. Many designs do not even freeze longer than in the hardware-walked scheme. the pipeline; for instance, the Intel Pentium In addition, the use of the mecha- Pro allows instructions that are independent nism adds a number of cycles to the cost by of the faulting instruction to continue pro- flushing the pipeline—possibly flushing a cessing while the TLB miss is serviced. The large number of instructions from the reorder

JULY–AUGUST 1998 61 VIRTUAL MEMORY

64-bit virtual address concentrate on those mecha- Region (Sign-extended) 32-bit virtual page number 12-bit page offset nisms that are unique to each architecture. 8-bit ASID Cache index MIPS MIPS (Figure 1) defines Eight-entry one of the simplest memory instruction TLB, 32-Kbyte, 64-paired-entry, management architectures two-way set-associative among recent microproces- fully associative instruction or data cache main TLB sors. The OS handles TLB misses entirely in software: Tag the software fills in the TLB, 28-bit page frame number compare and the OS defines the TLB (cache hit/miss) Tag: page frame number replacement policy. Address space Cache data The / virtual address is 32 bits wide; the Figure 1. MIPS R10000 address translation mechanism. The split instruction and data R10000 virtual address is 64 caches are virtually indexed, both requiring an index larger than the page offset. The TLB bits wide, though not all 64 lookup proceeds in parallel with the cache lookup. The earliest MIPS designs had physically bits are translated in the indexed caches, and if the cache was larger than the page size, the cache was accessed in R10000. The top “region” series with the TLB. ASID: address space identifier. bits divide the virtual space into areas of different behav- ior. The top two bits distin- buffer. This can add hundreds of cycles to the guish between user, supervisor, and kernel overhead of walking the page table. Nonethe- spaces (the R10000 offers three levels of exe- less, the flexibility afforded by the software- cution/access privileges). Further bits divide managed scheme can outweigh the potentially the kernel and supervisor regions into areas of higher per-miss cost of the design.7 different memory behavior (that is, cached/ Given the few details presented so far, one uncached, mapped/unmapped). can easily see that the use of different virtual- In the R2000/R3000, the top bit divides memory interface definitions in every the 4-Gbyte address space into user and ker- is becoming a significant nel regions, and the next two bits further problem. More often than not, the OS run- divide the kernel’s space into cached/uncached ning on a was not initially and mapped/unmapped regions. In both designed for that processor: OSs often outlast architectures, virtual addresses are extended the hardware on which they were designed with an address space identifier (ASID) to dis- and built, and the more popular OSs are port- tinguish between contexts. The 6-bit-wide ed to many different architectures. Hardware ASID on the R2000/R3000 uniquely identi- abstraction layers (for example, see Rashid et fies 64 processes; the 8-bit-wide ASID on the al.8 and Custer9) hide hardware particulars R10000 uniquely identifies 256. from most of the OS, and they can prevent Since it is simpler, we describe the 32-bit system designers from fully optimizing their address space of the R2000/R3000. , software. These types of mismatches between called kuseg, occupies the bottom 2 Gbytes of OSs and cause significant the address space. All kuseg references are performance problems;10 an OS not tuned to mapped through the TLB and considered the hardware on which it operates is unlikely cacheable by the hardware unless otherwise to live up to its potential performance. noted in a TLB entry. The top half of the virtual The following sections describe the differ- address space belongs to the kernel: an address ent commercial virtual memory interfaces. generated with the top bit set while in user First is the MIPS organization, which has the mode causes an exception. Kernel space is divid- most in common with the others. Then, we ed into three regions: the 1-Gbyte kseg2 region

62 IEEE MICRO is cacheable and mapped through the TLB like 20 6 6 kuseg. The other two 512-Mbyte regions (kseg0 Virtual page number ASID 0 and kseg1) are mapped directly onto physical (a) memory; the hardware zeroes out the top three Dirty N Noncacheable bits to generate the physical address directly. G Global Valid The hardware then caches references to kseg0, but not the references to kseg1. 201111 8 Page frame number N D V G 0

TLB (b) The MIPS TLB is a unified 64-entry, fully associative cache. The OS loads page table Figure 2. MIPS R2000/3000 TLB entry format: EntryHi (a) and EntryLo (b). entries (PTEs) into the TLB, using either ran- dom replacement (the hardware chooses a TLB slot randomly) or specified placement (the OS (PFN) and status bits. The design saves die tells the hardware which slot to choose). The area and power. It is nearly as flexible as a 128- TLB’s 64 entries are partitioned between entry TLB but requires half the tag area— “wired” and “random” entries. While the because two mappings share each virtual page R2000/R3000 has eight wired entries, the par- number (VPN)—and half the comparators. tition between wired and random entries is set In the MIPS R2000/R3000, the status by the R10000 software. The hardware pro- fields in EntryLo are vides a mechanism to choose one of the ran- dom slots. On request, it produces a random • N, noncacheable. If this bit is set for a number between index values of 8 and 63, TLB entry, the page it maps is not inclusive (the R10000 produces values between cached; the processor sends the address N and 63, inclusive, where N is set by software). out to main memory without accessing This random number references only the TLB’s the cache. random entries; by not returning values corre- • D, dirty. If this bit is set, the page is sponding to wired entries, it effectively protects writable. The bit can be set by software, those entries. The TLBWR (TLB write ran- so it is effectively a write-enable bit. A dom) instruction uses this mechanism to insert store to a page with the dirty bit cleared a mapping randomly into the TLB, and the causes a protection violation. TLBWI (TLB write indexed) instruction • V, valid. If this bit is set, the entry con- inserts mappings at any specified location. tains a valid mapping. Most OSs use the wired partition to store root- • G, global. If this bit is set, the TLB level PTEs and kernel mappings in the pro- ignores the ASID match requirement for tected slots, keeping user mappings in the a TLB hit on this page. This feature sup- random slots and using a low-cost random ports shared memory and allows the ker- replacement policy to manage them. nel to access its own code and data while The OS interacts with the TLB through the operating on behalf of a user process EntryHi and EntryLo registers, pictured in without having to save or restore ASIDs. Figure 2. EntryHi contains a virtual page number and an ASID; EntryLo corresponds The R10000 inherits this organization and to a PTE and contains a page frame number adds more powerful control-status bits that and status bits. A TLB entry is equivalent to support features such as more complex the concatenation of these structures. caching behavior and specification for The R10000 structure is similar but larger. coherency protocols. Also, if the G bit is set It also has two separate EntryLo registers— for either page in a paired entry, the ASID one for each of two paired virtual page num- check is disabled for both pages. bers. This allows the R10000 to effectively When the OS reads an entry from the TLB, double the reach of the TLB without adding the hardware places the information into more entries. A single TLB entry maps every EntryHi and EntryLo. When the OS inserts two contiguous even-odd virtual pages, though a mapping into the TLB, it first loads the each receives its own page frame number desired values into these registers. It then exe-

JULY–AUGUST 1998 63 VIRTUAL MEMORY

Virtual page number Page offset ware TLB miss handler: the TLB context reg- (a) ister, as depicted in Figure 3. At the time of a Virtual address for PTE user-level TLB miss, the context register con- PTEBase Virtual page number 0 tains the virtual address of the PTE that maps the faulting page. The system software loads (b) Load the top bits of the TLB context register, called Page table entry PTEBase. PTEBase represents the virtual base Page frame number Stat. bits address of the current process’s user page table. When a user address misses the TLB, hard- ware fills in the next bits of the context regis- Page frame number Stat. bits ter with the VPN of the faulting address. The (c) bottom bits of the context register are defined Figure 3. Using the TLB context register. The VPN of the to be zero (the R2000/3000 PTE is 4 , faulting virtual address (a) is placed into the context register the R10000 PTE is 8 bytes), so the faulting (b), creating the virtual address of the mapping PTE. This VPN is an index into the linear user page table PTE goes into EntryLo (c). structure and identifies the user PTE that maps the faulting address. The TLB miss handler can use the context cutes a TLBWR instruction, or it loads a slot register immediately, and the handler looks number into the and executes a much like the following (those who know the TLBWI instruction. Thus, the OS has the MIPS instruction set will notice that a few tools to implement a wide range of replace- NOPs have been omitted for clarity): ment policies. Periodic TLB flushes are unavoidable in mfc0 k0,tlbcxt #move the contents of TLB these MIPS processors, as there are 64 unique #context register into k0 context identifiers in the R2000/R3000 and mfc0 k1,epc #move PC of faulting load 256 in the R10000. Many systems have more #instruction into k1 active processes than this, requiring ASID shar- lw k0,0(k0) #load thru address that was ing and periodic remapping. When an ASID #inTLB context register is temporarily reassigned from one process to mtc0 k0,entry_lo #move the loaded value another, it is necessary to first flush TLB entries #into the EntryLo register with that ASID. It is possible to avoid flushing tlbwr #write entry into the TLB the cache by flushing the TLB; since the caches #at a random slot number are physically tagged, the new process cannot j k1 #jump to PC of faulting overwrite the old process’s data. #load instruction to retry rfe #RESTORE FROM Address translation and TLB-miss handling #EXCEPTION MIPS supports a simple bottom-up hierar- chical page table organization,1 though an OS The handler first moves the address out of is free to choose a different page table organi- the TLB context register into general-purpose zation. We describe the R2000/3000 transla- register k0, through which it can use the tion mechanism here; the R10000 mechanism address as a load target. The program is similar. The VPN of any page in a user’s of the instruction that caused the TLB miss is address space is also an index into the user page found in exception register epc and is moved table. On a user-level TLB miss, one can use into general-purpose register k1. The handler the faulting VPN to create a virtual address for loads the PTE into k0 then moves it directly the mapping PTE. Frequently, the OS will suc- into EntryLo. EntryHi is already filled in by cessfully load the PTE with this address, re- the hardware; it contains the faulting VPN and quiring only one memory reference to handle the ASID of the currently executing process a TLB miss. In the worst case, a PTE lookup (the one that caused the TLB miss). The will require an additional memory reference TLBWR instruction writes the PTE into the to look up the root-level PTE as well. TLB at a randomly selected location. MIPS offers a hardware assist for the soft- At this point, the mapping for the faulting

64 IEEE MICRO 64-bit virtual address 7-bit address (Sign-extended) 30-bit virtual page number 13-bit page offset space register ASID Cache index

8-Kbyte, 8-Kbyte 48-entry, fully virtually indexed, virtually-indexed, associative instruction TLB physically tagged, virtually-tagged 64-entry, fully direct-mapped direct-mapped associative data TLB data cache instruction cache

Tag compare 30-bit virtual page number 27-bit page frame number (cache hit/miss) ASID 7-bit address space register Tag: 27-bit page frame number Cache tag ASID 30-bit virtual page number Tag compare (cache hit/miss) Cache data

Figure 4. Alpha 21164 address translation mechanism. The virtual bits of the page offset effectively physically index each of the 8-Kbyte split caches. The data cache is physically tagged and the instruction cache is virtually tagged. The instruction TLB is accessed every reference to provide page protection. address is in the TLB, and the hardware can privileged access . For example, PAL- retry the faulting instruction by jumping to code, not OS code, loads PTEs from a hard- its address (previously moved from epc to k1). ware register (like EntryHi/EntryLo in MIPS) MIPS uses delayed branches, so the instruc- into the TLB. Though software manages the tion immediately after any branch instruction TLB, the TLB replacement policy in the executes with the branch. Thus, the Restore 21164 is fixed at not-most-recently-used— From Exception instruction executes before the OS does not define it. The 21164 sup- the target instruction of the jump. The ports a fixed 8-Kbyte page size, a 43-bit virtual Restore returns the processor (which runs in address, and a 40-bit physical address. kernel mode during exceptions) to user mode. Virtual and physical address spaces Alpha OSF/1, one of Alpha’s main OSs, divides Alpha (Figure 4) is a 64-bit architecture the virtual space into three regions: seg0 is the with a memory management design similar bottom half, seg1 is the top quarter, and kseg to MIPS. Applications generate 64-bit point- is the middle quarter. The regions are divid- ers, of which anywhere from 43 to 64 bits are ed by the two most significant bits of the translated, depending on the processor imple- translated address, as opposed to the MIPS mentation. Alpha implementations that sup- division, which is based on the high-order bits port less than 64 bits of addressing must of the 64-bit address, regardless of how many ensure that the high-order bits of a virtual bits the hardware translates. address are sign-extended from the most sig- Seg0 and seg1 comprise the user space and nificant translated bit. are mapped through the TLBs. Kseg maps The Alpha architecture has split TLBs and directly onto the physical address space by zero- split caches, and, as in the MIPS architecture, ing the top two bits. Note that the kernel may software manages the TLBs. However, the OS use seg0 and seg1 for its own virtual data struc- does not have direct access to the TLB but tures (the page tables). While the MIPS archi- indirect access through the PALcode—the tecture distinguishes between user-level virtual

JULY–AUGUST 1998 65 VIRTUAL MEMORY

32-bit effective address Superpages and shared memory The Alpha TLB entry Seg# 16-bit segment offset 12-bit page offset includes such fields as address- space-match, similar to the Segment registers (16 IDs × 24 bits) global bit in the MIPS TLB; if the bit is set, the TLB entry matches positively for all ASIDs. The flags also include granularity-hint (GH), a two- bit field that supports super- pages. GH indicates that the entry maps a set of 8GH con- tiguous pages, that is, a block 52-bit of 1, 8, 64, or 512 pages. virtual address The 21164 uses 7-bit ad- 24-bit segment ID 16-bit segment offset 12-bit page offset dress space identifiers; this scheme has the same need for flushing as the MIPS archi- 40-bit virtual page number tecture due to a small number of contexts. The 21164 split- cache sizes are 8 Kbytes each, 128-entry, two-way Split L1 cache, which is the processor’s page (instruction instruction TLB size, making the caches effec- 128-entry, two-way data TLB and data) each 16-Kbyte, four-way tively indexed by the physical set associative address. Thus the caches have the access time of a virtually Tag 20-bit page frame number compare indexed cache but the cache- (cache consistency behavior of a hit/miss) Tag: 20-bit page frame number physically indexed cache.

Cache data PowerPC The PowerPC (Figure 5) Figure 5. PowerPC 604 address translation mechanism. Each 16-Kbyte, four-way set-associative features a hardware-defined cache uses the page offset as a cache index, effectively making the cache physically indexed. A inverted page table structure, block address translation mechanism occurs in parallel with the TLB access, but is not shown. a hardware-managed TLB miss mechanism, block ad- dress translation for super- access and kernel-level virtual access by the top pages, and a 52-bit segmented virtual address bit of the virtual address, the Alpha makes no space for 32-bit implementations of the archi- such distinction. The OS manages the division tecture. The 64-bit PowerPC implementa- of the virtual space through the page tables. tions have an 80-bit segmented virtual address Unlike MIPS, which uses the top bits of the space. The block address translation mecha- virtual address to demarcate regions of differ- nism is not shown in the figure—it is per- ing memory behavior, Alpha’s physical mem- formed in parallel with the pictured ory is divided into regions of different behavior translation mechanism. It is similar to TLB to support features like cacheable and non- translation, except that the effective address, cacheable memory. Physical space is divided not the extended virtual address, references by the top two bits of the physical address into the entries. four regions, each with potentially different behavior. The behavior is left to the imple- Segmented address translation mentation to define; possibilities include “nor- PowerPC’s memory management uses mal” memory, I/O space, and noncached paged segmentation; the TLBs and page table uncacheable memory, among others. map an extended , not

66 IEEE MICRO Process 7 Process 4

Process 6 Process 0 One address space (4 Gbytes)

Process 4 Virtual segment (1 Gbyte) Process 1 Process 3 Process 3

Process 1 Process 2 (a) (b)

Figure 6. Difference between ASIDs and segmentation in two virtual addressing schemes. A 3-bit ASID extends the 32-bit address space (a). A 35-bit segmented address space is divided into fixed-length, 1-Gbyte segments (b). In both schemes, processes generate 32-bit virtual addresses, and the hardware translates a 35-bit virtual address. the application’s effective address space. Pow- while the number of processes in Figure 6b has erPC segments are 256-Mbyte continuous a combinatorial limit. regions of virtual space; 16 segments comprise The PowerPC does not provide explicit an application’s address space; and the global ASIDs; address spaces are protected through space contains 224 segments. the segment registers, which only the OS can Sixteen segment registers map the user modify. The OS enforces protection by con- address space onto the segmented global space. trolling the degree to which segment identi- The top four bits of an effective address index fiers may be overlapped and can even share a the segment registers to obtain a 24-bit segment segment safely between a small set of process- identifier. This segment ID replaces the top four es. This is in sharp contrast to most other shar- bits of the 32-bit effective address to form a 52- ing schemes in which processes and pages have bit virtual address. The 64-bit PowerPC imple- but one access ID each, and in which the only mentations use the top 36 bits of the effective way to support hardware access to shared address to index the segment registers. This cor- pages with different access IDs is to mark the responds to an enormous table that would never pages visible to all processes. fit in hardware. As a result, the segment map- The 24-bit-wide segment identifiers support ping in 64-bit implementations uses a segment over a million unique processes on a system. look-aside buffer (a fully associative cache much Therefore, required cache or TLB flushing like a TLB) rather than a hardware table. occurs very infrequently, assuming shared mem- Figure 6 compares ASIDs with PowerPC- ory is implemented through segmentation. like segmentation using fixed-length segments. In Figure 6a, a 32-Gbyte space is organized Block address translation into eight 4-Gbyte regions, each of which cor- To support superpages, the PowerPC defines responds to exactly one process determined by a block address translation (BAT) mechanism the process’s ASID. In Figure 6b, the global that operates in parallel with the TLB. The 32-Gbyte segmented space is an array of 1- BAT takes precedence over the TLB whenever Gbyte virtual segments. Each segment can the BAT mechanism signals a hit on any given map into any number of process address spaces translation. Blocks can range in size from 128 at any location (or at multiple locations) with- Kbytes to 256 Mbytes. The mechanism uses a in each process address space. The size of the fully associative lookup on four BAT registers, ASID (in this case, eight processes) limits the each of which is similar to a TLB entry. number of processes in the Figure 6a scenario, BAT registers contain information such as

JULY–AUGUST 1998 67 VIRTUAL MEMORY

TLB miss, the hardware loads an entire PTE 40-bit virtual page number 8 page table entries: group and performs a linear search for a a PTE group matching virtual address. If the PTE is not Hash found in the first PTE group, the hardware function performs a second lookup to a different PTE Index into the group, based on a secondary hash value (the hashed page table 8-bit one’s complement of the primary hash value). PTE Since the location of a PTE in the table N PTE groups in the page bears no relation to either the VPN or the table, where N is power of 2 and at least half the PFN in the mapping, both VPN and PFN number of physical pages must be stored in the PTE. Each PTE is there- fore 8 bytes wide. In 64-bit implementations, Hashed page table Physical memory each PTE is 16 bytes wide.

Figure 7. PowerPC hashed page table structure. PA-RISC 2.0 The PA-RISC 2.0 (Figure 8) is a 64-bit design with a segmentation mechanism sim- fine-grained protection (whether the block is ilar to a PowerPC’s. Processes generate 64-bit readable and/or writable) and storage access addresses that are mapped through space reg- control (whether the block is write-through isters to form 96-bit virtual addresses trans- and/or caching-inhibited, and/or needs to be lated by the TLB and page table. The maintained as cache-coherent). A BAT regis- architecture differs from the PowerPC by its ter provides no address-space protection addition of two types of page-access keys: the besides making a block available only to priv- protection ID, which identifies a running ileged (kernel) access. If a block is available to process, and the access ID, which identifies a a single user-level process, all user processes virtual page. The architecture allows user-level share it. In this case, if sharing is not desired, processes to change the contents of the space the BAT register contents need to be flushed registers, enabling an application to extend its on a context . The PowerPC 604 has address space at will. This user-level access to two sets of BAT registers: one for the instruc- the space registers creates the need for the tion stream and one for the data stream. additional protection mechanism. Hashed page table Spaces and space registers The PowerPC architecture defines a hashed Every 64-bit user address combines with a page table, pictured in Figure 7. A variation 64-bit space ID to create a 96-bit virtual on the inverted page table, it is an eight-way address that references the TLB and virtual set-associative software cache for PTEs,11 caches. The space IDs are held in eight space walked by hardware. The organization scales registers that are selected by fields within the well to a 64-bit implementation—inverted instruction or by the top bits of the effective tables scale with the size of the physical address address. As in the PowerPC, the space IDs are space, not the virtual address space. However, concatenated with the user’s effective address, since the table is not guaranteed to hold all but the mechanism is designed to allow a vari- active mappings, the OS must manage a back- able-size segment. up page table as well. Unlike the PowerPC segment identifier, The design is similar to the canonical which replaces the top four bits, the space ID inverted page table, except that it eliminates in the PA-RISC replaces anything from the the hash anchor table (which reduces the top 2 bits to the top 32 bits. The top 34 bits number of memory references by one), and of the 96-bit virtual address come from the the fixed-length PTE group replaces the col- space ID; the bottom 32 bits come from the lision chain. The table is eight PTEs wide. If user’s effective address; and the middle 30 bits more than eight VPNs hash to the same PTE are the logical OR of the 30-bit overlap of the group, any extra PTEs are simply left out or two. This allows the OS to define a segmen- placed into a secondary PTE group. On a tation granularity. A logical OR suggests a

68 IEEE MICRO 64-bit effective address 2 30-bit OR field 32-bit space offset

Space registers (8 IDs × 64 bits) User-modifiable space registers Not user-modifiable

96-bit virtual OR address 34-bit space identifier 30-bit logical OR 32-bit space offset

84-bit virtual page number Protection identifiers (8 IDs × 31 bits) TLB (for example, PA-8000 uses a four-entry instruction micro-TLB and a 96-entry, fully associative main TLB)

52-bit page frame number

Figure 8. PA-RISC 2.0 address translation mechanism. The architecture uses large virtual caches located off chip. simple and effective organization; one can portion of the global space. The naming mech- choose a partition (for example, half way, at anism—appending an ASID to every virtual the 15-bit mark), and allow processes to gen- address—prevents processes from generating erate only virtual addresses with the top 15 addresses corresponding to locations in other bits of this 30-bit field set to 0. Similarly, all address spaces. Even systems based on the seg- space IDs would have the bottom 15 bits set mentation scheme in Figure 6b can use nam- to 0, so that the logical OR would effectively ing for protection. For example, the PowerPC yield a concatenation. This example gives us segmentation mechanism allows any process a PowerPC segment-like concatenation of a to generate any address within the global space, (34 + 15 = 49-bit) space ID with the (32 + 15 but only the OS can put specific values into = 47-bit) user virtual address. the segment registers. This effectively restricts The top two bits of the effective address any given process from generating addresses select among four of the eight space registers. outside of its 32-bit space. The IA-32 seg- The rest of the registers are accessed rarely and mentation mechanism is very similar. only by the OS. The PA-RISC mechanism differs from other segmented architectures in a simple but impor- Naming, protection, and access keys tant way: it does not prevent a user-level appli- PA-RISC is unique in its address-space orga- cation from modifying the contents of the space nization and protection. In most other archi- registers. Thus, a process can produce any vir- tectures, address-space protection is guaranteed tual address in the 96-bit global space without through naming. That is, addresses are pro- requiring permission from the OS. The PA- tected from a process by preventing that RISC, unlike the other schemes, does not use process from generating the appropriate vir- naming to provide protection, and so it must tual address. For instance, in Figure 6a, provide protection through other means. The processes in ASID-based systems are prevent- PA-RISC solution is to tag every page with a 31- ed from generating addresses outside of their bit access identifier, roughly corresponding to

JULY–AUGUST 1998 69 VIRTUAL MEMORY

fronting a main page table of any organiza- 40-bit virtual page number tion, even a hierarchical table.12 HPT CRT UltraSPARC The UltraSPARC (Figure 10) is another 64- Index into the hashed page bit design. Like the MIPS and Alpha designs, table PTE each implementation does not recognize all of collisions the 64 bits; in the UltraSPARC-I the top 20 are chained bits must be sign-extended from the 44th bit. into the The implementation sets the size of the phys- collision ical address. The UltraSPARC memory man- resolution table agement organization is notable for the way it uses ASIDs, called ASIs in Sun terminology. Hashed page Physical ASIs translation table memory These are essentially opcodes to the MMU. Figure 9. PA-RISC hashed page translation table (HPT). When different VPNs The 8-bit ASI is not used to identify different produce the same hash value, the resulting PTE collisions are chained into contexts directly, but to identify data formats the collision resolution table (CRT), which can be a separate data structure. and privileges and to reference indirectly one of a set of context identifiers. The following are a few of the basic ASIs reserved and an ASID in other architectures. Instead of assign- defined by the UltraSPARC architecture; indi- ing a single ASID to every executing process, the vidual implementations may add more. PA-RISC assigns eight IDs; each running process has eight 31-bit protection identifiers Non-restricted: associated with it. A process may access a page if ASI_PRIMARY {_LITTLE} any of its protection IDs match the page’s access ASI_PRIMARY_NOFAULT {_LITTLE} ID. As in the PowerPC, this allows a small set of ASI_SECONDARY {_LITTLE} user-level processes to access a given page, which ASI_SECONDARY_NOFAULT {_LITTLE} differs markedly from the all-or-nothing shar- ing approach found in most processors. Restricted: ASI_NUCLEUS {_LITTLE} Hashed page translation table ASI_AS_IF_USER_PRIMARY {_LITTLE} While the architecture does not define a ASI_AS_IF_USER_SECONDARY {_LITTLE} particular page table organization, PA-RISC engineers have published a variant of the Only processes operating in privileged mode inverted page table.12 The hashed page trans- may generate the restricted ASIs; user-level lation table is pictured in Figure 9. As in the processes can generate nonrestricted ASIs. Any PowerPC, it dispenses with the hash anchor ASI labeled PRIMARY refers to the context table, eliminating one memory reference on ID held in the primary context register; any every PTE lookup; the table can hold more ASI labeled SECONDARY refers to the con- entries than physical pages in the system. The text ID held in the secondary context register; collision chain is held in the page table or in and any ASI labeled NUCLEUS refers to the a separate structure called the collision reso- context ID held in the nucleus context regis- lution table. The PFN cannot be deduced ter. The {_LITTLE} suffixes indicate that the from an entry’s location in the page table, so target of the load or store should be interpret- each PTE contains both the VPN and the ed as a little-endian object; otherwise the PFN for the mapped page. MMU treats the data as big-endian. The This mechanism is implemented in hard- NOFAULT directives tell the MMU to load ware in the PA-7200, which performs a single the data into the cache and to fail silently if the probe of the table in hardware and defers to data is not actually in memory. This function software if the initial probe fails. Thus, the can be used for speculative prefetching. hashed table can act as a software TLB11 The default ASI is ASI_PRIMARY; it indi-

70 IEEE MICRO ASI register 64-bit

8-bit ASI Sign-extended (20 bits) 31-bit virtual page number 13-bit page offset

31-bit virtual page number

14-bit data cache index 13-bit context ID 13-bit Virtually indexed, instruction physically tagged, cache index 64-entry, fully associative direct-mapped instruction TLB 16-Kbyte, L1 data cache 64-entry, fully associative data TLB Physically indexed, physically tagged, pseudo two-way set associative, 16-Kbyte Tag Page frame number compare L2 instruction cache (cache hit/miss) Tag: (top bits of) page frame number

Cache data

Figure 10. UltraSPARC-I address translation mechanism. The instruction cache is pseudo two-way set-associative (the most significant bit of the cache index is predicted), so only the page offset bits are needed to index the cache. The instruction cache is effectively indexed by the physical address, and there are no cache synonym problems. The direct-mapped data cache is twice the page size; it is indexed by the virtual address, and synonym problems can occur. As a solution, the OS aligns all aliases on 16-Mbyte boundaries, which is the largest L2 cache the architecture supports. cates to the MMU that the current user-level ence data in its own address space. process is executing. When the OS executes, Processes can generate ASIs either directly or it runs as ASI_NUCLEUS and can peek into indirectly. Certain instructions, such as the the user’s address space by generating PRI- Load/Store From Alternate Space as well as MARY or SECONDARY ASIs (the atomic memory-access instructions like Swap AS_IF_USER ASIs). Register Memory and Compare and Swap System software can move values in and out Word, each contain an ASI in an immediate of the MMU’s context registers that hold the field, or reference the ASI register. These instruc- PRIMARY and SECONDARY context IDs. tions directly specify an ASI whenever an imme- This supports shared memory and resembles diate bit is zero, indicating that the ASI is to be the PA-RISC space IDs and protection IDs taken from another immediate field within the in its ability to give applications simultaneous instruction. When the bit is 1, the ASI is taken access to multiple protection domains. from the ASI register. Other instructions can For instance, user-level OS servers that peek only generate ASI_PRIMARY, the default ASI. into the address spaces of normal processes can use the ASI_SECONDARY identifier. A user- IA-32 level dynamic or server implementing a The Intel architecture (Figure 11, next particular OS API, as in or Windows page) is one of the older memory manage- NT, must set up and manage the address spaces ment architectures manufactured today. It is of processes running under its environment. It also one of the most widely used and one of could run as PRIMARY and move the context the most complex. Its design is an amalgam ID of the child process into the secondary con- of several techniques, making it difficult to text register. It could then explicitly use SEC- cover the entire organization briefly. We there- ONDARY ASIs to load and store data values to fore complement the description of Intel’s sys- the child’s address space, and use PRIMARY tem call and protection mechanisms in ASIs to execute its own instructions and refer- Hennessy and Patterson13 by describing the

JULY–AUGUST 1998 71 VIRTUAL MEMORY

tecture with no explicit ASIDs. For perfor- 48-bit logical address mance reasons, its segmentation mechanism 16-bit segment selector 32-bit segment offset is often unused by today’s operating systems, Local and global which typically flush the TLBs on context descriptor tables switch to provide protection. The caches are physically indexed and tagged and therefore 32-bit linear address need no flushing on , provided 20-bit virtual page number 12-bit page offset the TLBs are flushed. The location of the root page table is loaded into one of a set of control registers, CR3, and on a TLB miss the hardware walks the table to 32-entry, refill the TLB. If every process has its own four-way 16-Kbyte, four-way page table, the TLBs are guaranteed to con- instruction TLB data cache 64-entry, 16-Kbyte, four-way tain only entries belonging to the current four-way data TLB instruction cache process—those from the current page table— if the TLBs are flushed and the value in CR3 Tag changes on context switch. Shared memory is Page frame number compare often implemented by aliasing—duplicating (cache 1 hit/miss) Tag: page frame number mapping information across page tables. Writing to the CR3 flush- Cache data es the entire TLB; during a context switch, the hardware writes to CR3, so flushing the Figure 11. The Pentium II address translation mechanism. Each 16-Kbyte, TLB is part of the hardware-defined context- four-way set-associative cache uses the page offset as a cache index, effec- switch protocol. Globally shared pages (pro- tively making the cache physically indexed. The architecture defines sepa- tected kernel pages or library pages) can have rate block TLBs to hold superpage mappings, but they are not shown. the global bit of their PTE set. This bit keeps the entries from being flushed from the TLB; an entry so marked remains indefinitely in the Pentium II’s address translation mechanism. TLB until it is intentionally removed. The Pentium II memory management design features a segmented 32-bit address Segmented addressing space, split TLBs, and a hardware-managed The IA-32 segmentation mechanism sup- TLB miss mechanism. Processes generate 32- ports variable-size (1- to 4-Gbyte) seg- bit near pointers and 48-bit far pointers that ments; the size is set by software and can differ are mapped by the segmentation mechanism for every segment. Unlike other segmented onto a 32-bit linear address space. The linear schemes, in which the global space is much address space is mapped onto the physical larger than each process’s virtual space, the IA- space through the TLBs and hardware-defined 32 global virtual space (the linear address space) page table. The canonical two-tiered hierar- is the same size or, from one viewpoint, small- chical page table uses a 4-Kbyte root-level er than an individual user-level address space. table, 4-Kbyte PTE pages, and 4-byte PTEs. Processes generate 32-bit addresses that are The processor supports 4-Kbyte and 4- extended by 16-bit segment selectors. Two Mbyte page sizes, as well as 2-Mbyte super- bits of the 16-bit selector contain protection pages in some modes. Physical addresses in information, 13 bits select an entry within a the IA-32 are 32 bits wide, though the Pen- software descriptor table (similar to the Pow- tium II supports an even larger physical erPC segment registers or the PA-RISC space address through its physical address extension registers), and the last bit chooses between two or 36-bit page-size extension modes. In either different descriptor tables. of these modes, physical addresses produced Conceptually, an application can access sev- by the TLB are 36 bits wide. eral thousand segments, each of which can range from 1 byte to 4 Gbytes. This may seem to Protection and shared memory imply an enormous virtual space, but a process’s The Pentium II is another segmented archi- address space is actually 4 Gbytes. During

72 IEEE MICRO address generation, the segment’s base address is that they hold the IDs of segments that com- added to the 32-bit address the process gener- prise a process’s address space. They differ in ates. A process actually has access to several that they are referenced by context rather than thousand segments, each of which ultimately by a field in the virtual address. Instruction lies completely within the 4-Gbyte linear fetches implicitly reference CS, the register address space. The processor cannot distinguish holding the code segment selector. Any stack more than four unique Gbytes of data at a time; references (Push or Pop instructions) use SS, it is limited by the linear address space. the register holding the stack segment selec- The segment selector indexes the global and tor. Destinations of string instructions like local descriptor tables. The entries in these MOVS, CMPS, LODS, or STOS, implicitly tables are called segment descriptors and con- use ES, one of the four registers holding data tain information including the segment’s segment selectors. Other data references use length, protection parameters, and location DS by default, but applications can override in the linear address space. A 20-bit limit field this explicitly if desired, making available the specifies the segment’s maximum legal length remaining data-segment registers FS and GS. from 1 byte to 4 Gbytes. A granularity bit A near pointer references a location within determines how the 20-bit limit field is to be one of the currently active segments, that is, interpreted. If the granularity bit is clear, the the segments whose selectors are in the six seg- limit field specifies a maximum length from ment registers. On the other hand, the appli- 1 byte to 1 Mbyte in increments of 1 byte. If cation may reference a location outside the the granularity bit is set, the limit field speci- currently active segments by using a 48-bit far fies a maximum length from 4 Kbytes to 4 pointer. This loads a selector and corre- Gbytes in increments of 4 Kbytes. sponding descriptor into a segment register; The segment descriptor also contains a 2-bit then, the segment is referenced. Near point- field specifying one of four privilege levels ers are less flexible but incur less overhead than (highest is usually reserved for the OS kernel, far pointers (one fewer memory reference), lowest for user-level processes, and intermedi- and so they tend to be more popular. ate levels are for OS services). Other bits indi- If used properly, the IA-32 segmentation cate fine-grained protection, whether the mechanism could provide address-space pro- segment is allowed to grow (for example, a stack tection, obviating the need to flush the TLBs segment), and whether the descriptor allows on context switch. Protection is one of the interprivilege-level transfers. These transfers original intents of segmentation.3 The seg- support special functions such as task switch- ments guarantee protection if the 4-Gbyte lin- ing, calling exception handlers, calling inter- ear address space is divided among all rupt handlers, and accessing shared libraries or processes, in the way that the PowerPC divides code within the OS from the user level. its 52-bit virtual space among all processes. However, 4 Gbytes is an admittedly small Segment registers and pointers amount of space in which to work. For improved performance, the hardware caches six of a process’s segment selectors and Superpages and large physical addresses descriptors in six segment registers. The IA- Note that an entry in the IA-32 root-level 32 divides each segment register into “visible” table maps a 4-Kbyte PTE page, which in turn and “hidden” parts. Software can modify the maps 1,024 4-Kbyte user pages. Thus, a root- visible part, the segment selector. The soft- level PTE effectively maps a 4-Mbyte region. ware cannot modify the hidden part, the cor- IA-32 supports a simple superpage mecha- responding segment descriptor. Hardware nism: a bit in each root-level PTE determines loads the corresponding segment descriptor whether the entry maps a 4-Kbyte PTE page from the local or global descriptor table into or a 4-Mbyte user-level superpage directly. the hidden part of the segment register when- This reduces contention for TLB entries, and ever a new selector is placed into the visible when servicing a TLB miss for a superpage, part of the segment register. the hardware makes only one memory refer- The segment registers are similar to the seg- ence to obtain the mapping PTE, not two. ment registers of the PowerPC architecture in Normally, physical addresses in IA-32 are

JULY–AUGUST 1998 73 VIRTUAL MEMORY

32 bits. P6 processors offer two modes in cannot be masked with a software layer. It is which the physical address space is larger than the hardware details that are most difficult to 4 Gbytes: physical address extension and 36- encapsulate in a layer bit page size extension. In either of these that cause most of the problems, for example, modes, physical addresses can be 36 bits wide. The physical address extension mode requires • the use of virtual vs. physical caches changes to the page table format. The PTEs • protection methods (ASIDs, segments, expand to 8 bytes when this mode is enabled multiple IDs) (double their normal size), and so the mode • address-space organization (segmented/ requires the addition of another level in the flat, page table support, superpage page table. On every TLB miss, the hardware support). must make three memory references to find the mapping PTE, not two. The mode also These hardware features force designers to changes the semantics of the IA-32 superpage make decisions about the high-level organiza- mechanism. In the reorganized IA-32 page tion of the OS. Designers must make these table, a root-level PTE only maps a 2-Mbyte decisions early on in the design process, and region; therefore superpages are 2 Mbytes the decisions are difficult to undo when the OS when the physical address extension mode is has been implemented. For example, a virtual enabled, not 4 Mbytes. cache forces the OS to be aware of data-con- To limit confusion, designers added anoth- sistency issues. The hardware protection mech- er mode that requires no changes to the page anism influences the OS’s implementations of table or PTE size. The 36-bit page-size exten- interprocess communication, memory- sion mode uses previously reserved bits in the mapped files, and shared memory. The address- PTEs to extend physical addresses. Therefore, space organization mechanisms influence the with this mode the OS can support 4-Mbyte OS’s model of object allocation—how easy is it superpages and 36-bit physical addressing at to allocate and destroy large/small objects, the same time. whether objects can have different characteris- tics (such as protection, cache coherency) based he use of one memory management on the process accessing them, and how sparse- Torganization over another has not cata- ly populated the address space can be. Clearly, pulted any architecture to the top of the per- it is very difficult to encapsulate these effects in formance ladder, nor has the lack of any a transparent software layer. memory management function been the lead- We thus have two choices: to live with ing cause of an architecture’s downfall. So, diversity that serves no significant purpose or while it may seem refreshing to have so many to standardize on support for memory man- choices of VM interface, the diversity serves agement. Within standardization, there are little purpose other than to impede the port- further choices, including the elimination of ing of system software. most hardware support for memory manage- Designers have recognized this and have ment so that the software can define support,14 defined hardware abstraction layers8,9 to hide or decision by fiat such as the adoption of the hardware details from the OS; this simplifies already de facto standard, the IA-32 (or a sub- both the transition to new versions of hard- set thereof). The advantage of adopting the ware platforms and the porting to entirely new IA-32 interface is the large number of hard- architectures. Primitives in a memory- ware and software developers already familiar management abstraction layer include cre- with the interface, as well as its relatively good ate/destroy_map, protect_page/region, performance. The disadvantage is that it wire_down_page/region, and so on. The types might not scale well to 64-bit architectures. If of mechanisms hidden by this software layer one were starting with a clean slate, there is include TLB and data cache management evidence to recommend a hardware-walked instructions. However, these mechanisms inverted page table,16 which would resemble cause little problem when porting system soft- both IA-32 and PA-RISC, and would offer ware; there are other underlying models that good performance and scale well to 64-bit are so different from one another that they address spaces and beyond. MICRO

74 IEEE MICRO Acknowledgments IEEE Computer Society, Los Alamitos, Calif., We thank the many reviewers who helped May 1993, pp. 39-50. shape this article, especially Joel Emer, Jerry 13. J.L. Hennessy and D.A. Patterson, Comput- Huck, Mike Upton, and Robert Yung for er Architecture: A Quantitative Approach, their comments and insights into the work- 2nd ed., Morgan Kaufmann Publishers, Inc., ings of the Alpha, PA-RISC, IA-32, and San Francisco, Calif., 1996. SPARC architectures. The Defense Advanced 14. B.L. Jacob and T.N. Mudge, “Software-Man- Research Projects Agency under DARPA/ aged Address Translation,” Proc. HPCA-3, ARO contract DAAH04-94-G-0327 partially IEEE CS Press, Feb. 1997, pp. 156-167; http:// supported this work. computer.org/conferen/hpca97/77640156.pdf. 15. B.L. Jacob and T.N. Mudge, “A Look at Sev- References eral Memory Management Units, TLB-Refill 1. B.L. Jacob and T.N. Mudge, “Virtual Memo- Mechanisms, and Page Table Organizations,” ry: Issues of Implementation,” Computer, ASPLOS-8, ACM, Oct. 1998, to appear. Vol. 31, No. 6, June 1998, pp. 33-43. 2. D.W. Clark and J.S. Emer, “Performance of Bruce Jacob is an assistant professor of elec- the VAX-11/780 Translation Buffer: Simula- trical and computer engineering at the Uni- tion and Measurement,” ACM Trans. Com- versity of Maryland, College Park. His present puter Systems, ACM, New York, Vol. 3, No. research includes the design of memory man- 1, Feb. 1985, pp. 31-62. agement organizations and hardware archi- 3. E.I. Organick, The System: An Exam- tectures for real-time and embedded systems. ination of Its Structure, MIT Press, Cam- Jacob received the AB degree in mathematics bridge, Mass., 1972. from Harvard University, and the MS and PhD 4. Pentium Processor User’s Manual, Intel Cor- degrees in computer science and engineering poration, Mt. Prospect, Ill., 1993. from the University of Michigan, Ann Arbor. 5. G. Kane and J. Heinrich, MIPS RISC Archi- He is a member of the IEEE and the ACM, and tecture, Prentice-Hall, Englewood Cliffs, has authored papers on , N.J., 1992. analytical modeling, distributed systems, astro- 6. J.E. Smith, G.E. Dermer, and M.A. Gold- physics, and algorithmic composition. smith, Computer System Employing Virtual Memory, patent 4,774,659, US Patent Trevor Mudge is a professor of electrical engi- Office, Wash., D.C., Sep. 1988. neering and computer science at the Univer- 7. D. Nagle et al., “Design Trade-Offs for Soft- sity of Michigan, Ann Arbor, and director of ware-Managed TLBs,” ACM Trans. Com- the Advanced Computer Architecture Labo- puter Systems, Vol. 12, No. 3, Aug. 1994, ratory, a group of eight faculty and 70 gradu- pp. 175-205. ate research assistants. He also consults for 8. R. Rashid et al., “Machine-Independent Vir- several computer companies. His research tual Memory Management for Paged interests include computer architecture, com- Uniprocessor and Multiprocessor Architec- puter-aided design, and compilers. tures,” IEEE Trans. , Vol. 37, No. Mudge received the BSc degree in cyber- 8, Aug. 1988, pp. 896-908. netics from the University of Reading, Eng- 9. H. Custer, Inside Windows NT, land, and the MS and PhD degrees in Press, Redmond, Wash., 1993. computer science from the University of Illi- 10. J. Liedtke, “On Micro-Kernel Construction,” nois, Urbana. He is an associate editor for SOSP-15, ACM, Dec. 1995, pp. 237-250. IEEE Transaction on Computers and ACM 11. K. Bala, M.F. Kaashoek, and W.E. Weihl, Computing Surveys, a Fellow of the IEEE, and “Software Prefetching and Caching for a member of the ACM, the IEE, and the Translation Lookaside Buffers,” Proc. OSDI- British Computer Society. 1, Usenix Corporation, Berkeley, Calif., Nov. 1994, pp. 243-253. Direct comments about this article to Bruce 12. J. Huck and J. Hays, “Architectural Support Jacob, Dept. of Electrical and Computer for Translation Table Management in Large Engineering, University of Maryland, College Address Space Machines,” Proc. ISCA-20, Park, MD 20742; [email protected].

JULY–AUGUST 1998 75