Spectlb: a Mechanism for Speculative Address Translation

SpecTLB: A Mechanism for Speculative Address Translation Thomas W. Barr, Alan L. Cox, Scott Rixner Rice University Houston, TX {twb, alc, rixner}@rice.edu ABSTRACT 1. INTRODUCTION Data-intensive computing applications are using more and more The memory capacity of modern computers is growing at a much memory and are placing an increasing load on the virtual mem- faster rate than the size of translation-lookaside buffers (TLBs). ory system. While the use of large pages can help alleviate the Therefore, TLB misses are becoming more frequent and address overhead of address translation, they limit the control the operat- translation is an increasingly significant performance bottleneck. ing system has over memory allocation and protection. We present Recent work has shown that the impact of address translation on a novel device, the SpecTLB, that exploits the predictable behav- overall system performance ranges from 5-14% even for nominally ior of reservation-based physical memory allocators to interpolate sized applications in a non-virtualized environment [5]. Larger ap- address translations. plications have even higher overhead, approaching 50% in some Our device provides speculative translations for many TLB cases [17]. misses on small pages without referencing the page table. While Virtualization compounds address translation overhead. Nested these interpolations must be confirmed, doing so can be done in paging, a popular mechanism for hardware supported memory parallel with speculative execution. This effectively hides the ex- virtualization, increases virtual memory overhead dramatically. ecution latency of these TLB misses. In simulation, the SpecTLB Translating a single guest virtual address to a machine physical ad- is able to overlap an average of 57% of page table walks with suc- dress can take as many as twenty-four memory accesses on x86-64. cessful speculative execution over a wide variety of applications. Consequently, the overhead of virtual memory increases to up to We also show that the SpecTLB outperforms a state-of-the-art TLB 89% for real-world workloads when using nested paging [5]. prefetching scheme for virtually all tested applications with signif- The use of large pages reduces the performance overhead of vir- icant TLB miss rates. Moreover, we show that the SpecTLB is tual memory by increasing TLB coverage. Each entry in the TLB efficient since mispredictions are extremely rare, occurring in less that holds a large page covers a larger region of virtual memory, than 1% of TLB misses. In essense, the SpecTLB effectively en- so the entire TLB is able to translate a larger region of the address ables the use of small pages to achieve fine-grained allocation and space. However, this increased coverage does not come for free. protection, while avoiding the associated latency penalties of small The page is the unit of a program’s address space that the oper- pages. ating system can allocate and protect. The operating system must allocate an entire page of physical memory at a time, regardless of Categories and Subject Descriptors how much of that space will be used by the application. Always allocating large pages leads to excessive physical memory use if a C.0 [General]: Modelling of computer architecture; C.4 program fragments its virtual memory use. More critically, appli- [Performance of systems]: Design studies; D.4.2 [Operating Sys- cation permissions (read/write/execute) must be consistent across tems]: Virtual Memory an entire page, large or small. Finally, the operating system can tell if memory has been modified or accessed on a page granularity. It General Terms uses this facility to know when to write back portions of memory mapped files. This means that if large pages are used for memory Performance, Design mapped files, the operating system must write back the entire large page even if only a single byte was modified. Keywords On x86-64, pages are only available in three sizes: 4KB, 2MB TLB, Memory Management and 1GB. The large gap between these sizes makes page size selec- tion critical. Using too small of a page overburdens the TLB, while using too large of a page can waste physical memory and have high I/O overhead. The proper page size for a region of memory may change from application to application or even from execution to execution. Therefore, some modern operating systems take an au- Permission to make digital or hard copies of all or part of this work for tomatic approach to selecting page size. personal or classroom use is granted without fee provided that copies are FreeBSD’s reservation-based physical memory allocator uses not made or distributed for profit or commercial advantage and that copies small pages for all memory by default. After a program uses ev- bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific ery 4KB page within an entire 2MB region of virtual memory, that permission and/or a fee. contiguous region is promoted to a large page. To prepare for this, ISCA’11, June 4–8, 2011, San Jose, California, USA. the operating system places small pages that might be promoted in Copyright 2011 ACM 978-1-4503-0472-6/11/06 ...$10.00. 307 the future into large page reservations. In a reservation, 4KB pages are aligned within a 2MB region of physical memory correspond- 63:48 47:39 38:30 29:21 20:12 11:0 ing to their alignment within their 2MB region of virtual memory. se L4 idx L3 idx L2 idx L1 idx page offset This means that within a reservation, consecutive virtual pages will also be consecutive physical pages [19]. Figure 1: Decomposition of the x86-64 virtual address. Processor architectures must evolve to meet the demands of modern operating systems [18]. The challenges involved in address translation require innovative architectural solutions that provide 2. BACKGROUND support for such reservation-based physical memory allocators. In this paper, we present the SpecTLB, a novel TLB-like structure that The SpecTLB is intricately tied to the operation of address trans- exploits the contiguity and alignment generated by a reservation- lation and pysical memory allocation. This section provides back- based physical memory allocator to interpolate address translations ground information that explains how traditional x86 address trans- based on the physical address of nearby pages. While the under- lation works, how address translation is extended to handle virtu- lying page table still must be walked to verify these speculative alization, and how a reservation-based physical memory allocator translations, this walk can be done concurrently with speculative operates. This background provides context to help explain how memory access and execution. This new capability allows the op- the SpecTLB is able to improve performance by hiding the latency erating system to maintain fine-grained protection and allocation of TLB misses. over memory while hiding the latency of the resulting TLB misses. We show that the SpecTLB is able to eliminate the latency 2.1 x86 Address Translation penalty from a majority of TLB misses with an unmodified version All x86 processors since the Intel 80386 have used a radix tree of FreeBSD. However, one of the key contributions of the SpecTLB to record the mapping from virtual to physical addresses. Although is its ability to achieve large-page like performance when large the depth of this tree has increased, to accommodate larger physi- pages are impractical to use, such as in virtualization. Traditional cal and virtual address spaces, the procedure for translating virtual hypervisors implement I/O by marking pages of the guest physi- addresses to physical addresses using this tree is essentially un- cal address space that contain memory mapped I/O as unavailable. changed. A virtual address is split into a page number and a page When the guest system accesses them, the hypervisor is invoked offset. The page number is further split into a sequence of indices. which emulates the I/O device. Ideally, the physical memory space The first index is used to select an entry from the root of the tree, of a virtualized guest would be stored as a small number of 1GB which may contain a pointer to a node at the next lower level of the pages. However, the guest physical pages that the guest operating tree. If the entry does contain a pointer, the next index is used to system will use for I/O are fixed, defined by the x86 architecture select an entry from this node, which may again contain a pointer itself. Therefore, the hypervisor must have fine-grained protection to a node at the next lower level of the tree. These steps repeat until control over those pages. With a speculative TLB, this space can the selected entry is either invalid (in essence, a NULL pointer in- be stored in a 1GB reservation with both data and emulated I/O dicating there is no valid translation for that portion of the address pages mixed together. All accesses are made speculatively, as if space) or the entry instead points to a data page using its physical they were to a data page. Therefore, all guest physical memory ac- address. In the latter case, the page offset from the virtual address cesses can proceed without blocking for a TLB miss. If the address is added to the physical address of this data page to obtain the full turns out to be part of a data region, the speculative work is com- physical address. In a simple memory management unit (MMU) mitted. However, if an address is part of an emulated I/O region, design, this procedure requires one memory access per level in the the speculative execution will not be committed, and the hypervisor tree.

Spectlb: a Mechanism for Speculative Address Translation

Computer Science 246 Computer Architecture Spring 2010 Harvard University

OS and Compiler Considerations in the Design of the IA-64 Architecture

Selective Eager Execution on the Polypath Architecture

Igpu: Exception Support and Speculative Execution on Gpus

A Survey of Published Attacks on Intel

SUPPORT for SPECULATIVE EXECUTION in HIGH- PERFORMANCE PROCESSORS Michael David Smith Technical Report: CSL-TR-93456 November 1992

High Performance Architecture Using Speculative Threads and Dynamic Memory Management Hardware

Whitepaper Cache Speculation Side-Channels Author: Richard Grisenthwaite Date: January 2018 Version 1.1

Invisispec: Making Speculative Execution Invisible in the Cache Hierarchy

A State of the Art Investigation

Speculative Execution and Instruction-Level Parallelism

PA-RISC 8X00 Family of Microprocessors with Focus on PA-8700