Optimizing the TLB Shootdown Algorithm with Page Access Tracking
Total Page:16
File Type:pdf, Size:1020Kb
Optimizing the TLB Shootdown Algorithm with Page Access Tracking Nadav Amit, VMware Research https://www.usenix.org/conference/atc17/technical-sessions/presentation/amit This paper is included in the Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17). July 12–14, 2017 • Santa Clara, CA, USA ISBN 978-1-931971-38-6 Open access to the Proceedings of the 2017 USENIX Annual Technical Conference is sponsored by USENIX. Optimizing the TLB Shootdown Algorithm with Page Access Tracking Nadav Amit VMware Research Abstract their TLBs according to the information supplied by The operating system is tasked with maintaining the the initiator core, and they report back when they coherency of per-core TLBs, necessitating costly syn- are done. TLB shootdown can take microseconds, chronization operations, notably to invalidate stale causing a notable slowdown [48]. Performing TLB mappings. As core-counts increase, the overhead of shootdown in hardware, as certain CPUs do, is faster TLB synchronization likewise increases and hinders but still incurs considerable overheads [22]. scalability, whereas existing software optimizations In addition to reducing performance, shootdown that attempt to alleviate the problem (like batching) overheads can negatively affect the way applications are lacking. are constructed. Notably, to avoid shootdown la- We address this problem by revising the TLB tency, programmers are advised against using mem- synchronization subsystem. We introduce several ory mappings, against unmapping them, and even techniques that detect cases whereby soon-to-be against building multithreaded applications [28, 42]. invalidated mappings are cached by only one TLB But memory mappings are the efficient way to use or not cached at all, allowing us to entirely avoid persistent memory [18, 47], and avoiding unmap- the cost of synchronization. In contrast to existing pings might cause corruption of persistent data [12]. optimizations, our approach leverages hardware OSes try to cope with shootdown overheads by page access tracking. We implement our techniques batching them [21, 43], avoiding them on idle cores, in Linux and find that they reduce the number of or, when possible, performing them faster [5]. But TLB invalidations by up to 98% on average and thus the potential of these existing solutions is inherently improve performance by up to 78%. Evaluations limited to certain specific scenarios. To have a gener- show that while our techniques may introduce ally applicable, efficient solution, OSes need do know overheads of up to 9% when memory mappings which mappings are cached by which cores. Such in- are never removed, these overheads can be avoided formation can in principle be obtained by replicating by simple hardware enhancements. the translation data structures for each core [11], but this approach might result in significantly degraded 1. Introduction performance and wasted memory. We propose to avoid unwarranted TLB shoot- Translation lookaside buffers (TLBs) are perhaps the downs in a different manner: by monitoring access most frequently accessed caches whose coherency bits. While TLB coherency is not maintained by the is not maintained by modern CPUs. The TLB is CPU, CPU architectures can maintain the consis- tasked with caching virtual-to-physical translations tency of access bits, which are set when a mapping (“mappings”) of memory addresses, and so it is is cached. We contend that these bits can therefore accessed upon every memory read or write operation. be used to reveal which mappings are cached by Maintaining TLB coherency in hardware hampers which cores. To our knowledge, we are the first to performance [33], so CPU vendors require OSes to use access bits in this way. maintain coherency in software. But it is difficult for In the x86 architecture, which we study in this OSes to efficiently achieve this goal [27, 38, 39, 41, 48]. paper, access bit consistency is maintained by the To maintain TLB coherency, OSes employ the memory subsystem. Exploiting it, we propose tech- TLB shootdown protocol [8]. If a mapping m that niques to identify two types of common mappings possibly resides in the TLB becomes stale (due to whose shootdown can be avoided: (1) short-lived memory mapping changes) the OS flushes m from private mappings, which are only cached by a single the local TLB to restore coherency. Concurrently, core; and (2) long-lived idle mappings, which are the OS directs remote cores that might house m in reclaimed after the corresponding pages have not their TLB to do the same, by sending them an inter- been used for a while and are not cached at all. Using processor interrupt (IPI). The remote cores flush USENIX Association 2017 USENIX Annual Technical Conference 27 these techniques, we implement a fully functional x86 CPUs do not maintain coherence between the prototype in Linux 4.5. Our evaluation shows that TLB and the page-tables, nor among the TLBs of our proposal can eliminate more than 90% of TLB different cores. As a result, page-table changes may shootdowns and improve the performance of mem- leave stale entries in the TLBs until coherence is ory migration by 78%, of copy-on-write events by restored by the OS. The instruction set enables the 18–25%, and of multithreaded applications (Apache OS to do so by flushing (“invalidating”) individual and parallel bzip2) by up to 12%. PTEs or the entire TLB. Global and individual TLB Our system introduces a worst case slowdown flushes can only be performed locally, on the TLB of of up to 9% when mappings are only set and never the core that executes the flush instruction. removed or changed, which means no shootdown Although the TLB is essential to attain reasonable activity is conducted. This slowdown is caused, translation latency, some workloads experience fre- according to our measurements, due to the overhead quent TLB cache-misses [4]. Recently, new features of our TLB manipulation software techniques. To were introduced into the x86 architecture to reduce eliminate it, we propose a CPU extension that would the number and latency of TLB misses. A new instruc- allow OSes to write entries directly into the TLB, and tion set extension allows each page-table hierarchy to resembles the functionality provided by CPUs that be associated with an address-space ID (ASID) and employ software-TLB. avoid TLB flushes during address-space switching, thus reducing the number of TLB misses. Micro- 2. Background and Motivation architectural enhancements introduced page-walk caches that enable the hardware to cache internal 2.1 Memory Management Hardware nodes in the page-table hierarchy, thereby reducing Virtual memory is supported by most modern CPUs TLB-miss latencies [3]. and used by all the major OSes [9, 32]. Using vir- tual memory allows the OS to utilize the physical 2.2 TLB Software Challenges memory more efficiently and to isolate the address The x86 architecture leaves maintaining TLB co- space of each process. The CPU translates the virtual herency to the OSes, which often requires frequent addresses to physical addresses before memory ac- TLB invalidations after PTE changes. OS kernels cesses are performed. The OS sets the virtual address can make such PTE changes independently of the translations (also called “mappings”) according to running processes, upon memory migration across its policies and considerations. NUMA nodes [2], memory deduplications [49], mem- The memory mappings of each address space are ory reclamation, and memory compaction for accom- kept in a memory-resident data structure, which is modating huge pages [14]. Processes can also trigger defined by the CPU architecture. The most common PTE changes by using system calls, for example data structure, used by the x86 architecture, is a radix- mprotect, which changes protection on a memory tree, which is also known as a page-table hierarchy. range, or by writing to copy-on-write pages (COW). The leaves of the tree, called the page-table entries These PTE changes can require a TLB flush to (PTEs), hold the translations of fixed-sized virtual avoid caching of stale PTEs in the TLB. We distin- memory pages to physical frames. To translate a guish between two types of flushes: local and remote, virtual address into a physical address, the CPU in accordance with the core that initiated the PTE incorporates a memory management unit (MMU), change. Remote TLB flushes are significantly more which performs a “page table walk” on the page expensive, since most CPUs cannot flush remote table hierarchy, checking access permissions at every TLBs directly. OSes therefore perform a TLB shoot- level. During a page-walk, the MMU updates the down: The initiating core sends an inter-processor status bits in each PTE, indicating whether the page interrupt (IPI) to the remote cores and waits for was read from and/or written to (dirtied). their interrupt handlers to invalidate their TLBs and To avoid frequent page-table walks and their as- acknowledge that they are done. sociated latency, the MMU caches translations of TLB shootdowns introduce a variety of overheads. recently used pages in a translation lookaside buffer IPI delivery can take several hundreds of cycles [5]. (TLB). In the x86 architecture, these caches are main- Then, the IPI may be kept pending if the remote core tained by the hardware, bringing translations into has interrupts disabled, for instance while running the cache after page walks and evicting them accord- a device driver [13]. The x86 architecture does ing to an implementation-specific cache replacement not allow OSes to flush multiple PTEs efficiently, policy. Each x86 core holds a logically private TLB. requiring the OS to either incur the overhead of Unlike memory caches, TLBs of different CPUs are multiple flushes or flush the entire TLB and increase not maintained coherent by hardware. Specifically, the TLB miss rate. In addition, TLB flushes may 28 2017 USENIX Annual Technical Conference USENIX Association indirectly cause lock contention since they are often 2.4 Per-Core Page Tables performed while the OS holds a lock [11, 15].