Fast Local Page-Tables for Virtualized NUMA Servers with Vmitosis
Total Page:16
File Type:pdf, Size:1020Kb
Fast Local Page-Tables for Virtualized NUMA Servers with vMitosis Ashish Panwar Reto Achermann Arkaprava Basu Indian Institute of Science University of British Columbia Indian Institute of Science Abhishek Bhattacharjee K. Gopinath Jayneel Gandhi Yale University Indian Institute of Science VMware Research ABSTRACT 1 INTRODUCTION Increasing heterogeneity in the memory system mandates careful Applications suffer non-uniform memory access (NUMA) latencies data placement to hide the non-uniform memory access (NUMA) on modern multi-tier memory systems. As computer systems em- effects on applications. However, NUMA optimizations have pre- brace even more heterogeneity in the memory subsystem, with dominantly focused on application data in the past decades, largely innovations in die-stacked DRAM, high-bandwidth HBM, more ignoring the placement of kernel data structures due to their small socket counts and multi-chip module-based designs, the speed dif- memory footprint; this is evident in typical OS designs that pin ferences between local and remote memory continue to grow and kernel objects in memory. In this paper, we show that careful place- become more complex to reason about [52, 77]. Carefully placing, ment of kernel data structures is gaining importance in the context replicating, and migrating data among memory devices with vari- of page-tables: sub-optimal placement of page-tables causes severe able latency and bandwidth is of paramount importance to the slowdown (up to 3.1×) on virtualized NUMA servers. success of these technologies, and much work remains to be done In response, we present vMitosis ś a system for explicit manage- on these topics. ment of two-level page-tables, i.e., the guest and extended page- However, while there is at least prior work on data placement tables, on virtualized NUMA servers. vMitosis enables faster address for application pages [2, 18, 24, 41, 61], kernel data structures have translation by migrating and replicating page-tables. It supports been largely ignored from this discussion, primarily due to their two prevalent virtualization configurations: first, where the hyper- small memory footprint. Consequently, most kernel objects are visor exposes the NUMA architecture to the guest OS, and second, pinned and unmovable in typical OS designs [33, 58]. We argue where such information is hidden from the guest OS. vMitosis is that the access latency of kernel objects is gaining importance. This implemented in Linux/KVM, and our evaluation on a recent 1.5TiB paper focuses on one critical kernel data structure, the page-table, 4-socket server shows that it effectively eliminates NUMA effects and shows that virtualized NUMA servers must carefully reason on 2D page-table walks, resulting in a speedup of 1.8−3.1× for Thin about its placement to enable high performance. (single-socket) and 1.06 − 1.6× for Wide (multi-socket) workloads. Why focus on page-tables? Page-tables are vital to overall system CCS CONCEPTS performance. First, big-memory workloads require DRAM accesses on frequent page-table walks due to high TLB miss rates [4, 14, 30]. · Software and its engineering → Operating systems; Virtual As system memory capacities grow to meet ever-increasing work- memory. load data demand, page-tables grow proportionally outstripping the coverage of hardware TLBs. Larger address spaces require ad- KEYWORDS ditional levels in page-tables (e.g., Intel’s 5-level page-tables). TLB NUMA; TLB; Linux; KVM; 2D page-tables; replication; migration misses under virtualization are already expensiveśa 2D page-table walk over guest page-tables (henceforth gPT) and extended page- ACM Reference Format: tables (henceforth ePT) requires up to 24 memory accesses that Ashish Panwar, Reto Achermann, Arkaprava Basu, Abhishek Bhattachar- will increase to 35 with 5-level page-tables. Finally, a page-table jee, K. Gopinath, and Jayneel Gandhi. 2021. Fast Local Page-Tables for walk does not benefit from memory-level parallelism as it isan Virtualized NUMA Servers with vMitosis. In Proceedings of the 26th ACM In- ternational Conference on Architectural Support for Programming Languages inherently serial processśeach long DRAM access adds latency to and Operating Systems (ASPLOS ’21), April 19ś23, 2021, Virtual, USA. ACM, address translation [14]. New York, NY, USA, 17 pages. https://doi.org/10.1145/3445814.3446709 On a single socket machine, frequent 2D page-table walks have been shown to add 10-50% execution overhead on important ap- plications [4, 30, 44]. However, in this paper, we show that these Permission to make digital or hard copies of all or part of this work for personal or . classroom use is granted without fee provided that copies are not made or distributed overheads are as high as 3 1× on multi-socket machines due to for profit or commercial advantage and that copies bear this notice and the full citation sub-optimal page-table placement. on the first page. Copyrights for components of this work owned by others than the Misplacement of page-tables occurs in two use-cases on NUMA author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission machines. First, Thin workloads/VMs (i.e., those fitting within a and/or a fee. Request permissions from [email protected]. single NUMA socket) are occasionally migrated across NUMA sock- ASPLOS ’21, April 19ś23, 2021, Virtual, USA ets. Commodity OSes and hypervisors use migration to improve © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8317-2/21/04...$15.00 resource utilization and performance isolation. Some real-world https://doi.org/10.1145/3445814.3446709 examples include VMware vSphere that performs periodic NUMA 260 ASPLOS ’21, April 19ś23, 2021, Virtual, USA A. Panwar, R. Achermann, A. Basu, A. Bhattacharjee, K. Gopinath, and J. Gandhi re-balancing of VMs every two seconds [64], and Linux/KVM that Guest page-tables Extended page-tables System migrates processes to improve load-balancing and performance Migration Replication Migration Replication under memory sharing on NUMA systems [22, 73]. While data Linux/KVM No No No No pages are often migrated along with the threads, current systems Mitosis via Replication* Yes* No No usually pin kernel objects in memory [33, 58]. Consequently, work- vMitosis Yes Yes Yes Yes load/VM migration can make page-tables permanently remote. Sec- Table 1: NUMA support for page-tables in state-of-the-art ond, Wide workloads/VMs (i.e., those spanning multiple NUMA systems. (*) Replication is possible in Mitosis only if the sockets) experience remote page-table walks due to a single copy server’s NUMA topology is exposed to the guest OS. of the page-table; their virtual-to-physical address translations are requested from multiple sockets but each page-table entry is local Contributions over Mitosis: Our recent proposal Mitosis replicates to only one of the sockets. the page-tables on native systems. vMitosis makes several novel We highlighted the importance of careful page-table placement, contributions over Mitosis (see Table 1). First, Mitosis relies on for native systems, in our recent work Mitosis [1]. In contrast, in the availability of the server’s NUMA topology for replicating the this paper, we analyze the effect of NUMA on 2D page-tables and page-tables. In bare-metal servers, the OS extracts platform specifi- present vMitosis ś a system that extends the design principles of cations from ACPI tables. However, the hardware abstract layer in Mitosis to virtualized environments. virtualized systems often hides platform details from the guest OS. vMitosis provides various mechanisms to mitigate NUMA effects Therefore, Mitosis can replicate gPT only in the NUMA-visible VMs. on 2D page-table walks for both the use-cases discussed above. Our vMitosis supports both the VM configurations; it re-uses Mitosis in design applies migration and replication to page-tables to ensure the NUMA-visible VMs but introduces two novel techniques for that TLB misses are serviced from local memory. While replication replicating gPT in the NUMA-oblivious VMs (ğ 3.3). and migration are well-known NUMA management techniques, Second, Mitosis does not provide ePT-level optimizations; vMito- virtualization-specific challenges make their practical realization sis does. Finally, vMitosis handles page-table migration differently non-trivial. For instance, a hypervisor may or may not expose the from Mitosis. To migrate page-tables, Mitosis first replicates them on host platform’s NUMA topology to a VM. We refer to VMs exposed the destination socket, configures the system to use the new replica, to the host NUMA topology as NUMA-visible and to VMs not and then releases the old replica. In contrast, vMitosis incrementally exposed to such information as NUMA-oblivious. migrates page-tables when the OS/hypervisor migrates data pages In the NUMA-oblivious configuration, the guest OS is exposed ( ğ 3.2). For single-socket workloads, incremental page-table migra- to a flat topology in which all memory and virtual CPUs (vCPUs) tion of vMitosis provides similar address translation performance are grouped in a single virtual socket. This configuration provides as the pre-replicated page-tables of Mitosis but with lower space great flexibility to cloud service providers as NUMA-oblivious VMs and runtime overheads. can be freely migrated for maintenance, power management, or better consolidation. Further, CPUs and memory can be added to 2 ANALYSIS OF 2D PAGE-TABLE PLACEMENT or removed from NUMA-oblivious VMs dynamically, irrespective We start by uncovering the sources of remote DRAM accesses dur- of NUMA locality. Most of the VMs on major cloud platforms are ing page-table walks and quantifying their impact on performance available under this configuration [66]. with a range of memory-intensive applications listed in Table 2.