Spacejmp: Programming with Multiple Virtual Address Spaces

SpaceJMP: Programming with Multiple Virtual Address Spaces Izzat El Hajj3,4,*, Alexander Merritt2,4,*, Gerd Zellweger1,4,*, Dejan Milojicic4, Reto Achermann1, Paolo Faraboschi4, Wen-mei Hwu3, Timothy Roscoe1, and Karsten Schwan2 1Department of Computer Science, ETH Zürich 2Georgia Institute of Technology 3University of Illinois at Urbana-Champaign 4Hewlett Packard Labs *Co-primary authors Abstract based data structures across process lifetimes, and sharing Memory-centric computing demands careful organization of very large memory objects between processes. the virtual address space, but traditional methods for doing so We expect that main memory capacity will soon exceed are inflexible and inefficient. If an application wishes to ad- the virtual address space size supported by CPUs today dress larger physical memory than virtual address bits allow, (typically 256 TiB with 48 virtual address bits). This is if it wishes to maintain pointer-based data structures beyond made particularly likely with non-volatile memory (NVM) process lifetimes, or if it wishes to share large amounts of devices – having larger capacity than DRAM – expected to memory across simultaneously executing processes, legacy appear in memory systems by 2020. While some vendors interfaces for managing the address space are cumbersome have already increased the number of virtual address (VA) and often incur excessive overheads. bits in their processors, adding VA bits has implications on We propose a new operating system design that promotes performance, power, production, and estate cost that make it virtual address spaces to first-class citizens, enabling process undesirable for low-end and low-power processors. Adding threads to attach to, detach from, and switch between multiple VA bits also implies longer TLB miss latency, which is virtual address spaces. Our work enables data-centric appli- already a bottleneck in current systems (up to 50% overhead cations to utilize vast physical memory beyond the virtual in scientific apps [55]). Processors supporting virtual address range, represent persistent pointer-rich data structures with- spaces smaller than available physical memory require large out special pointer representations, and share large amounts data structures to be partitioned across multiple processes or of memory between processes efficiently. these structures to be mapped in and out of virtual memory. We describe our prototype implementations in the Dragon- Representing pointer-based data structures beyond pro- Fly BSD and Barrelfish operating systems. We also present cess lifetimes requires data serialization, which incurs a large programming semantics and a compiler transformation to performance overhead, or the use of special pointer represen- detect unsafe pointer usage. We demonstrate the benefits of tations, which results in awkward programming techniques. our work on data-intensive applications such as the GUPS Sharing and storing pointers in their original form across benchmark, the SAMTools genomics workflow, and the Redis process lifetimes requires guaranteed acquisition of specific key-value store. VA locations. Providing this guarantee is not always feasible, and when feasible, may necessitate mapping datasets residing 1. Introduction at conflicting VA locations in and out of the virtual space. Sharing data across simultaneously executing processes The volume of data processed by applications is increasing requires special communication with data-serving processes dramatically, and the amount of physical memory in machines (e.g., using sockets), impacting programmability and incur- is growing to meet this demand. However, effectively using ring communication channel overheads. Sharing data via this memory poses challenges for programmers. The main traditional shared memory requires tedious communication challenges tackled in this paper are addressing more physical and synchronization between all client processes for growing memory than the size of a virtual space, maintaining pointer- the shared region or guaranteeing consistency on writes. To address all these challenges, we present SpaceJMP, a set of APIs, OS mechanisms, and compiler techniques that constitute a new way to manage memory. SpaceJMP applications create and manage Virtual Address Spaces (VASes) as first-class objects, independent of processes. Decoupling VASes from processes enables a single process to activate multiple VASes, such that threads of that process can switch between these VASes in a lightweight manner. It also enables 104 a single VAS to be activated by multiple processes or to exist 103 map map (cached) in the system alone in a self-contained manner. 102 unmap unmap (cached) 101 SpaceJMP solves the problem of insufficient VA bits by 100 1 allowing a process to place data in multiple address spaces. 10− 2 If a process runs out of virtual memory, it does not need to 10− Latency [msec] 3 10− modify data mappings or create new processes. It simply 4 10− creates more address spaces and switches between them. 15 20 25 30 35 SpaceJMP solves the problem of managing pointer-based Region Size [power of 2, bytes] data structures because SpaceJMP processes can always guarantee the availability of a VA location by creating a new Figure 1: Page table construction (mmap) and removal address space. SpaceJMP solves the problem of data sharing (munmap) costs in Linux, using 4KiB pages. Does not include by allowing processes to switch into a shared address space. page zeroing costs. Clients thus need not communicate with a server process or synchronize with each other on shared region management, and synchronization on writes can be lightweight. bits to the virtual memory translation unit because of power SpaceJMP builds on a wealth of historical techniques in and performance implications. Most CPUs today are limited memory management and virtualization but occupies a novel to 48 virtual address bits (i.e., 256 TiB) and 44-46 physical “sweet spot” for modern data-centric applications: it maps address bits (16-64 TiB), and we expect a slow growth in well to existing hardware while delivering more flexibility these numbers. and performance for large data-centric applications compared The challenge presented is how to support applications with current OS facilities. that want to address large physical memories without paying We make the following contributions: the cost of increasing processor address bits across the board. One solution is partitioning physical memory across multiple 1. We present SpaceJMP (Section 3): the OS facilities pro- OS processes which would incur unnecessary inter-process vided for processes to each create, structure, compose, and communication overhead and is tedious to program. Another access multiple virtual address spaces, together with the solution is mapping memory partitions in and out of the VAS, compiler support for detecting unsafe program behavior. which has overheads discussed in Section 2.4. 2. We describe and compare (Section 4) implementations of 2.2 Maintaining Pointer-based Data Structures SpaceJMP for the 64-bit x86 architecture in two different OSes: DragonFly BSD and Barrelfish. Maintaining pointer-based data structures beyond process lifetimes without serialization overhead has motivated emerg- 3. We empirically show (Section 5) how SpaceJMP improves ing persistent memory programming models to adopt region- performance with microbenchmarks and three data-centric based programming paradigms [15, 21, 78]. While these ap- applications: the GUPS benchmark, the SAMTools ge- proaches provide a more natural means for applications to nomics workflow, and the Redis key-value store. interact with data, they present the challenge of how to represent pointers meaningfully across processes. 2. Motivation One solution is the use of special pointers, but this hin- Our primary motivations are the challenges we have already ders programmability. Another solution is requiring memory mentioned: insufficient VA bits, managing pointer-based data regions to be mapped to fixed virtual addresses by all users. structures, and sharing large amounts of memory – discussed This solution creates degenerate scenarios where memory in Sections 2.1, 2.2, and 2.3 respectively. is mapped in and out when memory regions overlap, the drawbacks of which are discussed in Section 2.4. 2.1 Insufficient Virtual Address Bits Non-Volatile Memory (NVM) technologies, besides persis- 2.3 Sharing Large Memory tence, promise better scaling and lower power than DRAM, Collaborative environments where multiple processes share which will enable large-scale, densely packed memory sys- large amounts of data require clean mechanisms for processes tems with much larger capacity than today’s DRAM-based to access and manage shared data efficiently and safely. computers. Combined with high-radix optical switches [76], One approach uses a client-server model whereby client these future memory systems will appear to the processing processes communicate with key-value stores [24, 63, 67] elements as single, petabyte-scale “load-store domains” [33]. that manage and serve the data. This approach requires One implication of this trend is that the physical memory code to be rewritten to use specific interfaces and incurs accessible from a CPU will exceed the size that VA bits can communication overhead. Another approach is using shared address. While almost all modern processor architectures memory regions, but current OS mechanisms for doing so support 64-bit addressing, CPU implementations pass fewer have many

Load more