Revisiting Swapping in User-Space with Lightweight Threading

Revisiting Swapping in User-space with Lightweight Threading Kan Zhongy, Wenlin Cuiy, Youyou Lux, Quanzhang Liuy, Xiaodan Yany, Qizhao Yuany, Siwei Luoy, and Keji Huangy yHuawei Technologies Co., Ltd Chengdu, China xDepartment of Computer Science and Technology, Tsinghua University Beijing, China Abstract Based on the virtual memory system, existing OS pro- Memory-intensive applications, such as in-memory databases, vides swapping to enlarge the main memory by writing caching systems and key-value stores, are increasingly de- inactive pages to a backend store, which today is usually manding larger main memory to fit their working sets. Con- backed by SSDs. Compared to SSDs, DRAM still has orders ventional swapping can enlarge the memory capacity by of magnitude performance advantage, providing memory- paging out inactive pages to disks. However, the heavy I/O like performance by paging with SSDs has been explored stack makes the traditional kernel-based swapping suffers for decades [3–13] and still remains great challenges. Espe- from several critical performance issues. cially, we find that the heavy kernel I/O stack introduces In this paper, We redesign the swapping system and pro- large performance penalty, more than 40% of the time is cost pose LightSwap, an high-performance user-space swapping by the I/O stack when swapping in/out a signal page, and scheme that supports paging with both local SSDs and re- this number will keep increasing if ultra-low latency storage mote memories. First, to avoids kernel-involving, a novel media, such as Intel Optane [14] and KIOXIA XL-FLash [15] page fault handling mechanism is proposed to handle page are adopted as the backend stores. faults in user-space and further eliminates the heavy I/O To avoid the high-latency of paging with SSDs, memory stack with the help of user-space I/O drivers. Second, we disaggregation architecture [16–20] proposes to expose a co-design Lightswap with light weight thread (LWT) to im- global memory address space to all machines to improve prove system throughput and make it be transparent to user memory utilization and avoid memory over-provisioning. applications. Finally, we propose a try-catch framework in However, existing memory disaggregation proposals require Lightswap to deal with paging errors which are exacerbated new hardware supports [17, 18, 21–23], making these solu- by the scaling in process technology. tions infeasible and expensive in real production environ- We implement Lightswap in our production-level system ments. Fortunately, recent researches, such as Infiniswap [24], and evaluate it with YCSB workloads running on memcached. Fluidmem [25], and AIFM [26] have shown that paging or Results show that Ligthswap reduces the page faults han- swapping with remote memory is a promising solution to en- dling latency by 3–5 times, and improves the throughput of able memory disaggregation. However, we still find several memcached by more than 40% compared with the stat-of-art critical issues of these existing approaches. swapping systems. First, kernel-based swapping, which relies on the heavy kernel I/O stack exhibits large software stack overheads, CCS Concepts: • Software and its engineering ! Oper- making it cannot fully exploit the high-performance and low- ating systems. arXiv:2107.13848v1 [cs.OS] 29 Jul 2021 latency characteristics of emerging storage media (e.g., Intel Optane) or networks (e.g., RDMA). We measured the remote Keywords: user-space swapping, memory disaggregation, memory access latency of Infiniswap, which is based the light weight thread kernel swapping, can be as high as 40`s even using one-side RDMA. Further, modern applications exhibit diverse mem- 1 Introduction ory access patterns, kernel-based swapping fails to provide Memory-intensive applications[1, 2], such as in-memory enough flexibility to customize the eviction and prefetch- databases, caching systems, in-memory key-value stores are ing policy. Second, recent research [25] has also explored increasingly demanding more and more memory to meet the user-space swapping with userfaultfd, which however their low-latency and high-throughput requirements as these needs extra memory copy operations and exhibits ultra-high applications often experience significant performance loss page fault latency under high-concurrency (i.e., 107`s un- once their working set cannot fit in memory. Therefore, ex- der 64 threads), leading to systems that based on userfaultfd tending the memory capacity becomes a mission-critical task cannot tolerate frequent page faults. Finally, new proposals for both researchers and system administrators. like AIFM [26] and Semeru [27], that do not rely on virtual 40 memory abstraction, provide a runtime-managed swapping Disk Kernel stack User stack to applications and can largely reduce the I/O amplification. 30 60.2% 56.6% However, these schemes break the compatibility and require 66.5% 62.9% 20 large efforts to rewrite existing applications. 4.1% 6.3% 5.3% 5.2% Therefore, we argue that swapping need to be redesigned 10 Latency (us) Latency to become high-performance and remains transparent to 0 applications. In this paper, we propose Lightswap, an user- 4K-Rnd-Rd 4K-Rnd-Wr 8K-Rnd-Rd 8K-Rnd-Wr space swapping that supports paging with both SSDs and Figure 1. remote memories. First, Lightswap handles page faults in Random read and write latency breakdown. The user-space and utilizes the high-performance user I/O stack numbers on top of each bar denote the relative fraction of I/O to eliminate the software stack overheads. To this end, we stack time in the total latency. propose an ultra-low latency page fault handling mechanism 2 Background and Motivation to handle page faults in user-space (§4.2). Second, when page faults happen, existing swapping schemes will block the 2.1 Swapping faulting thread to wait for data fetch, which lowers the sys- Existing swapping approaches [3–10] depend on the existing tem throughput. In Lightswap, we co-design swapping with kernel data path that is optimized for slow block devices, light weight thread (LWT) to achieve both high-performance both reading and writing pages from/to the backend store and application-transparent (§4.3). When page fault happens, would introduce high software stack overheads. Figure1 leveraging the low context switch cost of LWT, we use a ded- compares the I/O latency breakdown for random read and icated swap-in LWT to handle page fault and allow other write of XL-Flash [15] while using the kernel data path and LWTs to occupy the CPU while waiting for data fetch. Finally, user-space SPDK [28] driver. The figure shows that over half due to the scaling in process technology and the ever increas- (56.6% – 66.5%) of the time is spent on the kernel I/O stack ing capacity, we find that both DRAM and SSD become more for both read and write while using the kernel data path. prone to errors. Therefore, we propose a try-catch exception This overhead mostly comes from the generic block layer framework in Lightswap to deal with paging errors (§4.4). and device driver. Comparably, the I/O stack overhead is We evaluate Lightswap with memcached workloads. Eval- negligible while using the user-space SPDK driver. uation results show that Lightswap can reduce the page fault Besides using local SSDs as the backend store, the high handling latency by 3–5 times and improve the throughput bandwidth and low latency RDMA network offers the oppor- by more than 40% when compared to Infiniswap. In summary, tunity for swapping pages to remote memories. The newly we make the following contributions: proposed memory disaggregation architecture [16–20] takes a first look on remote memory paging. In disaggregation model, computing nodes can be composed of large amount of memory borrowing space from remote memory servers. Ex- • We propose an user-space page fault handling mecha- isting works, such as Infiniswap [24] and FluidMem [25] have nism that can achieve ultra-low page fault notification showed that swapping to remote memories is a promising so- latency (i.e., 2.4`s under 64 threads). lution for memory disaggregation. Nevertheless, Infniswap • We demonstrate that with the help of LWT, user-space exposes remote memory as a local block device, paging in/out swapping system can achieve both high-performance still needs to go through the whole kernel I/O stack. Our and application-transparent. evaluation shows that the remote memory access in Infin- • We propose a try-catch based exception framework to iswap can as high as 40`s, which is about 10x higher than handle paging errors. To best of our knowledge, we the latency of a 4KB page RDMA read. The difference is all are the first work to explore handling both memory caused by the slow kernel data path. errors and storage device errors in a swapping system. • We show the performance benefits of Lightswap with 2.2 Linux eBPF YSCB workloads on memcached, and compare it to other swapping schemes. eBPF (for extended Berkeley Packet Filter) is a general virtual machine that running inside the Linux kernel. It provides an instruction set and an execution environment to run eBPF programs in kernel. Thus, user-space applications can instru- The rest of the paper is organized as follows. Section2 ment the kernel by eBPF programs without changing kernel presents the background and motivation. Section3 and4 source code or loading kernel modules. eBFP programs are discuss our design considerations and the design details of written in a special assembly language. Figure2 shows how Lightswap, respectively. Section5 presents the implementa- eBPF works. As shown, eBPF bytecode can be loaded into tion and evaluation of Lightswap. We cover the related work the kernel using bpf() system call.

Revisiting Swapping in User-Space with Lightweight Threading

Memory Deduplication: An

Overcoming Traditional Problems with OS Huge Page Management

Murciano Soto, Joan; Rexachs Del Rosario, Dolores Isabel, Dir

Asynchronous Page Faults Aix Did It Red Hat Author Gleb Natapov August 10, 2010

What Is an Operating System III 2.1 Compnents II an Operating System

Strict Memory Protection for Microcontrollers

Measuring Software Performance on Linux Technical Report

Hiding Process Memory Via Anti-Forensic Techniques

Guide to Openvms Performance Management

Virtual Memory §5.4 Virt Virtual Memory U Al Memo Use Main Memory As a “Cache” For

Virtual Memory - Paging

Chapter 4 Memory Management and Virtual Memory Asst.Prof.Dr