Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation Youngmoon Lee∗, Hasan Al Maruf∗, Mosharaf Chowdhury∗, Asaf Cidonx, Kang G. Shin∗ University of Michigan∗ Columbia Universityx ABSTRACT State-of-the-art solutions take three primary approaches: (i) local We present the design and implementation of a low-latency, low- disk backup [36, 66], (ii) remote in-memory replication [31, 50], and overhead, and highly available resilient disaggregated cluster mem- (iii) remote in-memory erasure coding [61, 64, 70, 73] and compres- ory. Our proposed framework can access erasure-coded remote sion [45]. Unfortunately, they suffer from some combinations of memory within a single-digit `s read/write latency, significantly the following problems. improving the performance-efficiency tradeoff over the state-of- High latency: The first approach has no additional memory the-art – it performs similar to in-memory replication with 1.6× overhead, but the access latency is intolerably high in the presence lower memory overhead. We also propose a novel coding group of any of the aforementioned failure scenarios. The systems that placement algorithm for erasure-coded data, that provides load bal- take the third approach do not meet the single-digit `s latency ancing while reducing the probability of data loss under correlated requirement of disaggregated cluster memory even when paired failures by an order of magnitude. with RDMA (Figure 1). High cost: While the second approach has low latency, it dou- bles memory consumption as well as network bandwidth require- ments. The first and second approaches represent the two extreme 1 INTRODUCTION points in the performance-vs-efficiency tradeoff space for resilient To address the increasing memory pressure in datacenters, two cluster memory (Figure 1). broad approaches toward cluster-wide remote memory have emerged Low availability: All three approaches lose availability to low in recent years. The first requires redesigning applications from latency memory when even a very small number of servers become scratch using the key-value interface and/or RDMA [17, 26, 38, 59, unavailable. With the first approach, if a single server fails its data 61, 72]. The explicit nature of such solutions makes them efficient needs to be reconstituted from disk, which is a slow process. In the and exposes cluster memory availability concerns to the applica- second and third approach, when even a small number of servers tions. However, rewriting applications may often not be possible. (e.g., three) fail simultaneously, some users will lose access to data. Memory disaggregation, in contrast, exposes remote memory This is due to the fact that replication and erasure coding assign throughout the cluster in an implicit manner via well-known ab- replicas and coding groups to random servers. Random data place- stractions such as distributed virtual file system (VFS) and dis- ment is susceptible to data loss when a small number of servers fail tributed virtual memory manager (VMM) [12, 33, 36, 45, 66, 74]. at the same time [23, 24] (Figure 2). With the advent of RDMA, recent solutions are now close to meeting In this paper, we consider how to mitigate these three prob- the single-digit `s latency required to support acceptable application- lems and present Hydra, a low-latency, low-overhead, and highly level performance [33, 45]. Indeed, unmodified in-memory DBMSs available resilience mechanism for disaggregated memory. While can often achieve significant performance benefit using disaggre- erasure codes are known for reducing storage overhead and for bet- gated memory by transparently replacing slower disk accesses with ter load balancing, we demonstrate how to achieve erasure-coded orders-of-magnitude faster remote memory accesses. disaggregated cluster memory with single-digit `s latency, which However, realizing memory disaggregation for heterogeneous is also resilient to simultaneous server failures. workloads running in a large-scale cluster faces considerable chal- We explore the challenges and tradeoffs for resilient memory arXiv:1910.09727v2 [cs.DC] 27 Oct 2020 lenges [13, 20] stemming from two root causes: disaggregation without sacrificing application-level performance (1) Expanded failure domains: Because applications rely on memory or incurring high overhead in the presence of correlated failures across multiple machines in a disaggregated cluster, they be- (§2). We also explore the trade-off between load balancing and high come susceptible to a wide variety of failure scenarios. Potential availability in the presence of simultaneous server failures. Our failures include independent and correlated failures of remote solution, Hydra, is a configurable resilience mechanism that applies machines, evictions from and corruptions of remote memory, online erasure coding to individual remote memory pages while and network partitions. maintaining high availability (§3). Hydra’s carefully designed data path enables it to access remote memory pages within a single- (2) Tail at scale: Applications also suffer from stragglers or late- digit `s median and tail latency (§4). Furthermore, we develop arriving remote responses. Stragglers can arise from many CodingSets, a novel coding group placement algorithm for erasure sources including latency variabilities in a large network due codes that provides load balancing while reducing the probability to congestion and background traffic [25]. of data loss under correlated failures (§5). While one leads to catastrophic failures and the other manifests as We implement Hydra on Linux kernel 4.11.0 and integrate it service-level objective (SLO) violations, both are unacceptable in with the two major logical memory disaggregation approaches production [45, 53]. Figure 1: Performance-vs-efficiency tradeoff in the resilient Figure 2: Availability-vs-efficiency tradeoff considering 1% cluster memory design space. Here, the Y-axis is in log scale. simultaneous server failures in a 1000-machine cluster. memory management for distributed OS [66]. In the past, special- widely embraced today: disaggregated virtual memory manager ized memory appliances for physical memory disaggregation were (VMM) (used by Infiniswap [36] and LegoOS [66]) and disaggre- proposed as well [48, 49]. gated virtual file system (VFS) (used by Remote Regions [12]) (§6). All existing memory disaggregation solutions use the 4KB page Our evaluation using multiple memory-intensive applications with granularity. While some applications use huge pages for perfor- production workloads shows that Hydra achieves the best of both mance enhancement [44], the Linux kernel still performs paging at worlds (§7). On the one hand, it closely matches the performance the basic 4KB level by splitting individual huge pages because huge of replication-based resilience with 1.6× lower memory overhead pages can result in high amplification for dirty data tracking [19]. with or without the presence of failures. On the other hand, it In such systems, as an application’s working set spans multi- improves latency and throughput of the benchmark applications ple machines, the probability of being affected by a failure event by up to 64.78× and 20.61×, respectively, over SSD backup-based increases as well. Existing memory disaggregation systems use resilience with only 1.25× higher memory overhead. Hydra also disk backup [36, 66] and in-memory replication [31, 50] to provide provides performance benefit for real-world application and work- availability in the event of failures. load combinations. It improves the throughput of VoltDB over the state-of-the-art by 2.14× during online transaction processing. Sim- 2.2 Failures in Disaggregated Memory ilarly, during online analytics, Hydra improves the completion time The probability of failure or temporary unavailability is higher of Apache Spark/GraphX by 4.35× over its counterparts. Using in a large memory-disaggregated cluster, since memory is being CodingSets, Hydra reduces the probability of data loss under simul- accessed remotely. To illustrate possible performance penalties in taneous server failures by about 10×. the presence of such unpredictable events, we consider a resilience solution from the existing literature [36], where each page is asyn- Contributions. We make the following contributions: chronously backed up to a local SSD. We run transaction processing • Hydra is the first in-memory erasure coding scheme that achieves benchmark TPC-C [10] on an in-memory database system, VoltDB single-digit `s median and tail latency needed for in-memory [11]. We set VoltDB’s available memory to 50% of its peak memory transaction processing and analytics DBMSs. to force remote paging for up to 50% of its working set. • Novel analysis of load balancing and availability trade-off for distributed erasure codes. 1. Remote Failures and Evictions. Machine failures are the norm in large-scale clusters [75]. Without redundancy, applications • CodingSets is a new data placement scheme that balances avail- relying on remote memory may fail when a remote machine fails ability and load balancing, while reducing probability of data or remote memory pages are evicted. Because disk operations are loss by an order of magnitude under correlated failures. significantly slower than the latency requirement of memory disaggregation, disk-based fault-tolerance is also far from being practical. 2 BACKGROUND AND MOTIVATION In the presence of a remote failure, VoltDB experiences almost 90% 2.1

Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

Demystifying the Real-Time Linux Scheduling Latency

Memory Deduplication: An

Context Switch in Linux – OS Course Memory Layout – General Picture

The Linux Kernel Module Programming Guide

Murciano Soto, Joan; Rexachs Del Rosario, Dolores Isabel, Dir

The Xen Port of Kexec / Kdump a Short Introduction and Status Report

What Is an Operating System III 2.1 Compnents II an Operating System

Strict Memory Protection for Microcontrollers

Anatomy of Linux Loadable Kernel Modules a 2.6 Kernel Perspective

HALO: Post-Link Heap-Layout Optimisation

Kdump, a Kexec-Based Kernel Crash Dumping Mechanism

Linux Kernal II 9.1 Architecture