Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

Youngmoon Lee∗, Hasan Al Maruf∗, Mosharaf Chowdhury∗, Asaf Cidon§, Kang G. Shin∗ University of Michigan∗ Columbia University§

ABSTRACT State-of-the-art solutions take three primary approaches: (i) local We present the design and implementation of a low-, low- disk backup [36, 66], (ii) remote in-memory replication [31, 50], and overhead, and highly available resilient disaggregated cluster mem- (iii) remote in-memory erasure coding [61, 64, 70, 73] and compres- ory. Our proposed framework can access erasure-coded remote sion [45]. Unfortunately, they suffer from some combinations of memory within a single-digit 휇s / latency, significantly the following problems. improving the performance-efficiency tradeoff over the state-of- High latency: The first approach has no additional memory the-art – it performs similar to in-memory replication with 1.6× overhead, but the access latency is intolerably high in the presence lower memory overhead. We also propose a novel coding group of any of the aforementioned failure scenarios. The systems that placement algorithm for erasure-coded data, that provides load bal- take the third approach do not meet the single-digit 휇s latency ancing while reducing the probability of data loss under correlated requirement of disaggregated cluster memory even when paired failures by an order of magnitude. with RDMA (Figure 1). High cost: While the second approach has low latency, it dou- bles memory consumption as well as network bandwidth require- ments. The first and second approaches represent the two extreme 1 INTRODUCTION points in the performance-vs-efficiency tradeoff space for resilient To address the increasing memory pressure in datacenters, two cluster memory (Figure 1). broad approaches toward cluster-wide remote memory have emerged Low availability: All three approaches lose availability to low in recent years. The first requires redesigning applications from latency memory when even a very small number of servers become scratch using the key-value interface and/or RDMA [17, 26, 38, 59, unavailable. With the first approach, if a single server fails its data 61, 72]. The explicit nature of such solutions makes them efficient needs to be reconstituted from disk, which is a slow . In the and exposes cluster memory availability concerns to the applica- second and third approach, when even a small number of servers tions. However, rewriting applications may often not be possible. (e.g., three) fail simultaneously, some users will lose access to data. Memory disaggregation, in contrast, exposes remote memory This is due to the fact that replication and erasure coding assign throughout the cluster in an implicit manner via well-known ab- replicas and coding groups to random servers. Random data place- stractions such as distributed virtual (VFS) and dis- ment is susceptible to data loss when a small number of servers fail tributed manager (VMM) [12, 33, 36, 45, 66, 74]. at the same time [23, 24] (Figure 2). With the advent of RDMA, recent solutions are now to meeting In this paper, we consider how to mitigate these three prob- the single-digit 휇s latency required to support acceptable application- lems and present Hydra, a low-latency, low-overhead, and highly level performance [33, 45]. Indeed, unmodified in-memory DBMSs available resilience mechanism for disaggregated memory. While can often achieve significant performance benefit using disaggre- erasure codes are known for reducing storage overhead and for bet- gated memory by transparently replacing slower disk accesses with ter load balancing, we demonstrate how to achieve erasure-coded orders-of-magnitude faster remote memory accesses. disaggregated cluster memory with single-digit 휇s latency, which However, realizing memory disaggregation for heterogeneous is also resilient to simultaneous server failures. workloads running in a large-scale cluster faces considerable chal- We explore the challenges and tradeoffs for resilient memory

arXiv:1910.09727v2 [cs.DC] 27 Oct 2020 lenges [13, 20] stemming from two root causes: disaggregation without sacrificing application-level performance (1) Expanded failure domains: Because applications rely on memory or incurring high overhead in the presence of correlated failures across multiple machines in a disaggregated cluster, they be- (§2). We also explore the trade-off between load balancing and high come susceptible to a wide variety of failure scenarios. Potential availability in the presence of simultaneous server failures. Our failures include independent and correlated failures of remote solution, Hydra, is a configurable resilience mechanism that applies machines, evictions from and corruptions of remote memory, online erasure coding to individual remote memory pages while and network partitions. maintaining high availability (§3). Hydra’s carefully designed data enables it to access remote memory pages within a single- (2) Tail at scale: Applications also suffer from stragglers or late- digit 휇s median and tail latency (§4). Furthermore, we develop arriving remote responses. Stragglers can arise from many CodingSets, a novel coding group placement algorithm for erasure sources including latency variabilities in a large network due codes that provides load balancing while reducing the probability to congestion and background traffic [25]. of data loss under correlated failures (§5). While one leads to catastrophic failures and the other manifests as We implement Hydra on kernel 4.11.0 and integrate it service-level objective (SLO) violations, both are unacceptable in with the two major logical memory disaggregation approaches production [45, 53]. Figure 1: Performance-vs-efficiency tradeoff in the resilient Figure 2: Availability-vs-efficiency tradeoff considering 1% cluster memory design space. Here, the Y-axis is in log scale. simultaneous server failures in a 1000-machine cluster.

for distributed OS [66]. In the past, special- widely embraced today: disaggregated virtual memory manager ized memory appliances for physical memory disaggregation were (VMM) (used by Infiniswap36 [ ] and LegoOS [66]) and disaggre- proposed as well [48, 49]. gated (VFS) (used by Remote Regions [12]) (§6). All existing memory disaggregation solutions use the 4KB Our evaluation using multiple memory-intensive applications with granularity. While some applications use huge pages for perfor- production workloads shows that Hydra achieves the best of both mance enhancement [44], the still performs paging at worlds (§7). On the one hand, it closely matches the performance the basic 4KB level by splitting individual huge pages because huge of replication-based resilience with 1.6× lower memory overhead pages can result in high amplification for dirty data tracking [19]. with or without the presence of failures. On the other hand, it In such systems, as an application’s spans multi- improves latency and throughput of the benchmark applications ple machines, the probability of being affected by a failure event by up to 64.78× and 20.61×, respectively, over SSD backup-based increases as well. Existing memory disaggregation systems use resilience with only 1.25× higher memory overhead. Hydra also disk backup [36, 66] and in-memory replication [31, 50] to provide provides performance benefit for real-world application and work- availability in the event of failures. load combinations. It improves the throughput of VoltDB over the state-of-the-art by 2.14× during online transaction processing. Sim- 2.2 Failures in Disaggregated Memory ilarly, during online analytics, Hydra improves the completion time The probability of failure or temporary unavailability is higher of Apache Spark/GraphX by 4.35× over its counterparts. Using in a large memory-disaggregated cluster, since memory is being CodingSets, Hydra reduces the probability of data loss under simul- accessed remotely. To illustrate possible performance penalties in taneous server failures by about 10×. the presence of such unpredictable events, we consider a resilience solution from the existing literature [36], where each page is asyn- Contributions. We make the following contributions: chronously backed up to a local SSD. We run transaction processing • Hydra is the first in-memory erasure coding scheme that achieves benchmark TPC- [10] on an in-memory database system, VoltDB single-digit 휇s median and tail latency needed for in-memory [11]. We set VoltDB’s available memory to 50% of its peak memory transaction processing and analytics DBMSs. to force remote paging for up to 50% of its working set. • Novel analysis of load balancing and availability trade-off for distributed erasure codes. 1. Remote Failures and Evictions. Machine failures are the norm in large-scale clusters [75]. Without redundancy, applications • CodingSets is a new data placement scheme that balances avail- relying on remote memory may fail when a remote machine fails ability and load balancing, while reducing probability of data or remote memory pages are evicted. Because disk operations are loss by an order of magnitude under correlated failures. significantly slower than the latency requirement of memory disag- gregation, disk-based fault-tolerance is also far from being practical. 2 BACKGROUND AND MOTIVATION In the presence of a remote failure, VoltDB experiences almost 90% 2.1 Memory Disaggregation cascading throughput loss (Figure 3a); throughput recovery takes a long time after the failure happens. Memory disaggregation exposes memory available in remote ma- chines as a pool of memory shared by many machines. It is of- 2. Background Network Load. Network load throughout a ten implemented logically by leveraging unused/stranded memory large cluster can experience significant fluctuations25 [ , 40], which in remote machines via well-known abstractions, such as the file can inflate RDMA latency and application-level stragglers, caus- abstraction [12], remote [33, 36, 46], and virtual ing unpredictable performance issues [76]. In the presence of an Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

SSD Backup Replication SSD Backup Replication SSD Backup Replication 40 40 40 In-memory Buffer Full 30 30 30 20 Remote Failure 20 Network Load 20 10 10 10 0 0 0

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

TPS (x1000) TPS TPS (x1000) TPS TPS (x1000) TPS Time (s) Time (s) Time(s) (a) Remote failure (b) Background Network Load (c) Request burst

Figure 3: TPC-C throughput over time on VoltDB when 50% of the working set fits in memory.

induced bandwidth-intensive background load, VoltDB throughput across (푘 + 푟) failure domains.1 Existing erasure-coded memory Paging I/O (x1000) I/O Paging drops by about 50% (Figure 3b). solutions deal with large objects (e.g., larger than 1 MB [61]), where hundreds-of-휇s latency of the TCP/IP stack can be ignored. Simply replacing TCP with RDMA is not enough either. For example, the 3. Request Bursts. Applications themselves can have bursty EC- with RDMA (shown in Figure 1) provides a lower storage memory access patterns. Existing solutions maintain an in-memory overhead than compression but with a latency around 20휇s. buffer to absorb temporary bursts12 [ , 36, 59]. However, as the Last but not least, all of these approaches experience high un- buffer ties remote access latency to disk latency when it is full,the availability in the presence of correlated failures [24]. buffer can become the bottleneck when a workload experiences a prolonged burst. While a page read from remote memory is still 2.4 Challenges in Erasure-Coded Memory fast, backup page writes to the local disk become the bottleneck 푡ℎ High Latency. Individually erasure coding 4 KB pages that after the 100 second in Figure 3c. As a result, throughput drops 4 by about 60%. are already small lead to even smaller data chunks (푘 KB), which contributes to the higher latency of erasure-coded remote memory over RDMA due to following primary reasons: 2.3 Performance vs. Efficiency Tradeoff for (1) Non-negligible coding overhead: When using erasure codes Resilience with on-disk data or over slower networks that have hundreds- of-휇s latency, its 0.7휇s encoding and 1.5휇s decoding overheads In all of these scenarios, the obvious alternative – in-memory 2× can be ignored. However, they become non-negligible when or 3× replication [31, 50] – is effective in mitigating a small-scale dealing with DRAM and RDMA. failure, such as the loss of a single server (Figure 3). When an in-memory copy becomes unavailable, we can switch to an alterna- (2) Stragglers and errors: Because erasure codes require 푘 splits tive. Unfortunately, replication incurs high memory overhead in before the original data can be constructed, any straggler can proportion to the number of replicas. This defeats the purpose of slow down a remote read. To detect and correct an error, erasure memory disaggregation. Hedging requests to avoid stragglers [25] codes require additional splits; an extra read adds another round- in a replicated system doubles its bandwidth requirement as well. trip to double the overall read latency. This leads to an impasse: one has to either settle for high latency (3) Interruption overhead: Splitting data also increases the total in the presence of a failure or incur high memory overhead. Figure 1 number of RDMA operations for each request. Any depicts this performance-vs-efficiency tradeoff measured in terms switch in between can further add to the latency. of remote memory read latency under failures and memory usage (4) Data copy overhead: In a latency-sensitive system, additional overhead to provide resilience. data movement can limit the lowest possible latency. During Beyond the two extremes in the tradeoff space, there are two erasure coding, additional data copy into different buffers for primary alternatives to achieve high resilience with low overhead. data and parity splits can quickly add up. The first is replicating pages to remote memory after compressing Availability Under Simultaneous Server Failures. While ex- them (e.g., using ) [45], which improves the tradeoff in both isting erasure coding schemes can a small-scale failure with- dimensions. However, its latency can be more than 10휇s when out interruptions, when even a relatively modest number of servers data is in remote memory. Especially, during resource scarcity, the fail or become unavailable at the same time (e.g., due to a net- presence of a prolonged burst in accessing remote compressed work partition or a correlated server failure event), they are highly pages can even lead to orders of magnitude higher latency due to susceptible to losing availability to some of the data. the demand spike in both CPU and local DRAM consumption for This is due to the fact that existing erasure coding schemes decompression. Additionally, this approach suffers from similar generate coding groups on random sets of servers [61]. In a coding issues as replication such as latency inflation due to stragglers. scheme with 푘 data and 푟 parity splits, an individual coding group, The alternative is erasure coding, which has recently made its will fail to decode the data if 푟 + 1 servers fail simultaneously. Now way from disk-based storage to in-memory cluster caching to re- consider the case of a large cluster of servers that experiences 푟 + 1 duce storage overhead and improve load balancing [14, 21, 61, 69, 70, 73]. Typically, an object is divided into 푘 data splits and encoded to 1A failure domain is a set of remote machines that are likely to experience a correlated create 푟 equal-sized parity splits (푘 > 푟), which are then distributed failure, e.g., servers in a rack sharing a single power source. HYDRA Resilience Manager Local Application Page Access Request k-slab address ranges

Virtual File System Paging Distributed VMM (k+r) remote slabs (Remote Regions - ATC’18) (Infiniswap - NSDI’17) (LegoOS – OSDI’18) Data/Parity Slab for each Remote Write Remote Read address range 1 2 3 2 3 1 # Slab Mapped to Machine# HYDRA Resilience Manager

Encode Decode k split Local r parities Machine first k arrivals Figure 5: Hydra’s address space is divided into fixed-size ad- async dress ranges, each of which spans (푘 + 푟) memory slabs in remote machines; i.e., 푘 for data and 푟 for parity (푘=2 and M1 M2 M3 M1 M2 M3 Resource Resource Resource Remote Resource Resource Resource Machines 푟=1 in this figure). Monitor Monitor Monitor Monitor Monitor Monitor

Figure 4: Hydra consists of Resilience Manager and Re- coordinates with remote Resource Monitors to manage coarse- source Monitor – both can be present in a machine. Hydra grained memory slabs to reduce metadata overhead and connection can provide resilience for different remote memory systems. management complexity. 3.2 Resource Monitor Hydra Resource Monitor manages a machine’s local memory and failures. The probability that those 푟 +1 servers will fail for a specific exposes them to the remote Resilience Manager in terms of fixed- coding group is low. However, when coding groups are generated size (푆푙푎푏푆푖푧푒) memory slabs. Different slabs can belong to dif- randomly (i.e., each one of them compromises a random set of 푘 +푟 ferent machines’ Resilience Manager. During each control period servers), and there are a large number of coding groups per server, (퐶표푛푡푟표푙푃푒푟푖표푑), the Resource Monitor tracks the available memory then the probability that those 푟 + 1 servers will affect any coding in its local machine and proactively allocates (reclaims) slabs to group in the cluster is much higher. Therefore, state-of-the-art (from) remote mapping when memory usage is low (high). It is also erasure coding schemes, such as EC-Cache, will experience a very responsible for slab regeneration during remote machine failures high probability of unavailability even when a very small number or slab corruptions. of servers fail simultaneously. 4 RESILIENT DATA PATH 3 HYDRA ARCHITECTURE Hydra encodes and decodes each 4 KB page independently instead Hydra is an erasure-coded resilience mechanism for existing mem- of batch-coding across multiple pages. This decreases latency by ory disaggregation techniques to provide better performance-efficiency avoiding the “batch waiting” time. Moreover, the Resilience Man- tradeoff under remote failures while ensuring high availability un- ager does not have to read unnecessary pages within the same der simultaneous failures. It has two main components (Figure 4): batch during remote reads, which reduces bandwidth overhead. (i) Resilience Managercoordinates erasure-coded resilience oper- Distributing remote I/O across many remote machines increases ations during remote read/write; (ii) Resource Monitor handles I/O parallelism too. These benefits, however, come with a latency the memory management in a remote machine. Both can be present penalty. In this section, we discuss Hydra’s data path design and in every machine and work together without central coordination. how it addresses the resilience challenges mentioned in Section 2.4. 4.1 Hydra Remote Memory Data Path 3.1 Resilience Manager To minimize erasure coding-associated latency overheads, we in- Hydra Resilience Manager enables configurable resilience when corporate four design principles in Hydra’s data path to remote applications use remote memory through different state-of-the-art memory in the Resilience Manager. A breakdown of the contribu- memory disaggregation solutions – e.g., via a virtual file system tions of each of these techniques in Hydra’s data path is discussed (VFS) [12] or via paging through a virtual memory manager (VMM) in our evaluation (Figure 10). [36, 66]. It transparently handles all aspects of RDMA communica- tion and erasure coding. 4.1.1 Asynchronously Encoded Write. During a remote write, Following the typical (푘, 푟) erasure coding construction, the Hydra Resilience Manager applies erasure coding within each in- Resilience Manager divides its remote address space into fixed- dividual page by dividing it into 푘 splits (the size of each split is ( + ) 4 size address ranges. Each address range resides in 푘 푟 remote 푘 KB for a 4 KB page), encodes these splits using Reed-Solomon (RS) slabs: 푘 slabs for page data and 푟 slabs for parity (Figure 5). Each codes [62] to generate 푟 parity splits. Then, it writes these (푘 + 푟) of the (푘 + 푟) slabs of an address range are distributed across (푘 + splits to different (푘 + 푟) slabs that have already been mapped 푟) independent failure domains using our CodingSets placement to unique remote machines. Each Resilience Manager can have a algorithm (§5). Page accesses are directed to the designated (푘 + 푟) different choice of 푘 and 푟. machines according to the address–slab mapping. Although remote To hide encoding latency, the Resilience Manager sends the data memory access happens at the page level, the Resilience Manager splits first and responds back; then it encodes and sends the parity DataData Split SplitParityParity Split Split PagePage

Mitigating the Performance-Efficiency Tradeoff k Ink-PageIn-Page Data Data Splits Splits r Paritiesr Parities in Buffer in Buffer in Resilient Memory Disaggregation

ResilienceResilience DataData DataData ParityParity ResilienceResilience DataData Data Parity DataData Split Split Slab 2 ManagerManager SlabSlab 11 Slab 2SlabSlab 11 ManagerManager SlabSlab 11 SlabSlab 22Slab 1 PagePage ParityParity Split Split EncodingEncoding DecodingDecoding k Datak Data Splits Splits in Pagein Pager Parityr Parity Splits Splits in Bufferin Buffer In-Inplace-place Decoding Decoding End IO Straggler EndEnd IO IO End IO (a) Remote Write (b) Remote Read ErrorError DetectionDetection (a) Remote Write (b) Remote Read Figure 7: Hydra performs in-place coding with an extra buffer of 푟 splits to reduce the data-copy latency. Figure 6: Hydra remote write/read: (a) Hydra first writes data splits, then encodes/writes parities to hide encoding latency; (b) Hydra reads from (푘 + Δ) slabs and completes as soon as the first 푘 splits arrive, avoiding stragglers. page now. During all remote I/O, requests are forwarded directly to RDMA dispatch queues without additional copying.

4.2 Handling Failures splits asynchronously. Any 푘 successful writes of the (푘 + 푟) writes allow the page to be recovered, but all (푘 + 푟) must be written to Late binding during remote reads automatically mitigates stragglers guarantee resilience in the event of 푟 failures. Decoupling the two due to a slow network. We discuss Hydra mechanisms for handling hides encoding latency and subsequent write latency for the parities failures, evictions, and corruptions below. without affecting the resilience guarantee. A write is considered Remote Failure and Eviction. Hydra uses reliable connections complete after all (푘 + 푟) have been written. Figure 6a depicts the (RC) for all RDMA communication. Hence, we consider unreachabil- timeline of a page write. ity of remote machines due to machine failures/reboots or network 4.1.2 Late-Binding Resilient Read. Because Hydra uses RS partition as the primary cause of failure. When a remote machine codes, any 푘 out of the (푘 + 푟) splits suffice to reconstruct a page. becomes unreachable, the Resilience Manager is notified by the However, to be resilient in the presence of Δ failures, during a RDMA connection manager. Upon disconnection, it processes all remote read, Hydra Resilience Manager reads from (푘+Δ) randomly the in-flight requests in order first. For ongoing I/O operations, it chosen splits in parallel. A page can be decoded as soon as any 푘 resends the I/O request to other available machines. Since RDMA splits arrive out of (푘 + Δ). The additional Δ reads mitigate the guarantees strict ordering, in the read-after-write case, read re- impact of stragglers on tail latency as well. Figure 6b provides an quests will arrive at the same RDMA dispatch queue after write example of a read operation with 푘 = 2 and Δ = 1, where one of requests; hence, read requests will not be served with stale data. Fi- the data slabs (Data Slab 2) is a straggler. Δ = 1 is often enough in nally, Resilience Manager marks the failed slab, and future requests practice. are only directed to the available ones. Eviction handling is similar to that of a failure, except that the 4.1.3 Latency Optimizations. In addition to asynchronous cod- Resource Monitor sends an explicit message to the corresponding ing during writes and late binding during reads, Hydra employs Resilience Manager. two additional optimizations in both cases. Corruption. So far we consider unavailability of a correct split Run-to-Completion. As Resilience Manager divides a 4 KB due to failures and stragglers. However, in the presence of remote page into 푘 smaller pieces, RDMA messages become smaller. In fact, memory corruption, the Resilience Manager needs (푘 + Δ) splits to their network latency decrease to the point that run-to-completion detect Δ errors and (푘 +2Δ+1) splits to locate and correct them (§5.2). becomes more beneficial than a . Hence, to avoid If the Resilience Manager detects an error, it requests additional interruption-related overheads, the remote I/O request waits Δ + 1 reads from the rest of the (푘 + 푟) machines. It also marks the until the RDMA operations are done. machine(s) with corrupted splits as probable erroneous machines. If the error rate for a remote machine exceeds a -defined In-Place Coding. To reduce the number of data copies, Hydra threshold (퐸푟푟표푟퐶표푟푟푒푐푡푖표푛퐿푖푚푖푡), subsequent read requests in- Resilience Manager uses in-place coding with an extra buffer of 푟 volved with that machine will be initiated with (푘 + 2Δ + 1) split splits. During a write, the data splits are always kept in-page while requests as there is a high probability to reach an erroneous ma- the encoded 푟 parities are put into the buffer (Figure 7a). Likewise, chine. This will reduce the wait time for additional Δ + 1 reads. during a read, the data splits arrive at the page address, and the Resilience Manager continues this until the error rate for the in- parity splits find their way into the buffer (Figure 7b). volved machine gets lower than the 퐸푟푟표푟퐶표푟푟푒푐푡푖표푛퐿푖푚푖푡. If this Because a read can complete as soon as any 푘 valid splits arrive, continues for long and/or the error rate for a machine goes be- corrupted/straggler data split(s) can arrive late and overwrite valid yond another threshold (푆푙푎푏푅푒푔푒푛푒푟푎푡푖표푛퐿푖푚푖푡), the Resilience page data. To address this, as soon as Hydra Resilience Manager Manager initiates a slab regeneration request for that machine. detects the arrival of 푘 valid splits, it deregisters relevant RDMA memory regions. It then performs decoding and directly places the Adaptive /Eviction. Hydra Resource Monitor decoded data in the page destination. Because the memory region allocates memory slabs for remote Resilience Managers as well as has already been deregistered, any late data split cannot access the proactively frees/evicts them to avoid local performance impacts Remote slab Unmapped slab Mapped slab partial power outages, and software bugs. An example scenario is a power outage, which can cause 0.5%-1% machines to fail or go Remote Memory Local Memory Remote Memory Local Memory ECswap ECswap offline concurrently24 [ ]. In case of Hydra, data loss will happen if … Device FreeLocal MemoryUsed Block Device LocalFree Memory a concurrent failure kills more than 푟 + 1 of (푘 + 푟) machines for a … particular coding group. Grow Free Free/EvictsUsed Reclaim Free Allocate We are inspired by copysets, a scheme for preventing data loss Remote Slabs Slabs Remote Slabs Slabs Map Free Unmap Allocate under correlated failures in replication [23, 24], which constraints Free Memory Monitor Free Memory Monitor new remote slabs unmapped slabs remote slabs new slabs the number of replication groups. Using the same terminology as Resource Monitor Resource Monitor Free Memory Monitor Free Memory Monitor prior work, we define each unique set of (푘 + 푟) servers within (a)ECswap High memoryDaemon pressure (b) LowECswap memoryDaemon pressure a coding group as a 푐표푝푦푠푒푡. The number of copysets in a single 푘+푟  coding group will be: 푟+1 . For example, in an (8+2) configuration, Figure 8: Resource Monitor proactively allocates memory where nodes are numbered 1, 2,..., 10, the 3 nodes that will cause for remote machines and frees local memory pressure. failure if they fail at the same time (i.e., copysets) will be every 3 combinations of 10 nodes: (1, 2, 3), (1, 2, 4),..., (8, 9, 10), and the 10 total number of copysets will be 3 = 120. (Figure 8). It periodically monitors local memory usage and main- For a data loss event impacting exactly 푟 + 1 random nodes tains a headroom to provide enough memory for local applications. simultaneously, the probability of losing data of a single specific 푘+푟 When the amount of free memory shrinks below the headroom Num Copysets in Coding Group (푟+1) coding group is: Total Copysets = 푁 , where 푁 is (Figure 8a), the Resource Monitor first proactively frees/evicts slabs (푟+1) to ensure local applications are unaffected. It uses the decentralized the total number of servers. We denote this probability as P[퐺푟표푢푝]. batch eviction algorithm [36], whereby (퐸 + 퐸′) block devices are In a cluster with more than (푘 + 푟) servers, we need to use more contacted to determine and evict 퐸 least-frequently-accessed slabs. than one coding group. However, if each server is a member of a When the amount of free memory grows above the headroom single coding group, hot spots can occur if one or more members of (Figure 8b), the Resource Monitor first attempts to make the local that group are overloaded. Therefore, for load-balancing purposes, Resilience Manager to reclaim its pages from remote memory and a simple solution is to allow each server to be a member of multiple unmap corresponding remote slabs. Furthermore, it proactively coding groups, in case some members of a particular coding group allocates new, unmapped slabs that can be readily mapped and used are over-loaded at the time of online coding. by remote Resilience Managers. Assuming we have 퐺 disjoint coding groups, and the correlated failure rate is 푓 %, the total probability of data loss is: 1 − (1 − Background Slab Regeneration. The Resource Monitor also (푁 ·푓 ) regenerates unavailable slabs – marked by the Resilience Manager P[퐺푟표푢푝] · 퐺) 푟+1 . We define disjoint coding groups where the – in the background. During regeneration, writes to the slab are groups do not share any copysets; or in other words, they do not disabled to prevent overwriting new pages with stale ones; reads overlap by more than 푟 nodes. can still be served without interruption. Strawman: Multiple Coding Groups per Server. In order to Hydra Resilience Manager uses the placement algorithm to find equalize load, the first scheme we considered is a scheme where a new regeneration slab in a remote Resource Monitor with a lower each slab forms a coding group with the least-loaded nodes in the memory usage. It then hands over the task of slab regeneration cluster at coding time. We assume the nodes that are least loaded to that Resource Monitor. The selected Resource Monitor decodes at a given time are distributed randomly, and the number of slabs the unavailable slab by directly reading the 푘 randomly-selected per server is 푆. When 푆 ·(푟 + 푘) ≪ 푁 , the coding groups are highly remaining valid slab for that address region. Once regeneration likely to be disjoint [24], and the number of groups is equal to: completes, it contacts the Resilience Manager to mark the slab as 푁 · 푆 퐺 = . available. Requests thereafter go to the regenerated slab. 푘 + 푟 We call this placement strategy the EC-Cache scheme, as it pro- 5 HIGHLY AVAILABLE ERASURE CODING duces a random coding group placement used by the prior state- We now describe CodingSets, a novel coding group placement of-the-art in-memory erasure coding system, EC-Cache [61]. In scheme used by Hydra, which performs load-balancing while re- this scheme, with even a modest number of slabs per server, a high + ducing the probability of data loss. We also show how Hydra deals number of combinations of 푟 1 machines will be a copyset. In with data corruptions by leveraging distributed redundancy. other words, even a small number of simultaneous node failures in the cluster will result in data loss. When the number of slabs per 5.1 Availability vs. Load Balancing server is high, almost every combination of only 푟 +1 failures across the cluster will cause data loss. Therefore, to reduce the probability To motivate the design of CodingSets, we analyze the fault-tolerance of data loss, we need to to minimize the number of copysets, while of existing erasure coding schemes under simultaneous server un- achieving sufficient load balancing. availability events. Prior work has shown orders-of-magnitude more frequent data loss due to events causing multiple nodes to CodingSets: Reducing Copysets for Erasure Coding. To this fail simultaneously than data loss due to independent node fail- end, we propose CodingSets, a novel load-balancing scheme, which ures [23, 32]. Several scenarios can cause multiple servers to fail reduces the number of copysets for distributed erasure coding. In- or become unavailable simultaneously, such as network partitions, stead of having each node participate in several coding groups Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

# of Minimum Memory Errors # of Splits Overhead Infiniswap HYDRA Replication Remote Regions HYDRA Replication 푟 100 100

+ 26.8 Failure 푟 푘 1 22.9 17.4 15.5 14.1 13.7

푘 12.4 12.8 11.9 11.7 11.2 11.5 8.9 8.9 8.3 8.3 7.4 7.2 6.9 7.0

Δ 5.4 10 10 5.2 + + 4.5 4.6 Error Detection Δ 푘 Δ 1 푘 2Δ+1 Error Correction Δ 푘 + 2Δ + 1 1 + 1 1 푘 (us) Latency Median 99th Median 99th (us) Latency Median 99th Median 99th Table 1: Minimum number of splits needs to be in remote Page-in Page-out Read Write machines for resilience under uncertainties. (a) Disaggregated VMM Latency (b) Disaggregated VFS Latency

Figure 9: Hydra provides better latency characteristics dur- ing both disaggregated VMM and VFS operations. like in EC-Cache, in our scheme, each server belongs to a single, extended coding group. At time of coding, (푘 + 푟) slabs will still be considered together, but the nodes participating in the coding 6 IMPLEMENTATION group are chosen from a set of (푘 + 푟 + 푙) nodes, where 푙 is the load-balancing factor. The nodes chosen within the extended group Hydra Resilience Manager is implemented as a loadable kernel are the least loaded ones. While extending the coding group in- module for Linux kernel 4.11 or later. Kernel-level implementation creases the number of copysets (instead of 푘+푟  copysets, now facilitates its deployment as an underlying block device for dif- 푟+1 ferent memory disaggregation systems [12, 36, 66]. We integrated each extended coding group creates 푘+푟+푙 copysets, while the 푟+1 Hydra with two memory disaggregation systems: Infiniswap, a dis- 푁 number of groups is 퐺 = ), it still has a significantly lower aggregated virtual memory manager (VMM) and Remote Regions, 푘 + 푟 + 푙 probability of data loss than having each node belong to multiple a disaggregated virtual file system (VFS). All I/O operations (e.g., coding groups. Hydra uses CodingSets as its load balancing and slab mapping, memory registration, RDMA posting/polling, era- slab placement policy. We evaluate its data loss and load balancing sure coding) are independent across threads and processed without in Section 7.2. synchronization. All RDMA operations use reliable connection and one-sided RDMA verbs (RDMA WRITE/READ). Each Resilience 5.2 Fixing Memory Corruption Manager maintains one connection for each active remote machine. For erasure coding, we use AVX instructions and the ISA li- Besides handling entire server unavailability, Hydra can also correct brary [5] that achieves over 4 GB/s encoding throughput per core memory corruption. In the absence of corruption, waiting for any 푘 for (8+2) configuration in our evaluation platform. writes of (푘 + 푟) splits are enough to guarantee resilience to failures. Hydra Resource Monitor is implemented as a user-space program. The same holds for the read; a read request can complete just after It uses RDMA SEND/RECV operations for all control messages. the arrival of the 푘푡ℎ split. However, to guarantee correction in the presence of corruption, remote I/O operations must wait for additional splits. Hydra needs 7 EVALUATION additional Δ splits to detect Δ corruptions and 2Δ +1 splits to locate We have integrated Hydra with both Infiniswap [36] and Remote and fix the error. Hence, to provide the correctness guarantee over Regions [12] and evaluated it on a 50-machine 56 Gbps InfiniBand Δ corruptions, both asynchronous encoding and late binding need CloudLab cluster. Our evaluation addresses the following questions: to wait for (푘 + 2Δ + 1) splits to be written into or read from remote • Does Hydra improve the resilience of disaggregated cluster mem- machines. ory? (§7.1) This raises an interesting trade-off. Since memory corruptions • Does it improve the availability? (§7.2) are corrected at a slab granularity, using smaller slabs, which result • How sensitive is Hydra to its parameters and what are its over- in more splits for the same memory capacity, would allow Hydra to heads? (§7.3) correct more memory corruptions. Prior work analyzes this trade- • How much TCO savings can we expect? (§7.4) off in the context of distributed storage systems67 [ ]. Therefore, from the memory corruption perspective, minimizing the slab size Methodology. Unless otherwise specified, we use 푘=8, 푟=2, and is optimal. Δ=1, targeting 1.25× memory and bandwidth overhead. We select Table 1 summarizes requirements for different scenarios. 푟=2 because late binding is still possible even when one of the re- What About Replication? During in-memory replication, to mote slab fails. The additional read Δ=1 incurs 1.125× bandwidth remain operational after 푟 failures, there should be at least 푟 + 1 overhead during reads. We use 1GB 푆푙푎푏푆푖푧푒, The additional num- ′ copies of an entire 4 KB page, and hence the memory overhead is ber of choices for eviction 퐸 = 2. Free memory headroom is set to (푟 + 1)×. However, a remote I/O operation can complete just after 25%, and the control period is set to 1 second. Each machine has 64 the confirmation from one of the 푟 + 1 machines. To detect and fix Δ GB of DRAM and 2× Intel Xeon E5-2650v2 with 32 virtual cores. corruptions, replication needs Δ + 1 and 2Δ + 1 copies of the entire We compare Hydra against the following alternatives: page, respectively. Thus, to provide the correctness guarantee over • SSD Backup: A copy of each page is backed up in a local SSD for Δ corruptions, replication needs to wait until it writes to or reads the minimum 1× remote memory overhead. We consider both from at least 2Δ + 1 of the replicas along with a memory overhead a disaggregated VMM system (Infiniswap) and a disaggregated of (2Δ + 1)×. VFS system (Remote Regions). EC-Cache w/ RDMA EC-Cache w/ RDMA SSD Backup HYDRA Replication SSD Backup HYDRA Replication Run-to-completion Asynchronous Encoding 80.5 82.4 In-place Coding Run-to-completion 100 100 40.4 42.0 19.2 In-place Coding 17.8 14.2 14.2 13.6

Late-binding 12.3 9.8 9.2 8.8 9.0

100 8.3 8.3 5.9 5.9 5.5 100 10 5.5 10 4.5 4.6 4.6 4.6

1 1 Latency (us) Latency 10 10 (us) Latency Median 99th Median 99th Median 99th Median 99th Read Write Read Write CCDF (%) CCDF CCDF (%) CCDF 1 1 (a) Background network flow (b) Remote Failures 0 5 10 15 20 0 5 10 15 20 25 30 Random Read Latency (μs) Random Write Latency (μs) (a) Remote Read (b) Remote Write Figure 11: Hydra latency in the presence of background net- work flow and remote failure. Figure 10: Latency breakdown of Hydra. TPS/OPS Latency (ms) (thousands) • Replication: We directly write each page over RDMA to two 50th 99th remote machines’ memory for a 2× overhead. HYD REP HYD REP HYD REP 100% 39.4 39.4 52.8 52.8 134.0 134.0 • EC-Cache w/ RDMA: Implementation of the erasure coding VoltDB 75% 36.1 35.3 56.3 56.1 142.0 143.0 scheme in EC-Cache [61], but implemented on RDMA. 50% 32.3 34.0 57.8 59.0 161.0 168.0 100% 123.0 123.0 123.0 123.0 257.0 257.0 Our evaluation of Hydra consists of both microbenchmarks and ETC 75% 119.0 125.0 120.0 121.0 255.0 257.0 cluster-scale evaluations. 50% 119.0 119.0 118.0 122.0 254.0 264.0 100% 108.0 108.0 125.0 125.0 267.0 267.0 SYS 75% 100.0 104.0 120.0 125.0 262.0 305.0 7.1 Resilience Evaluation 50% 101.0 102.0 117.0 123.0 257.5 430.0 We evaluate Hydra’s performance both in the presence and absence Table 2: Hydra (HYD) provides similar performance to repli- of failures with microbenchmarks and real-world memory-sensitive cation (REP) for VoltDB and Memcached workloads (ETC applications. and SYS). Higher is better for throughput-related, and the Lower is better for the latency-related numbers. 7.1.1 Latency Characteristics. First, we measure Hydra’s la- tency characteristics for disaggregated VMM and VFS systems in the absence of failures. Then we analyze the impact of each of its design components. (3) Late binding specifically improves the tail latency during remote Disaggregated VMM Latency. We use a simple application read by 61% by avoiding stragglers. The additional read request with its working set size set to 2GB. It is provided 1GB memory increases the median latency only by 6%. to ensure that 50% of its memory accesses cause paging. While (4) Asynchronous encoding hides erasure coding overhead during using disaggregated memory for remote page-in, Hydra improves writes, reducing the median write latency by 38%. page-in latency over Infiniswap by 1.79× at median and 1.93× at the 99th percentile. Page-out latency is improved by 1.9× and 2.2× over 7.1.2 Latency Under Failures. Infiniswap at median and 99th percentile, respectively. Replication Background Flows. In this case, we generate RDMA flows on provides at most 1.1× improved latency over Hydra, while incurring the remote machine constantly sending 1 GB messages. Unlike SSD 2× memory and bandwidth overhead (Figure 9a). backup and replication, Hydra maintains consistent latency due to Disaggregated VFS Latency. We use fio [3] to generate one late binding. Hydra’s latency improvement over SSD backup is 1.97– million random read/write requests of 4 KB block I/O. During reads, 2.56×, and it even outperforms replication at the 99th percentile by Hydra provides improved latency over Remote Regions by 2.13× 1.33× for read and 1.50× for write (Figure 11a). at median and 2.04× at the 99th percentile. During writes, Hydra also improves the latency over Remote Regions by 2.22× at median Remote Failures. Figure 11b shows that both read and write la- and 1.74× at the 99th percentile. Replication does not show signifi- tency are disk-bound when it is necessary to write to and read from cant latency gain over Hydra, improving latency by at most 1.18× the backup SSD. Hydra reduces latency over SSD backup by 8.37– (Figure 9b). 13.6× and 4.79–7.30× during remote read and write, respectively. Furthermore, it matches replication’s performance. Latency Breakdown. Due to its coding overhead, erasure cod- ing over RDMA (i.e., EC-Cache with RDMA) performs worse than 7.1.3 Application-Level Performance. We now focus on Hy- disk backup-based solution. Figure 10 shows how each component dra’s benefits for memory-intensive applications and compare it of Hydra’s data path design contributes toward reducing its latency. with that of SSD backup and replication. We evaluated using the (1) Run-to-completion avoids interruptions during remote I/O, re- following workloads: ducing the median read and write latency by 51%. • TPC-C benchmark [10] on VoltDB [11]; (2) In-place coding saves additional time for data copying, which • Facebook workloads [16] on Memcached [8]; substantially adds up in memory disaggregation systems, re- • PageRank with Twitter dataset [43] on PowerGraph [34] and ducing 28% of the read and write latency. Apache Spark/GraphX [35]. Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

7.2 Availability Evaluation 1000 1000 In this section, we evaluate Hydra’s availability and load balancing

100 100 characteristics in large clusters. 7.2.1 Analysis. We compare the availability and load balancing 10 10 GraphX GraphX of Hydra with EC-Cache and power-of-two-choices [52]. In Cod- Powergraph

Powergraph (s) Time Completion Completion Time (s) Time Completion 1 1 ingSets, each server is attached to a disjoint coding group and 100% 75% 50% 100% 75% 50% during encoded write, and the (푘 +푟) least loaded nodes are chosen In-Memory Working Set In-Memory Working Set from a subset of the (푘 + 푟 + 푙) coding group at the time of replica- (a) Hydra (b) Replication tion. EC-Cache simply assigns slabs to coding groups comprising of random nodes. Power-of-two-choices chooses two candidate nodes Figure 12: Hydra also provides similar completion time to at random for each slab, and picks the less loaded one. replication for graph analytic applications. Probability of Data Loss Under Simultaneous Failures. To evaluate the probability of data loss of Hydra under different sce- narios in a large cluster setting, we compute the probability of data loss under the three schemes. Note that, in terms of data loss We consider container-based application deployment [68] and probability, we assume EC-Cache and power of two choices select run each application in an container with a memory limit to random servers, and are therefore equivalent. Figure 15 compares fit 100%, 75%, 50% of the peak memory usage for each application. the probabilities of loss for different parameters on a 1000-machine For 100%, applications run completely in memory. For 75% and 50%, cluster. Our baseline comparison is against the best case scenario applications hit their memory limits and access pages in remote for EC-Cache and power-of-two-choices, where the number of slabs memory via Hydra. per server is low (1 GB slabs, with 16 GB of memory per server). We present Hydra’s application-level performance against repli- Even for a small number of slabs per server Hydra reduces the cation (Table 2 and Figure 12) to show that it can achieve similar probability of data loss by an order of magnitude. With a large performance with a lower memory overhead even in the absence of number of slabs per server (e.g., 100) the probability of failure for any failures. For brevity, we omit the plots for SSD backup, which EC-Cache becomes very high in case of a correlated failure. The performs much worse than both Hydra and replication – albeit figure also shows that there is an inherent trade-off between the with no memory overhead. load-balancing factor (푙) and the probability of data loss under For VoltDB, when half of its data is in remote memory (50%), correlated failures. Hydra achieves 0.82× throughput and almost transparent latency characteristics compared to the fully in-memory case (100%). For Load Balancing of CodingSets. Figure 16 compares the load memcached, Hydra achieves 0.97× throughput with GET-dominant balancing of the three policies. EC-Cache’s random selection of ( + ) ETC workloads and 0.93× throughput with SET-intensive SYS work- 푘 푟 nodes causes a higher load imbalance, since some nodes will loads; additionally, it provides almost zero latency overhead com- randomly be overloaded more than others. As a result, CodingSets × pared to the fully in-memory scenario (100%). For graph analytics, improves load balancing over EC-Cache scheme by 1.1 even when Hydra could achieve almost transparent application performance 푙 = 0, since CodingSets’ coding groups are non-overlapping. For × for PowerGraph; thanks to its optimized heap management. How- 푙 = 4, CodingSets provides with 1.5 better load balancing over ever, it suffers from increased job completion time for GraphX EC-Cache at 1M machines. The power of two choices improves due to massive thrashing of in-memory and remote memory data. load balancing by 0%-20% compared CodingSets with 푙 = 2, because Replication does not have significant gains over Hydra. it has more degrees of freedom in choosing nodes, but suffers from an order of magnitude higher failure rate (Figure 15).

7.1.4 Application Performance Under Failures. Now we an- 7.2.2 Cluster Deployment. We run 250 containerized applica- alyze Hydra’s performance in the presence of failures and compare tions across 50 machines. For each application and workload, we against the alternatives. In terms of impact on applications, we create a container and randomly distribute it across the cluster. first go back to the scenarios discussed in Section 2.2 regarding Here, total is 2.76 TB; our cluster has 3.20 TB to VoltDB running with 50% memory constraint. Except for the of total memory. Half of the containers use 100% configuration; corruption scenario where we set 푟=3, we use Hydra’s default pa- about 30% use the 75% configuration; and the rest use the 50% rameters. At a high level, we observe that Hydra performs similar configuration. There are at most two simultaneous failures. to replication with 1.6× lower memory overhead (Figure 13). Next, we start each benchmark application in 50% settings and in- Application Performance. We compare application perfor- troduce one remote failure while it is running. Hydra’s application- mance in terms of completion time (Figure 17) and latency (Table 3) level performance is transparent to the presence of remote failure. that demonstrate Hydra’s performance benefits in the presence of Figure 14 shows Hydra provides almost similar completion times to cluster dynamics. Hydra’s improvements increase with decreasing that of replication at a lower memory overhead in the presence of local memory ratio. Its throughput improvements w.r.t. SSD backup remote failure. In comparison to SSD backup, workloads experience were up to 4.87× for 75% and up to 20.61× for 50%. Its latency im- 1.3–5.75× lower completion times using Hydra. provements were up to 64.78× for 75% and up to 51.47× for 50%. SSD Backup HYDRA Replication SSD Backup HYDRA Replication SSD Backup HYDRA Replication 40 40 Network Load 40 In-memory Buffer Full 30 30 30 20 20 20 10 Remote Failure 10 10

0 0 0

TPS (x1000) TPS TPS (x1000) TPS TPS (x1000) TPS 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Time (s) Time (s) Time (s) 300 (a) Remote failure 300 (b) Remote Network Load (c) Request Burst

Figure 13: Hydra throughput with the same setup in Figure 3.

1954.9 Power of two Choices EC-Cache

Paging I/O (x1000) I/O Paging 4.0 Paging I/O (x1000) I/O Paging CodingSets (l = 0) CodingSets (l = 2) 1000 3.5 353.3 339.8 CodingSets (l = 4) Optimal 302.2 215.3 152.1 124.4 111.8

100.1 99.8 98.1 3.0 85.8 84.1 84.1 67.8 69.7 69.4 61.9 62.2 100 61.1 2.5 10 2.0 1.5 Load Load Imbalance 1 1.0

Completion Time (s) Time Completion 100 1000 10000 100000 1000000 Number of Machines and Slabs HYDRA HYDRA HYDRA HYDRA HYDRA

w/o Failure Replication w/o Failure Replication w/o Failure Replication w/o Failure Replication w/o Failure Replication Figure 16: CodingSets enhances Hydra with better load bal- SSD BackupSSD BackupSSD BackupSSD BackupSSD BackupSSD VoltDB ETC SYS PowerGraph GraphX ancing across the cluster (base parameters k=8, r=2).

Figure 14: Hydra provides transparent completions in the 50th 99th Latency (ms) presence of failure. Note that the Y-axis is in log scale. SSD HYD REP SSD HYD REP 100% 55 60 48 179 173 177 VoltDB 75% 60 57 48 217 185 225 50% 78 61 48 305 243 225 100% 138 119 118 260 245 247 ETC 75% 148 113 120 9912 240 263 CodingSets EC-Cache/Power of 2-Choices CodingSets EC-Cache/Power of 2-Choices 50% 167 117 111 10175 244 259 99.8 100% 145 127 125 249 269 267 100 13.0 100 13.0 36.4 13.0 13.0 SYS 75% 154 119 113 17557 271 321 10 10 1.6 50% 124 111 117 22828 452 356 1.3 1.1 1.3 1 0.2 1 (%) (%) Table 3: Latencies for VoltDB and Memcached workloads 0.1 0.03 0.1 (ETC and SYS) for SSD backup, Hydra (HYD) and replication 0.01 0.01 Probability of Loss ProbabilityLoss of Probability of Loss ProbabilityLoss of (REP) in the cluster experiment. r =3 r =2 r =1 l = = 1 l = 2 l = 3 l

(a) Varied parities, 푟 (b) Varied load balancing factors, 푙

CodingSets EC-Cache/Power of 2-Choices CodingSets EC-Cache/Power of 2-Choices Impact on Memory Imbalance and Stranding. Figure 18 100 58.1 100 40.9 73.2 shows that Hydra reduces memory usage imbalance w.r.t. coarser- 13.0 13.0 11.8 10 1.7 10 4.9 1.3 1.3 1.3 1.1 1.3 grained memory management systems: in comparison to SSD backup- 1 1 (%) based (replication-based) systems, memory usage variation de- (%) 0.1 0.1 0.1 creased from 18.5% (12.9%) to 5.9% and the maximum-to-minimum 0.01 0.01

Probability of Loss ProbabilityLoss of × × × Probability of Loss ProbabilityLoss of utilization ratio decreased from 6.92 (2.77 ) to 1.74 . Hydra better

S = = 2 S exploits unused memory in under-utilized machines, increasing the S = S 16 f = f 1% = f 2% S = S 100 f = f 0.5% = f 1.5% minimum memory utilization of any individual machine by 46%. (c) Varied slabs per machine, 푆 () Varied failure rate, 푓 Hydra incurs about 5% additional total memory usage compared to disk backup, whereas replication incurs 20% overhead. Figure 15: Probability of data loss for Hydra with different 푟, 푙, 푆, 푓 parameters (base parameters 푘 = 8, 푟 = 2, 푙 = 2, 푆 = 7.3 Sensitivity Evaluation 16, 푓 = 1%) on a 1000-machine cluster. Finally, we focus on Hydra’s sensitivity to its different parameters as well as its overheads.

Impact of (푘, 푟, Δ) Choices. Figure 19a shows read latency Hydra’s performance benefits are similar to replication (Figure 17c), characteristics for varying 푘. Increasing from 푘=1 to 푘=2 reduces but with lower memory overhead. median latency by parallelizing data transfers. Further increasing Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

ETC/SYS Graph 10000 616 2000 3254 10000 10000 563 405 466 286 393 1000 260 155159 1000 1000 188 62 95 96 66 80 97 65 79 84 85 83 83 90 99 97 69 71 70 136139 8210386 80 94 92 11270 75 73 124 100 100 100 58 90 10 10 10

1 1 1

50% 50% 75% 75% 50% 75% 50% 75% 75% 50%

75% 50% 75% 50% 75% 50% 75% 50% 75% 50%

50% 75% 50% 75% 50% 75% 50% 75% 50% 75%

100% 100% 100% 100% 100%

100% 100% 100% 100% 100%

100% 100% 100% 100% 100%

Completiontime (s)

Completion time (s) time Completion Completion(s) time VoltDB Memcached Power GraphX VoltDB Memcached Power GraphX VoltDB Memcached Power GraphX ETC/SYS Graph ETC/SYS Graph ETC/SYS Graph (a) SSD Backup (b) Hydra (c) Replication

Figure 17: Median completion times (i.e., throughput) of 250 containers on a 50-machine cluster.

SSD Backup HYDRA Replication When it is evicted, Hydra Resilience Manager places a new slab and 100% provides the evicted slab information to the corresponding Hydra 75% Resource Monitor, which takes 54 ms. Then the Resource Monitor 50% randomly selects 푘 out of remaining remote slabs and read the page 25% data, which takes 170 ms for a 1 GB slab. Finally, it decodes the page data to the local memory slab within 50 ms. Therefore, the

Memory Memory Load 0% 1 4Servers7 1013 Sorted161922 by25 Memory28313437 Usage40434649 total regeneration time for a 1 GB size slab is 274 ms, as opposed to taking several minutes to restart a server after failure. Figure 18: Memory load distribution across servers in terms of the average memory usage in each machine. 7.4 TCO Savings

Monthly Pricing Google Amazon We limit our TCO analysis only to memory provisioning. The TCO Standard machine $1,553 $2,211 $2,242 savings of Hydra is the revenue from leveraged unused (disaggre- 1% memory $5.18 $9.21 $5.92 gated) memory after deducting the TCO of RDMA hardware. We Hydra 6.3% 8.8% 5.1% consider capital expenditure (CAPEX) of acquiring RDMA hard- Replication 3.3% 5.0% 2.8% ware and operational expenditure (OPEX) including their power Table 4: Revenue model and TCO savings over three years usage over three years. We found an RDMA adapter costs $600 [6] for each machine with 30% unused memory on average. and RDMA switch costs $318 [7] per machine and the operating cost is $52 over three years [36] – overall, the three-year TCO is $970 for each machine. We consider the standard machine configuration and pricing from Google Cloud Compute [4], Amazon EC2 [2], and 푟 푘 improves space efficiency (measured as 푘+푟 ) and load balancing, Microsoft Azure [2] to build revenue models and calculate the TCO but latency deteriorates as well. savings for 30% of leveraged memory for each machine (Table 4). Figure 19b shows read latency for varying values of Δ. Although For example, in Google, the savings of disaggregation over 3 years just one additional read (from Δ=0 to Δ=1) helps tail latency, more using Hydra is (($5.18*30*36)/1.25-$970)/($1553*36)*100% = 6.3%. additional reads have diminishing returns; instead, it hurts latency due to proportionally increasing communication overheads. Figure 19c shows write latency variations for different 푟 values. 8 RELATED WORK Increasing 푟 does not affect the median write latency. However, Memory Disaggregation Systems. In the last few decades, the tail latency increases from 푟 = 3 due to the increase in overall many software systems tried leveraging remote machines’ mem- communication overheads. ory for paging [1, 22, 27, 29, 36, 45, 46, 50, 56, 63], global virtual memory abstraction [9, 28, 42], and to create distributed data stores Resource Overhead. We measure average CPU utilization of [26, 41, 47, 58]. There are also proposals for hardware-based remote Hydra components during I/O requests. Hydra Resilience Manager access to disaggregated memory using PCIe interconnects [48] and uses event-driven I/O and consumes only 0.001% CPU cycles in extended NUMA fabric [57]. Table 5 compares a selected few across each core. Erasure coding adds an additional 0.09% CPU usage in key characteristics. each core. Because Hydra uses one-sided RDMA, remote Hydra Resource Monitor does not have CPU overhead in the data path. Cluster Memory Solutions . With the advent of RDMA, there In the cluster deployment, Hydra increased kernel CPU utiliza- has been a renewed interest in cluster memory solutions. The pri- tion by 2.2% on average and generated 291 Mbps RDMA traffic per mary way of leveraging cluster memory is through key-value in- machine, which is only 0.5% of its 56 Gbps bandwidth. Replication terfaces [26, 39, 51, 58], distributed [55, 60], or had negligible CPU usage but generated more than 1 Gbps traffic distributed lock [71]. However, these solutions are either limited per machine. by their interface or replication overheads. Hydra, on the contrary, Background Slab Regeneration. We manually evict one of is a transparent, memory-efficient, and load-balanced mechanism the remote slabs and measure the time for it to be regenerated. for resilient remote memory. Median 99th Median 99th Median 99th 10 8.0 15 12.0 11.8 15 10.9 10.7 5.0 5.5 5.5 10 8.0 8.2 10 8.6 8.2 5 5.6 4.6 5 5 4.0 4.0 4.7 5.2 5.1 5.7 4.7 5.2 5.1 5.3 Latency (us) Latency Latency (us) Latency 0 0 (us) Latency 0 1 2 4 8 0 1 2 3 1 2 3 4 Number of Page Split k Number of Add. Reads ∆ Number of Parity Split r (c) k=8, Δ=1 (a) r=4, Δ=1 (b) k=8, r=4

Figure 19: Impact of page splits (푘), additional reads (Δ) on read latency, and parity splits (푟) on write latency.

System Year Deployability Fault Tolerance Load Balancing Latency Tolerance Global Memory [28] ’95 OS Change Local Disk Global Manager None Memory Pager [50] ’96 Unmodified Replication/RAID None None Network RAM [15] ’98 None Local Disk/Replication/RAID None None HPBD [46] ’05 Unmodified None None None Memory Blade [48] ’09 HW Change Reprovision None None RamCloud [58] ’10 App. Change Remote Disks Power of Choices None FaRM [26] ’14 App. Change Replication Central Coordinator None EC-Cache [61] ’16 App. Change Erasure Coding Multiple Coding Groups Late Binding Infiniswap [36] ’17 Unmodified Local Disk Power of Choices None Remote Regions [12] ’18 App. Change None Central Manager None LegoOS [66] ’18 OS Change Remote Disk None None Compressed Far Memory [45] ’19 OS Change None None None Hydra Unmodified Erasure Coding CodingSets Late Binding Table 5: Selected proposals on leveraging remote memory over the years.

Remote Memory for DBMSs. Over the last few years, memory erasure-coded data to improve availability and load balancing. On management systems are proposed for DBMSs to utilize remote the one hand, it matches the resilience of replication with 1.6× lower memories using distributed shared memory abstraction [17, 18, memory overhead. On the other hand, it significantly improves 26, 30, 65] and even over a disaggregated OS [74]. These frame- latency and throughput of both micro-benchmark and real-world works consider either the applicability or the performance issues of online transaction and analytics processing applications over SSD DBMSs over remote memory. None of them consider to improve the backup-based resilience with only 1.25× higher memory overhead. performance-vs-efficiency tradeoff for a resilient cluster memory. Furthermore, CodingSets allows Hydra to reduce the probability of data loss under simultaneous failures by about 10×. Overall, Hydra Erasure Coding in Storage. Erasure coding has been widely makes resilient disaggregated memory practical for DBMSs. employed in RAID systems to achieve space-efficient fault toler- ance [64, 77]. Recent large-scale clusters leverage erasure coding for storing cold data in a space-efficient manner to achieve fault- tolerance [37, 54, 69]. EC-Cache [61] is an erasure-coded in-memory cache for 1MB or larger objects, but it is highly susceptible to data loss under correlated failures, and its scalability is limited due to communication overhead. In contrast, Hydra achieves resilient dis- aggregated cluster memory with single-digit 휇s page access latency while maintaining the benefits of erasure codes.

9 CONCLUSION With the ever-increasing memory demand of in-memory DBMSs, the memory TCO of datacenters is rapidly increasing. Memory disaggregation promises to alleviate this, but its deployment is hindered by larger failure domains and susceptibility to stragglers [13, 20, 45], leading to tradeoffs between performance, efficiency, and availability. Hydra leverages online erasure coding to achieve single-digit 휇s latency under failures, while judiciously placing Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

REFERENCES [33] Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, [1] [n.d.]. Accelio based network block device. https://github.com/accelio/NBDX. Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network require- [2] [n.d.]. Amazon EC2 Pricing. https://aws.amazon.com/ec2/pricing. Accessed: ments for resource disaggregation. In OSDI. 2019-08-05. [34] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. [3] [n.d.]. Fio - Flexible I/O Tester. https://github.com/axboe/fio. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. [4] [n.d.]. Google Compute Engine Pricing. https://cloud.google.com/compute/ In OSDI. pricing. Accessed: 2019-08-05. [35] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J [5] [n.d.]. Intel Intelligent Storage Acceleration (Intel ISA-L). https://software. Franklin, and Ion Stoica. 2014. GraphX: Graph processing in a distributed dataflow intel.com/en-us/storage/ISA-L. framework. In OSDI. [6] [n.d.]. Mellanox InfiniBand Adapter Cards. https://www.mellanoxstore.com/ [36] Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. categories/adapters/infiniband-and-vpi-adapter-cards.html. Shin. 2017. Efficient Memory Disaggregation with Infiniswap. In NSDI. [7] [n.d.]. Mellanox Switches. https://www.mellanoxstore.com/categories/switches/ [37] Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit infiniband-and-vpi-switch-systems.html. Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure Coding in Windows Azure [8] [n.d.]. Memcached - A distributed memory object caching system. http: Storage. In USENIX ATC. //memcached.org. [38] Anuj Kalia, Michael Kaminsky, and David G Andersen. 2014. Using RDMA [9] [n.d.]. The Versatile SMP ( vSMP ) Architecture. http://www.scalemp.com/ efficiently for key-value services. In SIGCOMM. technology/versatile-smp-vsmp-architecture/. [39] Anuj Kalia Michael Kaminsky and David G Andersen. 2014. Using rdma efficiently [10] [n.d.]. TPC Benchmark C (TPC-C). http://www.tpc.org/tpcc/. for key-value services. In SIGCOMM. [11] [n.d.]. VoltDB. https://github.com/VoltDB/voltdb. [40] Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie [12] Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Chaiken. 2009. The Nature of Datacenter Traffic: Measurements and Analysis. In Stanko Novaković, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, IMC. Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2018. Remote regions: [41] Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, and Ryan Stutsman. a simple abstraction for remote memory. In USENIX ATC. 2017. Rocksteady: Fast Migration for Low-latency In-memory Storage. In SOSP. [13] Marcos K Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, [42] Yossi Kuperman, Joel Nider, Abel Gordon, and Dan Tsafrir. 2016. Paravirtual Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, Remote I/O. In ASPLOS. and Michael Wei. 2017. Remote memory in the age of fast networks. In SoCC. [43] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is [14] M. K. Aguilera, R. Janakiraman, and L. Xu. 2005. Using erasure codes efficiently Twitter, a social network or a news media?. In WWW. for storage in a distributed system. In DSN. [44] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J Rossbach, and Emmett [15] Eric A. Anderson and Jeanna M. Neefe. 1994. An Exploration of Network RAM. Witchel. 2016. Coordinated and efficient huge page management with Ingens. In Technical Report UCB/CSD-98-1000. EECS Department, University of California, OSDI. Berkeley. [45] Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Radoslaw [16] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid Shahid, 2012. Workload Analysis of a Large-Scale Key-Value Store. In SIGMETRICS. Greg Thelen, Kamil Adam Yurtsever, Yu Zhao, and Parthasarathy Ranganathan. [17] Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. 2015. 2019. Software-Defined Far Memory in Warehouse-Scale . In ASPLOS. Rack-Scale In-Memory Join Processing Using RDMA. In SIGMOD. [46] Shuang Liang, Ranjit Noronha, and Dhabaleswar K Panda. 2005. Swapping to [18] Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, remote memory over Infiniband: An approach using a high performance network Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. 2018. Effi- block device. In Cluster Computing. cient Distributed Memory Management with RDMA and Caching. (2018). [47] Hyeontaek Lim, Dongsu Han, David G Andersen, and Michael Kaminsky. 2014. [19] Irina Calciu, Ivan Puddu, Aasheesh Kolli, Andreas Nowatzyk, Jayneel Gandhi, MICA: A Holistic Approach to Fast In-Memory Key-Value Storage. In NSDI. Onur Mutlu, and Pratap Subrahmanyam. 2019. Project PBerry: FPGA Acceleration [48] Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K for Remote Memory (HotOS). Reinhardt, and Thomas F Wenisch. 2009. Disaggregated memory for expansion [20] Amanda Carbonari and Ivan Beschasnikh. 2017. Tolerating Faults in Disaggre- and sharing in blade servers. In ISCA. gated Datacenters. In HotNets. [49] Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, [21] Jeremy C. W. Chan, Qian Ding, Patrick P. C. Lee, and Helen H. W. Chan. 2014. Parthasarathy Ranganathan, and Thomas F Wenisch. 2012. System-level implica- Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in tions of disaggregated memory. In HPCA. Erasure-coded Clustered Storage. In FAST. [50] Evangelos P Markatos and George Dramitinos. 1996. Implementation of a Reliable [22] Haogang Chen, Yingwei Luo, Xiaolin Wang, Binbin Zhang, Yifeng Sun, and Remote Memory Pager. In USENIX ATC. Zhenlin Wang. 2008. A transparent remote paging model for virtual machines. [51] Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-Sided RDMA In International Workshop on Technology. Reads to Build a Fast, CPU-Efficient Key-Value Store. In USENIX ATC. [23] Asaf Cidon, Robert Escriva, Sachin Katti, Mendel Rosenblum, and Emin Gun [52] Michael Mitzenmacher, Andrea W. Richa, and Ramesh Sitaraman. 2001. The Sirer. 2015. Tiered Replication: A Cost-effective Alternative to Full Cluster Geo- Power of Two Random Choices: A Survey of Techniques and Results. Handbook replication. In USENIX ATC. of Randomized Computing (2001), 255–312. Issue 1. [24] Asaf Cidon, Stephen M Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout, [53] Jeffrey C Mogul and John Wilkes. 2019. Nines are Not Enough: Meaningful and Mendel Rosenblum. 2013. Copysets: Reducing the Frequency of Data Loss in Metrics for Clouds. In HotOS. Cloud Storage. In USENIX ATC. [54] Subramanian Muralidhar, Wyatt Lloyd, Southern California, Sabyasachi Roy, Cory [25] Jeffrey Dean and Luiz André Barroso. 2013. The tail atscale. Commun. ACM 56, Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Subramanian Muralidhar, Wyatt Lloyd, 2 (2013), 74–80. Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, [26] Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. 2014. Facebook’s 2014. FaRM: Fast Remote Memory. In NSDI. Warm BLOB Storage System. In OSDI. [27] Sandhya Dwarkadas, Nikolaos Hardavellas, Leonidas Kontothanassis, Rishiyur [55] Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Nikhil, and Robert Stets. 1999. Cashmere-VLM: Remote memory paging for Kahan, and Mark Oskin. 2015. Latency-tolerant software distributed shared software distributed shared memory. In IPPS/SPDP. memory. In USENIX ATC. [28] Michael J Feeley, William E Morgan, EP Pighin, Anna R Karlin, Henry M Levy, and [56] Tia Newhall, Sean Finney, Kuzman Ganchev, and Michael Spiegel. 2003. Nswap: Chandramohan A Thekkath. 1995. Implementing global memory management A network swapping module for Linux clusters. In Euro-Par. in a workstation cluster. In SOSP. [57] Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris [29] Edward W. Felten and John Zahorjan. 1991. Issues in the implementation of a Grot. 2014. Scale-out NUMA. In ASPLOS. remote memory paging system. Technical Report 91-03-09. University of Wash- [58] Diego Ongaro, Stephen M Rumble, Ryan Stutsman, John Ousterhout, and Mendel ington. Rosenblum. 2011. Fast Crash Recovery in RAMCloud. In SOSP. [30] Manoj Syamala Feng Li, Sudipto Das and Vivek R. Narasayya. 2016. Accelerating [59] John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Relational Databases by Leveraging Remote Memory and RDMA. In SIGMOD. Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru Parulkar, [31] Michail D. Flouris and Evangelos P. Markatos. 1999. The network RamDisk: Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. Using remote memory on heterogeneous NOWs. Journal of Cluster Computing 2, 2010. The Case for RAMClouds: Scalable High Performance Storage Entirely in 4 (1999), 281–293. DRAM. SIGOPS OSR 43, 4 (2010). [32] Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh [60] Russell Power and Jinyang Li. 2010. Building fast, distributed programs with Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in partitioned tables. In OSDI. Globally Distributed Storage Systems. In OSDI. [61] K V Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ram- chandran. 2016. EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding. In OSDI. [62] I. Reed and G. Solomon. 1960. Polynomial Codes Over Certain Finite Fields. J. OSDI. Soc. Indust. Appl. Math. 8, 2 (1960), 300–304. [70] Matt M. T. Yiu, Helen H. W. Chan, and Patrick P. C. Lee. 2017. Erasure Coding [63] Ahmad Samih, Ren Wang, Christian Maciocco, Tsung-Yuan Charlie Tai, Ronghui for Small Objects in In-memory KV Storage. In SYSTOR. Duan, Jiangang Duan, and Yan Solihin. 2012. Evaluating dynamics and bottle- [71] Dong Young Yoon, Mosharaf Chowdhury, and Barzan Mozafari. 2018. Distributed necks of memory collaboration in cluster systems. In CCGrid. lock management with RDMA: decentralization without starvation. In SIGMOD. [64] Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris S. Papailiopoulos, [72] Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017. The End of Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. a Myth: Distributed Transactions Can Scale. In VLDB. 2013. XORing Elephants: Novel Erasure Codes for Big Data. In VLDB. [73] Heng Zhang, Mingkai Dong, and Haibo Chen. 2016. Efficient and Available [65] Alex Shamis, Matthew Renzelmann, Stanko Novakovic, Georgios Chatzopoulos, In-memory KV-Store with Hybrid Erasure Coding and Replication. In FAST. Aleksandar Dragojević, Dushyanth Narayanan, and Miguel Castro. 2019. Fast [74] Qizhen Zhang, Yifan Cai, Xinyi Chen, Sebastian Angel, Ang Chen, Vincent Liu, General Distributed Transactions with Opacity. In SIGMOD. and Boon Thau Loo. 2020. Understanding the Effect of Data Center Resource [66] Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A Disaggregation on Production DBMSs. In VLDB. Disseminated, Distributed OS for Hardware Resource Disaggregation. In OSDI. [75] Qi Zhang, Mohamed Faten Zhani, Shuo Zhang, Quanyan Zhu, Raouf Boutaba, [67] Amy Tai, Andrew Kryczka, Shobhit O. Kanaujia, Kyle Jamieson, Michael J. Freed- and Joseph L Hellerstein. 2012. Dynamic energy-aware capacity provisioning for man, and Asaf Cidon. 2019. Who’s Afraid of Uncorrectable Bit Errors? Online cloud computing environments. In ICAC. Recovery of Flash Errors with Distributed Redundancy. In USENIX ATC. [76] Yiwen Zhang, Juncheng Gu, Youngmoon Lee, Mosharaf Chowdhury, and Kang G. [68] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Shin. 2017. Performance Isolation Anomalies in RDMA. In KBNets. Tune, and John Wilkes. 2015. Large-scale cluster management at Google with [77] Zhe Zhang, Zmey Deshpande, Xiasong Ma, Eno Thereska, and Dushyanth Borg. In EuroSys. Narayanan. 2010. Does erasure coding have a role to play in my data center? [69] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Technical Report May. Microsoft Research Technical Report MSR-TR-2010-52, Maltzahn. 2006. Ceph: A Scalable, High-performance Distributed File System. In May 2010.