Scalable, Near-Zero Loss Disaster Recovery for Distributed Data Stores

Scalable, Near-Zero Loss Disaster Recovery for Distributed Data Stores ∗ Ahmed Alquraan Alex Kogan Virendra J. Marathe University of Waterloo Oracle Labs Oracle Labs [email protected] [email protected] [email protected] Samer Al-Kiswany University of Waterloo [email protected] ABSTRACT This paper presents a new Disaster Recovery (DR) system, called Slogger, that differs from prior works in two principle ways: (i) Slogger enables DR for a linearizable distributed data store, and (ii) Slogger adopts the continuous backup approach that strives to maintain a tiny lag on the backup site relative to the primary site, thereby restricting the data loss window, due to disasters, to milliseconds. These goals pose a significant set of challenges related to consistency of the backup site’s state, failures, and scalability. Slogger employs a combination of asynchronous log replication, Figure 1: Local vs. Geo replication performance (throughput vs. intra-data center synchronized clocks, pipelining, batching, and a latency) of LogCabin under 100% writes load, varying number of novel watermark service to address these challenges. Furthermore, concurrent clients (numbers on the curves). Full details of the setup Slogger is designed to be deployable as an “add-on” module in an are provided in x7. existing distributed data store with few modifications to the original code base. Our evaluation, conducted on Slogger extensions to fault tolerance capabilities that are central to distributed infrastruc- a 32-sharded version of LogCabin, an open source key-value store, tures and services hosted by cloud vendors. This paper focuses shows that Slogger maintains a very small data loss window of on disaster recovery (DR), a critical feature required in production 14.2 milliseconds which is near the optimal value in our evalua- distributed systems, long before the cloud era, and certainly since tion setup. Moreover, Slogger reduces the length of the data loss it started. DR enables tolerance of data center wide outages, where window by 50% compared to incremental snapshotting technique the original data center is rendered inoperable for extended peri- without having any performance penalty on the primary data store. ods. DR of a distributed data store (databases, key-value stores, Furthermore, our experiments demonstrate that Slogger achieves file storage, etc.) is enabled by creating an additional copy of the our other goals of scalability, fault tolerance, and efficient failover data store at a remote backup site (data center) while the primary to the backup data store when a disaster is declared at the primary site’s data store is online [30]. The backup copy, which typically data store. lags behind the primary data store, serves as the new basis of the PVLDB Reference Format: data store to create and/or start a new primary data store. The latest Ahmed Alquraan, Alex Kogan, Virendra J. Marathe, and Samer Al- data updates at the old primary data store may be lost during dis- Kiswany. Scalable, Non-Zero Loss Disaster Recovery for Distributed Data asters. Nonetheless, concerns about data loss due to disasters have Stores. PVLDB, 13(9): 1429-1442, 2020. forced key DR design decisions in production data center infras- DOI: https://doi.org/10.14778/3397230.3397239 tructures and distributed data stores [2,8, 45, 46]. Traditional means of DR is through snapshots [18,20,33,40,48] 1. INTRODUCTION – a data store snapshot is asynchronously created and replicated to The importance of distributed systems has dramatically grown the backup site. While a sufficient solution for many use cases, in the cloud era. They provide the highly desirable scale-out and the key limitation of this approach is a potentially large window of data loss (seconds, minutes, hours/days) between the time the last ∗This work was done when the author was an intern at Oracle Labs. snapshot was replicated, and the time the disaster occurred. An alternate approach is to build synchronous geo-replicated data stores [5,8, 46], which can trivially tolerate data center wide This work is licensed under the Creative Commons Attribution- failures. The key benefit of geo-replicated data stores is that zero NonCommercial-NoDerivatives 4.0 International License. To view a copy data loss is guaranteed even in the presence of data center wide of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For outages. However, synchronous replication across data centers has any use beyond those covered by this license, obtain permission by emailing a significant performance cost in the data store’s critical path [17]. [email protected]. Copyright is held by the owner/author(s). Publication rights Our experiments on LogCabin [37], a highly available Key-Value licensed to the VLDB Endowment. (K-V) store that uses the Raft consensus protocol [38] for repli- Proceedings of the VLDB Endowment, Vol. 13, No. 9 ISSN 2150-8097. cation, compared performance between synchronous 3-way intra DOI: https://doi.org/10.14778/3397230.3397239 and inter data center replication. Figure 1 shows the experiment’s 1429 results. We found that the inter data center replicated LogCabin Write-ahead Logs cluster performs over an order of magnitude worse than the intra Shard2 L Shard1 L data center cluster. The performance degradation simply reflects F F the effects of geographic distance between machines interacting in F F the replication protocol. F L L Clearly, intra data center replication is highly attractive from a Internet F L F performance perspective. Disasters are naturally relevant to dis- F F tributed data stores constrained within a single geographic location F Shard3 (e.g. a single data center). As a result, our work’s scope is re- F L Shard2 F stricted to such systems. The first question we want to answer is Shard3 Shard1 whether one can engineer a DR scheme that asynchronously repli- Primary Data Center Backup Data Center cates updates to a backup site with a near zero lag – in the event Shard Replica Sets L: Shard Leader of a disaster at the primary site, the DR scheme may lose updates F: Shard Follower accepted at the primary site from just the last few milliseconds. Figure 2: Example data store with 3 shards, each 3-way replicated. Another class of solutions to DR that form a reasonable start- The inner workings of our backup system are described in x5. ing point for our work fall under a category we call continuous the primary site, particularly for mission critical applications such backups. In these solutions, the primary data store is continu- as databases [4, 11, 35, 44] and enterprise storage systems [36]. ously, asynchronously, and incrementally replicated to the backup We present extensive empirical evaluation (x7) of Slogger ex- site [11, 21, 22, 30]. Unfortunately, none of these solutions work tensions to a 32-way sharded version of LogCabin, a highly avail- correctly for linearizable [16] distributed data stores1. In particu- able open source key-value store. Our experiments show that the lar, we show that the order in which updates are backed up, using data loss window stays surprisingly low (as low as 14:2 millisec- prior techniques, may lead to an update order at the backup site onds in our experiments) even though the backup happens asyn- that is inconsistent with the update order observed at the primary shronously in the background. This low data loss window comes at site (x3). virtually no performance penalty on LogCabin. Furthermore, our In recent times, a growing number of commercially successful experiments demonstrate that Slogger achieves our other goals of data stores that support linearizability, exclusively or optionally, scalability, fault tolerance, and efficient failover to the backup data have emerged [1,3,5,8,23,41]. The DR problem we discuss here is store when a disaster is declared at the primary data store. relevant to intra data center deployments of these systems. Thus the real question we want to address is, can a near-zero lag DR scheme 2. DATA STORE ARCHITECTURE be designed for linearizable distributed data stores? To that end, The Primary Data Store. Slogger makes critical assumptions we propose a solution based on timestamps generated by a syn- about some properties of the distributed data store’s design. We be- chronized distributed clocks (x4). Furthermore, from a pragmatic lieve these properties are commonly supported by most data stores view, can the solution be easily pluggable into existing distributed that provide high-availability guarantees using synchronous repli- data stores, where the changes needed in the original data store are cation [2,8, 23, 45, 46]. few and non-invasive? The data store is logically partitioned into a multitude of non- We introduce Slogger, a new DR framework, to address overlapping shards. Multiple shards can co-habit the same physical the above questions (x5). Slogger plugs into any linearizable machine. Each shard is synchronously replicated for high availabil- (even non-linearizable) distributed data store that uses write-ahead ity. We assume a shard’s replica set contains a collection of copies logs [32] to apply changes to its state. Slogger asynchronously of the shard, termed replicas, hosted on different machines. A replicates and applies the logs to a designated backup site. It replica set typically contains a single leader replica that processes preserves linearizability by tracking temporal order between log all the updates directed to the shard and propagates them to the rest records using synchronized distributed clocks [8, 14, 28, 29, 31]. of the replicas, called followers, in its replica set. Leaders can be Specifically, Slogger assumes that the data store can tag each log chosen statically or dynamically using a leader election algorithm. record with a timestamp derived from the distributed clock that is The replication scheme itself may have a simple primary-secondary synchronized across nodes in a data center.

Load more