Scalable Replay-Based Replication for Fast Databases

Scalable Replay-Based Replication For Fast Databases Dai Qin Angela Demke Brown Ashvin Goel University of Toronto University of Toronto University of Toronto [email protected] [email protected] [email protected] ABSTRACT transfer requires a 10 Gb/s link for a single database. Fail- Primary-backup replication is commonly used for providing over and disaster recovery in enterprise environments, where fault tolerance in databases. It is performed by replaying the backup is located across buildings (possibly separated the database recovery log on a backup server. Such a scheme geographically), is thus an expensive proposition. These net- raises several challenges for modern, high-throughput multi- work links are expensive to operate or lease, and upgrades core databases. It is hard to replay the recovery log con- require major investment. A second challenge is that the currently, and so the backup can become the bottleneck. backup ensures consistency with the primary by replaying Moreover, with the high transaction rates on the primary, the database log in serial order. This replay is hard to per- the log transfer can cause network bottlenecks. Both these form concurrently, and so the backup performance may not bottlenecks can significantly slow the primary database. scale with the primary performance [13]. In this paper, we propose using record-replay for repli- These challenges can lead to the network or the backup cating fast databases. Our design enables replay to be per- becoming a bottleneck for the primary, which is otherwise formed scalably and concurrently, so that the backup per- scalable. Two trends worsen these issues: 1) the availability formance scales with the primary performance. At the same of increasing numbers of cores, and 2) novel databases [21, time, our approach requires only 15-20% of the network 15], both of which further improve database performance. bandwidth required by traditional logging, reducing network While there is much recent work on optimizing logging and infrastructure costs significantly. scalable recovery [17, 36, 35, 33], replication for fast databases requires a different set of tradeoffs, for two reasons. First, when logging for recovery, if the storage throughput becomes a bottleneck, storage performance can be easily up- 1. INTRODUCTION graded. For instance, a 1 TB Intel SSD 750 Series PCIe card Databases are often a critical part of modern computing costing less than $1000 can provide 1 GB/s sequential write infrastructures and hence many real-world database deploy- performance, which can sustain the logging requirements de- ments use backup and failover mechanisms to guard against scribed above. Much cheaper SSDs can also provide similar catastrophic failures. For example, many traditional data- performance in RAID configurations. In comparison, when bases use log shipping to improve database availability [19, logging for replication, if a network link becomes a bottle- 20, 24]. In this scheme, transactions run on a primary server neck, especially high-speed leased lines, an upgrade typically and after they commit on the primary, the database recov- has prohibitive costs and may not even be available. ery log is transferred asynchronously and replayed on the Second, unlike recovery, which is performed offline after a backup. If the primary fails, incoming requests can be redi- crash, a backup needs to be able to catch up with the pri- rected to the backup. mary, or it directly impacts primary performance [1]. Data- This replication scheme raises several challenges for in- bases perform frequent checkpointing, so the amount of data memory, multi-core databases. These databases support to recover is bounded. If the recovery mechanism doesn’t high transaction rates, in the millions of transactions per scale with the primary, the consequence is a little more time second, for online transaction processing workloads [14, 16, for recovery. However, for replication, if the backup cannot 30]. These fast databases can generate 50 GB of data per sustain the primary throughput, then it will fall increasingly minute on modern hardware [36]. Logging at this rate re- far behind and may not be able to catch up later. quires expensive, high-bandwidth storage [30] and leads to Our goal is to perform database replication with minimal significant CPU overheads [17]. For replication, the log performance impact on the primary database. We aim to 1) reduce the logging traffic, and 2) perform replay on the backup efficiently so that the backup scales with the primary. To reduce network traffic, we propose using deterministic record-replay designed for replicating databases. The This work is licensed under the Creative Commons Attribution- primary sends transaction inputs to the backup, which then NonCommercial-NoDerivatives 4.0 International License. To view a copy replays the transactions deterministically. This approach re- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For duces network traffic significantly because, as we show later, any use beyond those covered by this license, obtain permission by emailing transaction inputs for OLTP workloads are much smaller [email protected]. Proceedings of the VLDB Endowment, Vol. 10, No. 13 than their output values. Copyright 2017 VLDB Endowment 2150-8097/17/08. 2025 For deterministic replay, we record and send the transac- While log shipping is commonly used for backup and fail- tion write-set so that the backup can determine the records over, it raises several challenges in a production environ- that were written by the transaction. On the backup, we use ment. A recent post by Uber describes these issues in de- multi-versioning, so that writers can execute concurrently, tail [31]. First, log shipping incurs significant network traffic and safely create new versions while readers are accessing the because it sends physical data, and this is expensive be- old versions. Our replay uses epoch-based processing, which cause backup machines are usually situated across buildings allows both readers and writers to efficiently determine the or data centers. Second, databases usually do not maintain correct versions to access, while allowing readers to execute storage compatibility between major releases. Because log concurrently with the writers. Together, these techniques shipping ships physical data, the backup and the primary allow highly concurrent and deterministic replay. database have to run exactly the same version of the data- Our main contribution is a generic and scalable replay- base software, making database upgrades a challenge. Tra- based database replication mechanism. By decoupling our ditional databases usually provide an offline storage format replication scheme from the primary database, we enable converter [25], and both the primary and the backup have to supporting different database designs. Our approach allows shut down to perform a major upgrade. This significantly in- the primary to use any concurrency control scheme that sup- creases the downtime and maintenance complexity. Finally, ports total ordering of transactions, and imposes no restric- if the primary database has bugs that can corrupt data, log tions on the programming model. In addition, the backup is shipping will propagate the corruption to the backup data- designed so that it makes no assumptions about the work- base as well. loads, data partitioning, or the load balancing mechanism Uber’s current solution to these problems is to use My- on the primary. For example, to support different kinds of SQL’s statement level replication [22]. This approach sim- applications fast databases often make various design trade- plifies the upgrade process and saves some network traffic offs, such as data partitioning [14,7] and a special program- because the logging granularity is row level. However, this ming model [8]. Our approach is designed to scale without approach can still generate close to 10Gb/s for a fast data- relying on any specific primary database optimizations, and base, and the re-execution needs to be performed serially. without requiring any developer effort for tuning the backup It also doesn’t help with data corruption. Our replay-based for these optimizations. approach is designed to address these issues. We have implemented replication for ERMIA [15], an in- memory database designed to support heterogeneous work- 3. RELATED WORK loads. Our backup database is specifically designed and In this section, we describe related work in the areas of optimized for replaying transactions concurrently. Our ex- database replication and deterministic replay schemes. Ta- periments with TPC-C workloads show that our approach ble1 provides a summary comparing our scheme with vari- requires 15-20% of the network bandwidth required by tra- ous logging and replay schemes. ditional logging. The backup scales well, replaying transac- Primary-backup replication based on traditional log ship- tions as fast as the primary, and the primary performance ping is easy to implement and commonly available in many is comparable to its performance with traditional logging. traditional databases such as SQL Server [20], DB2 [19], An added reliability benefit of our generic replication mech- MySQL [22], and PostgreSQL [24], but it has significant anism is that it can be used to validate the concurrency con- logging requirements as discussed earlier. While it can be trol scheme on the primary. We found and helped fix several executed relatively efficiently, Hong et al. suggest that seri- serious bugs in the ERMIA implementation that could lead ally replaying the recovery log on the backup database can to non-serializable schedules. become a bottleneck with increasing cores on the primary The rest of the paper describes our approach in detail. database [13]. To enable concurrent log replay, their KuaFu Section2 provides motivation for our approach, and Sec- system constructs a dependency graph on the backup, based tion3 describes related work in the area. Section4 describes on tracking write-write dependencies in the log. Then it uses our multi-version replay strategy and the design of our sys- topological ordering to concurrently apply the logs of non- tem.

Scalable Replay-Based Replication for Fast Databases

Quantum Dxi4500 User's Guide

Replication Using Group Communication Over a Partitioned Network

Replication Vs. Caching Strategies for Distributed Metadata Management in Cloud Storage Systems

Replication As a Prelude to Erasure Coding in Data-Intensive Scalable Computing

Middleware-Based Database Replication: the Gaps Between Theory and Practice

Manufacturing Equipment Technologies for Hard Disk's

A Crash Course in Wide Area Data Replication

Replication Under Scalable Hashing: a Family of Algorithms for Scalable Decentralized Data Distribution

Score: a Scalable One-Copy Serializable Partial Replication Protocol⋆

Overview of Data Replication Strategies in Various Mobile Environment

Storage Replication Compudyne Cloud Services

Scalable State-Machine Replication (S-SMR)