Hybrid Replication: Optimizing Network Bandwidth and Primary

Hybrid Replication: Optimizing Network Bandwidth and Primary Storage Performance for Remote Replication Assaf Natanzon∗‡, Philip Shilane∗, Mark Abashkin∗‡, Leehod Baruch∗, and Eitan Bachmat‡ ‡ Department of Computer Science ∗ EMC Corporation Ben-Gurion University of the Negev Email: fi[email protected] Email: [email protected] Abstract—Traditionally, there are two main forms of data the primary system, unlike snapshot replication. The main replication to secondary locations: continuous and snapshot- disadvantage of continuous replication is that the network based. Continuous replication mirrors every I/O to a remote bandwidth requirements can be quite high, and, specifically, server, which maintains the most up-to-date state, though at a large network bandwidth cost. Snapshot replication periodically as high as peak client writes to the primary system. transfers modified regions, so it has lower network bandwidth In snapshot-based replication, the storage system period- requirements since repeatedly overwritten regions are only trans- ically creates snapshots of a volume and tracks changes ferred once. Snapshot replication, though, comes with larger I/O between snapshots. When new writes enter the system, the loads on primary storage since it must read modified regions to snapshot content is protected so both the earlier and current transfer. To achieve the benefits of both approaches, we present hybrid replication, which is a novel mix of continuous replication content are available. Then, when replication starts, it reads and snapshot-based replication. Hybrid replication selects data the modified regions from disk and transfers the regions across regions with high overwrite characteristics to be protected by the network to the replica. Because there may be numerous snapshots, while data regions with fewer overwrites are protected overwrites to the same region, and only the final version by continuous replication. In experiments with real-world storage within a snapshot period needs to be transferred, the amount of traces, hybrid replication reduces network bandwidth up to 40% relative to continuous replication and I/O requirements on bandwidth used can be dramatically lower than for continuous primary storage up to 90% relative to snapshot replication. replication. Previous work shows bandwidth may be 80% lower with hourly snapshots [10], [14]. Snapshot replication Keywords: replication, data protection, continuous data pro- generally has longer RPOs than continuous replication, extra tection, snapshot storage space to track changes, and I/O to read modified data from primary storage. The goal of hybrid replication is to achieve the network I. INTRODUCTION bandwidth efficiencies of snapshot replication and low over- Remote replication is a widely used technique to protect heads on primary storage of continuous replication. The basic primary storage systems by generating the same state on approach is to partition a volume into extents, consisting of a secondary system [5], [10], [14], [16], [18]. Unlike the consecutive blocks, that are protected by either continuous or data protection options of RAID and creating explicit copies, snapshot replication. Hybrid replication assigns extents with remote replication creates a copy on a separate machine that the most overwrites to snapshot replication, and the remaining may be geographically near or distant. Depending on the style extents are assigned to continuous replication. As storage of remote replication, the secondary system may be part of a patterns change, extents are reassigned dynamically. Hybrid high-availability solution that can handle primary storage tasks replication will have the RPO of snapshot replication but in the event the main system is offline. Our work focuses on reduces primary I/O overheads up to 90% versus snapshot improving the efficiency of remote replication (simply called replication while also reducing network bandwidth overheads replication in the remainder of this paper) for backups and up to 40% versus continuous replication. high availability solutions. The remainder of our paper is organized as follows. Sec- Replication techniques have been studied extensively [5], tion II presents background on continuous and snapshot [10], [14], [16], [18], and we focus on the two most commonly replication, followed by related work in Section III. The implemented approaches to protect storage volumes: continu- hybrid replication algorithm is described in Section IV. Our ous [5] and snapshot [10] replication. Continuous replication is methodology for evaluating hybrid replication is described in a conceptually simple technique that intercepts every write I/O Section V. Experiments are presented in Section VI, and the to a protected volume and sends writes to both the local storage last section presents our conclusions and future work. volume as well as to the replica site. Continuous replication has two main advantages over snapshot replication. First, it II. REMOTE REPLICATION BACKGROUND supports a very low Recovery Point Objective (RPO) [17], We next present more detailed background information meaning a replica only lags the primary system by a short about continuous and snapshot replication that motivates our period, perhaps only a few seconds. Second, since data is work on creating a hybrid solution. We use Figure 1 as an sent to the replica inline, it is unnecessary to read data from ongoing example to motivate hybrid replication. 978-1-5090-3315-7/16/$31.00 ©2016 IEEE Continuous Replication Clients Primary Server I/O Sequence Replica 1 1 3 2 3 I/O Sequence I/O splitter Write I/Os 1 1 3 2 3 I/O Transferred 1 3 2 1 3 2 Snapshot Replication Clients Primary Server Replica I/O Sequence Changed I/O Sequence 1 1 3 2 3 Blocks 1 2 3 I/O Transferred 1 3 2 1 1 3 2 3 2 Original block Modified in snapshot Fig. 1. An example of continuous and snapshot replication. Continuous replication tends to have higher network bandwidth requirements, while snapshot replication tends to create more I/Os on the primary server. A. Continuous Replication for overwrites that take place within a short period of time. In this example, repeated writes to LBA 1 could have resulted in An example of continuous replication is shown in the top a single I/O sent to the replica with buffering. Buffering helps half of Figure 1. Starting at the left, client I/Os are sent to control bandwidth fluctuations, and even a relatively small the primary storage system. While the primary system serves buffer may reduce the maximum required bandwidth of the both reads and writes, we will focus on write traffic as it is system though the RPO increases with buffering delays. the modified data that must be replicated. In this example, One example of a commercially available continuous repli- write I/Os are directed to logical block addresses (LBAs) 1, cation system is EMC’s RecoverPoint, which installs an I/O 1, 3, 2, and 3, with the repeated writes to 1 and 3 creating splitter on the primary storage system such as a storage array overwrites. Each write is assigned a unique color to highlight or hypervisor and supports the spectrum of synchronous and differences between techniques. In continuous replication, all asynchronous replication [9]. The overhead of adding a splitter of the I/Os are intercepted by an I/O splitter in the primary may be as low as 150 microseconds added to writes, which is storage system that directs the writes to the primary system low relative to creating and maintaining snapshots. as well as to the replica, which receives the same I/O write pattern. Note that write I/Os can be sent to the replica without ever reading regions back from primary storage. This example B. Snapshot Replication shows that repeated writes to LBAs 1 and 3 are transferred to Snapshot replication is shown in the bottom half of Figure 1. the replica, which increases network bandwidth usage, though The main difference between snapshot and continuous repli- it does enable a shorter RPO. The final region state is shown cation is that client writes land on the primary storage system on disk after the sequence of I/Os. and are later read and transferred to the replica. Because Continuous replication techniques include synchronous only the most recent version of a region written within a replication [8], which send all of the write I/Os to the replica, snapshot period is transferred, network bandwidth is lower for as we described. Synchronous replication may be configured to snapshot replication as compared to continuous replication. In wait for confirmation from the replica before acknowledging this example, LBAs 1, 2, and 3 are transferred once at the end the write was received from the client, creating an RPO of of the snapshot period instead of transferred once per overwrite zero, though client acknowledgments have a delay due to from the clients. As the snapshot period increases, network network transfer. An alternative version sends the data to usage typically decreases until a plateau is reached. As part of the replica but responds to the client immediately without the snapshot process, though, extra copies are created on the waiting for a confirmation from the replica, which has a primary storage system for modified regions. This is shown in faster response to the client but the RPO then includes the the figure with the original regions outlined and the modified network transfer time. When a client requires the lowest versions colored. When replication has completed, the extra RPOs, continuous replication is preferred

Hybrid Replication: Optimizing Network Bandwidth and Primary

Quantum Dxi4500 User's Guide

Replication Using Group Communication Over a Partitioned Network

Replication Vs. Caching Strategies for Distributed Metadata Management in Cloud Storage Systems

Replication As a Prelude to Erasure Coding in Data-Intensive Scalable Computing

Middleware-Based Database Replication: the Gaps Between Theory and Practice

Manufacturing Equipment Technologies for Hard Disk's

A Crash Course in Wide Area Data Replication

Replication Under Scalable Hashing: a Family of Algorithms for Scalable Decentralized Data Distribution

Score: a Scalable One-Copy Serializable Partial Replication Protocol⋆

Overview of Data Replication Strategies in Various Mobile Environment

Storage Replication Compudyne Cloud Services

Scalable State-Machine Replication (S-SMR)