OpenEC: Toward Unified and Configurable Erasure Coding Management in Distributed Storage Systems Xiaolu Li, Runhui Li, and Patrick P. C. Lee, The Chinese University of Hong Kong; Yuchong Hu, Huazhong University of Science and Technology https://www.usenix.org/conference/fast19/presentation/li

This paper is included in the Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST ’19). February 25–28, 2019 • Boston, MA, USA 978-1-939133-09-0

Open access to the Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST ’19) is sponsored by OpenEC: Toward Unified and Configurable Erasure Coding Management in Distributed Storage Systems Xiaolu Li†, Runhui Li†, Patrick P. C. Lee†, and Yuchong Hu‡ †The Chinese University of Hong Kong ‡Huazhong University of Science and Technology

Abstract coupling between erasure coding management and the DSS workflows makes new erasure coding solutions hard to be gen- Erasure coding becomes a practical redundancy technique eralized for other DSSs and further enhanced. Some DSSs for distributed storage systems to achieve fault tolerance with with built-in erasure coding features (e.g., HDFS with era- low storage overhead. Given its popularity, research stud- sure coding [1,5], [54], and Swift [7]) provide certain ies have proposed theoretically proven erasure codes or effi- configuration capabilities, such as interfaces for implement- cient repair algorithms to make erasure coding more viable. ing various erasure codes and controlling erasure-coded data However, integrating new erasure coding solutions into ex- placement, yet the interfaces are rather limited and it is non- isting distributed storage systems is a challenging task and trivial to extend the DSSs with more advanced erasure codes requires non-trivial re-engineering of the underlying storage and repair algorithms (§2.2). How to fully realize the power workflows. We present OpenEC, a unified and configurable of erasure coding in DSSs remains a challenging issue. framework for readily deploying a variety of erasure coding We present OpenEC, a unified and configurable frame- solutions into existing distributed storage systems. OpenEC work for erasure coding management in DSSs, with the pri- decouples erasure coding management from the storage work- mary goal of bridging the gap between designing new erasure flows of distributed storage systems, and provides erasure coding solutions and enabling the feasible deployment of coding designers with configurable controls of erasure coding such new solutions in DSSs. Inspired by software-defined operations through a directed-acyclic-graph-based program- storage [16,48,51], which aims for configurable storage man- ming abstraction. We prototype OpenEC on two versions agement without being constrained by the underlying storage of HDFS with limited code modifications. Experiments on a architecture, we apply this concept into erasure coding man- local cluster and Amazon EC2 show that OpenEC preserves agement. Our main idea is to decouple erasure coding man- both the operational performance and the properties of erasure agement from the DSS workflows. Specifically, OpenEC coding solutions; OpenEC can also automatically optimize runs as a middleware system between upper-layer applications erasure coding operations to improve repair performance. and the underlying DSS, and is responsible for performing all erasure coding operations on behalf of the DSS. Such a 1 Introduction design relaxes the stringent dependence on the erasure cod- Erasure coding provides a low-cost redundancy mechanism ing support of DSSs. More importantly, OpenEC takes the for fault-tolerant storage, and is now widely deployed in full responsibility of erasure coding management, and hence today’s distributed storage systems (DSSs). Examples in- provides flexibility for erasure coding designers to (i) incor- clude enterprise-level DSSs [15,21,30] and many open-source porate a variety of erasure coding solutions, (ii) configure DSSs [1,3,7,31,54,55]. Unlike replication that simply cre- the workflows of erasure coding operations, and (iii) decide ates identical data copies for redundancy protection, erasure the placement of both erasure-coded data and erasure cod- coding introduces much less storage overhead through the ing operations across storage nodes. Our contributions are coding operations of data copies, while preserving the same summarized as follows: degree of fault tolerance [53]. Modern DSSs mostly realize • We propose a new programming model for erasure coding erasure coding based on the classical Reed-Solomon (RS) implementation and deployment. Our model builds on an codes [43], yet RS codes have high performance penalty, es- abstraction called an ECDAG, a directed acyclic graph that pecially in repairing lost data when failures happen. Thus, defines the workflows of erasure coding operations. We research studies have proposed new erasure coding solutions show how we feasibly realize a general erasure coding with improved performance, such as erasure codes with the- solution through the ECDAG abstraction. oretical guarantees and efficient repair algorithms that are • We design OpenEC, which translates an ECDAG into applicable to general erasure-coding-based storage (§6). erasure coding operations atop a DSS. OpenEC supports However, deploying new erasure coding solutions in DSSs encoding operations on or off the write path as well as is a daunting task. Existing studies often integrate new era- various state-of-the-art repair operations. In particular, it sure coding solutions into specific DSSs by re-engineering can automatically optimize an ECDAG for hierarchical the DSS workflows (e.g., the read/write paths). The tight topologies to improve repair performance.

USENIX Association 17th USENIX Conference on File and Storage Technologies 331 • We implement a prototype of OpenEC on HDFS-RAID single lost block of a coding group in degraded reads or a [5] and Hadoop 3.0 HDFS (HDFS-3) [1]. Its integrations single failed node in full-node recovery) are the most com- into HDFS-RAID and HDFS-3 only require limited code mon repair scenarios [21, 40], existing repair-friendly erasure changes (with no more than 450 LoC). codes aim to minimize the repair bandwidth or I/O in single- • We evaluate OpenEC on a local cluster and Amazon EC2. failure repairs. Examples are regenerating codes [14], in- OpenEC incurs negligible performance overhead in DSS cluding minimum-storage regenerating (MSR) and minimum- operations, supports various state-of-the-art erasure codes bandwidth regenerating (MBR) codes, as well as locally re- and repair algorithms, and increases the repair through- pairable codes (LRCs) [21, 23, 44, 49]. put by at least 82% through automatically customizing an Our work focuses on practical erasure codes. In particular, ECDAG for a hierarchical topology. we target linear codes, which include RS codes, MSR and MBR codes, as well as LRCs. Linear codes perform linear The source code of our OpenEC prototype is available at: coding operations based on the Galois field arithmetic [17]. http://adslab.cse.cuhk.edu.hk/software/openec. Mathematically, for an (n,k) code, let d0,···,dk−1 be the k data blocks, and p ,···, p be the n − k parity blocks. 2 Background and Motivation 0 n−k−1 Each parity block p j (0 ≤ j ≤ n − k − 1) can be expressed k−1 2.1 Erasure Coding Basics as p j = ∑i=0 γ jidi, where γ ji is some coding coefficient for Consider a DSS that comprises multiple storage nodes and computing p j. Note that the linear operations are additive organizes data in units of blocks. We construct erasure coding associative (i.e., independent of how additions are grouped). as an (n,k) code with two configurable parameters n and k, Also, our work addresses sub-packetization, which is used where k < n. For every k fixed-size original blocks (called in various designs of MSR and MBR codes [14, 18, 32, 39, 42, data blocks), an (n,k) code encodes them into n−k redundant 45,50,52]. Sub-packetization divides each block into smaller- blocks of the same size (called parity blocks), such that any size sub-blocks, so that repairs can be done by retrieving k out of the n erasure-coded blocks (including both data and sub-blocks rather than whole blocks. parity blocks) can decode the k data blocks; that is, any n − k Most DSSs assume that all erasure-coded blocks are im- block failures can be tolerated. We call the collection of n mutable and do not support in-place updates. Thus, we focus erasure-coded blocks a coding group. A DSS encodes differ- on four basic operations: writes, normal reads, degraded ent sets of k data blocks independently, and distributes the n reads, and full-node recovery (§4.2), while we address in- erasure-coded blocks of each coding group across n storage place updates in future work. nodes to protect against any n − k storage node failures. In 2.2 Limitations of Erasure Coding Management this paper, our discussion focuses on the coding operations (i.e., encoding or decoding) of a single coding group. Modern DSSs now support erasure coding, yet existing era- For performance reasons, a DSS implements coding opera- sure coding management in such DSSs remains stringent and tions in small-size units called packets, while the read/write still faces practical limitations. To motivate our study, we units are in blocks; for example, our experiments set the review six state-of-the-art DSSs that currently realize erasure- default packet and block sizes as 128 KiB and 64 MiB, respec- coded storage: HDFS-RAID [5], HDFS-3 [1], QFS [31], tively). It divides a block into multiple packets, and encodes Tahoe-LAFS [55], Ceph [54], and Swift [7]. HDFS-RAID is the packets at the same block offsets in a coding group to- the erasure coding extension of HDFS [46] in the earlier ver- gether. Thus, instead of first reading the whole blocks to start sion of Hadoop. Here, we focus on Facebook’s HDFS-RAID coding operations, a DSS can perform packet-level coding implementation [3], which builds on Hadoop version 0.20. operations, while reading the whole blocks, in a pipelined HDFS-3 builds on the newer Hadoop version 3.0, which in- manner. To simplify our discussion, we use blocks as the cludes erasure coding by design. QFS resembles HDFS and units of coding operations, and only differentiate packets and includes erasure coding by design. All HDFS-RAID, HDFS- blocks in our implementation (§4.5). 3, and QFS organize data in fixed-size blocks. In contrast, Given the prevalence of failures, repairs are frequent op- Tahoe-LAFS, Ceph, and Swift organize data in variable-size erations in DSSs [40]. We consider two types of repairs: (i) objects and partition each object into equal-size data blocks degraded reads, which decode the unavailable data blocks for erasure coding. that are being requested, and (ii) full-node recovery, which (L1) Limited support for adding advanced erasure codes: decodes all lost blocks of a failed storage node. Since re- Existing DSSs provide encoding/decoding interfaces for im- pairs trigger substantial traffic [40], achieving high repair plementing new erasure codes. However, most DSSs do performance is important in erasure coding deployment. RS not provide interfaces for adding erasure codes with sub- codes [43] are the most popular erasure codes that are widely packetization (e.g., MSR and MBR codes [14,18,32,39,42, used in production [5, 7, 15, 31, 54, 55], but they incur high 45,50,52]) and handling erasure-coded blocks at the granu- repair costs. Thus, many repair-friendly erasure codes have larity of sub-blocks, while only recently Ceph includes the been proposed. Since single-failure repairs (i.e., repairing a sub-packetization feature in its master codebase [52]. Also,

332 17th USENIX Conference on File and Storage Technologies USENIX Association recent erasure codes [19,38] address the hierarchical nature of flow) and how erasure-coded blocks are stored and accessed DSSs to reduce cross-rack [19] (or cross-cluster [38]) repair (i.e., the data flow). The current practice is that erasure coding traffic, yet realizing such hierarchy-aware erasure codes needs designers only define an erasure code and its coding opera- modifications to the DSS workflows. tions (e.g., the coding coefficients used in coding operations), (L2) Limited configurability for workflows of coding op- while DSS developers require dedicated engineering efforts erations: Enabling configurable workflows of coding opera- to integrate the coding operations into the read/write paths of tions allows better resource usage within a DSS. Take repairs DSSs without compromising the correctness of upper-layer (degraded reads or full-node recovery) as an example. DSSs applications. Such tight coupling makes the extensions of execute repairs at different entities upon the detection of fail- erasure coding features inflexible. ures. For a degraded read, it is executed at the client (in OpenEC decouples erasure coding management from the HDFS-RAID, HDFS-3, QFS, and Tahoe-LAFS), the proxy underlying DSS by providing a unified and configurable (in Swift), or a storage node (in Ceph); for full-node recovery, framework for erasure coding management, such that era- it is executed at either storage nodes (in HDFS-RAID, HDFS- sure coding designers can leverage OpenEC to realize new 3, QFS, Ceph, and Swift) or the client (in Tahoe-LAFS). Both erasure coding solutions and configure the workflows of cod- degraded reads and full-node recovery operate in a fetch-and- ing operations, without worrying how they are integrated into compute manner, in which the entity that executes the repair the DSS workflows. Specifically, OpenEC addresses the will retrieve available blocks from other non-failed storage limitations in §2.2 with the following goals: (i) extensibility nodes and reconstruct the lost blocks. On the other hand, be- of new erasure codes; (ii) configurable workflows of coding sides the fetch-and-compute approach, we cannot configure operations; and (iii) configurable placement of both erasure- a DSS to adopt different repair workflows or distribute the coded blocks and coding operations. To achieve these goals, repair loads across storage nodes. For example, recent repair OpenEC builds on a programming model for erasure coding algorithms [25,29] decompose a single-block repair operation management, as elaborated in the following sections. into partial sub-block repair operations that are parallelized across storage nodes for better bandwidth usage, but existing 3 Programming Model DSSs do not support this feature by design. We propose a programming model that allows erasure coding (L3) Limited configurability for placement of coding op- designers to not only define an erasure code structure and its erations: All DSSs we consider ensure that the n erasure- coding operations, but also configure how coding operations coded blocks of each coding group are stored in n distinct are performed in a DSS. We present a new erasure coding storage nodes, and most of them additionally allow config- abstraction called an ECDAG (§3.1), followed by three primi- urable block placement. For example, both HDFS-RAID and tives for constructing an ECDAG (§3.2). We then propose a HDFS-3 provide a base class for configuring block placement programming interface for realizing an erasure code based on policies; QFS provides an in-rack placement option to store the ECDAG abstraction (§3.3). placement groups multiple blocks in a rack; Ceph uses , while 3.1 ECDAG Overview Swift uses object rings, to control how erasure-coded blocks are placed in different storage nodes. At a high level, an ECDAG is a directed acyclic graph that de- However, existing DSSs focus on how erasure-coded scribes the workflows of coding operations of a coding group blocks are placed after encoding, but do not specify where of an erasure code. Each vertex represents a block in the cod- to perform the coding operations. For example, in encod- ing group, and the connections among vertices describe how ing operations, we may want to co-locate the computations vertices are related by linear combinations. To address the of parity blocks at one storage node (rather than distribute limitations in §2.2, we design ECDAGs to work for general the computations across different storage nodes) to limit the linear codes (L1 addressed). Also, we can construct different I/Os of retrieving data blocks. Also, the repair algorithms ECDAGs to configure how and where coding operations are in [25,29] require some storage nodes that store available data performed (L2 and L3 addressed, respectively). blocks to first compute partially decoded blocks and send the Consider a coding group of an (n,k) code with n erasure- results to other storage nodes for further decoding. In this coded blocks; to simplify our discussion, we do not con- case, we need to place the partial decoding operations at spe- sider sub-packetization first. We index the blocks from 0 to cific storage nodes. Such fine-grained placement of coding n − 1, and let bi denote the block with index i. Without loss operations is currently not supported in existing DSSs. of generality, we refer to b0,···,bk−1 as k data blocks, and bk,···,bn−1 as n−k parity blocks. In some cases (see below), 2.3 Lessons Learned and Goals the coding operations may generate some intermediately com- The root cause of the limitations in §2.2 is that the current puted blocks that will not be finally stored (as opposed to erasure coding management is tightly coupled with the DSS blocks b0,b1,···,bn−1). We call such blocks virtual blocks, 0 workflows. Realizing erasure coding in DSSs needs to ad- and denote a virtual block by bi0 for some i ≥ n. dress how coding operations are performed (i.e., the control In an ECDAG, let vi (i ≥ 0) be a vertex that maps to block

USENIX Association 17th USENIX Conference on File and Storage Technologies 333 00 11 22 33 11 22 33 44 11 22 33 44 void Join(int pidx, vector cidxs, vector coefs); int BindX(vector idxs); 55 66 void BindY(int pidx, int cidx);

44 00 00 Listing 1: Primitives for ECDAG construction. (a) Encoding (b) Decoding (c) PPR [29] indexed. Figure 2(b) shows the ECDAG for the encoding Figure 1: Example of an ECDAG for a (5,4) code. operation, in which the sub-blocks of parity blocks b2 and b3 are computed from the sub-blocks of data blocks b0 and b1. 0 2 4 6 0 1 2 3 222 33 3 44 4 55 5 66 677 7 00 22 44 66 00 11 22 33 Suppose that block b0 is lost. Figure 2(c) shows the ECDAG 111 333 555 777 888 99 9 1010 10 for decoding b0 based on MISER codes [45], in which we first 44 55 66 777 compute an encoded sub-block from each of other available b0bb00 bb1b11 bbb222 bbb333 00 0 11 1 blocks b1, b2, and b3 (represented by the virtual vertices v8, (a) Layout (b) Encoding (c) Decoding v9, and v10, respectively), followed by using the encoded Figure 2: Example for an ECDAG for a (4,2) code with w = 2. sub-blocks to decode the lost sub-blocks of b0.

0 3.2 ECDAG Primitives bi; we call a vertex vi0 (i ≥ n) that maps to a virtual block bi0 a virtual vertex. Let ei, j (i, j ≥ 0) be a directed edge from vi An ECDAG can be constructed from three primitives: Join, to v j indicating that bi is an input to the linear combination BindX, and BindY. Join is used for constructing an ECDAG, for computing b j. Each edge is associated with a coding while BindX and BindY control the placement of coding coefficient for the linear combination. If there exists an edge operations. Listing 1 shows their definitions in C++ format. e , we say that v is the parent of v , while v is a child of v . i, j j i i j Join: It specifies how a parent vertex (with index pidx) is A vertex can have any number of parents and children. formed by the linear combinations of a list of child vertices Both encoding and decoding operations are each associated (with indices in cidxs) and the corresponding coding coeffi- with an ECDAG. The ECDAG for encoding is constructed at cients (in coefs). For example, we deploy the (6,4) RS code the beginning of the encoding operation to describe how data and encode four data blocks b , b , b , and b into two new blocks are linearly combined to form each parity block. In 0 1 2 3 parity blocks b and b . We can construct an ECDAG with contrast, the ECDAG for decoding is constructed on demand 4 5 Join as follows (see Figure 3(a)): depending on what blocks are currently available. For example, consider a (5,4) code (i.e., (4+1)-RAID-5). ECDAG* ecdag = new ECDAG(); ecdag->Join(4, {0,1,2,3}, {1,1,1,1}); Figure 1(a) shows the ECDAG for the encoding operation, ecdag->Join(5, {0,1,2,3}, {1,2,4,8}); which states that the parity block b4 is a linear combination of the four data blocks b0, b1, b2, and b3. Suppose now BindX: It co-locates the coding operations of multiple ver- that block b0 is lost. Figure 1(b) shows the ECDAG for the tices (with indices in idxs) that reside at the same level of an decoding operation for b0, which can be computed from other ECDAG (i.e., in the x-direction), so as to reduce I/O in coding available blocks b1, b2, b3, and b4. operations. For example, in Figure 3(a), suppose that the data We can parallelize partial decoding operations as in PPR blocks being encoded are stored in different storage nodes. [29] by constructing another ECDAG for decoding b (see 0 Without BindX, we need to compute b4 and b5 separately and Figure 1(c)), in which we first compute in parallel the partially retrieve each data block twice. Instead, we can call BindX on decoded blocks b and b (both of which are virtual blocks) 5 6 vertices v4 and v5 to create a new virtual vertex v6 as follows from b1 and b2 and from b3 and b4, respectively, followed (see Figure 3(b)): by computing b0 from b5 and b6. This shows that we can flexibly configure coding operations by constructing different int vidx = ecdag->BindX({4,5}); ECDAGs. Note that PPR needs to compute b5 and b6 at the This indicates that blocks b4 and b5 are first computed to- storage nodes where data blocks (e.g., b2 and b4, respectively) gether at the same storage node before being distributed to are stored (see [29] for details). We address this issue in §3.2. different storage nodes. Now we only need to retrieve each We can also construct an ECDAG for erasure codes with data block once. Note that the index of v6 (i.e., 6) is generated sub-packetization. Let w be the number of sub-blocks per randomly and returned as vidx by BindX. block (w = 1 means no sub-packetization). We index the BindY: sub-blocks of block b from 0 to w − 1, those of b from w It co-locates the coding operations of a parent vertex 0 1 (with index pidx) and its child vertex (with index cidx) at to 2w − 1, and so on. Each vertex vi (i ≥ 0) now corresponds 0 different levels (i.e., in the y-direction). Consider the same to the sub-block with index i, while any vertex vi0 for i ≥ nw is a virtual vertex. For example, consider the (4,2) MISER example in Figure 3(b) after we call BindX. We can call v v code [45] (an MSR code based on interference alignment), BindY on vertices 0 and 6 as follows (see Figure 3(c)): where w = 2. Figure 2(a) shows how the sub-blocks are ecdag->BindY(vidx, 0);

334 17th USENIX Conference on File and Storage Technologies USENIX Association 000 111 222 333 00 0 11 1 22 2 33 3 00 0 11 1 22 2 33 3 ECDAG* Encode() { ECDAG* ecdag = new ECDAG(); 66 6 66 6 ecdag->Join(4, {0,1,2,3}, {1,1,1,1}); ecdag->Join(5, {0,1,2,3}, {1,2,4,8}); 444 555 44 4 55 5 44 4 55 5 int vidx = ecdag->BindX({4,5}); (a) Initial ECDAG (b) After BindX (c) After BindY ecdag->BindY(vidx, 0); return ecdag; Figure 3: Construction of ECDAGs for the (6,4) RS code. } Listing 3: Encode function. class ECBase { int n, k, w;// coding parameters vector ecoefs;// encoding coefficients ECDAG* Decode(vector from, vector to) { public: ECDAG* ecdag = new ECDAG(); ECDAG* Encode(); vector dcoefs;// decoding coefficients ECDAG* Decode(vector from, vector to); // compute dcoefs based on the available blocks vector> Place(); ecdag->Join(to[0], from, dcoefs); }; return ecdag; } Listing 2: Erasure coding programming interface. Listing 4: Decode function.

Thus, we compute parity blocks b4 and b5 at the same storage b b vector> Place() { node that stores 0, thereby saving the I/Os of retrieving 0. vector> groups; Note that BindY enables us to implement the repair algo- for(int i=0; i

USENIX Association 17th USENIX Conference on File and Storage Technologies 335 Applications 4.2 Basic Operations OpenEC supports four basic operations: (i) writes; (ii) nor- OECClient OECClient OECClient Controller ... control flow mal reads; (iii) degraded reads; and (iv) full-node recovery.

Agent Agent ... Agent data flow Writes: Note that HDFS-3 supports online encoding (i.e., HDFS Client HDFS Client HDFS Client clients perform encoding on the write path), while HDFS- RAID supports offline encoding (i.e., clients first write the NameNode data blocks in uncoded form, and the data blocks are later DataNode DataNode ... DataNode encoded in the background). OpenEC is currently designed to support both online and offline encoding. An OECClient specifies which encoding mode to use in a write request. For Figure 4: OpenEC architecture based on HDFS. online encoding, OpenEC encodes data on a per-file basis. k 4.1 Architectural Overview When an OECClient writes a file, its agent encodes every data blocks into n − k parity blocks and writes the n erasure- OpenEC runs as a middleware system atop a DSS. As a coded blocks to n DataNodes through the HDFS client. For proof of concept, we design OpenEC atop two implemen- offline encoding, an OECClient first writes file data via its tations of HDFS [46]: HDFS-RAID [5] and HDFS-3 [1]1. agent to HDFS. When OpenEC receives an encoding request, HDFS (including both HDFS-RAID and HDFS-3) comprises the controller parses the specified ECDAG (§4.3) and instructs a NameNode that coordinates the storage in units of blocks all agents to perform encoding, such that every k blocks are across multiple DataNodes (storage nodes). Figure 4 shows encoded into n erasure-coded blocks as a coding group. how OpenEC is integrated into HDFS. OpenEC comprises Normal reads: An OECClient issues normal reads (under no a centralized controller, which coordinates multiple agents. failures) via its agent, which connects to the DataNodes that An application interacts with OpenEC via an OECClient. store the uncoded data blocks and retrieves the data blocks Controller: The controller parses ECDAGs and instructs all from the DataNodes. agents how to perform coding operations and store erasure- Degraded reads: An OECClient issues degraded reads (un- coded blocks. It keeps erasure coding metadata and all der failures) via its agent, which connects to non-failed DataN- ECDAGs in local disk for persistence. There are three types odes and retrieves the available blocks for decoding the lost of metadata: (i) the information of blocks associated with blocks based on the ECDAG specification. each file; (ii) the information of blocks associated with each Full-node recovery: The controller coordinates the full-node coding group; and (iii) the block locations. The controller recovery operation. When it receives a report of lost blocks interacts with the NameNode in two aspects. First, it accesses from the NameNode, it informs the agents to repair the lost or updates the block locations of the NameNode to configure blocks based on the ECDAG specification. the placement of blocks. Second, it receives the reports of lost blocks from the NameNode and coordinates the repair 4.3 Parsing an ECDAG operations among the agents. OpenEC parses ECDAGs to perform coding operations in We assume that the controller is reliable (i.e., no single- writes (online or offline encoding), degraded reads, and full- point-of-failure). Our measurements show that the controller node recovery. Given an ECDAG, OpenEC decomposes a can serve a request of parsing an ECDAG for coding opera- coding operation into multiple tasks, each of which is exe- tions in less than 0.3 ms in our local cluster (§5), and hence it cuted by an agent. Each task operates in blocks (or sub-blocks incurs limited overhead to basic operations. in sub-packetization). There are four types of tasks: Agent: Each agent performs coding operations as instructed • Load: It loads a block into memory from the agent’s input by the controller. It accesses the erasure-coded blocks in stream, which could be either the OECClient if the block HDFS through the HDFS client interface. Note that agents is from upper-layer applications, or the HDFS client if the can communicate among themselves to perform coding op- block is from HDFS. erations and exchange erasure-coded blocks. We currently • Fetch: It retrieves blocks from other agents. deploy each agent at a DataNode, so that the agent can access • Compute: It computes a block based on the linear combi- the local storage of the DataNode without network transfers. nation of blocks and coding coefficients. OECClient: Each OECClient is associated with an agent, • Persist: It either writes a block to HDFS via the HDFS and serves as an interface between an upper-layer applica- client, or returns the block to an OECClient. tion and the agent. It connects to the agent via Redis-based communication (§4.5). An application now accesses HDFS Parsing procedure: OpenEC performs topological sorting through an OECClient instead of an HDFS client. of an ECDAG (based on depth-first search) to identify the vertex sequence of coding operations. It then assigns tasks to 1We also implement OpenEC atop QFS [31]. See [27] for details. each vertex based on the ECDAG structure. Depending on the

336 17th USENIX Conference on File and Storage Technologies USENIX Association Vertices Nodes Tasks the coding operations across storage nodes. To elaborate, v0 C Load b0 suppose that bi is stored in storage node Ni, for 0 ≤ i ≤ 5. v C Load b 1 1 First, since v0, v1, v2, and v3 have no child, OpenEC creates v C Load b 2 2 tasks for the agents in storage nodes N0, N1, N2, and N3 to v3 C Load b3 load the blocks b0, b1, b2, and b3, respectively, from HDFS Compute b4 from {b0, b1, b2, b3} with (via HDFS clients) into memory. Second, since vertex v is coding coefficients {1,1,1,1}; 6 v6 C created from BindX on vertices v4 and v5, OpenEC computes Compute b5 from {b0, b1, b2, b3} with coding coefficients {1,2,4,8} both b4 and b5 from the blocks in the child vertices (i.e., b0, b , b , and b ). Also, since BindY is called on v and v , v4 C – 1 2 3 6 0 OpenEC assigns the tasks of v to the agent in N . Finally, v5 C – 6 0 Persist b ; Persist b ; Persist b ; v4 and v5 retrieve blocks b4 and b5 from v6, respectively. – C 0 1 2 Persist b3; Persist b4; Persist b5 Since v4 and v5 have no parent and are the last vertices in the (a) Online encoding topological order, they persist the blocks to HDFS. Note that OpenEC can parallelize the coding operations Vertices Nodes Tasks on the vertices that have no dependencies on others. For v N Load b 0 0 0 example, OpenEC can simultaneously execute the tasks for v1 N1 Load b1 v0, v1, v2, and v3, and similarly the tasks for v4 and v5. v2 N2 Load b2 v3 N3 Load b3 4.4 Automated Optimizations Fetch b from N ; 1 1 In addition to letting erasure coding designers construct Fetch b2 from N2; ECDAGs, OpenEC can automatically customize ECDAGs Fetch b3 from N3; for performance optimizations to save manual configuration v6 N0 Compute b4 from {b0, b1, b2, b3} with coding coefficients {1,1,1,1}; efforts. We address this in two aspects. Compute b5 from {b0, b1, b2, b3} with Automated BindX and BindY: OpenEC can automatically coding coefficients {1,2,4,8} call BindX and BindY for some specific subgraph structures v4 N4 Fetch b4 from N0; Persist b4 of an ECDAG. For BindX, OpenEC examines all parent v5 N5 Fetch b5 from N0; Persist b5 vertices that have more than one child vertex in an ECDAG. (b) Offline encoding If multiple parent vertices have the same set of child vertices, Table 1: Vertex sequence of coding operations, including the nodes OpenEC calls BindX on those parent vertices (e.g., v4 and that are responsible for processing the vertices as well as the tasks v5 in Figure 3(b)). For BindY, for any parent vertex (with one that are performed. or more child vertices), OpenEC calls BindY on the parent vertex and any one of the child vertices (e.g., the parent vertex types of basic operations, OpenEC may perform coding op- v6 and the child vertex v0 in Figure 3(c)). erations on the client side (for online encoding and degraded Hierarchy awareness: OpenEC can further enhance the re- reads) or distribute the coding operations across storage nodes pair performance based on the physical DSS topology. One (for offline encoding and full-node recovery). scenario is that a DSS hierarchically organizes storage nodes OpenEC associates tasks with different types of vertices. in racks [19] (or clusters [38]), such that the cross-rack band- At a high level, the Load task is associated with a vertex with- width is much more constrained than the inner-rack band- out any child; the Fetch task is associated with a parent vertex width. OpenEC can transform an ECDAG into a pipelined that has a child vertex; the Compute task is associated with a ECDAG, so as to mitigate the cross-rack traffic. Our idea vertex with more than one child for the linear combination; is based on repair pipelining [25], which pipelines partial the Persist task is associated with a vertex without any parent, coding operations across multiple storage nodes. We addition- while it is also associated with a vertex without any child in ally perform all partial coding operations within a rack before the case of online encoding (see the example below). sending the partial coding results to another rack. To illustrate, Example: We show the parsing procedure via an example. suppose that we deploy an (n,k) RS code with k = 6. We Suppose that we encode four data blocks (i.e., b0, b1, b2, want to repair a lost block b0 from six other available blocks and b3) to generate two parity blocks (i.e., b4 and b5) using b1, b2, b3, b4, b5, and b6, such that blocks b1, b3, and b5 are the (6,4) RS code, based on the ECDAG in Figure 3(c) and in one rack, while blocks b2, b4, and b6 are in another rack. the Encode function in Listing 3. Table 1 shows the vertex We also want to store the reconstructed block b0 at the same sequence of tasks for both online and offline encoding. rack as b2, b4, and b6. The conventional repair approach is to For online encoding (see Table 1(a)), the client-side agent retrieve all six available blocks and construct an ECDAG as (denoted by C) performs all coding operations. It finally in Figure 5(a). Then we need to transfer three blocks (i.e., b1, persists all data blocks and parity blocks into HDFS. b3, and b5) across racks. Instead, OpenEC can automatically For offline encoding (see Table 1(b)), OpenEC distributes construct another ECDAG as in Figure 5(b), in which it first

USENIX Association 17th USENIX Conference on File and Storage Technologies 337 11 22 3 4 5 6 1 33 55 22 44 66 pp0 0 pp4 4 pp8 8 pp12 12 pp0 0 pp1 1 pp2 2 pp3 3

pp1 1 pp5 5 pp9 9 pp13 13 pp4 4 pp5 5 pp6 6 pp7 7 0 77 88 99 1010 1111 00 pp2 2 pp6 6 pp10 10 pp14 14 pp8 8 pp9 9 pp10 10 pp11 11 (a) Initial ECDAG (b) Pipelined ECDAG pp3 3 pp7 7 pp11 11 pp15 15 pp12 12 pp13 13 pp14 14 pp15 15 Figure 5: Example of constructing a pipelined ECDAG; vertices of 4 4data data blocks blocks 2 2parity parity blocks blocks 4 4data data blocks blocks 2 2parity parity b locksblocks the same color mean that their blocks are in the same rack. (a) Contiguous layout (b) Striped layout Figure 6: Block layouts for the (6,4) RS code. Suppose that each computes the partially decoded block b8 (corresponding to block stores four packets. We partition file data into 16 packets, vertex v8) based on b1, b3, and b5 in the same rack, followed ordered as p0, p1, ···, and p15. We place the packets across k = 4 by combining b8 with b0, b2, and b4 in another rack to recon- blocks. Packets at the same offset (e.g., in dashed boxes) are encoded struct block b0. In this case, we only need to transfer one together. In sub-packetization, each packet is further divided into sub-packets for encoding. block (i.e., b8) across racks.

4.5 Implementation called BlockPlacementPolicyOEC, which redirects block We implement an OpenEC prototype in C++ with around 7K placement requests to the controller to manage erasure-coded LoC. We use Intel’s Intelligent Storage Acceleration Library block placement. We also modify the FSNamesystem class (ISA-L) [6] to implement erasure coding functionalities. Here, in HDFS-RAID and the BlockManager class in HDFS-3 to we highlight several implementation details of OpenEC. redirect any lost block information to the controller so that From blocks to packets: OpenEC performs coding opera- OpenEC manages repair operations. Note that our integra- tions in units of packets to improve performance, while the tions into HDFS-RAID and HDFS-3 only require limited read/write operations are still in units of blocks (§2.1). By modifications to their codebases, with around 300 LoC and default, the packet size is 128 KiB. For encoding (both online 450 LoC, respectively2. and offline), OpenEC writes n erasure-coded packets to n DataNodes; in the case of sub-packetization, each packet is 5 Evaluation divided into sub-packets. If a DataNode receives an amount We conduct testbed experiments on OpenEC. We summarize of packet data equal to the HDFS block size (64 MiB by de- our major findings on OpenEC: (i) it preserves the perfor- fault), it seals the block and stores additional packets in a mance of HDFS-RAID and HDFS-3 in erasure coding deploy- different block. The n sealed erasure-coded blocks then form ment (§5.2); (ii) it supports various state-of-the-art erasure a coding group. Note that while OpenEC is sending packets coding solutions and preserves their properties, especially in to DataNodes, it can start encoding for the next group of pack- network-bound environments (§5.3); (iii) it can automatically ets. Thus, both the sending and encoding operations can be optimize the repair performance for a hierarchical topology done in parallel. Similarly, OpenEC performs decoding (for (§5.4); and (iv) it achieves scalable performance in real cloud degraded reads and full-node recovery) at the packet level. environments (§5.5). As OpenEC performs packet-level coding operations, the block layouts differ in online and offline encoding. For online 5.1 Setup encoding, OpenEC adopts a striped layout as in HDFS-3 [4], Testbeds: We evaluate OpenEC on both a local cluster (§5.2- as it stripes file data across blocks at the granularities of §5.4) and Amazon EC2 (§5.5). Our local cluster testbed com- packets. For offline encoding, OpenEC adopts a contiguous prises 16 machines, each of which has a quad-core 3.4 GHz In- layout, as the file data is first stored in a block before encoding. tel Core i5-3570, 16 GiB RAM, and a Seagate ST1000DM003 Figure 6 depicts both block layouts. 7200 RPM 1 TiB SATA hard disk. All machines are intercon- Internal communication: OpenEC uses Redis [8] for inter- nected via a 10 Gb/s Ethernet switch. On the other hand, nal communications among the controller, agents, and OEC- our Amazon EC2 testbed comprises 30 instances of type Client. Each agent maintains a local in-memory key-value m5.xlarge, connected via a 10 Gb/s network, in the US Redis store. The controller sends the task instructions of East (North Virginia) region. Each instance has four vCPUs coding operations to an agent via the Redis client, and the with Intel AVX-512 instruction sets and 16 GiB RAM. Both task instructions are buffered at the agent for subsequent pro- testbeds support optimized coding operations based on ISA-L. cessing. Agent-to-agent communications are pull-based via Default setup: We set the HDFS block size as 64 MiB and the Fetch tasks (§4.3), such that the sender agent buffers the the packet size for erasure coding as 128 KiB. We set HDFS-3 blocks to be sent in its local Redis store, and the receiver 2We compare the amounts of code changes in OpenEC with those in our fetches the buffer via the Redis client. Each OECClient also previously built prototypes CORE [26] and DoubleR [19], both of which communicates with its associated agent via Redis. modify HDFS-RAID to realize new erasure codes. Excluding the implemen- tation of erasure codes (e.g., coding operations), CORE and DoubleR make Integration: We integrate OpenEC into HDFS-RAID and around 2,300 LoC and 4,100 LoC of changes to the HDFS-RAID codebase HDFS-3 as follows. We realize a new block placement policy for the integration of erasure codes, respectively.

338 17th USENIX Conference on File and Storage Technologies USENIX Association ecompare We ernattlo v let,ec fwihwie l of file a writes which of each clients, five of total a run We ihoedt lc eee.Fgr ()sostethrough- the shows 7(a) Figure deleted. block data one with ie,sxtmsteboksz) n su omlra othe to read normal a issue and size), block the times six (i.e., USENIX Association reads. normal writes, degraded in and clients five reads, all of throughput aggregate the the under MiB 384 size and HDFS-3 between performance multi-client the encoding: online in performance Multi-client in respectively. 4.41% reads, and degraded 2.83% and by reads HDFS-3’s normal writes, than in higher 2.36% slightly by is HDFS-3’s and than less slightly is throughput Both reads. degraded and reads, OpenEC normal writes, of file results the put to read degraded a issue also MiB We 384 failures. size without of file file a write first We [31]). the QFS use in we (as Here, code data. erasure-coded generate to and HDFS-3 OpenEC between performance single-client encoding: the online compare in performance sizes. packet Single-client and block different for performance its uate cases, some compare in limited; We is any) overhead. OpenEC (if extra overhead incur such may that it show DSS, underlying the applications upper-layer and between layer As software cluster. another local adds our using operations basic of terms Operations Basic of Performance 5.2 runs. 10 the of showing minimum bars and error maximum the the the including plot runs, We 10 DataNode. over HDFS results an average and client, HDFS an an OECClient, agent, an serves machine remaining each the this evaluate we in until feature repairs hierarchy-aware disable also § without codes erasure of formance BindX ( features mization OpenEC for DSS default the as . n hnw evaluate we when and 5.3 Thpt (MiB/s) 200 400 600 800 OpenEC a nie(igecin)()Oln mlil let)()Ofln d niev.offline vs. Online (d) Offline (c) clients) (multiple Online (b) client) (single Online (a) 0 and rt Normal Write OpenEC oho hc r ofiue iholn encoding online with configured are which of both , 343 n DS3hv iia performance: similar have HDFS-3 and § vnsgicnl mrvspromne ealso We performance. improves significantly even ihHF-AD eadn h uoae opti- automated the Regarding HDFS-RAID. with DS3OpenEC HDFS-3 ..W sinaddctdmciet ev both serve to machine dedicated a assign We 5.4. BindY 334.9 OpenEC otolradteHF aeoe while NameNode, HDFS the and controller

Read 611.9 xetwe eeaut h rgnlper- original the evaluate we when except , § ihntv oigpromneadeval- and performance coding native with 629.2 .) u xeiet nbeautomated enable experiments our 4.4), Degraded Read OpenEC 573.8 ihHF-ADadHF- in HDFS-3 and HDFS-RAID with ( 9 599.1 , 6 BindX OpenEC ) iue7: Figure Scd.Fgr ()shows 7(b) Figure code. RS

xetwe ecompare we when except , Thpt (MiB/s) 1000 1500 2000 OpenEC 500 and 0 efrac fbscoeain noln n fieencoding. offline and online in operations basic of Performance a oe aggregate lower has BindY rt Normal Write 844 DS3OpenEC HDFS-3 827.5 piiainin optimization Read in ecompare We 1050.4 OpenEC OpenEC OpenEC § OpenEC ( 1185.4 ..We 5.4. efirst We 9 Degraded , 6 Read

) 1039

RS 1132.2 ’s 17th USENIX Conference onFileand StorageTechnologies 339 .

Thpt (MiB/s) 1000 1500 2000 29 n .7,rsetvl.Nvrhls,cnieigthe considering Nevertheless, respectively. 8.97%, and 12.9% 3% esuyteHF-ADsuc oeadfidthat find and code source HDFS-RAID the study We 137%. rt 8 lcs n s fieecdn ognrt erasure- generate to encoding offline use and blocks, 180 write deploy and HDFS-RAID encoding: Offline performance significant see not do between we differences figure, the in bars error by reads degraded and reads ag- normal higher in but throughput 1.95%, by gregate writes in HDFS-3 than throughput rm1MBt 4MB(suigta h l iei divisi- is size encoding, online file For the eight). that by (assuming ble MiB 64 to MiB 1 from the uses client only). currently encoding (which online HDFS-3 supports atop encoding offline and online We encoding). offline deploy (in layout contiguous online the (in and layout encoding) striped the between difference performance in encoding offline encoding: offline vs. Online bars. dif- error limited have the systems considering two ferences the yet full- 7.9%, For by HDFS-RAID regeneration. parity after recovery, file node blocks HDFS parity single all step a re-writing extra into and the reading to in due HDFS-RAID mainly of is difference performance the by HDFS-RAID of throughput encoding offline the creases recovery. full-node and ing HDFS-RAID. of evaluation that our Note in la- s) this 20 subtract around is and (which latency, empty tency its an measure start to we job evaluation, MapReduce our in overhead MapRe- startup the duce exclude To unit MapReduce. per and via encoding recovered recovery offline full-node being performs data HDFS-RAID that lost Note of time). through- amount recovery the full-node being (i.e., the data put and input time) unit of per amount encoded the offline (i.e., the throughput measure node we encoding storage Here, one recovery. of full-node blocks trigger the and delete then We groups). ing the using blocks coded 500 0 ecnie h igecin efrac,i hc a which in performance, single-client the consider We Interestingly, results. the shows 7(c) Figure Encoding Offline OpenEC OpenEC 624.9 OpenEC 1484.5 ( 12 OpenEC OpenEC HDFS-RAID Recovery Full-node tpHF-,adso hti losboth allows it that show and HDFS-3, atop

, 285.6 nHF-ADfrfi oprsn.We comparisons. fair for HDFS-RAID on 8 OpenEC OpenEC osntueMpeuei fieencod- offline in MapReduce use not does ) ecmaetepromnebetween performance the compare We OpenEC

Scd n rtsafieo ieranging size of file a writes and code RS 308.3 ( 9 a lgtyhge hogptthan throughput higher slightly has , 6 ) esstefiesz,adsuythe study and size, file the versus Scd ie,attlo 0cod- 30 of total a (i.e., code RS

nofln noig enow We encoding. offline in Thpt (MiB/s) efrhrcmaeoln and online compare further We 200 400 600 800 n HDFS-3. and 0 OpenEC 63 64 32 16 8 4 2 1 Offline (Degraded) Online (Degraded) Offline (Normal) Online (Normal) File Size(MiB) tie h file the stripes OpenEC in- ihtefiesz,snetedt rnfrpromnebecomes performance transfer data the since size, file the with w mtterslshr nteitrs fspace). of interest the in here results the omit (we and respectively) encoding, offline and online realize (which reads parallel the of one in slowdown any as 44-718%), (by oprsn ihntv oigoperations: coding native with Comparisons 340 17th USENIXConference onFile andStorage Technologies overhead. faster limited much incur are and coding 7), ECDAG-based (Figure of operations computations read/write the Neverthe- overall block. the single to a compared decoding less, is for there task as compute one decoding, only native than 0.6-3.2%) less (by slightly throughput only has one decoding ECDAG-based decoding which for in throughput block, decoding the shows 8(b) the Figure computing creating for for tasks overhead compute additional multiple is encod- there native because than mainly ing, throughput lower 29-38% has throughput encoding encoding the shows 8(a) for Figure ISA- HDFS-3. using in operations coding L native the of that coding with ECDAG-based operations the of performance computational the in as differences performance similar show they HDFS-RAID and HDFS-3 original in the implementations coding using validate experiments erasure To similar file. conduct the we recover results, to our additional file) seven original (i.e., the blocks over eight blocks retrieve to file needs larger it for as especially sizes, encoding online in is that encoding than offline less much in throughput read performance. degraded overall the the However, degrade can data online-encoded to encoding online normal in its that than encoding, higher offline much In is throughput parallel. read reads in issues blocks client eight the reads to which degraded in and performance, reads similar normal show both see encoding. encoding, also offline online We and In online larger. in differences become performance blocks the the as dominant more evaluation. read degraded our in the file delete the we stores encoding, that offline data block in data one file; (with the read to degraded deleted) block a and failures) other seven (without with read it encodes later and ( block blocks a in encoding, file offline the the than for stores less MiB); 64 is size block each block the that default after (note blocks completed the is seals write and file blocks eight across packets in Thpt (MiB/s) iue7d hw h eut.Tetruhu increases throughput The results. the shows 7(d) Figure 2000 4000 6000 8000 k 4MBbok under blocks 64-MiB iue8: Figure 0 § 64 96 (12,8)(14,10) (9,6) (6,4) .) ecmaeterpromnei normal a in performance their compare We 4.2). a noig()Decoding (b) Encoding (a) 7296.5 5176.4

oprsn ihntv oigoperations. coding native with Comparisons 5972.1 3659.6

4373.1 OpenEC Native 2746.7 4092.2

( 2617.6 n , k ) Thpt (MiB/s) Scds ECDAG-based codes. RS 1000 2000 3000 0 64 96 (12,8)(14,10) (9,6) (6,4)

n 2840.6

− 2751 k 1913.5 aiyblocks. parity ecompare We 1876.6 OpenEC OpenEC 1443.9 OpenEC Native 1424 1091.6 1085.1 0G/.Frte1G/ ae ewr rnfrbcmsthe becomes transfer network case, Gb/s 1 the For Gb/s. 10 eraiesvrlsaeo-h-r earfinl rsr cod- erasure repair-friendly state-of-the-art several realize We ihtefloigsolutions: following the with hs efcso vlaigterpromneo repairing of performance their evaluating on focus we Thus, the chooses setup default our performance, high achieve To since small too is size packet the if degrades performance The iei tlat6 i.Fgr ()sostetruhu ver- throughput the shows 9(b) Figure MiB. 64 block least the at when is network stabilizes size and and disk utilized, the better as are op- size bandwidths both block of the throughput with The KiB. increases 128 erations where as size, fixed is block size the packet versus the throughput the shows 9(a) ure the and under encoding reads online of degraded throughput single-client the on focus of mance sizes: packet and block of Impact • • fetch-and-compute a in ( block manner lost the decode to retrieves DataNodes it which in baseline, our to conforms performance gains. theoretical empirical the the that expect I/O), disk we and and computations coding to and (compared Gb/s bottleneck 1 cluster: local our in settings under bandwidth group two coding ure a in block lost repairs. one single-failure in I/O or bandwidth repair the imize from Recall abstraction. ECDAG § the on based solutions ing Designs Coding Erasure of Support KiB. 128 as 5.3 size packet the and MiB 64 parallelism. as less size is block there since large too is size packet packets, the individual if retrieving or for calls function MiB. many 64 are as there fixed is size block the where size, packet the sus

. hteitn earfinl oe r eindt min- to designed are codes repair-friendly existing that 2.1 Thpt (MiB/s) eset we mz h earbnwdh efcso w ainsof variants two on focus We bandwidth. repair min- to the sub-packetization imize leverage which [14], codes MSR 10(b)): are (Figure that codes blocks. MSR blocks data parity of six all global group from two local encoded a and from blocks, encoded data is three which of each blocks, set we codes, RS For [21]. LRC 10(a)): (Figure LRC as codes RS of approach repair conventional the use We 200 400 600 800 0 a ayn lc ie b ayn aktsizes packet Varying (b) sizes block Varying (a)

1 § ( .) ecmaetecnetoa earapproach repair conventional the compare We 2.2). Degraded Read Write OpenEC n iue9: Figure 2 Block Size(MiB) Block , k 4 ( = ) 8 16 10

mato lc n aktsizes. packet and block of Impact 32 aiswt lc n aktszs We sizes. packet and block with varies ,

6 64 )

nwihteeaetolclparity local two are there which in , 128 ecmaeR oe ihAzure’s with codes RS compare We 256 ( 9 , 6 ) ecmaeR oe with codes RS compare We

Scdsa in as codes RS Thpt (MiB/s) k 200 400 600 800 lcsfrom blocks esuyhwteperfor- the how study We ( 0 n , OpenEC k USENIX Association 8 ( = )

16 Degraded Read Write Packet Size(KiB) 32 9 64 , 6

k 128 econfig- We . ) o LRC, for ; § non-failed 256 ..Fig- 5.2. 512 1024 2048 hl hto utrycdsdost 1.35 to drops codes Butterfly of that while h esni htbt S oe ereedt from data retrieve codes MSR both that is reason The • • USENIX Association solutions. coding erasure the of Overall, properties overhead. the preserves I/O versus disk (seven higher codes incur Butterfly and than five) nodes con- storage codes more MISER to and nect repairs, for nodes storage non-failed 1.25 to drops codes MISER Gb/s of 10 gain the throughput in the codes network; Butterfly than MISER throughput example, less I/O have For codes disk significant. and more empirical computation become coding the overheads the network, since Gb/s 10 decrease the gains For theoretical gains. the with throughput degradations) are slight solutions em- only coding the (with erasure that consistent the observe of we gains network, throughput Gb/s pirical 1 the for For approach codes. erasure repair RS conventional the the of over gains solutions throughput coding theoretical the shows also Thpt (MiB/s) ecmaeR oe n R under DRC and codes RS compare We ( it ewe n w trg oe tdfeetlogical different at nodes storage two any between width logical three into cluster local our divide We setting. work the under repair conventional the with obeRgnrtn oe DC Fgr 10(d)): (Figure (DRC) Codes Regenerating Double 10(c)): (Figure algorithms Repair iue1 hw h eut;frorcmaios al 2 Table comparisons, our for results; the shows 10 Figure (with different racks across three evenly in group nodes coding each of blocks coded Gb/s. 10 remains rack two logical any same between the bandwidth within nodes the storage while [44], Gb/s 1 as racks the use net- We hierarchical racks. a in [19] DRC with codes RS compare them compare We by codes operations. RS repair pipelin- of partial performance parallelizing repair repair and the improve [29] [25], ing PPR namely algorithms, repair the require the (which consider [32] require codes (which Butterfly [45] and codes MISER codes: MSR 100 150 200 250 n 50 , 0 k ( ( = ) 8 , al 2: Table 4 bs10Gbps 1 Gbps 18.4 LRC(10, 6) RS (9,6) ) Gain 9 a R b S oe c earagrtm d DRC (d) algorithms Repair (c) codes MSR (b) LRC (a) IE code. MISER ,

6 36.5 ( ) 6 nbt ae,w itiueteerasure- the distribute we cases, both In . hoeia an fsaeo-h-r rsr oe rrpi loihsoe h ovninlrpi fR codes. RS of repair conventional the over algorithms repair or codes erasure state-of-the-art of gains Theoretical , 4 LRC s RS vs. ) 123.9 Scd,the code, RS 2 ( × 10 216.9 ( 9 , , 6 6 ) ) tc n / 3 omn olmtteband- the limit to command Butterfly s RS vs. rsr-oe lcseach). blocks erasure-coded Thpt (MiB/s) 100 200 300 400 500 ( 6 0 1.6 , 4 ( ) × 6 ( , esuyhwthe how study We utrycd,and code, Butterfly ( bs10Gbps 1 Gbps MISER (8,4) Butterfly (6,4) RS (6,4) 6 4 27.4 9 ( iue10: Figure , ) × 4 n , )

6 39.1 , n k ) Fgr 10(b)). (Figure = ( = ) 58.4 Scode. RS s RS vs. MISER k + OpenEC 2.28

6 160.6 n , 2 upr feauecdn designs. coding erasure of Support 4 ( .We ). × ≥

( 216.9 6 ) n 8 , , 4 and − We 2 4 201.3 ) × ) k 1 ) , 17th USENIX Conference onFileand StorageTechnologies 341 PPR s RS vs. Thpt (MiB/s) 100 200 300 400 500 eaancngr he-aklgcltplg norlocal our in topology logical three-rack a configure again We how evaluate now We efial evaluate finally We the consider We ( i)ol automated only (ii) i uoae piiain ( optimizations automated via ed aesmlrtruhu) lo h efrac of performance the Also, OpenEC throughput). similar have cluster reads local our in in ob- as experiments We patterns throughput HDFS-3. consistent atop serve encoding offline and online realizes in as the ations consider We DataNode. HDFS an and client, HDFS an remain- the of each and ing NameNode, HDFS the and controller § with settings three EC2 Amazon in Performance 5.5 of throughput shows repair the 11(b) OpenEC increases Figure optimization repair node. the same that trigger the and node on storage recovery pipelined one full-node a of blocks of all construction delete We the ECDAG. via performance repair the 38-44%. by throughput both enabling while 37-42%, 11(a) only Figure enabling racks. that shows three across blocks the via 30 distributes HDFS-3 evenly writing into by blocks of encoding groups offline coding of throughput the measure mated disabled, is optimization automated (i) configurations: three in experiments DRC our in as cluster Optimizations Automated with Improvements 5.4 2 ( 9 . o h ntnetp) n ntnehssthe hosts instance One type). instance the for 5.1 × 9 0 , , ( eas vlaehow evaluate also We for performance encoding offline the compare first We 6 9 6 N ) , ) 6 ) Scd,adall and code, RS −

bs10Gbps 1 Gbps 18.4 Pipeline (9,6) PPR (9,6) RS (9,6) BindX

1 35.4 Pipeline s RS vs. ntne ot nOClet an OECClient, an hosts instances clswl ihtenme finstances. of number the with well scales 82-128%. by

§ 101.1 ..Fgr 2sosterslswhen results the shows 12 Figure 5.2. 6 and × ( § 9

( 123.9 . eg,bt omlrasaddegraded and reads normal both (e.g., 5.2 ( , 9 6 8 N , ) BindY 6 OpenEC , 200.2 ) 6 ntne,where instances, BindX ) OpenEC N 322.7 , ( − s RS vs. DRC 10 BindX OpenEC r nbe ordfutsetting). default (our enabled are 1 , 1.5 seald n ii ohauto- both (iii) and enabled, is let su ifrn ai oper- basic different issue clients 8 § BindX ( nAao C.W configure We EC2. Amazon in

) Thpt (MiB/s) .)fraheacia topology. hierarchical a for 4.4) 6 × 20 40 60 80 ( and , , 6 0 4 , civspromnegains performance achieves nrae h hogptby throughput the increases ) 4 ) uoaial improves automatically and ( 36.8 N 12 § 6 )(9,6) (6, 4) s RS vs. DRC 5.3. = , BindY

10 54.7 2 0 0 n 0(see 30 and 20, 10, ( SDRC RS × OpenEC 9 ) ( OpenEC , 9 6 Scds We codes. RS , ) 6 ) nrae the increases 27.8 OpenEC OpenEC 55.6 which , agent, (MDS) e rsr oigsolutions: coding erasure New work Related 6 342 17th USENIXConference onFile andStorage Technologies tail mitigate [20] reads degraded Proactive and across nodes. repair [29] single-failure storage PPR a parallelize occurs. [25] failures pipelining of repair re- number repair Lazy a threshold deferring a by until codes. executions repair erasure reduces general [11,47] pair to apply that algorithms topologies. hierarchical in bandwidth repair rack repair [21 less (e.g., for codes I/O MDS than redundancy storage more with trade constructed easily be can any but MSR point bandwidth Aside minimum repair the more Ceph). than slightly incur in codes evaluated MDS some are codes, codes [32] Butterfly [52] while Clay HDFS, in and PM- evaluated (e.g., are DSSs [39] open-source codes in RBT 52], evaluated 50, are 42,45, which 32,39, [18, of codes some MSR Follow- new property. design MDS studies the up preserve and bandwidth repair minimize the [14] codes (MSR) regenerating Minimum-storage I/O. or bandwidth repair the reduce solutions to coding proposed erasure been new have many hence and costs, repair parameters coding general n support codes RS Second, data). (i.e., redundancy storage minimum the against tolerance fault two are for codes mainly RS 55]), First, 54, 31, 15, reasons. 7, [5, (e.g., today deployed a uoae id n id b earoptimization Repair (b) BindY and BindX Automated (a)

and Thpt (MiB/s)

nte ieto frsac st einefiin repair efficient design to is research of direction Another codes. erasure new design to is research of direction One Thpt (MiB/s) 2000 4000 6000 8000 200 400 600 ( n 0 0 k , enn htudrtecdn parameters coding the under that meaning , a nieecdn b fieencoding Offline (b) encoding Online (a) k poie that (provided )

rt Normal Write 976.5 86 1,)(12,10) (10,8) (8,6) 179.5 eg,[44],wiesm o-D rsr codes erasure non-MDS some while [24,41]), (e.g., N=30 N=20 N=10 BindX &BindY(default) BindX No Optimization iue12: Figure 2152.1 286.8

– 3221.1 290.5 iue11: Figure 3 3 4 9) R 1]mnmzstecross- the minimizes [19] DRC 49]). 44, 33, 23,

Read 2538.3 183 4907.2 298.4 7270 316.7 Degraded efrac nAao EC2. Amazon in Performance k Read n 2529.8 187 uoae optimization. Automated < − 5078.9 323.6

n 7235.4 k 336 .Hwvr Scdshv high have codes RS However, ). lc alrsi civdwith achieved is failures block aiu itneseparable distance maximum Thpt (MiB/s) Scds[3 r widely are [43] codes RS 1000 1500 2000 2500 Thpt (MiB/s) 500 120 40 80 0 0 n / k Encoding 1188.9 (12,10) (10,8) (8,6) 55.1 Offline ie h original the times 1825.7 100.8 2195.5 42.3 ( Recovery Full-node n 167.2 96.6 , k 425.9 N=30 N=20 N=10 ) Pipeline Default the , 36.7 541 74.1 5)adNF 6821,6529) uhiL snow is Li Runhui 61502191). (61872414, NSFC and 15G) ihSnfrTcnlge n stecrepnigauthor. corresponding the is and Technologies Sangfor C7036- with CRF 14216316, (GRF HKRGC by supported was Acknowledgments: how study we work, hspprpresents paper This ie nfidadcngrbeeauecdn management coding erasure configurable and unified vides eaue[6 saClbayta uprsvroserasure various supports that library C a is [36] Jerasure ofiual storage: Configurable implement abstraction. to ECDAG libraries the above via the codes leverage erasure can it and DSSs, in and Jerasure unifies both ISA-L. which including [2], libraries coding liberasurecode erasure on Open- different by builds used It library Python Swift. a Stack and is Ceph [9] (e.g., PyEClib production 3.0). in Hadoop used and widely Jerasure are Both libraries hardware. ISA-L C Intel for optimizes another arithmetic it is Field and [6] codes, en- Galois erasure ISA-L to various supports [34] that arithmetic. GF-Complete library Field with Galois extended fast later able is It codes. [55]. Tahoe-LAFS by used is and codes Zfec RS programming. implements coding [10] erasure for available are braries programming: coding Erasure framework. unified and a codes in erasure algorithms new repair different supports cod- It erasure configurable management. and ing unified on focuses and spective requests. read of balancing load the via latencies n h nnmu eiwr o hi omns hswork This comments. their for reviewers anonymous the and objects. variable-size in data Swift) organize and that Ceph (e.g., DSSs object-storage-based especially de- also [27] integrate report we how technical scribes Our blocks. fixed-size in data manner. flexible and simple a erasure in deploy solutions to coding designers on coding light erasure sheds facilitate work to Our how algorithms. repair and codes state-of-the-art erasure of EC2 variety Amazon a and supporting cluster while local environments, both in HDFS atop formance coding of Our workflows the operations. configure and to codes abstraction erasure ECDAG define the leverages It storage. distributed for Work Future and Conclusions 7 environments. distributed erasure in configurable management on coding focuses specifically but SDS, from 51]. 48, (SDS) [16, cen- storage framework a software-defined the by under coordination controller the tralized or client-side 37] 28, either 13, [12, on customization rely re- approaches application Existing on based quirements. policies storage con- different and management figuring system storage for flexibility providing nieteaoestudies, above the Unlike hspprcretyfcsso DS hc organizes which HDFS, on focuses currently paper This OpenEC OpenEC mhszstedpomn feauecodes erasure of deployment the emphasizes OpenEC OpenEC etakorsehr,BetWelch, Brent shepherd, our thank We OpenEC hr sa nraigdmn of demand increasing an is There OpenEC rttp civsefcieper- effective achieves prototype OpenEC a edpoe nohrDSSs, other in deployed be can e rmwr htpro- that framework new a , orw h aeprinciple same the borrows eea pnsuc li- open-source Several noQS[1.I future In [31]. QFS into agt ifrn per- different a targets USENIX Association References [18] Y. Hu, H. C. Chen, P. P. Lee, and Y. Tang. NCCloud: [1] 3.0.0. https://hadoop.apache. Applying network coding for the storage repair in a org/docs/r3.0.0/. cloud-of-clouds. In Proc. of USENIX FAST, 2012. [2] Erasure code API library written in c with plug- [19] Y. Hu, X. Li, M. Zhang, P. P. C. Lee, X. Zhang, P. Zhou, gable erasure code backends. https://github.com/ and D. Feng. Optimal repair layering for erasure-coded openstack/liberasurecode. data centers: From theory to practice. ACM Trans. on Storage, 13(4):33, 2017. [3] Facebook’s realtime distributed FS based on Apache Hadoop 0.20-append. https: [20] Y. Hu, Y. Wang, B. Liu, D. Niu, and C. Huang. Latency //github.com/facebookarchive/hadoop-20. reduction and load balancing in coded storage systems. In Proc. of ACM SoCC, 2017. [4] HDFS erasure coding. https://hadoop.apache. [21] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, org/docs/r3.0.0/hadoop-project-dist/ P. Gopalan, J. Li, S. Yekhanin, et al. Erasure coding hadoop-hdfs/HDFSErasureCoding.html. in Windows Azure storage. In Proc. of USENIX ATC, [5] HDFS-RAID. https://wiki.apache.org/hadoop/ 2012. HDFS-RAID. [22] O. Khan, R. C. Burns, J. S. Plank, W. Pierce, and [6] Intelligent storage acceleration library. https:// C. Huang. Rethinking erasure codes for cloud file sys- github.com/01org/isa-l. tems: Minimizing I/O for recovery and degraded reads. [7] OpenStack Swift. https://wiki.openstack.org/ In Proc. of USENIX FAST, 2012. wiki/Swift. [23] O. Kolosov, G. Yadgar, M. Liram, I. Tamo, and A. Barg. [8] Redis. http://redis.io/. On fault tolerance, locality, and optimality in locally repairable codes. In Proc. of USENIX ATC, 2018. [9] A simple Python interface for implementing erasure codes. https://github.com/openstack/pyeclib. [24] K. Kralevska, D. Gligoroski, R. E. Jensen, and H. Øverby. Hashtag erasure codes: From theory to [10] zfec – an efficient, portable erasure coding tool. https: practice. IEEE Trans. on Big Data, 2017. //github.com/tahoe-lafs/zfec. [25] R. Li, X. Li, P. P. C. Lee, and Q. Huang. Repair pipelin- [11] R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. M. ing for erasure-coded storage. In Proc. of USENIX ATC, Voelker. Total recall: System support for automated 2017. availability management. In Proc. of USENIX NSDI, 2004. [26] R. Li, J. Lin, and P. P. C. Lee. Enabling concurrent failure recovery for regenerating-coding-based storage [12] F. Chen, M. P. Mesnier, and S. Hahn. Client-aware cloud systems: From theory to practice. IEEE Trans. on Com- storage. In Proc. of IEEE MSST, 2014. puters, 64(7):1898–1911, 2015. [13] J. Y. Chung, C. Joe-Wong, S. Ha, J. W.-K. Hong, and [27] X. Li, R. Li, P. P. C. Lee, and Y. Hu. OpenEC: To- M. Chiang. CYRUS: Towards client-defined cloud stor- ward unified and configurable erasure coding manage- age. In Proc. of ACM EuroSys, 2015. ment in distributed storage systems. Technical re- [14] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, port, CUHK, 2019. http://www.cse.cuhk.edu.hk/ and K. Ramchandran. Network coding for distributed ~pclee/www/pubs/tech_openec.pdf. storage systems. IEEE Trans. on Information Theory, [28] M. Mesnier, F. Chen, T. Luo, and J. B. Akers. Dif- 56(9):4539–4551, 2010. ferentiated storage services. In Proc. of ACM SOSP, [15] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. 2011. Truong, L. Barroso, C. Grimes, and S. Quinlan. Avail- [29] S. Mitra, R. Panta, M.-R. Ra, and S. Bagchi. Partial- ability in globally distributed storage systems. In Proc. parallel-repair (PPR): A distributed technique for repair- of USENIX OSDI, 2010. ing erasure coded storage. In Proc. of ACM EuroSys, [16] R. Gracia-Tinedo, J. Sampe,´ E. Zamora, M. Sanchez-´ 2016. Artigas, P. Garc´ıa-Lopez,´ Y. Moatti, and E. Rom. Crys- [30] S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin, W. Liu, tal: Software-defined storage for multi-tenant object S. Pan, S. Shankar, V. Sivakumar, L. Tang, et al. f4: stores. In Proc. of USENIX FAST, 2017. Facebook’s warm blob storage system. In Proc. of [17] K. M. Greenan, E. L. Miller, and T. J. E. Schwarz. Op- USENIX OSDI, 2014. timizing Galois field arithmetic for diverse processor [31] M. Ovsiannikov, S. Rus, D. Reeves, P. Sutter, S. Rao, architectures and applications. In Proc. of IEEE MAS- and J. Kelly. The Quantcast file system. In Proc. of COTS, 2008. VLDB Endowment, 2013.

USENIX Association 17th USENIX Conference on File and Storage Technologies 343 [32] L. Pamies-Juarez, F. Blagojevic, R. Mateescu, C. Guyot, elephants: Novel erasure codes for big data. In Proc. of E. E. Gad, and Z. Bandic. Opening the chrysalis: On VLDB Endowment, 2013. the real repair performance of MSR codes. In Proc. of [45] N. B. Shah, K. Rashmi, P. V. Kumar, and K. Ramchan- USENIX FAST, 2016. dran. Interference alignment in regenerating codes for [33] D. S. Papailiopoulos, J. Luo, A. G. Dimakis, C. Huang, distributed storage: Necessity and code constructions. and J. Li. Simple regenerating codes: Network coding IEEE Trans. on Information Theory, 58(4):2134–2158, for cloud storage. In Proc. of IEEE INFOCOM, 2012. 2012. [34] J. S. Plank, K. Greenan, E. Miller, and W. Houston. [46] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The GF-Complete: A comprehensive open source library for Hadoop distributed file system. In Proc. of IEEE MSST, galois field arithmetic. Technical Report UT-CS-13-703, 2010. University of Tennessee, 2013. [47] M. Silberstein, L. Ganesh, Y. Wang, L. Alvisi, and [35] J. S. Plank and C. Huang. Tutorial: Erasure coding for M. Dahlin. Lazy means smart: Reducing repair band- storage applications. Slides presented at USENIX FAST width costs in erasure-coded distributed storage. In Proc. 2013, Feb 2013. of ACM SYSTOR, 2014. [36] J. S. Plank, S. Simmerman, and C. D. Schuman. Jera- [48] I. A. Stefanovici, B. Schroeder, G. O’Shea, and sure: A library in C/C++ facilitating erasure coding for E. Thereska. sRoute: Treating the storage stack like storage applications - version 1.2. Technical Report a network. In Proc. of USENIX FAST, 2016. CS-08-627, University of Tennessee, 2008. [49] I. Tamo and A. Barg. A family of optimal locally re- [37] R. Pontes, D. Burihabwa, F. Maia, J. Paulo, V. Schi- coverable codes. IEEE Trans. on Information Theory, avoni, P. Felber, H. Mercier, and R. Oliveira. SafeFS: A 60(8):4661–4676, 2014. modular architecture for secure user-space file systems [50] I. Tamo, Z. Wang, and J. Bruck. Zigzag codes: MDS (one FUSE to rule them all). In Proc. of ACM SYSTOR, array codes with optimal rebuilding. IEEE Trans. on 2017. Information Theory, 59(3):1597–1616, 2013. [38] N. Prakash, V. Abdrashitov, and M. Medard.´ The [51] E. Thereska, H. Ballani, G. O’Shea, T. Karagiannis, storage versus repair-bandwidth trade-off for clustered A. Rowstron, T. Talpey, R. Black, and T. Zhu. IOFlow: storage systems. IEEE Trans.on Information Theory, A software-defined storage architecture. In Proc. of 64(8):5783–5805, Aug 2018. ACM SOSP, 2013. [39] K. Rashmi, P. Nakkiran, J. Wang, N. B. Shah, and [52] M. Vajha, V. Ramkumar, B. Puranik, G. Kini, K. Ramchandran. Having your cake and eating it E. Lobo, B. Sasidharan, P. V. Kumar, A. Barg, M. Ye, too: Jointly optimal erasure codes for I/O, storage, and S. Narayanamurthy, et al. Clay codes: Moulding MDS network-bandwidth. In Proc. of USENIX FAST, 2015. codes to yield an MSR code. In Proc. of USENIX FAST, [40] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, 2018. and K. Ramchandran. A solution to the network chal- [53] H. Weatherspoon and J. D. Kubiatowicz. Erasure coding lenges of data recovery in erasure-coded distributed vs. replication: A quantitative comparison. In Proc. of storage systems: A study on the Facebook warehouse IPTPS, 2002. cluster. In Proc. of USENIX HotStorage, 2013. [54] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, [41] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, and C. Maltzahn. Ceph: A scalable, high-performance D. Borthakur, and K. Ramchandran. A ”hitchhiker’s” distributed file system. In Proc. of USENIX OSDI, 2006. guide to fast and efficient data reconstruction in erasure- [55] Z. Wilcox-O’Hearn and B. Warner. Tahoe: the least- coded data centers. In Proc. of ACM SIGCOMM, 2014. authority filesystem. In Proc. of ACM StorageSS, 2008. [42] K. V. Rashmi, N. B. Shah, and P. V. Kumar. Optimal exact-regenerating codes for distributed storage at the MSR and MBR points via a product-matrix construction. IEEE Trans. on Information Theory, 57(8):5227–5239, 2011. [43] I. S. Reed and G. Solomon. Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics, 8(2):300–304, 1960. [44] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring

344 17th USENIX Conference on File and Storage Technologies USENIX Association