Fractional Repetition Codes with Flexible Repair from Combinatorial

1 Fractional repetition codes with flexible repair from combinatorial designs Oktay Olmez and Aditya Ramamoorthy Abstract—Fractional repetition (FR) codes are a class of would like to ensure that the regeneration process be fast. regenerating codes for distributed storage systems with an exact For this purpose we would like to minimize the data that (table-based) repair process that is also uncoded, i.e., upon needs to be downloaded from the surviving nodes. Moreover, failure, a node is regenerated by simply downloading packets from the surviving nodes. In our work, we present constructions we would like the surviving nodes and the new node to of FR codes based on Steiner systems and resolvable combina- perform very little (ideally no) computation, as this also torial designs such as affine geometries, Hadamard designs and induces a substantial delay in the regeneration process that is mutually orthogonal Latin squares. The failure resilience of our comparable to the download time (since nowadays, memory codes can be varied in a simple manner. We construct codes with access bandwidth is comparable to network bandwidth [1]). In normalized repair bandwidth (β) strictly larger than one; these cannot be obtained trivially from codes with β = 1. Furthermore, addition, the regeneration induces a workload on the surviving we present the Kronecker product technique for generating new storage nodes and it is desirable to perform the regeneration codes from existing ones and elaborate on their properties. FR by connecting to a small number of nodes. Connecting to a codes with locality are those where the repair degree is smaller small set of nodes also reduces the overall energy consumption than the number of nodes contacted for reconstructing the stored of the system. file. For these codes we establish a tradeoff between the local repair property and failure resilience and construct codes that In recent years, codes which are designed to satisfy the meet this tradeoff. Much of prior work only provided lower needs of data storage systems have been the subject of much bounds on the FR code rate. In our work, for most of our investigation and there is extensive literature on this topic. constructions we determine the code rate for certain parameter Depending upon the specific metrics that are optimized there ranges. are different requirements that the distributed storage system Index Terms—fractional repetition code, combinatorial design, needs to satisfy. However, broadly speaking, all systems have Steiner systems, affine geometry, high girth, resolvable design, the following general characteristics. A distributed storage regenerating codes, local repair. system (henceforth abbreviated to DSS) consists of n storage nodes, each of which stores α packets (we use symbols and I. INTRODUCTION packets interchangeably). A given user, also referred to as Large scale data storage systems that are employed in social the data collector needs to have the ability to reconstruct the networks, video streaming websites and cloud storage are stored file by contacting any k nodes; this is referred to as the becoming increasingly popular. In these systems, the integrity maximum distance separability (MDS) property of the system. of the stored data and the speed of the data access needs to be To ensure reliability in the system, the DSS also needs to maintained even in the presence of unreliable storage nodes. repair a failed node. This is accomplished by contacting a set This issue is typically handled by introducing redundancy in of d surviving nodes and downloading β packets from each of the storage system, through the usage of replication and/or them for a total repair bandwidth of γ = dβ packets. Thus, the erasure coding. However, the large scale, distributed nature system has a repair degree of d, normalized repair bandwidth arXiv:1408.5780v2 [cs.IT] 8 Feb 2016 of the systems under consideration introduces another issue. β and total repair bandwidth γ. The new DSS should continue Namely, if a given storage node fails, it need to be regenerated to have the MDS property. so that the new system continues to have the properties of A simple technique for obtaining a DSS is to treat the the original system. It is of course desirable to perform this file that needs to be stored as a set of symbols over a large regeneration in a distributed manner and optimize performance enough finite field, generate encoded symbols by using an metrics associated with the regeneration process. Firstly, one MDS code (such as a Reed-Solomon (RS) code) and then store each encoded symbol on a different storage node. It is This work was supported in part by the NSF under grants CCF-1320416, well recognized that the drawback of this method is that upon CCF-1149860, CCF-1116322 and DMS-1120597 and by TUBITAK project failure of a given storage node, a large amount of data needs to numbers 114F246 and 115F064. The material in this work has appeared in part at the 50th Annual Allerton Conference on Communication, Control and be downloaded from the remaining storage nodes (equivalent Computing, 2012, the 2013 International Symposium on Network Coding and to recreating the file). To address this issue, the technique of the 2013 Asilomar Conference on Signals, Systems and Computers. regenerating codes was developed in the work of Dimakis et Oktay Olmez ([email protected]) is with the Department of Mathemat- ics at Ankara University, Tandogan, Ankara, Turkey. Aditya Ramamoorthy al. [2]. In the framework of [2], the repair degree d ≥ k and ([email protected]) is with the Department of Electrical and Computer the system needs to have the property that a failed node can be Engineering at Iowa State University, Ames, IA 50011. Copyright (c) 2014 repaired from any set of d surviving nodes. The principal idea IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by of regenerating codes is to use subpacketization. In particular, sending a request to [email protected]. one treats a given physical block as consisting of multiple 2 (a) (b) Fig. 1: (a) A DSS with (n; k; d; α) = (5; 3; 4; 4). Each node contains a subset of size 4 of the packets from fc1; : : : ; c10g. First node for instance contains symbols ci; i = 1;:::; 4 that is fc1; c2; c3; c4g. When there is no confusion we simply use the notation c1c2c3c4 instead of the set notation fc1; c2; c3; c4g. (b) The DSS is constructed by applying a (20; 18)-MDS code followed by the inner fractional repetition code shown in the figure. It is specified with parameters (5; 3; 4; 8). When a node fails we contact the remaining four nodes and download two packets from each to repair the failed node. This DSS can be obtained from the (5; 3; 4; 4) DSS on the left by trivial β-expansion, where β = 2. symbols (unlike the MDS code that stores exactly one symbol DSS needs to satisfy the property of exact and uncoded repair, in each node). Coding is now performed across the packets i.e., the regenerating node needs to produce an exact copy such that the file can be recovered by contacting a certain of the failed node by simply downloading packets from the minimum number of nodes. In addition, one can regenerate surviving nodes. This allows the entire system to work without a failed node by downloading appropriately coded data from requiring any computation at the surviving nodes. In addition, the surviving nodes. The work of [2] identified a fundamental they considered systems that are resilient to multiple (> 1) tradeoff between the amount of storage at each node and the failures. However, the DSS only has the property that the amount of data downloaded for repairing a failed node under repair can be conducted by contacting some set of d nodes, the mechanism of functional repair, where the new node is i.e., unlike the original setup, repair is not guaranteed by functionally equivalent to the failed node, though it may not contacting any set of d nodes. This is reasonable as most be an exact copy of it. Two points on the curve deserve practical systems operate via a table-based repair, where the special mention and are arguably of the most interest from new node is provided information on the set of surviving a practical perspective. The minimum bandwidth regenerating nodes that it needs to contact. The work of [7] proposed (MBR) point refers to the point where the repair bandwidth, a construction whereby an outer MDS code is concatenated γ is minimum. Likewise, the minimum storage regenerating with an inner “fractional repetition” code that specifies the (MSR) point refers to the point where the storage per node, placement of the coded symbols on the storage nodes. The α is minimum. main challenge here is to design the inner fractional repetition In a different line of work, it has been argued that repair (FR) code in a systematic manner. bandwidth is not the only metric for evaluating the repair In this work, we present several families of FR codes and process. It has been observed that the number of nodes that analyze their properties. This paper is organized as follows. are contacted for purposes of repair is also an important metric In Section II, we outline our precise problem formulation, that needs to be considered. The model of [2], which enforces elaborate on the related work in the literature and summarize repair from any set of d surviving nodes requires d to be at the contributions of our work. We discuss our FR code least k.

Load more