Xoring Elephants: Novel Erasure Codes for Big Data
Total Page:16
File Type:pdf, Size:1020Kb
XORing Elephants: Novel Erasure Codes for Big Data Maheswaran Megasthenis Asteris Dimitris Papailiopoulos Sathiamoorthy University of Southern University of Southern University of Southern California California California [email protected] [email protected] [email protected] Alexandros G. Dimakis Ramkumar Vadali Scott Chen University of Southern Facebook Facebook California [email protected] [email protected] [email protected] Dhruba Borthakur Facebook [email protected] ABSTRACT using Hadoop MapReduce. Standard implementations rely Distributed storage systems for large clusters typically use on a distributed file system that provides reliability by ex- replication to provide reliability. Recently, erasure codes ploiting triple block replication. The major disadvantage have been used to reduce the large storage overhead of three- of replication is the very large storage overhead of 200%, replicated systems. Reed-Solomon codes are the standard which reflects on the cluster costs. This overhead is be- design choice and their high repair cost is often considered coming a major bottleneck as the amount of managed data an unavoidable price to pay for high storage efficiency and grows faster than data center infrastructure. high reliability. For this reason, Facebook and many others are transition- This paper shows how to overcome this limitation. We ing to erasure coding techniques (typically, classical Reed- present a novel family of erasure codes that are efficiently Solomon codes) to introduce redundancy while saving stor- repairable and offer higher reliability compared to Reed- age [4, 19], especially for data that is more archival in na- Solomon codes. We show analytically that our codes are ture. In this paper we show that classical codes are highly optimal on a recently identified tradeoff between locality suboptimal for distributed MapReduce architectures. We and minimum distance. introduce new erasure codes that address the main chal- We implement our new codes in Hadoop HDFS and com- lenges of distributed data reliability and information theo- pare to a currently deployed HDFS module that uses Reed- retic bounds that show the optimality of our construction. Solomon codes. Our modified HDFS implementation shows We rely on measurements from a large Facebook production a reduction of approximately 2× on the repair disk I/O and cluster (more than 3000 nodes, 30 PB of logical data stor- repair network traffic. The disadvantage of the new coding age) that uses Hadoop MapReduce for data analytics. Face- scheme is that it requires 14% more storage compared to book recently started deploying an open source HDFS Mod- Reed-Solomon codes, an overhead shown to be information ule called HDFS RAID ([2, 8]) that relies on Reed-Solomon theoretically optimal to obtain locality. Because the new (RS) codes. In HDFS RAID, the replication factor of \cold" codes repair failures faster, this provides higher reliability, (i.e., rarely accessed) files is lowered to 1 and a new parity which is orders of magnitude higher compared to replica- file is created, consisting of parity blocks. tion. Using the parameters of Facebook clusters, the data blocks of each large file are grouped in stripes of 10 and for each such set, 4 parity blocks are created. This system (called 1. INTRODUCTION RS (10; 4)) can tolerate any 4 block failures and has a stor- MapReduce architectures are becoming increasingly pop- age overhead of only 40%. RS codes are therefore signifi- ular for big data management due to their high scalabil- cantly more robust and storage efficient compared to repli- ity properties. At Facebook, large analytics clusters store cation. In fact, this storage overhead is the minimal possible, petabytes of information and handle multiple analytics jobs for this level of reliability [7]. Codes that achieve this op- timal storage-reliability tradeoff are called Maximum Dis- tance Separable (MDS) [32] and Reed-Solomon codes [27] form the most widely used MDS family. Classical erasure codes are suboptimal for distributed en- vironments because of the so-called Repair problem: When a single node fails, typically one block is lost from each stripe that is stored in that node. RS codes are usually repaired with the simple method that requires transferring 10 blocks and recreating the original 10 data blocks even if a single 1 block is lost [28], hence creating a 10× overhead in repair 110 bandwidth and disk I/O. Recently, information theoretic results established that it 88 is possible to repair erasure codes with much less network bandwidth compared to this naive method [6]. There has 66 been significant amount of very recent work on designing such efficiently repairable codes, see section 6 for an overview 44 of this literature. Nodes Failed Our Contributions: We introduce a new family of era- 22 sure codes called Locally Repairable Codes (LRCs), that are efficiently repairable both in terms of network bandwidth Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed and disk I/O. We analytically show that our codes are in- Dec 26 Jan 2 Jan 9 Jan 16 Jan 23 formation theoretically optimal in terms of their locality, Figure 1: Number of failed nodes over a single i.e., the number of other blocks needed to repair single block month period in a 3000 node production cluster of failures. We present both randomized and explicit LRC con- Facebook. structions starting from generalized Reed-Solomon parities. We also design and implement HDFS-Xorbas, a module data center failure events [9, 19]. During the period of a that replaces Reed-Solomon codes with LRCs in HDFS- transient failure event, block reads of a coded stripe will be RAID. We evaluate HDFS-Xorbas using experiments on Ama- degraded if the corresponding data blocks are unavailable. zon EC2 and a cluster in Facebook. Note that while LRCs In this case, the missing data block can be reconstructed by are defined for any stripe and parity size, our experimen- a repair process, which is not aimed at fault tolerance but at tal evaluation is based on a RS(10,4) and its extension to a higher data availability. The only difference with standard (10,6,5) LRC to compare with the current production clus- repair is that the reconstructed block does not have to be ter. written in disk. For this reason, efficient and fast repair can Our experiments show that Xorbas enables approximately significantly improve data availability. a 2× reduction in disk I/O and repair network traffic com- The second is the problem of efficient node decommis- pared to the Reed-Solomon code currently used in produc- sioning. Hadoop offers the decommission feature to retire tion. The disadvantage of the new code is that it requires a faulty data node. Functional data has to be copied out 14% more storage compared to RS, an overhead shown to be of the node before decommission, a process that is compli- information theoretically optimal for the obtained locality. cated and time consuming. Fast repairs allow to treat node One interesting side benefit is that because Xorbas re- decommissioning as a scheduled repair and start a MapRe- pairs failures faster, this provides higher availability, due to duce job to recreate the blocks without creating very large more efficient degraded reading performance. Under a sim- network traffic. ple Markov model evaluation, Xorbas has 2 more zeros in The third reason is that repair influences the performance Mean Time to Data Loss (MTTDL) compared to RS (10; 4) of other concurrent MapReduce jobs. Several researchers and 5 more zeros compared to 3-replication. have observed that the main bottleneck in MapReduce is the network [5]. As mentioned, repair network traffic is cur- 1.1 Importance of Repair rently consuming a non-negligible fraction of the cluster net- At Facebook, large analytics clusters store petabytes of work bandwidth. This issue is becoming more significant as information and handle multiple MapReduce analytics jobs. the storage used is increasing disproportionately fast com- In a 3000 node production cluster storing approximately 230 pared to network bandwidth in data centers. This increasing million blocks (each of size 256MB), only 8% of the data is storage density trend emphasizes the importance of local re- currently RS encoded (`RAIDed'). Fig. 1 shows a recent pairs when coding is used. trace of node failures in this production cluster. It is quite Finally, local repair would be a key in facilitating geo- typical to have 20 or more node failures per day that trig- graphically distributed file systems across data centers. Geo- ger repair jobs, even when most repairs are delayed to avoid diversity has been identified as one of the key future direc- transient failures. A typical data node will be storing ap- tions for improving latency and reliability [13]. Tradition- proximately 15 TB and the repair traffic with the current ally, sites used to distribute data across data centers via configuration is estimated around 10−20% of the total aver- replication. This, however, dramatically increases the total age of 2 PB/day cluster network traffic. As discussed, (10,4) storage cost. Reed-Solomon codes across geographic loca- RS encoded blocks require approximately 10× more network tions at this scale would be completely impractical due to repair overhead per bit compared to replicated blocks. We the high bandwidth requirements across wide area networks. estimate that if 50% of the cluster was RS encoded, the re- Our work makes local repairs possible at a marginally higher pair network traffic would completely saturate the cluster storage overhead cost. network links. Our goal is to design more efficient coding Replication is obviously the winner in optimizing the four schemes that would allow a large fraction of the data to be issues discussed, but requires a very large storage overhead.