Diskreduce: RAID for Data-Intensive Scalable Computing

DiskReduce: RAID for Data-Intensive Scalable Computing Bin Fan Wittawat Tantisiriroj Lin Xiao Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] [email protected] Garth Gibson Carnegie Mellon University [email protected] ABSTRACT ule with other double failure tolerant codes, and could Data-intensive file systems, developed for Internet services easily be extended to higher failure tolerance. and popular in cloud computing, provide high reliability and availability by replicating data, typically three copies • Asynchronous and delayed encoding, based on a trace of everything. Alternatively high performance computing, of the Yahoo! M45 cluster [2], enables most applications which has comparable scale, and smaller scale enterprise to attain the performance benefit of multiple copies with storage systems get similar tolerance for multiple failures minimal storage overhead. With M45 usage as an exam- from lower overhead erasure encoding, or RAID, organi- ple, delaying encoding by as little as an hour can allow zations. DiskReduce is a modification of the Hadoop dis- almost all accesses to choose among three copies of the tributed file system (HDFS) enabling asynchronous com- blocks being read. pression of initially triplicated data down to RAID-class redundancy overheads. In addition to increasing a cluster's storage capacity as seen by its users by up to a factor of The balance of the paper is as follows: In Section 2, we three, DiskReduce can delay encoding long enough to de- discuss work related to DiskReduce. We present its design liver the performance benefits of multiple data copies. and prototype in Section 3 and Section 4. Section 5 discusses deferred encoding for read performance. We conclude in 1. INTRODUCTION Section 6. The Google file system (GFS)[11] and Hadoop distributed file system (HDFS)[5], defined data-intensive file systems. 2. RELATED WORK They provide reliable storage and access to large scale data Almost all enterprise and high performance computing stor- by parallel applications, typically through the Map/Reduce age systems protect data against disk failures using a variant programming framework [10]. To tolerate frequent failures, of the erasure protecting scheme known as Redundant Ar- each data block is triplicated and therefore capable of re- rays of Inexpensive Disks [16]. Presented originally as a sin- covering from two simultaneous node failures. Though sim- gle disk failure tolerant scheme, RAID was soon enhanced ple, a triplication policy comes with a high overhead cost in by various double disk failure tolerance encodings, collec- terms of disk space: 200%. The goal of this work is to reduce tively known as RAID 6, including two-dimensional parity the storage overhead significantly while retaining double node [12], P+Q Reed Solomon codes [20, 8], XOR-based EvenOdd failure tolerance and multiple copy performance advantage. [3], and NetApp's variant Row-Diagonal Parity [9]. Lately research as turned to greater reliability through codes that We present DiskReduce, an application of RAID in HDFS to protect more, but not all, sets of larger than two disk fail- save storage capacity. In this paper, we will elaborate and ures [13], and the careful evaluation of the tradeoffs between investigate the following key ideas: codes and their implementations [17]. • A framework is proposed and prototyped for HDFS to Networked RAID has also been explored, initially as a block accommodate different double failure tolerant encoding storage scheme [15], then later for symmetric multi-server schemes, including a simple \RAID 5 and mirroring" en- logs [14], Redundant Arrays of Independent Nodes [4], peer- coding combination and a\RAID 6"encoding. The frame- to-peer file systems [22] and is in use today in the PanFS work is extensible by replacing an encoding/decoding mod- supercomputer storage clusters [23]. This paper explores similar techniques, specialized to the characteristics of large- scale data-intensive distributed file systems. Permission to make digital or hard copies of all or part of this work for Deferred encoding for compression, a technique we use to re- personal or classroom use is granted without fee provided that copies are cover capacity without loss of the benefits of multiple copies not made or distributed for profit or commercial advantage and that copies for read bandwidth, is similar to two-level caching-and-compression bear this notice and the full citation on the first page. To copy otherwise, to in file systems [7], delayed parity updates in RAID sys- republish, to post on servers or to redistribute to lists, requires prior specific tems [21], and alternative mirror or RAID 5 representation permission and/or a fee. Supercomputing PDSW ’09, Nov. 15, 2009. Portland, OR, USA. schemes [24]. Copyright 2009 ACM 978-1-60558-883-4/09/11 ...$10.00. 2xN Data Entries N Data Entries D D DN D D D 1 2 D1 D2 ... DN 1 2 ... N D D ... DN 1 2 D1 D2 ... DN 2 Coding Entries D D1 D2 N P1 Coding Entry P P2 (a) Triplication (b) RAID 5 and mirror (c) RAID 6 Figure 1: Codewords providing protection against double node failures Finally, our basic approach of adding erasure coding to data- 3.3 Encoding intensive distributed file systems has been introduced into Files are written initially as three copies on three different the Google File System [19] and, as a result of an early ver- data nodes. We later compress the capacity used by encod- sion of this work, into the Hadoop Distributed File System ing redundancy and deleting the excess copies. [6]. This paper studies the advantages of deferring the act of encoding. In our prototype, we have implemented two codes: 3. DESIGN • RAID 5 and Mirror As shown in Figure 1(b), we In this section we introduce DiskReduce, a modification of both maintain a mirror of all data and a RAID 5 en- the HDFS[5]. coding. The RAID 5 encoding is only needed if both copies of one block are lost. In this way, the storage 3.1 Hadoop Distributed File System overhead is reduced to 1+1=N where N is the number HDFS[5] is the native file system in Hadoop[1], an open of blocks in the parity's RAID set. source Map/Reduce parallel programming environment, and • RAID 6 DiskReduce also implements the leading scheme is highly similar to GFS[11]. HDFS supports write-once- for double disk failure protection as shown in Fig- read-many semantics on files. Each HDFS cluster consists ure 1(c). The storage overhead is 2=N where N is of a single metadata node and a usually large number of the number of data blocks in a RAID set. data nodes. The metadata node manages the namespace, file layout information and permissions. To handle failures, HDFS replicates files three times. 200% RAID 5 and Mirror, within file 180% RAID 5 and Mirror, across files In HDFS, all files are immutable once closed. Files are di- RAID 6, within file vided into blocks, typically 64MB, each stored on a data 160% RAID 6, across files node. Each data node manages all file data stored on its 140% persistent storage. It handles read and write requests from 120% clients and performs\make a replica"requests from the meta- 100% data node. There is a background process in HDFS that periodically checks for missing blocks and, if found, assigns 80% a data node to replicate the block having too few copies. 60% Space Overhead (%) 40% 3.2 DiskReduce Basics 20% One principle of the DiskReduce design is to minimize the 0% change to original HDFS logic. Specifically, DiskReduce 2 4 8 16 32 takes advantage of following two important features of HDFS: RAID group size (1) files are immutable after they are written to the system and (2) all blocks in a file are triplicated initially. DiskRe- Figure 2: Space overhead by different grouping duce makes no change to HDFS when files are committed strategies according to the file size distribution on and triplicated. Then DiskReduce exploits the background the Yahoo! M45 cluster. The overhead of triplica- re-replication in HDFS, but in a different way: in HDFS the tion is 200% background process looks for insufficient number of copies, while in DiskReduce it looks for blocks with high overhead Based on a talk about our previous DiskReduce work [6], (i.e. blocks triplicated) that can be turned into blocks with a userspace RAID 5 and mirror encoding scheme has been lower overhead (i.e. RAID encoding). Redundant blocks implemented on top of HDFS and may appear in the next will not be deleted before the encoding is done to ensure HDFS release. In that implementation, only blocks from the data reliability during encoding phase. Since this process same file will be grouped together. Alternatively, Figure 2 is inherently asynchronous, DiskReduce can further delay shows the capacity overhead derived from a file size distribu- encoding, when space allows, to facilitate temporally local tion from the Yahoo! M45 cluster for two encoding schemes: accesses to choose among multiple copies. blocks grouped for encoding within a file or grouped across 99.999% 1800 RAID-6 99.99% 1600 RAID-5 w/ Mirror 300% 99.9% 1400 99% 90% 1200 100% 200% 1000 800 80% 600 100% 60% System Space (GB) 400 200 Normalized to User Space 40% 0 0% 4 fraction of block access at age < t 20% φ: 2 (GB/s) 0% 0 1sec 1min 1hr 1day 1week 1mon Encoding Rate 0 100 200 300 400 500 600 Block Age t Time (sec) Figure 4: CDF of block age at time of access Figure 3: Storage utilized and rate of capacity re- covered by encoding While this experiment is simple, it shows the encoding process recovering 400GB and 900GB for the RAID 5 and mir- files.

Load more