Differential RAID: Rethinking RAID for SSD Reliability

Differential RAID: Rethinking RAID for SSD Reliability Mahesh Balakrishnan1, Asim Kadav2, Vijayan Prabhakaran1, Dahlia Malkhi1 1Microsoft Research Silicon Valley, Mountain View, CA, USA 2University of Wisconsin, Madison, WI, USA [email protected], [email protected], [email protected], [email protected] Abstract reads and writes per second, potentially eliminating I/O SSDs exhibit very different failure characteristics compared bottlenecks in high-performance data centers while driv- to hard drives. In particular, the Bit Error Rate (BER) of an ing down power consumption. Though early SSDs were SSD climbs as it receives more writes. As a result, RAID prohibitively expensive, the emergence of Multi-Level Cell arrays composed from SSDs are subject to correlated fail- (MLC) technology has significantly driven down SSD cost ures. By balancing writes evenly across the array, RAID in the recent past. schemes can wear out devices at similar times. When a de- However, MLC devices are severely hamstrung by low vice in the array fails towards the end of its lifetime, the high endurance limits. Individual flash pages within an SSD re- BER of the remaining devices can result in data loss. We quire expensive erase operations between successive writes. propose Diff-RAID, a parity-based redundancy solution that Each erasure makes the device less reliable, increasing the creates an age differential in an array of SSDs. Diff-RAID Bit Error Rate (BER) observed by accesses. Consequently, SSD manufacturers specify not only a maximum BER (usu- distributes parity blocks unevenly across the array, leverag- −14 ing their higher update rate to age devices at different rates. ally around 10 , as with hard disks), but also a limit on the To maintain this age differential when old devices are re- number of erasures within which this guarantee holds. For placed by new ones, Diff-RAID reshuffles the parity distri- MLC devices, this erasure limit is typically rated at 5,000 bution on each drive replacement. We evaluate Diff-RAID’s to 10,000 cycles per block. As flash bit density continues to reliability by using real BER data from 12 flash chips on a increase, the erasure limit is expected to decrease as well. simulator and show that it is more reliable than RAID-5, in Device-level redundancy is currently the first line of de- some cases by multiple orders of magnitude. We also eval- fense against storage failures. Existing redundancy options uate Diff-RAID’s performance using a software implemen- – such as any of the RAID levels – can be used without tation on a 5-device array of 80 GB Intel X25-M SSDs and modification to guard against SSD failures, and to mask the show that it offers a trade-off between throughput and relia- high BER of aging SSDs. Unfortunately, existing RAID so- bility. lutions do not provide adequate protection for data stored on SSDs. By balancing writes across devices, they cause mul- Categories and Subject Descriptors D.4.2 [Operating Sys- tiple SSDs to wear out at approximately the same rate. In- tems]: Storage Management; D.4.4 [Operating Systems]: tuitively, such solutions end up trying to protect data on old Reliability; D.4.8 [Operating Systems]: Performance SSDs by storing it redundantly on other, equally old SSDs. General Terms Design, Performance, Reliability Later in this paper, we quantify the ineffectiveness of such an approach. Keywords RAID, SSD, Flash We propose Differential RAID (Diff-RAID), a new parity- 1. Introduction based technique similar to RAID-5 designed explicitly for reliable SSD storage. Diff-RAID attempts to create an age Solid State Devices (SSDs) have emerged in the last few differential across devices, limiting the number of high-BER years as viable replacements for hard drives in many set- SSDs in the array at any point in time. In other words, Diff- tings. Commodity SSDs can offer thousands of random RAID balances the high BER of older devices in the array against the low BER of younger devices. To create and maintain this age differential, Diff-RAID Permission to make digital or hard copies of all or part of this work for personal or modifies two existing mechanisms in RAID-5. First, Diff- classroom use is granted without fee provided that copies are not made or distributed RAID distributes parity blocks unevenly across devices; for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute since parity blocks are updated more often than data blocks to lists, requires prior specific permission and/or a fee. due to random access patterns, devices holding more par- EuroSys’10, April 13–16, 2010, Paris, France. Copyright c 2010 ACM 978-1-60558-577-2/10/04. $10.00 ity receive more writes and consequently age faster. Diff- 2. Problem Description 1 RBER UBER (ECC=4) 2.1 Flash Primer The smallest unit of NAND-based flash storage that can be 1e-05 read or programmed (written) is a page (typically 4 KB in size). All bits in a blank page are set to 1s, and writing 1e-10 data to the page involves setting some of the bits within it to 0s. Individual bits within a page cannot be reset to Bit Error Rate 1s; rewriting the page to a different bit pattern requires 1e-15 an intermediate erase operation that resets all bits back to Erasure Limit 1. These erasures are performed over large blocks (e.g., of 128 KB) spanning multiple pages. Blocks wear out as 1e-20 0 2 4 6 8 10 12 14 they are erased, exhibiting increasing BERs that become unmanageably high once the erasure limit is breached. Thousands of Erasure Cycles As a result of these fundamental constraints on write Figure 1. RBER (Raw Bit Error Rate) and UBER (Uncor- operations, early flash-based devices that performed in-place rectable Bit Error Rate, with 4 bits of ECC) for MLC flash page modification suffered from very poor random write rated at 10,000 cycles; data taken from [6]. latencies; writing to a randomly selected 4 KB page required the entire 128 KB erase block to be erased and rewritten. In addition, imbalanced loads that updated some pages much more frequently than others could result in uneven wear across the device. RAID supports arbitrary parity assignments, providing a To circumvent these problems, modern SSDs implement fine-grained trade-off curve between throughput and relia- a log-based block store, exposing a logical address space bility. Second, Diff-RAID reshuffles parity on drive replace- that is decoupled from the physical address space on the ments to ensure that the oldest device in the array always raw flash. The SSD maintains a mapping between logical holds the maximum parity and ages at the highest rate. This and physical locations at the granularity of an erase block. ensures that the age differential created through uneven par- A write to a random 4 KB page involves reading the sur- ity assignment persists when new devices replace expired rounding erasure block and writing it to an empty, previ- ones in the array. ously erased block, with no expensive erase operations in Diff-RAID’s ability to mask high BERs on aging SSDs the critical path. In addition, the mapping of logical to phys- confers multiple advantages. First, it offers higher reliabil- ical blocks is driven by wear-leveling algorithms that aim ity than RAID-5 or RAID-4 while retaining the low space for even wear-out across the device. SSDs typically include overhead of these options. Second, it opens the door to us- more raw flash than advertised in order to continue logging ing commodity SSDs past their erasure limit, protecting the updates even when the entire logical address space has been data on expired SSDs by storing it redundantly on younger occupied; for example, an 80 GB SSD could include an extra devices. Third, it potentially reduces the need for expensive 10 GB of flash. hardware Error Correction Codes (ECC) in the devices; as SSDs come in two flavors, depending on the type of flash MLC density continues to increase, the cost of such ECC is used: Single-Level Cell (SLC) and Multi-Level Cell (MLC). expected to rise prohibitively. These benefits are achieved at A cell is the basic physical unit of flash, storing voltage the cost of some degradation in throughput and complexity levels that represent bit values. SLC flash stores a single bit in device replacement. in each cell, and MLC stores multiple bits. SLC provides We evaluate Diff-RAID performance using a software ten times the erasure limit as MLC (100,000 cycles versus implementation running on a 5-device array of Intel X25-M 10,000), but is currently 3-4 times as expensive. Current SSDs, on a combination of synthetic and real server traces. industry trends point towards MLC technology with more We also evaluate Diff-RAID reliability by plugging in real- bits per cell. world flash error rates into a simulator. We show that Diff- Flash Error Modes: MTTF (Mean Time To Failure) RAID provides up to four orders of magnitude more relia- values are much higher for SSDs than hard drives due to the bility than conventional RAID-5 for specific failure modes. absence of moving parts. As a result, the dominant failure The remainder of this paper is organized as follows: Sec- modes for SSDs are related to bit errors in the underlying tion 2 describes the problem of correlated SSD failures in flash. Bit errors can arise due to writes (program disturbs), detail.

Load more