<<

Differential RAID: Rethinking RAID for SSD Reliability

Asim Kadav Mahesh Balakrishnan University of Wisconsin Research Silicon Valley Madison, WI Mountain View, CA [email protected] [email protected] Vijayan Prabhakaran Dahlia Malkhi Microsoft Research Silicon Valley Microsoft Research Silicon Valley Mountain View, CA Mountain View, CA [email protected] [email protected]

ABSTRACT sult, a write-intensive workload can wear out the SSD within Deployment of SSDs in enterprise settings is limited by the months. Also, this erasure limit continues to decrease as low erase cycles available on commodity devices. Redun- MLC devices increase in capacity and density. As a conse- dancy solutions such as RAID can potentially be used to pro- quence, the reliability of MLC devices remains a paramount tect against the high Error Rate (BER) of aging SSDs. concern for its adoption in servers [4]. Unfortunately, such solutions wear out redundant devices at similar rates, inducing correlated failures as arrays age in In this , we explore the possibility of using device-level unison. We present Diff-RAID, a new RAID variant that redundancy to mask the effects of aging on SSDs. Clustering distributes parity unevenly across SSDs to create age dispari- options such as RAID can potentially be used to tolerate the ties within arrays. By doing so, Diff-RAID balances the high higher BERs exhibited by worn out SSDs. However, these BER of old SSDs against the low BER of young SSDs. Diff- techniques do not automatically provide adequate protec- RAID provides much greater reliability for SSDs compared tion for aging SSDs; by balancing write load across devices, to RAID-4 and RAID-5 for the same space overhead, and solutions such as RAID-5 cause all SSDs to wear out at ap- offers a trade-off curve between throughput and reliability. proximately the same rate. Intuitively, such solutions end up trying to protect data on old SSDs by storing it redun- 1. INTRODUCTION dantly on other, equally old SSDs. Later in the paper, we quantify the ineffectiveness of such an approach. Solid State Devices (SSDs) have emerged in the last few years as a viable secondary storage option for and We propose Differential RAID (Diff-RAID), a new RAID personal computers. Now, SSDs are poised to replace con- technique designed to provide a reliable storage so- ventional disks within large-scale data centers, potentially lution using MLCs. Diff-RAID leverages the fact that the providing massive gains in I/O throughput and power effi- BER of an SSD increases continuously through its lifetime, ciency for high-performance applications. The major barrier starting out very low and reaching the maximum specified to SSD deployment in such settings is cost [3]. This barrier rate as it wears out. Accordingly, Diff-RAID attempts to is gradually being lifted through high-capacity Multi-Level create an age differential among devices in the array, bal- Cell (MLC) technology, which has driven down the cost of ancing the high BER of older devices against the low BER flash storage significantly. of young devices. Yet, MLC devices are severely hamstrung by low endurance To create and maintain this age differential, Diff-RAID mod- limits. Individual flash blocks within an SSD require expen- ifies conventional RAID in two ways. First, it distributes sive erase operations between successive page writes. Each parity unequally across devices; since parity blocks are up- erasure makes the device less reliable, increasing the Bit Er- dated more frequently than data blocks due to random writes, ror Rate (BER) observed by accesses. Consequently, SSD this unequal distribution forces some devices to age faster manufacturers specify not only a maximum BER (usually than others. Diff-RAID supports arbitrary parity assign- between 10−14 to 10−15 [2], as with conventional hard disks), ments, providing a trade-off curve between throughput and but also a limit on the number of erasures within which this reliability. Second, Diff-RAID redistributes parity during BER guarantee holds. For MLC devices, the rated erasure device replacements to ensure that the oldest device in the limit is typically 5,000 to 10,000 cycles per block; as a re- array always holds the most parity and ages at the highest rate. When the oldest device in the array is replaced at some threshold age, its parity blocks are assigned across all the devices in the array and not just to its replacement.

Diff-RAID’s ability to tolerate high BERs on aging SSDs provides multiple advantages. First, it provides a much higher degree of reliability for SSD arrays compared to stan- dard RAID-5 or RAID-4 configurations. Second, it opens the door to using commodity SSDs past their erasure limit, protecting the data on expired SSDs by storing it redun- 3. WHY NOT RAID? dantly on younger devices. Third, it potentially reduces the RAID techniques induce similar update loads on multiple need for expensive hardware Error Correction Codes (ECC) disks, which in turn, may induce correlated failures in SSDs in the devices; as MLC densities continue to increase, the as noted above. For RAID-1, RAID-10 and RAID-01, mir- cost of masking high error rates with ECC is prohibitive. rors are exposed to identical workloads and grow old to- These benefits are achieved at a slight cost in throughput gether. All the devices in RAID-5 receive roughly equiva- degradation and in device replacement complexity; these lent workloads and age at similar rates. RAID-4 does im- trade-offs are explored in detail below. pose a slightly staggered wear-out schedule; the parity device wears out at a much faster rate than the other devices for 2. BACKGROUND workloads dominated by random writes, and has to be re- NAND-based SSDs exhibit failure behaviors which differ placed much before them. However, the non-parity devices substantially from hard disks. Most crucially, the Bit Error in RAID-4 age at similar rates. Thus, each of these RAID Rate (BER) of modern SSDs is uniform across all disk pages, options imposes a lock-step wear-out schedule on devices in and grows strongly with the number of updates the SSD the array. receives. In order to understand this behavior, we briefly overview basic features of SSDs. As long as each SSD is replaced immediately as it hits the erasure limit, the SSD array is as reliable as an array con- SSDs are composed of pages, which are the smallest units structed from hard disks of comparable maximum BER. that can be read or programmed (written). Pages are struc- However, if the SSDs are allowed to continue past their era- tured together into larger units called blocks, which are the sure limit, the reliability of the array falls drastically. Hence, smallest units that can be erased (all reset to 1s). Shar- when one of the devices fails, the probability of another fail- ing erasure circuitry across multiple pages allows for higher ure during the RAID reconstruction is very high. Conse- bit densities by freeing up chip area, reducing overall cost. quently, the overall reliability of conventional RAID arrays Bits can only be programmed in one direction, from 1s to 0s; of SSDs varies highly, hitting a peak as multiple devices as a result, a rewrite to a specific page requires an erasure reach their erasure limit simultaneously. A more quantita- of the block containing the page. tive analysis of this is given in Section 5 below.

Modern SSDs provide excellent performance despite these Essentially, the wear-out schedule imposed on different de- design restrictions by using wear-levelling algorithms that vices by a redundancy scheme heavily impacts the overall store data in log format on flash. These wear-levelling al- reliability of the array. The load balancing properties of gorithms ensure that wear is spread evenly across the SSD, schemes such as RAID-5 – which make them so attractive resulting in a relatively uniform BER across the SSD. How- for hard disks – result in lower array reliability for SSDs. ever, SSDs are still subject to an erasure limit, beyond which They induce a lock-step SSD wear-out schedule that leaves the BER of the device becomes unacceptably high. SSDs the array vulnerable to correlated failures. typically use hardware ECC to correct bit errors and extend the erasure limit; in this paper, we use the term BER to re- Required is a redundancy scheme that can distribute writes fer to the post-ECC error rate, also called the Uncorrectable unevenly across devices, creating an imbalance in update Bit Error Rate (UBER), unless specified otherwise. load across the array. Such a load imbalance will translate into different aging rates for SSDs, allowing staggered device First-generation SSDs used SLC (Single-Level Cell) flash, wear-outs over the lifetime of the array and reducing the where each flash cell stores a single bit value. This variant probability of correlated failure. This observation leads us to of flash has relatively high endurance limits – around 100,000 introduce novel RAID variants, which we collectively name erase cycles per block – but is prohibitively expensive; as a Diff-RAID. result, it is now used predominantly in top-tier SSDs meant for high-performance applications. Recently, the emergence 4. DESIGN of MLC (Multi-Level Cell) technology has dropped the price The goal of Diff-RAID is to use device-level redundancy to of SSDs significantly (for example, the Intel X25-M [1] is mask the high BERs of aging SSDs. The basic idea is to store a popular MLC SSD priced at $4 per GB at the time of data redundantly across devices of different ages, balancing writing), storing multiple bits within each flash cell to ex- the high BER of old SSDs past their erasure limit against pand capacity without increasing cost. However, MLC de- the low BER of young SSDs. Diff-RAID implements this vices have a major drawback: the endurance limit drops idea by creating and maintaining age differentials within a to around 10,000 erase cycles per block. As flash geome- RAID array. tries shrink and MLC technology packs more bits into each cell, the price of SSDs is expected to continue decreasing; We measure the age of a device by the average number of unfortunately, their pre-ECC BER is expected to increase, cycles used by each block in it, assuming a perfect wear- creating the possibility of lower erasure limits. levelling algorithm that equalizes this number across blocks. For example, if each block within a device has used up In this paper, we re-examine the use of standard device-level roughly 7,500 cycles, we say that the device has used up redundancy techniques in face of the unreliability character- 7,500 cycles of its lifetime. istics of MLC devices, as they approach and cross the erasure limit. Since, as we noted above, SSDs differ from hard disks Diff-RAID controls the aging rates of different devices in the in fundamental ways, this requires a rethink of traditional array by distributing parity blocks unevenly across them. redundancy solutions such as RAID. Random writes cause parity blocks to attract much more aeo h aiydvc ota fayohrdvc would device other any aging of the that of to ratio device the be parity RAID-4, the 5-device of of rate example the In pe on ntedsaiyo gn ae na ra;all array; an in ex- rates aging an of is disparity RAID-4 the an Since on represents it bound assignment, device. upper parity other uneven of any example as treme fast as times then: a devices, respective the to the of and rate device, aging given the any of for ratio easy an devices the For is of it rates assignment. aging writes, parity relative random the of compute only to consisting workload a For de- all across evenly distributed is vices. parity where devices, 5 for (100 by represented uneven the RAID-4, of and is example parity extreme assignment the An parity of each. 15% 40% store holds devices device other first the percent- (40 where of array example, n-tuples a for with receives ages; assignments parity parity more represent faster. with We correspondingly SSD ages an and writes result, of a a number to As higher updated write be to random well. has every block as parity on corresponding the blocks; block, data data than traffic write after parity the of 70% replacement. its holds first shuffling during the SSD-1 age-driven array that the replacements; ensures of device distribution two Age first 1: Figure o eie,weetefis eiehlsalteprt.At parity. the all (20 by holds represented device RAID-5, is first extreme the other the where devices, 5 for ij Age Age 100 0 = SSD SSD 0 SSD 0 ∗ 4+100 ∗ p p 4+0 j i ∗ ∗ Retired ( ( n n − − ;i te od,teprt eieae four ages device parity the words, other in 4; = p 1)+(100 1)+(100 i Replacement 2 Replacement 1 Replacement SSD SSD 1 SSD SSD 1 10,000 Erasures 10,000 Erasures 10,000 and − − p p p j i j ) ) r h ecnae fprt allotted parity of percentages the are , SSD 2 SSD SSD 2 SSD 15 , 15 n , eieary if array, device 15 i , hdvc ota fthe of that to device th SSD SSD 3 SSD 3 5 ersnsa5-device a represents 15) Distribution: Distribution: SSD3: SSD3: 10% SSD2: 10% SSD1: 10% SSD0: 70% SSD4: SSD4: 10% SSD3: 10% SSD2: 10% SSD1: 70% SSD SSD 4 a Parity Parity Parity Parity ij , 20 represents , , 0 20 , 0 , , 20 0 j , , th 0) 20) iue1ilsrtsti ceefra4dvc ra iha with array 4-device (70 a of for assignment scheme parity this illustrates 1 device. Figure whereas data a arbitrarily), becomes ties device (breaking replacement becomes new devices device the data is parity remaining device new parity the In a the of time oldest them. them each the to order RAID-4, replaced, assignment to of parity example array the simple the applying the in before devices the age, shuffle the by we of replacement, positions device logical an every in on resulting Consequently, cycles, 7,500 at array. be unreliable all and will aging is device, devices parity erase and the data cycles 2,500 of replacements the 10,000 approximately such hits three at device After parity be replaced. the will when data each array the cycles writes, the random in of workload devices a conventional With a array. consider RAID-4 better, ages. faster intuition in one this imbalance understand age an To to create want to we order possible. age, in as same other long the the as at than young are them SSDs keep two pos- to If as reliable, want more soon we array as hence the it and old make remove SSDs an and young Conversely, to rapidly of want cycles sible. we presence erase hence it its the and faster up unreliable, the simple: use more pro- array device, is the be a here makes to older SSD intuition the device The that a so of ages. age, rate its aging to the portional like would we disparity.Ideally, with no disparity, providing less extreme other in the result at will RAID-5 assignments parity other h otrlal aiyasgmn o ie throughput given a for assignment parity require- reliable their choose most fits adaptively the that could curve itself Diff-RAID trade-off Potentially, can the ments. administrators on sup- point assignment, a Diff-RAID parity choose Since arbitrary highest reliability. any the least ports provides the RAID-5 but to reliability, highest throughput corresponding the one but – the par- bottleneck to and single a the as corresponding acting with device – assignment ity throughput parity least the Diff-RAID provides RAID-4 The through- and put. reliability between trade-off a provides Diff-RAID omit we space. — of assignment 2,500 lack parity conver- 3,750, for any the details for at calculate distribution can devices at We age always other gent respectively. is the cycles device 1,250 with and time remaining cycles, of replacement oldest erase number at the 5,000 small remain- devices that a the oldest so After of the converge ages to replacement. the shifting each replacements, on parity device the ing array with distribu- (100 5-device RAID-4 a of the for is assignment plots time parity replacement 2 a at Figure with ages device of distribution. tion converge stationary eventually time a replacement to de- at over the observed assignment, shuffling ages parity vice given age-driven a this for replacements continue multiple we if Interestingly, the each. (including 10% SSDs assigned other assigned are the is and one) SSD-1 blocks, new replaced, when parity is the result, are SSD-0 of SSDs a 70% When three As other cycles. the devices. 5,000 cycles, at three erase other aging 10,000 the reaches device SSD-0 as first fast the as in twice results assignment this previously, , 10 , 10 , 0;b h oml given formula the by 10); , 0 , 0 , 0 , ) setal,this essentially, 0); 100 90 80 70 60 50 40 30 20

Age (% of Cycles) 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Replacements

Figure 2: Age distribution of devices for Diff-RAID parity assignment (100, 0, 0, 0, 0) (i.e., RAID-4 with age- driven shuffling on device replacements). Each set of bars represents the age distribution of devices in the array when the oldest SSD wears out. level as the write performance required by the workload fluc- signment corresponding to RAID-4. The parity assignment tuates. used by Diff-RAID is (1, 5, 5, 5, 5, 5) — effectively, the only difference between conventional RAID-4 and the Diff-RAID When a device is replaced and a new device takes its place, version is the age-driven shuffling of parity on each device age-driven shuffling of the parity assignments has to occur. replacement. Conventional RAID-5 causes all SSDs age in One implementation option involves treating a device re- lock-step fashion, and conventional RAID-4 does so with placement as a failure and triggering the RAID reconstruc- the data devices; as a result, the probability of data loss tion process to rebuild the contents of the replaced SSD on a on an SSD failure climbs to almost 1 for both solutions as new device, but modifying the reconstruction logic so that it the array ages, and periodically resets to almost zero when- interchanges parity blocks and data blocks on different SSDs ever all SSDs are replaced simultaneously. In contrast, Diff- to achieve the required parity assignment over a shuffled or- RAID never experiences a data loss probability of more than dering of the devices. We expect to explore different imple- 17%, and once the age distribution converges, it consistently mentation options that perform device replacement without maintains a data loss probability under 4%, which is compa- triggering RAID reconstruction logic, as well. rable to the 3% data loss probability of a SATA . In this simulation, we assume that the oldest device in the 5. SIMULATION RESULTS array fails initially, and compute the probability of data loss In this section, we show through simulation that Diff-RAID in the remaining devices; if any device is allowed to fail provides much better reliability for SSD arrays than con- initially with equal probability, the RAID-5 curve remains ventional RAID. Also, we show that Diff-RAID can push unchanged, the RAID-4 curve does not change significantly, inexpensive, low-end SSDs to twice their rated erasure limit and the Diff-RAID curve is bounded at 35%. without compromising the overall reliability of the array. We model the Bit Error Rate of an individual SSD using data Figure 4 shows the space of possible Diff-RAID configura- in a study published by Intel and Micron [2]. That paper tions for a 6-device array. Each point in this graph rep- contains raw (pre-ECC) BERs for four MLC devices up to resents a different parity assignment; on the x-axis is the 10,000 erase cycles, of which one device is rated at 5,000 throughput of the solution as a fraction of the total array cycles. We use the raw BER curve of this 5,000 cycle device throughput, and on the y-axis is the probability of data loss as a starting point, assume 2-bit ECC, and modify the re- upon a device failure. As expected, the parity assignment sulting UBER data slightly to obtain a smooth curve that corresponding to RAID-4 is on one extreme of the trade-off hits an UBER of 10−14 at 5,000 cycles, comparable to the curve, providing high reliability but very low throughput, UBER of SATA hard disks. and the assignment corresponding to RAID-5 is at the op- posite end, with high throughput but very low reliability. We quantify the reliability of a RAID array by considering a common failure mode — a disk fails in the array, and dur- Sequential Writes: So far, we have assumed a workload ing reconstruction an uncorrectable bit error is encountered consisting only of random writes. Sequential writes that up- elsewhere in the array, resulting in the loss of data. For a 6- date the entire stripe can potentially reduce the reliability disk RAID-5 array of 80 GB SATA disks (with UBER equal of a Diff-RAID array, since they reduce the ability of parity to 10−14), the probability of data loss given a disk failure blocks to imbalance write load. When we ran the experi- is around 3%, which remains fairly constant throughout the ment shown in Figure 3 with a workload consisting of 10% lifetime of the array. In other words, for every 100 instances sequential writes, the worst case unreliability for Diff-RAID of a disk failure in such an array, roughly 3 will experience in the initial period climbed to 23%; after convergence, the some data loss. unreliability stays under 8%. With a workload where se- quential writes comprise 50% of all writes, Diff-RAID is still Figure 3 illustrates the difference between conventional RAID- bounded at 25% unreliability after convergence, providing 4 5 and RAID-4 used with SSDs and a Diff-RAID parity as- times the reliability of conventional RAID-4 and RAID-5. In 1 0.9 RAID-5 0.8 RAID-4 0.7 Diff-RAID 0.6 0.5 0.4 0.3 0.2 on 1 Disk Failure 0.1

Probability of Data Loss 0 0 50000 100000 150000 200000 250000 Total Array Writes (in multiples of 20M pages)

Figure 3: RAID-5 and RAID-4 versus Diff-RAID for a 6-device array. Diff-RAID is configured to use a RAID- 4 parity assignment – (100, 0, 0, 0, 0, 0) – but differs by performing age-driven shuffling on device replacements.

1 While this paper focuses on parity-based RAID, other schemes 0.9 which use mirroring (such as RAID-1 or RAID-10) are vul- 0.8 nerable to correlated aging as well. Diff-RAID does not im- 0.7 mediately generalize to mirroring, since it uses the higher 0.6 write load of parity blocks to age SSDs differentially. Dif- 0.5 ferent techniques are required to combat correlated aging in 0.4 RAID-4 RAID-5 mirrored RAID setups. 0.3 0.2 0.1 In general, the large-scale introduction of SSDs into clus- Data Loss Probability 0 tered settings poses many interesting questions. System ad- 0 0.2 0.4 0.6 0.8 1 ministrators may be required to track the age of individual devices and define proactive replacement policies. Also, con- Fraction of Array Throughput sidering a larger pool of SSDs – as opposed to a single array – may allow for solutions that are more effective at creating age differentials. Figure 4: Throughput and Reliability at different Diff-RAID parity assignments for a six device RAID array. 7. CONCLUSION Diff-RAID is a new RAID variant designed specifically for SSDs. It distributes parity unevenly across the array to the worst case where all writes are sequential, Diff-RAID de- force devices to age at different rates, and redistributes this grades to provide reliability identical to conventional RAID. parity on each device replacement. By doing so, Diff-RAID maintains an age differential between devices in the RAID array, balancing the high BER of aging devices against the 6. FUTURE WORK low BERs of younger devices and reducing the chances of We are currently implementing Diff-RAID as a RAID correlated failures. Compared to conventional RAID-5 or driver. We hope that such an implementation will provide RAID-4, Diff-RAID provides a higher degree of reliability for insight into the end-to-end performance of SSD-based stor- SSDs for the same space overhead, and provides a trade-off age stacks. With commodity RAID controllers – or soft- curve between throughput and reliability. It can potentially ware RAID running on conventional machines – it is pos- be used to operate SSDs past their erasure limit, as well as sible that the RAID logic is the bottleneck for the random reduce the need for expensive ECC within SSDs. write throughput of the array. In this case, we will not see the throughput reliability trade-off described in Figure 4; 8. REFERENCES instead, the throughput of all configurations will lie under [1] Intel Corporation. Intel X18-M/X25-M SATA Solid State an upper threshold. Drive. http://download.intel.com/design/flash/nand/ mainstream/mainstream-sata-ssd-datasheet.pdf. We expect the Diff-RAID implementation to have a pol- [2] N. Mielke, T. Marquart, N. Wu, J. Kessenich, H. Belgal, icy component that determines which parity assignment to E. Schares, F. Trivedi, E. Goodness, and L. Nevill. Bit Error Rate in NAND Flash Memories. In IRPS 2008: IEEE use. As mentioned previously, the policy could select the International Reliability Physics Symposium, 2008. most reliable configuration below the application’s through- [3] D. Narayanan, E. Thereska, A. Donnelly, S. Elnikety, and put requirement; alternately, it could factor in the maxi- A. Rowstron. Migrating Server Storage to SSDs: Analysis of mum throughput possible with the hardware setup. Also, Tradeoffs. In EuroSys 2009. ACM, 2009. Diff-RAID could allow the application to specify a threshold [4] D. Roberts, T. Kgil, and T. Mudge. Integrating NAND Flash level of data loss probability and select the highest through- Devices onto Servers. Commun. ACM, 52(4):98–103, 2009. put configuration under that threshold.