Practical Approach to Determining SSD Reliability Todd Marquart, Ph.D, Fellow-Micron Technology
Total Page:16
File Type:pdf, Size:1020Kb
Practical Approach to Determining SSD Reliability Todd Marquart, Ph.D, Fellow-Micron Technology ©2014 Micron Technology, Inc. All rights reserved. Products are warranted only to meet Micron’s production data sheet specifications. Information, products, and/or specifications are subject to change without notice. All information is provided on an “AS IS” basis without warranties of any kind. Dates are estimates only. Drawings are not to scale. Micron and the Micron logo are trademarks of Micron Technology, Inc. All other trademarks are the property of their respective owners. 1 August 25, 2015 | ©2014 Micron Technology, Inc. | Micron Confidential Some Initial Definitions and Acronyms • Total Bytes Written (TBW): ▪ This is the amount of data transferred byt the host. Typically base 10. • Logical or User Density: ▪ This is the total amount of data a user can store on a drive at any given moment. Typically base 10. • Physical Density: ▪ The total amount of storage physically on the drive, typically base 2. • Write Amplification (WA): ▪ A measure of how much data is actually written to the SSD compared to what was requested by the host. • Mean Time To Failure (MTTF): ▪ The position parameter of the exponential distribution based on the time to first failure. ▪ The exponential distribution applies to failures that are truly random. ▪ MTBF (Mean Time Between Failure): Only valid for a repairable system. • Annualized Failure Rate (AFR): ▪ The failure probability in 1 year, constant for an exponential distribution. 2 August 25, 2015 | ©2014 Micron Technology, Inc. | Micron Confidential Reliability Statistics 1 β=4.0 β=1.0 0.8 F(t): Cumulative g n i l i 0.6 =0.1 a β F Distribution n o i t c a 0.4 Function (CDF). r F F(t) =1− S(t) 0.2 β=4.0 Where : S(t) = Survivor Function 0 0 0.5 1 1.5 2 H(t): Hazard Multiples of η Function. β=0.1 f (t) f (t) β=0.1 H (t) = = β=1.0 1− F(t) S(t) β=4.0 0 0.5 1 1.5 2 Multiples of η β=1.0 f(t): Probability Distribution Function (PDF) dF f (t) = 0 0.5 1 1.5 2 dt Multiples of η 3 | ©2014 Micron Technology, Inc. 3 Reliability Statistics • Weibull Distribution • What is it? ▪ Large family of distributions that cover the entire life of many products. β • What is it good for? & , t ) # F(t) =1− exp$− * ' ! ▪ Many failure modes can be described with or approximated by a *η ' Weibull distribution. The lognormal is not covered by the Weibull %$ + ( "! family. To distinguish Weibull from lognormal requires at least 20 points of GOOD data. β β ▪ Works well with very small sample sizes. Analysis is possible with , t ) β & , t ) # 1 or even 0 failures. f (t) = * ' exp$− * ' ! *η ' t *η ' ▪ Good starting point for most analyses. If the data is actually + ( %$ + ( "! lognormal, the Weibull distribution will give more conservative extrapolated lifetimes making it a safer starting point. β ▪ Weakest link distribution. ( t % β • What are the parameters? H (t) = & # & # t ▪ ß is the shape or slope parameter. It is the slope of the line when 'η $ plotted on Weibull paper. ▪ ß=1 gives the exponential distribution. ▪ η is the characteristic life or scale parameter. It is the time to 63.2% failure. 4 | ©2014 Micron Technology, Inc. 4 Reliability Statistics • In general it is advantageous to plot distributions on axes that generate straight lines in order to judge the fit of the data, examine signs of curvature etc… β & , t ) # Weibull CDF: F(t) =1− exp$− * ' ! %$ +η ( "! ' 1 $ ln ln% " = ß ln(t) − ß ln(η)= ln(−ln(1− F(t))) &1− F(t) # Y = m X + b 5 | ©2014 Micron Technology, Inc. 5 Reliability Statistics • A plot of the Weibull CDF gives a good , intuitive view of the product lifetime. • The same reliability mechanisms will tend to have the same Weibull slope even when changing from one technode to another. 6 | ©2014 Micron Technology, Inc. 6 Reliability Statistics • Historical view of overall reliability lifetime. • Not my favorite view. Useful Life End of Life Early Life Early Wear out Wear out 7 | ©2014 Micron Technology, Inc. 7 Acceleration Factors • Acceleration factors take the failure distribution and move it parallel to the original distribution. ▪ A change is slope 15X typically indicates a shift in mechanism. • Typically it is used to move failure distributions into the region (stress and sample size) of interest. 8 | ©2014 Micron Technology, Inc. 8 JEDEC JESD218 • From JEDEC JESD218, Section 1: • “This standard defines JEDEC requirements for solid state drives. For each defined class of solid state drive, the standard defines the conditions of use and the corresponding endurance verification requirements. Although endurance is to be rated based upon the standard conditions of use for the class, the standard also sets out requirements for possible additional use conditions as agreed to between manufacturer and purchaser. • Qualification of a solid state drive involves many factors beyond endurance and retention, so such qualification is beyond the scope of this standard, but this standard is sufficient for the endurance and retention part of a drive qualification. This standard applies to individual products and also to qualification families as defined in this standard. • The scope of this standard includes solid state drives based on solid-state non-volatile memory (NVM). NAND Flash memory is the most common form on memory used in solid state drives at the time of this writing, and this standard emphasizes certain features of NAND. The standard is also intended to apply to other forms of NVM. ” 9 August 25, 2015 | ©2014 Micron Technology, Inc. | Micron Confidential JESD218-Testing • JESD218, is really about validating that the drive can meet its endurance specification, including post TBW data retention. RDT TBW Qual (JESD218) Cycle-to-Death Testing 10 August 25, 2015 | ©2014 Micron Technology, Inc. | Micron Confidential JESD218 Requirements • The general requirements for an SSD are shown here, as per JESD218. ▪ 2 Classes are defined, Client and Enterprise. 11 August 25, 2015 | ©2014 Micron Technology, Inc. | Micron Confidential Functional Failure Rate • FFR is dominated by defectivity related issues. • Single pages, wordlines, blocks can fail as can entire die. ▪ In some cases failures can be graceful, not resulting in data loss, others are catastrophic. P. Muroke, Proc IRPS, 2006 12 | ©2014 Micron Technology, Inc. Functional Failure Rate • These are typically the defect related failures. ▪ Defective NAND, controller failures etc… • This is related to the MTTF/ AFR of the drive. ▪ Most drives specify the AFR/ MTTF of a drive that may not align with the JEDEC specification. • In most cases it is the purpose of the Reliability Demonstration Test (RDT) to demonstrate the MTTF. 13 August 25, 2015 | ©2014 Micron Technology, Inc. | Micron Confidential Functional Failure Rate-Sample Size 14 August 25, 2015 | ©2014 Micron Technology, Inc. Functional Failure Rate and RDT • The FFR at the max TBW gives an estimate of the average AFR over the JEDEC product lifetime, but may Region not represent the fallout in the first year. ▪ For a purely exponential RDT distribution, the numbers Region will be the same, however, in the non-exponential case it will not align. 15 August 25, 2015 | ©2014 Micron Technology, Inc. UBER-Uncorrectable Bit Error Rate • From JEDEC JESD218 16 August 25, 2015 | ©2014 Micron Technology, Inc. | Micron Confidential UBER and Fail Probability • UBER is related to the failure probability of a codeword. ▪ The third piece of the equation below is essentially a usage model. N failing _sectors UBER(DRSSD )= ' N $ N cycles 1 bits _in _ sample ⋅% + " & WA # F(t)codeword ⋅ Ndrives ⋅ Ncodewords _ per _ drive ⋅ Nsectors _ per _ cw Instantaneous UBER = ' N $ N N N cycles 1 drives ⋅ codewords _ per _ drives ⋅ bits _ per _ cw ⋅% + " & WA # F(t)codeword ⋅ Ndrives ⋅ Ncodewords _ per _ drive ⋅ Nsectors _ per _ cw = ' N $ N N N cycles 1 Usage Model drives ⋅ codewords _ per _ drives ⋅ bits _ per _ cw ⋅% + " & WA # F(t)codeword ⋅ Nsectors _ per _ cw F(t)codeword 1 = = ⋅ Nsectors _ per _ cw ⋅ ' N $ N Ncycles N cycles 1 bits _ per _ cw +1 bits _ per _ cw ⋅% + " & WA # WA 17 August 25, 2015 | ©2014 Micron Technology, Inc. Flash Memory Reliability-UBER F(t)codeword 1 UBER(DRSSD )= ⋅ Nsectors _ per _ cw ⋅ N N bits _ per _ cw cycles +1 WA 18 August 25, 2015 | ©2014 Micron Technology, Inc. UBER at the Component vs Drive • The drive and component UBERs may not be aligned for a variety of reasons. L L 0 L1 2 L3 ▪ The usage models that feed the UBER calculations may not be -2 aligned. 10 ▪ Additional error-management may be present at the drive y t i l i b a level. b o r P -3 ▪ RAID 10 ▪ LDPC vs BCH ▪ Different ECC levels -4 10 -2 -1 0 1 2 3 4 5 6 Vt (V) BCH Type Codes -4 HRER ≈ 3.0e-03 10 1 HRER ≈ 3.0e-04 -6 10 -8 10 R E -10 B 10 U -12 10 LDPC Type Codes -14 10 -16 10 -1 -2 -3 10 10 10 Correction Probability RBER 0 Number of errors in a Khayat, P. R. et. al., IMC, 2015 Codeword 19 August 25, 2015 | ©2014 Micron Technology, Inc. Sample Size and UBER 20 August 25, 2015 | ©2014 Micron Technology, Inc. High Temperature Data Retention • Shifts ▪ Detrapping of electrons. ▪ Causes the entire distribution to shift and widen. ▪ The shift is of the entire population, the shift by bit is highly variable. Before Bake - - - - Vt e- -e - e -e Pre Bake e e e ee- e- e- e- e- e- Post Bake After Bake - - Vt-dV - -e - - e- - e - e Log Bit Count e e e ee Vt 21 | ©2014 Micron Technology, Inc.