An Analysis of Data Corruption in the Storage Stack Garth Goodson NetApp, Inc
Lakshmi Bairavasundaram Bianca Schroeder Andrea C. Arpaci-Dusseau University of Toronto Remzi H. Arpaci-Dusseau University of Wisconsin-Madison
Storage Developer Conference 2008 Corruption Anecdote
There is much anecdotal evidence of data corruption E.g., this is a photo stored on an author’s laptop
System designers know of similar occurrences Data protection often based on anecdotes Anecdotes: interesting, but not enough for system design A more rigorous understanding is needed
Storage Developer Conference 2008 Our Analysis
First large scale study of data corruption 1.53 million disks in 1000s of NetApp systems Time period 41 months (Jan 2004 – Jun 2007) Corruption detection Using various data protection techniques Data from NetApp Autosupport Database Also used in latent sector error [Bairavasundaram07], disk and storage failure [Jiang08] studies
Storage Developer Conference 2008 Questions we had about corruption
What kinds of corruption occur and how often ? Does disk class matter ? Expensive enterprise (FC) disks versus cheaper nearline (SATA) disks Does disk drive family/product matter ? Are corruption instances independent ? Do corruption instances have spatial locality?
Storage Developer Conference 2008 Talk Outline
Introduction Background Data corruption Protection techniques Results Lessons Conclusion
Storage Developer Conference 2008 Should we care about disk errors?
Joint UIUC/NetApp system failure analysis 44 months; 39,000 systems; 1.8 million disks
Performance Performance Failure Failure Protocol Failure Protocol Failure
Disk Failure
Physical Physical Disk Failure Interconnect Interconnect Failure Failure
High-End Systems Nearline Systems *W. Jiang, et. al, “Are disks the dominant contributor of storage failures?”, USENIX FAST, 2008 Storage Developer Conference 2008 Disk system failure rates
From failure rate pie charts: High-end: 29% of system errors are disk errors Nearline: 57% of system errors are disk errors What’s going on? Software is generally the same Hardware platforms are somewhat different But real difference is in the type of disk in use i.e., Fibre-channel vs SATA
Storage Developer Conference 2008 Types of disk errors
Operational/component failures Fundamental problem with the drive hardware Bad servo, head, electronics, etc. Firmware bugs Failure to flush cache on power-down, etc. Partial failures Only affects small subset of disk sectors Errors during writing Bad media, high-fly write, vibration, etc. Errors during reading (write was successful) Scratches, corrosion, thermal asperities, etc.
Storage Developer Conference 2008 Unreported Disk Errors
Operational failures are easy to detect Usually fail-stop; something stops working Latent sector errors are reported via SCSI errors Occurs when a disk sector is read What about errors that go undetected? Observed errors not corrected by disk’s ECC Can not correct them unless detected first Result is usually some form of corruption
Storage Developer Conference 2008 Data Corruption
Data stored on a disk block is incorrect Many sources Software bugs File system, software RAID, device drivers, etc. Firmware bugs Disk drives, shelf controllers, adapters, etc. Corruption is silent Not reported by the disk drive Could have greater impact than other errors
Storage Developer Conference 2008 Forms of Data Corruption
Bit corruption Contents of existing disk block are modified Data being written to a disk block is corrupted Lost writes Data not written but completion is reported Misdirected writes Data is written to the wrong disk block Torn writes Data partially written but completion is reported In all cases, data passes disk’s internal ECC
Storage Developer Conference 2008 Detecting data corruption
Basic idea: 1. Generate checksum of data (64 Bytes/4KB) 2. Store checksum along with data (4KB FS block) 3. Verify checksum whenever reading data Simple checksum has limited protection Detects bit corruption and torn (partial) writes No protection against lost or misdirected writes Since data was not overwritten
Storage Developer Conference 2008 Checksum problems: lost writes
Block checksums
Data 1 Data 2 Data 3 Parity
A B C P(ABC’)P(ABC)
cksum(A) cksum(B) cksum(C) cksum(P)
Lost Write Overwrite C→C’ Read file ABC’
CKSUM Return data (ABC) Return Corrupt Data (C instead of C’) Storage Developer Conference 2008 Write verify: a partial solution
Attempt to solve lost write problem Costly solution, expect good protection Procedure: 1. Write data to disk 2. Read back to verify 3. If lost write detected, write again C’C or remap to new location cksum(C’)cksum(C)
Lost Write Overwrite C→C’ Read back (C) Success Lost write detected, write C’ again 14 Storage Developer Conference 2008 Lost write protection: a better way
Need logical information pertaining to block identity Something external to data being stored Store inode, FS block number within checksum Verified by file system at read time We also add a checksum of checksum structure
Block Checksum Protects 4KB FS block Block Identity Data Protects against lost writes Embedded Checksum Protects checksum structure
4KB file system block 64B Checksum
520 520 520 520 520 520 520 520
Storage Developer Conference 2008 Summary: Data Corruption Classes
Checksum mismatch Causes: bit corruption, torn/misdirected write Detection: block checksum mismatch Identity mismatch Causes: lost or misdirected write Detection: block identity mismatch Parity mismatch Causes: lost write, bad parity Detection: RAID parity computation mismatch
Storage Developer Conference 2008 Talk Outline
Introduction Background Results System architecture Overall results Checksum mismatch results Lessons and Conclusion
Storage Developer Conference 2008 NetApp® System • Store, verify block identity 3 (Inode X, offset Y) • Detect identity discrepancy • Lost or misdirected writes Client interface (NFS)
• Parity generation 2 WAFL® file system • Reconstruction on failure • Data scrubbing RAID layer – read blocks, verify parity – Detect parity inconsistency Storage layer Autosupport – Lost or misdirected writes, parity miscalculations
• Store, verify checksum 1 • Detect checksum mismatch Disk drives • Bit corruptions, torn writes
Storage Developer Conference 2008 Overall Numbers
What percentage of disks are affected by the different kinds of corruption?
Storage Developer Conference 2008 Overall Numbers (% disks affected in 17 months of use)
Corruption type Nearline Enterprise (SATA) (FC) Checksum mismatches 0.661% 0.059% 1 Parity inconsistencies 0.147% 0.017% 2 Identity discrepancies 0.042% 0.006% 3
~10 times fewer disks than latent sector errors Higher % of Nearline disks affected Order of magnitude more than enterprise disks Bit corruptions or torn writes affect more disks than lost or misdirected writes
Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis
• Disk class (Nearline / Enterprise) • Disk model • Disk age • Disk size (capacity) 1. Factors • Workload
2. Characteristics • CMs per corrupt disk 3. Correlations with • Independence • Spatial locality other errors • Temporal locality
4. Request type • Not ready conditions • Latent sector errors • System reset
• Scrubs vs. FS reads etc.
Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis
• Disk class • Disk model 1. Factors • Disk age 2. Characteristics • Disk size • Workload 3. Correlations with other errors 4. Request type
Storage Developer Conference 2008 Factors
Do disk class, model, or age affect development of checksum mismatches? Disk class: Nearline (SATA) or Enterprise (FC) Disk model: Specific disk drive product (say Vendor V’s disk product P of capacity 80 GB) Disk age: Time in the field since ship date
Can we use these factors to determine corruption handling policies or mechanisms? Ex: Aggressive scrubbing for some disks
Storage Developer Conference 2008 Class, Model, Age – Nearline
4.0% Fraction of disks affected 3.5% varies across models 3.0% From 0.27% to 3.51% 2.5% More than 3% 2.0% 4 out of 6 models 1.5% Response to age also varies 1.0% 0.5% % of disks with at least 1 CM with at least % of disks 0.0% 0 3 6 9 12 15 18 Disk age (months)
Storage Developer Conference 2008 Class, Model, Age – Enterprise
0.18% Fraction of disks affected 0.16% varies across models 0.14% From 0% to 0.17% 0.12% 0.10% All less than lowest Nearline (0.27%) 0.08% 0.06% Response to age also 0.04% varies
% of disks with at least 1 CM % of disks with 0.02% 0.00% 0 3 6 9 12 15 18 Disk age (months)
Storage Developer Conference 2008 Factors – Summary
Class, Model matter Nearline disks require greater attention
Effect of age is unclear Cannot use age-specific corruption handling
Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis
1. Factors • CMs per corrupt disk 2. Characteristics • Independence 3. Correlations with • Spatial locality other errors • Temporal locality 4. Request type
Storage Developer Conference 2008 Checksum Mismatches per Corrupt Disk
Corrupt disk: A disk with at least 1 checksum mismatch (CM)
How many CMs does a corrupt disk have?
Should we “fail-out” disks when one corruption is detected?
Storage Developer Conference 2008 CMs per Corrupt Disk – Nearline
100% CMs per corrupt disk is
CMs 90% low X
80% ≤ 70% 50% of corrupt disks 60% have ≤ 2 CMs 50% 90% of corrupt disks 40% have ≤ 100 CMs 30% 20% Anomaly: E-1 10% Develops many CMs
% of with corrupt disks 0% 1 2 3 4 5 10 20 50 100 200 500 1K Number of Checksum Mismatches
Storage Developer Conference 2008 CMs per Corrupt Disk – Enterprise
100% CMs per corrupt disk
CMs 90% higher X
80% ≤ 70% 50% of corrupt disks 60% have ≤ 10 CMs 50% (2 for Nearline) 40% 90% of corrupt disks 30% have ≤ 200 CMs 20% 10% (100 for Nearline)
% of with corrupt disks 0% 1 2 3 4 5 10 20 50 100 200 500 1K Number ofNumber Checksum of CMs Mismatches
Storage Developer Conference 2008 CMs per Corrupt Disk – Summary
Class and model matter
Fewer enterprise disks have CMs, but corrupt disks have more CMs Fail-out enterprise disks on first CM
Corrupt nearline disks develop fewer CMs There can be anomalies (Disk model E-1)
Storage Developer Conference 2008 Other Characteristics
Very high spatial locality When multiple checksum mismatches occur, they are often for consecutive disk blocks
High temporal locality
Not independent Over different disks in same system Defect may be in common hardware components (Example: shelf controller)
Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis
1. Factors 2. Characteristics 3. Correlations with other errors 4. Request type • Scrubs vs. FS reads etc.
Storage Developer Conference 2008 Request Type
What types of disk requests detect checksum mismatches?
Is data scrubbing useful?
Storage Developer Conference 2008 Request Type
Data scrubbing finds most CMs
100% Nearline: 49% 90% 80% Enterprise: 73% 70% 60% Reconstruction finds 50% CMs 40% 30% Nearline: 9% 20% 10% Enterprise: 4%
% of CMs discovered % of CMs 0%
Disk Model
Storage Developer Conference 2008 Request Type – Summary
Data scrubbing appears to be very useful Study of scrub rates, workload needed
Mismatches found during reconstruction Data loss without double disk failure protection [Alvarez97, Blaum94, Corbett04, Park95, Hafner05] More aggressive scrubbing may be needed
Storage Developer Conference 2008 Interesting Behavior
Do system designers need to factor in any abnormal behavior?
Storage Developer Conference 2008 Block numbers are not created equal!
Disk Model: E-1
120 Typically, each block number has 1 disk where 100 it is corrupt 80 A series of block numbers 60 are corrupt in many disks A block-number 40 specific bug? 20
0 Number of disks with CM at block X of disks with CM Number Block Number Space
Storage Developer Conference 2008 Talk Outline
Introduction Background Results Lessons Conclusion
Storage Developer Conference 2008 Lessons
Data corruption does occur Even rare errors like lost writes do occur Corruption handling mechanisms are essential Very few enterprise disks develop corruption “Fail-out” these disks on first corruption detection High spatial locality Spread out redundant data within the same disk
Storage Developer Conference 2008 Lessons (contd.)
Temporal locality, consecutive blocks affected May be corruption occurs during the same write op Write redundant data with separate disk requests, spaced out over time
Storage Developer Conference 2008 Conclusion
Our analysis First large scale study of data corruption Corruptions detected by NetApp production systems Data corruptions do occur Affect ~10 times fewer disks than latent sector errors Nearline (SATA) disks are most affected Corruption handling mechanisms are essential Data corruption characteristics Depend on disk class and disk model Not independent (both within disk and within system) High spatial and temporal locality May occur at specific block numbers
Storage Developer Conference 2008 Thank You!
Advanced Technology Group (ATG) NetApp, Inc http://www.netapp.com/company/research/
Advanced Systems Lab (ADSL) University of Wisconsin-Madison http://www.cs.wisc.edu/adsl
Department of Computer Science University of Toronto http://www.cs.toronto.edu/~bianca
Storage Developer Conference 2008