An Analysis of Corruption in the Storage Stack Garth Goodson NetApp, Inc

Lakshmi Bairavasundaram Bianca Schroeder Andrea C. Arpaci-Dusseau University of Toronto Remzi H. Arpaci-Dusseau University of Wisconsin-Madison

Storage Developer Conference 2008 Corruption Anecdote

ˆ There is much anecdotal evidence of ˆ E.g., this is a photo stored on an author’s laptop

ˆ System designers know of similar occurrences ˆ Data protection often based on anecdotes ˆ Anecdotes: interesting, but not enough for system design ˆ A more rigorous understanding is needed

Storage Developer Conference 2008 Our Analysis

ˆ First large scale study of data corruption ˆ 1.53 million disks in 1000s of NetApp systems ˆ Time period ˆ 41 months (Jan 2004 – Jun 2007) ˆ Corruption detection ˆ Using various data protection techniques ˆ Data from NetApp Autosupport Database ˆAlso used in latent sector error [Bairavasundaram07], disk and storage failure [Jiang08] studies

Storage Developer Conference 2008 Questions we had about corruption

ˆ What kinds of corruption occur and how often ? ˆ Does disk class matter ? ˆ Expensive enterprise (FC) disks versus cheaper nearline (SATA) disks ˆ Does disk drive family/product matter ? ˆ Are corruption instances independent ? ˆ Do corruption instances have spatial locality?

Storage Developer Conference 2008 Talk Outline

ˆ Introduction ˆ Background ˆ Data corruption ˆ Protection techniques ˆ Results ˆ Lessons ˆ Conclusion

Storage Developer Conference 2008 Should we care about disk errors?

ˆ Joint UIUC/NetApp system failure analysis ˆ 44 months; 39,000 systems; 1.8 million disks

Performance Performance Failure Failure Protocol Failure Protocol Failure

Disk Failure

Physical Physical Disk Failure Interconnect Interconnect Failure Failure

High-End Systems Nearline Systems *W. Jiang, et. al, “Are disks the dominant contributor of storage failures?”, USENIX FAST, 2008 Storage Developer Conference 2008 Disk system failure rates

ˆ From failure rate pie charts: ˆ High-end: 29% of system errors are disk errors ˆ Nearline: 57% of system errors are disk errors ˆ What’s going on? ˆ Software is generally the same ˆ Hardware platforms are somewhat different ˆ But real difference is in the type of disk in use ˆ i.e., Fibre-channel vs SATA

Storage Developer Conference 2008 Types of disk errors

ˆ Operational/component failures ˆ Fundamental problem with the drive hardware ˆBad servo, head, electronics, etc. ˆ Firmware bugs ˆFailure to flush cache on power-down, etc. ˆ Partial failures ˆ Only affects small subset of disk sectors ˆ Errors during writing ˆBad media, high-fly write, vibration, etc. ˆ Errors during reading (write was successful) ˆScratches, corrosion, thermal asperities, etc.

Storage Developer Conference 2008 Unreported Disk Errors

ˆ Operational failures are easy to detect ˆ Usually fail-stop; something stops working ˆ Latent sector errors are reported via SCSI errors ˆ Occurs when a disk sector is read ˆ What about errors that go undetected? ˆ Observed errors not corrected by disk’s ECC ˆ Can not correct them unless detected first ˆ Result is usually some form of corruption

Storage Developer Conference 2008 Data Corruption

ˆ Data stored on a disk block is incorrect ˆ Many sources ˆ Software bugs ˆ, software RAID, device drivers, etc. ˆ Firmware bugs ˆDisk drives, shelf controllers, adapters, etc. ˆ Corruption is silent ˆ Not reported by the disk drive ˆ Could have greater impact than other errors

Storage Developer Conference 2008 Forms of Data Corruption

ˆ Bit corruption ˆ Contents of existing disk block are modified ˆ Data being written to a disk block is corrupted ˆ Lost writes ˆ Data not written but completion is reported ˆ Misdirected writes ˆ Data is written to the wrong disk block ˆ Torn writes ˆ Data partially written but completion is reported In all cases, data passes disk’s internal ECC

Storage Developer Conference 2008 Detecting data corruption

ˆ Basic idea: 1. Generate of data (64 /4KB) 2. Store checksum along with data (4KB FS block) 3. Verify checksum whenever reading data ˆ Simple checksum has limited protection ˆ Detects bit corruption and torn (partial) writes ˆ No protection against lost or misdirected writes ˆ Since data was not overwritten

Storage Developer Conference 2008 Checksum problems: lost writes

ˆ Block

Data 1 Data 2 Data 3 Parity

A B C P(ABC’)P(ABC)

cksum(A) cksum(B) cksum(C) cksum(P)

Lost Write Overwrite C→C’ Read file ABC’

CKSUM Return data (ABC) Return Corrupt Data (C instead of C’) Storage Developer Conference 2008 Write verify: a partial solution

ˆ Attempt to solve lost write problem ˆ Costly solution, expect good protection ˆ Procedure: 1. Write data to disk 2. Read back to verify 3. If lost write detected, write again C’C or remap to new location cksum(C’)cksum(C)

Lost Write Overwrite C→C’ Read back (C) Success Lost write detected, write C’ again 14 Storage Developer Conference 2008 Lost write protection: a better way

ˆ Need logical information pertaining to block identity ˆ Something external to data being stored ˆ Store inode, FS block number within checksum ˆ Verified by file system at read time ˆ We also add a checksum of checksum structure

Block Checksum Protects 4KB FS block Block Identity Data Protects against lost writes Embedded Checksum Protects checksum structure

4KB file system block 64B Checksum

520 520 520 520 520 520 520 520

Storage Developer Conference 2008 Summary: Data Corruption Classes

ˆ Checksum mismatch ˆ Causes: bit corruption, torn/misdirected write ˆ Detection: block checksum mismatch ˆ Identity mismatch ˆ Causes: lost or misdirected write ˆ Detection: block identity mismatch ˆ Parity mismatch ˆ Causes: lost write, bad parity ˆ Detection: RAID parity computation mismatch

Storage Developer Conference 2008 Talk Outline

ˆ Introduction ˆ Background ˆ Results ˆ System architecture ˆ Overall results ˆ Checksum mismatch results ˆ Lessons and Conclusion

Storage Developer Conference 2008 NetApp® System • Store, verify block identity 3 (Inode X, offset Y) • Detect identity discrepancy • Lost or misdirected writes Client interface (NFS)

• Parity generation 2 WAFL® file system • Reconstruction on failure • RAID layer – read blocks, verify parity – Detect parity inconsistency Storage layer Autosupport – Lost or misdirected writes, parity miscalculations

• Store, verify checksum 1 • Detect checksum mismatch Disk drives • Bit corruptions, torn writes

Storage Developer Conference 2008 Overall Numbers

What percentage of disks are affected by the different kinds of corruption?

Storage Developer Conference 2008 Overall Numbers (% disks affected in 17 months of use)

Corruption type Nearline Enterprise (SATA) (FC) Checksum mismatches 0.661% 0.059% 1 Parity inconsistencies 0.147% 0.017% 2 Identity discrepancies 0.042% 0.006% 3

ˆ ~10 times fewer disks than latent sector errors ˆ Higher % of Nearline disks affected ˆ Order of magnitude more than enterprise disks ˆ Bit corruptions or torn writes affect more disks than lost or misdirected writes

Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis

• Disk class (Nearline / Enterprise) • Disk model • Disk age • Disk size (capacity) 1. Factors • Workload

2. Characteristics • CMs per corrupt disk 3. Correlations with • Independence • Spatial locality other errors • Temporal locality

4. Request type • Not ready conditions • Latent sector errors • System reset

• Scrubs vs. FS reads etc.

Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis

• Disk class • Disk model 1. Factors • Disk age 2. Characteristics • Disk size • Workload 3. Correlations with other errors 4. Request type

Storage Developer Conference 2008 Factors

ˆ Do disk class, model, or age affect development of checksum mismatches? ˆ Disk class: Nearline (SATA) or Enterprise (FC) ˆ Disk model: Specific disk drive product (say Vendor V’s disk product P of capacity 80 GB) ˆ Disk age: Time in the field since ship date

ˆ Can we use these factors to determine corruption handling policies or mechanisms? ˆ Ex: Aggressive scrubbing for some disks

Storage Developer Conference 2008 Class, Model, Age – Nearline

4.0% ˆ Fraction of disks affected 3.5% varies across models 3.0% ˆ From 0.27% to 3.51% 2.5% ˆ More than 3% 2.0% ˆ 4 out of 6 models 1.5% ˆ Response to age also varies 1.0% 0.5% % of disks with at least 1 CM with at least % of disks 0.0% 0 3 6 9 12 15 18 Disk age (months)

Storage Developer Conference 2008 Class, Model, Age – Enterprise

0.18% ˆ Fraction of disks affected 0.16% varies across models 0.14% ˆ From 0% to 0.17% 0.12% ˆ 0.10% All less than lowest Nearline (0.27%) 0.08% 0.06% ˆ Response to age also 0.04% varies

% of disks with at least 1 CM % of disks with 0.02% 0.00% 0 3 6 9 12 15 18 Disk age (months)

Storage Developer Conference 2008 Factors – Summary

ˆ Class, Model matter ˆ Nearline disks require greater attention

ˆ Effect of age is unclear ˆ Cannot use age-specific corruption handling

Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis

1. Factors • CMs per corrupt disk 2. Characteristics • Independence 3. Correlations with • Spatial locality other errors • Temporal locality 4. Request type

Storage Developer Conference 2008 Checksum Mismatches per Corrupt Disk

ˆ Corrupt disk: A disk with at least 1 checksum mismatch (CM)

ˆ How many CMs does a corrupt disk have?

ˆ Should we “fail-out” disks when one corruption is detected?

Storage Developer Conference 2008 CMs per Corrupt Disk – Nearline

100% ˆ CMs per corrupt disk is

CMs 90% low X

80% ≤ ˆ 70% 50% of corrupt disks 60% have ≤ 2 CMs 50% ˆ 90% of corrupt disks 40% have ≤ 100 CMs 30% 20% ˆ Anomaly: E-1 10% ˆ Develops many CMs

% of with corrupt disks 0% 1 2 3 4 5 10 20 50 100 200 500 1K Number of Checksum Mismatches

Storage Developer Conference 2008 CMs per Corrupt Disk – Enterprise

100% CMs per corrupt disk

CMs 90% higher X

80% ≤ ˆ 70% 50% of corrupt disks 60% have ≤ 10 CMs 50% (2 for Nearline) 40% ˆ 90% of corrupt disks 30% have ≤ 200 CMs 20% 10% (100 for Nearline)

% of with corrupt disks 0% 1 2 3 4 5 10 20 50 100 200 500 1K Number ofNumber Checksum of CMs Mismatches

Storage Developer Conference 2008 CMs per Corrupt Disk – Summary

ˆ Class and model matter

ˆ Fewer enterprise disks have CMs, but corrupt disks have more CMs ˆ Fail-out enterprise disks on first CM

ˆ Corrupt nearline disks develop fewer CMs ˆ There can be anomalies (Disk model E-1)

Storage Developer Conference 2008 Other Characteristics

ˆ Very high spatial locality ˆ When multiple checksum mismatches occur, they are often for consecutive disk blocks

ˆ High temporal locality

ˆ Not independent ˆ Over different disks in same system ˆ Defect may be in common hardware components (Example: shelf controller)

Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis

1. Factors 2. Characteristics 3. Correlations with other errors 4. Request type • Scrubs vs. FS reads etc.

Storage Developer Conference 2008 Request Type

ˆ What types of disk requests detect checksum mismatches?

ˆ Is data scrubbing useful?

Storage Developer Conference 2008 Request Type

ˆ Data scrubbing finds most CMs

100% ˆ Nearline: 49% 90% 80% ˆ Enterprise: 73% 70% 60% ˆ Reconstruction finds 50% CMs 40% 30% ˆ Nearline: 9% 20% 10% ˆ Enterprise: 4%

% of CMs discovered % of CMs 0%

Disk Model

Storage Developer Conference 2008 Request Type – Summary

ˆ Data scrubbing appears to be very useful ˆ Study of scrub rates, workload needed

ˆ Mismatches found during reconstruction ˆ without double disk failure protection [Alvarez97, Blaum94, Corbett04, Park95, Hafner05] ˆ More aggressive scrubbing may be needed

Storage Developer Conference 2008 Interesting Behavior

Do system designers need to factor in any abnormal behavior?

Storage Developer Conference 2008 Block numbers are not created equal!

Disk Model: E-1

120 ˆ Typically, each block number has 1 disk where 100 it is corrupt 80 ˆ A series of block numbers 60 are corrupt in many disks ˆ A block-number 40 specific bug? 20

0 Number of disks with CM at block X of disks with CM Number Block Number Space

Storage Developer Conference 2008 Talk Outline

ˆ Introduction ˆ Background ˆ Results ˆ Lessons ˆ Conclusion

Storage Developer Conference 2008 Lessons

ˆ Data corruption does occur ˆ Even rare errors like lost writes do occur ˆ Corruption handling mechanisms are essential ˆ Very few enterprise disks develop corruption ˆ “Fail-out” these disks on first corruption detection ˆ High spatial locality ˆ Spread out redundant data within the same disk

Storage Developer Conference 2008 Lessons (contd.)

ˆ Temporal locality, consecutive blocks affected ˆ May be corruption occurs during the same write op ˆ Write redundant data with separate disk requests, spaced out over time

Storage Developer Conference 2008 Conclusion

ˆ Our analysis ˆ First large scale study of data corruption ˆ Corruptions detected by NetApp production systems ˆ Data corruptions do occur ˆ Affect ~10 times fewer disks than latent sector errors ˆ Nearline (SATA) disks are most affected ˆ Corruption handling mechanisms are essential ˆ Data corruption characteristics ˆ Depend on disk class and disk model ˆ Not independent (both within disk and within system) ˆ High spatial and temporal locality ˆ May occur at specific block numbers

Storage Developer Conference 2008 Thank You!

Advanced Technology Group (ATG) NetApp, Inc http://www.netapp.com/company/research/

Advanced Systems Lab (ADSL) University of Wisconsin-Madison http://www.cs.wisc.edu/adsl

Department of University of Toronto http://www.cs.toronto.edu/~bianca

Storage Developer Conference 2008