An Analysis of Data Corruption in the Storage Stack Garth Goodson Netapp, Inc

An Analysis of Data Corruption in the Storage Stack Garth Goodson NetApp, Inc Lakshmi Bairavasundaram Bianca Schroeder Andrea C. Arpaci-Dusseau University of Toronto Remzi H. Arpaci-Dusseau University of Wisconsin-Madison Storage Developer Conference 2008 Corruption Anecdote There is much anecdotal evidence of data corruption E.g., this is a photo stored on an author’s laptop System designers know of similar occurrences Data protection often based on anecdotes Anecdotes: interesting, but not enough for system design A more rigorous understanding is needed Storage Developer Conference 2008 Our Analysis First large scale study of data corruption 1.53 million disks in 1000s of NetApp systems Time period 41 months (Jan 2004 – Jun 2007) Corruption detection Using various data protection techniques Data from NetApp Autosupport Database Also used in latent sector error [Bairavasundaram07], disk and storage failure [Jiang08] studies Storage Developer Conference 2008 Questions we had about corruption What kinds of corruption occur and how often ? Does disk class matter ? Expensive enterprise (FC) disks versus cheaper nearline (SATA) disks Does disk drive family/product matter ? Are corruption instances independent ? Do corruption instances have spatial locality? Storage Developer Conference 2008 Talk Outline Introduction Background Data corruption Protection techniques Results Lessons Conclusion Storage Developer Conference 2008 Should we care about disk errors? Joint UIUC/NetApp system failure analysis 44 months; 39,000 systems; 1.8 million disks Performance Performance Failure Failure Protocol Failure Protocol Failure Disk Failure Physical Physical Disk Failure Interconnect Interconnect Failure Failure High-End Systems Nearline Systems *W. Jiang, et. al, “Are disks the dominant contributor of storage failures?”, USENIX FAST, 2008 Storage Developer Conference 2008 Disk system failure rates From failure rate pie charts: High-end: 29% of system errors are disk errors Nearline: 57% of system errors are disk errors What’s going on? Software is generally the same Hardware platforms are somewhat different But real difference is in the type of disk in use i.e., Fibre-channel vs SATA Storage Developer Conference 2008 Types of disk errors Operational/component failures Fundamental problem with the drive hardware Bad servo, head, electronics, etc. Firmware bugs Failure to flush cache on power-down, etc. Partial failures Only affects small subset of disk sectors Errors during writing Bad media, high-fly write, vibration, etc. Errors during reading (write was successful) Scratches, corrosion, thermal asperities, etc. Storage Developer Conference 2008 Unreported Disk Errors Operational failures are easy to detect Usually fail-stop; something stops working Latent sector errors are reported via SCSI errors Occurs when a disk sector is read What about errors that go undetected? Observed errors not corrected by disk’s ECC Can not correct them unless detected first Result is usually some form of corruption Storage Developer Conference 2008 Data Corruption Data stored on a disk block is incorrect Many sources Software bugs File system, software RAID, device drivers, etc. Firmware bugs Disk drives, shelf controllers, adapters, etc. Corruption is silent Not reported by the disk drive Could have greater impact than other errors Storage Developer Conference 2008 Forms of Data Corruption Bit corruption Contents of existing disk block are modified Data being written to a disk block is corrupted Lost writes Data not written but completion is reported Misdirected writes Data is written to the wrong disk block Torn writes Data partially written but completion is reported In all cases, data passes disk’s internal ECC Storage Developer Conference 2008 Detecting data corruption Basic idea: 1. Generate checksum of data (64 Bytes/4KB) 2. Store checksum along with data (4KB FS block) 3. Verify checksum whenever reading data Simple checksum has limited protection Detects bit corruption and torn (partial) writes No protection against lost or misdirected writes Since data was not overwritten Storage Developer Conference 2008 Checksum problems: lost writes Block checksums Data 1 Data 2 Data 3 Parity A B C P(ABC’)P(ABC) cksum(A) cksum(B) cksum(C) cksum(P) Lost Write Overwrite C→C’ Read file ABC’ CKSUM Return data (ABC) Return Corrupt Data (C instead of C’) Storage Developer Conference 2008 Write verify: a partial solution Attempt to solve lost write problem Costly solution, expect good protection Procedure: 1. Write data to disk 2. Read back to verify 3. If lost write detected, write again C’C or remap to new location cksum(C’)cksum(C) Lost Write Overwrite C→C’ Read back (C) Success Lost write detected, write C’ again 14 Storage Developer Conference 2008 Lost write protection: a better way Need logical information pertaining to block identity Something external to data being stored Store inode, FS block number within checksum Verified by file system at read time We also add a checksum of checksum structure Block Checksum Protects 4KB FS block Block Identity Data Protects against lost writes Embedded Checksum Protects checksum structure 4KB file system block 64B Checksum 520 520 520 520 520 520 520 520 Storage Developer Conference 2008 Summary: Data Corruption Classes Checksum mismatch Causes: bit corruption, torn/misdirected write Detection: block checksum mismatch Identity mismatch Causes: lost or misdirected write Detection: block identity mismatch Parity mismatch Causes: lost write, bad parity Detection: RAID parity computation mismatch Storage Developer Conference 2008 Talk Outline Introduction Background Results System architecture Overall results Checksum mismatch results Lessons and Conclusion Storage Developer Conference 2008 NetApp® System • Store, verify block identity 3 (Inode X, offset Y) • Detect identity discrepancy • Lost or misdirected writes Client interface (NFS) • Parity generation 2 WAFL® file system • Reconstruction on failure • Data scrubbing RAID layer – read blocks, verify parity – Detect parity inconsistency Storage layer Autosupport – Lost or misdirected writes, parity miscalculations • Store, verify checksum 1 • Detect checksum mismatch Disk drives • Bit corruptions, torn writes Storage Developer Conference 2008 Overall Numbers What percentage of disks are affected by the different kinds of corruption? Storage Developer Conference 2008 Overall Numbers (% disks affected in 17 months of use) Corruption type Nearline Enterprise (SATA) (FC) Checksum mismatches 0.661% 0.059% 1 Parity inconsistencies 0.147% 0.017% 2 Identity discrepancies 0.042% 0.006% 3 ~10 times fewer disks than latent sector errors Higher % of Nearline disks affected Order of magnitude more than enterprise disks Bit corruptions or torn writes affect more disks than lost or misdirected writes Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis • Disk class (Nearline / Enterprise) • Disk model • Disk age • Disk size (capacity) 1. Factors • Workload 2. Characteristics • CMs per corrupt disk 3. Correlations with • Independence • Spatial locality other errors • Temporal locality 4. Request type • Not ready conditions • Latent sector errors • System reset • Scrubs vs. FS reads etc. Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis • Disk class • Disk model 1. Factors • Disk age 2. Characteristics • Disk size • Workload 3. Correlations with other errors 4. Request type Storage Developer Conference 2008 Factors Do disk class, model, or age affect development of checksum mismatches? Disk class: Nearline (SATA) or Enterprise (FC) Disk model: Specific disk drive product (say Vendor V’s disk product P of capacity 80 GB) Disk age: Time in the field since ship date Can we use these factors to determine corruption handling policies or mechanisms? Ex: Aggressive scrubbing for some disks Storage Developer Conference 2008 Class, Model, Age – Nearline 4.0% Fraction of disks affected 3.5% varies across models 3.0% From 0.27% to 3.51% 2.5% More than 3% 2.0% 4 out of 6 models 1.5% Response to age also varies 1.0% 0.5% % of disks with at least 1 CM 0.0% 0 3 6 9 12 15 18 Disk age (months) Storage Developer Conference 2008 Class, Model, Age – Enterprise 0.18% Fraction of disks affected 0.16% varies across models 0.14% From 0% to 0.17% 0.12% 0.10% All less than lowest Nearline (0.27%) 0.08% 0.06% Response to age also 0.04% varies % of disks with at least 1 CM 0.02% 0.00% 0 3 6 9 12 15 18 Disk age (months) Storage Developer Conference 2008 Factors – Summary Class, Model matter Nearline disks require greater attention Effect of age is unclear Cannot use age-specific corruption handling Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis 1. Factors • CMs per corrupt disk 2. Characteristics • Independence 3. Correlations with • Spatial locality other errors • Temporal locality 4. Request type Storage Developer Conference 2008 Checksum Mismatches per Corrupt Disk Corrupt disk: A disk with at least 1 checksum mismatch (CM) How many CMs does a corrupt disk have? Should we “fail-out” disks when one corruption is detected? Storage Developer Conference 2008 CMs per Corrupt Disk – Nearline 100% CMs per corrupt disk is CMs 90% low X 80% ≤ 70% 50% of corrupt disks 60% have ≤ 2 CMs 50% 90% of corrupt disks 40% have ≤ 100 CMs 30% 20% Anomaly: E-1 10% Develops many CMs % of corrupt disks with 0% 1 2 3 4 5 10 20 50 100 200 500 1K Number of Checksum

Load more