Disk Scrubbing Large Archival Storage Systems

Disk Scrubbing in Large Archival Storage Systems Thomas J. E. Schwarz, S. J.12† Qin Xin2 3‡ Ethan L. Miller2‡ Darrell D. E. Long2‡ Andy Hospodor1 Spencer Ng3 1Computer Engineering Department, Santa Clara University, Santa Clara, CA 95053 2Storage Systems Research Center, University of California, Santa Cruz, CA 95064 3Hitachi Global Storage Technologies, San Jose Research Center, San Jose, CA 95120 Abstract quent, necessitating redundant data storage. For instance, we can mirror disks or mirror blocks of data, or even use Large archival storage systems experience long periods higher degrees of replication. For better storage efficiency, of idleness broken up by rare data accesses. In such sys- we collect data into large reliability blocks, and group m tems, disks may remain powered off for long periods of of these large reliability blocks together in a redundancy time. These systems can lose data for a variety of reasons, group to which we add k parity blocks. We store all relia- including failures at both the device level and the block bility blocks in a redundancy group on different disks. We level. To deal with these failures, we must detect them early calculate the parity blocks with an erasure correcting code. enough to be able to use the redundancy built into the stor- The data blocks in a redundancy group can be recovered if · age system. We propose a process called “disk scrubbing” we can access m out of the n m k blocks making up the in a system in which drives are periodically accessed to de- redundancy group, making our system k-available. tect drive failure. By scrubbing all of the data stored on all of the disks, we can detect block failures and compensate for them by rebuilding the affected blocks. Our research Our redundancy scheme implies that the availability of shows how the scheduling of disk scrubbing affects over- data depends on the availability of data in another block. all system reliability, and that “opportunistic” scrubbing, The extreme scenario is one where we try to access data, in which the system scrubs disks only when they are pow- discover that the disk containing the block has failed or that ered on for other reasons, performs very well without the a sector in the block can no longer be read, access the other need to power on disks solely to check them. blocks in the redundancy group, and then find that enough of them have failed so that we are no longer able to recon- struct the data, which is now lost. In this scenario, the re- 1. Introduction construction unmasks failures elsewhere in the system. As disks outpace tapes in capacity growth, large scale To maintain high reliability, we must detect these storage systems based on disks are becoming increasingly masked failures early. Not only do we have to access attractive for archival storage systems. Such systems will disks periodically to test whether the device has failed, encompass a very large number of disks and store petabytes but we also have to periodically verify all of the data on of data (1015 bytes). Because the disks store archival data, it [6, 7], an operation called “scrubbing.” By merely read- they will remain powered off between accesses, conserv- ing the data, we verify that we can access them. Since we ing power and extending disk lifetime. Colarelli and Grun- are already reading the data, we should also verify its ac- wald [2] called such a system a Massive Array of mainly curacy using signatures—small bit strings calculated from Idle Disks (MAID). a block of data. We can use signatures to verify the co- In a large system encompassing thousands or even tens herence between mirrored copies of blocks or between of thousands of active disks, disk failure will become fre- client data and generalized parity data blocks. Sig- † Supported in part by an SCU-internal IBM research grant. natures can also detect corrupted data blocks with a ‡ Supported in part by Lawrence Livermore National Laboratory, Los tiny probability of error. Our work investigates the im- Alamos National Laboratory, and Sandia National Laboratory under pact of disk scrubbing through analysis and simula- contract B520714. tion. Proceedings of the 12th IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2004), Volendam, Netherlands, October 2004. 2. Disk Failures unread block. Therefore, scrubbing and subsequent recon- struction of failed blocks is necessary to ensure long-term Storage systems must deal with defects and failures in data viability. rotating magnetic media. We can distinguish three principal failure modes that affect disks: block failures, also known 2.2. Disk Failure Rates as media failures; device failures; and data corruption or unnoticed read error. Some block failures are correlated. For example, if a On a modern disk drive, each block consisting of one or particle left over from the manufacturing process remains more 512 B sectors contains error correcting code (ECC) within the disk drive, then this particle will be periodically information that is used to correct errors during a read. If caught between a head and a surface, scratching the sur- there are multiple errors within a block, the ECC can ei- face and damaging the head [3]. The user would observe a ther successfully correct them, flag the read as unsuccess- burst of hard read failures, followed eventually by the fail- ful, or in the worst case, mis-correct them. The latter is an ure of the drive. However, most block failures are not re- instance of data corruption, which fortunately is extremely lated, so we can model block failures with a constant fail- rare. If the ECC is unable to correct the error, a modern disk ure rate λbf . retries the read several times. If a retry is successful, the er- Device failure rates are specified by disk drive manu- ror is considered a soft error; otherwise, the error is hard. facturers as MTBF values. The actual observed values de- Failures to read data may reflect physical damage to the pend in practice heavily on operating conditions that are disk or result from faulty disk operations. For example if a frequently worse than the manufacturers’ implicit assump- disk suffers a significant physical shock during a write op- tions. Block failure rates are published as failures per num- eration, the head might swing off the track and thus destroy ber of bits read. A typical value of 1 in 10 14 bits for a com- data in a number of sectors adjacent to the one being writ- modity ATA drive means that if we access data under nor- ten. mal circumstances at a rate of 10 TB per year, we should expect one uncorrectable error per year. 2.1. Block Failures Block errors may occur even if we do not access the disk. Lack of measurements, lack of openness, and most Many block defects are the result of imperfections in the importantly, lack of a tested failure model make conjec- machining of the disk substrate, non-uniformity in the mag- tures on how to translate the standard errors per bits rate netic coating and contaminants within the head disk assem- into a failure rate per year hazardous. Based on interviews bly. These manufacturing defects cause repeatable hard er- with data storage professionals, we propose the following rors detected during read operations. Disk drive manufac- model: turers attempt to detect and compensate for these defects We assume that for server drives, about 1/3 of all field during a manufacturing process called self-scan. Worst- returns are due to hard errors. RAID system users, who buy case data patterns are written and read across all surfaces 90% of the server disks, do not return drives with hard er- of the disk drive to form a map of known defects called the rors. Thus, 10% of the disks sold account for one third of P-list. After the defect map is created (during self-scan), hard errors reported. Therefore, the mean-time-to-block- the drive is then formatted in a way that eliminates each failure is 3/10 times the MTBF of all disk failures and the defect from the set of logical blocks presented to the user. mean-time-to-disk-failure is 3/2 of the MTBF. If a drive These manufacturing defects are said to be “mapped-out” is rated at 1 million hours, then the mean-time-to-block- 5 and should not affect the function of the drive. failure is 3 ¢ 10 hours and the mean-time-to-disk-failure 6 ¢ Over the lifetime of a disk, additional defects can be en- is 15 10 hours. Thus, we propose that block failure rates countered. Additional spare sectors are reserved at the end are about five times higher than disk failure rates. of each track or cylinder for these grown defects. A spe- cial grown defect list, or G-List, maps the grown defects and complements the P-List. In normal operation, the user 3. System Overview should rarely encounter repeatable errors. An increase in repeatable errors, and subsequent grown defects in the G- We investigate the reliability of a very large archival List, can indicate a systemic problem in the heads, media, storage system that uses disks. We further assume that disks servo or recording channel. Self Monitoring Analysis and are powered down when they are not serving requests, as in Reporting Technology (SMART) [13] allows users to mon- a MAID [2]. Disk drives are not built for high reliability [1] itor these error rates. and any storage system containing many disks needs to de- Note that a drive only detects errors after reading the af- tect failures actively.

Disk Scrubbing Large Archival Storage Systems

An Analysis of Data Corruption in the Storage Stack

Data & Computer Recovery Guidelines

Data Remanence in Non-Volatile Semiconductor Memory (Part I)

Databridge ETL Solution Datasheet

Database Analyst Ii

Increasing Reliability and Fault Tolerance of a Secure Distributed Cloud Storage

Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives

EEPROM Emulation

PROTECTING DATA from RANSOMWARE and OTHER DATA LOSS EVENTS a Guide for Managed Service Providers to Conduct, Maintain and Test Backup Files

Active Storage Activeraid ES Data Sheet

Algorithms for Data Cleaning in Knowledge Bases

And Intra-Set Write Variations