End-to-end Integrity for Systems: A ZFS Case Study

Yupu Zhang, Abhishek Rajimwale, Andrea tag">C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Computer Sciences Department, University of Wisconsin-Madison

Abstract data and important may be cached in memory for long periods of time, making them more susceptible We present a study of the effects of disk and memory cor- to memory corruptions. ruption on file system . Our analysis fo- In this , we ask: how robust are modern file cuses on Sun’s ZFS, a modern commercial offering with systems to disk and memory corruptions? To answer numerous reliability mechanisms. Through careful and this query, we analyze a state-of-the-art file system, Sun thorough fault injection, we show that ZFS is robust to Microsystem’s ZFS, by performing fault injection tests a wide range of disk faults. We further demonstrate that representative of realistic disk and memory corruptions. ZFS is less resilient to memory corruption, which can We choose ZFS for our analysis because it is a modern lead to corrupt data being returned to applications or and important commercial file system with numerous ro- system crashes. Our analysis reveals the importance of bustness features, including end-to-end , data considering both memory and disk in the construction of , and transactional updates; the result, accord- truly robust file and storage systems. ing to the designers, is “provable data integrity” [14]. 1 Introduction In our analysis, we find that ZFS is indeed robust to a wide range of disk corruptions, thus partially confirming One of the primary challenges faced by modern file sys- that many of its design goals have been met. However, tems is the preservation of data integrity despite the pres- we also find that ZFS often fails to maintain data integrity ence of imperfect components in the storage stack. Disk in the face of memory corruption. In many cases, ZFS is media, firmware, controllers, and the buses and networks either unable to detect the corruption, returns bad data to that connect them all can corrupt data [4, 52, 54, 58]; the , or simply crashes. We further find that many of higher-level storage software is thus responsible for both these cases could be avoided with simple techniques. detecting and recovering from the broad range of corrup- The contributions of this paper are: tions that can (and do [7]) occur. • To our knowledge, the first study to empirically an- File and storage systems have evolved various tech- alyze the reliability of ZFS. niques to handle corruption. Different types of check- • To our knowledge, the first study to analyze local sums can be used to detect when corruption occurs [9, file system reliability techniques in the face of mem- 14, 49, 52], and redundancy, likely in mirrored or parity- ory corruption. based form [43], can be applied to from it. While • A novel holistic approach to analyzing both disk such techniques are not foolproof [32], they clearly have and memory corruptions using carefully-controlled made file systems more robust to disk corruptions. fault-injection techniques. Unfortunately, the effects of memory corruption on • data integrity have been largely ignored in file system A simple framework to measure the likelihood of design. Hardware-based memory corruption occurs as different memory corruption failure scenarios. • both transient soft errors and repeatable hard errors due Results that demonstrate the importance of both to a variety of radiation mechanisms [11, 35, 62], and memory and disk in end-to-end data protection. recent studies have confirmed their presence in modern The rest of this paper is organized as follows. In Sec- systems [34, 41, 46]. Software can also cause memory tion 2, we motivate our work by discussing the problem corruption; bugs can lead to “wild writes” into random of disk and memory corruption. In Section 3, we provide memory contents [18], thus polluting memory; studies some background on the reliability features of ZFS. Sec- confirm the presence of software-induced memory cor- tion 4 and Section 5 present our analysis of data integrity ruptions in operating systems [1, 2, 3, 60]. in ZFS with disk and memory corruptions. Section 6 The problem of memory corruption is critical for file gives an preliminary analysis of the probabilities of dif- systems that a great deal of data in memory for ferent failure scenarios in ZFS due to memory errors. In performance. Almost all modern file systems use a Section 7, we present initial results of the data integrity cache or buffer cache to store copies of on-disk in with memory corruptions. Section 8 dis- and metadata in memory. Moreover, frequently-accessed cusses related work and Section 9 concludes our work. 2 Motivation 2.1.3 How to handle them This section provides the motivation for our study by de- Systems use a number of techniques to handle disk cor- scribing how potentthe problem of disk and memory cor- ruptions. We discuss some of the most widely used tech- ruptions is to file system data integrity. Here, we discuss niques along with their limitations. why such corruptions happen, how frequently they oc- Checksums: Checksums are hashes computed cur, and how systems try to deal with them. We discuss with a collision-resistant and are used to disk and memory corruptions separately. verify data integrity. For on-disk data integrity, check- sums are stored or updated on disk during opera- 2.1 Disk corruptions tions and back to verify the block or sector contents during reads. We define disk corruption as a state when any data ac- Many storage systems have used checksums for on- cessed from disk does not have the expected contents due disk data integrity, such as Tandem NonStop [9] and Net- to some problem in the storage stack. This is different App Data ONTAP [52]. Similar checksumming tech- from latent sector errors, not-ready-condition errors and niques have also been used in file systems [14, 42]. recovered errors (discussed in [6]) in disk drives, where However, Krioukov et al. show that checksumming, if there is an explicit notification from the drive about the not carefully integrated into the storage system, can fail error condition. to protect against complex failures such as lost writes and misdirected writes [32]. Further, checksumming does 2.1.1 Why they happen not protect against corruptions that happen due to bugs Disk corruptions happen due to many reasons originat- in software, typically in large code bases [20, 61]. ing at different layers of the storage stack. Errors in the Redundancy: Redundancy in on-disk structures also magnetic media lead to the problem of “bit-rot” where helps to detect and, in some cases, recover from disk cor- the magnetic properties of a single bit or few bits are ruptions. For example, some B- file systems such as damaged. Spikes in power, erratic arm movements, and ReiserFS [15] store page-level information in each inter- scratches in media can also cause corruptions in disk nal page in the B-Tree. Thus, a corrupt pointer that does blocks [4, 47, 54]. On-disk ECC catches many (but not not connect pages in adjacent levels is caught by check- all) of these corruptions. ing this page-level information. Similarly, ext2 [16] and Errors are also induced due to bugs in complex drive [56] use redundant copies of superblock and group firmware (modern drives contain hundreds of thousands descriptors to recover from corruptions. of lines of firmware code [44]). Some reported firmware However, it has been shown that many of these file problems include a misdirected write where the firmware systems still sometimes fail to detect corruptions, leading accidentally writes to the wrong location [58] or a lost to greater problems [44]. Further, Gunawi et al. show write (or phantom write) where the disk reports a write instances where ext2/ext3 file system checkers fail to use as completed when in fact it never reaches the disk [52]. available redundant information for recovery [26]. Bus controllers have also been found to incorrectly report RAID storage: Another popular technique is to use a disk requests as complete or to corrupt data [24, 57]. RAID storage system [43] underneath the file system. Finally, software bugs in operating systems are also However, RAID is designed to tolerate the loss of a cer- potential sources of corruption. Buggy device drivers can tain number of disks or blocks (.g., RAID-5 tolerates issue disk requests with bad parameters or data [20, 22, one, and RAID-6 two) and it may not be possible with 53]. Software bugs in the file system itself can cause RAID alone to accurately identify the block (in a stripe) incorrect data to be written to disk. that is corrupted. Secondly, some RAID systems have been shown to have flaws where a single block loss leads 2.1.2 How frequently they happen to or silent corruption [32]. Finally, not all sys- Disk corruptions are prevalent across a broad range tems incorporate multiple disks, which limits the appli- of modern drives. In a recent study of 1.53 million cability of RAID. disk drives over 41 months [7], Bairavasundaram et al. show that more than 400,000 blocks had mis- 2.2 Memory corruptions matches, 8% of which were discovered during RAID re- We define memory corruption as the state when the con- construction, creating the possibility of real data loss. tents accessed from the main memory have one or more They also found that nearline disks develop checksum bits changed from the expected value (from a previous mismatches an order of magnitude more often than enter- store to the location). From the software perspective, it prise class disk drives. In addition, there is much anecdo- may not be possible to distinguish memory corruption tal evidence of corruption in storage [9, 52, 58]. from disk corruption on a read of a disk block. 2.2.1 Why they happen ECC: Traditionally, memory systems have employed Errorsin the memory chip are one source of memorycor- Error Correction Codes [19] to correct memory errors. ruptions. Memory errors can be classified as soft errors Unfortunately, ECC is unable to address all soft-error which randomly flip bits in RAM without leaving any problems. Studies found that the most commonly-used permanent damage, and hard errors which corrupt bits ECC called SEC/DED (Single Error Cor- in a repeatable manner due to physical damage. rect/Double Error Detect) can recover from only 94% of Researchers have discovered radiation mechanisms the errors in DRAMs [23]. Further, many commodity that cause errors in semiconductor devices at terrestrial systems simply do not use ECC protection in order to altitudes. Nearly three decades ago, May and Woods reduce cost [28]. found that if an alpha particle penetrates the die surface, More sophisticated techniques like Chipkill[30] have it can cause a random, single-bit error [35]. Zeigler and been proposed to withstand multi-bit failure in DRAMs. Lanford found that cosmic rays can also disrupt elec- However, such techniques are expensive and have been tronic circuits [62]. More recent studies and measure- restricted to proprietary systems, leaving the prob- ments confirm the effect of atmospheric neutrons causing lem of memory corruptions in commodity systems. single event upsets (SEU) in memories [40, 41]. Programming models and tools: Another approach to Memory corruption can also happen due to software deal with memory errors is to use recoverable program- bugs. The use of unsafe languages like C and C++ makes ming models [38] at different levels (firmware, operating software vulnerable to bugs such as dangling pointers, system, and applications). However, such techniques re- buffer overflows and heap corruption [12], which can re- quire support from hardware to detect memory corrup- sult in seemingly random memory corruptions. tions. Further, a holistic change in software is required to provide recovery solution at various levels. 2.2.2 How frequently they happen Much effort has also gone into detecting software Early studies and measurements on memory errors pro- bugs which cause memory corruptions. Tools such as vided evidence of soft errors. Data collected from a vast [27] and CSSV [21] apply static analysis to de- storehouseof data at IBM over a 15-yearperiod[41] con- tect memory corruptions. Others such as Purify [29] and firmed the presence of errors in RAM and that the up- SafeMem [45] use dynamic monitoring to detect mem- set rates increase with elevation, indicating atmospheric ory corruptions at runtime. However, as discussed in neutrons as the likely cause. Section 2.2.2, software-induced memory corruptions still In a recent measurement-based study of memory er- remain a problem. rors in a large fleet of commodity servers over a period of 2.5 years [46], Schroeder et al. observe DRAM error 2.3 Summary rates that are orders of magnitude higher than previously In modern systems corruption occurs both within the reported, with 25,000 to 70,000 FIT per Mbit (1 FIT storage system and in memory. Many commercial sys- equals 1 failure in 109 device hours). They also find that tems apply sophisticated techniques to detect and recover more than 8% of the DIMMs they examined (from mul- from disk-level corruptions; beyond ECC, little is done to tiple vendors, with varying capacities and technologies) protect against memory-level problems. Therefore, the were affected by bit errors each year. Finally, they also protection of critical user data against memory corrup- provide strong evidence that memory errors are domi- tions is largely left to software. nated by hard errors, rather than soft errors. 3 ZFS reliability features Another study [34] of production systems including 300 machines for a multi-month period found 2 cases of ZFS is a state-of-the-art file system from Sun which suspected soft errors and 9 cases of hard errors suggest- takes a unified approach to . It provides ing the commonness of hard memory faults. data integrity, transactional consistency, scalability, and Besides hardware errors, software bugs that lead to a multitude of useful features such as snapshots, copy- memory corruption are widely extant. Reports from the on-write clones, and simple administration [14]. Kernel Bugzilla [2], USCERT Vulner- In terms of reliability, ZFS claims to provide provable abilities Database [3], CERT/CC advisories [1], data integrity by using techniques like checksums, repli- as well as other anecdotal evidence [18] show cases of cation, and transactional updates. Further, the use of a memory corruption happening due to software bugs. pooled storage in ZFS lends it additional RAID-like reli- ability features. In the words of the designers, ZFS is the 2.2.3 How to handle them “The Last Word in File Systems.” We now describe the Systems use both hardware and software techniques to reliability mechanisms in ZFS. handle memory corruptions. Below, we discuss the most Checksums for data integrity checking: ZFS main- relevant hardware and software techniques. tains data integrity by using checksums for on-disk

§

¤¤

 

£ ¥ ¡ ¡ ¢ £ ¤ ¥ ¦

¥

blocks. The checksums are kept separate from the cor- ¦

¨ ©

¥

 

£  ¡ ¡ ¢ £ ¤



responding blocks by storing them in the parent blocks. ¦

§

¤¤

¦

 

£  ¡ ¡ ¢ £ ¤



¦

¨ © 

ZFS provides for these parental checksums of blocks by

§ ¤¤

using a generic block pointer structure to address all on-

¦

¨ ©  disk blocks.

The block pointer structure contains the checksum of  £ ¢  the block it references. Before using a block, ZFS calcu- lates its checksum and verifies it against the stored check- Figure 1: Block pointer. The figure shows how the block in the block pointer. The checksum hierarchy forms pointer structure points to (up to) three copies of a block (ditto a self-validating [37]. With this mechanism, blocks), and keeps a single checksum. ZFS is able to detect silent , such as bit rot, phantom writes, and misdirected reads and writes. maintain data integrity under a variety of disk corruption Replication for : Besides using RAID scenarios. Specifically, we wish to find if ZFS can detect techniques (described below), ZFS provides for recov- and recover from all disk corruptions in data and meta- ery from disk corruption by keeping replicas of certain data and how ZFS reacts to multiple block corruptions at “important” on-disk blocks. Each block pointer contains the same time. pointers to up to three copies (ditto blocks) of the block We find that ZFS is able to detect all and recover from being referenced. By default ZFS stores multiple copies most disk corruptions. We present our analysis, includ- for metadata and one copy for data. Upon detecting a ing methodology and results in later sections. First, we corruption due to checksum mismatch, ZFS uses a re- present a brief background about the on-disk organiza- dundant copy with a correctly-matching checksum. tion in ZFS, focusing on how data integrity is maintained. COW transactions for atomic updates: ZFS maintains 4.1 ZFS on-disk organization data consistency in the event of system crashes by using a All on-disk data and metadata in ZFS are treated as ob- copy-on-write transactional update model. ZFS manages jects, where an object is a collection of blocks. Objects all metadata and data as objects. Updates to all objects are further grouped into object sets. Other structures are grouped together as a transaction group. To commit such as uberblocksare also used to organizedata on disk. a transaction group to disk, new copies are created for all We now discuss these on-disk structures and their the modified blocks (in a Merkle tree). The root of this usage in ZFS. tree (the uberblock) is updated atomically, thus main- taining an always-consistent . In effect, the 4.1.1 Basic structures copy-on-write transactions along with block checksums Block pointers: A block pointer is the basic structure in (in a Merkle tree) preclude the need for journaling [59], ZFS for addressing a block on disk. It provides a generic though ZFS occasionally uses a write-ahead log for per- mechanism to keep parental checksums and replicas of formance reasons. on-disk blocks. Figure 1 shows the block pointer used Storage pools for additional reliability: ZFS provides by ZFS. As shown, the block pointer contains up to three additional reliability by enabling RAID-like configura- block addresses, called DVAs (data virtual addresses), tion for devices using a common storage pool for all file each pointing to a different block having the same con- system instances. ZFS presents physical storage to file tents. These are referred to as ditto blocks. The num- systems in the form of a storage pool (called zpool). A ber of DVAs varies depending on the importance of the storage pool is made up of virtual devices (vdev). A vir- block. The current policy in ZFS is that there is one DVA tual device could be a physical device (e.g., disks) or a for user data, two DVAs for file system metadata, and logical device (e.g., a mirror that is constructed by two three DVAs for global metadata across all file system in- disks). This storage pool can be used to provide addi- stances in the pool [39]. As discussed earlier, the block tional reliability by using devices as RAID arrays. Fur- pointer also contains a single copy of the checksum of ther, ZFS also introduces a new data replication model, the block being pointed to. RAID-Z, a novel solution similar to RAID-5 but using Objects: All blocks on disk are organized in objects. a variable stripe width to eliminate the write-hole issue Physically, an object is represented on disk by a structure in RAID-5 [13]. Finally, ZFS provides automatic repairs called dnode phys t (hereafter referred to as dnode). in mirrored configurations and provides a disk scrubbing A dnode contains an array of up to three block point- facility to detect latent sector errors. ers, each of which points to either a leaf block (e.g., a data block) or an indirect block (full of block pointers). 4 On-disk data integrity in ZFS These blocks pointed to by the dnode form a block tree. In this section, we analyze the robustness of ZFS against A dnode also contains a bonus buffer at the end, which disk corruptions. Our aim is to find whether ZFS can stores an object-specific data structure for different types Level Object Name Simplified Explanation MOS dnode A dnode object that contains dnode blocks, which store dnodes representing pool-level objects. Object A ZAP object whose blocks contain name-value pairs referencing further objects in the MOS object set. zpool Dataset It represents an object set (e.g., a file system) and tracks its relationships with its snapshots and clones. Dataset directory It maintains an active dataset object along with its child datasets. It has a reference to a dataset child map object. It also maintains properties such as quotas for all datasets in this directory. Dataset child map A ZAP object whose blocks hold name-value pairs referencing child dataset directories. FS dnode A dnode object that contains dnode blocks, which store dnodes representing filesystem-level objects. Master node A ZAP object whose blocks contain name-value pairs referencing further objects in this file system. File An object whose blocks contain file data. Directory A ZAP object whose blocks contain name-value pairs referencing files and directories inside this directory.

Table 1: Summary of ZFS objects visited. The table presents a summary of all ZFS objects visited in the walkthrough, along with a simplified explanation. Note that ZAP stands for ZFS Attribute Processor. A ZAP object is used to store name-value pairs. of objects. For example, a dnode of a file object contains to begin with. a structure called znode phys t (znode) in the bonus Find pool metadata (steps 1-2): As the starting point, buffer, which stores file attributes such as access time, ZFS locates the active uberblock in the vdev of the file mode and size of the file. device. ZFS then uses the uberblock to and verify Object sets: Object sets are used in ZFS to group related the integrity of pool-wide metadata contained in an ob- objects. An exampleof a objectset is a file system, which ject set called Meta Object Set (MOS). There are three contains file objects and directory objects belonging to copies of the object set block representing the MOS. this file system. Find a file system (steps 3-10): To locate a file system, An object set is represented by a structure called ZFS accesses a series of objects in MOS, all of which objset phys t, which consists of a meta dnode and a have three ditto blocks. Once the dataset representing ZIL (ZFS Intent Log) header. The meta dnode points to “myfs” is found, it is used to access file system wide a group of dnode blocks; dnodes representing the objects metadata contained in an object set. The integrity of file in this object set are stored in these dnode blocks. The system metadata is checked using the block pointer in object described by the meta dnode is called “dnode ob- the dataset, which points to the object set block. All file ject”. The ZIL header points to a list of blocks, which system metadata blocks have two ditto copies. holds transaction records for ZFS’s logging mechanism. Find a file and a data block (steps 11-18): To locate Other structures: ZFS uses other structures to organize a file, ZFS then uses the directory objects in the “myfs” on-disk data. Each physical vdev is labeled with a vdev object set. Finally, by following the block pointers in label that describes this device and other related virtual the dnode of the file object, ZFS finds the required data devices. Four copies of the label are stored in each phys- block. The integrity of every traversed block is con- ical vdev to provide redundancy and a two-stage update firmed by verifying the checksum in its block pointers. mechanism is used to guarantee that there is always a The legend in Figure 2 shows a summary of all the on- valid vdev label in the device [51]. An uberblock (simi- disk block types visited during the traversal. Our fault lar to a superblock) inside the vdev label is used to pro- injection tests for analyzing robustness of ZFS against vide access to the pool data and verify its integrity. The disk corruptions (discussed in the next subsection) inject uberblock is self-checksummed and updated atomically. bit errors in the on-disk blocks shown in Figure 2. 4.1.2 On-disk layout 4.2 Methodology of analysis In this section, we present some details about ZFS on- In this section, we discuss the methodology of our relia- disk layout. This overview will help the reader to un- bility analysis of ZFS against disk corruptions. We dis- derstand the range of our fault injection experiments pre- cuss our fault injection framework first and then present sented in later sections. A complete description of ZFS our test procedures and workloads. on-disk structures can be found elsewhere [51]. For the purpose of illustration, we demonstrate the 4.2.1 Fault injection framework steps that ZFS takes to locate a file system and to locate Our experiments are performed on a 64-bit Solaris Ex- file data in it in a simple storage pool. Figure 2 shows the press Community Edition (build 108) virtual machine on-disk layout of the simplified pool with a sample file with 2GB non-ECC memory. We use ZFS pool version system called “myfs”, along with the sequence of objects 14 and ZFS filesystem version 3. We run ZFS on top of and blocks accessed by ZFS. A simple explanation of all a single disk for our experiments. visited objects is described in Table 1. Note that we skip To emulate disk corruptions, we developed a fault in- the details of how in-memory structures are set up and jection framework consisting of a pseudo-driver to per- assume that data and metadata are not cached in memory form fault injection on disk blocks and an application for

‚

€ ƒ„

    2

 0

& & @

F FG J - * " " " FI FH

$ %$ %

  1 1



/ / +/E /#

D

& ' 4 ; >

!" " 5 6 6 " !" "

$ % $ %$ $ $ % $

# # 3 78 : < 8 +=# =+ /) #

9 9

( ( & ' > ( ( >

" - " !" "

% $ % $

) # ++/ ) /) #

?

( & &

" "

/

D D

& '

"

= + # /++/

?

( & ' > & '@

* * " * "

1 -

$ $ %

) + # # ++/ # ) +

?

& ' N &

J J L " "

& ' >

,-. $ A ,BC- * ". " !"

$ $$ $ %% $

/) /M / =

D

# / / + + = + # + #

@ K ( @

" " " *

% %

/# / + +

/ ...

( & ' > > &

A * "

$ $ %

/ / # / / +

PV W

R

YO OS

P PV W X X

QR R

T U T

S OS O 6 8 9

5

jkC !" "

$ % $ #

3

> >

k !" z z

$ $ % % 2 $

# M M

? ?

op q q u

n v

( ( ( ( (

x * " " " " "

$ $ % $ $ % $ $ % $ $ % $

rs t s tw

+ # + / / / / / / / /

?

( K &( @ (

* " * * "

$ $

+ # + # M/ + # +

? ?

PV W

[

Z \

PV W

[

Z \

OS

OS

4 7

(

" I

$ $ % $l

+ / /

>

Iy

% l

M

?

PV W

R

YO OS

_ e g

P PV W ] ]

QR R

c d

U T

T 15 17

^ fb `ab

X X

O S OS

&

B* "

j z "

%$ $

/ +

>

!" "

% $ % $

M #

? 11

( (

* "

13 "

$

+ # + )

?

}~ q q u

v

x

{ rs t s tw 10 | 14 16

... 18h

_ e g

] ]

c d

PV W

PV W [

Z \

[

^ fb Z \ `ab

PV W

OS

OS

T

i i OS

> &

* " H

l m $ l

12 +

Figure 2: ZFS on-disk structures. The figure shows the on-disk structures of ZFS including the pool-wide metadata and file system metadata. In the example above, the zpool contains a sample file system named “myfs”. All ZFS on-disk data structures are shown by rounded boxes, and on-disk blocks are shown by rectangular boxes. Solid arrows point to allocated blocks and dotted arrows represent references to objects inside blocks. The legend at the top shows the types of on-disk blocks and their contents. controlling the experiments. The pseudo-driver is a stan- disk copies of metadata. The “remount” workload in- dard Solaris layered driver that interposes between the dicates that the target block is corrupted with “myfs” ZFS virtual device and the disk driver beneath. We an- mounted and we subsequently umountand it. ZFS alyze the behavior of ZFS by looking at return values, uses in-memory copies of metadata in this workload. checking system logs, and tracing system calls. For injecting faults in file and directory blocks in a file system, we used two simple operations as workloads: 4.2.2 Test procedure and workloads “create file” creates a new file in a directory, and “read In our tests, we wanted to understand the behavior of file” reads a file’s contents. ZFS to disk corruptions on different types of blocks. We injected faults by flipping bits at random offsets in 4.3 Results and observations disk blocks. Since we used the default setting in ZFS The results of our fault injection experiments are shown for compression (metadata compressed and data uncom- in Table 2. The table reportsthe results of experimentson pressed), our fault injection tests corrupted compressed pool-wide metadata and file system metadata and data. metadata and uncompressed data blocks on disk. We It also shows the results of corrupting a single copy as injected faults on nine different classes of ZFS on-disk well as all copies of blocks. We now explain the results blocks and for each class, we corrupted a single copy as in detail in terms of the observations we made from our well as all copies of blocks. fault injection experiments. In our fault injection experiments on pool-wide and Observation 1: ZFS detects all corruptions due to file system level metadata, we used “mount” and “re- the use of checksums. In our fault injection experiments mount” operations as our workload. The “mount” work- on all metadata and data, we found that bad data was load indicates that the target block is corrupted with the never returned to the user because ZFS was able to de- pool exported and “myfs” not mounted, and we subse- tect all corruptions due to the use of checksums in block quently mount it. This workload forces ZFS to use on- pointers. The parental checksums are used in ZFS to ver- Single All commit, making the old corrupted copy void. Similar ditto ditto results were obtained when corrupting other pool-wide metadata and file system metadata, and ZFS was able to recover from these multiple block corruptions (R).

Level Block mount remount create file read file mount remount create file read file Observation 5: ZFS cannot recover from multiple vdev label1 R R E R block corruptions affecting all ditto blocks when no in- uberblock R R E R zpool memory copy exists. For file system metadata, like di- MOS object set block R R E R MOS dnode block R R E R rectory ZAP blocks, ZFS does not always keep an in- myfs object set block R R E R memory copy unless the directory has been accessed. myfs indirect block R R E R Thus, on corruptions to both ditto blocks, ZFS reported zfs myfs dnode block R R E R an error. This behavior is shown by the results (E) for di- ZAP block R R EE file data block E E rectories indicating for the “create file” and “read file” 1 excluding the uberblocks contained in it. operations. Note that we performed these corruptions without first accessing the directory, so that there were no Table 2: On-disk corruption analysis. The table shows in-memory copies. Similarly, in the “mount” workload, the results of on-disk experiments. Each cell indicates whether when the pool was inactive (exported) and thus no in- ZFS was able to recover from the corruption (R), whether ZFS memory copies existed, ZFS was unable to recover from reported an error (E), whether ZFS returned bad data to the multiple disk corruptions and responded with errors (E). user (B), or whether the system crashed (C). Blank cells mean that the workload was not exercised for the block. Observation 4 and 5 also lead to an interesting conclu- sion that an active storage pool is likely to tolerate more ify the integrity of all the on-disk blocks accessed. The serious disk corruptions than an inactive one. only exception are uberblocks, which do not have parent In summary, ZFS successfully detects all corruptions block pointers. Corruptions to the uberblock are detected and recovers from them as long as one correct copy ex- by the use of checksums inside the uberblock itself. ists. The in-memory caching and periodic flushing of Observation 2: ZFS gracefully recovers from single metadata on transaction commits help ZFS recover from metadata block corruptions. For pool-widemetadata and serious disk corruptions affecting all copies of metadata. file system wide metadata, ZFS recovered from disk cor- For user data, ZFS does not keep redundant copies and ruptions by using the ditto blocks. ZFS keeps three ditto is unable to recover from corruptions. ZFS, however, de- blocks for pool-wide metadata and two for file system tects the corruptions and reports an error to the user. metadata. Hence, on single-block corruption to meta- 5 In-memory data integrity in ZFS data, ZFS was successfully able to detect the corruption and use other available correct copies to recover from it; In the last section we showed the robustness of ZFS to this is shown by the cells (R) in the “Single ditto” column disk corruptions. Although ZFS was not specifically de- for all metadata blocks. signed to tolerate memory corruptions, we still would Observation 3: ZFS does not recover from data block like to know how ZFS reacts to memory corruptions, i.e., corruptions. For data blocks belonging to files, ZFS whether ZFS can detect and recover from a single bit flip was not able to recover from corruptions. ZFS detected in data and metadata blocks. Our fault injection exper- the corruption and reported an error on reading the data iments indicate that ZFS has no precautions for mem- block. Since ZFS does not keep multiple copies of data ory corruptions: bad data blocks are returned to the user blocks by default, this behavior is expected; this is shown or written to disk, file system operations fail, and many by the cells (E) for the file data block. times the whole system crashes. Observation 4: In-memory copies of metadata help This section is organized as follows. First, we briefly ZFS to recover from serious multiple block corruptions. describe ZFS in-memory structures. Then, we discuss In an active storage pool, ZFS caches metadata in mem- the test methodology and workloads we used to conduct ory for performance. ZFS performs operations on these the analysis. Finally, we present the experimental results cached copies of metadata and writes them to disk on and our observations. transaction group commits. These in-memory copies of 5.1 ZFS in-memory structures metadata, along with periodic transaction commits, help In order to better understand the in-memory experiments, ZFS recover from multiple disk corruptions. we present some background information on ZFS in- In the “remount” workload that corrupted all copies of memory structures. uberblock, ZFS recovered from the corruptions because the in-memory copy of the active uberblock remains as 5.1.1 In-memory structures long as the pool exists. The in-memory copy is subse- ZFS in-memory structures can be classified into two cat- quently written to a new disk block in a transaction group egories: those that exist in the and those that

READ WRITE

‹

š

ˆ ‰

– – ‹

– – ‰ ˆ

š

‹ •

‰ ˆ ’ ’

š ‰

‰ †‡ †Š

‰ ˆ ‰

› †

” ” •– ˜  

™ ž ž — † Š † †  †

— † Š † œ œ

œ

† º

“ †Š ·



“ “ —“

Ÿ

Œ ‹ ‹ Ž

” 

‘’

  ‹ ‹ ˆ

  ‹ ‹

”  

™ ’ š

‰‰ ˆ Š Š Š

 “

— º— †

— º— †

“ º —» º

¦§

¥ ¨

” 

‘ ’

”



”

š’

š’

ˆ

ˆ

“

†

†

©¦©

ª¨ ´

² ³ ......

«¬­ ®¯ °®¯ ±

V G

•

š

«¬­ ®¯ °®¯ ± ‰ ‰ ˆ

† † † Š †

œ ·

µ ¶

˜ ¸ ˜ ¸

’ ¹ ’

 † “—  † “—

£

¡¢ ¤

–

‹

š

‰ˆ ‰‰ ˆ

 †Š

º —»

·

– ¸

– ¸ – ¸

º

º º

Figure 3: Lifecycle of a block. This figure illustrates one example of the lifecycle of a block. The left half represents the read timeline and the right half represents the write timeline. The black dotted line is a protection boundary, below which a block is protected by the checksum, otherwise unprotected. are in memory outside of the page cache; for convenience ing on many factors such as the workload and the cache we call the latter in-heap structures. Whenever a disk replacement policy. block is accessed, it is loaded into memory. Disk blocks After some time, the block is updated. The write time- containing data and metadata are cached in the ARC line is illustrated in the right half of Figure 3. All up- page cache [36], and stay there until evicted. Data blocks dates are first done in the page cache and then flushed are stored only in the page cache, while most metadata to disk. Thus before the updates occur, the target block structures are stored in both the page cache (as copies of is either in the page cache already or just loaded to the on-disk structures) and the heap. Note that block point- page cache from disk. After the write, the updated block ers inside indirect blocks are also metadata, but they only stays in the page cache for at most 30 seconds and then reside in the page cache. Uberblocks and vdev labels, on it is flushed to disk. During the flush, a new physical the other hand, only stay in the heap. block is allocated and a new checksum is generated for the dirty target block. The new disk address and check- 5.1.2 Lifecycle of a block sum are then written to the block pointer contained in To help the reader understand the vulnerability of ZFS to the parental block, thus making it dirty. After the target memory corruptions discussed in later sections, Figure 3 block is written to the disk, the flush procedure contin- illustrates one example of the lifecycle of a block (i.e., ues to allocate a new block and calculate a new check- how a block is read from and written asynchronously to sum for the parental block, which in turn dirties its sub- disk). To simplify the explanation, we consider a pair of sequent parental block. Following the updates of block blocks in which the target block to be read or written is pointers along the tree (solid arrows in Figure 2), it fi- pointed to by a block pointer contained in the parental nally reaches the uberblock which is self-checksummed. block. The target block could be a data block or a meta- After the flush, the target block is kept in the page cache data block. The parental block could be an indirect block until it is evicted. (full of block pointers), a dnode block (array of dnodes, 5.2 Methodology of analysis each of which contains block pointers) or an object set In this section, we discuss the fault injection framework, block (a dnode is embedded in it). The user of the block and the test procedure and workloads. The injection could be a user-level application or ZFS itself. Note that framework is similar to the one used for on-disk experi- only the target block is shown in the figure. ments. The only difference is the pseudo-driver,which in At first, the target block is read from disk to memory. this case, interacts with the ZFS stack by calling internal For read, there are two scenarios, as shown in the left functions to locate the in-memory structures. half of Figure 3. On first read of a target block not in the page cache, it is read from the disk and immediately 5.2.1 Test procedure and workloads verified against the checksum stored in the block pointer We wished to find out the behavior of ZFS in response in the parental block. Then the target block is returned to to corruptions in different in-memory objects. Since all the user. On a subsequent read of a block already in the data and metadata in memory are uncompressed, we per- page cache, the read request gets the cached block from formed a controlled fault injection in various objects. For the page cache directly, without verifying the checksum. metadata, we randomly flipped a bit in each individual In both cases, after the read, the target block stays in field of the structure separately; for data, we randomly the page cache until evicted. The block remains in the corrupteda bit in a data block of a file in memory. We re- page cache for an unbounded interval of time depend- peated each fault injection test five times. We performed Object Data Structures Workload Data Structure Fields MOS dnode dnode t, dnode phys t dnode t dn nlevels, dn bonustype, dn indblkshift, Object dnode t, dnode phys t, zfs create, dn nblkptr, dn datablkszsec, dn maxblkid, directory mzap phys t, mzap ent phys t zfs destroy, dn compress, dn bonuslen, dn checksum, Dataset dnode t, dnode phys t, zfs rename, dn type dsl dataset phys t zfs list, dnode phys t dn nlevels, dn bonustype, dn indblkshift, Dataset dnode t, dnode phys t, zfs mount, dn nblkptr, dn datablkszsec, dn maxblkid, directory dsl dir phys t zfs umount dn compress, dn bonuslen, dn checksum, Dataset dnode t, dnode phys t, dn type, dn used, dn flags, child map mzap phys t, mzap ent phys t mzap phys t mz block type, mz salt FS dnode dnode t, dnode phys t zfs umount, mzap ent phys t mze value, mze name Master node dnode t, dnode phys t, traversal znode phys t zp mode, zp size, zp links, mzap phys t, mzap ent phys t zp flags, zp parent File dnode t, dnode phys t, open, , lseek, dsl dir phys t dd head dataset obj, dd child dir zapobj, znode phys t read, write, access, dd parent obj Dir dnode t, dnode phys t, link, unlink, dsl dataset phys t ds dir obj znode phys t, rename, truncate mzap phys t, mzap ent phys t (chdir, mkdir, rmdir) Table 4: Summary of data structures and fields cor- Table 3: Summary of objects and data structures cor- rupted. The table lists all fields we corrupted in the in- rupted. The table presents a summary of all the ZFS objects and memory experiments. mzap phys t and mzap ent phys t structures corrupted in our in-memory analysis, along with their are metadata stored in ZAP blocks. The last three structures data structures and the workloads exercised on them. are object-specific structures stored in the dnode bonus buffer. fault injection tests on nine different types of objects at cases with apparent problems. In other cases that are ei- two levels (zfs and zpool) and exercised different set of ther indicated by a dot (.) in the result cells or not shown workloads as listed in Table 3. Table 4 shows all data at all in Table 5, the corresponding operation either did structures inside the objects and all the fields we cor- not access the corrupted field or completed successfully rupted during the experiments. with the corrupted field. However, in all cases, ZFS did For data blocks, we injected bit flips at an appropriate not correct the corrupted field. time as described below. For reads, we flipped a random Next we present our observationson ZFS behavior and bit in the data block after it was loaded to the page cache; user-visible results. The first five observations are about then we issued a subsequent read() on that block to see if ZFS behavior and the last five observations are about ZFS returned the corrupted block. In this case, the read() user-visible results of memory corruptions. call fetched the block from the page cache. For writes, Observation 1: ZFS does not use the checksums in we corrupted the block after the write() call finished but the page cache along with the blocks to detect memory before the target block was written to the disk. corruptions. Checksums are the first guard for detect- For metadata, in our fault injection experiments, we ing data corruption in ZFS. However, when a block is covered a broad range of metadata structures. However, already in the page cache, ZFS implicitly assumes that it to reduce the sample space for experiments to more in- is protected against corruptions. In the case of reads, the teresting cases, we made two choices. First, we always checksum is verified only when the block is being read injected faults to the in-memory structure after it was ac- from the disk. Following that, as long as the block stays cessed by the file system, so that both the in-heap version in the page cache, it is never checked against the check- and page cache version already exist in the memory. Sec- sum, despite the checksum also being in the page cache ond, among the in-heap structures, we only corrupted the (in the block pointer contained in its parental block). The dnode t structure (in-heap version of dnode phys t). result is that ZFS returns bad data to the user on reads. The dnode structure is the most widely used metadata structure in ZFS and every object in ZFS is represented For writes, the checksum is generated only when the by a dnode. Hence, we anticipate that corrupting the in- block is being written to disk. Before that, the dirty block heap dnode structure will cover many interesting cases. stays in the page cache with an outdated checksum in the block pointer pointing to it. If the block is corrupted in 5.3 Results and observations the page cache before it is flushed to disk, ZFS calcu- We present the results of our in-memory experiments in lates a checksum for the bad block and stores the new Table 5. As shown, ZFS fails to catch data block corrup- checksum in the block pointer. Both the block and its tions due to memory errors in both read and write exper- parental block containing the block pointer are written iments. Single bit flips in metadata blocks not only lead to disk. On subsequent reads of the block, it passes the to returning bad data blocks, but also cause more serious checksum verification and is returned to the user. problems like failure of operations and system crashes. Moreover, since the detection mechanisms already Note that Table 5 is a subset of the results showing only fail to detect memory corruptions, recovery mechanisms Dataset Dataset directory File Dir MOS dnode childmap Dataset Structure Field ORW A UNT OALUNTMCD cdrlmu cdr lmu c d r cdrlm dn type ...... CCCCC C ...... dn indblkshift . BC .. C .. .. EEE . E . E ...... dn nlevels .. C ... C .. CCC . C . C CCCCC C ...... CCC CCC .. dnode t dn checksum .. C ... C ...... dn compress .. C ...... dn maxblkid ...... C ...... C ...... dn indblkshift .... C ...... dn nlevels . BC C . C ...... C ...... C ...... dnode phys t dn nblkptr ...... C ... dn bonuslen .. C ...... C ...... C ... dn maxblkid . B .. C . C ...... C ...... C .... . C . . C ... zp size ...... E znode phys t zp flags E .. E . EE EEEEEEEEE dd head dataset obj EEEE .. dsl dir phys t dd child dir zapobj EC EC EC EC EC C dsl dataset phys t ds dir obj . EE .. data block B B

Table 5: In-memory corruption results. The table shows a subset of memory corruption results. The operations exercised are O(open), R(read), W(write), A(access), L(link), U(unlink), N(rename), T(truncate), M(mkdir), C(chdir), D(rmdir), c(zfs create), d(zfs destroy), r(zfs rename), l(zfs list), m(zfs mount) and u(zfs umount). Each result cell indicates whether the system crashed (C), whether the operation failed with wrong results or with a misleading message (E), whether a bad data block was returned (B) or whether the operation completed (.). Large blanks mean that the operations are not applicable. such as ditto blocks and the mirrored zpool are not trig- to disk in each flush, we conjecture that the likelihood gered to recover from the damage. of memory corruption leading to permanent on-disk cor- The results in Table 5 indicate that when a data block ruptions is high. was corrupted, the application that issued a read() or We did a block-based fault injection to verify this ob- write() request was returned bad data (B), as shown in servation. We injected a single bit flip to a dirty (or to-be- the last row. When metadata blocks were corrupted, ZFS dirtied) block before a flush; as long as the flipped bit in accessed the corrupted data structures and thus behaved the block was not overwritten by subsequent operations, wrongly, as shown by other cases in the result table. the corrupted block was written to disk permanently. Observation 2: The window of vulnerability of blocks Observation 4: Dirtying blocks due to updating file in the page cache is unbounded. As Figure 3 shows, af- access time increases the possibility of making corrup- ter a block is loaded into the page cache by first read, it tions permanent. By default, access time updates are en- stays there until evicted. During this interval, if a cor- abled in ZFS; therefore, a read-only workload will up- ruption happens to the block, any subsequent read will date the access time of any file accessed. Consequently, get the corrupted block because the checksum is not ver- when the structure containing the access time (znode) ified. Therefore, as long as the block is in the page cache goes inactive (or when there is another workload that up- (unbounded), it is susceptible to memory corruptions. dates the znode), ZFS writes the block holding the zn- Observation 3: Since checksums are created when ode to disk and updates and writes all its parental blocks. blocks are written to disk, any corruption to blocks that Therefore, any corruption to these blocks will become are dirty (or will be dirtied) is written to disk perma- permanent after the flush caused by the access time up- nently on a flush. As described in Section 5.1.2, dirty date. Further, as mentioned earlier, the time interval blocks in the page cache are written to disk during a when the corruption could happen is unbounded. flush. During the flush, any dirty block will further cause Observation 5: For most metadata blocks in the page updates of all its parental blocks; a new checksum is then cache, checksums are not valid and thus useless in de- calculated for each updated block and all of them are tecting memory corruptions. By default, most metadata flushed to disk. If a memorycorruptionhappensto any of blocks such as indirect blocks and dnode blocks are com- those blocks before a flush (above the black dotted line pressed on disk. Since the checksums for these blocks before G in Figure 3), the corrupted block is written to are used to prevent disk corruptions, they are only valid disk with a new checksum. The checksum is thus valid for compressed blocks, which are calculated after they for the corrupted block, which makes the corruption per- are compressed during writes and verified before they are manent. Since the window of vulnerability is long (30 decompressed during reads. When metadata blocks are seconds), and there are many blocks that will be flushed in the page cache, they are uncompressed. Therefore, the checksums contained in the corresponding block point- wrong blocks are returned to the user also have the po- ers are useless. tential for security vulnerabilities. We now discuss our observations about user-visible re- Observation 9: There is no recovery for corrupted sults of memory corruptions. metadata. In the cases where no apparent error happened Observation 6: When metadata is corrupted, oper- (as indicated by a dot or not shown) and the operation ations fail with wrong results, or give misleading error was not meant to update the corrupted field, the corrup- (E). As shown in Table 5, when zp flags in tion remained in the metadata block in the page cache. dnode phys t for a file object was corrupted, in one In summary, ZFS fails to detect and recover from case open() returned an error code EACCES (permis- many corruptions. Checksums in the page cache are not sion denied). This case occurred when the 41st bit of used to protect the integrity of blocks. Therefore, bad zp flags was flipped from 0 to 1, which signifies that data blocks are returned to the user or written to disk. the file is quarantined by an anti-virus software. There- Moreover, corrupted metadata blocks are accessed by fore, open() was incorrectly denied, giving an error code ZFS and lead to operation failure and system crashes. EACCES. The calls access(), rename() and truncate() also failed for the same reason. 6 Probability of bit-flip induced failures Another example of a misleading error mes- In this section, we present a preliminary analysis of the sage happened when dd head dataset obj in likelihood of different failure scenarios due to memory dsl dir phys t for a dataset directory object was errors in a system using ZFS. Specifically, given that one corrupted; there is one case where “zfs create” failed to random bit in memory is flipped, we compute the proba- create a new file system under the parent file system rep- bilities of four scenarios: reading corrupt data (R), writ- resented by the corrupted object. ZFS gave a misleading ing corrupt data (W), crashing/hanging (C) and running error message saying that the parent file system did not successfully to complete (S). These probabilities help us exist. ZFS gave similar error messages in other cases (E) to understand how severely filesystem data integrity is under “Dataset directory” and “Dataset”. affected by memory corruptions and how much effort A case where wrong results are returned occurred filesystem developers should make to add extra protec- when dd child dir zapobj was corrupted. This field tion to maintain data integrity. refers to a dataset child map object containing references to child file systems. On corrupting this field, “zfs list”, 6.1 Methodology which should list all file systems in the pool, did not list We apply fault-injection techniques to perform the analy- the child file systems of the corrupted dataset directory. sis. Considering one run of a specific workload as a trial, Observation 7: Many corruptions lead to a system we inject a fixed number number of random bit flips to crash (C). For example, when dn nlevels (the height of the memory and record how the system reacts. There- the block tree pointed to by the dnode) in dnode phys t fore, by doing multiple trials, we measure the number for a file object was corrupted and the file was read, the of trials where each scenario occurs, thus estimating the system crashed due to a NULL pointer dereference. In probability of each scenario given that certain number of this case, ZFS used the wrong value of dn nlevels to bits are flipped. Then, we calculate the probability of traverse the block tree of the file object and obtained an each scenario given the occurrence of one single bit flip. invalid block pointer. Therefore, the block size obtained We have extended our fault injection framework to from the block pointer was an arbitrary value, which was conduct the experiments. We replaced the pseudo-driver then used to index into an array whose size was much with a user-level “injector” which injects random bit flips less than the value. As a result, the system crashed when to the physical memory. We used filebench [50] to gener- a NULL pointer was dereferenced. ate complex workloads. We modified filebench such that Observation 8: The read() system call may return it always writes predefined data blocks (e.g., full of 1s) bad data. As shown in Table 5, for metadata corruptions, to disk. Therefore, we can check every read operation there were three cases where read() gave bad data block to verify that the returned data matches the predefined to the user. In these cases, ZFS simply trusted the value pattern. We can also verify the data written to disk by of the corrupted field and used it to traverse the block checking the contents of on-disk files. tree pointed to by the dnode, thus returning bad blocks. We used the framework as follows. For a specific For example, when dn nlevels in dnode phys t for a workload, we ran 100 trials. For each trial, we used the file object was changed from 3 to 1, ZFS gave an incor- injector to generate 16 random bit flips at the same time rect block to the user on a read request for the first block when the workload has been running for 3 minutes. We of the file. The bad block was returned because ZFS as- then kept the workload running for 5 minutes. Any oc- sumed that the tree only had one level, and incorrectly currence of reading corrupt data (R) was reported. When returned an indirect block to the user. Such cases where the workload was done, we checked all on-disk files to see if there was any corrupt data written to the disk (W). Workload P16(R) P16(W ) P16(C) P16(S) Since we only verify write operations after each run of varmail 9% [4, 17] 0% [0, 3] 5% [1, 12] 86% [77, 93] a workload, some intermediate corrupt data might have oltp 26% [17, 36] 2% [0, 8] 16% [9, 25] 60% [49, 70] webserver 11% [5, 19] 20% [12, 30] 19% [11, 29] 61% [50, 71] been overwritten and thus the actual number of occur- fileserver 69% [58, 78] 44% [34, 55] 23% [15, 33] 28% [19, 38] rence of writing corrupt data could be higher than mea- sured here. We also logged whether the system hung or Workload P1(R) P1(W ) P1(C) P1(S) crashed (C) during each trial, but we did not determine if varmail 0.6% [0.2, 1.2] 0% [0, 0.2] 0.3% [0.1, 0.8] 99.1% [98.4, 99.5] oltp 1.9% [1.2, 2.8] 0.1% [0, 0.5] 1.1% [0.6, 1.8] 96.9% [95.7, 97.8] it was due to corruption of ZFS metadata or other kernel webserver 0.7% [0.3, 1.3] 1.4% [0.8, 2.2] 1.3% [0.7, 2.1] 97.0% [95.8, 97.9] data structures. fileserver 7.1% [5.4, 9.0] 3.6% [2.5, 4.8] 1.6% [1.0, 2.5] 92.4% [90.2, 94.2] It is important to notice that we injected 16 bit flips in each trial because it let us observe a sufficient number Table 6: P16(X) and P1(X). The upper table presents of failure trials in 100 trials. However, we apply the fol- percentage values of the probabilities and 95% confidence in- lowing calculation to derive the probabilities of different tervals (in square brackets) of reading corrupt data (R), writ- ing corrupt data (W), crash/hang and everything being fine (S), failure scenarios given that 1 bit is flipped. given that 16 bits are flipped, on a machine of 2GB memory. The lower table gives the derived percentage values given that 6.2 Calculation 1 bit is corrupted. The working set size of each workload is We use Pk(X) to represent the probability of scenario X less than 2GB; the average amount of page cache consumed by given that k random bits are flipped, in which X could each workload after the bit flips are injected is 31MB (varmail), be R, W, C or S. Therefore, Pk(X¯)=1 − Pk(X) is 129MB (oltp), 441MB (webserver) and 915MB (fileserver). the probability of scenario X not happening given that Table 6 provides the probabilities and confidence in- k bits are flipped. In order to calculate P1(X), we first tervals given that 16 bits are flipped and the derived val- measure Pk(X) using the method described above and ues given that 1 bit is flipped. Note that for each work- then derive P1(X) from Pk(X), as explained below. load, the sum of Pk(R), Pk(W ), Pk(C) and Pk(S) is • Measure Pk(X) Given that k random bit flips are not necessary equal to 1, because there are cases where injected in each trial, we denote the total number of multiple failure scenarios occur in one trial. trials as N and the number of trials in which sce- From the lower table in Table 6, we see that a single nario X occurs at least once as NX . Therefore, bit flip in memory causes a small but non-negligible per- NX centage of runs to experience an failure. For all work- Pk(X)= N loads, the probability of reading corrupt data is greater than 0.6% and the probability of crashing or hanging is • Derive P1(X) Assume k bit flips are independent, higher than 0.3%. The probability of writing corrupt data then we have varies widely from 0 to 3.6%. Our results also show that ¯ ¯ k Pk(X) = (P1(X)) , when X = R, W or C in most cases, when the working set size is less than the k Pk(X) = (P1(X)) , when X = S memory size, the more page cache the workload con- sumes, the more likely that a failure would occur if one Substituting Pk(X¯)=1−Pk(X) into the equations bit is flipped. above, we can get, In summary, when a single bit flip occurs, the chances 1 P1(X)=1−(1−Pk(X)) k , when X = R, W or C of failure scenarios happening can not be ignored. There- 1 fore, efforts should be made to preserve data integrity in P1(X) = (Pk(X)) k , when X = S memory and prevent these failures from happening.

6.3 Results 7 Beyond ZFS The analysis is performedon the same virtual machine as In addition to ZFS, we have applied the same fault injec- mentioned in Section 4.2.1. The machine is configured tion framework used in Section 5 to a simpler filesystem, with 2GB non-ECC memory and a single disk running ext2. Our initial results indicate that ext2 is also vulner- ZFS. We first ran some controlled micro-benchmarks able to memory corruptions. For example, corrupt data (e.g., sequential read) to verify that the methodology and can be returned to the user or written to disk. When cer- the calculation is correct (the result is not shown due to tain fields of a VFS are corrupted, operations on limited space). Then, we chose four workloads from that inode fail or the whole system crashes. If the inode filebench: varmail, oltp, webserver and fileserver, all is dirty, the corrupted fields of the VFS inode are propa- of which were exercised with their default parameters. gated to the inode in the page cache and are then written A detailed description of these workloads can be found to disk, making the corruptions permanent. Moreover, if elsewhere [50]. the superblock in the page cache is corrupted and flushed to disk, it might result in an unmountable filesystem. small but non-negligible chances to cause failures such In summary, so far we have studied two extremes: as reading/writing corrupt data and system crashing. ZFS, a complex filesystem with many techniques to We argue that file systems should be designed with maintain on-disk data integrity, and ext2, a simpler end-to-end data integrity as a goal. File systems should filesystem with few mechanisms to provide extra relia- not only provide protection against disk corruptions, but bility. Both are vulnerable to memory corruptions. It also aim to protect data from memory corruptions. Al- seems that regardless of the complexity of the file sys- though dealing with memory corruptionsis hard, we con- tem and the amount of machinery used to protect against clude by discussing some techniquesthat file systems can disk corruptions, memory corruptions are still a problem. use to increase protection against memory corruptions. Block-level checksums in the page cache: File systems 8 Related work could protect the vulnerable data and metadata blocks Software-implemented fault injection techniques have in the page cache by using checksums. For example, been widely used to analyze the robustness of sys- ZFS could use the checksums inside block pointers in tems [10, 17, 25, 31, 48, 55]. For example, FINE used the page cache, update them on block updates, and ver- fault injection to emulate hardware and software faults ify the checksums on reads. However, this does incur an in the [31]; Weining et al. [25] injected overhead in computation as well as some complexity in faults to instruction streams of function to implementation; these are always the tradeoffs one has characterize Linux kernel behavior. to make for reliability. More recent works [5, 8, 44] have applied type-aware Metadata checksums in the heap: Even with block- fault injection to analyze failure behaviors of different level checksums in the page cache, there are still copies file systems to disk corruptions. Our analysis of on-disk of metadata structures in the heap that are vulnerable data integrity in ZFS is similar to these studies. to memory corruptions. To provide end-to-end data in- Further, fault injection has also been used to analyze tegrity, data-structure checksums may be useful in pro- effects of memory corruptions on systems. FIAT [10] tecting in-heap metadata structures. used fault injection to study the effects of memory cor- Programming for error detection: Many serious ef- ruptions in a distributed environment. Krishnan et al. fects of memory corruptions can be mitigated by using applied a memory corruption framework to analyze the simple programming practices. One technique is to use effects of metadata corruption on NFS [33]. Our study existing redundancy in data structures for simple consis- on in-memory data integrity is related to these studies in tency checks. For instance, the case described in Obser- their goal of finding effects of memory corruptions. vation 8 (Section 5.3) could be detected by comparing However, our work on ZFS is the first comprehensive the expected level calculated from the dn levels field reliability analysis of local file system that covers care- of dnode phys t with the actual level stored inside the fully controlled experiments to analyze both on-disk and first block pointer. Another simple technique is to in- in-memory data integrity. Specifically, for our study of clude magic numbers in metadata structures for sanity memory corruptions, we separately analyze ZFS behav- checking. For example, some “crash” cases happened ior for faults in page cache metadata and data and for due to bad block pointers obtained during the block tree metadata structures in the heap. To the best of our knowl- traversal (Observation 7 in Section 5.3). Using a magic edge, this is the first such comprehensive study of end- number in block pointers could help detect such cases to-end file system data integrity. and prevent unexpected behavior. 9 Summary and discussion Acknowledgment In this paper, we analyzed a state-of-the-art file system, We thank the anonymous reviewers and Craig Soules (our shep- ZFS, to study the implications of disk and memory cor- herd) for their tremendous feedback and comments, which have ruptions to data integrity. We used carefully controlled substantially improved the content and presentation of this pa- fault injection experiments to simulate realistic disk and per. We thank Asim Kadav for his initial work on ZFS on-disk fault injection. We also thank the members of the ADSL re- memory errors and presented our observations about ZFS search group for their insightful comments. behavior and its robustness. This material is based upon work supported by the National While the reliability mechanisms in ZFS are able to Science Foundation under the following grants: CCF-0621487, provide reasonable robustness against disk corruptions, CNS-0509474, CNS-0834392, CCF-0811697, CCF-0811697, memory corruptions still remain a serious problem to CCF-0937959, as well as by generous donations from NetApp, data integrity. Our results for memory corruptions in- Inc, , and Google. dicate cases where bad data is returned to the user, oper- Any opinions, findings, and conclusions or recommenda- ations silently fail, and the whole system crashes. Our tions expressed in this material are those of the authors and do probability analysis shows that one single bit flip has not necessarily reflect the views of NSF or other institutions. References [32] A. Krioukov, L. N. Bairavasundaram, G. R. Goodson, K. Srini- vasan, R. Thelen, A. C. Arpaci-Dusseau, and R. H. Arpaci- [1] CERT/CC Advisories. http://www.cert.org/advisories/. Dusseau. Parity Lost and Parity Regained. In FAST, 2008. [2] Kernel Bug Tracker. http://bugzilla.kernel.org/. [33] S. Krishnan, G. Ravipati, A. C. Arpaci-Dusseau, R. H. Arpaci- [3] US-CERT Vulnerabilities Notes Database. http://www.kb.cert. Dusseau, and B. P. Miller. The Effects of Metadata Corruption org/vuls/. on NFS. In StorageSS, 2007. [4] D. Anderson, J. Dykes, and E. Riedel. More Than an Interface: [34] X. Li, K. Shen, M. C. Huang, and L. Chu. A memory SCSI vs. ATA. In FAST, 2003. measurement on production systems. In USENIX, 2007. [5] L. N. Bairavasundaram, A. C. Arpaci-Dusseau, and R. H. Arpaci- [35] T. C. May and M. H. Woods. Alpha-particle-induced soft errors Dusseau. Dependability Analysis of Systems. In in dynamic memories. IEEE Trans. on Electron Dev, 26(1), 1979. DSN, 2006. [36] N. Megiddo and D. Modha. Arc: A self-tuning, low overhead [6] L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and replacement cache. In FAST, 2003. J. Schindler. An Analysis of Latent Sector Errors in Disk Drives. [37] R. C. Merkle. A digital signature based on a conventional encryp- In SIGMETRICS, 2007. tion function. In CRYPTO, 1987. [7] L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. An Analysis of Data [38] D. Milojicic, A. Messer, J. Shau, G. Fu, and A. Munoz. Increas- Corruption in the Storage Stack. In FAST, 2008. ing relevance of memory hardware errors: a case for recoverable programming models. In ACM SIGOPS European Workshop, [8] L. N. Bairavasundaram, M. Rungta, N. Agrawal, A. C. Arpaci- 2000. Dusseau, R. H. Arpaci-Dusseau, and M. M. Swift. Analyzing the [39] B. Moore. Ditto Blocks - The Amazing Tape Repellent. http:// Effects of Disk-Pointer Corruption. In DSN, 2008. blogs.sun.com/bill/entry/ditto blocks the amazing tape. [9] W. Bartlett and L. Spainhower. Commercial Fault Tolerance: A [40] E. Normand. Single event upset at ground level. Nuclear Science, Tale of Two Systems. IEEE Trans. on Dependable and Secure IEEE Transactions on, 43(6):2742–2750, 1996. Computing, 1(1), 2004. [41] T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. [10] J. Barton, E. Czeck, Z. Segall, and D. Siewiorek. Fault Injection Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field Experiments Using FIAT. IEEE Trans. on Comp., 39(4), 1990. testing for soft errors in semiconductor memories. [11] R. Baumann. Soft errors in advanced computer systems. IEEE IBM J. Res. Dev., 40(1):41–50, 1996. Des. Test, 22(3):258–266, 2005. [42] . : A Checksumming Copy on Write [12] E. D. Berger and B. G. Zorn. Diehard: probabilistic memory Filesystem. http://oss.oracle.com/projects/btrfs/. safety for unsafe languages. In PLDI, 2006. [43] D. Patterson, G. Gibson, and R. Katz. A Case for Redundant [13] J. Bonwick. RAID-Z. http://blogs.sun.com/bonwick/entry/raid z. Arrays of Inexpensive Disks (RAID). In SIGMOD, 1988. [14] J. Bonwick and B. Moore. ZFS: The Last Word in File Systems. [44] V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gu- http://opensolaris.org/os/community/zfs/docs/zfs last.pdf. nawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. IRON [15] F. Buchholz. The structure of the Reiser file system. http://homes. File Systems. In SOSP, 2005. cerias.purdue.edu/∼florian/reiser/reiserfs.php. [45] F. Qin, S. Lu, and Y. Zhou. Safemem: Exploiting ecc-memory [16] R. Card, T. Ts’o, and S. Tweedie. Design and Implementation for detecting memory leaks and memory corruption during pro- of the Second Extended Filesystem. In Proceedings of the First duction runs. In HPCA, 2005. Dutch International Symposium on Linux, 1994. [46] B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the [17] J. Carreira, H. Madeira, and J. G. Silva. Xception: A Tech- wild: a large-scale field study. In SIGMETRICS, 2009. nique for the Experimental Evaluation of Dependability in Mod- [47] T. J. Schwarz, Q. Xin, E. L. Miller, D. D. Long, A. Hospodor, ern Computers. IEEE Trans. on Software Engg., 1998. and S. Ng. Disk Scrubbing in Large Archival Storage Systems. [18] J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and In MASCOTS, 2004. A. Gupta. Hive: Fault Containment for Shared-Memory Multi- [48] D. Siewiorek, J. Hudak, B. Suh, and Z. Segal. Development of a processors. In SOSP, 1995. to Measure System Robustness. In FTCS-23, 1993. [19] C. L. Chen. Error-correcting codes for semiconductor memories. [49] C. A. Stein, J. H. Howard, and M. I. Seltzer. Unifying SIGARCH Comput. Archit. News, 12(3):245–247, 1984. Protection. In USENIX, 2001. [20] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empir- [50] Sun Microsystems. Solaris Internals: FileBench. http://www. ical Study of Operating System Errors. In SOSP, 2001. solarisinternals.com/wiki/index.php/FileBench. [21] N. Dor, M. Rodeh, and M. Sagiv. CSSV: towards a realistic tool [51] Sun Microsystems. ZFS On-Disk Specification. http://www.open for statically detecting all buffer overflows in C. In PLDI, 2003. solaris.org/os/community/zfs/docs/ondiskformat0822.pdf. [22] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs [52] R. Sundaram. The Private Lives of Disk Drives. http://partners. as Deviant Behavior: A General Approach to Inferring Errors in .com/go/techontap/matl/sample/0206tot resiliency.html. SOSP Systems Code. In , 2001. [53] M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the [23] A. Eto, M. Hidaka, Y. Okuyama, K. Kimura, and M. Hosono. Reliability of Commodity Operating Systems. In SOSP, 2003. Impact of neutron flux on soft errors in mos memories. In IEDM, 1998. [54] The Data Clinic. Hard Disk Failure. http://www.dataclinic. co.uk/hard-disk-failures.htm. [24] R. Green. EIDE Controller Flaws Version 24. http://mindprod.com/jgloss/eideflaw.html. [55] T. K. Tsai and R. K. Iyer. Measuring Fault Tolerance with the FTAPE Fault Injection Tool. In The 8th International Conference [25] W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characterization On Modeling Techniques and Tools for Computer Performance of Linux Kernel Behavior Under Errors. In DSN, 2003. Evaluation, 1995. [26] H. S. Gunawi, A. Rajimwale, A. C. Arpaci-Dusseau, and R. H. [56] S. C. Tweedie. Journaling the Linux ext2fs File System. In The Arpaci-Dusseau. SQCK: A Declarative File System Checker. In Fourth Annual Linux Expo, Durham, North Carolina, 1998. OSDI, 2008. [27] S. Hallem, B. Chelf, Y. Xie, and D. Engler. A system and lang- [57] J. Wehman and P. den Haan. The Enhanced IDE/Fast-ATA FAQ. http://thef-nym.sci.kun.nl/cgi-pieterh/atazip/atafq.html. uage for building system-specific, static analyses. In PLDI, 2002. [58] G. Weinberg. The Solaris Dynamic File System. [28] J. Hamilton. Successfully Challenging the Server Tax. ∼ http://perspectives.mvdirona.com/2009/09/03/Successfully Chal- http://members.visi.net/ thedave/sun/DynFS.pdf. lengingTheServerTax.aspx. [59] A. Wenas. ZFS FAQ. http://blogs.sun.com/awenas/entry/zfs faq. [29] R. Hastings and B. Joyce. Purify: Fast detection of memory leaks [60] Y. Xie, A. Chou, and D. Engler. Archer: using symbolic, path- and access errors. In USENIX Winter, 1992. sensitive analysis to detect memory access errors. In FSE, 2003. [30] D. T. J. A white paper on the benefits of chipkill- correct ecc for [61] J. Yang, C. Sar, and D. Engler. EXPLODE: A Lightweight, Gen- pc server main memory. IBM Microelectronics Division, 1997. eral System for Finding Serious Storage System Errors. In OSDI, [31] W. Kao, R. K. Iyer, and D. Tang. FINE: A Fault Injection and 2006. Monitoring Environment for Tracing the System Behavior [62] J. F. Ziegler and W. A. Lanford. Effect of cosmic rays on com- Under Faults. In IEEE Trans. on Software Engg., 1993. puter memories. Science, 206(4420):776–788, 1979.