Oracle ZFS Storage— Integrity

ORACLE WHITE P A P E R | M A Y 2 0 1 7

Table of Contents

Introduction 1

Overview 2

Primary Design Goals 2

Shortcomings of Traditional 2

RAID Design Problems 2

Traditional RAID Study 3

ZFS Data Integrity 4

ZFS Transactional Processing 4

ZFS Ditto Blocks 5

ZFS Self-Healing 5

University Research Testing and Validation of ZFS Data Integrity 5

Defining Corruption and How Often Corruption Happens 5

Problems Identified with Other File Systems and RAID 6

Research Testing Methodology 6

Testing Configuration and Data Layout 6

Oracle ZFS Storage Appliance Robustness 7

Reducing Risk Through Advanced Data Integrity 7

Robust Protection from Hardware Failures 7

ZFS Data 8

Conclusion 8

Related Links 9

ORACLE ZFS STORAGE—DATA INTEGRITY

Introduction

Studies have shown that traditional RAID technology, while effective for disk space savings, cannot provide sufficient end-to-end data integrity to ensure against or corruption in contemporary cloud-scale data storage environments. For example, traditional RAID technology is unable to isolate and protect against firmware or driver bugs that can have a substantial impact on a storage system’s ability to protect against or loss.

Modern storage architectures, like those incorporated into Oracle ZFS Storage Appliance, protect against these failure modes by providing advanced data integrity technology such as hierarchical , redundant , transactional processing, and integrated redundancy. These features provide a more comprehensive and reliable protection against data corruption or loss.

This paper explores inherent design flaws in traditional RAID projects that cause data loss, provides case study evidence of these deficiencies, and demonstrates how modern technology like Oracle ZFS Storage is better suited to today’s contemporary cloud-scale data storage environments.1

1 Throughout this paper, Oracle ZFS Storage is used as an umbrella term that covers numerous products, such as the Oracle ZFS Storage Appliance systems, which use the Oracle Solaris ZFS .

1 | ORACLE ZFS STORAGE—DATA INTEGRITY

Overview

Today's vast repositories of data represent valuable intellectual property that must be protected to ensure the livelihood of the enterprise. While and archival capabilities are designed to guard against data loss, they do not necessarily protect against silent data corruption. Furthermore, in the event of a problem, the process of restoring archived data generally entails downtime and thus lost productivity. Even worse, not all restore operations are successful, and this can result in permanent data loss.

Primary Design Goals

A primary design goal and foundation of Oracle ZFS Storage systems is data integrity. The modern end-to-end data integrity architecture of the ZFS file system is designed to overcome the deficiencies of traditional storage products.

Oracle ZFS Storage Appliance architecture is able to deliver industry-leading performance while providing contemporary end-to-end data protection and data integrity capability that ensure against data loss better than traditional storage systems.

» Redirect-on-write architecture: Data is never overwritten in place. » Snapshots provide continuous file system replication. » protection occurs throughout the data path. » “Self-healing” ability replaces damaged data from redundant configurations. » Multiple levels of RAID protection meet modern capacities. » ZFS data path reliability is available throughout the OS, application, and hardware stack.

Shortcomings of Traditional Data Storage

The problem with traditional data storage products is that there is no defense against silent data corruption. Any defect in disk, controller, cable, driver, laser, or firmware can corrupt data silently. » File systems rely on underlying hardware to detect and report errors. » Hardware doesn’t know that a firmware bug occurred. » If disk returns bad data, traditional file system won’t detect it. Even without hardware problems, data is vulnerable to in-transit damage such as controller bugs, DMA parity errors, and SAN network bugs. Block-level checksums only prove that a block is self-consistent. They do not ensure that it's the right block. There is no fault isolation between the data and the checksum that is supposed to protect it, as illustrated in Figure 1 below.

RAID Design Problems RAID 5 and other RAID parity schemes include a fatal flaw known as the RAID 5 write hole. When data in a RAID stripe is updated, the parity also must be updated as part of the reconstruction process when a disk fails. Because no way exists to update two or more disks atomically, RAID stripes can become damaged during a reconstruction process if the system crashes or a power outage occurs. This problem means that now the data and the parity are inconsistent and remain inconsistent. This is a silent failure that corrupts data.

RAID systems can provide protection with NVRAM that survives power loss but this solution is expensive, and if the battery-backed NVRAM cache fails, a risk of data inconsistency still exists.

A well-known performance problem with RAID systems is that if they do partial-stripe writes, where the data updated is less than a single RAID stripe, the RAID system must read the old data and parity to calculate the new parity. This recalculation slows performance. A possible solution is that the RAID system buffers the partial-stripe writes in

2 | ORACLE ZFS STORAGE—DATA INTEGRITY

NVRAM, which hides the latency from the user, but the NVRAM cache can fill up. One RAID vendor’s solution is to sell more expensive cache.

Today’s savvy enterprise customers are looking for a storage solution without the costs associated with silent data corruption or slow performance.

Figure 1. Traditional RAID approach leaves I/O path vulnerable.

Traditional RAID Data Integrity Study

A CERN research study on data integrity with traditional RAID solutions found that data corruption occurs at lower levels and that probes and monitors must be implemented and deployed, leading to increased OpEx and CapEx costs. 2 » Testing involved writing 2 GB files with special bit patterns and afterwards, reading the files back to compare the patterns. The testing was deployed on more than 3,000 nodes (disk server, CPU server, server, etc.) and run every two hours. After about five weeks of testing that was run on 3,000 nodes, the results revealed 500 errors on 100 nodes. » The study’s findings uncovered the following RAID problems: » RAID controllers don’t always check parity when reading data from RAID 5 file systems. » RAID controllers don’t always report problems at the disk level to the upper-level OS. » Running a verify operation for the RAID controller reads all data from disk and recalculates the RAID 5 checksum, but it doesn’t have a notion of what “correct” data is from a user point of view. » Running the verify operation on 492 systems over four weeks resulted in the fix of ~300 block problems. » The discovery of a disk firmware problem required a manual update of the firmware on 3,000 disks. » CERN concluded that they must implement and deploy constant RAID monitoring and intensive probes that would double their original disk I/O performance requirements, and they must also increase CPU capacity on storage servers by 50 percent to accommodate the RAID monitoring.

2 Bernd Panzer-Steindel, CERN/IT, Data Integrity, April 2007

3 | ORACLE ZFS STORAGE—DATA INTEGRITY

ZFS Data Integrity

ZFS is a modern file system that provides the following data integrity components to eliminate the problems with previous storage products: » End-to-end data integrity requires each data block to be verified against an independent checksum, after the data has arrived in the host's memory. » The ZFS design goal provides fault isolation between data and checksum by storing the checksum of each block in its parent block pointer—not in the block itself. » ZFS is a self-validating Merkle tree of blocks, proven to provide cryptographically strong authentication for any component of the tree, and for the tree as a whole. » Each ZFS block contains checksums for all its children, meaning the entire pool is self-validating. » ZFS knows to trust the checksum because it is part of some other block at a level higher in the tree, and that block has already been validated. » ZFS uses this end-to-end checksum hierarchy to detect and correct silent data corruption: » If a disk returns bad data transiently, ZFS detect its and retries the read. » If disk is part of mirror or RAIDZ group, ZFS both detects and corrects the error by using the checksum to determine the copy is correct, provides good data to the application, and repairs the damaged copy.

Figure 2. The ZFS hierarchical checksums can detect more types of errors than traditional block-level checksums.

ZFS Transactional Processing

ZFS maintains data consistency in the event of system crashes by using a redirect-on-write transactional update model that is similar to a copy-on-write (COW) model, only it is more efficient because it generates one less copy. ZFS file system metadata and data are represented as objects and are grouped into a transactional group for modification. New copies are created for all the modified blocks (in a Merkle tree) when a transaction group is committed to disk. See Figure 1. The root of the tree structure (uberblock) is updated atomically, maintaining an

4 | ORACLE ZFS STORAGE—DATA INTEGRITY

always-consistent disk state. The redirect-on-write transactional processing model, along with hierarchical checksums, means there is no need for a journal that is part of many traditional file systems.

ZFS Ditto Blocks ZFS uses block pointers to point to data blocks on disk called disk virtual addresses (DVAs). ZFS block pointers identify not just one DVA, but up to three DVAs. Using these extra DVAs, three copies of a block are stored in three separate locations. These replicated copies called ditto blocks are stored in addition to the ZFS storage pool’s redundancy configuration. Ditto blocks ensure that the more "important" a file system block is (the closer to the root of the tree) because it points to multiple blocks in the tree structure, the more replicated it becomes. This policy stores one DVA for user data, two DVAs for file system metadata, and three DVAs for metadata that's global across all file systems in the storage pool. According to a University of Wisconsin research team’s testing results, ditto blocks play a key role in recovering from data corruption.

ZFS Self-Healing

Another advantage of ZFS checksumming is that it enables a self-healing architecture. In some traditional checksum approaches, wherever data blocks are replicated from one location to another there is an opportunity to propagate data corruption. This is because, with traditional checksum approaches, the newest data block is simply replicated. With ZFS checksums, each block replica pair has a checksum calculated independently. If one is found corrupted, the healthy block is then used as a reference to repair the unhealthy block.

Figure 3. Example of how the ZFS self-healing architecture corrects a corrupted block.

University Research Testing and Validation of ZFS Data Integrity

The University of Wisconsin research team thoroughly tested and reported on the data integrity of the ZFS file system. Excerpts of their tests, methods, and results provided herein validate the exceptional data integrity that is the foundation of Oracle ZFS Storage.

Defining Corruption and How Often Corruption Happens The University of Wisconsin research team identified various events that can impact data integrity including bit rot caused by magnetic media errors when bits are damaged, power fluctuations, and erratic arm movements, some of

5 | ORACLE ZFS STORAGE—DATA INTEGRITY

which can be caught by disks with (ECC) features. Controller, firmware, and software bugs can cause misdirected writes, lost writes, and incorrect data written to disk.

The University of Wisconsin research team identified the following statistics on how often problems are found with traditional RAID products and commercial quality drives: » A recent study of 1.53 million disk drives over 41 months, Bairavasundaram et al. show that more than 400,000 blocks had checksum mismatches, 8 percent of which were discovered during RAID reconstruction, creating the possibility of real data loss.3 » They also found that nearline disks develop checksum mismatches an order of magnitude more often than enterprise class disk drives. » In addition, there is much anecdotal evidence of corruption in storage stacks.4

Problems Identified with Other File Systems and RAID

The team at the University of Wisconsin provided data on issues with other file systems and existing RAID products: » Ext2/ext3 file system checkers fail to use available redundant information for recovery.5 » RAID is designed to tolerate the loss of a certain number of disks or blocks (e.g., RAID 5 tolerates one, and RAID 6 two), and it may not be possible with RAID alone to accurately identify the block (in a stripe) that is corrupted. Secondly, some RAID systems have been shown to have flaws where a single block loss leads to data loss or silent corruption.67

Research Testing Methodology The University of Wisconsin team tested ZFS data integrity in the following ways:

» The team developed a fault injection framework to perform fault injection on disk blocks and an application for disk blocks. » The team injected random bit flips at random offsets in disk blocks. » The team analyzed ZFS behavior by reviewing return values, checking system logs, and tracing system calls.

Testing Configuration and Data Layout

The University of Wisconsin research team tested ZFS data integrity on a 64-bit Solaris Express Community Edition (build 108) virtual machine with 2 GB non-ECC memory. The team used ZFS pool version 14 and ZFS file system version 3. They ran ZFS on top of a single disk for their experiments.

The University of Wisconsin research team’s testing results are provided in the following table, which illustrate that either ZFS recovered (identified by R in the table) or ZFS reported an error (identified by E in the table) from the corruption that was caused by the simulated fault injections.

3 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. An Analysis of Data Corruption in the Storage Stack. In FAST, 2008. Original footnote in End-to-End Data Integrity for File Systems: A ZFS Case Study. 4 Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, End-to-End Data Integrity for File Systems: A ZFS Case Study, Computer Sciences Department, University of Wisconsin-Madison, FAST10. 5 H. S. Gunawi, A. Rajimwale, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. SQCK: A Declarative File System Checker. In OSDI, 2008. Original footnote in End-to-End Data Integrity for File Systems: A ZFS Case Study. 6 A. Krioukov, L. N. Bairavasundaram, G. R. Goodson, K. Srinivasan, R. Thelen, A. C. Arpaci-Dusseau, and R. H. Arpaci- Dusseau. Parity Lost and Parity Regained. In FAST, 2008. Original footnote in End-to-End Data Integrity for File Systems: A ZFS Case Study. 7 Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. End-to-End Data Integrity for File Systems: A ZFS Case Study, Computer Sciences Department, University of Wisconsin-Madison, FAST10.

6 | ORACLE ZFS STORAGE—DATA INTEGRITY

Level Block Single Ditto All Ditto

mount remount file create file read mount remount file create file read zpool vdev label8 R R E R uberblock R R E R MOS object set block R R E R MOS dnode block R R E R myfs object set block R R E R myfs indirect block R R E R myfs dnode block R R E R dir ZAP block R R E E file data block E E

Oracle ZFS Storage Appliance Robustness

Oracle ZFS Storage Appliance builds on the existing and strong foundation of the ZFS file system with the following protection features: » Integrated software protection » Redirect on write » Continuous snapshots » Fault Management Architecture (FMA), a feature of Oracle Solaris, fault detection and alerts » Data encryption » Integrated hardware protection » ECC memory » Dual paths of all components » NSPF options » Cluster configurations

Reducing Risk Through Advanced Data Integrity The Oracle ZFS Storage Appliance software has several capabilities that extend data protection to additional levels. These advanced data protection features can help increase productivity by improving data availability, thus reducing the risk of time-consuming recovery procedures and protecting the integrity of archived data.

The Oracle ZFS Storage Appliance software keeps on-disk data self-consistent and eliminates silent data corruption. It combines a copy-on-write approach (data is written to a new block on the media before the pointers to the data are changed and the write is committed) with end-to-end checksumming (explained below) to keep the file system internally consistent.

Robust Protection from Hardware Failures Oracle ZFS Storage Appliance also provides robust protection from hardware failures. The most typical hardware failure in enterprise storage is, of course, disk failures. ZFS provides multiple options for protection from disk

8 Excluding the uberblocks contained in it.

7 | ORACLE ZFS STORAGE—DATA INTEGRITY

failures. ZFS uses a pooled storage pool of multiple disks, and each pool has a layout assigned to it at creation that defines how data should be protected. In addition, each redundancy configuration can be further protected by adding hot spares. Available redundancy configuration options are as follows: » Double- or triple-mirror protection (survives the failure of one or two devices) » RAIDZ1 single-parity protection (survives the failure of a single disk within a four-disk set) » RAIDZ2 dual-parity (survives the failure of two disks within a 9-, 10-, or 12-disk set, depending on pool drive count) » RAIDZ3 (survives the failure of three disks within a multiple disk set, where stripe width varies depending on pool disk count)

ZFS Data Encryption

Oracle ZFS Storage Appliance supports data encryption on the following models: Oracle ZFS Storage ZS3-4, Oracle ZFS Storage ZS4-4, Oracle ZFS Storage ZS5-2, and Oracle ZFS Storage ZS5-4. Encryption happens as a fully inline process upon ingest, so all data at rest is encrypted. The technology used is a highly secure AES 128/192/256-bit algorithm with a two-tier architecture. The first level encrypts the data; then the second level encrypts that with another 256-bit encryption key. The encryption keys then can be stored either locally within the ZFS key manager or centrally within the Oracle Key Manager. This provides robust privacy protection against security breaches and can help data centers to meet security requirements. Encryption can be managed at either the project or share/LUN level for granularity in implementation and efficiency in administration. Competitive encryption options typically do not offer this fine granularity, nor do they generally offer this flexibility in key management. Many competitive options also require expensive, specialized self-encrypting drives whereas Oracle ZFS Storage Appliance’s encryption is totally controller-based, drive-independent, and very flexible, yet easy to use.

Conclusion

Oracle ZFS Storage Appliance uses the advanced Oracle Solaris ZFS file system to ensure the highest data integrity in the industry, in combination with other architectural elements to enhance the integrity, such as RAID-level protection, encryption, etc.

The University of Wisconsin research team supports this conclusion, “In our analysis, we find that ZFS is indeed robust to a wide range of disk corruptions, thus partially confirming that many of its design goals have been met.”9

In the face of their fault injection and ZFS error recovery results, the team stated: “ZFS gracefully recovers from single metadata block corruptions. For pool-wide metadata and file system-wide metadata, ZFS recovered from disk corruptions by using the ditto blocks. ZFS keeps three ditto blocks for pool-wide metadata and two for file system metadata. Hence, on single-block corruption to metadata, ZFS was successfully able to detect the corruption and use other available correct copies to recover from it.”10

9 Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. End-to-End Data Integrity for File Systems: A ZFS Case Study, Computer Sciences Department, University of Wisconsin-Madison, FAST10. 10 Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. End-to-End Data Integrity for File Systems: A ZFS Case Study, Computer Sciences Department, University of Wisconsin-Madison, FAST10.

8 | ORACLE ZFS STORAGE—DATA INTEGRITY

Related Links

Oracle ZFS Storage Appliance

Oracle Technology Network Oracle ZFS Storage Appliance

Product data sheet

Business value white paper

Analyst white paper on Oracle integration

9 | ORACLE ZFS STORAGE—DATA INTEGRITY

Oracle Corporation, World Headquarters Worldwide Inquiries 500 Oracle Parkway Phone: +1.650.506.7000 Redwood Shores, CA 94065, USA Fax: +1.650.506.7200

CONNECT WITH US

blogs.oracle.com/oracle Copyright © 2017, Oracle and/or its affiliates. All rights reserved. This document is provided for information purposes only, and the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other facebook.com/oracle warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any liability with respect to this document, and no contractual obligations are twitter.com/oracle formed either directly or indirectly by this document. This document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our prior written permission.

oracle.com Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Intel and Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. 0517

Oracle ZFS Storage—Data Integrity May 2017