ZFS and RAID-Z: the Über-FS?
Total Page:16
File Type:pdf, Size:1020Kb
ZFS and RAID-Z: The Über-FS? Brian Hickmann and Kynan Shook University of Wisconsin – Madison Electrical and Computer Engineering Department Madison, WI 53706 {bjhickmann and kashook} @ wisc.edu Abstract RAID-Z was also developed by Sun as an integral part of ZFS. It is similar to RAID-5, with one parity ZFS has been touted by Sun Microsystems as the block per stripe, rotated among the disks to prevent any “last word in file systems”. It is a revolutionary file one disk from being a bottleneck. There is also the system with several new features that improve its double-parity RAID-Z2, which is similar to RAID-6 in reliability, manageability, and performance. ZFS also that it can recover from two simultaneous disk failures contains RAID-Z, an integrated software RAID system instead of one. However, unlike the standard RAID that claims to solve the RAID-5 write hole without levels, RAID-Z allows a variable stripe width. This special hardware. While these claims are impressive, allows data that is smaller than a whole stripe to be very little information about ZFS and RAID-Z has been written to fewer disks. Along with the copy-on-write published and no work has been done to verify these behavior of ZFS, this allows RAID-Z to avoid the claims. In this work, we investigate RAID-Z and ZFS read-modify-write of parity that RAID-5 performs by examining the on-disk file layout. We also use a when updating only a single block on disk. Because of kernel extension to intercept reads and writes to the the variable stripe width, recovery necessitates backing store and use this temporal information to reconstructing metadata in order to find where the further understand the file system’s behavior. With parity blocks are located. Although RAID-Z requires a these tools, we were able to discover how RAID-Z lays more intelligent recovery algorithm, this also means files out on the disk. We also verified that RAID-Z that recovery time is much shorter when a volume has both solved the RAID-5 write hole and maintains its a lot of free space. In RAID-5, because the RAID full-stripe write semantics even when the volume is full controller is separate from the file system and has no or heavily fragmented. Finally, during our testing, we knowledge of what is on the disk, it must recreate data uncovered serious performance degradation and for the entire drive, even if it is unallocated space. erratic behavior in ZFS when the disk is near full and While RAID-Z and ZFS make several exciting heavily fragmented. claims, very little information is published on ZFS, and even less on RAID-Z. Sun has published a draft On- Disk Specification [3], however it barely mentions 1. Introduction RAID-Z. Nothing is published to indicate how data is spread out across disks or how parity is written. Our ZFS, a file system recently developed by Sun objective in this paper is to show some of the behavior Microsystems, promises many advancements in of RAID-Z that has not yet been documented. reliability, scalability, and management. It stores In this paper, we cover a variety of topics relating to checksums to detect corruption and allows for the easy the behavior of RAID-Z. We look at file and parity creation of mirrored or RAID-Z volumes to facilitate placement on disk, including how RAID-Z always correcting errors. It is a 128-bit file system, providing performs full-stripe writes, preventing RAID-5’s read- the ability to expand well beyond inherent limitations modify-write when writing a single block of data. We in existing file systems. Additionally, ZFS breaks the discuss the RAID-5 write hole, and show how RAID-Z traditional barriers between the file system, volume avoids this problem. Additionally, we investigate the manager, and RAID controller, allowing file systems behavior of RAID-Z when full or when free space is and the storage pool they use to grow flexibly as fragmented, and we show how performance can needed. degrade for a full or fragmented volume. 2. RAID-Z and ZFS Background Information There are two pieces of ZFS’s and RAID-Z’s architecture that are important to this work. The first is ZFS’s Copy-on-Write (CoW) transactional model. In this model any update to the file system, including both metadata and data, is written in a CoW fashion to an empty portion of disk. Old data or metadata is never immediately overwritten in ZFS when it is being updated. Once the updates have been propagated up the file system tree and eventually committed to disk, the 512-byte uberblock (akin to the superblock in ext3) is updated in a single atomic write. This ensures that the updates are applied atomically to the file system, Figure 1: Storage Driver Stack [1] solving many consistent update problems. RAID-Z must also follow this model and therefore guarantees to as backing stores allowed us to place them on a special always perform full-stripe writes of new data to empty disk image and write a kernel extension to intercept all locations on the disk. reads and writes to this disk image. The second relevant part of ZFS to this work is how Our kernel extension fits in the storage driver it manages free space. Since ZFS is a 128-bit file hierarchy pictured in Figure 1 and is based off of system, it can scale to extremely large volumes. sample code from the Apple Developer Connection Traditional methods of managing free space, such as website [7]. It acts as a media filter driver, consuming bitmaps, do not scale to these large volumes due to and producing a media object which represents a single their large space overhead. ZFS therefore uses a write or read on the disk. It sits below the file system, different structure called a space map to manage free so there is no concept of files or directories. The space [9]. Each device in a ZFS volume is split into kernel matches this type of driver based on one of several hundred metaslabs, and each metaslab has an several properties of the media objects. Our kernel associated space map that contains a log of all extension matched on the content hint key, which is set allocations and frees performed within that metaslab. by the disk formatting utility when the volume is At allocation time, this log is replayed in order to find created. We then created a custom disk image with our free space available in the metaslab. The space map is content hint so that the kernel extension would be also condensed at this time and written to disk. Space called only for traffic to this disk image. maps save significant space, especially when the This kernel extension allowed us to easily monitor metaslab is nearly full; a completely full metaslab is file system activity to the ZFS backing stores. represented by a single entry in the log. Their However, it does have various limitations. Because downside, however, is the time overhead needed to writing to a file from within the kernel is very difficult replay the log to find free space and condense it for due to the potential for deadlock, we used the system future use. This overhead becomes noticeable when logging facility to record the output of our kernel the disk is nearly full. extension. However, this has a limited buffer size, causing some events to get missed during high periods of activity. To avoid this, we only printed messages 3. Methodology relevant to our reads and writes, which we recognized from a repeating data pattern. Also, because the driver In order to investigate the properties of RAID-Z we is in the direct path of all reads and writes, if the used binary files as backing stores, instead of actual extension takes too long to run, it can cause the file disks. This had several advantages. First, we did not system to become too slow to even mount in a need to use separate disks or disk partitions to create reasonable amount of time. In one particular example, our RAID-Z volumes. This allowed us to easily test we attempted to scan through all the data being written configurations with different numbers or sizes of to the disk image, but this caused the mount operation backing stores. Second, it also allowed us to easily to take longer than an hour. observe the backing stores while they were in use with In order to facilitate directly examining the backing a normal binary editor. Finally, the use of binary files stores, we wrote repeating data patterns to the disk that 2 we could search for using a binary file editor. We 4. Investigating File Placement in always wrote the pattern in multiples of 512-byte chunks to match the logical block size of the “disks” in RAID-Z our system. In order to be able to distinguish the parity blocks from the data blocks in our test files, we used a The first aspect of RAID-Z that we investigated 32-bit pattern that was rotated and periodically was how it places the file data on disk. Here we incremented after each write to guarantee that the wanted to see how RAID-Z laid out data and parity on parity block would always be unique. The use of a 32- blocks on the disk, how it chose to rotate parity across bit repeating pattern also allowed us to filter the reads the disks, and how it guaranteed to always perform and writes reported by our kernel extension by looking full-stripe writes.