Tracking Back References in a Write-Anywhere File System

Tracking Back References in a Write-Anywhere File System Peter Macko Margo Seltzer Keith A. Smith Harvard University Harvard University NetApp, Inc. [email protected] [email protected] [email protected] Abstract the objects. Such virtualization would help with some of the use cases of interest, but it is insufficient for one of Many file systems reorganize data on disk, for example to the most important—defragmentation. defragment storage, shrink volumes, or migrate data be- Defragmentation can be a particularly important is- tween different classes of storage. Advanced file system sue for file systems that implement block sharing to sup- features such as snapshots, writable clones, and dedupli- port snapshots, deduplication, and other features. While cation make these tasks complicated, as moving a single block sharing offers great savings in space efficiency, block may require finding and updating dozens, or even sub-file sharing of blocks necessarily introduces on-disk hundreds, of pointers to it. fragmentation. If two files share a subset of their blocks, We present Backlog, an efficient implementation of it is impossible for both files to have a perfectly sequen- explicit back references, to address this problem. Back tial on-disk layout. references are file system meta-data that map physi- Block sharing also makes it harder to optimize on-disk cal block numbers to the data objects that use them. layout. When two files share blocks, defragmenting one We show that by using LSM-Trees and exploiting the file may hurt the layout of the other file. A better ap- write-anywhere behavior of modern file systems such proach is to make reallocation decisions that are aware of as NetApp R WAFL R or btrfs, we can maintain back block sharing relationships between files and can make reference meta-data with minimal overhead (one extra more intelligent optimization decisions, such as priori- disk I/O per 102 block operations) and provide excel- tizing which files get defragmented, selectively breaking lent query performance for the common case of queries block sharing, or co-locating related files on the disk. covering ranges of physically adjacent blocks. These decisions require that when we defragment a file, we determine its new layout in the context of other 1 Introduction files with which it shares blocks. In other words, given the blocks in one file, we need to determine the other Today’s file systems such as WAFL [12], btrfs [5], and files that share those blocks. This is the key obstacle ZFS [23] have moved beyond merely providing reliable to using virtualization to enable block reallocation, as storage to providing useful services, such as snapshots it would hide this mapping from physical blocks to the and deduplication. In the presence of these services, any files that reference them. Thus we have sought a tech- data block can be referenced by multiple snapshots, mul- nique that will allow us to track, rather than hide, this tiple files, or even multiple offsets within a file. This mapping, while imposing minimal performance impact complicates any operation that must efficiently deter- on common file operations. Our solution is to introduce mine the set of objects referencing a given block, for and maintain back references in the file system. example when updating the pointers to a block that has Back references are meta-data that map physical block moved during defragmentation or volume resizing. In numbers to their containing objects. Such back refer- this paper we present new file system structures and al- ences are essentially inverted indexes on the traditional gorithms to facilitate such dynamic reorganization of file file system meta-data that maps file offsets to physical system data in the presence of block sharing. blocks. The challenge in using back references to sim- In many problem domains, a layer of indirection pro- plify maintenance operations, such as defragmentation, vides a simple way to relocate objects in memory or on is in maintaining them efficiently. storage without updating any pointers held by users of We have designed Log-Structured Back References, or Backlog for short, a write-optimized back reference Root implementation with small, predictable overhead that re- mains stable over time. Our approach requires no disk reads to update the back reference database on block al- Inode. Inode Inode location, reallocation, or deallocation. We buffer updates in main memory and efficiently apply them en masse to the on-disk database during file system consistency points (checkpoints). Maintaining back references in the I-Block I-Block presence of snapshot creation, cloning or deletion incurs no additional I/O overhead. We use database compaction to reclaim space occupied by records referencing deleted Data. Data Data snapshots. The only time that we read data from disk Figure 1: File System as a Tree. The conceptual view of a is during data compaction, which is an infrequent activ- file system as a tree rooted at the volume root (superblock) [18], ity, and in response to queries for which the data is not which is a parent of all inodes. An inode is a parent of data currently in memory. blocks and/or indirect blocks. We present a brief overview of write-anywhere file systems in Section 2. Section 3 outlines the use cases that motivate our work and describes some of the challenges Root Root' of handling them in a write-anywhere file system. We describe our design in Section 4 and our implementation in Section 5. We evaluate the maintenance overheads and Inode 1 Inode 2 Inode 2' query performance in Section 6. We present related work in Section 7, discuss future work in Section 8, and con- clude in Section 9. I-Block 1 I-Block 2 I-Block 2' 2 Background Data 1 Data 2 Data 2' Our work focuses specifically on tracking back refer- Figure 2: Write-Anywhere file system maintenance. In ences in write-anywhere (or no-overwrite) file systems, write-anywhere file systems, block updates generate new block such as btrfs [5] or WAFL [12]. The terminology across copies. For example, upon updating the block “Data 2”, the file such file systems has not yet been standardized; in this system writes the new data to a new block and then recursively work we use WAFL terminology unless stated otherwise. updates the blocks that point to it – all the way to the volume Write-anywhere file systems can be conceptually root. modeled as trees [18]. Figure 1 depicts a file system tree rooted at the volume root or a superblock. Inodes are the immediate children of the root, and they in turn are par- (disk or NVRAM) can then recover data written since the ents of indirect blocks and/or data blocks. Many modern last checkpoint by replaying the log. file systems also represent inodes, free space bitmaps, Write-anywhere file systems can capture snapshots, and other meta-data as hidden files (not shown in the fig- point-in-time copies of previous file system states, by ure), so every allocated block with the exception of the preserving the file system images from past consistency root has a parent inode. points. These snapshots are space efficient; the only dif- Write-anywhere file systems never update a block in ferences between a snapshot and the live file system are place. When overwriting a file, they write the new file the blocks that have changed since the snapshot copy was data to newly allocated disk blocks, recursively updating created. In essence, a write-anywhere allocation policy the appropriate pointers in the parent blocks. Figure 2 implements copy-on-write as a side effect of its normal illustrates this process. This recursive chain of updates operation. is expensive if it occurs at every write, so the file system Many systems preserve a limited number of the most accumulates updates in memory and applies them all at recent consistency points, promoting some to hourly, once during a consistency point (CP or checkpoint). The daily, weekly, etc. snapshots. An asynchronous process file system writes the root node last, ensuring that it rep- typically reclaims space by deleting old CPs, reclaiming resents a consistent set of data structures. In the case blocks whose only references were from deleted CPs. of failure, the operating system is guaranteed to find a Several file systems, such as WAFL and ZFS, can cre- consistent file system state with contents as of the last ate writable clones of snapshots, which are useful es- CP. File systems that support journaling to stable storage pecially in development (such as creation of a writable Line 2 example, can do this only by traversing the entire file system tree searching for block pointers that fall in the target range [2]. In a large file system, the I/O required for this Line 1 brute-force approach is prohibitive. Our second use case is the dynamic reorganization Line 0 of on-disk data. This is traditionally thought of as defragmentation—reallocating files on-disk to achieve ver. 0 ver. 1 ver. 2 ver. 3 ver. 4 contiguous layout. We consider this use case more Figure 3: Snapshot Lines. The tuple (line, version), where broadly to include tasks such as free space coalescing version is a global CP number, uniquely identifies a snapshot (to create contiguous expanses of free blocks for the effi- or consistency point. Taking a consistency point creates a new cient layout of new files) and the migration of individual version of the latest snapshot within each line, while creating a files between different classes of storage in a file system. writable clone of an existing snapshot starts a new line.

Load more