Tracking Back References in a Write-Anywhere

Peter Macko Margo Seltzer Keith A. Smith Harvard University Harvard University NetApp, Inc. [email protected] [email protected] [email protected]

Abstract the objects. Such virtualization would help with some of the use cases of interest, but it is insufficient for one of Many file systems reorganize data on disk, for example to the most important—. defragment storage, shrink volumes, or migrate data be- Defragmentation can be a particularly important is- tween different classes of storage. Advanced file system sue for file systems that implement block sharing to sup- features such as snapshots, writable clones, and dedupli- port snapshots, deduplication, and other features. While cation make these tasks complicated, as moving a single block sharing offers great savings in space efficiency, block may require finding and updating dozens, or even sub-file sharing of blocks necessarily introduces on-disk hundreds, of pointers to it. fragmentation. If two files share a subset of their blocks, We present Backlog, an efficient implementation of it is impossible for both files to have a perfectly sequen- explicit back references, to address this problem. Back tial on-disk layout. references are file system meta-data that map physi- Block sharing also makes it harder to optimize on-disk cal block numbers to the data objects that use them. layout. When two files share blocks, defragmenting one We show that by using LSM-Trees and exploiting the file may hurt the layout of the other file. A better ap- write-anywhere behavior of modern file systems such proach is to make reallocation decisions that are aware of as NetApp R WAFL R or , we can maintain back block sharing relationships between files and can make reference meta-data with minimal overhead (one extra more intelligent optimization decisions, such as priori- disk I/O per 102 block operations) and provide excel- tizing which files get defragmented, selectively breaking lent query performance for the common case of queries block sharing, or co-locating related files on the disk. covering ranges of physically adjacent blocks. These decisions require that when we defragment a file, we determine its new layout in the context of other 1 Introduction files with which it shares blocks. In other words, given the blocks in one file, we need to determine the other Today’s file systems such as WAFL [12], btrfs [5], and files that share those blocks. This is the key obstacle ZFS [23] have moved beyond merely providing reliable to using virtualization to enable block reallocation, as storage to providing useful services, such as snapshots it would hide this mapping from physical blocks to the and deduplication. In the presence of these services, any files that reference them. Thus we have sought a tech- data block can be referenced by multiple snapshots, mul- nique that will allow us to track, rather than hide, this tiple files, or even multiple offsets within a file. This mapping, while imposing minimal performance impact complicates any operation that must efficiently deter- on common file operations. Our solution is to introduce mine the set of objects referencing a given block, for and maintain back references in the file system. example when updating the pointers to a block that has Back references are meta-data that map physical block moved during defragmentation or volume resizing. In numbers to their containing objects. Such back refer- this paper we present new file system structures and al- ences are essentially inverted indexes on the traditional gorithms to facilitate such dynamic reorganization of file file system meta-data that maps file offsets to physical system data in the presence of block sharing. blocks. The challenge in using back references to sim- In many problem domains, a layer of indirection pro- plify maintenance operations, such as defragmentation, vides a simple way to relocate objects in memory or on is in maintaining them efficiently. storage without updating any pointers held by users of We have designed Log-Structured Back References, or Backlog for short, a write-optimized back reference Root implementation with small, predictable overhead that re- mains stable over time. Our approach requires no disk reads to update the back reference database on block al- Inode. . . Inode Inode location, reallocation, or deallocation. We buffer updates in main memory and efficiently apply them en masse to the on-disk database during file system consistency points (checkpoints). Maintaining back references in the I-Block I-Block presence of snapshot creation, cloning or deletion incurs no additional I/O overhead. We use database compaction to reclaim space occupied by records referencing deleted Data. . . Data Data snapshots. The only time that we read data from disk Figure 1: File System as a . The conceptual view of a is during data compaction, which is an infrequent activ- file system as a tree rooted at the volume root (superblock) [18], ity, and in response to queries for which the data is not which is a parent of all inodes. An inode is a parent of data currently in memory. blocks and/or indirect blocks. We present a brief overview of write-anywhere file systems in Section 2. Section 3 outlines the use cases that motivate our work and describes some of the challenges Root Root' of handling them in a write-anywhere file system. We describe our design in Section 4 and our implementation in Section 5. We evaluate the maintenance overheads and Inode 1 Inode 2 Inode 2' query performance in Section 6. We present related work in Section 7, discuss future work in Section 8, and con- clude in Section 9. I-Block 1 I-Block 2 I-Block 2'

2 Background Data 1 Data 2 Data 2'

Our work focuses specifically on tracking back refer- Figure 2: Write-Anywhere file system maintenance. In ences in write-anywhere (or no-overwrite) file systems, write-anywhere file systems, block updates generate new block such as btrfs [5] or WAFL [12]. The terminology across copies. For example, upon updating the block “Data 2”, the file such file systems has not yet been standardized; in this system writes the new data to a new block and then recursively work we use WAFL terminology unless stated otherwise. updates the blocks that point to it – all the way to the volume Write-anywhere file systems can be conceptually root. modeled as trees [18]. Figure 1 depicts a file system tree rooted at the volume root or a superblock. Inodes are the immediate children of the root, and they in turn are par- (disk or NVRAM) can then recover data written since the ents of indirect blocks and/or data blocks. Many modern last checkpoint by replaying the log. file systems also represent inodes, free space bitmaps, Write-anywhere file systems can capture snapshots, and other meta-data as hidden files (not shown in the fig- point-in-time copies of previous file system states, by ure), so every allocated block with the exception of the preserving the file system images from past consistency root has a parent inode. points. These snapshots are space efficient; the only dif- Write-anywhere file systems never update a block in ferences between a snapshot and the live file system are place. When overwriting a file, they write the new file the blocks that have changed since the snapshot copy was data to newly allocated disk blocks, recursively updating created. In essence, a write-anywhere allocation policy the appropriate pointers in the parent blocks. Figure 2 implements copy-on-write as a side effect of its normal illustrates this . This recursive chain of updates operation. is expensive if it occurs at every write, so the file system Many systems preserve a limited number of the most accumulates updates in memory and applies them all at recent consistency points, promoting some to hourly, once during a consistency point (CP or checkpoint). The daily, weekly, etc. snapshots. An asynchronous process file system writes the root node last, ensuring that it rep- typically reclaims space by deleting old CPs, reclaiming resents a consistent set of data structures. In the case blocks whose only references were from deleted CPs. of failure, the is guaranteed to find a Several file systems, such as WAFL and ZFS, can cre- consistent file system state with contents as of the last ate writable clones of snapshots, which are useful es- CP. File systems that support journaling to stable storage pecially in development (such as creation of a writable Line 2 example, can do this only by traversing the entire file sys- tem tree searching for block pointers that fall in the target range [2]. In a large file system, the I/O required for this Line 1 brute-force approach is prohibitive. Our second use case is the dynamic reorganization Line 0 of on-disk data. This is traditionally thought of as defragmentation—reallocating files on-disk to achieve ver. 0 ver. 1 ver. 2 ver. 3 ver. 4 contiguous layout. We consider this use case more Figure 3: Snapshot Lines. The tuple (line, version), where broadly to include tasks such as free space coalescing version is a global CP number, uniquely identifies a snapshot (to create contiguous expanses of free blocks for the effi- or consistency point. Taking a consistency point creates a new cient layout of new files) and the migration of individual version of the latest snapshot within each line, while creating a files between different classes of storage in a file system. writable clone of an existing snapshot starts a new line. To support these data movement functions in write- anywhere file systems, we must take into account the duplicate for testing of a production database) and virtu- block sharing that emerges from features such as snap- alization [9]. shots and clones, as well as from the deduplication of It is helpful to conceptualize a set of snapshots and identical data blocks [7, 17]. This block sharing makes consistency points in terms of lines as illustrated in Fig- defragmentation both more important and more chal- ure 3. A time-ordered set of snapshots of a file system lenging than in traditional file system designs. Fragmen- forms a single line, while creation of a writable clone tation is a natural consequence of block sharing; two files starts a new line. In this model, a (line ID, version) pair that share a subset of their blocks cannot both have an uniquely identifies a snapshot or a consistency point. In ideal sequential layout. And when we move a shared the rest of the paper, we use the global consistency point block during defragmentation, we face the challenge of number during which a snapshot or consistency point finding and updating pointers in multiple files. was created as its version number. Consider a basic defragmentation scenario where we The use of copy-on-write to implement snapshots and are trying to reallocate the blocks of a single file. This clones means that a single physical block may belong is simple to handle. We find the file’s blocks by reading to multiple file system trees and have many meta-data the indirect block tree for the file. Then we move the blocks pointing to it. In Figure 2, for example, two dif- blocks to a new, contiguous, on-disk location, updating ferent indirect blocks, I-Block 2 and I-Block 2’, refer- the pointer to each block as we move it. ence the block Data 1. Block-level deduplication [7, 17] But things are more complicated if we need to defrag- can further increase the number of pointers to a block by ment two files that share one or more blocks, a case that allowing files containing identical data blocks to share might arise when multiple virtual machine images are a single on-disk copy of the block. This block sharing cloned from a single master image. If we defragment presents a challenge for file system management opera- the files one at a time, as described above, the shared tions, such as defragmentation or data migration, that re- blocks will ping-pong back and forth between the files organize blocks on disk. If the file system moves a block, as we defragment one and then the other. A better ap- it will need to find and update all of the pointers to that proach is to make reallocation decisions that are aware block. of the sharing relationship. There are multiple ways we might do this. We could select the most important file, and only optimize its layout. Or we could decide that 3 Use Cases performance is more important than space savings and make duplicate copies of the shared blocks to allow se- The goal of Backlog is to maintain meta-data that facil- quential layout for all of the files that use them. Or we itates the dynamic movement and reorganization of data might apply multi-dimensional layout techniques [20] to in write-anywhere file systems. We envision two ma- achieve near-optimal layouts for both files while still pre- jor cases for internal data reorganization in a file system. serving block sharing. The first is support for bulk data migration. This is useful The common theme in all of these approaches to lay- when we need to move all of the data off of a device (or out optimization is that when we defragment a file, we a portion of a device), such as when shrinking a volume must determine its new layout in the context of the other or replacing hardware. The challenge here for traditional files with which it shares blocks. Thus we have sought file system designs is translating from the physical block a technique that will allow us to easily map physical addresses we are moving to the files referencing those blocks to the files that use them, while imposing minimal blocks so we can update their block pointers. , for performance impact on common file system operations. Our solution is to introduce and maintain back reference Let us now consider how a table of these records, in- meta-data to explicitly track all of the logical owners of dexed by physical block number, lets us answer the sort each physical data block. of query we encounter in file system maintenance. Imag- ine that we have previously run a deduplication process 4 Log-Structured Back References and found that many files contain a block of all 0’s. We stored one copy of that block on disk and now have mul- Back references are updated significantly more fre- tiple inodes referencing that block. Now, let’s assume quently than they are queried; they must be updated on that we wish to move the physical location of that block every block allocation, deallocation, or reallocation. It is of 0’s in order to shrink the size of the volume on which crucial that they impose only a small performance over- it lives. First we need to identify all the files that ref- head that does not increase with the age of the file sys- erence this block, so that when we relocate the block, tem. Fortunately, it is not a requirement that the meta- we can update their meta-data to reference the new loca- data be space efficient, since disk is relatively inexpen- tion. Thus, we wish to query the back references to an- sive. swer the question, “Tell me all the objects containing this In this section, we present Log-Structured Back Ref- block.” More generally, we may want to ask this query erences (Backlog). We present our design in two parts. for a range of physical blocks. Such queries translate First, we present the conceptual design, which provides easily into indexed lookups on the structure described a simple model of back references and their use in query- above. We use the physical block number as an index to ing. We then present a design that achieves the capabili- locate all the records for the given physical block num- ties of the conceptual design efficiently. ber. Those records identify all the objects that reference the block and all versions in which those blocks are valid. Unfortunately, this representation, while elegantly 4.1 Conceptual Design simple, would perform abysmally. Consider what is re- A na¨ıve approach to maintaining back references re- quired for common operations. Every block deallocation quires that we write a back reference record for every requires replacing the ∞ in the to field with the current block at every consistency point. Such an approach CP number, translating into a read-modify-write on this would be prohibitively expensive both in terms of disk table. Block allocation requires creating a new record, usage and performance overhead. Using the observation translating into an insert into the table. Block realloca- that a given block and its back references may remain un- tion requires both a deallocation and an allocation, and changed for many consistency points, we improve upon thus a read-modify-write and an insert. We ran experi- this na¨ıve representation by maintaining back references ments with this approach and found that the file system over ranges of CPs. We represent every such back refer- slowed down to a crawl after only a few hundred con- ence as a record with the following fields: sistency points. Providing back references with accept- able overhead during normal operation requires a feasi- • block: The physical block number ble design that efficiently realizes the conceptual model • inode: The inode number that references the block described in this section. • offset: The offset within the inode • line: The line of snapshots that contains the inode • from: The global CP number (time epoch) from 4.2 Feasible Design which this record is valid (i.e., when the reference Observe that records in the conceptual table described was allocated to the inode) in Section 4.1 are of two types. Complete records refer • to: The global CP number until which the record to blocks that are no longer part of the live file system; is valid (exclusive) or ∞ if the record is still alive they exist only in snapshots. Such blocks are identified For example, the following table describes two blocks by having to < ∞. Incomplete records are part of the owned by inode 2, created at time 4 and truncated to one live file system and always have to = ∞. Our actual de- block at time 7: sign maintains two separate tables, From and To. Both tables contain the first four columns of the conceptual ta- block inode offset line from to ble (block, inode, offset, and line). The From 100 2 0 0 4 ∞ table also contains the from column, and the To table 101 2 1 0 4 7 contains the to column. Incomplete records exist only in the From table, while complete records appear in both Although we present this representation as operating tables. at the level of blocks, it can be extended to include a On a block allocation, regardless of whether the block length field to operate on extents. is newly allocated or reallocated, we insert the corre- sponding entry into the From table with the from field form a logical pair describing a single interval during set to the current global CP number, creating an incom- which the block was allocated to inode 4. To reconstruct plete record. When a reference is removed, we insert the history of this block allocation, a record from = 10 the appropriate entry into the To table, completing the has to join with to = 12. Similarly, the second From record. We buffer new records in memory, committing record should join with the second To record. The third them to disk at the end of the current CP, which guar- From entry does not have a corresponding To entry, so antees that all entries with the current global CP number it joins with an implicit entry with to = ∞. are present in memory. This facilitates pruning records The result of this outer join is the Conceptual view. where from = to, which refer to block references that Every tuple C ∈ Conceptual has both from and to were added and removed within the same CP. fields, which together represent a range of global CP For example, the Conceptual table from the previous numbers within the given snapshot line, during which subsection (describing the two blocks of inode 2) is bro- the specified block is referenced by the given inode ken down as follows: from the given file offset. The range might include block inode offset line from deleted consistency points or snapshots, so we must ap- From: 100 2 0 0 4 ply a mask of the set of valid versions before returning 101 2 1 0 4 query results. Coming back to our previous example, performing an block inode offset line to To: outer join on these tables produces: 101 2 1 0 7 block inode offset line from to The record for block 101 is complete (has both From 103 4 0 0 10 12 and To entries), while the record for 100 is incomplete 103 4 0 0 16 20 (the block is currently allocated). 103 5 2 0 30 ∞ This design naturally handles block sharing arising This design is feasible until we introduce writable from deduplication. When the file system detects that a clones. In the rest of this section, we explain how we newly written block is a duplicate of an existing on-disk have to modify the conceptual view to address them. block, it adds a pointer to that block and creates an entry Then, in Section 5, we discuss how we realize this de- in the From table corresponding to the new reference. sign efficiently.

4.2.1 Joining the Tables 4.2.2 Representing Writable Clones The conceptual table on which we want to query is the outer join of the From and To tables. A tuple F ∈ From Writable clones pose a challenge in realizing the concep- joins with a tuple T ∈ To that has the same first four tual design. Consider a snapshot (l, v), where l is the line fields and that has the smallest value of T.to such that and v is the version or CP. Na¨ıvely creating a writable 0 0 F.from < T.to. If there is a From entry without a clone (l , v ) requires that we duplicate all back refer- matching To entry (i.e., a live, incomplete record), we ences that include (l, v) (that is, C.line = l ∧ C.from ≤ outer-join it with an implicitly-present tuple T0 ∈ To with v < C.to, where C ∈ Conceptual), updating the line 0 T0.to = ∞. field to l and the from and to fields to represent all For example, assume that a file with inode 4 was cre- versions (range 0 − ∞). Using this technique, the con- ated at time 10 with one block and then truncated at time ceptual table would continue to be the result of the out- 12. Then, the same block was assigned to the file at time erjoin of the From and To tables, and we could express 16, and the file was removed at time 20. Later on, the queries directly on the conceptual table. Unfortunately, same block was allocated to a different file at time 30. this mass duplication is prohibitively expensive. Thus, These operations produce the following records: our actual design cannot simply rely on the conceptual table. Instead we implicitly represent writable clones in block inode offset line from the database using structural inheritance [6], a technique 103 4 0 0 10 From: akin to copy-on-write. This avoids the massive duplica- 103 4 0 0 16 tion in the na¨ıve approach. 103 5 2 0 30 The implicit representation assumes that every block block inode offset line to of (l, v) is present in all subsequent versions of l0, unless To: 103 4 0 0 12 explicitly overridden. When we modify a block, b, in a 103 4 0 0 20 new writable clone, we do two things: First, we declare the end of b’s lifetime by writing an entry in the To table Observe that the first From and the first To record recording the current CP. Second, we record the alloca- tion of the new block b0 (a copy-on-write of b) by adding shots as described in Section 4.2.1. an entry into the From table. This approach requires that we never delete the back For example, if the old block b = 103 was originally references for a cloned snapshot. Consequently, snapshot allocated at time 30 in line l = 0 and was replaced by a deletion checks whether the snapshot has been cloned, new block b0 = 107 at time 43 in line l0 = 1, the system and if it has, it adds the snapshot ID to the list of zombies, produces the following records: ensuring that its back references are not purged during maintenance. The file system is then free to proceed with block inode offset line from snapshot deletion. Periodically we examine the list of From: 103 5 2 0 30 zombies and drop snapshot IDs that have no remaining 107 5 2 1 43 descendants (clones). block inode offset line to To: 103 5 2 1 43 5 Implementation

The entry in the To table overrides the inheritance With the feasible design in hand, we now turn towards from the previous snapshot; however, notice that this new the problem of efficiently realizing the design. First To entry now has no element in the From table with we discuss our implementation strategy and then discuss which to join, since no entry in the From table exists our on-disk data storage (section 5.1). We then proceed 0 with the line l = 1. We join such entries with an im- to discuss database compaction and maintenance (sec- plicit entry in the From table with from = 0. With the tion 5.2), partitioning the tables (section 5.3), and recov- introduction of structural inheritance and implicit records ering the tables after system failure (section 5.4). We im- in the From table, our joined table no longer matches plemented and evaluated the system in fsim, our custom our conceptual table. To distinguish the conceptual table file system simulator, and then replaced the native back from the actual result of the join, we call the join result reference support in btrfs with Backlog. the Combined table. The implementation in fsim allows us to study the Summarizing, a back reference record C ∈ Combined new feature in isolation from the rest of the file system. of (l, v) is implicitly present in all versions of l0, un- Thus, we fully realize the implementation of the back less there is an overriding record C0 ∈ Combined reference system, but embed it in a simulated file sys- with C.block = C0.block ∧ C.inode = tem rather than a real file system, allowing us to consider C0.inode ∧ C.offset = C0.offset ∧ C0.line = a broad range of file systems rather than a single spe- l0 ∧ C0.from = 0. If such a C0 record exists, then it cific implementation. Fsim simulates a write-anywhere defines the versions of l0 for which the back reference is file system with writable snapshots and deduplication. It valid (i.e., from C0.from to C0.to). The file system con- exports an interface for creating, deleting, and writing tinues to maintain back references as usual by inserting to files, and an interface for managing snapshots, which the appropriate From and To records in response to al- are controlled either by a stochastic workload generator location, deallocation and reallocation operations. or an NFS trace player. It stores all file system meta- While the Combined table avoids the massive copy data in main memory, but it does not explicitly store any when creating writable clones, query execution becomes data blocks. It stores only the back reference meta-data a bit more complicated. After extracting initial result on disk. Fsim also provides two parameters to con- from the Combined table, we must iteratively expand figure deduplication emulation. The first specifies the those results as follows. Let Initial be the initial re- percentage of newly created blocks that duplicate exist- sult extracted from Combined containing all records that ing blocks. The second specifies the distribution of how correspond to blocks b0, . . . , bn. If any of the blocks bi those duplicate blocks are shared. has one or more override records, they are all guaranteed We implement back references as a set of callback to be in this initial result. We then initialize the query functions on the following events: adding a block ref- Result to contain all records in Initial and proceed erence, removing a block reference, and taking a consis- as follows. For every record R ∈ Result that refer- tency point. The first two callbacks accumulate updates ences a snapshot (l, v) that was cloned to produce (l0, v0), in main memory, while the consistency point callback we check for the existence of a corresponding override writes the updates to stable storage, as described in the record C0 ∈ Initial with C0.line = l0. If no such next section. We implement the equivalent of a user-level record exists, we explicitly add records C0.line ← l0, process to support database maintenance and query. We C0.from ← 0 and C0.to ← ∞ to Result. This pro- verify the correctness of our implementation by a util- cess repeats recursively until it fails to insert additional ity program that walks the entire file system tree, recon- records. Finally, when the result is fully expanded we structs the back references, and then compares them with mask the ranges to remove references to deleted snap- the database produced by our algorithm. 5.1 Data Storage and Maintenance files requires no disk reads. Queries specify a block or a range of blocks, and We store the From and To tables as well as the pre- those blocks may be present in only some of the Level 0 computed Combined table (if available) in a custom RS files that accumulate between data compaction runs. row-oriented database optimized for efficient insert and To avoid many unnecessary accesses, the query system query. We use a variant of LSM-Trees [16] to hold the maintains a Bloom filter [3] on the RS files that is used tables. The fundamental property of this structure is that to determine which, if any, RS files must be accessed. If it separates an in-memory write store (WS or C0 in the the blocks are in the RS, then we position an iterator in LSM-Tree terminology) and an on-disk read store (RS the Leaf file on the first block in the query result and re- or C1). trieve successive records until we have retrieved all the We accumulate updates to each table in its respec- blocks necessary to satisfy the query. tive WS, an in-memory balanced tree. Our fsim im- The Bloom filter uses four hash functions, and its de- plementation uses a Berkeley DB 4.7.25 in-memory B- fault size for From and To RS files depends on the max- tree database [15], while our btrfs implementation uses imum number of operations in a CP. We use 32 KB for red/black trees, but any efficient indexing structure 32,000 operations (a typical setting for WAFL), which would work. During consistency point creation, we write results in an expected false positive rate of up to 2.4%. If the contents of the WS into the RS, an on-disk, densely an RS contains a smaller number of records, we appropri- packed B-tree, which uses our own LSM-Tree/Stepped- ately shrink its Bloom filter to save memory. This opera- Merge implementation, described in the next section. tion is efficient, since a Bloom filter can be halved in size In the original LSM-Tree design, the system selects in linear time [4]. The default filter size is expandable parts of the WS to write to disk and merges them with the up to 1 MB for a Combined read store. False positives corresponding parts of the RS (indiscriminately merging for the latter filter grow with the size of the file system, all nodes of the WS is too inefficient). We cannot use this but this is not a problem, because the Combined RS is approach, because we require that a consistency point involved in almost all queries anyway. has all accumulated updates persistent on disk. Our ap- Each time that we remove a block reference, we prune proach is thus more like the Stepped-Merge variant [13], in real time by checking whether the reference was both in which the entire WS is written to a new RS run file, created and removed during the same interval between resulting in one RS file per consistency point. These RS two consistency points. If it was, we avoiding creating files are called the Level 0 runs, which are periodically records in the Combined table where from = to. If such merged into Level 1 runs, and multiple Level 1 runs are a record exists in From, our buffering approach guaran- merged to produce Level 2 runs, etc., until we get to a tees that the record resides in the in-memory WS from large Level N file, where N is fixed. The Stepped-Merge which it can be easily removed. Conversely, upon block Method uses these intermediate levels to ensure that the reference addition, we check the in-memory WS for the sizes of the RS files are manageable. For the back refer- existence of a corresponding To entry with the same CP ences use case, we found it more practical to retain the number and proactively prune those if they exist (thus a Level 0 runs until we run data compaction (described in reference that exists between CPs 3 and 4 and is then re- Section 5.2), at which point, we merge all existing Level allocated in CP 4 will be represented with a single entry 0 runs into a single RS (analogous to the Stepped-Merge in Combined with a lifespan beginning at 3 and contin- Level N) and then begin accumulating new Level 0 files uing to the present). We implement the WS for all the at subsequent CPs. We ensure that the individual files tables as balanced trees sorted first by block, inode, are of a manageable size using horizontal partitioning as offset, and line, and then by the from and/or to described in Section 5.3. fields, so that it is efficient to perform this proactive prun- Writing Level 0 RS files is efficient, since the records ing. are already sorted in memory, which allows us to con- During normal operation, there is no need to delete struct the compact B-tree bottom-up: The data records tuples from the RS. The masking procedure described in are packed densely into pages in the order they appear Section 4.2.1 addresses blocks deleted due to snapshot in the WS, creating a Leaf file. We then create an Inter- removal. nal 1 (I1) file, containing densely packed internal nodes During maintenance operations that relocate blocks, containing references to each block in the Leaf file. We e.g., defragmentation or volume shrinking, it becomes continue building I files until we have an I file with only necessary to remove blocks from the RS. Rather than a single block (the root of the B-tree). As we write the modifying the RS directly, we borrow an idea from the Leaf file, we incrementally build the I1 file and itera- C-store, column-oriented data manager [22] and retain a tively, as we write I file, In, to disk, we incrementally deletion vector, containing the set of entries that should build the I(n + 1) file in memory, so that writing the I not appear in the RS. We store this vector as a B-tree in- There are several interesting alternatives for partition- ing that we plan to explore in future work. We could start with a single partition and then use a threshold-based scheme, creating a new partition when an existing par- tition exceeds the threshold. A different approach that might better exploit parallelism would be to use hashed partitioning. Partitioning can also allow us to exploit the paral- lelism found in today’s storage servers: different par- titions could reside on different disks or RAID groups and/or could be processed by different CPU cores in par- allel.

5.4 Recovery Figure 4: Database Maintenance. This query plan merges all on-disk RS’s, represented by the “From N”, precomputes This back reference design depends on the write- the Combined table, which is the join of the From and To anywhere nature of the file system for its consistency. tables, and purges old records. Incomplete records reside in the At each consistency point, we write the WS’s to disk and on-disk From table. do not consider the CP complete until all the resulting RS’s are safely on disk. When the system restarts after a failure, it is thus guaranteed that it finds a consistent file dex, which is usually small enough to be entirely cached system with consistent back references at a state as of the in memory. The query engine then filters records read last complete CP. If the file system has a journal, it can from the RS according to the deletion vector in a man- rebuild the WS’s together with the other parts of the file ner that is completely opaque to query processing logic. system state as the system replays the journal. If the deletion vector becomes sufficiently large, the sys- tem can optionally write a new copy of the RS with the deleted tuples removed. 6 Evaluation

Our goal is that back reference maintenance not interfere 5.2 Database Maintenance with normal file-system processing. Thus, maintaining the back reference database should have minimal over- The system periodically compacts the back reference in- head that remains stable over time. In addition, we want dexes. This compaction merges the existing Level 0 to confirm that query time is sufficiently low so that util- Combined RS’s, precomputes the table by joining the ities such as volume shrinking can use them freely. Fi- From To and tables, and purges records that refer to nally, although space overhead is not of primary concern, deleted checkpoints. Merging RS files is efficient, be- we want to ensure that we do not consume excessive disk cause all the tuples are sorted identically. space. After compaction, we are left with one RS containing We evaluated our algorithm first on a syntheti- the complete records in the Combined table and one cally generated workload that submits write requests as RS containing the incomplete records in the From table. rapidly as possible. We then proceeded to evaluate our Figure 4 depicts this compaction process. system using NFS traces; we present results using part of the EECS03 data set [10]. Next, we report performance 5.3 Horizontal Partitioning for an implementation of Backlog ported into btrfs. Fi- nally, we present query performance results. We partition the RS files by block number to ensure that each of the files is of a manageable size. We main- 6.1 Experimental Setup tain a single WS per table, but then during a check- point, we write the contents of the WS to separate par- We ran the first part of our evaluation in fsim. We titions, and compaction processes each partition sepa- configured the system to be representative of a common rately. Note that this arrangement provides the com- write-anywhere file system, WAFL [12]. Our simula- paction process the option of selectively compacting dif- tion used 4 KB blocks and took a consistency point af- ferent partitions. In our current implementation, each ter every 32,000 block writes or 10 seconds, whichever partition corresponds to a fixed sequential range of block came first (a common configuration of WAFL). We con- numbers. figured the deduplication parameters based on measure- 9 0.01 8 7 0.008 6 0.006 5 4 s) per a block op.

0.004 µ 3 2 0.002 Time ( 1 Total Time CPU Time

I/O Writes (4 KB blocks) per a block op. 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Global CP number (thousands) Global CP number (thousands) Figure 5: Fsim Synthetic Workload Overhead during Normal Operation. I/O overhead due to maintaining back references normalized per persistent block operations (adding or removing a reference with effects that survive at least one CP) and the time overhead normalized per block operation. ments from a few file servers at NetApp. We treat 10% No maintenance of incoming blocks as duplicates, resulting in a file sys- 25 Maintenance every 200 CPs tem where approximately 75 – 78% of the blocks have Maintenance every 100 CPs reference counts of 1, 18% have reference counts of 2, 20 5% have reference counts of 3, etc. Our file system kept four hourly and four nightly snapshots. 15 We ran our simulations on a server with two dual-core 10 Intel Xeon 3.0 GHz CPUs, 10 GB of RAM, running Space Overhead (%) Linux 2.6.28. We stored the back reference meta-data 5 from fsim on a 15K RPM Fujitsu MAX3073RC SAS drive that provides 60 MB/s of write throughput. For the 0 micro-benchmarks, we used a 32 MB cache in addition to 100 200 300 400 500 600 700 800 900 1000 the memory consumed by the write stores and the Bloom Global CP Number Figure 6: Fsim Synthetic Workload Database Size. The filters. size of the back reference meta-data as a percentage of the total We carried out the second part of our evaluation in a physical data size as it evolves over time. The disk usage at the modified version of btrfs, in which we replaced the orig- end of the workload is 14.2 GB after deduplication. inal implementation of back references by Backlog. As btrfs uses -based allocation, we added a length field to both the From and To described in Section 4.1. All fields in back reference records are 64-bit. The re- sults, so we selected one representative workload and sulting From and To tuples are 40 bytes each, and a used that throughout the rest of this section. We config- Combined tuple is 48 bytes long. All btrfs workloads ured our workload generator to perform at least 32,000 were executed on an Intel Pentium 4 3.0 GHz, 512 MB block writes between two consistency points, which cor- RAM, running Linux 2.6.31. responds to the periods of high load on real systems. We set the rates of file create, delete, and update operations to mirror the rates observed in the EECS03 trace [10]. 6.2 Overhead 90% of our files are small, reflecting what we observe on We evaluated the overhead of our algorithm in fsim file systems containing mostly home directories of de- using both synthetically generated workloads and NFS velopers – which is similar to the file system from which traces. We used the former to understand how our algo- the EECS03 trace was gathered. We also introduced cre- rithm behaves under high system load and the latter to ation and deletion of writable clones at a rate of approxi- study lower, more realistic loads. mately 7 clones per 100 CP’s, although the original NFS trace did not have any analogous behavior. This is sub- stantially more clone activity than we would expect in a 6.2.1 Synthetic Workload home- workload such as EECS03, so it gives us We experimented with a number of different configu- a pessimal view of the overhead clones impose. rations and found that all of them produced similar re- Figure 5 shows how the overhead of maintaining back 0.14 Total Time 60 CPU Time 0.12 50 0.1 40 0.08

0.06 30 s) per a block op. µ 0.04 20 Time ( 0.02 10

I/O Writes (4 KB blocks) per a block op. 0 0 50 100 150 200 250 300 350 50 100 150 200 250 300 350 Hours Hours Figure 7: Fsim NFS Trace Overhead during Normal Operation. The I/O and time overheads for maintaining back references normalized per a block operation (adding or removing a reference).

14 The database maintenance tool processes the original database at the rate 7.7 – 10.4 MB/s. In our experi- 12 ments, compaction reduced the database size by 30 – 10 50%. The exact percentage depends on the fraction of records that could be purged, which can be quite high if 8 the file system deletes an entire snapshot line as we did 6 in this benchmark.

4 Space Overhead (%)

2 No maintenance 6.2.2 NFS Traces Maintenance every 48 Hours Maintenance every 8 Hours 0 We used the first 16 days of the EECS03 trace [10], 50 100 150 200 250 300 350 which captures research activity in home directories of a Hours university computer science department during February Figure 8: Fsim NFS Traces: Space Overhead. The size of and March of 2003. This is a write-rich workload, with the back reference meta-data as a percentage of the total phys- one write for every two read operations. Thus, it places ical data size as it evolves over time. The disk usage at the end more load on Backlog than workloads with higher read- of the workload is 11.0 GB after deduplication. /write ratios. We ran the workload with the default con- figuration of 10 seconds between two consistency points. Figure 7 shows how the overhead changes over time references changes over time, ignoring the cost of peri- during the normal file system operation, omitting the cost odic database maintenance. The average cost of a block of database maintenance. The time overhead is usually operation is 0.010 block writes or 8-9 µs per block op- between 8 and 9 µs, which is what we saw for the syn- eration, regardless of whether the operation is adding or thetically generated workload, and as we saw there, the removing a reference. A single copy-on-write operation overhead remains stable over time. Unlike the overhead (involving both adding and removing a block from an in- observed with the synthetic workload, this workload ex- ode) adds on average 0.020 disk writes and at most 18 hibits occasional spikes and one period where the over- µs. This amounts to at most 628 additional writes and head dips (between hours 200 and 250). 0.5–0.6 seconds per CP. More than 95% of this overhead The spikes align with periods of low system load, is CPU time, most of which is spent updating the write where the constant part of the CP overhead is amortized store. Most importantly, the overhead is stable over time, across a smaller number of block operations, making the and the I/O cost is constant even as the total data on the per-block overhead greater. We do not consider this be- file system increases. havior to pose any problem, since the system is under Figure 6 illustrates meta-data size evolution as a per- low load during these spikes and thus can better absorb centage of the total physical data size for two frequencies the temporarily increased overhead. of maintenance (every 100 or 200 CPs) and for no main- The period of lower time overhead aligns with periods tenance at all. The space overhead after maintenance of high system load with a large proportion of setattr drops consistently to 2.5%–3.5% of the total data size, commands, most of which are used for file truncation. and this low point does not increase over time. During this period, we found that only a small fraction Benchmark Base Original Backlog Overhead Creation of a 4 KB file (2048 ops. per CP) 0.89 ms 0.91 ms 0.96 ms 7.9% Creation of a 64 KB file (2048 ops. per CP) 2.10 ms 2.11 ms 2.11 ms 1.9% Deletion of a 4 KB file (2048 ops. per CP) 0.57 ms 0.59 ms 0.63 ms 11.2% Creation of a 4 KB file (8192 ops. per CP) 0.85 ms 0.87 ms 0.87 ms 2.0% Creation of a 64 KB file (8192 ops. per CP) 1.91 ms 1.92 ms 1.92 ms 0.6% Deletion of a 4 KB file (8192 ops. per CP) 0.45 ms 0.46 ms 0.48 ms 7.1% DBench CIFS workload, 4 users 19.59 MB/s 19.20 MB/s 19.19 MB/s 2.1% FileBench /var/mail, 16 threads 852.04 ops/s 835.80 ops/s 836.70 ops/s 1.8% PostMark 2050 ops/s 2032 ops/s 2020 ops/s 1.5% Table 1: Btrfs Benchmarks. The Base column refers to a customized version of btrfs, from which we removed its original implementation of back references. The Original column corresponds to the original btrfs back references, and the Backlog column refers to our implementation. The Overhead column is the overhead of Backlog relative to the Base. of the block operations survive past a consistency point. Table 1 summarizes the benchmarks we executed on Thus, the operations in this interval tend to cancel each btrfs and the overheads Backlog imposes, relative to other out, resulting in smaller time overheads, because baseline btrfs. We ran microbenchmarks of create, we never materialize these references in the read store. delete, and clone operations and three application bench- This workload exhibits I/O overhead of approximately marks. The create microbenchmark creates a set of 4 KB 0.010 to 0.015 page writes per block operation with oc- or 64 KB files in the file system’s root directory. Af- casional spikes, most (but not all) of which align with the ter recording the performance of the create microbench- periods of low file system load. mark, we the files to disk. Then, the delete mi- Figure 8 shows how the space overhead evolves over crobenchmark deletes the files just created. We run these time for the NFS workload. The general growth pat- microbenchmarks in two different configurations. In the tern follows that of the synthetically generated workload first, we take CPs every 2048 operations, and in the sec- with the exception that database maintenance frees less ond, we take CP after 8192 operations. The choice of space. This is expected, since unlike the synthetic work- 8192 operations per CP is still rather conservative, con- load, the NFS trace does not delete entire snapshot lines. sidering that WAFL batches up to 32,000 operations. We The space overhead after maintenance is between 6.1% also report the case with 2048 operations per CP, which and 6.3%, and it does not increase over time. The exact corresponds to periods of a light server load as a point for magnitude of the space overhead depends on the actual comparison (and we can thus tolerate higher overheads). workload, and it is in fact different from the synthetic We executed each benchmark five times and report the workload presented in Section 6.2.1. Each maintenance average execution time (including the time to perform operation completed in less than 25 seconds, which we sync) divided by the total number of operations. consider acceptable, given the elapsed time between in- The first three lines in the table present microbench- vocations (8 or 48 hours). mark results of creating and deleting small 4 KB files, and creating 64 KB files, taking a CP (btrfs transaction) 6.3 Performance in btrfs every 256 operations. The second three lines present re- sults for the same microbenchmarks with an inter-CP in- We validated our simulation results by porting our imple- terval of 1024 operations. We show results for the three mentation of Backlog to btrfs. Since btrfs natively sup- btrfs configurations—Base, Original, and Backlog. In ports back references, we had to remove the native im- general, the Backlog performance for writes is compara- plementation, replacing it with our own. We present re- ble to that of the native btrfs implementation. For 8192 sults for three btrfs configurations—the Base configura- operations per CP, it is marginally slower on creates than tion with no back reference support, the Original config- the file system with no back references (Base), but com- uration with native btrfs back reference support, and the parable to the original btrfs. Backlog is unfortunately Backlog configuration with our implementation. Com- slower on deletes – 7% as compared to Base, but only paring Backlog to the Base configuration shows the ab- 4.3% slower than the original btrfs. Most of this over- solute overhead for our back reference implementation. head comes from updating the write-store. Comparing Backlog to the Original configuration shows The choice of 4 KB (one file system page) as our file the overhead of using a general purpose back reference size targets the worst case scenario, in which only a small implementation rather than a customized implementation number of pages are written in any given operation. The that is more tightly coupled to the rest of the file system. overhead decreases to as little as 0.6% for the creation of 10000 8 Immediately after maintenance 200 CPs since maintenance 7 400 CPs since maintenance 600 CPs since maintenance 1000 6 800 CPs since maintenance No maintenance 5 100 4 Immediately after maintenance 3 200 CPs since maintenance 10 400 CPs since maintenance I/O Reads per query 2 600 CPs since maintenance 800 CPs since maintenance 1 Throughput (queries per second) No maintenance 1 0 1 10 100 1000 1 10 100 1000 Run Length Run Length Figure 9: Query Performance. The query performance as a factor of run length and the number of CP’s since the last maintenance on a 1000 CP-long workload. The plots show data collected from the execution of 8,192 queries with different run lengths.

45000 1600 40000 1400 35000 1200 30000 1000 25000 800 20000 600 15000 400 Runs of 1024 10000 Runs of 1024 Runs of 2048 Runs of 2048 200 Runs of 4096 Runs of 4096

Throughput (queries per second) Throughput (queries per second) 5000 Runs of 8192 Runs of 8192 0 0 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 Global CP number when the queries were evaluated Global CP number when the queries were evaluated Figure 10: Query Performance over Time. The evolution of query performance over time on a database 100 CP’s after maintenance (left) and immediately after maintenance (right). The horizontal axis is the global CP number at the time the query workload was executed. Each run of queries starts at a randomly chosen physical block. a 64 KB file, because btrfs writes all of its data in one 6.4 Query Performance extent. This generates only a single back reference, and its cost is amortized over a larger number of block I/O We ran an assortment of queries against the back ref- operations. erence database, varying two key parameters, the se- The final three lines in Table 1 present application quentiality of the requests (expressed as the length of a benchmark results: dbench [8], a CIFS file server work- run) and the number of block operations applied to the load; FileBench’s /var/mail [11] multi-threaded mail database since the last maintenance run. We implement server; and PostMark [14], a small file workload. We runs with length n by starting at a randomly selected al- executed each benchmark on a clean, freshly format- located block, b, and returning back references for b and ted volume. The application overheads are generally the next n − 1 allocated blocks. This holds the amount lower (1.5% – 2.1%) than the worst-case microbench- of work in each test case constant; we always return n mark overheads (operating on 4 KB files) and in two back references, regardless of whether the area of the file cases out of three comparable to the original btrfs. system we select is densely or sparsely allocated. It also Our btrfs implementation confirms the low overheads gives us conservative results, since it always returns data predicted via simulation and also demonstrates that for n back references. By returning the maximum pos- Backlog achieves nearly the same performance as the sibly number of back references, we perform the maxi- btrfs native implementation. This is a powerful result mum number of I/Os that could occur and thus report the as the btrfs implementation is tightly integrated with lowest query throughput that would be observed. the btrfs data structures, while Backlog is a general- We cleared both our internal caches and all file sys- purpose solution that can be incorporated into any write- tem caches before each set of queries, so the numbers we anywhere file system. present illustrate worst-case performance. We found the query performance in both the synthetic and NFS work- tion ID is analogous to a WAFL CP number. Btrfs sup- loads to be similar, so we will present only the former for ports efficient cloning by omitting transaction ID’s from brevity. Figure 9 summarizes the results. back reference records, while Backlog uses ranges of We saw the best performance, 36,000 queries per sec- snapshot versions (the from and to fields) and struc- ond, when performing highly sequential queries imme- tural inheritance. A na¨ıve copy-on-write of an inode in diately after database maintenance. As the time since btrfs would create an exact copy of the inode (with the database maintenance increases, and as the queries be- same inode ID), marked with a more recent transaction come more random, performance quickly drops. We ID. If the back reference records contain transaction IDs can process 290 single-back-reference queries per sec- (as in early btrfs designs), the file system would also have ond immediately after maintenance, but this drops to 43 to duplicate the back references of all of the extents ref- – 197 as the interval since maintenance increases. We ex- erenced by the inode. By omitting the transaction ID, pect queries for large sorted runs to be the norm for main- a single back reference points to both the old and new tenance operations such as defragmentation, indicating versions of the inode simultaneously. Therefore, btrfs that such utilities will experience the better throughput. performs inode copy-on-write for free, in exchange for Likewise, it is reasonable practice to run database main- query performance degradation, since the file system has tenance prior to starting a query intensive task. For ex- to perform additional I/O to determine transaction ID’s. ample, a tool that defragments a 100 MB region of a disk In contrast, Backlog enables free copy-on-write by op- would issue a sorted run of at most 100 MB / 4 KB = erating on ranges of global CP numbers and by using 25,600 queries, which would execute in less than a sec- structural inheritance, which do not sacrifice query per- ond on a database immediately after maintenance. The formance. query runs for smaller-scale applications, such as file Btrfs accumulates updates to back references in an in- defragmentation, would vary considerably – anywhere memory balanced tree analogous to our write store. The from a few blocks per run on fragmented files to thou- system inserts all the entries from the in-memory tree to sands for the ones with a low degree of fragmentation. the on-disk tree during a transaction commit (a part of Issuing queries in large sorted runs provides two ben- a checkpoint processing). Btrfs stores most back refer- efits. It increases the probability that two consecutive ences directly inside the B-tree records that describe the queries can be satisfied from the same database page, allocated extents, but on some occasions, it stores them and it reduces the total seek distance between operations. as separate items close to these extent allocation records. Queries on recently maintained database are more effi- This is different from our approach in which we store all cient for for two reasons: First, a compacted database back references together, separately from block alloca- occupies fewer RS files, so a query accesses fewer files. tion bitmaps or records. Second, the maintenance process shrinks the database Perhaps the most significant difference between btrfs size, producing better cache hit ratios. back references and Backlog is that the btrfs approach is Figure 10 shows the result of an experiment in which deeply enmeshed in the file system design. The btrfs ap- we evaluated 8192 queries every 100 CP’s just before and proach would not be possible without the existence of a after the database maintenance operation, also scheduled global meta-store. In contrast, the only assumption nec- every 100 CP’s. The figure shows the improvement in the essary for our approach is the use of a write-anywhere query performance due to maintenance, but more impor- or no-overwrite file system. Thus, our approach is easily tantly, it also shows that once the database size reaches portable to a broader class of file systems. a certain point, query throughput levels off, even as the database grows larger. 8 Future Work

7 Related Work The results presented in Section 6 provide compelling evidence that our LSM-Tree based implementation of Btrfs [2, 5] is the only file system of which we are aware back references is an efficient and viable approach. Our that currently supports back references. Its implementa- next step is to explore different options for further reduc- tion is efficient, because it is integrated with the entire ing the time overheads, the implications and effects of file system’s meta-data management. Btrfs maintains a horizontal partitioning as described in Section 5.3, and single B-tree containing all meta-data objects. experiment with compression. Our tables of back ref- A file extent back reference consists of the four fields: erence records appear to be highly compressible, espe- the subvolume, the inode, the offset, and the number of cially if we to compress them by columns [1]. Com- times the extent is referenced by the inode. Btrfs encap- pression will cost additional CPU cycles, which must be sulates all meta-data operations in transactions analogous carefully balanced against the expected improvements in to WAFL consistency points. Therefore a btrfs transac- the space overhead. We plan to explore the use of back references, im- [4] BRODER,A., AND MITZENMACHER, M. Network applications plementing defragmentation and other functionality that of bloom filters: A survey. Internet Mathematics (2005). uses back reference meta-data to efficiently maintain and [5] Btrfs. http://btrfs.wiki.kernel.org. [6] CHAPMAN, A. P., JAGADISH, H. V., AND RAMANAN, P. Effi- improve the on-disk organization of data. Finally, we cient provenance storage. In SIGMOD (2008), pp. 993–1006. are currently experimenting with using Backlog in an [7] CLEMENTS, A. T., AHMAD,I.,VILAYANNUR,M., AND LI, update-in-place journaling file system. J. Decentralized deduplication in SAN cluster file systems. In USENIX Annual Technical Conference (2009), pp. 101–114. [8] DBench. http://samba.org/ftp/tridge/dbench/. [9] EDWARDS,J.K.,ELLARD,D.,EVERHART,C.,FAIR,R., 9 Conclusion HAMILTON,E.,KAHN,A.,KANEVSKY,A.,LENTINI,J., PRAKASH,A.,SMITH,K.A., AND ZAYAS, E. R. FlexVol: As file systems are called upon to provide more sophis- Flexible, efficient file volume virtualization in WAFL. In ticated maintenance, back references represent an im- USENIX ATC (2008), pp. 129–142. [10] ELLARD,D., AND SELTZER, M. New NFS Tracing Tools and portant enabling technology. They facilitate hard-to- Techniques for System Analysis. In LISA (Oct. 2003), pp. 73–85. implement features that involve block relocation, such as [11] FileBench. http://www.solarisinternals.com/ shrinking a partition or fast defragmentation, and enable wiki/index.php/FileBench/. [12] HITZ,D.,LAU,J., AND MALCOLM, M. A. File system de- us to do file system optimizations that involve reasoning sign for an NFS file server appliance. In USENIX Winter (1994), about block ownership, such as defragmentation of files pp. 235–246. that share one or more blocks (Section 3). [13] JAGADISH, H. V., NARAYAN, P. P. S., SESHADRI,S.,SUDAR- We exploit several key aspects of this problem domain SHAN,S., AND KANNEGANTI, R. Incremental organization for data recording and warehousing. In VLDB (1997), pp. 16–25. to provide an efficient database-style implementation of [14] KATCHER, J. PostMark: A new file system benchmark. NetApp back references. By separately tracking when blocks Technical Report TR3022 (1997). come into use (via the From table) and when they are [15] OLSON,M.A.,BOSTIC,K., AND SELTZER, M. I. Berkeley freed (via the To table) and exploiting the relationship DB. In USENIX ATC (June 1999). [16] O’NEIL, P. E., CHENG,E.,GAWLICK,D., AND O’NEIL,E.J. between writable clones and their parents (via structural The log-structured merge-tree (LSM-Tree). Acta Informatica 33, inheritance), we avoid the cost of updating per block 4 (1996), 351–385. meta-data on each snapshot or clone creation or deletion. [17] QUINLAN,S., AND DORWARD, S. Venti: a new approach to LSM-trees provide an efficient mechanism for sequen- archival storage. In USENIX FAST (2002), pp. 89–101. [18] RODEH, O. B-trees, shadowing, and clones. ACM Transactions tially writing back-reference data to storage. Finally, pe- on Storage 3, 4 (2008). riodic background maintenance operations amortize the [19] ROSENBLUM,M., AND OUSTERHOUT, J. K. The design and cost of combining this data and removing stale entries. implementation of a log-structured file system. ACM Transac- tions on Computer Systems 10, 1 (1992), 26–52. In our prototype implementation we showed that we [20] SCHLOSSER, S. W., SCHINDLER,J.,PAPADOMANOLAKIS,S., can track back-references with a low constant overhead SHAO,M.,AILAMAKI,A.,FALOUTSOS,C., AND GANGER, of roughly 8-9 µs and 0.010 I/O writes per block opera- G. R. On multidimensional data and modern disks. In USENIX tion and achieve query performance up to 36,000 queries FAST (2005), pp. 225–238. [21] SELTZER,M.I.,BOSTIC,K.,MCKUSICK,M.K., AND per second. STAELIN, C. An implementation of a log-structured file system for . In USENIX Winter (1993), pp. 307–326. [22] STONEBRAKER,M.,ABADI,D.J.,BATKIN,A.,CHEN,X., 10 Acknowledgments CHERNIACK,M.,FERREIRA,M.,LAU,E.,LIN,A.,MADDEN, S.R.,O’NEIL,E.J.,O’NEIL, P. E., RASIN,A.,TRAN,N., AND ZDONIK, S. B. C-Store: A column-oriented DBMS. In We thank Hugo Patterson, our shepherd, and the anony- VLDB (2005), pp. 553–564. mous reviewers for careful and thoughtful reviews of our [23] ZFS at OpenSolaris community. http://opensolaris. paper. We also thank students of CS 261 (Fall 2009, Har- org/os/community/zfs/. vard University), many of whom reviewed our work and NetApp, the NetApp logo, Go further, faster, and WAFL are trade- provided thoughtful feedback. We thank Alexei Colin marks or registered trademarks of NetApp, Inc. in the U.S. and other for his insight and the experience of porting Backlog to countries. other file systems. This work was made possible thanks to NetApp and its summer internship program.

References

[1] ABADI,D.J.,MADDEN,S.R., AND FERREIRA, M. Integrating compression and execution in column-oriented database systems. In SIGMOD (2006), pp. 671–682. [2] AURORA, V. A short history of btrfs. LWN.net (2009). [3] BLOOM, B. H. Space/time trade-offs in hash coding with allow- able errors. Commun. ACM 13, 7 (July 1970), 422–426.