Betrfs: a Right-Optimized Write-Optimized File System
Total Page:16
File Type:pdf, Size:1020Kb
BetrFS: A Right-Optimized Write-Optimized File System William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, Stony Brook University; John Esmet, Tokutek Inc.; Yizheng Jiao, Ankur Mittal, Prashant Pandey, and Phaneendra Reddy, Stony Brook University; Leif Walsh, Tokutek Inc.; Michael Bender, Stony Brook University; Martin Farach-Colton, Rutgers University; Rob Johnson, Stony Brook University; Bradley C. Kuszmaul, Massachusetts Institute of Technology; Donald E. Porter, Stony Brook University https://www.usenix.org/conference/fast15/technical-sessions/presentation/jannen This paper is included in the Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST ’15). February 16–19, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-201 Open access to the Proceedings of the 13th USENIX Conference on File and Storage Technologies is sponsored by USENIX BetrFS: A Right-Optimized Write-Optimized File System William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet∗, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh∗, Michael Bender, Martin Farach-Colton†, Rob Johnson, Bradley C. Kuszmaul‡, and Donald E. Porter Stony Brook University, ∗Tokutek Inc., †Rutgers University, and ‡Massachusetts Institute of Technology Abstract (microwrites). Examples include email delivery, creat- The Bε -tree File System, or BetrFS, (pronounced ing lock files for an editing application, making small “better eff ess”) is the first in-kernel file system to use a updates to a large file, or updating a file’s atime. The un- write-optimized index. Write optimized indexes (WOIs) derlying problem is that many standard data structures in are promising building blocks for storage systems be- the file-system designer’s toolbox optimize for one case cause of their potential to implement both microwrites at the expense of another. and large scans efficiently. Recent advances in write-optimized indexes (WOI) [4, Previous work on WOI-based file systems has shown 8–10, 23, 27, 28] are exciting because they have the po- promise but has also been hampered by several open tential to implement both efficient microwrites and large problems, which this paper addresses. For example, scans. The key strength of the best WOIs is that they can FUSE issues many queries into the file system, su- ingest data up to two orders of magnitude faster than B- perimposing read-intensive workloads on top of write- trees while matching or improving on the B-tree’s point- intensive ones, thereby reducing the effectiveness of and range-query performance [4, 9]. WOIs. Moving to an in-kernel implementation can ad- WOIs have been successful in commercial key-value dress this problem by providing finer control of reads. stores and databases [2,3,12,17,20,33,34], and previous This paper also contributes several implementation tech- research on WOIs in file systems has shown promise [15, niques to leverage kernel infrastructure without throttling 25, 31]. However, progress towards a production-quality write performance. write-optimized file system has been hampered by sev- Our results show that BetrFS provides good perfor- eral open challenges, which we address in this paper: mance for both arbitrary microdata operations, which in- • Code complexity. A production-quality WOI can eas- clude creating small files, updating metadata, and small ily be 50,000 lines of complex code, which is difficult writes into large or small files, and for large sequen- to shoehorn into an OS kernel. Previous WOI file sys- tial I/O. On one microdata benchmark, BetrFS pro- tems have been implemented in user space. vides more than 4 the performance of ext4 or XFS. × • FUSE squanders microwrite performance. BetrFS is an ongoing prototype effort, and requires ad- FUSE [16] issues a query to the underlying file ditional data-structure tuning to match current general- system before almost every update, superimposing purpose file systems on some operations such as deletes, search-intensive workloads on top of write-intensive directory renames, and large sequential writes. Nonethe- workloads. Although WOIs are no worse for point less, many applications realize significant performance queries than any other sensible data structure, writes improvements. For instance, an in-place rsync of the are much faster than reads, and injecting needless Linux kernel source realizes roughly 1.6–22 speedup × point queries can nullify the advantages of write over other commodity file systems. optimization. • Mapping file system abstractions onto a WOI. We 1 Introduction cannot realize the full potential performance benefits of write-optimization by simply dropping in a WOI Today’s applications exhibit widely varying I/O patterns, as a replacement for a B-tree. The schema and use making performance tuning of a general-purpose file sys- of kernel infrastructure must exploit the performance tem a frustrating balancing act. Some software, such advantages of the new data structure. as virus scans and backups, demand large, sequential This paper describes the Bε -tree File System, or scans of data. Other software requires many small writes BetrFS, the first in-kernel file system designed to take USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 301 full advantage of write optimization. Specifically, Update-in-place file systems [11, 32] optimize for BetrFS is built using the mature, well-engineered Bε - scans by keeping related items, such as entries in a di- tree implementation from Tokutek’s Fractal Tree index, rectory or consecutive blocks in a file, near each other. called TokuDB [33]. Our design is tailored to the perfor- However, since items are updated in place, update per- mance characteristics of Fractal Tree indexes, but other- formance is often limited by the random-write latency of wise uses them as a black-box key-value store, so many the underlying disk. of our design decisions may be applicable to other write- optimized file systems. B-tree-based file systems store related items logically Experiments show that BetrFS can give up to an order- adjacent in the B-tree, but B-trees do not guarantee that of-magnitude improvement to the performance of file logically-adjacent items will be physically adjacent. As creation, random writes to large files, recursive directory a B-tree ages, leaves become scattered across the disk traversals (such as occur in find, recursive greps, back- due to node splits from insertions and node merges from ups, and virus scans), and meta-data updates (such as up- deletions. In an aged B-tree, there is little correlation be- dating file atime each time a file is read). tween the logical and physical order of the leaves, and the The contributions of this paper are: cost of reading a new leaf involves both the data-transfer • The design and implementation of an in-kernel, write- cost and the seek cost. If leaves are too small to amor- optimized file system. tize the seek costs, then range queries can be slow. The • A schema, which ensures both locality and fast writes, seek costs can be amortized by using larger leaves, but for mapping VFS operations to a write-optimized in- this further throttles update performance. dex. At the other extreme, logging file systems [5,7,26,29, • A design that uses unmodified OS kernel infrastruc- 30] optimize for writes. Logging ensures that files are ture, designed for traditional file systems, yet mini- created and updated rapidly, but the resulting data and mizes the impact on write optimization. For instance, metadata can be spread throughout the log, leading to BetrFS uses the OS kernel cache to accelerate reads poor performance when reading data or metadata from without throttling writes smaller than a disk sector. disk. These performance problems are particularly no- • A thorough evaluation of the performance of the ticeable in large scans (recursive directory traversals and BetrFS prototype. For instance, our results show backups) that cannot be accelerated by caching. almost two orders of magnitude improvement on a small-write microbenchmark and a 1.5 speedup on × The microwrite bottleneck creates problems for a applications such as rsync range of applications. HPC checkpointing systems gen- ε Our results suggest that a well-designed B -tree-based erate so many microwrites that a custom file system, file system can match or outperform traditional file sys- PLFS, was designed to efficiently handle them [5] by tems on almost every operation, some by an order of exploiting the specifics of the checkpointing workload. magnitude. Comparisons with state-of-the-art file sys- Email servers often struggle to manage large sets of tems, such as ext4, XFS, zfs, and btrfs support this small messages and metadata about those messages, such claim. We believe that the few slower operations are not as the read flag. Desktop environments store prefer- fundamental to WOIs, but can be addressed with a com- ences and active state in a key-value store (i.e., a reg- bination of algorithmic advances and engineering effort istry) so that accessing and updating keys will not re- in future work. quire file-system-level microdata operations. Unix and Linux system administrators commonly report 10–20% performance improvements by disabling the atime op- 2 Motivation and Background tion [14]; maintaining the correct atime behavior in- duces a heavy microwrite load, but some applications re- This section summarizes background on the increasing quire accurate atime values. importance of microwrites, explains how WOIs work, and summarizes previous WOI-based file systems. Microwrites cause performance problems even when the storage system uses SSDs. In a B-tree-based file sys- 2.1 The microwrite problem tem, small writes trigger larger writes of entire B-tree nodes, which can further be amplified to an entire erase A microwrite is a write operation where the setup time block on the SSD. In a log-structured file system, mi- (i.e.