Betrfs: a Right-Optimized Write-Optimized File System

BetrFS: A Right-Optimized Write-Optimized File System William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet∗, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh∗, Michael Bender, Martin Farach-Colton†, Rob Johnson, Bradley C. Kuszmaul‡, and Donald E. Porter Stony Brook University, ∗Tokutek Inc., †Rutgers University, and ‡Massachusetts Institute of Technology Abstract (microwrites). Examples include email delivery, creat- The Be -tree File System, or BetrFS, (pronounced ing lock files for an editing application, making small “better eff ess”) is the first in-kernel file system to use a updates to a large file, or updating a file’s atime. The un- write-optimized index. Write optimized indexes (WOIs) derlying problem is that many standard data structures in are promising building blocks for storage systems be- the file-system designer’s toolbox optimize for one case cause of their potential to implement both microwrites at the expense of another. and large scans efficiently. Recent advances in write-optimized indexes (WOI) [4, Previous work on WOI-based file systems has shown 8–10, 23, 27, 28] are exciting because they have the po- promise but has also been hampered by several open tential to implement both efficient microwrites and large problems, which this paper addresses. For example, scans. The key strength of the best WOIs is that they can FUSE issues many queries into the file system, su- ingest data up to two orders of magnitude faster than B- perimposing read-intensive workloads on top of write- trees while matching or improving on the B-tree’s point- intensive ones, thereby reducing the effectiveness of and range-query performance [4, 9]. WOIs. Moving to an in-kernel implementation can ad- WOIs have been successful in commercial key-value dress this problem by providing finer control of reads. stores and databases [2,3,12,17,20,33,34], and previous This paper also contributes several implementation tech- research on WOIs in file systems has shown promise [15, niques to leverage kernel infrastructure without throttling 25, 31]. However, progress towards a production-quality write performance. write-optimized file system has been hampered by sev- Our results show that BetrFS provides good perfor- eral open challenges, which we address in this paper: mance for both arbitrary microdata operations, which in- • Code complexity. A production-quality WOI can eas- clude creating small files, updating metadata, and small ily be 50,000 lines of complex code, which is difficult writes into large or small files, and for large sequen- to shoehorn into an OS kernel. Previous WOI file sys- tial I/O. On one microdata benchmark, BetrFS pro- tems have been implemented in user space. vides more than 4× the performance of ext4 or XFS. • FUSE squanders microwrite performance. BetrFS is an ongoing prototype effort, and requires ad- FUSE [16] issues a query to the underlying file ditional data-structure tuning to match current general- system before almost every update, superimposing purpose file systems on some operations such as deletes, search-intensive workloads on top of write-intensive directory renames, and large sequential writes. Nonethe- workloads. Although WOIs are no worse for point less, many applications realize significant performance queries than any other sensible data structure, writes improvements. For instance, an in-place rsync of the are much faster than reads, and injecting needless Linux kernel source realizes roughly 1.6–22× speedup point queries can nullify the advantages of write over other commodity file systems. optimization. • Mapping file system abstractions onto a WOI. We 1 Introduction cannot realize the full potential performance benefits of write-optimization by simply dropping in a WOI Today’s applications exhibit widely varying I/O patterns, as a replacement for a B-tree. The schema and use making performance tuning of a general-purpose file sys- of kernel infrastructure must exploit the performance tem a frustrating balancing act. Some software, such advantages of the new data structure. as virus scans and backups, demand large, sequential This paper describes the Be -tree File System, or scans of data. Other software requires many small writes BetrFS, the first in-kernel file system designed to take full advantage of write optimization. Specifically, Update-in-place file systems [11, 32] optimize for BetrFS is built using the mature, well-engineered Be - scans by keeping related items, such as entries in a di- tree implementation from Tokutek’s Fractal Tree index, rectory or consecutive blocks in a file, near each other. called TokuDB [33]. Our design is tailored to the perfor- However, since items are updated in place, update per- mance characteristics of Fractal Tree indexes, but other- formance is often limited by the random-write latency of wise uses them as a black-box key-value store, so many the underlying disk. of our design decisions may be applicable to other write- optimized file systems. B-tree-based file systems store related items logically Experiments show that BetrFS can give up to an order- adjacent in the B-tree, but B-trees do not guarantee that of-magnitude improvement to the performance of file logically-adjacent items will be physically adjacent. As creation, random writes to large files, recursive directory a B-tree ages, leaves become scattered across the disk traversals (such as occur in find, recursive greps, back- due to node splits from insertions and node merges from ups, and virus scans), and meta-data updates (such as up- deletions. In an aged B-tree, there is little correlation be- dating file atime each time a file is read). tween the logical and physical order of the leaves, and the The contributions of this paper are: cost of reading a new leaf involves both the data-transfer • The design and implementation of an in-kernel, write- cost and the seek cost. If leaves are too small to amor- optimized file system. tize the seek costs, then range queries can be slow. The • A schema, which ensures both locality and fast writes, seek costs can be amortized by using larger leaves, but for mapping VFS operations to a write-optimized in- this further throttles update performance. dex. At the other extreme, logging file systems [5,7,26,29, • A design that uses unmodified OS kernel infrastruc- 30] optimize for writes. Logging ensures that files are ture, designed for traditional file systems, yet mini- created and updated rapidly, but the resulting data and mizes the impact on write optimization. For instance, metadata can be spread throughout the log, leading to BetrFS uses the OS kernel cache to accelerate reads poor performance when reading data or metadata from without throttling writes smaller than a disk sector. disk. These performance problems are particularly no- • A thorough evaluation of the performance of the ticeable in large scans (recursive directory traversals and BetrFS prototype. For instance, our results show backups) that cannot be accelerated by caching. almost two orders of magnitude improvement on a small-write microbenchmark and a 1:5× speedup on The microwrite bottleneck creates problems for a applications such as rsync range of applications. HPC checkpointing systems gen- e Our results suggest that a well-designed B -tree-based erate so many microwrites that a custom file system, file system can match or outperform traditional file sys- PLFS, was designed to efficiently handle them [5] by tems on almost every operation, some by an order of exploiting the specifics of the checkpointing workload. magnitude. Comparisons with state-of-the-art file sys- Email servers often struggle to manage large sets of tems, such as ext4, XFS, zfs, and btrfs support this small messages and metadata about those messages, such claim. We believe that the few slower operations are not as the read flag. Desktop environments store prefer- fundamental to WOIs, but can be addressed with a com- ences and active state in a key-value store (i.e., a reg- bination of algorithmic advances and engineering effort istry) so that accessing and updating keys will not re- in future work. quire file-system-level microdata operations. Unix and Linux system administrators commonly report 10–20% 2 Motivation and Background performance improvements by disabling the atime op- tion [14]; maintaining the correct atime behavior in- duces a heavy microwrite load, but some applications re- This section summarizes background on the increasing quire accurate atime values. importance of microwrites, explains how WOIs work, and summarizes previous WOI-based file systems. Microwrites cause performance problems even when the storage system uses SSDs. In a B-tree-based file sys- 2.1 The microwrite problem tem, small writes trigger larger writes of entire B-tree nodes, which can further be amplified to an entire erase A microwrite is a write operation where the setup time block on the SSD. In a log-structured file system, mi- (i.e. seek time on a conventional disk) exceeds the data- crowrites can induce heavy cleaning activity, especially transfer time. Conventional file-system data structures when the disk is nearly full. In either case, the extra write force file-system designers to choose between optimizing activity reduces the lifespan of SSDs and can limit perfor efficient microwrites and efficient scans. formance by wasting bandwidth. 2 2.2 Write-Optimized Indexes Large nodes also speed up range queries, since the data is spread over fewer nodes, requiring fewer disk seeks to In this subsection, we review write-optimized indexes read in all the data. and their impact on performance. Specifically, we de- To get a feeling for what this speedup looks like, scribe the Be -tree [9] and why we have selected this data consider the following example. Suppose a key-value structure for BetrFS.

Betrfs: a Right-Optimized Write-Optimized File System

Betrfs: a Right-Optimized Write-Optimized File System

Algorithms and Complexity (AL)

Assignment of Master's Thesis

UNIVERSIDAD SAN FRANCISCO DE QUITO USFQ Optimizing Large

Conception Et Exploitation D'une Base De Modèles: Application Aux Data

Percona Server for Mongodb Documentation Release 3.2.22-3.13

The Algorithmics of Write Optimization

Big Data Solutions and Applications

The Tokufs Streaming File System

Data Structures Dagstuhl Seminar

Workload Analysis of Key-Value Stores on Non-Volatile Media Vishal Verma (Performance Engineer, Intel) Tushar Gohad (Cloud Software Architect, Intel)

Data Management Organization Charter