Verifying a High-Performance Crash-Safe File System Using a Tree Specification

Verifying a high-performance crash-safe file system using a tree specification Haogang Chen,y Tej Chajed, Alex Konradi,z Stephanie Wang,x Atalay İleri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich MIT CSAIL ABSTRACT 1 INTRODUCTION File systems achieve high I/O performance and crash safety DFSCQ is the first file system that (1) provides a precise by implementing sophisticated optimizations to increase disk fsync fdatasync specification for and , which allow appli- throughput. These optimizations include deferring writing cations to achieve high performance and crash safety, and buffered data to persistent storage, grouping many trans- (2) provides a machine-checked proof that its implementa- actions into a single I/O operation, checksumming journal tion meets this specification. DFSCQ’s specification captures entries, and bypassing the write-ahead log when writing to the behavior of sophisticated optimizations, including log- file data blocks. The widely used Linux ext4 is an example bypass writes, and DFSCQ’s proof rules out some of the of an I/O-efficient file system; the above optimizations allow common bugs in file-system implementations despite the it to batch many writes into a single I/O operation and to complex optimizations. reduce the number of disk-write barriers that flush data to The key challenge in building DFSCQ is to write a speci- disk [33, 56]. Unfortunately, these optimizations complicate fication for the file system and its internal implementation a file system’s implementation. For example, it took 6 years without exposing internal file-system details. DFSCQ in- for ext4 developers to realize that two optimizations (data troduces a metadata-prefix specification that captures the writes that bypass the journal and journal checksumming) properties of fsync and fdatasync, which roughly follows taken together can lead to disclosure of previously deleted the behavior of Linux ext4. This specification uses a no- data after a crash [30]. This bug was fixed in November of tion of tree sequences—logical sequences of file-system tree 2014 by forbidding users from mounting an ext4 file sys- states—for succinct description of the possible states after tem with both journal bypassing for data writes and journal a crash and to describe how data writes can be reordered checksumming. A comprehensive study of several file sys- with respect to metadata updates. This helps application tems in Linux also found a range of other bugs [34: §6]. developers prove the crash safety of their own applications, avoiding application-level bugs such as forgetting to invoke Somewhat surprisingly, there exists no precise specifi- fsync on both the file and the containing directory. cation that would allow proving the correctness of a high- An evaluation shows that DFSCQ achieves 103 MB/s on performance file system, ruling out bugs like the ones de- large file writes to an SSD and durably creates small files at scribed above. For example, the POSIX standard is noto- a rate of 1,618 files per second. This is slower than Linux riously vague on what crash-safety guarantees file-system ext4 (which achieves 295 MB/s for large file writes and 4,977 operations provide [44]. Of particular concern are the guar- files/s for small file creation) but much faster than twore- antees provided by fsync and fdatasync, which give appli- cent verified file systems, Yggdrasil and FSCQ. Evaluation cations precise control over what data the file system flushes results from application-level benchmarks, including TPC-C to persistent storage. Unfortunately, file systems provide on SQLite, mirror these microbenchmarks. imprecise promises on exactly what data is flushed, and, in fact, for the Linux ext4 file system, it depends on the op- tions that an administrator specifies when mounting the file system [40]. Because of this lack of precision, applications Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies such as databases and mail servers, which try hard to make are not made or distributed for profit or commercial advantage and that sequences of file creates, writes, and renames crash-safe by copies bear this notice and the full citation on the first page. Copyrights for issuing fsyncs and fdatasyncs, may still lose data if the file third-party components of this work must be honored. For all other uses, contact the owner/author(s). system crashes at an inopportune time [40]. For example, SQLite and LightningDB, two popular databases, improperly SOSP’17, October 28–31, 2017, Shanghai, China. used fsync and fdatasync, leading to possible data loss [65]. © 2017 Copyright is held by the owner/author(s). ACM ISBN 978-1-4503-5085-3/17/10. https://doi.org/10.1145/3132747.3132776 y Now at Databricks. z Now at Google. x Now at UC Berkeley. 1 A particularly challenging behavior to specify is log-bypass flushed. This paper uses the notion of tree sequences to cap- writes1. A file system typically uses a write-ahead logto ture the metadata-prefix specification. Abstractly, the state guarantee that changes are flushed to disk atomically and in of the file system is a sequence of trees (directories and files), order. The one exception is data writes: to avoid writing data each corresponding to a metadata update performed by the blocks to disk twice (once to the log and once to the file’s application, in the order the updates were issued; the se- blocks), file systems bypass the log when writing file data. quence captures the metadata order. Data writes apply to Since data blocks are not written to the log, they may be every tree in the sequence, reflecting the fact that they can re-ordered with respect to logged metadata changes, and this be reordered with respect to metadata. After a crash, the file reordering must be precisely captured in the specification. A system can end up in any tree in the sequence. Calling fsync further complication arises from the fact that this reordering logically truncates the sequence of trees to just the last one, can lead to corruption in the presence of block reuse. For and fdatasync makes file data writes durable in every tree example, a bypass write to file f can modify block b on in the sequence. disk, but the metadata update allocating b to f has not been (4) To demonstrate that the metadata-prefix specification flushed to disk yet; b may still be in use by another file or is easy to use, we implemented, specified, and proved the cor- directory, which might be corrupted by the bypass write. rectness of a core pattern used by applications to implement The specification must exclude this possibility. crash-safe updates. This code pattern modifies a destina- This paper makes several contributions: tion file atomically, even if the file system crashes during (1) An approach to specifying file-system behavior purely the update. That is, if a crash happens, either the destina- in terms of abstract file-system trees, as opposed to expos- tion file exists with a copy of all the new bytes that should ing the internal details of file-system optimizations, such as have been written to it, or the destination file is not modi- log-bypass writes directly updating disk blocks. A purely fied. This pattern captures the core goal of many crash-safe tree-based specification shields application developers from applications—ensuring that some sequence of file creates, having to understand and reason about low-level file-system writes, renames, fsyncs, and fdatasyncs is crash-safe. If an implementation details. application’s use of fsync and fdatasync does not ensure (2) A precise metadata-prefix specification that describes crash safety, the application developer is unable to prove the the file system’s behavior in the presence of crashes (in addi- application correct. tion to describing non-crash behavior), including the fsync (5) To demonstrate that the metadata-prefix specification and fdatasync system calls and sophisticated optimizations allows proving the correctness of a high-performance file such as log-bypass writes, using the tree-based specification system, we built DFSCQ (Deferred-write FSCQ), based on approach. Intuitively, the specification says that if an ap- FSCQ [10] and its Crash-Hoare Logic infrastructure.3 DFSCQ plication makes a sequence of metadata changes, the file implements standard optimizations such as deferring writes, system will persist these changes to disk in order, meaning group commit, non-journaled writes for file data, check- that the state after a crash will correspond to a prefix of the summed log commit, and so on. We report the results of application’s changes. On the other hand, data updates need several benchmarks on top of DFSCQ, compared to Linux not be ordered; thus, the name metadata-prefix. Our specifi- ext4 and two recent verified file systems. cation is inspired by the observed behavior of the Linux ext4 DFSCQ has several limitations. First, DFSCQ (and the file system,2 as well as other prior work [11, 35]. metadata-prefix specification) does not capture concurrency; (3) A formalization of this metadata-prefix specification. system calls are executed one at a time. Second, DFSCQ A precise formal specification is important for two reasons. runs as a Haskell program, which incurs a large trusted First, it allows developers of a mission-critical application computing base (the Haskell compiler and runtime) as well such as a database to prove that they are using fsync cor- as CPU performance overhead. Addressing these limitations rectly and that the database doesn’t run the risk of losing remains an interesting open problem for future work. data. Second, it allows file-system developers to prove that the optimizations that they implement will still provide the 2 RELATED WORK same guarantees, embodied in the metadata-prefix specifica- DFSCQ is the first file system with a machine-checked proof tion, that applications may rely on.

Verifying a High-Performance Crash-Safe File System Using a Tree Specification

Trusted Docker Containers and Trusted Vms in Openstack

Rootless Containers with Podman and Fuse-Overlayfs

Providing User Security Guarantees in Public Infrastructure Clouds

11.7 the Windows 2000 File System

In Search of the Ideal Storage Configuration for Docker Containers

MINCS - the Container in the Shell (Script)

“Application - File System” Divide with Promises

NOVA: a Log-Structured File System for Hybrid Volatile/Non

High Velocity Kernel File Systems with Bento

Singularityce User Guide Release 3.8

Firecracker: Lightweight Virtualization for Serverless Applications

Container-Based Virtualization for Byte-Addressable NVM Data Storage