Failure-Atomic Updates of Application Data in a Linux File System

Failure-Atomic Updates of Application Data in a Linux File System Rajat Verma and Anton Ajay Mendez, Hewlett-Packard; Stan Park, Hewlett-Packard Labs; Sandya Mannarswamy, Hewlett-Packard; Terence Kelly and Charles B. Morrey III, Hewlett-Packard Labs https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma This paper is included in the Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST ’15). February 16–19, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-201 Open access to the Proceedings of the 13th USENIX Conference on File and Storage Technologies is sponsored by USENIX Failure-Atomic Updates of Application Data in a Linux File System Rajat Verma1 Anton Ajay Mendez1 Stan Park2 Sandya Mannarswamy1 Terence Kelly2 Charles B. Morrey III2 1Hewlett-Packard Storage Division 2Hewlett-Packard Laboratories Abstract erroneous behavior under power failures; the proprietary commercial databases tested lose data [36]. We present the design, implementation, and evaluation File systems strive to protect their internal metadata of a file system mechanism that protects the integrity of from corruption, but most offer no corresponding protec- application data from failures such as process crashes, tion for application data, providing neither transactions kernel panics, and power outages. A simple interface of- on application data nor any other unified solution to the fers applications a guarantee that the application data in a CMADD problem. Instead, file systems offer primitives file always reflects the most recent successful fsync or for controlling the order in which application data attains msync operation on the file. Our file system furthermore durability; applications shoulder the burden of restoring offers a new syncv mechanism that failure-atomically consistency to their data following failures. Added to commits changes to multiple files. Failure-injection tests the inconvenience and expense of implementing correct verify that our file system protects the integrity of ap- recovery is the inefficiency of the sequences of primi- plication data from crashes and performance measure- tive operations required for complex updates: Consider, ments confirm that our implementation is efficient. Our for example, the chore of failure-atomically updating a file system runs on conventional hardware and unmod- set of files scattered throughout a POSIX-like file sys- ified Linux kernels and will be released commercially. tem. Remarkably, the vast majority of file systems do We believe that our mechanism is implementable in any not provide the straightforward operation that CMADD file system that supports per-file writable snapshots. demands: the ability to modify application data in (sets of) files failure-atomically and efficiently. 1 Introduction We present the design, implementation, and evaluation of failure-atomic application data updates in HP’s Many applications modify data on durable media, and Advanced File System (AdvFS), a modern industrial- failures during updates—application process crashes, OS strength Linux file system derived from DEC’s Tru64 kernel panics, and power outages—jeopardize the in- file system [1]. AdvFS provides a simple interface tegrity of the application data. We therefore require so- that generalizes failure-atomic variants of writev [8] lutions to the fundamental problem of consistent modifi- and msync [20]: If a file is opened with a new cation of application durable data (CMADD), i.e., the O_ATOMIC flag, the state of its application data will al- problem of evolving durable application data without ways reflect the most recent successful msync, fsync, fear that failure will preclude recovery to a consistent or fdatasync. AdvFS furthermore includes a new state. syncv operation that combines updates to multiple files Existing mechanisms provide imperfect support for into a failure-atomic bundle, comparable to the multi- solving the CMADD problem. Relational databases of- file transaction support in Windows Vista TxF [17] and fer ACID transactions; similarly, many key-value stores TxOS [22] but much simpler than the former and more allow failure-atomic bundling of updates [2,13,14]. De- capable than the latter. The size of transactional updates spite the obvious attractions of transactions, both kinds in AdvFS is limited only by the free space in the file sys- of databases can lead to two difficulties: First, in- tem. AdvFS requires no special hardware and runs on memory data structures do not always translate conve- unmodified Linux kernels. niently or efficiently to and from database formats; re- The remainder of this paper is organized as follows: peated attempts to smooth over the “impedance mis- Section 2 situates our contributions in the context of prior match” between data formats have met with limited suc- work. Section 3 describes AdvFS and the features that cess [19]. Second, the complexity of modern databases made it possible to implement failure-atomic updates of offers fertile ground for implementation bugs that negate application data. Section 4 presents experimental evalua- the promise of ACID: A recent study has shown that tions of both the correctness and performance of AdvFS, widely used key-value and relational databases exhibit and Section 5 concludes with a discussion. USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 203 2 Related Work support inter-process isolation even in the presence of non-transactional accesses by legacy applications [25]. Most widely deployed mainstream file systems offer only The price that transaction-aware applications pay for this limited and indirect support for consistent modification sophisticated support includes a substantial burden of of application durable data (CMADD).1 Semantically logging: Applications must perform a Log Append weak OS interfaces are partly to blame. For example, syscall prior to modifying a page of a file within a trans- POSIX permits write to succeed partially, making it action, which is awkward at best for the important case difficult to define atomic semantics for this call [30]. of random STOREs to a memory-mapped file. Synchronization calls such as fsync and msync con- An attractive approach to the CMADD problem on strain the order in which application data reaches durable emerging durable media is a persistent heap support- media, and recent research has proposed decoupling or- ing atomic updates via a transactional memory (TM) dering from durability [5]. However applications re- interface. Mnemosyne [31] and Hathi [24] imple- main responsible for building CMADD solutions (e.g., ment such mechanisms for byte-addressable non-volatile atomicity mechanisms) atop ordering primitives and for memory (NVM) and flash storage, respectively. Per- reconstructing a consistent state of application data fol- sistent heaps obviate the need for separate in-memory lowing a crash. Experience has shown that custom recov- and durable data formats: Applications simply manipu- ery code is difficult to write and prone to bugs. Some- late in-memory data structures using LOAD and STORE times applications circumvent the need for recovery by instructions, which seems especially natural for byte- using the one failure-atomicity mechanism provided in addressable NVM. One limitation of these systems is that conventional file systems: file rename [11]. For ex- they do not support conventional file operations; another ample, desktop applications can open a temporary file, is that they are tailored to specific durable media. Fi- write the entire modified contents of a file to it, then use nally, they employ software TM, which carries substan- rename (or a specialized equivalent [7, 23]) to imple- tial overheads. ment an atomic file update—a reasonable expedient for Persistent heaps can be implemented for conventional small files but untenable for large ones. block storage and need not employ TM. Recent exam- FusionIO provides an elegant and efficient mechanism ples include Software Persistent Memory (SoftPM) [9] for solving the CMADD problem for data on their flash- and Ken [34], whose persistent heaps expose malloc- based storage devices: a failure-atomic writev sup- like interfaces and support atomic checkpointing. Such ported by the flash translation layer [8]. The MySQL approaches provide ergonomic benefits and are com- database exploits this new mechanism to eliminate patible with conventional hardware, but their atomic- application-level double writes and thereby improve per- update mechanisms entail substantial complexity and formance substantially [29]. Still more impressive gains overheads. For example, SoftPM automatically copies are available to applications architected from scratch volatile data into persistent containers as necessary around the new mechanism. For example, a key-value through a novel hybrid of static and dynamic pointer store designed to exploit the new feature achieves both analysis, making development easier and less error- performance and flash endurance benefits [15]. The lim- prone. However SoftPM tracks data modification in itations of failure-atomic writev are that it requires coarse-grained chunks of 512 KB or larger, which can special hardware, applies only to single-file updates, and lead to write amplification at the storage layer. Ken’s does not address modifications to memory-mapped files. user-space persistent heap tracks modifications at 4 KB Fully general support for failure-atomic bundles of file memory-page granularity, which may reduce write am- modifications is surprisingly rare. Windows

Load more