Optimistic Crash Consistency

Optimistic Crash Consistency Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Department of Computer Sciences University of Wisconsin, Madison vijayc, madthanu, dusseau, remzi @cs.wisc.edu { } Abstract the disk has received the request, not that the data has been written to the disk surface persistently. We introduce optimistic crash consistency, a new ap- Out-of-order write completion, in particular, greatly proach to crash consistency in journaling file systems. complicates known techniques for recovering from sys- Using an array of novel techniques, we demonstrate how tem crashes. For example, modern journaling file sys- to build an optimistic commit protocol that correctly tems such as Linux ext3, XFS, and NTFS all carefully recovers from crashes and delivers high performance. orchestrate a sequence of updates to ensure that writes to We implement this optimistic approach within a Linux main file-system structures and the journal reach disk in a ext4 variant which we call OptFS. We introduce two particular order [22]; copy-on-write file systems such as new file-system primitives, osync() and dsync(), that LFS, btrfs, and ZFS also require ordering when updating decouple ordering of writes from their durability. We certain structures. Without ordering, most file systems show through experiments that OptFS improves perfor- cannot ensure that state can be recovered after a crash [6]. mance for many workloads, sometimes by an order of Write ordering is achieved in modern drives via ex- magnitude; we confirm its correctness through a series pensive cache flush operations [30]; such flushes cause of robustness tests, showing it recovers to a consistent all buffered dirty data in the drive to be written to the state after crashes. Finally, we show that osync() and surface (i.e., persisted) immediately. To ensure A is writ- dsync() are useful in atomic file system and database ten before B, a client issues the write to A, and then a update scenarios, both improving performance and meet- cache flush; when the flush returns, the client can safely ing application-level consistency demands. assume that A reached the disk; the write to B can then be safely issued, knowing it will be persisted after A. 1 Introduction Unfortunately, cache flushing is expensive, sometimes Modern storage devices present a seemingly innocuous prohibitively so. Flushes make I/O scheduling less effi- interface to clients. To read a block, one simply issues a cient, as the disk has fewer requests to choose from. A low-level read command and specifies the address of the flush also unnecessarily forces all previous writes to disk, block (or set of blocks) to read; when the disk finishes whereas the requirements of the client may be less strin- the read, it is transferred into memory and any awaiting gent. In addition, during a large cache flush, disk reads clients notified of the completion. A similar process is may exhibit extremely long latencies as they wait for followed for writes. pending writes to complete [26]. Finally, flushing con- Unfortunately, the introduction of write buffering [28] flates ordering and durability; if a client simply wishes in modern disks greatly complicates this apparently sim- to order one write before another, forcing the first write ple process. With write buffering enabled, disk writes to disk is an expensive manner in which to achieve such may complete out of order, as a smart disk scheduler may an end. In short, the classic approach of flushing is pes- reorder requests for performance [13,24,38]; further, the simistic; it assumes a crash will occur and goes to great notification received after a write issue implies only that lengths to ensure that the disk is never in an inconsistent state via flush commands. The poor performance that results from pessimism has led some systems to dis- Permission to make digital or hard copies of part or all of this work for able flushing, apparently sacrificing correctness for per- personal or classroom use is granted without fee provided that copies formance; for example, the Linux ext3 default configu- are not made or distributed for profit or commercial advantage and that ration did not flush caches for many years [8]. copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other Disabling flushes does not necessarily lead to file sys- uses, contact the Owner/Author. tem inconsistency, but rather introduces it as a possibil- ity. We refer to such an approach as probabilistic crash Copyright is held by the Owner/Author(s). consistency, in which a crash might lead to file system in- SOSP’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA. ACM 978-1-4503-2388-8/13/11. consistency, depending on many factors, including work- http://dx.doi.org/10.1145/2517349.2522726 load, system and disk parameters, and the exact timing of 228 the crash or power loss. In this paper, one of our first con- enables application-level consistency at high perfor- tributions is the careful study of probabilistic crash con- mance. OptFS introduces two new file-system primi- sistency, wherein we show which exact factors affect the tives: osync(), which ensures ordering between writes odds that a crash will leave the file system inconsistent but only eventual durability, and dsync(), which en- (§3). We show that for some workloads, the probabilistic sures immediate durability as well as ordering. approach rarely leaves the file system inconsistent. We show how these primitives provide a useful base Unfortunately, a probabilistic approach is insufficient on which to build higher-level application consistency for many applications, where certainty in crash recov- semantics (§7). Specifically, we show how a document ery is desired. Indeed, we also show, for some work- editing application can use osync() to implement the loads, that the chances of inconsistency are high; to atomic update of a file (via a create and then atomic re- realize higher-level application-level consistency (i.e., name), and how the SQLite database management sys- something a DBMS might desire), the file system must tem can use file-system provided ordering to implement provide something more than probability and chance. ordered transactions with eventual durability. We show Thus, in this paper, we introduce optimistic crash con- that these primitives are sufficient for realizing useful sistency, a new approach to building a crash-consistent application-level consistency at high performance. journaling file system (§4). This optimistic approach Of course, the optimistic approach, while useful in takes advantage of the fact that in many cases, ordering many scenarios, is not a panacea. If an application can be achieved through other means and that crashes requires immediate, synchronous durability (instead of are rare events (similar to optimistic concurrency con- eventual, asynchronous durability with consistent order- trol [12, 16]). However, realizing consistency in an opti- ing), an expensive cache flush is still required. In this mistic fashion is not without challenge; we thus develop case, applications can use dsync() to request durabil- a range of novel techniques, including a new extension of ity (as well as ordering). However, by decoupling the the transactional checksum [23] to detect data/metadata durability of writes from their ordering, OptFS provides inconsistency, delayed reuse of blocks to avoid incorrect a useful middle ground, thus realizing high performance dangling pointers, and a selective data journaling tech- and meaningful crash consistency for many applications. nique to handle block overwrite correctly. The combina- tion of these techniques leads to both high performance 2 Pessimistic Crash Consistency and deterministic consistency; in the rare event that a To understand the optimistic approach to journaling, we crash does occur, optimistic crash consistency either first describe standard pessimistic crash consistency in avoids inconsistency by design or ensures that enough journaling file systems. To do so, we describe the nec- information is present on the disk to detect and discard essary disk support (i.e., cache-flushing commands) and improper updates during recovery. details on how such crash consistency operates. We then We demonstrate the power of optimistic crash consis- demonstrate the negative performance impact of cache tency through the design, implementation, and analysis flushing during pessimistic journaling. of the optimistic file system (OptFS). OptFS builds upon the principles of optimistic crash consistency to imple- 2.1 Disk Interface ment optimistic journaling, which ensures that the file For the purposes of this discussion, we assume the pres- system is kept consistent despite crashes. Optimistic ence of a disk-level cache flush command. In the ATA journaling is realized as a set of modifications to the family of drives, this is referred to as the “flush cache” Linux ext4 file system, but also requires a slight change command; in SCSI drives, it is known as “synchronize in the disk interface to provide what we refer to as asyn- cache”. Both operations have similar semantics, forcing chronous durability notification, i.e., a notification when all pending dirty writes in the disk to be written to the a write is persisted in addition to when the write has sim- surface. Note that a flush can be issued as a separate re- ply been received by the disk. We describe the details of quest, or as part of a write to a given block D; in the latter our implementation (§5) and study its performance (§6), case, pending writes are flushed before the write to D. showing that for a range of workloads, OptFS signifi- Some finer-grained controls also exist. For exam- cantly outperforms classic Linux ext4 with pessimistic ple, “force unit access” (FUA) commands read or write journaling, and showing that OptFS performs almost around the cache entirely.

Optimistic Crash Consistency

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support