Scaling a File System to Many Cores Using an Operation Log

Scaling a file system to many cores using an operation log Srivatsa S. Bhat,y Rasha Eqbal,z Austin T. Clements,x M. Frans Kaashoek, Nickolai Zeldovich MIT CSAIL ABSTRACT allow file-system-intensive applications to scale better [4, It is challenging to simultaneously achieve multicore scala- 10, 13, 23, 26, 31]. This paper contributes a clean-slate file bility and high disk throughput in a file system. For exam- system design that allows for good multicore scalability by ple, even for commutative operations like creating different separating the in-memory file system from the on-disk file files in the same directory, current file systems introduce system, and describes a prototype file system, ScaleFS, that cache-line conflicts when updating an in-memory copy of implements this design. the on-disk directory block, which limits scalability. The main goal achieved by ScaleFS is multicore scala- ScaleFS is a novel file system design that decouples the bility. ScaleFS scales well for a number of workloads on an in-memory file system from the on-disk file system using 80-core machine, but even more importantly, the ScaleFS per-core operation logs. This design facilitates the use of implementation is conflict-free for almost all commutative highly concurrent data structures for the in-memory repre- operations [10]. Conflict freedom allows ScaleFS to take sentation, which allows commutative operations to proceed advantage of disjoint-access parallelism [1, 20] and suggests without cache conflicts and hence scale perfectly. ScaleFS that ScaleFS will continue to scale even for workloads or logs operations in a per-core log so that it can delay propa- machines we have not yet measured. gating updates to the disk representation (and the cache-line In addition to scalability, ScaleFS must also satisfy two conflicts involved in doing so) until an fsync. The fsync standard file system constraints: crash safety (meaning that call merges the per-core logs and applies the operations to ScaleFS recovers from a crash at any point and provides clear disk. ScaleFS uses several techniques to perform the merge guarantees for fsync) and good disk throughput (meaning correctly while achieving good performance: timestamped that the amount of data written to the disk is commensurate linearization points to order updates without introducing with the changes that an application made to the file system, cache-line conflicts, absorption of logged operations, and and that data is written efficiently). dependency tracking across operations. These goals are difficult to achieve together. Consider Experiments with a prototype of ScaleFS show that its directory operations in Linux’s ext4 file system [29]. Direc- implementation has no cache conflicts for 99% of test cases tories are represented in memory using a concurrent hash of commutative operations generated by Commuter, scales table, but when Linux updates a directory, it also propagates well on an 80-core machine, and provides on-disk perfor- these changes to the in-memory ext4 physical log, which is mance that is comparable to that of Linux ext4. later flushed to disk. The physical log is essential to ensure crash safety, but can cause two commutative directory oper- 1 INTRODUCTION ations to contend for the same disk block in the in-memory log. For example, consider create(f1) and create(f2) un- Many of today’s file systems do not scale well on multicore der the same parent directory, where f1 and f2 are distinct. machines, and much effort is spent on improving them to According to the Scalable Commutativity Rule [10], because y Now at VMware. z Now at Apple. x Now at Google. these two creates commute, a conflict-free and thus scalable implementation is possible. However, in Linux, they may update the same disk block, causing cache conflicts and Permission to make digital or hard copies of part or all of this work for limiting scalability despite commuting. personal or classroom use is granted without fee provided that copies Multicore scalability is important even for a file system are not made or distributed for profit or commercial advantage and that 1 copies bear this notice and the full citation on the first page. Copyrights for on a relatively slow disk, because many workloads operate third-party components of this work must be honored. For all other uses, in memory without flushing every change to disk. For exam- contact the owner/author(s). ple, an application may process a large amount of data by SOSP’17, October 28–31, 2017, Shanghai, China. creating and deleting temporary files, and flush changes to © 2017 Copyright is held by the owner/author(s). ACM ISBN 978-1-4503-5085-3/17/10. 1In this paper, we use the term “disk” loosely to refer to persistent https://doi.org/10.1145/3132747.3132779 storage, including rotational disks, flash-based SSDs, etc. 1 disk only after producing a final output file. It is important erations that, say, Linux supports (such as sendfile), but that the application not be bottlenecked by the file system ScaleFS does implement many operations required from a when it is not flushing data to disk. Similarly, a file system file system, and supports complex operations suchas rename may be hosting multiple applications, such as a text editor across directories. Furthermore, we believe that extending (which requires strong durability) and a parallel software ScaleFS to support additional features can be done without build (which does not require immediate durability but needs impacting scalability of commutative operations. the file system to scale well). Experiments with ScaleFS on Commuter [10] demon- Our key insight is to decouple the in-memory file system strate that ScaleFS maintains sv6’s high multicore scalabil- from the on-disk file system, and incur the cost of writing ity for commutative operations while providing crash safety. to disk (including the cache-line conflicts to construct an This demonstrates that ScaleFS’s decoupling approach is in-memory copy of the on-disk data structures that will be effective at combining crash safety and scalability. Exper- written to disk) only when requested by the application. To imental results also indicate that ScaleFS achieves better enable decoupling, ScaleFS separates the in-memory file scalability than the Linux ext4 file system for in-memory system from the on-disk file system using an operation log, workloads, while providing similar performance when ac- based on oplog [5]. The operation log consists of per-core cessing disk. ScaleFS is conflict-free in 99.2% of the com- logs of file system operations (e.g., link, unlink, rename). mutative test cases Commuter generates, while Linux is When fsync or sync are invoked, ScaleFS sorts the opera- conflict-free for only 65% of them. Furthermore, experiments tions in the operation log by timestamp, and applies them to demonstrate that ScaleFS achieves good disk performance. the on-disk file system. For example, ScaleFS implements The main contributions of the paper are as follows. directories in such a way that if two cores update different • A new design approach for multicore file systems that entries in a shared directory, then no interaction is neces- decouples the in-memory file system from the on-disk sary between the two cores. When an application calls fsync file system using an operation log. on the directory, ScaleFS merges the per-core operation • Techniques based on timestamping linearization points logs into an ordered log, prepares the on-disk representation, that ensure crash safety and high disk performance. adds the updated disk blocks to a physical log, and finally flushes the physical log to disk. • An implementation of the above design and techniques Although existing file systems decouple representations in a ScaleFS prototype. for reads, the operation log allows ScaleFS to take this ap- • An evaluation of our ScaleFS prototype that confirms proach to its logical extreme even for updates. For example, that ScaleFS achieves good scalability and performance. Linux has an in-memory cache that represents directory en- The rest of the paper is organized as follows. §2 describes tries differently than on disk. However, system calls that related work, §3 describes the semantics that ScaleFS aims modify directories update both the in-memory cache and to provide, §4 provides an overview of ScaleFS, §5 describes an in-memory copy of the on-disk representation. Keeping the design of ScaleFS, §6 summarizes ScaleFS’s implemen- an updated copy of the on-disk representation in memory tation, §7 presents experimental results, and §8 concludes. means that when it is eventually written to disk (e.g., when an application invokes fsync), the on-disk state will cor- 2 RELATED WORK rectly reflect the order in which the application’s system calls executed. The main contribution of ScaleFS is the split design that allows the in-memory file system to be designed for multi- ScaleFS’s log enables two independent file systems: an core scalability and the on-disk file system for durability and in-memory file system tailored to achieve scalability, andan disk performance. The rest of this section relates ScaleFS’s on-disk file system tailored for high disk throughput. The separation to previous designs. in-memory file system can choose data structures that allow for good concurrency, choose inode numbers that can be allo- File system scalability. ScaleFS adopts its in-memory cated concurrently without coordination, etc. (as in sv6 [10]) file system from sv6 [10]. sv6 uses sophisticated parallel- and be completely unaware of the on-disk data structures. programming techniques to make commutative file system On the other hand, the on-disk file system can choose data operations conflict-free so that they scale well on today’s structures that allow for good disk throughput, and can even multicore processors.

Load more