File Systems Unfit As Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution Abutalib Aghayev Sage Weil Michael Kuchnik Carnegie Mellon University Red Hat, Inc. Carnegie Mellon University Mark Nelson Gregory R. Ganger George Amvrosiadis Red Hat, Inc. Carnegie Mellon University Carnegie Mellon University Abstract 1 Introduction For a decade, the Ceph distributed file system followed the Distributed file systems operate on a cluster of machines, conventional wisdom of building its storage backend on top each assigned one or more roles such as cluster state mon- of local file systems. This is a preferred choice for most dis- itor, metadata server, and storage server. Storage servers, tributed file systems today because it allows them to benefit which form the bulk of the machines in the cluster, receive from the convenience and maturity of battle-tested code. I/O requests over the network and serve them from locally Ceph’s experience, however, shows that this comes at a high attached storage devices using storage backend software. Sit- price. First, developing a zero-overhead transaction mecha- ting in the I/O path, the storage backend plays a key role in nism is challenging. Second, metadata performance at the the performance of the overall system. local level can significantly affect performance at the dis- Traditionally distributed file systems have used local file tributed level. Third, supporting emerging storage hardware systems, such as ext4 or XFS, directly or through middleware, is painstakingly slow. as the storage backend [29, 34, 37, 41, 74, 84, 93, 98, 101, 102]. Ceph addressed these issues with BlueStore, a new back- This approach has delivered reasonable performance, pre- end designed to run directly on raw storage devices. In only cluding questions on the suitability of file systems as a dis- two years since its inception, BlueStore outperformed previ- tributed storage backend. Several reasons have contributed ous established backends and is adopted by 70% of users in to the success of file systems as the storage backend. First, production. By running in user space and fully controlling they allow delegating the hard problems of data persistence the I/O stack, it has enabled space-efficient metadata and and block allocation to a well-tested and highly performant data checksums, fast overwrites of erasure-coded data, inline code. Second, they offer a familiar interface (POSIX) and compression, decreased performance variability, and avoided abstractions (files, directories). Third, they enable the useof a series of performance pitfalls of local file systems. Finally, standard tools (ls, find) to explore disk contents. it makes the adoption of backwards-incompatible storage Ceph [98] is a widely-used, open-source distributed file hardware possible, an important trait in a changing storage system that followed this convention for a decade. Hard landscape that is learning to embrace hardware diversity. lessons that the Ceph team learned using several popular file systems led them to question the fitness of file systems as CCS Concepts • Information systems → Distributed storage backends. This is not surprising in hindsight. Stone- storage; • Software and its engineering → File systems braker, after building the INGRES database for a decade, management; Software performance. noted that “operating systems offer all things to all people at much higher overhead” [90]. Similarly, exokernels demon- Keywords Ceph, object storage, distributed file system, strated that customizing abstractions to applications results storage backend, file system in significantly better performance31 [ , 50]. In addition to the performance penalty, adopting increasingly diverse stor- Permission to make digital or hard copies of all or part of this work for age hardware is becoming a challenge for local file systems, personal or classroom use is granted without fee provided that copies which were originally designed for a single storage medium. are not made or distributed for profit or commercial advantage and that The first contribution of this experience paper is to outline copies bear this notice and the full citation on the first page. Copyrights the main reasons behind Ceph’s decision to develop BlueStore, for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or a new storage backend deployed directly on raw storage de- republish, to post on servers or to redistribute to lists, requires prior specific vices. First, it is hard to implement efficient transactions on permission and/or a fee. Request permissions from [email protected]. top of existing file systems. A significant body of work aims SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada to introduce transactions into file systems39 [ , 63, 65, 72, 77, © 2019 Copyright held by the owner/author(s). Publication rights licensed 80, 87, 106], but none of these approaches have been adopted to ACM. due to their high performance overhead, limited function- ACM ISBN 978-1-4503-6873-5/19/10...$15.00 https://doi.org/10.1145/3341301.3359656 ality, interface complexity, or implementation complexity. 353 The experience of the Ceph team shows that the alternative overhead of the resulting extent reference-counting through options, such as leveraging the limited internal transaction careful interface design; (3) BlueFS—a user space file system mechanism of file systems, implementing Write-Ahead Log- that enables RocksDB to run faster on raw storage devices; ging in user space, or using a transactional key-value store, and (4) a space allocator with a fixed 35 MiB memory usage also deliver subpar performance. per terabyte of disk space. Second, the local file system’s metadata performance can In addition to the above contributions, we perform several significantly affect the performance of the distributed file experiments that evaluate the improvement of design changes system as a whole. More specifically, a key challenge that from Ceph’s previous production backend, FileStore, to Blue- the Ceph team faced was enumerating directories with mil- Store. We experimentally measure the performance effect of lions of entries fast, and the lack of ordering in the returned issues such as the overhead of journaling file systems, dou- result. Both Btrfs and XFS-based backends suffered from this ble writes to the journal, inefficient directory splitting, and problem, and directory splitting operations meant to distrib- update-in-place mechanisms (as opposed to copy-on-write). ute the metadata load were found to clash with file system policies, crippling overall system performance. 2 Background At the same time, the rigidity of mature file systems pre- vents them from adopting emerging storage hardware that This section aims to highlight the role of distributed storage abandon the venerable block interface. The history of produc- backends and the features that are essential for building an efficient distributed file system (§ 2.1). We provide a brief tion file systems shows that on average they take adecade overview of Ceph’s architecture (§ 2.2) and the evolution of to mature [30, 56, 103, 104]. Once file systems mature, their maintainers tend to be conservative when it comes to making Ceph’s storage backend over the last decade (§ 2.3), intro- fundamental changes due to the consequences of mistakes. ducing terms that will be used throughout the paper. On the other hand, novel storage hardware aimed for data centers introduce backward-incompatible interfaces that re- 2.1 Essentials of Distributed Storage Backends quire drastic changes. For example, to increase capacity, hard Distributed file systems aggregate storage space from multi- disk drive (HDD) vendors are moving to Shingled Magnetic ple physical machines into a single unified data store that Recording (SMR) technology [38, 62, 82] that works best with offers high-bandwidth and parallel I/O, horizontal scalability, a backward-incompatible zone interface [46]. Similarly, to fault tolerance, and strong consistency. While distributed eliminate the long I/O tail latency in solid state drives (SSDs) file systems may be designed differently and use unique caused by the Flash Translation Layer (FTL) [35, 54, 108], terms to refer to the machines managing data placement vendors are introducing Zoned Namespace (ZNS) SSDs that on physical media, the storage backend is usually defined eliminate the FTL, again, exposing the zone interface [9, 26]. as the software module directly managing the storage de- Cloud storage providers [58, 73, 109] and storage server ven- vice attached to physical machines. For example, Lustre’s dors [17, 57] are already adapting their private software Object Storage Servers (OSSs) store data on Object Stor- stacks to use the zoned devices. Distributed file systems, age Targets [102] (OSTs), GlusterFS’s Nodes store data on however, are stalled by delays in the adoption of zoned de- Bricks [74], and Ceph’s Nodes store data on Object Storage vices in local file systems. Devices (OSDs) [98]. In these, and other systems, the storage In 2015, the Ceph project started designing and implement- backend is the software module that manages space on disks ing BlueStore, a user space storage backend that stores data (OSTs, Bricks, OSDs) attached to physical machines (OSSs, directly on raw storage devices, and metadata in a key-value Nodes). store. By taking full control of the I/O path, BlueStore has Widely-used distributed file systems such as

Load more