Gassyfs: an In-Memory File System That Embraces Volatility

GassyFS: An In-Memory File System That Embraces Volatility Noah Watkins, Michael Sevilla, Carlos Maltzahn University of California, Santa Cruz fjayhawk,msevilla,[email protected] Abstract For many years storage systems were designed for slow storage devices. However, the average speed of these devices has been growing exponentially, making traditional storage system designs increasingly inadequate. Suffi- cient time must be dedicated to redesigning future storage systems or they might never adequately support fast storage devices using traditional storage semantics. In this paper, we argue that storage systems should expose a persistence-performance trade-off to applications that are willing to explicitly take control over durability. We describe our prototype system called GassyFS that stores file system data in distributed remote memory and provides support for checkpointing file system state. As a consequence of this design, we explore a spectrum Figure 1: GassyFS is comparable to tmpfs for genomic bench- of mechanisms and policies for efficiently sharing data marks that fit on one node, yet it is designed to scale beyond one across file system boundaries. node. Its simplicity and well-defined workflows for its applications lets us explore new techniques for sharing data. 1 Introduction parent scale-out solution that minimizes modifications to A wide range of data-intensive applications that access applications. In this paper we consider the latter solu- large working sets managed in a file system can eas- tion, and observe that existing scale-out storage systems ily become bound by the cost of I/O. As a workaround, do not provide the persistence-performance trade-off de- practitioners exploit the availability of inexpensive RAM spite applications already using a homegrown strategy. to significantly accelerate application performance using in-memory file systems such as tmpfs [10]. Unfor- Figure 1 shows how our system, GassyFS, compares tunately, durability and scalability are major challenges to tmpfs when the working set for our genomics bench- that users must address on their own. marks fit on a single node. That figure demonstrates the types of workflows we target: the user ingests BAM in- While the use of tmpfs is a quick and effective so- 1 lution to accelerating application performance, in order put files (10GB, 40MB), computes results, and check- to bound data loss inherent with memory-based storage, points the results back into durable storage (24GB) us- practitioners modify application workflows to bracket ing cp and sync. The slowdowns reported by Figure 1 periods of high-performance I/O with explicit control demonstrate that GassyFS has performance comparable over durability (e.g. checkpointing), creating a trade- to tmpfs, in our limited tests, with the added benefit that it off between persistence and performance. But when 1 single-node scalability limits are reached applications are The workload uses SAMtools, a popular Genomics tool kit. The jobs process are (input/output sizes): sort (10GB/10GB), shuf- faced with a choice: either adopt a distributed algorithm fle (10GB/52MB), convert into VCF (40MB/160MB), merge (20GB/ through an expensive development effort, or use a trans- 13.4GB) is designed to scale beyond one node. Scaling the workload to volumes that do not fit into RAM without a dis- POSIX Application POSIX Application tributed solution limits the scope of solutions to expen- GassyFS (libgassy, FUSE) GassyFS (libgassy, FUSE) sive, emerging byte-addressable media. One approach is to utilize the caching features found in some distributed storage systems. These systems are d e t Network-attached RAM a Network-attached RAM often composed from commodity hardware with many r e Checkpoint d Checkpoint e cores, lots of memory, high-performance networking, Restore f Restore and a variety of storage media with differing capabili- RADOS Ceph Amazon S3 Gluster SSD ties. Applications that wish to take advantage of the per- Organization A Organization B formance heterogeneity of media (e.g. HDD vs SSD) may use caching features when available, but such sys- Figure 2: GassyFS has facilities for explicitly managing tems tend to use coarse-grained abstractions, impose re- persistence to different storage targets. A checkpointing dundancy costs at all layers, and do not expose abstrac- infrastructure gives GassyFS flexible policies for persist- tions for volatile storage—memory is only used inter- ing namespaces and federating data. nally. These limitations force applications that explicitly manage durability from scaling through the use of existing systems. Furthermore, existing systems originally 2 Architecture designed for slow media may not be able to fully exploit the speed of RAM over networks that offer direct mem- The architecture of GassyFS is illustrated in Figure 2. ory access. What is needed are scale-out solutions that The core of the file system is a user-space library that allow applications to fine-tune their durability require- implements a POSIX file interface. File system meta- ments to achieve higher performance when possible. data is managed locally in memory, and file data is dis- Existing solutions center on large-volume high- tributed across a pool of network-attached RAM man- performance storage constructed from off-the-shelf hard- aged by worker nodes and accessible over RDMA or Eth- ware and software components. For instance, iSCSI ernet. Applications access GassyFS through a standard extensions for RDMA can be combined with remote FUSE mount, or may link directly to the library to avoid RAM disks to provide a high-performance volatile block any overhead that FUSE may introduce. store formatted with a local file system or used as a By default all data in GassyFS is non-persistent. That block cache. While easy to assemble, this approach re- is, all metadata and file data is kept in memory, and quires coarse-grained data management, and I/O must be any node failure will result in data loss. In this mode funneled through the host for it to be made persistent. GassyFS can be thought of as a high-volume tmpfs that Non-volatile memories with RAM-like performance are can be instantiated and destroyed as needed, or kept emerging, but software and capacity are a limiting factor. mounted and used by applications with multiple stages of While easy to setup and scale, the problem of forcing all execution. The differences between GassyFS and tmpfs I/O related to persistence through a host is exacerbated as become apparent when we consider how users deal with the volume of data managed increases. The inability of durability concerns. these solutions to exploit application-driven persistence At the bottom of Figure 2 are shown a set of storage forces us to look to other architectures. targets that can be used for managing persistent check- In the remainder of this paper we present a prototype points of GassyFS. We will discuss durability in more file system called GassyFS1, a distributed in-memory file detail in the next section. Finally, we support a form of system with explicit checkpointing for persistence. We file system federation that allows checkpoint content to view GassyFS as a stop-gap solution to providing appli- be accessed remotely to enable efficient data sharing be- cations with access to very high-performance large vol- tween users over a wide-area network. Federation and ume persistent memories that are projected to be avail- data sharing are discussed in Section 6. able in the coming years, allowing applications to be- gin exploring the implications of a changing I/O per- 3 Durability formance profile. As a result of including support for checkpointing in our system, we encountered a number Persistence is achieved in GassyFS using a checkpoint of ways that we envision data to be shared without cross- feature that materializes a consistent view of the file sys- ing file system boundaries, by providing data manage- tem. In contrast to explicitly copying data into and out of ment features for attaching shared checkpoints and ex- tmpfs, users instruct GassyFS to generate a checkpoint ternal storage systems, allowing data to reside within the of a configurable subset of the file system. Each check- context of GassyFS as long as possible. point may be stored on a variety of supported backends 2 such as local disk or SSD, a distributed storage system Namespace View Checkpoint Content such as Ceph, RADOS or GlusterFS, or to a cloud-based Full Checkpoint service such as Amazon S3. A checkpoint is saved using a copy-on-write mecha- nism that generates a new, persistent version of the file Partial Checkpoint system. Users may forego the ability to time travel and only use checkpointing to recover from local failures. However, generating checkpoint versions allows for ro- bust sharing of checkpoints when one file system is ac- B B Parallel Load tively being used. Included In Checkpoint A B A GassyFS takes advantage of its distributed design to Attached File System or Checkpoint avoid the single-node I/O bottleneck that is present when persisting data stored within tmpfs. Rather, in GassyFS each worker node performs checkpoint I/O in parallel Figure 3: Namespace checkpointing modes. Top: with all other nodes, storing data to a locally attached full/partial checkpoints, Middle: checkpointing attached disk, or to a networked storage system that supports par- file systems (e.g., connected with POSIX), Bottom: allel I/O. Next we’ll discuss the modes of operation and checkpointing inode types (e.g., code to access remote how content in a checkpoint is controlled. data) 4 Extensible Namespaces since the mount is not managed by GassyFS it has no ability to integrate the external content or maintain refer- In the previous section we discussed how persistence is ences within a checkpoint. And second, all I/O to the ex- achieved in GassyFS using a basic versioning checkpoint ternal mount must flow through the host, preventing fea- scheme.

Gassyfs: an In-Memory File System That Embraces Volatility

CIS 191 Linux Lab Exercise

Connecting the Storage System to the Solaris Host

Managing File Systems in Oracle® Solaris 11.4

Unionfs: User- and Community-Oriented Development of a Uniﬁcation File System

We Get Letters Sept/Oct 2018

Comparison of Kernel and User Space File Systems

State of the Art: Where We Are with the Ext3 Filesystem

OST Data Migrations Using ZFS Snapshot/Send/Receive

Sibylfs: Formal Specification and Oracle-Based Testing for POSIX and Real-World File Systems

Freenas® 11.2-U3 User Guide

Rump File Systems Kernel Code Reborn

File Systems