Understanding and Finding Crash-Consistency Bugs in Parallel File Systems Jinghan Sun, Chen Wang, Jian Huang, and Marc Snir University of Illinois at Urbana-Champaign

Understanding and Finding Crash-Consistency Bugs in Parallel File Systems Jinghan Sun, Chen Wang, Jian Huang, and Marc Snir University of Illinois at Urbana-Champaign Abstract directly applied to PFSes, due to the unique architecture of Parallel file systems (PFSes) and parallel I/O libraries have PFS, its workload characteristics, and the increased complex- been the backbone of high-performance computing (HPC) ity of HPC I/O stack. Specifically, files can be shared by all infrastructures for decades. However, their crash consistency processes in a parallel job, and I/O is performed through a bugs have not been extensively studied, and the corresponding variety of parallel I/O libraries which may have their own bug-finding or testing tools are lacking. In this paper, we first crash consistency implementation. A proper attribution of conduct a thorough bug study on the popular PFSes, such as application-level bugs to one layer or another in the I/O stack BeeGFS and OrangeFS, with a cross-stack approach that cov- depends on the "contract" between each layer. ers HPC I/O library, PFS, and interactions with local file sys- To further understand crash consistency bugs in parallel I/O, tems. The study results drive our design of a scalable testing we conduct a study with popular PFSes, including BeeGFS framework, named PFSCHECK. PFSCHECK is easy to use and OrangeFS. We manually create seven typical I/O work- with low performance overhead, as it can automatically gener- loads with the parallel I/O library HDF5 [5], and investigate ate test cases for triggering potential crash-consistency bugs, crash-consistency bugs across the I/O stack. Our study re- and trace essential file operations with low overhead. PF- sults demonstrate that workloads on PFSes suffer from much SCHECK is scalable for supporting large-scale HPC clusters, (2.6-3.8×) more from crash consistency bugs than local file as it can exploit the parallelism to facilitate the verification of system (see Table2). We also develop a study methodology persistent storage states. to distinguish bugs in the PFS from bugs in the parallel I/O library, using a formal definition of the crash consistency con- 1 Introduction tract. We find that the number of crash-consistency bugs in parallel I/O libraries is comparable to that in PFSes. Parallel file Systems are the standard I/O systems for large- To identify crash-consistency bugs in various PFSes and scale supercomputers. They aim at maximizing storage ca- parallel I/O libraries, and facilitate their design and implemen- pacity as well as I/O bandwidth for files shared by parallel tation, we propose to develop a scalable and generic testing programs. They scale to tens of thousands of disks accessed framework, named PFSCHECK, with four major goals: (1) by hundreds of thousands of processes. PFSCHECK should be easy to use, with as much automation To avoid data loss, various fault tolerance techniques have of testing as possible; (2) PFSCHECK should be lightweight, been proposed for PFSes, including checkpointing [12,15,17] with minimal performance overhead; (3) PFSCHECK should and journaling [3,14]. However, crash consistency bugs are be scalable, with the increasing complexity of PFS configura- still happening and causing severe damages. For instance, a tions; (4) PFSCHECK should be accurate, which can identify severe data loss took place at Texas Tech HPCC after two the exact locations of crash-consistency bugs. power outages in 2016, resulting in metadata inconsistencies To achieve these aforementioned goals, we develop PF- in its Lustre file system [1]; the most recent crash in Lustre SCHECK with four major components: (1) a PFS-specific in the Stampede Supercomputer suspended its service for automated test-case generator, which will automatically gen- six days. As supercomputing time is expensive, both long erate a limited number of effective test cases for crash consis- recovery time and loss of data are expensive. tency testing. The generator follows POSIX APIs and hide the To understand crash consistency bugs, researchers have underlying I/O library details from upper-level developers; conducted intensive studies on commodity file systems (2) a lightweight logging mechanism, which will trace the [2,7,13,16]. Pillai et al. [16] investigated the characteristics of essential file operations on both client and storage server side crash vulnerabilities in Linux file systems, such as ext2, ext4, with low overhead; (3) a fast crash state exploration mecha- and btrfs. To alleviate these bugs, recent studies have applied nism, which will efficiently prune the large number of crash verification techniques to develop bug-free file systems [6,21]. states brought by parallel workloads and multiple servers; Prior researchers also exploited model checking [4,26] and (4) a bug classifier, which will properly attribute the reported fuzzing [11,25] techniques to pinpoint crash-consistency bugs crash-consistency bugs to PFS or I/O library with the given with defined specifications. However, most of these prior stud- crash consistency model. We wish to develop PFSCHECK ies focused on the regular file systems. Few of them can be into an open-source testing platform that can facilitate the workload ARVR storage-node #2 metadata-node #1 storage-node #1 recovery assuming that the layer below satisfied its contract. recvfrom(client) However, POSIX APIs used by file systems do not specify creat(idfile) creat(tmp) setattr(idfile) this contract [4], nor do PFSes clarify what they guarantee link(idfile, dentries/tmp) after crash recovery. setxattr(dir_inode) sendto(client) In this paper, we assume that the proper contract, which we recvfrom(client) creat(chunk) call prefix crash consistency, is that the state of the file system append(chunk) after recovery is that obtained by executing correctly a prefix pwrite(tmp, new) close(chunk) close(tmp) sendto(meta-node) of the sequence of the I/O calls before the crash and none of recvfrom(storage-1) the operations following this prefix. This prefix includes all #1 setattr(idfile) fsync recvfrom(client) operations preceding the crash, the last operation in rename(dentries/tmp, dentries/file) this prefix may be only partially completed, if it is not atomic. setxattr(dir_inode) #3 A similar definition can be provided for other I/O systems – unlink(idfile_2) #2 rename(tmp, file) setattr(idfile) one needs to define which operations are atomic and which are sendto(storage-1) sendto(storage-2) recvfrom(meta-node) "persist" operations. If the I/O interface supports concurrent recvfrom(meta-node) unlink(chunk2) calls, then the order between I/O calls is the partial causality unlink(chunk2) = -1 sendto(meta-node) sendto(meta-node) order, rather than a cut through a sequence. Figure 1: Crash-Vulnerability Examples in BeeGFS. Table 1: Test suites used in our study. Test suite Initializer Workload (s) Checking Items testing of developed and developing parallel file systems and file with old file renamed from a tmp atomicity of ARVR parallel I/O libraries. content file with new content replacement update atomicity; file with old logging before updating WAL otherwise check content file, then remove log log content create; delete; rename; atomicity; readability 2 Background and Motivation two groups and HDF5 resize dimension; of existing datasets two datasets h5part write dataset and groups; 2.1 Crash Consistency in in HPC I/O PFSes, such as BeeGFS [10], OrangeFS [23], Lustre [19], and GPFS [18], have been the main I/O infrastructure for 2.2 Motivating Examples supercomputers. PFSes were designed for different hardware configurations, and different workloads than distributed file We present a crash-vulnerability scenario of PFS in Figure1. systems (DFSes) like GFS [9] and HDFS [20, 22]. Specifi- The application attempts to replace the content of a target cally, the I/O stack of PFSes has three different properties. file by creating, modifying, and renaming a temporary file First, while the "one file per process" use mode is common, in BeeGFS (see the detailed setting in § 3.1). The file op- PFSes are optimized for shared access to a single file by all erations are issued by BeeGFS client, and forwarded to the processes in a large-scale parallel computing setting. A large corresponding server machines. In this motivating scenario, file may be stripped across all storage servers. This brings we reveal three crash inconsistency states. Two of them are challenges to enforce the ordering of filesystem operations caused by cross-node persistence reordering, one is caused across large-scale server machines. Second, the metadata and by the intra-node persistence reordering. data blocks are managed separately. Their complicated coordi- (1) When the append to the file on storage node #1 is nation causes crash-consistency bugs, even if each server has persisted to the disk after the rename of directory entry on employed transactional mechanisms to enforce local atomic the metadata node, a crash happens right before the append. storage operations. Third, PFSes have a deep software stack, In this case, BeeGFS will suffer from data loss. This intro- which often includes specific parallel I/O libraries. For exam- duces crash vulnerability to the application, since its assumed ple, an application may use HDF5 [5] that invokes MPI-IO [8] atomicity is broken. (2) Similar inconsistency will happen if to execute POSIX I/O operations on the PFS. reordering happens between rename on the metadata node The post-crash behaviors of an I/O system depend on what and unlink of the file on the storage node. Both of these has been persisted before failures. Typically, PFSes rely on two inconsistencies cannot be resolved by beegfs-fsck. (3) local file systems to hold data on each storage server, and The third crash vulnerability is caused by the intra-node per- the I/O library relies on the PFS.

Understanding and Finding Crash-Consistency Bugs in Parallel File Systems Jinghan Sun, Chen Wang, Jian Huang, and Marc Snir University of Illinois at Urbana-Champaign

Base Performance for Beegfs with Data Protection

Spring 2014 • Vol

Data Integration and Analysis 10 Distributed Data Storage

Globalfs: a Strongly Consistent Multi-Site File System

Towards Transparent Data Access with Context Awareness

The Leading Parallel Cluster File System, Developed with a St Rong Focus on Perf Orm Ance and Designed for Very Ea Sy Inst All Ati on and Management

Maximizing the Performance of Scientific Data Transfer By

Using On-Demand File Systems in HPC Environments

Performance and Scalability of a Sensor Data Storage Framework

Towards Transparent Data Access with Context Awareness

Beegfs Unofficial Documentation

Filesystems in IO500 3% Gridscaler Datawarp 2% Wekaio Matrix 2% Total Submissions Across All Lists: 147 5% SUSE Enterprise Storage 1% Lustre 29%