NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory Jian Xu∗ Lu Zhang∗ Amirsaman Memaripour University of California, San Diego University of California, San Diego University of California, San Diego [email protected] [email protected] Akshatha Gangadharaiah Amit Borase Tamires Brito Da Silva University of California, San Diego University of California, San Diego University of California, San Diego Steven Swanson Andy Rudoff University of California, San Diego Corporation [email protected] [email protected] ABSTRACT CCS CONCEPTS Emerging fast, persistent memories will enable systems that • Information systems → Phase change memory; Mirror- combine conventional DRAM with large amounts of non- ing; RAID; Point-in-time copies; volatile main memory (NVMM) and provide huge increases in storage performance. Fully realizing this potential requires KEYWORDS fundamental changes in how system software manages, pro- Persistent Memory, Non-volatile Memory, Direct Access, tects, and provides access to data that resides in NVMM. We DAX, File Systems, Reliability address these needs by describing an NVMM-optimized file system called NOVA-Fortis that is both fast and resilient in ACM Reference Format: the face of corruption due to media errors and software bugs. Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangad- We identify and propose solutions for the unique challenges haraiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, in adding fault tolerance to an NVMM file system, adapt and Andy Rudoff. 2017. NOVA-Fortis: A Fault-Tolerant Non-Volatile state-of-the-art reliability techniques to an NVMM file sys- Main Memory File System. In Proceedings of ACM SIGOPS 26th Sym- tem, and quantify the performance and storage overheads posium on Operating Systems Principles (SOSP ’17). ACM, New York, of these techniques. We find that NOVA-Fortis’ reliability NY, USA, 19 pages. https://doi.org/10.1145/3132747.3132761 features consume 14.8% of the storage for redundancy and re- duce application-level performance by between 2% and 38% 1 INTRODUCTION compared to the same file system with the features removed. Fast, persistent memory technologies (e.g., battery-backed NOVA-Fortis outperforms DAX-aware file systems without NVDIMMs [37] or Intel and Micron’s 3D XPoint [36]) will reliability features by 1.5× on average. It outperforms reli- enable computer systems with expansive memory systems able, -based file systems running on NVMM by3× on that combine volatile and non-volatile memories. These hy- average. brid memory systems offer the promise of dramatic increases in storage performance. ∗The first two authors contributed equally to this work. Integrating NVMMs into computer systems presents a host of interesting challenges. The most pressing of these focus on how we should redesign existing software com- Permission to make digital or hard copies of part or all of this work for ponents (e.g., file systems) to accommodate and exploit the personal or classroom use is granted without fee provided that copies are different performance characteristics, interfaces, and seman- not made or distributed for profit or commercial advantage and that copies tics that NVMMs provide. bear this notice and the full citation on the first . Copyrights for third- Several groups have proposed new file systems12 [ , 15, 64] party components of this work must be honored. For all other uses, contact designed specifically for NVMMs and several Windows and the owner/author(s). file systems now include at least rudimentary support SOSP ’17, October 28, 2017, Shanghai, China © 2017 Copyright held by the owner/author(s). for them [10, 20, 63]. These file systems provide significant ACM ISBN 978-1-4503-5085-3/17/10. performance gains for data access and support “direct access” https://doi.org/10.1145/3132747.3132761 (or DAX-style) mmap() that allows applications to access a

478 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al. file’s contents directly using load and store instructions, a Caveat DAXor has two important consequences for NOVA- likely “killer app” for NVMMs. Fortis’ design. The first applies to most other NVMM file Despite these NVMM-centric performance improvements, systems: To maximize performance, applications are respon- none of these file systems provide the data protection fea- sible for enforcing ordering on stores to mapped data to tures necessary to detect and correct media errors, protect ensure consistency in the face of system failure. against due to misbehaving code, or perform The second consequence arises because NOVA-Fortis uses consistent backups of the NVMM’s contents. File system parity to protect file data from corruption. Keeping error stacks in wide use (e.g., running atop LVM, Btrfs, and correction information up-to-date for mapped data would ZFS) provide some or all of these capabilities for block-based require interposing on every store, imposing a significant storage. If users are to trust NVMM file systems with critical performance overhead. Instead, NOVA-Fortis requires ap- data, they will need these features as well. plications to take responsibility for data protection of data From a reliability perspective, there are four key differ- while it is mapped and restores parity protection when the ences between conventional block-based file systems and memory is unmapped. NVMM file systems. We quantify the performance and storage overhead of First, the memory controller reports persistent memory NOVA-Fortis’ fault-tolerance mechanisms and these design media errors as non-maskable interrupts rather than error decisions and evaluate their effectiveness at preventing cor- codes from a block driver. Further, the granularity of errors ruption of both file system metadata and file data. is smaller (e.g., a cache line) and varies depending on the We make the following contributions: memory device. (1) We identify the unique challenges that the caveat DAXor Second, persistent memory file systems must support principle presents to building a fault-tolerant NVMM DAX-style memory mapping that maps persistent memory file systems. pages directly into the application’s address space. DAX is (2) We describe a fast algorithm called Tick- the fastest way to access persistent memory since it elimi- Tock for NVMM data structures that combines atomic nates all operating and file system code from the access path. update with error detection and recovery. However, it means a file’s contents can change without the (3) We adapt state-of-the-art techniques for data protec- file system’s knowledge, something that is not possible ina tion to work in NOVA-Fortis and to accommodate block-based file system. DAX-style mmap(). Third, the entire file system resides in the kernel’s address (4) We quantify NOVA-Fortis’ vulnerability to scribbles space, vastly increasing vulnerability to “scribbles” – errant and develop techniques to reduce this vulnerability. stores from misbehaving kernel code. (5) We quantify the performance and storage overheads Fourth, NVMMs are vastly faster than block-based stor- of NOVA-Fortis’ data protection mechanisms. age devices. This means that the trade-offs block-based file systems make between reliability and performance need a We find that the extra storage NOVA-Fortis needs topro- thorough re-evaluation. vide fault-tolerance consumes 14.8% of file system space and We explore the impact of these differences on file system reduces application-level performance by between 2% and reliability mechanisms by building NOVA-Fortis, an NVMM 38% compared to NOVA. NOVA-Fortis outperforms DAX- × file system that adds fault-tolerance to NOVA[64] by in- aware file systems without reliability features by1.5 on corporating snapshots, replication, , and RAID-4 average. It outperforms reliable, block-based file systems × parity protection. running on NVMM by 3 on average. In applying these techniques to an NVMM file system, we To describe NOVA-Fortis, we start by providing a brief have developed the principle of caveat DAXor (“let the DAXer primer on NVMM’s implications for system designers, exist- beware”): Applications that use DAX-style mmap() must ac- ing NVMM file systems, key issues in file system reliability, cept responsibility for protecting their data’s integrity and and the NOVA filesystem (Section 2). Then, we describe consistency. NOVA-Fortis’ snapshot and (meta)data protection mecha- Protecting and guaranteeing consistency for DAX mmap()’d nisms (Sections 3 and 4). Section 5 evaluates these mecha- data is complex and challenging. The file system cannot fix nisms, and Section 6 presents our conclusions. that and should not try. Instead, the file system should stu- diously avoid imposing any performance overhead on DAX- 2 BACKGROUND style access, except when absolutely necessary. For data that NOVA-Fortis targets memory systems that include emerging is not mapped, the file system should retain responsibility non-volatile memory technologies along with DRAM. This for . section first provides a brief survey of NVMM technologies and the opportunities and challenges they present. Then we

479 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

describe recent work on NVMM file systems and discuss key 2.2 NVMM File Systems and DAX issues in file system reliability. Finally, we provide a brief Several groups have designed NVMM file systems12 [ , 14, primer on NOVA. 15, 63, 64] that address the unique challenges that NVMMs’ performance and -addressible interface present. One of these, NOVA, is the basis for NOVA-Fortis, and we describe 2.1 Non-volatile Memory Technologies it in more detail in Section 2.4. Modern server platforms have support NVMM in form of NVMMs’ low latencies make software efficiency much NVDIMMs [22, 37] and the includes low-level more important than in block-based storage devices [7, 9, 62, drivers for identifying physical address regions that are non- 65]. volatile, etc. NVDIMMs are commercially available from sev- NVMM-aware CPUs provide a load/store interface with eral vendors in form of DRAM DIMMs that can store their atomic 8-byte 1 operations rather than a block-based inter- contents to an on-board flash-memory chip in case of power face with block- or sector-based atomicity. NVMM file sys- failure with the help of super-capacitors. tems can use these atomic updates to implement features NVDIMMs that dispense with flash and battery backup such as complex atomic data and metadata updates, but do- are expected to appear in systems soon. Phase change mem- ing so requires different data structures and algorithms than ory (PCM) [32, 44], resistive RAM (ReRAM) [18, 56], and block-based file systems have employed. 3D XPoint memory technology [36] are denser than DRAM, Since NVMMs reside on the processor’s memory bus, ap- and may enable very large, non-volatile main memories. plications should be able to access them directly via loads Their latencies are longer than DRAM, however, making it and stores. NVMM file systems provide this ability via di- unlikely that they will fully replace DRAM as main mem- rect access (or “DAX”). DAX allows and write system ory. Other technologies, such as spin-torque transfer RAM calls to bypass the page cache and access NVMM directly. (STT-RAM) [28] are faster, but less dense and may find other DAX mmap() maps the NVMM physical pages that hold a roles in future systems (e.g., as non-volatile caches [67]). file directly into an application’s address space, so the appli- These technologies are all under active development and cation can access and modify file data with loads and stores knowledge about their reliability and performance is evolv- and use persist barriers to enforce ordering constraints. File ing rapidly. systems for both Windows [20] and Linux [10, 63] support Commercial availability of these technologies appears to DAX mmap(). be at hand. The 3D XPoint memory technology has DAX mmap() is a likely “killer app” for NVMMs since it already appeared in SSDs [25], and is expected to appear on gives applications the fastest possible access to stored data the processor memory bus shortly [26]. In addition all major and allows them to build complex, persistent, pointer-based memory manufacturers have candidate technologies that data structures. The typical usage model would have the could compete with 3D XPoint. Consequently, we expect application create a large file in an NVMM file system, use hybrid volatile/non-volatile memory hierarchies to become mmap() to map it into its own address space, and then rely common in large systems. on a userspace library [11, 42, 61] to manage it. Allowing programmers to build useful data structures Building persistent data structures that are robust in the with NVMMs requires CPUs to make guarantees about when face of system and application failures is a difficult program- stores become persistent that programmers can use to guar- ming challenge and the first inspiration for the caveat DAXor antee consistency after a system crash [2, 6]. Without these principle. Building data structures in NVMM requires the ap- guarantees it is impossible to build data structures in NVMM plication (or library) to take responsibility for allocating and that are reliable in the face of power failures [11, 60, 61]. freeing persistent memory and avoiding a host of new bugs NVMM-aware systems provide some form of persist barrier that can arise in memory systems that combine persistent that allows programmers to ensure that earlier stores become and transient state [11]. persistent before later stores. Researchers have proposed Applications are also responsible for enforcing ordering several different kinds of persist barriers [12, 30, 41]. relationships between stores to ensure that the application For example, under x86 a persist barrier comprises a clflush can recover from an unexpected system failure. Conven- or clwb [24] instruction to force cache lines into the system’s tional mmap()-based applications use msync() for this pur- “persistence domain” and a conventional memory fence to pose. msync() works with DAX mmap(), but it is expensive, enforce ordering. Once a store reaches the persistence do- non-atomic, and only operates on pages. Persist barriers main, the system guarantees it will reach NVMM, even in the are much faster than msync() and better-suited to building case of crash. NOVA-Fortis and other NVMM file systems 1 assume that these or similar instructions are available. NVMM-aware Intel CPUs provide 8-byte atomic operations. Other archi- tectures may provide different atomicity semantics.

480 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al.

complex data structures, but they are more difficult to use NOVA : Head Tail correctly. write write Inode log: ...... 0~8k 0~4k 2.3 File System Consistency and Reliability Apart from their core function of storing and retrieving data, File pages: File Page 1 File Page 2 File Page 1 file systems also provide facilities to protect the data they Log entry type: Inode update File write Uncommitted hold from corruption due to system failures, media errors, File page state: Stale Live and software bugs (both in the file system and elsewhere). File systems have devised a variety of different techniques Figure 1: NOVA inode log structure – A NOVA inode to guarantee system-wide consistency of file system data log records changes to the inode (e.g., the mode change structures, including journaling [15, 58], copy-on-write [8, and two file write operations shown above). NOVA 12, 45] and log-structuring [46, 47]. stores file data outside the log. Highly-reliable file systems like ZFS8 [ ] and Btrfs [45] provide two key features to protect data and metadata: The ability to take snapshots of the file system (to facilitate back- media errors via an unmaskable machine-check exception ups) and set of mechanisms to detect and recover from data (MCE) (see Section 4.1). corruption due to media errors and other causes. Software errors can also cause data corruption. If the file Existing DAX file systems provide neither of these fea- system is buggy, it may write data in the wrong place or tures, limiting their usefulness in mission-critical applica- fail to write at all. Other code in the kernel can corrupt file tions. Below, we discuss the importance of each feature and system data by “scribbling” [31] on file system data structures existing approaches. or data buffers. Scribbles are an especially critical problem for NVMM file 2.3.1 Snapshots. Snapshots provide a consistent image systems, since the NVMM is mapped into the kernel’s address of the file system at a moment in time. Their most impor- space. As a result, all of file system’s data and metadata are tant application is facilitating consistent backups without always vulnerable to scribbles. unmounting the file system, affording protection against We discuss other prior work on file system reliability as it catastrophic system failures and the accidental deletion or relates to NOVA-Fortis in Section 4.8. modification of files. Many modern file systems have built-in support for snap- 2.4 The NOVA File System shots [8, 29, 45]. In other systems the underlying storage NOVA-Fortis is based on the NOVA NVMM file system [64]. stack (e.g., LVM in Linux) can take snapshots of the underly- NOVA’s initial design focused on two goals: Fully expos- ing block device. ing the performance that NVMMs offer and providing very Neither existing DAX-enabled NVMM file systems nor strong consistency guarantees – all operations in NOVA are current low-level NVMM drivers support snapshots, making atomic. Below, we describe the features of NOVA that are consistent online backups impossible. most relevant to our description of NOVA-Fortis. Each inode in a NOVA file system has a private log that 2.3.2 Data Corruption. File systems are subject to a wide records changes to the inode. Figure 1 illustrates the relation- array of data corruption mechanisms including media er- ship between an inode, its log, and file data. NOVA stores rors that cause storage media to return incorrect values the log as a linked list of 4 KB NVMM pages, so logs are non- and software errors that store incorrect data to the media. contiguous. To perform a file operation that affects a single Data corruption and software errors in the storage stack inode, NOVA appends a log entry to the log and updates the have been thoroughly studied for hard disks [4, 49, 50], pointer to the log’s tail using an atomic, 64-bit store. SSDs [34, 40, 51] and DRAM-based memories [52, 54, 55]. For writes, NOVA uses copy-on-write, allocating new The results of DRAM-based studies apply to DRAM-based pages for the written data. The log entry for a write holds NVDIMMs, but there have been no (publicly-available) stud- pointers to the newly written pages, atomically replacing ies of error behaviors in emerging NVMM technologies. them in the file. NOVA immediately reclaims the resulting Storage devices use error-correcting codes (ECC) to pro- stale pages. tect against media errors. Errors that ECC detects but cannot For complex file operations that involve multiple correct result in uncorrectable media errors. For block-based (e.g., moving a file between directories), NOVA uses small, storage, these errors appear as read or write failures from fixed-size journals (one per core) to store new tail pointers the storage driver. Intel NVMM-based systems report these for all the inodes and update them atomically.

481 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

NOVA periodically performs garbage collection on the traverses the log only while the log entries’ snapshot IDs are inode logs by scanning and compacting the log. Since the smaller than or equal to the target snapshot’s ID. logs do not contain file data, they are shorter and garbage Below we describe how the snapshot mechanism interacts collection is less critical than in a conventional log-structured with normal file operations. Then, we describe how NOVA- file system. Fortis takes consistent snapshots while applications are using To maximize concurrency, NOVA uses per-CPU structures DAX-mmap(). Finally, we evaluate the impact of snapshots in DRAM to allocate NVMM pages and inodes. It also caches on NOVA-Fortis’ resource consumption and on application inode metadata in DRAM to minimize accesses to NVMM performance. (which is projected to be slower than DRAM), and uses one DRAM-based radix tree per file to map file offsets toNVMM 3.2 Snapshot Management pages. NOVA divides the allocatable NVMM into multiple re- To take a snapshot NOVA-Fortis must preserve file contents gions, one region per CPU core. A per-core allocator manages from previous snapshots while also being able to recover the each of the regions, minimizing contention during memory space a snapshot occupied after its deletion. In this decription, allocation. we use “snapshot” to denote both the file system’s state when After a system crash, NOVA must scan all the logs to re- a snapshot is taken and the set of file operations since the build the memory allocator state. Since, there are many logs, previous snapshot. NOVA aggressively parallelizes the scan. Recovering a 50 GB NOVA-Fortis maintains a snapshot manifest for each snap- NOVA file system takes just over 1/10th of a second [64]. shot. Entries in the manifest contain a pointer to a log entry, a birth snapshot ID, and a death snapshot ID. We denote a 3 SNAPSHOTS manifest entry with [create snapshot ID, delete snapshot ID). The manifest for a snapshot ID contains entries for all the NOVA-Fortis’ snapshot support lets system administrators log entries that were born (i.e., created) with that snapshot take consistent snapshots of the file system while applica- ID and have since died (i.e., become stale because another tions are running. The system can mount a snapshot as a write replaced them) in a different snapshot. If a snapshot read-only file system or roll the file system back to a previ- is deleted, the manifest entries for that snapshot become ous snapshot. NOVA-Fortis supports an unlimited number of part of the following snapshot. This can result in a snapshot snapshots, and snapshots can be deleted in any order. NOVA- containing multiple manifest entries for the same file data. Fortis is the first NVMM file system that supports taking In this case, it is safe to remove the older entry (i.e., those consistent snapshots when applications modify file data via with earlier birth snapshot IDs). DAX mmap(). Figure 2 illustrates how NOVA-Fortis manages snapshot Below, we described the three central challenges that manifests. In Figure 2 (a), an application issues two writes. In NOVA-Fortis’ snapshot mechanisms address: Taking effi- (b), NOVA-Fortis takes a snapshot, increments the snapshot cient snapshots, managing storage for snapshots, and taking ID (Step 1), and performs at write to location 0. The new usable snapshots of DAX mmap()’d files. write log entry has snapshot ID 1 (Step 2), since it was born in snapshot 1. 3.1 Taking and Restoring Snapshots The write to location 0 in (b) kills the write to that location Applications expect efficient snapshot creation as well as in (a). NOVA-Fortis checks whether that log entry was born high performance from NVMM file systems: Taking a snap- and died in different snapshots. In the example, it was born shot should have low latency, incur minimal writes to NVMM, in snapshot 0 and died in snapshot 1, so NOVA-Fortis adds a and not block file system write operations. None of the ex- entry to the snapshot 0 manifest (Step 3) and preserves the isting NVMM file systems meet all these requirements. log entry. In Figure 2 (c), the application issues two writes NOVA-Fortis implements snapshots by maintaining a global that overwrite entries in snapshots 0 and 1, so NOVA-Fortis snapshot ID for the file system and storing the current snap- adds entries to those manifests. shot ID in each log entry. Taking a snapshot increments the In Figure 2 (d), NOVA-Fortis deletes snapshot 0 by remov- global snapshot ID and records the old snapshot ID in a list ing it from the list of live snapshots and adding the manifest of valid snapshots. Creating a new snapshot in NOVA-Fortis entries for snapshot 0 to snapshot 1’s manifest (Step 4). In does not block file system write operations, and if no files are effect, this moves the start of snapshot 1 to the beginning of mmap()’d it takes constant time regardless of the file system snapshot 0. volume size. Then, NOVA-Fortis starts a background thread that com- To restore the file system to a snapshot, NOVA-Fortis pacts the combined manifest by discarding all the manifest mounts the file system read only. Then, to afile,it entries that cover one part of the file except the most recent

482 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al.

Snapshot 0 manifest Step 1 Snapshot 0 manifest Snapshot 1 manifest Snapshot 0 Step 3 [0, 1) Snapshot 1 [0, 1) [0, 2) [1, 2) Snapshot 2

File log 0 0 0 4K 0 0 0 4K 1 0 0 0 0 4K 1 0 2 0 2 4K Step 2 Data Data Data Data Data Data Data Data Data Data

(a) write(0, 4K); write(4K, 4K); (b) take_snapshot(); write(0, 4K); (c) take_snapshot(); write(0, 4K); write(4K, 4K);

Snapshot 1 manifest Step 4 Snapshot 1 manifest Step 5 [0, 1) [0, 2) [1, 2) Snapshot 2 [0, 2) [1, 2) Snapshot 2

File log 0 0 0 4K 1 0 2 0 2 4K 0 0 0 4K 1 0 2 0 2 4K

Step 6 Data Data Data Data Data Data Data Data Data

(d) delete_snapshot(0); (e) After garbage collection

Snapshot entry Dead snapshot entry addr File write entry Snapshot ID Data in snapshot Reclaimed data Current data

Figure 2: Snapshots in NOVA-Fortis – NOVA-Fortis stores the current snapshot ID in each log entry, and takes a snapshot by incrementing the snapshot ID. When a data block or log entry becomes dead in the current snapshot, NOVA-Fortis records its death in the snapshot manifest for most recent snapshot it survives in. To delete a snap- shot, NOVA-Fortis adds the manifest for the deleted snapshot to the manifest for the next snapshot and removes redundant entries.

(Step 5). As it discards the manifest entries, it marks the cor- moment when application operations might be in-flight and responding log entries as dead and reclaims the associated when the application had no chance to shut down cleanly. file data (Step 6). Finally, (e) shows the file system state after A naive approach to checkpointing mmap()’d files in NOVA- manifest compaction is complete. Fortis would simply mark each of the read/write mapped NOVA-Fortis keeps the list of live snapshots in NVMM, pages as read-only and then do copy-on-write when a store but it keeps the contents of the manifests in DRAM. On a occurs to preserve the old pages as part of the snapshot. clean shutdown, it writes the manifests to NVMM in the However, this approach can leave the snapshot in an in- recovery inode. After an unclean shutdown, NOVA-Fortis consistent state: Setting the page to read-only captures its can reconstruct the manifests while it scans the inode logs contents for the snapshot, and the kernel requires NOVA- during recovery. Fortis to set the pages as read-only one at a time. If the order in which NOVA-Fortis marks pages as read-only is incompat- 3.3 Snapshots for DAX mmap()’d Files ible with ordering that the application requires, the snapshot will contain an inconsistent version of the file. Taking consistent snapshots while applications are modi- Consider an example with a data value D, and a flag V that fying files using DAX-style mmap requires NOVA-Fortis to is True when D contains valid data. To store data in D while reckon with the order in which stores to NVMM become enforcing this invariant, a thread could issue a persist barrier persistent (i.e., reach physical NVMM so they will survive a between the stores to D and V . Assume D and V reside in system failure). These applications rely on the processor’s different pages, and initially V is False. “memory persistence model” [41] to make guarantees about If NOVA-Fortis marks D’s page as read-only before the when and in what order stores become persistent, allowing program updates D, and marks V ’s page as read-only after them to restore their data to a consistent state when the the program updates V , the snapshot has V equal to True, application restarts after a system failure. but D with its old, incorrect value. From the application’s perspective, reading a snapshot is Figure 3 illustrates how we resolve the problem: When equivalent to recovering from a system failure. In both cases, NOVA-Fortis starts marking pages as read-only in (a), it the contents of the memory-mapped file reflect its state ata

483 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

Pages state in time Snapshot state Application Workload size Max manifest size D: 0 D: 0 Fileserver 13 GB 3.6 MB Snapshot starts, (a) D marked read-only Varmail 4 GB 2 MB V: F V: Redis 4 GB 1.1 MB D = 1 TPCC 8.5 GB 3.1 MB D: 0 D's page fault waiting D: 0 ctree 10 GB 3.3 MB (b) Table 1: Snapshot manifest size – The size of snap- V: F V: shot manifest is proportional to the size of changes D = 1 between snapshots.

D: 0 D: 0 (c) V marked read-only V: F Snapshot completes V: F the mmap()’d pages read-only, ensuring that the store to V D = 1 will not appear in the snapshot. This approach guarantees that, for each thread, the snap- D: 0 1 D copy-on-write D: 0 shot will reflect a prefix of the program order-based sequence (d) of NVMM store operations the thread has performed. This V: F V: F model is equivalent to strict persistence [41], the most re- V = T strictive model for how NVMM memory operations should D: 0 1 D: 0 behave (i.e., in what order updates can become persistent) (e) in a multi-processor system. CPUs may implement more ag- V page fault, V: F T copy-on-write V: F gressive, relaxed models for memory persistence, but strict persistence is strictly more conservative than all proposed models, so NOVA-Fortis’ approach is correct under those 2 Read/Write Read-Only Read-Only, Page Value models as well . CoW page fault waiting Fault Change 3.4 Design Decisions and Evaluation mmap() Figure 3: Snapshots of ’d data – NOVA-Fortis There are several ways we could have addressed the prob- mmap() marks ’d pages as read-only to capture their lems of creating snapshots, managing the data they contain, contents in the snapshot. To preserve the application’s and correctly capturing mmap()’d data. Our approach stores constraints on when stores become persistent (here, snapshot manifests in DRAM. The size of snapshot manifest that the assignment to V follow the assignment to D), is proportional to the changes between snapshots, and each it prevents pages faults to read-only pages from com- data block and log entry is referred to by at most one snap- pleting until it can mark all the pages read-only. shot manifest. In the worst case the snapshot manifest size is proportional to the size of the live data in file system, this blocks page faults on read-only mmap()’d pages from com- happens when the user takes a snapshot and then modifies pleting until it has marked all the mmap()’d pages read-only. the entire file system. D’s page fault in (b) will not complete (and the store to V In practice, the size of the manifests is manageable. We will not occur) until all the pages (including V ’s) have been ran the fileserver workloads described in Section 5 and the marked read-only and the snapshot is complete in (c). Then WHISPER [39] workloads while taking snapshots every 10 NOVA-Fortis allows the copy-on-write for D’s page to pro- seconds. WHISPER is a suite of eight applications that use ceed, allowing the modification of D to complete in (d). Since DAX mmap() for data access. Table 1 shows the results. The the assignment to V occurs after the assignment to D in pro- size of the manifests ranged from 0.027% to 0.05% of the space gram order, the copy-on-write and modification to V in (e) in use in the file system. will occur after creating a consistent snapshot. Taking snapshots has different effects on the performance This solution is also correct for multi-threaded programs. of applications that use file-based access and those that use If thread 1 updates D and, later, thread 2 should update V , DAX-style mmap(). Figure 4 measures the impact on both the threads must use some synchronization primitive to en- 2This is directly analogous to sequential memory consistency being a valid force that ordering. For instance, thread 1 may release a lock implementation of any more relaxed memory consistency model. Strict after storing to D. If D’s page is marked read-only before D persistence is a natural extension of sequential consistency to include a changes, the unlock will not occur until the kernel marks all notion of persistence.

484 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al.

Figure 4: Snapshot performance impact – For file-based application (“Filebench”), taking a snapshot every 10s reduces performance by just 0.6% on average. WHISPER applications that use DAX-mmap() see a larger drop: 3.7% on average.

Figure 5: Snapshot operation latency – The time required to mark pages read-only during snapshot creation is proportional to the number of write-faulted pages (300 in this example) in the DAX mmap()’d regions. groups. On the right, it shows results for the WHISPER ap- 4 HANDLING DATA CORRUPTION IN plications. On the left are results for four Filebench [57] NOVA-FORTIS workloads. Like all storage media, NVMM is subject to data and metadata For all the applications, the figure compares performance corruption from media failures and software bugs. To pre- without snapshots to performance while running for 1 to vent, detect, and recover from data corruption, NOVA-Fortis 5 minutes and taking a snapshot every 10 seconds. For the relies on the capabilities of the system hardware and operat- mmap() WHISPER applications that use DAX , taking periodic ing system as well as its own error detection and recovery snapshots only reduces performance by 3.7% on average. For mechanisms. file-based filebench workloads, the impact is negligible –0.6% This section describes the interfaces that NOVA-Fortis on average. expects from the memory system hardware and the OS and Figure 5 shows the latency of snapshot operations. For file- how it leverages them to detect and recover from corruption. based applications, snapshot creation takes constant time. We also discuss a technique that prevents data corruption in mmap() mmap() For DAX applications, the time to mark ’d many cases and NOVA-Fortis’ ability to trade reliability for regions as read-only is proportional to the number of write- performance. Finally, we discuss NOVA-Fortis’ protection faulted pages, and each page takes about 10 ns. Snapshot mechanisms in the context of recent work on file system deletion always takes constant time, since it only needs to reliability. combine the deleted snapshot’s manifest with the next snap- shot’s manifest and then perform stale data reclamation in the background. The background thread takes about 400 ns to reclaim 4 KB data. A copy-on-write page fault takes 2 µs: 1.8 µs for copying data to new page and 200 ns to update the page table.

485 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

NVMM In rare cases (e.g., an MCE occurring during a page fault), CPU 0 CPU 1 ... CPU n MCEs are not recoverable, and a kernel panic is inevitable. Replica SB When the processor hardware detects an uncorrectable SB media error, it “poisons” a contiguous region of physical addresses. The size of this region is the poison radius (PR) of Replica Replica Replica a media error. We assume PRs are a power of two in size and csum parity inode log & data DZ log inode csum aligned to that size. Loads to poisoned addresses cause an Primary allocation Replica allocation MCE, and all the data in the PR is lost. The poisoned status of a PR persists across system failures and the PMEM driver Figure 6: NOVA-Fortis space layout – NOVA-Fortis’ collects a list of poisoned PRs at boot. On Intel processors per-core allocators satisfy requests for primary and the poison radius is 64 (one cache line), but after boot, replica storage from different directions. They also Linux reports poisoned regions at 512-byte granularity, so store data pages and their and parity pages NOVA-Fortis uses 512-byte. separately. The “dead zone” (DZ) for metadata alloca- We also assume that NVMM platforms will allow sys- tion guarantees a gap between primary and replica tem software to clear a poisoned PR to make the address pages. range usable again. Intel processors provide this capability via the “Clear Uncorrectable Error” command that is part of the Advanced Configuration and Power Interface (ACPI) specification [59]. 4.1 Detecting and Correcting Media Errors NOVA-Fortis detects NVMM media errors with the same mechanisms that processors provide to detect DRAM er- 4.2 Tick-Tock Metadata Protection rors. The details of these mechanisms determine how NOVA- NOVA-Fortis protects its metadata by keeping two copies Fortis and other NVMM file systems can protect themselves of each structure – a primary and a replica – and adding a from media errors. CRC32 checksum to both. This section describes the interface that recent Linux ker- To update a metadata structure, NOVA-Fortis first copies nels (e.g., Linux 4.10) and Intel processors provide via the the contents of the data structure into the primary (the tick), PMEM low-level NVDIMM driver. Porting NOVA-Fortis to and issues a persist barrier to ensure that data is written to other architectures or operating systems may require NOVA- NVMM. Then it does the same for the replica (the tock). This Fortis to adopt a different approach to error detection. scheme ensures that, at any moment, at least one of the two NOVA-Fortis assumes the memory system provides ECC copies is correctly updated and has a consistent checksum. for NVMM that is similar to (or perhaps stronger than) the To reliably access a metadata structure NOVA-Fortis copies single-error-correction/double-error-detection (SEC-DED) the primary and replica into DRAM buffers using memcpy_ scheme that conventional DRAM uses. We assume the con- mcsafe() to detect media errors. If it finds none, it verifies troller transparently corrects correctable errors, and silently the checksums for both copies. If it detects that one copy returns invalid data for undetectable errors. is corrupt due to a media error or checksum mismatch, it For detectable but uncorrectable errors, Intel’s Machine restores it by copying the other. If both copies are error free Check Architecture (MCA) [23] raises a machine check ex- but not identical, the system failed between the tick and tock ception (MCE) in response to uncorrectable memory errors. phases of a previous update, and NOVA-Fortis copies the After the exception, MCA registers hold information that al- primary to the replica, effectively completing the interrupted lows the OS to identify the memory address and instruction update. If both copies are corrupt, the metadata is lost, and responsible for the exception. NOVA-Fortis returns an error. The default response to an MCE in the kernel is a kernel panic. However, recent Linux kernels include a version of 4.3 Protecting File Data memcpy(), called memcpy_mcsafe(), that returns an error to NOVA-Fortis adopts RAID-4 parity protection and check- the caller instead of crashing in response to memory-error- sums to protect file data and it includes features to maximize induced MCEs, and allows the kernel software to recover protection for files that applications access data via DAX- from the exception. NOVA-Fortis always uses this function style mmap(). when reading from NVMM and checks its return code to detect uncorrectable media errors. Intel processors do not RAID Parity and Checksums NOVA-Fortis treats each provide a mechanism for detecting store failures, and the 4 KB file page as a stripe, and divides it into PR-sized (or memory controller transparently maps around faulty cells. larger) stripe segments, or strips.

486 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al.

NOVA-Fortis stores a parity strip for each file page in a recompute checksums and parity, and add a log entry to reserved region of NVMM. It also stores two copies of a mark them as valid. CRC32 checksum for each data strip in separate reserved Design Decisions and Alternatives Using RAID parity regions. Figure 6 shows the checksum and parity layouts in and checksums to protect file data is similar to the approach the NVMM. that ZFS [8] and IRON [43] take, but we store parity for When NOVA-Fortis performs a read, it first copies each each page rather than one parity page per file43 [ ], and we strip of data into DRAM using memcpy_mcsafe(), and cal- maintain per-strip checksums instead of a single checksum culates its checksum. If this checksum matches either stored for a whole stripe [8]. copy of the checksum, NOVA-Fortis concludes the file data NOVA-Fortis chooses RAID-4 over RAID-5 because RAID- is correct and updates the mismatched copy of the checksum 5 intermingles parity and data and would make DAX mmap() if needed. impossible since parity bits would end up in the application’s If neither of the checksums match the read data or a media address space. error occurs, NOVA-Fortis attempts to restore the strip using Alternately, NOVA-Fortis could rely on RAIM [33], Chip- RAID-4 parity, and uses the strip’s checksum to determine kill [21], or other advanced ECC mechanisms to protect file if the recovery succeeded. If no other strip in the page is data. These techniques would improve reliability, but they corrupt, recovery will succeed and NOVA-Fortis restores are not universally available and cannot protect against scrib- the target strip and its checksums. If more than one strip is bles. corrupt, the file page is lost and the read fails. Writes and atomic parity updates are simple since NOVA- Fortis uses copy-on-write for data: For each file page write, 4.4 Minimizing Vulnerability to Scribbles NOVA-Fortis allocates new pages, populates them with the Scribbles pose significant risk to NOVA-Fortis’ data and meta- written data, computes the checksums and parity, and finally data, since a scribble can impact large, continuous regions commits the write with an atomic log appending operation. of memory. We are not aware of any systematic study of the prevalence of these errors, but scribbles, lost, and misdi- rected writes are well-known culprits for file system corrup- Caveat DAXor: Protecting DAX-mmap’d Data By de- tion [19, 31, 66]. In practice, we expect that smaller scribbles sign, DAX-style mmap() lets application modify file data are more likely than larger ones, in part since the bugs that without involving the file system, so it is impossible for result in larger scribbles would be more severe and more NOVA-Fortis to keep the checksums and parity for read/write likely to be found and fixed. mapped pages up-to-date. Instead, NOVA-Fortis follows the To quantify the risk that these errors pose, we define bytes- caveat DAXor principle and provides the following guaran- at-risk (BAR) for a scribble as the number of bytes it may tee: The checksums and parity for data pages are up-to-date render irretrievable. at all times, except when those pages are mapped read/write NOVA-Fortis packs log entries in to log pages, and it must into an application’s address space. scan the page to recognize each entry. Without protection, We believe this is the strongest guarantee that NOVA- losing a single byte can corrupt a whole page. For replicated Fortis can provide on current hardware, and it raises several log pages, a scribble that spans both copies of a byte will challenges. First users that use DAX-mmap() take on respon- corrupt the page. To measure the BAR for a scribble of size sibility for detecting and recovering from both media errors N we measure the number of pages each possible N -byte (which appear as SIGBUS in ) and scribbles. This is scribble would destroy in an aged NOVA-Fortis file system. an interesting challenge but beyond the scope of this paper. Figure 7 shows the maximum and average metadata BAR Second, NOVA-Fortis must be able to tell when a data page’s for a 64 GB NOVA-Fortis file system with four protection checksums and parity should match the data it contains and schemes for metadata: “no replication” does not replicate when they might not. metadata; “simple replication” allocates the primary and To accomplish this, when a portion a file is mmap()’d, replicas naively and tends to allocate lower addresses before NOVA-Fortis records this fact in the file’s log, signifying higher address, so the primary and replica are often close; that the checksums and parity for the affected pages are no “two-way replication” separates the primary and replica by longer valid. NOVA-Fortis only recomputes the checksums preferring low addresses for the primary and high addresses and parity for dirty pages on msync() and munmap(). On for the replica; and “dead-zone replication” extends “two- munmap(), it adds a log entry that restores protection for way” by enforcing a 1 MB “dead zone” between the primary these pages when the last mapping for the page is removed. and replica. The dead zone can store file data but not meta- If the system crashes while pages are mapped, the recovery data. The more separation the allocator provides, the less process will identify these pages while scanning the logs, likely a scribble will corrupt a pair of mirrored metadata

487 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

Figure 7: Scribble size and metadata bytes at risk – Replicating metadata pages and taking care to allocate the replicas separately improves resilience to scribbles. The most effective technique enforces a 1 MB “dead zone” between replicas and eliminates the threat of a single scribble smaller than 1 MB. The graph omits zero values due to the vertical log scale. pages. Figure 6 shows an example of NOVA-Fortis two-way to data pages by interleaving strips from different pages, but allocator with dead zone separation. For each pair of mir- this would disallow DAX-style mmap(). Increasing the strip rored pages, the dead zone forbids the primary and replica size can also improve scribble tolerance, but at the cost of from becoming too close, but data pages can reside between increased storage overhead for the parity strip. them. To stress the allocator’s ability to place the primary and 4.5 Preventing Scribbles replica far apart, we aged the file system by spawning multi- The mechanisms described above let NOVA-Fortis detect and ple, multi-threaded, Filebench workloads. When each work- recover from data corruption. NOVA-Fortis can borrow a load finishes, we remove about half of its files, and then technique from WAFL [31] and PMFS [15] to prevent scrib- restart the workload. We continue until the file system is bles by marking all of NVMM as read-only and then clearing 99% full. Intel’s WriteProtect Enable (WP) bit to disable all write pro- The data show that even for the smallest 1-byte scribble, tection when NOVA-Fortis needs to modify NVMM. Clearing the unprotected version will lose up to a whole page (4 KB) and re-setting the bit takes ∼400 ns on our systems. of metadata and an average of 0.06 pages. With simple repli- The WP approach only protects against scribbles from cation, scribbles smaller than 4 KB have zero BAR. Under other kernel code. It cannot prevent NOVA-Fortis from cor- simple replication, an 8 KB scribble can corrupt up to 4 KB, rupting its own data by performing “misdirected writes,” a but affects only 0.04 pages on average. common source of data corruption in file systems [5]. Two-way replication tries to allocate the primary and replica farther apart, and it reduces the average bytes at risk with an 8 KB scribble to 2.9 × 10−5 pages, but the worst case 4.6 Relaxing Data and Metadata Protection remains the same because the allocator’s options are limited Many existing file systems can trade off reliability forim- when space is scarce. proved performance (e.g., the data journaling option in Ext4). Enforcing the dead zone further improves protection: A NOVA-Fortis can do the same: It provides a relaxed mode 1 MB dead zone can eliminate metadata corruption for scrib- that relaxes atomicity constraints on file data and metadata. bles smaller than 1 MB. The dead zone size is configurable, In relaxed mode, write operations modify existing data so NOVA-Fortis can increase the 1 MB threshold for scribble directly rather than using copy-on-write, and metadata oper- vulnerability if larger scribbles are a concern. ations modify the most recent log entry for an inode directly Scribbles also place data pages at risk. Since NOVA-Fortis rather than appending a new entry. Relaxed mode guaran- stores the strips of data pages contiguously, scribbles that are tees metadata atomicity by journaling the modified pieces larger than the strip size may causes data loss, but smaller of metadata. These changes improve performance and we scribbles do not. NOVA-Fortis could tolerate larger scribbles evaluate their impact in Section 5.

488 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al.

4.7 Protecting DRAM Data Structures complex patterns of locality [35, 54, 55]. For instance, an Corruption of DRAM data structures can result in file system NVMM chip may suffer from a faulty bank, row, or column, corruption [13, 66], and NOVA-Fortis protects most of its leading to a non-uniform error distribution. Or, an entire critical DRAM data structures with checksums. Most DRAM NVDIMM may fail. structures that NOVA-Fortis does not protect are short lived NOVA-Fortis’ allocator actively separates the primary and (e.g., the DRAM copies we create of metadata structures) replica copies of metadata structures to eliminate logical or are not written back to NVMM. However, the snapshot locality, but it does not account for how the memory system logs and allocator state are exceptions and they are vulner- maps physical addresses onto the physical memory. A layout- able to corruption. The allocator protects the address and aware allocator could, for instance, ensure that replicas reside length of each free region with checksums, but it does not in different banks or different NVDIMMs. protect the pointers that make up the red-black tree that NOVA-Fortis cannot keep running after an unrecoverable holds them, since we use the kernel’s generic red-black tree MCE (since they cause a panic()), but it could recover any implementation. corrupted data during recovery. The PMEM driver provides a list of poisoned PRs on boot, and NOVA-Fortis can use this 4.8 Related and Future Work information to locate and recover corrupted file data during mount. Without this step, NOVA-Fortis will still detect poi- Below, we describe proposed “best practices” for file sys- soned metadata, since reading from a poisoned PR results in tem design and how NOVA-Fortis addresses them. Then, we a recoverable MCE, and NOVA-Fortis reads all metadata dur- describe areas of potential improvement for NOVA-Fortis. ing recovery. Poisoned file data, however, could accumulate 4.8.1 Is NOVA-Fortis Ferrous? InteRnally cONistent (IRON) over multiple unrecoverable MCEs, increasing the chances file systems [43] provide a set of principles to that lead to of data loss. improved reliability. We designed NOVA-Fortis to embody Finally, NOVA-Fortis does not scrub data or metadata. these principles: PMEM detects media errors on reboot, but if a NOVA-Fortis Check error codes Uncorrectable ECC errors are the file system ran continuously for a long time, undetected only errors that the NVMM memory system delivers to soft- media errors could accumulate. Undetected scribbles to data ware (i.e., via MCEs). NOVA-Fortis uses memcpy_mcsafe() and metadata can accumulate during normal operation and for all NVMM loads and triggers recovery if it detects an across reboots. MCE. NOVA-Fortis also interacts with the PMEM driver that provides low-level management of NVDIMMs. For these 5 PERFORMANCE TRADE-OFFS calls, we check and respond to error codes appropriately. NOVA-Fortis’ reliability features improve its resilience but Report errors and limit the scope of failures NOVA- also incur overhead in terms of performance and storage Fortis reports all unrecoverable errors as EIO rather than space. This section quantifies these overheads and explores calling panic(). the trade-offs they allow. Use redundancy for integrity checks and distribute re- 5.1 Experimental Setup dundancy information NOVA-Fortis’ tick-tock replica- tion scheme stores the checksum for each replica with the We implemented NOVA-Fortis for Linux 4.10, and the source replica, but it is careful to allocate the primary and replica code is available at https://github.com/NVSL/linux-nova. copies far from one another. Likewise, NOVA-Fortis stores We use the Intel Persistent Memory Emulation Platform the parity and checksum information for data pages sepa- (PMEP) [15] to emulate different types of NVMM and study rately from the pages themselves. their effects on NVMM file systems. PMEP supports config- urable latencies and bandwidth for the emulated NVMM, Type-aware fault injection For testing, we built a NOVA- and emulates clwb instruction with microcode. In our tests Fortis-specific error injection tool that can corrupt data and we configure the PMEP with 32 GB of DRAM and 64GBof metadata structures in specific, targeted ways, allowing us NVMM, and choose two configurations for PMEP’s memory to test NOVA-Fortis’ detection and recovery mechanisms. emulation system: We use the same read latency and band- 4.8.2 Areas for Improvement. There are several additional width as DRAM to emulate fast NVDIMM-N [48], and set steps NOVA-Fortis could take to further improve reliability. read latency to 300 ns and reduce the write bandwidth to We do not expect any of them to have a large impact on 1/8th of DRAM to emulate slower PCM. For both configura- performance or storage overheads. tions we set clwb instruction latency to 40 ns. Sector or block failures in disks are not randomly dis- We compare NOVA-Fortis against five other file systems. tributed [3], and errors in NVMM are also likely to exhibit Ext4-DAX, -DAX and PMFS are the three DAX-enabled

489 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

Figure 8: File operation latency – NOVA-Fortis’ basic file operations are faster than competing file systems ex- cept in cases where the other file system provides weaker consistency guarantees and/or data protection (e.g., writes and overwrites vs. Ext4-DAX and xfs-DAX). Sacrificing these protections in NOVA-Fortis (“Relaxed mode”) improves performance.

Figure 9: NOVA-Fortis random read/write bandwidth on NVDIMM-N – Read bandwidth is similar across all the file systems except Btrfs, and NOVA-Fortis’ reliability mechanisms reduces its throughput by between 14%and 19%. For writes the cost of reliability is higher – up to 66%, but still outperforms Btrfs by 33× with 16 threads. file systems. None of them provides strong consistency guar- 5.2 Performance Impacts antees (i.e., they do not guarantee that all operations are To understand the impact of NOVA-Fortis’ reliability mecha- atomic), while NOVA does provide these guarantees. To nisms, we begin by measuring the performance of individual compare to a file system with stronger guarantees, we also mechanisms and basic file operations. Then we measure their compare to ext4 in data journaling mode (ext4-dataj) and impact on application-level performance. Btrfs running on the NVMM-based block device. Ext4 and xfs We compare several version of NOVA-Fortis: We start with keep checksums for metadata, but they do not provide any our baseline, NOVA, and add metadata protection (“MP”), recovery mechanisms for NVMM media errors or protection data protection (“DP”), and write protection (“WP”). “Re- against stray writes. laxed mode” weakens consistency guarantees to improve performance and provides no data protection (Section 4.6).

490 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al.

Metadata Protection Data Protection Write Protection

Figure 10: NOVA-Fortis latencies for NVDIMM-N – Protecting file data is usually more expensive than protecting metadata because the cost of computing checksums and parity for data scales with access size.

5.3 Microbenchmarks Figure 9 shows FIO [1] measurements for the multi-threaded We evaluate basic file system operations: create, 4 KB append, read/write bandwidth of the file systems. For writes, NOVA- 4 KB write, 512 B write, and 4 KB read. Figure 8 measures Fortis’ relaxed mode achieves the highest bandwidth. With the latency for these operations with NVDIMM-N configu- sixteen threads, metadata protection reduces NOVA-Fortis ration. Data for PCM has similar trends. bandwidth by 24% compared to the baseline, data protection Create is a metadata-only operation. NOVA is 1.9× to 5× reduces throughtput by 37%, and enabling all of NOVA-Fortis’ faster than the existing file systems, and adding metadata protection features reduces bandwidth by 66%. For reads, all protection increases the latency by 47% compared to the the file systems scale well except Btrfs, while NOVA-Fortis baseline. Append affects metadata and data updates. Adding data protection incurs 14% overhead on 16 threads, due to metadata and data protection increase the latency by 36% and checksum verification. 100%, respectively, and write protection increases the latency by an additional 22%. NOVA-Fortis with full protection (i.e., “w/ MP+DP+WP”) is 59% slower than NOVA. 5.4 Macrobenchmarks For overwrite, NOVA-Fortis performs copy-on-write for file data to provide data atomicity guarantees, and the latency We use nine application-level workloads to evaluate NOVA- is close to that of append. For 512 B overwrite, NOVA-Fortis Fortis: Four Filebench [57] workloads (fileserver, varmail, has longer latency than other DAX file systems since its webproxy, and webserver), two key-value stores (RocksDB [17] requires reading and writing 4 KB. Full protection increases and MongoDB [38]), the Exim email server [16], SQLite [53], the latency by 2.2×. Relaxed mode is 3.8× faster than NOVA and TPC-C running on Shore-MT [27]. Fileserver, varmail, since it performs in-place updates. For read operations, data webproxy and Exim are metadata-intensive workloads, while protection adds 70% overhead because it verifies the data other workloads are data-intensive. Table 2 summarizes the checksum before returning to the user. workloads. Figure 10 breaks down the latency for NOVA-Fortis and Figure 11 measures their performance on our five compar- its reliability mechanisms. For create, inode allocation and ison file systems and several NOVA-Fortis configurations, appending to the log combine to consume 48% of latency, due normalized to the NOVA throughput on NVDIMM-N. NOVA- to inode/log replication and checksum calculation. For 4 KB Fortis outperforms xfs-DAX and ext4-DAX by between 3% append and overwrite, data protection has almost the same and 4.4×. PMFS shows similar performance to NOVA on data- latency as memory copy (memcpy_nocache), and it accounts intensive workloads, but NOVA-Fortis outperforms it by a for 31% of the total latency in 512 B overwrite. wide margin (up to 350×) on metadata-intensive workloads. Btrfs provides reliability features similar to NOVA-Fortis’, but it is slower: NOVA-Fortis with all its protection features

491 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

Figure 11: Application performance on NOVA-Fortis – Reliability overheads and the benefits of relaxed mode have less impact on applications than microbenchmarks (Figures 8 and 9). The change is especially small for read- intensive workloads (e.g., web-proxy), while and key-value stores see greater differences. The numbers above the bars measure NOVA throughput in operations per second.

Application Data size Notes non-page-aligned writes that result in extra reads, writes, Filebench-fileserver 64 GB R/W ratio: 1:2 and checksum calculation during copy-on-write. Relaxed Filebench-varmail 32 GB R/W ratio: 1:1 mode avoids these overheads, so it improves performance Filebench-webproxy 32 GB R/W ratio: 5:1 for RocksDB more than for other workloads. Filebench-webserver 32 GB R/W ratio: 10:1 For the PCM configuration, fileserver, webserver and Rocks- RocksDB 8 GB db_bench’s overwrite test DB show the largest performance drop compared to NVDIMM- MongoDB 10 GB YCSB’s 50/50-read/write N. Fileserver and RocksDB are write-intensive and saturate Exim 4 GB Mail server PCM’s write bandwidth. Webserver is read-intensive and SQLite 400 MB Insert operation PCM’s read latency limits performance. Btrfs outperforms TPC-C 26 GB The ’Payment’ query other DAX file systems on Rocks-DB because this workload Table 2: Application benchmarks does not call fsync frequently, allowing it to leverage the page cache. Compared to other file systems, NOVA-Fortis is more sen- enabled outperforms it by between 26% and 42×. NOVA- sitive to NVMM performance, because it has lower software Fortis achieves larger improvement on metadata-intensive overhead and reveals the underlying NVMM performance workloads, such as varmail and Exim. more directly. Overall, NOVA outperforms other DAX file Adding metadata protection reduces performance by be- systems by 1.75× on average, and adding full protection tween 0 and 9% and using the WP bit costs an additional reduces performance by 12% on average compared to NOVA. 0.1% to 13.4%. Enabling all protection features reduces perfor- mance by between 2% and 38%, with write-intensive work- 5.5 NVMM Storage Utilization loads seeing the larger drops. The figure also shows that the performance benefits of giving up atomicity in file operations Protecting data integrity introduces storage overheads. Fig- (“Relaxed mode”) are modest – no more than 6.4%. ure 12 shows the breakdown of space among (meta)data RocksDB sees the biggest performance loss with NOVA- structures in an aged, 64 GB NOVA-Fortis file system. Over- Fortis with all protections enabled because it issues many all, NOVA-Fortis devotes 14.8% of storage space to improving

492 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al.

Primary log: 2.0% Despite these uncertainties, NOVA-Fortis demonstrates Primary inode: 0.1% that NVMM file system can provide strong reliability guar- antees while providing high performance and supporting Replica inode: 0.1% DAX-style mmap(). It also makes a clear case for developing Replica log: 2.0% special file systems and reliability mechanisms for NVMM File checksum: 1.56% rather than blithely adapting existing schemes: The chal- lenges NVMMs presents are different, different solutions are File parity: 11.1% appropriate, and the systems built with these differences in mind can be very fast and highly reliable. File data: 82.4%

Unused: 0.75% ACKNOWLEDGMENTS This work was supported by STARnet, a Semiconductor Re- search Corporation program, sponsored by MARCO and Figure 12: NVMM storage utilization – Extra storage DARPA and by Intel Corporation, Toshiba Corporation, and required for reliability is highlighted to the right. Pro- NSF Award 1629395. We would like to thank our shepherd, tecting data is more expensive that protecting meta- Michael Swift, and the anonymous SOSP reviewers for their data, consuming 12.7% of storage compared to just insightful comments and suggestions. We are also thankful 2.1% for metadata. to Subramanya R. Dulloor from Intel for his support and help with accessing PMEP. reliability. Of this, metadata redundancy accounts for 2.1% and occupies 12.7%. REFERENCES [1] Jens Axboe. 2017. Flexible I/O Tester. (2017). https://github. 6 CONCLUSION com/axboe/fio. We have used NOVA-Fortis to explore the unique challenges [2] Katelin Bailey, Luis Ceze, Steven D. Gribble, and Henry M. that improving NVMM file system reliability presents. The Levy. 2011. Implications of Fast, Cheap, Non- solutions that NOVA-Fortis implements facilitate backups by volatile Memory. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems (HotOS’13). USENIX As- taking consistent snapshots of the file system and provide sociation, Berkeley, CA, USA, 2–2. http://dl.acm.org/citation. significant protection against media errors and corruption cfm?id=1991596.1991599 due to software errors. [3] Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, The extra storage required to implement these changes is Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca modest, but their performance impact is significant for some Schroeder. 2008. An Analysis of Data Corruption in the Stor- applications. In particular, the cost of checking and main- age Stack. Trans. Storage 4, 3, Article 8 (Nov. 2008), 28 pages. taining checksums and parity for file data incurs a steep cost https://doi.org/10.1145/1416944.1416947 for both reads and writes, despite our use of very fast (XOR [4] Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pa- parity) and hardware accelerated (CRC) mechanisms. Pro- supathy, and Jiri Schindler. 2007. An Analysis of Latent Sector Errors in Disk Drives. In Proceedings of the 2007 ACM SIGMET- viding atomicity for unaligned writes is also a performance RICS International Conference on Measurement and Modeling bottleneck. of Computer Systems (SIGMETRICS ’07). ACM, New York, NY, These costs suggest that NVMM file systems should pro- USA, 289–300. https://doi.org/10.1145/1254882.1254917 vide users with a range of protection options that trade off [5] L. N. Bairavasundaram, M. Rungta, N. Agrawa, A. C. Arpaci- performance against the level of protection and consistency. Dusseau, R. H. Arpaci-Dusseau, and M. M. Swift. 2008. An- For instance, NOVA-Fortis can selectively disable checksum alyzing the Effects of Disk-Pointer Corruption. In 2008 IEEE based file data protection and the write protection mecha- International Conference on Dependable Systems and Networks nism. Relaxed mode disables copy-on-write. With FTCS and DCC (DSN). 502–511. https://doi.org/10.1109/ Making these policy decisions rationally is currently dif- DSN.2008.4630121 [6] Kumud Bhandari, Dhruva R Chakrabarti, and Hans-J Boehm. ficult due to a lack of two pieces of information. First, the 2012. Implications of CPU Caching on Byte-addressable Non- rate of uncorrectable media errors in emerging NVMM tech- volatile Memory Programming. Technical Report. HP Technical nologies is not publicly known. Second, the frequency and Report HPL-2012-236. size of scribbles has not been studied in detail. Without a [7] Meenakshi Sundaram Bhaskaran, Jian Xu, and Steven Swan- better understanding in these areas, it is hard to determine son. 2013. Bankshot: Caching Slow Storage in Fast Non- whether the costs of these techniques are worth the benefits volatile Memory. In Proceedings of the 1st Workshop on Inter- they provide. actions of NVM/FLASH with Operating Systems and Workloads

493 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

(INFLOW ’13). ACM, New York, NY, USA, Article 1, 9 pages. USENIX Conference on File and Storage Technologies (FAST 17). https://doi.org/10.1145/2527792.2527793 USENIX Association, Santa Clara, CA, 149–166. [8] Jeff Bonwick and Bill Moore. 2007. ZFS: The Last Word inFile [20] Robin Harris. 2016. Windows leaps into the NVM Systems. (2007). revolution. (2016). http://www.zdnet.com/article/ [9] Adrian M. Caulfield, Todor I. Mollov, Louis Alex Eisner, Arup windows-leaps-into-the-nvm-revolution/. De, Joel Coburn, and Steven Swanson. 2012. Providing safe, [21] IBM. 1999. Chipkill Memory. (1999). http://www-05.ibm. user space access to fast, solid state disks. In Proceedings of the com/hu/termekismertetok/xseries/dn/chipkill.pdf. seventeenth international conference on Architectural Support [22] Intel. 2015. NVDIMM Namespace Specification. (2015). http: for Programming Languages and Operating Systems (ASPLOS //pmem.io/documents/NVDIMM_Namespace_Spec.pdf. XVII). ACM, New York, NY, USA, 387–400. https://doi.org/10. [23] Intel. 2016. Intel 64 and IA-32 Architectures Software 1145/2150976.2151017 Developer’s Manual, Volume 3, Chapter 15. (2016). [10] Dave Chinner. 2015. xfs: updates for 4.2-rc1. (2015). http: https://software.intel.com/sites/default/files/managed/a4/60/ //oss.sgi.com/archives/xfs/2015-06/msg00478.html. 325384-sdm-vol-3abcd.pdf, Version December 2016. [11] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. [24] Intel. 2017. Intel Architecture Instruction Set Extensions Pro- Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. gramming Reference. (2017). https://software.intel.com/sites/ 2011. NV-Heaps: Making Persistent Objects Fast and Safe default/files/managed/0d/53/319433-022.pdf. with Next-generation, Non-volatile Memories. In Proceed- [25] Intel. 2017. Intel Optane Memory. (2017). http://www. ings of the Sixteenth International Conference on Architectural intel.com/content/www/us/en/architecture-and-technology/ Support for Programming Languages and Operating Systems optane-memory.html. (ASPLOS ’11). ACM, New York, NY, USA, 105–118. https: [26] Intel. 2017. Intel ships first Optane memory modules for test- //doi.org/10.1145/1950365.1950380 ing. (2017). http://www.pcworld.com/article/3162177/storage/ [12] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, intel-ships-first-optane-memory-modules-for-testing.html. Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. [27] Ryan Johnson, Ippokratis Pandis, Nikos Hardavellas, Anastasia 2009. Better I/O through byte-addressable, persistent memory. Ailamaki, and Babak Falsafi. 2009. Shore-MT: A Scalable Stor- In Proceedings of the ACM SIGOPS 22nd Symposium on Operat- age Manager for the Multicore Era. In Proceedings of the 12th ing Systems Principles (SOSP ’09). ACM, New York, NY, USA, International Conference on Extending Technology: 133–146. https://doi.org/10.1145/1629575.1629589 Advances in Database Technology (EDBT ’09). ACM, New York, [13] Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, An- NY, USA, 24–35. https://doi.org/10.1145/1516360.1516365 drea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. [28] Takayuki Kawahara. 2011. Scalable Spin-Transfer Torque RAM HARDFS: Hardening HDFS with Selective and Lightweight Technology for Normally-Off Computing. Design & Test of Versioning. In Presented as part of the 11th USENIX Confer- Computers, IEEE 28, 1 (Jan 2011), 52–63. https://doi.org/10. ence on File and Storage Technologies (FAST 13). USENIX, San 1109/MDT.2010.97 Jose, CA, 105–118. https://www.usenix.org/conference/fast13/ [29] Ram Kesavan, Rohit Singh, Travis Grusecki, and Yuvraj Patel. technical-sessions/presentation/do 2017. Algorithms and Data Structures for Efficient Free Space [14] Mingkai Dong and Haibo Chen. 2017. Made Reclamation in WAFL. In 15th USENIX Conference on File and Simple and Fast on Non-volatile Memory. In 2017 USENIX Storage Technologies (FAST 17). USENIX Association. Annual Technical Conference (USENIX ATC 17). USENIX Asso- [30] A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley, S. Liu, ciation, Santa Clara, CA, 719–731. https://www.usenix.org/ P. M. Chen, and T. F. Wenisch. 2016. Delegated Persist Order- conference/atc17/technical-sessions/presentation/dong ing. In 2016 49th Annual IEEE/ACM International Symposium [15] Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, on Microarchitecture (MICRO). 1–13. https://doi.org/10.1109/ Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. MICRO.2016.7783761 2014. System Software for Persistent Memory. In Proceedings of [31] Harendra Kumar, Yuvraj Patel, Ram Kesavan, and Sum- the Ninth European Conference on Computer Systems (EuroSys ith Makam. 2017. High Performance Metadata In- ’14). ACM, New York, NY, USA, Article 15, 15 pages. https: tegrity Protection in the WAFL Copy-on-Write File Sys- //doi.org/10.1145/2592798.2592814 tem. In 15th USENIX Conference on File and Storage [16] Exim 2017. Exim Internet Mailer. (2017). http://www.exim.org. Technologies (FAST 17). USENIX Association, Santa Clara, [17] Facebook. 2017. RocksDB. (2017). http://rocksdb.org. CA, 197–212. https://www.usenix.org/conference/fast17/ [18] R. Fackenthal, M. Kitagawa, W. Otsuka, K. Prall, D. Mills, K. technical-sessions/presentation/kumar Tsutsui, J. Javanifard, K. Tedrow, T. Tsushima, Y. Shibahara, and [32] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. G. Hush. 2014. A 16Gb ReRAM with 200MB/s write and 1GB/s 2009. Architecting Phase Change Memory as a Scalable DRAM read in 27nm technology. In Solid-State Circuits Conference Alternative. In ISCA ’09: Proceedings of the 36th Annual Interna- Digest of Technical Papers (ISSCC), 2014 IEEE International. 338– tional Symposium on Computer Architecture. ACM, New York, 339. https://doi.org/10.1109/ISSCC.2014.6757460 NY, USA, 2–13. https://doi.org/10.1145/1555754.1555758 [19] Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. [33] P. J. Meaney, L. A. Lastras-Montanõ, V. K. Papazova, E. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redun- Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. dancy Does Not Imply Fault Tolerance: Analysis of Distributed Clarke. 2012. IBM zEnterprise Redundant Array of Indepen- Storage Reactions to Single Errors and Corruptions. In 15th dent Memory Subsystem. IBM J. Res. Dev. 56, 1 (Jan. 2012),

494 SOSP ’17, October 28, 2017, Shanghai, China J. Xu and L. Zhang et al.

43–53. https://doi.org/10.1147/JRD.2011.2177106 Systems Principles (SOSP ’91). ACM, New York, NY, USA, 1–15. [34] Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. https://doi.org/10.1145/121132.121137 A Large-Scale Study of Failures in the Field. [47] Stephen M. Rumble, Ankita Kejriwal, and John Ouster- In Proceedings of the 2015 ACM SIGMETRICS International hout. 2014. Log-structured Memory for DRAM-based Conference on Measurement and Modeling of Computer Sys- Storage. In Proceedings of the 12th USENIX Conference on tems (SIGMETRICS ’15). ACM, New York, NY, USA, 177–190. File and Storage Technologies (FAST ’14). USENIX, Santa https://doi.org/10.1145/2745844.2745848 Clara, CA, 1–16. https://www.usenix.org/conference/fast14/ [35] J. Meza, Q. Wu, S. Kumar, and O. Mutlu. 2015. Revisiting Mem- technical-sessions/presentation/rumble ory Errors in Large-Scale Production Data Centers: Analysis [48] Arthur Sainio. 2016. NVDIMM: Changes are Here So What’s and Modeling of New Trends from the Field. In 2015 45th An- Next?. In In-Memory Computing Summit 2016. nual IEEE/IFIP International Conference on Dependable Systems [49] Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. and Networks. 415–426. https://doi.org/10.1109/DSN.2015.57 Understanding Latent Sector Errors and How to Protect [36] Micron. 2017. 3D XPoint Technology. (2017). Against Them. In Proceedings of the 8th USENIX Conference on http://www.micron.com/products/advanced-solutions/ File and Storage Technologies (FAST’10). USENIX Association, 3d-xpoint-technology. Berkeley, CA, USA, 6–6. http://dl.acm.org/citation.cfm?id= [37] Micron. 2017. Hybrid Memory: Bridging the Gap Between 1855511.1855517 DRAM Speed and NAND Nonvolatility. (2017). http://www. [50] Bianca Schroeder and Garth A Gibson. 2007. Disk Failures in micron.com/products/dram-modules/nvdimm. the Real World: What does an MTTF of 1,000,000 Hours Mean [38] MongoDB, Inc. 2017. MongoDB. (2017). https://www. to You?. In USENIX Conference on File and Storage Technologies mongodb.com. (FAST). [39] Sanketh Nalli, Swapnil Haria, Mark D. Hill, Michael M. Swift, [51] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Haris Volos, and Kimberly Keeton. 2017. An Analysis of Flash Reliability in Production: The Expected and the Unex- Persistent Memory Use with WHISPER. In Proceedings of pected. In 14th USENIX Conference on File and Storage Tech- the Twenty-Second International Conference on Architectural nologies (FAST 16). USENIX Association, Santa Clara, CA, 67– Support for Programming Languages and Operating Systems 80. http://usenix.org/conference/fast16/technical-sessions/ (ASPLOS ’17). ACM, New York, NY, USA, 135–148. https: presentation/schroeder //doi.org/10.1145/3037697.3037730 [52] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. [40] Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash 2009. DRAM Errors in the Wild: A Large-scale Field Study. In Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, ACM SIGMETRICS. Jie Liu, Badriddine Khessib, and Kushagra Vaid. 2016. SSD Fail- [53] SQLite. 2017. SQLite. (2017). https://www.sqlite.org. ures in Datacenters: What? When? And Why?. In Proceedings [54] Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B of the 9th ACM International on Systems and Storage Conference Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. (SYSTOR ’16). ACM, New York, NY, USA, Article 7, 11 pages. 2015. Memory Errors in Modern Systems: The Good, The Bad, https://doi.org/10.1145/2928275.2928278 and The Ugly. In International Conference on Architectural [41] Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. 2014. Support for Programming Languages and Operating Systems Memory Persistency. In Proceeding of the 41st Annual In- (ASPLOS). ACM. ternational Symposium on Computer Architecture (ISCA ’14). [55] V. Sridharan and D. Liberty. 2012. A study of DRAM failures IEEE Press, Piscataway, NJ, USA, 265–276. http://dl.acm.org/ in the field. In High Performance Computing, Networking, Stor- citation.cfm?id=2665671.2665712 age and Analysis (SC), 2012 International Conference for. 1–11. [42] pmem.io. 2017. NVM Library. (2017). http://pmem.io/nvml. https://doi.org/10.1109/SC.2012.13 [43] Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin [56] Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, R Stanley Williams. 2008. The Missing Memristor Found. and Remzi H. Arpaci-Dusseau. 2005. IRON File Systems. In Nature 453, 7191 (2008), 80–83. The ACM Symposium on Operating Systems Principles (SOSP). [57] Vasily Tarasov, Erez Zadok, and Spencer Shepler. 2016. ACM. Filebench: A Flexible Framework for File System Benchmark- [44] S. Raoux, G.W. Burr, M.J. Breitwisch, C.T. Rettner, Y.C. Chen, ing. USENIX; login 41 (2016). R.M. Shelby, M. Salinga, D. Krebs, S.-H. Chen, H. L Lung, and [58] Stephen C. Tweedie. 1998. Journaling the Linux ext2fs Filesys- C.H. Lam. 2008. Phase-change Random Access Memory: A tem. In LinuxExpo’98: Proceedings of The 4th Annual Linux Scalable Technology. IBM Journal of Research and Development Expo. 52, 4.5 (July 2008), 465–479. https://doi.org/10.1147/rd.524. [59] UEFI Forum. 2017. Advanced Configuration 0465 and Power Interface Specification. (2017). [45] Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf. Linux B-Tree Filesystem. Trans. Storage 9, 3, Article 9 (Aug. [60] Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ran- 2013), 32 pages. https://doi.org/10.1145/2501620.2501623 ganathan, and Roy Campbell. 2011. Consistent and Durable [46] Mendel Rosenblum and John K. Ousterhout. 1991. The De- Data Structures for Non-volatile Byte-addressable Memory. In sign and Implementation of a Log-structured File System. In Proceedings of the 9th USENIX Conference on File and Storage Proceedings of the Thirteenth ACM Symposium on Operating Technologies (FAST ’11). USENIX Association, San Jose, CA,

495 NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System SOSP ’17, October 28, 2017, Shanghai, China

USA, 5–5. [65] Jisoo Yang, Dave B. Minturn, and Frank Hady. 2012. When [61] Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Poll Is Better than Interrupt. In Proceedings of the 10th USENIX Mnemosyne: Lightweight Persistent Memory. In ASPLOS ’11: Conference on File and Storage Technologies (FAST ’12). USENIX, Proceeding of the 16th International Conference on Architectural Berkeley, CA, USA, 3–3. http://dl.acm.org/citation.cfm?id= Support for Programming Languages and Operating Systems. 2208461.2208464 ACM, New York, NY, USA. [66] Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, [62] Dejan Vučinić, Qingbo Wang, Cyril Guyot, Robert Mateescu, and Remzi H. Arpaci-Dusseau. 2010. End-to-end Data Integrity Filip Blagojević, Luiz Franca-Neto, Damien Le Moal, Trevor for File Systems: A ZFS Case Study. In Proceedings of the 8th Bunker, Jian Xu, Steven Swanson, and Zvonimir Bandić. 2014. USENIX Conference on File and Storage Technologies (FAST’10). DC Express: Shortest Latency Protocol for Reading Phase USENIX Association, Berkeley, CA, USA, 3–3. http://dl.acm. Change Memory over PCI Express. In Proceedings of the 12th org/citation.cfm?id=1855511.1855514 USENIX Conference on File and Storage Technologies (FAST ’14). [67] Jishen Zhao, Sheng Li, Doe Hyun Yoon, Yuan Xie, and Nor- USENIX, Santa Clara, CA, 309–315. https://www.usenix.org/ man P. Jouppi. 2013. Kiln: Closing the Performance Gap Be- conference/fast14/technical-sessions/presentation/vucinic tween Systems With and Without Persistence Support. In [63] Matthew Wilcox. 2014. Add Support for NV-DIMMs to Ext4. Proceedings of the 46th Annual IEEE/ACM International Sym- (2014). https://lwn.net/Articles/613384/. posium on Microarchitecture (MICRO-46). ACM, New York, NY, [64] Jian Xu and Steven Swanson. 2016. NOVA: A Log- USA, 421–432. https://doi.org/10.1145/2540708.2540744 structured File System for Hybrid Volatile/Non-volatile Main Memories. In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 323–338. https://www.usenix.org/conference/fast16/ technical-sessions/presentation/xu

496