A Reconstructable File System on Flash Storage

ReconFS: A Reconstructable File System on Flash Storage Youyou Lu, Jiwu Shu, and Wei Wang, Tsinghua University https://www.usenix.org/conference/fast14/technical-sessions/presentation/lu This paper is included in the Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST ’14). February 17–20, 2014 • Santa Clara, CA USA ISBN 978-1-931971-08-9 Open access to the Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST ’14) is sponsored by ReconFS: A Reconstructable File System on Flash Storage Youyou Lu Jiwu Shu∗ Wei Wang Department of Computer Science and Technology, Tsinghua University Tsinghua National Laboratory for Information Science and Technology ∗ Corresponding author: [email protected] luyy09, wangwei11 @mails.tsinghua.edu.cn { } Abstract memory, which increases storage capacity by multiple- level cell (MLC) or triple-level cell (TLC) technologies, Hierarchical namespaces (directory trees) in file systems makes the endurance problem even worse [17]. are effective in indexing file system data. However, the File system design evolves slowly in the past few update patterns of namespace metadata, such as intensive decades, yet it has a marked impact on I/O behaviors of writeback and scattered small updates, exaggerate the the storage subsystems. Recent studies have proposed writes to flash storage dramatically, which hurts both to revisit the namespace structure of file systems, e.g., performance and endurance (i.e., limited program/erase flexible indexing for search-friendly file systems [33] and cycles of flash memory) of the storage system. table structured metadata management for better meta- In this paper, we propose a reconstructable file system, data access performance [31]. Meanwhile, leveraging ReconFS, to reduce namespace metadata writeback the internal storage management of flash translation layer size while providing hierarchical namespace access. (FTL) of solid state drives (SSDs) to improve storage ReconFS decouples the volatile and persistent directory management efficiency has also been discussed [19, 23, tree maintenance. Hierarchical namespace access is 25, 37]. But namespace management also impacts flash- emulated with the volatile directory tree, and the based storage performance and endurance, especially consistency and persistence of the persistent directory when considering metadata-intensive workloads. This tree are provided using two mechanisms in case of however has not been well researched. system failures. First, consistency is ensured by Namespace metadata are intensively written back embedding an inverted index in each page, eliminating to persistent storage due to system consistency or the writes of the pointers (indexing for directory tree). persistence guarantees [18, 20]. Since the no-overwrite Second, persistence is guaranteed by compacting and property of flash memory requires writes to be updated logging the scattered small updates to the metadata in free pages, frequent writeback introduces a large persistence log, so as to reduce write size. The inverted dynamic update size (i.e., the total write size of free indices and logs are used respectively to reconstruct pages that are used). Even worse, a single file system the structure and the content of the directory tree operation may scatter updates to different metadata on reconstruction. Experiments show that ReconFS pages (e.g., the create operation writes both the inode provides up to 46.3% performance improvement and and the directory entry), and the average update size 27.1% write reduction compared to ext2, a file system to each metadata page is far less than one page size with low metadata overhead. (e.g., an inode in ext2 has the size of 128 bytes). A whole page needs to be written even though only a 1 Introduction small part in the page is updated. Endurance, as well as performance, of flash storage systems is affected In recent years, flash memory is gaining popularity in by namespace metadata accesses due to frequent and storage systems for its high performance, low power scattered small write patterns. consumption and small size [11, 12, 13, 19, 23, 28]. To address these problems, we propose a recon- However, flash memory has limited program/erase (P/E) structable file system, ReconFS, which provides a cycles, and the reliability is weakened as P/E cycles volatile hierarchical namespace and relaxes the write- approach the limit, which is known as the endurance back requirements. ReconFS decouples the maintenance problem [10, 14, 17, 23]. The recent trend of denser flash of the volatile and persistent directory trees. Metadata USENIX Association 12th USENIX Conference on File and Storage Technologies 75 pages are written back to their home locations only persistent directory tree maintenance, the embedded when they are evicted or checkpointed (i.e., the operation connectivity and metadata persistence logging mecha- to update the persistent directory tree the same as the nisms, as well as the reconstruction. We present the volatile directory tree) from main memory. Consistency implementation in Section 4 and evaluate ReconFS in and persistence of the persistent directory tree are Section 5. Related work is given in Section 6, and the guaranteed using two new mechanisms. First, we use conclusion is made in Section 7. embedded connectivity mechanism to embed an inverted index in each page and track the unindexed pages. Since the namespace is tree-structured, the inverted indices are 2 Background used for directory tree structure reconstruction. Second, we log the differential updates of each metadata page 2.1 Flash Memory Basics to the metadata persistence log and compact them into Programming in flash memory is performed in one fewer pages, and we call it metadata persistence logging direction. Flash memory cells need to be erased before mechanism. These logs are used for directory tree overwritten. The read/write unit is a flash page (e.g., content update on reconstruction. 4KB), and the erase unit is a flash block (e.g., 64 pages). Fortunately, flash memory properties can be leveraged In each flash page, there is a spare area for storing the to keep overhead of the two mechanisms low. First, page metadata of the page, which is called page metadata or metadata, the spare space alongside each flash page, is out-of-band (OOB) area [10]. The page metadata is used used to store the inverted index. The inverted index to store error correction codes (ECC). And it has been is atomically accessed with its page data without extra proposed to expose the page metadata to software in overhead [10]. Second, unindexed pages are tracked NVMe standard [6]. in the unindexed zone by limiting new allocations to a Flash translation layers (FTLs) are used in flash- continuous logical space. The address mapping table in based solid state drives (SSDs) to export the block FTL redirects the writes to different physical pages, and interface [10]. FTLs translate the logical page number the performance is not affected even though the logical in the software to the physical page number in flash layout is changed. Third, high random read performance memory. The address mapping hides the no-overwrite makes the compact logging possible, as the reads of property from the system software. FTLs also perform corresponding base pages are fast during recovery. As garbage collection to reclaim space and wear leveling to such, ReconFS can efficiently gain performance and extend the lifetime of the device. endurance benefits with rather low overhead. Flash-based SSDs provide higher bandwidth and IOPS Our contributions are summarized as follows: compared to hard disk drives (HDDs) [10]. Multiple We propose a reconstructable file system design to chips are connected through multiple channels inside • avoid the high overhead of maintaining a persistent an SSD to provide internal parallelism, providing high directory tree and emulate hierarchical namespace aggregated bandwidth. Due to elimination of mechanical access using a volatile directory tree in memory. moving part, an SSD provides high IOPS. Endurance is We provide namespace consistency by embedding another element that makes flash-based SSDs different • an inverted index with the indexed data and from HDDs [10, 14, 17, 23]. Each flash memory cell has eliminate the pointer update in the parent node limited program/erase (P/E) cycles. As the P/E cycles (in the directory tree view) to reduce writeback approach the limit, the reliability of each cell drops frequency. dramatically. As such, endurance is a critical issue in We also provide metadata persistence by logging system designs on flash-based storage. • and compacting dirty parts from multiple metadata pages to the metadata persistence log, and the compact form reduces metadata writeback size. 2.2 Hierarchical Namespaces We implement ReconFS based on ext2 and evaluate Directory trees have been used in different file systems • it against different file systems, including ext2, for over three decades to manage data in a hierarchical ext3, btrfs and f2fs. Results show an up to way. But hierarchical namespaces introduce high 46.3% performance increase and 27.1% endurance overhead to provide consistency and persistence for improvement compared to ext2, a file system with the directory tree. Also, static metadata organization low metadata overhead. amplifies the metadata write size. The rest of this paper is organized as follows. Namespace Consistency and Persistence. Directories Section 2 gives the background of flash memory and and files are indexed in a tree structure, the directory namespace management. Section 3 describes the tree. Each page uses pointers to index its children in the ReconFS design, including the decoupled volatile and directory tree. To keep the consistency of the directory 2 76 12th USENIX Conference on File and Storage Technologies USENIX Association In this section, we first present the overall design of Volatile Directory Tree Main ReconFS, including the decoupled volatile and persistent Memory Buffer Eviction/Checkpoint Consistency Persistence directory tree maintenance and four types of metadata Induced Writeback Induced Writeback Induced Writeback Persistent writeback.

A Reconstructable File System on Flash Storage

F2FS) Overview

In Search of Optimal Data Placement for Eliminating Write Amplification in Log-Structured Storage

NOVA: a Log-Structured File System for Hybrid Volatile/Non

Filesystem Considerations for Embedded Devices ELC2015 03/25/15

Analysis and Design of Native File System Enhancements for Storage Class Memory

NOVA: the Fastest File System for Nvdimms

Authenticated and Encypted Storage on Embedded Linux

Sibylfs: Formal Specification and Oracle-Based Testing for POSIX and Real-World File Systems

Disk Imaging Technologies

Strata: a Cross Media File System

ZNS: Avoiding the Block Interface Tax for Flash-Based Ssds

Cross-Checking Semantic Correctness: the Case of Finding File System Bugs