Fstream: Managing Flash Streams in the File System

FStream: Managing Flash Streams in the File System Eunhee Rho, Kanchan Joshi, Seung-Uk Shin, Nitesh Jagadeesh Shetty, Joo-Young Hwang, Sangyeun Cho, Daniel DG Lee, and Jaeheon Jeong, Samsung Electronics. Co., Ltd. https://www.usenix.org/conference/fast18/presentation/rho This paper is included in the Proceedings of the 16th USENIX Conference on File and Storage Technologies. February 12–15, 2018 • Oakland, CA, USA ISBN 978-1-931971-42-3 Open access to the Proceedings of the 16th USENIX Conference on File and Storage Technologies is sponsored by USENIX. FStream: Managing Flash Streams in the File System Eunhee Rho, Kanchan Joshi, Seung-Uk Shin, Nitesh Jagadeesh Shetty Joo-Young Hwang, Sangyeun Cho, Daniel DG Lee, Jaeheon Jeong Samsung Electronics Co., Ltd. Abstract out issue [3,7]. To the same end, we focus on how to take The performance and lifespan of a solid-state drive advantage of the multi-streamed SSD mechanism [8]. (SSD) depend not only on the current input workload This mechanism opens up a way to dictate data place- but also on its internal media fragmentation formed over ment on an SSD’s underlying physical media, abstracted time, as stale data are spread over a wide range of phys- by streams. In principle, if the host system perfectly ical space in an SSD. The recently proposed streams maps data having the same lifetime to the same streams, gives a means for the host system to control how data are an SSD’s write amplification becomes one, completely placed on the physical media (abstracted by a stream) eliminating the media fragmentation problem. and effectively reduce the media fragmentation. This Prior works have revealed two strategies to leverage work proposes FStream, a file system approach to tak- streams. The first strategy would map application data ing advantage of this facility. FStream extracts streams to disparate streams based on an understanding of the at the file system level and avoids complex application expected lifetime of those data. For example, files in dif- level data mapping to streams. Experimental results ferent levels of a log-structured merge tree could be as- show that FStream enhances the filebench performance signed to a separate stream. Case studies show that this by 5%∼35% and reduces WAF (Write Amplification strategy works well for NoSQL databases like Cassandra Factor) by 7%∼46%. For a NoSQL database benchmark, and RocksDB [8, 14]. Unfortunately, this application- performance is improved by up to 38% and WAF is re- level customization strategy requires that a system de- duced by up to 81%. signer understand her target application’s internal working fairly well, remaining a challenge to the designer. The other strategy aimed to “automate” the process of 1 Introduction mapping write I/O operations to an SSD stream with no Solid-state drives (SSDs) are rapidly replacing hard disk application changes. For example, the recently proposed drives (HDDs) in enterprise data centers. SSDs maintain AutoStream scheme assigns a stream to each write re- the traditional logical block device abstraction with the quest based on estimated lifetime from past LBA access help from internal software, commonly known as flash patterns [15]. However, this scheme has not been proven translation layer (FTL). The FTL allows SSDs to sub- to work well under complex workload scenarios, partic- stitute HDDs without complex modification in the block ularly when the workload changes dynamically. More- device interface of an OS. over, LBA based pattern detection is not practical when Prior work has revealed, however, that this compatibil- file data are updated in an out-of-place manner, as in ity comes at a cost; when the underlying media is frag- copy-on-write and log-structured file systems. The above mented as a device is aged, the operational efficiency of sketched strategies capture the two extremes in the de- the SSD deteriorates dramatically due to garbage collec- sign space—application level customization vs. block tion overheads [11, 13]. More specifically, a user write level full automation. I/O translates into an amplified amount of actual media In this work, we take another strategy, where we sep- writes [9], which shortens device lifetime and hampers arate streams at the file system layer. Our approach is performance. The ratio of the actual media writes to the motivated by the observation that file system metadata user I/O is called write amplification factor (WAF). and journal data are short-lived and are good targets for A large body of prior work has been undertaken to ad- separation from user data. Naturally, the primary compo- dress the write amplification problem and the SSD wear- nent of our scheme, when applied to a journaling file sys- USENIX Association 16th USENIX Conference on File and Storage Technologies 257 tem like ext4 and xfs, is to allocate a separate stream for 2.2 Multi-streamed SSD metadata and journal data, respectively. As a corollary The multi-streamed SSD [8] endeavors to keep WAF in component of our scheme, we also propose to separate check by focusing on the placement of data. It allows databases redo/undo log file as a distinct stream at the the host to pass a hint (stream ID) along with write re- file system layer. We implement our scheme, FStream, quests. The stream ID provides a means to convey hints in Linux ext4 and xfs file systems and perform experi- on lifetime of data that are getting written [5]. The multi- ments using a variety of workloads and a stream-capable streamed SSD groups data having the same stream ID to NVMe (NVM Express) SSD. Our experiments show that be stored to the same NAND block. By avoiding mixing FStream robustly achieves a near-optimal WAF (close to data of different lifetime, fragmentation is reduced inside 1) across the workloads we examined. We make the fol- NAND blocks. Figure1 shows an example of how the lowing contributions in this work. data placement using multi-stream could reduce media • We provide an automated multi-streaming of differ- fragmentation. ent types of file system generated data with respect to their lifetime; • We enhance the existing journaling file systems, ext4 and xfs, to use the multi-streamed SSDs with minimally invasive changes; and • We achieve stream classification for application data using file system layer information. The remainder of this paper is organized as follows. First, we describe the background of our study in Sec- tion2, with respect to the problems of previous multi- stream schemes. Section3 describes FStream and its im- plementation details. We show experimental results in Section4 and conclude in Section5. Figure 1: Data placement comparison for two write sequences. Here 2 Background H=Hot, C=Cold, and we assume there are four pages per flash block. 2.1 Write-amplification in SSDs Multi-stream is now a part of T10 (SCSI) standard, Flash memory has an inherent characteristic of erase- and under discussion in NVMe (NVM Express) working before-program. Write, also called “program” operation, group. NVMe 1.3 specification introduces support for happens at the granularity of a NAND page. A page can- multi-stream in the form of “directives” [1]. An NVMe not be rewritten unless it is “erased” first. In-place update write command has the provision to carry a stream ID. is not possible due to this erase-before-write characteristic. Hence, overwrite is handled by placing the data in 2.3 Leveraging Streams a new page, and invalidating the previous one. Erase operation is done in the unit of a NAND block, which While the multi-streamed SSD provides the facility of is a collection of multiple NAND pages. Before eras- segregating data into streams, its benefit largely depends ing a NAND block, all its valid pages need to be copied upon how well the streams are leveraged by the host. out elsewhere; this is done in a process called garbage- Identifying what should and should not go into the same collection (GC). These valid page movements cause ad- stream is of cardinal importance for maximum bene- ditional writes that consume bandwidth, thereby leading fit. Previous work [8, 14] shows benefits of application- to performance degradation and fluctuation. These ad- assigned streams. This approach has the benefit of de- ditional writes also reduce endurance as program-erase termining data lifetime accurately, but it involves mod- cycles are limited for NAND blocks. One way to mea- ifying the source code of the target application, leading sure GC overheads is write amplification factor (WAF), to increased deployment effort. Also when multiple ap- which is described as the ratio of writes performed on plications try to assign streams, a centralized stream as- flash memory to writes requested from the host system. signment is required to avoid conflicts. Instead of di- rect assignment of stream IDs, recently Linux (from 4.13 Amount of writes committed to flash WAF = kernel) supports fcntl() interface to send data life- Amount of writes that arrived from the host time hints to file systems to exploit the multi-streamed WAF may soar more often than not as an SSD experi- SSDs [2,6]. AutoStream [15] takes stream management ences aging [8]. to the NVMe device driver layer. It monitors requests 258 16th USENIX Conference on File and Storage Technologies USENIX Association from file systems and estimates data lifetime. However, other metadata regions (inode table, block bitmap, inode only limited information (e.g., request size, block ad- bitmap) within the block group. Another type of meta- dresses) is available in the driver layer, and even worse, data is a directory block. the address-based algorithm may be ineffective under a copy-on-write file system.

Fstream: Managing Flash Streams in the File System

SYSTEM V RELEASE 4 Migration Guide

Fast Linux I/O in the Unix Tradition

Linux Fast-STREAMS Installation and Reference Manual Version 0.9.2 Edition 4 Updated 2008-10-31 Package Streams-0.9.2.4

Introduction to the Linux for HPC

UNIX History Page 1 Tuesday, December 10, 2002 7:02 PM

Veritas Backup Exec from Arcserve Netware.Pdf

COS 318: Operating Systems I/O Device Interactions and Drivers

Appletalk Protocol Architecture Physical and Data Link Protocols

Novell Groupwise Idataagent

STREAMS Vs. Sockets Performance Comparison for UDP

Streams Version 3.03

Chapter 2 Unix