Towards LFS Flash-Friendly File System Without GC Operations
Total Page:16
File Type:pdf, Size:1020Kb
SSDFS: Towards LFS Flash-Friendly File System without GC operations Viacheslav Dubeyko Abstract to one. The typical latencies: (1) read operation - 20 us, (2) write operation - 200 us, (3) erase operation - 2 ms. Solid state drives have a number of interesting character- Flash Translation Layer (FTL). FTL emulates the func- istics. However, there are numerous file system and storage tionality of a block device and enables operating system to design issues for SSDs that impact the performance and de- use flash memory without any modification. FTL mimics vice endurance. Many flash-oriented and flash-friendly file block storage interface and hides the internal complexities of systems introduce significant write amplification issue and flash memory to operating systems, thus enabling the oper- GC overhead that results in shorter SSD lifetime and ne- ating system to read/write flash memory in the same way as cessity to use the NAND flash overprovisioning. SSDFS reading/writing the hard disk. The basic function of FTL al- file system introduces several authentic concepts and mech- gorithm is to map the page number from logical to physical. anisms: logical segment, logical extent, segment’s PEBs However, internally FTL needs to deal with erase-before- pool, Main/Diff/Journal areas in the PEB’s log, Diff-On- write, which makes it critical to overall performance and life- Write approach, PEBs migration scheme, hot/warm data time of SSD. self-migration, segment bitmap, hybrid b-tree, shared dic- tionary b-tree, shared extents b-tree. Combination of all sug- Garbage Collection and Wear Leveling. The process of gested concepts are able: (1) manage write amplification in collecting, moving of valid data and erasing the invalid data smart way, (2) decrease GC overhead, (3) prolong SSD life- is called as garbage collection. Through SSD firmware com- time, and (4) provide predictable file system’s performance. mand TRIM the garbage collection is triggered for deleted file blocks by the file system. Commonly used erase blocks Index terms: NAND flash, SSD, Log-structured file puts off quickly, slows down access times and finally burn- system (LFS), write amplification issue, GC overhead, ing out. Therefore the erase count of each erase block should flash-friendly file system, SSDFS, delta-encoding, Copy- be monitored. There are wide number of wear-leveling tech- On-Write (COW), Diff-On-Write (DOW), PEB migra- niques used in FTL. tion, deduplication. Building blocks of SSD. SSD includes a controller that incorporates the electronics that bridge the NAND memory 1 INTRODUCTION components to the host computer. The controller is an em- arXiv:1907.11825v1 [cs.OS] 27 Jul 2019 bedded processor that executes firmware-level code. Some Flash memory characteristics. Flash is available in two of the functions performed by the controller includes, error- types NOR and NAND. NOR flash is directly addressable, correcting code (ECC), wear leveling, bad block mapping, helps in reading but also in executing of instructions di- read scrubbing and read disturb management, read and write rectly from the memory. NAND based SSD consists of set caching, garbage collection etc. A flash-based SSD typically of blocks which are fixed in number and each block com- uses a small amount of DRAM as a cache, similar to the prises of a fixed set of pages or whole pages set makes up cache in hard disk drives. A directory of block placement a block. There are three types of operations in flash mem- and wear leveling data is also kept in the cache while the ory: read, write and erase. Execution of operations read and drive is operating. write takes place per page level. On the other hand the data Write amplification. For write requests that come in ran- is erased on block level by using erase operation. Because dom order, after a period of time, the free page count in flash of the physical feature of flash memory, write operations are memory becomes low. The garbage-collection mechanism able to modify bits from one to zero. Hence the erase op- then identifies a victim block for cleaning. All valid pages eration should be executed before rewriting as it set all bits in the victim block are relocated into a new block with free 1 pages, and finally the candidate block is erased so that the architecture and approaches. Section IV includes final dis- pages become available for rewriting. This mechanism in- cussion. Section V offers conclusions. troduces additional read and write operations, the extent of which depends on the specific policy deployed, as well as on the system parameters. These additional writes result in the 2 RELATED WORKS multiplication of user writes, a phenomenon referred to as write amplification. 2.1 File Systems Statistics and Analysis Read disturbance. Flash data block is composed of mul- File size. Agrawal, et al. [10] discovered that 1-1.5% of files tiple NAND units to which the memory cells are connected on a file system’s volume have a size of zero. The arithmetic in series. A memory operation on a specific flash cell will mean file size was 108 KB in 2000 year and 189 KB in 2004 influence the charge contents on a different cells. This is year. This metric grows roughly 15% per year. The median called disturbance, which can occur on any flash operation weighted file size increasing from 3 MB to 9 MB. Most of and predominantly this is observed during the read operation the bytes in large files are in video, database, and blob files, and leads to errors in undesignated memory cells. To avoid and that most of the video, database, and blob bytes are in failure on reads, error-correcting codes (ECC) are widely large files. A large number of small files account for a small employed. Read disturbance can occur when reading the fraction of disk usage. Douceur, et al. [12] confirmed that same target cell multiple times without an erase and pro- 1.7% of all files have a size of zero. The mean file size ranges gram operation. In general, to preserve data consistency, from 64 kB to 128 kB across the middle two quartiles of all flash firmware reads all live data pages, erases the block, and file systems. The median size is 2 MB and it confirms that writes down the live pages to the erased block. This process, most files are small but most bytes are in large files. Ullah, called read block reclaiming, introduces long latencies and et al. [16] have results about 1 to 10 KB the value observed degrades performance. is up to 32% of the total occurrences. There are 29% values SSD design issues. Solid state drives have a number in the range of 10 KB to 100 KB. Gibson, et al. [17] agreed of interesting characteristics that change the access pat- that most files are relatively small, more than half are less terns required to optimize metrics such as disk lifetime and than 8 KB. 80% or more of the files are smaller than 32 KB. read/write throughput. In particular, SSDs have approxi- On the other hand, while only 25% are larger than 8 KB, this mately two orders of magnitude improvement in read and 25% contains the majority of the bytes used on the different write latencies, as well as a significant increase in overall systems. bandwidth. However, there are numerous file system and File age. Agrawal, et al. [10] stated that the median file storage array design issues for SSDs that impact the perfor- age ranges between 80 and 160 days across datasets, with no mance and device endurance. SSDs suffer well-documented clear trend over time. Douceur, et al. [12] has the vision of shortcomings: log-on-log, large tail-latencies, unpredictable the median file age is 48 days. Studies of short-term trace I/O latency, and resource underutilization. These shortcom- data have shown that the vast majority of files are deleted ings are not due to hardware limitations: the non-volatile within a few minutes of their creation. On 50% of file sys- memory chips at the core of SSDs provide predictable high- tems, the median file age ranges by a factor of 8 from 12 to 97 performance at the cost of constrained operations and limited days, and on 90% of file systems, it ranges by a factor of 256 endurance/reliability. Providing the same block I/O interface from 1.5 to 388 days. Gibson, et al. [17] showed that while as a magnetic disk is one of the important reason of these only 15% of the files are modified daily, these modifications drawbacks. account for over 70% of the bytes used daily. Relatively few SSDFS features. Many flash-oriented and flash-friendly files are used on any one day normally less than 5%. 5- file systems introduce significant write amplification issue 10% of all files created are only used on one day, depending and GC overhead that results in shorter SSD lifetime and on the system. On the other hand, approximately 0.8% of necessity to use the NAND flash overprovisioning. SSDFS the files are used more than 129 times essentially every day. file system introduces several authentic concepts and mech- These files which are used more than 129 times account for anisms: logical segment, logical extent, segment’s PEBs less than 1% of all files created and approximately 10% of pool, Main/Diff/Journal areas in the PEB’s log, Diff-On- all the files which were accessed or modified. 90% of all Write approach, PEBs migration scheme, hot/warm data files are not used after initial creation, those that are used are self-migration, segment bitmap, hybrid b-tree, shared dic- normally short-lived, and that if a file is not used in some tionary b-tree, shared extents b-tree.