Flash-aware File System
Flash-aware Computing
Instructor: Prof. Sungjin Lee ([email protected])
1 Today
File System Basics Traditional Flash File Systems SSD-Friendly Flash File Systems Reference
2 What is a File System?
Provides a virtualized logical view of information stored on various storage media, such as disks, tapes, and flash-based SSDs
Two key abstractions have developed over time in the virtualization of storage . File: A linear array of bytes, each of which you can read or write . Its contents are defined by a creator (e.g., text and binary) . It is often referred to as its inode number . Directory: A special file that is a collection of files and other directories . Its contents are quite specific – it contains a list of (user-readable name, inode #) pairs (e.g., (“foo”, 10)) . It has a hierarchical organization (e.g., tree, acyclic-graph, and graph) . It is also identified by an inode number
3 Operations on Files and Directories
POSIX Operations on Files POSIX Operations on Directories POSIX APIs Description POSIX APIs Description creat () Create a file opendir () Open a directory for reading open () Create/open a file closedir () Close a directory write () Write bytes to a file readdir () Read one directory entry read () Read bytes from a file rewinddir () Rewind a directory so it can be lseek () Move byte position inside a file reread unlink () Remove a file mkdir () Create a new directory truncate () Resize a file rmdir () Remove a directory close () Close a file … …
The POSIX API is to the VFS interface, rather than any specific type of file system
open(), close (), read (), write ()
File-system specific implementations
5 File System Implementation
UNIX File System Journaling File System Log-structured or Copy-on Write File Systems
6 UNIX File System
A traditional file system first developed for UNIX systems
Boot Super Inode Data 0 1 2 … Data (4 KiB Blocks) Sector Block Bmap Bmap
Inode table
. Boot sector: Information to be loaded into RAM to boot up the OS . Superblock: File system’s metadata (e.g., file system type, size, …) . Inode & data Bmaps: Keep the status of blocks belonging to an inode table and data blocks . Inode table: Keep file’s metadata (e.g., size, permission, …) and data block pointers . Data blocks: Keep users’ file data
7 Inode & Block Pointers 1034 100 Boot Super Inode Data 27 0 1 2 … Sector Block Bmap Bmap
Inode table
Last modified time Last access time File size Permission Link count Direct Blocks 27 1037 … Indirect Blocks … 100 … 101 … 102 Double Indirect Blocks … 104 … 3942
Block numbers 1034 … 104 94133 1037 131 14483 3942 152 … … 8 Consistent Update Problem
What happens if sudden power loss occurs while writing data to a file
write (0, “foo”, strlen (“foo”) );
Boot Super Inode Data foo 0 1 2 … Data (4 KiB Blocks) Sector Block Bmap Bmap
Inode table
The file system will be inconsistent!!! Consistent update problem
9 Journaling File System
Journaling file systems address the consistent update problem by adopting an idea of write-ahead logging (or journaling) from database systems Ext3, Ext4, ReiserFS, XFS, and NTFS are based on journaling Journal write & commit write (0, “foo”, strlen (“foo”) ); TxB TxE foo Boot Super Inode Data foo 0 1 2 … Data (4 KiB Blocks) 0 Sector Block Bmap Bmap
Inode table Journaling space
Checkpoint
Double writes could degrade overall write performance!
10 Log-structured File System
Log-structured file systems (LFS) treat a storage space as a huge log, appending all files and directories sequentially
The state-of-the-art file systems are based on LFS or CoW . e.g., Sprite LFS, F2FS, NetApp’s WAFL, Btrfs, ZFS, …
Write all the files and inodes sequentially File A File B Inode for B for Inode inode for A for inode inodeMap Check Check Boot Super point point Sector Block #1 #2
11 Log-structured File System (Cont.)
Advantages . (+) No consistent update problem . (+) No double writes – an LFS itself is a log! . (+) Provide excellent write performance – disks are optimized for sequential I/O operations . (+) Reduce the movements of disk headers further (e.g., inode update and file updates)
Disadvantages . (–) Expensive garbage collection cost . (–) Slow read performance
12 Disadvantages of LFS File A File B Invalid
Write sequentially Inode for B for Inode Inode for B for Inode inode for A for inode inode for A for inode inodeMap inodeMap Check Check Boot Super point point Sector Block #1 #2
Expensive garbage collection cost: invalid blocks must be reclaimed for future writes; otherwise, free disk space will be exhausted Slow read performance: involve more head movements for future reads (e.g., when reading the file A)
13 Write Cost
Write cost with GC is modeled as follows . Note: a segment (seg) is a unit of space allocation and GC
. N is the number of segments . µ is the utilization of the segments (0 ≤ µ < 1) . If segments have no live data (µ = 0), write cost becomes 1.0
14 Write Cost Comparison
(measured)
(delayed writes, sorting)
15 Greedy Policy
The cleaner chooses the least-utilized segments and sorts the live data by age before writing it out again Workloads: 4 KB files with two overwrite patterns . (1) Uniform: No locality – equal likelihood of being overwritten . (2) Hot-and-cold: Locality – 10:90 Worse than a system with no locality
The variance in segment utilization
16 Cost-Benefit Policy
Hot segments are frequently selected as victims even though their utilizations would drop further . It is necessary to delay cleaning and let more of the blocks die . On the other hand, free space in cold segments are valuable
Cost-benefit policy:
17 Cost-Benefit Policy (Cont.)
18 LFS Performance
Inode updates
Random reads
19 Today
File System Basics Traditional Flash File Systems . JFFS2: Journaling Flash File System . YAFFS2: Yet Another Flash File System . UBIFS: Unsorted Bock Image File System SSD-Friendly File Systems Reference
20 Traditional File Systems for Flash
Originally designed for block devices like HDDs . e.g., ext2/3/4, FAT32, and NTFS But, NAND flash memory is not a block device . The FTL provides block-device views outside, hiding the unique properties of NAND flash memory
Traditional File System for HDDs (e.g., ext2/3/4, FAT32, and NTFS) Read Write Block I/O Interface
Flash Translation Layer NAND control Read Write Erase (e.g., ONFI) NAND flash
Flash-based SSDs 21 Flash File Systems
Directly manage raw NAND flash memory . Internally performing address mapping, garbage collection, and wear-leveling by itself Representative flash file systems . JFFS2, YAFFS2, and UBIFS
Flash File System (e.g., JFFS2, YAFFS2, and UBIFS)
Read Write Erase
NAND-specific Low-Level Device Driver (e.g., MTD and UBI)
NAND control Read Write Erase (e.g., ONFI)
NAND Flash
22 Memory Technology Device (MTD)
MTD is the lowest level for accessing flash chips . Offer the same APIs for different flash types and technologies . e.g., NAND, OneNAND, and NOR JFFS2 and YAFFS2 run on top of MTD
Typical FS
JFFS2 YAFFS2 FTLs
mtd_read (), mtd_write (), … MTD
device-specific commands, …
NAND OneNAND NOR …
23 Traditional File Systems vs. Flash File Systems
File System + FTL Flash File System
Method - Access a flash device via FTL - Access a flash device directly
- High interoperability - High-level optimization with Pros - No difficulties in managing recent system-level information NAND flash with new constraints - Flash-aware storage management - Lack of system-level information - Low interoperability Cons - Flash-unaware storage - Must be redesigned to handle management new NAND constraints
Flash file systems now become obsolete because of difficulties for the adoption to new types of NAND devices
24 JFFS2: Journaling Flash File System
A log-structured file system (LFS) for use with NAND flash . Unlike LFS, however, it does not allow any in-place updates!!!
Main features of JFFS2 . File data and metadata stored as nodes in NAND flash memory . Keep an inode cache holding the information of nodes in DRAM . A greedy garbage collection algorithm . Select cheapest blocks as a victim for garbage collection . A simple wear-leveling algorithm combined with GC . Consider the wearing rate of flash blocks when choosing a victim block for GC . Optional data compression
25 JFFS2: Write Operation
All data are written sequentially to a log which records all changes
Updates on File A Inode Cache (DRAM) Ver: 1 Offset: 0 Logical View of File A Len: 200 32-bit CRC 0-75: Ver 41 Data Ver: 2 Ver: 4 75-125: Ver 3 Offset: 200 Offset: 0 Len: 200 Len: 75 125-200: Ver 51 32-bit CRC 32-bit CRC Data Data
200-400: Ver 2 Ver: 3 Ver: 5 Offset: 75 Offset: 125 Len: 50 Len: 75 32-biit CRC 32-bit CRC Data Data
NAND Ver 1 Ver 2 Ver 3 Ver 4 Ver 5 Flash Memory
0 200 400
26 JFFS2: Read Operation
The latest data can be read from NAND flash by referring to the inode cache in DRAM Updates on File A Inode Cache (DRAM) Ver: 1 Offset: 0 Logical View of File A Len: 200 Read data from File A 32-bit CRC (offset: 75 – 200) 0-75: Ver 41 Data Ver: 2 Ver: 4 75-125: Ver 3 Offset: 200 Offset: 0 Len: 200 Len: 75 125-200: Ver 51 32-bit CRC 32-bit CRC Data Data
200-400: Ver 2 Ver: 3 Ver: 5 Offset: 75 Offset: 125 Len: 50 Len: 75 32-biit CRC 32-bit CRC Data Data
NAND Ver 1 Ver 2 Ver 3 Ver 4 Ver 5 Flash Memory
0 200 400
27 JFFS2: Mount
Scan the flash memory medium after rebooting . Check the CRC for written data and mark the obsolete data . Build the inode cache
00--200:75: Ver Ver411
20075-125:-400: VerVer 32 Inode Cache 125-200: Ver 51
200-400: Ver 2
Ver: 1 Ver: 2 Ver: 3 Ver: 4 Ver: 5 Offset: 0 Offset: 200 Offset: 75 Offset: 0 Offset: 125 Len: 200 Len: 200 Len: 50 Len: 75 Len: 75 32-bit CRC 32-bit CRC 32-bit CRC 32-bit CRC 32-bit CRC Data Data Data Data Data
NAND Ver 1 Ver 2 Ver 3 Ver 4 Ver 5 Flash Memory
0 200 400
28 JFFS2: Problems
Slow mount time . All nodes must be scanned at mount time . Mount time increases in proportion to flash size and file system contents
High memory consumption . All node information must be maintained in DRAM . Memory consumption linearly depends on file system contents
Low scalability . Infeasible for a large-scale flash device . Mount time and memory consumption increase according to a flash size
29 YAFFS2: Yet Another Flash File System
Another log-structured file system for flash memory . Store data to flash memory like a log with a unique sequence number like JFFS2 . Reads and writes are performed similar to JFFS2
Mitigate the problems raised by JFFS2 . Relatively frugal with memory resource . Checkpoints for a fast file system mount . Dynamic wear-leveling
Support multiple platforms . Linux, WinCE, RTOSs, etc
30 YAFFS2: Physical Layout
The entries in the log are all one chunk (one page) in size and can hold one of two types of chunk . Data chunk: A chunk holding regular data file contents . Object header: A descriptor for an object, such as a directory, a regular file, etc; similar to struct stat but include dentry - Object ID: Identify which object the chunk belongs to - Chunk ID: Identify where in the file this chunk belongs Page (chunk) - Seq : A sequence number of a data chunk : 1 : 1 : 1 : 2 : 1
… Chunk ID:0 Object ID:500 Object Object ID:500 Object ID:500 Object ID:500 Object Object ID:500 Object Object ID:500 Object Data chunk Data Data chunk Data Data chunk Data chunk Data Data chunk Data Object header Object Chunk ID:1 / Seq Chunk ID:2 / Seq Chunk ID:3 / Seq Chunk ID:1 / Seq Chunk ID:4 / Seq
Block
31 YAFFS2: File System Layout
Maintain the information about objects and chunks in DRAM
Object header (/) Object (/) Object header (file1)
Data chunk (file1) Object Object (file1) (dir1) Data chunk (file1)
Object header (dir1) Tnode Object Object header (file2) (file1) (file2) Data chunk (file2) Tnode (file2) NAND flash memory Main memory
32 YAFFS2: Block Summary
A block summary (including object id and chunk id) for chunks in the block are written to the last chunk This allows all the tags for chunks in that block to be read in one hit, avoiding full disk scan If a block summary is not available for a block, then a full scan is used to get the tags on each chunk Object header: 1st page - Object ID: 500 - Chunk ID: 0 Data chunk: 2nd page … : 1 : 1 : 1 : 2 : 1
… Block Block Chunk ID:0 Summary Summary Object ID:500 Object Object ID:500 Object ID:500 Object ID:500 Object Object ID:500 Object Object ID:500 Object Data chunk Data Data chunk Data Data chunk Data chunk Data Data chunk Data Object header Object Chunk ID:1 / Seq Chunk ID:2 / Seq Chunk ID:3 / Seq Chunk ID:1 / Seq Chunk ID:4 / Seq
Block 33 YAFFS2: Checkpoints
DRAM structures are saved on flash at unmount Structures re-read, avoiding boot scan Lazy loading also reduces mount time
Re-read at remount
Saved at unmount : 1 : 1 : 1 : 2 : 1
… On - flash On - flash Chunk ID:0 Chunk Object ID:500 Object Object ID:500 Object ID:500 Object ID:500 Object Object ID:500 Object Object ID:500 Object Data chunk Data chunk Data chunk Data chunk Data chunk Object header RAM structure RAM structure Chunk ID:1 ID:1 Chunk / Seq Chunk ID:2 ID:2 Chunk / Seq ID:3 Chunk / Seq ID:1 Chunk / Seq Chunk ID:4 ID:4 Chunk / Seq
34 UBIFS: Unsorted Bock Image File System
A new flash file system developed by Nokia . Considered as the next generation of JFFS2
New features of UBIFS . Scalability . Scale well with respect to flash size . Memory size and mount time do not depend on flash size . Fast mount . Do not have to scan the whole media when mounting . Write-back support . Dramatically improve the throughput of the file system in many workloads . …
35 UBI: Unsorted Block Image
UBIFS runs on top of UBI volume . UBI supports multiple volumes, bad block management, wear- leveling, and bit-flips error management . The upper level software can be simpler with UBI
JFFS2 YAFFS2 UBIFS ubi_read (), ubi_write (), … UBI mtd_read (), mtd_write (), … MTD
device-specific commands, … NAND OneNAND NOR …
36 How UBI works
Logical erase blocks (LEBs) are mapped to physical erase blocks (PEBs) . Any LEB can be mapped to any PEB PEB: physical erase block erase
Return 0xFFs read read write LEB: logical erase block
Volume A Volume B
LEB 0 LEB 1 LEB 2 LEB 3 LEB 4 LEB 0 LEB 1 LEB 2
UBI layer
PEB 0 PEB 1 PEB 2 PEB 3 PEB 4 PEB 5 PEB 6 PEB 7 PEB 8 PEB 9 PEB 10
MTD device
37 How UBI works
UBI has its own wear-leveling algorithm that moves the data kept in highly erased blocks to lower one
PEB: physical erase block LEB: logical erase block
Static Volumeread-only A data Volume B
LEB 0 LEB 1 LEB 2 LEB 3 LEB 4 LEB 0 LEB 1 LEB 2
UBI layer
PEB 0 PEB 1 PEB 2 PEB 3 PEB 4 PEB 5 PEB 6 PEB 7 PEB 8 PEB 9 PEB 10
ReMove-map data LEB Low erase counter MTD Highdevice erase counter
38 UBIFS: Indexing with B+ Tree
UBIFS index is a B+ tree and is stored on NAND flash . c.f., JFFS2 does not store the index on flash Leaf level contains data
Full scanning is not needed B+ tree is cached in RAM B+ tree . Shrunk in case of memory pressure
Realize good scalability and fast mount
Leaf level nodes
Index Data 39 UBIFS: Wandering Tree
How to find the root of the tree?
- Write data node “D” A A - Old “D” becomes obsolete - Write indexing node “C” - Old “C” becomes obsolete B B - Write indexing node “B” - Old “B” becomes obsolete - Write indexing node “A” - Old “A” becomes obsolete C C
D D A B C C B A
The position of the tree in flash is changed 40 UBIFS: Master Area
LEBs 1/2 are reserved for a master area pointing to a root index node . The master area can be quickly found on mount since its location is fixed (LEBs 1 and 2) . Using the root index node, B+ tree can be quickly constructed A mater area could have multiple nodes . A valid master node is found by scanning master area
M M D D A B C C B A (ver.1) (ver.2)
Master area (LEBs 1 and 2) Root index node
41 Mount Time Comparison
42 Summary
JFFS2: Journaling Flash File System version 2 . Commonly used for low volume flash devices . Compression is supported . Long mount time & High memory consumption
YAFFS2: Yet Another Flash File System version 2 . Fast mount time with check-pointing
UBIFS: Unsorted Block Image File System . Fast mount time and low memory consumption by adopting B+ tree indexing
43 Today
File System Basics Traditional Flash File Systems SSD-Friendly Flash File Systems . F2FS: Flash-friendly File System Reference
44 F2FS: Flash-friendly File System
Log-structured file system for FTL devices . Unlike other flash file systems, it runs atop FTL-based flash storage and is optimized for it . Exploit system-level information for better performance and reliability (e.g., better hot-cold separation, background GC, …)
JFFS2/YAFFS2 UBIFS F2FS
UBI Block Device Driver
MTD FTL
NAND Flash Memory NAND Flash Memory Flash Device Traditional flash file system F2FS
45 Design Concept of F2FS
Designed to take advantage of both of two approaches . Exploit high-level system information . Flash-aware storage management . Handle new NAND flash without redesign
File System + FTL Flash File System
Method - Access a flash device via FTL - Access a flash device directly
- High-level optimization with - High interoperability system-level information Pros - No difficulties in managing recent - Flash-aware storage NAND flash with new constraints management - Lack of system-level information - Low interoperability Cons - Flash-unaware storage - Must be redesigned to handle management new constraints
46 Index Structure of LFS
LFS and UBIFS suffer from the wandering tree problem . Update on a file causes several extra writes
Update on file data
47 Index Structure of F2FS
Introduce Node Address Table (NAT) containing the locations of all the node blocks, including indirect and direct nodes NAT allows us to eliminate the wandering tree problem In-place update
48 Logical Storage Layout
File-system’s metadata is located together for locality . Use an “in-place update” strategy for metadata Files are written sequentially for performance . Use an “out-of-place-update” strategy to exploit high throughput of multiple NAND chips . Six active logs for static hot and cold data separation
Random writes Sequential writes
Superblock 0 Main Area Segment Node Segment Check Info. Address Summary Superblock 1 point Table Table Area area (SIT) (NAT) (SSA)
Hot/Warm/Cold Hot/Warm/Cold node segments data segments
49 Cleaning Process
Background cleaning process . A kernel thread doing the cleaning job periodically at idle time
Victim selection policies . Greedy algorithm for foreground cleaning job . Reduce the amount of data moved for cleaning . Cost-benefit algorithm for background cleaning job . Reclaim obsolete space in a file system – Improve the lifetime of a storage device – Improve the overall I/O performance
50 F2FS Performance
51 Today
File System Basics Traditional Flash File Systems SSD-Friendly Flash File Systems Reference
52 Reference
Rosenblum, Mendel, and John K. Ousterhout. "The design and implementation of a log-structured file system." ACM Transactions on Computer Systems (TOCS) 10.1 (1992): 26-52. Woodhouse, David. "JFFS2: The Journalling Flash File System, Version 2, 2001.“ YAFFS2 Specification. http://www.yaffs.net/yaffs-2-specification Lee, Changman, et al. "F2FS: A New File System for Flash Storage.“ USENIX FAST. 2015. Arpaci-Dusseau, Remzi H., and Andrea C. Arpaci-Dusseau. Operating systems: Three easy pieces. Vol. 151. Wisconsin: Arpaci-Dusseau Books, 2014. Hunter, Adrian. "A brief introduction to the design of UBIFS." Rapport technique, March (2008).
53