Flash-aware

Flash-aware Computing

Instructor: Prof. Sungjin Lee ([email protected])

1 Today

 File System Basics  Traditional Flash File Systems  SSD-Friendly Flash File Systems  Reference

2 What is a File System?

 Provides a virtualized logical view of information stored on various storage media, such as disks, tapes, and flash-based SSDs

 Two key abstractions have developed over time in the virtualization of storage . File: A linear array of bytes, each of which you can read or write . Its contents are defined by a creator (e.g., text and binary) . It is often referred to as its number . Directory: A special file that is a collection of files and other directories . Its contents are quite specific – it contains a list of (user-readable name, inode #) pairs (e.g., (“foo”, 10)) . It has a hierarchical organization (e.g., tree, acyclic-graph, and graph) . It is also identified by an inode number

3 Operations on Files and Directories

POSIX Operations on Files POSIX Operations on Directories POSIX APIs Description POSIX APIs Description creat () Create a file opendir () Open a directory for reading open () Create/open a file closedir () Close a directory write () Write bytes to a file readdir () Read one directory entry read () Read bytes from a file rewinddir () Rewind a directory so it can be lseek () Move byte position inside a file reread unlink () Remove a file mkdir () Create a new directory truncate () Resize a file rmdir () Remove a directory close () Close a file … …

4

 The POSIX API is to the VFS interface, rather than any specific type of file system

open(), close (), read (), write ()

File-system specific implementations

5 File System Implementation

 Log-structured or Copy-on Write File Systems

6 UNIX File System

 A traditional file system first developed for UNIX systems

Boot Super Inode Data 0 1 2 … Data (4 KiB Blocks) Sector Block Bmap Bmap

Inode table

. Boot sector: Information to be loaded into RAM to boot up the OS . Superblock: File system’s metadata (e.g., file system type, size, …) . Inode & data Bmaps: Keep the status of blocks belonging to an inode table and data blocks . Inode table: Keep file’s metadata (e.g., size, permission, …) and data block pointers . Data blocks: Keep users’ file data

7 Inode & Block Pointers 1034 100 Boot Super Inode Data 27 0 1 2 … Sector Block Bmap Bmap

Inode table

Last modified time Last access time File size Permission Link count Direct Blocks 27 1037 … Indirect Blocks … 100 … 101 … 102 Double Indirect Blocks … 104 … 3942

Block numbers 1034 … 104 94133 1037 131 14483 3942 152 … … 8 Consistent Update Problem

 What happens if sudden power loss occurs while writing data to a file

write (0, “foo”, strlen (“foo”) );

Boot Super Inode Data foo 0 1 2 … Data (4 KiB Blocks) Sector Block Bmap Bmap

Inode table

The file system will be inconsistent!!!  Consistent update problem

9 Journaling File System

 Journaling file systems address the consistent update problem by adopting an idea of write-ahead logging (or journaling) from database systems  , , ReiserFS, XFS, and NTFS are based on journaling Journal write & commit write (0, “foo”, strlen (“foo”) ); TxB TxE foo Boot Super Inode Data foo 0 1 2 … Data (4 KiB Blocks) 0 Sector Block Bmap Bmap

Inode table Journaling space

Checkpoint

 Double writes could degrade overall write performance!

10 Log-structured File System

 Log-structured file systems (LFS) treat a storage space as a huge log, appending all files and directories sequentially

 The state-of-the-art file systems are based on LFS or CoW . e.g., Sprite LFS, F2FS, NetApp’s WAFL, , ZFS, …

Write all the files and sequentially File A File B Inode for B for Inode inode for A for inode inodeMap Check Check Boot Super point point Sector Block #1 #2

11 Log-structured File System (Cont.)

 Advantages . (+) No consistent update problem . (+) No double writes – an LFS itself is a log! . (+) Provide excellent write performance – disks are optimized for sequential I/O operations . (+) Reduce the movements of disk headers further (e.g., inode update and file updates)

 Disadvantages . (–) Expensive garbage collection cost . (–) Slow read performance

12 Disadvantages of LFS File A File B Invalid

Write sequentially Inode for B for Inode Inode for B for Inode inode for A for inode inode for A for inode inodeMap inodeMap Check Check Boot Super point point Sector Block #1 #2

 Expensive garbage collection cost: invalid blocks must be reclaimed for future writes; otherwise, free disk space will be exhausted  Slow read performance: involve more head movements for future reads (e.g., when reading the file A)

13 Write Cost

 Write cost with GC is modeled as follows . Note: a segment (seg) is a unit of space allocation and GC

. N is the number of segments . µ is the utilization of the segments (0 ≤ µ < 1) . If segments have no live data (µ = 0), write cost becomes 1.0

14 Write Cost Comparison

(measured)

(delayed writes, sorting)

15 Greedy Policy

 The cleaner chooses the least-utilized segments and sorts the live data by age before writing it out again  Workloads: 4 KB files with two overwrite patterns . (1) Uniform: No locality – equal likelihood of being overwritten . (2) Hot-and-cold: Locality – 10:90 Worse than a system with no locality

The variance in segment utilization

16 Cost-Benefit Policy

 Hot segments are frequently selected as victims even though their utilizations would drop further . It is necessary to delay cleaning and let more of the blocks die . On the other hand, free space in cold segments are valuable

 Cost-benefit policy:

17 Cost-Benefit Policy (Cont.)

18 LFS Performance

Inode updates

Random reads

19 Today

 File System Basics  Traditional Flash File Systems . JFFS2: Journaling . YAFFS2: Yet Another Flash File System . UBIFS: Unsorted Bock Image File System  SSD-Friendly File Systems  Reference

20 Traditional File Systems for Flash

 Originally designed for block devices like HDDs . e.g., /3/4, FAT32, and NTFS  But, NAND is not a block device . The FTL provides block-device views outside, hiding the unique properties of NAND flash memory

Traditional File System for HDDs (e.g., ext2/3/4, FAT32, and NTFS) Read Write Block I/O Interface

Flash Translation Layer NAND control Read Write Erase (e.g., ONFI) NAND flash

Flash-based SSDs 21 Flash File Systems

 Directly manage raw NAND flash memory . Internally performing address mapping, garbage collection, and wear-leveling by itself  Representative flash file systems . JFFS2, YAFFS2, and UBIFS

Flash File System (e.g., JFFS2, YAFFS2, and UBIFS)

Read Write Erase

NAND-specific Low-Level Device Driver (e.g., MTD and UBI)

NAND control Read Write Erase (e.g., ONFI)

NAND Flash

22 (MTD)

 MTD is the lowest level for accessing flash chips . Offer the same APIs for different flash types and technologies . e.g., NAND, OneNAND, and NOR  JFFS2 and YAFFS2 run on top of MTD

Typical FS

JFFS2 YAFFS2 FTLs

mtd_read (), mtd_write (), … MTD

device-specific commands, …

NAND OneNAND NOR …

23 Traditional File Systems vs. Flash File Systems

File System + FTL Flash File System

Method - Access a flash device via FTL - Access a flash device directly

- High interoperability - High-level optimization with Pros - No difficulties in managing recent system-level information NAND flash with new constraints - Flash-aware storage management - Lack of system-level information - Low interoperability Cons - Flash-unaware storage - Must be redesigned to handle management new NAND constraints

Flash file systems now become obsolete because of difficulties for the adoption to new types of NAND devices

24 JFFS2: Journaling Flash File System

 A log-structured file system (LFS) for use with NAND flash . Unlike LFS, however, it does not allow any in-place updates!!!

 Main features of JFFS2 . File data and metadata stored as nodes in NAND flash memory . Keep an inode cache holding the information of nodes in DRAM . A greedy garbage collection algorithm . Select cheapest blocks as a victim for garbage collection . A simple wear-leveling algorithm combined with GC . Consider the wearing rate of flash blocks when choosing a victim block for GC . Optional data compression

25 JFFS2: Write Operation

 All data are written sequentially to a log which records all changes

Updates on File A Inode Cache (DRAM) Ver: 1 Offset: 0 Logical View of File A Len: 200 32-bit CRC 0-75: Ver 41 Data Ver: 2 Ver: 4 75-125: Ver 3 Offset: 200 Offset: 0 Len: 200 Len: 75 125-200: Ver 51 32-bit CRC 32-bit CRC Data Data

200-400: Ver 2 Ver: 3 Ver: 5 Offset: 75 Offset: 125 Len: 50 Len: 75 32-biit CRC 32-bit CRC Data Data

NAND Ver 1 Ver 2 Ver 3 Ver 4 Ver 5 Flash Memory

0 200 400

26 JFFS2: Read Operation

 The latest data can be read from NAND flash by referring to the inode cache in DRAM Updates on File A Inode Cache (DRAM) Ver: 1 Offset: 0 Logical View of File A Len: 200 Read data from File A 32-bit CRC (offset: 75 – 200) 0-75: Ver 41 Data Ver: 2 Ver: 4 75-125: Ver 3 Offset: 200 Offset: 0 Len: 200 Len: 75 125-200: Ver 51 32-bit CRC 32-bit CRC Data Data

200-400: Ver 2 Ver: 3 Ver: 5 Offset: 75 Offset: 125 Len: 50 Len: 75 32-biit CRC 32-bit CRC Data Data

NAND Ver 1 Ver 2 Ver 3 Ver 4 Ver 5 Flash Memory

0 200 400

27 JFFS2: Mount

 Scan the flash memory medium after rebooting . Check the CRC for written data and mark the obsolete data . Build the inode cache

00--200:75: Ver Ver411

20075-125:-400: VerVer 32 Inode Cache 125-200: Ver 51

200-400: Ver 2

Ver: 1 Ver: 2 Ver: 3 Ver: 4 Ver: 5 Offset: 0 Offset: 200 Offset: 75 Offset: 0 Offset: 125 Len: 200 Len: 200 Len: 50 Len: 75 Len: 75 32-bit CRC 32-bit CRC 32-bit CRC 32-bit CRC 32-bit CRC Data Data Data Data Data

NAND Ver 1 Ver 2 Ver 3 Ver 4 Ver 5 Flash Memory

0 200 400

28 JFFS2: Problems

 Slow mount time . All nodes must be scanned at mount time . Mount time increases in proportion to flash size and file system contents

 High memory consumption . All node information must be maintained in DRAM . Memory consumption linearly depends on file system contents

 Low scalability . Infeasible for a large-scale flash device . Mount time and memory consumption increase according to a flash size

29 YAFFS2: Yet Another Flash File System

 Another log-structured file system for flash memory . Store data to flash memory like a log with a unique sequence number like JFFS2 . Reads and writes are performed similar to JFFS2

 Mitigate the problems raised by JFFS2 . Relatively frugal with memory resource . Checkpoints for a fast file system mount . Dynamic wear-leveling

 Support multiple platforms . , WinCE, RTOSs, etc

30 YAFFS2: Physical Layout

 The entries in the log are all one chunk (one page) in size and can hold one of two types of chunk . Data chunk: A chunk holding regular data file contents . Object header: A descriptor for an object, such as a directory, a regular file, etc; similar to struct stat but include dentry - Object ID: Identify which object the chunk belongs to - Chunk ID: Identify where in the file this chunk belongs Page (chunk) - Seq : A sequence number of a data chunk : 1 : 1 : 1 : 2 : 1

… Chunk ID:0 Object ID:500 Object Object ID:500 Object ID:500 Object ID:500 Object Object ID:500 Object Object ID:500 Object Data chunk Data Data chunk Data Data chunk Data chunk Data Data chunk Data Object header Object Chunk ID:1 / Seq Chunk ID:2 / Seq Chunk ID:3 / Seq Chunk ID:1 / Seq Chunk ID:4 / Seq

Block

31 YAFFS2: File System Layout

 Maintain the information about objects and chunks in DRAM

Object header (/) Object (/) Object header (file1)

Data chunk (file1) Object Object (file1) (dir1) Data chunk (file1)

Object header (dir1) Tnode Object Object header (file2) (file1) (file2) Data chunk (file2) Tnode (file2) NAND flash memory Main memory

32 YAFFS2: Block Summary

 A block summary (including object id and chunk id) for chunks in the block are written to the last chunk  This allows all the tags for chunks in that block to be read in one hit, avoiding full disk scan  If a block summary is not available for a block, then a full scan is used to get the tags on each chunk Object header: 1st page - Object ID: 500 - Chunk ID: 0 Data chunk: 2nd page … : 1 : 1 : 1 : 2 : 1

… Block Block Chunk ID:0 Summary Summary Object ID:500 Object Object ID:500 Object ID:500 Object ID:500 Object Object ID:500 Object Object ID:500 Object Data chunk Data Data chunk Data Data chunk Data chunk Data Data chunk Data Object header Object Chunk ID:1 / Seq Chunk ID:2 / Seq Chunk ID:3 / Seq Chunk ID:1 / Seq Chunk ID:4 / Seq

Block 33 YAFFS2: Checkpoints

 DRAM structures are saved on flash at unmount  Structures re-read, avoiding boot scan  Lazy loading also reduces mount time

Re-read at remount

Saved at unmount : 1 : 1 : 1 : 2 : 1

… On - flash On - flash Chunk ID:0 Chunk Object ID:500 Object Object ID:500 Object ID:500 Object ID:500 Object Object ID:500 Object Object ID:500 Object Data chunk Data chunk Data chunk Data chunk Data chunk Object header RAM structure RAM structure Chunk ID:1 ID:1 Chunk / Seq Chunk ID:2 ID:2 Chunk / Seq ID:3 Chunk / Seq ID:1 Chunk / Seq Chunk ID:4 ID:4 Chunk / Seq

34 UBIFS: Unsorted Bock Image File System

 A new flash file system developed by Nokia . Considered as the next generation of JFFS2

 New features of UBIFS . Scalability . Scale well with respect to flash size . Memory size and mount time do not depend on flash size . Fast mount . Do not have to scan the whole media when mounting . Write-back support . Dramatically improve the throughput of the file system in many workloads . …

35 UBI: Unsorted Block Image

 UBIFS runs on top of UBI volume . UBI supports multiple volumes, bad block management, wear- leveling, and bit-flips error management . The upper level software can be simpler with UBI

JFFS2 YAFFS2 UBIFS ubi_read (), ubi_write (), … UBI mtd_read (), mtd_write (), … MTD

device-specific commands, … NAND OneNAND NOR …

36 How UBI works

 Logical erase blocks (LEBs) are mapped to physical erase blocks (PEBs) . Any LEB can be mapped to any PEB PEB: physical erase block erase

Return 0xFFs read read write LEB: logical erase block

Volume A Volume B

LEB 0 LEB 1 LEB 2 LEB 3 LEB 4 LEB 0 LEB 1 LEB 2

UBI layer

PEB 0 PEB 1 PEB 2 PEB 3 PEB 4 PEB 5 PEB 6 PEB 7 PEB 8 PEB 9 PEB 10

MTD device

37 How UBI works

 UBI has its own wear-leveling algorithm that moves the data kept in highly erased blocks to lower one

PEB: physical erase block LEB: logical erase block

Static Volumeread-only A data Volume B

LEB 0 LEB 1 LEB 2 LEB 3 LEB 4 LEB 0 LEB 1 LEB 2

UBI layer

PEB 0 PEB 1 PEB 2 PEB 3 PEB 4 PEB 5 PEB 6 PEB 7 PEB 8 PEB 9 PEB 10

ReMove-map data LEB Low erase counter MTD Highdevice erase counter

38 UBIFS: Indexing with B+ Tree

 UBIFS index is a B+ tree and is stored on NAND flash . c.f., JFFS2 does not store the index on flash  Leaf level contains data

 Full scanning is not needed  B+ tree is cached in RAM B+ tree . Shrunk in case of memory pressure

Realize good scalability and fast mount

Leaf level nodes

Index Data 39 UBIFS: Wandering Tree

 How to find the root of the tree?

- Write data node “D” A A - Old “D” becomes obsolete - Write indexing node “C” - Old “C” becomes obsolete B B - Write indexing node “B” - Old “B” becomes obsolete - Write indexing node “A” - Old “A” becomes obsolete C C

D D A B C C B A

The position of the tree in flash is changed 40 UBIFS: Master Area

 LEBs 1/2 are reserved for a master area pointing to a root index node . The master area can be quickly found on mount since its location is fixed (LEBs 1 and 2) . Using the root index node, B+ tree can be quickly constructed  A mater area could have multiple nodes . A valid master node is found by scanning master area

M M D D A B C C B A (ver.1) (ver.2)

Master area (LEBs 1 and 2) Root index node

41 Mount Time Comparison

42 Summary

 JFFS2: Journaling Flash File System version 2 . Commonly used for low volume flash devices . Compression is supported . Long mount time & High memory consumption

 YAFFS2: Yet Another Flash File System version 2 . Fast mount time with check-pointing

 UBIFS: Unsorted Block Image File System . Fast mount time and low memory consumption by adopting B+ tree indexing

43 Today

 File System Basics  Traditional Flash File Systems  SSD-Friendly Flash File Systems . F2FS: Flash-friendly File System  Reference

44 F2FS: Flash-friendly File System

 Log-structured file system for FTL devices . Unlike other flash file systems, it runs atop FTL-based flash storage and is optimized for it . Exploit system-level information for better performance and reliability (e.g., better hot-cold separation, background GC, …)

JFFS2/YAFFS2 UBIFS F2FS

UBI Block Device Driver

MTD FTL

NAND Flash Memory NAND Flash Memory Flash Device Traditional flash file system F2FS

45 Design Concept of F2FS

 Designed to take advantage of both of two approaches . Exploit high-level system information . Flash-aware storage management . Handle new NAND flash without redesign

File System + FTL Flash File System

Method - Access a flash device via FTL - Access a flash device directly

- High-level optimization with - High interoperability system-level information Pros - No difficulties in managing recent - Flash-aware storage NAND flash with new constraints management - Lack of system-level information - Low interoperability Cons - Flash-unaware storage - Must be redesigned to handle management new constraints

46 Index Structure of LFS

 LFS and UBIFS suffer from the wandering tree problem . Update on a file causes several extra writes

Update on file data

47 Index Structure of F2FS

 Introduce Node Address Table (NAT) containing the locations of all the node blocks, including indirect and direct nodes  NAT allows us to eliminate the wandering tree problem In-place update

48 Logical Storage Layout

 File-system’s metadata is located together for locality . Use an “in-place update” strategy for metadata  Files are written sequentially for performance . Use an “out-of-place-update” strategy to exploit high throughput of multiple NAND chips . Six active logs for static hot and cold data separation

Random writes Sequential writes

Superblock 0 Main Area Segment Node Segment Check Info. Address Summary Superblock 1 point Table Table Area area (SIT) (NAT) (SSA)

Hot/Warm/Cold Hot/Warm/Cold node segments data segments

49 Cleaning Process

 Background cleaning process . A kernel thread doing the cleaning job periodically at idle time

 Victim selection policies . Greedy algorithm for foreground cleaning job . Reduce the amount of data moved for cleaning . Cost-benefit algorithm for background cleaning job . Reclaim obsolete space in a file system – Improve the lifetime of a storage device – Improve the overall I/O performance

50 F2FS Performance

51 Today

 File System Basics  Traditional Flash File Systems  SSD-Friendly Flash File Systems  Reference

52 Reference

 Rosenblum, Mendel, and John K. Ousterhout. "The design and implementation of a log-structured file system." ACM Transactions on Computer Systems (TOCS) 10.1 (1992): 26-52.  Woodhouse, David. "JFFS2: The Journalling Flash File System, Version 2, 2001.“  YAFFS2 Specification. http://www.yaffs.net/yaffs-2-specification  Lee, Changman, et al. "F2FS: A New File System for Flash Storage.“ USENIX FAST. 2015.  Arpaci-Dusseau, Remzi H., and Andrea C. Arpaci-Dusseau. Operating systems: Three easy pieces. Vol. 151. Wisconsin: Arpaci-Dusseau Books, 2014.  Hunter, Adrian. "A brief introduction to the design of UBIFS." Rapport technique, March (2008).

53