HMVFS: A Hybrid Memory Versioning

Shengan Zheng, Linpeng Huang, Hao Liu, Linzhu Wu, Jin Zha

Department of Computer Science and Engineering Shanghai Jiao Tong University Outline

• Introduction • Design • Implementation • Evaluation • Conclusion Introduction

• Emerging Non-Volatile Memory (NVM) • Persistency as disk • Byte addressability as DRAM • Current file systems for NVM • PMFS, SCMFS, BPFS • Non-versioning, unable to recover old data • Hardware and software errors • Large dataset and long execution time • Fault tolerance mechanism is needed • Current versioning file systems • , NILFS2 • Not optimized for NVM Design Goals

• Strong consistency • A Stratified File System Tree (SFST) represents the snapshot of whole file system • Atomic snapshotting is ensured • Fast recovery • Almost no redo or undo overhead in recovery • High performance • Utilize the byte-addressability of NVM to update the tree metadata at the granularity of bytes • Log-structured updates to files balance the endurance of NVM • Avoid write amplification • User friendly • Snapshots are created automatically and transparently Overview

• HMVFS is an NVM-friendly log-structured versioning file system • Space-efficient file system snapshotting • HMVFS decouples tree metadata from tree data • High performance and consistency guarantee • POSIX compliant Outline

• Introduction • Design • Implementation • Evaluation • Conclusion On-Memory Layout

• DRAM: cache and journal

Segment Checkpoint Information DRAM Information Node Address Tree Cache Tree Table (NAT Cache ) (CIT) Journal NVM Superblock Superblock Block Segment Node Address Tree (NAT) Checkpoint Blocks (CP) Information Information Table Table Node Blocks Data Blocks (BIT) (SIT) Auxiliary Information Main Area (SFST) Random Writes Sequential Writes

• NVM: • Random write zone • Sequential write zone • File system metadata • File metadata and data • Tree metadata • Tree data Index Structure in traditional Log-structured File Systems

• Update propagation problem

Node Updated blocks Inode block Indirect node Metadata Data Direct block node Triple-indirect Indirect Indirect Indirect Double-indirect … node node node Indirect Inode Single-indirect node Direct pointer Direct Direct Direct Direct Direct Or … … … node node node node node Inline data

… … Data

…………… Data Data block Data block Data Data block Data block Data block Data block Data Data block Data Data block Data Data block Index Structure without write amplification

• Node Address Table

Node Updated blocks Inode block Indirect node Metadata Data Direct block node Triple-indirect Indirect Indirect … Indirect Double-indirect node node node Single-indirect Direct pointer Direct Direct Direct Direct Direct Or … … … node node node node node Inline data Node Address Table

… … Node-ID Address Data … … n-1 0x38 n 0x420x73 …………… n+1 0x24 Data Data block Data block Data Data block Data block Data block Data block Data Data block Data Data block Data Data block … … Index Structure for versioning

• Node Address Table with the dimension of version.

Node Updated blocks Inode block Indirect node Metadata Data Direct block node Triple-indirect Indirect Indirect … Indirect Double-indirect node node node Single-indirect Direct pointer Direct Direct Direct Direct Direct Or … … … node node node node node Inline data Node Address Table with Version Version1 Version2 Version3 … … Node-ID Address Address Address Data … … … … n-1 0x14 0x38 0x38 n 0x42 0x42 0x73 …………… n+1 0x24 0x24 0x24 Data Data block Data block Data Data block Data block Data block Data block Data Data block Data Data block Data Data block … … … … How to store different trees space-efficiently Original New

Node Address Tree NAT NAT • Node Address Tree (NAT) root root • A four-level B-tree to store multi-version Node Address NAT NAT NAT … Table space-efficiently internal internal internal • Adopt the idea of CoW friendly B-tree NAT NAT NAT NAT … … • NAT leaves contain NodeID-address pairs internal internal internal internal

• Other tree blocks in NAT contain pointers to lower level NAT NAT NAT NAT …… … blocks. leaf leaf leaf leaf

Node

P,1 P,1 Q,1 P,0 Q,1 Indirect Direct Direct Inode node node node

E,1 F,1 E,2 F,1 F',1 E,1 F,0 F',1

A,1 B,1 C,1 D,1 A,1 B,1 C,2 D,1 D',1 A,1 B,1 C,1 D,0 D',1 Original snapshot New snapshot Checkpoint CP CP Stratified File System Tree (SFST) block block

Node Address Tree NAT NAT • Four different categories of blocks: root root • Checkpoint layer NAT NAT NAT … • Node Address Tree (NAT) layer internal internal internal

• Node layer NAT NAT NAT NAT … … internal internal internal internal • Data layer

NAT NAT NAT NAT …… … • All blocks from SFST are stored in the main area with leaf leaf leaf leaf log-structured writes Node • Balance the endurance of NVM media Indirect Direct Direct Inode • Each SFST represents a valid snapshot of file system node node node • Share overlapped blocks to achieve space-efficiency Data

… … Data block Data block Data block Data block Data block Data block Original snapshot New snapshot Checkpoint CP CP Stratified File System Tree (SFST) block block

Node Address Tree NAT NAT • The metadata of SFST root root • In auxiliary information zone NAT NAT NAT … • Random write updates internal internal internal • Segment Information Table (SIT) NAT NAT NAT NAT … … • Contains the status information of every segment internal internal internal internal

• Block Information Table (BIT) NAT NAT NAT NAT …… … • Keeps the information of every block leaf leaf leaf leaf

• Update precisely at variable bytes granularity Node

• Contains: Indirect Direct Direct Inode • Start and end version number node node node

• Block type Data • Node ID … … • Reference count Data block Data block Data block Data block Data block Data block Garbage Collection in HMVFS

• Move all the valid blocks in the victim segment to the current segment • When finished, update SIT and create a snapshot • Handle block sharing problem

Version 1 Version 2 Version 3 Version 4

NAT NAT NAT NAT block block block block Segment A Segment B

Node Node Node Node Node Block Block Block Block Block 1 2 2 2 2 Outline

• Introduction • Design • Implementation • Evaluation • Conclusion Block Information Table (BIT)

• Block sharing problem • The corresponding pointer in the parent block must be updated if a new child block is written in the main area • Node ID and block type • Used to locate parent node Type of the block Type of the parent Node ID Checkpoint N/A N/A NAT internal NAT internal Index code in NAT NAT leaf Inode Indirect NAT leaf Node ID Direct Data Inode or direct Node ID of parent node Block Information Table (BIT)

• Start and end version number • The first and last versions in which the block is valid • Operations like write and delete set these two variables to the current version number • Reference count • The number of parent nodes which are linked to the block • Update with lazy reference counting • File level operations and snapshot level operations update the reference count • If the count reaches zero, the block will become garbage Original snapshot New snapshot Checkpoint Super CP CP Snapshot Creation Block block block

Node Address Tree NAT NAT • Strong consistency is guaranteed root root

• Flush dirty NAT entries from DRAM to form a new NAT … NAT NAT Node Address Tree internal internal internal

• Follow the bottom-up procedure NAT … NAT … NAT NAT internal internal internal internal • Status information are stored in checkpoint block NAT … NAT … NAT NAT • Space-efficient snapshot leaf leaf leaf leaf

• The atomicity of snapshot creation is ensured Node Indirect Direct Direct Inode • Atomic update to the pointer in superblock to announce node node node the validity of the new snapshot Data • Crash during snapshot creation can be recovered by … … undo or redo depend on the validity Data Data block Data Data block Data block Data block Data block Data block Snapshot Deletion

• Deletion starts from the checkpoint block • Checkpoint cache is stored in DRAM • Follows the top-down procedure to decrease reference counts • Consistency is ensured by journaling • Call garbage collection afterwards • Many reference counts have decreased to zero

P,1 Q,1 P,0 Q,1

E,2 F,1 F',1 E,1 F,0 F',1

A,1 B,1 C,2 D,1 D',1 A,1 B,1 C,1 D,0 D',1 Crash Recovery

• Mount the writable last completed snapshot • No additional recovery overhead • Mount the read-only old snapshots • Locate the checkpoint block of the snapshot • Retrieve files via SFST Superblock

Checkpoint Checkpoint Checkpoint Checkpoint

NAT root NAT root NAT root NAT root

… … … … Outline

• Introduction • Design • Implementation • Evaluation • Conclusion Evaluation

• Experimental Setup • A commodity server with 64 Intel Xeon 2GHz processors and 512GB DRAM • Performance comparison with PMFS, , BTRFS, NILFS2 • Postmark results • Different read bias numbers

HMVFS BTRFS NILFS2 EXT4 PMFS HMVFS BTRFS NILFS2 180 120 160 100 140 ) -1 120 80 2.7x and 2.3x 100 60 sec 80 60 40 40 Efficiency (sec 20 20 0 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of Reads Percentage of Reads

Transaction performance Snapshotting efficiency Evaluation

• Filebench results • Fileserver • Different numbers of files

HMVFS BTRFS NILFS2 EXT4 PMFS HMVFS BTRFS NILFS2 25 40 35 20 9.7x and 6.6x ) 30 -1

15 sec 25 20 10 15 ops/sec(x1000) fiiny ( Efficiency 10 5 5 0 0 2k 4k 8k 16k 2k 4k 8k 16k Number of Files Number of Files Throughput performance Snapshotting efficiency Evaluation

• Filebench results • Varmail • Different depths of directories

HMVFS BTRFS NILFS2 EXT4 PMFS HMVFS BTRFS NILFS2 25 12 8.7x and 2.5x 20 10 ) -1 8 15 6 10 4 ops/sec(x1000) fiiny (secEfficiency 5 2

0 0 0.7 1.2 1.4 2.1 0.7 1.2 1.4 2.1 Directory Depth Directory Depth Throughput performance Snapshotting efficiency Outline

• Introduction • Design • Implementation • Evaluation • Conclusion Conclusion

• HMVFS is the first file system to solve the consistency problem for NVM-based in-memory file systems using snapshotting. • Metadata of the Stratified File System Tree (SFST) is decoupled from data and is updated at byte granularity • HMVFS stores the snapshots space-efficiently with shared blocks in SFST and handles write amplification problem and block sharing problem well • HMVFS exploits the structural benefit of CoW friendly B-tree and the byte- addressability of NVM to automatically take frequent snapshots • HMVFS outperforms tradition versioning file systems in snapshotting and performance while providing strong consistency guarantee and having little impact on foreground operations • Q & A • Thank you