NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories
Andiry Xu, Steven Swanson
Non-volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego
1 NOVA overview • NOVA extends LFS to leverage non-volatile memories • NOVA proposes per-inode logging • High performance + Strong atomicity – 3.1x to 13.5x to file systems that have equally strong consistency guarantees in write-intensive workloads • POSIX compliant
https://github.com/NVSL/NOVA
2 Hybrid DRAM/NVMM system • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology • File system for NVMM Host
CPU
NVMM FS
DRAM NVMM
3 Disk-based file systems are inadequate for NVMM
Ext4 Ext4 Ext4 • Ext4, xfs, Btrfs, F2FS, NILFS2 Atomicity Btrfs xfs wb order dataj • 1-Sector Built for hard disks and SSDs ✓ ✓ ✓ ✓ ✓ overwrite – Software overhead is high 1-Sector ✗ ✓ ✓ ✓ ✓ – CPU may reorder writes to NVMM append 1-Block ✗ ✗ ✓ ✓ ✗ – NVMM has different atomicity guarantees overwrite 1-Block ✗ ✓ ✓ ✓ ✓ append • Cannot exploit NVMM performance N-Block ✗ ✗ ✗ ✗ ✗ • write/append Performance optimization compromises N-Block ✗ ✓ ✓ ✓ ✓ consistency on system failure [1] prefix/append
[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 4 NVMM file systems are not strongly consistent • BPFS, PMFS, Ext4-DAX, SCMFS, Aerie • None of them provide strong metadata and data consistency
Metadata Data Mmap File system atomicity atomicity Atomicity [1] BPFS Yes Yes [2] No PMFS Yes No No Ext4-DAX Yes No No SCMFS No No No Aerie Yes No No NOVA Yes Yes Yes
[1] Each msync() commits updates atomically. [2] In BPFS, write times are not updated atomically with respect to the write itself. 5 Why LFS? • Log-structuring provides cheaper atomicity than journaling and shadow paging • NVMM supports fast, highly concurrent random accesses – Using multiple logs does not negatively impact performance – Log does not need to be contiguous
• Rethink and redesign log-structuring entirely
6 NOVA design goals • Atomicity – Combine log-structuring, journaling and copy-on- write Per-inode logging • High performance – Split data structure between DRAM and NVMM Inode Head Tail – Highly scalable Inode log • Efficient garbage collection – Fine-grained log cleaning with log as a linked list – Log only contains metadata Committed entry Uncommitted entry • Fast recovery – Lazy rebuild – Parallel scan 7 Atomicity • Log-structuring for single log update – Write, msync, chmod, etc – Strictly commit log entry to NVMM TailTail before updating log tail Directory log • Lightweight journaling for update across logs Tail TailTail – Unlink, rename, etc – Journal log tails instead of metadata or data File log
• Copy-on-write for file data Dir tail – Log only contains metadata Journal – Log is short File tail
8 Atomicity • Log-structuring for single log update – Write, msync, chmod, etc – Strictly commit log entry to NVMM Tail before updating log tail Directory log • Lightweight journaling for update across logs Tail Tail – Unlink, rename, etc – Journal log tails instead of metadata or data File log
• Copy-on-write for file data Data 0 Data 1 Data 1 Data 2 – Log only contains metadata – Log is short
9 Performance • Per-inode logging allows for high concurrency Tail Directory log • Split data structure between Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volatile tree structure has no consistency overhead
10 Performance
• Per-inode logging allows for Radix tree high concurrency 0 1 2 3 DRAM • Split data structure between NVMM Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volatile tree structure has no consistency overhead
11 NOVA layout
• Put allocator in DRAM CPU 0 CPU 1 • High scalability Free list Free list – Per-CPU NVMM free list, DRAM journal and inode table NVMM Journal Journal – Concurrent transactions Super Inode table Inode table and allocation/deallocation block Recovery inode Inode Head Tail
Inode log
12 Fast garbage collection • Log is a linked list • Log only contains Tail metadata Head • Fast GC deletes dead log pages from the linked list • No copying Vaild log entry Invalid log entry
13 Thorough garbage collection • Starts if valid log entries < 50% log length • Format a new log and atomically replace the old one • Only copy metadata Tail Head
Vaild log entry Invalid log entry
14 Recovery • Rebuild DRAM structure – Allocator CPU 0 CPU 1 – Lazy rebuild: postpones inode radix tree rebuild Free list Free list • Accelerates recovery • Reduces DRAM consumption DRAM • Normal shutdown recovery: NVMM – Store allocator in recovery inode Journal Journal – No log scanning Super Inode table Inode table block • Failure recovery: Recovery – Log is short inode Recovery Recovery thread thread – Parallel scan – Failure recovery bandwidth: > 400 GB/s
15 Evaluation: Latency
Operation latency 25 • Intel PM Emulation Platform 20 – Emulates different NVM
15 characteristics – Emulates clwb/PCOMMIT 10 latency
Latency (microsecond) Latency 5 • NOVA provides low latency atomicity 0 Create Append (4KB) Delete Ext4-datajournal Ext4-DAX NOVA
16 Filebench throughput
Filebench throughput 450 400 350 300 3.2x2.5x • NOVA achieves high 250 performance with strong 200 2.2x 150 10x data consistency
Ops per second (x1000) per Ops second 100 50 0 Fileserver Varmail Webproxy Webserver Ext4-datajournal Ext4-DAX NOVA
17 Garbage collection efficiency
Fileserver (95% NVMM utilization) 300 250 • NOVA’s performance stays 200 stable with increasing 150 60% drop12x running time 100 Fail Fail • Fast GC reclaims the Ops per second (x1000) per Ops second 50 majority of stale pages in the 0 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 long-term running Running time (second)
NILFS2 F2FS NOVA
18 Conclusion • Existing file systems do not meet the requirements of applications on NVMM file systems • NOVA’s multi-log design achieves high concurrency, efficient garbage collection and fast recovery • NOVA outperforms existing file systems while providing stronger consistency and atomicity guarantees
19 Thank you!
https://github.com/NVSL/NOVA
20 Backup slides
21 Atomicity and enforce write ordering // Strictly commit log entry to NVMM before updating tail new_tail = append_to_log(inode->tail, entry); clwb(inode->tail, entry->length); // writes back the cachelines sfence(); PCOMMIT(); // Commits to NVMM Tail Tail sfence(); inode->tail = new_tail; Inode log
22 Directory operations • mv Alice/book Bob/
• (name, inode number) Tail Tail
Alice log “book”, 10 “book”, 0
Alice tail Tail Tail Bob tail Bob log “book”, 10 book tail Tail Tail
Journal book log inode update
23 Atomic file operations • Copy-on-write for file data •
• Write(8192, 8192) 0 1 2 3 DRAM
Head Tail Tail NVMM
File log <0, 1> <1, 2> <2, 2>
Data 0 Data 1 Data 2 Data 2 Data 3
24 Atomic mmap • Allocate replica pages and mmap(fd, 4096, 4096); mmap to user space msync(addr, 4096); • msync() commits updates Replica 1 atomically User space Head Tail Tail Kernel
File log
Data 0 Data 1 Data 1 Replica 1
25 Evaluation • Intel PM Emulation Platform • 32GB of DRAM, 64GB of NVMM
Read Write clwb PCOMMIT NVMM latency bandwidth latency latency STT-RAM 100 ns Full DRAM 40 ns 200 ns PCM 300 ns 1/8 DRAM 40 ns 500 ns
• Compare to Btrfs, NILFS2, F2FS, Ext4, Ext4-data, Ext4-DAX, PMFS • Linux kernel 4.0 x86-64
26 Garbage collection efficiency
GC pages percentage Fast GC Thorough GC 100% 90% 80% 70% • Fast GC reclaims 94% pages 60% in one-hour test 50% 94.2
40% 85.4 PERCENTAGE 30% 64.3 20% 10% 10.7 0% 0 10s 30s 120s 600s 3600s RUNNING TIME
27