NOVA: A High-Performance, Fault Tolerant File System for Non-Volatile Main Memories Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson
Non-Vola le Systems Laboratory Department of Computer Science and Engineering University of California, San Diego
1 NVMM File System Requirements • Legacy File IO Accelera on – fast and easy – Run exis ng IO-intensive apps on NVDIMMs – “just works” • Strong atomicity • DAX Mmap – Load-store access & complex data structures – More control/responsibility for programs • Data protec on – Don’t trust the media or other so ware – Support backups • DAX High performance – Otherwise, why bother?
✔ ❌ ✔❌DAX ❌ 3
BPFS SCMFS PMFS Aerie EXT4-DAX XFS-DAX M1FS
✔ ❌ ❌ ✔DAX ✔ 4
NOVA
✔✔✔✔DAX ✔ 5 Files, Atomicity, and Performance
6 Log Structured FS For NVMM
Per-inode logging
• A Nova FS is a tree of logs Inode Head Tail – One log per i-node – Logs are not con guous Inode log • I-nodes point to the head and tail Commi ed entry Uncommi ed entry
7 Atomicity: Logging for Simple Metadata Operations • Combines log-structuring, journaling and copy-on-write
• Log-structuring for single log Tail Tail update File log – Write, msync, chmod, etc – Lower overhead than journaling and shadow paging
8 Atomicity: Lightweight Journaling for Complex Metadata Operations • Lightweight journaling for update across logs TailTail – Unlink, rename, etc Directory log – Journal log tails instead of
metadata or data TailTail File log
Dir tail Journal File tail
9 Atomicity: Copy-on-write for file data • Copy-on-write for file data – Log only contains metadata
– Log is short Tail Tail File log
Data 0 Data 1 Data 1 Data 2
10 Atomicity: DAX
• Nova does not make atomicity guarantees for DAX mmap()’d data • msync() works, but it’s slow & non-atomic • The program must ensure consistency – ISA Support: CLWB, clflush, etc. DAX – Careful programming
11 DRAM Indexes for High Performance
• Per-inode logging allows for Radix tree high concurrency 0 1 2 3 DRAM • Split data structure between NVMM Tail DRAM and NVMM
– Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Vola le tree structure has no consistency overhead
12 Performance and Scalability
• Put allocator in DRAM CPU 0 CPU 1 • High scalability Free list Free list – Per-CPU NVMM free list, DRAM journal and inode table NVMM Journal Journal – Concurrent transac ons Super Inode table Inode table and alloca on/dealloca on block Recovery inode Inode Head Tail
Inode log
13 Fast garbage collection • Log is a linked list • Log only contains Tail metadata Head • Fast GC deletes dead log pages from the linked list • No copying Vaild log entry Invalid log entry
14 Thorough garbage collection • Starts if valid log entries < 50% log length • Format a new log and atomically replace the old one
• Only copy metadata Tail Head
Vaild log entry Invalid log entry
15 Recovery
• Normal shutdown CPU 0 CPU 1 recovery: Free list Free list – Store allocator state in recovery inode DRAM NVMM – Constant me startup Journal Journal • Failure recovery: Super Inode table Inode table block – Parallel scan Recovery inode Recovery Recovery – Failure recovery bandwidth: thread thread > 400 GB/s
16 Operation Latency 30 Ext4-datajournal Ext4-DAX xfs-DAX m1fs 25 NOVA • Intel PM Emula on Pla orm – Emulates different NVM 20 characteris cs 15 – Emulates clwb/clflush latency • NOVA provides low latency 10 Latency (microsecond) atomicity 5
0 Create Append (4KB) Delete 17 Filebench throughput
450 Ext4-datajournal Ext4-DAX 400 xfs-DAX m1fs NOVA 350 300 • NOVA achieves high 250 performance with 200 strong data consistency 150
Ops per second (x1000) 100 50 0 Fileserver Varmail Webproxy Webserver 18 DAX, Backup, and Data Protec on
DAX 19 DAX Challenges • DAX mmap() gives the programmer great power – Fine-grain updates – Pointer-based data structures – Custom data layout • …along with great responsibili es. – Data structure consistency/persistence ordering DAX – Memory protec on/media error management • The file system must not interfere.
20 Snapshots for Normal File Access
Current epoch 012
Snapshot 0 Snapshot 1
File log 0 Page 1 1 Page 1 1 Page 1 2 Page 1
Data Data Data Data
Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 21 Memory Ordering With DAX mmap()
D = 42; V = true;
• D and V live in two pages of a mmap()’d region. • Recovery invariant: if V == True, then D is valid
22 Corrupt Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a- me
V = Applica on: D = 1; True; Snapshot
Page hos ng D: ? 1 ?
Page hos ng V: False True T
Time Snapshot
Page Copy on Value R/W RO Fault Write Change 23 Consistent Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Correct: Block page faults un l all pages are read-only
V = Applica on: D = 1; True; Snapshot
Page hos ng D: ? 1 ?
Page hos ng V: False True F
Time Snapshot
Page Copy on Value R/W RO RO Fault Write Change Blocking 24 Performance impact of snapshots • Normal execu on vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 6.2% through mmap()
Conven onal workloads NVMM-aware workloads from WHISPER 25 Protec ng Metadata
26 NVMM Failure Modes: Media Failures • Media errors – Detectable & correctable • Transparent to so ware – Detectable & uncorrectable So ware: Consumes good data • Affect a con guous range of data • Raise machine check excep on (MCE) NVMM Ctrl.: Detects & corrects errors – Undetectable Read NVMM data: • May consume corrupted data Media error • So ware scribbles – Kernel bugs or own bugs – Transparent to hardware
27 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to so ware – Detectable & uncorrectable So ware: Receives MCE • Affect a con guous range of data • Raise machine check excep on (MCE) NVMM Ctrl.: Detects uncorrectable errors
Raises excep on Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius • So ware scribbles (PR) – Kernel bugs or own bugs e.g. 512 bytes – Transparent to hardware
28 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to so ware – Detectable & uncorrectable So ware: Consumes corrupted data • Affect a con guous range of data • Raise machine check excep on (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Media error • So ware scribbles – Kernel bugs or own bugs – Transparent to hardware
29 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to so ware – Detectable & uncorrectable So ware: Bug code scribbles NVMM • Affect a con guous range of data Write • Raise machine check excep on (MCE) NVMM Ctrl.: Updates ECC – Undetectable NVMM data: • Consume corrupted data Scribble error • So ware “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable
30 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to so ware – Detectable & uncorrectable So ware: Consumes corrupted data • Affect a con guous range of data • Raise machine check excep on (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Scribble error • So ware “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable
31 NOVA Metadata Protection • Replicate everything inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – Inodes inode – Logs Head Tail csum H1 T1 – Superblock
– … ent1 c1 … entN cN • CRC32 Checksums everywhere
ent1’ c1’ … entN’ cN’
Data 1 Data 2
32 Defense Against Scribbles • Tolera ng Larger Scribbles – Allocate replicas far from one another – Can tolerate arbitrarily large scribbles to metadata. • Preven ng scribbles – Mark all NVMM as read-only – Disable CPU write protec on while accessing NVMM
33 Protec ng Data
DAX
34 NOVA Data Protection • Divide 4KB blocks into 512-byte stripes • Compute a RAID 5-style parity stripe • Compute and replicate checksums for each stripe
1 Block 512-Byte stripe segments
S0 S1 S2 S3 S4 S5 S6 S7 P P = ⊕ S0..7
Ci = CRC32C(Si) Replicated 35 Data Protection With DAX mmap()
• File systems cannot efficiently protect mmap’ed data, since stores are invisible • NOVA’s data protec on contract:
NOVA protects pages from media errors DAX and scribbles iff they are not mmap()’d for wri ng.
36 File data protection with DAX-mmap • NOVA logs mmap() opera ons
User-space load/store load/store Applica ons:
Kernel-space read(), mmap() NOVA: write()
NVDIMMs File data: protected unprotected File log: mmap log entry 38 File data protection with DAX-mmap • On unmap and during recovery, NOVA restores protec on
User-space load/store munmap() Applica ons:
Kernel-space read(), mmap() NOVA: write()
NVDIMMs Protec on restored File data:
File log:
39 File data protection with DAX-mmap • On unmap and during recovery, NOVA restores protec on
User-space System Failure + Applica ons: recovery Kernel-space read(), mmap() NOVA: write()
NVDIMMs File data:
File log:
40 Performance
41 Metadata Protec on O(1)
Data Protec on O(N)
42 Storage Utilization
Primary log: 2.00% Redundancy Primary inode: 0.10% 14.76% File data: 82.4%
Replica inode: 0.10% Replica log: 2.00% Unused File checksum: 1.56% 0.75% File parity: 11.1%
43 Performance Cost of Data Integrity
1.2
1
0.8
0.6
0.4
0.2
0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average ext4-DAX ext4-dataj For s baseline NOVA w/ MP+WP w/ MP+DP+WP 44 Conclusion • Exis ng file systems do not meet the requirements of applica ons on NVMM file systems • NOVA’s mul -log design achieves high performance and strong consistency • NOVA’s data protec on features ensure data integrity • NOVA outperforms exis ng file systems while providing stronger consistency and data protec on guarantees
45 NOVA is open source. We are preparing it for addi on to Linux.
To help or try it out: h ps://github.com/ NVSL/linux-nova
46 Thanks!
47