NOVA: a High-Performance, Fault Tolerant File System for Non

NOVA: A High-Performance, Fault Tolerant File System for Non-Volatile Main Memories Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-VolaHle Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1 NVMM File System Requirements • Legacy File IO Acceleraon – fast and easy – Run exis1ng IO-intensive apps on NVDIMMs – “just works” • Strong atomicity • DAX Mmap – Load-store access & complex data structures – More control/responsibility for programs • Data protec1on – Don’t trust the media or other soPware – Support backups • DAX High performance – Otherwise, why bother? 2 XFS EXT4 F2FS BTRFS NILFS ✔ ❌ ✔❌DAX ❌ 3 BPFS SCMFS PMFS Aerie EXT4-DAX XFS-DAX M1FS ✔ ❌ ❌ ✔DAX ✔ 4 NOVA ✔✔✔✔DAX ✔ 5 Files, Atomicity, and Performance 6 Log Structured FS For NVMM Per-inode logging • A Nova FS is a tree of logs Inode Head Tail – One log per i-node – Logs are not con1guous Inode log • I-nodes point to the head and tail Commiêd entry Uncommiêd entry 7 Atomicity: Logging for Simple Metadata Operations • Combines log-structuring, journaling and copy-on-write • Log-structuring for single log Tail Tail update File log – Write, msync, chmod, etc – Lower overhead than journaling and shadow paging 8 Atomicity: Lightweight Journaling for Complex Metadata Operations • Lightweight journaling for update across logs TailTail – Unlink, rename, etc Directory log – Journal log tails instead of metadata or data TailTail File log Dir tail Journal File tail 9 Atomicity: Copy-on-write for file data • Copy-on-write for file data – Log only contains metadata – Log is short Tail Tail File log Data 0 Data 1 Data 1 Data 2 10 Atomicity: DAX • Nova does not make atomicity guarantees for DAX mmap()’d data • msync() works, but it’s slow & non-atomic • The program must ensure consistency – ISA Support: CLWB, clflush, etc. DAX – Careful programming 11 DRAM Indexes for High Performance • Per-inode logging allows for Radix tree high concurrency 0 1 2 3 DRAM • Split data structure between NVMM Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volale tree structure has no consistency overhead 12 Performance and Scalability • Put allocator in DRAM CPU 0 CPU 1 • High scalability Free list Free list – Per-CPU NVMM free list, DRAM journal and inode table NVMM Journal Journal – Concurrent transac1ons Super Inode table Inode table and allocaon/deallocaon block Recovery inode Inode Head Tail Inode log 13 Fast garbage collection • Log is a linked list • Log only contains Tail metadata Head • Fast GC deletes dead log pages from the linked list • No copying Vaild log entry Invalid log entry 14 Thorough garbage collection • Starts if valid log entries < 50% log length • Format a new log and atomically replace the old one • Only copy metadata Tail Head Vaild log entry Invalid log entry 15 Recovery • Normal shutdown CPU 0 CPU 1 recovery: Free list Free list – Store allocator state in recovery inode DRAM NVMM – Constant 1me startup Journal Journal • Failure recovery: Super Inode table Inode table block – Parallel scan Recovery inode Recovery Recovery – Failure recovery bandwidth: thread thread > 400 GB/s 16 Operation Latency 30 Ext4-datajournal Ext4-DAX xfs-DAX m1fs 25 NOVA • Intel PM Emulaon Plaorm – Emulates different NVM 20 characteris1cs 15 – Emulates clwb/clflush latency • NOVA provides low latency 10 Latency (microsecond) atomicity 5 0 Create Append (4KB) Delete 17 Filebench throughput 450 Ext4-datajournal Ext4-DAX 400 xfs-DAX m1fs NOVA 350 300 • NOVA achieves high 250 performance with 200 strong data consistency 150 Ops per second (x1000) 100 50 0 Fileserver Varmail Webproxy Webserver 18 DAX, Backup, and Data Protec1on DAX 19 DAX Challenges • DAX mmap() gives the programmer great power – Fine-grain updates – Pointer-based data structures – Custom data layout • …along with great responsibili1es. – Data structure consistency/persistence ordering DAX – Memory protec1on/media error management • The file system must not interfere. 20 Snapshots for Normal File Access Current epoch 012 Snapshot 0 Snapshot 1 File log 0 Page 1 1 Page 1 1 Page 1 2 Page 1 Data Data Data Data Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 21 Memory Ordering With DAX mmap() D = 42; V = true; • D and V live in two pages of a mmap()’d region. • Recovery invariant: if V == True, then D is valid 22 Corrupt Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a-1me V = Applicaon: D = 1; True; Snapshot Page hos1ng D: ? 1 ? Page hos1ng V: False True T Time Snapshot Page Copy on Value R/W RO Fault Write Change 23 Consistent Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Correct: Block page faults un1l all pages are read-only V = Applicaon: D = 1; True; Snapshot Page hos1ng D: ? 1 ? Page hos1ng V: False True F Time Snapshot Page Copy on Value R/W RO RO Fault Write Change Blocking 24 Performance impact of snapshots • Normal execu1on vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 6.2% through mmap() Conven1onal workloads NVMM-aware workloads from WHISPER 25 Protec1ng Metadata 26 NVMM Failure Modes: Media Failures • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Consumes good data • Affect a con1guous range of data • Raise machine check excep1on (MCE) NVMM Ctrl.: Detects & corrects errors – Undetectable Read NVMM data: • May consume corrupted data Media error • SoPware scribbles – Kernel bugs or own bugs – Transparent to hardware 27 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Receives MCE • Affect a con1guous range of data • Raise machine check excep1on (MCE) NVMM Ctrl.: Detects uncorrectable errors Raises excep1on Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius • SoPware scribbles (PR) – Kernel bugs or own bugs e.g. 512 bytes – Transparent to hardware 28 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Consumes corrupted data • Affect a con1guous range of data • Raise machine check excep1on (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Media error • SoPware scribbles – Kernel bugs or own bugs – Transparent to hardware 29 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Bug code scribbles NVMM • Affect a con1guous range of data Write • Raise machine check excep1on (MCE) NVMM Ctrl.: Updates ECC – Undetectable NVMM data: • Consume corrupted data Scribble error • SoPware “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable 30 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Consumes corrupted data • Affect a con1guous range of data • Raise machine check excep1on (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Scribble error • SoPware “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable 31 NOVA Metadata Protection • Replicate everything inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – Inodes inode – Logs Head Tail csum H1 T1 – Superblock – … ent1 c1 … entN cN • CRC32 Checksums everywhere ent1’ c1’ … entN’ cN’ Data 1 Data 2 32 Defense Against Scribbles • Tolerang Larger Scribbles – Allocate replicas far from one another – Can tolerate arbitrarily large scribbles to metadata. • Preven1ng scribbles – Mark all NVMM as read-only – Disable CPU write protec1on while accessing NVMM 33 Protec1ng Data DAX 34 NOVA Data Protection • Divide 4KB blocks into 512-byte stripes • Compute a RAID 5-style parity stripe • Compute and replicate checksums for each stripe 1 Block 512-Byte stripe segments S0 S1 S2 S3 S4 S5 S6 S7 P P = ⊕ S0..7 Ci = CRC32C(Si) Replicated 35 Data Protection With DAX mmap() • File systems cannot efficiently protect mmap’ed data, since stores are invisible • NOVA’s data protec1on contract: NOVA protects pages from media errors DAX and scribbles iff they are not mmap()’d for wring. 36 File data protection with DAX-mmap • NOVA logs mmap() operaons User-space load/store load/store Applicaons: Kernel-space read(), mmap() NOVA: write() NVDIMMs File data: protected unprotected File log: mmap log entry 38 File data protection with DAX-mmap • On unmap and during recovery, NOVA restores protec1on User-space load/store munmap() Applicaons: Kernel-space read(), mmap() NOVA: write() NVDIMMs Protec1on restored File data: File log: 39 File data protection with DAX-mmap • On unmap and during recovery, NOVA restores protec1on User-space System Failure + Applicaons: recovery Kernel-space read(), mmap() NOVA: write() NVDIMMs File data: File log: 40 Performance 41 Metadata Protec1on O(1) Data Protec1on O(N) 42 Storage Utilization Primary log: 2.00% Redundancy Primary inode: 0.10% 14.76% File data: 82.4% Replica inode: 0.10% Replica log: 2.00% Unused File checksum: 1.56% 0.75% File parity: 11.1% 43 Performance Cost of Data Integrity 1.2 1 0.8 0.6 0.4 0.2 0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average ext4-DAX ext4-dataj For1s baseline NOVA w/ MP+WP w/ MP+DP+WP 44 Conclusion • Exis1ng file systems do not meet the requirements of applicaons on NVMM file systems • NOVA’s mul1-log design achieves high performance and strong consistency • NOVA’s data protec1on features ensure data integrity • NOVA outperforms exis1ng file systems while providing stronger consistency and data protec1on guarantees 45 NOVA is open source. We are preparing it for addi1on to Linux. To help or try it out: h^ps://github.com/ NVSL/linux-nova 46 Thanks! 47 .

Load more