NVSL Presentation Template

NOVA: A High-Performance, Hardened File System for Non-Volatile Main Memories Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1 NVDIMM Usage Models • Legacy File IO Acceleration – fast and easy Legacy IO Throughput – Run existing IO-intensive apps on NVDIMMs 450 – “just works” 400 – NOVA is 30% - 10x faster than Ext4 for write intensive 350 workloads. 300 – Need strong protections on data. 250 • DAX Mmap -- maximum speed + programming 200 challenges 150 100 – Load -store access (x1000) per second Ops 50 – You still need a strongly-consistent file system 0 • File system corruption can still destroy your data • NOVA is strongly consistent – Data protection is still critical Ext4-datajournal NOVA 2 XFS F2FS NILFS EXT4 BTRFS 3 Disk-based file systems are inadequate for NVMM • Disk-based file systems Atomicity Data Protection cannot exploit NVMM 1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Data performance overwrite append overwrite append overwrite append data shots Ext4 wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ Ext4 Performance Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ • Ext4 optimization Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ Btrfs compromises consistency ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ xfs on system failure [1] ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ Reiserfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ [1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 4 BPFS SCMFS PMFS Aerie EXT4-DAX M1FS XFS-DAX 5 NVMM file systems don’t provide strong consistency or data protection • DAX does not provide data Atomicity Data Protection Meta- Snap- atomicity guarantees Metadata Data Data data shots • Programming is more difficult BPFS ✓ ✓ ✗ ✗ ✗ PMFS ✓ ✗ ✗ ✗ ✗ Ext4 DAX ✓ ✗ ✗ ✓ ✗ XFS DAX ✓ ✗ ✗ ✓ ✗ SCMFS ✗ ✗ ✗ ✗ ✗ Aerie ✓ ✗ ✗ ✗ ✗ 6 NOVA provides strong atomicity guarantee Atomicity Data Protection Atomicity Data Protection 1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Meta- Snap- Data Metadata Data Data overwrite append overwrite append overwrite append data shots data shots Ext4 BPFS wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ Ext4 PMFS Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ Ext4 Ext4 Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ XFS Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ xfs SCMFS ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Reiserfs Aerie ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ 7 NOVA provides strong atomicity guarantee Atomicity Data Protection Atomicity Data Protection 1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Meta- Snap- Data Metadata Data Data overwrite append overwrite append overwrite append data shots data shots Ext4 BPFS wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ Ext4 PMFS Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ Ext4 Ext4 Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ XFS Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ xfs SCMFS ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Reiserfs Aerie ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ NOVA NOVA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 8 NOVA’s Key Features • Features – High-performance • Usage Models – Strong Consistency – open()/close(), read()/write() – Snapshot support – DAX -mmap() – Data protection 9 NOVA’s Architecture 10 Core NOVA Structures Log Structure + copy-on-write + Journals • One log per iNode • Non-contiguous Tail Tail • Fast, Simple atomic File log updates • Meta-data only 11 Core NOVA Structures Log Structure + copy-on-write + Journals • Multi-page atomic update Tail Tail • Fast allocation File log • Instant data GC Data 0 Data 1 Data 1 Data 2 12 Core NOVA Structures Log Structure + copy-on-write + Journals TailTail Directory log • Small, fixed sized Tail Tail journals File log • For complex ops. Dir tail Journal File tail 13 Supporting Backups with Snapshots 14 Snapshots for Normal File Access Current epoch 012 Snapshot 0 Snapshot 1 File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000 Data Data Data Data Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 15 Corrupt Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a-time Application: D = 1; V = True; Snapshot Page hosting D: ? 1 ? Page hosting V: False True T Time Snapshot Page Copy on Value R/W RO Fault Write Change 16 Consistent Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Correct: Block page faults until all pages are read-only Application: D = 1; V = True; Snapshot Page hosting D: ? 1 ? Page hosting V: False True F Time Snapshot Page Copy on Value R/W RO RO Fault Write Change Blocking 17 Performance impact of snapshots • Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 6.2% through mmap() Conventional workloads NVMM-aware workloads from WHISPER 18 Data Protection: Metadata 19 NVMM Failure Modes: Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes good data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Detects & corrects errors – Undetectable Read NVMM data: • May consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware 20 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Receives MCE • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Detects uncorrectable errors Raises exception Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius (PR) • Software scribbles e.g. 512 bytes – Kernel bugs or own bugs – Transparent to hardware 21 Detecting NVMM Media Errors memcpy_mcsafe() • Copy data from NVMM Process and • Catch MCEs and return failure return Yes Handler registered? No Kernel Whose Kernel panic access? User Recoverable MCE SIGBUS Unrecoverable Kernel panic 22 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes corrupted data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware 23 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Bug code scribbles NVMM • Affect a contiguous range of data Write • Raise machine check exception (MCE) NVMM Ctrl.: Updates ECC – Undetectable NVMM data: • Consume corrupted data Scribble error • Software “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable 24 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes corrupted data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Scribble error • Software “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable 25 NOVA Metadata Protection • Replicate everything inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – Inodes inode – Logs Head Tail csum H1 T1 – Superblock – … ent1 c1 … entN cN • CRC32 Checksums everywhere ent1’ c1’ … entN’ cN’ Data 1 Data 2 26 Defense Against Scribbles • Tolerating Larger Scribbles – Allocate replicas far from one another – Can tolerate arbitrarily large scribbles to metadata. • Preventing scribbles – Mark all NVMM as read-only – Disable CPU write protection while accessing NVMM 27 Data Protection: Data 28 NOVA Data Protection • Divide 4KB blocks into 512-byte stripes • Compute a RAID 5-style parity stripe • Compute and replicate checksums for each stripe 1 Block 512-Byte stripe segments S0 S1 S2 S3 S4 S5 S6 S7 P P = S0..7 C = CRC32C(S ) i ⊕ i Replicated 29 File data protection with DAX-mmap • With DAX-Mmap(), file data changes are invisible to NOVA • NOVA cannot protect mmap’ed file data • NOVA logs mmap() and restores protection on munmap() or recovery User-space load/store load/store Applications: Kernel-space NOVA: read(), write() mmap() NVDIMMs File data: protected unprotected File log: mmap log entry 30 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed User-space load/store munmap() Applications: Kernel-space NOVA: read(), write() mmap() NVDIMMs Protection restored File data: File log: 31 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed User-space System Failure + Applications: recovery Kernel-space NOVA: read(), write() mmap() NVDIMMs File data: File log: 32 Performance 33 Performance Cost of Data Integrity 1.2 1 0.8 0.6 0.4 0.2 0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average xfs-DAX ext4-DAX ext4-dataj Fortis baseline w/ MP+WP w/ MP+DP+WP 34 Conclusion • Existing file systems do not meet the requirements of applications on NVMM file systems • NOVA’s multi-log design achieves high performance and strong consistency • NOVA’s data protection features ensure data integrity • NOVA outperforms existing file systems while providing stronger consistency and data protection guarantees 35 Thank you! Try NOVA! https://github.com/NVSL/NOVA

Load more