NOVA: A High-Performance, Hardened File System for Non-Volatile Main Memories
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson
Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego
1 NVDIMM Usage Models
• Legacy File IO Acceleration – fast and easy Legacy IO Throughput – Run existing IO-intensive apps on NVDIMMs 450 – “just works” 400 – NOVA is 30% - 10x faster than Ext4 for write intensive 350 workloads. 300 – Need strong protections on data. 250 • DAX Mmap -- maximum speed + programming 200 challenges 150 100
– Load-store access (x1000) per second Ops 50 – You still need a strongly-consistent file system 0 • File system corruption can still destroy your data • NOVA is strongly consistent – Data protection is still critical Ext4-datajournal NOVA 2
XFS F2FS NILFS
EXT4 BTRFS
3 Disk-based file systems are inadequate for NVMM • Disk-based file systems Atomicity Data Protection cannot exploit NVMM 1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Data performance overwrite append overwrite append overwrite append data shots Ext4 wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ Ext4 Performance Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ • Ext4 optimization Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ Btrfs compromises consistency ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ xfs on system failure [1] ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ Reiserfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓
[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 4 BPFS SCMFS PMFS Aerie
EXT4-DAX M1FS XFS-DAX
5 NVMM file systems don’t provide strong consistency or data protection
• DAX does not provide data Atomicity Data Protection
Meta- Snap- atomicity guarantees Metadata Data Data data shots • Programming is more difficult BPFS ✓ ✓ ✗ ✗ ✗ PMFS ✓ ✗ ✗ ✗ ✗ Ext4 DAX ✓ ✗ ✗ ✓ ✗ XFS DAX ✓ ✗ ✗ ✓ ✗ SCMFS ✗ ✗ ✗ ✗ ✗ Aerie ✓ ✗ ✗ ✗ ✗
6 NOVA provides strong atomicity guarantee
Atomicity Data Protection Atomicity Data Protection
1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Meta- Snap- Data Metadata Data Data overwrite append overwrite append overwrite append data shots data shots Ext4 BPFS wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ Ext4 PMFS Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ Ext4 Ext4 Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ XFS Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ xfs SCMFS ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Reiserfs Aerie ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗
7 NOVA provides strong atomicity guarantee
Atomicity Data Protection Atomicity Data Protection
1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Meta- Snap- Data Metadata Data Data overwrite append overwrite append overwrite append data shots data shots Ext4 BPFS wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ Ext4 PMFS Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ Ext4 Ext4 Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ XFS Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ xfs SCMFS ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Reiserfs Aerie ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ NOVA NOVA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
8 NOVA’s Key Features • Features – High-performance • Usage Models – Strong Consistency – open()/close(), read()/write() – Snapshot support – DAX-mmap() – Data protection
9 NOVA’s Architecture
10 Core NOVA Structures Log Structure + copy-on-write + Journals
• One log per iNode • Non-contiguous Tail Tail • Fast, Simple atomic File log updates • Meta-data only
11 Core NOVA Structures Log Structure + copy-on-write + Journals
• Multi-page atomic update Tail Tail • Fast allocation File log • Instant data GC
Data 0 Data 1 Data 1 Data 2
12 Core NOVA Structures Log Structure + copy-on-write + Journals
TailTail
Directory log • Small, fixed sized Tail Tail journals File log • For complex ops.
Dir tail Journal File tail 13 Supporting Backups with Snapshots
14 Snapshots for Normal File Access
Current epoch 012
Snapshot 0 Snapshot 1
File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000
Data Data Data Data
Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 15 Corrupt Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a-time
Application: D = 1; V = True; Snapshot
Page hosting D: ? 1 ?
Page hosting V: False True T
Time Snapshot
Page Copy on Value R/W RO Fault Write Change 16 Consistent Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Correct: Block page faults until all pages are read-only
Application: D = 1; V = True; Snapshot
Page hosting D: ? 1 ?
Page hosting V: False True F
Time Snapshot
Page Copy on Value R/W RO RO Fault Write Change Blocking 17 Performance impact of snapshots • Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 6.2% through mmap()
Conventional workloads NVMM-aware workloads from WHISPER 18 Data Protection: Metadata
19 NVMM Failure Modes: Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes good data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Detects & corrects errors – Undetectable Read NVMM data: • May consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware
20 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Receives MCE • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Detects uncorrectable errors
Raises exception Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius (PR) • Software scribbles e.g. 512 bytes – Kernel bugs or own bugs – Transparent to hardware
21 Detecting NVMM Media Errors
memcpy_mcsafe() • Copy data from NVMM Process and • Catch MCEs and return failure return
Yes Handler registered? No Kernel Whose Kernel panic access? User Recoverable MCE SIGBUS Unrecoverable
Kernel panic 22 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes corrupted data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware
23 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Bug code scribbles NVMM • Affect a contiguous range of data Write • Raise machine check exception (MCE) NVMM Ctrl.: Updates ECC – Undetectable NVMM data: • Consume corrupted data Scribble error • Software “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable
24 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes corrupted data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Scribble error • Software “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable
25 NOVA Metadata Protection • Replicate everything inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – Inodes inode – Logs Head Tail csum H1 T1 – Superblock
– … ent1 c1 … entN cN • CRC32 Checksums everywhere
ent1’ c1’ … entN’ cN’
Data 1 Data 2
26 Defense Against Scribbles • Tolerating Larger Scribbles – Allocate replicas far from one another – Can tolerate arbitrarily large scribbles to metadata. • Preventing scribbles – Mark all NVMM as read-only – Disable CPU write protection while accessing NVMM
27 Data Protection: Data
28 NOVA Data Protection • Divide 4KB blocks into 512-byte stripes • Compute a RAID 5-style parity stripe • Compute and replicate checksums for each stripe
1 Block 512-Byte stripe segments
S0 S1 S2 S3 S4 S5 S6 S7 P P = S0..7 C = CRC32C(S ) i ⊕ i Replicated 29 File data protection with DAX-mmap • With DAX-Mmap(), file data changes are invisible to NOVA • NOVA cannot protect mmap’ed file data • NOVA logs mmap() and restores protection on munmap() or recovery User-space load/store load/store Applications:
Kernel-space NOVA: read(), write() mmap()
NVDIMMs File data: protected unprotected File log: mmap log entry 30 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed
User-space load/store munmap() Applications:
Kernel-space NOVA: read(), write() mmap()
NVDIMMs Protection restored File data:
File log: 31 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed
User-space System Failure + Applications: recovery Kernel-space NOVA: read(), write() mmap()
NVDIMMs File data:
File log: 32 Performance
33 Performance Cost of Data Integrity
1.2
1
0.8
0.6
0.4
0.2
0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average xfs-DAX ext4-DAX ext4-dataj Fortis baseline w/ MP+WP w/ MP+DP+WP 34 Conclusion • Existing file systems do not meet the requirements of applications on NVMM file systems • NOVA’s multi-log design achieves high performance and strong consistency • NOVA’s data protection features ensure data integrity • NOVA outperforms existing file systems while providing stronger consistency and data protection guarantees
35 Thank you!
Try NOVA! https://github.com/NVSL/NOVA
36 Backup Slides
37 Protecting Against Scribbles • Metadata allocator separates metadata replicas – Allocate primary and replica pages in opposite directions – Use allocator ‘dead-zone’ to guarantee minimal distance – Protect against scribbles from other kernel bugs and own bugs
Simple allocation: log1 log1’ log2 log2’ logN logN’ A page-sized scribble can affect most pairs of replicated metadata pages
Two-way allocation: log1 log2 logN logN’ log2’ log1’ A page-sized scribble can affect limited pairs of replicated metadata pages
Dead-zone allocation: log1 log2 logN 1 MB logN’ log2’ log1’ A scribble less than 1 MB can not corrupt any metadata 38 Minimize the chance of corruptions – x86 write protection • Leverage x86 CPU’s write protection – CR0.WP disables/enables writing to RO memories of each x86 core – Only enable writing when NOVA writes to NVMM – Protect against scribbles from other kernel bugs, not own bugs
39 Filebench throughput
Filebench throughput 450 400 350 300 250 • NOVA achieves high 200 performance with strong 150
Ops per second (x1000) second per Ops 100 data consistency 50 0 Fileserver Varmail Webproxy Webserver Ext4-datajournal Ext4-DAX m1fs NOVA
40 Tick-tock inode update • Update tails of primary inode • Update csum of primary inode • Same procedure for inode’
Primary Head’ Tail’ csum’ H1’ T1’
Secondary Head Tail csum H1 T1
Old Updating New
41 Performance Cost of Data Integrity
1.2
1
0.8
0.6
0.4
0.2
0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average xfs-DAX ext4-DAX ext4-dataj Fortis baseline w/ MP w/ MP+WP w/ MP+DP w/ MP+DP+WP 42 Conclusions • NVMM file systems need unique solutions for reliability – Error reporting mechanisms different than disks – DAXp -mma complicates designs • Performance and storage penalties vary – Storage cost is modest for the presented hardening techniques – Performance impact is significant for some applications • More knowledge is necessary to determine the trade-offs – Uncorrectable media errors in emerging NVMM technologies – The frequency and size of scribbles in kernel space • NOVA provides all hardening techniques as mount options
43 Performance impact of data integrity • File operation latency
44 Reliability evaluation – metadata pages at risk • Scan an aged NOVA file system image – Examine distances between the primary and replica pages – Count vulnerable page pairs for a given scribble size
Y == 0 points do not show in log-log plot 45 PMFS shortcomings • No data atomicity support • High consistency overhead with persistent B-tree • Not scalable – Directory operations (linear search) – NVMM allocation (Single allocator) – Single journal shared by all transactions • Poor performance on large directories • Intel has deprecated PMFS
46 Ext4-DAX and xfs-DAX shortcomings • No data atomicity support • Single journal shared by all the transactions (JBD2-based) • Poor performance
47 Non-volatile main memory is about to happen
NVMM needs a new file system: PMFS, Ext4-DAX, SCMFS, Aerie, NOVA, … 48 Why a new file system?
We need to reduce the software overhead.
Source: Memory-Driven Computing, Kimberly Keeton, HP Labs
49 What Should a File System Provide? • Performance current focus (of all known efforts) • Consistency – Atomic metadata operations – Atomic file updates • Data Protection – Snapshots – Media error protection We need to study the impact of – Software error protection adding more file system services • Cost optimizations in the context of NVMM. – Compression – Deduplication
50 Evaluation: Latency
Operation latency 25 • Intel PM Emulation Platform 20 – Emulates different NVM
15 characteristics – Emulates clwb/PCOMMIT 10 latency Latency (microsecond) Latency 5 • NOVA provides low latency atomicity 0 Create Append (4KB) Delete Ext4-datajournal Ext4-DAX m1fs NOVA
51 NOVA design and in-NVMM data layout
• High performance CPU 0 CPU 1 – No page cache Free list Free list – Memory semantics ... – Segregated data structures DRAM NVMM – Per-CPU freelist Journal Journal Super Inode table Inode table – Per-inode logging block • Strong consistency Recovery inode – Copy-on-write file data Inode Head Tail – Using 8-byte atomic stores Inode log
52 NOVA design and in-NVMM data layout • High performance – No page cache Per-CPU inode table – Memory semantics Inode table – Segregated data structures – Per-CPU freelist Inode Head Tail Per-inode log 64-bit tail ptr – Per-inode logging • Strong consistency File log
– Copy-on-write file data Data 0 Data 1 Data 1 Data 2 – Using 8-byte atomic stores writing to page 1 and 2
53 NVMM file systems should support snapshot • Snapshot is essential for file system backup • Available in file systems for block devices – ZFS, Btrfs, WAFL • NOVA is the first NVMM file system providing snapshot – Efficient full-filesystem snapshot at minimal performance cost – Creating/deleting snapshots does not halt file system – Creating consistent snapshots with DAX-mmap enabled
54 Enable snapshot in NOVA • Maintain a current ‘epoch_id’ for the file system – Stored in the superblock epoch_id – Incremented after every snapshot taken • Add the ‘epoch_id’ to each log entry epoch_id
55 Taking snapshots
Snapshot 0 log 0x1000, 1 Current epoch 012 Snapshot 1 log 0x3000, 2
Snapshot 0 Snapshot 1
File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000
Data Data Data Data
Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 56 Deleting snapshots
Snapshot 0 log 0x1000, 1 Current epoch 012 Snapshot 1 log 0x3000, 2
Snapshot 0 Snapshot 1
File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000
Data Data Data Data
Background GC Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 57 Mounting snapshots
Current epoch 012 Snapshot 1 log 0x3000, 2
Snapshot 1 log tail File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000
Data Data Data Data
Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 58 Snapshots with DAX-mmap • Goal: Applications take snapshots and keep running – Virtual addresses do not change – Consistency must be guaranteed • How: Set each mmap’ed page as read-only – Then do copy-on-write for new stores (detected by page fault) • Caveat: Can only atomically set one page as read-only – What if the order of becoming read-only conflicts with consistency?
59 NOVA (meta)data integrity features • Detect (meta)data corruptions – Media errors: error code from memcpy_mcsafe(), and checksums – Software scribbles: checksums • Repair (meta)data corruptions – Metadata: Fully replicated – File data: Stripe and parity-code each block • Minimize scribbles – Leverage x86 CPU’s write protection (CR0.WP) – Metadata allocators separate replicas
60 Metadata error detection and correction • Use inode access as an example:
Read inode Read inode’ Verify inode csum memcmp Use inode memcpy_mcsafe memcpy_mcsafe Verify inode’ csum (inode, inode’)
• If any step raises an error: – Attempt to repair and retry – If recovery fails: return –EIO to user
61 File data error detection and correction
Read a strip Verify strip’s Check errors in Copy data to user memcpy_mcsafe checksum checksums
• If any step raises an error: – Attempt to repair and retry – If recovery fails: return –EIO to user
62 File data protection with DAX-mmap • With DAX-Mmap(), file data changes are invisible to NOVA
User-space load/store Applications:
Kernel-space NOVA: read(), write() mmap()
NVDIMMs Following dax-mmap() semantics, NOVA doesn’t interfere with mmap’ed file data. File data: protected protected
File log: 63 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed
User-space read() load/store load/store Applications:
Kernel-space no protection vm area list: NOVA: read(), write() mmap()
NVDIMMs NOVA skips protection routines for reads and writes to the regions found in the vm File data: area list.
File log:
64 Performance impact of data integrity • Latency breakdown on NVDIMM-N
metadata protection
file data protection
x86 write protection
65 Performance impact of data integrity • Random read/write bandwidth on NVDIMM-N
66 Storage utilization with data integrity • Conceptual view of NOVA’s utilization of NVMM
Dead-zone (DZ) only virtually exists to separate metadata replicas. File data can still live inside.
• Actual usage of a practical workload: fileserver
67 Differences from disk FS implementation • Low-latency storage media – Need to choose fast methods for any involved computation • Fine -grained random access – Need fine-grained checksum ranges, not per block (as in Btrfs, ZFS) • Small atomicity guarantees (64-bit) – Need metadata replication to assist consistent updates • Media errors cause machine check exceptions (MCEs) – Need awareness and mitigations • Demands for DAX-mmap (no copy-on-write, no FS control) – Need awareness and lowering the protection level on demand
68 Recovery for snapshot metadata • Snapshot metadata list resides in DRAM to reduce consistency overhead
• Clean unmount: – Finish background snapshot delete – Save snapshot lists to NVMM
• Power failure: – Snapshot transaction ID is persistent – Rebuild snapshot metadata lists during power failure recovery
69 NVMM (Meta)data corruptions • Media errors – Detectable & correctable Good data • Transparent to software – Detectable & uncorrectable Software • Affect a contiguous range of data
• Raise machine check exception (MCE) Hardware HW ECC Read – Undetectable NVMM data: • May consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware
70 NVMM (Meta)data corruptions • Media errors – Detectable & correctable Might panic SIGBUS • Transparent to software Kernel mode User mode – Detectable & uncorrectable MCE Software • Affect a contiguous range of data
• Raise machine check exception (MCE) Hardware HW ECC Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius (PR) • Software scribbles e.g. 512 bytes – Kernel bugs or own bugs – Transparent to hardware
71 NVMM (Meta)data corruptions • Media errors – Detectable & correctable Corrupted data • Transparent to software – Detectable & uncorrectable Software • Affect a contiguous range of data
• Raise machine check exception (MCE) Hardware HW ECC Read – Undetectable NVMM data: • Consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware
72 NVMM (Meta)data corruptions • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software • Affect a contiguous range of data • Raise machine check exception (MCE) Hardware – Undetectable NVMM data: • Consume corrupted data • Software scribbles – Kernel bugs or own bugs – Transparent to hardware
73 NVMM (Meta)data corruptions • Media errors – Detectable & correctable Corrupted data • Transparent to software – Detectable & uncorrectable Software • Affect a contiguous range of data
• Raise machine check exception (MCE) Hardware HW ECC Read – Undetectable NVMM data: • Consume corrupted data • Software scribbles – Kernel bugs or own bugs – Transparent to hardware
74 Metadata error detection and correction • Use inode access as example
memcmp neq inode (inode, inode’) inode’ Both OK eq
Verify inode csum Good inode One fails Continue Verify inode’ csum Error inode OK Read inode’ Both fail memcpy_mcsafe MCE -EIO to user PR(inode) OK PR(inode’) Read inode Goto memcpy_mcsafe MCE PR(inode) PR(inode’) Read inode’ OK memcpy_mcsafe MCE -EIO to user 75 File data error detection and correction • Detect and repair both data and checksum errors
Copy data to user
Yes Calculate csum csum0 == csum1? No Success OK Yes Read a strip csum == csum0 or Repair bad strip Good csum memcpy_mcsafe csum == csum1 ? & verify csums Error csum MCE All No Judged by the majority: OK csum, csum0, csum1 Read other strips Fail and the parity Any MCE -EIO to user 76