NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System
Jian Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson
Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1 Non-volatile Memory and DAX
• Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application
load/store load/store
DRAM NVMM
File system
HDD / SSD
2 Non-volatile Memory and DAX
• Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application • Direct Access (DAX) mmap() DAX-mmap() – DAX file I/O bypasses the page cache – DAX-mmap() maps NVMM pages to application DRAM NVMM address space directly and bypasses file system – “Killer app” copy
HDD / SSD
3 Application expectations on NVMM File System
DAX
POSIX I/O Atomicity Fault Direct Speed Tolerance Access
DAX
POSIX I/O Atomicity Fault Direct Speed Tolerance Access
✔ ❌ ✔❌ ❌ 5 PMFS ext4-DAX xfs-DAX
DAX ✔ ❌ ❌ ✔✔ 6 Fault Direct Speed Strata SOSP ’17
DAX ✔✔ ❌ ❌ ✔ 7 Fault Direct Speed NOVA FAST ’16
DAX ✔ ✔ ❌ ✔ ✔ 8 Fault Direct Speed NOVA-Fortis
DAX ✔ ✔✔✔ ✔ 9 Fault Direct Speed Challenges
DAX
10 NOVA: Log-structured FS for NVMM
• Per-inode logging – High concurrency Per-inode logging – Parallel recovery • High scalability Inode Head Tail – Per-core allocator, journal and inode table Inode log • Atomicity Data Data Data – Logging for single inode update – Journaling for update across logs – Copy-on-Write for file data
Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, FAST ’16. 11 Snapshot
13 Snapshot support
• Snapshot is essential for file system backup • Widely used in enterprise file systems – ZFS, Btrfs, WAFL
• Snapshot is not available with DAX file systems
14 Snapshot for normal file I/O
Current snapshot 012 write(0, 4K); take_snapshot(); write(0, 4K); File log 0 Page 0 1 Page 0 1 Page 0 2 Page 0 write(0, 4K); take_snapshot(); Data Data Data Data write(0, 4K); recover_snapshot(1); File write entry Data in snapshot Reclaimed data Current data
15 Memory Ordering With DAX-mmap()
D V Valid D = 42; ? False ✓ Fence(); 42 False ✓ V = True; 42 True ✓ ? True ✗
• Recovery invariant: if V == True, then D is valid
16 Memory Ordering With DAX-mmap()
D = 42; Application D V Fence(); DAX-mmap() V = True; NVMM Page 1 Page 3
• Recovery invariant: if V == True, then D is valid • D and V live in two pages of a mmap()’d region.
17 DAX Snapshot: Idea
• Set pages read-only, then copy-on-write
Applications:
no file system intervention File system: DAX-mmap()
File data: RO
18 DAX Snapshot: Incorrect implementation
• Application invariant: if V is True, then D is valid
Application Application NOVA Snapshot thread values snapshot values
D = ?; D V D V V = False;
? F snapshot_begin(); set_read_only(page_d); ? D = 42;page fault 42 F copy_on_write(page_d);
V = True; 42 T set_read_only(page_v); ? T snapshot_end(); ? T
19 DAX Snapshot: Correct implementation
• Delay CoW page faults completion until all pages are read-only
Application Application NOVA Snapshot thread values snapshot values
D = ?; D V D V V = False;
? F snapshot_begin(); set_read_only(page_d); ? D = 42;page fault
set_read_only(page_v); ? F snapshot_end(); ? F 42 F copy_on_write(page_d); V = True; 42 T copy_on_write(page_v);
20 Performance impact of snapshots
• Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 3.7% through mmap() W/O snapshot W snapshot
1.2 Filebench (read/write) WHISPER (DAX-mmap()) 1 0.8 0.6 0.4 0.2 0
21 Protecting Metadata and Data
22 NVMM Failure Modes
• Detectable errors – Media errors detected by NVMM controller Software: Receives MCE – Raises Machine Check Exception Detects uncorrectable errors (MCE) NVMM Ctrl.: Raises exception • Undetectable errors Read NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles
23 NVMM Failure Modes
• Detectable errors – Media errors detected by NVMM controller Software: Consumes corrupted data – Raises Machine Check Exception (MCE) NVMM Ctrl.: Sees no error • Undetectable errors Read NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles
24 NVMM Failure Modes
• Detectable errors – Media errors detected by NVMM controller Software: Bug code scribbles NVMM – Raises Machine Check Exception Write (MCE) NVMM Ctrl.: Updates ECC • Undetectable errors NVMM data: Scribble error – Media errors not detected by NVMM controller – Software scribbles
25 NOVA-Fortis Metadata Protection
• Detection inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – CRC32 checksums in all structures – Use memcpy_mcsafe() to catch inode Head Tail csum H1 T1 MCEs
• Correction log ent1 c1 … entN cN – Replicate all metadata: inodes, logs, superblock, etc. log’ ent1’ c1’ … entN’ cN’ – Tick-tock: persist primary before updating replica
Data 1 Data 2
26 NOVA-Fortis Data Protection
inode’ Head’ Tail’ csumcsum’’ H1’ T1’
inode Head Tail csum H1 T1 • Metadata – CRC32 + replication for all structures log ent1 c1 … entN cN
• Data c1 log’ ent1’ … entN’ cN’ ’ – RAID-4 style parity
– Replicated checksums Data 1 Data 2
1 Block (8 stripes)
S0 S1 S2 S3 S4 S5 S6 S7 P P = ⊕ S0..7
Ci = CRC32C(Si) Replicated 27 File data protection with DAX-mmap
• Stores are invisible to the file systems • The file systems cannot protect mmap’ed data • NOVA-Fortis’ data protection contract: DAX
NOVA-Fortis protects pages from media errors and scribbles iff they are not mmap()’d for writing.
28 File data protection with DAX-mmap
• NOVA-Fortis logs mmap() operations
User-space load/store load/store Applications:
Kernel-space NOVA-Fortis: read/write mmap()
NVDIMMs File data: protected unprotected File log: mmap log entry
29 File data protection with DAX-mmap
• On munmap and during recovery, NOVA-Fortis restores protection
User-space load/store munmap() Applications:
Kernel-space NOVA-Fortis: read/write mmap()
NVDIMMs Protection restored File data:
File log:
30 File data protection with DAX-mmap
• On munmap and during recovery, NOVA-Fortis restores protection
User-space System Failure + Applications: recovery Kernel-space NOVA-Fortis: read/write mmap()
NVDIMMs File data:
File log:
31 Performance
32 Latency breakdown
VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity
Create
Append 4KB
Overwrite 4KB
Overwrite 512B
Read 4KB
Read 16KB
0 1 2 3 4 5 6 Latency (microsecond)
33 Latency breakdown
VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity
Create
Append 4KB
Overwrite 4KB
Overwrite 512B
Read 4KB
Read 16KB
0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection 34 Latency breakdown
VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity
Create
Append 4KB
Overwrite 4KB
Overwrite 512B
Read 4KB
Read 16KB
0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection Data Protection 35 Application performance
Normalized throughput 1.2
1
0.8
0.6
0.4 Normalized throughput Normalized 0.2
0 Fileserver Varmail MongoDB SQLite TPCC Average
ext4-DAX Btrfs NOVA w/ MP w/ MP+DP 36 Conclusion
• Fault tolerance is critical for file system, but existing DAX file systems don’t provide it • We identify new challenges that NVMM file system fault tolerance poses • NOVA-Fortis provides fault tolerance with high performance – 1.5x on average to DAX-aware file systems without reliability features – 3x on average to other reliable file systems
37 Give a try
https://github.com/NVSL/linux-nova
38 Thanks!
39 Backup slides
40 Hybrid DRAM/NVMM system
• Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology • File system for NVMM Host
CPU
NVMM FS
DRAM NVMM
41 Disk-based file systems are inadequate for NVMM
Ext4 Ext4 Ext4 • Ext4, xfs, Btrfs, F2FS, NILFS2 Atomicity Btrfs xfs wb order dataj • 1-Sector Built for hard disks and SSDs ✓ ✓ ✓ ✓ ✓ overwrite – Software overhead is high 1-Sector ✗ ✓ ✓ ✓ ✓ – CPU may reorder writes to NVMM append 1-Block ✗ ✗ ✓ ✓ ✗ – NVMM has different atomicity guarantees overwrite 1-Block ✗ ✓ ✓ ✓ ✓ append • Cannot exploit NVMM performance N-Block ✗ ✗ ✗ ✗ ✗ • write/append Performance optimization compromises N-Block ✗ ✓ ✓ ✓ ✓ consistency on system failure [1] prefix/append
[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 42 NVMM file systems are not strongly consistent
• BPFS, PMFS, Ext4-DAX, SCMFS, Aerie • None of them provide strong metadata and data consistency
Metadata Data Mmap File system atomicity atomicity Atomicity [1] BPFS Yes Yes [2] No PMFS Yes No No Ext4-DAX Yes No No SCMFS No No No Aerie Yes No No NOVA Yes Yes Yes
[1] Each msync() commits updates atomically. [2] In BPFS, write times are not updated atomically with respect to the write itself. 43 Why LFS?
• Log-structuring provides cheaper atomicity than journaling and shadow paging • NVMM supports fast, highly concurrent random accesses – Using multiple logs does not negatively impact performance – Log does not need to be contiguous
• Rethink and redesign log-structuring entirely
44 Atomicity
• Log-structuring for single log update – Write, msync, chmod, etc – Strictly commit log entry to NVMM TailTail before updating log tail Directory log • Lightweight journaling for update across logs Tail TailTail – Unlink, rename, etc – Journal log tails instead of metadata or data File log
• Copy-on-write for file data Dir tail – Log only contains metadata Journal – Log is short File tail
45 Atomicity
• Log-structuring for single log update – Write, msync, chmod, etc – Strictly commit log entry to NVMM Tail before updating log tail Directory log • Lightweight journaling for update across logs Tail Tail – Unlink, rename, etc – Journal log tails instead of metadata or data File log
• Copy-on-write for file data Data 0 Data 1 Data 1 Data 2 – Log only contains metadata – Log is short
46 Performance
• Per-inode logging allows for high concurrency Tail Directory log • Split data structure between Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volatile tree structure has no consistency overhead
47 Performance
• Per-inode logging allows for Radix tree high concurrency 0 1 2 3 DRAM • Split data structure between NVMM Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volatile tree structure has no consistency overhead
48 NOVA layout
• Put allocator in DRAM CPU 0 CPU 1 • High scalability Free list Free list – Per-CPU NVMM free list, DRAM journal and inode table NVMM Journal Journal – Concurrent transactions Super Inode table Inode table and allocation/deallocation block Recovery inode Inode Head Tail
Inode log
49 Fast garbage collection
• Log is a linked list • Log only contains Tail metadata Head • Fast GC deletes dead log pages from the linked list • No copying Vaild log entry Invalid log entry
50 Thorough garbage collection
• Starts if valid log entries < 50% log length • Format a new log and atomically replace the old one • Only copy metadata Tail Head
Vaild log entry Invalid log entry
51 Recovery
• Rebuild DRAM structure – Allocator CPU 0 CPU 1 – Lazy rebuild: postpones inode radix tree rebuild Free list Free list • Accelerates recovery • Reduces DRAM consumption DRAM • Normal shutdown recovery: NVMM – Store allocator in recovery inode Journal Journal – No log scanning Super Inode table Inode table block • Failure recovery: Recovery – Log is short inode Recovery Recovery thread thread – Parallel scan – Failure recovery bandwidth: > 400 GB/s
52 Snapshot for normal file I/O
Current snapshot 012 Delete snapshot 0;
Snapshot 0 [0, 1) Snapshot 1 [1, 2)
File log 0 Page 1 1 Page 1 1 Page 1 2 Page 1
Data Data Data Data
Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data
53 Corrupt Snapshots with DAX-mmap()
• Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a-time
Application: D = 5; V = True; Snapshot
Page hosting D: ? 5 ? Corrupt
Page hosting V: False True T
Timeline: Snapshot
Page Copy on Value R/W RO Fault Write Change 54 Consistent Snapshots with DAX-mmap()
• Recovery invariant: if V == True, then D is valid – Correct: Delay CoW page faults completion until all pages are read- only Application: D = 5; V = True; Snapshot
Page hosting D: ? 5 ? Consistent
Page hosting V: False True F
Timeline: Snapshot
Page Copy on Value R/W RO RO Fault Write Change Waiting 55 Snapshot-related latency
snapshot manifest init combine manifests radix tree locking sync superblock mark pages read-only change mapping memcpy_nocache
Snapshot creation
Snapshot deletion
CoW page fault (4KB)
0 1 2 3 4 5 6 7 8 9 10 Latency (microsecond)
56 Defense Against Scribbles
• Tolerating Larger Scribbles – Allocate replicas far from one another – NOVA metadata can tolerate scribbles of 100s of MB • Preventing scribbles – Mark all NVMM as read-only – Disable CPU write protection while accessing NVMM – Exposes all kernel data to bugs in a very small section of NOVA code.
57 NVMM Failure Modes: Media Failures
• Media errors – Detectable & correctable Software: Consumes good data – Detectable & uncorrectable – NVMM Ctrl.: Detects & corrects errors
Undetectable Read NVMM data: • Software scribbles Media error – Kernel bugs or own bugs – Transparent to hardware
58 NVMM Failure Modes: Media Failures
• Media errors – Detectable & correctable Software: Receives MCE – Detectable & uncorrectable Detects uncorrectable errors – NVMM Ctrl.:
Undetectable Raises exception Read NVMM data: • Software scribbles Media error & – Kernel bugs or own bugs Poison Radius (PR) e.g. 512 bytes – Transparent to hardware
59 NVMM Failure Modes: Media Failures
• Media errors – Detectable & correctable Software: Consumes corrupted data – Detectable & uncorrectable – NVMM Ctrl.: Sees no error
Undetectable Read NVMM data: • Software scribbles Media error – Kernel bugs or own bugs – Transparent to hardware
60 NVMM Failure Modes: Scribbles
• Media errors – Detectable & correctable
– Detectable & uncorrectable Software: Bug code scribbles NVMM Write
– Undetectable NVMM Ctrl.: Updates ECC
• Software “scribbles” NVMM data: Scribble error – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable
61 NVMM Failure Modes: Scribbles
• Media errors – Detectable & correctable – Detectable & uncorrectable Software: Consumes corrupted data
– Undetectable NVMM Ctrl.: Sees no error Read • Software “scribbles” NVMM data: Scribble error – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable
62 File operation latency
20.0 xfs-DAX 17.5 PMFS
15.0 ext4-DAX ext4-dataj 12.5 Btrfs 10.0 NOVA w/ MP 7.5
Latency (microsecond) Latency w/ MP+WP 5.0 w/ MP+DP
2.5 w/ MP+DP+WP Relaxed mode 0.0 Create Append (4KB) Overwrite (4KB) Overwrite (512B) Read (4KB)
63 Random R/W bandwidth on NVDIMM-N
NVDIMM-N 4K Read NVDIMM-N 4K Write 30 14 xfs-DAX 25 12 PMFS 10 20 ext4-DAX 8 ext4-dataj 15 Btrfs 6
10 NOVA Bandwidnth (GB/s) Bandwidnth Bandwidnth (GB/s) Bandwidnth 4 w/ MP 5 2 w/ MP+DP Relaxed mode 0 0 1 2 4 8 16 1 2 4 8 16 Threads Threads
64 Scribble size and metadata bytes at risk
16K
2K
256 no replication, worst 32 no replication, average 4 simple replication, worst simple replication, average 0.5 two-way replication, worst 0.06 two-way replication, average Metadata Pages at Risk at Pages Metadata 7.8E-3 dead-zone replication, worst 9.8E-4 dead-zone replication, average
1.2E-4
1.5E-5 1 16 256 4K 64K 1M 16M 256M Scribble Size in Bytes
65 Storage overhead
Primary inode 0.1% Primary log 2.0%
Replica inode 0.1% Replica log 2.0%
File data File checksum 82.4% 1.6%
File parity 11.1%
Unused 0.8%
66 Latency breakdown
VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection
Create
Append 4KB
Overwrite 4KB
Overwrite 512B
Read 4KB
Read 16KB
0 1 2 3 4 5 6 7 8 9 Latency (microsecond)
67 Latency breakdown
VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection
Create
Append 4KB
Overwrite 4KB
Overwrite 512B
Read 4KB
Read 16KB
0 1 2 3 4 5 6 7 8 9 Latency (microsecond) Metadata Protection 68 Latency breakdown
VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection
Create
Append 4KB
Overwrite 4KB
Overwrite 512B
Read 4KB
Read 16KB
0 1 2 3 4 5 6 7 8 9 Latency (microsecond) Metadata Protection Data Protection 69 Latency breakdown
VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection
Create
Append 4KB
Overwrite 4KB
Overwrite 512B
Read 4KB
Read 16KB
0 1 2 3 4 5 6 7 8 9 Latency (microsecond) Metadata Protection Data Protection Scribble Prevention 70 Application performance on NOVA-Fortis
NVDIMM-N 1.2 495k 610k 553k 692k 27k 73k 30k 126k 45k 1.0
0.8
0.6 Ops/second
0.4 Normalized throughput Normalized
0.2
0.0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim SQLite TPCC Average
xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode
71 Application performance on NOVA-Fortis
PCM 1.2
1.0
0.8
N -
0.6
to NVDIMM to 0.4 Normalized throughput Normalized
0.2
0.0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim SQLite TPCC Average
xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode
72