NOVA-Fortis: a Fault-Tolerant Non-Volatile Main Memory File
Total Page:16
File Type:pdf, Size:1020Kb
NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System Jian Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1 Non-volatile Memory and DAX • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application load/store load/store DRAM NVMM File system HDD / SSD 2 Non-volatile Memory and DAX • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application • Direct Access (DAX) mmap() DAX-mmap() – DAX file I/O bypasses the page cache – DAX-mmap() maps NVMM pages to application DRAM NVMM address space directly and bypasses file system – “Killer app” copy HDD / SSD 3 Application expectations on NVMM File System DAX POSIX I/O Atomicity Fault Direct Speed Tolerance Access 4 ext4 xfs BtrFS F2FS DAX POSIX I/O Atomicity Fault Direct Speed Tolerance Access ✔ ❌ ✔❌ ❌ 5 PMFS ext4-DAX xfs-DAX DAX ✔ ❌ ❌ ✔✔ 6 Fault Direct Speed Strata SOSP ’17 DAX ✔✔ ❌ ❌ ✔ 7 Fault Direct Speed NOVA FAST ’16 DAX ✔ ✔ ❌ ✔ ✔ 8 Fault Direct Speed NOVA-Fortis DAX ✔ ✔✔✔ ✔ 9 Fault Direct Speed Challenges DAX 10 NOVA: Log-structured FS for NVMM • Per-inode logging – High concurrency Per-inode logging – Parallel recovery • High scalability Inode Head Tail – Per-core allocator, journal and inode table Inode log • Atomicity Data Data Data – Logging for single inode update – Journaling for update across logs – Copy-on-Write for file data Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, FAST ’16. 11 Snapshot 13 Snapshot support • Snapshot is essential for file system backup • Widely used in enterprise file systems – ZFS, Btrfs, WAFL • Snapshot is not available with DAX file systems 14 Snapshot for normal file I/O Current snapshot 012 write(0, 4K); take_snapshot(); write(0, 4K); File log 0 Page 0 1 Page 0 1 Page 0 2 Page 0 write(0, 4K); take_snapshot(); Data Data Data Data write(0, 4K); recover_snapshot(1); File write entry Data in snapshot Reclaimed data Current data 15 Memory Ordering With DAX-mmap() D V Valid D = 42; ? False ✓ Fence(); 42 False ✓ V = True; 42 True ✓ ? True ✗ • Recovery invariant: if V == True, then D is valid 16 Memory Ordering With DAX-mmap() D = 42; Application D V Fence(); DAX-mmap() V = True; NVMM Page 1 Page 3 • Recovery invariant: if V == True, then D is valid • D and V live in two pages of a mmap()’d region. 17 DAX Snapshot: Idea • Set pages read-only, then copy-on-write Applications: no file system intervention File system: DAX-mmap() File data: RO 18 DAX Snapshot: Incorrect implementation • Application invariant: if V is True, then D is valid Application Application NOVA Snapshot thread values snapshot values D = ?; D V D V V = False; ? F snapshot_begin(); set_read_only(page_d); ? D = 42;page fault 42 F copy_on_write(page_d); V = True; 42 T set_read_only(page_v); ? T snapshot_end(); ? T 19 DAX Snapshot: Correct implementation • Delay CoW page faults completion until all pages are read-only Application Application NOVA Snapshot thread values snapshot values D = ?; D V D V V = False; ? F snapshot_begin(); set_read_only(page_d); ? D = 42;page fault set_read_only(page_v); ? F snapshot_end(); ? F 42 F copy_on_write(page_d); V = True; 42 T copy_on_write(page_v); 20 Performance impact of snapshots • Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 3.7% through mmap() W/O snapshot W snapshot 1.2 Filebench (read/write) WHISPER (DAX-mmap()) 1 0.8 0.6 0.4 0.2 0 21 Protecting Metadata and Data 22 NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Receives MCE – Raises Machine Check Exception Detects uncorrectable errors (MCE) NVMM Ctrl.: Raises exception • Undetectable errors Read NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles 23 NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Consumes corrupted data – Raises Machine Check Exception (MCE) NVMM Ctrl.: Sees no error • Undetectable errors Read NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles 24 NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Bug code scribbles NVMM – Raises Machine Check Exception Write (MCE) NVMM Ctrl.: Updates ECC • Undetectable errors NVMM data: Scribble error – Media errors not detected by NVMM controller – Software scribbles 25 NOVA-Fortis Metadata Protection • Detection inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – CRC32 checksums in all structures – Use memcpy_mcsafe() to catch inode Head Tail csum H1 T1 MCEs • Correction log ent1 c1 … entN cN – Replicate all metadata: inodes, logs, superblock, etc. log’ ent1’ c1’ … entN’ cN’ – Tick-tock: persist primary before updating replica Data 1 Data 2 26 NOVA-Fortis Data Protection inode’ Head’ Tail’ csumcsum’’ H1’ T1’ inode Head Tail csum H1 T1 • Metadata – CRC32 + replication for all structures log ent1 c1 … entN cN • Data c1 log’ ent1’ … entN’ cN’ ’ – RAID-4 style parity – Replicated checksums Data 1 Data 2 1 Block (8 stripes) S0 S1 S2 S3 S4 S5 S6 S7 P P = ⊕ S0..7 Ci = CRC32C(Si) Replicated 27 File data protection with DAX-mmap • Stores are invisible to the file systems • The file systems cannot protect mmap’ed data • NOVA-Fortis’ data protection contract: DAX NOVA-Fortis protects pages from media errors and scribbles iff they are not mmap()’d for writing. 28 File data protection with DAX-mmap • NOVA-Fortis logs mmap() operations User-space load/store load/store Applications: Kernel-space NOVA-Fortis: read/write mmap() NVDIMMs File data: protected unprotected File log: mmap log entry 29 File data protection with DAX-mmap • On munmap and during recovery, NOVA-Fortis restores protection User-space load/store munmap() Applications: Kernel-space NOVA-Fortis: read/write mmap() NVDIMMs Protection restored File data: File log: 30 File data protection with DAX-mmap • On munmap and during recovery, NOVA-Fortis restores protection User-space System Failure + Applications: recovery Kernel-space NOVA-Fortis: read/write mmap() NVDIMMs File data: File log: 31 Performance 32 Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) 33 Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection 34 Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection Data Protection 35 Application performance Normalized throughput 1.2 1 0.8 0.6 0.4 Normalized throughput Normalized 0.2 0 Fileserver Varmail MongoDB SQLite TPCC Average ext4-DAX Btrfs NOVA w/ MP w/ MP+DP 36 Conclusion • Fault tolerance is critical for file system, but existing DAX file systems don’t provide it • We identify new challenges that NVMM file system fault tolerance poses • NOVA-Fortis provides fault tolerance with high performance – 1.5x on average to DAX-aware file systems without reliability features – 3x on average to other reliable file systems 37 Give a try https://github.com/NVSL/linux-nova 38 Thanks! 39 Backup slides 40 Hybrid DRAM/NVMM system • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology • File system for NVMM Host CPU NVMM FS DRAM NVMM 41 Disk-based file systems are inadequate for NVMM Ext4 Ext4 Ext4 • Ext4, xfs, Btrfs, F2FS, NILFS2 Atomicity Btrfs xfs wb order dataj • 1-Sector Built for hard disks and SSDs ✓ ✓ ✓ ✓ ✓ overwrite – Software overhead is high 1-Sector ✗ ✓ ✓ ✓ ✓ – CPU may reorder writes to NVMM append 1-Block ✗ ✗ ✓ ✓ ✗ – NVMM has different atomicity guarantees overwrite 1-Block ✗ ✓ ✓ ✓ ✓ append • Cannot exploit NVMM performance N-Block ✗ ✗ ✗ ✗ ✗ • write/append Performance optimization compromises N-Block ✗ ✓ ✓ ✓ ✓ consistency on system failure [1] prefix/append [1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 42 NVMM file systems are not strongly consistent • BPFS, PMFS, Ext4-DAX, SCMFS, Aerie • None of them provide strong metadata and data consistency Metadata Data Mmap File system atomicity atomicity Atomicity [1] BPFS Yes Yes [2] No PMFS Yes No No Ext4-DAX Yes No No SCMFS No No No Aerie Yes No No NOVA Yes Yes Yes [1] Each msync() commits updates atomically. [2] In BPFS, write times are not updated atomically with respect to the write itself. 43 Why LFS? • Log-structuring provides cheaper atomicity than journaling and shadow paging • NVMM supports fast, highly concurrent random accesses – Using multiple logs does not negatively impact performance