<<

NOVA: A High-Performance, Fault Tolerant for Non-Volatile Main Memories Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (), Steven Swanson

Non-Volale Systems Laboratory Department of Computer Science and Engineering University of California, San Diego

1 NVMM File System Requirements • Legacy File IO Acceleraon – fast and easy – Run exisng IO-intensive apps on NVDIMMs – “just works” • Strong atomicity • DAX Mmap – Load-store access & complex data structures – More control/responsibility for programs • Data protecon – Don’t trust the media or other soware – Support • DAX High performance – Otherwise, why bother?

2 XFS F2FS NILFS

✔ ❌ ✔❌DAX ❌ 3

BPFS SCMFS PMFS Aerie EXT4-DAX XFS-DAX M1FS

✔ ❌ ❌ ✔DAX ✔ 4

NOVA

✔✔✔✔DAX ✔ 5 Files, Atomicity, and Performance

6 Log Structured FS For NVMM

Per- logging

• A Nova FS is a of logs Inode Head Tail – One log per i-node – Logs are not conguous Inode log • I-nodes point to the head and tail Commied entry Uncommied entry

7 Atomicity: Logging for Simple Metadata Operations • Combines log-structuring, journaling and copy-on-

• Log-structuring for single log Tail Tail update File log – Write, msync, , etc – Lower overhead than journaling and shadow paging

8 Atomicity: Lightweight Journaling for Complex Metadata Operations • Lightweight journaling for update across logs TailTail – Unlink, rename, etc Directory log – Journal log tails instead of

metadata or data TailTail File log

Dir tail Journal File tail

9 Atomicity: Copy-on-write for file data • Copy-on-write for file data – Log only contains metadata

– Log is short Tail Tail File log

Data 0 Data 1 Data 1 Data 2

10 Atomicity: DAX

• Nova does not make atomicity guarantees for DAX mmap()’d data • msync() works, but it’s slow & non-atomic • The program must ensure consistency – ISA Support: CLWB, clflush, etc. DAX – Careful programming

11 DRAM Indexes for High Performance

• Per-inode logging allows for Radix tree high concurrency 0 1 2 3 DRAM • Split data structure between NVMM Tail DRAM and NVMM

– Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volale tree structure has no consistency overhead

12 Performance and Scalability

• Put allocator in DRAM CPU 0 CPU 1 • High scalability Free list Free list – Per-CPU NVMM free list, DRAM journal and inode table NVMM Journal Journal – Concurrent transacons Super Inode table Inode table and allocaon/deallocaon Recovery inode Inode Head Tail

Inode log

13 Fast garbage collection • Log is a linked list • Log only contains Tail metadata Head • Fast GC deletes dead log pages from the linked list • No copying Vaild log entry Invalid log entry

14 Thorough garbage collection • Starts if valid log entries < 50% log length • Format a new log and atomically replace the old one

• Only copy metadata Tail Head

Vaild log entry Invalid log entry

15 Recovery

• Normal shutdown CPU 0 CPU 1 recovery: Free list Free list – Store allocator state in recovery inode DRAM NVMM – Constant me startup Journal Journal • Failure recovery: Super Inode table Inode table block – Parallel scan Recovery inode Recovery Recovery – Failure recovery bandwidth: thread > 400 GB/s

16 Operation Latency 30 Ext4-datajournal Ext4-DAX -DAX m1fs 25 NOVA • Intel PM Emulaon Plaorm – Emulates different NVM 20 characteriscs 15 – Emulates clwb/clflush latency • NOVA provides low latency 10 Latency (microsecond) atomicity 5

0 Create Append (4KB) Delete 17 Filebench throughput

450 Ext4-datajournal Ext4-DAX 400 xfs-DAX m1fs NOVA 350 300 • NOVA achieves high 250 performance with 200 strong data consistency 150

Ops per second (x1000) 100 50 0 Fileserver Varmail Webproxy Webserver 18 DAX, , and Data Protecon

DAX 19 DAX Challenges • DAX mmap() gives the programmer great power – Fine-grain updates – Pointer-based data structures – Custom data layout • …along with great responsibilies. – Data structure consistency/persistence ordering DAX – Memory protecon/media error management • The file system must not interfere.

20 Snapshots for Normal File Access

Current epoch 012

Snapshot 0 Snapshot 1

File log 0 1 1 Page 1 1 Page 1 2 Page 1

Data Data Data Data

Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 21 Memory Ordering With DAX mmap()

D = 42; V = true;

• D and V live in two pages of a mmap()’d region. • Recovery invariant: if V == True, then D is valid

22 Corrupt Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages -only one-at-a-me

V = Applicaon: D = 1; True; Snapshot

Page hosng D: ? 1 ?

Page hosng V: False True T

Time Snapshot

Page Copy on Value R/W RO Fault Write Change 23 Consistent Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Correct: Block page faults unl all pages are read-only

V = Applicaon: D = 1; True; Snapshot

Page hosng D: ? 1 ?

Page hosng V: False True F

Time Snapshot

Page Copy on Value R/W RO RO Fault Write Change Blocking 24 Performance impact of snapshots • Normal execuon vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 6.2% through mmap()

Convenonal workloads NVMM-aware workloads from WHISPER 25 Protecng Metadata

26 NVMM Failure Modes: Media Failures • Media errors – Detectable & correctable • Transparent to soware – Detectable & uncorrectable Soware: Consumes good data • Affect a conguous range of data • Raise machine check excepon (MCE) NVMM Ctrl.: Detects & corrects errors – Undetectable Read NVMM data: • May consume corrupted data Media error • Soware scribbles – Kernel bugs or own bugs – Transparent to hardware

27 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to soware – Detectable & uncorrectable Soware: Receives MCE • Affect a conguous range of data • Raise machine check excepon (MCE) NVMM Ctrl.: Detects uncorrectable errors

Raises excepon Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius • Soware scribbles (PR) – Kernel bugs or own bugs e.g. 512 – Transparent to hardware

28 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to soware – Detectable & uncorrectable Soware: Consumes corrupted data • Affect a conguous range of data • Raise machine check excepon (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Media error • Soware scribbles – Kernel bugs or own bugs – Transparent to hardware

29 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to soware – Detectable & uncorrectable Soware: Bug code scribbles NVMM • Affect a conguous range of data Write • Raise machine check excepon (MCE) NVMM Ctrl.: Updates ECC – Undetectable NVMM data: • Consume corrupted data Scribble error • Soware “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable

30 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to soware – Detectable & uncorrectable Soware: Consumes corrupted data • Affect a conguous range of data • Raise machine check excepon (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Scribble error • Soware “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable

31 NOVA Metadata Protection • Replicate everything inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – inode – Logs Head Tail csum H1 T1 – Superblock

– … ent1 c1 … entN cN • CRC32 everywhere

ent1’ c1’ … entN’ cN’

Data 1 Data 2

32 Defense Against Scribbles • Tolerang Larger Scribbles – Allocate replicas far from one another – Can tolerate arbitrarily large scribbles to metadata. • Prevenng scribbles – Mark all NVMM as read-only – Disable CPU write protecon while accessing NVMM

33 Protecng Data

DAX

34 NOVA Data Protection • Divide 4KB blocks into 512- stripes • Compute a RAID 5-style parity stripe • Compute and replicate checksums for each stripe

1 Block 512-Byte stripe segments

S0 S1 S2 S3 S4 S5 S6 S7 P P = ⊕ S0..7

Ci = CRC32C(Si) Replicated 35 Data Protection With DAX mmap()

• File systems cannot efficiently protect mmap’ed data, since stores are invisible • NOVA’s data protecon contract:

NOVA protects pages from media errors DAX and scribbles iff they are not mmap()’d for wring.

36 File data protection with DAX-mmap • NOVA logs mmap() operaons

User-space load/store load/store Applicaons:

Kernel-space read(), mmap() NOVA: write()

NVDIMMs File data: protected unprotected File log: mmap log entry 38 File data protection with DAX-mmap • On unmap and during recovery, NOVA restores protecon

User-space load/store munmap() Applicaons:

Kernel-space read(), mmap() NOVA: write()

NVDIMMs Protecon restored File data:

File log:

39 File data protection with DAX-mmap • On unmap and during recovery, NOVA restores protecon

User-space System Failure + Applicaons: recovery Kernel-space read(), mmap() NOVA: write()

NVDIMMs File data:

File log:

40 Performance

41 Metadata Protecon O(1)

Data Protecon O(N)

42 Storage Utilization

Primary log: 2.00% Redundancy Primary inode: 0.10% 14.76% File data: 82.4%

Replica inode: 0.10% Replica log: 2.00% Unused File : 1.56% 0.75% File parity: 11.1%

43 Performance Cost of

1.2

1

0.8

0.6

0.4

0.2

0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average ext4-DAX ext4-dataj Fors baseline NOVA w/ MP+WP w/ MP+DP+WP 44 Conclusion • Exisng file systems do not meet the requirements of applicaons on NVMM file systems • NOVA’s mul-log design achieves high performance and strong consistency • NOVA’s data protecon features ensure data integrity • NOVA outperforms exisng file systems while providing stronger consistency and data protecon guarantees

45 NOVA is source. We are preparing it for addion to .

To help or try it out: hps://github.com/ NVSL/linux-nova

46 Thanks!

47