<<

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory

Jian Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (), Steven Swanson

Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1 Non-volatile Memory and DAX

• Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application

load/store load/store

DRAM NVMM

File system

HDD / SSD

2 Non-volatile Memory and DAX

• Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application • Direct Access (DAX) mmap() DAX-mmap() – DAX file I/O bypasses the cache – DAX-mmap() maps NVMM pages to application DRAM NVMM address space directly and bypasses file system – “Killer app” copy

HDD / SSD

3 Application expectations on NVMM File System

DAX

POSIX I/O Atomicity Fault Direct Speed Tolerance Access

4 F2FS

DAX

POSIX I/O Atomicity Fault Direct Speed Tolerance Access

✔ ❌ ✔❌ ❌ 5 PMFS ext4-DAX xfs-DAX

DAX ✔ ❌ ❌ ✔✔ 6 Fault Direct Speed Strata SOSP ’17

DAX ✔✔ ❌ ❌ ✔ 7 Fault Direct Speed NOVA FAST ’16

DAX ✔ ✔ ❌ ✔ ✔ 8 Fault Direct Speed NOVA-Fortis

DAX ✔ ✔✔✔ ✔ 9 Fault Direct Speed Challenges

DAX

10 NOVA: Log-structured FS for NVMM

• Per- logging – High concurrency Per-inode logging – Parallel recovery • High scalability Inode Head Tail – Per-core allocator, journal and inode table Inode log • Atomicity Data Data Data – Logging for single inode update – Journaling for update across logs – Copy-on- for file data

Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, FAST ’16. 11 Snapshot

13 Snapshot support

• Snapshot is essential for file system backup • Widely used in enterprise file systems – ZFS, Btrfs, WAFL

• Snapshot is not available with DAX file systems

14 Snapshot for normal file I/O

Current snapshot 012 write(0, 4K); take_snapshot(); write(0, 4K); File log 0 Page 0 1 Page 0 1 Page 0 2 Page 0 write(0, 4K); take_snapshot(); Data Data Data Data write(0, 4K); recover_snapshot(1); File write entry Data in snapshot Reclaimed data Current data

15 Memory Ordering With DAX-mmap()

D V Valid D = 42; ? False ✓ Fence(); 42 False ✓ V = True; 42 True ✓ ? True ✗

• Recovery invariant: if V == True, then D is valid

16 Memory Ordering With DAX-mmap()

D = 42; Application D V Fence(); DAX-mmap() V = True; NVMM Page 1 Page 3

• Recovery invariant: if V == True, then D is valid • D and V live in two pages of a mmap()’d region.

17 DAX Snapshot: Idea

• Set pages -only, then copy-on-write

Applications:

no file system intervention File system: DAX-mmap()

File data: RO

18 DAX Snapshot: Incorrect implementation

• Application invariant: if V is True, then D is valid

Application Application NOVA Snapshot values snapshot values

D = ?; D V D V V = False;

? F snapshot_begin(); set_read_only(page_d); ? D = 42;page fault 42 F copy_on_write(page_d);

V = True; 42 T set_read_only(page_v); ? T snapshot_end(); ? T

19 DAX Snapshot: Correct implementation

• Delay CoW page faults completion until all pages are read-only

Application Application NOVA Snapshot thread values snapshot values

D = ?; D V D V V = False;

? F snapshot_begin(); set_read_only(page_d); ? D = 42;page fault

set_read_only(page_v); ? F snapshot_end(); ? F 42 F copy_on_write(page_d); V = True; 42 T copy_on_write(page_v);

20 Performance impact of snapshots

• Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 3.7% through mmap() W/O snapshot W snapshot

1.2 Filebench (read/write) WHISPER (DAX-mmap()) 1 0.8 0.6 0.4 0.2 0

21 Protecting Metadata and Data

22 NVMM Failure Modes

• Detectable errors – Media errors detected by NVMM controller Software: Receives MCE – Raises Machine Check Exception Detects uncorrectable errors (MCE) NVMM Ctrl.: Raises exception • Undetectable errors Read NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles

23 NVMM Failure Modes

• Detectable errors – Media errors detected by NVMM controller Software: Consumes corrupted data – Raises Machine Check Exception (MCE) NVMM Ctrl.: Sees no error • Undetectable errors Read NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles

24 NVMM Failure Modes

• Detectable errors – Media errors detected by NVMM controller Software: Bug code scribbles NVMM – Raises Machine Check Exception Write (MCE) NVMM Ctrl.: Updates ECC • Undetectable errors NVMM data: Scribble error – Media errors not detected by NVMM controller – Software scribbles

25 NOVA-Fortis Metadata Protection

• Detection inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – CRC32 in all structures – Use memcpy_mcsafe() to catch inode Head Tail csum H1 T1 MCEs

• Correction log ent1 c1 … entN cN – Replicate all metadata: , logs, superblock, etc. log’ ent1’ c1’ … entN’ cN’ – Tick-tock: persist primary before updating replica

Data 1 Data 2

26 NOVA-Fortis Data Protection

inode’ Head’ Tail’ csumcsum’’ H1’ T1’

inode Head Tail csum H1 T1 • Metadata – CRC32 + for all structures log ent1 c1 … entN cN

• Data c1 log’ ent1’ … entN’ cN’ ’ – RAID-4 style parity

– Replicated checksums Data 1 Data 2

1 (8 stripes)

S0 S1 S2 S3 S4 S5 S6 S7 P P = ⊕ S0..7

Ci = CRC32C(Si) Replicated 27 File data protection with DAX-mmap

• Stores are invisible to the file systems • The file systems cannot protect mmap’ed data • NOVA-Fortis’ data protection contract: DAX

NOVA-Fortis protects pages from media errors and scribbles iff they are not mmap()’d for writing.

28 File data protection with DAX-mmap

• NOVA-Fortis logs mmap() operations

User-space load/store load/store Applications:

Kernel-space NOVA-Fortis: read/write mmap()

NVDIMMs File data: protected unprotected File log: mmap log entry

29 File data protection with DAX-mmap

• On munmap and during recovery, NOVA-Fortis restores protection

User-space load/store munmap() Applications:

Kernel-space NOVA-Fortis: read/write mmap()

NVDIMMs Protection restored File data:

File log:

30 File data protection with DAX-mmap

• On munmap and during recovery, NOVA-Fortis restores protection

User-space System Failure + Applications: recovery Kernel-space NOVA-Fortis: read/write mmap()

NVDIMMs File data:

File log:

31 Performance

32 Latency breakdown

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity

Create

Append 4KB

Overwrite 4KB

Overwrite 512B

Read 4KB

Read 16KB

0 1 2 3 4 5 6 Latency (microsecond)

33 Latency breakdown

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity

Create

Append 4KB

Overwrite 4KB

Overwrite 512B

Read 4KB

Read 16KB

0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection 34 Latency breakdown

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity

Create

Append 4KB

Overwrite 4KB

Overwrite 512B

Read 4KB

Read 16KB

0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection Data Protection 35 Application performance

Normalized throughput 1.2

1

0.8

0.6

0.4 Normalized throughput Normalized 0.2

0 Fileserver Varmail MongoDB SQLite TPCC Average

ext4-DAX Btrfs NOVA w/ MP w/ MP+DP 36 Conclusion

• Fault tolerance is critical for file system, but existing DAX file systems don’t provide it • We identify new challenges that NVMM file system fault tolerance poses • NOVA-Fortis provides fault tolerance with high performance – 1.5x on average to DAX-aware file systems without reliability features – 3x on average to other reliable file systems

37 Give a try

https://github.com/NVSL/linux-nova

38 Thanks!

39 Backup slides

40 Hybrid DRAM/NVMM system

• Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology • File system for NVMM Host

CPU

NVMM FS

DRAM NVMM

41 Disk-based file systems are inadequate for NVMM

Ext4 Ext4 Ext4 • Ext4, xfs, Btrfs, F2FS, NILFS2 Atomicity Btrfs xfs wb order dataj • 1-Sector Built for hard disks and SSDs ✓ ✓ ✓ ✓ ✓ overwrite – Software overhead is high 1-Sector ✗ ✓ ✓ ✓ ✓ – CPU may reorder writes to NVMM append 1-Block ✗ ✗ ✓ ✓ ✗ – NVMM has different atomicity guarantees overwrite 1-Block ✗ ✓ ✓ ✓ ✓ append • Cannot exploit NVMM performance N-Block ✗ ✗ ✗ ✗ ✗ • write/append Performance optimization compromises N-Block ✗ ✓ ✓ ✓ ✓ consistency on system failure [1] prefix/append

[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 42 NVMM file systems are not strongly consistent

• BPFS, PMFS, Ext4-DAX, SCMFS, Aerie • None of them provide strong metadata and data consistency

Metadata Data Mmap File system atomicity atomicity Atomicity [1] BPFS Yes Yes [2] No PMFS Yes No No Ext4-DAX Yes No No SCMFS No No No Aerie Yes No No NOVA Yes Yes Yes

[1] Each msync() commits updates atomically. [2] In BPFS, write times are not updated atomically with respect to the write itself. 43 Why LFS?

• Log-structuring provides cheaper atomicity than journaling and shadow paging • NVMM supports fast, highly concurrent random accesses – Using multiple logs does not negatively impact performance – Log does not need to be contiguous

• Rethink and redesign log-structuring entirely

44 Atomicity

• Log-structuring for single log update – Write, msync, , etc – Strictly commit log entry to NVMM TailTail before updating log tail Directory log • Lightweight journaling for update across logs Tail TailTail – Unlink, rename, etc – Journal log tails instead of metadata or data File log

• Copy-on-write for file data Dir tail – Log only contains metadata Journal – Log is short File tail

45 Atomicity

• Log-structuring for single log update – Write, msync, chmod, etc – Strictly commit log entry to NVMM Tail before updating log tail Directory log • Lightweight journaling for update across logs Tail Tail – Unlink, rename, etc – Journal log tails instead of metadata or data File log

• Copy-on-write for file data Data 0 Data 1 Data 1 Data 2 – Log only contains metadata – Log is short

46 Performance

• Per-inode logging allows for high concurrency Tail Directory log • Split data structure between Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volatile structure has no consistency overhead

47 Performance

• Per-inode logging allows for Radix tree high concurrency 0 1 2 3 DRAM • Split data structure between NVMM Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volatile tree structure has no consistency overhead

48 NOVA layout

• Put allocator in DRAM CPU 0 CPU 1 • High scalability Free list Free list – Per-CPU NVMM free list, DRAM journal and inode table NVMM Journal Journal – Concurrent transactions Super Inode table Inode table and allocation/deallocation block Recovery inode Inode Head Tail

Inode log

49 Fast garbage collection

• Log is a linked list • Log only contains Tail metadata Head • Fast GC deletes dead log pages from the linked list • No copying Vaild log entry Invalid log entry

50 Thorough garbage collection

• Starts if valid log entries < 50% log length • Format a new log and atomically replace the old one • Only copy metadata Tail Head

Vaild log entry Invalid log entry

51 Recovery

• Rebuild DRAM structure – Allocator CPU 0 CPU 1 – Lazy rebuild: postpones inode radix tree rebuild Free list Free list • Accelerates recovery • Reduces DRAM consumption DRAM • Normal shutdown recovery: NVMM – Store allocator in recovery inode Journal Journal – No log scanning Super Inode table Inode table block • Failure recovery: Recovery – Log is short inode Recovery Recovery thread thread – Parallel scan – Failure recovery bandwidth: > 400 GB/s

52 Snapshot for normal file I/O

Current snapshot 012 Delete snapshot 0;

Snapshot 0 [0, 1) Snapshot 1 [1, 2)

File log 0 Page 1 1 Page 1 1 Page 1 2 Page 1

Data Data Data Data

Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data

53 Corrupt Snapshots with DAX-mmap()

• Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a-time

Application: D = 5; V = True; Snapshot

Page hosting D: ? 5 ? Corrupt

Page hosting V: False True T

Timeline: Snapshot

Page Copy on Value R/W RO Fault Write Change 54 Consistent Snapshots with DAX-mmap()

• Recovery invariant: if V == True, then D is valid – Correct: Delay CoW page faults completion until all pages are read- only Application: D = 5; V = True; Snapshot

Page hosting D: ? 5 ? Consistent

Page hosting V: False True F

Timeline: Snapshot

Page Copy on Value R/W RO RO Fault Write Change Waiting 55 Snapshot-related latency

snapshot manifest init combine manifests radix tree locking superblock mark pages read-only change mapping memcpy_nocache

Snapshot creation

Snapshot deletion

CoW page fault (4KB)

0 1 2 3 4 5 6 7 8 9 10 Latency (microsecond)

56 Defense Against Scribbles

• Tolerating Larger Scribbles – Allocate replicas far from one another – NOVA metadata can tolerate scribbles of 100s of MB • Preventing scribbles – Mark all NVMM as read-only – Disable CPU write protection while accessing NVMM – Exposes all kernel data to bugs in a very small section of NOVA code.

57 NVMM Failure Modes: Media Failures

• Media errors – Detectable & correctable Software: Consumes good data – Detectable & uncorrectable – NVMM Ctrl.: Detects & corrects errors

Undetectable Read NVMM data: • Software scribbles Media error – Kernel bugs or own bugs – Transparent to hardware

58 NVMM Failure Modes: Media Failures

• Media errors – Detectable & correctable Software: Receives MCE – Detectable & uncorrectable Detects uncorrectable errors – NVMM Ctrl.:

Undetectable Raises exception Read NVMM data: • Software scribbles Media error & – Kernel bugs or own bugs Poison Radius (PR) e.g. 512 – Transparent to hardware

59 NVMM Failure Modes: Media Failures

• Media errors – Detectable & correctable Software: Consumes corrupted data – Detectable & uncorrectable – NVMM Ctrl.: Sees no error

Undetectable Read NVMM data: • Software scribbles Media error – Kernel bugs or own bugs – Transparent to hardware

60 NVMM Failure Modes: Scribbles

• Media errors – Detectable & correctable

– Detectable & uncorrectable Software: Bug code scribbles NVMM Write

– Undetectable NVMM Ctrl.: Updates ECC

• Software “scribbles” NVMM data: Scribble error – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable

61 NVMM Failure Modes: Scribbles

• Media errors – Detectable & correctable – Detectable & uncorrectable Software: Consumes corrupted data

– Undetectable NVMM Ctrl.: Sees no error Read • Software “scribbles” NVMM data: Scribble error – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable

62 File operation latency

20.0 xfs-DAX 17.5 PMFS

15.0 ext4-DAX ext4-dataj 12.5 Btrfs 10.0 NOVA w/ MP 7.5

Latency (microsecond) Latency w/ MP+WP 5.0 w/ MP+DP

2.5 w/ MP+DP+WP Relaxed mode 0.0 Create Append (4KB) Overwrite (4KB) Overwrite (512B) Read (4KB)

63 Random R/W bandwidth on NVDIMM-N

NVDIMM-N 4K Read NVDIMM-N 4K Write 30 14 xfs-DAX 25 12 PMFS 10 20 ext4-DAX 8 ext4-dataj 15 Btrfs 6

10 NOVA Bandwidnth (GB/s) Bandwidnth Bandwidnth (GB/s) Bandwidnth 4 w/ MP 5 2 w/ MP+DP Relaxed mode 0 0 1 2 4 8 16 1 2 4 8 16 Threads Threads

64 Scribble size and metadata bytes at risk

16K

2K

256 no replication, worst 32 no replication, average 4 simple replication, worst simple replication, average 0.5 two-way replication, worst 0.06 two-way replication, average Metadata Pages at Risk at Pages Metadata 7.8E-3 dead-zone replication, worst 9.8E-4 dead-zone replication, average

1.2E-4

1.5E-5 1 16 256 4K 64K 1M 16M 256M Scribble Size in Bytes

65 Storage overhead

Primary inode 0.1% Primary log 2.0%

Replica inode 0.1% Replica log 2.0%

File data File 82.4% 1.6%

File parity 11.1%

Unused 0.8%

66 Latency breakdown

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection

Create

Append 4KB

Overwrite 4KB

Overwrite 512B

Read 4KB

Read 16KB

0 1 2 3 4 5 6 7 8 9 Latency (microsecond)

67 Latency breakdown

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection

Create

Append 4KB

Overwrite 4KB

Overwrite 512B

Read 4KB

Read 16KB

0 1 2 3 4 5 6 7 8 9 Latency (microsecond) Metadata Protection 68 Latency breakdown

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection

Create

Append 4KB

Overwrite 4KB

Overwrite 512B

Read 4KB

Read 16KB

0 1 2 3 4 5 6 7 8 9 Latency (microsecond) Metadata Protection Data Protection 69 Latency breakdown

VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity write protection

Create

Append 4KB

Overwrite 4KB

Overwrite 512B

Read 4KB

Read 16KB

0 1 2 3 4 5 6 7 8 9 Latency (microsecond) Metadata Protection Data Protection Scribble Prevention 70 Application performance on NOVA-Fortis

NVDIMM-N 1.2 495k 610k 553k 692k 27k 73k 30k 126k 45k 1.0

0.8

0.6 Ops/second

0.4 Normalized throughput Normalized

0.2

0.0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim SQLite TPCC Average

xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode

71 Application performance on NOVA-Fortis

PCM 1.2

1.0

0.8

N -

0.6

to NVDIMM to 0.4 Normalized throughput Normalized

0.2

0.0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim SQLite TPCC Average

xfs-DAX PMFS ext4-DAX ext4-dataj Btrfs NOVA w/ MP w/ MP+DP Relaxed mode

72