<<

NOVA: A High-Performance, Hardened for Non-Volatile Main Memories

Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (), Steven Swanson

Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego

1 NVDIMM Usage Models

• Legacy File IO Acceleration – fast and easy Legacy IO Throughput – Run existing IO-intensive apps on NVDIMMs 450 – “just works” 400 – NOVA is 30% - 10x faster than for intensive 350 workloads. 300 – Need strong protections on data. 250 • DAX Mmap -- maximum speed + programming 200 challenges 150 100

– Load-store access (x1000) per second Ops 50 – You still need a strongly-consistent file system 0 • File system corruption can still destroy your data • NOVA is strongly consistent – Data protection is still critical Ext4-datajournal NOVA 2

XFS F2FS NILFS

EXT4

3 Disk-based file systems are inadequate for NVMM • Disk-based file systems Atomicity Data Protection cannot exploit NVMM 1-Sector 1-Sector 1- 1-Block N-Block N-Block Meta- Snap- Data performance overwrite append overwrite append overwrite append data shots Ext4 wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ Ext4 Performance Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ • Ext4 optimization Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ Btrfs compromises consistency ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ on system failure [1] ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ Reiserfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓

[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 4 BPFS SCMFS PMFS Aerie

EXT4-DAX M1FS XFS-DAX

5 NVMM file systems don’t provide strong consistency or data protection

• DAX does not provide data Atomicity Data Protection

Meta- Snap- atomicity guarantees Metadata Data Data data shots • Programming is more difficult BPFS ✓ ✓ ✗ ✗ ✗ PMFS ✓ ✗ ✗ ✗ ✗ Ext4 DAX ✓ ✗ ✗ ✓ ✗ XFS DAX ✓ ✗ ✗ ✓ ✗ SCMFS ✗ ✗ ✗ ✗ ✗ Aerie ✓ ✗ ✗ ✗ ✗

6 NOVA provides strong atomicity guarantee

Atomicity Data Protection Atomicity Data Protection

1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Meta- Snap- Data Metadata Data Data overwrite append overwrite append overwrite append data shots data shots Ext4 BPFS wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ Ext4 PMFS Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ Ext4 Ext4 Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ XFS Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ xfs SCMFS ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Reiserfs Aerie ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗

7 NOVA provides strong atomicity guarantee

Atomicity Data Protection Atomicity Data Protection

1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Meta- Snap- Data Metadata Data Data overwrite append overwrite append overwrite append data shots data shots Ext4 BPFS wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ Ext4 PMFS Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ Ext4 Ext4 Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ XFS Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ xfs SCMFS ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Reiserfs Aerie ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ NOVA NOVA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

8 NOVA’s Key Features • Features – High-performance • Usage Models – Strong Consistency – ()/(), ()/write() – Snapshot support – DAX-mmap() – Data protection

9 NOVA’s Architecture

10 Core NOVA Structures Log Structure + copy-on-write + Journals

• One log per • Non-contiguous Tail Tail • Fast, Simple atomic File log updates • Meta-data only

11 Core NOVA Structures Log Structure + copy-on-write + Journals

• Multi- atomic update Tail Tail • Fast allocation File log • Instant data GC

Data 0 Data 1 Data 1 Data 2

12 Core NOVA Structures Log Structure + copy-on-write + Journals

TailTail

Directory log • Small, fixed sized Tail Tail journals File log • For complex ops.

Dir tail Journal File tail 13 Supporting Backups with Snapshots

14 Snapshots for Normal File Access

Current epoch 012

Snapshot 0 Snapshot 1

File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000

Data Data Data Data

Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 15 Corrupt Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a-time

Application: D = 1; V = True; Snapshot

Page hosting D: ? 1 ?

Page hosting V: False True T

Time Snapshot

Page Copy on Value R/W RO Fault Write Change 16 Consistent Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Correct: Block page faults until all pages are read-only

Application: D = 1; V = True; Snapshot

Page hosting D: ? 1 ?

Page hosting V: False True F

Time Snapshot

Page Copy on Value R/W RO RO Fault Write Change Blocking 17 Performance impact of snapshots • Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 6.2% through mmap()

Conventional workloads NVMM-aware workloads from WHISPER 18 Data Protection: Metadata

19 NVMM Failure Modes: Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes good data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Detects & corrects errors – Undetectable Read NVMM data: • May consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware

20 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Receives MCE • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Detects uncorrectable errors

Raises exception Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius (PR) • Software scribbles e.g. 512 – Kernel bugs or own bugs – Transparent to hardware

21 Detecting NVMM Media Errors

memcpy_mcsafe() • Copy data from NVMM Process and • Catch MCEs and return failure return

Yes Handler registered? No Kernel Whose Kernel panic access? User Recoverable MCE SIGBUS Unrecoverable

Kernel panic 22 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes corrupted data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware

23 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Bug code scribbles NVMM • Affect a contiguous range of data Write • Raise machine check exception (MCE) NVMM Ctrl.: Updates ECC – Undetectable NVMM data: • Consume corrupted data Scribble error • Software “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable

24 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes corrupted data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Scribble error • Software “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable

25 NOVA Metadata Protection • Replicate everything inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – inode – Logs Head Tail csum H1 T1 – Superblock

– … ent1 c1 … entN cN • CRC32 everywhere

ent1’ c1’ … entN’ cN’

Data 1 Data 2

26 Defense Against Scribbles • Tolerating Larger Scribbles – Allocate replicas far from one another – Can tolerate arbitrarily large scribbles to metadata. • Preventing scribbles – Mark all NVMM as read-only – Disable CPU write protection while accessing NVMM

27 Data Protection: Data

28 NOVA Data Protection • Divide 4KB blocks into 512- stripes • Compute a RAID 5-style parity stripe • Compute and replicate checksums for each stripe

1 Block 512-Byte stripe segments

S0 S1 S2 S3 S4 S5 S6 S7 P P = S0..7 = CRC32C(S ) i ⊕ i Replicated 29 File data protection with DAX-mmap • With DAX-Mmap(), file data changes are invisible to NOVA • NOVA cannot protect mmap’ed file data • NOVA logs mmap() and restores protection on munmap() or recovery User-space load/store load/store Applications:

Kernel-space NOVA: read(), write() mmap()

NVDIMMs File data: protected unprotected File log: mmap log entry 30 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed

User-space load/store munmap() Applications:

Kernel-space NOVA: read(), write() mmap()

NVDIMMs Protection restored File data:

File log: 31 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed

User-space System Failure + Applications: recovery Kernel-space NOVA: read(), write() mmap()

NVDIMMs File data:

File log: 32 Performance

33 Performance Cost of

1.2

1

0.8

0.6

0.4

0.2

0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average xfs-DAX ext4-DAX ext4-dataj Fortis baseline w/ MP+WP w/ MP+DP+WP 34 Conclusion • Existing file systems do not meet the requirements of applications on NVMM file systems • NOVA’s multi-log design achieves high performance and strong consistency • NOVA’s data protection features ensure data integrity • NOVA outperforms existing file systems while providing stronger consistency and data protection guarantees

35 Thank you!

Try NOVA! https://github.com/NVSL/NOVA

36 Backup Slides

37 Protecting Against Scribbles • Metadata allocator separates metadata replicas – Allocate primary and replica pages in opposite directions – Use allocator ‘dead-zone’ to guarantee minimal distance – Protect against scribbles from other kernel bugs and own bugs

Simple allocation: log1 log1’ log2 log2’ logN logN’ A page-sized scribble can affect most pairs of replicated metadata pages

Two-way allocation: log1 log2 logN logN’ log2’ log1’ A page-sized scribble can affect limited pairs of replicated metadata pages

Dead-zone allocation: log1 log2 logN 1 MB logN’ log2’ log1’ A scribble less than 1 MB can not corrupt any metadata 38 Minimize the chance of corruptions – x86 write protection • Leverage x86 CPU’s write protection – CR0.WP disables/enables writing to RO memories of each x86 core – Only enable writing when NOVA writes to NVMM – Protect against scribbles from other kernel bugs, not own bugs

39 Filebench throughput

Filebench throughput 450 400 350 300 250 • NOVA achieves high 200 performance with strong 150

Ops per second (x1000) second per Ops 100 data consistency 50 0 Fileserver Varmail Webproxy Webserver Ext4-datajournal Ext4-DAX m1fs NOVA

40 Tick-tock inode update • Update tails of primary inode • Update csum of primary inode • Same procedure for inode’

Primary Head’ Tail’ csum’ H1’ T1’

Secondary Head Tail csum H1 T1

Old Updating New

41 Performance Cost of Data Integrity

1.2

1

0.8

0.6

0.4

0.2

0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average xfs-DAX ext4-DAX ext4-dataj Fortis baseline w/ MP w/ MP+WP w/ MP+DP w/ MP+DP+WP 42 Conclusions • NVMM file systems need unique solutions for reliability – Error reporting mechanisms different than disks – DAXp -mma complicates designs • Performance and storage penalties vary – Storage cost is modest for the presented hardening techniques – Performance impact is significant for some applications • More knowledge is necessary to determine the trade-offs – Uncorrectable media errors in emerging NVMM technologies – The frequency and size of scribbles in kernel space • NOVA provides all hardening techniques as mount options

43 Performance impact of data integrity • File operation latency

44 Reliability evaluation – metadata pages at risk • Scan an aged NOVA file system image – Examine distances between the primary and replica pages – Count vulnerable page pairs for a given scribble size

Y == 0 points do not show in log-log plot 45 PMFS shortcomings • No data atomicity support • High consistency overhead with persistent B- • Not scalable – Directory operations (linear search) – NVMM allocation (Single allocator) – Single journal shared by all transactions • Poor performance on large directories • Intel has deprecated PMFS

46 Ext4-DAX and xfs-DAX shortcomings • No data atomicity support • Single journal shared by all the transactions (JBD2-based) • Poor performance

47 Non-volatile main memory is about to happen

NVMM needs a new file system: PMFS, Ext4-DAX, SCMFS, Aerie, NOVA, … 48 Why a new file system?

We need to reduce the software overhead.

Source: Memory-Driven Computing, Kimberly Keeton, HP Labs

49 What Should a File System Provide? • Performance current focus (of all known efforts) • Consistency – Atomic metadata operations – Atomic file updates • Data Protection – Snapshots – Media error protection We need to study the impact of – Software error protection adding more file system services • Cost optimizations in the context of NVMM. – Compression – Deduplication

50 Evaluation: Latency

Operation latency 25 • Intel PM Emulation Platform 20 – Emulates different NVM

15 characteristics – Emulates clwb/PCOMMIT 10 latency Latency (microsecond) Latency 5 • NOVA provides low latency atomicity 0 Create Append (4KB) Delete Ext4-datajournal Ext4-DAX m1fs NOVA

51 NOVA design and in-NVMM data layout

• High performance CPU 0 CPU 1 – No page cache Free list Free list – Memory semantics ... – Segregated data structures DRAM NVMM – Per-CPU freelist Journal Journal Super Inode table Inode table – Per-inode logging block • Strong consistency Recovery inode – Copy-on-write file data Inode Head Tail – Using 8-byte atomic stores Inode log

52 NOVA design and in-NVMM data layout • High performance – No page cache Per-CPU inode table – Memory semantics Inode table – Segregated data structures – Per-CPU freelist Inode Head Tail Per-inode log 64-bit tail ptr – Per-inode logging • Strong consistency File log

– Copy-on-write file data Data 0 Data 1 Data 1 Data 2 – Using 8-byte atomic stores writing to page 1 and 2

53 NVMM file systems should support snapshot • Snapshot is essential for file system backup • Available in file systems for block devices – ZFS, Btrfs, WAFL • NOVA is the first NVMM file system providing snapshot – Efficient full-filesystem snapshot at minimal performance cost – Creating/deleting snapshots does not halt file system – Creating consistent snapshots with DAX-mmap enabled

54 Enable snapshot in NOVA • Maintain a current ‘epoch_id’ for the file system – Stored in the superblock epoch_id – Incremented after every snapshot taken • Add the ‘epoch_id’ to each log entry epoch_id

55 Taking snapshots

Snapshot 0 log 0x1000, 1 Current epoch 012 Snapshot 1 log 0x3000, 2

Snapshot 0 Snapshot 1

File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000

Data Data Data Data

Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 56 Deleting snapshots

Snapshot 0 log 0x1000, 1 Current epoch 012 Snapshot 1 log 0x3000, 2

Snapshot 0 Snapshot 1

File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000

Data Data Data Data

Background GC Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 57 Mounting snapshots

Current epoch 012 Snapshot 1 log 0x3000, 2

Snapshot 1 log tail File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000

Data Data Data Data

Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 58 Snapshots with DAX-mmap • Goal: Applications take snapshots and keep running – Virtual addresses do not change – Consistency must be guaranteed • How: Set each mmap’ed page as read-only – Then do copy-on-write for new stores (detected by page fault) • Caveat: Can only atomically set one page as read-only – What if the order of becoming read-only conflicts with consistency?

59 NOVA (meta)data integrity features • Detect (meta)data corruptions – Media errors: error code from memcpy_mcsafe(), and checksums – Software scribbles: checksums • Repair (meta)data corruptions – Metadata: Fully replicated – File data: Stripe and parity-code each block • Minimize scribbles – Leverage x86 CPU’s write protection (CR0.WP) – Metadata allocators separate replicas

60 Metadata error detection and correction • Use inode access as an example:

Read inode Read inode’ Verify inode csum memcmp Use inode memcpy_mcsafe memcpy_mcsafe Verify inode’ csum (inode, inode’)

• If any step raises an error: – Attempt to repair and retry – If recovery fails: return –EIO to user

61 File data error detection and correction

Read a strip Verify strip’s Check errors in Copy data to user memcpy_mcsafe checksums

• If any step raises an error: – Attempt to repair and retry – If recovery fails: return –EIO to user

62 File data protection with DAX-mmap • With DAX-Mmap(), file data changes are invisible to NOVA

User-space load/store Applications:

Kernel-space NOVA: read(), write() mmap()

NVDIMMs Following dax-mmap() semantics, NOVA doesn’t interfere with mmap’ed file data. File data: protected protected

File log: 63 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed

User-space read() load/store load/store Applications:

Kernel-space no protection vm area list: NOVA: read(), write() mmap()

NVDIMMs NOVA skips protection routines for reads and writes to the regions found in the vm File data: area list.

File log:

64 Performance impact of data integrity • Latency breakdown on NVDIMM-N

metadata protection

file data protection

x86 write protection

65 Performance impact of data integrity • Random read/write bandwidth on NVDIMM-N

66 Storage utilization with data integrity • Conceptual view of NOVA’s utilization of NVMM

Dead-zone (DZ) only virtually exists to separate metadata replicas. File data can still live inside.

• Actual usage of a practical workload: fileserver

67 Differences from disk FS implementation • Low-latency storage media – Need to choose fast methods for any involved computation • Fine -grained random access – Need fine-grained checksum ranges, not per block (as in Btrfs, ZFS) • Small atomicity guarantees (64-bit) – Need metadata to assist consistent updates • Media errors cause machine check exceptions (MCEs) – Need awareness and mitigations • Demands for DAX-mmap (no copy-on-write, no FS control) – Need awareness and lowering the protection level on demand

68 Recovery for snapshot metadata • Snapshot metadata list resides in DRAM to reduce consistency overhead

• Clean unmount: – Finish background snapshot delete – Save snapshot lists to NVMM

• Power failure: – Snapshot transaction ID is persistent – Rebuild snapshot metadata lists during power failure recovery

69 NVMM (Meta)data corruptions • Media errors – Detectable & correctable Good data • Transparent to software – Detectable & uncorrectable Software • Affect a contiguous range of data

• Raise machine check exception (MCE) Hardware HW ECC Read – Undetectable NVMM data: • May consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware

70 NVMM (Meta)data corruptions • Media errors – Detectable & correctable Might panic SIGBUS • Transparent to software Kernel mode User mode – Detectable & uncorrectable MCE Software • Affect a contiguous range of data

• Raise machine check exception (MCE) Hardware HW ECC Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius (PR) • Software scribbles e.g. 512 bytes – Kernel bugs or own bugs – Transparent to hardware

71 NVMM (Meta)data corruptions • Media errors – Detectable & correctable Corrupted data • Transparent to software – Detectable & uncorrectable Software • Affect a contiguous range of data

• Raise machine check exception (MCE) Hardware HW ECC Read – Undetectable NVMM data: • Consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware

72 NVMM (Meta)data corruptions • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software • Affect a contiguous range of data • Raise machine check exception (MCE) Hardware – Undetectable NVMM data: • Consume corrupted data • Software scribbles – Kernel bugs or own bugs – Transparent to hardware

73 NVMM (Meta)data corruptions • Media errors – Detectable & correctable Corrupted data • Transparent to software – Detectable & uncorrectable Software • Affect a contiguous range of data

• Raise machine check exception (MCE) Hardware HW ECC Read – Undetectable NVMM data: • Consume corrupted data • Software scribbles – Kernel bugs or own bugs – Transparent to hardware

74 Metadata error detection and correction • Use inode access as example

memcmp neq inode (inode, inode’) inode’ Both OK eq

Verify inode csum Good inode One fails Continue Verify inode’ csum Error inode OK Read inode’ Both fail memcpy_mcsafe MCE -EIO to user PR(inode) OK PR(inode’) Read inode Goto memcpy_mcsafe MCE PR(inode) PR(inode’) Read inode’ OK memcpy_mcsafe MCE -EIO to user 75 File data error detection and correction • Detect and repair both data and checksum errors

Copy data to user

Yes Calculate csum csum0 == csum1? No Success OK Yes Read a strip csum == csum0 or Repair bad strip Good csum memcpy_mcsafe csum == csum1 ? & verify csums Error csum MCE All No Judged by the majority: OK csum, csum0, csum1 Read other strips Fail and the parity Any MCE -EIO to user 76