<<

NOVA: A Log-structured for Hybrid Volatile/Non-volatile Main Memories

Andiry Xu, Steven Swanson

Non-volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego

1 NOVA overview • NOVA extends LFS to leverage non-volatile memories • NOVA proposes per- logging • High performance + Strong atomicity – 3.1x to 13.5x to file systems that have equally strong consistency guarantees in -intensive workloads • POSIX compliant

https://github.com/NVSL/NOVA

2 Hybrid DRAM/NVMM system • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology • File system for NVMM Host

CPU

NVMM FS

DRAM NVMM

3 Disk-based file systems are inadequate for NVMM

Ext4 Ext4 • Ext4, , , F2FS, NILFS2 Atomicity Btrfs xfs wb order dataj • 1-Sector Built for hard disks and SSDs ✓ ✓ ✓ ✓ ✓ overwrite – Software overhead is high 1-Sector ✗ ✓ ✓ ✓ ✓ – CPU may reorder writes to NVMM append 1- ✗ ✗ ✓ ✓ ✗ – NVMM has different atomicity guarantees overwrite 1-Block ✗ ✓ ✓ ✓ ✓ append • Cannot exploit NVMM performance N-Block ✗ ✗ ✗ ✗ ✗ • write/append Performance optimization compromises N-Block ✗ ✓ ✓ ✓ ✓ consistency on system failure [1] prefix/append

[1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 4 NVMM file systems are not strongly consistent • BPFS, PMFS, Ext4-DAX, SCMFS, Aerie • None of them provide strong metadata and data consistency

Metadata Data Mmap File system atomicity atomicity Atomicity [1] BPFS Yes Yes [2] No PMFS Yes No No Ext4-DAX Yes No No SCMFS No No No Aerie Yes No No NOVA Yes Yes Yes

[1] Each msync() commits updates atomically. [2] In BPFS, write times are not updated atomically with respect to the write itself. 5 Why LFS? • Log-structuring provides cheaper atomicity than journaling and shadow paging • NVMM supports fast, highly concurrent random accesses – Using multiple logs does not negatively impact performance – Log does not need to be contiguous

• Rethink and redesign log-structuring entirely

6 NOVA design goals • Atomicity – Combine log-structuring, journaling and copy-on- write Per-inode logging • High performance – Split data structure between DRAM and NVMM Inode Head Tail – Highly scalable Inode log • Efficient garbage collection – Fine-grained log cleaning with log as a linked list – Log only contains metadata Committed entry Uncommitted entry • Fast recovery – Lazy rebuild – Parallel scan 7 Atomicity • Log-structuring for single log update – Write, msync, , etc – Strictly commit log entry to NVMM TailTail before updating log tail Directory log • Lightweight journaling for update across logs Tail TailTail – Unlink, rename, etc – Journal log tails instead of metadata or data File log

• Copy-on-write for file data Dir tail – Log only contains metadata Journal – Log is short File tail

8 Atomicity • Log-structuring for single log update – Write, msync, chmod, etc – Strictly commit log entry to NVMM Tail before updating log tail Directory log • Lightweight journaling for update across logs Tail Tail – Unlink, rename, etc – Journal log tails instead of metadata or data File log

• Copy-on-write for file data Data 0 Data 1 Data 1 Data 2 – Log only contains metadata – Log is short

9 Performance • Per-inode logging allows for high concurrency Tail Directory log • Split data structure between Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volatile structure has no consistency overhead

10 Performance

• Per-inode logging allows for Radix tree high concurrency 0 1 2 3 DRAM • Split data structure between NVMM Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volatile tree structure has no consistency overhead

11 NOVA layout

• Put allocator in DRAM CPU 0 CPU 1 • High scalability Free list Free list – Per-CPU NVMM free list, DRAM journal and inode table NVMM Journal Journal – Concurrent transactions Super Inode table Inode table and allocation/deallocation block Recovery inode Inode Head Tail

Inode log

12 Fast garbage collection • Log is a linked list • Log only contains Tail metadata Head • Fast GC deletes dead log pages from the linked list • No copying Vaild log entry Invalid log entry

13 Thorough garbage collection • Starts if valid log entries < 50% log length • Format a new log and atomically replace the old one • Only copy metadata Tail Head

Vaild log entry Invalid log entry

14 Recovery • Rebuild DRAM structure – Allocator CPU 0 CPU 1 – Lazy rebuild: postpones inode radix tree rebuild Free list Free list • Accelerates recovery • Reduces DRAM consumption DRAM • Normal shutdown recovery: NVMM – Store allocator in recovery inode Journal Journal – No log scanning Super Inode table Inode table block • Failure recovery: Recovery – Log is short inode Recovery Recovery thread – Parallel scan – Failure recovery bandwidth: > 400 GB/s

15 Evaluation: Latency

Operation latency 25 • PM Emulation Platform 20 – Emulates different NVM

15 characteristics – Emulates clwb/PCOMMIT 10 latency

Latency (microsecond) Latency 5 • NOVA provides low latency atomicity 0 Create Append (4KB) Delete Ext4-datajournal Ext4-DAX NOVA

16 Filebench throughput

Filebench throughput 450 400 350 300 3.2x2.5x • NOVA achieves high 250 performance with strong 200 2.2x 150 10x data consistency

Ops per second (x1000) per Ops second 100 50 0 Fileserver Varmail Webproxy Webserver Ext4-datajournal Ext4-DAX NOVA

17 Garbage collection efficiency

Fileserver (95% NVMM utilization) 300 250 • NOVA’s performance stays 200 stable with increasing 150 60% drop12x running time 100 Fail Fail • Fast GC reclaims the Ops per second (x1000) per Ops second 50 majority of stale pages in the 0 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 long-term running Running time (second)

NILFS2 F2FS NOVA

18 Conclusion • Existing file systems do not meet the requirements of applications on NVMM file systems • NOVA’s multi-log design achieves high concurrency, efficient garbage collection and fast recovery • NOVA outperforms existing file systems while providing stronger consistency and atomicity guarantees

19 Thank you!

https://github.com/NVSL/NOVA

20 Backup slides

21 Atomicity and enforce write ordering // Strictly commit log entry to NVMM before updating tail new_tail = append_to_log(inode->tail, entry); clwb(inode->tail, entry->length); // writes back the cachelines sfence(); PCOMMIT(); // Commits to NVMM Tail Tail sfence(); inode->tail = new_tail; Inode log

22 Directory operations • mv Alice/book Bob/

• (name, inode number) Tail Tail

Alice log “book”, 10 “book”, 0

Alice tail Tail Tail Bob tail Bob log “book”, 10 book tail Tail Tail

Journal book log inode update

23 Atomic file operations • Copy-on-write for file data • root File radix tree

• Write(8192, 8192) 0 1 2 3 DRAM

Head Tail Tail NVMM

File log <0, 1> <1, 2> <2, 2>

Data 0 Data 1 Data 2 Data 2 Data 3

24 Atomic mmap • Allocate replica pages and mmap(fd, 4096, 4096); mmap to msync(addr, 4096); • msync() commits updates Replica 1 atomically User space Head Tail Tail Kernel

File log

Data 0 Data 1 Data 1 Replica 1

25 Evaluation • Intel PM Emulation Platform • 32GB of DRAM, 64GB of NVMM

Read Write clwb PCOMMIT NVMM latency bandwidth latency latency STT-RAM 100 ns Full DRAM 40 ns 200 ns PCM 300 ns 1/8 DRAM 40 ns 500 ns

• Compare to Btrfs, NILFS2, F2FS, Ext4, Ext4-data, Ext4-DAX, PMFS • kernel 4.0 x86-64

26 Garbage collection efficiency

GC pages percentage Fast GC Thorough GC 100% 90% 80% 70% • Fast GC reclaims 94% pages 60% in one-hour test 50% 94.2

40% 85.4 PERCENTAGE 30% 64.3 20% 10% 10.7 0% 0 10s 30s 120s 600s 3600s RUNNING TIME

27