Orion: a Distributed Persistent Memory File System
Total Page:16
File Type:pdf, Size:1020Kb
Research Faculty Summit 2018 Systems | Fueling future disruptions What Should You Do With Persistent Memory? Steven Swanson Director, Non-Volatile Systems Lab Computer Science and Engineering UC San Diego 2 Non-volatile main memory (NVMM) Byte-addressable • CPU • Denser than DRAM Core Core L1/L2 L1/L2 • DRAM-comparable latency Last-level cache • Higher bandwidth than SSD Memory Memory • Ready for DMA / RDMA Controller Controller DRAM DIMM NVDIMM DRAM DIMM NVDIMM Memory NVMM Intel 3D XPoint NVDIMM 3 What Should You Do With NVMM? 1. Use files and a conventional (distributed) file system NOVAEXT4 NOVAEXT4 2. Use files and better file (distributed) system CEPH NOVAEXT4 Orion EXT4 3. Build persistent data etc. NOVA structures 4. Use it as slow DRAM NOVAEXT4Log NOVAEXT4 4 What Should You Do With NVMM? 1. Use files and a conventional (distributed) file system NOVA NOVA 2. Use files and better file (distributed) system NOVA Orion NOVA 3. Build persistent data structures 4. Use it as slow DRAM NOVA NOVA 5 DAX Fault Direct File IO Atomicity Speed Tolerance Access 6 NOVA: A File System for NVMM • A NOVA FS is a tree of logs Per-inode logging • One log per inode – Inode points to head and tail Inode Head Tail – Logs are not contiguous • Many Logs -> high concurrency Inode log • Strong consistency guarantees • Log-structured + journals + copy- Committed entry on-write Uncommitted entry 7 Atomicity: Logging for Simple Metadata Operations • Combines log-structuring, journaling and copy-on-write • Log-structuring for single log update Tail Tail – Write, msync, chmod, etc File log – Lower overhead than journaling and shadow paging 8 Atomicity: Lightweight Journaling for Complex Metadata Operations TailTail • Lightweight journaling for update across logs Directory log – Unlink, rename, etc TailTail – Journal log tails instead of File log metadata or data Dir tail Journal File tail 9 Atomicity: Copy-on-write for file data • Copy-on-write for file data – Log only contains metadata Tail Tail – Log is short File log – Instant data GC Data 0 Data 1 Data 1 Data 2 10 Filebench throughput 400 350 300 250 per second 200 150 KOps 100 50 0 Fileserver Varmail Webproxy Webserver Ext4-datajournal Ext4-DAX xfs-DAX NOVA 11 What Should You Do With NVMM? 1. Use files and a conventional (distributed) file system NOVA NOVA 2. Use files and better file (distributed) system NOVA ??? 3. Build persistent data NOVA structures 4. Use it as slow DRAM NOVA NOVA 12 Existing Distributed File Systems are Slow Distributed File Distributed File System Application System Client Server Network Local File Client FS Network Stack Stack System 13 Orion: A Distributed Persistent Memory File System Application OrionFS OrionFS RDMA 14 Orion: Key Features • Based on NOVA • Mirrored metadata on client – Client keep local, NVMM copy of inode’s log – Leases + simple arbitration for concurrent updates • Mostly-local operation – Local read cache – CoW creates new, local copy • Pervasive RDMA – All addresses/pointers are RDMA-friendly – Zero-copy IO for most transfers (NOVA data structures are RDMA targets) – Single-ended remote data access 15 Application performance on Orion 16 What Should You Do With NVMM? 1. Use files and a conventional (distributed) file system 2. Use files and better file (distributed) system 3. Build persistent data structures 4. Use it as slow DRAM Log 17 Build NVMM Data Structures Is Hard • All existing programming errors are still possible – Memory leaks – Multiple frees – Locking errors • There are new kinds of errors – Pointers between NV memory pools – Pointers from NVMM to DRAM • Programmers get this stuff wrong • Rebooting/restarting won’t help! • Language + Compiler support will come, but slowly 18 Optimizing RocksDB 19 What Should You Do With NVMM? 1. Use files and a conventional Easy; ~5x gains (distributed) file system 2. Use files and better file Pretty easy; ~10x gains (distributed) system The NVMM Programmability Gap 3. Build persistent data structures Really hard; ~30x gains 4. Use it as slow DRAM 20 Optimizing RocksDB File Emulation 21 File Emulation • Normal write-ahead logging – open(); – write(); sync(); • Emulate read/write in user space – open(); mmap(); – memcpy() + clwb + fence • Almost POSIX semantics – Minimal changes to app logic – No complex logging, allocation, or locking • Almost persistent data structure performance – Just 10% slower. 22 File Emulation Speedups 60 50 40 30 20 File Speedup Emulation 10 0 SQLite KyotoCabinet RocksDB XFS-DAX Ext4-DAX NOVA 23 What Should You Do With NVMM? • You should study it! – Many interesting, open problems remain – Lots of PhDs to come • You should use it! – Use a file system! – Want more performance? Use file emulation! – Want more performance? Build persistent data structures. 24 NOVA is open source. We are preparing it for “upstreaming” in to Linux. To help or try it out: https://github.com/NVSL/linux-nova 25 Thanks! 26 Thank you! .