ORION: a Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks
Total Page:16
File Type:pdf, Size:1020Kb
ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz, Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego 1 Accessing NVMM as Remote Storage Application Application Application DRAM Memory NVMM DRAM NVMM DRAM NVMM RDMA Network Non-Volatile Main Memory (NVMM) Remote Direct Memory Access (RDMA) • PCM, STT-RAM, ReRAM, Intel 3DXPoint • DMA from/to remote memory • Performant: DRAM-class latency/BW • Two-sided verbs (Send/Recv) • Byte-addressable • One-sided verbs (Read/Write) • Persistent over power failures • Bypasses remote CPU • Byte-addressable 2 Accessing Local NVMM vs. Remote NVMM Latency of write(4KB) + fsync() Better Access local NVMM Access remote NVMM over Dist. FS NVMMFS (NOVA) 3.86 us Application Application iSCSI/RDMA 19x Ceph/RDMA 241x RDMA NVMM FS Distributed FS Distributed FS Throughput of fileserver workload Better NVMMFS 6653MB/s NVMM NVMM (NOVA) iSCSI/RDMA 19% Ceph/RDMA 5% 3 Issue #1: Existing Dist. FSs are Slow on NVMM Application File Access RPC • Layered Design Storage Client Storage Server • Indirection overhead User File Request File Access Kernel • Expensive to persist (e.g., fsync()) File System/FUSE File System Block Access NVMM 4 NVMM is Faster than RDMA Harddrive (NVMe) RDMA Network NVMM Latency 70 μs >> 3 μs 300 ns Bandwidth 1.3-3.2 GB/s < 5 GB/s 2-16 GB/s Access Size 4 KB (Page) 2 KB (MTU) 64 Byte (Cacheline) (@ Max BW) Networking is faster than storage 5 NVMM is Faster than RDMA Harddrive (NVMe) RDMA Network NVMM Latency 70 μs 3 μs >> 300 ns Bandwidth 1.3-3.2 GB/s 5 GB/s < 2-16 GB/s Access Size 4 KB (Page) 2 KB (MTU) 64 Byte (Cacheline) (@ Max BW) NVMM is faster than networking 6 Issue #2: Lack of Support for Local NVMM • Use case of converged storage Application – Local NVMM supports Direct Access (DAX) Storage RDMA Storage Server • Existing systems do not store data at local Client NVMM FS • Run Local FS and Dist. FS File System – Expensive to move data DAX Page Cache Local NVMM Remote NVMM 7 ORION: A Distributed File System for NVMM and RDMA-Capable Networks • A clean slate design for NVMM and RDMA • A unified layer: kernel FS + networking Application Application POSIX I/O • Pooled NVMM storage ORION • Accessing metadata/data directly over DAX Access RDMA Access Direct Access (DAX) and RDMA NVMM Pooled NVMM NVMM • Designed for rack-scale scalability 8 Outline • Background • Design overview • Metadata and data management • Replication • RDMA persistence • Evaluation • Conclusion 9 ORION: Cluster Overview MDS Metadata NVMM • Metadata Server (MDS): Runs ORIONFS, Sync/Update keeps authoritative metadata of the whole FS Client • Client: Runs ORIONFS, keeps active metadata Client ClientMetadata and cached data. Access local NVMM DRAM • Data Store (DS): Pooled NVMM data Data $ RDMA Read/Write DAX Data Store • Metadata Access: Clients <=> MDS (Two-sided) Read/Write DataData Store Store • Data Access: Clients => DSs (One-sided) Data Data NVMM 10 The ORION File System • Inherited from NOVA [Xu, FAST 16] MDS MDS – Per-inode metadata (operation) log VFS a – Build in-DRAM data structures on recovery inodes b – Atomic log append dentries c inode log • Metadata: Client – DMA (Physical) memory region (MR) – RDMA-able metadata structures VFS b inodes c dentries • Data: Globally partitioned Data Store inode log [$] Data $ [Y] Data [X] Data 11 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ Client b c RPC: Open inode log Path: a [$] Data $ WAddr:&alloc [X] Data 12 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client b c inode log [$] Data $ [X] Data 13 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client – MDS RDMA_Write to allocated space b a c inode log [$] Data $ [X] Data 14 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client – MDS RDMA_Write to allocated space b a – Client RDMA_Read the rest of the log c inode log [$] Data $ [X] Data 15 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c inode log RecvQ Client b a c inode log [$] Data $ [X] Data 16 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log Client b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 17 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 18 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client – MDS Append log entry b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 19 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client – MDS Append log entry b a – Client MDS Update tail pointers atomically c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 20 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ Client b c inode log [$] Data $ [X] Data 21 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ – Client RDMA_Read remote log tail Client b c inode log [$] Data $ [X] Data 22 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ – Client RDMA_Read remote log tail Client – Client Read from MDS if Len(Local) < Len(Remote) b c inode log [$] Data $ [X] Data 23 Operations in ORION File System • Read(b) MDS – Client Tailcheck (async) 1a – Client RDMA_Read from data store b c inode log RecvQ Client • Data locality Type=FileWrite Addr=(Y,3) – Future reads will hit DRAM cache b Size=4096 – Future writes will go to local NVMM c Data Store inode log [$] Data $ 3 [Y] Data [X] Data 24 Operations in ORION File System • Read(b) MDS – Client Tailcheck (async) 1a – Client RDMA_Read from data store b – Client In-place update to log entry c inode log RecvQ Client • Data locality Type=FileWrite Addr=($,0) – Future reads will hit DRAM cache b Size=4096 – Future writes will go to local NVMM c Data Store inode log [$] Data $ 3 [Y] Data [X] Data 25 Accelerating Metadata Accesses One-sided RDMA Throughtput Read [Inbound] 15Mop/s • Observations: Write[Outbound] 1.9Mop/s – RDMA prefers inbound operations [Su, EuroSys 17] RDMA Send Latency – RDMA prefers small operations [Kalia, ATC 16] 8 Bytes 1.3 us 512 Bytes 3.3 us • MDS request handling: – Tailcheck (8B RDMA_Read): MDS-bypass – Log Commit (~128B RDMA_Send): Single-inode operations – RPC (Varies): Other operations, less common 26 Optimizing Log Commits MDS Tail • Speculative log commit: Local Tail – Return when RDMA_Send verb is signaled inode log – Tailcheck before send #1 #2 Client 2 Client – Rebuild inode from log when necessary – RPCs for complex operations (e.g. inode log #1 #2 #3 O_APPEND) MDS RecvQ • Log commit + Persist: ~ 500 CPU Cycles (memcpy) (flush+fence) inode log #1 #3 #3 Client 1 Client Rebuild 27 Evaluation ORION Prototype • ORION kernel modules (~15K LOC) Networking • Linux Kernel 4.10 • 12 Nodes connected to a switch • RDMA Stack: MLNX_OFED 4.3 • InfiniBand Switch (QLogic 12300) • Bind to 1 core for each client Hardware • 2x Intel Westmere-EP CPU • 16GB DRAM as DRAM • 32GB DRAM as NVMM • RNIC: Mellanox ConnectX-2 VPI (40Gbps) 28 Evaluation: File Operations Orion vs. Distributed File Systems Orion vs. Local File Systems Orion Ceph Gluster Orion NOVA Ext4-DAX 582/932 417 100 30 75 20 50 1.3x 162x 10 1.3x Access Access Latency (us) 25 3x 0 0 Create Mkdir Unlink Rmdir FIO 4K FIO 4K Better Create Mkdir Unlink Rmdir FIO 4K FIO 4K Better Read Write Read Write 29 Evaluation: Applications Filebench workloads and MongoDB (YCSB-A) throughput (relative to NOVA) Orion NOVA Ext4-DAX Ceph Gluster Better 1 85% 0.8 68% 0.6 0.4 Throughput (Relative) 0.2 0 varmail fileserver webserver mongodb 30 Evaluation: Metadata Accesses Relative throughput of metadata operations Throughput of metadata operations 8 Better Tailcheck Log Commit RPC 7 15 6 5 4 10 3 2 1 5 0 Throughput (Relative Throughputto1 Client) (Relative Throughput (Mop/s) 1 2 3 4 5 6 7 8 # Clients 0 Tailcheck Log Commit RPC 8 Clients 31 Conclusion • Existing distributed file systems lack of NVMM support and have significant software overhead • ORION unifies the NVMM file system and the networking layer • ORION provides fast metadata accesses • ORION allows DAX to local NVMM data • Performance comparable to local NVMM file systems 32.