ORION: a Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks

ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz, Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego 1 Accessing NVMM as Remote Storage Application Application Application DRAM Memory NVMM DRAM NVMM DRAM NVMM RDMA Network Non-Volatile Main Memory (NVMM) Remote Direct Memory Access (RDMA) • PCM, STT-RAM, ReRAM, Intel 3DXPoint • DMA from/to remote memory • Performant: DRAM-class latency/BW • Two-sided verbs (Send/Recv) • Byte-addressable • One-sided verbs (Read/Write) • Persistent over power failures • Bypasses remote CPU • Byte-addressable 2 Accessing Local NVMM vs. Remote NVMM Latency of write(4KB) + fsync() Better Access local NVMM Access remote NVMM over Dist. FS NVMMFS (NOVA) 3.86 us Application Application iSCSI/RDMA 19x Ceph/RDMA 241x RDMA NVMM FS Distributed FS Distributed FS Throughput of fileserver workload Better NVMMFS 6653MB/s NVMM NVMM (NOVA) iSCSI/RDMA 19% Ceph/RDMA 5% 3 Issue #1: Existing Dist. FSs are Slow on NVMM Application File Access RPC • Layered Design Storage Client Storage Server • Indirection overhead User File Request File Access Kernel • Expensive to persist (e.g., fsync()) File System/FUSE File System Block Access NVMM 4 NVMM is Faster than RDMA Harddrive (NVMe) RDMA Network NVMM Latency 70 μs >> 3 μs 300 ns Bandwidth 1.3-3.2 GB/s < 5 GB/s 2-16 GB/s Access Size 4 KB (Page) 2 KB (MTU) 64 Byte (Cacheline) (@ Max BW) Networking is faster than storage 5 NVMM is Faster than RDMA Harddrive (NVMe) RDMA Network NVMM Latency 70 μs 3 μs >> 300 ns Bandwidth 1.3-3.2 GB/s 5 GB/s < 2-16 GB/s Access Size 4 KB (Page) 2 KB (MTU) 64 Byte (Cacheline) (@ Max BW) NVMM is faster than networking 6 Issue #2: Lack of Support for Local NVMM • Use case of converged storage Application – Local NVMM supports Direct Access (DAX) Storage RDMA Storage Server • Existing systems do not store data at local Client NVMM FS • Run Local FS and Dist. FS File System – Expensive to move data DAX Page Cache Local NVMM Remote NVMM 7 ORION: A Distributed File System for NVMM and RDMA-Capable Networks • A clean slate design for NVMM and RDMA • A unified layer: kernel FS + networking Application Application POSIX I/O • Pooled NVMM storage ORION • Accessing metadata/data directly over DAX Access RDMA Access Direct Access (DAX) and RDMA NVMM Pooled NVMM NVMM • Designed for rack-scale scalability 8 Outline • Background • Design overview • Metadata and data management • Replication • RDMA persistence • Evaluation • Conclusion 9 ORION: Cluster Overview MDS Metadata NVMM • Metadata Server (MDS): Runs ORIONFS, Sync/Update keeps authoritative metadata of the whole FS Client • Client: Runs ORIONFS, keeps active metadata Client ClientMetadata and cached data. Access local NVMM DRAM • Data Store (DS): Pooled NVMM data Data $ RDMA Read/Write DAX Data Store • Metadata Access: Clients <=> MDS (Two-sided) Read/Write DataData Store Store • Data Access: Clients => DSs (One-sided) Data Data NVMM 10 The ORION File System • Inherited from NOVA [Xu, FAST 16] MDS MDS – Per-inode metadata (operation) log VFS a – Build in-DRAM data structures on recovery inodes b – Atomic log append dentries c inode log • Metadata: Client – DMA (Physical) memory region (MR) – RDMA-able metadata structures VFS b inodes c dentries • Data: Globally partitioned Data Store inode log [$] Data $ [Y] Data [X] Data 11 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ Client b c RPC: Open inode log Path: a [$] Data $ WAddr:&alloc [X] Data 12 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client b c inode log [$] Data $ [X] Data 13 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client – MDS RDMA_Write to allocated space b a c inode log [$] Data $ [X] Data 14 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client – MDS RDMA_Write to allocated space b a – Client RDMA_Read the rest of the log c inode log [$] Data $ [X] Data 15 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c inode log RecvQ Client b a c inode log [$] Data $ [X] Data 16 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log Client b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 17 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 18 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client – MDS Append log entry b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 19 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client – MDS Append log entry b a – Client MDS Update tail pointers atomically c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 20 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ Client b c inode log [$] Data $ [X] Data 21 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ – Client RDMA_Read remote log tail Client b c inode log [$] Data $ [X] Data 22 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ – Client RDMA_Read remote log tail Client – Client Read from MDS if Len(Local) < Len(Remote) b c inode log [$] Data $ [X] Data 23 Operations in ORION File System • Read(b) MDS – Client Tailcheck (async) 1a – Client RDMA_Read from data store b c inode log RecvQ Client • Data locality Type=FileWrite Addr=(Y,3) – Future reads will hit DRAM cache b Size=4096 – Future writes will go to local NVMM c Data Store inode log [$] Data $ 3 [Y] Data [X] Data 24 Operations in ORION File System • Read(b) MDS – Client Tailcheck (async) 1a – Client RDMA_Read from data store b – Client In-place update to log entry c inode log RecvQ Client • Data locality Type=FileWrite Addr=($,0) – Future reads will hit DRAM cache b Size=4096 – Future writes will go to local NVMM c Data Store inode log [$] Data $ 3 [Y] Data [X] Data 25 Accelerating Metadata Accesses One-sided RDMA Throughtput Read [Inbound] 15Mop/s • Observations: Write[Outbound] 1.9Mop/s – RDMA prefers inbound operations [Su, EuroSys 17] RDMA Send Latency – RDMA prefers small operations [Kalia, ATC 16] 8 Bytes 1.3 us 512 Bytes 3.3 us • MDS request handling: – Tailcheck (8B RDMA_Read): MDS-bypass – Log Commit (~128B RDMA_Send): Single-inode operations – RPC (Varies): Other operations, less common 26 Optimizing Log Commits MDS Tail • Speculative log commit: Local Tail – Return when RDMA_Send verb is signaled inode log – Tailcheck before send #1 #2 Client 2 Client – Rebuild inode from log when necessary – RPCs for complex operations (e.g. inode log #1 #2 #3 O_APPEND) MDS RecvQ • Log commit + Persist: ~ 500 CPU Cycles (memcpy) (flush+fence) inode log #1 #3 #3 Client 1 Client Rebuild 27 Evaluation ORION Prototype • ORION kernel modules (~15K LOC) Networking • Linux Kernel 4.10 • 12 Nodes connected to a switch • RDMA Stack: MLNX_OFED 4.3 • InfiniBand Switch (QLogic 12300) • Bind to 1 core for each client Hardware • 2x Intel Westmere-EP CPU • 16GB DRAM as DRAM • 32GB DRAM as NVMM • RNIC: Mellanox ConnectX-2 VPI (40Gbps) 28 Evaluation: File Operations Orion vs. Distributed File Systems Orion vs. Local File Systems Orion Ceph Gluster Orion NOVA Ext4-DAX 582/932 417 100 30 75 20 50 1.3x 162x 10 1.3x Access Access Latency (us) 25 3x 0 0 Create Mkdir Unlink Rmdir FIO 4K FIO 4K Better Create Mkdir Unlink Rmdir FIO 4K FIO 4K Better Read Write Read Write 29 Evaluation: Applications Filebench workloads and MongoDB (YCSB-A) throughput (relative to NOVA) Orion NOVA Ext4-DAX Ceph Gluster Better 1 85% 0.8 68% 0.6 0.4 Throughput (Relative) 0.2 0 varmail fileserver webserver mongodb 30 Evaluation: Metadata Accesses Relative throughput of metadata operations Throughput of metadata operations 8 Better Tailcheck Log Commit RPC 7 15 6 5 4 10 3 2 1 5 0 Throughput (Relative Throughputto1 Client) (Relative Throughput (Mop/s) 1 2 3 4 5 6 7 8 # Clients 0 Tailcheck Log Commit RPC 8 Clients 31 Conclusion • Existing distributed file systems lack of NVMM support and have significant software overhead • ORION unifies the NVMM file system and the networking layer • ORION provides fast metadata accesses • ORION allows DAX to local NVMM data • Performance comparable to local NVMM file systems 32.

ORION: a Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks

Optimizing the Ceph Distributed File System for High Performance Computing

Proxmox Ve Mit Ceph &

Red Hat Openstack* Platform with Red Hat Ceph* Storage

Red Hat Data Analytics Infrastructure Solution

Unlock Bigdata Analytic Efficiency with Ceph Data Lake

Evaluation of Active Storage Strategies for the Lustre Parallel File System

Design and Implementation of Archival Storage Component of OAIS Reference Model

Ceph Done Right for Openstack

Shared File Systems: Determining the Best Choice for Your Distributed SAS® Foundation Applications Margaret Crevar, SAS Institute Inc., Cary, NC

Cs 5412/Lecture 24. Ceph: a Scalable High-Performance Distributed File System

Openzfs on Linux Hepix 2014 October 16, 2014 Brian Behlendorf [email protected]

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques