ORION: A Distributed for Non-Volatile Main Memories and RDMA-Capable Networks

Jian Yang, Joseph Izraelevitz, Steven Swanson

Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego

1 Accessing NVMM as Remote Storage

Application Application Application

DRAM Memory NVMM DRAM NVMM DRAM NVMM

RDMA Network

Non-Volatile Main Memory (NVMM) Remote Direct Memory Access (RDMA) • PCM, STT-RAM, ReRAM, 3DXPoint • DMA from/to remote memory • Performant: DRAM-class latency/BW • Two-sided verbs (Send/Recv) • -addressable • One-sided verbs (Read/Write) • Persistent over power failures • Bypasses remote CPU • Byte-addressable

2 Accessing Local NVMM vs. Remote NVMM

Latency of write(4KB) + fsync() Better Access local NVMM Access remote NVMM over Dist. FS NVMMFS (NOVA) 3.86 us Application Application iSCSI/RDMA 19x Ceph/RDMA 241x RDMA NVMM FS Distributed FS Distributed FS Throughput of fileserver workload Better NVMMFS 6653MB/s NVMM NVMM (NOVA) iSCSI/RDMA 19% Ceph/RDMA 5%

3 Issue #1: Existing Dist. FSs are Slow on NVMM

Application

File Access RPC • Layered Design Storage Client Storage Server

• Indirection overhead User File Request File Access Kernel • Expensive to persist (e.g., fsync()) File System/FUSE File System Access NVMM

4 NVMM is Faster than RDMA

Harddrive (NVMe) RDMA Network NVMM Latency 70 μs >> 3 μs 300 ns Bandwidth 1.3-3.2 GB/s < 5 GB/s 2-16 GB/s Access Size 4 KB (Page) 2 KB (MTU) 64 Byte (Cacheline) (@ Max BW)

Networking is faster than storage

5 NVMM is Faster than RDMA

Harddrive (NVMe) RDMA Network NVMM Latency 70 μs 3 μs >> 300 ns Bandwidth 1.3-3.2 GB/s 5 GB/s < 2-16 GB/s Access Size 4 KB (Page) 2 KB (MTU) 64 Byte (Cacheline) (@ Max BW)

NVMM is faster than networking

6 Issue #2: Lack of Support for Local NVMM

• Use case of converged storage Application – Local NVMM supports Direct Access (DAX) Storage RDMA Storage Server • Existing systems do not store data at local Client NVMM FS • Run Local FS and Dist. FS File System – Expensive to move data DAX Page Cache Local NVMM Remote NVMM

7 ORION: A Distributed File System for NVMM and RDMA-Capable Networks

• A clean slate design for NVMM and RDMA • A unified layer: kernel FS + networking Application Application POSIX I/O • Pooled NVMM storage ORION • Accessing metadata/data directly over DAX Access RDMA Access Direct Access (DAX) and RDMA NVMM Pooled NVMM NVMM • Designed for rack-scale scalability

8 Outline

• Background • Design overview • Metadata and data management • • RDMA persistence • Evaluation • Conclusion

9 ORION: Cluster Overview

MDS

Metadata NVMM • Metadata Server (MDS): Runs ORIONFS, Sync/Update keeps authoritative metadata of the whole FS Client • Client: Runs ORIONFS, keeps active metadata Client ClientMetadata and cached data. Access local NVMM DRAM • Data Store (DS): Pooled NVMM data Data $ RDMA Read/Write DAX Data Store • Metadata Access: Clients <=> MDS (Two-sided) Read/Write DataData Store Store

• Data Access: Clients => DSs (One-sided) Data Data NVMM

10 The ORION File System

• Inherited from NOVA [Xu, FAST 16] MDS MDS – Per- metadata (operation) log VFS a – Build in-DRAM data structures on recovery b – Atomic log append dentries inode log • Metadata: Client – DMA (Physical) memory region (MR) – RDMA-able metadata structures VFS b inodes c dentries • Data: Globally partitioned Data Store inode log [$] Data $

[Y] Data [X] Data

11 Operations in ORION File System

MDS

a b • Open(a) c – Client Allocate inode inode log RecvQ Client

b c RPC: Open inode log Path: a [$] Data $ WAddr:&alloc

[X] Data

12 Operations in ORION File System

MDS

a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client

b c inode log [$] Data $

[X] Data

13 Operations in ORION File System

MDS

a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client – MDS RDMA_Write to allocated space b a c inode log [$] Data $

[X] Data

14 Operations in ORION File System

MDS

a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client – MDS RDMA_Write to allocated space b a – Client RDMA_Read the rest of the log c inode log [$] Data $

[X] Data

15 Operations in ORION File System

MDS

1a • Write(c) b – Client Allocate & CoW to client-owned pages c inode log RecvQ

Client

b a c inode log [$] Data $

[X] Data

16 Operations in ORION File System

MDS

1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log Client

b a

c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $

[X] Data

17 Operations in ORION File System

MDS

1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client

b a

c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $

[X] Data

18 Operations in ORION File System

MDS

1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client – MDS Append log entry b a

c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $

[X] Data

19 Operations in ORION File System

MDS

1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client – MDS Append log entry b a – Client MDS Update tail pointers atomically c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $

[X] Data

20 Operations in ORION File System

MDS

1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ

Client

b c inode log [$] Data $

[X] Data

21 Operations in ORION File System

MDS

1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ – Client RDMA_Read remote log tail Client

b c inode log [$] Data $

[X] Data

22 Operations in ORION File System

MDS

1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ – Client RDMA_Read remote log tail Client – Client Read from MDS if Len(Local) < Len(Remote) b c inode log [$] Data $

[X] Data

23 Operations in ORION File System

• Read(b) MDS – Client Tailcheck (async) 1a – Client RDMA_Read from data store b c inode log RecvQ

Client • Data locality Type=FileWrite Addr=(Y,3) – Future reads will hit DRAM cache b Size=4096 – Future writes will go to local NVMM c Data Store inode log [$] Data $ 3 [Y] Data [X] Data

24 Operations in ORION File System

• Read(b) MDS – Client Tailcheck (async) 1a – Client RDMA_Read from data store b – Client In-place update to log entry c inode log RecvQ

Client • Data locality Type=FileWrite Addr=($,0) – Future reads will hit DRAM cache b Size=4096 – Future writes will go to local NVMM c Data Store inode log [$] Data $ 3 [Y] Data [X] Data

25 Accelerating Metadata Accesses

One-sided RDMA Throughtput Read [Inbound] 15Mop/s • Observations: Write[Outbound] 1.9Mop/s – RDMA prefers inbound operations [Su, EuroSys 17] RDMA Send Latency – RDMA prefers small operations [Kalia, ATC 16] 8 1.3 us 512 Bytes 3.3 us • MDS request handling: – Tailcheck (8B RDMA_Read): MDS-bypass – Log Commit (~128B RDMA_Send): Single-inode operations – RPC (Varies): Other operations, less common

26 Optimizing Log Commits

MDS Tail • Speculative log commit: Local Tail

– Return when RDMA_Send verb is signaled inode log

– Tailcheck before send #1 #2 Client 2 Client – Rebuild inode from log when necessary – RPCs for complex operations (e.g. inode log #1 #2 #3

O_APPEND) MDS RecvQ

• Log commit + Persist: ~ 500 CPU Cycles (memcpy) (flush+fence) inode log #1 #3 #3

Client 1 Client Rebuild

27 Evaluation

ORION Prototype

• ORION kernel modules (~15K LOC) Networking • Kernel 4.10 • 12 Nodes connected to a switch • RDMA Stack: MLNX_OFED 4.3 • InfiniBand Switch (QLogic 12300) • Bind to 1 core for each client

Hardware • 2x Intel Westmere-EP CPU • 16GB DRAM as DRAM • 32GB DRAM as NVMM • RNIC: Mellanox ConnectX-2 VPI (40Gbps) 28 Evaluation: File Operations

Orion vs. Distributed File Systems Orion vs. Local File Systems Orion Ceph Orion NOVA -DAX 582/932 417 100 30

75 20

50 1.3x 162x 10 1.3x

Access LatencyAccess (us) 25

3x 0 0 Create Mkdir Unlink Rmdir FIO 4K FIO 4K Better Create Mkdir Unlink Rmdir FIO 4K FIO 4K Better Read Write Read Write

29 Evaluation: Applications

Filebench workloads and MongoDB (YCSB-A) throughput (relative to NOVA) Orion NOVA Ext4-DAX Ceph Gluster Better 1 85% 0.8 68% 0.6

0.4

Throughput(Relative) 0.2

0 varmail fileserver webserver mongodb

30 Evaluation: Metadata Accesses

Relative throughput of metadata operations Throughput of metadata operations 8 Better Tailcheck Log Commit RPC 7 15 6 5 4 10 3 2 1 5

0

Throughput (Relative Client) 1 toThroughput (Relative Throughput(Mop/s) 1 2 3 4 5 6 7 8 # Clients 0 Tailcheck Log Commit RPC 8 Clients

31 Conclusion

• Existing distributed file systems lack of NVMM support and have significant software overhead

• ORION unifies the NVMM file system and the networking layer

• ORION provides fast metadata accesses

• ORION allows DAX to local NVMM data

• Performance comparable to local NVMM file systems

32