ReFlex: Remote Flash at the Performance of Local Flash

Ana Klimovic Heiner Litz Christos Kozyrakis

Flash Memory Summit 2017 Santa Clara, CA Stanford MAST1 ! Flash in Datacenters

§ NVMe Flash • 1,000x higher throughput than disk (1MIOPS) • 100x lower latency than disk (50-70usec)

§ But Flash is often underutilized due to imbalanced resource requirements Example Datacenter Flash Use-Case

Applica(on Tier Datastore Tier

Datastore Service

get(k) Key-Value Store So9ware put(k,val) App App Tier Tier App Clients Servers TCP/IP NIC CPU RAM Hardware

Flash

3 Imbalanced Resource Utilization

Sample utilization of servers hosting a Flash-based key-value store over 6 months

4 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar. Imbalanced Resource Utilization

Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months

5 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar. Imbalanced Resource Utilization

ulizaon

Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months

6 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar. Imbalanced Resource Utilization

ulizaon

Flash capacity and bandwidth are underutilized for long periods of time

7 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar Solution: Remote Access to Storage

§ Improve resource utilization by sharing NVMe Flash between remote tenants

§ There are 3 main concerns: 1. Performance overhead for remote access 2. Interference on shared Flash 3. Flexibility of Flash usage

8 Issue 1: Performance Overhead

4kB random read 1000 Local Flash iSCSI (1 core) 800 libaio+libevent (1core) 600 4x throughput drop 400

200 p95read latency (us) 2× latency increase 0 0 50 100 150 200 250 300 IOPS (Thousands) • Traditional network/storage protocols and Linux I/O libraries (e.g. libaio, libevent) have high overhead 9

Issue 2: Performance Interference

2000 Latency depends on 1800 IOPS load 1600

1400 Writes impact read 1200 tail latency 1000

800 100%read 99%read p95 read latency (us) 600 95%read 400 90%read 75%read 200 50%read 0 0 250 500 750 1000 1250 Total IOPS (Thousands)

To Flash, we need to enforce performance isolation 10 Issue 3: Flexibility

§ Flexibility is key § Want to allocate Flash capacity and bandwidth for applications: • On any machine in the datacenter • At any scale • Accessed using any network-storage protocol How does ReFlex achieve high performance?

Linux vs. ReFlex User Space User Space

Remote Storage Application Remote Storage Application

Linux Filesystem ReFlex I/O Control Device Driver Plane Plane

Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage

12 [ASPLOS’17] ReFlex: Remote Flash ≈ Local Flash. Ana Klimovic, Heiner Litz, Christos Kozyrakis. How does ReFlex achieve high performance?

Linux vs. ReFlex User Space User Space

Remote Storage Application Remote Storage ApplicationRemove SW bloat by separating Linux Filesystem ReFlex control & data plane Block I/O Control Device Driver Plane Data Plane

Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage

13 How does ReFlex achieve high performance?

Linux vs. ReFlex User Space User Space

Remote Storage Application Remote Storage Application Direct access ReFlex Linux Filesystem to hardware Block I/O Control Device Driver Plane Data Plane 1 data plane DPDK SPDK per CPU core Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage

14 How does ReFlex achieve high performance?

Linux vs. ReFlex User Space User Space

Remote Storage Application Remote Storage Application

Linux Filesystem PollingReFlex vs. Block I/O Control Device Driver interruptsPlane Data Plane Hardware IRQ Hardware Network Flash Network Flash Interface Storage Interface Storage

15 How does ReFlex achieve high performance?

Linux vs. ReFlex User Space User Space

Remote Storage Application Remote Storage Application Run to Linux Filesystem PollingReFlex vs. completion Block I/O Control Device Driver interruptsPlane Data Plane Hardware IRQ Hardware Network Flash Network Flash Interface Storage Interface Storage

16 How does ReFlex achieve high performance?

Linux vs. ReFlex User Space User Space

Remote Storage Application Remote Storage Application Adaptive Linux Filesystem PollingReFlex vs. batching Block I/O Control Device Driver interruptsPlane Data Plane Hardware IRQ Hardware Network Flash Network Flash Interface Storage Interface Storage

17 How does ReFlex achieve high performance?

Linux vs. ReFlex User Space User Space

Remote Storage Application Remote Storage Application

3. ReFlex Linux Filesystem Zero-copy 2. Block I/O Control device toPlane device Data Plane Device Driver4. Hardware 1. Hardware Network Flash Network Flash Interface Storage Interface Storage

18 How does ReFlex enable performance isolation?

§ Request cost based scheduling

§ Determine the impact of tenant A on the tail latency and IOPS of tenant B

§ Control plane assigns tenants with a quota

§ Data plane enforces quotas through throttling 19 Request Cost Modeling Compensate for read-write asymmetry For this device: Write == 10x Read

2000 2000 1800 1800 1600 1600 1400 1400 1200 1200 1000 1000 800 100%read 800 100%read 99%read 99%read p95 read latency (us) 600 600 95%read p95 Read Latency (us) 95%read 400 90%read 400 90%read 75%read 200 200 75%read 50%read 50%read 0 0 0 250 500 750 1000 1250 0 200 400 600 800 1000 Total IOPS (Thousands) Weighted IOPS (x 103 tokens/s ) Request Cost Based Scheduling

Tenant A SLO: • 1ms tail latency • 200K IOPS

1ms tail latency SLO

200K Device max IOPS: 310K 510K IOPS Slack SLO

21 Request Cost Based Scheduling

Tenant A SLO: 1.5 ms tail latency SLO • 1ms tail latency • 200K IOPS

1ms tail latency SLO Tenant B SLO: • 1.5ms tail latency • 100K IOPS 200K 100K Device max IOPS: 210K 510K IOPS IOPS Slack SLO SLO

22 Request Cost Based Scheduling

Tenant A SLO: • 1ms tail latency • 200K IOPS

1ms tail latency SLO Tenant B SLO: • 1.5ms tail latency • 100K IOPS Tenant C cannot register since 0.5ms tail latency SLO system cannot satisfy its SLO Tenant C SLO: 200K 100K 75K Device max IOPS: • 0.5ms tail latency IOPS IOPS slack 375K • 200K IOPS

23 Summary: Life of a Packet in ReFlex

Network Flash Application 3 Event Batched Conditions libReFlex Network/Storage Calls

2 Scheduler Scheduler

TCP/IP

NVMe

1 Network NVMe RX queue SQ 4

24 Summary: Life of a Packet in ReFlex

Network Flash Application 7 Event Batched Conditions libReFlex Network/Storage Calls

6 Scheduler

NVMe TCP/IP TCP/IP

NVMe 8 NVMe Network Network NVMe CQ RX queue TX queue SQ 5 25 Results: Local ≈ Remote Latency 1000 Linux: ReFlex: 75K IOPS/core 850K IOPS/core 800 over TCP over TCP

600 Local-1T 400 Use 1 core ReFlex-1T on 200 Flash server p95Read Latency (us) Linux-1TLibaio-1T 0 0 250 500 750 1000 1250 26 IOPS (Thousands) Results: Local ≈ Remote Latency 1000

800

Latency 600 Local Flash 78 µs ReFlex 99 µs Local-1T 400 Linux 200 µs ReFlex-1T 200 p95Read Latency (us) Linux-1TLibaio-1T 0 0 250 500 750 1000 1250 27 IOPS (Thousands) Results: Local ≈ Remote Latency 1000 ReFlex: saturates Flash 800

600 Local-1T 400 Local-2T Use 2 vs. 1 ReFlex-1T cores on 200 ReFlex-2T Flash Libaio-1TLinux-1T server p95Read Latency (us) 0 Linux-2TLibaio-2T 0 250 500 750 1000 1250 28 IOPS (Thousands) Results: Performance Isolation

4000 140 I/O sched disabled I/O sched disabled 3500 Tenant A IOPS SLO I/O sched enabled 120 I/O sched enabled 3000 100 2500 80 2000 Tenant B IOPS SLO 60 1500 40 1000 Latency 500 SLO 20 IOPS (Thousands) (Thousands) IOPS Readp95latency (us) 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs 29 Results: Performance Isolation

4000 140 I/O sched disabled I/O sched disabled 3500 Tenant A IOPS SLO I/O sched enabled 120 I/O sched enabled 3000 100 2500 80 2000 Tenant B IOPS SLO 60 1500 40 1000 Latency 500 SLO 20 IOPS (Thousands) (Thousands) IOPS Readp95latency (us) 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs 30 Results: Performance Isolation

4000 140 I/O sched disabled I/O sched disabled 3500 Tenant A IOPS SLO I/O sched enabled 120 I/O sched enabled 3000 100 2500 80 2000 Tenant B IOPS SLO 60 1500 40 1000 Latency 500 SLO 20 IOPS (Thousands) (Thousands) IOPS Readp95latency (us) 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs 31 Results: Scalability

§ ReFlex scales to multiple cores (we tested with up to 12 cores) § ReFlex scales to thousands of tenants and connections

ReFlex can manage Each tenant issues 5,000 tenants with a 100 1kB reads/sec single core.

32 How can I use ReFlex?

§ ReFlex is : www.github.com/stanford-mast/reflex

§ Two versions of ReFlex available: 1. Kernel implementation: – Provides a protected network-storage data plane (user application not trusted) – Builds on hardware support for virtualization – Requires Dune kernel module for direct access to hardware and memory management in Linux 2. User space implementation: – Network-storage data plane runs as user space application; kernel-bypass – Portable implementation, tested on Web Services i3 instance ReFlex Clients

§ ReFlex exposes a logical block interface

§ ReFlex client implementations available: 1. Block I/O client à For issuing raw network block I/O requests to ReFlex 2. Remote block device driver à For legacy Linux applications

34 Control Plane for ReFlex

§ ReFlex provides an efficient data plane for remote access

§ Bring your own control plane or distributed storage system: • Global control plane allocates Flash capacity & bandwidth in cluster • Control plane can be a cluster manager or a distributed storage system (e.g. distributed or )

§ Example: ReFlex storage tier for Crail distributed file system • Spark applications can use ReFlex as a remote Flash storage tier, managed by the Crail distributed data store Example Use-Case

§ analytics frameworks such as Spark store a variety of data types in memory, Flash and disk

SQL Streaming MLlib GraphX Spark Shuffle data RDDs Broadcast data Input/Output files

§ Crail is a distributed storage system that manages and provides high-performance access to data across multiple storage tiers

www.crail.io 36 Example Use-Case

§ Spark executors often use (local) Flash to store temporary data (e.g. shuffle)

Spark Executor Spark Executor Spark Executor Spark Executor

Compute Compute Compute Compute

DRAM DRAM DRAM DRAM

NVMe Flash NVMe Flash NVMe Flash NVMe Flash

§ Spark applications often saturate CPU resources before they can saturate NVMe Flash IOPS à Local Flash bandwidth is underutilized 37 Example Use-Case

§ ReFlex provides a shared remote Flash tier for big data analytics • Improves resource utilization while offering local Flash performance

Spark Executor Spark Executor Spark Executor Spark Executor

Compute Compute Compute Compute

DRAM DRAM DRAM DRAM

High-Bandwidth Ethernet Network

ReFlex ReFlex

NVMe Flash NVMe Flash 38 Summary

§ ReFlex enables Flash disaggregation

§ Performance: remote ≈ local

§ Commodity networking, low CPU overhead

§ QoS on shared Flash with I/O scheduler

Thank You!

ReFlex is available at: https://github.com/stanford-mast/reflex