Reflex: Remote Flash at the Performance of Local Flash

ReFlex: Remote Flash at the Performance of Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis Flash Memory Summit 2017 Santa Clara, CA Stanford MAST1 ! Flash in Datacenters § NVMe Flash • 1,000x higher throughput than disk (1MIOPS) • 100x lower latency than disk (50-70usec) § But Flash is often underutilized due to imbalanced resource requirements Example Datacenter Flash Use-Case Applica(on Tier Datastore Tier Datastore Service get(k) Key-Value Store So9ware put(k,val) App App Tier Tier App Clients Servers TCP/IP NIC CPU RAM Hardware Flash 3 Imbalanced Resource Utilization Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months 4 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar. Imbalanced Resource Utilization Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months 5 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar. Imbalanced Resource Utilization u"lizaon Sample utilization of Facebook servers hosting a Flash-based key-value store over 6 months 6 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar. Imbalanced Resource Utilization u"lizaon Flash capacity and bandwidth are underutilized for long periods of time 7 [EuroSys’16] Flash storage disaggregation. Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, Sanjeev Kumar Solution: Remote Access to Storage § Improve resource utilization by sharing NVMe Flash between remote tenants § There are 3 main concerns: 1. Performance overhead for remote access 2. Interference on shared Flash 3. Flexibility of Flash usage 8 Issue 1: Performance Overhead 4kB random read 1000 Local Flash iSCSI (1 core) 800 libaio+libevent (1core) 600 4x throughput drop 400 200 p95 read latency(us) 2× latency increase 0 0 50 100 150 200 250 300 IOPS (Thousands) • Traditional network/storage protocols and Linux I/O libraries (e.g. libaio, libevent) have high overhead 9 Issue 2: Performance Interference 2000 Latency depends on 1800 IOPS load 1600 1400 Writes impact read 1200 tail latency 1000 800 100%read 99%read p95 read latency (us) 600 95%read 400 90%read 75%read 200 50%read 0 0 250 500 750 1000 1250 Total IOPS (Thousands) To share Flash, we need to enforce performance isolation 10 Issue 3: Flexibility § Flexibility is key § Want to allocate Flash capacity and bandwidth for applications: • On any machine in the datacenter • At any scale • Accessed using any network-storage protocol How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Linux Filesystem ReFlex Block I/O Control Device Driver Plane Data Plane Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 12 [ASPLOS’17] ReFlex: Remote Flash ≈ Local Flash. Ana Klimovic, Heiner Litz, Christos Kozyrakis. How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage ApplicationRemove SW bloat by separating Linux Filesystem ReFlex control & data plane Block I/O Control Device Driver Plane Data Plane Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 13 How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Direct access ReFlex Linux Filesystem to hardware Block I/O Control Device Driver Plane Data Plane 1 data plane DPDK SPDK per CPU core Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 14 How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Linux Filesystem PollingReFlex vs. Block I/O Control Device Driver interruptsPlane Data Plane Hardware IRQ Hardware Network Flash Network Flash Interface Storage Interface Storage 15 How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Run to Linux Filesystem PollingReFlex vs. completion Block I/O Control Device Driver interruptsPlane Data Plane Hardware IRQ Hardware Network Flash Network Flash Interface Storage Interface Storage 16 How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Adaptive Linux Filesystem PollingReFlex vs. batching Block I/O Control Device Driver interruptsPlane Data Plane Hardware IRQ Hardware Network Flash Network Flash Interface Storage Interface Storage 17 How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application 3. ReFlex Linux Filesystem Zero-copy 2. Block I/O Control device toPlane device Data Plane Device Driver4. Hardware 1. Hardware Network Flash Network Flash Interface Storage Interface Storage 18 How does ReFlex enable performance isolation? § Request cost based scheduling § Determine the impact of tenant A on the tail latency and IOPS of tenant B § Control plane assigns tenants with a quota § Data plane enforces quotas through throttling 19 Request Cost Modeling Compensate for read-write asymmetry For this device: Write == 10x Read 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200 1000 1000 800 100%read 800 100%read 99%read 99%read p95 read latency (us) 600 600 95%read p95 Read Latency (us) 95%read 400 90%read 400 90%read 75%read 200 200 75%read 50%read 50%read 0 0 0 250 500 750 1000 1250 0 200 400 600 800 1000 Total IOPS (Thousands) Weighted IOPS (x 103 tokens/s ) Request Cost Based Scheduling Tenant A SLO: • 1ms tail latency • 200K IOPS 1ms tail latency SLO 200K Device max IOPS: 310K 510K IOPS Slack SLO 21 Request Cost Based Scheduling Tenant A SLO: 1.5 ms tail latency SLO • 1ms tail latency • 200K IOPS 1ms tail latency SLO Tenant B SLO: • 1.5ms tail latency • 100K IOPS 200K 100K Device max IOPS: 210K 510K IOPS IOPS Slack SLO SLO 22 Request Cost Based Scheduling Tenant A SLO: • 1ms tail latency • 200K IOPS 1ms tail latency SLO Tenant B SLO: • 1.5ms tail latency • 100K IOPS Tenant C cannot register since 0.5ms tail latency SLO system cannot satisfy its SLO Tenant C SLO: 200K 100K 75K Device max IOPS: • 0.5ms tail latency IOPS IOPS slack 375K • 200K IOPS 23 Summary: Life of a Packet in ReFlex Network Flash Application 3 Event Batched Conditions libReFlex Network/Storage Calls 2 Scheduler TCP/IP NVMe 1 Network NVMe RX queue SQ 4 24 Summary: Life of a Packet in ReFlex Network Flash Application 7 Event Batched Conditions libReFlex Network/Storage Calls 6 Scheduler NVMe TCP/IP TCP/IP NVMe 8 NVMe Network Network NVMe CQ RX queue TX queue SQ 5 25 Results: Local ≈ Remote Latency 1000 Linux: ReFlex: 75K IOPS/core 850K IOPS/core 800 over TCP over TCP 600 Local-1T 400 Use 1 core ReFlex-1T on 200 Flash server p95Read Latency(us) Linux-1TLibaio-1T 0 0 250 500 750 1000 1250 26 IOPS (Thousands) Results: Local ≈ Remote Latency 1000 800 Latency 600 Local Flash 78 µs ReFlex 99 µs Local-1T 400 Linux 200 µs ReFlex-1T 200 p95Read Latency(us) Linux-1TLibaio-1T 0 0 250 500 750 1000 1250 27 IOPS (Thousands) Results: Local ≈ Remote Latency 1000 ReFlex: saturates Flash 800 600 Local-1T 400 Local-2T Use 2 vs. 1 ReFlex-1T cores on 200 ReFlex-2T Flash Libaio-1TLinux-1T server p95Read Latency(us) 0 Linux-2TLibaio-2T 0 250 500 750 1000 1250 28 IOPS (Thousands) Results: Performance Isolation 4000 140 I/O sched disabled I/O sched disabled 3500 Tenant A IOPS SLO I/O sched enabled 120 I/O sched enabled 3000 100 2500 80 2000 Tenant B IOPS SLO 60 1500 40 1000 Latency 500 SLO 20 IOPS (Thousands) (Thousands) IOPS Readp95latency (us) 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs 29 Results: Performance Isolation 4000 140 I/O sched disabled I/O sched disabled 3500 Tenant A IOPS SLO I/O sched enabled 120 I/O sched enabled 3000 100 2500 80 2000 Tenant B IOPS SLO 60 1500 40 1000 Latency 500 SLO 20 IOPS (Thousands) (Thousands) IOPS Readp95latency (us) 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs 30 Results: Performance Isolation 4000 140 I/O sched disabled I/O sched disabled 3500 Tenant A IOPS SLO I/O sched enabled 120 I/O sched enabled 3000 100 2500 80 2000 Tenant B IOPS SLO 60 1500 40 1000 Latency 500 SLO 20 IOPS (Thousands) (Thousands) IOPS Readp95latency (us) 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs 31 Results: Scalability § ReFlex scales to multiple cores (we tested with up to 12 cores) § ReFlex scales to thousands of tenants and connections ReFlex can manage Each tenant issues 5,000 tenants with a 100 1kB reads/sec single core. 32 How can I use ReFlex? § ReFlex is open source: www.github.com/stanford-mast/reflex § Two versions of ReFlex available: 1.

Reflex: Remote Flash at the Performance of Local Flash

Multi-Resolution Storage and Search in Sensor Networks Deepak Ganesan University of Massachusetts Amherst

Network-Compute Co-Design for Distributed In-Memory Computing” Advisors: Prof

Succinct: Enabling Queries on Compressed Data

Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory

(ASE 18 US 8,370.224 B2 Page 2

Bluecache: a Scalable Distributed Flash-Based Key-Value Store

Building Distributed Storage with Specialized Hardware

Iot Governance, Privacy and Security Issues EUROPEAN RESEARCH CLUSTER on the INTERNET of THINGS

Zipg: a Memory-Efficient Graph Store for Interactive Queries

Towards Elastic High-Performance Geo-Distributed Storage in the Cloud

A Scalable Distributed Flash-Based Key-Value Store

Building a Transactional Distributed Data Store with Erlang