The Future of Software-Defined Storage – What Does It Look Like in 3 Years Time? Richard McDougall, VMware, Inc This Talk: The Future of SDS

Hardware New Workloads Futures & Trends

Emerging Cloud Native Platforms

2 Storage Workload Map 1 Million IOPS

10’s of Terabytes 10’s of New Petabytes Focus Workloads of Most SDS Today One Thousand

IOPS 3 What Are Cloud-Native Applications?

Developer access via APIs Continuous integration Built for scale and deployment

Application

Availability architected in the Microservices, not Decoupled from app, not hypervisor monolithic stacks infrastructure

4 What do Containers Need from Storage?

Parent Linux /

/etc /containers Container 1 Container 2

/containers/cont1/rootfs /containers/cont2/roofts

etc var etc var • Ability to copy & clone root images • Isolated namespaces between containers • QoS Controls between containers

Option A: copy whole root tree 5 Containers and Fast Clone Using Shared Read-Only Images

/

• Actions – Quick update of child to use same boot image /etc /containers • Downsides: – Container root fs not writable

/containers/cont1/rootfs

etc var

6 Docker Solution: Clone via “Another Union ” (aufs)

Except:

CONFIDENTIAL 7 Docker’s Direction: Leverage native copy-on-write filesystem

/ Parent Linux

/etc /containers

Golden Container 1

/containers/golden/rootfs /containers/cont1/roofts

etc var etc var Goal: – Snap single image into writable clones – Boot clones as isolated linux instances

# snapshot cpool/golden@base # zfs clone cpool/root@base cpool/cont1

8 Shared Data: Containers can Share Filesystem Within Host

/ Containers Today: Share within the host /etc /containers Containers Future:

/containers/cont1/rootfs Shared /containers/cont2/roofts Share across hosts Clustered File Systems

etc shared var etc shared var

container1 container2

CONFIDENTIAL 9 Docker Storage Abstractions for Containers

Persistent Local Data Disks Docker Container /data

“VOLUME /var/lib/mysql”

Non-persistent boot environment /

/etc /var

10 Docker Storage Abstractions for Containers

Persistent Block Volumes Data Docker VSAN Container /data RBD ScaleIO “VOLUME /var/lib/mysql” Flocker Etc,…

Non-persistent boot environment /

/etc /var

11 Container Storage Use Cases

Unshared Volumes Shared Volumes Persist to External Storage

Host Host Host Host Host Host

C C C C C C C C C C C C

API API Cloud Storage Storage Platform Storage Platform

Use Case: Running container- Use Case: Sharing a set of tools Use Case: Object store for based SQL or noSQL DB or content across app instances retention/archival, DBaaS for config/transactions

12 Eliminate the Silos: Converged Big Data Platform

Other Hadoop NoSQL Spark, batch analysis Cassandra, Shark, Solr, Big SQL Mongo, etc HBase Impala, Platfora, Compute Pivotal HawQ Etc,… layer real-time queries

Distributed Resource Manager or Virtualization platform

File System (HDFS, MAPR, GPFS) POSIX Block Storage

Distributed Storage Platform

Host Host Host Host Host Host Host

13 Storage Platforms

14 Array Technologies 10 Million IOPS

10’s of Terabytes 10’s of Petabytes

Consolidate

and

Increase Performance One Thousand IOPS 15 Storage Media Technology: Capacities & Latencies – 2016 DRAM Per 4TB Host 1TB ~100ns Trip to Australia in .3 seconds

4TB 1TB NVM ~500ns Trip to Australia in 1.8 seconds

48TB 4TB ~10us Trip to Australia in 36 seconds NVMe SSD

192 TB 16TB Capacity Oriented SSD ~1ms Trip to Australia in 1 hour

384TB 32TB Magnetic Storage ~10ms Trip to Australia in 12 hours

Object Storage ~1s Trip to Australia in 50 days 16 Storage Media Technology 1 Million IOPS

2D NAND “IOPS Flash” 10’s of Gigabytes 10’s of Terabytes

One Thousand 3D NAND IOPS “Capacity Flash” 17 NVDIMM’s

DRAM [2] 1TB ~100ns NVDIMM-N (Type 1) DDR3 / DDR4 Std DRAM Event Xfer NAND

1TB NVM ~500ns DRAM + NAND: Non-volatile Memory Load/Store

NVDIMM-F[1] (Type 3) 4TB ~10us RAM Async Block Xfer NAND NVMe SSD DDR3 / DDR4 Buffer

DRAM + NAND: Non-volatile Memory Block Mode

16TB Capacity Oriented SSD ~1ms Type 4[1]

DDR4 Sync Xfer NVM 32TB Magnetic Storage ~10ms

Next Generation Load/Store Memory ~1s 18 Intel 3D XPoint™ Technology

• Non-Volatile • 1000x higher write endurance than NAND • 1000x Faster than NAND • 10x Denser than DRAM • Can serve as system memory and storage • Cross Point Structure: allows individual memory cell to be addressed • Cacheline load/store should be possible. • Stackable to boost density • Memory cell can be read or written w/o requiring a transistor, reducing power & cost 1. “Intel and Micron Produce Breakthrough Memory Technology”, Intel

19 Storage Workload Map: SDS Solutions 1 Million IOPS

VSAN

10’s of Terabytes 10’s of Petabytes

One Thousand

IOPS HDFS 22 What’s Really Behind a Storage Array?

Data Data Fail-over iSCSI, FC, etc Services Services Services

Storage Storage De-dup Stack Stack Volumes Mirroring Clustered OS OS Caching OS Etc,…

Dual Failover Industry std Processors X86 Servers

SAS or FC Network

Shared Disk

SAS or FC 23 Different Types of SDS (1): Fail-Over Software on Commodity Servers Data Data Services Services Software- Nextenta Defined Storage Storage Data On-tap Virtual Storage Stack Stack Appliance EMC VNX VA OS OS etc,…

HP Dell Your Intel Servers Super-Micro etc,…

24 (1) Better: Software Replication Using Servers + Local Disks

Data Data Simpler Services Services Server configuration

Clustered Storage Stack Storage Stack Lower cost OS OS Less to go wrong

Not very Software Scalable Replication

25 (2) Caching Hot Core/Cold Edge

Compute Layer – All Flash Hot Tier Low Latency Network Caching of Reads/Writes Host Host Host Host Host Host Host Host

S3 iSCSI

Scale-out Scale-out Storage Service Storage Tier

26 (3) Scale-Out SDS Compute Scalable

Storage Network iSCSI or Proprietary More Fault Protocol Tolerance

Rolling Upgrades Data Data Data Data Services Services Services Services

Storage Stack More Management OS OS OS OS (Separate compute and storage silos)

27 Easier Management: (4) Hyperconverged SDS (Top-Down Provisioning)

App App App App App Scalable

Data Services More Fault Tolerance Storage Stack Rolling Hypervisor Hypervisor Hypervisor Hypervisor Upgrades

Fixed Compute and Storage Ratio VSAN

28 Storage Interconnects

Protocols Compute Tier iSCSI FC FCoE NVMe NVMe over Fabrics

HCA Hardware Transports Fibre Channel Ethernet Infiniband Device Connectivity SAS SATA SAS NVMe 29 30 Network

• Lossy • Dominant Network Protocol • Enables converged storage network: – iSCSI – iSER – FCoE – RDMA over Ethernet – NVMe Fabrics

31 Device Interconnects *There are now PCIe extensions for Storage SAS 6-12G @ 1.2Gbytes/sec

HCA SATA SAS

PCI-e Bus SATA/SAS HDD/SSDs Server 4 lanes gen3 x 4 = 4Gbytes/sec HCA Electrical Adapter NVM Over PCIe PCI-e Bus Intel P3700 @3Gbytes/sec Flash: Server - Intel P3700 @3GB/s HCA - Samsung XS1715 NVMe SSD NVM - Future devices PCI-e Bus Server PCIe Serial Speed Bytes/sec Gen 1 2gbits per lane 256MB/sec Gen 2 4gbits per lane 512MB/sec Gen 3 8gbits per lane 1GB/sec 32 Storage Fabrics iSCSI Pros: • Popular: Windows, Linux, Hypervisor friendly NIC iSCSI • Runs over many networks: routeable. TCP over • Optional RDMA acceleration (iSER) TCP/IP PCI-e Bus Cons: • Limited protocol (SCSI) Server • Performance • Setup, naming, endpoint management FCoE Pros: • Runs on ethernet NIC FC • Can easily gateway between FC and FCoE over Cons: Ethernet PCI-e Bus • Not routable • Needs special switches (lossless ethernet) Server • Not popular • No RDMA option? NVMe over Fabrics (the future!)

Pros: NIC NVMe • Runs on ethernet over • Same rich protocol as NVMe Ethernet • In-band administraton & data PCI-e Bus • Leverages RDMA for low latency

Server Cons:

33 PCIe Storage Fabrics (PCIe Rack Level) Switch

4Gbytes/sec 4Gbytes/sec 4Gbytes/sec

NVMe Over PCIe

34 PCIe Shared Storage RAID Namespace QoS PCIe Rack Fabric Autotiering (Shared Drives between Hosts)

34 Integrated PCIe Fabric

35 PCI-e Rack-Scale Compute+Storage

Storage Box Storage Box

• PCIe Fabric: • Share disk boxes across hosts PCIe Fabric • Storage-DMA: • Host-Host RDMA <- Storage-DMA, RDMA over PCIe -> capability • Leverages the same PCI-e Fabric

Compute Compute Compute Compute

36 NVMe – The New Kid on the Block SCSI-like: Block R/W Transport NFS-like: Objects/Namespaces Read Write Create Delete vVol-like: Object + Metadata + Policy

NVMe Objects

Namespaces

NVMe Core

RDMA FC PCIe IB Ethernet

NVMe over Fabrics 37 Now/Future

Future Interconnect Now Next (Richard’s Opinion) SATA 6g - NVMe

SAS 12g 24g NVMe

PCI-e Proprietary 4 x 8g NVMe

SCSI Over FC/Virtual NVMe

FC 8g 32g NVMe over Fabrics

Ethernet 10g 20/40g 100g

38 SDS Hardware Futures: Conclusions & Challenges

• Magnetic Storage is phasing out – Better IOPS per $$$ with Flash since this year – Better Space per $$$ after 2016 – Devices approaching 1M IOPS and extremely low latencies • Device Latencies push for new interconnects and networks – Storage (first cache) locality is important again – Durability demands low-latency commits across the wire (think RDMA) • The network is critical – Flat 40G networks design point for SDS

39 Beyond Block: Software-Defined Data Services Platforms

40 Storage Protocols NoSQL K-V S3/Swift

SCSI POSIX SQL Flavors HDFS

HDFS 41 Compute Too Many Data Silos Cloud Developers Are Happy

API API API API API Management Management Management Management Management Lots of Management

Data Big

Object Different SW Stacks Block K - V DB

Different HCLs HCL HCL HCL HCL HCL Walled Resources

42 Option 1: Multi-Protocol Stack Developers Are Happy Compute Cloud

API API API API API One Stack to Data Big Object Management Block Manage K - V DB

Least Capable Services

All-in-one Software-defined Storage HCL Shared Resources

43 Option 2: Common Platform + Ecosystem of Services Developers Are Happy Compute Cloud

API API API API API One Platform MongoDB

Hadoop To MySQL CEPH Block

Management Manage

Richest Services

Software Defined Storage Platform HCL Shared Resources

44 Opportunities for SDS Platform

• Provide for needs of Cloud Native Data Services – E.g. MongoDB, Cassandra, Hadoop HDFS • Advanced Primitives in support of Distributed Data Services – Policies for storage services: replication, raid, error handling – Fault-domain aware placement/control – Fine grained policies for performance: e.g. ultra-low latencies for database logs – Placement control for performance: locality, topology – Group snapshot, replication semantics

45 46