The Future of Software-Defined Storage – What Does It Look Like in 3 Years Time? Richard McDougall, VMware, Inc This Talk: The Future of SDS
Hardware New Workloads Futures & Trends
Emerging Cloud Native Platforms
2 Storage Workload Map 1 Million IOPS
10’s of Terabytes 10’s of New Petabytes Focus Workloads of Most SDS Today One Thousand
IOPS 3 What Are Cloud-Native Applications?
Developer access via APIs Continuous integration Built for scale and deployment
Application
Availability architected in the Microservices, not Decoupled from app, not hypervisor monolithic stacks infrastructure
4 What do Linux Containers Need from Storage?
Parent Linux /
/etc /containers Container 1 Container 2
/containers/cont1/rootfs /containers/cont2/roofts
etc var etc var • Ability to copy & clone root images • Isolated namespaces between containers • QoS Controls between containers
Option A: copy whole root tree 5 Containers and Fast Clone Using Shared Read-Only Images
/
• Actions – Quick update of child to use same boot image /etc /containers • Downsides: – Container root fs not writable
/containers/cont1/rootfs
etc var
6 Docker Solution: Clone via “Another Union File System” (aufs)
Except:
CONFIDENTIAL 7 Docker’s Direction: Leverage native copy-on-write filesystem
/ Parent Linux
/etc /containers
Golden Container 1
/containers/golden/rootfs /containers/cont1/roofts
etc var etc var Goal: – Snap single image into writable clones – Boot clones as isolated linux instances
# zfs snapshot cpool/golden@base # zfs clone cpool/root@base cpool/cont1
8 Shared Data: Containers can Share Filesystem Within Host
/ Containers Today: Share within the host /etc /containers Containers Future:
/containers/cont1/rootfs Shared /containers/cont2/roofts Share across hosts Clustered File Systems
etc shared var etc shared var
container1 container2
CONFIDENTIAL 9 Docker Storage Abstractions for Containers
Persistent Local Data Disks Docker Container /data
“VOLUME /var/lib/mysql”
Non-persistent boot environment /
/etc /var
10 Docker Storage Abstractions for Containers
Persistent Block Volumes Data Docker VSAN Container /data CEPH RBD ScaleIO “VOLUME /var/lib/mysql” Flocker Etc,…
Non-persistent boot environment /
/etc /var
11 Container Storage Use Cases
Unshared Volumes Shared Volumes Persist to External Storage
Host Host Host Host Host Host
C C C C C C C C C C C C
API API Cloud Storage Storage Platform Storage Platform
Use Case: Running container- Use Case: Sharing a set of tools Use Case: Object store for based SQL or noSQL DB or content across app instances retention/archival, DBaaS for config/transactions
12 Eliminate the Silos: Converged Big Data Platform
Other Hadoop NoSQL Spark, batch analysis Cassandra, Shark, Solr, Big SQL Mongo, etc HBase Impala, Platfora, Compute Pivotal HawQ Etc,… layer real-time queries
Distributed Resource Manager or Virtualization platform
File System (HDFS, MAPR, GPFS) POSIX Block Storage
Distributed Storage Platform
Host Host Host Host Host Host Host
13 Storage Platforms
14 Array Technologies 10 Million IOPS
10’s of Terabytes 10’s of Petabytes
Consolidate
and
Increase Performance One Thousand IOPS 15 Storage Media Technology: Capacities & Latencies – 2016 DRAM Per 4TB Host 1TB ~100ns Trip to Australia in .3 seconds
4TB 1TB NVM ~500ns Trip to Australia in 1.8 seconds
48TB 4TB ~10us Trip to Australia in 36 seconds NVMe SSD
192 TB 16TB Capacity Oriented SSD ~1ms Trip to Australia in 1 hour
384TB 32TB Magnetic Storage ~10ms Trip to Australia in 12 hours
Object Storage ~1s Trip to Australia in 50 days 16 Storage Media Technology 1 Million IOPS
2D NAND “IOPS Flash” 10’s of Gigabytes 10’s of Terabytes
One Thousand 3D NAND IOPS “Capacity Flash” 17 NVDIMM’s
DRAM [2] 1TB ~100ns NVDIMM-N (Type 1) DDR3 / DDR4 Std DRAM Event Xfer NAND
1TB NVM ~500ns DRAM + NAND: Non-volatile Memory Load/Store
NVDIMM-F[1] (Type 3) 4TB ~10us RAM Async Block Xfer NAND NVMe SSD DDR3 / DDR4 Buffer
DRAM + NAND: Non-volatile Memory Block Mode
16TB Capacity Oriented SSD ~1ms Type 4[1]
DDR4 Sync Xfer NVM 32TB Magnetic Storage ~10ms
Next Generation Load/Store Memory Object Storage ~1s 18 Intel 3D XPoint™ Technology
• Non-Volatile • 1000x higher write endurance than NAND • 1000x Faster than NAND • 10x Denser than DRAM • Can serve as system memory and storage • Cross Point Structure: allows individual memory cell to be addressed • Cacheline load/store should be possible. • Stackable to boost density • Memory cell can be read or written w/o requiring a transistor, reducing power & cost 1. “Intel and Micron Produce Breakthrough Memory Technology”, Intel
19 Storage Workload Map: SDS Solutions 1 Million IOPS
VSAN
10’s of Terabytes 10’s of Petabytes
One Thousand
IOPS HDFS 22 What’s Really Behind a Storage Array?
Data Data Fail-over iSCSI, FC, etc Services Services Services
Storage Storage De-dup Stack Stack Volumes Mirroring Clustered OS OS Caching OS Etc,…
Dual Failover Industry std Processors X86 Servers
SAS or FC Network
Shared Disk
SAS or FC 23 Different Types of SDS (1): Fail-Over Software on Commodity Servers Data Data Services Services Software- Nextenta Defined Storage Storage Data On-tap Virtual Storage Stack Stack Appliance EMC VNX VA OS OS etc,…
HP Dell Your Intel Servers Super-Micro etc,…
24 (1) Better: Software Replication Using Servers + Local Disks
Data Data Simpler Services Services Server configuration
Clustered Storage Stack Storage Stack Lower cost OS OS Less to go wrong
Not very Software Scalable Replication
25 (2) Caching Hot Core/Cold Edge
Compute Layer – All Flash Hot Tier Low Latency Network Caching of Reads/Writes Host Host Host Host Host Host Host Host
S3 iSCSI
Scale-out Scale-out Storage Service Storage Tier
26 (3) Scale-Out SDS Compute Scalable
Storage Network iSCSI or Proprietary More Fault Protocol Tolerance
Rolling Upgrades Data Data Data Data Services Services Services Services
Storage Stack More Management OS OS OS OS (Separate compute and storage silos)
27 Easier Management: (4) Hyperconverged SDS (Top-Down Provisioning)
App App App App App Scalable
Data Services More Fault Tolerance Storage Stack Rolling Hypervisor Hypervisor Hypervisor Hypervisor Upgrades
Fixed Compute and Storage Ratio VSAN
28 Storage Interconnects
Protocols Compute Tier iSCSI FC FCoE NVMe NVMe over Fabrics
HCA Hardware Transports Fibre Channel Ethernet Infiniband Device Connectivity SAS SATA SAS NVMe 29 30 Network
• Lossy • Dominant Network Protocol • Enables converged storage network: – iSCSI – iSER – FCoE – RDMA over Ethernet – NVMe Fabrics
31 Device Interconnects *There are now PCIe extensions for Storage SAS 6-12G @ 1.2Gbytes/sec
HCA SATA SAS
PCI-e Bus SATA/SAS HDD/SSDs Server 4 lanes gen3 x 4 = 4Gbytes/sec HCA Electrical Adapter NVM Over PCIe PCI-e Bus Intel P3700 @3Gbytes/sec Flash: Server - Intel P3700 @3GB/s HCA - Samsung XS1715 NVMe SSD NVM - Future devices PCI-e Bus Server PCIe Serial Speed Bytes/sec Gen 1 2gbits per lane 256MB/sec Gen 2 4gbits per lane 512MB/sec Gen 3 8gbits per lane 1GB/sec 32 Storage Fabrics iSCSI Pros: • Popular: Windows, Linux, Hypervisor friendly NIC iSCSI • Runs over many networks: routeable. TCP over • Optional RDMA acceleration (iSER) TCP/IP PCI-e Bus Cons: • Limited protocol (SCSI) Server • Performance • Setup, naming, endpoint management FCoE Pros: • Runs on ethernet NIC FC • Can easily gateway between FC and FCoE over Cons: Ethernet PCI-e Bus • Not routable • Needs special switches (lossless ethernet) Server • Not popular • No RDMA option? NVMe over Fabrics (the future!)
Pros: NIC NVMe • Runs on ethernet over • Same rich protocol as NVMe Ethernet • In-band administraton & data PCI-e Bus • Leverages RDMA for low latency
Server Cons:
33 PCIe Storage Fabrics (PCIe Rack Level) Switch
4Gbytes/sec 4Gbytes/sec 4Gbytes/sec
NVMe Over PCIe
34 PCIe Shared Storage RAID Namespace QoS PCIe Rack Fabric Autotiering (Shared Drives between Hosts)
34 Integrated PCIe Fabric
35 PCI-e Rack-Scale Compute+Storage
Storage Box Storage Box
• PCIe Fabric: • Share disk boxes across hosts PCIe Fabric • Storage-DMA: • Host-Host RDMA <- Storage-DMA, RDMA over PCIe -> capability • Leverages the same PCI-e Fabric
Compute Compute Compute Compute
36 NVMe – The New Kid on the Block SCSI-like: Block R/W Transport NFS-like: Objects/Namespaces Read Write Create Delete vVol-like: Object + Metadata + Policy
NVMe Objects
Namespaces
NVMe Core
RDMA FC PCIe IB Ethernet
NVMe over Fabrics 37 Now/Future
Future Interconnect Now Next (Richard’s Opinion) SATA 6g - NVMe
SAS 12g 24g NVMe
PCI-e Proprietary 4 x 8g NVMe
SCSI Over FC/Virtual NVMe
FC 8g 32g NVMe over Fabrics
Ethernet 10g 20/40g 100g
38 SDS Hardware Futures: Conclusions & Challenges
• Magnetic Storage is phasing out – Better IOPS per $$$ with Flash since this year – Better Space per $$$ after 2016 – Devices approaching 1M IOPS and extremely low latencies • Device Latencies push for new interconnects and networks – Storage (first cache) locality is important again – Durability demands low-latency commits across the wire (think RDMA) • The network is critical – Flat 40G networks design point for SDS
39 Beyond Block: Software-Defined Data Services Platforms
40 Storage Protocols NoSQL K-V S3/Swift
SCSI POSIX SQL Flavors HDFS
HDFS 41 Compute Too Many Data Silos Cloud Developers Are Happy
API API API API API Management Management Management Management Management Lots of Management
Data Big
Object Different SW Stacks Block K - V DB
Different HCLs HCL HCL HCL HCL HCL Walled Resources
42 Option 1: Multi-Protocol Stack Developers Are Happy Compute Cloud
API API API API API One Stack to Data Big Object Management Block Manage K - V DB
Least Capable Services
All-in-one Software-defined Storage HCL Shared Resources
43 Option 2: Common Platform + Ecosystem of Services Developers Are Happy Compute Cloud
API API API API API One Platform MongoDB
Hadoop To MySQL CEPH Block
Management Manage
Richest Services
Software Defined Storage Platform HCL Shared Resources
44 Opportunities for SDS Platform
• Provide for needs of Cloud Native Data Services – E.g. MongoDB, Cassandra, Hadoop HDFS • Advanced Primitives in support of Distributed Data Services – Policies for storage services: replication, raid, error handling – Fault-domain aware placement/control – Fine grained policies for performance: e.g. ultra-low latencies for database logs – Placement control for performance: locality, topology – Group snapshot, replication semantics
45 46