Lugano April 2018
‘’NVMe Takes It All, SCSI Has To Fall’’ freely adapted from ABBA
Brave New Storage World
Alexander Ruebensaal
© 1 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 ABC Systems AG Design, Implementation, Support & Operating of optimized IT Infrastructures - HA & HP - allowing for fail-safe Transportation of the Applications … since 1981
© 2 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 In the Year 2012 …
NVMe PCIe SSD
64GB/s
© 3 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Six Years After …
Non-Volatile Memory Express NVMe SSD
NVMe is an innovative Host Controller Interface to use SSD natively over PCIe. Mainly, it allows for acceleration due to parallelism resulting in reduced I/O overhead and latency. M.2
PCIe
NGSFF Next Generation SFF [ M.3 ] U.2
2.5’’
EDSFF Enterprise & DataCenter SFF [ Ruler ]
© 4 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Why NGSFF and EDSFF?
• No costly drive cages with failure points U.2 • No cables to SSDs [ 2.5’’ ] • Eliminate the backplane with cooling holes • Simplified thermal implementation
NGSFF Next Generation SFF [ M.3 ] • Less complicated chassis • Reduced component cost per SSD • Simple hot swap with high density capabiltites
EDSFF Enterprise & DataCenter SFF [ Ruler ]
© 5 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 INTEL OPTANE NVMe
PCIe AIC U.2 2.5’’
© 6 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe SSD against SAS SSD …
10x NVMe 24x NVMe 48x NVMe 1U 2U 2U
© 7 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe Storage Protocol is designed to take full Advantage of Flash
NVMe supports 64K commands per queue (SAS 256, SATA 32) and up to 64K queues. These queues are designed such that I/O commands and responses to those commands operate on the same processor core and can take advantage of the parallel processing capabilities of multi-core processors. Each application or thread can have its own independent queue, so no I/O locking is required. NVMe also supports MSI-X and interrupt steering, which prevents bottlenecking at the CPU level and enables massive scalability as systems expand.
NVMe has a streamlined and simple command set that uses less than half the number of CPU instructions to process an I/O request that SAS or SATA does, providing higher IOPS per CPU instruction cycle and lower I/O latency in the host software stack. NVMe also supports enterprise features such as reservations and client features such as power management, extending the improved efficiency beyond just I/O.
Text & Graphics from http://nvmexpress.org
© 8 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 PCIe 3.0 Bus - 64GB/s
A Single NVMe
© 9 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe uses CPU Lanes directly
CPU – Bus – NVMe Flash
or
CPU - Bus – FC-HBA – Switche(s) – FC-HBA – RAID Ctrl – SAS Enclosure – Disk
8x SAS NVMe already saturate the SAS-Bus …
-> Effect is reduced to Electronics vs Mechanics!
1 NVMe uses 4 CPU Lanes
Broadwell 40 Lanes Skylake 48 Lanes EPYC 128 Lanes
© 10 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe – new Level of Performance
© 11 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Are the NVMe too strong, are the CPU too weak …
Lanes: 1 CPU 48 – 1 NVMe 4 Intel Xeon Scalable Processors –F (With OmniPath)
. Single on-package OmniPath interface . Incremental to existing 48 PCIe Lanes . Single cable connection to QSFP I/O module . Same socket for Skylake & Skylake-F processors
© 12 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 AMD EPYC CPU
8~32 “Zen” Cores
TDP 120W~180W
8 Memory Channels
Up to 2TB per CPU
Dedicated Security Engine
Lanes of High Bandwidth I/O 128
How to use them?
1x NVMe 4 Lanes < 3’500MB/S / 3’938MB/s 89% 1x SATA SSD 1 Lane max. 32 directly supported by CPU < 540MB/s / 985MB/s 55% 1x 100GbE 16 Lanes < 12’500MB/s / 15’754MB/s 79% 2x 25GbE 8 Lanes < 6’250MB/s / 7’877MB/s 79% 2x 10GbE 8 Lanes [standard Interface for comparison] < 2’500MB/s / 7’877MB/s 32%
© 13 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Storage Centric Solution Design – Don’t waste Lanes!
Balanced Designs for Multi-Socket Server Solutions, regardless of CPU Vendor, is a huge Optimization Challenge!
K IOPS Random GB/s Sequential C A P A C I T Y I O P S T H R O U G H P U T Component Parameter Reads Writes Reads Writes 112 TB IOPS GB/s Gbps 112 TB Mio. GB/s Gbps 112 TB IOPS GB/s Gbps
NVDIMM 32GB 1100 17
NVMe SSD 4 Lane 11TB 800 95 3.35 2.4 64 176 12.8 53.6 48 132 9.6 40.2
SATA SSD 1 Lane 8TB 93 74 0.54 0.52 32 192 2.9 3 max. 32/CPU
100GbE 16 Lane full bandwidth 32 200 200 48 300
25GbE 4 Lane 8 Lane for 2x 25GbE
16 Lane 2x PCIe Slots 48 48 16 8 lane 2x
Total 112 192 2.9 3 200 112 176 29.8 53.6 200 112 132 9.6 40.2 300
R-world performance is, of course, application, workload and file system depend Assumption: 112 net Lanes availabe of 128
1x NVMe 4 Lanes < 3’500MB/S / 3’938MB/s 89% 1x SATA SSD 1 Lane max. 32 directly supported by CPU < 540MB/s / 985MB/s 55% 1x 100GbE 16 Lanes < 12’500MB/s / 15’754MB/s 79% 2x 25GbE 8 Lanes < 6’250MB/s / 7’877MB/s 79% 2x 10GbE 8 Lanes [standard Interface for comparison] < 2’500MB/s / 7’877MB/s 32%
© 14 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Conceptual NVMe-Server Design
Universel Department Store Servers Purpose-built Servers for Efficiency
might be over- or wrong-sized for SDS cost, performance, power, space etc. effective
allow for lean Architecture
… HPC, 10’000s in biggest DCs
e.g. choose from > 100 NVMe Servers AMD and INTEL CPUs
© 15 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 From Storage-Server to Server-Storage
< 10 million IOPS
1U 1U
36x NVMe NGSFF 32x NVMe U.2 32x NVMe EDSFF
2x Intel Xeon Scalable CPU 2x Intel Xeon Scalable CPU 2x Intel Xeon Scalable CPU 3x UPI, <10.4GT/s 3x UPI, <10.4GT/s 3x UPI, <10.4GT/s 24x DIMM up to 3TB 24x DIMM up to 3TB 24x DIMM up to 3TB 2x PCIe x16, 1x PCIe x8 2x PCIe x16 2x PCIe x16 2x 10GBase-T 2x 10GBase-T 2x 10GBase-T
NVMe < 576TB < 352TB < 1’080TB
© 16 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 From Storage-Server to Server-Storage
2U 2U
48x NVMe U.2 Dual Port 24x NVMe U.2 48x NVMe U.2 2x Nodes: 4x Nodes: 2x Intel Xeon E5-2600v4 CPU 2x Intel Xeon Scalable CPU 2x Intel Xeon E5-2600v4 CPU 2x QPI, <9.6GT/s 3x UPI, <10.4GT/s 2x QPI, <9.6GT/s 16x DIMM up to 2TB 24x DIMM up to 3TB 24x DIMM up to 3TB 1x PCIe x16, 1x PCIe x8 2x PCIe x16 2x PCIe x16, 1x PCIe x8, SIOM 2x 10GBase-T, SIOM 2x 10GBase-T SIOM (e.g. 2x 25GbE, 2x 10GbE)
NVMe < 528TB < 264TB < 528TB
© 17 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 From Storage-Server to Server-Storage
4x V100, P100, P40, M10 ...
1U All-NVMe & GPU Server on ABC booth NVMe-oF - NVMEe over Fabric
JBOF Just a Bunch of Flash
20x NVMe U.2 7mm RAID Protection Storage changes to 2x Intel Xeon Scalable CPU 3x UP up 10.4GT/s - SERVER-CENTRIC 24x DIMM up to 3TB 2x PCIe x8 - SOFTWARE-DEFINED HCI Hyper Converged Infrastructure 2x 25GBe
Holisitc Data Management < 80TB
© 18 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe-oF - NVMe over Fabric
The goal of NVMe over Fabrics is to provide distance connectivity to NVMe devices with no more Low Latency Networking than 10 micro-seconds (µs) of additional latency over a native NVMe device inside a server.
Storage Accelerations, leveraging hardware offloads Use Cases for NVMe - ConnectX adapters support NVMe-oF <100Gbps - BlueField (SoC) Smart NIC 2x 25GbE combines - A storage system comprised of many ConnectX5 with ARM CPU NVMe devices, using NVMe over Fabrics with either an RDMA or Fibre Channel interface, making a complete end-to-end IBM Spectrum Scale Acceleration NVMe storage solution. This system would provide extremely high performance while Achieved with 24x NVMe: maintaining the very low latency available FC RDMA The only sub-millisecond overall response time at via NVMe. 0.69ms ORT! 2.5x more builds than other Spectrum Scale storage options. Higher IOPS and throughput - Usage of NVMe over Fabrics to achieve than all other SPEC SFS2014_swbuild results. the low latency while connected to a Soltuion available as Appliance or Software only. storage subsystem that uses more traditional protocols internally to handle NVMesh Reference Architecture I/O to each of the SSDs in that system. This - near server-local performance would gain the benefits of the simplified in a linear scale-out remote standard NVMe solution. host software stack and lower latency over NVMesh RA provides the flexibility to create and the wire, while taking advantage of existing manage a single, centralized pool of storage, create storage subsystem technology. “right-sized” logical volumes, and even share storage NVMe-oF resources with existing compute resources. Also FC-HBA supporting existing applications without changes. Text & Graphics from http://nvmexpress.org © 19 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 All-NVMe JBOF
1U 32-bay JBOF Just a Bunch Of Flash
NVMe SSD U2 hot-swap NVMe SSD EDSFF hot-swap Sync. Mirror
4x PCIe-Bus Extension PCIe 3.0 x16
64 GB/s > 36 Mio. IOPS • 64, 128TB • 256TB • … • 512 TB • 1’024TB Cluster Interconnect Capacity Cache 10x more Performance with 2x EDR IB 100Gbps PCIe 3.0 x16 Drives 3D XPoint™ OPTANE Technology than NAND via PCIe* NVMe*
Client Network 2x EDR IB 100Gbps / 100/40GbE PCIe 3.0 x16
© 20 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Supermicro JBOF 32x NVMe SSD U.2
4 Mini-SAS HD x16 ports , 2 PCI-E 3.0 x16 Slots, 2 IPMI ports
The Supermicro JBOF supports up to 12 direct attached hosts, making this the go-to storage platform for any high-performance computing application.
Alternatively, the dual PCI-E 3.0 x16 slots can support dual NVMe-oF add-on-cards to enable additional deployment scenarios.
© 21 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 RAID Protection
Flash Technology
• More reliable, less Replacements
• Higher Throughput, faster Rebuilds
With a 3.2x lower Annualized Failure Rate (AFR) - SASTA SSD - compared to HDD, IT departments will spend less time and expense replacing or upgrading storage devices.
RAID Approach
• Hardware-Defined (RAID-Controller)
• Software-Defined (SDS)
• Hybrid: Intel VROC Virtual RAID on CPU
© 22 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Intel Virtual RAID on CPU – VROC
RAID Function in VMD Volume Management Device
© 23 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 HCI Hyper Converged Infrastructure
8x 2U 4-Node Server . Dual CPU . 3UPI <10.4GT/s . 24x DIMM . 6x NVMe U.2 . 1x PCIe Extension x16 1U JBOF 32x NVMe . 2x 10GbE . 4x Mini-SAS HD x16 ports … . 2x PCI-E 3.0 x16 Slots
© © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Data Management – not only Data Storing
Conceptual Optimization
LTO LTFS
NVMe
© 25 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe Takes It All, SCSI Has To Fall
Parameter up to TB in 1U TB in 4U Main- Directions of Movements … TB Server Server stream Storage-Technology /drive or JBOF or JBOD
Volatile RAM Non-Volatile
EDSFF [Ruler] 32 1080
NGSFF [M.3] 16 576 NVMe U.2 2.5" 11 Flash M.2 2 AIC Add-in Card 8
SSD SAS 8 2.5" SATA 11
SAS 15K Gotthardpost, 1873 2.5" 10K 1.8 Johann Rudolf Koller, 1828-1905 Disk NL SAS 2.5" 2 https://en.wikipedia.org/wiki/Rudolf_Koller SATA 3.5" 12 1080 LTO 12 - 30 Change Horses, add Horses Tape IBM TS1150 [Jaguar] 10 or use the Gotthard Tunnel … …
© 26 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe Takes It All, SCSI Has To Fall
• The PCIe Bus is in the Server • NVMe is the Protocol for Flash ‘’Flat screens
• 50-100TB NVMe • PCIe 4.0 NVMe SAS Hot FC Data SATA SSD vs Displays’’ Somewhat Hot Data NL SAS SATA Lukewarm Data LTO [ LTFS ] Cold Data
© 27 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 … simplify and win with us and our Partners
Headquarter Zurich Branch Office Berne Spectrum Scale Spectrum Protect Ruetistrasse 28 Giessereiweg 9 CH-8952 Schlieren CH-3007 Bern
Tel +41 43 433 6 433 Tel +41 31 3 700 600
http://www.ABCsystems.ch [email protected]
Alexander Ruebensaal [email protected] Other names and brands may be claimed as property of others.
© 28 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18