Lugano April 2018

‘’NVMe Takes It All, SCSI Has To Fall’’ freely adapted from ABBA

Brave New Storage World

Alexander Ruebensaal

© 1 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 ABC Systems AG Design, Implementation, Support & Operating of optimized IT Infrastructures - HA & HP - allowing for fail-safe Transportation of the Applications … since 1981

© 2 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 In the Year 2012 …

NVMe PCIe SSD

64GB/s

© 3 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Six Years After …

Non-Volatile Memory Express NVMe SSD

NVMe is an innovative Host Controller to use SSD natively over PCIe. Mainly, it allows for acceleration due to parallelism resulting in reduced I/O overhead and latency. M.2

PCIe

NGSFF Next Generation SFF [ M.3 ] U.2

2.5’’

EDSFF Enterprise & DataCenter SFF [ Ruler ]

© 4 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Why NGSFF and EDSFF?

• No costly drive cages with failure points U.2 • No cables to SSDs [ 2.5’’ ] • Eliminate the with cooling holes • Simplified thermal implementation

NGSFF Next Generation SFF [ M.3 ] • Less complicated chassis • Reduced component cost per SSD • Simple hot swap with high density capabiltites

EDSFF Enterprise & DataCenter SFF [ Ruler ]

© 5 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 OPTANE NVMe

PCIe AIC U.2 2.5’’

© 6 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe SSD against SAS SSD …

10x NVMe 24x NVMe 48x NVMe 1U 2U 2U

© 7 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe Storage Protocol is designed to take full Advantage of Flash

NVMe supports 64K commands per queue (SAS 256, SATA 32) and up to 64K queues. These queues are designed such that I/O commands and responses to those commands operate on the same processor core and can take advantage of the parallel processing capabilities of multi-core processors. Each application or thread can have its own independent queue, so no I/O locking is required. NVMe also supports MSI-X and interrupt steering, which prevents bottlenecking at the CPU level and enables massive scalability as systems expand.

NVMe has a streamlined and simple command set that uses less than half the number of CPU instructions to process an I/O request that SAS or SATA does, providing higher IOPS per CPU instruction cycle and lower I/O latency in the host software stack. NVMe also supports enterprise features such as reservations and client features such as power management, extending the improved efficiency beyond just I/O.

Text & Graphics from http://nvmexpress.org

© 8 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 PCIe 3.0 - 64GB/s

A Single NVMe

© 9 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe uses CPU Lanes directly

CPU – Bus – NVMe Flash

or

CPU - Bus – FC-HBA – Switche(s) – FC-HBA – RAID Ctrl – SAS Enclosure – Disk

8x SAS NVMe already saturate the SAS-Bus …

-> Effect is reduced to Electronics vs Mechanics!

1 NVMe uses 4 CPU Lanes

Broadwell 40 Lanes Skylake 48 Lanes EPYC 128 Lanes

© 10 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe – new Level of Performance

© 11 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Are the NVMe too strong, are the CPU too weak …

Lanes: 1 CPU 48 – 1 NVMe 4 Intel Xeon Scalable Processors –F (With OmniPath)

. Single on-package OmniPath interface . Incremental to existing 48 PCIe Lanes . Single cable connection to QSFP I/O module . Same socket for Skylake & Skylake-F processors

© 12 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 AMD EPYC CPU

8~32 “Zen” Cores

TDP 120W~180W

8 Memory Channels

Up to 2TB per CPU

Dedicated Security Engine

Lanes of High Bandwidth I/O 128

How to use them?

1x NVMe 4 Lanes < 3’500MB/S / 3’938MB/s 89% 1x SATA SSD 1 Lane max. 32 directly supported by CPU < 540MB/s / 985MB/s 55% 1x 100GbE 16 Lanes < 12’500MB/s / 15’754MB/s 79% 2x 25GbE 8 Lanes < 6’250MB/s / 7’877MB/s 79% 2x 10GbE 8 Lanes [standard Interface for comparison] < 2’500MB/s / 7’877MB/s 32%

© 13 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Storage Centric Solution Design – Don’t waste Lanes!

Balanced Designs for Multi-Socket Server Solutions, regardless of CPU Vendor, is a huge Optimization Challenge!

K IOPS Random GB/s Sequential C A P A C I T Y I O P S T H R O U G H P U T Component Parameter Reads Writes Reads Writes 112 TB IOPS GB/s Gbps 112 TB Mio. GB/s Gbps 112 TB IOPS GB/s Gbps

NVDIMM 32GB 1100 17

NVMe SSD 4 Lane 11TB 800 95 3.35 2.4 64 176 12.8 53.6 48 132 9.6 40.2

SATA SSD 1 Lane 8TB 93 74 0.54 0.52 32 192 2.9 3 max. 32/CPU

100GbE 16 Lane full bandwidth 32 200 200 48 300

25GbE 4 Lane 8 Lane for 2x 25GbE

16 Lane 2x PCIe Slots 48 48 16 8 lane 2x

Total 112 192 2.9 3 200 112 176 29.8 53.6 200 112 132 9.6 40.2 300

R-world performance is, of course, application, workload and file system depend Assumption: 112 net Lanes availabe of 128

1x NVMe 4 Lanes < 3’500MB/S / 3’938MB/s 89% 1x SATA SSD 1 Lane max. 32 directly supported by CPU < 540MB/s / 985MB/s 55% 1x 100GbE 16 Lanes < 12’500MB/s / 15’754MB/s 79% 2x 25GbE 8 Lanes < 6’250MB/s / 7’877MB/s 79% 2x 10GbE 8 Lanes [standard Interface for comparison] < 2’500MB/s / 7’877MB/s 32%

© 14 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Conceptual NVMe-Server Design

Universel Department Store Servers Purpose-built Servers for Efficiency

might be over- or wrong-sized for SDS cost, performance, power, space etc. effective

allow for lean Architecture

… HPC, 10’000s in biggest DCs

e.g. choose from > 100 NVMe Servers AMD and INTEL CPUs

© 15 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 From Storage-Server to Server-Storage

< 10 million IOPS

1U 1U

36x NVMe NGSFF 32x NVMe U.2 32x NVMe EDSFF

2x Intel Xeon Scalable CPU 2x Intel Xeon Scalable CPU 2x Intel Xeon Scalable CPU 3x UPI, <10.4GT/s 3x UPI, <10.4GT/s 3x UPI, <10.4GT/s 24x DIMM up to 3TB 24x DIMM up to 3TB 24x DIMM up to 3TB 2x PCIe x16, 1x PCIe x8 2x PCIe x16 2x PCIe x16 2x 10GBase-T 2x 10GBase-T 2x 10GBase-T

NVMe < 576TB < 352TB < 1’080TB

© 16 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 From Storage-Server to Server-Storage

2U 2U

48x NVMe U.2 Dual Port 24x NVMe U.2 48x NVMe U.2 2x Nodes: 4x Nodes: 2x Intel Xeon E5-2600v4 CPU 2x Intel Xeon Scalable CPU 2x Intel Xeon E5-2600v4 CPU 2x QPI, <9.6GT/s 3x UPI, <10.4GT/s 2x QPI, <9.6GT/s 16x DIMM up to 2TB 24x DIMM up to 3TB 24x DIMM up to 3TB 1x PCIe x16, 1x PCIe x8 2x PCIe x16 2x PCIe x16, 1x PCIe x8, SIOM 2x 10GBase-T, SIOM 2x 10GBase-T SIOM (e.g. 2x 25GbE, 2x 10GbE)

NVMe < 528TB < 264TB < 528TB

© 17 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 From Storage-Server to Server-Storage

4x V100, P100, P40, M10 ...

1U All-NVMe & GPU Server on ABC booth NVMe-oF - NVMEe over Fabric

JBOF Just a Bunch of Flash

20x NVMe U.2 7mm RAID Protection Storage changes to 2x Intel Xeon Scalable CPU 3x UP up 10.4GT/s - SERVER-CENTRIC 24x DIMM up to 3TB 2x PCIe x8 - SOFTWARE-DEFINED HCI Hyper Converged Infrastructure 2x 25GBe

Holisitc Data Management < 80TB

© 18 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe-oF - NVMe over Fabric

The goal of NVMe over Fabrics is to provide distance connectivity to NVMe devices with no more Low Latency Networking than 10 micro-seconds (µs) of additional latency over a native NVMe device inside a server.

Storage Accelerations, leveraging hardware offloads Use Cases for NVMe - ConnectX adapters support NVMe-oF <100Gbps - BlueField (SoC) Smart NIC 2x 25GbE combines - A storage system comprised of many ConnectX5 with ARM CPU NVMe devices, using NVMe over Fabrics with either an RDMA or interface, making a complete end-to-end IBM Spectrum Scale Acceleration NVMe storage solution. This system would provide extremely high performance while Achieved with 24x NVMe: maintaining the very low latency available FC RDMA The only sub-millisecond overall response time at via NVMe. 0.69ms ORT! 2.5x more builds than other Spectrum Scale storage options. Higher IOPS and throughput - Usage of NVMe over Fabrics to achieve than all other SPEC SFS2014_swbuild results. the low latency while connected to a Soltuion available as Appliance or Software only. storage subsystem that uses more traditional protocols internally to handle NVMesh Reference Architecture I/O to each of the SSDs in that system. This - near server-local performance would gain the benefits of the simplified in a linear scale-out remote standard NVMe solution. host software stack and lower latency over NVMesh RA provides the flexibility to create and the wire, while taking advantage of existing manage a single, centralized pool of storage, create storage subsystem technology. “right-sized” logical volumes, and even share storage NVMe-oF resources with existing compute resources. Also FC-HBA supporting existing applications without changes. Text & Graphics from http://nvmexpress.org © 19 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 All-NVMe JBOF

1U 32-bay JBOF Just a Bunch Of Flash

NVMe SSD U2 hot-swap NVMe SSD EDSFF hot-swap Sync. Mirror

4x PCIe-Bus Extension PCIe 3.0 x16

64 GB/s > 36 Mio. IOPS • 64, 128TB • 256TB • … • 512 TB • 1’024TB Cluster Interconnect Capacity Cache 10x more Performance with 2x EDR IB 100Gbps PCIe 3.0 x16 Drives 3D XPoint™ OPTANE Technology than NAND via PCIe* NVMe*

Client Network 2x EDR IB 100Gbps / 100/40GbE PCIe 3.0 x16

© 20 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Supermicro JBOF 32x NVMe SSD U.2

4 Mini-SAS HD x16 ports , 2 PCI-E 3.0 x16 Slots, 2 IPMI ports

The Supermicro JBOF supports up to 12 direct attached hosts, making this the go-to storage platform for any high-performance computing application.

Alternatively, the dual PCI-E 3.0 x16 slots can support dual NVMe-oF add-on-cards to enable additional deployment scenarios.

© 21 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 RAID Protection

Flash Technology

• More reliable, less Replacements

• Higher Throughput, faster Rebuilds

With a 3.2x lower Annualized Failure Rate (AFR) - SASTA SSD - compared to HDD, IT departments will spend less time and expense replacing or upgrading storage devices.

RAID Approach

• Hardware-Defined (RAID-Controller)

• Software-Defined (SDS)

• Hybrid: Intel VROC Virtual RAID on CPU

© 22 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Intel Virtual RAID on CPU – VROC

RAID Function in VMD Volume Management Device

© 23 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 HCI Hyper Converged Infrastructure

8x 2U 4-Node Server . Dual CPU . 3UPI <10.4GT/s . 24x DIMM . 6x NVMe U.2 . 1x PCIe Extension x16 1U JBOF 32x NVMe . 2x 10GbE . 4x Mini-SAS HD x16 ports … . 2x PCI-E 3.0 x16 Slots

© © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 Data Management – not only Data Storing

Conceptual Optimization

LTO LTFS

NVMe

© 25 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe Takes It All, SCSI Has To Fall

Parameter up to TB in 1U TB in 4U Main- Directions of Movements … TB Server Server stream Storage-Technology /drive or JBOF or JBOD

Volatile RAM Non-Volatile

EDSFF [Ruler] 32 1080

NGSFF [M.3] 16 576 NVMe U.2 2.5"  11 Flash M.2  2 AIC Add-in Card 8

SSD SAS 8 2.5" SATA  11

SAS 15K Gotthardpost, 1873 2.5" 10K 1.8 Johann Rudolf Koller, 1828-1905 Disk NL SAS 2.5"  2 https://en.wikipedia.org/wiki/Rudolf_Koller SATA 3.5"  12 1080 LTO  12 - 30 Change Horses, add Horses Tape IBM TS1150 [Jaguar] 10 or use the Gotthard Tunnel … …

© 26 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 NVMe Takes It All, SCSI Has To Fall

• The PCIe Bus is in the Server • NVMe is the Protocol for Flash ‘’Flat screens

• 50-100TB NVMe • PCIe 4.0 NVMe SAS Hot FC Data SATA SSD vs Displays’’ Somewhat Hot Data NL SAS SATA Lukewarm Data LTO [ LTFS ] Cold Data

© 27 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18 … simplify and win with us and our Partners

Headquarter Zurich Branch Office Berne Spectrum Scale Spectrum Protect Ruetistrasse 28 Giessereiweg 9 CH-8952 Schlieren CH-3007 Bern

Tel +41 43 433 6 433 Tel +41 31 3 700 600

http://www.ABCsystems.ch [email protected]

Alexander Ruebensaal [email protected] Other names and brands may be claimed as property of others.

© 28 © 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18