#CLUS Cisco HyperFlex Architecture Deep Dive, System Design and Performance Analysis Brian Everitt, Technical Marketing Engineer @CiscoTMEguy BRKINI-2016

#CLUS Agenda

• Introduction

• What’s New

• System Design and Data Paths

• Benchmarking Tools

• Performance Testing

• Best Practices

• Conclusion

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 About Me

Brian Everitt

• Technical Marketing Engineer with the CSPG BU • Focus on HyperFlex performance, benchmarking, quality and SAP solutions

• 5 years as a Cisco Advanced Services Solutions Architect • Focus on SAP Hana appliances and TDI, UCS and storage

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 Cisco Webex Teams

Questions? Use Cisco Webex Teams to chat with the speaker after the session How 1 Find this session in the Cisco Live Mobile App 2 Click “Join the Discussion” 3 Install Webex Teams or go directly to the team space 4 Enter messages/questions in the team space

Webex Teams will be moderated cs.co/ciscolivebot#BRKINI-2016 by the speaker until June 16, 2019.

#CLUS © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Visit DC lounge outside WOS Grab a shirt, get a digital portrait for your social profiles, share your stories! Let’s Celebrate Visit the Data Center area inside 10 Years of WOS to see what’s new Unified Computing!

Follow us! We made it simple. #CiscoDC https://bit.ly/2CXR33q You made it happen.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 What’s New? What’s New With Cisco HyperFlex 4.0?

HyperFlex Edge Next Gen Hardware Management

2 node (and 4 node) All-NVMe nodes One click, full stack deployments and firmware HyperFlex Acceleration Engine upgrades Cisco Intersight cloud hardware offload card based witness service Stretched cluster 4th Generation Fabric compute-only nodes Redundant 10GbE Interconnects and VIC and expansions option Cascade Lake CPUs and 2:1 compute-only node Optane DC persistent memory ratio (HX 4.0(2a))

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 Scaling Options in HXDP 4.0.1a VMware HyperV SFF LFF All-NVMe SFF LFF (excluding All-NVMe) Converged Nodes Converged Nodes Converged Nodes Converged Nodes Converged Nodes 3-32 3-16 3-16 3-16 3-16

Compute-Only Nodes Compute-Only Nodes Compute-Only Nodes Compute-Only Nodes Compute-Only Nodes 0-32 0-32 0-16 0-16 0-16

✓Expansion Supported ✓Expansion Supported ✓Expansion Supported ✓Expansion Supported ✓Expansion Supported 2:1 2:1 1:1 1:1 1:1 Max ratio Compute to Max ratio Compute to Max ratio Compute to Max ratio Compute to Max ratio Compute to Converged Nodes* Converged Nodes* Converged Nodes* Converged Nodes* Converged Nodes*

Max Cluster Size Max Cluster Size Max Cluster Size Max Cluster Size Max Cluster Size 64 48 32 32 32

! Hybrid Only ! Requires Enterprise License ! Hybrid Only

* 2:1 – Enterprise license (HXDP-P) if # of Compute > # of Converged Nodes. * 1:1 – Standard license (HXDP-S) if # of Compute <= # of Converged Nodes. #CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 9 Scaling Options in HXDP 4.0.1a (cont.) Edge Stretched Cluster** SFF SFF LFF (excluding All-NVMe) Converged Nodes Converged Nodes Converged Nodes 2-4 2-16 / Site (4-32 / Cluster) 2-8 / Site (4-16 / Cluster)

Expansion Not Currently Supported

1GbE and 10GbE Compute-Only Nodes Compute-Only Nodes Network Options 0-21 / Site (0-42 / Cluster) 0-16 / Site (0-32 / Cluster) [Limited by Max Cluster size] Available ✓Expansion Supported*** ✓Expansion Supported*** Single and Dual Switch Support 2:1 2:1 Deploy and Manage Max ratio Compute to Max ratio Compute to via Cisco Intersight Converged Nodes* Converged Nodes*

Max Cluster Size Max Cluster Size ! HX*220 Nodes Only 32/Site 64/Cluster 24/Site 48/Cluster

! Hybrid Only * 2:1 – Enterprise license (HXDP-P) if # of Compute > # of Converged Nodes, 1:1 – Standard license (HXDP-S) if # of Compute <= # of Converged Nodes. ** Stretched cluster requires Enterprise license (HXDP-P) *** Requires uniform expansion of converged nodes across both sites

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 Invisible Cloud Witness, Why?

2-Node ROBO Site #1 Consistency of the file system is essential. VM VM 1

VM VM To guarantee consistency of a file system a quorum 2 Traditional Hypervisor is needed. 2-Node Deployment In a HyperFlex cluster a quorum requires a majority, 3 or 2 of 3 nodes available and in agreement in a cluster.

WAN Traditional 2-Node ROBO solutions require a 3rd 4 site for witness VMs.

Cisco’s innovation of the Intersight Invisible Cloud 5 Witness virtually eliminates the costs with deploying 3rd Site for 2-node HCI clusters/ Witness VMs $$ CapEx & OpEx $$

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 Benefits of Invisible Cloud Witness Architecture

 No additional license cost; Included in an HX Edge Subscription 2-Node ROBO 2-Node ROBO Site #1 … Site #2000  No requirement for 3rd site or existing infrastructure VM VM VM VM  Cloud-like operations - Nothing to manage

VM VM VM VM o No user interface to monitor

Hypervisor Hypervisor o No user driven software updates o No setup, configuration or backup required o No scale limitations  Security o Realtime updates with the latest security patches WAN o All communications encrypted via TLS 1.2 o Uses standard HTTPS port 443; no firewall configuration required  Built on efficient protocol stack o No periodic heartbeats o No cluster metadata or user data transferred to Intersight o Tolerates high latency and lossy WAN connections

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 The All New HyperFlex Edge with Full Lifecycle Management

Multi-Cluster Invisible Cloud Witness On-Going Management Connected TAC Install Service Including Experience Full Stack Upgrades

1 Node . . . Upgrade

2000 + Nodes

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 13 HyperFlex All-NVMe Introducing All-NVMe HyperFlex

• All-NVMe UCS C220 M5 Servers, optimized for All-NVMe • Integrated Fabric Networking • Ongoing I/O optimizations

• Co-Engineered with Intel VMD for Hot-Plug & surprise removal cases • Reliability Availability Serviceability (RAS) Assurance

• NVMe Optane Cache, NVMe Capacity • Up to 32TB/Node, 40GB Networking • Fully UCSM Managed for FW, LED etc.

• Higher IOPS • Lower Latency • Interop with HW acceleration to deliver the highest performance

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 What’s Special About NVMe Drives Unlocking the drive bottleneck Promise of SAS/SATA System NVMe System Higher IOPs, Lower Latency CPU CPU & Less CPU PCIe PCIe 8/16/32.. Gb/s cycles per IO vs Storage Controller NVMe SSD 6 Gb/s 12 Gb/s • NVMe is a new SATA bus SAS bus protocol and command set for PCIe connected SAS/SATA SSD storage • Drives are still NAND or We speak ATA or SCSI We speak all new NVMe new 3D XPoint • 65,535 queues vs.1

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 All-NVMe: Opportunity & Challenges

Performance RAS Optimized Reliability, Availability, Serviceability • NVMe drives have higher perf • Performance without • No compromise on than SAS/SATA SSDs compromising RAS manageability

• Need PCI lanes in the platform • Ability to handle hotplug, • Software stack that can to leverage drive perf surprise removal, etc leverage higher perf

• Need low latency, high • Higher perf requires very high • Software optimization reqd to bandwidth N/W endurance write cache N/W and I/O datapath

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 17 Why Has All-NVMe HCI Not Happened Already? Many technical challenges needed to be overcome first… All-NVMe Hardware RAS for NVMe Drives Capable Network HCI SW Optimization Cisco HXAF220c-M5N (Reliability, Availability, Serviceability) (To match drive capabilities) (To leverage the Benefits)

CPU CPU HCI HCI HCI HCI Controller Node Node Node VM PCIe ? PCIe NVMe Specific all-NVMe design Storage NW Cache SSD with PCIe M-switch controller N/W is central to HCI, all Capacity SATA Overcome limited PCIe writes traverse the SSD network lanes for drives and Is it architected to peripherals RAS Delivery Point? leverage the benefits? Network bandwidth and latency both matter Intel Optane high Hot plug, Surprise drive Software Optimization endurance caching disks Removal, Firmware, in HX to both NW IO LED, etc. Cisco UCS 10/25/40 path and local IO path GbE is the solution

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 18 RAS in SATA/SAS vs NVMe System Hot Plug, Surprise Removal, Firmware Management, LED etc.

SAS System NVMe System

CPU CPU No Controller PCIe Surprise removal of PCIe NVMe drive may RAS Delivery Storage controller vs crash system point NVMe NVMe SSD SSD Hot Plug specification for NVMe not there yet SAS SAS SSD SSD

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 Absence of RAS Means.. BSOD? PSOD? Hyper-V – Blue Screen due to surprise PCIe SSD removal

ESXi – Purple Screen due to surprise PCIe SSD removal

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 Intel Volume Management Device (VMD) To Rescue Provides Means To Address HotPlug, Surprise Removal and LED Before Intel® VMD After Intel® VMD • Intel® VMD is a new technology to Operating System Operating System enhance solutions with PCIe storage Intel® VMD Driver • Supported for Windows, Linux, and ESXi Intel® ® Intel® Xeon® E5v4 Scalable • Multi-SSD vendor support Processor Processor PCIe PCIe Root Complex Root Complex Intel® VMD • Intel® VMD enables: • Isolating fault domains for device surprise hot-plug and error handling PCIe SSDs • Provide consistent framework for managing LEDs

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 A Closer Look at Intel® VMD OS PCI Bus Interface . Intel VMD is an “integrated end point” in Intel® Xeon® Scalable Processor PCIe root complex. Block I/O Intel VMD Interface . Maps entire sub PCIe trees (child devices) Configuration into VMD address space. Block I/O PCIe Domain Intel NVMe . VMD provides standard SW interface Management Driver to set up/manage the domain (enumerate, event/error handling, LEDs), mostly out of the IO path. OS Storage Interface PCIe configuration, Block I/O . 3 VMD domains per Intel® Xeon® Processor, errors, and events SW each managing 16 PCIe lanes HW x16 PCIe Lane VMD Domain (x3)

Switch SSD SSD

SSD SSD

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 RAS in NVMe System With VMD Hot Plug, Surprise Removal, Firmware Management, LED etc.

HCI Controller VM

NVMe Driver VMD Driver

SAS System NVMe System RAS Delivery Point CPU CPU VMD PCIe vs Storage controller NVMe NVMe SSD SSD

SAS SAS SSD SSD

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 23 Intel® Optane™ SSD DC P4800X $ The Ideal Caching Solution Lower and more consistent latency + Higher endurance = More efficient Average Read Latency under Random Write Workload1 Drive Writes Per Day (DWPD)2 Cache as a % of Storage Capacity3

Intel® Optane™ SSD DC P4800X Intel® SSD DC P3700 Random Write

800 Intel® Optane™ 1000 700 Intel® Optane™ SSD DC P4800X 600 SSD DC P4800X as cache 800 500 Storage 60 DWPD 400 600 300

200

400 Random Write MB/s Write Random Avg Read (us) Latency Read Avg 40x 100 lower Intel® SSD DC Intel® SSD DC 200 0 P4600 (3D NAND) 8x P4600 (3D NAND) as cache lower 0 3.0 DWPD Storage Lower latency + higher endurance = greater SDS system efficiency

1. Responsiveness defined as average read latency measured at queue depth 1 during 4k random write workload. Measured using FIO 2.15. Common Configuration - Intel 2U Server System, OS CentOS 7.2, kernel 3.10.0-327.el7.x86_64, CPU 2 x Intel® Xeon® E5-2699 v4 @2.20GHz (22 cores), RAM 396GB DDR @ 2133MHz. Configuration – Intel® Optane™ SSD DC P4800X 375GB and Intel® SSD DC P3700 1600GB. Latency – Average read latency measured at QD1 during 4K Random Write operations using fio-2.15. results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system. 2. Source – Intel Data Sheets 3. Source – Intel: General proportions shown for illustrative purposes.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 HyperFlex All NVMe Delivers Higher Performance

Complex Mixed General Workloads Databases Workload

28-57% 66-90% 57-71% Higher IOPS Higher IOPS Higher IOPS

Engineered and % % % Optimized All NVMe 22-40 31-39 34-37 Lower Total Lower Total Lower Total Latency Latency Latency Note: All-NVMe numbers relative to HyperFlex All-Flash

ESG Report: https://research.esg-global.com/reportaction/ciscohyperflexallnvme/Toc? Note: Results are not to be used as an application or cluster performance sizer. Leverage the HX Sizer

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 HyperFlex I/O Data Paths • 10 Slides of additional content hidden in this deck • Slides are viewable in the downloaded PDF version

• Covers the HX read and write I/O data pathways Learn on your own? and caching functions

• Content was shown in sessions with the same ID at Cisco Live Las Vegas 2017 and Orlando 2018

BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 27 #CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 28 Input/Output Lifecycles: Writes

VM IOvisor VM IOvisor VM IOvisor CONTROLLER CONTROLLER CONTROLLER ESXi ESXi ESXi

Active Secondary Active Secondary Active Secondary

Local WL Pair

Passive WL Pair #1

Passive WL Pair #2 Cache Cache Cache

Capacity Capacity Capacity

Node 1 Node 2 Node 3 1. VM writes data to HX datastore, received by the IOvisor, hashed to determine the primary node for that write, sent via network to primary node’s controller VM.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 29 Input/Output Lifecycles: Writes

VM IOvisor VM IOvisor VM IOvisor CONTROLLER CONTROLLER CONTROLLER ESXi ESXi ESXi

Active Secondary Active Secondary Active Secondary

Local WL Pair A1

Passive WL Pair #1

Passive WL Pair #2 Cache Cache Cache

Capacity Capacity Capacity

Node 1 Node 2 Node 3 2. Controller VM compresses the data and commits the write to the active local write log on the caching SSD, and sends duplicate copies to the other nodes.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 Input/Output Lifecycles: Writes

VM IOvisor VM IOvisor VM IOvisor CONTROLLER CONTROLLER CONTROLLER ESXi ESXi ESXi

Active Secondary Active Secondary Active Secondary

Local WL Pair A1

Passive WL Pair #1 A2

Passive WL Pair #2 A3 Cache Cache Cache

Capacity Capacity Capacity

Node 1 Node 2 Node 3 3. Additional nodes commit the write to their passive write logs on the caching SSDs.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 Input/Output Lifecycles: Writes

VM IOvisor VM IOvisor VM IOvisor CONTROLLER CONTROLLER CONTROLLER ESXi ESXi ESXi

Active Secondary Active Secondary Active Secondary

Local WL Pair A1

Passive WL Pair #1 A2

Passive WL Pair #2 A3 Cache Cache Cache

Capacity Capacity Capacity

Node 1 Node 2 Node 3 4. After all 3 copies are committed, the write is acknowledged back to the VM via the IOvisor and the datastore like a normal I/O.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 32 Input/Output Lifecycles: Writes

VM IOvisor VM IOvisor VM IOvisor CONTROLLER CONTROLLER CONTROLLER ESXi ESXi ESXi

Active Secondary Active Secondary Active Secondary

Local WL Pair B1 A1 D1 C1

Passive WL Pair #1 A2 D2 B2 B3

Passive WL Pair #2 C2 C3 A3 D3 Cache Cache Cache

Capacity Capacity Capacity

Node 1 Node 2 Node 3 5. Additional write I/O is handled the same way, until the active local write log segment of a node becomes full, then a destage operation is started.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 33 Input/Output Lifecycles: Destage

VM IOvisor VM IOvisor VM IOvisor CONTROLLER CONTROLLER CONTROLLER ESXi ESXi ESXi

Active Secondary Active Secondary Active Secondary

Local WL Pair B1 E1 A1 D1 C1

Passive WL Pair #1 E2 A2 D2 B2 B3

Passive WL Pair #2 C2 C3 E3 A3 D3 Cache Cache Cache

Capacity Capacity Capacity

Node 1 Node 2 Node 3 6. The active and secondary log segments are flipped, the full segments are ready to destage meanwhile the empty segments receive new incoming writes.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 34 Input/Output Lifecycles: Destage

VM IOvisor VM IOvisor VM IOvisor CONTROLLER CONTROLLER CONTROLLER ESXi ESXi ESXi

Active Secondary Active Secondary Active Secondary

Local WL Pair B1 E1 A1 D1 C1

Passive WL Pair #1 E2 A2 D2 B2 B3

Passive WL Pair #2 C2 C3 E3 A3 D3 Cache Cache Cache

A2 D2 A1 D1 A3 D3

Capacity Capacity Capacity

Node 1 Node 2 Node 3 7. The data is deduplicated, and new blocks are committed to the capacity layer disks. Duplicates generate only a metadata update.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 35 Input/Output Lifecycles: Destage

VM IOvisor VM IOvisor VM IOvisor CONTROLLER CONTROLLER CONTROLLER ESXi ESXi ESXi

Active Secondary Active Secondary Active Secondary

Local WL Pair B1 E1 C1

Passive WL Pair #1 E2 B2 B3

Passive WL Pair #2 C2 C3 E3 Cache Cache Cache

A2 D2 A1 D1 A3 D3

Capacity Capacity Capacity

Node 1 Node 2 Node 3 8. Once all data has been destaged, the secondary log segments are emptied, and ready to flip once the active segments become full again.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 36 Input/Output Lifecycles: Read Caches

VM IOvisor • All systems have a small amount of CONTROLLER L1 RAM in the controller VMs set ESXi DRAM aside for a read cache (L1). Active Secondary Local WL Pair E1 • Hybrid converged nodes have a Passive WL Pair #1 B2 large dedicated read cache C3 Passive WL Pair #2 segment on their caching SSD A1 D1 Read Cache (L2) (L2). Cache • During destage, a copy of the most A1 D1 recently written data is also copied Capacity to the L2 read cache. HX Node

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 37 Input/Output Lifecycles: Reads Requests to read data are also VM IOvisor received by the IOvisor and serviced in CONTROLLER L1 ESXi DRAM the following order:

Active Secondary 1. Active write log of primary node Local WL Pair E1 2. Passive write logs Passive WL Pair #1 B2

Passive WL Pair #2 C3 3. L1 (DRAM) cache

A1 D1 4. L2 read cache (hybrid only) Read Cache (L2) 5. Capacity SSDs/HDDs Cache

A1 D1 Reads are decompressed and cached Capacity HX Node in L1/L2 as appropriate, and returned to the requesting VM.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 38 HyperFlex Acceleration Engine HyperFlex Acceleration Engine

• PCIe FPGA offloading HX filesystem compression • Higher Compression, Lower $/GB • Better VM Density, Lower $/VM

• Performance Boost + lower read/write latency • CPU Utilization & Working Capacity Improvements • Other service-offloads in-future: EC, Crypto, Hash etc.

• Purpose Built & Co-Engineered Software & Hardware • Ongoing improvements via new images and driver optimize • Improved TCO

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 40 HyperFlex Data Path with Accelerator

I/O I/O

Comp/Decomp Driver PCI Passthrough

Cache Cache Accelerator vs. Card Dedupe Dedupe • Cisco custom compression Flush Flush engine FPGA image • Driver tunings also critical to Capacity Capacity achieve best performance • Queueing Controller VM Controller VM • Polling • Interrupts

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 41 HyperFlex Acceleration Engine PID: HX-PCIE-OFFLOAD-1 HyperFlex Acceleration Engine

Requires: • Enterprise License HXDP-P(1-5YR) • Homogeneous cluster – all nodes must contain Hercules • Greenfield – activation @ cluster install

Supported in specific models of nodes: • HXAF240C-M5SX, All-Flash • HX240C-M5SX, Hybrid SFF • HX240C-M5L, Hybrid LFF • HXAF220-M5N, All-NVMe

Not supported with HX 4.0(1a): • Stretch clusters • Edge clusters • Hyper-V Hypervisor • SED drives

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 42 HX Acceleration Engine - HW Compression HXDP performs SW Compression using Snappy format – efficient, high speed and reasonable compression ratio.

Acceleration Engine utilizes GZIP in hardware which has a higher compression ratio, and is more computationally intensive than Snappy.

• Compression results will vary depending upon type, content and compressibility of the data. • Previously compressed file formats like JPEG video are not well suited for additional compression. • Internal tests show roughly 10% additional compression savings

Additional Compressed Size Savings Without Compressed Size Compression Original Total Size Without Card Card With Card Savings With Card Savings With Card Calgary corpus 3141622 1814576 42.24% 1759282 44.00% 3.00% Silesia corpus 211938580 114586735 45.93% 107982280 49.05% 5.80% Canterbury corpus 2810784 1393861 50.41% 1182089 57.94% 15.20%

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 43 Benchmarking Tools Vdbench

• Written by Henk Vandenbergh at Oracle

• Java application that can run on almost all platforms with a JRE

• Configurations to test are written in text based job files

• Highly customizable

• Easy server/client job execution via SSH or RSH

• Easy to read output in HTML files

• Has become the standard tool for the Cisco HX BU TME team

• Issues: • Thread and JVM count can make calculating the OIO per disk difficult to calculate

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 45 Vdbench Example Job File

dedupunit=8k dedupratio=1.25 compratio=1.4 hd=default,user=root,shell=ssh,jvms=1 hd=vdbench1,system=192.168.100.11 hd=vdbench2,system=192.168.100.12 hd=vdbench3,system=192.168.100.13 hd=vdbench4,system=192.168.100.14

sd=default,openflags=o_direct,threads=16 sd=sd1,host=vdbench*,lun=/dev/sdb

wd=wd1,sd=sd*,xfersize=8k,rdpct=70,seekpct=100

rd=rd1,wd=wd1,iorate=max,elapsed=1h,interval=5,sleep=10m rd=rd2,wd=wd1,iorate=max,elapsed=1h,interval=5,sleep=10m rd=rd3,wd=wd1,iorate=max,elapsed=1h,interval=5

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 46 Vdbench Tips

• Calculate the minimum number of total threads required:

(# of load generators) X (# of disks per load generator) X (# of JVMs per load generator)

• Total # of threads defined depends on where in the job file it is specified: • If specified in the sd=default, and only one disk is defined, the value is the total # of threads for the entire test • If specified in the sd=default, and multiple disks are defined, it applies to all disks and the total is the sum of all values • If specified per disk, the total # of threads is the sum of all values

• Calculate the OIO per disk:

(total threads defined) / ((# of load generators) X (# of disks per load generator))

• Each JVM can service ~100K IOPs, so specifying jvms=1 makes thread calculation easier

• Build a load generator template VM with the SSH keys of the master for no password logins

• Multi-homed master systems must resolve the load generator facing IP to its hostname via /etc/hosts

• Redirect output to a working web server folder, and use .tod for time of day stamps in the folder name

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 47 Vdbench Example Job Execution

Ensure each load generator VM has the vdbench java executable in the same location, and copy SSH keys so SSH logins happen without a password.

Since the vdbench job file specifies the clients as well, just run the test: vdbench –f 7030_4k.vdb –o /var/www/html/7030_4k.tod

• Easy to loop multiple runs in sequence: for i in {1..3} ; do vdbench –f 7030_4k.vdb –o /var/www/html/ && sleep 3600 ; done • Simple scheduling of runs with Linux at: # cd # at 1am tomorrow at> vdbench -f 7030_4k.vdb -o /var/www/html/output/7030_4k.tod at> (Press CTRL+d to end the job creation) # atq (view at queued jobs) # tail –f /var/spool/at/spool/ (monitor output of running at job)

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 48 Vdbench Output

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 49 Cisco HX Bench

• Cisco developed tool to automate many facets of benchmark testing

• Uses Vdbench for the testing

• Download at https://hyperflexsizer.cloudapps.cisco.com (CCO login required)

• Deploys a master test controller VM from an OVA file on your HX cluster

• HTML GUI makes test configuration easy

• Automatically deploys lightweight Ubuntu VMs as the test load generators

• Use built in test scenarios or build your own, queue multiple runs and compare the results.

• Test results and charts are customizable and even easier to view than the default HTML output from vdbench

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 50 Cisco HX Bench User Interface

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 51 Other Benchmark Tools • Iometer • Defunct open source tool, no active development in 4 years • Data is highly dedupable, results cannot be trusted

• FIO • Similar to Vdbench, highly customizable, output harder to consume

• HCIBench • VMware written wrapper for Vdbench

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 52 Performance Test Results Base Configurations for Vdbench Tests

HX All-Flash + HX All-NVMe + HX All-Flash HX All-NVMe Accelerator Accelerator Version HX 4.0.1a HX 4.0.1a HX 4.0.1a HX 4.0.1a

Hardware 8 x HXAF240 M5S, 8 x HXAF240 M5S, 8 x HXAF220 M5N, 8 x HXAF220 M5N, Config each with: each with: each with: each with: • WL: 375GB Optane • WL: 375GB Optane • WL: 375GB Optane • WL: 375GB Optane • Data: 10 x 960G • Data: 10 x 960G • Data: 6 x 1TB • Data: 6 x 1TB • VIC 1387 • VIC 1387 • VIC 1387 • VIC 1387 • Xeon 6154 • Xeon 6154 • Xeon 6142 • Xeon 6142 • 384 GB RAM • 384 GB RAM • 384 GB RAM • 384 GB RAM Benchmark • Cisco UCS Firmware 4.0(2d) Setup • Vdbench 5.04.06 • 8 Linux VMs per host (64 total) • Each VM has 1 x 200GB virtual disk on HXAF240, 125GB on HXAF220, HX filesystem ~50% full before dedupe and compression • Priming: 64K writes 100% sequential • Tests run 3 times minimum for 1 hour each, and then averaged

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 Benchmark 1 Performance Results Using Vdbench Tests 8K 100% Sequential Write Workload

Maximum Sequential Write • 16 OIO Per Disk IOPs Latency • ~21% IOPs gain and ~21% 250,000 20.00 latency reduction for All- 221,049 NVMe compared to All-Flash 18.00 200,973 • ~10% IOPs gain and ~9% 200,000 181,941 16.00 latency reduction when 166,177 14.00 adding the Accelerator Card

150,000 12.00 Latency 10.00 IOPs 100,000 8.00 6.16 5.62 5.09 4.63 6.00 50,000 4.00 2.00 0 0.00 4.0 All Flash 4.0 AF+Card 4.0 All NVMe 4.0 AN+Card

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 55 Benchmark 2 Performance Results Using Vdbench Tests 8K 100% Random Read Workload

Maximum Random Read • 32 OIO Per Disk IOPs Latency • ~12% IOPs loss for All-Flash 800,000 20.00 with Accelerator Card 701,277 18.00 • Loss only seen in 100% read 700,000 669,376 662,464 test, not seen in mixed R/W 617,102 16.00 600,000 • Differences in queues and

14.00 block interaction being Latency 500,000 12.00 worked to improve result 10.00 • ~2% IOPs loss for All-NVMe IOPs 400,000 8.00 with Accelerator Card 300,000 6.00 • Lower All-NVMe results 3.32 attributable to fewer disks 2.92 3.06 3.09 4.00 200,000 and lower CPU clock speeds 2.00 100,000 0.00 4.0 All Flash 4.0 AF+Card 4.0 All NVMe 4.0 AN+Card

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 56 Benchmark 3 Performance Results Using Vdbench Tests 100% Random 70/30% R/W with Varying Block Sizes

4K 8K 16K 64K 256K

IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency

All-Flash 170,759 2.04 5.25 176,292 2.90 5.51 149,271 2.07 6.61 67,118 5.19 13.41 21,906 12.36 49.01

All-Flash + 185,103 1.94 4.70 202,958 2.52 4.55 170,993 1.90 5.54 76,077 4.86 11.19 24,699 11.82 41.41 Accelerator

All-NVMe 220,171 1.82 3.50 253,415 2.02 3.38 204,523 1.64 4.50 94,009 3.56 9.83 33,298 9.89 28.08

All-NVMe + 233,351 1.79 3.15 283,736 1.80 2.90 219,688 1.63 3.95 99,438 3.69 8.53 35,181 10.12 24.79 Accelerator

• 8 OIO Per Disk • 29-50% IOPs Gains with HyperFlex All-NVMe compared to All-Flash • All-NVMe cluster without HyperFlex Accelerator Cards already outperforms All-Flash with the cards • 5-15% IOPs Gains with HyperFlex Accelerator Cards • 10-17% Write Latency reduction with HyperFlex Accelerator Cards

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 57 Benchmark 4 Performance Results Using Vdbench Tests 8K 100% Random 70/30% R/W IOPs Curve

20% 40% 60% 80% 100%

IOPs Avg IOPs Avg IOPs Avg IOPs Avg IOPs Avg IOPs IOPs IOPs IOPs IOPs STDV Latency STDV Latency STDV Latency STDV Latency STDV Latency

All-Flash 35,825 0.3% 0.59 71,632 0.3% 1.01 107,422 0.2% 1.69 143,221 0.3% 2.12 178,981 10.7% 2.86

All-Flash + 42,263 0.3% 0.60 84,531 0.2% 0.94 126,748 0.2% 1.57 169,015 0.3% 1.94 211,220 7.8% 2.43 Accelerator

All-NVMe 51,699 0.3% 0.71 103,397 0.3% 1.03 155,011 0.4% 1.28 206,674 0.7% 1.63 258,316 12.1% 1.98

All-NVMe + 56,829 0.3% 0.68 113,614 0.3% 0.95 170,384 0.4% 1.22 227,176 0.8% 1.49 283,907 11.0% 1.80 Accelerator • Curve test measures performance at various performance levels, starting with a 100% unthrottled test, then running the workload again at a different percentages of that maximum result. • We calculate stability by measuring the standard deviation of IOPs as a percentage of the final average result • Maximum speed, unthrottled tests will always have more variability due to more aggressive background tasks • 20% through 80% values are typically very low because the system is not being pushed to the limit • Every configuration shows higher IOPs at lower latency than the previous result, up to 59% improvement

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 58 Benchmark 4 Performance Results Using Vdbench Tests 8K 100% Random 70/30% R/W IOPs Curve

8K Blocks: All-Flash IOPS and latency curve 2.86 280,000 260,000 240,000 2.50 2.12 220,000 2.43

200,000 211,220 2.00 1.69 ) 180,000 1.94 178,981 ms 160,000 169,015 1.50 140,000 1.57 1.01 143,221 120,000

126,748 1.00 ( Latency I/O per secondperI/O 100,000 107,422 0.59 0.94 80,000 84,531 0.50 60,000 0.60 71,632 40,000 35,825 42,263 20,000 0.00 20% load 40% load 60% load 80% load 100% load

IOPS without card IOPS with card Latency without card Latency with card

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 59 Benchmark 4 Performance Results Using Vdbench Tests 8K 100% Random 70/30% R/W IOPs Curve

8K Blocks: All-NVMe IOPS and latency curve 1.98 340,000 2.00 320,000 1.80 300,000 1.63 1.80 280,000 1.60 260,000 283,907 1.28 1.40 240,000 1.49 258,316 220,000 ) 227,176 1.20 200,000 1.03 ms 1.22 206,674 180,000 1.00 160,000 0.71 0.95 170,384 0.80

140,000 155,011 Latency ( Latency I/O per secondperI/O 120,000 0.60 0.68 100,000 113,614 80,000 103,397 0.40 60,000 0.20 40,000 51,699 56,829 20,000 0.00 20% load 40% load 60% load 80% load 100% load

IOPS without card IOPS with card Latency without card Latency with card

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 60 #CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 61 Benchmark 5 Performance Results Using Vdbench Tests I/O Blender 100% Random 70/30% R/W IOPs Curve

20% 40% 60% 80% 100%

IOPs Avg IOPs Avg IOPs Avg IOPs Avg IOPs Avg IOPs IOPs IOPs IOPs IOPs STDV Latency STDV Latency STDV Latency STDV Latency STDV Latency

All-Flash 28,464 0.3% 0.75 56,832 0.2% 1.29 85,292 0.2% 1.98 113,655 0.3% 2.71 142,052 12.5% 3.61

All-Flash + 31,296 0.3% 0.73 62,528 0.3% 1.24 93,728 0.3% 1.88 124,992 0.4% 2.52 156,136 12.4% 3.28 Accelerator

All-NVMe 38,961 0.3% 0.69 77,889 0.3% 1.07 116,751 0.3% 1.53 155,683 0.6% 2.15 194,544 14.6% 2.63

All-NVMe + 41,997 0.3% 0.66 83,960 0.2% 0.98 125,915 0.4% 1.40 167,810 0.9% 1.95 209,760 13.0% 2.44 Accelerator

• Curve test measures performance at various performance levels, starting with a 100% unthrottled test, then running the workload again at a different percentages of that maximum result. • We calculate stability by measuring the standard deviation of IOPs as a percentage of the final average result • Block Size Mix: 4K 40%, 8K 40%,16K 10%, 64K 10% • 20% through 80% values are typically very low because the system is not being pushed to the limit • Every configuration shows higher IOPs at lower latency than the previous result, up to 48% improvement

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 62 Benchmark 5 Performance Results Using Vdbench Tests I/O Blender 100% Random 70/30% R/W IOPs Curve

Mixed Blocks: All-Flash IOPS and latency curve 3.61 200,000 3.50 180,000 3.28 3.00 160,000 2.71 156,136 140,000 2.50 142,052 ) 1.98 2.52 120,000 ms 124,992 2.00 113,655 100,000 1.88 1.29 1.50

93,728 ( Latency

I/O per second perI/O 80,000 85,292 0.75 1.24 1.00 60,000 62,528 56,832 0.50 40,000 0.73 28,464 31,296 20,000 0.00 20% load 40% load 60% load 80% load 100% load

IOPS without card IOPS with card Latency without card Latency with card

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 63 Benchmark 5 Performance Results Using Vdbench Tests I/O Blender 100% Random 70/30% R/W IOPs Curve

Mixed Blocks: All-NVMe IOPS and latency curve 2.63 260,000 240,000 2.50 2.15 220,000 2.44

200,000 209,760 2.00 180,000 194,544 1.95 ) 1.53

160,000 ms 167,810 1.50 140,000 155,683 1.07 1.40 120,000 125,915 1.00 116,751 ( Latency I/O per second perI/O 100,000 0.69 0.98 80,000 83,960 60,000 0.66 77,889 0.50 40,000 38,961 41,997 20,000 0.00 20% load 40% load 60% load 80% load 100% load

IOPS without card IOPS with card Latency without card Latency with card

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 64 Benchmark 6 Performance Results Using Vdbench Tests 8K 100% Random 70/30% R/W IOPs Curve Match

20% 40% 60% 80% 100%

IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency

All-Flash 37,794 0.53 0.94 75,495 0.64 1.55 113,261 0.93 2.71 150,953 1.18 3.66 188,654 1.66 5.21

All-Flash + 37,596 0.48 0.79 75,196 0.63 1.39 112,787 0.83 2.14 150,381 1.12 3.08 187,525 1.38 3.89 Accelerator

All-NVMe 37,595 0.51 0.71 75,193 0.93 0.97 112,785 0.84 1.17 150,385 0.93 1.65 187,977 0.88 1.80

All-NVMe + 37,598 0.52 0.69 75,191 0.94 0.95 112,784 0.84 1.05 150,388 0.90 1.41 187,982 0.85 1.55 Accelerator • Curve test run as before, but throttled to match the results from the standard HX 4.0 cluster • Write latencies improve in all tests, by as much as 70% • Read latencies also improve by up to 49% • 40% test read latencies affected by cleaner getting more aggressive on the smaller All-NVMe cluster earlier

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 65 Benchmark 6 Performance Results Using Vdbench Tests I/O Blender 100% Random 70/30% R/W IOPs Curve Match

20% 40% 60% 80% 100%

IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency

All-Flash 28,464 0.60 1.09 56,832 0.91 2.16 85,292 1.27 3.63 113,655 1.68 5.12 142,052 2.24 6.80

All-Flash + 28,398 0.56 0.90 56,799 0.77 1.58 85,186 1.26 2.79 113,591 1.41 3.61 140,373 2.02 5.00 Accelerator

All-NVMe 28,397 0.51 0.74 56,792 0.65 1.20 85,191 0.79 1.74 113,589 0.98 2.34 141,979 1.17 2.99

All-NVMe + 28,397 0.51 0.71 56,796 0.61 0.99 85,185 0.76 1.40 113,587 0.91 1.81 141,976 1.07 2.25 Accelerator

• Curve test run as before, but throttled to match the results from the standard HX 4.0 cluster • Block Size Mix: 4K 40%, 8K 40%,16K 10%, 64K 10% • Write latencies improve in all tests, by as much as 67% • Read latencies also improve by up to 52% • 40% test read latencies issue not seen when testing with mixed blocks

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 66 How Far We Have Come Compare HX 3.0 All-Flash to HX 4.0 All-NVMe

64K 100% W 8K 100% R 8K 70/30 R/W 64K 70/30 R/W 256K 70/30 R/W

IOPs W Latency IOPs R Latency IOPs R Latency W Latency IOPs R Latency W Latency IOPs R Latency W Latency

3.0 All-Flash 34,493 29.42 571,145 3.58 168,258 2.11 5.20 67,484 5.76 11.98 21,374 12.06 51.61

4.0 All-NVMe + 65,536 15.05 662,464 3.09 283,736 1.33 2.90 99,438 3.69 8.53 35,181 10.12 24.79 Accelerator

Improvement 90.0% -48.8% 16.0% -13.8% 68.6% -36.9% -44.3% 47.4% -36.0% -28.8% 64.6% -16.0% -52.0%

• Compare HX 3.0 All-Flash cluster results with Intel Optane caching drive to HX 4.0 All-NVMe cluster with HyperFlex Accelerator Cards • Results shown between two HX systems that were released only 1 year apart from each other • 90% higher throughput and IOPs during 64K block priming tests with latency cut in half • Mixed R/W testing IOPs improves by 47-68%, nearly 9GB/sec for 256K blocks • Mixed R/W write latency improves by up to 52%, cut in half for large block writes • Mixed R/W read latencies improve in all tests by up to 37%

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 67 Benchmarking Best Practices 1. Initialize the Disks • More realistic: applications don’t read data that hasn’t been already written

2. Longer Test Duration (1 hour+) • Allows background processes to run, performance will stabilize over time

3. Test Appropriate Working Set Sizes • Larger working set size is a key benefit for all flash, don’t exceed read cache size on hybrid nodes

4. Enable Dedupe and Compression on Competition (On by Default for HX) • Most enterprise workloads see significant storage efficiency savings

5. Workload Read / Write Mix • Reasonable write percentage (20-35% at least) in the workload

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 68 Benchmarking Best Practices 6. Similar hardware configuration • Pay careful attention to the BOM – CPU, Memory and SSD (type and count for cache/WL and persistent data SSDs)

7. Similar software configuration • Pay careful attention to the resiliency setting – that needs to be identical • Comparing Replication Factor = 2 performance with Replication Factor = 3 performance is not comparing apples to apples

8. What to look for in performance results: • IOPS / Throughput • Latency – not all workloads show sub millisecond latencies (even on all flash!) • Consistent Latency / standard deviation of latency • Variances across the VMs

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 69 Per VM Test Results Variation • IOPS in a 70/30 Workload – 1 Hour Duration

Source: https://research.esg-global.com/reportaction/ciscohyperflexcomplexworkloads/Toc

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 70 HyperFlex Design and Performance Summary

Cisco HyperFlex All-NVMe cluster combine best-of- breed technology and partnerships HyperFlex All-NVMe clusters take HCI performance to new heights HyperFlex Accelerator Cards improve performance, reduce latency and increase compression savings Year-over-year performance gains are transformational and enable HX use for all workloads

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 71 Complete your online session • Please complete your session survey after each session. Your feedback evaluation is very important.

• Complete a minimum of 4 session surveys and the Overall Conference survey (starting on Thursday) to receive your Cisco Live water bottle.

• All surveys can be taken in the Cisco Live Mobile App or by logging in to the Session Catalog on ciscolive.cisco.com/us.

Cisco Live sessions will be available for viewing on demand after the event at ciscolive.cisco.com.

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 72 Continue your education

Demos in the Walk-in labs Cisco campus

Meet the engineer Related sessions 1:1 meetings

#CLUS BRKINI-2016 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 73 Thank you

#CLUS #CLUS