Virtualizing High Performance Computing Workloads

CernVM Users Workshop June 7, 2016

Josh Simons VMware Office of the CTO

2 Barriers to Centralization

3 Barriers to Centralization

4 Centralization via Virtualization

Virtualization Layer (Hypervisor)

5 Centralization with Heterogeneity

Virtualization Layer

6 Virtual Machine Benefits

App OS VM

Virtual Machines offer: hypervisor • Heterogeneity • Multi-tenant data security • Fault isolation hardware hardware • Reproducibility • Fault resiliency • Dynamic load balancing • Performance

7 Virtualized Infrastructure

Virtualization Layer

monitoring management

8 Virtualized Infrastructure

Self-provisioning, pre-defined templates, …

Virtualization Layer

monitoring management

9 Secure Private Cloud for HPC

Research Group 1 Research Group m

Users IT

Hybrid/Public VMware vRealize Automation Clouds

User Portals Blueprints Security VMware vRA API

Research Cluster 1 Research Cluster n NSX Programmatic Control and Integrations

VMware VMware VMware vCenter Server vCenter Server vCenter Server

VMware vSphere VMware vSphere VMware vSphere “Evolutionary” Container Support

11 vSphere Integrated Containers Before & After VI admin creates docker-machine docker run 1 3 docker stop 1 Virtual Container docker rm Host 3 docker run Docker API VM create, start, stop, delete VM created vCenter vCenter 2 w/ Docker Engine 2 Docker Virtual API Container DE’ DE Host jeVM (Resource jeVM Pool) and C2 Docker containers Docker containers C1 Docker created in jeVMs via C1 C2 are created inside 4 4 Engine jeVM Instant Instant Clone the VM jeVM created Clone!

PhotonOS PhotonOS PhotonOS VMware ESX VMware ESX VMware ESX VMware ESX VMware ESX

Layer1 Shared Datastores Shared Datastores Image1 Layer2 Layer3 12 “Revolutionary” Container Support

13 Photon Platform Architecture

Create CF cluster cf push kubectl create Create Kubernetes cluster cf scale kubectl get pods Photon API Cloud Foundry API Kubernetes API

Cloud Foundry Photon Controller #1 Kubernetes Cluster Photon Controller Cluster Clustered design Photon Controller #2 delivers massive scale and high- Photon Controller #3 availability.

PhotonOS PhotonOS PhotonOS Photon Machine Photon Machine Photon Machine Photon Machine Combination of core ESX with PhotonOS

Combination of local and/or shared Photon Machine datastores. 14 OCTO HPC Test Cluster

• Hardware – Ten two-socket HP DL380 G8 servers (3.3 GHz 8-core E5-2667v2 CPUs; 128 GB) – Two two-socket Super Micro 1028GR-TRT servers (2.5 GHz 12-core E5-2680v3 CPUs; 128 GB) – Mellanox FDR / 40 Gb RoCE adaptor – Intel 10 GbE adaptor – Two Mellanox 12-port FDR/40Gb switches

• Software – ESXi 5.5u1, 6.0 hypervisors – RHEL 6.5 (native and guest) Performance of Throughput Applications

hypervisor VS. hardware hardware

16 BioPerf Benchmark Suite

Native to Virtual Ratios (Higher is Better) 1.2

CLUSTALW 0.8 GLIMMER GRAPPA 0.6 HMMER PHYLIP 0.4 PREDATOR TCOFFEE BLAST 0.2 FASTA

0 ESXi5.5u1

17 BLAST Native to Virtual Ratios (Higher is Better)

0.8 OMP_NUM_THREADS=1

OMP_NUM_THREADS=4 0.6 OMP_NUM_THREADS=8

OMP_NUM_THREADS=16 0.4

0.2

0 ESXi5.5u1

18 Monte Carlo Simulation – Vanilla Option Pricing

Native to Virtual Ratios (Higher is better)

• Dual-socket, 8-core IVB 1 processor • 128 GB memory 0.8 • Single 16-vCPU VM, 120 Run 1 GB 0.6 Run 2 • 16 single-threaded jobs run Run 3 0.4 in parallel (one per core) Run 4 • Windows Server 2012 R2 0.2 • Default settings 0 ESX5.5u1

19 MPI Small-Message InfiniBand Latency

VM VM

Application Application osu_latency benchmark MPI MPI Open MPI 1.6.5 MLX OFED 2.2 RedHat 6.5 RDMA RDMA HP DL380p Gen8 Mellanox FDR InfiniBand

ESX ESX

Hardware Hardware MPI Small-Message InfiniBand Latency IMB PingPong 3.5

2.5

2 Native

ESX 5.5 trip trip Latency (µs) - 1.5 ESX 6.0 ESX 6.0u1

1 Testbuild Half Half Round 0.5

0 1 2 4 8 16 32 64 128 256 512 1024 Message size (bytes) NAMD – ESX 5.5

NAMD Benchmarks Native to Virtual Ratios (Higher is Better)

1 0.9 0.8 n8np8 0.7 0.6 n8np16 0.5 n8np32 0.4 n8np64 0.3 n8np128 0.2 0.1 0 apoa1 f1atpase stmv

22 LAMMPS – ESX 5.5

LAMMPS Benchmarks Native to Virtual Ratios (Higher is Better)

1 0.9 0.8 n8np8 0.7 0.6 n8np16 0.5 n8np32 0.4 n8np64 0.3 n8np128 0.2 0.1 0 Atomic Fluid Bulk Copper Bead-Spring Polymer

23 LAMMPS (Testbuild)

LAMMPS Benchmarks Native to Virtual Ratios (Higher is Better)

1 0.9 0.8 n8np8 0.7 0.6 n8np16 0.5 n8np32 0.4 n8np64 0.3 n8np128 0.2 0.1 0 Atomic Fluid Bulk Copper Bead-Spring Polymer

24 30TB TeraSort, 32-host cluster with vSphere 6

1.2

1.1

0.9

0.8 1 VM/host 0.7 2 VMs/host 0.6 4 VMs/host 0.5 10 VMs/host 0.4 20 VMs/host

0.3

Native/Virtual Native/Virtual Elapsed Ratio Time 0.2

0.1

0 TeraGen TeraSort TeraValidate

25 Network Storage: Small I/O Case Study

• Rendering applications Application – 1.4X – 3X slowdown seen Guest OS • Customer NFS stress test – 10K files ESXi – 1K random reads/file – 1-32K bytes NFS Server – 7X slowdown • Final app performance – 1 – 5% slower than native • Single change – Disable LRO (Large Receive Offload) within the guest to avoid coalescing of small messages upon arrival – See KB 1027511: Poor TCP Performance can occur in Linux virtual machines with LRO enabled Remote Storage Access Path

app app app

storage OS storage device driver serverstorage serverstorage server HW server PCI device

InfiniBand

switch

27 Passthrough Mode Limitation

app

Guest OS Guest OS Guest OS driver storage ₓ storage ₓ serverstorage serverstorage server hardware server PCI device

InfiniBand

switch

28 Single-Root I/O Virtualization (SR-IOV)

app app app

Guest OS Guest OS Guest OS VF driver VF driver VF driver storage vmkernel storage PF driver serverstorage serverstorage server hardware server PCI device

InfiniBand

switch

29 IOR Bandwidth Performance

3VM x 4core versus bare-metal Linux 12core Two-socket (8-core) IVB 4000 64 GB memory

3500 MLX ConnectX-3 FDR IB 256 GB IOR dataset

3000 CentOS 6.4 Lustre 2.6 2500

VM write 2000 VM read

BW [MB/sec] BW Bare Metal write 1500 Bare Metal read

1000

500

0 Data provided by Sorin Faibish, EMC 1 2 3 6 12 Office of the CTO No. of Procs

30 Performance Tuning Tips

• For maximum performance, do not over-commit CPU and memory resources • NUMA – Socket or sub-socket binding when you have the choice (throughput workloads) – For multi-socket VMs, match vNUMA to physical topology (cupid.coresPerSocket) – Use lstopo within VM to verify topology (possibly important for small VMs as well) • Hyperthreading – Enable in BIOS, but only assign real cores to vCPUs – If using threads for vCPUs, be careful of NUMA layout issues (numa.preferHT)

31 Performance Tuning Tips

• Latency Sensitivity = HIGH – Exclusive access to pCPUs – vmkernel scheduler bypass – VMXNET3 coalescing & LRO disabled – (etc.) • Other Binding (if really needed) – vCPUs to cores, interrupts to cores, etc. • Guest-level tuning – Tickless/nohz – SELinux/iptables – tuned-adm latency-performance (or throughput-performance) – vNIC (coalescing off) – Java UseLargePages

32 EDA Example (Over-Committed with vMotion)

• Electronic Design Automation (chip design) • 64 hosts, 640 VMs • CPU 5X over-commitment (memory allocation unknown) • 360,000 vMotion operations in a year • 5X jobs run in 2X time  2.5X throughput increase – 1.5M more jobs per month (40% increase)

33 Using Accelerators with ESXi

34 nVidia K2, CUDA, VM DirectPath I/O

The Scalable Heterogenous Computing (SHOC) Benchmark Suite Virtual versus Native Ratios (higher is better) 1.2

0.8 ESX 6.0u1 0.6 Testbuild

0.4 Virtual/Native Ratios Virtual/Native 0.2

qtc

s3d

sort

scan

fft_sp

fft_dp

stencil

s3d_dp

scan_dp

triad_bw

md5hash

reduction

sgemm_n

dgemm_n

stencil_dp

qtc_kernel

maxspflops

tex_readbw

maxdpflops

md_sp_flops

reduction_dp

md_dp_flops

lmem_readbw

gmem_readbw

lmem_writebw

gmem_writebw

bspeed_readback

bspeed_download

spmv_ellpackr_sp

spmv_ellpackr_dp

spmv_csr_scalar_sp

spmv_csr_scalar_dp

spmv_csr_vector_sp

spmv_csr_vector_dp

gmem_readbw_strided gmem_writebw_strided BS MF DeviceMemory FFTGEMM MD MH RD Scan SR SPMV StencilTB S3D QTC

35 Intel Xeon Phi, VM DirectPath I/O

36 Resources

• CTO HPC blog: – http://cto.vmware.com/tag/hpc • Latency whitepaper: – Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs • http://www.vmware.com/resources/techresources/10220 • Big Data / Hadoop technical whitepaper – Virtualized Hadoop Performance with VMware vSphere 6.0 on High-Performance Servers • http://www.vmware.com/files/pdf/techpaper/Virtualized-Hadoop-Performance-with-VMware-vSphere6.pdf • InfiniBand performance – Performance of RDMA and HPC Applications in Virtual Machines using FDR InfiniBand on VMware vSphere • https://www.vmware.com/resources/techresources/10530 • Paravirtualized RMDA – Toward a Paravirtual RDMA Device for VMware ESXi Guests • http://labs.vmware.com/publications/vrdma-vmtj-winter2012 Thank You Josh Simons [email protected]