Virtualizing High Performance Computing Workloads
CernVM Users Workshop June 7, 2016
Josh Simons VMware Office of the CTO
© 2014 VMware Inc. All rights reserved. Isolated Islands of Compute
2 Barriers to Centralization
3 Barriers to Centralization
?
4 Centralization via Virtualization
Virtualization Layer (Hypervisor)
5 Centralization with Heterogeneity
Virtualization Layer
6 Virtual Machine Benefits
App OS VM
Virtual Machines offer: hypervisor • Heterogeneity • Multi-tenant data security • Fault isolation hardware hardware • Reproducibility • Fault resiliency • Dynamic load balancing • Performance
7 Virtualized Infrastructure
Virtualization Layer
monitoring management
8 Virtualized Infrastructure
Self-provisioning, pre-defined templates, …
Virtualization Layer
monitoring management
9 Secure Private Cloud for HPC
Research Group 1 Research Group m
Users IT
Hybrid/Public VMware vRealize Automation Clouds
User Portals Blueprints Security VMware vRA API
Research Cluster 1 Research Cluster n NSX Programmatic Control and Integrations
VMware VMware VMware vCenter Server vCenter Server vCenter Server
VMware vSphere VMware vSphere VMware vSphere “Evolutionary” Container Support
11 vSphere Integrated Containers Before & After VI admin creates docker-machine docker run 1 3 docker stop 1 Virtual Container docker rm Host 3 docker run Docker API VM create, start, stop, delete VM created vCenter vCenter 2 w/ Docker Engine 2 Docker Virtual API Container DE’ DE Host jeVM (Resource jeVM Pool) and C2 Docker containers Docker containers C1 Docker created in jeVMs via C1 C2 are created inside 4 4 Engine jeVM Instant Instant Clone the VM jeVM created Clone!
PhotonOS PhotonOS PhotonOS VMware ESX VMware ESX VMware ESX VMware ESX VMware ESX
Layer1 Shared Datastores Shared Datastores Image1 Layer2 Layer3 12 “Revolutionary” Container Support
13 Photon Platform Architecture
Create CF cluster cf push kubectl create Create Kubernetes cluster cf scale kubectl get pods Photon API Cloud Foundry API Kubernetes API
Cloud Foundry Photon Controller #1 Kubernetes Cluster Photon Controller Cluster Clustered design Photon Controller #2 delivers massive scale and high- Photon Controller #3 availability.
PhotonOS PhotonOS PhotonOS Photon Machine Photon Machine Photon Machine Photon Machine Combination of core ESX with PhotonOS
Combination of local and/or shared Photon Machine datastores. 14 OCTO HPC Test Cluster
• Hardware – Ten two-socket HP DL380 G8 servers (3.3 GHz 8-core E5-2667v2 CPUs; 128 GB) – Two two-socket Super Micro 1028GR-TRT servers (2.5 GHz 12-core E5-2680v3 CPUs; 128 GB) – Mellanox FDR / 40 Gb RoCE adaptor – Intel 10 GbE adaptor – Two Mellanox 12-port FDR/40Gb switches
• Software – ESXi 5.5u1, 6.0 hypervisors – RHEL 6.5 (native and guest) Performance of Throughput Applications
hypervisor VS. hardware hardware
16 BioPerf Benchmark Suite
Native to Virtual Ratios (Higher is Better) 1.2
1
CLUSTALW 0.8 GLIMMER GRAPPA 0.6 HMMER PHYLIP 0.4 PREDATOR TCOFFEE BLAST 0.2 FASTA
0 ESXi5.5u1
17 BLAST Native to Virtual Ratios (Higher is Better)
1
0.8 OMP_NUM_THREADS=1
OMP_NUM_THREADS=4 0.6 OMP_NUM_THREADS=8
OMP_NUM_THREADS=16 0.4
0.2
0 ESXi5.5u1
18 Monte Carlo Simulation – Vanilla Option Pricing
Native to Virtual Ratios (Higher is better)
• Dual-socket, 8-core IVB 1 processor • 128 GB memory 0.8 • Single 16-vCPU VM, 120 Run 1 GB 0.6 Run 2 • 16 single-threaded jobs run Run 3 0.4 in parallel (one per core) Run 4 • Windows Server 2012 R2 0.2 • Default settings 0 ESX5.5u1
19 MPI Small-Message InfiniBand Latency
VM VM
Application Application osu_latency benchmark MPI MPI Open MPI 1.6.5 MLX OFED 2.2 RedHat 6.5 RDMA RDMA HP DL380p Gen8 Mellanox FDR InfiniBand
ESX ESX
Hardware Hardware MPI Small-Message InfiniBand Latency IMB PingPong 3.5
3
2.5
2 Native
ESX 5.5 trip trip Latency (µs) - 1.5 ESX 6.0 ESX 6.0u1
1 Testbuild Half Half Round 0.5
0 1 2 4 8 16 32 64 128 256 512 1024 Message size (bytes) NAMD – ESX 5.5
NAMD Benchmarks Native to Virtual Ratios (Higher is Better)
1 0.9 0.8 n8np8 0.7 0.6 n8np16 0.5 n8np32 0.4 n8np64 0.3 n8np128 0.2 0.1 0 apoa1 f1atpase stmv
22 LAMMPS – ESX 5.5
LAMMPS Benchmarks Native to Virtual Ratios (Higher is Better)
1 0.9 0.8 n8np8 0.7 0.6 n8np16 0.5 n8np32 0.4 n8np64 0.3 n8np128 0.2 0.1 0 Atomic Fluid Bulk Copper Bead-Spring Polymer
23 LAMMPS (Testbuild)
LAMMPS Benchmarks Native to Virtual Ratios (Higher is Better)
1 0.9 0.8 n8np8 0.7 0.6 n8np16 0.5 n8np32 0.4 n8np64 0.3 n8np128 0.2 0.1 0 Atomic Fluid Bulk Copper Bead-Spring Polymer
24 30TB TeraSort, 32-host cluster with vSphere 6
1.2
1.1
1
0.9
0.8 1 VM/host 0.7 2 VMs/host 0.6 4 VMs/host 0.5 10 VMs/host 0.4 20 VMs/host
0.3
Native/Virtual Native/Virtual Elapsed Ratio Time 0.2
0.1
0 TeraGen TeraSort TeraValidate
25 Network Storage: Small I/O Case Study
• Rendering applications Application – 1.4X – 3X slowdown seen Guest OS • Customer NFS stress test – 10K files ESXi – 1K random reads/file – 1-32K bytes NFS Server – 7X slowdown • Final app performance – 1 – 5% slower than native • Single change – Disable LRO (Large Receive Offload) within the guest to avoid coalescing of small messages upon arrival – See KB 1027511: Poor TCP Performance can occur in Linux virtual machines with LRO enabled Remote Storage Access Path
app app app
storage OS storage device driver serverstorage serverstorage server HW server PCI device
InfiniBand
switch
27 Passthrough Mode Limitation
app
Guest OS Guest OS Guest OS driver storage ₓ storage ₓ serverstorage serverstorage server hardware server PCI device
InfiniBand
switch
28 Single-Root I/O Virtualization (SR-IOV)
app app app
Guest OS Guest OS Guest OS VF driver VF driver VF driver storage vmkernel storage PF driver serverstorage serverstorage server hardware server PCI device
InfiniBand
switch
29 IOR Bandwidth Performance
3VM x 4core versus bare-metal Linux 12core Two-socket (8-core) IVB 4000 64 GB memory
3500 MLX ConnectX-3 FDR IB 256 GB IOR dataset
3000 CentOS 6.4 Lustre 2.6 2500
VM write 2000 VM read
BW [MB/sec] BW Bare Metal write 1500 Bare Metal read
1000
500
0 Data provided by Sorin Faibish, EMC 1 2 3 6 12 Office of the CTO No. of Procs
30 Performance Tuning Tips
• For maximum performance, do not over-commit CPU and memory resources • NUMA – Socket or sub-socket binding when you have the choice (throughput workloads) – For multi-socket VMs, match vNUMA to physical topology (cupid.coresPerSocket) – Use lstopo within VM to verify topology (possibly important for small VMs as well) • Hyperthreading – Enable in BIOS, but only assign real cores to vCPUs – If using threads for vCPUs, be careful of NUMA layout issues (numa.preferHT)
31 Performance Tuning Tips
• Latency Sensitivity = HIGH – Exclusive access to pCPUs – vmkernel scheduler bypass – VMXNET3 coalescing & LRO disabled – (etc.) • Other Binding (if really needed) – vCPUs to cores, interrupts to cores, etc. • Guest-level tuning – Tickless/nohz – SELinux/iptables – tuned-adm latency-performance (or throughput-performance) – vNIC (coalescing off) – Java UseLargePages
32 EDA Example (Over-Committed with vMotion)
• Electronic Design Automation (chip design) • 64 hosts, 640 VMs • CPU 5X over-commitment (memory allocation unknown) • 360,000 vMotion operations in a year • 5X jobs run in 2X time 2.5X throughput increase – 1.5M more jobs per month (40% increase)
33 Using Accelerators with ESXi
34 nVidia K2, CUDA, VM DirectPath I/O
The Scalable Heterogenous Computing (SHOC) Benchmark Suite Virtual versus Native Ratios (higher is better) 1.2
1
0.8 ESX 6.0u1 0.6 Testbuild
0.4 Virtual/Native Ratios Virtual/Native 0.2
0
qtc
s3d
sort
scan
fft_sp
fft_dp
stencil
s3d_dp
scan_dp
triad_bw
md5hash
reduction
sgemm_n
dgemm_n
stencil_dp
qtc_kernel
maxspflops
tex_readbw
maxdpflops
md_sp_flops
reduction_dp
md_dp_flops
lmem_readbw
gmem_readbw
lmem_writebw
gmem_writebw
bspeed_readback
bspeed_download
spmv_ellpackr_sp
spmv_ellpackr_dp
spmv_csr_scalar_sp
spmv_csr_scalar_dp
spmv_csr_vector_sp
spmv_csr_vector_dp
gmem_readbw_strided gmem_writebw_strided BS MF DeviceMemory FFTGEMM MD MH RD Scan SR SPMV StencilTB S3D QTC
35 Intel Xeon Phi, VM DirectPath I/O
36 Resources
• CTO HPC blog: – http://cto.vmware.com/tag/hpc • Latency whitepaper: – Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs • http://www.vmware.com/resources/techresources/10220 • Big Data / Hadoop technical whitepaper – Virtualized Hadoop Performance with VMware vSphere 6.0 on High-Performance Servers • http://www.vmware.com/files/pdf/techpaper/Virtualized-Hadoop-Performance-with-VMware-vSphere6.pdf • InfiniBand performance – Performance of RDMA and HPC Applications in Virtual Machines using FDR InfiniBand on VMware vSphere • https://www.vmware.com/resources/techresources/10530 • Paravirtualized RMDA – Toward a Paravirtual RDMA Device for VMware ESXi Guests • http://labs.vmware.com/publications/vrdma-vmtj-winter2012 Thank You Josh Simons [email protected]