<<

IO Performance HUANG Zhiteng ([email protected]) Agenda

• IO Virtualization Overview • solution • Hardware solution • IO performance Trend • How IO Virtualization Performed in Micro Benchmark • Network • Disk • Performance in Enterprise Workloads • Web Server: PV, VT-d and Native performance • Database: VT-d vs. Native performance • Consolidated Workload: SR-IOV benefit • Direct IO (VT-d) Overhead Analysis

Software 2 & Services group IO Virtualization Overview

IO Virtualization enables VMs to utilize Input/Output resources of hardware VM N platform. In this session, we cover network VM 0 and storage.

Software Solutions: Two solution we are familiar with on : •Emulated Devices – QEMU Good compatibility, very poor performance •Para-Virtualized Devices Need driver support in guest, provides optimized performance compared to QEMU.

Both require participation of Dom0 (driver domain) to serve an VM IO request.

Software 3 & Services group IO Virtualization Overview – Hardware Solution

• VMDq ( Device Queue) • Separate Rx & Tx queue pairs of NIC for each VM, Network Only Software “switch”. Requires specific OS and VMM Offer 3 kinds of support. H/W assists to accelerate IO. • Direct IO (VT-d) VM Exclusively Single or •Improved IO performance through direct assignment of a Owns Device combination of I/O device to an unmodified or paravirtualized VM. technology can be used to • SR-IOV (Single Root I/O Virtualization) address various • One Device, Changes to I/O device silicon to support multiple PCI Multiple usages. device ID’s, thus one I/O device can support multiple Virtual direct assigned guests. Requires VT-d. Function

Software 4 & Services group IO Virtualization Overview - Trend Much Higher throughput, denser IO capacity.

40Gb/s and 100Gb/s Ethernet Scheduled to release Draft 3.0 in Nov. 2009, Standard Approval in 2010**. Fibre Channel over Ethernet (FCoE) Unified IO consolidates network (IP) and storage (SAN) to single connection.

Solid State Drive (SSD) Provides hundreds MB/s bandwidth and >10,000 IOPS for single devices*.

PCIe 2.0 Doubles the bit rate from 2.5GT/s to 5.0GT/s.

Software 5 & Services * http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403 group ** See http://en.wikipedia.org/wiki/100_Gigabit_Ethernet How IO Virtualization Performed in Micro Benchmark – Network

iperf: Transmiting Performance SR-IOV Dual-Port Iperf with 10Gb/s Ethernet NIC were 10GbE NIC 20 100% 18 19 used to benchmark TCP bandwidth of 16 80% 14 12 60% different device models(*). 10 8 1.13x 9.54 40% 6 1.81x 8.47

Bandwidth (Gb/s)Bandwidth 4 20% Thanks to VT-d, VM can easily 4.68 2 0 0% achieved 10GbE line-rate in both cases HVM + PV driver PV Guest HVM + VT-d 30VMs+SR-IOV perf cpu% with relatively much lower resource iperf: Receiving Performance consumption. 10 100%

8 9.43 80%

Also with SR-IOV, we were able to get 6 3.04x 60% 19Gb/s transmitting performance with 4 40% 2.12x

Bandwidth (Gb/s)Bandwidth 2 3.10 20% 1.46 30VFs assigned to 30 VMs. 0 0% HVM + PV driver PV Guest HVM + VT-d perf cpu%

* HVM+ VT-d uses 2.6.27 kernel while PV guest and HVM+PV driver uses 2.6.18. Software 6 * We turned off multiqueue support in NIC driver of HVM+VT-d due 2.6.18 kernel doesn’t have multi TX queue support. So for & Services iperf test, there was only one TX/RX queue in the NIC and all interrupts are sent to one physical core only. group * ITR (Interrupt Throttle Rate) was set to 8000 for all cases. How IO Virtualization Performed in Micro Benchmark – Network (cont.)

Packet Transmitting performance is another

Packet Transmitting Performance essential aspect of high throughput network.

9.00 8.15 Using Linux kernel packet generator(pktgen) 8.00 7.00 with small UDP packets (128 Byte), HVM+VT-d 2.1x 6.00 can send over 4 million packets/s with 1 TX 5.00 3.90 4.00 3.00 queue and >8 million packets/s with 4 TX 11.7x 2.00 0.33 queue. Million Packet Million PerSecond (mpp/s) 1.00 0.00 PV Guest 1 queue HVM+VT-d 1 queue HVM+VT-d 4 queue PV performance was far behind due to its long packet processing path.

Software 7 & Services group How IO Virtualization Performed in Micro Benchmark – Disk IO

IOmeter: Disk Bandwidth 6,000 100.00% 4,911 5,000 80.00% 4,000 We measured disk bandwidth with sequential 60.00% 3,000 2.9x 40.00% read and IOPS with random read to check block

2,000 1,711 Bandwidth (MB/s)Bandwidth 1,000 20.00% device performance. 0 0.00% PV Guest HVM + VT-d HVM+VT-d out-perform PV guest to ~3.0x in IOmeter: Disk IOPS 20,000 18,725 100.00% both tests.

16,000 80.00%

12,000 2.7x 60.00%

8,000 7,056 40.00% IO per Second IO 4,000 20.00%

0 0.00% PV Guest HVM + VT-d

Software 8 & Services group Performance in Enterprise Workloads – Web Server

Web Server simulates a support website where connected users browse and Web Server Performance download files. 30,000 100.00%

24,500 We measures maximum simultaneous user 25,000 80.00% sessions that web server can support while 20,000 60.00% satisfying the QoS criteria. 15,000 2.7x

Sessions 40.00% 10,000 9,000 Only HVM+VT-d was able to push server’s 5,000 1.8x 20.00% 5,000 utilization to ~100%. PV solution hit some - 0.00% HVM + PV driver PV guest HVM+ VT-d bottleneck that they failed to pass QoS while performance CPU utilization utilization is still <70%.

Software 9 & Services group Performance in Enterprise Workloads – Database

Decision Support DB Performnce OLTP DB Performance 11,443 199.71 12000 10,762 184.08 200.00 10000 94.06% 150.00 127.56 8000 92.2%

6000 100.00 QphH 4000 50.00 63.9% 2000

0 0.00 Native HVM + VT-d Native HVM + VT-d Storage & NIC HVM + VT-d Storage

Decision Support DB requires high disk bandwidth while OLTP DB is IOPS bound and requires certain network bandwidth to connect to clients.

HVM+VT-d combination achieved > 90% of native performance in these two DB workloads.

Software 10 & Services group Performance in Enterprise Workloads – Consolidation with SR-IOV

Workload consolidates multiple tiles of servers to run on same physical machine. One tile SR-IOV Benefit: Consolidated Workload consists of 1 instance of Web Server, 1 J2EE 1.60 1.49 100 AppServer and 1 Mail Server, altogether 6 VMs. 1.40 90 80 1.20 It’s a complex workload, which consumes CPU, 1.00 70 memory, disk and network. 1.00 60 0.80 50

0.60 1.49x 40 PV solution could only supports 4 tiles on two 30 Ratio of Performance of Ratio 0.40 socket server and fails to pass QoS of Web 20 0.20 10 Server criteria before it saturates CPU. 0.00 0 PV Guest HVM+SR-IOV As a pilot, we enabled SR-IOV NIC for Web Ratio System Utilization Server. This brought >49% performance increase also allowed system to support two more tiles (12 VMs).

Software 11 & Services group Direct IO (VT-d) Overhead

VT-d cases: Utilization Breakdown VT-d: XEN Cycles Breakdown 100 14.00% 90 11.878 12.00% 80 70 10.00% 60 Dom0 4.73% 8.00% Interrupt Windows 50 Xen INTR 40 Guest Kernel 6.00% Utilization Utilization (%) 30 APIC ACCESS Guest User 20 4.00% IO INSTRUCTION 6.37% 10 5.94 2.00% 0 Disk Bandwidth Web Server 0.00% SPECWeb Web Server

APIC access and interrupt delivery consumed the most cycles. (Note that some amount of interrupts arrive when CPU is HLT thus they are not counted.) In various workloads, we’ve seen Xen brings in about 5~12% overhead, which is being mainly spent on serving interrupts. Intel OTC team has developed patch set to eliminate part of Xen software overhead. Check out Xiaowei’s session for details.

Software 12 & Services group CREDIT

Great thanks to DUAN, Ronghui and XIANG, Kai for providing data of VT-d network and SR-IOV.

Software 13 & Services group QUESTIONS?

Software 14 SSG/SSD/SPA/PRC Scalability Lab & Services group BACKUP

Software 15 SSG/SSD/SPA/PRC Scalability Lab & Services group Configuration

Hardware Configuration:

Intel® Nehalem-EP Server System Test Case VM Configuration CPU Info: 2 socket Nehalem 2.66 GHZ with 8MB LLC Cache, C0 stepping. Network Micro 4 vCPU, 64GB memory Hardware Prefetches OFF benchmark Turbo mode OFF, EIST OFF. NIC device Storage Micro Intel 10Gb XF SR NIC (82598EB)—2 single port NIC installed on 2 vCPU, 12GB memory machine and one dual port NIC installed on server. Benchmark RAID bus controller: LSI Logic MegaRAID SAS 8888elp x3 4 vCPU, 64GB memory DISK array x6 (each with 70GB X 12 SAS HDD). Web Server Memory Info 64GB memory (16x 4GB DDR3 1066MHz) , 32GB on each node. Database 4 vCPU, 12GB memory

Software Configuration: Xen C/S:18771 for network/disk micro benchmark, 19591 for SR-IOV test

Software 16 & Services group 17 SSG/SSD/SPA/PRC Scalability Lab