Exploring and Optimizing Scalability for High Performance Virtual Switching

Exploring and Optimizing Scalability for High Performance Virtual Switching Zhihong Wang – [email protected] Ian Stokes – [email protected] Notices and Disclaimers © Copyright 2017 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel. Experience What’s Inside are trademarks of Intel. Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com]. Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modelling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Network Platforms Group 2 A Typical Use Case Cloud computing multi-tenancy VM1 VM2 VMn-1 VMn ............ Virtual Switch Open vSwitch: v2.7.0 DPDK: 17.02 GCC/GLIBC: 6.2.1/2.23 Linux: 4.7.5-200.fc24.x86_64 CPU: Xeon E5-2699 v3 @ 2.30GHz Load Balancer NIC: 82599ES 10-Gigabit Traffic: Round-robin DST IP Network Platforms Group 3 OvS-DPDK Scalability AS IS Throughput drops significantly as the number of guests increases Performance doesn’t scale by increasing the number of CPUs Lack of clear guidelines for customer deployment * Test and System Configurations: Estimates are based on internal Intel analysis using Intel® Server Board S2600WT, Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, Intel® 82599ES 10 Gigabit Ethernet Controller Network Platforms Group 4 Datapath Abstraction PVP (North-South) VM while (true) { VM1 VM2 VMn-1 foreach rxq { VMn recv packets; process packet; Virtio-net1 Virtio-net2 ............ Virtio-netn-1 foreachVirtiotxq-net{ n send packets; } } } OvS-DPDK PMD 1 2 Vhost-user Vhost-user ............ Vhost-usern-1 Vhostwhile-user (true)n { foreach rxq { Virtual Switch (OvS-DPDK as reference) recv packets; NIC1 NIC2 table lookup; foreach txq { send packets; } } Ixia } Network Platforms Group 5 Mapping To Hardware VM while (true) { foreach rxq { recv packets; process packet; foreach txq { send packets; } CPU CPU CPU ............ CPU 1 2 3 n } } Memory OvS-DPDK PMD OvS-DPDK PMD while (true) { Last Level Cache while (true) { foreach rxq { foreach rxq { recv packets; recv packets; table lookup; CPU Socket table lookup; foreach txq { foreach txq { send packets; send packets; } } } Intel Adapter/NIC } } } Network Platforms Group 6 Analysis Cache efficiency Work hard CPU efficiency Work smart Good hardware work ethic CPU utilization Multitasking So-called optimization Network Platforms Group 7 Cache Efficiency: Maximize Burst Size Vhost Tx Fragmentation: 1 burst for multiple guests How does it impact? Poor cache efficiency! VM The solution: Tx-Buffering Virtio-pmd Side effects: Increase of latency Narrow the gap DPDK app . For throughput peaks only Micro benchmarking Vhost-user . Timeout settings tuning * Test and System Configurations: Estimates are based on internal Intel analysis using Intel® Server Board S2600WT, Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, Intel® 82599ES 10 Gigabit Ethernet Controller Network Platforms Group 8 CPU Efficiency: Minimize Empty Polling Allocate CPU based on expected traffic model . NIC cycle per 64B packet: ~30 . Vhost cycle per 64B packet: ~70 + memory latency . NIC Rx + Vhost Tx ~= Vhost Rx + NIC Tx 1:1 core allocation for NIC and Vhost if 1:1 north vs. south traffic [root]# ./ovs-appctl dpif-netdev/pmd-rxq-show [root]# ./ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 13: pmd thread numa_id 0 core_id 13: isolated : false isolated : true port: Vhost-user1 queue-id: 0 port: Vhost-user1 queue-id: 0 port: Vhost-user2 queue-id: 0 port: Vhost-user2 queue-id: 0 pmd-rxq-affinity port: Vhost-user4 queue-id: 0 port: Vhost-user3 queue-id: 0 pmd thread numa_id 0 core_id 12: port: Vhost-user4 queue-id: 0 isolated : false pmd thread numa_id 0 core_id 12: port: dpdk0 queue-id: 0 isolated : true port: Vhost-user3 queue-id: 0 port: dpdk0 queue-id: 0 The default core mapping doesn’t fit for all Network Platforms Group How Did We Do? Not bad 2.2x the throughput 3.3x the throughput Test based on a POC patch, OvS formal patch WIP * Test and System Configurations: Estimates are based on internal Intel analysis using Intel® Server Board S2600WT, Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, Intel® 82599ES 10 Gigabit Ethernet Controller Network Platforms Group 10 CPU Utilization: Enable Hyper-Threading Does Hyper-Threading help? Yes! . Hiding memory access latency 2 threads for NIC and 2 for Vhost, (NN, VV) or (NV, NV)? . Squeezing execution units CPU1 CPU2 But don’t spoil it! 1 1 . Keep spinlocks inside one CPU T T2 T T2 Side effects: Again, increase of latency Network Platforms Group Hyper-Threading Gain Keep Txq spinlocks inside one CPU 30% HT gain * Test and System Configurations: Estimates are based on internal Intel analysis using Intel® Server Board S2600WT, Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, Intel® 82599ES 10 Gigabit Ethernet Controller Network Platforms Group Finally, With All Due Optimization About 3x the throughput 3.5x the throughput!!! 2.9x the throughput!!! IMIX: Uniform distribution random [64, 1518] * Test and System Configurations: Estimates are based on internal Intel analysis using Intel® Server Board S2600WT, Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, Intel® 82599ES 10 Gigabit Ethernet Controller Network Platforms Group 13 Latency Is The Price To Pay Latency is increased due to . Extra overhead of buffering mechanism . Lower frequency due to Hyper-Threading Latency is decided by . Guests per PMD core (Round-Robin) . Traffic pattern (How fragmented?) . Tx-Buffering timeout settings No silver bullet, only tradeoffs and priorities Network Platforms Group 14 Deployment Guideline Enable Vhost Tx-Buffering for throughput . Enable for throughput peaks dynamically . Evaluate timeout settings carefully Allocate CPU based on expected traffic model Utilize Hyper-Threading for PMDs . For 1 CPU, 1 HT for NIC, 1 HT for Vhost . For 2 CPUs, 2 HTs on 1 CPU for NIC, 2 HTs on another for Vhost . For more than 2 CPUs, per case tuning is required Network Platforms Group .

Exploring and Optimizing Scalability for High Performance Virtual Switching

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Scalability and Performance Management of Internet Applications in the Cloud

Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)

Parallel System Performance: Evaluation & Scalability

Multiprocessing and Scalability

Pascal Viewer: a Tool for the Visualization of Parallel Scalability Trends

Computer Architecture: Parallel Processing Basics

Parallel Programming with Openmp

Challenges for the Message Passing Interface in the Petaflops Era

Scalable SIMD-Efficient Graph Processing on Gpus

Computing at Massive Scale: Scalability and Dependability Challenges

CUDA Basics Murphy Stein New York University