WHITE PAPER Communications Service Providers Characterizing VNF Performance NUMA-Aware Hypervisor and Impact on Brocade* 5600 vRouter Using a Brocade* 5600 vRouter as an example, this paper shows how a VNF can be used on a dual-socket COTS server, highlighting the impact of non-uniform memory access (NUMA) on the VNF’s performance. Author Executive Summary Xavier Simonart Many papers characterizing virtual network functions (VNF) performance use only one socket of dual-sockets commercial off-the-shelf (COTS) server. In some cases, Intel both sockets are used independently by two VNFs. In the case of a vRouter for instance, this would mean that two independent routers would run on the dual- Table of Contents sockets system. All interfaces could not be connected full-mesh. Executive Summary . 1 1 Use-Case Details . 2 This document shows how a Brocade* 5600 vRouter can be used on a dual- 2 Test Results . 3 socket commercial off-the-shelf (COTS) server, in cross-socket full-mesh traffic 2 1. One Brocade vRouter Instance configurations. It highlights the impact of non-uniform memory access (NUMA) on (Two Sockets) - No Cross Socket Traffic . 3 the performance of VNF applications, using a Brocade 5600 vRouter. It shows the 2 .2 One Brocade vRouter Instance (Two Sockets) - Cross Socket Traffic . 3 importance of a NUMA-aware QEMU and the influence of QPI. 3 System Under Test's Configuration . 3 Figure 1 shows the performance of a Brocade 5600 vRouter and the performance 3.1 Host Configuration . 3 impact when the traffic is using the QPI link. In this setup, traffic from one interface 3 1. 1. Hardware and Software Details . 3 is always routed to one and exactly one (other) interface, either on the same CPU 3.1.2 Grub.cfg . 4 socket, or on the other CPU socket. Other traffic profiles (e.g., traffic going from one 3 1. .3 QEMU . 4 interface to the other three interfaces on the same socket) might highlight different 3 1. .4 Scripts . 4 performances. In the rest of this paper, vRouter and Brocade 5600 refer to Brocade 3.2 Brocade vRouter Configuration . 9 3.2.1 Login and Password . 9 5600 vRouter. 3.2.2 Set Root Password . 9 Brocade 5600 vRouter Throughput 3.2.3 Set Vyatta Management 8x 10 GbE interfaces Interface IP address . 9 1024 routes, 16 next hops / interface 3 .2 .4 Enable ssh Access + http . 9 Impact of Traffic Profiles 3 .2 .5 Key Manipulation . 10 100% 1 dest, 0% QPI 3.2.6 Set Dataplane IP Address . 10 90% 1 dest, 100% QPI 3 .2 7. Create Routes . 10 80% 3.2.8 Example Config File . 10 70% 3.2.9 Brocade vRouter Configuration 60% per Use Case . 13 50% 4 Test Generators' Configuration . 15 4 1. Hardware and Software Details . 15 40% 4.1.1 Grub.cfg . 15 Throughput (relative) 30% 4.1.2 Scripts to Prevent Interrupts on 20% DPDK Fast Path . 15 10% 4 .2 Test Setup Details . 15 0% 4.3 Test Parameters . 16 64 128 256 512 1024 1518 4.3.1 Traffic Profiles . 16 Packet Size (Bytes) 4 .4 Characterization Scripts . 16 5 Running the Characterization . 17 Figure 1. Impact of traffic profile on vRouter’s throughput¹ 6 BIOS Settings . 18 7 References . 21 Even with traffic sent over QPI links, vRouter shows no drop in performance for packet size above 256 bytes. ¹ Intel internal analysis. See Section 3 for the system under test’s configuration details, and Section 4 for the test generators’ configuraton details. White Paper | NUMA-Aware Hypervisor and Impact on Brocade* 5600 vRouter 2 1 Use-Case Details • For Figure 4 it means that all traffic from interface 1 is sent to interface 5, from interface 5 to interface 1, etc. Many papers characterizing VNF performance usually use only one socket of dual-sockets systems. In some cases, both Different traffic profiles (where, for instance, traffic from sockets are used independently by two VNFs. interface 1 might be routed to interface 2 to 4, or even 1 to 4) will highlight different performance results. In the case of a vRouter for instance, this would mean that two independent routers would run on the dual-sockets The Brocade 5600 vRouter is running in a virtual machine, system. If such a dual-sockets system is able to handle eight using QEMU as the hypervisor and CentOS as the host 10 GbE interfaces, the traffic could not go from any interface operating system (see 3.1.1 for hardware and software to any interface (Figure 2): for instance, traffic from interface details). PCI pass-through is being used,² i.e., the control of 1 cannot be forwarded to interface 5. the full physical device is given to the virtual machine; there is no virtual switch involved in the fast path (see Figure 5, using two instances of the vRouter, and Figure 6, using one instance spanning both CPU sockets). Figure 2. Two vRouter instances In some cases, it might be required to support a full-mesh eight 10 GbE ports vRouter. Hence, it is interesting to assess the performance demonstrated by such a vRouter configuration, first (Figure 3) using the same traffic as in Figure 2 (i.e., a traffic not crossing the inter-socket link), then Figure 5. Two vRouter instances using a traffic crossing the inter-socket link (Figure 4). Figure 3. One vRouter instance, no inter-socket traffic Figure 6. One vRouter instance The vRouter is characterized under network load created by test generators: the test generators generate IP traffic towards four or eight 10 Gbps interfaces, and they measure Figure 4. One vRouter instance, with inter-socket traffic the traffic coming from those interfaces. Those test generators can be Ixia* (or Spirent*) or COTS servers running DPDK-based applications (pktgen or prox). In all three cases, the traffic is setup in such a way that all traffic from one interface is always routed to one (and exactly For automation purposes, prox (https://01.org/intel-data- one) other interface. plane-performance-demonstrators/prox-overview) has been used to generate the traffic and to measure the throughput • For Figure 2 and Figure 3 this means for instance that and latency from the Brocade 5600 vRouter.³ Ixia has been all traffic from interface 1 is sent to interface 2, and from used as well to confirm some key data results. interface 2 to interface 1, etc. ² PCI-Pass-through was chosen to stay focused on CPU characteristics and not be distracted by vNIC/NIC capabilities and does not reflect on what the Brocade 5600 vRouter supports ³ Choice of test generator is simply based on engineer's preference and has no known impact on the performance numbers. White Paper | NUMA-Aware Hypervisor and Impact on Brocade* 5600 vRouter 3 2 Test Results vRouter instances as long as NUMA-aware QEMU is used and as long as the QPI link is not actually utilized. 2 1. One Brocade 5600 vRouter Instance (Two Sockets) – No Cross Socket Traffic 2 .2 One Brocade 5600 vRouter Instance (Two Sockets) – Cross Socket Traffic The goal of this test is to see which performance penalty is paid when one VNF uses both CPU sockets (see Figure 6) In the previous test result, the traffic was configured in such instead of two VNFs, each running on its own CPU socket a way that it does not cross the CPU sockets (Figure 3), so (Figure 5). that we were able to compare two instances of Brocade 5600 vRouter (where it is not possible for the traffic to cross Figure 7 shows the performance obtained when one the QPI link) and one instance of the Brocade 5600 vRouter. Brocade 5600 vRouter instance uses interfaces from both sockets and cores from both sockets. It is compared with In this chapter, we will check the influence of having traffic two instances, each running on its own socket. crossing the inter-socket link, taking full benefit of using only one instance of the vRouter (Figure 4). Two different versions of QEMU are also compared: QEMU 1.5.3 compared to QEMU 2.4.1. Brocade 5600 vRouter Throughput 8x 10 GbE interfaces QEMU 1.5.3 is the default QEMU version included in CentOS 1024 routes, 16 next hops / interface 7.1. With this QEMU version, PCI devices passed-through to Impact of Traffic Profiles 100% 1 dest, 0% QPI the VM cannot be associated to a NUMA node. 90% 1 dest, 100% QPI QEMU 2.4.1 is the latest QEMU version (at the time of 80% writing) available from open source. This QEMU 2.4.1 has 70% better support for NUMA, as the VM can be configured with 60% knowledge about the NUMA nodes: 50% 40% • VCPUs on NUMA node Throughput (relative) 30% • Huge pages on NUMA nodes 20% 10% • PCI devices on NUMA node 0% 64 128 256 512 1024 1518 We see in Figure 7 that the performance gain using QEMU Packet Size (Bytes) 2.4.1 versus QEMU 1.5.3 is very important.⁴ Even in the best case scenario where the traffic does not cross the QPI link, Figure 8. Impact of traffic profile on vRouter throughput⁴ QEMU 1.5.3, not fully NUMA aware, is severely impacted by running on both CPU sockets. Even though the traffic We see that the performance is lower when the traffic does not cross CPU sockets, packets handling on socket 0 crosses the CPU sockets. Still we can see that with any results in many cases of memory being used on socket 1, packet sizes bigger than 256 bytes, traffic line rate is generating intensive QPI traffic.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages21 Page
-
File Size-