Modeling CPU Impact on Packet-Processing Performance

Case Study Modeling the impact of CPU properties to optimize and predict packet-processing performance

Intel and AT&T have collaborated in a proof of concept (POC) to model and analyze the performance of packet-processing workloads on various CPUs. This POC has established a highly accurate model that can be used to simulate edge router workloads on x86 systems. Developers and architects can use this methodology to more accurately predict performance improvements for network workloads across future CPU product generations and hardware accelerators. This approach can help developers gain better insight into the impacts of new software and x86 hardware components on packet processing throughput.

Intel® CoFluent™ Studio Intel® CoFluent™ Technology for Big Data

Challenges in packet-processing

AUTHORS Today’s packet-processing devices face an enormous performance challenge. Not only is more and more data being transmitted, but packet processing tasks — which need to be executed at AT&T authors Intel authors line speed — have become increasingly complex. Kartik Pandit Bianny Bian Along with the usual forwarding of data, packet processing systems are also responsible for Vishwa M. Prasad Atul Kwatra other functions. These functions include traffic management (shaping, timing, scheduling), Patrick Lu security processing, and quality of service (QoS). Mike Riess Wayne Willey Making the issue even more of a challenge is the proliferation of internet devices and sensors. Huawei Xie Data is now often produced faster than can be transmitted, stored, or processed. Most of this Gen Xu data is transmitted as packet streams over packet-switched networks. Along the way, various network devices — such as switches, routers, and adapters — are used to forward and process the data streams.

Modeling the impact of CPU properties to optimize and predict packet-processing performance 2

At the same time, technology and customer POC results Network virtualization requirements are also changing rapidly. These changes are forcing vendors to seek data- The results of our POC show that performance scales linearly with the number Network function virtualization streaming software solutions that deliver shorter development cycles. Unfortunately, of cores, and scales almost linearly with core (NFV) refers to a network 1 infrastructure concept. NFV shorter development cycles often mean that frequency. Packet size did not have a virtualizes network node functions developers must make more revisions to significant performance impact in terms of to make it easier to deploy and update initial design shortcomings, and/or packets per second (PPS) on our workload, manange network services. With allow less time between required updates. nor did the size or performance of NFV, service providers can LLC cache.1 simplify and speed up scaling new Overall, this is a complex set of conditions network functions and that creates the challenge of optimizing and Our study and detailed results tell us that, applications, and better use predicting packet-processing performance in when selecting a hardware component for an network resources. future architectures. edge router workload, developers should consider prioritizing core number and A virtualized network function Proposed solution: Network frequency. (VNF) refers to a virtualized function used in an NFV function virtualization (NFV) One of the key components in our POC was architecture. These functions used our use of Intel® CoFluent™ Technology One proposed solution to the challenge of to be carried out by dedicated (Intel® CoFluent™), a modeling and hardware. With VNF, they are data streaming is network function simulation tool for both hardware and virtualized, performed by virtualization (NFV). software. We found our Intel CoFluent model software, and run in one or more To explore the effectiveness of this potential to be highly accurate. The average difference virtual machines. Common VNFs NFV solution, Intel and AT&T have (delta) in results between the simulation include routing, load balancing, predictions and comparative measurements caching, intrusion detection collaborated in a proof of concept (POC) project. In this POC, we focused on modeling made on actual physical architectures was devices, and firewalls. 1 and analyzing the impact of different CPU under 4%.

1 properties on packet processing workloads. Because of this, we believe similar models We performed an extensive analysis of packet could help developers make faster, better processing workloads via detailed simulations choices of components for new designs and on various microarchitectures. We then optimizations. In turn, this could help compared the simulation results to developers reduce risk, minimize time to measurements made on physical hardware market for both new and updated products, systems. This allowed us to validate the and reduce overall development costs. accuracy of both our model and the POC simulations.

Modeling the impact of CPU properties to optimize and predict packet-processing performance 3

Table of Contents Challenges in packet-processing ...... 1 Results and model validation ...... 15 Proposed solution: Network function virtualization (NFV) ...... 2 Establishing a baseline for simulation accuracy ...... 15 POC results ...... 2 Measuring performance at different core frequencies ...... 15 Analyzing performance for different cache sizes ...... 16 Proof of concept for network function virtualization (NFV) ...... 4 Measuring performance for different numbers of cores ...... 17 Goals of the POC ...... 4 Simulating hardware acceleration components ...... 17 Predictive model to meet POC goals ...... 4 Performance sensitivity from generation to generation ...... 18 Key components used in this POC ...... 4 Core cycles per instruction (CPI) ...... 18 Network processing equipment ...... 4 Maximum capacity at the LLC level ...... 18 Packet processing workload ...... 4 Ideal world versus reality ...... 19 DPDK framework for packet processing ...... 5 Performance sensitivities based on traffic profile ...... 19 Hardware and software simulation tools ...... 5 Performance scaled linearly with the number of cores ...... 19 Traditional hardware and software simulation tools ...... 5 Execution flow was steady ...... 19 Better solutions model both hardware and software ...... 5 Next steps ...... 19 Intel® CoFluent™ Technology ...... 6 Simulation at a functional level ...... 6 Summary ...... 20 Layered and configurable architecture...... 6 Key findings ...... 20 CPU frequency ...... 20 Upstream and downstream workloads ...... 6 LLC cache ...... 20 Upstream pipeline ...... 6 Performance ...... 20 Execution flow of the upstream pipeline ...... 7 Packet size ...... 20 6 Stages in a typical upstream pipeline ...... 8 Conclusion ...... 20 Downstream pipeline ...... 8 Appendix A. Performance in the downstream pipeline ...... 21 Physical system setup ...... 8 Packetgen in the physical DUT ...... 9 Appendix B. Acronyms and terminology ...... 22 Generating a packet-traffic profile ...... 10 Appendix C. Authors ...... 23 Performance and sensitivities of the traffic profile ...... 10 Hyperthreading was disabled to expose the impact of other elements ...... 10 Lookup table size affected performance ...... 10 List of tables Developing the Intel CoFluent simulation model ...... 11 Table 1. Test configuration based on the pre-production Simulating the packetgen ...... 11 Intel® Xeon® processor, 1.8GHz (Skylake) ...... 9 Simulating the network ...... 12 Table 2. Test configuration based on the Intel® Xeon® E5-2630, Modeling the upstream pipeline ...... 12 2.2GHz (Broadwell) ...... 9 Implementing lookup algorithms ...... 12 Table 3. Test configuration based on the Intel® Xeon® processor Developing a model of the cost of performance ...... 12 E5-2680, 2.5 GHz (Haswell) ...... 10 Hardware performance considerations ...... 13 Table A-1. Test configuration based on the Intel® Xeon® processor Impact of cache on pipeline performance ...... 13 Gold 6152, 2.1 GHz ...... 21

Characterizing the cost model ...... 13 Establishing the execution cost of the model ...... 14 List of figures Simulation constraints and conditions ...... 14 Figure 1. Edge router with upstream and downstream pipelines ...... 5 Support for the upstream pipeline ...... 14 Figure 2. Edge router’s upstream software pipeline ...... 6 Cache analysis supported for LLC ...... 14 Figure 3. Edge router’s downstream software pipeline ...... 6 Dropping or dumping packets was not supported ...... 14 Figure 4. Model development ...... 11 Critical data paths simulated...... 14 Figure 5. Model of upstream software pipeline ...... 12 Hardware analysis and data collection ...... 14 Figure 6. Simple models for estimating cycles per instruction ...... 13 Hardware analysis tools ...... 14 Figure 7. Baseline performance measurements...... 15 Event Monitor (EMON) tool ...... 14 Figure 8. Performance measured at different frequencies ...... 16 Sampling Enabling Product (SEP) tool ...... 14 Figure 9. Performance measured for different LLC cache sizes ...... 16 EMON Data Processor (EDP) tool ...... 14 Figure 10. Measuring performance on multi-core CPUs ...... 17 Collecting performance-based data ...... 15 Figure 11. Performance comparison of baseline configuration versus a Workload performance measurements and analysis for model inputs ...... 15 simulation that includes an FPGA-accelerated ACL lookup ...... 17 Figure 12. Comparison of performance from CPU generation to generation ...... 18 Figure A-1. Throughput scaling, as tested on the Intel® Xeon® processor E5-2680 ...... 21 Figure A-2. Throughput scaling at 2000 MHz on different architectures ...... 21

Modeling the impact of CPU properties to optimize and predict packet-processing performance 4

Proof of concept Upstream and downstream Key components used for network function pipeline traffic in this POC virtualization (NFV) Upstream traffic is traffic that For this NFV POC, we needed to identify the critical hardware characteristics that had The goal of this joint Intel-AT&T POC was to moves from end users toward the network’s core. the most impact on the network processing generate information that could help equipment and the packet processing developers choose designs that could be best Downstream traffic is traffic that workload. To do this, we modeled and optimized for network traffic — and make moves toward the end user. simulated a hardware system as typically used such choices faster and more accurately. We for NFV. also wanted to generate information that would help developers predict packet- processing performance for future would also allow us to project more accurate Network processing equipment optimizations for future CPU product architectures more accurately, for both Network processing equipment can usually be generations. hardware and software developments. divided into three categories: To do this, we first developed a predictive  Easily programmable, general CPUs Goals of the POC model based on performance data from an existing x86 hardware platform. We then Our joint Intel-AT&T team had two  High performance (but hardwired) compared network performance on that main goals: application-specific integrated circuits physical architecture to the performance (ASICs)  Quantify and measure the performance projected by our simulation. These of the upstream traffic pipeline of the comparative measurements would help us  Middle-ground network-processing units Intel® Data Plane Development Kit determine the accuracy of our simulation (NPUs), such as field-programmable gate (DPDK) virtual provider edge model. A high degree of accuracy would help arrays (FPGAs) router (vPE). build confidence in using Intel CoFluent to Of those three categories, we focused this effectively characterize network function POC on the impact of general CPU  Identify CPU characteristics and identify virtualization workloads. characteristics on packet processing components that have a significant throughput. impact on network performance for For this POC, our team focused on modeling workloads running on an x86 system. upstream pipelines. Upstream traffic moves from end users toward the network’s core (see Packet processing workload In this paper, we quantify the vPE upstream Figure 1, next page). In the future, we hope to Packet-processing throughput is dependent traffic pipeline using the Intel CoFluent develop a similar predictive model to analyze on several hardware characteristics. modeling and simulation solution. We the performance of downstream pipelines, These include: validated our model by comparing the where traffic moves toward the end user. simulation results to performance  CPU speed measurements on physical hardware. Longer term, our goal is to use these predictive models and simulations to identify  Number of programmable cores in Predictive model performance bottlenecks in various designs of the CPU architecture, microarchitecture, and software. to meet POC goals That future work would model both upstream  Cache size, bandwidth, and hit/miss To achieve our goals, our team needed to and downstream pipelines. We hope to use latencies for level 1 cache (L1), level 2 develop a highly accurate Intel CoFluent that knowledge to recommend changes that cache (L2), and level 3 cache (L3; also simulation model of the Intel DPDK vPE will significantly improve the performance of called last level cache, or LLC) router workload. Such a model would help us future x86 architectures for packet processing  Memory bandwidth and read/write characterize the network traffic pipelines. It workloads. This information should make it latencies easier for developers to choose components that will best optimize NFV workloads for  Network interface card (NIC) throughput specific business needs. In our POC, the packet processing workload is the DPDK vPE virtual router. Modeling the impact of CPU properties to optimize and predict packet-processing performance 5

Figure 1. Edge router with upstream and downstream pipelines. Upstream traffic moves from end users toward the network’s core. Downstream traffic moves toward the end user(s).

DPDK framework Hardware and software At the other end of the spectrum are for packet processing simulation tools hardware-oriented simulators. These simulators model system timing on a cycle- The Intel DPDK is a set of libraries and Optimizing a design for network traffic is by-cycle basis. These models are highly drivers for fast packet processing. The DPDK typically done using traditional simulation accurate, but suffer from very slow simulation packet framework gives developers a standard tools and a lot of manual effort. We were speeds. Because of this, they are not usually methodology for building complex packet looking for a better approach that would make used to analyze complete, end-to-end systems. processing pipelines. The DPDK provides it easier and faster for developers to choose Instead, they are used mainly for decision- pipeline configuration files and functional the best components for their needs. making at the microarchitecture level. blocks for building different kinds of applications. For example, for our POC, we Traditional hardware and software Better solutions model used the functions to build our internet simulation tools both hardware and software protocol (IP) pipeline application. For system analysis, traditional simulation- Solutions that model only software One of the benefits of DPDK functions is that based modeling tools range from solely performance or which model only hardware they help with the rapid development of software-oriented approaches to solely performance are not effective for modeling packet processing applications that run on hardware-oriented approaches. Unfortunately, the performance of a complete system. The multicore processors. For example, in our these traditional tools have not been able to best solution for modeling a complete system POC, the edge router pipeline is built on meet the complex performance challenges would be: the IP pipeline application (based on the driven by today’s packet-processing devices. DPDK functions), to run on our four  Highly configurable At one end of the traditional analysis physical hardware DUTs. Our IP pipeline spectrum are the software-oriented  Able to simulate both software and models a provider edge router between the simulations. In these simulations, software hardware aspects of an environment access network and the core network behavior and interactions are defined against a (see Figure 1).  Easily execute without the overhead of specific execution time. However, solutions setting up actual packet-processing based solely on software analyses do not take applications hardware efficiency into consideration. Hardware efficiency has a significant impact on system performance.

Modeling the impact of CPU properties to optimize and predict packet-processing performance 6

Figure 2. Edge router’s upstream software pipeline.

Figure 3. Edge router’s downstream software pipeline.

Intel® CoFluent™ Technology Layered and configurable architecture Upstream and For our joint Intel-AT&T POC, we needed a For our POC, Intel CoFluent was ideal downstream workloads more effective tool than a software-only or because the simulator can estimate complete hardware-only analysis tool. To reach our system designs. Even more, Intel CoFluent For this project, our team examined primarily goals, we chose the Intel CoFluent modeling can do so without the need for embedded the upstream traffic pipeline. Figure 2 shows and simulation solution. Intel CoFluent is an application code, firmware, or even a precise the edge router’s upstream software pipeline. application that helps developers characterize platform description. In our POC, this meant (Figure 3 shows the edge router’s downstream and optimize both hardware and software we did not have to create and set up actual software pipeline.) environments. As shown by the results of this packet processing applications for our model, POC, the Intel CoFluent model proved to be but could simulate them instead. Upstream pipeline highly accurate when compared to Another key advantage of using Intel measurements taken on a physical system.1 In the edge router’s upstream traffic pipeline CoFluent for our POC is the tool’s layered there are several actively running components. and configurable architecture. The layered and Our POC used a physical model to validate Simulation at a functional level configurable capabilities help developers the results of our simulation experiments. This With Intel CoFluent, the computing and optimize early architecture and designs, and physical test setup consisted of three actively communication behavior of the software stack predict system performance. Intel CoFluent running components: is abstracted and simulated at a functional also includes a low overhead, discrete-event level. Software functions are then dynamically simulation engine. This engine enables fast  Ixia* packet generator (packetgen), or mapped onto hardware components. The simulation speed and good scalability. the software packetgen timing of the hardware components — CPU,  Intel® Ethernet controller (the NIC) memory, network, and storage — is modeled according to payload and activities, as  Upstream software pipeline stages perceived by software. running on one or more cores Modeling the impact of CPU properties to optimize and predict packet-processing performance 7

In our physical model, the Ixia packetgen injects packets into the Ethernet controller. This simulates the activity of packets arriving from the access network. The Ethernet Downstream pipeline stages The ID of the output interface for the controller receives packets from the access Although we did not simulate the input packet is read from the identified flow table entry. The set of flows used network, and places them in its internal downstream pipeline for this project, by the application is statically buffer. we did collect some data on this configured, and is loaded into the hash pipeline (see Appendix A). upon initialization. Execution flow As shown in Figure 3 (previous page), of the upstream pipeline the second stage of the downstream LPM lookup method The upstream software pipeline can run on a pipeline is the routing stage. This stage When the lookup method is LPM- single core, or the workload can be distributed demonstrates the use of the hash and based, an LPM object is used to amongst several cores. Each core iterates each LPM (longest prefix match) libraries in emulate the pipeline’s forwarding stage pipeline assigned to it, and runs the pipeline’s the data plane development kit for internet protocol version 4 (IPv4) standard flow. (DPDK). The hash and LPM libraries packets. The LPM object is used as the are used to implement packet Here is the general execution flow of the routing table, in order to identify the forwarding. In this pipeline stage, the typical packet processing pipeline: next hop for each input packet at lookup method is either hash-based or runtime.  Receive packets from input ports. LPM-based, and is selected at runtime. Configuration  Perform port-specific action handlers and Hash lookup method of downstream pipeline table look-ups. When the lookup method is hash- Below is the configuration code we Execute entry actions on a lookup hit, or based, a hash object is used to emulate used for the first stage in the execute the default actions on a lookup miss. the downstream pipeline’s flow downstream pipeline. (The table entry action usually sends packets classification stage. The hash object is to the output port, or dumps or drops the correlated with a flow table, in order to Additional information and analysis of packets.) map each input packet to its flow at the downstream pipeline will be a runtime. The hash lookup key is future POC project. represented by a unique DiffServ

5-tuple.

The DiffServ 5-tuple is composed of [ PIPELINE1] several fields that are read from the type = ROUTING input packet. These fields are the core = 1 source IP address, destination IP pktq_in = RXQ0.0 RXQ1.0 pktq_out = SWQ0 SWQ1 SINK0 address, transport protocol, source port encap = ethernet_qinq number, and destination port number. ip_hdr_offset = 270 Traffic Manager Pipeline: This is a pass-through stage with the following configuration: [PIPELINE2] type = PASS-THROUGH core = 1 pktq_in = SWQ0 SWQ1 TM0 TM1 pktq_out = TM0 TM1 SWQ2 SWQ3

Transmit Pipeline: Also a pass through stage: [PIPELINE3] type = PASS-THROUGH core = 1 pktq_in = SWQ2 SWQ3 pktq_out = TXQ0.0 TXQ1.0 Modeling the impact of CPU properties to optimize and predict packet-processing performance 8

6 Stages in a typical upstream pipeline Note that the scope of this project did not allow a complete analysis of the downstream In the specific case of the upstream traffic pipeline. The downstream pipeline uses pipeline of the DPDK vPE, there are usually different software and has different Project names for 6 stages. Figure 2 (earlier in this paper) shows functionality and pipeline stages, as compared devices under test (DUTs) an overview of the 6 typical stages. The first to the upstream pipeline. pipeline stage drains packets from the Intel internal project code names are Ethernet controller. The last stage in the chain We do provide some of the data and insights often used to refer to various processors queues up the packets and sends them to the for the downstream pipeline that we observed during development, proof of concepts core network through the Ethernet controller. while conducting our POC (see Appendix A). (POCs), and other research projects. However, full analysis and verification of In our POC, we modeled and simulated all those initial results will have to be a future key stages of the upstream pipeline, and In the joint Intel and AT&T POC, we project. used three main test configurations, one verified those results against known hardware of which was a pre-production configurations. processor (Skylake). Some of the Physical system setup devices under test (DUTs) were used to Downstream pipeline confirm simulation results and establish When our team began setting up this POC, we the accuracy of the simulations. Some In the edge router’s downstream traffic started with a description of a typical physical were used to confirm simulation results pipeline there are 3 actively running architecture. We then set up a hardware DUT for projecting optimal configurations components and 4 typical pipeline stages. that would match that architecture as closely for future generations of processors. A Figure 3 (earlier in this paper) shows an as possible. We set up additional DUTs to fourth, production version of the provide configurations for comparisons and Skylake microarchitecture was used to overview of the four typical stages. The three verifications. characterize some aspects of the components are: downstream pipeline (see Appendix A).  DPDK packetgen Tables 1 and 2 (next page) describe the two DUTs we built for the first phase of our The three main DUTs for our POC  Intel® Ethernet controller (the NIC) NFV POC. We used these DUTs to take were based on these processors, with performance measurements on the upstream  Downstream software pipeline stages, these project code names: pipeline. running on one or more cores We compared those measurements to the  Skylake-based DUT: In our POC, for the downstream pipeline, the corresponding elements of our Intel CoFluent Pre-production Intel® Xeon® packetgen injects packets into the Ethernet processor, 1.8 GHz simulations. The physical DUTs helped us controller. This simulates packets entering the determine the accuracy of the virtual Intel access network from the core.  Broadwell-based DUT: CoFluent model that we used for our Intel® Xeon® processor The first stage of the edge router’s simulations and projections. E5-2630, 2.2 GHz downstream pipeline pops packets from the

internal buffer of the Ethernet controller. The  Haswell-based DUT: last stage sends packets to the access network Intel® Xeon® processor via the Ethernet controller. E5-2680, 2.5 GHz In our POC, the devices under test (DUTs)

used IxNetwork* client software to connect to an Ixia traffic generator. Ixia generates simulated edge traffic into the DUT, and reports measurements of the maximum forwarding performance of the pipeline. In our model, we did not include considerations of packet loss. Modeling the impact of CPU properties to optimize and predict packet-processing performance 9

The DUT described in Table 1 is based on a Table 1. Test configuration based on the pre-production Intel® Xeon® processor, pre-production Intel® Xeon® processor, 1.8GHz (Skylake) 1.8 GHz, with 32 cores. The Intel-internal project code name for this pre-production Component Description Details processor is “Skylake.” Pre-production Intel® Xeon® processor, Product 1.8 GHz The DUT described in Table 2 is based on an Intel® Xeon® processor E5-2630, 2.2 GHz. Processor Speed (MHz) 1800 The Intel-internal project code name for this Number of CPUs 32 Cores / 64 Threads processor is “Broadwell.” LLC cache 22528 KB In order to explore the performance sensitivity Capacity 64 GB of one processor generation versus another, Type DDR4 we set up an additional DUT, as described in Rank 2 Table 3. This DUT is based on an Intel® Memory Xeon® processor E5-2680, 2.5 GHz. The Speed (MHz) 2666 Intel-internal project code name for this Channel/socket 6 processor is “Haswell.” Per DIMM size 16 GB Ethernet controller X710-DA4 (4x10G) Packetgen in the physical DUT NIC Driver igb_uio As mentioned earlier in the description of the upstream pipeline, our physical systems Distribution Ubuntu 16.04.2 LTS OS included the Ixia packetgen. In the upstream Kernel 4.4.0-64-lowlatency pipeline, the job of this hardware-based BIOS Hyper-threading Off packetgen is to generate packets and work with the packet receive (RX) and transmit (TX) functions. Basically, the packetgen sends packets into the receive unit or out of Table 2. Test configuration based on the the transmit unit. This is just one of the key Intel® Xeon® E5-2630, 2.2GHz (Broadwell) hardware functions that was simulated in our Component Description Details Intel CoFluent model. Product Intel® Xeon® processor E5-2630, 2.2 GHz

Speed (MHz) 2200 Processor Number of CPUs 10 Cores / 20 Threads LLC cache 25600 KB

Capacity 64 GB Type DDR4 Rank 2 Memory Speed (MHz) 2133 Channel/socket 4 Per DIMM size 16 GB Ethernet controller X710-DA4 (4x10G) NIC Driver igb_uio Distribution Ubuntu 16.04.2 LTS OS Kernel 4.4.0-64-lowlatency BIOS Hyper-threading Off

Modeling the impact of CPU properties to optimize and predict packet-processing performance 10

Table 3. Test configuration based on the In our physical test model, we used default Intel® Xeon® processor E5-2680, 2.5 GHz (Haswell) settings for other parameters, such as the media access control (MAC) address, source Component Description Details (SRC) transmission control protocol (TCP) Product Intel® Xeon® processor E5-2680 v3, 2.5 GHz port, and destination (DST) TCP port. Speed (MHz) 2500 Processor Number of CPUs 24 Cores / 24 Threads Performance and sensitivities of the traffic profile LLC cache 30720KB Capacity 256 GB In order to get the most accurate results, we needed to characterize the traffic profile in Type DDR4 detail for both the hardware DUTs and our Rank 2 Intel CoFluent models and simulations. Memory Speed (MHz) 2666 Channel/socket 6 Hyperthreading was disabled to expose the impact of other elements Per DIMM size 16 GB There are a number of system and application Ethernet controller (4x10G) NIC parameters that can impact performance, Driver igb_uio including hyperthreading. For example, when Distribution Ubuntu 16.04.2 LTS we ran the workload with hyperthreading OS enabled, we gained about 25% more Kernel 4.4.0-64-lowlatency performance per core.1 BIOS Hyper-threading Off However, hyperthreading shares some hardware resources between cores, and this can mask core performance issues. Also, the Generating a packet-traffic profile For our POC, we chose the following IP range performance delivered by hyperthreading can settings to traverse the LPM (longest prefix make it hard to identify the impact of other, Once we set up our physical DUTs, we match) table. For lpm24, the memory range more subtle architectural elements. Since we needed to estimate the performance effect of a is 64 MB, which exceeds the LLC size, and were looking for the impact of those other cache miss in the routing table lookup on can trigger a miss in the LLC cache. packet-handling elements, we disabled these architectures. To do this, for each hyperthreading for this POC. packet, we increased the destination IP to a range 0 dst ip start 0.0.0.0 fixed stride of 0.4.0.0. This caused each range 0 dst ip min 0.0.0.0 range 0 dst ip max Lookup table size affected performance destination IP lookup to hit at a different 255.255.255.255 memory location in the routing table. range 0 dst ip inc 0.4.0.0 While setting up the experiments, we observed a performance difference (delta) that For the source IP setting, any IP stride should depended on the size of the application’s be appropriate, as long as the stride succeeds lookup table. Because of this, for our POC, on the access control list (ACL) table lookup. we decided to use the traffic profile described (The exact relationship of cache misses under “Generating a packet-traffic profile.” and traffic characteristics is not described in This ensured that we had some LLC misses this POC, and will be investigated in a in our model. future study.)

Modeling the impact of CPU properties to optimize and predict packet-processing performance 11

Figure 4. Model development. This figure shows how we modeled the flow for the simulation of the entire virtual edge provider (vPE) router pipeline.

Developing the Intel hardware components except storage. actually simulate the packetgen. Instead, we It was not necessary to model storage because used queues to represent the packetgen. CoFluent simulation model our workload did not perform any actual To do this, we first simulated a queue of storage I/O. To develop the simulation model for this packets that were sent to the Ethernet project, we performed an analysis of the For our project, the actively running controller at a defined rate. In other words, we source code, and developed a behavior model. components were the CPU, Ethernet created receive (RX) and transmit (TX) We then developed a model of the controller, and packet generator (packetgen). queues for our model. We took a packet off performance cost in order to create a One of the benefits of using the Intel CoFluent the RX queue every so many milliseconds for virtualized network function (VNF) framework for these simulations is that Intel the RX stage in the pipeline. This simulated a development model (see Figure 4). CoFluent can schedule these components at a packet arriving at a specific rate. user-specified granularity of nanoseconds or To create our VNF simulation, we built an We did a similar simulation for the TX stage even smaller. Intel CoFluent model of actively running in the pipeline. components and pipeline stages that The rate at which packets entered and exited corresponded to those in the physical DUTs. Simulating the packetgen the queues was determined by the way the We mapped these pipeline stages to a CPU In a simulation, there is no physical hardware physical DUT behaved, so the simulation core as the workload. We then simulated the to generate packets, so we needed to add that would model the physical DUT as closely behavior of each pipeline stage. functionality to our model in order to simulate as possible. In any performance cost model, underlying that capability. For this POC, we did not hardware can have an impact on the pipeline flow. For this reason, we modeled all key

Modeling the impact of CPU properties to optimize and predict packet-processing performance 12

Figure 5. Model of upstream software pipeline. The ACL pipeline stage supports auditing of incoming traffic. Note that, in our proof-of-concept (POC), the queueing stage has two locations, and performs packet receiving (RX) or packet transmitting (TX), depending on its location in the pipeline.

Simulating the network queuing and packet TX stage are also usually Developing a model separate pipeline stages. In our POC, we of the cost of performance In this POC, the Ethernet controller was modeled the packet RX and packet TX stages simulated based on a very simple throughput as a single packet queueing stage that was It’s important to understand that model, which receives or sends packets at a located at both the beginning and the end of Intel CoFluent is a high-level framework for user-specified rate. the pipeline (see Figure 5). simulating behavior. This means that the framework doesn’t actually execute CPU For our POC since we wanted to characterize Implementing lookup algorithms instructions; access cache or memory cells; or the performance impact of CPU parameters on perform network I/O. Instead, Intel CoFluent software packet processing pipeline we did One of the things we needed to do for our uses algorithms to simulate these processes not implement the internal physical layer model was to implement lookup algorithms. with great accuracy (as shown in this POC).1 (PHY), MAC, switching, or first-in-first-out To do this, we first had to consider the (FIFO) logic. We specifically defined our test pipelines. As shown earlier in Figure 2, In order to develop a model of the execution to make sure we would not see system an upstream pipeline usually consists cost of performance, we tested the accuracy of bottlenecks from memory bus bandwidth or of 3 actively running components the simulations in all phases of our POC by from the bandwidth of the Peripheral and 6 typical pipeline stages. comparing the simulation results to Component Interconnect Express (PCI-e). measurements of actual physical architectures Note that the ACL pipeline stage is a Because of that, we did not need to model the of various DUTs. The delta between the multiple-bit trie implementation (a tree-like effect of that network transaction traffic estimated execution times from the structure). This routing pipeline stage uses an versus system bandwidth. simulation, and measurements made on the LPM lookup algorithm which is based on a physical DUTs, ranged from 0.4% full implementation of the binary tree. For our 1 Modeling the upstream pipeline to 3.3%. This gave us a high degree of POC, we implemented the ACL lookup confidence in our cost model. In our POC, we simulated all key stages of the algorithm and the LPM lookup algorithm to (Specifics on the accuracy of our model and upstream packet processing pipeline. The support auditing of the incoming traffic. We simulations, as compared to the DUTs, are ACL filters, flow classifier, metering and also implemented these two algorithms to discussed later in this paper under “Results policing, and routing stages were modeled support routing of traffic to different and model validation.”) individually. The packet RX stage and the destinations. In addition, the flow classification used a hash With the very small delta seen between table lookup algorithm, while flow action performance on the simulations versus the used an array table lookup algorithm. We physical DUTs, we expect to be able to use implemented both of these algorithms other Intel CoFluent models in the future, to in our model. effectively identify the best components for other packet-traffic workloads under development.

Modeling the impact of CPU properties to optimize and predict packet-processing performance 13

Hardware performance considerations To build an NFV model of the true cost of performance, we had to consider hardware, not just software. For example, elements that affect a typical hardware configuration include: core frequency, cache size, memory frequency, NIC throughput, and so on. Also, different cache sizes can trigger different cache miss rates in the LPM table or the ACL table lookup. Any of these hardware- configuration factors could have a significant effect on our cost model.

Impact of cache on pipeline performance Another key consideration in our POC was how much the packet processing performance could be affected by different hardware Figure 6. Simple models for estimating cycles per instruction (CPI) when cache misses occur; components. For example, consider the and for estimating path length for the pipeline. example of cache. Specifically, look at the impact on pipeline performance of cache misses at different cache levels. Besides the Note: packet RX and packet TX pipelines, all edge some instructions are executed as if router pipelines must perform a table lookup, overlapped. The result is that, regardless of L1 and L2 also have a significant impact then a table entry handler on a hit — or a the DUT configuration, the blocking factor is on performance. However, the impact of L1 default table entry handler on a miss. These not usually 1. and L2 can’t actually be quantified, since operations are mostly memory operations. The challenge for developers is that the cache we cannot change the sizes of the L1 and L2 cache. Cache misses could have very different access miss rate has a critical impact on performance. latencies for the memory operations at To address this challenge, we needed to different cache levels. For example, latencies quantify the impact of this miss rate, and  Linear performance. Core frequency could be as many as 4 cycles in an L1 cache integrate its consequences into our cost refers to clock cycles per second (CPS), hit; to 12 cycles in an L2 cache hit; to model. To do this, we configured the cache which is used as an indicator of the hundreds of cycles in a DRAM memory size via the Intel® Platform Quality of processor's speed. The higher the core access. The longer the latency of the memory Service Technology (PQOS) utility. We also frequency, the faster the core can execute access, the greater the impact on the CPU increased the number of destination IPs to an instruction. Again we used a microarchitecture pipeline. traverse the LPM table. This allowed us to regression model to estimate the packet introduce different LLC cache miss rates into processing throughput for the upstream The impact of latency on the pipeline is called our model. software pipeline, by using the core the blocking probability or blocking factor (as frequency as a predictor variable. compared to a zero cache miss). The longer  Real-world performance versus ideal the memory access latency, the more the CPU cache miss rate. To estimate the impact Characterizing the cost model execution pipeline is blocked. The blocking of cache miss rate on performance, we In our POC, the cost model is a model of the factor is the ratio of that latency as compared regressed the equation for core cycles per to zero cache misses. execution cost of specific functions in the instruction (CPI). In regressing the execution flow of the upstream software You might expect the blocking factor to be 1 equation for CPI, we used LLC misses pipeline. In other words, the cost model is the when memory access cycles aren’t hidden by per instruction (MPI) and LLC miss execution latency. the CPU’s execution pipeline, but that is not latency (ML) as predictor variables (see actually the case. A miss does not necessarily Figure 6). In other words, we regressed The cost model for each software pipeline result in the processor being blocked. In the blocking factor and the core CPI consists of characterizing the pipeline in terms reality, the CPU can execute other instructions metric. This gave us a way to estimate of CPI and path length. We determined CPI even while some instructions are blocked at the extra performance cost imposed by using the simple model shown in Figure 6 the memory access point. Because of this, different hardware cache configurations. (above).

Modeling the impact of CPU properties to optimize and predict packet-processing performance 14

In order to get the execution latency for a Cache analysis supported for LLC Hardware analysis tools specific length of the pipeline, we also had to In this study, our test setup did not allow us to estimate the path length for that section of the We used three hardware analysis tools to help change the L1 and L2 size to evaluate the pipeline. The path length is the number of x86 with our VNF verifications: Event Monitor impact of L1 and L2 on performance. Because instructions retired per 1 Mb of data sent. (EMON) tool, Sampling Enabling Product of this, our model supported only the LLC Again, see Figure 6 (previous page). (SEP) tool, and the EMON Data Processor cache size sensitivity analysis, and not an (EDP) tool. These tools were developed by In our model, multiplying the two variables — analysis of L1 or L2. Intel, and are available for download from the CPI and path length — gives the execution Intel Developer Zone. time of the software pipeline in terms of CPU Dropping or dumping packets was not cycles. With that information, we were able to supported Event Monitor (EMON) tool simulate CPI and path length, using the Intel Dropping packets is an error-handling CoFluent tool, in order to compute the end-to- EMON is a low-level command-line tool for method; and dumping packets is a debugging end packet throughput. processors and chipsets. The tool logs event or profiling tool. Dropping and dumping counters against a timebase. For our POC, we packets doesn’t always occur in the upstream Establishing the execution cost used EMON to collect and log hardware pipeline. If it does, it can occur at a low rate of the model performance counters. during the running lifetime of that pipeline. We began building our Intel CoFluent cost You can download EMON as part of the Our test model did not support dropping or model based on the DPDK code. With the Intel® VTune Amplifier suite. Intel VTune dumping packets. If we had included dropping previous considerations taken into account, Amplifier is a performance analysis tool that packets and/or the debugging tools in our we used the DPDK code to measure the helps users develop serial and multithreaded POC model, they could have introduced more instructions and cycles spent in the different applications. overhead to the simulator. This could have pipeline stages. These cycles were assumed to slowed the simulation speed and skewed our be the basic execution cost of the model. Sampling Enabling Product (SEP) tool results. Figure 4, earlier in this paper, shows an SEP is a command-line performance data overview of the cost model. We suspect that dropping and dumping collector. It performs event-based sampling packets might not be critical to performance (EBS) by leveraging the counter overflow Simulation constraints in most scenarios, but we would need to feature of the test hardware’s performance and conditions create another model to explore those impacts. monitoring unit (PMU). The tool captures the That would be an additional project for processor’s execution state each time a For the upstream pipeline, we modeled the future. performance counter overflow raises an hardware parameters (such as CPU frequency interrupt. and LLC cache size), packet size, pipeline Critical data paths simulated configurations, and flow configurations. We Using SEP allowed us to directly collect the focused on the areas we expected would have With those three constraints in place, we performance data — including cache misses the biggest impact on performance. We then modeled and simulated the most critical data — of the target hardware system. SEP is part verified the results of our simulations against paths of the upstream pipeline. This allowed of the Intel VTune Amplifier. performance measurements made on the us to examine the most important performance physical hardware DUTs. considerations of that pipeline. EMON Data Processor (EDP) tool In order to create an effective model for a EDP is an Intel-internal analysis tool that complete system, we accepted some Hardware analysis and processes EMON performance samples for conditions for this project. data collection analysis. EDP analyzes key hardware events such as CPU utilization, core CPI, cache Support for the upstream pipeline Verifying the accuracy of any simulation is an misses and miss latency, and so on. important phase of any study. In order to test As mentioned earlier, this POC was focused the accuracy of our Intel CoFluent simulations on the upstream pipeline. The scope of our against the hardware DUT configurations, we model did not support simulating the needed to look at how VNF performance data downstream pipeline. However, we hope to would be collected and analyzed. conduct future POCs to explore packet performance in that pipeline. The information we did collect on the downstream pipeline is presented in Appendix A, at the end of this paper. Modeling the impact of CPU properties to optimize and predict packet-processing performance 15

hotspots. For example, it does not report Our results show that it is possible to use a Baseline delta of performance where the code takes up the most cycles, or VNF model to estimate both best-case and on physical hardware versus simulation where it generates the most cache misses. worst-case packet performance for any given 6.0 Sampling mode is better for collecting that production environment. kind of information. 5.0 The next several discussions explain how we EMON outputs the raw counter information in established the accuracy of our simulation,

MPPS) 4.0

（ the form of comma-separated values (CSV and describe our key results. 3.72 3.0 3.60 data). We used the EMON Data Processor 2.0 (EDP) tool to import those raw counter CSV Establishing a baseline for files, and convert them into higher level

hroughput 1.0 simulation accuracy T metrics. For our POC, we converted our data 0.0 into Microsoft Excel* format, so we could To establish a baseline of the accuracy of our Pre-production Intel® Intel® Xeon® CoFluent™ interpret the data more easily. VNF simulation model, we first measured processor, Technology packet performance on a physical Skylake- 1.8GHz simulation based DUT, versus our Intel CoFluent (Skylake) Workload performance measurements and analysis for simulation model. Figure 7. Baseline performance model inputs Figure 7 shows performance when measured measurements, with default CPU under the default CPU frequency, with the frequency and default LLC cache We used various metrics to collect data from default LLC cache size of 22 MB. As you can size of 22 MB.1 The DUT for these the hardware platform. Using these metrics let see in Figure 7, the simulation measurement measurements was based on a pre- us input the data more easily into our model projections (our results) are very close — production Intel® Xeon® processor, and/or calibrate our modeling output. 1.8 GHz (Skylake). within 3.3% — of those made on real-world 1 Such metrics included: architectures.  Instructions per cycle (IPC) Collecting performance-based data Measuring performance at  Core frequency different core frequencies In general, there are two ways to collect performance-based data for hardware  L1, L2, and LLC cache sizes Figure 8 (next page) shows the results of counters: measuring performance at different core  Cache associativity frequencies on both DUTs and simulations.  Counting mode, implemented in  Memory channels These measurements include using EMON, which is part of the Intel® Turbo Boost at maximum frequency, Intel VTune suite  Number of processor cores and adjusting the core frequency using a  Sampling mode, implemented in the We also collected application-level Linux* P-state driver. SEP, which is also part of the performance metrics in order to calibrate the In Figure 8, the yellow line represents the Intel VTune suite model’s projected results. difference between measurements of the Counting mode reports the number of simulation as compared to measurements occurrences of a counter in a specific period Results and model validation taken on the physical DUTs. As you can see, of time. This mode lets us calculate precise the Intel CoFluent simulation provides bandwidth and latency information for a given Our NFV project provided significant estimated measurements that are within configuration. performance data for various hardware 0.7% to 3.3% of the measurements made on configurations and their correspondingly the actual physical hardware.1 Counting mode is best for observing the modeled simulations. It also allowed us to precise use of system resources. However, compare the performance of different counting mode does not report software simulation models, from real-world configurations to worst-case configurations, to ideal configurations. Modeling the impact of CPU properties to optimize and predict packet-processing performance 16

Delta of performance based on core fequency: physical hardware versus simulation

6 1.0% 0.7% 1.5% 1.1% 0.7% 2.9% 3.3% 100% 5 5.65 5.69 80% 4 3.72 60% 3 3.35 3.60 3.12 3.09 3.26 2.79 2.75 40% 2 2.41 2.44 2.60 2.58

Throughput (MPPS) Throughput 1 20% 0 0% 1.1 GHz 1.2 GHz 1.3 GHz 1.5 GHz 1.6 GHz 1.8 GHz 3.0 GHz 22 MB22 MB cache

Intel®Pre-production Xeon® processorIntel®Figure Xeon® E5-2600, 9.processor, 1.8GHz1.8GHz (Skylake) (Skylake) Intel®Intel® CoFluent™ CoFluent™ Technology TechnologyPerforma simulation simulation AccuracyDelta of measurements of simulationnce measured made on physical for system and simulation different LLC Pre-production Intel®cache Xeon® sizes processor,in our 1.8GHz (Skylake) Figure 8. Performance measured at different frequencies for our simulation versus on the physical device under test (DUT).1 Intel® CoFluent™simulations Technology versus simulation The DUT for these measurements was based on a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake). Delta of measurementhets made physical on physical system and simulation DUT.122 MB cache

Figure 8. Performance measured at different frequencies for our simulation versus on the physicalAnalyzing device performanceunder test (DUT). 1for DeltaThe DUT of performance for these measurements based on LLCwas basedcache on size: a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake). physical hardware versus simulation different cache sizes Figure 9 shows packet performance as 1.1% 1.6% 0.5% 3.3% 6 100% measured for different LLC cache sizes: 2 MB, 4 MB, 8 MB, and 22 MB. In our 5 80% simulation, cache size was adjusted using the 4 60% Intel PQOS utility. 3 3.58 3.72 3.60 3.36 3.33 3.47 3.41 3.56 40% Our POC showed that a 2 MB LLC cache size 2 causes a dramatically larger miss rate (32% 1 20% miss rate) than a 22 MB LLC cache size Throughtput (MPPS) Throughtput (1.6% miss rate).1 However, almost 90% of 0 0% memory access hits are at L1 cache.1 1.8 GHz 1.8 GHz 1.8 GHz 1.8 GHz Because of this, adjusting the LLC cache size 2 MB 4 MB 8 MB 22 MB decreases performance by a maximum of only 10%.1 Intel®Pre- productionXeon® processor Intel® Xeon® E5-2600, processor, 1.8GHz 1.8GHz (Skylake) (Skylake) In Figure 9, the yellow line again shows the Intel®Intel® CoFluent™ CoFluent™ Technology Technology simulationsimulation difference between measurements made on DeltaDelta between of measurements measurements made on physical on physical system and system simulation and simulation the physical DUT, and measurements of the simulation. The delta remains very small, Figure 10. Measuring performance on multi-core between 0.5% and 3.3%.1 Figure 9. PerformanceCPUs measured.1Pre-production for different Intel® Xeon® LLC cache processor, sizes 1.8GHzin our simulations (Skylake) 1 versus theIntel® physical CoFluent™ DUT. Technology The DUT for simulation these measurements was based on a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake). Delta of measurements made on physical system and simulation

Figure 8. Performance measured at different frequencies for our simulation versus on the physical device under test (DUT).1 Figure 9. Performance measured for different LLC cache sizes 1 in our simulations versus the physical DUT. The DUT for these measurements was based on a pre-production Intel® Xeon® processor,

1.8 GHz (Skylake). Modeling the impact of CPU properties to optimize and predict packet-processing performance 17

Performance when scaling CPU cores (2.2 GHz) Performance without a hardware accelerator and with a hardware accelerator 30 120% 4.5 0.4% 1.2% 3.0% 25 100% 4.4 4.3 4.39 24.5 23.8 20 80% 4.2 4.1 15 60% 4.0 3.9 10 40% 3.8 Throughput (MPPS) Throughput 8.2 8.3 3.7 3.80

5 20% (MPPS) Throughput 3.6 4.1 4.1 0 0% 3.5 1 core 2 cores 6 cores Baseline hardware Simulated FPGA- configuration without field- accelerated ACL lookup Pre-production Intel® Xeon® processor, 1.8GHz (Skylake) programmable gate array (FPGA)-accelerated Intel® CoFluent™ Technology simulation access control list (ACL) Delta between measurements on physical system and simulation lookup

Figure 10. Measuring performance on multi-core CPUs.1 The DUT Pre-production Intel® Xeon® processor, 1.8GHz (Skylake) for these measurements was based on a pre-production Intel® CoFluent™ Technology simulation Intel® Xeon® processor, 1.8 GHz (Skylake). Figure 11. Performance comparison of baseline configuration versus a simulation that includes a field-programmable

Figure 11. Performance comparison of baseline configuration gate array (FPGA)-accelerated access control list Measuring performance .1 versus a simulation that includes a field -programmable (ACL) lookup The DUT for these measurements was for different numbersgate array of(FPGA) cores-accelerated access control list based on a pre-production Intel® Xeon® processor, .1 (ACL) lookup Figure 10. Measuring performance 1.8 GHz (Skylake). Figure 10 showson the multi throughput-core CPUs results.1 T hefor DUT for these measurements measurements taken on DUTs with various was based on a pre-production Intel®Simulating Xeon® processor, hardware numbers of cores. These measurements were 1.8 GHz (Skylake).accel eration componentsFigure 12. Comparison of performance from CPU generation to made on the pre-production Skylake-based generation.1Figure 11. Performance comparison It’s important to note that this performance DUT, and compared with the results projected Previously, we showed how we broke downof baseline configuration versus a simulation that by our simulation. Note that in this POC, the the distribution of CPU cycles and includes a fresultield-programmable represents only gate one array functionality (FPGA)- of the upstream pipeline ran on a single core, even instructions amongst different stages of acceleratedpipeline access control that was list s imulated for FPGA. The when run on processors with multiple cores. pipelines. Just as we did in that analysis, we(ACL) lookup15.5.1 T%he result DUT wefor sawthese here measurements does not represent was based on a thepre -resultsproduction of the Intel® full capabili Xeon® processor,ty of the FPGA 1.8 As you can see in Figure 10, the throughput can do a similar what-if analysis to identify the best hardware accelerators for our modelGHz. (Skylake).used for this workload. Still, this kind of scales linearly as more cores are added. In this what-if analysis can help developers more test, the packets were spread evenly across the For example, in one what-if analysis, we accurately estimate the cost and efficiency of cores by way of a manual test configuration. replaced the ACL lookup with an FPGA adopting FPGA accelerators or of using some In our test, all pipeline stages were bound to accelerator that is 10 times as efficient as the other method to offload hardware one fixed core, and the impact of core-to-core standard software implementation in the functionalities. movement was very small. DPDK. We found that swapping this Again, the yellow line represents the component sped up the performance of the difference between measurements of the overall upstream traffic pipeline by over 15% 1 physical system, and measurements of the (see Figure 11). simulation. For performance based on cache size, the delta is still very small, between 0.4% and 3.0% for simulation predictions as compared to measurements made on the DUT.1 Modeling the impact of CPU properties to optimize and predict packet-processing performance 18

Performance comparison of two generations of hardware platforms versus simulations

4 Best case and worst case 3.5 3.72 3.59 traffic profiles 3 3.43 3.35

At the time of this joint NFV project, 2.5 a production traffic profile was not 2 available for analyzing bottlenecks in a 1.5 production deployment. However, we did analyze both the worst-case profile 1 and the best case profile. In the worst 0.5

case profile, every packet is a new (MPPS) Throughput 0 flow. In the best-case profile, there is Intel® Xeon® processor E5-2630L Pre-production Intel® Xeon® only one flow. We did not set out to v4 1.8G (Broadwell) processor, 1.8GHz (Skylake) study these scenarios specifically, but the traffic profile we used provided Hardware device under test (DUT) information on both best- and worst- Intel® CoFluent™ Technology simulation case profiles. Figure 12. Comparison of performance from CPU generation to generation.1 Our results showed that the difference in performance between worst-case and

best-case profiles was only 7%.1 That

could offer developers a rough Broadwell also has only 256 KB of L2 cache, PerformanceFigure 12. Comparison of performance from CPU generation to generation.1 estimation of what performance could sensitivity from generation while Skylake has 2 MB of L2 cache (more be like between best-case and worst- cache is better). Also, when there is a cache

case packet performance for any given to generation miss in L2, the L2 message-passing interface production environment. Figure 12 shows the performance of the (MPI) on the Skylake-based DUT is 6x the upstream pipeline on a Broadwell-based throughput of L2 MPI delivered by It’s important to understand that our 1 microarchitecture, as compared to a Skylake- Broadwell. results show only a rough estimation based microarchitecture. The Intel CoFluent of that difference for our pipeline Our POC measurements tell us that all of simulation gives us an estimated delta of less model and our particular type of these factors contribute to the higher core CPI than 4% for measurements of packet application. The packet performance seen for Broadwell microarchitectures, versus throughput on simulated generations of gap between your own actual best- and the greater performance delivered by Skylake. worst-case traffic profiles could be microarchitecture, as compared to significantly different. measurements on the physical DUTs.1 Maximum capacity at the LLC level s Core cycles per instruction (CPI) One of the ways we used our simulations was to understand performance when assuming Our POC results tell us that several specific maximum capacity at the LLC level. This factors affect CPI and performance. For analysis assumed an infinite-sized LLC, with example, the edge routers on both Broadwell no LLC misses. and Skylake microarchitectures have the same program path length. However, Skylake has a much lower core CPI than Broadwell (lower CPI is better). The core CPI on the Broadwell- based DUT is 0.87; while the core CPI on the Skylake DUT is only 0.50.1

Modeling the impact of CPU properties to optimize and predict packet-processing performance 19

Our analysis shows that packet throughput can Performance sensitivities based achieve a theoretical maximum of 3.98 MPPS on traffic profile Fused µOPs (million packets per second) per core on Skylake-based microarchitectures.1 Our POC shows that the packet processing Along with traditional micro- performance for the upstream pipeline on the ops fusion, Haswell supports Ideal world versus reality edge router will change depending on the macro fusion. In macro fusion, traffic profile you use. In an ideal world, we would eliminate all specific types of x86 instructions are combined in pipeline stalls in the core pipeline, eliminate Performance scaled linearly with the pre-decode phase, and then all branch mispredictions, eliminate all the number of cores sent through a single decoder. translation lookaside buffer (TLB) misses, and They are then translated into a assume that all memory accesses hit at the Using the traffic profile we chose, we found single micro-op. L1 data cache. This would allow us to achieve that we could sustain performance at about the optimal core CPI. 4 MPPS per core on a Skylake architecture.1 We tested this performance on systems For example, a Haswell architecture can with 1, 2, and 6 cores; and found that commit up to 4 fused µOPs each cycle per performance scaled linearly with the number Next steps thread. Therefore, the optimal CPI for the of cores (see Figure 10, earlier in this paper).1 4-wide microarchitecture pipeline is As we continue to model packet performance, theoretically 0.25 CPI. For Skylake, the Note that, when adding more cores, each it’s unavoidable that we will have to deal with processor’s additional microarchitecture core can use less LLC cache, which may hardware concurrency and the interactions features can lower the ideal CPI even further. cause a higher LLC cache miss rate. Also, typically seen with various core and non-core as mentioned earlier in this paper, under the components. The complexity of developing If we managed to meet optimal conditions for heading, “Impact of cache on pipeline and verifying such systems will require Haswell, when CPI reaches 0.25, we could performance,” adjusting the LLC cache size significant resources. However, we believe double the packet performance seen today, could impact performance. So adding more we could gain significant insights from which would then be about 8 MPPS cores could actually increase fetch latency and additional POCs. (7.96 MPPS) per core. (Actual CPI is based cause core performance to drop. on the application, of course, and on how We suggest that next steps include: Our results include the performance analysis well the application is optimized for its  Using our Intel CoFluent model to for different cache sizes, from 2MB to 22MB. workload.) identify component characteristics that We did not obtain results for cache sizes have the greatest impact on the smaller than 2MB. performance of networking workloads. This could help developers choose the Execution flow was steady best components for cluster designs that We also discovered that our test vPE are focused on particular types of application delivered a steady execution flow. workloads. That meant we had a predictable number of  Model and improve packet-traffic instructions per packet. We can take that profiles to support a multi-core further to mean that the higher the core clock paradigm. This paradigm would allow frequency, the more throughput we could scheduling of different parts of the achieve. workload pipeline onto different cores. For our POC conditions (traffic profile, 1.8  Model and improve traffic profiles to GHz to 2.1 GHz processors, workload type) study the impact of the number of new we found that performance of the vPE scales flows per second, load balancing across linearly as frequency increases.1 cores, and other performance metrics.

 Model and simulate the downstream pipeline. Modeling the impact of CPU properties to optimize and predict packet-processing performance 20

Summary LLC cache Conclusion Most of today’s data is transmitted over The size and performance of LLC cache had Our detailed POC tells us that, when selecting packet-switched networks, and the amount of little influence on our DPDK packet a hardware component for an edge router data being transmitted is growing processing workload. This is because, in our workload, developers should consider dramatically. This growth, along with a POC, most memory accesses hit in L1 and L2, prioritizing core number and frequency. not in LLC; and there is a low miss rate in complex set of conditions, creates an In terms of scaling for future products, our L1 and L2. enormous performance challenge for model was able to project potential developers who work with packet traffic. Note that the size of LLC cache can cause performance gains very effectively. For To identify ways to resolve this challenge, higher miss rates on different types of VNF example, our simulation model showed a Intel and AT&T collaborated to perform a workloads. However, on the VNF workload detailed distribution of CPU cycles and detailed POC on several packet processing we simulated, the effect was small because of instructions of each workload stage. configurations. For this project, our joint team the high L1 and L2 hit rates. Developers can use this type of information to better estimate the performance gains when used a simulation tool (Intel CoFluent) on a We did not include studies in our POC to FPGA hardware accelerators or other ASIC DPDK packet-processing workload which understand the effects caused by other VNF off-loading methods are applied. was based on the DPDK library, and run on workloads running on different cores on the x86 architectures. The simulation tool same socket, and affecting the LLC in Our POC also demonstrated that our Intel demonstrated results (projections) with an different ways. That was not in the scope of CoFluent model is highly accurate in accuracy of 96% to 97% when compared to our POC, and would be a future project. simulating vPE workloads on x86 systems. the measurements made on physical hardware The average correlation between our 1 configurations. Performance simulations and the known-good physical architectures is within 96%.1 This correlation Our results provide insight into the kinds of Performance scales linearly with the number holds true even across the scaling of different changes that can have an impact on packet of cores.1 Our results show this is due to the hardware configurations. traffic throughput. We were also able to small impact of LLC, since there is such a low identify the details of some of those impacts. miss rate in L1 and L2. Basically, when a The accuracy of these Intel CoFluent This included how significant the changes memory access misses in L1 or L2, the system simulations can help developers prove the were, based on different hardware will search LLC. Since L1 and L2 are value of modeling and simulating their characteristics. Finally, our POC analyzed dedicated for each core, and since LLC is designs. With faster, more accurate component and processor changes that could shared by all cores, the more cores there are, simulations, developers can improve their provide significant performance gains. the higher the potential rate for LLC misses. choices of components used in new designs, reduce risk, and speed up development cycles. Key findings Packet size This can then help them reduce time to market for both new and updated products, and help Here are some of our key findings: Packet size does not have a significant reduce overall development costs. performance impact in terms of packets per CPU frequency second (PPS).1 For example, look at the edge router, which is a packet-forwarding CPU frequency has a high impact on packet approach. With an edge router, only the performance — it’s a nearly linear scaling.1 packet header (the MAC/IP/TCP header) is Even if the underlying architecture has processed for classifying flow, for different core frequencies, the execution determining quality of service (QoS), or for efficiency (reflected in core CPI) for these making decisions about routing. The edge cores is almost the same. router doesn’t touch the packet payload; and increasing the packet size (payload size) will not consume extra CPU cycles. Contrast this with BPS (bytes per second), where BPS scales with PPS and packet size. This BPS-to-PPS scaling will continue until the bandwidth limit is reached, either at the Ethernet controller, or at the system interconnect. Modeling the impact of CPU properties to optimize and predict packet-processing performance 21

Appendix A. Table A-1. Test configuration based on the Intel® Xeon® processor Gold 6152, 2.1 GHz Performance in the downstream pipeline Component Description Details Product Intel® Xeon® Gold processor 6152, 2.1 GHz Our joint team characterized the downstream Speed (MHz) 2095 pipeline on two of our devices under test Processor (DUTs). A full study of the downstream Number of CPUs 44 cores / 88 threads pipeline was not in the scope of our proof of LLC cache 30976 KB concept (POC) study. However, this appendix Capacity 256 GB provides some preliminary results that we Type DDR4 observed while studying the upstream pipeline. Rank 2 Memory Note that the downstream pipeline uses Speed (MHz) 2666 different software, with different functionality Channel/socket 6 and has different stages than the upstream Per DIMM size 16 GB pipeline. Full verification of results seen from Ethernet controller (4x10G) the downstream pipeline will have to be a NIC future project. Driver igb_uio Table 3, earlier in this paper, describes the OS Distribution Ubuntu 16.04.2 LTS hardware DUT configuration for the Haswell- BIOS Hyper-threading Off based microarchitecture used to determine throughput scaling of the downstream pipeline. Table A-1 (above, right) describes the hardware DUT configuration of a fourth, production version of Skylake-based microarchitecture on which we also obtained downstream pipeline results. This fourth production-version processor was the Intel® Xeon® processor Gold 6152, 2.1 GHz. In this POC, we measured throughput on a single core for all stages of the downstream pipeline. Figure A-1 shows throughput scaling as a function of frequency for the downstream pipeline. These measurements were made on the Intel® Xeon® processor E5-2680, 2.5 Figure A-1. Throughput scaling, as tested on Figure A-2. Throughput scaling at 2000 MHz GHz DUT (Haswell-based architecture). the Intel® Xeon® processor on different architectures.1 Figure A-2 shows throughput measured at E5-2680, 2.5 GHz (Haswell) Throughput scaling is measured in 1 million packets per second (MPPS) 2000 MHz on two architectures: the Haswell- device under test. Throughput scaling is measured in million as a function of CPU speed. based architcture, and the production version packets per second (MPPS) as a of the Skylake-based architecture. function of CPU speed. As mentioned earlier, our POC focused on the upstream pipeline. A more detailed analysis of the downstream pipeline will be the subject of future work.

Modeling the impact of CPU properties to optimize and predict packet-processing performance 22

Appendix B. Acronyms and terminology This appendix defines and/or explains terms and acronyms used in this paper.

ACL Access control list. FIFO First in, first out. packetgen Packet generator. ASICs Application-specific integrated FPGA Field-programmable gate array. PCI-e Peripheral Component circuits. Interconnect Express. I/O Input / output. BF Blocking probability, or “blocking PHY Physical layer. IP Internet protocol. factor.” POC Proof of concept. IPC Instructions per cycle. BPS Bytes per second. PQOS Intel® Platform Quality of Service IPv4 Internet protocol version 4. CPI Cycles per instruction. Technology utility. L1 Level 1 cache. CPIcore CPI assuming infinite LLC (no PPS Packets per second. off-chip accesses). L2 Level 2 cache. QoS Quality of service. CPS Clock cycles per second. L3 Level 3 cache. Also called last- RX Receive, receiving. level cache. CSV Comma-separated values. SEP Sampling Enabling Product, an LLC Last-level cache. Also called DPDK Data plane development kit. Intel-developed tool used for level 3 cache. hardware analysis. DRAM Dynamic random-access LPM Longest prefix match. memory. SRC Source. MAC Media access control. DST Destination. TCP Transmission control protocol. ML Miss latency or memory latency, DUT Device under test. TLB Translation lookaside buffer. as measured in core clock cycles. EBS Event-based sampling. TX Transmit, transmitting. MPI Message-passing interface. Also EDP EMON Data Processor tool. EDP misses per instruction (with VNF Virtualized network function. is an Intel-developed tool used regards to LLC). vPE Virtual provider edge (router). for hardware analysis. MPPS Million packets per second. EMON Event monitor. EMON is an Intel- NFV Network function virtualization. developed, low-level command- line tool for analyzing processors NIC Network interface card. and chipsets. NPU Network-processing unit.

Modeling the impact of CPU properties to optimize and predict packet-processing performance 23

Appendix C. Authors

AT&T authors Intel authors

Kartik Pandit Bianny Bian Patrick Lu Huawei Xie AT&T Intel Intel Intel [email protected] [email protected] Vishwa M Prasad Gen Xu AT&T Atul Kwatra Mike Riess Intel Intel Intel [email protected] [email protected] [email protected] Wayne Willey Intel [email protected]

For information about Intel CoFluent technology, visit intel.cofluent.com To download some of the test tools used in this POC, visit the Intel Developer Zone.

Modeling the impact of CPU properties to optimize and predict packet-processing performance 24

1 Results are based on Intel benchmarking and are provided for information purposes only.

Tests document performance of components on a particular test, in specific systems. Results have been estimated or simulated using internal Intel analyses or architecture simulation or modeling, and are provided for informational purposes only. Any differences in system hardware, software, or configuration may affect actual performance.

Performance results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Information in this document is provided as-is. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel assumes no liability whatsoever, and Intel disclaims all express or implied warranty relating to this information, including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright, or other intellectual property right.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, Xeon, and CoFluent are trademarks of Intel Corporation in the U.S. and/or other countries.

AT&T and the AT&T logo are trademarks of AT&T Inc. in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Printed in USA