Solution Brief Packet Processing on ® Architecture

Impressive Packet Processing Performance Enables Greater Workload Consolidation Single Intel® ® processor platform achieves Over 80 Mpps throughput1

With Intel® processors, it’s possible support teams, and faster time-to- to transition from using discrete market with a solution benefiting architectures per major workload from the economies of scale of the (application, , packet server industry. Such a notable level and signal processing) to a single of consolidation is achievable through architecture that consolidates the industry-leading performance, I/O workloads into a more scalable and throughput and performance per watt simplified solution. This software- density – all on a standards-based based approach, depicted in Figure 1, platform. Solution providers listed in yields further benefits with Intel’s 4:1 this paper are ready to help equipment workload consolidation strategy. The architects jumpstart their next designs. hardware platform – based on general- purpose server technology – has been optimized using the best practices of 2011 2012 2013 the communications industry to ensure system robustness with core, memory Application and I/O scalability and performance Processing Intel® Xeon® Intel® Xeon® Next improvements, to meet Processor Generation operator’s low to high-end system C3500/C5500 E5-2600 series requirements. Control Processing Intel® QuickAssist This framework is made possible by Technology the high performance level of Intel Software Library processors plus the Intel® Data Plane Packet Development Kit (Intel® DPDK), which NPU/ASIC Processing greatly boosts packet processing Intel® Data Plane Development Kit performance and throughput, allowing more time for data plane applications. Signal As a result, telecom and network DSP DSP Processing Intel® Signal Processing equipment manufacturers (TEMs/ Development Kit NEMs) can take advantage of lower development costs, fewer tools and Figure 1. Intel’s 4:1 Workload Consolidation Strategy This approach also reinforces What’s Changed? disabling interrupts generated by the synergies resulting from the packet I/O, using cache alignment, The Intel® Xeon® processor E5-2600 convergence of telecom and data implementing huge pages to reduce series, with an integrated DDR3 networks by blending the best of translation lookaside buffer (TLB) memory controller and an integrated communications and computing misses, prefetching, new instructions PCI Express* controller, was a game- competencies. What do both industries and many other concepts. The Intel changer for delivering the small bring to the party in terms of DPDK also runs in user space, thus packet throughput required for next technology and capability? removing the high overhead associated generation networks. This resulted with kernel operations and with The communications industry excels in lower memory latency, decreasing copying data between kernel and user in hardware density, high speed from around 120 nanoseconds (ns) memory space. pipes, low latency transmission, to about 70ns, on par with the arrival robustness, exceptional quality of rate of 64 byte packets. Still, there’s Taking advantage of the breakthroughs service (QoS), determinism and real- far more to do than just store packets enabled by the Intel DPDK, today’s time I/O switching. On the computing to memory, and that’s where the Intel Intel® architecture-based platforms side, there are standards-based DPDK makes a tremendous difference. based on a single Intel Xeon processor technologies that enable dynamic The development kit reduces a E5-2600 series are achieving over 80 resource sharing, workload migration, significant amount of overhead when million packets per second (Mpps) of security, open APIs, developer using an out-of-the-box, standard L3 forwarding throughput for 64 byte communities, virtualization and power Linux* . Significant packets (Figure 2).1 management. time is saved by using core affinity,

80 Mpps † Intel® DPDK Linux User Space

35.2 Mpps † Intel® Data Plane Development Kit (Intel® DPDK) 12.2 Mpps † Linux User Native Linux* Space Stack 64 Byte Throughput

Intel® Xeon® Intel® Xeon® Next generation Processor E5645 Processor E5645 Intel® Processor 2 Sockets 1 Socket 1 Socket (6 x 2.4 GHz cores) (6 x 2.4 GHz cores) (8 x 2.0 GHz cores) † Measured performance. IO Performance may vary with platform configuration †† Estimates are based on internal analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Figure 2. Breakthrough Data Plane Performance with Intel® Data Plane Development Kit (Intel® DPDK) L3

2 Higher Packet Processing packet (64 byte) high performance Performance applications. It offers a simple software After testing how many programming model that scales from Intel architecture has a proven track 10 gigabyte pipes could be Intel® Atom™ processors to the latest record of delivering high performance fully utilized using the Intel® Intel® Xeon® processors, providing and innovation over its long history Data Plane Development Kit flexible system configurations to of meeting requirements of various (Intel® DPDK), John Basco, meet any customer requirements embedded communication markets. By Distinguished Engineer at for performance and scalable I/O. integrating a memory controller and Verisign, couldn’t believe his While the Intel DPDK may be used in increasing memory bandwidth, the first performance measurements. various environments, the Linux SMP generation Intel® Core™ i7 processor So, he went to some of his user space environment is the most trusted colleagues and said, family achieved breakthrough common among engineers due to “You’re going to think I’m performance when executing control the ease of development and testing crazy, but I’m getting these and data plane workloads concurrently. of customer applications, along with performance results. There This was possible, in large part, to the maintaining leading edge data packet must be a decimal point in the Intel DPDK, which greatly improved performance that easily outperforms wrong place.” packet processing on Intel architecture. native Linux, as shown in Figure 2. The “They did their own evaluations, The Intel DPDK, consisting of a set L3 forwarding for the 2.40 GHz Intel® and it turned out the numbers of libraries designed for high speed Xeon® processor E5645 on a native were right. The Intel DPDK is data packet networking, is based on Linux stack is about 1 Mpps per core, really what I would consider a simple embedded system concepts and compared to 6 Mpps (64 byte packets) disruptive technology.” allows users to build outstanding small per core using the Intel DPDK.

Introduction of 180 Introduction of integrated PCI 160 integrated memory Express* Controller 140 controller and Intel® Data Plane 120 Development Kit (Intel® DPDK) 100 Mpps 80 60 40 20 0 2006 2007 2008 2009 2010 2011 2012 Low Voltage Intel® Xeon® processor Intel® Xeon® processor Intel® Xeon® Intel® Xeon® processor Next generation Next generation Intel® Xeon® processors 5300 Series 5400 Series processor 5540 E5645 Intel® processor Intel processor 2 Sockets 2 Sockets 2 Sockets 2 Sockets 2 Sockets 1 Socket 2 Sockets (2x2.00 GHz cores) (4x2.33 GHz cores) (4x2.33 GHz cores) (4x2.53 GHz cores) (6x2.40 GHz cores) (8x2.00 GHz cores) (8x2.00 GHz cores)

Figure 3. IPv4 Layer 3 Forwarding Performance for Various Generations of Intel Architecture Processor-based Platforms

3 With the addition of integrated PCI L3 Forwarding Performance Express controllers, more cores and (Packets/sec) architecture enhancements, the 2nd Generation Intel® Core™ Processor 90 Family provides even greater 80 scalability and flexibility to embedded 70 communication market segments. 60 The Intel DPDK demonstrates the impressive small packet performance 50 achievable using the latest Intel 40 architecture in Figures 2 and 3. See the 30 Appendix for platform information. 20

It should be noted in the case of Packets/second Millions of 10 IPv4 Layer 3 forwarding, the system 0 throughput is I/O limited, constrained by the current PCI Express Generation 64 128 256 512 768 1024 1280 1518 2 network interface cards (NICs); in Packet Size (bytes) other words, the processor’s computing 2 Ports (1C/2T) 4 Ports (2C/4T) 8 Ports (4C/8T) 8 Ports (8C/8T) capacity was not fully utilized. Figure Figure 4. Eight Core Intel® Xeon® Processor E5-2600, L3 Forwarding Performance 4 shows the packet performance for various packet sizes and port usage power management and security. It cases for a single eight core 2.0 GHz is now possible to consolidate packet Intel Xeon processor E5-2600 series processing and network services onto “Telecom equipment vendors with 20MB L3 cache and with Intel® are trying to do more with a single processor architecture, which Hyper-Threading Technology (Intel® HT less R&D spending, and saves hardware and operating cost, Technology) disabled. it’s important for them to eases application development and This impressive performance makes move towards a common improves agility. platform that runs multiple Intel architecture well suited to applications. The dramatic support a wide range of software- packet processing boost from based networking components that Development Kit Overview the Intel DPDK enables our perform , firewall, VPN and The Intel DPDK provides Intel customers to converge on a IPS, just to name a few services. architecture-optimized libraries that single processor architecture, Such a platform is highly adaptable allow developers to focus on their which helps them speed up and capable of running nearly any application. The Intel DPDK provides their development cycles workload, allowing TEMs/NEMs to non-GPL source code libraries to across all our platforms,” easily incorporate new features, support exceptional data plane said Paul Stevens, Telecom services and applications. Furthermore, performance and ease software Sector Marketing Director at Advantech. developers can differentiate their development, while minimizing product offerings by leveraging development time. This allows the complementary technologies from developers to make additions and the server industry, like virtualization, modifications to the Intel DPDK, as

4 required, to meet their individual Customer system needs. Application Designed to accelerate packet processing performance, the Intel Customer Application DPDK contains a growing number of libraries (Figure 5), whose source code Customer is available for developers to use and/ Application or modify in a production network element, with examples showing usage cases and performance. Developers can build applications with the libraries Intel® Data Plane using “run-to-completion” and/or Development Kit “pipeline” models that enable the (Intel® DPDK) equipment manufacturer’s application to maintain complete control. The Buffer Management following provides a brief description of key software components: Queue/Ring Functions

• The Environment Abstraction Layer Flow Classification Library (EAL) provides access to low-level resources (hardware, memory space, NIC Poll Mode Library logical cores, etc.) through a generic interface that hides the environment specifics from the applications and Environment Abstraction Layer libraries.

• The Memory Pool Manager allocates User Space NUMA-aware pools of objects in Kernel Space memory. The pools are created in huge-page memory space to increase performance by reducing translation Linux* Kernel lookaside buffer (TLB) misses, and a ring is used to store free objects. It also provides an alignment helper to ensure objects are distributed evenly across all DRAM channels, thus balancing memory bandwidth utilization across Platform Hardware the channels. • The Buffer Manager significantly reduces the amount of time the system spends allocating and de- allocating buffers. The Intel DPDK Figure 5. Major Development Kit Components

5 pre-allocates fixed size buffers, functions, debugging and cryptography. which are stored in memory pools for Intel plans to continually add more “High-touch packet processing fast, efficient cache-aligned memory functions and make enhancements is a critical piece for any telco allocation and de-allocation from to further increase performance. One equipment manufacturer trying NUMA-aware memory pools. Each core noteworthy technique, called software to win business in today’s is provided a dedicated buffer cache to pre-fetching, increases performance highly competitive market. the memory pools, which is replenished by bringing data from memory into There’s a lot of complexity as required. This provides a fast and the cache before it’s needed, thereby in this software layer, and traditionally our customers efficient method for quick access and significantly reducing memory latency. did all that work themselves. release of buffers without locks. It’s used in many places, including the Having the Intel® Data Plane • The Queue Manager implements classification library for n-tuple exact Development Kit (Intel® DPDK) safe lockless queues instead of using match lookups and the poll mode means customers don’t spinlocks that allow different software drivers, enabling them to speed up have to reinvent the wheel, components to process packets, while classification by preloading the packet and they can focus on their avoiding unnecessary wait times. , typically the first 64 bytes, differentiated value add,” said into cache. Robert Pettigrew, Director • The Ring Manager provides a of Marketing – Embedded lockless implementation for single or Computing Business at multi producer/consumer enqueue/ Intel DPDK Fundamentals Emerson Network Power. dequeue operations, supporting bulk • Implements a run to completion operations to reduce overhead for model efficient passing of events, data and packet buffers. • Accesses all devices by polling without a scheduler • Flow Classification provides an efficient mechanism for generating a • Runs in 32-bit and 64-bit mode with/ hash (based on tuple information) used without NUMA to combine packets into flows, which • Scales from Intel Atom processors to enables faster processing and greater Intel Xeon processors throughput. • Supports an unlimited number of • Poll Mode Drivers for 1 GbE and 10 processors and processor cores GbE controllers greatly speed • Optimizes packet allocation across up the packet pipeline by receiving DRAM channels and transmitting packets without the use of asynchronous, interrupt-based • Allocates memory from the local signaling mechanisms, which have a lot node where possible of overhead. • Ensures all data structures and The Intel DPDK also includes examples objects are cache-aligned for greater to demonstrate use cases and contains performance many other libraries such as multi- The Intel DPDK fast path application is process library, timers, logs, atomic created as a single image, which when

6 loaded, allocates resources, initializes of the IP data, to each received packet. Intel® 82599EB 10 Gigabit Ethernet setup, and assigns core affinity and the Packets are then routed to one of a set Controller, the filters in the flow starting points of the execution code of Rx queues based on their RSS index; director identify flows or sets of flows to each logical core allocated (handled all packets from the same flow are and route them to specific queues, by Pthread Affinity in Linux model), as sent to the same core per the hash. Via which are processed by cores. The flow depicted in Figure 6. the pole mode drivers, the core reads director supports two types of filtering The Intel DPDK image can access packets from the NIC’s queue and modes: multiple physical cores, as well as performs the required processing on • Perfect match filters — the hardware multiple logical cores, when Intel HT the packets. For example, packet flows checks for a match between the Technology is enabled. Each core in from a 10GbE interface are distributed masked fields of the received packets the image contains a dispatch loop and processed by multiple cores, each and the programmed filters. The Intel® that processes packets as they arrive. performing the same run-to-completion 82599 10 Gigabit Ethernet controller There are a number of models that actions. supports up to 8 K of perfect match may be used, based on the user’s Besides RSS, some NICs have flow filters. requirements. In many cases, a run- filters, which may discard packets • Signature filters — the hardware to-completion model is used; however, or forward packets to specific checks for a match between a hash- depending on the number of cycles receive queues handled by cores based signature of the masked fields to process a packet, a pipeline model for processing. In the case of the of the received packet. Up to 32 K may also be used by passing packets via a ring to another core for additional processing. Processor 0 At least one run-to-completion core will be used for packet I/O, and there Linux* Data Plane are various ways to distribute packets Userspace amongst the cores for processing App 1 App 1 ingress traffic. The simplest of these is for a core to handle as much ingress traffic as possible and load balance the remaining packets to other cores to perform the work required. Many NICs Linux Kernel Intel® Data Plane Development Kit also have mechanisms to direct traffic (Intel® DPDK) to cores based on filters or hashes. One such mechanism is receive side scaling (RSS), which assigns packets C1 C2 C3 C4 into different descriptor queues in order to efficiently distribute packet processing between several processor cores, thus increasing performance on NIC NIC a multi-processor system. RSS assigns an RSS index, resulting from the hash Figure 6. Linux* Model

7 signature filters are supported. Migrating to a Single Architecture consolidation onto one platform, which lowers architectural complexity, In a typical networking scenario, Delivering the flexibility, scalability and reduces BOM cost and speeds up packets are received and classified, and capacity needed for next generation product development. subsequent action may follow. In the telecom and data networks, Intel case of security applications, packet architecture-based platforms enable flows are recognized, actions are a highly adaptable alternative to taken based on the flow and further fixed-function hardware. They also processing may ensue based on the provide greater workload consolidation policies enforced by the application. that can lower design challenges Therefore, it is essential to process around cost, software development, the traffic as quickly as possible to power consumption, time to market minimize the overall time needed to and team communication. Platforms deliver value-added services. Visit will become even more powerful, www.intel.com/go/DPDK to find following through on Moore’s Law, out more. further accommodating the workload

Appendix: Using the L3 forwarding example in the Intel® Data Plane Development Kit (Intel® DPDK) for 64 byte packets, a server with • dual Intel® Xeon® processor E5540 (2.53 GHz, 4 core) processed 42 Mpps. • dual Intel® Xeon® processor E5645 (2.4 GHz, 6 core) system processed 55 Mpps. • a single Intel® Xeon® Processor E5-2600 (2.0 GHz, 8 core) processed 80 Mpps (with Intel® Hyper-Threading Technology (Intel® HT Technology) disabled). • dual Intel Xeon Processor E5-2600 (2.0 GHz, 8 core) processed 160 Mpps (with Intel HT Technology disabled) and 4x 10GbE dual port PCI Express* Gen. 2 NICs on each processor.

1 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

Copyright © 2012 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Core, Intel Atom and Xeon are trademarks of Intel Corporation in the United States and/or other countries.

*Other names and brands may be claimed as the property of others.

Printed in USA MS/VC/0412 Order No. 327263-001US

8