Modern Packet Switch Design

White Paper Northforge Innovations Inc.

September 2016

1 Introduction

This white paper surveys the architectural evolution of Ethernet packet switches, featuring an overview of early packet switch design and the forces that have driven modern design solutions. These forces, which include increased interface speed, increased number of ports, Software-Defined Networking (SDN), Deep Packet Inspection (DPI), and Network Functions Virtualization (NFV) have had a major impact on the architecture and design of modern packet switches.

All Packet Switches Are Divided Into Three Parts

The networking industy has been building LAN-based packet switches (including under this rubric, LAN Bridges, MPLS Switches, and IP Routers) since the early-mid 1980s. The three functional components of a packet switch have been stable since the beginning.

• Data plane – The primary job of a packet switch is to move packets from an input interface to an output interface. Moving the data from input to output is the job of the data plane (sometimes called the forwarding plane). • Control plane – The data plane decides which output port to select for each arriving packet based on a set of tables that are built by the control plane. For an Ethernet Switch, the control plane includes the process that learns MAC addresses, the spanning tree protocol, etc. For an IP router, the control plane includes the various IP routing protocols (e.g., OSPF and IS-IS). • Management plane – All packet switches require some configuration. In , they include mechanisms for fault detection and reporting, statistics collection, and troubleshooting. This is all done by the management plane. The management plane includes the CLI, structured management information (e.g., MIBs), and protocols for accessing management information (e.g., SNMP).

The Early Years

For the first 20 years or so, although LANs got faster, there were only two major changes in packet switch design and architecture. The first change occurred when LANs moved from shared coaxial cable to twisted pair (10BASE-T). When LANs were based on shared media, the packet switches had a relatively small number of interfaces (2 to 8) because each connected LAN could have dozens of systems connected. The packet switch connected networks together. When LANs moved to twisted pair cable and each switch interface connected to a single system (which might be another switch), the packet switch was the network. At that point we started seeing 8, 16, 32, 64 port switches. A related architectural change was the move from software-based implementations (based on standard CPUs, frequently 68000-series and occasionally ), often with some hardware assist, to ASIC-based implementations. Fortunately both of these changes happened around the same time (from about 1990 to 1995) since as the number of ports increased, software-based implementations became impractical.

1 Packet performance is usually based on minimum-sized packets (64 ). Although some aspects of forwarding performance may be limited by the actual data rate (which is higher with larger packets), per-packet functions such as address lookups tend to be the limiting factor, and minimum-sized packets have the highest packet rate.

2 Early Switch with Software Data Plane Early Switch with Switching SI

Management Plane IF IF IF IF

Control Plane CPU Data Plane

Data Plane CPU Switch Control ASIC Management

Optional Hardware Forwarding IF IF Assist

The move to ASIC-based switching was critical to the success of packet switching. A 10Mbps Ethernet carries ~15,000 packets per second, so a packet switch with two 10Mbps Ethernet interfaces processes as many as 30,000 pps. This was feasible on a single CPU in 1990. A packet switch with two 100Mbps Ethernet interfaces processes 300,000 pps. This was feasible on a single CPU in the late 1990s. However, by then, it was common to deploy packet switches with 16 ports or more. In 1998 gigabit Ethernet was standardized increasing the packet rate on each interface by another factor of ten. Clearly, packet switching now depends on an ASIC-based forwarding plane. Today, switches with 64 or more 10Gbps ports are common. A 10Gbps Ethernet can source 15,000,000 pps, so these switches process a billion packets per second or more!

During this period, very little happened in the control and management planes. The control plane migrated to newer protocols — at layer 2, Spanning Tree moved to Rapid and Multiple Spanning Tree and at layer 3, distance-vector protocols such as RIP and IGRP moved to link state protocols such as OSPF and IS-IS. MPLS was introduced during this period adding a new set of control protocols, a new set of data plane tables, and a new data plane forwarding algorithm. None of these improvements required substantial architectural change.

The Later Years

Ethernet keeps getting faster and the number of connected stations keeps increasing, so the data plane of packet switches continues to get bigger and faster. In the last 10 years there have been major changes to the functionality and implementation in all of the planes some of which have resulted in changes to the architecture and design of packet switches.

The Modern Management Plane In the management plane there has been a migration towards a new way of expressing management information, moving away from ASN.1 and MIBs to a new data modeling language called YANG. YANG information is communicated using a new protocol called NETCONF (rather than SNMP). This is a major software change but does not result in any architectural changes. Another change in the management plane is the inclusion of Fault Management (FM) based on standards IEEE 802.1Q(ag) and ITU-T Y.1731 and Performance Monitoring (PM) based on ITU-T Y.1731. Although found primarily in carrier-based packet switches, they can

3 be used in enterprise networks also. These protocols require substantial software support and the regular exchange of protocol messages. The number of messages that need to be sent and received can be quite large (especially for PM), and therefore call for enhanced CPU capabilities or external hardware assist (e.g., an FPGA). Increasingly these capabilities are built right into the switch ASIC.

The Modern Control Plane As noted above, the IP interior gateway protocols have migrated over the years from distance vector protocols to link state protocols. IP routers also frequently run BGP. At layer 2, the Spanning Tree Protocol migrated to Rapid (and Multiple) Spanning Tree Protocol. Newer layer 2 protection protocols such as ITU-T G.8032 Ethernet Ring Protection and G.8031 Ethernet Linear Protection became popular primarily because of their faster switching time. Also, MPLS became a staple on packet switches (especially those that supported IP), which brought along additional control protocols, primarily the Label Distribution Protocol (LDP) and RSVP-TE. None of these changes required substantial architectural changes in the switch. Some new control tables were needed and depending on the protocols supported and the design center with respect to network size, there was an effect on CPU power and the amount of memory needed for the control tables.

A divergence has also developed over the past several years between the control protocols used in carrier networks (and enterprise networks) and those used in data centers. Data centers pose a unique challenge in that they have a very large number (hundreds or more) of colocated systems that must be interconnected with high bandwidth. This has resulted in development of new techniques and protocols. Two competing solutions are used in these environments, either the IEEE 802.1Q(aq) protocol Shortest Path Bridging (SPB), or the IETF protocol, Transparent Interconnection of Lots of Links (TRILL, several RFCs based on RFC6325).

Over the past several years, however, a major new approach to packet switch control planes has garnered increasing interest – Software Defined Networking or SDN. The Control Plane is a body of software (firmware) that builds tables to drive the activity of the data plane. Control plane protocols are used to exchange information with the other systems in the network in order to derive the information necessary to build the tables. This is a distributed control plane since each node is performing its own computations to populate its forwarding tables. Software Defined Networking is a technique for centralizing the control plane. With SDN, a communicates with the network elements to collect system and interface state information. The server then computes forwarding table information for each network element (NE) and distributes the new/updated information to each NE. With SDN, the control plane in each NE consists of a protocol that can receive the forwarding table updates and apply them to the data plane. The most common protocol for encoding and encapsulating this information is OpenFlow.

There are several perceived2 benefits of SDN over the traditional distributed control plane:

• Reduced reliance on proprietary protocol implementations – software embedded in network elements is developed, owned, and maintained by the equipment vendor. Improvements, enhancements, bug fixes, overall product direction, and, of course, cost are all based on the product strategy and direction of the vendor. With SDN, control software runs on an external server and therefore the network operator can purchase (or develop) software as needed. This can result in lower cost as well as faster adoption of new capabilities and features.

• Lower equipment and network cost – with a distributed control plane each network element has “X” computes (generic unit of computation power) to apply to running the control plane protocols. In a network with 50 network elements there are 50X computes deployed for control plane protocols. If the processing is centralized, each network element requires substantially less horsepower for the control plane so each network element can cost less. The network server that is running the control plane clearly doesn’t require 50X computes. First of all, not all of the processing for all of the network elements must be done at exactly the same time – there is a level of statistical multiplexing. Secondly, distributed protocol processing results in a lot of redundant computation. Consider an OSPF

2 This isn’t to imply that they aren’t real, just that SDN deployment is not yet broad enough to have solid evidence.

4 Link State Advertisement broadcast by a NE. That LSA will be received and processed by 49 other NEs in a distributed control environment but in a centralized environment only one node has to process status information from the NEs. It is not unreasonable to assume that the total computational requirements of the control plane could drop by 50% or more.

• Network-based forwarding rather than node-based forwarding – possibly the most exciting benefit of SDN over a distributed control plane is the ability to implement network-based forwarding regimes rather than node-based forwarding. With distributed control, each network element has a topological view of the network, but is unable to make service-aware forwarding decisions since the nodes have no insight into the service assignments to network paths for any services other than the ones that terminate in the NE. With SDN, the controller can have a global view of the network topology and a global view of the service assignments and can assign services to paths (and even reassign services to new paths) just by distributing updated forwarding information to the involved NEs. This holds out the hope of substantially improved network resource utilization.

The Modern Data Plane Here is where we see the greatest architectural change. Not only has the data plane had to get larger and faster, but there has been a tendency to push functionality into the data plane – and, of course, at the speeds that the data plane has to run, even trivial amounts of functionality result in huge computational requirements.

The requirements for enhanced data plane functionality started back in the early days with access control lists (ACLs). ACLs were relatively simple filter specifications that allowed packets to be identified for special processing. They were a common technique for implementing rudimentary security by allowing particular addresses to be restricted or allowed or to provide enhanced quality of service for certain packet flows. The initial filters were fairly simple, based on the L2 or L3 header, usually a source or destination address (MAC or IP) and an action to take on a match. These filters migrated from software into the switching ASICs.

Over time, the filters got more complicated. It wasn’t just a source or destination address. It could be a source AND destination or it could be an address prefix or an address prefix with a particular class of service marking. These additional requirements resulted in more complex filters in the switching ASICs and frequently the inclusion of a Content Addressable Memory (CAM) either internal to the ASIC or external. But all of these rules were still based on the L2 or L3 packet headers. In recent years, however, there have been requirements for filtering farther into the packet at layers 4 and up and is complicated by the fact that the number of packet headers and length of the various packet headers are both variable. This Deep Packet Inspection (DPI) requires substantially greater “intelligence”. DPI is discussed later on in this white paper (page 8).

Architecting the Modern Switch

In addition to supporting data forwarding requirements, the architecture of a packet switch has to reflect both the computational requirements of the switch and the data paths necessary to support those requirements. There is a range of solutions from low-end to high-end. The solutions described below focus on the design of a single switching module which could be put into a box as a standard-alone switch or into a chassis to interconnect with other modules to create larger switches. Although we avoid a broad discussion of chassis architecture and switch interconnection, the discussion in the High-End Switch Solutions section discusses how module interconnection in a chassis can be used to enhance packet processing capability.

5 Low-end Architectures – (SoC solutions) At the low-end (≤250Gbps) there are many vendors that provide switching ASICs that include a CPU System on a Chip on-chip. These devices frequently include several of the other important components necessary to build a full switching system and are called System IF uffer IF on a Chip or SoC. The most common CPU architec- ture today for this solution is ARM although some IF IF solutions use MIPs or Power PC (PPC). Switch IF IF The CPU is responsible for both the management and the control plane. On-chip connectivity between the IF IF CPU and data plane makes it efficient to move packets to and from the CPU. The interconnection between the 162 bit I data plane and the local CPU, whether the CPU is on- chip or external is usually a parallel interface (e.g., 16 or 32 bits wide), which allows for substantial bandwidth CPU between the two. This interface is usually implemented with the same features as the LAN interfaces so that it can support many of the same features such as multiple class of service queues and various types of rate limit- Flash RAM ing/scheduling capabilities. RAM

These low-end SoCs are developed by several commercial chip vendors. Broadcom has several product lines including the Northstar series, Microsemi (Vitesse) has several low-end products such as the Serval and LynX products lines, Marvell has several models of their Prestera switch in this category, just to name a few.

Mid-range Switch Solutions IF IF In the mid-range (250–500Gbps) we can start with the uffer traditional split CPU/Switch architecture. IF IF This is similar to the architecture described in the low-end IF IF (above), but the CPU is moved off-chip so that more of the Switch chip real estate can be dedicated to LAN interfaces and IF IF switching logic. This model allows more flexibility in the choice of CPU horsepower. For enterprise switches where the IF IF CPU is providing standard management plane support (CLI, IF IF SNMP, NETCONF) and standard control plane support (L2 or L3), a midrange CPU can be used. ARM, MIPs, PPC, and x86 (Atom) CPUs are commonly used. The flexibility is valuable 162 bit I if additional packet processing capability is needed. If some packets will be shunted to the CPU for further inspection for security purposes, encryption, policy enforcement, etc. a larger CPU can be used. One approach is to use multi-core CPUs. Systems can be designed to cover multiple product spaces by CPU designing the switching module to support multiple CPUs in the same family. For example, a module can be designed to support 2 core, 4 core, and 8 core processors. For enterprise switching, 2 cores may be sufficient, but for solutions that require enhanced packet processing, 8 cores may be appropriate. Flash RAM RAM

6 Another approach to building mid-range switches RAM is to use a (NP). A couple of RAM the leaders in this space have been Broadcom and RAM Cavium. A NP is a specialized CPU that includes instructions, data paths, and on-board co-processors IF designed to do high performance packet switching. A commodity switching ASIC is a fixed-function CPU Network device (with a variety of configuration options) but a network processor is fully-programmable. Since Processor the NP architecture is focused on packet switching IF functionality, a NP can, for a given amount of power TCAM (opt) and cost, switch many times the rate of a traditional CPU (e.g. x86). Network Processors are particularly Fabric or More IF desirable if the application requires any kind of out of the ordinary packet processing since it, in effect, provides a programmable data plane.

Standalone NPs top out around 400-500Mbps, but there is a new generation of combined Network Processor and multicore CPU pushing up the top end. Cavium’s approach has been to build devices with more and more cores (at the high end the number is currently 64). These cores have been MIPs-based with specialized architectural enhancements with a migration towards ARM in the most recent products. Broadcom NPs are based on proprietary network architectures developed primarily by acquisitions such as NetLogic Microsystems. Both of these solutions are pushing up both the performance and the programmability of mid-range packet switches.

High-end Switch Solutions At the high-end (500Mbps and up) dedicated switching ASICs are required. Moving around half a terabit per second of data or more requires a focused solution. The Broadcom Trident series tops out at 1.28Tbps and the newer Tomahawk line (BCM56960) can switch 3.2Tbps on a mix of 10G, 25G, 40G, 50G, and 100G interfaces. Although these Broadcom XGS series devices are focused primarily on enterprise and data center switching, they also support carrier switching capabilities. The Broadcom 88000 series of switches (originally called the Dune series and now called the DNX series) is more directly focused on carrier-based switching. This series now tops out at 720Gbps per device with the BCM88670 switch.

Although the switching techniques supported by both series are pretty IF Fabric IF much the same — they both support all of the standard bridging modes, as well as MPLS switching, and IP routing — there are several important distinctions. Switch First and foremost is packet buffering. IFEN uffer Fabric The XGS series has the packet buffers uffer IF uffer on chip and these buffers have been uffer relatively small. The Tomahawk, with 162 bit I 10MB, has the largest on-chip buffer of the series. The DNX series has a small on-chip buffer, but most of the packet buffer is off-chip, up to gigabytes. The CPU benefit of a large packet buffer is that it enables substantially more sophisticated traffic management and QoS capabilities (multi-level policing/shaping/queuing). Flash RAM The DNX has much larger forwarding RAM tables for the various forwarding

7 regimes, and more sophisticated performance management tools. The DNX series is designed to be integrated into larger switching configurations. Each device has a high performance fabric interface and the series includes a crossbar fabric device that supports 1-stage and 3-stage CLOS networks. In this configuration, huge switches can be built supporting hundreds of terabits per second. The previous diagram shows a single module that could go into a box or a chassis and depicts how it can be connected to other modules through a fabric.

Deep Packet Inspection As networks have gotten smarter, there has been an increasing need to make decisions based on more than just the L2 or L3 headers. “The Network” has to look deeper and deeper into the payload of the packet. There are a variety of applications for “Deep Packet Inspection” (DPI) including intrusion detection, detection of denial of service attacks, lawful intercept, more intelligent quality of service, looking for copyright violations, advertising, intelligent load balancing, etc.

There have been two primary approaches to implementing DPI. One has been to forward the traffic through a separate DPI box. Companies such as Arbor Networks and Sandvine have developed these solutions. The other approach has been to build the capability right into the switch. The design approach for this depends on the size of the switch, and the amount of traffic subject to DPI. This last part is important. DPI is memory bandwidth intensive and, more importantly, it is compute intensive. Even if the DPI software (much of it is done in software) has hardware assist, the amount of “horse-power” needed to inspect the full bandwidth of a 1Tbps switch is prohibitive. For example, modern high-end Cavium multi-core CPUs, which include substantial hardware assist for DPI, are in the 100Gbps range.

For low-end switches, DPI can often be done in the control processor by selecting out the appropriate traffic and forwarding it to the control processor as with control plane traffic, although often using different queues. In this approach the system can be designed with a heftier control processor than it otherwise would be.

For mid-range switches, DPI also be done in the control processor. If the switch is designed using a Network Processor (e.g., E-Z Chip), it might IF Fabric IF uffer be efficient to do some of the inspection in the uffer uffer network processor and then shunt some or all of uffer the traffic off to an external processor for additional processing. Mid-range switches are in the vicinity of 250 – 500Gbps and therefore if a significant amount Switch of traffic is subject to DPI, an external processor is definitely needed. IFEN IF

For high-end switches, DPI always needs dedicated IF IF processing. For board/box solutions (sixteen 40G ports is a high-end switch by our characterization), 162 bit I the DPI processor will usually be on the board and connect to the switch over one or more Ethernet DPI CPU interfaces (the latest Cavium processors have on chip (e.g. Cavium 10/40/100Gbps Ethernet interfaces). OCTEON) CPU

Flash RAM RAM

8 DPI CPU Switch (e.g. CaviumDPI CPU RAM uffer Octeom)(e.g.DPI Cavium CPU uffer uffer Octeom)(e.g. CaviumDPI CPU RAM RAM

162 bit I Octeom)(e.g. Cavium Switch OCTEON) RAM RAM

uffer uffer RAM CPU uffer Fabric RAM 162 bit I Switch RAM RAM

Switch RAM CPU RAM

RAM

For fabric-based switches, the DPI processing can be on a separate application-processing module (shown above). This allows each module to contain more processing capability (e.g. multiple CPUs) and it allows the system to contain multiple modules. It also allows each of the application-processing modules to access all of the switch modules in the chassis. This approach results in a solution that can scale to very large DPI capabilities with very efficient interfaces to the switches.

Network Functions Virtualization An Ethernet packet switch is really a smart computing system with a (relatively) dumb but fast packet forwarding capability (data plane). The computing system implements a variety of functions including the control plane and management plane and more recent innovations such as performance monitoring (part of the management plane) and DPI. In addition there are a variety of other functions that live on packet switches (especially IP routers) such as firewalls, encryption, IP address distribution (DHCP), address translation (NAT), etc.

As described in the SDN discussion above, economies of scale can be achieved by centralizing the control plane for packet switches. Although it shares some of the characteristics of virtualization, SDN is not really virtualization. If SDN were virtualization then the SDN server would be running virtual instances of each of the control planes for each of the packet switches. But with SDN, the server is running a single network-wide control plane and populating the forwarding tables of each of the NEs.

It does, however, make sense to consider moving various network functions such as the ones listed in the previous paragraphs into centralized servers. Consider DHCP, for example. A network could have 50 edge routers running DHCP, but there is no reason why DHCP has to run in each edge router. The traffic load associated with DHCP is small so it makes sense to run each of the 50 DHCP instances as a Virtual Machine (VM) or possibly as a thread of a DHCP VM on a centralized server. There is the efficiency benefit in statistically multiplexing the computational load (all 50 instances are never running at the same time) and several management benefits such as only having to update a single system to fix bugs and add features and only having to configure a single local system. Additionally, modern cloud computing technology provides the ability to migrate VMs for redundancy and to cloud burst for unexpected load peaks.

An important benefit of NFV is that it drives an architectural approach that results in encapsulated functions that can independently be distributed. In traditional packet switches the implementation of many of the functions are intertwined with each other so that it is difficult to move them around, but if the architecture is focused on encapsulating these functions, it then becomes possible to decide where each function should run. Some functions, like DHCP, can be run on a large central server, other functions like NAT can be virtualized on servers closer to the edge, and some functions such as encryption are best done at the subscriber site.

9 Capabilities can be spread around the network to run in the most efficient place and “service chains” can be implemented that direct packets from service to service through the network. As implementations and loads change, the services can be migrated as needed and the chains can be updated. The result is that all of the packet switches in the network become part of a large distributed computer system that can be tuned and optimized as needed. Network operators have the ability to scale their networks more easily, reduce hardware and software costs, improve quality of service, and deploy new functionality.

NFV exploits the virtual machine infrastructure of modern cloud computing. As a result, many of the physical (i.e., hardware) components of traditional packet switch are implemented as software abstractions. Multiple virtual machines are connected together by virtual switches using software such as Open vSwitch (OVS). Software optimizations such as Intel’s Data Plane Development Kit (DPDK) and poll-mode drivers (which are more efficient than under heavy load) allow efficient interconnection of virtual components in a data center environment.

A final benefit of NFV, as we discussed on page 4 for SDN, is that it creates a software-driven approach to network design that substantially reduces the dependency on proprietary software.

NFV is a work in progress but holds great promise for packet switches and networks over the decade. Like any new technology approach, NFV needs to prove itself to IT management as an approach that’s worth the implementation. But once implemented, enterprises will see how they can add new services more quickly and at lower cost and the return on their investment will be multi-fold.

Ethernet packet switches have envolved in the past 20 years, but with new advancements in SDN, NFV, and DPI technologies the network industry may see even greater impact on the architecture and design of modern packet switches – perhaps even in the next 20 months.

The Role of Northforge Innovations As packet switches have evolved over time, Northforge Innovations has been working with network communications companies to solve their challenges with implementing packet switches and to provide solutions quickly to help them get to market faster. Northforge can help with Deep Packet Inspection, including intrusion detection, detection of denial of service attacks, and quality of service. Northforge has the expertise in high-end switch solutions, including Broadcom switches and Cavium processors. For the latest in NFV and SDN technologies, Northforge has the software development experience to help companies develop their new NFV and SDN-based products or to convert existing networking functions.

10 About Northforge Innovations Inc.

Northforge Innovations is an expert software consulting and development company focused on advancing network communications. We target network security, network infrastructure, and media services, with the mission and passion to meet the industry’s demands in the evolving cloud infrastructure, virtualization and software-defined networking.

With an average of 15 years of experience, our consultants comprise a worldwide resource pool that’s based in North America. Northforge employs top technical and project management talent to give customers the “intellectual capital” they need for their network communications software development. Our developers have extensive technical and domain expertise across a breadth of technologies. With expertise extending beyond software development services, our team tackles our customer’s most demanding challenges and delivers innovative solutions. Our culture stresses innovation at every step, from our ability to understand and address our customers’ needs, our constant exchange of innovative ideas to the continuous value that we create for our customers.

For more information about Northforge Innovations Inc., please visit www.gonorthforge.com.

NORTHFORGE INNOVATIONS INC. USA OFFICE (Sales Office) GATINEAU DEVELOPMENT CENTER One Boston Place, Suite 2600 (Development Center) Boston, MA 02108 72 Laval Street, 3rd Level Gatineau (QC) J8X 3H3 General Inquiries 819.776.6066 Consulting Inquiries 781.897.1727 MONTREAL DEVELOPMENT CENTER 40 Saint-Nicolas Street, Suite 026 [email protected] Montreal, QC H2Y 2P5 www.gonorthforge.com

Copyright 2016 - Northforge Innovations Inc. 11