Packet processing at wire speed using Network processors

Chethan Kumar and Hao Che University of Texas at Arlington {ckumar, hche}@cse.uta.edu

Abstract 1 Introduction

Recent developments in fiber optics and the new The modern day has seen an explosive bandwidth hungry applications have put more stress growth of applications being used on it. As more and on the active components (switches, routers etc.,) of a more applications are being developed, there is an network. Optical fiber bandwidth is no longer a increase in the amount of load put on the internet. At constraint for increasing the network bandwidth. the same time the fiber optics bandwidth has However, the processing power of the network has not increased dramatically to meet the traffic demand, but scaled upto the increase in the fiber bandwidth. the present day routers have limited processing power Communication industry is looking forward for more to handle this profound demand increase. Hence the innovative ways of designing router1 architecture and networking and telecommunications industry is research is being conducted to develop a scalable, compelled to look for new solutions for improving flexible and cost-effective architecture for routers. A the performance and the processing power of the successful outcome of this effort is a specialized routers. processor called . Network processor provides performance at hardware speeds One of the industry’s solutions to the challenges while attaining the flexibility of software. Network posed by the increased demand for the processing processors from different vendors employ different power is programmable functional units grouped into architectures and the choice of a particular type of a processor called Application Specific Instruction network processor can affect the architecture of the Processor (ASIP) or Network processor (NP)2. NPs and the performance of the whole system. offer the ease of programming with high scalability. Selecting the optimal design for router architecture They offer dedicated processing power to the routers with a particular type of network processor can be for performing standard RFC complaint packet very difficult. A systematic modeling framework has processing tasks while allowing the slower, higher to be developed to analyze the impact of various level control and management tasks to be performed design choices on the system performance. This in the general purpose CPU. It is this separation of framework should be simple, efficient and easy to tasks which allows the router to harness the full comprehend. In this paper, we provide a survey of the power of the NP. To be able to optimize the ongoing research works in network processing field, performance of a NP there should be clear guidelines the problems faced for processing packets at wire for dividing the tasks (also called as function speed, some of the solutions developed to address partitioning) between the NP and the general purpose these problems, router and network processor processor. Further, the tasks that can be executed in architectures, simulation and analysis tools for the NP can be split between the slow CPU and the routers and network processors. fast processing elements3 within the NP.

2 Network Processor is also referred as Network Processing Unit (NPU)

1 A router also refers to multi-service switch which includes 3 Different vendors use different names for processing multiple Asynchronous Transfer Mode (ATM) and frame- elements. uses the term “Micro Engine”. AMCC calls relay interfaces it as “nPCore” Different vendors use different architectures to design the performance of NP can be very helpful to Network processors. A NP’s architecture can affect its designers and system integrators to evaluate different performance and thereby the performance of the router NP architectures and choose the right one for their as a whole. Benchmarking the performance of NPs system. In Sec. 8 we try to address one of the most based on different architectures will help system important issues which can cause undeterministic integrators to choose the right kind of NP for their behavior of the NP – memory access latency. routers. However, just choosing the right kind of NP Memory access latencies can prohibit the NPs in will not be sufficient to boost the performance of a achieving wire speed4 processing. We look at some router. Of course, there has to be several other of the solutions to solve the memory access latency components working in tandem with the NP. The problem. Finally, we conclude our discussion in Sec. individual ability of the hardware components along 9. with their interactions can affect the performance of the whole system which can be much different than what is anticipated. Good modeling frameworks to analyze and 2 Function Partitioning quantify the effect of individual components and their interactions will make the choice of design for a system a lot easier. Policy Applications Network Management This survey paper is an effort to give an overview of Signaling ongoing research efforts in Network processing fields. Topology Management This paper is organized as follows: in Sec. 2 we Queuing / Scheduling mention some of the common packet processing tasks and explain how these tasks can be partitioned. This Data Transformation gives us a big picture of the role of NP and the division Classification Forwarding plane of tasks among different processors in a router. In Sec. Data Parsing 3 we describe how the packet processing tasks are Media Access Control mapped to the physical components and study some of the architectural solutions to design a router. In Sec. 4 we discuss some of the techniques to evaluate the Fig 1: Packet processing tasks performance of routers based on their designs. Here we study different modeling frameworks from a system- level perspective. These frameworks are helpful not The key to an efficient design for a router is the only to analyze the capabilities of the individual understanding of the nature of the packet processing components, but also to quantify the effect of tasks and dividing the tasks into the functional interactions of the components on the system components. Packet processing tasks can be broadly architecture. In Sec. 5 we talk about NPs, different categorized into: types of NPs and the programming model for NPs. We describe some of the issues for processing the packets • Forwarding plane (Data path) tasks – group of at line speed and mention some of the techniques to tasks on the forwarding path of a router. These address these issues. In Sec. 6 we explain two of the include receiving, processing and transmitting the important characteristics of NP – multithreading and packets. pipelining. In this section we also describe different types of pipelining techniques, their advantages and • Control Plane (Control path) tasks – group of disadvantages. A good understanding of multithreading tasks which involve the control and management and pipelining is very important to analyze NP operations. These comprise of maintaining architecture. In Sec. 7 we mention some of the tools to table, ICMP packet processing, building model and analyze the performance of NP. Analysis of different architectures and comparing their effects on 4 Also called as Line speed processing up routing tree, network monitoring and One of the hidden outcomes of the separation of management tasks. control plane and data plane is the ability to forward packets even in case of CE failure and/or restart. A detailed description of these packet processing tasks Since the CE and FE association is dynamic can be found in [6] [7] & [14]. availability can be increased through mechanisms such as graceful restart [3]. As trivial as it sounds, partitioning the tasks into control plane and data plane has been one of the biggest A similar framework [4] has been developed by Intel challenges in network processing. Researchers are for their IXP series of network processors. However trying to come out with a framework for partitioning this framework is limited to Intel network processors. the packet processing tasks. One such effort is being conducted by the ForCES (Forwarding and Control Data path tasks can be further partitioned into slow Element Separation) [3] group of IETF. ForCES work and fast data path functions [12] [14]. Invoking slow group is trying to come out with an architectural data path functions for a packet processing results in framework for the data plane and control plane slow data path forwarding of that packet. The separation, identifying the associated entities in each of purpose of using slow data path forwarding is to these planes and the interactions among them. process packets which need special treatment and more resources. The slow data path functions may The network entity (such as a router) is subdivided into include, for e.g., packet fragmentation and options two logical sub elements known as Control Element field processing. Fast data path functions include IP (CE) and Forwarding Element (FE). Forwarding validation, IP header lookup, firewall/policy elements can be hardware based (ASIC), programmable filtering, MPLS label swapping. (Network Processor) or software based (implemented with a general purpose CPU). Forwarding Element handles all the data plane tasks shown in fig 1. Control 3 Mapping packet processing tasks to Elements are based on general purpose CPU physical components implementing the control plane tasks. Separating the tasks into data plane and control plane functions and There are three major approaches to mapping the having them implemented in separate hardware has packet processing tasks to the physical components in some advantages. Standards based specifications can a router depending on the platform used for evolve for these components and vendors can specialize performing tasks. The main tradeoff in these in developing the components. Standards based approaches is speed against flexibility. Pure hardware components can interoperate with one another and approach uses configurable ASICs, pure allow the systems integrators to choose the best programmable approach use software, and the hybrid components for their products. Scalability could be approach uses programmable components. easily achieved by just adding additional components to the system to improve the performance. Pure hardware approach [5] uses ASICs optimized for data path functions. This approach gives the The physical separation of CE and FE can be achieved highest achievable performance for the data path at the blade level [3] or at the box level [3]. Blade level functions. However it takes relatively longer time to architectures are used in chassis based solutions where develop the ASICs and has long time-to-market. The the interconnectivity between the CE and FE can be cost of development / enhancement is very provided by a switch fabric. In case of a box solution, significant. Fixing a bug, adding additional CEs and FEs can be implemented using separate boxes functionalities takes longer cycles and is very and they can be interconnected using high speed LAN resource intensive. technologies like Gigabit . ForCES is an effort to develop a standard protocol for intercommunication The data path functions are performed entirely in the between FEs and CEs. software domain [5] using general purpose CPUs in case of pure software approach. This approach is 4 Tools and methodologies for router optimized for maximum programmability. It has a performance analysis comparatively shorter time-to-market and maximum reusability. However, there are a lot of scalability The performance of a router is not just affected by its issues with this approach. individual components, but also by the interactions between them. The modern day routers contain The hybrid approach [5] uses the best features of both network processors which employ parallelism to the hardware and software domain mentioned above. increase the overall latency6 budget available to The performance is comparable to that of ASICs while process packets. Changes in the workload behavior achieving a high level of flexibility. Data path functions can greatly affect the interactions among the that do not require flexibility is implemented using components. For example, under worst-case traffic dedicated hardware components (also called Co- condition where minimum sized packets arrive back- processors). Modifications to the existing application to-back, the contention for external memory access protocols can be easily accommodated by by multiple processing elements in an NPU can lead reprogramming the programmable components (also to undeterministic latency behavior by the line card. called as processing elements). Forwarding plane functions shown in fig. 1 are performed in hardware Another example is the thread scheduling algorithms and software Network processor (dedicated used in the NPs. To understand this, let us explain programmable processor). The control plane functions what may happen when an NP is overloaded. Since are implemented in software either in an embedded from the inbound side of the NP, there is virtually no microprocessor or a dedicated general purpose CPU. buffer available to hold the backlogged packets, any inbound packet which fails to grab an NP thread is Fast data path functions are handled by either the NP dropped. On the other hand, an outbound packet can 5 and its coprocessors , ASICS or software program be back pressured to the switch fabric memory or the depending on the type of approach, while the slow data queue management module in the line card without path functions are performed by either the local CPU, being dropped, in case it fails to grab an NP thread. the control card, or even an embedded CPU available in some NPs. The purpose of using slow data path The difficulty lies in the fact that most of today’s NP forwarding is to offload the processing and resource solutions either adopt a system-agnostic thread load from the NP for those packets which need special scheduling algorithm (e.g., the INTEL IXP1200 uses treatment. a round-robin thread scheduling algorithm) or a simple strict priority based algorithm (e.g., the Packet processing tasks can also be Centralized or AMCC nP7120 assigns strict priority for threads Distributed. In a centralized architecture all the packet assigned inbound packets over the threads assigned processing functions are performed by a central outbound packets). These algorithms generally fail processing element. In case of a distributed architecture to allow graceful service degradation when the NP is the intelligence is distributed to the line cards. A line overloaded. card typically contains transceiver, framer device, network processor (for Data plane tasks), traffic On one hand, system-agnostic algorithms tend to scheduler and a CPU (for Control plane tasks). A cause significant inbound packet dropping when the detailed description of these components can be found NP is overloaded. On the other hand, strict priority in [10] & [11]. based algorithms may back pressure indiscriminately both the best-effort traffic as well as real-time traffic, causing excessive delay or loss to the real-time traffic. In this context, it becomes necessary to 5 Special purpose processors handling specific tasks like Routing table lookup, Next hop lookup, Policy filtering (TCAMs), / Decryption (IPSec coprocessor and 6 Latency is the amount of time available to process a deep content inspection coprocessor) packet before the next packet arrives. analyze the performance of the network processor in integrates the application and workload conjunction with several other hardware components characteristics with the performance model constituting a router. framework of the system. It uses the application code directly to model a processor running that To achieve an effective system design a good system- application. SimOS [44] is another framework which level modeling technique is required. This modeling uses emulation techniques for system level technique should be able to capture the application and modelling. It applies fetch-execute-loop technique work load characteristics and also should be able to on the application code. quantify the impact of the component interactions on the system performance. There are several techniques Software based simulations [46] can also be used to to measure the performance of the individual study the impact of different design choices on the components. However there are only a few modeling performance. [46] uses a light-weight full system frameworks developed to analyze the performance of a simulator to analyze the effect of different design system as a whole. Some of these techniques are choices on the performance of the system albeit the explained below. main goal is to improve the software performance on a given hardware architecture. It models all the The system-level modeling frame work can be basically system components, software running on the categorized into 3 groups [45]: components and the work load and is also a discrete event-driven simulator. The main difference between Measurement-based – this approach uses the actual [45] and [46] is that [45] uses the actual application system and software implementation to study the code to model the various hardware components performance. This approach has several limitations along the pipeline whereas [46] uses models for both [45]. It works only for existing systems. A designer is hardware components and the program code forced to work with pre-configured hardware settings, representing the application. and it does not depict the component interactions. This approach may have limited capabilities for a thorough Another tool that can be used for performance analysis of new system designs. analysis of system level design for routers is parallel object-oriented specification language (POOSL) [27] Analytical modeling – this approach uses analytical [29]. [27] uses performance modeling for developing methods to model the system. However, it is too an executable model of a system under study. complex to model the application behavior using this POOSL is a modeling language that can be used for approach. analyzing the properties of real-time distributed hardware/software systems. It provides primitives for Simulation approach - this approach is more suited for describing both behavior and architecture of a system level modeling and can integrate application system. POOSL is equipped with a formal (i.e., and work load behavior as well. Each system mathematically defined) semantics, which can be component is modeled individually and placed in a used to apply analytical techniques to analyze the discrete event simulator. Work flows through each performance. component model as events representing the interactions among all the components modeled in the A similar tool for analyzing router architecture is system. By having this kind of model it is easy to “Click-modular router” [31] [32]. Click-modular identify the bottlenecks in the system and rectify the router uses an object – based description of system problem. hardware and can determine maximum packet flow through the system and the overall resource Discrete event-driven simulators [45] [46] can be used utilization for a given traffic model. It is designed to to develop a modeling framework that can give a primarily run on Linux systems. More details about system level performance measure. Countach [45] is a click can be found in [31] & [32]. performance modeling framework that tightly [50] uses an analytical framework for modeling and interconnected by a configurable network. A manager analyzing the impact of technology on the cost- is used to configure the coprocessors and to handle performance tradeoffs in distributed-router the interconnection between them. The manager also architectures. controls the memory access, and the instruction set used by the coprocessors. Coprocessors perform a predefined set of functions (e.g., longest prefix 5 Network processors matching or packet classification). The manager co- ordinates the functioning of the coprocessors and sets Network processors use the hybrid approach for the path for the packet flow through the coprocessors. implementing fast data path functions. With their flexibility, scalability and shorter time-to-market, This type of NP is optimized for a limited set of data network processors are finding their place in virtually path functions. The advantage of this type of NP is all communications equipment today. [1], [2], [5], [6], that the coprocessors can be designed for high [11] & [26] gives a detailed description of various performance. The disadvantage is that this approach network processors available from different vendors in limits the adaptation of NPs for new applications and the market today. protocols and reduces the NPs’ time-in-market. NPs with VLIW architecture are considered to be There are several architectures [10] [14] used for configurable NPs [10]. designing a network processor. Based on the type of processors used for performing data path functions, • Programmable NP [10] - This type of NP consists NPs can be broadly classified into two categories: of a main controller and task units that are interconnected by a central switch fabric. A task unit • Multiple Instruction Multiple Data (MIMD) [10] can be a cluster of (one or more) RISC processors or – several processors perform the fast data path a special-purpose coprocessor. The controller tasks in parallel. Multiple Reduced Instruction Set controls the interactions among the RISC processors, Computing (RISC) processors are used in the core the coprocessors, memory, and the switch fabric. It of the network processor. They are connected to also loads the instruction set for each RISC shared buffer memory and I/O devices through processor. common BUS. Inter-process communication and access to external memory can become a bottleneck This approach offers great flexibility due to its for processing packets at wire speed for higher line adaptation to new applications and protocols. The rates. RISC processors can be programmed to perform different functions and the processing order can be • Very Long Instruction Word (VLIW) [10] – This programmed. The disadvantage is that the design of is similar to MIMD architecture except that it uses the interconnection between fabric, RISC processors, multiple special purpose processors called co- and co-processors cannot be optimized for all processors to perform different fast data path tasks functions. As a result, the latency budget for some like IP table lookup and packet classification. functions may exceed the worst-case requirement7 These co-processors are designed for specific tasks and achieving the wire-speed processing may be and can give better performance for the particular difficult. MIMD processing architectures can be task. However since they are function-specific they included in this category. have limited flexibility and portability.

NPs can also be classified based on their architecture and the type of embedded processors [10].

• Configurable NP [10] - This type of NP consists of 7 Worst-case requirement is actually the time required to multiple special-purpose coprocessors which are transmit the minimum sized packets 5.1 Programming model for Network These processors provide mostly parsing and Processors classification capabilities. Additional external hardware (and associated software) is required to While it is true that the design of the hardware implement other processing tasks. The processors influence the ease of programming and the programming domain is disjoint, as part of the performance level, it is really the software that packet processing tasks are implemented using determines the flexibility, simplicity and the scalability 4GL languages and the rest may use other of the network processors. The level of programming techniques described here. programmability determines the extent to which the power of the hardware components can be utilized. • Standard Language programming - Standard There are several choices for programming [7] the higher level languages (such as C / C++) network processors combined with special purpose hardware called co-processors are used to implement various • Programming – All the forwarding packet processing tasks. Multiple RISC cores are plane functions are implemented in application used to execute the C / C++ programs specific hardware using microcode programming. It implementing the packet processing functions. is implemented in multiple instances of the processing elements running in parallel, with each Many “RISC-based” NPs implement vendor- processing element using multi-threading. Fixed specific instruction sets which are based on the schedulers distribute the incoming packets to the hardware design of the NP. This restrain the next available processing element. However, this programmers to write all or significant portions requires the programmer to be knowledgeable of their code in processor dependent RISC about everything including the processing latency assembly language. Higher level programming of each processing element, memory access latency model [47][48] provide an API abstraction layer and synchronization among the threads. that hides the lower level chip implementation details from the higher level code without The advantage with microcode programming is the sacrificing performance (this is in general not efficiency and the compactness of the code. possible) and it supports writing effective However there are some drawbacks which include programs in higher level, processor independent the difficulties in programming using the low level standard languages. Writing programs in higher machine code, the understanding required on the level languages can extend the life of the part of programmer about the lower lying software as the programs can be re-used on any architecture, lack of portability due to machine NP. However, the drawback of this type of model code programming which can be hardware and is it requires some kind of mapping between the vendor dependent. processor independent, higher level programming language to the lower-level processor dependent • 4GL Programming – This type of programming micro-code and the RISC core has to possess uses proprietary search and pattern-matching enough processing power to achieve this algorithms for parsing and classification part of the mapping. packet processing tasks. Lots of these algorithms are implemented using “Fourth Generation Languages”. These languages provide optimized 5.2 Challenges for packet processing at wire method of programming for classification functions speed but they can be used to implement only a part of data path processing tasks. Network processors are required to process packets at wire speed so that no packets are dropped. This The processors using 4GL languages typically imposes a limitation on the number of instructions compromise memory size for search speed [7]. that can be executed on a packet & the latency available for processing the packet in the network to be exclusive which can result in serialization processor. For example, at 10Gbps line rate a minimum of multiple parallel accesses. sized packets of 40 bytes (POS) arrives at every 32ns. At 40Gbps packet arrival time drops to 8ns. Network • Storing the shared data [13] – Since packet processor must complete all the data path functions on arrival belonging to the same flow is an incoming packet including accessing external undeterministic; the packet contexts may have to memories within one packet arrival time. be stored for accessing at later time. Depending on the number of contexts that can be supported There are several problems for processing the packets by the application and the size of the on-chip at higher line speed: storage in the network processor, the shared data may need to be stored in external memories. • Upper bound on packet processing latency [14] Accessing the external memories requires – The time available to process a packet for a additional processor cycles thus increasing the network processor is actually the time required to overall latency. transmit the minimum sized packets (worst case) at wire speed. Processing elements in NP check the • Hiding memory access latency [10] [11] [14] – header information in the incoming packets and Even with the fastest memory available today, place them in appropriate queues corresponding to the memory access latency is still very high the output ports8 in external memory based on the compared to the inter packet arrival time at next-hop information. The output scheduler in the higher speeds. The time available to execute NP checks each queue for the packets to be instructions for packets processing is very small transmitted to the output ports. The number of compared to the memory access latency. The clock cycles required to process the packet at wire switch fabric interface of the line card has a speed varies depending on the number and size of bandwidth that is usually twice [10] that of the the output queues and the position of the packets in line speed. This can compensate for some of the the output queues in the external memory. performance degradation arising due to improper Schedulers which are required to parse the output queue scheduling mechanisms for the output queues for transmitting the packets have variable ports and the overhead used to carry routing, memory access latencies depending on the size of flow control, and Quality-of-Service (QoS) the queues and the searching algorithm used. information in the packet / cell header. At Meeting the wire speed consistently becomes 10Gbps (OC-192) and 40Gbps (OC-768) line rate difficult in such cases. It is very important to use the aggregated I/O bandwidth of the memory at proper data structures to have a deterministic upper the switch port can be 120Gbps [10]. For the bound on the memory access latency involved in minimum sized packets of 40-byte, the memory searching the packets in the queues. access latency at each port is required to be less than 2.66 ns. • Dependency [13] – Certain computations on cells / packets are usually dependent upon the results of Also, the large size makes it highly difficult to computations on the preceding cells / packets. In integrate memories into the network processor9. such cases processing the packets in the sequential The higher pin count for the memory puts a order is very important. This increases the latency limitation on the number of external memories of packet processing. that can be attached to the network processor. This external memory access increases the over • Access to shared data structures [13] – Access to all latency. shared data structure also increases latency. Sometimes the access to shared data structure has 9 Although there are on-chip memories they are insufficient to hold all the information required for packet processing. Separate memory chips have to be used for this 8 For simplicity, we will call them output queues. purpose. Hence to increase the processing (instruction) Just to give an example to show the latency budget budget and hide the memory access latency several available for a network processor at higher line techniques like parallel multiple processing speeds, consider a minimum sized packet (40Bytes) elements, pipelining and multithreading are used. flow at 10Gbps; a new packet arrives at every 35ns. Programming with multithreading is a challenge as Assume that the packets are buffered in an SDRAM. the programmer has to deal with issues like thread A 100MHz SDRAM with 64 bit bus bandwidth has a synchronization. total memory bandwidth of 6.4Gbps. To store a 40Bytes packet in the SDRAM it takes 50ns. An • Consistency of shared data [14] – multiple additional 50ns is required to transmit the packet packets can be processed in parallel using from the packet buffer to the output port. Hence multithreading and pipelining. However, it is very without even considering the latency for processing important to maintain the coherency of the shared the packet header, just to store the packet in external data. Several threads may be accessing the shared packet buffers and to transmit the packet to the output data at the same time and update the data. There are port it takes 100ns. several techniques to achieve consistency in such cases. Using locking mechanism or ensuring strict To overcome the latency problem, the network thread ordering are some of them. But these processors uses special hardware designs. One of the techniques increase the latency. solutions to increase the latency available to process a packet in the network processor is to use multiple • Packet ordering [14] – Some applications require processing elements. The fast data path is subdivided ordered packet processing. For e.g., voice packets into packet processing tasks. These tasks can be and ATM cells. Packet ordering problem can be implemented by programming the processing solved using techniques mentioned below: elements to perform the individual tasks. The i. The sequence numbers of the packets can be elements can be connected to form a “pipeline”. The used to order the packets [14]. packet processing tasks can run in parallel on ii. Thread ordering can be used. Packets are multiple packets. assigned to threads in the incoming order and the threads will process the packet in the same The processing elements remain idle most of the time order and transmit them to the output port [14]. waiting for the completion of an I/O operation. In However in such cases a thread can hog the order to make the network processor “work resources and processing the succeeding packets conserving” (in other words be active as much as will be delayed. possible) multithreading can be used. Multithreading is explained in detail in the next section. • [10] – QoS requirements can vary for different traffic flows. Traffic policing/metering/shaping require proper packet 6 Multithreading scheduling and queue management (discarding

packet policies) algorithms. For a Execution time = 4 X Ta network processor using multithreading the QoS Packet n Thread 0 requirements impose additional challenges for the thread Packet n + 1 Thread 1

scheduling algorithm. Similar Packet n + 2 Thread 2 policies should be applied to the Thread 3 multithread scheduling as those Packet n + 3 applied for packet scheduling and buffer management. Ta

Fig. 2 Parallel processing of packets using multi-threading

Multithreading allows multiple packets to be processed • Parallel (function pipelining [11] [14] & [15]) – in parallel. Each incoming packet is assigned to a multiple processing elements perform all the thread. The thread executes a particular packet tasks in parallel. processing task on that packet. When a thread has to wait for the completion of an I/O operation which can • Mixed pipelining [11] – uses a mixture of be a memory access, the status of the thread can be context and function pipelining. Elasticity buffers changed to “wait” and any other thread performing the [11] are used to move the data from one pipeline same task and is “ready” for execution can be switched to the other. to “active”, thus conserving the processor cycles from going waste otherwise waiting for the completion of an I/O operation. This increases the overall throughput and 6.1 Context pipelining also the latency budget available for processing a packet as there are multiple m*Ta packets being processed in parallel at any given time. Element Element Element Element #1 #2 #3 Partitioning the fast data path #n processing requires some Exec time = m* Ta E xec time = m* Ta E xec time = m* Ta Exec time = m* Ta consideration to the worst case performance requirement of the tasks. The processing elements Ta Ta Ta Ta should be able to meet the worst Fig. 3 Pipeline of processing elements case performance requirement of the tasks. Worst case performance requirement for a This approach is also called “Pipelined architecture”. task is the worst case inter-arrival time of the packets In this approach, each packet processing task is corresponding to maximum rate at which they arrive. allocated to a separate processing element. Each To illustrate this, assume the worst case inter-packet processing element is connected sequentially to form a “pipeline of tasks”. The context (state) of a packet arrival time be Ta. Let I be the number of instructions required to execute a particular task and F be clock moves across the pipeline stages as the individual frequency of the processing element implementing that packet processing task is performed sequentially. If task. The processing element can meet the worst case each processing element supports m threads, then m packets can be processed simultaneously by a performance requirements only if I/F < Ta (time required to execute all the instructions implementing processing element. This increases the latency budget the task < worst case inter-packet arrival time). for a particular stage to m*Ta. The total latency budget for a pipeline with n stages would be m*n*Ta. If the execution time of a particular packet processing In other words, total processing time for a packet task on a single processing element is greater than the processing task assigned to a processing element in inter-packet arrival time the latency budget available to an “n” stage pipeline is m*n*Ta (each processing complete the task can be increased by using multiple element has “m” threads). processing elements executing the task in, parallel. The tasks can be assigned to the processing elements in Advantages of Context pipelining [14] three different ways: • The state for a given packet processing task is • Pipeline (context pipelining [11] & [14]) – each persistent across all the packets in a pipeline processing element performs a particular task in stage and can be stored local to the pipeline serial fashion. stage.

• It eliminates the complexity associated with sharing packet is processed without dropping any packet, n the state information with multiple processing processing elements work in parallel. If each elements. processing element has “m” stages and there are “n” • The processing element program memory space processing elements in parallel the total execution can be dedicated to a single packet processing task. time available for completing all the packet processing tasks on each packet is given by m*n*Ta. Disadvantages of Context pipelining [14] Advantages of Functional pipelining [14] • Some packet state information must be communicated from each processing element in the • The packet state information can be held local to pipeline to the next (e.g., updated packet headers). a processing element. Sharing packet state information involves • This design eliminates the latency involved in additional overhead if the packet is large. communicating the packet state between • Each packet processing task must meet the worst- processing elements. case performance requirement. Partitioning the fast • Each processing task in a processing element data path into packet processing tasks becomes should meet the worst case performance very critical to achieve the desired performance requirement. The worst case performance limit of level. a processing element is the sum of the processing latency of all the packet processing tasks executed by that element. This makes it possible 6.2 Functional pipelining to distribute the execution time among different stages of the pipeline in a processing element This configuration is also called “Multiprocessing”. unevenly, resulting in a better utilization of the Each processing element executes all packet processing processing element’s execution time. tasks of the fast data path on a cell/packet context. In this design, a packet is handled by only one processing Disadvantages of Functional pipelining [14]

m*Ta Ta • The processing element’s program Stage Stage Stage memory is shared Element #1 #1 #2 #m between multiple functions and can Stage Stage Stage Element #2 #1 #2 #m become a bottleneck. • The state information shared across the packets is kept in Stage Stage Element #n Stage external memory. #1 #2 #m Maintainenance of that state coherently and access to that Fig. 4 Multiprocessing elements state can be costly. element. The processing elements are used in parallel. A processing element performs m packet processing 7 Performance analysis of Network tasks on a single packet and, at any given instance, Processor performs only one of the m packet processing tasks

[13]. Each processing element can be assumed to be a Network processors are used for a wide variety of pipeline of m stages. To ensure that each incoming applications and their performance is dependent on the underlying hardware and the software. A key factor time consuming for system-level design. Standard in obtaining good performance out of a network system design tools which are inexpensive, easy to processor is analyzing the performance requirements of comprehend and implement are yet to evolve for network processor for different applications. Standard analyzing the performance of the various network benchmarks on lines of CPU and system-level bench processor architectures though a few tools are already marks are still evolving for the network processors. available.

Two important parameters used to analyze the performance of a network processor are Instruction 7.1 Tools and methodologies to analyze the budget and memory access latency budget. Instruction performance of Network Processor budget measured in terms of compute cycles [24] is given by the total number of instructions required to [24] describes a methodology that can be used to execute all the fast data path functions [6] & [14] on a analyze the performance of a network processor. It packet. Memory access latency budget measured in uses 46-Byte POS packets for IPv4 forwarding + terms of I/O cycles [24] includes all the accesses to DiffServ application running on an Intel IXP2400 external memories during the processing stage. [25] network processor for the performance analysis. This methodology uses a data movement model [24] To estimate the total available compute cycle and I/O which describes various packet processing tasks cycle budget for a given application the following performed by the target network processor. This thumb rules can be used: model can be used to estimate the total number of compute clock cycles and I/O cycles required to IN = F* Ta process a packet for a given application. The Total instruction budget for the given application = estimation can be validated by implementing IN*N microcode and tuning the code on the simulator. Where IN = Instruction budget per stage in the pipeline F = Processor clock frequency System level modeling tools like POOSL [28] and Ta = Inter packet arrival time for worst case Click-modular router [30] [51] can also be used for traffic (minimum sized packets of 40Bytes at modeling NPs. line speed) N = Number of stages in the pipeline [33] describes a task and resource model for network processors along with an analytical model for data Memory latency budget = Ta* FSR traffic which is used to analyze the problems of Where Ta = Inter packet arrival time for worst case packet processing. It is also used in design space traffic (minimum sized packets of 40Bytes at exploration [34] of the network processors. An line speed) analytical model [33] [34] can also be used to study FSR = Memory clock speed the performance of network processors. Arrival curves and deadlines [34] can be used for specifying Systems engineers and architects are faced with the the traffic load. [20] uses a C++ cycle accurate model challenge of designing complex real time hardware / for comparing the AES key scheduler against known software systems. Design of hardware systems become results. complex as it involves partitioning of the functions and mapping these functions to hardware components. [36] compares the results from an analytical model to Different design choices affect the system performance that of the simulation on an Intel IXP 1200 network and the cost of the system. Well structured design processor and the results are shown to be within 15% methods and tools aid system engineers and architects accuracy. in better analysis of a particular design for a system. Using actual hardware / software components for the performance analysis can be expensive, complex and 7.2 Performance Analysis [30] presents several examples of modeling uniprocessor and multiprocessor systems executing While multithreading certainly increases the IPv4 routing and IPSec VPN encryption/decryption performance of the network processor, designing a applications. The performance results of the proper architecture with multithreading is very architecture in Click-modular [30] model are complicated. Thread scheduling is one of the major compared to the actual results measured on the real problems when it comes to multithreading. There has systems being modeled; the results are found to be been a very few efforts to study the behavior of accurate within 10% [30]. multithreads in network processors. [34] & [35] estimates the performance of a network One such effort is explained in [17] where the processor for different applications. It provides a new performance of two types of architectures – single scheme to estimate end-to-end packet delays, packet processor with multithreading (SMT) [17] and chip- queuing and to explore the design spaces. This can be multiprocessors (CMP) [17] have been analyzed. The used to quickly develop new architectures which can simulation uses a Cycle accurate simulator [20] [23] be later analyzed in detail using other design tools. with multi-programmed work load [17]. The work load [17] comprises of three different tasks – IP forwarding, [21] & [22] describes a systematic approach to a web-switch monitoring HTTP requests and benchmarking network processors. connections and VPN node that performs encryption/decryption and authentication. The main contribution of [17] is the comparison of the 8 Memory access latency performance of SMT and CMP processors. One of the major bottlenecks in processing the A similar work has been explained in [19] & [20] packets at wire speed is the latency in accessing the though it is limited to only cryptographic applications. external memories. Memories are used for storing the They analyze the performance of different incoming packets, maintaining various queues based cryptographic algorithms on network processors. on next hop information and packet classification. According to [19] cryptographic applications require Memory access is required for routing table lookups, different architectural characteristics than normal policy filtering, metering and policing and to packet processing applications. This work uses an enqueuing / dequeuing the packets. execution driven simulator called “SimpleScalar”. Performance of the memory is dependent on the SimpleScalar [19] is a tool which can simulate the following factors: memory chip's I/O signaling behavior of a general purpose processor based on interface, core architecture, address and command SimpleScalar architecture [19]. The architectural protocols. No matter what type of pipelining (Context characteristics studied in [19] would be more applicable vs Functional) pipelining is used, the threads access to MIPS-based processor architecture rather than RISC- the memories atleast once in every Ta (inter-packet based architecture (used by most of the modern day arrival time) units. Packet enqueue and dequeue processors). However the general purpose processor operations [11] can cause undeterministic memory architecture is different from the network processor access behavior in network processors. The access architecture and the latter is optimized for packet latency depends upon various factors [11] such as the processing functions. New simulation tools which can number of queues, the length of each queue, and the simulate the behavior of network processors can yield position of a packet in a particular queue. In order to better results. However, this study is limited to only process the packets at wire speed, network processor instruction set characteristics, instruction level needs to complete the memory operations well within parallelisms, branch prediction, and cache behaviors. the inter-packet arrival time. The upper limit on the Further research can be done on the latency memory bandwidth and the access latency is given by requirements for cryptographic applications. Memory access latency per thread per stage = Ta/N and RLDRAM which have a memory access time less than 32ns. Another option is to interleave or Ta = Inter packet arrival time for worst case traffic pipeline the access to memory across different (minimum sized packets of 40Bytes at line speed) memory banks (preferably in the same chip) [40] N = Number of memory references (read / write) per yielding one memory access per clock cycle. thread RDRAM uses a protocol which can access multiple packets simultaneously allowing up to five pipelined Memory bandwidth = Memory clock speed * Bus operations [40]. width At OC-748 line rates none of these memories would To process the packet at line rate, the memory clock be able to accommodate the memory access request rate should be atleast Nmax/Ta where Nmax represents the at line speed. A new research code named “Yellow maximum number of memory references required by stone” [40] is expected to allow the memories to run any stage. at 3.2GHz speed and subsequently at 6.4GHz speed which will be able to accommodate higher line rates. Another important factor that can affect the performance of the memory is the random nature of the Content Addressable Memories (CAMs) [41] allow data traffic and variable packet size (40 to 1500 bytes). simultaneous search operations on each entry in the The packet size also determines the number of memory memory. Each entry in a CAM has a compare logic references required to store/read the packet in/from the which allows the entries to be compared to a search buffer. key in a single access. This reduces the latency associated with a lookup in the convential memories. In this context, the choice of memory type plays a very critical role in achieving line rate processing. The new IP forwarding application can use Ternary CAMs types of memories that are available today which [41] for next hop lookups. An on-chip TCAM [11] enable NP to process OC-192 (10Gbps) data rate can be used to reduce the latency in accessing the include Fast cycle RAMs (FCRAM) [40], reduced- queue descriptors [26] in the external memory. This latency DRAMs (RLDRAM) [40], double data rate solution tries to take advantage of various forms of DRAMs (DDRDRAM) [40], rambus DRAMs localities like spatial and temporal. However in the (RDRAM) [39] [40]. These memories are characterized core routers since there is a convergence of backbone by different types of signaling and frequencies. Some links, millions of individual packet streams flow of the most common signaling levels are Series - stub - through the network processor and special techniques terminated - logic (SSTL) [40], High-speed transceiver have to be designed to improve the memory access logic (HSTL) [40] and Rambus signaling levels (RSL) latency as caching techniques would not be of much [40]. Following table [40] gives the type of memories, help. signaling type and the clock frequency: [38] describes a route lookup mechanism that when Memory I/O Signaling Number Row access implemented in a pipelined fashion in hardware, can frequency of banks latency (ns) achieve one route lookup every memory access. This (MHz) scheme is simple, fast and can be efficiently DDRDRAM 400 SSTL-2 4 55 implemented in hardware. This is aimed for those FCRAM 400 SSTl-2 8 25 routers where speed is a concern (core routers). This RLDRAM 600 HSTL 8 26.7 scheme takes advantage of the fact that on backbone RDRAM 1200 RSL 32 53.3 routers there are very few routes with prefixes longer than 24-bits and is optimized for IPV4. A 40 byte packet at 20Gbps line speed requires a The proposed scheme in [38] is called DIR-24-8- memory access latency to be less than or equal to 32ns. BASIC. It contains two lookup tables: the first This can be achieved by using memories like FCRAM lookup table stores entries for all routes whose prefix length is less than or equal to 24 bits and the second number of transactions per given unit time thereby table stores entries for all routes whose prefix length is increasing the memory bandwidth and the greater than 24 bits. The format of the entries in both throughput. the tables is given in fig. 4. Although [43] stresses the need to increase the If the longest prefix length is <= 24 bits: throughput of the memory rather than decreasing the 0 Next hop information access latency, memory access latency cannot be 1 bit 15 bits completely ignored. Increasing the memory bandwidth may be a good option when it comes to If the longest prefix length is > 24 bits: moving the data in bulk as is the case of storing the 1 Index to second table incoming packets in the packet buffers or moving the 1 bit 15 bits packets from the buffer to the output ports. Memory access latency can be one of the deciding factors Fig 4: TBL24 entry format [38] when calculating the overall packet processing latency for a given particular application, as the The advantages [38] of this scheme are that it requires latency for accessing memory can be substantial only two accesses to two separate memories which can compared to execution latency of packet processing be pipelined. It supports unlimited number of routes tasks on incoming packets. The latency cannot be and it is a very simple architecture which requires a increased from the present day value as already the small sized memory. The limitations [38] of this latency available for packet processing has dropped scheme are the memory usage is very inefficient and to a value lower than the fastest memory available route updates require multiple memory access. This is today. only a partial solution to the memory access latency problem. Access to memory is required for more than The behavior of memory access can be just the route table lookups e.g., enqueuing and undeterministic in cases where the order of the dequeuing the packets. [38] also discusses two packets has to be preserved. Also in cases where variations to the basic scheme. there are interdependencies among the packets (e.g., read-modify-write access for metering a particular Other solutions to the latency problem include packet flow), the latency can be more than the normal pipelined hierarchical memory [42] and pipelined memory access. The pipelined architecture described memories [43]. Hierarchical memory organizations in [43] may have to be stalled for increased number exploit the temporal locality of the computation, and of cycles. support block transfer to take advantage of the spatial locality of the computation, and pipelining can exploit The pipelined memory architecture described in [43] concurrency of memory accesses. can work best for multi-port memories. For a single [43] aims at improving the throughput of routers and port memory, access cannot be pipelined with the proposes a new architecture for the routers. To achieve proposed solution in [43]. high throughput it proposes a new design for the memory using non-uniform wide word parallelism [43]. Memory bandwidth can be increased by increasing the 9 Conclusions width of the word that can be accessed within one clock cycle. Using pipelined memory architecture to enable Network processors are fast emerging as the solution multiple memory (word) access at any given instance, for processing packet at wire speeds. Network the throughput can be increased. Concurrent accesses to processors provide the performance of ASICs with memory can be more efficient if smaller memory tiles the flexibility of the software programs. They are (banks) are used in the memory chip. Even though this scalable, flexible, portable (atleast among the design increases latency per transaction, it increases the products from the same vendors) and they have shorter time-to-market. Efforts are on to evolve standards [49] for the processor interfaces and policy-based networking are posing challenges to the functionalities. router design and at the same time offer research opportunities, especially the data plane design, in However network processors come with their own terms of speed and function extensibility. share of problems. They use multithreading extensively Communication industry is moving towards to increase the latency budget available to process achieving wire-speed packet processing, though it has packets, which may pose problems with respect to to overcome a lot of challenges. The ongoing thread scheduling. Portions of code can interact with research effort is a good step in that direction each other quickly leading to complex behavior. The although it is not sufficient to achieve the ultimate interaction of NPs with other hardware components in goal for flexibility, scalability, code portability, the router also may lead to undeterministic performance shorter time-to-market and longer time-in-market. A characteristics especially when it comes to accessing lot more needs to be done to accomplish this goal. external memories. This fact cannot be overlooked and analyzing the performance of network processor in isolation is not sufficient to achieve the ultimate goal of References: wire speed packet processing. As a matter of fact, it underlines the importance of analyzing the performance [1] Niraj Shah, Kurt Keutzer “Network Processors: of network processor from a system-level perspective. Origin of Species”, University of California, Berkeley. Our main motivation in this survey paper is to expose the research community to the various challenges faced [2] Niraj Shah, “Understanding Network Processors”, in achieving wire speed packet processing in University of California, Berkeley, 4th September communication networks. First, we discussed the 2001. function partitioning which is necessary to define the role of various processors (NPs, co-processors, [3] L. Yang , T. Anderson , R. Gopal “Forwarding embedded CPUs, general purpose CPUs) in a router. and Control Element Separation (ForCES) We explained how to map the packet processing tasks Framework”, Internet Draft, Working Group: to various components and analyze the effect of this ForCES, June 2003. mapping on the performance of the router using various simulation tools and analysis techniques. We described [4] Uday Naik, Alex Shoykhet, Larry Huston, Donald various types of NPs and programming models for NPs. Hooper, Raj Yavatkar, Duke Tallam, Travis We highlighted some of challenges for achieving wire- Schluessler, Prashant Chandra, Adrian Georgescu, speed packet processing and the techniques used to “IXA Portability Framework: Preserving Software overcome of some of these issues. We also talked about Investment in Network Processor Applications” NP architecture in general and pipelining and Intel® Technology Journal, Volume 6, Issue 3, 2002. multithreading in specific and mention the advantages and disadvantages of different types of pipelining. We [5] Vik Chandra, “Selecting a network processor”, briefly explained simulation tools and analysis IBM Microelectronics. methodologies to study the performance of NP based on different architectures and some of the on-going [6] David Husak, “Network processors: A Definition researches in this area. Finally, we discussed some and Comparison”, White paper, C-port (a Motorola solutions to overcome the memory access latencies. company).

In summary, new applications and protocol suites [7] David Husak, “Network processor programming including multiple protocol label switching (MPLS), models: The key to achieving faster time-to-market Differentiated Services (DiffServ), layer 2/3 virtual and extending product life”, White paper, C-port (A private networking (VPNs), static/dynamic network Motorola company). address translation (NAT), constraint-based routing, [8] V. P. Kumar, T. V. Lakshman, D. Stilliadis, Proceedings of the 2000 International Conference on “Beyond Best Effort: Router Architecture for the Supercomputing, Santa Fe, N.M., May, 2000. Differentiated Services of Tomorrow's Internet,” IEEE Communication Magazine, pp. 152-164, May. [19] Haiyong Xie, Li Zhou, Laxmi Bhuyan, “Architectural Analysis of Cryptographic [9] “Network processor designs for next generation Applications for Network Processors”, Department of equipments”, white paper, EZChip Technologies. Computer Science & Engineering, University of California, Riverside, Riverside. [10] H. Jonathan Chao, “Next Generation Routers”, Proceedings of the IEEE, vol. 90, no. 9, September [20] Wajdi Feghali, Brad Burres, Gilbert Wolrich, 2002. Douglas Carrigan “Security: Adding Protection to the Network via the Network Processor”, Intel® [11] Matthew Adiletta, Mark Rosenbluth, Debra Technology Journal, Volume 6, Issue 3, 2002. Bernstein, Gilbert Wolrich, Hugh Wilkinson, “The Next Generation of Intel IXP Network Processors”, [21] Mel Tsai1, Chidamber Kulkarni1, Christian Intel® Technology Journal, Volume 6, Issue 3, 2002. Sauer2, Niraj Shah1, Kurt Keutzer1, “A Benchmarking Methodology for Network [12] James Aweya, “IP Router Architectures: An Processors”, 1st Network Processor Workshop, 8th Overview”, Nortel Networks Ottawa, Canada, K1Y Int. Symposium on High Performance Computer 4H7. Architectures (HPCA), Feb. 3rd 2002 Boston.

[13] “Next generation network processor technologies”, [22] Prashant R. Chandra, Frank Hady, Raj Yavatkar, Intel white paper, October 2001. Tony Bock, Mason Cabot and Philip Mathew “Benchmarking Network Processors” Intel [14] Muthu Venkatachalam, Prashant Chandra, Corporation. RajYavatkar, “A highly flexible, distributed multiprocessor architecture for network processing”, [23] Ram Bhamidipati, Ahmad Zaidi, Siva Makineni, Intel Communications Group, Computer Networks 41 Kah K. Low, Robert Chen, Kin-Yip Liu, Jack (2003) 563–586. Dahlgren, “Challenges and Methodologies for Implementing High-Performance Network [15] Keith Morris, “Challenges in Making Highly Processors”, Intel® Technology Journal, Volume 6, Integrated Network Processors”, Applied Micro Issue 3, 2002. Circuits Corporation. [24] Sridhar Lakshmanamurthy, Kin-Yip Liu, Yim [16] Matthew Adiletta, Donald Hooper, Myles Wilde, Pun, Larry Huston, Uday Naik, “Network Processor “Packet over SONET: Achieving 10 Gigabit/sec Packet Performance Analysis Methodology”, Intel® Processing with an IXP2800”, Intel® Technology Technology Journal, Volume 6, Issue 3, 2002. Journal, Volume 6, Issue 3, 2002. [25] “Intel® IXP2400 Network Processor: Flexible, [17] Crowley, Marc E. Fiuczynski, Jean-Loup Baer, High-Performance Solution for Access and Edge “On the Performance of Multithreaded Architectures Applications”, Intel white paper. for Network Processors”, Technical Report 2000-10- 01, Department of Computer Science & Engineering, [26] Intel IXP1200-Programmer’s manual. University of Washington, Seattle, WA 98195. [27] B.D. Theelen, J.P.M. Voeten, L.J. van [18] Patrick Crowley, Marc E. Fiuczynski, Jean-Loup Bokhoven. “Assessment of POOSL Modelling: Baer, and Brian N. Bershad, “Characterizing Processor Performance analysis for system-level design”, Architectures for Programmable Network Interfaces”, Eindhoven University of Technology, 30-06-00. [28] B.D. Theelen *, J.P.M. Voeten, R.D.J. Kramer [37] Timothy Sherwood, George Varghese, Brad “Performance modeling of a network processor using Calder, “A Pipelined Memory Architecture for High POOSL”, Information and Communication Systems Throughput Network Processors”, Department of Group, Faculty of Electrical Engineering, Eindhoven Computer Science and Engineering, University of University of Technology. California, San Diego.

[29] Zhangqin Huang, J.P.M. Voeten, B.D. Theelen [38] Pankaj Gupta, Steven Lin, Nick McKeown, “Modeling and Simulation of a Packet Switch System “Routing Lookups in Hardware at Memory Access using POOSL”, Proceedings of the 3d progress Speeds”, Computer Systems Laboratory, Stanford workshop on embedded systems, October 24, 2002. University, Stanford, CA 94305-9030.

[30] Patrick Crowley, Jean-Loup Baer, “A Modeling [39] Matthias Gries, “The Impact of Recent DRAM Framework for Network Processor Systems”, Architectures on Embedded Systems Performance”, Department of Computer Science & Engineering, Euromicro’2000, Symposium on Digital Systems University of Washington. Design, Maastricht, Netherlands, Sep. 2000, vol.1, pages 282-289. [31] E. Kohler “The Click modular router”, PhD thesis, Massachusetts Institute of Technology, November 2000. [40] Michael Ching, “Packet buffer memory bandwidth causes NPU performance bottlenecks”, [32] E. Kohler, R. Morris, B. Chen, J. Jannotti, M. F. Rambus, Inc., Lost Altos, Calif., EE Times May 9, Kaashoek “The click modular router” ACM 2003 Transactions on Computer Systems, 18(3):263–297, http://www.commdesign.com/story/OEG20030509S0 August 2000. 037

[33] Lothar Thiele, Samarjit Chakraborty, Matthias [41] Romain Saha, Tomasz Wojcicki, “TCAMs Gries, Alexander Maxiaguine, Jonas Greutert, Emerge as Viable Replacement to Trie Lookups”, “Embedded Software in Network Processors - Models CommsDesign.com June 19, 2003 and Algorithms”, Computer Engineering and Networks http://www.commsdesign.com/design_center/netproc Laboratory, Swiss Federal Institute of Technology essing/design_corner/OEG20030619S0013 (ETH) Z¨urich, CH-8092 Z¨urich, Switzerland. [42] Gianfranco Bilardi, Kattamuri Ekanadham, [34] Lothar Thiele, Samarjit Chakraborty, Matthias Pratap Pattnaik, “Optimal Organizations for Gries, Simon K¨unzli, “Design Space Exploration of Pipelined Hierarchical Memories”, SPAA’02, August Network Processor Architectures”, Computer 10-13, 2002. Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Z¨urich, Switzerland. [43] Timothy Sherwood, George Varghese, Brad Calder, “A Pipelined Memory Architecture for High [35] Lothar Thiele, Samarjit Chakraborty, Matthias Throughput Network Processors” - In Proceedings of Gries, Simon K¨unzli, “A Framework for evaluating the 30th International Symposium on Computer Design Tradeoffs in Packet Processing Architectures”, Architecture (ISCA), June 2003. Computer Engineering and Networks Laboratory, Swiss Federal Institute of Technology (ETH) Z¨urich, [44] M. Rosenblum, et al, “Using the SimOS CH-8092 Z¨urich, Switzerland. Machine Simulator to Study Complex Computer Systems,” Modeling and Computer Simulations, Vol. [36] Matthias Gries, Chidamber Kulkarni, Christian 7, No. 1, pp. 78-103, 1997. Sauer, Kurt Keutzer, “Comparing Analytical Modeling with Simulation for Network Processors: A Case Study”. [45] Prashant Pradhan Wen Xu* Indira Nair Sambit Sahu, “Efficient and Faithful Performance Modeling for Network-Processor Based System Designs”

[46] Wen Xu, Larry Peterson, “Support for Software Performance Tuning on Network Processors” IEEE Network, July/August 2003

[47] N. Shah, W. Plishker, K. Keutzer, “NP-Click: A Programming Model for the Intel IXP1200,” IEEE Second Workshop on Network Processors, with HPCA- 9, Boston, Feb. 2003.

[48] G. Memik and W. H. Mangione-Smith, “NEPAL: A Framework for Efficiently Structuring Applications for Network Processors,” IEEE Second Workshop on Network Processors, with HPCA-9, Boston, Feb. 2003.

[49] http://www.npforum.org

[50] Henry C. B. Chan, Hussein M. Alnuweiri, Victor C. M. Leung, “ A Framework for Optimizing the Cost and Performance of Next-Generation IP Routers”, IEEE journal on selected areas in communications, vol. 17, no. 6, June 1999

[51] Niraj Shah, William Plishker, Kurt Keutzer “NP- click: A Programming Model for the Intel IXP1200” University of California, Berkeley