Queueing Behavior and Packet Delays in Network Processor Systems

Jing Fu Olof Hagsand Gunnar Karlsson KTH, Royal Institute of KTH, Royal Institute of KTH, Royal Institute of Technology Technology Technology SE-100 44, Stockholm, SE-100 44, Stockholm, SE-100 44, Stockholm, Sweden Sweden Sweden [email protected] [email protected] [email protected]

emerged to provide a flexible router forwarding plane. The goals are to provide the performance of traditional ASICs ABSTRACT and the programmability of general-purpose processors. To Network processor systems provide the performance of ASICs achieve this, programmable processing elements, special pur- combined with the programmability of general-purpose pro- pose hardware and general-purpose CPUs are used to per- cessors. One of the main challenges in designing these sys- form packet processing tasks. In this work, we model a tems is the memory subsystem used when forwarding and router using a network processor system, that is, a system queueing packets. In this work, we study the queueing constituted by line cards built with network processors. behavior and packet delays in a network processor system which works as a router. The study is based on a system Packets may arrive to a network processor system in bursts model that we have introduced and a simulation tool that and be forwarded to the same limited resource (i.e., an out- is constructed according to the model. Using the simula- going interface) causing congestion. Therefore, packets need tion tool, both best-effort and diffserv IPv4 forwarding were to be queued at several stages. Thus, designing the mem- modeled and tested using real-world and synthetically gen- ory subsystem and queueing disciplines inside the system erated packet traces. The results on queueing behavior have becomes an important task. been used to dimension various queues, which can be used as guidelines for designing memory subsystems and queueing We extend our earlier work on network processors systems [1] disciplines. In addition, the results on packet delays show by introducing a revised model. The model is capable of that our diffserv setup provides good service differentiation modeling a system with multiple line cards, supporting a for best-effort and priority packets. The study also reveals variety of parallel processing approaches, queueing disci- that the choice of traces has a large impact on the results plines and forwarding services. Afterwards, we study the when evaluating router and switch architectures. queueing behavior and packet delays using both real-world and synthetically generated packet traces. Our study covers queueing behavior and packet delays of a best-effort IPv4 Keywords forwarding service and an IPv4 forwarding service support- network processor, router, queueing behavior ing diffserv [2].

1. INTRODUCTION The rest of the paper is organized as follows. Section 2 During the recent years, both the Internet traffic and packet overviews related work. Section 3 presents a model for net- transmission rates have grown rapidly. At the same time, work processor systems. Section 4 presents and character- new Internet services such as VPNs, QoS, and IPTV are izes the packet traces used in the simulations. Section 5 emerging. These trends have implications in the architecture shows the experimental setup. Section 6 presents and ana- of routers. Ideally, a router should process packets at line lyzes the results. Finally, section 7 concludes the paper. rates of high speed, and at the same time be sufficiently programmable and flexible to support current and future Internet services. 2. RELATED WORK There are a variety of studies focusing on investigating the To meet these requirements, network processor systems have performance of network processor systems in various as- pects. These studies are based on analytical models, simu- lations, or real experiments.

An example of an analytical model is described in [3], where the design space of a network processor is explored. How- ever, the model is based on a high level of abstraction, where the goal is to quickly identify interesting architec- tures, which may be subject to a more detailed evaluation using simulation. Their final output is three candidate ar- chitectures, representing cost versus performance tradeoffs. The IETF ForCES (Forwarding and Control Element Sep- route aration) group has defined the ForCES forwarding element processor model [4]. The model provides a general management model Source Sink for diverse forwarding elements including network proces- ingress egress sors. The observation that current network processors are line card 1 line card 1 difficult to program has influenced the work with NetVM, a Source Sink ingress egress network virtual machine [5]. NetVM models the main com- line card 2 line card 2 ponents of a network processor, and aims at influencing the Source Sink design of next-generation future network processor architec- ingress egress tures giving a more unified programming interface. line card 3 line card 3 Source Sink Many studies on the Intel IXP 1200 network processor have ingress egress been performed. Spalink et al. demonstrate how to build line card 4 switch line card 4 a software-based router using the IXP 1200 [6]. Their anal- fabric ysis partly focuses on queueing disciplines, which includes packet , queue contention and port mapping. Lin Figure 1: Network processor system overview. et al. present an implementation and evaluation of diffserv over the IXP 1200 [7]. They have showed in detail the de- sign and implementation of diffserv. The throughput of the 3.1 Network Processor Line Cards flows are measured and the performance bottlenecks of the Processing blocks network processor are identified. For example, they found Processing blocks are abstractions of processing elements SRAM to be one of the major performance bottlenecks. Pa- (PEs) in a line card. In a block, a program runs on the paefstathiou et al. present how to manage queues in network local processing unit and processes the received packets. A processors [8]. The study is performed both at the IXP 1200 block may need to wait for external access to memory or an and a reference prototype architecture. To summarize, the engine in order to complete, thus reducing the utilization. queueing studies performed at the IXP 1200 are focused on Using several threads increases the utilization by processing technical details, including where and how to queue packets. several packets simultaneously: while one thread is blocked, Still, there are no studies on dimensioning the queues in a another may take over the execution. network processor system based on real-world and synthet- ically generated traces. Engines Finally, in-router queueing behavior and packet delays are Engines are special-purpose hardware available to a network studied in a gateway router of the Sprint IP backbone net- processor that performs specific tasks. They are usually trig- work [9]. The statistics are used to derive a model of router gered by PEs. Examples are TCAM engines and checksum delay performance that accurately predicts packet delays in- engines. side a router. Channels 3. A MODEL FOR NETWORK PROCES- Processing blocks and engines are inter-connected by chan- SOR SYSTEMS nels that represent potential paths for packet transfer. In this section, a model for a network processor system is presented. The major building blocks of the model are line Queues cards using network processors, a switch fabric and a route There are several places in the system where queues are nec- processor. In other words, we have modeled a router using essary. First, packets may arrive to an ingress line card at network processor based line cards. The model is based on a higher rate than the service rate of the line card. Sec- a simpler model, presented in an earlier work [1]. ond, several ingress line cards may simultaneously trans- mit a large number of packets to the same egress line card. The basic building blocks of a network processor line card Third, the introduction of may cause best- are processing blocks, engines, channels and queues. Such a effort packets to be queued when higher priority traffic is line card can be logically separated into an ingress and an present. egress line card. Fig. 1 shows a network processor system that represents a router with four ingress and four egress In order to make the processing of packets more efficient, line cards. We assume that there is only one port inside a special meta-packet is created. This meta-packet includes each line card, and packets arriving at this port are first the packet header, information about the packet and a pointer processed at the ingress line card, and are then transmitted to the actual packet. While the actual packet resides in through the switch fabric to the egress line cards. Based on slower SDRAM, the meta-packet is stored in faster SRAM the queueing discipline, the packets can be queued either at for faster access. This means that an SRAM operation needs the ingress or at the egress line cards. Moreover, there is to be performed when transferring a packet between process- a route processor whose main task is to handle routing and ing blocks, while an SDRAM operation is needed to transmit management protocols. the entire packet over the backplane.

In this model, there is no slow-path or terminating traffic. All queues are formed from meta packets and are FIFOs im- In other words, packets are only sent between line cards, plementing either tail drop or random early discard (RED) and there is no traffic to or from the route processor. policies. TCAM Egress Line Card TCAM Table 1: Autocorrelation of the FUNET interarrival Source Ethernet Destination Ethernet Source Ethernet Destination Ethernet Sink times. Scheduler Modifier Modifier Sink Modifier Modifier Lag 1 2 3 5 10 100 Autocorrelation 0.12 0.06 0.04 0.03 0.02 0.01 Figure 2: Egress line card supporting IPv4. switch fabric. In addition to using a virtual output queue An example (VOQ) for each egress line card, a VOQ for each traffic class Fig. 2 shows an example of an egress line card. In the ex- can be used to provide service differentiation. ample, the processing blocks are connected with a pipelined topology where each block performs a stage of packet pro- In the modeling of the network processor system, virtual cessing. The scheduler selects packets from several ingress output queueing is the only discipline used. line cards. The source Ethernet modifier updates the source MAC address. A TCAM is accessed in the destination Eth- 4. PACKET TRACES ernet modifier to lookup the destination MAC address. Fi- Packet traces are provided as input to the simulated router. nally, the sink block sends out the packets through the link. The traces contain a sequence of packets, each packet con- Other processing block topologies have been studied in [1], taining an interarrival time, an outgoing interface and a traf- including a pooled topology where all processing blocks per- fic class. In our experiments, packet arrival processes and form the same functionalities to achieve parallelism. In this outgoing-interface selection methods are based both on a study, only the pipelined architecture is used. real-world trace from the Finnish University Network (FU- NET) and synthetically generated traces. The traffic class 3.2 Switch Fabric information is not available from the traces, thus best-effort A switch fabric is used to transmit packets from the ingress and priority packets are selected randomly with certain per- to the egress line cards. There are many switch fabric de- centages. signs [10] [11] [12]. Since this is not the main focus of this work, we assume that the fabric has an adequate speedup 4.1 Arrival Processes and is non-blocking, i.e., whenever an egress line card is The first arrival process is based on the FUNET trace, which ready to process packets, packets can be fetched from the contains more than 20 million packets. It is captured on ingress line cards immediately. one of the interfaces of FUNET’s Helsinki core router. The router is located at the Helsinki University of Technology 3.3 Queueing Discipline and carries the FUNET international traffic. The interface First of all, there is a queue in front of each ingress line card, where the trace was recorded is connected to the FICIX2 ex- which we name input queue. The input queue is needed change point, which is a part of the Finnish Communication since packets may arrive in bursts that the card is not able and Internet Exchange (FICIX) which peers with other op- to process immediately. erators in Finland. In addition to universities, some student dormitories are connected through FUNET. The queueing disciplines normally used between ingress and egress line cards include input queueing, output queueing The second arrival process is a Poisson process. The third and virtual output queueing [13]. arrival process assumes that packets arrive back-to-back on a fully-utilized link. The interarrival times are therefore based In input queueing, the queues are placed between the ingress on the packet size distribution measured at NLANR [14]. line cards and the switch fabric. Input queueing is not effi- The interarrival times are independent since the packet sizes cient due to head-of-line blocking: If the packet at the head are uncorrelated, as shown in [15]. of a queue is blocked waiting for access to a particular egress line card, packets in the queue destined to other egress line Fig. 3 presents the interarrival-time distributions for the cards are blocked as well, even if the other egress line cards three traces. The average interarrival times are all set to are ready to receive them. 746 ns, which is the average measured in the FUNET trace. As can be seen in the figure, the FUNET trace and the In output queueing, the queues are placed between the switch fully-utilized link trace have more packets with both large fabric and the egress line cards. It allows packets from sev- and small interarrival times than the trace based on a Pois- eral ingress line cards to be transmitted to a single egress line son process, resulting higher coefficients of variation in these card simultaneously. The main challenge in output queueing two traces. The variation coefficients are 1.21, 1.35 and 1 re- centers on the capacity of the switch fabric: It has to allow spectively for the FUNET trace, the fully-utilized link trace, multiple packets to be transmitted to an egress line card si- and a trace based on a Poisson arrival process. multaneously, instead of one at time in the input queueing case. The autocorrelation of the FUNET interarrival times is not particularly large as shown in Table 1. Interarrival times in As an alternative, a system can use virtual output queueing other two traces are uncorrelated. with the queues placed between ingress line cards and the switch fabric. Unlike input queueing, there is a separate 4.2 Modeling the Arrival Processes queue for each egress line card. This solves the head-of-line We have modeled the arrival processes of the FUNET trace blocking problem and does not require high speedup in the and the fully-utilized link trace. With these models, math- 60% Poisson Arrival Funet Trace F. U. Link 50%

40%

30%

20% Percentages of packets Figure 5: The two-state MMPP process. 10%

0 0 500 1000 2000 3000 4000 >5000 Table 2: Packet outgoing interfaces. Packet interarival time (ns) Percentage of packets to interface 1 2 3 4 Figure 3: Packet inter-arrival time. Uniform 25.0% 25.0% 25.0% 25.0% distribu- tion 1400 FUNET 57.3% 6.0% 35.2% 1.0%

1200 trace

1000

800 now composed of three independent Poisson processes, each with an average arrival rate of λx and a certain probability 600 Px to be chosen. Average interarrival time 400 4.3 Outgoing-Interface Selection Methods 200 0 100 200 300 400 500 The selection of outgoing interfaces is either based on a uni- Packet arrivals (hundreds) form distribution or the FUNET trace. The FUNET trace does not contain outgoing-interface information explicitly, Figure 4: Average of 100 interarrival times over it only contains destination IP addresses. We therefore ex- time. tended the trace with interface information from a routing table of the Helsinki core router. This routing table is avail- able at CSC’s FUNET looking glass page [16]. A lookup in ematical analysis on the queueing behavior could be made. the routing table was made for each destination IP address Furthermore, router and switch performance studies become in the trace. possible without the presence of real-world traces. Table 2 shows the percentage of packets transmitted to the Fig. 4 shows how the average packet interarrival time varies four most frequently used outgoing interfaces. The Helsinki over time for the FUNET trace. The averages are taken core router has 11 interfaces in total; however, a very small over every 100 packets. The average interarrival time of amount of traffic (< 1%) is transmitted to some of the in- 100 packets varies between 400 ns and 1400 ns. The arrival terfaces. Thus only the four most frequently used outgoing process appears to be periodic with periods of high and low interfaces are shown. The FUNET trace shows asymmetry arrival intensities. with respect to outgoing interfaces. About 57% of the traffic is transmitted on the first interface, while only 1.0% of the The arrival process can be modeled using Markov Modulated traffic is transmitted on the fourth. Poisson Process (MMPP), as shown in Fig. 5. In the model, there are two states, S and S . While in state S , the 1 2 k Since the selection of outgoing interfaces can be correlated, arrivals occur according to a Poisson process with an average we have studied the autocorrelation function of packets des- rate λ . λ and λ are set to 1.1λ and 0.9λ, respectively. λ k 1 2 tined to different interfaces. Fig. 6 shows the autocorrela- is the total average arrival rate. The state transitions are tion function. As can be seen in the figure, correlations are based on a Poisson process, with average transition rates significant and decay slowly. This indicates that the selec- α and α set to 446 s−1. The rates are set based on the 12 21 tion of outgoing interfaces is highly correlated, and shows periodicity observed in Fig. 4. long-range dependence. As shown in Fig. 3, the interarrival-time distribution of the fully-utilized link trace has three peaks, which corresponds 5. EXPERIMENTAL SETUP to the peak packet sizes of approximately 48, 600, and 1500 Based on the model for the network processor system, we bytes. The arrival process can be modeled using a H3 hyper- constructed a simulation environment using the Java pro- exponential arrival process with parameters P1 = 0.61, P2 = gramming language. Thereafter, a best-effort IPv4 forward- 0.29, P3 = 0.1, λ1 = 1.6λ, λ2 = 0.8λ and λ3 = 0.4λ, where ing service and an IPv4 forwarding service supporting diff- λ is the total average arrival rate. This arrival process is serv were implemented. Additionally, a series of experiments Ingress Line Card 0.45

Interface 1 TCAM TCAM Checksum TCAM VOQ 1 0.4 Interface 2 Interface 3 VOQ 2 L2 Destination L3 Protocol IPv4 IPv4 IPv4 Output interface 0.35 Interface 4 Input Queue Classifier Classifier Verifier Forwarder Modifier Classifier VOQ 3 0.3 VOQ 4 0.25

0.2

0.15 Figure 7: Ingress line card supporting best-effort IPv4 forwarding. Autocorrelation function 0.1

0.05 0 blocks are arranged in a pipeline. Packets are first queued −0.05 at an input queue and are later processed by the process- 0 50 100 150 200 Lag ing blocks. The ingress line card performs layer two clas- sification, layer three protocol classification, IPv4 header and checksum verification, IPv4 forwarding and IPv4 header Figure 6: Autocorrelation function of the packet modification. outgoing interfaces. In our earlier work, we studied the saturation throughput, Table 3: Network Processor Parameters and Con- processing block utilization and processing delay of such a figuration. setup [1]. We will now study the queueing behavior and Name Value packet delays with the real-world and synthetically gener- PE Clock Frequency 250 MHz ated traces. PE Instruction Execution 4 ns / instruc- The input queue on the ingress card was configured with Time tion an unlimited queue size, and then packets were generated Number of PE Instructions 25 instruc- according to the three arrival processes. The average arrival tions rates of the traces were modified to correspond to 95%, 90%, SRAM Access Latency 10 ns 80% and 50% of the line card’s processing capacity respec- SDRAM Access Latency 60 ns tively. This resulted in a modified average interarrival time Engine Access Latency 100 ns of 111, 116, 131 and 211 ns respectively. Number of PE Threads 4 threads Using the modified traces, the average and the 99th per- centile queue lengths were measured. These measurements to study the queueing behavior and packet delays have been were used to set the input queue sizes to realistic and appro- performed. priate values. Using the settings of the input queue sizes, the average queue length and the percentage of dropped packets 5.1 Parameters and Configuration were studied. The input queues were configured as FIFOs Table 3 shows the setting of the simulation parameters. implementing a tail-drop policy. These parameters do no exactly match the parameters of a specific network processor system, but we claim that they are realistic enough to be considered as a valid system. 5.3 Single Ingress Line Card Supporting Diff- serv The PE clock frequency is 250 MHz. A PE instruction will Fig. 8 shows an ingress line card supporting IPv4 and diff- therefore take 4 ns to execute. 25 instructions are executed serv. Input traffic is first sorted by the diffserv classifier into for each packet inside a PE. This number is loosely based two classes: priority and best-effort traffic. We assumed the on the IXP 1200 programming experience. Thus, for each classifier is fast enough to process packets at line rates, so packet, 100 ns of instruction execution time is required. The no queue is needed in front of the classifier. SRAM, SDRAM and engine access latencies are 10, 60 and 100 ns respectively. Finally, there are four threads inside The two traffic classes are then queued separately in the a PE to hide engine access latencies, which can keep the input queues and a scheduler performs priority scheduling. PE fully utilized as shown in our earlier work [1]. Since the Packets are then processed in the IPv4 ingress pipeline. Af- PE is fully utilized and the instruction execution time of ter ingress processing, the packets are queued in a VOQ individual packets is constant, the service rate of a PE is depending on outgoing interface and class. For each ingress nearly constant even though processing delay may vary. A line card, there is a queue for each egress line card per traf- pipeline of PEs are used in a line card, thus the service rate fic class. Packets in the VOQs are later being transmitted of a line card are almost constant as well. to the egress line cards through the switch fabric. In the experiments, the arrival rate and the percentage of priority 5.2 Best-Effort Single Ingress Line Card traffic were varied in order to study the resulting queueing We first performed experiments on a single ingress line card behavior for the input queues. Since we are only modeling a with best-effort IPv4 forwarding. The logical layout of the single ingress line card, the queueing behavior of the VOQs line card is shown in Fig. 7. In the line card, the processing cannot be studied in this setup. PRO VOQ 1 Ingress Line Card BE 700 VOQ 1 FUNET, 99th percentile TCAM TCAM Checksum TCAM PRO VOQ 2 FUNET, average PRO Input BE 600 F. U. link, 99th percentile Queue VOQ 2 Diffserv L2 Destination L3 Protocol IPv4 IPv4 IPv4 Output interface Scheduler F. U. link, average Classifier Classifier Classifier Verifier Forwarder Modifier Classifier PRO BE Input Poisson, 99th percentile VOQ 3 Queue 500 BE Poisson, average VOQ 3

PRO VOQ 4

BE 400 VOQ 4

300 Figure 8: Ingress line card supporting diffserv. Queue length 200

5.4 A Network Processor System 100 After studying the queueing behavior in a single ingress line 0 card, a complete network processor system supporting IPv4 95% 90% 80% 50% and diffserv was set-up. The system consisted of four ingress Arrival rate and four egress line cards as shown in Fig. 1. The layout of the ingress and egress line cards are shown in Figs. 8 and 2, Figure 9: Queue length measurement of a single respectively. Disjoint parts of the FUNET trace are used as ingress line card. input data for the four ingress line cards.

The queueing discipline between ingress and egress line cards 150 Queue size 100 is strict priority scheduling. This means that as long as Queue size 200 Queue size 300 there are non-empty priority queues, priority packets are al- Queue size 400 ways selected first. Selection of packets at the same priority level from the ingress line cards is done in a round robin 100 fashion. The input queue sizes are set to be 300 and 100 packets to the best-effort and priority input queues respec- tively. Following the same method described in Section 5.2, appropriate VOQ sizes are determined. 50 Average Queue Length

Finally, we performed experiments to study the average queue length and percentage of dropped packets in the VOQs.

0 95% 90% 5.5 Packet Delays Arrival Rate In addition to the queueing behavior experiments, the cross- router delay experienced by packets in the network processor system were measured. Figure 10: Average queue length for the FUNET trace with limited queue size. The cross-router delay is interesting to study since it con- tains both processing delay and queuing delay. The pro- cessing delay is the delay a packet experiences in processing 6.1 Best-Effort Single Ingress Line Card blocks. This includes the time for instruction execution and In Fig. 9, the average and the 99th percentile input queue the time for engine and memory access. The queueing delay lengths are shown. The FUNET trace has much longer corresponds to the sum of the time a packet spends in the queue lengths than the other two synthetically generated input queue and in the VOQ. In the input queue, the delay traces. The average and the 99th percentile queue lengths is nearly proportional to the current queue length. How- are 190 and 650 packets, respectively, when the average ar- ever, if the packet is a best-effort packet, it depends on the rival rate is 95% of the line card’s processing capacity. For queue length and future packet arrivals at the priority input the synthetically generated traces, the average and the 99th queue. The waiting time for a packet in the VOQ is even percentile queue lengths are 20 and 80 packets respectively more complex to calculate. First, it depends on the length for the fully-utilized link arrival and 9 and 41 packets re- of the current VOQ. Second, it depends on the queue length spectively for the Poisson arrival process. and future packet arrivals at all other priority VOQs. Third, if it is a best-effort VOQ, it depends on the queue length and Next, we set the input queue sizes to 100, 200, 300, and future packet arrivals at the priority VOQ in the same line 400 packets respectively. Fig. 10 shows the average queue card and other best-effort VOQs in other line cards. lengths for the FUNET trace with 95% and 90% arrival rates. Fig. 11 shows the percentages of dropped packets for the FUNET trace. For the other traces, the queue lengths 6. RESULTS AND ANALYSIS hardly reach 100 packets, and there are no packet drops. In this section, we present and discuss the results obtained As can be seen from the figures, the percentage of dropped from the experiments. packets decreases as the queue size increases.

Due to the burstiness of the FUNET trace, its queueing behaviour stands out from the other traces. This suggests 4% 140 Queue size 100 20% priority traffic, best effort queue Queue size 200 50% priority traffic, best effort queue 3.5% Queue size 300 120 80% priority traffic, best effort queue Queue size 400 20% priority traffic, priority queue 3% 50% priority traffic, priority queue 100 80% priority traffic, priority queue

2.5% 80 2% 60 1.5% Average queue length 40 1% Percentages of dropped packets

0.5% 20

0 0 95% 90% 95% 90% Arrival rate Arrival rate

Figure 11: Percentages of dropped packets for the Figure 12: Average queue length for diffserv traffic, FUNET trace with limited queue size. FUNET trace.

that queue sizes should be dimensioned based on study of 5% real-world traces. 20% priority traffic 4.5% 50% priority traffic 80% priority traffic An advantage of large queue sizes is that it reduces pack- 4% ets drops. However, larger queue sizes may result in longer 3.5% packet delays. Furthermore, having a large queue requires 3% more memory and has impacts on the cost and complexity. 2.5% Based on these observations, a queue size of 300 packets 2% seems appropriate for the FUNET trace. The drop percent- 1.5% Percentages of dropped packets age is below 1% at the high arrival rate of 95%. When the 1% arrival rate decreases to 90%, the drop percentage decreases below 0.2%. At the rate of 80%, packet drops do not occur. 0.5% 0% 95% 90% The modeling of arrival processes and the analysis of line Arrival rate card service rate suggest that the system could be mod- eled as a MMPP/D/1, M/D/1 or H3/D/1 queue depending Figure 13: Percentages of dropped best-effort pack- on whether the arrival process is based on FUNET, Pois- ets, FUNET trace. son or fully-utilized link trace. Our simulations show that the MMPP/D/1 and H3/D/1 queueing models provide sim- ilar queueing behaviors as the FUNET trace and the fully- utilized trace. Analytical results for the MMMP/G/1 queue the line card’s processing capacity. The percentages of high is also available [17]. Thus, these two are proper models for priority traffic are set to 20%, 50% and 80% respectively. packet arrivals. MMPP models the arrival process well in a non fully-utilized link. When the link is fully-utilized, the Fig. 12 shows the average queue lengths and Fig. 13 shows arrival process is more similar to a hyper-exponential model the percentages of dropped packets. As can be seen in the with three states. figures, average queue lengths for priority traffic are much shorter than average queue lengths for best effort traffic. We observed from the simulation output that the departure Moreover, there is no priority packet drop in all setups, even process from an ingress line card is nearly periodic when through the percentage of priority traffic is set to as high as there are packets in the input queue. The reason is that 80%. the line card always has packets to process and the service rate is nearly constant. However, when there are no packets For the best-effort queue, the average queue length does not in the input queue, the departure process is similar to the depend much on the percentage of priority traffic; however, arrival process. a lower arrival rate reduces the average queue length sig- nificantly. Moreover, the percentage of dropped best-effort packets increases as the percentage of priority traffic in- 6.2 Single Ingress Line Card Supporting Diff- creases. We consider the best-effort queue size of 300 packets serv as a good choice since 1.3% of the packets are dropped at In the diffserv experiments, the queue sizes are set to 300 a high arrival rate of 95% and with 80% priority packets, packets for the best-effort queue and 100 packets for the which corresponds to 1% packet drop when considering all priority queue. The arrival process is based on the FUNET packets. While with 90% arrival rate, the drop percentages trace and the average arrival rates are set to 95% and 90% of in the best-effort queues are at acceptable levels below 1%. Table 4: Network processor system setup. Table 5: Length of virtual output queues, 0% prior- Setup Arrival Interface Arrival rate of line card ity traffic. Process selection Best effort queue method Setup VOQxy Average 99th per- 1 2 3 4 queue centile 1 FUNET FUNET 95% 30% 10% 3% length queue 2 F.U. link FUNET 95% 30% 10% 3% length 3 Poisson FUNET 95% 30% 10% 3% 1 VOQ11 139 611 4 FUNET Uniform 95% 95% 95% 95% 2 VOQ11 126 615 distribution 3 VOQ11 102 536 5 F.U. link Uniform 95% 95% 95% 95% 4 VOQ11 6.7 31 distribution 5 VOQ11 6.1 21 6 Poisson Uniform 95% 95% 95% 95% 6 VOQ11 5.1 19 distribution

Table 6: Length of virtual output queues, 20% pri- 6.3 A Network Processor System ority traffic. There are four ingress line cards and four egress line cards Priority queue Best-effort queue in the network processor system. Thus, with two priority Setup VOQxy Average 99th Average 99th levels, there are eight VOQs in each ingress line card and queue per- queue per- 32 VOQs in total. The exact setups for the experiments are length centile length centile shown in Table 4. In each setup, one arrival process and one 1 VOQ11 0.07 2 139 610 interface selection method is used. The percentage of high 1 VOQ21 0.02 1 0.5 9 priority traffic is set to 0% and 20% respectively. With 0% 1 VOQ14 0.01 1 0.08 2 priority traffic, the system is similar to the one supporting 2 VOQ11 0.08 2 126 614 best-effort IPv4 forwarding service. 2 VOQ21 0.02 1 0.7 11 2 VOQ14 0.001 0 0.2 2 The arrival process to an egress line card can be consid- 3 VOQ11 0.07 2 94 495 ered as a multiplexing of the arrival processes to the VOQs 3 VOQ21 0.02 1 0.4 5 in the four ingress line cards. These, in turn, depends on 3 VOQ14 0.001 0 0.01 1 the departure process from the ingress line card and the outgoing-interface selection method. switch fabric. For VOQ14, the queue lengths are very short, As it is impractical to show the queueing behavior for all since only 1% of traffic are transmitted to the egress line 32 VOQs, Table 5 shows the queueing behavior for one in- card 4. teresting VOQ. In the table, VOQxy refers to the VOQ be- tween the ingress line card x and egress line card y. Queue Based on the initial results from Table 6, we set the VOQ lengths for setups 1, 2 and 3 are significantly longer than size based on two approaches. The first approach is to set a queue lengths for setups 4, 5 and 6. The large difference in size for each VOQ. The second approach is to use a common queue lengths is caused by the difference in interface selec- buffer for all VOQs in a line card. In this way, the size of tion methods used in the setups. the common buffer is set and four VOQs share this common buffer. The FUNET method results in large queue lengths, while uniform distribution results in fairly short lengths. It could Table 7 shows the average queue lengths and percentages of be observed that the queue lengths vary only slightly de- dropped packets for VOQ11 with varying queue and buffer pending on the arrival process. The queue length is highest sizes. It can be seen that a common buffer of 400 packets for the FUNET arrival process, and shortest for the Poisson provides better queueing performance than a separate VOQ arrival. We conclude that the choice of interface selection size of 300 packets, even though four VOQs of 300 packets methods has a large impact on the VOQ length, while the require a buffer space of 1200 packets. Furthermore, packets impact of arrival process is smaller. are never dropped with a common buffer size of 800 packets.

Table 6 shows the queueing behavior for several selected We observe that the percentages of dropped packets are VOQs with the FUNET selection method. The lengths of quite high even though the average queue lengths are short the priority VOQs are generally short. For the best-effort compared to the queue size. In most of the cases, the av- VOQs, the length varies for different VOQs. For example, erage queue length is approximately 25% of the queue size. VOQ11 normally has a large queue length. The main causes The high drop percentage is most likely caused by the high are: (1) The ingress line card 1 has a high arrival rate of correlation and long-range dependence of the traffic: During 95%; and (2) 57.3% of the packets are transmitted to egress a short time interval, all packets may belong to a single flow line card 1. For VOQ21, the queue lengths are shorter even and are transmitted to a single egress line card. This behav- though 57.3% of the packets are transmitted to egress line ior requires a large queue size to prevent high packet loss. card 1. This is caused by the low arrival rate of 30% at However, as the table shows, using a common buffer for all the ingress line card 2 and random robin scheduling in the VOQs alleviates this problem and significantly reduces the Table 7: Average queue length and percentage of 7% dropped packets of best-effort VOQ11, 20% priority traffic. 6% Queue Queue Queue Queue Buffer Buffer 5% size size size size size size

100 200 300 500 400 800 4% Average 25 46 69 108 89 139 queue 3% length Percentage of packets Percentage 5.5% 4.6% 2.9% 1.3% 2.7% 0% 2% of dropped packets 1%

0 0 20 40 60 80 100 120 Packet delay (µs) Table 8: Average Delay experienced by the packets, 20% priority traffic. Average delay (µs) Figure 14: Delay distribution for best effort packets. Setup Egress Priority Best- nr packets effort 35% packets

1 1 2.0 µs 30.2 µs 30% 1 2 1.9 µs 16.1 µs 1 3 2.0 µs 9.7 µs 25% 1 4 2.0 µs 17.3 µs 6 1 2.0 µs 5.0 µs 20%

15% Percentage of packets packet drop rates. It also makes it possible to use less buffer 10% space. In particular, with a common buffer of 800 packets, no packet drop occurs. 5%

0 0 1 2 3 4 5 6 7 8 6.4 Packet Delays packet delay in µs Table 8 shows the average delays experienced by the packets with 20% priority traffic, where the common buffer size is set to 800 packets. Each row of the table shows the average Figure 15: Delay distribution for priority packets. delay for priority and best-effort packets for a given egress line card in a setup. formance variation, we consider this acceptable since the As can be seen in the table, the delays experienced by the setup is at a high arrival rate of 95% and priority packets priority packets are much shorter than the delays for best- receive nearly guaranteed performance. effort packets. In the delay experienced by a packet, there is a processing delay of approximately 1.6 µs for each packet, We have compared the cross-router delay from our simu- and the remaining delay is caused by queueing. lations with the delay achieved through measurement on a gateway router of the Sprint IP backbone network [9]. Fig. 14 and 15 show the delay distribution for best effort The average delays according to Table 8 are between 9.7 µs and priority packets arriving at egress line card 1 in setup and 30.2 µs for best-effort packets using the FUNET trace. 1. On the x-axis, the packet delay is shown in µs. On the y- While reported in [9], the minimum and average delays are axis, the percentage of packets belonging to a certain delay around 20 µs and 100 µs respectively, and some packets ex- interval of 0.1 µs is shown. Table 9 shows the packet delay perience more than 10 ms of delay. The authors claim that at several percentile levels. Even though the average delay these long delays are caused by the slow-path packet pro- is 30.2 µs for best effort packets, 50% of packets have delays cessing, which has a large impact on the average delay. In less than 12 µs and 10% of packets have delays larger than our study, we do not consider slow-path packets. Besides, 98 µs. The average delay for priority packets is 2.0 µs as shown in Table 8, and most of the delays are around 2.0 µs. Although some delays are up to 8 µs, only 10% of packets have delay larger than 2.5 µs. Table 9: Packet delays in µs at percentile levels. 10th 25th 50th 75th 90th By comparing the delay distribution, we can see that the Best- 1.8 µs 2.2 µs 12 µs 50 µs 98 µs system provides low delays with small delay jitters for pri- effort ority packets. For best effort packets, delays are larger and Packets the delay jitters are larger as well. As a result, the delay Priority 1.7 µs 1.8 µs 2.0 µs 2.1 µs 2.5 µs performance varies from packet to packet. Despite the per- packets the network processor system simulated by us and the router High-Performance Computer Architecture, Cambridge, of the Sprint IP backbone network have completely different MA, Feb. 2002. architectures, in particular the queue sizes, which probably have a large impact on the results. [4] L. Yang et al., ”ForCES Forwarding Element Model”, Internet Draft, March 2006.

7. CONCLUSIONS [5] L. Degioanni, M. Baldi, D. Buffa, F. Risso, F. Stirano, In this work, we have developed a model for network proces- and G Varenni, ”Network Virtual Machine (NetVM): A sor systems. Based on the model, a simulation environment New Architecture for Efficient and Portable Packet of a router was constructed using the Java programming lan- Processing Applications”, in Proc. of 8th International guage. The simulation environment was then used to imple- Conference on Telecommunications (ConTEL 2005), ment two forwarding services: Best-effort and diffserv. The pp. 153-168, Zagreb, Croatia, June 2005. queueing behavior and packet delays of these two forward- ing services were studied using real-world and synthetically [6] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb, generated traces. The measurement results include queueing ”Building a Robust Software-based Router using behaviour, packet drops and cross-router packet delays. Network Processors”, in Proc. of the 18th ACM Symposium on Operating Systems Principles (SOSP), Our initial results regarding queueing behavior were used to pp. 216-229, Banff, Alberta, Canada, Oct. 2001. dimension the length of the queues that were later used in the simulation. Based on the requirements on cross-router [7] Ying-Dar Lin, Yi-Neng Lin, S. Yang, and Yu-Sheng delay, percentages of packet loss and system complexity, var- Lin, ”Diffserv over Network Processors: ious queue sizes were dimensioned accordingly. Implementation and Evaluation”, in Proc. of 10th Symposium on High Performance Interconnects, pp. By studying the traces, the packet arrivals are modeled us- 121- 126, Stanford, California, USA, Aug. 2002. ing an MMPP and an H3 hyper-exponential process. Our simulation results show that these models provide similar [8] I. Papaefstathiou, T. Orphanoudakis, G. Kornaros, C. queueing behaviors as the traces. Therefore, they are proper Kachris, I. Mavroidis, and A. Nikologiannis, ”Queue models for packet arrivals. Management in Network Processors,” in Proc. of Design, Automation and Test in Europe (DATE’05) The study also shows that it is possible to support diffserv Volume 3, pp. 112-117, M¨unich, Germany, Mar. 2005. in a network processor system. In particular, the diffserv [9] N. Hohn, D. Veitch, K. Papagiannaki, C. Diot, setup provides a good service differentiation of best-effort ”Bridging router performance and ”, in and priority traffic. While best-effort packets experience Proc. of ACM Sigmetrics Conference on the high loss probability, delay and delay jitter, priority packets Measurement and Modeling of Computer Systems, are transmitted with a very low delay and small delay jitters. pp.355-266, New York, USA, June 2004.

Moreover, the study reveals that real-world traffic is not [10] N. McKeown, The iSLIP Scheduling Algorithm for only bursty, but packets are also clumped together when Input-Queued Switches, IEEE Transactions on transmitted to outgoing interfaces. As the results from the Networking, Vol 7, No 2, April 1999. 5. real-world and synthetically generated traces are compared, large differences in queueing behavior and packet delays can [11] N. McKeown et al., The Tiny Tera: A Packet Switch be observed. Thus, studies on router and switch architec- Core, Hot Interconnects V, Stanford University, August tures should not assume that packets are transmitted uni- 1996. formly to all outgoing interfaces. Even hot-spot modelling of traffic where a large percentage of packets are transmitted [12] C-S. Chang et al., Load Balanced Birkhoff von to a single outgoing interface does not capture the temporal Neumann Switches, Part I: Onestage buffering, behavior of real-world traces. We conclude that the choice Computer Comm., Vol 25, 2001. of traces has a large influence on the results when evaluating router and switch architectures. [13] T. Anderson, S. Owicki, J. Saxe, and C. Thacker, High speed switch scheduling for local area networks, ACM Trans. Comput. Syst., pp. 319352, Nov. 1993. 8. REFERENCES [1] J. Fu and O. Hagsand, ”Designing and Evaluating [14] WAN packet size distribution by NLANR, information Network Processor Applications”, in Proc. of 2005 available at: IEEE Workshop on High Performance Switching and http://www.nlanr.net/NA/Learn/packetsizes.html Routing (HPSR), pp. 142-146, Hong Kong, May 2005. [15] T. Karagiannis, M. Molle, M. Faloutsos, A. Broido, [2] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, ”A nonstationary Poisson view of Internet traffic”, and W. Weiss, ”An Architecture for Differentiated IEEE Infocom 2004, Vol 3, pp. 1558- 1569, Hong Kong, Services”, RFC 2475, IETF, Dec. 1998. March 2004.

[3] L. Thiele, S. Chakraborty, and M. Gries, ”Design Space [16] CSC, Finish IT Center for Science, FUNET Looking Exploration of Network Processor Architectures”, in Glass, Available: Proc. of 1st Workshop on Network Processors, held in http.//www.csc.fi/sumoi/funet/noc/looking- conjunction with the 8th International Symposium on glass/lg.cgi. [17] H. Heffes and D. Lucantoni, ”A Markov Modulated Characterization of Packetized Voice and Data Traffic and Related Statistical Multiplexer Performance” IEEE J. Selected Areas in Communications, Vol 4, Issue 6, pp.856 - 868, Sep 1986.