arXiv:cs/0411070v1 [cs.NI] 18 Nov 2004 ewr evcdb h Ssadcnettosnso comput of thousands connect and ISPs backbone the by serviced network ttedfeetsfwr n adaetcnqe employed techniques hardware router. and a l software computing different the it, the to to mu added at addition have routers In guarantees access delay protocols. and of The bandwidth variety services. a value-added and of media physical set numbe rich large require a backb routers for enterprise The The different. links. are few a routers over these for requirements The Route (ISPs). Providers Service Internet the to businesses r elt o.Sice n otr r eoigtebott In the diverse. becoming extremely are become routers also have and routers Switches OC now. of increased. reality also a have Internet are the across capacities link h atdcd a inse nrcdne rwhi h In the in growth unprecedented witnessed has decade last The aaPt rcsigi atProgrammable Fast in Processing Path Data rcsigcl o e prahsi einn routers. provid designing of in goals approaches conflicting new The for router. call the th processing at on and cycles second CPU every packets more th more becoming deliver up are links faster routers coming hand and are Switches applications routers. New intermediate technologies. link faster of fN edr ntemre.Tedsuso nteN benchmar NP the on discussion The market. commerc NPs. the the these will of evaluate in some We vendors on ASICs. NP focus of and of Proce Processors speed Network Network of the of promise with design The together changes. processors the with purpose up keep solutio cons to are hardware requirements time application i and The the standards Service network But Routers. as of (ASICs). Queued Circuits Quality Integrated Input Specific provide Application in to performance is QoS the routers guarantee reduce the to for aim requirement that perform algorithms scheduling the link discusses output and scheduling switch ment, hssre ae oka h oefntoaiis iepa like functionalities, core the at look a takes survey This nenti rwn tafs ae h iksed r surgin are speeds link The pace. fast a at growing is Internet otr ttecr fteItre onc Ssuigln di long using ISPs connect Internet the of core the at routers optrSineDepartment Science Computer tn ro,N 11794-4400 NY Brook, Stony tn ro University Brook Stony [email protected] .INTRODUCTION I. Routers rdpaDe Pradipta Abstract access sin rs hrn neiiiyo h ehdi demerit a is method the of inflexibility nherent atyeovn,wihse atrturnaround faster a seek which evolving, tantly 4c(. bs n C12 1 bs links Gbps) (10 OC-192c and Gbps) (2.5 -48c fprsa o otprpr,btne support need but port, per cost low at ports of r oteek nfs omncto.O one On communication. fast in bottlenecks e e nentbsdapiain ihstringent with applications based Internet new db otri t aapt rcsigand processing path data its in router a by ed te aditlietpoesn consumes processing intelligent hand other e n otr upr otn tvr ihrate high very at routing support routers one ewrs otr onc oe n small and homes connect routers networks, ooecm h efrac oteek in bottlenecks performance the overcome to hc eur nelgn rcsiga the at processing intelligent require which ktcasfiain ufrmmr manage- memory buffer classification, cket on o hs prtos nimportant An operations. these for bound a narue.I hssre,w illook will we survey, this In router. a on oad tspotmn ot ihheterogeneous with ports many support st urnes epooea loih to algorithm an propose We guarantees. enterprise n atrbtcmuainlyexpensive computationally but faster ing eeki atcmuiain h role The communication. fast in leneck td h rhtcua hie o the for choices architectural the study r rmcmue n nepie.The enterprises. and campuses from ers al vial P.Teei plethora a is There NPs. available ially oad4 bswt h emergence the with Gbps 40 toward g sr N)i h eiiiyo general- of flexibility the is (NP) ssors osedu otroeainwas operation router up speed to n enttafi [ traffic ternet sst h omlzn ltomto platform normalizing the sets ks tnetuk flrebandwidth. large of trunks stance ewrsfr h de fthe of edges the form networks ? .A h aetm the time same the At ]. Routers used to be implemented in software. A single CPU performed all the data and control related tasks. In a connectionless network like IP, basic data path activity involves looking up the outgoing interface for an incoming packet. The router extracts the destination field from the header of the packet and uses it to perform a Longest Prefix Match against a forwarding database of entries of the form < network address/mask, egress port >. The address lookup involves classification based on a single field in the header. For providing different levels of service to individual flows, it is necessary to perform this classification on multiple fields in the header of a packet to distinguish among the flows. Once the egress port is resolved the packets are routed to the output port over a shared bus, or a non-blocking interconnect. Multiple packets might be competing for the same output requiring a scheduling algorithm for arbitration. In order to give service guarantees, the outgoing link must be scheduled among the flows such that none of their guarantees are violated. This calls for a link scheduling discipline at each output. Each of these functions are complex operations and have to be performed for every packet passing through the router. A software based router has the advantage that it can be easily modified to adapt to changing specifications, but the heavy computations associated with these operations can kill the performance. One way is to find better algorithms to make the operations faster. An alternative way found its way into the router market with the advancement of IC technology. Application Specific Integrated Circuits (ASICs) can implement many of these operations in hardware and achieve orders of magnitude improvement in speed. So instead of pushing the limits of processor speed, router designers adopted the use of specialized ASICs. The downside of the approach is the inherent inflexibility of the ASIC based design. Once a logic is burnt into silicon it is not possible to modify them. This leads to longer time-to-market of new designs. The flexibility of software based approach coupled with the speed of ASIC is the holy grail of router design. Network Processor (NP) is the answer to this search. Network Processor is a programmable device that is designed and built with optimized instruction set for networking functions. Software programmability gives flexibility of modification and the optimized instruction set helps to reach the speed of ASIC. The term ‘Network Processor’ is often loosely used to refer to a collection of specialized co-processors for functions like address lookup, flow classification that are often integrated on the same chip with the programmable core. In this paper, we survey different approaches that aim to boost router performance. On the software side, we will study the algorithms which have improved the runtime complexity of router operations. The Network Processor presents the hardware angle to router performance improvement. We also propose a switch scheduling algorithm which can ensure fair sharing of the bandwidth among the flows destined to an output in an input queued switch/router. The organization of the paper is, in Section II we will discuss the algorithmic solutions to the key router functions. Section III introduces our proposed algorithm for providing QoS guarantees in an Input Queued switch. In Section IV we point out the architectural choices for Network Processors and evaluate designs from some commercial vendors. Section V looks at the need for and the means to distinguish among the slew of NPs currently available. We will conclude the survey in Section VI with our opinion. II. KEY TASKS IN PACKET PROCESSING Next-generation routers are expected to support a wide range of applications and functionalities while maintaining the high processing speed. Initially, much of the research had gone into speeding up the forwarding table lookup and switch fabric scheduling for a router. With Quality of Service in the network becoming a sought after requirement, research efforts were directed toward finding efficient solutions for packet classification based on multiple header fields, fast switch fabric scheduling techniques, fair scheduling of the output link and innovative memory architectures to prevent packet buffering and Rule F1 F2 R1 00* 00* R2 0* 01* R3 1* 0* R4 00* 0* R5 0* 1* R6 * 1* TABLE I. Rule Set for Classification queueing from becoming a bottleneck. In this section, we will take a look at each of these issues and existing solutions. A. Packet Classification Problem Statement: Given a database of filters or rules, packet classification is the task of determining the filter that ‘best’ matches the packet. Suppose there are K fields in the header that must be matched for classification. Then, each filter F is a K-tuple (F[1], F[2], ..., F[k]), where each F [i] is either a variable length prefix bit string or a range. For example, the filter F = (130.245.*, *, TCP, 23, *), specifies a rule for traffic flow addressed to subnet 130.245 with TCP destination port 23, which is used for incoming Telnet connections. If the firewall database includes this rule it will be able to isolate all incoming Telnet connections, and allow or discard it. Each filter rule in a filter database comprises an array of K distinct fields, where F [i] is a specification for the ith field, and an associated cost. The best match for a packet is decided by, (a) the order of occurrence of the rule in the filter database, (b) assignment of priorities to different filters, (c) assignment of priorities to fields to break the ties in case of multiple matches. For example, in firewall databases, option (a) is adopted. A packet can match a filter rule F in three ways, • Exact match means matching with the specific value of the field in a filter rule. • Prefix match requires that the value specified in the filter is a prefix of the value in that particular field in the packet header. • Range match is the case when the filter value is specified as a range, for example when specifying a set of port numbers. Any range specification can, however be converted to a prefix-based speci- fication. It has been shown by Feldman and Muthukrishnan that a W -bit range can be represented by at most (2W − 2) prefixes [?]. Of the three types, Prefix Matching is the most commonly used. The two metrics on which the algorithms in this section will be evaluated are, • Space Complexity: The space complexity gives an estimate of the memory footprint required to store the data structures. The bigger the data structures more difficult it is to use faster and more expensive memory technologies to store them, thereby increasing the memory access time. • Time Complexity: The time complexity gives a bound on the maximum number of steps or cycles that are required to perform the classification. This is directly related to the search speed of the classification algorithm. To enable fast processing on the data path, the most useful algorithms are the ones with the minimum memory usage and the fastest lookup time. The most naive approach in solving the classification problem is to use linear search on all the rules in the filter set. The worst-case storage and lookup time complexity for such a scenario will be O(N). Linear increase with the number of rules presents poor scaling property. 1 0 1 0 F1 TRIE
0 0
1 0 1 0 1 1 1 1 0 0 1 F2 TRIE 0 1 0 R3 R6 R4 R4 R3 R6 1 R5 R6 0 1 R5 R5 R5 0 1 0 R2 Switch Pointers R1 R1 R2 R2 (a) (b)
Fig. 1. Examples of Trie Data Structure. Figure (a) shows a Hierarchical Trie data structure which represents fields F1 and F2 from the Table I. Extension to Grid-of-Tries with the use of Switch Pointers is also shown. Figure (b) gives an example of a Set Pruning Trie.
Another hardware option is to use a Ternary Content Addressable Memory (TCAM). In this case, the lookup time is constant, O(1), but space complexity remains O(N). TCAMs are expensive memories with limited capacities. Hence in routers where thousands of rules are stored, it is not a viable solution. The theoretical bounds for the classification problem is obtained from the domain of computational geometry. It can be framed as a point location problem in multi-dimensional space [?]. The best bounds for point location in N rectangular regions and d> 3 dimensions are O(log N) time with O(N d) space, or O((log N)d−1) time with O(N) space. This poly-logarithmic time bounds are not practical for use in high-speed routers. For example, with 1000 rules and 5 fields, either the number of memory accesses is 10,000 per packet, or the space requirement is 1000 GB. As stated earlier, any range specification can be converted to prefix-based specification, therefore for the rest of this section, without loss of generality we will refer to the rule database in Table I throughout the discussion. In Table I, Rj refers to a rule and Rj1 and Rj2 refers to the two fields for each rule. The most popular data structure for performing fast lookup is Patricia Trie [?], which is a binary branching tree with each branch labeled 0 or 1. Extension of a 1-dimensional trie to multiple dimensions is done by building the trie for dimension-1 (F1) and each valid prefix in F1 points to a trie containing the prefixes of dimension-2 (F2). This is shown in the Figure 1(a). The storage cost for this trie is O(NdW ), where N is the number of rules, d is the number of dimensions and W is the maximum bit length of a prefix. The lookup is done in a depth-first traversal which gives a lookup complexity of O(W d). Instead of a DFS approach, breadth-first search (BFS) is proposed by Varghese et al. [?], which can reduce the runtime memory requirement for the algorithm. However, the linear complexity for space and polynomial complexity for lookup time is not ideal for high speed data forwarding. A way to speed up the search procedure in Hierarchical Tries was proposed in [?] by preventing backtracking with the use of Switch Pointers. Switch Pointers are edges labeled 0 or 1, and directs the search process to the next trie without backtracking. This scheme can only work in 2-dimensional classification where the lookup time is reduced from O(W 2) to O(W ). Yet another way of speeding up the lookup can be by replicating rules to avoid backtracking. This is called a Set Pruning Trie and is shown in Figure 1(b). The storage complexity blows up to O(N ddW ). In summary, use of tries leads to polynomial complexity solutions either in terms of space or query processing. Another solution to the multi-dimensional classification problem was proposed by Srinivasan et al. [?], which is called cross-producting. The basic idea is to do a 1-dimensional lookup for each chosen field in the packet, and then compose the results to get the entry from a precomputed table of lowest cost filters. The table computation involves generating all possible combinations of filters by cross-producting individual prefixes in each dimension. The query processing Rule Specification Tuple Tuple Hash Table Entries R1 (00*, 00*) (2, 2) (0, 1) {R6} R2 (0**, 01*) (1, 2) (1, 1) {R3, R5} R3 (1**, 0**) (1, 1) (1, 2) {R2} R4 (00*, 0**) (2, 1) (2, 1) {R4} R5 (0**, 1**) (1, 1) (2, 2) {R1} R6 (***, 1**) (0, 1) TABLE III. Hash Table Entries based on Table IV. TABLE II. Mapping from RulesSet to Tuples
Database Size Tuples Pruned Tuples Fwal-1 278 41 11 Fwal-2 158 28 6 Fwal-3 183 24 7 Fwal-4 68 15 5 TABLE IV. The worst-case number of hash probes produced by the Pruned Tuple Search Method. time is reduced to O(N * 1-D lookuptime), but the cross-product table can have a whooping N d entries. Attempts to solve the most general cases of the multi-dimensional classification problem lead to expensive worst-case solutions. But worst-case behavior is not a frequent occurrence in real world classifiers. There is considerable amount of structure and redundancy which can be exploited to come up with intelligent heuristics as we will find out. We study three heuristic based solutions here. A simple observation that in most real rule sets the number of distinct field lengths used is always small is the key to the algorithm, called Tuple Space Search proposed by Srinivasan et al. [?]. The entire rule set is broken up into sets, called tuple space based on the length of the individual prefixes in each rule. Each d-dimensional rule maps to one of the d-tuples, where the ith field of the tuple is the length of the ith dimension of the rule. An example of generating such tuple space is shown in Table II and Table III. Each rule is now stored in a hash table corresponding to the set it maps to. The storage complexity in this case is still O(N), but the advantage comes from a faster query processing time in the average case, where it has to do M hashed memory accesses, M being the number of tuples and is definitely less than N. The authors suggest an improvement over this approach, which they call Tuple Pruning. This heuristic is based on an observation that in real filter databases there are very few prefixes of a given address. This observation is utilized during a query by first doing individual longest prefix matching in each dimension, and then searching only those tuples that are compatible with the individual matches. The authors have reported the effect of Tuple Space Pruning on some real firewall databases, as shown in Table IV. Srinivasan has suggested a different approach to tuple pruning, which is called Entry-Pruned Tuple Space Search [?]. In this case, with each entry E corresponding to a tuple, some information is maintained as a bitmap which tells which other tuples need to be searched instead of doing a linear search. Only those tuples that have at least one filter that does not contradict with a current match E in any of the specified bits need to be looked at. The bitmap associated with each entry is T bits long, where T is the number of tuple entries. The bitmap has to be precomputed by checking all filters that does not contradict E. Though the naive approach for this precomputation takes O(N 2) time, the author has proposed ways to do it using only O(T ∗ N) time and O(T ∗ N) memory. The key observation for the Recursive Flow Classification [?] is that the number of overlapping ieso into, dimension hnafie ubro ue,wihcnb ierysace du 3. searched Figure linearly in be shown can is I which Table rules, the partitio of on the number based Finally, fixed them. a in than placed accordingly are rules ntehuitc ehv ocos n dimension, one choose to nod have root we the building, heuristics o tree the the based on of that is beginning stage the preprocessing partitioning At Hence solution. The classifier. the rules. of of structure sp number search fixed complete the comprising partitioning essentially is McKeown where ainAlgorithm. cation ueisedo apn tt n fthe of one worst to the it than mapping smaller of considerably instead is rule dataset a for regions 2. Fig. e.Tepolmteeoebisdw ooeo mapping of one to down boils therefore problem The set. pt 70rls F per rcia o 0Gp iert i rate line Gbps F 10 in for shown practical appears is RFC works rules, algorithm 1700 the to how up of depiction pictorial A orsodn oec fthe of each to corresponding aerdcinrcrieyoe xdnme fpae.The phases. soluti of The number storage. fixed a of over amount recursively unreasonable reduction of same expense the at h dabhn the behind idea The • • • h atsaegvsol n eutwihi h dnie th identifier the is which result one forme only is gives th memory stage narrower each last be stage. The into to previous index has the the lookup from the lookups phases, class of subsequent result the the the on In that based ensure information to 2-bit is a to information 16-bit xlsv sets exclusive ewe 0-65536 between ntadtepr pcfiain sdi h ue are rules 16-b the the in using used are specifications we port pa assume the the us and of m let unit Each into example, parallel. split For in are identifier. memories header an packet multiple the into of index d-fields that phase first the In T itra eito fteRcrieFo Classifi- Flow Recursive the of depiction Pictorial log = Packet hs hs hs Phase3 Phase2 Phase1 Phase 0 N np(C) and { 20,21 index } ec ahitra oerpeet usto agso ea of ranges of subset a represents node internal each Hence . T hnw a noeti nomto sn ny2bt oden to 2-bits only using information this encode can we Then . irrhclItlietCut Intelligent Hierarchical ≪ } , S { 2 n mrcia a o hsi opeopt h au fth of value the precompute to is this for way impractical One . 80 S Preprocessed Tables } ifrn akthaes hslast ouini n mem one in solution a to leads This headers. packet different , { 1024-65536 N Result d osbecssoenest erhtruhamc reduced much a through search to needs one cases possible } , { -9 27,81-1023 22-79, 0-19, Cut AcrossY i.3. Fig. dim HCt [ (HiCut) S igsoswe h efndshv omore no have nodes leaf the when stops ning ulstesac rei h e etr nthe in feature key the is tree search the builds { n h ubro attost iiethat divide to partitions of number the and , cs au,wihcnbe can which value, -case otisalterls neeypae based phase, every In rules. the all contains e 80 c ftecasfirrlsit mle spaces smaller into rules classifier the of ace isi h akthae oa to header packet the in bits iu refrteCasfir ie nteTbeI. Table the in given Classifiers the for tree HiCut A } R2 R1 , npooe yteatosi od the do to is authors the by proposed on adaead25Gp nsoftware. in Gbps 2.5 and hardware n ansesi h loih are, algorithm the in steps main { gr .Wt ellf Dcasfir of classifiers 4D real-life With 2. igure ersista sdtrie ythe by determined is that heuristics n Cut AcrossX firrls h e oli ahphase each in goal key The rules. ifier ? alllousyeda uptwihis which output an yield lookups rallel 20,21 tmp oapriua rule. particular a to maps at igaqey neapeHCttree HiCut example An query. a ring lil hnso xdnme fbits of number fixed of chunks ultiple loih,pooe yGpaand Gupta by proposed algorithm, ] nteidxo h eoyaccess. memory the of index the an tpr ubrfil sachunk-size a as field number port it ycmiigterslso the of results the combining by d } , R5 (2*8, Y,2) } { hsw a euefo a from reduce can we Thus . > 1023 } , { R5 R2 O hdmnin and dimension, ch eann integers remaining (8*8, X,4) ( t h mutually the ote N T d ) i identifier, bit hsfra for Thus . identifier e r access ory R6 R3 R6 R3 F2