<<

Multiway Range Trees: Scalable IP Lookup with Fast Updates Subhash Suri George Varghese Priyank Ramesh Warkhede Department of Computer Science Washington University St. Louis, MO 63130.

Abstract—In this paper, we introduce a new IP lookup scheme with A. Our Contribution

log n n worst-case search and update time of O , where is the number of prefixes in the forwarding table. Our scheme is based on a new data

structure, a multiway range tree. While existing lookup schemes are good We introduce a new IP lookup scheme whose worst-case

log n n for IPv4, they do not scale well in both lookup speed and update costs search and update time is O , where is the number when addresses grow longer as in the IPv6 proposal. Thus our lookup of prefixes in the forwarding table. Our scheme is based on a scheme is the first lookup scheme to offer fast lookups and updates for new data structure, a multiway range tree, which achieves the IPv6 while remaining competitive for IPv4. optimal lookup time of binary search, but can also be updated The possibility of a global Internet with addressable appli- fast when a prefix is added or deleted. (By contrast, ordinary

ances has motivated a proposal to move from the older In- binary search [5] is a precomputation-based scheme whose

n ternet Ipv4 routing protocol with 32 bit addresses to a new worst-case update time is .) protocol, IPv6 with 128 bit addresses. While there is uncer- Our main contribution is to modify binary search for tainty about IPv6, any future Internet protocol will require at prefixes to provide fast update times. In standard binary

least 64 bit, and possibly even 128 bit addresses as in IPv6. search [5], each prefix maps to an address range. A set of

n Thus for this paper we simply use IPv6 and 128 bits to refer n prefixes partition the address line into at most inter- to the next generation IP protocol and its address length. vals. We build a tree, whose leaves correspond to interval

Our paper deals with address lookup, a key bottleneck endpoints. All packet headers mapped to an interval have the

log n for Internet routers. Our paper describes the first IP lookup same longest prefix match. Search time is O .

scheme that scales well as address lengths increase as in IPv6, The problem is update: a prefix range can be split into

n while providing fast search times and fast update times. intervals, which need to be updated when the prefix is

There are four requirements for a good lookup scheme: added or deleted. Thus our first idea is associating an address v search time, memory, scalability, and update time. Search span with every tree node v ; the span of is the maximal time is important for lookup to not be a bottleneck. Schemes range of addresses into which the final search can fall after

that are memory-efficient lead to faster search because com- reaching v . Thus the root span is the entire address range.

pact data structures fit in fast but expensive Static RAM. Scal- We then associate with every node v the set of prefixes that

ability in both number of prefixes and address length is im- contain the span of v . However, this results in storing redun- portant because prefix databases are growing, and a switch to dant prefixes.

IPv6 will increase address prefix lengths. Finally, schemes To remove this redundancy, we store a prefix with a node v

with fast update time are desirable because i) instabilities in if and only if the same prefix is not stored at the parent w of w backbone protocols [4] can cause rapid insertion and deletion the node. Since the span of v is a subrange of the span of ,

of prefixes. ii) multicast forwarding support requires route any prefix that contains the span of w , will contain the span

entries to be added as part of the forwarding process and not of v . Using this compression rule we can prove that only a

by a separate routing protocol iii) Any lookup scheme that small number of prefixes need be precomputed and stored;

log n does not support fast incremental updates will require two thus updates change information only at O nodes. By

copies of the lookup structure in fast memory, one for up- increasing the arity of the tree, we can reduce the height of

log n d d

dates and one for lookup. the tree to O , for any integer . We pick so that the d The main drawback of existing lookup schemes is their entire node can fit in one cache line. For instance, a 256 bit lack of both scalability and fast update. Current IP lookup cache line and arity-14 yields a tree height of about 5 or 6 for schemes come in two categories: those that use precomputa- the largest IPv4 databases. tion, and those that do not. Schemes like “binary search on Our third contribution is to extend multiway range trees prefix lengths” [9], “Lulea compressed tries” [1], or “binary to IPv6. On a 32-bit machine, a single 128-bit address will search on intervals” [5] perform precomputation to speed up require 4 memory accesses. Thus, an IP lookup scheme de- search; however, adding a prefix can cause the data struc- signed for IPv4 could slow down by a factor of 4 when used

ture to be rebuilt. Thus, precomputation-based schemes have for IPv6. In practice, since we are assuming wide memories

n a worst-case update time of . On the other hand, trie- of at least 128 bits, we can handle such wide prefixes in one based schemes (e.g., [8], [7]) do not use precomputation; memory access. Unfortunately, the use of 128 bit addresses however, their search time grows linearly with the prefix result in the use of a small branching factor and hence large length. tree heights. To avoid this problem, we generalize our scheme

to the case of k -word prefixes, giving a worst-case search and with 32K additional 32-bit prefixes described by the sequence

k log n n

updates times of O , where is the total number .

d

k

xy

of prefixes present in the database. Notice that the factor of Thus all IP addresses of the form where the

is additive. LSB of y will have as their best matching pre-

The rest of this paper is organized as follows. Section I fix. The binary search method [5] handles this by setting

match p

describes basic binary search. Section II describes our main for all the 32K prefixes of length 32.

data structure, the multiway range tree. Section III briefly Now consider adding the prefix Since this new pre-

xy

describes how to handle updates efficiently. Section IV de- fix is now a better match for prefixes of the form ,

match p scribes our extension to long addresses. Section V describes we have to update for 32K values of

our experiments and measurements. Section VI states our p. Thus update can take linear time. We avoid this problem conclusions. using the following ideas.

I. IP LOOKUP BY BINARY OR MULTIWAY SEARCH A. Address Spans

We first review binary search [5] prefix matching. Suppose Our key idea is to associate address spans with nodes and

r r r

m keys of the tree; intuitively, the address span of a node is the

 m

the database consists of prefixes . Each pre-

b e

fix r matches a contiguous interval of addresses. For range of addresses that can be reached through the node. For

instance, the interval of prefix r begins at point the multiway tree in Figure 1, first consider a leaf node, such

z x x

and ends at . View the IP address as . For each key in it, we define the span of to be space as a line in which each prefix maps to a contiguous in- the range from its predecessor key to the key itself. (When terval, and each destination address is a point. Prefix intervals the predecessor key doesn’t exist, we use the artificial guard are disjoint but one prefix can completely contain another. key .) To ensure that spans are disjoint, each span in-

The longest matching prefix problem reduces to determining cludes the right endpoint of its range, but not the left end-

spanaa

the shortest interval containing a query point. point. Thus, the spans of a b and are ,

ba b spancb c

span , .

p p p n m

 n Let , where , denote the distinct end-

points of all the prefix intervals, sorted in ascending order. The span of a leaf node is defined to be the union of the

spanz c

match p

p spans of all the keys in it. Thus, we have .

 i

With each key i , we store two prefixes, and

h p match p

matc The span of a non-leaf node is defined as the union of the

i  i

. The first, , is the longest prefix that

match p

p spans of all the descendants of the node. Thus, for in-

i

begins or ends at i . The second, , is the longest

spanv i spanw i o

p p

stance, , and , and

i

prefix whose interval includes the range i , where

spanuo p

p . (In fact, it is easy to check that nodes at

i i is the predecessor key of . the same level in the tree form a partition of the total address Given a destination address q , we perform binary search

span defined by all the input prefixes.) We now describe how

n q in log time to determine the successor of , which is the

 to associate prefixes with nodes of the tree.

q succ q

smallest key greater than or equal to q .If ,we

match succ q q

return  as best matching prefix of ; other-

match succ q wise, we return . This search can imple- u mented as a binary tree. To reduce tree height and memory i accesses (assuming processing cost in registers is small com- pared to memory access cost), the binary search tree can be vw

generalized to a multiway search tree; we use a B-tree. cf k m

In a B-tree, each node other than the root has at least t

t t and at most keys, where is known as the arity of zxy

the tree. The root has at least one key. A node with k keys,

a b c d efg hi j k l m no

k t k

where t , has exactly subtrees. Specif-

v x x x

 k

ically, suppose a node of the tree has keys .

k T T T T

 k  Then, there are subtree , where is

β

x T  the B-tree on keys less than or equal to ; is the B-tree

α

x x 

on keys in the range , and so on. To search for a

q v x

query at a node , we find the smallest key i that is greater

q T

than or equal to , and continue the search in the subtree i . Fig. 1. Address spans.

log n

For a tree of arity (degree) worst-case search takes O t (wide) memory accesses. B. Associating Prefixes with Nodes

II. MULTIWAY RANGE TREES

r b e r

Suppose we have a prefix with range r . Clearly

spanv

The main difficulty with the binary search scheme [5] is if for some node v of the tree, is contained in the

match b e r

r r

that the of a short prefix can affect the range , then is a matching prefix for any address in

v r v

value of a large number of keys that it covers. For exam- span . We could store at every node whose span is

b e r ple, consider a database with a lone 8-bit prefix together contained in range r , but this leads to a potential mem- ory blowup: we can have n prefixes, each stored at about tain a count of the number of prefixes terminating at each key.

n nodes. Observe, however, that the span of a parent is the When a key is newly inserted, its count is initialized to one. union of the spans of each of its children. Thus if a prefix Whenever a prefix is inserted or deleted, the counts of its end- contains the span of a parent node, then it must contain the point keys are updated accordingly. When the count of a key span of the child. Thus we do not need to explicitly store the decrements to zero, it is deleted from the range tree. Insertion prefix at the child if it is already stored at the parent. Thus we and deletion can result in nodes being split and merged; it is

use the following compression rule: easy to update the spans of the affected nodes using the spans v

[Prefix Storing Rule:] Store a prefix r at a node or leaf key of the parent nodes.

v b e r

if and only if the span of is contained in r but the span 2. [Updating the Address Spans and Prefix Heaps.] When

v b e r

of the parent of is not contained in r . (The parent of a new key is inserted, it may change the address span of some

L r

a key in a leaf node L is considered to be .) In addition, is nodes in the range tree. Specifically, suppose we add a new

b

q u q stored with the start key of its range, namely, r . key . Then, any node whose rightmost key is has its span

The first rule should be intuitive; the second rule (storing enlarged—previously, the span extended to the predecessor

q s

prefixes with the start key) is slightly more technical and is of q ; now it extends to . In addition, if is the successor of s necessary to handle the special case when a search encoun- q , then the span of every node that is an ancestor of but not

ters the start point of a range. We can show that the update an ancestor of q shrinks—previously, the span terminated at q

memory, for all practical purposes, is essentially linear ex- the predecessor of q ; now it terminates at . Thus, altogether

O log n tw there can be nodes whose address span need updat-

cept for a logarithmic factor of (proof omitted for lack of t

w

space). With t and , this is an additive factor of ing. Similarly, when a key is deleted, spans of a logarithmic N around . number of nodes are affected. Modification of address spans can cause changes in the pre- C. Search Algorithm fix heaps stored at these nodes—when the address span in- creases, some of the prefixes stored at the node may need to

We summarize the search algorithm. The arity t of the be removed, and when the address span of a node shrinks,

range tree is a tunable . Each node has at least t some of the prefixes stored at a node’s children may move up

t M and at most keys. Let v be the (narrowest) prefix to the node itself. stored at each internal node v in the search structure, where

3. [Inserting or Deleting the New Prefix.] Finally, the pre-

M v

v is the root of the corresponding heap at node in the up-

fix r is stored in the heaps of some nodes, as dictated by the E date structure. Let k be the prefix stored with each leaf key in the search structure corresponding to the root of the equal Prefix Storing Rule. This phase of the update process adds or

removes r from those heaps. G list heap in the update structure; let k be the prefix stored

with each leaf key in the search structure corresponding to Lemma 1: Given an IP address, we can find its longest

t log n matching prefix in worst-case time O using the the root of the span list heap. t

We initialize the best matching prefix to be null. Sup- range tree data structure. The number of memory accesses is

log n

O if each node of the tree fits in one cache line. The t

pose the search key is q , and the root of the range tree is

nt log W

data structure requires O memory, and a prefix

t

u M

.If u is non-empty and is longer than the current best

t log n can be inserted or deleted in the worst-case time O .

matching prefix, then we update the best matching prefix t

M x x x u

 d

to u . Let be the keys stored at , and let

T T T

 d

denote the subtrees associated with these IV. MULTIWORD ADDRESSES:IPV6 OR MULTICAST

x q keys. We determine the successor i of as the smallest

In this section, we describe an extension of our scheme to

x x x q

 d key in that is greater than or equal to .We

multiword prefixes. On a 32-bit machine, accessing a single T recursively continue the search in the subtree i . 128-bit address will require 4 memory accesses. Thus, an As in the original static search, we initialize the successor

IP lookup scheme designed for IPv4 addresses could suffer a

T to be , and when we descend into the subtree i , we update

slowdown by factor of 4 when used for IPv6.

x k q the successor to be i . Suppose is the successor key of

We consider the general problem of handling k -word ad-

q when the search terminates. We check to see if k . If the

dress prefixes, where each word is some fixed W bits long.

E G k equality holds, we return k ; otherwise, we return . We show that our multiway range tree scheme generalizes to

III. UPDATES TO THE RANGE TREE the case of k -word prefixes, giving a worst-case search and

k log n n

updates times of O , where is the total number d

We show how to handle prefix and deletions. We of prefixes. We describe the general construction. The top T

describe at a high level the changes that occur in the range level tree  is the range tree on the first words of the input

r b e

S x T

r r

tree when a prefix is inserted or deleted. Concep- set of prefixes, namely, . Consider a key in the tree  ,

x

tually, update can be divided into three phases (which can be and let S denote the set of prefixes whose first word equals

t log n

S x combined), each of which takes at most time. x

t . We build a second range tree on the set , and have the

x T

1. [Updating the Keys and Handling Splits and Merges.] branch of in  point to it. We will call this second tree

x x

The keys representing the endpoints of the prefix range, T . We also store with the key the longest (best) prefix

T x

e 

b in matching . r

namely, r and , may need to be inserted or deleted. A key

T x x x

 i may be the endpoint of multiple prefix ranges, so we main- In general, the tree is a range tree on the

i

ith words of the prefixes whose first words are exactly A. Results for IPv4 and IPv6

x x x x

 i i .If is a key in this subtree, then its left

Experiments were conducted using (somewhat dated) rout- x subtree contains keys smaller than i , the right subtree con-

ing table snapshots from IPMA[6]; only a subset of results

x tains keys bigger than i , and the pointer points a range

are described here. Experiments involved inserting prefixes

i S x x x x

 i i tree on the st words of the set . in the same order as the input database. Measurements of

(This is the set of prefixes whose first i words are exactly

lookup time were obtained for randomly generated IP ad-

x x x x

 i i .) The key also stores the best prefix in the

dresses.

S x x x x

 i i set that matches .

Given an input set of n prefixes, the height of the range

For instance, suppose W , and consider three prefixes

t h dlog ne

tree of maximum arity is . The number

t

r r r

  , , and . Then, the

of memory accesses per lookup is h in the worst case. For search tree on the first word includes the prefix range ,

insertion of a new prefix, the worst case number of memory

and points to and . This example also illustrates

ht h t accesses is , while for deletion of a prefix the

that the keys in the search tree form a multiset, since many

ht h t worst case number of memory accesses is . different prefixes may have a common first word. Thus, each of our range trees will allow duplicate keys. In addition to Table I summarizes observed tree height, memory require- allowing duplicate keys, the multiword multiway search tree ment for the search structure and the overall memory require- has one other significant difference from the single word tree. ment (i.e., including memory relevant only for updates) for

the largest available dataset, Mae-East. x

Consider a key x at some node in the range tree. The key

partitions the key space in three parts: , and . Perhaps surprisingly, we can prove that the obvious trick of removing Arity Height Search Memory Total Memory duplicate keys in each range tree, results in a correct algo- 4 15 0.82MB 1.76MB rithm but with poor performance. 8 8 0.51MB 1.15MB

14 6 0.44MB 0.99MB

p p p

 k Given a packet destination address , the

18 5 0.42MB 0.95MB p search algorithm is as follows. Observe that each i is fully

specified, so it maps to a point. We begin with the first word TABLE I

p s T  , and find its successor key in the top level tree . During Mae-East database: 41,456 prefixes, 62273 distinct keys

the search for s, we maintain the best matching prefix found along the path.

Similar observations hold for the smaller PacBell dataset

q q p

As soon as we find a key such that , we stop

(Table II) for which the number of distinct keys is about a

T q searching the tree  . If the best matching prefix stored at third of Mae-East (though it has half the number of prefixes is longer than the one found so far, we update the answer. We of Mae-East). This is reflected in the memory requirement now extract the second word of the packet header, namely,

and height of the range tree, which are a of number

p T p  , and start the search in the range tree pointed to of keys, not prefixes. by q . Search terminates either when the associated B-tree is empty, or when we don’t get an equal match with the search Arity Height Search Memory Total Memory key. In the former case, we simply output the best matching 4 12 0.26MB 0.53MB prefix found so far. In the latter case, suppose the search

8 6 0.17MB 0.36MB

s s p terminates at a successor key such that . In this case, we compare the current best matching prefix with the 14 5 0.15MB 0.32MB 18 4 0.15MB 0.30MB prefix stored at the root of the span list heap of s, and choose the longer prefix. TABLE II PacBell database: 24740 prefixes, 24380 distinct keys

V. E XPERIMENTAL RESULTS An important purpose of the experiments was to under- This section describes the experimental setup and measure- stand variation of performance (speed and mem- ments for our scheme. We consider software platforms us- ory requirement) with change in arity of the range tree. The ing modern processors such as the Pentium and the Alpha. results show that when the arity is small, say 4, the total mem-

For the Pentium processor, size of a cache line is 32 bytes ory requirement is quite high. This occurs because the range

tree structure guarantees a minimum utilization of t for

t

(256 bits). For hardware platforms, we assume a chip model 

t that can read data from internal or external SRAM. We as- maximum arity . Thus, when the maximum arity is , many

sume that the chip can make wide accesses to the SRAM nodes have only one key, which leads to an increased number (using wide buses), retrieving upto and even con- of allocated nodes and large total memory. Increasing arity of secutive bits. We count memory accesses in terms of distinct the tree from very small values achieves significant reduction cache-line accesses, as opposed to the number of memory in the number of nodes. words retrieved. Since lookup memory needs to be in fast The worst-case update time for our scheme is calculated SRAM while update memory need not, we separately state assuming node split/merge at each level of the tree. To eval- our lookup-relevant memory and update memory. We imple- uate the average update performance, we first built the prefix mented our range tree scheme in C on a UNIX machine. database corresponding to the Mae-East database. We then

n generated 500 random prefixes, inserted them into the data based schemes, require O update time in the worst-case. structure, and then deleted them in random order. Table III Trie based schemes are slow for large address sizes; large below summarizes the results of this experiment. strides can increase worst-case memory exponentially (most papers only report memory on typical databases.) Arity Height Average Worst Case 4 12 107 337 VI. CONCLUSION 8 6 58 321 We have described a new IP lookup scheme that scales well 14 5 45 455 to IPv6 and multicast addresses, allowing fast search and up- 18 4 38 479 date. It is the first scheme we know of that possesses all these TABLE III properties.

Average update performance of multiway range trees. For example, prior schemes such as [9], [1], [5] all have

n n

update times, where is the number of prefixes, while

w For small arity values, say 4, the average update time is multibit trie schemes (e.g., [8], [7], [3]) take O search fairly close to the worst case. But with increasing arity, aver- times, where w is the address length. With the size of prefix

age update time becomes much smaller than worst case up- tables approaching 100,000 and expected to perhaps reach

n date time. This is expected, because the probability of a node 500,000 in a few years, update times can be problem- merge/split is very high for lower arity values, and decreases atic. On the other hand, multibit trie schemes require wc

memory accesses where c is at most 8 in order to bound mem- linearly with increasing arity. For arity , which is used later on for comparison with other schemes, the average update ory. This can be expensive for 128-bit IPv6 addresses. Thus our scheme, which takes worst-case search and update times

time is less than of the worst case.

log n

IPv6 Performance: Since IPv6 prefix databases are un- of O , may be interesting for larg IPv6 tables, espe- n cially by choosing a sufficiently high arity to make the log available, we only provide analytical bounds. Using bit

term very small. words, each address is l words long. Worst case lookup

Even for IPv4, unlike other schemes such as [9], [1], [8],

dlog ne t h time is l for trees of arity . Let be the height

t our scheme is not patented and can be used freely for publi-

dlog ne of any tree. Then, h . Since we already know t cally available software. Our scheme is competitive for IPv4 worst case update time for a single tree of height h, total up-

For example, on a Mae-East database using a 512 bit cache

ht h t date time for the data structure is l in the line, our scheme takes 5 memory accesses to do a worst-case worst case. For bit cache-line, we can choose an arity of

lookup, requires 0.44 MB storage, and needs 455 memory ac-

. For a dataset of the same size as Mae-East ( pre- cesses for update in the worst case. Even if its performance fixes), we get a projected worst case lookup time of mem- ory accesses and an update time four times the IPv4 update is not as good as multibit tries, its availability and simplicity could make it useful for a software solution used by low-end time, memory accesses. The total memory requirement also changes for IPv6. We measured the increased overhead routers that sit on the edge of the network.

(due to tree pointers associated with each key) by prepend-

REFERENCES

 ing the string  to each prefix. The constructed data structure required almost double the amount of memory re- [1] Andrej Brodnik, Svante Carlsson, Mikael Degermark, and Stephen Pink. Small Forwarding Table for Fast Routing Lookups. Computer quired for the corresponding IPv4 data structure. Communication Review, October 1997. [2] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. B. Comparison with other schemes MIT Press, 1990. [3] Pankaj Gupta, Steven Lin, and Nick McKeown. Routing Lookups in Several schemes have been proposed in the literature for Hardware at Memory Access Speeds. INFOCOM ’98, April 1998. performing IP address lookups. We evaluate these schemes [4] C. Labovitz, G. Malan, and F. Jahanian. Internet Routing Instability. ACM SIGCOMM 97. on four important metrics: lookup time, update time, memory [5] B. Lampson, V. Srinivasan, and G. Varghese. IP Lookups using Multi- requirement and scalability to longer prefixes. Table IV com- way and Multicolumn Search. IEEE INFOCOM ’98. pares the complexity of search time, update time and mem- [6] Merit Inc. IPMA Statistics. http://nic.merit.edu/ipma. ory requirement of the proposed scheme against some of the [7] S. Nilsson and G. Karlsson. Fast Address Look-Up for Internet Routers. Proceedings of IEEE Broadband Communications 98, April prominent schemes proposed in literature. 1998. [8] V. Srinivasan and George Varghese. Faster IP Lookups using Con-

Scheme Lookup Update Memory trolled Prefix Expansion. ACM Sigmetrics’98, June 1998.

W O W O nW

Patricia trie O [9] Marcel Waldvogel, George Varghese, Jon Turner, and Bernhard Plat-

W k

k tner. Scalable High Speed IP Routing Lookups. ACM SIGCOMM 97.

Wk O O nW

Multibit tries O

k

log n O n O n

Binary search O

log n O n O n

Multiway search O

k

log W O n O n log W

Prefix length binary search O

log n O k log n O kn log n

Multiway Range Trees O

k k k TABLE IV Comparison of search, update and space complexities

As seen from the table, all existing schemes, except trie-