Consiglio Nazionale Delle Ricerche

CORE Metadata, citation and similar papers at core.ac.uk

Provided by PUblication MAnagement

C Consiglio Nazionale delle Ricerche

Adaptive Stratified Search Trees for IP Table Lookup

M. Pellegrini, G. Fusco, G. Vecchiocattivi

IIT TR-22/2002

Technical report

Novembre 2002

Iit

Istituto di Informatica e Telematica

Adaptive Stratied Search Trees for IP Table Lo okup

M Pellegrini G Fusco and G Vecchio cattivi

Istitute for Informatics and Telematics of CNR

September

Abstract

The IP Table Lo okup mechanism is a key comp onent of a router in a packet network

such as Internet A router in the network holds a table typically for backb one

routers the number of entries is in the range where each entry sp ecies

a prex at most bits long and a next hop exit line When a pack et comes to

the router the destination address in the header of the packet is read the longest

prex in the table matching the destination is sought and the packet is sent to the

corresp onding next hop exit line In this pap er we prop ose a data structure called

Adaptive Stratied Tree AST to solve the IP Table Lo okup problem Foratable of

n prexes of length at most w the AST is built in O n log n log w time uses storage

O n and allows searching in time O log n The algorithm has b een implemen ted in

C and compared to a state of the art software solution notably the LevelCompressed

Trie of S Nilsson and G Karlson For several large b enchmark tables and for dierent

trac proles we signicantly reduce the storage of the search structure up to a factor

and signicantly reduce the search time up to a factor Interestinglyeven for

the largest test tables ab out entries we could build small search trees having

depth three

Intro duction

Motivation Internet is surely one of the great scientic technological and so cial successes

of the last decade and an ever growing range of services rely on the eciency of the

underlying switching infrastructure Thus improvements in the throughput of Internet

routers are likely to have a large impact The IP Address Lo okup mechanism is a critical

comp onentofanInternet Packet Switch see for an overview Briey a router within

the network holds a table where each entry sp ecies a prex at most bits long in the

IPv proto col bits in the IPv proto col and a next hop exit line When a packet comes

to the router the destination address in the header of the packet is read the longest prex

in the table matching the destination is sought and the packet is sent to the corresp onding

next hop exit line How to solve this problem so to be able to handle millions of packets

p er second has b een the topic of a large numb er of research pap ers in the last years

See section

Our contribution The starting p ointof our research whichisnovel as far as we know

is to revisit the Stratied Trees of van Emde Boas et al but see also a more mo dern

description in with the goal of assessing its worth as a basis for solving the IP Table

lo okup problem Easy transformations see Section map the longest matching prex

Area della Ricerca Via Moruzzi Pisa ITALY fmarcop ellegrini giordanofusco gior

giovecchio cattivigiitcnrit

problem into the predecessor problem for a set of n keys in the range U with

U where w in the IPv proto col The predecessor problem in turns is handled

by Stratied Trees

Stratied Trees are known to have small height O log log U for a data set of n keys

form a universe of size U but to be quite memoryintensive using close to O U storage

Mehlhorn et al are able to reduce the storage to O n by representing large sparse

interior no des of the tree with hashing tables Thus go o d worst case b ounds b oth in storage

and height of the tree are attained at the cost of p erforming for each visited no de one or

more hash function evaluation eachinvolving multiplications and p ossibly integer divisions

instead of simpler indexing op eration

Our approach is that while storage has to be controlled tightly a small depth can be

achieved on real data sets by adapting the construction of the tree to the actual distribution

of the keys Using this approach while we always can guarantee a worst case depth no more

than log nhowever we could achieve maximal depth far b elow the worst case in all of the

benchmarks considered in Section

The b est way to explain the gist of the algorithm is to visualize it geometrically The keys

are points on the real line We want to split this line in to a grid of equal size buckets

and then pro ceed recursively in each bucket separately The grid is completely sp ecied

by giving an anchor p oint a and the step s of the grid Finding the bucket containing the

query p oint p is done in time O since the expression bp asc gives the oset from the

sp ecial bucket containing the anchor We will take care of cho osing for s apower of twoso

to reduce integer division to a right shift If we cho ose the step s to o short we might end

up with to o many empty buckets which implies a waste of storage We cho ose thus s as

follows cho ose the smallest s for which the ratio of empty to o ccupied buckets is no more

than a userdened constant threshold On the other hand shifting the grid ie moving its

anchor can have dramatic eects on the number of empty buckets o ccupied buckets and

the maximum number of keys in a bucket So the search for the optimal step size includes

an inner optimization lo op on the choice of the anchor The search for the lo cally optimal

step and corresp onding anchor can b e done eciently in time close to linear up to p olylog

terms see Section This denition of lo cally optimal grids has to be contrasted with

several techniques in literature where a global optimality criterion is sought see eg

usually via much more exp ensive dynamic programming techniques

Main Result For a table of n prexes of length at most w the AST is built in

O n log n log w time uses storage O n and allows searching in time O log n The

constants in the storage b ound are very small and this is in accord with the empirical

results The logarithmic upp er b ound in the query time is clearly due to the simplicityof

the analysis that do es not account for the full power of adaptivity as is evident from the

small numb er of levels required by AST on large b enchmark data sets

Dynamization Ecient prepro cessing is imp ortant in the p ersp ective of augmenting

the AST with dynamic op erations insertondeletion In general up dates involve partial

reconstruction of the data structure thus in eect they apply the prepro cessing scheme to

a p ortion of the input From a complexity analysis p oint of view there is a well develop ed

theory see relating prepro cessing time with up date time supp orting the imp ortance

of fast prepro cessing time as a prerequisite for fast up dates

Discussion of exp erimental results In Section we compare measured p erformance

of an implementation of our algorithm for AST with two settings of parameters and

the implementation of the LCTrie algorithm supplied by the authors run on the same

from platform Figures and exemplify the results Five b enchmark tables ranging

medium K entires to large K entries in size and three dierent mo dels of randomly

generated trac have b een used AST p erforms consistently b etter than LCtrie both in

throughput measured in lo okups per second and in storage measured in bytes The

variant with prameters chosen so to attain the smallest query time AS T roughly gives

minT

an increased throughput by to with resp ect to LCtrie while using less than

half of the auxiliary storage The variant with prameters chosen so to attain the smallest

storage AS T roughly sees an increased throughput by to with resp ect to LC

minS

trie while using ab out one tenth of the auxiliary storage The exp eriments p ointtowards

the conclusion that AST is a valid alternative to LCtries as well as several other schemes

see Section Implemen tation and testing of dynamic AST as well as tuning for cache

eciency are planned as future research topics

Related Work

Sp eeding up IP Table lo okup has received considerable attention in recent years There

are three main directions improved data structures and software searching techniques

design of sp ecialized hardware aiming at exploiting machinelevel parallelism avoiding

the lo okup pro cess altogether by exploiting additional header information

Software Searching The classical data structure to solve the longest matching prex

problem is the binary trie Patricia Tries take advantage of common subsequences

to compress certain paths in a binary tries thus saving b oth in search time and storage The

rst IP Lo okup algorithm in is based on the Patricia Trie idea Nilsson and Karlson

add the concept of level compression by increasing the outdegree of each no de of the

trie as long as at least a large fraction ab out of the subtrees are nonempty

Waldvogel et al organize the prexes in groups of prexes of same length and each

group is stored in an hash table for fast memb ership testing The search is essentially a

binary searchby prex length over the hash tables quite similar to the technique in

The number of hash tables can be reduced by padding some of the prexes Good

hashing strategies are studied in

mapp ed onto a segment on the real line and each address In and each prex is

into a p oint Thus nding the longest matching prex is transformed into the geometric

problem of nding the shortest segment covering a query p oint In an elab oration on

the Btree is used

Dagemark et al use data compression techniques to store compactly parts of the prex

tree representing the set of prexes At present this technique achieves in practice the

lowest use of storage Crescenzi et al instead start from a full table representation of

the Lo okup function then apply a data compression technique that reduces the storage to

acceptable levels in practice while requiring no more than memory accesses

Ergun et al use the fast reconguration capabilities of skip list to adapt online

the search data structure to the mo dications of trac patterns

Hardware Lo okup Gupta et al Huang et al and Sikka et al exploit

pip elining to overlap the execution of several lo okup op erations The use of Content

Addressable Memories CAM is prop osed in and The inuence of caching

technology is studied in and

ReducingAvoiding lo okup Standard IP Table Lo okup is done indep endently on each

router using only the destination address on the packet header However several schemes

have been prop osed in which additional lab els tags and clues are added to the header

at the next router or routing during routing so to help either forwarding the single packet

a stream of packets reaching the same destination see eg

Reduction to the predecessor problem

Reduction of the longest prex match to the shortest segment stabbing and this to the

predecessor problem has been used in Here we give details of the reduction for

completeness Eachentry in a lo okup table downloaded from IPMA has the form

netbasenetmask nexthop

where netbase is a bit IP address netmask denotes the numb er of bits used for prex and

nexthop is the destination for the packets whose destination address match with this entry

The set of matching destinations is a contiguous blo ck of addresses an interval closed on

the left endp oint and op en on the right endp oint the b egin and the end of interval are

given by

beg in netbase mask

end netbase mask

where are the bitwise Bo olean op erators Moreover we asso ciate to begin and end

points a lab el denoting the next hop for destinations to its right which is valid up to the

next endp oint on the line See an example in gure

l abel beg in nexthop

l abel end nexthop of pr ev ious pr ef ix

prefix1 = A..D / hop1 prefix2 = B..C / hop2

A D

hop1 hopX BC

hop2 hop1

hopX hop1 hop2 hop1 hopX

Figure Intervals and nexthops

Since the next hop lab el is constantbetween two endp oints given a query p oint q we just

have to nd the predecessor of q among the endp oints and retrieve the asso ciated lab el

Description of the AST Data Structure

Here we describ e the AST data structure by iteratively rening a general framework with

sp ecic choices Consider initially the set U of all p ossible addresses the

set P of prexes and the set S S of endp oints of the segments obtained by mapping

eachprexp P to the set of addresses of U matching p as describ ed ab ove We will often

think of U as emb edded in the real line R

The main idea is to build recursively a tree by levels each no de x of the tree has an

asso ciated connected subset of the universe U U which is the set of all queries that visit

no de x and a p oint data set S S whichis S U For the ro ot r wehave U U and

x x r

S S

Let x b e a no de and let k be the number of children on x By y y k we denote the

x x

the children of x A particular rule to be describ ed b elow will tell us how to decide k

and for each i k how to compute U Afterwards simply S U S

x x

y i y i y i

For simplicitywe explain the pro cess at the ro ot Let S b e a set of p oints on the real line

and spanS the smallest interval containing S Consider an innite grid Ga sofanchor

a and step s Ga s fa ksjk N g By shifting the anchor by s to the right the grid

remains unchanged Ga sGa s s

Consider the continuous movement of a grid f Ga s sfor Consider

an eventavalue for which S f

i G

Lemma The number of events is at most n

Proof Take a single xed interv al I of Ga s we have an event when the moving left

extreme meet a p ointinI this can happ en only once for each p ointinI Therefore overall

there are at most n events

Since we study bucket o ccupancy extending the shift range b eyond is useless since every

distribution of p oints in the bucket has already b een considered given the p erio dic nature

of the grid

Consider point p S and the bucket I containing p Point p pro duces an event when

p lef tI s that is when the shift is equal to the distance from the left extreme

p p

of the interval containing p Thus we can generate the order of events by constructing a

minpriority queue QS on the set S using as priority the value of

p p

We can extract iteratively the minimum for the queue and up date the counters for the

shifting interval I Note that for our counters an event consists in decreasing the count for

abucket and increasing it for the neighb our bucket

Moreover wekeep the currentmaximum of the counters To do so wekeep a second max

priority queue QI cI When a counter increases we apply the op eration increasekey

when it decreases we apply the op eration decreasekey Finally we record changes in the

ro ot of the priority queue recording the minim um value found during the lifetime of the

algorithm This value is the

g s min max jS I j

I Gss

that is we nd the shift that for a given step s minimizes the maximal o ccupancy The

whole algorithm takes time O n log n

In order to use a binary search scheme we need some monotonicity prop erty whichwe are

going to prove next

Lemma Take two step values s and t with t s then we have g t g s

Proof Consider the grid G t that attains minmax o ccupancy K g t So every

min

bucket in Gt has at most K elements Nowwe consider the grid Gs that splits exactly

in two every bucket in G t In this grid Gs the maximum o ccupancy is at most K

min

so the value g s that minimizes the maximum o ccupancy for a translate of Gs cannot

attain a larger value than K ie g s K g t

Nowifwe use only p owers of two as p ossible values for a grid step the ab ove monotonicity

lemma implies that for a given threshold x we can use binary search to nd the largest

k h

step s with g s x and the smallest step u with g u x

Call R EF the ratio of the number empty to the full buckets

Lemma Take two step values s and t with t s then we have R t R s

min min

Proof Take the grid G s the grid of step s minimizing the ratio R sE F Now

min min s s

make the grid Gtby pairing adjacent buckets wehave the relations N N N N

s t

F F F and E F E E Nowwe express R E F as a function

s t s s s t s t t t

of N and F

R E F NF

t t t t

This is an arc of hyp erb ola in the variable F having maximum value for abscissa

F F The value of the maximum is E F R s Thus we have shown that

t s s s min

R t R s Naturally also R t R tsowehave proved R t R s

min min min min

Thus the minimum ratio is monotonically increasing as grids get ner and ner

By the ab ove lemmas wehavetwo functions g and R that are b oth monotone one

min

increasing one decreasing in the step size

In order to control memory consumption we adopt the following criterion nd smallest

s with R s C for a predened constant threshold C then take the grid giving the

o ccupancy g s the minmax o ccupancy this shifting do es not change the numb er of no des

at the level When the numb er of p oints in a bucket drops b elow a small threshold D we

switch to standard binary search

Search algorithm

Each no de x is characterized byananchor and a step The index of the child searched for

is found by the expression

anchor q uer y

b c

step

We require that step is a p ower of two so to reduce the integer division to a shift op eration

The sign of q uer y anchor can be p ositive or negative and is to be interpreted as an

oset with resp ect to the p osition of the child of x containing the anchor

Worst case b ounds

Theorem AnASTon n prexes of lenght at most w uses O n storage is built in time

O n log n log w and a query is answered in time O log n

Proof Consider no de x and the set of p oints S under xs inuence If wecho ose as anchor

in the renement of x the median p oint of S we get buckets containing at most jS j

x x

points Since the actual chosen anchor will minimize the maximal o ccupancy this b ound

still holds Thus the depth of the tree is at most log n

Split the input set S into m n log n groups of contiguous points each of size log n

tative p oints At level Picking the leftmost p oint in each group wehave a set S of represen

i of the AST for S the full no des are a partition of S so there are no more than m full

no des The number of empty no des is b ounded by cm Thus a level has at most O m

no des Summing over all levels O m log nO n to store the AS T S The points in

S n S are stored group by group as simple sorted arrays in O n additional storage The

Query time is O log m log nO log n

During prepro cessing for jS j p oints optimizing the anchor for xed step takes time

O jS j log jS j The step s takes as value p owers of two from to U U thus at most log U

i i i

values We exploit monotonicity to p erform a binary search on these values thus requiring

O log log U attempts Thus building a level requires O m log log U log n time After initial

sorting O n log n we sp end time O m log log U log n log n O n log log U log n which is

O n log n log w

Indirect Exp erimental Comparisons

In a recent survey Sanchez et al compare the execution of several IP lo okup algorithms

co des supplied by the resp ective authors or publicly available on the same platform thus

providing for an indication of relative p erformance in terms of sp eed The algorithms tested

are the following BSD Tries Table compression LCTries binary search by

prex lenght Multibit tries Taking as criterion the time needed to answer

and of the lo okups LCTries proved to be faster than any other metho d except the

table compression

Exp eriments

Test equipment Exp eriments have b eed carried out on a pro cessor AMD Athlon XP

MHz Clo ck KB Cache size total RAM memory Gigabyte running

Linux

Software Source co de in C for the LCTrie metho d is freely available on the web at

httpwwwnadakthsesnilsson written by the authors of and run with the

native parameters Co de for AST has b eed develop ed in C Both co des have b een compiled

using gcc with optimization level O

Benchmark tables Real lo okup tables have b een downloaded from the Internet

Performance Measurement and Analysis IPMA pro ject site httpnicmeriteduipma

as of June Each prex has b een mapp ed into an interval and the endp oints of the

intervals form the data set

Generation of Trac

a Same ProbabilityTrac This trac is based on the assumption that every prex

has the same probability of b eing accessed In other words the trac per prex is

supp osed to be the same for all prexes Thus the entries of the routing table are

extended to bits by adding zero es and are p ermuted to reduce the eects of cache

lo cality The total numb er of lo okups is numb er of prexes in the table

b Uniform Random Trac The IP addresses for this trac are uniform random in

all the space of addresses The total numb er of lo okups is

c Mixed Trac The IP address of this trac are taken uniform random in the

subset of the space of addresses obtained by concatenating all the segments asso ciated

to prexes uniform random in all the space of addresses The total number of

lo okups is

Performance

Dimension of Data Sets

Paix MaeEast AADS PB MaeWest

n of prexes

n of p oints

Maximum depth primary search structure

Maximum depth Paix MaeEast AADS PB MaeWest

LCTrie

AST min Time

AST min Size

Throughput in MegaLo okupssecond

Same Probability Paix MaeEast AADS PB MaeWest

LCTrie

AST min Time

AST min Size

Uniform Random Paix MaeEast AADS PB MaeWest

LCTrie

AST min Time

AST min Size

Mixed Paix MaeEast AADS PB MaeWest

LCTrie

AST min Time

AST min Size

Note The parameters for the AS T and the AS T variants havebeenchosen for

minT minS

the mixed trac and kept xed for the other exp eriments As a consequence o ccasionally

AS T can b e faster than AS T

minS minT

Note Each entry of the throughput table is the minimum of eight runs with the same

data so to eliminate the inuence of concurrent pro cesses The queries are made in random

ordersotoavoid bias due to temp oral caching

Size of Data Structures in Bytes

We distinguish the storage used essentially to hold the input and the storage used to supp ort

searching The search structures for LCTrie are Trie and Prex vector see The total

storage size is the sum of search structures size and the size of input holding data structures

Search Structures Paix MaeEast AADS PB MaeWest

LCTrie

AST min Time

AST min Memory

Base Nexthop vectors Paix MaeEast AADS PB MaeWest

LCTrie

AST

Total Memory Paix MaeEast AADS PB MaeWest

LCTrie

AST min Time

AST min Memory

16 900 LC-Trie LC-Trie AST_minT 800 AST_minT 14 AST_minS AST_minS 700 12 600 500 10 400 8 300 200 6 100 Size of Search Structures in KBytes

Throughput in Mega-Lookups/second 4 0 10000 20000 30000 40000 50000 60000 70000 80000 10000 20000 30000 40000 50000 60000 70000 80000

Number of Prefixes Number of Prefixes

Figure Throughput with Figure Size of Search Structures

Uniform Random Trac excluding input descriptors

References

Anat BremlerBarr Yehuda Afek and Sariel HarPeled Routing with a clue In SIGCOMM

pages

A Bro der and M Mitzenmacher Using multiple hash functions to improve ip lo okups

Gene Cheung and Steven McCanne Optimal routing table design for IP address lo okups under

memory constraints In INFOCOM pages

Tzi cker Chiueh and Prashant Pradhan High p erformance IP routing table lo okup using CPU

caching In INFOCOM pages

Pierluigi Crescenzi Leandro Dardini and Rob erto Grossi IP address lo okup made fast and

simple In European Symposium on Algorithms pages

Mikael Degermark Andrej Bro dnik Svante Carlsson and Stephen Pink Small forwarding

tables for fast routing lo okups In SIGCOMM pages

Ergun Sahinalp Sharp and Sinha Biased skip lists for highly skewed access patterns In

ALENEX International Workshop on Algorithm Engine ering and Experimentation LNCS

pages

P Gupta S Lin and N McKeown Routing lo okups in hardware at memory access sp eeds

Panka j Gupta Bala ji Prabhakar and Stephen P Boyd Near optimal routing lo okups with

b ounded worst case p erformance In INFOCOM pages

NenFu Huang ShiMing Zhao JenYi Pan and ChiAn Su A fast IP routing lo okup scheme

for gigabit switching routers In INFOCOM pages

D E Knuth The Art of Computer Programming Vol Sorting and Searching Addison

Wesley Reading MA

Butler W Lampson Venkatachary Srinivasan and George Varghese IP lo okups using

multiway and multicolumn search IEEEACM Transactions on Networking

Anthony J McAuley and Paul Francis Fast routing table lo okup using CAMs In INFOCOM

pages

Nick McKeown Hot interconnects tutorial slides Stanford University available at

httpklamathstanfordedutalks

K Mehlhorn and S Naher Bounded ordered dictionaries in O log log n time and O N space

Inform Process Lett

Mehlhorn K Data Structures and Algorithms volume MultiDimensional Searching and

Computational Geometry EATCS Monographs on Theoretical Computer Science Springer

Verlag Brauer W Rozenb erg G and Salomaa A eds

Tim Mo ors and Antonio Cantoni Cascading contentaddressable memories IEEE Micro

June

D R Morrison PATRICIA practical algorithm to retrieve information co ded in alphanu

meric Journal of A CM Octob er

Girija Narlikar and Francis Zane Performance mo deling for fast ip lo okups In Proc ACM

SIGMETRICSJune

P Newman G Minshall T Lyon and L Huston Ip switching and gigabit routers IEEE

Communications Magazine pages January

S Nilsson and G Karlsson IPAddress Lo okup Using LCTries IEEE Journal on Selected

Areas in Communications

W Pugh Skip lists A probabilistic alternative to balanced trees Communications of the

ACM June

M A RuizSanchez and W Dabb ous Un mcanisme optimis de recherche de route ip In

Proceedings of CFIP Coloque Francophone sur lIngnierie des Protocoles pages

Octob er

Miguel A RuizSanchez Ernst W Biersack and Walid Dabb ous Survey and taxonomyofip

address lo okup algorithms IEEE Network MarchApril

Jonathan Sharp Funda Ergun Suvo Mittra Cenk Sahinalp and Rakesh Sinha A dynamic

lo okup scheme for bursty access patterns In Proceedings of the Twentieth Annual Joint

Conference of the IEEE Computer and Communications Societies INFOCOM pages

Los Alamitos CA April IEEE Computer So ciety

t state lo okups with fast up dates In Sandeep Sikka and George Varghese Memoryecien

Proceedings of SIGCOMM pages

Keith Sklower A treebased packet routing table for b erkeley unix In USENIX Winter

pages

Venkatachary Srinivasan and George Varghese Faster IP lo okups using controlled prex

expansion In Measurement and Modeling of Computer Systems pages

S Suri G Varghese and P Warkhede Multiway range trees Scalable ip lo okup with

fast up dates Technical Rep ort Washington UniversityinSt LuisDept of Computer

Science

Peter van Emde Boas R Kaas and E Zijlstra Design and implementation of an ecient

priority queue Mathematical Systems Theory

Marcel Waldvogel George Varghese Jon Turner and Bernhard Plattner Scalable high sp eed

IP routing lo okups In SIGCOMM pages

Dan E Willard Loglogarithmic worstcase range queries are p ossible in space N

Information Processing Letters August