Dissertation zur Erlangung des Doktorgrades der Fakult¨atf¨urAngewandte Wissenschaften der Albert-Ludwigs-Universit¨atFreiburg

Algorithms and Data Structures for IP Lookup, Packet Classification and Conflict Detection

Christine Maindorfer

Betreuer: Prof. Dr. Thomas Ottmann Dekan der Fakult¨atf¨urAngewandte Wissenschaften: Prof. Dr. Hans Zappe

Betreuer: Prof. Dr. Thomas Ottmann

Zweitgutachterin: Prof. Dr. Susanne Albers

Tag der Disputation: 2.3.2009 Zusammenfassung

Die Hauptaufgabe eines Internet-Routers besteht in der Weiterleitung von Paketen. Um den n¨achsten Router auf dem Weg zum Ziel zu bestimmen, wird der Header, welcher u.a. die Zieladresse enth¨alt,eines jeden Datenpaketes inspiziert und gegen eine Routertabelle abgeglichen. Im Falle, dass mehrere Pr¨afixein der Routertabelle mit der Zieladresse ¨ubereinstimmen, wird in der Regel eine Strategie gew¨ahlt,die als “Longest Prefix Matching” bekannt ist. Hierbei wird von allen m¨oglichen Aktio- nen diejenige ausgew¨ahlt,die durch das l¨angstemit der Adresse ¨ubereinstimmende Pr¨afixfestgelegt ist. Zur L¨osungdieses sogenannten IP-Lookup-Problems sind zahlreiche Algorithmen und Datenstrukturen vorgeschlagen worden. Anderungen¨ in der Netzwerktopologie aufgrund von physikalischen Verbindungsaus- f¨allen,der Hinzunahme von neuen Routern oder Verbindungen f¨uhrenzu Aktu- alisierungen in den Routertabellen. Da die Performanz der IP-Lookup-Einheit einen entscheidenden Einfluss auf die Gesamtperformanz des Internets hat, ist es entscheidend, dass IP-Lookup sowie Aktualisierungen so schnell wie m¨oglich durchgef¨uhrtwerden. Um diese Operationen zu beschleunigen, sollten Routerta- bellen so implementiert werden, dass Lookup und Aktualisierungen gleichzeitig ausgef¨uhrtwerden k¨onnen.Um zu sichern, dass auf Suchb¨aumenbasierte dynami- sche Routertabellen nicht durch Updates degenerieren, unterlegt man diese mit einer balancierten Suchbaumklasse. Relaxierte Balancierung ist ein gebr¨auchliches Konzept im Design von nebenl¨aufigimplementierten Suchb¨aumen. Hierbei wer- den die Balanceoperationen ggf. auf Zeitpunkte verschoben, in denen keine Such- prozesse im gleichen Teil des Baumes durchgef¨uhrtwerden. Der erste Teil dieser Dissertation untersucht die Hypothese, dass ein relaxiert balanciertes Schema f¨urdynamische Routertabellen besser geeignet ist als ein Schema, welches strikte Balancierung verwendet. Dazu schlagen wir den relax- iert balancierten min-augmentierten Bereichssuchbaum vor und vergleichen diesen mit der strikt balancierten Variante im Rahmen eines Benchmarks, wobei echte IPv4 Routerdaten verwendet werden. Um eine Plausibilit¨atsbetrachtung anstellen zu k¨onnen,welche die Korrektheit der verschiedenen Lockingstrategien untermauert, wird dar¨uberhinaus eine interaktive Visualisierung des relaxiert balancierten min-augmentierten Bereichssuchbaums pr¨asentiert. Des Weiteren stellen IP-Router “Policy basierte” Routing-Mechanismen (PBR) zur Verf¨ugung,welche das bestehende, auf Zieladressen basierende Routing, erg¨anzen. PBR bietet unter anderem die M¨oglichkeit, “Quality of Service” (QoS) sowie Netz- werksicherheitsbestimmungen, sogenannte Firewalls, zu unterst¨utzen. Um PBR zur Verf¨ugungstellen zu k¨onnen,m¨ussenRouter mehrere Paketfelder wie zum Beispiel die Quell- und Zieladresse, Port und Protokoll inspizieren, um Pakete in sogenannte “Flows” zu klassifizieren. Dies erfordert eine gegebene Menge von vordefinierten d-dimensionalen Filtern zu durchsuchen, wobei die Anzahl der zu in- spizierenden Paketfelder der Dimension d entspricht. Geometrisch gesprochen wer- den Filter durch d-dimensionale Hyperrechtecke und Pakete durch d-dimensionale Punkte repr¨asentiert. Paketklassifikation bedeutet nun, f¨urein weiterzuleiten- des Paket das am besten passende Hyperrechteck zu finden, welches den Punkt enth¨alt. Der R-Baum ist eine mehrdimensionale Indexstruktur zur dynamischen Verwal- tung von r¨aumlichen Daten, welcher Punkt- und Enthaltenseinanfragen unterst¨utzt. Der R-Baum und dessen Varianten wurden bis dato noch nicht auf deren Eignung f¨urdas Paketklassifizierungsproblem hin untersucht. Im zweiten Teil werden wir eruieren, ob der weitverbreitete R*-Baum zur L¨osung dieses Problems geeignet ist. Dazu wird dieser mit zwei repr¨asentativen Klassi- fizierungsalgorithmen im Rahmen eines Benchmark-Tests verglichen. Die Simu- lationsumgebung ist statisch, d.h. es finden keine Filteraktualisierungen statt. Die Mehrheit der vorgeschlagenen Klassifizierungsalgorithmen unterst¨utztinkre- mentelle Aktualisierungen nicht auf eine effiziente Weise. Erweist sich der R*- Baum als geeignet, ist dieses Benchmark ein Sprungbrett f¨ureine Untersuchung im dynamischen Fall.

Falls mehrere Filter auf ein weiterzuleitendes Paket anwendbar sind, wird ein soge- nannter Tiebreaker verwendet, um den am besten passenden Filter zu bestimmen. Ubliche¨ Tiebreaker sind (i) w¨ahleden “ersten passenden” Filter, (ii) den Filter mit h¨ochster Priorit¨atund (iii) den “spezifischsten” Filter. Es wurde festgestellt, dass nicht jede Policy durch die Vergabe von Priorit¨atendurchgesetzt werden kann und vorgeschlagen, in jenen F¨allenden “spezifischsten Filter” Tiebreaker anzuwenden. Jedoch ist dieser Tiebreaker nur realisierbar, wenn f¨urjedes Paket der spezifisch- ste Filter wohldefiniert ist. Ist dies nicht der Fall, sagt man, die Filtermenge sei widerspr¨uchlich. Im letzten Teil dieser Dissertation schlagen wir einen Algorithmus zur Konflikt- aufdeckung und Beseitigung f¨urden statischen eindimensionalen Fall vor, wobei jeder Filter durch ein beliebiges Intervall spezifiziert ist. Weiterhin zeigen wir, dass wenn zur L¨osungdieses Problems eine partiell persistente Datenstruktur verwendet wird, diese Struktur auch IP-Lookup unterst¨utzt. Abstract

The major task of an Internet router is to forward packets towards their final destination. When a router receives a packet from an input link interface, it uses its destination address to look up a routing . The result of the lookup provides the next hop address to which the packet is forwarded. Routers only need to determine the next best hop toward a destination, not the complete path to the destination. Changes in network topologies due to physical link failures, link repairs or the addition of new routers and links lead to updates in the routing database. Since the performance of the lookup device plays a crucial role in the overall performance of the Internet, it is important that lookup and route update operations are performed as fast as possible. To accelerate lookup and update operations, routing tables must be implemented in a way that they can be queried and modified concurrently by several processes. Relaxed balancing has become a commonly used concept in the design of concurrent search algorithms. The first part investigates the hypothesis that a relaxed balancing scheme is better suited for search-tree based dynamic IP router tables than a scheme that utilizes strict balancing. To this end we propose the relaxed balanced min-augmented and benchmark it with the strictly balanced variant using real IPv4 routing data. Further, in order to carry out a plausibility consideration, which corroborates the correctness of the proposed locking schemes, we present an inter- active visualization of the relaxed balanced min-augmented range tree.

Enhanced IP routers further provide policy-based routing (PBR) mechanisms, complementing the existing destination-based routing scheme. PBR provides a mechanism for implementing Quality of Service (QoS), i.e., certain kinds of traffic receive differentiated, preferential service. For example, time-sensitive traffic such as voice should receive higher QoS guarantees than less time-sensitive traffic such as file transfers or e-mail. Besides QoS, PBR further provides a mechanism to en- force network security policies. PBR requires network routers to examine multiple fields of the packet header in order to classify them into “flows”. Flow identifica- tion entails searching a table of predefined filters to identify the appropriate flow based on criteria including source and destination IP address, ports, and proto- col type. Geometrically speaking, classifying an arriving packet is equivalent to finding the best matching hyperrectangle among all hyperrectangles that contain the point representing the packet. The R-tree and its variants have not been experimentally evaluated and bench- marked for their eligibility for the packet classification problem. In the second part we investigate if the popular R*-tree is suited for packet classi- fication. For this purpose we will benchmark the R*-tree with two representative classification algorithms in a static environment. Most of the proposed classifica- tion algorithms do not support fast incremental updates. If the R*-tree shows to be suitable in a static classification scenario, then the benchmark is a stepping stone for benchmarking R*-trees in a dynamic classification environment, i.e., where classification is intermixed with filter updates.

If a packet matches multiple filters, a tiebreaker is used in order to determine the best matching filter among all matching filters. Common tiebreakers are: (i) First matching filter, (ii) highest priority filter (HPF) and (iii) most specific filter (MSTB). However, not every policy can be enforced by assigning priorities. In these cases, MSTB should be used instead. Yet, the most specific tiebreaker is only feasible if for each packet p the most specific filter that applies to p is well-defined. If this is not the case, the filter is said to be conflicting. In the last part of the thesis we propose a conflict detection and resolution algo- rithm for static one-dimensional range tables, i.e., where each filter is specified by an arbitrary range. Further, we show that by making use of partial persistence, the structure can also be employed for IP lookup. Acknowlegments

The work presented in this thesis was carried out during my time as a research assistant at the institute of computer science at the University of Freiburg.

Inexpressible thanks go to my research advisor, Prof. Dr. Thomas Ottmann. I sincerely appreciate his tremendous patience, especially early in my studies when I was a rather na¨ıve researcher, his time for countless discussions and diligent mentorship. His keen sense of untraveled paths in the world of research is truly inspiring. It has also been an honor to have Prof. Dr. Susanne Albers as my second referee. I highly appreciate her time and effort to appraise my thesis. I also would like to thank Prof. Dr. Christian Schindelhauer and Prof. Dr. Wol- fram Burgard for serving on my committee.

I gratefully acknowledge the support of my work through a grant from the German Research Foundation (DFG) within the program “Algorithmik großer und kom- plexer Netzwerke”.

I would like to thank my current and former collaborators in our research group: Frank Dal-Ri, Dr. Tobias Lauer, Khaireel Mohamed, Robin Pomplun, Christoph Hermann, Martina Welte, Dr. Wolfgang H¨urst,Dr. Peter Leven and Elisabeth Patschke, for all their advice and for the enjoyable time.

I have also had the opportunity to advise or co-advise several bachelor and diploma theses. Specifically, I would like to thank Thorsten Seddig and Waldemar Wittmann, whose support significantly enhanced my research. Further, I am particularly grateful to my friend and post-graduate assistant Bettina B¨arwhose effort and dedication greatly promoted a substantial part of this work.

I would like to thank my friend Anita Willmann for her friendship since childhood and for accompanying me unto faraway places for special events (“I booked a seat right next to yours”). I also thank her and Dr. Tobias Lauer for proofreading parts of this thesis. I would like to offer my most heartfelt thanks to my husband and best friend, Ingo Daniel Maindorfer, for his love, companionship and undying support. Life would not be what it is without him. I would like to thank my parents for their love and for their consummate support and encouragement throughout my life. It is impossible to list all that they have done and still do for me. I thank my sister Arlette Patricia for the joy we share since she entered this planet. I love you all and I will never stop thanking God for you. Contents

1 Introduction 1 1.1 Geometric interpretation of IP lookup and packet classification . . . 5 1.2 Objectives of this dissertation ...... 7 1.3 Organization ...... 9

I IP Address Lookup 11

2 Introduction 13 2.1 Organization of part I ...... 14 2.2 Another geometric interpretation of IP lookup ...... 14 2.3 Related work ...... 15

3 Min-augmented Range Trees 23 3.1 Longest matching prefix ...... 23 3.2 Update operations ...... 24 3.3 Comparison with priority search trees and priority search pennants 25

4 Relaxed Balancing 27 4.1 Red-black trees ...... 27 4.1.1 Insertions ...... 28 4.1.2 Deletions ...... 28 4.2 Relaxed balanced red-black trees ...... 29 4.2.1 Interleaving updates ...... 30 4.2.2 Concurrent handling of rebalancing transformations . . . . . 31

5 Relaxed Min-Augmented Range Trees 33 5.1 Longest matching prefix ...... 34 5.2 Interleaving updates ...... 35

6 Concurrency Control 37 6.1 The deadlock problem ...... 40 6.2 Strictly balanced trees ...... 41 6.2.1 Concurrent MART ...... 42 ii CONTENTS

6.3 Relaxed balanced trees ...... 46 6.3.1 Concurrent RMART ...... 48

7 Interactive Visualization of the RMART 51 7.1 Application framework ...... 52 7.2 Architecture ...... 52 7.3 The graphical user interface ...... 53

8 Experimental Results 61 8.1 The MRT format ...... 62 8.2 Flow characteristics in internetwork traffic ...... 63 8.2.1 Locality in internetwork traffic ...... 64 8.2.2 Statistical properties of flows ...... 64 8.3 Generation of sequences of operations ...... 65 8.4 Test setup ...... 67 8.5 Comparison of the RMART and the MART ...... 68 8.5.1 Solely lookups ...... 69 8.5.2 Solely updates ...... 71 8.5.3 Various update frequencies ...... 72 8.5.4 R´esum´eof experimental results ...... 74 8.6 Benchmark on Sun Fire X4600 ...... 76 8.7 Implementing the RMART in hardware ...... 78

9 Conclusions and Future Directions 81

II Packet Classification 83

10 Introduction 85 10.1 Goal of this part ...... 85 10.2 Organization of part II ...... 86 10.3 Related work ...... 86

11 R-trees 95 11.1 The original R-tree ...... 95 11.1.1 Query processing ...... 96 11.1.2 Query optimization criteria ...... 97 11.1.3 Updates ...... 98 11.2 R-tree variants ...... 98 11.2.1 The R+-tree ...... 99 11.2.2 The R*-tree ...... 99 11.2.3 Compact R-trees ...... 99 11.2.4 cR-trees ...... 99 11.2.5 Static versions of R-trees ...... 100 CONTENTS iii

12 Packet Classification using R-trees 101 12.1 Performance evaluation ...... 101 12.1.1 Filter sets ...... 102 12.1.2 Simulation results of R*-tree ...... 103 12.1.3 Benchmark of R*-tree and HyperCuts ...... 105 12.1.4 Benchmark of R*-tree and RFC ...... 111 12.2 Conclusions and future directions ...... 113

III Conflict Detection and Resolution 115

13 Introduction 117 13.1 Organization of this part ...... 118 13.2 Preliminaries ...... 118 13.3 Related work ...... 123 13.3.1 Online conflict detection and resolution ...... 123 13.3.2 Offline conflict detection and resolution ...... 126

14 Detecting and Resolving Conflicts 129 14.1 The output-sensitive solution to the one - dimensional offline problem129 14.1.1 Status structures ...... 130 14.1.2 Handling event points ...... 131 14.1.3 The sweepline environment ...... 132 14.1.4 Running Slab-Detect ...... 132 14.2 Experimental results ...... 133 14.3 Adapting Slab-Detect under the HPF rule ...... 135 14.3.1 Status structures ...... 137 14.3.2 Handling event points ...... 137 14.4 Setting up IP lookup with Slab-Detect ...... 137 14.5 Contributions and concluding remarks ...... 139

IV Summary of Contributions 141

Bibliography 145

Chapter 1

Introduction

The Internet is a global web of autonomous networks, a “network of networks”, interconnected with routers. Each network, or Autonomous System (AS), is man- aged by its own authority and contains its own internal network of routers and subnetworks. Network “reachability” information is exchanged via routing proto- cols. A dynamic routing protocol adjusts to changing network topologies, which are indicated in update messages that are exchanged between routers. If a link goes down or becomes congested, the routing protocol makes sure that other routers know about the change. From these updates a router constructs a forwarding ta- ble which contains a set of network addresses and a reference to the interface that leads to that network. Routers in different autonomous systems use the Border Gateway Protocol (BGP) to exchange network reachability information. Routing among autonomous systems is called exterior routing or “interdomain routing”. After applying local policies, a BGP router selects a single best route and ad- vertises it to other routers within the same AS. Interior routing is referred to as “intradomain routing”. The primary interior routing protocol in use today is Open Shortest Path First (OSPF). Information travels in packets across a network that consists of multiple paths to a destination. A packet is conceptually divided into two pieces: the header and the payload. The header contains addressing and control fields, while the payload carries the actual data to be sent over the internetwork. When a packet arrives at a router, the router consults its forwarding table to determine the best way to forward that packet, i.e., the next hop address. However, routers only need to determine the next best hop toward a destination, not the complete path to the destination. The TCP/IP protocol suite provides the internetwork addressing scheme and trans- port scheme for router-connected networks [1]. In order to uniquely identify Inter- net hosts, each host is assigned an IP address. An IP address is a unique number that contains two parts: a network address and a host address. The network address is used when forwarding packets across interconnected networks. It de- fines the destination network, and routers along the way know how to forward the 2 Chapter 1. Introduction

packet based on the network address. When the packet arrives at the destina- tion network, the host portion of the IP address identifies the destination host. Currently, the vast majority of Internet traffic utilizes Internet Protocol version 4 (IPv4). IPv4 assigns 32-bit addresses to Internet hosts, which limits the address space to 232 possible unique addresses. With the rapid growth of the Internet through the 1990’s, there was a rapid reduction in the number of free IP addresses available under IPv4 [2]. The IETF settled on IPv6, recommended in January 1995 in RFC 1752, sometimes also referred to as the “Next Generation Internet Protocol”, or IPng [2]. IPv6 assigns 128-bit addresses to Internet hosts. The date predicted where the Regional Internet Registry IPv4 unallocated address pool will be exhausted is November 2011 [3]. A related prediction is the exhaustion of the Internet Assigned Numbers Authority IPv4 unallocated address pool by the end of 2010 [3]. Currently, we are in a transition phase, i.e., IPv4 and IPv6 coexist on the same machines (technically often referred to as “dual stack”) and are transmitted over the same network links. While computers work with IP addresses as 32 (128)-bit binary values, humans normally use the dotted-decimal notation. A binary IPv4 address and its dotted- decimal equivalent are, e.g., 11000000.10101000.00001010.00000110 = 192.168.10.6. Note that the 32-bit address is divided into four eight-bit fields called octets. Each octet in an IP address ranges in value from a minimum of 0 to a maximum of 255. Therefore, the full range of IP addresses is from 0.0.0.0 through 255.255.255.255. Historically, the IP address space was divided into three main classes, where each class had a fixed size network address: Class A (16777214 hosts), Class B (65534 hosts), and Class C (254 hosts) [1]. The class was determined by the most signif- icant bits of an IP address. Most organizations which required a larger address space than Class C were allocated a block of Class B addresses, even though their network consumed only a fraction of the addresses. During the 1980s, the need for more flexible addressing schemes became increasingly apparent. This led to the gradual development of subnetting and Classless Inter-Domain Routing (CIDR). CIDR was introduced in 1993 and is the latest refinement to the way IP addresses are interpreted [4]. CIDR allows routing protocols to aggregate network addresses into single routing table entries which reduces the amount of packet forwarding information stored by each router. These aggregations, commonly called CIDR blocks, share an initial sequence of bits in the binary representation of their IP addresses. IPv4 CIDR blocks are identified using a syntax similar to that of IPv4 addresses: a four-part dotted-decimal address, followed by a slash, then a num- ber from 0 to 32: A.B.C.D/k. The dotted decimal portion is interpreted, like an IPv4 address, as a 32-bit binary number that has been broken into four octets. The number following the slash is the prefix length, the number of shared initial bits, i.e., counting from the most significant bit. For example, in the CIDR block 206.13.01.48/25, the “/25” indicates the first 25 bits are used to identify the unique network leaving the remaining bits, which are commonly represented by a wildcard 3

Figure 1.1: Example of Longest Prefix Matching for a 7-bit destination address; 11011* is the longest matching prefix; the corresponding next hop is seven.

’*’, to identify the specific host. An IP address is part of a CIDR block, and is said to match the CIDR prefix if the initial k bits of the address and the CIDR prefix are the same. The task of resolving the next hop for an incoming packet is referred to as IP lookup. A route lookup requires finding the longest matching prefix among all matching prefixes for the given destination address. An example of Longest Prefix Matching (LPM) for a 7-bit search key is provided in Figure 1.1.

The Transmission Control Protocol provides a reliable transmission service for IP packets [1]. While TCP provides these reliable services, it depends on IP to de- liver packets. Reliable data delivery services are critical for applications such as file transfers, database services, transaction processing, and other mission-critical ap- plications in which every packet must be delivered-guaranteed. TCP uses sequence numbers so that the destination can reorder packets and determine if a packet is missing. It further uses a cumulative acknowledgment scheme, where the receiver sends an acknowledgment signifying that it has received all data preceding the acknowledged sequence number. Sequence numbers and acknowledgments make it possible for TCP to provide an in-order delivery of packets at the destination host, discard duplicate packets, and retransmit lost packets. TCP identifies applications using 16-bit port numbers carried in the transport header which is appended to the IP header. The type of transport protocol carried in the IP header determines the format of the transport protocol header following the IP header in the packet.

Best-effort delivery describes a network service in which the network does not pro- 4 Chapter 1. Introduction

vide any guarantees that data is delivered or that a user is given a guaranteed qual- ity of service level or a certain priority [1]. In a best-effort network all users obtain best-effort service, meaning that they obtain unspecified variable bit rate and de- livery time, depending on the current traffic load. Note that TCP does not reserve any resources in advance, and does not provide any guarantees regarding quality of service, for example bit rate. In that sense, it can be considered as best-effort com- munication. Conventional IP routers only provide best-effort service. Enhanced IP routers further provide policy-based routing (PBR) mechanisms, complementing the existing destination-based routing scheme [5]. PBR provides a mechanism for expressing and implementing routing of data packets based on the policies defined by the network administrators. For example, mission-critical and time-sensitive traffic such as voice should receive higher qualitiy of service (QoS) guarantees than less time-sensitive traffic such as file transfers or e-mail. Besides QoS, PBR further provides a mechanism to enforce network security policies. PBR requires network routers to examine multiple fields of the packet header in order to categorize them into “flows”. A flow may be thought of as the commu- nication traffic generated by a specific application traveling between a specific set of hosts or subnetworks. Hence, flows are considered to be sequences of packets with an n-tuple of common values such as source and destination addresses. The process of categorizing packets into flows in an Internet router is called packet classification. The function of the packet classification system is to check packet headers against a set of predefined filters. The relevant packet header fields include source and destination IP addresses, source and destination port numbers, proto- col and others. Formally, a filter set consists of a finite set of n filters, f1, f2 . . . fn. Each filter is a combination of d header field specifications, h1, h2 . . . hd. Each header field specifies one of four kinds of matches: exact match, prefix match, range match, or masked-bitmap match. A packet p is said to match a filter fi if and only if the header fields, h1, h2 . . . hd, match the corresponding fields in fi in the specified way. Each filter fi has an associated action that determines how a packet p is handled if p matches fi. A collection of filters is called a classifier. An example classifier is shown in Table 1.1. The header of an arriving packet may satisfy the conditions of more than one filter. In this case the filter with the highest priority among all the matching filters is commonly used. Using the example classifier in Table 1.1, an incoming packet p with header (10 ..., 0011 ..., TCP, 1) matches f2 and f3. Assuming that f2 has higher priority than f3, f2 will be returned. 1.1. Geometric interpretation of IP lookup and packet classification 5

Filter SA DA Prot DP P

f1 11* * TCP [3:15] 1 f2 100111* * TCP [1:1] 2 f3 1011* 0011* * [1:15] 3 f4 10* 011* UDP [3:3] 4 f5 0* 11* TCP [0:1] 5 f6 0* 100111* UDP [0:15] 6 f7 * * TCP [3:5] 7 Table 1.1: Example classifier of seven filters classifying on four fields (source and destination address, protocol and destination port). Each filter has an associated priority tag P ; wildcard fields are denoted with *.

Figure 1.2: The longest matching prefix corresponds to the most specific interval of all intervals that contain the query point.

1.1 Geometric interpretation of IP lookup and packet classification

In geometric terms, a prefix b1 . . . bk∗ can be mapped to an interval in the form of [b1 . . . bk0 ... 0, b1 . . . bk1 ... 1]. For example, if the prefix length is limited by 5, 0010∗ is represented by [4, 5]. An incoming packet with destination address w b1, . . . , bw can be mapped to a point p ∈ U, where U = [0, 2 − 1] and w = 32 for IPv4 and w = 128 for IPv6. The longest matching prefix corresponds to the most specific interval of all intervals that contain the query point. An interval f1 is more specific than an interval f2 iff f1 ⊂ f2. If two intervals partially overlap, neither is more specific than the other. Figure 1.2 shows an example. A set of intervals specified by prefixes has the property that any two intervals are either disjoint or one is completely contained in the other. Hence, for each query point p, there is a unique defined most specific interval that contains p, provided that the default filter spanning the entire universe U is included in the set. 6 Chapter 1. Introduction

We have seen that a prefix represents a contiguous interval on the number line. Similarly, a two-dimensional filter is represented by an axes-parallel rectangle in the two-dimensional Euclidean space. A filter f = (prs∗, prd∗), where prs is a i-bit w−i w−j prefix and prd is a j-bit prefix, is represented by a 2 × 2 rectangle, where w is the maximum prefix length. Generalizing, a filter in d dimensions represents a d-dimensional hyperrectangle in d-dimensional space. A classifier is therefore a collection of rectangles, each of which is labeled with a priority. An incoming packet header represents a point with coordinates equal to the values of the header fields corresponding to the dimensions. For example, Figure 1.3 shows the geo- metric representation of the classifier in Table 1.1 for the source and destination 10 10 address fields and w = 10. Filter f7 covers the entire space 2 × 2 . Given this geometric representation, classifying an arriving packet is equivalent to finding the highest priority rectangle among all rectangles that contain the point representing the packet. For example, the point p in Figure 1.3 is contained in the filters with priorities five and seven. If lower values represent higher priorities, then filter f5 will be returned.

Figure 1.3: The geometric representation of the 10-bit source and destination address fields of the classifier in Table 1.1. Point p represents a packet to be classified. 1.2. Objectives of this dissertation 7

1.2 Objectives of this dissertation

With a rapid increase in the data transmission link rates and an immense contin- uous growth in the Internet traffic, efficient lookup and classification techniques are essential for meeting performance demands. The speed and scalability of the IP lookup or packet classification scheme employed largely determines the perfor- mance of the router, and hence the Internet as a whole. Therefore, both problems have received much attention in the research community.

Due to the transient nature of network links, routing protocols allow the routers to continually exchange information about the state of the network. There are two strategies to handle table updates. The first employs two copies of the table. Lookups are done on the working table, updates are performed on a shadow table. Periodically, the shadow table replaces the working table. In this mode of opera- tion, packets may be forwarded wrongly. The amount of misdirections depends on the periodicity with which the working table is replaced by an updated shadow. Further, additional memory is required for the shadow table. The second strategy performs updates directly on the working table. Here, no packet is improperly for- warded. However, IP lookup may be delayed while a preceding update completes. To accelerate lookup and update processes operating on a single forwarding table, these tables must be implemented in a way that they can be queried and modified concurrently by several processes. If implemented in a concurrent environment there must be a way to prevent simultaneous reading and writing of the same parts of the . A common strategy is to lock the critical parts. In order to allow a high degree of concurrency, only a small part of the structure should be locked at a time. Relaxed balancing has become a commonly used con- cept in the design of concurrent algorithms. In relaxed balanced data structures, rebalancing is uncoupled from updates and may be arbitrarily delayed. This contrasts with strict balancing, where rebalancing is performed immediately after an update. Hanke [6] presents an experimental comparison of the strictly balanced red-black tree and three relaxed balancing algorithms for red-black trees, using the simulation of a multiprocessor machine. The results indicate that the relaxed schemes have significantly better performance than the strictly balanced version. Motivated by Hanke’s results, the first part investigates the hypothesis that a re- laxed balancing scheme is better suited for search-tree based dynamic IP router tables than a scheme that utilizes strict balancing. To this end, we propose the relaxed balanced min-augmented range tree and benchmark it with the strictly balanced version of the tree using real IPv4 routing data. In order to carry out a plausibility consideration, which corroborates the correctness of the proposed lock- ing schemes, we will present an interactive visualization of the relaxed balanced min-augmented range tree. 8 Chapter 1. Introduction

The R-tree, one of the most influential multidimensional access methods, was proposed by Guttman in 1984. R-tree applications cover a wide spectrum, from geographical information systems, computer-aided design to computer vision and robotics. R-trees are hierarchical data structures that are used for the dynamic organization of a set of d-dimensional geometric objects. The challenge for R-trees is the following: dynamically maintain the structure in a way that retrieval oper- ations are supported efficiently. Common retrieval operations are range queries, i.e., find all objects that a query region intersects, or point queries, i.e., find all objects that contain a query point. The R-tree and its variants have not been experimentally evaluated and benchmarked for their eligibility for the packet clas- sification problem. In the second part we will investigate if the popular R*-tree is suited for packet classification in a static environment. To this end we will benchmark the R*-tree with two representative classification algorithms using the ClassBench tools suite. Most of the proposed classification algorithms do not support fast incremental up- dates. If the R*-tree shows to be suitable in a static classification scenario, then the benchmark is a stepping stone for benchmarking R*-trees in a dynamic classi- fication environment, i.e., where classification is intermixed with filter updates.

We have seen that filters can lead to ambiguities in the packet classification pro- cess. This is due to the fact that packets might match multiple filters, each with a different associated action. Hari et al. [7] noticed that not every policy can be enforced by assigning priorities and applying the filter with the highest priority. The authors suggest a scheme that utilizes the most specific tiebreaker (MSTB), analogous to the most specific tiebreaker in one-dimensional IP lookup. If the most specific tiebreaker is to be applied, it must be ensured that for each packet there is a well defined most specific filter that applies to p. In one-dimensional prefix tables, any two filters are either disjoint or one is completely contained in the other. Therefore, for an incoming packet p the most specific filter that matches p is well defined. In higher dimensions, filters may partially overlap. Hence, for points falling in the overlap region, the most specific filter may not be defined. Hari et al.’s seminal technique adds so-called “resolve filters” for each pair of par- tially overlapping filters which guarantees that the most specific tiebreaker can be applied. The third part of this dissertation proposes a conflict detection and resolution al- gorithm for static one-dimensional range tables containing arbitrary ranges. We are motivated to study the one-dimensional case for the following reason. Multi- dimensional classifiers typically have one or more fields that are arbitrary ranges. Since a solution for multidimensional conflict detection often builds on data struc- tures for the one-dimensional case, it is beneficial to develop efficient solutions for one-dimensional range router tables. 1.3. Organization 9

1.3 Organization

The remainder of this dissertation is organized as follows. Each of the three ob- jectives is presented in a separate part. These parts can be read independently of each other. Each part has an introduction, surveys related work, presents the contributions and terminates with a summary and future directions. Finally, the dissertation concludes with an overall summary of contributions.

Part I

IP Address Lookup

Chapter 2

Introduction

The internet is a system of immense scale. Changes in network topologies due to physical link failures, link repairs or the addition of new routers and links happen quite frequently as indicated by high volumes of routing updates [8]. This infor- mation must be flooded to all routers in the network as soon as possible after the event. Routers must then update their routing tables accordingly. Min-augmented range trees (MART) were introduced by Datta and Ottmann as a conceptually sim- ple tree structure for maintaining dynamic IP router tables [9]. Maintaining the forwarding table in a min-augmented range tree, the complexity of IP lookup is in O(h), where h is the height of the tree. Hence it is desirable to maintain the underlying search tree balanced. In order to accelerate the lookup and update operations, min-augmented range trees must be implemented in a way that they can be queried and modified con- currently by several processes. Trees with relaxed balance are defined to facilitate fast updating in a concurrent database environment, since the rebalancing tasks can be performed gradually after urgent updates. However, weaker constraints than the usual ones are maintained such that the tree can still be balanced effi- ciently. Uncoupling was first discussed in connection with red-black trees [10], and later in connection with AVL trees [11]. Since then, several relaxed balancing schemes have been proposed [12] [13] [14] [15]. Relaxed data structures with group updates have been proposed in [16] [17] [18].

In this part we propose the relaxed balanced min-augmented range tree and in- vestigate the hypothesis that the relaxed balanced min-augmented range tree is better suited for the representation of dynamic IP router tables than the strictly balanced version of the tree. To this end, we benchmark these two structures us- ing real IPv4 routing data. To our knowledge, there are no other approches which examine relaxed balancing in the context of forwarding or packet classification in general. The research and implementation in this part was carried out in collaboration with 14 Chapter 2. Introduction

Thorsten Seddig, Bettina B¨ar,Tobias Lauer and Thomas Ottmann. Thorsten Sed- dig has implemented the RMART within the scope of his diploma thesis [19]. The MART has been implemented by my collegue Tobias Lauer. Bettina B¨arhas given support in the implementation and realization of the benchmark of the concurrent MART and the RMART. A visualization of the RMART was developed in collab- oration with Waldemar Wittmann in line with his bachelor thesis [20].

2.1 Organization of part I

The remainder of this part is organized as follows. Section 2.2 describes another geometric interpretation of IP lookup, while section 2.3 reviews related work. The min-augmented range tree for the representation of dynamic forwarding tables is presented in chapter 3. Relaxed balanced red-black trees, which form the basis of the relaxed balanced min-augmented range tree, are presented in chapter 4. The locking strategies for both the strictly and relaxed balanced min-augmented range trees are described in chapter 6. An interactive animation of the RMART is presented in chapter 7. Finally, benchmark results are discussed.

2.2 Another geometric interpretation of IP lookup

As we have seen in section 1.1, the longest prefix match problem can be mapped into the geometric problem of finding the shortest interval on a line containing a query point. Intervals on the line can be mapped to points in the plane and vice versa, because both entities are defined by two values. If we map an interval [l, r] with start point l and finish point r to the point (r, l) in the plane, a set of intervals on the line is mapped to a set of points below the main diagonal in the plane. Let us denote this mapping by map1, following the notation by Lu and Sahni [21]. A point p is said to stab an interval [l, r] if p ∈ [l, r]. A stabbing query reports all intervals that are stabbed by a given query point. It has been observed that stabbing queries for sets of intervals on the line can be translated to range queries for so called south-grounded, semi-infinite ranges of points in the plane [22]. More precisely: p ∈ [l, r] if map1 (l, r) is to the right and below the point (p, p). For two intervals [l, r] and [l0, r0] the point map1 (l, r) lies to the left and above the point map1 (l0, r0) iff [l, r] is contained in [l0, r0]. Hence, finding the most specific interval containing a given point p corresponds to finding, for the point p = (p, p) on the main diagonal, the topmost and leftmost point (r, l) that is right and below of p, cf. Figure 2.1. Note that if the most specific interval exists, there is always a unique topmost-leftmost point. Thus, solving the dynamic version of the IP lookup problem for prefix filters means to maintain a set of points in the plane for which we can carry out insertions and deletions of points and answer topmost-leftmost queries efficiently. Topmost- 2.3. Related work 15

Figure 2.1: A set of intervals mapped to a set of points. The longest matching prefix is the topmost-leftmost point below the query point p = (p, p). leftmost queries can be reduced to leftmost queries when ensuring that no two points have the same x-coordinate. This can be accomplished by mapping each point (x, y) to the point (2wx − y + 2w − 1, y) [21]. Note that it is not possible to have a leftmost point which is not also the highest point in the semi-infinite range to the right and below a query point on the main diagonal. This is due to the fact that intervals that are specified by prefixes have the property that any two intervals are either disjoint or one is completely contained in the other.

2.3 Related work

Longest Prefix Matching has received significant attention due to the fundamental role it plays in the performance of Internet routers. If the set of prefixes is small, a linear search through a list of the prefixes sorted in order of decreasing length may be sufficient. The sorting step guarantees that the first matching prefix in the list is the longest matching prefix for the given search key. Linear search is commonly touted as the most memory efficient of all LPM techniques in that the memory requirement is O(n), where n is the number of prefixes in the table. Note that the search time is also O(n). Several more sophisticated techniques have been developed to improve the speed of address lookup. Each technique’s performance can be measured in terms of the time required for lookup, the storage space required and the complexity of updat- ing the filter set when a filter is added, deleted or changed. Many solutions are based on the fundamental structure [23]. A trie is a bi- nary tree with labeled branches. Each node v represents a bit-string formed by concatenating the labels of all branches on the path from the root node to v. All the descendants of any one node have a common prefix of the string associated with that node, and the root is associated with the empty string. An example of a 16 Chapter 2. Introduction

Figure 2.2: Example of Longest Prefix Matching using a binary trie. The values in the nodes denote the associated output link information.

binary trie constructed from the set of prefixes in Figure 1.1 is shown in Figure 2.2. If a node is associated with a prefix, it stores the corresponding output link for packets destined for the respective network. allow finding, in a straightfor- ward way, the longest prefix that matches a given destination address. IP lookup is conducted by traversing the trie using the bits of the destination address of a packet p, starting with the most significant bit. While traversing the trie, every time we visit a node that is associated with a prefix we remember that prefix as the longest match found so far. The last prefix encountered on the path is the longest prefix that matches p [24]. As in the previous examples, the best matching prefix for destination address 1101100 is 11011* and the corresponding output link is seven. Note that the worst-case search time is now O(w), where w is the length of the address and maximum prefix length in bits. Update operations are also straightforward to implement in binary tries [24]. In- serting a prefix begins by doing a search. When arriving at a node with no branch to take, we can insert the necessary nodes. Deleting a prefix starts again by a search, unmarking the node as prefix and, if necessary, deleting unused nodes. Several schemes have been proposed to improve the lookup performance of bi- nary tries, e.g., multibit tries [25] and shape shifting tries [26]. These strategies collapse several levels of each subtree of a binary trie into a single node, that can be searched with a number of memory accesses that is less than the number of 2.3. Related work 17 levels collapsed. Lu and Sahni [27] propose a method to partition a static IP router table such that each partition is represented using a base structure such as a multibit trie [25] or a hybrid shape shifting trie [28]. The partition results in an overall reduction in the number of memory accesses needed for a lookup and a reduction in the total memory required. The fundamental issue with trie-based techniques is that performance and scalabil- ity are fundamentally tied to address length. With the future transition to IPv6, it is not clear if trie-based solutions will be capable of meeting performance demands.

Several solutions utilize the geometric view of the filter set. Lee et al. [29] propose an algorithm which is based on the . The segment tree is a well-known data structure in computational geometry for handling intervals. The skeleton of the segment tree is static. After the skeleton has been built over the given set of intervals, these intervals can be stored in a dynamic fashion, that is, supporting insertions and deletions. First, the so-called elementary intervals are computed which will be stored in the leaves. Each node or leaf v stores the interval IntR(v) that it represents and a set I(v) of intervals. A parent node represents the union of the intevals of its children. The set I(v) contains the intervals [x, x0] such that IntR(v) is included in [x, x0] and IntR(parent(v)) is not included in [x, x0]. An interval [x, x0] is stored at a number of nodes that together cover the interval, and these intervals are chosen as close to the root as possible. Every interval is stored at at most two nodes per level. An interval i is inserted as follows: from the root node check whether i contains the interval represented by that node. If yes, allocate i there. Otherwise, do the same check recursively for the children nodes whose intervals are overlapping i. Figure 2.3 shows the segment tree storing the intervals a = [0, 7], b = [0, 5], c = [2, 3] and d = [6, 7]. The elementary intervals are: [0, 1], [2, 3], [4, 5], [6, 7]. The interval stored in each node v represents IntR(v).

There always exists a unique shortest segment among the segments stored at each node. For each node of the segment tree, a pointer is maintained that points to that shortest segment. To find the most specific range (msr) stabbed by a query point p, the segment tree is used as a search tree for p, i.e., we search for the elementary interval that contains p. The last segment encountered is the shortest segment over p. Suppose we search the msr for the query point 3 among the ranges illustrated in Figure 2.3. The last segment encountered is interval C. Given n IP prefixes, their algorithm performs IP address lookup in O(h) time, where h is the height of the tree. Their approach can also handle insertions of IP prefixes that don’t fit in the skeleton, but then the segment tree has to be rebuild from time to time in order to maintain lookup performance. The algorithm per- forms insertion in time O(log n), and deletion in O(log n) time on average. 18 Chapter 2. Introduction

Figure 2.3: A segment tree storing the intervals a = [0, 7], b = [0, 5], c = [2, 3] and d = [6, 7]. The interval stored in each node v represents IntR(v).

Figure 2.4: A set of intervals R = {a = [0, 30], b = [0, 10], c = [1, 9], d = [2, 8], e = [12, 20], f = [14, 18], g = [22, 30], h = [23, 25] and i = [27, 29]}.

In the interval tree of [30], each node v stores a non-empty subset intervals(v) of a set of intervals R. Let the median of an ordered sample (x1, x2, . . . , xn) be defined as: ( x n+1 if n is odd, median = 2 xn/2 if n is even.

Let xmed be the median of the interval endpoints. The root stores all intervals that contain xmed. The right subtree stores all intervals that lie completely to the right of xmed, and the left subtree stores all intervals completely to the left of xmed. These subtrees are constructed recursively in the same way. In [30], two seperate lists are maintained to store intervals(v). One list keeps the intervals sorted according to increasing left endpoints, the other list maintains the intervals sorted according to decreasing right endpoints. If intervals are nested, both lists are identic. Consider the intervals in Figure 2.4. The interval tree storing the set R of intervals is shown in Figure 2.5. The left endpoint of f is the median of all the endpoints, and hence becomes the root of the tree. Intervals a, e and f contain this endpoint and get attached to the root. Intervals b, c and d lie completely to the left of 14 and get placed in its left subtree, g, h and i get placed in the right subtree and so forth. The longest matching prefix can be found in O(log n + k) time, where k is the 2.3. Related work 19

Figure 2.5: An interval tree storing the set S. number of prefixes that match the given destination address. Suppose we search for the longest matching prefix for the destination address 19 in the interval tree in Figure 2.5. Intervals a and e both contain 19 and e is the most specific match. Prefix insertion and deletion are expensive. Lu and Sahni [31] propose an en- hancement of the interval tree of [30] for the representation of dynamic router tables. The enhanced structure supports efficient insertion and deletion of ranges. The longest matching prefix can be found in O(log n + k) time as in the original structure. They further propose several refinements of the enhanced interval tree for dynamic router tables. For example, LMPBOB (longest matching prefix on binary tree), which permits lookup in O(w) time, where w is the length of the longest prefix, and filter insertion and deletion in O(log n) time each, where n is the number of prefixes in the forwarding table. Another scheme proposed by Lu and Sahni [21] shows that each of the three op- erations insert, delete and IP lookup may be performed in O(log n) time in the worst case using a priority-search tree. The multi-way and multi-column search techniques presented by Lampson, Srini- vasan, and Varghese map the longest matching prefix problem to a binary search over the fixed-length endpoints of the intervals defined by the prefixes [32]. The authors exploit the fact that any two prefixes are either disjoint or nested. For a database of n prefixes with address length w, naive binary search would take O(w ∗ log n). They show how to reduce this to O(w + log n) using multiple-column binary search. Warkhede, Suri and Varghese [33] introduce an IP lookup scheme based on a multi- way range tree with worst case search and update time of O(log n), where n is the number of prefixes in the forwarding table.

With the advances in optical networking technology, link rates reach over 40 Gi- gabits per second (OC768). Given the smallest packet size of 40 bytes, in order to achieve 40 Gbps wire speed, the router needs to lookup packets at a speed of 125 million packets per second. This, together with other needs in process- ing, amounts to less than eight nanoseconds per packet lookup. Such high rates demand IP lookup to be performed in hardware. Originally, commercial routers 20 Chapter 2. Introduction

used Content Addressable Memory (CAM) for IP address lookups in order to keep pace with optical link speeds [34]. CAMs locate an entry by comparing the input key against all memory words in parallel. Hence, a lookup effectively requires one clock cycle. While binary CAMs performed well for exact match operations and could be used for route lookups in strictly hierarchical addressing schemes, the introduction of CIDR required storing and searching entries with arbitrary prefix lengths [34]. In response, Ternary Content Addressable Memories (TCAMs) were developed with the ability to store an additional “Don’t Care” state thereby en- abling them to retain single clock cycle lookups for arbitrary prefix lengths [34]. The use of TCAMs for routing table lookups was first proposed by McAuley and Francis [35]. They also described the problem of updating TCAM-based routing tables that are sorted with respect to prefix lengths. A more recent scheme for filter updates in TCAMs was proposed in [36]. For example, the Cisco Catalyst 6500 Series Switch maintains its Forwarding Information Base (FIB) in TCAM which is accessed by the hardware forwarding engine ASIC (application-specific integrated circuit) [37]. TCAMs have several deficiencies [38]: (1) high cost per bit relative to other memory technologies, (2) storage inefficiency, (3) high power consumption, and (4) limited scalability to long input keys. The storage ineffi- ciency comes from two sources. First, arbitrary ranges must be converted into prefixes. For example, if w = 4, the range [2, 10] is represented by 001∗, 01∗, 100∗ and 1010, which exactly cover that range. In the worst case, a range covering w- bit port numbers may require 2(w − 1) prefixes [38]. The second source of storage inefficiency stems from the additional hardware required to implement the third “Don’t Care” state. The massive parallelism inherent in TCAM architecture is the source of high power consumption. A further deficiency stems from the lack of flexibility and programmability [39]. CoolCAMs greatly reduce power dissipation and were proposed in [40]. Power consumption is approximately proportional to the number of blocks searched. [40] provide two different power efficient TCAM-based architectures for IP lookup. Both architectures utilize a two stage lookup process. The basic idea in both cases is to divide the TCAM device into multiple partitions. When a route lookup is performed, the results of the first stage lookup are used to selectively search only one of these partitions during the second stage lookup. The two architectures differ in the mechanism for performing the first stage lookup. Zane, Narlikar and Basu further investigate the performance of both architectures in the face of routing table updates [40]. Adding prefixes may cause a bucket in the data TCAM to overflow, requiring a repartitioning of the prefixes into buckets and rewriting the entire table in the data TCAM. The authors describe several heuristics in order to minimize the number of repartitions. Spitznagel, Taylor, and Turner extend the basic idea of [40] and introduced Ex- tended TCAM (E-TCAM) [41]. They propose an indexing mechanism that can support multidimensional packet classification. A further extension of E-TCAM is 2.3. Related work 21 that the range-matching inefficiency is resolved by incorporating range-matching logic directly into hardware at the cost of a small increase in hardware resources. Perhaps the biggest piece missing from the Extended TCAM solution is an efficient update procedure.

The other architectural approach to IP lookup uses more conventional memory architectures like Static RAM (SRAM) and Reduced Latency Dynamic RAM (RL- DRAM) and sophisticated data structures. Trie-based structures are widely used in these solutions, e.g., in the Juniper M-series, MX-series and T-series, ASIC- driven lookup is based on a (radix) trie [39]. In Cisco’s CRS-1 (Carrier Routing System) high-end router, lookup is based on Tree Bitmap [42], a multibit trie al- gorithm, proposed by Eatherton, Varghese and Dittia [43].

Due to the serial nature of decision tree approaches, multiple clock cycles are needed to perform IP lookup. In response, several researchers have explored pipelining to improve the throughput [44] [45]. A pipeline is a collection of con- current “entities” in which the output of each entity is used as the input to another.

We have seen that new solutions employ a combined algorithmic and architec- tural approach to the problem. In the following we will propose the relaxed min- augmented range tree (RMART), an efficient representation for dynamic IP for- warding tables, and outline a technique which can be used to describe the RMART by a hardware description language.

Chapter 3

Min-augmented Range Trees

A min-augmented range tree (MART) [9] [46] maintaining a set of points with pairwise different x-coordinates stores the points at the leaves such that it is a leaf-search tree for the x-coordinates of points. The internal nodes have two fields, a router field guiding the search to the leaves and a min field. In the router field we store the maximum x-coordinate of the left subtree, we call this the x-value property, and in the min field we store the minimum y-coordinate of any point stored in the leaves of the subtrees of the node. The next section will provide an example. In the following we show how to answer a leftmost, or minXinRectangle (xleft, ∞, ytop) query.

3.1 Longest matching prefix

In order to find the longest matching prefix, we have to find the point p with minimal x-coordinate in the semi-infinite range x ≥ xleft and with y-coordinate below the threshold value ytop. Therefore, we first carry out a search for the boundary value xleft. It ends at a leaf storing a point with minimal x-coordinate larger than or equal to xleft. If this point has a y-coordinate below the threshold value ytop, we are done. Otherwise we retrace the search path for xleft bottom-up and inspect the roots of subtrees falling completely into the semi-infinite x-range. These roots appear as right children of nodes on the search path. Among them we determine the first one from below (which is also the leftmost one and) which has a min field value below the threshold ytop. This subtree must contain the answer to the minXinRectangle (xleft, ∞, ytop) query stored at its leaf. In order to find it, we recursively proceed to the left child of the current node, if its min field shows that the subtree contains a legal point, i.e., if its min field is (still) below the threshold, and we proceed to the right child only, if we cannot go to the left child (because the min field of the left child is above the threshold ytop) [46]. Note that in an actual implementation it is more efficient to truncate the initial search for xleft and begin retracing the path as soon as the min field of the currently 24 Chapter 3. Min-augmented Range Trees

Figure 3.1: The search path of the query minXinRectangle (35,80,34) in a MART. Visited nodes are highlighted, the pink node is the result returned by the query. In internal nodes, the bottom value represents the router field, the top one the min field. From [47].

inspected node is above ytop [47]. A min-augmented range tree storing a set of 16 points at the leaves is visualized in Figure 3.1. In internal nodes, the bottom value represents the router field, the top one the min field. The search path of the query minXinRectangle (35,80,34) is highlighted. We can find the desired point in time which is proportional to the height of the underlying leaf-search tree. Hence it is desirable to maintain the underlying tree balanced. All we have to show is that the augmented information stored in the min fields of nodes can be efficiently maintained when we carry out an update operation and rebalance the underlying search tree.

3.2 Update operations

To show that the augmented information can efficiently be maintained during up- date operations, it is appropriate to think of an update operation for the underlying balanced leaf-search tree as consisting of two successive phases [46]. In the first phase, we insert or delete a point as in a normal (unbalanced) binary leaf-search tree, and in the second phase we retrace the search path and carry out rebalancing operations, if necessary. In order to update the information stored in the min fields of internal nodes, the first phase has to be extended as follows. We retrace the search path and carry out a tournament starting from the leaf affected by the update operation: We recursively consider the min fields of the current node and its sibling and store the minimum of both in the min field of their common parent. In this way we correctly update the information stored in the min fields after the first phase. Instead of retracing the search path in order to update the min fields, the insertion process could also modify the min fields top-down. 3.3. Comparison with priority search trees and priority search pennants 25

In order to show that this information can also be maintained during the second phase, i.e., during rebalancing, let us consider a right rotation. Here we assume that a, b, c, d, e are the routers in increasing x-order stored in the router fields of the internal nodes. The values of the min fields before the rotations are u, v, w, x, y and u, v0, w, x0, y after the rotation. Note that u, w, y have not to be changed, be- cause their subtrees are not affected by the rotation. We have just to update the min fields of the nodes A and B. Note, however, that the min value stored at node A is (still) the overall min value x of all subtrees 1, 2, and 3, hence, we define x0 = x. Choosing v0 = min(w, y) will finally restore the min fields correctly, cf. Figure 3.2.

Figure 3.2: MART right rotation. The min field values are shown on top of the router values. From [9].

Rotations and the process of maintaining the augmented min-information are strictly local, hence we can freely choose an underlying balancing scheme for min- augmented range trees. Furthermore, the locality property enables us to decouple the update and rebalancing operations as will be shown in chapter 5.

3.3 Comparison with priority search trees and priority search pennants

Lauer [47] has benchmarked the MART with (1) the priority search tree as used in the approach by Lu and Sahni [21], herein after referred to as PST, (2) the priority search tree as suggested by McCreight [22], herein after referred to as PST McC, both of which are balanced with red-black trees, and (3) the priority search pen- nant by Hinze [48], herein after referred to as PSP. Lauer shows how balanced PSPs answer longest matching queries and proves that the complexity is bounded by O(log n). The MART and PSP were implemented and benchmarked with two balancing schemes each: internal path reduction and red-black trees. These bal- ancing schemes are strict balancing schemes, i.e., the balance condition is restored immediately after each update. The MART is the simplest structure in terms of node complexity, i.e., the number of values stored per node that are inspected 26 Chapter 3. Min-augmented Range Trees

during a search operation, followed by the PST, PSP, and finally PST McC. For minXinRectangle queries, the length of the search path, i.e., the number of in- spected nodes during the search, was measured. The results have shown that the average search path length of the MART was longer compared to the other structures. Yet, concerning runtime performance, the results have shown that the search path length is less crucial than the number and type of comparisons inside each node along the path. The simple node structure of a MART node compen- sates the longer search paths. For minXinRectangle queries the MART needed 45% less time than the PST. The choice of the balancing scheme for the MART and PSP turned out to be of rather low importance in terms of search time. In terms of updates, the MART and PSP require fewer node manipulations than PSTs. When the same balancing scheme is applied, the MART requires about 30% less node manipulations on average during an insertion than PSTs. In case of deletions, the reduction is 27%. However, in terms of runtime, the performance gain of the MART is only about 10% in case of insertions and 15% - 20% in case of deletions compared to PSTs. Chapter 4

Relaxed Balancing

In order to accelerate lookup and update operations, routing tables must be im- plemented in a way that they can be queried and modified concurrently by several processes. If implemented in a concurrent environment there must be a way to prevent simultaneous reading and writing of the same parts of the data structure. A common strategy is to lock the critical parts. In order to allow a high degree of concurrency, only a small part of the tree should be locked at a time. Relaxed balancing has become a commonly used concept in the design of concurrent search tree algorithms [6]. Instead of requiring that the balance condition is restored immediately after each update, the balance conditions are relaxed such that the rebalancing operations can be delayed and interleaved with search and update op- erations. In the following we will present the relaxed red-black tree as proposed in Hanke, Ottmann and Soisalon-Soininen [14] as we utilized this scheme to relax the min-augmented range tree. This scheme can be applied to any class of balanced trees. The main idea is to use the same rebalancing operations as for the standard (strictly balanced) version of the tree. In the following we recapitulate red-black trees in order to build the basis for section 4.2.

4.1 Red-black trees

A red-black tree is a with the following red-black properties [49]:

• Every node is either red or black.

• Every leaf is black.

• If a node is red, then both its children are black.

• Every path from the root to a leaf contains the same number of black nodes.

• The root node is black. 28 Chapter 4. Relaxed Balancing

Figure 4.1: Call of the rebalancing procedure up-in (denoted by ↑). Filled nodes denote black nodes. From [14].

These constraints enforce a critical property of red-black trees: the longest possible path from the root to a leaf is no more than twice as long as the shortest possible path. A red-black tree with n internal nodes has height at most 2 lg(n + 1). The immediate result of an insertion or removal may violate the properties of a red- black tree. Restoring the red-black properties requires a small number (O(log n) or amortized O(1)) of color changes and no more than three tree rotations (maximum two for an insertion).

4.1.1 Insertions In order to insert a new key we first locate its position among the leaves and replace the leaf by an internal red node v with two black leaves. If the parent of v is red, we must restore the balance condition and call the rebalancing procedure up-in for v, cf. Figure 4.1.

4.1.2 Deletions In order to delete a key we first locate its position among the leaves and then remove the leaf together with its parent. If the parent is red, then remove the leaf together with its parent. If the removed parent was black the balance condition is violated. If the sibling is red, we just change it to black. Otherwise, the removal leads to a call of the rebalancig procedure up-out for the remaining leaf, cf. Figure 4.2. 4.2. Relaxed balanced red-black trees 29

Figure 4.2: Deletion of an item (denoted by x) and call of the rebalancing procedure up-out (denoted by ↓). From [14].

Figure 4.3: Call of the rebalancing procedure up-out (denoted by ↓). Half filled nodes denote nodes that are either black or red. From [14].

The task of the procedure up-out attached to some node v is to increase the black height of the subtree rooted at v by one. It either performs a structural change and settles the request or it moves up in the tree, cf. Figure 4.3.

4.2 Relaxed balanced red-black trees

In order to uncouple the rebalancing tasks from an update, we only deposit an up-in or an up-out request instead of calling these procedures immediately after an update. The relaxed balance conditions require that [14]

1. on each path from the root to the leaf, the sum of the number of black nodes plus the number of up-out requests is the same

2. each red node has either a black parent or an up-in request

3. all leaves are black 30 Chapter 4. Relaxed Balancing

Figure 4.4: Deletion of an item (denoted by x).

The rebalancing requests can be carried out concurrently as long as they do not interfere at the same nodes. This can be achieved by settling the rebalancing requests in a top-down manner [6]. In order to facilitate the locating of rebalancing requests in the tree, we utilize a problem queue as proposed in [50]. For each type of request, we maintain a seperate queue. If all problem queues are empty, the tree is (strictly) balanced. Each node maintains an additional bit for each problem queue, which is set as soon as a link to that node is inserted in the queue. When the request in the queue is deleted, the rebalancing process resets this bit to zero. To avoid side effects, every node has only one request of the same type. Since the same rebalancing operations as for the standard version of the tree are used, the number of rebalancing operations from the strict balancing scheme carry over to relaxed balancing. If a deletion falls into a leaf which has a red parent with an up-in request the leaf is immediately deleted and the up-in request abandoned, cf. Figure 4.4. Thus, a number of insertions and subsequent deletions of the same nodes will not cause any rebalancing requests and hence no rebalancing operations are required. Otherwise, we just deposit a removal request at the appropriate leaf. The deallo- cation, cf. Figure 4.2, is thus part of a rebalancing process.

4.2.1 Interleaving updates

If an insertion falls into a leaf which has a removal request the removal request is abandoned and the key reinserted at that leaf. If the leaf has an up-out request the up-out request is removed and the leaf is replaced by an internal black node with two leaves. If a deletion falls into a leaf v with a red parent, and the leaf’s sibling has an up- out, up-in- or removal request, it does not interfere with the deletion and remains attached, and, once the removal request attached at v is being handled, the leaf together with its parent is removed, as in Figure 4.2 (a). If the removal request falls into a leaf with an up-out request or whose parent has an up-out request, these requests have to be settled first. 4.2. Relaxed balanced red-black trees 31

Figure 4.5: Concurrent handling of the procedure up-out (denoted by ↓). From [14].

4.2.2 Concurrent handling of rebalancing transformations Two up-out requests that occur at sibling nodes are in conflict. However, this can be solved by applying the transformations in Figure 4.5 (a) or (b). If two up-out requests occur in the same area as in Figure 4.5 (c), this conflict can be settled by one rotation and recoloring.

In the following chapter we will present a strategy to update min-augmented range trees in such a way that the rebalancing tasks can be left for separate processes that perform several local modifications in the tree.

Chapter 5

Relaxed Min-Augmented Range Trees

We will now examine how the min-augmented range tree and the relaxed bal- anced red-black tree can be combined into the relaxed min-augmented range tree RMART. The relaxed balance conditions stay unmodified:

1. on each path from the root to the leaf, the sum of the number of black nodes plus the number of up-out requests is the same

2. each red node has either a black parent or an up-in request

3. all leaves are black

All features of the min-augmented range tree remain valid, except for the x-value property. After a deletion, every node may have an x-value that is larger or equal to the highest x-value of its left subtree and smaller than the smallest x-value in the right subtree. Hence, we do not have to update the x-values after a deletion. The insertion process modifies the min fields top-down. In order to update the min fields after a deletion we introduce the update request up-y. Hence, in addition to up-in, up-out and removal requests, red and black nodes can also have up- y requests. Furthermore, we need one additional problem queue for the up-y requests. If the deletion falls into a leaf whose red parent has an up-in request, the leaf together with its parent are removed, cmp. Figure 4.4. Additionally, we may have to attach an up-y request to the leaf’s sibling, cf. Figure 5.1. Otherwise, as in the case of relaxed balanced red-black trees, we only deposit a removal request. When we settle a removal request which resides in a leaf with a red parent, the leaf together with its parent are removed, cf. Figure 4.2 (a). Additionally, an up-y request is attached to the leaf’s sibling. When we settle a removal request which 34 Chapter 5. Relaxed Min-Augmented Range Trees

Figure 5.1: Deletion of an item and attachment of the rebalancing procedure up-y (denoted by 4).

Figure 5.2: Handling of an up-y request (denoted by 4). The values stored in the nodes denote the min fields, router values are omitted.

resides in a leaf with a black parent and red sibling, we remove the leaf together with its parent, and color the sibling black (as in Figure 4.2 (b)). Additionally, we attach an up-y request to the leaf’s sibling in order to restore the min fields. When we settle a removal request which resides in a leaf with a black parent and sibling, we remove the leaf together with its parent, and attach an up-out request to the leaf’s sibling (Figure 4.2 (c)). Additionally, we attach an up-y request to the leaf’s sibling. If a node v has an up-y request we must update the predecessors’ min fields if necessary. To settle an up-y request we compare the min field values of v and its sibling. Let m be the minimum of these values. If m equals the min field value of parent(v), we delete the up-y request at v. Otherwise, we delete the up- y request at v, update the min field of parent(v) to the new value m, and shift the up-y request to parent(p), if it doesn’t have an up-y request yet, see Figure 5.2.

5.1 Longest matching prefix

If the search for the longest matching prefix ends at a leaf with a higher y-value than the threshold or a removal request, the search has to track back. If the y-fields along the path have not yet been (completely) updated, the search might again end at a leaf with an invalid y-value or a removal request. Hence, the search has to be modified in such a way that all leaves with a valid x-value are visited one after the other (in ascending x-values) until a leaf with a valid y-value has been found. In the following we will show that only one node on average (averaged over the number of y-min paths, see below) has to be updated in order to settle an up-y request. Clearly, this benefits IP lookup since numerous backtracking is avoided. 5.2. Interleaving updates 35

Definition 1. Let v be a leaf with min field y. The y-min path of v contains v and all ancestor nodes of v with min field y.

A node k belongs to the y-min path of a leaf v, if v stores the minimum y-value of all y-values of leaves stored in the subtrees of k. Consider for example the MART in Figure 3.1. The y-min path of the leaf storing the point (55, 40) contains the leaf itself, its parent and grandparent. The y-min path of the leaf storing the point (40, 61) only contains the leaf itself. The length of a y-min path is the number of internal nodes it contains. The maximum length of a y-min path equals the height of the tree.

Theorem 1. The average y-min path length is N/(N+1), i.e., bounded by 1.

Proof. Let N be the number of internal nodes. The sum of the y-min path lengths is N, as each internal node belongs to exactly one y-min path. There are N + 1 leaves and hence N + 1 y-min paths.

5.2 Interleaving updates

If an insertion falls into a leaf with an up-y request, the procedure stays the same as for the relaxed balanced red-black tree and the up-y request is shifted to the new internal node. If the leaf has a removal and an up-y request, then we insert the new point and detach the removal request. If an up-out request and an up-y request meet at a node v, then the up-out trans- formations that don’t involve rotations do not interfere with the up-y requests, cf. Figure 4.3 (b) and (e). If rotations are involved, the up-y request at v or at nodes that are involved in the rotation are shifted out of the sphere of influence of the rotation prior to the up-out transformations. As in the case of rotations in standard min-augmented range trees, the min fields of the involved nodes are maintained during a rotation.

Chapter 6

Concurrency Control

Concurrent computing is related to parallel computing, but focuses more on the interactions between processes. In order to cooperate, concurrently executing pro- cesses must communicate and synchronize. Interprocess communication is based on the use of shared variables (variables that can be referenced by more than one process) or on message passing [51]. If several processes operate concurrently on shared data then unintended results might occur. This happens when two opera- tions, running in different threads, but acting on the same data, interleave. This means that the two operations consist of multiple steps, and the sequences of steps overlap [52]. A sequence of statements that must appear to be executed as an in- divisible operation is called a critical section. The term “mutual exclusion” refers to mutually exclusive execution of critical sections [51].

A common communication strategy in tree structures is that processes use various kinds of locks while traversing the tree. For example, Ellis proposed a locking protocol for strictly balanced search trees [53]. Nurmi and Soisalon-Soininen pre- sented a modification of Ellis’ scheme for relaxed balanced search trees [54]. Both protocols use r,- w- and x-locks. Search operations place a shared lock (or r-lock) on nodes, update operations place write locks (or w-locks) and exclusive locks (or x-locks) on nodes. Several processes can hold an r-lock at a node at the same time. Yet, only one process can hold a w- or x-lock at a node. Furthermore, a node can be both w-locked by one process and r-locked by several other processes. An x-locked node cannot be r- w- or x-locked simultaneously by another process. An update process uses w-locks, if it wants to exclude other update processes, but does not want to exclude search processes. If an update process changes the search structure, then it uses x-locks. A w-lock can be changed into an x-lock if the node is not r-locked by another process. An x-lock can always be changed into a w-lock.

In both the strictly as well as relaxed balanced MART, a node maintains the infor- mation if it is w- or x-locked by a process in the boolean variables nodeWriteLock 38 Chapter 6. Concurrency Control

and nodeXLock. It further maintains the amount of processes that hold an r-lock in the integer variable nodeReadLock. A boolean variable is not sufficient since the node can be r-locked by several processes. The procedures to set the various locks are visualized in Algorithm 1, 2 and 3. As can be seen, an x-lock can only be obtained via a w-lock, i.e., if a thread wants to x-lock a node, it first must obtain a w-lock. Algorithm 4 shows how an x-lock is changed into a w-lock. All these locking procedures must be synchronized in order to prevent that several processes call the methods simultaneously.

A monitor consists of a collection of permanent variables used to store the re- sources’s state, and some procedures, which implement operations on the resource. The permanent variables may be accessed only from within the monitor. Execution of the procedures in a given monitor is guaranteed to be mutually exclusive. This ensures that the permanent variables are never accessed concurrently [51]. The concept of a monitor was introduced by Brinch-Hansen [55]. A monitor is formed by encapsulating both a resource definition and operations that manipulate it [51]. This textual grouping of critical sections together with the data which are ma- nipulated is superior to critical sections scattered throughout the user program as described by Dijkstra and Hoare in the early days of concurrent programming [56].

The MART as well as the RMART are implemented in the Java programming language. Java provides a basic synchronization idiom: the synchronized keyword. Making methods synchronized has the effect that it is not possible for several invocations of synchronized methods on the same object to interleave. When one thread is executing a synchronized method for an object, all other threads that invoke synchronized methods for the same object block (suspend execution) until the first thread is done with the object. All methods to set, release or change a lock are synchronized.

Algorithm 1 Set r- Lock 1: procedure setNodeReadLock 2: if ! nodeXLock then 3: nodeReadLock ← nodeReadLock + 1 4: return true 5: else 6: return false 7: end if 8: end procedure 39

Algorithm 2 Set w- Lock 1: procedure setNodeWriteLock 2: if ! nodeW riteLock && !nodeXLock then 3: nodeW riteLock ← true 4: return true 5: else 6: return false 7: end if 8: end procedure

Algorithm 3 Set x- Lock 1: procedure setNodeXLock 2: if nodeReadLock == 0 && nodeW riteLock then 3: nodeXLock ← true 4: nodeW riteLock ← false 5: return true 6: else 7: return false 8: end if 9: end procedure

Algorithm 4 Change x-lock to w-lock 1: procedure changeNodeXLockToNodeWriteLock 2: if nodeXLock then 3: nodeW riteLock ← true 4: nodeXLock ← false 5: return true 6: else 7: return false 8: end if 9: end procedure 40 Chapter 6. Concurrency Control

Figure 6.1: Potential for deadlock.

6.1 The deadlock problem

Requests by separate tasks for “resources” may possibly be granted in such a sequence that a group of two or more tasks is unable to proceed–each task monop- olizing resources and waiting for the release of resources currently held by others in that group [57]. For example, consider two tasks P and Q, each requiring the exclusive use of two different resources A and B. Clearly, if P obtains A at the same time Q obtains B, a deadlock occurs since neither of them can proceed to obtain the other resource it needs, see Figure 6.1. There are four conditions required for deadlock [57]:

• Mutual exclusion. Only one process may use the shared resource at a time.

• “Wait for”. Processes may hold allocated resources while awaiting assign- ment of others.

• “No preemption”. Once a resource is held by a process, it cannot be forcibly removed from the process.

• “Circular wait”. There exists a circular chain of processes, such that each process holds one or more resources that are being requested by the next task in the chain.

Deadlocks can be expressed more precisely in terms of graphs, for details see [57] [58].

In the following two sections we present the locking protocol by Ellis [53] for strictly balanced trees and the protocol by Nurmi and Soisalon-Soininen [54] for relaxed balanced trees and show how these can be adapted to the (R)MART. 6.2. Strictly balanced trees 41

Figure 6.2: Insertion of a key.

6.2 Strictly balanced trees

In a conventional bottom-up rebalancing scheme rebalancing transformations are carried out when an updater returns from the inserted or deleted node to the root. If a bottom-up method is used in a concurrent environment, the path from the root to a leaf needs to be locked for the time a writer operates; otherwise the process can lose the path to the root. During the time the root is locked by an updater, no other update process can access the tree. Thus, at most one updater can be active at a time.

Search processes only traverse the tree top-bottom and use r-lock coupling, i.e., a search process r-locks the node to be visited next before it releases the lock on the currently visited node.

An insertion process w-locks the entire path from the root to the leaf, where the key is inserted, in order to be able to rebalance immediately after the insertion. If the key is already in the tree, all locks along the path are released and the insertion process terminates. Otherwise, the leaf’s w-lock is changed into an x-lock. The leaf is changed into an inner node with two leaves as children, cf. Figure 6.2. The key stored in it is copied into one new leaf; the key to be inserted is stored in the other new leaf. Then, if necessary, the x-value in the new internal node is adjusted. By using this technique, the parent of the leaf, where the search terminated, does not have to be x-locked.

Afterwards, the tree is rebalanced. To check which rebalancing transformation applies, all relevant nodes are w-locked top-down. If only the node colors change, w-locks are sufficient. If rotations are involved, then the w-locks are changed into x-locks in top-down direction. X-locks are necessary because the tree structure changes and a search is not excluded by a w-lock. Hence x-locks guarantee that searches are not mislead. After the rotaion, the locks of the nodes that are lo- cally in balance are released top-down. When the tree is completely in balance, all remaining w-locks are released bottom-up and the insertion process terminates. 42 Chapter 6. Concurrency Control

(a) Backtracking search (b) Backtrack further (c) Use r-lock coupling to descend the tree

Figure 6.3: Locking protocol for backtracking searches.

The locking strategy of a deletion process is similar. The difference is that not only the leaf, but also the grandparent, the parent and the sibling have to be x- locked. This is done in top-down direction after they have been w-locked. The leaf together with its parent are deleted and the grandparent now points to the sibling. Afterwards, all remaining locks are released. The locking strategy for rebalancing the tree remains the same as in the insertion process. Clearly, this locking scheme does not evoke deadlock situations among update pro- cesses. The root remains w-locked during the entire update and hence no other update process is allowed to enter the tree. An update process x-locks consecutive nodes only in top-down direction. Hence, this strategy does not evoke deadlock situations among an update process and search processes. Hence, this scheme is deadlock-free.

In the following we adapt this locking scheme to the MART.

6.2.1 Concurrent MART The protocol for a search operation uses r-lock coupling. In case the search has to track back, it has to determine the first subtree (from below) that has a min field value below the threshold. Therefore, while backtracking, not only the par- ent, but also the sibling are r-locked, cf. Figure 6.3(a). If the search has to track back further, the lowermost r-locks are released, and the grandparent and uncle are r-locked, cf. Figure 6.3(b). As soon as the correct subtree has been found, the leftmost r-locks are released and r-lock coupling is used to descend the tree, cf. Figure 6.3(c). When a leaf is reached, the key is returned.

An insertion process uses x-locks in addition to w-locks because it updates the 6.2. Strictly balanced trees 43 min fields top-down when it searches for the insertion position. After a w-lock has been acquired, it is changed into an x-lock to update the min field. Then it is retransformed to a w-lock and the appropriate child is w-locked. Algorithm 5 illustrates the search phase of the insertion process. The acquisition of a w- and x-lock has to be carried out in a loop. When the process is not granted the lock, it calls yield(). This causes the currently executing thread to temporarily pause and allow other threads to execute. Then, it tries again to acquire the lock. Algorithm 5 returns with an x-lock on the located leaf. Deadlocks cannot occur in this phase. After the correct insertion position has been located, the new key is inserted and afterwards, the x-lock is transformed into a w-lock. In the second phase, the insertion process calls the rebalancing procedure for the new internal node v. If only node colors change, the uncle is w-locked (nodes that are not on the insertion path are not locked yet) and the colors are adjusted. Then, the uncle, the parent and v are unlocked. Finally, all remaining w-locks are released bottom-up. If rotations are involved, the locking strategy is as follows: All nodes beneath the node v for which the rotation is called (the node which is rotated) are unlocked; all nodes above v and including v are w-locked (they obtained their lock during the search phase of the insertion process). Then, to execute the rotation, all relevant nodes are w-locked and then x-locked top-down. Then, the pointers are adjusted and the min fields are restored. After the rotation, all nodes beneath the node v that is returned by the rotation are unlocked top-down. The nodes above and including v are w-locked. Finally, after the tree has been rebalanced, all remaining w-locks are released bottom-up. This strategy does create deadlock situations with search processes that track back and enter an area that contains the nodes that are to be rotated. When a rotation is carried out, the insertion process x-locks the nodes top-down. Search processes use r-lock coupling also when they track back. If a search process wants to r-lock its parent which is already x-locked, it fails. And the insertion process fails to x-lock the child due to the r-lock. Since a rebalancing operation is carried out subsequent to the insertion, it insists on carrying out the rotation. If a search process wants to r-lock its parent which is already x-locked, it does not necessarily have to be a deadlock: the insertion process temporarily x-locks a node in order to update the min field while searching for the insertion position. Hence, the search process temporarily pauses and then tries again to r-lock the parent. If it is still not granted the lock, it releases its current lock and restarts the search.

The deletion process first locates the appropriate leaf and thereby w-locks the en- tire path from the root to the leaf. In this phase, no deadlock situations arise since w- and r-locks don’t exclude one another. After it has located the leaf, it w-locks the sibling and then x-locks the appropriate nodes top-down: the grandparent, the 44 Chapter 6. Concurrency Control

Algorithm 5 Find Leaf Insert 1: procedure findLeafInsert(searchkey, y value) 2: if MART Node.rootNode is null then return null 3: end if 4: while (! MART Node.rootNode.setNodeW riteLock()) do 5: yield() . Temporarily pause 6: end while 7: while (! MART Node.rootNode.setNodeXLock()) do 8: yield() 9: end while 10: MART Node l ← MART Node.rootNode 11: while (! l.isLeaf()) do 12: if l.getY ().compareT o(y value) > 0 then 13: l.setY (y value) . Update y value 14: end if 15: if l.getX().compareT o(searchkey) ≥ 0 then . branch left 16: l.changeNodeXLockT oNodeW riteLock() 17: while (! l.getLeft()).setNodeW riteLock()) do 18: yield() 19: end while 20: l ← l.getLeft() 21: while (! l.setNodeXLock()) do 22: yield() 23: end while 24: else . branch right 25: l.changeNodeXLockT oNodeW riteLock() 26: while (! l.getRight()).setNodeW riteLock()) do 27: yield() 28: end while 29: l ← l.getRight() 30: while (! l.setNodeXLock()) do 31: yield() 32: end while 33: end if 34: end while return l 35: end procedure 6.2. Strictly balanced trees 45

Figure 6.4: Deadlock between deletion process and backtracking search process.

Figure 6.5: Deletion of a key. Capital X denotes removed nodes.

parent and finally the sibling and the leaf itself. A deadlock situation can arise if the deletion process encounters a backtracking search process while x-locking the appropriate nodes. Suppose the grandparent has been successfully x-locked and the deletion process now intends to x-lock the leaf’s parent. This fails since the parent is r-locked. And the search fails to backtrack due to the x-lock, cf. Figure 6.4. If a deadlock situation arises, the search process temporarily pauses and then tries again to r-lock the parent. If it is still not granted the lock, it releases its current lock and restarts the search. After the appropriate nodes have been successfully x-locked, the leaf and its parent are deleted and the sibling’s x-lock is retransformed into a w-lock, cf. Figure 6.5. Now, the deletion process tracks back, as long as necessary, in order to update the predecessors’ min fields. To update a min field the deletion process uses x-locks. After the min field has been updated, the deletion process retransforms the x-lock into a w-lock and then x-locks the parent. After the min fields have been updated, all nodes from the root to the sibling are w-locked, cf. Figure 6.5. The strategy to adjust the min fields does not cause deadlock situations with search processes. Then, after the item is located, deleted and the min fields have been updated, it rebalances the tree. The locking strategy for rebalancing the tree remains the same as in the insertion process. If, during the rebalancing, a deadlock situation 46 Chapter 6. Concurrency Control

arises with a backtracking search process, the search process temporarily pauses and then tries again to r-lock the parent. If it is still not granted the lock, it releases its current lock and restarts the search.

In summary, deadlock situations cannot arise between update processes. Further, when x-locking consecutive nodes, x-locks are taken top-down. Hence, deadlock situations can only occur with backtracking searches. In case a search process is not granted to lock the parent, after a second try, it releases its current lock and restarts the search. Hence, the locking protocol is deadlock-free.

6.3 Relaxed balanced trees

In relaxed balanced data structures, the update processes perform no rebalancing but leave certain information for separate rebalancing processes, which will later restore the balance. A rebalancing process can be activated when there are only few other active processes. Several rebalancers can work concurrently. Since the update processes do no rebalancing and the separate rebalance operation is di- vided into several small steps the nodes can be unlocked rapidly. Yet, the tree may temporarily be out of balance, i.e., its height is not necessarily bounded by O(log n).

Search processes use r-lock coupling, i.e., a search process r-locks the node to be visited next before it releases the lock on the currently visited node.

In relaxed balanced structures, it is sufficient to use w-lock coupling during the search phase of an update operation since rebalancing operations are seperated from the update operations. Once arrived at a leaf, the w-lock is changed into an x-lock. In case of an insertion, the leaf is changed into an internal node with two leaves.

Generally, the deletion process just attaches a removal request and the actual dele- tion is part of the rebalancing process. Analog to the insertion process, the leaf is x-locked and the removal request attached before the x-lock is released. If the deletion process deletes the leaf, cf. Figure 4.2 (a) and (b), the leaf together with its parent are removed. Additionally, an up-y request is attached to the leaf’s sibling. To achieve this, a process that will perform a delete operation uses w-lock coupling during the search phase. When the leaf to be deleted has been found, its parent and grandparent are still kept w-locked, and the process w-locks the sibling of the leaf. Then it x-locks the grandparent and parent of the leaf, the leaf itself, and its sibling. Now the leaf and its parent are deleted, the grandparent points to the sibling and an up-y request is attached to the sibling. Then, the remaining locks are released, cf. Figure 6.6. 6.3. Relaxed balanced trees 47

Figure 6.6: Deletion of a key. Capital X denotes the nodes that are deleted.

Figure 6.7: Deletion of a key by exchanging contents of nodes. Capital X denotes the leaf that is to be deleted.

If structural changes are implemented by exchanging the contents of nodes, the process keeps the parent P of the leaf that is to be deleted w-locked, and then w-locks the sibling of the leaf as well as its nephews, if any. Then it x-locks the nodes top-down. If the sibling node is a leaf, the parent must be made to a leaf. Then, the content of the sibling is copied into the parent node and the parent pointers of the nephews, if any, are switched to point to P . Furthermore, an up-y request may have to be attached to P , cf. Figure 6.7. Finally, the leaf and its sibling are deleted and the remaining locks are released.

A rebalancing process w-locks the nodes while checking whether a transformation applies. If a transformation applies, it changes all w-locks to x-locks. Nurmi and Soisalon-Soininen suggest to traverse the tree nondeterministically in order to lo- cate the rebalancing requests. This implies that the top-most node can be locked first among all nodes that have to be considered. Hence all locks are taken top- down and the locking scheme is deadlock free. If a problem queue is used to locate the requests, the situation is different. In order to apply a transformation, the parent or grandparent of a node with a rebalancing request have to be considered. These are w-locked bottom-up. Hence, the scheme 48 Chapter 6. Concurrency Control

must be extended in an appropriate way, cf. Hanke [6]: when a rebalancing process tries to w-lock its parent that is w- or x-locked by another process, the rebalancer immediately releases all locks it holds. The rebalancing processes are the only processes that also w-lock bottom-up. Since they immediately release all locks if a requested lock cannot be granted, deadlock situations are eliminated. If the topmost node, that is relevant for the transformation is successfully locked, then all other relevant nodes are w-locked top-down. Along the way, we check if a transformation can be applied. If no further node can be locked and no transforma- tion can be applied, then all locks are released. If a transformation can be applied, then the w-locks are changed to x-locks top-down. Then, the transformation is applied, and if necessary, a newly generated rebalancing request is appended to the respective queue, and all locks are released.

In the following we extend the locking scheme by Hanke such that it can be applied to the RMART.

6.3.1 Concurrent RMART Search processes use r-lock coupling, also when they track back. The protocol is the same as in the MART. The difference is that the MART only has to track back once, whereas the RMART may have to track back several times until a leaf with a valid y-coordinate and without removal request has been found.

Since the insertion process modifies the y-fields top-down, it uses x-locks in ad- dition to w-lock coupling. It tries to get the lock until it receives it. It first w-locks and then x-locks a node v. After the y-field has been updated, the x-lock is changed into a w-lock. Then, it tries to w-lock the appropriate son. If this is successful, it releases v’s lock and x-locks the son. Once arrived at a leaf, it inserts the new item and then releases the x-lock. Deadlocks among search and insert processes cannot occur.

The protocol for delete operations is similar as described for relaxed balanced trees. The difference is that if the deletion process fails to x-lock the appropriate nodes top-down, i.e., in case it does not just attach a removal request but deletes the leaf, then it transforms all x-locks into w-locks, temporarily pauses and then tries again to x-lock the nodes. This avoids deadlock situations with backtracking search processes.

The protocol for the rebalancing processes has to be extended such that it releases all locks if it fails to x-lock top-down. Again, this avoids deadlock situations with backtracking search processes and hence the entire locking scheme is deadlock-free. 6.3. Relaxed balanced trees 49

In order to carry out a plausibility consideration, which corroborates the correct- ness of the proposed locking schemes, we will present an interactive animation of the relaxed balanced min-augmented range tree. Further, the animation will foster a deeper understanding of the various locking strategies.

Chapter 7

Interactive Visualization of the RMART

According to Price, Small and Baecker “program visualization is the use of graph- ics to enhance the understanding of a program” [59]. Stasko and Domingue define the term program visualization as “the visualization of actual program code or data structures in either static or dynamic form” [60]. An early example of a static visual representation of a program are flowcharts, introduced by Goldstein and von Neumann [61]. The first dynamic representation was probably Knowl- ton’s animation of dynamically changing data structures in Bell Lab’s low-level list processing language [62] [63]. In this chapter we will present an interactive animation of the RMART. We vi- sualize how the RMART evolves over time as well as the acquisition and release of locks while the search, insert, delete and rebalancing processes operate on the tree. The animation shows how the RMART is traversed concurrently by the var- ious processes and the types of locks they use. In order to better understand the interactions of the processes, the user can interactively, i.e., at runtime of the ani- mation, control the number and type of processes currently operating on the tree. Since the animation focuses on the presentation of the main concepts, it does not show in detail how the red-black properties are restored. When a rotation occurs, the visualization shows the resulting data structure after the rotation. Red-black tree animations that visualize the operations step-by-step can be found in [64] [65].

Execution in concurrent programs may be non-deterministic. That is, multiple executions of the same program may result in varying program behaviors. This makes it difficult to evaluate the correctness of concurrent programs, as deadlock situations may occur only once in hundreds or thousands of executions. The vi- sualization can be seen as a further indication that the proposed locking scheme is deadlock-free. Of course, it is not a proof that the locking scheme actually is deadlock-free. 52 Chapter 7. Interactive Visualization of the RMART

This animation was developed in collaboration with Waldemar Wittmann in line with his bachelor thesis [20]. In the following we will describe the application framework, the underlying architecture as well as the graphical user interface.

7.1 Application framework

The visualization has been developed utilizing Qt. Qt is a cross-platform applica- tion framework which includes an intuitive class library, integrated development tools, support for C++ and Java development (Qt Jambi) as well as desktop and embedded development support. As a prominent example, the K Desktop En- vironment (KDE), a contemporary desktop environment for UNIX systems, has been developed with Qt. In the following we will describe the architecture of the visualization.

7.2 Architecture

The visualization uses a model-view-controller (MVC) architecture. The MVC paradigm is an architectural pattern that is often used when building user inter- faces. The model represents the data and functionality of the application. The view attaches to the model and renders its contents. In addition, when the model changes, the view redraws the affected part to reflect those changes. The model-view separation makes it possible to display the same data in several differ- ent views, and to implement new types of views, without changing the underlying data structures. The controller processes and responds to events, typically user actions such as keystrokes, and may invoke changes on the model. In Design Patterns, Gamma et al. [66] write:

MVC consists of three kinds of objects. The Model is the application object, the View is its screen presentation, and the Controller defines the way the user interface reacts to user input. Before MVC, user in- terface designs tended to lump these objects together. MVC decouples them to increase flexibility and reuse.

This clearly expresses the advantage of the separation of the three components.

Qt Jambi provides the abstract class QTreeModel to create custom models that represent tree structures. It further provides a ready-to-use implementation of a tree view: QTreeView. QTreeModel defines an interface that is used by QTreeView to access data. The model and the view communicate via the powerful “signals and slots” mecha- nism: a signal is emitted when a particular event occurs. A slot is a function that 7.3. The graphical user interface 53 is called in reponse to a particular signal. When the data changes, e.g., when a lock is acquired, when a node is inserted or deleted, when the color changes or when a rotation occurs, the model emits a signal to inform the view about the change. In response, the view redraws the affected part. Further, the controller processes signals emitted from the user interface, e.g., when the user increases the number of insertion processes. A grand advantage of making use of the MVC framework to implement a visual- ization of the RMART is that only few modifications to the existing source code had to be made. This in turn minimized potential errors in the implementation of the visualization. The visualization itself in turn proved to be a useful graphical debugging tool. In the following we will describe the graphical user interface.

7.3 The graphical user interface

Figure 7.1 shows a screenshot of the visualization. The main window displays the RMART data structure in an explorer-like style. The top-level item (numeral string) represents the root. Nodes with the same indentation and connected by a vertical line are siblings: the top node represents the right child of the parent, the bottom node represents the left child. When traversing the tree, each process changes the locks at each visited node. A numeral string represents the locks currently set at a node: (r w x). For example, (2 0 0) at a node denotes that the node is r-locked by two processes, and is neither w- nor x-locked. A node can be r-locked by several processes, but can be w- and x-locked by only one process at a time. A node can be x-locked only if the node is not r-locked by another process. An x-lock can only be obtained via a w-lock, i.e., before a process can x-lock a node, it must have obtained a w-lock first. After the x-lock was granted, the process can release the w-lock. In order to better perceive the nodes that are being locked, the view highlights the corresponding lock. Each lock has a different color. When a lock is set, the appropriate color flashes up. When the lock is released, the highlight extinguishes. This makes it easy to track the various processes while they traverse the tree. For example, the root in Figure 7.1 is r-locked by two processes, highlighted in yellow. One of these processes already r-locked the left son. (Search processes use r-lock coupling. The lock of the current node is released only when the lock of the child has been aquired.) Further, the root and its left son are w-locked by an insert process, highlighted in cyan. The delete process performs an actual deletion of a leaf indicated by three x-locked nodes, highlighted in magenta: the leaf that is to be deleted, its sibling and its parent. Initially, the RMART is empty, i.e., consists of one leaf. In the Settings window, the user can interactively vary the number of search, update and rebalancing processes that operate on the tree. When the user decreases a number, then the process(es) cannot be killed straight away, but first have to finish their task. Hence, the number 54 Chapter 7. Interactive Visualization of the RMART

Figure 7.1: RMART visualization. Two search, one insert and one delete process.

next to the interactive field displays the actual number of active processes. As soon as the appropriate number of processes have been stopped, the number is identical to the one in the interactive field. 7.3. The graphical user interface 55

Using the slider in Zoom, the user can zoom in and zoom out of the tree. By zooming out, it is possible to get a better overview of the activities in the tree when the tree is quite large. Via the Insert button, the user can insert a specified number of nodes on the spot, i.e., without animation. For example, this feature can be used to generate an initial tree of a specific user-defined size. The preset value is 100. The Queue Size window displays the current size of the various queues, e.g., the number of remaining queries in the search queue, or the number of up-in requests. Checking the Show more node contents box, the view renders additional node information, namely the types of attached rebalancing requests, if the node has any. The up-in request is highlighted in red, up-out in green, removal in black, and up-y in blue. The view further displays the color of each node as textual string. Figure 7.2 shows the tree after a number of insertions, and the attached up-in requests. After all up-in requests have been settled, the red-black tree constraints

Figure 7.2: RMART visualization. A number of up-in requests due to insertions of nodes. are satisfied, i.e., if a node is red, then both its children are black and every path 56 Chapter 7. Interactive Visualization of the RMART

Figure 7.3: All up-in requests shown in Figure 7.2 are settled.

from the root to a leaf contains the same number of black nodes, cf. Figure 7.3. Note that this tree is more balanced. When additionally delete operations are performed, then removal, up-out and up-y requests may also be generated. Figure 7.4 captures the handling of a removal as well as an up-out request. All relevant nodes are x-locked (highlighted in magenta). There are three remaining removal as well as one up-out request in the queues. The figure shows one more of each type attached to nodes. This is due to the fact that of each type, one request has already been taken out of the corresponding queue and is currently being handled. Figure 7.5 shows one removal and one up-out request rebalancer operating on the tree. The leaf that has an up-out request has only three black nodes (the leaf included) on the path to the root, all other leaves have four black nodes. The removal rebalancing process attaches an up-out request to the parent of the node that is to be deleted. This is because the parent and the sibling are also black. Further, an up-y request is attached and the leaf and its sibling are deleted, cf. Figure 7.6. Here, it can be seen that the new leaf indeed needs an up-out request: 7.3. The graphical user interface 57

Figure 7.4: One search process as well as one removal and one up-out rebalancing processes operate on the tree.

there are three black nodes on the path, instead of four. Furthermore, the up-out rebalancer settled the up-out request highlighted in Figure 7.5. To this end, one double rotation (right-left) and one recoloring were performed, cf. Figure 4.3(d). Further, the up-y requests that were attached at those nodes which were involved in the double rotation have been pulled to the root of the subtree, cf. Figure 7.6.

Using the slider in Speed, the user can control the speed and also freeze the ex- ecution using the pause button. When the pause button is pressed again, the execution continues.

This interactive animation was presented to illustrate the functioning of the search, update and rebalancing processes and to corroborate the correctness of the pro- 58 Chapter 7. Interactive Visualization of the RMART

Figure 7.5: One removal and one up-out request rebalancer operate on the RMART. The removal rebalancer attached an up-out request to the parent of the node that is to be deleted. The up-out rebalancer x-locks the appropriate nodes top down.

posed locking schemes. Clearly, it is not a proof that the locking strategies are deadlock-free, yet it is an indication. A further indication for correctness was fur- nished by the Java PathFinder (JPF), an explicit state software model checker [67].

Both the strictly and relaxed balanced min-augmented range tree can be queried and modified concurrently by several processes. The difference is that updates must be performed serially in the strictly balanced MART, since the rebalancing is performed immediately after an update. Yet, an update can be performed concurrent with search operations. In the relaxed balanced MART, updates can be performed concurrently. Further, updates can be performed concurrently with search as well as with restructuring operations. Hence, the vantage of the relaxed 7.3. The graphical user interface 59

Figure 7.6: The removal and up-out requests in Figure 7.5 are settled. Further, the removal and up-out rebalancers retrieved the next request from the appropriate queues.

balanced min-augmented range tree over the strictly balanced version is expected to become noticeable in the presence of update bursts. Although route updates typically occur on the order of tens or hundreds of times per second, transient bursts may occur at rates that are orders of magnitude higher [68] [69]. The update rate is critical because a router that cannot keep up with all of the updates may trigger a condition known as route flap, in which a router processing a backlog of route updates is incorrectly marked as being unreachable by other routers. This state change creates a domino effect of route updates that can cripple a network [68]. Such worse cases have already been observed in practice [70]. Further, the relaxed balancing scheme is expected to reduce lookup latency if lookups and updates “meet in the same area”. 60 Chapter 7. Interactive Visualization of the RMART

The above deliberations suggest that a relaxed scheme is better suited, yet it is not clear if these expected privileges compensate the higher coordinational effort which is due to the uncoupling of updates and rebalancing tasks. In the following we investigate by experiment if the relaxed balanced min-augmented range tree has an advantage over the standard min-augmented range tree in a dynamic concurrent environment. Chapter 8

Experimental Results

Routers in different ASs use the Border Gateway Protocol (BGP) to exchange network reachability information. Each BGP speaker maintains a set of tables (Routing Information Bases, or RIBs) - one for each BGP neighbour (Adjacency- RIB-INs) and one for its own internal use for forwarding. It selects the “best” of these routes to use for its local forwarding decisions (Local-RIB), and sends a copy of this best route to all its peers (Adjacency-RIB-OUTs) [71]. Each incoming update message causes a change in the corresponding Adjacency- RIB-IN. If the information is a prefix withdrawal, then a comparison needs to be made with the Local-RIB. If there is a match, then all other Adjacency-RIB-Ins need to be scanned and a new best route installed into the Local-RIB, as well as loading new announcement messages in the Adjacency-RIB-OUTs to reflect this local change of best path. If there are no other candidate routes in the other RIB- INs then the route is withdrawn from the Local-RIB and a withdrawal message is passed to the BGP speaker’s peers [71]. If the incoming update message is an announcement, then the BGP engine has to update the Adjacency-RIB-IN and then compare this route to the current best path in the Local-RIB. If this new route represents a better path, then the Local- RIB is updated and announcement messages are queued in all the Adjacency-RIB- OUTs [71].

To conduct our experiments, we use real and up-to-date routing data that is sup- plied by the Advanced Network Technology Center at the University of Oregon within the framework of the Route Views Project [72]. The Route Views routers archive their BGP routing table snapshots and the BGP updates received from their peers. RIB table dumps are collected every two hours. 62 Chapter 8. Experimental Results

8.1 The MRT format

Researchers and engineers often wish to analyze network behavior by studying routing protocol transactions and routing information base snapshots. To this end, the MRT format was developed to encapsulate, export, and archive this in- formation in a standardized data representation [73]. The format was developed in concert with the Multi-threaded Routing Toolkit (MRT) at Merit in the mid-90’s. It is employed by RIPE RIS [74] and Routeviews BGP routing data collectors. In our benchmark, we use routing data collected from the “route-views2.oregon- ix.net router”.1 The following example MRT record is taken from the April 13, 2007 routing information base snapshot.

TIME : 04/13/07 08 : 40 : 29 TYPE : T ABLE DUMP/INET VIEW : 0 SEQUENCE : 2 PREFIX : 3.0.0.0/8 FROM : 134.55.200.31 AS293 ORIGINAT ED : 04/12/07 03 : 29 : 26 ORIGIN : IGP ASP AT H : 293 701 703 80 NEXT HOP : 134.55.200.31 COMMUNITY : 293 : 14293 : 46 ST AT US : 0x1

The PREFIX entry contains the IP address of a particular routing table dump entry. ORIGINATED contains the time at which this prefix was heard. The NEXT HOP entry determines the next hop address. For a full description of the MRT format see [73]. Multiple networks may have the same path and path attributes. In that case, specifying multiple network prefixes in the same update message is more efficient than generating a new one for each network. Hence, an update message can adver- tise, at most, one set of path attributes, but multiple destinations, provided that the destinations share these attributes. Furthermore, an update message can list multiple routes that are to be withdrawn from service.

In the following message, the fields are the attributes from a single BGP update message which announces a destination.

TIME : 04/13/07 08 : 42 : 20

1The peers of the route-views2.oregon-ix.net router can be found at http://www.routeviews.org/peers/route-views2.oregon-ix.net.txt 8.2. Flow characteristics in internetwork traffic 63

TYPE : BGP 4MP/MESSAGE/Update FROM : 66.185.128.1 AS1668 TO : 128.223.51.102 AS6447 ORIGIN : IGP ASP AT H : 1668 1299 9121 NEXT HOP : 66.185.128.1 MULT I EXIT DISC : 1026 ANNOUNCE 81.213.47.0/24

The following message lists a multiple of routes that are to be withdrawn from service.

TIME : 04/13/07 08 : 42 : 21 TYPE : BGP 4MP/MESSAGE/Update FROM : 206.24.210.99 AS3561 TO : 128.223.51.102 AS6447 W IT HDRAW 202.136.176.0/24 58.65.1.0/24 202.136.182.0/24

Update files contain the update messages and are rotated every 15 minutes, i.e., each update file contains prefix withdrawals and announcements that occur within a 15 minutes time interval.

We expect that the relaxed version shows better behavior in the presence of update bursts. In our experiments, various scenarios with varying update frequencies are simulated. A high update frequency corresponds to a short interarrival time between routing updates. The update file of April 13, 2007 at 10:27 contains a 9 seconds interval with (nearly) peak update rates each second (measured over one month). The interval starts at 10:36:24 and ends at 10:36:32 with 85961 prefix withdrawals and announcements all in all, cmp. Figure 8.1.

8.2 Flow characteristics in internetwork traffic

Flows are considered to be sequences of packets with an n-tuple of common values such as source and destination address prefixes, protocol and port numbers, ending after a fixed timeout interval. 64 Chapter 8. Experimental Results

16000

14000

12000

10000

8000

6000

Number Number of Updates 4000

2000

0 10:36 10:36 10:36 10:36 10:36 10:36 10:36 10:36 10:36 :24 :25 :26 :27 :28 :29 :30 :31 :32 Number of Updates 9096 12325 15792 11190 7409 5879 7959 7343 8968 Time

Figure 8.1: Excerpt from the update file of April 13, 2007 at 10:27.

8.2.1 Locality in internetwork traffic Several studies have identified the presence of locality in internetwork traffic [75] [76] [77] [78]. Temporal locality in network address traces refers to the phenomenon that if an address is referenced, it is likely to be referenced again in the near future. The reason for the existence of this temporal locality lies in the fact that packets with the same destination tend to be transmitted closely in time, usually as the result of transmission of data that are segmented into a sequence of packets [75]. A trace of references to IP addresses has high temporal locality if a large portion of the repeated references have a short interarrival time, i.e., there is a large prob- ability of rereferencing the same IP address in a short period of time. MacGregor and Chvets [77] analyzed traces that were captured at the University of Auckland, the University of Alberta and at the San Diego Supercomputer Cen- ter commodity connection. The analysis showed that approximately 80% of all references have interarrival times less than 0.2 seconds.

8.2.2 Statistical properties of flows Chabchoub et al. [79] study the size of flows in a limited time window of duration ∆. The reason for considering short time windows is that in short time intervals, volumes of flows exhibit only one major statistical mode. 8.3. Generation of sequences of operations 65

The key observation when characterizing a traffic trace is the fact that if the duration ∆ of the successive time intervals used for computing traffic parameters is appropriately chosen, then the distribution of the size of the main contributing flows in the time interval can be represented by a Pareto distribution and therefore exhibits a unimodal behavior. More precisely, there exist ∆, Bmin, Bmax and a > 0 such that if S is the number of packets transmitted by a flow during ∆, then a P (S ≥ x|S ≥ Bmin) = (Bmin/x) , for Bmin ≤ x ≤ Bmax. The parameter Bmin is usually referred to as the location parameter and a as the shape parameter. In other words, if the time interval is sufficiently small then the distribution of the number of packets transmitted by a long flow has one dominant Pareto mode and therefore can be characterized in a robust way. The quantity Bmin defines the elephants of the corresponding trace. An elephant is a flow such that its number of packets during a time interval of length ∆ is greater than or equal to Bmin. By definition of Bmax, flows whose size is greater than Bmax represent a small fraction of the elephants. A mouse is a flow with a number of packets less than Bmin. It should be noted that the parameters computed in a time window of length ∆ do not give a complete description of the distribution of the total number of packets of a flow, since statistics are done over a limited horizon. To obtain information on the total number of packets, it is necessary to glue the statistics from successive time windows of length ∆. Chabchoub et al. [79] leave this as an open problem.

8.3 Generation of sequences of operations

Random samples from a Pareto distribution can be generated using inverse trans- form sampling. Given a random variate U drawn from the uniform distribution Bmin on the unit interval (0; 1), the variate T = U 1/a is Pareto-distributed [80]. It turns out that for commercial traffic, the value of Bmin is close to 20, Bmax close to 94, and a close to 1.85 [79].2 Due to unavailability of public IP traces associated with their corresponding rout- ing tables, we synthetically generate a trace file using the 10:36:23 routing informa- tion base snapshot and the parameters above. The latest RIB snapshot provided prior to 10:36:24 is the 08:42 snapshot. We generate the 10:36:23 snapshot start- ing with the available 08:40 RIB and perform the updates that took place between 08:42 and 10:36:23. (BGP routers typically receive multiple paths to the same destination. We arbitrarily select one to be installed as best route.) The resulting RIB contains 235620 entries. To generate the traces, we randomly select a point within the range of each route table entry, this ensures that a query is covered by at least one filter in the set,

2Parameters for traces from the France Telecom (FT) network 66 Chapter 8. Experimental Results

Figure 8.2: Excerpt from the generated trace file.

hence default filter matches are avoided. Each query is multiplied T times. Yet, packets of a same flow are not back-to-back but mixed with packets of other flows. In order to interleave the various flows, we randomize the entries within a fixed sized window which moves over the trace file. We choose the window length such that packet interarrival times tend to be small. Figure 8.2 shows an excerpt of the generated trace file with the temporal locality characteristics. Let the arrival time of the packets be consecutive numbers. Then, the interarrival time of two consecutive identic references is one time unit. There are 14 unique IP addresses, of which three addresses, e.g., 5218394330100989951, have interarrival time of one.

In router tables, IP lookup is intermixed with updates. To generate sequences consisting of searches, insertions and deletions, the searches were randomly dis- persed in the sequence of updates (from the update file of April 13, 2007 starting at 10:36:24) such that the original update sequence and search sequence were maintained. 8.4. Test setup 67

8.4 Test setup

The experiments are performed on a Sun Fire T2000 Server 1.0 GHz UltraSPARC T1 Processor with six cores. It contains 8 GB DDR2 main memory and 3MB Level-2 cache. It has a Unified Memory Architecture (UMA), i.e., memory is shared among all six cores. It contains multiple physical instruction execution pipelines (one for each core), with several active thread contexts per pipeline or core. Each core is designed to switch between up to four threads on each clock cycle. Threads that are stalled, e.g., those threads waiting for a memory access, are skipped. As a result, the processors execution pipeline remains active doing real useful work, even as memory operations for stalled threads continue in paral- lel [81]. We concurrently perform a sequence of dictionary operations on a balanced stan- dard red-black tree which is built by inserting the 10:36:23 RIB snapshot into an empty tree. This snapshot represents the Local-RIB. The test environment consists of three classes:

• The main programm which starts the test thread.

• The test thread which builds the RIB snapshot and launches a variable num- ber of tree threads.

• A tree thread performs the operations insert, delete and search on the snap- shot.3

In the RMART as well as in the MART, the test thread remains active during the execution of the test (the main program terminates after it has started the test thread). In the case of strict balancing, 2, 3, 4, 5 or 6 tree threads are concurrently active. The tree threads which perform the dictionary operations, plus the test thread. In the case of relaxed balancing, the four rebalancing processes, which get their work from the appropriate problem queues, are additionally started. The generated sequence consisting of searches and updates is stored in a linked queue which manages concurrent access. Each tree thread takes an entry from the sequence, performs the operation and then retrieves the next entry from the queue. The queue guarantees that each element is taken out exactly once. To benchmark the MART and RMART, we measure the total time needed to per- form the sequence of search and update operations as well as the average time of an insert, delete and search operation. The time is measured with System.nanoTime which is provided in the java.lang package. To measure the total time of execution we take the minimum start time of all tree processes, the maximum end time and compute the difference, cf. Figure 8.3 for an example.

3We solely perform the operations on the Local-RIB. 68 Chapter 8. Experimental Results

Figure 8.3: The total execution time to perform a sequence of operations. In this example, the operations are performed by four threads.

The benchmark considers the average time of the total time as well as the average time per insert, delete and search operation over three diverse sequences. In Java, it is not possible to bind a thread to a processor. We assumed that the Java Virtual Maschine spreads the threads equally to the six available processors. A router shall perform the rebalancing tasks in less busy times. In order to simulate this, the rebalancing processes sleep a random amount of time between 0 and 5999 milliseconds. Before we benchmark the RMART and the MART in a concurrent environment, we examine how the total time behaves when the sequence of operations is performed sequentially. Therefore, we perform a sequence of one million dictionary opera- tions on the MART. Since the underlying architecture switches between threads, and the next operation may only be performed when the current operation is com- plete, this time elapses without continuing with actual work, and the total time ascends with the number of parallel threads, cf. Figure 8.4.

8.5 Comparison of the RMART and the MART

We simulate various scenarios with varying update frequencies based on the 10:36:23 snapshot. The snapshot is represented by a balanced standard red-black tree which is built by inserting the 10:36:23 snapshot into an empty tree. We repeat each test three times and compute the average total time as well as the average time per operation respectively. Both trees yield correct answers to lookup queries in a given sequence of oper- ations, since in both trees the appropriate nodes are locked when performing a structural change. The focus of this benchmark lies on measuring the advantage of relaxed balancing over instantaneous rebalancing. We do this by comparing the respective perfor- mance results of both trees among the same number of running tree processes. The totalT imeRMART performance gain constituts 100 − (100 × totalT imeMART )%. For example, suppose the total time to perform a sequence of operations is 9 seconds in the RMART 8.5. Comparison of the RMART and the MART 69

MART

45 40 35 30 25 20

Total Total time [s] 15 10 5 0 2 3 4 number of processes

Figure 8.4: Total execution time to perform a sequence of 1000k insert, delete and search operations sequentially.

and 10 seconds in the MART in the case of four processes. Then the performance gain is 10%.

Both trees are expected to be on par in the absence of routing updates. The following simulation will confirm this hypothesis.

8.5.1 Solely lookups To conduct this test, we use the generated trace file and execute one million lookups on the 10:36:23 snapshot. In this setting, both trees have an equal amount of ac- tive processes since the rebalancing processes are not started. The average time per search operation as well as the total time are on par in both trees. The total time to perform the lookups decreases in both trees, see Figure 8.5(a). This is because nodes can be r-locked by an arbitrary number of processes. In both trees, there is a 44% advantage in the case of five processes compared to the case of two executing processes. The average time to perform a search operation increases in both trees, see Fig- ure 8.5(b). The higher the number of parallel processes, the higher the chance that processes must wait to be able to enter the synchronized method in order to r-lock the same node. This phenomenon is also known under the phrase “lock contention”. Hence, the average time for a single search operation increases. The results show that the total time decreases until eight threads (Figure 8.5(a) 70 Chapter 8. Experimental Results

RMART MART RMART MART 13 35 12 11 30 10 9 8 25 Total time [s] time Total

7 Averagetime[µs] 6 20 2 3 4 5 2 3 4 5 Number of processes Number of processes

(a) Total time to perform 1000k search opera- (b) Average time per search operation. tions.

Figure 8.5: Execution time. Solely search operations.

only shows up to five threads), even though 24 active execution threads are sup- ported on this machine (six cores and each core is designed to switch between up to four threads on each clock cycle). When performing the same scenario without using r-locking (in this scenario it is not necessary that processes use r-locks since no updates are performed), the total time descends until 17 processes. These re- sults suggest that lock contention poses a significant scalability impediment. Applications in Java are compiled to target the Java Virtual Machine. However, such applications compiled to this virtual machine’s instruction set (called Java byte codes) usually run on a processor either through an interpreter or through just-in-time (JIT) compilation [82]. One problem with conventional techniques for executing synchronized Java methods is that several time-consuming operations have to be performed in order to execute the synchronized “statement”. More- over, these operations are performed once when the monitor is acquired and then have to be repeated again in order to release the monitor [83]. Hence, thread synchronization significantly adds to the execution time of many programs [84]. Performance can be enhanced by tailoring a microprocessor to the Java comput- ing environment, e.g., by providing hardware support for garbage collection and thread synchronization [82]. But still, synchronization depends on the support of the operating system [82]. Synchronization overhead can be further reduced when the Java byte codes are synthesized directly to hardware. Section 8.7 outlines a technique how byte codes can be converted to a hardware description language.

In the following scenario, we solely perform update operations. 8.5. Comparison of the RMART and the MART 71

RMART MART 250

5.E+09 200 4.E+09 150 3.E+09 Average

2.E+09 Insert time [µs] 100

Totaltime[ns] 1.E+09 50 0.E+00 2 3 4 2 3 4 RMART 51.668 65.54 91.4 MART 69.482 145.077 233.98 Number of processes Number of processes

(a) Total time to perform 85961 update oper- (b) Average time per insert operation. ations.

400

350

300

250

200

150 Average Delete time [µs]

100 2 3 4 RMART 118.106 114.175 150.083 MART 198.904 313.17 363.719 Number of processes

(c) Average time per delete operation.

Figure 8.6: Time to perform 85961 update operations.

8.5.2 Solely updates

The updates are taken from the update file of April 13, 2007 at 10:27 and start at 10:36:24 and end at 10:36:32 with 85961 updates all in all. In the RMART, the total time to perform the sequence of update operations de- creases until three parallel tree threads, see Figure 8.6(a). In the MART, the total time ascends. The performance gain is 23% in the case of two, 54% in the case of three processes and 59% in the case of four processes. The average time of an insert and delete operation ascends with the number of parallel threads in both trees, cf. Figure 8.6(b) and 8.6(c). In the MART, only one process may update the tree at a certain time, all other processes cannot enter the tree (the root is kept w-locked) and must wait until the update operation is complete. Hence, the average time raises with the number of parallel threads. In the RMART, many processes may update the tree concurrently at different locations in the tree. The more the number of parallel threads, the higher the chance that update processes meet at the same location in the tree. Hence, the 72 Chapter 8. Experimental Results

average time raises with the number of parallel threads. The performance gain when performing an update operation in the RMART compared to the MART is considerable and increases with the number of parallel processes. In case of inser- tions, it varies from 26% in the case of two processes to 61% in the case of four processes. In case of deletions, it varies from 41% in the case of two processes to 59% in the case of four processes. The faster a table update is being settled, the faster a correct view of the current network topology is being established.

In the following simulations we perform IP lookup and table update operations concurrently and examine the impact of varying update frequencies on the perfor- mance.

8.5.3 Various update frequencies

We simulate the update frequencies via the amount of interspersed lookups in the original update sequence (the original orders of updates and searches are main- tained). The first scenario performs 100k, the second 500k and the last scenario 1000k operations all in all.

First scenario

All in all, 100k operations are performed and hence 14039 lookups are interspersed. In this scenario, the 85961 updates constitute 85.9%. Only in the case of one process does the MART terminate earlier than the RMART. In the MART, the lookups are performed concurrently, but due to the high update rate, the total time ascends, cf. Figure 8.7(a). In the case of two parallel processes the RMART needs 35 % less time than the MART. This increases to 63 % in the case of four processes.

The performance gain when performing a search operation in the RMART com- pared to the MART increases with the number of processes. It varies from 22% in the case of two processes to 27% in the case of four processes, cf. Figure 8.7(b). This supports the hypothesis that in the MART, a search operation is delayed due to instantaneous rebalancing. In relaxed balancing, the rebalancing operations are postponed to less busy times and hence searches are not as much delayed. The performance gain when performing an insert operation varies from 38% in the case of two processes to 66% in the case of four processes, cf. Figure 8.7(c). When performing a delete operation it constitutes 67% in the case of four processes, cf. Figure 8.7(d). 8.5. Comparison of the RMART and the MART 73

7 55

6 50

5 45

4 40

3 35

Total time [s] 2 30

1 [µs] time Search Average 25

0 20 2 3 4 2 3 4 RMART 2.60586 2.31174 2.30124 RMART 25.286 31.361 39.325 MART 4.03869 5.70736 6.22572 MART 32.267 42.828 53.677 Number of processes Number of processes

(a) Total time. (b) Average time per search operation.

300 500

250 450 400 200 350

150 300

250 100 200 Average Insert time [µs] time Insert Average 50 [µs] time Delete Average 150

0 100 2 3 4 2 3 4 RMART 51.501 68.308 93.707 RMART 103.131 127.735 155.968 MART 83.683 184.64 273.164 MART 197.478 318.782 468.29 Number of processes Number of processes

(c) Average time per insert operation. (d) Average time per delete operation.

Figure 8.7: Execution time to perform a sequence of 100k insert, delete and search operations. The update rate is 85.9%.

Second scenario

In this scenario, the updates constitute 17%. All in all, 500k operations are per- formed and hence 414039 lookups were interspersed. In the MART, the lookups are performed concurrently. Due to the moderate up- date rate, the total time levels off, cf. Figure 8.8(a). The performance gain adds up to 31 % in the case of four processes. The average time per search operation is visualized in Figure 8.8(b). The per- formance gain adds up to 14 % in the case of four processes. When performing an insert operation it adds up to 51% in the case of four pro- cesses, cf. Figure 8.8(c), and up to 59% in the case of a delete operation, cf. Figure 8.8(d).

Third scenario

In the last scenario, 1000k operations are performed and hence 914039 lookups were interspersed. The updates constitute 8.6%. 74 Chapter 8. Experimental Results

10 38 36 8 34 32 6 30 28 4

Total time [s] 26

2 24 Average Search time [µs] time Search Average 22 0 20 2 3 4 2 3 4 RMART 7.81801 6.05761 5.55424 RMART 23.931 27.264 32.236 MART 8.1968 7.4579 8.07405 MART 24.136 29.154 37.543 Number of processes Number of processes

(a) Total time. (b) Average time per search operation.

170 300

150 250 130

110 200

90 150 Average Insert time [µs] time Insert Average 70 [µs] time Delete Average

50 100 2 3 4 2 3 4 RMART 54.15 64.855 87.555 RMART 106.881 121.662 130.528 MART 61.085 105.119 177.382 MART 114.067 211.601 314.951 Number of processes Number of processes

(c) Average time per insert operation. (d) Average time per delete operation.

Figure 8.8: Execution time to perform a sequence of 500k insert, delete and search operations. The update rate is 17%.

Due to the relative low update rate, the total time also decreases in the MART, cf. Figure 8.9(a). The performance gain only adds up to 11% in the case of four processes. The average time per search operation is on par in both trees, cf. Figure 8.9(b). The performance gain only adds up to 5% in the case of four processes. When performing an insert operation it constituts 31% in the case of four processes, cf. Figure 8.9(c). The performance gain when performing a delete operation adds up to 36% in the case of three, and 41% in the case of four processes, cf. Figure 8.9(d).

8.5.4 R´esum´eof experimental results In the experiments, scenarios with varying update frequencies were simulated. If solely search operations are performed, the total time of the MART is on a par with the total time of the RMART. If also insert and delete operations are 8.5. Comparison of the RMART and the MART 75

15 35

14

13 30 12

11

Total time [s] 25 10

9 [µs] time Search Average

8 20 2 3 4 2 3 4 RMART 14.73932 11.0807266 9.4920708 RMART 25.013 27.841 31.492 MART 14.5188 11.4548 10.658857 MART 24.46 27.988 33.026 Number of processes Number of processes

(a) Total time. (b) Average time per search operation.

250

110

200 90

150 70 Average Insert time [µs] time Insert Average Average Delete time [µs] time Delete Average

50 100 2 3 4 2 3 4 RMART 55.897 66.047 81.672 RMART 113.656 115.838 140.117 MART 57.727 79.737 118.746 MART 114.487 181.395 235.643 Number of processes Number of processes

(c) Average time per insert operation. (d) Average time per delete operation.

Figure 8.9: Execution time to perform a sequence of 1000k insert, delete and search operations. The update rate is 8.6%. performed, then the higher the update/lookup ratio, the clearer does the relaxed MART outperform the standard MART. Table 8.1 summarizes the results for the various tested scenarios when performing a sequence of search, insert and delete operations. Further, the results have shown that the performance gain per search operation grows with the update/lookup ratio. If only search operations are performed, the average time per search operation is on a par in both trees. The higher the ratio, and the higher the number of processes, the clearer the difference in the average performance per search operation. This confirms the hypothesis that in the relaxed balanced min-augmented range tree, lookup queries are not as much delayed as in the standard version, since the rebalancing operations are postponed. The smaller the periods of lookup latency, the smaller the chance that packets are being dropped due to buffer overload. Thus another advantage is that the rate of packet loss might be reduced.

The simulations were performed on a six core unified memory architecture. Here, the total time of the RMART descended until a maximum of four threads. In the 76 Chapter 8. Experimental Results

% updates number of processes 2 3 4 8.6 - 3 11 17 5 19 31 86 35 59 63 100 23 54 59

Table 8.1: Performance gain in terms of total execution time when performing a sequence of search, insert and delete operations.

following, we evaluate RMART’s performance when executed on a Non-Uniform Memory Architecture (NUMA). NUMA is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. The next section summarizes our benchmark results per- formed on a Sun Fire X4600.

8.6 Benchmark on Sun Fire X4600

The Sun Fire X4600 M2 server is a NUMA system that supports up to eight inter- nal CPU/memory modules. Each module holds a single dual-core AMD Opteron processor, where each core has a dedicated 1 MB Level-2 cache. Further, each module supports up to 8 DDR2 memory DIMM slots (1, 2, or 4 GB DIMMs). Processors are directly connected to memory, I/O, and each other via Hypertrans- port links. Hypertransport technology is a high-speed, low-latency, point-to-point link designed to increase the communication speed between integrated circuits in computers, servers, embedded systems, and networking and telecommunications equipment [85]. Figure 8.10 illustrates the Hypertransport topology for an eight- processor configuration. With NUMA, maintaining cache coherence across shared memory has a signifi- cant overhead. By using inter-processor communication between cache controllers the memory image is kept consistent when more than one cache stores the same memory location. For this reason, cache-coherent NUMA performs poorly when multiple processors attempt to access the same memory area in rapid succession.

In the following we will summarize our benchmark results performed on a Sun Fire X4600 M2 server with an eight-processor configuration and compare it with our results from the Sun Fire T2000 Server. When solely lookups (1Mio) are per- formed, there is a 48% advantage in the case of eight parallel processes compared to the case of two processes in terms of total time on the T2000. When maintaing a queue containing one million lookups on the X4600, both the RMART and the MART do not scale with the number of processes, i.e., there is almost no advantage 8.6. Benchmark on Sun Fire X4600 77

Figure 8.10: Hypertransport topology in Sun Fire X4600 M2 servers for an eight- processor configuration. From [85].

when four processes perform the sequence of lookups compared to when only two processes perform these operations. When the queue size is reduced and only 100k lookups are performed, the total time of both trees scales well with the number of processes. The total time of both trees descends until six processes with an advantage of 39%. When solely updates are performed on T2000, the total time descends until three threads in the RMART. There is a 17% advantage in the case of three processes compared to the case of two executing RMART processes. When performed on the X4600 server the total time descends until five concurrent processes with an advantage of 33%. When updates and lookups are interleaved (all in all 100k operations), the total time descends barely until four processes with an advantage of 12% when per- formed on T2000. When performed on the X4600, the total time descends until five threads with an advantage of 32%. Here, the performance gain when perform- ing a search operation in the RMART compared to the MART constitutes 27% in the case of four (five) processes when performed on the T2000 (X4600). Hence, also on this architecture lookups are less delayed in the RMART.

Even though the (R)MART is NUMA unfriendly (the processes work on common data), the RMART has shown to scale quite well up to a certain number of parallel processes on the X4600 server (provided that the sequence of operations, on which all processes work concurrently, was not too large). Yet, not only the underlying architecture plays a role in the scalability, but also the implementation of the Java 78 Chapter 8. Experimental Results

Virtual Machine, particularly in our case the implementation of thread synchro- nization, targeted to the processor and operating system combination.

Performed on an uniform or non-uniform memory architecture, the performance of software running on a microprocessor is unfavorably affected by the instruction cycle. There are basically four stages of an instruction cycle that a microprocessor carries out: 1. Fetch the next instruction into the Current Instruction Register (CIR) 2. Decode the instruction 3. Execute the instruction 4. Store results back The term “instruction cycle” refers to both the series of four steps and also the amount of time that it takes to carry out the four steps. Most microprocessors are typically divided into two main components: a datapath and a control unit [86]. The first two steps of the instruction cycle are performed by the control unit. The datapath mainly consists of an arithmetic logic unit (ALU) which is a digital circuit that performs arithmetic and logical operations. A digital circuit is often constructed from small electronic circuits called logic gates. Each logic gate represents a function of boolean logic, e.g., AND and OR [87]. The inputs to the ALU are the data to be operated on and a code from the control unit indicating which operation to perform. Implementing the software algorithm directly in hardware, i.e., a dedicated data- path and control unit that executes only a particular algorithm, can alleviate the performance penalty of the fetch-decode steps. A Field-Programmable Gate Array (FPGA) is an integrated circuit that contains programmable logic components called “logic blocks”, and programmable inter- connects. Logic blocks can be programmed to perform the function of basic logic gates such as AND and NOT [88] [89]. Complex designs are created by combining these basic blocks to create the desired circuit. To configure an FPGA the user specifies the FPGA’s function with a logic circuit diagram or a hardware descrip- tion language (HDL). This specification is fed to a software suite from the FPGA vendor that produces a file which is then transferred to the FPGA [88]. Packet forwarding in high-speed IP routers must be done directly in hardware. In the next section we will describe how the RMART could be described by a hardware description language.

8.7 Implementing the RMART in hardware

Field-Programmable Gate Arrays offer the flexibility of software executed on a microprocessor along with increased performance in terms of throughput. How- 8.7. Implementing the RMART in hardware 79 ever, this flexibility usually requires expertise in hardware design and a hardware description language such as VHDL or Verilog. The High-Performance FPGA Laboratory (HPFL) at Oakland University has developed a compiler that is able to convert programs written in Java to datapaths and corresponding control units, collectively called flowpaths [90]. Flowpaths can then be synthesized to hard- ware using a HDL. The compiler takes as input Java byte codes generated from a Sun Microsystems compliant Java compiler. Java is one of several software- programming languages that compiles to an intermediate representation (IR) that is stack-based. To execute instructions, variables are loaded onto the stack, the instruction is executed, and the answer is restored. This load-execute-store fea- ture of Java byte codes maintains the normal microprocessor paradigm. Indeed microprocessors have been developed that execute Java byte codes directly [91]. However, the performance of these microprocessors still suffers from the excessive use of local variables in the original software program [92]. Experimental results show that flowpaths can perform within a factor of two of a minimal hand-crafted direct hardware implementation and orders of mag- nitude better than compiling the program to a microprocessor [93]. Duchene and Hanna further describe a technique to extend the flowpath architecture to gen- erate flowpaths directly from java byte codes representing multithreaded java programs [94]. Java supports thread synchronization through the use of monitors. When a thread holds the monitor for some object, other threads are locked out and cannot inspect or modify the object. Java uses the synchronized keyword, see section 6, to mark sections that operate on common data. In the Java lan- guage, a unique monitor is associated with every object that has a synchronized method. The synchronized keyword is reflected by monitorenter/monitorexit in the corresponding java byte codes. Given a multithreaded algorithm, each thread is converted to a flowpath where the logic for requesting and releasing locks to shared memory is added to each flowpath. Further, each flowpath is connected to an access controller which controls access to shared memory, cf. Figure 8.11. Space for the instance variables of an object is allocated when the application instantiates an object instance from a class (with new) [92]. A traditional pro- ducer/consumer example to create flowpaths from a multithreaded Java applica- tion can be found in [94]. Performance increases occur, in general, since flowpaths created from multithreaded Java programs do not suffer from traditional processor bottlenecks such as context switching, stack manipulation and the traditional instruction cycle [92].

This extended scheme could be employed in order to implement the RMART directly in hardware. The tasks are the threads that operate on the tree and are contained in a single task frame. Further research is required to investigate how the RMART performs when implemented directly in hardware. A current Xilinx FPGA operates at over 550 MHz, contains a total of over 10 megabit of embedded 80 Chapter 8. Experimental Results

Figure 8.11: N tasks with shared data. From [94].

memory, and provides high bandwidth interfaces to several megabytes of off-chip memory. Chapter 9

Conclusions and Future Directions

In order to efficiently support update bursts and to reduce IP lookup latency, we proposed an elegant way for the representation of dynamic routing tables, namely relaxed balanced min-augmented range trees. The interactive animation of the RMART enhanced the understanding of its concepts and furnished an additional indication that the proposed locking scheme is deadlock-free. We have benchmarked the relaxed min-augmented range tree versus the strictly balanced version using real IPv4 routing data. The experimental results confirmed the hypothesis that the relaxed balanced min-augmented range tree is better suited than its strictly balanced opponent when confronted with update bursts. Of course, the simulation environment is not a practical solution for high-speed IP routers. Rather, packet forwarding must be done in hardware. It would be interesting to evaluate RMART’s performance when implemented directly in hardware. This could be achieved by utilizing the extended flowpath architecture as outlined in section 8.7. The change of paradigm in IP networks towards QoS, multimedia and real-time applications calls for fast update rates in higher dimensional classification tables. Typically, high update rates in packet classification designs have been sacrificed to achieve better search times or to reduce storage capacity requirements. By virtue of our experimental results, it would be interesting to study relaxed data structures which can be used for the representation of higher dimensional packet classifiers.

It is generally accepted that routers will take longer to forward IPv6 packets, and that the routing tables under IPv6 will get bigger [95] [96]. Further, it is reasonable to expect that, as the number of hosts connected to the Internet grows, worst case burst update rates will increase. Further research is needed to scrutinize the adequacy of the RMART under IPv6.

Part II

Packet Classification

Chapter 10

Introduction

In order to enforce network security policies, guarantee service agreements, per- form monitoring etc., network routers are required to examine multiple fields of the packet header. Geometrically speaking, classifying an arriving packet is equiv- alent to finding the highest priority rectangle among all rectangles that contain the point representing the packet. The R-tree, one of the most popular access methods for multidimensional data, was introduced by Guttman in 1984 [97]. The R-tree was proposed as an index mechanism that stores d-dimensional geometric objects and supports spatial retrievals efficiently. The challenge for R-trees is the following: dynamically maintain the structure in a way that retrieval operations are supported efficiently. Common retrieval operations are range queries, i.e., find all objects that a query region intersects, or point queries, i.e., find all objects that contain a query point. The R-tree can be easily implemented which considerably contributes to its pop- ularity. Several modifications on the original R-tree have been proposed to either improve its performance or adapt the structure to a different application domain.

10.1 Goal of this part

The R-tree and its variants, being amongst the most popular access methods for points and rectangles, have not been experimentally evaluated and benchmarked for their eligibility for the packet classification problem. In this chapter we inves- tigate how the popular R*-tree is suited for five-dimensional packet classification in a static environment. To this end we will benchmark the R*-tree with two rep- resentative classification algorithms using the ClassBench tools suite [98]. If the R*-tree shows to be suitable in a static classification scenario, then it can further be investigated in a dynamic scenario, i.e., where classification is intermixed with filter updates. Since spatial often involve massive datasets, R-trees and their variants are often disk-based implemented. Packet classification has to be done as fast as 86 Chapter 10. Introduction

possible, hence in our benchmark, we will use a main memory-based implementa- tion.

10.2 Organization of part II

The remainder of this part is organized as follows. Section 10.3 surveys packet classification techniques. Chapter 11 describes R-trees and several of its variants. After presenting the classification algorithm based on R-trees in chapter 12, we discuss our benchmark results of the R*-tree and two representative classification algorithms.

10.3 Related work

The RFC (Recursive Flow Classification) scheme is a decomposition-based algo- rithm, which provides very high classification throughput at the cost of low mem- ory efficiency [99]. RFC performs independent, parallel searches on chunks of the packet header. Thus, the parallelism offered by hardware can be leveraged. The result of each chunk lookup is an equivalence class identifier eqID, that represents the set of potentially matching filters for the packet. An example of assigning eqIDs is shown in Figure 10.1. In this example, the rectangles are defined by the filters in our running example filter set in Figure 1.3. The end points of each rectangle are projected to the axis. Any two adjacent projection points on an axis define an elementary interval which is fully covered by a set of filters. Two neighboring elementary intervals cannot represent the same filters; whereas two nonadjacent elementary intervals are possible to represent the same filters. Each elementary interval is assigned an eqID. The elementary intervals representing the same set of filters are labeled with the same eqID. Note that in our example the fields create six equivalence classes in the source address field and five equivalence classes in the destination address field. The results of the chunk searches are combined in multiple phases. RFC lookups in chunk and aggregation tables utilize indexing. The index tables used for aggre- gation require significant precomputation in order to assign the proper eqIDs for the combination of the eqIDs of the previous phases. Such extensive precomputa- tion precludes dynamic updates at high rates. Several proposed techniques employ a trie-based approach. A hierarchical trie (H- trie) [100] is a multidimensional prefix-based matching scheme, i.e., range-based fields must be transformed into prefixes first. A H-trie is recursively constructed as follows: First, a one-dimensional trie is constructed for the first dimension. Then, for each prefix pr, a (d − 1)-dimensional trie is constructed on those filters that specify pr in the first dimension. Each node in the first trie that is associated with 10.3. Related work 87

Figure 10.1: Example of Recursive Flow Classification using the filter set in Figure 1.3.

a prefix is connected to the second trie and so forth. Classification of an incoming packet starts in the top trie. At each trie node encountered, the algorithm fol- lows the “next-trie” pointer (if present) and traverses the (d − 1)-dimensional trie. Assuming that the maximum prefix length is w and the number of dimensions is d, the H-trie requires O(wd) search time and O(ndw) memory, where n is the number of filters. Incremental updates can be carried out in O(d2w) time since each component of the updated rule is stored in exactly one location at maximum depth O(dw). Set-pruning tries improve the search time to O(dw) by replicating rules to elimi- nate the need for multiple traversals in each of the tries [101] [100]. The query for an incoming packet with fields (h1, h2, . . . , hd) locates the node associated with the longest matching prefix for the first field h1, then follows the “next-trie” pointer to locate the longest matching prefix for h2 and so forth for all dimensions. The rules are replicated to ensure that every matching rule will be encountered in the path. The query time is reduced to O(dw), yet it requires O(nddw) memory size for the price of the improved search time. Update complexity is O(nd), and hence, this data structure is only suited for relatively static classifiers. The Grid-of-Tries data structure [102] was proposed for two-dimensional packet classification and eliminates filter replication by storing filters at a single node and using switch pointers to direct searches to potentially matching filters. Grid-of- Tries bounds memory usage to O(nw) while achieving a search time of O(w). The authors propose a technique using multiple instances of the Grid-of-Tries structure for packet classification on the standard 5-tuple, albeit with some loss of efficiency. 88 Chapter 10. Introduction

Baboescu, Singh, and Varghese proposed Extended Grid-of-Tries (EGT) that sup- ports multiple field searches without the need for many instances of the Grid-of- Tries structure [103]. In the worst case, EGT requires O(w2) memory accesses per classification.

In the following, we will survey several approaches that utilize the geometric view of the filter set. Gupta and McKeown introduced a seminal technique called Hierarchical Intelligent Cuttings (HiCuts) [104]. The concept of cutting comes from viewing the packet classification problem geometrically. Selecting a decision criteria is analogous to choosing a partitioning, or cutting, of the space. The decision-tree construction algorithm recursively cuts the space into smaller sub-regions, one dimension per step. The cuttings are made by axis-parallel hyperplanes. In order to keep the decisions at each node simple, each node is cut into equally sized partitions along a single dimension. The leaves contain a small number of filters bounded by a threshold. A larger threshold can help reduce the size and depth of a decision tree, but can yield a longer linear search time. A smaller threshold has the opposite effects. Packet header fields are used to traverse the decision tree until a leaf is reached. The filters stored in that leaf are then linearly searched for a match. If a packet matches multiple filters, the one with the highest priority is returned. Figure 10.2 illustrates an example of the decision-tree construction for our example filter set in Figure 1.3. First, we cut along the x-axis to generate four sub-regions. If we decide it is affordable to do a linear search on at most three filters, we can stop cutting sub-regions with three or less filters. Three of the four sub-regions contain three or less filters, hence we can stop cutting these regions further. At the following step, we choose the remaining sub-region to cut along the y-axis to generate two sub-regions. This results in a sub-region containing only two filters (f2, f7), and another sub-region containing four filters (f2, f3, f4, f7). At the last step, we cut along the y-axis to generate two sub-regions, each containing three rules respectively. Now, every sub-region contains at most three rules and the construction terminates. The resulting data structure is shown in Figure 10.3. Each tree node covers a portion of the d-dimensional space and the root node covers the entire space. In this example, we have set the thresholds such that a leaf contains at most three filters and a node may contain at most four children. It is very difficult to find the globally optimal decision tree given some constraints. So in practice the algorithm uses various heuristics to select decision criteria at each node that minimizes the depth of the tree while controlling the amount of memory used. Intuitively, the more cuts are made at each step, the fatter and lower the resulting decision tree will be. However, a large number of cuts may lead to an excessive duplication of filters. Apart from the number of cuts, the choice of 10.3. Related work 89

Figure 10.2: A partitioning created by HiCuts for the example filter set in Figure 1.3.

Figure 10.3: HiCuts data structure for the example filter set in Table 1.1. The maximum size of the set of filters at each leaf is set to three. 90 Chapter 10. Introduction

the cutting at each intermediate decision tree node is also critical for the algorithm performance. The preprocessing time is high, caused mainly by the complexity of the heuristics. Incremental update time depends on the filter to be inserted or deleted.

The HyperCuts algorithm introduced in [105] eliminates the limitation in HiCuts by using the most representative dimensions, as opposed to only a single dimen- sion, to cut the space. This simulates several cuts of HiCuts in one cut. This approach reduces the height of the decision tree. For each of the chosen dimen- sions, the number of cuts is computed based on a metric dependent on the amount of space that is available for the search structure. Another optimization is the idea of pulling filters up in the decision tree. The authors observed that a heavily wildcarded filter often ends up in many leaves, increasing storage consumption. Their approach pulls all common filters in a sub- tree to a linear list at the root of the subtree. In order to determine which of the pointers in each node to follow during a search, array indexing, which costs one memory access regardless of the number of chil- dren at a node, is used. In this indexing scheme, cut widths have to be fixed in each dimension. Qi and Li [106] propose ExCuts, an extension of HyperCuts, which mainly im- proves the memory consumption.

Another geometry-based solution for multidimensional classification referred to as G-filter was proposed by Geraci et al. [107]. The space that represents all possible values of the packets’ attributes is called the universe. The input for the algorithm that constructs the search data structure is a region r of the search space, and a list F (r) of filters potentially intersecting the region r. Initially, the algorithm starts with the entire filterset F and r equals the universe. The algorithm partitions the filters in the following sets, with each filter f belonging to exactly one set:

1. if f does not intersect r, it is discarded (a query point in region r will never match the rule);

2. otherwise, if f covers the entire region r, it becomes part of the set cover (r) of cover rules;

3. otherwise, if the projection Pj(f) of f on axis j entirely covers the projection Pj(r) of the region r on the same axis, f becomes part of the set FBj(r) of fallback rules on axis j (if f satisfies this property for more than one axis, we arbitrarily pick one);

4. otherwise, filter f becomes part of the set cross(r) of cross rules, which intersect r, but does not fall in any of the other categories. 10.3. Related work 91

Figure 10.4: For the universe u, f7 ∈ cover(u), f1, f2 ∈ FB2(u), f3, f4, f5, f6 ∈ cross(u). For the subregion y2, f5, f6 ∈ FB1(y2). For the subregion y4, f3, f4 ∈ cross(y4).

Figure 10.4 shows a two-dimensional example of the relation between rules and regions. Any packet p contained in a region r matches all rules in cover (r). The only information we need to remember from this set is the filter fh(r) with the highest priority in cover (r), as this will be a potential result for the classification. For fallback rules, we know that if p ∈ r, then the j-th coordinate of p is within the range Pj(f) of all the rules in the set FBj(r). So p will match a rule f ∈ FBj(r) if and only if its remaining (d − 1) coordinates are contained in the remaining (d − 1) ranges of the rule. So the problem reduces to a classification problem in a (d − 1)-dimensional region. Cross rules have to be partitioned further. This is done by recursively partition- ing region r into m regions y1 . . . ym of uniform size and shape and assigning the remaining cross rules to these subregions. Figure 10.5 shows the G-filter for the example of Figure 10.4 for the first two levels and m = 4. The only cover rule at root level is filter f7. Hence it is remembered as fh(r). The cross rules f3, f4, f5 and f6 are located in the subregions y2 and y4. Hence, subregions y1 and y3 point to null. Rules f5 and f6 are assigned to FB1(y2). Filters f3 and f4 are assigned to cross(y4) and hence subregion y4 is partitioned further. The classification can be performed as a recursive process on the data struc- ture. At each node (initially the root), we perform d recursive queries on the (d−1)-dimensional fallback structures, one recursive query on the region yi|p ∈ yi, and return the highest priority rule among fh(r) and the rules returned by the (d + 1) recursive queries. As an example, consider the packet p with coordinates (900, 900). The highest priority filter fh(r) is f7. The highest priority rule returned by the fallback structures is f1. The subregion y1 is not further partitioned. Hence the query algorithm yields f1 as the highest priority rule. Let F be a set of n hyperrectangles in a d-dimensional universe u and k a param- 92 Chapter 10. Introduction

Figure 10.5: The G-filter for the example of Figure 10.4 for the first two levels and m = 4.

f(d) d eter, 1 ≤ k ≤ n. The data structure uses O(nk logk |u|) space and performs d 2 packet classification in time O(logk |u|). The function f(d) grows roughly as d /2. The structure does not support incremental updates, i.e., the structure has to be reconstructed from scratch each time the classifier changes.

The Area-based (AQT) was proposed by Buddhikot et al. [108] for two- dimensional classification on the source and the destination prefix fields. The search space is recursively partitioned into four equally sized spaces. Each rect- angular search space is mapped into a node in a quadtree. In other words, the entire space is mapped into the root node of the quadtree, and four equally sized quadrants are mapped into four children of the root node, and so forth. Rules are allocated to each node as follows. A filter is said to cross a quadrant if it completely spans at least one dimension of the quadrant. The authors call the set of all filters that cross a given region r as its Crossing Filter Set (CFS). The CFS of a region r can be split into two sets CX(r) and CY (r). The former is the set of filters that cross r perpendicular to the x-axis. The set CY (r) is the set of filters that cross r perpendicular to the y-axis. In Figure 1.3, f1 and f2 belong to CX(210 × 210). For both sets, only the range specified in the other dimension has to be stored. Each filter f is stored exactly once, at the highest node for which f is a crossing filter. At each node, we search the CFS structure for the highest priority filter at that node (two one-dimensional lookups). If the filter has higher priority than the filter recorded so far, we replace that filter and continue. The AQT can take advantage of a well known technique called fractional cascading to reduce the O(h log w) worst case search complexity to O(h + log w), where the worst case height h is w (w is the maximum prefix length). The memory require- 10.3. Related work 93 ment is O(n), because each filter is stored exactly once. AQT supports incremental updates allowing the complexity to be traded off with the query time by a tunable parameter. Lim, Kang and Yim propose a priority-based quadtree (PQT) for two-dimensional packet classification [109]. By additionally utilizing the priority of rules, the num- ber of tree levels can be reduced. A survey of packet classification techniques can be found in [38] and [100].

We have seen that a number of solutions perform a linear search over a bounded subset of filters as the final step, e.g., [104], [110], [105]. If the lists are kept short, this typically results in a large savings in terms of data structure space, at only a small cost in terms of classification performance. We will propose a classification scheme based on R-trees that also follows this approach.

Currently, most vendors maintain their packet filters in Ternary Content Address- able Memories (TCAMs) [39]. For example, the Cisco Catalyst 6500 Series Switch and Cisco 7600 Router maintain QoS and security policies in TCAMs which are accessed by application-specific integrated circuits (ASICs) [111] [112] [113]. How- ever, the usefulness of TCAMs is limited by their high power consumption, ineffi- cient representation of range match fields and lack of flexibility and programma- bility. Juniper Networks uses an ASIC-driven memory approach with more con- ventional memory architectures like Static RAM (SRAM) and Reduced Latency Dynamic RAM (RLDRAM) and sophisticated data structures to packet classifi- cation [39].1

Although many algorithms and architectures have been proposed, the design of efficient packet classification systems remains a challenging problem. In the fol- lowing we will describe the R-tree and how it supports packet classification.

1According to Juniper Networks, the 50 Gbps T-series Packet Forwarding Engine (PFE) is currently the newest, most flexible and highest performing PFE on the market [39].

Chapter 11

R-trees

11.1 The original R-tree

R-trees are hierarchical data structures based on B+-trees [114] and were intro- duced as an index for multidimensional information. The R-tree abstracts an object o by using its minimum bounding d-dimensional rectangle (MBR). The leaves of the tree contain the MBRs of the objects as well as pointers to these objects. Each non-leaf node of the R-tree contains entries which store a pointer to a child node and the MBR that bounds all rectangles in that child node. An example of eight objects (o7 − o14) and a possible organization of these objects using six MBRs (R1 − R6) is shown in Figure 11.1. A corresponding R-tree is visualized in Figure 11.2. The space is split by hierarchically nested, and possibly overlapping minimum bounding rectangles. Hence, an object may be contained in several MBRs, but it is associated to only one R-tree node. For example, object o8 is contained in R3 and R4, but is only stored in the leaf pointed by R3.

Let m be the minimum and M the maximum allowed number of entries that each node can store and 2 ≤ m ≤ M/2. The R-tree of order (m, M) has the following characteristics:

• Each leaf node entry is of the form (mbr; oid), such that mbr is the MBR that spatially contains the object and oid is the object’s identifier.

• Each entry in an internal node is of the form (mbr; p), where p is a pointer to a child of the node and mbr is the MBR that spatially contains the rectangles in this child.

• The minimum allowed number of entries in the root node is two, unless it is a leaf. In this case, it may contain zero or a single entry.

• All leaves of the R-tree are at the same level. 96 Chapter 11. R-trees

Figure 11.1: An example of eight objects (o7 − o14) and a possible organization using six MBRs (R1 − R6).

Figure 11.2: A corresponding R-tree.

Let n be the number of objects. The maximum value for the height h is [115]:

hmax = dlogmne − 1.

The maximum number of nodes can be derived by summing the maximum pos- sible number of nodes per level. This number comes up when all nodes contain the minimum allowed number of entries, i.e., m. Therefore, it results that the maximum number of nodes in an R-tree is equal to:

Xh mi i=0

11.1.1 Query processing The processing of a range (point) query commences from the root node of the tree. For each entry whose MBR intersects (contains) the query region (point), the process descends to the corresponding subtree. At the leaf level, for each bounding rectangle that intersects (contains) the query region, the corresponding object is examined. The algorithm that processes point queries in an R-tree is given in Algorithm 6. For a node entry e, e.mbr denotes the corresponding MBR 11.1. The original R-tree 97 and e.p the corresponding pointer to the next level. If the node is a leaf, then e.p denotes the corresponding object identifier (oid).

Algorithm 6 Point Query 1: procedure PointQuery(TypeNode N, TypePoint Q) 2: if N is not a leaf node then 3: examine each entry e of N to find those e.mbr that contain Q 4: for each such entry e call PointQuery(e.p, Q) 5: else 6: examine all entries e and find those for which e.mbr contains Q 7: add these entries to the answer set 8: end if 9: end procedure

The MBRs may overlap each other. Thus it cannot be guaranteed that only one search path is traversed during a point query. As we will see in subsection 12.1.2, the inquiry if a MBR contains a query point can be implemented efficiently. This is of practical interest to our benchmark, since this determines the complexity of a point query and hence of packet classification.

11.1.2 Query optimization criteria

In the following some of the parameters which are essential for the retrieval per- formance are considered [116].

• The area covered by a MBR should be minimized, i.e., the area covered by the bounding rectangle but not covered by the enclosed rectangles, the dead space, should be minimized. This will improve performance since decisions which paths have to be traversed, can be taken on higher levels.

• The overlap between MBRs should be minimized. This also decreases the number of paths to be traversed.

• The perimeter of a MBR should be minimized. Assuming a fixed area, the object with the smallest margin is the square. Thus minimizing the perime- ter instead of the area, the MBRs will be shaped more quadratic. Essentially queries with large quadratic query rectangles will profit from this optimiza- tion.

• Storage utilization should be optimized. Higher storage utilization will gen- erally reduce the query cost as the height of the tree will be kept low. 98 Chapter 11. R-trees

11.1.3 Updates

The R-tree is a dynamic structure. Thus all approaches of optimizing the retrieval performance have to be applied during the insertion or deletion of a new object. The insertion algorithm calls two more algorithms in which the crucial decisions for good retrieval performance are made. The first is the algorithm ChooseSubtree. Beginning at the root, descending to a leaf, it finds on every level the most suitable subtree to accommodate the new entry. The second is the algorithm Split. It is called, if ChooseSubtree ends in a node filled with the maximum number of entries M. Split should distribute (M +1) rectangles into two nodes in a way that makes it as unlikely as possible that both new nodes will need to be examined on subsequent searches.

11.2 R-tree variants

In all R-tree variants that have appeared in the literature, tree traversals for any kind of operations are executed in exactly the same way as in the original R-tree. Basically, the variations of R-trees differ in how they choose the appropriate subtree and how they perform splits during insertion by considering different minimization criteria [115]. Insertions of new objects are directed to leaf nodes. At each level, the most suitable subtree to accommodate the new entry has to be chosen. In the original R-tree as proposed by Guttman, this is the node that needs least area enlargement to include the new object. Ties are resolved by choosing the entry with the rectangle of the smallest area. Finally the object is inserted in an existing leaf if there is adequate space, otherwise a split takes place. Since the decision whether to visit a node depends on whether its covering rectangle overlaps the search area, the total area of the two covering rectangles after a split should be minimized. Guttman discusses split-algorithms with exponential, quadratic and linear cost with respect to the number of entries in a node. All of them are designed to minimize the area covered by the two rectangles resulting from the split. Then, covering rectangles on the path from the leaf to the root need to be adjusted, and node splits propagated as necessary. If node split propagation causes the root to split, a new root is created whose children are the two resulting nodes. If a deletion of an entry causes a leaf to underflow, the leaf is eliminated and the remaining entries are reinserted. Node elimination must be propagated upwards. All covering rectangles on the path to the root need to be adjusted, making them smaller if possible. In the following we will have a look at four R-tree variants, for a comprehensive survey refer to [115] [117]. 11.2. R-tree variants 99

11.2.1 The R+-tree The R+-tree was proposed as a structure that avoids visiting multiple paths during point queries [118]. To achieve this, R+-trees do not allow overlapping of MBRs at the same tree level. Therefore, inserted objects may have to be divided in two or more MBRs, and stored in various nodes which results in an increase in space consumption. Further, when a node with (M +1) rectangles, where each rectangle encloses a smaller one, has to be split, the split procedure will fail.

11.2.2 The R*-tree The new concepts incorporated in the R*-tree [116] are based on the minimiza- tion of the overlapping between MBRs at the same level, the minimization of the perimeter of the produced MBRs, as well as the maximization of storage utiliza- tion. The R*-tree follows a sophisticated node split technique and uses the concept of forced reinsertion. Dynamic updates of the structure may have introduced MBRs which are not suit- able to guarantee a good retrieval performance in the current situation. Therefore, the R*-tree forces entries to be reinserted during the insertion routine. If a node overflows, it is not split right away. Rather, p entries are removed from the node and reinserted into the tree. Hence, each first overflow treatment on each level will be a reinsertion of p entries. If it is not the first overflow on that level, the split procedure is invoked. Experiments have shown that p = 30% yields the best performance [116]. In summary, the R*-tree differs from the R-tree mainly in the insertion algorithm; deletion and searching remain essentially unchanged.

11.2.3 Compact R-trees Huang, Lin and Lin proposed compact R-trees, a dynamic R-tree version which can achieve almost 100% storage utilization [119]. Among the (M +1) entries of an overflowing node during insertions, a set of M entries is selected to remain in this node, such that the resulting MBR is the minimum possible. Then, the remaining entry is inserted to a sibling that (i) has available space, and (ii) whose MBR is enlarged least. Thus the frequency of node splitting is reduced significantly. The range query performance is similar to that of the original R-tree.

11.2.4 cR-trees Brakatsoulas, Pfoser and Theodoridis have altered the assumption that an over- flowing node has to be split in exactly two nodes [120]. In particular, they rely on the k-means clustering algorithm and allow an overflowing node to be split in up to k nodes (k ≥ 2). Their benchmarks showed that the resulting index quality, the 100 Chapter 11. R-trees

retrieval performance and the insertion time are significantly better than those of R-trees (assuming quadratic split) and similar to those of R*-trees.

11.2.5 Static versions of R-trees There are common applications that use static data. For instance, insertions and deletions in census, cartographic and environmental databases are rare. Here, the data is known in advance, and this fact is utilized in order to build a structure that supports queries as efficient as possible. This method is well known in the literature as “packing” or “bulk loading”. The Packed R-tree [121] proposed by Roussopoulos and Leifker in 1985 was the first packing algorithm, soon after the proposal of the original R-tree. This first effort basically suggests ordering the objects according to some spatial criterion, e.g., according to ascending x-coordinates. Arge et al. [122] propose the Priority R-tree, or PR-tree, which is the first R-tree variant that always answers a window query using O((n/B)1−1/d + T/B) I/Os, where n is the number of d-dimensional (hyper-) rectangles stored in the R-tree, B is the disk block size, and T is the output size. This is provably asymptotically optimal and significantly better than other R-tree variants, where a query may visit all n/B leaves in the tree even when T = 0. Chapter 12

Packet Classification using R-trees

Classifying an arriving packet is equivalent to finding the highest priority rect- angle among all rectangles that contain the point representing the packet. All filters have a priority attached and are located in the R-tree leaf nodes. Based on the value of the packet header, the algorithm follows the appropriate pointers to locate the target MBR(s), i.e., leaf node(s) in the decision tree, as described in Algorithm 6. Note that we do not need the refinement step, where the object itself (not its MBR) is examined for containment, since in our case, the objects are rectangles and hence identical to their MBRs. During the traversal we keep track of the highest priority filter so far. After all potentially matching filters have been visited, the highest priority filter is reported. This search scheme has some parallels with the classification scheme of [104] [105] [41], cf. section 10.3. In these methods, a classification is performed by traversing a sophisticated data structure which yields not just one matching filter, but a short list over which a linear search is performed. A requirement of recent network security applications, e.g., network intrusion de- tection systems, transparent monitoring and usage-based accounting, is that all matching filters are reported, not just the highest priority filter. R-trees support this at no additional cost.

12.1 Performance evaluation

We measure the performance of a classification operation by the number of bytes inspected, see below for more details. Even though only a single path from the root to the leaf is traversed during a search, the problems inherent in R+-trees, cf. subsection 11.2.1, disqualify this structure for further investigation. The R*- tree is widely accepted in the literature as a high-performance structure among R-trees and its variants and is often used for performance comparisons [115]. In 102 Chapter 12. Packet Classification using R-trees

our classification performance evaluation we will use the R*-tree and benchmark it with representative packet classification algorithms. The benchmark is performed on a Pentium 4 dual core 2.8 GHz machine. In our simulations we will use a main memory-based implementation of R*-trees.

12.1.1 Filter sets Different filter sets with different structures and sizes tend to give very different results. The performance of the algorithm on “real” filter sets is the decisive fac- tor in any realistic evaluation. Due to security and confidentiality reasons, real filter sets are hardly available. In response to this problem, Taylor and Turner developed ClassBench, a suite of tools for benchmarking packet classification al- gorithms [123] [98]. The ClassBench Tool includes a Filter Set Generator which produces synthetic filter sets that accurately model the characteristics of real filter sets. The tools suite also includes a Trace Generator that produces a sequence of packet headers to exercise the synthetic filter set. To implement these tools, Tay- lor and Turner analysed 12 real filter sets provided by Internet Service Providers (ISPs), a network equipment vendor, and other researchers working in the field. The filter sets range in size from 68 to 4557 entries and utilize one of the following formats [124]:

• Access Control List (ACL) - standard format for security, VPN, and NAT filters for firewalls and routers (enterprise, edge, and backbone)

• IP Chain (IPC) - decision tree format for security, VPN, and NAT filters for software-based systems

• Firewall (FW) - proprietary format for specifying security filters for firewalls

Their analysis provides invaluable insight into the structure of real filter sets. A repository [125] has been established to provide synthetic filter sets and trace files generated with ClassBench as well as source codes of representative classification algorithms. These synthetic sets are generated with the ClassBench tools suite using seed filter sets that are extracted from three real filter sets, utilizing the above mentioned three different formats. These real filter sets have the following characteristics [124]:

• acl1 : In this filter set, fully specified source and destination addresses dom- inate the distribution. The destination port specification can be either an exact value (in most cases), a wildcard or an arbitrary range. All source ports are specified by a wildcard.

• fw1 : The most common prefix pair is a fully specified destination address and a wildcard for the source address. The ports are specified either by a wildcard, an exact value, an arbitrary range or a HI range ([1023 : 65535]). 12.1. Performance evaluation 103

• ipc1 : Fully specified source and destination addresses dominate the distri- bution, yet not as much as in acl1. Port specifications are as in the fw1 set, yet with different distributions.

The protocol is specified by a unique value or the wildcard. See [124] for detailed characteristics. The repository provides each type of synthetic filter set in the size of 100, 1K, 5K and 10K filters and a corresponding trace file for each of the filter sets. The size of a trace is about ten times that of the corresponding filter set.

12.1.2 Simulation results of R*-tree In this simulation we use the filter and trace files provided by [125]. All filters are five-dimensional with 32-bit source and destination IP addresses, 16-bit source and destination port numbers and an eight-bit protocol. For each filter set and cor- responding trace file, we evaluate the R*-tree’s performance. In each simulation, we iteratively insert the respective filters into an initially empty R*-tree. After all filters have been inserted, we use this tree for packet classification. Therefore, once the R*-tree has been built, we use it in a static fashion. If the R*-tree shows to be suitable in a static classification scenario, then it can further be investigated in a dynamic scenario, i.e., where classification is intermixed with filter updates. Hence, the following benchmark can be considered as a stepping stone for bench- marking R*-trees in a dynamic classification environment. The algorithms that it will be benchmarked with are both optimized for static scenarios. In our simulations we measure the total memory requirement, the worst case as well as the average number of bytes inspected per classification. To measure the total memory requirement, we summate the memory consumption of the contents of all nodes. A node mainly maintains its level information, its capacity M, the number of its children, its MBR and identifier, as well as its children MBRs and IDs. To measure the number of bytes per classification, we summate the number of bytes that are read at each node visit. For efficient packet classification, that number must be kept at a minimum. At each node we have to examine each en- try’s MBR to determine which path(s) to descend. When examining an entry’s MBR, we check each of the MBR’s dimensions one after the other if it contains the packet to be classified, using the packet’s coordinate in the respective dimension. This process is aborted as soon as the packet falls out of range in one of the five dimensions. Our implementations are based on the R/R*-tree implementations by Hadjieleft- heriou [126]. Figure 12.1 shows the evaluation of the R*-tree’s simulation results for the ACL, FW and IPC classification types. The results show that the R*-tree is very space efficient. Even in the case of 10k filters, total memory consumption remains below 600 KB. In terms of classification performance, the results show 104 Chapter 12. Packet Classification using R-trees

800 80.0 600 70.0 400 60.0 200 Bytes / filterBytes/ 0 50.0 TotalMemory [KB] 100 1000 5000 10000 100 1000 5000 10000 ACL1 5.551 56.987 257.937 592.013 ACL1 56.6 62.2 58.4 61.6 IPC1 7.237 60.701 294.117 509.915 IPC1 73.1 64.7 65.9 56.4 FW1 5.411 46.235 258.391 541.003 FW1 58.8 58.5 55.5 58.1 # filters # filters ACL1 IPC1 FW1 ACL1 IPC1 FW1

40000 20000 30000 15000 20000 10000 10000 5000 (Avg. Case) (WorstCase) 0 0 100 1000 5000 10000 100 1000 5000 10000 BytesClassification/ BytesClassification/ ACL1 1348 7254 18717 31207 ACL1 773.1 3802.2 9430 14886.1 IPC1 905 4453 14343 24478 IPC1 493.9 2233 6297.2 10752.9 FW1 1418 4260 11980 16786 FW1 888.9 2376.4 5080.4 6397.7 # filters # filters ACL1 IPC1 FW1 ACL1 IPC1 FW1

Figure 12.1: Performance Evaluation R*.

that the R*-tree scales best (in the number of filters) for the FW type. We further tested the R-tree, utilizing quadratic split, and benchmarked it with the R*-tree. The R*-tree consistently demonstrated better performance for all three classification types. As already discussed, the R-tree is based solely on the area minimization of each MBR. On the other hand, the R*-tree goes beyond this cri- terion and incorporates the minimization of the overlapping between MBRs at the same level, as well as the minimization of the perimeter of the produced MBRs, which improve query processing performance. Hence, in the following, we only consider the R*-tree. In order to investigate how the R*-tree is suited for packet classification, we bench- mark it with Hypercuts [105] and RFC [99]. RFC appears to be the fastest clas- sification algorithm for static filter sets in the current literature. HiCuts [104] and its improved version HyperCuts are seminal techniques providing excellent tradeoffs. According to [127], HyperCuts is one of the most promising algorithmic solutions. For this benchmark we used the source codes of HyperCuts and RFC provided by [125]. Along with the codes, [125] further provides an evaluation of these packet classification algorithms measuring the amount of bytes consumed per filter as well as the worst case and average number of bytes read per classification. HyperCuts involves some tradeoffs, heuristics, and optimizations. These tunable 12.1. Performance evaluation 105 parameters have tremendous effects on HyperCuts’ performance. It is important to isolate them and evaluate their behavior carefully in order to clarify their im- pact on the algorithm. [125] provides an evalutaion only for a selected set of filter sets, namely the FW filter set size of 100, IPC filter set of size 1K and ACL set of size 10K. Yet, to get fair benchmark results, it is necessary to determine Hy- perCuts optimal parameters for all of the given filter sets. Therefore, we conduct a thorough performance evalutaion of HyperCuts, which will be presented in the following subsection.

12.1.3 Benchmark of R*-tree and HyperCuts As we will see, the performance of HyperCuts is highly sensitive to the configurable parameters, which are the space factor, the bucket size and the filter push level. The space factor is used to bound the number of cuts on each chosen dimension. The bucket size determines the maximum number of filters allowed in a leaf node. It is used to determine when to terminate the decision tree construction. A larger bucket size can help to reduce the size and the depth of the decision tree, but induces a longer linear search time. A smaller bucket size has the counter effects. HyperCuts pulls all common filters in a subtree to a linear list at the root of the subtree. The filter push level restricts the number of tree levels that common fil- ters are pulled upwards. After a sequence of cuttings are performed, the portion of a hypercube in a sub- region might be fully covered by a hypercube with a higher priority. The corre- sponding filter at this decision tree node is redundant thus it can be removed to save the storage. This is referred to as the filter overlap optimization. Our simu- lations have shown that the filter overlap optimization can reduce the storage to some extent but has no significant effect on the classification throughput. Hence in our simulations, we enable the filter overlap optimization. In the following simulations, we measure the amount of bytes consumed per filter as well as the number of bytes per classification in the worst case for the various filter sets.

Filter set size 100 Figure 12.2 shows the HyperCuts performance evaluation results in terms of its sensitivity to (i) the space factor, (ii) the bucket size as well as (iii) the filter push level. The storage decreases when the bucket size increases. Generally a larger bucket size means a worse lookup throughput but this is not always the case. When the filter pushing optimization is disabled, i.e., the filter push level is set to zero, the storage use is very inefficient for the FW filter set, yet the throughput is the best. When the filter push level is increased, the storage efficiency is significantly improved. However, the throughput becomes much worse. For the IPC set, increasing the filter push level worsens classification performance, 106 Chapter 12. Packet Classification using R-trees

200 1200 180 1000 160 800 140 120 600

100 (Worst Case) 400 80 Bytes / Filter Bytes / Classification 200 60 1 2 4 8 16 40 ACL_100 624 516 360 372 356 20 IPC_100 372 236 220 212 212 1 2 4 8 16 FW_100 1118 944 948 880 956 Space Factor Space Factor ACL_100 IPC_100 FW_100 ACL_100 IPC_100 FW_100

1700 60 1200

40 700 Bytes / Filter (Worst Case)

20 Bytes / Classification 200 10 16 24 32 40 10 16 24 32 40 ACL_100 48.9 31.5 28.1 28.1 24.9 ACL_100 360 316 452 452 776 IPC_100 37.7 34.6 29.9 29.1 29.1 IPC_100 220 252 436 476 476 FW_100 67 66.7 54.5 50.1 47 FW_100 948 992 1192 1256 1460 Bucket Size Bucket Size ACL_100 IPC_100 FW_100 ACL_100 IPC_100 FW_100

3000 1000 800 2000 600 1000 Bytes / Filter

(Worst Case) 400

0 Bytes / Classification 200 01234 01234 ACL_100 31.5 31.5 31.5 31.5 31.5 ACL_100 316 316 316 316 316 IPC_100 37.7 25.7 25.7 25.7 25.7 IPC_100 220 422 422 422 422 FW_100 3384.6 46 35.8 35.8 35.8 FW_100 468 944 1034 1034 1034 Filter Push Level Filter Push Level ACL_100 IPC_100 FW_100 ACL_100 IPC_100 FW_100

Figure 12.2: Hypercuts Performance Evalutaion for filter set size 100.

yet without affecting storage. In case of ACL, the algorithm is insensitive to the number of filter push levels. For the FW filter set, 46 bytes per filter and 944 bytes per classification are a good tradeoff. The average number of bytes read per classification in this setting is 605. In comparison, R* needs 54 bytes per filter, 1400 bytes per classification in the worst and 889 in the average case. For the IPC filter set, 37 / 220 (bytes consumed per filter/ bytes inspected in the worst case) give a good tradeoff. R* consumes 72 bytes per filter and inspects 905 bytes per classification in the worst case. 12.1. Performance evaluation 107

In simulation (i), the bucket size was set to 10, the push level was set to 1 in case of FW, and zero for the ACL and IPC filter sets. In simulation (ii), the space factor was set to four, the push level was set to 1 in case of FW, and zero for the ACL and IPC filter sets. The bucket size was set to ten in case of FW and IPC, and 16 in case of ACL in simulation (iii). Further, the space factor was set to two (FW) and four (IPC, ACL) respectively.

Filter set size 1K Figure 12.3 shows the HyperCuts performance evaluation results for filter set sizes of 1K. For the FW filter set, 350 / 1912 give a good tradeoff. For best memory efficiency, i.e., 52 bytes per filter, classification takes 7796 bytes in the worst case and 4077 in the average case. In comparison, R* needs 58 bytes per filter and 4260 bytes per classification in the worst and 2376 in the average case. Remark: for the FW type, the algorithm does not work for all parameter settings, cf. Figure 12.3, simulation (ii) and (iii). For the IPC set, an increasing space factor yields better performance and worse memory consumption. Vice versa is the sensitivity to the bucket size. The storage decreases monotonously when the bucket size increases. The algorithm is highly sensitive to the filter push level. When the filter push level is increased, classification drastically increases for the IPC set. For the IPC set, 61 / 660 give a good tradeoff. In comparison, R* needs 64 bytes per filter and 4453 bytes per classification in the worst case. In simulation (i), the bucket size was set to 24 (FW) and 16 (IPC, ACL) respec- tively. The push level was set to 1 in case of FW, and zero for the ACL and IPC filter sets. In simulation (ii) and (iii), the space factor was set to one (FW), two (IPC) and four (ACL). The push level was set to 1 in case of FW, and zero for the ACL and IPC filter sets. In simulation (iii) the bucket size was set to 24 in case of FW and IPC, and 16 in case of ACL.

Filter set size 5K Figure 12.4 shows the HyperCuts performance evaluation results for filter set sizes of 5K. For the FW filter set, Hypercuts needs 7 MB in total for a worst case performance of 3038. In case of good memory efficiency (51 bytes per filter), the algorithm needs twice as much bytes per classification in the worst case and 2563 on average. In comparison, R* needs 55 bytes per filter and inspects approximately 12000 bytes per classification in the worst and 5080 bytes in the average case. For the IPC filter set, a classification performance of 668 requires above 10 MB in total. For the IPC set, 187 / 1088 gives a good tradeoff. R* only consumes 65 bytes per filter, but shows much worse classification performance. For the ACL set, 59 / 772 give a good tradeoff. R* is on par in terms of memory consumption, but has poor classification performance. 108 Chapter 12. Packet Classification using R-trees

5000 8000

4000 6000 3000 4000 2000

Bytes / Filter 2000 1000 (Worst Case)

0 Bytes / Classification 0 1 2 4 8 16 1 2 4 8 16 ACL_1K 90.7 58.1 83.1 203.9 302.8 ACL_1K 640 516 468 436 404 IPC_1K 186.8 294.7 526.6 920.7 3475.1 IPC_1K 732 524 424 388 392 FW_1K 1538 794 340 4785.9 52.3 FW_1K 1670 2248 2892 2980 7796 Space Factor Space Factor ACL_1K IPC_1K FW_1K ACL_1K IPC_1K FW_1K

1500 1800

1000 1300

500 800 Bytes / Filter (Worst Case)

0 Bytes / Classification 300 10 16 24 32 40 10 16 24 32 40 ACL_1K 158.5 83.1 56 45 42.3 ACL_1K 408 468 540 728 756 IPC_1K 903.1 294.7 92.6 61.3 46.2 IPC_1K 468 524 564 660 792 FW_1K 1538.3 830 350.8 FW_1K 1670 1816 1912 Bucket Size Bucket Size ACL_1K IPC_1K FW_1K ACL_1K IPC_1K FW_1K

3000 1500 2500 1000 2000

500 1500 Bytes / Filter 1000 0 (Worst Case) 01234 Bytes / Classification 500 ACL_1K 83.1 61.1 52.5 51.9 46.6 IPC_1K 92.6 23.7 23.7 23.7 23.7 0 FW_1K 1538.3 1131 1105 777.9 01234 Filter Push Level Filter Push Level ACL_1K IPC_1K FW_1K ACL_1K IPC_1K FW_1K

Figure 12.3: Hypercuts Performance Evalutaion for filter set size 1K.

If in simulation (iii), the space factor is set to four, the algorithm needs 65.3 bytes per filter and 14876 bytes per classification in the worst case for the IPC set. As can be seen, the parameters have to be chosen very carefully to achieve good performance results. In simulation (i), the bucket size was set to 16 (FW) and 24 (IPC, ACL) respec- tively. The push level was set to 1 in case of FW, and zero for the ACL and IPC filter sets. In simulation (ii) and (iii), the space factor was set to two (FW, ACL) and one (IPC). The push level was set to 1 in case of FW, and zero for the ACL and IPC filter sets. In simulation (iii) the bucket size was set to 24 in case of FW 12.1. Performance evaluation 109

7000 6000 40500 5000 4000 30500

3000 20500

Bytes / Filter Bytes/ 2000 10500 1000 0 500 1 2 4 8 16 1 2 4 8 16

ACL_5K 99.7 79.1 123.5 273.2 436.4 BytesClassification / Case)(Worst ACL_5K 884 676 576 612 528 IPC_5K 664.8 881.9 2181.4 5026.6 IPC_5K 876 676 668 576 FW_5K 1400 2168.8 6279.4 51.3 23.8 FW_5K 3038 3006 6312 6274 50034 Space Factor Space Factor

ACL_5K IPC_5K FW_5K ACL_5K IPC_5K FW_5K

4000 3500 3500 3000 3000 2500 2500 2000 2000 1500 1500 Bytes / Filter Bytes/ 1000 1000 500 0 500 10 16 24 32 40 10 16 24 32 40

ACL_5K 226.3 131.2 79.1 59.6 50.3 BytesClassification / Case)(Worst ACL_5K 504 504 676 772 896 IPC_5K 664.8 326.6 187.9 IPC_5K 876 1000 1088 FW_5K 3988.8 2168.8 694.3 683 525.2 FW_5K 2886 3006 3106 3106 3450 Bucket Size Bucket Size

ACL_5K IPC_5K FW_5K ACL_5K IPC_5K FW_5K

700 600 4500 500 3500 400 2500

300 Case)

Bytes / Filter Bytes/ 200 1500 100 500 0 0 1 2 3 4 0 1 2 3 4 Bytes / Classification (Worst ACL_5K 676 838 2168 2170 2170 ACL_5K 79.1 59.7 53.3 52.2 52 IPC_5K 1000 1300 2926 2948 2948 IPC_5K 326.6 164.7 160.6 102 95 FW_5K 694.3 674.4 670.9 669.2 FW_5K 3106 4630 4630 4630 Filter Push Level Filter Push Level

ACL_5K IPC_5K FW_5K ACL_5K IPC_5K FW_5K

Figure 12.4: Hypercuts Performance Evalutaion for filter set size 5K. and ACL, and 32 in case of IPC.

Filter set size 10K Figure 12.5 shows the HyperCuts performance evaluation results for filter set sizes of 10K. For the FW filter set, a space factor of one or two means large storage and good performance, a space factor of eight means low storage and worse perfor- mance. In increasing the bucket size, we can reduce storage while maintaining the performance. But still 788 bytes per filter are needed for a worst case classification 110 Chapter 12. Packet Classification using R-trees

5000 25000 4000 20000 3000 15000 2000 10000 Bytes / Filter 1000 (Worst Case) 5000

0 Bytes / Classification 0 1248 1248 ACL_10K 45.6 58.6 81.7 124.6 ACL_10K 968 664 616 548 IPC_10K 214 109.4 211.6 IPC_10K 1460 22648 22648 FW_10K 1187.9 1143.7 4038.1 47.8 FW_10K 3506 3426 3430 10594 Space Factor Space Factor ACL_10K IPC_10K FW_10K ACL_10K IPC_10K FW_10K

4200 3000 3200 2000 2200

1000 Bytes / Filter 1200 (Worst Case)

0 Bytes / Classification 200 10 16 24 32 40 10 16 24 32 40 ACL_10K 131.2 75.5 58.6 54.2 52.7 ACL_10K 544 532 664 704 920 IPC_10K 3453.4 1631.1 1085.3 IPC_10K 640 712 848 FW_10K 1143.7 803.9 788.8 FW_10K 3426 3426 3426 Bucket Size Bucket Size ACL_10K IPC_10K FW_10K ACL_10K IPC_10K FW_10K

25000 3000 20000

2000 15000 10000 1000 Bytes / Filter (Worst Case) 5000

0 Bytes / Classification 0 01234 01234 ACL_10K 75.5 36 33.6 33.6 33.6 ACL_10K 532 704 852 852 852 IPC_10K 3453.4 64.1 43.8 37.1 34.7 IPC_10K 640 22876 22876 22876 22876 FW_10K 788.8 788.8 780 780 FW_10K 3426 3426 3846 3846 Filter Push Level Filter Push Level ACL_10K IPC_10K FW_10K ACL_10K IPC_10K FW_10K

Figure 12.5: Hypercuts Performance Evalutaion for filter set size 10K.

performance of 3426 bytes. For high memory efficiency (48 bytes per filter), the algorithm inspects 10594 bytes per classifiction in the worst case and 5298 on av- erage. In comparison, R* needs 54 Bytes per filter, 16786 bytes per classification in the worst and 6398 on average. For the FW set, the algorithm shows to be relatively insensitive to the number of push levels. As can be seen, for the IPC filter sets, the algorithm is highly sensitive to the space factor. Choosing a bad space factor, the classification performance degrades with a factor of 15. Setting the push level to zero, a space factor of four results in high classification 12.1. Performance evaluation 111 performance (640 bytes per classification), yet with high storage consumption. In- creasing the bucket size greatly reduces storage with only a slight deterioration of performance. But still above 1K bytes per filter are necessary. When the filter pushing optimization is disabled, i.e., the filter push level is set to zero, the storage use is very inefficient for IPC filter sets, yet the throughput is the best. When the filter push level is increased, the storage efficiency is significantly improved; how- ever, the throughput becomes much worse. Choosing a space factor of two, push level of one and bucket size 16 seems to give the best tradeoff (214 bytes per filter, 1460 bytes per classification). With these parameters, HyperCuts needs almost four times more memory than R*, but has a factor of 17 better classification time. With high memory efficiency, HyperCuts needs 22876 bytes per classification and consumes 34 bytes per filter. In comparison, R* needs 56 bytes per filter and reads 24478 bytes per classification operation. Using the ACL filter set, HyperCuts greatly outperforms the R*-tree in view of worst case bytes per classification. In terms of storage, the algorithms are on par. In simulation (i), the bucket size was set to 24 (FW, ACL) and 16 (IPC) respec- tively. The push level was set to 1 (FW, IPC) and zero for the ACL filter set. In simulation (ii) and (iii), the space factor was set to two (FW, ACL) and four (IPC). The push level was set to 1 in case of FW, and zero for the ACL and IPC filter sets. In simulation (iii) the bucket size was set to 40 (FW), 24 (IPC) and 16 (ACL).

Summary of results The R*-tree has shown to scale well to large filter sets in terms of memory con- sumption for all three classification types. Choosing high memory efficiency for HyperCuts, R* and Hypercuts are on par in terms of memory consumption for the FW filter sets. In case of 1K, R* even outperforms HyperCuts in terms of classification performance with a factor of ap- proximately 2. For the remaining FW filter set sizes, HyperCuts is only about a factor of 1.5 to 2 better than R*, cf. Figure 12.6. Choosing high memory efficiency for HyperCuts, both algorithms are on par in terms of memory consumption as well as classification performance for the IPC 10K set. Choosing good tradeoffs, HyperCuts needs up to four times more storage, but has better classification performance. Hypercuts greatly outperforms the R*- tree using the ACL filter set in view of worst case bytes per classification. In terms of storage, the algorithms are on par.

12.1.4 Benchmark of R*-tree and RFC Figure 12.7 shows the RFC performance evaluation results in terms of bytes per filter as well as the number of bytes per classification for the ACL, FW and IPC classification types. These results can also be found in the evalutaion of [125]. The 112 Chapter 12. Packet Classification using R-trees

FW FW

800 18000 700 16000 600 14000 500 12000 R* 10000 R* 400 HyperCuts 8000 HyperCuts 300 6000 (worst case) Bytes / filterBytes/ 200 4000

100 Bytes / classification 2000 0 0 100 1000 5000 10000 100 1000 5000 10000 # of filters # of filters

FW FW

70 18000 60 16000 14000 50 12000 40 R* 10000 R* 30 HyperCuts 8000 HyperCuts 6000

Bytes / filterBytes/ 20 (worstcase) 4000 10 2000 Bytesclassification/ 0 0 100 1000 5000 10000 100 1000 5000 10000 # of filters # of filters

Figure 12.6: Benchmark results for the FW filter sets. HyperCuts’ parameters tuned for (i) above: high classification performance, (ii) below: high memory efficiency.

RFC implementation is provided in a 3- and 4-phase configuration. According to reported experimental results, there is a slight improvement on the lookup throughput performance with a decreasing number of phases, but the storage can become worse in orders of magnitude. In our simulation, we use the 4-phase configuration. The lookup throughput gets slightly worse when the filter sets become larger, cf. Figure 12.7. The memory consumption does not necessarily get worse when the filter sets become larger, as in the case of ACL filter sets. Yet, RFC shows severe scalability problems for the FW and IPC filter sets in terms of storage consumption. RFC uses approximately 4.8 MB for FW filter set sizes of 100 and 11 MB for sizes of 1K. For the IPC 5K set, RFC consumes 175 MB in total. 12.2. Conclusions and future directions 113

1.E+05 20

19

18 1.E+04 17

16

15 Bytes Bytes / Filter 1.E+03

Bytes / Bytes Classification 14

13

1.E+02 12 100 1000 5000 10000 100 1000 5000 10000 # filters # filters ACL IPC FW ACL IPC FW

Figure 12.7: RFC performance evaluation.

12.2 Conclusions and future directions

According to our benchmark results, R* is competitive with HyperCuts for static packet classification for FW filter sets. Network state changes (e.g., link failures) along the policy-based routes or dynamic topologies (ad hoc networks) are scenarios where policies need to get updated. Most existing packet classification solutions do not support (fast) incremental up- dates. RFC’s extensive precomputation precludes dynamic updates at high rates. HyperCuts’s support for incremental updates is not specifically addressed. While it is conceivable that the data structure can support a moderate rate of random- ized updates, it appears that an adversarial stream of updates can either create an arbitrarily deep decision tree or force a significant restructuring of the tree [38]. The preprocessing time can be taken as an indication for the time needed for the reconstruction of the structure. Precomputation can be defined as the process of transforming the representation of a filter database (i.e., the way in which the fil- ters are expressed and stored) to represent that same data in a way more suitable to the classification procedure. Taking the FW 10K filter set for example, our simulations measured a preprocessing time of 7.4 seconds on a 3 GHz Pentium-IV. When an insertion or deletion of a filter is triggered, a delay of several seconds is unacceptable. A strength of the R*-tree is its support of incremental updates. Investigating the performance of R*-trees in a dynamic classification environment, i.e., where classification is intermixed with filter updates, would be an interest- ing future research field. To our knowledge, there is no prior work that presents simulation results of a dynamic packet classification scenario. Gupta and McK- eown [104] present experimental results of the average update time over 10000 random incremental updates, but not how these affect classification performance.

In static environments, the fact that the data is known in advance is used in order to build a structure that supports queries as efficient as possible. This method is 114 Chapter 12. Packet Classification using R-trees

well known in the literature as “packing” or “bulk loading”. Packed R-trees, e.g., the Priority R-tree as proposed by Arge et al. [122], might qualify as a structure to be applied in static packet classification environments.

We have seen that during classification, each entry per node has to be checked if it contains the packet to be classified. Further, more than one subtree under a node may have to be visited to locate the highest priority filter matching a packet. To speed up the search, all entries in a node could be queried in parallel. Furthermore, several branches of the tree can be searched in parallel. The total number of bytes that are inspected remain the same, but classification speed will be thus increased. Therefore, the parallelism offered by hardware can be leveraged. A current generation Xilinx FPGA operates at over 550 MHz and contains over 10Mb (1.25 MB) of embedded memory. The R*-tree has shown to be very space efficient, even in the case of 10k filters it requires less than 0.6 MB to store the filters. Additional research is required to evaluate R*-tree’s performance when implemented directly in hardware. Part III

Conflict Detection and Resolution

Chapter 13

Introduction

Policy-based routing requires network routers to examine multiple fields of the packet header in order to categorize them into “flows”. Flow identification entails searching a table of predefined filters to identify the appropriate flow based on criteria including IP address, port, and protocol type. If the header of an arriving packet matches more than one filter, a tiebreaker determines the filter which is to be applied. A common tiebreaker is to select the highest priority filter among all matching filters. Hari et al. noticed that not every policy can be enforced by assigning priorities [7]. The authors suggest to employ the most specific tiebreaker (MSTB). In one-dimensional prefix tables, the most specific filter is equivalent to the longest matching prefix. The most specific criterion is a special case of the highest priority criterion: a filter f1 is assigned a higher priority than a filter f2 if f1 is more specific than f2. A filter f1 is more specific than a filter f2 iff f1 ⊂ f2. The most specific tiebreaker is only feasible if for each packet p the most specific filter that applies to p is well defined. Otherwise, the filter set is said to be conflicting. The conflict detection problem occurs in two variants; the offline and online modes. Algorithms for the offline version are given a set of filters and report (and resolve) all conflicts. The online version, on the other hand, is a gradual build-up of a conflict-free set R, such that for every insertion and deletion of a range r, a check is made on the current status of R and if conflicts occur, the algorithms will offer solutions to resolve them and maintain the conflict-free property of R. In this part we propose a conflict detection and resolution algorithm for static one-dimensional range tables, i.e., where each filter is specified by an arbitrary range. We are motivated to study the one-dimensional case for the following reason. Multi-dimensional classifiers typically have one or more fields that are arbitrary ranges. Since a solution for multi-dimensional conflict detection often builds on data structures for the one-dimensional case, it is beneficial to develop efficient solutions for one-dimensional range router tables. This line of research has been conducted in collaboration with Khaireel Mohamed, Thomas Ottmann and Amitava Datta. The “Slab-Detect” and na¨ıve algorithms presented in this part were implemented by my collegue Khaireel Mohamed. 118 Chapter 13. Introduction

13.1 Organization of this part

In the following two sections we introduce the terminology we use and describe re- lated work [128]. After presenting our conflict detection and resolution algorithm for one-dimensional range tables in section 14.1, we provide our experimental re- sults of benchmarking the new solution with a na¨ıve algorithm. Section 14.3 describes how the algorithm can be adapted to work under the highest-priority tiebreaking rule. Motivated by our main application, we consider a related prob- lem in section 14.4. We show that by making use of partial persistence, the data structure can also support IP lookup.

13.2 Preliminaries

A one-dimensional filter applies to a packet if the range that represents the filter contains the point representing the packet. A point p is said to stab a range [u, v] if p ∈ [u, v]. A stabbing query reports all ranges that are stabbed by a given query point. Let R be a set of one-dimensional arbitrary ranges. The MSTB rule is only feasible if the set R is conflict-free.

Definition 2. The set R is conflict-free iff for each point p there is a unique range r ∈ R such that p ∈ r and for all other ranges s ∈ R stabbed by p, i.e., p ∈ s, r ⊆ s.

If the set R of ranges is conflict-free then for each point p there is a well defined most specific filter in R which contains p.

Definition 3. Two filters r and s partially overlap if r ∩ s 6= ∅ and r ∩ s 6= r and r ∩ s 6= s.

Definition 4. A set R of filters is nested if for any pair r, s ∈ R, either r ⊂ s or s ⊂ r.

Definition 5. A set R of ranges is called nonintersecting if for any two ranges r, s ∈ R either r ∩ s = ∅ or r ⊂ s or s ⊂ r. In other words, R is nonintersecting if any two ranges are either disjoint or one is completely contained in the other. It is obvious that a set of nonintersecting ranges is always conflict-free, i.e., for each packet (query point p) the most specific range in R containing p is well defined. There may be, however, conflict-free sets of ranges which are not nonintersecting. Consider, e.g., a set R of three ranges {r, s, t}, where r and s partially overlap and t = r ∩ s. 13.2. Preliminaries 119

Definition 6. Two ranges r, s ∈ R are in conflict with respect to R if r and s partially overlap and there is a point p such that p ∈ r ∩ s but there is no range t ∈ R such that p ∈ t and t ⊂ r and t ⊂ s.

Q Sn Definition 7. Let R = {r1, . . . , rn} be a set of n ranges. Then (R) = ri i=1 (see [21]).

Lemma 1.Q If two ranges r, s ∈ R are in conflict then there is no subset S ⊆ R such that (S) = r ∩ s.

Proof. Let r, s be two conflicting ranges. ConsiderQ the overlapping range r ∩ s and assume that there is a subset S ⊆ R such that (S) = r ∩ s. Take t ∈ S and p ∈ t arbitrarily. Then r, s and t contain p and t ⊂ r and t ⊂ s contradicting the assumption that r and s are in conflict.

The reverse of the above lemma is also true. We show:

Lemma 2. If r, s ∈ R are conflict-freeQ then either r ∩ s = ∅ or r ⊆ s or s ⊆ r or there is a subset S ⊆ R such that (S) = r ∩ s.

Proof. It is sufficient to assume that r∩s 6= ∅ and neither r ⊆ s nor s ⊆ r. Because r and s are conflict-free, there exists for eachS p ∈ r ∩ s aQ range tp ∈ R such that p ∈ tp and tp ⊂ r and tp ⊂ s. Choose S = tp. Then (S) = r ∩ s. p∈r∩s

With the MSTB rule in mind, we know for each point p ∈ r∩s, r, s two nonconflict- ing but partially overlapping ranges, that neither r nor s is the filter determining the routing of p. Intuitively speaking, we consider only those pairs of filters r and s as conflicting in R if there are points in r ∩ s for which lookup can not be transfered to more specific ranges.

Hari et al.’s [7] definition of a resolve filter for two-dimensional prefix filters natu- rally translates to one dimension.

Definition 8. Let r, s ∈ R be two conflicting ranges. Then we call the overlapping range r ∩s the resolve filter for r and s with respect to R. We denote by resolve (R) the set obtained from R by adding a resolve filter for every pair of conflicting filters in R. 120 Chapter 13. Introduction

Figure 13.1: Intervals and their relative position.

Theorem 2. Let R be a set of one-dimensional ranges. Then R ∪ resolve(R) is a conflict-free set of filters.

Proof. We prove this by induction [129].

Basis : The basis of the induction is a set of two ranges and the resolve filter induced by them. There are three cases, (i) the two ranges are disjoint, (ii) one of them encloses the other, and (iii) they partially overlap. There is no need to introduce any resolve filter in the first two cases and hence the theorem holds. It is easy to see that the resolve filter introduced in the last case does not introduce any conflict.

Induction : Suppose we have a set of i − 1 ranges and their resolve filters already introduced and this set is conflict-free. We have to prove that the resulting set is still conflict-free when we introduce a new range in this set as well as the resolve filters for this new range. We refer to Fig. 13.1 for the proof. All the ranges existing before the introduction of the new range can be classified into five categories : (i) the ranges that left overlap the new range, (ii) the ranges that right overlap the new range, (iii) the ranges that enclose the new range, (iv) the ranges that are enclosed by the new range, and (v) the ranges and the new range are disjoint. Note that the ranges in category (v) or their resolve filters cannot cause any conflict after the introduction of the new range, because none of these resolve filters overlap the newly introduced range. Similarly, the ranges or their resolve filters in category (iv) do not introduce any conflict as these ranges and their resolve filters are nested within the newly introduced range. Hence, we have to concentrate on the ranges in the first three categories. However, the proof is almost similar for the ranges in categories (i) and (ii) and we will only consider the proof for ranges in category (i), i.e., the ranges that left overlap the new range and category (iii), i.e., the ranges that enclose the new range. According to our definition of conflict, left- or right-overlap does not necessarily cause a conflict. We only insert resolve filters when there is a conflict, otherwise, we only insert the new range. For each of these categories, we have to prove, (a) the old 13.2. Preliminaries 121

Figure 13.2: Any two ranges in RFi and RFi−1 are conflict-free. resolve filters do not conflict with the resolve filters introduced for the new range, and (b) the old resolve filters do not conflict with the new range. If the new range overlaps with existing ranges but there is no conflict, then (a) is straightforward to prove, since in this case there are no newly inserted resolve filters. Hence in the following, we will only consider ranges that overlap and conflict. We denote all the ranges in category (i) by L. Suppose we are adding the i-th range ri and its low and high end points are li and hi respectively. Consider a range rk ∈ L that left overlaps and conflicts with ri. The low and high end points of rk are lk and hk. We can sort all the ranges that left overlap ri according to their right end points. After the introduction of ri, we introduce a resolve filter for a range rk ∈ L (as rk conflicts with ri) and this resolve filter is a new range [li, hk]. Clearly, all these new resolve filters (we call this set RFi) due to ri are not in conflict as they are nested. Consider now the resolve filters (we call this set RFi−1) that existed due to the ranges in L before the introduction of ri. We have to check whether the resolve filters in RFi−1 are in conflict with the resolve filters in RFi. We prove by con- tradiction that there is no such conflict. Assume that there is a resolve filter rfjk ∈ RFi−1 such that rfjk has a conflict with some resolve filter in RFi. rfjk was introduced for resolving the conflict of two ranges rj and rk before the introduction of ri. Suppose rfjk conflicts with rfim ∈ RFi. Clearly, the left end point of rfjk is to the left of li as both rj and rk left overlap ri. The left end point of rfim is li as this resolve filter was introduced due to ri. Consider now rfim and the part of rfjk to the right of li (the part of rfjk to the left of li remains conflict-free after the introduction of ri). See Fig. 13.2. There are two cases as shown in Fig. 13.2, depending on the relative positions of the end points of rj, rk and rm. First, we consider Fig. 13.2(a). In this case, the right end point of rfim is to the right of the right end point of rfjk. Note that the right end point of rfjk is either due to the right end point of rj or due to the right end point of rk (as rfjk is a resolve filter for rj and rk). Suppose, without loss of generality, the right end point of rfjk is due to the right end point of rj. Then we have already introduced a resolve filter that starts at li and ends at the high end point of rj, since we have added a resolve filter to resolve the conflict between rj and ri after we added ri. This resolve filter for ri and rj is shown by the dashed line in Fig. 13.2(a). This resolve filter for ri and rj resolves the conflict between 122 Chapter 13. Introduction

Figure 13.3: A range that encloses ri cannot have any conflict due to the intro- duction of ri.

rfim and rfjk. Next, consider Fig. 13.2(b), i.e., where the right end point of rfim is to the left of the right end point of rfjk. Clearly, there is no conflict, as rfim is nested in rfjk. We can prove in a similar way that the resolve filters in RFi−1 do not conflict with the new range ri. Finally, we have to consider the ranges in category (iii), i.e., the ranges that enclose ri. Suppose rp is such a range. Clearly, ri is not in conflict with rp. Hence, (a) is straightforward to prove. However, we have to check whether any resolve filter due to rp is in conflict with ri. Suppose there is a range rl which has a conflict with rp and we have introduced a resolve filter earlier to resolve this conflict. Clearly, if rl ∩ ri = ∅ then their resolve filter does not conflict with ri. There are three possibilities left as shown in Fig. 13.3. In the first case, the resolve filter due to rp and rl has a conflict with ri (Fig. 13.3(a)). However, in this case rl left overlaps ri and hence we have already introduced a resolve filter (shown by the dashed line in Fig. 13.3(a)) to resolve the conflict between rl and ri which also resolves the conflict between ri and the resolve due to rp and rl. In the second case, rl right overlaps ri and the proof is similar. In the last case rl encloses ri and hence the resolve filter for rp and rl also encloses ri. This concludes the proof.

Figure 13.4 shows an example for a set R of one-dimensional filters: r and s are not conflicting, because for each point p ∈ r ∩ s there is a range t ∈ R such that p ∈ t and t ⊂ r and t ⊂ s. However, the pairs (a,b), (b,c), (c,d) are all conflicting pairs of filters. Hence, R is not conflict-free. Adding a ∩ b, b ∩ c, c ∩ d to R results in a conflict-free set of filters, i.e., R ∪ resolve(R) = R ∪ {a ∩ b, b ∩ c, c ∩ d} is conflict-free. 13.3. Related work 123

Figure 13.4: Filters r and s are conflict-free. However, the set as a whole is not conflict-free under MSTB.

Figure 13.5: Any pair ri, rj ∈ R, i 6= j is conflicting with respect to R.

This definition of conflict implies that there may be sets of n one-dimensional ranges having O(n2) pairs of conflicting ranges. Figure 13.5 shows an example of a set R of n ranges. Here, any pair ri, rj ∈ R, i 6= j is conflicting with respect to n(n−1) R. Resolve(R) = {ri ∩ rj|1 ≤ i, j ≤ n, i 6= j} contains 2 elements. Though R ∪ resolve(R) is conflict-free according to Theorem 2 it is not necessary to add a resolve filter for each pair of conflicting filters in R in order to make it conflict-free. We can show:

Lemma 3. For every set R of n one-dimensional ranges, there is a set S of O(n) one-dimensional ranges s.t. R ∪ S is conflict-free.

Proof. Let R = {r1, . . . , rn}. These ranges partition the universe U into at most 2n−1 consecutive slabs defined by the endpoints of the ranges. Let ep0, ep1, . . . , epk be the boundaries of these slabs. Let σ = {[epi, epi+1], 0 ≤ i < k}. Then R ∪ σ is obviously conflict-free. Hence, the trivial solution to make a set conflict-free is to add a “slab-resolve” filter for each of the slabs. Yet, this solution possibly adds unnecessary filters since not every slab may require a resolve filter, e.g., when the set is already conflict-free. Therefore, our goal is to add only those slab-resolve filters that are needed to make the set conflict-free. After presenting related work, we will propose our output-sensitive offline conflict detection and resolution algorithm for one-dimensional range tables.

13.3 Related work

13.3.1 Online conflict detection and resolution For a given set R of n one-dimensional nonintersecting ranges under MSTB, the deletion of intervals maintains the conflict-free property of R. Hence, only the 124 Chapter 13. Introduction

(a) (b)

Figure 13.6: Examples of partially overlapping ranges.

insertion of a new interval may become critical for the online variant of the problem. In order to determine whether a new range r = [u, v] partially overlaps with any of the ranges in R, two conditions have to be checked: 1. ∃s = [x, y] ∈ R : x < u ≤ y < v (s left-overlaps r, cf. Figure 13.6(a)) 2. ∃s = [x, y] ∈ R : u < x ≤ v < y (s right-overlaps r, cf. Figure 13.6(b)) Lu and Sahni [21] maintain two priority search trees (PST) to detect all con- flicts. However, the second PST is maintained exclusively for the detection of right-overlaps (IP lookup is performed on the first PST). The actual informa- tion contained in the second PST is redundant. The overall insertion time is in O(log n). However, the actual time for insertion (and deletion), as well as the space requirements of the overall approach, are increased by roughly a factor of two. Lauer et al. [46] show that such an additional structure is not required and that the verification of condition (2) can also be achieved in time O(log n) by a single query on the original structure (plus one comparison operation).

If R is made up of arbitrary one-dimensional ranges under MSTB, then both operations to insert and delete an interval may lead to conflicts in the resulting set. In the case of an insertion, the new interval may left- or right-overlap with one or more intervals in R, such that the overlapping range has no resolving subset. Similarly, a conflict may arise if we remove t = r ∩ s from R, because t ∈ R is an interval that resolves a conflict between r, s ∈ R, see Figure 13.7.

Figure 13.7: Removing t will lead to a conflict between r and s.

This online version of the problem is solved (in a rather complex way) by Lu and Sahni [21].

Hari et al. [7] introduced the notion of filter conflict under the most specific tiebreaking rule. Hari et al.’s motivation to apply the MSTB is that the scheme for 13.3. Related work 125 resolving ambiguities in classification that is based on prioritizing the filters (and then choosing the filter with highest priority) is not able to enforce every policy. Let each filter f be a 2-tuple (f[1], f[2]), where each field f[i] is a prefix bit string. Assuming that the most specific tiebreaker is applied, two filters f1 and f2 have a conflict iff [7]

1. f2[1] is a prefix of f1[1] and f1[2] is a prefix of f2[2] or

2. f1[1] is a prefix of f2[1] and f2[2] is a prefix of f1[2] Figure 13.8 provides an example of two conflicting filters. For packets falling in the overlap region, the most specific filter is not defined.

Figure 13.8: Filters f1 and f2 are in conflict.

Hari et al. propose a new scheme for conflict resolution, which is based on the idea of adding a resolve filter f1 ∩ f2 for each pair f1, f2 of conflicting filters [7]. This guarantees that the most specific tiebreaker can be employed. For packets falling in the overlap region, the resolve filter determines the action that is to be applied. Note that this definition of conflict disregards the fact that the overlap region o may already be exactly covered by another filter or set of filters whose union equals o, hence this approach may introduce unnecessary resolve filters. In order to detect all conflicts between a new filter and a given set of filters, they utilize two complementary data structures, one for each of the two cases listed above. When a new filter is to be inserted, we search the first structure and re- port all filters that satisfiy condition (1). All filters satisfying condition (2) can be found by searching the second structure. The algorithm adds resolve filters for each pair of conflicting filters. Conflict detection takes time O(w2), where w is the width of each field. It is possible to reduce this to O(w) using switch pointers. The drawback is that the involved precomputation raises the filter update time to O(n), where n is the number of current filters. The authors further extend their algorithm to three dimensions where the protocol field is restricted to be either TCP, UDP or wildcard. In this case, the time for conflict detection as well as the algorithm remains unchanged. The only overhead is the three fold increase in memory for filters with wildcard protocol field. They further extend their algoritm 126 Chapter 13. Introduction

to five dimensions, in which case the source and destination ports are restricted to be either fully specified or wildcard.

Al-Shaer and Hamed developed the Policy Anomaly Detector, a set of tools to manage filtering policies [130]. It discovers conflicting filters and automatically determines the proper order for any inserted or modified filter. It further provides a natural language translation of low-level filters.

13.3.2 Offline conflict detection and resolution Lu and Sahni [131] consider the two-dimensional offline problem under MSTB for sets of prefix filters. Two filters f1, f2 ∈ F conflict iff an edge of f1 perfectly crosses an edge of f2, that is, two edges perfectly cross iff they cross, and their crossing point is not an endpoint. Note that this definition of conflict is based on the definition by Hari et al. [7]. In other words, a proper edge intersection between two ranges in F is a direct cause for conflict. This implies that all conflicts in F can be detected by computing all proper intersecting pairs of filters in F . This can be done by a slight modification of the classical sweepline algorithm for reporting all intersections in a set of n iso-oriented line segments. Lu and Sahni [131] further discuss the problem of resolving conflicts by adding a set of resolve filters to F . For each conflicting prefix-pair f1, f2 ∈ F , a new resolve filter h = f1 ∩ f2 is added to F , cf. Figure 13.9. If a resolve filter h is already in the original set of filters F , or if the original

Figure 13.9: Conflict-free set of prefix filters, colored regions represent the set of resolve filters.

set contains a set of filters whose union equals h, then we can avoid adding h to resolve(F). Therefore, the authors introduce the notion of an essential resolve filter. A filter h ∈ resolve(F ) is an essential resolve filter iff F ∪ resolve(F ) − {h} has no subset whose union equals h. It takes O(n log n+s) time to determine resolve(F), where n is the number of filters in F and s the number of resolve filters and an additional time of O((n + s)w), where w is the length of the longest prefix, to identify the set of essential resolve 13.3. Related work 127

filters.

A set of prioritized rectangles has a conflict if there exists a query point p such that there is no unique maximum priority rectangle containing p. The only known result for conflict detection for prioritized ranges (not only prefix ranges) in two dimensions is the algorithm proposed by Eppstein and Muthukrishnan [132]. Their algorithm uses a technique related to an algorithm by Overmars and Yap [133] devised for solving Klee’s measure problem: Given a collection of n d-dimensional rectangles, compute the measure√ of their union. Overmars and Yap proposed an O(nd/2 log n) time and O(n n) space sweepline algorithm for d ≥ 3. In three dimensions, their algorithm uses a generalization of a 2-d-tree, a two-dimensional orthogonal partition tree, which defines a subdivison of the plane into rectanguar cells. In two dimensions, the partition created has the properties√ that there are O(n) cells, that√ no rectangle is contained in more than O( n) cells, and that no more than O( n) rectangles need to be examined within each cell. The cells are stored in the leaves in the partition tree. Each cell has the form of a trellis, not containing vertices in its interior, cf. Figure 13.10. The measure of each trellis

Figure 13.10: A trellis. can be easily computed. The overall measure is computed using the information that is maintained in the partition tree under insertions and deletions of rectangles during the sweep. Eppstein and Muthukrishnan construct a 2-d-tree of the rectangle vertices to divide the plane into rectangular cells not containing any rectangular vertex. Then, they perform a depth first traversal of the tree. However, this scheme only yields a yes or no answer to the offline version of the conflict detection problem and does not report and resolve the conflicts. Their algorithm runs in time O(n3/2) and uses linear space.

Chapter 14

Detecting and Resolving Conflicts

In the following we present our solution for an offline conflict detection and resolu- tion algorithm for one-dimensional range tables [129] [134]. The algorithm is based on the sweepline technique, achieves a worst case time complexity of O(n log n) and uses O(n) space, where n is the number of filters in the set. The algorithm is output-sensitive in the sense that it reports only essential resolve filters.

14.1 The output-sensitive solution to the one - dimensional offline problem

Let R be a set of n one-dimensional arbitrary range filters under MSTB. The left and right endpoints of the filters in R partition the discrete universe into at most 2n − 1 slabs. Each distinct endpoint makes up the boundary between two slabs and is placed in a linearly ordered set to form the event points of the sweepline paradigm.

Definition 9. A slab σi is a single partition of the set R containing a range of points from the discrete universe between two event points epi and epi+1, where σi = [epi, epi+1). All points p ∈ σi stab the same subset Si ⊆ R and are represented collectively by epi.

Definition 10. Slab σi is conflict-free under MSTB iff there is a shortest filter r ∈ Si that contains epi, such that r is contained in all other filters Si ⊆ R stabbed by epi.

If the shortest (most specific) filter r contains epi, then it contains all points p in slab σi. Thus, we detect a conflict situation at epi if the shortest filter r is not contained in at least one s = [s.lo, s.hi] ∈ Si. 130 Chapter 14. Detecting and Resolving Conflicts

Definition 11. Slab σi is non-conflict-free under MSTB iff the shortest filter r is not contained in at least one s ∈ Si.

Lemma 4. Let σi be a non-conflict-free slab. Then σi requires only a single

“slab-resolve” filter hσi = [epi, epi+1) to make it conflict-free.

Proof. If σi is not conflict-free, then the shortest filter r stabbed by epi is not

contained in all other filters in Si. Let hσi be a resolve filter that spans σi. Then

for the same epi, lengths ||hσi || < ||r||, so that hσi is now the shortest filter that contains epi and is contained in all Si (including r).

Definition 12. A filter r = [r.lo, r.hi] left-overlaps filter s = [s.lo, s.hi] iff r.lo < s.lo ≤ r.hi < s.hi.

Definition 13. A filter r right-overlaps s iff s.lo < r.lo ≤ s.hi < r.hi.

Corollary 1. Slab σi requires hσi if ∃s ∈ Si that left-overlaps r and s.hi > epi.

Corollary 2. Slab σi requires hσi if ∃s ∈ Si that right-overlaps r and s.lo ≤ epi.

Theorem 3. Let SlabResolve(R) be the set obtained from R by adding a slab-

resolve hσi for every non-conflict-free slab σi. Then R∪SlabResolve(R) is conflict- free.

Proof. After adding a slab-resolve hσi for every non-conflict-free slab σi, there exists a most specific filter in every single slab. Hence, by Definition 2, the set R ∪ SlabResolve(R) is conflict-free.

Therefore, it is sufficient to (i) determine the smallest filter r from all Si stabbed by epi, and then (ii) check that r is contained in all Si to deduce that slab σi is conflict free.

14.1.1 Status structures

A filter s ∈ R belongs to the status structure T at epi iff epi ∈ [s.lo, s.hi), and all such filters are stored in a collective structure ordLen(T ), which orders the filters according to ascending lengths. T can be split into two distinct subsets, so that T = new(T ) ∪ current(T ), where s ∈ new(T ) if s.lo = epi, otherwise, s ∈ current(T ). 14.1. The output-sensitive solution to the one - dimensional offline problem 131

The filters in new(T ) are ordered in a single structure in their ascending lengths. The filters in the generic set current(T ), on the other hand, are maintained in two separate red-black trees; (i) curR(T ) that orders all s in their ascending hi- endpoints, and (ii) curL(T ) that orders all s in their descending lo-endpoints. Furthermore, if current(T ) 6= ∅, then let lhp and hlp be pointers to the lowest of all the hi-endpoints and the highest of all the lo-endpoints in current(T ). Respec- tively, lhp and hlp point to the top of the ordered lists in curR(T ) and curL(T ).

14.1.2 Handling event points

At each event point epi, we take the shortest filter r ∈ T from the top of the list in ordLen(T ) and check if r ∈ new(T ), otherwise, r ∈ current(T ). Here, there are two cases to consider: Case I: The shortest filter r ∈ new(T ). Case II: The shortest filter r ∈ current(T ).

Theorem 4 (Case I). If r ∈ new(T ) and current(T ) = ∅, then slab σi is conflict- free.

Proof. At epi all s ∈ new(T ) have the same lo-endpoints, and since r ∈ new(T ), r is completely contained in all s ∈ new(T ).

Theorem 5 (Case I). If r ∈ new(T ) and lhp ≥ r.hi, then slab σi is conflict-free. Proof. From Theorem 4, r cannot conflict with any s ∈ new(T ). Also, by Defini- 1 tion 9, slab σi cannot span any longer than ||r|| , which then makes it unnecessary to consider the case for lhp ≥ r.hi in slab σi.

Theorem 6 (Case I). If r ∈ new(T ) and lhp < r.hi, then ∃ at least one filter s ∈ current(T ) that left-overlaps r, so that slab σi is no longer conflict-free. Proof. Straightforward from Corollary 1.

Theorem 7 (Case II). If r ∈ current(T ) and new(T ) 6= ∅, then every s ∈ new(T ) conflicts with r, so that slab σi is no longer conflict-free. Proof. Straightforward from Corollary 2; ∀s ∈ new(T ) right-overlap with r.

Theorem 8 (Case II). If r ∈ current(T ) and new(T ) = ∅, then slab σi is conflict- free iff hlp ≤ r.lo and r.hi ≤ lhp. Proof. It follows from Definition 10 that r is contained in all s ∈ current(T ) = T . 1 However, ||σi|| can be shorter than ||r|| if ∃epi+1 that comes before r.hi. 132 Chapter 14. Detecting and Resolving Conflicts

14.1.3 The sweepline environment We conceptually extend all intervals by half the size of the distance of points in the discrete raster U, such that all stabbing queries become queries for points falling into the interior of the intervals. Figure 14.1 shows this transformation on filters r, s, and t, such that for any range filter x there is a mapping function f : x[x.lo, x.hi] 7−→ x0[x.lo − 0.5, x.hi + 0.5]. This is important for the Slab-Detect algorithm in order to detect all conflicts. For example, the point epm = 19 in Figure 14.1(a) stabs both filters r and s. If the filters are not mapped as above, then only s exists in the status structure at epm, and the conflict between r and s is not detected. Also, we will miss t completely at epn = 21. Figure 14.1(b) shows the solution to these problems, and, as a consequence, all event points epi lie on the 0.5 mark.

Figure 14.1: Extending each filter in (a) by half the size of the distance of points so as not to miss a crucial event point during the sweep in Slab-Detect.

14.1.4 Running Slab-Detect Figure 14.2 illustrates the Slab-Detect algorithm on a given set R = {r, s, t, u, v, w}, where R partitions the discrete universe into 11 slabs, separated by ten distinct event points. Sweeping from left to right, Slab-Detect maintains the status struc- ture T = new(T ) ∪ current(T ) at every event point epi, determines the shortest filter r∗ and notes which subset of T it comes from, and then reports whether or not slab σi requires a resolve filter. In the given example, Slab-Detect finds that the non-conflict-free slabs σ3, σ6, σ7, and σ8 require resolve filters.

Corollary 3. Let g and h be two adjacent resolve filters reported by the Slab-Detect algorithm that span two consecutive slabs σi and σi+1. Then g and h cannot be merged.

Proof. All adjacent event points on the sweepline contain a unique set of filters in new(T ) and current(T ); that is, no two consecutive event points epi and epi+1 have the same elements in the status structure T . Thus it follows that the conditions 14.2. Experimental results 133

Figure 14.2: Running from left to right, the Slab-Detect algorithm reports four resolves required for the non-conflict-free slabs σ3, σ6, σ7, and σ8.

that make the filters in slab σi conflict is different from the conflict conditions in σi+1. Therefore, g and h cannot be merged.

Slab-Detect does not report duplicate resolve filters. Figure 14.3 shows an example.

Figure 14.3: Slab-Detect: duplicate resolve filters are not reported.

Further, as mentioned in section 13.3, Lu and Sahni [131] introduced the notion of an essential resolve filter. Their solution reports all resolve filters at first, and then, in a second step, identifies the essential resolve filters. Slab-Detect reports the essential resolve filters ab initio.

14.2 Experimental results

The input data, i.e., the ranges, are randomly generated depending on two main parameters: the total number n of filters to generate, and the bit-length w of a filter field. An arbitrary range filter r = [r.lo, r.hi] is formed using the function random(i, j) that generates a random integer that is normally distributed between i and j, and is described as follows: r.lo = random(0, 2w − 1), r.hi = random(lo, 2w − 1) We add variations of discrepancy into the generated set R by introducing an additional parameter percentP refix, which determines the percentage of one- dimensional prefix filters in R. That is, by setting percentP refix = 1.0, we 134 Chapter 14. Detecting and Resolving Conflicts

generate n non-conflicting prefix filters in R. By varying this parameter between 0.0 and 1.0, we can indirectly control the number of conflicting pairs of filters gen- erated by the base set R, and in doing so, test the Slab-Detect algorithm on its output-sensitivity. A prefix filter is described by f = [b/prefixLen], where b is a random bit-pattern, and prefixLen is the number of prefix-bits to retain in b. We generate it as follows: b = random(0, 2w − 1), prefixLen = random(0, w − 1) All of our simulations are performed on a Pentium IV, 3GHz machine with 2GBytes RAM running on Java 5.0. We generate arbitrary range filters for w = 128 with various percentages of prefix filters in steps of 10% increments, for sample sizes |R| = 5K, 10K, 20K, 30K, 40K, and 50K. At each epoch of one sample size |R| and one preset value of percentP refix, we note the total runtime, the resolve filters reported, and the memory consumption. We repeat each epoch 20 times, each time generating a new set of random samples for R, and then calculate the averages for the noted values above. We benchmark Slab-Detect with a na¨ıve algorithm which reports resolve filters for all pairwise conflicting filters following Definition 8, and we term this set as “resolve(R)”. Figure 14.4 shows our simulation results for |R| = 5K and w = 128. We see in Figure 14.4(a) that the number of reported slab-resolves, which we collectively term essential(R), is never more than the number of slabs generated by the filters in R. Also, the total runtime for Slab-Detect decreases as the number of essential(R) decreases, which is a consequence of increasing the percentage of prefix filters in the sample R. This shows that Slab-Detect is indeed output-sensitive, as the amount of time taken to report all necessary resolve filters in the set R is proportional to the number of non-conflict-free slabs in R. Whereas the total runtime of the na¨ıve algorithm is unaffected by the number of conflicting pairs within the sample R as shown in Figure 14.4(b). From both these figures, we see that Slab-Detect reports far less resolve filters than its na¨ıve opponent. Further, the runtime for Slab-Detect is up to one order of magnitude faster than the na¨ıve algorithm. Note that the time reported in Figure 14.4(b) is the time taken by the na¨ıve algorithm to report the raw set of resolve(R), and it will take additional time to remove all unnecessary filters in resolve(R). Slab-Detect, however, does not report duplicate resolve filters ab initio. In Figure 14.4(c), we report the average and maximum memory requirement to handle Slab-Detect for |R| = 5K and w = 128. The figure also shows the average amount of time taken to handle and process a single event point within Slab- Detect, in accordance to the various percentage mixes of prefix filters in the set R. The complete round of simulation outputs for the Slab-Detect algorithm on |R| = 10K, 20K, 30K, 40K, and 50K are summarized in the graphs shown in Figure 14.5. 14.3. Adapting Slab-Detect under the HPF rule 135

Figure 14.4: Simulation results for Slab-Detect and Na¨ıve. |R| =5K, w = 128.

14.3 Adapting Slab-Detect under the HPF rule

In this section, we show that the original concepts of the Slab-Detect algorithm under MSTB are adaptable for use under the highest-priority tiebreaking rule (HPF). It also achieves the same runtime performances and space complexities as discussed in Section 14.2. Under HPF, each filter r ∈ R is assigned a priority value prio(r).

Definition 14. A filter r has a higher priority than filter s iff prio(r) < prio(s).

Definition 15. A set of filters R is conflict-free under HPF iff for each point p there is a unique filter r of the highest priority that contains p, such that prio(r) < prio(s), ∀s ∈ S ⊆ R stabbed by p. 136 Chapter 14. Detecting and Resolving Conflicts

(a) Slab-Detect runtime performance (b) Reporting slabResolve(R)

(c) Slab-Detect memory require- ments

Figure 14.5: Slab-Detect: Simulation results for |R| = 10K, 20K, 30K, 40K, 50K, and w = 128. 14.4. Setting up IP lookup with Slab-Detect 137

Definition 16. Slab σi is conflict-free under HPF iff there is a highest priority filter r ∈ Si that contains epi, such that prio(r) < prio(s), ∀s ∈ Si stabbed by epi.

14.3.1 Status structures

A filter s ∈ R belongs to the status structure T at epi iff epi ∈ [s.lo, s.hi). All such filters are stored in a single red-black tree, ordP rio(T ), which orders the filters according to ascending priorities.

14.3.2 Handling event points

At each event point epi, we query the top two filters r0 and r1 in ordP rio(T ).

Corollary 4. If prio(r0) 6= prio(r1), then slab σi is conflict-free.

Proof. This implies that filter r0, which contains epi, has the highest priority compared to all others in ordP rio(T ) stabbed by epi. Otherwise, if prio(r0) = prio(r1), then slab σi is no longer conflict-free and requires a resolve filter hσi with prio(hσi ) < prio(r0).

Corollary 5. If two adjacent slabs σi and σi+1 require resolve filters g and h respectively, then under HPF g and h can be merged. Assign prio(g ∪h) = prio(g), if prio(g) < prio(h); otherwise assign prio(g ∪ h) = prio(h).

Proof. In both σi and σi+1, g ∪ h is the highest priority resolve filter stabbed by epi and epi+1.

14.4 Setting up IP lookup with Slab-Detect

In the following we discuss IP lookup under MSTB where we utilise the structure ordLen(T ) introduced in section 14.1. Alternatively, we can substitute ordLen(T ) with ordP rio(T ) introduced in section 14.3 to allow IP lookup under HPF. Think of the x-axis as timeline. Note that the sets of line segments intersecting contiguous slabs are similar. As the boundary from one slab to the next is crossed, certain segments are deleted from the set and other segments are inserted. Over the entire time range, there are 2n insertions and deletions, one insertion and one deletion per segment. Ordinary data structures are ephemeral in the sense that an update on the struc- ture destroys the old version, leaving only the new version available for use. A data structure is called partially persistent if all intermediate versions can be accessed, but only the newest version can be modified, and fully persistent if every version can be both accessed and modified. The obvious way to provide persistence is to 138 Chapter 14. Detecting and Resolving Conflicts

make a copy of the data structure each time it is changed. Refer to Driscoll et al. [135] for a systematic study of persistence. The idea is to maintain a data structure during Slab-Detect’s sweep that stores for each slab the segments that cover the slab and also the resolve filter if the slab was found to be non-conflict-free, cf. Figure 14.6.

Figure 14.6: Slab-Detect and the partially persistent version of ordLen(T ).

After Slab-Detect completes its full run, let T p refer to the partially persistent version of ordLen(T ). Let R be a set of n one-dimensional range filters and p be an incoming packet to be classified. Version x of T p consists of the intervals in R that intersect the line x = epi. For a stabbing point p, we search the highest version less than p. Therefore, we need an auxiliary data structure to store the access pointers to the various versions. When the pointers are stored in a balanced binary search tree, initiating access into any version takes O(log n) time. We know that the intervals in each version of T p are ordered with respect to their ascending lengths. Hence, the shortest (most specific) filter in each version can be reported in O(1) time. IP lookup can thus be performed in O(log n) time. Mohamed, Langner and Ottmann [136] propose path-merging as a refinement of techniques used to make linked data structures partially persistent. Path-merging supports bursts of operations between any two adjacent versions in contrast to only one operation in the original variants. Utilizing the path-merging technique we are 14.5. Contributions and concluding remarks 139 able to solve the conflict detection problem in time of O(n log n) while building the partially persistent structure, and then utilize it to answer lookup queries in O(log n) time.

14.5 Contributions and concluding remarks

We presented an output-sensitive sweepline algorithm to make a given set of one- dimensional arbitrary range filters conflict-free. The scheme achieves a worst case time complexity of O(n log n) and uses O(n) space, where n is the number of filters in the set. The number of reported resolve filters is not always minimal, yet the measure of their union is minimal in order to make the set conflict-free. For example, consider the filter set in Figure 14.7. Slab-Detect reports the two resolve filters d and e to make the original set a, b, c conflict-free, even though one resolve filter a ∩ c would be sufficient. Yet, the measure of the union of d and e is smaller than a ∩ c.

Figure 14.7: Slab-Detect reports the two resolve filters d and e in order to make the original set a, b, c conflict-free.

In summary, Slab-Detect:

• does not report duplicate resolve filters

• only reports essential resolve filters (at most O(n))

• with an augmentation, IP lookup is supported

• is adaptable for use under the highest-priority filter rule (HPF), with same runtime performances and space complexities as under MSTB

This work is unique in a sense that there are no similar previous works to bench- mark our results on. This is because all the literary evidence we came across either deal exclusively with arbitrary range filters in the online variant, or, if the authors deal with the offline variant, the data sets involved are prefix filters in higher dimensions.

Part IV

Summary of Contributions

Summary of Contributions and Future Directions

However impenetrable it seems, if you don’t try it, then you can never do it. Andrew Wiles A dynamic routing protocol adjusts to changing network topologies, which are in- dicated in update messages that are exchanged between routers. If a link attached to a router goes down or becomes congested, the routing protocol makes sure that other routers know about the change. From these updates a router constructs a forwarding table which contains a set of network addresses and a reference to the interface that leads to that network. In a turbulent period, one or a few major routing events cause several routes to simultaneously get updated. Since the performance of the lookup device plays a crucial role in the overall per- formance of the Internet, it is important that lookup and update operations are performed as fast as possible. In order to accelerate the operations, routing tables must be implemented in a way that they can be queried and modified concurrently by several processes. Relaxed balancing has become a commonly used concept in the design of concurrent search tree algorithms. In relaxed balanced data struc- tures, rebalancing is uncoupled from updates and may be arbitrarily delayed. The first part of this dissertation proposed the relaxed balanced min-augmented range tree and presented an experimental comparison with the strictly balanced min-augmented range tree in a concurrent environment. The benchmark results confirmed the hypothesis that the relaxed balanced min-augmented range tree is better suited for the representation of dynamic IP router tables than the strictly balanced version of the tree. The higher the update frequency, and the higher, up to a certain amount, the number of processes, the clearer did the relaxed version outperform the standard version. These results were presented at the 13th IEEE Symposium on Computers and Communications (ISCC 2008) [137]. Further, we presented an interactive visualization of the relaxed balanced MART which corroborated the correctness of the proposed locking schemes. The continuous growth of network link rates pose a grand challenge on high speed IP lookup engines. Given such high data rates, IP lookup must be implemented in hardware. Duchene and Hanna [94] proposed a technique to generate flowpaths 144

directly from java byte codes representing multithreaded java programs. It would be interesting to investigate RMART’s performance when implemented directly in hardware.

Modern IP routers further provide policy-based routing (PBR) mechanisms. PBR provides a technique for expressing routing criteria based on the policies defined by the network administrators, which complements the existing destination-based routing scheme. PBR requires network routers to examine multiple fields of a packet header in order to categorize it into the appropriate ”flow”. Flow iden- tification entails searching a table of predefined rules to identify the appropriate flow based on criteria including IP address, port, and protocol type. Packet clas- sification enables network routers to provide advanced network services including network security, quality of service (QoS) routing, and monitoring. The second part investigated if the popular R*-tree is suited for packet classi- fication. To this end it was benchmarked with two representative classification algorithms using the ClassBench tools suite. According to our benchmark results, R*-trees can be considered an alternative solution for the static packet classifi- cation problem for Firewall (FW) filter sets. The contributions presented in the second part of this thesis were presented at the Seventh IEEE International Sym- posium on Network Computing and Applications (NCA 2008) [138]. Scenarios where policies need to get updated include network state changes (e.g., link failures) along the policy-based routes or dynamic topologies (ad hoc net- works). Most existing packet classification solutions do not support (fast) incre- mental updates. Another strength of the R*-tree is its support of dynamic incre- mental updates. Hence it would be interesting to investigate the performance of R*-trees in a dynamic classification environment. To this end, the PALAC simula- tor could be employed. PALAC is a packet lookup and classification simulator that was designed by Gupta and Balkman and is freely available for public use [139]. The simulator provides facilities for traffic generation as well as classifier updates generation. Updates are interleaved with packet lookups/classifications during a simulation. PALAC outputs a variety of statistics including algorithm storage, worst case as well as average classification time, and the number of dropped pack- ets. The simulator further provides a repository of algorithms which currently contains Linear Search, Trie Search and -on-Trie Search. Each algorithm is subclassed from a generic class. If a user wants to evaluate a new algorithm, it must be subclassed from the generic class and integrated into the PALAC archi- tecture. The simulator is implemented in C++. Further, additional research is required to evaluate R*-tree’s performance when implemented directly in hardware.

The header of an arriving packet may match more than one filter, in which case the filter with the highest priority among all the matching filters is commonly chosen 145 as the best matching filter. Applying the highest-priority tiebreaker resolves this ambiguity in the classification process. Yet not any policy can be implemented by prioritizing the filters. If the most specific tiebreaker, analogous to the most specific tiebreaker (MSTB) in one-dimensional IP lookup, is to be deployed, it must be ensured that for each packet p there is a well defined most specific filter that applies to p. A seminal technique adds so-called resolve filters for each pair of conflicting filters which guarantees that the most specific tiebreaker can be ap- plied. The third part of this dissertation presented a conflict detection and resolution scheme for static one-dimensional range tables. The proposed algorithm achieves a worst case time complexity of O(n log n) and reports only O(n) resolve filters to make a given set of n one-dimensional arbitrary range filters in a router table conflict-free, under both MSTB and HPF tiebreakers. Further, we have shown that by making use of partial persistence, the structure also supports IP lookup. These contributions were presented at the 26th Annual IEEE Conference on Computer Communications (INFOCOM 2007) [129].

An overview of the contributions presented in part I and part III will be published in [140].

Bibliography

[1] T. Sheldon, McGraw-Hill’s Encyclopedia of Networking and Telecommuni- cations. McGraw-Hill Professional, 2001.

[2] (2008) The IPv6 Portal. [Online]. Available: http://www.ipv6tf.org/index. php?page=meet/history

[3] (2008) IPv4 Address Report. [Online]. Available: http://www.potaroo.net/ tools/ipv4/

[4] (2008) Classless Inter-Domain Routing. [Online]. Available: http: //en.wikipedia.org/wiki/Classless Inter-Domain Routing

[5] (2008) Policy-based routing. [Online]. Available: http://www.cisco.com/ warp/public/732/Tech/plicy wp.htm

[6] S. Hanke, “The performance of concurrent red-black tree algorithms,” Lec- ture Notes in Computer Science, vol. 1668, pp. 286–300, 1999.

[7] A. Hari, S. Suri, and G. Parulkar, “Detecting and resolving packet filter conflicts,” in INFOCOM 2000: Proceedings of the Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies. IEEE Press, 2000, pp. 1203–1212.

[8] (2008) Bgp update reports. [Online]. Available: http://bgp.potaroo.net/ index-upd.html

[9] A. Datta and T. Ottmann, “A note on the IP table lookup problem,” Dec. 2004, unpublished.

[10] L. Guibas and R. Sedgewick, “A dichromatic framework for balanced trees,” in Proceedings of the 19th Annual Symposium on Foundations of Computer Science, 1978, pp. 8–21.

[11] J. L. W. Kessels, “On-the-fly optimization of data structures,” Commun. ACM, vol. 26, no. 11, pp. 895–901, 1983. 148 BIBLIOGRAPHY

[12] T. Ottmann and E. Soisalon-Soininen, “Relaxed balancing made simple,” Institut f¨ur Informatik, Albert-Ludwigs-Universit¨at Freiburg, Tech. Rep. 71, Jan. 1995. [Online]. Available: ftp://ftp.informatik.uni-freiburg. de/documents/reports/report71/ [13] Larsen and Fagerberg, “B-trees with relaxed balance,” in IPPS: 9th International Parallel Processing Symposium. IEEE Computer Society Press, 1995. [Online]. Available: citeseer.ist.psu.edu/larsen95btrees.html [14] S. Hanke, T. Ottmann, and E. Soisalon-Soininen, “Relaxed balanced red- black trees,” in CIAC ’97: Proceedings of the Third Italian Conference on Algorithms and Complexity. London, UK: Springer Verlag, 1997, pp. 193– 204. [15] K. S. Larsen, T. Ottmann, and E. Soisalon-Soininen, “Relaxed balance for search trees with local rebalancing,” Acta Informatica, vol. 37, no. 10, pp. 743–763, 2001. [Online]. Available: citeseer.ist.psu.edu/larsen97relaxed.html [16] K. S. Larsen, “Relaxed red-black trees with group updates,” Acta Informat- ica, vol. 38, no. 8, pp. 565–586, 2002. [17] L. Malmi and E. Soisalon-Soininen, “Group updates for relaxed height- balanced trees,” in PODS ’99: Proceedings of the eighteenth ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems. New York, NY, USA: ACM Press, 1999, pp. 358–367. [18] K. S. Larsen, “Relaxed multi-way trees with group updates,” in PODS ’01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York, NY, USA: ACM Press, 2001, pp. 93–101. [19] T. Seddig, “Balancierung von Datenstrukturen zur L¨osung des Paketklassifizierung-Problems,” Diplomarbeit, Institut f¨ur Informatik, Albert-Ludwigs-Universit¨atFreiburg, Apr. 2006. [20] W. Wittmann, “Nebenl¨aufige RBRSMAB Prozesse und ihre Visu- alisierung,” Bachelorarbeit, Institut f¨ur Informatik, Albert-Ludwigs- Universit¨atFreiburg, July 2008. [21] H. Lu and S. Sahni, “O(log n) dynamic router-tables for prefixes and ranges,” IEEE Transanctions on Computers, vol. 53, no. 10, pp. 1217–1230, 2004. [22] E. M. McCreight, “Priority search trees.” SIAM J. Comput., vol. 14, no. 2, pp. 257–276, 1985. [23] D. E. Knuth, The Art of Computer Programming. Volume 3 Sorting and Searching. Addison-Wesley, 1998. BIBLIOGRAPHY 149

[24] M. Ruiz-Sanchez, E. Biersack, and W. Dabbous, “Survey and taxonomy of IP address lookup algorithms,” Network, IEEE, vol. 15, no. 2, pp. 8–23, 2001.

[25] V. Srinivasan and G. Varghese, “Faster IP lookups using controlled prefix expansion,” in SIGMETRICS ’98/PERFORMANCE ’98: Proceedings of the 1998 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. New York, NY, USA: ACM Press, 1998, pp. 1–10.

[26] H. Song, J. Turner, and J. Lockwood, “Shape shifting tries for faster IP route lookup,” in ICNP ’05: Proceedings of the 13th IEEE International Conference on Network Protocols (ICNP’05). Washington, DC, USA: IEEE Computer Society, 2005, pp. 358–367.

[27] W. Lu and S. Sahni, “Recursively partitioned static IP router-tables,” in 12th IEEE Symposium on Computers and Communications, 2007, pp. 437–442.

[28] W. Lu and S. Sahni, “Succinct representation of static packet classifiers,” in 12th IEEE Symposium on Computers and Communications, 2007, pp. 1119–1124.

[29] I. Lee, K. Park, Y. Choi, and S. K. Chung, “A simple and scalable algorithm for the IP address lookup problem,” Fundamenta Informaticae, vol. 56, no. 1,2, pp. 181–190, 2003.

[30] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, Computa- tional Geometry: Algorithms and Applications. Springer Verlag, 2000.

[31] H. Lu and S. Sahni, “Enhanced interval trees for dynamic IP router-tables,” IEEE Transactions on Computers, vol. 53, no. 12, pp. 1615–1628, 2004.

[32] B. Lampson, V. Srinivasan, and G. Varghese, “IP lookups using multiway and multicolumn search,” IEEE/ACM Trans. Netw., vol. 7, no. 3, pp. 324– 334, 1999.

[33] P. Warkhede, S. Suri, and G. Varghese, “Multiway range trees: scalable IP lookup with fast updates,” Computer Networks, vol. 44, no. 3, pp. 289–303, 2004.

[34] S. Dharmapurikar, P. Krishnamurthy, and D. E. Taylor, “Longest prefix matching using bloom filters,” in SIGCOMM ’03: Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications. New York, NY, USA: ACM, 2003, pp. 201– 212. 150 BIBLIOGRAPHY

[35] A. J. McAuley and P. Francis, “Fast routing table lookup using CAMs,” in INFOCOM (3), 1993, pp. 1382–1391. [Online]. Available: citeseer.ist.psu.edu/mcauley93fast.html

[36] H. Song and J. S. Turner, “Fast filter updates for packet classification using TCAM,” in GLOBECOM. IEEE, 2006.

[37] S. de Silva, “6500 FIB Forwarding Capacities,” Presentation, Cisco Systems, Inc., 2007. [Online]. Available: http://www.nanog.org/mtg-0702/ presentations/fib-desilva.pdf

[38] D. E. Taylor, “Survey and taxonomy of packet classification techniques,” ACM Comput. Surv., vol. 37, no. 3, pp. 238–275, 2005.

[39] “Efficient Scaling for Multiservice Networks,” White Paper, Juniper Networks, July 2008. [Online]. Available: http://www.juniper.net/ solutions/literature/white papers/200207.pdf

[40] F. Zane, G. Narlikar, and A. Basu, “CoolCAMs: Power-Efficient TCAMs for Forwarding Engines,” in Proceeding of IEEE INFOCOM ’03, 2003.

[41] E. Spitznagel, D. Taylor, and J. Turner, “Packet classification using Ex- tended TCAMs,” in Proceedings of IEEE International Conference on Net- work Protocols (ICNP), 2003.

[42] W. Eatherton, G. Varghese, and Z. Dittia, “Tree bitmap: hardware/software IP lookups with incremental updates,” SIGCOMM Comput. Commun. Rev., vol. 34, no. 2, pp. 97–122, 2004.

[43] R. Zemach, “CRS-1 overview,” Presentation, Cisco Systems, Inc. [Online]. Available: www.cs.ucsd.edu/∼varghese/crs1.ppt

[44] K. S. Kim and S. Sahni, “Efficient construction of pipelined multibit-trie router-tables,” IEEE Transactions on Computers, vol. 56, no. 1, pp. 32–43, 2007.

[45] W. Jiang, Q. Wang, and V. Prasanna, “Beyond TCAMs: An SRAM-Based Parallel Multi-Pipeline Architecture for Terabit IP Lookup,” INFOCOM 2008. The 27th Conference on Computer Communications. IEEE, pp. 1786– 1794, April 2008.

[46] T. Lauer, T. Ottmann, and A. Datta, “Update-efficient data structures for dynamic IP router tables,” International Journal of Foundations of Com- puter Science, vol. 18, no. 1, pp. 139–161, 2007. BIBLIOGRAPHY 151

[47] T. Lauer, “Potentials and limitations of visual methods for the exploration of complex data structures,” Ph.D. dissertation, Albert-Ludwigs-Universit¨at Freiburg, 2007.

[48] R. Hinze, “A simple implementation technique for priority search queues,” in International Conference on Functional Programming, 2001, pp. 110–121. [Online]. Available: citeseer.ist.psu.edu/hinze01simple.html

[49] N. Sarnak and R. E. Tarjan, “Planar point location using persistent search trees,” Commun. ACM, vol. 29, no. 7, pp. 669–679, 1986.

[50] J. Boyar and K. S. Larsen, “Efficient rebalancing of chromatic search trees,” in Proceedings of the 30th IEEE symposium on Foundations of computer science. Orlando, FL, USA: Academic Press, Inc., 1994, pp. 667–682.

[51] G. R. Andrews and F. B. Schneider, “Concepts and notations for concurrent programming,” ACM Comput. Surv., vol. 15, no. 1, pp. 3–43, 1983.

[52] (2008) The Java Tutorials. Lesson: Concurrency. [Online]. Available: http://java.sun.com/docs/books/tutorial/essential/concurrency/index.html

[53] C. S. Ellis, “Concurrent search and insertion in AVL trees.” IEEE Trans. Computers, vol. 29, no. 9, pp. 811–817, 1980.

[54] O. Nurmi and E. Soisalon-Soininen, “Chromatic binary search trees. a struc- ture for concurrent rebalancing.” Acta Inf., vol. 33, no. 6, pp. 547–557, 1996.

[55] P. Brinch-Hansen, Operating system principles. Prentice-Hall, Inc., 1973.

[56] C. A. R. Hoare, “Monitors: an operating system structuring concept,” Com- mun. ACM, vol. 17, no. 10, pp. 549–557, 1974.

[57] E. G. Coffman, M. Elphick, and A. Shoshani, “System deadlocks,” ACM Comput. Surv., vol. 3, no. 2, pp. 67–78, 1971.

[58] R. C. Holt, “Some deadlock properties of computer systems,” ACM Comput. Surv., vol. 4, no. 3, pp. 179–196, 1972.

[59] B. Price, I. Small, and R. Baecker, “A taxonomy of software visualization,” Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences, vol. 2, pp. 597–606, Jan 1992.

[60] (editors) J. T. Stasko, J. B. Domingue, M. H. Brown, and B. A. Price, Software Visualization: Programming as a Multimedia Experience. The MIT Press, 1998. 152 BIBLIOGRAPHY

[61] H. H. Goldstein and J. von Neumann, “Planning and coding problems for an electronic computing instrument,” von Neumann Collected Works, vol. 5, pp. 80–151, 1947.

[62] R. Fleischer and L. Kucera, “Algorithm animation for teaching,” in Re- vised Lectures on Software Visualization, International Seminar, ser. Lecture Notes in Computer Science, S. Diehl, Ed., vol. 2269. Springer-Verlag, 2002, pp. 113–128.

[63] K. Knowlton, “Bell telephone laboratories low-level language,” 1966, 16-minute black and white film.

[64] (2008, July) Red Black Tree Simulation. [Online]. Available: http: //reptar.uta.edu/NOTES5311/REDBLACK/RedBlack.html

[65] (2008, July) Red-Black Tree Demonstration. [Online]. Available: http: //www.ece.uc.edu/∼franco/C321/html/RedBlack/rb.orig.html

[66] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns: Ele- ments of Reusable Object-Oriented Software. AddisonWesley, 1994.

[67] “Java PathFinder,” 2009, http://javapathfinder.sourceforge.net/.

[68] M. Ichiriu, “High Performance Layer 3 Forwarding. The Need for Dedicated Hardware Solutions,” White Paper, NetLogic Microsystems, 2000. [Online]. Available: http://www.netlogicmicro.com/pdf/cidr white paper.pdf

[69] BGP Update Reports. [Online]. Available: http://bgp.potaroo.net/ index-upd.html

[70] C. Villamizar, R. Chandra, and R. Govindan. (1998, Nov.) BGP Route Flap Damping. Request for Comments: 2439. [Online]. Available: http://www.ietf.org/rfc/rfc2439.txt

[71] G. Huston. (2006, June) The BGP Report for 2005. The ISP Column. [Online]. Available: http://ispcolumn.isoc.org/2006-06/bgpupds.html

[72] (2008) University of Oregon Route Views Project. [Online]. Available: http://archive.routeviews.org/bgpdata/

[73] L. Blunk, M. Karir, and C. Labovitz. MRT Format. [Online]. Available: http://tools.ietf.org/html/draft-ietf-grow-mrt-00

[74] Routing information service (RIS). [Online]. Available: http://www.ripe. net/ris/ BIBLIOGRAPHY 153

[75] R. Jain and S. Routhier, “Packet trains–Measurements and a new model for computer network traffic,” IEEE Selected Areas in Communications, vol. 4, no. 6, pp. 986 – 995, 1986.

[76] K. C. Claffy, “Internet traffic characterization,” Ph.D. dissertation, Univer- sity of California, San Diego, 1994.

[77] M. H. MacGregor and I. L. Chvets, “Locality in internetwork traffic,” Uni- versity of Alberta, Tech. Rep., Mar. 2002.

[78] D. J. Lee and N. Brownlee, “Passive measurement of one-way and two-way flow lifetimes,” ACM SIGCOMM, vol. 37, no. 3, pp. 19–27, 2007.

[79] Y. Chabchoub, C. Fricker, F. Guillemin, and P. Robert, “A study of flow statistics of IP traffic with application to sampling,” July 2007, unpublished. [Online]. Available: http://www-rocq.inria.fr/Philippe.Robert/src/papers/ 2007-4.pdf

[80] (2008) Pareto distribution. [Online]. Available: http://en.wikipedia.org/ wiki/Pareto distribution

[81] “SUN FIRE T1000 and T2000 SERVER ARCHITECTURE,” White Paper, Sun microsystems, Dec. 2005. [Online]. Available: http: //www.sun.com/servers/coolthreads/coolthreads architecture wp.pdf

[82] J. M. O’Connor and M. Tremblay, “picoJava-I: The Java Virtual Machine in Hardware,” IEEE Micro, vol. 17, no. 2, pp. 45–53, 1997.

[83] (2008) Execution of synchronized Java methods in Java computing environments. United States Patent 6918109. [Online]. Available: http: //www.patentstorm.us/patents/6918109/fulltext.html

[84] (2008) Fast synchronization for programs written in the JAVA programming language. United States Patent 6349322. [Online]. Available: http: //www.freepatentsonline.com/6349322.html

[85] “SUN FIRE X4600 M2 SERVER ARCHITECTURE,” White Paper, Sun microsystems, June 2008. [Online]. Available: http://www.sun.com/ servers/x64/x4600/arch-wp.pdf

[86] J. Nehmer and P. Sturm, Systemsoftware. Grundlagen moderner Betrieb- ssysteme. dpunkt, 1998.

[87] K. Beuth and W. Schmusch, Grundschaltungen. Vogel, 1992.

[88] (2008) Field-programmable gate array. [Online]. Available: http://en. wikipedia.org/wiki/Field-programmable gate array 154 BIBLIOGRAPHY

[89] (2008) FPGA Basics. [Online]. Available: http://www.andraka.com/ whatisan.htm

[90] D. Hanna, A. Spagnuolo, and M. DuChene, “Speedup using flowpaths for a finite difference solution of a 3D parabolic PDE,” Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pp. 1–6, March 2007.

[91] ajile. (2008) Embedded Low-Power Direct Execution Java Processors. [Online]. Available: http://www.ajile.com/

[92] D. Hanna, M. DuChene, G. Tewolde, and J. Sattler, “Java flowpaths: Effi- ciently generating circuits for embedded systems from Java,” in International Conference on Embedded Systems and Applications, Nov. 2006, pp. 23–30.

[93] D. M. Hanna and R. E. Haskell, “Implementing Software Programs in FP- GAs using Flowpaths,” in International Conference on Embedded Systems and Applications, 2004, pp. 76–82.

[94] M. Duchene and D. Hanna, “Implementing parallel algorithms on an FPGA directly from multithreaded Java using flowpaths,” Circuits and Systems, 2005. 48th Midwest Symposium on, pp. 980–983 Vol. 2, Aug. 2005.

[95] C. A. Shue and M. Gupta, “Projecting IPv6 Forwarding Characteristics un- der Internet-wide Deployment,” in ACM SIGCOMM 2007 IPv6 Workshop, Aug. 2007.

[96] P. Owezarski, “Does IPv6 Improve the Scalability of the Internet?” in IDMS/PROMS 2002: Proceedings of the Joint International Workshops on Interactive Distributed Multimedia Systems and Protocols for Multimedia Systems. London, UK: Springer-Verlag, 2002, pp. 130–140.

[97] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in SIGMOD Conference, B. Yormark, Ed. ACM Press, 1984, pp. 47–57.

[98] (2008) Classbench: A packet classification benchmark. [Online]. Available: http://www.arl.wustl.edu/∼det3/ClassBench/index.htm

[99] P. Gupta and N. McKeown, “Packet classification on multiple fields,” in SIGCOMM ’99: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. ACM, 1999, pp. 147–160.

[100] P. Gupta and N. McKeown, “Algorithms for packet classification,” IEEE Network, vol. 15, no. 2, pp. 24–32, 2001. BIBLIOGRAPHY 155

[101] P. F. Tsuchiya, “A search algorithm for table entries with non-contiguous wildcarding,” 1991, unpublished. [Online]. Available: citeseer.ist.psu.edu/ tsuchiya91search.html

[102] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, “Fast and scalable layer four switching,” in Proceedings of SIGCOMM ’98, 1998, pp. 191–202. [Online]. Available: citeseer.ist.psu.edu/article/srinivasan98fast.html

[103] F. Baboescu, S. Singh, and G. Varghese, “Packet classification for core routers: is there an alternative to CAMs?” in INFOCOM 2003. Twenty- Second Annual Joint Conference of the IEEE Computer and Communica- tions Societies. IEEE, March-April 2003, pp. 53–63.

[104] Gupta and McKeown, “Packet classification using hierarchical intelligent cuttings,” in Proceedings of Hot Interconnects VII, 1999.

[105] S. Singh, F. Baboescu, G. Varghese, and J. Wang, “Packet classification using multidimensional cutting,” in SIGCOMM ’03: Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications. ACM, 2003, pp. 213–224.

[106] Y. Qi and J. Li, “Towards effective packet classification,” in IASTED Com- munication, Network, and Information Security, 2006.

[107] F. Geraci, M. Pellegrini, P. Pisati, and L. Rizzo, “Packet classification via improved space decomposition techniques.” in INFOCOM. IEEE, 2005, pp. 304–312.

[108] M. M. Buddhikot, S. Suri, and M. Waldvogel, “Space decomposition tech- niques for fast Layer-4 switching,” in Protocols for High Speed Networks IV (Proceedings of PfHSN ’99), J. D. Touch and J. P. G. Sterbenz, Eds. Salem, MA, USA: Kluwer Academic Publishers, Aug. 1999, pp. 25–41.

[109] H. Lim, M. Y. Kang, and C. Yim, “Two-dimensional packet classification algorithm using a quad-tree,” Comput. Commun., vol. 30, no. 6, pp. 1396– 1405, 2007.

[110] T. Y. C. Woo, “A modular approach to packet classification: Algorithms and results,” in INFOCOM (3), 2000, pp. 1213–1222.

[111] “Understanding ACL on Catalyst 6500 Series Switches,” White Paper, Cisco Systems, Inc. [Online]. Available: http://www.cisco.com/en/US/products/ hw/switches/ps708/products white paper09186a00800c9470.shtml

[112] C. Solder, “Understanding Quality of Service on the Catalyst 6500 and Cisco 7600 Router,” White Paper, Cisco Systems, Inc., June 2006. [Online]. 156 BIBLIOGRAPHY

Available: http://www.cisco.com/application/pdf/en/us/guest/products/ ps708/c1225/ccmigration 09186a00806eca1e.pdf

[113] “Cisco Catalyst 6500 and 6500-E Series Switch Data Sheet,” White Paper, Cisco Systems, Inc. [Online]. Available: http://www.cisco.com/en/US/prod/collateral/modules/ps2797/ps5138/ product data sheet09186a00800ff916 ps708 Products Data Sheet.html

[114] R. Bayer and E. M. McCreight, “Organization and Maintenance of Large Ordered Indices,” Acta Inf., vol. 1, pp. 173–189, 1972.

[115] Y. Manolopoulos, A. Nanopoulos, A. N. Papadopoulos, and Y. Theodoridis, R-trees: Theory and Applications. Springer Verlag, 2005.

[116] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: An efficient and robust access method for points and rectangles,” in SIGMOD Conference. ACM Press, 1990, pp. 322–331.

[117] V. Gaede and O. G¨unther, “Multidimensional access methods,” ACM Com- put. Surv., vol. 30, no. 2, pp. 170–231, 1998.

[118] T. K. Sellis, N. Roussopoulos, and C. Faloutsos, “The R-tree: A dynamic index for multi-dimensional objects,” in The VLDB Journal, 1987, pp. 507–518. [Online]. Available: citeseer.ist.psu.edu/sellis87rtree.html

[119] P. W. Huang, P. L. Lin, and H. Y. Lin, “Optimizing storage utilization in R- tree dynamic index structure for spatial databases,” J. Syst. Softw., vol. 55, no. 3, pp. 291–299, 2001.

[120] S. Brakatsoulas, D. Pfoser, and Y. Theodoridis, “Revisiting R-tree construc- tion principles,” in ADBIS ’02: Proceedings of the 6th East European Con- ference on Advances in Databases and Information Systems. London, UK: Springer-Verlag, 2002, pp. 149–162.

[121] N. Roussopoulos and D. Leifker, “Direct spatial search on pictorial databases using packed R-trees,” SIGMOD Rec., vol. 14, no. 4, pp. 17–31, 1985.

[122] L. Arge, M. de Berg, H. J. Haverkort, and K. Yi, “The Priority R-Tree: A practically efficient and worst-case optimal R-tree,” in SIGMOD Conference, G. Weikum, A. C. K¨onig,and S. Deßloch, Eds. ACM, 2004, pp. 347–358.

[123] D. E. Taylor and J. S. Turner, “Classbench: A Packet Classification Bench- mark,” in INFOCOM 2005. 24th Annual Joint Conference of the IEEE Com- puter and Communications Societies, 2005, pp. 2068–2079.

[124] D. Taylor and J. Turner, “Classbench: A Packet Classification Benchmark,” Washington University in Saint Louis, Tech. Rep., May 2004. BIBLIOGRAPHY 157

[125] H. Song. (2008) Packet classificaton evaluation. [Online]. Available: http://www.arl.wustl.edu/∼hs1/PClassEval.html

[126] (2008) The R-tree Portal. C++ and Java implementations. [Online]. Avail- able: http://www.rtreeportal.org/index.php?option=com content&task= view&id=17&Itemid=32

[127] D. E. Taylor, “Models, algorithms, and architectures for scalable packet classification,” Ph.D. dissertation, Washington University, 2004.

[128] C. Kupich and K. A. Mohamed, “Conflict Detection in Internet Router Ta- bles,” Institut f¨urInformatik, Albert-Ludwigs-Universit¨atFreiburg, Tech. Rep., Aug. 2006.

[129] C. Maindorfer, K. A. Mohamed, T. Ottmann, and A. Datta, “A new output- sensitive algorithm to detect and resolve conflicts in Internet router tables,” in INFOCOM 2007. 26th IEEE Conference on Computer Communications, May 2007, pp. 2431–2435.

[130] E. Al-Shaer and H. Hamed, “Management and translation of filtering secu- rity policies,” ICC ’03: IEEE International Conference on Communications, vol. 1, pp. 256–260, May 2003.

[131] H. Lu and S. Sahni, “Conflict detection and resolution in two-dimensional prefix router tables,” IEEE/ACM Transactions on Networking, vol. 13, no. 6, pp. 1353–1363, 2005.

[132] D. Eppstein and S. Muthukrishnan, “Internet packet filter management and rectangle geometry,” in SODA 01: Proceedings of the Twelfth Annual ACM- SIAM Symposium on Discrete Algorithms. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2001, pp. 827–835.

[133] M. H. Overmars and C.-K. Yap, “New upper bounds in Klee’s measure problem,” SIAM J. Comput., vol. 20, no. 6, pp. 1034–1045, 1991.

[134] K. A. Mohamed and C. Maindorfer, “An O(n log n) Output-Sensitive Al- gorithm to Detect and Resolve Conflicts for 1D Range Filters in Router Tables,” Institut f¨urInformatik, Albert-Ludwigs-Universit¨atFreiburg, Tech. Rep., Oct. 2006.

[135] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan, “Making data structures persistent,” in STOC 86: Proceedings of the eighteenth annual ACM symposium on Theory of computing. New York, NY, USA: ACM Press, 1986, pp. 109–121. 158 BIBLIOGRAPHY

[136] K. A. Mohamed, T. Langner, and T. Ottmann, “Versioning tree structures by path-merging,” in Frontiers in Algorithmics, ser. Lecture Notes in Computer Science, F. P. Preparata, X. Wu, and J. Yin, Eds., vol. 5059. Springer, 2008, pp. 101–112.

[137] C. Maindorfer, B. B¨ar,and T. Ottmann, “Relaxed min-augmented range trees for the representation of dynamic IP router tables,” in 13th IEEE Symposium on Computers and Communications, July 2008, pp. 920–927.

[138] C. Maindorfer and T. Ottmann, “Is the Popular R*-tree Suited for Packet Classification?” in Seventh IEEE International Symposium on Network Computing and Applications, July 2008, pp. 168–176.

[139] P. Gupta and J. Balkman. (2008) Packet Lookup and Classification Simulator (PALAC). [Online]. Available: http://klamath.stanford.edu/ tools/PALAC/SRC/

[140] C. Maindorfer, T. Lauer, and T. Ottmann, “New data structures for IP lookup and conflict detection,” in Algorithmics of Large and Complex Net- works, ser. Lecture Notes in Computer Science. Springer, 2009, to appear.