BGP, Not As Easy As 1-2-3

Ashley Flavel

Thesis submitted for the degree of Doctor of Philosophy in Applied Mathematics at The University of Adelaide

Faculty of Engineering, Computer and Mathematical Sciences

October 6, 2009 Signed Statement

This work contains no material which has been accepted for the award of any other degree or diploma in any university or other tertiary institution and, to the best of my knowledge and belief, contains no material previously published or written by another person, except where due reference has been made in the text.

I consent to this copy of my thesis, when deposited in the University Library, being available for loan and photocopying, subject to the provisions of the Copyright Act 1968.

SIGNED: ...... DATE: ......

i Acknowledgements

I would firstly like to acknowledge the support of my primary supervisor, As- sociate Professor Matthew Roughan. It was he who first inspired me to pursue a researching career. In addition, his financial and social assistance helped gain me an internship at AT&T Research that reinvigorated my desire to work on real-world problems with real-world solutions. I would also like to acknowledge the support of my co-supervisor, Professor Nigel Bean. His ability to interpret my sometimes incoherent description of a problem into one that was simple and easy-to-understand was a major factor in the success of this thesis. Dr. Olaf Maennel, as fellow co-supervisor, provided an alternative perspective to inter-domain routing to my primary supervisor A/Prof Roughan. His pragmatic approach to inter-domain routing and in-depth knowledge of its quirks and corner cases was an excellent source to verify the sanity of my ideas. All three of my supervisors provided contrasting perspectives which although at times was frustrating allowed the development of ideas that were not only theoretically sound but pragmatic. I would like to thank them all for their friendship and support over the last few years. Dr. Aman Shaikh of AT&T Research Labs was a major contributor to the ideas in this thesis. The four months I spent at AT&T Research and the subsequent year-long collaboration resulted in the best work of my thesis. Aman also as- sisted with my transport to and from the laboratories in which time many of the most important research breakthroughs came (as well as many discussions on cricket). AT&T also deserve acknowledgement for providing me access to their

iii commercially sensitive network data. This thesis was reviewed by three of my most admired networking researchers:- Dr. Timothy Griffin of The University of Cambridge, Professor Jennifer Rexford of Princeton University and Dr. Steve Uhlig of Technische Universitat¨ Berlin. I am sincerely privileged and greatly appreciate their time and useful comments. For financial assistance I am very grateful to the Australian Research Coun- cil’s Communications Research Network (ACoRN). ACoRN provided me with substantial financial assistance for my research visit to AT&T. I specifically would like to thank ACoRN’s Adelaide University representative Belinda Chiera. In addition, I would like to acknowledge the financial support of the Australian Research Council through grant DP0557066. Dr. Flo Rice, although not related to my technical research, was a pivotal figure in its development. She provided me with accommodation for the time I spent at AT&T and became a good friend. The staff and students at the Teletraffic Research Centre at the University of Adelaide provided me with an excellent research atmosphere. I would specifically like to thank Dr. Jeremy McMahon for frequent useful discussions. I would like to thank Maxine and Mark Wong See for proof reading several sections of this thesis. A special acknowledgement is reserved for Carrie Kelly who devoted a significant amount of time and energy to proof read the entire thesis. Lastly, I would like to thank my family for their support throughout my time as a student. Without them my chance to reach this point would not be possible. They have enabled me to reach a level of education that will give me incredible future opportunities. Dedication

I dedicate this thesis to my family. My desire to succeed comes from them, and without their love and support this thesis would not have been possible.

v Contents

Abstract xix

1 Introduction 1 1.1 Preliminary Background ...... 8 1.2 Thesis Roadmap ...... 9 1.3 Statement of Research Contributions ...... 10 1.3.1 Publications Arising From This Thesis ...... 12

2 Background 15 2.1 General Routing Protocols ...... 17 2.1.1 Link-State Protocols ...... 17 2.1.2 Distance-Vector Protocols ...... 18 2.2 Routing in the Internet ...... 18 2.3 Border Gateway Protocol ...... 20 2.3.1 BGP Decision Process ...... 22 2.3.2 BGP Operation ...... 25 2.3.3 Internal vs External BGP ...... 26 2.4 BGP, Not as Easy as 1-2-3? ...... 30 2.4.1 The Stable Paths Problem ...... 31 2.4.2 MED Oscillation ...... 34 2.4.3 iBGP Oscillation ...... 37 2.4.4 BGP in the Wild ...... 39 2.4.5 BGP Data ...... 40

vii 3 Where’s Waldo? Practical Searches for Stability in iBGP 43 3.1 Introduction ...... 43 3.2 Related Work ...... 46 3.3 Background ...... 47 3.3.1 iBGP Recap ...... 47 3.3.2 Best Path Selection ...... 48 3.3.3 Interior Gateway Protocol ...... 49 3.3.4 Physical Graph ...... 50 3.3.5 Signaling Graph ...... 50 3.3.6 Egress Instance ...... 50 3.4 Stability ...... 51 3.4.1 Complexity of Determining Signaling Correctness ...... 52 3.5 Router Reliance Graph ...... 52 3.5.1 Reliance Rules for a Route Reﬂector Topology ...... 53 3.5.2 Co-reliance Groups ...... 54 3.6 Where Can An Oscillation Occur? ...... 57 3.7 Algebraic Description of Co-reliance Groups ...... 60 3.7.1 Reducing the Size of Co-reliance Groups ...... 62 3.7.2 Oscillation Detection ...... 64 3.7.3 Oscillation Classes ...... 64 3.7.4 Reliances between Co-reliance Groups ...... 70 3.8 Oldest-Route Tie-breaker ...... 74 3.9 Prioritizing Egress Instances ...... 80 3.9.1 Proving the Stability of an Egress Instance ...... 81 3.9.2 Proving the Stability of a Conﬁguration ...... 82 3.9.3 Checking the Stability of the Current Network ...... 83 3.9.4 Checking the Stability of the Current Network with Limited Measurement Infrastructure ...... 85 3.9.5 Practical Implementation ...... 86 3.9.6 Online Tool ...... 88

viii 3.10 Preventing BGP Oscillation ...... 94 3.11 Three-Or-More-Level Route-Reﬂector Hierarchies ...... 94 3.11.1 Greater than Three-Level Hierarchies ...... 104 3.12 Discussion ...... 106

4 Humpty Dumpty: Putting iBGP Back Together Again 109 4.1 Introduction ...... 110 4.2 Related Work ...... 111 4.3 Two-Level Route-Reflector Reliance Graph ...... 114 4.4 General Route-Reflector Reliance Graph ...... 116 4.4.1 Notation Recap ...... 117 4.4.2 Reliance Rules for Route Reflection ...... 119 4.5 Finding the Actual Solution ...... 124 4.5.1 Ordering of Routers Within a Co-reliance Group ...... 125 4.5.2 Breaking Ties ...... 130 4.5.3 Dynamic IGP ...... 130 4.6 Evaluation ...... 131 4.7 Generalized Topologies ...... 133 4.7.1 Route-Reflection with MED ...... 133 4.7.2 Full Mesh ...... 136 4.7.3 Confederations ...... 137 4.8 Discussion ...... 139

5 Peer Dragnet: Analysis of BGP Peering Policies 141 5.1 Introduction ...... 141 5.2 Background ...... 145 5.3 Related Work ...... 147 5.4 Data Collection ...... 148 5.4.1 BGP Routes ...... 149 5.4.2 IGP Distance Information ...... 149 5.4.3 iBGP Topology Information ...... 150

ix 5.4.4 Aggregate Traffic Data ...... 150 5.5 Analysis of Peering Policies ...... 150 5.5.1 Policy Implementation Techniques ...... 151 5.5.2 How Peering Links are Used ...... 154 5.6 Impact on Routing and Traffic ...... 157 5.6.1 Dealing with Routing Dynamics ...... 158 5.6.2 Routing Impact ...... 159 5.6.3 Traffic Impact ...... 167 5.7 Dynamics of Peering Policies ...... 172 5.7.1 Policy Change Detection Algorithm ...... 173 5.8 Operational Peer Dragnet ...... 177 5.9 Non-Canonical Policy Mitigation ...... 184 5.9.1 AS-wide BGP Route Controller ...... 184 5.9.2 Import Policies ...... 186 5.9.3 Distributed Knowledge ...... 186 5.10 Discussion ...... 190

6 CleanBGP: Verifying the Consistency of BGP Data 193 6.1 Introduction ...... 193 6.2 Data Consistency ...... 196 6.3 Measurement Artifacts ...... 197 6.3.1 Session Failures and Resets ...... 197 6.3.2 Incomplete Tables ...... 198 6.3.3 Missing Updates ...... 198 6.3.4 Update Ordering ...... 198 6.3.5 Non-atomic Table Dumps ...... 199 6.3.6 Other Artifacts ...... 199 6.4 Characterization of Artifacts ...... 202 6.4.1 Table Comparison ...... 202 6.4.2 Oldest Preﬁx ...... 204

x 6.4.3 State Information ...... 205 6.4.4 Downtime ...... 205 6.4.5 Session Re-establishment ...... 206 6.4.6 Detecting Measurement Artifacts ...... 207 6.5 Extended Measurement Artifacts ...... 208 6.6 Cleaning Data ...... 210 6.6.1 Session Failures/Re-establishments ...... 211 6.6.2 Incomplete Tables ...... 212 6.6.3 Missing Updates ...... 212 6.6.4 Update Ordering ...... 213 6.7 Default Parameter Selection ...... 213 6.7.1 Sliding Window Length ...... 213 6.7.2 Re-establishment Phase Thresholds ...... 213 6.7.3 Downtime Threshold and Bin Length ...... 214 6.7.4 Suspicious Bin Thresholds ...... 215 6.8 Automated Parameter Selection ...... 216 6.8.1 Sliding Window Thresholds ...... 216 6.8.2 Suspicious Bin Thresholds ...... 220 6.8.3 Discussion ...... 225 6.9 Results ...... 228 6.10 Discussion ...... 232

7 Conclusion 233

Acronyms 235

Bibliography 237

xi List of Figures

2.2.1 Routing protocol domains ...... 19 2.3.1 Internal router structure ...... 21 2.3.2 BGP decision process ...... 23 2.3.3 Example route-reflector topology ...... 27 2.3.4 Route-reflection obscures route availability ...... 29 2.4.1 Griffin et al.’s Good Gadget ...... 31 2.4.2 Non-stable gadgets defined by Griffin et al...... 31 2.4.3 MED Oscillation ...... 35 2.4.4 iBGP persistent oscillation ...... 37

3.3.1 Edge types in the iBGP signaling graph ...... 51 3.5.1 An example egress instance ...... 56 3.7.1 Stable solutions for a two node co-reliance group ...... 62 3.7.2 Co-reliance group reduction ...... 63 3.7.3 Oscillation classes Venn diagram ...... 65 3.7.4 Example co-reliance groups for each oscillation class...... 66 3.7.5 The state machine for the four-node ‘Good’ state machine in Fig- ure 3.7.4(a) ...... 67 3.7.6 The state machine for the three-node ‘Bad’ co-reliance group in Figure 3.7.4(b) ...... 68 3.7.7 The state machine for the ﬁve-node ‘Naughty’ co-reliance group in Figure 3.7.4(c) ...... 69 3.7.8 The state machine for the four-node ‘Asymptotically Good’ co- reliance group in Figure 3.7.4(d) ...... 71

xiii 3.7.9 The state machine of the four-node ‘Asymptotically Good’ co- reliance group in Figure 3.7.4(d) with inbound i at node 3 . . . . . 73 3.8.1 The state machine of the single cycle three-node co-reliance group with all ‘weak’ reliances ...... 76 3.9.1 Prioritization of the egress instances currently used in the AS. . . 84 3.9.2 Prioritization of egress instances consistent with available measurement data...... 87 3.9.3 Prioritization of egress instances for an online tool...... 90 3.9.4 An example of the prioritization of egress instances ...... 91 3.9.5 Equivalent example to Figure 3.9.4 with a shorter sliding window 92 3.11.1 Three-level route-reflector hierarchy ...... 97 3.11.2 Oscillation in a three-level route-reflector hierarchy ...... 97 3.11.3 Oscillation in three-level route-reflector hierarchy (bottom level full-mesh)...... 98 3.11.4 Oscillation between levels of route-reflector topology...... 99 3.11.5 An example three-level topology with three child preference paths 101 3.11.6 An example from a Tier-2 AS of a route-reflector preferring a downstream egress learned from a non-downstream router . . . . 102 3.11.7 Four-level route-reflector hierarchy ...... 105 3.11.8 Modified four-level route-reflector hierarchy ...... 105

4.2.1 Stable egress instances violating Griffin and Wilfong’s condition . 113 4.3.1 Reliances and co-reliance groups for examples in Figure 4.2.1 . . . 116 4.4.1 An example three-level route-reflector topology ...... 118 4.5.1 Router comparison subroutine for a non-singleton co-reliance group127 4.5.2 Function and variable definitions used in the compare routers and the network solver algorithm...... 128 4.5.3 Network solver algorithm ...... 129 4.5.4 Subroutine igp change for determining the reliance graphs requiring recalculation when an IGP distance changes...... 131 4.7.1 An example topology where the MED attribute is respected . . . . 135

xiv 4.7.2 Reliance graph for a full-mesh topology ...... 137 4.7.3 Full-mesh topology with the MED attribute respected...... 138 4.7.4 An example confederation of sub-ASes and the corresponding reliance graph...... 139

5.2.1 The impact of non-canonical peering policy ...... 144 5.5.1 Plot of the proportion of peers implementing a non-canonical peering policy ...... 151 5.5.2 The techniques used by a subset of peers to de-preference routes . 152 5.5.3 Example of peers displaying different behavior modes ...... 155 5.6.1 The impact of non-canonical peering policy ...... 161 5.6.2 A CDF for the impact of two peers’ non-canonical policy on all routers ...... 162 5.6.3 A CCDF showing the proportion of decisions affected by non- canonical policies of peers ...... 163 5.6.4 The possible impact of a peers policy in the absence of routes from other ASes ...... 164 5.6.5 The impact of the non-canonical peering policy of the two peers’ from Figure 5.6.2 when routes from other ASes are unavailable . . 165 5.6.6 Example of “when good routes go bad” phenomenon ...... 166 5.6.7 Traffic Impact: Finding the ingress router of a flow ...... 169 5.6.8 Traffic Impact: Finding the egress link under a canonical policy . 170 5.6.9 The shift of traffic that would occur for various ingress PoPs if a peer were to use a canonical peering policy...... 172 5.7.1 Policy changes for one peer during interval September 1, 2007 - January 14, 2008 ...... 173 5.8.1 Summary table of peers ...... 179 5.8.2 The canonical peering policy of Kangaroo Corp...... 180 5.8.3 Legend for a peer’s de-preferencing techniques...... 181 5.8.4 The non-canonical peering policy of Emu Inc...... 181 5.8.5 The non-canonical peering policy of Platypus Tech ...... 182

xv 5.8.6 The non-canonical peering policy of Dingo Net...... 183 5.9.1 Decentralized mitigation scheme ...... 187

6.4.1 Consistency-check example ...... 203 6.4.2 Oldest preﬁx characteristic ...... 205 6.5.1 Finding the interval of extended measurement artifacts ...... 208 6.6.1 Detected failures in inter-table interval and the time we infer the missing withdrawal occurred...... 212 6.8.1 A cartoon illustration of a monitoring BGP session to determine sliding window thresholds ...... 219 6.8.2 A cartoon illustration of the multi-variate threshold obtained using LDA on the data points from Figure 6.8.1...... 220 6.8.3 Cartoon illustration of independent thresholds obtained using LDA on the data points from Figure 6.8.1...... 221 6.8.4 Example of sliding window parameter selection ...... 221 6.8.5 Anomalous data-points ...... 222 6.8.6 Cartoon illustration of monitoring BGP session to determine bin parameters ...... 224 6.8.7 Cartoon illustration of LDA producing an undesirable class separation...... 225 6.8.8 Cartoon illustration showing a desirable class separation...... 226 6.8.9 Cartoon illustration using LDA to tune thresholds independently. 226 6.8.10 Example of bin parameter selection ...... 227

xvi List of Tables

2.0.1 Example Forwarding Table ...... 16 2.4.1 Step-by-step route selections for Bad Gadget ...... 33 2.4.2 Step-by-step route selections for Naughty Gadget ...... 33 2.4.3 Best route selection at routers 0 and 1 from Figure 2.4.3 ...... 36 2.4.4 Step-by-step route selection for Figure 2.4.4 ...... 38

3.7.1 Properties of oscillation classes...... 65 3.8.1 Table showing the result of for weak and strong reliances. . . . 75 ⊕ 3.11.1 The egress selected by routers 0, 1 and 2 in Figure 3.11.2 ...... 96 3.11.2 The egress selected by routers 0, 1 and 2 in Figure 3.11.3...... 98 3.11.3 The egress selected by routers 0 5 in Figure 3.11.4 ...... 100 − 4.4.1 Downstream egress sets for routers in example topology of Figure 4.4.1...... 120 4.4.2 Reliances for example topology of Figure 4.4.1 ...... 122

5.5.1 Summary of peer behavior modes ...... 157 5.7.1 Number of snapshots before the policy change detection algorithm identiﬁes a policy change ...... 176

6.3.1 Data characteristics of main measurement artifacts ...... 201 6.7.1 Default parameter settings...... 214 6.9.1 Summary of consistency-check failures ...... 231 6.9.2 Session failure characteristics ...... 231

xvii Abstract

The Internet is literally an “Inter-Network”, that is, a network of networks. Net- works can be entities including Internet Service Providers (ISPs), universities and commercial enterprises. Every network or Autonomous System (AS) has individual requirements, restrictions and capabilities to transit data traffic. No central controlling body determines how ASes connect — instead contractual agreements are established between AS pairs to govern their relationship. It is not feasible for all ASes to be physically connected to all others. Consequently, some ASes provide transit between other ASes. Such a service usually results in remuneration from one or both ASes. Unlike centrally administered networks where all nodes in the network make generic, predictable decisions, each AS has the ability to select its best route based on its own proprietary commercial agreements. Such agreements are converted to a technical policy implemented in the Border Gateway Protocol (BGP). The ability to implement policies ensures the commercial viability of the Internet, but also makes the prediction of routes difficult and even more worrisome, conflict- ing policies can cause undesirable BGP states where no single AS has sufficient knowledge to understand what is happening [43]. Designing new clean-slate routing protocols is one approach to improving the predictability and reliability of the Internet. However, due to the Internet’s distributed political and administrative control, significant collaboration is required to implement a new routing protocol — especially when no new protocol currently proposed has sufficiently superior flexibility, scalability or robustness. The difficulty in implementing new and improved protocols is evident in the deploy-

xix ment of IPv6 [23]. Although the IPv6 standard has been deﬁned for over a decade and oﬀers a larger address space, better security and embedded quality of service in comparison to traditional IPv4, its deployment is limited to 1200 of over 30000 ASes in the Internet [66]. Hence, it is crucial practical solutions to current problems are evolved in addition to developing clean-slate techniques. Consequently, our approach is pragmatic — designing tangible solutions to practical problems that can be implemented immediately.

In this thesis we examine and combine eBGP, iBGP, OSPF, Netflow and router configuration data to discover important aspects of routing. It is this investigation that instigated the development of a model of iBGP. iBGP is the version of BGP implemented within ASes to propagate routes between internal routers. It exists on a logical topology, however it interacts with the physical topology. It is this interaction which can cause persistent oscillation [49] — a system state where routers alter their decision ad infinitum. Detecting configurations which can cause this oscillation is NP-hard [49]. However, our model of iBGP introduced in this thesis benefits from the ‘designed’ structure of the iBGP topology to restrict the search space dramatically to one that is computationally feasible.

iBGP data — which is collected to analyze router decisions — is often only collected on a subset of routers due to its massive storage requirements. In addition there is a substantial amount of correlation between router decisions. Our model of iBGP discovers the dependencies between router decisions and can consequently predict the decisions of those routers for which no measurements are available. It does not rely on any assumption of operator configuration, and subsequently is able to be applied in any network scenario — not just the one originally configured. It is this feature, together with the model’s ability to use any available measurement data that makes our technique ideal for network measurement and management applications. We found our model is efficient and accurate on the network of a large Tier-2 AS, where all but seven of over 12.7 million decisions were consistent with observed data. Further we were able to predict the decision of 85% of routers where observed data was unavailable.

xx During our analysis, we also identified several minor configuration errors on operational routers when we predicted the “correct” outcome. The internal state of a network can be influenced by neighboring ASes. Peering agreements are closely guarded due to their commercially sensitivity. They are implemented in BGP in the form of policies and are difficult to infer with publicly available data sources. We examined the peering policies of over 100 ASes from the perspective of a large Tier-2 AS, finding 22% differ from the canonical peering policy outlined in many peering agreements. When a policy differes from the canonical peering policy, it may result in sub-optimal routing within the Tier- 2 AS. We used our model of iBGP to firstly predict the decisions of all routers under the current peering policy, before determining the changes that would have occurred under a canonical peering policy. This analysis not only provided a metric for the routing impact of a peers’ non-canonical policy, but subsequently used in combination with traffic data allowed us to determine the influence of the peer on traffic flows. Our techniques described allow an AS to fully quantify the impact of a non-canonical peering policy and adapt business arrangements appropriately. Throughout our analysis of BGP data, we noticed several inconsistencies in the data. Although the results in the above work were insensitive to such inconsistencies, other applications requiring accurate, fine time-scale analysis of the routing state are much more sensitive. Consequently, we undertake a self-consistency check on the BGP data and examine the possible causes of such inconsistencies. We also present a mechanism to ‘clean’ the data to minimize the effects of any inconsistency.

xxi Chapter 1

Introduction

The Internet is designed to allow any two end-users to communicate. However, there is rarely a direct connection between these end-users. Consequently, they must know of a path traversing multiple links to reach each other. As the Internet is a dynamic structure, the best path between any two end-users can change over time. Hence, to discover the best path between end-users at any time, a routing protocol is responsible. Any fault in a routing protocol can have severe impacts such as partitioning the Internet (so that groups of end-users are unable to communicate). In fact such an event occurred on October 30, 2008. A dispute between Sprint and Cogent resulted in Sprint severing all connections with Cogent [126]. This act resulted in many customers of Sprint and Cogent being unable to communicate despite there being a physical connection via other ISPs! Consequently, not only did Sprint’s actions impact Cogent’s customers, but also their own. This gamesmanship ended three days later when Sprint re-established the connection. With more and more governments, businesses and individuals relying on the In- ternet as their communication infrastructure, such an outage could have serious consequences and stunt the growth of next-generation applications requiring a high level of reliability. However, if either or both ISPs had the ability to predict the impact of their behavior prior to the implementation in the live network, then the consequences could have been minimized or even avoided. The example scenario has been rare in the past and was a signiﬁcant action

1 2 CHAPTER 1. INTRODUCTION with significant consequences. However, an action does not need to be significant to have significant consequences. For instance, on February 25, 2008, a Pakistani government ban on YouTube and a small change to routing policy in a single ISP resulted in YouTube being blocked to a significant fraction of the Internet outside of Pakistan for over two hours [12]. The ability of an ISP to predict the impact of their changes and the ability of other ISPs to minimize the impact of malicious or accidental changes is becoming a necessity.

The above examples demonstrate several pitfalls of the current de-facto standard inter-domain Internet routing protocol, the Border Gateway Protocol (BGP). The Internet is a network of networks comprising of over 30, 000 [54] distinct administrative bodies — termed Autonomous Systems (ASes) — such as ISPs, universities and commercial enterprises. Many of these ASes are competing profit-making entities. Consequently, ASes generally do not wish to reveal information about their network that could be used by other ASes for commercial advantage. Thus, BGP only propagates changes to the selected path, not the associated reason for the change. Hence, the decision process within BGP is not transparent. This is the primary reason why BGP is not as easy as 1-2-3. Predicting the impact of a change is especially difficult when any AS on the path (and any router within an AS) can rank any route over any other. This level of flexibility allows an AS to policy-route traffic where the best route is chosen based on business or commercial optimizations. The route chosen may be different if the route was chosen from a purely technical perspective (e.g., using shortest path routing). The impact of policy-routing could even lead to the possibility of undesirable routing states such as “BGP Wedgies” [43] where no single AS has sufficient knowledge to understand what is happening. As no AS has enough information to understand the cause, appropriate corrective action is hard to determine.

The Internet is becoming (if not already) a critical communication infrastructure. Consequently, ensuring reliable connectivity between end-users, is vital. However, as demonstrated by the recent real-world examples and the lack of transparency in BGP, the reliability of the Internet is not guaranteed. As a result, CHAPTER 1. INTRODUCTION 3 the Internet is unable to completely replace existing communication infrastructure. Further, the creation of next-generation applications is stymied without new ways of increasing the reliability of the Internet are developed. One approach to increasing the reliability of Internet routing is to re-design BGP. However, as demonstrated by the slow up-take of IPv6 [66], which has a number of advan- tages over the incumbent IPv4, it is diﬃcult to implement new protocols in an administratively distributed system like the Internet. Hence, understanding the current operation of BGP, identifying its faults and proposing tangible solutions is not only critical for the foreseeable future of the Internet, but also essential to determine the features required for next-generation routing protocols. This is the approach we take in this thesis.

Network management is the key to creating a robust network. Although network administrators may aim to engineer a network for properties such as reliability and an even utilization of links, a network is rarely static and consequently the management of the network requires constant monitoring of its properties. Further, even if the network topological characteristics are known, BGP is not an easily predictable protocol. That is, for a given destination, there are often multiple feasible routes to which BGP could converge. It is even possible that BGP is unable to converge to a single route, causing undesirable routing oscillation. In fact it has been shown that routing oscillation can occur between ASes as a result of conﬂicting policies [45, 116], within a single AS as a result of of other AS policies [6, 48, 75] and even inside a single AS due to internal routing protocol interactions with BGP [49]. Consequently, although vital in successful management of a robust network, it is diﬃcult to predict the impact of either planned or unplanned changes or even determine the behavior of the current network.

The ‘management’ of the Internet can be thought of at multiple scales. Two logical scales to consider given the structure of the Internet are Internet-wide and AS-wide. Ideally, the entire Internet would be manageable and robust to changes. However, the management of the Internet is de-centralized. No single body has 4 CHAPTER 1. INTRODUCTION control, and therefore management is inherently diﬃcult. However, individual ASes are managed centrally. So certainly we should be able to improve each AS. This might allow them to oﬀer premium services (such as Virtual Private Networks) with a higher level of reliability. Optimistically, if each AS is more reliable then the Internet, which consists of a set of ASes, might also become more reliable.

Although an AS by definition is a single administrative domain, the difficulty in configuring routers, the lack of determinism in BGP, the dynamic nature of the network topology, and the constant change of external routes makes predicting the behavior of the routing control-plane difficult. Although on the surface it appears much simpler to manage a single AS than attempting to manage the Internet as a whole, it is still a challenging (but not an impossible) task. A major benefit of considering the reliability inside an AS in contrast to Internet-wide is the ability to precisely know the topology of the network and the configured policies on routers. Although this does not remove the possibility of multiple feasible convergent states for BGP, we can examine the propagation of routes and compare routes to determine their relative attractiveness. This is in contrast to the management of the general Internet where the topology of another AS and its preference of routes is unknown, making any routing prediction very difficult.

Improving the reliability of an AS’s network centers around the ability to understand the behavior of their network. Determining and predicting this behavior, however, is not easy. Measurement infrastructure is needed throughout the network. Nethertheless, collecting all the required network data is not only diﬃcult due to the storage and setup costs. Correlating the diﬀerent sources of often non-trivial and non-synchronized data is also challenging. Our task in this thesis is to provide solid network management techniques to improve the reliability of networks, using currently available data with all its limitations.

For example, one cause of network performance degradation is routing oscillation. Routing oscillation is a control-plane state where routers persistently alter their chosen route in response to updated information from one another. Rout- CHAPTER 1. INTRODUCTION 5 ing oscillation causes packet loss, constantly uses up the scarce resource of router CPU, the predictability of the network is impossible and consequently the manual debugging of network issues is impossible. Hence, identifying whether such a scenario is occurring within an AS is a vital component of network management. Some of the reasons why this task is not as easy as simply looking for repeated patterns include:

Measurement infrastructure generally provides only a partial view of the • network.

Oscillatory modes may be aperiodic making detecting repeating states dif- • ﬁcult.

Separating routing dynamics caused by external eﬀects from those caused • by internal network structures is diﬃcult.

Locating the cause of oscillation is diﬃcult when many router decisions will • simply alter their decision in response to others but are not the cause of the oscillation.

Current network stability is not a guarantee of future network stability, even • under a static network configuration. That is, it is feasible for a configuration causing routing oscillation to be dormant where the ordering of messages received results the current state which may be stable while a slightly different ordering causes oscillation.

In Chapter 3 we firstly provide an abstract model of the propagation of routes within an AS. From this model, we determine when an AS’s network configuration can enter oscillatory modes. Our abstract model removes a significant amount of the detail involved with the routing protocols used within an AS, however it does not lose any information required to detect or locate oscillation. This makes our approach easy to implement and very pragmatic. In-fact, we tested our approach on a topology derived from a Tier-2 AS, proving its stability. We also provide an adaption to the BGP decision process that would prevent oscillation. 6 CHAPTER 1. INTRODUCTION

Oscillation is only one of a number of possible causes for network performance degradation. The sheer number of variables (such as link weights, iBGP sessions, prefix filters, dynamic availability of external routes, internal link failures etc.) involved in configuring a network may result in other undiagnosed problems. The ability for network operators to use their skills and knowledge may be limited by the lack of complete information from measurement infrastructure. Further, the complex nature and interaction between all network variables can make it difficult to determine the cause of a problem. For network operators to manually correlate all available data sources is time-consuming and error prone. We believe their time is better spent interpreting the data at a high level in order to diagnose problems. In Chapter 4, we extend our initial model of route-propagation within an AS described in Chapter 3 to incorporate more general topologies. However, instead of using this general model to determine a network’s oscillatory characteristics, we predict the actual routes selected by all routers within an AS. This information can be used to determine the current path for data in the network. It can also be used to predict the path data would take under altered network conditions. This is especially beneficial for network management as it significantly improves on the “tweak-and-pray” approach1. A further benefit of predicting the path data traverses is for sanity checking a network configuration. Router configuration files in large networks are often complex text files which are difficult to write from scratch. Diagnosing errors is a skilled task. By predicting the routes chosen by routers and comparing them to available measurement infrastructure we may find inconsistencies. These inconsistencies can be investigated in detail and may help discover configuration errors. In fact, we discovered several minor configuration errors in a large Tier-2 AS which were unlikely to have been discovered without our analysis. An equally important component of network management is the relationships with neighboring ASes. Creating commercial relationships that govern the routes

1A change is undertaken on the network under the belief (but no guarantee) it will not break. CHAPTER 1. INTRODUCTION 7 an AS is able to learn at each inter-connection point can help restrict the possible routing outcomes and hence make a network more predictable. A common commercial relationship between ASes is the peering inter-connection. A peering relationship is generally an arrangement between ASes to transit data between each other’s customers. The arrangement is one of mutual-benefit and generally no remuneration is exchanged by either party. Such a relationship is often undertaken by ASes of similar size (as larger ASes will generally want smaller ASes as customers) and consequently they are often also market-place competitors. Consequently, although both ASes benefit from peering agreements, an unequal benefit may be of concern to the AS benefiting the least.

Terms of the peering relationship are set out in contractual peering agreements between ASes. A general requirement is that if multiple peering (inter-connection) locations are created then equivalent routes per prefix should be announced at each peering location in terms of BGP attributes2. This process allows the receiving AS to choose the egress point which satisfies its own local objectives. The equivalent process occurs in the reverse direction. This simple agreement is aimed at creating a fair arrangement. In Chapter 5 we investigate the peering policies of all peers of a Tier-2 AS finding 22% of peers did not follow this simple agreement. We also use our network management tool described in Chapter 4 to predict the impact on the Tier-2 AS of its peers’ policies. This analysis discovered several peers that were not adequately following their contractual obligations.

Network management tools rely heavily on network measurements. Without an inherent understanding of the limitations and inaccuracies in these measurements, the benefit of such tools significantly decreases. The primary source of measurement data we consider in this thesis is BGP data collected by BGP route- monitors. In Chapter 6 we examine the consistency of data recorded at BGP monitors finding that 5% of BGP tables when compared to the stored sequence of BGP updates were inconsistent. We investigate the underlying measurement

2The actual BGP attributes associated with each preﬁx can be diﬀerent. However, the routes are equal through all AS-wide steps of the BGP decision process. 8 CHAPTER 1. INTRODUCTION artifacts causing the inconsistencies and provide a mechanism to minimize their impact on further analyses. In summary, the theme of this thesis is the pragmatic use of network measurements to enable operators to identify causes of problems and develop solutions to address them.

1.1 Preliminary Background

In principle, the Internet allows communication between any two users. Due to the massive nature of the Internet, users generally do not have a direct physical connection with each other. Hence, to allow communication between any two users, intermediate network devices receive data and propagate it to others closer to the destination. This process is known as forwarding. Data traverses the Internet in packets. Packets are small amounts of payload data with associated source and destination addresses included in them. Routers, the physical devices entrusted with the forwarding of packets are specifically designed to quickly switch packets from one link to another. They determine the appropriate out-bound link by matching the destination address associated with a packet with an entry in the forwarding table. How router construct a forwarding table? This process is known as routing. Routing can be undertaken in two possible ways. The first is static routing. Ad- ministrators manually configure the forwarding table based on their knowledge of the Internet. However, link failures and topology changes result in the time- consuming re-configuration of the network. The large scale of the Internet makes manual reconfiguration after every change infeasible. Consequently, a routing protocol is used to automatically adapt to network changes. The current de-facto standard routing protocol used between networks (or ASes) is the Border Gateway Protocol (BGP) (version 4). BGP is used by routers to learn of available routes to destinations and select which route is ‘best’. However, unlike a network controlled by a single adminis- 1.2. THESIS ROADMAP 9 trative body where a coherent routing strategy can be defined, each AS has their own individual strategy governed by economic relationships with other ASes. Consequently, the optimal next-hop is not simply the ‘closest’. For example, a link to a next-hop in which the AS pays for connectivity may be less ‘optimal’ than a link for which no cost is associated. Although BGP is used so that ASes can route traffic between each other, an AS generally consists of more than a single router. A separate routing protocol known as an Interior Gateway Protocol (IGP) determines the route between destinations within a single AS. However, despite the apparent separation between internal and external routing, routers within an AS must inform each other of the availability of external routes. Internal BGP (iBGP) is used for this purpose. iBGP propagates BGP routes between routers within an AS, however, the decision process used by routers to determine which route is selected is often based on the IGP. It is this interaction between inter-domain and intra-domain routing that is the primary focus of this thesis.

1.2 Thesis Roadmap

This thesis is split into six additional chapters. In each chapter, we provide the background required for the specific issue being addressed. In Chapter 2 we provide an overall background to acquaint the reader with the details of routing and specifically BGP. This material is common background to all chapters. Next, in Chapter 3 we introduce our model of iBGP. Using this model we find locations in an iBGP configuration where oscillation can occur. We determine whether an instance of oscillation is persistent, transient, or is dormant and provide a minor adaption to the BGP decision process that would prevent oscillation. In Chapter 4 we use measured network data in tandem with an extended iBGP model from Chapter 3 to predict the decisions made by routers under current and altered network conditions. Using this technique we determine the current decisions of routers with limited measurement infrastructure and predict 10 CHAPTER 1. INTRODUCTION the changes to the control-plane under altered network scenarios. The peering policies of all peers of a large Tier-2 AS are analyzed in Chapter 5. Where the policies differ from the generic policy outlined in many contractual peering agreements, we use the techniques outlined in Chapter 4 to predict the routing changes resulting from such policies. We also correlate this information with traffic data to gain a complete picture of the impact of peering policies on the Tier-2 AS. Throughout our careful analysis of BGP, we found cases of data inconsistency. In Chapter 6 we methodically investigate any inconsistencies between the two sources of available BGP data (tables and updates), determine the likely cause of any inconsistency and provide a mechanism to minimize their effects. We conclude the thesis in Chapter 7, including future research directions.

1.3 Statement of Research Contributions

The work outlined in this thesis is focused on developing techniques, tools and models to assist network operators, protocol designers and researchers in creating stable, scalable, cost eﬀective and reliable networks. To this end,

we model the propagation of routes in a route-reﬂector iBGP topology as a • directed graph for the purposes of oscillation detection.

– we prove BGP route oscillations can only occur between a subset of route-reﬂectors in an iBGP topology, signiﬁcantly restricting the search for oscillation.

– we present an eﬃcient algorithm to detect potential BGP route oscillations inside a network based on iBGP and IGP conﬁgurations.

– we pinpoint the exact routers responsible for any possible route oscillation. 1.3. STATEMENT OF RESEARCH CONTRIBUTIONS 11

– we characterize whether any oscillation is persistent, transient or dormant.

– we analyze a topology derived from a large Tier-2 AS for oscillatory properties.

– we demonstrate that it is more difficult than previously suspected to ensure stability by configuration in a greater-than-two-level route- reflector configuration.

we recommend an adaption to the BGP decision process to prevent route • oscillation.

we extend our route propagation model to a greater-than-two-level route- • reﬂector hierarchy to predict the actual decisions made by routers (the network solution).

– for the 15% of all routers with route-monitors, our technique predicts consistent decisions for 99.9999% of (pre f ix, router) pairs.

– for the 85% of routers without route-monitors, we predict their decisions so that they are consistent with the 15% of routers that have route-monitors.

we provide a technique for an operator to predict the decisions of all routers • under an altered conﬁguration prior to use in the “live” network (what-if analysis).

we identify conﬁguration errors when the predicted decisions are inconsis- • tent with the measurement infrastructure.

for the over 100 peers of a large Tier-2 AS, we compare the routing announce- • ments at all peering locations.

– we ﬁnd 22% of peers inconsistently announced over 10% of preﬁxes at one or more peering locations. 12 CHAPTER 1. INTRODUCTION

– we use our what-if analysis to predict routers that would alter their decision if consistent routes were announced.

– we use traﬃc data to predict the quantity of traﬃc which would shift egress location under consistent route announcement.

we methodically investigate the BGP data used in many studies to validate • its consistency.

– we ﬁnd inconsistencies between BGP updates and BGP tables for 5% of instances.

– we detail the checks undertaken on the data and provide intuitive reasoning behind any inconsistencies.

– we identify the data aﬀected by any inconsistency.

– we provide methodology to minimize the impact of any inconsistency on further analyses.

1.3.1 Publications Arising From This Thesis

Components of this thesis have previously been published: Stable and Flexible iBGP, Ashley Flavel and Matthew Roughan. In Proceedings of ACM SIGCOMM, Barcelona, Spain, August 18-20, 2009.

Humpty Dumpty: Putting iBGP Back Together Again, Ashley Flavel, Jeremy McMa- hon, Aman Shaikh, Matthew Roughan and Nigel Bean. In Proceedings of IFIP Networking, Aachen, Germany, May 11-15, 2009.

Where’s Waldo? Metarouting and Practical Searches for Stability in iBGP, Ashley Flavel, Matthew Roughan, Nigel Bean and Aman Shaikh. In Proceedings of In- ternational Conference on Network Protocols, Orlando, Florida, 2008. 1.3. STATEMENT OF RESEARCH CONTRIBUTIONS 13

CleanBGP: Verifying the Consistency of BGP Data, Ashley Flavel, Olaf Maennel, Belinda Chiera, Matthew Roughan and Nigel Bean. In Proceedings of Internet Network Management Workshop, Orlando, Florida, 2008.

Modeling BGP Table Fluctuations, Ashley Flavel, Matthew Roughan, Nigel Bean and Olaf Maennel. In Proceedings of 20th International Teletraﬃc Conference, Ottawa, Canada, June 17-21, 2007. Chapter 2

Background

The true value of the Internet is its connectivity. Anybody at any location can exchange traﬃc with anyone at any other location, provided both are connected to the Internet. Consequently, organizations forming the Internet, Autonomous Systems (ASes), must ensure they have paths to all others. These paths may traverse multiple ASes. Hence, the Internet is literally a network of networks. ASes with wide geographic coverage and substantial backbone infrastructure are considered large (for instance AT&T, Verio, Sprint). A natural hierarchy exists where smaller ASes are customers of larger ASes. There may be several layers of on-selling of connectivity until an end-user is reached. Each node (which may represent a computer, router or other network device) in the Internet has an Internet Protocol (IP) address1. The IP address is used to identify each node. Currently there are two types of IP address: 32-bit IP version 4 (IPv4) and 128-bit IP version 6 (IPv6). For this thesis we concentrate on the most common deployment: IPv4. The notation of the 32-bit address is split into 4 octets, separated by dots. For instance, an IP address owned by the University of Adelaide is 129.127.5.1. IP addresses are delegated by Internet Assigned Numbers Authority (IANA) to the regional Internet registries (Africa: AfriNIC, Asia-Paciﬁc: APNIC, North America: ARIN, Latin America: LACNIC

1It is possible for multiple nodes to be referenced to the same IP address under a Network Address Translation (NAT) system [26]. Also, some nodes have multiple IP addresses.

15 16 CHAPTER 2. BACKGROUND

Preﬁx Link 129.127.5.0/24 A 129.127.6.0/24 B 129.127.0.0/16 C 0.0.0.0/0 D

Table 2.0.1: Example Forwarding Table. A router matches a packet’s IP address to the covering preﬁx with the longest mask length and forwards the packet along the associated link. and Europe: RIPE NCC). In turn, the addresses are subdivided and assigned to ASes and may be further subdivided and assigned to customer ASes.

Addresses are assigned by the Internet registries in a structured hierarchical manner, so outside of an AS we do not need to know about the individual IP addresses. IP addresses are grouped into Classless Inter-Domain Routing (CIDR) prefixes [35]. CIDR notation is similar to IP notation, with the addition of a mask length. For instance, 129.127.5.0/24 refers to all IP addresses with the first 24 bits (or 3 octets) equivalent to the IP address 129.127.5.0 (e.g.. 129.127.5.1). Routers store these prefixes in their forwarding tables, along with the appropriate link to forward data.

Data traverses the Internet in packets. Packets are small amounts of payload data with associated source and destination addresses included in them. When a data packet enters a router, the destination IP address is mapped to the most specific covering prefix in the table and forwarded along the appropriate link. For instance, consider the example forwarding table in Table 2.0.1. If a packet with the destination IP address 129.127.5.1 enters the router, it will match to the prefixes 129.127.5.0/24 and 129.127.0.0/16. The former is the most specific prefix. Hence, the packet will be forwarded on link A. If a packet with the destination address 129.127.3.123 enters the router, it will be forwarded on link C. Some routers include a default route 0.0.0.0/0 which matches all prefixes. As this is the least specific prefix possible, it will only be used when no other covering 2.1. GENERAL ROUTING PROTOCOLS 17 prefix is available. This option is often used by small ASes where there are few choices for out-bound links or greater hardware constraints. In the previous table example, the size of the forwarding table is small. How- ever, in the Internet there are currently (as of December 2008) over 260, 000 prefixes in default-free forwarding tables2. Hence, manually constructing a forwarding table by selecting optimal links for each prefix for each router is not a scalable solution — especially when the Internet is a dynamic structure and the best link may change frequently. Consequently, an automated routing protocol is responsible for dynamically constructing the forwarding table.

2.1 General Routing Protocols

A routing protocol distributes routing information between nodes in the Internet. Although nodes can be any device connected to the Internet, a routing protocol is unnecessary on devices with only one connection to the remainder of the Internet. Thus, generally the devices running a routing protocol are those with multiple inbound/outbound links. Throughout this thesis, without loss of generality, we will refer to all these multiple link devices as routers. Two main types of routing protocols are distance-vector and link-state.

2.1.1 Link-State Protocols

A link-state protocol requires a router to inform all routers in the network of the state of its directly connected links. The information regarding its links are propagated to all direct neighbors who then propagate this information to their neighbors and so-on until all routers in the network receive the information. This process is also known as ﬂooding. The shortest-path at each router is generally calculated by some variant on Dijkstra’s algorithm [24].

2Default-free forwarding tables are those that do not need a route to 0.0.0.0/0 because their table is otherwise complete. 18 CHAPTER 2. BACKGROUND

Link-state protocols provide each router with a complete picture of the network topology. Consequently, they can take network characteristics such as bandwidth, delay, reliability and load into consideration. Examples of link-state protocols include Open Shortest Path First (OSPF) [76] and Intermediate System-Intermediate System (ISIS) [22].

2.1.2 Distance-Vector Protocols

A distance-vector protocol requires a router to inform its neighbors of its own routing table either periodically or when changes are detected. Each neighboring router then selects the route (from the set of routes learned from all its neighbors) to each destination with the lowest distance. The distance may be a simple hop-count or a conﬁgured link-cost metric. The Bellman-Ford [101] (or similar) algorithm is executed and the outbound link corresponding to the shortest-path to the destination is chosen at each router.

In contrast to link-state protocols, each router does not know the entire network topology in a distance-vector protocol. Examples of distance-vector protocols include Routing Information Protocol (RIP) [71] and the Cisco proprietary Interior Gateway Routing Protocol (IGRP) [51].

A class of distance-vector protocols is a path-vector protocol. Instead of simply a distance associated with a route, each node along the path appends its identiﬁer to the current path. Hence some topological information is recorded and loops can be avoided by a router by simply looking for its own identiﬁer in the current path.

2.2 Routing in the Internet

Routing in the Internet is hierarchical: routing within ASes and routing between ASes. Several reasons for this include: 2.2. ROUTING IN THE INTERNET 19

1. The Internet is large and still growing. The hierarchy is used to improve its scalability.

2. The topology of individual ASes is often proprietary and network operators are hesitant to inform the greater Internet of this information.

3. Full autonomy of an AS should be preserved. The technology and routing choices inside an AS should be the choice of the individual AS’s management.

AS B

AS A

AS C

EGP IGP

Figure 2.2.1: Routing protocol domains. An IGP is used within an AS, while an EGP is used between ASes.

Routing within an AS is undertaken by an Interior Gateway Protocol (IGP). A distance-vector or link-state protocol as described in Section 2.1 is a common choice for an IGP. Each AS can choose its own IGP without informing any other AS of this choice. For example, in Figure 2.2.1, AS A may use OSPF, AS B may use RIP and AS C may use IS-IS or any other such combination. 20 CHAPTER 2. BACKGROUND

The purpose of an Exterior Gateway Protocol (EGP) is to exchange information such that traﬃc can cross an AS boundary. As this boundary straddles two administrative domains, they must agree on one EGP. Although the actual EGP used between pairs of ASes can be diﬀerent, the de-facto standard EGP is the Border Gateway Protocol (BGP) version 4 [91].

Inter-AS routing requires the ability to implement policies. Routing policies allow any route to be preferred over any other. Hence, unlike protocols chosen for an IGP which commonly select the optimal route based on shortest paths, BGP often bases its selection on commercial objectives. For example, intelligence organizations may not wish their data traffic to pass through certain countries, or financial constraints may alter the route traffic takes. In Figure 2.2.1, the link between AS A and AS B may be a premium link. Hence, the EGP may select the route from AS A to AS B via AS C, despite the path being longer.

2.3 Border Gateway Protocol

The goal of BGP is to propagate routes across the Internet while respecting the policies of individual ASes. Before the exchange of routes, two routers conﬁgured to speak BGP with each other must establish a connection. This connection is known as a BGP session and the end-points are BGP neighbors. To ensure no information is lost, the session must be reliable. BGP uses the Transmission Control Protocol (TCP) for this purpose. BGP cannot rely on TCP to ensure its BGP neighbor is alive. For this purpose BGP uses keep-alive messages, periodically transmitted during periods of low routing activity. If no routing update or keep- alive has been received in a pre-determined hold-time, the session is assumed to have failed, and all routes learned from the BGP neighbor are removed from selection consideration.

A conceptual view of how a router running BGP operates is shown in Figure 2.3.1. BGP routing information learned from neighboring routers is used to build 2.3. BORDER GATEWAY PROTOCOL 21

Figure 2.3.1: Internal router structure. Incoming routes are stored in RIBs. The BGP decision process uses the available routes to determine the locally selected route. The local decision is propagated to connected routers.

a RIB-in3. A RIB-in can be thought of as the forwarding table of the neighboring router after the neighboring router has applied its export policy. Each RIB-in is altered based on an import policy. The import policy can either filter routes completely or alter their attributes (we will consider the attributes in detail later). The resulting post-policy RIB (RIB-pp) then constitutes the set of available routes. These routes together with IGP information is input into the BGP decision process. The BGP decision process selects a single best route from the set of available routes for each prefix, based on the attributes associated with each route. The best routes form the Loc-RIB which corresponds to the forwarding table of the router4. An export policy, which can either filter or modify the attributes of routes (on a per BGP neighbor basis), is placed on the Loc-RIB (creating a RIB-out) before routes are propagated to BGP neighbors. Policies are implemented by modifying route attributes to alter the relative attractiveness of routes according to the BGP decision process. We now examine the BGP decision process and the attributes that can be

3Routing Information Base (RIB). 4For the purposes of this description, we can consider the Loc-RIB as the forwarding table. However, in reality, routes from other sources (e.g. static routes) are combined into the forwarding table. 22 CHAPTER 2. BACKGROUND modiﬁed. In Section 2.4 we consider how BGP is currently used in the Internet and several problems encountered due to the use of policies and the distributed nature of BGP.

2.3.1 BGP Decision Process

Each BGP speaking router undertakes the following step-by-step decision process for each prefix until a single best route is selected. However, some decision steps may be configured to be ignored due to specific policies implemented by a network administrator. We now detail the steps involved to select a single route using the BGP decision process and the reasoning behind them. A summary of the BGP decision process is shown in Figure 2.3.2. The process always continues until exactly one route is remaining from the set of candidate routes. The next-hop IP address attached to a route indicates the link on which to send data to the prefix. If the next-hop IP address is not reachable via a direct link or other routing protocol (such as an IGP), then the route is discarded by the router (as the path to the destination is unknown). The Local Preference (or local-pref) of a route is an attribute set by an AS. The local-pref attribute is generally set by the import-policy of the router that learned the route from an external source (such as another AS). The local-pref attribute allows an AS to override all other attributes learned from a neighboring AS. This ensures an AS can fully control the routes chosen in their network by ignoring any attributes sent by neighboring ASes. When a router propagates a route across an AS boundary, the router prepends its own AS number5 to the AS Path attribute. The third step of the BGP decision process is to prefer routes with the fewest number of ASes in the AS Path. This is an approximation to shortest-path routing as the actual number of links the route traverses depends on the actual ASes on the path. This attribute is also BGP’s

5AS numbers are unique identiﬁers assigned to the AS by regional registries in a similar way to IP addresses. 2.3. BORDER GATEWAY PROTOCOL 23

1. Remove from consideration all routes with the next-hop unreachable.

2. Remove from consideration all routes without the highest Local Preference.

3. Remove from consideration the routes that don’t have the minimal number of ASes in their AS Path attribute.

4. Remove from consideration all routes that are not tied for having the lowest Origin number in their Origin attribute. IGP is lower than EGP and EGP is lower than INCOMPLETE.

5. Remove from consideration routes with less-preferred MED values. The router conﬁguration determines whether the MED value is compared across all ASs or on a per-AS basis.

6. If at least one of the candidate routes was received via eBGP, remove from consideration all routes that were received via iBGP.

7. Remove from consideration all candidate routes with non-minimal IGP costs.

8. A single route is selected by an arbitrary tie-breaking mechanism. Two options for tie-breaking used by current-generation routers are the lowest- router-id of the BGP neighbor and the oldest route.

Figure 2.3.2: BGP decision process. Candidate routes are excluded from selection at each step of the BGP decision process. The process continues until exactly one route is remaining in the set of candidate routes. 24 CHAPTER 2. BACKGROUND loop detection mechanism. If an AS receives a route with its own AS number in the AS Path, the route is discarded. The Origin attribute indicates how the route came to enter BGP. The IGP parameter indicates the route was originated via a network configuration command. The EGP parameter indicates the route was learned via an EGP other than BGP. The INCOMPLETE parameter indicates the route was originated via some other mechanism such as a static route. The BGP decision process prefers routes with the Origin attribute set to IGP over EGP over INCOMPLETE. Often ASes connect at multiple locations, the Multi-Exit-Discriminator (MED) attribute is a metric allowing the announcing AS to indicate a degree of preference for its multiple links. As this parameter is set by an announcing AS, its actual value has limited meaning across different announcing ASes. Consequently, it is often only compared on a per-AS basis. eBGP is the version of BGP just described — between routers in two ASes. iBGP is slightly different and runs between routers within a single AS. We will detail these differences in Section 2.3.3. Steps 6 and 7 are sometimes referred to as the ‘hot-potato’ or ‘closest-egress’ stage of the BGP decision process. Given all other attributes are equal, the route which exits the AS’s own network is chosen, by lowest cost. If a single route is still not chosen, an arbitrary tie-break option is used to select a single route. Two options for tie-breaking used by current-generation routers are the lowest-router-id of the BGP neighbor and the oldest-route.

Other BGP Attributes

Several other BGP attributes are associated with BGP routes. They are specified in RFC 4271 [91] and comprehensively explained by Stewart [102]. One attribute that is important to outline is the community attribute. A community can be thought of as a tag associated with a route. The tags can be used within an AS or across AS boundaries. Specific rules on routers can match community values and alter the other BGP parameters (such as local-pref). For example, a 2.3. BORDER GATEWAY PROTOCOL 25 neighboring AS can set a community to indicate the route should be treated as a backup route. The receiving AS can then match the predefined community value in its import policy and set the local-pref attribute to a low value.

2.3.2 BGP Operation

Several mechanisms have been introduced to reduce the amount of BGP information propagated through the entire Internet. Although these mechanisms are not directly related to the BGP decision process, then can impact its operation. We now consider two of the major additions to BGP.

The ﬁrst mechanism we consider is the use of a Minimum Route-Advertisement Interval (MRAI) [91] to rate-limit out-going BGP routes. By waiting for a brief period before propagating a route to BGP neighbors, transient BGP dynamics can be suppressed, avoiding triggering further transient dynamics at other routers. However, due to the MRAI’s unsophisticated nature and the impact it can have on the overall convergence of routes its deprecation has been recommended [68].

The second mechanism we consider is route-flap-damping [83, 117]. Route- flap-damping is essentially a mechanism to penalize unstable routes. When the penalty for a prefix reaches a threshold, it is removed from consideration for selection. Mao et al. [73] examined the effectiveness of route-flap-damping. It was found that due to the exploration of alternative paths [62], a single link failure can result in the default route-flap-damping thresholds on commercial routers to be reached and the routes suppressed and can lead to route suppression for more than one hour. Due to router CPUs increasing in computational power and the risk of damping stable prefixes as a result of path exploration, it is now recommended to disable route-flap-damping as the “cure has become worse than the disease” [98]. The extent to which this recommendation has been implemented is unknown. 26 CHAPTER 2. BACKGROUND

2.3.3 Internal vs External BGP

External-BGP (eBGP) refers to the operation of BGP between routers in diﬀerent ASes. Internal-BGP (iBGP) refers to the operation of BGP between routers within a single Autonomous System (AS). Note that iBGP is not the same as the Interior Gateway Protocol (IGP) in use by an AS. In fact these two protocols interact in step 7 of the BGP decision process (see Figure 2.3.2).

iBGP and eBGP are the same protocol in the sense that they use the same message types and route attributes. However, they diﬀer in how routes are propagated. Routes learned via eBGP have no a priori restriction as to which neighbors they can be propagated. In contrast, routes learned via iBGP are restricted to being announced to eBGP neighbors. This restriction is used to prevent looping route announcements. Recall the loop-detection mechanism used by BGP is the AS Path attribute (see Section 2.3.1). Within a single AS, this attribute can no longer be used so looping is prevented by simply stopping messages at 1 hop. The consequence of this restriction is that all routers within an AS must form a clique (or fully connected network) of iBGP sessions to ensure all routing information is available6. Note that routers do not need to form a clique of physical links, purely a clique of iBGP sessions, possibly (and most likely) over multiple physical links.

The clique of iBGP sessions inside an AS can suﬀer from scalability issues N(N 1) − (CPU, memory and bandwidth usage) as the number of sessions is 2 for N routers. Consequently, several techniques are used to allow routers to propagate routes learned from iBGP neighbors for more than one hop and hence reduce the number of iBGP sessions required within an AS.

6A clique was the original speciﬁcation for iBGP. We will consider route-reﬂection and confederations later in this section. These do not require a clique. 2.3. BORDER GATEWAY PROTOCOL 27

a route-reflector e b g a client eBGP learned route a c d f

Figure 2.3.3: Example route-reflector topology. Route-reflectors (black nodes) ‘reflect’ client-learned (white nodes) routes.

Route Reﬂection

Route reflection [7] replaces the clique of iBGP sessions with a hierarchical 7 topology. Consequently, the restriction on propagating iBGP learned routes to other iBGP neighbors is selectively removed. For this purpose, two router types are defined:- route-reflectors and route-reflector clients. A route-reflector client is a router which relies on a route-reflector to propagate its best routes to the remainder of the AS and inform the client of the available routes from other routers. Hence, route-reflectors reflect routes based on where they are learned using the following rules:

Source Reﬂect to: a client all iBGP neighbors a non-client all clients

Route-reflection rules prevent the need for a clique of iBGP sessions. However a clique is still required between all route-reflectors. Consider the example in Figure 2.3.3. Black routers a, b and c are route-reflectors and form a clique. White routers e and d are client routers of router a and routers g and f are clients of c. A route is learned at router d from an external source (such as another AS). A client router propagates any externally learned route to all its iBGP neighbors. In

7A strict hierarchy is not necessary. For this thesis we primarily show examples of a hierarchical topology although our analyses are applicable to more general route-reflector topologies. 28 CHAPTER 2. BACKGROUND this example, router d propagates this route to its single iBGP neighbor a. Route- reflector a reflects this client-learned route to all iBGP neighbors. Hence b, c and e learn of the route. Now, as b learned the route from another route-reflector (a), it does not propagate this route to c (and vice-versa). However, route-reflector c reflects the route to its clients g and f . The route-reflection RFC [7] does not explicitly state whether a route-reflector runs the BGP decision process or simply reflects all learned routes. In practice, route-reflectors only reflect the route they select as their best via the BGP decision process. Consequently, not all AS-wide available routes may be available for selection at any particular router in the network. Consider the example in Figure 2.3.4. This example is equivalent to Figure 2.3.3 except an additional route to the destination is available via an external source at router e8. We see that as router a only propagates one route, routers b, c, g and f only have one available route to select from. Hence their choice is predetermined by a. It is important to note that the iBGP route-reflector topology is not the same as the IGP topology. Hence, although a selects the route learned from e over the route learned from d, there is no guarantee f would make the same selection given the same choice of routes. The route-reflector topology can subsequently cause some routers to select routes that are not optimal from an AS-wide perspective (due to the route-reflector hierarchy limiting an individual router’s diversity of learned routes [114]). This sub-optimality can lead to oscillation, examined in detail in Chapter 3. The iBGP topology is a logical topology whose main purpose is to inform all routers in the network of available egress links for all destinations. The IGP is responsible for actually routing the traffic within the AS to the egress location. Consequently, the next-hop IP address is not updated along each iBGP hop in the route-reflector topology (it remains the IP address of the router that learned the route from an external source).

8We are assuming this route is equally attractive as the existing route up to the IGP distance step of the BGP decision process. 2.3. BORDER GATEWAY PROTOCOL 29

a route-reflector e b g a client eBGP learned route a c d f

Figure 2.3.4: Route-reﬂection obscures route availability. Route-reﬂector a learns of two routes, however it only propagates one. Hence routers b, c, g and f learn only one route to the destination.

A route-reflector hierarchy may be constructed to include more than two- levels. The rules governing route propagation are identical, however, a single router may simultaneously be a route-reflector and a client. We consider multi- level route-reflector topologies in detail in Chapter 3 and Chapter 4. Relaxing the restriction on announcing a route to an iBGP neighbor learned from another iBGP neighbor introduces the possibility for routing announcements within an AS to loop. To prevent this, two new attributes were added to routes: originator-id and cluster-list. We refer the reader to Stewart [102] for full details.

AS Confederations

Confederations are an alternative to route-reflection to increase the scalability of iBGP. In contrast to route-reflection’s hierarchical approach to improving scalability, AS confederations use the divide-and-conquer approach. The basic concept involves breaking an AS into smaller sub-ASes. iBGP sessions crossing sub-AS boundaries are similar to eBGP sessions, however the local-pref and next- hop attributes are not reset upon entering a sub-AS. Sub-AS numbers are included in the route attribute AS-CONFED-SEQUENCE, which is similar to the AS Path for loop detection. This attribute is stripped before propagating to other ASes, hiding the internal iBGP topology. In a similar way to route-reflector, the next-hop IP address is not reset at each iBGP hop. 30 CHAPTER 2. BACKGROUND

The real-world networks we examine in this thesis use the route-reﬂector iBGP topology. Consequently we concentrate our focus on such topologies.

2.4 BGP, Not as Easy as 1-2-3?

So far we have described the protocol deﬁnition of BGP and only hinted at the opportunities for ASes to implement policies. In reality, policies stemming from economic, political or performance considerations govern many router decisions [39, 52, 100, 107]. The relationship between a pair of ASes often falls into one of the following two broad categories [37, 52, 119]:

1. Customer-Provider: One AS (the customer) ﬁnancially compensates the other AS (the provider) for connectivity to the remainder of the Internet.

2. Peer-Peer: A mutually beneﬁcial relationship between two ASes to provide connectivity to each others’ customers. No remuneration is required for traﬃc exchanged between the two peer ASes.

There is no obligation for a policy to fall neatly into one of the above categories. For instance, a North American based AS may have substantial infrastructure in their own geographic region and little capacity in Asia. Subsequently, the North American based AS may negotiate a policy with an Asian based AS to transit each others traﬃc in their region with greatest capacity. Another example may be a policy where an AS uses another purely as a backup [38]. A policy such as this may have been the solution to the Internet partitioning issue between Sprint and Cogent, described in Chapter 1. Policies are negotiated between pairs of ASes (and considered proprietary), so the extent to which customized policies are prevalent throughout the Internet is unknown. However, recent work by Muhlbauer¨ et al. [77] demonstrated the two categories above were insuﬃcient for predicting chosen routes in the Internet, indicating customized policies may be more common than previously believed. 2.4. BGP, NOT AS EASY AS 1-2-3? 31

(a) Good Gadget (b) A stable assignment

Figure 2.4.1: Griﬃn et al.’s Good Gadget [45]. A node’s preference to reach node 0 is labeled next to each node.

(a) Bad Gadget (b) Naughty Gadget

Figure 2.4.2: Non-stable gadgets deﬁned by Griﬃn et al. [45].

The ﬂexibility to set policies allowing any route to be preferred over any other has consequences. One such consequence is routing oscillation, best illustrated by the Griﬃn et al. stable paths problem [45].

2.4.1 The Stable Paths Problem

The goal of BGP is to propagate reachability information and changes to the Inter- net’s topology based on link failures. However, even when a physical topology is static, BGP can exhibit unstable behavior. This instability is best described by the Griffin et al. abstract view of the stable paths problem [45]. Griffin et al. [45] use a simple five node system to examine policy conflicts. This example is shown in Figure 2.4.1. Each node deterministically ranks routes 32 CHAPTER 2. BACKGROUND to reach the central node, 0, according to the list shown by the node. For example, in Figure 2.4.1(a) node 1 prefers the indirect route via 3 over the direct route to 0. A stable assignment of routes occurs when each node is selecting its best available route. The assignment must be consistent. For example, node 1 can only select the route 1 3 0 if node 3 selects 3 0. In Figure 2.4.1(b) we see a stable assignment for the Good Gadget. Arrow directions represent the route selected to 0. Note node 2’s most preferred route via 1 is not selected as it is not consistent with 1’s selection. That is 2 cannot use the route 2 1 0 as 1 does not broadcast 1 0 to 2. Any starting assignment will converge to this unique assignment.

Now consider the node preferences outlined in Bad Gadget (Figure 2.4.2(a)). Note that this example is only slightly different to the Good Gadget example in Figure 2.4.1(a) and is best explained using a step-by-step approach shown in Table 2.4.1. A stable configuration should be stable from any distribution of starting routes, so we can start from an arbitrary point. We start with the configuration in step 1 of the table. Node 1, which currently selects the direct route, learns a route via 3. Hence, it alters its selection to reflect this new information. This in-turn affects the selection of 2 and its affect cascades to the decisions of nodes 4 and 3. We see in step 9 the route selections are equivalent to those in step 1. Consequently, the process continues ad infinitum. Note that in this example, given any message ordering and starting assignment, routing oscillation will occur. Also, oscillation is not necessarily periodic like in this example. It can involve the information at multiple routers being altered after a step and the subsequent steps involved may depend on the order nodes decisions are evaluated.

Naughty Gadget is shown in Figure 2.4.2(b). This example has a stable solution equivalent to the Good Gadget. However, route oscillation may occur depending on the starting routes selected (see Table 2.4.2). Hence this example demonstrates that current stability of a network may not indicate future stability.

An approach to avoid the above undesirable states is to redesign the inter- domain routing protocol to prevent the problems completely. However, despite the pitfalls of BGP, there are no new protocols proposed that solve all problems 2.4. BGP, NOT AS EASY AS 1-2-3? 33

Step Node 1 Node 2 Node 3 Node 4 1 1 0 2 1 0 3 0 4 3 0 2 1 3 0 2 1 0 3 0 4 3 0 3 1 3 0 2 0 3 0 4 3 0 4 1 3 0 2 0 3 0 4 2 0 5 1 3 0 2 0 3 4 2 0 4 2 0 6 1 0 2 0 3 4 2 0 4 2 0 7 1 0 2 1 0 3 4 2 0 4 2 0 8 1 0 2 1 0 3 0 4 2 0 9 1 0 2 1 0 3 0 4 3 0

Table 2.4.1: Step-by-step route selections for Bad Gadget. Note that step 9 is equivalent to step 1 and the process will continue ad inﬁnitum. No timing of messages will result in a stable route selection.

Step Node 1 Node 2 Node 3 Node 4 1 1 0 2 0 3 4 2 0 4 2 0 2 1 0 2 1 0 3 4 2 0 4 2 0 3 1 0 2 1 0 3 4 2 0 - 4 1 0 2 1 0 3 0 - 5 1 3 0 2 1 0 3 0 - 6 1 3 0 2 0 3 0 - 7 1 3 0 2 0 3 0 4 2 0 8 1 3 0 2 0 3 4 2 0 4 2 0 9 1 0 2 0 3 4 2 0 4 2 0

Table 2.4.2: Step-by-step route selections for Naughty Gadget. Although a stable solution exists (equivalent to Good Gadget), the system may enter a cycle where the stable solution is unreachable. 34 CHAPTER 2. BACKGROUND while retaining the flexibility now expected by operators to implement their policies. One such proposal is that of Subramanian et al. [104] who propose a new inter-domain protocol that sacrifices some visibility for scalability and localization of BGP changes. In contrast, Xu and Rexford [128] increase route visibility to offer more route options. Ideally a routing protocol should be designed such that the workspace for operators is entirely safe. Work by Griffin and Sobrinho [42] moves toward this end. They describe BGP as an algebra [99] and use algebraic theory to prove properties of the protocol. Although changes to the inter-domain routing protocol used in the Internet may be a good solution to the problems encountered by BGP, the up- take of new protocols is stymied by the need for extensive collaboration between ASes. Despite the benefit to the entire Internet, AS operators are often unwilling to alter their network without significant local incentives. Consequently, there is a need for pragmatic approaches that solve the problems encountered by BGP with locally justifiable improvements in an AS’s own network. One possible approach is to restrict the policies ASes can employ [36, 38]. However, there are also arguments for adding even more flexibility in routing policies [122]. Due to the proprietary nature of BGP policies, the extent to which any of the currently observed BGP instability is caused by policy conflicts is unknown. More easily detected, MED oscillation, occurs within a single AS. In fact, there are documented examples of MED oscillation occurring throughout the Internet [125].

2.4.2 MED Oscillation

The Multi-Exit-Discriminator (MED) attribute is used by routers to discriminate among multiple exit or entry points to the same neighboring AS. However, the MED attribute associated with routes is comparable only between routes learned from the same neighboring AS (see Section 2.3.1). Consequently, the total ordering of routes depends on the set of routes available. This in combination with the limited visibility of external routes due to route-reﬂection has the potential to 2.4. BGP, NOT AS EASY AS 1-2-3? 35

0 AS A 1 1 2 1 4 2 3 4

(1) (0)

AS B AS C

AS D

Figure 2.4.3: MED Oscillation. A route-reflector configuration (black nodes are route- reflectors and white nodes are clients) with IGP weights labeled on internal links. MED values are indicated in brackets on links crossing AS boundaries. This configuration oscillates persistently.

cause oscillation [6,48,75]. Consider the adaption (see Figure 2.4.3) of the example in [48,75]. MED values are indicated on links crossing AS boundaries (IGP weights are labeled on internal links). The IGP distance between non-directly connected routers is the sum of the IGP distances of the links.

Consider the route selections at nodes 0 and 1. Their decisions given route availability is shown in Table 2.4.3. Recall from Section 2.4.1 a solution must be consistent. Now, let us consider the possible route availabilities and the best route selected by nodes 0 and 1. First, if node 0 learns of the egress via 4, it implies that node 1 must have selected the egress via 4 (to be consistent). Now, node 0 will have learned of all three routes, and will discard the route from node 3 during the MED step of the BGP decision process, leaving the routes from node 4 and 2. The route from node 2 will be selected by node 0 as the IGP distance 36 CHAPTER 2. BACKGROUND

Node Available Egresses Egress Selected Reason for selection 0 2,3 3 IGP 0 3,4 4 MED 0 2,4 2 IGP 0 2,3,4 2 MED, IGP 1 2,4 2 IGP 1 3,4 4 MED

Table 2.4.3: Best route selection at routers 0 and 1 from Figure 2.4.3 given all possible sets of available egress routes. Note that the combination of available egress routers (2, 3, 4) is not feasible at router 1 (as router 1 can only learn of a single route from router 0).

from 0 to 2 is shorter than from 0 to 4. However, this is a contradiction as now node 1 would learn and select the egress via 2 under this scenario (due to the shorter IGP distance). Now if node 0 does not learn of the egress via 4. It will choose to exit AS A via 3 as it has a shorter IGP distance than the egress via 2. Node 0 will propagate this information to node 1. Node 1 now has the choice of egressing via 3 or 4. As both egresses are to AS C, the MED attribute is compared and node 4 selects to egress via 4, contradicting our initial assumption. There is no solution to this conﬁguration. In practice 0 and 1 would endlessly exchange update messages without arriving at a stable conﬁguration.

In addition to cases such as the one described above, Griﬃn and Wilfong examine other counter-intuitive anomalies resulting from the use of the MED attribute in [48].

To prevent MED oscillation, ASes often filter the MED attribute. An alternative solution to MED oscillation is that an AS can choose to always-compare-med within their router configurations. This setting compares the MED attribute across all neighboring ASes (in contrast to comparing MEDs only from the same neighbor AS). However, the MED attribute has limited meaning compared across multiple ASes. For example, one AS may announce two routes with MED values 100 and 110 indicating a preference for the first route. However, another AS 2.4. BGP, NOT AS EASY AS 1-2-3? 37

0 2 1 1 2 1 2 2 1 3 4 5

Figure 2.4.4: iBGP persistent oscillation. Solid lines denote iBGP sessions. Dashed lines from route-reﬂectors to non-client routers show relevant IGP distances.

may indicate their preference with MED values 1000 and 1100 (preferring the ﬁrst route). The magnitude of the MED attribute consequently has limited meaning unless comparing a routes from the same AS. Preventing MED oscillation using one of the above techniques may not prevent all forms of oscillation within an AS.

2.4.3 iBGP Oscillation iBGP configurations can oscillate even without the influence of the MED attribute [49]. Consider the example shown in Figure 2.4.4. Solid nodes are route-reflectors with BGP sessions to their respective clients (white nodes), indicated by a solid line. Important IGP distances are labeled next to the line (either solid or dashed) joining nodes. Externally learned routes equivalent, upto, and including the MED step of the BGP decision process are depicted by large arrows.

In Table 2.4.4 we show step-by-step how oscillation will persistently occur in this configuration. Nodes 3, 4 and 5 will all select their direct egress and inform their respective route-reflectors of their selection. The route-reflectors will either select this direct route learned from their client or a route learned from another route-reflector. The first round of route selections shown in the table involves node 0 and 1 selecting their direct route and node 2 selecting a route learned from 0 because of the lower IGP distance. Now, node 0 learns of the egress via 4 from 1 38 CHAPTER 2. BACKGROUND

Step Node 0 Node 1 Node 2 1 3 - - 2 3 4 - 3 3 4 3 4 4 4 3 5 4 4 5 6 4 5 5 7 3 5 5 8 3 5 3 9 3 4 3

Table 2.4.4: Step-by-step route selections for Figure 2.4.4. Note that step 9 is equivalent to step 3 and the process will continue ad inﬁnitum.

and alters its selection, this in turn alters the decision made by 2, cascading to the decision of 1. A persistent oscillation ensues. Any starting configuration results in the same oscillation. Griffin and Wilfong demonstrate it is NP-hard to determine if a configuration will not oscillate [49]. However, they do provide a sufficient condition to prevent it: all route-reflectors should choose a route learned from their own client over any other. There has been previous effort to recommend guidelines for configuring iBGP networks [14, 27, 90, 118], proposing the alteration of iBGP [6, 10, 59, 67, 79, 80, 86, 109], or centralizing router decisions [18, 29, 41]. However, in practice, networks may not fulfill certain guidelines or configurations and the current version of BGP is unable to support the demands of many of the proposals. In contrast, our primary motivation is to determine the current operation of BGP. For example, in Chapter 3 we develop a practical method to determine if a network configuration has the potential to oscillate while in Chapter 4 we determine the current network routing state, and in Chapter 5 we analyze the actual policies employed by ASes in the Internet. 2.4. BGP, NOT AS EASY AS 1-2-3? 39

2.4.4 BGP in the Wild

We have seen that BGP has problems, but there are a much wider range of issues in BGP than we can discuss in detail here. For instance, instability in the global BGP system is significant [64,65]. Such instability can cause delayed convergence [63, 72]. On a positive note, Rexford et al. [92] found that destinations with high routing instability had low traffic volumes. It is of concern to consider that the low traffic volume may be a consequence of high routing instability. When a problem occurs, locating its cause is not as simple as examining a single BGP update with all relevant details included. BGP hides a significant amount of information and often multiple updates can be attributed to a single underlying event. Also, events may have different effects at different locations in the Internet. Some observed behavior may simply be caused by mis-configured or out-of-date router configurations [15, 28, 123]. Consequently, a significant line of recent research has concentrated on root- cause analysis of BGP updates [19, 20, 33, 127] or simply identifying anomalous behavior [112, 130]. All these techniques are attempting to reverse engineer the Internet topology and its dynamic properties with BGP data that only represents a fraction (what fraction is still an open question) of the entire Internet. In contrast, others propose techniques to prevent malicious behavior using cryptography [40, 58, 124], BGP enhancements [105, 133] or simply identifying unusual BGP updates [57, 61]. Predicting the propagation of routes is an important step to identifying the cause of failures. However, as we have seen, the Internet is commonly governed by economic relationships between ASes [52] that can make the inference of paths difficult. Gao [37] provides a technique for inferring the Customer-Provider and Peer-Peer relationships based on observed BGP data. Following from this, BGP data was used to characterize ASes and their relationships in the Internet [8,106]. Muhlbauer¨ et al. [78] remove the assumption of atomic AS peering policy. This work is extended in [77], showing that not only are the generic customer- provider and peer-peer policies insufficient to predict selected paths, but even 40 CHAPTER 2. BACKGROUND

‘per-neighbor’ policies are inadequate. All of the above studies use BGP data collected throughout the Internet. This data will also be used in this thesis. Hence, we must understand what this data represents and its inherent limitations.

2.4.5 BGP Data

One method to collect BGP data is using a software router such as [56] that establishes a BGP session with an operational router. We term such a device a route-monitor. The operational router does not distinguish between a route- monitor and a physical router. Hence, the route-monitor receives all BGP updates from the operational router and records them. Consequently, if no export policy is applied to the monitoring BGP session, the route-monitor gains a dynamic view of the Loc-RIB of the operational router (see Figure 2.3.1). The route-monitor records the data in a compressed binary format. In-order to analyze the data, a binary-to-ascii tool is generally used (for instance [5]) to convert the data into a format easily read by other applications. The monitoring procedure is non-invasive and passive. Consequently, many ASes allow public route-monitors to establish BGP sessions with one or more of their routers. Terabytes of compressed BGP data are publicly available from route-monitors such as RIPE NCC [93] and RouteViews [115]. In addition, ASes collect BGP data internally to troubleshoot local routing issues. Recall that the BGP data collected is a view of a router’s Loc-RIB. A router can only select one best route to store in the local Routing Information Base (Loc-RIB). Hence, monitoring a router via this mechanism does not include all available routes learned by the router. Furthermore, a single router’s view of the Internet is obscured by the decisions and policies of other routers and ASes. Previous work such as [33] has attempted to correlate multiple view points. However, their heuristics are still unable to unambiguously identifying the cause of all route changes. Although the amount of BGP data available is massive, it is incomplete and biased by the positioning of the route-monitors [131]. 2.4. BGP, NOT AS EASY AS 1-2-3? 41

An alternative to this form of measurement of BGP data is more invasive and requires substantial assistance from network operators. The operational router is accessed via a password-protected telnet session, and the show ip bgp command is executed. This method has access to both RIB-ins and RIB-pps. Through the same telnet session a router’s conﬁguration can be altered. Consequently, operators are hesitant to collect data using this mechanism. Further, this technique is unable to capture route dynamics — it is just a snapshot9. There is a proposal [96] to add the ability to record all routes in the RIB-in of a router, but has not been implemented to the best of our knowledge. Unless explicitly stated, we use BGP data collected by the route-monitor technique for this thesis. The collection mechanism for BGP data is not perfect. Consequently, in each chapter of this thesis we account for its imperfections based on the analysis it is required for. In Chapter 6 we rigorously analyze the consistency of recorded BGP data to quantify some of its limitations.

9It is possible to capture routing updates, although it is highly CPU intensive and not adequate for operational routers Chapter 3

Where’s Waldo? Practical Searches for Stability in iBGP

We have seen in Chapter 2 that iBGP can persistently oscillate within an AS. However, checking if a configuration is oscillatory is NP-hard [49]. Networks may be designed to fulfill sufficient conditions [49] in order to avoid such oscillation. However, in practice, networks are dynamic structures adapting to events such as link failures, additions and configuration changes. Consequently, such conditions to avoid oscillation can be easily violated. In this chapter, we model the propagation of routes throughout an iBGP topology and localize the routers responsible for any oscillatory characteristics. We also propose a minor adaption to the BGP decision process to avoid oscillation. In addition, this chapter provides the basis for our iBGP model used for ‘what-if’ analyses in later chapters.

3.1 Introduction

BGP routing oscillation degrades network performance but is surprisingly difficult to diagnose. Routing oscillation is on a per-prefix basis. Each router makes a local decision as to what is its best available route to a destination prefix and informs neighboring routers of this choice. This may affect the neighboring router’s set of available routes and their choice of the ‘best’ route. If this process results in

43 44 CHAPTER 3. WHERE’S WALDO? a cycle of decisions from which there is no exit, the system is said to be oscillatory.

With the appropriate measurement infrastructure, and given suﬃcient time, we can detect oscillations that have been occurring, but this is unsatisfactory. For a start, detection does not solve the problem. More importantly, performance degradation will have already occurred by the time the problem is detected (if it ever is, given the infrastructure and analysis requirements). Inside an AS, where an operator has complete control over BGP routing, oscillation should never occur. Oscillation should be prevented, not ﬁxed after the fact.

Until now, the only viable approach to prevention was to follow a set of guidelines proposed by Griffin and Wilfong [49]. These guidelines specify sufficient, but not necessary conditions for iBGP stability. Therefore, they unnecessarily restrict configuration flexibility and in practice are often violated. Even when they are not intentionally violated, configuration changes or failures can lead to violations, resulting in oscillations and instability.

When the guidelines for preventing oscillations are violated, further analysis is required to detect the possibility of oscillation. However, the search for potential oscillations is NP-hard [49] which makes it extremely diﬃcult to analyze large service provider networks due to the scale and dynamism involved.

In this chapter, we present an algorithm to detect potential BGP route oscillations inside a network for an iBGP and IGP configuration. The algorithm creates a directed graph of routers based on the notion of “reliance”. A router is said to be reliant on another when the latter’s BGP route selection can impact the former’s selection. When more than one router in a reliance graph form a strongly connected component [103], the routers’ decisions in this component are dependent on one another and consequently there is the possibility for route oscillation. In large networks, where route-reflection [7] is often used, the reliance graph allows us to prove that such strongly connected components can only be present in a subset of route-reflectors. We then use an algebraic approach [99] to derive the oscillatory properties of each strongly connected component.

This approach leads to a significant reduction in the number of routers that 3.1. INTRODUCTION 45 need to be considered in the second step of our analysis since the number of route-reflectors should be much less than the number of routers in a network (otherwise, the benefit of route-reflectors is limited). This in turn makes the algorithm scalable, allowing an operator to not only detect potential oscillations in a network design or proposed network design, but also to analyze a large number of failure scenarios. We demonstrate the efficacy of our algorithm by employing it on a topology derived from a large Tier-2 provider. When an oscillation is actually detected, our algorithm also pinpoints the exact set of routers that cause the problem, allowing an operator to more easily fix it. Finally, the algorithm leads us to recommend a simple change to the BGP route selection process that can eliminate the potential for oscillation altogether.

Searching for oscillation in massive amounts of routing data — where separating locally caused oscillation from valid routing dynamics possibly caused by another AS’s oscillation — is a difficult task. Localizing the routers responsible for oscillation can also be difficult when many other routers will experience routing changes as a result of the few routers causing the oscillation. Further, as we show in Section 3.7.3 analyzing the current network state for oscillatory properties may not be adequate for ensuring the future stability of a network — even if the network properties remain identical! Moreover, using the properties of the network only requires analysis on a per egress instance basis rather than a per-prefix per-router basis required for pure data analysis.

The remainder of the chapter is organized as follows. We provide related work in Section 3.2 and additional background information in Section 3.3. In Section 3.4, we formalize the notion of stability. In Sections 3.5 and 3.6, we present the reliance graph theory for detection of potential oscillations and the subset of routers where they can to occur. Section 3.7 then uses the algebraic approach to prove if an oscillation will actually occur. Through most of the chapter we assume the final tie-breaking step in the BGP route selection process is based on router-ID. This step is directly incorporated into our reliance graph theory. However, BGP also allows the use of the “oldest route” as the tie-break. Unfortunately, in this case 46 CHAPTER 3. WHERE’S WALDO? the ordering of route preference is based on the time that the route was received, leading to more complicated stability problems. We present a modified algebra in Section 3.8 to show that despite this complication, the oldest route tie-break is more appealing because it reduces the likelihood of oscillations. In Section 3.9, we demonstrate the efficacy of our algorithm in the Tier-2 AS and in Section 3.10 suggest a modification to the BGP decision process which would prevent iBGP oscillation. We extend the iBGP model used in this chapter in Chapter 4 to determine the actual router decisions rather than a configuration’s oscillatory properties.

3.2 Related Work iBGP has been shown to oscillate with [6, 48, 75] and without [49] the MED attribute affecting the BGP decision process. The oscillation resulting from the MED attribute was first described by McPherson et al. [75]. This prompted substantial investigation into its causes and conditions to avoid it [27,36,38,44–47,118]. How- ever, even with the MED attribute filtered, or compared AS-wide, it is possible for an iBGP configuration to oscillate [49]. This led to additional recommended guidelines for designing iBGP configurations [14,27,90,118], proposing alterations to iBGP to disseminate more information AS-wide [6, 10, 59, 67, 79, 80, 86] or for centralizing router decisions [18,29,41]. Our approach is complementary to these in one vital aspect. We do not aim to re-design iBGP, nor do we attempt to provide over-arching guidelines for network configuration (although in Section 3.10 we propose an additional BGP decision step to prevent oscillation). Our aim is a very pragmatic one — to understand the operation of a network — whether it satisfies certain guidelines or not. Varadhan et al. [116] investigated the abstract preferences of routes, finding that certain combinations of preferences across ASes result in oscillation. They developed a concept of return graphs with similar motivation to our reliance graph. However, their work was focused on the abstract problem of stability 3.3. BACKGROUND 47 in path-vector protocols, while our work is focused on the practical issue of determining iBGP stability. Griffin and Sobrinho [42] outlined an algebraic representation of BGP between ASes. This work, together with earlier work by Sobrinho [99] that described an algebraic representation of iBGP, proved general properties of the BGP. We do not attempt to design an algebraic representation of iBGP as a whole. Instead, as we are interested in oscillation, we use a much simpler algebraic representation with routers only able to select one of two ‘types’ of routes. We find this is all that is required to determine the oscillatory properties of a configuration without the complexities of protocol idiosynchrocies. In this chapter, we are purely focused on control plane (routing) oscillation. The crux of routing oscillation is the difference in knowledge of available egresses at each router resulting in some routers choosing sub-optimal or non-shortest paths. This has a further consequence. Routers on the IGP path chosen to forward traffic may disagree as to the ‘best’ egress (as their set of available routes may be different). Hence, there is the possibility that traffic can be forwarded back to whence it came, resulting in forwarding loops or deflections [14, 49]. However, forwarding loops can be avoided by tunneling traffic from ingress to egress router within an AS. MPLS [94] is a protocol used by many ASes for such a purpose.

3.3 Background

3.3.1 iBGP Recap

After a router learns a route from a neighboring AS, Internal BGP (iBGP) is used to propagate it to other routers within the AS. The router that learns the route directly from a neighboring AS is the egress router (for traffic). It was originally conceived that iBGP would connect all routers in a full-mesh. However, scalability concerns resulted in the introduction of a hierarchical configuration known as route-reflection [7]. Although route-reflection can have multiple hierarchical lev- 48 CHAPTER 3. WHERE’S WALDO? els, in this chapter we consider the commonly used two-level hierarchy (though our ideas can be extended to the more complex general case (see Chapter 4)). All routers are either route-reflectors or clients of route-reflectors. Clients propagate external routes (learned directly from neighboring ASes) to their parent route- reflector(s). Route-reflectors select the best route and ‘reflect’ routes differently depending on the type of router they are learned from. A route-reflector’s best route is reflected as follows:

Source Reﬂect to: a client all iBGP neighbors a non-client all clients

Scalability is achieved because the number of iBGP sessions is reduced. This comes at a price: routers now learn only a subset of potentially available routes. It has been shown that this resection has more serious consequences than suboptimal routing. It can also lead to persistent oscillation as a result of the MED attribute [75] or purely as a result of the internal topology [49]. The MED attribute is set by neighboring ASes, so an AS has no control over its values, but an operator can configure routers such that MEDs have no effect, and hence avoid MED oscillation. Although our techniques are extensible, in this chapter we assume the AS has filtered the MED attribute (or compared across all neighboring ASes) unless otherwise specified and focus on the oscillation resulting only from an AS’s topology.

3.3.2 Best Path Selection

Routes learned externally that are discounted before the closest egress router step (steps 6 and 7 in Section 2.3.1) of the BGP decision process are never chosen as the ﬁnal best route by any router in the network [31]1. Therefore, we only need to consider the routes that survive as equally good routes up to the closest egress

1While the protocol is converging, other routes may be selected for a short time. However, for the purposes of analyzing persistent oscillation, these transient selections are irrelevant. 3.3. BACKGROUND 49 router steps of the BGP decision process. However, a router may not learn all of the globally available routes. A change in the locally available routes can result in a router changing its decision, and hence advertising diﬀerent routes to neighbors, changing their locally available routes and so on. This is the crux of the oscillation issue we are examining.

At each router, two local decision steps determine which of the available routes are selected. Firstly, the route with the lowest IGP distance to the egress router is selected. If multiple routes have equal IGP distances, the tie is broken by selecting the route with the lowest router-id (we consider the second tie-break option of “oldest route” later). Such decision steps are topology based and are not timing dependent. Thus, given a set of available routes, A, there is a strict preference of routes. We deﬁne a ranking function λu for a router u, such that if ak is preferred over aj at router u, then λu(ak) > λu(aj).

3.3.3 Interior Gateway Protocol

Steps 6 and 7 of the BGP decision process combine such that the closest border router is selected, where shortest-path “distances” (these need not be geographic distances) are calculated by the IGP used for routing inside an AS. The key issue is that IGP distances are often unrelated to the iBGP topology. BGP sessions are routed and so may extend over multiple physical hops. A route-reflector’s client may not be “close” to the route-reflector. There are even good reasons (e.g., redundancy) why another route-reflector’s client may be closer.

The complicated interaction between iBGP and IGP requires us to make clear distinctions between the underlying IGP network topology (which we term the physical topology) and the logical iBGP signaling topology. In this chapter we will use Griﬃn and Wilfong’s terminology [49] in which an iBGP conﬁguration C is a pair C = (GP, GS) where GP is the physical graph and GS is the signaling graph. 50 CHAPTER 3. WHERE’S WALDO?

3.3.4 Physical Graph

The physical graph represents the physical topology of the network. It is deﬁned by the quartet GP = (V, B, EP, d). Each node u V represents a router in the ∈ network. B V is the set of border (or egress) routers with physical connectivity ⊆ to external networks. The set of uni-directional edges between routers is Ep, and d(e) is the IGP distance administratively assigned to edge e = (u, v) EP. A path ∈ P is a sequence of edges P = e1e2...en. The length of P is the sum of the distances d(e) for all edges e of P, and the IGP is used to compute the shortest paths.

3.3.5 Signaling Graph

The directed signaling graph GS = (V, AS) represents the propagation of BGP routes between routers within V. An arc in GS represents an iBGP session between two routers and is overlaid on some path in GP.

The set of arcs AS is partitioned into three sets over, up, and down. We show examples of each type of arc in Figure 3.3.1. An arc (u, v) over represents a ∈ vanilla iBGP session from router u to v. If (u, v) over, then (v, u) over. An ∈ ∈ arc (u, v) down represents an arc from a route-reflector u to one of its clients v. ∈ Inversely, an arc (u, v) up represents an arc from a client u to its route-reflector v. ∈ An arc (u, v) down if and only if (v, u) up. Arcs in up are acyclic — consistent ∈ ∈ with a hierarchy rather than an arbitrary network design. A valid signaling path S satisfies the following properties for a two-level route- reflector hierarchy. The path S can be split into sub paths S = PQR where P is either empty or consists of a single arc p up, R is either empty or consists of a ∈ single arc r down and Q is either empty or consists of a single arc q over. We ∈ ∈ later redefine a signaling path for a three-or-more-level hierarchy.

3.3.6 Egress Instance

An egress instance [49] I = (C, X) can be deﬁned as a pair of conﬁguration C and a set of border routers X B. The routers in X represent border routers each of ⊆ 3.4. STABILITY 51

over 0 1

up down

2 3 4 5

Figure 3.3.1: Edge types in the iBGP signaling graph. Black nodes are route-reﬂectors and white nodes are clients.

which learns an external BGP route to a particular preﬁx that are equally good up to the closest egress step of the BGP decision process. All other routes will be eliminated by earlier steps in the decision process and hence can not cause oscillation. Note that although a border router may learn multiple routes (to a preﬁx) it will only advertise its best route to neighbors. It is irrelevant which route is advertised as routers will make their decision purely on the distance to this border router. Hence, there is a one-to-one mapping from border routers X to available routes. We will refer to a border router and its available route interchangeably.

3.4 Stability

Griffin and Wilfong [49] define an egress instance to be signaling correct if it is guaranteed to deterministically arrive at a unique (predictable) routing. How- ever, we need additional terminology to describe all of the possible behaviors of egress instances, and we do so by drawing on the dynamic systems literature. We say a system is in equilibrium if it remains in a single-state, or if it cycles through a subset of states such that the cycle persists indefinitely in the absence of external influences. We call a single-state equilibrium stable, and a cycle oscillatory, by analogy to previous works (although in dynamic systems, stability would be otherwise defined). An egress instance may have more than one possible equilibrium cycle/state, and we characterize an egress instance as signaling unstable if there is 52 CHAPTER 3. WHERE’S WALDO? at least one oscillatory equilibrium, or as signaling stable if only stable equilibria exist. A signaling correct egress instance must be signaling stable, but if there is more than one possible equilibrium, then the equilibrium we reach for a particular egress instance is non-deterministic and so a signaling stable instance is not necessarily signaling correct.

B Any conﬁguration may have 2| | 1 possible egress instances (though in prac- − tice not all of these will occur). If all possible egress instances are signaling correct/stable, then the conﬁguration C is signaling correct/stable.

3.4.1 Complexity of Determining Signaling Correctness

Griffin and Wilfong [49] construct a generalized configuration and demonstrate G that determining if it is signaling correct is NP-hard in the number of routers. However, they outline a sufficient condition to ensure signaling correctness; a route-reflector should select a route learned from a client router in preference to any route learned from a non-client router. In a two-level route-reflector topology, this condition can be satisfied by ensuring all route-reflectors are closer to their own client routers (IGP distance wise) than non-client routers. We demonstrate in Section 3.11 that for a three-or-more-level route-reflector topology that this condition is harder to guarantee. Also, Griffin and Wilfong’s condition is a sufficient condition, not a necessary condition. Networks violating this condition may be signaling stable or even signaling correct. Such networks arise naturally as a result of redundancy requirements and due to link failures.

3.5 Router Reliance Graph

Unless otherwise specified, our analysis is undertaken on a network snapshot with properties such as available egresses, IGP distances and iBGP sessions remaining constant. We refer to an always available route in this scenario if a router is guaranteed to learn of the route under any equilibrium scenario, that is, the route will not be obscured by another router’s decision (after a finite time). In 3.5. ROUTER RELIANCE GRAPH 53 addition, although our analysis is undertaken on a network snapshot, we will show in Section 3.9 the structure of our analysis is highly amenable to analyzing a dynamic network. A router selects its best route from a set of routes A that it learns from other routers. In a route-reflector topology, the set A is dynamic and relies on other routers’ decisions. However, there are many possible routes that the router would never choose in equilibrium. For instance, a router that learns a route directly from a neighboring AS will have this route in A, and so will never select any route that is worse (i.e. routes learned from iBGP). This simple fact reduces the complexity of our problem dramatically. The use of a router reliance graph captures only those reliances (or dependencies) that can influence the decision of a router. The reliance graph is calculated per egress instance I. The vertices of the graph are routers, and if a router’s decision is dependent on another router, then we say it is reliant and create a directed edge in the reliance graph. In other words, if ui is reliant on uj, we write ui f uj. The reliance graph contains only a subset of arcs from the signaling graph AS. Note that the arrow direction in figures and the notation used for reliance parallels the information flow in the signaling graph. So ui f uj means that the decision of ui is reliant on the decision made by uj and a directed edge exists in the figures from uj to ui.

3.5.1 Reliance Rules for a Route Reﬂector Topology

In a two-level route-reﬂector hierarchy the intuitive construction of the reliance graph for an egress instance I = (C, X) is based on the following.

1. A router in X (with a direct egress) will always choose this egress and so is not reliant on any other routers’ decisions.

2. A client router without a direct egress is reliant on the decisions made by its parent route-reﬂector(s).

3. A route-reﬂector u is reliant on 54 CHAPTER 3. WHERE’S WALDO?

(a) its “best” client router, and

(b) any other route-reﬂector v whose best client router is better than u’s best client, from u’s perspective.

We now formally define the rules governing reliances based on the above ideas and the rules of route-reflection. First, we define the best client egress router for route-reflector u < X as Λ(u) ∈ X, where best is the closest egress router. If a route-reflector u has no client in X, then for convenience we define λv(Λ(u)) = v V (recall V is the set of all −∞ ∀ ∈ routers). Now, there are three classes of directed edges in the signaling graph and they all lead to potential directed edges in the router reliance graph. Hence, our router reliance rules are based on the three cases for an arc (u, v):

1. (u, v) up: a client u is reliant on its route-reflector v iff u < X. ∈ 2. (u, v) down: a route-reflector u is reliant only on its best client egress router ∈ Λ(u) X, and on no other client. ∈ 3. (u, v) over: ∈

(a) a route-reflector u is reliant on another route-reflector v iff λu(Λ(v)) >

λu(Λ(u)).

(b) A client u with an over connection to another client v is reliant on v iff u < X and v X. ∈ Griffin and Wilfong’s condition is that rule 3(a) never applies between route- reflectors with client egress routers. That is, for all route-reflectors u with a client in X we need

λu(Λ(u)) > λu(Λ(v)) route-reﬂectors v , u. ∀

3.5.2 Co-reliance Groups

Oscillation in a network occurs when two routers ri and rj alter their decision in response to each other’s change. Consequently, by the design of the reliance 3.5. ROUTER RELIANCE GRAPH 55 graph, for oscillation to occur there must be a path in the reliance graph from

2 ri to rj and from rj to ri. Formally, ri and rj must be strongly connected . We deﬁne a co-reliance group Dk to be a strongly connected component of the reliance graph, and we denote D(I) as the set of all co-reliance groups of an egress instance I. According to graph theory, the co-reliance groups form a partition of the routers [103]; that is, each router is in exactly one co-reliance group (though many such groups may have only one node).

A router’s decision cannot oscillate independently. Consequently, for oscillation to occur, a co-reliance group must contain at least 2 nodes.

Now let us consider an example route-reflector topology, one satisfying the sufficient condition of Griffin and Wilfong. Figure 3.5.1(a) shows such a route- reflector hierarchy, black nodes denote route-reflectors, and white nodes denote client routers.

Suppose two egress routers (router 1 and router 2) have equally good routes up to the closest egress step of the BGP decision process, and hence routers 1, 2 X. ∈ These routers will always egress via the direct egress. Hence they do not rely on any other router decisions. Route-reflectors 3, 4 < X, and hence rely on the decisions of router 1 and router 2, respectively. This reliance is illustrated by an arrow in Figure 3.5.1(a). Route-reflectors 5 and 6 rely only on the decisions made by route-reflectors 3 and 4, and client routers 7 and 8 rely on route-reflectors 5 and 6 respectively. In this simple example, all of the strongly connected components contain only one node, so each router is part of its own co-reliance group as indicated by the dotted ellipses. Thus there is a unique solution for router decisions. Hence the system is signaling correct.

In Figure 3.5.1(a) we assumed that IGP distances are such that λ3(1) > λ3(2) , i.e., that the route-reﬂector 3’s client router 1 is preferred over router 2. Likewise we assumed λ4(2) > λ4(1). Now suppose that λ3(1) < λ3(2) (violating the suﬃcient

2For any two vertices u and v in a strongly connected component of a directed graph there exists a path from u to v, a path from v to u, and the component is the maximal such set containing these vertices. 56 CHAPTER 3. WHERE’S WALDO?

(a) λ3(1) > λ3(2), λ4(2) > λ4(1) (b) λ3(1) < λ3(2), λ4(2) > λ4(1)

Figure 3.5.1: A example egress instance. The direct egress set X = 1, 2 , is indicated { } by large arrows. Black nodes are route-reflectors, white nodes are client routers, dashed lines represent iBGP sessions with no corresponding reliance, and solid lines indicate a reliance. Dash-dot lines indicate preferred clients, where these are not direct clients. Dotted ellipses indicate co-reliance groups. 3.6. WHERE CAN AN OSCILLATION OCCUR? 57 condition of Griffin and Wilfong). In this case, the decision at route-reflector 3 is dependent on the decision made by route-reflector 4. If route-reflector 3 learns of router 2, via route-reflector 4, then it will prefer this egress point. Otherwise it will prefer its client. Hence, there is an additional reliance of route-reflector 3 on route-reflector 4, as shown in Figure 3.5.1(b). However, each co-reliance group still contains exactly one router and there is a unique solution, so the system is again signaling correct. If we further change the network (see Figure 3.5.1(c)) such that route-reflectors

3 and 4 both prefer each others client router. That is, λ3(1) < λ3(2) and λ4(2) <

λ4(1), then this introduces a further reliance between route-reflector 4 and route- reflector 3, and these two then form a single co-reliance group D3. In this case, the equilibrium choice of routes will depend on the timing of messages inside the co-reliance group D3. As multiple solutions are possible in this third scenario, it is not signaling correct. However, we will show in Section 3.7 that this system will not oscillate and hence is signaling stable. An interesting consequence is that although this configuration will not oscillate, its behavior is not derived purely by the configuration. Consequently, additional information is required to determine the actual decisions of routers. We consider this further in Chapter 4.

3.6 Where Can An Oscillation Occur?

Routing oscillation can only occur when the configuration instance C is not signaling stable and only within a co-reliance group. This reduces the search for an oscillation to a search only inside co-reliance groups. We can reduce this search further still by eliminating singleton co-reliance groups. Note the sufficient condition of Griffin and Wilfong ensures no routers are strongly connected and hence all co-reliance groups are singleton. Let us now examine where in a general route-reflector topology a non-singleton co-reliance group can occur. In the following we define the egress ancestor set E V, as the union of X and the ⊆ parent route-reflectors of X (thus X E), and we use E to denote its complement. ⊆ 58 CHAPTER 3. WHERE’S WALDO?

Theorem 3.6.1 For all u E and v E, u f/ v. ∈ ∈

Proof: Assume there exists a router u E and a router v E, such that u f v. ∈ ∈ Consider the formal reliance rules for route-reﬂection in Section 3.5.1:

1. If (u, v) over, then we must consider rule 3. Since v has no downstream ∈ egresses and is not an egress itself, u f/ v.

2. If (u, v) down then as v < X, rule 2 implies u f/ v. ∈ 3. And (u, v) < up, since u E and v E. ∈ ∈

Thus our assumption is false.

Corollary 3.6.2 A co-reliance group cannot have routers in both E and E.

Theorem 3.6.3 A non-singleton co-reliance group D does not exist in E.

Proof: Assume a co-reliance group D has routers u , ..., un E. Then there must 1 ∈ exist a ui, uj, uk D and ui , uj, uj , uk (ui, uk need not be distinct) such that ∈ ui f uj and uj f uk. Once again we must consider when reliances between these routers can exist.

1. If (ui, uj) over, since ui, uj E, neither rule 3(a) or 3(b) applies. Hence, ui, uj ∈ ∈ have no reliance.

2. If (ui, uj) down, then as uj < X, rule 2 implies ui f/ uj. ∈

3. If (ui, uj) up, then rule 1 implies that ui f uj since ui < X. ∈

(a) If (uj, uk) over, then by 1), uj and uk have no reliance. ∈

(b) If (uj, uk) down, then by 2) uj f/ uk. ∈

Thus our assumption is false. 3.6. WHERE CAN AN OSCILLATION OCCUR? 59

Corollary 3.6.4 A non-singleton co-reliance group D must be a subset of E.

These theorems show that non-singleton co-reliance groups can only occur in the egress ancestor set E. The direct egress set X will typically have only a few routers in it. Even a large network might only peer at a few dozen locations, creating on the order of a few dozen routers in X. Each such border router might have two route-reﬂectors (for redundancy), but rarely would they have substantially more. So E is likely to be much smaller than the complete network. Hence we need to search only a small portion of a network for potential oscillation. We can restrict our search even further due to the following result.

Theorem 3.6.5 A non-singleton co-reliance group D contains only route-reﬂectors in E. So D E X. ⊆ \

Proof: By Corollary 3.6.4 all non-singleton co-reliance groups are in E. All border routers in E are also in X and select their direct external route. Hence they do not rely on any other router.

In any network the number of route-reflectors must be an order of magnitude smaller than the total number of routers (otherwise there is little point to having a route-reflector hierarchy). In addition, the number of route-reflectors in the egress ancestor set is generally a fraction of the total number of route-reflectors (as all must have clients with equally good routes up to the IGP distance step). Thus the search space for co-reliance groups can be dramatically reduced. To locate strongly connected components there are standard graph algorithms, and given the size of the problems (typically less than a few tens of nodes once we have reduced the problem size), there are no performance problems on reasonably designed networks. The actual size of these groups in practice is very small — it is quite hard to construct practical network designs where the group size is larger than three. A non-singleton co-reliance group is necessary for oscillation, but not sufficient. Just because such a group exists does not automatically imply oscillation 60 CHAPTER 3. WHERE’S WALDO? as we demonstrated in Figure 3.5.1. Hence, we need to perform further analysis to classify the behavior of these groups which will be examined in the following sections.

3.7 Algebraic Description of Co-reliance Groups

We have shown that an oscillation can only occur within a co-reliance group, and non-singleton co-reliance groups will only ever exist between the parent route- reflectors of direct egress routers X. Consequently, every arc in the co-reliance group is an over edge, and every node in the co-reliance group will learn a route from a client. Reliance on another route-reflector implies that the route learned from that route-reflector is better than the client route. Thus, if available, the route learned indirectly from another route-reflector is selected. By the rules of iBGP, if a route-reflector learns a route from another route-reflector, it will not inform any other route-reflector about this route. We can use an algebraic abstraction of routing along the lines of [42, 99] to characterize this set of rules and analyze the behavior of co-reliance groups. We create a set of labels for edges and nodes in the graph, though we describe them with reference to the nodes as the description is simpler:

direct (d): A router (node) selects its direct downstream client-learned route. • indirect (i): A router (node) selects a route learned from another route- • reﬂector.

null route (φ): No route is selected. • The null route, φ, is used for completeness. However, as every node in the co-reliance group will have a downstream egress, no equilibrium solution will ever include φ. We use these labels in a routing algebra in the same vein as Sobrinho [99]. We describe the labeling set of possible route selections as deﬁned above: Σ = d, i, φ , { } 3.7. ALGEBRAIC DESCRIPTION OF CO-RELIANCE GROUPS 61 with the preference relation:

i d φ, that is, any route is always preferred over no route, and the indirect route is preferred over a direct route because of the definition of reliance between route- reflectors. A node’s route decision is made by applying this preference to the labels of its incoming reliance arcs. The other key element of the algebra is a mapping function which is applied ⊕ when exporting a router’s best route to neighboring routers. In iBGP, routes are exported to iBGP neighbors, but the whole signaling graph need not be considered. We only need to consider the information flow along the arcs of the reliance graph, as these are the only information flows that can affect a router’s decision. We label outgoing arcs on the reliance graph by applying the operator:    d i  →  =  i φ ⊕  →   φ φ. → That is, a router will not propagate an indirect route (so it uses the null label φ), and a direct route becomes indirect after propagation. A stable labeling is one in which no node has a better available route than the current chosen route. However, multiple stable labelings are possible.

As an example, consider the two node co-reliance group (D3) shown in Figure 3.5.1(c). We represent the two solutions of the co-reliance group in Figure 3.7.1. Both nodes have direct routes available via clients, however, if they ever learn of the other route-reﬂector’s route, they will select the indirect route. Message timing determines which solution is realized, however, the system is guaranteed to settle on a solution and hence will not persistently oscillate3. Notice this example covers all two-node co-reliance groups and demonstrates that such groups cannot oscillate. Hence, for oscillation, we need at least three nodes in the group.

3We assume there is enough jitter in the system such that the probability of nodes simultaneously changing decisions is small. 62 CHAPTER 3. WHERE’S WALDO?

d d d d φ i 3 4 3 4 d i i d i φ

Figure 3.7.1: Stable solutions for a two node co-reliance group, showing the algebraic labels on edges and nodes. In addition to the co-reliance group, we also explicitly show the arcs from the direct egress set X, though in subsequent examples we will omit these because every node in the co-reliance group implicitly has such an edge available.

We can use an algebraic representation to characterize the behavior of a co- reliance group, for instance by simply enumerating states. Most co-reliance groups will be small, and so this is computationally tractable, but it is sometimes useful, for larger groups, to be able to reduce a large co-reliance group to a smaller group with identical oscillatory properties and hence reduce the computational complexity.

3.7.1 Reducing the Size of Co-reliance Groups

We will discover that the complexity of oscillation detection in an n-node co- reliance group is 2n. Hence a reduction in n can signiﬁcantly aﬀect the computation time. We now present one such reduction.

Theorem 3.7.1 An acyclic component can be reduced to a single component with a multiple input/multiple output (MIMO) function.

Proof: The decision of each node in an acyclic component will be reliant only on its parents and so will be a deterministic function of their decisions. Repeat this process back up to the input. Hence, the output of the acyclic component will be a MIMO function of the inputs. 3.7. ALGEBRAIC DESCRIPTION OF CO-RELIANCE GROUPS 63

Figure 3.7.2: Co-reliance group reduction. A three-node path can be replaced by a function (Output = Input) or equivalently a single node.

More importantly, there is a simple algorithm (a breadth ﬁrst traversal) for computing the MIMO function for a given input. This algorithm is linear in the number of nodes n, so the complexity for computing the full behavior of this component will be 2m when there are m input edges, rather than 2n.

Consider the example shown in Figure 3.7.2. Here the original acyclic component consists of a single input/single output three-node conﬁguration. We can replace the original component by a function that ﬂips the input edge label. Note that we can represent the function by a node equivalent component (a single node in this example).

As we prove the stability of co-reliance groups in turn, we can imagine storing their reduced form in a library. Any new co-reliance group that can be reduced to a form already stored in the library does not require further enumeration. For example, any odd-length path can be reduced to a single node path as shown in Figure 3.7.2. 64 CHAPTER 3. WHERE’S WALDO?

3.7.2 Oscillation Detection

For a co-reliance group with no prior example in the library with or without reduction, we use a state machine to determine its oscillatory properties. Each state is a labeling of nodes in the reliance graph. For example, idi represents one possible labeling of three nodes (Node 0: i, Node 1: d, Node 2: i). Each transition in the state machine represents a node altering its decision as it learns a better available route than currently selected. Recall we assume there is enough jitter in the system such that the probability of nodes simultaneously changing decisions is small. The actual transition depends on the timing of messages, and any transition from a state has a positive probability of occurring. As we are now dealing with two graph structures, we refer to states and transitions when considering the state machine, and nodes and edges when referring to the reliance graph. There are two possible equilibria for the state machine. First, a stable state is such that all nodes have selected their best currently available route. Such a state is a sink in the state machine graph. Second, an oscillatory mode is a subset of communicating states with no sink and no transitions out of the subset. If we enter such a mode, then the state machine oscillates persistently. Note that oscillation does not necessarily mean that the sequence of states a process enters is periodic.

3.7.3 Oscillation Classes

We can partition co-reliance groups into four disjoint categories describing their oscillation characteristics: Good, Asymptotically Good, Naughty and Bad. Each class is summarized in Table 3.7.1, and their relationship to signaling correct, stable, and unstable is shown in Figure 3.7.3. An egress instance/configuration is classified as the most oscillatory class across all its co-reliance groups. A Good co-reliance group is one where every state can be visited at most once. Thus, there must exist at least one sink state which by definition is stable. By default, all singleton co-reliance groups are classified as ‘Good’. A more 3.7. ALGEBRAIC DESCRIPTION OF CO-RELIANCE GROUPS 65

State Machine Group Name Graph Properties

Good No cycles No oscillatory modes • Asymptotically Good At least one cycle • At least one sink • At least one oscillatory mode Naughty • At least one sink • Bad No sinks

Table 3.7.1: Properties of oscillation classes.

Stable Oscillatory

Signaling Correct Asymptotically Naughty Bad Good Good

Signaling Stable Signaling Unstable

Figure 3.7.3: Oscillation classes Venn diagram. More stable classes are to the left and more oscillatory classes are to the right.

interesting example is a co-reliance group with four nodes 0, 1, 2, 3 shown in { } Figure 3.7.4(a). We arrange them such that node n is reliant on node n + 1 mod 4 (each node has exactly one inbound and outbound reliance). In practice, this four- node single cycle co-reliance group would be reduced to its equivalent two node representation, however, for demonstration purposes we examine it unreduced. In Figure 3.7.5 we show the state machine. It has 16 states. The state machine is acyclic with two sinks (didi and idid). This conﬁguration has similar properties to the Good Gadget [45], although it is not signaling correct as multiple sinks exist, 66 CHAPTER 3. WHERE’S WALDO?

0 1

3 2

(a) ‘Good’ co-reliance group (b) ’Bad’ co-reliance group

Figure 3.7.4: Example co-reliance groups for each oscillation class. 3.7. ALGEBRAIC DESCRIPTION OF CO-RELIANCE GROUPS 67

Figure 3.7.5: The state machine for the four-node ‘Good’ state machine in Figure 3.7.4(a). Each state is labeled with the route selection for all nodes in Figure 3.7.4(a). Black states represent stable sinks. White states are transient4.

and we cannot determine which one of them the state machine will reach.

No sink nodes exist in a Bad co-reliance group. An infinite number of transitions will occur given any message ordering. An example of a ‘Bad’ co-reliance group is the three-node single cycle shown in Figure 3.7.4(b). Figure 3.7.6 shows all possible states of the associated state machine and the transitions between states. As there are outbound transitions from every state, no state is stable, and hence the configuration is oscillatory with a cycle consisting of (idi, idd, iid, did, dii, ddi). This configuration is similar to the Bad Gadget [45]. Note that a cycle may not always be deterministic. That is, there may be multiple cycles in a strongly co-reliance group, and the actual transitions taken may be aperiodic.

A Naughty co-reliance group has at least one sink. However, it also has an oscillatory mode. The ﬁve-node conﬁguration shown in Figure 3.7.4(c) demonstrates a possible scenario. For this example, the state machine representation is shown in Figure 3.7.7. There exists a stable state (ididi), however, there also 68 CHAPTER 3. WHERE’S WALDO?

ddd iii

idd ddi idi

iid did dii

Figure 3.7.6: The state machine for the three-node ‘Bad’ state machine in Figure 3.7.4(b). No stable sink states. Shaded states show the cycle from which there is no exit.

exists a cycle (diiid, ddiid, idiid, iddid, iidid, didid) from which exit is impossible5. Consequently, depending on the starting state and the ordering of messages, the configuration may be oscillatory. This configuration has similar properties to the Naughty Gadget [45]. This simple example demonstrates that even if a network is currently stable, it may not remain stable in the future! The potential for latent instability is one of our key findings and motivates the need for analysis techniques such as ours.

5Note that states dddid and iiiid are not part of the cycle, however every exit from these states leads to a state in the cycle. 3.7. ALGEBRAIC DESCRIPTION OF CO-RELIANCE GROUPS 69 ididi ididd ddidi iiidi iiidd diidi diidd ddidd ddddi iiddi idddi diddi iiddd idddd diddd ddddd diiii diiid didii didid iiiii iiiid iidid iidii ddiii dddii idiii idiid iddid iddii ddiid dddid e 3.7.7: The state machine for the ﬁve-node ‘Naughty’ co-reliance group in Figure 3.7.4(c). There exists both a locked cycle and a stable state. The single blackactual state state is transitions a determine sink whether state the while the system shaded is states stable are or those oscillatory. that once entered guarantee persistent oscillation. The Figur 70 CHAPTER 3. WHERE’S WALDO?

An Asymptotically Good co-reliance group will settle on a stable labeling after a ﬁnite time. Such a co-reliance group has a state machine with a cycle in it. However, every cycle is unlocked in that it has an ‘exit’ such that a sink is reachable. For example, the co-reliance group in Figure 3.7.4(d) has a cycle (diii, ddii, idii, iddi, iidi, didi) as shown in Figure 3.7.8. However, there are transitions ((diii, diid) and (ddii, ddid)) that result in eventual escape from the cycle and reaching the sink (idid).

3.7.4 Reliances between Co-reliance Groups

So far we have investigated co-reliance groups in isolation. However, co-reliance groups can be reliant on other co-reliance groups containing route-reﬂectors. A reliance of a node ni in co-reliance group Di on another node nj in co-reliance group

Dj can alter the oscillatory properties of the co-reliance group Di. Moreover, by the definition of a co-reliance group, as ni f nj and ni and nj are in distinct co- reliance groups, then nj f/ ni. Consequently, if we can evaluate the properties of co-reliance group Dj prior to Di, then we will know the label (or set of feasible labels) for nj and consequently the label of edge (nj, ni) is fixed if Dj does not oscillate. If Dj does oscillate, then the oscillatory mode of Di is irrelevant as the egress instance is already oscillatory. This is true for all inbound edges to a co-reliance group. An inbound edge from another route-reflector can only be labeled φ or i by the definition of the algebra. When an inbound edge is labeled φ, no additional information is available in the co-reliance group, and thus this situation is equivalent to the group being considered in isolation. However, when it is labeled i, the reliant node in the co-reliance group is fixed to be i. This makes a number of states in the state machine inaccessible6. Let us now look at the impact on the oscillation

6Although ﬁxed, the label of the inbound edge may be dependent on the state chosen by a previously analyzed co-reliance group (which may have multiple solutions). Consequently, we analyze the properties of the co-reliance group under all feasible inbound edge combinations (labeled either i or φ). 3.7. ALGEBRAIC DESCRIPTION OF CO-RELIANCE GROUPS 71

dddd

dddi

iiii ddii

iddd idii

iidd iddi

didd iidi

iiid didi

diii

diid

ddid

idid

Figure 3.7.8: The state machine for the four-node ‘Asymptotically Good’ co-reliance group in Figure 3.7.4(d). Lightly shaded states show the ‘unlocked’ cycle. The black state is a sink. 72 CHAPTER 3. WHERE’S WALDO? classes of co-reliance groups.

Theorem 3.7.2 If an inbound edge to a co-reliance group is labeled i, then its state machine is a sub-graph of the state machine of the co-reliance group when considered in isolation.

Proof: If an inbound reliance edge is connected to node uj and is labeled i, node uj is ﬁxed to select i (as no route can be better than i). Consequently, only states in the isolated co-reliance group state machine with uj labeled i are feasible. Also, no new transitions between states are possible. Hence, the state machine of the co-reliance group is a subgraph of the isolated co-reliance group state machine.

Corollary 3.7.3 If a co-reliance group is classified ‘Good’ in isolation, then it will be classified as ‘Good’ with any inbound edges fixed by other co-reliance groups.

Proof: Suppose a node in a ‘Good’ co-reliance group (classiﬁed in isolation) has a reliance on a route-reﬂector outside the co-reliance group. If the inbound edge is φ, no additional information enters the co-reliance group and is equivalent to the isolated co-reliance group. If the inbound edge is i, by Theorem 3.7.2 the state machine of the new co-reliance group is a subgraph of the state machine of the isolated co-reliance group. Since no cycles exist in the original state machine, no cycles can exist in the subgraph. Hence, it is also ‘Good’.

The same cannot be said for other oscillation classes. A subgraph of the state machine can result in an unlocked cycle becoming an oscillatory mode or becoming acyclic, both of which will alter the oscillation class. For example, an inbound edge labeled i to node 3 in Figure 3.7.4(c) will ﬁx 3 to be labeled i. The unlocked cycle in the Figure 3.7.8 now becomes locked, and the previous stable state is no longer valid as shown in Figure 3.7.9. The co-reliance group would now be classiﬁed as ‘Bad’.

We have now demonstrated how to determine the oscillatory properties of a co-reliance group. We have also shown that co-reliance groups can only exist in 3.7. ALGEBRAIC DESCRIPTION OF CO-RELIANCE GROUPS 73

dddi

ddii iiii

idii

iddi

iidi

didi

diii

Figure 3.7.9: The state machine of the four-node ‘Asymptotically Good’ co-reliance group in Figure 3.7.4(d) with inbound i at node 3. The cycle now becomes locked, and the stable state no longer exists. 74 CHAPTER 3. WHERE’S WALDO? a subset of route-reflectors for a particular egress instance. If an egress instance contains all singleton co-reliance groups, or all non-singleton co-reliance groups are signaling stable, then the egress instance is signaling stable. Further, if all feasible egress instances in an iBGP configuration are signaling stable, then the configuration is signaling stable. Consequently, our analysis is able to determine the oscillatory properties of a configuration. In addition, when a configuration is oscillatory, we can precisely determine type of oscillation and the routers responsible. So far our analysis has focused on the case when the lowest-router-id is used as the tie-break option in step 8 of the BGP decision process. We now consider the currently implemented alternative.

3.8 Oldest-Route Tie-breaker

The benefit often associated with the lowest-router-id tie-breaker is the determinism associated with it. However, as we have seen, even with the lowest-router id, an egress instance can be non-deterministic. The solution the system settles on depends on message timing. For instance, for a two-node co-reliance, that is, two route-reflectors who each prefer the other’s client, the stable states are di and id (see Figure 3.7.1), and the solution that is eventually chosen typically depends on which route-reflector learns of the other’s client first. In contrast, the oldest route tie-breaker was designed to reduce the number of routing changes. We show here that it is also likely to reduce the possibility of persistent oscillation. All of the reliances we have considered so far have been strong reliances. That is, if a route-reflector learns a route from one of its reliances on another route-reflector, it will select that route. This is reflected in the algebra, i.e., i d. However, when the oldest route tie-breaker step is used, this is not always the case. Strong reliances still exist when the IGP distance “breaks the tie”. However, if the IGP distance is equal for multiple routes and the oldest route is the tie- breaker, then the reliance is weak. That is, if a route is learned from another node, the weak reliance implies the node may select the route learned from this node. 3.8. OLDEST-ROUTE TIE-BREAKER 75

Reliance Node Label Weak Strong

d iw is

is φ φ

iw φ φ φ φ φ

Table 3.8.1: Table showing the result of for weak and strong reliances. ⊕

Such a configuration is not as simple to describe as the BGP decision process is now dependent on message timings. However, when a reliance is weak and a node selects its direct route, it will never change its selection (as it is always available and is the oldest). Figure 3.8.1 shows the state machine for the three-node oscillatory configuration from Figure 3.7.4(b) inclusive of the new tie-break rule such that all reliances are now weak. The state machine is dramatically simplified, as compared to the state machine shown in Figure 3.7.6. Many of the state transitions have been removed because they will never occur under the new tie-break rule. The result is a state machine that now has four stable sink states. The new tie-break rule still allows for strong reliances where there is no tie in the IGP distances. Our approach in modeling this new case is to introduce an extended algebra description with strong (is) and weak (iw) indirect routes. That is Σ = d, is, iw, φ with the preference relation: { }

is iw d φ, ' where under the operator, the oldest route is chosen. It would perhaps be more ' elegant to include timing into the algebra directly to resolve the preference, but ' this complicates it substantially, and is unnecessary for the proof to follow. The arcs in the reliance graph are now labeled weak or strong and the mapping function now depends on this labeling as shown in Table 3.8.1. ⊕ 76 CHAPTER 3. WHERE’S WALDO?

ddd iii

dii idi iid

ddi idd did

Figure 3.8.1: The state machine of the single cycle three-node co-reliance group with only ‘weak’ reliances. Four states are now stable.

Theorem 3.8.1 A single cycle consisting of at least one weak reliance will never oscillate.

Proof: Consider a single cycle of reliances r1 f r2 f f rn f r1 such that ··· w at least one of these reliances is weak, that is, rj f rj+1, for some j. At some point in time the information available to rj will lead to a decision about its state, and that state x d, iw because of the weak reliance. If the state x = d, then by ∈ { } construction the direct route is always available, so rj will continue to use x = d (as it will from now on be the oldest available route). As soon as one node is ﬁxed, it breaks the cyclic dependence, and removes the possibility of oscillation.

If x = iw, then rj will transmit φ to rj 1, which will therefore choose its direct − route d, and subsequently rj 2 will receive is or iw and make the appropriate − decision, again transmitting this to its upstream neighbor. This will continue around the cycle until returning to rj. When the cycle returns to rj, there are only two possibilities. Either rj receives iw, in which case, the current state is stable, or rj receives φ, in which case it changes its choice to d, and the situation reverts to the case discussed above.

Theorem 3.8.2 If every cycle within a co-reliance group contains at least one weak reliance, the co-reliance group will never oscillate. 3.8. OLDEST-ROUTE TIE-BREAKER 77

Proof: Initially, we classify nodes in a co-reliance group. A primary node is one in which there is exactly one inbound edge and exactly one outbound edge. A fork is a node with one inbound edge and multiple outbound edges. A join is a node with multiple inbound edges and one outbound edge. A compound node is a node with multiple inbound and outbound edges.

Consider the co-reliance group D and choose a fork or compound node rj. If no such node exists, then D must be a single cycle. Now follow an outbound edge (rj, rj+1) from rj. If rj+1 is a primary node, follow the unique outbound edge

(rj+1, rj+2) to rj+2. Continue this process until we reach a fork, join or compound node rm. If rm is a fork, restart this process from rm. If rm is a join or compound node, deﬁne the component Cn to be the path from rj to rm via rj+1, and remove Cn from D. Now reclassify all nodes.

We can remove a sequence of components Cn, Cn 1, Cn 2, ...C2 from D until − − there are no more forks or compound nodes, and we are left with a single cycle consisting of primary nodes C1. By Theorem 3.8.1, C1 will not oscillate.

For convenience, let us write Di to represent the co-reliance group consisting of the original single cycle C1 with components C2, ...Ci added in order (and D1 = C1).

Let us now consider D2. Take any stable labeling of D1 and consider the eﬀect of connecting C2 to D1 at nodes rj (C2 starts with an outbound edge from rj) and rm (C2 ﬁnishes with an inbound edge to rm). Note rj and rm need not be distinct.

We can deterministically label all nodes in C2 based on the current label of rj. Let us consider all possibilities for rm.

1. rm is currently labeled is. No inbound edge label from C2 will alter the

decision of rm. Consequently, no node will alter its decision, and D2 will not oscillate.

2. rm is currently labeled iw.

(a) If the inbound edge from C2 is is, it will cause rm to choose is. However,

this change causes no change to the outbound edges of rm. Conse-

quently, D2 will not oscillate. 78 CHAPTER 3. WHERE’S WALDO?

(b) If the the inbound edge from C2 is iw or φ, rm will not alter its decision.

Consequently, D2 will not oscillate.

3. rm is currently labeled d.

(a) If the inbound edge is labeled iw or φ, rm will not alter its decision.

Consequently, D2 will not oscillate.

(b) If the inbound edge is labeled is, rm will alter its decision to select is.

Propagate this decision from rm to rj using D1. If rj does not alter its

decision, then D2 will not oscillate. If rj alters its decision, there are two possible causes.

i. The path from rm to rj contains at least one weak reliance (in which

all nodes with weak reliance were labeled iw). All these nodes will now be labeled d and hence now are locked to this selection. As this

is the only path from rm to rj, there are no cycles that will oscillate

in D2, and hence D2 will not oscillate.

ii. The path from rm to rj contains only strong reliances. Due to the hypothesis of this theorem, there must be a weak reliance on the

original path in D1 from rj to rm, and the new component C2 also must have at least one weak reliance. Now propagate the routing

decisions from rj to rm via both D1 and C2. If rm does not alter its

decision, then D2 will not oscillate. If rm does alter its decision,

then at least one of the inbound edges to rm must have altered its label. However, for this to occur, all the nodes with weak reliances

on these paths (those paths with altered inbound edges to rm) must

previously have been labeled iw and now be labeled d. Hence they are locked and will not change in the future. All other paths (in this case there is a maximum of 1) with inbound edges must not have altered their label as at least one node with a weak reliance in these paths must have already been labeled d and thus are locked.

Hence, all paths from rj to rm are locked, and D2 will not oscillate. 3.8. OLDEST-ROUTE TIE-BREAKER 79

We have therefore shown that if we connect C2 to D1, the resultant co-reliance group D2 will not oscillate.

Now let us assume the co-reliance group Dk will not oscillate. Now consider the co-reliance group Dk+1. That is, let us add component Ck+1 to Dk. We undertake an analysis similar to the case for C2.

The only diﬀerence to the analysis for D2 is that there now may exist multiple paths from rj to rm in Dk and from rm to rj in Dk. However, the argument proceeds identically up to case 3(b), where the statement now must change slightly as follows.

If an inbound edge from Ck+1 to rm (which was labeled d in Dk) is labeled is, then rm alters its label to is. Propagate this decision from rm to rj using all paths in Dk. If rj does not alter its decision, then Dk+1 will not oscillate. If rj alters its decision, there are two possible causes.

i. All paths from rm to rj contain at least one weak reliance (in which all nodes

with weak reliance were labeled iw). All these nodes will now be labeled d and hence now are locked to this selection. As we have considered all paths

from rm to rj, there are no cycles which will oscillate in Dk+1, and hence Dk+1 will not oscillate.

ii. There exists a path from rm to rj that contains only strong reliances. Due to the hypothesis of this theorem, there then must be a weak reliance on all original

paths in Dk from rj to rm, and the new component Ck+1 also must have at least

one weak reliance. Now propagate the routing decisions from rj to rm via all

paths in Dk and the path Ck+1. If rm does not alter its decision, then Dk+1 will

not oscillate. If rm does alter its decision, then at least one of the inbound

edges to rm must have altered its label. However, for this to occur, all the nodes with weak reliances on these paths must previously have been labeled

iw and now be labeled d. Hence they are locked and will not change in the future. All other paths with inbound edges must not have altered their label as at least one node with a weak reliance in these paths must have already 80 CHAPTER 3. WHERE’S WALDO?

been labeled d and thus are locked. Hence, all paths from rj to rm are locked,

and Dk+1 will not oscillate.

We have now shown that if the co-reliance group Dk is stable, then the co- reliance group Dk+1 is also stable. Therefore, by the Theorem of Mathematical Induction, if a co-reliance group contains only cycles with at least one weak reliance, the co-reliance group will not oscillate.

It follows that a cycle of strong reliances is necessary for a co-reliance group to oscillate. Consequently, we can essentially discard all weak reliances in a co- reliance group to analyze oscillatory properties! However, such weak reliances may provide inbound edge labels with the same properties as in Section 3.7.4. Thus, if a co-reliance group contains no strong cycles, it is signaling stable. Hence, in general the oldest-path tie-breaker, although possibly less deterministic (as there are more stable states and the choice depends entirely on historic message timing), is likely to be less oscillatory than the lowest-router-id tie-breaker.

3.9 Prioritizing Egress Instances

We must remember that we are trying to solve an NP-hard problem. In the worst case, the problem is still exponential in the number of routers. However, in practice the actual set of egress instances in use is substantially smaller. Consequently, we can prioritize the egress instances which are currently in use over those that may be in use in the future over those that can never be used. Such prioritization allows a tool which analyzes the oscillatory properties of a network to quickly detect the most important oscillatory modes — those that are currently occurring. We firstly illustrate the computation required to analyze a single egress in Section 3.9.1. We argue the complexity of such an analysis is polynomial in the number of route-reflectors in the egress ancestor set on almost all occasions. In Section 3.9.2 we undertake our initial prioritization of egress instances. Due to import policies on border routers, many egresses will never be used in combination. By prioritizing the analysis of the feasible egress instances under the current 3.9. PRIORITIZING EGRESS INSTANCES 81 policy of the AS, we are able to prove the stability of the current network given any external route availability prior to analyzing other scenarios. Although our prioritization in Section 3.9.2 is able to remove a significant number of egress instances from analysis, the problem may still be intractable — especially on a dynamic network whose properties change regularly. With additional data from route-monitors on all border routers, we can determine the actual egress instances used at any time. Consequently, we prioritize these egresses over all others in Section 3.9.3. In practice, route-monitors are unlikely to be connected to all border routers. This influences our observation of egress instances. A snapshot of router decisions may not discover all the currently available border routers, especially if route-oscillation is occurring. In Section 3.9.4 we take advantage of the timing of data recorded at monitors to prioritize those egress instances that could possibly be in current operation. We implement our analysis on the prioritized set of egress instances in Section 3.9.5 finding our analysis takes under 15 minutes to complete on a topology derived from a large Tier 2 AS. However, such an analysis is may still be intractable given a larger number of border routers. Consequently, in Section 3.9.6, we outline an online tool — able to further prioritize those egress instances requiring analysis, quickly detect oscillatory modes and deal efficiently with network dynamics.

3.9.1 Proving the Stability of an Egress Instance

Given an egress instance I, we can determine its oscillatory properties. First, we must find the reliance graph. However, we do not need to discover the entire reliance graph — only the egress ancestor set. The computation required for this is order N2 where N = E X . This result comes from our need to check if each | \ | route-reflector is closer to each other route-reflector’s best client egress. Note that in practice, N is generally an order of magnitude smaller than the total number of routers in the network (this is the purpose of a route-reflector hierarchy). Secondly,if a co-reliance group is non-singleton, we must analyze its properties using our algebraic approach. If no reductions can be applied and the co-reliance 82 CHAPTER 3. WHERE’S WALDO? group contains the maximum number of nodes (in the worst case), the number of states requiring enumeration in the state machine is 2N. Logical network design, where route-reflectors are typically close to their clients, significantly reduces the size of co-reliance groups. In most cases co-reliance groups have one or two nodes, requiring no algebraic analysis. Consequently, the dominating factor for proving the stability of a single egress instance is usually finding the reliance graph, and subsequently we argue the dominant computation is O(N2) for most networks7

3.9.2 Proving the Stability of a Conﬁguration

In this section, we examine the number of egress instances requiring analysis to prove the stability of a conﬁguration. A large network has many potential egress

B instances, I = 2| | 1, where B is the number of border routers, and in principle | | − | | we would need to analyze all of them to prove network stability. However, we can restrict this enumeration substantially by considering an extra component to a network conﬁguration — import policies. Import policies are placed on border routers to determine preferences for routes. A common policy is to prefer customer-learned over peer-learned over provider-learned routes by setting the local-preference attribute. Consequently, if a route is learned from a customer, the routes available at the IGP distance stage of the BGP decision process will not include any peer-learned or provider-learned routes at any router. Hence, the sets of customer-access, peering and provider-access routers can be analyzed disjointly. That is, the number of egress instances requiring analysis can be

customer access peering provider access reduced to I = 2| | + 2| | + 2| | 3. | | − We are interested in determining the oscillatory properties of large networks. Such networks will generally fall in the top few tiers of the AS level topology. Top-tier ASes are likely to have a much larger number of customers than peers or providers (consistent with the tiered structure of the AS level topology [37]). Consequently, the number of customer-access routers ( customer access ) will dom- | | 7In the special case where a topology is shown to be fm-optimal [13] no further checking is required. 3.9. PRIORITIZING EGRESS INSTANCES 83 inate the number of egress instances requiring analysis. However, once again import policies allow us to instantly discount a range of combinations. Good network practice (which is likely to be employed in an AS that is interested in determining if their network is oscillating), is to place filters on customer routes so that customers cannot announce prefixes that they do not own. Consequently, these filters prevent many customer-access routers ever being used in combination for any prefixes. Hence, we can prioritize the set of egress instances requiring enumeration to those customer-access routers that can feasibly be used in combination. In summary, although in the worst case, the number of egress instances requiring enumeration is still exponential in the number of border routers, practical network design restricts the actual number of egress instances requiring enumeration.

3.9.3 Checking the Stability of the Current Network

Although we have reduced the feasible number of egress instances requiring enumeration, there still may be a suﬃciently large number of egress instances that prevent computation in reasonable time. Consequently, we take the pragmatic view of prime importance to a network operator — is the current network state oscillatory? Within many networks iBGP route-monitors collect information regarding the current selections of routers. From the egress selections of all routers for all preﬁxes, we can determine all the current egress instances. By prioritizing these egress instances, we are able to evaluate the current oscillatory nature of the current network, before checking other egresses not currently in use8. We see in Figure 3.9.1 how the set of all egress instances is partitioned. We will continue in the next few sections to prioritize this set further.

8Note that we can further prioritize egress instances based on the number of changes required for them to be realized in practice. 84 CHAPTER 3. WHERE’S WALDO?

The single egress instance with - all egresses currently observed 1

All egress instances with - compatible import policies

All egress instances 6

Figure 3.9.1: Prioritization of the egress instances currently used in the AS. 3.9. PRIORITIZING EGRESS INSTANCES 85

3.9.4 Checking the Stability of the Current Network with Lim- ited Measurement Infrastructure

The above analysis requires us to have route-monitors on every border router (which could be every router). In practice, due to operational constraints and storage requirements for this amount of data, it is unlikely all routers will be connected to a route monitor. Instead, a more practical set of routers for monitors to be connected to is the route-reflectors. This practical restraint prevents us from knowing the precise set of egresses available at any time9. For instance, if oscillation is occurring, the egress instances causing the oscillation may never be selected at the same time by route-reflectors, and consequently it is no longer guaranteed that we will see all egresses available at any particular time. If oscillation is occurring, we will eventually see all egresses in use on route- reflectors after a period of time. Consequently, instead of simply using a snapshot of egresses currently being used, we find all egresses used within a time window.

This now gives us the egresses used for each preﬁx (illustrated by the set ‘20 in

Figure 3.9.2). Note that egress instance sets ‘10 and ‘20 may not be distinct. A configuration may have a ‘Naughty’ oscillatory mode which may not be identified using this technique (as we may not learn of all routes available at a particular time). However, we are not restricting our search, simply prioritizing the egresses we analyze to detect most critical oscillatory modes first. By examining a dynamic system over time, external BGP dynamics could occur during this window, affecting our analysis. Consequently, it is no longer valid to assume all egresses in the window are currently available. Hence, we must also analyze all subsets of the egresses in the window for each prefix (shown by set ‘40 in Figure 3.9.2). If the maximum number of egresses for any prefix is small, this analysis is fast (as shown in Section 3.9.5). However, if the number of egresses is large then the analysis may take a significant period of time or even become

9Route-monitors only record the RIB-out of the monitored router (see Section 2.4.5). If we were able to record the RIB-in of all route-reﬂectors, our ability to detect oscillation would be greatly increased. 86 CHAPTER 3. WHERE’S WALDO? infeasible. Hence, we outline our ﬁnal prioritization in Section 3.9.6.

3.9.5 Practical Implementation

In this section, we determine the current stability of a network by analyzing the current set of feasible egress instances (that is set ‘40 in Figure 3.9.2). We use data collected at BGP and IGP monitors throughout a large (about 500 routers) Tier-2 AS to find the current egresses used for all prefixes and the distances between routers. As the Tier-2 AS we examined had a three-level hierarchy and employed the oldest route tie-break option, we adapted the topology to fulfill our assumptions. We began by compressing the topology to two levels, assuming all routers not in the central full-mesh had direct iBGP sessions with all parents and grandparents10 in the full-mesh. Next, we assumed the lowest-router-id was used as the tie-break option. We recorded all egresses used in combination over a 2 hour interval. Using this technique, we found 954 unique egress instances. The maximum number of border routers in an egress instance was 1711. All combinations of current egresses were analyzed. That is, if 3 border routers were in an egress instance, then all 7 non-empty subsets were also analyzed. This raised the number of egress instances requiring analysis to 204, 621. We found the reliance graph (of the egress ancestor set) for all egress instances and found 60, 304 egress instances violated the sufficient condition of Griffin and Wilfong [49]. That is, there were reliances between route-reflectors in the egress ancestor set — all such reliances were a result of equal IGP distances and the lowest-router-id tie-break. However, none of these reliances resulted in a non-singleton co-reliance group — hence the current set of egress instances would not oscillate (even when the sufficient condition of Griffin and Wilfong was violated). This analysis took under 15 minutes to carry out. This analysis shows that determining current stability in a

10The parents of all parent routers. 11The BGP data used for this analysis may have not included all available egresses. This is a limitation of the BGP data collected in the network analyzed. 3.9. PRIORITIZING EGRESS INSTANCES 87

The single egress instance with - all egresses currently observed 1

The single egress instance with - all egresses currently observed; and - all egresses recently observed 2

All egress instances with - any combination of recent and currently observed egresses 4

All egress instances with - compatible import policies

All egress instances 6

Figure 3.9.2: Prioritization of egress instances consistent with available measurement data. 88 CHAPTER 3. WHERE’S WALDO? network is feasible.

3.9.6 Online Tool

Our goal is to provide an operator with the capability to monitor their network as it evolves. This capability could be captured in an online tool, a continuously running system which alerts the operator if an oscillatory mode is identiﬁed. For our analysis to be practical and identify potential oscillatory modes as quickly as possible, we must be able to discover which egress instances are most vital to analyze. With limited measurement infrastructure, the set of all egress instances can be too large to implement in a reasonable time for an online tool. We can do better than simply evaluating the oscillatory properties of sets ‘10 and ‘20 in Figure

3.9.2 prior to the remaining egress instances in set ‘40. We do so by adapting our windowed approach above to a sliding window. In addition, we consider which egresses could feasibly be in use at the current time. Consider Figure 3.9.4; each line represents the availability (over time) of an egress for a specific prefix. A sliding window is used to analyze this time series. Note that the current time is the point on the right most side of the figure. Recall our motivation — to determine which egresses are feasibly available at the current time. If the number of total egress instances is small enough (say < 10), all subsets of egress instances available during the sliding window can be analyzed. However, if the number of egress instances in the sliding window interval is large, then as any observed egresses at the current time must be available, we only consider subsets that contain the egresses present at the current time. It can be thought of as every egress instance containing the intersection of the egresses in the sets ‘10 and ‘20 in this figure. We display this as set ‘30 in Figure 3.9.3. For example in Figure 3.9.4, we show a sample prefix’s variation of available egresses over time. Next to each egress’s availability timeline is the egress’s presence (or lack of) in the prioritized sets of egress instances. The number/color of the egress instance defines the set as shown in Figure 3.9.3. A solid square indicates that this egress is present in all sets of this priority level. A clear square 3.9. PRIORITIZING EGRESS INSTANCES 89 indicates this egress is not present in any egress instance of this priority level. A half solid square indicates the egress can either be present or not present in an egress instance of this priority level. It is the these half square egresses which contribute to the exponential number of egress instances requiring analysis. In this

10 14 case, 2 egress instances require analysis in the set ‘30 while 2 egress instances require analysis in set ‘40. The process of only considering subsets containing the current egresses may remain intractable if the number of non-current egresses, which are in the sliding window interval, is large. That is, the set ‘30 in Figure 3.9.3 may actually be equivalent to set ‘40. This is possible if external BGP dynamics aﬀect a large number of egresses. If set ‘30 is too large to analyze, we shorten the sliding window which may reduce the size of the yellow set (see Figure 3.9.5). We continue this reduction until set ‘30 is of feasible size. At each reduction of the sliding window, we also analyze set ‘20 as in the absence of external dynamics, this is likely to be the actual egress instance used. Note that set ‘10 does not change by altering the sliding window length. If a large number of egress instances disappear simultaneously, it is feasible that the sliding window reduces to a snapshot. This has the eﬀect of not examining the presence of all combinations of the disappearing egresses. As they disappear simultaneously, considering them as a block of egresses which are either present or not (sets ‘10 and ‘20) is a logical choice. The complete prioritization of egress instances is shown in Figure 3.9.3. By prioritizing egress instances, our analysis is applicable to a tool that is able to operate in real-time — providing important feedback to network operators as to the current state of their network.

Network Dynamics

We are now dealing with a dynamic system where not only external route availability changes over time, but also the internal network topology. These dynamics generally only aﬀect a small portion of the network. Re-analyzing the entire network every time the sliding window is moved is not a justiﬁable use of resources. 90 CHAPTER 3. WHERE’S WALDO?

The single egress instance with - all egresses currently observed 1

The single egress instance with - all egresses currently observed; and - all egresses recently observed 2

All egress instances with - all currently observed egresses; and - any combination of recent egresses 3

All egress instances with - any combination of recent and currently observed egresses 4

All egress instances with - compatible import policies

All egress instances 6

1 2 3 4 5 6

High Priority Low Priority

Figure 3.9.3: Prioritization of egress instances for an online tool. 3.9. PRIORITIZING EGRESS INSTANCES 91

Type of egress Egress observed over time instance

Egress 1 2 3 4 5 6 A B C D E F G H I J K L M N O P Q R S T U V

SLIDING WINDOW

Solid Square: Present in all egress instances of this type Clear Square: Not present in all egress instances of this type Half Square: Either present or not present in egress instances of this type

Figure 3.9.4: An example of the prioritization of egress instances. The time a preﬁx is seen in the measurement infrastructure is shown by a solid line. The presence of the egress in a particular set is shown next to each individual egress. Note that we have implicitly assumed egresses U and V cannot be used for this particular preﬁx (hence their presence in set ‘60 and not set ‘50). See Figure 3.9.3 for a description of the types of egress instance. 92 CHAPTER 3. WHERE’S WALDO?

Type of egress Egress observed over time instance

Egress 1 2 3 4 5 6 A B C D E F G H I J K L M N O P Q R S T U V

SLIDING WINDOW

Solid Square: Present in all egress instances of this type Clear Square: Not present in all egress instances of this type Half Square: Either present or not present in egress instances of this type

Figure 3.9.5: Equivalent example to Figure 3.9.4 with a shorter sliding window. The number of egress instances in set ‘30 is reduced. 3.9. PRIORITIZING EGRESS INSTANCES 93

The approach we present is highly amenable to an incremental implementation that analyzes only those portions of the network that have changed. First, let us consider each preﬁx as part of an egress instance set, based on the border routers available for egress selection. Now, if a change to the availability of a preﬁx occurs, there are three possible cases.

1. The preﬁx becomes part of an existing or previous egress instance set; or

2. The preﬁx forms a new egress instance with an equivalent reliance graph (of the egress ancestor set) structure to another current or previously examined; or

3. The preﬁx forms a new egress instance set with a new reliance graph.

In the ﬁrst case, no evaluation of the preﬁx’s oscillatory properties is required as it has previously been examined. In the second case, the reliance graph of the egress ancestor set must be calculated, however, as the abstract reliance graph has previously been examined, no oscillatory properties need be calculated. Recall the reliance graph does not record the actual egresses, nor does it record IGP distances. Hence, many egress instances will have equivalent reliance graphs. The third case occurs when the egress instance’s reliance graph representation requires its oscillatory properties be examined. Whenever a reliance graph has been analyzed for its oscillatory properties, it forms part of a library. Consequently, we can use incremental analysis to substantially reduce the ongoing workload of our stability analysis. A second form of network dynamics is caused by the IGP. There are once again several possibilities:

1. The IGP change does not cause any reliance graph to be altered.

2. The IGP change modiﬁes one or more current reliance graphs.

In the ﬁrst case, no recalculation is required as the reliance graph is not altered. In the second case, all the modiﬁed reliance graphs must be re-evaluated. 94 CHAPTER 3. WHERE’S WALDO?

However, as a number of reliance graphs are likely to be in the library, their oscillatory properties need not be evaluated. Consequently, our analysis is also highly amenable to IGP changes.

3.10 Preventing BGP Oscillation

Our approach to detect oscillation allows network operators great ﬂexibility in network design while ensuring stability. A simple alternative highlighted through the algebraic approach above would be to introduce an extension to the BGP protocol — immediately prior to the closest egress step introduce the rule:

“a route-reﬂector prefers client-learned routes”.

As a result, there would be no edges in the reliance graph between route- reflectors in the egress ancestor set and Griffin and Wilfong’s condition would always be satisfied. Consequently, no oscillation would occur. This would shift the burden of ensuring stability onto the BGP decision process and away from network design and configuration. It may result in sub-optimal routing on some occasions but so does the route-reflector hierarchy.

3.11 Three-Or-More-Level Route-Reﬂector Hierarchies

Throughout this analysis, we have considered a two level route-reflection topology. That is, there is a fully meshed layer of route-reflectors with clients directly connected to one or more route-reflector. However, it is feasible, and as is the case in the large Tier-2 AS we focus on throughout this thesis, that more than two hierarchical levels are used in a route-reflector iBGP topology. Consider the example three-level topology in Figure 3.11.1. Black nodes are the highest level of the hierarchy and form a full-mesh. Grey nodes are clients of the black nodes while simultaneously being route-reflectors of the white nodes. The same reflection rules as the two-level case apply (see Section 3.3). That is, a router will reflect a 3.11. THREE-OR-MORE-LEVEL ROUTE-REFLECTOR HIERARCHIES 95 client-learned route to all iBGP neighbors, but will only propagate a route learned otherwise to its own clients.

We must now also redeﬁne our description of a valid signaling path for the general case. The path S can be split into sub paths S = PQR where P contains zero or more edges pi up, R contains zero or more edges ri down and Q is either ∈ ∈ empty or consists of a single arc q over. For example, in Figure 3.11.1 a valid ∈ signaling path includes (j, i, 3, 1, d, b). However the signaling path (j, i, 3, 1, d, c, a) is invalid.

When considering a general route-reflector hierarchy such as this, the routes any particular router learns are less predictable. Consequently the possibility for oscillation is not nearly as simple as the two-level case. We later (in Chapter 4) redefine a set of reliance rules for multi-level hierarchies. However, for now, let us consider an example demonstrating the difficulties encountered in a three-level hierarchy.

We refer the reader to Figure 3.11.2. In this three-level route-reflector topology, three egresses are available (6, 7 and 8). Next to each node is the preference for each egress based on the IGP distance to the egress. Note that all routers prefer downstream egresses over non-downstream egresses. Consequently, one might expect this configuration to fulfill the sufficient condition of Griffin and Wilfong to prevent oscillation: a route-reflector prefers a client-learned route over any other (see Section 3.4.1). However, let us consider if this is the case in this example. Routers 6, 7 and 8 will select their direct egress and inform their parent route- reflectors of their decision. Hence, 3 will select 6, 4 will select 8 and 5 will select 7. Now, router 0 will initially select the egress 6 as this is the only route it learns (from 3) and propagate this route to 1. Router 1 will select this route learned from 0 as it prefers 6 over 8. Router 2 will select the egress via 7 as it does not learn of the route via 8. However, router 2 will now inform router 0 of the availability of the egress via 7. Hence, router 0 will change its decision and select the egress via 7 over its currently selected route. This in-turn affects the decision of 1 as it no longer learns of the egress via 6. It will now select the egress via 8. Again, 96 CHAPTER 3. WHERE’S WALDO?

Egress selected Step Router 0 Router 1 Router 2 1 6 6 7 2 7 6 7 3 7 8 7 4 7 8 8 5 6 8 8 6 6 6 8 7 6 6 7

Table 3.11.1: The egress selected by routers 0, 1 and 2 in Figure 3.11.2. Persistant oscillation occurs due to the middle level of the route-reflector hierarchy. this affects the decision of router 2 as it now prefers egress via 8. This process continues ad infinitum as shown in Table 3.11.1. We have presented this example to show that stability is not guaranteed even if all downstream routers are ‘closer’ than all non-downstream egresses. However this is not a contradiction of the sufficient condition of Griffin and Wilfong [49]. Their condition requires that all client learned routes are preferred over all others. This is not the case in this example. However, this example does show that the condition is harder to ensure than we may initially have thought. Configuring all routers to be closer to downstream egresses is not a guarantee of an oscillation free network. By co-ordinating router preferences, we can guarantee the configuration will not oscillate, but this co-ordination requires the type of analysis presented in this thesis. Interestingly, even if the bottom two levels of routers are fully inter-connected (which might be the case in a PoP in an operational network), oscillation can still occur. Consider the example shown in Figure 3.11.3. As the bottom two levels of routers are fully inter-connected, routers 3, 4 and 5 select their most preferred route (all egresses are available). Now consider the decisions of routers 0, 1 and 2 shown in Table 3.11.2. We see that step 7 is equivalent to step 1, and hence these 3.11. THREE-OR-MORE-LEVEL ROUTE-REFLECTOR HIERARCHIES 97

a b e f

c d g h

1 2

3 4

i l n

j k m o p

Figure 3.11.1: Three-level route-reﬂector hierarchy. Black nodes are the highest level of the hierarchy and form a full-mesh. Grey nodes are clients of the black nodes while simultaneously being route-reﬂectors of the white nodes.

7,6,8 0 2 8,7,6 1 6,8,7

6,7,8 3 4 8,6,7 5 7,8,6

6 7 8

Figure 3.11.2: Oscillation in three-level route-reflector hierarchy. Route-reflectors have their egress preference specified. For example, route-reflector 3 prefers egress 6 over 7 over 8. Notice that all downstream egresses are preferred over non-downstream egresses. 98 CHAPTER 3. WHERE’S WALDO?

7,8,6 1

8,6,7 0 2 6,7,8 7,8,6 4 6,7,8 3 5 8,6,7

6 7 8

Figure 3.11.3: Oscillation in three-level route-reﬂector hierarchy (bottom level full-mesh).

Egress selected Step Router 0 Router 1 Router 2 1 6 8 6 2 8 8 6 3 8 8 7 4 8 7 7 5 6 7 7 6 6 7 6 7 6 8 6

Table 3.11.2: The egress selected by routers 0, 1 and 2 in Figure 3.11.3. Persistant oscillation occurs even when the bottom level of routers is fully-inter-connected. routers will never settle on a stable state.

Oscillation can also occur between levels in a three-level route-reflector hierarchy. In the example shown in Figure 3.11.4 oscillation occurs between routers in the top and middle levels. This form of oscillation is caused by middle level route-reflectors preferring routes learned via a top-level route-reflector and conse- 3.11. THREE-OR-MORE-LEVEL ROUTE-REFLECTOR HIERARCHIES 99

8,6,7 6,7,8 1 7,8,6 0 2

6,7,8 8,6,7 3 4 5 7,8,6

6 7 8

Figure 3.11.4: Oscillation between levels of route-reﬂector topology.

quently affecting the availability of its own client-learned route. In our example, route-reflector 3 originally selects client-learned egress 6 (as it only has a choice of egress 6 or 7 at this stage) and consequently router 0 is able to select its most preferred egress via 6. However, router 1 learns of the egress route via 8 and informs 3 of this available egress. Router 3 now alters its decision and chooses the egress via 8. However, this now alters the available egresses at router 0 as egress 6 is no longer available. We show one possible cycle in Table 3.11.3 where step 19 is equal to step 5. Note that at some steps multiple routers could alter their decision (and we show just one). This configuration has no solution and hence will persistently oscillate. We are now able to develop conditions that guarantee a configuration will not oscillate. First we must redefine the egress ancestor set E recursively to be the set of border routers X and all parents of routers in E. Next, we define the set child(u) for a router u as all routers v such that (u, v) down. Next, we define α(u) to be ∈ the best egress for a router u from all AS-wide egresses. Finally, we define a child preference path from u1 to un as a valid signaling path (u1, u2, ...un) consisting of only down edges such that α(u1) = α(u2) = ... = α(un 1) = un and ui+1 propagates − its chosen route to ui. An illustration of child preference paths is shown in Figure 3.11.5. For example node 1 prefers the egress 7 over all others. Node 5 which is a 100 CHAPTER 3. WHERE’S WALDO?

Egress selected Step Router 0 Router 1 Router 2 Router 3 Router 4 Router 5 1 6 8 7 6 7 8 2 6 8 7 8 7 8 3 7 8 7 8 7 8 4 7 8 7 8 7 7 5 7 7 7 8 7 7 6 7 7 7 6 7 7 7 6 7 7 6 7 7 8 6 7 7 6 6 7 9 6 7 6 6 6 7 10 6 6 6 6 6 7 11 6 6 6 6 6 8 12 6 8 6 6 6 8 13 6 8 6 8 6 8 14 8 8 6 8 6 8 15 8 8 6 8 7 8 16 8 8 7 8 7 8 17 8 8 7 8 7 7 18 8 7 7 8 7 7 19 7 7 7 8 7 7

Table 3.11.3: The egress selected by routers 0 5 in Figure 3.11.4. Persistant oscillation − occurs between levels of the hierarchy. Note that at some steps multiple routers could alter their decision (and we show just one). This conﬁguration has no solution and hence will persistently oscillate. 3.11. THREE-OR-MORE-LEVEL ROUTE-REFLECTOR HIERARCHIES 101

7,8,6 1

8,6,7 0 2 6,7,8 6,8,7 4 8,7,6 3 5 7,6,8

6 7 8

Figure 3.11.5: An example three-level topology with three child preference paths. For example node 0 prefers the egress 8 over any other. Its child router 3 also has the same preference.

client of 1 also prefers egress 7. As node 7 is a client of 5, a child preference path (1, 5, 7) exists in this example.

Theorem 3.11.1 If for all u E X, α(u) = α(v) for some v (child(u) E), then the ∈ \ ∈ ∩ conﬁguration will not oscillate.

Proof: Suppose there exists a child preference path CPP = (ui, ui+1, ..., un) from all ui E X. Then as un = αu , un is an egress and selects its direct egress and ∈ \ 1 propagates this route to its parents, including un 1. αun 1 = un, by deﬁnition of the − −

CPP. Hence, un 1 will select to egress via un (which it learned from its child un) − and propagate this route to its parents including un 2. Continuing along the CPP, − we find all ui in the CPP will select a route learned via ui+1, which by definition is a child router. We checked that Theorem 3.11.1 holds on the Tier-2 AS using the configuration from June 1, 2008. We discovered only a single instance where Theorem 3.11.1 did not hold. This is shown in Figure 3.11.6. Only the egress ancestor set is shown. Notice that node 0 does not have a child preference path as its child 102 CHAPTER 3. WHERE’S WALDO?

5,4 0 1 5,4

4,5 2 3 5,4

4 5

Figure 3.11.6: An example from a Tier-2 AS of a route-reﬂector preferring a downstream egress learned from a non-downstream router. Node 0 does not have a child preference path as its child router 2 prefers the egress 4 over 5. Node 0 will select the route learned at node 5 which is a downstream router. However, it will learn of this route from node 1.

router 2 prefers the egress 4 over 5. Now, let us see the effect of this. Routers 4 and 5 will select their direct egress. Router 2 will select the egress via router 4 while router 3 will select to egress via router 5 (due to the shortest IGP distances). Router 1 will learn of the egress via router 5 from router 3, select it and inform router 0 of this selection. Now router 0 learns of both egress 4 (from router 2) and egress 5 (from router 1). It will select to egress via router 5 as it has a shorter IGP distance than egress 4. Consequently, although router 0 is configured with it’s downstream egresses closer than any other, it will choose a route it learns from another route-reflector because 5 is a downstream egress of both route-reflectors 0 and 1. Note that this example will not oscillate, however, it does show how without co-ordination between router preferences, ensuring a router selects a client-learned route is more difficult than might be expected in the current BGP decision process. We believe this is more evidence that the additional decision step described in Section 3.10 would be beneficial in preventing oscillation.

There is another alternative when designing a route-reflector configuration to 3.11. THREE-OR-MORE-LEVEL ROUTE-REFLECTOR HIERARCHIES 103 ensure oscillation will not occur. If all levels of the route-reflector hierarchy can be split into fully inter-connected partitions, that is all parents in a partition are parents of all children and vice versa, then oscillation will not occur.

Theorem 3.11.2 If each level of the route-reﬂector hierarchy (and all routers are in a single level) can be split into fully inter-connected partitions, and all routers are closer to downstream egresses than non-downstream egresses, then the conﬁguration will not oscillate.

Proof: A router u cannot learn of a better downstream egress e via an over iBGP session with v, as by definition v must only have down iBGP sessions with the same clients as u. Hence u would have learned the route directly from a client router. Also, u cannot learn of a better downstream egress e via an up iBGP session with v, as v could only learn of this route via an over iBGP session which by the above is not feasible, or from a down session with a router w. However, w must have the same children as u and be part of the same inter-connected partition. Thus, as u cannot learn of any downstream egress (which are by definition better than all others) from any other means than a down iBGP session, the sufficient condition of Griffin and Wilfong [49] is satisfied, and the configuration will not oscillate. . Configuring a network to obey Theorem 3.11.2 may require the operator to alter their route-reflector configuration. For instance, an operator of a three-level route-reflector hierarchy may configure it in such a way that in each PoP there are two route-reflectors (from a middle level) and a number of client routers. They may inter-connect these two levels such as in Figure 3.11.3. For additional route- availability, they may then configure their network such that the middle level route-reflectors in the PoP have different route-reflectors from the top-level full- mesh of route-reflectors. Wehave shown that this practice could cause undesirable oscillation in iBGP and recommend that when route-reflector hierarchies larger 104 CHAPTER 3. WHERE’S WALDO? than two levels are in use, that Theorem 3.11.2 or Theorem 3.11.1 be satisfied to guarantee oscillation will not occur.

3.11.1 Greater than Three-Level Hierarchies

We must start this section by mentioning that we are unaware of any route- reflector topologies larger than three levels used in practice. However, with the scalability concerns of iBGP, we present this interesting observation to any AS considering implementing such a scheme. Consider the example in Figure 3.11.7. Here we see router 3 learns of the dashed route via both client router 4 and non-client router 2. Note that the route is equivalent in all aspects except where it is learned from. Now, does this route get propagated to router 0? In this example, the answer is yes. It lies in the intricate tie-breaking details of the BGP decision process where the route with the lowest cluster-list length is chosen. Now, router 3 will receive a route from router 4 with only 4 in the cluster-list. However, it will receive a route with 2, 5 in the cluster-list from router 2. Hence, router 3 will select the route learned from router 4, and because it was learned from its own client, it will propagate the route to router 0. In contrast, consider the example in Figure 3.11.8. Now 3 learns of the dashed route from 2 and 4. However, the cluster-list length is equal. This tie is now broken by the lowest-router-id of the iBGP neighbor (note this is different to the router-id of the next-hop router). In this case, the route from router 2 is selected. Hence router 0 does not learn of the downstream egress via router 7, even though its direct child selects it! This observation can cause the conditions for Theorem 3.11.1 to not be satisfied. By the definition of a child preference path, a client must propagate a route from its best child to its parent. However, in greater than three-level hierarchies, it is no longer guaranteed a child will propagate its route to its parent. We can place a constraint on the topology such that Theorem 3.11.1 holds. This constraint 3.11. THREE-OR-MORE-LEVEL ROUTE-REFLECTOR HIERARCHIES 105

0 1 ? 3 2 a

4 5 6

7 8 9

Figure 3.11.7: Four-level route-reﬂector hierarchy. Node 3 learns of the dashed route from 2 and from 4. Does it propagate this route to its parent node 0? In this example, it does as it selects the route learned from 4 due to a shorter cluster-list-length.

0 1 ? 3 2 a

4 6

7 8 9

Figure 3.11.8: Modified four-level route-reflector hierarchy. Node 3 learns of the dashed route from 2 and from 4. Does it propagate this route to its parent node 0? In this example, it doesn’t as is selects the route learned from 2 due to the lowest neighbor-router-id. 106 CHAPTER 3. WHERE’S WALDO? guarantees a router will select (of two equivalent routes) a client-learned route over any other. Constraint: Any router must be in a single level of the route-reflector hierarchy. That is, the number of down edges to any downstream router is the same from every top-level router. This ensures the length of the cluster-list is larger for an ’over’ learned route than a ’down’ learned route and hence the downstream route is preferred. We see as the hierarchy gets larger, the possible sources of problems due to incomplete knowledge increases. Consequently, our recommendation is to use the flattest hierarchy possible to prevent such issues — ideally a full-mesh or if necessary a two-level route-reflector hierarchy.

3.12 Discussion

The interaction between IGP and iBGP is complex. In this chapter we have abstracted away the complex details, analyzed the properties of the resulting reliance graph and discovered locations where the unwanted network property — oscillation — can occur. The approach uses careful algebraic modeling of the problem to reduce the computational complexity dramatically. Our work strengthens the previous recommendation of Griffin and Wilfong [49] that a client-learned route should be preferred over a route learned from any other iBGP neighbor to prevent oscillation. However, it is difficult to guarantee this condition is met at all times within dynamic networks. Further, in a route- reflector hierarchy with three-or-more levels, Griffin and Wilfong’s condition is impossible to guarantee without coordination between the router preferences. Consequently, our recommendation that an additional BGP decision step prior to the closest IGP distance step to explicitly prefer a client-learned route over any other is especially critical in large networks where such hierarchies may be employed. We believe although such a decision step may occasionally result in sub-optimal routing (that is, a route with a greater IGP distance than would have 3.12. DISCUSSION 107 been selected otherwise), the benefits of predictable, stable routing are highly appealing. In addition, as we see in Chapters 4 and 5 the current route-reflector hierarchy also often results in sub-optimal routing. For the purposes of this chapter, we have analyzed the oscillatory properties of an iBGP configuration. Further, our model of iBGP can be used for applications such as determining the decisions of routers when Griffin and Wilfong’s sufficient condition for stability does not hold (Chapter 4), identifying the influence of route announcements from neighboring ASes (Chapter 5), and other what-if analyses within an AS. Similar concepts might also be extended to inter-AS relationships to predict the propagation of routes. Chapter 4

Humpty Dumpty: Putting iBGP Back Together Again

In the previous chapter we considered how an AS might test potential network configurations with a view to implementation. In this chapter, we are presented with an existing network whose operational state must be measured. As noted earlier, the configuration of a network is not enough to determine the state in which it is operating. The non-deterministic choice of state, even when the state is stable, means we must supplement our data with measurements. However, measurement infrastructure is often limited due to high storage requirements and operational setup costs. Consequently, configuration data and measurement infrastructure each provide only partial information as to the network state. In this chapter, we combine the network configuration and collected BGP and IGP data to systematically determine the current network state. In addition, our approach can predict the impact of network changes prior to implementation in the live network. We use our technique in Chapter 5 in a ‘what-if’ scenario to predict the likely impact on a Tier-2 AS when neighboring ASes modifying their policy.

109 110 CHAPTER 4. HUMPTY DUMPTY 4.1 Introduction

Measurement plays a crucial role in the management of IP networks since it allows operators to determine the current network state. Measurement data can be used for tasks such as oscillation detection as examined in Chapter 3, together with deriving traffic demands in operational networks [32] and finding traffic matrices [132] together with their dynamics [108, 110, 111]. A majority of such tasks require some knowledge of the path traffic takes through a network — hence the need for routing measurements. However, due to high storage requirements and operational setup costs, route-monitors collecting BGP routing information are often only connected to a subset of routers. In this chapter we provide a methodology to make use of the high level of dependency between router decisions to systematically “fill in the gaps” left by partial measurements.

We saw in Chapter 3 that an AS’s BGP routing decisions are not atomic. When multiple routes are available to a destination, individual routers within the AS can make different decisions as to their selected route based on their own perspective of the ’best’ route. The network solution, that is, the decision of all routers in the network for a particular destination, is dependent on the subset of AS-wide routes learned at each individual router. Hence, it is invalid to assume all AS-wide routes are learned at each router for selection [13]. The iBGP configuration employed, such as full mesh or route-reflection [7], determines whether all routes or a subset of routes are available at every router. In this chapter we primarily focus on the route-reflector iBGP configuration, since it is used widely in large enterprise and service provider networks. However, in Section 4.7 we also demonstrate how our approach is applicable to other iBGP configurations.

In Chapter 3, we introduced a model to analyze the oscillatory properties of a two-level route-reflector iBGP topology. We now extend this model to determine the network solution of a general route-reflector iBGP topology. The model illus- trates the reliance of a router on other routers for choosing its best route. This model underpins a methodology for determining routes selected by all routers 4.2. RELATED WORK 111 based on the knowledge of routes selected by a subset of routers and the iBGP configuration. Another benefit of our methodology is that it can also be used for ’what-if’ analysis. Compared to the methodology proposed by Feamster and Rexford [31], which provides similar functionality, our methodology is applicable to any route-reflector iBGP configuration, not just configurations satisfying the recommendations of Griffin and Wilfong [49]. Further, our approach is topology independent making it extensible to topologies other than a route-reflector iBGP configuration. We illustrate several other topologies in Section 4.7. We applied our methodology to the topology of a large Tier-2 AS, and using measurements collected from 15% routers (mostly route-reflectors), we could determine routes selected by all the routers in the network. Of over 12.7 million routing decisions, we predicted a decision consistent with the observed data for all but seven routers. In the process, we also detected several configuration and data collection issues on routers when routes predicted by our methodology were inconsistent with the measurement data — highlighting an additional benefit of our analysis. We begin by examining the related work in detail in Section 4.2, specifically showing where our model and algorithm surpasses previous work. In Section 4.3 we start by recapping the notion of a reliance graph introduced in Chapter 3 and extend this concept to a multi-level route-reflection hierarchy. The design of our model of iBGP is highly amenable to supplementation with network measurements to find which of multiple network solution states in which the network is currently operating. We examine this supplementation in Section 4.5 and evaluate our techniques on a large Tier-2 AS in Section 4.6.

4.2 Related Work

Several approaches to discovering the network solution have been proposed. The first approach is to undertake a simulation of BGP. However, simulating the time consuming packet exchanges in software packages such as SSFNet [87], J-sim [113] 112 CHAPTER 4. HUMPTY DUMPTY and ns [3] is not necessary to find the network solution. The intermediate steps in the BGP convergence process are irrelevant for our goal. Consequently, we abstract away unnecessary details, resulting in a model which is far more efficient and amenable to finding the network’s chosen solution.

An alternative approach is to simply propagate routes in an arbitrary order between routers (ignoring any timing delays) until no router alters its decision (for example, C-BGP [88]). However using this approach, a significant number of intermediate router states are evaluated prior to converging to an arbitrary final solution, and as we demonstrated in Chapter 3, there may be multiple feasible solutions. Consequently, determining if a network is in the convergence process or if it is persistently oscillating is difficult. In contrast, we avoid many intermediate states to quickly find a valid solution and more importantly, converge to a solution consistent with observed data. This enables our approach to predict the route traffic will actually take in the network. In addition, if the configuration has an oscillatory state, we can quickly identify it and pinpoint the responsible routers.

The most closely-related work to that presented in this chapter is by Feamster and Rexford [31]. Their motivation was to predict the network solution as designed. That is, they assume recommended guidelines for network configuration are always satisfied resulting in a unique network solution. We, however, allow the network to be analyzed in its currently operating state — whether it satisfies guidelines or not. In addition, they assume complete visibility of input routes. In contrast, our technique works with limited knowledge of input routes from the network. Further, our technique is designed to use observed data to influence which of multiple network solutions is actually chosen by the network. Finally, it can efficiently analyze the impact of small changes to the network without a significant re-analysis.

The primary assumption of Feamster and Rexford requires all route-reﬂectors to prefer a client-learned route over any other. This constraint is a suﬃcient condition to prevent persistent oscillation and guarantees a unique solution [49]. However, the condition is not necessary. In Chapter 3, we demonstrated that this 4.2. RELATED WORK 113

1 3 1 2 2 5 5 5 10 5 5 10 10 10

4 5 6 3 4

(a) Unique solution (b) Two solutions

Figure 4.2.1: Stable egress instances violating Griffin and Wilfong’s condition. Black nodes are route-reflectors, and white nodes are client routers. Solid lines represent iBGP sessions, and dashed lines indicate IGP distances to non-client routers. IGP distances are shown next to lines connecting nodes. condition can be overly restrictive and difficult to satisfy using the current BGP decision process. Removing this assumption removes the benefit of guaranteeing convergence to a unique solution — the timing of BGP updates can determine which of multiple solutions is settled upon. In addition, the tie-breaking option employed in operational networks, such as the one we examined, can be non- deterministic, resulting in an even greater number of feasible network solutions. Our technique always converges to a valid solution and, in almost all instances, converges to a solution consistent with the observed data. Let us now consider several examples where Griffin and Wilfong’s constraint is not satisfied. First, we refer the reader to Figure 4.2.1(a). The solution to this example is 4, 6, 6, 4, 5, 6 where the ith element of the solution vector represents { } the selected next-hop of router i. Notice that although router 1 is closer to the egress via router 5 than the egress via router 4, it never learns of this route as 2 does not select this egress. Router 2 selects an egress via 6 which it learns from 3 — not a client router. Hence Griffin and Wilfong’s constraint is violated, but a unique solution still exists. Although a unique solution guarantees the configuration will not oscillate ad infinitum [49], we saw in Chapter 3 that non-uniqueness does not imply a 114 CHAPTER 4. HUMPTY DUMPTY configuration will oscillate. In Figure 4.2.1(b) there are two possible solutions depending on whether router 1 makes its decision before — or after — router 2 (the solutions are 3, 3, 3, 4 , 4, 4, 3, 4 ). In the above examples, as Griffin and { } { } Wilfong’s constraint is violated, the technique in [31] may not find the solution selected by routers or may even find an invalid solution. For example in Figure 4.2.1(a), the technique of Feamster and Rexford [31] may return 5, 6, 6, 4, 5, 6 { } which is invalid (as 1 is selecting a route which it has no knowledge of). The pitfall of the technique in [31] is that it relies on the assumption that networks are designed to satisfy Griffin and Wilfong’s constraint. However, we demonstrated in Chapter 3 that ensuring this constraint is satisfied is difficult under all failure scenarios and with the current BGP decision process (especially in multi-level hierarchies). In contrast, we make no assumptions on the configuration of the topology making our technique applicable to any network scenario. Further, our technique is highly amenable to the inclusion of measurement data to influence which of multiple network solutions is actually chosen by the network. Wepresent results in Section 4.6 of a case-study where we always predict a valid solution. We found this solution was consistent with observed data in 99.9999% of cases.

4.3 Two-Level Route-Reﬂector Reliance Graph

We first introduced in Chapter 3 the concept of reliance between router decisions to determine if a network configuration was oscillatory. In this chapter, we use reliances to efficiently and accurately determine the actual routes selected by any router. We say a router u is reliant on another router v if it can learn of its best route for a particular prefix (after convergence) from v. We denote this reliance as u f v. Reliances are represented by a directed edge in the direction of information flow in the reliance graph. Routing information can only flow over iBGP sessions between routers. Consequently, the reliance graph is a sub-graph of the signaling graph. The rules governing route-propagation in an iBGP topology determine which links are pruned from the signaling graph to form the reliance graph. 4.3. TWO-LEVEL ROUTE-REFLECTOR RELIANCE GRAPH 115

Co-reliance groups are strongly connected components of the reliance graph. Co-reliance groups form an acyclic structure representing the reliances between co-reliance groups. We visit each co-reliance group in a topological order, evaluating the decision of all routers in a co-reliance group before moving to the next. When only one router exists in a co-reliance group, the BGP decision process is evaluated exactly once. However, in a non-singleton co-reliance group, the decisions of some routers may be dependent on the decision of other router’s within a co-reliance group. Consequently, we may need to re-evaluate the decisions of routers within a non-singleton co-reliance group to ensure updated information does not alter a router’s decision. Importantly, a co-reliance group is never re-visited. We explicitly describe the ordering of co-reliance groups and the evaluation of router decisions within a co-reliance group in Section 4.5.

The reliance graphs that are based on the rules outlined in Chapter 3 and corresponding co-reliance groups for the examples in Figure 4.2.1 are shown in Figure 4.3.1. In Figure 4.3.1(a), router 1 is reliant on 2 because if it learns of egress 5 it will select it. However, router 3 is not reliant on 1 as its closest egress is 6 — which it always learns of and hence will never select any other route. In Figure 4.3.1(b), router 1 will egress via 4 if it learns of it. Thus it is reliant on 2. Further, if router 2 learns of the egress via 3, it will select it. Hence 2 is reliant on 1 and so routers 1 and 2 form a co-reliance group. Evaluating router decisions in any topological ordering (the numerical ordering D1, D2, ...in both Figure 4.3.1(a) and Figure 4.3.1(b) are examples of possible topological orderings) will result in a valid solution. In Figure 4.3.1(b) the order in which we evaluate the decisions of router 1 and 2 in co-reliance group D3 determines which of two solutions is realized. For example, if we evaluate the decision of router 1 ﬁrst, it will only have knowledge of the egress via router 3. Hence, it will select to egress via 3 and propagate this choice to router 2. Router 2 will also learn of the egress via router 4 and consequently will choose to egress via 3 (as it has a shorter IGP distance). Router 1 will not alter its decision as it will not learn of the egress via router 4. Hence, a network solution is 3, 3, 3, 4 . Conversely, if we evaluate the decision of { } 116 CHAPTER 4. HUMPTY DUMPTY

D D 1 6 D 3 4 1 2 5 D 2 3

4 5 6 3 4 D D D D D 1 2 3 1 2

(a) (b)

Figure 4.3.1: Reliances and co-reliance groups for examples in Figure 4.2.1. Reliances indicated by arrows and co-reliance groups by dotted circles. router 2 prior to router 1, an alternative network solution is 4, 4, 3, 4 . In Section { } 4.5 we use network measurements to determine which of these alternatives is actually selected by the network. The reliance rules in a multi-level hierarchy are somewhat more complicated than in the two-level case examined in Chapter 3. We ﬁrst present a brief recap of the notation used in Chapter 3 before outlining the new reliance rules and examining an example of their application. Recall, all reliance rules are based on where a router can learn of its best route after convergence. Similar techniques can be applied to any iBGP topology. We examine several other topologies in Section 4.7.

4.4 General Route-Reﬂector Reliance Graph

The rules we examined in Chapter 3 were only applicable to the two-level route- reflector hierarchy. We now generalize these rules to a multi-level hierarchy. The rules governing reliance in the multi-level hierarchy are significantly more complex than the two-level case as route-reflectors often hide information propagated to their parents (see Chapter 3 for details). Consequently, we use an example topology shown in Figure 4.4.1(a) as we describe the rules to assist the reader in following the concepts. In this figure, routers 1, 2, 3 and 4 form the central 4.4. GENERAL ROUTE-REFLECTOR RELIANCE GRAPH 117 top-level mesh of route-reflectors, while c, d, g, h, i, l and n form the middle level route-reflectors. The solid lines represent iBGP sessions, and the dashed lines represent a router’s preference for a non-downstream egress. Where no dashed line exists, a downstream egress is preferred. We explicitly define several vital preferences in the caption of the figure. Routes learned directly from neighboring ASes are denoted by large arrows, i.e. routers b, e, f, i and m are the egress routers.

4.4.1 Notation Recap

We now present a brief recap of the important notation used for the remainder of this chapter. For a thorough description, we refer the reader to Chapter 3. An iBGP conﬁguration C is a pair C = ( , ) where is the physical graph GP GS GP on which the IGP is run to determine the shortest path between two routers. The logical signaling graph = (V, AS) is overlaid on top of the physical graph with GS routers V connected by directed arcs in AS.

Three types of arcs exist in AS. An arc (u, v) down represents an arc from ∈ a route-reflector u to one of its clients v. An arc (u, v) up if and only if (v, u) ∈ ∈ down. Arcs in up are acyclic — consistent with a hierarchy rather than an arbitrary network design. An arc (u, v) over represents a vanilla iBGP session from router ∈ u to v. If (u, v) over then (v, u) over. ∈ ∈ A valid signaling path S satisfies the following property. The path S can be split into sub paths S = PQR where P = p p ...pa for some a 0 such that each pi 1 2 ≥ ∈ up, R = r r ...rb for some b 0 such that each ri down and Q consists of at most 1 2 ≥ ∈ a single arc q over. Note that P, Q or R may be empty. ∈ An egress instance [49] I = (C, X) corresponds to a pair of a configuration C and a set of egress routers X. The set X consists of all egress routers that learn an external BGP route to a particular prefix which are not eliminated by the BGP decision process (up-to the IGP distance step) when compared with all AS-wide routes [31]. In our example of Figure 4.4.1(a), b, e, f , i and m form X. An egress ancestor set E can be recursively defined as the set of egress routers X and all parents of routers in E. In our example, E = b, d, e, f, g, h, i, m, l, 1, 2, 3 . Note { } 118 CHAPTER 4. HUMPTY DUMPTY

a b e f a b e f

c d g h c d g h

1 2 1 2

3 4 3 4

i l n i l n

j k m o p j k m o p

(a) iBGP sessions denoted by solid (b) Reliance graph constructed from lines and a preference for a non- reliance rules. downstream router by dashed lines.

D 5 D D D a b e f a b 1 6 e 7 f

c d g h c d D2 g h D 4 D8 D9

1 2 D3 1 2

3 4 D11 3 4 D16 D D15 17 i l n i D10 l n

D j k m o p j k m 14 o p D D D12 D13 18 19

Figure 4.4.1: An example 3-level route-reflector topology. Black nodes represent route- reflectors at the top level. Gray nodes are middle level routers. Arrowed lines represent a reliance. Large arrows are AS-wide available routes. We explicitly define the following preferences: λg(b) > λg(e) > λg( f ), λ2(b) > λ2(e) > λ2( f ), λ3(i) > λ3(m), λ4(m) > λ4(i) > λ4(b) > λ4(e) > λ4( f ). 4.4. GENERAL ROUTE-REFLECTOR RELIANCE GRAPH 119 that although an egress router may learn multiple routes (to a prefix) it will only advertise its best route to neighbors. Hence, there is a one-to-one mapping from egress routers to available routes. We will refer to a egress router and its available route interchangeably.

The BGP decision process is denoted by a ranking function λu for a router u such that if a route ak is preferred over a route aj at router u, then λu(ak) > λu(aj). If two routes ak and aj are equivalent up-to the tie-break option and the actual route chosen is dependent on message timing, then λu(ak) = λu(aj). For convenience, we denote the preference of the null route φ as λu(φ) = . −∞

4.4.2 Reliance Rules for Route Reﬂection

In this section we generalize the reliance rules we previously defined in Chapter 3 for the two-level route-reflector hierarchy to an arbitrary hierarchy. Although there is a strict set of reliances which are a subset of arcs (of type up, down and over) in the signaling graph AS, defining where a router can learn of its best route in an n-level hierarchy is more difficult than in the two-level case. An important consideration is that failing to define reliances can result in incorrect decisions, while defining additional reliances simply increases the computational complexity of predicting selected routes (as it may create larger co-reliance groups than really exist). Consequently, we start with a relatively conservative definition of reliances before pruning those which cannot exist (but perhaps allowing some false reliances). We assume the MED attribute is filtered or compared AS-wide in this section. This is the policy of the AS we examine.

Downstream Egress Set

Let us generalize the best downstream egress function deﬁned in Chapter 3 to return a set of downstream egresses Λ(u) for a router u. If u has no downstream egresses, Λ(u) = φ. Unlike the two-level hierarchy, in an arbitrary hierarchy, it is 120 CHAPTER 4. HUMPTY DUMPTY

Router(u) 1 2 3 d g h l all other routers

Λ1(u) φ φ i b e f m φ

Λn(u) b e, f φ φ φ φ φ φ

Table 4.4.1: Downstream egress sets for routers in example topology of Figure 4.4.1.

no longer guaranteed that a router will learn of all downstream egresses since the set of available routes is restricted by the selection of intermediate routers.

We ﬁrst deﬁne Λ1(u) as the set of best downstream egresses which are one downstream iBGP hop away from u. Router u is guaranteed to learn of these routes due to the direct iBGP session and so these routes will always be available. Formally, for u < X,

n max o Λ1(u) = v X :(u, v) down and λu(v) = w X:(u,w) down λu(w) . ∈ ∈ ∈ ∈

We show for our example in Figure 4.4.1 the sets Λ1(u) for all routers in Table

4.4.1. Notice that as λg(e) > λg( f ), Λ1(g) = e. Now, let us consider other egresses u could learn (and select as best) from clients which are not direct egresses, i.e., those more than one hop away. We denote this set by Λn(u) and deﬁne it for u < X as,

[ n o Λn(u) = v Λ(w) Λ1(u): λu(v) max λu(r) . ∈ \ ≥ r Λ1(u) w E X:(u,w) down ∈ ∈ \ ∈ Note that egresses not preferred over an “always available egress” are not in

Λn(u). We also show in Table 4.4.1 for our example in Figure 4.4.1 the sets of

Λn(u) for all routers. As router 2 can learn of both e and f from its children (g and h), both egresses are in Λn(2). Also notice that as i is an always available downstream egress of 3 and i is preferred over m, m is not in Λn(3). Finally, we define Λ(u) = Λ (u) Λn(u). Note, Λ(u) is well defined as we define Λ(u) 1 ∪ recursively up the hierarchy. 4.4. GENERAL ROUTE-REFLECTOR RELIANCE GRAPH 121

Rules for Reliance

Reliance rules are adapted from the route propagation rules [7] and indicate where a router can learn of its best route. Arcs in the reliance graph are a subset of the arcs in the signaling graph AS. There are three types of arcs in AS which may be part of the reliance graph. Consider the arc (u, v) AS: ∈

1. (u, v) down: a route-reflector u is reliant on its child v iff u < X and v E. ∈ ∈ 2. (u, v) up: a client u is reliant on its parent v iff u < X. ∈ 3. (u, v) over: a router u < X is reliant ∈ (a) on another router v E X iff ∈ \

min λu(r) max λu(s) r Λ(u) ≤ s Λ(v) Λ1(u) ∈ ∈ \ (b) on another router v X iﬀ ∈

min λu(r) λu(v) r Λ(u) ∈ ≤ We demonstrate the above rules for our example in Figure 4.4.1(b). A summary of all the reliances for this topology is included in the ﬁrst two columns of Table 4.4.2. Applying rule 1, we see any router in the egress ancestor set E which does not have a direct egress is reliant on its children, for example, 2 is reliant on h. Applying rule 2, all client routers are reliant on their parents (unless they are a direct egress), for example l is reliant on 3. Rule 3 applies when a router can learn of a better route via an over edge than any client-learned route. For instance, rule 3(a) applies to the reliance of 2 on 1, as 2 will select the route from b if it ever learns of it, whereas rule 3(b) applies for the reliance of a on b, as a can learn the egress directly from b, via an over edge.

Pruning Reliances

Our technique for determining router decisions would work on the reliance graph deﬁned by the rules above. However, ideally we would like to have the smallest 122 CHAPTER 4. HUMPTY DUMPTY

Reliance Rule Prunable? Reason 1 f d down No 2 f 1 over(a) No 2 f g down No 2 f h down No 3 f i down No 3 f l down Yes (2) (3, l) down and l E X and i Λ1(3), Λ(l) = m ∈ ∈ \ ∈ { } and λ3(m) < λ3(i) 4 f 1 over(a) No 4 f 2 over(a) No 4 f 3 over(a) No a f c up No a f b over(b) No c f d over(a) No c f 1 up No d f b down No d f 1 up Yes (3) λd(b) > λd( ). ∗ g f e down No g f f down Yes (1) (g, f ) down and f X and f < Λ1(g) ∈ ∈ g f 2 up No λg(e) < λg(b) and λ2(e) < λ2(b) h f f down No h f 2 up Yes (3) λh( f ) > λh( ). ∗ j f i up No k f i up No l f m down No l f 3 up Yes (3) λl(m) > λl( ) ∗ n f 4 up No o f n up No p f n up No

Table 4.4.2: Reliances for example topology of Figure 4.4.1 and the rule which identiﬁes them. Also included is whether they are prunable and non-trivial reasons why or why not prunable. 4.4. GENERAL ROUTE-REFLECTOR RELIANCE GRAPH 123 possible co-reliance groups in the reliance graph. Reliance rules are essentially a pruning of the signaling graph. We continue in this vein by pruning superﬂuous reliances:

1) u f/ v if (u, v) down and v X Λ1(u). ∈ ∈ \ That is, a route-reﬂector is reliant only on its best client with a direct egress. In our example, as g prefers e over f and as e is always available, f will never be chosen and is pruned in Figure 4.4.1(c).

2) u f/ v if (u, v) down and v E X and ∈ ∈ \

min λu(r) > max λu(s) r Λ1(u) s Λ(v) ∈ ∈ That is, if a route-reflector u has a client r with a direct egress and no possible egress which it can learn from another client v is better than r, then u cannot be reliant on v. In our example, as 3 prefers i over m, 3 is not reliant on l, and this edge is also pruned in Figure 4.4.1(c). Our next rule is for up edges. Before specifying the rule for a (u, v) up, we ∈ define L(u, v) as the egresses that can be learned by a router u from the parent router v which are not available from a direct client. For a general hierarchy, the exact form of L(u, v) can be quite complicated. In practice, route-reflector hierarchies do not tend to be larger than three-levels, and for three levels, we can formally define:

[ L(u, v) = Λ(w) Λ (u). \ 1 w E:vfw ∈ 3) u f/ v if (u, v) up and ∈

min λu(r) > max λu(s) r Λ1(u) s L(u,v) ∈ ∈ That is, a route-reﬂector in the second level of the hierarchy is only reliant on its parent if the parent can learn a better route than any of the always available egress at the route-reﬂector. In our example, l prefers m over any egress it can learn from 3 (as λl(m) > λl(i)). Hence the reliance of l on 3 is pruned. Other 124 CHAPTER 4. HUMPTY DUMPTY reliances that are pruned using this technique are h f 2 and d f 1. Notice g f 2 is not prunable as g may select an egress learned from 2. We include in the third column of Table 4.4.2 whether reliances are able to be pruned and in column four, we provide non-trivial reasons why a reliance can (or cannot) be pruned.

Finding a Valid Solution

As the reliance graph precisely identiﬁes which routers’ decisions a particular router is dependent upon, if there are no cycles in the reliance graph structure, we can topologically sort the routers and evaluate them in order (visiting them exactly once). However, as shown in our example, it is likely there are cycles in the reliance graph. Hence, we partition the reliance graph into co-reliance groups before undertaking a topological sort on co-reliance groups. We show these co- reliance groups in Figure 4.4.1(d). One topological ordering (any topological order will result in the same network solution) is the numerical ordering of D D in 1 − 19 Figure 4.4.1(d). If multiple routers are present in a co-reliance group (such as D9) the decisions of the routers may be dependent on message timing, and we evaluate their decisions until a valid solution is found. The ordering in which we evaluate the BGP decision process on routers within a co-reliance group determines upon which of possibly multiple valid solutions we converge. Our desire is to converge to the actual solution selected by the network. We describe how this is achieved in the next section.

4.5 Finding the Actual Solution

A walk of the reliance graph can have multiple valid solutions when either:

1. co-reliance groups have multiple routers; or

2. the non-deterministic oldest-route tie-breaker is used. 4.5. FINDING THE ACTUAL SOLUTION 125

When such conditions exist, we want to find the actual solution chosen by routers. If decisions of some routers in the network are known (through measurement), we can use them as constraints while determining the solution1. Two feasible approaches to using these constraints are: (i) find all possible solutions and select one which satisfies the constraints; or (ii) gravitate towards a solution satisfying all constraints by ensuring that when we visit each co-reliance group, we select a solution consistent with the constraints. We take the latter approach as it reduces unnecessary computation of infeasible solutions. However, it can result in discrepancies if we reach a co-reliance group with no solutions that satisfy the constraints2. We could backtrack along the reliance graph to resolve discrepancies. However in our examined network, we found only seven in over 12.7 million decisions were affected by this, and so we did not implement a backtracking algorithm. We saw in the example 4.3.1(b), evaluating the decision of routers within a non-singleton co-reliance group determines to which of multiple valid solutions we converge. We now outline our heuristic to converge to solution satisfying all measurement constraints.

4.5.1 Ordering of Routers Within a Co-reliance Group

We currently have an ordering of co-reliance groups to visit. However, a co- reliance group can contain multiple routers. The order in which we evaluate the BGP decision process on routers within a co-reliance group determines upon which of possibly multiple network solutions we converge. Network measurement infrastructure in the form of route-monitors (or any other available source) can provide us with additional information as to the current network state. For

1There may be multiple feasible solutions matching the known route selections. 2A lack of constraints at a non-singleton co-reliance group may cause a random ordering of router evaluation which subsequently results in a route being unavailable for selection at a later co-reliance group. Also, a random tie-break decision (when the oldest-route tie-break is used) may also cause a route being unavailable for selection within a later co-reliance group. 126 CHAPTER 4. HUMPTY DUMPTY co-reliance groups with multiple routers, we can order the routers to increase the probability of converging to a solution consistent with route-monitors such that we do not need to re-visit a co-reliance group.

We use the comparison subroutine compare routers shown in Figure 4.5.1 (function and variable descriptions shown in Figure 4.5.2) and any sorting algorithm (we used Perl’s inbuilt sort) to order routers within a co-reliance group. If compare routers(a,b) returns 1, we evaluate the BGP decision process at router a before router b. If compare routers(a,b) returns 1, we evaluate the BGP de- − cision process at router a after router b. If compare routers(a,b) returns 0, we have no information indicating whether the BGP decision process at a or b should be evaluated ﬁrst. This algorithm is easily combined with Perl’s inbuilt sort. We describe the reasoning for the router comparison below.

Firstly, to minimize visits to routers within a non-singleton co-reliance group, downstream routers (routers with fewest iBGP hops to the egress) are evaluated ﬁrst to ensure maximum information is available as early as possible.

For a monitored router to choose the route it is known to select, the route must be in its set of available routes. We can increase the likelihood of this by ensuring if a monitored router prefers its downstream egress, we evaluate this router prior to other routers. In addition, if a monitored router prefers a non-downstream egress, then we would like this route to be in the set of available routes when we visit the monitored router. Hence, we evaluate the monitored router after other routers.

There are still issues to ensure consistency with the BGP monitor. These may occur when we have, for example, one monitored router which prefers another route-reﬂector’s egress for which we do not have a BGP monitor. We attempt to ensure this egress is available when we evaluate this router’s decision by evaluating the decisions of parents of such egresses prior to other routers.

Once we have an ordering for co-reliance groups and an ordering for routers within a co-reliance group, we can calculate the decisions of all routers by simply walking the reliance graph as shown in Figure 4.5.3. We visit each co-reliance 4.5. FINDING THE ACTUAL SOLUTION 127

sub compare routers(a,b) // Evaluate routers closest to egress ﬁrst if (a) < (b) Hreturn 1H elsif (a) > (b) returnH -1 H end-if // Evaluate Monitored routers preferring downstream ﬁrst if (a) > (b) Mreturn 1M elsif (a) < (b) returnM -1 M end-if // Evaluate routers whose reliant routers need downstream route parent reliance a=0 if (a) == 1 M foreach ai (a) ∈ R if (ai) Λ(a) Mparent∈reliance a=1 end-if end end-if parent reliance b=0 if (b) == 1 M foreach bi (b) ∈ R if (bi) Λ(b) Mparent∈reliance b=1 end-if end end-if if parent reliance a< parent reliance b return 1 elsif parent reliance a > parent reliance b return -1 end-if // No information to indicate ordering of routers return 0 end-sub

Figure 4.5.1: Router comparison subroutine for a non-singleton co-reliance group. By ordering routers within a co-reliance group, we can increase the likelihood of converging to a solution consistent with constraints learned from the measurement infrastructure. If the compare routers(a,b) returns 1, we evaluate the BGP decision process of router a prior to router b. If the compare routers(a,b) returns 1, we evaluate the BGP decision − process of router a after router b. This subroutine can be used in conjunction with a generic sorting algorithm (we use the inbuilt Perl sort) to determine a sorted list of routers. 128 CHAPTER 4. HUMPTY DUMPTY

Tr Available routes learned from neighboring routers according to topology rules

α(Tr) Returns preferred route from Tr at router r

br Selected route at router r

βDk Routers with decisions evaluated in co-reliance group Dk

βD Co-reliance groups already visited (r) 1 if monitor available for router r and a downstream egress is preferred, M 0 if no monitor is available on router r, 1 if a monitor is available and a − non-downstream egress is preferred (r) Set of routers reliant on r R (r) Number of down iBGP edges a router r is from a downstream egress H min First element in a sorted set Γ Routers which modify their decision on backtracking

Figure 4.5.2: Function and variable deﬁnitions used in the compare routers and the network solver algorithm.

group in order and visit each router within the co-reliance group in order. If the co-reliance group is non-singleton, we continue evaluating the router decisions until no routers alter their decision. Our network solver algorithm does not rely on the underlying topology, only its description in terms of a reliance graph. Consequently, any topology describable by a reliance graph can be analyzed using this algorithm.

We saw in Chapter 3 that it is possible that a co-reliance group never converges to a solution. However, if this is the case, then the actual network also has oscillatory properties. The non-convergent co-reliance group isolates the routers responsible for oscillatory modes so administrators can then investigate possible corrective actions, 4.5. FINDING THE ACTUAL SOLUTION 129

βD φ ← while Dk < βD Dk = min Dj Dj < βD ∃ { | } βDk φ // Reset visited routers // Initializing← router decisions while r < βD r = min ri ri < βD ∃ k { | k } br αr(Tr) ← βDk βDk r end ← ∪ if Dk > 1 |// Check| all routers decisions stable do- βD φ k ← Γ φ // Reset modiﬁed decisions ← while r < βD and r = min ri ri < βD ∃ k { | k } if br , αr(Tr) Γ Γ r end-if← ∪ br α(Tr) ← βDk βDk r end ← ∪ while Γ , φ end-if βD βD Dk end ← ∪

Figure 4.5.3: Network solver algorithm. The covering while loop enumerates all co- reliance groups. The ﬁrst inner loop enumerates all routers within a co-reliance group. We have separated, for clarity, the process of checking if all router’s decisions are stable in a non-singleton co-reliance group. However, in practice, this loop can be integrated into the ﬁrst inner loop. 130 CHAPTER 4. HUMPTY DUMPTY

4.5.2 Breaking Ties

When the oldest-route tie-breaker is used and there could be multiple routes available at a router with equal IGP distances to an egress, then any route with this equal-best IGP distance may be chosen by a router. This can increase the number of feasible network solutions. Similar to Section 4.5.1 we would like to converge to a solution satisfying our measurements. Obviously, if we have a route-monitor at a router and the route selected by the monitor is in the set of equal-best IGP distance routes, then we select this route. However, if no route- monitor is available for a router, we select the route that satisﬁes the greatest number of monitors connected to reliant routers. Once a monitor on a reliant router has been satisﬁed, it is marked so other tie-breaking decisions will satisfy other monitors. If the tie is still not broken, we select a route at random. This is the reason seven incorrect decisions were found out of over 12.7 million decisions.

4.5.3 Dynamic IGP

IGP distances are dynamic and can change when the underlying network changes. Consequently, our approach must be amenable to IGP changes as well as BGP changes. An OSPF monitor [97] is able to record the dynamics of the IGP, and this information can be fed into our model. The design of the reliance graph (which abstracts the actual IGP distances), allows us to re-evaluate router decisions only when the changes in distance aﬀect router decisions. This process is shown in Figure 4.5.4. The ranking of egresses at each router can be thought of as the index in a sorted list (based on IGP distance). If the lowest-router-id is used as the tie-break option, then the list can be completely sorted. If the updated distance between two routers does not aﬀect the ranking of the egress, then the decision of all routers will not be altered. However, if the ranking of the egress is altered, then the reliance graph may be changed and consequently the decisions of some routers may also be changed. We only need to examine the egress instances in which the reliance graph can 4.6. EVALUATION 131

sub igp change(a,b,newdist) // Check if this egresses rank has changed if rank(a, dist(a,b)) == rank(a, newdist) // no changing in rank so no change in decision return end-if // Find all routers whose rank has now changed if rank(a, dist(a, b)) < rank(a, newdist) sorted egresses[rank(a, dist(a, b))...rank(a, newdist)] elseE ← sorted egresses[rank(a, newdist)...rank(a, dist(a, b))] end-ifE ← // Check which reliance graphs are aﬀected R φ //Reliance graphs to re-calculate foreach← e foreach∈X E if b e X recalculate∪ ∈ reliance graph(X) end-if end-if end end-sub

Figure 4.5.4: Subroutine igp change for determining the reliance graphs requiring recalculation when an IGP distance changes.

be altered. That is, those egress instances in which both the egress aﬀected by the new distance and another egress which alters its ranking are part of the egress instance. Even if the best egress is not aﬀected by a change in IGP distance, we re-calculate the reliance graph to ensure all reliance graphs are synchronized with the current state of the network.

4.6 Evaluation

We have implemented our techniques using a shared sun-solaris server using Perl v5.8.8 and evaluated them using data collected from a large Tier-2 AS. This data included router-configuration files, IGP distances from an OSPF monitor and BGP routes from a route-monitor connected to approximately 15% of routers, a 132 CHAPTER 4. HUMPTY DUMPTY majority of which were route-reflectors. We use the known routes from these routers as the set of input routes to the network. Each such route contains a “next-hop” attribute which corresponds to the egress router for the route. Where no IGP distance is available (which may be due to OSPF monitor limitations in certain areas of the network), we assume the egress is unreachable3. The AS has a three-level RR hierarchy and uses the oldest-route tie-break option. The MED attribute is reset by the AS.

Our algorithm discovers the decisions made by routers once the network has converged to a solution. Consequently, we only examine stable prefixes – those prefixes with no updates witnessed from any router under observation in the 6 hours prior and 6 hours following the point in time we examined. Our evaluation is based on data collected on 26th May 2008, although we found similar results for several other examined intervals. During the analysis process, our model discovered several minor configuration errors. In this case, our model predicted the “correct” outcome, although the network selected an “incorrect” outcome due to a configuration error on several egress routers. We confirmed this configuration error with operators, and it was subsequently corrected. We have excluded the prefixes affected by these configuration errors from our analysis.

We group all prefixes with the same egress instance into molecules. A single reliance graph exists for each molecule. We were able to cluster the the 224, 870 stable prefixes into 827 reliance graphs — a significant reduction in required computation. However, as there are multiple feasible solutions for a single reliance graph (due to the ordering of routers within a co-reliance group and the oldest- route tie-break), we split molecules into atoms. Atoms are clusters of prefixes with the same egress instance and also all route-monitors indicate the same egress is selected for each router. Each atom requires a ‘walk’ of the reliance graph. For the 224, 870 stable prefixes, we discovered the egress router selected by all routers (including the 85% of routers without route-monitors) with 1, 154 walks of the

3If we have a route-monitor indicating a route is better than others but we have no IGP distance, we assume the IGP distance is less than all other routes. 4.7. GENERALIZED TOPOLOGIES 133 reliance graph. As our technique is based on the rules of route propagation, it will always find a valid solution given any configuration. With the addition of monitor information (or any other constraints available), we can converge to a solution satisfying such constraints. In practice, we found our technique always found valid solutions, with only seven inconsistencies with route-monitors in over 12.7 million known (pre f ix, router) pairs. These discrepancies resulted from a tie-break decision at a router without a route-monitor. The predicted egresses were in the same PoPs as the actual selected egress. Backtracking to alter the random tie-break decision would correct this. We found 99.99% of co-reliance groups were singleton, and the maximum size of a co-reliance group was five routers which occurred only four times, ensuring our technique very rarely required the re-evaluation of router decisions.

4.7 Generalized Topologies

In this chapter we have focused on the route-reﬂector topology. Although our approach has less restrictions than any previous work of which we are aware, we still have some restrictions, such as the MED attribute being ﬁltered. However, even this restriction can be relaxed.

4.7.1 Route-Reﬂection with MED

Now let us consider a route-reﬂector topology where an AS respects MEDs (which are set by neighboring ASes). That is, MED values are only compared if they are learned from the same AS. Consequently, it is no longer valid to assume all routers select routes that are equally attractive through the MED step of the BGP decision process. However, it is valid to assume that all routers select a route equally attractive up to the MED step [31]. It is also no longer valid to assume routers prefer their direct egress. However, all direct egress routers with the best AS-wide MED value (for each neighboring AS) will still always select their direct egress. 134 CHAPTER 4. HUMPTY DUMPTY

A direct egress router with a non-optimal MED value will select a better route if it learns of it. Thus, all valid paths in the signaling graph such that the non-optimal egress can learn of this route must be in the reliance graph. The increase in the number of edges in the reliance graph can increase the size of the co-reliance groups. However, the maximum co-reliance group size is still bounded by the size of the egress ancestor set (routers with either a direct egress or can learn of an egress from a client) which is commonly an order of magnitude smaller than the total number of routers. Feamster and Rexford [31] recommend a simulator as the most eﬃcient method of solving this case, but our approach involves signiﬁcantly reduced complexity.

Consider the example in Figure 4.7.1(a). In this example, we assume all route- reﬂectors are closer to their clients than any other router (satisfying Constraint A). Two ASs (black and white) send routes with MED attributes depicted by number inside the large arrows. Recall that reliances only indicate where a router may learn of its best route. There are more possibilities in this case as border routers with a direct egress can choose to exit the network via an indirect route (if they learn of a route from the same AS with a lower MED value).

Firstly, as shown in Figure 4.7.1(a) we add reliances as before based on IGP distances. As we have assumed for this example Constraint A is satisﬁed, no router with a client egress is reliant on another router in the same level of the hierarchy.

Secondly, we add reliances resulting from the MED announcements from white. Router 2 has a direct egress and hence it may select this route. However, 1 has an egress with a better MED value than the direct egress via 2. If 2 ever learns of the better route via 1, it will select it. Thus we must add reliances on every feasible signaling path from 1 to 2. In this two-level hierarchy, there is only one feasible signaling path (1, 3, 4, 2). Hence, we add the reliances between router 3 and 4 and 4 and 2. Router 7 also learns of a route from white. If 7 ever learns of the egress via 1 or 2, it will select it. Hence, we add all feasible paths from 1 to 7 and 2 to 7. Reliances between routers 3 and 5, 4 and 5, and 5 and 7 are inserted into the 4.7. GENERALIZED TOPOLOGIES 135

(a) IGP distance reliances. In this example, all client egress routers are closer than all non- client egress routers.

(b) Addition of reliances from white. Router 1 (c) Black reliances and co-reliance groups. is not reliant on any other router, as no white Router 9 may alter its decision if it learns MED value is better than its own direct egress the black route from router 7. We create re- MED value. However, if router 2 learns of the liances on the signaling path between 7 and 9. route from 1, then it will select it over its own Co-reliance groups are the strongly connected direct egress. Hence, on all possible signal- components of the reliance graph. ing paths from 1 to 2, we create reliances (in this example, there is only one valid signaling path (1, 3, 4, 2)).

Figure 4.7.1: An example topology where the MED attribute is respected. The shading of the large arrows represent the neighboring AS originating the route, and the MED value is depicted within the arrow. 136 CHAPTER 4. HUMPTY DUMPTY reliance graph. All the above reliances are shown in Figure 4.7.1. Thirdly, reliances can be created from the announcements of black. Router 9 will modify its selected route if it learns of the route via 7. Thus, as the only feasible signaling path from 7 to 9 is (7, 5, 4, 9), we add the reliances between 5 and 4 and 4 and 9. Now that we have the reliance graph for this topology, we ﬁnd the strongly connected components which form the co-reliance groups (Figure 4.7.1(c)) and apply our general algorithm for determining router decisions (Figure 4.5.3). Feamster and Rexford [31] stated simulation is the best way to solve the network solution when MEDs are compared per AS. However, we have shown that this is not the case. In the best case, our techniques are linear in the number of routers, and in the worst case router decisions only need to be evaluated multiple times in the egress ancestor set — which is much smaller than the total number of routers.

4.7.2 Full Mesh

A full mesh is the simplest of iBGP topologies but it can still be diﬃcult to analyze in the presence of MEDs. When the MED attribute is either ignored or compared AS-wide, all routers with a direct egress (AS-wide best-route up-to the IGP distance step) will select it as best and hence are not reliant on any other router. All other routers have an iBGP session with all routers. Thus the only reliance rule required for the full-mesh is of routers without direct egress routes on those routers with direct egress routes. Consider the example in Figure 4.7.2. Here we have two routes equally attractive up to the IGP distance step arriving at routers 2 and 8. Given the simple reliance graph, the analysis is almost trivial. When MED is introduced, we are still able to determine reliances in a similar way to the route reﬂection example in Section 4.7.1. We simply place reliances on all routers with larger MED values (per neighboring AS). 4.7. GENERALIZED TOPOLOGIES 137

D3 a D 1 D h b 2

D4 g c D8

f d D D5 e 7 D6

Figure 4.7.2: Reliance graph for a full-mesh topology. Routers without a direct egress are reliant on routers with a direct egress.

For example in Figure 4.7.3(a) we see router a has a direct egress, however it is reliant on b as it will select the route learned at b from black (as it has a lower MED value). However, it will not select the route learned at h as the white and black MED values are not comparable. Again the reliance graph is simple, and the analysis is trivial again. An interesting example in Figure 4.7.3(b) is derived from Griﬃn and Wilfong’s ‘Mashed Potato’ conﬁguration [48]. Here black and white each announce routes at h and b with black preferring the egress at h and white the opposite. This results in a non-singleton co-reliance group D1. Multiple solutions are feasible in this co-reliance group, i.e. (b, h) = (black20, white20), (white30, black30). Notice the second solution results in both backup routes being selected, breaking the semantics of the MED attribute [48].

4.7.3 Confederations

Confederations of sub-ASs are used as an alternative to route-reﬂection in large networks where a full iBGP mesh is infeasible. The large AS is split into a confederation of sub-ASs. Within a sub-AS, the reliances can be calculated as with a full-mesh topology. Additional reliances are required between routers in separate confederations with iBGP sessions as they are able to propagate any 138 CHAPTER 4. HUMPTY DUMPTY

50 D 2 D2 20 a 30 30 a 30 D D 3 h b 1 20 h b 20 D1

g c g c D D D4 D8 3 7

f d f d D6 D D 7 D4 e 5 e D 6 D5

(a) Basic example (b) Full-mesh ‘Mashed Potato’ topology

Figure 4.7.3: Full-mesh topology with the MED attribute respected.

route learned. Complicated topologies, such as a combination of route reflection and confederations4, can be solved if reliance rules can be found. In the worst case every router could be part of a single co-reliance group, and the algorithm is effectively a network simulation. Consider the example in Figure 4.7.4(a). The AS is split into four sub-ASs. Each sub-AS has a full-mesh topology. Links between sub-ASs exist between routers (d, j), (d, g), (h, n), (m, j). Equally good routes enter the AS at router a and f . As shown in Figure 4.7.4(b), within each sub-AS, reliances are found as with the full mesh topology. If we assume the intra sub-AS distances are closer than inter sub-AS distances then all co-reliance groups in sub-ASs with direct egresses are singular. However, D10 is non-singular as j is closer to f than a and m is closer to a than f . The order in which messages are passed determines which solution is found. When multiple routers are present in a co-reliance group, we use a similar ordering of routers within the co-reliance groups as to the route-reflection topology. We evaluate routers with monitors using an egress with the fewest sub-AS hops

4We are unaware of this topology being used in any network 4.8. DISCUSSION 139

D2 D5 D D a b e f 1 a b e f 3 SUB-ASX SUB-ASY SUB-ASX SUB-ASY c d g h c d g h D6 D4 D7 D8

i j m n i j m n D9 D11 D SUB-ASZ SUB-ASW SUB-ASZ 10 SUB-ASW k l o p k l o p

D D12 D13 14 D15

(a) iBGP topology. Links between sub-ASes (b) Reliance graph of iBGP topology. are iBGP sessions able to propagate any learned route. Red dashed lines indicate the important preferred egresses (from another sub-AS).

Figure 4.7.4: An example confederation of sub-ASes and the corresponding reliance graph.

ﬁrst. For example, in Figure 4.7.4(b), if j chooses to egress via a we evaluate a’s decision prior to m’s (as m would choose a as well). A possible extension of the description of confederations with reliances is the description of inter-AS relationships and the prediction of Internet-wide routes. A topology inferred by a technique such as Muhlbauer¨ et al. [78] could form a starting point to predict the (Internet-wide) solution for a particular preﬁx and may help to answer Internet-wide ‘what-if’ questions. In addition, the ordering of router decisions outlined in this chapter could also be used to improve the convergence times of existing BGP simulators such as C-BGP [88].

4.8 Discussion

Route-monitors are used to determine the route selected by routers in the network, i.e. the selected network solution. Our reliance graph analysis identifies where routes can be learned. Consequently, a new direction of research could work to identify the optimal placement of BGP monitors to minimize the number of 140 CHAPTER 4. HUMPTY DUMPTY random tie-break decisions while maximizing information as to the available egresses. In this chapter, we presented a reliance graph model to capture the dependence among routers for route selection. The input to the model is the iBGP topology and IGP distances. The model allows one to efficiently calculate the network solution (set of routes selected by all routers) with no assumptions on the iBGP configuration. The model also works when only partial information about routes is available. We demonstrated the efficacy of the model by applying it to a Tier-2 containing over 220, 000 prefixes. Our methodology was able to find a valid solution as well as being analogous with observed routes for all but seven (pre f ix, router) pairs, even when routes from only about 15% routers were known. One significant benefit of using a reliance graph model is that dynamics of iBGP topology or IGP distance that do not affect the reliance graph also do not have any effect on the actual routing choices. Furthermore, BGP route dynamics only require the re-evaluation of routers in the portion of the network so affected. We believe that these two features should allow our methodology to work in real- time for filling gaps to BGP monitors as well as for ’what-if’ analyses. In fact, in Chapter 5 we apply the methodology successfully to determine the current router decisions and predict changes under modified route availability. In Chapter 3 we proposed an alteration to the BGP decision process to prevent persistent oscillation. This alteration involved an additional step before the IGP distance step to prefer client-learned routes over any other. If this step was introduced to the BGP decision process, the reliance rules would become significantly simpler and co-reliance group sizes would be reduced, resulting in the network solution becoming more deterministic. Chapter 5

Peer Dragnet: Analysis of BGP Peering Policies

So far in this thesis, we have considered how a set of routes of equal attractiveness up to the IGP distance step influence the selections of routers within an AS. How- ever, this set of available routes is influenced by the policies of neighboring ASes. Consequently, of equal importance to network management are the relationships with neighboring ASes. In this chapter, we investigate the policies of neighboring ASes to the Tier-2 AS under examination, and using techniques from Chapter 4, we determine the impact they have on router decisions and traffic flow within the Tier-2 AS.

5.1 Introduction

Peering agreements between ASes within the Internet are based on trust. Terms of peering are expressly outlined in the peering agreement prior to a peering link being established, and it is assumed that the agreement is implemented in a network operator’s policies. Is this really the case? This chapter examines the extent to which peering policies vary from convention. ASes are not points, in reality they are complicated, geographically distributed systems in their own right. The tendency to represent the Internet as an AS-

141 142 CHAPTER 5. PEER DRAGNET graph (e.g. see [82] and the references therein) where ASes are simple nodes, connected by single links, is not always accurate. ASes are often interconnected at multiple geographic locations and can indicate a preference for some locations over others. However, unless otherwise negotiated, the conventional practice is for peers to announce the availability of equally good (or canonical) BGP routes at every peering location [1,2]. This allows the local AS to use its own optimization criteria to determine the egress link for traﬃc destined to the peer. In this chapter, we present the ﬁrst long-term and detailed study of (1) the peering policies of all neighbors of a large Tier-2 AS network, (2) how these external policies impact the local network, and (3) the evolution of these policies over time.

Tier-2 ASes benefit greatly from peering relationships as they reduce upstream costs. Hence they typically have a large number of peers (the AS in question has more than 100 peers). The motivation to peer with other ASes is starkly different for a Tier-1 AS [81] — such as the one examined in [30] — whose neighbors are either peers or customers. Thus, Tier-1 ASes generally have an order of magnitude fewer peers and are likely to see a narrower range of behavior. In the Tier-2 network we examine, 42% of peers have non-canonical policies — a policy of announcing unequal routes at different peering locations for a prefix — affecting at least 10% of their prefixes, and over 20% of peers’ policies cannot be filtered through basic import policies (such as ignoring the MED attribute in BGP routes). Although we do not know how representative these results are for the Internet as a whole, we believe that most of the peers engaging in non-canonical policies with the AS are likely to be employing similar policies with their other peers as well. Extrapolating this, it is highly likely that non-canonical policies are fairly widespread in the Internet, at least in lower-tier ASes.

We observe peering policies by comparing the relative attractiveness of all BGP routes announced by each peer over all peering locations on a per prefix basis. If a significant proportion of prefixes announced by a peer at each of their peering locations are less attractive (or not announced at all) at some or all links, then we say it is part of a non-canonical peering policy. In Section 5.5 we explain how peers 5.1. INTRODUCTION 143 can implement non-canonical policies, and we examine which methods are used in practice. We actually see at least one case of every known method, including a technique that we have not observed being mentioned in the research literature (using the Origin attribute in BGP routes). The prevalence of each technique is interesting in its own right, because it clearly shows that the AS-graph is not representative of the real complexity in inter-domain routing.

Each non-canonical policy has an impact on the local AS. The magnitude of impact can vary based on the current routing state, internal network topology and the quantity of traffic destined to the peer. Consequently in Section 5.6, we use our model described in Chapter 4 to analyze the internal routing and subsequent traffic flow changes as a result of all non-canonical peering policies for the Tier-2 AS. We show that for about 27% of peers with non-canonical policies, more than 10% of all (router, pre f ix) pairs are modified for the peers prefixes. We follow this with an example where 85% of traffic on a particular peering link is affected.

Modification of a peering policy generally requires time consuming changes to contractual peering agreements and manual modifications to router configurations. Consequently, it is expected that they remain stable for extended periods of time. In Section 5.7 we examine the dynamics of peering policies. We present an algorithm that can learn an AS’s current peering policy and alert an operator when a significant policy change occurs. We find that 9% of peers modify their policies at least once during a 20-week interval. Given that peering agreements are generally negotiated prior to establishing a peering link, policy changes are of substantial interest to network operators.

We have implemented our methodology for detecting non-canonical policies and quantifying their impact in a tool which we call Peer Dragnet. This tool and a preliminary study based on its deployment were presented at NANOG [84]. This work extends the study with a more detailed, thorough and rigorous analysis. The Peer Dragnet tool is now deployed, allowing operators of the Tier-2 AS in question to monitor their peers on a continuous basis. The tool and the analysis presented here have resulted in identiﬁcation of peers that are violating their 144 CHAPTER 5. PEER DRAGNET

DESTINATION Local AS A 1 1 9 1 1 2 Peer Peer 2 2 AS B 8 2 AS C 3 4 3 3 7 3 4 5 6

1 2 1 Cust Cust SOURCE AS D AS E

1 local route-reflector eBGP canonical iBGP non-canonical 1 local client router 1 remote router

Figure 5.2.1: The impact of non-canonical peering policy. AS A has three peering links to AS B. Under a canonical peering policy, the traﬃc from customer AS D ‘hot-potato’ exits via peering router 3. However, under a non-canonical peering policy, AS B can cause AS A to transit the traﬃc further to exit AS A’s network at peering router 1.

peering agreement by using non-canonical policies.

Terminating peering with peers violating their agreements is an obvious solution. However in reality an AS might not want to exercise this option for business reasons. Hence, the AS might be interested in mitigating the eﬀect of non-canonical policies. In Section 5.9, we show that by employing the monitoring capabilities of Peer Dragnet and appropriate use of import policies, a network operator can mitigate the inﬂuence of a non-canonical policy to a large extent.

To round out the chapter, we describe background information in Section 5.2, related work in Section 5.3, and an overview of the data used for our analysis in Section 5.4. 5.2. BACKGROUND 145 5.2 Background

Recall from Chapter 2 the relationship between a pair of ASes generally falls into one of the following two broad categories [37, 52]:

1. Customer-Provider: One AS (customer) ﬁnancially compensates the other AS (provider) for connectivity to the remainder of the Internet. For instance in Figure 5.2.1, AS A is a provider, and AS D is its customer.

2. Peer-Peer: A mutually beneficial relationship between two ASes to provide connectivity to each others’ customers. No remuneration is required for traffic exchanged between the two peer ASes. For example, the relationship between A and B in Figure 5.2.1 is a peer-peer relationship. An AS generally does not provide transit for traffic between two of its peers. For example, AS A in the figure would not pass traffic between peers B and C.

ASes often connect with each other at multiple locations (whether the relationship is customer-provider or peer-peer). When they do so, an AS is often required to send equally attractive routes (up-to the IGP distance step of the BGP decision process) over all the connecting links. We call such a policy a canonical policy. Fol- lowing a canonical policy allows the receiving AS to minimize its resource usage by choosing the closest egress link (based on the IGP distance) to send traffic to the sending peer. This strategy is often referred to as “hot-potato-routing”. Fairness in peering policies is attempted by both ASes following a canonical policy. In such a scenario, one AS carries the traffic further in one direction, while the other AS carries the traffic further in the reverse direction. However, an AS can send unequal routes (i.e., follow a non-canonical policy) over its peering links. This restricts the ability of the receiving AS to choose the best local egress link. For example, in Figure 5.2.1, if AS B has a canonical peering policy, all routes to the destination are announced by B with equally good attributes (AS Path length, Origin type and MED value) on peering links (B1–A1, B2–A2, B3–A3). Thus, client router 4 in AS A chooses the closest egress location for the destination pre- 146 CHAPTER 5. PEER DRAGNET

ﬁx, thereby allowing AS A to pass traﬃc to its peer via the closest egress (on link B3–A3).

On the other hand, if AS B uses a non-canonical peering policy where it announces a more attractive route on link (B1–A1) than other links (B2–A2 and B3–A3), or routes are not announced on these links, the BGP decision process then is left with the route via the link closest to the destination (B1–A1) as the sole best route after step 4. Consequently, AS B forces AS A into “cold-potato-routing” as the link (B1–A1) is chosen by AS A to send traffic to AS B. In other words, not only does AS A lose its ability to select a peering link for outgoing traffic, it ends up carrying the traffic through its backbone in both directions. The goal of a (no cost) peering relationship is mutual benefit, which relies on the symmetry of canonical policies. If AS A effectively transits the traffic for the other AS, then A can legitimately ask why B is a peer, rather than a customer. There are specific instances where non-canonical policies may be considered legitimate by both A and B, but these need to be agreed to by both parties when forming the peering agreement. For example, the traffic exchanged between ASes may be asymmetric. In such a case, both ASes employing a canonical peering policy may not be as ‘fair’ as when traffic is symmetric.

A non-canonical peering policy can be a negotiated policy between ASes, a deliberate breach of a peering agreement to fulfill local objectives, an unintentional configuration error or the result of a peer’s downstream customer setting the no- export attribute (see [30] for details). Where it is a violation of a peering agreement an AS needs to detect divergences and analyze its impact on resource usage and costs in order to judge the best course of action. Where such policies are an error, the AS may improve overall performance in both networks by helping the peer correct its configuration. Even where non-canonical policy is a deliberate decision, expressed in the peering agreement, the AS may wish to quantify the impact of this agreement in order to make informed decisions about agreements in the future. 5.3. RELATED WORK 147 5.3 Related Work

In this section, we provide an overview of closely related work. First of all, the Peer Dragnet tool was introduced by Patrick et al. [84] along with some preliminary analysis of observed non-canonical policies. This chapter extends the work through a more detailed and thorough analysis of such policies, their dynamics and impact on routing and traﬃc within the provider.

Spring et al. [100] used traceroutes to infer a wide variety of peering policies between Tier-1 ASes. However, they were unable to determine if the cause of a “late-exit” was due to the local policy or the peer’s policy. Along related lines, Muhlbauer¨ et al. [78] relied on eBGP data with limited visibility to infer simple policies of all ASes in the Internet. In contrast, our work analyzes policies from a vantage point of a Tier-2 AS, where we can ascertain in detail the nature of the policies involved.

The Borderguard tool, proposed by Feamster et al. [30], is also closely related to our work. The Borderguard paper proposed a methodology to determine non- canonical policies when an AS has access to data from a route-monitor. Recall from Chapter 2, a route-monitor is only able to provide access to the Loc-RIB of a border router. The Loc-RIB only provides access to one of all routes (the RIB-in) learned at the border router. Consequently, not all routes announced by a peer may be visible in a route-monitor. The more peers connected to peering routers, the more hidden the announced routes become. At Internet Exchange Points (IXP) it is common practice for many peers to connect to a single peering router. Thus to fully quantify a peer’s policy, we need the greater visibility provided by RIB-ins. If we were to rely solely on Loc-RIB BGP data to detect policies like the Borderguard paper, we would be able to determine the presence or absence of canonical policies for only 37% of (peer, pre f ix, location) triplets over the January 1 - 14, 2008 interval; for the remaining triplets, we would not get a deﬁnitive answer. In contrast, our method can comprehensively analyze the behavior of peering ASes by analyzing the routes announced by peers in RIB-in snapshots. 148 CHAPTER 5. PEER DRAGNET

Borderguard also analyzes Loc-RIB data available from the peering routers of a particular peer. However, if the peer is deliberately acting outside its peering agreement, it can easily hide this fact on the “monitored” session through appropriate export policies. In addition, the Borderguard study focused on a large Tier-1 AS which only had a few large peers. Tier-1 providers typically try to minimize their peering links – preferring to have other ASes as paying customers rather than peers. In contrast, Tier-2 providers can reduce their upstream costs by increasing the number of peering relationships. Consequently, the motivation to peer is different between Tier-1 and 2 ASes. Tier-2 providers are also typically smaller in size and are often geographically localized ASes or content providers. Such ASes might have more resource constraints and hence more motivation for customized non- canonical peering policies. This might explain some of the qualitative difference in the results; whereas the Borderguard paper only found a small number of non- canonical policy instances, we have found a significant number of such policy instances. Finally, unlike this work, the Borderguard paper did not analyze the impact of non-canonical policies on routing and traffic flow; neither did they analyze the dynamics of the policies. We believe understanding the impact of non-canonical policies and their dynamics are vital in determining the future course of action against a peer violating its agreement.

5.4 Data Collection

For our analysis of peering policies of all the peers (over 100) of the Tier-2 AS under examination, we use a variety of different data collected from the AS’s network. First, we collect BGP routes sent by peers at all peering locations and compare them to determine policies employed by peers. Next, we use IGP distance information and iBGP topology with the model described in Chapter 4 to determine the impact of non-canonical policies on routes selected at individual routers. Finally, we use aggregate traffic data to understand how non-canonical policies affect the flow of 5.4. DATA COLLECTION 149 traffic to peers. For this analysis, routing data was collected from September 1, 2007 through January 14, 2008, and the traffic data was collected during the January 1 - 14, 2008 interval.

5.4.1 BGP Routes

In order to determine routes sent by peers, we ideally need to dynamically record routes as they are received at peering routers (and stored in RIB-ins). However, commercial routers do not provide any feature to enable this1, and the only option is to wire-tap the peering links. For a variety of reasons this is not feasible, so we settle for periodic snapshots of RIB-in and RIB-pp (post-policy RIB) contents from all peering routers. This is achieved through an automated script logging into peering routers one by one and executing the “show ip bgp” command to record all routes in the RIB-ins and RIB-pps. Since the script handles routers sequentially, snapshots from different routers are recorded at slightly different times. Accordingly, we must be careful not to classify inconsistent routes resulting from transient BGP dynamics as a non-canonical policy of a peer. When examining the implementation of policies (in Sections 5.5 and 5.7), we consider the average proportion of inconsistent routes over a large number of consecutive snapshots (at least four evenly spaced snapshots per day for 14 days). When we examine the impact of policies (in Section 5.6), we take an even more conservative approach, only considering stable prefixes, that is, only those prefixes for which we witness no changes (see Section 5.6.1).

5.4.2 IGP Distance Information

The IGP distance to the egress router is used in step 7 to break the tie between routes equally good through the ﬁrst six steps of the BGP decision step. As a

1There is a proposal [96] to add this feature to routers, but it is not implemented currently to the best of our knowledge. 150 CHAPTER 5. PEER DRAGNET result, we need the IGP distance from the router to all egress routers at any given time to determine the impact of peering policies in terms of route selection and traﬃc ﬂow at any router. The AS we examine uses OSPF [76] as its IGP. We use OSPF Link State Advertisements (LSAs) collected using an OSPF monitor [97], to determine the distance between routers.

5.4.3 iBGP Topology Information

The Tier-2 AS under examination extracts and stores its router configuration files once a day. We use these files to determine the iBGP topology, i.e., the set of route reflectors, their clients, and interconnections between them. The iBGP topology allows us to determine how routes learned from peers propagate through the network (using the techniques described in Chapter 4).

5.4.4 Aggregate Traﬃc Data

To study the impact of peering policies on traffic, we use flow records collected by Netflow [21] on the peering routers of the AS. Each flow record consists of start and end times, source and destination IP addresses and port numbers, and number of bytes transferred. Due to high traffic volume, flow records are threshold sampled [25].

5.5 Analysis of Peering Policies

In this section we look at policies employed by peers. Our ﬁrst aim is to see how many peers employ non-canonical policies. For peers that do employ non- canonical policies, we focus on two aspects of such policies: the techniques used for implementing the non-canonical policy and which peering links are preferred over others. For the analysis in this section, we focus on two weeks (January 1 - 14, 2008) of RIB-in data. Figure 5.5.1 shows the overall behavior of the peers during these two 5.5. ANALYSIS OF PEERING POLICIES 151

0.5 5% 0.4 10% 25% 0.3 50% 0.2 90% Proportion of peers 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Day

Figure 5.5.1: Plot of the proportion of peers implementing a non-canonical peering policy for more than 5, 10, 25, 50 and 90 percent of their preﬁxes.

weeks. Each line in the figure shows what fraction of peers employ non-canonical policies for a given percentage of their prefixes. We can see that roughly 42% of peers employ non-canonical policies for 10% or more of their prefixes, and 23% of peers employ such policies for more than 90% of their prefixes. Thus, use of non-canonical policies is quite widespread amongst peers of this Tier-2 AS. As we have mentioned earlier, it is likely that these peers also employ similar policies for their other peers. MED filtering on peering routers reduces the number of peers with non-canonical policies to 22% and 6% down from 42% and 23% respectively. We have seen similar behavior during other time intervals.

5.5.1 Policy Implementation Techniques

To implement a non-canonical policy, an AS announces unequal routes over its peering links to another AS. The routes can be made unequal in two primary ways: (i) by manipulating prefix announcements and (ii) by using different route attributes that affect the decision process (e.g. AS Path length, Origin, MED), on a subset of peering links. In Figure 5.5.2, we show the percentage of routes de- preferenced – scaled by the total number of possible routes de-preferenced – using each technique for some peers of the Tier-2 AS. Not only do we see the use of all the techniques we know of across peers, we also observe that several individual peers employ multiple techniques on different (pre f ix, peering link) pairs. 152 CHAPTER 5. PEER DRAGNET

Figure 5.5.2: The techniques used by a subset of peers to de-preference routes between January 1 - 14, 2008.

Manipulation of Preﬁx Announcements

An AS can manipulate prefix announcements using two methods. This first is to not announce a prefix on some links. The local AS is then forced to choose from the subset of links where the announcement is made. The second scenario is similar to the first. The same address space may be announced on all peering links, however, different prefixes may be announced on each peering link. For example, consider Figure 5.2.1. AS A has three peering links to AS C. Let us assume that AS C would like to announce prefix 200.0.0.0/16 to AS A. In this case, AS C might announce the /16 prefix on peering links C1 A9 and C3 A7, − − and the two more specific prefixes 200.0.0.0/17 and 200.0.128.0/17 on link C2 A8. − IP packets are always forwarded to the most specific prefix (i.e., prefix with the longest mask) matching the destination address. Hence, when AS A wants to send traffic to any destination in the /16 prefix via AS C, it will be forced into using link C2 A8. Both the above scenarios can be detected by comparing the − prefixes announced at each peering location. In Figure 5.5.2, we see that ASes 2, 4, 8 and 9 do not announce significant portions of their prefixes on some links. On the whole, we find that 13% of ASes failed to announce at least 5% of routes from January 1 through 14, 2008. 5.5. ANALYSIS OF PEERING POLICIES 153

Attribute Manipulation

The second way of sending unequal routes for a prefix is to use different route attributes across peering links. In this case, some routes are dismissed as unattrac- tive prior to the IGP distance step of the BGP decision process. Let us look at the attributes considered in steps 1-5 of the BGP decision process to see if, and how, they can be used for non-canonical policies2. First consider the Next-Hop attribute. A peer AS sends a route with a next-hop known to be unreachable by the local AS router, then the route will be discarded by the local AS. Hence, a peer AS could ‘appear’ to send a route whilst ensuring the route would never be selected. Data limitations prevent us from determining the reachability of next-hop IP addresses. However, on almost all occasions, the next-hop IP address was the IP address of the peer router announcing the route. Consequently, we assume all received routes have a reachable next-hop IP address for this study. Second, consider the Local Preference attribute. This attribute is set locally by an AS. Consequently, even if a peer sends routes with different Local Preference, it does not affect how the local AS selects its routes. As a result, this attribute is not useful for implementing non-canonical policies. The Community route attribute may be used by peering ASes to influence the Local Preference, however, this requires an import policy, set by the local AS and would be part of a peering agreement. We do not consider this as part of a non-canonical policy for this study. Next consider AS Path attribute which represents the sequence of AS hops on the path to the destination prefix. By changing the length of the AS Path, an AS can affect the peering policy. Prepending one’s own AS number multiple times to the AS Path on a subset of links is a well-known technique to de-preference certain routes [16]. Doing so results in routes with longer AS Paths which are

2We incorporate step 6 in the BGP decision process to prefer eBGP over iBGP learned routes into the IGP distance step for clarity. These two steps combined can be thought of as ‘prefer the closest exit-point’. 154 CHAPTER 5. PEER DRAGNET discarded during step 3 of the decision process. In Figure 5.5.2, ASes 2, 4, 5, and 6 all de-preference a significant fraction of routes using longer AS Paths on some links. Overall, 4% of peers use this technique on at least 5% of their routes during the 14-day interval. The Origin attribute is set by the originating AS to one of three values in the decreasing order of preference: IGP, EGP or INCOMPLETE [91]. However, it can be modified by an intermediate AS to make a route look more or less attractive to neighbor ASes. In other words, an AS can set the Origin to different values on a subset of peering links, thereby forcing its peer to discard some routes in step 4 of the decision process. We have not noted this technique in the research literature, but we found that 3% of peers modify the Origin of routes in our data. For example, in Figure 5.5.2, ASes 1, 3 and 9 modify Origin to de-preference some of their routes. Finally, the MED parameter allows an AS to signal its preference of a peering link for traffic coming from its peer. The very purpose of MED is to realize non- canonical policies. However, the receiving AS can decide to reset the MED values as part of the import policies, and that is precisely what our Tier-2 AS does: it resets the MED to the same value across all its peering links. Thus, in our case, MED does not have its desired effect. However, it does allow us to know if a peer is trying to implement a non-canonical policy (albeit without success)3. This might also be indicative of a peer’s policy with their other neighboring ASes. We found that 29% of peers announce different MED values on their links even though it has no effect. In Figure 5.5.2, ASes 1, 3, 7 and 10 de-preference routes on some links using MED.

5.5.2 How Peering Links are Used

In the previous section we showed what techniques peers use to implement non- canonical policies. In this section, we look at what kind of preferential treatment

3This is why it is necessary to look at RIB-ins instead of RIB-pps; by the time a route reaches RIB-pp, we would have lost its MED value. 5.5. ANALYSIS OF PEERING POLICIES 155

(a) Canonical Mode (b) Backup Link (c) Partially Frag- Mode mented Mode

(d) Island Mode (e) Other Mode

Figure 5.5.3: Example of peers displaying different behavior modes. Each figure shows how prefixes are announced on different peering links of a peer. Black: best routes on the link. Gray: de-preferenced routes. The remaining portion of prefixes is not announced on the link.

peers accord to their peering links when undertaking a non-canonical policy. The preference, or lack of it, for different peering links leads to important insights into what goals a peer might be trying to achieve. We can identify five broad behavior modes for peers based on the preferential treatment accorded to some links versus others in terms of number of prefixes. Note that these behavior modes are not designed to precisely identify the motivation of peers (as we do not know ground- truth in this aspect). However, they do provide a high-level idea as to how peers use their peering links. We again look at our representative period (January 1 - 14, 2008) to demonstrate the number of peers exhibiting each mode. Since peering policies can change over time (as we will show later), we assign a peer a particular behavior mode if 156 CHAPTER 5. PEER DRAGNET it exhibits that mode for more than an arbitrary 20% of snapshots in the interval (recall behavior modes are simply indicative not conclusive). Note that these modes are not mutually exclusive, and a peer can exhibit more than one mode (for different prefixes) even for a single snapshot. A summary of the behavior modes is shown in Table 5.5.1.

Our ﬁrst behavior mode for peers is the Canonical Mode. In this mode, a peer announces equally attractive routes on each peering link. Transient route changes between snapshots can cause routes to appear less attractive on some links (see Section 5.4.1). Hence, instead of requiring 100% of routes to be equally attractive on all links, we use a 95% threshold. In our representative interval, we found that 79% of peers exhibit this mode. An example of canonical behavior is shown in Figure 5.5.3 (a).

The second behavior mode is the Backup-Link Mode. In this mode, a peer prefers a subset of links for most of their prefixes (95% or more). We denote such links as Primary Links. Furthermore, such a peer also has at least one other link where most prefixes are announced, but de-preferenced. We found that 8% of peers exhibit a backup mode. Figure 5.5.3 (b) shows a typical peer with this behavior mode. As the figure shows, this peer uses links 2 and 3 as primary, and links 1 and 4 as backup for a substantial fraction of prefixes.

In the third behavior mode, the Partially Fragmented Mode, a peer has at least one primary link. However unlike the backup mode, some routes are unavailable on other links with this mode. Thus, this mode captures peers that have a portion of their preﬁxes unreachable on one or more links. We found 10% of peers of the Tier-2 AS exhibit a partially fragmented peering mode. Figure 5.5.3 (c) shows an example of such a peer; this particular peer does not announce approximately 20% of preﬁxes on link 5.

The fourth mode is the Island Mode. A peer exhibiting this mode announces its preﬁxes uniquely across peering links. In other words, the peer’s set of preﬁxes can be divided into disjoint subsets (or islands) based on the peering link on which they can be reached on. In this case, the local AS is likely to be used for transit 5.6. IMPACT ON ROUTING AND TRAFFIC 157

Canonical Backup-Link Partially Fragmented Island Other Mode Mode Mode Mode Mode 79% 8% 10% 3% 3%

Table 5.5.1: Summary of peer behavior modes. If a peer exhibited a mode for more than 20% of the interval January 1 - 14, 2008, we include the peer in the statistics for the mode. Note that peers can exhibit multiple modes during the analysis interval (hence the percentages add to greater than 100%).

between “islands”. We found 3% of peers exhibit an island peering behavior. Figure 5.5.3 (d) shows an example of a peer with an Island behavior mode. The ﬁnal mode is a “catch-all” behavior mode: it covers any peer whose behavior cannot be identiﬁed. We call it the Other Mode. We found 3% of peers exhibit this mode. All such peers have large fractions of their routes announced as most attractive on their links. However, no link can be described as primary. Figure 5.5.3 (e) shows one such peer exhibiting this mode.

5.6 Impact on Routing and Traﬃc

In the previous section we saw that several peers of the Tier-2 AS employ non- canonical policies. The magnitude of such policies in terms of the number of prefixes and routes, how a peer is using its peering links, and what techniques it is using to realize such policies are important information for a network operator. Equally important though is how such policies are impacting routes selected by local routers, and ultimately, the traffic flow within the network. The impact on route selection depends not only on the policies employed by the peer, but also on the topology of the network and, in a lot of cases, routes via other neighboring ASes. In regard to traffic, the impact also depends on the actual traffic being sent to the peer. In this section, we look at the impact of non-canonical policies on routing and traffic within the Tier-2 AS. We do this by predicting routing and traffic changes that would occur if a peer employed a canonical policy while all other peers’ behavior remains constant. 158 CHAPTER 5. PEER DRAGNET

5.6.1 Dealing with Routing Dynamics

Before we go into the impact analysis, we need to deal with BGP and IGP changes that might happen between the RIB-in snapshots. These changes can affect route selection and traffic flow. Ideally, we would like to adjust routes and traffic flow in response to such changes. However, doing so complicates the analysis. Furthermore, our traffic data is highly sampled (40Mb is the finest granularity of traffic) and aggregated across both space and time. Hence, we need to examine an extended interval of time to obtain an indication of the impact of a peer’s policy. As we collect RIB snapshots at approximately 6 hour intervals, we choose a 12 hour interval including three snapshots as our ‘snapshot interval’. Within the snapshot interval, we only consider stable prefixes. We define a prefix to be stable if:

1. no BGP updates are observed for this preﬁx from any router at the BGP monitor for the interval encompassing the recorded timestamps of all RIB snapshots in the 12 hour interval.

2. at each peering location all three RIB-in snapshots have the exact same route present (or absent).

Removing unstable prefixes from the impact analysis results in a daily median of 22% of all peers’ prefixes being ignored over the interval 1 - 14 January 2008. In addition to removing unstable prefixes, we also ignore prefixes which are filtered by import policy on the Tier-2 AS routers. This constitutes a further 1% of prefixes. Determining the impact on route selection and traffic also requires knowing IGP distances between a router and egress routers for BGP routes since IGP distance is used in step 7 of the decision process. Just like BGP routes, the IGP distances can also undergo changes in between snapshots due to events internal to the network. However, we are aiming to study the impact of peering policy, not the impact of IGP changes which has already been considered [111]. We analyzed OSPF — the IGP used by the Tier-2 AS — finding that less than 1% of router pairs changed distance during every inter-snapshot interval. Furthermore, we 5.6. IMPACT ON ROUTING AND TRAFFIC 159 found that all of these changes were ephemeral, with all distances modified for a maximum of 11 minutes in any 12-hour interval before returning to the original IGP distance. As a result, we assume that the distances remain constant over every 12-hour interval we analyzed.

5.6.2 Routing Impact

We deﬁne the routing impact as the number of (pre f ix, router) pairs which change their egress selection due to the non-canonical peering policy. To ascertain the current routing impact, we must ﬁrst determine the decisions routers would have made under a canonical peering policy and then compare them to the decisions made under the actual policy.

Determining Routes under a Canonical Policy

Recall that a canonical peering policy is simply announcing equally attractive routes on each peering link. For each peer and prefix pair, we find the best (post- policy) route announced across all links. We then perform a “what-if” analysis, the goal of which is to determine what decisions routers would make if these routes were actually sent across all links. Consider the example shown in Figure 5.6.1. Three available routes are injected. The shading of each router represents the route chosen by that router. In Figure 5.6.1(a) we see the available information from route-monitors in the AS (a subset of all routers). We then use the model presented in Chapter 4 to estimate the decision of the routers where no monitor is available (Figure 5.6.1(b)). Next, we take the most attractive route announced by the peer (whose policy is being investigated) across all peering locations (from post-policy RIB snapshots) and assume the same route attributes were announced at every peering location for the prefix from the peer. For this example, route D would be available during a canonical peering policy is shown in Figure 5.6.1(c). In Figure 5.6.1(c) we also see the routers, that would alter their decisions as a result of route D’s availability. Notice that one router changes its chosen route 160 CHAPTER 5. PEER DRAGNET from route A to route C as a result of the newly available route D. We explain this phenomena later in the chapter. During the “what-if” analysis, the network solver often ends up with multiple equally good routes at a router after the IGP distance step of the BGP decision process (as shown with the blue/red router in Figure 5.6.1(d)). When faced with such a scenario, we select the route used under the non-canonical policy if this route is one of the candidate routes. Otherwise if the original canonical route is not available, we select a random route (for instance the router with route C and D). As the AS we examined used the oldest-route tie-break, this strategy makes our estimate of routing impact conservative in the sense we under-estimate the actual routing impact. Once the model provides us with its output – the selection of a router under a canonical policy – we compare it with the route selected by the router under the actual (non-canonical) policy and note any differences. Using the network solver, we can determine the extent of the impact each peers’ policy has on every router in the network. The analysis above is undertaken for all stable prefixes announced inconsistently by any peer. We believe having this information allows an operator to perform a very detailed analysis of how peers’ non-canonical policies impact routing decisions. Such information can be used for quantifying peers’ policies within a peering agreement for future negotiation of a new agreement or for determining the cost of a peer acting outside of its current agreement.

Routing Impact Results

We now present aggregated results to provide important insights. First, we show how a particular peer’s non-canonical policy affects various routers of the AS for a single snapshot, i.e., a 12-hour period. The solid line in Figure 5.6.2 shows the percentage of route changes as a Cumulative Distribution Function (CDF). We observed similar results for other snapshots which is not surprising since the peer’s policy did not change substantially over the two week period. The figure shows that the peer’s policy has different impact on different routers. Roughly 5.6. IMPACT ON ROUTING AND TRAFFIC 161

(a) Monitor selections and routes avail- (b) Router decisions predicted by tech- able under current policy. nique described in Chapter 4.

(c) Router decisions predicted under (d) Several routers may have a tie- canonical policy. break decision under canonical policy. We choose the route selected under the current policy when possible.

Route A Route B Route C Route D Route Unknown

Figure 5.6.1: The impact of non-canonical peering policy. Arrows represent available routes. Circles represent routers in an AS. The shading of a circle indicates the route selected by the router. 162 CHAPTER 5. PEER DRAGNET

Peer 1 Peer 2

Proportion of routers with < X% of preﬁxes changing routes

X% of preﬁxes

Figure 5.6.2: A cumulative distribution function representing the impact of two peers’ non-canonical policy on all routers of the Tier-2 AS on January 2, 2008.

19% of routers are completely unaffected by the peer’s policy; for the remaining 81% of routers, the affected numbers of routes fall in a narrow range of 34% to 38%. The dash-dot line in Figure 5.6.2 shows a different peer with an Island behavior. For this peer all routers would change routes for about 66% of their prefixes with a canonical policy. Next, we calculate the percentage of total route changes across all stable prefixes and all routers for all peers with a non-canonical policy over all 12-hour snapshots in the interval January 1 - 14, 2008. Figure 5.6.3 presents the results as a Complementary Cumulative Distribution Function (CCDF). The figure shows that the impact of non-canonical policies varies widely across peers. About 27% of peers with non-canonical policy affect more than 10% of route selections, and 7% of peers affect more than half of all router decisions in the AS.

Possible Routing Impact

The impact of a peer’s behavior can be dampened due to the presence of routes via other ASes (peers or customers). Consequently, another important view point that 5.6. IMPACT ON ROUTING AND TRAFFIC 163

Figure 5.6.3: A CCDF showing the proportion of decisions aﬀected by non-canonical policies of peers on January 1-14, 2008. Only peers with non-canonical peering policy are shown.

would be of use to an operator is the impact a peer could possibly have given other routes are unavailable (which may occur in the future or under a failure scenario). To determine this “possible routing impact” metric for a peer, we remove routes from all other ASes. For example in Figure 5.6.4(a), we remove route B from the example in Figure 5.6.1(a) which we assume came from a diﬀerent peer to the one under investigation. The model from Chapter 4 is used to predict the routes selected under this scenario (non-canonical policy with the red route removed). We continue by once again assuming a canonical policy is employed, and route D becomes available. The iBGP model is used to predict route selections under this scenario. Diﬀerences in router decisions are once again analyzed. We term this impact the ‘possible routing impact’.

Figure 5.6.5 shows the possible routing impact on the two example peers examined in Section 5.6.2. The prefixes announced by Peer 2 are not announced by any other AS. Consequently, the possible routing impact is identical to the current routing impact. However, the prefixes announced by Peer 1 are announced by other ASes. Consequently, the routing impact is significantly increased when the routes from other ASes are removed. We see in Figure 5.6.5 the proportion of affected prefixes for 81% of routers go from 35-38% (in Figure 5.6.2) to almost 100% 164 CHAPTER 5. PEER DRAGNET

(a) Router selections during non- (b) Router selections during canonical canonical policy. policy.

Route A Route B Route C Route D Route Unknown

Figure 5.6.4: The possible impact of a peers policy in the absence of routes from other ASes. In this example, route B from Figure 5.6.1 is no-longer available (as we assume it is from a diﬀerent AS). Initially, we predict the selection of routers under the non-canonical policy in (a), before introducing route D in (b) and determining which routers would modify their decision.

once routes from other ASes are removed. This figure amply demonstrates that peers’ non-canonical policy can have a much bigger impact when other ASes are not present to hide some of the routes. Furthermore, as Figure 5.6.3 demonstrates, this is not a one-off scenario; in fact, most peers benefit from the presence of other neighbors. The number of peers whose policies affect more than half of their (router, pre f ix) pairs almost doubles to 13% in the absence of other peers.

When Good Routes Go Bad

We close this section with an interesting observation about the quality of routes chosen in terms of IGP distance when we perform the what-if analysis with canonical policies. As we are purely adding additional routes into the network, the expectation is that every router would end up with same or better routes. Surprisingly, this does not always hold true. Although counter-intuitive, this 5.6. IMPACT ON ROUTING AND TRAFFIC 165

Peer 1 Peer 2

Proportion of routers with < X% of preﬁxes changing routes

X% of preﬁxes

Figure 5.6.5: The impact of the non-canonical peering policy of the two peers’ from Figure 5.6.2 when routes from other ASes are unavailable. The impact of Peer 2’s policy is unaﬀected by the availability of routes from other ASes. The impact of Peer 1’s policy is signiﬁcantly greater when routes from other ASes are unavailable.

phenomena is caused by the interaction between the iBGP hierarchy and the IGP. We have discovered this circumstance in our study, although it is very rare. Less than 0.01% of decisions analyzed during our analysis experienced this phenomena. In this section, we illustrate the phenomenon with an example to provide more insights into the circumstance that leads to its occurrence. Note our aim is not to work out the theoretical underpinnings for the exact circumstances; we leave that to future work.

Before we delve into the example, recall that a route-reflector hierarchy is setup to reduce the number of iBGP sessions required within an AS (see Chapter 2). However, such a hierarchy results in lack of visibility into all available routes since route-reflectors only send the routes they select for themselves to clients, not the inferior ones. Now when route-reflectors select their best routes, it is from their own point of view. Unfortunately, the selected and propagated route may not be the optimal one from their clients’ points of view. This is the crux of the problem that results in the selection of worse routes. 166 CHAPTER 5. PEER DRAGNET

5 5

4 1 6 4 1 6

2 2 6 6 5 3 5 3

5 4 5 4 8 7 8 7

(a) Original conﬁguration. (b) Addition of new route.

IGP iBGP 1 Route A 1 Route B 1 Route C

Figure 5.6.6: Example of “when good routes go bad” phenomenon. The addition of a better route at router 5 causes router 6 to select a worse route than before.

Consider the example in Figure 5.6.6. Suppose 2 routes (route A and route B) are the only choices for routers within the AS to select. All routers in the network will select their best egress from their candidate routes based on the shortest IGP distance.

Let us consider the decision made at router 6. A maximum of two unique egress locations will be learned via its route-reﬂectors (router 1 and router 2). Let us assume that router 2 selects route A, and router 1 selects route B. Hence, router 6 has two candidate routes and selects the egress link with the smallest IGP distance: route B in this case.

Now let us assume a canonical policy introduces route C at router 5. In this case, router 6 can only learn of two possible routes via its route-reﬂectors: router 1 now chooses route C as its best route, whereas router 2 sticks to its original choice of route A. Router 6 will now select route A as it has the smallest IGP distance of the two available routes. In comparison to the route selected during the non-canonical peering policy, this route is worse as it has a larger IGP distance. 5.6. IMPACT ON ROUTING AND TRAFFIC 167

5.6.3 Traﬃc Impact

In this section we quantify the impact of the non-canonical peering policy on the traffic flows inside the Tier-2 AS. The non-canonical policies impact traffic coming from customer edge routers and leaving over peering links. To perform the traffic impact analysis, we would ideally like to know flows coming in at all customer edge routers. However, the Tier-2 AS does not collect flow data from customer edge routers. Rather it collects them (using Netflow) only at peering locations of the network as mentioned in Section 5.4.4. In particular, we focus on the flow data for the out-bound traffic on the peering routers. Given an out-bound flow record, we need to determine two facts:

1. the ingress router of the ﬂow, i.e., which (customer edge) router this ﬂow used for entering the network; and

2. what peering router the ﬂow would use for leaving the network under a canonical policy.

Next, we describe how we determine these two facts and follow it up with the results showing the traﬃc impact.

Finding the Ingress Router of a Flow

As mentioned earlier, Netflow records provide us with the quantity of traffic leaving the network on a particular peering link, for instance, the link A1—B1 in Figure 5.6.7 (a). However, the record does not contain any information about the router at which the traffic entered the network. Consequently, we infer the ingress router based on various other fields in the flow record. We use the techniques described by Feldmann et al. [32] to infer candidate ingress routers for flows. Many customers connect at a single location and have their own non-overlapping address space. Hence, given a valid source IP address, we are able to determine 168 CHAPTER 5. PEER DRAGNET the ingress router for these flows without ambiguity4. In the example shown in Figure 5.6.7(b), the flow record collected on the link A1—B1 determines that the traffic came from AS E and entered the network at router A3.

Finding the Egress Link under the Canonical Policy

Once we know the ingress router of a flow, we need to determine its egress router under the canonical policy of the peer. In order to do this, we perform the what-if analysis described in Section 5.6.1 for the ingress router, the longest matching prefix in the ingress router’s Loc-RIB for the destination IP address of the flow and the peer in question. Unfortunately, our OSPF data is incomplete (due to multiple OSPF areas in the AS). Hence, for some ingress routers we were unable to determine which route was selected by the BGP decision process. Specifically, OSPF distance data for 34% of all routers in the AS was unavailable during the January 1 - 14 period; a majority of these routers being customer edge routers. Hence, we perform our analysis at the Point-of-Presence (PoP) level instead of individual router level5. Furthermore, we were also fortunate to have OSPF distance information for at least one router in most PoPs. These two things allowed us to simulate the decision process for at least one router in most PoPs and use the results as representative of the entire PoP. When a representative router is unavailable, we are unable to predict any impact. Although the above analysis demonstrated that almost all routers in a PoP selected the same egress router for all measurements, we are careful when choosing a representative router from a PoP. For instance, in Figure 5.6.8(a) we know router

4In the Tier-2 AS under examination, ambiguity rarely occurred. However, if it were to occur Feldmann et al. [32] randomly select one of the set of candidate ingress routers. We could further limit the set of available ingress routers by ensuring all candidate ingress routers selected the egress router where the ﬂow was monitored (a route-monitor on the ingress routers or our technique in Chapter 4 could provide this information). 5We checked this analysis was a suitable approximation by comparing the decisions of routers in each PoP for which we had IGP distances. We found 99.1% of routers within a PoP select the same route, justifying our PoP level analysis. 5.6. IMPACT ON ROUTING AND TRAFFIC 169

Outbound Netflow Mesurement Local AS A

1 1 Peer Peer AS B AS C

2 2

Cust Cust Cust AS D AS E AS F

(a) Outbound traﬃc recorded on link A1—B1.

Local AS A

1 1 Peer Peer AS B AS C

2 2 3 Inferred Traffic Ingress Router 1 Cust Cust Cust AS D AS E AS F

(b) Source of traﬃc inferred from source IP of ﬂow.

Figure 5.6.7: Traffic Impact: Finding the ingress router of a flow. Blue solid arrow represents known traffic flow before and after ingress router is inferred. Thickness of links between ASes represent relative attractiveness. 170 CHAPTER 5. PEER DRAGNET

Local AS A

1 1 7 1 Peer Peer AS B Ingress AS C PoP 2 2 4 5 3 6

1 Cust Cust Cust AS D AS E AS F

(a) Ingress PoP candidate routers for analysis.

Local AS A

1 1 7 1 Peer Peer AS B Traffic AS C Shift 2 2 4 5 3 6

1 Cust Cust Cust AS D AS E AS F

(b) Traﬃc shift under canonical peering policy.

Figure 5.6.8: Traffic Impact: Finding the egress link under a canonical policy. Dashed orange arrow shows predicted traffic flow under a canonical peering policy. The thickness of links between ASes represent relative attractiveness. 5.6. IMPACT ON ROUTING AND TRAFFIC 171

A3 egresses from router A1 as this is the location of the Netflow measurement. The remaining three routers in the ingress PoP are candidates which could be chosen as representative of the PoP as we have enough IGP data to infer their decisions. Routers A4 and A5 choose to egress via A3 while router A6 would choose to egress via A7 for the traffic flow. Hence, we exclude A6 from the set of candidate routers as it made a different decision to A3. A4 is chosen as representative of this PoP as it made the same initial decision as A3 (A5 could also have been selected). Now, with the information gained from the routing analysis in Section 5.6.2, we find that A4 modifies its decision to egress via A2 if AS B employs a canonical policy. Thus we can determine the traffic shift that occurs due to a non-canonical peering policy (Figure 5.6.8 (b)).

Traﬃc Impact Results

Once we can determine the traﬃc from ingress PoP to egress links for every peer and preﬁx under both canonical and non-canonical policies, we can answer these three questions:

1. What path through the network does traﬃc currently transit?

2. What path would be used if a canonical peering policy was used?

3. What quantity of traﬃc is aﬀected?

Answering these questions allows an operator to determine the impact of the non-canonical policy in terms of traffic. This information can then be used to develop various traffic related metrics representing the cost paid by the AS for every peer’s non-canonical policy. One such simple metric is “byte-route-miles” which indicates the quantity of traffic being carried extra distance. In Figure 5.6.9, we present the traffic impact results for the Tier-2 AS under examination. As an example, we show the proportion of traffic that would shift egress from a particular egress link to alternate links given a canonical peering policy. During a representative 12-hour interval, we see that all but two PoPs (PoP 172 CHAPTER 5. PEER DRAGNET

Original Link 15 Alternate Link 1 Alternate Link 2 Unknown Link

Percentage of Traffic 5

0 0 10 20 30 40 50 PoP ID

Figure 5.6.9: For a representative peer, the shift of traﬃc that would occur for various ingress PoPs if the peer were to use a canonical peering policy.

number 29 and 52) would modify their egress, and a total of 85% of traffic leaving the AS on this peering link would modify its route given a canonical peering policy. Due to IGP data limitation, we are unable to determine the decision made by any router in PoP 16. Consequently, we are unable to determine if traffic entering the network at this ingress PoP would change egress. This traffic represents only 2% of traffic egressing via the peering link under examination. We do not present summary statistics for the total traffic affected by non- canonical peering policies as the Tier-2 AS did not record such data during the analysis interval. However, the success of this analysis has since resulted in the AS recording traffic data on all peering links.

5.7 Dynamics of Peering Policies

In previous sections, we showed how several peers engage in non-canonical policies. How often do these peers change their policies? That’s the question we 5.7. DYNAMICS OF PEERING POLICIES 173

Figure 5.7.1: Policy changes for one peer during interval September 1, 2007 - January 14, 2008. Black: Best route available on this link. Gray: No route available on this link. White: Missing Data. Stars represent times when the peer altered their policy.

will examine here. In particular, we are interested in how they change the relative preferences of their links. We do not restrict our attention to looking for changes from canonical to non-canonical policies, because non-canonical policies are not a-priori “bad”, but a deviation from an agreed upon policy might be. Figure 5.7.1 shows a particular peer that changes its policy. The figure shows peering link preferences of the peer over a five month period. For the first eleven weeks, the peer employs a canonical policy except for a few periods lasting some days where prefixes are spread almost disjointly over the links. This may have been an attempt to temporarily route traffic around a part of its network. However after the eleventh week, the peer starts employing a non-canonical policy on a regular basis. The policy undergoes changes at least three times over the remaining nine weeks; these change points are marked by stars in the figure.

5.7.1 Policy Change Detection Algorithm

Detecting policy changes manually is cumbersome, error-prone and does not scale well, especially when an AS has a substantial number of peers. In this section, we 174 CHAPTER 5. PEER DRAGNET present an automated approach for systematically detecting policy changes. We ignore MEDs because our Tier 2 AS ignores them. Suppose a peer P’s set of peering links is denoted by K. Let xk [0, 1] be i ∈ the proportion of prefixes whose best route is advertised through link k K, ∈ k during the ith measurement interval. Note xi = 1 for all k’s for a canonical policy. k We first apply a simple median filter to the data, i.e, xî is the median of the N measurements preceding the ith interval. Median filters are ideal for removing the gaps caused by missing data, and removing the effect of the short transient changes that often occur during reconfiguration or as a result of BGP dynamics. Given a window of width N, we can handle f loor((N 1)/2) missing snapshots − (out of N). The next component of the algorithm is aimed at detecting changes or level shifts in the data. There are various possible approaches to this type of problem, but we note that the policy changes of interest are big. We are most interested in changes affecting greater than 10% of (pre f ix, location) pairs. Also, we know that most changes are implemented by people, and so may take place over the course of hours, up to days. Because of this, we need to be able to detect a level shift over the course of several snapshots. The simplest approach to this problem is

k to compare the current values of xˆi to the historical values over some previous time period. We can perform this test by taking a simple exponentially weighted moving average:

k k k yi = αxî + (1 α)yi 1, i N; − − ≥ and performing the comparison: X yk xˆk > T? | i − i | k K ∈ The algorithm declares that the peer has changed its policy if the difference exceeds the threshold T. The choice of α is dictated by the measurement interval (six hours in our case) and the interval over which we believe a change may occur. The parameter T is important here, as it determines the sensitivity of the algorithm. The situation here is, however, different from a typical anomaly detection 5.7. DYNAMICS OF PEERING POLICIES 175 problem. There are no false alarms, regardless of the value of T. All the detected changes are real changes in policy. T is simply used to filter the more important changes and so the value is chosen based on operator criteria.

Parameter Selection

The policy change detection algorithm requires three parameters N, T and α. N is used to smooth the data to limit the effects of dynamics during the snapshot collection and deal with missing data. T represents the magnitude of policy change of interest, and α determines the speed at which a policy change can be detected. Peering policies are generally manually configured by network administrators. Hence, they change infrequently (detecting automated tools is left for future work). Further, policy changes are unlikely to be gradual — instead large shifts in available routes from one peering location to another 6. Consequently, the actual parameters selected to discover a level shift are quite insensitive. Firstly, N is chosen to minimize the effects of ‘spikes’ in the data. Such spikes occur due to missing data (appearing as 0 table entries) or routing dynamics during the table recording process — both of which do not indicate a change to policy. We choose N = 5 for this purpose, allowing us to minimize the effect of up to two anomalous data points per peering location. If a ‘real’ level shift occurs, the median filter will delay its detection by two snapshots which is within operator requirements. Secondly, T is chosen based on operator requirements. A value of 0.1 is selected as it represents a 10% change in (pre f ix, location) pairs. Thirdly, α determines the speed at which a policy can be detected. We analyze several values of α. The results of this analysis are shown in Table 5.7.1. We see that lower values of α cause longer policy detection times than higher values. Also, large policy changes are detected quickly for any value of α. Policy changes affecting under 10% of routes are not detected. Note that the median

6Note that gradual changes can be detected by the infrequent analysis of all peering policies. 176 CHAPTER 5. PEER DRAGNET

Actual Policy Change Magnitude

α 0.0 0.1 0.101 0.11 0.12 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.01 460 239 179 69 41 29 23 19 16 14 12 11 ∞ ∞ 0.1 44 23 18 7 4 3 3 2 2 2 2 2 ∞ ∞ 0.25 17 9 7 3 2 2 1 1 1 1 1 1 ∞ ∞ 0.5 7 4 3 2 1 1 1 1 1 1 1 1 ∞ ∞ 0.75 4 2 2 1 1 1 1 1 1 1 1 1 ∞ ∞ 0.9 3 2 1 1 1 1 1 1 1 1 1 1 ∞ ∞ 0.99 2 1 1 1 1 1 1 1 1 1 1 1 ∞ ∞ Table 5.7.1: Number of snapshots before the policy change detection algorithm identiﬁes a policy change (T = 0.1).

filter causes the detection of policy changes to be delayed by two snapshots. Op- erator requirements are such that policy changes of interest are detected within 24 hours. Using an α value of 0.25 allows policy changes affecting more than 20% of (pre f ix, location) pairs to be detected within this time frame (five snapshots).

Policy Change Detection Results

We applied the detection algorithm over the interval from September 1, 2007 through January 14, 2008 across all peers. We found that 34% of peers made significant changes to policies during this period, but the majority of these were simply the addition or removal of one or more peering links. Only 9% of all peers made a significant change unrelated to changes to the underlying links. The average number of changes for these peers is two, corresponding to a rate of change of roughly one change every 68 days. Although this suggests a very slow rate of a significant change in peering policies, there exists a small minority of peers that are more active, such as the peer shown in Figure 5.7.1. Peering agreements are generally static, and consequently, these peers may be of particular interest to a network operator. Common changes in policy are often associated with active 5.8. OPERATIONAL PEER DRAGNET 177 traffic engineering and so this may indicate potential load issues on the edge of a peer’s network.

We have now investigated the policies of all peers of a Tier 2 AS, classified them based on inferred motivation, estimated the impacts of non-canonical policies on the AS, and detected changes over time of such policies — all of which are useful to gain an overall picture of peering policies. However, operators of ASes also require a clear representation of which peers are acting outside their agreements, which prefixes are affected and how are they implementing their policies. The next section deals with the implementation of the prototype tool available to network operators to easily investigate their peering relationships.

5.8 Operational Peer Dragnet

The main motivation of the Peer Dragnet project is to create a tool easily used by operators to monitor peers’ policies. Unlike researchers who may be interested in overall behavior of all peers, operators are interested in the specific behavior of individual peers connected to their network. Consequently, this is reflected in the tool. For ease of use, we developed a web front-end that allows an operator to view summaries of all peers’ announcements with unusual policies highlighted (see Figure 5.8.1). Also included in Figure 5.8.1 is information regarding peers’ prefixes which fall into categories that might be undesirable, such as private address space, prefixes with mask lengths longer than /24 and shorter than /8. Cells are highlighted when more than 5% of prefixes or comparisons ((pre f ix, snapshot, location) triplets) are de-preferenced. Examination of the high level summary of peers’ announcements provides a good overview of peers of interest. However, it does not provide a detailed view of the techniques and preferences of a peer. Consequently, Peer Dragnet provides the means to delve into the detail of an individual peer. For each peering location, we display the preference for each prefix announced by a peer. Consider 178 CHAPTER 5. PEER DRAGNET

Figure 5.8.2 and its legend in Figure 5.8.3. A subset of the prefixes announced by Kangaroo Corp7 are shown. Each horizontal bar represents a prefix and the relative attractiveness of it compared to other routes at other peering locations in snapshots over an analyzed interval. Prefixes are split into blocks based on their prefix mask length. We see Kangaroo Corp employs a canonical behavior mode, as all peering links have equally attractive routes. In contrast to Kangaroo Corp, Emu Inc is employing a non-canonical policy. This is clear from Figure 5.8.4. Here we see the Whyalla peering location is preferred for all prefixes. This peer is employing a ‘Backup Link’ behavior mode. A majority of prefixes on the Wagga Wagga link are announced with a longer AS path than on the Whyalla link. An example of an ‘Island Peer’ is shown in Figure 5.8.5. Here, Platypus Tech announces five prefixes disjointly over its five peering locations. In Figure 5.8.6 a peer is shown with a policy not easily characterized into one of our behavior modes (falls into the ‘Other’ category). However, this image is able to show the peer’s intent. It appears that the Port Augusta peering location is preferred for most prefixes. For approximately half the prefixes announced by the peer, the Port Pirie peering location is also preferred. However, the Port Lincoln peering location is used purely as a backup to the prefixes announced at both the Port Pirie and Port Augusta peering locations. No routes are announced for the remaining 50% of prefixes at either Port Lincoln or Port Pirie. A peer such as this may be of great interest to an AS as it may be optimizing its traffic at the expense of the local AS. Although automated techniques as described in Section 5.5 are useful for a rough idea of a peer’s behavior, the examples shown in this section are better suited to operational use. An operator can be alerted that a peer’s policy has changed using the technique outlined in Section 5.7 and investigate those changes in detail.

7AS names and peering locations are obscured for commercial reasons. We replace the names with arbitrary Australian names. 5.8. OPERATIONAL PEER DRAGNET 179 e 5.8.1: Summary table of peers. Commercially sensitive information is hidden. Figur 180 CHAPTER 5. PEER DRAGNET Perth (d) Sydney (c) Melbourne Kangaroo Corp . The y-axis lists preﬁxes in mask length order. The x-axis is the time of (b) Adelaide (a) Figure 5.8.2: The canonicalsnapshot where peering the policy peer’s of announcements are compared (this example covers a one-week period). 5.8. OPERATIONAL PEER DRAGNET 181

Best Path at this Location Inferior Path at this Location due to MED

Inferior Path at this Location due to Origin

Inferior Path at this Location due to AS Path

Missing Data Path Not Available at this Location

Path Not Available at ALL Locations

Figure 5.8.3: Legend for a peer’s de-preferencing techniques.

(a) Wagga Wagga (b) Whyalla

Figure 5.8.4: The non-canonical peering policy of Emu Inc. At the Wagga Wagga peering location, the peer consistently either does not announce a route to a prefix or de- preferences the route using a longer AS Path length. 182 CHAPTER 5. PEER DRAGNET Broome (c) Kalgoorlie (e) Darwin (b) Platypus. Tech Five prefixes are announced disjointly over five peering locations. Alice Springs (d) Mackay (a) Figure 5.8.5: The non-canonical peering policy of 5.8. OPERATIONAL PEER DRAGNET 183 Port Pirie (c) Port Lincoln (b) Dingo Net . The peer consistently does not advertise a large fraction of routes on the Port Augusta (a) Figure 5.8.6: The non-canonicalPort peering Pirie policy peering of link.with The a prefixes longer the AS peer path. does announce on the Port Pirie link are also announced on the Port Lincoln peering link 184 CHAPTER 5. PEER DRAGNET 5.9 Non-Canonical Policy Mitigation

In the previous sections we saw that a significant fraction of peers were engaging in non-canonical policies and the substantial impact these policies have on routing and traffic within the AS. The natural question that arises is whether the AS can do something to mitigate non-canonical policies, at least when they violate peering agreements (canonical or otherwise). Having access to such mitigation techniques protects the AS from intentional or unintentional violations of the agreed policies. Furthermore, judicious use of mitigation along with understanding of (plausible) reasons for peers’ non-conforming policies provides leverage that an AS can use during negotiations. In this section we describe some possible ways to mitigate the influence of non-canonical policies. We first consider how to achieve this when BGP routing is governed by an AS-wide BGP route control system such as a Route Control Platform (RCP) [29]. We next show that even when such a system is not deployed in an AS, a peer’s policies can be mitigated with a “Peer Dragnet” system and appropriate use of import policies on peering routers. Finally, we examine the possibility of a completely decentralized mitigation scheme, highlighting the difficulties with such a scheme.

5.9.1 AS-wide BGP Route Controller

Let us ﬁrst see what an AS can do if it is using an AS-wide BGP route controller. Two important requirements for such a route controller to detect and mitigate non-canonical policies are: (i) complete visibility into all eBGP routes announced by all peers, irrespective of whether these routes are chosen as best routes or not8; and (ii) the ability to select routes for every router in the network from this full set of routes. Note that the latter requirement is embodied into the functioning of an AS-wide BGP route controller. The ﬁrst requirement on the other hand depends on how such a controller is deployed. Without going into details of deployment

8Having full visibility into customer routes will be useful too but is not necessary here. 5.9. NON-CANONICAL POLICY MITIGATION 185 modes, we point the reader to a “phase-2 deployment” of an RCP as an example [29].

With an AS-wide knowledge of routes from a peer at the controller, we can mitigate a peer’s policy by making all routes equal up to the IGP distance step before running the BGP decision process. In other words, the controller will have to change values of MED, Origin and AS Path length attributes to the same values for all the routes received from a peer before running the decision process. Tomake the length of AS Paths the same, we propose prepending the peer’s AS number such that all paths become as long as the longest AS Path, instead of removing some ASes from longer AS Path to make them shorter. This conservative strategy prevents the possibility of ‘tricks’ such as inserting other ASes in the AS Path prior to the peer9. In a similar vein, we reset MED and Origin attributes to their lowest values observed across all routes.

The mitigation scheme described above guards against BGP route attribute ma- nipulations. What about manipulating prefix announcements (see Section 5.5.1)? If a peer decides not to announce a prefix on some links, there is not much the controller can do except to use routes from other peers (or customers) — if such routes are available. If the peer is announcing prefixes with different mask lengths, the controller may be able to aggregate or de-aggregate prefixes on various links to the extent where all the peering links (of the peer at hand) end up with exactly the same prefixes. We leave the fine details of such an algorithm to mitigate prefix manipulation in the most effective manner as an avenue for future work.

9For instance, a neighboring AS A using current common prepending techniques may announce the route AAABCD. However, AS A could announce the route AEFBCD with the same effect. A side-effect of this technique would be to cause AS E and F to detect this route as a loop and discard it. AS A could prepend AS numbers from a range known to be unallocated to prevent this side-effect. 186 CHAPTER 5. PEER DRAGNET

5.9.2 Import Policies

A first step to mitigate a non-canonical peering policy may be to use import policies to ignore attributes set by a peer. Most current generation routers can reset the Origin and MED attributes. However, the AS Path attribute cannot be ignored (by current generation routers) or altered without the knowledge of routes available at other routers. The route controller in Section 5.9.1 has this knowledge. If an AS-wide BGP route controller is not deployed in an AS, it is still possible for the AS to mitigate a peer’s influence with a “Peer Dragnet” tool that uses RIB-in snapshots (or a continuous stream) for detecting peering policy violations and modifying route attributes through import policies on peering routers. Once the tool detects what attributes are being used for implementing the policy, it can apply the same techniques described in Section 5.9.1 to de-preference routes using import policies. Most router vendors allow changing route attributes in their import policies. Similarly, if a peer is not announcing routes on some links, the tool can block routes to the same prefix on other links of the peer (provided it can find routes from other peers or customers) through import policies. Finally, in the case of prefix mask length manipulation, the tool can perform some aggregation of prefixes, however feasibility and effectiveness depends on the features provided by the policy language of the router vendor.

5.9.3 Distributed Knowledge

Mitigation using import policies requires some centralized knowledge of policies. In this section we look at the possibility of mitigating a peers’ behavior in a completely de-centralized system with each router making its decision independent of others. Diﬃculty arises in such a system because routers have limited global knowledge. In Figure 5.9.1 we investigate a distributed mitigation scheme. In this example, large arrows represent learned routes at router 1 and 4. A thicker arrow represents 5.9. NON-CANONICAL POLICY MITIGATION 187

3 2 3 2 3 2

4 1 4 1 4 1

(a) (b) (c)

3 2 3 2

4 1 4 1

(d) (e)

3 2 3 2

4 1 4 1

(f) (g)

Figure 5.9.1: Decentralized mitigation scheme. Router 2 and 3 are route-reﬂectors, and routers 1 and 4 are client routers. Large arrows represent routes learned from a neighboring AS. The thickness of the arrows indicates the route’s relative attractiveness. Solid arrowed lines are unchanged routes (as with the current version of BGP) while dashed arrowed lines are inferred routes to mitigate a peers’ inﬂuence.

a more attractive route than a thinner arrow. The shading of each router indicates the router’s selection. The information propagated is shown by arrowed lines next to the iBGP links. Solid arrowed lines are unchanged routes (as with the current version of BGP) while dashed arrowed lines are inferred routes to mitigate a peers inﬂuence. Routers 2 and 3 are route-reﬂectors, while 1 and 4 are their respective clients.

Consider the case in Figure 5.9.1(a). The routes are both learned from the same peer, and no other routes are available from any other location in the AS. The peer announces the lightly shaded route as more attractive than the dark route. 188 CHAPTER 5. PEER DRAGNET

Consequently, the light route is preferred at all routers. Router 4 is the only router which learns of the dark route and is consequently the only router with enough network-wide knowledge to mitigate a peer de-preferencing its routes. Router 1 can ignore the peer’s de-preferencing of the dark route and assume it has a route as attractive as the light route (shown in Figure 5.9.1(b)). Router 1 then propagates the new dark route to other routers within the network, allowing them to make local decisions as normal. This is indicated by the dashed dark arrowed line in Figure 5.9.1(b).

Problems arise with this scheme when there is a withdrawal or replacement of a route with a less attractive route. Consider Figure 5.9.1(c). In this example, the light route is replaced by a less attractive route (indicated by a thin large arrow). The route-reﬂectors would choose the alternate dark route. The issue is encountered as router 4 never learns that the light route had been replaced with a less attractive route as router 3 does not inform it of this change. Now it appears to router 1 that it is being subject to a non-canonical policy and router 1 undertakes the mitigation process (Figure 5.9.1(d)). Router 1 propagates its local route back to 2 (see Figure 5.9.1 (e)). Notice all information is now inferred. We now have two routers with consistent routes, but no router has heard the route from the peer! Both routers are assuming the other router has heard this route although neither of them has. This situation is caused by no router having enough network-wide information.

We are able to ensure the required amount of knowledge by creating iBGP sessions between peering routers. The full iBGP mesh of all peering routers would ensure enough knowledge of the network is distributed at the required locations. In the case of Figure 5.9.1(f), we see the replacement of the route at router 1 would be propagated to router 4. If router 1 had recorded the reliance of its route on the presence of a route at router 1, it would be able to replace this route with its previously known route from Figure 5.9.1(a) when the original route was withdrawn. The propagation of this route will result in all routers selecting correct (available) routes (see Figure 5.9.1(g)). However, establishing iBGP sessions with 5.9. NON-CANONICAL POLICY MITIGATION 189 all peering routers circumvents the route-reflector hierarchy, effectively defeating the purpose of route-reflection.

This solution can mitigate a peer de-preferencing a peering link. However, if no route is available at a location, it would be very dangerous to assume a peer has a route to the prefix. Checking if the prefix was ‘covered’ by a less specific prefix may be an option, however, a peers’ network may be physically disconnected or have a transient failure. In this case traffic would not reach its destination – affecting the local AS’s own customers.

The BGP decision process only propagates its best route. Consequently, the only way to mitigate a peers’ behavior in this case is to make every route equal to the most attractive route . Modification of BGP routes to appear more attractive than they actually are could have undesirable effects such as routing loops. It has been shown previously that there are convergence issues in iBGP [49] and eBGP [46,116], and our distributed methodology increases the complexity of an already complicated system. This solution also cannot be implemented using current generation routers. We have presented this option to show the difficulties and issues needing to be considered to overcome a somewhat simple problem — all of which are caused by the lack of network-wide knowledge at individual routers in the network. The added complexity may cause more unforeseen convergence issues.

Every mitigation scheme described here has issues either in complexity, the ability to cope with routing dynamics or could adversely aﬀect routing convergence times. Hence, the best solution to a peer acting outside of its peering agreement is only partly a technical solution. First, the Peer Dragnet tool can be used to identify peers acting outside of their peering agreement and quantify the cost of such behavior. Second, as peering agreements are contractual business arrangements, peers may be forced to change their behavior through legal processes. 190 CHAPTER 5. PEER DRAGNET 5.10 Discussion

This chapter presents a comprehensive study of peering policies for an ISP. Our analysis used BGP routes received from the peers of a Tier 2 ISP. We found that 23% of the peers employed non-canonical peering policies for at least 90% of their prefixes, and about 33% used such policies for at least 50% of their prefixes. We also found that peers used all possible techniques for de-preferencing routes on various peering links. Furthermore, based on how some peering links were favored over others, we identified five behavior modes for peers. Wealso analyzed the impact of non-canonical policies using our iBGP model described in earlier chapters. We found that 27% of such policies affected more than 10% of router decisions inside the examined AS and 85% of traffic egressing one particular peering link. To understand the dynamics of peering policies, we developed a technique to detect changes in policies in an automated and systematic manner. Upon applying the technique, we found that 9% of peers modified their policies during a five-month period. Finally, we provided mechanisms to counter such policies when it is a violation of peering agreements.

The results of our analysis demonstrate the need to monitor peering policies in a timely and on-going basis. We have not presented comparisons between peering agreements due to their commercial sensitivity, although some peers were identiﬁed as violating their agreements through Peer Dragnet. The tool is now deployed in the Tier 2 AS and is an important component of their network management system.

We believe our study has implications for modeling the inter-domain topology and routing dynamics of the Internet. This work further substantiates that an AS is not atomic and cannot always be abstracted into a single node in the Internet graph. We also believe that insights gained from this study should further the accurate modeling of inter-domain policies and route prediction [77].

This work can be extended by better understanding the motivation of peers using non-canonical policies. Another avenue for future work is to see if peers are 5.10. DISCUSSION 191 engaging in ﬁne time-scale changes to their policies, such as traﬃc engineering governed by time of the day. To gain a better picture of how representative the AS examined is, this analysis should be undertaken on other ASes. This would lead to a better understanding of the exact impact of non-canonical policies on AS-level topologies, observed paths in the Internet, and their implication for a clean-slate inter-domain routing protocol (for instance incorporating a negotiation-based framework [70]). Chapter 6

CleanBGP: Verifying the Consistency of BGP Data

Throughout this thesis we have used BGP data to determine the routing state within an AS. However, data collected by route-monitors may contain artifacts introduced by the measurement infrastructure itself. Such artifacts had limited impact on our previous studies due to our examination of stable prefixes. How- ever, if we wish to understand finer scale dynamics, we need to first understand the limitations of the data used for our analysis. In this chapter, we systematically characterize measurement artifacts and provide mechanisms to limit their effect on further analyses.

6.1 Introduction

Measurement artifacts in data can aﬀect the accuracy of analyses and result in ﬂawed conclusions. BGP data is no exception. In addition to our use in this thesis, BGP data has been used for many network management tasks such as debugging routing problems [33], anomaly detection [95], policy inference [77,78], and router table analysis [34, 53]. Although often glossed over in the research literature, failing to remove measurement artifacts from data prior to its use can result in meaningless statistics and conclusions. For example, recent work [4] by their

193 194 CHAPTER 6. CLEANBGP own admission did not remove the BGP updates caused by the establishment of BGP sessions between a route-monitor and operational router prior to their churn rate analysis1. A monitoring BGP session can fail and re-establish frequently, and during each re-establishment over 260, 000 updates may be seen at the route- monitor. Including these updates in a churn analysis can greatly bias results [121]. Consequently, in this chapter we identify these (and other) measurement artifacts and provide techniques to limit their effect on further analyses. It might be easy to argue for improvement of the measurement apparatus. Obviously, we would like monitors to be as accurate as possible and improvements to monitors are complementary to our work. However, regardless of such improvements, it is vital — particularly in operational systems — to calibrate the accuracy of all measurement devices [85]. The initial hypothesis of all measurement apparatus should be that it is flawed, and data taken from the apparatus can be considered accurate only when this hypothesis has been proven false. We have limited resources for such calibration, however we do have the capability to perform consistency-checks on the data. It is this methodology that has allowed us to detect some artifacts that otherwise would never have been found through a “hunt and peck” approach. The goal of collecting BGP data is to record the state of a single BGP router in the Internet. In order to analyze this data, it must be recorded on disk so that it can be processed offline. This might be done on the routers themselves, but this option is typically not chosen as it requires significant resources and can impact on the operational stability of the router. Hence, collecting BGP data is undertaken using a more passive approach. Recall from Chapter 2, a software route-monitor establishes a BGP session with the operational router, possibly over multiple physical links. The operational router sends all its best path updates to the monitor as if it were part of the Internet’s routing system. The monitor records this data to disk. Many components of this measurement infrastructure can fail. For instance, there can be bugs in the monitor implementation causing

1Note that this statistic is not the main point of the study in [4]. 6.1. INTRODUCTION 195 updates to not be recorded [60] or recorded non-chronologically — which is undesirable due to the BGP state not being periodically resent. Further, the BGP session between the monitor and the router can fail causing missed updates and during the re-establishment of the session a BGP update storm occurs as all routes are re-advertised. Despite it being shown that including these updates in further analyses can result in starkly different conclusions [121], the difficulty in removing them results in their effect simply being included in many statistics [4]. Zhang et al. [129] partially address the issue of session failures and subsequent re-establishments in BGP data. They identify the re-establishment phase of a BGP session when all table entries are re-announced. However, their technique cannot identify the entire interval affected when a session persistently fails, does not identify the time the session is down, often under-estimates the re-establishment phase, and can fail to identify session failures if a significant routing change occurred during the downtime. In contrast, we present a structured approach to identify the entire interval affected by both session failures and re-establishment updates, together with discovering previously unknown measurement artifacts by checking the BGP data for consistency.

Our methodology has several stages. First, we examine how the data is collected and the consistency-checking technique which highlights the presence of measurement artifacts (Section 6.2). Second, if a measurement artifact described in Section 6.3 is detected from characteristics in the data (Section 6.4), we use techniques presented in Section 6.5 to estimate the interval aﬀected and determine its source. Finally, we either exclude the data from further analysis or estimate the actual routing behavior using techniques based on the classiﬁcation of the measurement artifact (Section 6.6).

Measurement artifacts are binary in nature. They are either present or not. However, no binary indicator is available to inform us of their presence. Hence, we use multiple characteristics of the data for detection. In Section 6.9 we investigate the frequency of detected artifacts and their eﬀect on the measured data.

We find our consistency-checking of the data detects problems in 5% of 196 CHAPTER 6. CLEANBGP consistency-checks, with 81% of these caused by a number of updates recorded in a non-chronological order. A further 9.5% were caused by session failures, however, we found session failures occurred more frequently when the data was consistent. Consequently, we also use properties of the update stream to identify session failures. Analysis of BGP data may or may not be substantially affected by some artifacts. However, knowledge of their existence is vital so a judgment on their effect can be made. This is especially critical if a software router is being used as the replacement for an operational router, as data may be forwarded incorrectly. Further, our techniques may form a pre-processing step to a monitoring system such as that proposed by Matthews et al. [74], or for other network management tasks.

6.2 Data Consistency

BGP data is collected to represent the routing state of a router at a given time. This state is unique. Consequently, any data representing this state should be consistent. This consistency forms the basis of the check outlined in this section. The current best-practice approach to determine a router’s current view of the Internet is for a software router (for instance Quagga [56] or OpenBGPD [11]) to be used as a route-monitor. The monitor establishes a BGP session with an operational router, treating the monitor as any other router. Subsequently, all changes to the operational router’s best route are announced along the session to the monitor, which records these changes to disk and periodically dumps an entire table. This BGP update collection procedure is undertaken by public route- monitors [93, 115] as well as internally by ASes (see Chapters 4 and 5). Two types of BGP data are collected — tables and updates — which are views of the same dynamic system. Consequently, they should be consistent. BGP is a hard-state routing protocol, where updates are only sent once, thus, constructing a table at time t2. By combining the last known table, at some time t1, and all 6.3. MEASUREMENT ARTIFACTS 197 updates received in the interval [t1, t2], the end result should be consistent with the table recorded at time t2. However, this consistency-check does not always succeed when examining recorded data. We are able to characterize the causes of failed consistency-checks into several categories of measurement artifacts.

6.3 Measurement Artifacts

Measurement artifacts occur when the data stored by the monitor does not reﬂect the real state of the router under observation. In this section we consider the artifacts unearthed using our simple consistency- check and their characteristics, a summary of which is included in Table 6.3.1.

6.3.1 Session Failures and Resets

A significant portion of BGP updates in the Internet are caused by the re-establishment of operational BGP sessions [120]. Re-establishment occurs after a planned take- down of the session to alter the local policy (policy-changed) or after an unplanned session failure as a result of the hold-time expiring (policy-unchanged). The collection of BGP data relies on such a session and consequently can also be affected by session failures. Monitor sessions are particularly vulnerable to failures as they often use multi-hop sessions (that cross multiple physical links) and are more vulnerable to timeouts as a result of link congestion. If the monitoring session fails (or is taken down), any updates occurring on the operational router will not be recorded on the route-monitor. Consequently, the consistency-check may fail. When the session is re-established, the entire table is re-announced, resulting in a large influx of routing updates. These updates are not representative of changes in the observed network. Such updates can even cause a “BGP update storm” [121] that is not real. Meta-data indicating when a monitoring session is taken down or established is sometimes recorded together with BGP data. We term such meta-data, state information. State information can be used to confirm a monitoring session has 198 CHAPTER 6. CLEANBGP failed, however it is not perfect. A monitoring session can fail due to a problem with the route-monitor. In such cases, state information may not be recorded. Also, some data sources such as RouteViews [115], do not record state information. For this analysis, we primarily use state information for validation of our other characteristics.

6.3.2 Incomplete Tables

It is possible that a routing table was not been fully written to disk. This can occur due to a failed process on the monitor or hardware failure such as a hard disk reaching its capacity. Consequently, the incomplete table is not representative of the operational router’s view of the Internet.

6.3.3 Missing Updates

Kong [60] discovered some updates were not decoded by the monitor and consequently would be missing from both the update stream and recorded tables. This bug has since been corrected. However, if an entry is in a recorded table (indicating it has been decoded), but not in the update stream, then update ﬁles may be corrupted, empty or missing. This occurred for almost an hour on January 21, 2007 at RIPE route-monitor RRC01.

6.3.4 Update Ordering

We discovered that when updates occur close to each other (in the order of seconds), the route monitor occasionally output routes in the wrong chronological order. This may be a bug in the software. We have consulted with the software developers to identify the cause of this issue. Note that we are assuming that it is the updates which are recorded to disk in the incorrect order. Of greater concern would be if the opposite were true. That is, if the updates were recorded in the correct order, but applied to the table in the incorrect order. Due to the hard-state nature of BGP, applying updates in the 6.3. MEASUREMENT ARTIFACTS 199 incorrect order causes the software router to have an incorrect state. If the software router were to be used as the replacement for a hardware router — this would be of great concern.

6.3.5 Non-atomic Table Dumps

Recorded table dumps are not atomic. They take a period of time to write to disk – from several seconds to minutes depending on the size and number of monitored router tables. Consequently, updates occurring during the time of writing to disk may or may not be reﬂected in the written table. We do not consider any diﬀerence caused by an update arriving during this time as a failed table consistency-check to detect other artifacts.

6.3.6 Other Artifacts

Several other artifacts were discovered while analyzing the consistency of data. On some occasions, such as in the table recorded at RRC00 on May 1, 2008 for the prefix 84.205.80.0/24, two routes learned from the neighbor 202.12.29.64 are recorded although BGP does not allow this. One route appears to be valid, however the second has a null AS number and no AS Path. Further, no update for the second route is recorded in the update stream. Interestingly, in a later table, this entry is replaced, without being explicitly withdrawn, by a route learned from neighbor 193.136.5.1 with a null AS and the same originating time. The new route is replaced later again by a route from 12.0.1.63 (with AS7018) and a new originating time — although the update is not present in the update stream and the consistency-check fails. Another artifact we discovered was when updates were not applied to the table. This occurred in several tables recorded at RRC00 on April 30, 2008. Here, we discovered several hundred updates occurring hours before the recorded table time and not applied to the table. This is confirmed by the originating time of the routes in the recorded table. This could be caused by the re-ordering 200 CHAPTER 6. CLEANBGP of updates, however, the timestamp of the consecutive updates is much further apart than any other observed update re-ordering (minutes in contrast to seconds). Consequently, we believe some updates may not have been applied to the table. A table recorded on this day also took over 20 minutes to write to disk indicating a possible monitor failure. Other tables from the same monitor generally took less than 20 seconds to write to disk. A third artifact we discovered was the alteration of the originating AS. For example at RRC01 on the 25th January 2007, the originating AS of the prefix 203.10.62.0/24 recorded in a table is AS23456. However, updates indicate it is AS2.2 (which is not a valid AS number). This may be a binary-to-ascii issue or a feature of the software router of which we are unaware. We do not specifically search for the above artifacts because they do not occur regularly. However, they are important to detect as they can affect further analyses. We were able to detect these artifacts by investigating by hand the data responsible for a failed consistency-check. This highlights the need to continuously check the consistency of data prior to its use. 6.3. MEASUREMENT ARTIFACTS 201 + + + X dering Update Or Characteristic + + + + + Missing Updates + able T Incomplete Reset + + + X X Changed Session Policy Characteristic necessary for artifact presence. X Reset + + + X X X Unchanged Session Policy prefixes in constructed table prefixes in constructed table simultaneous updates for inconsis- prefix during inter-table interval of unique prefixes of duplicate updates ent prefixes in constructed table Information routing activity for extended period tent prefixes Additional Di ff er Missing Almost Oldest No Burst Burst State able 6.3.1: Data characteristics of main measurement artifacts. Legend: T strongly indicative (but not necessary) for artifact presence. 202 CHAPTER 6. CLEANBGP 6.4 Characterization of Artifacts

The consistency-check outlined in Section 5.4 detected many inconsistencies. However, consistent data does not necessarily indicate an interval is free from measurement artifacts. In this section, we use the characteristics of artifacts found in Section 6.3 to detect and classify measurement artifacts.

6.4.1 Table Comparison

The consistency-check outlined in Section 5.4 is also a process we can use to detect specific artifacts. If a session failure occurs between two successive table dumps, additional prefixes may be in the constructed table when compared to the second recorded table. Consider the example in Figure 6.4.1. A table is recorded at times t1 and t2 (shown by a shaded box). The time of the announcement of prefix 1 is denoted by an arrow annotated with A1 and its withdrawal by W1. We construct the table at time t2 (dashed box) by applying the updates in chronological order to the table at t1 (shaded box). However, during a period of downtime several updates on the operational router (W1, W5, A2) are not recorded on the route- monitor. All prefixes in the table after the downtime (P2, P3, P4) are sent to the monitor. Hence, the withdrawals (W1, W5) are not recorded on the route-monitor and thus if not re-advertised before a table is recorded, will result in differences between the constructed and recorded table (shaded box) at time t2. Note that missing announcements (such as A2) will be delayed as they are announced during re-establishment. If announcement of a prefix withdrawn during the downtime occurs after the downtime (such as A5), the missing withdrawal W5 will not cause data inconsistency. Hence, we will not always be able to detect such session failures using a consistency-check. Session failures are not the only cause of a table comparison failure. Any prefix without an equivalent route in both the constructed and recorded tables may be affected by non-chronological recording of updates. If an update occurs ‘almost simultaneously’ to the last update received, then we say it as responsible for the 6.4. CHARACTERIZATION OF ARTIFACTS 203

Figure 6.4.1: Consistency-check example. The recorded table at time t2 (shaded) is compared to a table constructed from the table at t1 and all updates recorded in the interval [t1, t2] (dashed). Prefix 1 is denoted by P1 in the table. The announcement of prefix 1 is annotated by A1. The withdrawal of prefix 1 is shown by an arrow annotated by W1 and so forth. The announcements during the downtime are missing from the recorded update stream, resulting in the extra prefix (P1) in the constructed table.

inconsistency.

If there are missing or diﬀerent routes in the constructed table, and the non- chronological ordering of routes cannot account for any discrepancies, it is likely that updates have not been recorded. If additional preﬁxes are in the constructed table, but other characteristics discount the possibility of a session failure being responsible, then it is likely that the recorded table is incomplete.

There are three possible inconsistencies detected by the table comparison characteristic — additional prefixes, missing prefixes and different prefixes — in reference to the constructed table compared to the recorded table. As described above and summarized in Table 6.3.1, additional prefixes may indicate the presence of any described artifacts, while different or missing prefixes are only indicative of missing updates or updates being re-ordered.

Note that if no table is recorded at a nominal recorded time2, a session failure can be detected.

2It is common practice for the monitor to dump the table of all peers in the same ﬁle. Hence, no entries for a particular peer indicates either the session is down or no preﬁxes are available on this session. 204 CHAPTER 6. CLEANBGP

6.4.2 Oldest Preﬁx

Tables recorded at a route-monitor also store the time at which the update was received. However, ASCII conversion tools, such as RIPE’s libbgpdump [5] (used to convert the binary data stored by route-monitors into text easily parsed by humans or text pattern matching), often do not output this information by default. Consequently, to the best of our knowledge, this information has not previously been used in literature. The failures and subsequent re-establishment of the monitoring session cause the entire table of the operational router to be re-announced. As a result, if a session failure occurs at a time t1, no prefix in the table can have an originating time earlier than t1. So the oldest prefix is a useful indicator. However, we cannot assume that a session re-establishment started at the timestamp of the oldest- prefix. Normal routing operation will result in all prefixes being re-announced at some point, and we cannot say a session re-establishment definitely started at the time of the oldest-prefix. However, a majority of the prefixes in the table are stable [34, 92]. Therefore, with regular snapshots (RIPE records them at 8 hour intervals) if the timestamp of the oldest prefix of the current table lies between the timestamp of the current and previous table snapshots, it is a good (but not definitive) indicator that a session failure has actually occurred during this interval. The use of this characteristic is best shown by example. Consider Figure 6.4.2. The timestamps of tables are shown in the shaded box, with their oldest prefix shown above. The first inter-table interval between 00:00 and 06:00 is a possible failure interval as the timestamp of the oldest-prefix in the table at 06:00 (timestamp of 05:55) lies in this interval. The next interval (06:00 - 12:00) is failure- free as the timestamp of the oldest prefix does not lie within this interval. Notice that the oldest-prefix can change across tables (for instance at the time 18:00), and an interval can remain failure-free. As shown in Table 6.3.1, the oldest prefix characteristic is only used as an indicator of session failures. It is not indicative of other types of artifacts. 6.4. CHARACTERIZATION OF ARTIFACTS 205

Oldest Preﬁx 05:55 05:55 09:27 09:27 28:32

Table Snapshot 00:00 06:00 12:00 18:00 24:00 30:00

Time

Classiﬁcation Possible Failure Failure-Free Failure-Free Failure-Free Possible Failure

Figure 6.4.2: Oldest prefix characteristic. The timestamps of tables are shown in the shaded box, with their oldest prefix shown above. The first inter-table interval between 00:00 and 06:00 is a possible failure interval as the timestamp of the oldest-prefix in the table at 06:00 (timestamp of 05:55) lies in this interval. The next interval (06:00 - 12:00) is failure-free as the timestamp of the oldest prefix does not lie within this interval.

6.4.3 State Information

Session state information is recorded by some monitors to indicate when a session has failed or has been re-established. However, some monitors such as Route- Views [115] do not store this data, and even when state information is recorded, it can be missing and does not identify measurement artifacts other than session failures. In addition, no state information is recorded to indicate the end of the re-establishment phase. Even if monitors such as RouteViews started recording state information in the future, the historical data will still be missing this data. Consequently, for the purposes of this analysis, we use state information purely as validation of session failures. However, in practice, if this information was available, it would also contribute to determining BGP data’s accuracy.

6.4.4 Downtime

BGP undergoes constant changes, so a long period where no updates are received can be indicative of a session failure or a hardware failure. However, receiving no announcements for a period of time may be part of normal routing behavior, especially in BGP tables with a relatively small number of preﬁxes. The recording of keep-alives — messages speciﬁcally sent to keep a BGP session alive during low routing activity — allows a low downtime threshold to be set, detecting measurement artifacts as early as possible. This characteristic can be used to 206 CHAPTER 6. CLEANBGP identify session failures or when updates are otherwise missing (see Table 6.3.1).

6.4.5 Session Re-establishment

The session re-establishment phase after a failure has two main characteristics: a large number of unique preﬁxes announced in a short interval (as the table is re-announced) and a large number of non-table altering or duplicate updates.

Unique Preﬁxes

When a BGP session is established (or re-established after a failure), all routes in the table are re-announced as quickly as possible. Normal routing activity generally affects only a small subset of prefixes [92] in the table. Thus, when a large number of prefixes are affected by BGP updates in a short interval it is indicative that a session re-establishment is in progress.

Duplicate Announcements

As mentioned above, routing activity generally only affects a subset of highly active prefixes. Accordingly, when a session re-establishes after period of downtime a substantial fraction of announcements are likely to be equivalent to the route prior to the downtime and hence appear as a duplicate. Duplicate announcements should be rare (apart from when session re-establishments after failures occur). However duplicate announcements can also be caused by internal routing changes [65, 121]. In addition, the Minimum Route Advertisement Interval (MRAI) timer suppresses updates for a period of time after sending a burst. Hence, if a prefix has its route attributes modified before reverting back to its original selection within the MRAI, then a duplicate announcement will be sent to the monitor3.

3We are not aware of router vendors keeping historical information to prevent this. There is little beneﬁt in doing so for operational routers. 6.4. CHARACTERIZATION OF ARTIFACTS 207

As duplicate announcements are rare — except when a session re-establishment is occurring — they can be used as a characteristic for identifying session failures. In a similar vain to the unique preﬁxes, we use a threshold to determine if a session re-establishment is occurring. However, a session re-establishment may not always result in a large number of duplicate announcements. The operator of the monitored router may alter their local policy during the downtime, resulting in a number of the routes being altered and consequently lowering the number of duplicate announcements. Consequently, we use this characteristic not only to detect if a session re-establishment is occurring but also to determine if the monitoring sessions policy has changed or remains the same. A re-establishment after a session failure may not complete due to a persistent failure resulting in multiple partial re-establishments. However, in this case, we are likely to see a substantial number of duplicate announcements or a long downtime. The technique of Zhang et al. [129] does not account for such cases.

6.4.6 Detecting Measurement Artifacts

We use all of the above characteristics to detect measurement artifacts in BGP data sources. We use a sliding window on the update stream to initially detect a measurement artifact. An extended downtime or a burst of unique prefixes/duplicate announcements is indicative that a measurement artifact may be occurring. Un- like other characteristics, thresholds are required to determine the size of burst of unique prefixes/duplicate announcements that represent a re-establishment phase. These thresholds must be high enough to classify normal routing operation correctly but must be low enough to ensure we do not miss many artifacts. We outline how to obtain these thresholds in Sections 6.7 and 6.8. When a table is available for comparison (it may not be available if our analysis is real-time), we compare the constructed table with the recorded table and examine the oldest prefix in the table for further evidence as to whether a measurement artifact (and what type) occurred during the interval between the previous and current table. When a measurement artifact spanning an extended interval is 208 CHAPTER 6. CLEANBGP

Detection Time

S S S S S S S S S Time

DETECTED INTERVAL

Figure 6.5.1: Finding the interval of extended measurement artifacts. ‘S’ indicates a suspicious interval. After an artifact is detected, we search forward and backwards for two successive non-suspicious bins. The start time of the artifact is one bin prior to the ﬁrst suspicious bin. The ﬁnish time of the artifact is one bin after the last suspicious bin.

detected, we enter the next phase of our analysis.

6.5 Extended Measurement Artifacts

The above characteristics can be used to indicate the presence of measurement artifacts at a particular time. The impact of some artifacts are inferred by their detection. For example, non-chronological updates obviously affect the updates of the prefixes that are recorded in the non-chronological order. However, some measurement artifacts such as session failures/re-establishments and missing updates span an interval which cannot be identified only by our detection techniques, which simply indicate a single timestamp of a measurement artifact. Our desire is to precisely identify all data that is not representative of the operational router’s behavior. Consequently, we would like to identify the exact start time and exact end time of a measurement artifact. However, this is difficult when there are no clear markers in the data. State information recorded in some data sources can provide a starting point to detect session failures, however state information can be delayed or missing and does not indicate the conclusion of the re-establishment phase4. Further, state information cannot indicate the other artifacts we have found such as missing updates.

4The Internet draft [89] deﬁnes an End-of-RIB marker which would be helpful in this context, but is not implemented to the best of our knowledge. 6.5. EXTENDED MEASUREMENT ARTIFACTS 209

If the sliding window used above detects a possible measurement artifact, we localize it by considering small disjoint bins surrounding the detection time. To locate the start and end of a measurement artifact, we search backward and forward from the detection time until we find non-suspicious routing behavior. If bins contain no updates/keep-alives (the session is down or the interval may be missing updates) or there are a large number of unique prefixes/duplicate announcements (during the re-establishment phase), we declare the bin suspicious. We ensure an artifact is completely captured by searching backward and forward in time for two successive non-suspicious bins. We see in Figure 6.5.1 the detected interval is one bin either side of the series of suspicious bins. This is to ensure the entire measurement artifact is captured. For example, a measurement artifact may commence towards the end of a bin with the number of unique prefixes or duplicate announcements in the bin not being large enough for the bin to be declared suspicious. Note that suspicious bins differ from our sliding window in several ways. First, our bins are disjoint while each window is not. Second, bins provide a finer granularity than the sliding window. Third, we can be more aggressive (to ensure the full measurement artifact is captured) when setting bin thresholds as we only check if it is suspicious when we have detected a measurement artifact from our sliding window.

Attempting to be overly precise when identifying the interval affected using the technique of Zhang et al. [129], results in under-estimating the duration of an artifact. This is especially critical when multiple artifacts occur close in time. Although the technique described in [129] is designed to identify session re-establishments, when a session fails frequently (which is often the case when an underlying issue causes a session failure), the technique is inadequate. In contrast, our technique does not attempt to capture a single session re-establishment. In- stead, we identify the entire interval affected by the measurement artifact which may contain multiple session failures and re-establishments or other artifacts. Thus we are able to provide a higher level of confidence in the data (unlike [129] whose motivation is simply to identify full session re-establishments). 210 CHAPTER 6. CLEANBGP

The above technique to identify the extended measurement artifacts can be used for real-time analysis — especially useful in monitoring systems. However, some measurement artifacts can only be detected when a table is available for consistency-checking. Further, a table also allows additional characteristics, such as the oldest preﬁx, to diﬀerentiate between missing updates and session failures.

The most common measurement artifact encountered was session failure (val- idated by state information). Subsequently, if the recorded table contains characteristics consistent with session failure (oldest prefix and additional prefixes) and the sliding window and suspicious bins localize the interval affected, we classify the artifact as a session failure. Further, if the detected interval contains few duplicate announcements, then we characterize the session failure as a policy- change. Other artifacts are classified based on the detection mechanism (e.g. non-chronological updates). We now outline how we clean the data based on the type of measurement artifact detected.

6.6 Cleaning Data

We must be very cautious when cleaning information to avoid unnecessarily altering data. To this end, we ‘mark’ updates and table entries that have been altered and clearly identify the interval ‘cleaned’. Consequently, applications using the BGP data can determine what data to include or exclude.

The most obvious form of data cleaning is exclusion — removing any interval aﬀected by measurement artifacts from further analysis. This would be ideal for applications sensitive to large numbers of updates or long periods of downtime and is the approach we recommend if continuous routing updates are not required. However, as many applications require an unbroken stream of data, this is not always an attractive solution. Thus, we now introduce a new technique for the estimation of routing behavior during measurement artifacts. 6.6. CLEANING DATA 211

6.6.1 Session Failures/Re-establishments

Removal of all duplicate announcements has been executed to minimize the effect of session failures [17,20,69,92]. Duplicates reflect no change in the routing state, however they can be caused by internal AS routing changes [65,121]. We mark all duplicate announcements as part of a measurement artifact only during a detected session failure interval. Thus, we alter a minimal amount of data and ensure all updates that reflect routing changes during the downtime are still present in the data although they may be delayed. During the downtime of a session, withdrawal of prefixes can occur on the operational router. These ‘ghost’ withdrawals are only noticeable when comparing a constructed table to a recorded table. The recorded table will not include the prefixes withdrawn during the downtime. Hence, we can assume these prefixes were withdrawn during a session failure, and consequently we are able to estimate the time the withdrawal occurred (we place it at the conclusion of the detected interval). If multiple session failures occur during a single inter-table interval, the withdrawal is placed where it is consistent with multiple session re-establishments. For example, consider Figure 6.6.1. Three session failures have been detected using the sliding window detection technique. A prefix is in the constructed table at t2, but missing from the recorded table, and four cases for the observed updates in the interval [t1, t2] are shown. In the first case, no updates are recorded between t1 and t2. Hence, we place the inferred withdrawal at the end of the first detected failure. In case two we observe an announcement after the first detected failure. Hence we place the inferred withdrawal at the conclusion of the second detected failure. Notice in case three we observe two announcements with the last announcement between the second and third detected failure. Hence, we place a single inferred withdrawal at the conclusion of the third detected failure. We see in case four an observed announcement after the last detected failure. No withdrawal can be placed during any detected interval to account for the missing prefix in the table at t2. Consequently, our detection and localization schemes 212 CHAPTER 6. CLEANBGP

Detected 1 2 3 Time Resets

Case 1

Case 2

Case 3

Case 4

Observed Announcement Inferred Withdrawal

Figure 6.6.1: Detected failures in inter-table interval and the time we infer the missing withdrawal occurred.

must have failed or the interval is aﬀected by missing updates. In this case we declare the entire inter-table interval a measurement artifact. Note we are possibly not replacing all missing updates. We are only introducing the minimal set of updates to ensure data cohesion. Consequently, our recommendation is use of the exclusion method for cleaning the data over the estimation method where possible.

6.6.2 Incomplete Tables

If a data source has an incomplete table, a table at any time can be constructed from updates and a previous table.

6.6.3 Missing Updates

If missing updates are detected, it is possible to estimate the actual routing behavior during that time by adding announcements of routes at the originating time recorded in the table. In addition, any preﬁxes not in the recorded table can be withdrawn during the detected interval. If updates cannot be added during this time to ensure consistency in a similar vain to case four in Figure 6.6.1, the entire interval between consecutive tables is declared an interval aﬀected by a measurement artifact. 6.7. DEFAULT PARAMETER SELECTION 213

6.6.4 Update Ordering

If a non-chronological ordering of updates is detected, the order nearly simultaneous updates can be permuted such that the constructed table is consistent with the recorded table.

6.7 Default Parameter Selection

In this section we outline our default parameter selections. These parameters are summarized in Table 6.7.1.

6.7.1 Sliding Window Length

The sliding window is used to detect session failures/re-establishments. It must be long enough to identify a session re-establishment from the unique prefixes and duplicate announcements, while being short enough such that normal routing behavior can be differentiated from session re-establishment behavior. For a full- feed operational router, we found a session generally re-established in less than ten minutes. Routers with partial-feeds re-establish more quickly as they have fewer prefixes. We use a sliding window of one hour to ensure multiple session failures/re-establishments can be captured.

6.7.2 Re-establishment Phase Thresholds

The unique prefixes threshold must be large enough to classify normal routing behavior correctly, but small enough to detect re-establishment phases. In addition, it is important to detect a session failure as quickly as possible (for real-time applications). By default we choose 50% of the table size as the threshold. It is rare to see greater than 50% of the table announced in one hour under normal conditions, as BGP updates have been shown to effect only a small fraction of prefixes [92]. We use a threshold based on the table size as some BGP feeds may 214 CHAPTER 6. CLEANBGP

Description Default Value Sliding Window Length 1 hour Downtime Threshold/Bin Length Hold time (180 seconds for RIPE) Unique Prefix Threshold 50% of table size Duplicate Prefix Threshold 25% of table size Suspicious Bin Unique Prefixes 10% of table size or 300 updates Suspicious Bin Duplicates 10% of table size or 300 updates

Table 6.7.1: Default parameter settings.

not include all Internet-wide prefixes5 The duplicate announcements threshold can be lower than the unique prefixes threshold as duplicate announcements are less common. We also lowered it to 25% of the previous recorded table size to identify faults causing a session to persistently fail during re-establishment. These default parameters work well for feeds of all sizes. However, they may not work as well when the monitored table has very few prefixes, as the difference between normal routing activity and session re-establishment will be lower. Hence, we have developed an automated technique to tune these parameters, described in Section 6.8.

6.7.3 Downtime Threshold and Bin Length

By the deﬁnition of the hold-time [91], an update or keep-alive message must be received within the hold-time interval for a BGP session to remain alive. Conse- quently, a bin that has no activity is an indicator that the session is down. If no keep-alives are recorded, as with RouteViews, the downtime threshold would require conﬁguration based on the previous non-suspicious routing activity. When an operational router has a full feed, it is likely the inter-arrival time of updates will be low, and ergo a low downtime threshold can be set. We use the hold-time

5This is common practice for small ASes where default routes are used, removing the need to store all Internet-wide preﬁxes. 6.7. DEFAULT PARAMETER SELECTION 215 as the default bin length. If no keep-alive or update is received during this time, the BGP speciﬁcation states the session is down.

6.7.4 Suspicious Bin Thresholds

An interval is declared suspicious if it is part of a possible session failure/re- establishment or missing update interval. Suspicious bin thresholds are only used to determine the impact of extended artifacts when an artifact has been detected by the sliding window. In this instance, the unique preﬁxes and duplicate announcement characteristics are considered over a shorter interval. Consequently, they must be set more aggressively than for the detection phase — 10% of the table size for both duplicate announcements and unique preﬁxes.

There must also be an absolute value for the number of updates received. We found this is necessary when several monitoring sessions fail simultaneously, likely due to the failure of a shared physical link or monitor failure and all re- establish in tandem. The monitor is physically unable to write all updates to disk instantaneously — full feed BGP neighbors currently have approximately 250, 000 prefixes [54], and for instance, the route-monitor RRC00 at RIPE has 13 of these neighbors. Accordingly, the burst of updates appears spread out in comparison to a single session failure. We use a low absolute threshold together with a proportion of the table to mark an interval as suspicious. We found 300 updates per bin was a good default parameter, although it can be tuned based on limitations of the route-monitor6. Consequently, if a bin has greater than 300 updates, greater than 10% of the prefixes announced or the number of duplicate announcements totaling greater than 10% of the total number of prefixes, then the bin is declared suspicious. In addition, if no updates or keep-alives are received in the interval, we assert the session is down, and the bin is also suspicious.

6Note that an absolute threshold is not needed for the sliding window as its duration is much longer to account for long re-establishment phases. 216 CHAPTER 6. CLEANBGP 6.8 Automated Parameter Selection

In the previous section we presented default values for the various parameters utilized. Some of these parameters can be set at values directly obtained from the BGP specification, however others need to be configured. We found the default parameters worked well in practice, as demonstrated in Section 6.9. However, we can tune these parameters on a per-session basis to improve the accuracy and artifact detection speed of our techniques. This is paramount for any automated system, where monitoring sessions may have different characteristics. In this section we present a machine-learning technique to select the thresholds for the sliding window detection and bin localization of extended measurement artifacts. If keep-alives are not recorded for a session, similar techniques can be used to determine an appropriate down-time threshold. It is often not practical to individually analyze the behavior of all BGP sessions by hand and select parameters on the observed behavior of the session. Alterna- tively, an automated technique is required. We use Linear Discriminant Analysis (LDA) [50] for this purpose. LDA is a machine learning technique where training data (data-points with known classifications) is used to determine linear boundaries between classes. It minimizes the ’in-class’ variance while maximizing the ‘between-class’ variance. Any number of classes and variables can be used with LDA. For our purposes, we initially considered using the unique and duplicate variables in combination but then found thresholds set individually were more robust and applicable to our analysis.

6.8.1 Sliding Window Thresholds

The sliding window is primarily used to identify the session failure measurement artifact. Consequently, we would like to separate two modes of operation — normal operation and failure-interval. However, there is no clear boundary between the two. Subsequently, we use LDA as a machine learning technique to determine the boundary between the two modes for all peering sessions. 6.8. AUTOMATED PARAMETER SELECTION 217

LDA requires a set of training data points to separate the two modes. The major issue with this technique is that we do not have concrete conﬁrmation as to which intervals form part of a session failure and which are not. We could use state information, however the sliding window thresholds are most beneﬁcial when state information is not available. Thus we do not use state information for this purpose.

Recall our motivation is to conservatively identify the intervals affected by measurement artifacts. Therefore, we would prefer to indicate an interval is affected by a session failure (when it is not) rather than failing to identify an actual session failure. The oldest-prefix characteristic can separate intervals which are free of session failures from those which may (and are likely) to contain session failures. We then must find the (unique, duplicate) data-points which relate to the classified intervals. We find one data-point per interval. This data point is the maximum proportion of unique prefixes and the maximum proportion of duplicate updates in the sliding window over this interval.

This process is best described by an example. Consider Figure 6.8.1. In the shaded box at the top of the figure are the times at which a table snapshot is recorded. Above each table is the timestamp of the oldest-prefix in the respective tables. A one hour sliding window is used across the data with data points recorded and plotted for both the unique prefixes and the duplicate announcements in the sliding window. The two sliding window data sets are also plotted in Figure 6.8.1. Each inter-table interval is classified as either a possible failure interval (red) or a failure-free interval (green) based on the oldest-prefix characteristic. No- tice that the second inter-table interval is classified as a possible failure interval even when the oldest-prefix is prior to the table recorded at 06:00. We slightly adapt this characteristic from the one used in Section 6.4.2 due to the sliding window. If the oldest-prefix is within the sliding window length of the previous table, we classify the interval as a possible failure interval as the effect of the failure may fall in this inter-table interval. Otherwise we classify the interval as a failure-free interval.

Next, we take the maximum value of both the duplicate and unique prefixes 218 CHAPTER 6. CLEANBGP characteristic in the inter-table interval. This pair forms a single data point and is shown inside the colored boxes indicating classification in Figure 6.8.1. Notice that the maximum values do not have to be simultaneous across unique and duplicate characteristics. Now, we use the data-points obtained in this process as input to the LDA technique. One approach is to use LDA in the two dimensional space as in Figure 6.8.2. However, recall two types of session failures are possible — policy-changed and policy-unchanged. Consequently, we decided to use LDA independently on each variable as in Figure 6.8.3. In practice, the number of failure-free data points will be much larger than the number of possible failure data points. Further, the success of our default parameter selections indicate the separation between classes is large. Hence, although the aggressiveness of LDA can be configured using prior probabilities of each class, the large separation between classes resulted in the technique being insensitive to these probabilities. For this analysis, we use the proportion of training data-points in each class as the prior probabilities. Although we have statistically obtained our thresholds, it is still possible to make an incorrect decision. Other characteristics assist in minimizing the likelihood of this — for example, the oldest-prefix attribute can be used to discount an interval as a session failure, and missing prefixes together with the oldest-prefix can be used to detect a session failure not detected by the sliding window. This is the benefit of using multiple characteristics in contrast to other techniques such as that described by Zhang et al. [129]. 6.8. AUTOMATED PARAMETER SELECTION 219 28:32 30:00 05:55 24:00 (0.3,0.1) (0.9,0.2) 18:00 05:55 05:55 (1.0,1.1) (0.4,0.2) 05:55 06:00 12:00 Sliding Window (1 hour) (0.5,0.4) 00:00 0.5 0.0 0.0 1.0 0.5 1.0 Time Proportion Proportion Classification Data-point and e 6.8.1: A cartoon illustration of a monitoring BGP session to determine sliding window thresholds. Table snapshot times shown in Unique Prefix Oldest Prefix Snapshot Table Sliding Window Sliding Window Duplicate Update the shaded box, with the oldest prefixof in the unique table for prefixes this session and shownpossible duplicate above. failure updates A interval 1 (red) (relative hour or window to failure-free is (green). slid the across The last the data maximum known and of the table each proportion characteristic size) is are used as plotted. a single Each data point inter-table for interval use is with LDA. classified as a Figur 220 CHAPTER 6. CLEANBGP

Sliding Window 1.0 Possible Failure Duplicate Update Proportion Failure-Free

0.5

0.5 1.0 Sliding Window Unique Preﬁx Proportion

Figure 6.8.2: A cartoon illustration of the multi-variate threshold obtained using LDA on the data points from Figure 6.8.1.

Real Data Example

We have claimed that LDA is able to successfully separate failure intervals from normal behavior. We now demonstrate an example obtained from real data. Consider Figure 6.8.4, where we show the data obtained from Onyx Internet’s monitoring session with RRC01 in May 2008. LDA is used independently on unique and duplicate variables to separate the two classes. The unique threshold obtained is 0.51, and the duplicate threshold is 0.59. In this example, the high separation of classes is expected due to the previously shown stability of a majority of preﬁxes [92].

6.8.2 Suspicious Bin Thresholds

The thresholds used to declare a bin suspicious are only used when an artifact is detected. They are used to localize the interval aﬀected by an extended measurement artifact. We use a similar technique as in the previous section to automate the selection of these parameters. However, there are slight diﬀerences. First, we now have discrete data, not continuous. Thus we use all data points in an 6.8. AUTOMATED PARAMETER SELECTION 221

Sliding Window 1.0 Possible Failure Duplicate Update Proportion Failure-Free

0.5

0.5 1.0 Sliding Window Unique Preﬁx Proportion

Figure 6.8.3: Cartoon illustration of independent thresholds obtained using LDA on the data points from Figure 6.8.1.

Figure 6.8.4: Example of sliding window parameter selection. Data for the Onyx Internet (195.66.224.35) session with RRC01 in May 2008 and the thresholds obtained via LDA. Green crosses are failure-free interval data points and red crosses are possible failure interval data points. Data-points in the green shaded region declared part of a failure-free interval while the data-points in the red shaded region indicate a possible-failure interval. The unique threshold obtained is 0.51, and the duplicate threshold is 0.59. 222 CHAPTER 6. CLEANBGP

Figure 6.8.5: Anomalous data-points. In this example of Hurricane Electric (195.66.224.21) session with RRC01 in April 2008, a possible-failure data point is amongst many failure- free points, and a failure-free data point is amongst the possible-failure data points. On almost all occasions the separation of classes is obvious, however, this example shows how LDA can deal with the intermixing of data-points from different classes. 6.8. AUTOMATED PARAMETER SELECTION 223 inter-table interval (shown in Figure 6.8.6) affected, not only the maximum. Also note that definition of the oldest-prefix characteristic reverts to the one described in Section 6.4.2 as we are not using a sliding window to obtain these thresholds. The data is ‘noisy’ as a large number of points which are in fact normal are classified as suspicious. Recall that suspicious intervals are those with ‘large’ numbers of uniques or duplicates. Training the machine learning technique with such noisy data can result in class separations which are not in line with this definition of suspicious. That is, if we find parameters in tandem, the boundary separating classes could have a positive instead of a negative slope. Hence, larger proportions of uniques or duplicates would result in the data-point being classified as normal (a non-intuitive result). For instance in Figure 6.8.7, we see a positive boundary slope. In this case if we get a unique proportion of 1.0 and less than 0.35 duplicates, we are in a suspicious class. However, if we have a greater number of duplicates, then we are in normal operation. This is counter- intuitive and not desirable for our analysis. We would like our boundary to have a negative slope as in example Figure 6.8.87. However, this is difficult to ensure using this bi-variate method. As a result, we expose thresholds independently as in the previous section. We separate the classes as in Figure 6.8.9. Recall, bins are only examined when the artifact has been detected by the sliding window thresholds. Hence, our aggressive parameter selections are warranted. Again we use the prior probabilities based on the number of data points in our training data in each class. Although we have found thresholds for each variable independently, it is still feasible LDA separates classes in an undesirable manner. That is, it predicts the normal class is more active than the suspicious class. In this case, we assert this variable cannot provide any reliable information and remove this variable from our definition of suspicious. If both parameters are unable to provide any reliable information, we use the default parameters.

7Note that LDA will only produce one of these two outputs. We are simply showing a desirable and undesirable output as an example. 224 CHAPTER 6. CLEANBGP 28:32 30:00 05:55 24:00 18:00 05:55 05:55 05:55 06:00 12:00 00:00 0.5 0.0 0.0 1.0 0.5 1.0 Bin Bins Time Bin Proportion Proportion e 6.8.6: Cartoon illustration of monitoring BGP session to determine bin parameters. Table snapshot times shown in the shaded Classification Oldest Prefix Snapshot Table Unique Prefix Duplicate Update Figur box, with the oldest prefix inupdates the (relative table for to this session thefailure-free shown last (green). above. The known In value each table of disjoint each size) bin, characteristic the is in proportion of each plotted. unique bin prefixes is and Each used duplicate as inter-table a interval data point is for classified use with as LDA. a possible failure interval (red) or 6.8. AUTOMATED PARAMETER SELECTION 225

Bin 1.0 Suspicious Duplicate Update Proportion Normal

0.5

0.5 1.0 Bin Unique Preﬁx Proportion

Figure 6.8.7: Cartoon illustration of LDA producing an undesirable class separation.

Real Data Example

We again return to our example from Onyx Internet. Using our technique we find the thresholds for the bins can be set to 0.1 for the unique prefixes and 0.00017 for the duplicate announcements (see Figure 6.8.10). Note the rarity of duplicate announcements on this session allows a very low threshold setting. Onyx Internet has a full-feed session with more than 250, 000 prefixes. For smaller peers, we found the thresholds must be set higher as there is less difference between normal and suspicious routing behavior.

6.8.3 Discussion

The sliding window and suspicious bin thresholds are used to detect and localize extended measurement artifacts (primarily session failures). Unfortunately, although it may seem a logical approach, they cannot be used in their current form to indicate the type of session failure (either policy-changed or policy-unchanged). When we use the training data to discover the detection thresholds, we are using the maximum values in an interval. However, when used in practice, the sliding window is a continuous function. When a threshold is reached, we enter the 226 CHAPTER 6. CLEANBGP

Bin 1.0 Suspicious Duplicate Update Proportion Normal

0.5

0.5 1.0 Bin Unique Preﬁx Proportion

Figure 6.8.8: Cartoon illustration showing a desirable class separation.

Bin 1.0 Suspicious Duplicate Update Proportion Normal

0.5

0.5 1.0 Bin Unique Preﬁx Proportion

Figure 6.8.9: Cartoon illustration using LDA to tune thresholds independently. 6.8. AUTOMATED PARAMETER SELECTION 227

Figure 6.8.10: Data for the Onyx Internet (195.66.224.35) session with RRC01 in May 2008 and the thresholds obtained via LDA. Green crosses are failure-free interval data points, and red crosses are possible failure interval data points. Data-points in the green shaded region declared part of a failure-free interval while the data-points in the red shaded region indicate a possible-failure interval. The unique threshold obtained is 0.1, and the duplicate threshold is 0.00017. 228 CHAPTER 6. CLEANBGP localization phase and reset the counters. Consequently, it is possible that the maximum value is not reached to classify the session failure. However, we can use the duplicate threshold in another manner. In the entire failure interval — determined by the localization technique — if the total number of duplicates exceeds the duplicate threshold, then the session failure is classified as policy-unchanged. Otherwise it is classified as policy-changed. Note that if an actual policy-changed and policy-unchanged failure is localized into a single interval, the interval will be classified as policy-unchanged.

6.9 Results

In this section we use the default parameters obtained in Section 6.7 for ease of replication and consistency of results. We analyzed 260 monitoring BGP sessions of RIPE [93], finding 5% of consistency-checks did not pass. A summary of the results is shown in Table 6.9.1. Note that in the data we examined, we discovered no partially recorded tables. In the first column we see the monitor of RIPE examined and in the second column the month examined. Each monitor has a peering session with numerous BGP neighbors. This number is shown in the third column. The total number of consistency-checks undertaken is in the fourth column. Each peering session had tables recorded nominally every eight hours over the entire month. However, if the session was down at the nominal recording time, no table was recorded for the BGP neighbor. The number of times (and percentages) the consistency-check failed is shown in the fifth column. The consistency-check failed in an average of 5% of checks. In Table 6.9.1 we also include statistics for the cause of failed consistency- checks. Only RRC01 in January 2007 experienced missing updates, with 14 consistency-checks failing due to this measurement artifact. This was caused by the monitor failing to record any updates for any BGP neighbors of RRC01 between 17:33:17 and 18:25:02 UTC on January 21, 2007. 6.9. RESULTS 229

The column titled ’Re-ordered updates’ depicts the number of consistency- checks which failed solely as a result of updates being recorded in a non-chronological order. Of the 5% of problems during the consistency-checks, 81% can be attributed to non-chronological recording of updates. Although generally less than ten prefixes were affected, we found cases of up to 712 prefixes affected by this artifact. We found several instances where updates were applied in a permuted order with timestamps up to 16 seconds apart. However, 76% of these prefixes had updates with the same timestamp but were written to file in the incorrect order. All such instances of non-chronological updates we discovered were caused by a withdrawal being written to disk prior to an announcement but applied after to the table.

The ‘Unknown’ category represents the consistency-check failures for which the oldest-preﬁx discounted the possibility of a session failure and no period of downtime was detected indicating missing updates. We investigated these cases individually ﬁnding many were caused by updates occurring several seconds prior to a table dump but not being recorded in the table dump, that is the non-atomic nature of the table spanned outside the timestamps of the recorded interval of the table. Other artifacts described in Section 6.3.6 were also found in this category.

The session failure column shows the consistency-check failures attributed to the failure of the monitoring BGP session. Session failures contribute only 9.5% to the cause of consistency-check failures. However, in Table 6.9.2 we see a significant number of session failures cannot be detected using the consistency-check. We detect 2243 intervals affected by session failures using our sliding window, and 92% of the session failure intervals classified as policy-unchanged. State information is included in the data we analyzed and indicated when a monitoring session failed or was established. If state information is inside a detected failure interval, we say our technique is rated successful. If state information was outside of all detected intervals, our technique failed. Note that state information is used as validation for this analysis. In the operational version of our techniques, state 230 CHAPTER 6. CLEANBGP information forms part of the detection mechanism. A number of failures were detected by our tool while no state information was included inside the interval. A number of reasons could be the cause. First, state information can not be recorded — which is feasible as a session failure may be caused by a monitor failure. Second, normal routing activity may cause the entire table to be re-advertised (especially on sessions with small neighbors). Third, an operational router may fail and come online within the downtime detection threshold. When the router comes back online, the BGP session with the monitor may be established before others. Consequently, there may be a period of time where no (or few) routes are announced on the monitored session. Hence, the session may be ‘alive’ although routing activity is low. State information may consequently indicate a session failure prior to the detected interval. We also detected intervals which had all the characteristics of session failures but no state information. This may indicate state information is missing, outside of a detected interval (i.e. localization of the failure was inadequate) or parameters used were overly-aggressive on occasion. For the intervals we examined, most failure intervals contained state information. In April, a single highly active BGP neighbor of RRC01 caused 737 of the 827 detected session failure intervals without state information. 6.9. RESULTS 231 e 5 9 45 26 22 51 39 827 e Detected Failur Session Failur ithout State Information 0 0 67 15 W Unknown Cause Interval 9 dered 27 17 17 18 98 668 107 Updates Re-or Outside Information 0 0 0 14 Missing State Updates Interval 852 252 235 1158 es Inside (2%) (9%) (3%) (3%) Check Failur 150 722 112 107 Consistency 31 21 20 102 Changed Policy 8274 8102 3282 3441 Checks Consistency 694 133 121 1121 Unchanged 95 91 37 37 BGP Neighbors Failures 725 154 141 1223 -08 -08 Detected Jan-07 Month Session Apr Apr May-08 -08 -08 Jan-07 Month Apr Apr May-08 Monitor RRC01 RRC01 RRC02 RRC02 Monitor RRC01 RRC01 RRC02 RRC02 able 6.9.1: Summary of consistency-check failures. The total number of table comparisons undertaken, together with the causes for able 6.9.2: Session failure characteristics. A session failure may not cause a consistency-check failure. We detected significantly more failed consistency-checks. session failures than thosethe in session Table failure 6.9.1. as policy-unchanged.detected If interval. the We We provide also duplicate the provide announcement the statistics threshold statistics for for was when the reached state detected in information intervals which (session our contain up / down) detected no is interval, state contained information. we within classify a T T 232 CHAPTER 6. CLEANBGP 6.10 Discussion

We have seen the benefit throughout this chapter of state information and keep- alive messages for determining measurement artifacts. In addition, more frequent table dumps would ensure even clearer identification of measurement artifacts as they would provide a greater capability to cross validate. If this technique was undertaken automatically during the collection process, a recorded table consistent with a constructed table could be discarded as providing no additional information. This process would increase the accuracy of data while not increasing data storage requirements. When a measurement artifact is discovered within the measurement infrastructure of a single AS, the correlation between router decisions as described in Chapter 4 may be used to predict routing behavior during a measurement artifact. Cross-checking the data with itself is not a valid method for discovering all possible inaccuracies in the data. One example of this was discovered during an unrelated analysis of RouteViews [115] data. The prefix 173.0.0.0/20 (over which we had administrative control for a separate project) was withdrawn in June 2008. However, this prefix remained in the table of RealConnect (AS16559, 206.223.115.26) at route-views.eqix well over a week later. RealConnect was contacted and asked about this prefix. It was not in their operational routing table. Consequently, there must have been a withdrawal that either was not recorded in the update stream or not applied to the route-monitor table. To identify this type of artifact, an independent validation data source would be required. Wediscovered ad-hoc inaccuracies in the data during our analyses in Chapter’s 4 and 5. These inaccuracies had limited effect on our work due to the use of stable prefixes. However, it prompted this chapters analysis to fully understand the features in the data. This analysis highlights that it is a necessary step to check that data is indeed representative of a system under observation. Chapter 7

Conclusion

The Internet is becoming (if not already) a critical communication medium. Con- sequently, ensuring the reliability of the Internet is pivotal to its future success. A key ingredient to ensuring reliability is the capacity for network administrators to manage their own networks. However, the behavior of BGP, the protocol responsible for determining the route data traverses, can be difficult to predict, and consequently network management is often more art than science. In this thesis, we provide techniques to improve network management by allowing operators to predict the impact of changes to their network prior to implementation. In Chapter 3 we modeled the interaction between iBGP and the IGP.Our model allows precise identification of where route oscillation could occur within an AS. We also presented a minor modification to the BGP decision process that would prevent oscillation. Similar concepts may be extended to inter-AS relationships to predict the propagation of routes throughout the Internet. Route oscillation is not the only concern for operators. Network management is inherently the ability to control (or at least determine) the route traffic traverses within the network. Hence, the network’s reliability can be improved by predicting the impact of network changes prior to implementation in the live network. In Chapter 4, we extended our iBGP model from Chapter 3 to predict the route selected by all routers in an AS. We tested our techniques on a large Tier-2 AS, finding the method always predicted a valid network solution, and 99.9999% of

233 234 CHAPTER 7. CONCLUSION router decisions were consistent with available data from route-monitors. A future research direction could use our abstraction of the iBGP topology to find the optimal placement of route-monitors so that the information deducible from the data collected is maximized. The policies of neighbors can influence the routes selected by routers within an AS. Consequently, when a connection is established between two ASes, legitimate policies are often set out in contractual agreements. Peering relationships between ASes frequently require a canonical policy to be employed by both ASes. However, from Chapter 5 we found 22% of peers of the Tier-2 AS under examination were employing non-canonical policies to some extent. We examined the impact on the Tier-2 AS using our techniques in Chapter 4 to compare the router selections under the current non-canonical peering policy to a canonical peering policy. Future work may include investigating time-of-day changes in peering policies to identify active traffic engineering by peers during busy periods. Network management often relies on measurement data obtained from the network itself. However, in Chapter 6 we detected artifacts in BGP data introduced by measurement infrastructure that are not representative of the network. We classified several types of measurement artifacts and provided techniques to minimize their impact on future analyses. Our approach throughout this thesis was pragmatic. Our goal was to improve techniques used by operators to manage their network without the need for major protocol changes. This goal was accomplished in the thesis by in-depth analyses, reasoning and testing of techniques to predict the behavior of BGP within an AS. We believe the implementation of our techniques by an AS would be a significant improvement on the ‘tweak-and-pray’ network management strategy often used. In summary, we believe this thesis is not only of academic merit, but our approaches are designed such that they can immediately improve the inter-domain routing process in the Internet. Acronyms

AfriNIC African Network Information Center

APNIC Asia-Paciﬁc Network Information Centre

ARIN American Registry for Internet Numbers

AS Autonomous System

BGP Border Gateway Protocol

CCDF Complementary Cumulative Distribution Function

CDF Cumulative Distribution Function

CIDR Classless Inter-Domain Routing eBGP External-BGP

EGP Exterior Gateway Protocol

FIB Forwarding Information Base

IANA Internet Assigned Numbers Authority iBGP Internal-BGP

IGP Interior Gateway Protocol

IGRP Interior Gateway Routing Protocol

IP Internet Protocol

235 236 CHAPTER 7. CONCLUSION

ISIS Intermediate System-Intermediate System

ISP Internet Service Provider

IXP Internet Exchange Point

LACNIC Latin America and Carribean Network Information Center

LDA Linear Discriminant Analysis

Loc-RIB local Routing Information Base

MED Multi-Exit-Discriminator

MIMO Multiple Input/Multiple Output

MPLS Multiprotocol Label Switching

MRAI Minimum Route-Advertisement Interval

NANOG North American Network Operators’ Group

NAT Network Address Translation

OSPF Open Shortest Path First

PoP Point-of-Presence

RIB-in pre-policy Routing Information Base

RIB-pp post-policy Routing Information Base

RIB Routing Information Base

RIP Routing Information Protocol

RIPE Reseaux´ IP Europeens´

SLA Service Level Agreement

TCP Transmission Control Protocol Bibliography

[1] https://www.atdn.net/settlement free int.shtml.

[2] http://www.corp.att.com/peering/.

[3] “The Network Simulator ns-2,” http://www.isi.edu/nsnam/ns.

[4] D. Anderson, H. Balakrishnan, N. Feamster, T. Koponen, D. Moon, and S. Shenker, “Accountable Internet Protocol (AIP),” in ACM SIGCOMM, 2008.

[5] D. Ardelean, “libbgpdump,” http://www.ris.ripe.net/source/.

[6] A. Basu, C.-H. L. Ong, A. Rasala, F. B. Shepherd, and G. Wilfong, “Route Oscillations in I-BGP with Route Reﬂection,” in ACM SIGCOMM, 2002.

[7] T. Bates, R. Chandra, and E. Chen, “BGP Route Reﬂection - An Alternative to Full Mesh IBGP,” RFC 2796, 2000.

[8] G. Battista, M. Patrignani, and M. Pizzonia, “Computing the Types of the Relationships Between Autonomous Systems,” in IEEE INFOCOM, 2003.

[9] A. Bilgin, J. Ellson, E. Gansner, Y. Hu, Y. Koren, and S. North, “Graphviz,” www.graphviz.com.

[10] O. Bonaventure, S. Uhlig, and B. Quoitin, “The Case for More Versatile BGP Route Reﬂectors,” 2004, work in progress, draft-bonaventure-bgp-route- reﬂectors-00.txt.

[11] H. Brauer and C. Jeker, “OpenBGPD,” www.openbgpd.org.

237 238 BIBLIOGRAPHY

[12] M. Brown, “Pakistan hijacks YouTube,” Renesys Blog, February 24 2008, http://www.renesys.com/blog/2008/02/ pakistan-hijacks-youtube-1.shtml.

[13] M. Buob, M. Meulle, and S. Uhlig, “Checking for Optimal Egress Points in iBGP Routing,” in International Workshop on the Design of Reliable Communi- cations Networks, 2007.

[14] M. Buob, S. Uhlig, and M. Meulle, “Designing Optimal iBGP Route- Reﬂection Topologies,” in IFIP Networking, 2008.

[15] R. Bush, J. Hiebert, O. Maennel, M. Roughan, and S. Uhlig, “Testing the Reachability of (new) Address Space,” in Internet Network Management Work- shop, 2007.

[16] M. Caesar and J. Rexford, “BGP Routing Policies in ISP Networks,” IEEE Network Magazine, 2005.

[17] M. Caesar, L. Subramanian, and R. Katz, “Towards Localizing Root Causes of BGP Dynamics,” UCB/CSD-03-1292, Tech. Rep., 2003.

[18] M. Caesar, D. Caldwell, N. Feamster, J. Rexford, A. Shaikh, and J. van der Merwe, “Design and Implementation of a Routing Control Platform,” in Symposium on Networked Systems Design and Implementation, 2005.

[19] M. Caesar, L. Subramanian, and R. H. Katz, “Root cause analysis of Internet routing dynamics,” UCB/CSD-04-1302, Tech. Rep., 2003.

[20] D.-F.Chang, R. Govindan, and J. Heidemann, “The Temporaland Toplogical Characteristics of BGP Path Changes,” in IEEE International Conference on Network Protocols, 2003.

[21] Cisco Netﬂow, http://www.cisco.com/warp/public/732/netflow/index.html.

[22] D. Oran, “OSI IS-IS Intra-domain Routing Protocol,” RFC 1142, 1990. BIBLIOGRAPHY 239

[23] S. Deering and R. Hinden, “Internet Protocol, Version 6 (IPv6) Speciﬁca- tion,” RFC 2328, 1998.

[24] E. W. Dijkstra, “A Note on Two Problems in connexion with graphs,” Nu- merische Mathematik, no. 1, pp. 269–271, 1959.

[25] N. G. Duﬃeld, C. Lund, and M. Thorup, “Learn more, sample less: control of volume and variance in network measurement,” IEEE Transactions in Information Theory, vol. 51, no. 5, pp. 1756–1775, 2005.

[26] K. Egevang and P. Francis, “The IP Network Address Translator (NAT),”, RFC 1631, 1994.

[27] N. Feamster and H. Balakrishnan, “Correctness Properties for Internet Rout- ing,” in Forty-third Allerton Conference on Communication, Control, and Com- puting, 2005.

[28] N. Feamster and H. Balakrishnan, “Detecting BGP Conﬁguration Faults with Static Analysis,” in Symposium on Networked Systems Design and Imple- mentation, 2005.

[29] N. Feamster, H. Balakrishnan, J. Rexford, A. Shaikh, and J. van der Merwe, “The Case for Separating Routing From Routers,” in ACM SIGCOMM Work- shop on Future Directions in Network Architecture, 2004.

[30] N. Feamster, Z. M. Mao, and J. Rexford, “BorderGuard: Detecting Cold Potatoes from Peers,” in ACM Internet Measurement Conference, 2004.

[31] N. Feamster and J. Rexford, “Network-Wide Prediction of BGP Routes,” IEEE/ACM Transactions on Networking, vol. 15, no. 2, pp. 253–266, 2007.

[32] A. Feldmann, A. Greenberg, C. Lund, N. Reingold, J. Rexford, and F. True, “Deriving Traﬃc Demands for Operational IP Networks: Methodology and Experience,” IEEE/ACM Transactions on Networking, June 2001. 240 BIBLIOGRAPHY

[33] A. Feldmann, O. Maennel, M. Mao, A. Berger, and B. Maggs, “Locating Internet Routing Instabilities,” in ACM SIGCOMM, 2004.

[34] A. Flavel, M. Roughan, N. Bean, and O. Maennel, “Modeling BGP Table Fluctuations,” in 20th International Teletraﬃc Congress, 2007.

[35] V. Fuller, T. Li, J. Yu, and K. Varadhan, “Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy,” RFC 1519, 1993.

[36] L. Gao and J. Rexford, “Stable Internet Routing Without Global Coordina- tion,” IEEE/ACM Transactions on Networking, pp. 681–692, 2001.

[37] L. Gao, “On Inferring Autonomous System Relationships in the Internet,” IEEE/ACM Transactions on Networking, vol. 9, no. 6, pp. 733–745, 2001.

[38] L. Gao, T. Griﬃn, and J. Rexford, “Inherently Safe Backup Routing with BGP,” in IEEE INFOCOM, 2001.

[39] L. Gao and F. Wang, “The Extent of AS Path Inﬂation by Routing Policies,” in Global Internet, 2002.

[40] G. Goodell, W. Aiello, T. Griﬃn, J. Ionnidis, P. McDaniel, and A. Rubin, “Working around BGP: An Incremental Approach to Improving Security and Accuracy in Interdomain Routing,” in ISOC Symposium on Network and Distributed Systems Security, 2003.

[41] R. Govindan, C. Alaettinog-lu, K. Varadhan, and D. Estrin, “Route Servers for Inter-domain Routing,” Computer Networks and ISDN Systems, vol. 30, no. 12, pp. 1157–1174, 1998.

[42] T. Griﬃn and J. Sobrinho, “Metarouting,” in ACM SIGCOMM, 2005.

[43] T. Griﬃn and G. Huston, “BGP Wedgies,” RFC 4264, 2005.

[44] T. Griﬃn, F. B. Shepherd, and G. Wilfong, “Policy Disputes in Path Vector Protocols,” in IEEE International Conference on Network Protocols, 1999. BIBLIOGRAPHY 241

[45] T. Griﬃn, F. B. Shepherd, and G. Wilfong, “The Stable Paths Problem and Interdomain Routing,” IEEE/ACM Transactions on Networking, vol. 10, no. 2, pp. 232–243, 2002.

[46] T. Griﬃn and G. Wilfong, “An Analysis of BGP Convergence Properties,” in ACM SIGCOMM, 1999.

[47] T. Griﬃn and G. Wilfong, “A Safe Path Vector Protocol,” in Proc. IEEE INFOCOM, 2000.

[48] T. Griﬃn and G. Wilfong, “Analysis of the MED Oscillation Problem in BGP,” in IEEE International Conference on Network Protocols, 2002.

[49] T. Griﬃn and G. Wilfong, “On the Correctness of IBGP Conﬁguration,” in ACM SIGCOMM, 2002.

[50] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning — Data Mining, Inference and Prediction. Springer, 2001, ch. 4, pp. 84–95.

[51] C. L. Hendrick, “An Introduction to IGRP,” Cisco Whitepaper, August 1991.

[52] G. Huston, “Interconnection, Peering, and Settlements,” The Internet Protocol Journal, 1999.

[53] G. Huston, “Analyzing the Internet BGP Routing Table,” The Internet Protocol Journal, vol. 4, no. 1, March 2001.

[54] G. Huston, “Potaroo,” http://bgp.potaroo.net.

[55] G. Huston, “The Changing Foundation of the Internet: Confronting IPv4 Address Exhaustion,” The ISP Column, September 2008.

[56] K. Ishiguro, “Quagga Routing Suite,” www.quagga.net.

[57] J. Karlin, S. Forrest, and J. Rexford, “Pretty Good BGP: Improving BGP by Cautiously Adopting Routes,” in IEEE International Conference on Network Protocols, 2006. 242 BIBLIOGRAPHY

[58] S. Kent, C. Lynn, J. Mikkelson, and K. Seo, “Secure Border Gateway Protocol (S-BGP) – Real World Performance and Deployment Issues,” in Network and Distributed System Security Symposium, 2000.

[59] T. Klockar and L. Carr-Motyckova, “Preventing oscillations in route reﬂector-based I-BGP,” in International Conference on Computer Comunica- tions and Networks, 2004.

[60] H. Kong, “The Consistency Veriﬁcation of Zebra BGP Data Collection,” Agilent Labs, Tech. Rep.

[61] C. Kruegel, D. Mutz, W. Robertson, and F. Valeur, “Topology-Based De- tection of Anomalous BGP Messages,” in Symposium on Recent Advances in Intrusion Detection, 2003.

[62] C. Labovitz, A. Ahuja, and A. Bose, “Delayed Internet Routing Conver- gence,” in ACM SIGCOMM, August 2000, pp. 175–177.

[63] C. Labovitz, R. Wattenhofer, S. Venkatachary, and A. Ahuja, “The Impact of Internet Policy and Topology on Delayed Routing Convergence,” in IEEE INFOCOM, 2001.

[64] C. Labovitz, R. Malan, and F. Jahanian, “Internet Routing Instability,” IEEE/ACM Transactions on Networking, 1998.

[65] C. Labovitz, R. Malan, and F. Jahanian, “Origins of Internet Routing Insta- bility,” in IEEE INFOCOM, 1999.

[66] M. Leber, “Global IPv6 Deployment Progress Report,” http://bgp.he.net/ipv6-progress-report.cgi.

[67] T. Levy, O. Marce, and J.-L. Lafragette, “An Embedded Solution to IBGP Oscillations,” in Workshop on High Performance Switching and Routing, 2005.

[68] T. Li and G. Huston, “BGP Stability Improvements,” Internet Draft, June 2007. BIBLIOGRAPHY 243

[69] O. Maennel and A. Feldmann, “Realistic BGP Traﬃc for Test Labs,” in ACM SIGCOMM, 2002.

[70] R. Mahajan, D. Wetherall, and T. Anderson, “Negotiation-Based Routing between Neighboring ISPs,” in Symposium on Networked Systems Design and Implementation, 2005.

[71] G. Malkin, “RIP Version 2,” RFC 2453, 1998.

[72] Z. M. Mao, R. Bush, T. Griﬃn, and M. Roughan, “BGP Beacons,” in ACM Internet Measurement Conference, 2003.

[73] Z. M. Mao, R. Govindan, G. Varghese, and R. Katz, “Route Flap Damping Exacerbates Internet Routing Convergence,” in ACM SIGCOMM, 2002.

[74] D. Matthews, Y. Chen, H. Yan, and D. Massey, “BGP Monitoring System,” NANOG 40, 2006.

[75] D. McPherson, V. Gill, D. Walton, and A. Retana, “Border Gateway Protocol (BGP) Persistent Route Oscillation Condition,” RFC 3345, 2002.

[76] J. Moy, “OSPF Version 2,” RFC 2328, 1998.

[77] W. Muhlbauer,¨ S. Uhlig, B. Fu, M. Meulle, and O. Maennel, “In Search for an Appropriate Granularity to Model Routing Policies,” in ACM SIGCOMM, 2007.

[78] W.Muhlbauer,¨ O. Maennel, S. Uhlig, A. Feldmann, and M. Roughan, “Build- ing an AS-Topology Model that Captures Route Diversity,” in ACM SIG- COMM, 2006.

[79] R. Musunuri and J. Cobb, “Scalable iBGP through Selective Path Dissemi- nation,” in IASTED International Conference on Parallel and Distributed Com- puting and Systems, 2003.

[80] R. Musunuri and J. Cobb, “A Complete Solution to Stable iBGP,” in IEEE International Conference on Communication, 2004. 244 BIBLIOGRAPHY

[81] W. Norton, “The Art of Peering: The Peering Playbook,” http://www.blogg.ch/uploads/peering-playbook.pdf.

[82] R. Oliveira, B. Zhang, and L. Zhang, “Observing the Evolution of Internet AS Topology,” in ACM SIGCOMM, 2007.

[83] C. Panigl, J. Schmitz, P. Smith, and C. Vistoli, “Recommendations for Coor- dinated Route-ﬂap Damping Parameters,” RIPE-229, October 2001.

[84] N. Patrick, T. Scholl, A. Shaikh, and R. Steenbergen, “Peering Dragnet: Examining BGP Routes Received from Peers,” North American Network Operators’ Group (NANOG) presentation, October 2006.

[85] V. Paxson, “Strategies for Sound Internet Measurement,” in ACM Internet Measurement Conference, 2004.

[86] K. Poduri, C. Alaettinoglu, and V.Jacobson, “BST - BGP Scalable Transport,” in NANOG 27, 2003.

[87] B. Premore, “SSF Implementations of BGP-4,” 2002, http://www.ssfnet.org/bgp/.

[88] B. Quoitin and S. Uhlig, “Modeling the Routing of an Autonomous System with CBGP,” IEEE Network Magazine, Special Issue on Interdomain Routing, 2005.

[89] S. Ramachandra, Y.Rekhter, R. Fernando, J. Scudder, and E. Chen, “Graceful Restart Mechanism for BGP,” 2007, Internet Draft.

[90] A. Rawat and M. A. Shayman, “Preventing Persistent Oscillations and Loops in IBGP Conﬁguration with Route Reﬂection,” Computer Networks, pp. 3642–3665, December 2006.

[91] Y. Rekhter, T. Li, and S. Hares, “A Border Gateway Protocol 4,” RFC 4271, 2006. BIBLIOGRAPHY 245

[92] J. Rexford, J. Wang, Z. Xiao, and Y.Zhang, “BGP Routing Stability of Popular Destinations,” in ACM Internet Measurement Workshop, 2002.

[93] RIPE NCC, www.ripe.net.

[94] E. Rosen, A. Viswanathan, and R. Callon, “Multiprotocol Label Switching Architecture,” RFC 3031, 2001.

[95] M. Roughan, T. Griﬃn, Z. M. Mao, A. Greenberg, and B. Freeman, “IP Forwarding Anomalies and Improving their Detection Using Multiple Data Sources,” in ACM SIGCOMM Workshop on Network Troubleshooting, 2004.

[96] J. Scudder, “Internet Draft: BGP Monitoring Protocol,” 2005.

[97] A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture, Design and Deployment Experience,” in Symposium on Networked Systems Design and Implementation, 2004.

[98] P. Smith and C. Panigl, “RIPE-378: Recommendations on Route-ﬂap Damp- ing,” RIPE Routing Working Group, May 2006.

[99] J. Sobrinho, “An Algebraic Theory of Dynamic Network Routing,” IEEE/ACM Transactions on Networking, vol. 13, no. 5, October 2005.

[100] N. Spring, R. Mahajan, and T. Anderson, “Quantifying the Causes of Path Inﬂation,” in ACM SIGCOMM, 2003.

[101] W. Stallings, Data and Computer Communications. McMillan Publishing Company, 1994.

[102] J. W. Stewart, BGP4. Inter-Domain Routing in the Internet. Addison Wesley, 1999.

[103] J. A. Storer, An Introduction to Data Structures and Algorithms. Springer, 2002. 246 BIBLIOGRAPHY

[104] L. Subramanian, M. Caesar, C. T. Ee, M. Handley, M. Mao, S. Shenker, and I. Stoica, “HLP: A Next-generation Interdomain Routing Protocol,” in ACM SIGCOMM, 2005.

[105] L. Subramanian, V. Roth, I. Stoica, S. Shenker, and R. Katz, “Listen and Whisper: Security Mechanisms for BGP,” in Networked Systems Design and Implementation, March 2004.

[106] L. Subramanian, S. Agarwal, J. Rexford, and R. H. Katz, “Characterizing the Internet hierarchy from multiple vantage points,” in IEEE INFOCOM, 2002.

[107] H. Tangmunarunkit, R. Govindan, S. Shenker, and D. Estrin, “The Impact of Internet Policy on Internet Paths,” in IEEE INFOCOM, 2001.

[108] R. Teixeira, N. G. Duﬃeld, J. Rexford, and M. Roughan, “Traﬃc Matrix Reloaded: Impact of Routing Changes,” in Passive and Active Measurement Conference, 2005.

[109] R. Teixeira, T. Griﬃn, M. Resende, and J. Rexford, “TIE breaking: Tun- able Interdomain Egress Selection,” IEEE/ACM Transactions on Networking, vol. 15, no. 4, pp. 761–774, 2007.

[110] R. Teixeira, A. Shaikh, T. G. Griﬃn, and J. Rexford, “Dynamics of Hot-Potato Routing in IP Networks,” in ACM SIGMETRICS, 2004.

[111] R. Teixeira, A. Shaikh, T. G. Griﬃn, and G. M. Voelker, “Network Sensitivity to Hot-Potato Disruptions,” in ACM SIGCOMM, 2004.

[112] S.-T. Teoh, K.-L. Ma, S. F. Wu, D. Massey, X.-L. Zhao, D. Pei, L. Wang, L. Zhang, and R. Bush, “Visual-based Anomaly Detection for BGP Origin AS Change (OASC) Events,” Self-Managing Distributed Systems, 2003.

[113] H. Tyan, “Design, Realization and Evaluation of a Component-Based Com- positional Software Architecture for Network Simulation,” Ph.D. disserta- tion, Ohio State University, 2002. BIBLIOGRAPHY 247

[114] S. Uhlig and S. Tandel, “Quantifying the BGP Routes Diversity Inside a Tier-1 Network,” in IFIP Networking, 2006.

[115] University of Oregon RouteViews project, www.routeviews.org.

[116] K. Varadhan, R. Govindan, and D. Estrin, “Persistent Route Oscillations in Inter-Domain Routing,” Computer Networks, 2000.

[117] C. Villamiyar, R. Chandra, and R. Govindan, “BGP Route Flap Damping,” RFC 2439, 1998.

[118] M. Vutukuru, P. Valiant, S. Kopparty, and H. Balakrishnan, “How to Con- struct a Correct and Scalable iBGP Conﬁguration,” in IEEE INFOCOM, Barcelona, Spain, April 2006.

[119] F. Wang and L. Gao, “Inferring and Characterizing Internet Routing Poli- cies,” in ACM Internet Measurement Workshop, 2003.

[120] L. Wang, M. Saranu, J. Gottlieb, and D. Pei, “Understanding BGP Session Failures in a Large ISP,” in IEEE INFOCOM, 2007.

[121] L. Wang, X. Zhao, D. Pei, R. Bush, D. Massey, A. Mankin, S. F. Wu, and L. Zhang, “Observation and Analysis of BGP Behavior under Stress,” in ACM Internet Measurement Workshop, 2002.

[122] Y. Wang, I. Avramopoulos, and J. Rexford, “Design for Conﬁgurability: Rethinking Interdomain Routing Policies from the Ground Up,” to appear in IEEE Journal on Selected Areas in Communications, 2009.

[123] D. Wetherall, R. Mahajan, and T. Anderson, “Understanding BGP miscon- ﬁgurations,” in ACM SIGCOMM, 2002.

[124] R. White, “Securing BGP through Secure Origin BGP,” The Internet Protocol Journal, vol. 6, no. 3, September 2003.

[125] G. Wilfong, “Interdomain Routing,” Lucent Technologies Presentation, February 2006. 248 BIBLIOGRAPHY

[126] S. Woolley, “The Day The Web Went Dead,” Forbes, December 2 2008.

[127] J. Wu, Z. M. Mao, J. Rexford, and J. Wang, “Finding a needle in a haystack: Pinpointing signiﬁcant BGP routing changes in an IP network,” in Sympo- sium on Networked Systems Design and Implementation, 2005.

[128] W. Xu and J. Rexford, “MIRO: Multi-path Interdomain ROuting,” in ACM SIGCOMM, 2006.

[129] B. Zhang, V. Kambhampati, M. Lad, D. Massey, and L. Zhang, “Identifying BGP Routing Table Transfers,” in ACM SIGCOMM Workshop on Mining Network Data, 2005.

[130] J. Zhang, J. Rexford, and J. Feigenbaum, “Learning-Based Anomaly De- tection in BGP Updates,” in ACM SIGCOMM Workshop on Mining Network Data, 2005.

[131] Y. Zhang, Z. Zhang, Z. M. Mao, C. Hu, and B. Maggs, “On the Impact of Route Monitor Selection,” in ACM Internet Measurement Conference, 2007.

[132] Y. Zhang, M. Roughan, C. Lund, and D. Donoho, “Estimating Point-to- Point and Point-to-Multipoint Traﬃc Matrices: An Information-Theoretic Approach,” IEEE/ACM Transactions on Networking, vol. 13, no. 5, pp. 947– 960, October 2005.

[133] X. Zhao, D. Pei, L. Wang, D. Massey, A. Mankin, S. F. Wu, and L. Zhang, “Detection of Invalid Routing Announcement in the Internet,” in Dependable Systems and Networks, 2002.