Systems for Improving Internet Availability and Performance

Systems for Improving Internet Availability and Performance Ethan B. Katz-Bassett A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2012 Program Authorized to Offer Degree: Computer Science and Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Ethan B. Katz-Bassett and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Co-Chairs of the Supervisory Committee: Thomas E. Anderson Arvind Krishnamurthy Reading Committee: Thomas E. Anderson Arvind Krishnamurthy David Wetherall Date: In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform.” Signature Date University of Washington Abstract Systems for Improving Internet Availability and Performance Ethan B. Katz-Bassett Co-Chairs of the Supervisory Committee: Professor Thomas E. Anderson Department of Computer Science and Engineering Associate Professor Arvind Krishnamurthy Department of Computer Science and Engineering The Internet’s role in our lives continues to grow, but it often fails to provide the availability and performance demanded by our increasing reliance on it. The problems largely stem from the fact that network operators at an Internet service provider (ISP) have little visibility into and even less control over the routing of other ISPs on which they depend to deliver global connectivity. In this dissertation, I design, build, and evaluate practical distributed systems that ISPs can use today to understand availability and performance problems. I develop reverse traceroute,a system to measure reverse paths back to the local host from arbitrary destinations. While tools have long existed to measure the forward direction, the reverse path has been largely opaque, hindering troubleshooting efforts. I show how content providers such as Google could use reverse traceroute to troubleshoot their clients’ performance problems. The rest of the dissertation focuses on long-lasting routing outages. My measurements show that they occur frequently and contribute significantly to overall unavailability. To address these long-term problems, I develop a system, LIFEGUARD, for automatic failure localization and remediation. First, the system builds on reverse traceroute to locate faults, even in the presence of asymmetric paths and failures. Second, I develop a technique that enables edge ISPs to steer traffic to them around failures, without requiring the involvement of the network causing the failure. Deploying LIFEGUARD on the live Internet, I find that it can effectively route traffic around particular ISPs without causing widespread disruption. TABLE OF CONTENTS Page List of Figures . iv List of Tables . vi Chapter 1: Introduction . 1 1.1 Approach and Goals . 3 1.1.1 Measuring Reverse Paths . 4 1.1.2 Characterizing Long-Lasting Outages on the Internet . 5 1.1.3 Accurately Locating Failures from the Edge of the Internet . 7 1.1.4 Automatic and Effective Route Repair . 8 1.1.5 Summary . 9 1.2 Thesis and Contributions . 9 1.3 Organization . 11 Chapter 2: Background . 13 2.1 Internet Topology and Terminology . 13 2.2 Internet Routing . 17 2.2.1 Mechanisms for BGP Route Control . 20 2.3 Tools to Understand Routing . 21 2.3.1 Tools Other than Traceroute . 22 2.3.2 Traceroute . 23 2.3.3 Common Uses and Limitations of Traceroute . 24 2.4 Performance Problems . 27 2.5 Availability Problems . 30 Chapter 3: Characterizing Internet Outages . 36 3.1 Hubble Design and Architecture . 38 3.1.1 Deriving Hubble’s Methodology . 38 3.1.2 Overview of Measurement Components . 42 i 3.1.3 Identifying Potential Outages . 44 3.1.4 Real-time Reachability Analysis . 46 3.2 Evaluation . 50 3.3 Characteristics of Reachability Problems on the Internet . 54 3.3.1 Case Studies . 55 3.3.2 Prevalence and Duration . 56 3.3.3 Direction Isolation . 57 3.4 Quantifying Unreachability from Amazon EC2 . 59 3.5 Assessing Policy-Compliant Alternate Paths . 61 3.6 Summary . 63 Chapter 4: Reverse Traceroute . 65 4.1 Basic Reverse Traceroute Techniques . 68 4.1.1 Identify Reverse Hops with IP Options . 70 4.1.2 Record Route with Spoofed Source Addresses . 72 4.1.3 Example of Incrementally Building a Reverse Traceroute . 73 4.2 Advanced Reverse Traceroute Techniques . 76 4.2.1 Correcting for Variations in IP Options Support . 76 4.2.2 Avoiding Probe Filters to Improve Coverage . 80 4.2.3 Filtering of Record Route . 81 4.2.4 Selective Use of Vantage Points for Scalability . 83 4.3 System Implementation . 85 4.3.1 Current Deployment . 86 4.4 Evaluation . 87 4.4.1 Accuracy . 88 4.4.2 Coverage . 94 4.4.3 Overhead . 98 4.5 Debugging Poor Performance with Reverse Traceroute . 101 4.6 Other Applications of Reverse Traceroute . 102 4.6.1 Topology Discovery . 103 4.6.2 Measuring One-Way Link Latency . 105 4.7 Summary . 107 Chapter 5: Improving Internet Availability with LIFEGUARD . 108 5.1 Current Approaches to Locating and Remediating Long-Lasting Failures . 110 5.2 Enabling Failure Avoidance . 113 ii 5.2.1 LIFEGUARD’s Failure Remediation . 115 5.3 Applying Failure Avoidance . 119 5.3.1 Locating a Failure . 119 5.3.2 Deciding to Start and Stop Poisoning . 128 5.4 LIFEGUARD Evaluation . 130 5.4.1 Efficacy . 131 5.4.2 Disruptiveness . 133 5.4.3 Accuracy . 137 5.4.4 Scalability . 140 5.5 LIFEGUARD Case Study . 143 5.6 Discussion . 144 5.6.1 Poisoning Anomalies . 144 5.6.2 Address Use by the Sentinel Prefix . 145 5.7 Summary . 146 Chapter 6: Related Work . 147 6.1 Internet Measurement . 147 6.1.1 Measurement Tools and Techniques . 147 6.1.2 Systems that Would Benefit from Reverse Path Information . 152 6.2 Availability Problems . 154 6.2.1 Characterizing and Locating Failures . 155 6.2.2 Avoiding Failures . 156 Chapter 7: Conclusions and Future Work . 158 7.1 Thesis and Contributions . 158 7.2 Key Approaches . 159 7.3 Future Work . 161 7.3.1 Improving Understanding of Failures . 162 7.3.2 Improving Reverse Traceroute . 163 7.3.3 Improving LIFEGUARD . 166 7.3.4 Improving Support for Measurements . 167 7.4 Summary . 170 Bibliography . 171 iii LIST OF FIGURES Figure Number Page 2.1 Internet’s autonomous system-level topology. 14 2.2 Structure of an autonomous system. 15 2.3 Illustrative example showing routers in a single AS selecting different paths. 19 3.1 Diagram of the Hubble architecture. 43 3.2 Instance in which Hubble can use spoofed probes to isolate the direction of failure. 49 3.3 Depiction of Hubble isolating a failure on paths to ISI. 50 3.4.

Systems for Improving Internet Availability and Performance

Inter-AS Routing Anomalies: Improved Detection and Classification*

EC Ipv6 Cluster Demonstrations at the Madrid 2003 Global Ipv6 Summit

Using Forgetful Routing to Control BGP Table Size

Stony Brook University

Internet Topology

Services Load Balancer for a Software Defined

A Study on the Costing, Operational Principles and Modalities of a Proposed South Pacific Internet Exchange

Desenvolvimento De Sistema Para Monitoramento De Redes De Computadores E Servidor Looking Glass

BGP-Multipath Routing in the Internet

Improving the Accuracy of the Internet Cartography

Analysis of BGP Routing Tables

Internet Plumbing for Security Professionals: the State of Bgp Security