Identifying Important Internet Outages
Total Page:16
File Type:pdf, Size:1020Kb
1 Identifying Important Internet Outages Ryan Bogutz1, Yuri Pradkin2, and John Heidemann2 1The College of New Jersey; Ewing, NJ 2USC/Information Sciences Institute; Marina del Rey, CA Abstract—Today, outage detection systems can track outages is easy to miss events lasting a short time, and it is time- across the whole IPv4 Internet—millions of networks. However, consuming to play out days of data. Even with direct access it becomes difficult to find meaningful, interesting events in this to the data in a database, queries are hard to formulate and it huge dataset, since three months of data can easily include 41 billion observations and millions of outage events. We propose is not clear what to look for. an outage reporting system that sifts through this data to find the Our contribution is twofold: we explore metrics that most interesting events. We explore multiple metrics to evaluate identify important events in this voluminous data, finding that “interesting”, reflecting the size and severity of outages. We show the product of event size and severity does a good job of that defining interest as the product of size by severity works identifying “interesting” events. Second, we use this metric to well, avoiding degenerate cases like complete outages affecting a few people, and apparently large outages that affect only a small provide a daily report of the ten “most interesting” events. fraction of people in an area. We have integrated outage reporting We use this tool to explore a week of outage data. Our tool into our existing public website (https://outage.ant.isi.edu) with is available on the public Internet and is integrated into the the goal of making near-real-time outage information accessible USC/ISI outage tool at https://outage.ant.isi.edu as a sidebar. to the general public. Such data can help answer questions like Our goal is to make outage data accessible to the general “what are the most significant outages today?”, “did Florida have major problems in an ongoing hurricane?”, and “are there power public. outages in Venezuela?”. II. BACKGROUND AND RELATED WORK I. INTRODUCTION Prior outage studies have focused on methods to detect outages, but relatively little work has considered analysis and Network outages, while rare, have tangible consequences visualization. Although in our work we used data specifically in our increasingly connected world. With the advent of the from Trinocular outage detection system and integrated our smart home, Internet-of-things (IoT) devices, mobile phones reports with Trinocular outage maps, we believe our method- and tablets, as well as voice-over-IP (VoIP) systems, citizens ology and metrics can be adapted by other outage detection are increasingly reliant on the Internet. As a result, the loss of systems that aggregate concurrent outages. Next, we describe Internet connectivity can disrupt our daily lives. For instance, some outage detection systems and prior work on visualizing a recent routing problem between a customer of Verizon and outages. Cloudflare interfered with millions of users’ access to popular sites like Google, Discord, and Amazon [16]. Outages also A. Trinocular Outage Detection occur as the result of political interference, such as in the Trinocular is an outage detection system that uses Egyptian revolution of 2011 [6]. Finally, we have shown that Bayesian inference to minimize probing traffic while reliably network outages reflect the impact of hurricanes and natural detecting outages across millions of edge networks [12]. disasters [9], [12]—outages in the Internet can be used to infer Trinocular monitors /24 IPv4 network prefixes (each a set of the extent of problems in the physical world. 256 adjacent IP addresses) or address blocks, observing each Several different systems today track Internet outages, network once every 11 minutes (a “Trinocular round”). Early both globally with active probing [12], globally with passive Trinocular processing was batched, with results generated once observations [8]. Some are specifically targeted at weather a quarter. In 2019 we deployed Near-Real-Time Trinocular events with active probing [14]. This data can be of use to providing preliminary results within two hours of an outage. the general public, to understand natural disasters or problems We continue to update the data quarterly with more complete, with their network; to scientists and policymakers, to improve batch-processed analysis. our Internet; and to network operators, to diagnose problems Trinocular is generally accurate; it has been found to in networks and plan network improvement. USC makes all detect 100% of outages lasting longer than one round [12]. their outage data available for research use [1] and operates a Recent work has used analysis of CDN data to confirm that Google-maps-style website with a global view and time travel most blocks are correct, but shown that some blocks report (see https://outage.ant.isi.edu[2]). many false outages [13]. We have traced this problem to sparse However, it is hard to make sense of what outages are blocks and shown improvements to address this problem [3]. important, with gigabytes of data collected over multiple years, We have visualized Trinocular data in an interactive web- and an entire world to look at with hundreds of dots indicating site since 2017 [2]. For this visualization, each address block potential outages each minute. When browsing the website, it is geolocated (with the MaxMind GeoLite City database), 2 then mapped to a grid cell defined by a 0.5-degree lati- granularity: country, region, and autonomous system (AS), and tude/longitude region, or 1 or 2-degree cells when zooming it also allows time travel and some types of queries. Their out. The visualization shows the number of networks that are dashboard highlights events based on “Alert Area”, a factor out in each cell by the area of a circle, while the fraction that considers the size of the change by the outage duration of networks that are out is encoded in the color, ranging from and now many detection methods see it. In comparison, our blue to white to red. This website provides the usual interactive website complements this work by providing a more targeted map web features (zoom, pan). It supports time travel to any geolocation (grid cells of latitude/longitude rather than country date for which we have data, and can play back the data with or region), and uses different metrics to highlight what is an animation to show evolution of outages over time. This considered important. We do not use area alerts because our visualization thus provides a calibrated, interactive global map outages consist of many blocks which often have different start of Internet outages, with updates being posted in near-real- and end times. Future work may evaluate our metrics against time. their groupings, and compare their alert area over our data. This outage detection system and website encompasses III. METHOD an enormous amount of data. Trinocular scans millions of A. Problem Statement and System Overview networks (from 3.5M in 2014 to more than 4.2M in 2019), and each quarter generates more than 41B observations. Un- Our goal is to find important outages. To reach this goal, fortunately, thousands of small outages happen all the time, we need to understand what makes an outage an important and with observation noise, it is easy to lose large events in a one. We seek to define a metric to reflect the importance of background of persistent, small outages. each outage, allowing us to prioritize more important ones Our goal is to address this gap, providing reports that over less important. identify key outage events in this mass of data, and to We consider outages important when they affect many integrate these reports into the existing website. The result is people in a noticeable way. We consider three factors as part to empower users to focus their attention on important events of interest: size, severity, and change. The size of the outage is rather than searching for them. how many people are affected. The more people are affected, the bigger is the problem. The severity is defined as the B. Related Outage Detection Systems fraction of networks in the area that have problems. When A number of other systems also report outage data. everyone in an area loses Internet access, this may indicate Schulman and Spring’s Thunderping [14] was developed that something significant is going on, such as a complete concurrently with our system, and focuses on outages of power outage or devastation by a hurricane, whereas if only residential networks that occur due to weather events. They some networks are affected, it could mean only one ISP may have visualized their results with videos, but they do not (to be experiencing problems. Finally, rate of change is important our knowledge) support interactive visualization. to consider. Outages that last for hours or days (perhaps due CAIDA has detected outages in passive observations to the time required to physically restore downed utility lines) through their network telescope [7]. The algorithms behind are important to highlight when they occur, but less important their approach have recently been formalized as Chocola- a day later. Measurement error or shifts in ISP use sometimes tine [8]. They visualize the results of this system in IODA, also result in outages that persist for days or weeks. We described below. consider such events of lesser interest. The website Downdetector uses crowd-sourced data to Our system generates outage reports (xIII-B) that high- provide some information about network and service out- light important outages for a given day based on a specific ages [10]. Unfortunately their exact methodology is propri- metric to rank all outages on that day (xIII-C).