Scouts: Improving the Diagnosis Process Through Domain-Customized Incident Routing
Total Page:16
File Type:pdf, Size:1020Kb
Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing Jiaqi Gao¢, Nofel Yaseen⋄, Robert MacDavid∗, Felipe Vieira Frujeri◦, Vincent Liu⋄, Ricardo Bianchini◦ Ramaswamy Adityax, Xiaohang Wangx, Henry Leex, David Maltzx, Minlan Yu¢, Behnaz Arzani◦ ¢Harvard University ⋄University of Pennsylvania ∗Princeton University ◦Microsoft Research xMicrosoft ABSTRACT service-level objectives. When incidents are mis-routed (sent to the Incident routing is critical for maintaining service level objectives wrong team), their time-to-diagnosis can increase by 10× [21]. in the cloud: the time-to-diagnosis can increase by 10× due to mis- A handful of well-known teams that underpin other services routings. tend to bear the brunt of this effect. The physical networking team Properly routing incidents is challenging because of the complex- in our large cloud, for instance, is a recipient in 1 in every 10 mis- ity of today’s data center (DC) applications and their dependencies. routed incidents (see §3). In comparison, the hundreds of other For instance, an application running on a VM might rely on a possible teams typically receive 1 in 100 to 1 in 1000. These findings functioning host-server, remote-storage service, and virtual and are common across the industry (see Appendix A). physical network components. It is hard for any one team, rule- Incident routing remains challenging because modern DC ap- based system, or even machine learning solution to fully learn the plications are large, complex, and distributed systems that rely complexity and solve the incident routing problem. We propose a on many sub-systems and components. Applications’ connections different approach using per-team Scouts. Each teams’ Scout actsas to users, for example, might cross the DC network and multiple its gate-keeper — it routes relevant incidents to the team and routes- ISPs, traversing firewalls and load balancers along the way. Anyof away unrelated ones. We solve the problem through a collection these components may be responsible for connectivity issues. The of these Scouts. Our PhyNet Scout alone — currently deployed in internal architectures and the relationships between these compo- production — reduces the time-to-mitigation of 65% of mis-routed nents may change over time. In the end, we find that the traditional incidents in our dataset. method of relying on humans and human-created rules to route incidents is inefficient, time-consuming, and error-prone. CCS CONCEPTS Instead, we seek a tool that can automatically analyze these complex relationships and route incidents to the team that is most • Computing methodologies → Machine learning; • Networks likely responsible; we note that machine learning (ML) is a potential → Data center networks; match for this classification task. In principle, a single, well-trained KEYWORDS ML model could process the massive amount of data available from operators’ monitoring systems—too vast and diverse for humans— Data center networks; Machine learning; Diagnosis to arrive at an informed prediction. Similar techniques have found ACM Reference Format: success in more limited contexts (e.g., specific problems and/or Jiaqi Gao, Nofel Yaseen, Robert MacDavid, Felipe Vieira Frujeri, Vincent applications) [11, 15, 22, 25, 73]. Unfortunately, we quickly found Liu, RicardoBianchini, Ramaswamy Aditya, Xiaohang Wang, Henry Lee, operationalizing this monolithic ML model comes with fundamental David Maltz, Minlan Yu,Behnaz Arzani. 2020. Scouts: Improving the Diag- technical and practical challenges: nosis Process Through Domain-customized Incident Routing. In Annual A constantly changing set of incidents, components, and monitoring conference of the ACM Special Interest Group on Data Communication on the data: As the root causes of incidents are addressed and components applications, technologies, architectures, and protocols for computer communi- cation (SIGCOMM ’20), August 10–14, 2020, Virtual Event, NY, USA. ACM, evolve over time, both the inputs and the outputs of the model are New York, NY, USA, 17 pages. https://doi.org/10.1145/3387514.3405867 constantly in flux. When incidents change, we are often left without enough training data and when components change, we potentially 1 INTRODUCTION need to retrain across the entire fleet. For cloud providers, incident routing — taking an issue that is too Curse of dimensionality: A monolithic incident router needs to in- complex for automated techniques and assigning it to a team of clude monitoring data from all teams. This large resulting feature engineers — is a critical bottleneck to maintaining availability and vector leads to “the curse of dimensionality” [4]. The typical solu- tion of increasing the number of training examples in proportion to Permission to make digital or hard copies of all or part of this work for personal or the number of features is not possible in a domain where examples classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation (incidents) are already relatively rare events. on the first page. Copyrights for components of this work owned by others than ACM Uneven instrumentation: A subset of teams will always have gaps must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a in monitoring, either because the team has introduced new compo- fee. Request permissions from [email protected]. nents and analytics have not caught up, or because measuring is just SIGCOMM ’20 hard, e.g., in active measurements where accuracy and overhead © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7955-7/20/08...$15.00 are in direct contention [34]. https://doi.org/10.1145/3387514.3405867 SIGCOMM ’20, August 10–14, 2020, Virtual Event, NY, USA Gao et al. Limited visibility into every team: For the same reasons that it is 3) The design of a Scout for Microsoft Azure’s physical networking difficult for teams to have expertise in all surrounding components, team accompanied by a framework to enable its evolution as the it is difficult for us to understand the appropriate feature setsfrom team’s monitoring systems, incidents, and responsibilities change. each and every team. 4) A thorough evaluation of the deployed PhyNet Scout and analysis Rather than building a single, monolithic predictor, we argue a of incidents in our cloud from the past year and a discussion of the piecewise solution based on a collection of (strategically assigned) challenges the Scout encountered in practice. per-team predictors, a.k.a. Scouts, is more useful. Scouts are low- This paper is the first to propose a decomposed solution tothe overhead, low-latency, and high-accuracy tools that predict, for a incident routing problem. We take the first step in demonstrating given team, whether the team should be involved. They are built by such a solution can be effective by building a Scout for the PhyNet the team to which they apply and are amenable to partial deploy- team of Microsoft Azure. This team was one of the teams most ment. Scouts address the above challenges: they only need to adapt heavily impacted by the incident routing problem. As such, it was a to changes to their team and its components (instead of all changes), good first candidate to demonstrate the benefits Scouts can provide; they operate over a more limited feature set (no longer suffer the we leave the detailed design of other teams’ Scouts for future work. curse of dimensionality), they limit the need for understanding the internals of every team (they only need to encode information about the team they are designed for and its local dependencies), 2 BACKGROUND: INCIDENT ROUTING and only require local instrumentation. Scouts can utilize a hybrid Incidents constitute unintended behavior that can potentially im- of supervised and unsupervised models to account for changes to pact service availability and performance. Incidents are reported incidents (see §5) and can provide explanations as to why they de- by customers, automated watchdogs, or discovered and reported cided the team is (not) responsible. Operators can be strategic about manually by operators. which Scouts they need: they can build Scouts for teams (such as Incident routing is the process through which operators decide our physical networking team) that are inordinately affected by which team should investigate an incident. In this context, we mis-routings. Given a set of Scouts, operators can incrementally use team to broadly refer to both internal teams in the cloud and compose them, either through a global routing system or through external organizations such as ISPs. Today, operators use run-books, the existing manual process. past-experience, and a natural language processing (NLP)-based We designed, implemented, and deployed a Scout for the physical recommendation system (see §7), to route incidents. Specifically, networking team of a large cloud.1 We focus on this team as, from incidents are created and routed using a few methods: our study of our cloud and other operators, we find the network — 1) By automated watchdogs that run inside the DC and monitor and specifically the physical network — suffers inordinately from the health of its different components. When a watchdog uncovers mis-routing (see §3). This team exhibits all of the challenges of a problem it follows a built-in set of rules to determine where it Scout construction: diverse, dirty datasets; complex dependencies should send the incident. inside and outside the provider; many reliant services; and frequent 2) As Customer Reported Incidents (CRIs) which go directly to a changes. As the team evolves, the framework we developed adapts 24 × 7 support team that uses past experience and a number of automatically and without expert intervention through the use of specialized tools to determine where to send the incident. If the meta-learning techniques [46]. cause is an external problem, the team contacts the organization These same techniques can be used to develop new “starter” responsible.