Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices

Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices Yu Gan Yanqi Zhang Kelvin Hu Cornell University Cornell University Cornell University [email protected] [email protected] [email protected] Dailun Cheng Yuan He Meghna Pancholi Cornell University Cornell University Cornell University [email protected] [email protected] [email protected] Christina Delimitrou Cornell University [email protected] Abstract CCS Concepts • Computer systems organization → Performance unpredictability is a major roadblock towards Cloud computing; Availability; • Computing method- cloud adoption, and has performance, cost, and revenue ologies → Neural networks; • Software and its engi- ramifications. Predictable performance is even more crit- neering → Scheduling. ical as cloud services transition from monolithic designs to Keywords cloud computing, datacenter, performance de- microservices. Detecting QoS violations after they occur in bugging, QoS, deep learning, data mining, tracing, monitor- systems with microservices results in long recovery times, as ing, microservices, resource management hotspots propagate and amplify across dependent services. We present Seer, an online cloud performance debugging ACM Reference Format: system that leverages deep learning and the massive amount Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big of tracing data cloud systems collect to learn spatial and Data to Navigate the Complexity of Performance Debugging in temporal patterns that translate to QoS violations. Seer com- Cloud Microservices. In 2019 Architectural Support for Programming bines lightweight distributed RPC-level tracing, with detailed Languages and Operating Systems (ASPLOS ’19), April 13–17, 2019, low-level hardware monitoring to signal an upcoming QoS Providence, RI, USA. ACM, New York, NY, USA, 15 pages. https: violation, and diagnose the source of unpredictable perfor- //doi.org/10.1145/3297858.3304004 mance. Once an imminent QoS violation is detected, Seer notifies the cluster manager to take action to avoid perfor- 1 Introduction mance degradation altogether. We evaluate Seer both in local Cloud computing services are governed by strict quality clusters, and in large-scale deployments of end-to-end appli- of service (QoS) constraints in terms of throughput, and cations built with microservices with hundreds of users. We more critically tail latency [14, 28, 31, 34]. Violating these show that Seer correctly anticipates QoS violations 91% of requirements worsens the end user experience, leads to loss the time, and avoids the QoS violation to begin with in 84% of availability and reliability, and has severe revenue im- of cases. Finally, we show that Seer can identify application- plications [13, 14, 28, 32, 35, 36]. In an effort to meet these level design bugs, and provide insights on how to better performance constraints and facilitate frequent application architect microservices to achieve predictable performance. updates, cloud services have recently undergone a major shift from complex monolithic designs, which encompass the Permission to make digital or hard copies of all or part of this work for entire functionality in a single binary, to graphs of hundreds personal or classroom use is granted without fee provided that copies of loosely-coupled, single-concerned microservices [9, 45]. are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights Microservices are appealing for several reasons, including for components of this work owned by others than the author(s) must accelerating development and deployment, simplifying cor- be honored. Abstracting with credit is permitted. To copy otherwise, or rectness debugging, as errors can be isolated in specific tiers, republish, to post on servers or to redistribute to lists, requires prior specific and enabling a rich software ecosystem, as each microservice permission and/or a fee. Request permissions from [email protected]. is written in the language or programming framework that ASPLOS ’19, April 13–17, 2019, Providence, RI, USA best suits its needs. © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. At the same time microservices signal a fundamental de- ACM ISBN 978-1-4503-6240-5/19/04...$15.00 parture from the way traditional cloud applications were https://doi.org/10.1145/3297858.3304004 designed, and bring with them several system challenges. Specifically, even though the quality-of-service (QoS) re- A Posteriori Seer Violation QoS detected restored Back-end 2 Back-end quirements of the end-to-end application are similar for mi- 100 10 2 croservices and monoliths, the tail latency required for each 90 10 80 1 individual microservice is much stricter than for traditional 10 1 70 10 cloud applications [34, 43–45, 61, 62, 64, 67, 70, 77]. This puts 60 50 0 0 increased pressure on delivering predictable performance, as 10 10 CPU Utilization (%) Latency increase (%) 40 Microservices Instances Microservices Instances dependencies between microservices mean that a single mis- Latency %ile meeting QoS 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Time (sec) Front-end Time (s) Front-end Time (s) behaving microservice can cause cascading QoS violations across the system. Figure 2. The performance impact of a posteriori perfor- Fig. 1 shows three mance diagnostics for a monolith and for microservices. instances of real large- scale production deployments of microser- services, forcing the system to operate in a degraded state for vices [2, 9, 11]. The longer, until all oversubscribed tiers have been relieved, and Netflix Twitter perimeter of the cir- all accumulated queues have drained. Fig. 2a shows the im- cle (or sphere surface) pact of reacting to a QoS violation after it occurs for the Social shows the different mi- Network application with several hundred users running on croservices, and edges 20 two-socket, high-end servers. Even though the scheduler show dependencies be- scales out all oversubscribed tiers once the violation occurs, tween them. We also it takes several seconds for the service to return to nominal Amazon Social Network show these dependen- operation. There are two reasons for this; first, by the time Social Network one tier has been upsized, its neighboring tiers have built cies for , Figure 1. Microservices up request backlogs, which cause them to saturate in turn. one of the large-scale graphs in three large cloud Second, utilization is not always a good proxy for tail latency services used in the providers [2, 9, 11], and our and/or QoS violations [14, 28, 61, 62, 73]. Fig. 2b shows the evaluation of this work Social Network service. (see Sec. 3). Unfortu- utilization of all microservices ordered from the back-end nately the complexity of modern cloud services means that to the front-end over time, and Fig. 2c shows their corre- manually determining the impact of each pair-wide depen- sponding 99th percentile latencies normalized to nominal dency on end-to-end QoS, or relying on the user to provide operation. Although there are cases where high utilization this information is impractical. and high latency match, the effect of hotspots propagating Apart from software heterogeneity, datacenter hardware is through the service is much more pronounced when looking also becoming increasingly heterogeneous as special-purpose at latencies, with the back-end tiers progressively saturating architectures [20–22, 39, 55] and FPGAs are used to accel- the service’s logic and front-end microservices. In contrast, erate critical operations [19, 25, 40, 75]. This adds to the there are highly-utilized microservices that do not experi- existing server heterogeneity in the cloud where servers are ence increases in their tail latency. A common way to address progressively replaced and upgraded over the datacenter’s such QoS violations is rate limiting [86], which constrains provisioned lifetime [31, 33, 65, 68, 95], and further compli- the incoming load, until hotspots dissipate. This restores cates the effort to guarantee predictable performance. performance, but degrades the end user’s experience, as a The need for performance predictability has prompted a fraction of input requests is dropped. long line of work on performance tracing, monitoring, and We present Seer, a proactive cloud performance debugging debugging systems [24, 42, 46, 78, 85, 93, 97]. Systems like system that leverages practical deep learning techniques to Dapper and GWP, for example, rely on distributed tracing diagnose upcoming QoS violations in a scalable and online (often at RPC level) and low-level hardware event monitoring manner. First, Seer is proactive to avoid the long recovery respectively to detect performance abnormalities, while the periods of a posteriori QoS violation detection. Second, it Mystery Machine [24] leverages the large amount of logged uses the massive amount of tracing data cloud systems collect data to extract the causal relationships between requests. over time to learn spatial and temporal patterns that lead Even though such tracing systems help cloud providers to QoS violations early enough to avoid them altogether. detect QoS violations and apply corrective actions to restore Seer includes a lightweight, distributed RPC-level tracing performance, until those actions take effect, performance suf- system,

Load more