Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource
Total Page:16
File Type:pdf, Size:1020Kb
Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song, Facebook Inc. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/veeraraghavan This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). November 2–4, 2016 • Savannah, GA, USA ISBN 978-1-931971-33-1 Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services Kaushik Veeraraghavan Justin Meza David Chou Wonho Kim Sonia Margulis Scott Michelson Rajesh Nishtala Daniel Obenshain Dmitri Perelman Yee Jiun Song Facebook Inc. Abstract pacity needs of a system, an evolving workload can quickly render models obsolete. Modern web services such as Facebook are made • Infrastructure heterogeneity: Constructing data up of hundreds of systems running in geographically- centers at different points in time leads to a vari- distributed data centers. Each system needs to be allo- ety of networking topologies, different generations cated capacity, configured, and tuned to use data center of hardware, and other physical constraints in each resources efficiently. Keeping a model of capacity allo- location that each affect how systems scale. cation current is challenging given that user behavior and • Changing bottlenecks: Each data center runs hun- software components evolve constantly. dreds of software systems with complex interac- Three insights motivate our work: (1) the live user tions that exhibit resource utilization bottlenecks, traffic accessing a web service provides the most current including performance regressions, load imbalance, target workload possible, (2) we can empirically test the and resource exhaustion, at a variety of scales from system to identify its scalability limits, and (3) the user single servers to entire data centers. The sheer size impact and operational overhead of empirical testing can of the system makes it impossible to understand all be largely eliminated by building automation which ad- the components. In addition, these systems change justs live traffic based on feedback. over time, leading to different bottlenecks present- We build on these insights in Kraken, a new system ing themselves. Thus, we need a way to continually that runs load tests by continually shifting live user traf- identify and fix bottlenecks to ensure that the sys- fic to one or more data centers. Kraken enables empiri- tems scale efficiently. cal testing by monitoring user experience (e.g., latency) and system health (e.g., error rate) in a feedback loop Our key insight is that the live user traffic accessing a between traffic shifts. We analyze the behavior of in- web service provides the most current workload possible, dividual systems and groups of systems to identify re- with natural phenomena like non-uniform request arrival source utilization bottlenecks such as capacity, load bal- rates. As a baseline, the web service must be capable ancing, software regressions, performance tuning, and so of executing its workload within a preset latency and er- on, which can be iteratively fixed and verified in sub- ror threshold. Ideally, the web service should be capable sequent load tests. Kraken, which manages the traffic of handling peak load (e.g., during Super Bowl) when generated by 1.7 billion users, has been in production at unexpected bottlenecks arise, and still achieve good per- Facebook for three years and has allowed us to improve formance. our hardware utilization by over 20%. We propose Kraken, a new system that allows us to 1 Introduction run live traffic load tests to accurately assess the capac- ity of a complex system. In building Kraken, we found Modern web services comprise software systems running that reliably tracking system health is the most important in multiple data centers that cater to a global user base. requirement for live traffic testing. But how do we se- At this scale, it is important to use all available data cen- lect among thousands of candidate metrics? We aim to ter resources as efficiently as possible. Effective resource provide a good experience to all users by ensuring that utilization is challenging because: a user served out of a data center undergoing a Kraken • Evolving workload: The workload of a web ser- test has a comparable experience to users being served vice is constantly changing as its user base grows out of any other data center. We accomplish this goal and new products are launched. Further, individual with a light-weight and configurable monitoring compo- software systems might be updated several times a nent seeded with two topline metrics, the web server’s day [35] or even continually [27]. While model- 99th percentile response time and HTTP fatal error rate, ing tools [20, 24, 46, 51] can estimate the initial ca- as reliable proxies for user experience. USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 635 We leverage Kraken as part of an iterative methodol- nicated schedules widely to minimize surprise. We en- ogy to improve capacity utilization. We begin by run- couraged engineers to share tips on how to better moni- ning a load test that directs user traffic at a target cluster tor their systems, handle different failure scenarios, and or region. A successful test concludes by hitting the uti- build out levers to mitigate production issues. Further, lization targets without crossing the latency or error rate we used a shared IRC channel to track tests and debug thresholds. Tests can fail in two ways: issues live. Often, we would use tests as an opportunity • We fail to hit the target utilization or exceed a pre- for developers to test or validate improvements, which set threshold. This is the usual outcome following made them an active part of the testing process. Engag- which we drill into test data to identify bottlenecks. ing engineers was critical for success; engaged engineers • Rarely, the load test results in a HTTP error spike or improve their systems quickly and support a frequent test some similar unexpected failure. When this occurs, cadence, allowing us to iterate quickly. we analyze the test data to understand what mon- Our initial remediations to resolve bottlenecks were itoring or system understanding we were missing. mostly capacity allocations to the failing system. As In practice, sudden jumps in HTTP errors or other our understanding of Facebook’s systems has improved, topline metrics almost never happen; the set of aux- our mitigation toolset has also expanded to include con- iliary metrics we add to our monitoring are few in figuration and performance tuning, load balancing im- number and are the result of small incidents rather provements, profiling-guided software changes and oc- than catastrophic failure. casionally a system redesign. Our monitoring has also Each test provides a probe into an initially uncharac- improved and as of October 2016, we have identified terized system, allowing us to learn new things. An un- metrics from 23 critical systems that are bellwethers of successful test provides data that allows us to either make non-linear behavior in our infrastructure. We use these the next test safer to run or increase capacity by remov- metrics as inputs to control Kraken’s behavior. ing a bottleneck. As tests are non-disruptive to users, we Kraken has been in use at Facebook for over three can run them regularly to determine both a data center’s years and has run thousands of load tests on production maximal load (e.g., requests per second, which we term systems using live user traffic. Our contributions: the system’s capacity), and continually identify and fix • Kraken is the first live traffic load testing frame- bottlenecks to improve system utilization. work to test systems ranging in size from individual Our first year of operating Kraken in production ex- servers to entire data centers, to our knowledge. posed several challenges. The first was that the systems • Kraken has allowed us to iteratively increase the ca- often exhibited a non-linear response where a small traf- pacity of Facebook’s infrastructure with an empir- fic shift directed at a data center could trigger an error ical approach that improves the utilization of sys- spike. Our follow-up was to check the health of all the tems at every level of the stack. major systems involved in serving user traffic to deter- • Kraken has allowed us to identify and remediate re- mine which ones were affected most during the test. This gressions, address load imbalance and resource ex- dovetailed into our second challenge which was that sys- haustion across Facebook’s fleet. Our initial tests tems have complex dependencies so it was often unclear stopped at about 70% of theoretical capacity, but which system initially failed and then trigged errors in now routinely exceed 90%, providing a 20% in- seemingly unrelated downstream subsystems. crease in request serving capacity. We addressed both of these challenges by encouraging The Kraken methodology is not applicable to all ser- subsystem developers to identify system-specific coun- vices. These assumptions/caveats underpin our work: ters for performance (e.g., response quality), error rate, • Assumption: stateless servers. We assume that and latency, that could be monitored during a test. We stateless servers handle requests without using focused on these three metrics because they represent the sticky sessions for server affinity. Stateless web contracts that clients of a service rely on—we have found servers provide high availability despite system or that nearly every production system wishes to maintain network failures by routing requests to any avail- or decrease its latency and error rate while maintaining able machine.