Gremlin: Systematic Resilience Testing of Microservices
Total Page:16
File Type:pdf, Size:1020Kb
Gremlin: Systematic Resilience Testing of Microservices Victor Heorhiadi Shriram Rajagopalan Hani Jamjoom Michael K. Reiter Vyas Sekar UNC Chapel Hill IBM T. J. Watson Research IBM T. J. Watson Research UNC Chapel Hill Carnegie Mellon University [email protected] [email protected] [email protected] [email protected] [email protected] Abstract—Modern Internet applications are being disaggre- Company Downtime Postmortem findings gated into a microservice-based architecture, with services Parse.ly, 13 hours Cascading failure due to message bus being updated and deployed hundreds of times a day. The 2015 [25] overload accelerated software life cycle and heterogeneity of language CircleCI, 17 hours Cascading failure due to database over- runtimes in a single application necessitates a new approach 2015 [19] load BBC, 48 hours Cascading failure due to database over- for testing the resiliency of these applications in production in- 2014 [18] load frastructures. We present Gremlin, a framework for systemat- Spotify, Several Cascading failure due to degradation of a ically testing the failure-handling capabilities of microservices. 2013 [26] hours core internal service Gremlin is based on the observation that microservices are Twilio, 10 hours Database failure caused billing service to loosely coupled and thus rely on standard message-exchange 2013 [28] repeatedly bill customers patterns over the network. Gremlin allows the operator to TABLE 1: A subset of recent outages experienced by popular, easily design tests and executes them by manipulating inter- highly available Internet services. Postmortem reports revealed service messages at the network layer. We show how to use missing or faulty failure-handling logic. Gremlin to express common failure scenarios and how develop- and test again. ers of an enterprise application were able to discover previously Microservices and their development model, however, unknown bugs in their failure-handling code without modifying pose new challenges for systematic resiliency testing: the application. • Microservices’ polyglot nature requires the testing tool 1. Introduction to be agnostic to each service’s language and runtime. Many modern Internet applications are moving toward a • Rapidly evolving code requires tests that are fast and microservice architecture [15], to support rapid change and focus on the failure-recovery logic, not business logic. respond to user feedback in a matter of hours. In this archi- Existing works on resilience testing of general distributed tecture, the application is a collection of web services, each systems and service-oriented architectures (SOAs) are un- serving a single purpose, i.e., a microservice. Each microser- suitable for microservice applications, since they do not vice is developed, deployed and managed independently; address the challenges mentioned above (see Section 8 new features and updates are delivered continuously [10], for details). In contrast, our work is designed to provide hundreds of times a day [20]–[22], making the applications systematic, application-agnostic testing on live services. extremely dynamic. Microservice applications are typically To achieve this goal, we make the following observations polyglot: developers write individual microservices in the about microservice applications: programming language of their choice, and the communica- • Interactions between microservices happen solely over tion between services happens using remote API calls. the network; and As cloud-native applications, microservices are designed • Microservices use standard application protocols (e.g., to withstand infrastructure failures and outages, yet struggle HTTP) and communication patterns (e.g., request- to remain available when deployed. Table 1 summarizes response, publish-subscribe). some recent failures experienced by microservice-based ap- Our key insight is that failures can be staged by manipu- plications. The postmortem reports point to missing or faulty lating the network interactions between microservices; the failure-recovery logic, indicating that unit and integration application’s ability (or lack of thereof) to recover from fail- tests are insufficient to catch such bugs. The application ures can be evaluated by observing the network interactions deployment needs to be subjected to resiliency testing— between microservices during the failure. testing the application’s ability to recover from failures Based on this insight, we present Gremlin, a systematic commonly encountered in the cloud. Specifically, we argue resiliency testing framework for microservices. Gremlin’s that resiliency testing needs to be systematic and feedback- design is inspired by software-defined networks (SDN): the driven: allow operators (developers or testers) to orchestrate operator interacts with a centralized control plane, which a specific failure scenario and obtain quick feedback about in turn configures the data plane. The operator provides how and why the application failed to recover as expected. Gremlin with a recipe—Python-based code describing a This feedback makes systematic testing more valuable than high-level outage scenario, along with a set of assertions on randomized fault injection by better enabling developers to how microservices should react during such an outage. The quickly locate and fix faulty failure-handling logic, redeploy, control plane translates the recipe into a fixed set of fault- Users injection rules to be applied to the network messages ex- Application changed between microservices. Gremlin’s data plane con- 3rd party sists of network proxies that intercept, log, and manipulate Internet Services messages exchanged between microservices. The control A C Microservice FaceBook plane configures the network proxies for fault injection based on the rules generated for a recipe. After emulating the F MoBile Push failure, the control plane analyzes the observation logs from NotiFication the network proxies to validate the assertions specified in the B D E ! recipe. Gremlin recipes can be executed and checked in a matter of seconds, thereby providing quick feedback to the operator. This low-latency feedback enables the operator to Message … create correlated failure scenarios by conditionally chaining Bus NoSQL RDBMS different types of failures and assertion checks. Cloud Platform Services Our case study shows that Gremlin requires a minimal learning curve: developers at IBM found unhandled corner- Figure 1: Typical architecture of a microservice-based applica- case failure scenarios in a production enterprise application, tion. The application leverages services provided by the hosting without modifying the application code. Furthermore, Grem- cloud platform, (e.g., managed databases, message queues, data lin correctly identified the lack of failure handling in an ac- analytics), and integrates with Internet services, such as social tively used library designed specifically for abstracting away networking, mobile backends, geolocation, etc. failure-handling code. Controlled experiments indicate that of other microservices as long as the APIs they expose are Gremlin is fast, introduces low overhead, and is suitable for backward compatible. To achieve loose coupling, microser- resiliency testing of operational microservice applications. vices use standard application protocols such as HTTP to Fault-injection ideas from Gremlin and their software- facilitate easy integration with other microservices. Organi- defined architecture have been integrated into IBM’s cloud zationally, each microservice is owned and operated by an offerings for microservice applications. Integration of other independent team of developers. The ability to immediately parts, such as validation of assertions, is planned for the integrate updates into the production deployment [5] has future. The source code for Gremlin is publicly available in led to a continuous software delivery model [10], where GitHub at https://github.com/ResilienceTesting. development teams incrementally deliver features while in- In summary, our contributions are as follows: corporating user feedback. • A systematic resiliency-testing framework for creating 2.1. Designing for Failure: Current Best Practices and executing recipes that capture a rich variety of high- level failure scenarios and assertions in microservice- To remain available in the face of infrastructure outages based applications. in the cloud, a microservice must guard itself from failures of its dependencies. Best design practices advocate incorpo- • A framework that can be integrated easily into production rating resiliency design patterns such as timeouts, bounded or production-like environments (e.g., shadow deploy- retries, circuit breakers, and bulkheads [9], [16]. ments) without modifications to application code. • Recipes capable of tolerating the rapid evolution of mi- • Timeouts ensure that an API call to a microservice com- croservice code by taking advantage of the standardized pletes in bounded time, to maintain responsiveness and interaction patterns between services. release resources associated with the API call in a timely fashion. 2. Background and Motivation • Bounded retries handle transient failures in the system, Large-scale Internet applications such as Netflix, Face- by retrying the API calls with the expectation that the book, Amazon store, etc., have demonstrated that in order to fault is temporary. The API calls are retried a bounded achieve scalability, robustness and agility, it is beneficial to