Assessing Resilience of Software Systems by Application of Chaos Engineering – a Case Study

Institute of Software Technology Reliable Software Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Bachelorarbeit Assessing Resilience of Software Systems by Application of Chaos Engineering – A Case Study Dominik Kesim Course of Study: Softwaretechnik Examiner: Dr.-Ing. André van Hoorn Supervisor: Matthias Häussler (Novatec), Christian Schwörer (Novatec), Dr.-Ing. André van Hoorn. Commenced: März 1, 2019 Completed: September 1, 2019 CR-Classification: I.7.2 Abstract Modern distributed systems more and more frequently adapt the microservice archi- tectural style to design cloud-based systems. However the prevalence of microservice architectures and container orchestration technologies such as Kubernetes increase the complexity of assessing the resilience of such systems. Resilience benchmarking is a means to assess the resilience of software systems by leveraging random fault injections on the application- and network-level. Though current approaches to resilience benchmarking have become inefficient. Chaos engineering is a new, yet, evolving discipline that forces a change in the perspective of how systems are developed with respect to their resilience. The key idea is to apply empirical experimentation in order to learn how a system behaves under turbulent conditions by intentionally injecting failures. Solving the errors found by this approach in addition to repeating the same experiments, allows the system to build up an immunity against failures before they occur in pro- duction. In the scope of an industrial case study this work provides means to identify risks and hazards by applying three hazard analysis methods known from engineering safety-critical systems to the domain of chaos engineering, namely i) Fault Tree Analysis as a top-down approach to identify root causes, ii) Failure Mode and Effects Analysis as a component-based inspection of different failure modes, iii) and Computational Hazard and Operations as a means to analyze the system’s communication paths. A dedicated number of the identified hazards are then implemented as chaos engineering experiments in Chaostoolkit in order to be injected on the application-platform-level, i.e., Kubernetes. In total, four experiments have been derived from the findings of the hazard analysis whereas three experiments have been executed and analyzed by applying non- parametric statistical tests to the observations. This work provides a generic approach to assessing the resilience of a distributed system in the context of chaos engineering illustrated by an industrial case study. Keywords – Resilience, Chaos Engineering, Chaostoolkit, Hazard Analysis, Industrial Case Study, Kubernetes iii Kurzfassung Moderne verteilte Anwendung übernehmen immer öfter den Microservices-Architekturstil um cloud-basierte Anwendungen zu designen. Allerdings erhöht die weite Verbreitung von Microservice-Architekturen und Container-Orchestrierung Technologie, wie zum Beispiel Kubernetes, die Komplexität zur Bewertung und Verbesserung der Widerstands- fähigkeit (engl. resilience) von verteilten Anwendungen. Resilience Benchmarking ist eine Methode die Widerstandsfähigkeit von Anwendungen mit Hilfe von zufälligen Fehlerinjektionen auf Applikations- und Netzwerkebene zu bewerten. Jedoch sind aktuelle Resilience-Benchmarking-Ansätze ineffizient. Chaos engineering ist eine neue, sich entwickelnde Disziplin, welche ein Umdenken hinsichtlich der Entwicklung von Anwendungen in Bezug auf ihre Widerstandsfähigkeit erzwingt. Die Idee ist es, durch Ausführung empirischer Experimente mehr über das Verhalten eines Systems mit Hilfe von gezielten und kontrollierten Fehlerinjektionen zu lernen. Durch diesen Ansatz gefundene und behobene Probleme und Wiederholung der Experimente erlauben es dem System, eine Immunität gegenüber gefundenen Fehlern aufzubauen. Im Rahmen einer industriellen Fallstudie stellt diese Arbeit Techniken zur Identifikation von Risiken und Gefährdungen, in einem Chaos-Engineering-Kontext, zur Verfügung. Es wurden drei Gefährdungsanalyse Techniken vorgestellt, namentlich i) Fault Tree Analysis als top-down-Ansatz zur Identifikation von Grundursachen, ii) Failure Mode and Effects Analysis zur Komponenten-basierten Analyse von verschiedenen Fehlermodi, iii) und ComputationalHazard and Operations als Mittel zur Analyse der Kommunikationswege des Systems. Die identifizierten Gefährdungen werden anschließend mit Hilfe des Chaostoolkits als Chaos Engineering Experimente definiert und auf Plattform-Ebene (Kubernetes) ausgeführt. Insgesamt wurden vier Experimente aus der Gefährdungs- analyse abgeleitet und entworfen wobei drei dieser Experimente ausgeführt und durch statistische Tests analyisiert wurden. Diese Arbeit stellt einen generischen Ansatz zur Verbesserung der Widerstandsfähigkeit von Verteilten Anwendungen, veranschaulicht Anhand einer industriellen Fallstudie, im Kontext von Chaos Engineering bereit. Schlüsselwörter – Widerstandsfähigkeit, Chaos Engineering, Chaostoolkit, Gefährdungs- analyse, Industrielle Fallstudie, Kubernetes v Contents 1. Introduction1 2. Foundations5 2.1. Architecture and System Design.......................5 2.2. Hazard Analysis................................7 2.3. Research Methods.............................. 11 2.4. Technologies................................. 16 2.5. Software Resilience and Dependability................... 23 2.6. Chaos Engineering.............................. 26 2.7. State of the Art Tools and Methods..................... 30 3. Related Work 39 3.1. Search Strategy................................ 39 3.2. Chaos Engineering by Basiri et al....................... 40 3.3. Professional Conference Contributions related to Chaos Engineering... 41 3.4. Case Studies.................................. 43 3.5. Summary of Findings............................. 45 4. Case Study Design 47 4.1. Objective................................... 47 4.2. The Case.................................... 48 4.3. Preparation of Data Collection........................ 48 4.4. Data Collection Methodology........................ 50 4.5. Data Evaluation Methodology........................ 52 4.6. Interview Structure and Question Design.................. 52 5. Cloud Service for Vehicle Configuration Files 55 5.1. Findings from Interviews........................... 55 5.2. Domain Context and System Scope..................... 56 5.3. Architecture of the Legacy Application................... 58 vii 5.4. Prototype of a Microservice based Architecture.............. 61 6. Hazard Analysis 69 6.1. Failure Modes and Effects Analysis..................... 70 6.2. Fault Tree Analysis.............................. 75 6.3. Computational Hazard Analysis and Operability.............. 82 6.4. Summary of Hazard Analysis........................ 87 7. Experiment Environment and Tool Configuration 89 7.1. Cloud Environment.............................. 90 7.2. Azure Kubernetes Service – Deployment.................. 91 7.3. Simulating and Monitoring Environmental Conditions.......... 92 8. Chaos Experiments Design, Execution, and Analysis 95 8.1. Steady State Description........................... 95 8.2. Experiment Description........................... 98 8.3. Hypothesis Backlog and Experiment Execution............... 99 8.4. Discussion................................... 122 8.5. Threats to Validity.............................. 126 9. Conclusion 131 9.1. Summary................................... 131 9.2. Discussion................................... 132 9.3. Future Work.................................. 135 Appendices 136 A. Interview Resources 137 A.1. Question Catalogue.............................. 138 A.2. Consent Form................................. 141 B. Incident Analysis Results 143 B.1. Failure Modes and Effects Analysis..................... 144 B.2. Fault Tree Analysis.............................. 150 B.3. Computational Hazard and Operations Study............... 156 Bibliography 161 Online References 165 viii List of Figures 2.1. Intermediate fault event...........................9 2.2. Basic fault event...............................9 2.3. Input and output events...........................9 2.4. Logic gates..................................9 2.5. Organizational structure of a Kubernetes cluster [Fou].......... 18 2.6. Components of dependability [ALR+01].................. 24 5.1. Investigated system embedded in its domain context........... 57 5.2. Use Case diagram of the prototype and legacy system........... 59 5.3. Monolithic implementation of the case study application......... 60 5.4. Component diagram of the prototype system................ 61 5.5. Sequence diagram of the getAll request................... 63 5.6. Sequence diagram of the create request.................. 63 5.7. Sequence diagram of the delete request.................. 64 6.1. Fault tree with top-level hazard: "User does not receive configuration identifiers"................................... 76 6.2. Left sub tree of Figure 6.1.......................... 77 6.3. Right sub tree of Figure 6.1......................... 78 7.1. Experiment environment........................... 90 7.2. JMeter test plan................................ 93 7.3. JMeter user thread group HTTP request configuration.......... 94 8.1. Response times for warmup, steady-state, injection and recovery phase with respect to the GET _CONF IG_ID operation............ 96 8.2. Steady-state phase of the first experiment................. 97 8.3. Response times of GET _CONF IG_ID operation after first experiment. 104 8.4.

Load more