Evaluating Software Systems Via Runtime Fault-Injection and Reliability, Availability and Serviceability (RAS) Metrics and Models

The 7U Evaluation Method: Evaluating Software Systems via Runtime Fault-Injection and Reliability, Availability and Serviceability (RAS) Metrics and Models Rean Griffith Submitted in partial fulfillment of the Requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2008 c 2008 Rean Griffith All Rights Reserved Abstract The 7U Evaluation Method: Evaluating Software Systems via Runtime Fault-Injection and Reliability, Availability and Serviceability (RAS) Metrics and Models Rean Griffith Renewed interest in developing computing systems that meet additional non-functional requirements such as reliability, high availability and ease-of-management/self-management (serviceability) has fueled research into developing systems that exhibit enhanced reliability, availability and serviceability (RAS) capabilities. This research focus on enhancing the RAS capabilities of computing systems impacts not only the legacy/existing systems we have today, but also has implications for the design and development of next generation (self- managing/self-*) systems, which are expected to meet these non-functional requirements with minimal human intervention. To reason about the RAS capabilities of the systems of today or the self-* systems of tomorrow, there are three evaluation-related challenges to address. First, developing (or identifying) practical fault-injection tools that can be used to study the failure behavior of computing systems and exercise any (remediation) mechanisms the system has available for mitigating or resolving problems. Second, identifying techniques that can be used to quantify RAS deficiencies in computing systems and reason about the efficacy of individual or combined RAS-enhancing mechanisms (at design-time or after system deployment). Third, developing an evaluation methodology that can be used to objectively compare systems based on the (expected or actual) benefits of RAS-enhancing mechanisms. This thesis addresses these three challenges by introducing the 7U Evaluation Methodology, a complementary approach to traditional performance-centric evaluations that identifies crite- ria for comparing and analyzing existing (or yet-to-be-added) RAS-enhancing mechanisms, is able to evaluate and reason about combinations of mechanisms, exposes under-performing mechanisms and highlights the lack of mechanisms in a rigorous, objective and quantitative manner. The development of the 7U Evaluation Methodology is based on the following three hypotheses. First, that runtime adaptation provides a platform for implementing efficient and flexible fault-injection tools capable of in-situ and in-vivo interactions with computing systems. Second, that mathematical models such as Markov chains, Markov reward networks and Control theory models can successfully be used to create simple, reusable templates for describing specific failure scenarios and scoring the system’s responses, i.e., studying the failure-behavior of systems, and the various facets of its remediation mechanisms and their impact on system operation. Third, that combining practical fault-injection tools with mathematical modeling techniques based on Markov Chains, Markov Reward Networks and Control Theory can be used to develop a benchmarking methodology for evaluating and comparing the reliability, availability and serviceability (RAS) characteristics of computing systems. This thesis demonstrates how the 7U Evaluation Method can be used to evaluate the RAS capabilities of real-world computing systems and in so doing makes three contributions. First, a suite of runtime fault-injection tools (Kheiron tools) able to work in a variety of execution environments is developed. Second, analytical tools that can be used to construct mathematical models (RAS models) to evaluate and quantify RAS capabilities using appropriate metrics are discussed. Finally, the results and insights gained from conducting fault-injection experiments on real-world systems and modeling the system responses (or lack thereof) using RAS models are presented. In conducting 7U Evaluations of real-world systems, this thesis highlights the similarities and differences between traditional performance-oriented evaluations and RAS-oriented evaluations and outlines a general framework for conducting RAS evaluations. Contents List of Figures vi List of Tables ix 1 Introduction 1 1.1 Definitions ................................... 3 1.2 Problem statement .............................. 4 1.3 Requirements ................................. 5 1.4 Hypotheses .................................. 9 1.5 Thesis outline ................................. 9 2 Motivation 11 2.1 DASADA Overview ............................. 11 2.2 Kinesthetics eXtreme (KX) .......................... 13 2.2.1 Probing Technologies used in KX .................. 14 2.2.2 Effector Technologies used in KX .................. 16 2.3 Short-term Research Objectives after KX .................. 17 2.4 Long-term Research Objectives ........................ 19 2.4.1 Scoping the Self-Management Capabilities to be Evaluated ..... 19 2.4.2 Expanding the Classes of Systems to be Evaluated ......... 21 2.5 Revised Research Agenda .......................... 22 i 2.6 Summary of Contributions .......................... 25 I Runtime Adaptation and Fault-Injection 28 3 Runtime Modification of Systems 29 3.1 Definitions ................................... 31 3.2 Overview ................................... 33 3.3 Motivation ................................... 35 3.4 Background on Execution Environments ................... 36 3.5 Challenges of Runtime Adaptation via the Execution Environment ..... 37 3.6 Hypotheses .................................. 38 3.7 Kheiron/CLR: Runtime Adaptation in the Common Language Runtime .. 40 3.7.1 Common Language Runtime Execution Model ........... 41 3.7.2 The CLR Profiler and Unmanaged Metadata APIs .......... 41 3.7.3 Kheiron/CLR Architecture ...................... 42 3.7.4 Model of Operation .......................... 43 3.7.5 Performing an Adaptation ...................... 47 3.7.6 Forcing Multiple JIT Compilations (re-JITs) ............ 50 3.7.7 Evaluation Part 1: Kheiron/CLR Performance Impact ........ 51 3.7.8 Evaluation Part 2: Kheiron/CLR Dynamic Reconfiguration Case Study ................................. 57 3.8 Kheiron/JVM: Runtime Adaptation in the Java Virtual Machine ...... 66 3.8.1 Java Virtual Machine Execution Model (Java HotspotVM) ..... 67 3.8.2 JVM Profiler and Metadata APIs ................... 67 3.8.3 Kheiron/JVM Architecture ...................... 68 3.8.4 Model of Operation .......................... 71 3.8.5 Evaluation Part 1: Kheiron/JVM Performance Impact ........ 74 ii 3.8.6 Evaluation Part 2: Kheiron/JVM Web-Application Fault-Injection . 76 3.9 Kheiron/C: Runtime Adaptation of Compiled-C Programs ......... 82 3.9.1 Native Execution Model ....................... 82 3.9.2 Kheiron/C Model of Operation .................... 85 3.9.3 Evaluation Part 1: Kheiron/C Performance Impact ......... 87 3.9.4 Evaluation Part 2: Kheiron/C Injecting Selective Emulation .... 88 3.10 Integrity/Consistency-preserving Adaptations ................ 93 3.11 Related Work ................................. 95 3.11.1 Runtime Adaptation ......................... 95 3.11.2 Software Implemented Fault-Injection Tools ............ 98 3.12 Summary ................................... 100 II RAS Evaluations via Runtime Adaptation and RAS Modeling 102 4 Evaluating RAS Capabilities 103 4.1 Hypotheses .................................. 105 4.2 Analytical Tools ................................ 106 4.2.1 Continuous Time Markov Chains (CTMCs) ............. 106 4.2.2 Markov Reward Networks ...................... 112 4.2.3 Feedback Control Models ...................... 113 4.3 Analysis Techniques ............................. 116 4.3.1 Microreboot RAS Model ....................... 117 4.3.2 Model Analysis – RAS Measures and Metrics ............ 122 4.3.3 Reliability Measures ......................... 123 4.3.4 Availability Measures ........................ 127 4.3.5 Serviceability Measures ....................... 131 4.3.6 Analysis Results ........................... 135 iii 4.4 Related Work ................................. 135 4.5 Summary ................................... 137 5 The 7U-Evaluation Benchmark 139 5.1 Introduction .................................. 140 5.2 The 7U RAS Benchmarking Methodology .................. 141 5.3 RAS Benchmarking Challenges ....................... 143 5.3.1 Selecting reasonable or representative faults ............. 143 5.3.2 Representative Workloads ...................... 145 5.3.3 Reproducibility and Portability .................... 146 5.3.4 Metrics and Scoring ......................... 147 5.4 Evaluation Part 1 ............................... 149 5.4.1 7U Process .............................. 149 5.4.2 Deployment 1: Resin, MySQL, Linux 2.4.18 ............ 152 5.4.3 Deployment 2: Resin, MySQL, Linux 2.6.20 ............ 158 5.4.4 Deployment Comparisons ...................... 160 5.5 Evaluation Part 2 ............................... 162 5.5.1 7U Process .............................. 168 5.5.2 VM-Rejuv Evaluation ........................ 170 5.6 Evaluation Part 3 ............................... 175 5.6.1 7U Process .............................. 177 5.6.2 Evaluating Hardened Network Device Drivers on OpenSolaris ... 180 5.7 Related Work ................................. 183 5.8 Summary ................................... 187 6 Contributions, Future Work and Conclusion 188

Evaluating Software Systems Via Runtime Fault-Injection and Reliability, Availability and Serviceability (RAS) Metrics and Models

Using Fault Injection to Increase Software Test Coverage

Improving Software Fault Injection

Formal Fault Injection Vulnerability Detection in Binaries: a Software

Experimental Assessment of Cloud Software Dependability Using Fault Injection

RESEARCH INSIGHTS – Modern Security Vulnerability Discovery

Etfids: Efficient Transient Fault Injection and Detection System

Efficient Testing of Recovery Code Using Fault Injection

Fault Injection for Software Certification

NUREG/CR-7151 Vol 1 "Development of a Fault Injection

What's Wrong with Fault Injection As a Benchmarking Tool?

Dependability Assessment of the Android OS Through Fault Injection Domenico Cotroneo, Antonio Ken Iannillo, Roberto Natella, Stefano Rosiello

Fuzzing Error Handling Code in Device Drivers Based on Software Fault Injection Zu-Ming Jiang, Jia-Ju Bai, Julia Lawall, Shi-Min Hu