Etfids: Efficient Transient Fault Injection and Detection System

ETFIDS: EFFICIENT TRANSIENT FAULT INJECTION AND DETECTION SYSTEM by NINGHAN TIAN Submitted in partial fulfillment of the requirements for the degree of Master of Science Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY January, 2019 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis of Ninghan Tian candidate for the degree of Master of Science*. Committee Chair Dr. Daniel G. Saab Committee Member Dr. Christos Papachristou Committee Member Dr. Francis Merat Date of Defense December, 12th, 2018 *We also certify that written approval has been obtained for any proprietary material contained therein. Contents List of Tables iii List of Figures iv Abstract vi 1 Introduction 1 2 Fault Injection Techniques 5 2.1 Hardware-implemented fault injection . .5 2.1.1 Injection with contact . .6 2.1.2 Injection without contact . .6 2.1.3 Hardware implemented fault injection tools . .7 2.2 Software-implemented fault injection (SWIFI) . .9 2.2.1 Compile-time injection . 11 2.2.2 Runtime injection . 11 2.2.3 SWIFI tools . 12 2.3 The significance of SWIFI . 19 3 ETFIDS 21 3.1 Motivation . 21 3.2 The overall approach . 22 i CONTENTS 3.3 The lower level operating principle of ETFIDS . 29 3.4 Fault injection control flow . 32 3.5 Fault outcome analysis . 34 4 Fault Injection Experiments using ETFIDS 40 4.1 Typical fault injection target and experiment configuration . 40 4.2 Fault injection experiment results . 43 4.3 Performance . 45 5 Conclusion 52 ii List of Tables 2.1 Characteristics of fault injection methods. 19 3.1 Size of Dump Files for Fault Outcome Analysis . 38 4.1 Fault Injected Outcome . 45 4.2 Average Time Overhead . 48 4.3 Time Overhead Comparison . 50 4.4 Detailed Time Overhead (in seconds) Comparison . 50 iii List of Figures 1.1 Stored charges versus technology process [7]. .2 2.1 General architecture of Messaline [2]. .7 2.2 Ftape environment [29]. 13 2.3 Xception structure [3]. 15 2.4 The work process when using PROPANE [12]. 16 2.5 Fiesta++ framework [4]. 18 3.1 Illustration of ETFIDS Fault Injection and Analysis. 23 3.2 ETFIDS Overall Framework. 23 3.3 Bit-flip fault model [7]. 26 3.4 Illustration of the Effects of Implemented Fault Models. 28 3.5 Illustration of the Effects of a \Jump" Fault. 28 3.6 ETFIDS Fault Injection Control Flow. 33 3.7 ETFIDS Fault Outcome Analysis Control Flow. 36 3.8 ETFIDS Fault Outcome Analysis Work Flow. 38 3.9 Fault Outcome Analysis Work Flow by Dumping Data. 39 4.1 Average Fault Outcome Probability. 46 4.2 Number of Benign Fault Outcome Observed for Each Target Application. 46 4.3 Number of Time-out Fault Outcome Observed for Each Target Appli- cation. 47 iv LIST OF FIGURES 4.4 Number of Detected Fault Outcome Observed for Each Target Appli- cation. 47 4.5 Number of Crash Fault Outcome Observed for Each Target Application. 48 4.6 Fault Injection Outcome Probability Regards to Each Application. 49 4.7 Relative Time Consumed by One Fault Injection Experiment using Early Termination Feature. (Data Obtained using matrix)...... 51 v ETFIDS: Efficient Transient Fault Injection and Detection System Abstract by NINGHAN TIAN Computer use in high dependability applications is rapidly increasing. However, even when correctly designed, computer systems can still suffer from temporary errors due to various factors. So to increase the reliability of modern computer systems, they should be able to detect, locate, isolate and recover from software, hardware or security attacks errors. Fault injection simulates the effect of unexpected errors on the system; they are useful in the evaluation and validation of dependable systems. As part of this thesis, we developed an efficient transient fault injection and detection system (ETFIDS) for evaluating the fault tolerance and fault response of software applications. Despite being portable and of high performance, ETFIDS introduced a new fault outcome analysis technique which evaluates, concurrently, both the faulty and the fault-free system behavior during runtime for better analysis and efficiency. vi Chapter 1 Introduction Because of the rapid increase in the use of computer systems in safety-critical applications, people can no longer tolerate system failures. Failures cause tremendous risk to human life, commerce, utility, transportation and military operation. In critical applications, dependable systems are required. These systems are capable of de- tecting errors caused by software/hardware faults or by security attacks, diagnosing errors, correcting errors failures, and maintaining normal/acceptable system operating. Designing reliable fault-tolerant systems relies on evaluating/validating system error detection, error location/isolation, and error recovery techniques. The evaluation/validation of these techniques are very critical and provide confidence in the dependability of the system before deployment. In computers systems, transient faults are one of the primary sources of abnor- malities which alter the system operation. From a hardware perspective, they are usually glitches in the circuits when they are operating. Modern systems are becoming more vulnerable to transient faults because the effects of crosstalk, ground bounce, timing faults, and soft errors are becoming more significant because of the following factors [7]: 1. As shown in figure 1.1, with increased density, layout dimensions shrink, and 1 CHAPTER 1. INTRODUCTION 100 (fC) 10 crit Q 1 0.35 0.25 0.18 0.09 Process technology (microns) Figure 1.1: Stored charges versus technology process [7]. the electrical charge that constitutes stored data in memory cells and logic also decreases. 2. The increased gate count forces designers to lower supply voltages to keep power consumption in chips within manageable bounds. Reducing supply voltages decreases noise margins. 3. To increase density, interconnects on the chip become closer together and thicker (to maximize cross-section), increasing the coupling (crosstalk) between them. 4. Switching signals change at a faster rate, creating ground bounce. 5. For switching devices on a chip that depends on the supply voltage, the voltage level is already low. When switching currents drive that voltage even lower, transient and timing errors can occur. The practical result of these factors is a shrinking stored charge for representing data that is increasingly exposed to and more sensitive to outside disturbances. All of these influences are transient | they \only" affect or destroy the data, and leave the devices and semiconductor material intact. 2 CHAPTER 1. INTRODUCTION In summary, with the scaling in the transistor sizes, the noise margins for memory elements have decreased, and the sensitivity to variations in the circuit environment parameters such as voltage and temperature has increased [21]. All these factors have made the circuits more vulnerable to transient faults, and most circuits nowadays have to be designed to be tolerant of such failures [14]. However, physical causes are just one source of transient faults; errors could also be introduced from a software level like computer networks. The reliability of computer networks is always an issue in modern computer-based systems, and it is a complex issue that the security of computer networks is related not only to the credibility of switching nodes, of communication medium, and network topology but also to the various site configurations and net- stream of the computer network [22]. An unstable network transaction or security attacks through the network could introduce transient faults into computer-based systems which could harm the stability and reliability of the system. For computer systems, transient faults pose a significant challenge to the system dependability, and we need to evaluate their effect on reliability. Microprocessors nowadays contain millions of memory elements susceptible to transient faults. Soft- ware run on these processors execute over trillions of clock cycles, and a transient fault can occur during any execution cycle [4]. As a result, the total number of possible transient faults in different application contexts is enormous, which make it very difficult to evaluate the outcomes of all possible faults or enumerate all faults to com- pute the exact probabilities of occurrence of any particular outcome [4]. Moreover, observing transient faults in an actual system is not a feasible method for such an evaluation as the occurrence of transient faults is somewhat unpredictable and often rare when compared with the usual system runtime [4]. Fault injection tools create a fast and reliable approach that allows the evaluation of the behavior of the system under realistic program execution workloads in the presence of faults. Fault injection tools provide a way to measure the effectiveness of 3 CHAPTER 1. INTRODUCTION the error detection, error diagnosis and correction/recovery techniques of the system. There are different kinds of ways to inject faults in a running system. However, in general, they can be divided into two categories: hardware fault injection and software fault injection. Also, software fault injection tools can further be classified into compiling based and runtime based. The fault injection tool we introduced in this thesis, ETFIDS, is a runtime based software fault injection tool, it injects a fault by changing signal/variable values at runtime, and it also provides the ability to observe the effect on the output or behavior of the system. We made several improvements upon the current fault injection and detection implementations in ETFIDS: 1. More accurate and consistent fault injection mechanism to address the high- perturbation issue of software-implemented fault injection tools. 2. Real-time faulty process evaluation for more accurate fault outcome analysis. 3. Concurrent and synchronized faulty and fault-free process comparison for performance increase and fault injection experiment time reduction. The approaches we used to achieve these improvements are further explained in the rest of this thesis.

Etfids: Efficient Transient Fault Injection and Detection System

Using Fault Injection to Increase Software Test Coverage

Improving Software Fault Injection

Evaluating Software Systems Via Runtime Fault-Injection and Reliability, Availability and Serviceability (RAS) Metrics and Models

Formal Fault Injection Vulnerability Detection in Binaries: a Software

Experimental Assessment of Cloud Software Dependability Using Fault Injection

RESEARCH INSIGHTS – Modern Security Vulnerability Discovery

Efficient Testing of Recovery Code Using Fault Injection

Fault Injection for Software Certification

NUREG/CR-7151 Vol 1 "Development of a Fault Injection

What's Wrong with Fault Injection As a Benchmarking Tool?

Dependability Assessment of the Android OS Through Fault Injection Domenico Cotroneo, Antonio Ken Iannillo, Roberto Natella, Stefano Rosiello

Fuzzing Error Handling Code in Device Drivers Based on Software Fault Injection Zu-Ming Jiang, Jia-Ju Bai, Julia Lawall, Shi-Min Hu