ETFIDS: EFFICIENT TRANSIENT FAULT

INJECTION AND DETECTION SYSTEM

by

NINGHAN TIAN

Submitted in partial fulfillment of the requirements

for the degree of Master of Science

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

January, 2019 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis of Ninghan Tian

candidate for the degree of Master of Science*.

Committee Chair Dr. Daniel G. Saab

Committee Member Dr. Christos Papachristou

Committee Member Dr. Francis Merat

Date of Defense

December, 12th, 2018

*We also certify that written approval has been obtained

for any proprietary material contained therein. Contents

List of Tables iii

List of Figures iv

Abstract vi

1 Introduction 1

2 Fault Injection Techniques 5 2.1 Hardware-implemented fault injection ...... 5 2.1.1 Injection with contact ...... 6 2.1.2 Injection without contact ...... 6 2.1.3 Hardware implemented fault injection tools ...... 7 2.2 Software-implemented fault injection (SWIFI) ...... 9 2.2.1 Compile-time injection ...... 11 2.2.2 Runtime injection ...... 11 2.2.3 SWIFI tools ...... 12 2.3 The significance of SWIFI ...... 19

3 ETFIDS 21 3.1 Motivation ...... 21 3.2 The overall approach ...... 22

i CONTENTS

3.3 The lower level operating principle of ETFIDS ...... 29 3.4 Fault injection control flow ...... 32 3.5 Fault outcome analysis ...... 34

4 Fault Injection Experiments using ETFIDS 40 4.1 Typical fault injection target and experiment configuration ...... 40 4.2 Fault injection experiment results ...... 43 4.3 Performance ...... 45

5 Conclusion 52

ii List of Tables

2.1 Characteristics of fault injection methods...... 19

3.1 Size of Dump Files for Fault Outcome Analysis ...... 38

4.1 Fault Injected Outcome ...... 45 4.2 Average Time Overhead ...... 48 4.3 Time Overhead Comparison ...... 50 4.4 Detailed Time Overhead (in seconds) Comparison ...... 50

iii List of Figures

1.1 Stored charges versus technology process [7]...... 2

2.1 General architecture of Messaline [2]...... 7 2.2 Ftape environment [29]...... 13 2.3 Xception structure [3]...... 15 2.4 The work process when using PROPANE [12]...... 16 2.5 Fiesta++ framework [4]...... 18

3.1 Illustration of ETFIDS Fault Injection and Analysis...... 23 3.2 ETFIDS Overall Framework...... 23 3.3 Bit-flip fault model [7]...... 26 3.4 Illustration of the Effects of Implemented Fault Models...... 28 3.5 Illustration of the Effects of a “Jump” Fault...... 28 3.6 ETFIDS Fault Injection Control Flow...... 33 3.7 ETFIDS Fault Outcome Analysis Control Flow...... 36 3.8 ETFIDS Fault Outcome Analysis Work Flow...... 38 3.9 Fault Outcome Analysis Work Flow by Dumping Data...... 39

4.1 Average Fault Outcome Probability...... 46 4.2 Number of Benign Fault Outcome Observed for Each Target Application. 46 4.3 Number of Time-out Fault Outcome Observed for Each Target Appli- cation...... 47

iv LIST OF FIGURES

4.4 Number of Detected Fault Outcome Observed for Each Target Appli- cation...... 47 4.5 Number of Crash Fault Outcome Observed for Each Target Application. 48 4.6 Fault Injection Outcome Probability Regards to Each Application. . . 49 4.7 Relative Time Consumed by One Fault Injection Experiment using

Early Termination Feature. (Data Obtained using matrix)...... 51

v ETFIDS: Efficient Transient Fault Injection and Detection System

Abstract by NINGHAN TIAN

Computer use in high dependability applications is rapidly increasing. However, even when correctly designed, computer systems can still suffer from temporary errors due to various factors. So to increase the reliability of modern computer systems, they should be able to detect, locate, isolate and recover from software, hardware or security attacks errors. Fault injection simulates the effect of unexpected errors on the system; they are useful in the evaluation and validation of dependable systems. As part of this thesis, we developed an efficient transient fault injection and detection system (ETFIDS) for evaluating the fault tolerance and fault response of software applications. Despite being portable and of high performance, ETFIDS introduced a new fault outcome analysis technique which evaluates, concurrently, both the faulty and the fault-free system behavior during runtime for better analysis and efficiency.

vi Chapter 1

Introduction

Because of the rapid increase in the use of computer systems in safety-critical appli- cations, people can no longer tolerate system failures. Failures cause tremendous risk to human life, commerce, utility, transportation and military operation. In critical applications, dependable systems are required. These systems are capable of de- tecting errors caused by software/hardware faults or by security attacks, diagnosing errors, correcting errors failures, and maintaining normal/acceptable system operat- ing. Designing reliable fault-tolerant systems relies on evaluating/validating system error detection, error location/isolation, and error recovery techniques. The evalu- ation/validation of these techniques are very critical and provide confidence in the dependability of the system before deployment. In computers systems, transient faults are one of the primary sources of abnor- malities which alter the system operation. From a hardware perspective, they are usually glitches in the circuits when they are operating. Modern systems are be- coming more vulnerable to transient faults because the effects of crosstalk, ground bounce, timing faults, and soft errors are becoming more significant because of the following factors [7]:

1. As shown in figure 1.1, with increased density, layout dimensions shrink, and

1 CHAPTER 1. INTRODUCTION

100

(fC) 10 crit Q

1 0.35 0.25 0.18 0.09 Process technology (microns)

Figure 1.1: Stored charges versus technology process [7].

the electrical charge that constitutes stored data in memory cells and logic also decreases.

2. The increased gate count forces designers to lower supply voltages to keep power consumption in chips within manageable bounds. Reducing supply voltages decreases noise margins.

3. To increase density, interconnects on the chip become closer together and thicker (to maximize cross-section), increasing the coupling (crosstalk) between them.

4. Switching signals change at a faster rate, creating ground bounce.

5. For switching devices on a chip that depends on the supply voltage, the voltage level is already low. When switching currents drive that voltage even lower, transient and timing errors can occur.

The practical result of these factors is a shrinking stored charge for representing data that is increasingly exposed to and more sensitive to outside disturbances. All of these influences are transient — they “only” affect or destroy the data, and leave the devices and semiconductor material intact.

2 CHAPTER 1. INTRODUCTION

In summary, with the scaling in the transistor sizes, the noise margins for memory elements have decreased, and the sensitivity to variations in the circuit environment parameters such as voltage and temperature has increased [21]. All these factors have made the circuits more vulnerable to transient faults, and most circuits nowadays have to be designed to be tolerant of such failures [14]. However, physical causes are just one source of transient faults; errors could also be introduced from a software level like computer networks. The reliability of computer networks is always an issue in modern computer-based systems, and it is a complex issue that the security of computer networks is related not only to the credibility of switching nodes, of communication medium, and network topology but also to the various site configurations and net- stream of the computer network [22]. An unstable network transaction or security attacks through the network could introduce transient faults into computer-based systems which could harm the stability and reliability of the system. For computer systems, transient faults pose a significant challenge to the system dependability, and we need to evaluate their effect on reliability. Microprocessors nowadays contain millions of memory elements susceptible to transient faults. Soft- ware run on these processors execute over trillions of clock cycles, and a transient fault can occur during any execution cycle [4]. As a result, the total number of possi- ble transient faults in different application contexts is enormous, which make it very difficult to evaluate the outcomes of all possible faults or enumerate all faults to com- pute the exact probabilities of occurrence of any particular outcome [4]. Moreover, observing transient faults in an actual system is not a feasible method for such an evaluation as the occurrence of transient faults is somewhat unpredictable and often rare when compared with the usual system runtime [4]. Fault injection tools create a fast and reliable approach that allows the evaluation of the behavior of the system under realistic program execution workloads in the presence of faults. Fault injection tools provide a way to measure the effectiveness of

3 CHAPTER 1. INTRODUCTION the error detection, error diagnosis and correction/recovery techniques of the system. There are different kinds of ways to inject faults in a running system. However, in general, they can be divided into two categories: hardware fault injection and software fault injection. Also, software fault injection tools can further be classified into compiling based and runtime based. The fault injection tool we introduced in this thesis, ETFIDS, is a runtime based software fault injection tool, it injects a fault by changing signal/variable values at runtime, and it also provides the ability to observe the effect on the output or behavior of the system. We made several improvements upon the current fault injection and detection implementations in ETFIDS:

1. More accurate and consistent fault injection mechanism to address the high- perturbation issue of software-implemented fault injection tools.

2. Real-time faulty process evaluation for more accurate fault outcome analysis.

3. Concurrent and synchronized faulty and fault-free process comparison for per- formance increase and fault injection experiment time reduction.

The approaches we used to achieve these improvements are further explained in the rest of this thesis. Here is the description of the structure of it. Chapter 2 discusses the previous works on fault injection methodologies, the description of different fault injection techniques and their advantages and disadvantages. We describe the overall approach of ETFIDS in Chapter 3. And in Chapter 4, the results of fault injection experiments conducted using ETFIDS is presented. Chapter 5 is the conclusion of this thesis, and it also discusses the possibility of how future research could carry on this work.

4 Chapter 2

Fault Injection Techniques

Fault injection tools can be categorized depending on the level of abstraction which that they operate. In general, there are two types of fault injectors, one is hardware based, and one is software based. The following sections discuss the fault injection tools available and provide description and comparison of different techniques.

2.1 Hardware-implemented fault injection

Hardware-implemented fault injection uses additional hardware to introduce faults into the target system’s hardware. Depending on the faults and their locations, hardware-implemented fault injection methods fall into two categories [13]:

1. Hardware fault injection with contact. The injector has direct physical contact with the target system, generating voltage or current changes externally to the target chip.

2. Hardware fault injection without contact. The injector has no direct physical contact with the target system. Instead, an external source produces some physical phenomenon causing the variations of the voltage inside the target chip.

5 CHAPTER 2. FAULT INJECTION TECHNIQUES

2.1.1 Injection with contact

Pin-level injection is a conventional method of hardware implemented fault injection. There are two main techniques for altering electrical currents and voltage at the pins [13]:

1. Active probes. This technique injects current via the probes attached to the pins, changing their electrical currents. The probe method is usually limited to persistent faults such as stuck-at faults and bridging faults. Moreover, since a dangerously high amount of current can damage the target hardware, the user must take care when using active probes to force additional current into the target device.

2. Socket insertion. This technique inserts a socket between the target hardware and its circuit board. The inserted socket injects stuck-at, open, or more com- plex logic faults into the target hardware by forcing the signals that represent desired logic values onto the pins of the target hardware.

Both of these methods provide good controllability of fault times and locations with little perturbation to the target system. But because faults are modeled at the pin level, they are not identical to traditional stuck-at and bridging fault models that generally occur inside the chip [13].

2.1.2 Injection without contact

There are some ways to inject faults physically without contacting the circuit. One of them is to inject fault using heavy-ion radiation [13]. An ion passes through the depletion region of the target device and generates current [13]. Placing the target hardware in or near an electromagnetic field also injects faults. Engineers like these methods because they mimic natural physical phenomena [13]. However, it is difficult

6 CHAPTER 2. FAULT INJECTION TECHNIQUES

Input Output Operator files files

Management of the test sequence

Environment Control of the experiment simulation Activation Injection Data collection Initialization Fault Readouts Inputs/outputs Synchronization

Target system

Figure 2.1: General architecture of Messaline [2]. to precisely control the time and location of a fault injection using this technique because you cannot completely control the exact moment of heavy-ion emission or electromagnetic field creation [13].

2.1.3 Hardware implemented fault injection tools

Hardware fault injector Messaline [2] uses both active probes and sockets to conduct pin-level fault injection. It can inject stuck-at, open, bridging and complex logical faults. It can also control the length of fault existence and the frequency. Signals collected from the target system can provide feedback to the injector [13]. Also a device is associated with each injection point to sense whether each fault is activated and produces an error. Messaline can inject up to 32 injection points simultaneously. However, because of its operating principle, the faults it could inject will be limited to pin-level faults. Figure [2] shows the general architecture of Messaline.

7 CHAPTER 2. FAULT INJECTION TECHNIQUES

FIST [9] employs both contact and contact-less methods to inject transient faults inside the target system. It uses heavy-ion radiation to create transient faults at random locations inside a chip when the chip is exposed to the radiation. The radia- tion source is mounted inside a vacuum chamber together with a small two-processor computer system. The computer is positioned so that one of the processors is exposed directly under the radiation. The other processor is used as a reference for detecting whether the radiation results in any bit-flips. FIST can inject faults directly inside a chip, which cannot be done with pin-level injections. It can produce transient faults at random locations evenly in a chip, which leads to a considerable variation in the errors seen on the output pins. In addition to radiation, FIST allows for the injec- tion of power disturbance faults. However, FIST is unable to inject certain transient faults, and it’s rather hard to inject faults at a specific location. In MARS [19], heavy-ion radiation and electromagnetic fields are used to conduct contact-less fault injection. It uses multiple ways to inject faults into the target system: placing a circuit board between two charged plates; placing a chip near a charged probe; dangling wires that are placed on individual chip pins to accentuate the electromagnetic field effect on those pins. But it shares the same limitations with FIST. [1] proposed an emulation based technique. This technique reconfigures the FPGA circuit model to enable fault injection. The reconfiguration happens at the runtime, which eliminates the need for having a complicated control circuit to determine the fault injection location. But the system model synthesis and reconfiguration intro- duces significant preprocessing overhead. In [6], Civera et al. proposed a reconfigurable injection technique based on a modified flip-flop design where each flip-flop corresponds to a fault injection location. The set of flip-flops are configured into a scan chain. The scan chain is programmed to inject faults in any desired flip-flop in the circuit. This approach provides more flexible

8 CHAPTER 2. FAULT INJECTION TECHNIQUES control over runtime fault injection. However, it still suffers from the preprocessing overhead needed for emulation based techniques. [8] use scan chains for changing the state of the flip-flops in the circuit at runtime. It proposes an enhanced On-Chip Debug (OCD) infrastructure with the objective of supporting the verification of fault-tolerant mechanisms through fault injection cam- paigns. This upgraded On-Chip Debug and Fault Injection (OCD-FI) infrastructure provides an efficient fault injection mechanism with improved capabilities and dy- namic behavior. But it requires an adequate amount of memory for data storage. And both the OCD and the target CPU must be available in the form of an HDL model. [24] presents a fault injection method devised to inject faults in any processor that supports an OCD with JTAG (Joint Test Access Group) interface. The in- jection platform is implemented in hardware to accelerate the evaluation process. This technique does not require any modification of the target system, and it can be applied to any processor that includes a debugging infrastructure with a JTAG inter- face. Furthermore, the evaluation task can be very rapid since hardware manages the whole process. [24] proposed a hardware-implemented (in an FPGA) fault injection environment to control the debugging infrastructure through JTAG instructions. But just like other FPGA based techniques, preprocessing overhead need to be taken into account when using this technique, and using JTAG means this method can only be used for ICs with JTAG interfaces.

2.2 Software-implemented fault injection (SWIFI)

Traditional techniques such as hardware fault injection, although appropriate for simpler and older processors, are presently not easy to apply due to the difficulties in controlling and observing the fault effects inside the chips. Other techniques, such as

9 CHAPTER 2. FAULT INJECTION TECHNIQUES simulation, are also tricky to implement because simulation models of these processors are very complex, and are often considered critical and confidential information by the manufacturers, thus being very difficult to obtain [3]. Due to the flexibility of software, researchers have taken more interest in de- veloping software based fault injection tools [13]. These tools can inject faults in target applications and into the operating systems, which is difficult and expensive to perform in hardware fault injection environment [13]. Software implemented fault injection tools, often referred to as “SWIFI,” do not need any hardware changes or additional hardware to run experiments. They provide a way to test complete systems with real hardware and software, including the , for fault tolerance and effects of fault [4]. “SWIFI” approaches are fast and more reproducible than hardware approaches. They do not require additional hardware and could be much more portable depending on their implementations. Although the software approach is flexible, it may have some shortcomings [13]:

1. It cannot inject faults into locations that are inaccessible to software.

2. The software instrumentation may disturb the workload running on the target system and even change the structure of the original software.

3. The poor time-resolution of the approach may cause fidelity problems. For long latency faults, such as memory faults, the low time-resolution may not be a problem. For short latency faults, the approach may fail to capture specific error behavior.

We can categorize software injection methods by when the faults are injected: during compile-time or runtime [13].

10 CHAPTER 2. FAULT INJECTION TECHNIQUES

2.2.1 Compile-time injection

To inject faults at compile-time, the fault injector must modify the source file of the target application before the program image being loaded and executed. Rather than injecting faults into the hardware of the target system, this method injects errors into the source code or assembly code of the target program to emulate the effect of hardware, software, and transient faults [13]. The modified code alters the target program instructions, causing injection. Injection generates an erroneous software image, and when the system executes the fault image, it activates the fault. This method requires the modification of the program that will evaluate the fault effect, and it requires no additional software during runtime [13]. Besides, it causes no perturbation to the target system during execution [13]. Because the fault effect is hard-coded, engineers can use it to emulate permanent faults. The implementation of this method is straightforward, but it does not allow the injection of faults as the workload program runs.

2.2.2 Runtime injection

Faults can also be injected during runtime as well. Runtime fault injections can usually be performed by changing the memory content when the target application is running. During runtime, a mechanism is often needed to trigger fault injection. Commonly used triggering mechanisms include [13]:

1. Time-out. Time could be used as a triggering factor for runtime fault injection, a timer expires at a predetermined time, triggering injection. Specifically, the time-out event generates an “interrupt” to trigger fault injection. The timer can be a hardware or software timer. This method requires no modification to the application or workload program. A hardware timer must be linked to the system’s interrupt handler vector. Since it injects fault by time rather

11 CHAPTER 2. FAULT INJECTION TECHNIQUES

than specific events or system state, it produces unpredictable fault effects and program behavior. However, it is suitable for emulating transient faults and intermittent hardware faults [13].

2. Exception/trap. In this case, a hardware exception or a software trap transfer control to the fault injector. Unlike time-out, exception/trap can inject faults whenever certain events or conditions occur [13]. When the trap executes, an interrupt is generated that transfers control to an interrupt handler. A hardware exception invokes injection when a hardware-observed event occurs. Both mechanisms must be linked to the interrupt handler vector.

3. Code insertion. In this technique, instructions are added to the target program that allows fault injection to occur before a particular instruction, much like the code-modification method. Unlike code modification, code insertion performs fault injection during runtime and adds instructions rather than changing orig- inal instruction [13]. Unlike the trap method, the fault injector may exist as part of the target program and run at user mode rather than system mode [13].

2.2.3 SWIFI tools

There are many software-based fault injection tools currently available. The software fault injection approach proposed in FERRARI [16] uses software traps to inject CPU, memory, and bus faults. It uses a parallel daemon process running on the host processor to control the application in which the faults need to be injected. It is capable of injecting faults in the address, data or the control lines of the processor. In the FERRARI approach, the UNIX ptrace function is used to corrupt the process memory image in run-time and insert software trap instructions at the specific instruction addresses where faults should be activated. This tool allows the injection of transient faults and provided valuable results from experiments

12 CHAPTER 2. FAULT INJECTION TECHNIQUES

Workload Fault injection Fault specifications Workload specification Injection generator

Workload activity

Workload specification

Measure CPU Memory I/O

Figure 2.2: Ftape environment [29]. conducted on a Sparc workstation [3]. The approach is to modify the control structure of a process that in turn alters the execution state of the target program. But this approach incurs a time overhead associated with the injection process. In Ftape [29], bit-flip faults are injected into user-accessible registers in CPU modules, memory locations, and the disk subsystems. In [29], disk system faults are injected by executing routines added to the operating system, and a synthetic workload generator is used. Ftape was used to inject faults in three prototypes of a commercial fault-tolerant computer [3]. It is part of a benchmark for characterizing the fault-tolerance of a system, and it also includes a synthetic program to generate CPU, memory and I/O activity. Ftape injects faults in the CPU, memory and I/O, and can select the time and location of the fault randomly, or based on workload ac- tivity measurements. This technique is known as “stress-based injection” and assures that faults are injected in components undergoing high activity [3]. While attention has been paid to portability, the benchmark has only been ported to two machines of the same architectural family, and thus, portability for widely different hardware and software platforms need to be addressed. Besides, the outcome of each benchmark

13 CHAPTER 2. FAULT INJECTION TECHNIQUES

run is not repeatable, although the average of a series of runs is repeatable. In FIAT [25], an approach to inject single bit-flips in the instruction memory of the application was proposed. It enabled the corruption of a task’s memory image. The user made the selection of fault location at the application level, and the physical position within the memory image was obtained from compiler and loader informa- tion. Although this provided valuable results, it was not able to inject transient faults. DOCTOR [10] is a software implemented fault injection tool which can be used for injecting faults in distributed computing systems. The approach in DOCTOR allows injection of CPU faults, memory faults, and network communication faults. It uses three triggering methods to indicate the start of fault injection. These methods are time-out, trap, and code modification. One distinct feature of the organization of DOCTOR is the separation of components of the host computer from those of the target system. It has the advantage of reducing the run-time interference with the target system caused by fault injection because each component runs separately and only essential elements are executed on the target system. It also increases the portability of DOCTOR, since the highly system-dependent part is isolated from the rest. DOCTOR provides the functionality of performing parallel fault injection experiments, by executing two identical workloads simultaneously on two distinct APs of a HARTS node [26]. However, such feature is not available on different architectures. The fault injector DEFINE [18] which utilizes built-in hardware exception trig- gers to perform fault injections. Before DEFINE was proposed, another tool named FINE [17] has been introduced to inject faults and monitor their effect by using a software monitor to trace the control flow. However, this tool needs the source code of the target application and causes a significant time overhead. In essence, DEFINE is an evolution of FINE that includes distributed capabilities. It modifies the pro-

14 CHAPTER 2. FAULT INJECTION TECHNIQUES

TARGET SYSTEM HOST COMPUTER

application Target Application Fault Archive output file

fault Experiment parameters Manager Module Fault Setup (lib)

Experiment Xception system call log file User space Results File

Fault Injection Exception Handlers Kernel

Figure 2.3: Xception structure [3].

grams’ executable image to emulate memory faults. DEFINE also enhances FINE by introducing a modified hardware clock interrupt handler to inject CPU and bus faults with time triggers. Finally, in addition to hardware faults, DEFINE is also capable of injecting some kinds of software faults, i.e., software design/implementation faults. Xception [3] takes a similar approach to DEFINE. In particular, it uses built-in hardware exception triggers to perform fault injections which require modification of the interrupt hardware vector. Xception can inject faults with minimum interference with the target application by directly programming the debugging hardware inside the target processor. The debugging exception mechanisms available allow the defi- nition of many fault triggers (events that cause the injection of the fault), including fault triggers related to the manipulation of data. On the other hand, by using the performance monitoring hardware inside the processor, Xception can record detailed information on the target processor behavior after the injection of a fault. However, like other traditional fault injection tool, it dumps data for fault outcome analysis,

15 CHAPTER 2. FAULT INJECTION TECHNIQUES

Original Instrumented target SETUP target software software

Description Log files Fault and files INJECTION error data

Readout Usage files profile data

ANALYSIS

Results

Figure 2.4: The work process when using PROPANE [12].

which incurs a performance overhead. The structure of Xception is illustrated in Figure 2.3. PROPANE [12] is a runtime fault injection tool that can inject software errors by mutating the source code at runtime or changing the variable and memory con- tents. PROPANE supports the injection of both software faults (by mutation of source code) and data errors (by manipulating variable and memory contents). It supports various error types out-of-the-box and has support for user-defined error types. PROPANE supports observation down to the variable level, i.e., individual variables may be logged during injection experiments. Such a mechanism enables a detailed examination of error propagation in software and is a valuable help in finding vulnerable software modules or variables. Figure 2.4 shows the basic work process of PROPANE. The approach in PIN [15] inserts code dynamically at runtime allowing a fault

16 CHAPTER 2. FAULT INJECTION TECHNIQUES

to be injected at a specific code location which allows recreating faults that result from common coding errors. [15] propose a new fault injection design pattern based on the PIN framework provided by Intel Company, and develop a dynamic software fault injection tool. But it cannot inject faults under specific distribution [15]. The architecture PIN could operate on is also limiting. The approach LLFI [23] proposed a compile-time-based fault injection technique which uses the LLVM compiler infrastructure. LLFI inserts faults by injecting code into the compiler for intermediate representation. While the LLVM intermediate code is close to the assembly code, it does not correspond one-to-one with the assembly language. [23] also uses PIN for building the fault injector at the assembly code level. Relyzer [11] is another solution which uses compiler based techniques to determine the equivalence of different fault sites in an application. It analyzes all application fault sites and picks a small subset to perform selective fault injections for transient faults. Relyzer can identify SDC (Silent Data Corruption) causing faults in the entire application. Overall, Relyzer reduces the number of application fault sites that re- quire thorough fault injection experiments, bringing them to a point where studying virtually all of them become practically viable. Further, it does so in a way that allows identifying the application sites vulnerable to SDCs. Relyzer employs a set of fault pruning techniques that dramatically reduce the number of faults (application sites) that require accurate fault simulations. In [20, 4], fault injector software Fiesta and Fiesta++ are presented. Figure 2.5 shows the framework of Fiesta++. They are based on C/C++ and GDB. When the user knows the system, the injector operates in “white box mode” otherwise it operates in “black box mode.” In “white box mode,” the users can specify the faults precisely regarding the target variables’ names, locations and when they should be injected. In “black box mode,” the injector lets the users choose from a list of memory locations for injecting faults. But both of these fault injection tools cannot track and

17 CHAPTER 2. FAULT INJECTION TECHNIQUES

Experiment Log Fault Fault GDB GDB Models Injector I/O

Fault timing & location & model

Experiment Manager Fiesta++ Target Application Experiment Experiment Outcome Config File user Memory input User Process Process Tracking Interface Monitor I/O Libraries

Figure 2.5: Fiesta++ framework [4].

observing the propagation of faults and setting observation points for precise fault outcome evaluation. The SWIFI techniques discussed above can inject and detect faults at various levels in the target system. However, most of them [18, 10, 16, 25, 17, 15, 12] can- not inject faults without the interference of time and other factors. This causes an issue most SWIFI techniques have — high perturbation. Moreover, the fault prop- agation detecting mechanisms in SWIFI tools [18, 16, 25, 4, 17, 23, 15, 12, 3] need to be improved, as they use the traditional fault outcome analysis paradigm, which only analyzes fault outcomes after the fault injection experiment using dumped in- formation of the faulty program. Such a paradigm has an impact on the accuracy of these tools, and usually incurs a high performance overhead. ETFIDS, on the contrary, solves these two problems by introducing new techniques for fault injec- tion and fault outcome analysis. It also further increases the portability of SWIFI tools [18, 10, 16, 25, 17, 15, 12, 3, 10] through a few new design choices. ETFIDS also introduces a new mechanism which could help further shorten the time required for a fault injection experiment campaign.

18 CHAPTER 2. FAULT INJECTION TECHNIQUES

Table 2.1: Characteristics of fault injection methods. Hardware Software With contact Without contact Compilation Runtime Cost High High Low Low Perturbation None None Low High Monitoring High High High Low time-resolution Accessibility Chip pin Chip internal Register Register of fault memory memory I/O points software controller/port Controllability High Low High High Trigger Yes No Yes Yes Repeatability High Low High High

2.3 The significance of SWIFI

Table 2.1 shows a comparison between fault injection techniques. As we can see from Table 2.1, SWIFI techniques do have certain limitations. It limits the fault injection at a higher, software-visible abstraction level such as memory contents and registers [4]. A single bit-flip at the software abstraction level may not accurately represent a single bit-flip at the device level [5]. But, SWIFI techniques are becoming more popular recently because of the following reasons. Software-implemented fault injection approaches possess many advantageous char- acteristics which hardware-implemented fault injection could not have. Software- implemented fault injection tools are low-cost, fast, portable, scalable, and easy to implement. Nowadays, many applications run on large complex superscalar proces- sors for which hardware-implemented fault injections are impractical to be used. Most of the hardware-based fault injection techniques have been applied on small proces- sors with a few hundred or thousand flip-flops in their design. Developing hardware implemented fault injection would be difficult for the large and complex processors used in most modern applications. Although the software-based fault injection tools lack the same level of accuracy

19 CHAPTER 2. FAULT INJECTION TECHNIQUES hardware-implemented ones have. They are indispensable for estimating the fault outcome and propagation for large supercomputing applications because they allow much more sophisticated fault outcome analysis. High-performance computing (HPC) applications which run on hundreds of processors for extended periods of time face a significant risk of instability due to transient faults. Hardware-based fault injec- tion cannot be used to evaluate such complex applications. And it is also hard for hardware-implemented approaches to analyze and track the injected faults in fault injection experiments. Finally, most hardware implemented fault injection methods are not able to recre- ate the precise fault injection scenarios and experiment setups. And they lack the flexibility of software approaches. Nowadays, as the integration of software and hard- ware system becomes stronger than ever, there is a high possibility that the actual abnormalities occurred was caused by the combination of impropriate algorithm and week hardware design. And it’s not possible for hardware-implemented fault injection to perform fault injection experiments at this level. On the contrary, software fault injections could be widely used for such kind of tasks. In conclusion, SWIFI approaches do have a promising future; however, the issues of high perturbation and low monitoring resolution need to be addressed for more accurate fault injection and analysis.

20 Chapter 3

ETFIDS

3.1 Motivation

Due to the flexibility and low-cost of software fault injection approaches, develop- ing software implemented fault injection tools is rapidly gaining popularity among researchers. Since hardware functionality is visible mainly through software, faults at various levels of the system can be emulated [16]. Hence the method of software fault injection is less expensive regarding time and effort than hardware implemented fault injection techniques. But most software-based fault injection tools provide only an interface to inject faults at random memory locations and usually at random time even. Such operating principle limits the coverage of the fault model and the flexibil- ity of specifying fault location and timing. And furthermore, this makes it impossible for us to recreate the scenario when a particular type of transient fault is introduced to a specific state of the system. Unfortunately, reproducibility is one of the most important characteristics that a useful debugging setup must possess. Moreover, most software implemented fault injection tools lack the ability to compare the faulty and fault-free runs at real time, thus increasing the difficulty of early detection and analysis.

21 CHAPTER 3. ETFIDS

To address the issues mentioned above, ETFIDS was developed. ETFIDS offers flexible fault model and fault specification. Unlike current fault injection techniques, ETFIDS evaluates, concurrently, both the faulty and fault-free system behavior. This concurrency enables efficient evaluation of latency and error propagation. And the ability to detect errors at runtime enables ETFIDS the functionality of early detection and target application termination. Due to the capability of GDB [28], the proposed fault injector can perform dependability analysis for multi-threaded applications. And as long as the hardware resource which ETFIDS have access to permits, fault injection experiments could be performed without software simulation, which decreases the timing overhead of fault injection and outcome analysis.

3.2 The overall approach

ETFIDS can inject faults to the target application and analyze the fault outcome at runtime. Figure 3.1 shows an illustration of how ETFIDS performs a typical fault injection experiment. As we can see from Figure 3.1, based on the user specification, ETFIDS will inject faults during the runtime of the target application. And after the fault has been injected, ETFIDS will monitor the faulty application and detect the error propagation at runtime. Once the error has been detected, ETFIDS will save the fault latency and other useful information, then terminate the target application as further fault outcome analysis does not interest the user anymore, thus saves the total time for fault injection experiments. Figure 3.2 shows the overall structure of the fault injection system. It consists of the target application and ETFIDS with GDB, an open source debugging application which uses the ptrace system call to attach to and debug a running process, works as the agent communicating between the two. Based on the implementation of the target application, the user could obtain a list of fault target and observation point

22 CHAPTER 3. ETFIDS

target application n control flow

fault n+1 injection faulty branch correct branch n+2

0-0 1-0 fault latency 0-1 1-1

fault 0-2 2-2 detected early termination

Figure 3.1: Illustration of ETFIDS Fault Injection and Analysis.

Target Application

Execution Manipulation and Monitoring Application Application Workload GDB Clone Before Injection Fault- Fault- Synchronization Injected Free Inferior Inferior

Monitoring Monitoring for Fault for Propagation Reference

Fault Injection Fault Analysis

List of Faults and Observation Points Observation and List Faults of User ETFIDS

Figure 3.2: ETFIDS Overall Framework.

23 CHAPTER 3. ETFIDS for generating the input specification file for ETFIDS. The specification file is read by the fault injector, as shown in Figure 3.2. The fault injector invokes GDB after the user started the fault injection experiment. GDB allows the fault injector to set breakpoints, watchpoints [27] and allows the execution of commands that alter memory contents. By using these facilities, the ETFIDS could insert a fault in the target application and monitors errors produced by the injected fault during runtime. Also, ETFIDS is also capable of concurrently comparing the faulty and fault-free runs for fault outcome analysis. By running the faulty and fault-free processes side by side during runtime, ETFIDS could detect propagated errors as early as possible which enables the ability for ETFIDS to stop the target application as soon as the fault was detected, thus saves the fault injection experiment time. Figure 3.2 also shows other important details about how ETFIDS works. The user could control how ETFIDS performs fault injection experiments. The fault injector sets the breakpoints and watchpoints base on the fault list provided by the user. Each fault in the fault list contains a program location specification and a description of execution event that initiates fault injection at the specified location. More precisely, in ETFIDS, a fault is specified in term of the target variable’s name, the location where the fault should be injected, and the execution event that triggers the fault injection. The execution event precisely defines the fault injection timing. It is specified regarding the number of times a particular region of code has been executed and the number of times the target variable has been accessed. In this way, faults are precisely injected without being influenced by various test environment setups. ETFIDS assumes all variables in the target system are visible to the user. And since variables can also be quarried using GDB, users with limited knowledge of the target system under investigation can also use ETFIDS for dependability evaluation. Also, it is assumed that both fault targets and observation points are valid variables

24 CHAPTER 3. ETFIDS

with allocated memory and are visible in GDB, which usually means the target ap- plication needs to be compiled with “debug” flag enabled to enable the generation of symbol table which ETFIDS relies on for variable and function name recognition. As shown in Figure 3.2, the ETFIDS also has the fault outcome analysis capability. It can analyze the propagation of injected faults by using two inferiors within GDB. GDB represents the state of each program execution with an object called an inferior. An inferior typically corresponds to a process, but is more general and also applies to targets that do not have processes. [27] ETFIDS keeps the two inferiors synchronized by setting an additional watchpoint in both inferiors. Faults could be detected by comparing the observation points between the faulty and fault-free (Golden Model) inferior. And to save time and system resources, ETFIDS will only clone the target application process just before the fault injection is going to happen. ETFIDS uses hardware breakpoints [27] available in GDB to monitor target vari- able efficiently. Hardware breakpoints allow ETFIDS to set a breakpoint at an in- struction without changing the instruction. [27] This utilizes the capability of setting breakpoints at the hardware-level. These variable monitoring capabilities are avail- able in most modern microprocessors. Using the hardware-level variable-monitoring functionality reduces CPU time overhead which makes the fault injection software run more efficiently and faster. A detailed introduction of Hardware breakpoints is in Section 3.3. To simulate the realistic scenario when transient faults exist in a running system, we need to understand the impact of a transient fault at the circuit level. [5] shows that most of the time a transient fault results in the bit value stored in a circuit memory element being flipped. And because of this, the default fault model used in ETFIDS is bit-flip. Figure 3.3 explains how a bit-flip fault could be triggered when the target application is operating. As we can see from Figure 3.3, bit-flips could be caused by environment changes

25 CHAPTER 3. ETFIDS

Particle strike SET Soft error

Error B U5 Cin

U3 D Q Cout U7

U2 CLK

U7 U4

A

U6 D Q S U1

CLK CLK

Figure 3.3: Bit-flip fault model [7]. like a particle strike. When a high-energy particle strikes the circuit gate, it will create a pulse at that node; this event is generally referred to as a “single-event transient (SET) fault” [7]. At the moment the particle strikes and has created a pulse, it has not yet generated any fault in the target application. But as the pulse propagates through the circuit and finally reaches a memory element in the circuit, a bit-flip fault will be created, which typically is represented by a bit in a memory element being flipped. And if the transient fault propagation reaches a counter instead, a fault which increments or decrements value stored in a memory element in the target application will result. In the current implementation, ETFIDS supports five types of transient faults models. They are listed as follows:

1. Single bit-flip

2. Multiple bit-flips

26 CHAPTER 3. ETFIDS

3. Increment

4. Decrement

5. Force value

6. Jump

The list also contains “force value” faults and “jump” faults. A memory element could be forced to a specific value because of a transient fault, for example, if the circuit layout is so compact that the coupling (crosstalk) effect is no longer negligible to the nearby gates in the circuit, a memory element could be forced to a specific value as the error propagates, thus cause a “force value” fault. And regards to “jump” faults, it could usually be caused by an attack through the computer network. When the attacker wishes to let the target application execute or branch in the binary or instructions created by the attacker, a “jump” fault could be triggered. We implemented the fault models described above by directly modifying the tar- get variable. Specifically, “single bit-flip” and “multiple bit-flips” are implemented as flipping a bit or multiple of bits in the target variable. “Increment” and “decrement” are implemented as incrementing or decrementing the target variable by 1. Besides, “force value” is implemented as changing the target variable to a user-specified value. Figure 3.4 shows an illustration of the effects of our implemented fault models, as- suming that the bit width is 32 bits. “Jump” faults are different from the other implemented fault models. It changes the control sequence of the target application. For emulating the behavior of a “jump” fault model, we change the program counter $pc when the program reached the user- specified location and timing. The change of $pc drives the target application jump to an illegal branch and executes it, thus emulates “jump” fault models. Figure 3.5 shows an illustration of the effects of a “jump” fault.

27 CHAPTER 3. ETFIDS

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 42

single bit-flip

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 65578 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 42

multiple bit-flips (4 bit-flips)

0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 270534690 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 42

increment

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 42

decrement

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 42

force value (to 0xEFEFEFEF)

1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 4025479151

Figure 3.4: Illustration of the Effects of Implemented Fault Models.

… 40hex sub x11, x2, x4 44hex and x12, x2, x5 48hex or x13, x2, x6 4Chex add x1, x2, x1 50hex sub x15, x6, x7 54hex ld x16, 100(x7)

jump

1C090000hex sd x26, 1000(x10) 1C090004hex sd x27, 1008(x10) …

Figure 3.5: Illustration of the Effects of a “Jump” Fault.

28 CHAPTER 3. ETFIDS

Other fault models can also be easily added because ETFIDS is based on the GDB Python Extension API. The flexibility of Python and its interface to GDB enables users to add new fault models easily and without the need to recompile the fault injection tool.

3.3 The lower level operating principle of ETFIDS

ETFIDS uses GDB as its backend, which means it operates on the same level as GDB does. As mentioned in [27], GDB uses ptrace system call for its debugging purposes. The ptrace system call provides a means by which one process (the “tracer”) may observe and control the execution of another process (the “tracee”), and examine and change the tracee’s memory and registers. It is primarily used to implement breakpoint debugging and system call tracing. The functionality of finding the injec- tion location and performing fault injection for ETFIDS relies on GDB’s breakpoint feature. Hardware interrupts are implemented directly on the processor itself. A CPU has a single stream of execution, working through instructions one by one. What it uses interrupts to handle asynchronous events like IO and hardware timers. A hardware interrupt is usually a dedicated electrical signal to which a particular “response cir- cuitry” is attached. This circuitry notices an activation of the interrupt and makes the CPU stop its current execution, save its state, and jump to a predefined address where a handler routine for the interrupt is located. When the handler finishes its work, the CPU resumes execution from where it stopped. Software interrupts are similar in principle but a bit different in practice. CPUs support special instructions that allow the software to simulate an interrupt. When such an instruction is executed, the CPU treats it as an interrupt — it stops its normal flow of execution, saves its state and jumps to a handler routine. Some

29 CHAPTER 3. ETFIDS programming errors are also handled by the CPU as traps, and are frequently referred to as “exceptions.” Here the line between hardware and software blurs since it’s hard to differentiate whether such exceptions are hardware interrupts or software interrupts. For example, on the x86 architecture, the popular architecture which we performed our fault injection experiments on, breakpoints are implemented on the CPU by a particular trap called “int 3”. To set a breakpoint at some target address in the traced process, the debugger does the following:

1. Remember the data stored at the target address.

2. Replace the first byte at the target address with the “int 3” instruction.

Then, when the debugger asks the OS to run the process, the process will run and eventually hit upon the “int 3”, where it will stop, and the OS will send it a signal. And after receiving a signal that its child (or traced process) was stopped. It can then:

1. Replace the “int 3” instruction at the target address with the original instruc- tion.

2. Roll the instruction pointer of the traced process back by one. This operation is needed because the instruction pointer now points after the “int 3”, having already executed it.

3. Allow the user to interact with the process in some way, since the process is still halted at the desired target address.

4. When the user wants to keep running, the debugger will take care of tracing the breakpoint back at the target address, unless the user asked to cancel the breakpoint.

30 CHAPTER 3. ETFIDS

ETFIDS uses the software interrupt mechanism mentioned above, and thus it in- jects faults directly to the target location without being influenced by the current state of the operating system which it is running on, and this, helps ETFIDS to guar- antee the injection accuracy and lower the time overhead during the fault injection process. As mentioned in Section 3.2, there is another reason why ETFIDS can achieve low time overhead, which is the use of hardware breakpoints. GDB provides two types of breakpoints, “hardware breakpoints” and “software breakpoints.” Hardware breakpoints are sometimes available as a built-in debugging feature with some chips. There are typically three possibilities for implementing “hardware breakpoints.”

1. The processor provides a dedicated register into which the breakpoint address may be stored. If the program counter ever matches a value in breakpoint registers, the CPU raises an exception and reports it to GDB.

2. The use of an emulator to achieve such functionality. Many emulators include circuitry that watches the address lines coming out from the processor, and force it to stop if the address matches a breakpoint’s address.

3. The target already can do breakpoints. For instance, a ROM monitor may do its software breakpoints. So although these are not actually “hardware breakpoints,” from GDB’s perspective, they work the same. [27] However, such implementation incurs a substantial time overhead.

The hardware breakpoint implementation ETFIDS uses is the application of ded- icated register available in the processor. Such implementation helps ETFIDS func- tions by just performing several extra load and store instructions which helps the performance of fault injection.

31 CHAPTER 3. ETFIDS

3.4 Fault injection control flow

As mentioned in Section 3.2, ETFIDS reads an input file which includes a set of fault target specifications before its operation. A fault target specification contains following elements: (1) the name of the target variable; (2) the code scope where the target variable locates in; (3) the fault injection triggering specification. The code scope in the specification limits the fault injection to a specific program location. It is specified regarding a function name or line numbers in the source code of the target application. The fault injection triggering specification is defined with two thresholds:

(1) a code scope hitting threshold CSth, which means the number of times the code has been accessed before the fault injection; (2) a target variable access threshold

TVth, which means the number of times the fault target variable has been accessed before the fault injection. For providing a clear overview of how ETFIDS injects a fault, we created a control flow diagram as shown in Figure 3.6. ETFIDS’ fault injection process starts by setting the scope to begin/end and starting the program execution. If the number of code scope accesses is less than the user-specified code scope threshold CSth, then the program continues its execution with no faults injected. Otherwise, if that number is more than CSth, then ETFIDS will start to monitor the target variable and continue the program execution. If the number of variable accesses is smaller than the user-specified threshold TVth; then the program will continue its execution without fault injection. Otherwise, the fault will be injected.

The combination of code scope access threshold CSth and target variable access threshold TVth provides a precise fault injection timing control. They help ETFIDS to reach the same timing (in other words, to be positioned in the same stack frame and control step) when the fault injection takes place. In this way, ETFIDS can perform fault injection experiments in a controlled and reproducible manner.

32 CHAPTER 3. ETFIDS

start

set scope start/end breakpoint and start running

continue running

no code scope threshold reached

yes set watchpoints

continue running no

variable access end scope yes threshold reached reached

yes inject fault no

further fault delete breakpoints no outcome and watchpoints analysis and terminate required

yes start fault outcome finish analysis

Figure 3.6: ETFIDS Fault Injection Control Flow.

33 CHAPTER 3. ETFIDS

After the completion of fault injection, the target application will either be in- structed to continue its execution or be analyzed for potential fault outcome. The decision of whether to start the fault outcome analysis depends on whether the user specifies the “probe variable.” If no fault outcome analysis is required, the faulty program may either continue executing until the end of its routine or crash (which is one of the valid fault outcomes).

3.5 Fault outcome analysis

Traditionally, fault injection experiments are usually followed by the analysis of faulty and fault-free results, which requires the fault injection tools to be able to dump the memory data after fault injection being performed. Depending on when the fault injection occurs, the memory dump could be quite large. And analyzing the fault injection results also involves spending a large amount of time and computation. To address such issues, ETFIDS takes a different approach. It compares the faulty and fault-free inferiors in real time after the fault has been injected without the need for dumping the data. Such an approach also helps precise error detection. Because the two inferiors are always synchronized, the target variable of interest for fault analysis could always be compared at the same timing and within the exact stack frame (if that is applicable after the fault injection). The user of ETFIDS could enable or disable the fault outcome analysis feature by choosing whether to specify or not to specify the “probe variable” in the input specification file. For achieving the fault outcome analysis feature of ETFIDS, we implemented an interleaved control flow. When the fault injection has been triggered, ETFIDS will first delete all previous breakpoints and watchpoints so they will stop interfering the program execution. And then ETFIDS will clone the current inferior and inject the fault into the original inferior. After the fault injection has been performed, ETFIDS

34 CHAPTER 3. ETFIDS

will then set a watchpoint to the “probe variable” for both inferiors which marks the start of fault outcome analysis sequence. From this moment, ETFIDS will stop the program execution every time the watchpoint has been hit, and compare the value of both “probe variables” when the two inferiors are synchronized. For every step, both the faulty and the fault-free inferiors are synchronized by continuing with the same watchpoints so any internal data corruptions or execution alternations could be detected right away if there are any differences between faulty and fault-free systems. Depending on how the user specifies the expected behavior of fault outcome analysis, ETFIDS could stop both inferiors immediately or monitor the value of the “probe variables” further for detecting any differences between both variables. If the user is no longer interested in the consecutive execution after the differences between the variables of observation has been detected, stopping the application immediately helps the user save time during fault injection experiments. Such a mechanism may save the user a large amount of time for fault injection experiments before getting the conclusion of their tests. By default, if the user specifies nothing but the “probe variable” in the “probe variable” section of the input specification file for ETFIDS, the execution of both inferiors will be terminated immediately after the detection

of fault outcome. To change this behavior, the user could specify a threshold Pth in the “probe variable” section for the number of times the difference between both “probe variables” was detected before the termination of the application execution. The control flow of ETFIDS for fault outcome analysis is illustrated in Figure 3.7. Also, if ETFIDS has detected any data corruption or execution alternation, it will return a “fault latency” result defined as the difference between the value of $pc when the “probe variable” difference was detected and the value of $pc when the fault was injected:

Tfault latency = Tfault detection − Tfault injection

35 CHAPTER 3. ETFIDS

start

delete existing breakpoints and watchpoints

set breakpoint to probe variable scope probe variable is yes and continue to local reach this scope

set watchpoint to probe variable in no both faulty and fault- free inferiors

no continue running

probe variable target application difference no routine ended detected

probe variable no difference threshold reached

yes

record current probe terminate fault variable value and outcome analysis detected fault latency

finish

Figure 3.7: ETFIDS Fault Outcome Analysis Control Flow.

36 CHAPTER 3. ETFIDS

By implementing the fault outcome detection mechanism described above, ET- FIDS can achieve better performance for fault outcome analysis and reduces the amount of storage required for fault injection experiments. Table 3.1 shows an il- lustration of how the size of a dump file when using a different approach for fault outcome analysis. As we can observe in this Table 3.1, depends on the number of observation point hit, the use of different strategies for fault injection analysis actu- ally could make a difference. If we use core dump for further fault outcome analysis, the total size of dump files could increase quite rapidly. And if we choose to dump the value every time the observation point is hit, the size of the dump files will also increase, although the number is actually quite trivial, it might become considerable if the total number of fault injection experiments is large. Contrary to the approaches of dumping the information about the “probe vari- able.” ETFIDS doesn’t store extra data; thus no additional memory will be used. But there are indeed some benefits of using the core dump for fault injection analysis, which is the ability to restore the fault injection session and rerun the target appli- cation from the point where the core dump was created. However, such functionality could be less useful when the requirement of fault outcome analysis is high, for ex- ample, the number of fault injection experiments is so high which could make fault outcome analysis too tricky by using the core dump approach. Thus, the core dump approach could be useful if the user is only interested in a few numbers of observation points and wish to investigate further about how fault outcome could be avoided. Figure 3.8 and Figure 3.9 shows a detailed comparison between the different ap- proaches to fault outcome analysis. As we can see from Figure 3.8, ETFIDS compares the faulty inferior and the fault-free inferior side by side by running these two infe- riors in an interleaved fashion. The arrows on the sides of the diagram mean the execution sequence, and the double-sided arrows in the middle mean the compari- son operation. The numbers on arrows represent the sequential relationship between

37 CHAPTER 3. ETFIDS

Table 3.1: Size of Dump Files for Fault Outcome Analysis Observation Point Hit Core Dump (KB) Value Dump (KB) ETFIDS (KB) 1 744 4 4 2 1488 8 4 3 2232 12 4 4 2976 16 4 5 3720 20 4 6 4464 24 4 7 5208 28 4 8 5952 32 4 9 6696 36 4 10 7440 40 4

Faulty Inferior Fault-free Inferior Step 1 1 Step 1 Step 2 2 3 Step 2 Step 3 Step 3 Step 4 Step 4 Step 5

Time Step 5 Step 6 Step 6 Step 7 Step 7 Step 8 Step 8 Step 9 Step 9 Comparison Step 10 Step 10 Early Termination Triggered by Fault Detection

Figure 3.8: ETFIDS Fault Outcome Analysis Work Flow. these operations: ETFIDS first steps the faulty inferior to the next access of the “probe variable,” then it does the same to the fault-free inferior, and finally, after the two inferiors are synchronized, it compares the observation point for any differences. Once a difference is detected as shown in “Step 10”, ETFIDS stores the difference and fault latency and terminates both inferiors. Figure 3.9 shows the traditional approach for fault outcome analysis; As illustrated in the diagram, the fault analyzer will dump the data every time the observation point is hit. And the fault outcome analysis could only happen after the end of

38 CHAPTER 3. ETFIDS Faulty Process Dump Step 1 Dump File for Step 2 Faulty Run Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9 Step 10 ... Keep Running till the End of Process Execution Comparison Time Fault-free Process Dump File for Step 1 Fault-free Run Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9 Step 10 ... Keep Running till the End of Process Execution

Figure 3.9: Fault Outcome Analysis Work Flow by Dumping Data. the target application execution. For clearer representation, Figure 3.9 only shows a serial approach; one could implement the behavior described above for fault outcome analysis, too. However, as the two processes share no information during execution, the fault analyzer has to run both the faulty and fault-free processes to the end to get the complete dump file. Such behavior will be still very likely to consume more time compared to ETFIDS’ approach.

39 Chapter 4

Fault Injection Experiments using ETFIDS

In this chapter, we discuss the usage of ETFIDS and the results of fault injection experiments obtained using ETFIDS and how these experiments are conducted. Then we compare the performance of ETFIDS with prior fault injection tools and discuss the reason for the performance difference.

4.1 Typical fault injection target and experiment

configuration

Because ETFIDS is a newly proposed fault injection tool, to demonstrate the us- age of ETFIDS, an example target application and its corresponding fault injection configuration files were shown in this section. Listing 4.1 shows an example C-program, Listing 4.2 and Listing 4.3 shows the corresponding configuration file example. The configuration file starts with the path to the executable, followed by the program argument (in this case, there are none) and a list of fault specifications. There are two ways to specify the fault target where

40 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

the only difference lies in how we specify the code scope. The first way is to specify the code scope by line number, as shown in Listing 4.2. In this case, the target

variable is located in file ex.c between line 7 and line 10. As specified in the example input configuration file, ETFIDS will start watching the target variable when the scope begin has been hit for the first time, and the fault will be injected after target variable being accessed a total of 10 times as defined. The injected fault, as specified in the configuration file, is a random bit-flip. The target fault location can also be specified by the function as shown in Listing 4.3. Again, the event when the fault is to be injected is specified as before. At the end of the configuration file, a list of observation locations is given. The locations are used to observe error propagation during program execution. The number after the observation point specification means the threshold which defines the number of times the fault has been detected at the observation point before ETFIDS records the fault latency and terminates the fault injection experiment.

1 #include 2 int foo (int i ){ return i + 1; } 3 int bar(int n){ return n - 1; } 4 int t func (int n){ 5 int t var ; // <<<<<<<<<<< The target variable 6 int i var , r var ; 7 if( n < 10) { // <<<<<<<< The start of the scope 8 for(int i = 0; i < n; ++ i ) 9 t var = t var + foo ( i ); 10 } // <<<<<<<<<<<<<<<<<<< The end of the scope 11 else{ t var = bar(n);} 12 i var = 1000;

41 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

13 r var = t var - 100; 14 return t var ; 15 } 16 int main(){ 17 int n = 100; 18 printf (”%d” , target function (n)); 19 return 0; 20 }

Listing 4.1: Example Source Code

1 # executable(full path)

2 executable ./a.out 3 # arguments: empty if no argument 4 arguments 5 # fault specification

6 f a u l t t var ex.c:7 ex.c :10 1 10 BIT FLIPS 7 # probe specification

8 probe r var 0

Listing 4.2: Fault Injector Configuration File with Line Number Specification

1 executable ./a.out 2 arguments 3 # fault specification

4 f a u l t t var t func 1 10 BIT FLIPS 5 # probe specification

42 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

6 probe r var 0

Listing 4.3: Fault Injector Configuration File with Target Function Specification

4.2 Fault injection experiment results

This section shows the results obtained by using ETFIDS for fault injection experi- ments. The purpose of these experiments is to demonstrate the capability and usage of ETFIDS. All experiments are performed on a Linux workstation running on Intel Core i7-8700 Coffee Lake processor. In each operation, a single fault was injected using the single bit-flip fault model. The behavior and outcome of the application under test was observed from both an observation point and the output of the target application, and the results were categorized in one of the following four categories.

1. Benign: The injected fault did not affect the execution of the application or the result it generates.

2. Time-out: The injected fault caused the application to run an extended amount of time, or never to stop running. In our experiments, the time-out fault out- come is defined as the execution time of the fault injected application being 10 times longer than the fault-free application.

3. Detected: The injected fault caused an error at the result generated by the application to be different from application result produced by the fault-free application. Or a difference was detected by comparing an observation point within both the faulty and fault-free application at runtime. A detected fault outcome also means the target application exited normally after the fault injec- tion.

4. Crash: The injected fault caused the application to terminate prematurely resulted from an exception — for example, a segmentation fault.

43 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

The target programs we use for our experiments are gzip, bzip2, link, perl, matrix, quicksort, sjeng, and libquantum. gzip, bzip2, link, perl, sjeng and libquantum are chosen from the SPEC 2006 benchmark application list and matrix along with quicksort was used because [16] used them for fault injection experiments. The selection of the target application is based on the choice of prior fault injection works so that the performance could be comparable. The inputs for applications from the SPEC 2006 benchmark are from the benchmark specification. The fault injection experiments for each application was performed for 100000 times. The list of faults for each application contains all fault target locations that are available for fault injection. The fault target selection was randomized, and the distribution we use for our fault injection experiment is uniform distribution as we want to ensure that the results are unbiased. The total number of fault injection experiments is 800000. Table 4.1 shows the experiment summary. The number shown in the table are the fault outcomes caused by fault injection experiments. For gzip, the number of fault injections has no effect (benign) is 86765, the number of time-out fault outcome is 0 as no faults caused the application to hang or run for an extended period of time. Moreover, a total of 13235 abnormalities was found by ETFIDS, and “probe variable” observation points detected 9429 of those abnormalities, and 3811 of fault injections was detected to cause crashes. For fault target bzip2, link, perl, matrix, quicksort, sjeng and libquantum, the similar result could be concluded. For visualization purposes, Figure 4.1 and 4.6 was created. Figure 4.1 shows the number of fault outcome probability for each target applications, to each fault outcome type. And Figure 4.2, Figure 4.3, Figure 4.4, Figure 4.5 illustrate the total number of fault outcomes observed for all target application. Figure 4.6 shows the probabilistic relationship between each fault outcome for each target application. The observation we can make from the experiment results we obtained is that the

44 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

Table 4.1: Fault Injected Outcome Application Name Benign Time-out Detected Crash gzip 86765 0 9424 3811 bzip2 80100 0 9368 10532 link 81214 0 8482 10304 perl 86590 4348 2476 6586 matrix 63605 0 35293 1102 quicksort 48825 0 1740 49435 sjeng 80729 694 7292 11285 libquantum 46674 11142 16683 25501 level of fault tolerance is related to the control complexity of the target application, when there are more branches in the target application, the effects of a single fault injection will more likely be minimized. Our hypothesis is that, as the complexity of the control logic goes up, the fault injection is less likely to cause any critical changes in the target application when it is running on a certain workload.

4.3 Performance

As mentioned in Section 3.1, ETFIDS was developed not only for addressing the problem of lack of reproducibility, flexibility, and portability of current fault injec- tion implementations; it was also developed for near-native performance. ETFIDS is lightweight, portable, and fast because of the way it was designed. As a fault injection tool, the most critical target ETFIDS has to achieve is to minimize the timing overhead for fault injection. By limiting the number of operations needed for performing each fault injection operation, ETFIDS introduced a negligible amount of computation every time it injects faults. Also, the newly introduced fault analysis mechanism helps ETFIDS to reduce the number of target application execution to 1 (instead of 2) for each fault injection experiment. Thus lower the timing overhead for a whole fault injection experiment campaign. As defined in [16], for minimizing the influence of different processor architectures and different CPU performances, the

45 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

100% 90% 80% 70% 60% PROBABILITY

50% Crash 40% Detected OUTCOME 30% Time‐out Benign

FAULT 20% 10% 0%

TARGET APPLICATIONS

Figure 4.1: Average Fault Outcome Probability.

Benign 100000 80000 60000 40000 20000 0

Figure 4.2: Number of Benign Fault Outcome Observed for Each Target Application.

time overhead Toverhead could be calculated as:

46 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

Time‐out 12000 10000 8000 6000 4000 2000 0

Figure 4.3: Number of Time-out Fault Outcome Observed for Each Target Applica- tion.

Detected 40000 30000 20000 10000 0

Figure 4.4: Number of Detected Fault Outcome Observed for Each Target Applica- tion.

Tfault injected run Toverhead = Tfault free run

Table 4.2 shows the time overhead of fault injection for ETFIDS. The table shows the average fault injection time overhead to different fault target memory location. As we can see from Table 4.2, the time overhead is smaller for global target variables,

47 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

Crash 60000 50000 40000 30000 20000 10000 0

Figure 4.5: Number of Crash Fault Outcome Observed for Each Target Application.

Table 4.2: Average Time Overhead Fault Target Memory Region Time Overhead Globals 1.009884 Stack Memory 1.087052 Dynamic Memory 1.073222 as injecting faults into global variables does not require reaching a specific function or subroutine. On the contrary, injecting faults to local target variables such as variables stored on stack memory or dynamic memory region, the time overhead is larger. Table 4.3 shows how ETFIDS performs compared with other software implemented fault injection (SWIFI) tools. As the table shows, there is no clear winner regarding fault injection efficiency, All modern approaches [29, 15] and ETFIDS performs better than the older approach [16]. It would be more interesting if the performance of other fault injection tools could be compared with ETFIDS, but unfortunately, most of prior software fault injection work does not include performance metrics. Moreover, the reason why [15] performs a little better than ETFIDS is that [15] was developed use the debugging framework specifically offered by Intel for x86 architecture, thus many architectural optimizations have been made for the fault injection tool, whereas ETFIDS was developed based on GDB, which supports multiple CPU architectures,

48 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

gzip Crash Crash bzip2 Detected 4% Detected 11% 9% 9% TimeͲout 0% Time-out 0%

Benign Benign 87% 80%

perl link Detected Crash Detected Crash 2% 7% 9% 10% Time-out 4% Time-out 0%

Benign Benign 81% 87%

Crash matrix quicksort 1%

Detected 35% Crash Benign 49% 49% Benign 64% Time-out Detected Time-out 0% 2% 0%

Crash sjeng libquantum 11% Detected 7% Crash Time-out 25% 1% Benign 47% Benign Detected 81% 17% Time-out 11%

Figure 4.6: Fault Injection Outcome Probability Regards to Each Application.

49 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS

Table 4.3: Time Overhead Comparison SWIFI Tool Time Overhead FERRARI [16] 1.290000 Ftape [29] 1.087275 PIN [15] 1.011341 ETFIDS 1.056720

Table 4.4: Detailed Time Overhead (in seconds) Comparison Fault Target Memory Region FIESTA++ [4] ETFIDS Globals 0.11 0.0012 Stack Memory 1.2 0.011 Dynamic Memory 2.9 0.009 so the level of architectural optimization is weaker than Intel’s PIN. [4] is the only prior work which measured fault injection time overhead for dif- ferent fault target memory regions; thus it is worth showing the difference between FIESTA++ [4] and ETFIDS. Table 4.4 shows the time overhead difference between FIESTA++ and ETFIDS in seconds for one run. As we can observe, the difference is quite large. The biggest difference of time overhead measurement between FI- ESTA++ and ETFIDS is that FIESTA++ measured time overhead for its “black box mode” while ETFIDS does not offer such functionality. When FIESTA++ is working in “black box mode,” most of the time overhead is contributed by searching the target variable to inject faults. However, such behavior could be emulated by querying the target program and listing all the possible fault injection targets be- fore the actual fault injection experiment. For fault injection experiment campaigns that consist of hundreds or thousands of single fault injection experiments, the time overhead of “black box mode” in FIESTA++ would become unacceptable. As mentioned in Section 3.5, the total fault injection experiment time could be re- duced using the early termination feature of ETFIDS, to demonstrate this feature, we did an experiment measuring the total time reduction for fault injection experiments with early termination enabled. Figure 4.7 shows the average relative consumed time

50 CHAPTER 4. FAULT INJECTION EXPERIMENTS USING ETFIDS for one fault injection experiment with different fault latency. As we can observe from this figure, the shorter the fault latency is, the higher the speed-up we can achieve, because the ETFIDS will terminate the target application when it detects the propa- gated fault at the observation point. We obtained the data for this plot using matrix application. When performing fault injection experiments using other applications, the speed-up may not be as linear, because the number of computations might not be as consistant as matrix. The relative consumed time is defined as:

Tearly termination Trelative = Tno early termination

1.2 1 0.8 0.6 0.4 0.2 0

Figure 4.7: Relative Time Consumed by One Fault Injection Experiment using Early Termination Feature. (Data Obtained using matrix)

51 Chapter 5

Conclusion

ETFIDS made three improvements upon current SWIFI implementations: (1) low- perturbation fault injection by using a cross-threshold mechanism; (2) accurate fault outcome analysis using real-time faulty process evaluation; (3) early fault propaga- tion detection and performance increase using concurrent faulty and fault-free process comparison. These are achieved by using a collection of newly introduced mechanisms. Respectively, the low-perturbation was accomplished by using a double threshold mechanism, which limits both the stack frame and memory state for fault injection, thus eliminates the perturbation introduced by the uncertainty of software execu- tion. The runtime faulty process evaluation reduces the number of target application execution from 2 to 1, so the total time required by one fault injection experiment is decreased. The concurrent faulty and fault-free inferior comparison is the most interesting improvement we made. ETFIDS uses an interleaved mechanism for de- tecting the fault propagation. Such a mechanism helps performance by enabling the early termination of the fault injection experiment triggered by early error detection. Concurrent faulty and fault-free inferior comparison also has the benefit of saving memory space during fault injection experiment campaigns. Also, from the fault injection experiments we conducted, we can make some in-

52 CHAPTER 5. CONCLUSION teresting observations. Firstly, different target applications show different behaviors after the fault being injected. There is some correlation between the control complexity of the target ap- plication and its fault tolerance. The reason for this is apparently that more complex applications tend to have a more extensive variety of features which a single workload may only able to activate all. When fault injection was performed to variables which are not related to the execution of the application, the probability for us to obtain a benign fault outcome would be quite substantial. And for applications whose vari- ables are mostly within the control logic, there will be a higher chance for us to get a crash from a fault injection experiment. Secondly, the efficiency of a fault injection experiment depends on a variety of factors, including the way the fault injector finds the target variable, the approach the fault outcome is analyzed, and the lower level implementation of the fault injector and fault propagation monitor. Moreover, the time typically a fault injection experiment campaign will take could be largely shortened by the behavior of the fault injection system. If the fault injection system is capable of terminating the faulty process when desirable results were obtained, the average time spent on each fault injection experiments would be shorter. Besides, the implementation of the fault injection system has quite a big impact on the usability of the fault injection tool. By using the Python extension API of GDB, ETFIDS can achieve great portability. ETFIDS waived the need for recompilation for different systems, and it inherits the good variety of target architectures GDB offers. And because of the easiness of extending Python programs, ETFIDS could be easily modified or extended specifically for the target application and the need for a specific fault injection campaign. ETFIDS does have certain limitations, it is still a software implemented fault injection tool, so it could not emulate and detect a lower level (for example, gate level)

53 CHAPTER 5. CONCLUSION

fault for a target hardware implementation. And the use of hardware breakpoints could also be limited by the CPU implementation which ETFIDS is running on. Thus, this work could be carried on by implementing a hybrid approach for fault injection and analysis — faulty behavior derived using hardware fault injection tools can be used to recreate certain device-level transient faults in ETFIDS. Such an approach combines the versatility of software fault injection and the accuracy of hardware monitoring. This way, the propagated error generated by a transient fault could be analyzed thoroughly so mitigation techniques could be developed more easily.

54 Bibliography

[1] L. Antoni, R. Leveugle, and M. Feher. Using run-time reconfiguration for fault injection in hardware prototypes. In 17th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2002. DFT 2002. Proceedings., pages 245–253, 2002.

[2] J. Arlat, Y. Crouzet, and J. C. Laprie. Fault injection for dependability validation of fault-tolerant computing systems. In [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers, pages 348–355, June 1989.

[3] J. Carreira, H. Madeira, and J. G. Silva. Xception: a technique for the exper- imental evaluation of dependability in modern computers. IEEE Transactions on Software Engineering, 24(2):125–136, Feb 1998.

[4] Ameya Suhas Chaudhar. Fiesta++: A software implemented fault injection

tool for transient fault injection. https://repositories.lib.utexas.edu/ bitstream/handle/2152/28159/CHAUDHARI-THESIS-2014.pdf?sequence=1, 2014. [Online; accessed 01-Mar-2018].

[5] H. Cho, S. Mirkhani, C. Y. Cher, J. A. Abraham, and S. Mitra. Quantitative evaluation of soft error injection techniques for robust system design. In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–10, May 2013.

55 BIBLIOGRAPHY

[6] P. Civera, L. Macchiarulo, M. Rebaudengo, M. S. Reorda, and M. Violante. Exploiting circuit emulation for fast hardness evaluation. IEEE Transactions on Nuclear Science, 48(6):2210–2216, Dec 2001.

[7] E. Dupont, M. Nicolaidis, and P. Rohr. Embedded robustness ips for transient- error-free ics. IEEE Design Test of Computers, 19(3):54–68, May 2002.

[8] A. V. Fidalgo, G. R. Alves, and J. M. Ferreira. Real time fault injection using a modified debugging infrastructure. In 12th IEEE International On-Line Testing Symposium (IOLTS’06), pages 6 pp.–, July 2006.

[9] U. Gunneflo, J. Karlsson, and J. Torin. Evaluation of error detection schemes us- ing fault injection by heavy-ion radiation. In [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers, pages 340–347, June 1989.

[10] Seungjae Han, K. G. Shin, and H. A. Rosenberg. Doctor: an integrated software fault injection environment for distributed real-time systems. In Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium, pages 204–213, Apr 1995.

[11] S. K. Sastry Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer: Application resiliency analyzer for transient faults. IEEE Micro, 33(3):58–66, May 2013.

[12] Martin Hiller. Propane: An environment for examining the propagation of errors in software. In In Proc. ACM SIGSOFT international symposium on and analysis, pages 81–85. ACM Press, 2002.

[13] Mei-Chen Hsueh, T. K. Tsai, and R. K. Iyer. Fault injection techniques and tools. Computer, 30(4):75–82, Apr 1997.

56 BIBLIOGRAPHY

[14] E. Ibe, H. Taniguchi, Y. Yahagi, K. i. Shimbo, and T. Toba. Impact of scaling on neutron-induced soft error in srams from a 250 nm to a 22 nm design rule. IEEE Transactions on Electron Devices, 57(7):1527–1538, July 2010.

[15] A. Jin, J. Jiang, J. Hu, and J. Lou. A pin-based dynamic software fault in- jection system. In 2008 The 9th International Conference for Young Computer Scientists, pages 2160–2167, Nov 2008.

[16] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: a flexible software-based fault and error injection system. IEEE Transactions on Com- puters, 44(2):248–260, Feb 1995.

[17] W. . Kao, R. K. Iyer, and D. Tang. Fine: A fault injection and monitoring envi- ronment for tracing the unix system behavior under faults. IEEE Transactions on Software Engineering, 19(11):1105–1118, Nov 1993.

[18] Wei-Lun Kao and R. K. Iyer. Define: a distributed fault injection and monitoring environment. In Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pages 252–259, Jun 1994.

[19] J. Karlsson, P. Folkesson, J. Arlat, Y. Crouzet, G. Leber, and J. Reisinger. Evaluation of the mars fault tolerance mechanisms using three physical fault injection techniques. In Integrating Error Models with Fault Injection, 1994., Third Int’l Workshop on, pages 21–22, Apr 1994.

[20] N Krishnamurthy, V Jhaveri, and JA Abraham. A design methodology for soft- ware fault injection in embedded systems. In Proc of the 1998 IFIP International Workshop on Dependable Computing and its Applications, pages 12–14, 1998.

[21] K. J. Kuhn, M. D. Giles, D. Becher, P. Kolar, A. Kornfeld, R. Kotlyar, S. T. Ma, A. Maheshwari, and S. Mudanai. Process technology variation. IEEE Transac- tions on Electron Devices, 58(8):2197–2208, Aug 2011.

57 BIBLIOGRAPHY

[22] Cao Lianmin and Zeng Qingliang. Study on reliability of computer network. In The 2nd International Conference on Information Science and Engineering, pages 2394–2398, Dec 2010.

[23] Q. Lu, M. Farahani, J. Wei, A. Thomas, and K. Pattabiraman. Llfi: An in- termediate code-level fault injection tool for hardware faults. In 2015 IEEE International Conference on Software Quality, Reliability and Security, pages 11–16, Aug 2015.

[24] M. Portela-Garcia, C. Lopez-Ongil, M. Garcia-Valderas, and L. Entrena. A rapid fault injection approach for measuring seu sensitivity in complex processors. In 13th IEEE International On-Line Testing Symposium (IOLTS 2007), pages 101– 106, July 2007.

[25] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Ysskin, J. Kownacki, J. Barton, R. Dancey, A. Robinson, and T. Lin. Fiat - fault injection based automated test- ing environment. In Fault-Tolerant Computing, 1995, Highlights from Twenty- Five Years., Twenty-Fifth International Symposium on, pages 394–, June 1995.

[26] K. G. Shin. Harts: a distributed real-time architecture. Computer, 24(5):25–35, May 1991.

[27] the GDB developers. Gdb documentation. https://www.gnu.org/software/ gdb/documentation/, 2018. [Online; accessed 19-Feb-2018].

[28] the GDB developers. Gdb: The gnu project debugger. https://www.gnu.org/ software/gdb/, 2018. [Online; accessed 19-Feb-2018].

[29] T. K. Tsai, R. K. Iyer, and D. Jewitt. An approach towards benchmarking of fault-tolerant commercial systems. In Proceedings of Annual Symposium on Fault Tolerant Computing, pages 314–323, Jun 1996.

58