Fault Injection Test Harness a Tool for Validating Driver Robustness

Fault Injection Test Harness a tool for validating driver robustness Louis Zhuang Stanley Wang Intel Corp. Intel Corp. [email protected], [email protected] [email protected] Kevin Gao Intel Corp. [email protected] Abstract 1 Introduction FITH (Fault Injection Test Harness) is a tool High-availability (HA) systems must respond for validating driver robustness. Without gracefully to fault conditions and remain oper- changing existing code, it can intercept arbi- ational during unexpected software and hard- trary MMIO/PIO access and IRQ handler in ware failures. Each layer of the software stack driver. of a HA system must be fault tolerant, produc- ing acceptable output or results when encoun- Firstly I’ll first list the requirements and design tering system, software or hardware faults, in- for Fault Injection. Next, we discuss a cou- cluding faults that theoretically should not oc- ple of new generally useful implementation in cur. An empirical study [2] shows that 60- FITH 70% of kernel space defects can be attributed to device driver software. Some defect con- 1. KMMIO - the ability to dynamically hook ditions (such as hardware failure, system re- into arbitrary MMIO operations. source shortages, and so forth) seldom hap- pen, however, it is difficult to simulate and reproduce without special assistant hardware, 2. KIRQ - the ability to hook into an arbi- such as an In-Circuit Emulator. In these situa- trary IRQ handler, tions, it is difficult to predict what would hap- pen should such a fault occur at some time in the future. Consequently, device drivers that Then I’ll demonstrate how the FITH can help are highly available or hardened are designed developers to trace and identify tricky issues to minimize the impact of failures to a system’s in their driver. Performance benchmark is also overall functionality. provided to show our efforts in minimizing the impact to system performance. At last, I’ll Developing hardened drivers requires employ- elaborate on current and future efforts and con- ing fault avoidance software development tech- clude. niques early in the development phase. To Linux Symposium 525 eliminate faults during development and con- • minimize the number of instructions in firm a driver’s level of hardening, a developer critical kernel path, such as exception and can test a device driver by injecting or simulat- interrupt part. ing fault events or conditions. The focus of this paper is on the injection or simulation of hardware faults. Injection of software faults will be considered in future version. 3 Architecture FITH simulates hardware-induced software er- rors without modifying the original driver. It offers flexible customization of hardware fault simulation as well as provides command-line FITH consists of four components: intercep- tool for facilitating test development. FITH can tors, faultsets, configuration tools, and a fault also provide the ability to log the route of an in- injection manager. The configuration tools jected fault, thereby enabling driver developers provide the command-line utilities needed to to diagnose the tested driver. customize a faultset for a driver. Three in- terceptors will be implemented to catch IO, MMIO and IRQ access. When a driver tries to 2 Requirements access a hardware resource, the IO interceptor captures this access and asks the fault injection This section describes some requirements for manager if there is a corresponding item in the FITH; we derived as part of the development. faultset. If there is, the fault injection manager determines how to inject the appropriate fault The only behavioral requirement is that FITH according to the associated properties defined should not impact functionality of the tested in the faultset. The fault injection manager then driver. The tested driver should work as if there returns this information to the interceptor, so is no FITH at all. the interceptor injects the actual fault into the There are various functionality requirements hardware. that need to be considered. Most center around IRQ fault injection is somewhat different from the interception of resources access. FITH the other types of fault injection. Hardware needs to have capability to intercept accesses triggers the IRQ, and a kernel IRQ interrupt for MMIO, IO, IRQ and PCI configuration. handler delivers this event to the IRQ inter- The other major requirement is about handling ceptor. The IRQ interceptor then checks the after interception. FITH needs to have capa- faultset to determine whether a specific fault bility to support the complex and customized is available to inject into the event. 1 illus- post-handling, such as tracing hardware status, trates how the interceptor interacts with the emulating fake hardware register, injecting er- other subsystems. ror data and logging. The interceptor sits between the hardware and With respect to performance, there are basi- the device driver and modifies information cally two overriding goals: based on certain conditions in the driver’s namespace. An interceptor for one driver does • minimize the impact to system perfor- not affect the interceptor for others. As a mat- mance when the tested driver doesn’t en- ter of fact, the hardware that the driver observes able FITH. is the hardware that our interceptor wraps. Linux Symposium 526 functions cannot be intercepted. 2. A special FITH header file needs to be added to the driver code, and the driver needs to be recompiled. 3. There are some differences between the “released driver” and the “driver with FITH.” The driver that is vali- dated and verified is the driver with FITH rather than the released driver. • Setting Watch Points. Figure 1: Architecture of FITH IA-32 architecture provides extensive debugging facilities for debugging code and monitoring code execution. These facili- 4 KMMIO—Interceptor of MMIO ties can also be used to intercept memory access. The major advantages are access 1. The driver does not need to be re- One of those requirementsof FITH is the abil- compiled. ity to hook to a specific memory mapped IO re- 2. There have been a good patch to sup- gion before the user of the region gets access. port it.[3] A fault injection test case may need to just note when a given region is being read/written, take On the other hand, there are several disad- some action before the caller returns from the vantages: read or write operation, or change the value that is being read or written. 1. In IA32 architecture, this is a trap type of exception, which means that 4.1 Approaches for capturing MMIO accesses the processor generates the exception after the IO instruction has been There are several hardware/software ap- executed, so this method cannot do proaches for capaturing MMIO accesses. fault injection in write operation. 2. There are only four watch points that • Overriding MMIO functions. can be used. This may be not enough in a complex environment. Memory mapped IO access can be cap- tured by overriding MMIO functions, • Trapping MMIO access by using Page- such as readb() and writeb(). Fault Exceptions. The major advantage is this method is platform-independent because all Linux Like normal memory, MMIO is handled platforms support these MMIO functions. by a page-protection mechanism. There- Disadvantages are fore, MMIO access can be intercepted by capturing page faults. The method clears 1. Any driver that accesses MMIO the PE (PRESENT) bit of the PTE (Page without using the standard MMIO Table Entry) of the MMIO address so that Linux Symposium 527 the processor triggers a page-fault excep- diff −Nru a/arch/i386/mm/fault.c tion when MMIO is accessed. The major b/arch/i386/mm/fault.c advantages are −−− a/arch/i386/mm/fault.c Thu May 15 1. In IA32 architecture, this is a fault 15:52:08 2003 type of exception, which means that +++ b/arch/i386/mm/fault.c Thu May 15 15:52:08 the processor generates an exception 2003 before MMIO access is executed, so @@ −80,6 +81,9 @@ this method can do fault injection in /∗ get the address ∗/ write operation. __asm__("movl %%cr2,%0":"=r" (address)); 2. The driver does not need to be recompiled. There is a disadvan- + if (is_kmmio_active() && kmmio_handler(regs, tage in the method—because the address)) unit of intercepted MMIO is the size + return; of a page (4k in IA-32), an adja- + cent MMIO access may unnecessar- /∗ It’s safe to allow irq’s after cr2 has been saved ∗/ ily trigger an exception, system per- if (regs−>eflags & X86_EFLAGS_IF) formance might be impacted. local_irq_enable(); Figure 2: Patch against fault.c Based on FITH requirements and our analysis above, we implemented a page-fault method to capture MMIO. 4.2 Implementation We also followed the same usage style like Then, KMMIO tries to recover normal exe- what kprobes provides, with a register_ cution. It sets the page as PRESENT. But kmmio_probe() function for adding KMMIO needs to do more than this. Be- the probe, and a unregister_kmmio_ cause the probe should be re-enabled after cur- probe() function for removing the probe. rent instruction which trigger the page-fault The register_kmmio_probe adds the exception, KMMIO enables single-step before probe into internal list and set the page exiting page-fault exception. Similar to the which the probe is on as UNPRESENT. After change to faults.c, KMMIO needs to patch register_kmmio_probe, any access on traps.c to get the control when the single- the page will trigger a page-fault exception step exception is triggered. and fall into KMMIO core. After the instruction, which triggers the page- To get the control in page-fault exception, fault execption, is executed, a single-step ex- we need some tweaks to faults.c.

Fault Injection Test Harness a Tool for Validating Driver Robustness

Filesystem Hierarchy Standard

Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux

Information on Open Source Software When Filling in The

(C) 1995 Microsoft Corporation. All Rights Reserved

Université De Montréal Low-Impact Operating

Linux IPCHAINS-HOWTO

Unreliable Guide to Locking

Virtio PCI Card Specification V0.9.5 DRAFT

Open Source Software License Information

Rusty's Unreliable Guide to Kernel Hacking

Linux Advanced Routing & Traffic Control HOWTO

Buried Alive in Patches: 6 Months of Picking up the Pieces of the Linux 2.5 Kernel