DEPENDABLE ADAPTIVE COMPUTING SYSTEMS STANFORD CRC ROAR PROJECT FINAL REPORT

1. Introduction Adaptive Computing Systems (ACS) comprising , memory and reconfigurable logic (e.g., user-programmable logic elements like FPGAs) represent a promising technology which is well-suited for applications such as digital signal processing, image processing, control, bit manipulation, compression and encryption. For these applications, ACS provides versatility and parallelism simultaneously due to the presence of reconfigurable logic such as Field Programmable Gate Arrays (FPGAs). FPGAs provide the capacity to implement applications with parallelism due to the abundance of reprogrammable Configurable Logic Blocks (CLBs) and routing resources. Figure 1.1 shows the architecture of a typical adaptive computing system [Rupp 98]. The Defense Advanced Research Projects Agency (DARPA) has sponsored several ACS efforts over the past years both in academia and industry. At the Stanford Center for Reliable Computing (CRC), we proposed using ACS opportunities to design dependable1 computing systems for applications such as nuclear reactor control, fly-by- wire systems, and remote locations (e.g., space stations and satellites). This proposal was funded by DARPA in 1997 under the name “Dependable Adaptive Computing Systems” (http://crc.stanford.edu). Another related effort on dependable ACS is the project called “On-line Testing and Fault-Tolerance in FPGAs” at Bell Labs [Abramovici 99]. Note that the objective of our project is different from the Teramac project at Hewlett-Packard Labs where the main goal is to achieve defect-tolerance in FPGAs by using configurations that do not use defective resources [Culbertson 97]. The challenges in our project are on-line error detection during system operation, very fast fault location and quick recovery from temporary and permanent failures.

1 Dependability comprises reliability, availability, testability, maintainability and fault-tolerance.

1

Figure 1.1. ACS architecture Design of dependable adaptive computing systems can create a significant impact for space applications by reducing the cost of very expensive electronics while delivering high performance. The current practice is to use radiation hardened integrated circuits for space applications. Radiation hardening is a very expensive process requiring very special designs that are not widely available. Moreover, radiation hardened ICs are several generations behind the most advanced technologies. For example, while commercial processors used today have clock frequencies in the GigaHertz range, the most advanced radiation-hardened processors belong to the generation of IBM RISC 6000. Thus, radiation-hardened processors are not only expensive, but also slow. For space applications, special boards with commercial will require multiple microprocessor chips for error detection and standby spares. Hence, organizations like DARPA, NASA and Los Alamos National Laboratories are interested in mitigating the cost of fault-tolerance for space applications by taking advantage of the ACS technology. Moreover, for unmanned deep space applications, it may be very difficult to change the functionality of computer systems on the fly; the flexibility of the ACS technology comes to the rescue in such circumstances. For terrestrial applications, adaptive computing systems open new opportunities in designing dependable systems by eliminating the need for standby spares as described in this report. The conventional approach to system recovery in fault-tolerant systems is to replace faulty parts (through board swapping, for example). With dependable ACS architectures, it is possible to repair a faulty system very fast with minimal human intervention. Not only do we eliminate the need for standby spares, but also reduce the system repair cost and time significantly. The reduction in the system repair time reduces the system downtime; this, in turn, increases the system availability significantly.

2 This report describes dependable architectures for adaptive computing systems that have been developed as a part of the DARPA sponsored ROAR project at the Stanford Center for Reliable Computing (CRC). ROAR is an acronym for Reliability Obtained by Adaptive Reconfiguration. The main idea is to develop: (1) Concurrent Error Detection (CED) techniques to detect errors while the ACS is in operation; (2) fault-location techniques to identify the defective part (e.g., the defective CLB or routing resource); and (3) recovery techniques to reconfigure the system to operate without using the faulty CLB or routing resource. Unlike conventional fault-tolerant systems where the Field Replaceable Unit (FRU) is a chip or a board, the FRU for an ACS is a CLB or a routing resource; thus, there are opportunities to repair the faulty system taking advantage of the fine granularity of the FRU. The objectives of the ROAR project are: (1) to develop design techniques that allow for the generation of highly dependable, adaptive computing systems; (2) to increase the availability of unattended systems by an order of magnitude through self-repair method based on the reconfiguration capability of the designs, (3) to eliminate the need for standby spares since adaptive configuration allows the use of all of the resources for performance; and (4) to develop redundancy techniques, such as diversified designs, that will protect systems from common-mode failures. If these objectives are achieved, then it is possible to eliminate expensive radiation hardening techniques and design low-cost dependable reconfigurable systems with considerably higher flexibility and performance. Section 2 presents an overview of the hardware model of FPGAs. In Sec. 3, we present ACS architectures that have been designed in the ROAR project. Section 4 presents an overview of the concurrent error detection, fault-location and recovery techniques we have developed for applications mapped on the reconfigurable hardware; these techniques enable dependable system design using the architectures described in Sec. 3. In Sec. 5 we describe the architecture of a self-healing soft-microprocessor developed in the ROAR project. We conclude in Sec. 6.

2. Hardware Model of FPGAs

Figure 2.1 shows the model for the of reconfigurable hardware used in this report. This model is consistent with contemporary SRAM-based FPGAs [ 01]. In this model, the programmable logic array of reconfigurable

3 hardware consists of three basic elements: Configurable Logic Blocks (CLBs), Connection Boxes (CBs), and Switch Boxes (SBs).

CLB CLB CLB CLB

CB

CLB CLB CLB CLB

SB

CLB CLB CLB CLB

1 CLB Column Figure 2.1. FPGA Architecture.

In this architecture, a CLB contains SRAM lookup tables (LUTs) that store the truth tables of user-defined combinational logic functions. In this way, a combinational logic gate is realized by looking up the value in the LUT that is addressed by the corresponding gate inputs. Because LUTs are implemented in SRAM cells, they can also be configured as memory modules in user applications. Here, memory modules refer to all types of memory implemented as RAM modules, such as RAM and FIFO. Bistables such as flip- and latches, however, are excluded from memory modules implemented in LUTs because of area efficiency. Instead, they are implemented as separate modules in each CLB to realize sequential logic. Each CLB may also contain and other dedicated circuitry in order to enhance functional flexibility and to speed up certain paths with long delay, such as the carry chain in a ripple [Xilinx 01]. To implement a logic network, CLBs are connected through horizontal and vertical wiring channels located between two neighboring rows or columns. To enhance the connectivity, there are various kinds of wires with different lengths for connecting various CLBs that are separated by variable numbers of blocks. For example, in Xilinx Virtex-series FPGAs, single lines connect adjacent CLBs, while hex lines connect CLBs that are three or six blocks apart [Xilinx 01]. There are two types of routing devices, CBs and SBs, to direct the signal flows among CLBs and wiring channels. CBs route the inputs and outputs of a CLB to the

4 adjacent wiring channels. SBs connect horizontal and vertical wiring channels. Both CBs and SBs are matrices of Programmable Interconnect Points (PIPs). The state of PIPs in these switch matrices is controlled by SRAM cells, which are configured according to the desired functionality. Logic functions in the reconfigurable hardware are configured through the configuration bit-stream, which contains a mixture of commands, configuration data, SRAM cell addresses for storing configuration data, and control overhead such as parity checks. Configuration data contain values in LUTs and interconnect states. The entire configuration data of the reconfigurable hardware are partitioned vertically or horizontally into configuration frames, which are the smallest amount of data that can be accessed with a single configuration command. Data stored in configuration memory can be accessed by two types of operations: configuration readback and configuration writeback. Configuration readback refers to the operation of reading the on-chip configuration memory contents out of the chip through designated ports. This operation does not affect user applications running in the reconfigurable hardware. Configuration writeback refers to the operation of writing valid configuration data into the on-chip configuration memory cells. Because the functional definition of the reconfigurable hardware is changed by configuration writeback, user operations may be affected. Partitioning the entire configuration data into frames in the reconfigurable hardware enables concurrent partial reconfiguration of the functional definition of the circuit. Because configuration writeback can be executed frame by frame, part of the configuration can be altered without stopping the operations running in the reconfigurable hardware, provided these operations do not interact with the frames under configuration writeback.

3. Dependable ACS Architectures In this section, we describe two main system architectures, the Dual-FPGA and the MT-ACS architectures. These architectures have the following characteristics.

• Provide the flexibility of reconfigurable systems.

• Guarantee system data integrity.

• Protect from temporary, permanent, common-mode and multiple failures.

• Allow design of dependable systems with built-in recovery capabilities with minimal performance impact.

• Eliminate the need for standby spares even for remote unmanned applications.

5 3.1. Dual-FPGA ACS Architecture The Dual-FPGA architecture is illustrated in Fig. 3.1. The architecture contains two reconfigurable modules implemented using two FPGA chips and each module can be reconfigured to run certain user applications. In addition, we have a controller (e.g., 8051 micro-controller) implemented on each module. The Dual-FPGA ACS architecture allows several implementation scenarios for dependable adaptive computing systems. We discuss each of these scenarios in this section.

Reconfigurable Reconfigurable FPGA 1 FPGA 2 Controller Controller

Memory

Figure 3.1. Dual-FPGA ACS architecture

The general ACS architecture shown in Fig. 1.1 can be mapped on the Dual- FPGA architecture. In this scenario, a microprocessor can be implemented on one of the FPGA modules. We call this a “soft-microprocessor”. An overview of a soft- microprocessor will be described in Sec. 5 and the details are presented in [Saxena 01]. Applications that are supposed to be executed on the reconfigurable can be mapped on the other FPGA module. The Dual-FPGA ACS architecture can be used to implement two separate applications or different parts of the same application on two FPGA chips. In a third scenario, the architecture can be used to implement a full duplex system running in a lock-step manner (very much like the systems from Sequoia, Tandem). In this case, the same application is executed on two FPGAs and the results are compared. In the context of implementing dependable applications on the Dual-FPGA architecture the following issues must be addressed: 1. Detection of errors in the FPGAs (described in Sec. 4.1), 2. Recovery from temporary failures that affect the configuration bits of the FPGAs (Sec. 4.2.1),

6 3. Recovery from temporary failures that affect the internal registers and states of the applications mapped on the FPGAs (Sec. 4.2.2), 4. Recovery from permanent faults (Sec. 4.3), 5. Location of permanent faults (Sec. 4.4), 6. Fault-tolerant communication between the FPGAs and, detection of and recovery from failures in the I/O pins of the FPGAs (Sec. 4.5), 7. Detection and recovery of failures in the storage (memories) shown in Fig. 3.1 (Sec. 4.5), 8. Failure of the reconfiguration circuitry of the FPGAs (Sec. 4.5), 9. Detection of and recovery from bus failures (Sec. 4.5).

Each application mapped on the FPGAs is designed with built-in concurrent error detection and autonomous recovery techniques. Section 4.1 presents an overview of the concurrent error detection techniques that can be used for user applications mapped on the reconfigurable hardware. These concurrent error detection techniques maintain system data integrity, detect errors while the system is in operation and report errors to the micro-controllers. The controllers exchange error status signals between the two FPGA modules through various protocols like – watchdog timeout, heart-beat signals, encoded protocol and error signals. The controllers also internally monitor the status of built-in concurrent error detection methods implemented within each FPGA. Not all temporary errors within an FPGA may need the controller for recovery; temporary errors that affect the system registers and bistables (e.g., flip-flops) can be corrected by the built-in autonomous recovery techniques (such as masking redundancy, roll-back and roll-forward recovery) discussed in Sec. 4.2. Temporary failures affecting the configuration bits of the FPGAs cannot be corrected by the built-in recovery techniques. The controllers are used for this purpose. For example, the controller on FPGA2 detects a non-recoverable error signal from FPGA1 and initiates recovery from temporary faults affecting the configuration bits of FPGA1. This recovery technique is also described in Sec. 4.2. After recovery from temporary faults, the system performs a retry of the computations from the last check-pointed state. If the fault still persists after a pre- determined number of retries, the permanent fault recovery procedure is initiated. For

7 recovery from permanent faults in FPGA1, the micro-controller on FPGA2 initiates the permanent fault recovery process through reconfiguration using pre-compiled configurations as described in Sec. 4.3. For loading the appropriate pre-compiled configuration, the micro-controller on FPGA2 must execute a quick fault-location procedure described in Sec. 4.4. The pre-compiled configurations are stored in the memory (shown in Fig. 3.1) and protected by error-correcting codes (ECCs). For faults affecting FPGA2, similar procedures are executed. It is clear from our above discussion that the communication channels between the controllers on FPGA1 and FPGA2 play a vital role in system recovery. The hardwired communication channels can become faulty (due to temporary and permanent faults) and must be made fault-tolerant using techniques described in [Yu 01c] and Sec. 4.5. Other sources of single points of failures in the Dual-FPGA architecture are faults in bus lines, faults in memories (both data and address faults) and faults in the reconfiguration logic of the FPGAs. Techniques for eliminating these single points of failures are also described in Sec. 4.5. Several strategies can be adopted for recovering from faults affecting the micro- controllers on the two FPGAs. The most straightforward scheme is to have three copies of the controller on each FPGA and perform voting on their outputs using Triple Modular Redundancy (TMR). Thus, fault-masking is used to recover from controller faults. However, TMR of controllers may be expensive. The following approach can be used for recovering from controller faults without designing a TMR of controllers. Concurrent error detection (CED) techniques can be built into the design of the micro-controllers mapped on the FPGAs. The controller on each FPGA can monitor the error signals from the CED techniques implemented in the controller on the other FPGA and initiate recovery procedures when errors are reported. For example, the controller on the FPGA2 monitors the error signals from the CED techniques implemented in the controller on FPGA1. When an error is reported, the controller on FPGA2 first initiates a recovery procedure from temporary faults affecting configuration bits and the flip-flops and registers (Sec. 4.3). Next, the system performs a retry; after a pre-determined number of retries, if the fault still persists in the controller on FPGA1, the controller on FPGA2 initiates a recovery process from permanent failures (Sec. 4.4). Thus, the Dual-FPGA architecture is completely adaptive and can recover itself

8 from temporary and permanent failures without requiring any human intervention or standby spares. While human intervention may be possible for some applications (e.g., terrestrial applications), it is either impossible or too expensive for space applications. Most importantly, in deep space missions (remote exploration) it may not be possible to download or upload data via communication channels to or from the spaceship. Hence, the Dual-FPGA architecture is applicable for terrestrial as well as unmanned applications such as remote space exploration.

3.2. Multi-Threaded ACS (MT-ACS) Architecture The Multi-Threaded ACS (MT-ACS) architecture is an extension of the conventional ACS architecture described in [Rupp 98]. Conventional ACS architectures augment the traditional model of computation by adding a reconfigurable co-. As shown in Fig. 3.2, we have a processor, memory, I/O devices and a reconfigurable coprocessor connected to the bus. FPGAs can be used to implement the reconfigurable coprocessor. The idea is that applications (or parts of applications) exhibiting parallelism can be mapped and executed on the reconfigurable coprocessor (determined by the compiler). As shown in Fig. 3.2, in the MT-ACS architecture, we have a multi-threaded processor [Thorton 64, Smith 78].

Memory

Multithreaded Reconfigurable Interface Bus Processor Coprocessor

I/O system

Figure 3.2. MT-ACS architecture A multi-threaded processor is capable of holding multiple thread contexts. While multi-threading is not a new idea, the idea of using multi-threading for fault tolerance in processors and reconfigurable logic is new and was first proposed in [Saxena 98, Saxena 99, Saxena 00]. Use of multi-threading for fault-tolerance has also been explored in [Rotenberg 99, Reinhardt 00]. Fault tolerance is accomplished by using multiple threads of computations and algorithms. For example, three copies of the same thread can be run in a multithreaded processor and the results voted on by a voter thread. OS support is

9 required to manage redundant threads and effect recovery. A key benefit of implementing fault tolerance with multithreading is the accomplishment of a level of reliability similar to that of NMR at almost the cost of simplex hardware. Another key benefit is that the implementation of fault tolerance does not require any new design feature and uses all of the architectural features that are already present in the multithreaded processors. By implementing fault-tolerant applications using multiple threads, it is possible to recover from temporary faults and from some permanent faults. More details on the use of multi-threading for fault-tolerance, redundancy models and reliability analysis can be obtained from [Saxena 00]. Reliability plots highlighting the reliability improvement with the MT-ACS architecture is shown in Fig. 3.3. The failure -7 rate used for the reliability plots is 10 upsets/bit/day. This is a realistic failure rate for SEUs (Single Event Upsets) in space as reported by SEU experiments in space [Shirvani 00a]. We assume that the processor is running at a 100-MHz . With this clock -12 rate, the failure rate translates to 10 upsets/bit/cycle. In Fig. 3.3, the reliability of 9 traditional Software TMR (STMR) at N =300 ×10 simplex execution cycles would approximately correspond to the 900 ×109 execution cycles (three threads)and the voter cycles. Figure 3.3 shows that applications executing on the MT-ACS platform exhibit a factor of 10 improvement in reliability over the traditional ACS simplex system and a significant reliability improvement over the ACS-STMR (ACS with software TMR for the processor) platform. Similar results can be obtained for processor frequencies in the GigaHertz range.

Figure 3.3. Reliability improvement with the MT-ACS architecture For the portions of any given application that are mapped on the reconfigurable

10 coprocessor, the concurrent error detection, fault-location and recovery techniques described in Sec. 4 can be used. Fault-tolerance through multi-threading protects the processor from temporary (transient and intermittent) failures because recomputation can be done safely and data integrity is guaranteed. For permanent faults affecting the processor, data integrity (correct outputs or error signal when incorrect outputs are produced) is not guaranteed with multi-threading. Moreover, system recovery from permanent faults in the processor requires board swapping or replacement of the microprocessor chip unless special recovery structures using standby spares are available. Thus, the MT-ACS architecture needs human intervention for permanent fault recovery and may not be suitable in unmanned applications (e.g., remote space exploration) unless special designs, e.g., standby spares, are used. Standby spares add to the system cost; moreover, as shown in [Ogus 75], for realistic scenarios where the circuit for switching standby spares into the system can be faulty, we can use a maximum of 2 or 3 standby spares. Thus, system recovery using standby spares has limited applications. The Dual-FPGA architecture, described in Sec. 3.1, addresses this problem.

4. Concurrent Error Detection, Fault-location and Recovery Techniques In this section, we describe concurrent error detection, fault-location and recovery techniques that have been developed in the ROAR project. These techniques are used in the two ACS architectures, described in Sec. 3, to design dependable adaptive computing systems. 4.1. Concurrent Error Detection Techniques Concurrent error detection (CED) is a major component of any dependable system. CED techniques detect errors during system operation. The major objective of any CED technique is to guarantee system data integrity. By data integrity we mean that the system either produces correct outputs or produces an error signal when incorrect outputs are produced. In the literature on fault-tolerance, data integrity has also been referred to as the fault-secure property [Pradhan 96]. Error detecting and correcting codes are examples of CED techniques used for semiconductor memories. Depending on the application, CED techniques can be applied at various granularity levels. For example, special CED techniques can be designed to detect errors at the outputs of

11 combinational and sequential logic circuits in any design. On the other hand, system- level CED techniques can be designed. For example, we can have two processors executing the same application and the outputs of the two processors can be compared. As explained before, for an ACS application, the FRU is a configurable or a routing resource of user-programmable logic elements like FPGAs. Hence, it is important to implement CED for combinational and sequential logic blocks and registers of a reconfigurable system. For the system to recover from errors after error detection, any system with CED must be equipped with sufficient capabilities to locate the faulty part and perform recovery. The fault-location and recovery techniques implemented in the ROAR project in conjunction with the CED techniques are described in Sec. 4.2, 4.3 and 4.4. The major concurrent error detection techniques studied in the ROAR project are: (1) simple duplication, (2) diverse duplication [Mitra 99a, Mitra 99b, Mitra 00b, Mitra 00c, Mitra 00d, Mitra 01a, Mitra 01b, Mitra 01c], (3) parity prediction [Zeng 99], (4) multi-threading [Yu 00], and (5) inverse comparison [Huang 00a, Huang 00b]. Hardware duplication is a simple way of implementing concurrent error detection (CED). Two implementations of the same logic function are used and their outputs compared — an error is reported when a mismatch occurs. Duplication has been used in many systems including the IBM G5 and G6 processors [Spainhower 99]. Duplication guarantees data integrity in the presence of a single failure. However, data integrity is not guaranteed in the presence of multiple failures or Common-Mode Failures (CMFs). It has been observed that CMFs, resulting from failures that affect both implementations at the same time generally due to a common cause, constitute a significant source of failures in duplex systems [Lala 94]. These include operational failures that appear during system operation and may be due to external (such as EMI, power-supply disturbances and radiation) or internal causes and design mistakes. CMFs are surveyed in [Mitra 00a]. Design diversity was proposed in [Avizienis 84] to protect redundant computing systems from common-mode failures. The idea of design diversity is to use two “different” implementations of the same logic function generated “independently” during duplication so that common failure modes create different effects in the two implementations. However, the concept of design diversity as described in [Avizienis

12 84] was qualitative. We have developed a metric [Mitra 99a, Mitra 99b] that quantifies diversity. The metric is applicable to both combinational and sequential circuits [Mitra 01a]. Fast techniques for computing this metric are described in [Mitra 01b]. Using this metric, we developed a technique for designing diverse combinational logic modules for duplex systems [Mitra 00b, Mitra 01c] and developed a system-level concurrent error detection technique that uses a combination of diverse duplication and parity prediction [Mitra 00c]. It was also demonstrated in [Mitra 00c] that, although diverse duplex systems sometimes have marginally higher area overhead, they provide significant protection from multiple failures and CMFs compared to other CED techniques like parity prediction. In addition, we developed test point insertion techniques to detect CMFs in duplex systems by scheduling test sessions during idle cycles or at the end of a computation. Our results show that for diverse duplex systems, the number of test points needed for 100% coverage of all modeled CMFs is orders of magnitude less than that required for systems employing simple duplication [Mitra 00d]. Traditionally, design diversity has two major costs associated with it. These are manufacturing and development costs. In an ACS environment, we can create diversity by synthesizing and downloading different implementations into the reconfigurable coprocessor at any time. Thus, there is no need to manufacture multiple diverse ASICs. The development cost is reduced by the use of automated tools for designing diverse duplex systems. Parity prediction is another popular CED technique that has been used in numerous systems. Techniques for designing logic circuits with parity prediction, developed in the ROAR project, are described in [Zeng 99]. Multi-threading can be used for concurrent error detection in applications mapped on the reconfigurable coprocessor [Yu 00]. For algorithms executed in pipeline stages, some resources are under-utilized (idle) due to data dependencies and memory latency. By scheduling multiple independent threads of computation, idle resources can be reclaimed. To implement CED, two copies of the same algorithm are executed as two threads on the FPGA and the results from different threads are compared. The replicated instructions of the redundant thread can be inserted into the idle stages to minimize timing overhead. In a fault-free scenario, identical outputs are produced from both threads. When faults are present, the redundant threads provide a means for fault

13 detection and masking. Multi-threading was used for CED in a robotic control application implemented on FPGAs in the ROAR project [Yu 00]. We developed several reconfigurable logic designs to implement the control algorithm of a robot arm controller with multi-threading capability for error detection. Unlike general-purpose microprocessors, another flexibility we had in reconfigurable logic was to control the precision of the arithmetic units in the implementation. We found that even with 16-bit arithmetic precision, the control system attained similar response time and stability as the one obtained using 32-bit arithmetic precision. The performance obtained, measured in terms of time to compute the force output was better than that obtained with a C program implementation running on a 300-MHz Ultra Sparc II processor. Synthesis results for the Xilinx Virtex FPGAs shows that the system can run at 139 MHz.

Figure 4.1. Inverse comparison CED scheme. Inverse comparison is an application-specific CED technique that can be used for LZ-based data compression algorithm [Huang 00a]. For LZ compression, the uniqueness of its inverse transformation enables us to verify the output integrity. The inverse comparison CED scheme is shown in Fig. 4.1. An additional decoder decompresses the encoded (compressed) codewords, and a checker compares the reconstructed data with the delayed source input. Note that, if the throughput of the decompressing process can match that of the encoder, we can pipeline the error detection computations so that the overall execution time overhead on the total cycle count is small. In principle, inverse comparison can be used for any 1-to-1 mapping application. However, to avoid a large hardware overhead, the complexity of the inverse process should be significantly less than that of the actual computation. The complexity of LZ decompression is only that of a memory access. LZ compression involves string matching in the encoder (compressor). Hence, LZ decompression is much simpler than the LZ compression process. Emulation results on the Wildforce FPGA board from Annapolis Microsystems (http://www.annapmicro.com) show an area overhead of 9.2%

14 and delay overhead of 0.5% with inverse comparison for the LZ data compression application [Huang 00a]. Implementation results of LZ compression with inverse comparison on the Wildforce board show the potential of two orders of magnitude improvement in throughput compared to software implementations [Huang 00a].

4.2. Recovery from Temporary Failures There are two types of failures that can affect system operation. These are temporary and permanent failures. Temporary failures can be caused by intermittent failures (possibly due to weakness in the part) or transient failures (e.g., Single Event Upsets or SEUs in space, power-supply disturbances, radiation from packaging material). In an FPGA-based system temporary failures can affect the configuration bits of the FPGAs or the contents of the system bistables. Temporary failures affecting the configuration bits of the FPGAs effectively change the designs implemented on the FPGAs and have permanent effects unless the FPGAs are reconfigured. For temporary failures in the sequential elements or configuration bits of a reconfigurable system, a reload of the contents from a safe storage (protected by error correcting codes) is sufficient.

4.2.1. Recovery from Temporary Failures in FPGA Configuration Bits The online partial reconfiguration feature of FPGAs can be used to read back and modify the data in configuration memory cells of FPGAs without stopping the circuit loaded in the chip. Virtex FPGAs from Xilinx exhibit this capability [Kelem 00]; for these FPGAs, configuration bits of each column of the FPGA can be read out, modified and written back without stopping the system. This read-back and partial reconfiguration feature can be used to locate failures affecting the FPGA configuration bits and recover from these failures. To locate failures affecting the configuration bits, we can read out the configuration bits of each column and compare the read-out configuration with the configuration data stored in safe memory storage protected by error correcting codes. For comparison purposes, we can calculate a signature (e.g., Cyclic Redundancy Check or CRC) of the read out FPGA configuration with the signature of the correct configuration data; the number of bits in the signature is generally much less compared to

15 the number of bits in the actual configuration data. If the signatures do not match, we know that the column under consideration contains erroneous configuration bits. After a faulty FPGA column is located, the system can recover from temporary failures affecting the configuration bits by simply reloading the contents of the FPGA configuration bits with the correct configuration data stored in safe storage (ECC protected). However, the process is not straightforward because some of the data in configuration frames of FPGAs (e.g., Virtex series [Kelem 00]) are memory contents in user applications, and their values vary dynamically while the application is running in the FPGA. The data in the memory contents of the frame may change due to system- level operations between the read-back and write-back of the frame. If the configuration data is corrected by a periodic scrubbing process [Carmichael 99], then the scrubbing process may run concurrently with the normal user processes in the system. In attempting to rewrite the configuration bits, there is a potential of overwriting memory bits that are changed during run-time operations. This memory coherence problem is illustrated in Fig. 4.2.

LUT R1C1 LUT R1C1 LUT R1C1 0x04 0x05 0x05 0x04 RAM R2C1 RAM R2C1 RAM R2C1 0xAB 0x84 0x840xAB Before read-back Before write-back After write-back

Figure 4.2. Illustration of the memory coherence problem

In the example in Fig. 4.2, suppose that an 8-entry Look-Up-Table (LUT) in the CLB at the first row of the first column is configured as a truth table of value 0x04 to represent a 3-input logic function. Also, the LUT in the CLB at the second row of the same column (R2C1) is configured as a RAM module. Suppose that an SEU occurs in the FPGA such that the value stored in LUT R1C1 changes from 0x04 to 0x05 before configuration read-back. Also, the RAM module in R2C1 stores a value 0xAB prior to the read-back process of the frame. However, suppose that, after configuration read-back, system-level operations update the values in RAM R2C1 to 0x84 before the write-back of the frames begins. After configuration read-back, the error due to the SEU can be located as previously explained. Next, the correct

16 configuration data must be written back to complete the transient error recovery. The values in RAM R2C1 are masked for verification purpose, and are used in constructing the frames for write-back. As a result, the values in the first column after write-back become 0x04 for LUT R1C1 and 0xAB for RAM R2C1, regardless of the system-level self-recovery process between read-back and write-back. Since the write-back process is ignorant of the update, the old value 0xAB will be incorrectly written back again in the RAM location R2C1. [Kelem 00] explored this memory coherence problem in Xilinx Virtex-series FPGAs and suggested that LUTs configured as RAM modules should lie in different frame slices from LUTs configured as logic functions. This method avoids the memory coherence problem but imposes an extra limitation on the placement and routing of signal lines in FPGAs and may increase path delays due to the separation of logic functions and local RAM or FIFOs. A system-level solution to the memory coherence problem without adding constraints on the placement and route is to stall system-level operations once a write to the memory modules is issued to the frame under configuration error detection and correction. This is called a stall-when-write strategy. Another solution to the memory coherence problem using “dirty bits” has been developed in [Huang 00c]. In this technique, there is a dirty bit flag associated with each column in an FPGA. This flag is set when there is a write operation in a RAM location in a column which has been read back. If an error is detected in the column under consideration, then the system stalls and the contents of the column are read back once more to record the updated contents of the RAM locations before the correct data is written back to that column. Analytical and simulation results indicate that the dirty-bit technique achieves more than an order of magnitude improvement in application performance over the stall-when-write strategy as shown in Fig. 4.3.

17 100

10

stall-when-write dirty-bit 1 Percentage of cycles stalled cycles of Percentage 0.1

0.01 0 0.2 0.4 0.6 0.8 1 Memory w rites per cycle

Figure 4.3. Performance improvement with “dirty bits”

4.2.2. Recovery from Temporary Failures in System Bistables Apart from FPGA configuration bits, temporary faults can also affect the internal flip-flops and registers (bistables) in the applications mapped on the FPGA. Since current FPGAs do not support any direct mechanism for reading out and writing back the contents of internal flip-flops and registers, special rollback and roll-forward recovery techniques have been designed in the system. A major recovery approach, rollback recovery, is to roll the system back to a reliable state to retry the computations again. In rollback recovery, system operation is backed up to some point in its processing, which is called a checkpoint. When an error occurs and is detected during operation, the system will restore its state back to the previous checkpoint and re-compute. This introduces a sacrifice in the computation throughput. For recovery from temporary faults affecting the bistables in the applications executing on one of the FPGAs in the Dual-FPGA architecture, the micro-controller on the other FPGA can initiate checkpointing and roll-back upon error detection, and vice- versa. Alternatively, applications mapped on FPGAs can have built-in hardware to perform checkpointing and roll-back recovery. As described next, special hardware for

18 rollback recovery has been implemented for the LZ compression application with inverse comparison studied in the ROAR project [Huang 00b]. In LZ compression with inverse comparison CED, a checkpoint can be defined as the completion of codeword generation. A checkpoint is passed if the corresponding codeword passes the verification of inverse comparison. Since previous correct outputs need not be regenerated all over again, it is efficient to rollback only to the last passed checkpoint. This approach generally has a short re-computing latency, but it may fail to recover normal operations if the last passed checkpoint has undetected faulty internal states. In [Huang 00b], two techniques for rollback recovery in LZ compression application have been described; these techniques are called “reload-retry” and “direct- retry”. The reload-retry technique uses two copies of the encoding dictionary used by the LZ-compression algorithm; the FSM controller for the reload-retry scheme is shown in Fig. 4.4. When an error is detected during normal operation, the system enters the reset mode and resets the control registers. Next, it reloads the dictionary used during LZ compression – if the dictionary size is N, it takes N cycles to reload the dictionary. After the dictionary reload, the system performs a retry operation. If no error is detected during the retry operation, the system enters the normal mode. If errors are detected for a pre-specified number of retries, the system assumes that the fault is permanent and initiates recovery from permanent faults.

Figure 4.4. FSM controller for rollback recovery

The direct-retry technique does not involve duplication of the encoder dictionary. However, during rollback, the system flushes the encoding dictionary. The drawback of direct-retry scheme is that there is a period after recovery during which the compression

19 ratio can be significantly degraded because historical data symbols in the dictionary are flushed out of the dictionary during recovery. Table 4.1 shows the various characteristics and trade-offs associated with these two rollback recovery schemes.

Table 4.1. Characteristics of Reload-retry and Direct-retry rollback recovery techniques Reload-retry Direct-retry Area Overhead 100 % 0% Recovery time N = 512 cycles 64 to 256 cycles Compression Ratio 0 25 % (approx.) Degradation during recovery

As mentioned earlier, the rollback recovery technique incurs performance overhead due to the following reasons: (1) The system state (bistable contents) must be backed up at the checkpoints, and (2) the system has to perform recomputation starting from the last checkpoint upon error detection. The performance overhead associated with rollback recovery may be unacceptable for real-time application with strict deadlines. The roll-forward recovery technique for TMR systems, described in [Yu 01a], overcomes the above problems and is applicable for real-time applications. By exploiting redundancy in TMR, no system state backup at checkpoints or recomputation is needed for single module failures. This technique in illustrated in Fig. 4.5.

Error checking

I0 I1 I2 State restoration M1 from fault-free copies M2 x

M3 x

t0 t1 t2 t3 time Faults occur

Figure 4.5. Illustration of roll-forward recovery.

As shown in Fig. 4.5, there are three modules M1, M2 and M3 in a TMR system whose outputs are obtained by voting the outputs of M1, M2 and M3. The voter used in this system has an additional capability of detecting errors when the outputs of all three modules do not match [Mitra 00e]. When a single fault occurs, the TMR system produces correct outputs through fault-masking. Error checks are scheduled at pre-

20 scheduled points during the computation (depending on the deadlines, error rates such that the probability of multiple faults is negligible, etc.). At these pre-scheduled points it is checked whether the voter detected any errors during the time between the last point and the current point. If no error is detected, the system continues its normal operation. If an error is detected, the faulty module is identified and the bistables in the faulty module are reloaded with correct data obtained from the fault-free modules – this is called the state restoration step. After state restoration, normal operation continues. As shown in the Fig. 4.5, for single module failures, the fault-free modules provide sufficient information for recovery. Therefore, no rollback is needed. Two schemes for state restoration are described in [Yu 01a]: the voting scheme and the direct-load scheme. In the voting scheme [Adams 89], bistables in the three modules are loaded with correct states obtained by voting among the corresponding bistables in the three modules. As shown in Fig. 4.6a, multiplexers (MUXs) are inserted between registers and signals that are connected to the registers during normal operation. During recovery, voted signals are selected to be loaded into the registers. A new scheme, the direct-load scheme, described in [Yu 01a], has less hardware overhead compared to the voting scheme. In the direct load scheme, the bistables in a faulty module obtain correct states by loading states directly from the corresponding bistables in one of the correct modules. As shown in Fig. 4.6b, MUXs are added at inputs of the bistables. During normal operation, the normal signals are selected to be loaded into the registers. When recovery is initiated, if the first module is faulty (indicated by Err0) then the contents of the bistables in the second module (which is correct under single fault assumption) are loaded into the bistables in the first module. If the second module is faulty (indicated by Err1), the contents of the bistables in the third module are loaded into the bistables of the second module. Finally, if the third module is faulty (indicated by Err2), the contents of the bistables in the second module are loaded into the bistables of the third module.

21 Recovery_Init Recovery_Init ERR0

D D Normal Normal signal signal Copy Copy0 0 Recovery_Init ERR1

Normal D V D signal Normal Copy1 signal Copy1 Recovery_Init ERR2

D Normal D signal Normal Copy2 signal Copy2 (a) (b)

Figure 4.6. State restoration. (a) The voting scheme. (b) The direct-load scheme.

Reliability analysis using Markov analysis for TMR systems with roll-forward recovery, performed in [Yu 01a], demonstrate that the reliability improvement obtained by using TMR systems with roll-forward recovery is at least an order of magnitude more than the reliability of conventional TMR systems. Thus, TMR systems with roll-forward recovery against temporary faults can run for mission times that are at least an order of magnitude longer than the useful mission times of conventional TMR systems. It has also been demonstrated in [Yu 01a] that the area overhead imposed by the roll-forward recovery is negligible over the area overhead of a TMR system.

4.3. Recovery Techniques from Permanent Failures through Reconfiguration After a permanent fault is detected and the faulty part of an FPGA is located, permanent fault recovery schemes that reconfigure the FPGA to replace the faulty part with originally unused resources are used to repair the system. Several approaches were proposed in the literature in order to generate an alternative configuration rapidly at run- time. Emmert and Bhatia proposed a min-max grid matching technique to re-map the functional units that are originally placed in faulty CLBs [Emmert 97]. They also proposed an incremental routing technique to reroute the corresponding signals [Emmert 98]. For cluster-based FPGAs that group multiple LUT and flip-flop pairs into a single cluster to take advantage of design locality, Lakamraju and Tessier proposed a localized swapping technique to effectively tolerate intra-cluster faults [Lakamraju 00]. Also, Dutt

22 et al. proposed node-covering techniques that reserve spare resources in order to facilitate the search for optimal replacement of faulty parts [Dutt 99][Hanchek 98][Mahapatra 99]. To further reduce the system downtime for permanent fault recovery, Lach et al. [Lach 98, 99] proposed a tile-based precompiled configuration approach that move the generation of alternative configurations to the design phase. The mapped circuitry is partitioned into tiles, and alternative configurations for each tile are pre-generated and stored in the system. The drawback of this precompiled configuration approach is the significant storage overhead in the system for all possible alternative configurations. Although the tile-partitioning approach in [Lach 98] reduces the storage requirement for alternative configurations to some extent, such storage overhead is still several times of the original configuration size. Generally, the precompiled configuration approach minimizes the system downtime because alternative configuration versions are pre-generated. Thus, re- mapping and rerouting of user application circuitry is not necessary once the fault location is diagnosed. Another drawback of these previous approaches is the assumption that high-precision fault location techniques are available prior to reconfiguration. Although there are many fault location techniques published in the literature [Stroud 97, 98][Mitra 98][Das 99], these techniques require high computation complexity to diagnose a fault in the level of a Configurable Logic Block (CLB) or a Programmable Interconnect Point (PIP). Consequently, these techniques are difficult to implement at run-time and increase the overall downtime of the system. In the ROAR project, we also use the idea of pre-compiled configurations to repair a reconfigurable system from permanent faults because it ensures fast reconfiguration and a significant increase in the system availability. The basic concept of pre-compiled configurations is to generate different versions of logic placement and routing information of the same application circuitry on the FPGA during the design phase. In each configuration version, a certain pre-determined portion of the FPGA is intentionally unused so that the corresponding configuration can tolerate permanent faults in the unused portion. This is possible because most designs do not use the whole FPGA [personal communication from Xilinx]. Since the different configuration versions are generated during the design phase, the system can switch between different configurations rapidly once the faulty region of the FPGA is located. The

23 reconfiguration time is of the order of milliseconds for current generation FPGAs. In order to minimize the storage overhead of the pre-compiled configurations, data compression techniques is used. To achieve a good compression ratio, there should be close similarities among the bit-streams corresponding to the different pre-compiled configuration versions so that the differences between the two bit-streams can be efficiently compressed using run-length encoding techniques [Gersho 92]. This can be achieved by properly selecting partitioning and re-routing methods while generating the different configuration versions during the design phase. To achieve this goal, a column- based pre-compiled configuration approach that can tolerate one faulty column in the FPGA was developed in [Huang 01a]. A column in an FPGA includes all CLBs (Configurable Logic Blocks) and the corresponding switch matrices (CBs and SBs) in the same column of the array. The actual circuitry, programmable logic and routing resources, and the configuration architecture of each CLB column are identical. A typical example of the FPGA architecture used in this report is the Xilinx Virtex-series FPGAs [Xilinx 01]. The main idea is to create column-based pre-compiled configurations in such a way that one configuration version can be derived from another by shifting one or more columns of the FPGA to the left or right. This technique achieves high compression ratio than the tile-based approach described in [Lach 98, Lach 99] because of the inherent similarity among the different configuration versions. The pre-compiled configurations are stored in memory protected by ECCs and are downloaded when recovery from permanent failures is initiated by the controllers, as described in Sec. 3.1. The key concept of the column-based precompiled configuration scheme can be illustrated by an example shown in Fig. 4.7. In Fig. 4.7a, suppose that the original fault- free configuration, or the base configuration, is mapped in four consecutive CLB columns (column 1 to column 4). The column-based functional modules that are mapped in each of the four columns are function A, B, C, and D, respectively. The entire circuitry is thus defined by the four column-based functional modules and interconnects among these modules. Let us consider the case for tolerating faults within any single CLB column. To this objective, we reserve one column outside of the mapped area (column 5) in the base configuration as backup resources for alternative configurations. All CLBs and switch matrices in the reserved column are unused. Also, the configuration of each

24 column-based functional module in the mapped area is stored separately as a basis to construct alternative configurations. In this case, four alternative configurations are required in order to guarantee 1-column fault tolerance in the FPGA. In each alternative configuration, one of the mapped columns (column 1 to 4) in the base configuration is intentionally unused. All functional modules originally mapped in CLB columns with smaller column indices than the intentionally unused column remain in the same places in the alternative configuration. The other functional modules are shifted to the right by one column in the alternative configuration to avoid the intentionally unused column. For example, Figure 4.7b shows one of the alternative configurations, where column 3 is intentionally unused. Functional modules mapped in column 1 and column 2 in the base configuration (function A and B) remain in the same places. Functional modules mapped in column 3 and column 4 in the base configuration (function C and D) are shifted to the right to column 4 and column 5, respectively, to avoid using column 3. In the overlapping precompiled configuration scheme, the above re-mapping procedure in each alternative configuration creates several corresponding column sets. A corresponding column set is defined as a set of CLB columns in which certain functions are mapped in different configuration versions. For example, column 3 in Fig. 4.7a and column 4 in Fig. 4.7b belong to the same corresponding column set in which function C is mapped, and there are four corresponding column sets in this case. Corresponding columns in different configurations are defined as CLB columns in the same corresponding column set that holds a certain column-based functional module. In Fig. 4.7b, the re-mapping of the functional modules forms two mapped regions that are separated by the intentionally unused column. The left mapped region contains column 1 and 2 (function A and B), and the right mapped region contains column 4 and 5 (function C and D). In this way, only inter-region signals that connect functional units in different mapped regions need to be rerouted due to the shifted functional modules. Intra-region signals that connect functional units within the same mapped region, however, can be routed in the same way as their counterparts in the corresponding base configuration column. This is because every CLB column has the same programmable logic and routing resources. Equivalently, the routing information of intra-region signals can be obtained directly by shifting the states of switches in corresponding base configuration columns using the same method described in the re-

25 mapping procedure for column-based functional modules.

D

C

B

A

D

A

C

B

d

n

n

d

n

n

n

n

n

n

e

e

o

o

o

o

o

o

o

o

i

i

i

i

i

i

s

i

i

s

t

t

t

t

t

t

t

t

u

u

c

c

c

c

c

c

c

c

n

n

n

n

n

n

n

n

n

n

u

U u

u

u

U

u

u

u

u

F

F

F

F

F

F

F F

Col 1 Col 2 Col 3 Col 4 Col 5 Col 1 Col 2 Col 3 Col 4 Col 5

(a) (b)

Figure 4.7. Column-based precompiled configuration. (a) Base configuration. (b) Alternative configuration when Column 3 is intentionally unused.

More details of the column-based pre-compiled configuration approach can be obtained from [Huang 01a]. It is clear from the above discussion that each incremental configuration version leaves one of the columns in the mapped circuitry unused. The columns in the right-hand side of the intentional unused column are shifted one column to the right, while the columns in the left-hand side remain in the same place. In this way, the FPGA is partitioned into two configuration regions separated by the unused column. If we compute the difference vectors in the corresponding columns between the incremental version and the initial configuration, there will be strings of 0's with scattered 1's representing inter-region reroutes and wires to the I/O's. Therefore, run- length coding based on Golomb codes [Golomb 66] is very effective in reducing the storage of these difference vectors.

26 70%

60% c499 duke2

50% planet1 sand1

40%

30%

20%

Overall Storage Overhead 10%

0% 16 32 64 128 256 512 1024 Group Size for Golomb Code

Figure 4.8. Storage overhead of precompiled configuration techniques

Figure 4.8 shows the storage overhead for both precompiled configuration techniques for MCNC benchmark circuits c499, duke2, planet1 and sand. We use Xilinx Virtex-E XCV50E FPGA with a 16x24 CLB array in our experiments. c499 maps to 11 CLB columns, duke2 maps to 10 CLB columns, planet1 maps to 6 and, sand maps to 5 CLB columns. We use Golomb codes [Golomb 66] to encode the difference vectors. Also, the storage overhead is calculated based on the amount of data in the configuration bit streams of used columns. This is to avoid the over-optimistic results because of the small size of the benchmark circuits relative to the capacity of the FPGA. From Fig. 4.8 it is clear that the overall storage overhead is in the range of 25-35% when the group size of Golomb code is 128. Figure 4.9 shows a comparison that the compression of the difference between two configurations using Golomb codes is more effective than the compression obtained by applying standard techniques like the Unix gzip. It has also been demonstrated in [Huang 01a] that in the worst-case, the critical path delay overhead in the alternative configurations range from 11% to 18% compared to the original base configuration. Also, for many alternative configurations, the critical path delay overhead doesn’t change at all compared to the base configuration.

27 120% gzip for all alternative 100% configurations Differential and Run- 80% length Code 60%

Overhead 40%

Overall Storage 20% 0% c499 duke2 planet1 sand1 Benchmark Circuit

Figure 4.9. Comparison of gzip and our compression technique.

The storage overhead can be further reduced to around 5% by using a non- overlapping pre-compiled configuration approach, described in [Huang 01a], for designs that utilize less than 50% of the total available area in an FPGA. For applications with very high utilization of routable fault-free elements or long mission times (long enough for the occurrence of multiple permanent faults), the repair approach based on pre-compiled configurations, presented in [Yu 01b], can be used.

4.4. Fault Location The discussion on permanent fault recovery through reconfiguration, presented in Sec. 4.3, assumes that there are some techniques available to locate the faulty location in the FPGA. Previous work on FPGA fault location can be found in [Jordan 93, Lombardi 96, Stroud 97, Stroud 98, Renovell 98, Mitra 98, Das 99]. However, these approaches need tens or hundreds of configurations to diagnose a faulty CLB which results in a major component of the system down-time. Some of these techniques are targeted towards fault location during FPGA manufacture. Abramovici et al. presented a roving STAR approach for fault detection and location during system operation [Abramovici 99, Emmert 00]. They suggested using CED techniques for detection of transient failures. For detection of permanent faults, two rows and two columns of the FPGA are reserved as the self-testing area (STAR). The STAR areas are not used for user applications and are tested while the rest of the system is in operation. After the testing of a STAR is complete, the STAR is moved to a new

28 location by downloading a new configuration, and the process iterates until the entire FPGA is completely tested. Complete testing of a STAR in a Lucent ORCA 2C15A FPGA [Lucent 01] takes about 31 seconds, and the system clock has to be stopped for about three to four seconds for copying the state before the STAR moves to a new location [Emmert 00]. One concern with the roving STAR approach is the performance and availability degradation of the system because of the moving of the STAR, even when failures are not present. Another concern with the roving STAR technique is the error latency. It has been reported in [Emmert 00] that it takes approximately six minutes before the STAR has roved the entire ORCA 2C15A FPGA, which has a programmable logic array of 20 columns and 20 rows. The capacity of logic gates in such FPGAs is about 80% of the smallest Xilinx Virtex FPGAs. Hence, unless CED techniques are used, it may take a long time before the error is detected in an FPGA. The above problems can be overcome if fine-grained CED techniques are used combined with fault location techniques that depend on checker responses. Using the STAR approach without utilizing checker data can be very expensive in terms of fault-location time and hence, system downtime. A fast fault-location technique which integrates well with the column-based pre- compiled configuration approach of Sec. 4.3 is described in [Huang 01b]. In the column- based precompiled configuration approach of Sec. 4.3, alternative configurations for repair are pre-generated and stored in the system before a permanent fault occurs. Thus, one simple way to avoid the fault while repairing the system is to try all possible configurations alternately until a configuration that successfully avoids the fault is loaded. For each configuration, CED schemes for the application circuitry are used to ensure data integrity and determine if the attempt is successful. In a sense, this “blind” reconfiguration scheme replaces high-complexity fault location techniques with CED schemes at run-time. There are several ways to test whether a configuration avoids the faulty region. The simplest approach is to perform a rollback and retry the operation (as described in Sec. 4.2.2). The second method is to use pseudo-exhaustive BIST (PE-BIST) patterns to test each sub-circuit in the target application exhaustively [McCluskey 81]. Such patterns provide very high fault coverage without relying on explicit fault models. The test pattern generator for PE-BIST can be implemented in the controller of the other FPGA in the Dual-FPGA architecture. The third method is to use a linear feedback shift

29 register (LFSR) to apply pseudo-random test patterns to the target application circuitry [Abramovici 90]. Like PE-BIST, this approach does not require memory overhead to store test patterns, and the test patterns can be easily generated by the controller of the other FPGA in the Dual-FPGA architecture. The fourth method is to use the functional verification pattern of the target application in the reconfigurable hardware. Such a pattern verifies the correct function in the application, but it requires memory overhead to store the patterns. Compared to the roving STAR approach in [Abramovici 99], this blind reconfiguration scheme guarantees data integrity and does not impose performance and availability degradation during fault-free operation. The control of the blind reconfiguration process is also simpler, and thus more feasible for dependable systems with autonomous recovery based on reconfigurable hardware. However, there are some issues related to the effect of error detection capability in the CED scheme, the precision of the fault location result, and the order of configuration attempts. Details about these issues and solutions can be found in [Huang 01b]. The number of reconfiguration attempts that must be performed to find a configuration that avoids the faulty region can be minimized by implementing CED with fine granularity. For example, in a pipelined circuit the functional modules and registers in each stage of the pipeline may be duplicated (through identical or diverse duplication), and a local, distributed checker in each stage is used to compare the duplicated outputs and to signal an error if a mismatch occurs. Distributed error signals can be combined to form the global error signal of the system. Because pipeline registers are used to separate different stages in the system, error signals from distributed CED checkers can localize faults within part of the system. If the CED checker that checks the i-th stage outputs does not signal an error in a certain cycle, correctness of computations in the i-th stage in this cycle can be assumed. In this way, we can reduce the number of suspect faulty columns in the reconfigurable hardware and the number of configuration attempts. The details of this technique are presented in [Huang 01b]. The outputs of the distributed checkers in each sub-circuit can be stored in flip-flops, which can be connected in a scan chain. The idea of using a scan chain to scan out the checker outputs is also used in the IBM mainframes to locate the faulty chip or board [Kraft 81, IBM 01]. When the system signals an error, the contents of the flip-flops storing the checker outputs can be scanned

30 out for further processing by the controller in the other FPGA of the Dual-FPGA architecture. The scanned data can be read out through a dedicated Scan Out pin or through the boundary scan port generally available in FPGA chips. If the number of checkers is low, dedicated I/O pins can be used to directly observe the checker outputs. The controller on the other FPGA stores the information about the CLB columns that are occupied by each sub-circuit in the system. Such information is directly available from CAD tools, such as Xilinx Alliance software, at the design phase through the floorplan of each sub-circuit and checker in the FPGA. After the controller scans out the checker data, it can execute a simple routine to find the set of suspect faulty columns [Huang 01b]. Note that since the faulty column suspects are specified by distributed checkers, our technique can also be integrated with the roving STAR approach in [Abramovici 99] in order to find the precise fault location. The only difference is that, instead of roving the STAR across the entire FPGA, we only need to rove the STAR across the suspect columns. The resulting number of reconfigurations required in the roving STAR approach is thus reduced when integrating with the distributed CED checkers in our approach. Generally, for fine-grained partitioning of sub-circuits, the number of configuration attempts is smaller than systems with coarse-grained sub-circuits. This is because each sub-circuit occupies fewer columns in the hardware for systems partitioned with finer granularity. However, there is a possible tradeoff with area overhead because of the increasing number of distributed checkers in the system because of fine-grained partitioning. By floorplanning the design, each sub-circuit and its corresponding CED checker can be confined within certain columns in the hardware. In this way, we can minimize the number of configuration attempts by using a modified column-based precompiled configuration scheme. To estimate the impact on area overhead due to the fine-grained partitioning and distributed checkers, we examined the case study: LZ compression in FPGAs with duplication for CED. Our designs are mapped in the Xilinx Virtex XCV1000 FPGA [Xilinx 01], which has a CLB array of 64 rows and 96 columns. The number of sub- circuits in the system is selected between 9 and 17 in order to balance the size of each sub-circuit. The area overhead due to distributed checkers and fine-grained partitioning is not very significant compared to a duplex scheme without partitioning. Degradation of maximum clock rate because of partitioning is within 10% in this case study. Because

31 there are global signals connecting the sub-circuits, the worst-case suspect set for faulty columns spans the entire used region in the FPGA. However, such global signals represent only about 0.2% of all signals in the system in both schemes. Therefore, with a very high probability (for all faults that do appear in 99.8% of signal lines), the entire fault location and recovery process can be completed in only one or two reconfigurations.

4.5. Recovery from Failures in Memories, Reconfiguration Circuitry, Bus and I/Os In the previous sections we discussed about error detection and recovery from failures that affect the logic blocks and routing resources inside the FPGAs. In the Dual- FPGA ACS architecture there can be failures affecting the memory chips, the I/O pins of the FPGAs, the reconfiguration circuitry, the interconnections between the two FPGAs, the common bus connecting the FPGAs and the memory. For a complete fault-tolerant system, we must be able to detect these failures and recover the system from these failures. Errors in the memory data bits can be detected and corrected by using techniques like error-correcting codes and periodic scrubbing [Shirvani 00b]. However, generally memory chips do not have redundant circuitry to detect and correct errors in the address decoding logic, address pins and read/write control pins. In order to protect the system from failures in these parts of the memory chips, chip-level redundancy can be used for memories. For example, a TMR system can be designed using three memory chips; this system will have extra circuitry for writing back (e.g., through memory scrubbing) correct data to replace erroneous data caused by temporary errors in the memories. It may seem that the FPGA reconfiguration circuitry is a single point of failure because we may not be able to configure the FPGAs correctly if there are failures in the reconfiguration circuitry. However, generally there are many different ways of reconfiguring the same FPGA. For example, Xilinx Virtex FPGAs provide several reconfiguration modes (e.g., SelectMAP, Boundary Scan support) that use different reserved pins for reconfiguration [Xilinx 01]. This inherent redundancy in the various reconfiguration modes can be utilized for recovering from failures in the FPGA reconfiguration circuitry. There are several approaches to tolerate failures in the I/O pins of the FPGAs and the interconnections used to communicate between the two FPGAs (data and control) and

32 the memories. The simplest technique is to use redundant wires, buses and pins through duplication, TMR or error-correcting codes. However, this technique requires extra I/O pins that may not be available for pin-limited designs mapped on FPGAs. In such circumstances, the fault-tolerant communication technique described in [Yu 01c] is applicable. The technique is illustrated using the following simple example. Let us consider the communication between the two controllers implemented on the two FPGAs in the Dual-FPGA architecture. Let us suppose that the communication is unidirectional. For error detection purposes, we use techniques based on parity or cyclic-redundancy check (CRC); there are redundant wires for carrying the parity bit or the CRC bits. Note that, the number of extra wires (and pins) needed for error detection is far less than the number of wires needed for error correction. Suppose that the controller on FPGA1 is sending a signal to FPGA2 and there is an error on one of the wires or an I/O pin of FPGA1. The data received by the controller on FPGA2 will be erroneous and will be detected using parity or CRC check. If the error still persists after a pre-determined number of retries, permanent fault repair has to be performed. The controller on FPGA2 will bounce back the message to the controller on FPGA1 (on another set of wires because it is assumed that the communication is unidirectional). The controller on FPGA1 can compare the original message it intended to send and the bounced message to find the wire that carried the erroneous bit. The controller on FPGA1 then initiates a reconfiguration of FPGA2 so that the I/O pin of FPGA2 to which the erroneous wire is connected is not used in the new configuration. After reconfiguration, the controller on FPGA2 initiates a reconfiguration of FPGA1 so that the I/O pin of FPGA1 to which the erroneous wire was connected is unused in the new configuration. Thus, a permanent fault recovery in the erroneous wire or erroneous I/O pins can be performed. This technique also assumes that extra pins are available on both FPGAs; however, the number of such extra pins required for repair is far less than the number of extra pins required for on-line error correction. The details of this technique are available from [Yu 01c].

5. Self-Healing Soft Microprocessor

As explained in Sec. 3.2, dependability in multithreaded processors is accomplished by using multiple redundant threads of computations. By using multiple

33 redundant threads the multithreaded processor in MT-ACS can recover from transient errors, however, for permanent failures the repair alternatives are limited. Standby processor spares or entire chip replacement may be the only repair alternative. In remote and unattended systems, such as in satellite and space applications, repair options that require chip replacement or standby spares may not be viable. One way to mitigate the limited repair alternatives in hardwired custom processors is to build a soft microprocessor entirely in reconfigurable hardware. The Dual-FPGA architecture proposes an ACS implementation that, with the exception of external memory, is entirely built with reconfigurable hardware. The recent advances in the semiconductor technology have made it very feasible to implement a microprocessor in a reconfigurable fabric. For example, software microprocessor cores NIOS and MicroBlaze are now available from and Xilinx. We recognize that custom microprocessors can have significantly better performance than soft-microprocessors. For example, clock rates of custom microprocessors are in the giga-hertz range and soft-microprocessor implementations achieve clock rates in the hundreds of mega-hertz. However, in ACS it is well recognized that most of the performance improvements are due to instrumentation of parallelism in reconfigurable hardware. For example, an ACS implementation of LZ-77 compress [Huang 00a] running at 15 Mhz in reconfigurable hardware had more than five times performance improvement over a software implementation of LZ-77 compress running on a 450Mhz custom processor. We believe that performance disadvantage due to lower clock rate in soft microprocessors is more than offset by the performance gains achievable by application specific parallel implementations in reconfigurable . The flexibility provided by not only allows implementation of a variety of fault-tolerance features but also the customization of instruction set architecture and for embedded applications. Figure 5.1 illustrates our proposed use of a soft microprocessor in an embedded dual-FPGA environment. The soft microprocessor on one FPGA chip communicates, through an external in-band signaling bus, with the reconfigurable coprocessor implemented on another FPGA. The micro-controllers provide run-time communication between the two FPGAs signaling possible detection of errors and effecting recovery and repair through reconfiguration.

34 A Debug and Initialization Bus A r r Instruction b b Inst Data Data i i Memory Cache Cache Memory t Soft Microprocesor t e e r Checkpoint r

Micro FPGA1 controller Non Volatile Memory FPGA2 Micro controller

Memory Reconfigurable Coprocessor

Debug and Initialization Bus

Figure 5.1 Soft Microprocessor in the Dual-FPGA Architecture

The LEON SPARC processor [Gaisler 01] from the European Space Agency is a soft microprocessor implementation of the SPARC architecture. Although the microarchitecture details are not available, the LEON SPARC processor appears to have fault-tolerant features [Gaisler 01] that provide recovery capability from transient register and memory bit errors. Soft microprocessors implemented in reconfigurable logic have two kinds of failures: 1. Failures in configuration bits that define the interconnection and the logic functionality for the soft microprocessor, and 2. Failures in the registers, memory, and combinational logic of the implemented soft microprocessor.

We have designed a reconfigurable soft microprocessor with certain structural properties that allows: 1. Recovery from transient errors in the soft microprocessor configuration bits, 2. Recovery from transient errors in the registers, memory, and combinational logic of the soft microprocessor, 3. Precompiled configuration based repair techniques that avoid permanent failures

35 in the reconfigurable logic.

We call this reconfigurable microprocessor a “self-healing soft microprocessor” because we are provisioning built-in concurrent error detection and autonomous recovery capabilities. Figure 5.1 shows a high-level organization of our proposed self-healing soft microprocessor. While the microarchitectural features described here do not preclude an ASIC implementation; the choice of reconfigurable logic is important from the point of view of permanent failure repair without chip replacement. A detailed microarchitecture description and implementation data on Xilinx Virtex-II FPGA is presented in [Saxena 01].

Reconfigurable Register Recovery Logic File FSM

Fetch Decode Execute Load Instruction Data Unit Unit Store Memory Unit Memory Unit

Backdoor Checkpoint Queues

Figure 5.1 Self-Healing Soft Microprocessor

In Virtex-II FPGA, all of the functional units (for example fetch, decode, and execute) are partitioned into groups of CLB and Block RAM columns. This kind of partitioning enables the use of column-based recovery techniques [Huang 00c] for configuration bit errors and allows the opportunity to repair permanent faults [Huang 01a] through precompiled configurations. In traditional microprocessors, check-pointing and state recovery is often accomplished through trap-handlers that switch the machine context and use load/store instruction sequences. Unlike traditional microprocessors, we developed special designs, called backdoor checkpoint queues, that transparently check- point the register state while the instructions are active in the soft microprocessor. During error recovery, special recovery finite state machines restore the register state of the soft-microprocessor without the necessity of transferring control to trap-handlers.

36 Tables 5.1 and 5.2 present implementation data for a two-threaded two-way super-scalar soft microprocessor. The salient features for this soft microprocessor are: − Instruction memory is 8Kbytes (2K 4byte instructions). − Instruction width is 32-bits and precision is 11 bits (to address 2K instructions) − Two-way super-scalar and two-threaded—fetch, decode, issue, and commit units. − Two execute units capable of handling—arithmetic, logical, and control transfer instructions. − A load/store unit with 64Kbytes (8K 4byte data words) of data memory. − Thirty two 32-bit general purpose registers (sixteen per thread). − The operational clock rate with Virtex-II speed grade 4 parts was 125 Mhz.

Table 5.1. Soft Microprocessor Implementation Data on Virtex-II FPGAs Number Number Block Unit of of RAM Description Name LUTs Flip-Flops Modules Includes 2Kbyte (uses 4 Block RAMs) instruction memory, finite state Fetch 837 721 4 machines and next fetch address calculation logic for two-threads. Determines dependency check between two instructions and Decode 624 406 0 reservation station availability. Also keeps track of instruction serial numbers. Thirty two 32-bit registers and logic for Register 2189 1056 0 score boarding reservation state for File issue and commit logic. Execute Two ALUs capable of arithmetic, logic, 1666 484 0 (2 Units) and control transfer instructions. Logic to execute load/store instructions Load Store 578 467 32 and 64 Kbyte (uses 32 Block RAMs) data memory. Back Door Fifo control logic and two Block rams Checkpoint 102 56 2 for fifo queues. Queue

37 Table 5.2. Soft Microprocessor Utilization Data on Virtex-II FPGAs LUTs Virtex II Block Used/Total Flip-Flops Family RAM LUTs 5,996 / 3,190 / 38 / 40 XC2V1000 10,240 10,240 (95%) (58%) (31%) 5,996 / 3,190 / 38 / 48 XC2V1500 15,360 15,360 (79%) (39%) (20%) 5,996 / 3,190 / 38 / 56 XC2V2000 21,504 21,504 (67%) (27%) (14%) 5,996 / 28,672 38 / 96 XC2V3000 28,672 (11%) (39%) (20%) 5,996 / 67,584 38 / 144 XC2V6000 67,584 (4%) (26%) (8%)

One of the main conclusions from the implementation data is that the recovery hardware (back door checkpoint queues and recovery finite-state machines) has a very insignificant overhead (1.7% ~ 2%) compared to rest of the soft-microprocessor hardware. From 0 it is also clear that for reasonably sized FPGAs, the soft- microprocessor occupies a fraction of the reconfigurable hardware thereby giving opportunity for repair under permanent failures. The Block RAM usage can be mitigated by reducing the instruction and data memory size requirements. As part of technology transfer, an instance of this soft-microprocessor is being used at Chip Engines, Sunnyvale, CA 94085 in a metropolitan area networks resilient packet processor.

6. Summary and Conclusions This report describes an architectural view of reconfigurable systems that can be used for dependable computing. In addition, we presented an overview of various concurrent error detection, fault-location and recovery techniques developed in the ROAR project that enable us to use the above architectures for dependable computing. To conclude, our work demonstrates that it is feasible to design low-cost but very effective dependable ACS. While the major thrust of this work is based on reconfigurable systems based on FPGAs, we believe that our techniques are applicable for other commercial reconfigurable devices.

38 In an announcement [Collier 99] of electronically configurable molecular-based logic gates, the authors have acknowledged the need for fault-tolerance. We believe that the fault-tolerance techniques developed in our project are relevant and applicable in this context. 7. Acknowledgments This work was supported by Defense Advanced Research Project Agency (DARPA) under Contract No. DABT63-97-C-0024 (ROAR project).

8. References [Abramovici 90] Abramovici, M., M. Breuer and A. Friedman, Digital Systems Testing and Testable Design, IEEE Press, 1990. [Abramovici 99] Abramovici, M., C. Stroud, C. Hamilton, C. Wijesuriya and V. Verma, “Using Roving Stars for On-line Testing and Diagnosis of FPGAs,” Proc. Intl. Test Conf., pp. 973-982, 1999. [Adams 89] Adams, S. J., “Hardware Assisted Recovery from Transient Errors in Redundant Processing Systems,” Proc. Intl. Symp. Fault-Tolerant Computing, pp. 512- 519, 1989. [Avizienis 84] Avizienis, A. and J. P. J. Kelly, “Fault Tolerance by Design Diversity: Concepts and Experiments,” IEEE Computer, pp. 67-80, August 1984. [Carmichael 99] Carmichael, C., E. Fuller, P. Blain, and M. Caffrey, “SEU Mitigation Techniques for Virtex FPGAs in Space Applications,” MAPLD ‘99, 1999. [Collier 99] Collier, C. P., et al., “Electronically Configurable Molecular-Based Logic Gates,” Science, Vol. 285, No. 5426, pp. 391-394, July 1999. [Culbertson 97] Culbertson, W., R. Amerson, R. Carter, P. Keukes and G. Snider, “Defect-Tolerance in the Teramac Custom Computer,” Proc. Intl. Symp. Field Programmable Custom Computing Machines (FCCM), pp. 116-124, 1997. [Das 99] Das, D., and N.A. Touba, “A Low-Cost Approach for Detecting, Locating and Avoiding Interconnect Faults in FPGA-based Reconfigurable Systems,” Proc. International Conference VLSI Design, pp. 266-269, 1999. [Dutt 99] Dutt, S., V. Shanmugavel, and S. Trimberger, “Efficient Incremental Rerouting for Fault Reconfiguration in Field Programmable Gate Arrays,” Proc. Intl. Conf. Computer-Aided Design, pp. 173-176, 1999. [Emmert 97] Emmert, J. M., and D. Bhatia, “Partial Reconfiguration of FPGA Mapped Designs with Applications to Fault Tolerance and Yield Enhancement,” Proc. Intl. Workshop Field-Programmable Logic, pp. 141-150, 1997. [Emmert 98] Emmert, J., and D. Bhatia, “Incremental Routing in FPGAs,” Proc. Intl. ASIC Conf., pp. 217-221, 1998. [Emmert 00] Emmert, J., C. Stroud, B. Skaggs, and M. Abramovici, “Dynamic Fault Tolerance in FPGAs via Partial Reconfiguration,” Proc. Symp. Field-Programmable Custom Computing Machines (FCCM), pp. 165-174, 2000. [Gaisler 01] Gaisler, J., LEON SPARC Processor, European Space Agency, http://www.estec.esa.nl/wsmwww/leon/ and http://www.gaisler.com, 2001. [Gersho 92] Gersho, A. and R. M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, 1992.

39 [Golomb 66] Golomb, S. W., “Run-length Encoding,” IEEE Trans. Information Theory, Vol. IT-12, pp. 399-401, 1966. [Hanchek 98] Hanchek, F., and S. Dutt, “Methods for Tolerating Cell and Interconnect Faults in FPGAs,” IEEE Trans. Computers, Vol. 47, No. 1, pp. 15-32, 1998. [Huang 00a] Huang, W. J., N. Saxena, and E. J. McCluskey, “A Reliable LZ Data Compressor on Reconfigurable Coprocessors,” Proc. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM), 2000. [Huang 00b] Huang, W. J., and E. J. McCluskey, “Analysis of Transient Error Effects in LZ Compression Algorithm and Rollback Error Recovery Schemes,” Proc. Pacific Rim Intl. Symp. Dependable Computing, 2000. [Huang 00c] Huang, W. J., and E. J. McCluskey, “A Memory Coherence Technique for Online Transient Error Recovery of FPGA Configuration,” Proc. International Symposium on FPGAs, 2001. [Huang 01a] Huang, W. J., and E. J. McCluskey “Column-Based Precompiled Configuration Techniques for FPGA Fault Tolerance,” Proc. IEEE Field Programmable Custom Computing Machines (FCCM), 2001, To appear. [Huang 01b] Huang, W. J., S. Mitra and E. J. McCluskey, “Fast Run-Time Fault Location for Dependable FPGA-based Applications,” Proc. Intl. Symp. Defect and Fault-Tolerance of VLSI Systems, To appear. [IBM 01] International Business Machines, http://www.ibm.com, 2001. [Jordan 93] Jordan, C., and W.P. Marnane, “Incoming Inspection of FPGAs,” European Test Conf., pp. 371-377, 1993. [Kelem 00] Kelem, S., “VirtexTM Configuration Architecture- Advanced Users’ Guide,” Application Note, XAPP 151, Ver. 1.1, www.xilinx.com, 2000. [Kraft 81] Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, 1981. [Lach 98] Lach, J., W. H. Manglone-Smith, and Miodrag Potkonjak, “Efficiently Supporting Fault-Tolerance in FPGAs,” Proc. Intl. Symp. FPGAs, pp.105-115, 1998. [Lach 99] Lach, J., W. H. Manglone-Smith, and Miodrag Potkonjak, “Algorithms for Efficient Runtime Fault Recovery on Diverse FPGA Architectures,” Proc. Intl. Symp. Defect and Fault Tolerance, pp.386-394, 1999. [Lakamraju 00] Lakamraju, V., and R. Tessier, “Tolerating Operational Faults in Cluster-Based FPGAs,” Proc. Intl. Symp. Field Programmable Gate Arrays, pp. 187- 194, 2000. [Lala 94] Lala, J. H. and R. E. Harper, “Architectural principles for safety-critical real- time applications,” Proc. of the IEEE, vol. 82, no. 1, pp. 25-40, Jan. 1994. [Lombardi 96] Lombardi, F., D. Ashen, X. Chen, and W.K. Huang, “Diagnosing Programmable Interconnect Systems for FPGAs,” Proc. Intl. Symp. Field Programmable Gate Arays, pp. 100-106, 1996. [Lucent 01] Lucent Technologies, http://www.lucent.com, 2001. [Mahapatra 99] Mahapatra, N. R., and S. Dutt, “Efficient Network-Flow Based Techniques for Dynamic Fault Reconfiguration in FPGAs”, Proc. Intl. Symp. Fault- Tolerant Computing, pp. 122-129, 1999. [McCluskey 81] McCluskey, E. J., and S. Bozorgui-Nesbat, “Design for Autonomous Test,” IEEE Trans. Computers, pp. 866-875, Nov. 1981. [Mitra 98] Mitra, S., P. P. Shirvani, and E.J. McCluskey, “Fault Location in FPGA- Based Reconfigurable Systems,” IEEE International High Level Design Validation and Test Workshop, pp. 143-150, 1998.

40 [Mitra 99a] Mitra, S., N. R. Saxena and E. J. McCluskey, “A Design Diversity Metric and Reliability Analysis for Redundant Systems,” Intl. Test Conf., pp. 662-671, 1999. [Mitra 99b] Mitra, S., N. R. Saxena and E. J. McCluskey, “A Design Diversity Metric and Analysis of Redundant Systems,” IEEE Trans. Computers, To appear. (Also available: CRC-TR-99-4, http://crc.stanford.edu). [Mitra 00a] Mitra, S., N. R. Saxena and E. J. McCluskey, “Common-Mode Failures in Redundant VLSI Systems: A Survey,” IEEE Trans. Reliability, Special Section on Fault-Tolerant VLSI Systems, Vol. 49, No. 3, pp. 285-295, Sept. 2000. [Mitra 00b] Mitra, S. and E. J. McCluskey, “Combinational for Diversity in Duplex Systems,” Proc. International Test Conf., pp. 179-188, 2000. [Mitra 00c] Mitra, S. and E. J. McCluskey, “Which Concurrent Error Detection Scheme To Choose?,” Proc. International Test Conf., pp. 985-994, 2000. [Mitra 00d] Mitra, S., N. Saxena, E. J. McCluskey, “Fault Escapes in Duplex Systems,” Proc. VLSI Test Symposium, pp. 453-458, 2000. [Mitra 00e] Mitra, S., and E. J. McCluskey, “Word-Voter: A New Voter Design for Triple Modular Redundant Systems,” Proc. VLSI Test Symp., pp. 465-470, 2000. [Mitra 01a] Mitra, S., and E. J. McCluskey, “Design Diversity for Concurrent Error Detection in Sequential Logic Circuits,” Proc. VLSI Test Symp., pp. 178-183, 2001. [Mitra 01b] Mitra, S., N. R. Saxena and E. J. McCluskey, “Techniques for Estimation of Design Diversity for Combinational Logic Circuits,” Proc. Intl. Conf. Dependable Systems and Networks, pp. 25-34, 2001. [Mitra 01c] Mitra, S., and E. J. McCluskey, “Design of Redundant Systems Protected Against Common-Mode Failures,” Proc. VLSI Test Symp., pp. 190-195, 2001. [Ogus 75] Ogus, R.C., “Reliability Analysis of Hybrid Redundant Systems with Nonperfect Switches,” Technical Report, Computer Systems Laboratory, Stanford University, CSL TR 65, 1975. [Pradhan 96] Pradhan, D. K., Fault-Tolerant Computer System Design, Prentice Hall, 1996. [Reinhardt 00] Reinhardt, S., and S. Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” Intl. Symp. , pp. 25-36, 2000. [Renovell 98] Renovell, M., J.M. Portal, J. Figueras, and Y. Zorian, “Testing the Interconnect of RAM-Based FPGAs,” IEEE Design and Test of Computers, pp. 45-50, Jan. 1998. [Rotenberg 99] Rotenberg, E., “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” Proc. Fault-Tolerant Computing Symposium, 1999. [Rupp 98] Rupp, C. R., M Landguth, T. Garverick, E. Gomersall, H. Holt, J. M. Arnold, and M. Gokhale, “The NAPA Adaptive Processing Architecture,” Proc. Intl. Symp. Field-Programmable Custom Computing Machines (FCCM), 1998. [Saxena 98] Saxena, N. R., and E.J. McCluskey, “Dependable Adaptive Computing Systems—the ROAR Project,” Intl. Conf. Systems, Man, and Cybernetics, pp. 2,172- 2,177, 1998. [Saxena 99] Saxena, N. R., and E.J. McCluskey, “Fault-Tolerance with Multi-Threaded Computing — A New Approach,” Fast Abstracts FTCS-29, pp. 29-30, June 1999. [Saxena 00] Saxena, N.R., S. Fernandez-Gomez, W.J. Huang, S. Mitra, S.Y. Yu and E.J. McCluskey, “Dependable Computing and On-line Testing in Adaptive Computing Systems,” IEEE Design and Test of Computers, Vol. 17, No. 1, pp. 29-41, Jan.-Mar. 2000.

41 [Saxena 01] Saxena, N. R., P. Kulkarni, R. Ganji, and E.J. McCluskey, “Reconfigurable Soft Microprocessors with Autonomous Recovery,” Technical Report, Center for Reliable Computing, Stanford University, under preparation, 2001. [Shirvani 00a] Shirvani, P., N. Oh, E. J. McCluskey, “Stanford CRC Argos Project: COTS in Space,” Fastabstract, Intl. Conf. Dependable Systems and Networks, 2000. [Shirvani 00b] Shirvani, P., N. R. Saxena and E. J. McCluskey, “Software Implemented EDAC Protection Against SEUs,” IEEE Trans. Reliability, Special Section on Fault- Tolerant VLSI Systems, Vol. 49, No. 3, pp. 273-284, Sept. 2000. [Smith 78] Smith, B. J., “A Pipeline Shared Resource MIMD Computer,” Proc. International Conf. Parallel Processing, pp. 6-8, 1978. [Spainhower 99] Spainhower, L. and T. A. Gregg, “S/390 Parallel Enterprise Server G5 fault tolerance,” IBM Journal of Research Development, Vol. 43, pp. 863-873, Sept./Nov. 1999. [Stroud 97] Stroud, C., E. Lee and M. Abramovici, “BIST-Based Diagnostics of FPGA Logic Blocks,” Intl. Test Conf., pp. 539-547, 1997. [Stroud 98] Stroud, C., S. Wijesuriya, C. Hamilton, and M. Abramovici, “Built-In Self- Test of FPGA Interconnect,” Proc. Intl. Test Conf., pp. 404-411, 1998. [Thorton 64] Thorton, J. E., “Parallel Operation in the Control Data 660,” Proc. AFIPS Conf., Fall Joint Computer Conf., Vol. 26, pp. 33-40, 1964. [Xilinx 01] Xilinx Virtex Datasheet, www.xilinx.com/apps/virtexapp.htm#databook, 2001. [Yu 00] Yu, S-Y., N. Saxena, and E. J. McCluskey, “ACS Implementation of a Robotic Controller Algorithm with Fault-Tolerant Capabilities,” Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2000. [Yu 01a] Yu, S-Y., and E. J. McCluskey, “On-line Testing and Recovery in TMR Systems for Real-Time Applications,” Proc. Intl. Test Conf., 2001, To appear. [Yu 01b] Yu, S-Y., and E. J. McCluskey, “Permanent Fault Repair for FPGAs with Limited Redundant Area,” Proc. Intl. Symp. Defect and Fault Tolerance, 2001, To appear. [Yu 01c] Yu, S-Y., and E. J. McCluskey, “Fault Tolerant Communication between Repair Controllers in the Dual-FPGA Architecture,” Technical report, Center for Reliable Computing, Stanford University, Under preparation. [Zeng 99] Zeng, C., N.R. Saxena, and E.J. McCluskey, “Finite State Machine Synthesis with Concurrent Error Detection,” Proc. International Test Conf., pp. 672-679, 1999.

42