Center for Reliable Computing
Total Page:16
File Type:pdf, Size:1020Kb
DEPENDABLE ADAPTIVE COMPUTING SYSTEMS STANFORD CRC ROAR PROJECT FINAL REPORT 1. Introduction Adaptive Computing Systems (ACS) comprising microprocessor, memory and reconfigurable logic (e.g., user-programmable logic elements like FPGAs) represent a promising technology which is well-suited for applications such as digital signal processing, image processing, control, bit manipulation, compression and encryption. For these applications, ACS provides versatility and parallelism simultaneously due to the presence of reconfigurable logic such as Field Programmable Gate Arrays (FPGAs). FPGAs provide the capacity to implement applications with parallelism due to the abundance of reprogrammable Configurable Logic Blocks (CLBs) and routing resources. Figure 1.1 shows the architecture of a typical adaptive computing system [Rupp 98]. The Defense Advanced Research Projects Agency (DARPA) has sponsored several ACS efforts over the past years both in academia and industry. At the Stanford Center for Reliable Computing (CRC), we proposed using ACS opportunities to design dependable1 computing systems for applications such as nuclear reactor control, fly-by- wire systems, and remote locations (e.g., space stations and satellites). This proposal was funded by DARPA in 1997 under the name “Dependable Adaptive Computing Systems” (http://crc.stanford.edu). Another related effort on dependable ACS is the project called “On-line Testing and Fault-Tolerance in FPGAs” at Bell Labs [Abramovici 99]. Note that the objective of our project is different from the Teramac project at Hewlett-Packard Labs where the main goal is to achieve defect-tolerance in FPGAs by using configurations that do not use defective resources [Culbertson 97]. The challenges in our project are on-line error detection during system operation, very fast fault location and quick recovery from temporary and permanent failures. 1 Dependability comprises reliability, availability, testability, maintainability and fault-tolerance. 1 Figure 1.1. ACS architecture Design of dependable adaptive computing systems can create a significant impact for space applications by reducing the cost of very expensive electronics while delivering high performance. The current practice is to use radiation hardened integrated circuits for space applications. Radiation hardening is a very expensive process requiring very special designs that are not widely available. Moreover, radiation hardened ICs are several generations behind the most advanced technologies. For example, while commercial processors used today have clock frequencies in the GigaHertz range, the most advanced radiation-hardened processors belong to the generation of IBM RISC 6000. Thus, radiation-hardened processors are not only expensive, but also slow. For space applications, special boards with commercial microprocessors will require multiple microprocessor chips for error detection and standby spares. Hence, organizations like DARPA, NASA and Los Alamos National Laboratories are interested in mitigating the cost of fault-tolerance for space applications by taking advantage of the ACS technology. Moreover, for unmanned deep space applications, it may be very difficult to change the functionality of computer systems on the fly; the flexibility of the ACS technology comes to the rescue in such circumstances. For terrestrial applications, adaptive computing systems open new opportunities in designing dependable systems by eliminating the need for standby spares as described in this report. The conventional approach to system recovery in fault-tolerant systems is to replace faulty parts (through board swapping, for example). With dependable ACS architectures, it is possible to repair a faulty system very fast with minimal human intervention. Not only do we eliminate the need for standby spares, but also reduce the system repair cost and time significantly. The reduction in the system repair time reduces the system downtime; this, in turn, increases the system availability significantly. 2 This report describes dependable architectures for adaptive computing systems that have been developed as a part of the DARPA sponsored ROAR project at the Stanford Center for Reliable Computing (CRC). ROAR is an acronym for Reliability Obtained by Adaptive Reconfiguration. The main idea is to develop: (1) Concurrent Error Detection (CED) techniques to detect errors while the ACS is in operation; (2) fault-location techniques to identify the defective part (e.g., the defective CLB or routing resource); and (3) recovery techniques to reconfigure the system to operate without using the faulty CLB or routing resource. Unlike conventional fault-tolerant systems where the Field Replaceable Unit (FRU) is a chip or a board, the FRU for an ACS is a CLB or a routing resource; thus, there are opportunities to repair the faulty system taking advantage of the fine granularity of the FRU. The objectives of the ROAR project are: (1) to develop design techniques that allow for the generation of highly dependable, adaptive computing systems; (2) to increase the availability of unattended systems by an order of magnitude through self-repair method based on the reconfiguration capability of the designs, (3) to eliminate the need for standby spares since adaptive configuration allows the use of all of the resources for performance; and (4) to develop redundancy techniques, such as diversified designs, that will protect systems from common-mode failures. If these objectives are achieved, then it is possible to eliminate expensive radiation hardening techniques and design low-cost dependable reconfigurable systems with considerably higher flexibility and performance. Section 2 presents an overview of the hardware model of FPGAs. In Sec. 3, we present ACS architectures that have been designed in the ROAR project. Section 4 presents an overview of the concurrent error detection, fault-location and recovery techniques we have developed for applications mapped on the reconfigurable hardware; these techniques enable dependable system design using the architectures described in Sec. 3. In Sec. 5 we describe the architecture of a self-healing soft-microprocessor developed in the ROAR project. We conclude in Sec. 6. 2. Hardware Model of FPGAs Figure 2.1 shows the model for the programmable logic array of reconfigurable hardware used in this report. This model is consistent with contemporary SRAM-based FPGAs [Xilinx 01]. In this model, the programmable logic array of reconfigurable 3 hardware consists of three basic elements: Configurable Logic Blocks (CLBs), Connection Boxes (CBs), and Switch Boxes (SBs). CLB CLB CLB CLB CB CLB CLB CLB CLB SB CLB CLB CLB CLB 1 CLB Column Figure 2.1. FPGA Architecture. In this architecture, a CLB contains SRAM lookup tables (LUTs) that store the truth tables of user-defined combinational logic functions. In this way, a combinational logic gate is realized by looking up the value in the LUT that is addressed by the corresponding gate inputs. Because LUTs are implemented in SRAM cells, they can also be configured as memory modules in user applications. Here, memory modules refer to all types of memory implemented as RAM modules, such as RAM and FIFO. Bistables such as flip-flops and latches, however, are excluded from memory modules implemented in LUTs because of area efficiency. Instead, they are implemented as separate modules in each CLB to realize sequential logic. Each CLB may also contain multiplexers and other dedicated circuitry in order to enhance functional flexibility and to speed up certain paths with long delay, such as the carry chain in a ripple adder [Xilinx 01]. To implement a logic network, CLBs are connected through horizontal and vertical wiring channels located between two neighboring rows or columns. To enhance the connectivity, there are various kinds of wires with different lengths for connecting various CLBs that are separated by variable numbers of blocks. For example, in Xilinx Virtex-series FPGAs, single lines connect adjacent CLBs, while hex lines connect CLBs that are three or six blocks apart [Xilinx 01]. There are two types of routing devices, CBs and SBs, to direct the signal flows among CLBs and wiring channels. CBs route the inputs and outputs of a CLB to the 4 adjacent wiring channels. SBs connect horizontal and vertical wiring channels. Both CBs and SBs are matrices of Programmable Interconnect Points (PIPs). The state of PIPs in these switch matrices is controlled by SRAM cells, which are configured according to the desired functionality. Logic functions in the reconfigurable hardware are configured through the configuration bit-stream, which contains a mixture of commands, configuration data, SRAM cell addresses for storing configuration data, and control overhead such as parity checks. Configuration data contain values in LUTs and interconnect states. The entire configuration data of the reconfigurable hardware are partitioned vertically or horizontally into configuration frames, which are the smallest amount of data that can be accessed with a single configuration command. Data stored in configuration memory can be accessed by two types of operations: configuration readback and configuration writeback. Configuration readback refers to the operation of reading the on-chip configuration memory contents out of the chip through designated ports. This operation does not affect user applications running in the reconfigurable hardware. Configuration writeback refers to the operation of writing valid configuration