Evolving Dependability

Evolving Dependability ANDY M. TYRRELL and ANDREW J. GREENSTED The University of York, UK Evolvable hardware offers much for the future of complex systems design. Evolutionary techniques not only have the potential for larger solution space coverage, but when implemented on hardware, also allow system designs to adapt to changes in the environment, including failures in system components. This article reviews a number of novel techniques, all based in the field of bio-inspired systems, that provide varying degrees of dependability over and above standard designs. In particular, three different techniques are considered: using FPGAs and ideas from developmental biology to create designs that possess emergent fault-tolerant properties, using FPGAs and continuous evolution to circumvent faults as and when they occur, and, finally, we consider a novel ASIC designed and built with bio-inspired systems in mind. Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing and Fault-Tolerance General Terms: Algorithms, Reliability Additional Key Words and Phrases: Evolutionary algorithms, fault tolerance, bio-inspired archi- tectures, RISA architecture ACM Reference Format: Tyrrell, A. M. and Greensted, A. J. 2007. Evolving dependability. ACM J. Emerg. Technol. Comput. Syst. 3, 2, Article 7 (July 2007), 20 pages. DOI = 10.1145/1265949.1265953 http://doi.acm.org/ 10.1145/1265949.1265953 1. INTRODUCTION With the increase in system complexity, performing complete fault coverage at the testing phase of the design cycle is very difficult to achieve, if not impossible. In addition, environmental effects such as electromagnetic interference, misuse by users, and the natural ageing of components mean system faults are likely to occur. These faults can cause errors which, if left untreated, could cause system failure. The role of fault tolerance is to deal with the errors caused by Parts of this work were funded by the EPSRC and the MOD. A shorter version of this article was presented at the Computing Frontiers Workshop in 2006. Authors’ address: Department of Electronics, University of York, Heslington, YO10 5DD, UK; email: {amt, ajg112}@ohm.york.ac.uk. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2007 ACM 1550-4832/2007/07-ART7 $5.00. DOI 10.1145/1265949.1265953 http://doi.acm.org/ 10.1145/1265949.1265953 ACM Journal on Emerging Technologies in Computing Systems, Vol. 3, No. 2, Article 7, Publication date: July 2007. 2 • A. M. Tyrrell and A. J. Greensted faults in order to avoid failure. Fault tolerance along with fault detection and recovery are techniques used in the design, implementation, and operation of dependable computing systems [Lee and Anderson 1990]. Fault tolerance is increasingly a crucial part of system designs. Many systems have part or all of their function classified as critical in one form or another. Since fully testing a system is generally unrealistic, critical functions must be protected online. This is often achieved by using fault tolerance to cope with errors produced during the operation of the system. Traditionally, two approaches are taken, both requiring the replication of the system, or system subsections, to be protected. Simple static redundancy (such as N-version systems [Lee and Anderson 1990]) involves the concurrent operation of redundant modules each contributing to a majority decision for a final output. Alternatively, dynamic redundancy operates using a single mod- ule, and when a failure is detected or expected, one of the redundant modules is switched into its place. However, these approaches are achieved at the expense of increased equipment needs due to the required replication of hardware and increased design time and costs. These redundancy schemes are termed space redundancy as the replicated sections are physically distributed over space. Another category, time redundancy, benefits from not requiring a replication of hardware, instead the redundancy is distributed over time. The same operation is repeated, and an output achieved from a consensus of the individual runs. All these redundancy schemes apply equally to a hardware process, a software process, or a combination of both. Providing continual fault-free operation in a system implies a continual map- ping of a logical system onto a nonfaulty physical system. When faults arise, a mechanism must be provided for reconfiguring the physical system such that the logical system can still be represented by the remaining nonfaulty processing elements. Whether the physical platform is a distributed software processor system or consists purely of hard circuitry, for fault tolerance, redundancy in the system’s basic processing elements is required. The reconfiguration mechanisms that control utilization of these processing elements can be considered to be based on one of two types of scheme: time-based redundancy reallocation or hardware-based redundancy reallocation. Time-based use of redundancy involves distributing the function of faulty processing elements among neighboring resources. When reconfiguration oc- curs, processing elements dedicate some time to performing their own tasks and some to performing the faulty neighbor’s functions, possibly resulting in some degradation of the system’s performance. In addition, the system operations that are being performed must be sufficiently flexible to ensure their reallocation can be simply performed in real time. Reallocating processes in a hardware redundancy scheme requires spare processing elements and interconnects in order to replace those that become faulty. For this process, reconfiguration algorithms must optimize the use of spares. In the ideal case, a processing system with N spares is able to tolerate N faulty processing elements. However, in practice, this goal is far from being achieved. Reconfiguration of the functional system may not be possible due to limitations of the interconnection capabilities and available resources of each cell. ACM Journal on Emerging Technologies in Computing Systems, Vol. 3, No. 2, Article 7, Publication date: July 2007. Evolving Dependability • 3 The majority of hardware redundancy reconfiguration techniques rely on complex algorithms to reassign physical resources to the elements of the logical array. In most cases, these algorithms are executed by a central controller which also performs diagnostic functions and accomplishes the reconfiguration of the physical system. This approach has been demonstrated to be effective, but its centralized nature makes it prone to collapse if the control unit fails. These mechanisms also rely on the designer making a priori decisions on reconfiguration strategies and data/code movement which are prone to error and may in practice be less than ideal. Furthermore, the timing of signals involved in the global control are often prohibitively long and are therefore unsuitable for applying to the control of high-speed systems. An alternative approach is to distribute the diagnosis and reconfiguration algorithms among all the processing elements in the system. In this way, no central agent is necessary and, consequently, the reliability and time response of the system should improve. However, this decentralised approach has tended to increase the complexity of the reconfiguration algorithm and the amount of communications within the network. In addition, considerable work is required in producing redundancy [Ortega et al. 2000]. Traditionally, fault tolerance has been added explicitly to system designs by including redundant hardware and/or software which take over when an error has been detected. A novel alternative approach would be to design the system in such a way that the redundancy was incorporated implicitly into the hardware and/or software during the design phase. This should provide a more holistic approach to the design process [Ortega et al. 2000; Hollingworth et al. 2000; Canham and Tyrrell 2003; Tyrrell et al. 2001; Bradley and Tyrrell 2002]. We already know that genetic algorithms and genetic programming can adapt and optimize the behavior and structure of solutions to perform a specific task [Fogel 2006], but the aim here is that they should learn to deal with faults within their operation space. This implicit redundancy would make the system response invariant to the occurrence of faults [Thompson et al. 1999; Layzell and Thompson 2000]. This article illustrates a number of novel techniques, all based in the field of bio-inspired electronics, that provide varying degrees of dependability over and above standard designs. In particular, three different techniques are considered: using FPGAs and ideas from developmental biology to create designs that possess emergent fault-tolerant properties, using FPGAs and continuous evolution to circumvent faults

Evolving Dependability

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support