Achieving Fault-Tolerant Software with Rejuvenation and Reconfiguration
Total Page:16
File Type:pdf, Size:1020Kb
focusfault tolerance Achieving Fault-Tolerant Software with Rejuvenation and Reconfiguration William Yurcik and David Doss, Illinois State University equirements for constantly functioning software have increased dramatically with commercialization of the Internet. Application service providers and service-level agreements specify contractual R software performance in terms of guaranteed availability and error thresholds (failed connection attempts, transaction failures, and fulfillment failures). These requirements are difficult to satisfy, particularly as applica- tions grow in complexity, but the alternative of letting systems unpredictably The authors crash is becoming less of an option. Such tifiably “assured” as bug-free, this assured present two crashes are becoming increasingly expensive software would likely have to execute on complementary to business and potentially life threatening systems with “nonassured” software that ways of to those who depend on essential services could potentially introduce new faults into 2 dealing with built on networked software systems. the system. Developing systems through As the makeup of systems is increasingly software integration and reuse (rather than software aging: composed of software relative to hardware, customized design) has become a corner- reinitializing to a system crashes are more likely to be the re- stone of modern software engineering. known operating sult of a software fault than a hardware Thus, when considering software systems as state before a fault. Although enormous efforts go into a whole, it is prudent to assume that bugs developing defect-free software, it isn’t al- are inherent and software should be fault failure occurs ways possible to find and eliminate every tolerant. or reconfiguring software bug. Software engineers develop Furthermore, when specific software after a failure software that works in the best of all possi- continuously executes, software aging oc- such that ble worlds, but the real world includes en- curs: The software ages due to error condi- vironmental disruptions, transient faults, tions that accumulate with time and use.3–5 the service human errors, and malicious attacks.1 Causes include memory leaks, memory the software Building constantly functioning software fragmentation, memory bloating, missing provides remains systems in such a highly dynamic and un- scheduling deadlines, broken pointers, poor operational. bounded environment is a challenge. register use, and build-up of numerical Even if individual software could be cer- round-off errors. This aging manifests itself 48 IEEE SOFTWARE July/August 2001 0740-7459/01/$10.00 © 2001 IEEE Software Decay Software decay is a proposed phenomenon, and it refers to the chang- ing behavior of software.1 Specifically, software decay refers to software in terms of system failures due to deterio- that degrades through time as it becomes increasingly difficult and expen- rating operating system resources, unre- sive to maintain. If software remains in the same supporting environment, it leased file locks, and data corruption.6 (For is possible for it to constantly function without changing its behavior, but this more information, see the “Software De- is unrealistic. Hardware- and software-supporting environments change cay” sidebar.) over time, and new features are added to change or enhance functionality. Software aging has occurred in both soft- The original software architects can’t anticipate all possible changes, so ware used on a massive scale (Microsoft unanticipated changes sometimes violate design principles or fail to follow Windows95 and Netscape Navigator) and the intent of imprecise requirements. The result is software that has decayed specialized high-availability safety-critical in function and would be more efficient if completely rewritten. software.7 As more PC users leave their While software decay occurs in incremental steps through change computers “always-on” through a cable processes that humans initiate, software aging occurs through underlying modem or DSL connections to the Internet, operating system resource management in response to dynamic events and the likelihood of system crashes due to soft- a varying load over time. ware aging is increasingly relevant. Software rejuvenation Reference 1. S.G. Eick et al., “Does Software Decay? Assessing the Evidence from Change Management Most software theory has focused on Data,” IEEE Trans. Software Eng., vol. 27, no. 1, Jan. 2001. static behavior by analyzing software list- ings. Little work was performed on longitu- dinal dynamic behavior and performance under varying loads until Yennun Huang I on-board preventative maintenance for and colleagues introduced the concept of long-life deep space missions (1998). software rejuvenation in 1995.3 Software re- juvenation is a proactive approach that in- Rejuvenation is similar to preventive volves stopping executing software periodi- maintenance for hardware systems.7 While cally, cleaning internal states, and then rejuvenation incurs immediate overhead in restarting the software. Rejuvenation may terms of some services being temporarily involve all or some of the following: garbage unavailable, the idea is to prevent more collection, memory defragmentation, flush- lengthy unexpected failures from occurring ing operating system kernel tables, and reini- in the future. tializing internal data structures.7 Cluster computing also provides a simi- Software rejuvenation does not remove lar fault tolerance by using planned outages. bugs resulting from software aging but When we detect a failure in one computer in rather prevents them from manifesting a cluster, we can “fail over” the executing themselves as unpredictable whole system process to another computer within the failures. Periodic rejuvenation limits the cluster. Similar to rejuvenation, computers state space in the execution domain and can be removed from the cluster as if under transforms a nonstationary random process failure, serviced, and upgraded, and then re- into a stationary process that can be pre- stored back to the cluster.5 This ability to dicted and avoided. handle unexpected failures and scheduled To all computer users, rejuvenation is as maintenance makes clusters the only infor- intuitive as occasionally rebooting your mation systems that can attain 100 percent computer. Of course, Murphy’s Law holds availability. that this reboot will occur when irreplace- The critical factor in making scheduled able data will be lost. Examples of using downtime preferable to unscheduled down- large-scale software rejuvenation include3,7 time is determining how often a system must be rejuvenated. If unexpected failures I the Patriot missile defense system’s re- are catastrophic, then a more aggressive re- quirement to switch the system off and juvenation schedule might be justified in on every eight hours (1992); terms of cost and availability. If unexpected I software rejuvenation for AT&T billing failures are equivalent to scheduled down- applications (1995); time in terms of cost and availability, then a I telecommunications switching software reactive approach is more appropriate. Cur- rejuvenation to prevent performance rently, the two techniques used to determine degradation (1997); and an optimal rejuvenation schedule are a July/August 2001 IEEE SOFTWARE 49 N-Version Programming NVP, first proposed by Algirdas Avizienis in 1977, refers to multiple (N > 2) functionally equivalent program versions based on the same specifi- cation.1 To provide fault tolerance, each version must employ design diversity the “N-Version Programming” sidebar) (different algorithms and programming languages) to maximize the probabil- and comparing outputs; ity that any error results are distinguishable. This is based on the conjecture I time redundancy expressed as repeti- that the probability of a random, independent fault producing the same error tively executing the same program to results in two or more versions is less when the versions are diverse. check consistent outputs; and A consistent set of inputs is supplied to all N versions and all N versions I information redundancy expressed as are executed in parallel. Similar to majority-voting hardware units, a con- redundancy bits that help detect and sensus software decision mechanism then examines the results from all N correct errors in messages and outputs. versions to determine the accurate result and mask error results. NVP is in- creasingly feasible because asymmetrical multiprocessing now allows differ- Redundancy in these three dimensions ent processors running different operating systems for applications requiring provides flexible and efficient recovery, inde- reliability. Research continues into the feasibility of NVP for different prob- pendent of knowledge about the underlying lems and whether the NVP assumption of independent failures from func- failure (such as fault identification or causal tionally equivalent but independently developed versions holds (or whether events). While robust software can be built failures remain correlated). with enough redundancy to handle almost any arbitrary failure, the challenge is to pro- Reference vide fault tolerance by minimizing redun- 1. A. Avizienis, “The Methodology of N-Version Programming,” Software Fault Tolerance, dancy—which reduces cost and complexity.8 M.R. Lyu, ed., John Wiley & Sons, New York, 1995, pp. 23–46. However, reactive techniques don’t have to mean that a system must crash before it can be gracefully recovered.