focusfault tolerance Achieving Fault-Tolerant with Rejuvenation and Reconfiguration

William Yurcik and David Doss, Illinois State University

equirements for constantly functioning software have increased dramatically with commercialization of the Internet. Application service providers and service-level agreements specify contractual R software performance in terms of guaranteed availability and error thresholds (failed connection attempts, transaction failures, and fulfillment failures). These requirements are difficult to satisfy, particularly as applica- tions grow in complexity, but the alternative of letting systems unpredictably The authors crash is becoming less of an option. Such tifiably “assured” as bug-free, this assured present two crashes are becoming increasingly expensive software would likely have to execute on complementary to business and potentially life threatening systems with “nonassured” software that ways of to those who depend on essential services could potentially introduce new faults into 2 dealing with built on networked software systems. the system. Developing systems through As the makeup of systems is increasingly software integration and reuse (rather than : composed of software relative to hardware, customized design) has become a corner- reinitializing to a system crashes are more likely to be the re- stone of modern software engineering. known operating sult of a software fault than a hardware Thus, when considering software systems as state before a fault. Although enormous efforts go into a whole, it is prudent to assume that bugs developing defect-free software, it isn’t al- are inherent and software should be fault failure occurs ways possible to find and eliminate every tolerant. or reconfiguring software bug. Software engineers develop Furthermore, when specific software after a failure software that works in the best of all possi- continuously executes, software aging oc- such that ble worlds, but the real world includes en- curs: The software ages due to error condi- vironmental disruptions, transient faults, tions that accumulate with time and use.3–5 the service human errors, and malicious attacks.1 Causes include memory leaks, memory the software Building constantly functioning software fragmentation, memory bloating, missing provides remains systems in such a highly dynamic and un- scheduling deadlines, broken pointers, poor operational. bounded environment is a challenge. register use, and build-up of numerical Even if individual software could be cer- round-off errors. This aging manifests itself

48 IEEE SOFTWARE July/August 2001 0740-7459/01/$10.00 © 2001 IEEE Software Decay

Software decay is a proposed phenomenon, and it refers to the chang- ing behavior of software.1 Specifically, software decay refers to software in terms of system failures due to deterio- that degrades through time as it becomes increasingly difficult and expen- rating resources, unre- sive to maintain. If software remains in the same supporting environment, it leased file locks, and data corruption.6 (For is possible for it to constantly function without changing its behavior, but this more information, see the “Software De- is unrealistic. Hardware- and software-supporting environments change cay” sidebar.) over time, and new features are added to change or enhance functionality. Software aging has occurred in both soft- The original software architects can’t anticipate all possible changes, so ware used on a massive scale (Microsoft unanticipated changes sometimes violate design principles or fail to follow Windows95 and Netscape Navigator) and the intent of imprecise requirements. The result is software that has decayed specialized high-availability safety-critical in function and would be more efficient if completely rewritten. software.7 As more PC users leave their While software decay occurs in incremental steps through change computers “always-on” through a cable processes that humans initiate, software aging occurs through underlying modem or DSL connections to the Internet, operating system resource management in response to dynamic events and the likelihood of system crashes due to soft- a varying load over time. ware aging is increasingly relevant.

Software rejuvenation Reference 1. S.G. Eick et al., “Does Software Decay? Assessing the Evidence from Change Management Most software theory has focused on Data,” IEEE Trans. Software Eng., vol. 27, no. 1, Jan. 2001. static behavior by analyzing software list- ings. Little work was performed on longitu- dinal dynamic behavior and performance under varying loads until Yennun Huang on-board preventative maintenance for and colleagues introduced the concept of long-life deep space missions (1998). software rejuvenation in 1995.3 Software re- juvenation is a proactive approach that in- Rejuvenation is similar to preventive volves stopping executing software periodi- maintenance for hardware systems.7 While cally, cleaning internal states, and then rejuvenation incurs immediate overhead in restarting the software. Rejuvenation may terms of some services being temporarily involve all or some of the following: garbage unavailable, the idea is to prevent more collection, memory defragmentation, flush- lengthy unexpected failures from occurring ing operating system kernel tables, and reini- in the future. tializing internal data structures.7 Cluster computing also provides a simi- Software rejuvenation does not remove lar fault tolerance by using planned outages. bugs resulting from software aging but When we detect a failure in one computer in rather prevents them from manifesting a cluster, we can “fail over” the executing themselves as unpredictable whole system process to another computer within the failures. Periodic rejuvenation limits the cluster. Similar to rejuvenation, computers state space in the execution domain and can be removed from the cluster as if under transforms a nonstationary random process failure, serviced, and upgraded, and then re- into a stationary process that can be pre- stored back to the cluster.5 This ability to dicted and avoided. handle unexpected failures and scheduled To all computer users, rejuvenation is as maintenance makes clusters the only infor- intuitive as occasionally rebooting your mation systems that can attain 100 percent computer. Of course, Murphy’s Law holds availability. that this reboot will occur when irreplace- The critical factor in making scheduled able data will be lost. Examples of using downtime preferable to unscheduled down- large-scale software rejuvenation include3,7 time is determining how often a system must be rejuvenated. If unexpected failures the Patriot missile defense system’s re- are catastrophic, then a more aggressive re- quirement to switch the system off and juvenation schedule might be justified in on every eight hours (1992); terms of cost and availability. If unexpected software rejuvenation for AT&T billing failures are equivalent to scheduled down- applications (1995); time in terms of cost and availability, then a telecommunications switching software reactive approach is more appropriate. Cur- rejuvenation to prevent performance rently, the two techniques used to determine degradation (1997); and an optimal rejuvenation schedule are a

July/August 2001 IEEE SOFTWARE 49 N-Version Programming

NVP, first proposed by Algirdas Avizienis in 1977, refers to multiple (N > 2) functionally equivalent program versions based on the same specifi- cation.1 To provide fault tolerance, each version must employ design diversity the “N-Version Programming” sidebar) (different algorithms and programming languages) to maximize the probabil- and comparing outputs; ity that any error results are distinguishable. This is based on the conjecture time redundancy expressed as repeti- that the probability of a random, independent fault producing the same error tively executing the same program to results in two or more versions is less when the versions are diverse. check consistent outputs; and A consistent set of inputs is supplied to all N versions and all N versions information redundancy expressed as are executed in parallel. Similar to majority-voting hardware units, a con- redundancy bits that help detect and sensus software decision mechanism then examines the results from all N correct errors in messages and outputs. versions to determine the accurate result and mask error results. NVP is in- creasingly feasible because asymmetrical multiprocessing now allows differ- Redundancy in these three dimensions ent processors running different operating systems for applications requiring provides flexible and efficient recovery, inde- reliability. Research continues into the feasibility of NVP for different prob- pendent of knowledge about the underlying lems and whether the NVP assumption of independent failures from func- failure (such as fault identification or causal tionally equivalent but independently developed versions holds (or whether events). While robust software can be built failures remain correlated). with enough redundancy to handle almost any arbitrary failure, the challenge is to pro- Reference vide fault tolerance by minimizing redun- 1. A. Avizienis, “The Methodology of N-Version Programming,” Software Fault Tolerance, dancy—which reduces cost and complexity.8 M.R. Lyu, ed., John Wiley & Sons, New York, 1995, pp. 23–46. However, reactive techniques don’t have to mean that a system must crash before it can be gracefully recovered. Software recon- measurement-based technique (which esti- figuration can use redundant resources for mates rejuvenation timing based on system real-time recovery while dynamically con- resource metrics) and a modeling-based sidering a large number of factors (operat- technique (which uses mathematical models ing system services, processor load, and and simulation to estimate rejuvenation memory variables among others)—a hu- timing based on predicted performance).7 man-in-the-loop might not be necessary.9 IBM is pioneering software rejuvenation To PC users, however, reactive reconfigu- technology, in conjunction with Duke Uni- ration means recovery after a system crash. versity, and products are beginning to ap- When your PC freezes, reconfiguration can pear. Software rejuvenation has been incor- take place by listing executing processes porated in IBM’s Netfinity Director for ( ) and attempting to -based IBM and non- identify and terminate the process responsi- IBM servers, desktops, workstations, and ble for the problem, often using a trial-and- notebook systems, and an extension has error approach. If something catastrophic been created for Microsoft Cluster Service. has occurred, a reboot from tape backup or original system disks might be necessary. Software reconfiguration Realistically, most users do not keep current In contrast to proactive rejuvenation, backups, so a market for automatic soft- fault-tolerance techniques have traditionally ware reconfiguration products has taken off been reactive. The reactive approach to (see the “Software Reconfiguration Prod- achieving fault-tolerant software is to re- ucts” sidebar). configure the system after detecting a fail- Reconfiguration techniques have been pi- ure—and redundancy is the primary tool oneered for fault-tolerant networks and in- used. For hardware, the reconfiguration ap- clude,8, 10–12 proach to providing fault tolerance is re- dundancy in terms of backup processors, preplanned reconfiguration with disjoint power supplies, disk drives, and circuits. working and backup circuits, such that For software, the reconfiguration ap- after detecting a failure in a working proach uses redundancy in three different circuit, the traffic can be automatically dimensions: rerouted to its dedicated backup circuit; dynamic reconfiguration, such that after software redundancy expressed as inde- detecting a failure, signaling messages pendently-written programs performing search the network for spare circuits on the same task executing in parallel (see which to reroute traffic;

50 IEEE SOFTWARE July/August 2001 Software Reconfiguration Products

Given the maturation of software reconfiguration techniques, products have begun to appear, particularly in the PC operating system market. multilayer reconfiguration, in which re- These products do not protect from hardware failures, such as CPU malfunc- covery from a failure at one layer might tion or disk failure, but they can be useful tools against buggy software and take place at higher layers either inde- human error. In general, these products track software changes (system, ap- pendently or in coordination with each plication, data file, and registry setting), use a hard disk to make redundant layer having different characteristics; and copies, and let the user restore (reconfigure) a system to a previous “snap- priority reconfiguration, which might shot.” Note that there are trade-offs for providing this reconfiguration capa- involve re-optimizing an entire network bility versus system performance and hard disk space requirements. to reconnect disrupted high-priority cir- For more information, here is a representative sampling of current cuits over currently established lower- products: priority circuits. ConfigSafe v 4.0, by imagine LAN (www.configsafe.com), Each of these reconfiguration techniques GoBack v 2.21, by Roxio (www.roxio.com), requires the provisioning of spare resources Rewind, by Power On Software (www.poweronsoftware.com), and for redundancy that can be used when a System Restore Utility, included in Microsoft Windows ME (www. failure occurs. The redundant resources can microsoft.com). be dedicated, to guarantee reconfiguration, or shared, in which case recovery might not be possible (spare resources might not be Figure 1. A model available at the time of a fault). On the depicting the other hand, sharing redundant resources is Operational complementary more efficient in environments of low fault state nature of probability or when reconfiguration need rejuvenation and not be guaranteed. Rejuvenation Reconfiguration reconfiguration.7 Reconfiguration can be provided at dif- Aging ferent layers and implemented with differ- ent algorithms at each layer.13 In fact, if all restoration mechanisms are similar at each Bug Failure- manifestation Failed layer, there is increased whole system vul- probable state nerability.2 For example, if all layers used state preplanned mechanisms, then each layer— and the system as a whole—will not be able to handle unexpected fault events. If all lay- ers used real-time search algorithms, then system behavior would be hard to predict. Instead, it is better to use complementary re- ault-tolerant software requires a configuration algorithms at different layers whole system approach.2,9 We have and draw on the benefits of each. For ex- F attempted to outline the use of con- ample, a preplanned algorithm at a lower trasting proactive and reactive approaches layer (for speed), followed by a real-time to achieve fault-tolerant software, but it’s search mechanism at a higher layer, can not easy to say which approach is better. handle unexpected faults that lower layers Both approaches are nonexclusive and com- have been unable to handle. plementary, such that they work well to- In general, reconfiguration of success- gether in an integrated system (see Figure 1). fully executing software for recovery from a The high cost of redundancy required for failure in another part of a system should reactive reconfiguration suggests it is better only be performed if it can be accomplished suited for software in which a rejuvenation transparently such that it is imperceptible to schedule appears unrealistic due to immi- users. However, there are cases when high- nent faults or where an outage’s effect could priority software fails and requires re- be catastrophic. Proactive rejuvenation is the sources for recovery. In this scenario, lower- preferred solution when faults can be effi- priority software should be delayed or ciently avoided using a realistic rejuvenation terminated and its resources reassigned to schedule or where the risk an outage pres- aid this recovery.9 Intentional system degra- ents is low. Because both approaches are rel- dation to maintain essential processing is atively new and under study, we direct read- the most extreme type of reconfiguration. ers to our references for more details.

July/August 2001 IEEE SOFTWARE 51 An analogy can be made between these “Statistical Non-Parametric Algorithms to Estimate the fault-tolerant software approaches and CPU Optimal Rejuvenation Schedule,” Pacific Rim Int’l Symp. Dependable Computing (PRDC), IEEE Com- communications in a computer system. Re- puter Soc. Press, Los Alamitos, Calif., 2000, pp. 77–84. active reconfiguration is equivalent to event- 8. W. Yurcik and D. Tipper, “Survivable ATM Group Communications: Issues and Techniques,” Eight Intl. driven interrupts, and proactive rejuvenation Conf. Telecomm. Systems, Vanderbuilt Univ./Owen is equivalent to polling resources. Preventing Graduate School of Management, Nashville, Tenn., failures before they occur might be the best 2000, pp. 518–537. 9. D. Wells et al., “Software Survivability,” Proc. DARPA approach when finding all software bugs is Information Survivability Conf. and Exposition (DIS- possible, just as polling is preferable when CEX), IEEE Computer Soc. Press, Los Alamitos, Calif., CPU communication can be anticipated. vol. 2, Jan. 2000, pp. 241–255. 10. D. Medhi, “Network Reliability and Fault Tolerance,” However, when finding all bugs is improba- Wiley Encyclopedia of Electrical and Electronics Engi- ble (or maybe testing is not even attempted), neering, John Wiley, New York, 1999. then having the flexibility to react to multi- 11. D. Medhi and D. Tipper, “Multi-Layered Network Sur- vivability—Models, Analysis, Architecture, Framework priority interrupts with robust service-han- and Implementation: An Overview,” Proc. DARPA In- dling routines might be the critical last line formation Survivability Conference and Exposition (DISCEX), IEEE Computer Soc. Press, Los Alamitos, of defense against software faults. Calif., vol. 1, Jan. 2000, pp. 173–186. 12. D. Medhi and D. Tipper, “Towards Fault Recovery and Management in Communications Networks,” J. Net- work and Systems Management, vol. 5, no. 2, Jun. 1997, pp. 101–104. 13. D. Johnson, “Survivability Strategies for Broadband Networks,” IEEE Globecom, 1996, pp. 452–456.

Acknowledgments The authors thank Katerina Goseva-Popstojanova, Duke University, who provided an outstanding intro- duction to the concept of rejuvenation; David Tipper, University of Pittsburgh, and Deep Medhi, University of Missouri–Kansas City, for making significant con- tributions to the field of fault-tolerant networking using reconfiguration; and Kishor S. Trivedi, Duke University, for making seminal contributions in the development of software rejuvenation. Lastly, we thank past reviewers for their specific feedback that has significantly improved this article.

For further information on this or any other computing topic, please visit our References Digital Library at http://computer.org/publications/dlib. 1. D. Milojicic, “Fred B. Schneider on Distributed Com- puting,” IEEE Distributed Systems Online, vol. 1, no. 1, 2000, http://computer.org/dsonline/archives/ds100/ ds1intprint.htm (current 11 June 2001). About the Authors 2. W. Yurcik, D. Doss, and H. Kruse, “Survivability- Over-Security: Providing Whole System Assurance,” IEEE/SEI/CERT Information Survivability Workshop William Yurcik is an assistant professor in (ISW), IEEE Computer Soc. Press, Los Alamitos, Calif., the Department of Applied Computer Science at 2000, pp. 201–204. Illinois State University. Prior to his academic ca- 3. Y. Huang et al., “Software Rejuvenation: Analysis, reer, he worked for organizations such as the Module and Applications,” 25th IEEE Intl. Symp. on Naval Research Laboratory, MITRE, and NASA. Fault Tolerant Computing, 1995, pp. 381–390. Contact him at [email protected]. 4. K.S. Trivedi, K. Vaidyanathan, and K. Goseva-Popsto- janova, “Modeling and Analysis of Software Aging and Rejuvenation,” 33rd Ann. Simulation Symp., Soc. Com- puter Simulation Int’l Press, 2000, pp. 270–279. David Doss is an associate professor and 5. K. Vaidyanathan et al., “Analysis of Software Rejuvena- the graduate program coordinator in the Depart- tion in Cluster Systems,” Fast Abstracts, Pacific Rim ment of Applied Computer Science at Illinois State Int’l Symp. Dependable Computing (PRDC), IEEE Com- University. He is also a retired Lt. Commander US puter Soc. Press, Los Alamitos, Calif., 2000, pp. 3–4. Navy (SSN). Contact him at [email protected]. 6. S. Garg et al., “A Methodology for Detection and Esti- mation of Software Aging,” Ninth Int’l Symp. Software Reliability Eng., 1998, pp. 283–292. 7. T. Dohi, K. Goseva-Popstojanova, and K.S. Trivedi,

52 IEEE SOFTWARE July/August 2001