Rx: Treating Bugs As Allergies— a Safe Method to Survive Software Failures
Total Page:16
File Type:pdf, Size:1020Kb
Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures Feng Qin, Joseph Tucek, Jagadeesan Sundaresan and Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana Champaign ffengqin, tucek, sundaresan, [email protected] ABSTRACT Keywords Many applications demand availability. Unfortunately, software Availability, Bug, Reliability, Software Failure failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limita- 1. INTRODUCTION tions: Required application restructuring, inability to address deter- ministic software bugs, unsafe speculation on program execution, 1.1 Motivation and long recovery time. Many applications, especially critical ones such as process con- This paper proposes an innovative safe technique, called Rx, trol or on-line transaction monitoring, require high availability [27]. which can quickly recover programs from many types of software For server applications, downtime leads to lost productivity and lost bugs, both deterministic and non-deterministic. Our idea, inspired business. According to a report by Gartner Group [48] the average from allergy treatment in real life, is to rollback the program to a cost of an hour of downtime for a financial company exceeds six recent checkpoint upon a software failure, and then to re-execute million US dollars. With the tremendous growth of e-commerce, the program in a modified environment. We base this idea on the almost every kind of organization is becoming dependent upon observation that many bugs are correlated with the execution envi- highly available systems. ronment, and therefore can be avoided by removing the “allergen” Unfortunately, software failures severely reduce system avail- from the environment. Rx requires few to no modifications to ap- ability. A recent study showed that software defects account for up plications and provides programmers with additional feedback for to 40% of system failures [37]. Memory-related bugs and concur- bug diagnosis. rency bugs are common software defects, causing more than 60% We have implemented Rx on Linux. Our experiments with four of system vulnerabilities [17]. For this reason, software compa- server applications that contain six bugs of various types show that nies invest enormous effort and resources on software testing prior Rx can survive all the six software failures and provide transparent to releasing software. However, software failures still occur dur- fast recovery within 0.017-0.16 seconds, 21-53 times faster than ing production runs since some bugs inevitably slip through even the whole program restart approach for all but one case (CVS). the strictest testing. Therefore, to achieve higher system availabil- In contrast, the two tested alternatives, a whole program restart ap- ity, mechanisms must be devised to allow systems to survive the proach and a simple rollback and re-execution without environmen- effects of uneliminated software bugs to the largest extent possible. tal changes, cannot successfully recover the three servers (Squid, Previous work on surviving software failures can be classified Apache, and CVS) that contain deterministic bugs, and have only into four categories. The first category encompasses various fla- a 40% recovery rate for the server (MySQL) that contains a non- vors of rebooting (restarting) techniques, including whole program deterministic concurrency bug. Additionally, Rx's checkpointing restart [27, 54], micro-rebooting of partial system components system is lightweight , imposing small time and space overheads. [13, 14], and software rejuvenation [30, 26, 8]. Since many of these techniques were originally designed to handle hardware failures, Categories and Subject Descriptors most of them are ill-suited for surviving software failures. For ex- ample, they cannot deal with deterministic software bugs, a major D.4.5 [Operating Systems]: Reliability cause of software failures [18], because these bugs will still occur even after rebooting. Another important limitation of these meth- General Terms ods is service unavailability while restarting, which can take up Design, Experimentation, Reliability to several seconds [57]. For servers that buffer significant amount of state in main memory (e.g. data buffer caches), it requires a long period to warm up to full service capacity [11, 58]. Micro- rebooting [14] addresses this problem to some extent by only re- Permission to make digital or hard copies of all or part of this work for booting the failed components. However, it requires legacy soft- personal or classroom use is granted without fee provided that copies are ware to be reconstructed in a loosely-coupled fashion. not made or distributed for profit or commercial advantage and that copies The second category includes general checkpointing and recov- bear this notice and the full citation on the first page. To copy otherwise, to ery. The most straightforward method in this category is to check- republish, to post on servers or to redistribute to lists, requires prior specific point, rollback upon failures, and then re-execute either on the permission and/or a fee. SOSP'05, October 23–26, 2005, Brighton, United Kingdom. same machine [24, 43] or on a different machine designated as the Copyright 2005 ACM 1595930795/05/0010 ...$5.00. “backup server” [27, 6, 11, 12, 58, 3, 61]. Similar to the whole c ACM, 2005. This is a minor revision of the work published in SOSP 2005, http://doi.acm.org/10.1145/1095809.1095833 program restart approach, these techniques were also proposed to 1.2 Our Contributions deal with hardware failures, and thereby suffer from the same limi- In this paper, we propose a safe (not speculatively “fixing” the tations in addressing software failures. Progressive retry [59] is an bug) technique, called Rx, to quickly recover from many types of interesting improvement over these approaches. It reorders mes- software failures caused by common software defects, both deter- sages to increase the degree of non-determinism. While this work ministic and non-deterministic. It requires few to no changes to proposes a promising direction, it limits the technique to message applications' source code, and provides diagnostic information for reordering. As a result, it cannot handle bugs unrelated to mes- postmortem bug analysis. Our idea is to rollback the program to sage order. For example, if a server receives a malicious request a recent checkpoint when a bug is detected, dynamically change that exploits a buffer overflow bug, simply reordering messages the execution environment based on the failure symptoms, and then will not solve the problem. The most aggressive approaches in the re-execute the buggy code region in the new environment. If the checkpointing/recovery category include recovery blocks [42] and re-execution successfully pass through the problematic period, the n-version programming [5, 4, 45], both of which rely on different new environmental changes are disabled to avoid imposing time implementation versions upon failures. These approaches may be and space overheads. able to survive deterministic bugs under the assumption that dif- Our idea is inspired from real life. When a person suffers from ferent versions fail independently. But they are rarely adopted by an allergy, the most common treatment is to remove the allergens software companies due to prohibitive costs. An alternative to N- from their living environment. For example, if patients are allergic version programming is data diversity that tries to execute multiple to milk, they should remove diary products from the diet. If pa- copies of the same program, each with a different form of the in- tients are allergic to pollen, they may install air filters to remove put [2]. While proposing an inspiring idea, this work focuses on the pollen from the air. Additionally, when removing a candidate al- theoretical framework instead of the practical aspects. In particu- lergen from the environment successfully treats the symptoms, it lar, it does not answer how to apply the idea transparently without allows diagnosis of the cause of the symptoms. Obviously, such modifying the application and without causing major performance treatment cannot and also should not start before patients shows degradation during normal execution. allergic symptoms since changing living environment requires spe- The third category comprises application-specific recovery mech- cial effort and may also be unhealthy. anisms, such as the multi-process model, exception handling, etc. In software, many bugs resemble allergies. That is, their man- Some multi-processed applications, such as the old version of the ifestation can be avoided by changing the execution environment. Apache HTTP Server and the CVS server, spawn a new process According to a previous study by Chandra and Chen [18], around for each client connection and therefore can simply kill a failed 56% of faults in Apache depend on execution environment 1. There- process and start a new one to handle a failed request. While capa- fore, by removing the “allergen” from the execution environment, it ble of surviving certain software failures, this technique has several is possible to avoid such bugs. For example, a memory corruption limitations. First, the new process will most likely fail again for bug may disappear if the memory allocator delays the recycling deterministic bugs at the same place given the same request (e.g. a of recently freed buffers or allocates