Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures Feng Qin, Joseph Tucek, Jagadeesan Sundaresan and Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana Champaign ffengqin, tucek, sundaresan,
[email protected] ABSTRACT Keywords Many applications demand availability. Unfortunately, software Availability, Bug, Reliability, Software Failure failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limita- 1. INTRODUCTION tions: Required application restructuring, inability to address deter- ministic software bugs, unsafe speculation on program execution, 1.1 Motivation and long recovery time. Many applications, especially critical ones such as process con- This paper proposes an innovative safe technique, called Rx, trol or on-line transaction monitoring, require high availability [27]. which can quickly recover programs from many types of software For server applications, downtime leads to lost productivity and lost bugs, both deterministic and non-deterministic. Our idea, inspired business. According to a report by Gartner Group [48] the average from allergy treatment in real life, is to rollback the program to a cost of an hour of downtime for a financial company exceeds six recent checkpoint upon a software failure, and then to re-execute million US dollars. With the tremendous growth of e-commerce, the program in a modified environment. We base this idea on the almost every kind of organization is becoming dependent upon observation that many bugs are correlated with the execution envi- highly available systems. ronment, and therefore can be avoided by removing the “allergen” Unfortunately, software failures severely reduce system avail- from the environment.