Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures

Feng Qin, Joseph Tucek, Jagadeesan Sundaresan and Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana Champaign {fengqin, tucek, sundaresan, yyzhou}@cs.uiuc.edu

ABSTRACT Keywords Many applications demand availability. Unfortunately, software Availability, Bug, Reliability, Software Failure failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limita- 1. INTRODUCTION tions: Required application restructuring, inability to address deter- ministic software bugs, unsafe speculation on program execution, 1.1 Motivation and long recovery time. Many applications, especially critical ones such as process con- This paper proposes an innovative safe technique, called Rx, trol or on-line transaction monitoring, require high availability [27]. which can quickly recover programs from many types of software For server applications, downtime leads to lost productivity and lost bugs, both deterministic and non-deterministic. Our idea, inspired business. According to a report by Gartner Group [48] the average from allergy treatment in real life, is to rollback the program to a cost of an hour of downtime for a financial company exceeds six recent checkpoint upon a software failure, and then to re-execute million US dollars. With the tremendous growth of e-commerce, the program in a modified environment. We base this idea on the almost every kind of organization is becoming dependent upon observation that many bugs are correlated with the execution envi- highly available systems. ronment, and therefore can be avoided by removing the “allergen” Unfortunately, software failures severely reduce system avail- from the environment. Rx requires few to no modifications to ap- ability. A recent study showed that software defects account for up plications and provides programmers with additional feedback for to 40% of system failures [37]. Memory-related bugs and concur- bug diagnosis. rency bugs are common software defects, causing more than 60% We have implemented Rx on Linux. Our experiments with four of system vulnerabilities [17]. For this reason, software compa- server applications that contain six bugs of various types show that nies invest enormous effort and resources on software testing prior Rx can survive all the six software failures and provide transparent to releasing software. However, software failures still occur dur- fast recovery within 0.017-0.16 seconds, 21-53 times faster than ing production runs since some bugs inevitably slip through even the whole program restart approach for all but one case (CVS). the strictest testing. Therefore, to achieve higher system availabil- In contrast, the two tested alternatives, a whole program restart ap- ity, mechanisms must be devised to allow systems to survive the proach and a simple rollback and re-execution without environmen- effects of uneliminated software bugs to the largest extent possible. tal changes, cannot successfully recover the three servers (Squid, Previous work on surviving software failures can be classified Apache, and CVS) that contain deterministic bugs, and have only into four categories. The first category encompasses various fla- a 40% recovery rate for the server (MySQL) that contains a non- vors of rebooting (restarting) techniques, including whole program deterministic concurrency bug. Additionally, Rx’s checkpointing restart [27, 54], micro-rebooting of partial system components system is lightweight , imposing small time and space overheads. [13, 14], and software rejuvenation [30, 26, 8]. Since many of these techniques were originally designed to handle hardware failures, Categories and Subject Descriptors most of them are ill-suited for surviving software failures. For ex- ample, they cannot deal with deterministic software bugs, a major D.4.5 [Operating Systems]: Reliability cause of software failures [18], because these bugs will still occur even after rebooting. Another important limitation of these meth- General Terms ods is service unavailability while restarting, which can take up Design, Experimentation, Reliability to several seconds [57]. For servers that buffer significant amount of state in main memory (e.g. data buffer caches), it requires a long period to warm up to full service capacity [11, 58]. Micro- rebooting [14] addresses this problem to some extent by only re- Permission to make digital or hard copies of all or part of this work for booting the failed components. However, it requires legacy soft- personal or classroom use is granted without fee provided that copies are ware to be reconstructed in a loosely-coupled fashion. not made or distributed for profit or commercial advantage and that copies The second category includes general checkpointing and recov- bear this notice and the full citation on the first page. To copy otherwise, to ery. The most straightforward method in this category is to check- republish, to post on servers or to redistribute to lists, requires prior specific point, rollback upon failures, and then re-execute either on the permission and/or a fee. SOSP'05, October 23–26, 2005, Brighton, United Kingdom. same machine [24, 43] or on a different machine designated as the Copyright 2005 ACM 1595930795/05/0010 ...$5.00. “backup server” [27, 6, 11, 12, 58, 3, 61]. Similar to the whole

c ACM, 2005. This is a minor revision of the work published in SOSP 2005, http://doi.acm.org/10.1145/1095809.1095833 program restart approach, these techniques were also proposed to 1.2 Our Contributions deal with hardware failures, and thereby suffer from the same limi- In this paper, we propose a safe (not speculatively “fixing” the tations in addressing software failures. Progressive retry [59] is an bug) technique, called Rx, to quickly recover from many types of interesting improvement over these approaches. It reorders mes- software failures caused by common software defects, both deter- sages to increase the degree of non-determinism. While this work ministic and non-deterministic. It requires few to no changes to proposes a promising direction, it limits the technique to message applications’ source code, and provides diagnostic information for reordering. As a result, it cannot handle bugs unrelated to mes- postmortem bug analysis. Our idea is to rollback the program to sage order. For example, if a server receives a malicious request a recent checkpoint when a bug is detected, dynamically change that exploits a buffer overflow bug, simply reordering messages the execution environment based on the failure symptoms, and then will not solve the problem. The most aggressive approaches in the re-execute the buggy code region in the new environment. If the checkpointing/recovery category include recovery blocks [42] and re-execution successfully pass through the problematic period, the n-version programming [5, 4, 45], both of which rely on different new environmental changes are disabled to avoid imposing time implementation versions upon failures. These approaches may be and space overheads. able to survive deterministic bugs under the assumption that dif- Our idea is inspired from real life. When a person suffers from ferent versions fail independently. But they are rarely adopted by an allergy, the most common treatment is to remove the allergens software companies due to prohibitive costs. An alternative to N- from their living environment. For example, if patients are allergic version programming is data diversity that tries to execute multiple to milk, they should remove diary products from the diet. If pa- copies of the same program, each with a different form of the in- tients are allergic to pollen, they may install air filters to remove put [2]. While proposing an inspiring idea, this work focuses on the pollen from the air. Additionally, when removing a candidate al- theoretical framework instead of the practical aspects. In particu- lergen from the environment successfully treats the symptoms, it lar, it does not answer how to apply the idea transparently without allows diagnosis of the cause of the symptoms. Obviously, such modifying the application and without causing major performance treatment cannot and also should not start before patients shows degradation during normal execution. allergic symptoms since changing living environment requires spe- The third category comprises application-specific recovery mech- cial effort and may also be unhealthy. anisms, such as the multi-process model, exception handling, etc. In software, many bugs resemble allergies. That is, their man- Some multi-processed applications, such as the old version of the ifestation can be avoided by changing the execution environment. Apache HTTP Server and the CVS server, spawn a new process According to a previous study by Chandra and Chen [18], around for each client connection and therefore can simply kill a failed 56% of faults in Apache depend on execution environment 1. There- process and start a new one to handle a failed request. While capa- fore, by removing the “allergen” from the execution environment, it ble of surviving certain software failures, this technique has several is possible to avoid such bugs. For example, a memory corruption limitations. First, the new process will most likely fail again for bug may disappear if the memory allocator delays the recycling deterministic bugs at the same place given the same request (e.g. a of recently freed buffers or allocates buffers non-consecutively in malicious request). Second, if a shared data structure is corrupted, isolated locations. A buffer overrun may not manifest itself if the simply killing the failed process and restarting a new one will not memory allocator pads the ends of every buffer with extra space. restore the shared data to a consistent state, therefore potentially Uninitialized reads may be avoided if every newly allocated buffer causing subsequent process failures. Other application-specific re- is all filled with zeros. Data races can be avoided by changing tim- covery mechanisms require software to be failure-aware, which ad- ing related events such as thread-scheduling, asynchronous events, versely affects programming difficulty and code readability. etc. Bugs that are exploited by malicious users can be avoided by The fourth category includes several recent non-conventional pro- dropping such requests during program re-execution. Even though posals such as failure-oblivious computing [44] and the reactive im- dropping requests may make a few users (hopefully the malicious mune system [49]. Failure-oblivious computing deals with buffer ones) unhappy, they do not introduce incorrect behavior to pro- overflows by providing artificial values for out-of-bound reads, while gram execution as the failure-oblivious approaches do. Further- the reactive immune system returns a speculative error code for more, given a spectrum of possible environmental changes, the functions that suffer software failures (e.g. crashes). While these least intrusive changes can be tried first, reserving the more extreme approaches are fascinating and may work for certain types of appli- one as a last resort for when all other changes have failed. Finally, cations or certain types of bugs, they are unsafe to use for correctness- the specific environmental change which cures the problem gives critical applications (e.g. on-line banking systems) because they diagnostic information as to what the bug is. “speculate” on programmers’ intentions, which can lead to program Similar to an allergy, it is difficult and expensive to apply these misbehavior. The problem becomes even more severe and harder execution environmental changes from the very beginning of the to detect if the speculative “fix” introduces a silent error that does program execution because we do not know what bugs might occur not manifest itself immediately. In addition, such problems, if they later. For example, zero-filling newly allocated buffers imposes occur, are very hard to diagnose since the application’s execution time overhead. Therefore, we should lazily apply environmental has been forcefully perturbed by those speculative “fixes”. changes only when needed. Besides the above individual limitations, existing work provides We have implemented Rx with Linux and evaluated it with four insufficient feedback to developers for debugging. For example, the server applications that contain four real bugs (bugs introduced by information provided to developers may include only a core dump, the original programmers) and two injected bugs (bugs injected by several recent checkpoints, and an event log for deterministic re- us) of various types including buffer overflow, double free, stack play of a few seconds of recent execution. To save programmers’ overflow, data race, uninitialized read and dangling pointer bugs. debugging effort, it is desirable if the run-time system can provide Compared with previous solutions, Rx has the following unique information regarding the bug type, under what conditions the bug advantages: is triggered, and how it can be avoided. Such diagnostic informa- 1 tion can guide programmers during their debugging process and Note that our definition of execution environment is different from malloc thereby enhance their efficiency. theirs. In our work, the standard library calls, such as , and system calls are also part of execution environment. Fail

Software Rollback Change Reexecute Succeed Failure Environment App App App App App App

checkpoint

Env Env Env Env' Env' Env

Time Out Other Approaches (e.g. whole program restart) Figure 1: Rx: The main idea

• Comprehensive: Rx can survive many common software de- 2. MAIN IDEA OF Rx fects. Besides non-deterministic bugs, Rx can also survive The main idea of Rx is to, upon a software failure, rollback the deterministic bugs. Our experiments show that Rx can suc- program to a recent checkpoint and re-execute it in a new envi- cessfully survive the six bugs listed above. In contrast, the ronment that has been modified based on the failure symptoms. two tested alternatives, a whole program restart approach If the bug’s “allergen” is removed from the new environment, the and a simple rollback and re-execution without environmen- bug will not occur during re-execution, and thus the program will tal changes, cannot recover the three servers (Squid, Apache, software failure. After the re-execution safely passes and CVS) that contain deterministic bugs, and have only a through the problematic code region, the environmental changes 40% recovery rate for the server (MySQL) that contains a are disabled to reduce time and space overhead imposed by the en- non-deterministic concurrency bug. Such results indicate that vironmental changes. applying environmental changes during re-execution is the Figure 1 shows the process by which Rx survives software fail- key reason for Rx’s successful recovery of all tested cases. ures. Rx periodically takes light-weight checkpoints that are spe- • Safe: Rx does not speculatively “fix” bugs at run time. In- cially designed to survive software failures instead of hardware fail- stead, it prevents bugs from manifesting themselves by chang- ures or OS crashes (See Section 3.2). When a bug is detected, either ing only the program’s execution environment. Therefore, by an exception or by integrated dynamic software bug detection it does not introduce uncertainty or misbehavior into a pro- tools called as the Rx sensors, the program is rolled back to a re- gram’s execution, which is usually very difficult for program- cent checkpoint. Rx then analyzes the occurring failure based on mers to diagnose. the failure symptoms and “experiences” accumulated from previ- ous failures, and determines how to apply environmental changes • Noninvasive: Rx requires few to no modifications to appli- to avoid this failure. Finally, the program re-executes from the cations’ source code. Therefore, it can be easily applied to checkpoint in the modified environment. This process will repeat legacy software. In our experiments, Rx successfully avoids by re-executing from different checkpoints and applying different software defects in the four tested server applications without environmental changes until either the failure does not recur or Rx modifying any of them. times out, resorting to alternate solutions, such as whole-program • Efficient: Because Rx requires no rebooting or warm-up, it rebooting [27, 54]. If the failure does not occur during re-execution, significantly reduces system down time and provides reason- the environmental changes are disabled to avoid the overhead asso- ably good performance during recovery. In our experiments, ciated with these changes. Rx recovers from software failure within 0.017-0.16 seconds, In our idea, the execution environment can include almost ev- 21-53 times faster than the whole program restart approach erything that is external to the target application but can affect the for all but one case (CVS). Such efficiency enables servers to execution of the target application. At the lowest level, it includes provide non-stop services despite software failures caused by the hardware such as processor architectures, devices, etc. At the common software defects. Additionally, Rx is quite efficient. middle level, it includes the OS kernel such as scheduling, virtual The technology imposes little overhead on server throughput memory management, device drivers, file systems, network proto- and average response time and also has small space overhead. cols, etc. At the highest level, it includes standard libraries, third- party libraries, etc. Such definition of the execution environment is • Informative: Rx does not hide software bugs. Instead, bugs much broader than the one used in previous work [18]. are still exposed. Furthermore, besides the usual bug report Obviously, the execution environment cannot be arbitrarily mod- package (including core dumps, checkpoints and event logs), ified for re-execution. A useful re-execution environmental change Rx provides programmers with additional diagnostic infor- should satisfy two properties. First, it should be correctness-preser- mation for postmortem analysis, including what conditions ving, i.e., every step (e.g., instruction, library call and system call) triggered the bug and which environmental changes can or of the program is executed according to the APIs. For example, in cannot avoid the bug. Based on such information, program- the malloc() library call, we have the flexibility to decide where mers can more efficiently find the root cause of the bug. For buffers should be allocated, but we cannot allocate a smaller buffer example, if Rx successfully avoids a bug by padding newly than requested. Second, a useful environmental change should be allocated buffers, the bug is likely to be a buffer overflow. able to potentially avoid some software bugs. For example, padding Similarly, if Rx avoids a bug by delaying the recycling of every allocated buffer can prevent some buffer overflow bugs from freed buffers, the bug is likely to be caused by double free or manifesting during re-execution. dangling pointers. Category Environmental Changes Potentially-Avoided Bugs Deterministic? delayed recycling of freed buffer double free, dangling pointer YES Memory padding allocated memory blocks dynamic buffer overflow YES Management allocating memory in an alternate location memory corruption YES zero-filling newly allocated memory buffers uninitialized read YES scheduling data race NO Asynchronous signal delivery data race NO message reordering data race NO User-Related dropping user requests bugs related to the dropped request Depends

Table 1: Possible environmental changes and their potentially-avoided bugs

Examples of useful execution environmental changes include, program region for a threshold amount of time, all environmental but are not limited to, the following categories: changes applied during the successful re-execution are disabled to (1)Memory management based: Many software bugs are mem- reduce space and time overheads. These changes are no longer nec- ory related, such as buffer overflows, dangling pointers, etc. These essary since the program has safely passed the “allergic seasons”. bugs may not manifest themselves if memory management is per- If the failure still occurs during a re-execution attempt, Rx will formed slightly differently. For example, each buffer allocated dur- rollback and re-execute the program again, either with a different ing re-execution can have padding added to both ends to prevent environmental change or from an older checkpoint. For example, if some buffer overflows. Delaying the recycling of freed buffers can one change (e.g. padding buffers) cannot avoid the bug during the reduce the probability for a dangling pointer to cause memory cor- re-execution, Rx will rollback the program again and try another ruption. In addition, buffers allocated during re-execution can be change (e.g. zero-filling new buffers) during the next re-execution. placed in isolated locations far away from existing memory buffers If none of the environmental changes work, Rx will rollback fur- to avoid some memory corruption. Furthermore, zero-filling new ther and repeat the same process. If the failure still remains after buffers can avoid some uninitialized read bugs. Since none of the a threshold number of iterations of rollback/re-execute, Rx will re- above changes violate memory allocation or deallocation interface sort to previous solutions, such as whole program restart [27, 54], specifications, they are safe to apply. Also note that these environ- or micro-rebooting [13, 14] if supported by the application. mental changes affect only those memory allocations/deallocations Upon a failure, Rx follows several rules to determine the order in made during re-execution. which environmental changes should be applied during the recov- ery process. First, if a similar failure has been successfully avoided (2)Timing based: Most non-deterministic software bugs, such as by Rx before, the environmental change that worked previously data races, are related to the timing of asynchronous events. These will be tried first. If this does not work, or if no information from bugs will likely disappear under different timing conditions. There- previous failures exists, environmental changes with small over- fore, Rx can forcefully change the timing of these events to avoid heads (e.g. padding buffers) are tried before those with large over- these bugs during re-execution. For example, increasing the length heads (e.g. zero-filling new buffers). Changes with negative side of a scheduling time slot can avoid context switches during buggy effects (e.g. dropping requests) are tried last. Changes that do not critical sections. This is very useful for those concurrency bugs conflict, such as padding buffers and changing event timing, can be that have high probability of occurrences. For example, the data applied simultaneously. race bug in our tested MySQL server has a 40% occurrence rate on Although the situation never arose during our experiments, there a uniprocessor machine. is still the rare possibility that a bug still occurs during re-execution (3)User request based: Since it is infeasible to test every possi- but is not detected in time by Rx’s sensors. In this case, Rx will ble user request before releasing software, many bugs occur due claim a recovery success while it is not. Addressing this problem to unexpected user requests. For example, malicious users issue requires using more rigorous on-the-fly software defect checkers malformed requests to exploit buffer overflow bugs during stack as sensors. This is currently a hot research area that has attracted smashing attacks [22]. These bugs can be avoided by dropping much attention. In addition, it is also important to note that, unlike some users’ requests during re-execution. Of course, since the user in failure oblivious computing, this problem is caused by the appli- may not be malicious, this method should be used as a last resort cation’s bug instead of Rx’s environmental changes. The environ- after all other environmental changes fail. mental changes just make the bug manifest itself in a different way. Table 1 lists some environmental changes and the types of bugs Furthermore, since Rx logs its every action including what environ- that can be potentially avoided by them. Although there are many mental changes are applied and what the results are, programmers such changes, due to space limitations, we only list a few examples can use this information (i.e. some environmental changes make for demonstration. the software crash much later) to analyze the occurring bug. If the failure disappears during a re-execution attempt, the failure symptoms and the effects of the environmental changes applied are 3. Rx DESIGN recorded. This speeds up the process of dealing with future failures that have similar symptoms and code locations. Additionally, Rx Rx is composed of a set of user-level and kernel-level compo- provides all such diagnostic information to programmers together nents that monitor and control the execution environment. The five with core dumps and other basic postmortem bug analysis informa- primary components are seen in Figure 2: (1) sensors for detecting tion. For example, if Rx reports that buffer padding does not avoid and identifying software failures or software defects at run time, the occurring bug but zero-filling newly allocated buffers does, the (2) a Checkpoint-and-Rollback (CR) component for taking check- programmer knows that the software failure is more likely to be points of the target server application and rolling back the applica- caused by an uninitialized read instead of a buffer overflow. tion to a previous checkpoint upon failure, (3) environment wrap- After a re-execution attempt successfully passes the problematic pers for changing execution environments during re-execution, (4) a proxy for making server recovery process transparent to clients, can easily provide a special interface that allows applications to in- Server Application Clients Proxy dicate what files should not be rolled back. Other system states such as messages and signals will be described in the next subsection be- Sensors cause they may need to be changed to avoid a software bug recur- Environment Checkpoint ring during re-execution. More details about our lightweight check- Wrapper & Rollback Rx System pointing method can be found in our previous work [50], which uses checkpointing and logging to support deterministic replay for report errors Control Unit programmers interactive debugging. In contrast to previous work on rollback and replay, Rx does not require deterministic replay. On the contrary, Rx purposely intro- duces nondeterminism into server’s re-execution to avoid the bug Figure 2: Rx architecture that occurred during the first execution. Therefore, the underlying implementation of Rx can be simplified because it does not need to remember when an asynchronous event is delivered to the ap- and (5) a control unit for maintaining checkpoints during normal plication in the first execution, how shared memory accesses from execution, and devising a recovery strategy once software failures multiple threads are interleaved in a multi-processor machine, etc., are reported by sensors. as we have done in our previous work [50]. The CR also supports multiple checkpoints and rollback to any 3.1 Sensors of these checkpoints in case Rx needs to roll back further than the Sensors detect software failures by dynamically monitoring ap- most recent checkpoint in order to avoid the occurring software plications’ execution. There are two types of sensors. The first type bug. After rolling back to a checkpoint CPi, all checkpoints which detects software errors such as assertion failures, access violations, were taken after CPi are deleted. This ensures that we do not roll- divide-by-zero exceptions, etc. This type of sensor can be imple- back to a checkpoint which has been rendered obsolete by the roll- mented by taking over OS-raised exceptions. The second type of back process. During a re-execution attempt, new checkpoints may sensor detects software bugs such as buffer overflows, accesses to be taken for future recovery needs in case this re-execution attempt freed memory etc., before they cause the program to crash. This successfully avoids the occurring software bug. type of sensors leverage existing low-overhead dynamic bug detec- tion tools, such as CCured [21], StackGuard [22], and our previous 3.2.2 Checkpoint Maintenance work SafeMem [41], just to name a few. In our Rx prototype, we A possible concern is that maintaining multiple checkpoints could have only implemented the first type of sensors. However, we plan impose a significant space overhead. To address this problem, Rx to integrate second type of sensors into Rx. can write old checkpoints to disks on the background when disks Sensors notify the control unit upon software failures with in- are idle. But rolling back to a checkpoint, which is already stored formation to help identify the occurring bug for recovery and also in disks, is expensive due to slow disk accesses. for postmortem bug diagnosis. Such information includes the type Fortunately, we do not need to keep too many checkpoints be- of exception (Segmentation fault, Floating Point Exception, Bus cause Rx strives to bound its recovery time to be 2-competitive as Error, Abort, etc.), the address of the offending instruction, stack the baseline solution: whole program restarting. In other words, signature, etc. in the worse case, Rx may take twice as much time as the whole program restarting solution (In reality, in most cases as shown in 3.2 Checkpoint and Rollback Section 6, Rx recovers much faster than the whole program restart). Therefore, if a whole program restart would take T seconds (This 3.2.1 Mechanism number can be measured by restarting immediately at the first soft- ware failure and then be used later), Rx can only repeat rollback/re- The CR (Checkpoint-and-Rollback) component takes checkpoints execute process for at most T seconds. As a result, Rx cannot roll- of the target server application, and automatically and transparently back to a checkpoint which is too far back in the past, which implies rolls back the application to a previous checkpoint upon a software that Rx does not need to keep such checkpoints any more. failure. At a checkpoint, CR stores a snapshot of the application More formally, suppose Rx takes checkpoints periodically, let into main memory. Similar to the fork operation, CR copies appli- τ1,τ2,· · · ,τn be the timestamps of the last n checkpoints that have cation memory in a copy-on-write fashion to minimize overhead. been kept in the reverse chronological order. We can use two schemes By preserving checkpoint states in memory, the overhead associ- to keep those checkpoints: one is to keep only recent checkpoints, ated with slow disk accesses in most previous checkpointing solu- and the other is to keep exponential landmark checkpoints (with β tions is avoided. This method is also used in previous work [20, 31, as the exponential factor) as in the Elephant file system [47]. In 33, 60, 36, 40, 50]. Performing a rollback operation is straightfor- other words, the two schemes satisfy the following equations, re- ward: simply reinstate the program from the snapshot associated spectively. with the specified checkpoint. Besides memory states, the CR also needs to take care of other system states such as file states during checkpointing and rollback τi − τi+1 = τi−1 − τi (2 ≤ i ≤ n − 1) to ensure correct re-execution. To handle file states, CR applies τ τ β τ − τ i n ideas similar to previous work [36, 50] by keeping a copy of each i − i+1 = ∗ ( i 1 − i) (2 ≤ ≤ − 1) accessed files and file pointers in the beginning of a checkpoint in- Note that time here refers to application execution time as op- terval and reinstate it for rollback. To simplify implementation, we posed to elapse time. The latter can be significantly higher, espe- can leverage a versioning file system which automatically takes a cially when there are many idle periods. file version upon modifications. Similarly, copy-on-write is used to After each checkpoint, Rx estimates whether it is still useful to reduce space and time overheads. For some logs file that users may keep the oldest checkpoint. If not, the oldest checkpoint taken at want the old content not to be overwritten during re-execution, Rx time τn is deleted from the system to save space. The estimation is done by calculating the worst-case recovery time that requires have been padded at its both ends, allocated from an isolated loca- rolling back to this oldest checkpoint. Suppose after rolling back tion, or zero-filled. to a checkpoint, every ith re-execution (1 ≤ i ≤ m) with different Message Wrapper Many concurrency bugs are related to mes- environmental changes incurs the overhead pi. Obviously, some sage delivery such as the message order across different connec- environmental changes such as buffer padding impose little time tions, the size and number of network packets which comprise a overhead, whereas other changes such as zero-filling buffers incur message, etc. Therefore, changing these execution environments large overhead. pis can be measured at run time. Therefore the during re-execution may be able to avoid an occurring concurrency worst-case recovery time, RT ime, that requires to roll back to the software bug. This is feasible because servers typically should not oldest checkpoint would be (let τ be the current timestamp): have any expectation regarding the order of messages from differ- ent connections (users), the size and the number of network packets n m n m ˙ that forms a message, especially the latter two which depend on the RT ime = X X(τ − τi)(1 + pj ) = X(τ − τi)X(1 + pj ) TCP/IP settings of both sides. i=1 j=1 i=1 j=1 The message wrapper, which is implemented in the proxy (de- If RT ime is greater than T , the oldest checkpoint taken at time scribed in the next subsection), changes the message delivery en- vironment in two ways: (1) It can randomly shuffle the order of τn is deleted. the requests among different connections, but keep the order of the 3.3 Environment Wrappers requests within each connection in order to maintain any possible The environment wrappers perform environmental changes dur- dependency among them. (2) It can deliver messages in random- ing re-execution for averting failures. Some of the wrappers, such sized packets. Such environmental changes do not impose over- as the memory wrappers, are implemented at user level by inter- head. Therefore, this message delivery policy can be used in the cepting library calls. Others, such as the message wrappers, are normal mode, but it does not decrease the probability of the occur- implemented in the proxy. Finally, still others, such as the schedul- rence of a concurrency bug because there is no way to predict in ing wrappers, are implemented in the kernel. what way a concurrency bug does not occur. Process Scheduling Memory Wrapper The memory wrapper is implemented by in- Similarly, concurrency bugs are also related tercepting memory-related library calls such as malloc(), realloc(), to process scheduling and are therefore prone to disappear if a dif- calloc(), free(), etc to provide environmental changes. During the ferent process scheduling is used during re-execution. Rx does this normal execution, the memory wrapper simply invokes the corre- by changing the process’ priority, and thus increasing the schedul- sponding standard memory management library calls, which incurs ing time quantum so a process is less likely to switched off in the little overhead. During re-execution, the memory wrapper activates middle of some unprotected critical region. the memory-related environmental changes instructed by the con- Signal Delivery Similar to process scheduling, the time when a trol unit. Note that the environmental changes only affect the mem- signal is delivered may also affect the probability of a concurrency ory allocation/deallocation made during re-execution. bug’s occurrence rate. Therefore, Rx can record all signals in a Specifically, the memory wrapper supports four environmental kernel-level table before delivering them. For hardware interrupts, changes: Rx delivers them at randomly selected times, but preserving their (1) Delaying free, which delays recycling of any buffers freed dur- order to maintain any possible ordering semantics. For software ing a re-execution attempt to avoid software bugs such as double timer signals, Rx ignores them because during rollback, the related free bugs and dangling pointer bugs. A freed buffer is reallocated software timer will also be restored. For software exception related only when there is no other free memory available or it has been de- signals such as segmentation faults, Rx’s sensors receive them as layed for a threshold of time (process execution time, not elapsed indications of software failures. time). Freed buffers are recycled in the order of the time when they Dropping User Requests Dropping user requests is a last envi- are freed. This memory allocation policy is not used in the normal ronmental change before switching to the whole program restart so- mode because it can increase paging activities. lution. As described earlier, the rational for doing this is that some (2) Padding buffers, which adds two fixed-size paddings to both software failures are triggered by some malformed requests, either ends of any memory buffers allocated during re-execution to avoid unintentionally or intentionally by malicious users. If Rx drops buffer overflow bugs corrupting useful data. This memory allo- that request, the server will not experience failure. In this case, the cation policy is only used in the recovery mode because it wastes server only denies those dropped requests, but does not affect other memory space. requests. The effectiveness of this environmental change is based (3) Allocation isolation, which places all memory buffers allocated on our assumption that the client and server use a request/response during re-execution in an isolated location to avoid corruption use- model, which is generally the case for large varieties of servers in- ful data due to severe buffer overflow or other general memory cor- cluding Web Servers, DNS Servers, database servers, etc. ruption bugs. Similar to padding, it is disabled in the normal mode Rx does not need to look for the exact culprit user request. As because it has space overhead. long as the dropped requests include this request, the server can (4) Zero-filling, which zero-fills any buffers allocated during re- avoid the software bug and continue providing services. Of course, execution to reduce the probability of failures caused by uninitial- the percentage of dropped requests should be small (e.g. 10%) to ized reads. Obviously, this environmental change needs to be dis- avoid malicious users exploiting it to launch denial of service at- abled in the normal mode since it imposes time overhead. tacks. Rx can achieve this by performing a binary search on all Since none of the above changes violate memory allocation or recently received requests. First, it can drop half of them to see deallocation interface specifications, they are safe to apply. At whether the bug still occurs during re-execution. If not, the prob- each memory allocation or free, the memory wrapper returns ex- lem request set becomes one half smaller. If the bug still occurs, it actly what the application may expect. For example, when an ap- rolls back to drop the other half. If it still does not work, Rx resorts plication asks for a memory buffer of size N, the memory wrapper to the whole program restart solution. Otherwise, the binary search returns a buffer with at least size N, even though this buffer may continues until the percentage of dropped requests becomes smaller Proxy Proxy ck1 ck1 req1 req1 req2 req2 ck2 ck2 forward req3 forward replay req3 Server req4 Client Server messages req4 messages Client ck3 req5 req5 rollback new requests point req6 req7 response response buffer buffer

(a) Proxy behavior in normal mode (b) Proxy behavior in recovery mode Figure 3: (a) In normal mode, the proxy forward request/response messages between the server and the client, buffers requests, and marks the waiting-for-sending request for each checkpoint (e.g., req3 is marked by checkpoint 2). (b) After the server is rolled back from the rollback-point, as shown in the dashed line to checkpoint 2, the proxy discards the mark of checkpoint 3, replays the necessary requests (req3, req4 and req5) to the server and buffers the incoming requests (req6 and req7). The “unanswered” responses are buffered in the response buffer. than the specified number. If the percentage upper bound is set to forwarding request in its request buffer. When the server needs be 10%, it only takes 5 iterations of rollback and re-execution. to roll back to this checkpoint, the mark indicates the place from After Rx finds the small set of requests (less than the specified which the proxy should replay the requests to the server. upper bound) that, once dropped, enable the server to survive the The proxy does not buffer any response in the normal mode ex- bug, Rx can remember each request’s signatures such as the IP ad- cept for those partially received responses. This is because after a dress, message size, message MD5 hash value, etc. In subsequent full response is received, the proxy sends it out to the corresponding times when a similar bug recurs in the normal mode, Rx can record client and mark the corresponding request as “answered”. Keeping the signatures again. After several rounds, Rx accumulates enough these committed responses is useless because during re-execution sets of signatures so that it can use statistical methods to identify the proxy cannot send out another response for the same request. the characteristics of those bug-exposing requests. Afterward, if Similarly, the proxy also strives to forward messages to clients at the same bug recurs, Rx can drop only those requests that match response granularity to reduce the possibility of sending a self- these characteristics to speed up the recovery process. conflicting response during re-execution, which may occur when the first part of the response is generated by the server’s normal 3.4 Proxy execution and the second part of the response is generated by re- The proxy helps a failed server re-execute and makes server-side execution that may take a different execution path. failure and recovery oblivious to its clients. When a server fails and However, if the response is too large to be buffered, a partial re- rolls back to a previous checkpoint, the proxy replays all the mes- sponse is sent first to the corresponding client but the MD5 hash for sages received from this checkpoint, along with the message-based this partial response is calculated and stored with the request. If a environmental changes described in the Section 3.3. The proxy software failure is encountered before the proxy receives the entire runs as a stand-alone process in order to avoid being corrupted by response from the server, the proxy needs to check the MD5 hash of the target server’s software defects. the same partial response generated during re-execution. If it does As Figure 3 shows, the Rx proxy can be in one of the two modes: not match with the stored value, the proxy will drop the connection normal mode for the server’s normal execution and recovery mode to the corresponding client to avoid sending a self-conflicting re- during the server’s re-execution. For simplicity, the proxy forwards sponse. To handle the case where a checkpoint is taken in the mid- and replays client messages in the granularity of user requests. dle of receiving a response from the server, the proxy also marks Therefore, the proxy needs to separate different requests within the exact position of the partially-received response. a stream of network messages. The proxy does this by plugging As shown on Figure 3(b), in the recovery mode, the proxy per- in some simple information about the application’s communication forms three functions to help server recovery. First, it replays to protocol (e.g. HTTP) so it can parse the header to separate one re- the server those requests received since the checkpoint where the quest from another. In addition, the proxy also uses the protocol server is rolled back. Second, the proxy introduces message-related information to match a response to the corresponding request to environmental changes as described in Section 3.3 to avoid some avoid delivering a response to the user twice during re-execution. concurrency bugs. Third, the proxy buffers any incoming requests In our experiments, we have evaluated four server applications, and from clients without forwarding them to the server until the server the proxy uses only 509 lines of code to handle 3 different proto- successfully survives the software failure. Doing such makes the cols: HTTP, MySQL message protocol and CVS message protocol. server’s failure and recovery transparent to clients, especially since As shown on Figure 3(a), in the normal mode, the proxy sim- Rx has very fast recovery time as shown in Section 6. The proxy ply bridges between the server and its clients. It keeps track of stays in the recovery mode until the server survives the software network connections and buffers the request messages between the failure after one or multiple iterations of rollback and re-execution. server and its clients in order to replay them during the server’s To deal with the output commit problem [53] (clients should per- re-execution. It forwards client messages at request granularity. ceive a consistent behavior of the server despite server failures), In other words, the proxy does not forward a partial request to Rx first ensures that any previous responses sent to the client are the server. At a checkpoint, the proxy marks the next wait-for- not resent during re-execution. This is achieved by recording for each request whether it has been responded by the server or not. If table can be provided to programmers for postmortem bug analy- so, a response made during re-execution is dropped silently. Oth- sis. It is possible to borrow ideas from machine learning (e.g., a erwise, the response generated during re-execution will be tem- Bayesian classifier) or use some statistical methods as a more “ad- porally buffered until any of the three conditions is met: (1) the vanced” technique to learn what environmental changes are the best server successfully avoids the failure via rollback and re-execution cure for a certain type of failures. Such optimization remains as our in changed execution environments; (2) the buffer is full; or (3) this future work. re-execution fails again. For the first two cases, the proxy sends the buffered responses to the corresponding clients and the correspond- ing requests are marked as “answered”. Thus, responses generated 4. DESIGN AND IMPLEMENTATION ISSUES in subsequent re-execution attempts will be dropped to ensure that only one response for each request goes to the client. For the last Inter-Server Communication In many real systems, servers are case, the responses are thrown away. tiered hierarchically to provide service. For example, a web server For applications such as on-line shopping that require strict ses- is usually linked to an application server, which is then linked to sion consistency (i.e. later requests in the same session depend on a backend database server. In this case, rolling back one failed previous responses), Rx can record the signatures (hash values) of server may not be enough to survive a failure because the failure all committed responses for each outstanding session, and perform may be caused by its front-end or back-end servers. To address MD5 hash-based consistency checks during re-execution. If a re- this problem, Rx should be used for all servers in this hierarchy execution attempt generates a response that does not match with a so that it is possible to rollback a subset or all of them in order to committed response for the same request in an outstanding session, survive a failure. We can borrow many ideas, such as, coordinated this session can be aborted to avoid confusing users. checkpointing, asynchronous recovery, etc, from previous work on The proxy also supports multiple checkpoints. When an old supporting fault tolerance in distributed systems [15, 16, 24, 45, 1], checkpoint is discarded, the proxy discards the marks associated and also from recent work such as micro- [14]. More specif- with this checkpoint. If this checkpoint is the oldest one, the proxy ically, during the normal execution, Rx in the tiered servers take also discards all the requests received before the second oldest check- checkpoints coordinately. Once a failure is detected, Rx rolls back point since the server can never roll back to the oldest checkpoint the failed server and also broadcasts its rollback to other correlated any more. servers, which then roll back correspondingly to recover the whole The space overhead incurred by the proxy is small. It mainly system. Currently, we have not implemented such support in the consists of two parts: (1) space used to buffer requests received Rx and it remains a topic for future study. since the undeleted oldest checkpoint, (2) a fixed size space used Multi-threaded Process Checkpointing Taking a checkpoint on to buffer “unanswered” responses generated during re-execution in a multi-threaded process is particularly challenging because, when the recovery mode. The first part is small because usually requests Rx needs to take a checkpoint, some threads may be executing sys- are small, and the proxy can also discard the oldest checkpoint to tem calls or could be blocked inside the kernel waiting for asyn- save space as described in Section 3.2. The second part has a fixed chronous events. Capturing the transient state of such threads could size and can be specified by administrators. easily lead to state inconsistency upon rollback. For example, there can be some kernel locks which have been acquired during check- 3.5 Control Unit point, and rolling back to such state may cause two processes hold The control unit coordinates all the other components in the Rx. the same kernel locks. Therefore, it is essential that we force all the It performs three functions: (1) directs the CR to checkpoint the threads to stay at the user level before checkpointing. We imple- server periodically and requests the CR to roll back the server upon ment this by sending a signal to all threads, which makes them exit failures. (2) diagnoses an occurring failure based on the failure from blocked system calls or waiting events with an EINTR return symptoms and its accumulated experiences, then decides what en- code. After the checkpoint, the library wrapper in Rx retries the vironmental changes should be applied and where the server should prematurely returned system calls and thus hides the checkpoint- be rolled back to. (3) provides programmers useful failure-related ing process from the target application. This has a bearing on the information for postmortem bug analysis. checkpointing frequency, as a high checkpointing frequency will After several failures, the control unit gradually builds up a fail- severely impair the performance of normal I/O system calls, which ure table to capture the recovery experience for future reference. are likely to be retried multiple times (once at every checkpoint) More specifically, during each re-execution attempt, the control before long I/Os finish. Therefore, we cannot set the checkpointing unit records the effects (success or failure) and the corresponding interval too small. environmental changes into the table. The control unit assigns a Unavoidable Bug/Failure for Rx Even though Rx should be score vector hs1, s2, · · · , smi to each failure, where m is the num- able to help servers recover from most software failures caused by ber of possible environmental changes. Each element si in the vec- common software bugs such as memory corruptions and concur- tor is the score for each corresponding environmental change Ci for rency bugs, there are still some types of bugs that Rx cannot help a certain failure. For a successful re-execution, the control unit adds the server to avoid via re-execution in changed execution environ- one point to all the environmental changes that are applied during ments. Resource leakage bugs, such as memory leaks, which have this re-execution. For a failed re-execution,the control unit sub- accumulative effects on system and may take hours or days to cause tracts one point from all the applied environmental changes. When system to crash, cannot be avoided by only rolling the server back a failure happens, the control unit searches the failure table based to a recent checkpoint. Therefore, for resource leaking, Rx resorts on failure symptoms, such as type of exceptions, instruction coun- to the whole program restart approach because restart can refresh ters, call chains, etc, provided by the Rx sensors. If one table entry server with plenty of resources. For some of the semantic bugs, Rx matches, it then applies those environmental changes whose scores may not be effective to avoid them since they may not be related to are larger than a certain threshold Ts. Otherwise, it will follow the execution environments. Finally, Rx are not able to avoid the bugs rules described in Section 2 to determine the order how environ- or failures that sensors cannot detect. Solving this problem would mental changes should be applied during re-execution. This failure require more rigorous dynamic checkers as sensors. 5. EVALUATION METHODOLOGY 30KB source file. For MySQL, we use two loads. To trigger the The experiments described in this section were conducted on two data race, the client spawns 5 threads, each of them sending out machines with a 2.4GHz Pentium processor, 512KB L2 cache, 1GB begin, select, and commit requests on a small table repeatedly. The of memory, and a 100Mbps Ethernet connection between them. We size of individual requests must be as small as possible to maximize run servers on one machine and clients on the other. The operating the probability of the race occurring. For the overhead experiments system we modified is the Linux kernel 2.6.10. The Rx proxy is with MySQL, a more realistic load with updates is used. To demon- currently implemented at user level for easy debugging. In the fu- strate that Rx can avoid server failures, we use another client that ture, we plan to move it to the kernel level to improve performance. sends bug-exposing requests to those servers. We evaluate four different real-world server applications as shown in Table 2, including a web server (Apache httpd), a web cache and 6. EXPERIMENTAL RESULTS proxy server (Squid), a database server (MySQL), and a concurrent 6.1 Overall Results version control server (CVS). The servers contain various types of Table 3 demonstrates the overall effectiveness of Rx in avoiding bugs, including buffer overflow, data race, double free, dangling bugs. For each buggy application, the table shows the type of bug, pointer, uninitialized read, and stack overflow bugs. Four of them what symptom was used to detect the bug, and what environmental were introduced by the original programmers. We have not yet change was eventually used to avoid the bug. The table also com- located server applications which contain uninitialized read or dan- pares Rx to two alternative approaches: the ordinary whole pro- gling pointer bugs. To evaluate Rx’s functionality of handling these gram restart solution and a simple rollback and re-execution with- two types of bugs, we inject them into Squid separately, renaming out environmental changes. For Rx, the checkpoint intervals in the two Squids as Squid-ui (containing an uninitialized read bug) most cases are 200ms except for MySQL and CVS. For MySQL, and Squid-dp (containing a dangling pointer bug), respectively. we use a checkpoint interval of 750ms because too frequent check- pointing causes its data race bug to disappear in the normal mode. App Ver Bug #LOC App Description The reason for using 50ms as the checkpoint interval for CVS will MySQL 4.1.1.a data race 588K a database server be explained later when we discuss the recovery time. The average Squid 2.3.s5 buffer overflow 93K a Web proxy recovery time is the recovery time averaged across multiple bug Squid-ui 2.3.s5 uninitialized read cache server Squid-dp 2.3.s5 dangling pointer occurrences in the same execution. Section 6.4 will discuss the dif- Apache 2.0.47 stack overflow 283K a Web server ference in Rx recovery time between the first time bug occurrence CVS 1.11.4 double free 114K a version and subsequent bug occurrences. control server As shown in Table 3, Rx can successfully avoid various types of common software defects, including 5 deterministic memory bugs Table 2: Applications and Bugs (App means Application. Ver and 1 concurrency bug. These bugs are avoided during re-execution means Version. LOC means lines of code). because of Rx’s environmental changes. For example, by padding buffers allocated during re-execution, Rx can successfully avoid In this paper, we design four sets of experiments to evaluate the the buffer overflow bug in Squid. Apache survives the stack over- key aspects of Rx: flow bug because Rx drops the bug-exposing user request during • The first set evaluates the functionality of Rx in surviving re-execution. Squid-ui survives the uninitialized read bug because software failures caused by common software defects by roll- Rx zero-fills all buffers allocated during re-execution. These results back and re-execution with environmental changes. We com- indicate that Rx is a viable solution to increase the availability of pare Rx with whole program restart in terms of client ex- server applications. periences during failure, and in terms of recovery time. In In contrast, the two alternatives, restart and simple rollback/re- addition, we also compare Rx with the simple rollback and execution, cannot successfully recover the three servers (Squid, re-execute with no environmental changes. This approach is Apache and CVS) that contain deterministic bugs. For the restart implemented by disabling environmental changes in Rx. approach, this is because the client notices a disconnection and tries • The second set evaluates the performance overhead of Rx for to resend the same bug-exposing request, which causes the server both server throughput and average response time without to crash again. For the simple rollback and re-execution approach, bug occurrence. Additionally, we evaluate the space over- once the server rolls back to a previous checkpoint and starts re- head caused by checkpoints and the proxy. execution, the same deterministic bug will occur again, causing the • The third set evaluates how Rx would behave under certain server to crash immediately. These two alternatives have a 40% degree of malicious attacks that continuously send bug-expo- recovery rate for MySQL that contains a non-deterministic concur- sing requests triggering buffer overflow or other software de- rency bug because in 60% cases the same bug-exposing interleav- fects. We measure the throughput and average response time ing is used again after restart or rollback. Such results show that under different bug arrival rates. In this set of experiments, these two alternative approaches, even though simple, cannot sur- we also compare Rx with the whole program restart approach vive failures caused by many common software defects and thus in terms of performance. cannot provide continuous services. The results also indicate that applying environmental changes is the key reason why Rx can sur- • The fourth set evaluates the benefits of Rx’s mechanism of vive software failures caused by common software defects, espe- learning from previous failure experiences, which are stored cially deterministic bugs. in the failure table to speed up recovery. Because the Rx’s proxy hides the server failure and recovery pro- For all the servers, we implement clients in a similar manner cess from its clients, clients do not experience any failures. In con- as previous work, such as httperf [38] or WebStone [56], sending trast, with restart, clients experience failures due to broken network continuous requests over concurrent connections. For Squid and connections. To be fault tolerant, clients need to reconnect to the Apache, the clients spawn 5 threads. Each thread sends out re- server and reissue all unreplied requests. With the simple rollback quests to fetch different files whose sizes range in 1KB, 2KB, ..., and re-execution, since the server cannot recover from the failure, 512KB with uniform distribution. For CVS, the client exports a clients eventually time out and thus experience server failures. Apps Bugs Failure Environmental Clients Experience Recoverable? Average Recovery Symptoms Changes Failure? Time (s) Alternatives Rx Alternatives Rx Restart Rx Squid Buffer Overflow SEGV Padding Yes No No Yes 5.113 0.095 MySQL Data Race SEGV Schedule Change Yes No 40% probablity Yes* 3.500 0.161 Apache Stack Overflow Assert Drop User Request Yes No No Yes 1.115 0.026 CVS Double Free SEGV Delay Free Yes No No Yes 0.010 0.017 Squid-ui Uninit Read SEGV Zero All Yes No No Yes 5.000 0.126 Squid-dp Dangling Pointer SEGV Delay Free Yes No No Yes 5.006 0.113

Table 3: Overall results: comparison of Rx and alternative approaches (whole program restart, and simple rollback and re-execution without environmental changes). The results are obtained by running each experiment 20 times. The recovery time for the restart approach is measured by having the client not resend the bug-exposing request after reconnection. Otherwise, the server will crash again immediately after restart. *For MySQL, during the 20 runs, the data race bug never occur during re-execution in Rx after applying various timing-related environmental changes.

Squid-Baseline Squid-Restart Squid-Rx 120 120 120 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 Throughput (Mbps) Throughput (Mbps) Throughput (Mbps) 0 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Elapsed Time (sec) Elapsed Time (sec) Elapsed Time (sec) (a) Throughput

Squid-Baseline Squid-Restart Squid-Rx 0.14 0.14 0.14 0.12 0.12 0.12 0.1 0.1 0.1 0.08 0.08 0.08 0.06 0.06 0.06

Time (sec) 0.04 Time (sec) 0.04 Time (sec) 0.04

Average Response 0.02 Average Response 0.02 Average Response 0.02 0 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Elapsed Time (sec) Elapsed Time (sec) Elapsed Time (sec) (b) Average Response Time

Figure 4: Throughput and average response time of Squid with Rx and Restart for one bug occurrence (Between time period (7,11.5), there are no measurements for restart because no requests are responded during this period.)

Table 3 also shows that Rx provides a significantly better (21-53 will be even better since such optimization will reduce the number times faster) recovery time than restart except for CVS. This is be- of context switches and memory coping overhead. cause rollback is a lightweight and fine-grained action due to the If the bug-exposing request is not resent after failure, restart has in-memory checkpoints. Also, as we find that most faults are de- similar recovery time for CVS (otherwise, restart cannot recover tected promptly (usually by crashing), we rarely need to roll back the failure for CVS). Restart takes only .01 seconds to recover for further than the recent checkpoint. This minimizes the amount of CVS, while Rx takes .017 seconds. This is because CVS is im- re-execution necessary. Furthermore, since we are starting from plemented using the xinetd daemon as its network monitor. Each a recent execution state, it is unnecessary to initialize data struc- connection to CVS causes xinetd to fork and exec a new instance tures or to warm up buffer caches from disks. In contrast, restart of CVS. Therefore, CVS must have a very low startup time in is much slower. This is because restart requires the program to be order to provide adequate performance. Additionally, there is no reloaded and reinitialized from the beginning. Any memory state state shared between different CVS processes except for that of the such as buffer caches and data structures need to be warmed up or repository, which is persistently stored on disk. As such, CVS has initialized. Squid is a particularly clear example. For Squid, restart only minimal state to initialize. Given such a simple application, requires 5.113 seconds to recover from a crash, whereas Rx takes ordinary restart technique are good enough. For the same reason, only 0.095 seconds. Since our experiments use only a small work- even when Rx takes a checkpoint every 50ms, the overhead is still load, we expect that, with a real world workload, it will take an small, less than 11%. But even with such frequent checkpoints, even longer time for the whole program restart approach to recov- Rx’s recovery time is still slightly higher than restart, which indi- ery from failures because it requires a long time to warm up caches cates for CVS-like servers, restart is a better alternative in terms and other memory data structures. This result indicates that Rx en- of recovery time. But note that restart is not failure transparent to ables servers to provide highly available services despite common clients, and, if the bug-exposing request is resent again by the client software defects. Instead of experiencing a failure, clients experi- after the failure, the same bug (especially deterministic one) is very ence an increased response time for a very short period. We expect likely to happen again. that after the Rx’s proxy is pushed into the kernel, the Rx results Rx does not hide software defects. Instead, Rx reacts only after Squid Squid 100 3 Restart 80 2.5 Rx 2 60 1.5 40 1 Time (sec) 20 Restart 0.5 Throughput (Mbps) Rx Average Response 0 0 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 Bug Arrival Rate (bugs/sec) Bug Arrival Rate (bugs/sec) (a) Throughput (b) Average Response Time

Figure 5: Throughput and average response time with different bug arrival rates

Squid Squid 100 0.1 Baseline 80 0.08 Rx

60 0.06

40 0.04 Time (sec) 20 Baseline 0.02 Throughput (Mbps) Rx Average Response 0 0 0 0.5 1 1.5 2 0 0.5 1 1.5 2 Checkpoint Interval (sec) Checkpoint Interval (sec) (a) Throughput and average response time for Squid

MySQL MySQL 20 0.1

0.08 15 0.06 10 0.04 5 Baseline 0.02 Baseline Rx Rx Throughput (trans/sec) 0 0

0 0.2 0.4 0.6 0.8 1 Average Latency (sec/trans) 0 0.2 0.4 0.6 0.8 1 Checkpoint Interval (sec) Checkpoint Interval (sec) (b) Throughput and average response time for MySQL Figure 6: Rx overhead in terms of throughput and average response time for two representative applications: Squid and MySQL. In these experiments, we do not send the bug-exposing request since we want to compare the pure overhead of Rx with the baseline in normal cases. a defect is exposed. In addition, Rx’s failure and recovery experi- ments for response time. Once Squid has restarted, there is a spike ences provide programmers with extra information to diagnose the in response time because all of the clients get their requests sat- occurring or occurred software defects. For example, for CVS, Rx isfied after a long queuing delay. Because Squid cannot service is able to avoid the bug by delaying the recycling of recently freed requests until it has completed the lengthy startup and initialization buffers during re-execution. Programmers then should investigate process, the whole program restart approach significantly degrades more in the direction of double-free or dangling pointers. the performance upon a failure. Similarly, with a large real-world workload, we expect that the performance with restart will be even 6.2 Recovery Performance worse since the recovery time will become longer and many more We have compared Rx with restart in terms of performance dur- requests will be queued, waiting to be serviced. ing recovery. As shown in Figure 4, Rx maintains throughput levels Figure 5 further illustrates the Rx’s performance in the case of close to that of the baseline case. At the time of bug occurrence (at continuous attacks by malicious users who keep issuing bug-exposing 7 seconds from the very beginning), the server throughput drops by requests. The throughput and response time of Rx remains constant 33% and the average response time increases by a factor of two for as the rate of bug occurrences increases, whereas the performance only a very short period of time (17-161 milliseconds). There- of restart degrades rapidly. This is because Rx has very small recov- fore, a bug occurrence imposes only a small overhead, and has a ery time, while restart spends a long time in recovery. Therefore, if minimal impact on overall throughput and response time. Restart, such a bug were triggered by an Internet-wide worm [51] or a ma- on the other hand, has a 5 second period of zero throughput. It licious user, restart cannot cope. However, since Rx can deal with services no requests during this period, so there are no measure- higher bug arrival rates, Rx can tolerate such attacks much better. Apps Rx Space Overhead (kB/checkpoint) 350 300 kernel proxy total 250 Squid 405.35 3.70 409.05 200 150 Mysql 300.00 0.16 300.16 Time (ms) 100 Average Recovery Apache 460.00 3.60 463.60 50 CVS 42.22 2.89 45.11 0 CVS Squid Apache MySQL Squid-ui Table 4: The average space overhead per checkpoint Squid-dp First Subseqent 6.3 Rx Time and Space Overhead Figure 7: Rx recovery time to avoid the first and subsequent Figures 6 shows the overhead of Rx compared to the baseline bug occurrences (without Rx) for various frequencies of checkpointing. The per- to disk [20, 31, 33, 60], remote memory [3, 61, 40], or non-volatile formance of Rx degrades somewhat as the checkpoint interval de- or persistent memory [36]. These checkpoints can be provided with creases, but the amount of degradation is small. For squid, both relatively low overhead. If there are messages and operations in throughput and response time are very close to baseline for all flight, logging is also needed [7, 35, 34, 32]. After failure, many, tested checkpoint rates. This is because the network remains the but not all, errors can be avoided by reattempting the failed compu- bottleneck for all cases. For MySQL, the performance degrades tation [27]. To deal with resource exhaustion or operating system slightly at small checkpoint intervals. Since MySQL is more CPU crashes, monitoring, logging and recovery can be done remotely bound, the additional memory-copying imposed by frequent check- via support by special network interface cards [9]. In some cases, points causes some degradation. It is expected that as checkpoints great care is taken to ensure deterministic replay [23, 46, 25]. How- are taken extremely frequently, Rx’s overhead will become domi- ever, unlike deterministic replay used by other techniques, we are nant. However, there is no need for very frequent checkpointing. purposely and systematically perturbing the re-execution environ- As shown earlier, even when Rx checkpoints every 200 millisec- ment to avoid determinism. As such, we have requirement and can onds, it is able to provide very good recovery performance. With use more lightweight checkpoints. Additionally, by changing envi- such a checkpoint interval, the overhead imposed by Rx is quite ronments, we can tolerate faults which simple re-execution cannot. small, almost negligible for Squid and only 5% for MySQL. Failure-Oblivious Computing [44] proposes modifying the be- Table 4 shows the average memory space overhead of Rx per havior of what it detects to be incorrect memory accesses. It dis- checkpoint. The space overhead of Rx for each checkpoint is rel- cards or redirects incorrect writes to a separate hash table and man- atively small (45.11-463.60kB). It mainly comes in two parts: up- ufactures values for incorrect reads. It has shortcomings in that it dates made during the checkpoint interval and the proxy message is restricted to memory related bugs, imposes high overheads (1-8x buffers. For the first part, Rx uses copy-on-write to reduce space slowdown [44],) and may introduce unpredictable behavior due to overhead. For the second part, since Rx only records requests in the its potentially unsafe modifications to the memory interface. The normal mode and request sizes are usually small, the proxy does recently proposed reactive immune system [49] has similar limi- not occupy much memory per checkpoint. Therefore, if 2-3MB of tations since it also speculatively “fixes” defects on-the-fly. As a space can be used by Rx, Rx is able to maintain 5-20 checkpoints: result, unlike Rx, these approaches can be unsafe for correctness- enough for our recovery purpose. critical server applications. 6.4 Benefits of the Failure Table Recovery-Oriented Computing (ROC) [39] proposes restructur- ing the entire software platform to focus on and allow recovery. Figure 7 reports the server recovery time with Rx when the server System components are to be isolated and failure aware. How- encounters the bug for the first time and for subsequent times in the ever, this requires not only restructuring individual servers, but all same run. The results show that the failure table can effectively re- of the programs in the entire system. Micro-rebootable [14] soft- duce the recovery time when the same bug/failure occurs again. For ware advocates software whose components are fail-stop, and indi- example, to deal with the first time occurrence of the buffer over- vidually recoverable, thereby making it easier to build fault-tolerant flow bug in Squid, Rx applies message reordering, delaying free systems. Again, this requires reengineering of existing software. + message reordering, padding + message reordering sequentially Rx can make use of dynamic bug detectors as sensors to deter- in three consecutive re-execution trials, and finally avoid the bug mine when a bug has occurred. For memory bugs, dynamic check- at the third re-execution. The entire recovery process lasts around ers include Purify [29], and StackGuard [22]. Many of these use 216.7 milliseconds. However, for any subsequent occurrences of instrumentation to monitor memory accesses, and hence impose the same bug, which can be located in the failure table, Rx applies high overhead. Some techniques can perform such checks with the correct environmental changes (padding + message reordering) lower overhead, such as CCured [21] and SafeMem [41]. Beyond in the first re-execution attempt, thus the recovery time is reduced memory bugs, it is also possible to detect deadlock and races. to 94.7 milliseconds. For CVS, the failure table also helps to re- Our proxy is similar to the shadow drivers used in [55], in that duce the recovery time from 25 milliseconds to 16.9 milliseconds. it interposes itself between the user of a service and the actual For MySQL, the data race bug is avoided at the very first try with provider of that service in order to mask failures. However, rather message reordering and therefore there is no difference between the than being between a kernel driver and applications calling on that first bug occurrence and subsequent ones. driver, we are between a server process and the client processes. Furthermore the proxy does not replicate the original server during 7. RELATED WORK failure, but merely acts as a standin until the server recovers. Our work builds on much previous work. Due to space limita- The environmental changes we make are similar to noisemak- tions, this section only briefly describes those works that are not ers [52], except that, instead of trying to spur non-deterministic discussed in previous sections. bugs into occurring, we are attempting to prevent deterministic and The idea of using checkpointing to provide fault tolerance is non-deterministic bugs by finding a legitimate execution path in old [10], and well known [24, 43]. These checkpoints may be done which they simply do not arise. 8. CONCLUSIONS AND LIMITATIONS solutions. Fortunately, this is a rare case, as a previous study [28] In summary, Rx is a safe, non-invasive and informative method shows that most errors tend to cause quick crashes. for quickly surviving software failures caused by common software defects such as memory corruptions and concurrency bugs and thus 9. ACKNOWLEDGMENTS providing highly available services. It does so by re-executing the The authors would like to thank the shepherd, Ken Birman, and buggy program region in a modified execution environment. It can the anonymous reviewers for their invaluable feedback. We ap- deal with both deterministic and non-deterministic bugs, and re- preciate useful discussion with the OPERA group members. This quires few to no modifications to applications’ source code. Be- research is supported by IBM Faculty Award, NSF CNS-0347854 cause Rx does not forcefully change programs’ execution by re- (career award), NSF CCR-0305854 and NSF CCR-0325603 grant. turning speculative values, it introduces no uncertainty or misbe- havior into programs’ execution. Moreover, it provides additional 10. REFERENCES feedback to programmers for their bug diagnosis. [1] L. Alvisi and K. Marzullo. Trade-offs in implementing optimal Our experimental studies of four server applications that con- message logging protocols. In Proceedings of the 15th ACM Symposium on the Principles of Distributed Computing, May 1996. tain six bugs of different types show that Rx can successfully avoid [2] P. E. Ammann and J. C. Knight. Data diversity: An approach to software defects during re-execution and thus provide non-stop ser- software fault tolerance. IEEE Transactions on Computers, vices. In contrast, the two tested alternatives, a whole program 37(4):418–425, 1988. restart approach and a simple rollback and re-execution without [3] C. Amza, A. Cox, and W. Zwaenepoel. Data replication strategies for environmental changes, cannot recover the three servers (Squid, fault tolerance and availability on commodity clusters. In Apache and CVS) that contain deterministic bugs, and only have Proceedings of the 2000 International Conference on Dependable a 40% recovery rate for the server (MySQL) that contains a non- Systems and Networks, Jun 2000. deterministic concurrency bug. These results indicate that apply- [4] A. Avizienis. The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering, SE-11(12), 1985. ing environmental changes is crucial to survive software failures [5] A. Avizienis and L. Chen. On the implementation of N-version caused by common software defects, especially deterministic bugs. programming for software fault tolerance during execution. In In addition, Rx also provides fast recovery within 0.017-0.16 sec- Proceedings of the 1st International Computer Software and onds, 21-53 times faster than the whole program restart approach Applications Conference, Nov 1977. for all but one case (CVS). With Rx, clients do not experience any [6] J. F. Bartlett. A NonStop kernel. In Proceedings of the 8th failures except a small increase in the average response time for a Symposium on Operating Systems Principles, Dec 1981. very short period of time. To provide such fast recovery, Rx im- [7] K. P. Birman. Building Secure and Reliable Network Applications, poses small time and small space overheads. chapter 19. Manning ISBN: 1-884777-29-5, 1996. There are several limitations that we wish to address in our future [8] A. Bobbio and M. Sereno. Fine grained software rejuvenation models. In Proceedings of the 1998 International Computer work. First, we are trying to evaluate Rx with more server applica- Performance and Dependability Symposium, Sep 1998. tions containing real bugs under various workloads. Second, cur- [9] A. Bohra, I. Neamtiu, P. Gallard, F. Sultan, and L. Iftode. Remote rently the Rx’s proxy is implemented at the user level. To improve repair of operating system state using backdoors. In Proceedings of performance, we plan to move it into the kernel, thereby avoiding the 2004 International Conference on Autonomic Computing, May context switches and memory copying. Third, we plan to extend 2004. Rx to support multi-tier server hierarchy as described in Section 4. [10] A. Borg, J. Baumbach, and S. Glazer. A message system supporting This is relative easy since Rx already works with a database server fault tolerance. In Proceedings of the 9th Symposium on Operating Systems Principles, Oct 1983. (MySQL), a web server (Apache), and a Web proxy server (Squid). [11] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault Fourth, our experiments so far have evaluated only I/O bound ap- tolerance under UNIX. ACM Transactions on Computer Systems, plications such as network servers whose availability is of critical 7(1), 1989. importance. We plan to evaluate the Rx’s overheads on computa- [12] T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. tion intensive applications, and we expect the overheads are likely ACM Transactions on Computer Systems, 14(1):80–107, Feb 1996. to be higher. Finally, we have only compared with two alterna- [13] G. Candea, J. Cutler, A. Fox, R. Doshi, P. Garg, and R. Gowda. tive approaches: the whole program restart approach and a simple Reducing recovery time in a small recursively restartable system. In rollback and re-execution without environmental changes. This is Proceedings of the 2002 International Conference on Dependable Systems and Networks, Jun 2002. because many other alternate approaches require substantial efforts [14] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. to restructure/redesign applications. Microreboot – A technique for cheap recovery. In Proceedings of the While Rx can effectively and efficiently recover from many soft- 6th Symposium on Operating System Design and Implementation, ware failures caused by common software defects, Rx is certainly Dec 2004. not a panacea. Like almost all previous solutions, Rx cannot guar- [15] M. Castro and B. Liskov. Practical byzantine fault tolerance. In antee recovery from all software failures. For example, as we dis- Proceedings of the 3rd Symposium on Operating System Design and cuss in Section 4, neither semantic bugs nor resource leaks can be Implementation, Feb 1999. directly addressed by Rx. Also, as described in Section 2, in some [16] M. Castro and B. Liskov. Proactive recovery in a Byzantine-Fault-Tolerant system. In Proceedings of the 4th rare cases, it is possible that a bug still occurs during re-execution Symposium on Operating System Design and Implementation, Oct but its symptoms are not detected in-time by the sensors. In this 2000. case, Rx will claim a false recovery success. While similar rare [17] CERT/CC. Advisories. http://www.cert.org/advisories/. cases can also appear in many previous solutions, it is still worthy [18] S. Chandra and P. M. Chen. Whither generic recovery from addressing by using more rigorous dynamic integrity and correct- application faults? A fault study using open-source software. In ness checkers as Rx’s sensors. This is currently an active research Proceedings of the 2000 International Conference on Dependable area with many recent innovations. Additionally, Rx cannot deal Systems and Networks, Jun 2000. with latent bugs–bugs in which the fault is introduced at a time [19] S. Chandra and P. M. Chen. The impact of recovery mechanisms on the likelihood of saving corrupted state. In Proceedings of the 13th long before any obvious symptoms. As discussed by Chandra and International Symposium on Software Reliability Engineering, Nov Chen [19], this problem is general to all checkpoint-based recovery 2002. [20] Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for Transactions on Parallel and Distributed Systems, 9(10):972–986, message-passing parallel programs. In Proceedings of the 1997 1998. ACM/IEEE Supercomputing Conference, Nov 1997. [41] F. Qin, S. Lu, and Y. Zhou. Safemem: Exploiting ECC-memory for [21] J. Condit, M. Harren, S. McPeak, G. C. Necula, and W. Weimer. detecting memory leaks and memory corruption during production CCured in the real world. In Proceedings of the ACM SIGPLAN 2003 runs. In Proceedings of the 11th International Symposium on Conference on Programming Language Design and Implementation, High-Performance Computer Architecture, Feb 2005. Jun 2003. [42] B. Randell. System structure for software fault tolerance. IEEE [22] C. Cowan, C. Pu, D. Maier, J. Walpole, P. Bakke, S. Beattie, A. Grier, Transactions on Software Engineering, 1(2):220–232, 1975. P. Wagle, Q. Zhang, and H. Hinton. StackGuard: Automatic adaptive [43] B. Randell, P. A. Lee, and P. C. Treleaven. Reliability issues in detection and prevention of buffer-overflow attacks. In Proceedings computing system design. ACM Computer Surveys, 10(2):123–165, of the 7th USENIX Security Symposium, Jan 1998. 1978. [23] G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. [44] M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Revirt: Enabling intrusion analysis through virtual-machine logging Beebee, Jr. Enhancing server availability and security through and replay. In Proceedings of the 5th Symposium on Operating failure-oblivious computing. In Proceedings of the 6th Symposium on System Design and Implementation, Dec 2002. Operating System Design and Implementation, Dec 2004. [24] E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A [45] R. Rodrigues, M. Castro, and B. Liskov. BASE: Using abstraction to survey of rollback-recovery protocols in message-passing systems. improve fault tolerance. In Proceedings of the 18th Symposium on ACM Computer Surveys, 34(3):375–408, 2002. Operating Systems Principles, Oct 2001. [25] Y. A. Feldman and H. Schneider. Simulating reactive systems by [46] M. Russinovich and B. Cogswell. Replay for concurrent deduction. ACM Transactions on Software Engineering and non-deterministic shared-memory applications. In Proceedings of the Methodology, 2(2):128–175, 1993. ACM SIGPLAN 1996 Conference on Programming Language Design [26] S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. On the analysis of and Implementation, May 1996. software rejuvenation policies. In Proceedings of the Annual [47] D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C. Veitch, R. W. Conference on Computer Assurance, Jun 1997. Carton, and J. Ofir. Deciding when to forget in the Elephant file [27] J. Gray. Why do computers stop and what can be done about it? In system. In Proceedings of the 17th ACM Symposium on Operating Proceedings of the 5th Symposium on Reliable Distributed Systems, System Principles, Dec 1999. Jan 1986. [48] D. Scott. Assessing the costs of application downtime. Gartner [28] W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z.-Y. Yang. Characterization Group, May 1998. of Linux kernel behavior under errors. In Proceedings of the 2003 [49] S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. International Conference on Dependable Systems and Networks, Jun Building a reactive immune system for software services. In 2003. Proceedings of the USENIX 2005 Annual Technical Conference, Apr [29] R. Hasting and B. Joyce. Purify: Fast detection of memory leaks and 2005. access errors. In Proceedings of the USENIX Winter 1992 Technical [50] S. Srinivasan, C. Andrews, S. Kandula, and Y. Zhou. Flashback: A Conference, Dec 1992. light-weight extension for rollback and deterministic replay for [30] Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton. Software software debugging. In Proceedings of the USENIX 2004 Annual rejuvenation: Analysis, module and applications. In Proceedings of Technical Conference, Jun 2004. the 25th Annual International Symposium on Fault-Tolerant [51] S. Staniford, V. Paxson, and N. Weaver. How to own the internet in Computing, Jun 1995. your spare time. In Proceedings of the 11th USENIX Security [31] D. Johnson and W. Zwaenepoel. Recovery in distributed systems Symposium, Aug 2002. using optimistic message logging and checkpointing. In Proceedings [52] S. D. Stoller. Testing concurrent Java programs using randomized of the 7th Annual ACM Symposium on Principles of Distributed scheduling. In Proceedings of the 2nd Workshop on Runtime Computing, Aug 1988. Veri®cation, Jul 2002. [32] D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems [53] R. Strom and S. Yemini. Optimistic recovery in distributed systems. using optimistic message logging and check-pointing. Journal of ACM Transactions on Computer Systems, 3(3):204–226, 1985. Algorithms, 11(3):462–491, 1990. [54] M. Sullivan and R. Chillarege. Software defects and their impact on [33] K. Li, J. Naughton, and J. Plank. Concurrent real-time checkpoint for system availability – A study of field failures in operating systems. In parallel programs. In Proceedings of the 2nd ACM SIGPLAN Proceedings of the 21th Annual International Symposium on Symposium on Princiles & Practice of Parallel Programming, Mar Fault-Tolerant Computing, Jun 1991. 1990. [55] M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. [34] D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure Recovering device drivers. In Proceedings of the 6th Symposium on transparency and the limits of generic recovery. In Proceedings of the Operating System Design and Implementation, Dec 2004. 4th Symposium on Operating System Design and Implementation, [56] G. Trent and M. Sake. Webstone: The first generation in http server Oct 2000. benchmarking, 1995. [35] D. E. Lowell and P. M. Chen. Free transactions with rio vista. In [57] W. Vogels, D. Dumitriu, A. Agrawal, T. Chia, and K. Guo. Proceedings of the 16th Symposium on Operating Systems Scalability of the Microsoft Cluster Service. In Proceedings of the Principles, Oct 1997. 2nd USENIX Windows NT Symposium, Aug 1998. [36] D. E. Lowell and P. M. Chen. Discount checking: Transparent, [58] W. Vogels, D. Dumitriu, K. Birman, R. Gamache, M. Massa, low-overhead recovery for general applications. Technical report, R. Short, J. Vert, J. Barrera, and J. Gray. The design and architecture CSE-TR-410-99, University of Michigan, Jul 1998. of the Microsoft Cluster Service. In Proceedings of the 28th Annual [37] E. Marcus and H. Stern. Blueprints for High Availability. John International Symposium on Fault-Tolerant Computing, Jun 1998. Willey & Sons, 2000. [59] Y.-M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for [38] D. Mosberger and T. Jin. httperf - a tool for measuring web server software error recovery in distributed systems. In Proceedings of the performance. SIGMETRICS Performance Evaluation Review, 23rd Annual International Symposium on Fault-Tolerant Computing, 26(3):31–37, 1998. Jun 1993. [39] D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, [60] Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala. P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, Checkpointing and its applications. In Proceedings of the 25th N. Sastry, W. Tetzlaff, J. Traupman, and N. Treuhaft. Recovery Annual International Symposium on Fault-Tolerant Computing, Jun oriented computing (ROC): Motivation, definition, techniques, and 1995. case studies. Technical report, Technical Report UCB//CSD-02-1175, [61] Y. Zhou, P. M. Chen, and K. Li. Fast cluster failover using virtual U.C.Berkeley, Mar 2002. memory-mapped communication. In Proceedings of the 1999 ACM [40] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE International Conference on Supercomputing, Jun 1999.