ROC: Recovery Oriented Computing

Web Page: http://roc.cs.berkeley.edu/

Participating UC Berkeley Faculty: David Patterson (CS– contact – [email protected] ),

“If a problem has no solution, it may not be a problem, but a fact --not to be solved, but to be coped with over time.” - Shimon Peres

General Description of Recovery Oriented Computing. A 20-year struggle by researchers and developers has improved performance of computing by 10000-fold: what took a year to compute in 1983 takes less than an hour today. Such success has led us to turn our attention another great challenge of computer science and engineering: dependability Taking Peres’ advice above to heart, our project accepts that failures of hardware, software, and people are facts. We advocate adding rapid recovery to redundancy to cope with these inevitable failures. Thus, rather than fail less often we could spend less time recovering from those failures. Shrinking recovery time from 80 hours to 5 minutes per year, for example, improves availability from two nines (99%) to five nines (99.999%). Since operators are often involved in recovery, making systems recover quickly means that they will be easier for operators to use, and hence ROC should lower cost of ownership. We are exploring five principles for building “ROC solid” systems : 1. Given that errors occur, design to recover rapidly 2. Given that errors occur, give operators tools to find errors 3. Given that humans err, system must support undo of operator errors 4. Inject errors to test the real behavior of systems and to train operators 5. The final principle is to develop and distribute recovery benchmarks to both measure progress and to change the priorities of companies that build computer products. Given the space available, we just cover two examples, but we are working one all five. Pinpointing Failures. Since errors occur, the operator could use help to track down an error, and effective aids should shrink recovery time. Traditional dependable systems start with a complete description of all hardware and software components, and then provide a careful failure analysis by looking at ways systems can fail. Internet services are often made up of ensembles of many different software components from different vendors, and these components often change rapidly. Failures in the service can arise from unexpected interactions of particular combinations of components, rather than being directly traceable to bugs in just one specific component. The ROC tool “PinPoint” tries to correlate user requests that cause Internet services to fail with the combinations of components that participated in those requests, to isolate troublesome combinations. The idea is to save a chronological list of all the software components used to fulfill a user’s request, called traces, and then save this list in a database along with a tag saying whether the request was successful or not. The system runs roughly 10% slower to record these lists. We then use standard “data mining” techniques to find the suspect module from the mix of good and bad traces. The virtue is that PinPoint provides high accuracy yet works with any combination of software components that happen to be on the system at that time, rather than the traditional solution that requires elaborate preplanning each time the suite of software changes. Pinpoint also helps us analyze what breaks when we inject different errors, allowing us to automatically determine the best order to restart components in response to a failure, and even automatically modify that strategy over time as the system’s software is evolved and upgraded. Undo of Operator Error. Perhaps the greatest challenge is providing a margin of safety against unexpected errors by the operator. As an analogy, the first word processors did not have the undo command, which made them frustrating if not terrifying: an erroneous global substitution could destroy all your work. Undo removed the anxiety from word processing, as users could undo and redo to their heart’s content. Anxiety is a fact of like for today’s operators of large systems, for they must manually preplan for the possibility of error, as there is no undo. As a demonstration of a better path, we are building an undo system for email. Although many of us would like to retrieve an occasional email just after we sent it, this is undo system is aimed at the place where email is stored. One example is electronically disinfecting an email storage server after infection by a new email virus. By saving all of the activities to the email server, including messages that have been discarded by users, the operator can first reset the system to before the virus arrived. He would then download software that attacks that new virus, and repair the damage, including deleting messages that that virus sent to its users. The last step is to play forward the valid events that occurred after the initial infected email arrived. We call these three steps Rewind, Repair, and Replay. Hence, like much of the literature of science fiction, we are offering time travel. Unlike those stories, the time traveling operator is supposed to change the past, as opposed to being warned against it and always accidentally doing it. We too can end up in a time travel paradox, whereby people might notice that the world is different after the past is changed. For example, a user might have read the infected email before the operator could eradicate the virus. Although there is no general-purpose solution to such paradoxes, there is an obvious solution when thinking at the level of the application. The system would just send another email saying that some messages that you read and saved were deleted in attempt to eradicate a virus. Unlike science fiction novels, we don’t need bizarre compensating actions for time travel problems. Desired Support: Funding for grad student participants.