Repairing Past Errors with System-Wide Undo
Total Page:16
File Type:pdf, Size:1020Kb
A Recovery-Oriented Approach to Dependable Services: Repairing Past Errors with System-Wide Undo Aaron Brown EECS Computer Science Division University of California, Berkeley Report No. UCB//CSD-04-1304 December 2003 Computer Science Division (EECS) University of California Berkeley, California 94720 A Recovery-Oriented Approach to Dependable Services: Repairing Past Errors with System-wide Undo by Aaron Baeten Brown A.B. (Harvard University) 1997 M.S. (University of California, Berkeley) 2000 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor David Patterson, Chair Professor John Chuang Professor Armando Fox Professor Katherine Yelick Fall 2003 A Recovery-Oriented Approach to Dependable Services: Repairing Past Errors with System-wide Undo Copyright 2003 by Aaron Baeten Brown Abstract Motivated by the pressing need for increased dependability in corporate and Internet services and by the perspective that effective recovery can improve dependability as much or more than avoid- ing failures, we introduce a novel recovery mechanism that gives human system operators the power of system-wide undo. System-wide undo allows operators to roll back erroneous changes to a service’s state without losing end-user data or updates, to make retroactive repairs in the historical timeline of the service system, and thereby to quickly recover from catastrophic state corruption, operator error, failed upgrades, and external attacks, even when the root cause of the catastrophe is unknown. We explore system-wide undo via a framework based on the novel concept of spheres of undo, bubbles of state and time that provide scope to the state recoverable by undo and serve as a structuring tool for implementing undo on standalone services, hierarchically-composed systems, and distributed interacting services. Crucially, spheres of undo allow us to define the concept of paradoxes, inconsis- tencies that occur when an undo process retroactively alters state that has been exposed outside of its containing sphere of undo. Managing paradoxes is the grand challenge of system-wide undo, and to tackle it we introduce a framework that automatically detects and compensates for paradoxes; our approach exploits the relaxed consistency semantics already present in existing services that interact with human end-users. We describe an implementation of our system-wide undo framework for standalone services with human end-users. We explore its applicability by assembling and evaluating a prototype undoable e- mail store service, by analyzing what would be necessary to construct an undoable online auction ser- vice, and by developing a set of guidelines to help service designers retrofit their services with undo. We find that system-wide undo functionality imposes non-negligible but tolerable overhead in terms of both time and space. Using a novel methodology we develop to benchmark human-assisted recovery processes, we also find that undo-based recovery has a net positive effect on dependability, providing significant improvements in correctness while only slightly degrading availability. 1 Table of Contents List of Figures v List of Tables vii Acknowledgements viii 1Introduction 1 1.1 Service Dependability and its Influencing Factors . 2 1.2 Dealing with Dependability: Avoidance vs. Recovery . 4 1.3 Undo: A Recovery-Oriented Technique for Human Operators. 7 1.4 Alternatives to Undo. 9 1.5 Contributions and Roadmap. 9 2 An Undo Model for Human Operators 11 2.1 A Survey of Existing Models of Undo . 11 2.1.1 Undo models for productivity applications . 12 2.1.2 Undo models in other domains . 14 2.2 Establishing a Broader Design Space for Undo Models . 16 2.2.1 Spheres of Undo . 17 2.2.2 Design axes for undo models within a sphere of undo . 19 2.2.3 Considering external actors: a third design axis . 22 2.2.4 Locating Existing Undo Models in the Design Space. 24 2.3 Ideal Undo Model for System-Wide Operator Undo for Services . 24 2.3.1 From ideal model to practical target. 26 2.3.2 Discussion . 29 3 Managing Undo-Induced Inconsistencies with Re-Execution 31 3.1 The Re-Execution Approach and the Three R’s . 31 3.2 Verbs . 35 3.3 Paradoxes . 36 3.4 Coping with Paradoxes . 37 i 4 An Architecture for Three-R’s Undo for Self-Contained Services 40 4.1 Design Points and Assumptions . 40 4.1.1 Self-contained services . 41 4.1.2 Black-box services . 42 4.1.3 Fault model . 42 4.2 Undo System Architecture . 43 4.3 Verbs . 44 4.4 Generating the Timeline Log . 46 4.4.1 Decoupling verbs from service behavior . 46 4.4.2 Sequencing verbs . 47 4.5 Paradox Management . 51 4.5.1 Detecting paradoxes: per-verb consistency predicates . 52 4.5.2 Handling detected paradoxes: per-verb compensation methods . 54 4.5.3 Handling propagating paradoxes: the squash interface . 54 4.6 Discussion. 55 4.6.1 Recovery guarantees for undo . 55 4.6.2 Potential applicability of Three-R’s undo. 57 4.6.3 Comparison to existing approaches . 59 4.7 Summary . 62 5 A Java-based Implementation of the Three-R’s Undo Architecture 63 5.1 Verbs . 64 5.2 Time and the timeline log. 66 5.3 Rewindable storage . 67 5.3.1 Reversible rewind. 67 5.3.2 Limited number of snapshots. 68 5.3.3 Checkpoint management . 68 5.4 The undo manager . 70 5.4.1 Mediating verb execution: proxy APIs . 70 5.4.2 Mediating verb execution: undo manager operation. 71 5.4.3 Managing the timeline: operation APIs . 72 5.5 Maintaining time-invariant names: the UIDFactory module . 73 5.6 User interfaces. 74 5.7 Summary . 75 6 Undo Case Study: An Undoable E-mail Store Service 77 6.1 Overview: an Undoable E-Mail Service. 78 6.2 Verbs for E-mail . 78 6.2.1 Verb implementation: stripping context. 79 6.2.2 Verb implementation: managing connection context . 81 ii 6.2.3 Sequencing e-mail verbs. 82 6.3 Handling Paradoxes in E-mail. 82 6.3.1 An external consistency policy for e-mail . 83 6.3.2 Paradox detection predicates . 84 6.3.3 Compensating for paradoxes . 84 6.3.4 Squashing propagating paradoxes. 86 6.3.5 Recovery guarantee for e-mail . 86 6.4 E-mail Service Proxy . 87 6.5 Evaluation of Overhead and Performance . 88 6.5.1 Setup . 88 6.5.2 Time overhead: throughput and latency. 89 6.5.3 Space overhead. 95 6.5.4 Performance of the undo cycle . 95 6.6 Discussion. ..