Improving the Reliability of Commodity Operating Systems

Improving the Reliability of Commodity Operating Systems Michael M. Swift A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2005 Program Authorized to Offer Degree: Computer Science and Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Michael M. Swift and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Co-Chairs of Supervisory Committee: Henry M. Levy Brian N. Bershad Reading Committee: Henry M. Levy Brian N. Bershad John Zahorjan Date: In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, to whom the author has granted “the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform.” Signature Date University of Washington Abstract Improving the Reliability of Commodity Operating Systems Michael M. Swift Co-Chairs of Supervisory Committee: Professor Henry M. Levy Computer Science and Engineering Associate Professor Brian N. Bershad Computer Science and Engineering This dissertation presents an architecture and mechanism for improving the reliability of commodity operating systems by enabling them to tolerate component failures. Improving reliability is one of the greatest challenges for commodity operating systems, such as Windows and Linux. System failures are commonplace in all domains: in the home, in the server room, and in embedded systems, where the existence of the OS itself is invisible. At the low end, failures lead to user frustration and lost sales. At the high end, an hour of downtime from a system failure can result in losses in the millions. Device drivers are a leading cause of failure in commodity operating systems. In Windows XP, for example, device drivers cause 85% of reported failures [160]. In Linux, the frequency of coding errors is up to seven times higher for device drivers than for the rest of the kernel [45]. Despite decades of research in reliability and extensible operating system technology, little of this technology has made it into commodity systems. One major reason for the lack of adoption is that these research systems require that the operating system, drivers, applications, or all three be rewritten to take advantage of the technology. In this dissertation I present a layered architecture for tolerating the failure of existing drivers within existing operating system kernels. My solution consists of two new techniques for isolation drivers and then recovering from their failure. First, I present an architecture, called Nooks, for isolating drivers from the kernel in a new protection mechanism called a lightweight kernel protection domain. By executing drivers within a domain, the kernel is protected from their failure and cannot be corrupted. Second, I present a mechanism, called shadow drivers, for recovering from device driver failures. Based on a replica of the driver’s state machine, a shadow driver conceals the driver’s failure from applications and restores the driver’s internal state to a point where it can process requests as if it had never failed. Thus, the entire failure and recovery is transparent to applications. I also show that Nooks functions as a platform for higher-level reliability services. I leverage the Nooks and shadow driver code to remove another major source of downtime: reboots to update driver code. This service, dynamic driver update, replaces drivers online, without restarting the OS or running applications and without changes to driver code. This service requires little additional code, and demonstrates that Nooks and shadow drivers provide useful services for constructing additional reliability mechanisms. I implemented the Nooks architecture, shadow drivers, and dynamic driver update within the Linux operating system and used them to fault-isolate several device drivers and other kernel extensions. My results show that Nooks offers a substantial increase in the reliability of operating systems, catching and quickly recovering from many faults that would otherwise crash the system. Under a wide range and number of fault conditions, I show that Nooks recovers automatically from 99% of the faults that otherwise cause Linux to crash. I also show that the system can upgrade existing drivers without rebooting. Finally, I show that the system imposes little performance overhead for many drivers. TABLE OF CONTENTS List of Figures iv List of Tables vi Chapter 1: Introduction 1 1.1 Motivation . 1 1.2 Thesis and Contributions . 2 1.3 Dissertation Organization . 5 Chapter 2: Background and Related Work 6 2.1 Definitions . 6 2.2 System Models . 7 2.3 Failure Models . 7 2.4 Properties of OS Failures . 11 2.5 Previous Approaches to Building Highly Reliable Systems . 12 2.6 Fault Tolerance in Nooks . 25 2.7 Summary . 26 Chapter 3: Device Driver Overview 28 Chapter 4: The Nooks Reliability Layer 31 4.1 Introduction . 31 4.2 Nooks Architecture . 33 4.3 Implementation . 37 4.4 Summary . 55 i Chapter 5: The Shadow Driver Recovery Mechanism 56 5.1 Introduction . 56 5.2 Shadow Driver Design . 57 5.3 Implementation . 61 5.4 Summary . 71 Chapter 6: Evaluation of Nooks and Shadow Drivers 73 6.1 Introduction . 73 6.2 Test Methodology . 73 6.3 System Survival . 77 6.4 Application Survival . 82 6.5 Testing Assumptions . 86 6.6 Performance . 89 6.7 Summary . 96 Chapter 7: The Dynamic Driver Update Mechanism 98 7.1 Introduction . 98 7.2 Related work . 100 7.3 Design and Implementation . 102 7.4 Evaluation of Dynamic Driver Update . 112 7.5 Summary . 119 Chapter 8: Lessons 121 8.1 Introduction . 121 8.2 Driver Interface . 121 8.3 Driver Organization . 123 8.4 Memory Management . 124 8.5 Source Code Availability . 126 8.6 Summary . 127 ii Chapter 9: Future Work 129 9.1 Introduction . 129 9.2 Static Analysis of Drivers . 129 9.3 Hybrid Static/Run-Time Checking . 130 9.4 Driver Design . 131 9.5 Driverless Operating Systems . 133 9.6 Summary . 134 Chapter 10: Conclusions 135 Bibliography 137 iii LIST OF FIGURES Figure Number Page 3.1 A sample device driver. 28 3.2 Examples of driver state machines. 30 4.1 The architecture of Nooks. 34 4.2 The Nooks Reliabilty Layer inside the Linux OS. 38 4.3 Protection of the kernel address space. 39 4.4 Trampoline wrappers for invoking multiple implementations of an interface function. 42 4.5 Control flow of driver and kernel wrappers. 43 4.6 Code sharing among wrappers . 46 5.1 A sample shadow driver operating in passive mode. 60 5.2 A sample shadow driver operating in active mode. 60 5.3 The Linux operating system with several device drivers and the shadow driver recovery subsystem. 62 5.4 The state machine transitions for a sound-card shadow driver . 63 6.1 The reduction in system crashes observed using Nooks. 79 6.2 The reduction in non-fatal extension failures observed using Nooks. 80 6.3 Results of fault-injection experiments on Linux-SD. 88 6.4 Comparative times spent in kernel mode for the Compile-local (VFAT) benchmark. 94 7.1 A shadow driver capturing the old driver’s state. 105 7.2 Device driver and shadow after an update. 110 7.3 Application performance on Linux-DDU relative to Linux-Native . 118 iv 7.4 Absolute CPU utilization on Linux-DDU and Linux-Native. 119 v LIST OF TABLES Table Number Page 2.1 Comparison of reliability techniques. 27 4.1 The number of non-comment lines of source code in Nooks. 38 5.1 The proxying actions of the shadow sound-card driver. 69 5.2 The size and quantity of shadows and the drivers they shadow. 70 6.1 The Linux extensions tested. 75 6.2 The types of faults injected into extensions. 76 6.3 The applications used for evaluating shadow drivers. 83 6.4 The observed behavior of applications following the failure of a device driver. 84 6.5 The relative performance of Nooks compared to native Linux. 90 7.1 The dynamic driver update process. 104 7.2 The number of lines of code dynamic driver update. 111 7.3 The three classes of shadow drivers and the Linux drivers tested. 113 7.4 The results of dynamically updating drivers. 114 7.5 The results of compatibility tests. 116 7.6 The applications used for evaluating dynamic driver update performance. 117 vi ACKNOWLEDGMENTS Thank you all! I would like to thank my advisors, Hank Levy and Brian Bershad, who encouraged me in this endeavor and taught me how to do research. When I entered graduate school, I thought I knew everything. From Hank and Brian I learned patience, as research is not an overnight process. I learned the value of a good question. I learned the value of a good story. And most importantly, I learned the value of a clear presentation, because ideas are useless if nobody understands them. I would also like to thank all the students who participated in the Nooks project. Steve Mar- tin provided early testing. Doug Buxton and Steve Martin implemented our wrapper generator.

Load more