Building a Reliable Operating System

c 2008 Francis Manoj David BUILDING A RELIABLE OPERATING SYSTEM BY FRANCIS MANOJ DAVID B.Tech., Indian Institute of Technology Madras, 2001 DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2008 Urbana, Illinois Doctoral Committee: Professor Roy H. Campbell, Chair Professor Ravishankar K. Iyer Associate Professor Ralph E. Johnson Assistant Professor Samuel T. King Ray Essick, Motorola Abstract Despite many decades of research, the management of errors in a live operating system remains a challenging problem. This thesis presents CuriOS, an operating system that incorporates several new error management techniques that significantly improve reliability. Errors detected by both hardware and software are signaled using language exception handling mechanisms. Unhandled exceptions do not crash the operating system and are dispatched to recovery routines. The architecture of CuriOS is influenced by microkernel design principles. Individual operating system services are assigned separate protection domains. This componentiza- tion provided by traditional microkernel designs helps confine errors. However, an error that occurs in a microkernel operating system service can potentially result in state corrup- tion and service failure. A simple restart of the failed service is not always the best solution for reliability. Blindly restarting a service which maintains client-related state such as ses- sion information results in the loss of this state and affects all clients that were using the service. CuriOS adopts a novel design that uses lightweight distribution, isolation and per- sistence of client-related state information maintained by operating system services. This helps mitigate the problem of state loss during a restart. This design also achieves inter- client isolation by curtailing error propagation within services. Fault injection experiments show that it is possible to recover from 87% or more man- ifested errors in operating system services such as the file system, timer, scheduler and network while maintaining low performance overheads. ii To Hina iii Acknowledgments I thank my advisor, Prof. Roy H. Campbell, for his enduring optimism and encouragement during my years at graduate school. It was a wonderful experience learning and working with him. Credit is also due to him for reading and helping revise several other publications in addition to this thesis. The other members of my PhD committee: Prof. Ravishankar Iyer, Prof. Ralph Johnson, Prof. Sam King and Dr. Ray Essick were also extremely helpful with their feedback and guidance. I owe Jeffrey Carlyle and Ellick Chan a lot of gratitude for working with me through the years in an effective research team. We spent many late nights in the Siebel Center fixing our code, running experiments and writing papers. They are awesome people to work with and are very good friends. Thanks are also due to everyone in the Systems Software Research Group for their support, feedback and encouragement during my research. My research team was supported by many partners in industry. I am very grateful to DoCoMo Labs USA, Motorola and Texas Instruments for providing us with equipment and funding for our research. I thank the many staff members in the Department of Computer Science who ensured my smooth progress through the PhD program. Anda Ohlsson and Barb Cicone helped me solve many non-technical issues over the years. Mary Beth Kelly and Lynette Lubben helped sort through all the paperwork required to submit this thesis on time. My numerous friends in Champaign-Urbana made day-to-day life colorful and fun. And finally, I will always be indebted to my family for their continuing support and love. iv Table of Contents ListofTables...................................... vii ListofFigures .....................................viii ListofAbbreviations ................................ ix Chapter1 Introduction ............................... 1 Chapter2 RelatedOperatingSystems . ... 7 2.1 Minix3..................................... 7 2.2 L4/Iguana................................... 9 2.3 Chorus..................................... 10 2.4 EROS ..................................... 10 2.5 Others..................................... 11 2.6 Summary ................................... 12 Chapter3 CuriOSArchitecture . 14 3.1 Organization.................................. 14 3.2 ProtectedObjects ............................... 15 3.3 ThreadModel................................. 18 3.4 ErrorManagement .............................. 19 Chapter4 ErrorSignaling ............................. 21 4.1 Creating C++ Exceptions from Processor Exceptions . ........ 22 4.2 CrossDomainExceptions. 27 4.3 UndispatchableExceptions . .. 28 4.4 SizeandPerformanceImpact. 29 Chapter5 ErrorDetection ............................. 33 5.1 InvalidMemoryAccessErrors . 33 5.2 MemoryCorruptionErrors . 34 5.3 LockupErrors................................. 37 v Chapter 6 Restart-Based Error Recovery . .... 46 6.1 ComponentRestarts.............................. 46 6.2 ServerStateManagement. 48 6.3 OperatingSystemServiceConstruction . ..... 50 6.4 Intra-ComponentErrorPropagation . .... 53 6.5 RecoverableErrors .............................. 55 Chapter7 RestartableComponents . .. 56 7.1 StatelessComponents. 56 7.2 StatefulComponents ............................. 58 Chapter8 Evaluation ................................ 62 8.1 ErrorRecovery ................................ 62 8.2 Performance.................................. 66 8.3 MemoryOverheads.............................. 67 8.4 RefactoringEffort............................... 68 Chapter 9 Fault-Tolerance Patterns . .... 69 9.1 ArchitecturalPatterns. .. 69 9.2 ErrorDetectionPatterns . 70 9.3 ErrorRecoveryPatterns. 71 Chapter 10 Additional Dependability Benefits . ...... 73 10.1Security .................................... 73 10.2 Maintainability ................................ 75 Chapter11 RelatedWork.............................. 76 11.1 Fault-Tolerance ................................ 76 11.2 HardwareProtectedObjects . .. 78 11.3 ProtectionDomains. .. .. .. .. .. .. .. .. 79 Chapter12 FutureWork .............................. 80 12.1 ImprovedErrorDetection. .. 80 12.2 ImprovedErrorRecovery. 81 12.3 ParallelComputing . .. .. .. .. .. .. .. .. 81 12.4 OtherHardwareArchitectures . ... 82 12.5 OtherOperatingSystems . 83 Chapter13 Conclusions ............................... 84 Appendix A The Choices Operating System . ... 85 Appendix B The ARM Processor Architecture . ... 87 References ....................................... 91 Author’sBiography .................................. 100 vi List of Tables 2.1 Recoverability of microkernel operating systems from memory access errors 8 4.1 Comparing SJLJ and table-driven implementations of exceptions . 30 4.2 Section sizes for different exception handling implementations . 31 5.1 Effectivenessoflockupdetectors . ..... 38 8.1 Protectedmethodcallperformance . .... 66 B.1 ARMprocessormodes ............................ 88 B.2 ARMregisters................................. 89 B.3 ARMinterruptvectorsandhandlingmodes . .... 90 vii List of Figures 1.1 Statedistribution ............................... 4 3.1 CuriOSorganization ............................. 15 3.2 Protectedmethodcall. .. .. .. .. .. .. .. .. 16 3.3 Pseudo code for the protected object implementation . ......... 17 3.4 CuriOSthreads ................................ 19 4.1 Terminology.................................. 22 4.2 InterruptmanagementinterfacesinCuriOS . ...... 23 4.3 Codeforthefunctionthatthrowstheexception . ....... 24 4.4 Processorexceptioncontrolflow . ... 25 4.5 Processorexceptionclasses. ... 26 4.6 Handlingofundispatchableexceptions . ...... 29 4.7 Size comparison of CuriOS using different exception handling mechanisms 32 5.1 Creatinganexceptionfromawatchdogbite . ..... 40 5.2 CuriOShardlockuprecoverycomparison . .... 44 6.1 Pseudo code for the protected object restart implementation......... 47 6.2 Requestprocessing .............................. 49 6.3 Pseudo code for the SSR mapping and unmapping implementation . 51 6.4 ErrorpropagationbetweenSSRs . .. 54 8.1 Errorrecoveryafterfaultinjection . ...... 63 viii List of Abbreviations DMA Direct Memory Access ECC Error Correcting Code FIFO First In, First Out HTTP Hyper Text Transfer Protocol I/O Input/Output IP Internet Protocol KB Kilobyte MAC Media Access Control MB Megabyte MMU MemoryManagementUnit NFS Network File System NMI NonMaskableInterrupt OS OperatingSystem QoS QualityofService SSR Server State Region TCP TransmissionControlProtocol TLB Translation Lookaside Buffer UDP User Datagram Protocol ix Chapter 1 Introduction Operating system reliability remains a challenging problem today [1] despite several decades of research [2, 3, 4, 5]. Errors caused by hardware and software faults are a major factor affecting operating system reliability. Hardware faults can arise due to various factors, some of which are aging, temperature, firmware faults, and radiation-induced bit-flips in memory and registers (Single Event Upsets [6]). Software faults (bugs) due to incorrect code are also very common in large and complex operating systems [7]. Errors in a monolithic operating system can easily propagate and corrupt other parts of the system [8, 9], making recovery extremely difficult. Microkernel designs componentize the operating system into servers managed by a minimal kernel. These

Load more