Operating System Support for Redundant Multithreading

Operating System Support for Redundant Multithreading Dissertation zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) Vorgelegt an der Technischen Universität Dresden Fakultät Informatik Eingereicht von Dipl.-Inf. Björn Döbel geboren am 17. Dezember 1980 in Lauchhammer Betreuender Hochschullehrer: Prof. Dr. Hermann Härtig Technische Universität Dresden Gutachter: Prof. Frank Mueller, Ph.D. North Carolina State University Fachreferent: Prof. Dr. Christof Fetzer Technische Universität Dresden Statusvortrag: 29.02.2012 Eingereicht am: 21.08.2014 Verteidigt am: 25.11.2014 FÜR JAKOB *† 15. Februar 2013 Contents 1 Introduction 7 1.1 Hardware meets Soft Errors 8 1.2 An Operating System for Tolerating Soft Errors 9 1.3 Whom can you Rely on? 12 2 Why Do Transistors Fail And What Can Be Done About It? 15 2.1 Hardware Faults at the Transistor Level 15 2.2 Faults, Errors, and Failures – A Taxonomy 18 2.3 Manifestation of Hardware Faults 20 2.4 Existing Approaches to Tolerating Faults 25 2.5 Thesis Goals and Design Decisions 36 3 Redundant Multithreading as an Operating System Service 39 3.1 Architectural Overview 39 3.2 Process Replication 41 3.3 Tracking Externalization Events 42 3.4 Handling Replica System Calls 45 3.5 Managing Replica Memory 49 3.6 Managing Memory Shared with External Applications 57 3.7 Hardware-Induced Non-Determinism 63 3.8 Error Detection and Recovery 65 4 Can We Put the Concurrency Back Into Redundant Multithreading? 71 4.1 What is the Problem with Multithreaded Replication? 71 4.2 Can we make Multithreading Deterministic? 74 4.3 Replication Using Lock-Based Determinism 79 4.4 Reliability Implications of Multithreaded Replication 92 6 BJÖRNDÖBEL 5 Evaluation 97 5.1 Methodology 97 5.2 Error Coverage and Detection Latency 98 5.3 Runtime and Resource Overhead 109 5.4 Implementation Complexity 121 5.5 Comparison with Related Work 122 6 Who Watches the Watchmen? 127 6.1 The Reliable Computing Base 127 6.2 Case Study #1: How Vulnerable is the Operating System? 131 6.3 Case Study #2: Mixed-Reliability Hardware Platforms 139 6.4 Case Study #3: Compiler-Assisted RCB Protection 146 7 Conclusions and Future Work 149 7.1 OS-Assisted Replication 149 7.2 Directions for Future Research 151 8 Bibliography 163 1 Introduction Computer systems fail every day, destroying personal data, causing economic loss and even threatening users’ lives. The reasons for such failures are manifold, ranging from environmental hazards (e.g., a fire breaking out in a data center) to programming errors (e.g., invalid pointer dereferences in unsafe programming languages), as well as defective hardware components (e.g., a hard disk returning erroneous data when reading). A large fraction of these failures results from programming mistakes.1 1 Steve McConnell. Code Complete: A Prac- This observation triggered a large body of research related to improving tical Handbook of Software Construction. Microsoft Press, Redmond, WA, 2 edition, 2 software quality, ranging from static code analysis to formally verified 2004 operating system kernels.3 However, even if the combined solutions at some 2 Dawson Engler and David Yu et al. Chen. Bugs as Deviant Behavior: A General Ap- point lead to a situation where software is fault-free, their assumptions only proach to Inferring Errors in Systems Code. hold if we can trust in hardware to function correctly. In Symposium on Operating Systems Princi- From a hardware development perspective, Moore’s Law4 indicates a dou- ples, SOSP’01, pages 57–72, Banff, Alberta, Canada, 2001. ACM bling of transistor counts in modern microprocessors roughly every two years. 3 Gerwin Klein, Kevin Elphinstone, Ger- While the law started off as a prediction, it is today used by hardware vendors not Heiser, June Andronick, David Cock, as a means to establish product release cycles, research, and development Philip Derrin, Dhammika Elkaduwe, Kai En- gelhardt, Rafal Kolanski, Michael Norrish, goals. This has turned Moore’s Law into a self-fulfilling prophecy. Increasing Thomas Sewell, Harvey Tuch, and Simon the number of transistors enables to integrate more and more components into Winwood. seL4: Formal Verification of an OS Kernel. In Symposium on Operating Sys- a single chip. This permits the addition of more processors, larger caches, as tems Principles, SOSP’09, pages 207–220, well as specialized functional units, such as on-chip graphics processors. Big Sky, MT, USA, October 2009. ACM While transistor counts increase, the available chip area does not. Emerging 4 Gordon E. Moore. Cramming More Compo- nents Onto Integrated Circuits. Electronics, technologies, such as three-dimensional stacking of transistors try to address 38(8), 1965 5 this problem. However, the standard way of solving this problem in practical 5 Dae Hyun Kim et al. 3D-MAPS: 3D Mas- systems is to decrease the size of individual transistors by applying more sively Parallel Processor With Stacked Mem- ory. In Solid-State Circuits Conference Di- fine-grained production processes. gest of Technical Papers (ISSCC), 2012 IEEE Unfortunately, hardware components are exposed to energetic stress caused International, pages 188–190, 2012 by radiation, thermal effects, energy fluctuations, and mechanical force. As 6 Sherkar Borkar. Designing Reliable Sys- I will outline in Chapter 2, hardware vendors spend significant effort in tems From Unreliable Components: The Challenges of Transistor Variability and keeping the failure probability of their components below certain thresholds. Degradation. IEEE Micro, 25(6):10 – 16, However, this effort is becoming increasingly difficult as hardware structure 2005 sizes shrink, because smaller transistors are more vulnerable to the effects 7 Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet 6 Gupta, Sani Nassif, Muhammad Shafique, previously mentioned. Mehdi Tahoori, and Norbert Wehn. Reliable This increased vulnerability accounts for an increased amount of inter- On-chip Systems in the Nano-Era: Lessons mittent (or soft) hardware faults.7 In contrast to permanent faults, which Learnt and Future Trends. In Annual Design Automation Conference, DAC ’13, pages constitute a constant malfunction of a component, intermittent faults occur 99:1–99:10, Austin, Texas, 2013. ACM and vanish seemingly randomly during execution time. This randomness 8 James F. Ziegler and William A. Lanford. 8 Effect of Cosmic Rays on Computer Memo- stems from the physical effects mentioned above. Soft errors are often tran- ries. Science, 206(4420):776–788, 1979 8 BJÖRNDÖBEL sient, which means they are only visible for a limited period of time before the affected component returns to a correct state. As an example, consider a bit in memory that flips due to a cosmic ray strike: Reading this bit will deliver an erroneous value until the containing memory word is later overwritten with a new datum. 9 Timothy J. Slegel, Robert M. Averill III, Fault tolerance mechanisms to deal with hardware errors have been devised Mark A. Check, Bruce C. Giamei, Barry W. at both hardware and software levels. While lower-level hardware components Krumm, Christopher A. Krygowski, Wen H. Li, John S. Liptay, John D. MacDougall, are suitable to detect and correct a large number of these errors, they do not Thomas J. McPherson, Jennifer A. Navarro, possess the necessary system knowledge to do so efficiently. For instance, Eric M. Schwarz, Kevin Shum, and Charles F. 9 Webb. IBM’s S/390 G5 Microprocessor De- IBM’s S390 processors provide redundancy in the form of lockstepping. sign. IEEE Micro, 19(2):12–23, 1999 However, this effort may not be necessary for all applications, because some software can protect itself, for instance using resilient programming tech- 10 Andrew M. Tyrrell. Recovery Blocks and niques.10 In such cases, the system could better use the additional hardware Algorithm-Based Fault Tolerance. In EU- for computing purposes. ROMICRO 96. Beyond 2000: Hardware and Software Design Strategies, pages 292–299, Existing software-level solutions often make assumptions about how de- 1996 velopers write software. For instance, they may require use of specific pro- 11 Jim Gray. Why Do Computers Stop and gramming models, such as process pairs.11 Other solutions apply specific What Can Be Done About It? In Symposium compiler techniques for fault tolerance.12 These software techniques trade on Reliability in Distributed Software and Database Systems, pages 3–12, 1986 general applicability for fault tolerance. 12 Semeen Rehman, Muhammad Shafique, The operating system bridges the gap between hardware and software. and Jörg Henkel. Instruction Scheduling for In this thesis I therefore evaluate whether we can strike a balance between Reliability-Aware Compilation. In Annual Design Automation Conference, DAC ’12, flexibility and generality by implementing fault tolerance as an operating pages 1292–1300, San Francisco, California, system service. 2012. ACM 1.1 Hardware meets Soft Errors 13 Daniel Lyons. Sun Screen. Forbes Mag- Sun acknowledged soft errors to be the cause for server crashes in 2000,13 azine, November 2000, accessed on April reportedly costing the company millions of dollars to replace the affected 22nd 2013, mirror: http://tudos.org/ ~doebel/phd/forbes2000sun components. Anecdotal evidence of soft errors affecting today’s systems can 14 Nelson Elhage. Attack of the be found across the internet.14 In Section 2.1 I will give a more thorough Cosmic Rays! KSPlice Blog, 2010, overview of scientific studies investigating causes and consequences of such https://blogs.oracle.com/ksplice/ entry/attack_of_the_cosmic_rays1, hardware errors. accessed on April 22nd 2013 Not all consequences of a soft error may become immediately visible as a fault. As I will describe in Chapter 2, these errors can also lead to erratic failures: the affected system continues to provide a service and simply generates wrong results. Such behavior also opens up new windows of security vulnerabilities. Dinaburg presented such a vulnerability as an experiment at BlackHat 2011: The authors registered 30 domain names that were one bit off popular domains, but that were not susceptible to be plain typing errors made by users.

Load more