Applying Microreboot to System Software Michael Le and Yuval Tamir Concurrent Systems Laboratory UCLA Computer Science Department {Mvle,Tamir}@Cs.Ucla.Edu
Total Page:16
File Type:pdf, Size:1020Kb
IEEE International Conference on Software Security and Reliability Washington, D.C., June 2012. Applying Microreboot to System Software Michael Le and Yuval Tamir Concurrent Systems Laboratory UCLA Computer Science Department {mvle,tamir}@cs.ucla.edu Abstract—Av ailability is increased with recovery based on ev entually continue operating normally.Finally, component microreboot instead of whole system reboot. There microreboot is simpler if a component with greater privilege areunique challenges that must be overcome in order to apply can manage the microreboot process. With LLSS there may microreboot to low-levelsystem software. These challenges not be anysoftware component in the system with greater arise from the need to interact with immutable hardware privilege. Hence, the LLSS must somehowmicroreboot components on one hand and, on the other hand, with a wide itself while in a potentially corrupted state. variety of higher levelworkloads whose characteristics may be unknown. As an example, we describe our experience with Ke y contributions of this paper include identifying applying microreboot to system-levelvirtualization software. unique challenges to applying microreboot to low-level Specifically,implementing microreboot for all the components system software and presenting general approaches to of the widely-used Xen virtualization infrastructure. We addressing these challenges. This is based on our identify the unique difficulties with applying microreboot for experience using these approaches with complexlow-level such low-levelsoftwareand present our solutions. We present software — software that provides system-level measures of the complexity of different classes of solutions and virtualization. Specifically,wehav e applied microreboot to experimental results, based on extensive fault injection, showing the effectiveness of the solutions. all the components of the Xen [1] virtualization software. As explained in Section 4, this software consists of three − Keywords Recovery; Virtualization; Fault injection components: the privileged virtual machine (PrivVM), the 1. Introduction device drivervirtual machine (DVM), and the virtual machine monitor (VMM). We denote all three of these Microreboot enables fast restoration of a valid state of a components as the virtualization infrastructure (VI). failed process or system by rebooting the specific failed component instead of the entire system [5]. Low-level We describe howwehav e employed the approaches to system software (LLSS) manages and controls hardware using microreboot with LLSS with all three components of resources and mediates access to hardware resources by the VI. In each case, we explain the specific difficulties and higher software layers. While in previous work microreboot present experimental results that showthe effectiveness of has been used mostly for user-levelprocesses, the technique the techniques. While the use of microreboot to recover is also applicable to system software. However, as from failures of twoofthe VI components has been explained below, due to key characteristics of system presented in prior work [13, 10, 14, 15], recovery from software, using microreboot to recoverfrom failures in such failures of the PrivVM is presented here for the first time. software is particularly challenging. Hence, the description and evaluation of the specific technique used for PrivVM recovery,based on microreboot, In general, it is difficult to apply microreboot to is another key contribution of this work. software with tightly-coupled components and non-modular organization. Unfortunately,LLSS tends to possess these The next section is an overviewofmicroreboot. characteristics. LLSS has additional properties that makes it Section 3 describes the challenges with the use of less amenable than application software to the microreboot microreboot for LLSS and general approaches to addressing technique. One such property is that LLSS interacts directly these challenges. The Xen VI is described in Section 4. with hardware resources. The hardware cannot be modified Section 5 explains howwehav e employed microreboot for and the interfaces to the hardware may not provide well- recovery from failures of each one of the Xen VI defined semantics in the face of an LLSS component reboot. components. The experimental setup used to evaluate the Hardware resources may be shared by multiple LLSS recovery mechanism is described in Section 6. The results components, potentially coupling these components to each of the evaluation are in Section 7. Section 8provides other,making it difficult to microreboot a single component. measures of the implementation complexity of our recovery schemes and related work is presented in Section 9. Asecond property that complicates the use of microreboot for LLSS is that the software that runs on top of 2. Microreboot Overview the LLSS (application software) is typically not under the When system components fail, a simple way to recoveristo control of the LLSS developer.This makes it difficult to reboot the entire system. This is slowand reduces system microreboot LLSS components while allowing the availability.Microreboot is a recovery technique that application software, that interacts with the LLSS, to -2- reduces the down time caused by failure of some accommodate LLSS microreboot. Hence, microreboot for components in the system by recovering (rebooting) only LLSS components may require implementing work-arounds the failed components in the system while allowing other to deal with specific problematic properties of specific functioning components to continue operating normally. hardware components. This may involveidentifying ways As explained in [5], there are three main design goals to perform partial resets of the hardware, evenifthat may for microrebootable software: (1) fast and correct not always be sufficient [13]. It may lead to special component recovery,(2) localized recovery to minimize operations performed during microreboot which, while not impact on other parts of the system, and (3) fast and correct an actual hardware reset, can eliminate specific known reintegration of recovered components. To achieve these ‘‘problem states’’inthe hardware. An example of this is to goals, software components need to be loosely coupled, acknowledge anypending interrupts, waiting for a response have states that are kept in dedicated state store separate from the rebooted LLSS component, that would otherwise from program logic, and be able to hide component failures block future interrupts. Finally,insome cases, the by providing mechanisms for retrying failed requests. To considerations above may force a recovery technique where decrease recovery latency, the software components should system functionality is restored before the microreboot of be fine-grained to minimize the restart and reinitialization the failed LLSS component and/or the reset of some time. Lastly,tofacilitate cleaning up after microreboots and particular hardware component. This can be done by fail- prevent resource leaks, system resources should be leased overmechanisms that use redundant software and hardware and requests should carry time-to-live values. resources while the failed components are restored. Workload transparency: Typically,manydifferent 3. Microrebooting Low-levelSystem Software applications and user-levelsubsystems (workloads) run on There are unique challenges to employing microrebooting top of the LLSS and interact with LLSS components. The with low-levelsystem software. In the rest of this section, software that runs on top of the LLSS may not be known at these challenges are described and general approaches to the time the LLSS is developed and/or may not be under the addressing these challenges are presented. control of the LLSS developer.Hence, ideally,the Immutable shared hardware: LLSS interacts directly microreboot of LLSS components should be transparent to with hardware that may not provide interfaces with the the layers above it. properties required for ‘‘clean’’microreboot and may cause Akey challenge to meeting the transparencygoal undesirable coupling among different LLSS components described above isthe possibility of requests from the that interact with the hardware. The coupling among LLSS workload that are in progress in the LLSS when an LLSS components may force multiple components to be component fails. Since the workload is typically not microrebooted together,thus violating the key goal of the designed to interact with components that may be rebooted, microreboot technique and increasing the recovery latency it cannot be expected to retry such in-progress requests once of the system. the microreboot is performed. Hence, an LLSS recovery An LLSS component failure can lead to corruption of mechanism that employs microreboot must have some way hardware state by providing the hardware with faulty inputs to log, in a safe location, information that allows in progress or failing to respond properly to inputs from the hardware. requests to be retried and completed following microreboot. The only way to restore the hardware component (e.g., a Ideally,this is done strictly on the LLSS side and must be device controller,aninterrupt controller,orabus) to a sane considered part of implementing the microreboot capability. state may be to reset it. The hardware component that needs ‘‘Last’’softwarelayer: It is typically simplest for to be reset may be coupled to other hardware components microreboot of a component to be performed by