Reducing Downtime Due to System Maintenance and Upgrades Shaya Potter and Jason Nieh – Columbia University
Total Page:16
File Type:pdf, Size:1020Kb
Reducing Downtime Due to System Maintenance and Upgrades Shaya Potter and Jason Nieh – Columbia University ABSTRACT Patching, upgrading, and maintaining operating system software is a growing management complexity problem that can result in unacceptable system downtime. We introduce AutoPod, a system that enables unscheduled operating system updates while preserving application service availability. AutoPod provides a group of processes and associated users with an isolated machine- independent virtualized environment that is decoupled from the underlying operating system instance. This virtualized environment is integrated with a novel checkpoint-restart mechanism which allows processes to be suspended, resumed, and migrated across operating system kernel versions with different security and maintenance patches. AutoPod incorporates a system status service to determine when operating system patches need to be applied to the current host, then automatically migrates application services to another host to preserve their availability while the current host is updated and rebooted. We have implemented AutoPod on Linux without requiring any application or operating system kernel changes. Our measurements on real world desktop and server applications demonstrate that AutoPod imposes little overhead and provides sub-second suspend and resume times that can be an order of magnitude faster than starting applications after a system reboot. AutoPod enables systems to autonomically stay updated with relevant maintenance and security patches, while ensuring no loss of data and minimizing service disruption. Introduction a system administrator chooses to fix an operating sys- tem security problem immediately, he risks upsetting As computers become more ubiquitous in large corporate, government, and academic organizations, his users because of loss of data. Therefore, a system the total cost of owning and maintaining them is administrator must schedule downtime in advance and becoming unmanageable. Computers are increasingly in cooperation with users, leaving the computer vul- networked, which only complicates the management nerable until repaired. If the operating system is problem, given the myriad of viruses and other attacks patched successfully, the system downtime may be commonplace in today’s networks. Security problems limited to just a few minutes during the reboot. Even can wreak havoc on an organization’s computing in- then, users are forced to incur additional inconve- frastructure. To prevent this, software vendors fre- nience and delays in starting applications again and quently release patches that can be applied to address attempting to restore their sessions to the state they security and maintenance issues that have been dis- were in before being shutdown. If the patch is not suc- covered. This creates a management nightmare for cessful, downtime can extend for many hours while administrators who take care of large sets of machines. the problem is diagnosed and a solution is found. For these patches to be effective, they need to be Downtime due to security and maintenance problems applied to the machines. It is not uncommon for sys- is not only inconvenient but costly as well. tems to continue running unpatched software long We present AutoPod, a system that provides an after a security exploit has become well-known [22]. easy-to-use autonomic infrastructure [11] for operat- This is especially true of the growing number of server ing system self-maintenance. AutoPod uniquely appliances intended for very low-maintenance opera- enables unscheduled operating system updates of tion by less skilled users. Furthermore, by reverse commodity operating systems while preserving appli- engineering security patches, exploits are being cation service availability during system maintenance. released as soon as a month after the fix is released, AutoPod provides its functionality without modifying, whereas just a couple of years ago, such exploits took recompiling, or relinking applications or operating closer to a year to create [12]. system kernels. This is accomplished by combining Even when software updates are applied to three key mechanisms: a lightweight virtual machine address security and maintenance issues, they com- isolation abstraction that can be used at the granularity monly result in system services being unavailable. of individual applications, a checkpoint-restart mecha- Patching an operating system can result in the entire nism that operates across operating system versions system having to be down for some period of time. If with different security and maintenance patches, and 19th Large Installation System Administration Conference (LISA ’05) 47 Reducing Downtime Due to System Maintenance and Upgrades Potter and Nieh an autonomic system status service that monitors the machine is maintained and rebooted, further minimiz- system for system faults as well as security updates. ing application service downtime. This enables secu- AutoPod provides a lightweight virtual machine rity patches to be applied to operating systems in a abstraction called a POD (PrOcess Domain) that timely manner with minimal impact on the availability encapsulates a group of processes and associated users of application services. Once the original machine has in an isolated machine-independent virtualized envi- been updated, applications can be returned and can ronment that is decoupled from the underlying operat- continue to execute even though the underlying oper- ing system instance. A pod mirrors the underlying ating system has changed. Similarly, if the service operating system environment but isolates processes detects an imminent system fault, AutoPod can check- from the system by using host-independent virtual point the processes, migrate, and restart them on a new identifiers for operating system resources. Pod isola- machine before the fault can cause the processes’ exe- tion not only protects the underlying system from cution to fail. compromised applications, but is crucial for enabling We have implemented AutoPod in a prototype applications to migrate across operating system system as a loadable Linux kernel module. We have instances. Unlike hardware virtualization approaches used this prototype to securely isolate and migrate a that require running multiple operating system wide range of unmodified legacy and network applica- instances [4, 29, 30], pods provide virtual application tions. We measure the performance and demonstrate execution environments within a single operating sys- the utility of AutoPod across multiple systems running tem instance. By operating within a single operating different Linux 2.4 kernel versions using three real- system instance, pods can support finer granularity world application scenarios, including a full KDE desk- isolation and can be administered using standard oper- top environment with a suite of desktop applications, ating system utilities without sacrificing system man- an Apache/MySQL web server and database server ageability. Furthermore, since it does not run an oper- environment, and a Exim/Procmail e-mail processing ating system instance, a pod prevents potentially mali- environment. Our performance results show that Auto- cious code from making use of an entire set of operat- Pod can provide secure isolation and migration func- ing system resources. tionality on real world applications with low overhead. AutoPod combines its pod virtualization with a This paper describes how AutoPod can enable novel checkpoint-restart mechanism that uniquely operating system self-maintenance by suspending, decouples processes from dependencies on the under- resuming, and migrating applications across operating lying system and maintains process state semantics to system kernel changes to facilitate kernel maintenance enable processes to be migrated across different and security updates with minimal application down- machines. The checkpoint-restart mechanism intro- time. Subsequent sections describe the AutoPod virtu- duces a platform-independent intermediate format for alization abstractions, present the virtualization archi- saving the state associated with processes and Auto- tecture to support the AutoPod model, discuss the Pod virtualization. AutoPod combines this format with AutoPod checkpoint-restart mechanisms used to facili- the use of higher-level functions for saving and restor- tate migration across operating system kernels that ing process state to provide a high degree of portabil- may differ in maintenance and security updates, pro- ity for process migration across different operating vide a brief overview of the AutoPod system status system versions that was not possible with previous service, provide a security analysis of the AutoPod approaches. In particular, the checkpoint-restart mech- system as well as examples of how to use AutoPod, anism relies on the same kind of operating system and present experimental results evaluating the over- semantics that ensure that applications can function head associated with AutoPod virtualization and quan- correctly across operating system versions with differ- tifying the performance benefits of AutoPod migration ent security and maintenance patches. versus a traditional maintenance approach for several AutoPod combines the pod virtual machine with application scenarios. We discuss related work before an autonomous system status service. The service some concluding remarks. monitors the system for system faults as well as secu- AutoPod