Improving the Dependability of Distributed Systems Through AIR Software Upgrades
Total Page:16
File Type:pdf, Size:1020Kb
Carnegie Mellon University CARNEGIE INSTITUTE OF TECHNOLOGY THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy TITLE Improving the Dependability of Distributed Systems through AIR Software Upgrades PRESENTED BY Tudor A. Dumitra! ACCEPTED BY THE DEPARTMENT OF Electrical & Computer Engineering ____________________________________________ ________________________ ADVISOR, MAJOR PROFESSOR DATE ____________________________________________ ________________________ DEPARTMENT HEAD DATE APPROVED BY THE COLLEGE COUNCIL ____________________________________________ ________________________ DEAN DATE Improving the Dependability of Distributed Systems through AIR Software Upgrades Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical & Computer Engineering Tudor A. Dumitra! B.S., Computer Science, “Politehnica” University, Bucharest, Romania Diplôme d’Ingénieur, Ecole Polytechnique, Paris, France M.S., Electrical & Computer Engineering, Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA December, 2010 To my parents and my teachers, who showed me the way. To my friends, who gave me a place to stand on. Pentru Tanti Lola. Abstract Traditional fault-tolerance mechanisms concentrate almost entirely on responding to, avoiding, or tolerating unexpected faults or security violations. However, scheduled events, such as software upgrades, account for most of the system unavailability and often introduce data corruption or latent errors. Through two empirical studies, this dissertation identifies the leading causes of upgrade failurebreaking hidden dependenciesand of planned down- timecomplex data conversionsin distributed enterprise systems. These findings repre- sent the foundation of a new benchmark for software-upgrade dependability. This dissertation further introduces the AIR propertiesAtomicity, Isolation and Runtime-testingrequired for improving the dependability of distributed systems that undergo major software upgrades. The AIR properties are realized in Imago, a system de- signed to reduce both planned and unplanned downtime by upgrading distributed systems end-to-end. Imago builds upon the idea of isolating the production system from the up- grade operations, in order to avoid breaking hidden dependencies and to decouple the data conversions from the normal system operation. Imago includes novel mechanisms, such as providing a parallel universe for the new version, performing data conversions oppor- tunistically, intercepting the live workload at the ingress and egress points or executing an atomic switchover to the new version, which allow it to deliver the AIR properties. Imago harnesses opportunities provided by the emerging cloud-computing technolo- gies, by trading resource overhead (needed by the parallel universe) for an improved de- pendability of the software upgrades. This approach separates the functional aspects of the upgrade from the mechanisms for online upgrade, enabling an upgrade-as-a-service model. This dissertation also describes techniques for assessing the impact of software upgrades, in order to reason about the implications of relaxing the AIR guarantees. iv This is for the one that I ate, 32 years ago. Acknowledgments Since I started to learn about Computer Science, I have been fascinated by our ability to write computer programs that can protect themselves from various adverse conditions in their environments. I equally enjoy building dependable systems firsthand and studying empirically the behavior of systems large and small. This is the result of my interactions with a few great mentors, who helped me discover the secrets of Computer Science and who showed me the ways of scientific research. At age 13, I received a great gift from my uncle, Florin Covaciu. It was an HC-85 com- puter, a Romanian replica of the Sinclair ZX Spectrum (one of the world’s first personal computers). I learned to program on this machine, and, since then, I was not able to stay away from programming for too long. This path took me to the Computer-Science High School in Bucharest (Liceul de Informatica˘, now Colegiul Nat, ional Tudor Vianu), where teach- ers Mihai Budiu and Raluca Vasilescu opened my eyes to many computing concepts and to their power to transform human society. These teachers shared their great knowledge and passion for computing, and this inspired me to continue my studies in this field. I was not too surprised when I met Mihai and Raluca again, years later, in the graduate program at Carnegie Mellon University. Among my professors at the “Politehnica” University in Bucharest were Mircea Pe- trescu, who had supervised my father’s Honors thesis (diploma de licent, a˘) twenty-five years previously, Adrian Petrescu and Francisc Iacob, the creators of the HC-85, and Nicolae T, apus˘ , , who supervised my own Honors thesis on Argo, a search engine with a distributed Web crawler. Most of all, I am grateful to Zoea Racovit, a˘ for encouraging me to continue my graduate studies and to apply to the Ph.D. program at Carnegie Mellon University. My education at the Ecole Polytechnique in Paris taught me to be responsible, confi- dent, determined and to keep my commitments. I have to thank Jean-Marc Steyaert for guiding me through those years. Atom, a final-year project supervised by Sam Toueg, fo- v ACKNOWLEDGMENTS vi cused on implementing a practical group-communication system based on unreliable fail- ure detectors (an exclusively theoretical technique, at the time) and seeded my curiosity about distributed systems. When I arrived at Carnegie Mellon, Radu Marculescu˘ taught me the paramount im- portance of originality in all academic endeavors and the value of anticipating future tech- nology trends for systems-oriented research. Our first paper together, which identified the need for system-level fault tolerance in the emerging networks-on-chip (NoC) and reeval- uated fundamental trade-offs made by classical networking protocols, remains my most cited publication. Phil Koopman taught me everything I know about dependability, and he always gave me good advice throughout my graduate studies. Many people contributed to the approach described in this dissertation. The members of my thesis committee, Greg Ganger, Bruce Maggs and Asit Dan, encouraged me to pursue the topic of online software upgrades and provided invaluable feedback in all the stages of this research. Dan Siewiorek taught me how to create a rigorous taxonomy. Jiaqi Tan and Zhengheng Gho helped me design some of the core algorithms of Imago. At my request, Lorenzo Keller wrote a user manual for ConfErr, his fault-injection tool for mutating con- figuration files. A dinner-time discussion with Eli Tilevich turned into a publication about the risks of software upgrades across multiple administrative domains. Douglas Schmidt showed me how to present my research in an effective way. Jean-Charles Fabre has always been a great mentor and a true friend. Je n’oublierai jamais que je dois mon premier emploi à ton soutien sans réserve, Jean-Charles. Daniela Ros, u, Bich Le and Alan Downingmy internship supervisors at IBM Research, VMware and Oracle, respectivelyprovided me with a wealth of information about the practical challenges of performing software upgrades. My approach also incorporates ex- tensive feedback from the industry members of the Parallel Data Lab consortium. During my dissertation research, I received financial support from the NSF CAREER Award CCR- 0238381, the DARPA PCES contract F33615-03-C-4110, as well as Carnegie Mellon’s CyLab and Parallel Data Lab. While not directly involved in this research, my collaborators Danny Dig and Iulian Neamtiu helped me establish the series of workshops on Hot Topics in Software Upgrades (HotSWUp). The first two editions of the workshop provided a venue for insightful discus- sions about how upgrades are performed at various levelsin the front-end of cloud com- ACKNOWLEDGMENTS vii puting infrastructures, in EJB-based enterprise applications, in databases, in long-running servers, in middleware frameworks or in satellites orbiting the Earth. Most importantly, HotSWUp emphasized that the challenge of upgrading distributed systems end-to-end calls for an inter-disciplinary approach, combining ideas and techniques from several areas of Computer Science. Above all, I would like to thank my dissertation advisor, Prof. Priya Narasimhan, for all her guidance. She taught me the secrets of fault-tolerant middleware, and she showed me how to enhance legacy applications with replication and recovery mechanisms by transpar- ently intercepting the application’s system calls. She shared with me her great gift for writ- ing and presenting technical material, she taught me the scientific method and she showed me how to ask the questions that matter. She passed down the teachings of Gottfried Wil- helm Leibniz, our academic ancestor, about how to turn an abundance of experimental data into rigorous scientific findings. She also passed down her aversion of adjectives and hyperbolae. She transformed a student into a researcher. Outside of Computer Science, my good friendstoo many to list herehave always been a source of inspiration and vitality. My aunt, S, tefania Dumitras, , taught me about the prac- tical things in life, and she also taught me French. My cousin, Ioana Caprar,˘ has been close as a sister. My parents, Dan Dumitras, and Monica Dumitras, , are my ultimate role models. Working at the Institute of Atomic Physics (where the