Self-Management for Large-Scale Distributed Systems
Total Page:16
File Type:pdf, Size:1020Kb
Self-Management for Large-Scale Distributed Systems AHMAD AL-SHISHTAWY PhD Thesis Stockholm, Sweden 2012 TRITA-ICT/ECS AVH 12:04 KTH School of Information and ISSN 1653-6363 Communication Technology ISRN KTH/ICT/ECS/AVH-12/04-SE SE-164 40 Kista ISBN 978-91-7501-437-1 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i elektronik och datorsystem onsdagen den 26 september 2012 klockan 14.00 i sal E i Forum IT-Universitetet, Kungl Tekniska högskolan, Isajordsgatan 39, Kista. Swedish Institute of Computer Science SICS Dissertation Series 57 ISRN SICS-D–57–SE ISSN 1101-1335. © Ahmad Al-Shishtawy, September 2012 Tryck: Universitetsservice US AB iii Abstract Autonomic computing aims at making computing systems self-managing by using autonomic managers in order to reduce obstacles caused by manage- ment complexity. This thesis presents results of research on self-management for large-scale distributed systems. This research was motivated by the in- creasing complexity of computing systems and their management. In the first part, we present our platform, called Niche, for program- ming self-managing component-based distributed applications. In our work on Niche, we have faced and addressed the following four challenges in achiev- ing self-management in a dynamic environment characterized by volatile re- sources and high churn: resource discovery, robust and efficient sensing and actuation, management bottleneck, and scale. We present results of our re- search on addressing the above challenges. Niche implements the autonomic computing architecture, proposed by IBM, in a fully decentralized way. Niche supports a network-transparent view of the system architecture simplifying the design of distributed self-management. Niche provides a concise and ex- pressive API for self-management. The implementation of the platform relies on the scalability and robustness of structured overlay networks. We proceed by presenting a methodology for designing the management part of a dis- tributed self-managing application. We define design steps that include par- titioning of management functions and orchestration of multiple autonomic managers. In the second part, we discuss robustness of management and data con- sistency, which are necessary in a distributed system. Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We pro- pose the abstraction of Robust Management Elements, which are able to heal themselves under continuous churn. Our approach is based on replicating a management element using finite state machine replication with a reconfig- urable replica set. Our algorithm automates the reconfiguration (migration) of the replica set in order to tolerate continuous churn. For data consistency, we propose a majority-based distributed key-value store supporting multiple consistency levels that is based on a peer-to-peer network. The store enables the tradeoff between high availability and data consistency. Using majority allows avoiding potential drawbacks of a master-based consistency control, namely, a single-point of failure and a potential performance bottleneck. In the third part, we investigate self-management for Cloud-based storage systems with the focus on elasticity control using elements of control theory and machine learning. We have conducted research on a number of different designs of an elasticity controller, including a State-Space feedback controller and a controller that combines feedback and feedforward control. We describe our experience in designing an elasticity controller for a Cloud-based key-value store using state-space model that enables to trade-off performance for cost. We describe the steps in designing an elasticity controller. We continue by presenting the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores that combines feedforward and feedback control. To my family vii Acknowledgements This thesis would not have been possible without the help and support of many people around me, only a proportion of which I have space to acknowledge here. I would like to start by expressing my deep gratitude to my supervisor Associate Professor Vladimir Vlassov for his vision, ideas, and useful critiques of this research work. With his insightful advice and unsurpassed knowledge that challenged and enriched my thoughts, together with the freedom given to me to pursue independent work, I was smoothly introduced to academia and research and kept focused on my goals. I would like as well to take a chance to thank him for the continuous support, patience and encouragements that have been invaluable on both academic and personal levels. I feel privileged to have the opportunity to work under the co-supervision of Professor Seif Haridi. His deep knowledge in the divers fields of computer science, fruitful discussions, and enthusiasm have been a tremendous source of inspiration. I am also grateful to Dr. Per Brand for sharing his knowledge and experience with me during my research and for his contributions and feedback to my work. I acknowledge the help and support given to me by the director of doctoral studies Associate Professor Robert Rönngren and the head of Software and Com- puter Systems unit Thomas Sjöland. I would like to thank Dr. Sverker Janson, the director of Computer Systems Laboratory at SICS, for his precious advices and guidance to improve my research quality and orient me to the right direction. I am truly indebted and thankful to my colleagues and dear friends Tallat Shafaat, Cosmin Arad, Amir Payberah, and Fatemeh Rahimian for their daily support and inspiration through the ups and downs of the years of this work. I am indebted to all my colleagues at KTH and SICS, specially to Dr. Sarunas Girdzijauskas, Dr. Jim Dowling, Dr. Ali Ghodsi, Roberto Roverso, and Niklas Ekström for making the environment at the lab both constructive and fun. I also acknowledge Muhammad Asif Fayyaz, Amir Moulavi, Tareq Jamal Khan, and Lin Bao for the work we did together. I take this opportunity to thank the Grid4All project team, especially Konstantin Popov, Joel Höglund, Dr. Nikos Parlavantzas, and Professor Noel de Palma for being a constant source of help. This research has been partially funded by the Grid4All FP6 European project; the Complex Service Systems focus project, a part of the ICT-TNG Strategic Re- search Areas initiative at the KTH; the End-to-End Clouds project funded by the Swedish Foundation for Strategic Research; the RMAC project funded by EIT ICT Labs. Finally, I owe my deepest gratitude to my wife Marwa and to my daughters Yara and Awan for their love and support at all times. I am most grateful to my parents for helping me to be where I am now. Contents Contents ix List of Figures xiii List of Tables xvii List of Algorithms xix I Thesis Overview 1 1 Introduction 3 1.1 Summary of Research Objectives . 5 1.2 Main Contributions . 6 1.3 Thesis Organization . 7 2 Background 9 2.1 Autonomic Computing . 10 2.2 The Fractal Component Model . 13 2.3 Structured Peer-to-Peer Overlay Networks . 14 2.4 Web 2.0 Applications . 16 2.5 State of the Art and Related Work in Self-Management for Large Scale Distributed Systems . 19 3 Self-Management for Large-Scale Distributed Systems 25 3.1 Enabling and Achieving Self-Management for Large-Scale Distributed Systems . 26 3.2 Robust Self-Management and Data Consistency in Large-Scale Dis- tributed Systems . 30 3.3 Self-Management for Cloud-Based Storage Systems: Automation of Elasticity . 32 4 Thesis Contributions 35 ix x CONTENTS 4.1 List of Publications . 36 4.2 Contributions . 37 5 Conclusions and Future Work 45 5.1 The Niche Platform . 45 5.2 Robust Self-Management and Data Consistency in Large-Scale Dis- tributed Systems . 47 5.3 Self-Management for Cloud-Based Storage Systems . 47 5.4 Discussion and Future Work . 48 II Enabling and Achieving Self-Management for Large-Scale Distributed Systems 53 6 Enabling Self-Management of Component Based Distributed Ap- plications 55 6.1 Introduction . 57 6.2 The Management Framework . 58 6.3 Implementation and evaluation . 61 6.4 Related Work . 64 6.5 Future Work . 66 6.6 Conclusions . 67 7 Niche: A Platform for Self-Managing Distributed Application 69 7.1 Introduction . 71 7.2 Background . 73 7.3 Related Work . 75 7.4 Our Approach . 76 7.5 Challenges . 77 7.6 Niche: A Platform for Self-Managing Distributed Applications . 79 7.7 Development of Self-Managing Applications . 92 7.8 Design Methodology . 107 7.9 Demonstrator Applications . 110 7.10 Policy Based Management . 120 7.11 Conclusion . 123 7.12 Future Work . 124 7.13 Acknowledgments . 125 8 A Design Methodology for Self-Management in Distributed En- vironments 127 8.1 Introduction . 129 8.2 The Distributed Component Management System . 130 8.3 Steps in Designing Distributed Management . 132 8.4 Orchestrating Autonomic Managers . 133 CONTENTS xi 8.5 Case Study: A Distributed Storage Service . 135 8.6 Related Work . 141 8.7 Conclusions and Future Work . 142 IIIRobust Self-Management and Data Consistency in Large- Scale Distributed Systems 145 9 Achieving Robust Self-Management for Large-Scale Distributed Applications 147 9.1 Introduction . 149 9.2 Background . 151 9.3 Automatic Reconfiguration of Replica Sets . 153 9.4 Robust Management Elements in Niche . 162 9.5 Prototype and Evaluation . 162 9.6 Related Work . 171 9.7 Conclusions and Future Work . 172 10 Robust Fault-Tolerant Majority-Based Key-Value Store Support- ing Multiple Consistency Levels 173 10.1 Introduction . 175 10.2 Related Work . 177 10.3 P2P Majority-Based Object Store . 180 10.4 Discussion . 186 10.5 Evaluation . 188 10.6 Conclusions and Future Work .