Operating System Support for Warehouse-Scale Computing
Total Page:16
File Type:pdf, Size:1020Kb
Operating system support for warehouse-scale computing Malte Schwarzkopf University of Cambridge Computer Laboratory St John’s College October 2015 This dissertation is submitted for the degree of Doctor of Philosophy Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation is not substantially the same as any that I have submitted, or, is being concur- rently submitted for a degree or diploma or other qualification at the University of Cambridge or any other University or similar institution. I further state that no substantial part of my dissertation has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University of similar institution. This dissertation does not exceed the regulation length of 60,000 words, including tables and footnotes. Operating system support for warehouse-scale computing Malte Schwarzkopf Summary Computer technology currently pursues two divergent trends: ever smaller mobile devices bring computing to new everyday contexts, and ever larger-scale data centres remotely back the ap- plications running on these devices. These data centres pose challenges to systems software and the operating system (OS): common OS abstractions are fundamentally scoped to a single machine, while data centre applications treat thousands of machines as a “warehouse-scale computer” (WSC). I argue that making the operating system explicitly aware of their distributed operation can result in significant benefits. This dissertation presents the design of a novel distributed operating system for warehouse-scale computers. I focus on two key OS components: the OS abstractions for interaction between the OS kernel and user-space programs, and the scheduling of work within and across machines. First, I argue that WSCs motivate a revisit of the 1980s concept of a distributed operating sys- tem. I introduce and justify six design principles for a distributed WSC OS. “Translucent” primitives combine transparency and control: they free users from being explicitly aware of object locality and properties, but expose sufficient optional hints to facilitate optimisations. A novel distributed capability scheme ensures fine-grained isolation, with cryptographic capabil- ities being treated as data and segregated capabilities used as temporary handles. Multi-phase I/O with kernel-managed buffers improves scalability and security at the system call level, but also permits the implementation of diverse application-level consistency policies. Second, I present the DIOS operating system, a realisation of these design principles. The DIOS system call API is centred around distributed objects, globally resolvable names, and translucent references that carry context-sensitive object meta-data. I illustrate how these concepts are used to build distributed applications, and describe an incrementally deployable DIOS prototype. Third, I present the Firmament cluster scheduler, which generalises prior work on scheduling via a minimum-cost flow optimisation. Firmament can flexibly express many scheduling poli- cies using pluggable cost models; it makes accurate, high-quality placement decisions based on fine-grained information about tasks and resources; and it scales to very large clusters by optimising the flow network incrementally. Finally, I demonstrate that the DIOS prototype achieves good performance in micro-benchmarks and on a data-intensive MapReduce application, and that it offers improved cross-machine iso- lation between users’ applications. In two case studies, I show that Firmament supports policies that reduce co-location interference between tasks and that exploit flexibility in the workload to improve the energy efficiency of a heterogeneous cluster. Moreover, Firmament scales the minimum-cost flow optimisation to very large clusters while still making rapid decisions. Acknowledgements “I find Cambridge an asylum, in every sense of the word.” — attributed to A. E. Housman [Ric41, p. 100]. My foremost gratitude extends to my advisor, Steve Hand, for his help and support over the course of the past six years. Steve’s enthusiasm, encouragement, patience, and high standards have impacted my journey into systems research as much as they have shaped my thesis. Steve also took the time to comment on countless drafts of this work and regularly talked to me about it at length, even as he himself moved between jobs and continents. Likewise, I am grateful to Ian Leslie, my second advisor, who gave insightful feedback on drafts of this document, and gave me the space and time to finish it to my satisfaction. In the same vein, I am also indebted to Frans Kaashoek and Nickolai Zeldovich for their seemingly infinite patience during my longer-than-expected “final stretch” prior to joining the MIT PDOS group. Other current and former members of the Systems Research Group have also supported me in various ways. I am grateful to Ionel Gog, Richard Mortier, Martin Maas, Derek Murray, Frank McSherry, Jon Crowcroft, and Tim Harris for comments that have much improved the clarity of this dissertation. Moreover, I owe thanks to Robert Watson for our discussions on security and capabilities, and for “adopting” me into the MRC2 project for two years; and to Anil Madhavapeddy, Andrew Moore, and Richard Mortier, who assisted with equipment and feedback at key moments. Ionel Gog and Matthew Grosvenor deserve huge credit and gratitude for our close collaboration and camaraderie over the years. Our shared spirit of intellectual curiosity and deep rigour, as well as the humour to go with it, embodies everything that makes systems research enjoyable. I have also been privileged to work with several enthusiastic undergraduate and master’s stu- dents: Adam Gleave, Gustaf Helgesson, Matthew Huxtable, and Andrew Scull all completed projects that impacted my research, and I thank them for their excellent contributions. The ideas for both systems presented in this dissertation go back to my internship at Google, where I worked with Andy Konwinski, John Wilkes, and Michael Abd-El-Malek. I thank them for the insights I gained in our collaboration on Omega and in the Borgmaster team, which made me realise the opportunities for innovative systems software to support the massive compute clusters deployed at internet companies. In addition to those already mentioned above, I am grateful to my other friends in the systems research community – especially Allen Clement, Natacha Crooks, Arjun Narayan, Simon Peter, Justine Sherry, and Andrew Warfield – as well as my family, whose support, opinions and counsel have impacted my work in many ways, and who I value immensely. Finally and above all, I thank Julia Netter, who has accompanied this journey – from its very beginning to the final proofreading of this document – with wit, insight, and loving support. Contents 1 Introduction 17 1.1 Background . 19 1.2 Why build a new operating system?....................... 20 1.3 Contributions . 22 1.4 Dissertation outline . 24 1.5 Related publications . 27 2 Background 29 2.1 Operating systems . 30 2.2 Warehouse-scale computers . 38 2.3 Cluster scheduling . 52 3 Designing a WSC operating system 59 3.1 Design principles . 60 3.2 Distributed objects . 62 3.3 Scalable, translucent, and uniform abstractions . 64 3.4 Narrow system call API . 66 3.5 Identifier and handle capabilities . 68 3.6 Flat persistent object store . 73 3.7 Request-based I/O with kernel-managed buffers . 74 3.8 Summary . 82 4 DIOS: an operating system for WSCs 83 4.1 Abstractions and concepts . 84 4.2 Objects . 88 CONTENTS CONTENTS 4.3 Names . 90 4.4 Groups . 97 4.5 References . 99 4.6 System call API . 103 4.7 Distributed coordination . 106 4.8 Scalability . 108 4.9 Prototype implementation . 109 4.10 Summary . 114 5 Firmament: a WSC scheduler 117 5.1 Background . 118 5.2 Scheduling as a flow network . 121 5.3 Implementation . 128 5.4 Scheduling policies . 133 5.5 Cost models . 138 5.6 Scalability . 155 5.7 Summary . 159 6 Evaluation 161 6.1 Experimental setup . 161 6.2 DIOS ....................................... 163 6.3 Firmament . 174 6.4 Summary and outlook . 192 7 Conclusions and future work 193 7.1 Extending DIOS .................................. 194 7.2 Evolving Firmament . 197 7.3 Summary . 198 Bibliography 199 A Additional background material 236 A.1 Operating system specialisation . 236 A.2 Additional workload interference experiments . 242 A.3 CPI and IPMA distributions in a Google WSC . 248 8 CONTENTS CONTENTS B Additional DIOS material 250 B.1 DIOS system call API . 250 B.2 DIOS I/O requests . 263 B.3 Incremental migration to DIOS .......................... 266 C Additional Firmament material 268 C.1 Minimum-cost, maximum-flow optimisation . 268 C.2 Flow scheduling capacity assignment details . 272 C.3 Quincy cost model details . 274 C.4 Details of flow scheduling limitations . 276 C.5 Firmament cost model API . 278 9 List of Figures 1.1 Outline of topics that have influenced my research. 26 2.1 The Google infrastructure stack. 39 2.2 The Facebook infrastructure stack. 40 2.3 Machine types and configurations in a Google WSC. 44 2.4 Micro-benchmarks on heterogeneous machine types. 45 2.5 Micro-architectural topologies of the systems used in my experiments. 46 2.6 WSC application interference on different CPU platforms. 48 2.7 WSC application interference of batch workloads on a shared cluster. 49 2.8 Comparison of different cluster scheduler architectures. 54 3.1 Results of exploratory system call tracing on common WSC applications. 67 3.2 Sequence diagrams of read and write I/O requests. 75 3.3 Example I/O requests with different concurrent access semantics. 77 3.4 Stages of an I/O request. 78 3.5 Handling an I/O request to a remote object. 79 3.6 Buffer ownership state transitions in the DIOS I/O model.