Durability and Crash Recovery in Distributed In-Memory Storage Systems

DURABILITY AND CRASH RECOVERY IN DISTRIBUTED IN-MEMORY STORAGE SYSTEMS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Ryan Scott Stutsman November 2013 © 2013 by Ryan Scott Stutsman. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/kk538ch0403 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. John Ousterhout, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. David Mazieres I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Mendel Rosenblum Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract This dissertation presents fast crash recovery for the RAMCloud distributed in-memory data center storage system. RAMCloud is designed to operate on thousands or tens-of-thousands of machines, and it stores all data in DRAM. Rather than replicating in DRAM for redundancy, it provides inexpensive durability and availability by recovering quickly after server crashes. Overall, its goal is to reconstitute the entire DRAM contentsof a server and to restore full performanceto the cluster in 1 to 2 seconds after failures. Consequently, RAMCloud provides continuous availability by recovering from failures so quickly that applications never notice failures. The guiding design principle behind fast recovery is leveraging scale. RAMCloud scatters backup data across thousands of disks, and it harnesses hundreds of servers in parallel to reconstruct lost data. The system uses a decentralized log-structured approach for all of its data, in DRAM as well as on disk; this provideshigh performance both during normal operation and during recovery. RAMCloud employs randomized techniques at several key points to balance load and to manage the system in a scalable and decentralized fashion. We have implemented and evaluated fast crash recovery. In an 80-node cluster, RAMCloud recovers 40 GB of data from a failed server in 1.86 seconds. Extrapolations based on measurements suggest that the current implementation approach should scale to recover 200 GB of DRAM from a failed server in about 2 seconds and may scale to recover 1 TB of data or more in the same period with few changes. Our analysis also suggests that 1 second recoveries may be as effective as increased on-disk redundancy at preventing data loss due to independent server failures while being less costly. Fast crash recovery provides a foundation for fault-tolerance and it simplifies RAMCloud’s overall design. In addition to handling single node failures, it is the key mechanism for recovering from multiple simultaneous machine failures and performing complete cluster restarts. RAMCloud servers always recover to a consistent state regardless of the timing of failures, which allows it to provide strong consistency seman- tics even across failures. In addition to fast recovery itself, this dissertation details how RAMCloud stores data in decentralized logs for durability, how the durability and consistency of data is protected from failures, how log replay is divided across all the nodes of the cluster after failures, and how recovery is used to handle the many failure scenarios that arise in large-scale clusters. iv Acknowledgements I owe a huge debt of gratitude to the hundreds of people who have provided me with encouragement, guidance, opportunities, and support over my time as a graduate student. Among those hundreds, there is no one that could have been more crucial to my success and happiness than my wife, Beth. She has supported my academic endeavor now for nearly a decade, patiently supporting me through slow progress and encouraging me to rebound after failures. I am thankful for her everyday, and I am as excited as I have ever been about our future together. No one has made me smile more during my time at Stanford than my daughter, Claire. She is energetic, independent, and beautiful. She gives me overwhelming optimism, and nothing cheers me more than seeing her at the end of a long day. Getting to know her has been the greatest joy of my life, and I look forward to seeing her grow in new ways. At Stanford, there is no one who has impacted my life more than my advisor, John Ousterhout. John is the rare type of person that is so positive, so collected, and so persistent that it simultaneously makes one feel both intimated and invigorated. He taught me how to conduct myself as a scientist and an engineer, but he also taught me how to express myself, how to focus amid confusion, and how to remain positive when challenges stack up. There is no one I admire more, and I will sorely miss scheming with him. David Mazières has been more encouraging to me than anyone else in my time at Stanford. David persistently encourages me to worry less, to think for myself, and to take joy in what I do. He is the reason I ended up at Stanford, and I will miss his unique personality. Mikhail Atallah, Christian Grothoff, and Krista Grothoff changed the trajectory of my life by introducing me to research as an undergraduate. Their guidance and support opened opportunities that I would have never imagined for myself. Thanks to Mendel Rosenblum for feedback and review of this dissertation as well as sitting on my defense committee. Also, thanks to Christos Kozyrakis and Christine Min Wotipka for sitting on my defense committee. I couldn’thave asked for a better pair of labmates than Stephen Rumble and Diego Ongaro. I have learned more from them in the past 6 years than from anyone else. Working closely together on a such a large project was the highlight of my academic career. I am excited to see where they head in the future. This dissertation wouldn’t have been possible without their help and continuous chastisement. I am looking forward to the results of the next generation of high-energy RAMCloud students: Asaf Cidon, Jonathan Ellithorpe, Ankita Kejriwal, Collin Lee, Behnam Montazeri Najafabadi, and Henry Qin. Also, thanks to the many people that v helped build RAMCloud over the past three years: Arjun Gopalan, Ashish Gupta, Nandu Jayakumar, Aravind Narayanan, Christian Tinnefeld, Stephen Yang, and many more. Stephen, you’re weird and I’ll miss you. I also owe thanksto several current and formermembersof the Secure Computer Systems lab at Stanford. In particular, Nickolai Zeldovich, Daniel Giffin, and Michael Walfish all helped show me the ropes early on in my graduate career. Thanks to Andrea Bittau for continuously offering to fly me to Vegas. Unquestionably, I owe enormous thanks to my parents; they impressed the value of education on me from a young age. I am perpetually thankful for their unconditional love, and I only hope I can be as good a parent for my children as they were for me. My sister, Julie, has been a persistent source of encouragement for me, and I owe her many thanks as well. My friends have also been a source of encouragement to me as I have pursued my degree. Thanks to Nathaniel DaleL, Brandon Greenawalt, and Zachary Tatlock for all your support. This research was partly performed under an appointment to the U.S. Department of Homeland Security (DHS) Scholarship and Fellowship Program, administered by the Oak Ridge Institute for Science and Edu- cation (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and DHS. ORISE is managed by Oak Ridge Associated Universities (ORAU) under DOE contract number DE-AC05- 06OR23100. All opinions expressed in this dissertation are the author’s and do not necessarily reflect the policies and views of DHS, DOE, or ORAU/ORISE. This research was also partly supported by the National Science Foundation under Grant No. 0963859, by STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and by grants from Facebook, Google, Mellanox, NEC, NetApp, SAP, and Samsung. vi Figure 1: Spiderweb by Claire Stutsman. vii Contents Abstract iv Acknowledgements v 1 Introduction 1 1.1 Dissertation Contributions .......................... 5 2 Motivation 6 3 RAMCloud Overview 10 3.1 Basics ..................................... 10 3.2 DataModel .................................. 11 3.3 System Structure ............................... 12 3.3.1 The Coordinator ........................... 13 3.3.2 Tablets ................................ 13 3.4 Durability and Availability .......................... 14 3.5 Log-Structured Storage ............................ 15 3.6 Recovery ................................... 17 3.6.1 Using Scale .............................. 18 4 Fault-tolerant Decentralized Logs 21 4.1 Log Operations ................................ 23 4.2 Appending Log Data ............................. 24 4.3 Scattering Segment Replicas ......................... 26 viii 4.4 Log Reassembly During Recovery ...................... 32 4.4.1 Finding Log Segment Replicas .................... 33 4.4.2 Detecting Incomplete Logs ...................... 33 4.5 Repairing Logs After Failures ........................ 36 4.5.1 Recreating Lost Replicas ....................... 37 4.5.2 Atomic Replication .......................... 38 4.5.3 Lost Head Segment Replicas ..................... 38 4.5.4 Preventing Inconsistencies Due to Overloaded Backups ......

Load more