Storing and Managing Data in a Distributed Hash Table
Total Page:16
File Type:pdf, Size:1020Kb
Storing and Managing Data in a Distributed Hash Table by Emil Sit S.B., Computer Science (1999); S.B., Mathematics (1999); M.Eng., Computer Science (2000) Massachusetts Institute of Technology Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of MASSACHUSETTS iNS Doctor of Philosophy in Computer Science OF TEHNOLO at the JUL 0 1 2008 MASSACHUSETTS INSTITUTE OF TECHNOLOGY LIBRARIES June 2008 AC.NV8 @ Massachusetts Institute of Technology 2008. All rights reserved. Author ................ Department of Electrical Engineering and Computer Science 4,A• • tMay 1,2008 Certified by..... WM. Frans Kaashoek Professor / 1l Thesis Supervisor Certified by .......................................... Robert T. Morris Associate Professor Thesis Supervisor Accepted by... ................... Terry P. Orlando Chair, Department Committee on Graduate Students Storing and Managing Data in a Distributed Hash Table by Emil Sit Submitted to the Department of Electrical Engineering and Computer Science on May 1, 2008, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Abstract Distributed hash tables (DHTs) have been proposed as a generic, robust storage infrastruc- ture for simplifying the construction of large-scale, wide-area applications. For example, UsenetDHT is a new design for Usenet News developed in this thesis that uses a DHT to cooperatively deliver Usenet articles: the DHT allows a set of N hosts to share storage of Usenet articles, reducing their combined storage requirements by a factor of O(N). Usenet generates a continuous stream of writes that exceeds 1 Tbyte/day in volume, comprising over ten million writes. Supporting this and the associated read workload requires a DHT engineered for durability and efficiency. Recovering from network and machine failures efficiently poses a challenge for DHT replication maintenance algorithms that provide durability. To avoid losing the last replica, replica maintenance must create additional replicas when failures are detected. However, creating replicas after every failure stresses network and storage resources unnecessarily. Tracking the location of every replica of every object would allow a replica maintenance algorithm to create replicas only when necessary, but when storing terabytes of data, such tracking is difficult to perform accurately and efficiently. This thesis describes a new algorithm, Passing Tone, that maintains durability efficiently, in a completely decentralized manner, despite transient and permanent failures. Passing Tone nodes make replication decisions with just basic DHT routing state, without maintaining state about the number or location of extant replicas and without responding to every transient failure with a new replica. Passing Tone is implemented in a revised version of DHash, optimized for both disk and network performance. A sample 12 node deployment of Passing Tone and UsenetDHT supports a partial Usenet feed of 2.5 Mbyte/s (processing over 80 Tbyte of data per year), while providing 30 Mbyte/s of read throughput, limited currently by disk seeks. This deployment is the first public DHT to store terabytes of data. These results indicate that DHT-based designs can successfully simplify the construction of large-scale, wide-area systems. Thesis Supervisor: M. Frans Kaashoek Title: Professor Thesis Supervisor: Robert T. Morris Title: Associate Professor Previously Published Material Portions of this thesis are versions of material that were originally published in the following publications: SIT, E., DABEK, F., AND ROBERTSON, J. UsenetDHT: A low overhead Usenet server. In Proc. of the 3rd InternationalWorkshop on Peer-to-PeerSystems (Feb. 2004). CHUN, B.-G., DABEK, F., HAEBERLEN, A., SIT, E., WEATHERSPOON, H., KAASHOEK, F., KUBIATOWICZ, J., AND MORRIS, R. Efficient replica main- tenance for distributed storage systems. In Proc. of the 3rd Symposium on Networked Systems Design and Implementation (May 2006). SIT, E., MORRIS, R., AND KAASHOEK, M. F. UsenetDHT: A low overhead design for Usenet. In Proc. of the 5th Symposium on Networked Systems Design and Implementation (Apr. 2008). Acknowledgments This thesis would not have been possible without collaboration with my colleagues, and the support of my friends and family. My advisors, M. Frans Kaashoek and Robert Morris, have been an endless source of wisdom, optimism and advice; by example, they have taught me how to conduct and present systems research. In addition to their support and feedback, they have created a fun and engaging environment at PDOS. The members of PDOS and other CSAIL systems students and faculty have been invalu- able in preparing talks, thus in-directly refining the presentation of ideas in this thesis. In addition to my advisors, particular thanks go to Russ Cox, who can always be counted on to attend yet another practice and offer suggestions for improvement. Systems research requires the development of significant software. Frank Dabek worked closely with me to develop many of the ideas in this thesis and implement the core Chord and DHash infrastructure. Josh Cates wrote the initial Merkle tree implementation and devised the earliest maintenance protocols used in DHash. Jeremy Stribling has also been a close collaborator on papers and code; he implemented the initial prototypes of Passing Tone and on- disk Merkle tree storage. Andreas Haeberlen implemented the early version of the simulator used to evaluate Passing Tone and Carbonite; his ability to transform ideas or questions into software and graphs overnight is inspiring. Max Krohn and Russ Cox were often sounding boards for my questions about coding and debugging; I am particularly grateful to them for taking on the burden of keeping the computer systems at PDOS running smoothly. Jinyang Li, Athicha Muthitacharoen, James Robertson and Benjie Chen also contributed code and ideas to what has become the work in this thesis. Garrett Wollman provided the live Usenet feed from CSAIL used for evaluation and graciously answered many questions. Trace data used in Chapter 3 was provided by Vivek Pai and Aaron Klingaman. The evaluation in Chapter 6 made use of the PlanetLab and the RON test-beds. David Andersen helped solve many problems with accessing and using RON nodes. Additional nodes were provided by Magda Balazinska (University of Washington), Kevin Fu (University of Massachusetts, Amherst), Jinyang Li (New York University), and Alex Snoeren (University of California, San Diego). My graduate studies were funded by the Cambridge-MIT Institute (as part of the Design and Implementation of Third Generation Peer-to-Peer Systems project), an NSF ITR grant (ANI-0225660 for Robust Large-Scale Distributed Systems), and various teaching assistant positions. Graduate school is only partly about research. My life has been immeasurably enriched by my wife, Aleksandra Mozdzanowska, who has loved and supported me through my years in graduate school, and the many hobbies and distractions I acquired along the way. These last few years were made wonderful by the time I spent with my teachers and friends at the Back Bay and South Boston Yoga studios, most notably David Vendetti, whose wisdom has guided me and so many others. Finally, my parents deserve thanks for instilling a life-long interest in learning and supporting me along the way. Contents 1 Introduction 11 1.1 Motivating application: UsenetDHT .................. .. .. 12 1.2 Goals . .. 13 1.2.1 Bandwidth-efficient durability ................... 14 1.2.2 Perform ance ............................. 15 1.3 Contributions ................................. 16 1.4 Sum mary .................... 17 2 Requirements and challenges 19 2.1 Workloads and requirements ......................... 19 2.2 DHT challenges ................................. 21 2.2.1 Failure handling ............ ..... ........ .. 22 2.2.2 Limited disk capacity ......................... 23 2.2.3 Managing object movement ..................... 26 2.2.4 Efficient maintenance ......................... 27 2.2.5 I/O performance ........................... 28 3 Achieving efficient durability 29 3.1 A lgorithm .................... 30 3.1.1 Local maintenance ........................... 31 3.1.2 Global maintenance ......................... 33 3.1.3 Synchronization ........................... 34 3.1.4 Expiration support ......................... 34 3.2 Design discussion .................... ........... 35 3.2.1 Context ............................... 35 3.2.2 Importance of re-integration ................... .. 36 3.2.3 Setting replication level ................... .... 37 3.2.4 Scope and parallelism ........................ 39 3.3 Evaluation .... ....... .. .......... ... ... ... 40 4 Implementation 43 4.1 D H ash ..................................... 43 4.1.1 Process structure ........................... 44 4.1.2 Storage .... ... .... .. ... ....... .. .. ... 45 4.1.3 Efficient data transfer .............. .. .. .. .. 45 4.2 M aintenance ....... ..... ... .. ... ........... 47 4.3 Synchronization ................. ........ .. .... 48 4.4 Routing and node management . .... .. .. .. .. 50 5 Application: UsenetDHT 53 5.1 Usenet ... ...... .. ... ...... ....... ...... 54 5.2 Architecture .......... ... ............. ....... 55 5.2.1 Design ..................... ........... 55 5.2.2 Write and read walk-through ..................... 57 5.2.3 Expiration .. ... .. .. .. .. ... .. .. .. ... .. .. 58 5.2.4 Trade-offs .......