Data Redundancy and Maintenance for Peer-To-Peer File Backup Systems Alessandro Duminuco
Total Page:16
File Type:pdf, Size:1020Kb
Data Redundancy and Maintenance for Peer-to-Peer File Backup Systems Alessandro Duminuco To cite this version: Alessandro Duminuco. Data Redundancy and Maintenance for Peer-to-Peer File Backup Systems. domain_other. Télécom ParisTech, 2009. English. pastel-00005541 HAL Id: pastel-00005541 https://pastel.archives-ouvertes.fr/pastel-00005541 Submitted on 7 Jul 2010 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. École Doctorale d’Informatique, Télécommunications et Électronique de Paris Thèse présentée pour obtenir le grade de docteur de TELECOM ParisTech Spécialité : Informatique et réseaux Alessandro DUMINUCO Redondance et maintenance des données dans les systèmes de sauvegarde de fichiers pair-à-pair Data Redundancy and Maintenance for Peer-to-Peer File Backup Systems Soutenue le 20 Octobre 2009 devant le jury composé de Prof. Isabelle Demeure Président Prof. Pascal Felber Rapporteurs Dr. Philippe Nain Prof. Thorsten Strufe Examinateur Prof. Ernst Biersack Directeur de thèse Copyright c 2009 by Alessandro Duminuco ( [email protected] ). Typeset in LATEX. a Piero, che mi ha insegnato piú d’ogni altro. Abstract The amount of digital data produced by users, such as photos, videos, and digital docu- ments, has grown tremendously over the last decade. These data are very valuable and need to be backed up safely. Solutions based on DVDs and external hard drives, though very common, are not practical and do not provide the required level of reliability, while centralized solutions are costly. For this reason the research community has shown an increasing interest in the use of peer-to- peer systems for file backup. The key property that makes peer-to-peer systems appealing is self-scaling, i.e. as more peers become part of the system the service capacity increases along with the service demand. The design of a peer-to-peer file backup system is a complex task and presents a consider- able number of challenges. Peers can be intermittently connected or can fail at a rate that is considerably higher than in the case of centralized storage systems. Our interest focused particularly on how to efficiently provide reliable storage of data applying appropriate re- dundancy schemes and adopting the right mechanisms to maintain this redundancy. This task is not trivial since data maintenance in such systems may require significant resources in terms of storage space and communication bandwidth. Our contribution is twofold. First, we study erasure coding redundancy schemes able to combine the bandwidth effi- ciency of replication with the storage efficiency of classical erasure codes. In particular, we introduce and analyze two new classes of codes, namely Regenerating Codes and Hierarchi- cal Codes. Second, we propose a proactive adaptive repair scheme, which combines the adaptiveness of reactive systems with the smooth bandwidth usage of proactive systems, generalizing the two existing approaches. Résumé La quantité de données numériques produites par les utilisateurs, comme les photos, les vidéos et les documents numériques, a énormément augmenté durant cette dernière décen- nie. Ces données possèdent une grande valeur et nécessitent d’être sauvegardées en sécurité. D’une part, les solutions basées sur les DVDs et les disques durs externes, bien que très communes, ne fournissent pas un niveau suffisant de fiabilité. D’autre part les solutions basées sur de serveurs centralisées sont très coûteuses. Pour ces raisons, la communauté de recherche a manifesté un grand intérêt pour l’utilisation des systèmes pair-à-pair pour la sauvegarde de donnés. Les systèmes pair-à-pair représentent une solution intéressante grâce à leur capacité de passage à l’échelle. En effet, la capacité du service augmente avec la demande. La conception d’un réseau de sauvegarde de fichiers pair-à-pair est une tâche très complexe et présente un nombre considérable de défis. Les pairs peuvent avoir une durée de connexion limitée et peuvent quitter le système à un taux qui est considérablement plus élevé que dans le cas des systèmes de stockage centralisés. Notre intérêt se concentre sur la manière de fournir efficacement du stockage de données suffisamment fiable en appliquant des schémas de redondance appropriés et en adoptant des bons mécanismes pour maintenir une telle redondance. Cet effort n’est pas négligeable, dans la mesure où la maintenance du stockage de données dans un tel système exige des ressources importantes en termes de capacité de stockage et de largeur de bande passante. Notre contribution se porte sur deux aspects. Premièrement, nous proposons et étudions des codes correcteurs pour la redondance capa- bles de combiner l’efficacité en bande passante de la réplication à l’efficacité en stockage des codes correcteurs classiques. En particulier, nous présentons et analysons deux nouvelles classes de codes: Regenerating Codes et Hierarchical Codes. Deuxièmement, nous proposons un système de réparation, nommé "adaptive proactive re- pair scheme", qui combine l’adaptabilité des systèmes réactifs avec l’utilisation régulière de la bande passante des systèmes proactifs, en généralisant les deux approches existantes. Acknowledgements This thesis is the result of more than three years spent at EURECOM. During this time I had the chance to interact with a lot of people, each of them gave her contribution to my accomplishments, and I feel obliged to thank them all. The first place must be occupied by my advisor Prof. Ernst Biersack, not because of his po- sition, but because he is without any doubt the person who contributed most to the success of my Ph.D. He gave tireless feedback to my ideas and my proposals, he helped me find the right research directions, he has been supportive when results did not arrive, and he showed appreciation when results arrived. I feel that I grew up a lot as a researcher in these three years and I owe a lot to him. I had a great opportunity to work with you and I want to thank you for everything you taught to me. I want to thank all the people who are or have been at EURECOM during these years. It has been a pleasure and helpful to work, chat and exchange ideas with them in lunch breaks or in front of our beloved espresso coffee machine. Some of them, however, deserve a special mention. Taoufik En-Najjary was extremely helpful in formalizing my ideas, especially at the beginning, thanks to his precious competence in statistics related topics. He was co- author of my first publication as a Ph.D. student and I really appreciated his collaboration. I am really grateful to Damiano Carra, who was the person who helped me the most after my advisor. I went countless times to his office to explain my ideas, my doubts and my questions and he was always available to listen and to give precious hints. Finally, I want to thank Pietro Michiardi, with whom I had the opportunity to discuss about my topic and to exchange ideas and suggestions. A special thanks also to Matteo Dell’Amico and Erik- Oliver Blass, who have been brave enough to read and comment the preliminary version of this manuscript. My thesis was partially supported by the “Microsoft Research European PhD Scholarship” and I thank Microsoft Research for the financial support that made my research possible. In particular I want to thank Fabien Petitcolas, who organized the scholarship and offered the opportunity to participate to two summer-schools, which enriched my skills and my vision of the research. His work was just perfect from any point of view. I also got the opportunity to spend three months at Microsoft Research in Cambridge as an intern, where I had the chance to meet and work with a lot of smart and skilled researchers. I really enjoyed this experience and I want to thank especially Christos Gkantsidis, who supervised me during my internship and has been my second advisor, he guided me in my research and in the hard life of distributed system designers. x My work would have not been the same if my housemates were not there. So thanks to Corrado, Marco and Xiaolan for having shared all the ”joys and sorrows” of being a PhD student in the French Riviera. Living, eating and fighting with you has been really a priceless experience. As any good Italian, I express all my gratitude to my two families. My natural family, which supported me at any step of my Ph.D., and my second family that I had as student in Turin, which has been always present, in the best and in the worst days of the last eight years. Finally, I need to thank all my friends from Sapri, my hometown in Italy. I cannot say that they contributed directly to this thesis, since they have always been a good reason for giving up and run back home, but I have to admit that without them I would not be the same person, and this thesis would definitely be worse. Grazie a tutti! Alessandro Contents List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Motivation ....................................... 1 1.2 Description of a peer-to-peer storage system ................... 4 1.3 Issues in peer-to-peer storage systems ....................... 8 1.4 Focus of the thesis ................................... 18 1.5 Organization of the thesis .............................. 19 2 Related Work 21 2.1 Introduction ...................................... 21 2.2 The genesis ......................................