Blobseer: Towards Efficient Data Storage Management for Large
Total Page:16
File Type:pdf, Size:1020Kb
BlobSeer: Towards efficient data storage management for large-scale, distributed systems Bogdan Nicolae To cite this version: Bogdan Nicolae. BlobSeer: Towards efficient data storage management for large-scale, distributed systems. Computer Science [cs]. Université Rennes 1, 2010. English. tel-00552271 HAL Id: tel-00552271 https://tel.archives-ouvertes.fr/tel-00552271 Submitted on 5 Jan 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. No d’ordre: 4310 ANNÉE 2010 THÈSE / UNIVERSITÉ DE RENNES 1 sous le sceau de l’Université Européenne de Bretagne pour le grade de DOCTEUR DE L’UNIVERSITÉ DE RENNES 1 Mention: INFORMATIQUE Ecole doctorale MATISSE présentée par Bogdan Nicolae préparée à l’unité de recherche no 6074 - IRISA Institut de Recherche en Informatique et Systèmes Aléatoires IFSIC Thèse soutenue à Rennes BlobSeer: le 30 novembre 2010 Towards efficient devant le jury composé de: Luc BOUGÉ /directeur de thèse data storage management Professeur, ENS Cachan Antenne de Bretagne, France Gabriel ANTONIU /directeur de thèse for large-scale, Chargé de recherche, INRIA Rennes, France Franck CAPPELLO /rapporteur Directeur de recherche, INRIA Saclay, France distributed systems Frédéric DESPREZ /rapporteur Directeur de recherche, INRIA Grenoble, France Kate KEAHEY /examinateur Scientist, Argonne National Laboratory, USA María PÉREZ /examinateur Professor, Universidad Politécnica de Madrid, Spain Valentin CRISTEA /examinateur Professor, Politehnica University Bucharest, Romania Jede Burg wurde Stein auf Stein aufgebaut. – Siebenbürger Spruch Every castle was build stone by stone. – Transylvanian proveb Acknowledgments This dissertation was made possible through the patience and guidance of my advisors, Gabriel and Luc. I am most grateful for their constant support and encouragements that brought my work in the right direction and can only hope that it ends up as a solid founda- tion for many others. Kind regards go to family: my parents, Adrian and Maria, as well as my sister Adriana and grandmother Maria. Their love and understanding has kept my motivation up through- out the duration of this doctorate and helped me overcome several difficult times. I would also like to thank the members of the jury: Maria, Kate and Valentin for evaluat- ing my work and traveling many miles to attend my defense. In particular many thanks to Kate for hosting me at Argonne National Laboratory, USA as a visiting student within the Nimbus project for a duration of two months. During this stay, I had the chance to apply the ideas of my work in a very challenging setting. Special thanks go to my two main evalua- tors, Franck and Frédéric, for taking their time to carefully read my manuscript and give me important feedback on my work. Many thanks for many specific contributions from various people. Jesús Montes con- tributed with his expertise on global behavior modeling to refine the quality-of-service de- livered by this work in the context of cloud storage. Matthieu Dorier helped during his internship with the integration of this work as a storage backend for Hadoop MapReduce, a joint work with Diana Moise. He was also kind enough to translate an extended 10-page abstract of this manuscript into French. Along Diana Moise, Alexandra-Carpen Amarie also contributed to a central publication around this work. Viet-Trung Tran applied this work in the context of distributed file systems, which led to a common publication. We also had many interesting exchanges of ideas towards the end of my doctorate and I sure hope we can purse them further beyond this work. Thanks go as well to the various members of the KerData team: Houssem Chihoub and Radu Tudoran, as well as several people inside and outside of INRIA Rennes Bretagne- Atlantique: Eugen Feller, Eliana Tîr¸sa, Pierre Riteau, Alexandru Costan, Alice M˘ar˘ascu, Sînziana Mazilu, Tassadit Bouadi and Peter Linell. I happily recall the countless relaxing discussions at coffee-breaks and the quality time we spent together. The experimental part of this work would not have been possible without the Grid’5000/ ALADDIN-G5K testbed, an initiative of the French Ministry of Research through the ACI GRID incentive action, INRIA, CNRS and RENATER and other contributing partners. Many thanks to Pascal Morillon and David Margery for their continuous support with Grid’5000. Finally, many thanks to all other people that had a direct or indirect contribution to this work and were not explicitly mentioned above. Your help and support is very much appre- ciated. v Contents 1 Introduction 1 1.1 Objectives ....................................... 2 1.2 Contributions .................................... 2 1.3 Publications ..................................... 4 1.4 Organizationofthemanuscript . 6 I Context: data storage in large-scale, distributed systems 9 2 Large scale, distributed computing 11 2.1 Clusters......................................... 12 2.1.1 Computingclusters.............................. 13 2.1.2 Load-balancingclusters . 13 2.1.3 High-availabilityclusters . 13 2.2 Grids .......................................... 14 2.2.1 Architecture.................................. 15 2.2.2 Middleware.................................. 16 2.3 Clouds ......................................... 16 2.3.1 Architecture.................................. 18 2.3.2 Emergingplatforms ............................. 19 2.4 Conclusions ...................................... 20 3 Data storage in large-scale, distributed systems 21 3.1 Centralizedfileservers . 23 3.2 Parallelfilesystems............................... 24 3.3 Datagrids ....................................... 25 3.3.1 Architecture.................................. 25 3.3.2 Services .................................... 26 3.4 Specializedstorageservices . 28 3.4.1 Revisioncontrolsystems. 28 3.4.2 Versioningfilesystems. 29 3.4.3 Dedicated file systems for data-intensive computing . 30 3.4.4 Cloudstorageservices . 31 3.5 Limitations of existing approaches and new challenges . ........... 32 vi Contents II BlobSeer: a versioning-based data storage service 35 4 Design principles 37 4.1 Coreprinciples ................................... 37 4.1.1 OrganizedataasBLOBs ........................... 37 4.1.2 Datastriping ................................. 38 4.1.3 Distributed metadata management . 39 4.1.4 Versioning................................... 40 4.2 Versioningasakeytosupportconcurrency . 41 4.2.1 A concurrency-oriented, versioning-based access interface....... 41 4.2.2 Optimized versioning-based concurrency control . 44 4.2.3 Consistencysemantics . 46 5 High level description 49 5.1 Globalarchitecture ............................... 49 5.2 Howreads,writesandappendswork . 50 5.3 Datastructures ................................... 52 5.4 Algorithms....................................... 53 5.4.1 Learningaboutnewsnapshotversions . 53 5.4.2 Reading .................................... 54 5.4.3 Writingandappending ........................... 55 5.4.4 Generatingnewsnapshotversions . 57 5.5 Example ........................................ 58 6 Metadata management 61 6.1 Generalconsiderations . 61 6.2 Datastructures ................................... 63 6.3 Algorithms....................................... 65 6.3.1 Obtaining the descriptor map for a given subsequence . 65 6.3.2 Buildingthemetadataofnewsnapshots . 66 6.3.3 Cloningandmerging............................. 70 6.4 Example ........................................ 72 7 Implementation details 75 7.1 Event-drivendesign ............................... 75 7.1.1 RPClayer ................................... 77 7.1.2 Chunkandmetadatarepositories . 79 7.1.3 Globallysharedcontainers . 79 7.1.4 Allocationstrategy .............................. 80 7.2 Faulttolerance................................... 80 7.2.1 Clientfailures................................. 81 7.2.2 Coreprocessfailures ............................. 82 7.3 Finalwords ...................................... 82 8 Synthetic evaluation 83 8.1 Datastriping..................................... 84 8.1.1 Clientsanddataprovidersdeployedseparately . 84 Contents vii 8.1.2 Clientsanddataprovidersco-deployed . 86 8.2 Distributedmetadatamanagement . 87 8.2.1 Clientsanddataprovidersdeployedseparately . 87 8.2.2 Clientsanddataprovidersco-deployed . 88 8.3 Versioning ....................................... 89 8.4 Conclusions ...................................... 90 III Applications of the BlobSeer approach 91 9 HighperformancestorageforMapReduceapplications 93 9.1 BlobSeerasastoragebackendforMapReduce . 94 9.1.1 MapReduce .................................. 94 9.1.2 RequirementsforaMapReducestoragebackend . 95 9.1.3 IntegratingBlobSeerwithHadoopMapReduce . 95 9.2 Experimentalsetup ................................ 97 9.2.1 Platformdescription . 97 9.2.2 Overviewoftheexperiments . 97 9.3 Microbenchmarks.................................. 97 9.3.1 Singlewriter,singlefile . 98 9.3.2 Concurrentreads,sharedfile . 99 9.3.3 Concurrentappends,sharedfile . 100 9.4 Higher-level experiments with MapReduce applications . ..........