By Michail D. Flouris a Thesis Submitted in Conformity with The

EXTENSIBLE NETWORKED-STORAGE VIRTUALIZATION WITH METADATA MANAGEMENT AT THE BLOCK LEVEL by Michail D. Flouris A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright c 2009 by Michail D. Flouris Library and Archives Bibliothèque et Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l’édition 395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre référence ISBN: 978-0-494-59076-8 Our file Notre référence ISBN: 978-0-494-59076-8 NOTICE: AVIS: The author has granted a non- L’auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l’Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distribute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L’auteur conserve la propriété du droit d’auteur ownership and moral rights in this et des droits moraux qui protège cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author’s permission. In compliance with the Canadian Conformément à la loi canadienne sur la Privacy Act some supporting forms protection de la vie privée, quelques may have been removed from this formulaires secondaires ont été enlevés de thesis. cette thèse. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n’y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. Abstract Extensible Networked-Storage Virtualization with Metadata Management at the Block Level Michail D. Flouris Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2009 Increased scaling costs and lack of desired features is leading to the evolution of high-performance storage systems from centralized architectures and specialized hardware to decentralized, commodity storage clusters. Existing systems try to address storage cost and management issues at the filesystem level. Besides dictating the use of a specific filesystem, however, this approach leads to increased complexity and load imbalance towards the file-server side, which in turn increase costs to scale. In this thesis, we examine these problems at the block-level. This approach has several advantages, such as transparency, cost-efficiency, better resource utilization, simplicity and easier management. First of all, we explore the mechanisms, the merits, and the overheads associated with advanced metadata-intensive functionality at the block level, by providing versioning at the block level. We find that block-level versioning has low overhead and offers transparency and simplicity advantages over filesystem-based approaches. Secondly, we study the problem of providing extensibility required by diverse and changing appli- cation needs that may use a single storage system. We provide support for (i) adding desired functions as block-level extensions, and (ii) flexibly combining them to create modular I/O hierarchies. In this direction, we design, implement and evaluate an extensible block-level storage virtualization framework, Violin, with support for metadata-intensive functions. Extending Violin we build Orchestra,an extensible framework for cluster storage virtualization and scalable storage sharing at the block-level. We show that Orchestra’s enhanced block interface can substantially simplify the design of higher-level storage services, such as cluster filesystems, while being scalable. Finally, we consider the problem of consistency and availability in decentralized commodity clus- ii ters. We propose RIBD, a novel storage system that provides support for handling both data and metadata consistency issues at the block layer. RIBD uses the notion of consistency intervals (CIs) to provide fine-grain consistency semantics on sequences of block level operations by means of a lightweight transactional mechanism. RIBD relies on Orchestra’s virtualization mechanisms and uses a roll-back recovery mechanism based on low-overhead block-level versioning. We evaluate RIBD on a cluster of 24 nodes, and find that it performs comparably to two popular cluster filesystems, PVFS and GFS, while offering stronger consistency guarantees. iii To my wife, Stavroti, for her support and patience to see the end of it. To my children, Kleria and Dimitris, for making my days (and nights) worth it. iv Acknowledgements I have been extremely fortunate to have Professor Angelos Bilas as my supervisor. His constant support, insightful comments and patient guidance have aided me greatly throughout my research en- deavors. I am deeply grateful to him for all his support and all I have learned under his supervision. I would also like to thank the members of my committee, Professors Angela Demke Brown, H.- Arno Jacobsen, Cristiana Amza and my external appraiser Remzi Arpaci-Dusseau, for their thoughtful comments that significantly improved this thesis. Special thanks go to all colleagues and paper co-authors, whose help and insight has been invaluable to my research efforts, more specifically Renaud Lachaize, Manolis Marazakis, Kostas Magoutis, Jesus Luna, Maciej Brzezniak, Zsolt Németh,Stergios Anastasiadis, Evangelos Markatos, Dionisios Pnev- matikatos, Sotiris Ioannidis, Dimitris Xinidis, Rosalia Christodoulopoulou, Reza Azimi, and Periklis Papakonstantinou. A great thanks to all the people of the legendary Grspam list and all my friends in Toronto for making my life there warmer and more fun. I am especially grateful to Yannis Velegrakis, Tasos Ke- mentsientsidis, Stavros Vassos, Giorgos Giakkoupis and Vasso Bartzoka for their hospitality. I would like to thank also all the present and past members of the CARV lab and FORTH-ICS for all the conversations, the fun and for creating an enjoyable and stimulating research environment. Many special thanks go to Stavros Passas, Michalis Ligerakis, Sven Karlsson, Markos Foundoulakis and Yannis Klonatos for their help with the hardware, the cluster administration, and debugging. A big “thank you” goes to Linda Chow for her help with the administrative tasks. Thanks to all the friends in Greece for the moral support. Finally, I would like to thank the people that really made all this possible; my wife Stavroti Liodaki, my children Kleria and Dimitris, my parents Maria and Dimitris, sister Irini, brother Andreas, and my family in-law. Thank you all for everything you have done for me. v Contents 1 Introduction 1 1.1 Motivation ......................................... 1 1.2 Our Approach . .................................... 6 1.3 Problems and Contributions ............................... 8 1.4 Overview ......................................... 9 1.4.1 Block-level versioning .............................. 10 1.4.2 Storage resource sharing and management .................... 10 1.4.3 Scalable storage distribution and sharing .................... 13 1.4.4 Reliability and Availability . ......................... 14 1.5 Thesis Organization .................................... 17 2 Related Work 18 2.1 Clotho: Block-level Versioning .............................. 21 2.2 Violin: Extensible Block-level Storage . ......................... 24 2.2.1 Extensible filesystems .............................. 24 2.2.2 Extensible network protocols . ......................... 25 2.2.3 Block-level storage virtualization ........................ 26 2.3 Orchestra: Extensible Networked-storage Virtualization . .............. 27 2.3.1 Conventional cluster storage systems . .................... 28 2.3.2 Flexible support for distributed storage . .................... 29 2.3.3 Support for cluster-based storage ......................... 30 2.3.4 Summary . .................................... 32 2.4 RIBD: Taxonomy and Related Work . ......................... 32 vi 2.4.1 Taxonomy of existing solutions ......................... 32 2.4.2 Related Work ................................... 36 3 Clotho: Transparent Data Versioning at the Block I/O Level 38 3.1 Introduction ........................................ 38 3.2 System Design . .................................... 42 3.2.1 Flexibility and Transparency . ......................... 42 3.2.2 Reducing Metadata Footprint . ......................... 45 3.2.3 Version Management Overhead ......................... 47 3.2.4 Common I/O Path Overhead . ......................... 48 3.2.5 Reducing Disk Space Requirements . .................... 51 3.2.6 Consistency .................................... 53 3.3 System Implementation . ............................... 54 3.4 Experimental Results ................................... 55 3.4.1 Bonnie++ . .................................... 57 3.4.2 SPEC SFS .................................... 57 3.4.3 Compact version performance . ......................... 60 3.5 Limitations

By Michail D. Flouris a Thesis Submitted in Conformity with The

The Complete Freebsd

Freebsd Handbook

DN Print Magazine BSD News BSD Mall BSD Support Source Wars Join Us

1999 USENIX Annual Technical Conference

Migration of WAFL to BSD the University Of

Rosetta Stone for Unix

GEOM Tutorial

Survey of Volume Managers

The Vinum Volume Manager

2000 Unixguide.Net, All Rights Reserved. Hermelito Go (Last Update: Thursday, 11-Apr-2002 14:32:33 PDT )

The Vinum Volume Manager Gr Eg Lehey LEMIS (SA) Pty Ltd PO Box 460 Echunga SA 5153

Freebsd Documentation Release 10.1