By Michail D. Flouris a Thesis Submitted in Conformity with The

EXTENSIBLE NETWORKED-STORAGE VIRTUALIZATION WITH METADATA MANAGEMENT AT THE BLOCK LEVEL by Michail D. Flouris A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright c 2009 by Michail D. Flouris Abstract Extensible Networked-Storage Virtualization with Metadata Management at the Block Level Michail D. Flouris Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2009 Increased scaling costs and lack of desired features is leading to the evolution of high-performance storage systems from centralized architectures and specialized hardware to decentralized, commodity storage clusters. Existing systems try to address storage cost and management issues at the filesystem level. Besides dictating the use of a specific filesystem, however, this approach leads to increased complexity and load imbalance towards the file-server side, which in turn increase costs to scale. In this thesis, we examine these problems at the block-level. This approach has several advantages, such as transparency, cost-efficiency, better resource utilization, simplicity and easier management. First of all, we explore the mechanisms, the merits, and the overheads associated with advanced metadata-intensive functionality at the block level, by providing versioning at the block level. We find that block-level versioning has low overhead and offers transparency and simplicity advantages over filesystem-based approaches. Secondly, we study the problem of providing extensibility required by diverse and changing appli- cation needs that may use a single storage system. We provide support for (i) adding desired functions as block-level extensions, and (ii) flexibly combining them to create modular I/O hierarchies. In this direction, we design, implement and evaluate an extensible block-level storage virtualization framework, Violin, with support for metadata-intensive functions. Extending Violin we build Orchestra,an extensible framework for cluster storage virtualization and scalable storage sharing at the block-level. We show that Orchestra’s enhanced block interface can substantially simplify the design of higher-level storage services, such as cluster filesystems, while being scalable. Finally, we consider the problem of consistency and availability in decentralized commodity clus- ii ters. We propose RIBD, a novel storage system that provides support for handling both data and metadata consistency issues at the block layer. RIBD uses the notion of consistency intervals (CIs) to provide fine-grain consistency semantics on sequences of block level operations by means of a lightweight transactional mechanism. RIBD relies on Orchestra’s virtualization mechanisms and uses a roll-back recovery mechanism based on low-overhead block-level versioning. We evaluate RIBD on a cluster of 24 nodes, and find that it performs comparably to two popular cluster filesystems, PVFS and GFS, while offering stronger consistency guarantees. iii To my wife, Stavroti, for her support and patience to see the end of it. To my children, Kleria and Dimitris, for making my days (and nights) worth it. iv Acknowledgements I have been extremely fortunate to have Professor Angelos Bilas as my supervisor. His constant support, insightful comments and patient guidance have aided me greatly throughout my research en- deavors. I am deeply grateful to him for all his support and all I have learned under his supervision. I would also like to thank the members of my committee, Professors Angela Demke Brown, H.- Arno Jacobsen, Cristiana Amza and my external appraiser Remzi Arpaci-Dusseau, for their thoughtful comments that significantly improved this thesis. Special thanks go to all colleagues and paper co-authors, whose help and insight has been invaluable to my research efforts, more specifically Renaud Lachaize, Manolis Marazakis, Kostas Magoutis, Jesus Luna, Maciej Brzezniak, Zsolt Németh, Stergios Anastasiadis, Evangelos Markatos, Dionisios Pnev- matikatos, Sotiris Ioannidis, Dimitris Xinidis, Rosalia Christodoulopoulou, Reza Azimi, and Periklis Papakonstantinou. A great thanks to all the people of the legendary Grspam list and all my friends in Toronto for making my life there warmer and more fun. I am especially grateful to Yannis Velegrakis, Tasos Ke- mentsientsidis, Stavros Vassos, Giorgos Giakkoupis and Vasso Bartzoka for their hospitality. I would like to thank also all the present and past members of the CARV lab and FORTH-ICS for all the conversations, the fun and for creating an enjoyable and stimulating research environment. Many special thanks go to Stavros Passas, Michalis Ligerakis, Sven Karlsson, Markos Foundoulakis and Yannis Klonatos for their help with the hardware, the cluster administration, and debugging. A big “thank you” goes to Linda Chow for her help with the administrative tasks. Thanks to all the friends in Greece for the moral support. Finally, I would like to thank the people that really made all this possible; my wife Stavroti Liodaki, my children Kleria and Dimitris, my parents Maria and Dimitris, sister Irini, brother Andreas, and my family in-law. Thank you all for everything you have done for me. v Contents 1 Introduction 1 1.1 Motivation ......................................... 1 1.2 Our Approach . .................................... 6 1.3 Problems and Contributions ............................... 8 1.4 Overview ......................................... 9 1.4.1 Block-level versioning .............................. 10 1.4.2 Storage resource sharing and management .................... 10 1.4.3 Scalable storage distribution and sharing .................... 13 1.4.4 Reliability and Availability . ......................... 14 1.5 Thesis Organization .................................... 17 2 Related Work 18 2.1 Clotho: Block-level Versioning .............................. 21 2.2 Violin: Extensible Block-level Storage . ......................... 24 2.2.1 Extensible filesystems .............................. 24 2.2.2 Extensible network protocols . ......................... 25 2.2.3 Block-level storage virtualization ........................ 26 2.3 Orchestra: Extensible Networked-storage Virtualization . .............. 27 2.3.1 Conventional cluster storage systems . .................... 28 2.3.2 Flexible support for distributed storage . .................... 29 2.3.3 Support for cluster-based storage ......................... 30 2.3.4 Summary . .................................... 32 2.4 RIBD: Taxonomy and Related Work . ......................... 32 vi 2.4.1 Taxonomy of existing solutions ......................... 32 2.4.2 Related Work ................................... 36 3 Clotho: Transparent Data Versioning at the Block I/O Level 38 3.1 Introduction ........................................ 38 3.2 System Design . .................................... 42 3.2.1 Flexibility and Transparency . ......................... 42 3.2.2 Reducing Metadata Footprint . ......................... 45 3.2.3 Version Management Overhead ......................... 47 3.2.4 Common I/O Path Overhead . ......................... 48 3.2.5 Reducing Disk Space Requirements . .................... 51 3.2.6 Consistency .................................... 53 3.3 System Implementation . ............................... 54 3.4 Experimental Results ................................... 55 3.4.1 Bonnie++ . .................................... 57 3.4.2 SPEC SFS .................................... 57 3.4.3 Compact version performance . ......................... 60 3.5 Limitations and Future work ............................... 62 3.6 Conclusions ........................................ 63 4 Violin: A Framework for Extensible Block-level Storage 65 4.1 Introduction ........................................ 65 4.2 System Architecture .................................... 68 4.2.1 Virtualization Semantics ............................. 68 4.2.2 Violin I/O Request Path ............................. 72 4.2.3 State Persistence . ............................... 75 4.2.4 Module API .................................... 78 4.3 System Implementation . ............................... 80 4.3.1 I/O Request Processing in Linux ......................... 80 4.3.2 Violin I/O path . ............................... 81 4.4 Evaluation ......................................... 82 4.4.1 Ease of Development ............................... 82 vii 4.4.2 Flexibility . .................................... 84 4.4.3 Performance ................................... 86 4.4.4 Hierarchy Performance .............................. 91 4.5 Limitations and Future work ............................... 93 4.6 Conclusions ........................................ 94 5 Orchestra: Extensible, Shared Networked Storage 95 5.1 Introduction ........................................ 95 5.2 System Design . .................................... 97 5.2.1 Distributed Virtual Hierarchies . ......................... 98 5.2.2 Distributed Block-level Sharing ......................... 100 5.2.3 Distributed File-level Sharing . ......................... 103 5.2.4 Differences of Orchestra vs. Violin . .................... 105 5.3 System Implementation . ............................... 106 5.4 Experimental Results ................................... 108 5.4.1 Orchestra . .................................... 109 5.4.2 ZeroFS (0FS)................................... 113 5.4.3 Summary . .................................... 115 5.5 Limitations and

By Michail D. Flouris a Thesis Submitted in Conformity with The

Disk Array Data Organizations and RAID

Architectures and Algorithms for On-Line Failure Recovery in Redundant Disk Arrays

Memory Systems : Cache, DRAM, Disk

I/O Workload Outsourcing for Boosting RAID Reconstruction Performance

Which RAID Level Is Right for Me?

Software-RAID-HOWTO.Pdf

A Secure, Reliable and Performance-Enhancing Storage Architecture Integrating Local and Cloud-Based Storage

On-Line Data Reconstruction in Redundant Disk Arrays

Scalability of RAID Systems

The Vinum Volume Manager

The Vinum Volume Manager Gr Eg Lehey LEMIS (SA) Pty Ltd PO Box 460 Echunga SA 5153

The Title Title: Subtitle March 2007