Cluster Storage for Commodity Computation

UCAM-CL-TR-690 Technical Report ISSN 1476-2986 Number 690 Computer Laboratory Cluster storage for commodity computation Russell Glen Ross June 2007 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2007 Russell Glen Ross This technical report is based on a dissertation submitted December 2006 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Wolfson College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Summary Standards in the computer industry have made basic components and en- tire architectures into commodities, and commodity hardware is increas- ingly being used for the heavy lifting formerly reserved for specialised plat- forms. Now software and services are following. Modern updates to virtualization technology make it practical to subdivide commodity servers and manage groups of heterogeneous services using commodity operating systems and tools, so services can be packaged and managed independent of the hardware on which they run. Computation as a commodity is soon to follow, moving beyond the specialised applications typical of today’s utility computing. In this dissertation, I argue for the adoption of service clusters— clusters of commodity machines under central control, but running services in virtual machines for arbitrary, untrusted clients—as the basic building block for an economy of flexible commodity computation.I outline the requirements this platform imposes on its storage system and argue that they are necessary for service clusters to be practical, but are not found in existing systems. Next I introduce Envoy, a distributed file system for service clusters. In addition to meeting the needs of a new environment, Envoy introduces a novel file distribution scheme that organises metadata and cache management according to runtime demand. In effect, the file system is partitioned and control of each part given to the client that uses it the most; that client in turn acts as a server with caching for other clients that require concur- rent access. Scalability is limited only by runtime contention, and clients share a perfectly consistent cache distributed across the cluster. As usage patterns change, the partition boundaries are updated dynamically, with urgent changes made quickly and more minor optimisations made over a longer period of time. Experiments with the Envoy prototype demonstrate that service clusters can support cheap and rapid deployment of services, from isolated instances to groups of cooperating components with shared storage de- mands. 3 Acknowledgements My advisor, Ian Pratt, helped in countless ways to set the course of my research and see it through, demanding rigour and offering many helpful suggestions. Major inflection points in my work always seemed to follow meetings with him. Steve Hand also helped me throughout my time at Cambridge, offering sound advice when needed and sharing his mastery of the research game. Jon Crowcroft read drafts of my dissertations, even taking time from his Christmas holiday to send me feedback. I was fortunate to be in the Systems Research Group during a time when it was full of clever and interesting people doing good research. Andy Warfield, Alex Ho, Eva Kalyvianaki, Christian Kreibich, and Anil Madhavapeddy have been particularly rich sources of conversation and camaraderie. I might not have survived the whole experience without Nancy, whose love and constant support provided a foundation for everything else. 4 Table of Contents 1 Introduction 9 1.1 Contributions ......................... 11 1.2 Outline ............................ 12 2 Background 14 2.1 Commodity computing .................... 14 2.1.1 Commodity big iron ................. 15 2.1.2 Computation for hire ................. 17 2.2 Virtualization ......................... 20 2.2.1 Virtual machine monitors .............. 21 2.2.2 XenoServers ...................... 22 2.3 Storage systems ........................ 23 2.3.1 Local file systems ................... 23 2.3.2 Client-server file systems ............... 25 2.3.3 Serverless file systems ................. 27 2.3.4 Wide-area file systems ................ 29 2.3.5 Cluster storage systems ................ 31 2.3.6 Virtual machines ................... 36 2.4 Summary ........................... 37 3 Service Clusters 39 3.1 Commodity computation ................... 41 3.1.1 Flexible commodity computation .......... 41 3.2 Service containers ....................... 42 3.2.1 Decoupling hardware and software ......... 43 3.2.2 Decoupling unrelated services ............ 44 3.2.3 Supporting commodity tools ............. 45 3.3 Service clusters ........................ 46 3.3.1 Economies of scale .................. 47 3.3.2 Heterogeneous workloads .............. 48 3.3.3 Central management ................. 49 3.3.4 Supporting a software ecosystem .......... 50 3.4 Storage for service clusters .................. 51 5 3.4.1 Running in a cluster ................. 52 3.4.2 Supporting heterogeneous clients .......... 53 3.4.3 Local impact ..................... 56 3.5 Summary ........................... 56 4 The Envoy File System 58 4.1 Assumptions ......................... 58 4.2 Architecture .......................... 59 4.2.1 Distribution ...................... 60 4.2.2 Storage layer ..................... 63 4.2.3 Envoy layer ...................... 65 4.2.4 Caching ........................ 68 4.2.5 Data paths for typical requests ............ 70 4.3 File system images ...................... 74 4.3.1 Security ........................ 75 4.3.2 Forks and snapshots ................. 78 4.3.3 Deleting snapshots .................. 80 4.4 Territory management .................... 83 4.4.1 Design principles ................... 84 4.4.2 Special cases ...................... 84 4.4.3 Dynamic boundary changes ............. 85 4.5 Recovery ........................... 87 4.5.1 Prerequisites ...................... 89 4.5.2 Recovery process ................... 89 4.5.3 Special cases ...................... 91 4.6 Summary ........................... 93 5 The Envoy Prototype 94 5.1 Scope and design coverage .................. 94 5.1.1 Implementation shortcuts .............. 95 5.2 The 9p protocol ........................ 96 5.2.1 Alternatives ...................... 96 5.2.2 Mapping Envoy to 9p ................ 99 5.3 Storage service ........................ 101 5.3.1 Protocol ........................ 102 5.3.2 Implementation .................... 103 5.3.3 Limitations ...................... 106 5.4 Envoy service ......................... 107 5.4.1 Protocol ........................ 108 5.4.2 Data structures .................... 109 5.4.3 Freezing and thawing ................. 112 5.4.4 Read operations ................... 114 5.4.5 Write operations ................... 116 6 5.4.6 Modifying territory ownership ............ 122 5.4.7 Limitations ...................... 126 5.5 Summary ........................... 128 6 Evaluation 129 6.1 Methodology ......................... 130 6.1.1 Assumptions and goals ................ 130 6.1.2 Test environment ................... 133 6.2 Independent clients ...................... 136 6.2.1 Performance ...................... 137 6.2.2 Architecture ...................... 140 6.2.3 Cache ......................... 144 6.3 Shared images ......................... 149 6.3.1 Dynamic behaviour .................. 150 6.4 Image operations ....................... 153 6.4.1 Forks ......................... 154 6.4.2 Snapshots ....................... 154 6.4.3 Service deployment .................. 155 6.5 Summary ........................... 157 7 Conclusion 158 7.1 Future Work .......................... 158 7.1.1 Caching ........................ 158 7.1.2 Territory partitioning ................. 159 7.1.3 Read-only territory grants .............. 159 7.1.4 Storage-informed load balancing .......... 159 7.2 Summary ........................... 160 References 162 7 8 Chapter 1 Introduction This dissertation presents the design and implementation of a distributed file system. The design is motivated by the requirements of a new environment that emerges at the intersection of three trends: the renaissance of machine virtualization as a tool for hardware management, the increasing use of commodity hardware and software for server applications, and the emergence of computation as a commodity service. Providing computing services for hire is nothing new, but previous of- ferings have always been on a limited scale, restricted to a specific type of application or tied to the tool stack of a single vendor. Web and email hosting services are commonplace, and many applications like tax preparation software and payroll management are available as hosted services. Nu- merous frameworks for hosted computing have been proposed [Ami98, Vah98, Tul98] and commercially implemented [Ama06, Sun06, Kal04], but these require using a specific distributed framework or middleware that introduces a “semantic bottleneck” between the application and the hosting environment [Ros00a]. The only way to make hosted computing a commodity is to convince vendors and customers to agree on a common set of interfaces so that applications can be easily moved from one host to the next and an open marketplace can emerge. Seeking universal agreement on a new application environment is a daunting task, but one that can be circumvented by using an existing platform that already enjoys

Cluster Storage for Commodity Computation

Venti Analysis and Memventi Implementation

HTTP-FUSE Xenoppix

Accelerated AC Contingency Calculation on Commodity Multi

Plan 9 from Bell Labs

(Brief) 2. an Extensible Compiler for Systems Programming

The Effectiveness of Deduplication on Virtual Machine Disk Images

Large Scale Disk Image Storage and Deployment in the Emulab Network Testbed

DASH 1 JDA@O?KmanjOok=HAHA=@EncHECDJNom,>KJEn>OokBohm

Thesis May Never Have Been Completed

6 Read-Performance Optimization for Deduplication-Based Storage

Scale Enterprise Systems – for Better Business Computing

Proceedings of 3Rd International Workshop on Plan 9 October 30-31, 2008