Cumulus: Filesystem Backup to the Cloud
Total Page:16
File Type:pdf, Size:1020Kb
Cumulus: Filesystem Backup to the Cloud Michael Vrable, Stefan Savage, and Geoffrey M. Voelker Department of Computer Science and Engineering University of California, San Diego Abstract uitous broadband now provides sufficient bandwidth re- In this paper we describe Cumulus, a system for effi- sources to offload the application. For small to mid-sized ciently implementing filesystem backups over the Inter- businesses, backup is rarely part of critical business pro- net. Cumulus is specifically designed under a thin cloud cesses and yet is sufficiently complex to “get right” that it assumption—that the remote datacenter storing the back- can consume significant IT resources. Finally, larger en- ups does not provide any special backup services, but terprises benefit from backing up to the cloud to provide only provides a least-common-denominator storage in- a business continuity hedge against site disasters. terface (i.e., get and put of complete files). Cumulus However, to price cloud-based backup services attrac- aggregates data from small files for remote storage, and tively requires minimizing the capital costs of data cen- uses LFS-inspired segment cleaning to maintain storage ter storage and the operational bandwidth costs of ship- efficiency. Cumulus also efficiently represents incremen- ping the data there and back. To this end, most exist- tal changes, including edits to large files. While Cumulus ing cloud-based backup services (e.g., Mozy, Carbonite, can use virtually any storage service, we show that its ef- Symantec’s Protection Network) implement integrated ficiency is comparable to integrated approaches. solutions that include backup-specific software hosted on both the client and at the data center (usually using 1 Introduction servers owned by the provider). In principle, this ap- proach allows greater storage and bandwidth efficiency It has become increasingly popular to talk of “cloud com- (server-side compression, cleaning, etc.) but also reduces puting” as the next infrastructure for hosting data and de- portability—locking customers into a particular provider. ploying software and services. Not surprisingly, there are a wide range of different architectures that fall un- In this paper we explore the other end of the de- der the umbrella of this vague-sounding term, ranging sign space—the thin cloud. We describe a cloud-based from highly integrated and focused (e.g., Software As backup system, called Cumulus, designed around a min- A Service offerings such as Salesforce.com) to decom- imal interface (put, get, delete, list) that is triv- posed and abstract (e.g., utility computing such as Ama- ially portable to virtually any on-line storage service. zon’s EC2/S3). Towards the former end of the spectrum, Thus, we assume that any application logic is imple- complex logic is bundled together with abstract resources mented solely by the client. In designing and evaluat- at a datacenter to provide a highly specific service— ing this system we make several contributions. First, we potentially offering greater performance and efficiency show through simulation that, through careful design, it through integration, but also reducing flexibility and in- is possible to build efficient network backup on top of creasing the cost to switch providers. At the other end of a generic storage service—competitive with integrated the spectrum, datacenter-based infrastructure providers backup solutions, in spite of having no specific backup offer minimal interfaces to very abstract resources (e.g., support in the underlying storage service. Second, we “store file”), making portability and provider switching build a working prototype of this system using Amazon’s easy, but potentially incurring additional overheads from Simple Storage Service (S3) and demonstrate its effec- the lack of server-side application integration. tiveness on real end-user traces. Finally, we describe how In this paper, we explore this thin-cloud vs. thick- such systems can be tuned for cost instead of for band- cloud trade-off in the context of a very simple applica- width or storage, both using the Amazon pricing model tion: filesystem backup. Backup is a particularly attrac- as well as for a range of storage to network cost ratios. tive application for outsourcing to the cloud because it is In the remainder of this paper, we first describe prior relatively simple, the growth of disk capacity relative to work in backup and network-based backup, followed by tape capacity has created an efficiency and cost inflection a design overview of Cumulus and an in-depth descrip- point, and the cloud offers easy off-site storage, always tion of its implementation. We then provide both simula- a key concern for backup. For end users there are few tion and experimental results of Cumulus performance, backup solutions that are both trivial and reliable (espe- overhead, and cost in trace-driven scenarios. We con- cially against disasters such as fire or flood), and ubiq- clude with a discussion of the implications of our work USENIX Association 7th USENIX Conference on File and Storage Technologies 225 and how this research agenda might be further explored. age system. Venti [17] uses hashes of block contents to address data blocks, rather than a block number on 2 Related Work disk. Identical data between snapshots (or even within a Many traditional backup tools are designed to work well snapshot) is automatically coalesced into a single copy for tape backups. The dump, cpio, and tar [16] utilities on disk—giving the space benefits of incremental back- are common on Unix systems and will write a full filesys- ups automatically. Data Domain [26] offers a similar tem backup as a single stream of data to tape. These utili- but more recent and efficient product; in addition to per- ties may create a full backup of a filesystem, but also sup- formance improvements, it uses content-defined chunk port incremental backups, which only contain files which boundaries so de-duplication can be performed even if have changed since a previous backup (either full or an- data is offset by less than the block size. other incremental). Incremental backups are smaller and A limitation of these tools is that backup data must be faster to create, but mostly useless without the backups stored unencrypted at the server, so the server must be on which they are based. trusted. Box Backup [21] modifies the protocol and stor- Organizations may establish backup policies specify- age representation to allow the client to encrypt data be- ing at what granularity backups are made, and how long fore sending, while still supporting rsync-style efficient they are kept. These policies might then be implemented network transfers. in various ways. For tape backups, long-term backups Most of the previous tools use a specialized protocol to may be full backups so they stand alone; short-term daily communicate between the client and the storage server. backups may be incrementals for space efficiency. Tools An alternate approach is to target a more generic inter- such as AMANDA [2] build on dump or tar, automating face, such as a network file system or an FTP-like pro- the process of scheduling full and incremental backups tocol. Amazon S3 [3] offers an HTTP-like interface to as well as collecting backups from a network of comput- storage. The operations supported are similar enough be- ers to write to tape as a group. Cumulus supports flexible tween these different protocols—get/put/delete on files policies for backup retention: an administrator does not and list on directories—that a client can easily support have to select at the start how long to keep backups, but multiple protocols. Cumulus tries to be network-friendly rather can delete any snapshot at any point. like rsync-based tools, while using only a generic storage The falling cost of disk relative to tape makes backup interface. to disk more attractive, especially since the random ac- Jungle Disk [13] can perform backups to Amazon S3. cess permitted by disks enables new backup approaches. However, the design is quite different from that of Cu- Many recent backup tools, including Cumulus, take ad- mulus. Jungle Disk is first a network filesystem with vantage of this trend. Two approaches for comparing Amazon S3 as the backing store. Jungle Disk can also these systems are by the storage representation on disk, be used for backups, keeping copies of old versions of and by the interface between the client and the storage— files instead of deleting them. But since it is optimized while the disk could be directly attached to the client, of- for random access it is less efficient than Cumulus for ten (especially with a desire to store backups remotely) pure backup—features like aggregation in Cumulus can communication will be over a network. improve compression, but are at odds with efficient ran- Rsync [22] efficiently mirrors a filesystem across a dom access. network using a specialized network protocol to identify Duplicity [8] aggregates files together before stor- and transfer only those parts of files that have changed. age for better compression and to reduce per-file stor- Both the client and storage server must have rsync in- age costs at the server. Incremental backups use space- stalled. Users typically want backups at multiple points efficient rsync-style deltas to represent changes. How- in time, so rsnapshot [19] and other wrappers around ever, because each incremental backup depends on the rsync exist that will store multiple snapshots, each as a previous, space cannot be reclaimed from old snapshots separate directory on the backup disk. Unmodified files without another full backup, with its associated large up- are hard-linked between the different snapshots, so stor- load cost. Cumulus was inspired by duplicity, but avoids age is space-efficient and snapshots are easy to delete.