Pstore: a Secure Peer-To-Peer Backup System∗

pStore: A Secure Peer-to-Peer Backup System∗ Christopher Batten, Kenneth Barr, Arvind Saraf, Stanley Trepetin {cbatten|kbarr|arvind s|stanleyt}@mit.edu Abstract update, retrieve, and delete commands may be in- voked by various user interfaces (e.g., a command In an effort to combine research in peer-to-peer systems with techniques for incremental backup line, file system, or GUI) according to a user’s systems, we propose pStore: a secure distributed needs. pStore maintains snapshots for each file backup system based on an adaptive peer-to-peer allowing a user to restore any snapshot at a later network. pStore exploits unused personal hard date. This low-level versioning primitive per- drive space attached to the Internet to provide the mits several usage models. For example, works distributed redundancy needed for reliable and ef- in progress may be backed up hourly so that a fective data backup. Experiments on a 30 node user can revert to a last-known-good copy, or an network show that 95% of the files in a 13 MB entire directory tree can be stored to recover from dataset can be retrieved even when 7 of the nodes a disk crash. have failed. On top of this reliability, pStore in- cludes support for file encryption, versioning, and pStore has three primary design goals: relia- secure sharing. Its custom versioning system per- bility, security, and resource efficiency. pStore mits arbitrary version retrieval similar to CVS. provides reliability through replication; copies are pStore provides this functionality at less than 10% available on several servers in case some of these of the network bandwidth and requires 85% less servers are malicious or unavailable. Since a storage capacity than simpler local tape backup client’s data is replicated on nodes beyond his schemes for a representative workload. control, pStore strives to provide reasonable security: private data is readable only by its owner; 1 Introduction data can be remotely deleted only by its owner; and any unwanted changes to data can be easily Current backup systems for personal and detected. Finally, since backups can be frequent small-office computer users usually rely on sec- and large, pStore aims to reduce resource-usage ondary on-site storage of their data. Although by sharing stored data and exchanging data only these on-site backups provide data redundancy, when necessary. they are vulnerable to localized catastrophe. Section 2 discusses related systems. pStore More sophisticated off-site backups are possible draws from their strengths while discarding func- but are usually expensive, difficult to manage, tionality which adds overhead or complexity in and are still a centralized form of redundancy. the application-specific domain of data backup. Independent from backup systems, current peer- Section 3 outlines the pStore architecture, and to-peer systems focus on file-sharing, distributed Section 4 presents our implementation. Section 5 archiving, distributed file systems, and anony- evaluates the design in terms of the goals stated mous publishing. Motivated by the strengths above, and Section 6 concludes. and weaknesses of current peer-to-peer systems, as well as the specific desires of users needing to backup personal data, we propose pStore: a se- 2 Related Work cure peer-to-peer backup system. A peer-to-peer backup system has two ma- pStore provides a user with the ability to se- jor components: the underlying peer-to-peer curely backup files in, and restore files from, a network and the backup/versioning framework. distributed network of untrusted peers. Insert, While much work has been done in the two fields ∗pStore was developed October-December 2001 as a individually, there is little literature integrating project for MIT 6.824: Distributed Computer Systems. the two. 2.1 Distributed Storage Systems to the system. The PAST system suggests that the same smart cards used for authentication There has been a wealth of recent work on dis- could be used to maintain storage quotas [10]. tributed storage systems. Peer-to-peer file shar- The Tangler system proposes an interesting quota ing systems, such as Napster [15] and Gnutella scheme based on peer monitoring: nodes monitor [12], are in wide use and provide a mechanism for their peers and report badly behaving nodes to file search and retrieval among a large group of others [21]. users. Napster handles searches through a centralized index server, while Gnutella uses broad- 2.2 Versioning and Backup cast queries. Both systems focus more on infor- The existing distributed storage systems dis- mation retrieval than on publishing. cussed above are intended for sharing, archiving, Freenet provides anonymous publication and or providing a distributed file system. As a re- retrieval of data in an adaptive peer-to-peer net- sult, the systems do not provide specific support work [5]. Anonymity is provided through sev- for incremental updates and/or versioning. Since eral means including: encrypted search keys, data many file changes are incremental (e.g., evolution caching along lookup paths, source-node spoof- of source code, documents, and even some aspects ing, and probabilistic time-to-live values. Freenet of binary files [14]), there has been a significant deletes data which is infrequently accessed to amount of work on exploiting these similarities to make room for more recent insertions. save bandwidth and storage space. Eternity proposes redundancy and information The Concurrent Versioning System, popular dispersal (secret sharing) to replicate data, and among software development teams, combines adds anonymity mechanisms to prevent selective the current state of a text file and a set of com- denial of service attacks [1]. Document queries mands necessary to incrementally revert that file are broadcast, and delivery is achieved through to its original state [6]. Network Appliances in- anonymous remailers. Free Haven, Publius and corporates the WAFL file system in its network- Mojo Nation also use secret sharing to achieve attached-storage devices [3]. WAFL provides reliability and author anonymity [9, 22, 13]. transparent snapshots of a file system at selected SFSRO is a content distribution system pro- instances, allowing the file system data to be viding secure and authenticated access to read- viewed either in its current state, or as it was only data via a replicated database [11]. Like SF- at some time in the past. SRO, CFS aims to achieve high performance and Overlap between file versions can enable a re- redundancy, without compromising on integrity duction in the network traffic required to update in a read-only file system [8]. Unlike complete older versions of files. Rsync is an algorithm for database replication in SFSRO, CFS inserts file updating files on a client so that they are iden- system blocks into a distributed storage system tical to those on a server [20]. The client breaks and uses Chord as a distributed lookup mecha- a file into fixed size blocks and sends a hash of nism [7]. The PAST system takes a similar lay- each block to the server. The server checks if ered approach, but uses Pastry as its distributed its version of the file contains any blocks which lookup mechanism [10]. Lookups using Chord hash to the same value as the client hashes. The and Pastry scale as O(log(n)) with the number server then sends the client any blocks for which of nodes in the system. Farsite is similar to CFS no matching hash was found and instructs the in that it provides a distributed file system among client how to reconstruct the file. Note that the cooperative peers [2], but uses digital signatures server hashes fixed size blocks at every byte offset, to allow delete operations on the file data. not just multiples of the block size. To reduce the Several systems have proposed schemes to en- time required when hashing at each byte offset, force storage quotas over a distributed storage the rsync algorithm use two types of hash func- system. Mojo Nation relies on a trusted third tions. Rsync’s slower cryptographic hash func- party to increase a user’s quota when he con- tion is used only when when its fast rolling hash tributes storage, network, and/or CPU resources establishes a probable match. LBFS also uses file 2 FBL FBL Ver 1 H(E(A)) H(E(B)) H(E(C)) ... Ver 1 H(E(A)) H(E(B)) H(E(C)) ... Ver 2 H(E(D)) H(E(B)) ... FB A FB B FB C FB D FB A FB B FB C (a) (b) Figure 1: File Block List and File Blocks: (a) shows a file with three equal sized blocks, (b) shows how a new version can be added by updating the file block list and adding a single new file block. block hashes to help reduce the ammount of data 3.1 Data Structures that needs to be transmitted when updating a This section describes the data structures used file [14]. Unlike rsync’s fixed block sizes, LBFS to manage files, directories, and versions. These uses content-dependent “fingerprints” to deter- data structures were designed for reliability, se- mine file block boundaries. curity, and resource efficiency. 3 System Architecture 3.1.1 File Blocks Lists and File Blocks A pStore file is represented by a file block list Before discussing the details of the pStore (FBL) and several file blocks (FB). Each FB con- architecture, we present an overview of how tains a portion of the file data, while the FBL con- the system works for one possible implementa- tains an ordered list of all the FBs in the pStore tion. A pStore user first invokes a pStore client file. The FBL has four pieces of information for which helps him generate keys and mark files for each FB: a file block identifier used to uniquely backup.

Pstore: a Secure Peer-To-Peer Backup System∗

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support