<<

Peer-to-Peer Cooperative Backup, December 16th, 2004

MoSAIC, LAAS-CNRS

Introduction 1.2. Discovering Similarities Amongst Peers’ This document summarizes a technical dis- Data cussion held by part of the MoSAIC team at Pastiche [8] is a peer-to-peer backup sys- Toulouse, on December 16th, 2004. Past and cur- tem that implements a so-called lighthouse sweep rent research work in the area of peer-to-peer mechanism aimed at helping backup peers dis- content storage, peer-to-peer content location as cover peers whose personal data are similar to the well as applications of these techniques to peer- data it is willing to back up. The goal is to limit to-peer cooperative backup were presented and the amount of data being backed up: data shared discussed. Slides by Ludovic Courtès are avail- among several peers is backed up only once. able at http://www.laas.fr/mosaic/ and related dis- While Pastiche’s authors claim that their sys- cussions are summarized here. tem computes an abstract (« a small, random sub- set of their [data chunks’] signatures ») of peers’ data in a way that is similar to a hash, it was high- 1. Discussions lighted in the discussion that hash functions are The presentation of existing peer-to-peer file designed in a way to emphasize differences, not sharing and backup systems yielded the follow- similarities. Indeed, while the authors do not de- ing discussions. scribe the exact algorithm they use for finding re- dundancy, they refer the reader to an article by 1.1. On the Robustness of Content Location A. Z. Broder, On the resemblance and containment of Algorithms documents, 1998. Their exists a number of efficient distribut- 1.3. Trade-off Between Superfluous Data ed content location algorithms for overlay net- Redundancy Elimination and Data works [12,5,1] some of which build upon net- Locality work topologies and associated routing algo- rithms found in high-performance computing. As discussed in section 1.2, backup sys- Since these algorithms rely on peers being well- tems try to reduce data chunks redundancy in behaved, they look vulnerable to the participa- the overall peer network by preferably group- tion of malicious peers not complying with the al- ing peers holding similar data rather than peers gorithm assumptions. with completely different data sets. At the same On the other hand, non-deterministic con- time, peer-to-peer backup systems try to favor tent location algorithms have been designed to peers’ physical locality (i.e. in terms of the number suit the needs of anonymous content publishing of IP hops needed to reach each other)so as to im- networks such as Freenet and GNUnet [7]. These prove the latency when sending/restoring data protocols have been designed with censorship re- to/from one’s peer. Furthermore, in Pastiche [8], silience in mind and can tolerate malicious nodes. backup peers (or buddies) directly communicate However, it is unclear how well such schemes (over IP)with each other and only use the overlay perform on large-scale networks. network as a means of discovering peers. Fortunately, solutions have been envisioned This scheme yields to trade-off between com- to make the former type of algorithm more ro- munications efficiency (good latency and direct bust, by increasing the number of peers known communication channels) and storage efficiency to each peer and by randomizing the routing al- (elimination of superfluousdata redundancy).As gorithm when possible, therefore making it non- noted during the meeting, Pastiche’s protocol for deterministic. discovering peers with similar data may be inef- ficient for several reasons: - 2 -

• data similarity in only checked upon peers’ ior of peers (not only data storage or availability) introduction into the network; might achieve better overall efficiency.

• more importantly, computing abstracts of all the data of a node may not be as efficient 2.2. Archival and Versioning Systems as using abstracts of well-chosen subsets of Archival and versioning systems were men- each nodes’contents. A possibility would be tioned as a research area particularly relevant to split data into various themes where a node to MoSAIC. By archival or versioning systems, (its user) would hold data corresponding to we mean systems that are able to keep track of different themes. It would then be more ef- changes that are made to individual files, file ficient to use a buddy-list for each theme, i.e. metadata, or directories. This includes regu- the similarities would be maximized. lar configuration management systems (SCM)such as CVS, GNU Arch, etc., as well as file Some experiments, however, tend to show [8] systems archival systems like Plan 9’s Venti [11] that this mechanism effectively discovers peers and versioning file systems [4,2,13]. Such sys- with a high data similarity. However, the authors tems offer interesting properties: are considering whole system installations, in- cluding the , libraries and exe- • They allow to keep track of modifications cutables, etc. This type of data are obviously very and access a specific version of a file. MoSA- similar from one system to another, while the IC could benefit from such a facility since it data produced by the user – which is typically the may not be able to back up a whole file be- type of data most people value the most – may fore it is modified and users would prefer vary significantly. having done a complete backup copy of an Another issuewith the peer selection scheme older version of a file rather than having of Pastiche is that it may not be very efficient in only copied piece of successive versions of a terms of the overall network superfluous redundan- file. It is worth noting that simple file lock- cy: a peer may end up backing up a data chunk ing mechanisms could be used to implement on its set of backup peers (called buddies in Pas- this feature instead of a complete versioning tiche) even though this chunk has already been file system , although this would not be as ef- backed up by other peers (but not by his bud- ficient and practical. dies). PeerStore [10] solves this problem by using a distributed hash table (DHT) in order to locate • Versioningsystemsefficiently use storage re- replicas of data chunks: before backing up a data sources by storing only differences between chunk, each peer looks for other replicas of that versions. block (wherever they are physically located) amongst participating nodes and only keeps it if it isn’t • File systems benefit have an intimate knowl- stored already. edge of how files are represented in the un- derlying block store; integrating versioning capabilities at the file system level allows to 2. Future Directions benefit from this fine-grained vision of files’ This section describes topics that have only contents and to coalesce the contents (blocks) been briefly discussed and will require further of different versions of a file. research. • The file system knows each time a new ver- sion is created, although it cannot distin- 2.1. Fairness Enforcement Technique in guish which versions are valuable to the user Cooperative Backup without the user’s (or application’s) help. It Most peer-to-peer backup systems propose could guarantee that a version being used on the one hand a dedicated mechanism to en- (e.g. being backed up) will not vanish. force fair contribution (e.g. [9,10]) and on the other hand another mechanism to deal with Moreover, as long as storage is available, a user peers’ unavailability (being malicious or not). It might wish to keep copies of different versions was noted that a global economic model (such as of his important files on his backup peers. From those found in most modern peer-to-peer net- these observations, one can envision the peer-to- works [3,6]) involved in all interactions amongst peer backup system itself as something that is peers and taking into account the whole behav- layered on top of a versioning file system. - 3 -

On the other hand, relying on a version- hancing Technologies (PET 2003), Dresden, ing file system implies modifying the end-users Germany, March, 2003. file system, this could not be an option in wide deployment scenarios. As a consequence, we [8] L. P. COX, B. D. NOBLE – Pastiche: Making should probably design MoSAIC in a way that al- backup cheap and easy – Fifth USENIX lows it to adapt to a versioning file system when SymposiumonOperating SystemsDesignand available. Implementation, Boston, MA, USA, Decem- ber, 2002.

References [9] LANDON P. COX,BRIAN D. NOBLE – Samsara: Honor Among Thieves in Peer-to-Peer Storage – 19th ACM Symposium on Operat- [1] ANTONY ROWSTRON,PETER DRUSCHEL – Pas- ing Systems Principles, Bolton Landing, NY, try: Scalable, distributed object location USA, October, 2003. and routing for large-scale peer-to-peer systems – Proceedings of the 18th IFIP/ACM [10] MARTIN LANDERS,HAN ZHANG,KIAN-LEE International Conference on Distributed Sys- TAN – PeerStore: Better Performance by tems Platforms (Middleware 2001), Hei- Relaxing in Peer-to-Peer Backup – Proceed- delberg, Germany, November, 2001, pp. ings of the Fourth International Conference 329-350. on Peer-to-Peer Computing, Zurich, Switzer- land, August, 2004. [2] B. CORNELL, P.DINDA, F. BUSTAMANTE – Way- back: A User-level Versioning File System [11] SEAN QUINLAN,SEAN DORWARD – Venti: A for – Proceedings of the USENIX An- new approach to archival storage – First nual Technical Conference, FREENIX Track, USENIX conference on File and Storage Tech- 2004, pp. 19-28. nologies, Monterey,CA, 2002.

[3] CHRISTIAN GROTHOFF – An Excess-Based [12] SYLVIA RATNASAMY,PAUL FRANCIS,MARK Economic Model for Resource Allocation HANDLEY,RICHARD KARP,SCOTT SHENKER –A in Peer-to-Peer Networks – Wirtschaftsin- Scalable Content-Addressable Network – formatik, 3-2003June, 2003. Proceedings of ACM SIGCOMM 2001, UC San Diego, California, USA, August, 2001. [4] DOUGLAS S. SANTRY,MICHAEL J, FEELEY, NORMAN . HUTCHINSON,ALISTAIR C. VEITCH, [13] Z.N.J. PETERSON, .C. BURNS – Ext3cow: ROSS W. CARTON,JACOB OFIR – Deciding The Design, Implementation, and Anal- when to forget in the Elephant file system ysis of Metadata for a Time-Shifting File – Proceedings of the 17th ACM Symposium System – HSSL-2003-03, Hopkins Storage on Operating Systems Principles (SOSP‘99), Systems Lab, Department of Computer December, 1999, pp. 110-123. Science, Johns Hopkins University, USA, 2003. [5] ION STOICA,ROBERT MORRIS,DAVID LIBEN-NOWELL,DAVID R. KARGER, M. FRANS KAASHOEK,FRANK DABEK,HARI BALAKRISHNAN – Chord: A ScalablePeer-to- peer Lookup Protocol for Internet Appli- cations – IEEE/ACM Transactionson Net- working, 2005?.

[6] KEVIN LAI,MICHAL FELDMAN,JOHN CHUANG, ION STOICA – Incentives for Cooperation in Peer-to-Peer Networks – Workshop on Economics of Peer-to-Peer Systems.

[7] KRISTA BENNETT,CHRISTIAN GROTHOFF – GAP - practical anonymous networking – Pro- ceedings of the 3rd Workshop on Privacy En-