Improving Performance in the Gnutella Protocol

Improving Performance in the Gnutella Protocol Jonathan Hess Benjamin Poon Department of Computer Science Department of Computer Science University of California, Berkeley University of California, Berkeley jonhess (at) uclink.berkeley.edu bpoon (at) uclink.berkeley.edu Abstract The Gnutella protocol describes a completely decentralized P2P file sharing system in which queries are flooded to all neighbors in the search for files. As originally specified, the protocol does not have any notion of providing privacy; as such, because agencies have begun to censor and threaten users of such systems, participation has decreased. In turn, users who continue to utilize the network, choose not to share data in fear of litigation. This reduces data redundancy, as well as an increase in the workload of fully participating peers. As files become less available, Gnutella peers must broadcast queries deeper into the network. While data-participation is relatively uncontrollable, increased redundancy and decreased workload can be achieved through replicating files to other peers. This, however, must be done in such a way that preserves the ability of the proxy-peers to deny knowledge of file content. In this paper, we present an extension to the Gnutella protocol that achieves replication through encrypted mirroring. We further improve performance by directing queries using a Bloom filter mechanism. Through simulation, we explore the performance gains of these protocol extensions in terms of query success rate, query bandwidth consumption, and aggregate bandwidth consumption. In the end, BloomNet is able to satisfy queries more readily than Gnutella, using approximately one-fourth of the bandwidth for queries. 1 Introduction Traditionally, computers have communicated in a fashion modeled by the client-server paradigm: a client computer makes requests from a server computer that fulfills those requests. This model has served as a central idea of computer networking for many years. It can be found anywhere from common protocols like HTTP and FTP, to online banking systems. The problems inherent in this paradigm are rooted in its centralization: there is a single point of failure, which make denial of service attacks and possible loss of privacy very possible. Recently, however, the peer-to-peer (P2P) paradigm has become increasingly popular because of its ability to provide ad-hoc collaboration, information sharing, privacy, self-administration, and efficient accumulation of existing distributed resources over a large-scale environment. Peer-to-peer file sharing (P2PFS) is specific to the information sharing and privacy aspects of the P2P paradigm in which any two hosts make a connection through a decentralized network in order to share files. One of the necessities of all P2P systems is cooperation—without it, these systems lose the very fabric of their existence. In the P2PFS domain, without peers sharing files, there are no files to download, thus making the system useless. In [4], it was empirically shown that over 70% of users of the popular P2PFS system Gnutella chose to free-ride—to download from the huge library of files without making any of their own files available. As more and more peers choose to free-ride, P2PFS degenerates into the client-server model with all its disadvantages. [4] shows that a small number of Gnutella peers contribute an unproportionally large number of files. This behavior is indeed reminiscent of the client-server paradigm, where the few contributors act as servers and the remaining population act as clients. Clearly, for all P2PFS systems, as fewer peers contribute files for the common good, the system’s performance degrades; further, as mentioned before, if all peers choose to free ride, the system collapses. To make matters worse, the increased threatening of litigation by some agencies has decreased the replication of files in P2PFS systems. The network still boasts the same library of files. It simply has 1 fewer copies. Unfortunately, demand does not change. This decreased replication causes an increase in the workload for sharing peers. Fewer peers must now supply the unchanged demand. An increase in query depth is similarly required to find data in the now more sparsely populated network. Therefore, the goal of this work is to improve performance of such systems in the face of decreased replication. In particular, we make an extension to the Gnutella protocol called BloomNet that includes two performance-improving techniques: file mirroring and directed search. After introducing Gnutella further in Section 2, we discuss the overall design of the protocol extension in Section 3. Section 4 follows with a description of the constructed simulation model as well as metrics for determining performance, with Section 5 evaluating the results from the simulations according to those metrics. Lastly, Section 6 examines related work, Section 7 concludes, and Section 8 discusses possibilities of further improvement of BloomNet. The key contributions of this work include the addition of several improvements to the Gnutella protocol that allow for less query traffic with improved query success rates, and the creation of a versatile Gnutella simulator with many adjustable parameters. 2 Gnutella The Gnutella protocol is a P2PFS model that provides a mechanism for the distributed searching of shared files across many connected hosts, called peers. To share files, a peer starts a Gnutella client A on her local networked computer. This client will then connect to an already-existing Gnutella client B, finding its address through some out-of-band means. Now, B will announce to all of the clients it knows (its neighbors) that a new client has joined the network. This occurs recursively out into the network, until the announcement message travels a certain distance: the time-to-live or TTL. Similarly, when querying for a file, client A will send out a Query message telling its neighbors that it is looking for a certain file. As other clients see this message, they check their locally stored files to see if any of them match. If a match is found, a QueryHit message is returned to the sender along the path taken by the Query. Subsequent to checking for local matches, the client repeats the broadcasting of the Query message to all of its neighbors. The amount of messages, and hence bandwidth, required for a query is clearly exponential in the breadth and depth of the broadcast; moreover, if a file exists in the network, it is not guaranteed to be found if the Query message does not reach a client that is sharing the file. As opposed to Gnutella, P2PFS systems have also been built on top of distributed hash tables (DHTs) that ameliorate the problem of creating too much traffic as well as guarantee the location of an object if it exists anywhere in the network. However, several factors arise in comparing Gnutella with DHT-based models that prompt us to favor trying to improve Gnutella. First, DHTs can only provide exact-match file querying in a scalable manner, as opposed to Gnutella’s built-in support for keyword searches. Second, DHTs expend much bandwidth when nodes join or leave the network (which happens extremely frequently), whereas Gnutella’s ad-hoc topology creation requires little to no maintenance. Third, as argued in [9], DHTs enable the efficient location of a single file in the network, similar to finding a needle in a haystack. While DHTs are very adept at this, most queries in P2PFS systems are for hay—files that are widely replicated. Gnutella finds such files very easily. Fourth, Gnutella is already widely deployed, and applying incremental changes to already deployed systems is more likely to succeed than trying to deploy a new system. It is for these four reasons that we chose to focus our efforts on creating improving the Gnutella protocol in designing BloomNet, as opposed to creating a new DHT- based model. 3 BloomNet Design BloomNet makes two major additions to the Gnutella protocol, both of which are aimed at improving performance given decreased file replication. Each addition introduces its functionality to the protocol through the creation of a new message type, both described below in Table 1. The use of file mirroring is discussed first in Section 3.1, followed by Section 3.2’s explanation of directed queries. 2 Table 1. A listing and description of the two messages used by BloomNet. Message Description Mirroring Request (MRM) The mechanism by which mirrors are chosen and created Bloom Used in conjunction with Ping messages to discover the Bloom filter associated with a node on the network 3.1 File Mirroring The goal of mirroring is to increase the replication factor of files, while keeping sole legal blame on the original sharer, called the originator. This gives BloomNet a way to deal with flash crowd situations, as well as a means to allow more peers to be able to find mirrored files. We do so by means of a new protocol message called a Mirroring Request Message (MRM), coupled with file encryption. Throughout this section, we explore the problem space from the point of view of a single client. The first decision the originator must make is the strategy with which to replicate its f files F1 … Ff. A naïve technique would be to replicate all f files as much as possible. However, this would consume so much bandwidth in file transfer traffic that it would outweigh the benefits, as seen below in Figure 1. Originator Originator Neighbors Neighbors Figure 1. The originator sends MRMs for all of its files to Figure 2. The originator sends one MRM for a single file, all of its neighbors. each time its demand is above mirrorThresh. The more conservative approach taken by BloomNet is to replicate only certain files, requiring the client to decide which file to mirror at what time (see Figure 2, above).

Load more