Improving Performance in the Gnutella Protocol

Jonathan Hess Benjamin Poon Department of Computer Science Department of Computer Science University of California, Berkeley University of California, Berkeley jonhess (at) uclink.berkeley.edu bpoon (at) uclink.berkeley.edu

Abstract The Gnutella protocol describes a completely decentralized P2P file system in which queries are flooded to all neighbors in the search for files. As originally specified, the protocol does not have any notion of providing privacy; as such, because agencies have begun to censor and threaten users of such systems, participation has decreased. In turn, users who continue to utilize the network, choose not to data in fear of litigation. This reduces data redundancy, as well as an increase in the workload of fully participating peers. As files become less available, Gnutella peers must broadcast queries deeper into the network. While data-participation is relatively uncontrollable, increased redundancy and decreased workload can be achieved through replicating files to other peers. This, however, must be done in such a way that preserves the ability of the proxy-peers to deny knowledge of file content. In this paper, we present an extension to the Gnutella protocol that achieves replication through encrypted mirroring. We further improve performance by directing queries using a Bloom filter mechanism. Through simulation, we explore the performance gains of these protocol extensions in terms of query success rate, query bandwidth consumption, and aggregate bandwidth consumption. In the end, BloomNet is able to satisfy queries more readily than Gnutella, using approximately one-fourth of the bandwidth for queries.

1 Introduction Traditionally, computers have communicated in a fashion modeled by the -server paradigm: a client computer makes requests from a server computer that fulfills those requests. This model has served as a central idea of computer networking for many years. It can be found anywhere from common protocols like HTTP and FTP, to online banking systems. The problems inherent in this paradigm are rooted in its centralization: there is a single point of failure, which make denial of service attacks and possible loss of privacy very possible. Recently, however, the peer-to-peer (P2P) paradigm has become increasingly popular because of its ability to provide ad-hoc collaboration, information sharing, privacy, self-administration, and efficient accumulation of existing distributed resources over a large-scale environment. Peer-to-peer (P2PFS) is specific to the information sharing and privacy aspects of the P2P paradigm in which any two hosts make a connection through a decentralized network in order to share files. One of the necessities of all P2P systems is cooperation—without it, these systems lose the very fabric of their existence. In the P2PFS domain, without peers sharing files, there are no files to , thus making the system useless. In [4], it was empirically shown that over 70% of users of the popular P2PFS system Gnutella chose to free-ride—to download from the huge library of files without making any of their own files available. As more and more peers choose to free-ride, P2PFS degenerates into the client-server model with all its disadvantages. [4] shows that a small number of Gnutella peers contribute an unproportionally large number of files. This behavior is indeed reminiscent of the client-server paradigm, where the few contributors act as servers and the remaining population act as clients. Clearly, for all P2PFS systems, as fewer peers contribute files for the common good, the system’s performance degrades; further, as mentioned before, if all peers choose to free ride, the system collapses. To make matters worse, the increased threatening of litigation by some agencies has decreased the replication of files in P2PFS systems. The network still boasts the same library of files. It simply has

1 fewer copies. Unfortunately, demand does not change. This decreased replication causes an increase in the workload for sharing peers. Fewer peers must now supply the unchanged demand. An increase in query depth is similarly required to find data in the now more sparsely populated network. Therefore, the goal of this work is to improve performance of such systems in the face of decreased replication. In particular, we make an extension to the Gnutella protocol called BloomNet that includes two performance-improving techniques: file mirroring and directed search. After introducing Gnutella further in Section 2, we discuss the overall design of the protocol extension in Section 3. Section 4 follows with a description of the constructed simulation model as well as metrics for determining performance, with Section 5 evaluating the results from the simulations according to those metrics. Lastly, Section 6 examines related work, Section 7 concludes, and Section 8 discusses possibilities of further improvement of BloomNet. The key contributions of this work include the addition of several improvements to the Gnutella protocol that allow for less query traffic with improved query success rates, and the creation of a versatile Gnutella simulator with many adjustable parameters.

2 Gnutella The Gnutella protocol is a P2PFS model that provides a mechanism for the distributed searching of shared files across many connected hosts, called peers. To share files, a peer starts a Gnutella client A on her local networked computer. This client will then connect to an already-existing Gnutella client B, finding its address through some out-of-band means. Now, B will announce to all of the clients it knows (its neighbors) that a new client has joined the network. This occurs recursively out into the network, until the announcement message travels a certain distance: the time-to-live or TTL. Similarly, when querying for a file, client A will send out a Query message telling its neighbors that it is looking for a certain file. As other clients see this message, they check their locally stored files to see if any of them match. If a match is found, a QueryHit message is returned to the sender along the path taken by the Query. Subsequent to checking for local matches, the client repeats the broadcasting of the Query message to all of its neighbors. The amount of messages, and hence bandwidth, required for a query is clearly exponential in the breadth and depth of the broadcast; moreover, if a file exists in the network, it is not guaranteed to be found if the Query message does not reach a client that is sharing the file. As opposed to Gnutella, P2PFS systems have also been built on top of distributed hash tables (DHTs) that ameliorate the problem of creating too much traffic as well as guarantee the location of an object if it exists anywhere in the network. However, several factors arise in comparing Gnutella with DHT-based models that prompt us to favor trying to improve Gnutella. First, DHTs can only provide exact-match file querying in a scalable manner, as opposed to Gnutella’s built-in support for keyword searches. Second, DHTs expend much bandwidth when nodes join or leave the network (which happens extremely frequently), whereas Gnutella’s ad-hoc topology creation requires little to no maintenance. Third, as argued in [9], DHTs enable the efficient location of a single file in the network, similar to finding a needle in a haystack. While DHTs are very adept at this, most queries in P2PFS systems are for hay—files that are widely replicated. Gnutella finds such files very easily. Fourth, Gnutella is already widely deployed, and applying incremental changes to already deployed systems is more likely to succeed than trying to deploy a new system. It is for these four reasons that we chose to focus our efforts on creating improving the Gnutella protocol in designing BloomNet, as opposed to creating a new DHT- based model.

3 BloomNet Design BloomNet makes two major additions to the Gnutella protocol, both of which are aimed at improving performance given decreased file replication. Each addition introduces its functionality to the protocol through the creation of a new message type, both described below in Table 1. The use of file mirroring is discussed first in Section 3.1, followed by Section 3.2’s explanation of directed queries.

2 Table 1. A listing and description of the two messages used by BloomNet. Message Description Mirroring Request (MRM) The mechanism by which mirrors are chosen and created Bloom Used in conjunction with Ping messages to discover the Bloom filter associated with a node on the network

3.1 File Mirroring The goal of mirroring is to increase the replication factor of files, while keeping sole legal blame on the original sharer, called the originator. This gives BloomNet a way to deal with flash crowd situations, as well as a means to allow more peers to be able to find mirrored files. We do so by means of a new protocol message called a Mirroring Request Message (MRM), coupled with file encryption. Throughout this section, we explore the problem space from the point of view of a single client. The first decision the originator must make is the strategy with which to replicate its f files F1 … Ff. A naïve technique would be to replicate all f files as much as possible. However, this would consume so much bandwidth in traffic that it would outweigh the benefits, as seen below in Figure 1.

Originator Originator

Neighbors Neighbors

Figure 1. The originator sends MRMs for all of its files to Figure 2. The originator sends one MRM for a single file, all of its neighbors. each time its demand is above mirrorThresh.

The more conservative approach taken by BloomNet is to replicate only certain files, requiring the client to decide which file to mirror at what time (see Figure 2, above). This is most appropriately decided by tracking the demand of each file Di, mirroring Fi when Di is greater than a given threshold mirrorThresh. For this paper, we assume that only one mirror should be created at a time, while it is possible that multiple mirrors could be made at one time (discussed further as possible future work in Section 8). Note that when each mirror is created, mirrorThresh should increase and Di should be reset - additional mirrors are not required as readily when there is a fresh mirror in the system. Having decided the time to replicate files, the second decision the client must make is where to send the file. This is done through the use of MRMs: To mirror file Fi, the originator O sends an MRM to find a client to act as the mirror M. This message contains a Gnutella header (including a time to live, or TTL), O’s file-transfer listening port, and a specially created file mirror index I that represents Fi. O also writes I to its outstanding MRM list, MRMList, used later in the mirroring process. The peer who receives the MRM when its TTL is 0 becomes the designated mirror. While the client could decide to flood MRMs on all outbound connections, this is not needed when only one mirror is required; therefore, the client only sends the MRM on one randomly chosen outbound connection, and all clients that receive an MRM likewise forward it along only one randomly chosen outbound connection. This effectively routes the MRM to a single, distant, mirror.

3 TTL = 5

TTL = 6 TTL = 4

TTL = 3

TTL = 2 Originator TTL = 0

TTL = 1

Mirror

Figure 3. The path of an MRM as it is sent out from the originator, O, and routed on randomly chosen outbound connections. Note that when MRMTTL = 0, the mirror, M, is chosen.

At this time, M connects to O by means of HTTP to perform a normal Gnutella file transfer request for the special file mirror index I. If O were to simply send the contents of the file regularly, the mirror may be made liable to copyright infringement due to its possibly illegal possession of the protected file. Thus, we provide deniability for mirrors by encrypting and name mangling files before mirroring. O knows if the file transfer request was initiated in response to one of its MRMs because the file index in the request will be in MRMList. In this case, O encrypts the file and strips its name before sending it to M. O however, does not send M the key. This affords M deniability. Lastly, O adds M to its list of mirrors for the file, mirrorList. Now, when O receives a query message from R, it follows the pseudocode in Figure 4 as follows. First, it searches its locally shared files for any matches to the search string in the query. Now, if O has enough bandwidth to serve files, it goes through each match and sends QueryHit messages containing O’s local file index for the matched file. If O did not have enough bandwidth, it checks if each match is mirrored. If it is, then O multiplexes the request for the file index over each of the mirrors by sending a QueryHit message back with the mirror’s address in the header as well as the encryption key K and original name, and remote index I for the particular file. Upon receipt of such a QueryHit, R requests the file from the mirror Mr contained in the message. When it receives this (encrypted) file, R uses K to decrypt the file, restores the name, and the process is complete. handleQuery(String searchString) // Search locally shared files for a match to searchString matchingIndices = search(searchString) if (availableUpKBps > 0) // Return a queryHit message for each match found makeQueryHits(matches) else foreach index in matchingIndices if (index is mirrored) // Multiplex request for index over set of mirrors // by sending a queryHit with the from field set // as the address of a mirror else // Reject request Figure 4. Pseudocode for receiving a query message.

4 Looking back on the setting of MRMTTL, if it were set to a small number, M would be close to O, in which case M could possibly subvert the encryption by intercepting query traffic going to O and mirror traffic coming from O to find out what files are being sent out. Therefore, MRMTTL should be set to a relatively large number. An inherent flaw with this extension to the Gnutella protocol is that the originator of mirror requests must stay in the network in order for the created mirrors to be used. If the originator leaves, requests for the mirrored file would not receive hits from the mirrors, because they do not know the contents of the file they store, and thus cannot reply with QueryHits.

3.2 Directed Queries While mirroring increases the replication of files, directed search aims to forward queries to nodes that have the highest probability of returning results. Since many of the network’s nodes contribute no data, forwarding queries to these nodes is not preferable. For the equivalent of fewer network resources, directed search should be able to satisfy as many (or more) queries. Query direction is achieved by means of an expensive initial Bloom filter broadcast realized through a protocol extension. This protocol extension is the Bloom message. Directed search is best described in terms of a policy P consisting of a query depth, Dq, a Bloom depth, Db, a filter timeout set T and a branching width B. Upon initializing itself, a new BloomNet client builds an index of files it is willing to share. Each file’s name is broken into tokens, or strings of alphanumeric characters. Each token is then inserted into a Bloom filter of a system-wide agreed upon specification*. Upon making a new overlay connection, each BloomNet client forwards its local Bloom filter, encapsulated in a Bloom message, on this new connection. Clients keep a set of Bloom filters associated with each of their overlay edges. Each set contains Db elements and each element’s index corresponds to a depth. The elements themselves are the merge of all the Bloom filters that the client has received from that edge at that depth. When a client receives a Bloom message, it extracts the filter data and checks the message’s hop count h. It then increments a counter noting the number of filters merged into that edge’s depth h. Upon exceeding the threshold of that depth Th, the counter is reset and the filter is zeroed. The client then uses binary OR to merge the filter data into the set of filters for that overlay edge at precisely depth h and forwards the Bloom messages to all of its other neighbors. In this manner, Bloom filter information is shared. Query propagation must also change to leverage the Bloom filter information that is now in the network. Upon receiving a query, a client breaks the query text into tokens as described above. It then scores each edge based on the number of Bloom filters associated with that edge, where the Bloom filters indicate containment of all the tokens. This scoring function weights edges exponentially with relation to depth: Bloom filter matches at Db = 1 score exponentially more points than filters at depth Db > 1. After scoring, the query is forwarded to the B edges with the highest score. Note that the edge that the query was originally received from is excluded from the scoring process. When new clients join the network, they will not necessarily learn the Bloom information for their distant neighbors. The backpressure of filter replies was deemed too expensive for this extra information propagation: because Bloom filters are a lossy source of data, we found it tolerable to likewise lose this information. Like normal query, directed query still grows exponentially. Both broadcast methods have a cost function FanoutQueryTTL; however, since the directed search aims to choose better edges for forwarding, it lowers Fanout and increases QueryTTL. For example, where normal query might have a QueryTTL of 6 and a Fanout of 4, costing 4096 messages, directed query would choose a QueryTTL of 7 and a Fanout of 2, costing 128 messages—a significant savings. With these additions, BloomNet now pays an expensive up front cost for Bloom filter broadcasts that it expects to amortize over the savings gained by using fewer query messages. We will see in Section 5 that using a small value for Db can indeed be successful and this amortization will prove worthwhile.

* It is assumed that the set of shared files does not vary during the life of the client.

5 4 Simulation Model To quantitatively measure how well BloomNet performs, we created a simulator that is able to simulate both standard Gnutella networks, as well as BloomNet networks with the protocol extensions in action. In order to keep all other behavior alike, we created the BloomNet portion of the simulator as a subclass of the Gnutella portion, making the only functional differences mirroring and Bloom filter additions.

4.1 Metrics To know how well a simulation performs, we chose three metrics that take into account the most important measurable perspectives of query performance in a P2PFS system: query success rate, query bandwidth consumption, and total bandwidth consumption. Query success rate is defined as the percent of queries that received a response over the total number of queries. It is measured by incrementing a global count of queries when one is made from any peer in the simulation, and incrementing a global count of successful queries when the first QueryHit message is received for a given query. Query bandwidth consumption is the amount of bandwidth used by query traffic for all peers over the entire simulation. This metric was measured by recording the size of all query messages as they arrive at a peer. Lastly, total bandwidth consumption is measured by accounting for both Bloom and Query messages.

4.2 Modeling P2PFS In order to model the simulator as closely to the real world as possible, we did a survey of recent findings on measurements done in real-world P2PFS systems. Much of this research was done to determine characteristics of today’s P2PFS environment. They have all revolved around several methods: passive network monitoring [1, 3, 4, 5], monitoring border routers [4], network crawlers [6], monitoring ping/pong messages [4], and sending query messages [1, 3, 4, 5]. The data for these characteristics have been gathered here and split into five categories: characteristics of file sharing, file popularity, file types, queries, and participating peers. While individual numbers are from sources ranging from measurements taken from the Gnutella system [4] to the University of Washington network [3], the general trends are supported over numerous sources [1, 2, 3, 4, 5, 6, 7]; therefore, our simulator takes into account all of these findings.

4.2.1 File Sharing The most important characteristic of P2PFS is the general cooperation of peers in the system. Unfortunately, studies show that users tend to not share any files at all, resulting in low levels of cooperation across P2PFS systems. In the Gnutella system, 66% of peers share no files and 73% share ten or fewer [4]. A general trend that stems from this lack of cooperation is the degradation of P2PFS into a client- server model. In the University of Washington campus network, the Gnutella network, and the network, there are two general classes of peers: those that exhibit server-like qualities (share many files and download infrequently) and those that exhibit client-like qualities (share few or no files and download frequently). For Gnutella in particular, the top 1% of sharing peers account for 37% of all shared files in the network, a staggering statistic. Even worse, the top 20% of sharing peers account for 98% of all shared files [4]. In the Napster network, 40-60% peers share only 5-20% of all shared files [6]. Fortunately, as seen in Section 2, research has shown that freeloaders do not have a large negative impact on performance of non-freeloaders. Some other statistics that do not directly result in general trends are: 1) less than 20% of file requests at the University of Washington campus (over a nine-day period) resulted in successful transactions as in [3], and 2) the average P2PFS peer shares 340 files [7].

6 4.2.2 File Popularity The Zipf distribution, often used to characterize the word usage frequency in a natural language, is useful in discussion of file popularity on P2PFS systems also. A Zipf distribution follows a straight line when plotted on a log-log graph: it has few data points with high y-values and many data points close to the x-axis (as in Figure 5).

Figure 5. Left: A Zipf distribution plotted on a log-log scale [10]. Right: A Zipf distribution plotted on a linear scale [10].

If the x-axis is the ranking of usage of an English word, and the y-axis is the frequency, it is easy to see that the Zipf distribution characterizes English language word usage from the following examples: 1) The words “the” and “and” are used extremely often and there are few of these. 2) The words “cat” and “paper” are used relatively often and there are a good amount of these. 3) The words “Zipf” and “logarithmic” are very rarely used and there are a huge number of these. Much like the English language, studies show that the Zipf distribution occurs in file popularity, query string popularity, and replication of files in P2PFS [2, 5]. One exception is that, for file popularity, very popular files are equally popular (deviation from Zipf) while less popular files follow the Zipf curve more strictly [8]. Generally, the most popular files account for a large portion of communication and storage: 1) the most popular 5% of files account for 50% of all transfers [1], 2) the most popular 10% of files account for 50% of total numbers of stored files [1], and 3) the most popular 10% of transferred files account for over 60% of total transfers [1].

4.2.3 File Types As one might guess, most files being shared on P2PFS networks are in the MP3 format [1, 2, 3]. This is the predominant reason that the mode of file sizes is 4MB and the download time for the most popular file is 150 seconds (the time it takes to download 4MB on a broadband connection, the most popular form of Internet connection in P2PFS peers, as seen in Section 4.5) [2]. Additionally, it has been shown that generally only a small amount of shared files are large, but those files account for a huge amount of storage: 3% of all files are videos, but they account for 21% of all stored ; for the network in particular, only 5% of the objects are over 100MB [1, 3]

4.2.4 Queries [4] shows that a huge amount of queries (63%) never get a response. Adar and Huberman postulate that the reason for this is many cooperating peers are not actually cooperating: they share only “undesirable” files which results in a lack of query responses. Assuming this is true, the high percentage

7 of unanswered queries would extend to all P2PFS architectures, not just ones that do not provide deterministic file location. As introduced in Section 4.2.1 on file sharing statistics, the degradation of P2PFS into a client- server model is prevalent in query statistics as well. An amazing 47% of all query answers are provided by the top 1% of peers; even worse, 98% of all query answers are provided by the top 25%.

4.2.5 Participating Peers The lack of participation from peers in P2PFS contributes to the breakdown of the system. First, peers are frequently unavailable: the distribution of continuous time availability is heavily skewed to short times. This can cause problems for most because the median duration for request fulfillment is 130 seconds [1, 3]. Second, peers are frequently unavailable even when they are connected: the majority of connected peers are actually too busy sending files and answering/forwarding queries to handle additional requests [1]. Third, most peers do not correctly report their bandwidth, which shows an unwillingness to cooperate: 30% of Napster users report bandwidth as “64Kbps or less” when they actually have a significantly greater bandwidth; however, extremely high bandwidth peers rarely misreport their bandwidth [6]. Some additional statistics are: 1) the average access link of a peer is 200Kbps, 2) 50% of Napster users and 60% of Gnutella users have broadband connections (Cable, DSL, T1, or T3), 3) 20% of Napster users and 30% of Gnutella users have better-than-broadband connections (3 Mbps or greater), 4) few peers actually have worse-than-broadband connections (64Kbps or less), and 5) most peers have medium latencies (20% of Gnutella peers have greater than 280ms and 20% have less than 70ms) [2, 6].

4.3 BloomNet Simulator Both the Gnutella and the BloomNet portions of the simulator take into account the measurement metrics as well as the P2PFS characteristics discussed in Sections 4.1 and 4.2. It was built using and run on a 2.8GHz machine with 1GB available RAM. Due to memory limitations of available simulation machines, and despite several iterations of memory requirement streamlining, we were unable to run simulations with more than approximately 1000 clients. We were, however, able to incorporate many parameter options in our simulations using a specially built GUI front-end, as seen below in Figure 6. This GUI is able to both take in parameters to the simulator, as well as display output back to the GUI and in easy-to-parse text files. In addition to the GUI, a script was created to parse, average, and accumulate all the output. To model real-world Gnutella networks better, we modeled P2PFS characteristics by using random numbers. For example, to model the fact that, in the Gnutella system, 66% of peers share no files and 73% share ten or fewer (Section 4.2.1), for each simulation client, we chose a random number r where 0 ≤ r ≤ 100. If r ≤ 66, we made that client share no files; if 66 < r ≤ 73, we made that client share between 1 and 10 files; if r > 73, we made that client share a random number between 11 and 5072 with probability weighting the decision towards fewer files. The simulator is only limited to 5072 files because that is all we chose to harvest from available data. The number could easily be increased. To make a query, a client picks a random filename from the list according to the modified Zipf distribution discussed in Section 4.2.2. The topology used in the simulator was marked by the power-law distribution of networks. This distribution gives few nodes very high connectivity, with most nodes having very low connectivity— much like the social network of our world [12]. Furthermore, we were only interested in capturing bandwidth information in our simulations. The constant increase in latency from increasing the TTL for Query messages by two hops was not of interest to us. We were much more concerned with the bandwidth tradeoff present in broadcasting Bloom filters to reduce query breadth. Since control traffic on the Gnutella network is within the means of a broadband connection, we modeled each node as having infinite bandwidth for control traffic and zero latency.

8

Figure 6. Custom-made GUI to take in many different parameters as input to the simulator, with output shown both on screen as well as in easy-to-parse text files.

5 Results In this section, we present the results of our simulations according to the three measurement metrics discussed above in Section 4.1. Note that since we incorporated randomness into our simulations, we averaged the results of six simulations to attain each point on all graphs. Parameters that were common to all result sets included: 1) having nodes attempt to acquire up to, and no more than, four outbound edges, 2) forming the network into a power law structure by means of an introduction service, and 3) making each network consist of 768 nodes. For each set of results, we compare many versions of BloomNet organized around different Bloom policies against a traditional Gnutella client. We looked to see how factors such as Bloom broadcast depth and filter size affected query performance. Each node in our simulation had a probability of executing a query at each tick of the simulation. To execute a query, clients would choose one file from their local copy of the global file list with probability proportional to a Zipf curve. They would then initiate a query based on that file’s name. After receiving a hit, that client would remove that file from its local list of choices. This reflects the “choose at most once” behavior presented in [8]. We ran each simulation for 200 seconds of wall clock time, which yielded 900 opportunities for each client on the network to take action. An average of 1350 queries were executed in each run of the simulation.

9 5.1 Query Success Query success is measured by the percentage of queries that achieve at least one hit (Figure 7). Note that the x-axis of all graphs in this section represents the Bloom filter policy: the depth with which the Bloom filters are sent, and the number of bytes used for each Bloom filter (bytes/depth). We looked to this experiment to see how factors such as Bloom broadcast depth and filter size would affect query performance. Surprisingly systems with Bloom depth = 1 faired best. It appears as though the exponential scoring scheme only added complication to the system. It is not clear whether a better scoring system would warrant broadcasting Bloom filters beyond depth = 1. Increasing the depth of Bloom filter broadcast appears to have added only noise to the system, forcing depth = 3 into dismal query satisfaction. Accordingly, systems with larger filters were able to reduce the number of false positives reported by their filters. Further, as one would expect, as the size of filters increased, so did the success of queries. The trend of the derivative of the policies with depth broadcast three is however interesting. As the size of the bloom filter increases, this policy catches up to the success rates of the other policies. Should this trend continue, deeper broadcast of large filters could provide the highest query success rate. However, as we will see in the next section, the bandwidth cost of depth three is prohibitive.

Figure 7. Query success for Gnutella and a variety of combinations of Bloom depths and buckets.

5.2 Bandwidth Consumption Traffic consumption is broken down in terms of query traffic, Bloom filter traffic, and the sum of query and bloom traffic. Looking at query bandwidth consumption, the results confirm our previous prediction: if the Bloom procedure’s query fan-out is decreased and query TTL is slightly increased, it will be much less expensive than the traditional broadcast, as seen below in Figure 8. Here, Gnutella’s query traffic is seen at approximately 40MB, while BloomNet’s decreases to under 10MB, a 75% reduction in cost. Again in Figure 8, as the Bloom depth stays below three, filter traffic is manageable. However, when combined with the results from Figure 7, it appears that the policy with filter size 256 bytes and broadcast depth = 1 behaves optimally. When the depth is increased above 2, however, the BloomNet bandwidth explodes into an exponential curve.

10

Figure 8. Bandwidth consumption for Gnutella and a variety of combinations of Bloom depths and buckets.

Similarly, in Figure 9, one sees that the growth of Bloom traffic dominates over query traffic as Bloom’s parameters increase. For a given depth, as the size in bytes increases, one sees a dramatic increase in Bloom domination. For example, when depth = 2, the ratio increases from 1 gradually to 2; when depth = 3, the ratio increases even more from 11 to over 25. Again this shows the exponential cost of increasing Bloom depth.

Figure 9. Ratio of Bloom traffic to query traffic over a variety of combinations of Bloom depths and buckets.

In summary, BloomNet is able to satisfy queries more readily than Gnutella, using approximately one-fourth of the query bandwidth in our simulation results. Further, as the network size increases, the performance gap will increase because Gnutella’s query bandwidth would be governed by a higher query TTL, the exponent in the cost function, while BloomNet would monopolize on its smaller fan-out, the

11 base of the exponent in the cost function. BloomNet is also able to achieve a higher percentage of successful queries than Gnutella, while using only one-fourth the query bandwidth.

6 Related Work BloomNet has related work both its overall design, as well as its file mirroring and Bloom filter features. One work in particular, done by Chawathe, et. al., shares our view on the importance of focusing on Gnutella-like systems [9]. They proposed to improve Gnutella’s scalability by dynamically adapting the network topology and search algorithms to take advantage of the existing heterogeneity inherent in many P2PFS systems. In terms of improving Gnutella’s performance, several papers include proposals for addressing its lack of scalability in particular. Adamic, et. al. in [11] try to exploit the existence of a power-law distribution to the nodal connections in Gnutella networks; Krishnamurthy, et. al. proposed a cluster- based architecture for P2P systems, grouping peers into clusters using a central-server, network-aware clustering technique [13]. A number of works have also been written on using hierarchies of Bloom filters to limit P2P query space. In [14], Mohan and Kalogeraki propose propagation and routing algorithms for fully distributed networks through the use of a Kundali data structure. They aimed to maximize the chances of getting query hits, while minimizing latency and balancing load among many peers. In [15], Ledlie, et. al. look at the tradeoffs between regular and compressed Bloom filters, expressly with the purpose of helping to solve the name query problem in distributed file systems. Their results show a similar improvement as BloomNet, but only for web caching hierarchies in particular. Perhaps most related, Rhea and Kubiatowicz explore using probabilistic location through attenuated Bloom filters, a lossy distributed index, in [16]. Their algorithms finds nearby replicas quickly, which is a similar goal of BloomNet, but encourages its use with deterministic algorithms in order to improve overall performance. Their work goes further than that of BloomNet in their accounting for mobile replicas.

7 Conclusion In this paper, we propose two additions to the Gnutella protocol in order to offset the decreased user participation and file replication caused by opponents of P2PFS. To do so, we introduce a mirroring technique that allows us to efficiently use files that have been replicated onto multiple peers (mirrors), without compromising the mirrors’ legal standing. Further, we improve performance by using Bloom filters to direct queries in a more efficient manner, allowing BloomNet to send queries down much fewer paths than in Gnutella’s broadcast style. In addition, through simulation, we found that BloomNet was able to find hits better and achieve a higher percentage of successful queries while using less query bandwidth than Gnutella. Our main result is that BloomNet satisfies more queries than Gnutella, using only one-fourth of the bandwidth.

8 Possible Future Work The most beneficial future work for our simulator would be to streamline memory usage further by porting the code to C, and running it on a more powerful machine or cluster. This would allow the simulator to reach higher amounts of clients before thrashing. For file mirroring, we may look into more sophisticated demand realization techniques that involve more distributed tracking of demand, possibly using gossiping protocols to pass information between clients. This would need to be balanced with the already large portion of control traffic that Gnutella and BloomNet use. In terms of directed search, further studies in filter merging and scoring functions should be explored to detect if broadcast depths greater than one can increase performance.

12 References [1] J. Chu, K. Labonte, and B. Levine. “Availability and [9] Chawathe, Ratnasamy, Breslau, Lanham, and Shenker. Locality Measurements of Peer-to-Peer File Systems,” “Making Gnutella-like P2P Systems Scalable.” In in Proceedings of ITCom: Scalability and Traffic Proceedings of ACM SIGCOMM 2003, Karlsruhe, Control in IP Networks, July2002. Germany, August 2003. [2] Z. Ge, D. R. Figueiredo, S. Jaiswal, J. Kurose, and D. [10] “Zipf Distribution of Website Popularity (Alertbox Towsley. "Modeling Peer-Peer File Sharing Systems," Sidebar).” [Available to appear in the proceeding of INFOCOM 2003. http://www.useit.com/alertbox/zipf.html] [3] Stefan Saroiu, Krishna P. Gummadi, Richard J. Dunn, [11] Adamic, Lukrose, Puniyani, and Huberman. “Search Steven D. Gribble, and Henry M. Levy, “An Analysis in Power-law Networks.” Physical Review E 64, 2001. of Internet Content Delivery Systems,” in proceedings [12] J. Kleinburg. “Small World Phenomena and the of 5th Symposium on Operating Systems Design and Dynamics of Information.” 2001. Implementation (OSDI) 2002, Boston, MA, USA, [13] Krishnamurthy, Wang, Xie. “Early Measurements of a December 2002. Cluster-based Architecture for P2P Systems.” In [4] E. Adar and B. A. Huberman. “Free Riding on Proceedings of the ACM SIGCOMM Internet Gnutella.” First Monday, Vol. 5, No. 10, October Measurement Workshop 2001, San Francisco, CA, 2000. November 2001. [5] K. Sripanidkulchai. “The Popularity of Gnutella [14] Aditya Mohan and Vana Kalogeraki, “Speculative Queries and its Implications on Scalability,” in Routing and Update Propagation: A Kundali Centric proceeding of O'Reilly Peer-to-Peer and Web Services Approach.” In IEEE 2003 International Conference Conference, 2001 on Communications, May, 2003, Anchorage, AK. [6] S. Saroiu, P.K. Gummadi, and S.D. Gribble. “A [15] J. Ledlie, L. Serban, and D. Toncheva. “Scaling Measurement Study of Peer-to-Peer File Sharing Filename Queries in a Large-scale Distributed File Systems,” in proceedings of Multimedia Computing System.” Research Report TR-03-02, Harvard and Networking 2002 (MMCN ’02), San Jose, CA, University, January 2002. USA, January 2002. [16] S. Rhea and J. Kubiatowicz. “Probabilistic Location [7] B. Yang and H. Garcia-Molina, “Improving Search in and Routing.” In Proceedings of INFOCOM 2002. Peer-to-Peer Networks,” October 2001. [8] B. Yang and H. Garcia-Molina. “Efficient Search in Peer-to-Peer Networks.” In Proceedings of ICDCS, 2002.

13