The Protocol

Gayatri

March 16, 2007 1 Abstract

Peer to peer technology has certainly improved mechanisms of downloading files at a very high rate. Statistics and research show that as more number of peer to peer protocols developed, the number of nodes in the entire network increased. In this paper, we observe the structure and operation of Gnutella, a peer to peer protocol. The paper also focuses on some issues of this protocol and what improvements were further made to improve its scalability. We also have a look at some of the security issues of this protocol. Some statistics in the paper also reflect important changes that the protocol inflicted onto the entire net.

1 2 Introduction

The Gnutella protocol is an open, decentralized group membership and search protocol, mainly used for file searching and . The term Gnutella also designates the virtual network of Internet acces- sible hosts running Gnutella-speaking applications. Gnutella nodes, called servents by developers, perform tasks normally associated with both servers and clients. They provide -side interfaces through which users can issue queries and view search results, accept queries from other servents, check for matches against their local data set, and respond with corresponding results. These nodes are also responsible for managing the background traffic that spreads the information used to maintain network integrity. Queries are not sent to a central site, but are instead distributed among the peers. Gnutella, the first of such systems, uses an unstructured in that the topology of the overlay network and placement of files within it is largely unconstrained. It floods each query across this overlay with a limited scope. Upon receiving a query, each peer sends a list of all content matching the query to the originating . Because the load on each node grows linearly with the total number of queries, which in turn grows with system size, this approach is clearly not scalable.

2 3 Design goals of Gnutella

Like most P2P file sharing applications, Gnutella was designed to meet the following goals: o Ability to operate in a dynamic environment. P2P applications operate in dynamic environments, where hosts may join or leave the network frequently. They must achieve flexibility in order to keep operating transparently despite a constantly changing set of resources. o Performance and Scalability. The P2P paradigm shows its full potential only on large-scale deploy- ments where the limits of the traditional client/server paradigm become ob- vious. Moreover, scalability is important as P2P applications exhibit what economists call the ”network effect” the value of a network to an individual user scales with the total number of participants. Ideally, when increas- ing the number of nodes, aggregate storage space and file availability should grow linearly, response time should remain constant, while search throughput should remain high or grow. o Reliability. External attacks should not cause significant data or performance loss. o Anonymity. Anonymity is valued as a means of protecting the privacy of people seeking or providing unpopular information.

3 4 Gnutella Protocol Description

4.1 Gnutella at a glance Gnutella’s distributed search protocol allows a set of peers, servents or clients, to perform filename searches over other clients without the need of an inter- mediate Index Server. The Gnutella is a pure Ad-Hoc topology where Clients may join or leave the network at any time without affecting the rest of the topology in any sense. The Protocol hence, is designed in a highly fault- tolerant fashion with a quite big overhead of synchronization messages trav- elling through its network. All searches are performed over the Gnutella network while all file are done offline. In this way every servent that needs to serve a file launches a mini HTTP 1.1 web server and commu- nicates with the interested servents with HTTP commands. The core of the protocol consists of a set of descriptors which are used for communication between servents and also sets rules for the inter-servents de- scriptors exchange. These descriptors are Ping, Pong, Query, Query Hit and Push. Ping messages are sent by a servent that needs to discover hosts that are currently active on the network. Pong messages on the other hand are responses to Ping requests every time a host wants to come in communication with the peer that requests to join the network (Figure 1, step 3). Although the protocol doesn’t set any bound on the amount of incoming or outcoming connections from a particular host, most Gnutella clients come with a default value for these parameters, usually 10, but which are adjustable by the user. A client joining the Gnutella network for the first time may of course not have any clue regarding the current topology or it’s neighbouring peers making the connection to the network impossible. For this reason most client vendors have setup Host-Caches Servers which serve clients with IP addresses of servents currently connected to the network (step 1). Another mechanism which is widely deployed, but is not part of the protocol, is the use of local caches of IP addresses from previous connections. In this way a servent doesn’t need to connect to an IP acquisition server (i.e. a host-cache) but can rather try to establish a connection with one of its past peers. Right after a peer has obtained a valid IP address and socket port of another servent it may perform several queries by sending Query descriptors and receive asynchronously results in Query Hit descriptors (step 4). Step 6 shows how a servent p1 may a file from another servent p2. This procedure is done offline with the HTTP protocol. If the servent p2 is fire walled, then p1 may request from p2, with a Push descriptor (step 5), to ”push” the file.

4 4.2 Types of packets used for communication DESCRIPTOR HEADERS Once a servent successfully connects to the Gnutella network, it commu- nicates with other servents by sending a Gnutella protocol descriptor. PING Ping descriptors have no associated payload and are of zero length. A Ping is simply represented by a Descriptor Header whose PayloadDescriptorfieldis0x00. A servent uses Ping descriptors to actively probe the network for other servents. A servent receiving a Ping descriptor may elect to respond with a Pong descriptor, which contains the address of an active Gnutella servent (possibly the one sending the Pong descriptor) and the amount of data it’s sharing on the network. This specification makes no recommendations as to the frequency at which a servent should send Ping descriptors, although servent implementers should make every attempt to minimize Ping traffic on the network. PONG Pong descriptors are only sent in response to an incoming Ping descriptor. It is valid for more than one Pong descriptor to be sent in response to a single Ping descriptor. This enables host caches to send cached servent address information in response to a Ping request. QUERY The primary mechanism for searching the distributed network. A servent receiving a Query descriptor will respond with a Query Hit if a match is found against its local data set. QUERY HIT The response to a Query. This descriptor provides the recipient with enough information to acquire the data matching the corresponding Query PUSH Used when the servent containing the file is behind a firewall. A servent may send a Push descriptor if it receives a Query Hit descriptor from a servent that doesn’t support incoming connections. This might occur when the servent sending the Query Hit descriptor is behind a firewall. When a servent receives a Push descriptor, it may act upon the push request if and only if the Servent Identifier field contains the value of its servent identifier. The DescriptorId field in the Descriptor Header of the Push descriptor should not contain the same value as that of the associated Query Hit descriptor, but should contain a new value generated by the servent’s DescriptorId generation algorithm.

5 4.3 Descriptor Routing The peer-to-peer nature of the Gnutella network requires servents to route network traffic (queries, query replies, push requests, etc) appropriately. A well-behaved Gnutella servent will route protocol descriptors according to the following rules: 1. Ping Pong routing Pong descriptors may only be sent along the same path that carried the incoming Ping descriptor. This ensures that only those servents that routed the Ping descriptor will see the Pong descriptor in response. A servent that receives a Pong descriptor with Descriptor ID = n, but has not seen a Ping descriptor with Descriptor ID = n should remove the Pong descriptor from the network. 2. Query - Query hit- Push Routing Query Hit descriptors may only be sent along the same path that carried the incoming Query descriptor. This ensures that only those servents that routed the Query descriptor will see the Query Hit descriptor in response. A servent that receives a Query Hit descriptor with Descriptor ID = n, but has not seen a Query descriptor with Descriptor ID = n should remove the Query Hit descriptor from the network. Push descriptors may only be sent along the same path that carried the incoming Query Hit descriptor. This ensures that only those servents that routed the Query Hit descriptor will see the Push descriptor. A servent that receives a Push descriptor with ServentIdentifier = n, but has not seen a Query Hit descriptor with Servent Identifier = n should remove the Push de- scriptor from the network. Push descriptors are routed by ServentIdentifier, not by DescriptorId.

6 5 Mechanisms

5.1 Functioning mechanisms How does a servent join the Gnutella network? A Gnutella servent connects itself to the network by establishing a con- nection with another servent currently on the network. The acquisition of another servent’s address is not part of the protocol definition and will not be described here (Host cache services are currently the predominant way of automating the acquisition of Gnutella servent addresses). Once the ad- dress of another servent on the network is obtained, a TCP/IP connection to the servent is created, and the following Gnutella connection request string (ASCII encoded) may be sent: GNUTELLA CONNECT/protocol version string where protocol version string is defined to be the ASCII string ”0.4” (or, equivalently, ”302e34”) in this version of the specification. A servent wishing to accept the connection request must respond with: GNUTELLA OK Any other response indicates the servent’s unwillingness to accept the connection. A servent may reject an incoming connection request for a variety of reasons - a servent’s pool of incoming connection slots may be exhausted, or it may not support the same version of the protocol as the requesting servent, for example.

5.2 Searching Mechanism Gnutella protocol uses the standard Breadth first search mechanism. The steps of search are as follows:

• Node initiates search for file

• Sends message to all neighbors

• Neighbors forward message

• Nodes that have the file initiate a reply message

• Query reply message is back-propagated

• File download occurs

Note: if one client is behind a firewall, another servent can request that servent push the file to itself

7 5.3 File Download Files are downloaded out-of-network i.e. a direct connection between the source and target servent is established in order to perform the data transfer. File data is never transferred over the Gnutella network. The file download protocol is HTTP. The servent initiating the download sends a request string of the following form to the target server:

GET /get/File Index/File Name/ HTTP/1.0 Connection: Keep-Alive Range: bytes=0 User-Agent: Gnutella

where File Index and File Name are one of the File Index,File Name pairs from a Query Hit descriptor’s Result Set. The file data then follows and should be read up to, and including, the number of bytes specified in the Content-length provided in the server’s HTTP response. The Gnutella pro- tocol provides support for the HTTP Range parameter, so that interrupted downloads may be resumed at the point at which they terminated.

8 6 Drawbacks of Gnutella

1. Upon receiving a query, each peer sends a list of all content matching the query to the originating node. Because the load on each node grows linearly with the total number of queries, which in turn grows with system size, this approach is clearly not scalable. 2. 50 percent of all files on Gnutella are served by just 1 percent of nodes. This makes the actual architecture of the Gnutella net closer to a client-server model than a true peer-to-peer system. Consequently, the resulting network suffers from single points of failure, problems associated with flash crowds, and resource unavailability. 3. Gnutella networks may distribute two varieties of poor quality files: masquerading files that simply waste , and malicious files that also threaten system security. DETAILED CASES OF GNUTELLA WEAKNESSES Distributed Denial Service In the rest of the paper, we will make use of the following conventions: Gnutella Network, denoted as ”G”, consists of an arbitrary number of inter- connected Gnutella Peers, forming a random graph, which might be con- nected or not. A Gnutella-peer or Servent denoted as ”p”, is a regular Gnutella client which has no malicious intention and participates in the ad-hoc topology in order to search and download files. A Malicious-peer, denoted as ”Mp”, is a modified Gnutella client which instead of behaving according to the recommendations of the Gnutella Protocol performs activ- ity which harms other peers’ p. We denote that a Mp, ”advertises” itself in the network with the best parameters (i.e. high bandwidth, large number of shared files). The obvious reason of doing such as thing is to attract as many as possible ”victims” p. A Host under attack, denoted as ”A”, might be either another p or any other Internet Service, such as a popular web site (e.g. yahoo.com), which is the target of an attack. Both Mp and p connect to the Gnutella network, as described by exchanging a number of Ping and Pong messages. Every p connects to a variable number of k other peers’ p0, p1 pk where k depends on the preference of the p client. A typical number for k in Gnutella is 10. Right after a connection to the Network G is achieved a peer might start accepting Query messages and respond to them with Query Hit messages if it meets the requirements of the given query. How Far can these spammed messages travel? Estimating how far our spammed messages can spread is probably a dif- ficult question to answer since it depends on the number of p that are con- nected to the Gnutella network, the random topology at the time of attack (i.e. is the network a connected graph), the number of hosts to which some

9 peer connects, the maximum TTL to which a peer is set and finally the per- centage of messages lost due to disconnections or failures. J.Ritter, one of the founding developers of presents an interesting set of mathematical equations which aim to calculate the reachability, capacity, and bandwidth throughput of the Gnutella Network. His assumptions are too broad, which makes it difficult to validate the correctness of his equations. An experiment was performed with 30 misbehaving peers Mp0, Mp1.Mp29 distributed on 3 hosts, in order to get a better understanding. Each host launched 10 Mp, each of which connected to a Gnutella Host-Cache Server to obtain a list of peers p currently connected to the Gnutella Network. After a period of 2, 4, 6, 8 hours we connected to the Gnutella Network with a commercial Gnutella client, namely , and performed several queries to see if we can actually receive an answer from one of the Mp. In all cases we were able to receive a spammed message from one or more of the 30 Mp. After shutting down the 30 Mp we observed that the misbehaving peers answered to 7,344, 838 Query requests in the 8 hour period. In most occasions the spammed message indicated that the resource is available 7 hops away, which is typi- cally the maximum TTL in Gnutella. This experiment proves that spammed messages can affect a large number of Gnutella users p and spread confusion all over the Gnutella Network. Shielding Peers from the DDoS Attack through Spamming It is obvious that spamming is probably one of the most difficult things to solve in P2P file-sharing systems such as Gnutella. This happens because a client has no liability against the queries it receives. Although the protocol specification states clearly that a peer should only reply to a Query with a Query Hit if it contains data that strictly meets the Query Search Criteria, that is simply not enough. Instead it is believed that if the protocol was slightly changed then a download activity of some peer p would not turn into a DDoS attack against some web server A. It was proposed that p should upon receiving a Query Hit from Mp, try to validate that Mp really holds the file that it claims to have. This could be achieved by an extra Query/Query Hit message exchange between p and the host A. In order to achieve this p should try to connect to A (with the Ping/Pong pair) and consequently try to exchange the Query/Query Hit pair. Since A doesn’t know anything about the particular file and may even not know anything about the Gnutella protocol (e.g. it is a web server), it will simply reject the request. P will know from that point that is was misdirected and cancel any further activity. This change on the protocol has a relatively small overhead since it results in the exchange of one extra Ping/Pong and one Query/Query Hit pair but it would save p from passively participating in a DDoS attack. It would also save many resources from p since it would not need to re-establish a

10 download session with A every 30 seconds. Pong Attack Since Mp is a misbehaving peer it will respond with a Pong which contains the IP and the PORT of another host A which is the host under attack. p1 would subsequently think that it established a connection with A, cache A0s address instead of Mp0s and forward all Query messages to A. In the figure we can see analytically the 3 steps that were described above. As we mention in the beginning of this subsection, this attack is not comparable to the DDoS through Spamming because this attack will last only for a short period. This happens because p1 will after a short interval re-send a Ping message to all the hosts in its cache in order to re-discover the topology. Since A is not really a Gnutella client and is not able to reply with a Pong message to p10s request, p1 will simply remove A from its cache and the attack will terminate. Other minor weaknesses include: Injecting viruses through Push leaks: Can be resolved by using virus sensing protocols that avoid sending the file to the servent. Losing anonymity by GUID tracing: In order to avoid GUIDS from being traced, a log must be maintained of all the GUID’s that are already in the network. Mapping of the Gnutella Network Unfortunately, it is prohibitively expensive to compute exactly the map- ping of the Gnutella onto the Internet topology, due both to the inherent difficulty of extracting Internet topology and to the computational scale of the problem. Instead, an experiment was performed that highlighted the mismatch between the topologies of the two networks. The Internet is a collection of Autonomous Systems (AS) connected by routers. AS’s, in turn, are collections of local area networks under a single technical administration. From an ISP point of view traffic crossing AS borders is more expensive than local traffic. We found that only 2-5 percent of Gnutella connections link nodes located within the same AS, although more than 40 percent of these nodes are located within the top ten ASs. This result indicates that most Gnutella-generated traffic crosses AS borders, thus increasing costs, unnecessarily.

11 7 GNUTELLA2

How it works Gnutella2 divides nodes into two groups: leaves and hubs. Leaves main- tain one or two connections to hubs, while hubs accept hundreds of leaves, and many connections to other hubs. When a search is initiated, the node obtains a list of hubs if needed, and contacts the hubs in the list, noting which have been searched, until the list is exhausted, or a predefined search limit has been reached. This allows a user to find a popular file easily without loading the network, while theoretically maintaining the ability for a user to find a single file located anywhere on the network. Hubs index what files a leaf has by means of a Query Routing Table, which is filled with single bit entries of hashes of keywords which the leaf to the hub, and which the hub then combines with all the hash tables its leaves have sent it in order to create a version to send to their neighboring hubs. This allows for hubs to reduce bandwidth greatly by simply not forwarding queries to leaves and neighboring hubs if the entries which match the search are not found in the routing tables. Gnutella2 relies extensively on UDP, rather than TCP, for searches. The overhead TCP introduces would make a random walk search system unworkable, though UDP is not without its own drawbacks. Protocol Features Gnutella2 has an extensible binary XML-like packet format which was conceived as an answer for Gnutella’s many kludges, as future network im- provements and individual vendor features could be added without worry of causing bugs in other clients on the network . While many developers who came after the flame war have proclaimed that this feature makes it much easier to code a client for Gnutella2 than Gnutella], Gnutella developers still maintain that the Generic Gnutella Extension Protocol (GGEP) allows for flexible additions to the Gnutella 0.6 protocol. Gnutella2 employs SHA-1 hashes for file identification and secure integrity check of files. To allow for a file to be reliably downloaded in parallel from multiple sources as well as to allow for the reliable uploading of parts as the file is being downloaded (swarming), tree hashes are used.[10] To create a more robust and com- plete system for searching, Gnutella2 also has a metadata system for more complete labeling, rating, and quality information to be given in the search results than would simply be gathered by the file names.[11] Nodes can even this information after they have deleted the file, allowing users to mark viruses and worms on the network without requiring them to keep a copy. Gnutella2 also utilizes compression in its network connections to reduce the bandwidth used by the network Gnutella2 versus Gnutella:

12 Overall, the two networks are fairly similar, with the primary differences being in the packet format and the search methodology. Gnutella’s packet format has been criticized because it was not originally designed with extensi- bility in mind, and has had many additions over the years, leaving the packet structure cluttered and inefficient. Gnutella2 learned from this, and aside from having many of the added features of Gnutella standard in Gnutella2, designed in future extensibility from the start. The other major difference is the search algorithm. While Gnutella uses a query flooding method of search- ing, Gnutella2 uses a walk system where a searching node gathers a list of hubs and contacts them directly, one at a time. This has several advantages over the Gnutella’s query flooding system. It is more efficient, as continuing a search does not increase the network traffic exponentially, queries are not routed through as many nodes, and it increases the granularity of a search, allowing a client to stop once a pre-defined threshold of results has been obtained more effectively than in Gnutella. It also increases the complexity of the network and the network maintenance required, as well as requir- ing safeguards to prevent a malicious attacker from using the network for Denial-of-service attacks. There is also difference in terminology, with the more capable nodes which are used to condense the network being referred to differently, Ultra peer in Gnutella and Hub in Gnutella2, and they are also used differently in topology. In Gnutella, the Ultra peers generally maintain a relatively small number of leaves and a high number of peer connections, while Gnutella2 Hubs maintain far more leaves, and fewer peer connections. The reason for this is that the search methods of the various networks have different optimum topologies.

13 8 Conclusion and future work

The Gnutella protocol is a P2P networking model that thanks to its de- centralized nature holds many advantages over competing centralized P2P models such as Napster. For instance, question of reliability cannot be an- swered in centralized P2P networks since these networks are dependent on their central access point, which if disabled incapacitates the entire network. In contrast, distributed models such as Gnutella have many access points and are more difficult to incapacitate if one or more of its access nodes are dis- abled. Gnutella node connectivity follows a multi-modal distribution, com- bining a power law and a quasi-constant distribution. This property keeps the network as reliable as a pure power-law network when assuming random node failures, and makes it harder to attack by a malicious adversary. However, Gnutella takes few precautions to ward off potential attacks. For example, the network topology information that we obtain here is easy to obtain and would permit highly efficient denial-of-service attacks. Some form of security mechanisms that would prevent an intruder to gather topol- ogy information appears essential for the long-term survival of the network. Although Gnutella is not a pure power-law network, it preserves good fault tolerance characteristics while being less dependent than a pure power-law network on highly connected nodes that are easy to single out (and attack). More analysis is being done in the security aspects of Gnutella.

9 References

[1] R. Albert, A. L. Barabasi, Statistical mechanics of complex networks, Re- views of Modern Physics, 74 (47) 2002. [2] R. Albert, H. Jeong, A. Barabsi, Attak and tolerance in complex networks, Nature 406, 378, 2000. [3] The Gnutella protocol specification v4.0. http://dss.clip2.com/GnutellaProtocol04.pdf. [4] K. Coffman, A. Odlyzko, Internet growth, Handbook of Massive Data Sets, J. Abello Kluwer, 2001. [5] S. Saroiu, P. Gummadi, S. Gribble, A Measurement Study of P2P Systems, University of Washington Technical Report UW-CSE-01-06-02, July 2001. [6] K. Sripanidkulchai, The popularity of Gnutella queries and its implications on scalability, February 2001. [7] Locating Data in (Small-World?) P2P Scientific Collaborations, Adriana Iamnitchi, Matei Ripeanu, Ian Foster, in Proc. of the 1st Interna- tional Workshop on Peer-to-Peer Systems Cambridge, MA, March 2002

14