<<

Searching the Peer-to-Peer Networks: The Community and Their Queries

Sai Ho Kwok Department of Information and Systems Management, The Hong Kong University of Science and Technology.

Christopher C. Yang Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong. E-mail: [email protected]

Peer-to-Peer (P2P) networks provide a new distributed work and make their underutilized resources available to computing paradigm on the for file sharing. The each other. The decentralized nature of P2P computing decentralized nature of P2P networks fosters coopera- tive and non-cooperative behaviors in sharing re- makes it also ideal for economic environments that foster sources. Searching is a major component of P2P file knowledge sharing and collaboration as well as cooperative sharing. Several studies have been reported on the na- and noncooperative behavior in sharing resources (Kwok, ture of queries of (WWW) search en- 2002). The pure P2P network provides the mechanism for a gines, but studies on queries of P2P networks have not been reported yet. In this report, we present our study on knowledge user to make direct connection with another the Gnutella network, a decentralized and unstructured knowledge user. Information is shared among the knowl- P2P network. We found that the majority of Gnutella edge users through a direct communication, which is an users are located in the United States. Most queries are important factor for the success of knowledge management repeated. This may be because the hosts of the target and reuse. Business models are being developed, which rely files connect or disconnect from the network any time, so clients resubmit their queries. Queries are also for- on incentive mechanisms to supply contributions to the warded from peers to peers. Findings are compared with system and methods for controlling free riding (Kwok, the data from two other studies of Web queries. The Lang, & Tam, 2002). Clearly, the growth and the manage- length of queries in the Gnutella network is longer than ment of P2P networks must be regulated to ensure adequate those reported in the studies of WWW search engines. Queries with the highest frequency are mostly related to compensation of content and/or service providers. the names of movies, songs, artists, singers, and direc- P2P is a technique that can be described as a facilitating tors. Terms with the highest frequency are related to file file sharing over a P2P network. Specifically, the P2P net- formats, entertainment, and sexuality. This study is im- works (or communities) contain a large number of nodes (or portant for the future design of applications, architec- ture, and services of P2P networks. peers). These nodes, also known as servants, act simulta- neously as both clients and servers. The popularity of P2P file sharing has attracted considerable interest. Conse- Introduction quently, a large number of researchers have investigated its Peer-to-peer (P2P) computing is currently attracting extensibility and applicability. Under closer scrutiny, it enormous media attention, spurred by the popularity of file would seem that network scalability is an inherent problem sharing systems such as Napster, Gnutella, and Morpheus. of P2P (The Ocean Store Project, 2002). Many researchers The peers are simply autonomous, or, as some call them, have devoted their efforts to network analysis, for example, first-class citizens. P2P networks are emerging as a new network traffic patterns (Matei, Iamnitchi, & Foster, 2002) distributed computing paradigm for their potential to har- and network performance measurement (Vaucher, Kropf, ness the computing power of the hosts composing the net- Babin, & Jouve, 2002), in an attempt to find ways to reduce the P2P network traffic and improve the efficiency of P2P file sharing. However, the key factor is how P2P users behave (i.e., their information-searching behavior). This Accepted December 3, 2003 factor has not been explicitly addressed in previous studies © 2004 Wiley Periodicals, Inc. ● Published online 9 April 2004 in Wiley and relevant data is lacking in the literature. This report will InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20022 attempt to explore this problem by observing peer’s activ-

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 55(9):783–793, 2004 ities using the servant’s log files, advocating ways to reduce search engines. Jansen, Spink, and Saracevic (2000) re- network traffic. ported the result of a study on 51,000 queries posed by Searching is a major component of P2P file sharing. One 18,000 users of Excite’s search engine, focusing on ses- of the most prevalent problems with P2P users is that a sions, queries, and terms. Among the 114,000 terms used, considerable amount of time is spent in searching. The 22,000 terms are unique. The report shows that the sessions reason for this is that the clients usually do not find the files are short (2.8 queries per session on average) and the they need. Even when they have found the required files, queries are short (2.21 terms per query on average). The they have to query the same files again for other sources Boolean operators and relevance feedback are seldom used. when the remote peers disconnect from the network. Con- A small number of terms are used in high frequency while sequently, there are a large number of query messages many terms are used only once. Silverstein, Henzinger, flooding the P2P network that will jeopardize the interests Marais, and Moricz (1999) reported the result of a study of of the P2P communities. One of the ways to enhance the 154,000,000 queries of the Alta Vista search engine. How- P2P network and file-sharing protocol is to understand the ever, the definition of term in a Web query is not the same. behavior of P2P users. For example, it is important to know In the study of the Alta Vista search engine, terms can be how the query messages are generated and what kind of field-value designators. Spink, Wolfram, Jansen, and query message most frequently occupies the P2P network. Saracevic (2001) reported the result of another study on In addition, it is necessary to have access to information on 1,026,000 queries of the Excite search engine. Similar re- the kinds of files that the P2P users query most because user sults are reported. Later, Wolfram, Spink, Jansen, and behavior has a direct impact on the network traffic. Saracevic (2001) compared the results of the studies of Studies on the queries of P2P networks are beneficial to Excite queries between 1997 and 1999. It was found that the future design of applications, architecture, and services there were fewer terms per query, fewer queries per session, of P2P networks, and P2P protocol (Kwok, Lui, Cheung, and little modification in subsequent queries. As well, the Chan, & Yang, 2003). A better understanding of user be- searching topics shifted from entertainment, recreation, and havior on queries of P2P networks can help to enhance the sex to e-commerce related topics. Spink, Ozmutlu, and searching mechanisms in order to provide efficient and Ozmutlu (2002) conducted four studies, two of which were effective searching. Recent research studies have examined survey responses by 11 Excite search engine users and 114 the nature of queries on WWW search engines; however, search sessions by Excite search engine users. They found there have not yet been any studies that examine the queries that multitasking information seeking and searching was a of P2P networks. common behavior. The mean number of topic changes per session was 2.11. Multitasking search sessions usually take a longer time than single topic sessions. They further inves- Information Behavior tigated the characteristics of question format of the Ask Information behavior is defined by Wilson (2000) as the Jeeves search engine (Spink & Ozmultu, 2002). Thirty totality of human behavior in relation to sources and chan- thousand queries were included in the study. The questions nels of information, including both active and passive in- are mainly in “where,” “what,” or “how” format. “Where formation seeking, and information use. Wilson has also can I find ” is the most common format. defined information searching behavior as the “micro- Although there are a number of studies on queries of level” of behavior employed by the searcher in interacting WWW search engines, studies on queries of P2P networks with information systems of all kinds and information use have not been reported yet. However, there have been some behavior as the physical and mental acts involved in incor- studies on the network traffic and connectivity of P2P porating the information found into the person’s existing networks. Markatos (2002) has investigated the magnitude knowledge base. In this work, we are interested in investi- and traffic patterns of the P2P network by tracing the gating the human behavior in P2P networks in terms of the queries going through a P2P servant in an hour. Matei et al. queries submitted by P2P users in contrast with the human (2002) sent a crawler to collect the topology information of behavior in WWW search engines as reported by other the P2P network. The study evaluated costs and benefits of researchers. We are not studying the micro-level level be- the P2P approach and, according to the data, the mismatch- havior of P2P users or how the P2P users act when they ing of the P2P and Internet infrastructure topology has incorporate the files found into their knowledge base. We considerable impact on the overall performance of the P2P are rather interested in the information behavior in relation system. These studies have all discussed the network traffic to two different channels: P2P networks and WWW. Com- and connectivity of the peers, which provide implications parison between our study and previous WWW studies will for the topology design for P2P application. be presented in the following sections. Saroiu, Gummadi, and Gribble (2002) studied the char- acteristics of the participating peers. Their findings suggest that there is significant heterogeneity and lack of coopera- Studies on Searching Queries tion among peers participating in P2P network. Adar and In the past few years, there have been a growing number Huberman (2000) recorded the P2P network traffic for 24 of studies investigating the nature of queries on WWW hours continuously. They indicate that free riding is a seri-

784 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 FIG. 1. Searching mechanism of P2P networks, Gnutella. ous problem for the P2P network. The free-riding problem other peers that they have identified. When a peer is search- refers to the tendency of many P2P users who request to ing for a file, it sends out a “Query” message containing download a file but rarely share their own files with others. some filtering criteria. If the requested file is identified, the In this report, we focus on the queries on P2P networks. peer will respond by a “Query hit” message containing the A report of the preliminary study has been presented (Kwok list of files matching the filtering criteria and the IP address et al., 2003), in which we proposed the enhancement of the of the peer who is the content provider. Gnutella uses P2P protocol based on some of the data collected. Here, we time-to-live (TTL) to control the number of hops that a provide a detailed analysis of the Gnutella queries and query can propagate. If the requested files are not identified compare these queries with the queries to WWW search after a certain number of TTL, the query will be terminated. engines. However, the searching processes on WWW Gnutella adopts “owner replication”; this means that the search engines and P2P networks are completely different. requested files will be replicated at the requesting peer when Therefore, in order to understand the possible difference in the searching is successful. Figure 1 illustrates the searching the information behavior on Web search engines and P2P mechanism of Gnutella P2P networks. networks, a brief discussion of their searching processes is As illustrated, P2P network searching does not rely on a presented in the next section. centralized to provide the address of the relevant files but instead relies on its peers to provide the matching files. The results of queries sent to Web search engines Searching on World Wide Web and Peer-to-Peer depend on the performance of a particular search engine Networks (the power of indexing and fetching and their ranking pol- The searching mechanisms on WWW and P2P networks icies). However, the results of queries sent to P2P networks are very different. The traditional WWW search engines depend on the protocol of a particular P2P network. It also rely on a centralized index (Chen, Chung, Ramsey, & depends on the peers appearing in the neighborhood and the Yang, 1998a,b; Yang & Chung, 2002; Yang, Yen, & Chen, files being shared by these peers. 2000), while the searching on a decentralized and unstruc- There are a number of research projects that have at- tured P2P networks (for the case of Gnutella) relies on tempted to improve the searching mechanisms of P2P net- forwarding query request through peers. works by analyzing the properties of the networks. The For the decentralized and unstructured P2P networks, Chord Project (2002), from the Massachusetts Institute of such as Gnutella, the searching relies on the message pass- Technology, aims at building scalable, robust distributed ing among peers on the dynamic networks to locate the systems using P2P ideas. Pastry (2002) by members from peers that have the requested files (Lui & Kwok, 2002; Lv, Microsoft, Rice University, and Purdue University, aims to Cao, Cohen, Li, & Shenker, 2002). Each peer in Gnutella develop a scalable, distributed object location and routing acts as a , who submits queries, a server, who provides application for large-scale P2P systems. The Stanford Peers content, and a router, who transmits queries and responses Project, which consisted of a group of researchers working when they do not have the file requested. “Ping” messages in the area of P2P network, investigated how to improve are sent out by peers to identify other peers on the dynamic searching in P2P network (Stanford Peers Project, 2002) by networks and a “Pong” message is received from an iden- iterative deepening, directed BFS technique, and local in- tified peer. Each peer may only identify a few peers among dices technique, and how to increase the reliability of data all the peers on the networks. However, when a query is replication (Thadani, 2002). In addition, the PIER Project submitted, the identified peers may forward the query to (P2P Information Exchange & Retrieval) (2002), from the

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 785 TABLE 1. Example of data in the connection log file.

Host͉ Start Time͉ End Time͉ Type͉ Msg Sent͉ Msg Received͉ No. of Files͉ File Size(kB)͉ Up Time(s) connect2.gnutellanet.com:6346͉ 5/26/02 5:12:13 PM͉ 5/26/02 5:12:16 PM͉ Outgoing͉ 31͉ 1͉ 0͉ 0͉ 15 public.bearshare.net:6346͉ 5/26/02 5:12:14 PM͉ 5/26/02 5:12:20 PM͉ Outgoing͉ 3͉ 1͉ 0͉ 0͉ 18 211.209.170.99:6346͉ 5/26/02 5:12:49 PM͉ 5/26/02 5:12:49 PM͉ Outgoing͉ 20͉ 51͉ 0͉ 0͉ 9 68.45.120.69:6346͉ 5/26/02 5:14:48 PM͉ 5/26/02 5:14:48 PM͉ Outgoing͉ 20͉ 51͉ 0͉ 0͉ 9

University of California, Berkeley, is working on building network forms a dynamic, self-organizing network of inde- complex query facilities (i.e., a subset of SQL) on top of pendent entities. This virtual, application-level network has these distributed hash table-based P2P systems. Another Gnutella servants as its nodes and open TCP connections as related project from the University of California at Berke- its links. ley, named the OceanStore Project (2002), attempts to de- Peers interact with each other by means of messages. velop a global persistent data store designed to scale to Peers create and initiate a broadcast of messages as well as billions of users. The Piazza project (2002), from the Uni- re-broadcasting others’ (receiving and transmitting to neigh- versity of Washington, based on the concept of data place- bors). There are five types of messages running over the ment and utilization, develops the techniques for reuse of network. Every message has a header and payload, the last query results. The BestPeer project (2002), a collaborative four bytes of the header describes the size of the payload in effort between the National University of Singapore and bytes. Below we show the format of each message in the Fudan University, looks at the architecture of BestPeer, a Gnutella network. generic P2P platform, and the various application layers that have been built on top of it. However, none of these projects Ping Messages: Essentially, an “are you there?” message takes the user behavior into account. In this report, we shall directed at a host. investigate the user behavior on the P2P network, Gnutella Pong Messages: A reply to a ping (“yes, I’m here”). The (www.gnutella.com), in terms of the participants of Gnu- pong message contains information about the peer such tella, files shared by the participants, and their queries. as their IP address and port as well as the number of files shared and the total size of those files. Peers forward this kind of message to their neighbors so that it is possible The Study: Gnutella to find other peers later. The Gnutella protocol (The Gnutella Protocol Specifica- Query Messages: These are messages stating, “I am look- tion v0.4, 2003) is an open, decentralized group member- ing for x” and can get forwarded throughout the entire ship and search protocol, mainly used for file sharing. The network. Query messages are uniquely identified, but their source is unknown. term Gnutella also designates the virtual network of Inter- Query Hit Messages: These are replies to query messages, net-accessible hosts running Gnutella-speaking applications and they include the information necessary to download and a number of smaller, and often private, disconnected the file (IP, port, and other location information). Re- networks. sponses also contain a unique client ID associated with the replying peer. These messages are propagated back- wards along the path that the query message originally Connections in Gnutella took. Since these messages are not broadcast, it becomes To become a member of the Gnutella network, a servant impossible to trace all query responses in the system. (node) has to open one or many connections with nodes that Get/Push Messages: Get messages are simply a request for are already in the network. In the dynamic environment a file returned by a query. The requesting peer connects where Gnutella operates, nodes often join and leave and to the serving peer directly and requests the file. Certain hosts, usually located behind a firewall, are unable to network connections are unreliable. To cope with this en- directly respond to requests for files. For this reason, the vironment, after joining the network, a node periodically Gnutella protocol includes push messages. Push mes- PINGs its neighbors to discover other participating nodes. sages request the serving client to initiate the connection Using this information, a disconnected node can always to the requesting peer and upload the file. However, if reconnect to the network. Nodes decide where to connect in both peers are located behind a firewall, a connection the network based only on local information; thus, the entire between the two will be impossible.

TABLE 2. Example of data in the monitor log file.

Min. Search Date Time speed TTL Hops GUID Origin host: port criteria

8/6/2002 16:08 33,665 0 1 [dd]-[d1]-[a8]-[18]-[e0]-[7c]-[83]-[de]-[ff]-[d5]-[13]-[e3]-[bf]-[7a]-[aa]-[0] connect2.gnutellanet.com:6346 dvia 8/6/2002 16:08 44,417 0 1 [92]-[c0]-[bc]-[cc]-[88]-[90]-[a7]-[28]-[ff]-[2e]-[30]-[4]-[77]-[36]-[30]-[0] connect1.gnutellanet.com:6346 good mp3

786 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 FIG. 2. Geographical distribution of Gnutella peers.

Data Collection Files Shared by Gnutella Peers Markatos (2002) shows that the overall P2P network The peers in the Gnutella network are not required to be characteristics can be represented by a randomly-chosen a host to share files with other peers. They can be just acting node in the P2P network and every P2P node has similar as clients to request for files that they are interested in. The network characteristics. As well, the characteristics are lo- peers may connect or disconnect to the network at any time. cation independent. We applied these characteristics in col- It was found that the total number of files in the Gnutella lecting querying data in our experiment. The querying data network, regardless of the types of files, was 435,000, and was collected from July 16 through July 22, 2002 (7 con- the total size of shared files was 836,725 MB. The average secutive days), with a P2P servant situated in Hong Kong. connection time of Gnutella peers was 69.51 seconds. The The P2P servant ran on a P4 1.2G PC with 100-Mps findings indicate that there are a large number of files bandwidth. The P2P servant used for data collection was available in the Gnutella network; however, not every peer written in Java, based on the Java API of (Jtella) that is willing to serve as a host to share files but may be only follows the Gnutella protocol. All query messages going interested in retrieving the files they need. The large collec- through the P2P program were logged in two log files, the tion of files but the penury of hosts makes searching in the connection log and monitor log files. Sample data in the Gnutella network even more difficult. connection log and monitor log are given in Table 1 and Table 2, respectively. The log files were then imported to a Queries of Gnutella database management system for data analysis. To analyze the queries of the Gnutella network, we focus on the QUERY message among the five types of message in Distribution of Peers Gnutella as described in Connections in Gnutella. A QUERY message contains the query terms that are usually Based on the collected data found in the connection log entered by the P2P user or sometimes issued by the P2P file (PONG messages), there were 47,489 peer connection servants. Over the seven days of continuous operation, our sessions in seven days. There were 10,230 unique peers, in P2P servant collected 5,052,754 QUERY messages in total which 3,585 peers performed searching activities; 1,663 from the Gnutella network. We label it the “5M Gnutella peers connected to the network but did not host any files for study.” sharing. Figure 2 presents the geographical distribution of the Gnutella peers. It is found that the majority of Gnutella TABLE 3. Occurrences and distributions of different types of queries in peers (80%) were located in North America. To be more Gnutella network. specific, 85% North America peers were located in the United States (thus the U.S. alone constituted 68% of the Occurrences Percentage (%) Gnutella population). In Europe, the Gnutella population Total number of queries (7 days) 5,052,754 100 was spread out in several countries, where the United King- Unique queries 676,402 13.39 dom and Germany had 16 and 13%, respectively, of the Repeat queries 4,373,813 86.56 European Gnutella population. In Asia, the Gnutella popu- Zero-term queries 2,539 0.05 lation was also spread out in a few countries, where Hong XML queries 39,151 0.77 Kong and the Republic of Korea had 38 and 13% of the English queries 4,983,919 98.64 Non-English queries 29,684 0.59 Asia Gnutella population.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 787 TABLE 4. Comparisons of 5M Gnutella study with 51K Excite study and while the unique queries and zero-term queries are only 1M Excite study in terms of types of queries. 13.39 and 0.05%, respectively. The repeat queries repeated 5M Gnutella 1M Excite 7.16 times on average. In extreme cases, some repeat que- study 51K Excite study study ries repeated over 10,000 times. The number of repeat queries is 6.5 times more than the number of unique queries. Unique queries (%) 13.39 57 51.80 Compared to the results reported by the study of 51,474 (Unique ϩ modified) Repeat queries (%) 86.56 43 38.55 Excite queries (here labeled the “51K Excite study”) Zero-term queries (%) 0.05 N/A 9.65 (Jansen et al., 2000), and the study of 1,025,910 Excite queries (here labeled the “1M Excite study”), as shown in Table 4, (Spink et al., 2001), the percentage of repeat queries in our 5M Gnutella study are significantly more that We generally follow the metrics developed by Spink et those in the 51K Excite study and the 1M Excite study. As al. (2001) for queries of Web search engines but with minor discussed in Introduction, users submit the same queries in modification to analyze the queries of the Gnutella network P2P networks when they cannot find the files from the since the format of queries in the Gnutella network is not existing peers in the network or when the peers that have the exactly the same as those of Web search engines. Terms in files have disconnected. In the 51K Excite study, zero-term queries are any unbroken strings of alphanumeric charac- queries are not considered; however, zero-term queries may ters. Terms include words, abbreviations, numbers, and be counted as repeat queries. The unique queries in the 51K logical operators. Queries are sets of one or more terms. Queries can be categorized into unique queries, repeat que- Excite study are further categorized into unique queries and ries, and zero-term queries or XML queries, English que- modified queries while both are considered as unique que- ries, and non-English queries. Unique queries are all differ- ries in our 5M Gnutella study and the 1M Excite study. The ent queries entered by one user in one session. Repeat queries of the P2P network have significantly more repeat queries are all multiple occurrences of the same query queries in comparison with the queries of Web search entered by one user or automatically by the system to update engines due to the randomness of connectivity with peers in the list of results. Zero-term queries are queries without any P2P networks. Users of the Gnutella network expect that the terms; they are generated by Gnutella clients, such as “[en- connectivity with existing peers will be changed after some ter search term]” or “enter search here,” when users enter no time and that resubmitting the same queries may obtain terms. XML queries are queries containing XML substring, results that cannot be obtained in the previous trials. There- such as “ ‘1.0’?” They are usually generated by advanced fore, users of the Gnutella network will continue to submit Gnutella clients; e.g., LimeWire (Thadani, 2002). English repeat queries until they find the matched files. The client queries are queries containing only English characters, systems may also automatically resubmit queries to update numbers, and symbols. Non-English queries are queries or extend the list of results. For Web queries, on the other containing non-English characters, such as Chinese charac- hand, submitting the same queries to a Web search engine ters and Korean characters. will not obtain different results within a period of time Table 3 presents the occurrences and distributions of unless the fetching robot has refreshed the updated Web different types of queries of the Gnutella network. The pages and the updated Web pages have been indexed by the results indicate that 86.56% of queries are repeat queries index server. Such cycles of Web revision usually take

FIG. 3. Distributions of the number of terms in queries of Gnutella.

788 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 TABLE 5. Comparisons of the 5M Gnutella study, 51K Excite study, and the 1M Excite study, the length of Gnutella queries is longer 1M Excite study in terms of number of terms per query. than those of Excite queries. The comparisons are presented 5M Gnutella 51K Excite 1M Excite in Table 5. The comparison of the distribution between the study study study 5M Gnutella study and the 51K Excite study is presented in Figure 4. This is partly because the Gnutella users usually Median number of terms specify the file formats in the queries, such as, jpg, divx, avi, per query 3 2 2 mp3, etc., while the results of Web search engines are Average number of terms per query 3.74 2.21 2.16 usually files in the format of hypertext markup language (HTML). Therefore, specifying the file format in the queries of Web search engines is not necessary. Such query patterns are further elaborated in Search Queries, Terms, and Topics. weeks. The repeat queries of Web search engines only occur In addition, users of Gnutella usually submit the names of when users want to see more results on the subsequent result movies, songs, singers, actors, and directors as their queries pages. It indicates that the performance of searching in P2P to search for the specific multimedia files. The names are networks can be improved if we can solve the problem of usually longer than two terms, especially for the names of repeat queries. Also, when compared to the 1M Excite movies and songs. For example, “The Lord of the Rings: study, our study shows that the percentage of zero-term The Two Towers” is the name of a movie with 8 terms. queries in the 5M Gnutella queries is significantly less because many P2P servants disallow zero-term queries. Table 3 also shows that the percentage of non-English Search Queries, Terms, and Topics queries is less than 0.6% and the percentage of XML queries is less than 0.8%. These types of queries are the minority in In this section, we further investigate the queries and the the Gnutella community. Perhaps, this indicates that En- terms used in the queries. We shall first identify the most glish-speaking P2P users are dominant in the community. frequent queries in the Gnutella network. A query usually The contents of the shared files are mostly in English. has multiple terms. Terms can be used in many different Moreover, only a few P2P servants support advanced XML- queries. Based on all the queries and all the unique queries, based queries. As a result, we find that almost 99% of the we identify the most frequent terms being used. The differ- queries are English queries. ence between the most frequent queries and the most fre- Figure 3 presents the distributions of the number of terms quent terms in all queries and all unique queries can help us in Gnutella queries. The median number of terms per query to understand the specific and general topics that users are was 3. The average number of terms per query was 3.74. interested in. About 50% of the queries contained one or two terms. Table 6 lists the 15 most frequently occurring queries of Fewer than 5% of the queries had more than 10 terms. When the Gnutella network. Their frequencies and percentages are comparing the results reported by the 51K Excite study and also presented in Table 6. It is observed that Gnutella users

FIG. 4. Comparison of the distribution of the number of terms in the 5M Gnutella study, 51K Excite study, and 1M Excite study.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 789 TABLE 6. Top 15 queries in the Gnutella network.

Rank Query Frequency % Rank Query Frequency %

1 divx 11,820 0.23 9 chris isaak 4,370 0.09 2 qwerty jpg 8,746 0.17 10 return to me 4,115 0.08 3 porn 6,510 0.13 11 joey gian 4,088 0.08 4 eminem 6,365 0.13 12 nelly 3,845 0.08 5 techno mp3 6,159 0.12 13 sex 3,567 0.07 6 divx avi 5,385 0.11 14 aqua mp3 3,567 0.07 7 porn mpg 4,805 0.10 15 Minority report 3,330 0.07 8 spiderman 4,402 0.09 are interested in timely contents; for example, recently found that many terms appearing in the top 15 queries are not released movies (“spiderman,” “minority report”), artist among the top 50 terms in all queries or unique queries. For names (“eminem,” “nelly,” and “chris isaak”) or band example, “qwerty,” “techno,” “spiderman,” “chris,” “isaak,” names (“aqua mp3”). Another popular category of query is “return,” “joey,” “gian,” “nelly,” “aqua,” “minority,” “report.” sexuality (“Porn,” “Porn mpg,” and “Sex”), but it is not as Most of these terms that appear in the top 15 queries but are not popular as the movies or songs. This matches with the in the top 50 terms are terms that are part of the names of behavior of Web searching reported in the 51K Excite study movies, songs, actors, actresses, directors, etc. On the other and the 1M Excite study. In addition to the timely content hand, many terms that appear in the top 50 terms in all queries queries and the sexuality content queries, three other fre- and unique queries, such as “mp3,” “avi,” “mpg,” “mpeg,” quent queries of the Gnutella network are file extensions “xxx,” and “sex,” do not appear in the top 15 queries. These (“divx” and “divx avi”), which are rare in queries of Web terms are mostly file formats or related to sexuality. It shows search engines. Indeed, the most frequent Gnutella query is that most users, who have a specific target item in mind, are “divx,” which is a file format. interested in the specific art work in the entertainment domain. We further investigate the most frequent terms in all However, in general, users may submit general terms that are queries and unique queries of the Gnutella network and the related to file formats, entertainment, and sexuality as part of results are presented in Tables 7 and 8, respectively. It is their queries as shown in Tables 7 and 8.

TABLE 7. Top 50 terms in all queries of the Gnutella network (after TABLE 8. Top 50 terms in all unique queries of the Gnutella network removing common terms without content). (after removing common terms without content).

Rank Term Frequency Rank Term Frequency Rank Term Frequency Rank Term Frequency

1 mp3 1,118,333 26 movie 26,255 1 mp3 252,338 26 all 4,497 2 urn: 907,641 27 trek# 25,873 2 urn: 233,133 27 mix 4,473 3 avi 519,966 28 episode# 25,464 3 mpg 55,872 28 young 4,307 4 mpg 393,443 29 big 24,441 4 avi 43,996 29 pa 4,216 5 zip 94,724 30 men# 24,417 5 zip 17,274 30 gay 4,179 6 mpeg 77,487 31 pa 23,535 6 you 16,665 31 girls 4,177 7 jpg 71,404 32 gay 23,254 7 jpg 16,543 32 full 4,113 8 you 68,075 33 pa 22,258 8 mpeg 14,624 33 dj 4,023 9 xxx 59,999 34 young 21,990 9 me 11,571 34 john 3,957 10 sex 55,635 35 john 21,518 10 love 11,523 35 red 3,901 11 porn 54,308 36 remix 21,222 11 my 10,315 36 new 3,762 12 star 51,929 37 boys 20,997 12 sex 9,176 37 man 3,761 13 divx 50,736 38 mix 20,595 13 porn 7,764 38 time 3,653 14 love 50,444 39 your 20,443 14 asf 6,727 39 pdf 3,651 15 me 49,199 40 all 20,406 15 girl 6,718 40 pa,# 3,343 16 black 48,588 41 man 20,157 16 live 6,536 41 soundtrack# 3,331 17 my 44,481 42 new 29,366 17 teen 6,444 42 boys 3,266 18 teen 39,118 43 german# 19,064 18 black 6,079 43 get# 3,263 19 asf 36,594 44 wars# 18,931 19 xxx 5,573 44 pa 3,189 20 girl 33,482 45 Dj 18,831 20 divx 5,464 45 song# 3,130 21 full 30,838 46 eminem# 18,460 21 star 5,333 46 movie 3,111 22 red 30,703 47 dvd# 18,196 22 your 5,304 47 rock# 3,096 23 live 28,005 48 time 18,110 23 remix 5,300 48 little# 3,077 24 hot 27,977 49 pdf 17,154 24 big 4,700 49 dance# 3,051 25 girls 27,213 50 pa,# 16,762 25 hot 4,541 50 music# 3,011

#Unique query. #Unique query. ap ϭ expletive. ap ϭ expletive.

790 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 TABLE 9. Top 50 terms in the 51K Excite study and the IM Excite study in the order of ranking (after removing common terms without content).

51K Excite study 1M Excite study sex, nude, free, pictures, new, university, women, chat, gay, girls, xxx, sex, free, nude, pictures, university, pics, chat, adult, women, music, , pics, ncaa, home, stories, pa, college, naked, adult, state, new, xxx, girls, music, porn, gay, school, home, college, state, big, basketball, men, employment, school, jobs, American, real, world, naked, American, stories, software, games, Diana, pa, black, black, porn, photos, york, young, history, page, celebrities, estate, photos, jobs, world, magazine, nudes, news, football, page, magazine, computer, news, texas, games, war, john, internet, car, computer, princess, airlines, download, real, education, art, wrestling web, history, video, sports, California, men, national, big

ap ϭ expletive.

Comparing the top 50 terms in all queries and unique File Extensions and Categories queries, we found that 42 terms (84%) are common to both lists. Terms that are unique in Tables 7 and 8 are highlighted Since file formats appear in most of the queries of the by a “#” sign. Terms such as “soundtrack,” “song,” “dance,” Gnutella network, we further investigated the top 15 most and “music,” are added to the top 50 terms if unique queries frequent file formats as shown in Table 10. Table 11 shows are considered. Terms such as “trek,” “episode,” and “emi- the distribution of the file categories specified in the queries nem” are removed from the top 50 terms if unique queries of the Gnutella network. It concurs with our earlier finding are considered only. Similarly, specific terms, which are that users of the Gnutella network are interested in multi- parts of names, are removed from top 50 terms, but general media files. terms are included in the top 50 terms when only unique queries are considered. It shows that users may repeatedly Conclusion submit specific terms to search for the items of interest. Since these specific items may not exist in the P2P network P2P networks are popular for file sharing among today’s all the time, users repeat the submission frequently until Internet users. Unfortunately, searching P2P networks is not they are retrieved. The general terms can be related to many efficient and effective. Many studies have been reported on items; therefore, queries with general terms are not repeated P2P network traffic and connectivity, but none have been as frequently. reported on the nature of their queries. In this report, a study The top 50 terms in unique queries in the 5M Gnutella on the queries of the Gnutella network is presented and study are also compared with those of the 51K Excite study trends in P2P searching are identified. Repeat queries are the and the 1M Excite study. The top 50 terms in unique queries majority queries in the Gnutella network. The number of of the 51K Excite study and the 1M Excite study are given repeat queries is 6.5 times the number of unique queries. in Table 9. There are some common terms among all three The length of queries is 3.74 terms on average. It is com- studies. For example, “sex,” “porn,” “black,” “xxx,” “big,” paratively longer than the length of queries in the Excite “gay,” “girls,” and “new.” Terms that are related to file search engines reported in the 51K Excite study and the 1M formats are only frequently found in the 5M Gnutella study Excite study. The most frequent queries are related to the but not in the 51K Excite study or the 1M Excite study. On names of movies, songs, artists, singers, and directors. The the other hand, the 51K Excite study and 5M Excite study most frequent terms in queries are related to file formats, have more terms on the topics of recreation, education, and entertainment, and sexuality. Among all the queries, video technology, such as “basketball,” “football,” “magazine,” and audio files are frequently specified as the formats of “school,” “college,” “computer,” and “software.” It is clear files being searched, which is over 50% of all queries. that users of the Gnutella network are more focused on Although the performance of P2P searching is not investi- multimedia files on the topic of entertainment. gated in this study, the behavior of P2P users provides

TABLE 10. Top 15 file types specified in queries.

File File Rank extension Occurrence % Rank extension Occurrence %

1 mp3 1,149,893 22.76 9 ra 32,214 0.64 2 avi 859,048 17.00 10 rm 21,094 0.42 3 mpg 491,146 9.72 11 pdf 19,792 0.39 4 zip 83,001 1.64 12 exe 17,369 0.34 5 mpeg 81,526 1.61 13 rar 16,990 0.34 6 jpg 54,791 1.08 14 ps 8,590 0.17 7 asf 40,924 0.81 15 mov 8,388 0.17 8 divx 35,040 0.69

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 791 TABLE 11. Distribution of file categories specified in queries. P2P and WWW indicates that users may have different information needs when they select to use P2P or WWW File category Percentage (%) because the information coverage in these channels is dif- Video 30.01 ferent. Audio 24.15 Compressed file 2.01 Graphic 1.12 Acknowledgments Document 0.67 Software 0.36 We thank S. M. Lui, Ricky Cheung, and Sally Chan for Other 41.69 collecting the data in our research project. insights on how to improve our future design of application, architecture, and services of P2P networks. References The serious P2P network traffic caused by the repeat Adar, E., & Huberman, B.A. (2000). Free riding on Gnutella, First Mon- queries can be reduced by reusing the popular queries. day, 5(10). When a query is forwarded by an initiating peer to a BestPeer: Adaptive peer-to-peer platform for object sharing. (2002). Re- processing P2P servant, the processing P2P servant will trieved August 27, 2002, from http://xena1.ddns.comp.nus.edu.sg/p2p/ forward the query to another servant and return its own index. query result to the query-initiating peer. When the query is Chen, H., Chung, Y., Ramsey, M., & Yang, C.C. (1998a). An intelligent personal spider (agent) for dynamic internet/intranet searching. Decision a repeat query and the P2P servant has collected sufficient Support Systems, 23(1), 41–58. and up-to-date results for this query from other servants, the Chen, H., Chung, Y., Ramsey, M., & Yang, C.C. (1998b). A smart itsy processing servant may decide not to forward the query but bitsy spider for the Web. Journal of the American Society for Informa- return all collected results to the initiating peer directly. The tion Science, Special Issue on Artificial Intelligence Techniques for processing servant is occasionally required to update its Emerging Information Systems Applications, 49(7), 604–618. The Chord Project. (2002). Retrieved August 27, 2002, from http://www. result set. To achieve this, an intelligent logic can be im- pdos.lcs.mit.edu/chord/ plemented in P2P servants. The Gnutella Protocol Specification v0.4. (2003). Retrieved August 27, 2002, P2P network and file sharing can be more efficient and from http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf effective when peers who share similar interests are con- Jansen, B.J., Spink, A., & Saracevic, T. (2000). Real life, real users and real nected together to form P2P special interests groups (SIGs). needs: A study and analysis of users queries on the Web. Information Processing & Management, 36(2), 207–227. These SIGs can be built around newsgroups that usually Kwok, S.H. (2002). Decentralized knowledge reuse with peer-to-peer house Internet users of similar interests. Regional SIGs technology. Proceedings of the First Workshop on e-Business could be set up for regions that have large populations of (Web2002). active P2P users, such as in the United States. This would Kwok, S.H., Lang, K.R., & Tam, K.Y. (2002). Peer-to-peer technology not only facilitate their P2P activities, but also reduce net- business and service models: Risks and opportunities. Electronic Mar- kets, 12(3), 175–183. work traffic on the Internet. More importantly, this will not Kwok, S.H., Lui, S.M., Cheung, R., Chan, S., & Yang, C.C. (2003). affect P2P users in a regional SIG at all because there is a Searching Behavior in Peer-to-Peer Community. Special track of Web sufficient amount of in demand and popular files available and Information Retrieval Technologies in the Proceedings of the Fourth for sharing. IEEE Conference on Information Technology (ITCC-2003), Las Vegas, Our findings show that high-bandwidth files, such as NV, April 28–30, 2003. Lui, S.M., & Kwok, S.H. (2002). Interoperability of peer-to-peer file videos and movies, are the most popular, constituting 54% sharing protocols. ACM SIGecom Exchanges, 3(3), 25–33. of all queries (Table 11). With the current P2P architecture, Lv, Q., Cao, P., Cohen, E., Li, K., & Shenker, S. (2002) Search and completely downloading a movie file (about 500M bytes) replication in unstructured peer-to-peer networks. Proceedings of the may take a few days. This is primarily due to the uploading ACM ICS’02 (pp. 84–95), New York, June 22–26. restrictions set by the file providers. This setting is to Matei, R., Iamnitchi, A., & Foster, P. (2002) Mapping the Gnutella net- work. Internet Computing, IEEE, 6(1), 50–57. prevent unknown peers from using the bandwidth of the file Markatos, E.P. (2002). Tracing a large-scale peer to peer system: An hour providers excessively. With the idea of P2P SIG in place, in the life of Gnutella. Cluster Computing and the Grid 2nd IEEE/ACM members of a SIG can be mutually benefited when they International Symposium CCGRID2002, Berlin, Germany. allow others to share files with higher bandwidth. The Ocean Store Project. (2002). Providing Global-Scale Persistent Data. The differences in user information behavior between Retrieved August 27, 2002, from http://oceanstore.cs.berkeley.edu Pastry: An infrastructure for peer-to-peer applications. (2002). Retrieved P2P networks and WWW are also presented. We found that August 27, 2002, from http://research.microsoft.com/ϳantr/Pastry there are significantly more repeat queries and the number Piazza Peer Data Management System. (2002). Retrieved August 27, of terms per query is higher in P2P networks. There are 2002, from http://data.cs.washington.edu/p2p/piazza/ many common terms in both P2P networks and WWW. The PIER Project Homepage (P2P Information Exchange & Retrieval). (2002). ϳ queries in WWW have more terms on the topics of recre- Retrieved August 27, 2002, from http://www.cs.berkeley.edu/ hueb- sch/pier/index.html ation, education, and technology, while the queries in P2P Saroiu, S., Gummadi, P.K., & Gribble, S.D. (2002). A measurement study are more focused on multimedia files on the topic of enter- of peer-to-peer file sharing systems. Proceedings of the Multimedia tainment. The difference in information behavior between Computing and Networking (MMCN), San Jose, CA, January 23–24.

792 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 Silverstein, C., Henzinger, M., Marais, H., & Moricz, M. (1999). Vaucher, J., Kropf, P., Babin, G., & Jouve, T. (2002). Experimenting with Analysis of a very large Web search engine Log. ACM SIGIR Forum, Gnutella communities. Proceedings of the Distributed Communities on 33(3). the Web (DCW 2002), Sydney, Australia, April 3–5, 2002. Spink, A., & Ozmultu, H.C. (2002). Characteristics of question format Wilson, T.D. (2000). Human information behavior. Informing Science, Web queries: An exploratory study. Information Processing & Manage- 3(2), 49–55. ment, 38, 453–471. Wolfram, D., Spink, A., Jansen, B.J., & Saracevic, T. (2001). Vox populi: Spink, A., Ozmutlu, H.C., & Ozmutlu, S. (2002). Multitasking information The public searching of the Web. Journal of the American Society for seeking and searching processes. Journal of the American Society for Information Science and Technology, 52(12), 1073–1074. Information Science and Technology, 53(8), 639–652. Spink, A., Wolfram, D., Jansen, M.B.J., & Saracevic, T. (2001). Searching Yang, C.C., & Chung, A. (2002). A personal agent for Chinese financial the Web: The public and their queries. Journal of the American Society news on the Web. Journal of the American Society for Information for Information Science and Technology, 52(3), 226–234. Science and Technology [Special issue on Web Research], 53(2), 186– Standford Peers Project. (2002). Retrieved August 27, 2002, from http:// 196. www-db.stanford.edu/peers Yang, C.C., Yen, J., & Chen, H. (2000). Intelligent Internet searching Thadani, S. (2002). Meta information searches on the Gnutella network. agent based on hybrid simulated annealing. Decision Support Sys- Retrieved August 8, 2002, from http://www.limewire.com/index.jsp/ tems, Special Issue on Intelligent Agents and Digital Community, metainfo_searches 28(3), 269–277.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—July 2004 793