Searching the Peer-To-Peer Networks: the Community and Their Queries

Searching the Peer-to-Peer Networks: The Community and Their Queries Sai Ho Kwok Department of Information and Systems Management, The Hong Kong University of Science and Technology. Christopher C. Yang Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong. E-mail: [email protected] Peer-to-Peer (P2P) networks provide a new distributed work and make their underutilized resources available to computing paradigm on the Internet for file sharing. The each other. The decentralized nature of P2P computing decentralized nature of P2P networks fosters cooperative and non-cooperative behaviors in sharing re- makes it also ideal for economic environments that foster sources. Searching is a major component of P2P file knowledge sharing and collaboration as well as cooperative sharing. Several studies have been reported on the na- and noncooperative behavior in sharing resources (Kwok, ture of queries of World Wide Web (WWW) search en- 2002). The pure P2P network provides the mechanism for a gines, but studies on queries of P2P networks have not been reported yet. In this report, we present our study on knowledge user to make direct connection with another the Gnutella network, a decentralized and unstructured knowledge user. Information is shared among the knowl- P2P network. We found that the majority of Gnutella edge users through a direct communication, which is an users are located in the United States. Most queries are important factor for the success of knowledge management repeated. This may be because the hosts of the target and reuse. Business models are being developed, which rely files connect or disconnect from the network any time, so clients resubmit their queries. Queries are also for- on incentive mechanisms to supply contributions to the warded from peers to peers. Findings are compared with system and methods for controlling free riding (Kwok, the data from two other studies of Web queries. The Lang, & Tam, 2002). Clearly, the growth and the manage- length of queries in the Gnutella network is longer than ment of P2P networks must be regulated to ensure adequate those reported in the studies of WWW search engines. Queries with the highest frequency are mostly related to compensation of content and/or service providers. the names of movies, songs, artists, singers, and direc- P2P is a technique that can be described as a facilitating tors. Terms with the highest frequency are related to file file sharing over a P2P network. Specifically, the P2P net- formats, entertainment, and sexuality. This study is im- works (or communities) contain a large number of nodes (or portant for the future design of applications, architecture, and services of P2P networks. peers). These nodes, also known as servants, act simulta- neously as both clients and servers. The popularity of P2P file sharing has attracted considerable interest. Conse- Introduction quently, a large number of researchers have investigated its Peer-to-peer (P2P) computing is currently attracting extensibility and applicability. Under closer scrutiny, it enormous media attention, spurred by the popularity of file would seem that network scalability is an inherent problem sharing systems such as Napster, Gnutella, and Morpheus. of P2P (The Ocean Store Project, 2002). Many researchers The peers are simply autonomous, or, as some call them, have devoted their efforts to network analysis, for example, first-class citizens. P2P networks are emerging as a new network traffic patterns (Matei, Iamnitchi, & Foster, 2002) distributed computing paradigm for their potential to har- and network performance measurement (Vaucher, Kropf, ness the computing power of the hosts composing the net- Babin, & Jouve, 2002), in an attempt to find ways to reduce the P2P network traffic and improve the efficiency of P2P file sharing. However, the key factor is how P2P users behave (i.e., their information-searching behavior). This Accepted December 3, 2003 factor has not been explicitly addressed in previous studies © 2004 Wiley Periodicals, Inc. ● Published online 9 April 2004 in Wiley and relevant data is lacking in the literature. This report will InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20022 attempt to explore this problem by observing peer’s activ- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 55(9):783–793, 2004 ities using the servant’s log files, advocating ways to reduce search engines. Jansen, Spink, and Saracevic (2000) re- network traffic. ported the result of a study on 51,000 queries posed by Searching is a major component of P2P file sharing. One 18,000 users of Excite’s search engine, focusing on ses- of the most prevalent problems with P2P users is that a sions, queries, and terms. Among the 114,000 terms used, considerable amount of time is spent in searching. The 22,000 terms are unique. The report shows that the sessions reason for this is that the clients usually do not find the files are short (2.8 queries per session on average) and the they need. Even when they have found the required files, queries are short (2.21 terms per query on average). The they have to query the same files again for other sources Boolean operators and relevance feedback are seldom used. when the remote peers disconnect from the network. Con- A small number of terms are used in high frequency while sequently, there are a large number of query messages many terms are used only once. Silverstein, Henzinger, flooding the P2P network that will jeopardize the interests Marais, and Moricz (1999) reported the result of a study of of the P2P communities. One of the ways to enhance the 154,000,000 queries of the Alta Vista search engine. How- P2P network and file-sharing protocol is to understand the ever, the definition of term in a Web query is not the same. behavior of P2P users. For example, it is important to know In the study of the Alta Vista search engine, terms can be how the query messages are generated and what kind of field-value designators. Spink, Wolfram, Jansen, and query message most frequently occupies the P2P network. Saracevic (2001) reported the result of another study on In addition, it is necessary to have access to information on 1,026,000 queries of the Excite search engine. Similar re- the kinds of files that the P2P users query most because user sults are reported. Later, Wolfram, Spink, Jansen, and behavior has a direct impact on the network traffic. Saracevic (2001) compared the results of the studies of Studies on the queries of P2P networks are beneficial to Excite queries between 1997 and 1999. It was found that the future design of applications, architecture, and services there were fewer terms per query, fewer queries per session, of P2P networks, and P2P protocol (Kwok, Lui, Cheung, and little modification in subsequent queries. As well, the Chan, & Yang, 2003). A better understanding of user be- searching topics shifted from entertainment, recreation, and havior on queries of P2P networks can help to enhance the sex to e-commerce related topics. Spink, Ozmutlu, and searching mechanisms in order to provide efficient and Ozmutlu (2002) conducted four studies, two of which were effective searching. Recent research studies have examined survey responses by 11 Excite search engine users and 114 the nature of queries on WWW search engines; however, search sessions by Excite search engine users. They found there have not yet been any studies that examine the queries that multitasking information seeking and searching was a of P2P networks. common behavior. The mean number of topic changes per session was 2.11. Multitasking search sessions usually take a longer time than single topic sessions. They further inves- Information Behavior tigated the characteristics of question format of the Ask Information behavior is defined by Wilson (2000) as the Jeeves search engine (Spink & Ozmultu, 2002). Thirty totality of human behavior in relation to sources and chan- thousand queries were included in the study. The questions nels of information, including both active and passive in- are mainly in “where,” “what,” or “how” format. “Where formation seeking, and information use. Wilson has also can I find ” is the most common format. defined information searching behavior as the “micro- Although there are a number of studies on queries of level” of behavior employed by the searcher in interacting WWW search engines, studies on queries of P2P networks with information systems of all kinds and information use have not been reported yet. However, there have been some behavior as the physical and mental acts involved in incor- studies on the network traffic and connectivity of P2P porating the information found into the person’s existing networks. Markatos (2002) has investigated the magnitude knowledge base. In this work, we are interested in investi- and traffic patterns of the P2P network by tracing the gating the human behavior in P2P networks in terms of the queries going through a P2P servant in an hour. Matei et al. queries submitted by P2P users in contrast with the human (2002) sent a crawler to collect the topology information of behavior in WWW search engines as reported by other the P2P network. The study evaluated costs and benefits of researchers. We are not studying the micro-level level be- the P2P approach and, according to the data, the mismatch- havior of P2P users or how the P2P users act when they ing of the P2P and Internet infrastructure topology has incorporate the files found into their knowledge base. We considerable impact on the overall performance of the P2P are rather interested in the information behavior in relation system. These studies have all discussed the network traffic to two different channels: P2P networks and WWW.

Searching the Peer-To-Peer Networks: the Community and Their Queries

Application Log Analysis

Analysis of Web Logs and Web User in Web Mining

System Log Files Kernel Ring Buffer Viewing Log Files the Log Files

Forensic Investigation of P2P Cloud Storage Services and Backbone For

Log File Management Tool Deployment and User's Guide

Dude, Where's My Log File?

Forensics in Peer-To-Peer Sharing and Associated Litigation Challenges

IRC Channel Data Analysis Using Apache Solr Nikhil Reddy Boreddy Purdue University

An Effective Method for Web Log Preprocessing and Page Access Frequency Using Web Usage Mining

02-Paolillo 1..9999

Aspera FASP Proxy Admin Guide 1.4.0

User Guide for Creating a Whatsup Event Logs Database on Microsoft SQL Server for Log Management V10.X Contents