The Robustness of Content-Based Search in Hierarchical Peer to Peer Networks
Total Page:16
File Type:pdf, Size:1020Kb
The Robustness of Content-Based Search in Hierarchical Peer to Peer Networks ∗ M. Elena Renda Jamie Callan I.S.T.I. – C.N.R. Language Technologies Institute and School of Computer Science Scuola Superiore Sant’Anna Carnegie Mellon University I-56100 Pisa, Italy Pittsburgh, PA 15213, US [email protected] [email protected] ABSTRACT Keywords Hierarchical peer to peer networks with multiple directory services Peer to Peer, Hierarchical, Search, Retrieval, Content-based, Ro- are an important architecture for large-scale file sharing due to their bustness effectiveness and efficiency. Recent research argues that they are also an effective method of providing large-scale content-based federated search of text-based digital libraries. In both cases the 1. INTRODUCTION directory services are critical resources that are subject to attack Peer to peer (P2P) networks have emerged in recent years as a or failure, but the latter architecture may be particularly vulnera- popular method of sharing information, primarily popular music, ble because content is less likely to be replicated throughout the movies, and software, among large numbers of networked com- network. puters. Computers (‘nodes’) in P2P networks can be information This paper studies the robustness, effectiveness and efficiency providers and information consumers, and may provide network of content-based federated search in hierarchical peer to peer net- services such as message routing and regional directory services. works when directory services fail unexpectedly. Several recovery A single node may play several roles at once, for example, posting methods are studied using simulations with varying failure rates. queries, sharing files, and relaying messages. Experimental results show that quality of service and efficiency Information providers in P2P networks can be viewed as digital degrade gracefully as the number of directory service failures in- libraries. In many of the well-known operational P2P networks, for creases. Furthermore, they show that content-based search mech- example, Gnutella [5], KaZaA [10], or LimeWire [13], the digital anisms are more resilient to failures than the match-based search libraries are relatively small, ranging from a few dozen to a few techniques. thousand files. The search services they offer tend to be based on exact or partial match of file names that are known or easily guessed Categories and Subject Descriptors (e.g., names of popular songs, movies, or software) and/or small amounts of manually-assigned metadata. H.3.3 [Information Storage and Retrieval]: Information Search Search for information in P2P networks is essentially federated and Retrieval—retrieval models, search process, selection process; search (or ‘distributed information retrieval’ [22]) across a possibly H.3.4 [Information Storage and Retrieval]: Systems and Soft- large set of digital libraries. Recent P2P research has focused on ware—distributed systems, performance evaluation (efficiency and how to efficiently and effectively determine which digital libraries effectiveness); H.3.5 [Information Storage and Retrieval]: On- have desired information. The two major approaches are highly- line Information Services—data sharing organized networks (e.g., Tapestry [8], OceanStore [12], Pond [20], Pastry [21], Chord [24]), in which content location can be inferred easily due to well-defined data placement policies; and hierarchical General Terms P2P networks with local directory services. The latter approach Algorithms, Design, Experimentation, Performance has became the most popular architecture for widely-deployed P2P applications (e.g., Gnutella 0.6 [5], KaZaA [10], Morpheus [15], Napster [16], Project JXTA [17]). ∗This work was done when the author was a Visiting Graduate Re- Our research interest is in using P2P networks to provide fed- search Assistant at the Language Technologies Institute, School of erated search of text-based digital libraries. For example, P2P Computer Science, Carnegie Mellon University. protocols could be an effective and convenient method of provid- ing federated search across the various document repositories of a medium-sized company. A P2P architecture is appealing in this environment because there is no need to collect and maintain cen- Permission to make digital or hard copies of all or part of this work for tralized repositories or indexes. P2P protocols could also be an personal or classroom use is granted without fee provided that copies are effective method of providing federated search of digital libraries not made or distributed for profit or commercial advantage and that copies that can be searched but not crawled by a Web crawler (so called bear this notice and the full citation on the first page. To copy otherwise, to “Hidden Web” repositories). republish, to post on servers or to redistribute to lists, requires prior specific Prior research demonstrated that a hierarchical P2P network ar- permission and/or a fee. CIKM’04, November 8–13, 2004, Washington, DC, USA. chitecture can provide effective and efficient access to several thou- Copyright 2004 ACM 1-58113-874-1/04/0011 ...$5.00. sand small text-based digital libraries [14]. In a hierarchical P2P Figure 1: Pure P2P Network. network some nodes (‘hub’ nodes) provide regional or topical di- rectory services that improve the accuracy and efficiency of the net- work by directing messages appropriately. Our interest in this pa- per is to study the stability and robustness of such networks when Figure 2: Hierarchical P2P Network. directory services fail. There is ample ‘real world’ evidence that hi- erarchical P2P networks such as KaZaA and Morpheus are stable as directory services (‘ultra peers’) enter and exit the network, but Napster was highly influential, most later systems adopted decen- their stability relies on massive sharing of information. Our goal tralized search. is to discover whether an architecture that gives digital libraries greater control over their intellectual property is as stable in the 2.2 Decentralized Search Without Directory presence of directory service failure. Services The next section of this paper describes some prior work related One reaction to the vulnerability of P2P networks with a sin- to our research. Section 3 provides more details about the motiva- gle directory service was P2P networks with no directory ser- tion for our research. Section 4 describes the search mechanisms vices, sometimes called ‘pure’ or ‘flat’ P2P networks (Figure 1). we used. Section 5 presents the behaviour of the search proto- Gnutella 0.4 [5] is an early example. It adopted unstructured search col under dynamic conditions. Section 6 describes the data and to find content: A peer sends a query to its neighbors (‘flooding’), methodology adopted for the experiments. Section 7 reports and which attempt to satisfy it, and which also relay it to their neigh- discusses the results obtained. Finally, Section 8 concludes. bors, until the request has traveled a maximum distance from the initiating node (TTL, Time to Live). The Gnutella 0.4 protocol is 2. PEER-TO-PEER SYSTEMS simple, robust, and easy to apply to any type of content, but it is not P2P applications for file sharing have become widespread in re- scalable because the message overhead is high, and because mes- cent years, which has focused considerable research attention on sages do not propagate to distant parts of the network (which may this area, particularly on how to improve P2P search. P2P search have the best matching files). methods can be classified into three broad categories, based on how A more focused interest-based locality approach [23] is based they manage the search process. We discuss these categories, and on collaborative filtering: If peer A is looking for document d and related prior research, below, with an emphasis on well-known so- peer B owns it, A creates an interest shortcut to B because B might lutions and solutions likely to be useful for federated search of text- have other documents A is interested in. The shortcuts provide a based digital libraries. loose overlay structure on top of Gnutella 0.4. To avoid flooding, the protocol first looks for content through shortcuts; when all of a 2.1 Centralized Search (One Directory node’s shortcuts fail, it reverts to Gnutella 0.4. Shortcuts seem to Service) have good load distribution properties, and help reduce the number of peers touched by each query. Prior experimental results were Napster, one of the first P2P systems, was essentially a sin- 1 based on relatively short simulations (one hour), where the system gle, network-wide directory service. When a new digital library was always running in optimal conditions [23], so the generality of (‘node’) wanted to join the network, it provided a catalog of its cur- the experimental results are not yet clear. rent holdings to Napster. Napster handled search requests, return- P-Grid [1] is a self-organizing and completely decentralized P2P ing lists of matching files and the addresses of the digital libraries lookup system based on a virtual distributed search tree. It has providing them. some similarities with Chord [24] (see below), but it exploits lo- The Napster architecture was efficient and effective, but not ro- cal knowledge and does not require central coordination or global bust because of the central and unique point of failure. Although knowledge, as Chord does. The P-Grid’s search has good ro- 1For efficiency reasons, this single, logical directory service was bustness and scalability characteristics, due to the replication dis- actually implemented as several distinct, slightly different directory tributed over all the peers. services. Other approaches [9, 26] select which nodes to search based on specific heuristics (past behaviour, past number of results, past ciency, DHTs have several disadvantages as a solution for federated query response time, etc.). These methods are generally efficient, search of text-based digital libraries. Name-based retrieval is gen- but the heuristic-based selection of nodes can fail to find relevant erally not effective for text retrieval.