On the Feasibility of Peer-To-Peer Web Indexing and Search
Total Page:16
File Type:pdf, Size:1020Kb
University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science October 2003 On the Feasibility of Peer-to-Peer Web Indexing and Search Jiyang Li Massachusetts Institute of Technology Boon Thau Loo University of Pennsylvania, [email protected] Joseph M. Hellerstein University of California M Frans Kaashoek Massachusetts Institute of Technology David Karger Massachusetts Institute of Technology See next page for additional authors Follow this and additional works at: https://repository.upenn.edu/cis_papers Recommended Citation Jiyang Li, Boon Thau Loo, Joseph M. Hellerstein, M Frans Kaashoek, David Karger, and Robert Morris, "On the Feasibility of Peer-to-Peer Web Indexing and Search", . October 2003. Postprint version. Published in Lecture Notes in Computer Science, Volume 2735, Peer-to-Peer Systems II, 2003, pages 207-215. Publisher URL: http://dx.doi.org/10.1007/b11823 NOTE: At the time of publication, author Boon Thau Loo was affiliated with the University of California at Berkeley. Currently (April 2007), he is a faculty member in the Department of Computer and Information Science at the University of Pennsylvania. This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_papers/328 For more information, please contact [email protected]. On the Feasibility of Peer-to-Peer Web Indexing and Search Abstract This paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries vo er an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We present a simple feasibility analysis based on the resource constraints and search workload. Our study suggests that the peer-to-peer network does not have enough capacity to make naive use of either of search techniques attractive for Web search. The paper presents a number of existing and novel optimizations for P2P search based on distributed hash tables, estimates their effects on performance, and concludes that in combination these optimizations would bring the problem to within an order of magnitude of feasibility. The paper suggests a number of compromises that might achieve the last order of magnitude. Comments Postprint version. Published in Lecture Notes in Computer Science, Volume 2735, Peer-to-Peer Systems II, 2003, pages 207-215. Publisher URL: http://dx.doi.org/10.1007/b11823 NOTE: At the time of publication, author Boon Thau Loo was affiliated with the University of California at Berkeley. Currently (April 2007), he is a faculty member in the Department of Computer and Information Science at the University of Pennsylvania. Author(s) Jiyang Li, Boon Thau Loo, Joseph M. Hellerstein, M Frans Kaashoek, David Karger, and Robert Morris This book chapter is available at ScholarlyCommons: https://repository.upenn.edu/cis_papers/328 On the Feasibility of Peer-to-Peer Web Indexing and Search Jinyang Li∗ Boon Thau Looy Joseph M. Hellersteiny M. Frans Kaashoek∗ David Karger∗ Robert Morris∗ ∗MIT Lab for Computer Science yUC Berkeley [email protected], fboonloo, [email protected], fkaashoek, karger, [email protected] Abstract search, including Gnutella [1] and KaZaA [4]. These systems use the simple and robust technique of flood- This paper discusses the feasibility of peer-to-peer ing queries over some or all peers. The estimated full-text keyword search of the Web. Two classes of number of documents in these systems is 500 mil- keyword search techniques are in use or have been lion [3]; documents are typically music files, and proposed: flooding of queries over an overlay net- searches examine only file meta-data such as title and work (as in Gnutella), and intersection of index lists artist. These systems have performance problems [6] stored in a distributed hash table. We present a sim- even with workloads much smaller than the Web. ple feasibility analysis based on the resource con- straints and search workload. Our study suggests Another class of P2P systems achieve scalabil- that the peer-to-peer network does not have enough ity by structuring the data so that it can be found capacity to make naive use of either of search tech- with far less expense than flooding; these are com- niques attractive for Web search. The paper presents monly called distributed hash tables (DHTs) [19, a number of existing and novel optimizations for 16, 18, 23]. DHTs are well-suited for exact match P2P search based on distributed hash tables, esti- lookups using unique identifiers, but do not directly mates their effects on performance, and concludes support text search. There have been recent propos- that in combination these optimizations would bring als for P2P text search [17, 20, 11, 10] over DHTs. the problem to within an order of magnitude of feasi- The most ambitious known evaluation of such a sys- bility. The paper suggests a number of compromises tem [17] demonstrated good full-text keyword search that might achieve the last order of magnitude. performance with about 100,000 documents. Again, this is a tiny fraction of the size of the Web. 1 Introduction This paper addresses the question Is P2P Web Full-text keyword search of the Web is arguably one search likely to work? The paper first estimates the of the most important Internet applications. It is also size of the problem: the size of a Web index and the a hard problem; Google currently indexes more than rate at which people submit Web searches. Then it 2 billion documents [2], a tiny fraction of the esti- estimates the magnitude of the two most fundamen- mated 550 billion documents on the Web [5]. While tal resource constraints: the capacity of the Internet centralized search engines such as Google work well, and the amount of disk space available on peer hosts. peer-to-peer (P2P) Web search is worth studying An analysis of the communication costs of naive P2P for the following reasons. First, Web search offers Web search shows that it would require orders of a good stress test for P2P architectures. Second, magnitude more resources than are likely to be avail- P2P search might be more resistant then centralized able. The paper evaluates a number of existing and search engines to censoring or manipulated rankings. novel optimizations, and shows that in combination, Third, P2P search might be more robust than central- they should reduce costs to within an order of mag- ized search as the demise of a single server or site is nitude of available resources. Finally, the paper out- unlikely to paralyze the entire search system. lines some design compromises that might eliminate A number of P2P systems provide keyword the last order of magnitude difference. 1 The main contribution of this paper is an evalua- gigabyte, a small fraction of the size of a typical tion of the fundamental costs of, and constraints on, PC hard disk. It is not atypical today for some P2P Web search. The paper does not claim to have large desktop applications to have an installed a definitive answer about whether P2P Web search is size of around 1GB. An inverted index of size likely to work, but it does provide a framework for 6 ∗ 1013 bytes would require 60,000 PCs, as- debating the question. suming no compression. 2 Background • Communication Constraints: A P2P query A query consists of a set of search terms (words) consumes bandwidth on the wide-area Internet; provided by a user. The result is usually a list of the total bandwidth consumed by all queries documents that contain the terms, ranked by some must fit comfortably within the Internet’s ca- scoring mechanism. Search engines typically pre- pacity. compute an inverted index: for each word, a posting Given the importance of finding information on list of the identifiers of documents that contain that the Web, we assume that it is reasonable for it word. These postings are intersected in a query in- to consume a noticeable fraction of Internet ca- volving more than one term. Since the intersection is pacity. DNS uses a few percent of wide-area In- often large, search engines usually present only the ternet capacity [21]; we optimistically assume most highly ranked documents. Systems typically that Web search could consume 10%. One way combine many ranking factors; these may include to estimate the Internet’s capacity is to look at the importance of the documents themselves [15], the backbone cross-section bandwidth. For ex- the frequency of the search terms, or how close the ample, the sum of the bisection bandwidths of terms occur to each other within the documents. Internet backbones in the U.S. was about 100 gigabits in 1999 [7]. Assuming 1,000 queries 3 Fundamental Constraints per second, the per-query communication bud- Whether a search algorithm is feasible depends on get is 10 megabits, or roughly one megabyte. the workload, the available resources, and the algo- This is a very optimistic assessment. rithm itself. We estimate the workload first. Google Another way to derive a reasonable query com- indexes more than 2 billion Web documents [2], so to munication cost is to assume that the query be conservative we assume 3 billion. Assuming 1000 should send no more data than the size of the words per document, an inverted index would con- 9 document ultimately retrieved. Assuming that tain 3∗10 ∗1000 document identifiers (docIDs). In a the average Web page size is about 10 kilobytes, DHT, a docID would likely be a key with which one this leads to a pessimistic query communication could retrieve the document; typically this is a 20- budget of 10 kilobytes. byte hash of the document’s content.