DEWS: a Decentralized Engine for Web Search
Total Page:16
File Type:pdf, Size:1020Kb
DEWS: A Decentralized Engine for Web Search Reaz Ahmed∗, Md. Faizul Bari∗, Rakibul Haque∗, Raouf Boutaba∗, and Bertrand Mathieuy ∗David R. Cheriton School of Computer Science, University of Waterloo fr5ahmed | mfbari | m9haque | [email protected] yOrange Labs, Lannion, France [email protected] Abstract—Contemporary Web search is governed by centrally feasible. Thus, those solutions have issues with performance controlled search engines, which is not healthy for our online and accuracy requirements. freedom and privacy. A better solution is to enable the Web to In this paper we take a very different approach to decen- index itself in a decentralized manner. In this work we propose a decentralized Web search mechanism, named DEWS, which en- tralized web indexing and ranking. Instead of relying on an ables existing webservers to collaborate with each other to build overlay of regular Internet users, we build an overlay between a distributed index of the Web. DEWS can rank search results webservers. We exploit the stability in webserver overlay to based on query keyword relevance and relative importance of heavily cache links (network addresses) that we use as routing webpages. DEWS also supports approximate matching of query shortcuts. Thus we achieve faster lookup, lower messaging keywords in web documents. Simulation results show that the ranking accuracy of DEWS is very close to the centralized case, overhead, and higher ranking accuracy in search results. while network overhead for collaborative search and indexing is The rest of this paper is organized as follows. First we logarithmic on network size. present the DEWS architecture in xII. Then we validate the concepts presented in this work through extensive simulations and present the results in xIII. We present and compare with I. INTRODUCTION the related works in xIV. Finally, we conclude with future Internet is the largest knowledge base that mankind has research directions in xV. ever created. Autonomous hosting infrastructure and voluntary contributions from millions of Internet users have given the II. SYSTEM ARCHITECTURE Internet its edge. However, contemporary Web search services We have used Plexus protocol [4] to build an overlay are governed by centrally controlled search engines, which network between the webservers participating in DEWS. Since is not healthy for our online freedom due to the following the webserver overlay is fairly stable, each webserver caches reasons. A Web search service provider can be compromised links (network addresses) to other servers. This link caching to evict certain websites from the search results, which can reduces network overhead during the indexing and routing reduce the websites’ visibility. Relative ranking of websites processes. On top of the overlay topology we maintain two in search results can be biased according to the service indexes for distributed ranking (xII-A1) and keyword search providers’ preference. Moreover, a service provider can record (xII-A2), respectively. Functionally DEWS is similar to a cen- its users’ search history for targeted advertisements or spying. tralized search engine. It generates ranked results for keyword For example, the recent PRISM scandal surfaced the secret search (xII-B1). From now on we use the terms webserver and role of the major service providers in continuously tracking node interchangeably. our web search and browsing history. Plexus is a unique Distributed Hash Table (DHT) technique A decentralized Web search service can subside these prob- with built-in support for approximate matching, which is not lems by distributing the control over a large number of network easily achievable by other DHT techniques. Plexus routing nodes. No single authority will control the search result. It will scales logarithmically with network size. Plexus delivers a high be computed by combining partial results from multiple nodes. level of fault-resilience by using systematic replication and Thus a large number of nodes have to be compromised to bias redundant routing paths. Because of these advantages we have a search result. Moreover, a user’s queries will be resolved by used Plexus protocol to build the webserver overlay. Here, different nodes. All of these nodes have to be compromised we summarize the basic concepts in Plexus followed by the to accumulate the user’s search history. proposed extensions. A number of research works ([1], [2], [3]) and implemen- tations (YacYwww:yacy:net, Faroowww:faroo:com) have focused A. Indexing Architecture on distributed Web search and ranking in peer-to-peer (P2P) Metrics used for ranking web search results can be broadly networks. These approaches have two potential problems in classified into two categories: a) hyperlink structure of the common: (a) lookup overhead: number of network messages webpages, and b) keyword to document relevance. Techniques required for index/peer lookup is much higher in P2P networks from Information Retrieval (IR) literature are used for measur- compared to a centralized alternative, (b) churn: maintaining ing relevance ranks. While link structure analysis algorithms a consistent index in presence of high peer churn is not like PageRank [7] is used for computing weights or relative significance of each URL. In the following, we present a Algorithm 1 Update P ageRank distributed mechanism that uses both of these measures for 1: Internals: Q : PageRank message queue for u ranking search results. The same ranking mechanism can be ui i L(ui): Number of outlinks for ui used to consider other factors for ranking like user feedback. wi: PageRank weight of ui In this work we assume that each Webserver will advertise η: Damping factor for PageRank algorithm its websites to the DEWS indexing structure. Alternatively, a θ: Update propagation threshold 2: for all URL ui indexed in this node β(ui) do distributed crawler can be deployed in each webserver that 3: temp 0 wsi 4: for all < usi; ui; >2 Qu do will crawl the webservers not participating in DEWS. L(usi) i 5: temp temp + wsi 1) Hyperlink Index: Algorithms for computing webpage L(usi) 6: end for weights based on hyperlink structure are iterative and require new − ∗ 7: wi (1 η) + η temp j new − j many iterations to converge. In each iteration webpage weights 8: if wi wi > θ then new are updated and the new weights are propagated to adjacent 9: wi wi 10: for all out link uit from ui do URLs for computation in the next iteration. To implement such wi 11: send PageRank message < ui; uit; > to β(uit) L(ui) ranking mechanisms on websites distributed across an overlay 12: end for network, we need to preserve the adjacency relationships in 13: end if hyperlink graph while mapping websites to nodes. If hyper- 14: end for linked websites are mapped to the same node or adjacent nodes then network overhead for computing URL weights will be significantly reduced. Unfortunately, there exists no straight containing that keyword by forwarding the query message to Krep f repg forward hyperlink structure preserving mapping of the Web to a small number of nodes. Suppose, i = kij is the rep an overlay network. set of representative keywords for ui. For each keyword kij In DEWS, we retain the hyperlink structure as a virtual Krep dmp in i , we generate kij by applying Double Metaphone overlay on top of Plexus overlay. We use a standard shift-add rep encoding [8] on kij . Double Metaphone encoding attempts hash function (~(·)) to map a website’s base URL, say ui, to detect phonetic (‘sounds-alike’) relationship between words. to a codeword, say ck = ~(ui). Then we use Plexus routing Motivation behind adapting phonetic encoding is twofold: i) to lookup β(ui), which is the node responsible for indexing any two phonetically equal keywords have no edit distance codeword ck (Fig. 1(a)). For each outgoing hyperlink, say between them, ii) phonetically inequivalent keywords have uit, of ui we find the responsible node β(uit) in a similar less edit distance than the edit distance between the original manner. During distributed link-structure analysis β(ui) has keywords. In both cases, Hamming distance between encoded to frequently send weight update messages to β(uit). Hence advertisement and search patterns is lesser than that of the we cache the network address of node β(uit) at node β(ui), patterns generated from original keywords. This low Ham- which we call a soft-link. Soft-links mitigate the network ming distance increases the percentage of common codewords overhead generated from repeated lookups during PageRank computed during advertisement and search, which eventually computation. increases the possibility of finding relevant webpages. The index stored in β(ui) has the form < ui; wi; f< The process of generating keyword index is depicted in g uit; β(uit) > >, where wi is the PageRankP weight of ui. Fig. 1(a). To generate advertisement or query pattern Pij g wit rep rep rep wi is computed as wi = (1 − η) + η . Here, η f g t=1 L(uit) from keyword kij , we fragment kij into k-grams ( kij ) (usually 0.85) is the damping factor for PageRank algorithm. dmp and encode these k-grams along with kij into an b-bit fuitg is the set of webpages linked by ui and L(uit) is the Bloom filter. We use this Bloom filter as a pattern Pij in number of outgoing links from webpage uit. Fb 1 2 and list decode it to a set of codewords, ζρ(Pij) = Each node periodically executes Algorithm 1 to maintain fckjck 2 C ^ δ(Pij; ck) < ρg, where ζρ(·) is a list decoding the PageRank weights updated in a distributed manner. To function and ρ is list decoding radius. Finally, we use Plexus communicate PageRank information between the nodes, we rep routing to lookup and store the index on kij at the nodes use a PageRank update message containing the triplet < rep responsible for codewords in ζρ(Pij).