TWO-LAYERED ARCHITECTURE for PEER-TO-PEER CONCEPT SEARCH Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas and Ul
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Unitn-eprints Research DISI ‐ Via Sommarive 14 ‐ 38123 Povo ‐ Trento (Italy) http://www.disi.unitn.it TWO-LAYERED ARCHITECTURE FOR PEER-TO-PEER CONCEPT SEARCH Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas and Uladzimir Kharkevich February 2011 Technical Report # DISI-11-351 Also: submitted to the 4th International Semantic Search Workshop, Semsearch'11 co-located with WWW'11 Two-layered architecture for peer-to-peer concept search Janakiram Fausto Giunchiglia Dharanipragada University of Trento, Italy Indian Institute of Technology, [email protected] Madras [email protected] ∗ Harisankar Haridas Uladzimir Kharkevich Indian Institute of Technology, University of Trento, Italy Madras [email protected] [email protected] ABSTRACT 1. INTRODUCTION The current search landscape consists of a small number Global scale search has become a vital infrastructure for of centralized search engines posing serious issues including the society in the current information age. But, the current centralized control, resource scalability, power consumption search landscape is almost fully controlled by a small num- and inability to handle long tail of user interests. Since, ber of centralized search engines. Centralized search engines the major search engines use syntactic search techniques, crawl the content available in the world wide web, index and the quality of search results are also low, as the meanings maintain it in centralized data-centers. The user queries are of words are not considered effectively. A collaboratively directed towards the data-centers where the queries are pro- managed peer-to-peer semantic search engine realized using cessed internally and results are provided to the users. The the edge nodes of the internet could address most of the functioning of the search engines is controlled by individ- issues mentioned. We identify the issues related to knowl- ual companies. This situation raises a lot of serious issues edge management, word-to-concept mapping and efficiency which need to be studied and addressed carefully, given the in realizing a peer-to-peer concept search engine, which ex- importance of search as a vital service. tends syntactic search with background knowledge of peers The individual companies which control the centralized and searches based on concepts rather than words. We search engines decide upon what to(not to) index, rank etc. propose a two-layered architecture for peer-to-peer concept Moreover, the complete details of the ranking and prepro- search to address the identified issues. In the two-layered cessing methods are not made available publicly by most approach, peers are organized into communities and back- of the search engines. This leads to political and security ground knowledge and document index are maintained at concerns. With the increase in the information needs of two levels. Universal knowledge is used to identify the ap- the global community, the demands on the search engine in- propriate communities for a query and search within the frastructure have increased manifold over the years. Since, communities proceed based on the background knowledge search involves a large amount of computational, storage and developed independently by the communities. We developed bandwidth resources, the cost of acquiring and maintaining proof-of-concept implementations of peer-to-peer syntactic resources are very high. Recently, the energy consumption search, straightforward single-layered and the proposed two- of search engines is causing environmental concerns. The layered peer-to-peer concept search approaches. Our evalu- centralized nature also makes them a single point of failure. ation concludes that the proposed two-layered approach im- The syntactic search techniques used in major search en- proves the quality and network efficiency substantially com- gines consider words and multi-word phrases in the docu- pared to a straightforward single-layered approach. ments and queries to determine relevance. But, this leads to low quality of search results due to the ambiguity of nat- ural language. For example, "bank" could mean "depository Keywords financial institution" or "slope bordering a water body". Se- peer-to-peer, semantic, search mantic search techniques like Concept search[8] address this problem by incorporating knowledge bases to identify con- cepts (eg: concept "bank-2" means "slope bordering a water ∗The authors are listed in alphabetical order of last name body") and relationships between concepts(eg: poodle-1 ISA dog-2). Thus, search can be performed based on the con- cepts instead of words thereby improving quality of search results. However, realizing such semantic search techniques over a centralized search engine is difficult because of multi- Permission to make digital or hard copies of all or part of this work for ple reasons. These techniques require availability of human personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies generated knowledge bases which cater to the interests of bear this notice and the full citation on the first page. To copy otherwise, to the users. But, it is observed that the users of search en- republish, to post on servers or to redistribute to lists, requires prior specific gines exhibit a "long tail of interests", i.e., a diverse range permission and/or a fee. of numerous niche interests. It will be extremely tough and Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. expensive, if not impossible for a centrally managed search be addressed. engine to develop and maintain rich knowledge bases ad- Knowledge management at large scale: The com- dressing the long tail of user interests. Another difficulty is mon knowledge which every one agrees on is minimal as it that the semantic search techniques are often more computa- is difficult to have consensus at a large scale. This is be- tionally expensive compared to syntactic search techniques cause of the large amount of time and effort required to get making them difficult to be offered at global scale. a wide understanding and meet requirements in a large di- A peer-to-peer semantic search engine could address most verse group as explained in [3]. Hence, a commonly agreed of the issues raised above related to centralized search en- monolithic knowledge base will be insufficient for realizing gines. A peer-to-peer semantic search engine uses the free semantic search. shared resources available in the edge nodes(peers) of the Difficulty in identifying concepts from words:A internet to realize a distributed semantic search using the major issue in semantic information retrieval approaches like background knowledge of the peers. In this paper, we in- Concept search is that, given limited context, it is tough to vestigate the applicability of a peer-to-peer semantic search identify the concept from the words in queries and docu- engine in addressing the issues related to global scale search. ments(word sense disambiguation problem). For example, The new issues which need to be addressed while realizing a from a query "bank" without any context, it is difficult to peer-to-peer semantic search engine are discussed. We show infer the intended meaning. Statistical techniques for word- the importance of a multi-layered approach in realizing peer- sense disambiguation(WSD) also suffer from low accuracy. to-peer semantic search. Experiments show the advantage of Network bandwidth usage of peer-to-peer search a two-layered approach to peer-to-peer concept search over techniques: Although there has been substantial amount of a straightforward single-layered approach. work done in the area of peer-to-peer (syntactic) search, the We explain the motivation and details of the architecture network bandwidth consumption of these techniques remain in Section 2. Sections 3 and 4 describe the design choices for high due to the distributed nature of the index. This affects the current implementation and the implementation details the applicability of the approach at internet scale. respectively. We present the details of the evaluation per- We propose a two-layered architecture for peer-to-peer formed and the results in Section 5. We review the related concept search to address the above mentioned issues re- work in Section 6 and state our conclusions in Section 7. lated to single-layered peer-to-peer concept search. 2. ARCHITECTURE 2.2 Two-layered architecture for peer-to-peer In a peer-to-peer semantic search engine, the edge nodes concept search of the internet(peers) participate in providing the semantic In the two-layered architecture for peer-to-peer concept search service in addition to being consumers of the service. search, the peers are organized as communities based on The peers collaboratively maintain the background knowl- common interests. Each community maintains its own back- edge base required for realizing semantic search. The open ground knowledge base(DBK) pertaining to its specific in- and shared nature of the infrastructure and peer-to-peer terests distributed among the peers of the community. The community will address the centralized control and trans- DBK of a community is created and collaboratively man- parency issues inherent in centralized search engines. Since, aged by the peers in the community. Thus the long tail the peers participate in providing the search service in a vol- of peer interests can be supported in the community-based untary fashion, the overhead and cost involved in acquiring knowledge management scheme. Since, the DBKs in differ- and maintaining dedicated resources in centralized data cen- ent communities evolve independently based on the specific ters are avoided. As the idle computational power available interests of the smaller set of peers within the communi- in non-dedicated edge nodes is used in providing the search ties, the "consensus problem of monolithic knowledge base" service, the power usage concerns are reduced compared to present in the single-layered approach is absent.