Volume-3, Issue-5, October-2013 ISSN No.: 2250-0758 International Journal of Engineering and Management Research Available at: www.ijemr.net Page Number: 82-88

Learning Based Approach for Selection inMetasearch

R.Kumar1 , A.K Giri2 1Ph.DScholar,School of Computer and System Sciences Jawaharlal Nehru University New Delhi, INDIA 2Ph.DScholar, School of Computer and System Sciences Jawaharlal Nehru University New Delhi, INDIA

ABSTRACT retrieve most of the relevant documents, which the In recent years, the web has become a huge source Metasearch engine combines into single ranked list and of information, which is mostly unstructured in the form of displays them to the user. A ranking of the document is text or image. However, every search engine uses their own based on the user query, top rank document has high query method or algorithm for ranking the retrieved results. The weight. The main goal of Metasearch over the single main goal of Metasearch over the single search engine is search engine is increased coverage and a consistent increased coverage and a consistent interface to ensure that result from several places can be meaningfully combined. In interface [1]. A consistent interface is necessary for the this paper we propose a learning based Query similarity Metasearch engine to ensure that result from several places can be meaningfully combined while the user is not aware using rank merge list of document approach for search engine selection algorithms to identify the most useful search about the underlying search engine. Two types of the engines that are likely to contain the relevant documents for search engines exist[3] namely general purpose search the user query. The objective of search engine selection is to engine and special purpose search engine. The general improve efficiency as it would result in sending a query to purpose search engine aims to provide the capability to potentially useful underlying search engines only. Finally, it search all types of the web page with respect to the user concludes the paper by pointing out some open issues and query like , Altavista, , and HotBotetc possible direction of future research related to search engine selection. [3]. The special purpose search engine retrieves the document for a defined domain such as specific subject Keyword-Query, search engine, meta- search engine area. Example, Cora search engine focuses on computer query similarity, MRDD algorithms. science search paper and Medical World Search focus to retrieve the medical information. It is believed that I. INTRODUCTION hundreds of thousands of special purpose search engines currently exist on the web [4]. The motivation for the In recent years, the web has become a huge source of Metasearch engine [5] includes, (1) An increase in the information, which is mostly unstructured in the form of search coverage at the rate at which the web has been text or image. The people all over the world poses queries increasing is much faster than the indexing capability of a using their favorite search engine to find relevant single search engine. (2) Metasearch engine effectively information. However, every search engine uses their own combines the coverage of all underlying search engines. method for ranking the retrieved results. The Metasearch (3) Retrieves relevant document by merging documents engines usually send the query simultaneously to different retrieved from the underlying search engine and ranking search engines resulting in queries being processed in them with respect to the user query thereby making it parallel thereby saving time. Metasearch engine [2] is a convenient and reliable for the user to retrieve relevant tool that allows searching multiple search engines at the information. (4) Facilitate the invocation of multiple same time and returning more comprehensive and relevant search engines. document that satisfy the information needs of the user, The user in order to retrieve most relevant document efficiently. In other words, Metasearch engine is a system needs to first identify search engines with the most that provides unified access to multiple existing search relevant information followed by sending queries to each engines [3]. When user poses a query to the Metasearch search engine in the corresponding query format. through the user interface, the Metasearch engine is The rest of the paper is organized as follows: Section responsible to identify appropriate underlying search II, describes about the Metasearch engine, its component, engine which has relevant document with respect to the In section III related work forthe search engine selection user query. Each search engine has a text (unstructured approach. The proposed Query Similarity with rank list data) , defined by the set of documents that can be document using a Round Robin algorithm is presented in searched by search engine. All underlying search engines

82 section IV. In section V shows the algorithms based example. The experimental simulation is shown in section Search Engine Selector: The search engine selector VI. Finally a conclusion and future work discuss in section selects the appropriate underlying search engine with VII. respect to the user query. A good search engine selector should correctly identify search engines while minimizing II. (A) METASEARCH ENGINE identifying irrelevant search engines. The approaches for selecting search engines are discussed later in this paper. Metasearch engine is a tool that sends the user Document Selector: The document selector determines requests to multiple primary search engines combine the what documents to retrieve from the selected search results retrieved from them together and display them to engines. The aim is to retrieve more relevant documents the user. The information on the web has increased rapidly with few irrelevant documents. To find out the relevant over time. When a query is submitted to the Metasearch information different similarity measure is used which engine, there issome question arise in the mind. (1)Which estimate the relevance between document and user query. underlying search engine will be selected for a user query. The similarity is measured based on a pre-defined (2) How to preprocess the submitted query with respect to threshold value. The high similarity value shows that the better utilization of the underlying search engine and (3) information is more relevant with respect to the user query. How to merge result retrieved from the search engines. All Query Dispatcher: The query dispatcher has a mechanism decisions are taken by the Metasearch engine based on to establish a connection of a server with each selected keyword of user queries. The challenging aim of search engine in order to dispatch query to each of these Metasearch engine is the selection of appropriate search search engines. In general, the user query will be sent to engines and merging results retrieved from them, with the search engine after preprocessing. Every search engine respect to the user query. may or may not have the same query as posed on the Metasearch engine. II (B). ARCHITECTURE OF Result Merger:The result merger merge document retrieved from the selected search engines. The result METASEARCH ENGINE merger combines all the result into a single ranked list and arranges the documents in descending order with their Information on the digital library is stored in global similarity with respect to the user query. The top disparate sources and each of these sources has their own most documents having higher global similarity in the search capability. In general any organizations have their ranked list are returned to the user through the interface. own web site and also have their own search engine. When user poses a query to find relevant information, the Metasearch engine finds such information using its III. RELATED WORK components. The main components of a Metasearch engine are search engine selector, document selector, A good search engine selector uses the different query dispatcher and result merger. The aim of Metasearch selection algorithms to identify potentially useful search is to maximize the precision or retrieval effectiveness engine for a given user query. Search engine selection while minimizing the cost. In order to carry out this, the main task is to identify the most useful search engines that meta search engines select the most appropriate search are likely to contain relevant documents for the user query. engine containing relevant information. Each selected The objective of search engine selection is to improve search engine should retrieve as much relevant information efficiency as it would result in sending a query to only as possible. The architecture of a Metasearch engine, as in potentially useful underlying search engines.In paper [7], [3], is shown in Figure (1) The Metasearch engine consists utilizes the retrieved results of past queries for selecting of four components namely Search Engine Selector, the appropriate search engines for a specific user query. Document Selector, Query Dispatcher and Result Merger. The selection of the search engines is based on the value of relevance between user query and the search. Modeling Relevant Document Distribution(MRDD) [8] [3] is a static User learning based approach, which uses a set of training queries for learning. With the help of training queries, it

Search Engine identifies all the relevant documents returned from every search engine and arrives at a distribution vector for each Result relevant document. Similarly, it finds the distribution Document Merger vector for each training query and calculate the average distribution vector is used to identify the appropriate Query search engines. In ref. [9] , each component search engine is represented by a set of pairs of the document frequency of the term and the sum of the weights of the term over all SE -1 SE- 2 SE- N documents in search engines. A threshold is associated with each query in [9] to indicate that only documents Fig. (1) Architecture of Metasearch Engine. 83 whose similarities with the query are higher than the IV. PROPOSE ALGORITHM threshold are of interest.In [10], the search engine selection is based on a static approach in which selects The propose an algorithm inspired by [7] algorithm in appropriate search engines using the document frequency which the retrieved documents for each past query from of each term as well as documents in each underlying all selected search engines, used to calculate the relevance search engine. The Collection Retrieval Interface Network between search engines and respective past query using (CORI-Net)[11]is carried out using two pieces of top k document. The top k document is a rank merge list of information for each distinct term i.e. document frequency documents from all search engines but in the top k and search engine frequency. If a term appears in k documents may have some common documents which are document in the search engine, the term is repeated k times not avoid to making the rank merge list of the documents. in the super document. Super document containing all In proposing algorithms, build a rank merge list using distinct term in the search engine, As a result, the Round Robin algorithms [18] which is avoiding the document frequency of a term in the search engine common documents in the rank merge list. Utilizes the becomes the term frequency in thesuper document. Savvy retrieved results of past queries for selecting the Search engine[12], ProFusion[13], is a hybrid learning appropriate search engines for a specific user query. In approach, which combines both static and dynamic algorithm , means it is more appropriate search learning approaches. In the ProFusion approach, when a rel (s j q) user query is received by the Metasearch engine, the query engine contain more relevant information for the user is first mapped to one or more categories underlying search query. The value of rel (s j q) depends on rel (s j pi ) engines. The query is mapped to a category that have at least one term that belong to the user query.NeuralNetwork and sim (pi qi ), where rel (s j pi ) is the relevance approach[14], In Relevant Document Distribution Estimation Method for Resource Selection (ReDDE)[15], between search engines and past queries and sim (pi qi ) estimate the distribution of relevant documents across the is the similarity between all past queries with the user set of available search engines. For making its estimation query. The search engines with higher value of rel (s q) ReDDE considers both search engine size and content j similarity. ReDDE is the resource selection algorithm that being selected by the Metasearch engine. normalizes the search engine size.In paper [16], an The selection of the search engines is based on the value of approach for large term queries thatrequire more space relevance between user query and the search engine say and time to execute on the server side, forwards probe, rel (s j q), which indicate the how likely it is for the instead of the entire query term, to the server. The probes search engine to be relevant to the user query q or what are useful as they have a low cost and are smaller in size percentage of the relevant documents for the user query q thereby using low bandwidth and has low latency. The Decision theoretic framework (DTF) approach [17] , is is in the search engine s j . Ranking in the search engines also called Cost based resource selection approach in is carried out according to the value of rel (s q). A which there is a retrieval cost associated with each digital j library and documents are retrieved for the query.The higher value of Re l (s j q)implied that the search engine search engine selection approaches can be classified into contains most relevant documents with respect to user four categories [3]. query q. Thus, the search engines having higher value for Search Engine Selection Approaches s  Rel  j  is selected. q   Algorithm Rough Static Cost Learning Representative Representative Based Representative Input: Let set H [p, q, s], where p is the number of past Approach Approach Approach Approach queries, q is the user query, and s is the number of search engine. ALIWE D-WISE TRD-CS Query Output: Ranked list of topmost search engines. Method: WAIS CORI-NET Sim MRDD Step1: For each i th past query g-Gloss Savvy Rank the documents retrieved from all search engines into the single ranked list ReDDE Search ENPUD Step2: Compute ESoMSD Profusion s  Rel(s j doc) Re l  j  = p ∑ Fig (2): Search Engine Selection Approach  i  topT doc in the merged T ranked list 84

Where T is a pre-defined number of top documents.

If ( ) then s j  doc ε s j Re l   = 1 Tq1 Se1 Se3 Se4 Se5 Se6  doc s  Tq Se Se Se Se Otherwise Re l  j  = 0 2 2 4 5 6  doc Tq3 Se1 Se2 Se3 Se4 Se5 Se6 Step3: Generate the ranked list R q of documents returned from the search engines for the user query q Now apply the all training query to all selected search th engines and assume the following document are retrieving Step 4: For each i past query piand user query q, with respect to the user query. compute the similarity between past query and user query Tq = {1,3,2} 1 1 Se usingSim(pi q) = Score(doc, R pi R q ) i ∑ Doc D1 D2 D3 D4 D5 D6 D7 D8 R pi doc∈R pi ∩R q j Se1 5 3 5 13 4 11 10 2 where R pi and R q are ranked list of past query and user Se3 6 5 7 5 8 14 query returned from the search engines and Score function is calculated as Se4 3 9 1 9 2 5 Se5 12 6 1 10 1 2 10 doc ranked in Rpi doc ranked in Rq Score( doc, Rpi, R q ) = 1−− RRpi q Tq2 = {3,4,2} Sei D8 Step5: Normalized the value of Sim (pi q) using D1 D2 D3 D4 D5 D6 D7 Doc j MaxSim= max Sim p q 15 qii( ) Se2 2 10 7 14 4 8 CutSimqq=0.8 ∗ MaxSim Se4 3 8 11 6 12 1 10 2 Normalized Se5 1 13 9 10 12 3 7

Sim (pi q) = Se6 2 9 14 3 11 8 = 0 if Sim( p q) < CutSim Tq3 {5,2,1} iq  Sei D7 Sim( piq q) − CutSum Doc D1 D2 D3 D4 D5 D6 D8 , otherwise j MaxSimqq− CutSim Se1 2 6 15 8 2 13 5 6 Se2 9 7 5 6 8 11 1 4 Step 6: For user query q Se3 8 13 2 1 12 7 4 th For (each j search engine)Compute Se4 2 15 10 3 7 6 5

Rel(s j q)= ∑ Rel(s i p i )× Sim(p i q) Step7: Ranked the search engine according to the value of Single merge list of documents return by search engine for all training query using Round Robin Algorithms [18] the . A larger value of is more Re l (s j q) Re l (s j q) likely means it contains most relevant documents with Tq1 5 6 3 12 9 7 1 13 10 8 2 11 14 ….. respect to the user query q. Tq2 2 3 1 10 8 13 9 7 11 14 6 12 15 …..

V. ALGORITHM BASED EXAMPLE Tq2 2 9 8 6 7 13 15 5 10 1 3 12 11 ……

Let the set of past queries be TQ = {Tq1, Tq2, Step 2. Computes between a search engine Tq3} and the set of search engines be SE= {Se1, Se2, Se3, Re l(Sei Tq j ) Se5, Se6} and a user query be q are input to the algorithm. and all training queries for top-8 documents Step1: Let assume all past queries having maximum three For Tq1 = {1,3,2} terms. Rel(Se1 Tq1 ) Rel( Se31 Tq ) Rel(Se4 Tq1 ) Rel(Se5 Tq1 ) Tq1 1 3 2 0.500 0.333 0.416 0.333

Tq2 3 4 2 For Tq2 = {3,4,2} Tq 5 2 1 3 Rel(Se2 Tq2 ) Rel(Se4 Tq2 ) Rel(Se5 Tq2 ) Rel(Se6 Tq2 )

For every past queryTqi , Metasearch engine selects the underlying search engine with respect to the training 0.500 0.583 0.416 .500 queries are shown as For Tq3 = {5,2,1} 85

selected search engine for user query are Rel ( Se / Tq )Rel ( Se23 / Tq )Rel ( Se33 / Tq )Rel ( Se43 / Tq )Rel ( Se53 / Tq )Rel ( Se63 / Tq )

13

{Se425631,,,,, Se Se Se Se Se }. 0.333 0.266 0.400 0.333 0.200 0.266 VI. EXPERIMENTAL SIMULATION

Step 3. Now apply the user query q= {5, 3, 2} to all In order to testifythe effectiveness of the propose selected search engines by Metasearch algorithm is simulatedin MATLAB 2010b. In this experiment we assume that there are 30 search engines, six Sei training query and a user query. The simulation result of a Doc D1 D2 D3 D4 D5 D6 D7 D8 j q_Sim algorithmusing rank merge list of documentand [3] Se1 6 7 15 8 12 3 15 1 algorithms. In Figure (3) is showing the higher value of Se2 2 5 11 9 8 11 10 4 relevance between a search engine and user query and in Se5 7 13 12 13 12 3 2 figure (4) the average distribution vector (ADV) for all search engines are showing thatis the most appropriate Se6 2 15 10 13 11 9 5 search engine with respect to the user query. In algorithm [3], ADV is showing the average documents retrieved Single merge list of documents return by the six search from all search engines which is relevant to the user query, engines for the user query UQ using Round Robin and that is not effective and robustness compare to the algorithms [18] proposed algorithms.

doci 6 2 7 5 13 15 11 12 10 8 .. T

Step 4.Now find the similarity between user query and all past queries

Sim(Tq1 q) Sim(Tq2 q) Sim(Tq3 q) 0.5556 0.4514 0.6111

Step 5. Normalized the value of Sim(Tqi q) Figure (3)

Normalized Sim(PQ1 q) .5455 0 Normalized Sim(PQ2 q) 1.000 Normalized Sim(PQ3 q)

Step 6.Find the value of Rel(s j q) are shown as below.

0.0000 Rel(Se1 q) 0.5715 Rel(Se 2 q) Rel(Se q) 0.1325 3 0.6061 Rel(Se 4 q) 0.4000 Figure (4) Rel(Se5 q) Rel( Se q) 0.1716 Figure (5) and figure (6) ,compared with the how many 6 common search engine and percentage of common search Step 7.The search engines are ranked according to the engine in both algorithms for the user query. The selection value of Rel(s j q).The higher value of Rel(s j q) is of the search engine is determined by the values of relevance between search engines and user query. the most appropriate for the user query. The order of

86

search engine among the various search engines with respect to the user queries.Using the revolutionary algorithms like genetic algorithm, particle swarm optimization (PSO), magnetic optimization algorithms, simulated annealing and ant colony optimization algorithms we can optimize the search engine selection which isthe most relevant to the user queries.

REFERENCES

[1] E. Selberg and O. Etzioni. The Meta Crawler architecture for resource aggregation on the Web. IEEE Expert (January-February), 1 l-14,1997. Figure (5) [2] Lui, Bing , Web Data Mining. ACM Computing Classification (1998). [3] Meng, W., Yu. C and Liu K. Building effective and efficient metasearch engine ACM Computing Surveys, Vol.34, No.1, (March 2002.) [4] Bergman, M. The deep Web: Surfacing the hidden value. Bright Planet, www.completeplanet.com/Tutorials/DeepWeb/index.asp [5] C. Yu, W. Meng, K. Liu, W. Wu, and N. Rishe. Efficient and Effective Metasearch for a Large Number of Text . ACM International Conference on Information and Knowledge Management, 1999, pp.217- 224. [6] Keyhamipoor, A.H., Piroozmand, M., Moshiri, B., Lucas, C. A multilayer/Mutli-Agent Architecture for Metasearch Engines, In AIML 05 Conference CICC, Figure (6) Cairo, Egypt. (Dec-2005) [7] Suleyman, C., Luo, S. and Hao, Y.2009. Learning from It is observed that the two algorithms tend to select a Past Queries for Resource Selection. In CIKM, pages reasonably high number of search engines in common 1867-1870, (Nov 2009) when a reasonable high number of search engines are [8] Voorhees, E., Gupta,N.K, and Laird, B.L. 1995b. selected by them. Learning Collection fusion Strategies in Proceeding of the ACM SIGIR Conference (Seattle, WA), Page no,172-179, V. CONCLUSION (July1995) [9] Gravano, L., and Garcia-Molina, H.1995. Generalizing This paper proposes query similarityusing rank merge gloss to vector space databases and broker hierarchies’ .In list ofdocumentsthat is the effectiveness and robustness proceeding of the International Conferences on Very Large compare to the algorithm [3]. The proposed algorithms are Data Bases, Page no.196-205. (August 1997) retrieved to relevant document compare to the algorithm [10] Yuwono, B. and Lee, D. Search and ranking [3] because of pruning the common documents and algorithm for locating resources on the World Wide Web. provide the rank document list to compute the relevance In Proceeding of the 5th International Conference on Data between a search engine and user query.The purpose of Engineering, Page no.164-177. (Feb. 1996) retrieved the relevant document is to find the appropriate [11] Callan, J. Lu, Z. and Cropt, W.1995b. Searching search engine for user query in the algorithms. The distributed collection with inference networks. In proposed algorithms, the relevance between a search Proceeding of the ACM SIGIR Conference (Seattle, WA), engine and user query is find out with the help of the rank Page no. 21-28, (July1995) merge list that gives the refined and the relevant search [12] Drielinger, D. and Howe, A. Experience with engine because of voiding common documents in the Selecting search engine using Metasearch, (1997) merge list. [13] Fan, Y. and Gauch, S. Adaptive agent for information Future Work gathering from multiple, distributed information source. In The problem of search engine selection would be Proceeding of the AAAI Symposium on Intelligent Agent compounded if large numbers of search enginesare in Cyberspace, Page no.40-46, (1999) selected for the user query. As a result, it would be more [14] Rumelhart, D.E, Hinton, G.E., and Williams, R.J. difficult to choose the relevant search engine among them. Learning internet presentations by error propagation. In Therefore,it is a challenging task to select therelevant D.E. Rumelhart & J.L. McClelland (Eds.), Parallel 87

Distributed Processing: Explorations in the microstructure of cognition. Volume 1: Foundation. Cambridge. MA: MIT Press, (1986) [15] Si, L. and Callan, J. A Semi supervised Learning Method to Merge Search engine Results. ACM Transactions on information System, Vol.21, No.4, pages 457-491, (Oct 2003) [16] Hawking, D. and Thistlewaite, P. Methods for Information Server Selection, ACM Transaction on Information Systems, Vol.17, pages 40-76, (Jan 1999) [17] Fuhr, N. A Decision –Theoretic Approach to Database Selection in Networked IR, ACM Transactions on Information Systems (TOIS), Volume 17, Issue 3, pages: 229 – 249, (July 1999) [18] Voorhees, E. And Tong, R. Multiple search engines in database merging. In Proceedings of the Second ACM International Conference on Digital Libraries (Philadelphia, PA), Page no.93–102. (July 1997)

88 © 2011-13 Vandana Publications. All Rights Reserved