SAVVYSEARCH: a Metasearch Engine That Learns Which Search
Total Page:16
File Type:pdf, Size:1020Kb
AI Magazine Volume 18 Number 2 (1997) (© AAAI) Articles SAVVYSEARCH A Metasearch Engine That Learns Which Search Engines to Query Adele E. Howe and Daniel Dreilinger ■ Search engines are among the most successful single query to multiple search engines. applications on the web today. So many search The simplest metasearch engines are forms engines have been created that it is difficult for that allow the user to indicate which search users to know where they are, how to use them, engines should be contacted (for example, and what topics they best address. Metasearch ALL-IN-ONE [www.albany.net/allinone] and engines reduce the user burden by dispatching METASEARCH [members.gnn.com/infinet/ queries to multiple search engines in parallel. The meta.htm]). PROFUSION (Gauch, Wang, and SAVVYSEARCH metasearch engine is designed to efficiently query other search engines by carefully Gomez 1996; www.designlab.ukans.edu/pro- selecting those search engines likely to return fusion) gives the user the choice of selecting useful results and responding to fluctuating load search engines themselves or letting PROFUSION demands on the web. SAVVYSEARCH learns to iden- select three of six robot-based search engines tify which search engines are most appropriate using hand-built rules. METACRAWLER (Selberg for particular queries, reasons about resource and Etzioni 1995; metacrawler.cs.washing- demands, and represents an iterative parallel ton.edu:8080/home.html) significantly search strategy as a simple plan. enhances the output by downloading and analyzing the links returned by the search engines to prune out unavailable and irrele- ompanies, institutions, and individuals vant links. must have a presence on the web; each Metasearch engines reduce the burden on Care vying for the attention of millions the user. They make available search engines of people. Not too surprisingly then, the most that might have been unknown to the user. successful applications on the web to date are They handle the simultaneous submission of search engines: tools that assist users in queries; some direct the query to appropriate finding information on specific topics. engines, and some postprocess the results as A variety of search engines are available, well. They provide a single interface (on the from general, robot based (for example, down side, they might not support all the fea- ALTAVISTA [www.altavista.digital.com] and tures of the target search engines). WEBCRAWLER [webcrawler.com]) to topic or area Unfortunately, metasearch can lead to the specific (for example, FTPSEARCH [ftpsearch. tragedy of the commons problem from eco- unit.no/ftpsearch] and DEJANEWS [www. nomics in which an individual’s best interests dejanews.com]). Each uses different algo- run counter to society’s. Individual users rithms for collecting, indexing, and searching appear to be best served by simultaneously links; thus, each returns different results for searching every possible search engine on the similar queries. Empirical results indicate that web for desired information. However, the no single search engine is likely to return process might waste web resources: network more than 45 percent of the relevant results load and search-engine computation. (Selberg and Etzioni 1995). To find what they We believe that a metasearch system can be desire, users might need to query several a good web citizen (Eichmann 1994) by tar- search engines; metasearch engines automate geting those search engines likely to return this process by simultaneously submitting a useful results and responding to changing Copyright © 1997, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1997 / $2.00 SUMMER 1997 19 Articles load demands on the web. To provide this be present) or can be presented as an ordered function, we incorporated simple AI tech- phrase. Three aspects of the results display niques in a metasearch engine. Our can be varied: (1) the number of links metasearch engine learns to identify which returned, (2) the format of the links descrip- search engines are most appropriate for par- tion, and (3) the timing. By default, 10 links ticular queries, reasons about resource are displayed with the uniform resource loca- demands, and represents an iterative parallel tors (URLs) and descriptions when available, search strategy as a simple plan. and the results of each search engine are list- ed separately as they arrive. Alternatively, we could change the number of links to 50, Our Metasearch System: return less or more description of the links, SAVVYSEARCH and interleave the results of the separate search engines. The interface is also available S AVVYSEARCH is our metasearch system in 23 different languages.1 (Dreilinger and Howe 1997, 1996) available at guaraldi.cs.colostate.edu:2000. It runs on five Processing a Query machines (three Sun SPARCSTATIONs and two When a user submits the query, SAVVYSEARCH IBM RS 6000s) at Colorado State University. must make two decisions: (1) how many The system was first made available in March search engines to contact simultaneously and AVVYSEARCH 1995 and has undergone several revisions S (2) what order the search engines should be since the original design. At present, two ver- is designed to contacted in. The first decision requires rea- sions of the system are available: (1) the one soning about the available resources and the balance two described here and (2) an experimental inter- second about ranking the search engines. potentially face that is mentioned in the section entitled Retrospective and Prospective Views of the Resource Reasoning Each search engine conflicting SAVVYSEARCH Project. queried expends network and local computa- tional resources. Thus, modifying concurren- goals: (1) SAVVYSEARCH is designed to balance two potentially conflicting goals: (1) maximizing cy (number of search engines queried in par- maximizing the likelihood of returning good links and (2) allel) is the best way to moderate resource the likelihood minimizing computational and web resource consumption. Concurrency is a function of consumption. The key to compromise is network load estimates, which are determined of returning knowing which search engines to contact for from a lookup table created from observa- tions of the network load at this time of day good links specific queries at particular times. SAVVY- in the past, and local central processing unit and (2) SEARCH tracks long-term performance of search engines on specific query terms to load, which is computed using the UNIX minimizing determine which are appropriate and moni- uptime command. computa- tors recent performance of search engines to Concurrency has a base value of two; as determine whether it is even worth trying to many as two additional units are added for tional and contact them. each load estimate for periods of low load. Thus, the maximum concurrency value is six. web resource In this section, we describe SAVVYSEARCH from a user’s perspective. We follow a run- Ranking Search Engines SAVVYSEARCH consumption. ning example, indicating what a user sees includes both large robot-based search engines and what goes on behind the scenes in pro- and small specialized search engines in its set. cessing a search request. The large search engines are likely to return links for any query, but these links might not Submitting a Query be as appropriate as links returned by a spe- To find out about artificial intelligence con- cialized search engine for a query in its area. ferences, we enter the query, select the Inte- The purpose of ranking is to determine grate Results option, and click on the SAVVY- which search engines are most worthwhile to SEARCH! button, as shown in an image of the contact for a given query. Search engines are interface in figure 1. The search form, the ranked based on learned associations between query interface to SAVVYSEARCH, asks the user search engines and query terms (stored in a to specify a set of keywords (query terms) and metaindex) and recent data on search-engine options for the search. Users typically enter performance. two query terms. The metaindex—A compendium of The options cover the treatment of the search experience: The metaindex maintains terms, the display of results, and the interface associations between individual query terms language. Query terms can be combined with (simplified by stemming and case stripping) logical and (all query terms must be included and search engines as effectiveness values. in documents) or or (any query term should High positive values indicate excellent perfor- 20 AI MAGAZINE Articles Figure 1. User Interface to SAVVYSEARCH for Entering a Query. mance of a search engine on queries contain- as some questionable responses. For each ing a specific term; high negative values indi- search, we collect two types of information: cate extremely poor performance. (1) no results, the search engine failed to The effectiveness values are derived from return links, and (2) visits, the number of two types of observation of the results of links that are explored by the user. No results users’ searches. We used observations (passive reduces confidence that the search engine is measures) because we obtained a low rate of appropriate for the particular query, and Vis- response to requests for user feedback as well its indicates that the user found some SUMMER 1997 21 Articles returned links to be interesting and so and are likely to return something for any increases confidence. query, their weights will tend to grow larger SAVVYSEARCH uses a simple weight-adjust- and more quickly than the specialized search ment scheme for learning effectiveness val- engines. Ts mitigates the tendency toward ues. No results and Visits are treated as nega- their dominance, allowing more specialized tive and positive reinforcement, respectively, engines to be selected when appropriate. amortized by the number of terms in the The penalties (Ps,h and Ps,r) are accrued query. Thus, if a search engine returned noth- only if thresholds on the minimum number ing for the example query, the effectiveness of hits and the maximum response time are values for artificial, intelligence, and conferences exceeded.