Article in Press
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLE IN PRESS Decision Support Systems 992 (2002) xxx–xxx www.elsevier.com/locate/dsw Complementing search engines with online web mining agents Filippo Menczer Department of Management Sciences, The University of Iowa, Iowa City, IA 52242, USA Abstract While search engines have become the major decision support tools for the Internet, there is a growing disparity between the image of the World Wide Web stored in search engine repositories and the actual dynamic, distributed nature of Web data. We propose to attack this problem using an adaptive population of intelligent agents mining the Web online at query time. We discuss the benefits and shortcomings of using dynamic search strategies versus the traditional static methods in which search and retrieval are disjoint. This paper presents a public Web intelligence tool called MySpiders, a threaded multiagent system designed for information discovery. The performance of the system is evaluated by comparing its effectiveness in locating recent, relevant documents with that of search engines. We present results suggesting that augmenting search engines with adaptive populations of intelligent search agents can lead to a significant competitive advantage. We also discuss some of the challenges of evaluating such a system on current Web data, introduce three novel metrics for this purpose, and outline some of the lessons learned in the process. D 2002 Elsevier Science B.V. All rights reserved. Keywords: Web mining; Search engines; Web intelligence; InfoSpiders; MySpiders; Evaluation metrics; Estimated recency; Precision; Recall 1. Introduction reported only a year earlier [15] and growing at a very fast pace—Google alone reports over 1.6 billion With the increasing amount of information avail- URLs indexed at the time of this writing. Further- able online, the World Wide Web has rapidly become more, the Web has a dynamic nature, with pages being the main source of competitive intelligence for busi- added, deleted, moved, updated, or linked to each nesses, and consequently search engines represent other at a fast rate and in an unprincipled manner. It is invaluable decision support tools. However, disci- also heterogeneous, without imposed restrictions on plines like information retrieval and machine learning document content, language, style, or format. are facing serious difficulties in dealing with the Web Search engines are powered by state-of-the-art in a scalable way. The search technologies employed indexing algorithms, which make use of automated on the Web are faced with several features of this programs, called robots or crawlers, to continuously environment that make the task of indexing the Web a visit large parts of the Web in an exhaustive fashion. formidable challenge. As of February 1999, the size of The pages accessed undergo some performance- the publicly indexable Web was reported to be over enhancing steps, such as the removal of ‘‘noise’’ words 800 million pages [17], more than double that (too frequent to be representative), the conflation of terms via stemming and/or the use of thesauri, and the use of different weighting schemes to represent in- E-mail address: [email protected] (F. Menczer). dexed pages. Then, the pages are stored in inverted URL: http://dollar.biz.uiowa.edu/~fil. index databases. Such indexes map every term from a 0167-9236/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved. PII: S0167-9236(02)00106-9 ARTICLE IN PRESS 2 F. Menczer / Decision Support Systems xx (2002) xxx–xxx controlled vocabulary onto the collection of URLs snapshots represented by search engines’ indexes. containing that term, and are used at query time to Further, such discrepancies can only grow with the return the sets of URLs containing the terms in users’ size of the Web, diminishing the expected level of queries. performance of search engines and hindering their The key idea of all search engines is that the index effectiveness as decision support systems for compet- built by a crawl of the Web can be used many times, so itive intelligence (Web intelligence tools). that the cost of crawling and indexing is amortized over The scalability problem posed by the size and the many queries that are answered by a search engine dynamic nature of the Web may be addressed by by looking up the same static database. The two factors complementing index-based search engines with intel- that differentiate among distinct search engines are the ligent search agents at the user’s end. In this paper, we crawling and ranking algorithms. Crawling strategies discuss a type of agent-based systems inspired by determine which documents are indexed and ulti- ecological and artificial life models. Such agents can mately affect the recall (fraction of relevant pages that browse online on behalf of the user, and evolve an are retrieved) and the recency (fraction of retrieved intelligent behavior that exploits the Web’s linkage pages that are current) of a search engine. Ranking and textual cues [21,23]. algorithms determine how the documents matching a Despite its huge size, the Web forms a small-world query are sorted and ultimately affect the precision network, characteristic of social and biological sys- (fraction of retrieved pages that are relevant) of a tems, in which any two documents on the Web are on search engine. For example, the recent success of the average 19 clicks away from each other [1]. This Google search engine is generally attributed to its result suggests that ‘‘smart’’ browsing agents that can ranking algorithm, which is based on weighted counts distinguish the relevant links inside the pages might of links pointing to pages rather than on lexical be able to find the desired information in a limited similarity between pages and query [13]. time. By using this approach, the relative load of the While the indexing approach is ideal for static col- network may be switched from the blind crawlers lections of documents, in the case of the World Wide employed by search engines to intelligent browsing Web, the sets of relevant pages draw one important agents that can reach beyond search engines and mine characteristic from their environment: they are highly for new, up-to-date information. Online agents search dynamic. The static snapshots of the Web captured in only the current environment and therefore will not search indexes, left to themselves, become less and less return stale information, and will have a better chance accurate and complete. They need to be updated per- to improve the recency of the visited documents. iodically, by having the crawlers revisit the same docu- InfoSpiders [25,26] is the name of a multiagent ments to assess the changes that might have occurred model for online, dynamic information mining on the since the last visit—in addition to looking for new Web, which uses several artificial intelligence techni- URLs. Therefore, as any Web site administrator can ques to adapt to the characteristics of its networked easily verify by inspecting server logs, search engine information environment—heterogeneity, noise, de- crawlers impose a huge load on network resources, centralization, and dynamics. In this approach, each even if this is amortized over multiple queries. agent navigates from page to page following hypertext Search engine users are often faced with very large links, trying to locate new documents relevant to the hit lists, low recall, low precision, and low recency. user’s query, having limited interactions with other Despite increasing computing power and network agents. Agents mimic the intelligent browsing behav- bandwidth, no search engine is able to index more ior of human users, being able to evaluate the relevance than 16% of the Web, and it takes a search engine an of a document’s content with respect to the user’s average of 6 months to index a new document [17]. query, and to reason autonomously about future links These problems are explained by the simple obser- to be followed. In order to improve the performance of vation that the static ‘‘index snapshot’’ approach does the system, adaptation occurs at both individual and not scale with the growing size and rate of change of population levels, by evolutionary and reinforcement the Web. This scalability issue is at the root of the learning. The goal is to maintain diversity at a global discrepancies between the current Web content and its level, trying to achieve a good coverage of all different ARTICLE IN PRESS F. Menczer / Decision Support Systems xx (2002) xxx–xxx 3 aspects related to the query, while capturing the rele- 2. The InfoSpiders model vant features of the local information environment, i.e., the textual hints leading to relevant pages. InfoSpiders are an evolutionary multiagent system The idea of intelligent online crawlers such as in which each agent in a population of peers adapts to InfoSpiders is to complement search engines in order its local information environment by learning to to improve upon their precision, recall, and recency. estimate the value of hyperlinks, while the population Unfortunately, assessing whether one is successful as a whole attempts to cover all promising areas with respect to this goal is a difficult task. Traditional through selective reproduction. Fig. 1 shows the measures used to evaluate the performance of infor- representation of each InfoSpiders agent. The agent mation retrieval systems assume that (i) all systems interacts with the information environment that con- have global knowledge of the information space in the sists of the actual networked collection (the Web) plus form of an index, and (ii) the experimenter has information kept on local data structures. knowledge of the relevant set of documents for each The adaptive representation of InfoSpiders roughly test query. Neither of these assumptions holds in the consists of a list of keywords, initialized with the present case.