Intelligent Internet Searching Agent Based on Hybrid Simulated Annealing

Decision Support Systems 28Ž. 2000 269±277 www.elsevier.comrlocaterdsw Intelligent internet searching agent based on hybrid simulated annealing Christopher C. Yang a,), Jerome Yen a, Hsinchun Chen b a Department of Systems Engineering and Engineering Management, The Chinese UniÕersity of Hong Kong, Hong Kong, People's Republic of China b Department of Management Information Systems, UniÕersity of Arizona, Tucson, AZ, USA Abstract The World-Wide WebŽ. WWW based Internet services have become a major channel for information delivery. For the same reason, information overload also has become a serious problem to the users of such services. It has been estimated that the amount of information stored on the Internet doubled every 18 months. The speed of increase of homepages can be even faster, some people estimated that it doubled every 6 months. Therefore, a scalable approach to support Internet searching is critical to the success of Internet services and other current or future National Information InfrastructureŽ. NII applications. In this paper, we discuss a modified version of simulated annealing algorithm to develop an intelligent personal spiderŽ. agent , which is based on automatic textual analysis of the Internet documents and hybrid simulated annealing. q 2000 Elsevier Science B.V. All rights reserved. Keywords: Information retrieval; Intelligent agent; Searching agent; Simulated annealing; World-wide Web 1. Introduction To develop searching engines or spiders, which are ``intelligent'', or to reach high recall and high Information searching over the cyberspace has precision is always the dream to the researchers in become more and more important. It has been esti- this area. In order to qualified as an agent or intelli- mated that the amount of information stored on the gent agent, such searching agent or spider must be Internet doubled every 18 months. However, the able to make adjustments according to progress of speed of increase of home pages can be even faster searching or be personalized to adjust its behavior and it is doubled every 6 months or even shorter. In according to the users' preferences or behavior. some areas, such as, Hong Kong and Taiwan, the The major problem with the current searching increasing speeds can be even faster. Therefore, engines or spiders is that only few spiders do have searching for the needed homepages or information the communication capabilities between the spiders has become a challenge to the users of Internet. and the users who dispatched the spiders. Since there is no communication, the users are difficult to trace ) Corresponding author. or to understand the progress of searching and have E-mail address: [email protected]Ž. C.C. Yang . to tie themselves to the terminals. This paper reports 0167-9236r00r$ - see front matter q 2000 Elsevier Science B.V. All rights reserved. PII: S0167-9236Ž. 99 00091-3 270 C.C. Yang et al.rDecision Support Systems 28() 2000 269±277 a searching engine, which uses the combination of brid simulated annealing algorithm. Best-first search CGI and Java to develop the user interface. It allows has been developed and reported in our earlier publi- the users to keep track of the progress of searching. cationswx 3,4 . In this paper, we propose an approach The users can also make changes on the searching based on automatic textual analysis of Internet docu- parameters, such as, number of homepages to re- ments and hybrid simulated annealing based search- trieve. There were several algorithms have been used ing engine. to develop spiders, for example, best-first searching In Section 2, a short literature review will be and genetic algorithms. In this paper, we will discuss provided. Section 3 will discuss the architecture and the spider that we developed with the hybrid simu- algorithms for building our searching spider. Section lated annealing algorithm. We have made a compari- 4 will report the experiments that we have conducted son to compare the performance of spiders that to compare its performance with the other searching developed with best-first search. The results will be spiders. This paper is concluded with some com- reported in this paper. ments about our spider and other factors that will Although network protocols and Internet applica- affect its performance. tions, such as, HTTP, Netscape and Mosaic, have significantly improved the efficiency and effective- ness of searching and retrieving of online informa- 2. Literature review: machine learning, intelligent tion, their usage is still accompanied by the problems agents that users cannot explore and find what they want in the cyberspace. While Internet services become pop- ular to the users, difficulties with searching on the 2.1. Machine learning Internet is expected to get worse as the amount of on-line information increases, the number of Internet Research on intelligent InternetrIntranet search- users increasesŽ. traffic increases , and more and ing relies significantly on machine learning. Neural more multimedia are used to develop the home network, symbolic learning algorithms and genetic pages. This is the problem of information overload algorithms are three major approaches in machine or information explosion. learning. Development of searching engines has become Neural network model computation in terms of easier. For example, it is possible to download exe- complex topologies and statistics-based error correc- cutable spider programs or even source codes. How- tion algorithms, which fits well conventional infor- ever, it is difficult to develop spiders which do have mation retrieval models such as vector space model satisfactory performance or unique features, such as, and probabilistic model. Doszkocs et al.wx 7 gives an learning and collaboration. overview of connectionist models for information There are two major approaches to develop retrieval. Belewwx 1 developed a three-layer neural searching engines, either based on keywords and network of authors, index terms, and documents huge index tables, for example, Alta Vista and Ya- using relevance feedback from users to change its hoo, or based on hypertext linkages, for example, representation over time. Kwok developed a similar Microsoft Explorer and Netscape browser. It is diffi- three-layer network using a modified Hebbian learn- cult for the keyword search to reach high precision ing rule. Lin et al.wx 11 adopted the Kohonen's and high recall. Slow response due to the limitations feature map to produce a two-dimensional grid rep- on indexing methodology and network traffics, and resentation for N-dimensional features. the inability for the users to use the appropriate Symbolic learning algorithms are based on pro- terms to articulate their need always become frus- duction rule and decision tree knowledge representa- trate the users. tions. Fuhr et al.wx 8 adopted regression methods and Our approach is based on automatic textual analy- ID3 for feature-based automatic indexing technique. sis of Internet documents, such as, HTML files, aims Chen and She adopted ID3 and the incremental to address the Internet searching problem by creating ID5R algorithm for constructing decision trees of intelligent personal spiderŽ. agent based on the hy- important keywords which represent users' queries. C.C. Yang et al.rDecision Support Systems 28() 2000 269±277 271 Genetic algorithms are based on evolution and bandwidth bottleneck on Internet severely con- heredity. Gordonwx 9 presented a genetic algorithms strained the usefulness of such an agent approach. based approach for document indexing. Chen and At the Second WWW Conference, Pinkertonwx 14 Kimwx 5 presented a GA-neural-network hybrid sys- reported a more efficient spiderŽ. crawler . The tem for concept optimization. WebCrawler extends the tueMosaic's concept to ini- Our hybrid simulated annealing approach is simi- tiate the search using its index and to follow links in lar to genetic algorithm in producing new generation. an intelligent order. It first appeared in April of 1994 However, the selection is stochastic based instead of and was purchased by America Online in January of probability base. 1995. The WebCrawler extended the concept of the Fish Search Algorithm to initiate the search using index, and to follow links in an intelligent order. 2.2. Intelligent internet searching agent However, the WebCrawler evaluates the relevance of the link based on the similarity of the anchor text to There are two major approaches of Internet the user's query. The anchor texts are the words that searching:Ž. 1 client-based searching agent, and Ž. 2 describe a link to another document. These anchor online database indexing and searching. There are texts are usually short and do not provide relevance also some systems contain both approaches. information as much as the full document texts. A client-based searching agent on the Internet Moreover, problems with the local search and com- serves as a program that operates autonomously to munication bottleneck persist. A more efficient and search for relevant information without direct human global Internet search algorithm is needed to im- supervision. Several software programs have been prove client-based searching agents. developed. The TkWWW robot was developed by Scott TueMosaic and the WebCrawler are two promi- Spetka and was funded by the Air Force Rome nent examples. Both of them are using the Best First Laboratorywx 15 The TkWWW robots are dispatched Search techniques. DeBra and Postwx 6 reported tue- from the TkWWW browser. The

Load more