<<

E-Business’s Page Ranking with Colony Algorithm

Asst. Prof. Chonawat Srisa-an, Ph.D.

Faculty of Information Technology, Rangsit University 52/347 Phaholyothin Rd. Lakok Pathumthani, 12000 [email protected], [email protected]

Abstract Keywords: web , text mining, information retrieval, search engines, Expert Almost all E-Business transactions use Systems and AI in e-Business classical keyword-based methods for searching commercial goods. One of the main problems that plague modern search 1. Introduction engines is poor relevance. In many situations, search engines retrieve thousands of pages In many situations, search engines at least partially satisfying input query. Of retrieve thousands of pages at least partially course, human attention remains more or satisfying input query. In E-business world, less constant, and most humans are able to people have been overwhelmed with those cope only with about 100-200 retrieved non-relevant pages every day. PageRank documents. succeeds in such situations by separating PageRank index method has been highly respected sites from junk pages that successfully used for search engine output only happened to contain words from the sorting. Two such experimental search query. PageRank is therefore a method that engines have been constructed at Stanford would allow us to approximate page University [1]. There seems to be quite high importance for the user, regardless of its correlation between high PageRank index, relevance to the query measured by classical and general page importance judged by methods. Having such approximation -in fact human users. This is especially spectacular a ranking score -we can either sort the pages, for very general queries, for which a lot of presenting these with higher rank first, or relevant pages exist in the Internet. even entirely remove these which have very This work proposes an algorithm for poor rank thus allowing user to concentrate page rank called Ranking (ACR) on a set of pages which has reasonable algorithm. The goal of this paper is to quantity. investigate the Ant Colony Ranking (ACR) There are two drawbacks of this algorithm in the context of such self- classical method. First, one of the most modifying systems to solve two drawbacks of important drawbacks of this methodology is this method. The experiments show that the the necessity to have access to (ideally) algorithm has good convergence properties entire Web structure to perform proper over hypertext structure of the Internet. computation. This is possible only in large

Proceedings of the Fourth International Conference on eBusiness, November 19-20, 2005, Bangkok, Thailand

27.1 systems that maintain entire Web hyperlink 1.PR(Tn) - Each page has a notion of its databases [4] such as general-purpose search own self-importance. That’s “PR(T1)” engines and web crawlers. The first one for the first page in the web all the way emphasizes the research effort on Web up to “PR(Tn)” for the last page topology and structural dependencies 2.C(Tn) - Each page spreads its vote out between links and documents, without evenly amongst all of it’s outgoing examining contents of hypertext documents. links. The count, or number, of It proves that the structure of Web itself outgoing links for page 1 is “C(T1)”, carries lots of useful information that should “C(Tn)” for page n, and so on for all be retrieved. pages. Secondly, the other drawback of this 3.PR(Tn)/C(Tn) - so if our page (page A) classical methodology is the difficulty to find has a backlink from page “n” the share out how many times we need to repeat the of the vote page A will get is calculation for big networks. That’s a “PR(Tn)/C(Tn)” difficult question; for a network as large as 4.d(... - All these fractions of votes are the World Wide Web [2] it can be many added together but, to stop the other millions of iterations! The “damping factor” pages having too much influence, this is quite subtle. If it’s too high then it takes total vote is “damped down” by ages for the numbers to settle, if it’s too low multiplying it by 0.85 (the factor “d”) then you get repeated over-shoot, both above 5.(1 - d) - The (1 – d) bit at the beginning and below the average - the numbers just is a bit of probability math magic so the swing about the average like a pendulum and “sum of all web pages’ PageRanks will never settle down. Also choosing the order be one”: it adds in the bit lost by the of calculations can help. The answer will d(.... It also means that if a page has no always come out the same no matter which links to it (no backlinks) even then it order you choose, but some orders will get will still get a small PR of 0.15 (i.e. 1 – you there quicker than others. 0.85). The goal of this paper is to Note that the PageRanks form a investigating the Ant Colony Ranking (ACR) probability distribution over web pages, so algorithm in the context of such self- the sum of all web pages’ PageRanks will be modifying systems to solve two drawbacks one. of a classical ranking method. The ACR PageRank or PR(A) can be calculated software is to construct a web structure and using a simple iterative algorithm, and compute ranking score for each page. corresponds to the principal eigenvector of the normalized link matrix of the web. 2. Standard Pagerank Formula 3. Ant Colony Ranking Algorithm We assume page A has pages T1...Tn which point to it (i.e., are citations). The Ant Colony Ranking (ACR) algorithm parameter d is a damping factor which can is a system based on agents which simulate be set between 0 and 1. Also C (A) is defined the natural behavior of , including as the number of links going out of page A. mechanisms of cooperation and adaptation. The PageRank of a page A is given as ACR algorithms are based on the follows: following RULES: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +PR(Tn)/C(Tn))

Special Issue of the International Journal of the Computer, the Internet and Management, Vol.13 No.SP3, November, 2005 27.2 • Each path followed by an ant is route. Assume that there is only associated with a candidate solution available food for one ant; therefore, for a given problem. once an ant reaches the destination • When an ant follows a path, the (food), it eliminates its pheromone on amount of pheromone deposited on the way back to its nest to prevent that path is proportional to the quality loop. of the corresponding candidate • Artificial ants implement loop solution for the target problem. elimination by eliminate all Artificial ants have a probabilistic pheromone and deposit backward preference for paths with a larger pheromone to set up a path. The amount of pheromone. Two kinds of permanent path is created by pheromones are forward pheromone “backward pheromone”. and backward pheromone on • Each ant that following a forward different purposes. pheromone has to give up its search • Each destination has food available once it finds backward pheromone on for only one artificial ant; therefore, the path. each node ant needs to visit only once. • In ACR system, the ants memorize When an ant has to choose between the node they visit during the forward two or more paths, the path(s) with a path and exchange its journey larger amount of pheromone have a knowledge with other once it reaches greater probability of being chosen its nest. The permanent path was by the ant. created by backward pheromone and • An appropriate representation of the no longer use. At this point in time, problem, which allows the ants to information about incoming and incrementally construct/modify outgoing for each node is store in a solutions through the use of a database. probabilistic transition rule, based on the amount of pheromone in the trail. Note: For clustering purpose, there are • Artificial Ants can be thought of possibly many nests in the ACR system. The having two working forward and ACR software reduces size problems by backward modes. They are in clustering methods that decompose such forward mode when they are moving problems into smaller ones. from the nest toward the food, and they are in backward mode when 4. Ant Colony Ranking Software they are moving from the food backward mode when they are In order to solve Web structure moving from the food back to their problem, this research replaces web crawler nest. Once an artificial ant in forward by Artificial Ants for the path construction mode reaches its destination, it phase and use web log analysis for better switches to backward mode and starts understanding of reference page. Since some its travel back to the source on the reference path is hard to detect in WWW same route they previously used and environment since any page can point to the switch to its backward pheromone. page anytime, web log is used for a better • A rule for pheromone updating, result but not the main part of the system. which specifies how to modify the ACR software creates enough pheromone trail (t). Artificial Ants “Artificial Ants” to explore Web structure. use pheromone trail to guide the All artificial ants must return their nest once

Proceedings of the Fourth International Conference on eBusiness, November 19-20, 2005, Bangkok, Thailand

27.3 they reach a food (destination). Each ant backward pheromone and was no longer detects and tags in source used. With this knowledge and web log code of web page and start its journey to analysis a clear topology can be constructed. construct a permanent path. Once it arrives In the case of large system, The ACR home, the ACR software destroys the ant and software reduces size problems by clustering then stores its knowledge in a database. Note: methods that decomposes such problems into The case of topology heavily change is not in smaller ones. Separating large number of this paper scope. pages into many nest helps artificial ant to The page can have high index, if it is identify quickly which groups are relevant pointed by many pages, or it has relatively and important for it. Each artificial ant can few back-links, but coming from pages with live in one nest. Once all paths are explored, high index value. This index works in the sub structure is created. To compose into hypertext environments that can be one big web structure is the task of ACR represented by either non-cyclic, or heavily software. The ants memorize the node they interconnected graphs. Unfortunately the visit during the forward path and exchange structure of the World Wide Web is different, its journey knowledge with other once it and the situations as presented below are not reaches its nest. The permanent path was uncommon; therefore, the artificial ants created by backward pheromone and no implement loop elimination by depositing longer use. Once an ant reaches its nest, the backward pheromone to set up a path. The ACR software destroys the ant and stores its permanent path is created by “backward knowledge in a database. Incoming and pheromone”. Moreover, normalized index outgoing information of each node is store in can be interpreted as a probability measure the database for ranking computation using a of viewing a certain page by so called standard formula above. Once all artificial “random surfer”, who randomly follows ants arrive home, the whole structure is built. hyperlinks between web documents. Page rank calculation can compute from Web Servers’ logs can be analyzed for information stored in the database. information about popularity of certain web pages among surfers. Such data can help 5. Conclusion searching and browsing processes in a variety of ways. Self modifying hypertext The ACR algorithm and software was environments, Hyperlinks have been devised built to enhance ranking method by as a means of facilitating navigation among eliminating two main drawbacks including huge number of documents. User navigation web structure construction and large iteration. patterns and habits are however variable, and Once all artificial ants arrived their nests, the hence hyperlink structure is determined only whole structure was built. Page rank by Web service creators, it can sometimes be calculation can compute from information ineffective and lead to confusion. There is stored in the database. even special name for such hyperlink The experiments show that the structure inefficiency effect, predating algorithm has good convergence properties of the World Wide Web: the “lost over hypertext structure of the Internet. in hyperspace phenomenon”. ACR algorithm would reduce this problem because each ant Acknowledgements memorizes the node they visit during the forward path and exchange its journey We would like to thank Prof. Chom knowledge with other once it reaches its nest. Kimpan for discussions and invaluable The permanent path was created by advice given during preparation of this paper.

Special Issue of the International Journal of the Computer, the Internet and Management, Vol.13 No.SP3, November, 2005 27.4 6. Bibliography [11] Matthew Merzbacher (1999), “Discovering Semantic Proximity for Web Pages”, [1] Lawrence Page, Sergey Brin, Rajeev ISMIS’99 conference proceedings Motwani, Terry Winograd, (1998), “The [12] P. Pirolli, J.Pitkow, R.Rao (1996), “Silk Page Rank Citation Ranking: Bringing from a Sow’s Ear: Extracting Usable order to Web”, Stanford University Digital Structures from Web”, Proc. of the Libraries papers. Conference on Human Factors in [2] Sergey Brin, Lawrence Page (1998), Computing Systems: Common Ground, “The Anatomy of a Large-Scale pages 118-125, New York, Hypertextual Web Search Engine”, Proc. of 7th International Web Conference. [3] Jon M. Kleinberg (1998), “Authoritative Sources in Hyperlinked Environment”, Proc. of Nineth Annual ACMSIAM Symposium on Discrete Algorithms. [4] D. Gibson, J. Kleinberg, P. Raghavan (1998), “Structural Analysis of the World Wide Web”, IBM Almaden Research Center. [5] S. Chakrabati, Byron E. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, (1998) “Experiments in Topic Distillation”, IBM Almaden Research Center [6] Botafogo R. A., Schneiderman B. (1991), “Identyfing aggregates in hypertext structures”, Third ACM Conference on Hypertext [7] P. Kazienko (1998), “Grupowanie stron WWW na podstawie odsy.aczy hipertekstowych”, MISSI’98 conf. proceedings, Wroc.aw, [8] D. Gibson, J. Kleinberg, P. Raghavan, (1998) “Inferring Web Communities from Link Topology”, ACM Conference on Hypertext and Hypermedia proceedings, 9th [9] P. Raghavan (1998), “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text”, Proc. of 7th World Wide Web conference, [10] S. Chakrabati, B. Dom, P. Indyk (1998), “Enhanced hypertext categorisation using hyperlinks”, ACM SIGMOD’98 proceedings.

Proceedings of the Fourth International Conference on eBusiness, November 19-20, 2005, Bangkok, Thailand

27.5