Optimized Query Execution in Large Scale Web Search Engines with Page Rank Algorithm

Total Page:16

File Type:pdf, Size:1020Kb

Optimized Query Execution in Large Scale Web Search Engines with Page Rank Algorithm

OPTIMIZED QUERY EXECUTION IN LARGE SCALE WEB SEARCH ENGINES WITH PAGE RANK ALGORITHM G KEERTHY1, G S AJAY KUMAR REDDY1*, P ARAVIND1*,Ch HARSHA VARDAN1* 1Assistant System Manager, Tata Consultancy Services, Bengaluru Section, INDIA 1* Graduating in Department of Electronics and Communication Engineering, Lakireddy Bali Reddy Autonomous Engineering College, Mylavaram– 521 230,Keywords: Krishna, A.P., Page India. Rank, Query Integrator, [email protected] links, Out bound link, Dangling Links, random Surfer, damping factor.

Abstract:

The importance of a Web page is an inherently subjective matter, which depends 1. Introduction: on the readers interests, knowledge and attitudes. But there is still much that can be Within the past few years, Google said objectively about the relative importance has become the far most utilized search of Web pages .To engineer a search engine is engine worldwide. A decisive factor a challenging task. Search engines index tens therefore was, besides high performance and to hundreds of millions of web pages ease of use, the superior quality of search involving a comparable number of distinct results compared to other search engines. terms. They answer tens of millions of queries This quality of search results is substantially every day. we present Google, a prototype of a based on PageRank, a sophisticated method large-scale search engine which makes heavy to rank web documents. use of the structure present in hypertext. Google is designed to crawl and index the The aim of these pages is to provide a Web efficiently and produce much more broad survey of all aspects of PageRank. satisfying search results than existing systems The contents of these pages primarily rest using the pagerank algorithm .This paper upon papers by Google founders Lawrence describes PageRank, a method for rating Web Page and Sergey Brin from their time as pages objectively and mechanically, graduate students at Stanford University. It effectively measuring the human interest and is often argued that, especially considering attention devoted to them. the dynamic of the internet, too much time We compare PageRank to an has passed since the scientific work on idealized random Web surfer. We show how PageRank, as that it still could be the basis to efficiently compute PageRank for large for the ranking methods of the Google numbers of pages. And, we show how to search engine. There is no doubt that within apply PageRank to search and to user the past years most likely many changes, navigation. And, we also compare google adjustments and modifications regarding the (following pagerank algorithm) with other ranking methods of Google have taken search engines taking some of the basic place, but PageRank was absolutely crucial parameters into consideration. for Google's success, so that at least the fundamental concept behind PageRank should still be constitutive. 2. The PageRank Concept: The original PageRank algorithm was described by Lawrence Page and Since the early stages of the world Sergey Brin . It is given by wide web, search engines have developed different methods to rank web pages. Until PR(A) = (1-d) + d (PR(T1)/C(T1) + today, the occurence of a search phrase ... + PR(Tn)/C(Tn)) within a document is one major factor within ranking techniques of virtually any Where, search engine. The occurence of a search phrase can thereby be weighted by the  PR(A) is the PageRank of length of a document (ranking by keyword page A, density) or by its accentuation within a PR(Ti) is the PageRank of pages Ti document by HTML tags. For the purpose which link to page A, of better search results and especially to  C(Ti) is the number of make search engines resistant against outbound links on page Ti automatically generated web pages based and upon the analysis of content specific ranking  d is a damping factor which criteria (doorway pages), the concept of link can be set between 0 and 1 popularity was developed. Following this concept, the number of inbound links for a So, first of all, we see that PageRank document measures its general importance. does not rank web sites as a whole, but is Hence, a web page is generally more determined for each page individually. Further, important, if many other web pages link to the PageRank of page A is recursively defined it. by the PageRanks of those pages which link to page A. Contrary to the concept of link popularity, PageRank is not simply based upon The PageRank of pages Ti which link to the total number of inbound links. The basic page A does not influence the PageRank of page approach of PageRank is that a document is in A uniformly. Within the PageRank algorithm, fact considered the more important the more the PageRank of a page T is always weighted by other documents link to it, but those inbound the number of outbound links C(T) on page T. links do not count equally. First of all, a This means that the more outbound links a page document ranks high in terms of PageRank, if T has, the less will page A benefit from a link to other high ranking documents link to it. it on page T.

So, within the PageRank concept, the The weighted PageRank of pages Ti is rank of a document is given by the rank of those then added up. The outcome of this is that an documents which link to it. Their rank again is additional inbound link for page A will always given by the rank of documents which link to increase page A's PageRank. them. Hence, the PageRank of a document is always determined recursively by the PageRank Finally, the sum of the weighted of other documents. PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 2.1 The PageRank Algorithm: and 1. Thereby, the extend of PageRank benefit for a page by another page linking to it is reduced.

2 2.2 The Random Surfer Model: PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) PageRank is consider as a model of user behaviour, where a surfer clicks on links at where N is the total number of all pages on the random with no regard towards content. web.

The random surfer visits a web page The second version of the algorithm, with a certain probability which derives from indeed, does not differ fundamentally from the the page's PageRank. The probability that the first one. Regarding the Random Surfer Model, random surfer clicks on one link is solely given the second version's PageRank of a page is the by the number of links on that page. This is why actual probability for a surfer reaching that page one page's PageRank is not completely passed after clicking on many links. The PageRanks on to a page it links to, but is divided by the then form a probability distribution over web number of links on the page. pages, so the sum of all pages' PageRanks will be one. So, the probability for the random surfer reaching one page is the sum of probabilities for Contrary, in the first version of the the random surfer following links to this page. algorithm the probability for the random surfer Now, this probability is reduced by the damping reaching a page is weighted by the total number factor d. The justification within the Random of web pages. So, in this version PageRank is an Surfer Model, therefore, is that the surfer does expected value for the random surfer visiting a not click on an infinite number of links, but gets page, when he restarts this procedure as often as bored sometimes and jumps to another page at the web has pages. random. As mentioned above, the two versions of The probability for the random surfer not the algorithm do not differ fundamentally from stopping to click on links is given by the each other. A PageRank which has been damping factor d, which is, depending on the calculated by using the second version of the degree of probability therefore, set between 0 algorithm has to be multiplied by the total and 1. The higher d is, the more likely will the number of web pages to get the according random surfer keep clicking links. Since the PageRank that would have been calculated by surfer jumps to another page at random after he using the first version. The first version of the stopped clicking links, the probability therefore algorithm to form a probability distribution over is implemented as a constant (1-d) into the web pages with the sum of all pages' PageRanks algorithm. Regardless of inbound links, the being one. probability for the random surfer jumping to a page is always (1-d), so a page has always a In the following, we will use the first minimum PageRank. version of the algorithm. The reason is that PageRank calculations by means of this 2.3 A Different Notation of the PageRank algorithm are easier to compute, because we can Algorithm: disregard the total number of web pages.

There are two different versions of 2.4 The Characteristics of PageRank: PageRank algorithm. In the second version of the algorithm, the PageRank of page A is given We regard a small web consisting of three as pages A, B and C, whereby page A links to the

3 pages B and C, page B links to page C and page be illustrated by our three-page example, C links to page A. According to Page and Brin, whereby each page is assigned a starting the damping factor d is usually set to 0.85, but PageRank value of 1. to keep the calculation simple we set it to 0.5. The exact value of the damping factor d Iteration PR(A) PR(B) PR(C) admittedly has effects on PageRank, but it does 0 1 1 1 not influence the fundamental principles of 1 1 0.75 1.125 PageRank. So, we get the following equations 2 1.0625 0.765625 1.1484375 for the PageRank calculation: 3 1.0742187 0.7685546 1.1528320 4 1.0764160 0.7691040 1.1536560 PR(A) = 0.5 + 0.5 PR(C) 5 1.0768280 0.7692070 1.1538105 PR(B) = 0.5 + 0.5 (PR(A) / 2) 6 1.0769052 0.7692263 1.1538394 PR(C) = 0.5 + 0.5 (PR(A) / 2 + 7 1.0769197 0.7692299 1.1538449 PR(B)) 8 1.0769224 0.7692306 1.1538459 9 1.0769229 0.7692307 1.1538461 These equations can easily be solved. 10 1.0769230 0.7692307 1.1538461 We get the following PageRank 11 1.0769230 0.7692307 1.1538461 values for the single pages: 12 1.0769230 0.76923077 1.15384615 PR(A) = 14/13 = 1.07692308 In order to provide search results, Google PR(B) = 10/13 = 0.76923077 computes an IR score out of page specific PR(C) = 15/13 = 1.15384615 factors and the anchor text of inbound links of a page, which is weighted by position and It is obvious that the sum of all pages' accentuation of the search term within the PageRanks is 3 and thus equals the total number document. This way the relevance of a of web pages. As shown above this is not a document for a query is determined. The IR- specific result for our simple example. score is then combined with PageRank as an indicator for the general importance of the page. For our simple three-page example it is easy to solve the according equation system to If pages are optimised for highly determine PageRank values. In practice, the web competitive search terms, it is essential for good consists of billions of documents and it is not rankings to have a high PageRank, even if a possible to find a solution by inspection. page is well optimised in terms of classical search engine optimisation. The reason therefore 2.5 The Iterative Computation of PageRank: is that the increase of IR score deminishes the more often the keyword occurs within the Because of the size of the actual web, document or the anchor texts of inbound links to the Google search engine uses an avoid spam by extensive keyword repetition. approximative, iterative computation of Thereby, the potentialities of classical search PageRank values. This means that each page engine optimisation are limited and PageRank is assigned an initial starting value and the becomes the decisive factor in highly PageRanks of all pages are then calculated in competitive areas. several computation circles based on the equations determined by the PageRank 3. The Effect of Inbound Links: algorithm. The iterative calculation shall again

4 It has already been shown that each additional PR(A) = 0.25 + 0.75 (PR(X) + inbound link for a web page always increases PR(D)) = that page's PageRank. Taking a look at the 7.75+0.75PR(D)PR(B)=0.25+0.75P PageRank algorithm, which is given by R(A)PR(C)=0.25+0.75PR(B)PR(D)= 0.25+0.75PR(C) PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Solving these equations gives us the following PageRank values: one may assume that an additional inbound link from page X increases PR(A) = 419/35 = 11.97 the PageRank of page A by PR(B) = 323/35 = 9.23 PR(C) = 251/35 = 7.17 d × PR(X) / C(X) PR(D) = 197/35 = 5.63

where PR(X) is the PageRank of page X and First of all, we see that there is a C(X) is the total number of its outbound links. significantly higher initial effect of additional But page A usually links to other pages itself. inbound link for page A which is given by The single effects of additional inbound links shall be illustrated by an example. d × PR(X) / C(X) = 0.75 × 10 / 1 = 7.5

This initial effect is then propagated even stronger by the links on our site. In this way, the PageRank of page A is almost twice as high at a damping factor of 0.75 than it is at We regard a website a damping factor of 0.5. h. So, the higher the consisting of four pages A, B, C and D which damping factor, the larger is the effect of an are linked to each other in circle. Without additional inbound link for the PageRank of external inbound links to one of these pages, the page that receives the link and the more each of them obviously has a PageRank of 1. evenly distributes PageRank over the other We now add a page X to our example, for which pages of a site. we presume a constant Pagerank PR(X) of 10. Further, page X links to page A by its only 5. The Effect of Outbound Links: outbound link.

4. The Influence of the Damping Factor:

The degree of PageRank propagation from one page to another by a link is primarily determined by the damping factor d. If we set d to 0.75 we get the following equations for our above example: We regard a web consisting of to websites, each having two web pages. One site consists of pages A and B, the other constists

5 of pages C and D. Initially, both pages of each  .Link Popularity within the site solely link to each other. It is obvious that Site's Internal Link Structure each page then has a PageRank of one. Now  Frequency of Updates to we add a link which points from page A to Page page C.  Accuracy of Spelling & Grammar 6. Additional Factors Influencing  Rate of New Inbound Links PageRank: to Site

The following potential influencing factors And many such parameters, many in his patent specifications for PageRank: forums and organizations conclude that google search engine(which Number of pages follows pagerank algorithm) is the Position of a link within a most efficient among the rest of the document search engines like YAHOO,MSN Distance between web pages etc,…. Importance of a linking page REFERENCES: Up-to-dateness of a linking page 1. The Anatomy of a Large-Scale Visibility of a link Hypertextual Web Search Engine, Sergey Brin and Lawrence Page. First of all, the implementation of 2. . 'The PageRank Citation additional criteria in PageRank would result in Ranking: Bringing Order to the a better approximation of human usage Web', Brin, Page, Motwani and regarding the Random Surfer Model. Winograd Considering the visibility of a link and its 3. 'When Experts Agree: Using position within a document implies that a user Non-Affiliated Experts to Rank does not click on links completely at Popular Topics', Bharat and haphazard, but rather follows links which are Mihaila highly and immediately visible regardless of 4. Optimal implementation of Gauss-Siedel/SOR Algorithm their anchor text 5. Markov Chain Updating Problem with Linear Solving.  The PageRank algorithm doesn’t AUTHOR’S PROFILE: fully address all G Keerthy was graduated in department of computer science engineering from the possibilities a Jawaharlal Technological University surfer could take. Kakinada, India and she is presently working as assistant system engineer in CONCLUSION TCS, Bangalore section. She conquers her Taking all the benefits and the major research part for web development systems. The author has attended for about drawbacks into consideration along with 3 International/National some parameters like: conferences/Seminars/Workshops/ Guest Lectures and The author also published her  Quality of the Document research work in 5 papers in well repute Content international journals and gave oral presentations in many of the top most  Quality/Relevance of Links institutions in India like IIT’S and IIIT’S to External Sites/Pages and many of the private universities like BITS Dubai, VIT, SRM etc..,

6

Recommended publications