Webpage Ranking Algorithms Second Exam Report
Total Page:16
File Type:pdf, Size:1020Kb
Webpage Ranking Algorithms Second Exam Report Grace Zhao Department of Computer Science Graduate Center, CUNY Exam Committee Professor Xiaowen Zhang, Mentor, College of Staten Island Professor Ted Brown, Queens College Professor Xiangdong Li, New York City College of Technology Initial version: March 8, 2015 Revision: May 1, 2015 1 Abstract The traditional link analysis algorithms exploit the context in- formation inherent in the hyperlink structure of the Web, with the premise being that a link from page A to page B denotes an endorse- ment of the quality of B. The exemplary PageRank algorithm weighs backlinks with a random surfer model; Kleinberg's HITS algorithm promotes the use of hubs and authorities over a base set; Lempel and Moran traverse this structure through their bipartite stochastic algo- rithm; Li examines the structure from head to tail, counting ballots over hypertext. Semantic Web and its inspired technologies bring new core factors into the ranking equation. While making continuous effort to improve the importance and relevancy of search results, Semantic ranking algorithms strive to improve the quality of search results: (1) The meaning of the search query; and (2) The relevancy of the result in relation to user's intention. The survey focuses on an overview of eight selected search ranking algorithms. 2 Contents 1 Introduction 4 2 Background 5 2.1 Core Concepts . 5 2.1.1 Search Engine . 5 2.1.2 Hyperlink Structure . 5 2.1.3 Search Query . 7 2.1.4 Web Graph . 7 2.1.5 Base Set of Webpages . 9 2.1.6 Semantic Web . 9 2.1.7 Resource Description Framework and Ontology 10 2.2 Mathematical Notations . 12 3 Classical Ranking Algorithms 13 3.1 PageRank[1][2] . 13 3.2 HITS[3][4] . 15 3.3 SALSA[5] . 18 3.4 HVV[6][7] . 21 4 Semantic Ranking Algorithms 23 4.1 OntoRank[8][9] . 24 4.2 TripleRank[10] . 26 4.3 RareRank[11] . 28 4.4 Semantic Re-Rank[12] . 32 5 Conclusion and Future Research Direction 35 3 1 Introduction According to Netcraft.com, there were 915,780,262 websites world- wide as of December 2014 1, compared to 2,738 websites twenty years earlier in 1994 2. Today Google claims that the Web is \made up of over 60 trillion individual pages and constantly growing 3." The vast amount of information available on the Internet is a double-edged sword - either an information blessing or, alternatively, a potential information nightmare. Web search engines play a key role in today's life, finding and taking in the blessing and abating the nightmare through sifting the vast information on the Web and providing the most relevant and authoritative information in a digestible amount to the Internet user. In order to cope with the ever growing web data, ever increasing user demands, while fighting against web spams, search engine engi- neers are constantly making relentless efforts to refine their ranking algorithms and bringing new factors into the equation. Google is said to use over 200 factors 4 5 in its ranking algorithms. The Web evolved rapidly for the past two decades { from \the Web of documents" in the 1990s, to \the Web of people" in the early 2000s, to the present \the Web of data and social networks"[13]. During the evolution process, the emerging Semantic Web (SW) sets out a new Web platform to augment the highly unordered web data with suit- able semi-structured self-describing data and metadata, which greatly improved the quality of search results in a SW-enabled environment. Most importantly, SW brings the vision of adding \meaning" to the Web resources via knowledge representation and reasoning. To under- stand a user's intention of performing a search is the key to provide accurate and customized search results. The structure of this report is organized as follows: In Section 2, I will introduce some core concepts in the field of search engine and ranking algorithms. In addition, some mathematical notations 1http://news.netcraft.com/archives/category/web-server-survey/ 2www.internetlivestats.com/total-number-of-websites/ 3http://www.google.com/insidesearch/howsearchworks/thestory/ 4http://www.google.com/insidesearch/howsearchworks/thestory/ 5http://backlinko.com/google-ranking-factors 4 used in this report will be introduced. In Section 3, I will review four iconic ranking algorithms: PageRank, Hyperlink-Induced Topic Search (HITS), Stochastic Approach for Link-Structure Analysis (SALSA), and Hyperlink Vector Voting (HVV). Semantic ranking algorithms will be examined in Section 4, and the report concludes in Section 5 with summary and future research direction. 2 Background 2.1 Core Concepts 2.1.1 Search Engine A search engine is typically composed of a crawler, an indexer and a ranker. The web crawler, also called spider, discovers and gathers websites and webpages on the Web. An indexer uses various methods to index the contents of a website or of the Internet as a whole. In order to process the query data and provide a relevant result set, the ranking engine is the \brain" of the search engine. Ranking algorithms works closely with the indexed data and metadata. Other than the crawler-based search engine, human-powered direc- tories such as the Yahoo directory, are depending the human editors' manual effort and discretion to build its listing and search database. 2.1.2 Hyperlink Structure A hyperlink has two components: link address and link text (hy- pertext). The link address is a Unique Resource Locater (URL) that Li[6] refers it as the head anchor at the destination. Li calls the hy- pertext the tail anchor on the source page. Since a URL is a unique identification on the Web, ranking algorithms tend to use a hyperlink address as the webpage (document) ID. Internal hyperlinks are links that go from one page on a domain to a different page on the same domain. They are commonly used for internal navigation. It's sometime referred as intralink. An ex- ternal link, or an interlink, points at a page on a different domain. Most of the ranking algorithms examine carefully the interlinks, but give little or no considerations to the intralinks. The web document containing a hyperlink is known as the source document. The desti- 5 nation document where the hyperlink points to is the target document. Examining the nature of links between the target and the source, it depicts an underlining graph structure. The hyperlinks can be viewed as the directed edges between the source nodes and the target nodes, in the residing graph. \Backlinks, also known as incoming links, inbound links, inlinks, and inward links, are the in-edges to a node, either a website or webpage.6"A forward link is defined as the out-edge be- tween the source node and the target node. Webpage A link a link b link c Webpage C Webpage B link d link e Figure 1: The link b and link d are backlinks of C Search engines tend to give special attention to the number and quality of backlinks to a node, since it is considered to be an indica- tion of the popularity or authority of the node. This is similar to the importance given to the number of citations of an academic paper by the library community[14]. Almost all the classical ranking algorithms perform hyperlink anal- ysis. This type of algorithms are called Link Analysis Ranking (LAR) algorithms. 6http://en.wikipedia.org/wiki/Backlink 6 2.1.3 Search Query There are three categories of web search queries 7: transactional, informational, and navigational. These are often called \do, know, go." Ranking algorithms often work with informational queries. Query topics can be broad or narrow. The former pertains to \topics for which there is an abundance of information on the Web, sometimes as many as millions of relevant resources (with varying degrees of relevance)[5]. Narrow topic queries refer to those for which very few resources exist on the Web. Therefore, it may require different tech- niques to handle each characteristic queries. Search Engine Persuasion[15] means \there may be millions of sites pertaining in some manner to broad-topic queries, but most users will only browse through the first k (e.g.: 10) results returned by the search engine." To streamline and simplify the complexity of ranking algorithms, user queries and anchor links referred in this report are textual only. 2.1.4 Web Graph It's important to understand the general graph structure of the web in order to design ranking algorithms. The Web Graph is the graph of the webpages together with the hypertext links between them[16]. Each webpage or a website, identi- fied by an URL, on the Web Graph is a vertex, the link between two webpages is an edge. Broder et al.[17] present a bow-tie structure of the Web Graph in 2000 (see Figure 2), based on crawling 200 million pages via Alta Vista search engine. In the center of the bow-tie structure is a giant Strongly Connected Component (SCC), a strongly connected directed graph, where every node can reach to another node in the graph. IN denotes a set of pages that can reach the SCC, but cannot be reached from the SCC; OUT consists of pages that can be reached from SCC but can not 7http://en.wikipedia.org/wiki/Web_search_query 7 reach the SCC. TENDRILS are the orphan web resources that can- not reach the SCC nor be reached from the SCC. The year 2000 research concluded that SCC contains 28% of the nodes on the Web. In addition, Broder et al. show that the IN degree and OUT degree follow a power law distribution that is heavy tailed.