Webpage Ranking Algorithms Second Exam Report

Grace Zhao Department of Computer Science Graduate Center, CUNY

Exam Committee Professor Xiaowen Zhang, Mentor, College of Staten Island Professor Ted Brown, Queens College Professor Xiangdong Li, New York City College of Technology

Initial version: March 8, 2015 Revision: May 1, 2015

1 Abstract The traditional link analysis algorithms exploit the context in- formation inherent in the structure of the Web, with the premise being that a link from page A to page B denotes an endorse- ment of the quality of B. The exemplary PageRank algorithm weighs backlinks with a random surfer model; Kleinberg’s HITS algorithm promotes the use of hubs and authorities over a base set; Lempel and Moran traverse this structure through their bipartite stochastic algo- rithm; Li examines the structure from head to tail, counting ballots over hypertext. Semantic Web and its inspired technologies bring new core factors into the ranking equation. While making continuous effort to improve the importance and relevancy of search results, Semantic ranking algorithms strive to improve the quality of search results: (1) The meaning of the search query; and (2) The relevancy of the result in relation to user’s intention. The survey focuses on an overview of eight selected search ranking algorithms.

2 Contents

1 Introduction 4

2 Background 5 2.1 Core Concepts ...... 5 2.1.1 ...... 5 2.1.2 Hyperlink Structure ...... 5 2.1.3 Search Query ...... 7 2.1.4 Web Graph ...... 7 2.1.5 Base Set of Webpages ...... 9 2.1.6 Semantic Web ...... 9 2.1.7 Resource Description Framework and Ontology 10 2.2 Mathematical Notations ...... 12

3 Classical Ranking Algorithms 13 3.1 PageRank[1][2] ...... 13 3.2 HITS[3][4] ...... 15 3.3 SALSA[5] ...... 18 3.4 HVV[6][7] ...... 21

4 Semantic Ranking Algorithms 23 4.1 OntoRank[8][9] ...... 24 4.2 TripleRank[10] ...... 26 4.3 RareRank[11] ...... 28 4.4 Semantic Re-Rank[12] ...... 32

5 Conclusion and Future Research Direction 35

3 1 Introduction

According to Netcraft.com, there were 915,780,262 world- wide as of December 2014 1, compared to 2,738 websites twenty years earlier in 1994 2. Today Google claims that the Web is “made up of over 60 trillion individual pages and constantly growing 3.” The vast amount of information available on the Internet is a double-edged sword - either an information blessing or, alternatively, a potential information nightmare. Web search engines play a key role in today’s life, finding and taking in the blessing and abating the nightmare through sifting the vast information on the Web and providing the most relevant and authoritative information in a digestible amount to the Internet user.

In order to cope with the ever growing web data, ever increasing user demands, while fighting against web spams, search engine engi- neers are constantly making relentless efforts to refine their ranking algorithms and bringing new factors into the equation. Google is said to use over 200 factors 4 5 in its ranking algorithms.

The Web evolved rapidly for the past two decades – from “the Web of documents” in the 1990s, to “the Web of people” in the early 2000s, to the present “the Web of data and social networks”[13]. During the evolution process, the emerging Semantic Web (SW) sets out a new Web platform to augment the highly unordered web data with suit- able semi-structured self-describing data and metadata, which greatly improved the quality of search results in a SW-enabled environment. Most importantly, SW brings the vision of adding “meaning” to the Web resources via knowledge representation and reasoning. To under- stand a user’s intention of performing a search is the key to provide accurate and customized search results.

The structure of this report is organized as follows: In Section 2, I will introduce some core concepts in the field of search engine and ranking algorithms. In addition, some mathematical notations

1http://news.netcraft.com/archives/category/web-server-survey/ 2www.internetlivestats.com/total-number-of-websites/ 3http://www.google.com/insidesearch/howsearchworks/thestory/ 4http://www.google.com/insidesearch/howsearchworks/thestory/ 5http://backlinko.com/google-ranking-factors

4 used in this report will be introduced. In Section 3, I will review four iconic ranking algorithms: PageRank, Hyperlink-Induced Topic Search (HITS), Stochastic Approach for Link-Structure Analysis (SALSA), and Hyperlink Vector Voting (HVV). Semantic ranking algorithms will be examined in Section 4, and the report concludes in Section 5 with summary and future research direction.

2 Background 2.1 Core Concepts 2.1.1 Search Engine A search engine is typically composed of a crawler, an indexer and a ranker. The , also called spider, discovers and gathers websites and webpages on the Web. An indexer uses various methods to index the contents of a or of the Internet as a whole. In order to process the query data and provide a relevant result set, the ranking engine is the “brain” of the search engine. Ranking algorithms works closely with the indexed data and metadata. Other than the crawler-based search engine, human-powered direc- tories such as the Yahoo directory, are depending the human editors’ manual effort and discretion to build its listing and search database.

2.1.2 Hyperlink Structure A hyperlink has two components: link address and link text (hy- pertext). The link address is a Unique Resource Locater (URL) that Li[6] refers it as the head anchor at the destination. Li calls the hy- pertext the tail anchor on the source page. Since a URL is a unique identification on the Web, ranking algorithms tend to use a hyperlink address as the webpage (document) ID.

Internal are links that go from one page on a domain to a different page on the same domain. They are commonly used for internal navigation. It’s sometime referred as intralink. An ex- ternal link, or an interlink, points at a page on a different domain. Most of the ranking algorithms examine carefully the interlinks, but give little or no considerations to the intralinks. The web document containing a hyperlink is known as the source document. The desti-

5 nation document where the hyperlink points to is the target document.

Examining the nature of links between the target and the source, it depicts an underlining graph structure. The hyperlinks can be viewed as the directed edges between the source nodes and the target nodes, in the residing graph. “Backlinks, also known as incoming links, inbound links, inlinks, and inward links, are the in-edges to a node, either a website or webpage.6”A forward link is defined as the out-edge be- tween the source node and the target node.

Webpage A

link a

link b

link c Webpage C

Webpage B

link d

link e

Figure 1: The link b and link d are backlinks of C

Search engines tend to give special attention to the number and quality of backlinks to a node, since it is considered to be an indica- tion of the popularity or authority of the node. This is similar to the importance given to the number of citations of an academic paper by the library community[14].

Almost all the classical ranking algorithms perform hyperlink anal- ysis. This type of algorithms are called Link Analysis Ranking (LAR) algorithms.

6http://en.wikipedia.org/wiki/Backlink

6 2.1.3 Search Query There are three categories of web search queries 7: transactional, informational, and navigational. These are often called “do, know, go.” Ranking algorithms often work with informational queries. Query topics can be broad or narrow. The former pertains to “topics for which there is an abundance of information on the Web, sometimes as many as millions of relevant resources (with varying degrees of relevance)[5]. Narrow topic queries refer to those for which very few resources exist on the Web. Therefore, it may require different tech- niques to handle each characteristic queries.

Search Engine Persuasion[15] means “there may be millions of sites pertaining in some manner to broad-topic queries, but most users will only browse through the first k (e.g.: 10) results returned by the search engine.”

To streamline and simplify the complexity of ranking algorithms, user queries and anchor links referred in this report are textual only.

2.1.4 Web Graph It’s important to understand the general graph structure of the web in order to design ranking algorithms.

The Web Graph is the graph of the webpages together with the hypertext links between them[16]. Each webpage or a website, identi- fied by an URL, on the Web Graph is a vertex, the link between two webpages is an edge.

Broder et al.[17] present a bow-tie structure of the Web Graph in 2000 (see Figure 2), based on crawling 200 million pages via Alta Vista search engine.

In the center of the bow-tie structure is a giant Strongly Connected Component (SCC), a strongly connected directed graph, where every node can reach to another node in the graph. IN denotes a set of pages that can reach the SCC, but cannot be reached from the SCC; OUT consists of pages that can be reached from SCC but can not

7http://en.wikipedia.org/wiki/Web_search_query

7 reach the SCC. TENDRILS are the orphan web resources that can- not reach the SCC nor be reached from the SCC.

The year 2000 research concluded that SCC contains 28% of the nodes on the Web. In addition, Broder et al. show that the IN degree and OUT degree follow a power law distribution that is heavy tailed.

A research[18] conducted two years later on Web Graph negated Broder’s power law distribution presumption. Meusel et al. analyzed 3.5 billion webpages gathered by the Common Crawl Foundation and claimed that the distributions may follow a log log law instead. They also found that the SCC, LSCC in their lingo (the paper did not clar- ify why putting an ’L’ in front of ’SCC’), covers 51.28% web resources on the Web, a 40% increase over the decade (see Figure 3).

Figure 2: Web Graph depicted by Broder et al. in 2000[17]

Web Graph can be further categorized into: Page-Level Graph, Host Graph and Pay-Level-Domain (PLD) Graph8.

8http://webdatacommons.org/hyperlinkgraph

8 Figure 3: Web Graph depicted by Meusel et al. in 2014[18]

2.1.5 Base Set of Webpages Since the Web Graph is colossal, searching through the entire Web Graph is almost impossible. Therefore, every ad hoc ranking algo- rithm based on link analysis starts with a set of initial webpages, Klein- berg calls it “focused subgraph of WWW”[3]. The general method of obtaining such a set is to use a spider and Breadth First Search (BFS).

The algorithms obtaining the initial set can be either query inde- pendent or query-dependent.

Brin and Page’s PageRank algorithm computes a query-independent authority score for every page. Whereas, the HITS algorithm by Klein- berg crawls a root set — a collections of pages likely to contain the most authoritative pages for a given topic, and then augments it to a “focused subgraph of WWW”, the base set, by adding pages linked from, or link to, the root set.

2.1.6 Semantic Web The term “semantic” relates to meaning in language or logic, such as the meaning of a word. When used in the context of the Semantic

9 Web, however, the term refers to formally defined meaning that can be used in computation.

The Semantic Web, envisioned by our very Tim Berners-Lee in the late 90’s and early 2000s, is an extension of the current Web in which information is given well-defined meaning, creating an environ- ment where software agents roaming from page to page can readily carry out sophisticated tasks for users 9. The Semantic Web is also called the “web of data.” The relations among data (resource), do not resemble the hyperlinks on the WWW that connect the current page with the target one. The SW relationships can be established between any two resources, not necessary a “current” page. Another major difference is that the relationship (i.e, the link) itself is named. The definition of those relations allow for a better and automatic in- terchange of data10.

The vision of Semantic Web was well captured and illustrated in the Semantic Web Tower11. It’s later on added one more axis – “P” to the tower. P stands for perception or people (Figure 4). The “percep- tion” axis is to signify the abstraction/adaptation of the technologies in the tower, towards the people. Therefore, Tim Berners-Lee’s orig- inal vision was “adjusted” during the years that human involvement is inevitable.

2.1.7 Resource Description Framework and Ontology The Resource Description Framework (RDF) 12 is the fundamen- tal exchange protocol of Semantic Web. It’s a metadata data model for the web resources in the form of subject-predicate-object (triple) expressions. A triple is usually represented by an Uniform Resource Identifier (URI) of the source document (subject), an URI or literal for the target document (object), and an URI pointing to a property definition (predicate). A collection of RDF statements intrinsically represents a labeled, directed multi-graph, suitable for knowledge rep- resentation. 9http://www.cs.umd.edu/ golbeck/LBSC690/SemanticWeb.html 10http://www.w3.org/RDF/FAQ 11http://www.w3.org/RDF/Metalog/docs/sw-easy.html 12http://www.w3.org/RDF/

10 P

Figure 4: Semantic Web Tower

Ontologies are considered one of the pillars of the Semantic Web (SW). SW technologies went through ups and downs since its incep- tion in the late 90’s. Ontology however has been gaining continuous keen interest from both the industry and the academia.

An ontology is a formal knowledge description of concepts and their relationships[19]. Ontologies, sometimes called “concept maps”, play an important role in a knowledge-based system13. In order to build an ontology, a well-defined lexicon and logics system such as an ontology language, taxonomy/metadata system, and vocabulary, has to be in place in order to safeguard the validity and soundness of the ontology. An ontology is usually written in RDF-based languages such as Resource Description Framework Schema (RDFS)14, Web Ontology Language (OWL)15.

13An system that is able to find implicit consequences of its explicitly represented knowledge[20]. 14http://www.w3.org/TR/rdf-schema/ 15http://www.w3.org/TR/owl-semantics

11 A SW vocabulary can be considered as a special form of ontology, usually light-weight, or sometimes as a collection of URIs with an de- scribed meaning.

An ontology can describe a body of knowledge, or a process. A well-built ontology can enable subsequent data display, analysis, infer- encing, entailments, and the like. DBpedia, one of the largest online SW knowledge bases (KB), currently describes 4.22 million things in its consistent 2014-version ontology16. Savvy users can query the KB and draw inferences using vocabularies.

Ontologies help bring the semi-structured web data in order and optimize query results, particularly in a domain-specific web commu- nity.

Any document available on the Web featuring SW technologies such as RDF, ontology, etc, is considered to be a Semantic Web Doc- ument (SWD).

2.2 Mathematical Notations We closely follow the notations in [21].

Let S be the base set of webpages (nodes). Let G = (S, E) be the underlining directed graph, where |S| = n. If node i has a hyperlink pointing to node j in the graph, there is a directed edge placed be- tween the two vertices.

The graph can be transformed into an n × n adjacency matrix P, where P[i, j] = 1 if there is a link from i → j, and 0 otherwise.

We define backlink set B, and forward link set F, for some node i, as follows: B(i) = {j : P[j, i] = 1} F (i) = {j : P[i, j] = 1} |B(i)| is the in-degree of node i, and |F (i)| is the out-degree of node i.

16http://wiki.dbpedia.org/Ontology2014

12 An authority node in the graph G is defined as a node with nonzero in-degree, and a hub node is one with nonzero out-degree. Let A be the set of authority nodes, and H be the set of hub nodes. If G is a connected graph, meaning there are no isolated nodes in the graph, S = A ∪ H.

A = {sa|sa ∈ S and in-degree > 0}

H = {sh|sh ∈ S and out-degree > 0} A link weight is a non-zero real number.

3 Classical Ranking Algorithms

All ranking algorithms introduced in this section belong to LAR al- gorithms. PageRank, HITS and SALSA are tightly related algorithms, all are eigenvector-based ranking algorithms. HVV adopts informa- tion retrieval technics and take both link address and link text as the factors of the ranking algorithm.

3.1 PageRank[1][2] PageRank considers that not all links have the same weight. For example, links from New York Times website should have more weight than links from an unpopular individual blog site. To assess the rank weight, PageRank leans its scale towards backlinks (citation) - A page has high rank if the sum of the ranks of its backlinks is high. This covers two cases: 1) when a page has many backlinks; 2) and when a page has a few highly ranked backlinks. In general, the more backlinks in the WWW a webpage has, the more important it is.

Let’s start with a slightly simplified version of PageRank. For some node u, let c be a factor used for normalization, the ranking R is:

X R(v) R(u) = c , (1) |F (v)| v∈B(u) where c < 1 for that there are a number of pages with no forward links, therefore, their weight is lost. The equation can be iteratively computed upon any set of ranks until it converges.

13 However, there is a problem with the simplified version of PageR- ank algorithm. if two nodes point to each other but not to other nodes and suppose there is some node that points to one of them. Then, during the iteration process, the rank of the node in question will accumulate ranks but do not dispatch any of the ranks out. The situation is called a rank sink. It’s similar to the “absorbing state” in Markov Chain. To overcome this problem, rank source and random surfer behavior are introduced:

Definition: For some node u ∈ S, let vector E be the rank source. Then: X R(v) R(u) = c + cE(u), (2) |F (v)| v∈B(u) such that c is maximized and ||R||1 = 1 (||R||1 denotes the L1 norm of R). E(u) is a factor of u.

Other than being a decay factor, E can be also viewed as a way of modeling a random surfer behavior: “The surfer periodically ‘gets bored’ and jumps to a random page chosen based on the distribution in E[1].”

In [2], there is a variation of PageRank algorithm, which can better illustrate the random surfer model: 1 − d X R(v) R(u) = + d . (3) N |F (v)| v∈B(u)

The original paper writes (1−d) instead of (1−d)/N, which results the sum of all rankings as N. However, Page and Brin state in the paper of discussion: “the sum of all PageRanks is one.”

The damping factor d, 0 < d < 1, usually set to 0.85, enables the following two behaviors: 1. From a given state s, a webpage, with probability d, an outgoing link of s is picked uniformly at random and the surfer moves to the destination, state s0. 2. With probability 1 − d, the surfer chooses a node uniformly at random, and jump to it, s0, with no consideration of s. This is the core of the random surfer model.

14 It is likely that [2] is the predecessor of [1]. The introduction of E concept is the key difference between the two PageRank algorithms.

PageRank can be as well understood as a Markov Chain in which the states are the node set S, and the transitions, the edge set E. Therefore, we can understand the node weights as transition proba- bilities.

In PageRank, S can be almost any set over webpages, such as E, or the entire WWW. PageRank algorithm is summarized in Algorithm 1.

Algorithm 1 PageRank algorithm 1: R0 ← S 2: do 3: Ri+1 ← PRi 4: d ← ||Ri||1 − ||Ri+1||1 5: Ri+1 ← Ri+1 + dE 6: δ ← ||jRi+1 − Ri||1 7: while δ > 

Note: E is a user defined parameter. In most case E can be uni- form over all webpages with value α. However, different values of E can generate “customized” page ranks.

Since PageRank is query-independent, it cannot by itself distin- guish between pages that are authoritative in general and authorita- tive on the query topic. An authoritative node in general may not be considered valuable within the domain on the topic of the query.

PageRank was the embryo of search engine Google.

3.2 HITS[3][4] Unlike PageRank, HITS lands on an equal footing with both back- links and forward links. It finds a set of relevant authoritative pages and a set of hub pages on the WWW, relative to a broad-topic query, through extracting information from hyperlink structure of the Web.

15 To define the ranking criteria, the author proposed two notions, or rather two distinct types of webpages: hubs and authorities. See definition in Section 2.2. A mutually reinforcing relationship depicts a reciprocal relationship between hubs and authorities, that “a good hub is a page that points to many good authorities, and a good authority is a page that is pointed to by many good hubs[3].”

Kleinberg defined the hub weight of a node to be the sum of the authority weights of its forward links, and the authority weight to be the sum of the hub weights of its backlinks.

Let h denote the n-dimensional vector of the hub weights, where hi is the i-th coordinate of h and the hub weight of node i. Let a be the n-dimensional vector of the authority weights, where ai, the i-th coordinate of the vector a, is the authority weight of node i in the graph. X ai = hj. (4) j∈B(i) X hi = aj. (5) j∈F (i)

With adjacency matrix P, we have

a = PT h and h = Pa.

The author states that if a hub node points to many pages with large authority weights, then it should receive a large hub weight; and if an authority node is pointed to by many pages with large hub weights, then it should receive a large authority value. Therefore he proposed a two-level weight propagation iterative algorithm, the H operation and the A operation, for computing the hub and authority weights, respectively, using Equation 4 and 5. After each iteration, a normalization should be performed on the vectors a and h so that they become unit vectors. The iteration stops upon convergence.

Let k be a natural number. Let z denote the vector (1, 1, 1, ..., 1) ∈ n R . Algorithm 2 describes the pseudo code of HITS.

16 Authority

Hub

Figure 5: A densely linked set of hubs and authorities

Algorithm 2 HITS algorithm 1: Iterate(S, k) 2: Set a0 := z. 3: Set h0 := z. 4: for i = 1, 2, . . . , k do 5: Apply the A operation to (ai−1, hi−1), obtaining new authority- 0 weights ai. 0 0 6: Apply the H operation to (ai, hi−1), obtaining new hub-weights hi. 0 7: Normalize ai, obtaining ai. 0 8: Normalize hi, obtaining hi. 9: end for 10: Return (ak, hk).

Then it follows a “filter” algorithm to select the top c authorities and top c hubs (Algorithm 3). Let k, c be natural numbers.

If k is of an arbitrarily large value, Kleinberg proved that the se- ∗ ∗ quences of vectors {ak} and {hk} converge to fixed points a and h . The singular vectors a∗ and h∗ become the principal eigenvectors of PT P and PPT , respectively.

Lempel and Moran further illustrated the concept via two associ- ation matrices, authority matrix AU, and hub matrix HU[5].

T AU =Def P P is the cocitation matrix of the set S. [AU]i,j is the number of pages which jointly point at pages i

17 Algorithm 3 HITS algorithm 1: Filter(S, k, c) 2: (ak, hk) := Iterate(S, k). 3: Report the pages with the c largest coordinates in ak as authorities. 4: Report the pages with the c largest coordinates in hk as hubs.

and j. Kleinberg’s iterative algorithm converges to author- ity weights which correspond to the entries of the (unique, normalized) principal eigenvector of AU.

T HU =Def PP is bibliographic coupling matrix.[HU]i,j is the number of pages which are jointly pointed at by node i and j[5]. Kleinberg’s iterative algorithm converges to hub weights which correspond to the entries of HU’s (unique, normalized) principal eigenvector.

HITS algorithm was used by a number of search engines such as YAHOO!, Alta Vista, etc.

3.3 SALSA[5] Lempel and Moran present SALSA, a new LAR algorithm with stochastic approach. The algorithm was inspired by the random surfer model of PageRank, and the mutually reinforcing relationship between hubs and authorities of HITS.

SALSA, a “weighted in/out-degree analysis of the link-structure of WWW subgraphs”, is said to be equivalent to Kleinberg’s algorithm that both are broad-topic-dependent, working on a focused subgraph, and employing the same metagorithm.

SALSA computes the weights by simulating a random walk through two different Markov Chains: a chain of hubs, H, and a chain of au- thorities, A, among nodes ∈ S. This is different than that of PageR- ank, which is “a single random leap on the entire WWW,” or in E, independent of the mathbb, with no distinction made between hubs and authorities. SALSA’s bipartite random surfing model is also a departure from HITS’ mutually reinforcing relationship.

18 Lempel and Moran construct a bipartite undirected graph Ge = (A, H, E). A and H are two independent sets. A node i can be in both A and H.

1h 2a 1 2

2h 3a 3 4

3h 4a

5 4h 5a

Directed graph G Bipartite graph G'

Figure 6: Transforming directed graph G to bipartite graph G0

SALSA launches two distinct random walks. Each walk only visits nodes on either the authority side or the hub side of the graph. In the initial state, the random walk starts from some authority node selected uniformly at random. In the next state, the random surfer crosses to the hub side and picks up a hub node uniformly at random. In the state after, the random surfer selects one of outgoing links (au- thority side) uniformly at random and moves to the authority node. The random walk proceeds by alternating between backward and for- ward steps.

Each node is assigned a authority and a hub weights, that are de- fined to be the stationary distributions of this random walk.

Here is the process of defining the authority and hub random walk matrices, A,e He, two doubly stochastic matrices. Firstly, two matrices are derived from P.

Let Pr denote the matrix derived from matrix P by normalizing the entries such that, for each row, the sum of the entries is 1, and let Pc denote the matrix derived from matrix P by normalizing the

19 entries such that, for each column, the sum of the entries is 1. Then T Ae consists of the nonzero rows and columns of Pc Pr, and He consists T of the nonzero rows and columns of PrPc . Then the stationary distri- butions of the SALSA algorithm are the principal eigenvector of the matrices.

The transition probabilities of the two Markov Chains, can be di- rectly computed via Equations 6 and 7.

X 1 1 P (i, j) = . (6) a |B(i)| |F (k)| k∈B(i)∩B(j) X 1 1 P (i, j) = . (7) h |F (i)| |B(k)| k∈F (i)∩F (j) SALSA can be seen as a variation of HITS[15]. In the H operation of the HITS algorithm, the hubs broadcast its weights to the author- ities and the authorities sum up the weight of the hubs that point to them. In SALSA however, each hub divides its weight equally among the authorities to which it points. Similarly, the SALSA al- gorithm modifies the A operation so that each authority divides its weight equally among the hubs that point to it. Therefore, we have equations 8 9

X 1 a = h . (8) i |F (j)| j j∈B(i) X 1 h = a . (9) i |B(j)| j j∈F (i) The authors argue that in some cases, the mutually reinforcing re- lationship may result in a topological phenomenon called Tightly Knit Community (TKC) Effect, which refers to a highly connected graph within the Web that may cause the Mutual Reinforcement approach to identify false authorities or hubs. SALSA is on the other hand less vulnerable to the effect due to its less tight coupling approach.

SALSA is said to be computationally lighter than HITS since its ranking is equivalent to a weighted in/out degree ranking. Both HITS and SALSA are ad hoc algorithms, meaning they are computed at

20 query time, therefore, their computational costs become crucial and will be in direct proportion with the response time of a search engine. Whereas, the PageRank is a query-independent approach, that can be computed off-line.

In a study[15] comparing 34 queries (term-based), SALSA per- formed the best among the aforementioned three algorithms in finding highly relevant pages.

SALSA algorithm was implemented by Twitter.

3.4 HVV[6][7] HVV, like PageRank, thinks “what a site says about itself is not considered reliable,” but “only what others say that page is about!17”. In other words, HVV emphasized the importance of backlinks. Like HITS and SALSA, HVV is an ad hoc algorithm. However, the algo- rithm follows an entirely different ranking approach than the other three. HVV computes the page rank based on the weighted terms in the hypertext of backlinks, for a given query.

Li calls the hyperlink URL pointing to the destination, the head anchor, the hypertext, the tail anchor (at the source). The “document ID” is usually defined as the hyperlink’s head anchor. The hypertext is treated like the content of the link.

Traditional Information Retrieval method - Vector Space Model (VSM) - was the building block of HVV. VSM is often used to compute the similarity between two documents, or a query and a document. It handles a full-text search of a document, whereas HVV copes with the hyperlinks only within a webpage.

Like VSM, HVV presents both the document (webpage) and the query as vectors.

In HVV, a document vector is represented by a vector of vectors. The dimension of each document vector is a hyperlink vector. The document can have zero or more link vectors. Each dimension of a 17http://tech-insider.org/internet/research/1997/1210.html

21 link vector is represented by the weight of the term (excluding stop words) extracted from the hypertext. This is different than that in VSM, which is a vector of document terms.

  L~1  ~  L2 Dj =   ... L~n where Dj is a document ID, and L~i is the ith hyperlink vector whose head anchor is Dj.

L~l = (wl,1, wl,2, ...wl,m) where m is the length of unique terms in the link vector L~l. The value of each link-vector dimension is calculated using a term weight- ing method, such as the popular Term Frequency - Inverse Document Frequency (tf-idf) model[22].

Like VSM, the query vector in HVV is represented by a vector of weights of each keyword in the query.

Q~ = (w1, w2, ...wn) where wi is the weight of the ith term in the query.

In VSM, the relevance score between the document and the query is calculated by the dot product of the document vector and query vector. Since the document vector in HVV is a vector of link vectors, the HVV ranking score is therefore defined as the summation of all the dot products between the query vector and each hyperlink vector for a given document.

n X R = (Q~ · L~t), (10) t=1 where R is the ranking score, Q~ is the query vector, and L~t is the link vector.

22 The vector model of classic information retrieval techniques sug- gest that the more rare words two documents share, the more similar they are considered to be.

Since HVV ranker does not search through full body text but hy- perlinks only, the Rankdex (a HVV-based search engine) index was significantly smaller than the other search engine indexes. Document size is no longer a factor in relevance ranking and thus shorter docu- ments are more likely to be selected. In addition, Li suggested, Im- ages, graphics, and sounds, which are not searchable by conventional methods–are searchable by hyperlink descriptions pointing to them. The same is true of foreign-language documents if there are hyperlinks to them in the user’s native language.

HVV algorithm underpinned the search engine Baidu.

4 Semantic Ranking Algorithms

Semantic Web and its inspired semantic technologies have been at the center of research interest on search engines during recent years. Consumers increasingly expect search engines to understand natural language and perceive the intent behind the words they type in, and search engine researchers are seeking new horizon to take up the chal- lenge.

In 2011, Microsoft, Yahoo, and Google jointly launched the “Schema.org” initiative, which defines a new set of HTML markup terms that can be used as clues to the meaning of that page, and assist search engines to recognize specific people, events, attributes, and so on. Meanwhile, Semantic search came into being.

Much of semantic search research is directed at adding semantic annotations to data in order to “improve search accuracy by under- standing searcher’s intent and the contextual meaning of terms as they appear in the search-able database, whether on the Web or within a closed system, to generate more relevant results18.”

18http://en.wikipedia.org/wiki/Semantic_search

23 In this report, we focus semantic search on the Semantic Web. Data on the Semantic Web is divided into two categories: ontological and instance data. The actual data the user is interested in are the instance data belonging to a class, but the domain knowledge and re- lationships is described primarily as class relationships in the ontology.

RDF is a standard model for data interchange on the Semantic Web. It defines the main concepts such as classes and properties and how they interact to create meanings. In a semantic graph, entities (classes, instances, property entities) are the nodes, the relationships (properties) between the entities are the edges.

Contrasting with the traditional rankers that swim in the tradi- tional Web Graph, the semantic spiders and semantic rankers leap through a conceptual network of the Web – the Web Knowledge Graph.

Jindal[23] classified semantic ranking into three types: Entity, Re- lationship, and Semantic Document.

4.1 OntoRank[8][9] Swoogle’s OntoRank is a term-based query-dependent ranking al- gorithm for the Semantic Web. Instead of crawling webpages, Swoogle looks for SWDs such as ontologies and RDFs published on the Web.

There are two types of SWDs defined in the paper, SW ontologies (SWO), and SW database (SWDB), documents that mostly describe instance data and individuals. SWO is said to be the TBox in a De- scription Logic (DL) knowledge base, and SWDB the ABox. A DL knowledge base typically comprises two components – a TBox and an ABox. The TBox contains intensional knowledge in the form of a ter- minology or taxonomy and is built through declarations that describe general properties of concepts. The ABox contains extensional knowl- edge that is specific to the individuals of the domain of discourse[20].

Swoogle extracts the metadata from the harvested SWOs and SWDBs, and builds/extends its RDF graph accordingly. The graph is a directed, labeled graph, where the edges represent the named links that have explicit semantics, between two resources, represented by the graph nodes. Swoogle identifies three categories of metadata: (i)

24 basic metadata, which considers the syntactic and semantic features of a SWD, (ii) relations, which consider the explicit semantics between individual SWDs, and (iii) analytical results such as SWO /SWDB classification, and SWD ranking.

For the directional relations (links) among SWDs, in other words, the navigational paths between the nodes, Swoogle classifies four types of interlinks. For some SWDs i and j: 1. link(j, imports, i) denotes that j imports all terms and content of i. 2. link(j, uses − term, i) denotes that j uses some of terms defined by i without importing i. 3. link(j, extends, i) denotes that j extends the definitions of terms defined by i. And 4. link(j, asserts, i), j makes assertions about the individuals de- fined by i. j contains the backlinks to i in all cases.

Heavily influenced by PageRank, Swoogle’s Ontology Rank (On- toRank) uses the number of backlinks to assess the importance of a SWD. OntoRank follows a random surfer model, called rational surfer model. Let link(i, l, j) be the semantic link from SWD i to SWD j using semantic tag l, d be a constant between 0 and 1. weight(l) be user’s preference of choosing semantic links with tag l. The initial ranking of some SWD i is defined as:

X f(j, i) R(i) = (1 − d) + d R(j) , (11) f(j) j∈B(i) X f(j, i) = weight(l), (12) link(j,l,i) X f(j) = f(j, k). (13) k∈F (j) In the rational random surfer model, in addition to the damping f(j,i) factor d, the forwarding link is chosen with unequal probability f(j) , where j is the current SWDB, i is the SWD that j links to, f(j, i) is

25 the sum of all link weights from j to i.

The final rank, the OntoRank, of SWD i is defined as: X OntoRank(i) = R(i) + R(j), (14) link(j,imports,i) where link(j, imports, i) is the transitive closure of SWO i imported by j. Thus i received accumulated scores from the its internal and external nodes. Evidently, OntoRank gives priority to SWOs over in- stance data.

Apart from ranking SWDs using OntoRank, Swoogle’s TermRank ranks Semantic Web terms (SWTs) found on the Semantic Web. Given a term t and a SWD i, fq(t, i) donates number of occurrences of t in i. Let the SWD collection be Dt = {d|fq(t, d) > 0}. |D| denotes the number of SWDs that uses t. Then,

X OntoRank(d) × TW (d, t) T ermRank(t) = P 0 , (15) 0 TW (d, t ) fq(t,d)>0 fq(t ,d)>0

TW (d, t) = fq(t, d) × |Dt|. (16) A general user can query with keywords, and the SWDs match- ing those keywords will be returned in ranked order. An advanced user query on the underlying knowledge base using keywords, content based constraints, language and encoding based constraints.

In Swoogle’s metadata database, 13.29% SWDs are classified as SWOs. About half of all SWDs have rank of 0.15, which means they are not referred to by any other SWDs. The mean of ranks is 0.8376, which implies that the SWDs Swoolgle spider have found are poorly connected.

OntoRank is a Semantic Document Ranking model.

4.2 TripleRank[10] TripleRank is a HITS-inspired algorithm, an authority ranking in the context of RDF knowledge bases. It uses a 3-dimensional tensor model to represent SW triples, bringing geometric structure into the

26 linear algebraic world. It approaches the Semantic Web graph from a different vantage point – associating the graph with a function that returns the link relations (properties, types) between two resources.

Definition: Let G = (V,E, Γ, θ), where V is a set of SWDs or SW resources, E is a set of links between SWDs/resources, Γ is a set of literals, and function (random variable) θ : V → E returns the URI of the property that links two resources.

Franz et al. model the graph using a 3rd-order tensor (3-way ar- ray), with object, subject, and predicate/property as the modes. A order-3 tensor produces three views (fibers, facets): row, column, and tube, and subsequently allows three-way slicing: horizontal, vertical and frontal[24] (See Figure 7). If slicing in the right direction, each of its slices can represent an adjacency matrix with respect to one link type or property. We can then use HITS to calculate the hub and au- thority scores for the slices. However, decomposition this way results in very sparse and not connected matrices. The tensor model, ana- lyzed through Parallel Factor Analysis (PARAFAC) decomposition, not only connects all the link properties together, but may detect fur- ther hidden relationships as well. PARAFAC is based on the trilinear PR model xijk = r=1 airbjrckr[25].

Figure 7: (Top) FIBERS: (A) Columns, (B) Rows, and (C) Tubes of a 3rd or- der tensor. (Bottom) SLICES: (A) Horizontal, (B) Vertical, and (C) Frontal slices of a 3-way tensor[24]

Formally, a tensor T ∈ Rk×l×m is decomposed by PAPAFAC

27 k×n l×n m×n into components matrices U1 ∈ R ,U2 ∈ R , andU3 ∈ R . PARAFAC also decomposes a tensor as a sum of rank-one tensors[26], Pn k k k k or Krusal tensors. T = k=1 U1 ◦U2 ◦U3 , where Ui is the kth column of Ui and ◦ is the outer product. If U1,U2,U3 represent subject, object, 1 and property respectively, in line with HITS, the largest entry of U1 1 corresponds to the largest hub scores, and the U2 is the largest au- thority. As such, PARAFAC can be considered a 3-D version of HITS.

The authors proposed a pre-processing step before ranking, which resonates to the traditional IR principle that rare words weigh more than frequent words.

1. Predicates linking the majority of resources are pruned as they convey little information and dominate the data set. 2. Statements with less frequent predicates are amplified stronger than more common statements The authors experimented TripleRank in faceted search, a tech- nique allowing users to explore a collection of information by applying multiple filters, and conclude that the TripleRank approach results in a substantially increased recall without loss of precision.

TripleRank is a Relationship Ranking model.

4.3 RareRank[11] RareRank, stands for “Rational Research Ranking” model. The semantic ranker emulates a researcher’s search behavior in a scientific research environment. It argues that a researcher tends to make a “ra- tional choice” in searching an answer as opposed to take a “random walk” as in the PageRank.

RareRank focuses on entities and the relationships among them. The authoritativeness of an entity, say a book, is based on three fac- tors: citations (backlinks), popularity of its authors, and relevancy to the query topic. As such, a newly written document can rise to promi- nence if it’s highly related to the query topic or its author is highly ven- erated in the field, which is missing in citation-based algorithms[14].

28 In RareRank, there are two types of graphs presented in the sys- tem: ontology schema graph and knowledge base (instance data) graph, both are directed, labeled and weighted.

Definition: Let schema graph Gs = (V,E, Ω(e, v)), where V is a set of classes, V = {vi|vi ∈ V, 0 < i ≤ |V |}. E is a set of predi- cates E = {ej|ej ∈ E, 0 < j ≤ |E|}. Ω(e, v) is the set of weights of predicate e whose domain is class v, Ω(e, v) = {ω(ej, vi)|ω(ej, vi) ∈ Ω(e, v), ω(ej, vi) ∈ [0, 1]}. |F (vi)| denotes the number of outgoing links from vi.

The schema graph designates the relations between ontological classes and their transition weights.

0 0 0 Definition: Let knowledge base graph Gk = (V ,E ), where V is 0 the set of all instances defined in Gk and instantiated from E that 0 0 0 0 0 0 V = {vi|vi ∈ V , 0 < i ≤ |V |. E is the set of all predicate instances 0 0 0 0 0 defined in Gk and instantiated from E , E = {ej|ej ∈ E , 0 < j ≤ 0 |E |}. Let N denotes the number of instances in Gk.

The knowledge base graph consists of instances (or entities) and their relationships instantiated from the schema ontology. Weight of a relation from instance id in its domain to instance ir in its range is determined by (i) the weight of the relation between the correspond- ing classes in the schema graph, (ii) how many instances of the same type as ir that id links to, and (iii) strength of the association between instances. RareRank score integrates both relevance (using domain- topic ontology) and quality.

In RareRank, computation of the ranking scores is based on the principle of convergence of a Markov Chain. An important property of a Markov Chain is that for any starting point, the chain will converge to the stationary distribution as long as the transition probability matrix P obeys two properties, irreducibility and aperiodicity.

~πP = ~π where ~π is the stationary probability vector associated with eigenvalue 1. This eigenvector represents the ranking values for all the entities in the graph, and can be obtained with the power iteration method. The transition probability matrix, constructed over both the ontology

29 schema graph and the knowledge base graph, is therefore the focal point of RareRank.

There are four types of teleport operations that governs the Rar- eRank transition probabilities. The notion teleport in RareRank means, from each node there is a probability to reach all other nodes in the graph. In general, there are two scenarios of teleporting, the discussed class doesn’t have an outgoing links in the ontology schema graph and the class has an outgoing links in the schema graph. 1. Full Teleport Probability – This is the type that the class has no outgoing links in the ontology schema. It also indicates that the corresponding instance in the knowledge base has no outgoing links. The teleport of the type is denoted as: prft.

In the schema matrix, ft prs = 1 In knowledge base, 1 prft = k N 2. Base Teleport Probability – This is the probability to initiate a teleport operation when a class has outgoing links in the ontology schema (then an instance of the class in the knowledge base pos- P|F (vi)| sibly has outgoing links), namely if j ω(ej, vi) = 1. The teleport prbt and is set to 1 − d, where d is the damping factor.

In schema, bt prs = 1 − d In knowledge base, (1 − d) prbt = k N RareRank sets d = 0.95 (recall that PageRank sets it to 0.85) in order to minimize the “random surfer ” behavior, in turn to increase the “rationality” of the ranker.

3. Schema Imbalance Teleport Probability – If the sum of transi- tion probabilities from one class to all other classes is less than P|F (vi)| 1, j ω(ej, vi) ∈ (0, 1) in schema, the teleport probability,

30 denoted as prit, will be the difference between 1 and the sum.

In schema, |F (vi)| it X prs = d(1 − ω(ej, vi)) j In knowledge base,

P|F (vi)| (1 − ω(ej, vi)) prit = d j k N

4. Link Zero-Instantiation Teleport Probability – When a predicate ej is defined in schema, but not instantiated in the knowledge base, the weight of the predicate is transferred for teleporting przt:

In knowledge base,

P|F (vi)| e ∈/E0 ω(ej, vi) przt = d j k N

In addition to using teleport probability for transition matrices, RareRank has a jumping case to contribute to the transition proba- bilities. If predicate ej with a domain vi presents in Gk, the transition probability is defined as :

ω(e , v ) pr(i, j) = d j i |(ej, vi)| where |(ej, vi)| is the is the number of times that the predicate ej with domain vi is instantiated in Gk.

Thus, the transition probability in Gs and the transition probabil- ity from instance i to j in Gk can be computed in Equation 17 and 18 respectively: ( 1, if no outlinks pr = (17) s bt it zt prs + prs + prs , otherwise

31 ( 1/N, if no outlinks pr = (18) k bt it zt prk + prk + prk + pr(i, j), otherwise

The characteristic feature of RareRank is that it adds a termino- logical topic ontology into ranking equation, in addition to knowledge base to simulate a more structure and “rational” research environ- ment, and the relationships between entities simulate the behavior of a rational researcher. Computation of the RareRank scores is based on a set of teleport rules plus a jumping feature for computing the transition probability matrix and is guaranteed to converge to an in- variant distribution.

RareRank is an Entity Ranking model.

4.4 Semantic Re-Rank[12] Wang et al. propose a re-ranking method that first fetches the top N results returned by an authoritative search engines such as Google, then employs lexical semantic similarity to re-rank the results.

The paper critiques three limitations of keyword search. (i) A few keywords may not effectively convey the intention of the user. (ii) Homonyms, homophones, homographs 19, and different sequence of the keywords may result in inaccurate search result through exact keyword matching. (iii) If a webpage doesn’t contain any of the search keywords but highly relevant to the topic in search, it is out of luck to get any attention of keyword search.

The semantic re-ranking method proposed in the paper is said to have “downplayed the limitations” of keyword search, and better adapt to the human thinking patterns, and thus attune the search results to the users’ search intention.

The re-ranking procedure is carried out in three steps: 1. Converting the returned ranking position of a candidate docu- ment to an importance score 19http://www.vocabulary.com/articles/chooseyourwords/homonym-homophone- homograph/

32 2. Computing the semantic similarity score and relevance score be- tween each query keyword and each non-stop word in the docu- ment 3. Computing the new ranking via linear combination of the im- portance score and similarity score of the candidate document. The importance score θ of a candidate webpage w was calculated as:

1 − (w − 1)/N θ(w) = , (19) log2(w + 1) where w is the original ranking position and N is the number of the fetched webpages for a query.

This formula is based on Discounted Cumulative Goal (DCG)[27] formula that is a commonly used in evaluating the search result qual- ity. It measures the usefulness, or gain, of a document based on its position in the result list.

The authors developed an ontology platform, W orkiNet, that in- tegrates Wikipedia into WordNet 20. Other than the words and con- cepts already collected in WordNet, WorkiNet adopted 1,782,276 new concepts from Wikipedia. The authors state that in order to exploit semantic similarity, an ontology must be specified first.

The algorithm calculates the semantic similarities between the query keywords and each non-stop word in the candidate document, in order to achieve the optimal relevancy score. Wang et al. borrowed the concept from Leacock and Chodorow’s formula[28] (Equation 20) to calculate the similarity score π between two concepts ci, cj (words on the candidate webpages v.s. the query keywords):

len(c , c ) π(c , c ) = max[−log( i j ], (20) i j 2D

where len(ci, cj) is the shortest path from ci to cj in WorkiNet, and D is the maximum depth of the taxonomy.

20http://wordnet.princeton.edu

33 The above formula does not consider where the two concepts ap- pear in the ontology. However, it is a general understanding that a node’s upper-level nodes in the ontology graph tend to be more general in concept than its sibling nodes (of similar level of general- ization or specification) or child nodes (of more specification). Hence, “sibling-concepts with larger depth are more likely to have semantic correlations than the higher ones[29].” The authors proposed a more sophisticated formula accordingly:

log len(ci,cj ) d(ci)+d(cj ) π(ci, cj) = 1 , (21) log 2(D+1) where d(i) is the length of the path from i to the root in the WorkiNet ontology graph.

After getting the similarity score of each word in a webpage, it is time to calculate the relevance score σ of the webpage w: P j∈w π(j)fq(j) σ(w) = P , (22) j∈w fq(j) where fq(j) means the number of occurrences the word j appeared in webpage w.

Finally, the authors use linear combination to settle down the re- ranking of the candidate pages.

R(w) = α × θ(w) + (1 − α) × σ(w), α ∈ [0, 1], (23) where α is the adjusting parameter.

In the experiment, the authors state when α is set to 0, it gets the worst result. The result becomes better along with the increase of α.

Since the algorithm does a full-text search, the commutation is intensive. However, it can be done off-line.

34 5 Conclusion and Future Research Di- rection

This report presents a summary of my learning process in search ranking algorithms and the Semantic Web. It is by no means a thor- ough and complete review in the discussed field nor would I try to draw any conclusions among these ranking algorithms for their competi- tive edges. Algorithms surveyed in this paper are primarily generic, non-domain specific, broad-topic and textual-term-based search algo- rithms. In addition to generic search, localized search (such as Yelp), industry search (such as airline search), language-specific search, and domain-specific search, to name a few, are the other landscapes in the land of Internet search. Re-ranking is gaining attention in both the industry and research field to provide high quality and highly relevant search results.

It is my intent to conduct further research in the area of domain- specific semantic re-ranking algorithms and systems.

35 References

[1] Lawrence Page, , Rajeev Motwani, and Terry Wino- grad. The citation ranking: bringing order to the web. 1999. [2] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1):107–117, 1998. [3] Jon M Kleinberg. Authoritative sources in a hyperlinked envi- ronment. Journal of the ACM (JACM), 46(5):604–632, 1999. [4] David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Infer- ring web communities from link topology. In Proceedings of the ninth ACM conference on Hypertext and : links, ob- jects, time and space—structure in hypermedia systems: links, objects, time and space—structure in hypermedia systems, pages 225–234. ACM, 1998. [5] Ronny Lempel and Shlomo Moran. Salsa: the stochastic ap- proach for link-structure analysis. ACM Transactions on Infor- mation Systems (TOIS), 19(2):131–160, 2001. [6] Yanhong Li. Toward a qualitative search engine. IEEE Internet Computing, 2(4):24–29, 1998. [7] Yanhong Li and Larry Rafsky. Beyond relevance ranking: Hy- perlink vector voting. In RIAO, volume 97, pages 648–650, 1997. [8] Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R Scott Cost, Yun Peng, Pavan Reddivari, Vishal Doshi, and Joel Sachs. Swoogle: a search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 652–659. ACM, 2004. [9] Tim Finin, Li Ding, Rong Pan, Anupam Joshi, Pranam Kolari, Akshay Java, and Yun Peng. Swoogle: Searching for knowl- edge on the semantic web. In PROCEEDINGS OF THE NA- TIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, volume 20, page 1682. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005. [10] Thomas Franz, Antje Schultz, Sergej Sizov, and Steffen Staab. Triplerank: Ranking semantic web data by tensor decomposition. Springer, 2009.

36 [11] Wang Wei, Payam Barnaghi, and Andrzej Bargiela. Rational re- search model for ranking semantic entities. Information Sciences, 181(13):2823–2840, 2011. [12] Ruofan Wang, Shan Jiang, Yan Zhang, and Min Wang. Re- ranking search results using semantic similarity. In Fuzzy Sys- tems and Knowledge Discovery (FSKD), 2011 Eighth Interna- tional Conference on, volume 2, pages 1047–1051. IEEE, 2011. [13] Wendy Hall and Thanassis Tiropanis. Web evolution and web science. Computer Networks, 56(18):3859–3865, 2012. [14] Steve Lawrence, C Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation indexing. Computer, 32(6):67–71, 1999. [15] Massimo Marchiori. The quest for correct information on the web: Hyper search engines. Computer Networks and ISDN Systems, 29(8):1225–1235, 1997. [16] Jean-Loup Guillaume and Matthieu Latapy. The web graph: an overview. In Actes d’ALGOTEL’02 (Quatri`emes Rencontres Francophones sur les aspects Algorithmiques des T´el´ecommunications), 2002. [17] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Ragha- van, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web. Computer networks, 33(1):309–320, 2000. [18] Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Chris- tian Bizer. Graph structure in the web—revisited: a trick of the heavy tail. In Proceedings of the companion publication of the 23rd international conference on World wide web compan- ion, pages 427–432. International World Wide Web Conferences Steering Committee, 2014. [19] Jihyun Lee, Jun-Ki Min, Alice Oh, and Chin-Wan Chung. Effec- tive ranking and search techniques for web resources considering semantic relationships. Information Processing & Management, 50(1):132–155, 2014. [20] Franz Baader. The description logic handbook: theory, implemen- tation, and applications. Cambridge university press, 2003. [21] Allan Borodin, Gareth O Roberts, Jeffrey S Rosenthal, and Panayiotis Tsaparas. Link analysis ranking: algorithms, the- ory, and experiments. ACM Transactions on Internet Technology (TOIT), 5(1):231–297, 2005.

37 [22] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. [23] Vikas Jindal, Seema Bawa, and Shalini Batra. A review of ranking approaches for semantic search on web. Information Processing & Management, 50(2):416–425, 2014. [24] B¨ulent Yener, Evrim Acar, Pheadra Aguis, Kristin Bennett, Scott L Vandenberg, and George E Plopper. Multiway modeling and analysis in stem cell systems biology. BMC Systems Biology, 2(1):63, 2008. [25] Richard A Harshman and Margaret E Lundy. Parafac: Paral- lel factor analysis. Computational Statistics & Data Analysis, 18(1):39–72, 1994. [26] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009. [27] Kalervo J¨arvelin and Jaana Kek¨al¨ainen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002. [28] Claudia Leacock and Martin Chodorow. Combining local context and wordnet similarity for word sense identification. WordNet: An electronic lexical database, 49(2):265–283, 1998. [29] Michael Sussna. Word sense disambiguation for free-text indexing using a massive semantic network. In Proceedings of the second international conference on Information and knowledge manage- ment, pages 67–74. ACM, 1993.

38