INSTITUT FÜR INFORMATIK LEHR- UND FORSCHUNGSEINHEIT FÜR PROGRAMMIER- UND MODELLIERUNGSSPRACHEN

Ranking for Social Semantic Media

Georg Klein

Diplomarbeit

Beginn der Arbeit: 7. September 2009 Abgabe der Arbeit: 10. August 2010 Betreuer: Prof. Dr. François Bry Klara Weiand Christoph Wieser

Erklärung

Hiermit versichere ich, dass ich diese Diplomarbeit selbständig verfasst habe. Ich habe dazu keine anderen als die angegebenen Quellen und Hilfsmittel verwendet.

München, den 10. August 2010 ...... Georg Klein Zusammenfassung

Diese Diplomarbeit beschäftigt sich mit dem Ranking der Ergebnisse von Suchmaschinen. Wir stellen einen Sate of the Art vor, der verschiedene Datentypen behandelt, unter anderem das Web, XML, RDF und Folksonomien. Für jeden Datentyp wird sowohl die Berechnung des Inhaltswertes als auch des Popularitätswertes vorgestellt. Unter Inhaltswert (content score) versteht man einen Wertes, der angibt, wie gut das gefundene Objekt zur Anfrage passt; während man unter Popularitätswert einen Wert versteht, der die Beliebtheit einer Seite unabhängig von der konkreten Anfrage angibt. Für die meisten Datentypen wird zu- sätzlich die Relevanz zweier Objekte zueinander betrachtet. Dies ist besonders für RDF Daten interessant. Der Zweck dieses State of the Art liegt darin, ein grundlegendes Verständnis der Techno- logien, die für Ranking existieren zu vermitteln. Darauf aufbauend ist es dann einfacher entweder existierende Verfahren zu verbessern, oder ein Ranking Verfahren für noch nicht untersuchte Datentypen zu entwerfen. Insbesondere gibt es, nach bestem Wissen, noch kein Verfahren für Social Semantic Wikis, die sowohl semantische Annotationen als auch struk- turierte Daten haben. Einige Vorschläge für Rankingverfahren dafür werden im letzten Ab- schnitt, Conclusion and future work, behandelt.

Abstract

This diploma thesis is devoted to the ranking of results returned by search engines. We present a State of the Art which covers the ranking for various datatypes, including the Web, XML, RDF, and folksonomies. For every datatype the calculation of a popularity score as well as the computing of a content score is presented. For most datatypes we also discuss the relevance of two objects to each other. This is especially of interest for RDF data. The purpose of this State of the Art is to provide a decent understanding of the technologies in ranking that exist, so based on that it is easier to either come up with an enhanced ranking algorithm or use the technologies provided here as a basis for developing new algorithms for new datatypes. In particular, to the best of our knowledge there is apparently no ranking scheme available for Social Semantic Wikis, which have both semantic annotations as well as structured data. We provide some suggestions for developing such a new algorithm in the section Conclusion and future work. Acknowledgements

First of all I would like to thank my supervisors Klara Weiand and Christoph Wieser. Klara and Christoph helped me a lot with the structuring of this thesis and were a valuable source of information, inspiration, and advice. Furthermore I am indebted to Prof. Dr. François Bry because with his great experience in supervising students he was able to come up with a sound time schedule, where the progress was checked on a weekly basis. He furthermore insisted on free Sundays which enabled me to accomplish the work load. Finally, I would like to thank Jenna Arnold for the final reviewing of this thesis as well as my friends and family for their understanding and continuous support. Contents

1 Introduction 11

2 Fundamentals 12 2.1 Ranking ...... 12 2.2 World Wide Web ...... 12 2.3 Web 2.0 - Social Software ...... 12 2.4 Structured Data ...... 13 2.5 Semantic Web ...... 13 2.6 Folksonomies ...... 14

3 Datatype independent ranking schemes 14 3.1 Content Score ...... 15 3.1.1 Vector Space Model ...... 15 3.1.2 Vector Space Model in Lucene ...... 17 3.2 Relevance of two resources ...... 17 3.2.1 Vector Space Model ...... 17 3.2.2 SimRank ...... 17 3.3 Spread Activation ...... 19

4 Ranking on the World Wide Web 20 4.1 Popularity Score ...... 21 4.1.1 Query-dependent Link Analysis ...... 21 4.1.2 Query-independent Link analysis ...... 23 4.2 Content Score ...... 26 4.2.1 ...... 26 4.2.2 Hilltop ...... 27 4.3 User personalization ...... 28 4.3.1 Personalization based on what? ...... 28 4.3.2 Personalized Popularity Score ...... 28 4.4 Similarity of Documents ...... 30 4.5 Enhancing Web Search using Semantic Annotations ...... 31 4.5.1 Content Score ...... 31 4.5.2 Popularity Score ...... 31

5 Ranking for XML Data 32 5.1 Popularity Score ...... 32 5.2 Content Score ...... 33 5.2.1 Vector Space Model ...... 34 5.2.2 Distance between keywords ...... 35 5.2.3 Ranking query answer trees ...... 35 5.2.4 Distance between candidate item and keyword ...... 36 5.3 Top-k evaluation ...... 37

6 Ranking for RDF Data 37 6.1 Weighting RDF properties ...... 37

6 6.2 Popularity Score ...... 39 6.3 Content Score ...... 39 6.4 Relevance of the relation between two resources ...... 40 6.4.1 DBPedia Relationship Finder ...... 41 6.4.2 SemRank ...... 41 6.4.3 Maguitman et al ...... 44 6.4.4 Corese ...... 44 6.5 Query ranking ...... 45 6.6 Ranking for RDF rule languages ...... 46

7 Ranking for Folksonomies 47 7.1 Popularity Score ...... 47 7.1.1 Overall Popularity Score ...... 47 7.1.2 Popularity Score for Users ...... 48 7.1.3 Popularity Score for Tags ...... 49 7.1.4 Popularity Score for Resources ...... 49 7.2 Content Score ...... 49 7.3 Similarity ...... 51 7.3.1 Similarity for two users ...... 51 7.3.2 Similarity for two tags ...... 51

8 Rank Aggregation 51 8.1 Similarity Scores ...... 52 8.2 Rank positions ...... 53 8.2.1 Borda ...... 54 8.2.2 Copeland ...... 55 8.2.3 Kemeny ...... 55 8.2.4 Local Kemenization ...... 56 8.3 Selected Aggregations ...... 58 8.3.1 Aggregation in XRank ...... 58 8.3.2 Aggregation in XSearch ...... 58 8.3.3 Aggregation in RSS ...... 59 8.3.4 Aggregation in Web Search Engines ...... 59

9 Conclusion and future work 59

7 List of Figures

1 Developing paths to the Social Semantic Web[6, p. 8] ...... 15 2 Small Web graph [37, p.1] ...... 18 3 Bipartite Graph [37, p.3] ...... 19 4 Hubs & Authorities [54, p.6] ...... 22 5 Simplified PageRank Calculation [51, p.4] ...... 24 6 Example Hyperlink Graph [43, p. 250] ...... 24 7 An Example XML Document [27, p.2] ...... 33 8 XML fragment for XXL [68, p.107] ...... 36 9 DBPedia Relationship Finder: Ludwig van Beethoven and Vienna ...... 42 10 Example RDF Graph[2, p.119] ...... 43 11 Example Ontology[47, p.436] ...... 45

8 List of

1 Comparison of combination functions ...... 53 2 pair-wise comparisons of a,b,c,d ...... 55 3 Comparisons won for a,b,c,d ...... 57 4 Kemeny sequence scores ...... 57

9

1 Introduction

Result rankings are an important and integral feature of most current Information Retrieval systems, for example web search engines are often evaluated on how highly ranked a relevant result for a query is in the results. Fuzzy matching—approaches to include not only strict matches, but also other results which are relevant but do not match the strict interpretation of the query—and ranking are closely related. Though they do not have to be used in conjunction, this is often the case, in particular to allow a fuzzy matching engine to differentiate looser results from results that adhere more strictly to the query. The true power of ranking and fuzzy matching is unleashed only in combination – fuzzy matching extends the set of results, ranking brings it into an order that makes the results easily consumable by the user even if the number of results is very big. While ranking is widely used in web search and other IR applications, conventional query languages for (semi-)structured data such as XQuery, SQL or SPARQL do not usually employ fuzzy matching or rank results. As the amount of structured web data increases and the semantic web continues to emerge, the need for solutions that allow for layman querying of structured data arises. This is true in particular in the context of the social semantic web where a heterogeneous user base interacts to create (semi-)structured data. Research has been dedicated to combining web querying and web search and introducing information retrieval methods to querying, for example in the form of extensions to conven- tional query languages, visual tools for exploratory search, extension of web keyword search to include (some) structure and keyword search over structured data. One important issue that arises in information retrieval systems for structured data is how ranking can be realized in this context. Ranking of documents is a topic that has been extensively researched. However, ranking of structured data in general and ranking in the context of Social Semantic Media in particular are new areas that are just beginning to emerge. Developing a new algorithm requires a decent understanding of what currently exists. Therefore we present a state of the art of ranking in this diploma thesis. It gives a thorough summary and classification of approaches. They are intended to serve as a means for getting an overview of the area, for example as the basis of developing new algorithms or improving on existing ones. Similarities and differences between approaches are pointed out and, where appropriate, illustrative examples are given. This diploma thesis is organized as follows: The next section introduces the terminology and explains basic terms used in this diploma thesis. In the third section, general approaches for ranking are introduced. These approaches are independent of the type of data to be ranked. Section 4 deals with ranking on the World Wide Web, that is, the ranking of web documents without regard to their structure. Data structured in XML trees is the topic of section 5, whereas sections 6 and 7 address ranking with respect to Semantic Annotations. Annotations in RDF are covered by section 6, and folksonomic annotations, that is, tags are handled in section 7. Finally, the last section discusses the combination of several rankings into one. All but the last sections contain sections on popularity score and content score.

11 Popularity score tries to measure the global popularity of a resource, whereas content score measures the relevance of a resource with respect to a query. Most sections also deal with the relevance of two resources. Though this is not directly related to ranking, this is still interesting because it is possible to use this information to build rankings.

2 Fundamentals

2.1 Ranking

Let D be the set of all considered documents. A ranking then refers to the ordering of a subset S of D [24, p. 615], i.e.

ranking = [x1 ≥ x2 ≥ ... ≥ xn] where xi ∈ S and ≥ an ordering relation

The ranking is restricted to a subset of documents, if the set of all documents D is not acces- sible. For instance on the Web, not all documents are visible to Web crawlers because some documents are only available in “databases that can only be accessed through parametrized query interfaces”[44, p.394]. The content of the set S depends on the Web crawler used in the search engine. This is especially a problem in rank aggregation (see section 8). The definition from above only takes into account positions. So an extension of this defi- nition is to assign scores to the documents, and rank them according to their scores. Most search engines compute both a content as well as a popularity score. The content score determines the relevance of a document to the query whereas the popularity score measures the overall popularity of an object independently of the query. The popularity score can be computed independently of the query whereas the content score naturally can only be calculated at query time.

2.2 World Wide Web

Sites on the World Wide Web (WWW) are stored in the HTML format. The most important feature of HTML pages is to set a hyperlink to another page. If users click on hyperlinks, the destination page will be loaded. Besides that, HTML enables rich formatting of the text. Both the hyperlink structure as well as the formatting are used for the ranking of the pages (see section 4). However, creating and maintaining pages for the Web requires some expert knowledge, such as obtaining storage space for the Web site, editing the Web site using either a special editor or learning the markup language HTML, and finally uploading the page to the storage space. These tasks are too complicated for ordinary users, so when the World Wide Web first began, the content was dominated by companies.

2.3 Web 2.0 - Social Software

Social Software refers to Web pages where users can easily generate content or participate in some other way. The infrastructure, also known as social platforms, is given by companies,

12 so the user does not have to deal with the obstacles that are inherent in the creation of a Web page. According to Blumauer et Pelligrini [6] the following ways exist for a user to participate: In Social Networks (Facebook, MySpace), users can create and manage their profile in order to present them to other users. In Wikis (e.g. Wikipedia) users can publish content online, which is of great value for the community of users. Finally on video platforms (Youtube) users can share self-made videos.

2.4 Structured Data

As there are many applications that store information in one way or another, the need for a standardized format for information interchange soon arose. It was chosen to provide a format which allows for structuring, because many applications operate on structured data. Such structuring is quite natural, e.g. a good example for structured data is a book, which consists of several chapters. Each chapter can be separated into several sections, which themselves can be separated into paragraphs. This standardized format is called the eXtensible Markup Language (XML). The markup consists of opening tags () and closing tags (). The level where the XML documents contain the actual data is called the instance level, whereas the level where the format of the XML document is specified is called the schema level. The format can be described with an XML schema language such as DTD or XML Schema. Since the tags can be nested, an XML document can also be seen as a tree, where the internal nodes contain the structural information, and the leave nodes contain the text. An example for an XML document can be found in figure 7 on page 33.

2.5 Semantic Web

Although the XML standard provides a standardized and consistent way for creating struc- tured documents, it still does not allow for computational reasoning, because the tags used to structure XML documents can be freely chosen. They don’t have a meaning, that can be computationally inferred, e.g. a computer can not infer a relationship between the struc- tural elements and [31, pp.29-30]. Apart from approaches which try to handle this problem at the XML level (see section 5.2.4), some sort of database is needed, where the names for the structural elements should be obtained from. This database is called ontology. Instead of structuring the document using XML, elements in the document can be annotated by semantic annotations. In contrast to XML the annotations are defined in the ontology and are therefore unique, i.e. all authors of Web pages are encouraged to use the same annotations. In the ontology there is also taxonomic information available, i.e. the fact that apple and pier are both fruits can be derived from the ontology. The semantic annotations have the form of triples which can be understood by machines in the way that automatic reasoning is possible. An RDF triple consists of (subject, predicate, object). An example for such an RDF triple is (Ludwig van Beethoven, deathPlace, Vienna).

13 Triples form an RDF graph G = (V,E), where • V , the nodes, are referred to as resources and • E, the edges, are referred to as properties An RDFS schema [8] defines a vocabulary for an RDF graph. The vocabulary for the labels of the graph nodes is called class or literal type; the vocabulary for the graph’s edges is called property type. Instead of urging the authors of Web Pages to add RDF triples to their Web sites, another possibility is to provide templates for content which are transformable to RDF. For instance in Wikipedia, standard templates ensure a common look for all pages with similar content, e.g. all pages concerning cities have an info-box, which contains information about the number of inhabitants, mayor, etc. Since this information is available for all cities, the dbpedia project1 transforms this information into RDF triples.

2.6 Folksonomies

The ideas presented in the last section are intended for use by web masters, with the same problems arising as discussed in the beginning of chapter 2.3. To sum it up: RDF annotations are not compatible with the ideas of the Web 2.0. Therefore an easier mechanism for annotating was found: the tag. A tag normally consists of one word which can be applied to many kinds of objects, like pictures, bookmarks, and text. Such a tag has nothing to do with the tags which are used for structuring XML (). All tags together form the folksonomy, a portmanteau of folk + taxonomy. Tags are used on Web 2.0 platforms for various purposes. On delicious.com2 users can tag their bookmarks. It is also possible to search for shared tags to retrieve bookmarks. It is expected that the Web 2.0 enriched by folksonomic tagging will converge with the ideas from the Semantic Web (see section 2.5). In this scenario the folksonomies are post- processed using Semantic Web Technology. An overview of this process can be found in figure 1.

3 Datatype independent ranking schemes

In this section we provide approaches which are independent of the datatype of the resources to be ranked. Some of these general approaches are modified in later sections to fit the context they are used in. The first section describes the Vector Space Model. The second section provides approaches that can be used to determine relationships of two resources. This is not directly related to ranking, but approaches exist which use these relevancies in order to build a ranking. 1http://dbpedia.org 2www.delicious.com

14 Social Semantic Web Web 2.0/Social Software E.g. Automatic post- E.g. Creation of folksono- processing of folksono- community based mies, i.e. free annotations, mies and combination of knowledge organization which support searching in distributed databases and the form of “tag clouds” tag clouds using Semantic Web technology

Web 1.0/Libraries Semantic Web E.g.: Expert centric E.g.: Semantic Annotation knowledge organization classification schemata, of Data with interoperable by few experts taxonomies and thesauri meta data and combination manual annotation of the of them with ontologies inventory for automatic processing

manual indexing automatic indexing

Figure 1: Developing paths to the Social Semantic Web

The last section presents Spread Activation and shows how Spread Activation is used in combination with datatype dependent approaches described later in this thesis.

3.1 Content Score

This section explains how the relevance of a document to a query can be determined with the Vector Space Model. A practical implementation of the Vector Space Model is investigated in the second section.

3.1.1 Vector Space Model

In the Vector Space Model [59] each document is assigned a vector which represents the document’s words. Computing the similarity of a document to a query is then led back to computing the angle between their vectors. Ranking is achieved by sorting the results by this similarity.

The Vector Space Model was designed in order to measure the contentual relevance of a document to a query and is therefore presented as such here, i.e. this section deals with “documents” though there are adoptions which require a more generic view, e.g. applications to folksonomic data described in section 7.2.

15 Let Di represent the document space:

Di = (di1, . . . , dit) where dij represents the weight of each individual term. The query Dj is also represented as such a vector.

As a similarity measure s(Di,Dj) the angle between the vectors is taken:

Di × Dj s(Di,Dj) = arccos ||Di|| · ||Dj||

There are various approaches for determining the weight of the individual terms (dij), which depend on the application. So in the next few paragraphs we describe some approaches com- mon in Information Retrieval. Adaptations to other domains require different weightings. The simplest approach for weighting is the Boolean representation, i.e. to assign 1 if the term occurs in the document and to assign 0 if it doesn’t occur. In this model, documents are either considered relevant or not relevant with respect to a query, so ranking is actually disabled [43, p.188]. Sometimes this is referred to as Boolean Model, as opposed to the Vector Space Model with more complex term weights. A weighting measure which works better in Information Retrieval is the term frequency (TF). The term frequency is defined as the relative frequency of the term in the document [59], i.e.

# occurrences of a term TF = # words in document

This idea has just one important drawback: It assigns high values to frequent words, which are not necessarily characteristic for this document. Therefore it is better to apply an approach called term frequency - inverse document frequency (TF-IDF). In this model a weight is calculated by multiplying the frequency of a term (TF) with the inverse document frequency (IDF). This IDF is calculated as:

total number of documents IDF = number of documents the term appears in

The IDF is often logarithmized because this avoids very small numbers which are difficult to deal with computationally. Salton et al compared TF with TF-IDF in [59] using benchmarks, which came to the con- clusion that TF-IDF works about 14 % better than TF. TF and TF-IDF are not the only possible approaches for weighting terms, the reader may be interested in the overviews presented in [58] [55]. For applications of the Vector Space Model different from Information Retrieval the weight- ing of terms has to be adapted. Some adaptations are presented in later sections, please refer to 5.2.1 and 7.2.

16 3.1.2 Vector Space Model in Lucene

A practical implementation of the Vector Space Model can be found in Lucene [45], which is a framework for developing search engines. Lucene uses both the Boolean Model (Vector Space Model with Boolean weighting) and the Vector Space Model. The query is evaluated using the Boolean Model at first, and for ranking the returned document the Vector Space Model is used. Lucene’s default scoring is TF-IDF, but this is adjustable. Lucene also uses the cosine similarity to evaluate the relevance of a document to a query or document to document, but modifies it the following way [46]: The normalization factor of the document is configurable, since normalizing it to the unit vector removes the document length information which might be problematic in some applications. Its scoring also supports boost factors for both queries and documents. Boost factors for documents are of particular interest, since there are algorithms which assign popularity scores to documents (sections 4.1, 5.1, 6.2, 7.1). Applying weights to queries can be useful for query ranking which is investigated in section 6.5. The last factor the Lucene scoring formula takes into account is a coordination factor which allows for ranking results higher when more terms are matched, of course only in the case that the query doesn’t require all terms to be matched.

3.2 Relevance of two resources

This section describes two approaches for determining the relevance of two resources. The first approach is again the Vector Space Model from the previous section which is also applicable for comparing two resources. It determines similarity according to the content. The second approach determines the similarity of two resources in graph-shaped data. This is not directly related to ranking, but approaches exist which use these relevancies in order to build a ranking [48].

3.2.1 Vector Space Model

The Vector Space Model is also used in order to determine the relationship of two documents to each other. The Vector Space Model described in section 3.1.1 determines the relationship of a query to a resource by representing the query as a resource vector. If a comparison of two resources is desired, the step where the query is represented as a document vector is skipped.

3.2.2 SimRank

Jeh et al [37] define SimRank, which calculates a similarity between two resources located in arbitrary structural context. It captures the idea, that “two objects are similar, if they are referenced by similar objects” [37, p.3]. Their approach is datatype-independent, i.e. it works for arbitrary objects and their relationships.

17 Figure 2: Small Web graph

In their simplest form the degree of similarity between two nodes a and b is calculated as the average similarity between all the nodes which link to a and all the nodes which link to b. This value is multiplied by a decay factor C, 0 ≤ C ≤ 1, which is often set to 0.8. Consider for example a web graph, where the page of a university links to two Professors, ProfA and ProfB (figure 2 (a)). Then these professors are considered to be similar to some degree. Without the decay factor the formula would conclude that both ProfA and ProfB are identical which of course is not true. The computation can be visualized in a node-pair graph (2 (b)). The relevance of 0.414 for instance is derived from the node-pairs {Univ, Univ} and {Univ, StudentB} as the average 1+0.034 similarity of the incoming edges: C · 2 . If the objects can be separated into two different domains, the similarity ranking is ex- tended from one equation to two mutually recursive equations which are similar to the basic similarity. This is called Bipartite SimRank.

This algorithm operates on a bipartite graph (V1,V2,E), which is a graph where the nodes can be separated into two domains V1, V2 and the edges E only go from V1 to V2

Since nodes in V1 only have out-degrees and the ones in V2 only in-degrees, computing the SimRank only based on in-degrees is not desirable. Instead two mutually recursive equations make up for this, one equation takes nodes in V1, the other nodes in V2.

Specifically, SimRank(A, B) for A, B ∈ V1 is computed as the average similarity of all the nodes which are pointed to by A and all the nodes which are pointed to by B. Likely SimRank(c, d) for c, d ∈ V2 is computed as the average similarity of all the nodes which point to c and all the nodes which point to d. Both similarities also contain the decay factor as in ordinary SimRank. So for instance the similarity of sugar and eggs (figure 3) is computed as: SimRank(A,A)+SimRank(A,B) 1+0.547 C · 2 = C · 2 . The computation cannot be done like this; neither in the bipartite nor in the other case, since the calculation depends on the values that haven’t been calculated yet in most cases. Therefore the authors suggest using an iterative fix-point algorithm, which converges pretty fast (5 iterations). However, both the space and time complexity is quadratic which could

18 Figure 3: Bipartite Graph be too much for big graphs such as the WWW. To overcome the complexity problematic, the authors suggest a pruning algorithm whose presentation is beyond the scope of this diploma thesis.

3.3 Spread Activation

Spread Activation deals with marking of nodes in a graph. From a start set of nodes, which have activation values, spreading of these values occurs along the edges until certain conditions are satisfied. This is not directly related to ranking, but some ranking algorithms make use of it in a preprocessing step. These are explained in the second part of the section.

Spread Activation is a technique that was originally invented in the field of cognitive psy- chology [17]. Its basic functionality is as follows: At first a start set of nodes is determined and to these nodes an initial activation value is assigned. Spreading occurs then among the edges that are reachable from the start set. Among the factors that calculate the weight of the node reached are (i) weight of the edge and (ii) the activation value of the prior node such that the activation value of the current node is lower than the activation values of the prior ones. The termination condition stops the spreading process if the activation value has fallen below a certain threshold. In order not to traverse the whole graph, some of the following stopping constraints are used among the termination condition [21]: The distance constraint provides that spreading does not excess a certain distance to the start set. The fan-out constraint stops spreading at nodes with a very high connectivity. The path constraint allows for spreading across preferential paths, which requires the edges to be weighted. Finally, the activation constraint adjusts the thresholds for the termination condition during spreading. The result of the spreading process is a graph where certain nodes are marked. Most algorithms collect the marked nodes for further processing. According to [21] there has been a decline of interest in Spread Activation since the 1990s. However, recently, Spread Activation has become one of the possible preprocessing steps for

19 ranking algorithms. Some are presented in the next few paragraphs. ORank [63] is a system which enhances Web search using RDF meta-data (section 4.5.1). Spread Activation is used here to find related concepts. Starting from a node, whose acti- vation value is set to 1, the RDF graph is traversed by setting the activation value of the visited nodes as the product of the edge that led to this node and the previous node. The further processing of these results is discussed in section 4.5.1. RSS [49] applies the PageRank algorithm to RDF graphs (see section 6.2). Spread Activation is used the following way: The initial set of nodes and their activation value is computed by using both a content and a popularity score of the nodes, as search engines do (see section 8.3.3). The activation values of the next node is calculated as follows: The weight of an incoming edge is multiplied with the weight of the previous node. These weights are summed up for all incoming edges. Additionally a decay factor and the distance to the starting node are taken into account which serve as stopping constraints. Rocha et. al. [56] provide a hybrid approach for searching in the Semantic Web. They use hybrid spread activation, which is a combination of the spread activation of semantic and associative networks. Hybrid spread activation works on a hybrid instances graph, where the edges have both a label and a weight. An initial set of nodes is taken from the ontology, but how this set is achieved is not explained. The activation values of these initial nodes have to be set up by the administrator. The weight of an edge is twofold: The first is its actual numerical weight, the second the relative weight associated to the symbol of the edge. This relative weight has to be set up by a knowledge engineer. Among the distance constraint, a new kind of constraint is proposed by the authors: A concept type constraint prevents the spreading to go through nodes of a certain concept type.

4 Ranking on the World Wide Web

With the birth of the World Wide Web and its great success, the information it contained soon grew to vast dimensions and search engines became crucial for the usability of the Web. However, the number of results returned by a search engine can still be enormous necessitating a proper ranking. However, since a high ranking is an important economic factor, companies try to boost their rank. Therefore it is necessary to determine a ranking by exploiting some kind of general agreement about the importance of a page. This general agreement is achieved by considering the hyperlinks between two pages as recommendations and derive a ranking using the hyperlink structure. These approaches are considered in the first section. The second section deals with the degree of matching between the query and the documents. For a document to get the highest ranking it is not only sufficient that it contains all the keywords, but also that the keywords appear at certain prominent positions. The third section describes how a similarity of two documents could be computed and the last section explains how search on the World Wide Web can be enhanced using RDF or folksonomic meta data.

20 4.1 Popularity Score

One of the key concepts of the Web is the fact, that everyone can publish content without external reviewing. However this leads to certain problems, for example spam pages. Clas- sical publications are not affected because it can be assumed that certain quality standards are met, since publishers review the content prior to publication. A popularity score measures the overall importance of Web pages by exploring the link structure of the Web. A link from site A to site B is considered as a recommendation from the author of A for the author of B. Sites that have many links should be more popular than sites with fewer recommendations. These recommendations account for the problem of spam pages. Apparently, a page which mainly consists of commercials will not have too many recommendations. Note that the algorithms for link analysis are not necessarily query-independent, some of them rely on obtaining a query-focused sub-graph of the web from a search engine. We first examine query-dependent algorithms and then query-independent algorithms.

4.1.1 Query-dependent Link Analysis

Carrière and Kazman [12] developed an algorithm for link analysis which was later extend by Jon M. Kleinberg [39] to the Hubs and Authorities algorithm. Both approaches gather a start set from an existing search engine at query time.

Both algorithms depend on the query because they start with a subset of the Web graph which is determined as the t highest ranked pages (root set) returned by a search engine. In both algorithms this set is augmented by the pages which have incoming or outgoing links to or from the root set, yet only Kleinberg’s algorithm provides an upper bound d for the inclusion of pages triggered by a single page. Common values are t = 200 and d = 50.A further speciality of Kleinberg’s algorithm is the removal of navigational links. For the ranking, both algorithms explore the link structure of the sub-graph, but in different ways. Carrière and Kazman rank the pages according to the connectivity level of the nodes in the graph, i.e. the node with the most edges is ranked highest. However, simply ranking by the connectivity level does not distinguish between pages which are universally popular and pages which belong to the topic of the graph. An algorithm which has the capability to distinguish between universally popular pages and pages popular for the specific topic is Kleinberg’s HITS algorithm. It makes the assumption, that there are two types of pages on the web: hubs and authorities. A good authority is a page which is pointed to by many hubs and a good hub is a page which points to many authorities. This is a mutually reinforcing relationship. Hubs serve as organizers of the web whereas authorities are popular pages. Figure 4 illustrates the relationship between hubs and authorities.

21 Figure 4: Hubs & Authorities

For every page, two values are computed, quantifying to which degree it can be considered as a hub and the degree it can be considered as an authority. The degree of authoritism depends on the degree of hubism and vice versa. The pages are ranked according to the authority value. Universally popular pages are not ranked as high as pages matching the topic, because only pages matching the topic are expected to have enough hubs in this query set.

The Hilltop algorithm, suggested by Bharat & Mihaila [4] is similar to HITS, but is more selective about choosing the hubs, called experts here. At first a list of the most relevant experts with respect to the query is calculated. The targets are then ranked according “to the number and relevance of non-affiliated experts that point to them” [4, p.4].

An expert is defined as a “page that is about a certain topic and has links to many non- affiliated pages on that topic” [4, p.4]. Two pages are considered to be affiliated if they are hosted by the same organization. In particular two pages are affiliated if their host’s IP address shares the first 3 octets or if certain parts of the URLs coincide. The other criterion, belonging to the same topic, is only considered if a broad classification of the page exists. Such a classification can for example be found in the Open Directory Project (ODP) [50]. The ODP is a free web catalogue, which tries to organize the web using a taxonomy of categories. If a broad classification is available, it is additionally required that a page is only considered as an expert, if it points to many pages which share the same topic.

22 However a page has to point to at least k (e.g. k = 5) pages to be considered as an expert. This section covered the determination of experts independently of the query. Section 4.2.2 explains how to choose experts based on a query.

4.1.2 Query-independent Link analysis

PageRank is a query-independent ranking scheme. It assigns weights to web pages based on the link structure of the Web. It uses a reinforcing notion of importance, i.e. pages become even more important if they are pointed to by important pages.

This ranking scheme became famous due to the success of the engine, whose ranking algorithm makes use of PageRank [51]. Its name is not derived from its ability to rank pages, but from one of the inventors of PageRank, . PageRank models the Web as a directed graph with nodes and edges. Weights are assigned to the nodes and the weight is distributed among the edges. Recall that links are regarded as recommendations, so the incoming links for a page are important. The pages the link originates from are referred to as . For a web page u, a simple version of PageRank can be given as

X R(v) R(u) = (1) |Fv| v∈Bu where Bu are the pages pointing to u, so this formula summarizes the weight of all pages pointing to it. The weight of a -page v is spread evenly among its outgoing links Fv. This formula can be intuitively thought of as a random surfer walking on the Web graph by clicking links. Refer to figure 5 for an illustration of how this calculation is done. At the moment there are some problems with this formula. The first problem are the pages which do not have any out-links, called rank leaks [62]. Another problem are the rank sinks. A rank sink is a circle of directed edges which does not point to any other node outside of that circle. This circle will accumulate rank, but never distribute any rank [51]. Both problems lead to the fact that the Web is not fully connected. The problem of rank leaks is solved by either (i) removing both the pages with no out-links and the links pointing to these pages during PageRank calculation, and then adding them in afterwards [9], or (ii) artificially adding links from every rank leak to every other page on the web. If there are dangling links these are removed. Dangling links are links which point to a site which doesn’t exist. This can be due to either mistakes by the webmaster of the page or insufficient crawling. To solve the problem of rank sinks the formula above is adapted by introducing a rank source P (u), which is a vector over Web pages.

X R(v) R(u) = (1 − d)P (u) + d (2) |Fv| v∈Bu

23 Figure 5: Simplified PageRank Calculation

1 P (u) is the uniform distribution over Web pages N , if no personalization is desired. (See section 4.3.2 for personalization.) d is the probability that the random surfer follows a subsequent link rather than jumps to a new site. d is often set to 0.8. Formulae 1 and 2 both are systems of equations. An efficient way how to solve such systems of equations is to use linear algebra. We will use the example graph (see figure 6 ), taken from [43], to illustrate the key concepts of this calculation. The first step is to create a state transition matrix based on the hyperlink graph. Such a matrix represents the probabilities that the random surfer jumps from page i to page j. 1 i j Conceptually, these probabilities are # outgoing links of i if there is an edge from to and 0 in the case of that there is no edge.

1 5 3 4 2 6

Figure 6: Example Hyperlink Graph

For figure 6 the state transition matrix corresponding to equation 1 is given as equation 3.

to page 1 2 3 4 5 6 1 0 1/2 1/2 0 0 0 2 1/2 0 1/2 0 0 0 (3) 3 0 1 0 0 0 0 4 0 0 1/3 0 1/3 1/3 5 1/6 1/6 1/6 1/6 1/6 1/6 from page 6 0 0 0 1/2 1/2 0

24 In equation 3 the probability to go from page 5 to another page is not 0, since an artificial link to every page is introduced. This is one of the possibilities to overcome the problem of rank leaks discussed before. The matrix corresponding to equation 2 with d = 0.8 is given as equation 4:

to page 1 2 3 4 5 6 1 1/30 13/30 13/30 1/30 1/30 1/30 2 13/30 1/30 13/30 1/30 1/30 1/30 (4) 3 1/30 25/30 1/30 1/30 1/30 1/30 4 1/30 1/30 9/30 1/30 9/30 9/30 5 1/6 1/6 1/6 1/6 1/6 1/6 from page 6 1/30 1/30 1/30 13/30 13/30 1/30

This represents the transition matrix of the example hyperlink graph in figure 6 where artificial edges are introduced with low probability, corresponding to the case that the random surfer jumps to a new site instead of following a subsequent link. The graph then becomes strongly connected which is important for the computation. In fact, the calculation of PageRank is done using linear algebra and a transposed ver- sion of matrix 4. Especially the power method is very efficient in the computation of the PageRank values. Good references for the actual computation are [43], [10] and, with more mathematical detail, [11]. The authors of [51] claim that PageRank takes about 52 iterations for a 322 million link database. In 2008 Google’s index contained 1 trillion (1,000,000 million) Web Pages [1], still PageRank is run several times a day.

Kamvar et al [38] provide an extension to PageRank which speeds up the computation of PageRank by allowing for the nested block structure of the web.

The authors observe that the structure of the web is as follows: Most links on the web are links which link to another page on the same host and fewer links interlink the hosts. Therefore their approach is threefold: At first they compute local PageRanks for each host, which only consider the link structure inside this host, since the PageRank of these pages are expected to depend mainly on the interlinks of the host. Local PageRank is run on a sub-graph of the web, which consists of pages thate are restricted to the host. Links to pages outside the host are removed. The relative importance of each host, that is block, is referred to as BlockRank. It is estimated using the PageRank algorithm on a clustered graph. The vertices of this graph are the hosts; all edges between the two hosts, are clustered to one edge, where the weight of the new edge is the sum of the former edges weighted by the local PageRank the link originated. The Global PageRank is estimated by combining the local PageRank and the BlockRank. The weight of a page is at first estimated as its local PageRank weighted by the BlockRank

25 of the block the pages resides in. This first estimation of the PageRank vector is then used as the start vector for the PageRank algorithm over the whole Web Graph. Note that the only difference between this approach and the original PageRank algorithm is that the original PageRank vectors are different. The authors expect the global PageRank algorithm to converge faster and conducted ex- periments which show that the calculation is speeded up by a factor of 2 compared to the original PageRank algorithm.

Comparison of the Approaches for Popularity Score

HITS uses a search engine to determine a set of pages which correspond to the keyword and then determines popular sites, which are specific of this set. Hilltop also relies on a search engine, but additionally works only for subjects of big coverage, since queries which return only returning only a few results it is likely that no experts will be found. Therefore Hilltop should be used in conjunction with another algorithm to provide reasonable results. For example [53] assumes, that Hilltop is used in conjunction with PageRank in the Google search engine. PageRank on the other hand is a measure of the universal popularity of sites so the final ranking will consists of PageRank and another IR-like ranking (see section 8.3.4) to find pages matching the query. The results of HITS do not undergo a second IR specific ranking, so the textual content of pages is considered only once, namely at the construction of the root set. Here HITS relies on the ranking of a search engine, so it is only able to enhance the ranking of a search engine rather than computing a new one.

4.2 Content Score

Content score is a score which resembles how closely a document matches a query. How a content score for the Google search engine is probably calculated is the subject of the first section. Unfortunately Google’s algorithms are a well kept secret, so this section might not be accurate. The second section deals with the identifying the experts in the Hilltop algorithm based on the query.

4.2.1 Google

The Google search engine uses the following information to calculate a content score: posi- tion of the word, font, capitalization, and anchor text [9].

The sources for this chapter are a paper from 1998 [9], which is likely not to accurately reflect Google’s current practises, and a Web site from a Search Engine Optimization company [60] [61].

26 An anchor text is the text that is placed in the link element. Links in HTML have two components: the address of the page the link points to, and the text, that is associated with the link. This anchor text is normally underlined when the user browses the Web site. These anchor texts are associated with the pages the links point to. For query evaluation and ranking, an inverted index is used. This index contains for every keyword a list of documents that contain the keyword and additionally some meta data, including the position of the word in the document, the font, and the capitalization infor- mation. Based on the meta data, a type is assigned to each hit. The types are of form title, anchor, URL, large font, small font etc. The type weight vector represents the weights of each type. According to a Search Engine Optimization company, the importance of the type weights is likely as follows (in order of importance) [60] [61]:

• Occurrence of the keywords in title element • Occurrence of the keywords in domain name of the URL • Occurrence of the keywords in path of the URL • Occurrence of the keywords in headings • Occurrence of the keywords in file-names of images, PDFs • Formatting of the keywords (bold, italics) • Occurrence of the keywords in the beginning or the end of the text • Occurrence of the keywords elsewhere

For a single keyword query, the occurrences of the several types for a document are counted, resulting in a count weight vector. The counts are weighted linearly up to a threshold where weights don’t increase further. An IR score is calculated as the dot-product of type weight and count weight vector. For multiple word queries, the distance of the two words are considered additionally.

4.2.2 Hilltop

In the Hilltop algorithm, the experts for a query are identified based on the degree of matching of the phrases to the keywords. Hence an expert score based on the following aspects is computed: • number of keywords the phrase contains (the more the better) • the type of phrase (title, heading, anchor text) • the number of surplus terms on the phrase (the less the better) However, the ranked list returned does not consist of experts, but of targets. Targets are pages which are pointed to by at least two non-affiliated experts, where the target is also not affiliated to the expert. To compute the target score, a graph V = (E,T ) is built, where the directed edges correspond to some of the hyperlinks of the experts.

27 Which hyperlinks are chosen from the experts depends on the place the phrase matching the keywords was found. If the keywords were found in the title of the expert page, all hyperlinks are considered, if they are found in a heading, solely the hyperlinks of the next paragraph are considered, and if the keywords are found in an anchor text, only the hyperlink in the anchor is considered. An edge score is computed based on the number of keywords which originally matched the phrase of the expert. The final ranking is computed by summarizing the in-edges of every target after removing the affiliated edges.

4.3 User personalization

The personalization of the ranking of search results is one of the latest developments in search engines. There are several degrees of personalization, starting from language dependency to full personalization on individual users. These different degrees are described in the first section. The second section deals with the computational problems of creating personalized popularity scores.

4.3.1 Personalization based on what?

True personalization means, that the ranking is essentially user-dependent. This is one of the recent features of search engines. Google for instance, personalizes the searches of the users. If the user is signed in into his , it uses the history of visited pages stored in that account for personalization. In case the user is not signed in, it uses an anonymous cookie to track the visited pages. The latter strategy was introduced in December of 2009 [32]. Another way to personalize search results, is localization. Here, the user’s locale is guessed based on the users IP address. For example, when visiting www.mail.ru, it presents the local weather of Berlin, and not of a Russian city as would be done without localization. Such localization information could also be used to influence the computation of PageRank. The third important way in which search results are personalized is to select a personal- ization vector which corresponds to the languages the user understands. The calculation of language dependencies is not as big a problem as personalization of every user, because the number of users is much bigger than the number of languages. All algorithms presented here use a personalized version of the PageRank algorithm. The problems lie mainly in the computation and storing of the personalized PageRank vectors.

4.3.2 Personalized Popularity Score

A personalization is done by building a personalized popularity score. However, it is not possible to build a personalized popularity score for every user, because their storage would occupy too much space and their computation take too much time. As already suggested by Page himself [51], PageRank can be personalized. In this case, 1 the uniform vector over all Web pages P (u) = N is replaced by a personalization vector

28 P (u, pers).

X Rpers(v) Rpers(u) = (1 − d)Ppers(u) + d (5) |Fv| v∈Bu Recall that the computation of the PageRank for the Web takes several hours, so a cal- culation of fully personalized PageRanks for every user is not possible because there is neither enough calculating capacity nor enough storage capacity available. Therefore fol- lowing [30], some approaches which are an approximation of a fully personalized PageRank are presented.

Topic sensitive PageRank by Haveliwala [29] calculates 16 PageRanks, each influenced by a certain topic. At query time, the topic of the query is determined, and the final pop- ularity ranking is computed based on the biased PageRanks and the topic of the query. This approach does not allow for full user personalization, instead the personalization is approximated by a combination of 16 topic-sensitive PageRanks.

In the approach by Haveliwala [29], the personalization vector is gained from the Open Directory Project (ODP) [50], which is a free web catalogue trying to organize the Web into categories. If u is found in the category, P (u) = 1/N, else P (u) = 0, where N denotes the number of URLs found in the topic. The intuition in the random surfer model is, that the random surfer does not jump to an arbitrary site on the web in case he gets bored, but is restricted to the sites that are in the same category as u. So far the calculation of the topic-specific PageRanks. What follows now, is how a person- alization can be approximated using these 16 topic-specific PageRanks and the query. The ranking is calculated as the expected value: X Ranking(query,document) = P (topicj|query) · Rank(j, d) j

Rank(j, d) denotes the ranking using topic j for the document d. Since there are 16 different topics, there are 16 such rankings.

P (topicj|query) is the probability that the query belongs to topic j. It is calculated using Bayes theorem.

P (topicj) · P (query|topicj) P (topicj|query) = P (query)

1 The authors chose to make P (topicj) uniform, i.e. P (topicj) = 16 . P (query|topicj) is computed by counting the occurrences of the terms in query in the documents located in the ODP category topicj. P (query) is neglected since it does not change for a topic. If more information about the context of the query is available (e.g. because the user starts a query by highlighting keywords in a text u), query is expanded to u.

29 Another approach which allows for more personalization vectors is “Scaling Personalized Web Search” by Glen Jeh and Jennifer Widom [35]. In their approach the personalization vectors Ppers(u) could be equivalently represented as a linear combination of vectors, which are computed at indexing time.

A hub set H is a subset of web pages, and a preference set P ⊂ H, is the set of pages personalization is done on, therefore restricting Ppers(u) to be a vector over pages in P . Suggestions for the selection of H are either popular pages, or categories in an web directory such as ODP [50]. The authors introduce Personalized PageRank vectors (PPVs), where each PPV corresponds to the solution of the computation of PageRank using the personalization vector Ppers(u). The authors prove, that it is possible to create a set of basis vectors, so the PPVs are retriev- able from the basis vectors. They further show how the basis vectors can be decomposed in order to store them efficiently.

Kamvar et al [38]’s approach for speeding up the computation of PageRank (see section 4.1.2) can also be used to speed up personalization.

Recall that in their approach the global PageRank is calculated out of local PageRanks, which are calculated based on the relationships of pages to each other within a host, and a BlockRank, which resembles the relationships of the hosts to each other. The authors limit the choice of the random surfer to jump to an arbitrary host instead of an arbitrary page in case he gets bored. Therefore it is sufficient to individualize the BlockRanks only, so the local PageRanks do not have to be personalized. However, the computation of the Global PageRank (last step in the algorithm described in section 4.1.2) requires a vector P (u, pers), which resembles the probabilities in the case the random surfer gets bored. The authors induce the probability as p(j) = P (J) · p(j|J), where p(J) is known and p(j|J) is the same as the local PageRank vector for the host J.

4.4 Similarity of Documents

Sometimes it is desired not only to be able to find the closest match of a query to a document but also to calculate a sort of distance between two documents. Kleinberg’s [39] Hubs and Authorities algorithm (see section 4.1.1) can also be used to determine pages similar to a query. As described in 4.1.1 a keyword string is sent to a search engine to obtain a set of pages as a root set. If it is desired to find pages similar to a page p, the root set could be constructed by asking the search engine not to return pages matching a keyword but pages similar to p. The rest of the algorithm works as described in section 4.1.1. Approaches which determine the similarity of two documents using the documents’ content are not specific to the Web, and were therefore already discussed in section 3.2.

30 4.5 Enhancing Web Search using Semantic Annotations

In this section we investigate approaches which use Semantic Annotations defined elsewhere to rank ordinary Web sites. These approaches do not cover the case that the author created a Web page and enriched it with semantic annotations. Instead, the semantic annotations are gathered from external web sites, e.g. delicious.com. Approaches which rank pages according to semantic annotations created by the author of the page are illuminated in sections 6 and 7. In the following two sections, both the content score and the popularity score are considered.

4.5.1 Content Score

One approach which tries to enhance web search using RDF meta-data is ORank [63] which uses the Vector Space Model (see section 3.1.1). It is a semantic search engine for non-semantic Web documents. The words of the Web documents are annotated with labels from an ontology automatically. The document vector then consists of solely these labels. The query vector is built by expanding the query using Spread Activation (3.3), which results in a query vector which not only contains the concepts of the query but also the related ones. Bao et al [3] make use of the social annotations the users stored in the delicious.com book- marking tool for evaluating the similarity between a query and a document. A simple measure they provide is the number of terms the query and the annotation share. Another measure they provide is based on the idea, that tags which often appear together, should be associated to some degree. Queries should then be expanded not only to the tags matching the keywords, but also to the tags which are associated to the keywords. One of the factors which is used in the ranking of the query results is the degree of association between the tags. To calculate this similarity, the authors provide SocialSimRank, which quantitatively evaluates the similarity between two annotations. It is an adaptation of the bipartite version of SimRank (see section 3.2.2).

4.5.2 Popularity Score

In [3], Bao et al. adapt PageRank in such a way that it profits from the social annotations of Web pages. The tags found on delicious.com for a certain web site serve as social annota- tions here. Three association matrices are used in order to represent the relations between pages and users (MPU ), annotations and pages (MAP ), and users and annotations (MUA). Element MPU (pi, uj) represents the number of annotations user uj gave to page pi, element MAP (ai, pj) represents the number of users that annotated page pj with annotation ai, and element MUA(ui, aj) denotes the number of pages user ui annotated with annotation aj. The PageRank algorithm as described in section 4.1.2 is calculated using a single transition matrix, where the domain of the functions the matrix represents is page → page. Since no association matrix takes that form, it cannot be used directly in the PageRank com- putation. Therefore the authors provide, SocialPageRank, which is a cascade of 6 equations,

31 using the three matrices defined above and their transposed forms.

5 Ranking for XML Data

The main aspects where ranking in XML is different from ranking on the Web are the following [27]: 1. The proximity of keywords in the document is not simply the distance. Additionally the underlying tree structure is also taken into account. 2. The result of a query does not need to be the entire document but can also be a deeply nested element. 3. The ranking needs to be computed differently. In the Web, the popularity score is calculated based on the hyperlink structure and at the document level. Both things are different in XML: Ranking should be done on the element level, since nested sub-elements are returned, and the tree structure of the XML document should be considered. In fact, it is more important than the link structure. We address the first problem in section 5.2.2, the second problem in section 5.2.1, and the third problem in section 5.1. Besides that, there are two main categories of XML query languages. The first one is in the tradition of database languages like SQL. These languages often do not support a ranking because of their Boolean nature; answers are either relevant or not relevant for a query. An exception is XXL by Theobald et al [67], so this approach is examined in section 5.2.4. The other type of query languages support keyword queries. These languages use the Vector Space Model for ranking, see section 5.2.1. It is also possible to compute a popularity score for XML, which is covered in section 5.1.

5.1 Popularity Score

The structure of an XML document is a tree. Yet if links are considered additionally, this structure transforms into a graph to which the PageRank algorithm can be applied. Lin Guo et al took this idea to introduce ElemRanks [27]. The original PageRank algorithm (see section 4.1.2) weights all relations in the graph equally. Since one doesn’t want hyperlinks to be equally weighted as containment edges, XRank weights forward containment edges, reverse containment edges, and (hyper)links differently. The term containment edge refers to the link between a node and its child. The authors argue for the discrimination between forward and reverse containment edges as follows: E.g. consider (1) a conference containing many important papers (figure 7), then the importance of the conference should also be higher. On the other hand, (2) the ElemRank of a section of a paper should also depend on the number of (brother-)sections. If there are more sections, the ElemRanks of the sections lose importance. In the case of (1), “ElemRanks of a parent element should be directly proportional to the aggregate of the ElemRanks of its sub-elements”[27, p.5], whereas in (2) “ElemRanks of

32 XML and IR: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza−Yates Gonzalo Navarro We consider the recently proposed language ...

Searching on structured text is more important ...
At first sight, the XQL query language looks ... ... Querying XML in Xyleme A Query ... Querying XML in Xyleme ...

Figure 7: An Example XML Document sub-elements should be inversely proportional to the number of sibling sub-elements”[27, p.5]. The formula for the ElemRank of a page u is defined as follows:

1 − d1 − d2 − d3 X ER(x) X ER(x) X ER(x) ER(u) = + d + d + d N × N u 1 N x 2 N x 3 d de( ) x∈B h( ) x∈CE c( ) −1 1 u u x∈CEu

Nd, Nde(u) denote the total number of documents, and the number of elements in the XML documents containing u. Nh(u), Nc(u) denote the number of outgoing links of document u, −1 and the number of sub-elements of u respectively. Bu, CEu, CEu denote the backlinks, for- ward containment edges, and backward containment edges of u respectively. The ElemRank of a backward containment edge is divided by 1, because this is the number of parents.

The weights d1, d2, d3 must be set by the administrator. The ElemRank can also be interpreted in the random surfer model. The random surfer in PageRank (section 4.1.2) follows a subsequent link with probability d and jumps to an arbitrary site with probability 1 − d. The random surfer in ElemRank instead jumps to an arbitrary site with probability 1 − d1 − d2 − d3, follows a subsequent hyperlink with probability d1, goes down in the tree with probability d2, and goes up with probability d3.

5.2 Content Score

There are various approaches for computing a content score for XML. As always, the Vector Space Model can also be adapted to XML (see section 5.2.1). Section 5.2.2 presents ap- proaches for calculating the distance between two keywords, whereas section 5.2.4 calculates a distance between a candidate item and a keyword. Gaining ranking factors from proper-

33 ties of the answer tree is the topic of section 5.2.3 and section 5.3 deals with approaches for top-k evaluation.

5.2.1 Vector Space Model

In the classical Vector Space Model, each document to be indexed is represented as a vector of term weights. However, term weights as used in information retrieval, e.g. TF-IDF, do not work at best for XML, because they do not preserve the structure of the document. Therefore, in most approaches the Vector Space Model is adapted to XML by changing its weighting such that the granularity of the model is extended from the document level to the element level [41] [16]. The term frequency (TF) is normally defined as the relative frequency of the appearances of a term in a document. However, as the result of a query is not necessarily the whole document, but often a subtree, the TF should be adapted.

Term Frequency

There are various approaches for adaptation of the term frequency to XML. Some calculate the term frequency solely based on the occurrences within on node [16], while others take the occurrences within a node and its descendants into account [72].

Cohen et al [16] define a term frequency of a term t in a leaf node nl as:

# occurrences of t in nl TF (t, nl) = # occurrences of the most frequent word in nl

Wolff et al [72] [70] have a similar definition of the term frequency of a term t in an element e, but do not restrict the term frequency to occurrences of terms in the leaf node. Therefore they also take the descendants of e into account for normalization: # occurrences of t in e TF (t, e) = maximum frequency of any term in e or its descendants

Li et al [41] also count the number of occurrences of the term in the node, but provide a different normalization. The term frequencies are not normalized, but instead when computing the weights as TF-IDF, these values are normalized by the maximal TF-IDF value found.

Inverse (Document) Frequency

The structure of an XML document is a tree. The result of a query is often a subtree. Therefore defining an Inverse Document Frequency normally does not make sense. An exception is the approach by Fuhr et al [26] which partitions the XML document prior to query evaluation and calculates flat IR scores inside the partitions.

34 Cohen et al [16] and Li et al [41] contrast the inverse document frequency by an Inverse Leaf Frequency (ILF). This ILF denotes the number of leaves in the corpus divided by the number of leaves that contain the keyword. Wolff et al [72] [70] define an Inverse Element Frequency (IEF) which makes use of the structural role of a term. The query language, XPRES, is such, that a query consists of both elements and associated structural roles. An example is {(’creation’,section.title)}, which looks for the word ’creation’ appearing in the title within a section element. The IEF of a term t and its structural role r is defined as: # elements having role r IEF (t, r) = # elements having role r and term t

Fuhr et al [26] argue, that it is better to partition the XML tree into certain index object. Index objects are subtrees of the XML tree, but must have at least one leaf node. The weighting of the root node n is computed using TF-IDF as if the index object with n as root would be a single flat document. TF-IDF is then calculated as the product of TF and ILF. As in section 3.1.1, where in Li et al’s approach [41] this value is normalized by the maximal TF-ILF value found, since no normalization was done for the TF values before.

5.2.2 Distance between keywords

One possibility to measure the distance between keywords is to determine the LCA and summarize the reciprocal of the length of the paths of the nodes containing the keyword. XBridge [41] implements this idea.

Maguitman [47] et al measure the semantic similarity of two topics t1, t2 by:

2 · log P r(LCA(t1, t2)) (6) log P r(t1) + log P r(t2)

This formula takes both the meaning of the LCA of t1 and t2, and the individual meanings of t1 and t2 into account. This is an adaptation of Lin’s [42] information theoretic definition of semantic similarity to taxonomies. In a taxonomy the probability of a topic t can be objects wanted calculated as all objects where objects wanted corresponds to the number of objects stored in node t and its descendants. Formula 6 is closely related to the information con- 1 tent in information theory defined as − log(P r(t)) = log( P r(t) ). This approach is further expanded to RDF in section 6.4.3. On the other hand, XRank [27] uses as distance measure the inverse proportionality of the smallest text window in the subtree to be ranked. If the data is heavily structured, the authors suggest to ignore the keyword proximity.

5.2.3 Ranking query answer trees

The result of an XML query is a subtree N of the source XML tree. Ranking can be done partially by taking some properties of these trees into account. One possible property is

35 the size of the resulting tree. Smaller subtrees are expected to be more specific than bigger subtrees. Cohen et al [16] therefore suggest to count the number of nodes in the subtrees (tsize(N)) and use the reciprocal of this value as a ranking factor. Another ranking factor Cohen et al [16] provide using this tree is the number of pairs that participate in an ancestor- descendant relationship anc-des(N). In XRank [27] a decay factor comes into play in the case that the node that ranking is calculated for does not directly contain the keyword. If v1 denotes the node that the ranking is computed for, and vt denotes the node that contains the keyword. The decay factor is then computed as

t−2 decay(vt) = df where the decay factor df is between 0 and 1.

5.2.4 Distance between candidate item and keyword

Some systems are able not only to retrieve exact matches of keywords on words but are also able to retrieve fuzzy matches. This implies a measure of distance between the keyword and the candidate item that was matched. In case of an exact match this distance is 0 of course, but in case of an inexact match some sort of distance is desired. XXL [67] [68] adds fuzzy search to XML data by introducing a similarity operator ~. The main idea is, that the underlying text search engine, calculates a degree of similarity based on its thesaurus, ontology, etc. The authors make an example using the XML fragment in figure 5.2.4 where x ~ "bass saxophone" assigns 0.8, 0.6, and 0.4 to x for baritone sax, saxophone, and soprano saxophone respectively. Also for "~cd" a value of 1 is assigned to the node "" and a value of 0.9 is assigned to the node "".

The Survivors ’s Suite Keith Jarrett Beginning Keith Jarrett piano , recorder , soprano saxophone Dewey Redman tenor saxophone ... ... Art Ensemble of Chicago Lester Bowie trumpet Roscoe Mitchell alto sax, baritone sax, piccolo ... ...

Figure 8: XML fragment for XXL

36 5.3 Top-k evaluation

One possibility to efficiently generate top-k answers is to use a special index structure. Another possibility is to only evaluate some of the ranked generated queries.

XRank [27] uses Dewey Inverted Lists to store the XML tree efficiently and capture the ancestor descendant information. This works by assigning each keyword a Dewey Id which represents the document and the position of the keyword in the tree. Yet this does not account for the problem of efficiently retrieving the top k elements. A Ranked Dewey Inverted List is a Dewey Inverted List which is sorted by ElemRank (section 5.1), thus exploration of only the top k is possible. XBridge evaluates a keyword query by converting it into a list of ranked SPARQL queries. The authors can provide an upper bound of the score of each query by setting the score of each keyword to the maximum value. So the plan is to evaluate the queries with higher rank first and take the score of the next query as the current threshold.

6 Ranking for RDF Data

There are actually three ways ranking could be done on RDF data, which heavily depend on the type of the query language. The following types of query languages for RDF exist: One type of language follows in the tradition of database languages, the other type is inspired by rule languages [28]. Other approaches support keyword queries. This broad variety of different kinds of query language is reflected by different ranking approaches. In the case of rule languages, the main criterion for ranking the answer is the way it was inferred (6.6). In the case of database-like languages, a ranking is not desired in the normal case, since all results are supposed to be equally relevant, but there are some important exceptions. In the case of a keyword query, one common approach is to translate the keyword query to a ranked list of formal, database-like queries (6.5). Section 6.1 gathers approaches which weigh labels of the RDF graph. These weights can be used to calculate a popularity score (section 6.2). How a content score can be calculated for RDF data is discussed in section 6.3. In section 6.4 the relationship of two resources is discussed. Approaches which enhance the ranking by using the semantic annotation defined somewhere else were already discussed in section 4.5. However, this chapter covers the ranking of Web documents which have semantic annotation stored inside them.

6.1 Weighting RDF properties

37 Since RDF data is represented as graphs with labelled edges, many algorithms assign weights to the edges based on their labels, prior to executing algorithms on the graph. This assign- ment can be either done manually or automatically. For manual assignment it is sufficient to assign weights on the schema level which is automatically inferred to the instance level [49]. This section explains some approaches for automatically assignment property weights.

Stojanovic et al [65] provide a specificity measure based on the ambiguity of a relation. They define the ambiguity of a concept Aj concerning relation r by counting the number of relations r that have Aj at the corresponding position. Written as a formula the ambiguity is defined as:

amb(Aj, r(A1,A2,...,Aj,...,An)) = |{x, y, . . . , w|∃x, y, . . . , w r(x, . . . , Aj, . . . , w)}|

Based on this ambiguity measure they define a specificity measure as: 1 1 specificity1(r(A, . . . , B)) = · ... · amb(A, r(A, . . . , B)) amb(B, r(A, . . . , B))

Rocha et al [56] actually weigh the edges two times. The first weight is based on the label. These weights must be set up by a human knowledge engineer. The second weight serves to express a strength measurement of the edge, because the authors feel that a Boolean “edge yes” or “edge no” is not sufficient in all cases. The authors suggest to weigh the relationship between two concepts A and B using a formula consisting of two parts, which is reminiscent of the TF-IDF formula. The first part is called cluster measure and was originally developed by Chen et al [14]. # concepts both related to A and B cluster(A, B) = # concepts related to A

The second part of the formula is called specificity measure and is an adaptation of the IDF measure. 1 specificity2(A, B) = √ # relations that have B as destination node The square root was chosen for smoothing. The final weight of the link between two concepts A and B is computed in TF-IDF manner as:

weight(A, B) = cluster(A, B) · specificity2(A, B)

The first term measures the similarity of A and B whereas the second term “measures how specific the destination concept is” [56, p. 376]. Anyanwu et al [2] provide a specificity measure for properties as follows: # occurrences of relation r specificity3(r(A, B)) = # all properties

38 This specificity is a specificity measure for properties. It weighs the property independently of A and B. They also provide another specificity measure, called θ-specificity: # occurrences of relation r θ-specificity(r(A, B)) = # occurrences of properties which are possible between A and B This θ-specificity of a property measures the uniqueness of the property r relative to all properties which have the same domain and range [2, p.121].

6.2 Popularity Score

Since RDF is already a graph, the PageRank algorithm can be adapted to work on the graph in order to get a ranking for the resources.

The PageRank algorithm as it was originally formulated and described in section 4.1.2 operates on a graph where the nodes represent Web documents and the edges hyperlinks. The PageRank algorithm will then assign weights to the Web documents. An RDF graph consists of RDF resources as nodes and RDF properties as edges. As discussed in the previous section, it is possible to assign weights to these edges. So in contrast to classical PageRank, which considers all links as equally important, these weights should be taken into account when applying the PageRank algorithm to the graph. [49] Another possibility is not to run PageRank on the RDF graph but at the document level. Swoogle [22] ranks Semantic Web Documents based on the relationship of Semantic Web Documents, Semantic Web Terms, and Semantic Web Ontologies to each other. A Semantic Web Document (SWD) is an RDF graph that was serialized using one of the RDF syntax languages like RDF/XML, N-Triples, or N3. A Semantic Web Term (SWT) is one of those RDF resources that are universal in the sense that they are intended to be reused. These SWTs are instances of rdfs:Class or rdf:Property. A Semantic Web Ontology (SWO) is a Semantic Web Document that consists of mainly Semantic Web Terms. Links exist among SWTs, between SWDs and SWTs, and among SWDs. These links have the form of RDF properties. They are not weighted uniformly but classified into different categories, where all properties that fall in the same category are assigned the same weight, which the user has to provide [23]. The categories are imports, uses-term, extends, and asserts. Both Ning et al [49] and Ding et al [23] adapt the PageRank algorithm in a way that it allows for the weighting of the edges.

6.3 Content Score

The Vector Space Model as described in section 3.1.1 can also be adapted to documents containing RDF annotations. A similar approach was already investigated in section 4.5.

39 Both this section and section 4.5 deal with ranking documents using RDF annotations, but in this section annotations attached by the creator of the document are considered, different from the approaches dealing with automatic annotation in section 4.5. An adaptation of the TF-IDF weighting to documents containing RDF annotations was suggested by Castells et al [13]. Instead of calculating a weight for a word in a document, they calculate a weight for an instance x in a document d. They modify the term frequency to:

# of the keywords attached to x in d TF (x, d) = maximum frequency of any instance in d

The inverse document frequency is defined as normal:

# all documents IDF (x, d) = # of documents annotated with x

A weight vector for each document is then calculated as TF ·IDF , which results in a vector over instances. Naturally, also a query vector is built. The authors use RDQL as a query language, which has a SQL-like syntax. For the query vector, the authors take the results of the RDQL query into account. Specifically, the query vector is a vector over all possible instances. The weights in this vector have to do with the variables in the SELECT clause and the returned result tuples. Specifically, the weight of an instance x is defined as the number of variables that had a result=x. This also corresponds to the occurrences of x in the columns of the result table, where multiple occurrences in one column are ignored. The similarity is computed as the angle between the query and document vector as usual. The authors discovered that this semantic ranking behaved poorly in their experiments, so they suggested calculating the final ranking as a combination of the ranking explained above and a keyword based approach. Ning et al [49] propose the following approach for calculating a relevance measure between an RDF node and a query term. Since in their approach the query terms consist of an RDF triple without subject, i.e. Q = (_, pred, obj), the authors argue for choosing different relevance measures depending on the type of the predicate pred. In most cases pred will be Boolean (e.g. has-author, born-in), so the relevance measure will be either 0 or 1. In few cases more sophisticated measurements such as TF-IDF can be used, e.g. when pred is has-fulltext.

6.4 Relevance of the relation between two resources

Some applications require to determine the relevance of two resources in an RDF graph. The idea is to explore the relationship which exist between two resources. These relationships can form the foundation for a ranking, as for example in [48].

40 6.4.1 DBPedia Relationship Finder

The DBPedia Relationship Finder3 [40] is a tool which visualizes the relations between two resources found in the DBPedia database. The DBPedia project aims to extract the structural information (info-boxes etc.) found in the Wikipedia4. As an example the various relationships between Ludwig van Beethoven and Vienna are displayed in figure 9. However, the DBPedia relationship finder does not support a ranking of the results found. It is further limited to relations at the instance level and neglects the information given from the schema, so it will for instance not find the relation that “Ludwig van Beethoven” and “György Ligeti” are both composers. It should only find the information that they both died in Vienna.

6.4.2 SemRank

An approach which considers schema information as well as provides a flexible ranking approach is found in Anyanwu et al [2]. Ranking the relations is supported in two different ways. Relations can be ranked either the conventional way or a discovery way. Ranking the conventional way ranks obvious results first, whereas a discovery search on the other hand ranks results higher that are surprising. Consider for example the graph in figure 9. The conventional search would retrieve the relation deathplace(Vienna, Ludwig van Beethoven) whereas a discovery search would retrieve relations that are more hidden and less obvious, e.g. the others paths in this figure. The authors stated that discovering these hidden paths might be important for homeland security, where these connections might reveal new insights. For example, the connection between George W. Bush and Osama Bin Laden, in the way that the former president is Commander in Chief of the military, which is looking for Osama Bin Laden is a piece of information that is not really interesting in this context. On the other hand, the more hidden information, that the Bush family has some friends in the Saudi Arabian administration, which themselves might also have some connection to somebody who knows somebody who eventually has some form of affiliation to Osama Bin Laden is of great value in this context. This argues for the discovery search. The user can choose gradually between conventional and discovery search. We first explain the ranking in purely discovery mode; how ranking is done in conventional mode is explained later in this section. The authors use two factors for measuring the predictability of results: The first factor is the specificity of the result and the second factor is the discrepancy between the result’s structure and the schema. This discrepancy happens when the nodes of a path do not all belong to one schema, i.e. some nodes are multiply classified and therefore the path will start in one schema and end in another schema. Consider the graph in figure 10. Schema 2 would only allow a path

3http://relfinder.dbpedia.org 4http://www.wikipedia.org

41 Figure 9: DBPedia Relationship Finder: Ludwig van Beethoven and Vienna

42 Figure 10: Example RDF Graph between nodes &r1 and &r4, but there is also a path between nodes &r1 and &r6 which is not restricted to schema 2 but uses schema 2,3, and 4. Therefore the authors count the number of refractions and rank paths with less refractions higher in conventional mode and paths with many refractions higher in discovery mode. Specificity measures between two directly connected resources were already investigated in section 6.1. In SemRank a measure for two indirectly connected resources (a path) is provided by means of information theory.

The extent to which the occurrence of a single property is surprising is denoted IS. It is defined as the information content of the property’s frequency, giving high values to rare properties and low values to frequent properties. IS ignores the type information available in the schema, therefore Iθ−S is the information content of the frequency of that property with respect to properties of the right type. Iθ−S uses the θ-specificity from section 6.1. The information content of a path I(ps) (sequence of properties) consists of the sum of two values. The first value is calculated using IS by taking simply the maximal IS(p) for all properties in the path, i.e. this value is the bigger, the rarer the rarest property is. For the second value the average of Itheta−S is taken, where the element with lowest information content (i.e. most frequent) is considered separately. The authors do not explain why they chose to use the maximum for IS but the average for Itheta−S. If the ranking mode is set to conventional search, i.e. non-surprising results are preferred, ranking is done according to the reciprocal of I(ps). Unfortunately, unlike the DBPedia relationship finder, SemRank is no longer online and

43 therefore can not be tried out. An approach which applies these techniques for ranking Web documents is presented in [48]. It ranks documents by ranking the relationships of the named entities of the document. The relationships are gained by automatic semantic annotation of the documents.

6.4.3 Maguitman et al

Maguitman et al [47] extend their idea of semantic similarity presented in 5.2.2 from a tax- onomy to an ontology. This ontology-graph is represented by an adjacency matrix (G), and this matrix is then split into a matrix which represents the graph’s hierarchically component (T ) and two matrices represent the non-hierarchically components of the graph: one for the „symbolic“ cross links (S) and one for the „related“ ones (R). The idea of splitting the ontology in such a way, is driven from the fact that the ontology derived from the ODP [50] was considered which merely is a taxonomy enriched by cross-links. Finally they define a matrix W which basically denotes the reachability in the ontology between two nodes. It is made up out of the adjacency matrix G and a closure version of T , called T +, which marks indirect subtrees like direct subtrees. Consider for example the example ontology in figure 11. Nodes that are reachable from node t8 are t3, t5, t6, t7, and t8 itself. Note that t2 is not reachable because “related edges” are not transitively followed (in opposition to “symbolic edges”). Matrix W is used to generalize formula 6 by substituting the LCA by a matrix computation where W is involved.5

6.4.4 Corese

A different application yet similar approach is addressed by Corby et al [20] in the Corese search engine. This search engines translates the query into a graph and then tries to match this graph on the RDF graph. Due to possible different viewpoints of the author of the ontology and the user of the search engine, Corese also returns items that do not match the items of the query exactly, but are just related. A key concept to achieve this is the computing of a semantic similarity on the ontology to measure the degree of relatedness. This semantic similarity considers the following facts: The distance of two concepts is defined as the distance of their classes which is calculated as the sum of the distances to both classes’ LCA. To make up for the fact, that classes are more related when they are more deeply nested, the distance between a parent and its child is decreasing with increasing depth. Links of form rdfs:seeAlso are also considered: The distance of two classes which are linked is divided by two.

5 The formula in [47, p.439] might be misleading:

G 2 · min(Wk1,Wk2) · log P r[tk] σS (t1, t2) = max k log(P r[t1|tk] · P r[tk]) + log(P r[t2|tk] · P r[tk])

In Wk1 and Wk2 the indices 1, 2 of the matrix are not constant, but correspond to t1, t2.

44 Figure 11: Example Ontology

Comparison of the approaches

The approaches investigated in this section are quite different. One thing they have in common is that they estimate how closely relevant two items in an ontology are. SemRank ranks paths between two resources according to the weight of the properties on the path, the information available from the schema level, and the information from the user whether she wants to be surprised by the results or not. Maguitman et al follow the idea that the ontology is roughly a tree and calculate the relevance between two resources using the LCA. Corese transfers the calculation of the distance from the instance to the schema level, and calculates the distance between the corresponding classes using an LCA. The approaches from Corby et al (Corese) and Maguitman et al are somewhat similar, but Maguitman et al only works for ontologies that more or less have a tree structure. SemRank’s approach to ranking is different, because one of the applications it was originally developed for included homeland security which has different demands.

6.5 Query ranking

Some systems translate keyword queries to a ranked list of formal queries.

SPARK [75] takes two measures into account: the probability of generating a formal query out of the keywords (Keyword Query Model (KQM)) and the probability of generating a

45 formal query out of the knowledge base (Knowledge Base Model (KBM)). In the KQM the authors combine the proximity of the keywords to the query and the relevance between the keywords and the query. The proximity is measured as the average term mapping proximity, but unfortunately no further information on it is given. The relevance is a measure for the number of keywords that were translated to the query. In the KBM the authors assume that the information content of a relation is higher, if the frequency of this relation is low. The user can choose whether she prefers frequent, not so surprising results or surprising results. SemRank (see section 6.4) reflects a similar idea. Q2Semantic [69] ranks formal queries according to the length of the paths of a formal query, and the matching distance of keyword and literal as in SPARK’s KQM. Since Q2Semantic clusters the RDF graph prior to evaluating the query, the number of nodes and edges that were clustered are also taken into account (the more nodes/edges clustered to 1 the better).

6.6 Ranking for RDF rule languages

Stojanovic et al [65] have developed a scheme for ranking the query results of rule languages by taking into account the way the results were inferred.

If the language used for modelling the semantic data supports rules, the knowledge can be separated into two fields: The ontology which contains the rules and type-information, and the knowledge base which contains the actual knowledge. This corresponds to the separation of rules and facts in Prolog. The authors suggest determining the relevance of a query according to the relevancies of the relation instances. A relation instance for a concept is the part of a query which deals with the concept. Consider for example the following Prolog code: human( pete ) . human( john ) . human( dave ) . vegetarian(dave). eats_pork(X) :− human(X) , not ( v e g e t a r i a n (X) ) .

Here the relation instance for the concept “john” and the query “find all people in the knowledge base eating pork” would be “eats_pork(john)”. The ranking is done based on the derivation tree of a relation instance, which contains all derivations for a query. The nodes in that tree are and-nodes or or-nodes. The or-nodes “connect various ways in which a relation instance is derived” [65, p.509], whereas a and- node “define a way in which a relation instance is derived” [65, p.509]. Therefore, for an and-node the relevancies of its children are multiplied, and for an or-node the relevancies of its children are summed up. The relevance of a leave node is calculated as described in chapter 6.1.

46 7 Ranking for Folksonomies

One of the latest emergences of the Internet is Social Tagging. The users who visit a Web site may assign tags to resources. Such resources could be various (real world) elements, such as bookmarks, books, photos etc. The label of the tags is freely chosen, which might be a problem, because users often use different words for the same resource. The collection of all tags and resources is called folksonomy. Folksonomy is a portmanteau made up of folk and taxonomy. Unlike a taxonomy developed by professional engineers, a folksonomy is created by the users. We investigate both the popularity score which measures the global popularity of tags, users, or resources as well as the content score which measures the relevance to the query. However, approaches to enhance the search on the World Wide Web by using folksonomic information are not covered here but in section 4.5.

7.1 Popularity Score

A popularity score measures the quality of objects. Since a folksonomy consists of users, tags, and resources, popularity scores can be defined for each of these objects. There are two approaches for determining a popularity score. The first one is to do the calculation all at once and hence gather a popularity for users, tags, and resources at the same time. Such an approach is investigated in section 7.1.1. Other approaches weigh users, tags, and resources separately. These are considered in section 7.1.2, 7.1.3, and 7.1.4 respectively.

7.1.1 Overall Popularity Score

In this section, FolkRank [33], an approach which generates a ranking for users, tags, and resources is investigated. FolkRank converts the folksonomy into a hypergraph and runs PageRank on it twice. One run is performed with the preference vector and the other without. The combination of these values results in the FolkRank value.

The first step is the conversion of the folksonomy into a tripartite undirected hypergraph by representing both the users, tags and resources as vertices. The weight of the edges is derived from the folksonomy by counting the occurrences of the element, which does not appear on neither end of the edge. E.g, the weight of the edge between a user u and a tag t, is determined as the number of resources tagged with t by u. In order to receive a global ranking for users, tags, as well as resources, the PageRank algorithm is run on this hypergraph. This ranking reflects the idea, “that a resource which is tagged with important tags by important users becomes important itself. The same holds, symmetrically, for tags and users. Thus we have a graph of vertices which are mutually reinforcing each other by spreading their weights.”[33, p.2]

47 It is possible to also compute a content score with FolkRank by modifying the personalization vector using the query. This approach is discussed in section 7.2. FolkRank determines a popularity score for users, tags, and resources at once. In the next chapters approaches which rank users, tags, or resources only are investigated.

7.1.2 Popularity Score for Users

John et Seligmann [36] developed the ExpertRank algorithm, which selects the best user for a corresponding task. The authors make an example where in a meeting the need for a software developer with certain skills arises. For this automatic detection the number of bookmarks tagged by a user are considered. The authors provide two approaches: A simple model, which assumes independence between the tags, and a more complex model, which allows for a tag space partitioned into clusters of strongly related tags.

The simple model for calculating an expert score is as follows:

# bookmarks user u tagged with tag t ExpertRankSimple(u, t) = # bookmarks tagged with tag t by all users

So a user is considered as an expert with respect to a topic that has to do with tag t, if she has tagged many resources with t. However, this simple approach does not perform well, because it assumes independence between the tags. In a true folksonomy this is not the case, because the users often choose slightly different tags, e.g. due to synonymy. Therefore the authors argue to partition the tag space into clusters, where a cluster consists of strongly related tags. In the enhanced ranking ExpertRank(u,t) will not only consider the bookmarks tagged by tag t but also the ones tagged by tags from the same cluster. To do this, the authors exploit the link structure between the tags in one cluster by using the PageRank algorithm.

Szekeley et Torres [66] argue that simply counting the number of tags is not feasible for all applications. Instead they estimate a ranking for a user based on the other user’s tagging behaviour. This approach is called UserRank.

The authors use a modified version of the PageRank algorithm in order to determine the UserRank. The graph the algorithm operates on is defined as follows: A node consists of all the tags a user asserted to a resource. The authors draw an directed edge from user A to user B if user A asserted the same tag to a page as user B, and user B asserted the tag prior to A. So rather than counting the number of tags a user asserted as in ExpertRank, the UserRank for a user A depends on how many other users agree on the tagging of A.

48 7.1.3 Popularity Score for Tags

Szekeley et Torres [66] also define a TagRank algorithm on top of their UserRank algorithm (section 7.1.2).

They define TagRank as the sum of the UserRank of the users which asserted this tag. “This approach is based on the belief that the most relevant tags are those used by the best users”[66, p.4].

Another thing that could be accounted for the ranking of tags is time.

Peters [52, p.350] suggests to apply a time weight in the interval between 1 and 2 to tags. A time weight of 2 is applied to tags newer than 30 days, a weight of 1 is applied to tags older than 360 days, and the time weight in the remaining interval is evenly spread between 1 and 2.

7.1.4 Popularity Score for Resources

Folksonomies often use Web documents as resources. Approaches which use tags in order to rank Web sites are investigated in section 4.5.2. Some of the approaches described in these sections can also be modified for ranking resources in folksonomies. The modification is such, that the step of looking up the semantic annotations defined elsewhere (e.g. delicous.com) is skipped, instead the annotations defined within the folksonomy are used.

7.2 Content Score

SocialRanking [74] uses the tag’s similarity defined in section 7.3.2 and the user’s similarity defined in section 7.3.1 to create a content score based on a query.

The query consists of a sequence of tags (t1, . . . , tn). The first step is to expand this set of tags, where the similarity function from section 7.3.2 is used. For every tag in the query the k nearest neighbours are included. The tags which are equal (sim(ti, tj) = 1) are added first and then tags which are only related (sim(ti, tj) < 1) are added additionally until k neighbours are found. In a second step the content score with respect to the query and the querying user is computed as the sum of user scores. Each user who tagged the resource with at least one tag, contributes to the ranking. A user score is aggregated similarity of this user’s tags corresponding to the query multiplied by the similarity of this user and the querying user.

49 Hotho et al [33] suggest generating a content score by adjusting the preference vector in the FolkRank algorithm (see section 7.1.1) such that it corresponds to the query.

On the hypergraph defined in section 7.1.1 the personalized PageRank algorithm is run (see section 4.3.2 on page 28). The preference vector “reflects the users preferences or search goals” and “is given by the user, extracted form a query, or determined from his behaviour”[34, p.8]. So among the possibilities to personalize FolkRank is also build a preference vector based on the keywords from the query, which necessitates to run FolkRank at query time. However it turned out that the preference vector is not strong enough, because it cannot overcome extremely popular resources. Therefore, Hotho et al developed an differential approach. This differential approach consists of doing the computation twice: The first time the pref- erence vector (d < 1 in equation 5 on page 29) is considered, and the other time the preference vector is neglected (d = 1 in equation 5). The FolkRank algorithm is computed by subtracting the universally popular ranks from the personalized ones: R = Rd<1 − Rd=1. If PageRank is run without the preference vector, it might not converge. However the authors don’t discuss whether this is a problem. Probably the preference vector is not needed when PageRank is run on the bipartite hypergraph the authors constructed. The authors also introduced a factor α which “speeds up convergence and avoids oscillation”, but don’t explain whether this already solved the problem. [34, p.8].

Hotho et al claim that the Vector Space Model is not feasible for folksonomies since the “documents consist of short text snippets only” [34, p.6]. Peters [52, p.346] claims that this is not true.

A relative term frequency can thus be defined as:

# occurrences of t in r relative TF(t, r) = # all tags in r

An inverse document frequency can be defined as:

total number of resources IDF(t) = # resources containing the tag t

Peters [52, p.348–350] argues for computing the term weight of tags in resources not only as TF · IDF but to take additionally a Relevance Feedback Weight (RFW (t, r)) into account. It accounts for the rating of the users about the tag. The calculation of this frequency depends on the application. Some applications allow positive and negative rating of tags, some only a negative rating, and some only a positive rating. The calculation should be such that positively rated tags have more influence than negatively rated tags.

50 7.3 Similarity

7.3.1 Similarity for two users

Valentina Zanardi et Licia Capra [74] provide a measure for the similarity of users using the Vector Space Model. Their basic notion for a users’ similarity (sim(ui, uj)) is, that “the more tags two users have used in common, the more similar they are, regardless of what resources they used it on” [74, p.3]. Since this definition neglects the resources tagged, the 3- dimensional relationship between users, tags, and resources is projected on a 2-dimensional one. So for every user a vector is built over tags, denoting how often a user used a tag. Similarity between two vectors is calculated as the angle between these vectors as in the Vector Space Model (see section 3.1.1).

7.3.2 Similarity for two tags

In “Social Ranking”, Zanardi et Capra [74] also define a tag’s similarity (sim(ti, tj)) in the following way: the more resources that have been tagged with the same pair of tags, the more similar these tags are, regardless of the users who used them [74, p. 3]. So a vector over documents is built for every tag, so the similarity of two documents can thus be computed as the dot-product of its vectors as in section 7.3.1 and in the Vector Space Model (see section 3.1.1). Users’ similarity and tag’s similarity is aggregated to form a single ranking in section 7.2.

8 Rank Aggregation

The term ’rank aggregation’ refers to the combination of several rankings, derived by differ- ent algorithms, into a global ranking. There are two main applications of rank aggregation: The aggregation of ranks within a search engine and the aggregation of rankings returned by several search engines into a meta search engine. The aggregation of ranks within a search engine comes into play when the final ranking of a search engine is made up of several ranking algorithms. In this case, the similarity scores in the rankings are available, so the global ranking can be computed using them. Several methods for combining similarity scores are discussed in section 8.1. A meta search engine is a search engine which passes the query through to several component search engines, and presents an aggregated ranking to the user. Meta search engines are useful for searching rare terms, which might not be covered by every search engine. The results returned by the component search engines typically are not augmented with similarity scores, therefore the aggregation must be done using the positions in the ranked list only. These methods are discussed in section 8.2.

51 8.1 Similarity Scores

Combining the ranking of results using their similarity scores is possible in two cases: The first case is the combination of similarity scores within one search engine. The second case is in the application of meta search engines, if the results of the query are augmented with the similarity scores. However, only few search engines provide this information.

Let the set of documents to be ranked D = {d1, d2, . . . , dn} and the similarity scores by k ranking methods for document j be denoted as s1j, . . . , skj. Fox et Shaw [25] define a number of combination functions for similarity scores. The CombMIN and CombMAX combination functions are defined as:

CombMIN(dj) = min(s1j, . . . , skj)

CombMAX(dj) = max(s1j, . . . , skj)

The CombMIN combination method serves to “minimize the probability that a non-relevant document would be highly ranked” [25, p.246] while the CombMAX method “minimizes the number of relevant documents being poorly ranked” [25, p.246]. A combination method the authors propose to avoid both extremes is the CombMED com- bination method which uses the median value of all scores:

CombMED(dj) = med(s1j, . . . , skj) The median is defined as the element in the middle of the sorted similarity scores. If k is even, there are two elements in the middle, so the arithmetical mean of the two elements is taken. The combination methods defined above do not consider the relative ranks of the documents. The relative ranks can be considered if the combination function does not only pick a similarity score from the list but combines the scores with mathematical operations. So Fox et Shaw [25] suggest to add all similarity values of an element, which results in the CombSUM combination function: k X CombSUM(dj) = s1j + ... + skj = sij i=1

However, CombSUM does not give any special treatment to resources which have no ranking because they were not found by the corresponding search engine. E.g. the CombSUM of element a is made up of only three retrieval scores, because ranking systems λ4 and λ5 have no retrieval score for a, which is treated as a value of zero in CombSUM. Let rj denote the number of ranking systems that have found the document j. Then it is possible to derive two measures which address the problem of documents not found:

CombANZ(dj) = CombSUM(dj) ÷ rj

CombMNZ(dj) = CombSUM(dj) × rj

52 CombMIN CombMAX CombMED CombSUM CombANZ CombMNZ a 3.22 5.23 4.22 12.67 4.22 38.01 b 0.68 4.52 3.23 13.79 2.76 68.95 c 0.12 6.63 1.23 13.94 2.79 69.70 d 0.34 2.12 1.18 4.82 1.21 19.28

Table 1: Comparison of combination functions

CombANZ is the average of nonzero similarity values. It “ignores the effects of a query failing to retrieve a relevant document”[25, p.246]. CombMNZ rewards documents that were found by multiple search engines. It is disputed which of the proposed combination methods is the best. Fox et Shaw [25] conducted experiments where CombSUM outperformed the others, while Liu [43, p.227] states that researchers have found that CombMNZ outperforms CombSUM in some cases. The following example is taken from [43, p.228] but augmented by some made-up retrieval scores. There are 5 ranking systems (λ1, . . . , λ5) which return the following rankings for 4 candidate web pages a, b, c, d:

λ1: a=5.23 b=2.12 c=1.23 d=0.34 λ2: b=4.52 a=4.22 d=1.02 c=0.12 λ3: c=6.63 b=3.23 a=3.22 d=1.34 λ4: c=4.94 b=3.24 d=2.12 λ5: c=1.02 b=0.68 The results of applying the combination functions to these values are presented in table 1.

8.2 Rank positions

Often the similarity scores of the results of a query are not available. This is especially the case when a meta search engine combines the scores of its component search engine. Then the aggregation has to be done without the retrieval scores by purely using the order in which the search results were returned. A rank position refers to the number of the document within the ranking returned by a component search engine. The aggregation of rank positions can be modelled by means of Social Choice Theory[64], which deals with voting procedures. This modelling is such, that the search engines represent voters, and the Web pages represent candidates. Then the aggregation of web pages into a global ranking can be lead back to the aggregation of votes to form an election result. However, in aggregating ranking positions few search engines vote for many candidates whereas in political elections many people vote for few candidates. In the field of social choice theory such a function which aggregates several rankings into a global ranking is called a social welfare function. In this section, D denotes the set of elements to be ranked, d ∈ D a single element out of this set, the lists ranking1, ranking2, . . . , rankingk denote possible rankings for D, and function pos(list, elem) returns the position of the element elem in the ranking list.

53 This is the example from above, but without the retrieval scores.

λ1: a b c d λ2: b a d c λ3: c b a d λ4: c b d λ5: c b Most aggregation methods can only aggregate full rankings, so the elements which are not covered by a ranking are inserted at the very end.

λ1: a b c d λ2: b a d c λ3: c b a d λ4: c b d a λ5: c b ad The aggregation methods Borda, Copeland, and Kemeny (sections 8.2.1, 8.2.2, and 8.2.3) are well known in the field of social choice. Local Kemenization, presented in section 8.2.4 combines the virtues of the Kemeny election with a good complexity.

8.2.1 Borda

A consistent aggregation method is Borda’s method [7]. It assigns scores to the positions in the ranking lists. The assignment is such, that the first element receives the maximum score, the second element 1 point below until the last element is reached. This method is consistent, which means that if an alternative is preferred over another in two different groups, it should also be preferred in the combined group [5]. If an element does not appear among the results of a search engine, the most common way is to spread the remaining points evenly among the unranked elements. This is suggested in [57] and [43, p.228]. However Dwork et al [24] claim that this spreading of the remaining points might lead to undesirable outcomes in some cases. System 4 did not return a ranking for candidate a and there is 1 point remaining, so document a receives 1 point. System 5 did not return rankings for candidates a and d, so the remaining 3 points are spread among a and d, so they both receive 1.5 points. The aggregated ranking is achieved by sorting elements by their Borda scores. In this example the following scores are achieved: Borda(a) = 4+3+2+1+1.5 = 11.5 Borda(b) = 3+4+3+3+3 = 16 Borda(c) = 2+1+4+4+4 = 15 Borda(d) = 1+2+1+2+1.5 = 7.5 So the overall ranking is b, c, a, d. Borda scores can be computed in linear time. Although the computation is quick, Borda has an important drawback. It does not satisfy the Condorcet criterion, which states that the candidate, who is preferred among all others by a majority of the voters should win. In

54 a  b a  c a  d 1 2 4 b  a b  c b  d 4 2 5 c  a c  b c  d 3 3 4 d  a d  b d  c 1 0 1

Table 2: pair-wise comparisons of a,b,c,d the example above c is a Condorcet winner, because he is preferred among all others by a majority (i.e. ≥ 3) of voters. He is also the only one.

8.2.2 Copeland

One method which satisfies the Condorcet criterion was developed by Copeland [19]. Pair- wise comparisons of the rankings returned by the search engines are undertaken. A wins the comparison over B if a majority prefers A over B. The fact that A is preferred over B is denoted as A  B. Common to all Condorcet methods is, that they select the Condorcet winner, if there is one. In the case a Condorcet winner does not exist, the Condorcet methods behave differently. Especially, in the case of no Condorcet winner, the Copeland method may lead to ties. Ties are not as a big problem in rank aggregation as they are in political elections. Applying the method to our example data yields the results in table 2. As there are 5 voters (search engines), A has won over B, if A  B by more than three voters. With table 3 the result of the Copeland algorithm can be calculated, which is c, b, a, d.

8.2.3 Kemeny

Kemeny is the best method for aggregating rankings, because it is the only method which is consistent as well as satisfies the Condorcet criterion [73]. It also satisfies the Extended Condorcet Principle, which is important for fighting spam according to [24]. However its computation is NP hard, so it is not feasible for the aggregation of rankings where the number of candidates can be very large. It aggregates the individual rankings into a collective ranking (Kemeny consensus), which could be described roughly as the number of pairs of alternatives (sequence scores) on which the individual rankings disagree with the global ranking [15]. Since there are n! sequence scores for n elements in the ranking list, this method is clearly not feasible for rank aggregation. A fast algorithm for computing the Kemeny scores is presented in Conitzer et al [18], but the experiment results described here are only for 40 elements and 5 voters at maximum and one can clearly see that the computational complexity still grows exponential. This is not fast enough for aggregating the thousands of elements a query might return.

55 The computation is done by considering all possible combinations [71]. For every com- bination a sequence score is computed by summarizing the pairwise comparisons which correspond to the sequence score. E.g. if the sequence is a, b, c, d the corresponding se- quence score is a  b + a  c + a  d + b  c + b  d + c  d. The table with pair-wise comparisons is constructed as described in section 8.2.2. In table 8.2.3 the sequence scores corresponding to the example are presented. The pair- wise comparisons were taken from table 2. The rank aggregation according to the Kemeny method is then: c, b, a, d.

8.2.4 Local Kemenization

To combine the advantages of the Kemeny optimal aggregation with an approach which is computable in reasonable time, Dwork et al [24] suggest the following approach for ag- gregating rankings: An interim aggregation µ is built using an arbitrary rank aggregation method. From this interim aggregation, the final aggregation is computed by applying an algorithm called Local Kemenization. The advantage of this method is, that although the interim aggregation might not put the Condorcet winner at the top, Local Kemenization will raise the Condorcet winner in the ranking. It is also possible to apply the Local Kem- enization only to the top-elements, where Dwork et al suggest that applying it to the top 100 elements should be sufficient. The algorithm is inductively defined as follows: Assume that π = 1, . . . , l − 1 is already a locally Kemeny aggregation. The element l is inserted into π just below the element y. The element y is selected such that the following two conditions are satisfied: • No majority of the original rankings (λs) prefers l to y • For all successors z of y in π, a majority prefers l to z. An intuitive explanation is given by the authors as: “We try to insert l at the end of the list π; we bubble it up toward the top of the list as long as a majority of the λ’s insists that we do”[24, p.617]. The steps taken when Local Kemenization is applied to the Borda ranking of the example data (b, c, a, d) are the following: b c,b c,b,a c,b,a,d b is inserted at first. The decision whether c is inserted before or behind b depends on whether a majority of the λ’s prefer b over c or c over b. These values can be looked up in table 2. In this table one can see that a majority of λ’s prefers c over b.

56 Candidate Comparisons won a 1 b 2 c 3 d 0

Table 3: Comparisons won for a,b,c,d

first choice second choice third choice fourth choice sequence score a b c d 1+2+4+2+5+4 = 18 a b d c 1+4+2+5+2+1 = 15 a c b d 2+1+4+3+4+5 = 19 a c d b 2+4+1+4+3+0 = 14 a d b c 4+1+2+0+1+2 = 10 a d c b 4+2+1+1+0+3 = 11 b a c d 4+2+5+2+4+4 = 21 b a d c 4+5+2+4+2+1 = 18 b c a d 2+4+5+3+4+4 = 22 b c d a 2+5+4+4+3+1 = 19 b d a c 5+4+2+1+1+2 = 15 b d c a 5+2+4+1+1+3 = 16 c a b d 3+3+4+1+4+5 = 20 c a d b 3+4+3+4+1+0 = 15 c b a d 3+3+4+4+5+4 = 23 c b d a 3+4+3+5+4+1 = 20 c d a b 4+3+3+1+0+1 = 12 c d b a 4+3+3+0+1+4 = 15 c d b a 4+3+3+0+1+4 = 15 d a b c 1+0+1+1+2+2 = 7 d a c b 1+1+0+2+1+3 = 8 d b a c 0+1+1+4+2+2 = 10 d b c a 0+1+1+2+4+3 = 11 d c a b 1+1+0+3+3+1 = 9 d c b a 1+0+1+3+3+4 = 12

Table 4: Kemeny sequence scores

57 8.3 Selected Aggregations

In this section we present some aggregations of ranking techniques presented earlier in this diploma thesis.

8.3.1 Aggregation in XRank

In section 5.1, ElemRanks were introduced, which are an adaptation of the PageRank algorithm to XML. In XRank [27] aggregation is done depending on both ElemRanks and on the query. The following factors are considered to form a global ranking: • The ElemRank ER, whose computation was discussed in section 5.1. • An aggregation function agg as discussed in section 8.1. • A distance function between keywords dist as discussed in section 5.2.2.

• A decay factor decay(vt) which serves to decrease the value of the ElemRank in the case the node contains the keyword indirectly. It is explained in section 5.2.3.

These factors are used for calculating a ranking R(v1,Q) for the node v1 with respect to the query Q.

Let vt+1 be the node that directly contains the keyword ki. Then there are ancestor nodes vt+1, vt, . . . , v1 and the ranking is computed for v1 using the ElemRank ER and the decay factor using formula r(vi, ki).

If there are multiple occurrences of the keyword ki in the XML tree, a list of rankings r1, r2, . . . , rm is received. These list of rankings is aggregated using one of the methods described in chapter 8.1. The authors of [27] suggest CombMAX or CombSUM. This ranking is called rˆ(v1, ki).

If there are multiple keywords k1, . . . , kn, the rankings rˆ(v1, ki) are aggregated once again using CombSUM and multiplied by a distance function dist, which accounts for the distance between the keywords (section 5.2.2).

r(v1, ki) = ER(vt) × decay(vt)

rˆ(v1, ki) = agg(r1, r2, . . . , rm)   X R(v1, (k1, . . . , kn)) =  rˆ(v1, ki) × dist(v1, k1, k2, . . . , kn) 1≤i≤n

8.3.2 Aggregation in XSearch

The following factors are used in XSearch [16] to form a final ranking • The similarity of the query to the answer sim(Q, N) which is determined using the vector space model, see section 5.2.1.

58 • The size of the resulting tree tsize(N) as discussed in section 5.2.3. • An ancestor-descendant relationship anc-des(N) as discussed in section 5.2.3. The combination is done according to the following formula:

sim(Q, N)α × (1 + γ × anc-des(N)) tsize(N)β

The authors of [16] experimented with different values for α. The result is that α > 1 is feasible. However, they stated that optimal values for α, β, and γ are subject to further experimentation.

8.3.3 Aggregation in RSS

In RSS [49] the global and query-dependent ranking for a node vi ∈ V is calculated using a simple approach:

PR(vi) AVi = λ + (1 − λ)Rel(Q, vi) maxvj ∈V PR(vj)

The default value for λ = 0.5. Rel(Q, vi) is a relevance measure of the query compared to the node (see section 6.3). The values gained using this formula are used as the activation values for Spread Activation. This is described in section 3.3.

8.3.4 Aggregation in Web Search Engines

In [9] is described that Google combines the IR score determined for a document (see section 4.2.1) with the document’s PageRank (see section 4.1.2), though it is not explained how this combination is achieved. In Lucene a global ranking is formed by using the PageRank of a page as one of the factors for calculating a page’s content score. See section 3.1.2 for ranking in Lucene.

9 Conclusion and future work

This diploma thesis overviews the current state of the art regarding ranking algorithms for Web pages, XML data, and RDF data. The section about ranking on the World Wide Web, section 4, discusses approaches concern- ing both the popularity score and the content score of Web pages. For the popularity score, PageRank is the approach which is perceived to be the best, because of the commercial success of the Google search engine and its scientific acceptance. (There are many adap- tations of the PageRank algorithm). PageRank can also be used stand-alone, in contrast to the HITS and the Hilltop algorithms which rely on existing search engines. The section about the content score of Web pages summarizes common assumptions of current ranking

59 techniques. The concrete implementations, however, are at best only vaguely published because of commercial interests. In the section about ranking for XML data, most approaches calculate a content score for XML documents by adapting the term weights used in the Vector Space Model, so they consider the tree structure of XML documents. An approach which does not modify the term weights but instead computes a popularity score by adapting the PageRank algorithm is XRank. XRank does not use term weights for calculating the content score, but aggregates the corresponding ElemRank values (see section 8.3.1). Ranking for RDF data can be roughly divided into two categories: The first category ranks documents using the semantic annotations attached to them. In this field, applications go from ranking Web documents using the tags found on delicious.com until the ranking of documents consisting of serialized RDF graphs. Sections 4.5, 6.3, and in parts 6.2 deal with these issues. Both content and popularity score of RDF documents can be calculated here. The other category deals with ranking for bare RDF data, i.e. without documents. The section 6.1 explains different approaches for calculating weights for the edges of the RDF graph. Using these weights it is either possible to calculate weights for the nodes using the PageRank algorithm (see section 6.2) or to investigate the relationships of two resources on the graph (see section 6.4). A folksonomy consists of the dimensions users, tags, and resources. Ranking can be done on each of these dimensions. For calculating a popularity score the relationships of users, tags, and resources are considered. This can be done either using the PageRank algorithm (which considers all three dimensions at one), or using simpler measures such as users are ranked higher if they assigned more tags. For building a content score for either users, tags, or resources (depending on the query) the PageRank algorithm can be used by modifying the personalization vector using the keywords from the query. A content score of tags can be built using the Vector Space Model or the approach “Social Ranking” [74] by Zanardi et Capra. The latter approach exploits the similarity of the users and tags which serves as the basis for building a content score. The state of the art summarized above shows that the PageRank algorithm and the Vector Space Model are the approaches with the highest success rate due to the adaptation in various application fields. The PageRank algorithm can be applied not only for ranking on the World Wide Web which was the authors’ original intention but also to many other applications. The most extreme adaptation is FolkRank by Hotho et al [33], which adapts PageRank to folksonomies (FolkRank). The authors propose to use FolkRank both for building a popularity score and a content score. In the latter case the FolkRank algorithm is run at query time with a precedence vector biased using the keywords from the query. Other algorithms to mention are Guo et al’s adaptation to XML (see section 5.1) and Ning et al’s adaptation to RDF (see section 6.2). The latter two algorithms use a variant of PageRank which operates on weighted edges. For using the Vector Space Model to build a content score on the other hand, there are adaptations to every datatype investigated. An adaptation which is non-conform is Social Ranking [74] which uses the Vector Space Model to calculate a similarity between users as

60 well as between tags in folksonomic applications. A content score is then calculated based on these similarities.

Future Work The state of the art analysis in this thesis reveals the lack of a ranking approach capa- ble of considering semi-structured data and semantic annotations at the same time. Two approaches seem to be promising: One way could be to combine the ideas of FolkRank and XRank. Section 7.1.1 described how a global ranking for users, tags, and resources can be obtained (FolkRank) whereas section 5.1 described how a global ranking can be found for XML documents (ElemRank). Both approaches build a graph where the PageRank algorithm is applied. The combination of those approaches could be accomplished by building one big graph. This graph could take into account the relationship among users, tags, and resources as in the FolkRank algorithm on the one hand. On the other hand structuring of the resources as in the ElemRank algorithm could be considered. Another possibility to obtain a popularity score would be to run the ranking algorithms separately and combine the retrieved scores using one of the methods described in chapter 8. For the content score there are also two possibilities. Either a term weighting which accounts for the tags as well as for the nested structure could be found, or rankings are done separately and aggregated afterwards.

61 References

[1] J. Alpert and N. Hajaj. We knew the web was big... Google Blog, http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html. [2] K. Anyanwu, A. Maduko, and A. Sheth. Semrank: ranking complex relationship search results on the semantic web. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 117–127, New York, NY, USA, 2005. ACM. [3] S. Bao, G. Xue, X. Wu, Y. Yu, and B. F. andZhong Su. Optimizing web search using social annotations. In Proceedings of the 16th international conference on World Wide Web. ACM, 2007. [4] K. Bharat and G. A. Mihaila. Hilltop: A search engine based on expert documents. ftp://ftp.db.toronto.edu/pub/reports/csrg/405/hilltop.html. [5] T. Biswas. Efficiency and consistency in group decisions. Public Choice, 80:23–34. [6] A. Blumauer and T. Pellegrini. Social Semantic Web, chapter Semantic Web Revisited - Eine kurze Einführung in das Social Semantic Web. X.media.press, Springer Verlag, Berlin, Heidelberg, 2009. [7] J. C. Borda. Mémoire sur les élections au scrutin. Histoire de l’Académie Royale des Sciences, 1781. [8] D. Brickley and R. Guha. Resource description framework (RDF) schema specification 1.0. http://www.w3.org/TR/rdf-schema/, 2004. [9] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW7: Proceedings of the seventh international conference on World Wide Web 7, pages 107–117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V. [10] F. Bry, N. Eisinger, and T. Furche. Web information systems. 2009. http://www.pms.ifi.lmu.de/lehre/webinfosys/09ws10/unterlagen- public/bry2009webinfosys_11-04.pdf. [11] K. Bryan and T. Leise. The $25,000,000,000 eigenvector - the linear algebra behind google. 2006. [12] S. J. Carrière and R. Kazman. Webquery: searching and visualizing the web through connectivity. In Selected papers from the sixth international conference on World Wide Web, pages 1257–1267, Essex, UK, 1997. Elsevier Science Publishers Ltd. [13] P. Castells, M. Fernandez, and D. Vallet. An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. on Knowl. and Data Eng., 19(2):261–272, 2007. [14] H. Chen and K. Lynch. Automatic construction of networks of concepts characterizing document databases. In IEEE transactions on systems, man, and cybernetics, vol. 22, no5, pages 885–902. Institute of Electrical and Electronics Engineers, New York, NY, USA, 1992. [15] Y. Chevaleyre, U. Endriss, J. Lang, and N. Maudet. A short introduction to compu- tational social choice. In SOFSEM 2007: Theory and Practice of Computer Science.

62 http://www.springerlink.com/content/768446470rplj120. [16] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSEarch: a semantic search engine for xml. In Proceedings of the 29th international conference on Very large data bases - Volume 29. VLDB Endowment, 2003. [17] A. Collins and E. Loftus. A spreading-activation theory of semantic processing. 1975. [18] V. Conitzer. Improved bounds for computing kemeny rankings. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI, pages 620–627. AAAI Press, 2006. [19] A. Copeland. A ’reasonable’ social welfare function. Seminar on Mathematics in Social Sciences, 1951. [20] O. Corby, R. Dieng-Kuntz, and C. Faron-Zucker. Querying the semantic web with the corese search engine. 2004. [21] F. Crestani. Application of spreading activation techniques in information retrieval. Artif. Intell. Rev., 11(6):453–482, 1997. [22] L. Ding, T. Finin, A. Joshi, R. Pan, R. S. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In CIKM ’04: Proceedings of the thirteenth ACM Intern. Conf. on Information and knowledge management, pages 652–659, New York, NY, USA, 2004. ACM. [23] L. Ding, R. Pan, T. Finin, A. Joshi, Y. Peng, and P. Kolari. Finding and ranking knowledge on the semantic web. In The Semantic Web - ISWC 2005, pages 156–170. Springer, 2005. [24] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In WWW ’01: Proceedings of the 10th international conference on World Wide Web, pages 613–622, New York, NY, USA, 2001. ACM. [25] E. Fox and J. A. Shaw. Combination of multiple searches. In Proceedings of the Second Text Retrieval Conference, pages 243–252, 1993. [26] N. Fuhr and K. Großjohann. Xirql: a query language for information retrieval in xml documents. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 172–180, New York, NY, USA, 2001. ACM. [27] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: ranked keyword search over xml documents. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD Intern. Conf. on Management of data, pages 16–27, New York, NY, USA, 2003. ACM. [28] P. Haase, J. Broekstra, A. Eberhart, and R. Volz. A compari- son of rdf query languages. In The Semantic Web – ISWC 2004. http://www.springerlink.com/content/ftxy71qedrhb945v. [29] T. Haveliwala. Topic-sensitive . 2002. [30] T. Haveliwala, S. Kamvar, and G. Jeh. An analytical comparison of approaches to personalizing pagerank. Technical report, Stanford University Technical Report, July 2003.

63 [31] P. Hitzler, M. Krötsch, S. Rudolph, and Y. Sure. Semantic Web, chapter Struktur mit XML. Springer Verlag, Berlin, Heidelberg, 2008. [32] B. Horling and M. Kulick. Google blog: Personalized search for everyone. [33] A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Folkrank: A ranking algorithm for folksonomies. Proc. FGIR, 2006, 2006. [34] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. Information retrieval in folk- sonomies: Search and ranking. [35] G. Jeh and J. Widom. Scaling personalized web search. In WWW ’03: Proceedings of the 12th international conference on World Wide Web, pages 271–279, New York, NY, USA, 2003. ACM. [36] A. John and D. Seligmann. Collaborative tagging and expertise in the enterprise. In Proceedings of Collaborative Web Tagging Workshop held in Conjunction with WWW 2006, 2006. [37] G. Jseh and J. Widom. Simrank: a measure of structural-context similarity. 2002. [38] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University, 2003. [39] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, 1999. [40] J. Lehmann, J. Schüppel, and S. Auer. Discovering unknown connections – the dbpedia relationship finder. In Proceedings of 1st Conference on Social Semantic Web. Leipzig (CSSW’07), 2007. [41] J. Li, C. Liu, R. Zhou, and B. Ning. Processing XML keyword search by constructing effective structured queries. In Lecture Notes in Computer Science. Springer-Verlag, 2009. [42] D. Lin. An information-theoretic definition of similarity. In Proceedings of the thirteenth ACM conference on Information and knowledge management, 1998. [43] B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data- Centric Systems and Applications). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [44] L. Liu and M. T. Özsu. Encyclopedia of Database Systems. Springer-Verlag, 2008. [45] Lucene text search engine. http://lucene.apache.org/. [46] Apache lucene api documentation: Class Similarity. http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/search/Similarity.html. [47] A. G. Maguitman, F. Menczer, F. Erdinc, H. Roinestad, and A. Vespignani. Algorith- mic computation and approximation of semantic similarity. 2006. [48] B. A. Meza. Ranking Documents Based on Relevance of Semantic Relationships. PhD thesis, University of Georgia, 2007. [49] X. Ning, H. Jin, and H. Wu. RSS: A framework enabling ranked search on the semantic

64 web. In Information Processing and Management, 2007. [50] Open directory project a.k.a. directory mozilla. http://www.dmoz.org/. [51] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. [52] I. Peters. Folksonomies. Indexing and Retrieval in Web 2.0. De Gruyter, 2009. [53] rankforsales.com. The google hilltop algorithm. http://www.rankforsales.com/search- engine-algorithms/google-hilltop-algorithm.html. [54] S. Rendle. Hubs and authorities - student presentation in link mining. http://www.informatik.uni-freiburg.de/ ml/teach- ing/ws04/lm/20041102_HITS_Rendle.pdf. [55] S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129–146, 1976. [56] C. Rocha, D. Schwabe, and M. P. de Aragão. A hybrid approach for searching in the semantic web. 2004. [57] D. G. Saari. The mathematics of voting: Democratic symmetry. The Economist, page 83, March 4 2000. [58] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. 1988. [59] G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. 1975. [60] F. Schräpler. http://www.profi-ranking.de/faq/google-ranking/. [61] F. Schräpler. http://www.profi-ranking.de/sitemap/. [62] K. Schulz. Information retrieval. 2004. [63] M. Shamsfard, A. Nematzadeh, and S. Motiee. ORank: An ontology based system for ranking documents. 2006. [64] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations, chapter Aggregating Preferences: Social Choice. Cambridge University Press, New York, NY, USA, 2008. [65] N. Stojanovic, R. Studer, and L. Stojanovic. An approach for the ranking of query results in the semantic web. 2003. [66] B. Szekely and E. Torres. Ranking bookmarks and bistros: Intelligent community and folksonomy development. 2005. [67] A. Theobald and G. Weikum. Adding relevance to XML. 2001. [68] A. Theobald and G. Weikum. The index-based xxl search engine for querying xl data with relevance ranking. 2002. [69] H. Wang, K. Zhang, Q. Liu, T. Tran, and Y. Yu. Q2semantic: A lightweight keyword interface to semantic search. In Lecture Notes in Computer Science, 2008. [70] F. Weigel, H. Meuss, K. U. Schulz, and F. Bry. Content and structure in indexing and ranking xml. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 67–72, New York, NY, USA, 2004. ACM.

65 [71] Wikipedia, the free encyclopedia: Kemeny-Young Method. http://en.wikipedia.org/wiki/Kemeny–Young_method. [72] J. E. Wolff, H. Flörke, and A. B. Cremers. Searching and browsing collections of structural information. Advances in Digital Libraries Conference, IEEE, 0:141, 2000. [73] H. P. Young and A. Levenglick. A consistent extension of condorcet’s election principle. SIAM Journal on Applied Mathematics, 35(2):285–300, 1978. [74] V. Zanardi and L. Capra. Social ranking: Finding relevant content in web 2.0. In Intl. Workshop on Recommender Systems, 2008. [75] Q. Zhou, C. Wang, M. Xiong, H. Wang, and Y. Yu. SPARK: Adapting keyword query to semantic search. In Lecture Notes in Computer Science, 2007.

66