Ranking for Social Semantic Media
Total Page:16
File Type:pdf, Size:1020Kb
INSTITUT FÜR INFORMATIK LEHR- UND FORSCHUNGSEINHEIT FÜR PROGRAMMIER- UND MODELLIERUNGSSPRACHEN Ranking for Social Semantic Media Georg Klein Diplomarbeit Beginn der Arbeit: 7. September 2009 Abgabe der Arbeit: 10. August 2010 Betreuer: Prof. Dr. François Bry Klara Weiand Christoph Wieser Erklärung Hiermit versichere ich, dass ich diese Diplomarbeit selbständig verfasst habe. Ich habe dazu keine anderen als die angegebenen Quellen und Hilfsmittel verwendet. München, den 10. August 2010 . Georg Klein Zusammenfassung Diese Diplomarbeit beschäftigt sich mit dem Ranking der Ergebnisse von Suchmaschinen. Wir stellen einen Sate of the Art vor, der verschiedene Datentypen behandelt, unter anderem das Web, XML, RDF und Folksonomien. Für jeden Datentyp wird sowohl die Berechnung des Inhaltswertes als auch des Popularitätswertes vorgestellt. Unter Inhaltswert (content score) versteht man einen Wertes, der angibt, wie gut das gefundene Objekt zur Anfrage passt; während man unter Popularitätswert einen Wert versteht, der die Beliebtheit einer Seite unabhängig von der konkreten Anfrage angibt. Für die meisten Datentypen wird zu- sätzlich die Relevanz zweier Objekte zueinander betrachtet. Dies ist besonders für RDF Daten interessant. Der Zweck dieses State of the Art liegt darin, ein grundlegendes Verständnis der Techno- logien, die für Ranking existieren zu vermitteln. Darauf aufbauend ist es dann einfacher entweder existierende Verfahren zu verbessern, oder ein Ranking Verfahren für noch nicht untersuchte Datentypen zu entwerfen. Insbesondere gibt es, nach bestem Wissen, noch kein Verfahren für Social Semantic Wikis, die sowohl semantische Annotationen als auch struk- turierte Daten haben. Einige Vorschläge für Rankingverfahren dafür werden im letzten Ab- schnitt, Conclusion and future work, behandelt. Abstract This diploma thesis is devoted to the ranking of results returned by search engines. We present a State of the Art which covers the ranking for various datatypes, including the Web, XML, RDF, and folksonomies. For every datatype the calculation of a popularity score as well as the computing of a content score is presented. For most datatypes we also discuss the relevance of two objects to each other. This is especially of interest for RDF data. The purpose of this State of the Art is to provide a decent understanding of the technologies in ranking that exist, so based on that it is easier to either come up with an enhanced ranking algorithm or use the technologies provided here as a basis for developing new algorithms for new datatypes. In particular, to the best of our knowledge there is apparently no ranking scheme available for Social Semantic Wikis, which have both semantic annotations as well as structured data. We provide some suggestions for developing such a new algorithm in the section Conclusion and future work. Acknowledgements First of all I would like to thank my supervisors Klara Weiand and Christoph Wieser. Klara and Christoph helped me a lot with the structuring of this thesis and were a valuable source of information, inspiration, and advice. Furthermore I am indebted to Prof. Dr. François Bry because with his great experience in supervising students he was able to come up with a sound time schedule, where the progress was checked on a weekly basis. He furthermore insisted on free Sundays which enabled me to accomplish the work load. Finally, I would like to thank Jenna Arnold for the final reviewing of this thesis as well as my friends and family for their understanding and continuous support. Contents 1 Introduction 11 2 Fundamentals 12 2.1 Ranking . 12 2.2 World Wide Web . 12 2.3 Web 2.0 - Social Software . 12 2.4 Structured Data . 13 2.5 Semantic Web . 13 2.6 Folksonomies . 14 3 Datatype independent ranking schemes 14 3.1 Content Score . 15 3.1.1 Vector Space Model . 15 3.1.2 Vector Space Model in Lucene . 17 3.2 Relevance of two resources . 17 3.2.1 Vector Space Model . 17 3.2.2 SimRank . 17 3.3 Spread Activation . 19 4 Ranking on the World Wide Web 20 4.1 Popularity Score . 21 4.1.1 Query-dependent Link Analysis . 21 4.1.2 Query-independent Link analysis . 23 4.2 Content Score . 26 4.2.1 Google . 26 4.2.2 Hilltop . 27 4.3 User personalization . 28 4.3.1 Personalization based on what? . 28 4.3.2 Personalized Popularity Score . 28 4.4 Similarity of Documents . 30 4.5 Enhancing Web Search using Semantic Annotations . 31 4.5.1 Content Score . 31 4.5.2 Popularity Score . 31 5 Ranking for XML Data 32 5.1 Popularity Score . 32 5.2 Content Score . 33 5.2.1 Vector Space Model . 34 5.2.2 Distance between keywords . 35 5.2.3 Ranking query answer trees . 35 5.2.4 Distance between candidate item and keyword . 36 5.3 Top-k evaluation . 37 6 Ranking for RDF Data 37 6.1 Weighting RDF properties . 37 6 6.2 Popularity Score . 39 6.3 Content Score . 39 6.4 Relevance of the relation between two resources . 40 6.4.1 DBPedia Relationship Finder . 41 6.4.2 SemRank . 41 6.4.3 Maguitman et al . 44 6.4.4 Corese . 44 6.5 Query ranking . 45 6.6 Ranking for RDF rule languages . 46 7 Ranking for Folksonomies 47 7.1 Popularity Score . 47 7.1.1 Overall Popularity Score . 47 7.1.2 Popularity Score for Users . 48 7.1.3 Popularity Score for Tags . 49 7.1.4 Popularity Score for Resources . 49 7.2 Content Score . 49 7.3 Similarity . 51 7.3.1 Similarity for two users . 51 7.3.2 Similarity for two tags . 51 8 Rank Aggregation 51 8.1 Similarity Scores . 52 8.2 Rank positions . 53 8.2.1 Borda . 54 8.2.2 Copeland . 55 8.2.3 Kemeny . 55 8.2.4 Local Kemenization . 56 8.3 Selected Aggregations . 58 8.3.1 Aggregation in XRank . 58 8.3.2 Aggregation in XSearch . 58 8.3.3 Aggregation in RSS . 59 8.3.4 Aggregation in Web Search Engines . 59 9 Conclusion and future work 59 7 List of Figures 1 Developing paths to the Social Semantic Web[6, p. 8] . 15 2 Small Web graph [37, p.1] . 18 3 Bipartite Graph [37, p.3] . 19 4 Hubs & Authorities [54, p.6] . 22 5 Simplified PageRank Calculation [51, p.4] . 24 6 Example Hyperlink Graph [43, p. 250] . 24 7 An Example XML Document [27, p.2] . 33 8 XML fragment for XXL [68, p.107] . 36 9 DBPedia Relationship Finder: Ludwig van Beethoven and Vienna . 42 10 Example RDF Graph[2, p.119] . 43 11 Example Ontology[47, p.436] . 45 8 List of Tables 1 Comparison of combination functions . 53 2 pair-wise comparisons of a,b,c,d . 55 3 Comparisons won for a,b,c,d . 57 4 Kemeny sequence scores . 57 9 1 Introduction Result rankings are an important and integral feature of most current Information Retrieval systems, for example web search engines are often evaluated on how highly ranked a relevant result for a query is in the results. Fuzzy matching—approaches to include not only strict matches, but also other results which are relevant but do not match the strict interpretation of the query—and ranking are closely related. Though they do not have to be used in conjunction, this is often the case, in particular to allow a fuzzy matching engine to differentiate looser results from results that adhere more strictly to the query. The true power of ranking and fuzzy matching is unleashed only in combination – fuzzy matching extends the set of results, ranking brings it into an order that makes the results easily consumable by the user even if the number of results is very big. While ranking is widely used in web search and other IR applications, conventional query languages for (semi-)structured data such as XQuery, SQL or SPARQL do not usually employ fuzzy matching or rank results. As the amount of structured web data increases and the semantic web continues to emerge, the need for solutions that allow for layman querying of structured data arises. This is true in particular in the context of the social semantic web where a heterogeneous user base interacts to create (semi-)structured data. Research has been dedicated to combining web querying and web search and introducing information retrieval methods to querying, for example in the form of extensions to conven- tional query languages, visual tools for exploratory search, extension of web keyword search to include (some) structure and keyword search over structured data. One important issue that arises in information retrieval systems for structured data is how ranking can be realized in this context. Ranking of documents is a.