Term-Propagation Over Structured Data Using Eigenvector Computation
Total Page:16
File Type:pdf, Size:1020Kb
INSTITUT FÜR INFORMATIK LEHR- UND FORSCHUNGSEINHEIT FÜR PROGRAMMIER- UND MODELLIERUNGSSPRACHEN Term-Propagation over Structured Data using Eigenvector Computation Fabian Kneißl Diplomarbeit Beginn der Arbeit: 15. März 2010 Abgabe der Arbeit: 13. September 2010 Betreuer: Prof. Dr. François Bry Klara Weiand, M.Sc. Dr. Tim Furche Erklärung Hiermit versichere ich, dass ich diese Diplomarbeit selbständig verfasst habe. Ich habe dazu keine anderen als die angegebenen Quellen und Hilfsmittel verwendet. München, den 13. September 2010 . Fabian Kneißl Zusammenfassung In dieser Diplomarbeit stellen wir pest (term-propagation using eigenvector computation over structured data) vor, einen neuen Ansatz für approximative Suche auf strukturierten Daten. pest nutzt die Struktur der Daten aus, um Termgewichte zwischen Objekten, die miteinander in Beziehung stehen, zu propagieren. Dabei gehen wir speziell auf solche struk- turierten Daten ein, bei denen sinnvolle Antworten bereits durch die anwendungsbezogene Semantik gegeben sind, z.B. bei Seiten in Wikis oder Personen in sozialen Netzwerken. Die pest-Matrix verallgemeinert die bei PageRank verwendete Google-Matrix durch Faktoren, die von den Termgewichten abhängen, und ermöglicht die unterschiedlich starke Gewich- tung (semantischer) Ähnlichkeit für verschiedene Beziehungen in den Daten, z.B. Freund vs. Arbeitskollege in einem sozialen Netzwerk. Die Eigenvektoren dieser pest-Matrix stel- len die Verteilung der Terme nach der Propagation dar. Die Eigenvektoren aller Terme zusammen bilden einen Vektorraum-Index, der die Struktur der Daten einbezieht und der mit Standardtechniken des Information Retrieval gehandhabt werden kann. In umfassenden Experimenten mit einem Wiki aus dem wirklichen Leben zeigen wir, wie pest die Qualität der Suchergebnisse im Vergleich zu mehreren existierenden Ansätzen verbessert. Außerdem stellen wir anhand von Experimenten mit einer sozialen Lesezeichen-Plattform dar, wie unterschiedliche Ansätze zur Auswahl der Beziehungen die Suchergebnisse beeinflussen. Abstract In this thesis, we present pest (term-propagation using eigenvector computation over structured data), a novel approach to approximate querying of structured data that ex- ploits the structure to propagate term weights between related data items. We focus on structured data for which meaningful answers are given through the application semantics, e.g., pages in wikis or persons in social networks. The pest matrix generalizes the Google Matrix used in PageRank with a term weight dependent factor and allows different levels of (semantic) closeness for different relations in the data, e.g., friend vs. co-worker in a social network. Its eigenvectors represent the distribution of terms after propagation. The eigenvectors for all terms together form a vector space index that takes the structure of the data into account and can be used with standard information retrieval techniques. In extensive experiments on a real life wiki, we show how pest improves the quality of the ranking over a range of existing ranking approaches. Furthermore, experiments on a social bookmarking site show how different approaches to relation selection influence the result rankings. Acknowledgements I want to thank everyone who supported me during my thesis. First of all, I am deeply grateful to Prof. François Bry, Klara Weiand and Tim Furche for their outstanding and helpful supervision of this thesis. The deep integration into this research topic right from the beginning and the constant feedback helped me to advance quickly. Moreover, I would like to thank my family for their kind all-time support and for making my studies possible. Last but not least, many thanks also go to Tilman Dingler, Richard Röttger and Kevin Wiesner for proofreading my thesis. I really appreciate all these contributions. Contents 1. Introduction 13 2. Preliminaries 19 2.1. Graph and Matrix Properties . 19 2.1.1. Graph Definition . 19 2.1.2. Adjacency Matrix . 20 2.1.3. Special Properties of Matrices . 21 2.2. Google’s PageRank Algorithm . 22 2.3. Vector Space Model . 24 2.4. KiWi as an Example for a Semantic Wiki . 25 3. Related Work 27 3.1. Fuzzy Matching in Information Retrieval . 27 3.1.1. Tree Edit Distance . 27 3.1.2. Adapting the Vector Space Model . 28 3.1.3. Differences to pest ............................. 29 3.2. Fuzzy Matching using Adapted PageRank Algorithms . 30 3.2.1. Adapted PageRank Algorithms . 30 3.2.2. pest in Comparison to Existing PageRank Approaches . 31 4. The pest Algorithm 33 4.1. Overview . 33 4.2. Detailed Steps of the Algorithm . 33 4.2.1. Weighted Propagation Graph . 33 4.2.2. Normalization of the Weighted Adjacency Matrix . 35 4.2.3. Transformation into the pest Matrix . 36 4.2.4. Term Propagation using Eigenvector Computation . 37 4.3. Structure Propagation with the pest Matrix: An Example . 38 4.4. Normalization Techniques for the Adjacency Matrix . 40 5. Implementation 43 5.1. Applied Technology . 43 5.1.1. Java as Programming Language . 43 5.1.2. Parsing and Accessing the Data . 44 5.1.3. Text Analysis with Lucene . 44 5.2. Overview of the pest Architecture . 45 5.2.1. Preprocessing and Parsing of Data Sources . 45 5.2.2. Implementation of the pest Algorithm . 48 5.2.3. Comparison of Result Rankings . 49 5.2.4. Presentation of Results . 50 5.3. Data Structures . 52 5.3.1. DOT Language Input Format . 52 5.3.2. MySQL as Input Source . 53 5.3.3. In-Memory Matrix . 54 9 5.3.4. In-File Matrix . 55 6. Experiment Results 57 6.1. Simpsons Wiki . 57 6.1.1. Experiment: Setup and Parameters . 57 6.1.2. Comparing pest .............................. 58 6.1.3. Run Time Performance Evaluation and Benchmarks . 60 6.1.4. Results from the Different Normalization Techniques . 62 6.2. Delicious . 63 6.2.1. Description of the Dataset . 63 6.2.2. Experiment Setup . 63 6.2.3. Results . 66 6.3. Properties of Datasets Processed by pest .................... 68 6.3.1. Identification of Data Items . 68 6.3.2. Connectivity between Data Items . 69 6.3.3. Distinctiveness of Terms . 70 7. Outlook and Conclusion 71 7.1. Outlook . 71 7.1.1. Improvements to the pest Algorithm . 71 7.1.2. Approaches to Increase the Run Time Performance . 72 7.1.3. Extension for a Document Collection with Frequent Changes . 73 7.2. Conclusion . 74 A. Appendix 75 A.1. Command Line Syntax . 75 A.2. ANTLR Grammar for DOT-Like Syntax . 76 A.3. Results of the Simpsons Experiments . 77 A.3.1. MediaWiki-Style Ranking for Query “Bart” . 77 A.3.2. Ranking for Query “Bart” . 78 A.3.3. MediaWiki-Style Ranking for Query “Moe beer” . 79 A.3.4. Ranking for Query “Moe beer” . 80 A.3.5. MediaWiki-Style Ranking for Query “Bart” and Normalization 1 . 81 A.3.6. MediaWiki-Style Ranking for Query “Bart” and Normalization 2 . 82 A.3.7. MediaWiki-Style Ranking for Query “Bart” and Normalization 3 . 83 A.3.8. MediaWiki-Style Ranking for Query “Bart” and Normalization 4 . 84 A.3.9. MediaWiki-Style Ranking for Query “Bart” and Normalization 5 . 85 A.4. Results of the Delicious Experiments . 86 A.4.1. Method 1, Query “art” . 86 A.4.2. Method 1, Query “photography” . 87 A.4.3. Method 2, Query “art” . 88 A.4.4. Method 2, Query “photography” . 89 A.4.5. Method 3, Query “art” . 90 A.4.6. Method 3, Query “photography” . 91 A.4.7. Method 4, Query “art” . 92 A.4.8. Method 4, Query “photography” . 93 10 A.4.9. Method 5, Query “art” . 94 A.4.10.Method 5, Query “photography” . 95 A.4.11.Method 6, Query “art” . 96 A.4.12.Method 6, Query “photography” . 97 A.4.13.Method 7, Query “art” . 98 A.4.14.Method 7, Query “photography” . 99 A.4.15.Method 8, Query “art” . 100 A.4.16.Method 8, Query “photography” . 101 11 12 1. Introduction Mary wants to get an overview of software projects in her company that are written in Java and that make use of the Lucene library for full-text search. According to the conventions of her company’s wiki, a brief introduction to each software project is provided by a wiki page tagged with “introduction”. Thus, Mary enters the query for wiki pages containing “java” and “lucene” that are also tagged with “introduction”. The search engine in the semantic wiki KiWi [41] allows for such a search on structured data with different kinds of data items [13]. However, the results fall short of Mary’s expectations for two reasons that are also illustrated in the sample wiki of Figure 1: 1. Some projects may not follow the wiki’s convention (or the convention may have changed over time) to use the tag “introduction” for identifying project briefs. This may be the case for Document 5 in Figure 1. Mary could loosen her query to retrieve all pages containing “introduction” (but not necessarily tagged with it). However, in this case pages that follow the convention are not necessarily ranked higher than other matching pages. 2. Some projects use the rich annotation and structuring mechanism of a wiki to split a wiki page into sub-sections, as in the case of the description of KiWi in Documents 1 and 2 from Figure 1; or the mechanism to link to related projects or technologies (rather than discuss them inline), as in the case of Document 4 and 5 in Figure 1. Such projects are not included in the results of the original query at all. Again, Mary could try to change her query to allow keywords to occur in sub-sections or in linked to documents, but such queries quickly become rather complex or impossible with the limited search facilities most wikis provide. Furthermore, this solution suffers from the same problem as addressed above: Documents following the wiki’s conventions are not necessarily ranked higher than those only matched due to the relaxation of the query.