Wikitology: a Novel Hybrid Knowledge Base Derived from Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
WIKITOLOGY: A NOVEL HYBRID KNOWLEDGE BASE DERIVED FROM WIKIPEDIA by Zareen Saba Syed Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2010 ABSTRACT Title of dissertation: WIKITOLOGY: A NOVEL HYBRID KNOWLEDGE BASE DERIVED FROM WIKIPEDIA Zareen Saba Syed, Doctor of Philosophy, 2010 Dissertation directed by: Professor Timothy W. Finin Department of Computer Science and Electrical Engineering World knowledge may be available in different forms such as relational databases, triple stores, link graphs, meta-data and free text. Human minds are capable of un- derstanding and reasoning over knowledge represented in different ways and are influenced by different social, contextual and environmental factors. By following a similar model, we have integrated a variety of knowledge sources in a novel way to produce a single hybrid knowledge base i.e., Wikitology, enabling applications to better access and exploit knowledge hidden in different forms. Wikipedia proves to be an invaluable resource for generating a hybrid knowl- edge base due to the availability and interlinking of structured, semi-structured and un-structured encyclopedic information. However, Wikipedia is designed in a way that facilitates human understanding and contribution by providing interlinking of articles and categories for better browsing and search of information, making the content easily understandable to humans but requiring intelligent approaches for being exploited by applications directly. Research projects like Cyc [61] have resulted in the development of a complex broad coverage knowledge base however, relatively few applications have been built that really exploit it. In contrast, the design and development of Wikitology KB has been incremental and has been driven and guided by a variety of applications and approaches that exploit the knowledge available in Wikipedia in different ways. This evolution has resulted in the development of a hybrid knowledge base that not only incorporates and integrates a variety of knowledge resources but also a variety of data structures, and exposes the knowledge hidden in different forms to applications through a single integrated query interface. We demonstrate the value of the derived knowledge base by developing prob- lem specific intelligent approaches that exploit Wikitology for a diverse set of use cases, namely, document concept prediction, cross document co-reference resolution defined as a task in Automatic Content Extraction (ACE) [1], Entity Linking to KB entities defined as a part of Text Analysis Conference - Knowledge Base Population Track 2009 [65] and interpreting tables [94]. These use cases directly serve to evalu- ate the utility of the knowledge base for different applications and also demonstrate how the knowledge base could be exploited in different ways. Based on our work we have also developed a Wikitology API that applications can use to exploit this unique hybrid knowledge resource for solving real world problems. The different use cases that exploit Wikitology for solving real world prob- lems also contribute to enriching the knowledge base automatically. The document concept prediction approach can predict inter-article and category-links for new Wikipedia articles. Cross document co-reference resolution and entity linking pro- vide a way for specifically linking entity mentions in Wikipedia articles or external articles to the entity articles in Wikipedia and also help in suggesting redirects. In addition to that we have also developed specific approaches aimed at automatically enriching the Wikitology KB by unsupervised discovery of ontology elements using the inter-article links, generating disambiguation trees for entities and estimating the page rank of Wikipedia concepts to serve as a measure of popularity. The set of approaches combined together can contribute to a number of steps in a broader uni- fied framework for automatically adding new concepts to the Wikitology knowledge base. WIKITOLOGY: A NOVEL HYBRID KNOWLEDGE BASE DERIVED FROM WIKIPEDIA by Zareen Saba Syed Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2010 c Copyright by Zareen Saba Syed 2010 To My Parents. ii Table of Contents List of Tables v List of Figures vii 1 Introduction 1 1.1 Advantages of Deriving a Knowledge Base from Wikipedia . 4 1.2 Thesis Statement . 7 1.3 Contributions . 7 1.4 Thesis Outline . 12 2 Background and Related Work 15 2.1 Wikipedia Structure and Content . 16 2.2 Wikipedia in-built Search . 24 2.3 Existing Approaches for Refining and Enriching Wikipedia . 24 2.4 Use of Wikipedia as a Knowledge Resource . 27 2.5 Other Open Knowledge Sources . 27 3 Introduction to Wikitology Knowledge Base 36 3.1 Evolution of Wikitology Knowledge Base . 40 3.2 Evaluating the Knowledge Base . 42 3.3 Automatic Enrichment of Knowledge Base . 42 3.4 Wikitology v/s Existing Wikipedia based Knowledge Resources . 43 3.5 Conclusion . 45 4 Novel Wikitology Based Approaches for Common Use Cases 46 4.1 Named Entity Classification . 47 4.2 Document Concept Prediction . 59 4.3 Cross Document Co-reference Resolution . 83 4.4 Entity Linking . 94 4.5 Interpreting Tables . 101 5 Wikitology Hybrid KB Design, Query Interface and API 112 5.1 Introduction . 112 5.2 Related Work . 113 5.3 Wikitology Architecture and Design . 115 5.4 Wikitology API . 130 5.5 Example Applications . 131 5.6 Conclusions . 139 6 Approaches for Automatically Enriching Wikitology 140 6.1 Unsupervised techniques for discovering ontology elements from Wikipedia article links . 141 6.2 Disambiguation Trees . 161 6.3 A Broader Framework for automatically enriching Wikitology . 168 iii 7 Evaluation Summary 173 7.1 Approaches for Evaluating Knowledge Bases . 173 7.2 Evaluating Wikitology Knowledge Base . 174 7.3 Evaluation for Document Concept Prediction task . 176 7.4 Evaluation for Cross Document Coreference Resolution Task . 177 7.5 Evaluation for Entity Linking Task . 179 7.6 Evaluation for Interpreting Tables . 181 7.7 Discussion . 182 7.8 Target Applications for Wikitology . 183 7.9 Conclusions . 185 8 Conclusions 186 8.1 Discussion . 188 8.2 Future Work . 192 Bibliography 193 iv List of Tables 2.1 Wikipedia Category Statistics (Wikipedia March 2006 dump). 21 2.2 Property related statistics in DBpedia Infobox Ontology. 30 3.1 Summary of features distinguishing Wikitology from existing knowl- edge resources derived from Wikipedia. 44 4.1 Wikipedia statistics, articles with word born in them. 53 4.2 Results showing accuracy obtained using different feature sets. H: Headings Only, F: Words in first paragraph, H+F: Headings + Words in first paragraph, H+F+S: Headings + Words in first paragraph + Stop words. 57 4.3 Confusion matrix using SVM. 57 4.4 Confusion matrix using Decision Trees . 57 4.5 Confusion matrix using Nave Bayes . 58 4.6 Top three concepts predicted for a single test document using Method 1 and Method 2. 71 4.7 Titles of documents in the test sets. 73 4.8 Common concept prediction results for a set of documents . 73 4.9 Common concept prediction results for a set of documents related to \Organic farming" using Method 3 with different pulses and activa- tion functions. 74 4.10 Twelve features were computed for each pair of entities using Wikitol- ogy, seven aimed at measuring their similarity and five for measuring their dissimilarity. 90 4.11 Evaluation results for Cross Document Entity Coreference Resolution using Wikitology Generated Features. 93 4.12 Accuracy obtained for Entity Linking Task for entities that exist in Wikipedia using different approaches. 98 4.13 Accuracy obtained for positive and negative entity matches using learned score thresholds. 99 4.14 Accuracy for PageRank prediction for different classification Algo- rithms. 105 4.15 Accuracy per PageRank class using Decision Tree (J48 pruned) algo- rithm. 106 4.16 Feature contributions for PageRank approximations. 106 4.17 Test Data Set for Interpreting Tables. 110 4.18 Entity Type Distribution in Test Data Set for Interpreting Tables. 111 4.19 Evaluation Results for Class Label Prediction and Cell Linking. 111 5.1 Property related statistics in DBpedia Infobox Ontology. 126 5.2 Wikitology Index Structure and Field Types. 128 6.1 Fifteen slots were discovered for musician Michael Jackson along with scores and example fillers. 150 v 6.2 Comparison of the precision, recall and F-measure for different feature sets for entity classification. The k column shows the number of clusters that maximized the F score. 154 6.3 Manual evaluation of discovered properties. 155 6.4 Evaluation results for class hierarchy prediction using different simi- larity functions. 156 6.5 Disambiguation Features evaluation for different entity types. 167 7.1 Summary of performance of Wikitology based approaches. 184 vi List of Figures 2.1 Structure of a typical Wikipedia Article with Title, Contents, Infobox, Table of Contents with different sections and Categories. 18 2.2 Demonstration of DBpedia resource on Barack Obama. 29 2.3 Linked Open Data datasets on the web (april 2008). 31 2.4 Freebase Topic on Barack Obama along with structured data. 32 2.5 Demonstration of Powerset Factz available online. 33 3.1 Wikitology is a hybrid knowledge base integrating information in structured and unstructured forms. 40 3.2 Evolution of the Wikitology System showing the different data struc- tures incorporated over time and the different applications targeted. 41 4.1 These graphs show the precision, average precision, recall and f- measure metrics as the average similarity threshold varies from 0.1 to 0.8. Legend label M1 is method 1, M1(1) and M1(2) are method 1 with scoring schemes 1 and 2, respectively, and SA1 and SA2 repre- sent the use of spreading activation with one and two pulses, respec- tively. 76 4.2 In the concept prediction task, precision, average precision and re- call improve at higher similarity thresholds, with average precision remaining higher than precision, indicating that our ranking scheme ranks relevant links higher than irrelevant links.