Extending Linked Open Data Resources Exploiting Wikipedia As Source of Information

Universitàdegli Studi di Milano DIPARTIMENTO DI INFORMATICA Scuola di Dottorato in Informatica – XXV ciclo PhD thesis Extending Linked Open Data resources exploiting Wikipedia as source of information Student: Advisor: Alessio Palmero Aprosio Prof. Ernesto Damiani Matricola R08605 Co-Advisors: Alberto Lavelli Claudio Giuliano Year 2011–2012 Abstract DBpedia is a project aiming to represent Wikipedia content in RDF triples. It plays a central role in the Semantic Web, due to the large and growing number of resources linked to it. Currently, the information contained in DBpedia is mainly collected from Wikipedia infoboxes, a set of attribute-value pairs that represent a summary of the Wikipedia page. The extraction procedure requires to manually map Wikipedia infoboxes into the DBpedia ontology. Thanks to crowdsourcing, a large number of infoboxes in the English Wikipedia has been mapped to the corresponding classes in DBpedia. Subsequently, the same procedure has been applied to other languages to create the localized versions of DBpedia. However, (i) the number of accomplished mappings is still small and limited to most frequent infoboxes, as the task is done manually by the DBpedia community, (ii) mappings need maintenance due to the constant and quick changes of Wikipedia articles, and (iii) infoboxes are manually compiled by the Wikipedia contributors, therefore in more than 50% of the Wikipedia articles the infobox is missing. As a demonstration of these issues, only 2.35M Wikipedia pages are classified in the DBpedia ontology (using a class different from the top-level owl:Thing), although the English Wikipedia contains almost 4M pages. This shows a clear problem of coverage, and this issue is even worse in other languages (like French and Spanish). The objective of this thesis is to define a methodology to increase the coverage of DBpedia in different languages, using various techniques to reach two different goals: automatic mapping and DBpedia dataset completion. A key aspect of our research is multi-linguality in Wikipedia: we bootstrap the available information through cross-language links, starting from the available mappings in some pivot languages, and then extending the existing DBpedia datasets (or create new ones from scratch) comparing the classifications in different languages. When the DBpedia classification is missing, we train a supervised classifier using DBpedia as training. We also use the Distant Supervision paradigm to extract the missing properties directly from the Wikipedia articles. We evaluated our system using a manually annotated test set and some existing DBpedia mappings excluded from the training. The results demonstrate the suitability of the approach in iv extending the DBpedia resource. Finally, the resulting resources are made available through a SPARQL endpoint and as a downloadable package. Acknowledgments This thesis would not have been possible without the help and the guidance of some valuable people, who gave me help and assistance in many ways. First of all, I would express my gratitude to my advisors Alberto Lavelli and Claudio Giuliano, who supported me with patience and enthusiasm. I thank Bernardo Magnini, head of the Human Language Technology unit at Fondazione Bruno Kessler, who accepted me despite my particular academic situation. I also wish to thank all the members of the HLT group for being always present both as inspiring colleagues and precious friends. I am also thankful to Prof. Silvio Ghilardi and Prof. Ernesto Damiani, from University of Milan, for their availability and collaboration. I would like to thank Volha Bryl, Philipp Cimiano and Alessandro Moschitti for agreeing to be my thesis referees and for their valuable feedback. I gratefully thank Elena Cabrio, Julien Cojan and Fabien Gandon from INRIA (Sophia Antipolis) for letting me be a part of their research work. Outside research, I thank all friends and flatmates I met during these years, for adding precious moments to my everyday life in Trento. Finally, I thank my parents and family for their constant and loving support throughout all my studies. vi Contents Abstract iii Acknowledgmentsv Contents vii List of Figures xi List of Tables xiii 1 Introduction 1 1.1 The context.......................................1 1.2 DBpedia.........................................3 1.3 The problem......................................4 1.3.1 Coverage expansion..............................5 1.3.2 Automatic mapping..............................5 1.4 The solution.......................................6 1.5 Interacting with the Semantic Web..........................7 1.6 Contributions......................................8 1.6.1 DBpedia expansion...............................8 1.6.2 Question Answering..............................8 1.7 Structure of the thesis.................................9 2 Linked Open Data 11 2.1 Origins.......................................... 11 2.2 Linked Data principles................................. 13 2.3 Linked Data in practice................................. 14 2.3.1 Resource Description Framework....................... 14 2.3.2 Resource Description Framework in Attributes................ 16 viii 2.3.3 SPARQL query language............................ 18 2.3.4 Processing RDF data.............................. 19 2.4 The LOD cloud..................................... 19 2.5 Resources........................................ 20 2.5.1 Wikipedia.................................... 21 2.5.2 DBpedia..................................... 23 2.5.3 Wikidata.................................... 26 3 Related work 29 3.1 LOD Resources..................................... 29 3.1.1 YAGO...................................... 29 3.1.2 Freebase..................................... 30 3.2 Entity classification................................... 31 3.3 Schema matching.................................... 31 3.4 Distant supervision................................... 32 3.5 Question answering.................................. 33 4 Pre-processing data 35 4.1 Filtering Wikipedia templates............................. 35 4.2 Wikipedia and DBpedia entities representation................... 36 4.2.1 Building the entity matrix........................... 37 4.2.2 Assigning DBpedia class to entities...................... 37 5 Automatic mapping generation for classes 39 5.1 Infobox mapping.................................... 40 5.2 Experiments and evaluation.............................. 42 6 Automatic mapping generation for properties 45 6.1 Problem Formalization................................. 46 6.2 Workflow of the System................................ 48 6.3 Pre-processing..................................... 49 6.3.1 Cross-language information.......................... 49 6.3.2 DBpedia dataset extraction.......................... 49 6.3.3 Template and redirect resolution....................... 50 6.3.4 Data Extraction................................. 50 6.4 Mapping extraction................................... 51 6.5 Inner similarity function................................ 52 ix 6.5.1 Similarity between object properties..................... 53 6.5.2 Similarity between datatype properties.................... 53 6.6 Post-processing..................................... 55 6.7 Evaluation........................................ 55 7 Extending DBpedia coverage on classes 59 7.1 Kernels for Entity Classification............................ 59 7.1.1 Bag-of-features Kernels............................. 60 7.1.2 Latent Semantic Kernel............................ 62 7.1.3 Composite Kernel................................ 63 7.2 Experiments....................................... 63 7.2.1 Pre-processing Wikipedia and DBpedia.................... 64 7.2.2 Benchmark................................... 64 7.2.3 Latent Semantic Models............................ 65 7.2.4 Learning Algorithm............................... 65 7.2.5 Classification Schemas............................. 66 7.2.6 Results...................................... 67 8 Extending DBpedia coverage on properties 71 8.1 Workflow........................................ 72 8.2 Pre-processing..................................... 73 8.2.1 Retrieving sentences.............................. 73 8.2.2 Selecting sentences............................... 75 8.2.3 Training algorithm............................... 76 8.3 Experiments and evaluation.............................. 77 9 Airpedia, an automatically built LOD resource 79 9.1 Mapping generation.................................. 80 9.1.1 Classes (released April 2013)......................... 80 9.1.2 Properties (released May 2013)........................ 80 9.2 Wikipedia page classification............................. 81 9.2.1 Version 1 (released December 2012)..................... 82 9.2.2 Version 2 (released June 2013)........................ 83 9.2.3 Integration with the Italian DBpedia..................... 84 9.3 DBpedia error reporting................................ 84 x 10 Case study: QAKiS 87 10.1 WikiFramework: collecting relational patterns.................... 89 10.2 QAKiS: a system for data answer retrieval from natural language questions.... 91 10.2.1 NE identification and Expected Answer Type (EAT)............. 91 10.2.2 Typed questions generation.......................... 93 10.2.3 WikiFramework pattern matching....................... 93 10.2.4 Query selector................................. 93 10.3 Experimental evaluation................................ 94 10.4 Demo.......................................... 94 11

Load more