Efficient Extraction and Query Benchmarking of Wikipedia Data

Efficient Extraction and Query Benchmarking of Wikipedia Data Der FakultätfürMathematik und Informatik der UniversitätLeipzig eingereichte DISSERTATION zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr. Ing.) im Fachgebiet Informatik vorgelegt von M.Sc. Mohamed Mabrouk Mawed Morsey geboren am 15. November 1980 in Kairo, Agypten¨ Leipzig, den 13.4.2014 Acknowledgement First of all I would like to thank my supervisors, Dr. Jens Lehmann, Prof. Sören Auer, and Prof. Klaus-Peter Fähnrich, without whom I could not start my Ph.D. at the Leipzig University. Special thanks to my direct supervisor Dr. Jens Lehmann, with whom I have started work on my Ph.D. proposal submitted to Deutscher Akademischer Aus- tauschdienst (DAAD), in order to pursue my Ph.D. at the Agile Knowledge Engineering and Semantic Web (AKSW) group. He has continuously supported me through my Ph.D. work, giving advices and recommendations for further research steps. His comments and notes were very helpful for me particularly during writing the papers we have published together. I would like to thank him also for proofreading this thesis and for his helpful feedback, which led to improving the quality of that thesis. Special thanks also to thank Prof. SörenAuer, whom I have first contacted asking for a vacancy to conduct my Ph.D. research at his research group. As the head of our group, Prof. SörenAuer has established a weekly meeting, called "Writing Group", for discussing how to write robust scientific papers. His notes and comments, during these meetings, were very useful for directing me to the correct way to write a good scientific paper. I would like also to thank Prof. Klaus-Peter Fähnrich, for the regular follow-up meetings he has managed in order to evaluate the performance of all Ph.D. students. During these meetings, he has proposed several directions, for me and for other Ph.D. students as well, on how to deepen and extend our research points. I would like to thank all of my colleagues in the Machine Learning and Ontology Engineering (MOLE) group, for providing me with their useful comments and guidelines, especially during the initial phase of my Ph.D.. I would like to dedicate this work to the souls of my parents, without whom I could not do anything in my life. With their help and support, I could take my first steps in my scientific career. Special thanks goes to all my family members, my son Momen, my wife Hebaalla, and my dearest sisters Reham and Nermeen. III Bibliographic Data Title: Efficient Extraction and Query Benchmarking of Wikipedia Data Author: Mohamed Mabrouk Mawed Morsey Institution: UniversitätLeipzig, FakultätfürMathematik und Informatik Statistical Information: 128 pages, 29 Figures, 19 tables, 2 appendices, 98 literature references Abstract Knowledge bases are playing an increasingly important role for integrating information between systems and over the Web. Today, most knowledge bases cover only specific domains, they are created by relatively small groups of knowledge engineers, and it is very cost intensive to keep them up-to-date as domains change. In parallel, Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. The DBpedia (http://dbpedia.org) project makes use of this large collaboratively edited knowledge source by ex- tracting structured content from it, interlinking it with other knowledge bases, and making the result publicly available. DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Furthermore, many companies and researchers use DBpedia and its public services to improve their applications and research approaches. However, the DBpedia release process is heavy-weight and the releases are sometimes based on several months old data. Hence, a strategy to keep DBpedia always in synchronization with Wikipedia is highly required. In this thesis we propose the DBpedia Live framework, which reads a continuous stream of updated Wikipedia articles, and processes it. DBpedia Live processes that stream on-the-fly to obtain RDF data and updates the DBpedia knowledge base with the newly extracted data. DBpedia Live also publishes the newly added/deleted facts in files, in order to enable synchronization between our DBpedia endpoint and other DBpedia mirrors. Moreover, the new DBpedia Live framework incorporates several significant features, e.g. abstract extraction, ontology changes, and changesets publication. Basically, knowledge bases, including DBpedia, are stored in triplestores in order to facilitate accessing and querying their respective data. Furthermore, the triplestores constitute the backbone of increasingly many Data Web applications. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in general. Consequently, IV it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triplestore implementations. We introduce a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triplestores and, thus, settled on measuring performance against a relational database which had been converted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applications against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more useful to compare existing triplestores and provide results for the popular triplestore implementations Virtuoso, Sesame, Apache Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the performance of triplestores is by far less homogeneous than suggested by previous benchmarks. Further, one of the crucial tasks when creating and maintaining knowledge bases is validating their facts and maintaining the quality of their inherent data. This task include several subtasks, and in thesis we address two of those major subtasks, specifically fact validation and provenance, and data quality The subtask fact validation and provenance aim at providing sources for these facts in order to ensure correctness and traceability of the provided knowledge This subtask is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time- consuming as the experts have to carry out several search processes and must often read several documents. We present DeFacto (Deep Fact Validation), which is an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. On the other hand the subtask of data quality maintenance aims at evaluating and continuously improving the quality of data of the knowledge bases. We present a methodology for assessing the quality of knowledge bases' data, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia. V Contents 1. Introduction1 1.1. Motivation...............................1 1.2. Contributions.............................3 1.3. Chapter Overview...........................5 2. Semantic Web Technologies7 2.1. Semantic Web Definition.......................7 2.2. Resource Description Framework - RDF..............7 2.2.1. RDF Resource.........................8 2.2.2. RDF Property.........................8 2.2.3. RDF Statement........................9 2.2.4. RDF Serialization Formats.................. 10 2.2.4.1. N-Triples...................... 10 2.2.4.2. RDF/XML..................... 11 2.2.4.3. N3.......................... 11 2.2.4.4. Turtle........................ 12 2.2.5. Ontology............................ 12 2.2.6. Ontology Languages..................... 13 2.2.6.1. RDFS........................ 13 2.2.6.2. OWL........................ 14 2.2.7. SPARQL Query Language.................. 14 2.2.8. Triplestore........................... 16 3. Overview on the DBpedia Project 17 3.1. Introduction to DBpedia....................... 17 3.2. DBpedia Extraction Framework................... 19 3.2.1. General Architecture..................... 19 3.2.2. Extractors........................... 20 3.2.3. Raw Infobox Extraction................... 22 3.2.4. Mapping-Based Infobox Extraction............. 22 3.2.5. URI Schemes......................... 25 3.2.6.

Load more