User-Driven Quality Evaluation of Dbpedia∗

User-driven Quality Evaluation of DBpedia∗ Amrapali Zaveri* Dimitris Kontokostas* Mohamed A. Sherif* [email protected] [email protected] [email protected] leipzig.de leipzig.de leipzig.de Lorenz Bühmann* Mohamed Morsey* Sören Auer+ [email protected] [email protected] [email protected] leipzig.de leipzig.de * AKSW/BIS, Universität Leipzig + CS/EIS, Universität Bonn Jens Lehmann* PO Box 100920, 04009 Römerstra 164, 53117 Bonn [email protected] Leipzig, Germany Bonn, Germany leipzig.de ABSTRACT and broken links were the most recurring quality problems. Linked Open Data (LOD) comprises of an unprecedented With this study, we not only aim to assess the quality of volume of structured datasets on the Web. However, these this sample of DBpedia resources but also adopt an agile datasets are of varying quality ranging from extensively cu- methodology to improve the quality in future versions by rated datasets to crowdsourced and even extracted data of regularly providing feedback to the DBpedia maintainers. relatively low quality. We present a methodology for assessing the quality of linked data resources, which comprises of Keywords a manual and a semi-automatic process. The first phase in- Evaluation, DBpedia, Data Quality, RDF, Extraction cludes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation 1. INTRODUCTION of a large number of individual resources, according to the The advent of semantic web technologies, as an enabler of quality problem taxonomy via crowdsourcing. This process Linked Open Data (LOD), has provided the world with an is accompanied by a tool wherein a user assesses an indi- unprecedented volume of structured data currently amount- vidual resource and evaluates each fact for correctness. The ing to 50 billion facts represented as RDF triples. Although semi-automatic process involves the generation and verifica- publishing large amounts of data on the Web is certainly a tion of schema axioms. We report the results obtained by step in the right direction, the data is only as usable as its applying this methodology to DBpedia. We identified 17 quality. On the Data Web, we have varying quality of infor- data quality problem types and 58 users assessed a total of mation covering various domains. There are a large number 521 resources. Overall, 11.93% of the evaluated DBpedia of high quality datasets (in particular in the life-sciences triples were identified to have some quality issues. Apply- domain), which are carefully curated over decades and re- ing the semi-automatic component yielded a total of 222,982 cently published on the Web. There are, however, also many triples that have a high probability to be incorrect. In par- datasets, which were extracted from unstructured and semi- ticular, we found that problems such as object values being structured information or are the result of some crowdsourc- incorrectly extracted, irrelevant extraction of information ing process, where large numbers of users contribute small parts. DBpedia [1, 17] is actually an example for both - a ∗ Permission to make digital or hard copies of all or part of dataset extracted from the result of a crowdsourcing process. this work for personal or classroom use is granted without fee Hence, quality problems are inherent in DBpedia. This is provided that copies are not made or distributed for profit or not a problem per se, since quality usually means fitness for commercial advantage and that copies bear this notice and a certain use case [12]. Hence, even datasets with quality the full citation on the first page. To copy otherwise, to re- problems might be useful for certain applications, as long as publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. I-SEMANTICS 2013, the quality is in the required range. 9th Int. Conf. on Semantic Systems, Sept. 4-6, 2013, Graz, Austria Copyright 2012 ACM 978-1-4503-1972-0 $10.00. In the case of DBpedia, for example, the data quality is perfectly sufficient for enriching Web search with facts or suggestions about common sense information, such as en- tertainment topics. In such a scenario, where the DBpedia background knowledge can be, for example, used to show the movies Angelina Jolie was starring in and actors she played with it is rather neglectable if, in relatively few cases, a movie or an actor is missing. For developing a medical ap- plication, on the other hand, the quality of DBpedia is prob- ably completely insufficient. Please note, that also on the traditional document-oriented Web we have varying quality of the information and still the Web is perceived to be We survey related work in Section 6 and conclude with an extremely useful by most people. Consequently, a key chal- outlook on future work in Section 7. lenge is to determine the quality of datasets published on the Web and make this quality information explicit. Other 2. ASSESSMENT METHODOLOGY than on the document Web where information quality can be In this section, we describe a generalized methodology for only indirectly, (e.g. via page rank, or vaguely) defined, we the assessment and subsequent data quality improvement of can have much more concrete and measurable data quality resources belonging to a dataset. The assessment method- indicators for structured information, such as correctness of ology we propose is depicted in Figure 1. This methodology facts, adequacy of semantic representation or degree of cov- consists of the following four steps: 1. Resource selection, erage. 2. Evaluation mode selection, 3. Resource evaluation and 4. Data quality improvement. In the following, we describe In this paper, we devise a data quality assessment method- these steps in more detail. ology, which comprises of a manual and a semi-automatic process. We empirically assess, based on this methodology, the data quality of one of the major knowledge hubs on the Step I: Resource selection. In this first step, the resources Data Web { DBpedia. The first phase includes the detection belonging to a particular dataset are selected. This selection of common quality problems and their representation in a can be performed in three different ways: comprehensive taxonomy of potential quality problems. In • Per Class: select resources belonging to a particular the manual process, the second phase comprises of the eval- class uation of a large number of individual resources, according to the quality problem taxonomy, using crowdsourcing in • Completely random: a random resource from the dataset order to evaluate the type and extent of data quality prob- • Manual: a resource selected manually from the dataset lems occurring in DBpedia. Here we would like to clarify Choosing resources per class (e.g. animal, sport, place etc.) the use of crowdsource used in this paper. Crowdsourc- gives the user the flexibility to choose resources belonging to ing involves the creating if HITs (Human Intelligent Tasks), only those classes she is familiar with. However, when choos- submitting them to a crowdsourcing platform (e.g. Ama- 1 ing resources from a class, the selection should be made in zon Mechanical Turk ) and providing a (financial) reward proportion to the number of instances of that class. Ran- for each HIT [11]. However, we use the broader-sense of the dom selection, on the other hand, ensures an unbiased and word as a large-scale problem-solving approach by which a uniform coverage of the underlying dataset. In the man- problem is divided into several smaller tasks (assessing the ual selection option, the user is free to select resources with quality of each triple, in this case) that can be indepen- problems that she has perhaps previously identified. dently solved by a large group of people. Each represented fact is evaluated for correctness by each user and, if found Step II: Evaluation mode selection. The assignment of problematic, annotated with one of 17 pre-defined quality the resources to a person or machine, selected in Step I, can criteria. This process is accompanied by a tool wherein a be accomplished in the following three ways: user assesses an individual resource and evaluates each fact for correctness. • Manual: the selected resources are assigned to a person (or group of individuals) who will then proceed to The semi-automatic process involves the generation and ver- manually evaluate the resources individually. ification of schema axioms, which yielded a total of 222,982 • Semi-automatic: selected resources are assigned to a triples that have a high probability to be incorrect. We semi-automatic tool which performs data quality as- find that while a substantial number of problems exists, the sessment employing some form of user feedback. overall quality is with a less than 11.93% error rate relatively high. With this study we not only aim to assess the quality • Automatic: the selected resources are given as input of DBpedia but also to adopt a methodology to improve the to an automatic tool which performs the quality as- quality in future versions by regularly providing feedback to sessment without any user involvement. the DBpedia maintainers to fix these problems. For the semi-automatic evaluation, machine learning can be applied as shown in [4] and provided by the DL-Learner Our main contributions are: framework [16, 19], where the workflow can be as follows: • a crowdsourcing based methodology for data quality (1) based on the instance data, generate OWL axioms which assessment (Section 2), can also be seen as restrictions2, e.g. learn characteristics (irreflexivity, (inverse) functionality, asymmetry) of proper- • a comprehensive quality issue taxonomy comprising ties as well as definitions and disjointness of classes in the common knowledge extraction problems (Section 3), knowledge base; (2) ask queries via SPARQL or a reasoner for violations of theses restrictions, e.g.

Load more