Test-Driven Evaluation of Linked Data Quality

Test-driven Evaluation of Linked Data Quality Dimitris Kontokostas Patrick Westphal Sören Auer University of Leipzig University of Leipzig University of Bonn and [email protected] [email protected] Fraunhofer IAIS leipzig.de [email protected] Sebastian Hellmann Jens Lehmann Roland Cornelissen University of Leipzig University of Leipzig Stichting Bibliotheek.nl [email protected] [email protected] [email protected] leipzig.de ABSTRACT Keywords Linked Open Data (LOD) comprises an unprecedented vol- Data Quality, Linked Data, DBpedia ume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated 1. INTRODUCTION datasets to crowdsourced or extracted data of often rela- Linked Open Data (LOD) comprises an unprecedented tively low quality. We present a methodology for test-driven volume of structured data published on the Web. How- quality assessment of Linked Data, which is inspired by test- ever, these datasets are of varying quality ranging from driven software development. We argue that vocabularies, extensively curated datasets to crowdsourced and even ex- ontologies and knowledge bases should be accompanied by tracted data of relatively low quality. Data quality is not a number of test cases, which help to ensure a basic level of an absolute measure, but assesses fitness for use [16]. Con- quality. We present a methodology for assessing the qual- sequently, one of the main challenges regarding the wider ity of linked data resources, based on a formalization of bad deployment and use of semantic technologies on the Web smells and data quality problems. Our formalization em- is the assessment and ensuring of the quality of a certain, ploys SPARQL query templates, which are instantiated into possibly evolving, dataset for a particular use case. There concrete quality test case queries. Based on an extensive sur- have been few approaches for assessing Linked Data qual- vey, we compile a comprehensive library of data quality test ity. However, these were majorly methodologies, which re- case patterns. We perform automatic test case instantiation quire (1) a large amount of manual configuration and inter- based on schema constraints or semi-automatically enriched action [2, 9, 22] or (2) automated, reasoning based meth- schemata and allow the user to generate specific test case in- ods [12, 15]. While reasoning based methods allow more stantiations that are applicable to a schema or dataset. We automation, they are either limited to very specific quality provide an extensive evaluation of five LOD datasets, man- aspects (such as link quality [12]) or lack scalability to the ual test case instantiation for five schemas and automatic medium and large datasets being increasingly published as test case instantiations for all available schemata registered Linked Data. In consequence, we observe a shortage of prac- with Linked Open Vocabularies (LOV). One of the main ad- tical quality assessment approaches for Linked Data, which vantages of our approach is that domain specific semantics balance between a high degree of automation and scalability can be encoded in the data quality test cases, thus being to datasets comprising billions of triples. able to discover data quality problems beyond conventional In this article, we present a methodology for test-driven quality heuristics. Linked Data quality assessment, which is inspired by test- driven software development. In software engineering, a test case can be defined as \an input on which the program under Categories and Subject Descriptors test is executed during testing" and a test set as \a set of test cases for testing a program" [28]. A basic metric in H.2.0 [DATABASE MANAGEMENT]: General|Secu- software unit-testing is test adequacy, which measures the rity, integrity, and protection; D.2.5 [Software Engineer- completeness of the test set. A key principle of test-driven ing]: Testing and Debugging |Testing tools, Debugging software development is to start the development with the aids implementation of automated test methods before the actual functionality is implemented. Compared to software source code testing, where test cases have to be implemented largely manually or with limited programmatic support, the situation for Linked Data quality testing is slightly more advantageous. On the Data Copyright is held by the International World Wide Web Conference Com- Web, we have a unified data model { RDF { which is the mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the basis for both, data and ontologies. In this work, we exploit author’s site if the Material is used in electronic media. the RDF data model by devising a pattern-based approach WWW’14, April 7–11, 2014, Seoul, Korea. ACM 978-1-4503-2744-2/14/04. for the data quality tests of RDF knowledge bases. We argue http://dx.doi.org/10.1145/2566486.2568002 . that ontologies, vocabularies and knowledge bases should be 747 accompanied by a number of test cases, which help to en- 2. TEST-DRIVEN DATA QUALITY sure a basic level of quality. We present a methodology METHODOLOGY for assessing the quality of linked data resources, based on We first introduce the basic notions in our methodology, a formalization of data quality integrity constraints. Our then describe its workflow and finally define test case cover- formalization employs SPARQL query templates, which are age criteria analogous to unit tests in software engineering. instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library Basic Notions of quality test case patterns, which can be instantiated for rapid development of more test cases. We provide a method Data Quality Test Pattern (DQTP). A data quality for automatic test case instantiation from these patterns for test case pattern is a tuple (V; S), where V is a set of typed a particular ontology or vocabulary schema. Furthermore, pattern variables and S is a SPARQL query template with we support the automatic derivation from OWL schema ax- placeholders for the variables from V . Possible types of the ioms. Since many schemata of LOD datasets are not very pattern variables are IRIs, literals, operators, datatype val- expressive, our methodology also includes semi-automatic ues (e.g. integers) and regular expressions. With R(v) we schema enrichment. Concrete test cases are equipped with denote the value range for a pattern variable in v 2 V , persistent identifiers to facilitate test tracking over time. We i.e. the set of values by which the variable can be sub- devise the notion of RDF test case coverage based on a com- stituted, and with R(V ) the union of all these sets, i.e. S bination of six individual coverage metrics (four for proper- R(V ) = v2V R(v). ties and two for classes). Ideally, DQTPs should be knowledge base and vocabu- As a result, the test case coverage can be explicitly stated lary agnostic. Using %%v%% as syntax for placeholders, an for a certain dataset and potential users can thus obtain example DQTP is: a more realistic assessment of the quality they can expect. Since the test cases are related to certain parts of the knowl- 1 SELECT ?s WHERE { ?s %%P1%% ?v1 . edge base (i.e. properties and classes), the quality of par- 2 ?s %%P2%% ?v2 . 3 ticular fragments relevant for a certain use-case can also be FILTER ( ?v1 %%OP%% ?v2 ) } easily assessed. Another benefit of test-driven data engi- This DQTP can be used for testing whether a value com- neering is support for data evolution. Once test cases are parison of two properties P 1 and P 2 holds with respect to defined for a certain vocabulary, they can be applied to all an operator OP . DQTPs represent abstract patterns, which datasets reusing elements of this vocabulary. Test cases can can be further refined into concrete data quality test cases be re-executed whenever the data is altered. Due to the using test pattern bindings. modularity of the approach, where test cases are bound to Test Case Pattern Binding. A test case pattern bind- certain vocabulary elements, test cases for newly emerging ing is a specific instantiation of a DQTP. It is a triple datasets, which reuse existing vocabularies can be easily de- (σ; S; C) in which σ : V ! R(V ) is a mapping of variables rived. to valid replacements, S is a SPARQL query template and Our approach allows to perform an automatic test C 2 ferror; bad smellg is used as classification of the error. case instantiation based on schema constraints or semi- Data Quality Test Cases. Applying σ to S results in a automatically enriched schemata and allows users to gen- SPARQL query, which can then be executed. Each result of erate specific test case instantiations that are applicable the query is considered to be a violation of a unit test. An for a schema or a dataset. A main contribution of our example test case pattern binding and resulting data quality work is an extensive and unprecedented quantitative eval- test case is2: uation involving (a) manual and automatic test case instantiations for five large-scale LOD datasets (two DBpedia edi- 1 P1 => dbo:birthDate | SELECT ?s WHERE { tions, datos.bne.es, Library of Congress authority data and 2 P2 => dbo:deathDate | ?s dbo:birthDate ?v1. LinkedGeoData) and (b) automatic test case instantiations 3 OP => > | ?s dbo:deathDate ?v2. for all available schemata registered with the Linked Open 4 | FILTER ( ?v1 > ?v2 ) } Vocabularies (LOV)1 resulting in 32,293 total unique test cases for 297 of the LOV vocabularies. One of the main ad- A test case has four different results: success (empty re- vantages of our approach is that domain specific semantics sult), violation (results are returned), timeout (test case is can be encoded in the data quality test cases, thus being marked for further inspection) and error (the query cannot able to discover data quality problems beyond conventional be evaluated due to e.g.

Load more