Sustainable Linked Data Generation: the Case of Dbpedia ⋆

Sustainable Linked Data Generation: The Case of DBpedia ? Wouter Maroy1, Anastasia Dimou1, Dimitris Kontokostas2, Ben De Meester1, Ruben Verborgh1, Jens Lehmann3,4, Erik Mannens1, and Sebastian Hellmann2 1 imec – IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium {firstname.lastname}@ugent.be 2 Leipzig University – AKSW/KILT, Leipzig, Germany {lastname}@informatik.uni-leipzig.de 3 University of Bonn, Smart Data Analytics Group, Germany [email protected] 4 Fraunhofer IAIS, Sankt Augustin, Germany [email protected] Abstract. dbpedia ef, the generation framework behind one of the Linked Open Data cloud’s central interlinking hubs, has limitations with regard to quality, coverage and sustainability of the generated dataset. dbpedia can be further improved both on schema and data level. Errors and inconsistencies can be addressed by amending (i) the dbpedia ef; (ii) the dbpedia mapping rules; or (iii) Wikipedia itself from which it extracts information. However, even though the dbpedia ef and mapping rules are continuously evolving and several changes were applied to both of them, there are no significant improvements on the dbpedia dataset since its limitations were identified. To address these shortcomings, we propose adapting a different semantic-driven approach that decouples, in a declarative manner, the extraction, transformation and mapping rules execution. In this paper, we provide details regarding the new dbpedia ef, its architecture, technical implementation and extraction results. This way, we achieve an enhanced data generation process, which can be broadly adopted, and that improves its quality, coverage and sustainability. 1 Introduction The DBpedia Extraction Framework (dbpedia ef) [12] extracts raw data from Wikipedia and makes it available as Linked Data, forming the well-known and broadly used dbpedia dataset. The majority of the dbpedia dataset is derived through Wikipedia infobox templates, after being annotated by the DBpedia ontology5 [12]. The dbpedia dataset is further enriched with additional information ? The described research activities were funded by Ghent University, imec, Flanders Innovation & Entrepreneurship (AIO), the Fund for Scientific Research Flanders (FWO Flanders), and the EU’s H2020 program for ALIGNED (GA 644055). 5 http://dbpedia.org/ontology/ 2 Wouter Maroy et al. derived from the articles such as free text, abstracts, images, links to other Web pages, and tables. Mapping Templates, i.e., rules generating most of the dbpedia dataset from Wikipedia are executed by the dbpedia ef, defined by a world-wide crowd-sourcing effort, and maintained via the DBpedia mappings wiki6. Even though dbpedia is one of the central interlinking hubs in the Linked Open Data (lod) cloud [17], its generation framework has limitations that reflect on the generated dataset [14,19]. We distinguish two types of issues with dbpedia: schema-level rules that define how to apply vocabularies to raw data [6,10,14], e.g., the dbo:militaryBranch property is used for entities of dbo:MilitaryUnit type, but it should only be used with entities of dbo:Person type [13]. data-level extracted, or processed and transformed data values. The former includes incorrect, incomplete or irrelevant extracted values and datatypes, or not (well-)recognized templates [19]; the latter issues with parsing values [18], interpreting/converting units [18], or transforming cardinal direction [15]. Errors or inconsistencies in dbpedia can be addressed by amending the dbpedia ef, mapping rules, or Wikipedia itself [19]. However, even though several changes were applied to both the dbpedia ef and mapping rules [12], quality issues still persist. For instance, 32% of its mapping rules are involved in at least one inconsistency [13], and more than half of all cardinal direction relations are still invalid or incorrect [15]. These issues challenge dbpedia dataset’s usage, hence the demands for dbpedia ef to improve in terms of expressivity, i.e., level of details that a user can specify in mapping rules, semantic flexibility, i.e., possibility to interchange among different schemas, and modularity, i.e., declarative rules and modular components for extraction, transformation and mapping. Given that quality issues persist over a long period of time [13, 15], the custom dbpedia ef appears to not meet the above and more foundational changes are required. In this work, we show that the coupling among extraction, i.e., retrieving data from Wikipedia, transformation, i.e., processing the extracted data values, and mapping rules execution, i.e., applying semantic annotations to the retrieved and transformed data values, contributes to dbpedia ef’s inadequacy to cope with the increasing demands for high quality Linked Data. While just decoupling these processes and keeping them hard-coded is just (re-)engineering, we look into a radical solution that turns the dbpedia ef more semantic rather than just a black-box. The goal is to adjust the current dbpedia ef and provide a general-purpose and more sustainable framework that enables more added value. For instance, in terms of semantic flexibility, interchanging among different schema annotations should be supported, instead of e.g., being coupled to a certain ontology as it is now, and allowing only certain semantic representation. In this paper, we show how we incorporate in the existing dbpedia ef a general-purpose semantic-driven Linked Data generation approach, based on rml [7]. It replaces the current solution, decouples extraction, transformations and conditions from dbpedia ef, and enables generating high quality Linked Data for dbpedia. The activities to achieve this solution started as a GSoC2016 6 http://mappings.dbpedia.org/index.php/Main_Page Sustainable Linked Data Generation: The Case of DBpedia 3 project7 8 and continue till nowdays. Our contribution is three-fold: we (i) outline the limitations of the current dbpedia ef; (ii) identify the requirements for a more sustainable approach; and (iii) incorporate a generic approach that fulfills those requirements to the current dbpedia ef and compare the two approaches. The paper is organized as follows: after outlining the state of the art (Section 2), we detail the current dbpedia ef limitations and requirements for a sustainable framework (Section 3). In Section 4, we introduce our approach, and provide a corresponding implementation. In Section 5, we provide the results of our approach’s evaluation. Finally, we summarize our conclusions in Section 6. 2 Background and Related Work In this section, we discuss related work in Linked Data generation (Section 2.1) and we outline the current dbpedia ef functionality (Section 2.2). 2.1 State of the Art In the past, case-specific solutions were established for Linked Data generation, which couple schema and data transformations. xslt- or xpath-based approaches were established for generating Linked Data from data originally in xml format, e.g., AstroGrid-D9. There are also query-oriented languages that combine sparql with other languages or custom alignments to the underlying data structure. For instance, xsparql [1] maps data in xml format and Tarql10 data in csv. Nevertheless, those approaches cannot be extended to cover other data sources. Different approaches emerged to define data transformations declaratively, such as volt [16] for sparql, Hydra [11] for Web Services, or fno [4]. Hydra or volt depend on the underlying system (Web Services and sparql, respectively), thus their use is inherently limited to it. Using Hydra descriptions for executing transformations only works online, whereas volt only works for data already existing in a sparql endpoint. Describing the transformations using fno does not include this dependency, thus allows for reuse in other use cases and technologies. 2.2 DBpedia Extraction Framework The dbpedia Extraction Framework (dbpedia ef) generates the dbpedia dataset [12]. It extracts data from Wikipedia, such as infobox templates or abstracts. The dbpedia ef consumes a source, i.e., an abstraction of a set of Wiki pages in wikitext syntax, parses it with a Wiki parser, and transforms it in an Abstract Syntax Tree (ast). To process the page’s content, several ast-processing components, called extractors (e.g., Mapping Extractor, Geo Extractor), traverse a Wiki page’s ast and generate rdf triples. Eventually the dbpedia ef forwards them to a sink that outputs them to different datasets based on their properties. 7 www.mail-archive.com/[email protected]/msg07837.html 8 https://summerofcode.withgoogle.com/projects/#6213126861094912 9 http://www.gac-grid.de/project-products/Software/XML2RDF.html 10 https://tarql.github.io/ 4 Wouter Maroy et al. Extractors The core dbpedia ef components are the extractors, whose common main functionality is to traverse the ast-representation of a Wiki page and generate rdf triples based on the syntax tree and other heuristics. Many extractors contribute in forming the dbpedia dataset [12, Section 2.2]. A prominent one is the Mapping-Based Infobox Extractor, which uses manually written mapping rules that relate infoboxes in Wikipedia to terms in dbpedia ontology, and generates rdf triples. It is the most important extractor, as the most valuable content of dbpedia dataset is derived from infoboxes [12], and covers the greatest part of the dbpedia ontology. Infobox templates are defined in wiki syntax and summarize information related to a page’s article in a structured,

Load more