Arxiv:2005.02614V1 [Cs.IR] 6 May 2020 Open Data Constitutes a Prospering and Continuously Evolving Concept
Total Page:16
File Type:pdf, Size:1020Kb
Piveau: A Large-scale Open Data Management Platform based on Semantic Web Technologies Fabian Kirstein1;2 , Kyriakos Stefanidis1 Benjamin Dittwald1, Simon Dutkowski1, Sebastian Urbanek1;2, and Manfred Hauswirth1;2;3 1 Fraunhofer FOKUS, Berlin, Germany 2 Weizenbaum Institute for the Networked Society, Berlin, Germany 3 TU Berlin, Open Distributed Systems, Berlin, Germany [email protected] Abstract. The publication and (re)utilization of Open Data is still fac- ing multiple barriers on technical, organizational and legal levels. This includes limitations in interfaces, search capabilities, provision of quality information and the lack of definite standards and implementation guide- lines. Many Semantic Web specifications and technologies are specifically designed to address the publication of data on the web. In addition, many official publication bodies encourage and foster the development of Open Data standards based on Semantic Web principles. However, no existing solution for managing Open Data takes full advantage of these possi- bilities and benefits. In this paper, we present our solution Piveau, a fully-fledged Open Data management solution, based on Semantic Web technologies. It harnesses a variety of standards, like RDF, DCAT, DQV, and SKOS, to overcome the barriers in Open Data publication. The solu- tion puts a strong focus on assuring data quality and scalability. We give a detailed description of the underlying, highly scalable, service-oriented architecture, how we integrated the aforementioned standards, and used a triplestore as our primary database. We have evaluated our work in a comprehensive feature comparison to established solutions and through a practical application in a production environment, the European Data Portal. Our solution is available as Open Source. Keywords: Open Data · DCAT · Scalability. 1 Introduction arXiv:2005.02614v1 [cs.IR] 6 May 2020 Open Data constitutes a prospering and continuously evolving concept. At the very core, this includes the publication and re-utilization of datasets. Typical ac- tors and publishers are public administrations, research institutes, and non-profit organizations. Common users are data journalists, businesses, and governments. The established method of distributing Open Data is via a web platform that is responsible for gathering, storing, and publishing the data. Several software so- lutions and specifications exist for implementing such platforms. Especially the Resource Description Framework (RDF) data model and its associated vocab- ularies represent a foundation for fostering interoperability and harmonization 2 F. Kirstein et al. of different data sources. The Data Catalog Vocabulary (DCAT) is applied as a comprehensive model and standard for describing datasets and data services on Open Data platforms [1]. However, RDF is only a subset of the Semantic Web stack and Open Data publishing does not benefit from the stack's full potential, which offers more features beyond data modeling. Therefore, we developed a novel and scalable platform for managing Open Data, where the Semantic Web stack is a first-class citizen. Our work focuses on two central aspects: (1) The utilization of a variety of Semantic Web standards and technologies for covering the entire life-cycle of the Open Data publishing process. This covers particu- larly data models for metadata, quality verification, reporting, harmonization, and machine-readable interfaces. (2) The application of state-of-the-art software engineering approaches for development and deployment to ensure production- grade applicability and scalability. Hence, we integrated a tailored microservice- based architecture and a suitable orchestration pattern to fit the requirements in an Open Data platform. It is important to note, that currently our work emphasizes the management of metadata, as intended by the DCAT specification. Hence, throughout the paper the notion of data is used in terms of metadata. In Section 2 we describe the overall problem and in Section 3 we discuss related and existing solutions. Our software architecture and orchestration approach is described in Section 4. Section 5 gives a detailed overview of the data workflow and the applied Semantic Web standards. We evaluate our work in Section 6 with a feature analysis and an extensive use case. To conclude, we summarize our work and give an outlook for future developments. 2 Problem Statement A wide adoption of Open Data by data providers and data users is still facing many barriers. Beno et al. [7] conducted a comprehensive study of these barriers, considering legal, organizational, technical, strategic, and usability aspects. Ma- jor technical issues for users are the limitations in the Application Programming Interfaces (APIs), difficulties in searching and browsing, missing information about data quality, and language barriers. Generally, low data quality is also a fundamental issue, especially because (meta)data is not machine-readable or, in many cases, incomplete. In addition, low responsiveness and bad performance of the portals have a negative impact on the adoption of Open Data. For publish- ers, securing the integrity and authenticity, enabling resource-efficient provision, and clear licensing are highly important issues. The lack of a definite standard and technical solutions is listed as a core barrier. The hypothesis of our work is, that a more sophisticated application of Semantic Web technologies can lower many barriers in Open Data publishing and reuse. These technologies intrinsically offer many aspects, which are required to improve the current support of Open Data. Essentially, the Semantic Web is about defining a common standard for integrating and har- nessing data from heterogeneous sources [2]. Thus, it constitutes an excellent Piveau 3 match for the decentralized and heterogeneous nature of Open Data. Widespread solutions for implementing Open Data platforms are based on canon- ical software stacks for web applications with relational and/or document databases. The most popular example is the Open Source solution Comprehensive Knowl- edge Archive Network (CKAN) [10], which is based on a flat JSON data schema, stored in a PostgreSQL database. This impedes a full adoption of Semantic Web principles. The expressiveness of such a data model is limited and not suited for a straightforward integration of RDF. 3 Related Work Making Open Data and Linked Data publicly available and accessible is an ongo- ing process that involves innovation and standardization efforts in various topics such as semantic interoperability, data and metadata quality, standardization as well as toolchain and platform development. One of the most widely adopted standards for the description of datasets is DCAT and its extension DCAT Application profile for data portals in Eu- rope (DCAT-AP) [12]. The latter adds metadata fields and mandatory prop- erty ranges, making it suitable for use with Open Data management platforms. Its adoption by various European countries led to the development of country- specific extensions such as the official exchange standard for open governmental data in Germany [17] and Belgium's extension [24]. Regarding Open Data man- agement platforms, the most widely known Open Source solution is CKAN [10]. It is considered the de-facto standard for the public sector and is also used by pri- vate organizations. It does not provide native Linked Data capabilities but only a mapping between existing data structures and RDF. Another widely adopted platform is uData [23]. It is a catalog application for collecting data and meta- data focused on being more contributive and inclusive than other Open Data platforms by providing additional functionality for data reuse and community contributions. Other Open Source alternatives include the repository solution DSpace which dynamically translates [13] relational metadata into native RDF metadata and offers it via a SPARQL endpoint. WikiData also follows a similar approach [36]; it uses a custom structure for identifiable items, converts them to native RDF and provides an API endpoint. Another, proprietary, solution is OpenDataSoft [26], which has limited support for Linked Data via its interop- erability mode. There are also solutions that offer native Linked Data support following the W3C recommendation for Linked Data Platforms (LDPs). Apache Marmotta [38] has native implementation of RDF with a pluggable triplestore for Linked Data publication. Virtuoso [27] is a highly scalable LDP implemen- tation that supports a wide array of data access standards and output formats. Fedora [21] is a native Linked Data repository suited for digital libraries. Recent research efforts [30] focusses on the notion of dynamic Linked Data where con- text aware services and applications are able to detect changes in data by means of publish-subscribe mechanisms using SPARQL. 4 F. Kirstein et al. A core feature of most big commercial platforms is the Extract, Transform, Load (ETL) functionality. It refers to the three basic data processing stages of reading data (extract) from heterogeneous sources, converting it (transform) to a suitable format, and storing it (load) into a database. Platforms that offer ETL as a core functionality include IBM InfoSphere [16] with its DataStage module, Oracle Autonomus Data Warehouse [28] with its Data Integrator module and SAS Institute's data warehouse [31]. Moreover, various Open Source solutions such as Scriptella [35] and Talend Open Studio [32] are based on ETL. The above data