Fabian Kirstein
Fraunhofer FOKUS THE EUROPEAN DATA PORTAL
A Large-scale Application of Semantic Vocabularies The European Data Portal (EDP)
• In November 2015, the European Commission launched the European Open Data Portal.
• As of March 2020, the EDP lists more than 1 million datasets, in total consisting of more than 1 billion RDF triples, harvested from 80 data providers.
• Ecosystem for fostering the manifestation, reuse and quality improvement of Open Data in Europe.
• Pioneer in adopting the DCAT-AP [1] specification and representing its first reference implementation.
The EDP is Europe's Linked Data-enabled one-stop-shop for open public sector information.
[1] https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/dcat-application-profile- data-portals-europe
THE EUROPEAN DATA PORTAL 2 DCAT-AP
• DCAT Application profile for data portals in Europe.
• Designed to describe public sector datasets.
• Based on Linked Data principles and derived from the Resource Description Framework (RDF) vocabulary Data Catalogue Vocabulary (DCAT).
• DCAT-AP makes extensive use of the controlled vocabularies and reference data provided by the EU Publications Office (EU Vocabularies).
Linked Data principles and traditional Open Data portal solutions do not match well.
THE EUROPEAN DATA PORTAL 3 We developed a novel and scalable platform for harvesting and managing the metadata of the European Data Portal.
Semantic Web, controlled vocabularies and DCAT-AP are first-class citizens!
THE EUROPEAN DATA PORTAL 4 General Architecture
• Virtuoso Triplestore as primary database.
• Microservice architecture and Single-Page- Application frontends based on Vue.js. [5]
• Reactive Java framework Vert.x and a asynchronous programming paradigm. [1]
• Deployment via Docker [2] and support for container- orchestration, like Kubernetes. [3]
• Authentication and authorization on both front-end Three main components: and back-end services follows the OpenID Connect - Harvester (OIDC) protocol. [4] - Registry - Quality Service [1] https://vertx.io/ [2] https://www.docker.com/ [3] https://kubernetes.io/ [4] https://openid.net/specs/openid-connect-core-1_0.html [5] https://vuejs.org/
THE EUROPEAN DATA PORTAL 5 Harvesting
Main entry point for the service orchestration. Scheduler Periodically triggers the harvesting process Frequency: Hourly, Daily, Weekly, …
Retrieves the metadata from the source portal(s). Support for a variety of interfaces and data formats: Importer CKAN-API, OAI-PMH, uData, RDF, and SPARQL.
Applies lightweight scripting-based transformation rules. Transformer Rules are written in JavaScript. The final output is DCAT-AP-compliant RDF.
THE EUROPEAN DATA PORTAL 6 Harvesting – An Example
Original Dataset in the EU Open Data Portal
THE EUROPEAN DATA PORTAL 7 Harvesting
Middleware to interact with the triplestore (Virtuoso). Registry RESTful interface for RDF (Turtle, JSON-LD, N-Triples, RDF/XML, Notation3). Application of URI schemata, generation of unique IDs and inter-linking.
Responsible for managing the search index (Elasticsearch). Extracting literals from the data, e.g. title and description. Indexing Application of vocabularies and ontologies.
Middleware to EU eTranslation service. Translation Bundling literals from multiple datasets to an integrated request. Stores the translation by applying the multi-language features of RDF.
THE EUROPEAN DATA PORTAL 8 Processing and Storing – An Example
Namespace Transformation and Indexing
THE EUROPEAN DATA PORTAL 9 Processing and Storing – An Example
RDF URIs become human-readable
THE EUROPEAN DATA PORTAL 10 Processing and Storing – An Example
RDF URIs become human-readable
THE EUROPEAN DATA PORTAL 11 Processing and Storing – An Example
Translated literals
THE EUROPEAN DATA PORTAL 12 Quality Evaluation
Application of the W3C SHACL specification for DCAT-AP. [1] Validator Results include detailed information about violations. Accessibility tests on each linked distribution (the actual data).
Quality assessment for each dataset with a custom metrics scheme. Inspired by the FAIR principles [2]. Annotator Completeness of the metadata, evaluating the format and type of data, availability of licensing information and linked distributions.
Applies W3C Data Quality Vocabulary (DQV) [3] for creating quality reports. Reporter Attached as RDF to the concerned dataset in the triplestore. Offers a variety of human-readable versions (PDF, XLS, ODS, HTML).
[1] Knublauch, H., Kontokostas, D.: Shapes constraint language (shacl). https://www.w3.org/TR/shacl/ , (Accessed 3.12.2019) [2] Wilkinson, M., Dumontier, M., Aalbersberg, et al., I.: The fair guiding principles for scientific data management and stewardship. Sci Data 3 (2016). https://doi.org/10.1038/sdata.2016.18 THE EUROPEAN DATA PORTAL [3] https://www.w3.org/TR/vocab-dqv/ 13 Quality Evaluation – An Example
SHACL Validation Report as RDF
THE EUROPEAN DATA PORTAL 14 Quality Evaluation – An Example
Quality Assessment Frontend
THE EUROPEAN DATA PORTAL 15 Lessons Learned
• Linked Data and RDF remain a challenge!
• The correct implementation of DCAT-AP and the use of controlled vocabularies are still evolving.
• The specifications are often complex and not clearly defined.
• The different pieces do not interact well in some cases. (DCAT-AP – Vocabularies – SHACL)
• Close collaboration between all stakeholders is essential.
An end-to-end application of Linked Data principles and well-defined controlled vocabularies will foster the dissemination, reuse and improvement of Open Data in Europe.
THE EUROPEAN DATA PORTAL 16 Take Away Contact Fabian Kirstein The EDP is a well-known one-stop-shop for Open Data and an advocate for the Linked Open Data movement.
• The European Commission establishes DCAT-AP and EU [email protected] Vocabularies as a standard, EDP acts as one enabler for its @fabiankirstein broad adoption. https://fabiankirstein.de • The source code is Open Source: https://gitlab.com/european-data-portal Further Reading:
• Registry: https://www.europeandataportal.eu/data ESWC 2020 Paper:
• SPARQL: https://www.europeandataportal.eu/sparql Piveau: A Large-Scale Open Data Management Platform Based on • Quality: https://www.europeandataportal.eu/mqa Semantic Web Technologies
https://link.springer.com/chapter/10.1007/978-3-030-49461-2_38
THE EUROPEAN DATA PORTAL 17