Fabian Kirstein

Fraunhofer FOKUS THE EUROPEAN DATA PORTAL

A Large-scale Application of Semantic Vocabularies The European Data Portal (EDP)

• In November 2015, the European Commission launched the European Open Data Portal.

• As of March 2020, the EDP lists more than 1 million datasets, in total consisting of more than 1 billion RDF triples, harvested from 80 data providers.

• Ecosystem for fostering the manifestation, reuse and quality improvement of Open Data in Europe.

• Pioneer in adopting the DCAT-AP [1] specification and representing its first reference implementation.

The EDP is Europe's -enabled one-stop-shop for open public sector information.

[1] https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/dcat-application-profile- data-portals-europe

THE EUROPEAN DATA PORTAL 2 DCAT-AP

• DCAT Application profile for data portals in Europe.

• Designed to describe public sector datasets.

• Based on Linked Data principles and derived from the Resource Description Framework (RDF) vocabulary Data Catalogue Vocabulary (DCAT).

• DCAT-AP makes extensive use of the controlled vocabularies and reference data provided by the EU Publications Office (EU Vocabularies).

Linked Data principles and traditional Open Data portal solutions do not match well.

THE EUROPEAN DATA PORTAL 3 We developed a novel and scalable platform for harvesting and managing the of the European Data Portal.

Semantic Web, controlled vocabularies and DCAT-AP are first-class citizens!

THE EUROPEAN DATA PORTAL 4 General Architecture

• Virtuoso Triplestore as primary .

• Microservice architecture and Single-Page- Application frontends based on Vue.js. [5]

• Reactive Java framework Vert.x and a asynchronous programming paradigm. [1]

• Deployment via Docker [2] and support for container- orchestration, like Kubernetes. [3]

• Authentication and authorization on both front-end Three main components: and back-end services follows the OpenID Connect - Harvester (OIDC) protocol. [4] - Registry - Quality Service [1] https://vertx.io/ [2] https://www.docker.com/ [3] https://kubernetes.io/ [4] https://openid.net/specs/openid-connect-core-1_0.html [5] https://vuejs.org/

THE EUROPEAN DATA PORTAL 5 Harvesting

 Main entry point for the service orchestration. Scheduler  Periodically triggers the harvesting process  Frequency: Hourly, Daily, Weekly, …

 Retrieves the metadata from the source portal(s).  Support for a variety of interfaces and data formats: Importer CKAN-API, OAI-PMH, uData, RDF, and SPARQL.

 Applies lightweight scripting-based transformation rules. Transformer  Rules are written in JavaScript.  The final output is DCAT-AP-compliant RDF.

THE EUROPEAN DATA PORTAL 6 Harvesting – An Example

Original Dataset in the EU Open Data Portal

THE EUROPEAN DATA PORTAL 7 Harvesting

 Middleware to interact with the triplestore (Virtuoso). Registry  RESTful interface for RDF (, JSON-LD, N-Triples, RDF/XML, ).  Application of URI schemata, generation of unique IDs and inter-linking.

 Responsible for managing the search index (Elasticsearch).  Extracting literals from the data, e.g. title and description. Indexing  Application of vocabularies and ontologies.

 Middleware to EU eTranslation service. Translation  Bundling literals from multiple datasets to an integrated request.  Stores the translation by applying the multi-language features of RDF.

THE EUROPEAN DATA PORTAL 8 Processing and Storing – An Example

Namespace Transformation and Indexing

THE EUROPEAN DATA PORTAL 9 Processing and Storing – An Example

RDF URIs become human-readable

THE EUROPEAN DATA PORTAL 10 Processing and Storing – An Example

RDF URIs become human-readable

THE EUROPEAN DATA PORTAL 11 Processing and Storing – An Example

Translated literals

THE EUROPEAN DATA PORTAL 12 Quality Evaluation

 Application of the W3C SHACL specification for DCAT-AP. [1] Validator  Results include detailed information about violations.  Accessibility tests on each linked distribution (the actual data).

 Quality assessment for each dataset with a custom metrics scheme.  Inspired by the FAIR principles [2]. Annotator  Completeness of the metadata, evaluating the format and type of data, availability of licensing information and linked distributions.

 Applies W3C Data Quality Vocabulary (DQV) [3] for creating quality reports. Reporter  Attached as RDF to the concerned dataset in the triplestore.  Offers a variety of human-readable versions (PDF, XLS, ODS, HTML).

[1] Knublauch, H., Kontokostas, D.: Shapes constraint language (shacl). https://www.w3.org/TR/shacl/ , (Accessed 3.12.2019) [2] Wilkinson, M., Dumontier, M., Aalbersberg, et al., I.: The fair guiding principles for scientific data management and stewardship. Sci Data 3 (2016). https://doi.org/10.1038/sdata.2016.18 THE EUROPEAN DATA PORTAL [3] https://www.w3.org/TR/vocab-dqv/ 13 Quality Evaluation – An Example

dqv:hasQualityAnnotation [ a dqv:QualityAnnotation ; dcterms:isVersionOf "dcatap201" ; dqv:inDimension voc:interoperability ; oa:hasBody [ a shacl:ValidationReport ; shacl:conforms false ; shacl:result [ a shacl:ValidationResult ; shacl:focusNode ; shacl:resultMessage "Value must be an instance of dct:Location" ; shacl:resultPath dcterms:spatial ; shacl:resultSeverity shacl:Violation ; ...

SHACL Validation Report as RDF

THE EUROPEAN DATA PORTAL 14 Quality Evaluation – An Example

Quality Assessment Frontend

THE EUROPEAN DATA PORTAL 15 Lessons Learned

• Linked Data and RDF remain a challenge! 

• The correct implementation of DCAT-AP and the use of controlled vocabularies are still evolving.

• The specifications are often complex and not clearly defined.

• The different pieces do not interact well in some cases. (DCAT-AP – Vocabularies – SHACL)

• Close collaboration between all stakeholders is essential.

An end-to-end application of Linked Data principles and well-defined controlled vocabularies will foster the dissemination, reuse and improvement of Open Data in Europe.

THE EUROPEAN DATA PORTAL 16 Take Away Contact Fabian Kirstein The EDP is a well-known one-stop-shop for Open Data and an advocate for the Linked Open Data movement.

• The European Commission establishes DCAT-AP and EU [email protected] Vocabularies as a standard, EDP acts as one enabler for its @fabiankirstein broad adoption. https://fabiankirstein.de • The source code is Open Source: https://gitlab.com/european-data-portal Further Reading:

• Registry: https://www.europeandataportal.eu/data ESWC 2020 Paper:

• SPARQL: https://www.europeandataportal.eu/sparql Piveau: A Large-Scale Open Data Management Platform Based on • Quality: https://www.europeandataportal.eu/mqa Technologies

https://link.springer.com/chapter/10.1007/978-3-030-49461-2_38

THE EUROPEAN DATA PORTAL 17