Implementation of a Linked Open Data Solution for the Statistics Agency of Cantabria's Metadata and Data Bank
Total Page:16
File Type:pdf, Size:1020Kb
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013 Implementation of a Linked Open Data Solution for the Statistics Agency of Cantabria's Metadata and Data Bank Alejandro Villar Fernández Ana Santurtún Zarrabeitia Spain University of Cantabria, contacto@alejandro- Spain villar.es [email protected] Abstract Statistics is a fundamental piece inside the Open Government philosophy, being a basic tool for citizens to know and make informed decisions about the society in which they participate. Due to the great number of organizations and agencies that collect, process and publish statistical data all over the world, several standards and methodologies for information exchange have been created in recent years in order to improve interoperability between data producers and consumers, of which SDMX is one of the most renowned examples. Despite having been developed independently of this, the global Semantic Web effort (backed mainly by the W3C-driven Linked Open Data initiatives) presents itself as an extremely useful tool for publishing both completely contextualized metadata and data, therefore making them easily understandable and processable by third parties. This report details the changes made to the IT systems of the Statistical Agency of Cantabria (Instituto Cántabro de Estadística, ICANE) with the purpose of implementing a Linked Open Data solution for its website and statistical data bank, making all data and metadata published by this Agency available not only to humans, but to automatized consumers, too. Multiple standards, recommendations and vocabularies were used for this task, ranging from Dublin Core metadata RDFa tagging, through the creation of several SKOS concept schemes, to providing statistical data using the RDF Data Cube vocabulary. Keywords: Linked Open Data; statistics; ICANE; Spain; Cantabria; SKOS; RDF; RDFa; Semantic Web. 1. Background Legally created in 1998, but not officially operative until 2004, the Statistics Agency of Cantabria (Instituto Cántabro de Estadística, ICANE) is the official public organization in Cantabria (Northern Spain) in charge of the production and dissemination of statistical data regarding every aspect of Cantabria's economy and society. Despite its young age, ICANE's face to the public has been in constant evolution, from its inception as a static web site populated with Microsoft Excel files, to its current state of a machine-navigable, metadata-annotated portal serving on-the-fly generated data in 7 different formats. 1.1 Initial Situation and Motivation ICANE's web portal and data bank underwent a major visual and functional overhaul in 2011, 1 2 replacing a legacy OpenCMS installation with Liferay Enterprise Portal 5.2, and an in-house- developed web application for data selection and display with a Pentaho Mondrian3 3.1 + 1 http://www.opencms.org/ 2 http://www.liferay.com/es/ 3 http://mondrian.pentaho.com/ 1 Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013 JPivot41.8 combination, which also involved converting its previous data warehouse to an OLAP5-compatible setup. On the second half of 2012, ICANE decided to implement a solution for improving its data and metadata availability and to publish and link its resources on the Semantic Web. At the time, all internal, structural metadata about data series (OLAP cubes, dimensions, measures, etc.) was stored in XML Schema files and deployed onto the data bank application servers, and all descriptive metadata (including hierarchical organization into folders and sections) resided in a relational database, which was used by the web portal and by the data bank application to present a hierarchically navigable view to the end user. This setup relied on the following domain model classes to represent navigational and descriptive metadata: • Area: subject area for a group of series. • Type: type of series (time series, historical data or municipal data). • Group: series group (subject area subdivision), belonging to an area and a type. • Nodes: hierarchical layout of folders and time series. Includes all descriptive metadata for every series and folder. Most series metadata (modification dates, periodicity, data sources, etc.) had not been yet normalized. In addition, no URL path hierarchy was used when accessing the different sections of the web portal or data bank application (with the latter relying on user-provided HTTP GET parameters for time series selection and descriptive metadata retrieval). 2. Metadata Reorganization, Normalization and Centralization A decision was made to reorganize and normalize all database-stored metadata. The first step was dropping the old area/group/type schema in favor of a section/subsection/category one so that: • Four main sections (economy, population, society, territory and environment) would exist. • Each section would contain several, more specific subsections, which would not be dependent on type or category. • Inside each subsection the end user would find up to three different series/folder trees, one for each category. Moreover, the following efforts were made, with regards to metadata normalization and preparation for the Linked Open Data implementation: • URI tags (friendly URL components) were defined for all user-referenceable classes. A concrete URL hierarchy based on these tags was laid out to make it easier for users to access and understand the web portal and data bank application layouts. More information on the use of URI tags can be found in section 3.2. • Folder and series metadata were normalized whenever possible. This included series periodicities (which were linked to their DC CDF Vocabulary, 2007, equivalent) and initial and final periods, data sources, reference areas, etc. • A link repository was created to store all entities' semantic relationships towards external resources. Finally, ICANE's IT department developed a Web Service to provide centralized access to this newly-created metadata database through a REST API, and avoid directly querying the database server containing it, thus protecting the clients from the specific implementation details. This Web Service was to be used by all internal applications, if possible. 4 http://jpivot.sourceforge.net/ 5 Online Analytical Processing. An introduction to OLAP can be found in Pedersen, T.B (2001). 2 Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013 3. Proposed Solution Alejandro Villar’s proposal consisted of 5 different, but interdependent, phases: 1. Definition of a RDF model for all publicly accessible entities, centered around the DCMI model. 2. Implementation of a Linked Open Data strategy to serve entity metadata. 3. Development of an RDF export filter for the data bank application. 4. Implementation of a SPARQL endpoint to access metadata. 5. Semantic linking of entities to external resources, in order to provide context to automatized data consumers. 3.1. RDF Model In addition to property mapping, the definition of the RDF model had to take into consideration the hierarchical nature of the time series and folders tree. Therefore, a SKOS ontology was conceived, comprising several Concept Schemes (one for each section and for each subsection), which would in turn contain a hierarchical array of folders (represented by SKOS Concepts). Time series and folders would be linked using the DCMI subject property. Given that SKOS does not provide an explicit mechanism to define a hierarchy between Concept Schemes, its inScheme property (whose domain is effectively any RDF resource) was used to this effect. A property summary of this model, using CURIE syntax, can be seen in Table 1, while table 2 contains a list of all prefixes used throughout this project report. In addition, a special ICANE vocabulary6 was created to enhance entity classification and to provide additional means of entity interlinking. TABLE 1: Property summary of the RDF model Entity Property Description Class of this resource. Section rdf:type Values: icane:Section and A section has many subsections. skos:ConceptScheme. skos:prefLabel and rdfs:label Resource label. rdf:type Class of this resource. Subsection Values: icane:Subsection and A subsection belongs to a single skos:ConceptScheme. section and has many folders and skos:prefLabel and rdfs:label Resource label. time series. icane:section and skos:inScheme Section that this subsection belongs to. rdf:type Class of this resource. Category Value: icane:Category. A category has many folders and rdfs:label Resource label. time series. icane:acronym Acronym for this category. Class of this resource. rdf:type Value: skos:Concept. Folder skos:prefLabel and rdfs:label Resource label. A folder is a branch node in the icane:section Section that this folder belongs to. subject matter tree. It has many icane:subsection Subsection that this folder belongs to. child folders and time series, and it Concept Schemes that this folder belongs to one subsection (and skos:inScheme therefore to one section) and to one belongs to (Its section and subsection). category. icane:category Category that this folder belongs to. Used for creating a hierarchical structure skos:broader / skos:narrower of folders. Table continued 6 http://www.icane.es/opendata/vocab 3 Proc. Int’l Conf. on Dublin Core and Metadata Applications 2013 Entity Property Description Class of this resource. rdf:type Values: icane:TimeSeries and qb:DataSet. rdfs:label and dcterms:title Resource label. Subject of this resource, which will be its dcterms:subject parent folder in the hierarchy. icane:section