European Long-Term Ecosystem and Socio-Ecological Research Infrastructure

D3.3 Data Models Authors: Peterseil, J., Magagna, B., Wohner, C., Oggioni, A. & Watkins, J. Lead partner for deliverable: Umweltbundesamt GmbH (EAA) Other partners involved: CNR, CEH

H2020-funded project, GA: 654359, INFRAIA call 2014-2015 Start date of project: 01 June 2015 Duration: 48 months Version of this document: 1.0 Submission date: 30.5.2018 Dissemination level

PU Public X

PP Restricted to other programme participants (including the Commission Services)

CO Confidential, only for members of the consortium (including the Commission Services)

CI Classified, as referred to in Commission Decision 2001/844/EC

Document ID: eLTER D3.3 Data Models © eLTER consortium

Version control Edited by Date of revision

Created – V1 Peterseil & Magagna 31.5.2018 Internal review Watkins Internal review Oggioni Revised – V2 Revised – V3 Revised – V4 Reviewed Haubold 31.5.2018 Revised – V5 Signed off – co-ordinator Mirtl 31.5.2018

Document ID: eLTER D3.3 Data Models © eLTER consortium

Publishable Executive Summary

Providing quality controlled and reliable data as the basis for scientific analysis and as input for the evaluation of existing environmental policies is one of the major aims of long-term ecosystem monitoring and research not only in Europe (Mirtl 2010) but also on a global scale (Mirtl et al. 2018). In order to foster information exchange and sharing, data must be discoverable and at least the metadata accessible (Michener et al. 1997). This requires proper documentation of data and services as well as the existence of infrastructure allowing the discovery, access and integration of data in a web-based environment. Following the FAIR data principles (Wilkinson et al. 2016) data need to be Findable, Accessible, Interoperable and Reusable. The syntactic and semantic interoperability plays an important role for the reusability of the data. eLTER Task 3.3 aims to provide the semantic backbone for the documentation and provision of data. This not only addresses the application of metadata models and data formats for the data provisions but moreover the development of an underlying common semantics. The report focuses on a) the metadata models for the documentation of sites, datasets and sensors, b) the data formats applied for the provision of time series data, and c) the development of a common semantics.

As eLTER is building on the site and organisations network of LTER Europe and ILTER metadata and data standards are adopted recommended by the regional and global network. EML and INSPIRE Metadata model are the important standards for the documentation of datasets. For research sites INSPIRE Environmental monitoring Facilities (EF) was implemented as well as a SensorML compliant community profile for any observation device. Finally, EnvThes provides the semantic backbone for the annotation (e.g. keywords) of long term observation data.

Document ID: eLTER D3.3 Data Models © eLTER consortium

Document ID: eLTER D3.3 Data Models © eLTER consortium

Contents

1 Introduction ...... 1 1.1 LTER Europe ...... 2 1.2 eLTER Information System Architecture ...... 3

2 Data documentation ...... 6 2.1 Relevant metadata standards ...... 6 2.2 LTER Metadata models ...... 8 2.3 Sensor metadata ...... 13

3 Data provision ...... 22 3.1 Relevant standards ...... 23 3.2 eLTER Data Reporting Format...... 26

4 Common Semantics ...... 34 4.1 Standards for controlled vocabularies ...... 35 4.2 EnvThes ...... 37

5 Conclusions ...... 55 5.1 Metadata ...... 55 5.2 Data format ...... 57 5.3 Common semantics ...... 58

References ...... 61

6 Annexes ...... 64 6.1 Annex A - Field Specification for data reporting ...... 65 6.2 Annex B – SensorML Implementation for DEIMS-SDR ...... 78 6.3 Annex C – SensorML Example DEIMS-SDR:Sensor ...... 89

Document ID: eLTER D3.3 Data Models © eLTER consortium

List of figures

Fig. 1.1 Global map of LTER sites and LTSER platforms covered by ILTER and LTER Europe ...... 3 Fig. 1.2. Conceptual architecture of the eLTER Information System ...... 4 Fig. 2.1 Example of a site record ...... 9 Fig. 2.2 Example of a data product record ...... 11 Fig. 2.3 Example of a dataset record ...... 12 Fig. 2.4 Relation between site, sensor and dataset ...... 13 Fig. 2.5 Metadata model DEIMS-SDR:Sensor Version 0.9 ...... 17 Fig. 2.6 Example of a sensor record ...... 19 Fig. 2.7 EDI Metadata Editor (get-IT software suite) – Register sensor ...... 21 Fig. 3.1 Overview of linking Sensor description and observation ...... 27 Fig. 3.2 Structure of the eLTER Data Reporting ...... 28 Fig. 3.3 eLTER Data Reporting Format: basic observation model ...... 29 Fig. 4.1 OBOE core model (Madin et al. 2007) ...... 39 Fig. 4.2: Tree diameter at breast height modelled in OBOE ...... 40 Fig. 4.3: Tree diameter at breast height modelled using protocol in OBOE ...... 40 Fig. 4.4: Concentration of nitrate in soil water modelled in OBOE ...... 40 Fig. 4.5: O&M (Cox 2017) ...... 41 Fig. 4.6: Comparison O&M and OBOE (in red)...... 42 Fig. 4.7: SSO pattern ...... 43 Fig. 4.8: Sensor perspective of SSNO ...... 43 Fig. 4.9 SERONTO Core Model ...... 44 Fig. 4.10: Observable Properties in O&M extension ...... 46 Fig. 4.11 Complex Properties Model as extension of O&M (Leadbetter & Vodden 2016) .... 47 Fig. 4.12 Compound versus atomic concepts in EnvThes ...... 47 Fig. 4.13: Structure of Semantic Repositories ...... 53 Fig. 5.1 Metadata levels for LTER Data Reporting ...... 58 Fig. 5.2 Conceptualisation of observation types ...... 60

List of tables

Tab. 2.1 DEIMS-SDR:Sensor community profile ...... 18 Tab. 3.1 eLTER Data Reporting - Basic format (row) ...... 30 Tab. 3.2 eLTER Data Reporting - Basic format alternative version (column) ...... 30 Tab. 3.3 Example biophysical data basic data format...... 33 Tab. 5.1 EML Data Package Completeness levels ...... 56

Document ID: eLTER D3.3 Data Models © eLTER consortium

Glossar

DC Dublin Core

DCAT Data Catalogue Vocabulary

DEIMS-SDR Dynamic Ecological Information Management System Site and Dataset Registry

DwC Darwin Core

ECOPOTENTIAL ECOPOTENTIAL: improving future ecosystem benefits through earth observations (H2020 Project)

EF Environmental Monitoring Facilities (INSPIRE Data Specification)

eLTER Integrated European Long Term Ecosystem & socio-ecological Research Infrastructure (H2020 project, GA 654359)

EML Ecological Metadata Language

EnvThes Environmental Thesaurus

EUDAT European Data Infrastructure

FAIR FAIR (Findable, Accessible, Interoperable, Re-usable) Principles

GBIF Global Biodiversity Information Facility

GEMET General Multilingual Environmental Thesaurus

ICOS Integrated Carbon Observation System

ILTER International Long-Term Ecosystem Research Network

INSPIRE Infrastructure for Spatial Information in the European Community

ISO International Standardisation Organisation

LTER Long term ecosystem research

LTSER Long term socio-ecological research

MD Metadata

Document ID: eLTER D3.3 Data Models © eLTER consortium

O&M OGC Observations and Measurement

OBOE Extensible Observation Ontology

OGC Open Geospatial Consortium

OWL Web Ontology Language

RDA Research Data Alliance

RDA VSSIG RDA Vocabulary and Semantic Services Interest Group

RDF Resource description framework

RI Research Infrastructure

SensorML Sensor Model Language

SERONTO Socio-ecological research and observation ontology

SKOS Simple Knowledge Organisation System

SOS Sensor Observation Service

SPARQL SPARQL Protocol and RDF Query Language

SSN Semantic Sensor Network

SSNO Semantic Sensor Network Ontology

SWE

W3C World Wide Web Consortium

XML Extended Markup Language

Document ID: eLTER D3.3 Data Models © eLTER consortium

1 Introduction

Providing quality controlled and reliable data as the basis for scientific analysis and as input into the construction of new and evaluation of existing environmental policies is one of the major aims of long-term ecosystem monitoring and research not only in Europe (Mirtl 2010) but also on a global scale (Mirtl et al. 2018). In order to foster information exchange and sharing, data must be discoverable and at least the metadata accessible (Michener et al. 1997). This requires proper documentation of data and services as well as the existence of infrastructure allowing the discovery and access of data in a web-based environment. The requirements for a common data documentation profile can be, in the simplest form, reduced to the goal of implementing the FAIR1 data principles (Wilkinson et al. 2016), these being that data must be: Findable, Accessible, Interoperable, and Reusable. Each of these principles represents a subset of goals that once achieved, provides a structured data repository that has unambiguous data objects, with corresponding metadata, provenance information, and semantic relationships defined in a syntax that is open, standardised, and endorsed by the community. Depending on which and how many of the FAIR goals are reached, a data representation may achieve one of many differing levels of FAIR. In addition it is necessary to use data coming from different disciplines, domains, and providers. Thus, discovery and integration of data, especially from the ecological domain, is highly labour-intensive and often ambiguous in semantic terms. To improve the discovery, integration and re-usability of data the use of semantic resources can help to harmonise and enrich the description of datasets and its content. In the last decade research groups and infrastructures focusing in the monitoring and analysis of ecosystem properties have increasingly put effort into the development of semantic resources mainly based on core ontologies such as OBOE or the O&M conceptual schema. Trying to cover these aspects LTER aims to improve comparability and interoperability of long-term ecological data, and facilitate exchange and preservation of these data (Mirtl et al. 2018, Vanderbilt et al. 2015). Currently, funding and organisation of the different components of the LTER network is strongly related to national funding opportunities thus leading to a diversity of data strategies and data management procedures. Within the eLTER2, (H2020) project funded by the European Union these challenges are tackled trying to support the process of data documentation and access and building and maintain a sustainable infrastructure for the project and beyond supporting the LTER-Europe network. The work is aligned with the major challenges and use cases (see deliverable D3.1) and focuses on the following activities: (a) adopting standards for documentation of research objects (observation facilities and datasets), (b) fostering the use of controlled vocabularies, (c) providing time series data in standardised form, and (d) providing a catalogue of datasets across the different data resources.

1 See https://www.force11.org/group/fairgroup/fairprinciples 2 see http://www.lter-europe.net/elter/about, Grant number 65359

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 1 -

1.1 LTER Europe

The Long-Term Ecosystem Research (LTER) is an essential component of the world-wide efforts to better understand ecosystems, their functioning and the effects of driving factors on its processes. LTER Europe3 (Long Term Ecological Research Network) is the European contribution to this effort. It is a network of about 488 LTER Sites and 50 Long-term Socio- ecological Research Platforms organised in 23 national LTER networks across Europe covering a broad biogeographic gradient as well as important ecosystem types. Aiming to provide an systematic coverage of terrestrial and aquatic ecosystems allowing an insight to ecosystem processes and changes, LTER is building on four conceptual pillars (see Mirtl et al. 2018) namely  Systems approach – fostering and enabling the long-term investigation of systems as a whole (e.g.. ecosystems, Earth systems, environmental systems, socio-ecological systems, hydro-geological systems) as well as their single compartments in order to understand the underlying processes and possible threats. This addresses abiotic and biotic components (which interact at different scales) as well as the human use of such systems and their services takes place.  Process orientation – identifying, quantifying and studying the interactions of ecosystem processes affected by internal and external drivers. As for socio-ecological systems ‘process orientation’ applies to both processes related to ecosystem services and to social processes (e.g. stakeholder engagement, multi-directional knowledge transfer, and collaborative decision-making; Haberl et al., 2006) required to facilitate transdisciplinary research and policy making.  Long-term – dedicated to the continuous collection documentation and provision of data on the observed ecosystem status and processes as well as the use and integration of long-term data on ecosystems with a time horizon of decades to centuries.  In-situ – site-based data generation at different spatial scales across ecosystem compartments of individual in-natura sites, environmental zones and socio-ecological regions. LTER Europe is embedded in the global ILTER network covering a wide range of ecosystem types and allowing to analyse long term data along geographic gradients. Fig. 1.1 provides an overview on the LTER sites and LTSER platforms comprising the LTER Europe and ILTER network of sites. All observation and experimentation facilities are documented using DEIMS-SDR4.

3 See http://www.lter-europe.net/ 4 See https://data.lter-europe.net/deims/

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 2 -

Fig. 1.1 Global map of LTER sites and LTSER platforms covered by ILTER and LTER Europe

The provision and sharing of information, addressing both metadata and data, is one of the core activities within LTER-Europe. The data management within the LTER Europe network is characterised by (a) a decentralised organisation of data management, quality control and data reporting (national or local scale) following different protocols and methods, (b) a broad thematic coverage of scientific domains and ecosystem types and (c) a large user community and stakeholders involved. This results in a high heterogeneity with respect to data formats, data management practices and use of semantics for data description.

1.2 eLTER Information System Architecture

In order to overcome the challenges characterising LTER Europe within the eLTER (H2020) project tools and infrastructure components are developed to provide an easy and central access to information and in the long also to data. Oggioni et al. (2017) collected and summarised the requirements for the eLTER Information System which can be summarised as the following: 1) Distributed data sources - A net of distributed data sources (currently composed of databases, virtual nodes that share data by web services, or other type of data storage) should be supported by the eLTER Information System allowing access to metadata and data from different data providers. This also includes the integration of local data management procedures using central services of the eLTER Information System should be ensured. 2) Central data facilities - As not all data is stored in well managed data repositories central services in order to store, document and archive data should be developed.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 3 -

3) Fostering of standard services - eLTER should support the use of common data services as standard interfaces for data sharing and exchange. 4) Metadata standards - eLTER should support the development of a community profile for the documentation of the data and providing machine-readable endpoints (e.g. CSW) to exchange metadata 5) Data standards - Different data provision formats need to be supported by an eLTER Information System. This needs to include data stored in databases as well as single data files (in different formats). 6) Common semantics - A common vocabulary is needed in order to describe the structure and content of the data files. This also includes the integration of a repository for reference lists is needed in order to reference them. If possible an online reference should be able from the data file to the reference list

This requirements resulted in the development of the eLTER system architecture (Watkins et al. 2017) which focuses on building a central discovery and data access catalogue (see Fig. 1.2) aiming to link local and regional data nodes by the means of a common description of sites, datasets and data services.

Fig. 1.2. Conceptual architecture of the eLTER Information System

It includes the following components:

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 4 -

 Site registry (DEIMS-SDR), providing harmonised and standardised documentation of long term observation facilities  Data nodes (DN), providing metadata and access to data (including the link to data repositories, if data are stored in external trusted repositories)  Data Integration Portal (DIP), providing tools for the discovery and access to data sources provided through the data nodes  Common controlled vocabulary (EnvThes), providing a semantic backbone for keyword tagging and discovery

All different components of the eLTER Information system are interlinked using references in the metadata and standard web services.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 5 -

2 Data documentation

Metadata standards - eLTER should support the development of a community profile for the documentation of the data and providing machine-readable endpoints (e.g. CSW) to exchange metadata

Providing quality controlled and reliable data as the basis for scientific analysis and as input into the construction of new and evaluation of existing environmental policies is one of the major aims of long-term ecosystem monitoring and research not only in Europe (Mirtl 2010) but also on a global scale (Mirtl et al. 2018). In order to foster information exchange and sharing, data must be discoverable and at least the metadata accessible (Michener et al. 1997). This requires proper documentation of data and services as well as the existence of infrastructure allowing the discovery and access of data in a web-based environment. With the FAIR principles (Wilkinson et al. 2016) the basic guidelines for sharing data in an collaborative and reproducible manner are led out. This includes the find-ability of data sources dealing mainly with good quality metadata. While the highest FAIR level is an aspiration for a full and comprehensive data documentation profile and the system it is implemented on, the technology and standards available limit what is feasible within the project. A number of projects contributed to the development and implementation of a common community profile for datasets and research sites leading to the development of DEIMS- SDR as common dataset and site catalogue. Within eLTER the community profile developed in EnvEurope 5 and ExpeER were adopted. Within the H2020 ECOPOTENTIAL the application and extension of the site documentation with regard to the documentation of protected areas and related in-situ data was shown (see Poursanides et al. 2017).

2.1 Relevant metadata standards

In order to enable a basic level of syntactic and semantic interoperability of data a number of metadata standards were taken into account. Most prominently ISO19115/19139 and EML proofed to be the most important ones. Nevertheless, additional metadata standards especially with regard to e.g. the openData initiative or the Global Biodiversity Information System (GBIF) need to be taken into account. The following aims not to be a complete listing but allows pointing out the most prominent ones.

Dublin Core metadata (DC)6 is a small set of vocabulary terms that can be used to describe web resources (e.g. video, images, web pages) as well as physical resources (e.g. books, CDs) or objects like artworks. The DC metadata terms are managed by the Dublin Core

5 Life Enviroment Project LIFE08 ENV/IT/000399 6 see http://dublincore.org/

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 6 -

Metadata Initiative (DCMI) 7 providing information e.g. on title, identifier, creator or description. By this DC would be a common core set of metadata attributes across different resource types (e.g. research site, datasets) within the eLTER Information System.

Data Catalogue Vocabulary (DCAT) 8 is W3C standard. DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogues published on the Web. By using DCAT to describe datasets in data catalogues, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogues. It further enables decentralized publishing of catalogues and facilitates federated dataset search across sites. Aggregated DCAT metadata can serve as a manifest file to facilitate digital preservation. The DCAT class Dataset [dcat: Dataset] is defined as collection of data, published or curated by a single agent, and available for access or download in one or more formats. For each resource information on e.g. title, description, language, identifier, contact point, distribution, frequency, keyword, landing page, publisher, release date, spatial coverage, temporal coverage, theme, update date can be given. DCAT is not domain specific but could be applied to any ecological dataset integrating information from more detailed metadata schemata’s as the EML, Ecological Metadata Language.

ISO 191159/1913910 is a generic metadata schema for describing geographic information and services by the means of metadata. It provides information about the identification, the extent, the quality, the spatial and temporal aspects, the content, the spatial reference etc. of digital geographic data and services. The standard consists of different parts namely ISO19115-1:2014 Fundamentals, ISO 19115-2:2009 Extensions for imagery and gridded data and ISO19139:2007 Metadata XML schema implementation. ISO19115/19139 is the underlying metadata schema adopted by the INSPIRE directive. As ISO19115 generically can be used to describe any geospatial dataset it can also be applied for any observational data having a spatial context. Nevertheless limitations in the documentation of methodological aspects may arise.

INSPIRE and INSPIRE MD Specification 11 is a European Community Directive, which entered into force in May 2007. The INSPIRE directive defines the guidelines for the establishment of a spatial data infrastructure in Europe in order to support the Community environmental policies, and policies or activities why may have an impact on the environment. The data infrastructure is based on the infrastructures for spatial information established operated by the 27 member states of the European Union. The descriptive metadata are based on ISO19115/19139 as defined in the INSPIRE Metadata regulation

7 see http://dublincore.org/documents/dcmi-terms/ 8 see https://www.w3.org/TR/vocab-dcat/ 9 See https://www.iso.org/standard/53798.html 10 See https://www.iso.org/standard/32557.html 11 see http://inspire.ec.europa.eu/document-tags/metadata

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 7 -

(2008)12. For the implementation, technical guidelines and implementing rules have been specified. The INSPIRE MD Specification is describing metadata elements for datasets and dataset series as well as for data services.

Ecological Metadata Language (EML)13 is a metadata specification for data resulting from the ecological domain (Mitchener et al. 1997). EML is implemented as a series of XML document types that can be used in a modular and extensible manner to document data. Each EML module is designed to describe one logical part of the total metadata that should be included with any ecological dataset. EML was adopted by ILTER and GBIF as the main supported metadata standard.

Darwin Core (DwC)14 is a Biodiversity Information Standards (TDWG) standard for sharing biodiversity data (Wieczorek et al. 2012). DwC is a set of terms, which can be seen as extension of the Dublin Core metadata standard for the biodiversity domain.

2.2 LTER Metadata models

Based on a community process (see Kliment & Oggioni 2011, Oggioni et al. 2012) the basic version of the eLTER community profile for datasets and research sites was developed and implemented within DEIMS-SDR. Within eLTER (H2020) and ECOPOTENTIAL (H2020, Poursanides et al. 2017) a focus on in-situ data provision extensions was set leading to an adoption of the community profile for LTER Europe and eLTER. This encompasses the documentation of:  Observation and experimentation facilities termed as DEIMS-SDR:Site  Data collections documented as DEIMS-SDR:DataProduct  Datasets collected documented as DEIMS-SDR:Dataset  Observation devices documented as DEIMS-SDR:Sensor

In addition basic information on DEIMS-SDR:Network and DEIMS-SDR:Person are collected which are not further detailed in the current report. Whereas the metadata models for DEIMS-SDR:Site, DEIMS-SDR:DataProduct and DEIMS- SDR:Dataset were not modified within the current project context a community profile for DEIMS-SDR:Sensor was developed based on OGC SensorML. In the following chapter a short overview on the relevant metadata models adopted and implemented in DEIMS-SDR is provided.

12 see http://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:32008R1205 13 see http://knb.ecoinformatics.org/ 14 see http://rs.tdwg.org/dwc/

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 8 -

2.2.1 DEIMS-SDR:Site

Sufficient and standardised documentation of data is needed in order to ensure the sharing and reuse of data. This not only applies to the description of a single data object but also to the context of the observation, e.g., the research facility or infrastructure. For place-based observations information on the observation facilities (e.g., the research site) is an intrinsic and important asset for the discovery and reuse of data and expertise. A DEIMS-SDR:Site is defined as the total sum of infrastructure elements (including the study area) needed to address the research. This includes all plots and installations within a given area. Most of the LTER sites are applying a catchment approach. This means that the instrumented part of the catchment is termed as the ‘site’. A single observation plot or sensor installation is termed as ‘station’ (see Chapter 3 Introduction Data provision).

Fig. 2.1 Example of a site record15

15 See https://data.lter-europe.net/deims/site/8eda49e9-1f4e-4f3e-b58e-e0bb25dc32a6

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 9 -

For the documentation of LTER sites a set of required fields was defined in order to allow proper accreditation of sites within the LTER network. The Site Metadata Model (SMM) encompasses the following metadata fields: ● Name and general description ● Contact details ● Metadata providers ● Geographic location ● Ecosystem and environmental characteristic ● Site classification ● Status and history ● Protection status and resource management ● Focus, design and scale of site ● Information on infrastructure, operation and data management (including general data sharing policy and guidelines) ● Network affiliation and specific characterisation

Full documentation of the metadata model for the research site (Version 1.11) can be found on the DEIMS-SDR in the metadata model documentation16. A mapping of the site metadata model to INSPIRE Environmental Monitoring Facility 17 (EF) application schema was developed within the ECOPOTENTIAL (H2020) project (see Magagna et al. 2018). This provides a common exchange format for site information (e.g. protected areas or LTER sites) from existing site catalogues. This enables and fosters the development of a cross RI registry of observation and experimentation facilities (e.g. extension of SITE_UUID to DEOS- ID) and leads to the reduction of information redundancies and enhancing discoverability.

2.2.2 DEIMS-SDR:DataProduct

In order to provide a fast and easy overview information on existing and relevant datasets the metadata model for DEIMS-SDR:DataProducts (see Poursanides et al. 2017) was developed. The metadata model results from the collection of requirements from research projects (e.g. ECOPOTENTIAL) and stakeholder groups (e.g. LTER) aiming for a complete overview on data topics for a given research infrastructure element (e.g. site). This can only be achieved, if the dataset metadata are not available as harvestable resource, by a condensed documentation. The concept of ‘data product’ was added in order to allow a summarised description of a series of data. By this a fast overview on available data sources in a protected area can be created without a full description of each single dataset. This needs to be the logical second step and cannot be replaced by the information on the data products.

16 See https://data.lter-europe.net/deims/metadata-models 17 See http://inspire.ec.europa.eu/documents/Data_Specifications/INSPIRE_DataSpecification_EF_v3.0rc3.p df

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 10 -

The Data Product Metadata Model (DPMM) describes the metadata elements of a DEIMS- SDR:DataProduct: ● Identification, title and abstract ● Related sites and datasets ● General information on data product type, parameters and keywords ● Spatial and temporal data resolution ● Data availability ● Contact information (including metadata creator)

Fig. 2.2 Example of a data product record18

For the eLTER DIP (Data Integration Portal) information on the DEIMS-SDR:DataProduct is provided as ISO19115/19139 metadata XML record to be discoverable.

2.2.3 DEIMS-SDR:Dataset

In order to ease the barrier of metadata provision a community metadata profile was defined selecting necessary required metadata elements to ensure discovery and reuse of data (Kliment & Oggioni 2011). This includes a mapping of the metadata elements implemented in DEIMS-SDR to both EML (Version 2.1.1) and ISO19115/139 (INSPIRE Profile). Both are relevant community standards which need to be supported. The Ecological Metadata Language (EML), a metadata specification for data resulting from the ecological domain (Michener et al. 1997), was adopted by ILTER and LTER-Europe as the main supported metadata standard. In addition, LTER-Europe also recommends the use of the INSPIRE metadata specification 19 , which is based on a European Community

18 See https://data.lter-europe.net/deims/activity/94016f3b-2e6b-4f95-a759-1b0a40126dcd 19 see http://inspire.ec.europa.eu/document-tags/metadata

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 11 -

Directive. The INSPIRE directive defines the guidelines for the establishment of a spatial data infrastructure in Europe in order to support the community environmental policies, and policies or activities that may have an impact on the environment. The descriptive metadata are based on ISO19115/19139 as defined in the INSPIRE Metadata regulation (2008)20. The Dataset Metadata Model (DSMM) provides information on: ● Title and identification ● Abstract and short description of the data content ● Keywords (taken from a controlled vocabulary) ● Method description (including sampling and instrumentation) ● Taxonomic, spatial and temporal extent ● Reference to data policy and intellectual property rights ● Documentation on the quality assurance procedure ● Link to the location of the file (or information on access)

Fig. 2.3 Example of a dataset record21

For each of the dataset metadata records an export to EML, ISO19115/19139, INSPIRE and BDP is possible. For these the local dataset metadata model is mapped to the respective standards.

20 see http://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:32008R1205 21 See https://data.lter-europe.net/deims/dataset/cd1fb6f8-5e57-11e3-aa73-005056ab003f

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 12 -

2.3 Sensor metadata

With the DEIMS-SDR: Site model a framework is provided allowing to describe the general setup of the observation and experimentation facilities. In order to link the specific observation setup to a single time series (e.g. meteorological data) the documentation of sensors, whether they are single sensors or sensor systems, are needed. Through the Data Nodes [DN] the sensors can be registered and described in order to associate them to the relevant time series.

DEIMS-SDR:Site belongsTo DEIMS-SDR:DataSet

installedAt

DEIMS-SDR:Sensor generates

Fig. 2.4 Relation between site, sensor and dataset

The documentation of the DEIMS-SDR:Sensor provides an important extension of the metadata models in DEIMS-SDR as the research sites can be divided into different observation devices or plots where data are generated (see Fig. 2.4). Applying the SensorML model for the documentation of sensor devices the resulting XML documents can be used to register sensors within the Central Data Node using 52°North SOS server as the core application.

2.3.1 Background

The OGC Sensor Web Enablement (SWE)22 framework provides specifications as guidelines for the description of sensors, procedures to create sensor descriptions, standards for representing observations collected by these sensors, and the specifications enabling machine to machine requests to get metadata and observations. The data models are defined by the Sensor Metadata Language (SensorML) as well as the Observations and Measurement (O&M) which allow the XML encoding of the information.

22 See http://www.opengeospatial.org/ogc/markets-technologies/swe

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 13 -

Sensor Model Language (SensorML)23 is an Open Geospatial Consortium standard for describing sensors and measurement processes. It can be used to describe a wide range of sensors, including both dynamic and stationary platforms and both in-situ and remote, it provides standard models and an XML encoding for describing sensors processes. O&M defines a conceptual schema encoding for observations, and for features involved in sampling, e.g. the when, where and who about made observations. The Sensor Observation Service (SOS) is a web service to query real-time sensor data and is part of the SWE. SensorML provides a common framework for any process, especially for the description of sensor and systems and the processes surrounding sensor observations.

The aims of SensorML are to: ● Provide descriptions of sensors and sensor systems for inventory management ● Provide sensor and process information in support of asset and observation discovery ● Support the processing and analysis of the sensor observations ● Support the geolocation of observed values (measured data) ● Provide performance and quality of measurement characteristics (e.g., accuracy, threshold, etc.) ● Provide general descriptions of components (e.g. a particular model or type of a sensor) as well as the specific configuration of that component when it’s deployed ● Provide a machine interpretable description of the interfaces and data streams flowing in and out of a component ● Provide an explicit description of the process by which an observation was obtained (i.e., its lineage) ● Provide an executable aggregate process for deriving new data products on demand (i.e., derivable products) ● Archive fundamental properties and assumptions regarding sensor systems and computational processes24 ● Provide information of the manufacturer, owner, and operator as a contacts to give more information about the sensors ● Provide historical events of the sensor (e.g. installation, calibration, etc.)

Sensor and transducer 25 components (detectors, transmitters, actuators 26 , filters, and processes) are modelled as physical processes that can be connected and participate equally within a process network or system, and which utilize the same model framework as any other process. Processes are entities that take one or more inputs and through the application of well-defined methods and configurable parameters, and produce one or more outputs. The process model can be used to describe a wide variety of processes, including

23 See http://www.opengeospatial.org/standards/sensorml 24 https://portal.opengeospatial.org/files/?artifact_id=55939, p. 14-15 25 An entity that receives a signal as input and generates a modified signal as output. Includes detectors, actuators, and filters. 26 A type of transducer that converts a signal to some real-world action or phenomenon

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 14 -

not only sensors, but also actuators, spatial transforms, and data processes. SensorML also supports explicit linking between processes and thus supports the concept of process chains, networks, or workflows, which are defined as processes using a composite pattern.

Processes that can be modelled with SensorML are: ● Physical System - is an aggregate system that can include multiple components (both physical and non-physical) with explicit links between the outputs, inputs, and parameters of the individual components. In a PhysicalSystem, the spatial position of the System itself is relevant to its application; ● Physical Component - is a physical process that will not be further divided into smaller components.

SensorML provides a framework within which the geometric, dynamic, and observational characteristics of sensors and sensor systems can be defined. A variety of sensor types can all be supported through the definition of simple and aggregate processes. The models and schema within the core SensorML specification provide a skeletal framework for describing processes, aggregate processes, and sensor systems27. In choosing to use SensorML and O&M schemas, eLTER gets the benefits of a data model that is highly extensible and can be customized to create a profile that best represents the eLTER data. For example, in other domains such as hydrology and soil, the SensorML and O&M models have been customized to produce profiles of the standards, these being WaterML 28 and a common-core SoilML specification 29 . While this extensibility can be regarded positively, allowing for multiple profiles and flexible definitions, it can also be overwhelming for users first encountering the standards, as can the range of ways potential information may be described. Moreover O&M has been identified as integrally relevant to five INSPIRE themes (Geology, Oceanographic geographical features, Atmospheric conditions and Meteorological geographical features, Environmental monitoring facilities, and Soil) and are including elements of O&M into their data specifications30.

2.3.1.1 SWE Lightweight SOS Profile

SensorML is the recommended sensor metadata for SOS 2.031. SensorML is used within SOS for encoding sensor metadata documents that are returned in case of DescribeSensor requests. This lightweight profile32 defines a minimum set of mandatory metadata that need

27 https://portal.opengeospatial.org/files/?artifact_id=55939, p. 14-15 28 http://www.opengeospatial.org/standards/waterml 29 https://portal.opengeospatial.org/files/?artifact_id=69891 30 http://inspire.ec.europa.eu/documents/Data_Specifications/D2.9_O&M_Guidelines_v2.0rc3.pdf 31 http://www.ogcnetwork.net/sos_2_0/tutorial/sensorml 32 https://portal.opengeospatial.org/files/?artifact_id=52675

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 15 -

to be provided in a SensorML document. Complex elements of SensorML are not considered here.

gml:description (mandatory): Short textual description of the sensor or sensor system. gml:identifier (mandatory): Unique identifier of the sensor system. sml:keywords (mandatory): Terms which help to describe the sensor system and serve for discovery purposes. For example, the phenomena observed by the system or the types of contained sensors can be mentioned. sml:identification (mandatory): This element contains identifiers of the sensor system. Each "identifier/Term" element contained in the "IdentifierList" must have a "definition" attribute which links to the semantics of the sensor system. One identifier has to be present which contains the definition "urn:ogc:def:identifier:OGC:shortname". The value of its contained "Term" element represents a human understandable name for the instance. One identifier has to be present which contains the definition "urn:ogc:def:identifier:OGC:longname". The value of its contained "Term" element represents a human understandable name for the sensor system. sml:classification (mandatory): This element contains classifiers for the sensor system. Each "classifier/Term" element contained in the "ClassifierList" must have a "definition" attribute. This attribute links to the semantics of the identifier. One classifier has to be present which contains the definition “http://www.opengis.net/def/property/OGC/0/SensorType”. The value of its contained “Term” element states the type of the sensor system (e.g., “weather station”). sml:contacts (mandatory): This element contains contact information about the operator of the sensor. The element "contacts/ContactList/member/gmd:CI_ResponsibleParty" has to be present to define the responsible party of the sensor system33. sml:featuresOfInterest (mandatory): This element contains the real world entity, the feature of interest, which is observed by the sensor system. In case of this profile, the feature of interest is a station and modelled as a SamplingPoint. sml:outputs (mandatory): The outputs of the sensors attached to the sensor system. Each child-element of an "output" has to use the "definition"-attribute to specify the URI of the observed property. If the child-element of the output is a "swe:Quantity" it has to contain the "swe:uom" element which specifies the "code" attribute stating the UCUM code. Depending on the observation types the outputs have to be described as one of the following elements  swe:Quantity (in case of Measurement)  swe:Count (in case of CountObservation)

33 https://portal.opengeospatial.org/files/?artifact_id=52803

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 16 -

 swe:Boolean (in case of TruthObservation)  swe:Category (in case of CategoryObservation)  swe:Text (in case of TextObservation)

2.3.2 DEIMS-SDR:Sensor

Within the eLTER project the implementation of a light-weight sensor profile based on SensorML was conducted. This led to the specification of the DEIMS-SDR:Sensor type described in the following chapters. In addition the EDI metadata editor provided by get-IT also allowed for the implementation of the sensor model. With the goal of obtain syntactic and semantic interoperability, discovery observations through their sensors information and reducing the cognitive load placed on data managers using the system, the eLTER O&M and SensorML model allows for  encoding the observations in a simple data model;  reducing the flexibility of SensorML to a small profile;  enriching the metadata and data with terms obtained from controlled vocabularies. The eLTER profile aims to make the task of providing data as smooth and simple as possible, offering a subset of the most important and widely used fields that form part of any monitoring or model projection programme. The selection of elements is based on the lightweight OGC SOS profile (OGC 11-169 document34). Only parts of the SensorML model is used for the DEIMS-SDR:Sensor community profile. Based on the requirements for the development of a valid SensorML document a community profile for sensors was defined and implemented within DEIMS-SDR. This allows the documentation and extraction of sensor information from DEIMS-SDR. For each documentation sensor a landing page (see Fig. 2.6) and XML representation of the documentation is provided (see Annex C – SensorML Example DEIMS-SDR:Sensor).

Fig. 2.5 Metadata model DEIMS-SDR:Sensor Version 0.9

34 https://portal.opengeospatial.org/files/?artifact_id=52803

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 17 -

Tab. 2.1 DEIMS-SDR:Sensor community profile

Field Definition mul man Title Custom name or identifier that describes the sensor. 1-1 yes Sensor_UUID Unique alpha-numeric identifier of the site. The UUID 1-1 yes is automatically generated by DEIMS-SDR. The UUID is used for creating the URL for the site, e.g. https://data.lter-europe.net/deims/site/8eda49e9-1f4e- 4f3e-b58e-e0bb25dc32a6.The UUID is taken as network independent unique identifier for the observation and experimentation facility. In addition the SITE CODE can be added as a network specific identifier. Metadata Update Provides date of metadata creation or last update 1-1 yes Related site The location, where specific observations are carried 1-1 yes out. This links to a valid Site_UUID provided by DEIMS-SDR. Sensor type Type of sensor, usually further qualified by an 1-1 yes application specific code space Description Textual description of the sensor 1-1 yes Keywords Describes the sensor with keywords based on 1-n yes EnvThes Contact Person that is associated with the sensor 1-n yes Sensor mobile State whether or not the sensor is mobile 1 yes Coordinates Describe the location of the sensor using coordinates 0-1 no Elevation Elevation in [m] above or below sea level 0-1 no Trajectory Describes the trajectory of the sensor 0-1 no Sensor operational Date when the sensor was installed 0-1 no since resultAquistionSour Categories for different types of the 0-1 no ce ResultAcquisitionSource: ex-situ, in-situ, remote sensing or submersed Media monitored Describes the media(s) that are observed/measured 0-n no by the sensor Measure Parameter observed by the sensor. The values for the 0-n no parameter are taken from EnvThes (parameter) Units of Describes the units which are used to describe the 0-n no measurement measurements

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 18 -

Fig. 2.6 Example of a sensor record35

Annex B – SensorML Implementation for DEIMS-SDR provides the consideration in defining the DEIMS-SDR:Sensor model based on SensorML. For each sensor a basic set of mandatory fields was defined (see Tab. 2.1) which allows the basic documentation of the sensor entities. The resulting SensorML document can be used for the registration of a sensor within the central data node.

2.3.3 DataNode:Sensor

As described more comprehensively in the deliverable D8.2 Software service prototype (Chapter 2.3, Oggioni et al. 2018), the central data node consists of several software packages all gathered in a single suite called GET-IT (Geoinformation Enabling Toolkit Starterkit®)36. GET-IT has undergone a process of evolution and is fully suitable to support the needs of syntactic and semantic interoperability and the distribution of data and metadata related to observations collected by sensors. The GET-IT code is freely released, so the software is free and open, released under the GNU General Public License v3.0, the same has been added as "INSPIRE directive in practice" tools37 and as the Open Source Geospatial Foundation (OSGeo) projects38. Central data node users can describe sensor and upload observations using a Web GUI, both following the OGC standard schemas and the INSPIRE specifications. The SensorML MD model profile, defined within eLTER project, defines a minimum set of metadata that

35 See https://data.lter-europe.net/deims/sensor/fb583610-fe71-4793-b1a9-43097ed5c3e3 36 http://www.get-it.it 37 https://inspire-reference.jrc.ec.europa.eu/tools/get-it-geoinformation-enabling-toolkit-starterkit®-0 38 https://www.osgeo.org/projects/get-it/

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 19 -

shall be provided in a SensorML document. Every sensor system shall be modelled as a PhysicalSystem and the mandatory elements, according to the OGC Best Practise for Sensor Web Enablement (OGC 2014), defined are:

● gml:description ● gml:identifier ● sml:keywords ● sml:identification ● sml:classification ● sml:contacts ● sml:featuresOfInterest ● sml:outputs

Also for observations, within eLTER project, has been defined a profile adherent to the O&M specification. Single, array or long term series observations can be shared. Further details are described in deliverable D8.2 Software service prototype (Oggioni et al. 2018). The solution adopted in the DataNode actually allows to achieve a semantic interoperability. In fact the fields of the SensorML as identification, classification but above all outputs in the elements swe:Quantity and swe:uom are defined as URI of terms of controlled vocabularies. For what refers to the eLTER project the used vocabulary is just EnvThes. For greater clarity, we give below an XML example for the element outputs.

As you can see the definition attribute of the element swe:Quantity and the xlink:href attribute of the element swe:uom are two URIs referring to as many terms in the EnvThes vocabulary, respectively: air temperature and degree Celsius. The attribution of the different elements of the SensorML to the hyperlinking approach, and not the use of label unresolvable (e.g. only “air temperature”), can enriching the information of the sensor with semantic annotation, improve the discoverability and interoperability. To facilitate the filling of sensor metadata, in the template described in this paragraph, some fields have auto-completion functionality (see Fig. 2.7). The figure shows particular concerning the compilation of the swe:Output field and the suggestion with respect to all the terms of the reference controlled vocabulary, in this case EnvThes, starting with word "air".

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 20 -

Fig. 2.7 EDI Metadata Editor (get-IT software suite) – Register sensor

The user is suggested terms increasingly specific while writes within the same field, this functionality is expressed through a SPARQL Query towards EnvThes.

For example:

PREFIX skos: SELECT * WHERE { SERVICE { GRAPH {

BIND ( as ?root) ?root (skos:narrower*) ?c. ?c skos:prefLabel ?l.

FILTER( REGEX( STR(?l), "$search_param", "i") )

}

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 21 -

3 Data provision

Data standards - Different data provision formats need to be supported by an eLTER Information System. This needs to include data stored in databases as well as single data files (in different formats).

Providing standardised data enable the automatisation of workflows, e.g. with the generation of added value data products or the application of models. This applies not only for spatial data where standardisation efforts in Europe are driven by the implementation of the INSPIRE directive but moreover also for in-situ observation data. This includes not only the definition of common data structures and formats for the data exchange but also the semantic documentation of the content. A clear definition of the different data elements is needed. The application of common core models describing the environmental observation (e.g. O&M, SERONTO) as well as their implementations (e.g. SOS services) support the syntactic harmonisation of data across the different domains. Nevertheless not all data can be shared using SOS services. Therefore in addition simple and easy to implement file based formats are needed to ensure the data transmission and provision. Providing sufficient metadata on the single observations (e.g. data quality) is one of the core requirements for its implementation. In the deliverable D3.1 eLTER State of the Art and Requirements (Oggioni et al. 2017) different data types are specified for eLTER. This encompasses:  Single point observations on single point in space and time  Time series observations being characterised by continuous or repeated measurements at fixed locations by humans or sensors  Profile observations being characterised like time series observations but including continuous or repeated measurements at different depth or height levels (e.g. soil temperature in different depth)  Trajectory observations being characterised like time series observations but including continuous or repeated measurements along a trajectory in space (e.g. surface water temperature)  Sample based observations being characterised like the time series observations but being based on sampling events (e.g. soil water samples) which is analysed in a second step. Thus providing a time lag in data provision.  Coverage observations being characterised as full coverage observation of a real- world phenomenon (e.g. vegetation types) varying in time and space. Basically all data have a reference in time and space, thus being spatial data in the broader sense. The basic information is on the observation point, profile or trajectory to which each observation value (with a dedicated observation time) can be linked. The specifications and implementations induced by the INSPIRE directive as well as the current developments for sensor services (e.g. get-IT, TERENO) are important inputs in the discussion and definition of the eLTER Data Model for time series data.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 22 -

3.1 Relevant standards

3.1.1 INSPIRE Directive

The INSPIRE directive39 aims to create a European Union (EU) spatial data infrastructure in order to enable the sharing of environmental spatial information among public sector organizations and better facilitate public access to spatial information across Europe. A European Spatial Data Infrastructure (SDI) assists in policy-making across boundaries. The INSPIRE Directive came into force on 15 May 2007 and was implemented in various stages, with full implementation required by 2021 based on the roadmap40. INSPIRE is based on the infrastructures for spatial information established and operated by the 27 Member States of the European Union. The implementation of the data infrastructure under the INSPIRE directive should consider a number of basic principles ensuring the usage and integration of distributed data sources and data providers is based on a number of common principles:  Data should be collected only once and kept where it can be maintained most effectively.  It should be possible to combine seamless spatial information from different sources across Europe and share it with many users and applications.  It should be possible for information collected at one level/scale to be shared with all levels/scales; detailed for thorough investigations, general for strategic purposes.  Geographic information needed for good governance at all levels should be readily and transparently available.  Easy to find what geographic information is available, how it can be used to meet a particular need, and under which conditions it can be acquired and used. The Directive addresses 34 spatial data themes needed for environmental applications, with key components specified through technical implementing rules. The spatial information considered under the directive is extensive and includes a great variety of topical and technical themes. With respect to long term observation the following data specifications are of interest:  Land cover 41 and land use 42  Species43 and habitat44 distribution  Environmental Monitoring Facilities45

39 See http://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:32007L0002 40 See https://inspire.ec.europa.eu/inspire-roadmap/61 41 See https://inspire.ec.europa.eu/Themes/123/2892 42 See https://inspire.ec.europa.eu/Themes/129/2892 43 See https://inspire.ec.europa.eu/Themes/133/2892 44 See https://inspire.ec.europa.eu/Themes/146/2892 45 See https://inspire.ec.europa.eu/Themes/120/2892

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 23 -

For more specific applications also Population distribution and demography46 especially for applications in LTSER Platforms might be of interest. The INSPIRE Environmental Monitoring Facilities (EF) allows to link observation location (feature of interest) to the specific time series via sensors and the media monitored. This applies to the requirements within eLTER to setup central data nodes and to facilitate workflows to enable data provision.

3.1.2 Sensor Web Enablement (SWE)

As introduced for the documentation of sensors, the OGC Sensor Web Enablement (SWE)47 framework provides specifications as guidelines for the description of sensors, procedures to create sensor descriptions, standards for representing observations collected by these sensors, and the specifications enabling machine to machine requests to get metadata and observations. It is recommended that the eLTER sensor data model implementation follow three distinct sets of guidelines, to ensure a uniform minimal set of metadata across sites and sources. By following these guidelines in both the software implementation and data management areas, users of the services will have a source of data that enables effective discovery, exploration, and reuse. This is due to the data model metadata capturing the collection procedures, observed phenomena, feature location, and observation details that provide a full context around each individual observation. The first set of guidelines is published as a minimal SensorML profile for procedures that generate observations. This profile, published in the document OGC 11-169, defines a subset of available SensorML 2.0 elements to capture the metadata of monitoring stations. The mandatory elements only require a minimal amount of effort on the behalf of the data manager, are straightforward in their definition and rationale for inclusion, and provide data users with enough information to allow for informed reuse of the data. The second set of guidelines regards the O&M observation encoding, and is defined in the inspire D2.9 document48. It provides a guide on creating an “observation-centric view”, where metadata about the capture of the observation is of utmost importance. Examples for observation generation types such as hardware, software, or human-based are provided, as are example feature types such as fixed point monitoring stations, moving monitoring stations, and sample based observations, giving a full demonstration of the approaches to observation recording. When followed along with the first set of guidelines, an observation is presented with a full record of how it was generated, where it was generated, and any other metadata regarding that single particular observation. The third and last set of guidelines relate to the encoding of observed phenomena, the property that an observation measures (Leadbetter and Vodden, 2016). It defines a way to describe phenomena as a group of constituent, discrete parts, that when brought together

46 See https://inspire.ec.europa.eu/Themes/138/2892 47 See http://www.opengeospatial.org/ogc/markets-technologies/swe 48 See http://inspire.ec.europa.eu/id/document/tg/d2.9-o%26m-swe

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 24 -

define complex phenomena in a clear, concise manner. The metadata is built upon a linked- data representation making it both accessible and searchable. Its use of discrete constituent parts to generate complex phenomena descriptions also allows for precise semantic mediation between differing data sources of phenomena descriptions. Individually, each of these guidelines provides a specialized view of how a part of an observation’s metadata must be collected. Collectively, these three guidelines provide a standard way of describing the metadata regarding the way in which an observation was generated, where an observation was generated, when an observation was generated, and the phenomena that the observation represents. For a complete data model that both captures metadata for meaningful reuse, while also putting as few constraints onto a data manager, these three guidelines must be followed.

3.1.3 Semantic Sensor Networks (SSN)

The SSN / SOSA complimentary ontologies49 provide another way of achieving the OGC Sensor Web Enablement (SWE) goal of FAIR observational data, this time using linked-data rather than document-based data representation. It can be seen as an evolution of SensorML and O&M, with it being possible to take data encoded in those formats and mapping them to SSN / SOSA, and to map the other way too. The driving force behind the development of these ontologies is that the existing SWE standards are “not integrated and aligned with W3C Semantic Web Technologies which are key drivers for creating and maintaining a global and densely interconnected graph of data”. The authors make the point that reuse of data requires much more than observation results, and it is the accessibility of the whole metadata provided by using linked-data and SPARQL that makes this implementation much more powerful than the SWE standards it emulates. Through the emulation of both SensorML and O&M, the SSN / SOSA ontologies provide the means of capturing the same relationships and metadata, but due to using the linked-data representation they provide a base that can meet the highest level of FAIR principles that is currently not possible using the other SWE technologies. This is achieved by a number of features, such as: all data objects being assigned a unique ID through their URI, all pieces of data and their relationships are queryable, the use of SPARQL provides an open, free, and universally implementable protocol that allows for querying against multiple data sources, provenance chains are both provided and are queryable, and it is not necessary to design specialist software to interact with the data. The last point in the list above, where it is not necessary to design specialist software to both store, query, and access the data is extremely important when comparing the ontologies to SensorML and O&M. While the SWE standards are open, they require specialist software to store, query, and serve the data. There are at present only two known open-source projects under current visible development, from 52°N and Geomatys, with others having had development appear to slow down, such as istSOS, or stop completely, such as OOSTethys. This lack of varied choice in software, combined with a specialized access standard in SOS

49 https://www.w3.org/TR/vocab-ssn/

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 25 -

demonstrates the increased level of openness provided by the ontologies using linked-data and SPARQL for representation and access. There are many varied suppliers of linked-data stores, there are many programming libraries for working with triples, and through the use of RDF and RDFS it is possible to inference across new and existing datasets created by different groups. One drawback to the SSN / SOSA ontology approach, and this is a drawback to the SensorML and O&M approach as well, is the lack of graphical tools that allow data managers to create metadata and upload observations. Both sets of data models expect the user to know how to use the raw standards, either SPARQL and ontologies, or document formats and SOS requests. This is a level of knowledge we cannot expect or request of users, and so tools must be created to address this. Another drawback to using linked-data is that of performance when storing large volumes of observation data. For each observation, there can be at least nine triples created, and potentially more. With large time series collections this can push the number of triples past volumes that are feasible in a price to performance comparison with other storage systems that are backed to more traditional representations, such as 52°N SOS PostgreSQL The SSN / SOSA ontologies are the current state of the art with regard to observational data and metadata recording for embracing and providing FAIR principles. While there are many and varied providers of the necessary software to support this data model, there is an open question about the performance of these software offerings when tasked with holding large amounts of observational data in comparison to more traditional approaches.

3.2 eLTER Data Reporting Format

The standardised provision of time series data is one of the key requirements for the eLTER information system. Focusing on the implementation of Sensor Web Enablement and the underlying Observations and Monitoring standard is seen as the main solution to provide syntactic harmonised data within the network. This result in the setup of the central data node (see Oggioni et al. 2018) and the integration of partner node (see Chapter 1.2). Nevertheless, the file based provision of data which: a) can currently not be placed within the central data node, or b) have a complex structure and cannot be managed by the central data node (e.g. vegetation data). Therefore an attempt was made to provide an easy to apply format which can be used across different data types including time series information.

3.2.1 Background

The common data reporting format should be used to share and publish data from the eLTER domain. Following the main elements describing an observation the transformation to of the data into a time series managed by Sensor Observation Services should be possible.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 26 -

Fig. 3.1 provides an simplified overview on linking sensor information (om:sensor) and the observation (om:resultTime and om:result).

Fig. 3.1 Overview of linking Sensor description and observation50

For the Observation the following information can be defined:  Observation time [om:phenomenonTime] defined as the time (instant or period) for which the observation contains observation data. This is the time when the sample (virtual or physical) was taken in the field.  Publication time [om:resultTime] is defined as the time when the result became available. This is often identical to the phenomenonTime but in case of sample based data as time lack between sampling and analysis is possible.  Observation method [om:procedure] is defined as the sensor instance (e.g. human or deivce) which has generated the observation. Often the ‘sensor’ is identified by an identifier (e.g. PID).  Parameter [om:observedProperty] is defined as the phenomenon that was observed in the field or in the lab.  Observed feature [om:featureOfInterest] is defined as the reference to the geometric feature (e.g. sensor station) to which the observation is associated. This can e.g. be

50 See https://portal.opengeospatial.org/files/?artifact_id=52803

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 27 -

defined as the sampling point or plot. Also the installation of the sensor can be referenced here.  Resulting value [om:result] is defined as the observed value, which is characterised by the observation method and the unit of measurement. In addition quality information can be linked to the value. This basic data model as well as SERONTO (Schentz et al. 2011) was taken into account in the definition and development of the eLTER Data Reporting format.

3.2.2 Structure

In order to provide a light weight mechanism to provide and publish data resulting from long term observations providing the main elements of an observation the eLTER Data Reporting Format was developed. The specification is based on the experience from the UNECE Integrated Monitoring Programme having long term experience with the standardisation of complex ecosystem data and observations.

Fig. 3.2 Structure of the eLTER Data Reporting

The data specification provides a common vocabulary of 57 terms for the description of the data ranging from the observation location (e.g. SCODE) to the observation itself (e.g. VALUE). Links to the observation facilities (e.g. LTER site or protected area) as well as to the observation stations (e.g. plots or sensors) are core elements of the data specification. The basic observation is encoded using a combination of WHERE (SCODE), WHAT (SUBST), WHEN (TIME) and VALUE. In case of biodiversity observation, e.g. vegetation releveés additional columns, e.g. TAXA are added, but following the basic structure of the reporting format. In order to simplify the data provision a basic and extend data format are

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 28 -

supported. Fig. 3.3 provides an overview on the elements linked in the data model. Details are described in Chapter 3.2.3)

observedAt Time Observation Value Unit

Station Method Parameter

ReferenceList

Fig. 3.3 eLTER Data Reporting Format: basic observation model

When using Microsoft Excel as reporting file format, the different information blocks (e.g. station, method, data, and references) are tables within one spreadsheet.

ZOEBELBODEN_VEG_SPECCOVER_2015_V20170315.xls  STATION  METHOD  DATA  Ref_STYPE etc.

If text formats (e.g. csv or txt) are used, the information blocks (e.g. station, method, data, references) are provided in separate files and zipped providing

ZOEBELBODEN_VEG_SPECCOVER_2015_V20170315.zip  STATION.CSV  METHOD.CSV  REFERENCE.CSV  AT003_ZOEBELBODEN_VEGETATION_2015_V20170315.CSV

The observations can be either organised as data rows (see Tab. 3.1), enabling the provision of extended information on the single observations, or as columns (see Tab. 3.2). Both versions are possible but the first one is recommended.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 29 -

Tab. 3.1 eLTER Data Reporting - Basic format (row)

SCODE SUBST LEVEL TIME VALUE UNIT FLAGQUA FLAGSTA IP1 TEMP 200 2016-03-15 5.5 °C X IP1 PREC 100 2016-03-03 10.2 MM S IP1 TEMP 200 2016-02-15 2.5 °C X IP1 NH4N 100 2016-03 5.5 mg N/l W IP1 SO4S 100 2016-03 10.2 mg S/l W IP1 CA 100 2016-03 2.5 Mg/l L W … … … … … … … …

Tab. 3.2 eLTER Data Reporting - Basic format alternative version (column)

SCODE LEVEL TIME TEMP PREC NH4N SO4S CA TYPE IP1 100 2016-03 5.5 10.2 2.5 5.5 2.5 Forest IP1 100 2016-04 5.2 1.2 2.2 5.8 1.2 Forest … … … … … … … … …

The extended version of the data reporting format includes all columns defining the observation.

3.2.3 Field description

The list of fields is following the extended data reporting file using the Microsoft Excel template. The lists used in the basic data reporting template are marked with bold letters in the column Field name and an asterisk (*). A full documentation of the data reporting format can be found in Annex A - Field Specification for data reporting.

3.2.3.1 Data:Station

Definition: A station is an observation entity within a LTER Site or LTSER Platform. Station in this respect is synonym to plot, observation location, sensor location, etc. and is defined by a location, elevation and installation height (if relevant). In general stations are not described in DEIMS-SDR. In this case additional information need to be provided together with the data. Stations also could be uploaded as additional data table.

Basic information about the stations is provided in the table STATION. This includes:

 Site code  Station code  Station name  Station type  Centre point or bounding box  Altitude  Country  Sampling period since

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 30 -

 Installation height (in case of sensor)  Plot size and shape (in case of plot based observations)

If additional fields are needed, e.g. habitat type for vegetation data, optional additional fields (columns) can be created. In terms of the Observations and Measurement the Data:Station refers to the om:featureOfInterest. If in addition to the Data:Station reference a reference to a sensor is provided this links to the om:procedure element. Stations are often also represented are spatial dataset (e.g. shapefile, geodatabase, WxS service) which follows the basic data structure.

3.2.3.2 Data:Method

Definition: A method describes the procedure to generate and manipulate the data. The section contains information on the methods applied for the generation and manipulation of the data. The method section should give an overview on the sampling, the field method and the method used in the lab to create the data value. In addition the method needs to be provided with the metadata description in DEIMS-SDR. This includes information on:

 Method identification  Sampling method  Field observation method  Laboratory method  Statistical or geo-statistical aggregation method

The method can also be a reference to an online documented structured description of the method applied. In this case a detailed documentation of the methods applied can be included in the discovery metadata.

If the link to a sensor should be provided, the method information needs to be encoded as sml:sensor. Additional information according to the DEIMS-SDR:Sensor model needs to the defined (e.g. sml:SensorMobile) including the documentation of the location (see Data:Station). Information on the sml:MediaMonitored is provided by the single observations encoded in Data:Value

3.2.3.3 Data:Value

Definition: The data are defined as the section where the observation values are provided.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 31 -

This section contains data on any observation or measurement in the different compartments of the ecosystem. It includes bio-geochemical measurements as well as biotic observations This includes the information on:

 Sub programme  Site code  Organisation name  Station code  Medium monitored (including the reference list)  Height of measurement (Min-Max in case of range (e.g. soil) or heigth in case of single measurement point (e.g. sensor))  Date and time of measurement  Spatial pool of single observations (for spatial aggregations)  Temporal pool of single observations (for temporal aggregations)  Temporal level of aggregation or observation  Taxonomic reference (including reference list)  Substance code or parameter name (including reference)  Method reference  Value of measurement  Unit of measurement  Quality flag for data value  Status flag for data value

Referring to the Observations and Measurement model the following elements can be mapped to the terms in the data specification: Station code [om:featureOfInterest] Medium monitored [sml:mediaMonitored] Date and time of measurement [om:phenomenonTime] Substance code or parameter name [om:observedProperty] Method reference [om:procedure] Value of measurement [om:Result] Unit of measurement [sml:Unit]

Following the record oriented structure of the data reporting format, (spatial and temporal) resolution or methods can be described in detail for each single parameter and media. In addition information for each value (e.g. aggregation level or quality) cannot be provided in this format. This enables well documented data values and the integration of metadata on record level.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 32 -

Tab. 3.3 Example biophysical data basic data format

SCODE SUBST LEVEL TIME VALUE UNIT FLAGQUA FLAGSTA IP1 TEMP 200 2016-03-15 5.5 °C X IP1 PREC 100 2016-03-03 10.2 MM S IP1 TEMP 200 2016-02-15 2.5 °C X IP1 NH4N 100 2016-03 5.5 mg N/l W IP1 SO4S 100 2016-03 10.2 mg S/l W IP1 CA 100 2016-03 2.5 Mg/l L W … … … … … … … …

A detailed description can be found in Annex A - Field Specification for data reporting.

3.2.3.4 Data:ReferenceLists

If the data are provided as nominal or categorical values these are often encoded as codes. In this case, if not otherwise online available, the reference lists should be provided together with the data. This includes the following fields:

 Field name  Name of the reference list  Code of the entry  Name or full name of the entry  Definition or description of the entry

If using text files the references are provided as separate file being structured as defined in the following. All definitions are provided in a single file.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 33 -

4 Common Semantics

Common semantics - A common vocabulary is needed in order to describe the structure and content of the data files. In addition it becomes necessary to provide the access to common reference lists as SKOS controlled lists. A repository for different types of semantic resources (thesauri, controlled lists and ontologies) needed to be refered to from LTER data is an important requirement to be taken into consideration to improve data interoperability at the semantic level.

When information sources are growing enormously, there is a need for more effective information retrieval. The latter is defined as the process of searching a collection of documents in order to identify those documents which deal with a specific subject. The effectiveness of search is often hampered by problems such as ambiguity and synonyms. This is particularly true regarding metadata description of data belonging to domains like ecology or biodiversity where scientist often use local and fuzzy names which manifest lack in harmonisation across institutions and countries. A controlled domain vocabulary seems the right answer to this common problem. A thesaurus is defined as a vocabulary of keywords, a standardized set of terms and phrases authorized for use in an indexing system to describe a subject area or information domain. It limits and controls the diversity of natural languages by offering an expression that should be used for each concept. This corpus of clear defined and consistent terms must be achieved through a harmonization process accepted by a broad community of the subject otherwise it won’t be used. In the last decades a series of standards and technologies has been developed to facilitate data discovery, exchange and integration. The use of metadata to provide the context for proper interpretation has helped to make ecological data more discoverable (metadata standards like EML or ISO 19115). But metadata alone don’t provide enough information to reuse data. The ability of two or more systems or components to exchange information and use the information that has been exchanges is called interoperability (IEEE 2001). Distributed data from different research and experimental sites must be interoperable at different levels to allow for common analysis: syntactic, structural and semantic interoperability is required.  Syntactic interoperability allows data exchange from one information technology system to be received by another and does not require the ability for the receiving information technology system to interpret the data. It standardises the data format and refers to the packaging and transmission mechanisms for data.  Structural interoperability is an intermediate level that defines the structure or format of data exchange. Data models like Darwin Core 51 ensure that data exchanges between information technology systems can be interpreted at the data field level, for instance if two datasets use both coordinates to specify locations, but name the

51 http://rs.tdwg.org/dwc/

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 34 -

attributes differently. This is not the case if the same meaning is expressed diversely with e.g. one dataset using named location instead (Veen et al. 2012).  Both syntactic and structural interoperability are pre-requisites for semantic interoperability. The latter provides interoperability at the highest level, which is the ability of two or more systems or elements to exchange information and to use the information that has been exchanged. Semantic interoperability takes advantage of both the structuring of the data exchange and the codification of the data including vocabulary so that the receiving information technology systems can interpret the data. This is achieved by adding data about the data (metadata), linking each data element to a controlled, shared vocabulary.

In order to integrate data from different sources a common semantic framework is needed (Oggioni et al. 2012). The development of a common semantic backbone needs to build upon a common language between the different data generators, providers and users. But a common vocabulary alone does not guarantee that people understand each other. It is well known that the human language is prone to misunderstandings, misinterpretation and information loss. Disclosure and transfer of knowledge can only succeed if the communication between all the involved works. Semantic resources such as controlled vocabularies aim for solving these problems, first of all for achieving disambiguation.

4.1 Standards for controlled vocabularies

Controlled vocabularies mandate the use of predefined, authorised terms or concepts which are related to each other. They refer to taxonomies, thesauri and ontologies, which differ in the degree of semantic expressivity (also known as semantic precision). Taxonomy is the practice and science of classification of things or concepts, including the principles that underlie such classification. A taxonomy is the simplest variant as it contains only terms that are organized into a hierarchical structure. A thesaurus is a special kind of a controlled vocabulary consisting of a collection of structured concepts. The concept is understood as a unit of thought which can have several labels associated with it. In contrast to spoken language a thesaurus must resolve any ambiguities such as homonyms (same spelling with different meanings) by the use of qualifiers, e.g. wood (substance) versus wood (area). The international norm for thesauri (ISO 25964-1) provides recommendations for the development and maintenance of thesauri intended for information retrieval applications. It lists four exchange standard formats for thesauri: MARC (Machine-Readable-Cataloguing for bibliographic information), Zthes (XML based), DD 8723-5 (from the British Standards Institution) and SKOS (Simple Knowledge Organization System).

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 35 -

SKOS was endorsed as a W3C recommendation 52 in 2009 for sharing and linking knowledge organization systems in the semantic web. In contrast to the other three mentioned standards, SKOS was developed for the specific aim to serve as a thesaurus standard. In accordance to ISO 25964-1 it is concept-based, which means that is build upon concepts and their representing terms. It provides a standard way to represent knowledge organization systems using the Resource Description Framework (RDF)53. Encoding this information in RDF allows it to be passed between computer applications in an interoperable way. SKOS concepts are identified with unique URIs and described by natural language labels. A label is a literal with a language tag (like ‘en’ for English). With skos:prefLabel a preferred lexical label to a resource is assigned. Skos:altLabel makes it possible to assign an alternative label. Hidden labels may be used to include misspelled variants for text-based indexing and search operations. For each concept only one prefLabel per language tag is allowed. All labels for one concept represent equivalence relationships. Concepts are linked via semantic relations to other concepts enabling a hierarchical (skos:borader/skos:narrower) and associative organization of them (skos:related) within the system. In addition to associations within a thesaurus it is possible to map concepts also across vocabularies (skos:exactMatch, skos:closeMatch, skos:broadMatch, skos:relatedMatch). This SKOS functionality helps to semantically enrich each concept and thus also the whole vocabulary. SKOS-XL is an optional extension of the SKOS standards wich allow the definition of spezialisations of relations between two labels such as acronyms for a used term. The drawback of this approach is that it can lead to incompatibilities with other vocabularies not using these extensions. Another type of controlled vocabulary are ontologies, defined as formal, explicit specifications of a shared conceptualization within a community. They are an arragement of URI concepts that are related by freely definable kinds of relations. Main components of ontologies are:  Classes that represent (abstract or specific) concepts,  Relations that specify the different types of associations between classes  Functions specifying arguments in triples in specified relations  Axioms expressing constant propositions and  Instances representing concrete elements and individual objects. Ontologies are mostly expressed in OWL 2 (Web Ontology Language)54, a W3C-endorsed specification. OWL is built upon RDF, extending its vocabulary. It is characterized by formal semantics including axioms and inferences. There are three different variants with different levels of expressivenes: OWL Lite, OWL DL and OWL Full.

52 https://www.w3.org/TR/skos-reference/ 53 https://www.w3.org/RDF/ 54 https://www.w3.org/TR/owl2-overview/

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 36 -

RDF is the standard model for the Semantic Web. It provides the foundation for publishing and linking data (Linked Data). It extends the linking structure of the Web forming a directed, labeled graph. The core structure is a set of triples, each consisting of a subject, a predicate and an object. In an RDF graph each triple is represented as a node-arc-node link. SPARQL (SPARQL Protocol and RDF Query Language) 55 is an RDF query language. It is a semantic query language for databases, able to retrieve and manipulate data stored in RDF. A SPARQL endpoint is a conformant SPARQL protocol service enabling users (human or other) to query a knowledge base via the SPARQL language. Results are typically returned in a pre-selected machine-processable format (e.g. XML, HTML, Simple JSON). Therefore, a SPARQL endpoint is mostly conceived as a machine-friendly interface towards a knowledge base and it is recommended that the formulation of the queries should be implemented by the calling software.56

4.2 EnvThes

The Environmental Thesaurus (EnvThes) was developed as a semantic backbone for data resulting from long-term ecosystem research and monitoring (Schentz et al. 2011, 2013) including all related domains such as biodiversity, agriculture, forestry, etc. (see http://vocabs.ceh.ac.uk/evn/tbl/envthes.evn). Built on the US LTER Controlled Vocabulary (Porter 2010) as a primary source it partly incorporates and links to other relevant vocabularies including EUROVOC57, GEMET58, the INSPIRE spatial data themes59, and AGROVOC60. The vocabulary is based on current semantic web standards (SKOS and SPARQL) and supports multilinguality. Initial tests on the use of EnvThes as a multilingual thesaurus for annotation and discovery have been made (Vanderbilt et al. 2010, Vanderbilt et al. 2017).

4.2.1 EnvThes in a nutshell

EnvThes represents the harmonised vocabulary dealing with environmental observations and is adopted by LTER-Europe, ILTER and ECOPOTENTIAL. It provides concepts describing all steps in the data gathering and analysis process starting from the observation sites to the statistical analysis. It is a thesaurus expressed in SKOS and edited and governed in TopBraid. It is hosted by CEH and organized by EAA (facilitator: Barbara Magagna). It is a community effort with a editor team composed by three stable members from the terrestrial domain and supported by three additional scientiss from diverse domains (remote sensing, freshwater domain and

55 http://www.w3.org/TR/rdf-sparql-query/ 56 http://semanticweb.org/wiki/SPARQL_endpoint.html 57 See http://eurovoc.europa.eu/drupal/ 58 See http://www.eionet.europa.eu/gemet/en/themes/ 59 See https://www.eionet.europa.eu/gemet/en/inspire-themes/ 60 See http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 37 -

landscape ecology) contributing when needed. The facilitator supports and supervises the editor team and is responsible for structural and semantic accuracy.

URL: http://vocabs.ceh.ac.uk/evn/tbl/envthes.evn

Scope: EnvThes compiles a set of terms in order to describe in a harmonised way data resulting from observation and measurements of ecosystem processes across different domain specific sciences.

Use: It is used by DEIMS for common keywords for annotation and quering metadata purposes. It is the semantic source for data annotation for later analysis and it serves as harmonized specification of parameters in the observation and measurement of ecosystem processes.

Status: still in progress, although second complete release is planned in the upcoming months

The conceptual models on which EnvThes is based: a mix of O&M, OBOE and SERONTO, with the focus to implement main design principles laid down in the Complex Properties Model (Leadbetter & Vodden 2016).

4.2.2 The conceptual model of EnvThes

4.2.2.1 Overview on conceptual models for observation and measurements

Analysing ecological phenomena across geographic, temporal, or biological scales typically requires access to a variety of existing (already collected) observational data sets. Observational data is typically represented in tabular form (i.e. rows and columns) but often differ in the number of attributes, the names of similar attributes, the relationships implied between attributes, and the coding conventions used for representing information within data sets. These differences not only make discovering relevant data challenging, but also require researchers to spend considerable time interpreting and integrating potential data sets for use within any particular analysis. A model capturing all necessary details to describe observations of the real objects but at the same time general enough to be used in different domains can help overcoming these difficulties. A number of different approaches used in the ecology community are presented and compared in the following. A special focus is laid on how parameters are conceptualized.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 38 -

As examples to show the different model patterns we use ‘tree diameter at breast height’ and ‘content of nitrogen in soil water’.

OBOE The Extensible Observation Ontology (OBOE) is a formal ontology for capturing the semantics of scientific observation and measurement. It had been developed within the SEEK Science Environment for Ecological Knowledge) Project and first published in 2007 (Madin 2007). It is used within the biodiversity community for semantic representation of observation data in the DataONE61 initiative and in many other communities (e.g. AquaDiva project) and research infrastructures like LifeWatch and AnaEE. The model identifies entities, or objects, being observed, the observation of entities and their corresponding measurements, for each measurement, the value of a characteristic of the entity according to a measurement standard or protocol (or procedure) and the context assumed by each measurement and observation (Bowers 2010).

Fig. 4.1 OBOE core model (Madin et al. 2007)

The entity class represents all concrete and conceptual objects that are observable. An observation is composed of exactly one entity and can provide context for the observation of another entity. An observation consists of zero or more measurements of the entity. Measurements assign values to characteristics of entities, where values can be another entity or primitive values like integers or strings. Measurements also include standards and can specify protocols (Madin et al. 2007). Standards stand for the unit or reference list used for the resulting value and protocol for the method of measurement. If we apply OBOE for representing the ‘diameter at breast height of a tree measurement’ it could be modelled as in Fig. 4.2.

61 https://www.dataone.org/

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 39 -

Fig. 4.2: Tree diameter at breast height modelled in OBOE

Another possibility would be to use the protocol class to include the description that the diameter was measured at breast height, while diameter would be the characteristic measured.

Fig. 4.3: Tree diameter at breast height modelled using protocol in OBOE

When trying to describe the more complex parameter ‘nitrate concentration in soil water’ the model seems not to be flexible enough to accommodate each description element as shown in Fig. 4.4. Soil water is interpreted as the entity, while concentration of nitrate is the characteristic which is measured.

Fig. 4.4: Concentration of nitrate in soil water modelled in OBOE

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 40 -

O&M The conceptual model for observation and measurements (O&M) is described in ISO 19156. According to S. Cox (2017) observations and measurements are used to determine values of properties, through application of some procedure (method) at a particular time and place. The result can only be an estimate of the true value, conditioned by procedure and circumstances. The observation is interpreted as an event, modelled with the observation class with several attributes and classes for the feature of interest, the procedure, the observed property and the result (compare with Fig. 4.5). The term feature is used in the sense defined in the ‘Reference Model’ used by OGC and by ISO Technical Committee 211 – Geographic Information referring to a conceptualization of an entity in the real world with spatial coordinates indicating its location. An observation serves as a property-value – provider for the feature of interest. In many cases, observations are not performed on the feature of ultimate interest of an observation, either because the feature is inaccessible (then the focus lies in a subset of the complete feature of interest) or because the property is not directly observable. To meet this challenge proximate sampling features are introduced, which are accessible and have properties that are sensible (for further details see ISO 19156).

Fig. 4.5: O&M (Cox 2017)

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 41 -

The two mentioned models are often compared (see also Fig. 4.6). It must be argued if entity really corresponds to feature, as entity is not only conceived as the representation of an object in the real world but also of the concept behind it. This would mean if we use the tree diameter example for comparison that the feature would refer to the concrete tree at a certain location, while the entity concept in OBOE could also address the general concept class tree as an organism type. A thesaurus like EnvThes could provide the specific vocabulary for these conceptual objects, but not for the feature of interest in O&M. O&M describes a generic model for metadata associated with property-value estimation. However, much of the detail in specific observations is associated with classes in the ‘second layer’ i.e. phenomena, and procedures (see Annex C of OGC 07-022rl), which are often indicated by reference (to an identifiable concept of a vocabulary). Phenomenon is defined as a property type, a characteristic of one or more feature types. It is recommended to use an ontology of observable property-types which allow to describe base property types and complex ones. In the annex the example ‘water temperature’ is defined as a constrained property type with temperature as the base property and water as the constraint. Water could also be interpreted as entity in OBOE.

Fig. 4.6: Comparison O&M and OBOE (in red)62

SSNO (Semantic Sensor Network ontology) The incubator group developed the Semantic Sensor Network ontology (SSNO) (Compton et al. 2012). It leverages the Stimulus-Sensor-Observation pattern (SSO), which adds the concept of ‘stimulus’ to the core model based on O&M (see Fig. 4.7). The pattern links sensors, the observed property and the resulting observations. Stimuli are defined as changes or states in an environment that a sensor is able to detect and used to measure a property.

62 https://www.slideserve.com/fahim/oboe-model-changes

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 42 -

Fig. 4.7: SSO pattern

The ontology is aligned to DOLCE-UltraLite63 (DUL) with SSNO concepts directly inheriting from a number of DUL classes and properties, which introduces ontological commitments. The model has four perspectives:  A sensor perspective, focussing on senses, how and what they sensed  An observation perspective focussing on observation data and related metadata  A system perspective with a focus on systems of sensors and deployments and  A feature and property perspectives where both concepts are left as place holder concepts to be enriched by linking to appropriate vocabularies. The sensor perspective enriches the model with the capabilities of sensors which defines any property observed by a sensor and the performance of the sensor affected by prevailing environmental conditions (see Fig. 4.8).

Fig. 4.8: Sensor perspective of SSNO

63 http://ontologydesignpatterns.org/wiki/Ontology:DOLCE+DnS_Ultralite

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 43 -

The ontology is used in the implementation of a transparent and RESTful proxy for OGC’s Sensor Observation Service (SOS). The proxy can be installed in front of a SOS to serve Linked Sensor Data on-the-fly (Compton et al. 2012). Compared to O&M it omits sampling features and it introduces a large amount of dependencies derived from other ISO 19100- series UML models (Cox 2017).

SERONTO (Socio-ecological research and observation ontology) The ontology SERONTO (Socio-ecological research and observation ontology) developed by EAA (e.g. Schentz et al. 2011) aims to allow for a linked data representation of observation data.

Fig. 4.9 SERONTO Core Model

Fig. 4.9 shows an overview on the basic concepts defined in SERONTO. This encompasses:  Physical thing (i.e. investigation object, which can also be the experimental unit),  Parameters (the measurement, classification and treatment of the investigation object),  Value set (joined concepts holding the information for the investigation object, the combined parameter/method used and the time series of values),  Reference elements (pointing to reference and reference lists such as species lists (necessary intermediate concept since the same references could be part of different reference lists) available for any used concept and specially for nominal values),

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 44 -

 Methods (used for each parameter, including units, scale and dimensions – these could be also represented as method chains),  Selection descriptions (the origin of the research object or population including the sampling method),  Groupings of objects (such as experimental blocks, to which the observer, time or other aspects are assigned or related to), and  Additional information (such as actors (observer, observer groups and institutions), project information etc., which can be attached to several different concepts).

The concepts of the SERONTO core are derived from scientific principles and lean heavily on statistical methodology while adhering to W3C standards and INSPIRE principles aiming to ensure repeatability of the observations and transparency. The Parameter_Method class combines the two base classes because it is assumed that for each parameter there is an appropriate method to be applied. For complex parameters it is possible to use via the association hasHelpObject any additional element pertaining to reference lists (such as nitrate as reference element of the reference list chemical substances). In comparison to OBOE the additional value of SERONTO seems to lie in the use of reference lists, exact descriptions of applied methods (allowing also chains of methods and methods encompassing other methods), the introduction of selection descriptions explaining the origin of the research object, the time stamp bound to every value and the provision of templates for specific domain use.

Complex Model Whereas there exists a variety of basic observation models and also of domain specific models Leadbetter and Vodden (2016) assess the necessity for semantic mediation between the abstract and the specific to allow mapping from one domain to the other. They propose the Complex Property Model (CPM), which is based on O&M but additionally breaks down the complex concepts into atomic concepts. Starting from the INSPIRE extension to O&M (Fig. 4.10) they analysed the attribute Base Phenomenon of the Observable Property, which is defined as a code list (Phenomenon Type list).

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 45 -

class Observable Properties

«type» «metaclass» AbstractObservableProperty General Feature Model:: realises +component GF_PropertyType + label: CharacterString [0..*] {root} 2..* + memberName: LocalName + definition: CharacterString

«type» «type» CompositeObservableProperty ObservableProperty

+ count: Integer + basePhenomenon: PhenomenonTypeValue + uom: UnitOfMeasure [0..1]

+statisticalMeasure 0..*

«type» StatisticalMeasure +restriction + label: CharacterString [0..1] 0..* + statisticalFunction: StatisticalFunctionTypeValue [0..1] «dataType» + aggregationTimePeriod: TM_Duration [0..1] +derivedFrom Constraint + aggregationLength: Length [0..1] 0..1 + aggregationArea: Area [0..1] + constrainedProperty: PhenomenonTypeValue [0..1] + aggregationVolume: Volume [0..1] + label: CharacterString [0..1] + otherAggregation: Any [0..1]

«dataType» «dataType» ScalarConstraint «dataType» RangeConstraint CategoryConstraint «dataType» OtherConstraint + value: Real [1..*] + value: RangeBounds [1..*] + comparison: ComparisonOperatorValue + comparison: ComparisonOperatorValue + uom: UnitOfMeasure [0..1] + description: CharacterString + uom: UnitOfMeasure [0..1] + value: CharacterString [1..*]

«enumeration» «dataType» «codeList» ComparisonOperatorValue RangeBounds StatisticalFunctionTypeValue

equalTo + startComparison: ComparisonOperatorValue tags notEqualTo + rangeStart: Real asDictionary = true lessThan + endComparison: ComparisonOperatorValue extensibility = any greaterThan + rangeEnd: Real vocabulary = lessThanOrEqualTo xsdEncodingRule = iso19136_2007_INSPIRE_Extensions greaterThanOrEqualTo

«codeList» PhenomenonTypeValue

tags asDictionary = true extensibility = any vocabulary = xsdEncodingRule = iso19136_2007_INSPIRE_Extensions

Fig. 4.10: Observable Properties in O&M extension

The phenomenon types can themselves be complex concepts which are hard to be mapped across domains. The authors propose to map Base Phenomenon to two OWL classes, namely Object of Interest and Property. This approach allows to identify the constituent concepts. Concentration of nitrate would then be broken into the Property concentration and the Object of Interest nitrate. This enables mapping of a property such as concentration regardless of whether it is measured of carbon or nitrogen. The concept Matrix helps to specify complex properties because in many situations the Object of Interest is embedded, dissolved or otherwise entailed within a medium or layer. Coming back to the original parameter name concentration of nitrate in soil water, soil water would be the matrix. The additional classes Constraint and Statistical Measure help address mathematical qualifying issues in complex parameter descriptions.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 46 -

Fig. 4.11 Complex Properties Model as extension of O&M (Leadbetter & Vodden 2016)

A similar approach is used in The Observable Property64 by S. Cox.

4.2.2.2 Conceptual Model used in EnvThes

In EnvThes we followed the approach presented in the Complex Properties Model. Using the same terminology EnvThes consists of the top concepts Object of Interest, Property, Constraint, Matrix (as sub concept of Objet of Interest) and Statistical Function (as sub concept of Method), plus additional concepts considered relevant for the LTER community such as Research Focus, Infrastructure (including Device), Method and most importantly Parameter. Parameter is a compound concept formed by determining an observed property for an object of interest as used by the LTER scientist. The corresponding atomic concepts Object of Interest, Property, Matrix and Device can be used to break down the compound concept as shown in Fig. 4.12. Unit is not part of EnvThes as there are more appropriate vocabularies available (e.g. QDTY ontology).

Fig. 4.12 Compound versus atomic concepts in EnvThes

64 See http://environment.data.gov.au/def/op

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 47 -

EnvThes is encompassing a number of top concepts (e.g. research topics, object of interest). The basic guideline in establishing the structure were  parsimony by reducing the levels of the hierarchy under the top concept to the minimum needed and  simplicity by reducing the use of polyhierarchies to a minimum (allowing the same concept to be assigned to different parent concepts). Note that all concepts terms are indicated by italics. NT is the tag used in ISO 25964-1 for narrower term, BT for broader term.

Research focus defines the scientific scope of the observations or experiments and encompasses the spatial or temporal scale addressed in the work as well as the thematic focus. The concept research focus is subdivided in scale and research topic. Scale: This concept addresses spatial (e.g. plot, landscape scale) and temporal (e.g. day, year) scales of research projects. Research topic: This concept covers disciplines and sub-disciplines and their specific issues and challenges. A scientific discipline is knowledge or wisdom associated with one academic field of study or profession. We ended up with three levels under research topic: (1) The most self-evident approach for assigning terms to the top level was using distinct disciplines like biology, chemistry, hydrology and so on. (2) The next level is built of well-established sub-disciplines like ecology as a child of biology or atmospheric chemistry as child of chemistry. (3) The third level has only few concepts, as this seemed to be only needed in cases when the parent is an umbrella for lots of popular sub-concepts. For example, this is illustrated with the hierarchical sequence of e.g. biology => ecology => forest ecology.

Infrastructure deals with physical items, software and organisation design as well. We understand infrastructure as (1) observational and experimental facilities and their related characteristics and classifications (example: maintenance infrastructure NT snow clearing equipment), (2) any devices related to research, (3) provided services and (4) organisation design of the research. Examples: maintenance infrastructure NT snow clearing equipment, data management NT metadata standard

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 48 -

Method is defined as a way of proceeding or doing something, especially a systematic or regular one (GEMET) 65 . In the current process the restructuring of method was not addressed. Examples are vegetation survey, pitfall trap or eddy covariance. However, there is already a collection of sub-concepts in EnvThes.

Parameter – this was formerly named as measure, this concept has been renamed to parameter (synonymous to variable), as the parameter is always the target of a measure and therefore the more suitable term. Note that parameter in EnvThes is a compound concept built by an observed property for an object of interest. For example, temperature (observed property) of air (the object of interest) results in the compound concept air temperature. As temperature can be measured in different media (e.g. water, soil, air), it is necessary to use compound concepts. In order to group parameters, we created several containers related to disciplines (e.g. biological parameter) and media (e.g. water parameter). Using this kind of structuring we allowed for this top concept for polyhierarchies, as it is quite likely that certain parameters which are e.g. in the biological parameter container will be part of the agricultural parameter container too.

Property - The concept of property is an element of the “measurements on properties of objects of interest” expression. Thus, one property can be associated with several measures on several objects of interest. The property of any object of interest is observed during the act of observation. The difference to parameter is that property is a specific quality, but not a compound term. This means property is not described related to a specific object of interest. In the process of classifying properties we created three sub-concepts as baskets for different qualities: (1) an observable property can be observed or measured by a single action of observation, field or laboratory measurement, scaling, equipment reading, visual estimation, etc., resulting in a single data. (e.g. NT size, mass, length). (2) A derived property cannot be directly observed, but are derived by sets of observed properties., e.g. temperature and oxygen content to determine oxygen saturation. This is appropriate for any calculated property based on different kinds of measurements (e.g. NT ratio, productivity, frequency). (3) The characteristic is a basket for more general properties of the objects of interests like pattern, composition, configuration, which can be approximated by several derived properties. They can be also qualitative properties, e.g. naturalness, which are assessed by expert decision.

Constraint - This concept is a flat list without any parents. The character of constraint is indicated by being mainly adjectives, e.g. NT altitudinal, apical, seasonal, biochemical. This term list can be combined with properties or object of interests.

65 http://www.eionet.europa.eu/gemet/en/concept/13088

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 49 -

Object of interest - The object of interest concept deals with several objects where research is dealing with. These objects could be simple physical objects or items of ecological importance based on complex interactions of abiotic and biotic ecosystem elements. Therefore we set up three baskets for the object of interest: (1) event as basket for things happening in an uncontrolled way to nature (e.g. draught, flood, fire, storm); (2) process contains dynamic development and cycling of elements in a controlled way driven by biology or environmental conditions (e.g. NT growth, decomposition, nitrogen cycling); (3) entity as basket for all physical objects of interest. The entity concept of physical things is the largest basket dealing with a range of different items. Therefore we structured the entity concept by the sub-concepts of (a) organism which is any living thing made out of substance (e.g. NT bacteria, plants, fungi), (b) substance which is any matter having mass and/or energy (important sub-concepts: NT agricultural substance, biological substance, chemical substance, elementary particle, radiation, pollutant), (c) matrix is the medium in which the actual object of interest is mixed, embedded, suspended, or entailed in some way - such that a process is required to extract or separate it before an observation can be made, (d) organisational unit which addresses biological systems (e.g. NT population, community) and spatial terms like region and landscape or environment as a whole.

Deprecated concept – is containing concepts which are no longer in use. Concepts which are not used anymore or are replaced by another updated terms are moved to deprecated and maintain their link to the following term. By this no broken links are guaranteed throughout the lifetime of the controlled vocabulary.

4.2.2.3 Parameter harmonisation ontology

Although EnvThes provides atomic concepts for each of the compound concepts under the top concept Parameter, the associations between the involved elements cannot easily be specified with a thesaurus in a consistent way. These specific thematic relationships (e.g. observedProperty or adressedFeatureOfInterest) can only be defined in an ontology and the SKOS relation relatedTerm does not allow this extensions. EnvThes is providing the source of the terms and concepts needed which feed into an ontology depicting the relationships so that each concept in Parameter can be described by its atomic terms and linked together. This concept correlates to the concept Parameter_Method in SERONTO which was designed as a similar container to link to

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 50 -

parameter and method. This mapping will allow to integrate parameter names from different sources and to build the basis for the harmonisation of naming for the discovery and reuse. With the restructuring of EnvThes the first steps towards the parameter ontology were done. Starting the work on the ontology is planned in the upcoming months. This ontology will considerable help in the discovery of parameters used, as it details the search. E.g. all observations dealing with ‘soil’ could be easily discovered.

4.2.3 Semantic Repositories

In recent years, semantic resources such as thesauri and ontologies have increasingly been used for a diverse set of applications. These range from the provision of consistent terminology for data entry and management to the application of logic inference as means of reasoning over large-scale datasets, serving purposes of discovery and the derivation of new facts. This increasing use has led to an explosion of available resources in a variety of fields, ranging from early pioneers such as the bio-medical domain to the humanities which only recently have started to increasingly embrace large scale digital means for research. Besides the continuous emergence of new resources, existing ones are also often under constant development, become updated, merged with others, parts of them deprecated, etc., resulting in an increasingly complex landscape which sometimes makes it hard to find the most appropriate resource and its most recent version. A variety of online and offline tools exist to create, edit, manage and publish such resources, examples for editors are the Open Source platforms (Web-)Protégé (available off-66 and online67) and Vocbench68 or the commercial TopBraid Composer69 software. Especially the online versions of these tools enable geographically separated persons to collaboratively manage resources and often also provide outlets for publishing them on the Web. Making available resources this way enables interested parties to search them via engines such as Google, although usually only in limited form, since in most cases only information about the resource itself, such as Title, description etc., are indexed by the search engines. Individual concepts defined by these resources thus often remain hidden from Web searches and if not, they often get lost amongst a multitude of related search results. It is moreover very difficult, if not impossible, to cross-check different resources for overlaps or disagreements in reasonable manner using Web search only. Another important aspect is that many semantic resources are related with each other, for example via the scientific domain they cover, and hosting them only via separate Web outlets would potentially obfuscate their mutual relationships. Another set of tools has thus been conceived to enable access to semantic resources for search, re-use and analysis: Dedicated ontology libraries – called semantic repositories in this deliverable – have been developed in this regard. As outlined by d’Aquin and Noy (2012)

66 https://protege.stanford.edu/, retr. May 3rd, 2018 67 https://webprotege.stanford.edu, retr. May 3rd, 2018 68 http://vocbench.uniroma2.it/, retr. May 3rd, 2018 69 https://www.topquadrant.com/tools/modeling-topbraid-composer-standard-edition/, retr. May 3rd, 2018

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 51 -

the purpose of such repositories is not only to enable to search for concepts – this has for example already been covered by dedicated semantic search engines such as Swoogle70 – but also to offer explicit collections of semantic resources, addressing the fact that many resources offer complementary content together covering different aspects of a specific domain. Especially this curation aspect represents one fundamental difference between existing semantic repositories: While some are strictly curated by the maintainers who solely decide which resources should be included, others allow registered users to upload resources by themselves. According to d’Aquin and Noy (2012), the former thus focus on providing a reference set of resources while the latter in turn seek to foster the publication and dissemination of existing resources. Another distinctive feature is the extent of services offered on top of the hosted collection of semantic resources, such as different means to search and/or browse the collection and the availability of APIs to access the underlying content.

Structure of contemporary semantic repositories Fig. 4.13 provides an overview on the architecture of a contemporary semantic repository. The top left section represents the catalogue component, consisting of a database holding information about the individual resources to be hosted by the repository. This metadata includes attributes such as title, description, version, etc. and – for advanced repositories – information about the location where the resource can be ingested from and in which format (OWL, SKOS, etc.) it’s available there. The reason for providing download locations instead of directly uploading files is that this way, the remote location can be frequently accessed and checked for new updated versions of the resource. A dedicated service performs these lookups and downloads the resource from the stated location in order to store its content in a dedicated database. This step potentially includes data transformations, since resources can in principle be made available in a variety of formats. Once transformed into the internal representation, the hosted resources can be searched together, which is usually accomplished by creating a central search index. Besides offering basic search facilities, contemporary semantic repositories usually provide visualizations of the hierarchies of the hosted resources to support browsing, as well as dedicated APIs to harvest them. Most recent additions to these services include automated mappings between the concepts of the hosted resources based on common ID or label, as well as semantic annotation services for tagging full-text with matching concepts or recommending the best suited resource based on common terminology.

70 http://swoogle.umbc.edu/2006/, retr. May 3rd, 2018

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 52 -

Fig. 4.13: Structure of Semantic Repositories

One example for a contemporary approach to semantic repositories is NCBO BioPortal71, described by Whetzel et al. (2011). This repository framework provides all the features listed in Fig. 4.13 and its components are made available in pre-configured form as virtual appliance72 which can be deployed within various virtual machine platforms such as VMware or Amazon AWS. BioPortal is currently the largest semantic repository, hosting close to 700 different resources. Its technology has been re-used in a number of different contexts, one example being AgroPortal, described by Jonquet et al. (2016), which provides a selection of semantic resources for the Agronomy domain. Besides BioPortal technology, another important project is the EMBL-EBI Ontology Lookup Service73 which in contrast to BioPortal maintains a strictly curated set of resources but otherwise offers a mostly similar set of functionalities. EBI-OLS is described by Jupp et al. (2015) and made available via a dedicated Github repository74.

Different aspects of storing and publishing resources via semantic repositories As outlined above, main advantages of publishing semantic resources via dedicated repositories are centralized search across their content, sophisticated means to browse their

71 https://bioportal.bioontology.org/, retr. May 3rd, 2018 72 https://www.bioontology.org/wiki/index.php/Category:NCBO_Virtual_Appliance, retr. May 3rd, 2018 73 https://www.ebi.ac.uk/ols/index, retr. May 3rd, 2018 74 https://github.com/EBISPOT/OLS, retr. May 3rd, 2018

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 53 -

structure and the potential for automated alignments between them and other resources hosted in the same repository. This is complemented by emerging services such as the automatic annotation of full-text with matching concepts, potentially supporting the re-use of the provided concepts as semantic tags. Using the APIs provided for machine access, recent developments such as the EUDAT Semantic Lookup Service Aggregator, described by Goldfarb and Le Franc (2017) seek to aggregate the content of different repositories. This enables centralized search even across different repositories and moreover provides an additional layer for mutual alignment between their resources. The latter can be used for contextualisation and enrichment during the development of new semantic resources, such as described in ECOPOTENTIAL Deliverable 5.6 (Magagna et al. 2018). Seen from the context of individual organizations, existing repository frameworks could thus be used to provide a central repository for the representations of the different knowledge organization systems (KOS) used there. Alignment services such as outlined above could subsequently be used to find overlaps between the different KOS, potentially resolving redundancies and ambiguities. Potential pitfalls, however, exist as well, especially with respect to publishing resources via repositories. One main problem remains to be how to keep them up-to-date and to prevent concurrent versions of one resource to be hosted in different repositories. Another issue arises from the common approach to use URIs as identifiers for individual concepts. The main question in this regard is where such URIs should point/resolve to if a resource is hosted in its own outlet and via one or more repositories. Additional questions arise when considering the establishment of dedicated semantic repositories for new domains. Should there at all be separate repositories for the different scientific fields or would it be better to on the long run maintain one single global repository for all of them? Reasons for the former would be that different fields have different requirements which would be neglected by a single solution, as well as issues of control over the infrastructure and questions regarding responsibility with respect to maintenance. Reasons for the latter would in turn be the avoidance of redundancies and concurrent versions, central access and better interoperability as well as potential cross-domain re-use and other synergies.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 54 -

5 Conclusions

The eLTER Information System aims to provide a framework for the integration and provision of observation data from the different components of the LTER-Europe network. Building on standards and service interfaces it aims to implement the FAIR principles for open data sharing. The framework is an endeavour of LTER-Europe which will be undertaken with upcoming eLTER Research Infrastructure. Nevertheless, the different components of the eLTER Information System are still in development. In addition to the development, linking to global data infrastructures as GEOSS and DataOne is tested and prepared. The development of the eLTER Information System poses important steps toward the implementation of tools and services enabling researchers and users to easily document and share the data within and beyond the network. Nevertheless, beside technical challenges also the cultural and social aspects of data sharing need to be taken into account (see Vanderbilt et al. 2015, Vanderbilt & Gaiser 2017). While agreeing on open data in principle on the global scale the implementation of common data sharing on the local level is still an issue in many of the member networks or single sites. This is also the result of different general data policies and funding regulations applied in the different countries. To enable this vision eLTER is not only addressing the technological aspects of data publishing and sharing but also the social aspects. Common guidelines and governance will be developed in order to ensure the sustainability of data provision and updating.

5.1 Metadata

A important aspect for data reuseability is the discoverability and interoperability. Whereas the first is easier to address the latter provides a lot of challenges which need to be addressed. Even with metadata, the harmonisation efforts are not easy between different communites. Different requirements, e.g. on the thematic scope of the metadata, limits the integration within a single system. In addition the application of tools also reduces the reusability of metadata. The aim of the project was to provide a common set of information elements which are needed for the discovery of the datasets. Within EML different levels of descriptions and usages of the metadata elements are described which encompass information, download and integration (see Tab. 5.175).

75 See http://im.lternet.edu/node/1019

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 55 -

Tab. 5.1 EML Data Package Completeness levels

Level EML Metadata content Uses

title abstract personnel*, contacts 1. Information about data only, e.g., LTER Type II data publication date 2. Searches by time, location, taxonomy Information coverage 3. Data citation keywords* Related LTER-2004 levels: “Identification”, “Discovery” project description* publisher access and use statements* Information + methods or protocol-link, 1. Data are available, but the user may need help with Data description: includes column interpretation. Download names Related LTER-2004 levels: “Evaluation”, “Access”, definitions & units* “Integration” (with caveats) physical description download URL at entity-level*

1. Integration, workflows, further automated processing Download + 2. Query applications Integration Metadata congruent with data* 3. Contribute to Network databases Related LTER-2004 level: “Integration” (intended)

Elements describing the ‘information’ level can be defined as metadata for the discoverability or findability according to the FAIR principles. This is linked to the main metadata elements in the dataset and data product metadata documented in DEIMS-SDR. These models also include important elements for the accessibility or download-ability like the online distribution link and service type. In order to ensure the compatibility across the different metadata standards and the integration of metadata in geonetwork the ‘integration level’ metadata are not included in the core set of the dataset metadata elements. Nevertheless, the metadata elements are included in DEIMS-SDR:DataSource implementing the main EML elements describing the structure of the metadata. One downside to using SensorML and O&M, in conjunction with SOS, is that it is only possible to reach level 2 of the FAIR principle implementation, which can be seen as each data object having a unique identifier, where a data object can be a sensing object, a feature, an observed property, a sample, or an observation. Higher levels are not achievable, as the individual metadata elements are not recursively able to be data objects in their own regard. SensorML also has the related problem of being “provider centric” rather than “user centric”76, in that it allows providers to easily describe their sensors, algorithms, and procedures, but it does not allow users an easy way of discovering information, or accessing the information. These downsides however are balanced by the performance of the software it runs on for large time series data sets, when compared to data models that are better in design, but don’t have software that can match the performance to cost ratio.

76 https://www.w3.org/TR/vocab-ssn/#TheSSNontology

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 56 -

To allow a real implementation of what previously indicated, only few software solutions are currently available 77 . The 52°North SOS has been chosen as the software base as it appears to be the most actively developed and documented of the open source options, with a strong and wide range of SOS, SensorML, and O&M standard implementations. In addition the site documentation in DEIMS-SDR provides a global service which can also be used by other observation networks beyond LTER-Europe and ILTER. Standard information exchange based on INSPIRE EF data specification and OGC services provides a valuable tool for further increasing interoperability and reusability.

5.2 Data format

In order to support automated data flows and easy integration of data in workflows not only the discoverability but also the semantic and syntactic interoperability of data needs to be guaranteed. This can be ensured by the application of strict and common protocols for data collection and reporting. One of the challenges addressed in LTER Europe is the broad scientific scope of the observations. Whereas part of the sites follow strict protocols (e.g. UNECE ICP Integrated Monitoring Programme), others focus more on the implementation of scientific research questions. Also the different of usages of the data require a broad range of spatial, temporal or thematic aggregations. Therefore it is difficult to provide one comprehensive data model for environmental observations. With the Observations and Measurement (OGC 2013) and SERONTO (Schentz et al. 2011, 2013) common semantic models for the depiction of environmental data exist. With 52°North SOS Software suite a reference implementation for time series data provision exist. Within the eLTER project SOS server provide a core element in the central data node (see chapter 1.2) enabling a syntactic harmonised access to the data. With the definition of the eLTER data reporting format following the same basic concepts in the identification of the single observation value the integration of file based information into automated workflows should be possible. The eLTER data reporting format allows the integration of record level metadata (see Fig. 5.1) which completes the information on the data sources and is thus enhancing the interoperability and reusability of the data.

77 http://www.opengeospatial.org/resource/products/byspec - searched using “Sensor Observation Service Interface Standard v.2.0”

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 57 -

Fig. 5.1 Metadata levels for LTER Data Reporting

Nevertheless, the formulation of a common semantic model for the observation is difficult if the scope and use of the data is not clearly specified. This aspect is addressed by the different research infrastructures (RI) aiming to provide scientific sound and quality controlled data for specific purposes. With the formulation of clear science cases the definition of required data and the resulting data products is clearer and automated workflows to quality control and integration of data to data products can be implemented. ICOS and the implemented workflows to the CarbonPortal is one of the recent best practise example. With the planned establishment of the eLTER RI for the long term ecosystem domain a similar approach is planned. Nevertheless this results in the following requirements for data standardisation:  Clear formulation of the data requirements for a scientific use case  Clear formulation of the potential data use (also addressing possible temporal and spatial resolutions)  Clear formulation of the data product specification This results in the definition of the underlying workflows for data quality control and data integration in order to generate the data products. This can be seen as the next step in the implementation of the eLTER Information System. The current work focuses on the mobilisation of legacy data from the LTER network. With the Central Data node based on the SOS data model and the eLTER Data Reporting Format for any in-situ based observations an important step for its implementation is done.

5.3 Common semantics

The efforts in the vocabulary work for EnvThes of the last year concentrated in quality enhancement of the corpus. This included following actions:

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 58 -

 The overall semantics was aligned to O&M model and to the need for semantic harmonization with other parameter vocabularies which lead to a redefinition of the top concepts accordingly  The structure was refined with a simpler hierarchy reducing the amount of levels  All concepts were reviewed and repositioned according to the adopted design model  All concepts were quality checked according to EnvThes design principles  The vocabulary was extended with more than 500 concepts extracted from MS academic keywords  Concepts were enriched with definitions and mapping relations using alignment techniques With end of May 2018 EnvThes version 2 is available, which should be free from structural inconsistencies and syntactic errors. All URIs of concepts with a foreign namespace (such as USLterCV_) were replaced by an incremental number while the link to the source is now indicated in the exact match relation. This also means that all old concepts with the former URI are placed in the deprecated concept container with the exact match links to the actual corresponding concepts. A clean and coherent structure and semantic is a prerequisite for the next series of improvements, which will focus on semantic mappings to other vocabularies and on translations of the terms. In addition a separate scheme will be prepared for Reference Lists, which are pre-existing controlled lists often used by the LTER community not elsewhere available on the web as referable objects. They could be excel lists (like the EUNIS habitat classification78) or map legend entries or lists used in documents (like CORINE land cover79). By converting them into SKOS concepts (with stable URIs) they can be reused also from external users. The next effort will be invested in the creation of the parameter harmonisation ontology as described in Chap. 4.2.2.3, which will considerable enhance the discoverability of parameters.

78 https://www.eea.europa.eu/data-and-maps/data/eunis-habitat-classification 79 http://www.umweltbundesamt.at/fileadmin/site/umweltthemen/raumplanung/1_flaechennutzung/corine /CORINE_Nomenklatur.pdf

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 59 -

Fig. 5.2 Conceptualisation of observation types

Another important activity was the establishment of cooperations with other communities curating semantic resources of relevance for the LTER community. The facilitator of EnvThes participated at several semantic workshops and events organized by LifeWatch, EUDAT and RDA VSSIG (Vocabulary and Semantic Services Interest Group) to foster collaboration activities. This lead to the task group formation “Harmonize the conceptualization of observation types” within VSSIG. Representatives of ILTER, PANGAEA, GFBio, LifeWatch, ICOS, AnaEE, AquaDiva, TIB, BODC, ENVO, BioPortal, EPIC are involved. Goal is the creation of a RDA endorsed Working Group with the purpose to develop best practices and a generally accepted model for conceptualization of scientific observation and measurement types including possibly also methods and devices by using agreed terminologies (compare with Fig. 5.2). This will allow better mapping means between the vocabularies of the involved communities and thus should result in improved interoperability for data discovery and data integration across the diverse sources. Another focus will be the collaboration with LifeWatch in developing a semantic repository for ontologies, thesauri (including EnvThes) and reference lists relevant for biodiversity and ecosystem research. A common domain specific portal of semantic resources allows their better integration into the work-flows of metadata annotation (e.g. DEIMS-SDR17) and discovery. This fosters the semantic interoperability not only on the metadata but also on the data level (Fiore et al. 2017).

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 60 -

References

Bowers, S., Cao, H., Schildhauer, M., Jones, M., Leinfelder, M. and O’Brien, M. (2010) A semantic annotation framework for retrieving and analyzing observational datasets, in Proceedings of the third workshop on Exploiting semantic annotations in information retrieval, 2010, pp. 31–32. Compton, M. et al. (2012), The SSN ontology of the W3C semantic sensor network incubator group, Web Semantics: Science, Services and Agents on the World Wide Web, Bd. 17, S. 25–32, Dez. 2012. Cox, S. J. (2017). Ontology for observations and sampling features, with alignments to existing models, Semantic Web, Bd. 8, Nr. 3, S. 453–470, 2017. Fiore, N., Magagna, B., Goldfarb, D. (2017). EcoPortal: a proposition for a semantic repository dedicated to ecology and biodiversity. In: A. Algergawy, N. Karam, F. Klan, C. Jonquet: Proc. of the 2nd International Workshop on Semantics for Biodiversity co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 22nd, 2017. CEUR Workshop Proceedings 1933. D’Aquin, M., Noy, N. F. (2012). Where to publish and find ontologies? A survey of ontology libraries. Web Semantics: Science, Services and Agents on the World Wide Web, vol. 11, pp. 96–111, Mar. 2012. Goldfarb, D., Le Franc, Y. (2017). Enhancing the Discoverability and Interoperability of Multi- Disciplinary Semantic Repositories. In: A. Algergawy, N. Karam, F. Klan, C. Jonquet: Proc. of the 2nd International Workshop on Semantics for Biodiversity co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 22nd, 2017. CEUR Workshop Proceedings 1933 Haberl, H.,Winiwarter, V., Andersson, K., Ayres, R., Boone, C., Castillo, A., et al., (2006). From LTER to LTSER: conceptualizing the socioeconomic dimension of long-term socioecological research. Ecol. Soc. 11. IEEE Standard Computer Dictionary (1991): A Compilation of IEEE Standard Computer Glossaries, IEEE Std 610, pp. 1–217, Jan. 1991. Jeffery S. Horsburgh, Anthony K. Aufdenkampe, Emilio Mayorga, Kerstin A. Lehnert, Leslie Hsu, Lulin Song, Amber Spackman Jones, Sara G. Damiano, David G. Tarboton, David Valentine, Ilya Zaslavsky, Tom Whitenack (2016). Observations Data Model 2: A community information model for spatially discrete Earth observations, Environmental Modelling & Software, Volume 79, 2016, Pages 55-74, ISSN 1364-8152, [https://doi.org/10.1016/j.envsoft.2016.01.010.] Jonquet, C., Toulet, A., Arnaud, E., Aubin, S., Dzalé-Yeumo, E., Emonet, V., ... & Larmande, P. (2016, August). Reusing the NCBO BioPortal technology for agronomy to build AgroPortal. In ICBO: International Conference on Biomedical Ontologies (No. D203). Jupp, S. et al. (2015). A new Ontology Lookup Service at EMBL-EBI. In: Malone, J. et al. (eds.) Proceedings of SWAT4LS International Conference 2015. Kliment, T. & Oggioni, A. (2011) Metadatabase: EnvEurope Metadata specification for Dataset Level. EnvEurope (LIFE08 ENV/IT/000339) EnvEurope Project Report PD.A1.1.4 87pp. [http://www.enveurope.eu/misc/PD_1_1_4_Kliment_Metadatabase_201112_final_v1.0.pdf]. Leadbetter, A.M., Vodden, P.N., (2016). Semantic linking of complex properties, monitoring processes and facilities in web-based representations of the environment. International Journal of Digital Earth, 9(3), pp.300-324. Madin, J., Bowers, S., Schildhauer, M., Krivov, S., Pennington, D and Villa F. (2007) An ontology for describing and synthesizing ecological observation data, Ecological informatics, vol. 2, no. 3, pp. 279–296, 2007.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 61 -

Magagna, B., Peterseil, J., Kirchhof, S., Bosch, S. (2018). Harmonised delivery of data. ECOPOTENTIAL Project (H2020 GANr. 641762) Deliverable D5.6. [online http://www.ECOPOTENTIAL-project.eu/images/ECOPOTENTIAL/documents/D5.6.pdf] 125pp. Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, Th.B., Stafford, S.G. (1997) Nongeospatial metadata for the Ecological Sciences. Ecological Applications 7:330-342 Mirtl M. (2010) Introducing the Next Generation of Ecosystem Research in Europe: LTER-Europe’s Multi-Functional and Multi-Scale Approach. In: Müller F., Baessler C., Schubert H., Klotz S. (eds) Long-Term Ecological Research. Springer, Dordrecht Mirtl, M., Borer, E. T., Djukic, I., Forsius, M., Haubold, H., Hugo, W., Jourdan, J., Lindenmayer, D., McDowell, W.H., Muraoka, H., Orenstein, D.E., Pauw, J.C., Peterseil, J., Shibata, H., Wohner, C.,, Yu, C., Haase, P. (2018). Genesis, goals and achievements of Long-Term Ecological Research at the global scale: A critical review of ILTER and future directions, Science of The Total Environment, Volume 626, 2018, pp.1439-1462, ISSN 0048-9697, https://doi.org/10.1016/j.scitotenv.2017.12.001. OGC (2007) OGC Sensor Web Enablement: Overview and High Level Architecture. OGC White Paper. Ref.Nr. OGC 07-165. 14pp. OGC (2013) Geographic Information – Observations and measurements. RefNr. OGC 10-004r3. 48pp. OGC (2014) OGC Best Practise for Sensor Web Enablement. Lightweight SOS Profile for Stationary In-situ Sensors. Ref.Nr. 11-169r1. 35pp. [http://www.opengis.net/doc/BP/sos-profile-in-situ/1.0] OGC (2016) Time Series Profile of Observations and Measurements. Ref.Nr. 15-043r3. 89pp [http://www.opengis.net/doc/IS/timeseries-profile-om/1.0] Oggioni, A., Carrara, P., Kliment, T., Peterseil, J., Schentz, H. (2012). Monitoring of Environmental Status through Long Term Series: Data Management System in the EnvEurope Project. EnviroInfo 2012, Shaker Verlag, Aachen. Oggioni, A., Wohner, C. et al. (2018) D8.2 Software service prototype. eLTER Project Deliverable D8.2 [#link]. Oggioni, A., Wohner, C., Watkins, J., Ciar, D., Schentz, H., Lanucara, S., Minic, V., Skribic, S., Bodroski, Z, Kunkel, R., Sorg, J., Kliment, T., Sanchez, F., Magagna, B., Peterseil, J. (2017). D3.1 eLTER State of the Art and Requirements. eLTER Project Deliverable D3.1 [http://www.lter- europe.net/document-archive/elter-h2020-project-files/d3-1-data-integration, 24.05.2018]. Porter, J.H. (2010) A controlled vocabulary for LTER datasets. (2010) (link: http://databits.lternet.edu/spring-2010/controlled-vocabulary-lter-datasets) Poursanidis, D., Peterseil, J., Wohner, C. , Chrysoulakis, N., Wetzel, F. , Alonso, J. , Castro, P. , Beierkuhnlein, C. , Bernd, A. , Zabala, A. , Masó, J. , Domingo, C. , Vetaas, O., Bargmann, T. , Bosch, S. (2017). D5.2 Metadata for pre-existing datasets. ECOPOTENTIAL Project (H2020 GANr. 641762) Deliverable. [online http://www.ECOPOTENTIAL- project.eu/images/ECOPOTENTIAL/documents/D5.2.pdf] 116pp. Royal Society (Great Britain). Science Policy Centre, (2012). Science as an Open Enterprise. Royal Society. Schentz, H., Peterseil, J. & Bertrand, N. (2013) EnvThes- interlinked thesaurus for long term ecological research, monitoring, and experiments. Proceedings EnviroInfo 2013: Environmental Informatics and Renewable Energies. Shaker Verlag, Aachen. Schentz, H., Peterseil, J., Magagna, B. & Mirtl, M. (2011) Semantics in Ecosystem Research and Monitoring. Proceedings EnviroInfo 2011: Innovations in Sharing Environmental Observation and Information. Shaker Verlag, Aachen. van der Werf, D.C., Adamescu, M., Ayromlou, M., Bertrand, N., Borovec, J., Boussard, H., Cazacu, C., van Daele, T., Datcu, S., Frenzel, M., Hammen, V., Karasti, H., Kertesz, M., Kuitunen, P., Lane, M., Lieskovsky, J., Magagna, B., Peterseil, J., Rennie, S., Schentz, H., Schleidt, K.,

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 62 -

Tuominen, L. (2008). SERONTO: a Socio-Ecological Research and Observation oNTOlogy. In: Weitzman, A.L., and Belbin, L. (Eds.). Proceedings of TDWG (2008), Fremantle, Australia. Vanderbilt, K., and E. Gaiser. (2017). The International Long Term Ecological Research Network: a platform for collaboration. Ecosphere 8(2):e01697. 10.1002/ecs2.1697 Vanderbilt, K., John H. Porter, Sheng-Shan Lu, Nic Bertrand, David Blankman, Xuebing Guo, Honglin He, Don Henshaw, Karpjoo Jeong, Eun-Shik Kim, Chau-Chin Lin, Margaret O'Brien, Takeshi Osawa, Éamonn Ó Tuama, Wen Su, Haibo Yang (2017) A prototype system for multilingual data discovery of International Long-Term Ecological Research (ILTER) Network data. Ecological Informatics 40:93-101, http://dx.doi.org/10.1016/j.ecoinf.2016.11.011. Vanderbilt, K., Lin, Ch.Ch., Lu, Sh-Sh., Kassim, A.R., He., H., Guo, X., San Gil, I., Blankman, D. & Porter, J. (2015) Forstering ecological data sharing: collaborations in the international Long Term Ecological Research Network. Ecosphere(10) Article 204 18pp. Vanderbilt, K.L., Blankman, D., Guo, X., He, H., Lin. Ch-Ch., Lu, S.-S., Ogawa, A., Ó Tuama, É., Schentz, H., Su, W. (2010) A multilingual metadata catalog for the ILTER: Issues and approaches, Ecological Informatics 5:187-193. Veen, L. E., van Reenen, G. B. A, Sluiter, F. P., van Loon, E. E. and Bouten, W. (2012) A semantically integrated, user-friendly data model for species observation data, Ecological informatics, vol. 8, pp. 1–9, 2012. Watkins, J., Ciar, D., Wohner, C., Peterseil, J., Schentz, H., Oggioni, A., Lanucara, S., Minic, V., Skribic, S., Bodroski, Z., Kunkel, R., Sorg, J. (2017). D8.1 eLTER Information Architecture Report. eLTER Project Deliverable D8.1 [http://www.lter-europe.net/document-archive/elter-h2020-project- files/d8-1-it-design, 24.05.2018]. Whetzel, P. L., Noy, N. F., Shah, N. H., Alexander, P. R., Nyulas, C., Tudorache, T., & Musen, M. A. (2011). BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic acids research, 39(suppl 2), W541-W545. Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, et al. (2012) Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLOS ONE 7(1): e29715. [https://doi.org/10.1371/journal.pone.0029715] Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M. , Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L.B., Bourne, Ph.E., Bouwman, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, Ch.T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J.G., Groth, P., Goble, K., Grethe, J. S., Heringa, J., ’t Hoen, P. A.C, Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., ons, A., Packer, A. L., Persson, B., Rocca-Serra, Ph., Roos, M., van Schaik, R., Sansone, S.-A., Schultes, E., Sengstag, Th., Slater, T., Strawn, G., Swertz, M. A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., Mons, B. (2016) The FAIR Guiding Principles for scientific data. management and stewardship. Scientific Data 3 [2016/03/15/online; http://dx.doi.org/10.1038/sdata.2016.18]

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 63 -

6 Annexes

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 64 -

6.1 Annex A - Field Specification for data reporting

Authors: Johannes Peterseil & Christoph Wohner (EAA)

6.1.1 Introduction

The following document provides a short reference to the columns used in the reporting template (basic and extended). When using Microsoft Excel as reporting file format, the different information blocks (e.g. station, method, data, and references) are tables within one spreadsheet.

ZOEBELBODEN_VEG_SPECCOVER_2015_V20170315.xls

 STATION  METHOD  DATA  Ref_STYPE etc.

If text formats (e.g. csv or txt) are used, the information blocks (e.g. station, method, data, references) are provided in separate files and zipped providing ZOEBELBODEN_VEG_SPECCOVER_2015_V20170315.zip

 STATION.CSV  METHOD.CSV  REFERENCE.CSV  AT003_ZOEBELBODEN_VEGETATION_2015_V20170315.CSV

The different sections are described in the following. The list of fields is following the extended data reporting file using the Microsoft Excel template. The lists used in the basic data reporting template are marked with bold letters in the column Field name and an asterisk (*).

6.1.2 Field description

6.1.2.1 Station

Definition: A station is an observation entity within a LTER Site or LTSER Platform. Station in this respect is synonym to plot, observation location, sensor location, etc. and is defined by a location, elevation and installation height (if relevant).

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 65 -

Basic information about the stations is provided in the table STATION. If additional fields are needed, e.g. habitat type for vegetation data, optional additional fields (columns) can be created. In general stations are not described in DEIMS-SDR. In this case additional information need to be provided with the data. Stations also could be uploaded as additional data table.

x … mandatory o … optional c … conditional

Field name Description Example M D V SITE_CODE Site code – as the reference to the LTER_EU_AT_003 o o o documentation of the LTER site and LTSER Platform in DEIMS. Provide either the Site code (e.g. LTER_EU_AT_003) or the Site- UUID (e.g. https://data.lter- europe.net/deims/site/0ce0d289-9ef9-4232- a981-8f34869db76d) Condition: if more than one site is referenced in the data, the site identification needs to be provided SCODE* Station code or Station identifier (ID) - 300 x x x Code for the station within the site. A station is any measuring unit such as a sampling plot or a meteorological station. If the station equals the site, meaning that only one station is used within the site, only the site identifier is provided in the data recording sheet. If external identification systems (e.g. WMO Station ID) are existing, this identifier could be used to reference the station. If a DEIMS-SDR UUID for the station exists, this need to be used. SNAME Station name (UTF-8 character encoding, IP1 https://en.wikipedia.org/wiki/UTF-8) provides the name of the station if relevant STYPE* Station type – type of station according to a PT x x x fixed list of values:

PT … point HLN … horizontal transect VLN … vertical transect PL … areal plot

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 66 -

Field name Description Example M D V westBounding bounding box for a station in decimal 7,88749 x x x Coordinate* degree [dec °] WGS84; if a point is represented east and west bounding coordinates are equal; equals to Longitude eastBounding bounding box for a station in decimal 7,95372 x x x Coordinate* degree [dec °] WGS84; if a point is represented east and west bounding coordinates are equal; equals to Longitude northBounding bounding box for a station in decimal 45,34080 x x x Coordinate* degree [dec °] WGS84; if a point is represented north and south bounding coordinates are equal; equals to Latitude southBoundin bounding box for a station in decimal 45,30056 x x x gCoordinate* degree [dec °] WGS84; if a point is represented north and south bounding coordinates are equal; equals to Latitude altitudeMinimu minimum altitude in meter above sea level 265 x x x m* [m a.s.l.] for the observed station, negative if below water level; if a single point is represented minimum and maximum are equal altitudeMaxim maximum altitude in meter above sea level 270 x x x um* [m a.s.l.] for the observed station, negative if below water level; if a single point is represented minimum and maximum are equal Country Country - country code: ISO 3166-1 alpha- AUT 3 (https://en.wikipedia.org/wiki/ISO_3166- 1_alpha-3), e.g. AUT for Austria SampPeriodSi Sampling period since [ISO date], see ISO 1992-06- nce 8601: calendar dates as YYYY-MM-DD, 01T13:00+02: time as HH:MM:SS plus a time zone 00 designator (as UTC plus offset)

Combined date and time: 2007-04- 05T12:30:00-02:00

Any time information

 in UTC (e.g. 2017-03- 03T11:00+00:00 or 2017-03- 03T11:00UTC or 2017-03-

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 67 -

Field name Description Example M D V 03T11:00Z),  or local times, if the UTC offset is provided (e.g. 2017-03- 03T13:00+02:00) InstHeight Installation height – height of the 200 installation of a sensor or device in [cm] measured from the soil surface. Positive and negative values are possible. plotSize Plot size in [m2] 25 x plotShape Plot shape (length x width) in [m] 5x5 x Local_Habitat_ Local habitat type [text] using local Beech forest x Type classification (needs to be defined in the method metadata) EUNIS_Habitat EUNIS habitat type [text] using [EUNIS G2.3 x _Type Habitat Classification] identifier Potential_natu Potential natural vegetation [txt] using Fagetum x ral_vegetation local syntaxonomic classification schema (needs to be defined in the method metadata)

If additional information on the stations (e.g. soil type, geology) needs to be provided additional columns are created in the station table, e.g. soilType. A separate reference files containing the reference list should be provided.

6.1.2.2 Method

Definition: A method describes the procedure to generate and manipulate the data. The section contains information on the methods applied for the generation and manipulation of the data. The method section should give an overview on the sampling, the field method and the method used in the lab to create the data value. In addition the method needs to be provided with the metadata description in DEIMS.

x … mandatory o … optional c … conditional

Field name Description Example M D V METHOD_CODE Method code – user defined code for the ZOE_IM_VEG x x x method description. The code is used to

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 68 -

Field name Description Example M D V reference to the METHOD_CODE within the data. SAMPLING Short method description on how the Random sampling x x x plots were selected from the total of spruce stands in the entire area of population (selection of plots, observation the site; 5 regularly points, etc.) spaced (10 m) positions on a transect; etc. FIELD_METHOD Short method description of the method Volume weighted x x x used in the field either to collect the mixing from 5 bulk sampler, 2 weeks samples or to do the observation interval of sampling, cooled transportation of the samples LAB_METHOD Short method description on the 45µm filtered; ICP- x procedures and methods applied in the lab, OES e.g. filtering, analysis, etc. AGG_METHOD Short method description of the Weighted mean x x x procedure how the values have been value aggregated from primary values; for primary data the aggregation procedure is “NONE”.

In case the method is sufficiently described by the metadata record in DEIMS, the METHOD.CSV file can contain the following information (including the reference to the : “For details to the methods applied please refer to the respective metadata record on DEIMS. [https://data.lter- europe.net/deims/dataset/xxxxxxxxx]”

6.1.2.3 Data

Definition: The data are defined as the section where the observation values are provided. This section contains data on any observation or measurement in the different compartments of the ecosystem. It includes bio-geochemical measurements as well as biotic observations

x … mandatory o … optional c … conditional

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 69 -

Field name Description Example M D V SUBPROG Code for the sub programme for which the BIOCHEM data are reported, e.g. BIOCHEM for “biogeochemical data” within the site. This refers to the parameter groups or thematic grouping of data.

BIOCHEM biogeochemistry data STRUCTU Structure and function of ecosystems, communities and populations HUMANEC human population and economy SITECHA site characteristics (land use and land cover) Additional values can be defined, but need to be documented in the REFERENCE section. SITE_CODE Site code – as the reference to the LTER_EU_AT_003 o o o documentation of the LTER site and LTSER Platform in DEIMS. Provide either the Site code (e.g. LTER_EU_AT_003) or the Site- UUID (e.g. https://data.lter- europe.net/deims/site/0ce0d289-9ef9-4232- a981-8f34869db76d) Condition: if more than one site is referenced in the data, the site identification needs to be provided ORG_NAME Abbreviation or name of the organisation EAA providing the data SCODE* Station code – as reference to the IP1 x x x observation location (=station) defined in the table STATION MEDIUM Medium – as the code for the sampled AIR medium in the observation procedure

AIR air including meteorology SOIL soil SOILWAT soil water WATER runoff and groundwater SEDIMENT sediments in aquatic environments LITTER litter fall BIOCOM biological communities HUMPOP human population SITECHAR site characteristics (as habitat or landscape structure) Additional values can be defined, but need to be documented in the REFERENCE section.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 70 -

Field name Description Example M D V LISTMED Reference list medium – as reference to the ELTER list used. Use the code ‘ADD’ if the code is defined by the user. Otherwise use reference to the reference list. LEVEL* Height of measurement in [cm] above -10 x x ground surface. Condition: If a single measurement for the height of measurement is provided the fields min_level and max_level are not used. MAX_LEVEL Upper measurement level in [cm] if a range -15 for the observation is provided; the land/water surface is the zero level; values below the surface are provided as negative values (e.g. - 20), values above the surface are provided as positive values (e.g. 20). MIN_LEVEL Lower measurement level in [cm] if a range -5 for the observation is provided; the land/water surface is the zero level; values below the surface are provided as negative values (e.g. - 20), values above the surface are provided as positive values (e.g. 20) SIZE Size of the sampling plot in [m²] where the 100 x observation takes place or the size of the area for which the aggregated values are representative (e.g. the site or part of the site such as the forested area) TIME* timestamp of measurement [ISO date] 2017-03- x x x (according to ISO 8601): calendar dates as 03T13:00+02 YYYY-MM-DD, time as HH:MM:SS in UTC :00 plus offset; combined date and time as YYYY- MM-DDTHH:MMOffset, e.g. 2007-04- 05T12:30:00-02:00 2017-03-03 Any time information in converted to UTC time (e.g. 2017-03-03T11:00+00:00 or 2017-03 2017-03-03T11:00UTC or 2017-03- 03T11:00Z) or local times, if the UTC offset is provided (e.g. 2017-03- 2017 03T13:00+02:00) If aggregations are provided the timestamp is provided as the following

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 71 -

Field name Description Example M D V o annual aggregation – provide only the year as YYYY o monthly aggregations – provide only the month as YYYY-MM o daily aggregations – provide the day as YYYY-MM-DD Condition: if the field TIME is used, the columns YEAR, MONTH, DAY, HOUR, MINUTE, SECOND are omitted YEAR Year [YYYY] of the measurement or the year 2017 for which the measurements were aggregated the year of an observation (e.g. plants)

Alternative: Time stamp as [ISO date], see ISO 8601: calendar dates as YYYY-MM-DD, time as 2017-03- HH:MM:SS plus a time zone designator (as 03T13:00+02:00 UTC plus offset), e.g. 2007-04-05T12:30:00-02:00 Any time information

o in UTC (e.g. 2017-03-03T11:00+00:00 or 2017- 03-03T11:00UTC or 2017-03-03T11:00Z), o or local times, if the UTC offset is provided (e.g. 2017-03-03T13:00+02:00) MONTH Month of the measurement or aggregation. 08 Leave blank if not relevant  see notes Year for aggregated time information DAY Day of the measurement or the aggregation. 01 Leave blank if not relevant  see notes Year for aggregated time information HOUR Hour of the measurement or the aggregation. 12 Leave blank if not relevant  see notes Year for aggregated time information MINUTE Minute of the measurement or the 30 aggregation. Leave blank if not relevant

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 72 -

Field name Description Example M D V  see notes Year for aggregated time information SECOND Second of the measurement or the 05 aggregation. Leave blank if not relevant  see notes Year for aggregated time information SPOOL Spatial pool as the number of spatial entities 5 (e.g. sensors, plots) used to calculate the data value If not relevant or described in the method section, leave blank. TPOOL Temporal pool as the number of observations 10 used to calculate the data value If not relevant or described in the method section, leave blank. TLEVEL Temporal level of aggregation or observation

HOUR hourly values (60 min) DAY daily values (24 hrs) WEEK weekly values (7 days) MONTH monthly values SEASON seasonly values (e.g. spring) HYEAR half yearly values (6 month) 6 month YEAR yearly values (12 month) Additional values can be defined, but need to be documented in the REFERENCE section. TAXA* Species name either defined by a species FAG SYLV x letter code (genus & species) or the full name. Only relevant when reporting species Fagus sylvatica L. information. In case of using species letter codes the definition needs to be provided in the REFERENCE section. LISTTAXA Reference to the taxonomic list used for the Flora of Austria observations (2005) SUBST* Substance code or parameter name as COVE_F x x x abbreviation

LISTSUB SUBST Name DB ALK Alkalinity DB BOD Biochemical oxygen demand

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 73 -

Field name Description Example M D V

DB TC Total carbon … … IM COVE_T species cover tree layer IM COVE_S species cover shrub layer IM COVE_F species cover field layer IM COVE_B species cover bottom layer Additional values can be defined, but need to be documented in the REFERENCE section. LISTSUB Reference to the code list of substances and IM parameter names, e.g. EnvThes or other vocabularies. Use ADD if defined by the user. METHOD_C Reference to the METHOD_CODE as ODE defined in the METHOD table/file VALUE* Data value of the observation. The comma 25 x x x separator needs to be consistently used in the data file either being ‘,’ or ‘.’ UNIT* Unit of the observation % x x c Condition: provided if relevant FLAGQUA* Quality flag for the data values based on the o o o applied data quality control procedure as provided by the local system

Examples L less than detection limit E estimated from measured value Additional values can be defined, but need to be documented in the REFERENCE section. FLAGSTA* Status flag for the data defining the level of X o o o aggregation of the data value according to the attached list of possible values. If not relevant leave blank.

X Arithmetic average, mean; e.g. monthly average W Weighted mean S Sum M Mode A Minimum Z Maximum XA average monthly minimum XZ average monthly maximum SZ maximum daily sum Additional values can be defined, but need to be documented in the REFERENCE section.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 74 -

If the list of parameters (measures) is standardised and repeated measurements are made (e.g. in experimental design) the measures also can be provided in columns. In this case no specific information on the single values can be given (e.g. quality). In the example an alternative form of data reporting is provided. This format is not recommended for the basic eLTER data reporting.

6.1.2.4 Reference lists

This section is about to provide the definitions for the codes used in the data reporting. If using the Microsoft Excel template the reference lists are provided in separate tables within the spreadsheet, e.g. Ref_SUBST. If using text files the references are provided as separate file being structured as defined in the following. All definitions are provided in a single file.

Field name Description Example M C V FIELD_NAME Name of the field the SUBST reference is referring to LIST_CODE Name of the code list; if ADD referring to an existing code list the name of the code list is provided (e.g. DB, EnvThes). If the code is defined by the user, use ‘ADD’ as identification CODE Code of the entry defined WOOD_HARVEST as abbreviation of the term, e.g. parameter name NAME Full name defined as the Yearly wood harvest full name of the term,e.g. full species name DEFINITON Definition of the term Yearly amount of wood harvested used, in order to allow the from the plot user to understand the data.

If additional fields (e.g. identifier) are needed, please add them to the reference table.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 75 -

6.1.3 Examples

6.1.3.1 Example station description

SCODE STY westBoundingC eastBoundingC northBounding southBounding altitude altitud PE oordinate oordinate Coordinate Coordinate Minimum eMaximu m IP1 PT 14,4466 14,4467 47,8385 47,8384 950 950

6.1.3.2 Example biophysical data

Recommended version basic format

SCODE SUBST LEVEL TIME VALUE UNIT FLAGQUA FLAGSTA IP1 TEMP 200 2016-03-15 5.5 °C X IP1 PREC 100 2016-03-03 10.2 MM S IP1 TEMP 200 2016-02-15 2.5 °C X IP1 NH4N 100 2016-03 5.5 mg N/l W IP1 SO4S 100 2016-03 10.2 mg S/l W IP1 CA 100 2016-03 2.5 Mg/l L W … … … … … … … …

Alternative version

SCODE LEVEL TIME TEMP PREC NH4N SO4S CA TYPE IP1 100 2016-03 5.5 10.2 2.5 5.5 2.5 Forest IP1 100 2016-04 5.2 1.2 2.2 5.8 1.2 Forest

Note: Resolution and methods needs to be described in detail for the single parameter. Additional information for each value (e.g. aggregation level or quality) cannot be provided in this format.

6.1.3.3 Example biodiversity data

SCODE SUBST TIME TAXA VALUE 1 COVE_T1 2016-06-25 FAG SYLV 3 1 COVE_T1 2016-06-25 PIC ABIE 3 1 COVE_S 2016-06-25 FAG SYLV 1 1 COVE_F 2016-06-25 OXA ACET 2 … … … … …

6.1.3.4 Example method documentation

CODE SAMPLING FIELD_METHOD LAB_METHOD AGG_METHOD METH_1 providing providing providing provide specification for specification for specification for specification for the sampling the field method the analysis the aggregation methods method, including method the statistical analysis of the dato if relevant METH_2 example selected water sample ICP_OES

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 76 -

6.1.3.5 Example reference list

FIELD_N LIST_ CODE NAME DEFINITION AME CODE SUBST CF NOXN atmosphere_mass_conte "Content" indicates a quantity per unit area. nt_of_nox_expressed_a The "atmosphere content" of a quantity refers s_nitrogen to the vertical integral from the surface to the top of the atmosphere. For the content between specified levels in the atmosphere, standard names including content_of_atmosphere_layer are used. "Nox" means a combination of two radical species containing nitrogen and oxygen: NO+NO2. The phrase 'expressed_as' is used in the construction A_expressed_as_B, where B is a chemical constituent of A. It means that the quantity indicated by the standard name is calculated solely with respect to the B contained in A, neglecting all other chemical constituents of A. SUBST CF NOYN atmosphere_mass_conte "Content" indicates a quantity per unit area. nt_of_noy_expressed_a The "atmosphere content" of a quantity refers s_nitrogen to the vertical integral from the surface to the top of the atmosphere. For the content between specified levels in the atmosphere, standard names including content_of_atmosphere_layer are used. "Noy" describes a family of chemical species. The family usually includes atomic nitrogen (N), nitrogen monoxide (NO), nitrogen dioxide (NO2), dinitrogen pentoxide (N2O5), nitric acid (HNO3), peroxynitric acid (HNO4), bromine nitrate (BrONO2) , chlorine nitrate (ClONO2) and organic nitrates (most notably peroxyacetyl nitrate, sometimes referred to as PAN, (CH3COO2NO2)). The list of individual species that are included in a quantity having a group chemical standard name can vary between models. Where possible, the data variable should be accompanied by a complete description of the species represented, for example, by using a comment attribute. The phrase 'expressed_as' is used in the construction A_expressed_as_B, where B is a chemical constituent of A. It means that the quantity indicated by the standard name is calculated solely with respect to the B contained in A, neglecting all other chemical constituents of A. … … … … …

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 77 -

6.2 Annex B – SensorML Implementation for DEIMS-SDR

Author: Christoph Wohner (EAA) & Alessandro Oggioni (CNR)

The aim of this document is to clarify whether or not it is sensible to implement SensorML for sensor description in DEIMS and to give a brief overview of the SensorML format and suggestions for a potential DEIMS community profile. This community profile would also allow to expose the sensor information in Inspire EMF.

6.2.1 Definitions

SensorML is an Open Geospatial Consortium standard for describing sensors and measurement processes. SensorML aims to: ● Provide descriptions of sensors and sensor systems for inventory management ● Provide sensor and process information in support of asset and observation discovery ● Support the processing and analysis of the sensor observations ● Support the geolocation of observed values (measured data) ● Provide performance and quality of measurement characteristics (e.g., accuracy, threshold, etc.) ● Provide general descriptions of components (e.g. a particular model or type of a sensor) as well as the specific configuration of that component when it’s deployed ● Provide a machine interpretable description of the interfaces and data streams flowing in and out of a component ● Provide an explicit description of the process by which an observation was obtained (i.e., its lineage) ● Provide an executable aggregate process for deriving new data products on demand (i.e., derivable products) ● Archive fundamental properties and assumptions regarding sensor systems and computational processes80 ● Provide informations of the manufacturer, owner, and operator as a contacts to give more informations about the sensors ● Provide historical events of the sensor (e.g. installation, calibration, etc.)

By this SensorML provides a common framework for any process, especially for the description of sensor and systems and the processes surrounding sensor observations. Sensor and transducer81 components (detectors, transmitters, actuators82, and filters) are

80 https://portal.opengeospatial.org/files/?artifact_id=55939, p. 14-15 81 An entity that receives a signal as input and generates a modified signal as output. Includes detectors, actuators, and filters. 82 A type of transducer that converts a signal to some real-world action or phenomenon

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 78 -

modelled as physical processes that can be connected and participate equally within a process network or system, and which utilize the same model framework as any other process. Processes are entities that take one or more inputs and through the application of well- defined methods and configurable parameters, and produce one or more outputs. The process model can be used to describe a wide variety of processes, including not only sensors, but also actuators, spatial transforms, and data processes. SensorML also supports explicit linking between processes and thus supports the concept of process chains, networks, or workflows, which are themselves defined as processes using a composite pattern. Processes that can be modelled with SensorML are: ● Physical System - is an aggregate system that can include multiple components (both physical and non-physical) with explicit links between the outputs, inputs, and parameters of the individual components. In a PhysicalSystem, the spatial position of the System itself is relevant to its application; ● Physical Component - is a physical process that will not be further divided into smaller components. The examples below were made with Physical System because most widely used scheme.

SensorML provides a framework within which the geometric, dynamic, and observational characteristics of sensors and sensor systems can be defined. A variety of sensor types can all be supported through the definition of simple and aggregate processes. The models and schema within the core SensorML specification provide a skeletal framework for describing processes, aggregate processes, and sensor systems83.

6.2.2 Minimum information for SensorML

In order to generate a valid SensorML file only very basic information has to be provided.

6.2.2.1 List of required fields for valid SensorML

The following information is mandatory and has to be provided in order for a SensorML file to be valid84:

83 https://portal.opengeospatial.org/files/?artifact_id=55939, p. 14-15 84 http://www.sensorml.com/sensorML-2.0/examples/helloWorld.html

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 79 -

● Header information:

● gml:identifier http://data.lter-europe.net/sensors/54321

6.2.2.2 SensorML profile (minimum information)

In order to be useful, a sensor description should at least tell the type of measurement and the location of the sensor. Since a sensor in SensorML is simply a physical process that outputs a measurement, the type of measurement is provided in the sml:outputs element. The location is provided by the sml:position element.

● Observed Property = Output Relative humidity of the atmosphere

● Sensor Location 11.977484 44.883448

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 80 -

0

Information about the location can be expressed in different ways:

● Sensor with location by description (byDescription)85 ● Sensor with GML point location (byPoint)86 ● Sensor with static location (byLocation)87 ● Sensor with orientation (byState)88 ● Sensor with dynamic position (byTrajectory)89 ● Sensor with location output (byTrajectory)90 ● Sensor with orbit propagator (byProcess) ● Sensor with SOS for location (byProcess)

GML point location allows mapping to the Inspire EMF geometry field. 6.2.2.3 1.4.3 Standardised description of sensors

Semantic Sensor Network Ontology (SSN 91 ) describes sensors and observations, and related concepts. It does not describe domain concepts, time, locations, etc. these are intended to be included from other ontologies via OWL imports. 6.2.2.4 1.4.4 Sensor Web Enablement Lightweight SOS Profile mandatory fields

SensorML is the recommended sensor metadata for SOS 2.092. SensorML is used within SOS for encoding sensor metadata documents that are returned in case of DescribeSensor requests. This lightweight profile 93 defines a minimum set of

85 http://www.sensorml.com/sensorML-2.0/examples/locDescription.html 86 http://www.sensorml.com/sensorML-2.0/examples/locGML.html 87 http://www.sensorml.com/sensorML-2.0/examples/locStatic.html 88 http://www.sensorml.com/sensorML-2.0/examples/locOrientation.html 89 http://www.sensorml.com/sensorML-2.0/examples/locDynamic.html 90 http://www.sensorml.com/sensorML-2.0/examples/locOutput.html 91 https://www.w3.org/2005/Incubator/ssn/ssnx/ssn 92 http://www.ogcnetwork.net/sos_2_0/tutorial/sensorml

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 81 -

mandatory metadata that need to be provided in a SensorML document. Complex elements of SensorML are not considered here.

gml:description (mandatory): Short textual description of the sensor or sensor system. gml:identifier (mandatory): Unique identifier of the sensor system. sml:keywords (mandatory): Terms which help to describe the sensor system and serve for discovery purposes. For example, the phenomena observed by the system or the types of contained sensors can be mentioned. sml:identification (mandatory): This element contains identifiers of the sensor system. Each "identifier/Term" element contained in the "IdentifierList" must have a "definition" attribute which links to the semantics of the sensor system. One identifier has to be present which contains the definition "urn:ogc:def:identifier:OGC:shortname". The value of its contained "Term" element represents a human understandable name for the instance. One identifier has to be present which contains the definition "urn:ogc:def:identifier:OGC:longname". The value of its contained "Term" element represents a human understandable name for the sensor system. sml:classification (mandatory): This element contains classifiers for the sensor system. 11- 169r1 Copyright © 2014 Open Geospatial Consortium 11 Each "classifier/Term" element contained in the "ClassifierList" must have a "definition" attribute. This attribute links to the semantics of the identifier. One classifier has to be present which contains the definition “http://www.opengis.net/def/property/OGC/0/SensorType”. The value of its contained “Term” element states the type of the sensor system (e.g., “weather station”). sml:contacts (mandatory): This element contains contact information about the operator of the sensor. The element "contacts/ContactList/member/gmd:CI_ResponsibleParty" has to be present to define the responsible party of the sensor system94. sml:featuresOfInterest (mandatory): This element contains the real world entity, the feature of interest, which is observed by the sensor system. In case of this profile, the feature of interest is a station and modelled as a SamplingPoint. sml:outputs (mandatory): The outputs of the sensors attached to the sensor system. Each child-element of an "output" has to use the "definition"-attribute to specify the URI of the observed property. If the child-element of the output is a "swe:Quantity" it has to contain the "swe:uom" element which specifies the "code" attribute stating the UCUM code. Depending on the observation types the outputs have to be described as one of the following elements o swe:Quantity (in case of

93 https://portal.opengeospatial.org/files/?artifact_id=52675 94 https://portal.opengeospatial.org/files/?artifact_id=52803

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 82 -

Measurement) o swe:Count (in case of CountObservation) o swe:Boolean (in case of TruthObservation) o swe:Category (in case of CategoryObservation) o swe:Text (in case of TextObservation)

6.2.3 SensorML and INSPIRE EMF

INSPIRE Data Specification on EMF (Environmental Monitoring Facilities) currently has limited means to describe sensors, therefore sensor metadata expressed in SensorML can be linked to EMF files.

EMF Sensor fields

inspireId = XXXXXXX

name = Hydrometric sensor : O12525100101

geometry = GM_Point (X/Y/ of the sensor)

responsibleParty = DREAL Midi-Pyrénées

mediaMonitored95 = http://inspire.ec.europa.eu/codeList/MediaValue/water

measurementRegime96 = http://inspire.ec.europa.eu/codeList/MeasurementRegimeValue/continuousDataCollection

mobile = False

resultAcquisitionSource97 = http://inspire.ec.europa.eu/codeList/ResultAcquisitionSourceValue/inSitu

specialisedEMFType = http://sandre.eaufrance.fr/?urn=urn:sandre:dictionnaire:HYD::entite:Capteur:ressource:2.1:::html

EMF Sensor fields - XML example98

-- Definition -- Representative location for the EnvironmentalMonitoringFacility. -- Description --

95 http://inspire.ec.europa.eu/codelist/MediaValue 96 http://inspire.ec.europa.eu/codelist/MeasurementRegimeValue 97 http://inspire.ec.europa.eu/codelist/ResultAcquisitionSourceValue 98 http://inspire.ec.europa.eu/schemas/ef/4.0/EnvironmentalMonitoringFacilities.xsd

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 83 -

-- Definition -- Regime of the measurement -- Definition -- Indicate whether the EnvironmentalMonitoringFacility is mobile (repositionable) during the acquisition of the observation. -- Definition -- Source of result acquisition -- Definition -- Categorisation of EnvironmentalMonitoringFacilities generally used by domain and in national settings. -- Description -- EXAMPLE: platform, site, station, sensor, ... -- Definition -- Lifespan of the physical object (facility). -- Definition -- Any Thematic Link to an Environmental Monitoring Facility. The association has additional properties as defined in the association class AnyDomainLink.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 84 -

-- Definition -- A link pointing to the EnvironmentalMonitoringNetwork(s) this EnvironmentalMonitoringFacility pertains to. The association has additional properties as defined in the association class NetworkFacility. ef:contains

6.2.4 Potential DEIMS Community Profile

Only parts of the SensorML model should be used for a potential DEIMS community profile. SensorML Profile for Sensor Discovery (OGC 09-033) https://portal.opengeospatial.org/files/?artifact_id=33284&version=2

6.2.4.1 Proposed DEIMS community profile

A community profile for DEIMS should include all fields necessary for valid sensorML. Those being: ● The required fields sensorML fields ○ A resolvable ID (syntax deims_base_url+sensor+uuid) ○ header information (not exposed for users)

Sensor description is not mandatory for EMF files, therefore all sensor description fields are voidable. However, at minimum the following fields should be used to create semantically useful sensor records: ● Recommended, but voidable EMF sensor fields:

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 85 -

○ name ○ media monitored99 ○ mobile (true/false)100 ■ if false: geometry (as GM_Point (X/Y/SRS of the sensor))101 ■ if true: then specify trajectory (e.g. number of points and corresponding timestamps102) ○ resultAcquisitionSource ○ specialisedEMFType103 (not exposed for users) ● Linkable and voidable EMF field ○ OperationalActivityPeriod (Activity Time)

LOVs (List of values) are provided for EMF fields. Other SensorML specific fields can be added, but would not be able to be exposed within EMF.

These include: ● Textual description ● Keywords ● Contact (within the SensorML, Inspire EMF offers that information for the site itself) ● Position (including altitude) ○ Components ■ Including information about input and output of each component

6.2.4.2 Example SensorML file

99 Observed Property 100 Could be mapped by specifying the trajectory of a sensor http://www.sensorml.com/sensorML- 2.0/examples/locDynamic.html 101 Can be mapped to Sensor Location in SensorML 102 http://www.sensorml.com/sensorML-2.0/examples/locDynamic.html 103 Categorisation of object generally used by domain and in national settings

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 86 -

$title $uuid Air Temperature $coordinates

6.2.4.3 Community Profile DEIMS field mapping

Community Profile Field Corresponding field in LOV DEIMS

ID* UUID X

user-defined sensor ID (that X isn’t exported; only for internal purposes, e.g. improved usability)

name* title (textfield) free text

media monitored* parameter (existing field that might be extended) or new field

mobile* mobile (new field of type Yes boolean) No

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 87 -

geometry* boundaries

Position Elevation - Average float value

resultAcquisitionSource* resultAcquisitionSource ex-situ in-situ remote-sensing subsumed

OperationalActivityPeriod Date only starting date

Textual description* General Site Description free text

Keywords* Keyword originating from Envthes thesaurus Envthes

Contact* Contact List of people on DEIMS

SpecialisedEMFType Site Type/Content Type For SensorML description always = “codespace/sensor”

sensorType* new field free text for the beginning and then switch to reference list

output name new field? text field with existing code list?

uom code new field? text field with existing code list?

*required existing field new field

6.2.5 Conclusion

SensorML can be used for DEIMS to describe and save metadata about sensors. A reduced community profile can be implemented easily. SensorML sensor metadata, e.g. type, location, etc. can be linked to EMF. Sensor metadata is therefore exportable as SensorML and EMF. Information about observations could be stored in addition to that.

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 88 -

Sensor documentation of some sort should be implemented in DEIMS. An integration with EMF is recommended. A SensorML implementation in DEIMS is possible, but debatable.

In any case a generic community profile should be implemented that allows mapping between these formats.

6.3 Annex C – SensorML Example DEIMS-SDR:Sensor

https://data.lter-europe.net/deims/sensor/fb583610-fe71-4793-b1a9-43097ed5c3e3104

Precipitation measurement at LTER Zöbelboden Austria, Wildwiese (forest clearing area) fb583610-fe71-4793-b1a9- 43097ed5c3e3 precipitation short name LTER Zöbelboden Austria precipitation WW long name LTER Zöbelboden Austria precipitation WW deployed at site LTER Zöbelboden - Austria

104 See https://data.lter-europe.net/deims/node/10763/sensorml

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 89 -

sensorType precipitation sensor Main Offering fb583610-fe71-4793-b1a9- 43097ed5c3e3/offering/1 Thomas Dirnboeck Environment Agency Austria (EAA) Sensor Contact person Spittelauer Lände 5 Vienna 1090 AT

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 90 -

[email protected] Environment Agency Austria (EAA) https://data.lter-europe.net/deims/sensor/fb583610-fe71-4793-b1a9- 43097ed5c3e3 47.842000000000 14.442000000000

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 91 -

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 92 -

Document ID: eLTER D3.3 Data Models © eLTER consortium

- 93 -

Document ID: eLTER D3.3 Data Models © eLTER consortium