European Long-Term Ecosystem and Socio- Ecological Research Infrastructure

D3.1 eLTER State of the art and requirements

Authors: Alessandro Oggioni (CNR-IREA), Christoph Wohner (EAA), John Watkins (CEH), David Ciar (CEH), Herbert Schentz (EAA), Simone Lanucara (CNR-IREA), Vladan Minić (BSI), Srđan Škrbić (BSI), Žarko Bodroški (BSI), Ralf Kunkel (FZ-Juelich), Jurgen Sorg (FZ-Juelich), Tomas Kliment (MK18), Francisco Sanchez (UGR), Barbara Magagna (EAA), Johannes Peterseil (EAA)

Lead partner for deliverable: CNR

Other partners involved: NERC/CEH, EAA, BSI, FZJ, UGR

H2020-funded project, GA: 654359, INFRAIA call 2014-2015 Start date of project: 01 June 2015 Duration: 48 months

Version of this document: 6 Submission date: 27 February 2017

Dissemination level

PU Public X

PP Restricted to other programme participants (including the Commission Services)

CO Confidential, only for members of the consortium (including the Commission Services)

CI Classified, as referred to in Commission Decision 2001/844/EC

Document history

Date Author Description 20.7.2016 Peterseil, Johannes Generating the document and basic structure 10.8.2016 Oggioni, Alessandro General things and text about Observation and Spatial data, Semantic harmonisation, LifeWatch, GET-IT and EDI 12.8.2016 Wohner, Christoph Text about DEIMS, Geonetwork and B2Share 6.9.2016 Minić, Vladan Text about GEOSS and INSPIRE 17.1.2017 Kunkel, Ralf Text about TERENO 19.1.2017 Peterseil, Johannes Evaluation of the 19.2.2017 Minić, Vladan Implementation strategy for DIP 1.2.2017 Oggioni, Alessandro Semantic harmonisation, LifeWatch, GET-IT and EDI 24.2.2017 Oggioni, Alessandro Review and final version 27.2.2017 Peterseil, Johannes Review WP Lead 27.2.2107 Oggioni, Alessandro Signed of to coordinator

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 1 of 80

Publishable Executive Summary

The understanding of ecosystem processes and functioning, as well as their relation to environmental pressures and threats is one of the important questions being addressed on different scientific and political scales. In order to analyse this questions sound and well documented data are needed. The long term ecological research and monitoring network (LTER) with its wide range of LTER sites and LTSER platforms provides an important source of data for this assessment. . LTER is organised by different national networks sharing a common conceptual basis, but nevertheless are managed by different organisations with their own funding regimes and organisation structure.

The eLTER H2020 project aims to develop the core components for a common eLTER Information System enhancing the accessability and usability of the data. The work aims to link data from the different distributed resources and make them available not only for the scientific experts but also for a more general use. The documentation of the site network as well as the provision of metadata and data are important tasks.

The current report aims to collect information on the general use cases describing the needs for a common eLTER Information System. In addition relevant existing large scale e-Infrastructures are evaluated and relevant components for the implementation of the eLTER Information System are described. The requirements were collected by analysing existing documents from previous projects related to the implementation of data infrastructures.

The report is the basis for the development of the common architecture for the implementation of the eLTER Information System, which is described in the report D8.1 Architecture design.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 2 of 80

Contents

1. Introduction 7

1.1. Aim of the document 9 1.2. eLTER approach to systems architecture development 9

2. Methods 10

3. Characterisation of LTER Europe 11

3.1. Network of networks 11 3.2. Network of Sites 13 3.3. Network of Data Management 17 3.4. Network of Observation and Spatial Data 22

4. Requirements 25

4.1. User Stories 25 4.1.1. Site documentation 28 4.1.1.1. Exchange of site information 28 4.1.1.2. Site registry service 29 4.1.2. Person documentation 29 4.1.3. Metadata authoring 30 4.1.4. Metadata harvesting and exchange 30 4.1.5. Data sharing and data services 30 4.1.6. Data discovery and visualisation 31 4.1.7. Semantic harmonisation 32 4.1.8. Provide facilities to author MD on sites, datasets and data services 33

5. State of the art - general architectures for distributed data 33

5.1. LifeWatch 33 5.2. GEOSS 36 5.3. DataOne 39 5.4. TERENO 41 5.5. UK Data Infrastructure 46 5.6. INSPIRE 49 5.7. EUDAT Collaborative Data Infrastructure (CDI) 49 5.8. ENVRI 51

6. State of the art - tools and services 56

6.1. Data discovery 56 6.1.1. B2Find 56

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 3 of 80

6.1.2. GeoNetwork 58 6.1.3. pyCSW 59 6.1.4. GI-CAT 60 6.1.5. Metacat 60

6.2. Metadata authoring and sharing 61 6.2.1. DEIMS 61 6.2.1.1. Site documentation 62 6.2.1.2. Dataset documentation 64 6.2.1.3. Export metadata 66 6.2.2. Morpho 68 6.2.3. INSPIRE MD Editor 69 6.2.4. EDI MD Editor 69

6.3. Metadata harvesting 70

6.4. Data archiving and sharing 71 6.4.1. B2Share 71 6.4.2. GET-IT 72 6.4.3. TEODOOR Suite 73

6.5. Implementation strategy 76 6.5.1. Implementation strategy for data nodes (GET-IT) 77 6.5.2. Implementation strategy of SOS web map client in DIP 78 6.5.3. Implementation strategy of GeoNetwork in DIP 78 6.5.4. Implementation strategy of B2FIND in DIP 78

7. References 78

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 4 of 80

Glossary

Term Definition

API An application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. In general terms, it's a set of clearly defined methods of communication between various software components

CSW Catalog Service for the Web (CSW), sometimes seen as Catalog Service - Web, is a standard for exposing a catalogue of geospatial records in XML on the Internet (over HTTP). The catalogue is made up of records that describe geospatial data (e.g. KML), geospatial services (e.g. WMS), and related resources

DEIMS The Dynamic Ecological Information System providing a site and dataset catalogue enabling to author, discover and share metadata on long term research sites, datasets, data products, persons, and networks.

DIP The Data Integration Portal is the central discovery and data visuliation component of the eLTER Information System

EUDAT The EUDAT Collaborative Data Infrastructure is essentially a European e-infrastructure of integrated data services and resources to support research.

GET-IT Geoinformation Enabling ToolkIT starterkit®, is the open-source software suite GET-IT developed by researchers from IREA and ISMAR in the context of the RITMARE Flagship Project and LifeWatch Italy (http://www.ritmare.it). The suite is the first open-source collaborative effort toward the integration of traditional geographic information with observational data.

ISO The International Organization for Standardization (ISO) is an international standard- setting body composed of representatives from various national standards organizations.

LDAP Lightweight Directory Access Protocol (LDAP) is a client/server protocol used to access and manage directory information. It reads and edits directories over IP networks and runs directly over TCP/IP using simple string formats for data transfer.

MD modules MetaData modules within the DIP that manage, transform and deliver metadata to users

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 5 of 80

OGC The Open Geospatial Consortium (OGC), an international voluntary consensus standards organization, originated in 1994. In the OGC, more than 500 commercial, governmental, non-profit and research organizations worldwide collaborate in a consensus process encouraging development and implementation of open standards for geospatial content and services, sensor web and Internet of Things, GIS data processing and data sharing.

RITMARE RITMARE Flagship Project is one of the National Research Programmes funded by the Italian Ministry of University and Research. RITMARE is the leading national marine research project for the period 2012-2016; the overall project budget amount to 250 million euros, co-funded by public and private resources.

SOS The Sensor Observation Service (SOS) is an Open Geospatial Consortium web service created for the provision of observational data, also for to specialised observation type (e.g. multipointObservation, trajectoryObservation, gridObservation, specimenObservations).

W[FCM]S The Open Geospatial Consortium Interface Standard (WFS) provides an interface allowing requests for geographical features across the web using platform-independent calls. The Open Geospatial Consortium Interface Standard (WCS) defines Web-based retrieval of coverages – that is, digital geospatial information representing space/time-varying phenomena. A (WMS) is a standard protocol for serving (over the Internet) georeferenced map images which a map server generates using data from a GIS database.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 6 of 80

1. Introduction To assess the effects of environmental pressures on ecosystem functions and the derived ecosystem services a deep understanding of the underlying processes is needed. Long term observation distributes along different geographic, ecosystem and environmental gradient are an important source of data to address this questions (Mirtl, 2010). Mostly starting in the early 90ies, a range of monitoring sites and research projects was initiated collecting information on the environment in different compartments many addressing important environmental threats like acidification and forest die-back. This also resulted in a variety of different sites. Getting an overview on the available resources is often difficult and sometimes hampered by the fragmentation of the information sources or unclear access conditions.

LTER Europe (see https://www.lter-europe.net/), established in 2003, aims to improve harmonisation and standardisation of long term observation in the ecosystem domain. With its network of sites and researcher is one of the European scale infrastructures in place providing data and expertise in order to bridge the gap between science and decision making. Getting an overview and access to information on available data and scientific observation infrastructures is therefore one of the central points to the success for the implementation of data intensive science.

The curation of data and the provision of metadata are key processes in each of the environmental projects. Getting an added value on the comparison of the ecological data, also between different projects, requires that the data is available and discoverable for usage by researchers. A distributed infrastructure allows each site to maintain data that has been collected, but a structured life cycle and information technology (IT) tools must be defined.

The project eLTER (H2020) works towards to further implementation of services for the LTER community and enhancing data accessibility. Focusing on the requirements and conceptual framework for data integration is part of the tasks to implement an eLTER Information System. By this the IT requirements of various user groups were collected, ranging from data users (e.g. research communities), data providers (e.g. LTER site managers), or data networks (e.g. GEOSS harvesting metadata). This includes the development of a common data policy in order to reduce data access restrictions to the degree possible. Figure 1 provides an overview on the design of the eLTER project showing the interlink of the data infrastructure design components (WP NA3), the data infrastructure implementation components (WP JRA1) and the scientific work packages as data providers and data consumers.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 7 of 80

Figure 1 - Overall project design of the eLTER H2020 project and the links of WP3.

European scale infrastructure projects like ENVRI and ENVRI+1 support the work on the development of eLTER Information System by providing general guidelines for the design and implementation of interfaces. There is the need fostering not only on the sharing metadata in a standardised and machine-readable manner applying metadata standards (e.g. ISO19115/19139), but also to address the sharing of data as an important step in the information flow.

By this, the eLTER Information System will be an additional information provisioning node in the European and global landscape of e-infrastructures dealing with environmental and biodiversity data. The evaluation of relevant data infrastructures, either providing best practise or being a relevant node to connect to, is part of the work. European scale regulations, like INSPIRE, or global data networks, like LifeWatch or DataOne, were also taken into consideration.

1 See http://www.envriplus.eu/ and https://confluence.egi.eu/display/EC/ENVRI+Reference+Model

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 8 of 80

1.1. Aim of the document

This document provides an overview on the actions taken to provide the basis for the decisions for the design and the implementation of the eLTER Information System. It describes the basic high level user stories used defined in order to provide a basis for the evaluation of different architectures and the related existing or planned tools. Comparing best practises and architecture from the field of environmental data management helps to define needs and components of the intended implementation of eLTER Information System.

1.2. eLTER approach to systems architecture development

The eLTER project is taking an iterative approach to the development of the information system architecture. The different components and links in the system will be developed over time rather than designing a complete system before implementation. The architecture document provides an overview of the system architecture as currently envisaged by the project but it can be expected that this will change as assumptions about the design are challenged by users as they use initial implementations of system components. The process will be driver by a number of “user stories” or test workflows of how we think users will work with the system. These are documented at the end of the report.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 9 of 80

2. Methods The eLTER H2020 project is inter-alia supporting the LTER Europe community to provide guidelines for the harmonisation of the observation program as well as to develop common services for data discovery and sharing. The work aims also to provide the basic e-infrastructure to be implemented by the different member nodes. eLTER H2020 is therefore in a line of activities starting with the ALTER-Net project in 2003.

The requirements for the eLTER Information System were collected based on the evaluation of previous reports (e.g. ALTER-Net, EnvEurope or ExpeER) but performing targeted interviews with selected partners. This reflected three groups: (a) data providers, (b) data users, and (c) infrastructure providers. A basic ranking of needed functionalities was done based on group discussion in the eLTER Development team. This led to the basic functionalities which are also the outline for the architecture design and decision.

In addition relevant research infrastructures, e-infrastructures, and service providers were analysed in order select options for the development and implementation. Special focus was given to relevant communities and regulations relevant for scientific data sharing. This included INSPIRE as the main data publishing framework for European scale administrative data as well as GEOSS and DataOne. Evaluations done by the ENVRI+ project were taken into account. This led to the selection of relevant software tools covering different aspects of the requirement catalogue.

Finally a matrix was developed mapping the requirements against the software tools. Based on the work the architecture for the eLTER Information System was defined, which was designed as interactive user driven process (see chapter 1.2).

The report is structured along these lines:

(a) Providing an overview and a short characterisation of the LTER Europe network (b) Provide an outline of the user requirements (c) Provide an overview on relevant e-infrastructures and data architectures (d) Provide an overview on relevant open source tools and services (e) Mapping requirements against the software tools

The work was also guided by previous developments in the LTER Europe context. A guiding principle was to focus on open source products as well as to foster an open access policy wherever possible and useful.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 10 of 80

3. Characterisation of LTER Europe 3.1. Network of networks

Long-Term Ecosystem Research (LTER) is an essential component of worldwide efforts to better understand ecosystems. Through research and monitoring, LTER seeks to improve our knowledge of the structure and functions of ecosystems and their long-term response to environmental, societal and economic drivers. LTER contributes to the knowledge base informing policy and to the development of management options in response to the Grand Challenges under Global Change.

LTER-Europe is the European network for Long-Term Ecosystem Research (LTER) comprising of the different national LTER networks. The network is the result of a 15 years de-fragmentation and integration process of ecosystem research infrastructures in 30 countries, which resulted in formal LTER networks in 21 countries with well-established national and European governance structures and embedded in the global LTER network (ILTER). It was founded in 2003 as the outcome of the Network of Excellence project ALTER-Net. Figure 2 gives an overview on the distribution of the current member networks and the year of joining LTER Europe.

Figure 2 - LTER Europe Member networks: Year a member joined the network

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 11 of 80

Since its launch in 2003, LTER-Europe has sought to better integrate traditional natural sciences and holistic ecosystem research approaches that include studies of the human component. LTER-Europe was heavily involved in developing the concept of Long-Term Socio-Ecological Research (LTSER). As a result, LTER-Europe now comprises not only LTER sites but also larger LTSER platforms, where long- term interdisciplinary research is encouraged.

Figure 3 - Hierarchical design of the elements of LTER Europe

The long-term mission of LTER-Europe is (a) to track and understand the effects of global, regional and local changes on socio-ecological systems and their feedbacks to environment and society, as well as (b) provide recommendations and support for solving current and future environmental problems.

In order to implement this mission, the following main objectives can be formulated for LTER Europe and the implemented scientific use cases: ● To identify drivers of ecosystem change across European environmental and economic gradients ● To explore relations between these drivers, responses and developmental challenges under the framework of a common research agenda, and referring to harmonised parameters and methods ● To develop criteria for LTER Sites and LTSER Platforms to support cutting edge science with a unique in-situ infrastructure ● To improve co-operation and synergy between different actors, interest groups, networks, etc. LTER-Europe works towards its objectives by providing a framework for project development, conceptual work, education, exchange of know-how, communication and institutional integration.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 12 of 80

LTER-Europe is a regional network of ILTER2, the international Long-Term Ecological Research Network (see Figure 3 Hierarchical design). European LTER networks are members of ILTER. LTER- Europe helps to coordinate ILTER activities in Europe and represents European-level interests in ILTER. There are 20 European national networks in ILTER, from a total of 41 networks.

3.2. Network of Sites eLTER, as a subset of the LTER Europe network, builds on and further consolidates the LTER in-situ infrastructure of the European LTER network. “LTER-Europe”3, comprising about 420 formally acknowledged ecosystem research sites (65% terrestrial, 26% aquatic and 9% transitional waters LTER Sites) and 35 LTSER Platforms for socio ecological research at the regional scale. The infrastructures are operated by around 100 institutions. LTER-Europe has condensed research sites originally set up in varying contexts (projects and networks driven by national/institutional strategies and domain specific requirements), which focus on investigating entire ecosystems and comply with the five pillars of LTER:

1. Long-term - dedicated to the continuous collection, documentation, provisioning and use of long-term data on ecosystems with a time horizon of decades to centuries (covering the aspect of natural capital for sustainable development) 2. In-situ - data generation at different spatial scales across ecosystem compartments of individual in-natura sites, European environmental zones and socio-ecological regions 3. Process orientation - aims at identifying, quantifying and studying the interactions of ecosystem processes affected by internal and external drivers. As for socio-ecological systems the process orientation implies processes related to ecosystem services and their use. 4. System approach - interactions of abiotic and biotic components at different scales in a given system 5. Wide-scale coverage - of major European terrestrial and aquatic environments

All sites and national networks comply with a refined site classification reaching from highly instrumented master sites (19%), to regular LTER sites (44%), extensive (24%) and emerging sites (4%) (Figure 4). A set of about 50 metadata attributes (installations at the sites, covered research topics, data policy etc.) in the DEIMS site documentation of LTER (see chapter 1.4) allows for the fast and objective selection of subsets of sites, which are suited for specific project involvements.

2 http://www.ilternet.edu/ 3 http://www.lter-europe.net

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 13 of 80

Figure 4 - LTER Europe Site network: classified by Site categories (n=510)

Currently in total 420 LTER Sites (regular, extensive and master) and 35 LTSER platforms conducting socio-ecological research are described for the site network of LTER Europe (Table 1). Each of the sites needs to have an approval by the national LTER network. eLTER thereby builds on a European infrastructure pool and data legacy established over the past one hundred years (most sites in the 80-ties and 90-ties of 20th century) and an approximate cumulative infrastructure value of 450 Million EUR.

LTER Europe is comprised of network of national networks which are running the different long term monitoring sites. Funding and organisation of the work is dedicated to the different local institutions coordinated by the national as well as the European LTER network. The site network is comprised of different facility types, which are defined as the following:

LTER Facility: An umbrella term for any location where LTER might take place (site, in-situ components) and whatever might facilitate LTER-activities (e.g. logistics, laboratories, on-site supporting institutions).

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 14 of 80

LTER Site: (“traditional” LTER-Site; Long Term Ecosystem Research Site): LTER-facility of limited size (up to 10 km²) and comprising mainly one habitat type and form of land use. Activities are concentrating on small scale ecosystem processes and structures (biogeochemistry, selected taxonomic groups, primary production, disturbances etc.).

LTSER Platform (Long-Term Socio-Ecological Research Platform): Modular LTER-facility consisting of sites which are located in an area with defined boundaries. Besides this physical component LTSER-Platforms provide multiple services like the networking of client groups (e.g. research, local stakeholders), data management, communication and representation (management component). The elements of LTSER Platforms represent the main habitats, land use forms and practices relevant for the broader region (up to 10000 km²) and cover all scales and levels relevant for LTSER (from local to landscape). LTSER-Platforms should represent economic and social units or coincide/overlap with such units where adequate information on land use history, economy and demography is available to allow for socio-ecological research.

In general LTER facilities are provided and operated by national networks, which are members of LTER-Europe.

Table 1 - Number of LTER sites and LTSER platforms in Europe per country (including non-accredited)

LTER Site LTSER Total

National network Emerging Extensive Regular Master

Austria 8 13 16 4 1 42

Belgium 2 2 25 8 37

Bulgaria 2 5 7

Czech Republic 12 5 7 24

Denmark 1 1 2

Finland 2 8 5 1 16

France 1 4 12 17

Germany 13 10 10 33

Greece 1 2 3 6

Hungary 1 4 4 9

Israel 1 4 5 2 12

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 15 of 80

LTER Site LTSER Total

National network Emerging Extensive Regular Master

Italy 12 76 7 5 100

Latvia 2 2 1 5

Netherlands 2 2

Norway 1 7 1 9

Poland 1 16 17

Portugal 1 10 11

Romania 4 6 10

Serbia 1 3 1 5

Slovakia 2 4 1 7

Slovenia 1 4 8 13

Spain 2 18 1 2 23

Sweden 1 1 8 4 14

Switzerland 1 1 2 19 23

United Kingdom 40 8 13 61

Total 16 117 248 90 34 505

LTER Site and LTSER Platforms are varying in size and distribution. Whereas a LTER Site shows a size of up to 100ha in average, LTSER Platforms could be much more extensive covering several thousands of square kilometres. Figure 5 shows the median size in ha of the LTER sites and LTSER Platforms for the different national LTER networks. This ranges from 0,25 ha in Poland to 600.000 ha in France.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 16 of 80

Figure 5 - LTER Europe Site Network: Median size (in ha) of the LTER sites

3.3. Network of Data Management

The management of data for most of the LTER sites lies within the responsibility of the respective site. This creates a high heterogeneity in the data management processes and tools. It is a challenge to cope with in the development of a common data sharing infrastructure within the LTER Europe context. The implementation of mechanisms and workflows for data integration, harmonisation and accessibility are challenging. The implementation of a service based architecture of data exchange is, also with respect to the implementation of the INSPIRE directive, on-going. This mainly focuses on geospatial data, which also in the LTER context form an important part.

As a starting point for the evaluation of the requirements was the analysis of the current data management tools and services. The analysis of the technical infrastructure of LTER Europe is based on a data export from DEIMS4 and provide an overview of the (data) infrastructure of LTER Europe. Please note that in many cases multiple selections are possible for a site, e.g. a site can offer a WMS and a WFS service. Additionally, not all sites provided information about each aspect of their data infrastructure.

Basic information on the data management at the different LTER sites was provided using DEIMS. In the section on Infrastructure and Data Management relevant characteristics are listed. This included data storage, standards, metadata and data provision. A short description on the policies applied for

4 https://data.lter-europe.net/deims/

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 17 of 80

data sharing can be extracted. 43.41% of the sites provided information on these subjects. This is an ongoing survey which is annually updated. Data was extracted from DEIMS using the SQL queries, and analysed using R and Excel.

Management of data within the LTER network is distributed. Data are usually managed by the local sites and shared e.g. with international programmes (e.g. ICP Integrated Monitoring, ICP Forest) using the respective data sharing formats. Within the EnvEurope project a common data reporting format was developed (PETERSEIL ET AL., 2013) which is also used in eLTER H2020 to collect datasets from the different LTER sites contributing to the VA programme.

Looking at the storage and management of long term data at the different LTER sites it shows, that the vast majority of information is stored in structured files or spreadsheets (45,5%) followed by relational databases (32,4%). Spatial information is either stored file (21,7%) or in a geodatabase (19,0%). Figure 6 shows the distribution of the different data storage formats. The majority of data is managed in a structured way (structured files or databases). Nevertheless a minority of the data still is managed either in proprietary file formats (e.g. of a given sensor) or on paper.

Figure 6 - Data Formats: Storage formats used by LTER Europe sites (n=1066; multiple choice possible)

Also different data models are used to store and manage the data. Whereas the data models underlying the databases are often better described and fixed, the data models used for structured files could be very flexible. This is true for the naming of the different columns as well as for the reference lists (e.g. species list) used within the data. DEIMS does not account for this information,

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 18 of 80

but based on previous projects exemplary analysis of selected data was done (e.g. EnvEurope Common Data Reporting format).

Derived requirements for eLTER H2020:

1. Different data storage formats need to be supported by an eLTER Information System. This needs to include data stored in databases as well as single data files (in different formats). 2. A common vocabulary is needed in order to describe the structure and content of the data files 3. A repository for reference lists is needed in order to reference them. If possible an online reference should be able from the data file to the reference list

For each observation site most of the data are managed in a central data storage location (46,6%, see Figure 7). Nevertheless, about a half of the sites answered, that data are distributed either on different locations within one organisation (29,1%) or even between different organisations (23,1%). This reflects the situation that the collection and management of the data for a big share of the LTER Sites is done by the research community themselves which can belong to different research teams or even organisations. Providing a common repository e.g. to document available data for the LTER network is one of the requirements derived from that.

Figure 7 - Data storage location on the LTER Europe sites (n=515; single choice)

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 19 of 80

When analysing the average number of different data storage locations (see Figure 8) it can be shown that even for the answer ‘central data management location’ (see Figure 7) more than one location (e.g. database) could be present. We assume that the data is managed as close to the research groups as possible, following the principle to manage the data as close to the place of data generation as possible. This allows for a good data quality evaluation but also account for issues with regard to standardised data documentation and reporting, as many different user communities and groups are involved in the process. Also issues on data security and long term archiving could be related to these characteristics.

Figure 8 - Number of data storage location of LTER Europe sites (n=515; single choice)

Derived requirements for eLTER H2020:

1. A distributed net of data sources (currently composed of DataBase, virtual nodes that shared data, different type of data storage, etc.) should be supported by the eLTER Information System allowing access to metadata and data from different data providers. 2. As not all data is stored in well managed data repositories central services in order to store, document and archive data should be developed. 3. The integration of local data management procedures using central services of the eLTER Information System should be ensured.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 20 of 80

When focusing on the data sharing only two third of the LTER sites provided information on this aspect (see Figure 9). Only about one fifth of the LTER sites provide online access to the data (20.2%). These are mostly classic data portals providing information about the data access or allow data access or visualisation via a data portal (see Figure 10).

The majority of the sites either provide only offline request to the data (e.g. by requesting data via mail or getting in contact via phone) or did not provide any information.

Figure 9 - Data request formats by LTER Europe sites (n=919; single choice)

Machine-readable services, such as CSW for metadata, WMS or WFS for geospatial data or SWE for time-series, are hardly widespread. About a third of sites didn’t provide a detailed answer. This shows that the implementation of data services for an automatised machine-to-machine communication are not widespread, leading to an important task in the development of the eLTER Information System dealing with capacity building and creating a community of practise for data sharing. Observation networks like TERENO (Germany) or RITMARE (Italy) are focusing on the service based provision of data.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 21 of 80

Figure 10 - Data Services provided by LTER Europe sites (n=477; multiple choice possible)

Derived requirements for eLTER H2020:

1. eLTER should support the use of common data services as standard interfaces for data sharing and exchange. 2. eLTER should support the development of a community profile for the documentation of the data and providing machine-readable endpoints (e.g. CSW) to exchange metadata

3.4. Network of Observation and Spatial Data

The main aim of the LTER Europe site infrastructure is to provide data to analyse and understand ecosystem processes and their vulnerability to environmental changes (e.g. climate change). This addresses the different compartments in an ecosystem resulting also in a wide range of data types to be supported. The following listing of data types is intended to show the variety of data types should not try to cover the full range of possibilities. The approach is to describe different type of data (observation and spatial data), in order to imagine what type of data can be manage in eLTER Information System. In the User Story (see chapter 4.1) some of the data type are included in the specific use case.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 22 of 80

Point observation

Definition: single observation at a fixed location by means of the human or instrument (e.g. species function, property or characteristics). Sampling interval: a time instance. Features: the object on which it performs the observation (e.g. the species). Procedure: observation methodology (e.g. floristic-statistical method) or sensor. Values type: numeric, textual, category, count, and truth. Original data format: file based (e.g. occurrences in a different point) or database.

Time series observations

Definition: continuous or repeated observation at a fixed location by means of the human observation (e.g. vegetation or habitat mapping) or an instrument. It can be referred to as a time series collection with a defined frequency or with different sampling times. Sampling interval: single point measurement, campaign based or continuous (bigger intervals). Features: the object on which it performs the observation (e.g. species, population, habitat, etc.). Procedure: observation methodology (e.g. floristic-statistical method) or sensor. Values type: numeric, textual, category, count, and truth. Original data formats: file based (e.g. occurrences in a different point) or database.

Profile Observation

Definition: observations collected in a varying or discrete depths or altitude (e.g. water column, air column, borehole or drilling of soil or ice), usually by instrument. The location of observations can be identified by the position of the column or located within the column with either relative position. Sampling interval: all observations are collected in the same time instance or in different time instant according with the depth or altitude. Features: water column, air column, borehole or drilling of soil or ice, etc. Procedure: sensor. Values type: numeric, textual, category, count, truth, and array. Original data formats: file based or database.

Trajectory Observation

Definition: observations collected continuously along a trajectory (e.g. surface water temperature, transect measurement). The location of observations, also altitude or depth, can be identified by the continuous GPS data logging. Sampling interval: continuous observation. Features: water, air, habitat, etc. Procedure: sensor.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 23 of 80

Values type: numeric, textual, category, count, truth, array, and geometry. Original data formats: file based or database.

Sample based observations

Definition: the observation is done using a sample taken in the field. Sampling could either be a single event or a continuous process. The samples are taken to the lab to be analysed. Data are generated by time delay. The location can be identified by the position where the sample was collected. Sampling interval: single event or continuous sampling. Features: water, air, habitat, etc. Procedure: Laboratory method Values type: numeric, textual, category, count, truth, and array. Original data formats: file based or database.

Spatial information

Definition: data with a direct or indirect reference to a specific location or geographic area. Dataset is distributed throughout a spatial data format (e.g. shapefile, kmz, GeoJSON, GML, geo-relational database), the data are gathered by features and the geographical representation are of 3 types: point, line and polygon. Sampling interval: single event. Feature: all environmental areas that can be represented as points, lines, or polygons. Procedure: observation methodology, human and sensor. Values type: numeric, textual, category, count, and truth. Original data format: file based or database.

Coverage data

Definition: coverages are used to describe characteristics of real-world phenomena that vary over space and/or time. In practice, the notion of coverages encompasses regular and irregular grids, point clouds, and general meshes. Typical examples are 1-D temperature (time series or vertical profile)2, 2-D elevation, 2-D precipitation, 2-D imagery, 2-D x/y/t image time series and x/y/z geophysical voxel data, and 4-D x/y/z/t weather data. A coverage contains a set of such values, each associated with one of the elements in a spatial, temporal or spatio-temporal domain. Sampling interval: single event or continuous sampling. Feature: all environmental areas. Procedure: observation methodology, human and sensor. Values type: numeric, textual, category, count, and truth. Original data format: file based.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 24 of 80

4. Requirements The eLTER H2020 project aims to collect the requirements for a common and shared eLTER Information System supporting scientific workflows and related information. By this eLTER aims to develop IT services for data provisioning and user need driven public access. In line with the description of action it will aim to:

● Open-up access to the long-term data legacy gathered by about 160 VA sites and further in- kind contributors through an easily accessible Data Integration Portal (eLTER-DIP). ● Enable question-driven selections of sites for multiple uses (physical use or data) through an improved site identification web-service on the basis of consistent and robust site metadata on up to 500 sites in the European Research Area. ● Enhance the ICT capacities of data providers (sites) by catalysing the establishment of Data Nodes supported by the eLTER Service Suite for the provisioning of data. This will take into account the practical realities facing data providers (e.g. technical capabilities, data policies) with the intention to improve capacities in a customized way.

Beyond Europe, eLTER will make efforts to secure: (1) the embedding in related international networks and infrastructure designs, and (2) avoid duplication of conceptual efforts by considering best examples for integrated ecosystem research infrastructure designs.

As described in the chapter 2 use case and derived user stories were formulated covering the main user requirements and needs. These user stories were used to map requirements against functionalities from different tools (see chapter 7).

4.1. General user Stories and use cases

As result of the requirement analysis process a a number of User Stories and Use Cases were defined for the LTER Europe network covering the priority aspects for data publishing and sharing. The user stories provide the general outline for the use cases building a bracket for a number of use case.

Chapter 4.1 provides a description of the user stories addressed, focusing on the documentation, provision, and discovery of metadata as well as data .

User story 1: As a [data provider], I want [to publish] a [citable eLTER dataset] or a [stream of time series data] so that I can securely share my data (e.g. research data, sensor data), meet journal / funding requirements and gain recognition for data outputs

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 25 of 80

Description: A [data provider] wants to publish and share a defined dataset with the user community. This dataset needs to be citable in order to create credit for the data provider. He/she goes to the LTER Europe node in order to document and reference the information. In addition, a [data provider] wants to publish and share a stream of data (from a sensor) with the user community for on-going analysis. This data service needs to be citable in order to create credit for the data provider. He/she goes to the LTER Europe node in order to document and upload the information and enable others to use and cite this service.

● A [data provider] wants [to share] a [citable dataset] with the user community ○ A [data provider] wants to [document] his [dataset] ○ A [data provider] wants to [deposit] his [dataset] ○ A [data provider] retrieves a persistent data identifier (e.g. DOI) from a data repository while uploading a data file ○ A [data provider] provides the [dataset] for service based [data publication]

● A [data provider] wants to [document] his [research site] and [person information] ○ A [data provider] wants to [document] related [research sites] ○ A [data provider] wants to [document] related [persons]

● A [data provider] wants to [share] his [observation] ○ A [data provider] wants to [document] the [sensor] which used to gather their [observations] ○ A [data provider] wants to [deposit] his [observations] ○ A [data provider] retrieves a persistent data identifier (e.g. DOI) from a data repository while uploading a [observations] ○ A [data provider] provides the [observations] for service based [data publication]

● A [data provider] wants to [share] his [spatial/coverage data] ○ A [data provider] wants to [deposit] his [spatial/coverage data] ○ A [data provider] retrieves a persistent data identifier (e.g. DOI) from a data repository while uploading a [spatial/coverage data] ○ A [data provider] provides the [spatial/coverage data] for service based [data publication]

User story 3: As a [site network manager], I want [to join the network] with an [existing data node] so that I can enable my site managers to publish their data to eLTER

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 26 of 80

Description: A [network manager] wants to join the existing eLTER data network to publish and share data files and data streams with the user community to publicise their site network data for reuse and citation. The data services need to be visible through the eLTER DIP and data accessible to eLTER users. ● A [data manager] configures and run a [Data Node] as MD provider in order to share the information with a wider community ○ A [data manager] setup the [Data Node] ○ A [data manager] runs the [Data Node] ○ A [data manager] backups the [Data Node] ○ A Data Node [DN] can be registered at the central discovery platform [DIP]

User story 2: As a [data user], I want [to discover] a [citable eLTER dataset] or a [stream of time series data] and ]access] them in order to perform a scientific analysis. Clear access rules and IPR regulations are needed.

Description: A [data user] wants to discover datasets or streams of data (e.g. from a sensor) from the central data catalogue in order to perform an analysis with the data based on defined research questions. Data access needs to be granted to the [data user] as well as clear guidelines for the data use needs to be given. He/she goes to the LTER Europe DIP and searches for the data needed based on spatial, temporal or thematic criteria.

● All metadata can be searched and accessed from a central location - the Data Integration Portal -and data accessed whilst residing on Virtual Data Nodes and other linked eLTER data resources. ○ A [data user] accesses the DIP to search for information ○ A [data user] posts a criteria based search for [datasets] ○ A [data user] selects relevant [datasets] from the search list ○ A [data user] accesses (either direct or via download) the selected [datasets] ○ A [data user] requests access to restricted [datasets] ○ A [data provider] grants access to a data user based on a request ○ A [data user] logs in or accesses DIP anonymously ○ A [data user] filters displayed SOS sources by giving a time frame

In addition two user stories were defined:

● User Story: Temperature data across eLTER can be accessed, searched and visualised. ● User Story: As a Research Infrastructure Integrate DOI within eLTER metadata Infrastructure

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 27 of 80

4.2. Specific use cases

Based on the general user stories and use cases defined in chapter 4.1 specific technical requirements and tools can already be formulated. These are addressing different technical aspects of the implementation. This provides additional input to the evaluation of the state of the art architectures and tools (see chapter 5 and 6). The current chapter includes also the needs for the adaptations of the current tools in place for LTER Europe.

4.2.1. Site documentation

The Dynamic Ecological Information Management System (DEIMS) is the current metadata catalogue used by LTER-Europe to describe the information assets of the network including datasets and site facilities. The site documentation contains comprehensive information about the ecological aspects of a site and its infrastructure. Site information can be exposed as XML. Currently, DEIMS stores comprehensive information about more than 900 sites that can be searched, visualised and downloaded. DEIMS allows not only to directly describe a site by filling in the site and person form, but in addition to that datasets and data products can be described and connected to a site.

Exchange of site information

Using a standardised format for site information, such as EMF, allows the exchange of site information between different data nodes. An implementation for DEIMS to support the export of site and linked information as Inspire EMF (Environmental Monitoring Facilities) is currently being developed.

In the existing architecture of DEIMS and the eLTER Information System Architecture an XML export is a more sustainable data format, using e.g INSPIRE EF Data Specification as basic metadata model.

Sharing and exchange of site information

In addition to exposing site information as EMF a harvesting process has to be defined. There are multiple ways of realising this. One of them includes the generation of harvest lists that link to each EMF record. A harvester can be provided with the url to the harvest list and automatically download all EMF records and possibly updating existing records based on the checksum. For that form of harvesting a simple harvester would need to be developed.

Another way of harvesting EMF records using existing software and architecture is the development of an EMF plugin for Geonetwork. Such a plugin would not only allow using Geonetwork as an EMF editor, but also use Geonetwork’s harvesting capabilities in order to distribute that information. That way Geonetwork could also be connected to an existing EMF editor to aid it with harvesting capabilities. The development of a GeoNetwork plugin is likely to consume more resources than a custom harvester.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 28 of 80

Site registry service

With the dispersion of site information there is also a need for unique identification of sites. An environmental monitoring facility might be part of multiple different research networks and have a different identifier in each network. This causes the need for a universal identifier that refers to a site independent from the network affiliation. For that purpose a site registry will to be developed that is able to issue identifiers for research sites, that are stable, resolvable and persistent (Figure 11). Much like an ORCID5 is designed to specifically identify a person, a site identifier will be used to identify a site provided by a designated site registry service. This service that will be developed and provided by FZJ during this project, should distribute in the RDF (Resource Description Framework) information to enable semantic interoperability.

Figure 11 - Potential workflow of site registry service

4.2.2. Person documentation

The user can use Dynamic Ecological Information Management System (DEIMS) for describe themselves. The information includes: name, surname, address of the institution, role, membership network (e.g. LTER-Italy), and also ORCID, in order to have a relationship between person and persistent digital identifier.

ORCID provides the unique person ID that can be used by other systems (ResearchGate, Scopus and Researcher ID all have easy ways to add your ORCID). ORCID is an open, non-profit, community- based effort to provide a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. ORCID is unique in its ability to reach across disciplines and research sectors. ORCID cannot be considered as a persistent identifier.

5 http://orcid.org/

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 29 of 80

DEIMS, at this moment, doesn’t utilise ORCID as a unique identifier and the person is not represented by any type of Unique IDentifier (UID). Information of the persons, collected in DEIMS, will be transformed in RDF data model and in particular Friends Of The Friends (FOAF) a machine-readable ontology describing persons, their activities and their relations to other people and objects.

4.2.3. Metadata authoring

The Dynamic Ecological Information Management System (DEIMS) is the current metadata catalogue used by LTER-Europe to describe the information assets of the network including datasets and field site facilities. DEIMS has editing facilities to enter information which is then held largely within an Environmental Markup Language (EML) schema within a Drupal application. The information in this schema can then be transformed into ISO 19115 records for dataset descriptions and INSPIRE EMF records for site descriptions. This allows DEIMS to be used as an authoring tool for dataset and facilities metadata that can then be shared through web service APIs provided by the eLTER Metadata Management System such as CSW.

4.2.4. Metadata harvesting and exchange

The LTER network is very heterogeneous in terms of technical and organisational aspects. This results in numerous different systems, setups and formats in usage.

Using INSPIRE-compliant standards (in addition to already existing standards and formats) is a way how to cope with this situation. ISO19139 can be harvested by using Geonetwork. It allows to easily share information between different nodes and can be used as a support system in the backend.

In order to be discovered and harvested all metadata sources must offer their metadata thru standardized services e.g OGC CSW.

4.2.5. Data sharing and data services

The interoperability of spatial data services is characterised by the capability to communicate, execute or transfer data among them. Therefore, the spatial data services need to be further documented with additional metadata. To a lesser degree, it also concerns the harmonisation of the content of the service contrary to the spatial data sets implementing rules.

For the development of the implementing rules mandated by the INSPIRE Directive 2007/2/EC the emphasis was first put on the core services, i.e., the network services, with Commission Regulation

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 30 of 80

(EC) No 976/2009, and on the interoperability of the spatial data sets, in Regulation (EU) No 1089/2010.

The LTER scientific community, through the tools implemented or adopted in the project, can share data and metadata by OGC web services. The capabilities of different nodes to work together have been realized only recently and, under many viewpoints, is still a challenge. Data integration, by the use of data service approach, is considered the prerequisite to advanced environmental monitoring and sustainable development. In particular, managing different and heterogeneous data sources, of several types, is an actual critical requirement for environmental monitoring, in particular in the field of sustainable development, for which the integration of information from different disciplines and domains (e.g. economic indicators, traditional GIS sources, output of models, time-series of sensors monitoring the environment continuously) is a key point.

4.2.6. Data discovery and visualisation

Data discovery and visualization is performed using modified version of Sensor Web application (Figure 12). It is based on 52°North Helgoland project and enables exploration, analysis and visualization of sensor time series data.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 31 of 80

Figure 12 - Visualization - Map view

The application is able to connect to different Sensor Web endpoints (REST-APIs) and offers a number of features: access to SOS instances (supports OGC SOS specification), diagram view of multiple time series (Figure 13), data export (pdf, CSV), etc.

Figure 13 - Visualization - Multiple data graph

These Sensor Web REST-APIs provide a thin access layer to sensor data via RESTful Web binding with different output formats.

4.2.7. Semantic harmonisation

Semantic interoperability is the ability of computer systems to exchange data with unambiguous, shared meaning. Semantic interoperability is a requirement to enable machine computable logic, inferencing, knowledge discovery, and data federation between information systems. It is therefore concerned not just with the packaging of data (syntax), but the simultaneous transmission of the meaning with the data (semantics). This is accomplished by adding data about the data (metadata),

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 32 of 80

linking each data element to a controlled, shared vocabulary. The meaning of the data is transmitted with the data itself, in one self-describing "information package" that is independent of any information system. It is this shared vocabulary, and its associated links to an ontology, which provides the foundation and capability of machine interpretation, inference, and logic. Syntactic interoperability is a prerequisite for semantic interoperability. Syntactic interoperability refers to the packaging and transmission mechanisms for data.

The LTER infrastructure is still working on both syntactic and semantic interoperability in order to harmonize the information about metadata and improve the ability to find the LTER network data. Two actions in particular have been implemented in order to improve interoperability:

- adoption of OGC standard for the distribution of data, - expansion of LTER thesaurus (EnvThes).

4.2.8. Provide facilities to author MD on sites, datasets and data services

Currently, metadata authoring and sharing is done by different communities themselves. This is done, using different systems (e.g. TERENO) at different information levels (e.g. site or national network level). To establish a network of distributed information sources, as defined for the eLTER Information System, requires the integration of existing metadata catalogues for discovery and access of the data.

5. State of the art - general architectures for distributed data 5.1. LifeWatch

The mission of LifeWatch is to advance biodiversity research and to provide major contributions to addressing the big environmental challenges, including knowledge-based solutions to environmental managers for its preservation. This mission is achieved by providing access through a single infrastructure to a multitude of sets of data, services and tools enabling the construction and operation of Virtual Research Environments (VREs) linked to LifeWatch, and where specific issues related with biodiversity research and preservation are addressed.

LifeWatch was included in the Roadmap of the European Strategy Forum on Research Infrastructures (ESFRI6), the body that identifies the new research infrastructures (RIs) of pan- European interest with the goal of promoting the long-term competitiveness of European Research and Innovation.

6 http://www.esfri.eu

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 33 of 80

The concepts behind this European e-Infrastructure were developed in the 1990s and early 2000s, with the support of EU Networks of Excellence related to biodiversity and ecosystem dynamics and functioning. They initiated the design plan for LifeWatch with their understanding that breakthroughs in biodiversity science require a sufficient large European scale research infrastructure capable of providing the advanced capabilities for data integration, analysis and simulations to complement reductionist experimentation.

Those concepts were formally described during the preparatory phase of LifeWatch (2008-2011) which was funded by a specific project from the 7th Framework Programme. This phase included preparing a Master Plan of the infrastructure and detailing both its building blocks and associated costs.

Subsequently, those concepts were refined taking into account a realistic provision of funds, the existing research facilities which are presently supported by the different countries that have expressed their interest in participating in LifeWatch (see Figure 4 on the right below, e- Infrastructure Resources layer which are mainly based on the National LifeWatch Centres and Third Parties contributions), and the suggestions and comments provided by both the ESFRI high-level Assessment Expert Group and the ESFRI Strategic Working Group on Environment.

Further inputs for fine-tuning of the LifeWatch construction plan arise from the conclusions of recent LifeWatch operational meetings held in Lecce (Italy, November 2013), Granada (Spain, February 2014), Crete (Greece, July 2014) and Málaga (Spain, February 2015).

Its statutes and governance were modified following the comments provided by the European Commission in January 2014 resulting from the step-1 submission of LifeWatch ERIC application back in July 2013.

The LifeWatch infrastructure meets a number of key requirements:

● ‘Fit for Purpose’: flexible, secure, adaptable, robust, resilient, scalable, and maintainable. ● Integration of “external resources”, provided by institutions and networks concerned with ICT technologies and biodiversity research. ● Offering an attractive set of capabilities to users and other stakeholders. ● User-friendly at different levels of knowledge in both science and policy domains. ● Non-proprietary – based on open standards (in application of EU Openness Directives). ● Based on existing technological solutions wherever appropriate, and Adaptive to the heterogeneous IT landscape of Europe wide research IT. ● In selected areas, parallel research into cutting-edge technologies to ensure adoption of new approaches and to contribute to ERA bioinformatics development. ● Staged approach to construction and deployment. Long term outlook on all desired functionality within a realistic, controlled and manageable construction process.

The LifeWatch e-Infrastructure is composed of four major layers (see Figure 14).

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 34 of 80

Figure 14 - Architecture of LifeWatch e-Infrastructure

The coordination and management of the LifeWatch ICT e-Infrastructure distributed construction operations and its further operational incremental (ITIL-based) maintenance will be carried out by an ICT e-Infrastructure Technical Office located at the common facilities placed in Spain, which will support the ICT Director of LifeWatch ERIC.

The Resource layer contains the specific resources, such as data repositories and collections (i.e., LTER, GBIF, CETAF), computational capacity and sensor networks, and High Performance Computing (HPC) resources, which contribute to the LifeWatch system. It is supported by contributing facilities and in turn integrated in an e-Infrastructure layer which also serves shared workflows. The Resource and e-Infrastructure layers will incorporate tools from existing networks and e-Infrastructures such as, LTER, GBIF, CETAF, among others. This will provide a basis for interoperability between LifeWatch and other existing and future systems.

The e-Infrastructure layer enables to share the specific resources as generic services in a distributed environment spread across multiple administrative domains. Some of the capabilities of this layer will be provided by underlying Europe-wide e-Infrastructures (for example, EGI.eu and its supporting IBERGRID through a proper LifeWatch EGI.eu Competence Centre) and these will also play a prominent role in delivering the Composition layer of LifeWatch, under the coordination and supervision of the above-mentioned ICT e-Infrastructure Technical Office of the common facilities. Similarly, in order to ensure commonality of data management modalities for differing data sets /

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 35 of 80

data providers, the ICT e-Infrastructure will have to take into account the data management guidelines that are likely to emerge over the coming years.

The Composition layer supports the selection and combination of services for task completion. It offers resources for new workflow development in a semantic metadata frame.

The User layer enables the different research communities to create their own Virtual Research Environments-VRE (e.g., e-Labs, decision making tools, etc.); users may share their data and analytical and modelling tools with others while controlling access to them.

At the Composition and User layers, LifeWatch expect to adapt and extend mechanisms from existing networks and e-Infrastructures, as well as from relevant e-Science projects.

5.2. GEOSS

Global Earth Observation System of Systems (GEOSS) seeks to connect the producers of environmental data and decision-support tools with the end users of these products. It has the aim of enhancing the relevance of Earth observations to global issues. The result is focused at a global public infrastructure that generates comprehensive, near real-time environmental data, information and analyses for a wide range of users.

One of the first achievements of the Group on EO was the acceptance of a set of high level Data Sharing Principles as a foundation for GEOSS. The 10-Year Implementation Plan says "The societal benefits of Earth observations cannot be achieved without data sharing" and sets out the GEOSS Data Sharing Principles.

The GEOSS Data Sharing Principles are formulated as the following:

● There will be full and open exchange of data, metadata and products shared within ● GEOSS, recognizing relevant international instruments and national policies and legislation; ● All shared data, metadata and products will be made available with minimum time delay and at minimum cost; ● All shared data, metadata and products being free of charge or no more than cost of reproduction will be encouraged for research and education.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 36 of 80

GEO recognizes that the societal benefits arising from Earth observations can only be fully achieved through the sharing of data, information, knowledge, products and services. GEO has therefore promoted fundamental principles for data sharing, expanding the trend towards open data worldwide. Thus, as it embarks on its second decade, GEO aims to implement the following GEOSS Data Sharing Principles for post 2015:

● Data, metadata and products will be shared as Open Data by default, by making them available as part of the GEOSS Data Collection of Open Resources for Everyone (Data-CORE) without charge or restrictions on reuse, subject to the conditions of registration and attribution when the data are reused; ● Where international instruments, national policies or legislation preclude the sharing of data as Open Data, data should be made available with minimal restrictions on use and at no more than the cost of reproduction and distribution; and ● All shared data, products and metadata will be made available with minimum time delay.

The main reasons for the new Data Sharing Principles are the following:

● Asserting that sharing data as part of GEOSS Data-CORE is the default standard for GEO elevates the status of this mechanism, as well as its overall importance for the successful operation of GEOSS and achievements of the GEO goals, including expanded commitment to sharing of Earth observations as emphasised in the Vision for GEO 2025 document adopted by the GEO X Plenary; ● Reference to the term “Open Data” provides context for the interpretation of the use conditions pertinent to data shared as part of GEOSS Data-CORE, as well as brings GEOSS Data Sharing Principles in line with the relevant international, regional, national and organizational developments; ● The option of sharing data through GEOSS with restrictions on use is presented as a deviation from the default mechanism, with the emphasis on imposing as few restrictions on the use of shared data as possible. This shift in emphasis better recognizes the motivations for GEOSS: encouraging and facilitating reuse of EO data and products, as well as helping make informed decisions within nine societal benefit areas. ● The definition of Open Data means that data are shared free of charge, for any purpose and to any user. This reflects the current move by many governments towards Open Data and is in accord with the GEO objectives of encouraging data sharing in order to tackle stated societal objectives and promote economic benefits. The current wording of the third Principle that limits free-of-charge sharing to research and education purposes is less apt to achieve these objectives. ● Various legal mechanisms of making data available as part of the GEOSS Data-CORE are presented and analysed in the White Paper “Mechanisms to Share Data as Part of GEOSS Data-CORE” as approved by the GEO Plenary in November 2014.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 37 of 80

The GEOSS Data Collection of Open Resources for Everyone (aka CORE):

● “full access” - means that all the data in the GEOSS Data-CORE can be accessed, used, and redistributed; ● “open access” - means that data providers may charge at most the cost of reproduction and distribution of the data, although it is expected that in most cases the data in the GEOSS Data-CORE will be made available at no cost; ● “unrestricted access” - means that no restrictions are placed on the access to, or use and redistribution of, the data in the GEOSS Data-CORE.

The GEOSS Data-CORE “Legal Interoperability”

● Legal interoperability for data means that the legal rights, terms, and conditions of databases from two or more sources are compatible and the data may be combined by any user without compromising the legal rights of any of the data sources used. ● Achieving the “legal interoperability” of data made available through the GEOSS Data-CORE is essential for the effective sharing of data in GEOSS.

It should be noted that there is no GEOSS Data Policy.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 38 of 80

Figure 15 - GEOSS Architecture

5.3. DataOne

Data Observation Network for Earth (DataONE) is a community driven project providing access to data across multiple member repositories, supporting enhanced search and discovery of Earth and environmental data. DataONE promotes best practices in data management through responsive educational resources and materials.

The architecture of DataONE network therefore needs to be reviewed in order to ensure compatibility. DataONE is a distributed network of repositories (Member Nodes) and currently four search facilities (Coordinating Nodes), which contain resource descriptions of the Member Nodes.

The Member Nodes7 maintain a preservation-oriented repository. Different repository products may take different approaches to data preservation, but in general they i) use persistent identifiers for data products, ii) ensure access to these data products over the long term; and iii) ensure that metadata documents exist alongside the data products. Resource Maps provide a common format for describing the bidirectional relationship between a metadata object and the data object(s) it documents. DataONE expects all content submitted via a primary system that must have associated Resource Maps. If data owners are not doing this alongside content submission, this should be a service provided centrally by the Member Node. Once published, end users expect the data product to remain the same over the long term. Curation practices should be compatible with data- preservation and data reproducibility ideals in mind. Specifically, content update and archiving activities should be transparent to DataONE end users.

The implementation of the DataONE infrastructure (Figure 16) is based on three major components: Member Nodes, which are existing or new data repositories that support the DataONE Member Node Application Programming Interfaces (APIs); Coordinating Nodes that are responsible for cataloging content, managing replication of content, and providing search and discovery mechanisms; and an Investigator Toolkit, which is a modular set of software and plug-ins that enables interaction with the DataONE infrastructure through commonly used analysis and data management tools.

7 https://www.dataone.org/member_node_requirements

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 39 of 80

Figure 16 - Major Components of the DataONE Infrastructure.8

Regarding on how DataONE implements the data management, below is described the DataONE data life cycle (Figure 17). It provides a high level overview of the stages involved in successful management and preservation of data for use and reuse.

Figure 17 - The DataONE data life cycle.9

8 Allard, S. (2012). DataONE: Facilitating eScience through collaboration. Journal of eScience Librarianship, 1(issue 1), 4-17. http://dx.doi.org/10.7191/jeslib.2012.1004

9 https://www.dataone.org/data-life-cycle

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 40 of 80

Recently LTER-Europe joined DataONE as a member of the DataONE community. The focus of this network is on ecosystem research, in which biodiversity plays an important role. Some sites have more than 100 years of data. One of the core interests of the LTER-Europe community are data and method harmonisation. To facilitate this, a number of EU projects (e.g. ALTER-Net, EnvEurope, or ExpeER) with a key role LTER-Europe have been executed.

DataONE provides a REST based API10 for accessing all services including metadata, and encompasses several member-nodes including KNB and Dryad.

The Provenance Aware Synthesis Tracking Architecture (PASTA)11 is a data management system which can help scientists to maintain information about LTER sites, organizational units, events, methods, ecosystems, etc. The API12 supports basic HTTP protocol communication through ReSTful web-services. PASTA uses an instance of Metacat to index all EML documents that are uploaded to PASTA as part of an LTER data package. The LTER Network Information System (NIS) Data Portal13 serves as both the point-of-presence for producers and consumers of LTER data products on the Internet and as a reference implementation/guide for other projects that interact with the PASTA API.

5.4. TERENO

The main goal of the infrastructure measure TERENO is the creation of observation platforms on the basis of an interdisciplinary and long-term aimed research program with a close cooperation between several facilities of the Helmholtz-Gemeinschaft for the investigation of consequences of Global Change for terrestrial ecosystems and the socioeconomic implications. TERENO will provide long-term statistical series of system variables for the analysis and prognosis of Global Change consequences using integrated model systems, which will be used to derive efficient prevention, mitigation and adaptation strategies.

Important system variables are amongst others fluxes of water, matter and energy within the continuum of the groundwater-soil-vegetation-atmosphere system, long-term changes of the composition and functioning of micro-organisms, plants and fauna as well as socio-economic conditions, which have to be determined with an adequate temporal and spatial resolution in dependence of the dynamics of the involved processes.

The complex interrelations and feedbacks of the different parts of the terrestrial systems require an interdisciplinary approach. In this context important questions are:

10 https://mule1.dataone.org/ArchitectureDocs-current/apis/ 11 https://github.com/PASTAplus/PASTA 12 https://pasta.lternet.edu/package/docs/api 13 https://portal.lternet.edu/nis/browse.jsp

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 41 of 80

• Which consequences have the expected climate changes on the terrestrial compartments (groundwater, soils, vegetation, surface waters)? • In which way will the feedbacks of the exchange processes of terrestrial systems (e.g. feedbacks between land surface and atmosphere) affect the terrestrial fluxes of water and matter? • Which direct influences have soil and land-use changes (e.g. due to EU Cross Compliance Directive, promotion of energy crops) on water balance, soil fertility, biodiversity and regional climate? • What are the consequences of large anthropogenic interferences (e.g. open mining, deforestation) on terrestrial systems?

The homogeneous long-term data sets provided by TERENO will significantly foster the validation, advancement and integration of terrestrial models (e.g. groundwater and soil water balance models, regional climate and weather prognostic models, air quality models, runoff and forest/agronomic models as well as diversity and socio-economical models). 14Integrated model systems will significantly support the management of agronomic and forest ecosystems (e.g. optimisation of irrigation systems as well as development of warning systems for extreme weather occurrences and flooding, integrated control systems of water management constructions, and monitoring systems for air, groundwater and surface water quality).

Four TERENO observatories have been selected in climate change sensitive regions in Germany (see figure 18). It is operated by six Helmholtz Centres in close cooperation with more than ten universities.

14 Bogena, H R, Schulz, K, Vereecken, H, 2006. TERENO: Towards a network of observatories in terrestrial environmental research. Advances in Geosciences 9, 109-114.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 42 of 80

Figure 18 - TERENO observatories.

The observatories operated on a long-term basis in order to facilitate the determination and quantification of environmental changes. Small-scale research facilities and test areas have been placed in order to accomplish detailed process studies and mobile measurement platforms for monitoring dynamic processes at the local scale up to the determination of spatial pattern at the regional scale have been applied (e.g. weather radar systems for the determination of regional precipitation fields at temporal and spatial scales, micrometeorological stations for the determination of atmospheric parameter and fluxes of water vapour, energy and trace gases, sensor networks for determination of environmental parameter at a high spatial and temporal resolution (e.g. patterns of soil temperature and moisture dynamics) and monitoring systems for the quantification of water and solute discharge in surface waters and groundwater). High capacitive data processing and communication systems have been installed to guarantee a fast availability and long-term protection of the gathered environmental data sets.

TERENO is part of ICOS and the eLTER research platforms and is presently hosting six Critical Zone Observatories (CZOs), which are part of the international CZEN network. The TERENO observatories are integrated in several environmental monitoring networks and programmes (e.g. FLUXNET, ICOS,

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 43 of 80

LTER, GEOSS, EXPEER). It serves as a blueprint for the establishment of a European network for long- term ecological research sites and critical zone observatories.

A decentralized data infrastructure TEODOOR (TEreno Online Data repOsitORry) had been established within the framework of TERENO. This decentralized setup aims at interconnecting regional data and metadata infrastructures established under the “umbrella” of TERENO A central “umbrella” portal application, the TERENO Data Discovery Portal (DDP) allows to query, visualize and access all data and metadata from the contributing institutions in a standardized way and acts in this way as a database node providing scientists and stakeholder with reliable and well accessible data and data products.

Data sharing, provision, access and licensing are key issues in each distributed data infrastructure. By the TERENO Data sharing policy, signed by each partner, everyone agreed upon that it is essential that data produced within TERENO is made available as soon as possible to the members of the worldwide scientific community. If not specified otherwise by individual intellectual property holders, the data will be made available under the Open Data Commons Attribution License.

Data publication and exchange is facilitated predominantly through web services as standardized by the International Standardization Organization (ISO) and the Open Geospatial Consortium (OGC15), operated from the data providers. TEODOOR contains the following basic features (see Figure 19):

1. A portal application, which allows one to query, find and access data from local TERENO or external data infrastructures according to the data policy. 2. Local data infrastructures hosted by the individual TERENO institutions 3. Defined interfaces for data exchange between the portal application and local data infrastructures provided by standardized, compliant web-services operated at FZJ and, as far as existent, in regional data infrastructures maintained by partners.

Figure 19 gives an overview about the setup of the distributed data infrastructure TEODOOR. The following web-service specifications are used:

● OGC (SWE) standards and interfaces are used to provide time series data from the TERENO observation networks together with contextual details such as instruments used, intended application, valid time, contact details to responsible persons, security level, and geographic location. ● OGC compliant Web Map Services (WMS) and Web Processing Services (WPS) are implemented by using a web-map server (Geoserver) to provide map layers generated from vector data or raster data to the users. Non-OGC compliant web services may be implemented to provide additional functionalities not covered by the OGC standards and

15 http://www.opengeospatial.org/

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 44 of 80

interfaces specifications. Common data file download services (based on http, FTP, etc.) are connected to the aforementioned services where appropriate. ● Metadata are published by OGC compliant Web Catalogue Services based on the ISO 19115/19139 and ebRIM metadata standards. Beneath Catalogue services operated by the individual partner institutions, a Central Metadata Catalogue (CMC) as a common entry point to TERENO metadata is be implemented at Research Centre Jülich.

Figure 19 - Basic setup of the TERENO distributed data infrastructure TEODOOR.

The TEODOOR web portal (http://teodoor.icg.kfa-juelich.de) consumes the services, and acts as a front-end that supports data discovery, visualization and download. Concepts and technologies have been developed to import and process the data from the environmental sensors and to handle dissemination of data and products. In this context, quality assessment of large amounts of environmental data and its incorporation into the SWE are important issues. Figure xxx gives an impression of the DDP and an example of the data visualization and download options for time series data. Currently, data from more than 900 monitoring stations are available online and freely accessible.

TEODOOR is the basis for several other initiatives, e.g. the BMBF and KfW funded WASCAL (West African Science Service Center on Climate Change and Adapted Land Use), in which a data infrastructure on climate change and adapted land use using knowledge and tools developed within

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 45 of 80

TERENO, is being build up at the WASCAL Competence Center in Ouagadougou, Burkina Faso. In the eLTER-H2020 project, the TERENO infrastructure is acting as a blueprint for the integration of observation data into the eLTER distributed data infrastructure. The EUDAT-H2020 project aims to the building a Collaborative Data Infrastructure, which covers both access and deposit, from informal data sharing to long-term archiving, and addressed the identification, discoverability, and computability of both long-tail and big data. TEODOOR infrastructures will we coupled and made available to the EUDAT network through the EUDAT services (e.g. b2SHARE, b2FIND).

Figure 20 - Data query, time series visualization and download in the DDP.

5.5. UK Data Infrastructure

The UK Government has created the data.gov.uk portal as the gateway to the UK’s publicly funded data. This includes data from the environmental sector regulatory agencies such as the UK Environment Agency. For example, this has been used as part of the UK Government's open data agenda with the recent release of over 8000 data sets in summer of 2016. The cataloguing infrastructure uses CKAN with an ISO19115 profile for discovery metadata. The catalogue will link online data visualisation services referred to in the metadata where available. These are mainly OGC WMS services with some WCS and WFS also provided. The catalogue harvests from a range of data nodes in the UK using OGC CSW or WFS (Web file system) calls. The data.gov.uk infrastructure checks compliance of metadata on upload and also polls visualisation services. As such, this

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 46 of 80

infrastructure delivers the UK's basic INSPIRE services and the catalogue also acts as the UK node for the EU INSPIRE GeoSpatial portal.

Figure 21 - UK Government data portal

The UK Natural Environment Research Council (NERC) environmental data centres provide a range of data and data services linked to data.gov.uk. The services reflect the science disciplines that the individual data centres serve (e.g. oceanographic, atmospheric, geosciences and terrestrial ecological research). The data centres offer INSPIRE compliant services for publicly funded data as well as services related to community standards such as NetCDF CF. Within the UK Centre for Ecology and Hydrology, a wide range of data are curate and made available via a range of portals. As a result of the diversity of requirements for describing data, models, monitoring sites and activities, CEH has developed a flexible approach to delivery of metadata independent of any particular standard. This architecture now forms the main data catalogue for CEH16 and provides information to data.gov.uk as well as supporting other environmental catalogues such as the UK Environmnetal Observation Framework (UKEOF17).

16 https://catalogue.ceh.ac.uk/documents 17 https://catalogue.ukeof.org.uk/

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 47 of 80

Figure 22 - UK Environmental Observation Framework EMF catalogue

The catalogue main components are Solr search and Github document repository. Metadata documents are indexed in Solr to power the search interface and Git is used to store the metadata record files. Storing a metadata record as a file allows the storage to be schema independent. Support has been added for multiple schema, currently UK Gemini data set and services, Environmental Monitoring Facilities, Models and Sample Archives. Different representations of the metadata record e.g. xml, json, html are generated through templates. The metadata editor can be configured to generate these different schemas. There is an API for searching and to edit metadata records. This provides a catalogue that is schema neutral and can represent the full range of components needing to be described in environmental sciences such as data sets, models, field sites and activities. The catalogue source is open source and is likely to be used to support a wide range of environmental programmes both inside and outside the UK. For example, work with start this year of developing a cataloguing system for AfSIS for soil information in Africa.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 48 of 80

5.6. INSPIRE

The INSPIRE directive came into force on 15 May 2007 and was implemented in various stages, with full implementation required by 2021 based on the roadmap18. The INSPIRE directive aims to create a European Union (EU) spatial data infrastructure. This should enable the sharing of environmental spatial information among public sector organizations and better facilitate public access to spatial information across Europe. A European Spatial Data Infrastructure assists in policy-making across boundaries. Therefore the spatial information considered under the directive is extensive and includes a great variety of topical and technical themes.

INSPIRE is based on a number of common principles:

● Data should be collected only once and kept where it can be maintained most effectively. ● It should be possible to combine seamless spatial information from different sources across Europe and share it with many users and applications. ● It should be possible for information collected at one level/scale to be shared with all levels/scales; detailed for thorough investigations, general for strategic purposes. ● Geographic information needed for good governance at all levels should be readily and transparently available. ● Easy to find what geographic information is available, how it can be used to meet a particular need, and under which conditions it can be acquired and used.

INSPIRE is based on the infrastructures for spatial information established and operated by the 27 Member States of the European Union. The Directive addresses 34 spatial data themes needed for environmental applications, with key components specified through technical implementing rules. This makes INSPIRE a unique example of a legislative “regional” approach.

5.7. EUDAT Collaborative Data Infrastructure (CDI)

The EUDAT CDI is a defined data model and a set of technical standards and policies adopted by European research data centres and community data repositories with the intent to create a single European e-infrastructure of interoperable data services. This e-infrastructure consists of integrated data services and resources to support research and consists of a network of nodes (mostly data centres or computer centres) that provide a range of services for upload and retrieval, identification

18 http://inspire.ec.europa.eu/index.cfm/pageid/44

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 49 of 80

and description, movement, replication, data integrity and additional services needed for operating the infrastructure19.

Figure 23 - Support of the data life cycle by EUDAT services

EUDAT spawned a number of different services that are used within eLTER and are described in greater detail in the following chapters.

19 https://www.eudat.eu/eudat-cdi/about

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 50 of 80

Figure 24 - Overview on the EUDAT Service Suite

5.8. ENVRI

The ENVRI Reference Model is an integrative reference approach that the environmental research community can use to potentially enable their research infrastructures to achieve a greater level of interoperability through the use of common standards and approaches for various functions.

The objective of the ENVRI Reference Model20 is to develop a common framework and specification for the description and characterisation of computational and storage infrastructures. This framework can support the RIs to achieve seamless interoperability between the heterogeneous resources of their different infrastructures.

The ENVRI Reference Model serves the following purposes:

20 See https://confluence.egi.eu/display/EC/ENVRI+Reference+Model

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 51 of 80

● to provide a way for structuring thinking that helps the community to reach a common vision; ● to provide a common language that can be used to communicate concepts concisely; ● to help discover existing solutions to common problems; ● to provide a framework into which different functional components of research infrastructures can be placed, in order to draw comparisons and identify missing functionality.

This will enable greater understanding and cooperation between infrastructures since fundamentally the model will serve to provide a universal reference framework for discussing many common technical challenges facing all of the ESFRI-ENV infrastructures. By drawing analogies between the reference components of the model and the actual elements of the infrastructures (or their proposed designs) as they exist now, various gaps and points of overlap can be identified.

The ENVRI Reference Model is based on the design experiences of the state-of-the-art environmental research infrastructures, with a view of informing future implementation. It tackles multiple challenging issues encountered by existing initiatives, such as data streaming and storage management; data discovery and access to distributed data archives; linked computational, network and storage infrastructure; data curation, data integration, harmonisation and publication; data mining and visualisation, and scientific workflow management and execution. It uses Open Distributed Processing (ODP), a standard framework for distributed system specification, to describe the model.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 52 of 80

Figure 25 - Viewpoints of the ENVRI RM using the ODP approach

By examining their computational characteristics, 5 common subsystems21 have been identified:

• Data Acquisition, • Data Curation, • Data Access, • Data Processing and • Community Support.

The fundamental reason of the division of the 5 subsystems is based on the observation that all applications, services and software tools are designed and implemented around 5 major physical resources: the sensor network, the storage, the (internet) communication network, application servers and client devices.

21 https://confluence.egi.eu/display/EC/ENVRI+Reference+Model?preview=/8553222/8553232/D3.5%20Guideline%20of%20using%20the%2 0Reference%20Model_final.pdf

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 53 of 80

The definitions of the five subsystems are given below:

Data acquisition: collects raw data from sensor arrays, various instruments, or human observers, and brings the measurements (data streams) into the system.

Data curation: facilitates quality control and preservation of scientific data. It is typically operated at a data centre.

Data access: enables discovery and retrieval of data housed in data resources managed by a data curation subsystem.

Data processing: aggregates the data from various resources and provides computational capabilities and capacities for conducting data analysis and scientific experiments.

Community support: manages, controls and tracks users' activities and supports users to conduct their roles in communities.

Figure 26 - Illustration of the major points of reference between different life cycle phases

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 54 of 80

The model captures the computational requirements and the state-of-the-art design experiences of a collection of representative research infrastructures for environmental sciences. It is the first reference model of this kind which can be used as a basis to inspire future research. Its main contributions can be summed up as follows:

● It provides a common language for communication to unify understanding. ● It serves as a community standard to secure interoperability. ● It can be used as a base to drive design and implementation.

Figure 27 - Overview on ENVRI RM life cycle phases addressed in the different research infrastructures

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 55 of 80

6. State of the art - tools and services 6.1. Data discovery

Searching, discovering, and accessing data are the keywords in the information management discipline. Some international initiatives have been undertaken to create metadata for describing the diversity of environmental data. Among all European Commission, developing a spatial data infrastructure, adopted Open Geospatial Consortium (OGC) web services and ISO standards for describe data. ISO 19115/19119/19110/19139 are on the base of metadata schemas proposed in INSPIRE directive (2007/2/EC). Some IT software solutions have been developed for supporting the storage of metadata and to provide graphical user interface for making data discovery easier.

For better discovery in a distributed architecture the usage of controlled vocabularies is essential and as well as semantic aspects for harmonising the metadata content. Below is a list of widely used open sources software most that we believe to be promising and offering more functionality for eLTER.

6.1.1. B2Find

B2FIND22 is a discovery service based on metadata steadily harvested from research data collections from EUDAT data centres and other repositories. The service offers faceted browsing and it allows in particular to discover data that is stored through the B2SAFE and B2SHARE services. The B2FIND service includes metadata that is harvested from many different community repositories.

22 https://eudat.eu/services/b2find

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 56 of 80

Figure 28 - B2Find on EUDAT platform

B2FIND allows users to:

● Find collections of scientific data quickly and easily, irrespective of their origin, discipline or community ● Get quick overviews of available data ● Browse through collections using standardized facets

The B2Find tab created within the data integration platform (Figure 29) is a search tool that uses CKAN API for discovery of information stored using B2SHARE service. Search results from B2Find service includes metadata that is harvested from eLTER/B2Share community repository as well as a download link for that file. The search results are displayed using “accordion” and pagination display pattern.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 57 of 80

Figure 29 - B2Find tab within the data integration platform

6.1.2. GeoNetwork

GeoNetwork23 is a free and open source (FOSS) cataloguing application for spatially referenced resources. It is a catalogue of location-oriented information. It is written in JAVA and allows generating metadata through a web-interface in a variety of formats.

It is capable harvesting these sources:

● OGC-CSW 2.0.2 ISO Profile, ● OAI-PMH, ● Z39.50 protocols, ● Thredds, ● Webdav, ● Web Accessible Folders, ● ESRI GeoPortal ● Other GeoNetwork node and can itself be harvested as it has its own API to interact with other systems.

23 http://geonetwork-opensource.org

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 58 of 80

The built-in metadata editor supports ISO 19115/119/110 standards used for spatial resources and also Dublin Core. However, it the number of supported can be extended through the use of schema plugins, e.g. ISO 19139. By default the data is stored in a H2 database, but can be switched to a Postgres database. Communication with the database is possible and allows querying metadata directly using xpath.

6.1.3. pyCSW pyCSW is an OGC CSW server implementation written in Python. It is Open Source, released under an MIT license, and runs on all major platforms (Windows, Linux, Mac OS X). pyCSW fully implements the OpenGIS Catalogue Service Implementation Specification (CSW, Catalogue Service for the Web), it support all mandatory and optional operations (e.g. GetCapabilities, DescribeRecord, GetRecords, GetRecordById, GetRepositoryItem, GetDomain, Harvest, etc.) with GET (KVP), POST (XML) and SOAP methods.

Initial development started in 2010. The project is certified OGC Compliant, and is an OGC Reference Implementation24. Since 2015 pyCSW is an official OSGeo Project. pyCSW allows for the publishing and discovery of geospatial metadata via numerous APIs (CSW 2/CSW 3, OpenSearch, OAI-PMH, SRU). Existing repositories of geospatial metadata can also be exposed, providing a standards-based metadata and catalogue component of spatial data infrastructures (SDI). The output which the metadata can be formatted is:

● XML, ● JSON while the following output schemes are supported:

● Dublin Core, ● ISO 19139, ● FGDC CSDGM, ● NASA DIF, ● Atom, ● GM03

24 http://www.opengeospatial.org/resource/products/details/?pid=1104

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 59 of 80

6.1.4. GI-CAT

GI-cat25 features caching and mediation capabilities and can act as a broker towards disparate catalogue and access services: by implementing metadata harmonization and protocol adaptation, it is able to transform query results to a uniform and consistent interface. GI-cat is based on a service- oriented framework of modular components and can be customized and tailored to support different deployment scenarios (NATIVI AND BIGAGLI, 2009).

GI-cat can access a multiplicity of catalogues services, as well as inventory and access services to discover, and possibly access, heterogeneous ESS resources. Specific components implement mediation services for interfacing heterogeneous service providers which expose multiple standard specifications; they are called Accessors. These mediating components map the heterogeneous providers’ metadata models into a uniform data model which implements ISO 19115, based on official ISO 19139 schemas and its extensions (check out more information about the internal GI-cat format). Assessors also implement the query protocol mapping; they translate the query requests expressed according to the interface protocols exposed by GI-cat, into the multiple query dialects spoken by the resource service providers. Currently, a number of well-accepted catalogue and inventory services are supported, including several OGC Web Services (e.g. WCS, WMS), THREDDS Data Server, SeaDataNet Common Data Index, and GBIF.

GI-cat itself exposes several interfaces, including the OGC CSW interfaces (Core, ISO, ebRIM EO and ebRIM CIM). The query and result mediation is implemented by the Profilers). A Distributor component implements the query distribution functionalities (e.g. results aggregation).

6.1.5. Metacat

Metacat26 is a flexible, open source metadata catalogue and data repository that targets scientific data, particularly from ecology and environmental science. Metacat helps scientists find, understand and effectively use data sets they manage or that have been created by others, providing the scientific community with a broad range of science data that–because the data are well and consistently described–can be easily searched, compared, merged, or used in other ways.

Metacat is a key infrastructure component for the NCEAS data catalogue, the Knowledge Network for Biocomplexity (KNB) data catalogue, and for the DataONE system, among others.

25 https://www.uos-firenze.iia.cnr.it/gi-cat 26 https://www.dataone.org/software-tools/metacat

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 60 of 80

Just to remark, KNB is an international repository for ecological/environmental data sets, and the home site for the Morpho/Metacat publishing system27

Metacat accepts XML as a common syntax for representing the large number of metadata content standards. Thus, Metacat is a generic XML database that allows storage, query, and retrieval of arbitrary XML documents without prior knowledge of the XML schema.

Metacat is designed and implemented as a Java servlet application that utilizes a relational database management system to store XML and associated meta-level information, and provides a rich client Application Programming Interface (API) which supports a variety of languages, including Java, Python, and Perl.

It contains a replication tool to connect Metacat distributed instances to share data and metadata between them. This tool is used to create local networks and also to connect to DataONE. Additionally, The Metacat graphical user interface has been recently updated with MetacatUI28. This interface shows a map where the datasets can be discovered and an improved search tool. It also permits users to log using ORCID and publish data and metadata without using Morpho. Metacat offers two service interfaces based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH): a data provider (or repository) service interface and a harvester service interface.

6.2. Metadata authoring and sharing

6.2.1. DEIMS

DEIMS (Dynamic Ecological Information Management System) is a system developed by the University of New Mexico, the University of Puerto Rico, the University of Wisconsin, Palantir.net and extended by LTER Europe to describe ecological data and environmental monitoring facilities. It allows to registers datasets and exposes their metadata in different formats, ISO 19115-2 North American Profile and ISO 19139 Inspire Profile, BDP and EML 2.1.1. It can be harvested using GI-CAT.

It is running on the Content Management System Drupal 7 and thus can be extended by using already existing contributed modules or by writing custom modules.

a) Observation and experimentation facilities (Site) a) Dataset b) Networks c) Person

27 https://knb.ecoinformatics.org 28 https://github.com/NCEAS/metacatui

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 61 of 80

6.2.1.1. Site documentation DEIMS allows registering sites and storing comprehensive information about research activities, infrastructure and environmental parameters. In order to enter information DEIMS offers a multi- layered form covering a broad field of different thematic groups. The full description of the site metadata model can be found on DEIMS.29

Figure 30 - DEIMS Site Form

Within the sites form there is the networks section, which allows users to define the national LTER network, request a declaration status and add additional networks to also document activities and memberships outside of LTER.

29 https://data.lter-europe.net/deims/documentation/site-1-1

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 62 of 80

Figure 31 - DEIMS Site Form - Network affiliation

The full description of the (LTER) network metadata model can be found on DEIMS.30

Additionally the site form allows incorporating information about metadata providers, site managers, site owner and funding organisations.

Figure 32 - DEIMS Site Form - Contact Details

Those entities themselves spawn new forms that are based on their own metadata formats, thus allowing building multi-layered information that is connected throughout different types of content.

30 https://data.lter-europe.net/deims/documentation/network

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 63 of 80

Figure 33 - DEIMS Person Form

After completing the site form the entry will be save to the DEIMS database and is ready to be searched and queried.

6.2.1.2. Dataset documentation DEIMS allows registering datasets. When adding a dataset a form is presented to the user with a number of fields. Filling in all mandatory fields provides sufficient information for DEIMS to render valid metadata formats in ISO19139 (GMI and GMD), EML 2.1.1 and BDP.

The full description of the dataset metadata model can be found on DEIMS.31

31 https://data.lter-europe.net/deims/documentation/dataset

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 64 of 80

Figure 34 - DEIMS Dataset Model (Version 1.1)

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 65 of 80

Figure 35 - DEIMS Dataset Form

Similar to the site form the dataset form also allows including information for a person.

6.2.1.3. Export metadata Dataset metadata can be exported in the following formats:

● ISO 19115-2 North American Profile ● ISO 19139 Inspire Profile ● BDP and ● EML 2.1.1

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 66 of 80

Figure 36 - DEIMS ISO 19139 Inspire Profile

Each of those formats can be exported and downloaded individually.

Harvesting has been implemented for:

● ISO 19115-2 North American Profile ● ISO 19139 Inspire Profile ● EML 2.1.1

Figure 37 - DEIMS EML Harvest List

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 67 of 80

This is realised using custom Drupal modules written in PHP by US LTER and extended by LTER Europe. Harvesting of EML files is realised through a harvest list.

Discussions about potential ways to realise EF harvesting are currently ongoing. Possible solutions include the generation of harvest lists for EF records.

6.2.2. Morpho

Created for scientists, Morpho is a user-friendly application designed to facilitate the creation of metadata (information that describes your data) so that researchers can easily discover and determine the nature of a wide range of data sets. When metadata file is created that explains what your data represent and how they are organized. Morpho interfaces with the Knowledge Network for Biocomplexity (KNB) Metacat server, which is essentially a server from which scientists can upload, download, store, query and view relevant metadata and data. Once the data has been annotated with metadata, it is possible to choose to upload data – or just your data description (the metadata) – to the Metacat server, where they can be accessed from the web by selected researchers or by the public. Data stored on the Metacat server is saved on several geographically separate servers, ensuring that data are archived securely.

In Morpho, the metadata contains information about the content of a data set (its owner, administrator, geographic extent, units, etc.) as well as who has access to the data (the owner, selected users, or the public). This information is stored in a file that conforms to the Ecological Metadata Language (EML) specification, which is commonly used to exchange information among scientists across the world.

The Morpho wizards create metadata files using a subset of Ecological Metadata Language (EML), a metadata specification developed by ecology discipline but that has since gained wider usage. EML is based on prior work done by the Ecological Society of America and associated efforts32.

MetacatUI, the new Metacat interface, is an alternative to Morpho. It’s a client-side web-based user interface used for searching, displaying, and creating content for a Metacat server, or other data repositories that can utilize the DataONE API. Using this tool, a user can publish new data package to the Metacat repository without using an additional tool as Morpho.

32 http://knb.ecoinformatics.org/software/eml/

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 68 of 80

6.2.3. INSPIRE MD Editor

The European Open Source Metadata Editor (EUOSME)33 is a web application to create INSPIRE- compliant metadata (ISO19139). It has been developed by the Joint Research Centre as part of the EuroGEOSS project34.

EUOSME allows to describe a spatial data set, a spatial dataset series or a spatial data service so that the metadata is also compliant with the standards ISO 19115:2003 (corrigendum 2003/Cor.1:2006 ) and ISO 19119:2005. It is therefore an implementation of the INSPIRE Metadata Technical Guidelines based on these two ISO standards, and published on the INSPIRE web site. This editor builds on the experience acquired in the development of the INSPIRE Metadata Implementing Rules, and includes the INSPIRE Metadata Validator Service available from the INSPIRE EU Geoportal35. EUOSME can be accessed through the INSPIRE Geoportal36.

6.2.4. EDI MD Editor

EDI (PAVESI ET AL., 2016) is a general purpose, template-driven metadata editor for creating XML- based descriptions. Originally aimed at defining rich and standard metadata for geospatial resources, it can be easily customised in order to comply with a broad range of schemata and domains. EDI allows plugging in external data sources that are made available as SPARQL endpoints. On the basis of these, EDI can semantically annotate metadata by providing, beside text descriptions of the entities that are referred to in the metadata (e.g., keywords, points of contact, toponyms, etc.), unique identifiers for these, in the form of URIs.

EDI creates HTML537 metadata forms with advanced assisted editing capabilities and compiles them into XML files. The examples included in the distribution implement profiles of the ISO 19139 standard for geographic information38, such as core INSPIRE metadata39, as well as the OGC standard for sensor description, SensorML40 v1.0.1 and v2.0.0.

Templates (the blueprints for a specific metadata format) drive form behaviour by element data types and provide advanced features like code lists underlying combo boxes or autocompleting functionalities.

33 http://inspire-geoportal.ec.europa.eu/editor/ 34 http://www.eurogeoss.eu 35 http://inspire-geoportal.ec.europa.eu 36 http://inspire-geoportal.ec.europa.eu/editor/ 37 http://www.w3.org/TR/2014/REChtml5-20141028 38 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=32557 39 http://inspire.ec.europa.eu/ documents/Metadata/MD_IR_and_ISO_20131029.pdf 40 https://portal.opengeospatial.org/ files/?artifact_id=55939

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 69 of 80

Virtually, the editing of any metadata format can be supported by creating a specific template. EDI is stored on GitHub at https://github.com/SP7-Ritmare/EDI-NG_client and https://github.com/SP7- Ritmare/EDI-NG_server.

Figure 39 - EDI Metadata editor form - Example of SensorML form

6.3. Metadata harvesting

Metadata harvesting within the data integration platform relies on harvesting capabilities of a GeoNetwork server. It is used to collect metadata from DEIMS and from Virtual Nodes, as well as from any other sources compatible with OAI/PMH,OGC CSW 2.0.2, another GeoNetwork, OGC Web Services, or can provide ISO19115/19139 XML files. All metadata sources can be added using GeoNetwork’s administrator portal (Figure 40).

Beside the DEIMS, we currently harvest the following sources on weekly base: CEH (UK), TERENO (Germany), CNR-IREA (Italy) and iEcolab (Spain) with 464 datasets. Users can search for metadata using simple or advanced search engine by keyword or by entering a part of a text, using smart search filter. Examples of search results and detailed view of selected datasets are shown at the figure below.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 70 of 80

GeoNetwork may also be used as a brokering tool. It exposes all harvested metadata towards other harvesters e.g. GEOSS portal.

Figure 40 - Metadata harvesting within the GeoNetwork component

6.4. Data archiving and sharing

6.4.1. B2Share

B2Share41 is a way to store and share small-scale research data from diverse contexts. It can either be used as an individual user or as a member of a specific research community, like LTER Europe.

The B2Share service is free of charge for European scientists and researchers.

It offers the possibility to upload and describe data rudimentarily. A generic template for metadata description contains fields that allow a dataset to be described according to Dublin Core (needs

41 https://b2share.eudat.eu/docs/b2share-about

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 71 of 80

verification/paraphrasing). There is the possibility for communities to extent that given template according to individual needs of a community. A PID (persistent identifier) will be attached to every uploaded dataset that allows distinct identification of and location to a dataset. This allows finding a dataset including the corresponding file even after it has moved to another server. Once a dataset has been uploaded it can’t be deleted anymore. Currently there is no versioning possible, but that feature is in the roadmap for future development.

Data can be uploaded using a web interface. There is also a REST (Representational State Transfer) API that allows the implementation of B2Share in other systems. Using the API it is possible to mass- upload files using python scripts or to include an interface to upload files and their corresponding metadata in DEIMS. The maximum file size limit is 2GB, but there are possibilities to increase that limit in accordance with the technical support.

A local instance of B2Share can also be set up and used independently.

6.4.2. GET-IT

GET-IT42 is a software suite that aims to enable domain expert researchers to become leaders in the creation of an interoperable SDI. By using some relevant standards from the OGC (Table 2), it enables the interoperable distribution of data collected by researchers, both geospatial and observations, through web services, creating their own spatial data repositories and facilitating the entry and maintenance of data and metadata of data and sensors. Created services, with entered data and metadata, are hosted by virtual machines that can be installed in server or in hosting sites.

Table 2 - Main interoperable services and OGC standards, supported by GET-IT.

Web service Version

OCG Web Map Service (WMS) 1.1.1, 1.3.0

OGC Web Feature Service (WFS) 1.0.0, 1.1.0, 2.0

OGC Web Coverage Service (WCS) 1.0, 1.1.0, 1.1, 1.1.1

OGC (SLD) 1.0

OGC Tile Map Service (TMS) 1.0.0

42 http://www.get-it.it

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 72 of 80

Web service Version

OGC Web Map Tiling Service(WMTS) 1.0.0

OGC Catalogue Services for the Web (CSW) 2.0.2

OGC Sensor Observation Service (SOS) 1.0.0, 2.0.0

OGC Sensor Model Language (SensorML) 1.0.1, 2.0.0

OGC Observation and Measurement (O&M) 1.0.0, 2.0.0

Developed by a joint research group of CNR IREA - CNR ISMAR under the flagship project RITMARE43, GET-IT is completely free and open-source and has among its objectives the creation of a federated interoperable infrastructure for the observational network of marine data, allowing researchers and research organizations to share their data and metadata. GET-IT consists of a virtual machine, based on the Ubuntu operating system; the basic software used is GeoNode44, a widely known geographic content management system. GeoNode, missing Sensor Web Enablement (SWE and semantic enablement, has been edited and have been added new software implementations, both client and server side, for the creation and management of observations, sensors and metadata, semantically enabled. In particular, the new software implementations include: ● A metadata editor client, named EDI, which allows the creation and validation of metadata in accordance with different profiles or templates (PAVESI ET AL., 2016). EDI allows plugging in external data sources that are made available as SPARQL endpoints; ● A SOS manager that allows the registration of new sensors edited in Sensor Metadata Language (SensorML) metadata profile by EDI metadata editor; ● An insert observation interface allows to upload observations, by copy and paste action, for GET-IT registered sensor; ● A SOS client that allows viewing the information of registered sensors and data recording in a web map.

6.4.3. TEODOOR Suite

Within TERENO, different time series management systems are used. The integrated time series management system (TSM), developed at Research Centre Jülich, is used in three observatories and

43 http://www.ritmare.it 44 http://geonode.org

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 73 of 80

accomplishes automated processing and management of observational time series data. TSM components are either implemented in JAVA or using Open Source software.

The basic setup of the system is illustrated in Figure 40. The system consists of four parts, which are fully parameterized and configurable from the underlying database system. First, the input file parser converts the sensor data, which enter the infrastructure from different sources and by various communication protocols (e.g. HTTP, FTP) into a defined structure.

Figure 40 - Basic structure of the operation of the time series management system.

In a second step the data are imported into a relational database by the data processor. Transformation filters may be applied in order to convert raw sensor signals (e.g. ADC voltage levels) to physical parameters (e.g. temperatures), to consider connections between data from different sensors (e.g. mean values from two or more parallel measurements) or to calculate temporal aggregates (e.g. daily precipitation rates from 10 minutes observations). In addition, filters allow first

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 74 of 80

quality checks and labelling of the data. A notification system informs the responsible parties about the import process and problems occurred by E-Mail or Sensor Event Service (SES) posts.

Persisting and archiving of the sensor data is accomplished within the data storage component of the infrastructure. Data storage in the database and registration of the sensor metadata require an underlying data model for time series data. Although time series data may be characterized primarily by the location and the timestamp of the measurement, the observed physical parameter and the observed value, its interpretation requires contextual information, or metadata. Metadata contain all relevant information about the observed quantity like name, unit, attributes, description or accuracy, but also on the processes of measurement, data recording, processing and calculation. A comprehensive data model for time series data was developed to store environmental observations along with sufficient metadata to provide traceable heritage from raw measurements to usable information.

Quality assessment of the current data is performed continuously by the responsible persons. In addition, “flagging-days” are held once a month, in which institute members (typically around 6-8) together with responsible scientist do the quality assessment of historical data and develop rules on how quality assessment has to be done for the different types of parameters. Characterization of data quality is done by three descriptors, which are stored together with each observation: data uncertainty, data processing levels and data quality flags. The observed data values remain unchanged in any case. Data uncertainty arises from the observation process itself and is mainly determined by the accuracy of the sensors used. Data processing levels indicate the status of data handling. Level 1, for instance, include (unpublished) raw data, level 2 refers to data subjected to quality control, whereas the next levels refer to derived data products. Data quality flags imply the outcome of quality tests, which may either be computer- (e.g., automatic evaluation procedures) or human-generated (e.g., visual inspections). The flagging scheme used here consists of two levels. The primary level includes generic flags, e.g., good, unevaluated, suspicious and bad. The secondary level is use case-specific and indicates either the result of individual quality tests applied, e.g., failed gradient checks, or background events affecting data values, e.g., icing events.

The fourth component, the OGC Web Service Provider, publishes the sensor data and metadata from the data model/ data storage components using OGC compliant web services. This is done predominantly by Open Source implementations. SWE services and standards have been implemented, published as free software by the Open Source Initiative 52° North and used here. For this purpose, connectors have been developed, which allows a data transfer from the data storage module to the particular web service software. An OGC compliant Sensor Observation Service has been developed (DEVARAJU ET AL., 2015), which provides quality information and enables consideration of data quality information by the data users.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 75 of 80

7. Implementation strategy Deliverable 8.1 introduces the architecture which will be implemented during the eLTER project. To realize this purpose some requirements must be satisfied with proper software. Some software solutions (Table 3) are selected with the idea how these were used, the size of the development community, and also of their robustness. There are integrated into the infrastructure to cover all parts (e.g. data sharing, data discovery, data upload, data description, metadata editing, metadata harvesting). Some requirements can be derived from user stories (see chapter 4).

Table 3 - Comparison of requirements against selected technologies

Harvesting Harvestable Data Data Metadata Metadata Discovery of capabilities upload sharing sharing generation metadata

DEIMS x x x x x

GeoNetwork x x x x x

GET-IT x x x x x x

GI-CAT x x x

B2Find x x x x

CEH Catalogue (x) x x

TEODOOR x x x x

Most projects (see chapter 5 for more information) make extensive use of FOSS software for the management of an interoperable infrastructure, but also standard for the dissemination of data and metadata. Initiatives such as the INSPIRE Directive, collaboration as EUDAT or ENVRI, European infrastructure as LifeWatch is a guideline to meet the needs expressed by users. eLTER project, within the implementation strategy, has taken account the state of the art trying to cover some gaps that are still present.

We try below to give an overall guideline with respect to the adopted strategies:

● Through the usage ISO 19139 as the preferred metadata format for metadata datasets eLTER can ensure both relatively easy harvesting procedures and compliance with INSPIRE. ● Using INSPIRE EMF promotes the distribution of site information as well as basic information about deployed facilities for each site.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 76 of 80

● Providing these EMF spatial data themes formats allows to distribute comprehensive information about a site and its data output. ● Depositing files on B2Share and saving the generated PID within an INSPIRE-conform metadata record allows long-term data preservation and the distribution of that information. ● Using OGC SensorML can ensure that the description of the sensors is compliant with modern research infrastructure. ● Through the use of Sensor Observation Service (SOS) for the sensor observation promotes the interoperability of measurements among initiatives (NOAA, NetLake, TERENO, etc.) within and outside the network. ● Adopting the OGC web services also for map allows interoperability and GEOSS compliant. ● Providing a term through thesaurus allows to obtain the harmonization of information and the semantic interoperability. ● A SOS web map client allows the users to visualize the geodata (map and observations) from distributed data node. ● A graphical user friendly interface (GUI) promotes use of data sharing practices with OGC standards by interfaces for upload data and creation of metadata. ● Harvestable capability allows the users to discover all type of the data within the data integration portal (DIP). ● The CEH catalogue is designed to be extensible so that new record types can be introduced with a minimum of effort. All the documents with different record types in the catalogue are held on the same XML repository. The documents are index on a small number of common fields derived from Dublin Core.

7.1. Implementation strategy for data nodes (GET-IT)

GET-IT is currently released in the stable version. Data node can provide 10 different OGC web service to share data and metadata (see table 2). In particular vector (shapefile), raster (e.g. GeoTIFF, TIFF) and observation collected by sensor can be added and shared by a friendly graphical user interface (GUI) or by standard request to web service. Purpose of GET-IT is to go against the needs of LTER researchers concerning:

1. Sharing of data (observations and maps) in a standard way;

2. Creating a distributed network where the users could share data using a friendly interface;

3. Giving the information of data (metadata) using a common vocabulary and unique interface;

4. Giving a user friendly tool and software for share data.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 77 of 80

7.2. Implementation strategy of SOS web map client in DIP

Currently the best available SOS web map client is made by 52° North but it doesn’t fit totally into DIP architecture. That was the main reason why we decided to become official 52° North partner in order to make necessary modifications which will be available for wider audience as an open source through official git repositories. Purpose of this SOS client is to show on the map all available data measurements sites from various sources across LTER-Europe network providing data visualisation tools and what is the most important, providing simple interface for downloading of requested data in standardized format.

7.3. Implementation strategy of GeoNetwork in DIP

GeoNetwork was chosen and implemented as a core for metadata search in DIP. The main role of GeoNetwork is to harvest all available metadata from different sources such as DEIMS, GET-IT based virtual nodes, other GeoNetworks like TERENO or some other stakeholders. All collected metadata are not available only to end users using smart or geo search but they can be also harvested from other systems like GEOSS.

7.4. Implementation strategy of B2FIND in DIP

The implementation of the EUDAT B2Find service in the Data Integration Portal is currently in progress. It will be utilised as a component of a schema neutral catalogue, allowing one-stop shopping for end users. The aim of this service is to provide a tool for easier discovery of all files uploaded to B2Share by the LTER community. Connecting the DIP and B2Find is done using APIs. Search request responses are provided as JSON files, containing all metadata for each file including the PID. All responses are parsed and shown to user in an intuitive and user-friendly interface.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 78 of 80

8. References

NATIVI, S. AND L. BIGAGLI. 2009. “Discovery, Mediation, and Access Services for Earth Observation Data”, Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal: 2, 4, Page(s):233 – 240. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05200393

MIRTL, MICHAEL. 2010. “Introducing the Next Generation of Ecosystem Research in Europe: LTER- Europe’s Multi-Functional and Multi-Scale Approach.” Inbook. In Long-Term Ecological Research: Between Theory and Application, edited by Felix Müller, Cornelia Baessler, Hendrik Schubert, and Stefan Klotz, 75–93. Dordrecht: Springer Netherlands. doi:10.1007/978-90-481-8782-9_6.

PAVESI, F., A. BASONI, C. FUGAZZA, S. MENEGON, A. OGGIONI, M. PEPE, P. TAGLIOLATO, AND P. CARRARA. 2016. “EDI – A Template-Driven Metadata Editor for Research Data.” Journal of Open Research Software - JORS 4. DOI: 10.5334/jors.106.

PETERSEIL, J., KLIMENT, T., OGGIONI, A., SCHENTZ, H., BLANKMAN, D., & SCHLEIDT, K. 2013. Manual on Data Management Standards for LTER Europe. Deliverable PD_A1.4.1 EnvEurope Life08 ENV/IT/0003999 Project. Deliverable PD_A1.6 EnvEurope Life08 ENV/IT/0003999 Project.

Document ID: eLTER D3.1 State of art and requirements © eLTER consortium

Page 79 of 80