<<

data2019opensci : En route vers la ouverte : le cycle de vie des données Partage des données dans la collaboration et environnements de recherche virtuels

Yvan LE BRAS, Chef de projet (UMS PatriNat) #PNDB @Yvan2935

Anne-Sophie ARCHAMBEAU, GBIF-France (UMS PatriNat) French BON Cécile CALLOU, Dir UMS BBEES (CNRS-MNHN) Aurélie DELAVAUD (FRB) Biodiversity & Dominique JOLY, DAS CNRS (INEE) Ecosystems Action Laurent PONCET, Dir. Adj., en charge du Centre de données (UMS PatriNat) Group Jean-Denis VIGNE, DGD-Recherche, expertise, valorisation, enseignement (MNHN)

1/1669 Sharing data -> collaboration… why ? Sharing data -> collaboration… why ?

Nongeospatial metadata for the ecological , Micheneret al. 1997 Sharing data -> collaboration… why ?

Nongeospatial metadata for the ecological sciences, Micheneret al. 1997

http://urtesasoiak.com/?page_id=1572&lang=fr Sharing data -> collaboration… why ?

Nongeospatial metadata for the ecological sciences, Micheneret al. 1997

-> Virtual Research Environment VRE KESAKO

Virtual Research Environments (VREs) are increasingly being used to support a more dynamic approach to collaborative working in systematics and taxonomy. Researchers who are not co-located are seeking to work dynamically together at various scales from local to international. These shared infrastructures are funded as VREs in Europe, Virtual Laboratories (VLs) in Australia and Science Gateways (SGs) in the USA and all have similar objectives. VRE KESAKO

« Research done through distributed global collaboration enabled by the , using very large data collections, terascale computing resources and high performance visualization » (Sir John Taylor – 2001)

« A VRE comprises a set of online tools and other network resources and technologies interoperating with each other to facilitate or enhance the processes of research practitioners within and across institutional boundaries» JISC definition

« Virtual Research Environments are innovative, web-based, community-oriented, comprehensive, flexible, and secure working environments conceived to serve the needs of modern science» (Candela et al. 2013) • Web based • Serve the needs of researchers communities (VRCs) • Open and flexible VRE, for which purpose

Entire research lifecycle

Data oriented steps

http://www.jisc.ac.uk/ VRE tools

1. Collaboration oriented tools • Built on existing platform • HUBzero (using Joomla) • Sakai (a first demonstrator project so old ;) ) • VRE4EIC e-VRE • Specific VRE framework • gCube • Microsoft Sharepoints VRE Toolkits • OpenDreamkit for mathematic

• Parthenos for Humanities Yang et al. 2006 (Integration of Existing Grid Tools in Sakai VRE) • Phenomenal on-demand VRE for metabolomics • Apache Airavata software framework • VRE OSF ? VRE: compose your stacks! microservices

Dependencies management

manage codes Data / Metadata portal(s)

Workflow manager

Cloud infrastructure manage « infrastructure as code » Analyze through GUI VRE: some issues (at least for me)!

Tool/Software/Script

Developped by communities Tool/Software/Script of practices -free SaaS -usable only in local PC PaaS

+ IaaS

Developped by international SaaS project with « business » delivrables such as H2020 PaaS -not free -usable only through private IaaS compagnies services Provided by private companies VRE: some issues (at least for me)!

Tool/Software/Script Gcube packaging Developped by communities through SAI Gcube packaged tool of practices Hidden source code -free SaaS -usable only in local PC PaaS

+ IaaS

Developped by international SaaS project using -free ? PaaS -possibility to create an account ? IaaS -usable only through D4science VREs Provided by D4Science VRE: some issues (at least for me)!

EU projects related to gcube: •IMARINE - Data e-Infrastructure Initiative for Fisheries Management and Conservation of Marine Living Resources (283644)

•BlueBRIDGE - Building Research environments for fostering Innovation, Decision making, Governance and Education to support Blue growth (675680)

•EGI-Engage - Engaging the EGI Community towards an Open Science Commons (654142)

•PARTHENOS - Pooling Activities, Resources and Tools for Heritage E-research Networking, Optimization and Synergies (654119)

•D4SCIENCE-II - Data Infrastructure Ecosystem for Science (239019)

•EUBRAZILOPENBIO - EU-Brazil and Cloud Computing e-Infrastructure for Biodiversity (288754)

•ENVRI PLUS - Environmental Research Infrastructures Providing Shared Solutions for Science and Society (654182)

•AGINFRA PLUS - Accelerating user-driven e-infrastructure innovation in Food Agriculture (731001) Ref: OpenScience Data Analytics Technologies, AGINFRA+ wiki •D4SCIENCE - DIstributed colLaboratories Infrastructure on Grid ENabled Technology 4 Science (212488) https://support.d4science.org/projects/aginfrapl us_wiki/wiki/D31_- •ENVRI - Common Operations of Environmental Research Infrastructures (283465) _Open_Science_Data_Analytics_Technologies #SAI •SoBigData - SoBigData Research Infrastructure (654024)

“Required packages are assumed to be preinstalled on the backend system” Java / Maven / Eclipse -> not for ecologs who develop scripts, need for IT JAVA guys VRE: some issues (at least for me)!

EU projects related to gcube: •IMARINE - Data e-Infrastructure Initiative for Fisheries Management and Conservation of Marine Living Resources (283644)

•BlueBRIDGE - Building Research environments for fostering Innovation, Decision making, Governance and Education to support Blue growth (675680)

•EGI-Engage - Engaging the EGI Community towards an Open Science Commons (654142)

•PARTHENOS - Pooling Activities, Resources and Tools for Heritage E-research Networking, Optimization and Synergies (654119)

•D4SCIENCE-II - Data Infrastructure Ecosystem for Science (239019)

•EUBRAZILOPENBIO - EU-Brazil Open Data and Cloud Computing e-Infrastructure for Biodiversity (288754)

•ENVRI PLUS - Environmental Research Infrastructures Providing Shared Solutions for Science and Society (654182)

•AGINFRA PLUS - Accelerating user-driven e-infrastructure innovation in Food Agriculture (731001) Ref: OpenScience Data Analytics Technologies, AGINFRA+ wiki •D4SCIENCE - DIstributed colLaboratories Infrastructure on Grid ENabled Technology 4 Science (212488) https://support.d4science.org/projects/aginfrapl us_wiki/wiki/D31_- •ENVRI - Common Operations of Environmental Research Infrastructures (283465) _Open_Science_Data_Analytics_Technologies #SAI •SoBigData - SoBigData Research Infrastructure (654024)

“Required packages are assumed to be preinstalled on the backend system” Java / Maven / Eclipse -> not for ecologs who develop scripts, need for IT JAVA guys

https://www.slideshare.net/aginfra/realising-a-science-gateway-for-the-agrifood-the-aginfraplus-experience?next_slideshow=1 Massimiliano Assante CNR IWSG 2019 Realising a Science Gateway for the Agri-food: the AGINFRAplus Experience VRE: some issues (at least for me)!

Ref: OpenScience Data Analytics Technologies, AGINFRA+ wiki https://support.d4science.org/projects/aginfrapl us_wiki/wiki/D31_- _Open_Science_Data_Analytics_Technologies #SAI

“Required packages are assumed to be preinstalled on the backend system” Java / Maven / Eclipse -> not for ecologs who develop scripts, need for IT JAVA guys

“an” ~Rstudio like “interface that allows scientists to easily and quickly import R scripts onto DataMiner” accessible via the WPS standard

“The algorithms installer manage a list of users with visibility rights on an algorithm” VRE: some good news (at least for me)!

“Required packages are assumed to be preinstalled on the backend system” Java / Maven / Eclipse -> not for ecologs who develop scripts, need for IT JAVA guys

“an” ~Rstudio like “interface that allows scientists to easily and quickly import R scripts onto DataMiner” accessible via the WPS standard

“The algorithms installer manage a list of users with visibility rights on an algorithm”

Workflow orientation -> Limitation for HPC or not VRE: some good news (at least for me)!

https://www.slideshare.net/aginfra/data-intensive-agricultural-sciences-requirements-based-on-aginfra-project-and-high-throughput-phenotyping-infrastructure Vincent Negre Data intensive agricultural sciences : requirements based on Aginfra+ Project and high throughput phenotyping infrastructure VRE: some good news (at least for me)!

https://www.slideshare.net/aginfra/realising-a-science-gateway-for-the-agrifood-the-aginfraplus-experience?next_slideshow=1 Massimiliano Assante CNR IWSG 2019 Realising a Science Gateway for the Agri-food: the AGINFRAplus Experience VRE: sustainability… VRE: METADATA!!! VRE: PNDB point of view DOI Github Tool/Software/Script

Bioconda Other initiatives to consider :

 software heritage Biocontainers  Easybuild  Linux Guix Microservices Connecting One free cloud based GUI solution : Galaxy the SaaS microservices

Connecting PaaS software and infra IaaS

Choose the services you want and price -local (linux/windows/...) -server, HPC or cloud -command line or GUI VRE: PNDB stacks! microservices

? Dependencies management

manage codes

Metacat Batut et al. (2018) Community-Driven Data Analysis Training for Biology Jimenez et al (2017) Four simple recommendations to encourage best practices in research software Workflow manager Grüning et al. (2018) Practical Computational in the Life Sciences

Cloud infrastructure manage « infrastructure as code » Analyze through GUI PNDB VRE proposal for Ecology Research Objects

http://www.researchobject.org/ PNDB VRE proposal for Ecology Research Objects

Focus on metadata / All is metadata ! (almost ;) ) PNDB VRE proposal for Ecology Research Objects

RO must be associated to metadata -> here we see the interest of the « completeness » of EML as this standard seems to be perfect to inform about each component of ROs PNDB VRE proposal for Ecology Research Objects

RO must be associated to metadata -> here we see the interest of the « completeness » of EML as this standard seems to be perfect to inform about each component of ROs PNDB VRE proposal for Ecology Data Management Plan PNDB VRE for Ecology, here we are…

V0 V1 V2 - Project - Assays - softwares - Investigation - Methods - Study - datasets - ORCID - attributes/variables/… - Protocols https://github.com/earnaud/MetaShARK-v2

V3 Publish PNDB VRE for Ecology, here we are…

Jimenez et al (2017) Four simple recommendations to encourage best practices in research software Grüning et al. (2018) Practical Computational Reproducibility in the Life Sciences

https://ecology.usegalaxy.eu & https://live.usegalaxy.eu/ PNDB VRE for Ecology, here we are…

Batut et al. (2018) Community-Driven Data Analysis Training for Biology

https://ecology.usegalaxy.eu & https://live.usegalaxy.eu/ https://etherpad.in2p3.fr/p/ecoinfoFAIR Ou https://urlz.fr/aYki EcoinfoFAIR

Formation introductives - Bonnes pratiques de travail package via R en mode " live coding « - Développement d’application R Shiny - Versioning de code et développement collaboratif GIT - Docker appliqué au SIG - Utilisation et développement d’outils Galaxy, création de recettes conda

Collaboration fest / hackathon - Modélisation prédiction de métrique de biodiversité sur polygones définis par utilisateurs - Dockerisation - d’applications R Shiny - De portail de données métadonnées metacat - Visualisation SIG via Docker - Discussion sur reprise de Plateforme PAMPA avec outils Galaxy ou Shiny

Formation 2/ Docker + SIG https://labs.play-with-docker.com/ et /!\ Pour le tuto Docker: pour jouer avec Docker, créez un compte sur https://hub.docker.com/ /!\ wget https://raw.githubusercontent.com/jancelin/docker-lizmap/3.2.2/docker-compose.yml docker-compose up https://etherpad.in2p3.fr/p/ecoinfoFAIR Ou https://urlz.fr/aYki EcoinfoFAIR

script R 1 Codes sur Github script R n Dépend de package R 1 script R y Codes Rproj package R n sur Rmd scripts Python Github RStudio package R x ipynb …

Ui.R Jupyter Server.R Recette conda 1

Shiny BioContainer Galaxy PAMPA Codes SEEE sur Recette conda 1 RapidMiner Github Taverna Recette conda n Nextflow … Recette conda x Merci de votre attention Merci de votre attention

Last test because fun is fun https://t.co/yR20wKkkzv IndividualDisplacements.jl source code http://shorturl.at/wxDKU https://80a311a51bfbbccb-7d1a41b20ada449ebaae3c503930a9cc.interactivetoolentrypoint.interactivetool.ecology.usegalaxy.eu/ipython/lab ANNEXES PNDB VRE proposal for Ecology DOI Tool/Software/Script Github Zenodo

Bioconda Other initiatives to consider :

 software heritage Biocontainers  Easybuild

 Linux Guix Microservices Connecting One free cloud based GUI solution : Galaxy the SaaS microservices

Connecting PaaS software and infra IaaS

Choose the services you want and price -local (linux/windows/...) -server, HPC or cloud -command line or GUI PNDB VRE proposal for Ecology Nongeospatial metadata for the ecologucal sciences, Micheneret al. 1997 VRE Kesako

Please, do take into account that there six projects officially supported by the EC in 2016 to develop VREs (https://ec.europa.eu/programmes/horizon2020/en/news/six-new-projects-e-i...): * BlueBRIDGE - Building Research environments for fostering Innovation, Decision making, Governance and Education to support Blue growth * MuG - Multi-Scale Complex Genomics * OpenDreamKit - Open Digital Research Environment Toolkit for the Advancement of Mathematics * VI-SEEM - VRE for regional Interdisciplinary communities in Southeast Europe and the Eastern Mediterranean * VRE4EIC - A Europe-wide Interoperable Virtual Research Environment to Empower Multidisciplinary Research Communities and Accelerate Innovation and Collaboration * West-Life - World-wide E-infrastructure for structural biology

* The EVER-EST Horizon 2020 project will create a state of the art virtual research environment (VRE) focused on the earth sciences. https://ever-est.eu/ EVER-EST example

H2020 project for Earth-science EVER-EST example

“VRE as a Services” is the response to the impelling need for an e-Science infrastructure supporting: reproducibility, reuse and interdisciplinarity insights and cross-fertilisation of scientific investigation and scholarly data, by developing and operating an Virtual Research Environment sustaining Open Science and implementing FAIR Guiding Principles for scientific data management and stewardship

Open Science covers the following aspects: • Open Discovery, tools able to increase the efficiency of research as well as of its diffusion; • to scientific inputs and outputs as research outcomes; • Open Data, freely to available to everyone to use and publish, without restrictions from copyright, patents or other mechanisms of control; • Open Methodology related to research strategy that outlines the way in which research is to be undertaken; • Open Educational, allowing the broaden access to the learning and training traditionally offered through formal education systems. EVER-EST example -> what is particularly of interest

RO must be associated to metadata -> here we see the interest of the « completude » of EML as this standard seems to be perfect to inform about each component of ROs EVER-EST example -> what is particularly of interest

- R / RStudio X - Galaxy-E

RO must be associated to metadata -> here we see the interest of the « completude » of EML as this standard seems to be perfect to inform about each component of ROs