<<

PoS(ISGC2014)022 a http://pos.sissa.it/ ce. and Pasquale Pagano b ironments, each tailored to serve our. The paper reports on the en- astructure supporting the dynamic eneva 23, Switzerland infrastructure and serve collaborative and demanding. Scien- re needed. In this paper it is presented ions, disciplines, and countries. In order do”, Consiglio Nazionale delle , Andrea Manzi a ive Attribution-NonCommercial-ShareAlike Licen , Donatella Castelli a {candela | castelli | pagano}@isti.cnr.it andrea.manzi@.ch to serve these scenarios innovative working environments a an innovative environment consisting of an Hybrid Data Infr Modern tend to be more thantific ever investigations multidisciplinary, span the boundaries of single institut deployment and operation of anthe array needs of of Virtual a scientific Env communityabling technology towards and a how research it endeav has been deployed to realise the the needs of different communities. owned by the author(s) under the terms of the Creat c Ricerche, Via G. Moruzzi 1, 56124,E-mail: Pisa, Italy European Organisation for Nuclear Research CERN, CH – 1211 G Istituto di Scienza e Tecnologie dell’Informazione “A. Fae E-mail: International Symposium on Grids and Clouds23-28 (ISGC) March 2014, 2014 Academia Sinica, Taipei, Taiwan Leonardo Candela Realising Virtual Research Environments by Hybrid Data Infrastructures: the D4Science Experience b a PoS(ISGC2014)022 - Virtual [1]. This ) data, usually i [1], that by leveraging 2 extensively describes the quire the amount of computing willing to develop applications ast and heterogeneous au- able the creation and operation of ucture (HDI) and the rich array of supporting data analysis and min- facilities include ( . One of its distinguishing feature of Practice [10] by providing each ments the D4Science Infrastructure. working environment, i.e., a es and practices aiming at facilitating Hybrid Data Infrastructure stomized to meet specific needs (Fig. information systems and repositories, sibly geographically distant from each ve communities of practice in domains REs and presents the currently existing ation [9] principles and approaches is estigations in a more “simple” way to DI is conceived to enable the delivery d while allowing them to save time and to the flexible and shared set of remote rous tasks. implemented by nicely integrating other ded to perform their work. ists expect to be provided with innovative s for scientists. The current catalogue of res nes, and countries. These approaches should as-a-Service 2 Hybrid Data Infrastructure “as-a-Service” [2]. Virtual Research Environments are web [2, 11]. ) computing capabilities, i.e., the power to elastically ac iii Science and scientists are calling for innovative approach The remainder of this paper is organised as follows. Section gCube [3, 5] is a software system specifically conceived to en In this paper, we present the D4Science Hybrid Data Infrastr To serve these scenarios, D4Science.org is operating an gCube hosts a compelling portfolio of applications having v AppsCube is a framework conceived to support practitioners Virtual Research Environments ) services, i.e., an open ended set of and workflows ii research collaborations that span institutions,be discipli offered under the “as-a-Service” paradigm, i.e., scient money without compromising researchfalling quality. in the The big expected data domain( and spreading across multiple ing, and ( Virtual Research Environments by Hybrid Data Infrastructu 1. Introduction working environments that give them the facilities they nee infrastructures, services and information systems. This H is an IT infrastructure built as a “system of systems”, i.e., of Virtual Research Environments deployed and operated to ser including biodiversity, environment and fisheries. D4Science enabling technology, i.e., gCube.Section 4 describes Section the 3 approach leading docu toones. the creation Finally, of Section V 5 concludes the paper. 2. gCube: the enabling technology an innovative typology of infrastructure, i.e.,Grid an [6], Cloud [7],delivering Digital a Library number [8] of and data Service-orient is management facilities the orientation toof serve them the with needs a of dedicated,Research diverse Environment flexible, Communities ready-to-use, web-based resources needed to effectively and efficiently execute one based working environments where groups ofother, scientists, have pos user friendly, transparent andresources (data, seamless services access and computing capabilities) nee applications captures six main domain bundles that can be cu dience ranging from scientists willingservice to providers perform willing their to inv develop innovative facilitie 1). 2.1 AppsCube PoS(ISGC2014)022 approach mparison ower level ) HCAF, a ii enabling the [12], facilities for microlibs ) a flexible environment a n species common names or sci- y, FuzzyMatcher, Levensthein, Tri- rchive format to detect, analyse and , i.e., a set of to practitioners working with species essing species occurrence and taxo- ta include ( ms, e.g., OBIS, GBIF, Catalogue of played on a map as well as saved in (e.g., corresponds, includes, overlaps, n and additional filters. The identified ities for species distribution modeling cluding AquaMaps [13]. AquaMaps is ncludes applications ranging from the BScan, distance based algorithms such CSV) for future uses. e resources. More details are available in cing new versions of HSPEN and HCAF. oducing species distribution probabilities ssing include algebraic operations (union, approach), occurrence points representa- ue), and occurrence points enrichment with res d on spatial and syntactic similarity measure- , i.e., a set of Java libraries that transparently ) a table containing species occurrences points iii 3 Featherweight Stack gCube Application Bundles SmartGears species occurrence datasets processing ) HSPEN, a table containing species envelops, ( i Figure 1: [13], and facilities for taxonomic and nomenclature data co , i.e., a framework acting as a middleware between the gCube l ) Bionym [15], i.e., a taxonomic data matching workflow based b BiolCube is a gCube application bundle offering facilities In particular, it offers facilities for discovering and acc Moreover, it offers facilities for services and the presentation layer, to the interaction with gCube services, and the turns Servlet-based containers and applications into gCub the Developer’s Guide [4]. 2.2 BiolCube occurrence data and taxonomic profiles. nomic data within majorLife, repositories WoRMS [12]. and The information discovery syste mechanismentific is names) simple yet (based powerful since o itdatasets supports are query enriched expansio with linksstandard formats to (e.g., other DarwinCore, species, DarwinCore-Archive, can be dis species distribution modeling Virtual Research Environments by Hybrid Data Infrastructu interfacing with and benefitting fromApplication Service gCube Layer facilities. It i [14, 15]. The facilities species occurrence datasets proce intersection, subtraction, and duplicates deletion) (half-degree cells). Moreover, if offers methodsThe for produ facilities for comparing taxonomic and nomenclature da ments, clustering (e.g., density basedas algorithms K-means), such outliers as detection D (e.g., Local Outliertiveness Facto (e.g., Habitat Representativeness Score techniq chemical and physical environmental parameters.offer a The rich facil array of dedicatedactually algorithms a and family approaches of in approaches (e.g., suitable, native) pr table containing environmental parameters and ( on half-degree cells by relying on ( for comparing any two taxonomic checklistsreport in relationships DarwinCore-A among taxa ofnot the found compared in) checklists [14], and ( that enables users to combine a number of matchers (e.g., GSa PoS(ISGC2014)022 shared stem for , these include facilities to inter- g enhanced documents [17], i.e., pproach and an advanced search m binary files to information ob- ies to practitioners wanting to pro- o practitioners dealing with geospa- sembles an email environment with scovery and access to heterogeneous rovided with URIs, and to any other dwidth. g on posted news, re-share news – yet ed by the type of information objects or defining templates the documents es for browsing and visualising geospa- ting multiple parts. Parts include im- with an integrated view over a number oper scientific names recognised by a oring workflows driving the collabora- ted environment integrating social net- eptually close to the common facilities he Information Object Discovery appli- king results, workflows, annotations and jects they are looking for. It includes fa- cilities, the environment offers a tion and synthesis of data from multiple on on comprehensive scientific products, covery and processing [21]. lyse geospatial data. t, e.g., it is possible to send as attachment ate, search and discovery layers within a ion maps, time series, and comprehensive based clustering [19]. Moreover, it offers a res ts from multiple collections and information 4 application. The workspace resembles a folder-based file sy messaging and a ConnectCube is a gCube application bundle offering facilit In particular, if offers an innovative collaboration orien The Geospatial Data Discovery application offers faciliti In particular, it offers facilities for geospatial data dis Moreover, it offers an application for creating and managin A set of facilities providing its users with an integrated di GeoCube is a gCube application bundle offering facilities t Virtual Research Environments by Hybrid Data Infrastructu gram) and tunes their contribution while identifying the pr number of authoritative sources. 2.3 ConnectCube duce information-rich objects, resulting fromsources. the aggrega working practices in research environments [16].promoted by It social is networks – conc e.g., postingadapted news, to commentin promote large scale collaboration anddatasets, cooperati theories, and tools. Apart from post-oriented fa tial data. In particular, these include facilities toactively explore, navig manipulate, visualize, compare, and ana any dataset residing in the workspace without consuming ban GeoNetwork instance via the OGC CSW protocol [22]. Moreover managing information objects. The addedit value can is manage represent in ajects seamless representing way. datasets, workflows, It species supports distribut items ranging fro documents immediately available to co-workers,person to authorized anyone via p The WebDAV. messaging applicationthe re distinguishing feature of being integrated with the res tial information. research products. Through it, data sharing is fostered, ma rich information objects resembling documentsages, yet datasets, aggrega maps, andshould adhere graphs. to, as well It astive facilities offers production for of functionality defining these and f “documents”. monit data completes the offering ofcation this offers bundle. facilities for In retrieving particular, information t systems objec in a seamlessallowing way users [18]. to characterise It in detail offerscilities the for both information presenting a ob the -like results according a unifying to domain semantic- specific top level ontologyof providing users information [20]. 2.4 GeosCube workspace PoS(ISGC2014)022 l It ) a llo- b hen- ticu- orms covery, deploy- ically endowed with a em to the e-Infrastructure. . The Storage Manager was ting tasks on that server. the creation of a VRE requires ], i.e., -based working ) interoperability with external ctions are allocated to a given and applications exploiting the b ), an intersection algorithm, i.e., ged via specialized open-source ods for services and applications CouchDB [26], while hiding the o host running instances of ser- nd notification instruments; ( ies for the deployment, operation Identity Federation. rd protocols and technologies (e.g., n different areas on the compared tion, three possible document store , it enables the dynamic assignment d activation of both gCube software nvocation of services exposing their s offered by the gCube Storage Man- ies for executing a rich array of data asic building block of this technology orted tasks includes maps comparison he of the Virtual Research res This application offers facilities for dynamically 5 acting as a registry of the infrastructure by offering globa This application offers facilities for authorisation, aut including hosting nodes, services, software and datasets. that builds on the Information Service to realise resource a This is a scalable high-performance storage service. In par , a software component that once installed on a server transf resources ) an open and extensible architecture; ( This application offers facilities for the management (dis a Information System hosting node ) an a ) an c creating and managing virtual research environments [2, 11 lar, it relies on asoftware network for of document-oriented distributed databases. storage This facility nodes i ager, mana a Java based software that presents arunning unique on set the of e-Infrastructure. meth In itssystems current are implementa used [24]: MongoDB, Terrastoredesigned and USTO.RE to [25] reduce theThis time promotes required versus to other add documentheterogeneous stores, a protocols e.g., new of storage those syst systemsinfrastructure from storage the facility. services it into a gCube hostingnumber of node. services including A a gCube local worker hosting to node execute can compu be dynam tication and accounting as-a-Service. It is based on standa ment, monitoring) of SAML) providing: ( infrastructures and domains while obtaining the so-called includes ( of a number of selected resources to a given community (e.g., and external software on hosting nodes, i.e., servers able t and partial views of itsResource resources Management and Service their current statuscation a and deployment strategies. For resource allocation that a set ofapplication). hosting For nodes, deployment, it service enables instances the allocation and an data colle vices; ( The Geospatial Data Processing application offers facilit IceCube is a gCube application bundle offering core facilit Virtual Research Environments Management File-oriented Storage Facilities Resources Management Virtual Research Environments by Hybrid Data Infrastructu processing tasks on geospatial data.algorithms The (supported current formats set include of WFS, supp OPeNDAPan and approach ASC that computesmaps, the and percentages many more. of Moreover, overlap thecapabilities betwee application via enables the the OGC i WPS protocol [23]. 2.5 IceCube and management of a gCube based infrastructure.and This is it the is b particularly relevant for the mechanism enabling t Environments. It includes the following applications: Policy-oriented Security Framework PoS(ISGC2014)022 ro- computing paradigms, ex- s a workflow of invocations ctors. It provides a powerful, ving user communities. The mplete VREs in terms of the oint offering a single ing and aggregating the needed tational middleware without per- ir consumers through the control tion engine (PE2ng) that manages cripts, map-reduce jobs) by ensur- anagement and processing of tabu- ructure under the coordination of a he infrastructure [27, 28]. ss, throughput, fault-tolerance, and ictive meta-Infrastructure is offered r versioning and a rich set of meta- to characterise in detail the informa- rogramming and scripting languages s to practitioners working with a rich t algorithms ranging from Anomalies it enables a simple integration and ex- . ed infrastructure by completely hiding e Statistical Manager to offer effective atistical data. In particular, it includes enance. For discovery, it offers both a nd effectively executing a rich array of nd . It offers also a number of , maps projection, clustering, outlier iden- or tabular data and code lists management. e. It offers a set of off-the-shelf algorithms ing, Bayesian Methods, Trends, and many ng, grouping, unions and intersections. In res ion relies on the distributed and elastic com- g the entire workflow of tasks on tabular data 6 This application offers facilities for executing complex p environments tailored to serveapplication the supports needs the of specification diversedata and and and evol deployment services of they should co resources offer including by the user automatically interface acquir constituents from t cesses, i.e., a workflow of tasks. Itthe includes execution a of process software execu elements incomposite a plan distributed infrast that defines theflow-oriented data processing dependencies model among that supports its several a compu formance compromises [29]. Thus, aof process components can (including, be services, designed binary a ing executables, s that prerequisite data areof prepared the and flow delivered to of the data.ecution patterns PE2ng and aims Infrastructures. to Overall, bring an together unrestr and integrate for “Programming in the Large” [30]. with a single submission, monitoring and access execution p The Tabular Data Manager offers facilities for discovery, m StatsCube is a gCube application bundle offering facilitie The Statistical Service offers facilities for efficiently a lar data. In particular, it offersincluding facilities tabular for data supportin creation, collaborativefacilities curation for a tabular data manipulationaddition including to filteri that, it isdata equipped for with describing a the powerful tabulaGoogle-like mechanism data approach fo resource and including an advanced prov searchtion allowing they users are looking for.tabular For data the manipulation processing, facilities it including relies geocoding on th tification, hidden trends, trends comparison, and many more the of such anprivacy. execution while ensuring robustne more [33]. These algorithms are then executed on a distribut array of information, ranging fromapplications observational for data data analytics to at st scale and applications f statistical data processing algorithms [31]. The applicat puting capacities offered by the underlyingincluding infrastructur clustering algorithms such as DBScan.ecution Moreover, of user-defined algorithms expressed inincluding a R number [32]. of p It currently embedsDetection, more Classification, than Clustering, 100 Simulation, differen Train Virtual Research Environments by Hybrid Data Infrastructu 2.6 StatsCube Workflow Management Facilities PoS(ISGC2014)022 844 35.4 754 106.25 1.8 56 ut in production back in 2007 1011.25 39.5 852 e those participating to the EU , Brazil; UPV - Valencia, Spain. RAM(GB) Disk(TB) CPUs on from each site. The column e dedicated to the pre-production to iMarine and EUBrazilOpenBio GC Taiwan). Therefore in total 6 U or Virtual Machine (VM) and the serve several use cases and commu- production infrastructure, the hosted ontrolled vocabularies, management. arch allowing users to characterise in ed by design and its main feature is to NR - Pisa, Italy; FAO - Rome, Italy; bservation and Biodiversity, and it has ders and data providers. In addition, it t where the software is validated before esources, namely ASGC - Taiwan. versity scientific communities (both ma- structure, such as cloud and grid comput- estion), collaborative curation, and publish- res 7 X5450 @ 3.0 GHz 8 0.3 8 CPU 5130 @ 2.00GHz 8 0.2 8 R R

[12, 14]. CPU 5150 @ 2.66GHz 4 0.1 4 2 R

D4Science Infrastructures: resources by partner CPU E5649 @ 2.53GHz 3 0.4 4 CPU3.06GHz 2.5 0.5 2 CPU 2.00 GHz 38 0.75 16 R R R

Table 1: and EUBrazilOpenBio 1 HW Quad-Core Intel Xeon Site Type Resource UFF HW Intel Xeon FAO HW Two Quad-Core Intel Xeon UPV HW Intel Xeon Total CNR VM Xen hypervisor VLIZ HW Intel Xeon http://www.i-marine.eu http://www.eubrazilopenbio.eu Table 1 provides detailed information about the contributi The D4Science infrastructure is geographically distribut The D4Science infrastructure hosting resources dedicated COTRIX offers facilities for code lists, i.e., recognised c The D4Science infrastructure was designed, developed and p Part of the the resources allocated to the infrastructure ar ASGC HW Two Quad-Core Intel Xeon NKUA VM Xen hypervisor 1 2 Virtual Research Environments by Hybrid Data Infrastructu are provided by project membersprojects plus members an sites external contribute partnerNKUA to - (AS Athens, the Greece; infrastructure: VLIZ - Ostende,In C Belgium; addition UFF an - external Niteroi projects partner is also providing r “type” either reports Hardware (HW) togethertype with of type virtualization of system. CP offers interoperability at the level ofing. the computing infra enable interconnections among different technology provi project iMarine rine and ). The two main user communities at the moment ar It includes facilities for code listsing. creation It (also offers via both ing a Google-likedetail approach the and information an they advanced are se looking for. 3. The D4Science infrastructure with the support of anities series ranging of from EU projects Fisheries [34]. toacquired Digital It its Libraries, started maturity to specializing Earth its O scope serving Biodi infrastructure, also called Quality Assurancereaching environmen the production environment. Forresources what can concerns be the categorized in 3 main areas: PoS(ISGC2014)022 ) statistical data tional category: omcat); , D4Science has 5 functions (cf. Sec n 240 0.05 2620 3.75 2380 3.7 rver, Thredds, North52 s; (cf. Tab. 4), ities: two of them are “prepara- US-C project ms” that provide their users with , 2 sites ( CNR and NKUA ) host 3 ndum of Understanding or collabo- for seamless access to a wide spec- ual Organisation (d4science.research- he D4Science Infrastructure. In detail, ffered by the EGI infrastructure which tructure. X X X X X X computational facilities) to accomplish a s external resources via federated access. res geospatial data 4 8 computation and storage cloud resources, namely 1.5 M 6 (cf. Tab. 6) from multiple data providers and information (cf. Tab. 3), EGI services supporting the D4Science Virtual Organisatio are hosting third party services classified under their func species data ) the development of software artefacts that realise a set of i are dedicate to host gCube Service containers (both gHN and t are dedicated to host the UMD middleware semi-structured data Table 2: Clusters: MongoDB, Cassandra, Hadoop, OGC Services (Geose WPS), CouchBase, ElasticSearch; Services: ActiveMQ MessageBroker, JackRabbit, RStudio; Databases: several PostgreSQL and MySQL instance • • • production services for the EGI infrastructure For the data, the D4Science infrastructure offers services http://repository.egi.eu/category/umd_releases/ http://www.egi.eu http://www.venus-c.eu/ https://www.windowsazure.com/ In addition to grid resources, by collaborating with the VEN Virtual Research Environments (VREs) are “ dedicated syste The development of VREs is actually based on three main activ 3 4 5 6 SiteShortName SiteOfficialName CREAM WN SE CPUs Storage(TB INFN-TRIESTE INFN-TRIESTE Total Taiwan-LCG2 Academia Sinica Center Virtual Research Environments by Hybrid Data Infrastructu gCube Resources Third Party Resources In addition to hosted resourcesThis the includes infrastructure access exploit to resourcesrations agreed with upon project signed members. Memora This includesextend the the storage resources and o computing capacitythe available under EGI t sites supportinginfrastructures.eu) the are in D4science Table 2. infrastructure Virt CPU Hours and 1.5 TB Storage. trum of data including been granted access to Azure UMD Resources tory” and consist in ( systems. 4. Virtual Research Environments set of tasks by dynamically relying on the underlying infras a web-based set of facilities (including services, data and (cf. Tab. 5), and PoS(ISGC2014)022 . www. www. . www.itis. geonetwork.fao. . www.ncbi.nlm.nih. in D4Science in D4Science nce databases. This currently e concentration of phosphate in yms and 400,000+ taxa; lanet; ica and the world; ta source offers access to over r Temperature, and Silicate; sation, Dissolved Oxygen, Nitrate, s carbon in sea water, net primary n species and more than 14,000 . s species “names” for more than d collaborators. nd collaborators. fers species “names” for deep-sea ole concentration of dissolved oxygen ature, zonal velocity, wind speed, and e thickness, ice velocity, mass concen- e offers more that 37 million records sms. . ce offers authoritative taxonomic infor- ental variables. In particular, D4Science 2,900 Common names, 53,600 Pictures, www.obis.org.au/irmng . includes 12,000+ species of interest or rela- s, 27,300 Common names, 11,900 Pictures, nvironmental variables. In particular, iMarine . tion data source offers a curated classification t of marine georeferenced place names and areas ned by FAO and its partners; list and a taxonomic hierarchy of more than 1.3 res www.gbif.org 9 . www.fao.org/fishery/collection/asfis/en www.iobis.org www.marinespecies.org/deepsea . . www.marineregions.org www.myocean.eu . . focuses on some indicators including Apparent Oxygen Utili Oxygen Saturation, Phosphate, Sea Water Salinity, Sea Wate nodc.noaa.gov/OC5/WOA09/pr_woa09.html including EEZ; org sea water, mole concentration of phytoplankton expressed a in sea water, mole concentration of nitrate inproduction sea of water, carbon, salinity, mol sea surfacewind height, stress. temper focuses on some indicators including ice concentration,tration ic of chlorophyll in sea water, meridional velocity, m Spatial Data Databases and Information Systems Integrated Species Data Databases and Information Systems Integrated mation on plants, animals, fungi, and microbes of North Amer million species of animals, plants, fungi and micro-organi tions to fisheries and aquaculture; 49,700 References aggregated thanks to the effort of thousa 465,000 genus names and 1.6 million species names; datasets aggregated from 580+ publishers; represents about 10% of the described species of on the p and nomenclature for all of the organisms in the public seque gov marinespecies.org species based on WoRMS. gov/taxonomy 200,000 species including 300,000+ species names and synon 18,200 References aggregated thanks to the effort of hundre on species and 1,300+ datasets; Table 4: Table 3: ITIS The Integrated Taxonomic Information System data sour Data sourceCatalogue of Life The data Description source offers an integrated check FAO ASFIS The List of Species for Fishery StatisticsFishbase Purpose IRMNG The data source offers access to 32,700 Species, The 30 Interim Register of Marine and Nonmarine Genera da GBIF The data source offers more than 430 million of records o NCBI Taxonomy The National Center of Biotechnology Informa WoRDSS The World Register of Deep-Sea Species data source of OBIS The Ocean Biogeographic Information System data sourc SeaLifeBaseWoRMS The data source offers access to 126,000 Specie The World Register of Marine Species data source offer World Ocean Atlas The data source gives access to a number of e Marine Regions The data source gives access to a standard lis FAO GeoNetwork The data source exposes spatial data maintai myOceans The data source gives access to a number of environm Data source Description Virtual Research Environments by Hybrid Data Infrastructu PoS(ISGC2014)022 ed approach perated by the D4Science ated in D4Science he expected facilities phase where authorised users tructure and the population of l African States, Freedom House, ial Development Organization. h industrial purse seiners. tries, aggregated according to CWP se that contains tuna and bycatches is a very straightforward activity automatic deployment of the real collected from several data providers ny user can join it, or not, and the y and Repository exposes the Sardara er four years with over 750 users. A he name of the VRE, the domain the ious works [27, 28] while a screenshot E and altering the VRE specification if operation saged environment by selecting among rom the FAO Registry, or manually uploaded the actual deployment and operation of several sectors including Agriculture, Educa- res ) an iii 10 phase where authorised users are provided with a wizard- deployment phase where authorised users are provided with a wizard-bas ) a ii design Statistical Data Databases and Information Systems Integr Virtual Research Environment definition phase: selecting t statistical squares (1’x1’ or 5’x5’)captures and observed the by ObServe scientific databa observers on-board of Frenc database that contains tuna captures data from several coun through the facility developed in the context of ICIS. tion, Energy, Environment, Industry, Population. Data are including African Development Bank, Central Bank ofInternational Centra Energy Agency, OECD, United Nations Industr ) a i Table 5: ) the deployment of these artefacts in an operational infras ii Figure 2: At the end of March 2014, 21 VREs are concurrently hosted and o IRD Datasets The UMR EME/Observatoire Thonier SDMX Registr Data source Description Codelists A set of SDMX Codelists either directly accessed f StatBase This data source collects and organises data about infrastructure. Some of them havedetailed been in list constant is use in for Table 7 ov VRE where is for serving, each whether VRE the itnumber VRE is of membership reported: users. is t “open”, i.e., a to specify the data andthe the available services characterising ones; the ( envi Virtual Research Environments by Hybrid Data Infrastructu 2); and ( consisting of: ( a VRE which thanks to gCube and the D4Science infrastructure the infrastructure itself (cf. Sec 3). The third activity is based approach to approve a VREcomponents needed specification to and satisfy monitor the the specification; and ( are provided with facilities for managing theneeded. users Details of on the this VR approach have beenof presented the in wizard prev supporting the VRE specification is in Figure 2. PoS(ISGC2014)022 . www. . Science . entific community in- anical libraries; eoBiota, PhytoKeys, Subter- marine domain. It currently dbpedia.org/About onal knowledge base on West , GeoNames and WordNet and cess to legacy literature of l database containing statistics . nchoring entities, facts and events ecies, water areas, land areas, and . . n 40,000 entities including marine ies. gh a commitment of the operating l Journal of Myriapodology, Journal . . mission is to give access to research . al knowledge base including fisheries dense network of relationships among iseases; . . fers institutional publications including ata source offers material covering ma- ins over 4 millions things including per- ose mission is to give access to research ified and integrated view on three marine h system research via OAI-PMH. The sys- aterials in Marine Science by aggregating s to quality research journals f a number of datasets including DBPedia, ed by .com. regated by the same European funded project; www.pangaea.de Base, WoRMS, ECOSCOPE, FLOD and DB- ecies Fact Sheets developed by the same FAO pen-access journals. In particular, iMarine fo- www.mblwhoilibrary.org/services/ terial covering natural marine, estuarine/brackish www.ceemar.org/dspace res datadryad.org . drs.nio.org/drs factforge.net www.bioline.org.br . 11 aquaticcommons.org www.oceandocs.net . www.fao.org/figis/flod . www.fao.org/fishery/fishfinder www.datacite.org Various Databases and Information Systems Integrated in D4 and fresh water environments; biodiversity held by a consortium of natural history and bot ranean , and ZooKeys. cuses on BioRisk, Comparative Cytogenetics, Internationa of Hymenoptera Research, MycoKeys, Nature Conservation, N tem guarantees long-term availabilityinstitutions. of The its aggregated repositories content are 475; throu content form 256 repositories; www.openaire.eu the major entities of the fisheryexclusive economic domain, zones; including marine sp sons, places, creative works, organisations, species and d journal articles and technical reports; data underlying research publications; WordNet, Geonames, and Freebase; programme; Pedia by using thecontains same approximately 3 top-level millions ontology of developed triples for about more the tha species, ecosystems, water areas, and vessels; rine, brackish and fresh water environments; data; published in developing countries; biodiversitylibrary.org contains more than 440 million facts about 9.8 million entit cludingwhoas-repository-services articles andin time data and space. The knowledge base sets; is built from Wikipedia fisheries information sources, i.e., FIRMS – an internation and resource from Westprovided Indian by Ocean; West Indian StatBase Ocean – countries; a and WIOFish statistica – a regi Indian Ocean Fisheries. Table 6: SmartFish The SmartFish Chimaera knowledge base offers a un BHL The Biodiversity Heritage Library data source offers ac Aquatic Commons The data source offers access to thematic ma Data source Description OceanDocsOpenAIRE The data source offersPANGAEA research and publication m The data source give access to the publications agg PenSoft The Journals data source offers georeferenced data from eart The data source gives access to a number of o Nature The data source offers access to the articles publish FAO FLOD A semantic knowledge base hosted by FAO containing a DBPedia The knowledge base results from Wikipedia. It conta DRSDryad The data source at National Institute of Oceanography of The data source offers access to the same service whose FactForge The knowledge base results from the integration o FAO Factsheets The data source gives access to the Aquatic Sp iMarine TLO The warehouse integrates information from Fish CEEMarDataCite The Central and Eastern European Marine Repository d The data source offers access to the same service wh Bioline The data source offers acces YAGO2 The knowledge base extends the YAGO knowledge base by a WHOAS The data source offers the production of Woods Hole sci Virtual Research Environments by Hybrid Data Infrastructu PoS(ISGC2014)022 ng . Fu- 7 ) the creation of new i 43 52 41 26 59 40 61 ction of the community ben- ity is particularly active ] is actually exploited by data X X X X X X X g environments, i.e., that offered Virtual Research Environments, rated that the proposed approach ervice. rossing the boundaries and capac- ards a research endeavour. We have arios by benefitting from the rich ar- es of actions: ( available a Hybrid Data Infrastructure tions. en involved in its development with more than he resulting infrastructure and the exist- res . 12 ) the development of plug-ins and mediator services enlargi ii D4Science Virtual Research Environments Table 7: https://www.openhub.net/p/gCube VTIVME-DB Fisheries Analytics 10 28 VesselActivitiesAnalyzer Analytics SpeciesLabTBTITComTimeSeries Biodieversity Fisheries 50 Software Any 10 35 MarineSearchScalableDataMining Analytics Marine 4 ENVRIFCPPSFishFinderVREgCubeICISiMarineBoardiSearch Environment Fisheries Fisheries Software Policy 22 12 Fisheries 32 Any 43 21 43 EcologicalModeling Biodiversity DocumentsWorkflow Any VRE NameAquaMapsBiodiversityLabBiodiversityResearchEnvironment Biodiversity Biodiversity Domain Biodiversity Open Users 64 57 In the period August ’13 – August ’14, 40 contributors have be In this paper we have presented one of such innovative workin Modern science calls for innovative working environments c The gCube technology is in constant evolution and its commun In some cases the users served by the VRE represent a small fra 7 virtual research environments to serveray the of needs facilities of so new far scen developed; ( ture activities and work will mainly focus on three typologi Virtual Research Environments by Hybrid Data Infrastructu 11,000 software commits ities of single scientists, laboratories and institutions ing VREs. The served communities and scenarios have demonst efitting from the VREmanagers services, to e.g., produce the the maps AquaMaps disseminated via VRE the [35, AquaMaps 13 s 5. Conclusions by the D4Science organisation. This organisationenabling is the making dynamic deploymenteach and tailored operation to of serve the an needspresented array gCube, of of i.e., a the scientific enabling community technology, tow as well as t is suitable and can be applied to a range of scientific applica PoS(ISGC2014)022 , , . , n/a:n/a, project (FP7 IEEE Internet . , 2008. iMarine ) the development of oper\%27s_Guide / iii . Cambridge Press, Contract No. 283644). lopments can be performed by Ecological Informatics , n/a:n/a, 2013. Article first published Communications of the ACM 016/j.ecoinf.2014.07.006 The DELOS Reference ucture; and ( onwinski, G. Lee, D. Patterson, A. ing of the Statistical Manager service. algorithm worth to share can decide to oost in performances [32]. ware System for Hybrid Data key concepts and principles. through hybrid data infrastructures. esearch Environments in the Cloud a di Scienza e Tecnologie dell’Informazione aldi. Species distribution modeling in the environments: an overview and a research ika, C. Meghini, P. Pagano, S. Ross, D. http: pa, V. Marioli, and P. Pagano. An rsity data. res 13 https://www.gcube-system.org , (83):32–33, October 2010. . DELOS: a Network of Excellence on Digital Libraries, ERCIM News , 12:GRDI75–GRDI81, 2013. The Grid: Blueprint for a Future Computing Infrastructure The work reported has been partially supported by the , (89):37–38, 2012. Communities of Practice: Learning, Meaning and Identity , 9:75–81, 2005. Data Science Journal Concurrency and Computation: Practice and Experience ERCIM News //gcube.wiki.gcube-system.org/gcube/index.php/Devel agenda. 1998. 2014. Article first published online: 1 August 2014 DOI: 10.1 cloud. online: 11 July 2013 http://onlinelibrary.wiley.com/doi/10.1002/cpe.3030 Reality: the gCube Approach. infrastructure-oriented approach for supporting biodive 2008. Computing “A. Faedo”, CNR, 2008. Infrastructures. 2008-TR-035, Istituto Morgan-Kaufmann, 2004. 53(4):50–58, April 2010. Rabkin, I. Stoica, and M. Zaharia. A view of . Soergel, M. Agosti, M. Dobreva, V. Katifori, and H. Schuldt. Model - Foundations for Digital Libraries February 2008. ISSN 1818-8044 ISBN 2-912335-37-X. [1] L. Candela, D. Castelli, and P. Pagano. Managing big data [4] gCube Development Team. gCube Developer’s Guide. [2] L. Candela, D. Castelli, and P. Pagano. Virtual research [3] gCube Development Team. gCube Website. [5] L. Candela, D. Castelli, and P. Pagano. gCube v1.0: A Soft [6] I. Foster and C. Kesselman. [7] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. K [8] L. Candela, D. Castelli, N. Ferro, Y. Ioannidis, G. Koutr [9] M.N. Huhns and M.P. Singh. Service-oriented computing: of the European Commission, FP7-INFRASTRUCTURES-2011-2, References Virtual Research Environments by Hybrid Data Infrastructu the set of data sources integrated in the D4Science infrastr new algorithms and approaches aiming atThanks enlarging the to offer the openness ofthe community the in gCube the system, large, e.g., someintegrate every it of scientists into owning these the an deve Statistical Manager and benefit fromAcknowledgements a b [13] L. Candela, D. Castelli, G. Coro, P. Pagano, and F. Sinib [11] L. Candela, D. Castelli, and P. Pagano. Making Virtual R [12] L. Candela, D. Castelli, G. Coro, L. Lelii, F. Mangiacra [10] E. Wenger. PoS(ISGC2014)022 , 3) - , , volume 5173 of . O’Really, 2009. Lecture Notes in Computer e, 12-13 June 2014). Technical Report 2014-TR-022 rez-Canhos, F. Quevedo, tional Conference on Semantic , 39(4):12–27, May 2011. ntic Research Conference, MTSR’13, ptember 16-21, 2007, Proceedings 2014 rk, September 14-19 orting biodiversity studies with the tructure for Distributed Retrieval. In L. pporting the production of live research rmation about Marine Species through a lii. An approach to virtual research 4Science Research-Oriented Social do”, CNR 2014. n B. Christensen-Dalsgaard, D. Castelli, , L. Candela, D. Castelli, C. Flann, al data infrastructure solutions in ENVRI. ldemita, A. Ellenbroek, and P. Pagano. ano, G. Papanikos, P. Polydoras, Y.E. Garcia, and F.A.M. Trinta. USTO.RE: A based Search Results using Entity Mining, , volume 6966 of i, P. Manghi, A. Manzi, P. Pagano, and M. res Fafalios, M. Doerr, N. Minadakis, T. Patkos and Proceedings of the International Conference on , pages 161–173. Springer-Verlag, 2007. SIGMOD Rec. Concurrency and Computation: Practice and 14 CouchDB: The Definitive Guide Research and Advanced Technology for Digital Libraries, 12th European Conference on Research and Advanced , 96:n/a, 2014. http://www.opengeospatial.org/standards/wps , pages 122–134. Springer, 2008. 13th International Conference on Web Engineering (ICWE 201 http://www.opengeospatial.org/standards/cat , 9(1), 2013. ERCIM News Lecture Notes in Computer Science , Aalborg, 2013. , n/a:n/a, 2014. The Grey Journal , pages 101–109. Springer, 2011. L. Candela Integrating Heterogeneous and Distributed Info Top Level Ontology. Proceedings of the 7th Metadata and Sema Thessaloniki, Greece, November 2013. In: GEPW-8 - GEO European Projects’ Workshop (Athens, Greec Private System. In Industry track B.A. Jurik, and J. Lippincott, editors, Technology for Digital Libraries, ECDL 2008, Aarhus, Denma Simi. An Extensible Virtual Digital Libraries Generator. I Science environment user interfaces dynamic construction. InTheory and Practice of Digital Libraries (TPDL 2011) Lecture Notes in Computer Science objects. Ioannidis, D. Aarvaag, and F. Crestani.Kovács, A N. Grid-Based Fuhr, and Infras C. Meghini,11th editors, European Conference, ECDL 2007, Budapest,volume Hungary, 4675 Se of and Link Analysis atComputing Query (ICSC’14), Time. Newport Beach, IEEE California, 8th USA, Interna June R. Rafanell, V. Rebello, M. Sousa-Baena,EUBrazilOpenBio and Hybrid E. Data Torres. Infrastructure. Supp R. De Giovanni, W. A. Gray, A. Jones, D. Lezzi, P. Pagano, V. Pe Experience Networking Facilities. BiOnym - a flexible workflow approachIstituto to di Scienza name e matching. Tecnologie dell’Informazione “A. Fae [21] Candela L., Coro G., Cossu R., Pagano P. Realizing spati [22] OpenGIS Catalogue Service [23] OpenGIS Web Processing Service [24] R. Cattell. Scalable SQL and NoSQL[25] data stores. F.A. Durão, R.E. Assad, A.F. Silva, J.F. Carvalho, V.C. [26] J.C. Anderson, J. Lehnardt, and N. Slater. [27] M. Assante, L. Candela, D. Castelli, L. Frosini, L. Leli [28] M. Assante, P. Pagano, L. Candela, F. De Faveri, and L. Le [18] F. Simeoni, L. Candela, G. Kakaletris, M. Sibeko, P. Pag [19] P. Fafalios and Y. Tzitzikas Post-Analysis of Keyword- [20] Y. Tzitzikas, C. Allocca, C. Bekiari, Y. Marketakis, P. [17] M. Assante, L. Candela, and P. Pagano. An environment su Virtual Research Environments by Hybrid Data Infrastructu [14] R. Amaral, R. M. Badia, I. Blanquer, Ricardo Braga-Neto [15] E. Vanden Berghe, N. Bailly, G. Coro, F. Fiorellato, C. A [16] M. Assante, L. Candela, D. Castelli, and P. Pagano. The D PoS(ISGC2014)022 , , , pages Data Driven e-Science - Communication of the ACM Technical Report 2014-TR-027 , editors, ber 2013), 2013. Abstract. oese. AquaMaps: Predicted range ization on grid and cloud , 2008. ting Infrastructures (ISGC 2010) Grid Resource Sharing: The Species amming. do”, CNR 2014. -Reyes, P. D. Eastwood, A. B. South, S. O. lgorithms. ano. The D4Science Production do. Parallelising the Execution of Native i Scienza e Tecnologie dell’Informazione “A. , F. Pentaris, P. Polydoras, E. Sitaridi, V. al algorithms as-a-service. TDWG 2013 - res Concurrency and Computation: Practice and 15 , 32(1):67–74, 2009. http://www.aquamaps.org/ IEEE Data Eng. Bull. , n/a:n/a, 2014. Under review. Infrastructure. Technical Report 2009-TR-054, Istituto d Faedo”, CNR, 2009. Occurrence Maps Generation Case. In Simon C. Lin and Eric Yen Use Cases and Successful Applications of225–238. Distributed Springer, Compu 2011. maps for aquatic species. Kullander, T. Rees, C. H. Close, R. Watson, D. Pauly, and R. Fr Stoumpos, and Y.E. Ioannidis. Dataflow processing and optim infrastructures. 38(11):89–99, November 1992. Algorithms for . Experience Taxonomic Database Working Group 2013 (Firenze, 28-31 Octo Istituto di Scienza e Tecnologie dell’Informazione “A. Fae [35] L. Candela and P. Pagano. The D4Science Approach toward [36] K. Kaschner, J. S. Ready, E. Agbayani, J. Rius, K. Kesner Virtual Research Environments by Hybrid Data Infrastructu [29] M.M. Tsangaris, G. Kakaletris, H. Kllapi, G. Papanikos [30] G. Wiederhold, P. Wegner, and S. Ceri. Toward Megaprogr [31] G. Coro, P. Pagano, and L. Candela. Providing statistic [32] G. Coro, L. Candela, P. Pagano, A. Italiano and L. Liccar [34] P. Andrade, L. Candela, D. Castelli, A. Manzi, and P. Pag [33] G. Coro and L. Candela. gCube statistical manager: the a