DataONE: Enabling Data-Intensive Biological and Environmental Research through Cyberinfrastructure

Leadership Team: William Michener1, Suzie Allard2, John Cobb3, Robert Cook3, Patricia Cruse4, Mike Frame5, Stephanie Hampton6, Vivian Hutchison5, Matthew Jones6, Steve Kelling7, Rebecca Koskela1, Carol Tenopir2, Dave Vieglais8, Todd Vision9, Bruce Wilson2 Co‐Investigators: Paul Allen7, Peter Buneman10, Randy Butler11, Ewa Deelman12, David DeRoure13, Cliff Duke14, Carole Goble15, Donald Hobern16, Peter Honeyman17, Jeffery Horsburgh18, John Kunze4, Bertram Ludaescher19, Maribeth Manoff2, Line Pouchard3, Robert Sandusky20, Ryan Scherle9, Mark Servilla1, Jake Weltzin5 1University of New Mexico; 2University of Tennessee; 3Oak Ridge National Laboratory; 4University of California ‐ California Digital Library; 5U.S. Geological Survey; 6National Center for Ecological Analysis and Synthesis ‐ University of California ‐ Santa Barbara; 7Cornell University; 8University of Kansas; 9National Evolutionary Synthesis Center, University of North Carolina; 10University of Edinburgh; 11University of Illinois ‐ Urbana Champaign; 12University of Southern California; 13University of Southampton; 14Ecological Society of America; 15University of Manchester; 16Atlas of Living Australia; 17University of Michigan; 18Utah State University; 19University of California –Davis; 20University of Illinois –Chicago

Abstract: Addressing the Earth's environmental problems requires that we change the ways that we do science; DataONE is designed to be the foundation for new innovative environmental science through a distributed data available from the genome to the ecosystem; making environmental data available from atmospheric, harness the enormity of existing data; develop new methods to combine, analyze, and visualize diverse data framework and sustainable cyberinfrastructure that meets the needs of science and society for open, ecological, hydrological, and oceanographic sources; providing secure and long‐term preservation and resources; create new, long‐lasting cyberinfrastructure; and re‐envision many of our longstanding persistent, robust, and secure access to well‐described and easily discovered Earth observational data . access; and engaging scientists, land‐managers, policy makers, students, educators, and the public thro ugh institutions. DataONE (Observation Network for Earth) represents a new virtual organization whose goal is logical access and intuitive visualizations. Most importantly, DataONE will serve a broader range of science to enable new science and knowledge creation through universal acc ess to data about life on earth and the Supported by the U.S. National Science Foundation, DataONE will ensure the preservation and access to domains both directly and through the interoperability with the DataONE distributed network.

environment that sustains it. multi‐scale, multi‐discipline, and multi‐national science data. DataONE is transdisciplinary, making biological DataONE is a five year project that began in Fall 2009 ( William Michener, PI, University of New Mexi co).

ThThee VVision:ision: ““DDataONEataONE wwillill bebe commoncommonlyly uusedsed bbyy rresearchers,esearchers, eeducators,ducators, anandd ththee ppublicublic toto bebettertter uunderstandnderstand anandd conconserveserve lifelife onon eeartharth anandd ththee enenvironmentvironment tthathat ssustainsustains itit.”.”

ByBy ccreatingreating aann infrainfrastructurestructure ofof ttechnologyechnology aandnd sstandards,tandards, peoplpeople,e, aandnd insinstitutionstitutions ttoo ssupportupport tthehe fullfull lifelife cycyclecle ofof biologbiologicaicall,, ececological,ological, anandd envenvironmentalironmental dadatata anandd ttoolsools tthathat eenablenable uniuniversaversall acaccess,cess, DaDataONEtaONE wwillill acacceleratecelerate ususee ooff earearthth oobservationabservationall datdataa inin reresearch,search, eeducationducation andand ddecision-making.ecision-making. IInn soso ddoing,oing, DDataONEataONE wwillill ttransformransform ouourr uundendersrstandingtanding ofof ececologicalological procprocessesesses aandnd cconserveonserve lifelife onon eaearthrth aandnd tthehe envenvironmentironment tthathat ssustainsustains iit.t.

WhyWhy dodo wewe needneed DataODataONNE?:E?: WhatWhat isis DaDataONE?taONE? HowHow willwill wewe buildbuild DataDataOONNE?E? Societal and Environmental challenges CCyybbeerrininfrfrasasttrructuctuurree enentteerrpprriseise (t(toooolsls,, idideaeas,s, pepeoploplee)) Societal and Environmental challenges Understand the community’s need, have the community envision solutions Distributed Framework of CoordCareiner-Longating Nod Leares aningnd M ember Nodes Distributed Framework of Coordinating Nodes and Member Nodes Perform baseline and iterative Community Assessments and Usability Studies • Best practice guides To see where data practices and policies are now, so we can see how practices change over the life of Coordinating Nodes • Exemplary data DataONE: management plans Global Ocean • retain complete metadata catalog Acidification • Podcasts, web-casts • perform basic indexing • Workshops and semina• providers network-wide services • ensure data availability • Downloadable curricula Libraries (preservation) Librarians • provide replication services Citizen-scientists Students & Teachers 1. University of New Mexico

2. University of California-Santa s Decision makers

Barbara ctice a Good

Popular press and results from the International Geosphere Biosphere Program show that environmental challenges 3. Oak Ridge Campus (UT & ORNL) P are of increasing concern for us all. Scientists Time Flexible, scalable, sustainable network Member Nodes Computer & IT • diverse institutions DataONE Assessment: http://vovici.com/wsb.dll/s/aaeg3cfe6 Science Challenges Member nodes • serve local community • Diversity tolerant (less tightly coordinated) To understand, we need easy access to different types of data • provide resources for managing their • Freedom to try or build new tools, methods, and leapfrog forward • Wide range of spatial and temporal scales (plot data to remote sensing data) data • Breadth of science domains (biological, environmental, social, and economic) Coordinating nodes Leverage existing CI whenever possible, build new CI whenever necessary • Citizen science networks, of increasing importance • Tightly coordinated, stable service platform Member Nodes Deployed in 2010: • CI based on use cases and prioritized need for functionality ORNL DAAC, Knowledge Network for Biocomplexity, Dryad, USGS NBII, and • Many existing open source efforts exist • Support and adapt existing community efforts California Digital Library – Metadata Editors: Mercury, Morpho – Data management: MATT, UDig, Specify Taverna e g

dge – Analysis and modeling: R, Octave, netCDF a le – Workflow systems: Kepler, Taverna, VisTrails ver

now – Grid systems: Condor, Globus, BOINC

Co Intensive science Provide tools to better manage data

l sites and experiments – Data and workflow portals: VegBank, myExperiment ia ss K e at • Commercial tools important too c p Provide support for the entire data life o r S Extensive science sites – MATLAB, SAS, ArcGIS, R, ORNL g

P cycle—preparation of data sets, DAAC

g • DataONE: help communities build their own tools sin stewardship, tools to access and use

sin – Integrate, interoperate, stabilize

ea Volunteer &

ea the data cr education networks – Create libraries to DataONE Service Interface cr De n

I 1. Collection/Preparation Remote 2. Deposition/acquisition/ingest sensing 3. Curation and metadata management 4. Protection, including privacy Adapted from CENR-OSTP 5. Discovery, access, use, and dissemination Engage the community, reach out, educate, enable new science, and 6. Interoperability, standards, and integration demonstrate success 7. Exploration, visualization, and analysis DataONE will use Working Groups Education and Training Metadata creation, management Structure Success Data Challenges Metadata catalog with >70,000 records Search catalog • 10 – 20 participants • Community-driven Best Practice Guide Scientists need access to the data generated by research to verify findings and test new hypotheses. • Inclusive Work Flows • Deep analysis Using BMeestat Prdaactta forice Guide • Intensive collaboration • High productivity e-research How to Cite Poor data practices place the scientific Over time the information content of data • Neutral territory • High impact YouBre Dsta Prta actice Guide 5 in a series record at risk products can be lost. • 2 week-long meetings per year How to Cite 6 in a serYiesour Data • Data are massively dispersed Time of publication - Not on the Web Specific details DataONE International DataONE 6 in a series - Orphaned

t Users Group Project General details

• Multiple Semantics en t

n Infrastructure and Research Working Groups Engagement Working Groups

- Not easily discovered o Retirement or • Poor data practices C Federated security Sociocultural barriers to data sharing Career Long Learning: n career change and preservation

- Documentation io Accident Distributed storage • best practice guides - Formats / obsolescence Community engagement and Data preservation, metadata, and education - Lack of Standards rmat • exemplary data

o interoperability f Death Citizen science and public outreach management plans

• Poor stewardship In Scientific workflows • podcasts, web-casts - Media Obsolescence Storage Failure Long-term sustainability and and semantics governance • workshops and seminars • Heterogeneous, incompatible formats (Michener et al. 1997) Time Exploration, Visualization, Analysis Exploration, Visualization, Analysis • downloadable curricula -Difficult to combine data from diverse sources http://mercury.ornl.gov/dataone/ Usability and assessment Usability and assessment

DataONE (Observation Network for Earth): https://dataone.org/ March 2010