Journal of eScience Librarianship

Volume 1 Issue 1 Article 3

2012-02-14

DataONE: Facilitating eScience through Collaboration

Suzie Allard

Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/jeslib

Part of the Library and Information Science Commons

Repository Citation Allard S. DataONE: Facilitating eScience through Collaboration. Journal of eScience Librarianship 2012;1(1): e1004. https://doi.org/10.7191/jeslib.2012.1004. Retrieved from https://escholarship.umassmed.edu/jeslib/vol1/iss1/3

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License. This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in Journal of eScience Librarianship by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004

DataONE: Facilitating eScience Through Collaboration

Suzie Allard

University of Tennessee, Knoxville, TN, USA

Abstract

Objective: To introduce DataONE, a multi- are the primary focus of the discussion. institutional, multinational, and interdisciplinary collaboration that is developing the cyberinfra- Results: DataONE is highly collaborative. This structure and organizational structure to support is a result of its cyberinfrastructure architecture, the full information lifecycle of biological, ecologi- its interdisciplinary nature, and its organizational cal, and environmental data and tools to be used diversity. The organizational structure of an agile by researchers, educators, and the public at management team, diverse leadership team, and large. productive working groups provides for a suc- cessful collaborative environment where substan- Setting: The dynamic world of data intensive tial contributions to the DataONE mission have science at the point it interacts with the grand been made by a large number of people. challenges facing environmental sciences. Conclusions: Librarians and information science Methods: Briefly discuss science’s “fourth para- researchers are key partners in the development digm,” then introduce how DataONE is being de- of DataONE. These roles are likely to grow as veloped to answer the challenges presented by more scientists engage data at all points of the this new environment. Sociocultural perspectives data lifecycle.

Introduction librarians will be addressing with their sci- ence communities and that librarians are EScience is changing the way librarians uniquely trained to negotiate successfully. work and the services they provide. An im- Beyond technological changes, as scientific portant aspect of eScience is the focus on research is becoming more data intensive, a data, as noted by Kafel (2010): “A prominent “fourth paradigm” (Hey, Tansley, and Tolle feature of eScience is the generation of im- 2009) has emerged. Gray (2007) identifies mense data sets that can be rapidly dissemi- the first three paradigms over a temporal nated to other researchers via the internet.” span beginning at a thousand years ago There is an enormous increase in the when science was empirically describing amount of data collected, analyzed, re- natural phenomena. In the last few hundred analyzed, and stored, which is a result of years, science added a theoretical branch developments in computational simulation using models and generalizations. Within and modeling, automated data acquisition, the last few decades, science added a third and communication technologies (National paradigm which is a computational branch Academies of Science 2009). These data- enabling simulations. The fourth paradigm is intensive activities present challenges that emerging now and is best described as data

Correspondence to Suzie Allard: [email protected] Keywords: eScience, DataONE, data-intensive science, cyberinfrastructure 4 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 exploration that unifies theory, experiment, nections exist is important because it allows and simulation. It is often referred to as us to address complex issues with a better eScience. The fourth paradigm is changing contextual understanding. However, inter- how science is conducted (Hunt, Baldocchi connected information demands that we be and van Ingen 2009), as well as how scien- able to make sense of information across tists and publishers engage the scholarly disparate vocabularies, heterogeneous infor- record (Lynch 2009). The fourth paradigm, mation artifacts, and diverse paradigms. eScience, focuses on unifying theory, experi- This creates intellectual and technological ment, and simulation. The sociocultural challenges that may not be addressed suffi- changes brought about by the fourth para- ciently with traditional information tools and digm also have implications for libraries and methods. It also suggests new roles for the librarianship, suggesting the extension of information managers and librarians who current relationships within the scientific work with the information, and for the people community, including publishers and the de- who create and use the information. velopment of new collaborations. Ultimately, the key to benefitting society is to find solu- The foundation to successfully negotiate this tions to the challenges that arise from con- complex data intensive environment is a ro- ducting data intensive science (Hey, Tans- bust cyberinfrastructure that provides the ley and Tolle 2009). technology and associated tools to support scientists in their activities and to facilitate The science librarian can play an essential new ways to engage science (National Sci- part in enabling the cyberinfrastructure, in- ence Foundation Cyberinfrastructure Council cluding both technology and people, that 2007). The definition of cyberinfrastructure supports eScience, but this role is still includes technological and sociological per- emerging and may not be adequately de- spectives (National Science Foundation Blue fined in existing job descriptions. This paper -Ribbon Panel on Cyberinfrastructure 2003). is designed to help set the context of eSci- Both perspectives are needed to address the ence so that the role of the eScience librari- challenges presented by the increased an can be explored. The paper begins by amount of data collected, analyzed, and briefly discussing the cyberinfrastructure that stored, including a heightened need for tech- is needed to make eScience successful, and nology that assures data preservation, for then introduces one project, DataONE, as an processes that enable digital curation, and exemplar to illustrate how a cyberinfrastruc- for approaches to enable metadata interop- ture may be configured, with particular atten- erability. This means that data intensive sci- tion to the participation of librarians. ence challenges extend beyond the tradition- al hard sciences and require research en- The Need for Cyberinfrastructure gagement from the social sciences. It also suggests that while data-driven science re- Many scientific problems are both data inten- quires persistent and reliable data and tools sive and complex. For example, the grand for scientists to create and use these data, it challenges facing science, such as climate also will benefit from tools that can be used change (International Panel on Climate by a variety of stakeholders beyond scien- Change 2007), destructive pandemics tists, including government decision-makers, (World Health Organization 2009), or sus- academic researchers, industry leaders, non tainable energy (World Energy Council -governmental organizations, and even the 2010), are not confined to one or two disci- public at large. plines, but rather cross many scientific do- mains, creating a situation in which the infor- Over the last five decades, the National Sci- mation is becoming more interconnected ence Foundation (NSF) has played an im- (Hannay 2009). Recognizing that intercon- portant role in supporting the transformation

5 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 to data-intensive science, beginning with in that cyberinfrastructure. funding campus-based computational facili- ties in the 1960s, Supercomputer Center Introducing DataONE Programs in the 1980s, and the High Perfor- mance Computing and Communications pro- DataONE is a multi-institutional, multination- gram in the 1990s. In the new millennium, al, and interdisciplinary collaboration working the Office of Cyberinfrastructure created the to develop an organizational structure that vision and coordinated the efforts to provide will support the full information lifecycle of insights into complex problems in science biological, ecological, and environmental da- and engineering with the help of advanced ta and tools to be used by researchers, edu- computational facilities and instruments cators, and the public at large. DataONE (National Science Foundation Cyberinfra- focuses on enabling data-intensive biological structure Council 2007; Computer Science and environmental research through cyber- and Telecommunications Board, 1995). infrastructure that can be used as a tool to enable new science and evidence-based NSF also envisioned the concept that cyber- policy. The key tenet is that data must be infrastructure organizations could be created robust, accessible, and secure; therefore to find solutions to support data-intensive data management, from both the technical scientific and engineering research by inte- and sociocultural perspectives, is crucial. grating domain sciences with cyberinfra- structure, library/information sciences, and “People of all countries are experiencing in- computer sciences so that data could be creasing environmental, social, and technologi- supported throughout its lifecycle (National cal challenges associated with climate variability, Science Foundation 2007). In 2007, this altered land use, population shifts, and changes was introduced as the Sustainable Digital in resource availability (e.g., food, water, and oil). Scientists, educators, librarians, resource Data Preservation and Access Network Part- managers, and the public need open, persistent, ners, or DataNet. NSF noted that multidisci- robust, and secure access to well described and plinary approaches were needed to tackle easily discovered Earth observational data. data issues in order to (1) “provide reliable Such data are critical, as they form the basis for digital preservation, access, integration, and good scientific decisions, wise management and analysis capabilities for science and/or engi- use of resources, and informed decision- neering data over a decades-long timeline; making” (Michener et al. 2009). (2) continuously anticipate and adapt to changes in technologies and in user needs DataONE tackles three problems. First, and expectations; (3) engage at the frontiers DataONE provides support for studying com- of computer and information science and plex environmental issues such as climate cyberinfrastructure with research and devel- change. Environmental issues represent opment to drive the leading edge forward; complex adaptive systems touching on many and (4) serve as component elements of an different disciplines. This results in studies interoperable data preservation and access conducted in different domains of scholarly network” (NSF 2007). interest (Dozier and Gaile 2009; Hunt, Baldocchi and van Ingen, 2009), making it In August 2009, NSF funded the first two difficult to share data and findings. An or- DataNets -- Data Conservancy and the Data ganization that serves researchers from dif- Observation Network for Earth (DataONE). ferent domains by providing a means to This paper focuses on DataONE (http:// share data, expertise, and tools helps to www.dataone.org), a virtual data network bridge that gap. One example of what can focusing on the earth sciences, to explore be accomplished is the State of the Birds the organization of one solution for building 2011 report (www.stateofthebirds.org). This cyberinfrastructure and the role of librarians is the nation’s first assessment of bird distri-

6 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 bution on public lands, providing public mation Infrastructure, Long Term Ecolog- agencies with a means to identify bird spe- ical Research Network and others) using cies for conservation efforts. This report was the available cyberinfrastructure; compiled from results of work done by the DataONE Scientific Exploration and Visuali- (2) creating a new global cyberinfrastructure zation Working Group. that contains both biological and environ- mental data coming from different re- The second problem is the lack of compati- sources (e.g. research networks, envi- ble data practices (Hunt, Baldocchi and van ronmental observatories, individual sci- Ingen 2009). This problem has emerged entists, and citizen scientists); more recently, as the value of combining the efforts of different scientists and different dis- (3) changing the science culture and institu- ciplines has been realized. Additional data tions through the new cyberinfrastructure challenges exist as well, including data loss practices by providing education and (natural disaster, format obsolescence, or- training, engaging citizens in science, phaned data), scattered data sources, data and building global communities of prac- deluge (the flood of increasingly heterogene- tice. ous data), poor data practices, and data lon- gevity. An example of how DataONE is This leads to DataONE’s mission to support tackling this challenge is the work of the Ed- science through three core areas: provision ucation and Outreach Working Group, which of a toolkit for data discovery, analysis, visu- is identifying the best practices for data cura- alization and decision making; provision of tion, and producing a comprehensive, easy- easy, secure, and persistent data storage; to-use set of materials about best practices. and facilitation of community engagement of scientists, data specialists, and policy mak- The third problem is the need to address a ers. global problem with a global perspective (Hunt, Baldocchi and Van Ingen 2009). DataONE Cyberinfrastructure Primer Many efforts have been disorganized and scattered due to disciplinary diffusion and This paper provides only a very brief over- the lack of coordination and collaboration view of the DataONE cyberinfrastructure among other stakeholders, such as govern- (see Michener, et al. 2011 for more detailed ments, industry, non-governmental organiza- information). The overall DataONE design is tions, and citizens. There have been some based on three principles. First, DataONE successes, such as the Long Term Ecologi- supports distributed management at both cal Network (LTEeR) existing and new repositories (i.e., DataONE (http://www.lternet.edu/), that demonstrate Member Nodes) and enables replication, that an organization itself can be a tool in caching, and discovery across these reposi- addressing these kinds of issues. tories for preservation, robustness, and per- formance. Second, the DataONE DataONE objectives are designed to ad- must provide benefits for scientists and data dress the need for accessible, secure, and providers today as well as adapt to tomor- robust data, which are essential for produc- row’s needs. Third, DataONE activities tive research efforts and policy-making re- should support and use existing community garding environmental issues. These objec- software, emphasizing free and open source tives are: software.

(1) providing coordinated access to current The cyberinfrastructure implementation of databases (such as Ecological Society DataONE (Figure 1) is based on three major for America, National Biological Infor- components: Member Nodes, which are ex-

7 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 Figure 1: Major Components of the DataONE Infrastructure. Source: DataONE

Investigator Toolkit Web Interface Analysis, Visualization Data Management

Client Libraries Java Python Command Line

Member Nodes Coordinating Nodes

Service Interfaces Resolution Discovery Service Interfaces Replication Registration

Bridge to non-DataONE Coordination Layer Member Node services Identifiers Catalog Preservation Monitor Data Repository Object Store Index

isting or new data repositories that support and publish the output to a repository where the DataONE Member Node Application Pro- others may similarly retrieve and utilize the gramming Interfaces (APIs); Coordinating data. DataONE architecture is developed to Nodes that are responsible for cataloging address the following technical challenges content, managing replication of content, facing the researcher: and providing search and discovery mecha- nisms; and an Investigator Toolkit, which is a (1) inconsistent service interface specifica- modular set of software and plug-ins that tions; enables interaction with the DataONE infra- structure through commonly used analysis (2) lack of reliable unique identifier produc- and data management tools. tion and resolution;

A focus of the DataONE infrastructure is to (3) data longevity and availability is depend- address the problems a researcher may find ent on repository lifespan; when she needs content from more than one data repository, each of which may be tai- (4) inconsistent search semantics and effec- lored to the needs of a particular domain or tiveness; community of researchers. The researcher may need to master different tools for each (5) varying service interactions and data repository and she may need to keep sepa- models; rate accounts in order to access data in each of the repositories. This can be a barrier to (6) access to quality metadata limits reuse of use and may result in ambiguity as well as data; confusion of data authorship and access rules. The researcher in this scenario might (7) lack of shared identity and access control want to retrieve content from multiple data policies; repositories, use that content in meta- analyses or in comparison with new studies, (8) difficulty in placing data near analysis,

8 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 visualization and other computational organization is built around environmental services. scientists, with a strong collaboration with information scientists. Each of these groups DataONE and Collaboration is highly diversified. The environmental sci- ences include scientists from biology, ecolo- The DataONE cyberinfrastructure team rec- gy, environmental sciences, hydrology, and ognizes that it is important to communicate biodiversity. The information science mem- and collaborate with others who are ad- bers include specialists in informatics, com- dressing these issues of long-term data puter engineering, computer sciences, infor- management, reuse, discovery, and integra- mation sciences, information management, tion. There are a number of ongoing and information technology, and library sciences. new projects ranging from other DataNet projects to projects targeting very specific In the future, DataONE envisions ever- topics such as improvement of semantic strengthening collaborations involving more search capabilities. Overlap in participation associated disciplines. For instance, possi- between members of the various projects ble areas for expansion include researchers helps to ensure that DataONE is up-to-date studying migration and urbanization, such as with ongoing developments and emerging sociologists, and those studying natural re- approaches for data management and source allocation, such as economists. preservation, and also helps to ensure that DataONE’s goal, and challenge, is to create other projects are aware of the base infra- the cyberinfrastructure that can address mul- structure being put into place by DataONE ti-faceted environmental issues and mobilize and how they might leverage that infrastruc- all the interested parties to engage. ture. DataONE is also highly collaborative in DataONE is a Type I partner of the Federa- terms of institutions. At DataONE’s incep- tion of Earth Science Information Partners tion in August 2009, DataONE partners in- and has or is exploring collaborative relation- cluded Cornell University, the National Evo- ships with many other projects including: lutionary Center at Duke University, Oak Ridge National Laboratory, the University of  other DataNet Projects, New Mexico, the California Digital Library at  the Filtered-Push project (http:// the University of California, the National etaxonomy.org/mw/FilteredPush), Center for Ecological Analysis and Synthesis  the Scientific Observations Network at the University of California Santa Barbara, (SONet http://www.sonet.com/), the University of Illinois-Chicago, The Uni-  Semantic Tools for Ecological Data Man- versity of Tennessee-Knoxville, the Universi- agement (SemTools https:// ty of Kansas, the U.S. Geological Survey, semtools.ecoinformatics.org/), and Utah State University. The diversity of  TeraGrid (transitioning to XD/XSEDE), these initial institutions can be seen in Table and, 1. This list of partners continues to grow.  the Avian Knowledge Network (AKN http://www.avianknowledge.net/content/). As noted in the earlier section, the techno- logical design creates collaboration at two DataONE’s multidisciplinary environment levels of participation: Coordinating Nodes requires vibrant collaboration in order to pur- (the initial ones are the University of New sue the organizational goal stated on the Mexico, the partnership between University website: "DataONE will be commonly used of Tennessee and Oak Ridge National La- by researchers, educators, and the public to boratories, and the National Center for Eco- better understand and conserve life on earth logical Analysis and Synthesis at the Univer- and the environment that sustains it." The sity of California, Santa Barbara) and Mem-

9 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004

Table 1: Institutions that are involved in or are supporting DataONE activities on different levels

Academic institutions from the U.S. (including three EPSCoR [The Experimental Program to Stim- ulate Competitive Research] states—Tennessee, Kansas, and New Mexico) and the United King- dom (i.e., Edinburgh, Manchester, Southampton);

Research networks (e.g., Long Term Ecological Research Network, Consortium of Universities for the Advancement of Hydrologic Science Inc. [CUAHSI], Taiwan Ecological Research Network, South African Environmental Research Network [SAEON]);

Environmental observatories (e.g., The National Ecological Observatory Network [NEON], USA- National Phenology Network, Ocean Observatory Initiative, South African Environmental Obser- vatory Network);

NSF- and government-funded synthesis (i.e., the National Center for Ecological Analysis and Synthesis [NCEAS], the National Evolutionary Synthesis Center [NESCent], Atlas of Living Aus- tralia) and supercomputer centers/networks (Oak Ridge National Laboratories [ORNL], National Center for Supercomputing Applications [NCSA], and TeraGrid);

Governmental organizations (e.g., U.S. Geological Survey [USGS], the National Aeronautics and Space Administration [NASA], Environmental Protection Agency [EPA]);

Academic libraries (e.g., University of California Digital Library, University of Tennessee, and Uni- versity of Illinois-Chicago libraries, which are active in the digital library community and are mem- bers of the Coalition for Networked Information, the Digital Library Federation, and the Associa- tion of Research Libraries);

International organizations (e.g., Global Biodiversity Information Facility, Inter American Biodiversi- ty Information Network, Biodiversity Information Standards);

Numerous large data and metadata archives (e.g., USGS-National Biological Information Infra- structure, ORNL Distributed Active Archive Center for Biogeochemical Dynamics, World Data Center for Biodiversity and Ecology, Knowledge Network for Biocomplexity);

Professional societies (e.g., Ecological Society of America, Natural Science Collections Alliance);

NGOs (e.g., The Keystone Center); and

The commercial sector (e.g., Amazon, Battelle Ventures, IBM, Intel)

Source: DataONE Proposal, 2009. ber Nodes (the first three are in the process ta replication, providing global user identity of coming online as of this writing). Coordi- services, providing log aggregation services, nating Nodes are geographically-distributed and monitoring node and network health. to provide a high-availability, fault-tolerant, Member Nodes will be located inside aca- and scalable set of coordinating services to demia, libraries, government agencies, and the Member Nodes. They are responsible other organizations to provide local data for utility services across the collaboration: storage, data access, access control, repli- member node registration services, metada- cation, metadata quality, and primary user ta indexing, coordinating and monitoring da- interaction.

10 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 DataONE Organizational Structure Working groups are also designed to enable research and education activities to evolve DataONE’s organizational structure includes over time. a small managerial team (principal investiga- tor, executive director, and directors for Each working group has two co-leaders, at cyberinfrastructure and community engage- least one of whom is a member of the Lead- ment), as well as a core cyberinfrastructure ership Team, in order to facilitate communi- team that is responsible for designing and cation between groups and to assure that building the cyberinfrastructure. The Man- the Management Team is aware of all activi- agement Team is based at the University of ties. Each group has an additional 8-10 New Mexico with principal investigator Dr. members who are actively engaged in on- William Michener, who is Professor and Di- going activities. Also, there is periodic inter- rector of eScience Initiatives at the Universi- action among the working groups such as ty Libraries at the University of New Mexico. members of different working groups ad- dressing a problem together. When there While DataONE is domain-centric and led by are face-to-face meetings, there are ses- domain scientists, librarians and information sions devoted to join forces and perspec- scientists are integral members of the team tives. The structure is fluid, flexible, and at all levels and across many activities. For adaptive. example, the Leadership Team, which meets each week in a virtual environment to share DataONE Lifecycle and coordinate technical and sociocultural activities, is composed of 14 individuals (in Through the activities of working groups, addition to the management team) repre- DataONE addresses the complete data senting 10 institutions, five of whom are li- lifecycle through a comprehensive program brarians or information scientists. DataONE of research, design, and development to cre- is also advised by two external bodies – the ate a system to preserve, disseminate, and External Advisory Board and the DataONE protect research objects in a secure, reliable, Users Group, each of which has librarian and open approach that is responsive to us- and information science representation. ers’ and scientists' needs.

DataONE is a virtual organization with a DataONE has adopted a lifecycle model that strong network of people. The network is focuses on “the data” and illustrates the dif- built around working groups that help assure ferent stages that data can pass through, that multiple perspectives are represented. although data may skip a stage or stages Working groups are composed of scientists, (Michener et al. 2011). At each stage, differ- academic researchers, educators, govern- ent people may interact with the data, and it ment and industry representatives, and lead- is unlikely that one person will interact with ing computer, information, and library scien- the data at all stages. The data lifecycle is tists. Working groups are central to useful because it can be used to identify da- DataONE research activities and most focus taflows and work processes for scientists, on either cyberinfrastructure or community librarians, or others associated with the sci- engagement issues, although two working ence data process. groups, Usability and Assessment and Ex- ploration, Visualization, Analysis, directly en- Let’s follow the data through the eight stages gage in both cyberinfrastructure and commu- of the lifecycle (Figure 2). The lifecycle be- nity engagement activities. The Working gins when scientists make a plan to conduct group model allows DataONE to conduct their research. They then collect data either targeted research and education activities in the field or laboratory. The scientific team with a broad group of scientists and users. may then review the data to assure the data

11 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004

Figure 2: DataONE has adopted a data lifecycle which focuses on the way data moves through eight unique stages. These steps begin at the point of creating the re- search plan then progress to data collection, quality assurance and quality control. Data needs to be described – which is when metadata is created. Data are then de- posited in a trusted repository where they may be preserved. Tools and services can then support data discovery, integration, and analysis including visualization. Source: DataONE

quality. The data is now ready to be de- Librarians can provide support and guidance scribed with metadata. Although it is recom- at nearly every stage of the data lifecycle. At mended that the specific domain metadata the Planning stage, librarians can address standard be used, scientists often use a data management questions that can help metadata schema that has been developed scientists develop a data management plan. for their project. When the data are de- In 2011, NSF began requiring that a data scribed, they are ready to be deposited into management plan be submitted with a trusted repository in which they will be pre- each proposal. Librarians have played a served. The data are now discoverable and very active role during the development of a may be accessed by others. At this point, new tool, the DMPTool (https:// data modelers or other scientists might ac- dmp.cdlib.org/), that helps researchers cre- cess the data and integrate multiple data ate data plans online. The DMPTool original sets for analysis. Conversely, data may not partner institutions include four libraries and be integrated and may instead be analyzed the United Kingdom’s Digital Curation Cen- by the original scientist who collected it tre. At the Assure stage, librarians can help (skipping both the discover and integrate scientists identify existing strategies for data stages). quality. At the Describe stage, librarians can help the scientist identify and apply a rele-

12 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 Figure 3: DataONE stakeholders. Scientists are the primary stakeholders and circles represent each of five science research environments. There are secondary stake- holders associated with each science research environment. Organizations are repre- sented with boxes and individuals with ovals. The dashed box indicates stakeholders associated with each level of government. Source: Michener, et al., 2011

vant metadata schema. At the Preserve DataONE Stakeholders stage, librarians can identify or perhaps their library can provide an appropriate and trust- DataONE engages a wide group of stake- worthy repository. At the Discover stage, holder communities (Figure 3). The primary librarians can help users find and access stakeholders at the center of the stakeholder data, which is an extension of a traditional network are scientists. Scientist practices role of librarianship. Librarian engagement and attitudes vary depending on the home at the Integrate stage is still evolving, howev- domain, meaning that scientists are not ho- er it may include helping to negotiate the in- mogenous. Science communities were not terconnected information challenges noted categorized by domain since that approach earlier. discouraged crossing disciplinary boundaries and practicing integrative science. Rather, this stakeholder community was character- ized based on how scientists “do” science

13 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 that aligned with five science research envi- brary and information science (LIS) profes- ronments: academia, government, private sionals and researchers on the team. This industry, non-profit, and community. While has provided the library and information cen- scientists working in private industry are pri- ter perspective as the cyberinfrastructure mary stakeholders, the proprietary con- has emerged. As the cyberinfrastructure straints placed on them means they are like- matures, LIS professionals and researchers ly to be restricted from sharing and therefore serve as on-going members of the leader- may have a limited relationship with ship team and working groups helping to DataONE. shape how DataONE addresses the issues and builds technical and sociocultural infra- Secondary stakeholders also have a role in structure. LIS professionals have the experi- data-intensive environmental science. The ence and knowledge to help understand how following are the major groups of secondary stakeholders interact across the five science stakeholders: research environments and throughout the data lifecycle. This section focuses primarily (1) Libraries and librarians are important on the sociocultural contributions, although sources of support for science and scien- there are also LIS professionals engaged in tists to negotiate the data-driven and in- answering DataONE’s technical questions. formation-reliant milieu in any of the five science research environments. In sociocultural terms, LIS skills and tools DataONE prioritizes libraries and librari- are helping provide insight into stakeholders’ ans as the most important secondary motivation, practices, and needs. This is community. In the DataONE stakeholder being accomplished through a series of as- network, the definition of libraries in- sessments being conducted with different cludes the full range of information- DataONE stakeholder communities. These centric agencies and services; assessments are designed to explore atti- tudes towards, and practices for, science (2) Administrators and policy makers at the data. The results help developers have a federal, state, and local level are people better understanding of how these targeted influencing the success of science communities are engaging with science data through funding programs and policy that and help developers create tools that will may facilitate or hinder research; provide useful services and also be usable by the community. For instance, LIS re- (3) Publishers and professional societies search shows that scientists’ data practices whose activities include the dissemina- (data management, digital curation, metada- tion of research results and data; ta creation, and data preservation) are poor for various reasons, including a lack of (4) Think tanks which develop evidence- knowledge of existing tools and a lack of de- based position papers or policy sugges- sire to use them (Tenopir et al. 2011; Parse tions; Insight 2010).

(5) Citizen scientists, citizen activists, K-12 Research conducted to learn more about teachers, informal educators, and curric- this include surveys, interviews, usability ulum builders. These stakeholders pro- studies, and analyses of data use. Working vide the bridge between science and the group efforts have been instrumental in con- public. ducting baseline assessments with stake- holders, analyzing the assessment studies Libraries, Librarians & DataONE that others have done, and conducting re- peat assessments of various stakeholder From the proposal stage, DataONE had li- groups every couple of years. These as-

14 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 sessments help identify current data needs, Conclusion perceptions, and practices of all parts of the data lifecycle and provides a basis for seeing The data-intensive environment is changing how these change over time. The DataONE the way scientists “do” science, and as li- baseline assessment of scientists was com- brarians, we can provide support and ser- pleted last year (Tenopir et al. 2011). Base- vices that will help scientists concentrate on line assessments are now being conducted “doing” science rather than wrestling with with libraries, librarians, and data managers. barriers that keep them from creating shara- ble datasets and from utilizing the range of Librarians have been actively involved in the data available. At universities across the teams, creating “user scenarios” for primary country, libraries are facing questions about stakeholder groups. These exemplify how a how to address eScience challenges. Li- user might interact with DataONE and high- brarians are assessing what skills they need light specific activities that they would be en- to provide the services associated with eSci- gaged in. By understanding the stakehold- ence, as well as how to integrate these new ers’ needs, motivations, concerns, and skill responsibilities into their workload. base, DataONE developers can better devel- DataONE provides a laboratory to address op appropriate tools and services, and also the new roles and responsibilities facing sci- understand the best way to market them ef- ence librarianship. fectively to appropriate groups. The DataONE data lifecycle helps identify Librarians have also been key members on some of the areas where librarians can offer the team developing personas. A persona essential skills and support. Librarians are provides a way to envision the “average” us- important partners from the moment scien- er in a particular stakeholder group. The tists begin planning their data collection by personas were developed using data from providing information and support for creat- the assessment surveys and from inter- ing data management plans. There are also views. The personas help build an under- roles in this dynamic new information envi- standing of current and potential users. Per- ronment that are based on the very founda- sonas allow developers, LIS professionals, tions of librarianship: metadata creation, and DataONE management to visualize how preservation strategies, and information ac- users from specific communities may use cess. It is likely that these activities have a DataONE. Knowing this facilitates building very different look in the eScience context; better tools and providing better services, however, the basic tenets established from and also increases the ability to make good years of research and practice provide a strategic decisions as the cyberinfrastructure strong foundation for developing the special- grows. ized skill set.

Librarians are also serving on working Another area that DataONE illuminates is groups that are responsible for the develop- the successful partnering of librarians and ment of a large number of best practices for information science researchers with domain data management, and for maintaining a list scientists. There is much potential for librari- of tools available for a range of data activi- ans to become more integrated in the sci- ties, including visualization and management ence workflow. This includes working close- (http://www.dataone.org/dataonepedia). Ed- ly with scientists at all stages of the data ucational modules for data management are lifecycle, as well as participating in the data also being developed. Many of these re- literacy education of the next generation of sources can be found at http:// scientists by helping coach science under- www.dataone.org/resources. graduates and graduates on best practices related to data.

15 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 EScience presents a host of challenges for ture. Washington, D.C.: The National Acade- libraries, including having sufficient technical mies Press. capacity, a workforce with the training to ad- dress eScience, and the capacity to do the Dozier, Jeff and William B. Gail. 2009. “The outreach and training needed to engage sci- Emerging Science of Environmental entists and student scientists. Committing Applications.” In The Fourth Paradigm: Data- resources at the library level can be difficult Intensive Scientific Discovery, edited by without strong institutional support, especial- Tony Hey, Stewart Tansley, and Kristin ly since science researchers may not envi- Tolle, 13-19. Microsoft Research. Accessed sion how their data management can benefit June 29, 2011. http:// from greater interaction with the library. This research.microsoft.com/en-us/collaboration/ lack of recognition may make it more difficult fourthparadigm/4th_paradigm_book_complet for the library to illustrate the return on in- e_lr.pdf vestment. The DataONE experience sug- gests that libraries can be involved in high Gray, Jim. 2007. "eScience Talk." Talk pre- profile activities such as creating data man- sented to the NRC-CSTC, Mountain View, agement plans, providing metadata guid- CA, January 11. ance, and identifying reliable data reposito- ries. Since these activities protect university Hannay, Timo. 2009. “From Web 2.0 to the intellectual assets, they may help establish Global Database.” In The Fourth Paradigm: the value of supporting library involvement Data-intensive Scientific Discovery, edited with eScience. by Tony Hey, Stewart Tansley, and Kristin Tolle, 215-220. Microsoft Research. Ac- Librarians also face the challenge of finding cessed June 29, 2011. http:// ways to integrate eScience activities into research.microsoft.com/en-us/collaboration/ their work day. The DataONE experience fourthparadigm/4th_paradigm_book_complet suggests that librarians are invaluable part- e_lr.pdf ners in data description, preservation and access. The necessary skill set is based on Hey, Tony, Stewart Tansley, and Kristin librarianship fundamentals, but does require Tolle, eds. 2009. The Fourth Paradigm: Data the librarian to become acquainted with spe- -intensive Scientific Discovery. Microsoft cific best data practices. Many associations Research. Accessed June 29, 2011. http:// are offering workshop opportunities, but li- research.microsoft.com/en-us/collaboration/ brarians may also utilize resources such as fourthparadigm/4th_paradigm_book_complet the DataONE best practices and tools ar- e_lr.pdf chives. Hunt, James ., Dennis D Baldocchi, and Libraries and librarians have a history of suc- Catharine van Ingen. 2009. “Redefining Eco- cessfully adjusting to a shifting information logical Science Using Data.” In The Fourth landscape. As evidenced by librarian partici- Paradigm: Data-intensive Scientific pation in DataONE, the library community is Discovery, edited by Tony Hey, Stewart already an active partner in shaping the fu- Tansley, and Kristin Tolle, 21-26. Microsoft ture of eScience. Research. Accessed June 29, 2011. http:// research.microsoft.com/en-us/collaboration/ fourthparadigm/4th_paradigm_book_complet References e_lr.pdf Computer Science and Telecommunications Board. 1995. Evolving the high performance International Panel on Climate Change. computing and communications initiative to 2007. Climate change 2007: Synthesis re- support the nation’s information infrastruc- port. Retrieved on June 30, 2011 from http://

16 JESLIB 2012; 1(1): 4-17 doi:10.7191/jeslib.2012.1004 www.ipcc.ch/pdf/assessment-report/ar4/syr/ Advisory Panel on Cyberinfrastructure. 2003. ar4_syr.pdf Revolutionizing Science and Engineering through Cyberinfrastructure. Accessed on Kafel, Donna. 2010. “e-Science and its August 20, 2011. www.nsf.gov/od/oci/ Relevance for Research Libraries.” reports/atkins.pdf Accessed July 30, 2011. http:// esciencelibrary.umassmed.edu/escience National Science Foundation Cyberinfra- structure Council. 2007. Cyberinfrastructure Lynch, Clifford. 2009. “Jim Gray’s Fourth Vision for 21st Century Discovery. Accessed Paradigm and the Construction of the July 29, 2011. http://www.arl.org/bm~doc/ Scientific Record.” In The Fourth Paradigm: ci_vision_march07.pdf Data-intensive Scientific Discovery, edited by Tony Hey, Stewart Tansley, and Kristin PARSE Insight. 2010. PARSE Insight. Ac- Tolle, 177-183. Microsoft Research. Ac- cessed January 31, 2011. http://www.parse- cessed June 29, 2011. http:// insight.eu/downloads/PARSE-Insight_D3- research.microsoft.com/en-us/collaboration/ 4_SurveyReport_final_hq.pdf fourthparadigm/4th_paradigm_book_complet e_lr.pdf Tenopir, Carol, Suzie Allard, Kimberly Douglass, Arsev Umur Aydinoglu, Lei Wu, Michener, William, Todd Vision, Stephanie Eleanor Read, Maribeth Manoff and Mike Hampton, and Robert Cook. 2009. Frame. 2011. “Data Sharing by Scientists: “DataONE Proposal.” Practices and Perceptions.” PLoS One 6 (6):e21101 (2011) doi:10.1371/ Michener, William K., Suzie Allard, Amber journal.pone.0021101 Budden, Robert Cook, Kimberly Douglass, Mike Frame, Steve Kelling, Rebecca World Energy Council. 2010. Energy and Koskela, Carol Tenopir, and David A. urban innovation. Accessed February 2, Vieglais. 2011. “Participatory Design of 2011. http://www.worldenergy.org/ DataONE - Enabling Cyberinfrastructure for documents/eui_2010_1.pdf the Biological and Environmental Sciences.” Ecological Informatics. Accepted, in press, World Health Organization. 2009. World now available online 3 September 2011. at the start of 2009 influenza pandemic. Ac- Accessed 5 September 2011. http:// cessed August 30, 2011. http://who.int/ dx.doi.org/10.1016/j.ecoinf.2011.08.007 mediacentre/news/statements/2009/ h1n1_pandemic_phase6_20090611/en/ National Academies of Science. Committee index.html on Ensuring the Utility and Integrity of Research Data in a Digital Age. 2009. “Ensuring the integrity, accessibility, and Disclosure: The author reports no conflicts stewardship of research data in the digital of interest. age.” Accessed August 2, 2011. http:// www.nap.edu/catalog.php?record_id=12615 All content in Journal of eScience Librarianship, unless otherwise noted, is National Science Foundation. 2007. licensed under a Creative Commons “Sustainable Digital Data Preservation and Attribution 4.0 International License. Access Network Partners (DataNet).” Ac- cessed July 26, 2011. http://www.nsf.gov/ http://creativecommons.org/licenses/by/4.0 pubs/2007/nsf07601/nsf07601.htm ISSN 2161-3974 National Science Foundation Blue-Ribbon

17