Sharing Smithsonian Digital Scientific Research Data from Biology

Total Page:16

File Type:pdf, Size:1020Kb

Sharing Smithsonian Digital Scientific Research Data from Biology Sharing Smithsonian Digital Scientific Research Data from Biology March 2011 Office of Policy and Analysis Washington, DC 20013 Contents Preface................................................................................................................................. v Acronyms .......................................................................................................................... vii Executive Summary ............................................................................................................ x Purpose and Methodology of the Study .......................................................................... x Background ..................................................................................................................... x Conclusions ................................................................................................................... xii Recommendations ........................................................................................................ xvi Introduction ......................................................................................................................... 1 Purpose and Scope of the Study...................................................................................... 1 Methodology ................................................................................................................... 4 Terms and Definitions..................................................................................................... 5 Organization of the Report............................................................................................ 10 Findings: A Changing Digital Data-sharing Environment ............................................... 12 Small Science and Data Management........................................................................... 12 Forces of Change .......................................................................................................... 14 The ―Data Deluge‖ .................................................................................................... 15 Interdisciplinary Global Challenges ......................................................................... 16 Technology ............................................................................................................... 18 Legacy Data .............................................................................................................. 18 Ethical Imperatives ................................................................................................... 19 The Costs of Systematic Data Management ............................................................. 20 Responses to the Changing Environment ..................................................................... 20 Individual Organizations ........................................................................................... 21 Collaborations ........................................................................................................... 22 Professional Roles ..................................................................................................... 25 Top-Down and Bottom-Up ....................................................................................... 30 Policy ............................................................................................................................ 31 i Open Access.............................................................................................................. 32 U.S. Federal Agencies............................................................................................... 34 Funding Organizations .................................................................................................. 37 Findings: Functional Areas (Discovery and Access, Use, Preservation)......................... 39 Discovery and Access ................................................................................................... 39 Mediating Entities ..................................................................................................... 41 The Centrality of Metadata ....................................................................................... 44 Use ................................................................................................................................ 46 Long-term Preservation ................................................................................................ 49 Trusted Digital Repositories ..................................................................................... 50 The Economics of Preservation ................................................................................ 51 Preservation Decision Making .................................................................................. 51 Cross-Cutting Factors ................................................................................................... 53 Human Resources ..................................................................................................... 53 Cyberinfrastructure ................................................................................................... 56 Economics ................................................................................................................. 61 Findings: The Smithsonian ............................................................................................... 65 Background ................................................................................................................... 65 Collaborative Engagement ........................................................................................ 65 Legacy Data .............................................................................................................. 66 Discovery, Access, and Usability ............................................................................. 67 OCIO and Pan-Institutional Initiatives ......................................................................... 68 Scientific Computing Needs Assessment ................................................................. 71 Digitization Strategic Plan ........................................................................................ 71 Proposed Smithsonian Institution DataNet ............................................................... 73 Digital Asset Management System (DAMS) ............................................................ 74 Enterprise Digital Asset Network (EDAN) .............................................................. 75 Smithsonian Research Online ................................................................................... 76 Proposed Smithsonian Institution Geographic Information System (SIGIS) ........... 77 Individual Science Units ............................................................................................... 78 ii National Museum of Natural History ....................................................................... 78 National Zoological Park .......................................................................................... 82 Smithsonian Environmental Research Center .......................................................... 85 Smithsonian Tropical Research Institute .................................................................. 88 Other Units .................................................................................................................... 91 Smithsonian Institution Archives .............................................................................. 91 Smithsonian Institution Libraries .............................................................................. 93 Collaborative Smithsonian Endeavors .......................................................................... 94 Barriers to Data Management and Sharing ................................................................... 99 Resources .................................................................................................................. 99 Culture..................................................................................................................... 104 Organization ............................................................................................................ 106 Taking Care of the Present, Looking to the Future ................................................. 107 Conclusions ..................................................................................................................... 109 Recommendations ........................................................................................................... 121 Appendix A: Selected Bibliography ............................................................................... 126 Appendix B: Organizations Interviewed for the Study................................................... 138 External Organizations................................................................................................ 138 Smithsonian Institution ............................................................................................... 139 Appendix C: Inter-organizational Efforts ...................................................................... 141 Appendix D: Additional International Initiatives .......................................................... 154 Other Nations and the European Union ...................................................................... 154 Canada....................................................................................................................
Recommended publications
  • Hello, Everyone. My Name Is Amy Walton
    >> Hello, everyone. My name is Amy Walton. I lead the data project in the Office of Advanced Cyberinfrastructure. I have to tell you a quick, funny story that, before that, for a long time, I was with the NASA's Jet Propulsion Laboratory, and it was my privilege at the time to have been at the, working with the Forest Service to try and do some of their first airborne detections of forest fires. The kinds of equipment that they had at the time was basically the size and weight of a refrigerator that they would roll onto the airplane. It would then take overflight infrared pictures, and they would have to drop the infrared film at a location, have it developed and taken to the fire camps overnight. The dramatic changes you are about to hear in how fires and many other data activities are now being done will be incredibly dramatic, so allow me to a take a moment to introduce one of my colleagues, Dr. Ilkay Altintas is the chief data science officer at the San Diego Supercomputer Center at the University of California San Diego, where she is also the founder and director for the Workflows for Data Science Center of Excellence. In her various projects, she leads collaborative, multidisciplinary activities with research objectives to deliver impactful results by making computational data science work more reusable, programmable, scalable, and reproducible. Since joining the San Diego Supercomputer Center in 2001, she has been a principal investigator and a technical leader in a wide range of cross-disciplinary projects.
    [Show full text]
  • Measuring the Impact of Digital Repositories
    Measuring the Impact of Digital Repositories February 28-March 1, 2017 Participant Introductions Bruce Ambacher’s career spans four decades at NARA, George Mason University, and the University of Maryland. His responsibilities included service as acting chief of NARA’s digital preservation unit and as court-appointed preservation officer for the PROFS, Iran-Contra, and Clinton email collections. He represented NARA on the Federal Geographic Data Committee and helped develop federal and international geospatial standards. He was NARA’s representative for the OAIS Reference Model. He co-chaired the development of TRAC. He helped develop the trustworthy digital repositories standards ISO 16363 and ISO 16919. He began teaching courses in archives and digital preservation at George Mason University in 1984, became an adjunct professor at the University of Maryland in 2000 and a Visiting Professor between 2007 and 2013. He has consulted on digital preservation for industry and cultural humanities institutions. He is a Research Affiliate of the Digital Curation Innovation Center at the University of Maryland Mary Barlow is head of the Strategic Project Management Office at the European Bioinformatics Institute (EMBL-EBI), and has oversight of impact reporting and the development of infrastructure business cases. Prior to taking on this role, Mary served as a Programme Manager for the multi-million pound investment in EMBL-EBI by the UK government's Large Facilities Capital Fund. This programme included the construction of new office space and the on-going public procurement of ICT infrastructure to support EMBL-EBI's growing public databases. Mary's work prior to EMBL-EBI focused on ICT integration and intelligent buildings.
    [Show full text]
  • Download Program in PDF Format
    www.sdsc.edu/research/IPP.html ________________________________________ Industry’s “Gateway” to SDSC The Industrial Partners Program (IPP) provides member companies with a framework for interacting with SDSC re- search-ers and staff, exchanging information, receiving education & training, and developing collaborations. Joining IPP is an ideal way for companies to get started collaborating with SDSC researchers and to stay abreast of new de- velopments and opportunities on an ongoing basis. The expertise of SDSC researchers spans many domains including computer science, cybersecurity, data management, data mining & analytics, engineering, geosciences, health IT, high performance computing, life sciences & ge-nomics, networking, physics, and many others. The IPP provides multiple avenues for consultation, networking, training, and developing deeper collaborations. The IPP is an annual fee-based program that provides member companies with a variety of mechanisms for interacting and collaborating with SDSC researchers. The IPP serves an important function in maintaining SDSC’s ties to industry and the high-technology economy. Membership fees fund a variety of preplanned and ad hoc activities designed to encourage the exchange of new ideas and “seed” deeper collaborations. ________________________________________ SDSC – A Long History of Applied R&D From its founding in 1985 by a private company, General Atomics, through to its present affiliation with UCSD, SDSC has a long history of collaborating with and delivering value to industry. SDSC has a strong culture of conducting applied R&D, leveraging science and technology to deliver cutting-edge solutions to real-world problems. From its roots in High Performance Computing to its present emphases in “Big Da-ta” and Predictive Analytics, SDSC has much to offer industry partners in terms of knowledge and experience that is relevant to their world and impactful to their business.
    [Show full text]
  • The Institutional Challenges of Cyberinfrastructure and E-Research
    E-Research and E-Scholarship The Institutional Challenges Cyberinfrastructureof and E-Research cholarly practices across an astoundingly wide range of dis- ciplines have become profoundly and irrevocably changed By Clifford Lynch by the application of advanced information technology. This collection of new and emergent scholarly practices was first widely recognized in the science and engineering disciplines. In the late 1990s, the term e-science (or occasionally, particularly in Asia, cyber-science) began to be used as a shorthand for these new methods and approaches. The United Kingdom launched Sits formal e-science program in 2001.1 In the United States, a multi-year inquiry, having its roots in supercomputing support for the portfolio of science and engineering disciplines funded by the National Science Foundation, culminated in the production of the “Atkins Report” in 2003, though there was considerable delay before NSF began to act program- matically on the report.2 The quantitative social sciences—which are largely part of NSF’s funding purview and which have long traditions of data curation and sharing, as well as the use of high-end statistical com- putation—received more detailed examination in a 2005 NSF report.3 Key leaders in the humanities and qualitative social sciences recognized that IT-driven innovation in those disciplines was also well advanced, though less uniformly adopted (and indeed sometimes controversial). In fact, the humanities continue to showcase some of the most creative and transformative examples of the use of information technology to create new scholarship.4 Based on this recognition of the immense disciplinary scope of the impact of information technology, the more inclusive term e-research (occasionally, e-scholarship) has come into common use, at least in North America and Europe.
    [Show full text]
  • The OCLC Research Survey of Special Collections and Archives
    Taking Our Pulse: The OCLC Research Survey of Special Collections and Archives Jackie M. Dooley Program Officer Katherine Luce Research Intern OCLC Research A publication of OCLC Research Taking Our Pulse: The OCLC Research Survey of Special Collections and Archives Taking Our Pulse: The OCLC Research Survey of Special Collections and Archives Jackie M. Dooley and Katherine Luce, for OCLC Research © 2010 OCLC Online Computer Library Center, Inc. Reuse of this document is permitted as long as it is consistent with the terms of the Creative Commons Attribution-Noncommercial-Share Alike 3.0 (USA) license (CC-BY-NC- SA): http://creativecommons.org/licenses/by-nc-sa/3.0/. October 2010 Updates: 15 November 2010, p. 75: corrected percentage in final sentence. 17 November 2010, p. 2: added Creative Commons license statement. 28 January 2011, p. 25, penultimate para., line 3: deleted “or more” following “300%”; p. 26, final para., 5th line: changed 89 million to 90 million; p. 30, final para.: changed 2009-10 to 2010-11; p. 75, final para.: changed 400 to 80; p. 76, 2nd para.: corrected funding figures; p. 90, final line: changed 67% to 75%. OCLC Research Dublin, Ohio 43017 USA www.oclc.org ISBN: 1-55653-387-X (978-1-55653-387-7) OCLC (WorldCat): 651793026 Please direct correspondence to: Jackie Dooley Program Officer [email protected] Suggested citation: Dooley, Jackie M., and Katherine Luce. 2010. Taking our pulse: The OCLC Research survey of special collections and archives. Dublin, Ohio: OCLC Research. http://www.oclc.org/research/publications/library/2010/2010-11.pdf. http://www.oclc.org/research/publications/library/2010/2010-11.pdf October 2010 Jackie M.
    [Show full text]
  • Deep Learning on Operational Facility Data Relat
    For citation, please use: A. Singh, E. Stephan, M. Schram, and I. Altintas, “Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows,” in 2017 IEEE 13th International Conference on e-Science (e-Science), 2017, pp. 586–591. https://ieeexplore.ieee.org/document/8109199/ Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows Alok Singh, Ilkay Altintas Eric Stephan, Malachi Schram San Diego Supercomputer Center Pacific Northwestern National Laboratory University of California, San Diego Richland, WA, USA La Jolla, CA, USA {eric.stephan, malachi.schram}@pnnl.gov {a1singh, ialtintas}@ucsd.edu Abstract—Distributed computing platforms provide a robust training/testing framework. Unsupervised learning techniques mechanism to perform large-scale computations by splitting the lend themselves to interpretation and can tackle uncertainty task and data among multiple locations, possibly located smoothly and provide mechanisms to infuse domain expertise. thousands of miles apart geographically. Although such Neural networks can be deployed to detect anomalies in distribution of resources can lead to benefits, it also comes with its dynamic environment with training. Deep Learning algorithms associated problems such as rampant duplication of file transfers involve development of multilayer neural networks to solve increasing congestion, long job completion times, unexpected site forecasting, classification and clustering solutions. Our crashing, suboptimal data transfer rates, unpredictable reliability approach leverages such Deep Learning algorithms to discover in a time range, and suboptimal usage of storage elements. In solutions to problems associated with having computational addition, each sub-system becomes a potential failure node that infrastructure that is spread over a wide area.
    [Show full text]
  • Data Management and Data Sharing in Science and Technology Studies
    Article Science, Technology, & Human Values 1-18 ª The Author(s) 2018 Data Management Article reuse guidelines: sagepub.com/journals-permissions DOI: 10.1177/0162243918798906 and Data Sharing journals.sagepub.com/home/sth in Science and Technology Studies Katie Jane Maienschein1, John N. Parker2, Manfred Laubichler1 and Edward J. Hackett3 Abstract This paper presents reports on discussions among an international group of science and technology studies (STS) scholars who convened at the US National Science Foundation (January 2015) to think about data sharing and open STS. The first report, which reflects discussions among members of the Society for Social Studies of Science (4S), relates the potential benefits of data sharing and open science for STS. The second report, which reflects dis- cussions among scholars from many professional STS societies (i.e., European Association for the Study of Science and Technology [EASST], 4S, Society for the History of Technology [SHOT], History of Science Society [HSS], and Philosophy of Science Association [PSA]), focuses on practical and conceptual issues related to managing, storing, and curating STS data. As is the case for all reports of such open discussions, a scholar’s presence at the meeting does not necessarily mean that they agree with all aspects of the text to follow. 1Center for Biology and Society, School of Life Sciences, Arizona State University, Tempe, AZ, USA 2Barrett, The Honors College, Arizona State University, Tempe, AZ, USA 3Heller School for Social Policy and Management, Brandeis University, Waltham, MA, USA Corresponding Author: John N. Parker, Barrett, The Honors College, Arizona State University, PO Box 871612, Tempe, AZ 85287, USA.
    [Show full text]
  • Uva-DARE (Digital Academic Repository)
    UvA-DARE (Digital Academic Repository) Collaborative provenance for workflow-driven science and engineering Altıntaş, İ. Publication date 2011 Link to publication Citation for published version (APA): Altıntaş, İ. (2011). Collaborative provenance for workflow-driven science and engineering. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:26 Sep 2021 147 Publications [1] Lin, Abel W., Ilkay Altintas, Chris Churas, Madhusudan Gujral, Jeffrey Grethe and Mark Ellisman (2011 (In print.)). REST: From Research to Practice (C. Pautasso and E. Wilde, Eds.). Chap. 16: Case Study on the Use of REST Architectural Principles for Scientific Analysis: CAMERA - Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis. Springer. [2] Altintas, Ilkay, Abel W. Lin, Jing Chen, Chris Churas, Madhusudan Gujral, Shulei Sun, Weizhong Li, Ramil Manansala, Mayya Sedova, Jeffrey S.
    [Show full text]
  • 2018 Annual Report Alfred P
    2018 Annual Report Alfred P. Sloan Foundation $ 2018 Annual Report Contents Preface II Mission Statement III From the President IV The Year in Discovery VI About the Grants Listing 1 2018 Grants by Program 2 2018 Financial Review 101 Audited Financial Statements and Schedules 103 Board of Trustees 133 Officers and Staff 134 Index of 2018 Grant Recipients 135 Cover: The Sloan Foundation Telescope at Apache Point Observatory, New Mexico as it appeared in May 1998, when it achieved first light as the primary instrument of the Sloan Digital Sky Survey. An early set of images is shown superimposed on the sky behind it. (CREDIT: DAN LONG, APACHE POINT OBSERVATORY) I Alfred P. Sloan Foundation $ 2018 Annual Report Preface The ALFRED P. SLOAN FOUNDATION administers a private fund for the benefit of the public. It accordingly recognizes the responsibility of making periodic reports to the public on the management of this fund. The Foundation therefore submits this public report for the year 2018. II Alfred P. Sloan Foundation $ 2018 Annual Report Mission Statement The ALFRED P. SLOAN FOUNDATION makes grants primarily to support original research and education related to science, technology, engineering, mathematics, and economics. The Foundation believes that these fields—and the scholars and practitioners who work in them—are chief drivers of the nation’s health and prosperity. The Foundation also believes that a reasoned, systematic understanding of the forces of nature and society, when applied inventively and wisely, can lead to a better world for all. III Alfred P. Sloan Foundation $ 2018 Annual Report From the President ADAM F.
    [Show full text]
  • Istec Distinguished Lectures, Colorado State
    Colorado State University’s Information Science and Technology Center (ISTeC) presents two lectures by Dr. Fran Berman Director, San Diego Supercomputer Center Professor and HPC Endowed Chair, UC San Diego ISTeC Distinguished Lecture in conjunction with the Electrical and Computer Engineering Department and Computer Science Department Seminar Series “100 Years of Digital Data” Monday, March 31, 2008 Reception: 10:30 a.m. Lecture: 11:00 – 12:00 noon Location: Lory Student Center Room 214 • Joint Electrical and Computer Engineering Department and Computer Science Department Special Seminar sponsored by ISTeC “Cyberinfrastructure Challenges in Computer Science” Monday, March 31, 2008 Lecture: 4:00 – 5:00 p.m. Location: Wagar Room 231 ABSTRACTS “100 Years of Digital Data” The Information Age has brought with it a deluge of digital data. Current estimates are that in 2006, 161 exabytes (10^18 bytes) of digital data were created from cell phones, computers, iPods, DVDs, sensors, satellites, scientific instruments, and other sources, providing a foundation for our digital world. Migrating digital content through new generations of storage media, making sense of its content, and ensuring that needed information is accessible now and for the foreseeable future constitute some of the most critical challenges of the Information Age. The San Diego Supercomputer Center (SDSC) is a national Center leading the development and deployment of a comprehensive infrastructure for managing, storing, preserving, and using digital data. Leveraging ongoing collaborations with the research community (National Science Foundation, Department of Energy, etc.), data preservation and archival communities (Library of Congress, National Archives and Records Administration) and other partners, SDSC is providing innovative leadership in the emerging area of Data Cyberinfrastructure.
    [Show full text]
  • Lightning Talks
    Lightning Talks Leslie Ray Peter Rose Ilya Zaslavsky Senior Director, Director, Spatial Epidemiologist, Structural Information County of San Bioinformatics Systems Diego Laboratory, San Laboratory, San Diego Super Diego Super Introduction to COVID Computer Center, Computer Center, Epidemiology UC San Diego UC San Diego Sep. 1, 10:00 AM PT Register here COVID-19 Knowledge COVID-19 and Spatial Introduction to Graph and Jupyter Data Science COVID-19 Data in Notebooks Sep. 2, 11:00 AM PT California Sep. 2, 10:30 AM PT Register here Sep. 1, 10:30 AM PT Register here Register here Munseob Lee Melissa Floca Michele Gilman Assistant Border Solutions Venable Professor Professor of Alliance, UC San of Law, University Economics, Diego of Baltimore School of Global Policy and Strategy, UC San Diego The Cost of Privacy: Introduction to Data Justice and Welfare Effects of the COVID-19 and Automated Decision- Disclosure of COVID- Mortality Data in Making Systems 19 Cases (paper) Mexico Sep. 3, 11:30 AM PT Sep. 2, 11:30 AM PT Sep. 3, 10:30 AM PT Register here Register here Register here All talks are recorded and will be available online here Lightning Talks Onedeige Ilkay Altintas Ashok James Chief Data Science Srinivasan Civic Innovation Officer, San Diego William Nystul Apprentice, Super Computer Eminent Scholar BetaNYC Center Chair and Professor, University of West Florida An Intro to Story Fighting Disasters COVID-19 in Crowded Maps: Manhattan’s with Big Data and Locations: Leveraging Disappearing Predictive Modeling New Data Sources to Religious Facilities Sep. 8, 10:00 AM PT Analyze Risk Sep.
    [Show full text]
  • Sharing Is Caring—Data Sharing Initiatives in Healthcare
    International Journal of Environmental Research and Public Health Review Sharing Is Caring—Data Sharing Initiatives in Healthcare Tim Hulsen Department of Professional Health Solutions & Services, Philips Research, 5656AE Eindhoven, The Netherlands; [email protected] Received: 28 February 2020; Accepted: 24 April 2020; Published: 27 April 2020 Abstract: In recent years, more and more health data are being generated. These data come not only from professional health systems, but also from wearable devices. All these ‘big data’ put together can be utilized to optimize treatments for each unique patient (‘precision medicine’). For this to be possible, it is necessary that hospitals, academia and industry work together to bridge the ‘valley of death’ of translational medicine. However, hospitals and academia often are reluctant to share their data with other parties, even though the patient is actually the owner of his/her own health data. Academic hospitals usually invest a lot of time in setting up clinical trials and collecting data, and want to be the first ones to publish papers on this data. There are some publicly available datasets, but these are usually only shared after study (and publication) completion, which means a severe delay of months or even years before others can analyse the data. One solution is to incentivize the hospitals to share their data with (other) academic institutes and the industry. Here, we show an analysis of the current literature around data sharing, and we discuss five aspects of data sharing in the medical domain: publisher requirements, data ownership, growing support for data sharing, data sharing initiatives and how the use of federated data might be a solution.
    [Show full text]