Curation Lessons Learned from Data Management Services 2nd International LSDMA Symposium Sayeed Choudhury

Outline and Key Message

• Current state • Experience with (SDSS) • Comparison of vs. Long-Tail science • Data management layer stack model

• Importance of data preservation – particularly provenance as a means for supporting data analytics Data Conservancy (DC)

• One of five awards through US National Science Foundation’s (NSF) DataNet program • $10 million award to build national-scale data infrastructure • Part of overall Cyberinfrastructure for the 21st Century program • Sheridan Libraries at Johns Hopkins is the lead organization and contributes majority of funding at this point Data Conservancy Objectives

• Data Conservancy is a community that develops solutions for data preservation and sharing to promote cross-disciplinary re-use.

• Preserve – collect and take care of research data • Share – reveal data’s potential and possibilities • Discover – promote re-use and new combinations

Johns Hopkins University Sheridan Libraries Data Management Services

• Johns Hopkins University Data Management Services (JHUDMS) – http://data.dmp.jhu.edu • Culmination of over a decade of R&D starting with Sloan Digital Sky Survey (SDSS) • Implementation of Data Conservancy technology development, educational, workforce development and sustainability programs Two Stages of JHUDMS

• Pre-proposal consultation and assistance with data management plan preparation for NSF proposals – though rapidly expanding beyond NSF and into other use cases • Post-proposal data management through JHU Data Archive • First stage paid for directly by JHU; second stage paid for through line items within NSF proposal budgets The Sloan Digital Sky Survey (SDSS) is one of the most ambitious and influential surveys in the history of . Over eight years of operations (SDSS-I, 2000-2005; SDSS-II, 2005-2008), it obtained deep, multi-color images covering more than a quarter of the sky and created 3-dimensional maps containing more than 930,000 and more than 120,000 quasars. Engaging the Astronomers

• 2001 – Professor Alex Szalay becomes the Principal Investigator for the National Virtual Observatory (NVO) • 2002 – Szalay and Choudhury begin dialogue about data preservation • 2007 – Szalay introduces principals for SDSS to Choudhury and Tim DiLauro • 2008 – Astrophysical Research Consortium (ARC) and Sheridan Libraries sign formal Memorandum of Understanding (MOU) SDSS has…

• Raw Data • Data Archive Server (DAS) • Catalog Archive Server (CAS) • Software

• Web-Based Data Documentation • Publications • Administrative Archive

• Content – Sloan Digital Sky Survey (SDSS) – Phase I & II • ~160 TB in ~80 million files – Researcher Content • Typically 5-200 GB in hundreds to thousands of files per article

Data Flow (Levels of Data)

Pixel data collected by telescope

Sent to Fermilab for processing

Beowulf Cluster produces catalog

Loaded in a SQL database

10 Information graph using OAI-ORE SDSS Public Wiki

• https://wiki.library.jhu.edu/x/K4Tl

• We will track our activities, data profiles, and lessons learned on this wiki page.

• The wiki space is a work in progress but it will contain our most earnest description (particularly data transfer, storage, archiving and preservation issues)

Long Tail Researchers have…

Anything you could imagine.*

* And probably some that you could not Side-by-Side

Concern SDSS* Long Tail Researchers Sophistication Greater expertise due to Little or no expertise or economies of scale. sophistication. Data Organization Well structured and Ad hoc, but often organized documented. Organized around and triggered by article around data publication. publication. Formats Community standard.** Ad hoc. Quality Control Yes. And well documented. Undocumented. Usually left to Strong project and grad students with some minimal community feedback. review at publication time. Code management Primarily CVS, but difficult to Rare. Undocumented determine which version modifications are common. produced a given output. Incentives Ambivalent. Investigated only Few. But more coming. More toward end of project. disincentives. * For released data ** But overloaded Data Management Layers

Layers Characteristics Implication for PI Implication relative to NSF Curation Adding value throughout • Feature Extraction • Competitive life-cycle • New query advantage capabilities • New • Cross-disciplinary opportunities Preservation Ensuring that data can • Ability to use own • Satisfies NSF be fully used and data in the future needs across interpreted (e.g. 5 yrs) directorates • Data sharing Archiving Data protection including • Provides identifiers • Could satisfy most fixity, identifiers for sharing, NSF requirements references, etc. Storage Bits on disk, tape, cloud, • Responsible for: • Could be enough etc. • Restore for now but not Backup and restore • Sharing near-term future • Staffing “Big Data”

• What is Big Data? • There are definitions based on the “V’s” of Big Data (e.g., volume, velocity, variety) • What is clear is that it’s different from “spreadsheet science” (or long-tail science) • For me, if a community’s ability to deal with data is overwhelmed, it’s “Big Data” – it’s more about “M’s” (methods or lack thereof) than “V’s” Characteristics of Data Conservancy Provenance • Immutable • Attributable • Autonomously creatable • Finalizable • Process reflecting

• Paper forthcoming in Data Science Journal Data Preservation and Data Analytics

• The same design, policies, data treatment and preparation necessary for preservation fosters re- use, including analytics • Information graphs provide a model for tracking provenance – information about objects (both original and derived), time and method of creation and processing steps • The future of data management may be the generation of information graphs – provenance is a reason to use open protocols like OAI-ORE Acknowledgements

• NSF Award OCI-0830976 • Sheridan Libraries and JHU financial support • Tim DiLauro for SDSS slides • Alex Szalay for Levels of Data slide • http://dataconservancy.org • http://dmp.data.jhu.edu -- JHU DMS • http://www.dlib.org/dlib/september12/mayernik/09mayern ik.html -- DC blueprint document • https://www.youtube.com/watch?v=F6iYXNvCRO4 -- data management layer stack model