Towards a Sustainable Ecosystem for Data Driven Research and Innovation
Total Page:16
File Type:pdf, Size:1020Kb
Towards a Sustainable Ecosystem for Data Driven Research and Innovation Dr. Francine Berman Chair, Research Data Alliance / US Edward P. Hamilton Distinguished Professor of Computer Science, Rensselaer Polytechnic Institute Fran Berman 1 Innovation opportunity: Research data driving solutions to scientific and societal challenges Who is most at risk to contract asthma? How can we increase wheat yields? Image: Lucas Taylor How can we best address energy needs and sustain the environment? How accurate is the Standard Model of Image: Ceinturion, Wikipedia Physics? Fran Berman 2 Infrastructure reality: Access, use and re-use of data now and in the future presupposes data sustainable stewardship and preservation today • Stewardship and Preservation critical: “Homeless” data cease to exist • Sustainable data infrastructure necessary to support – Data management plans – Public access to research data – Use and re-use of data – Reproducibility of results Fran Berman 3 How can we usefully think about sustainability? Sustainable development: "development that meets the needs of the present without compromising the ability of future generations to meet their own needs.“ Our Common Future, U.N. Brundtland Commission • Key components – Ecological sustainability – Cultural / institutional sustainability – Economic sustainability – Political sustainability Planet image: NASA; Quote from “Our Common Future” http://www.unFran Berman- 4 documents.net/our-common-future.pdf How can we measure sustainability? • Circles of Sustainability developed to assess and understand sustainability. Used – For managing projects directedAssessment Action towards socially sustainable outcomes Sustainability analysis helps identify strategic – To assess the sustainability of areas for policy, cities and urban settlements practice and • Used by global organizations investment including the United Nations Global Compact Cities Programme, The World Association of Metropolises, World Vision, and others. Fran Berman 5 Circles of sustainability provide a useful lens with which to consider sustainability of digital research data How can we What create a viable infrastructure is support model for needed to digital access and support digital preservation? stewardship and preservation? Why is digital How do we preservation and maximize the access such a hard access, sharing sell? and exchange of digital research data? Fran Berman 6 What infrastructure is needed to support digital stewardship and preservation? Fran Berman 7 Data-driven Geoscience: How can we respond to large-scale earthquakes? Earthquake simulations enable • Enhanced scientific understanding of the physical world • More strategic plans for bridge, building and other physical infrastructure reinforcements to increase safety • Better disaster response planning for police, fire fighters, ER teams in high- risk areas to increase their effectiveness Simulation courtesy of Amit Chourasia, SDSC, Table information courtesy of 8 Southern California Earthquake Center Fran Berman TeraShake simulation of 7.7 earthquake on the lower San Andreas fault Earthquake simulations enable • Enhanced scientific understanding of the physical world • More strategic plans for bridge, building and other physical infrastructure reinforcements to increase safety • Better disaster response planning for police, fire fighters, ER teams in high- risk areas to increase their effectiveness Simulation courtesy of Amit Chourasia, SDSC, Table information courtesy of 9 Southern California Earthquake Center Fran Berman Terashake data TeraShake Resources infrastructure* Computers and Systems • 80,000 hours on IBM Power 4 (DataStar) • Data Management • 256 GB memory p690 used for testing, p655s used for production run, – 10 Terabytes moved per day during TeraGrid used for porting execution over 5 days • 30 TB Global Parallel file GPFS – Derived data products registered into SCEC • Run-time 100 MB/s data transfer from digital library (total SCEC library has 168 TB) GPFS to SAM-QFS • Data Post-processing: • 27,000 hours post-processing for high resolution rendering – Movies of seismic wave propagation – Seismogram formatting for interactive on- People line analysis • 20+ people for IT support – Derived data: • 20+ people in domain research • Velocity magnitude Storage • Displacement vector field • SAM-QFS archival storage • Cumulative peak maps • HPSS backup • Statistics used in visualizations • Storage Resource Broker collection with 1,000,000 files * circa ~2007 Fran Berman 10 Technical infrastructure critical part of the ecosystem for data-driven innovation Data access via portals, science gateways, etc. Database and data collection systems Data services to support use and re-use Data analysis algorithms, data-driven models and simulations Data visualization tools Semantic frameworks Data management systems Data storage Fran Berman 11 Social, organizational, and human infrastructure for data-driven results equally important Policy Common Systems Standards Interoperability Sustainable Economics Workforce and Community Practice training Traffic Image: Fran Berman 12 Mike Gonzalez Data-driven ecosystem requires multiple kinds of infrastructure Data-driven Innovation Data Infrastructure Technical Infrastructure Social Infrastructure SW and systems, Tools and Policy, Practice, Standards, algorithms, Hardware and Rights, Community facilities culture Fran Berman 13 How can we create a viable support model for digital access and preservation? Fran Berman 14 Data economics: Responsible data stewardship requires a viable business model for sustaining its underlying infrastructure Data infrastructure costs increase with More mgt., stewardship usage, stewardship required and access More Long-lived curation requirements, data perceived value required “Locally Manageable” Greater costs at the Data extremes (including Access control, “big” data) … Big data broad access Coupled data services Fran Berman 15 It’s not just about the cost of storage Data Infrastructure costs include Resources and • Maintenance and upkeep Resource Refresh • Software tools and packages Model A (8-yr,15.2-mo 2X) TB Stored Planned Capacity 100000.0 • Utilities (power, cooling) • Space 10000.0 • Networking • Security and failover systems 1000.0 • People (expertise, help, infrastructure (TB) Storage Archival management, development) 100.0 • Training, documentation 10.0 June-97 June-98 June-99 June-00 June-01 June-02 June-03 June-04 June-05 June-06 June-07 June-08 June-09 • Monitoring, auditing Date • Reporting costs SDSC Data Storage Growth ‘97-’09 • Costs of compliance with regulation, • Most valuable data replicated policy, etc. … • As research collections increase, storage capacity must stay ahead of demand Information courtesyFran Berman of Richard Moore,16 SDSC Increased requirements for access mean increased need for data infrastructure • Increasing U.S. federal R&D agency requirements for managed and accessible research data • February 2013 OSTP Memo: New policies for public access to research data and publications – Strategy for capitalizing on what exists and fostering public-private partnerships with scientific journals – Strategy for increasing / enhancing discoverability, access, dissemination, stewardship, preservation – No new money: “Identification of resources within the existing agency budget to implement the plan” Fran Berman 17 Economics of public access: Who pays the data bill? Article: Science Magazine, August 9, 2013. Free public access link at http:/www.cs.rpi.edu/~bermaf/ Fran Berman 18 Op-ed recommendations: Cultivate / coordinate preservation and stewardship options in every sector Private Public Sector Sector Academia Individuals Fran Berman 19 Op-ed recommendations: Cultivate / coordinate preservation and stewardship options in every sector Private Sector • Facilitate private sector stewardship of public access research data as a public Not govt. ?? Govt. good supported supported Public Sector • Clarify public sector stewardship commitments: articulate what data will / won’t be supported Charleston Ballet blog: http://allianceblog.org/tag/charleston-ballet/ ; corporateFran and Berman collection logos 20 Op-ed recommendations: Cultivate / coordinate preservation and stewardship options in every sector Academic Sector • Create sustainable university library and repository stewardship solutions Individuals • Evolve research culture to take advantage of what works in the private sector Fran Berman 21 No magic economic bullet. Coordination between approaches can provide even more robust options for stewardship Private Public Sector Sector Academia Individuals Fran Berman 22 Why is digital preservation and access such a hard sell? Fran Berman 23 Academic and public sector infrastructure challenges Research Infrastructure What is New discoveries and breakthroughs Failure of systems newsworthy? What is the value Domain and national leadership Enabler of innovation proposition? and competitiveness What is the Fixed-term funding Continuous long-term support funding model? Who is Various govt. R&D agencies, NGOs, No-one’s major priority responsible? etc. Stephanie A. Miner, the Syracuse mayor, said [infrastructure is] too often overlooked when politicians want to spend money on economic development. “You don’t cut ribbons for new water mains, but that’s really what matters.” NY Times, Feburary 15, 2014 Fran Berman 24 Systemic challenges to sustainable stewardship (from the Blue Ribbon Task Force Interim Report [at brtf.sdsc.edu]) • Poor alignment between stakeholders