Speaker Suan Gregurick

NIH’s Strategic Vision for Data Science: Enabling a FAIR-Data Ecosystem Susan Gregurick, Ph.D. Associate Director for Data Science Office of Data Science Strategy February 10, 2020 Chronic Obstructive Pulmonary Disease is a significant cause of death in the US, genetic and Genetic & dietary data are available that could be used to dietary effects in further understand their effects on the disease Separate studies have been Challenges COPD done to collect genomic and dietary data for subjects with Obtaining access to all the COPD. relevant datasets so they can be analyzed Researchers know that many of the same subjects participated Understanding consent for in the two studies. each study to ensure that data usage limitations are respected Linking these datasets together would allow them to examine Connecting data from the the combined effects of same subject across genetics and diet using the different datasets so that the subjects present in both studies. genetic and dietary data from However, different identifiers the same subjects can be were used to identify the linked and studied subjects in the different studies. Icon made by Roundicons from www.flaticon.com DOC research requires datasets from a wide variety Dental, Oral and of sources, combining dental record systems with clinical and research datasets, with unique Craniofacial challenges of facial data (DOC) Research Advances in facial imaging Challenges have lead to rich sources of imaging and Joining dental and quantitative data for the clinical health records in craniofacial region. In order to integrate datasets addition, research in from these two parts of the model organisms can health system support DOC research in humans. Necessary datasets are spread across multiple But Data access and repositories consent are significant concerns with such Addressing the ELSI of inherently identifiable data facial data and DNA- based facial identification Mouse skull Image from 3D rendering created by USC types. Center for Craniofacial Molecular Biology Machine learning (ML) has great potential to help scientists make sense of the vast quantities of Machine data being generated by modern instruments Learning in life Challenges Techniques like genomics, sciences imaging, and others are Access to datasets of generating vast amounts of sufficient quality and data, that have to be quantity to support machine processed and could be learning applications integrated with other data sources such as patient data Moving the large datasets across the network to local or ML is becoming increasingly to the cloud attractive as a way to analyze these types of Access to training and/or datasets and find features of staff to develop new interest that may have algorithms mechanistic or diagnostic value Obtaining access to specialized hardware to run Icon made by Roundicons from www.flaticon.com the analysis software The International epidemiologic Databases to Global studies Evaluate AIDS (IeDEA) network assembles data from 7 large regional research cohorts- representing over 500 HIV clinics around the world Challenges NIH is supporting global research projects like IeDEA that have to Harmonizing data from multiple collect and integrate data from sources so that it can be hundreds of clinics worldwide. integrated and analyzed together. This presents significant data management challenges to Data access regulations to ensure that each clinic collects govern who can access data and data in a compatible fashion and what they can do with it are records the data in a compatible complex. format. National regulations may International data access preclude the use of US-based requirements can significantly cloud resources limiting access impact a researcher’s ability to to data and infrastructure that access data collected in another could support these projects Icon made by Roundicons from www.flaticon.com country Researchers understand the concepts behind FAIR FAIR and data but need guidance on how to put FAIR into practice sharing The FAIR principles Challenges (Findable, Accessable, Interoperable and Reusable) Prioritizing dataset annotation are familiar to many. and curation when it is time consuming and perceived as an However, there is confusion added burden about what FAIR means in practice. Selecting metadata to annotate their data that is compatible with It can be time consumingto other datasets and tools in the create FAIR datasets. ecosystem Where to put the data so it can be stored for the long term and securely accessed by authorized users (as appropriate) Icon made by Roundicons from www.flaticon.com the ability to link data in the Framingham Heart Study with IMAGINE… Alzheimer’s health data to understand associations in, for example, cardiovascular health with aging and dementia. Linking Data Platforms 7 the ability to quickly obtain access to data, and related IMAGINE… information, from published articles. Energetics of Chromophore Binding in the Visual Photoreceptor of Rhodopsin, H. Tian et al, Biophysical Journal, 2017. Negative stain EM reveals the principal Absorption spectra of purified CsR-WT (A) and architecture of the rhodopsin/GRK5 CySeR (B) at pH 5 (green), pH 7.4 (red), and pH 9 complex. (Image by Van Andel (blue). R. Fudim, e al, Science Signaling, 2019 Research Institute) Dark Data 8 the ability to link electronic health care records with IMAGINE… personal data and with clinical and basic research data. Data Security and Data Privacy 9 This is the promise of the NIH Strategic Plan for Data Science …and here’s how we will get there. Strategic Plan for Data Science: Goals and Objectives Data Management, Modernized Data Workforce Stewardship and Data Infrastructure Analytics, and Ecosystem Development Sustainability Tools Modernize data Support useful, Enhance the NIH Optimize data repository generalizable, and data science Develop policies storage and ecosystems accessible tools workforce for a FAIR data security ecosystem Support storage Broaden utility of, Expand the and sharing of and access to, national research individual specialized tools workforce datasets Connect NIH data Better integrate Enhance clinical and Improve discovery systems Engage a broader stewardship observational data and cataloging community into biomedical resources data science https://datascience.nih.gov Making Data FAIR • must have unique identifiers, effectively labeling it Findable within searchable resources. • must be easily retrievable via open systems and Accessible effective and secure authentication and authorization procedures. • should “use and speak the same language” via use Interoperable of standardized vocabularies. • must be adequately described to a new user, have Reusable clear information about data-usage licenses, and have a traceable “owner’s manual,” or provenance. 12 Percentage of NIH Supported PMC publications with data availability statement 75% 50% 25% PERCENTAGE PERCENTAGE 0% 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 YEAR 13 NIH Data Management and Sharing Policy Development • Researchers with NIH-funded or conducted research projects resulting in the generation of scientific data will be required to submit a Plan • Plans should explain how scientific data generated by a research study will be managed and which of these scientific data will be shared Community Identified Input need for Develop draft policy for Solicited appropriate Released data Release final infrastructure draft for • 189 management policy (spring submissions community • policy and and sharing 2020) from national implementation input and to go and related international ‘hand-in-hand’ guidance stakeholders Overview of Sharing Publication and Related Data NIH strongly encourages open access Data Sharing Repositories as a first choice. https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html Options of scaled implementation for sharing datasets Datasets up to 2 gigabytes Datasets up to 20*gigabytes High Priority Datasets petabytes PubMed Central Use of commercial and STRIDES Cloud Partners https://datascience.nih.gov/data-ecosystem/biomedicalnon-profit repositories-data-repositories-and-knowledgebases • PMC stores publication- • Assign Unique Identifiers • Store and manage large related supplemental to datasets associated scale, high priority NIH materials and datasets with publications and link datasets. (Partnership with directly associated to PubMed. STRIDES) publications. Up to 2 GB. • Store and manage • Assign Unique Identifiers, • Generate Unique datasets associated with implement authentication, Identifiers for the stored publication, up to 20* GB. authorization and access supplementary materials control. and datasets. Optimized Funding for NIH Data Repositories and Knowledgebases • Data resources are important • Solution: New Funding research tools Announcement for data repositories and knowledgebases • Historically funded through • Resource plan requirement research grants • Funding mechanism should Scientific Community be optimal for type of Impact Engagement resource • End goal: researcher Quality of Data and Services confident in data and Governance information integrity and Efficiency of Operations Supporting data repositories and knowledgebase resources Optimized Funding for NIH Data Repositories and Knowledgebases Funding Opportunities Scientific Community • NIH released two funding opportunities on Impact Engagement Jan. 17 to support biomedical data repositories and knowledgebases: Quality of Data and Services Governance • Biomedical Data Repository (PAR-20-089) and

Speaker Suan Gregurick

CHAPTER 6: Diversity in AI

A Participatory Approach Towards AI for Social Good Elizabeth Bondi,∗1 Lily Xu,∗1 Diana Acosta-Navas,2 Jackson A

David C. Parkes John A

Director's Update

Graduate Student Highlights

Report of the ACD Working Group Ad Hoc Virtual Meeting on AI/ML

Bridging Machine Learning and Mechanism Design Towards Algorithmic Fairness

Jurors Bio Version 1

Generative Modeling for Graphs

Fairly Private Through Group Tagging and Relation Impact

"Enhancing Fairness and Privacy in Search Systems"

June 28, 2019 at Phoenix, Arizona