UNIVERSITY of CALIFORNIA Los Angeles from Open
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA Los Angeles From Open Data to Knowledge Production: Biomedical Data Sharing and Unpredictable Data Reuses A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Information Studies by Irene Pasquetto 2018 © Copyright by Irene Pasquetto 2018 ABSTRACT OF THE DISSERTATION From Open Data to Knowledge Production: Biomedical Data Sharing and Unpredictable Data Reuses by Irene Pasquetto Doctor of Philosophy in Information Studies University of California, Los Angeles, 2018 Professor Christine L. Borgman, Chair Using a US consortium for data sharing as the primary field site, this three-year ethnographic research project examines the socio-technical, epistemic, and ethical challenges of making biomedical research data openly available and reusable. Public policy arguments for releasing scientific data for reuse by others include increasing trust in science and leveraging public investments in research. In most types of scientific research, data release occurs in parallel with associated publications, after peer-review. In the consortium studied for this project, datasets may also be released independently without an associated publication. Such research datasets are conceptualized as “hypothesis free” resources from which novel knowledge can be extracted indefinitely. Among the findings of this project are that biomedical researchers do not download and re-analyze “hypothesis free” research data from open repositories as a regular practice. Data reuse is a complex, delicate, and often time-consuming process. Metadata and ii ontology schemas appear to be necessary but not sufficient for data reuse processes. For scientists to test new hypotheses on “old” data, they depend on access to peer-reviewed primary analyses, pre-existing trusted relationships with the data creators, and shared research agendas. Data donors (patients, study participants, etc.), on the other hand, retain little control over how open research data are reused. Findings suggest that, in practice, it is impossible to predict – and consequently to regulate – how datasets might be reused once made openly available. Unintended consequences of reusing this consortium’s open data already are emerging, to the concern of some participants. iii This dissertation of Irene Pasquetto is approved. Soraya de Chadarevian Christopher M. Kelty Leah A. Lievrouw Christine L. Borgman, Committee Chair University of California, Los Angeles 2018 iv Table of Contents 1. Introduction .............................................................................................................................. 1 2. Theoretical framework ............................................................................................................ 6 Knowledge infrastructures in science .......................................................................................... 6 Research data and evidentiary power ....................................................................................... 10 Terminology specifications: data reuse, data sharing, open data ................................. 13 3. Literature review .................................................................................................................... 18 Pre-genome perspectives on data sharing and reuse ......................................................... 19 The drosophila community: a practice-driven openness ............................................ 20 Competition and collaboration in semi-open communities ........................................ 21 Centralized open collections and community databases ............................................. 25 Data sharing in the molecular lab: pre and post-publication practices ................. 28 The “genomics revolution:” the rise of unpublished open data ..................................... 31 The HGP: competing regimes for data sharing and open data .................................. 34 Sequence databases: open vs. proprietary ......................................................................... 36 The birth of “UJAD” open data ................................................................................................. 42 Open data, human subjects, and participatory paradigms ............................................... 49 Current controversies in human genomics research .......................................................... 55 Precision medicine, population genetics, and racial categories ................................ 56 Craniofacial research, facial measurements, and identification systems.............. 59 4. Research design ....................................................................................................................... 63 v Introduction to case study .............................................................................................................. 64 Sampling strategy ............................................................................................................................... 67 Research methods .............................................................................................................................. 69 Data collection and analysis with “open coding” .................................................................. 74 Ethical statement ................................................................................................................................ 75 5. Findings ...................................................................................................................................... 76 The DataFace leadership ................................................................................................................. 76 DataFace overarching workflow .................................................................................................. 77 Why open data? Rationales for the DataFace Consortium ............................................... 78 Integrating scarce and segregated knowledge ................................................................. 79 Collecting and accessing “hypothesis free” genomics data ......................................... 85 Funding strategy: the U01 and R03 grants ............................................................................. 90 Curating and making available data “before the fact” ........................................................ 92 Research workflows: data collection, analysis, and release ............................................ 95 Research workflow #1: The Blue Spoke .............................................................................. 95 Research workflow #2: The Green Spoke ........................................................................ 108 Research workflow #3: The Pink spoke ........................................................................... 116 DataFace workflows’ summary ............................................................................................ 121 Building tools for data search, browsing, and discovery ............................................... 122 Data modeling, searching, and browsing ......................................................................... 125 Metadata and ontology work ................................................................................................. 131 Tools for data discovery and visualization ...................................................................... 144 Disciplinary configurations: craniofacial research as team science .......................... 149 vi Craniofacial researchers and data reuse ............................................................................... 155 Accessing and reusing others’ data at a summary level ............................................ 157 Reusing others’ data when data are “raw” ...................................................................... 159 A data reuse story: the case of DNA-based facial reconstruction ............................... 176 Reconstructing faces from DNA samples: appeal and controversies ................... 177 Research on FDP technologies .............................................................................................. 180 The DataFace GWAS Datasets ............................................................................................... 181 Modeling 3D Facial Shape from DNA ................................................................................. 189 The debate over method: complex traits are not that simple ................................. 193 Refining the prediction model for human faces ............................................................ 195 6. Discussion ............................................................................................................................... 199 Regimes for Data Governance and the DataFace Consortium ..................................... 199 Re-purposing others’ data: background and foreground reuse .................................. 205 Trust in the “data” and trust in the “system” .................................................................. 206 Trust in the data creators ........................................................................................................ 206 Data reuse for background research .................................................................................. 208 Trust in others’ data and the “publication status” ....................................................... 209 Data Reuse for