Grechkin Washington 0250E 1

Computational Curation of Open Science Data Maxim Grechkin A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2018 Reading Committee: Bill Howe, Chair Walter L. Ruzzo Hoifung Poon Program Authorized to Offer Degree: Computer Science & Engineering c Copyright 2018 Maxim Grechkin University of Washington Abstract Computational Curation of Open Science Data Maxim Grechkin Chair of the Supervisory Committee: Associate Professor Bill Howe Information School Rapid advances in data collection, storage and processing technologies are driving a new, data-driven paradigm in science. In the life sciences, progress is driven by plummeting genome sequencing costs, opening up new fields of bioinformatics, genomics, and systems biology. The return on the enormous investments into the collection and storage of the data is hindered by a lack of curation, leaving significant portion of the data stagnant and underused. In this dissertation, we introduce several approaches aimed at making open scientific data accessible, valuable, and reusable. First, in the Wide-Open project, we introduce a text mining system for detecting datasets that are referenced in published papers but are still kept private. After parsing over 1.5 million open access publications, Wide-Open has identified hundreds of datasets overdue for publication, 400 of them were then released within one week. Second, we propose a machine learning system, EZLearn, for annotating scientific data into potentially thousands of classes without manual work required to provide training labels. EZLearn is based on an observation that in scientific domains, data samples often come with natural language descriptions meant for human consumption. We take advantage of those descriptions by introducing an auxiliary natural language processing system, training it together with the main classifier in a co-training fashion. Third, we introduce Cedalion, a system that can capture scientific claims from papers, validate them against the data associated with the paper, then generalize and adapt the claims to other relevant datasets in the repository to gather additional statistical evidence. We evaluated Cedalion by applying it to gene expression datasets, and producing reports summarizing the evidence for or against the claim based on the entirety of the collected knowledge in the repository. We find that the claim-based algorithms we propose outperform conventional data integration methods and achieve high accuracy against manually validated claims. TABLE OF CONTENTS Page List of Figures....................................... iii Chapter 1: Introduction................................1 Chapter 2: Related Work................................ 11 2.1 General data curation.............................. 11 2.2 Curating Gene Expression data......................... 13 2.3 Repository access tools.............................. 15 2.4 Hypothesis Generation and Discovery...................... 16 2.5 Workflow systems as an alternative to repository curation services...... 18 2.6 Summary..................................... 19 Chapter 3: Enforcing open data policy compliance using text-mining........ 20 3.1 Tracking dataset references in open access literature.............. 20 3.2 Summary..................................... 24 Chapter 4: Aiding data discovery in the repository by building curation classifiers 26 4.1 Introduction.................................... 26 4.2 Related Work................................... 28 4.3 EZLearn ...................................... 29 4.4 Application: Functional Genomics........................ 32 4.5 Application: Scientific Figure Comprehension................. 39 4.6 Summary..................................... 43 Chapter 5: Reconstructing gene regulatory networks based on public datasets... 44 5.1 Introduction.................................... 44 5.2 Pathway Constrained Sparse Inverse Covariance Estimation......... 45 i 5.3 PathGLasso Learning Algorithm......................... 49 5.4 Experiments.................................... 54 5.5 Interpretation of the Learned Network..................... 58 5.6 Summary..................................... 60 Chapter 6: Uncovering network-perturbed genes in public cancer expression datasets 61 6.1 Introduction.................................... 61 6.2 Results....................................... 66 6.3 Methods...................................... 88 6.4 Summary..................................... 96 Chapter 7: Enabling reproducibility by using claim-aware data integration.... 98 7.1 Introduction.................................... 98 7.2 Related work................................... 103 7.3 Problem Definition................................ 107 7.4 Claim-Aware Algorithms............................. 114 7.5 Experiments.................................... 121 7.6 Summary..................................... 133 Chapter 8: Conclusions................................. 136 8.1 Limitations and future work........................... 137 Bibliography........................................ 140 ii LIST OF FIGURES Figure Number Page 1.1 a) Total number of human samples submitted and curated in GEO. Curation is defined as being assembled into Geo DataSets (GDSs). b) Number of in- tegrative studies found in the PubMed Centeral corpus of open access papers plotted over years. Only studies using datasets from Affymetrix U133 Plus 2.0 platform were considered. c) The median number of samples used by a in- tegrative study using data from Affymetrix U133 Plus 2.0 platform vs number of available samples with matching tissue (based on EZLearn tissue labels).5 3.1 Number of samples in the NCBI Gene Expression Omnibus (GEO)...... 21 3.2 Number of GEO datasets overdue for release over time, as detected by Wide- Open. We notified GEO of the standing list in February 2017, which led to the dramatic drop of overdue datasets (magenta portion), with four hundred datasets released within the first week...................... 22 3.3 Average delay from submission to release in GEO................ 24 3.4 Wide-Open tracking result from February 2018, showing an initial drop in the number of overdue datasets, together with newly discovered ones....... 25 4.1 The EZLearn architecture: an auxiliary text-based classifier is introduced to bootstrap from the lexicon (often available from an ontology) and co-teaches the main classifier until convergence....................... 30 4.2 Example gene expression profile and its text description in Gene Expression Omnibus (GEO). Description is provided voluntarily and may contain am- biguous or incomplete class information..................... 33 4.3 Ontology-based precision-recall curves comparing EZLearn, distant supervision, URSA, and the random baseline (gray). Extrapolated points are shown in transparent colors............................... 35 iii 4.4 (a) Comparison of test accuracy with varying amount of unlabeled data, aver- aged over fifteen runs. EZLearn gained substantially with more data, whereas co-EM barely improves. (b) Comparison of number of unique classes in high- confidence predictions with varying amount of unlabeled data. EZLearn's gain stems in large part from learning to annotate an increasing number of classes, by using organic supervision to generate noisy examples, whereas co-EM is confined to classes in its labeled data...................... 36 4.5 Comparison of test accuracy of the main and auxiliary classifiers at various iterations during learning............................. 38 4.6 EZLearn's test accuracy with varying portion of the distant-supervision labels replaced by random ones in the first iteration. EZLearn is remarkably robust to noise, with its accuracy only starting to deteriorate significantly after 80% of labels are perturbed.............................. 39 4.7 The Viziometrics project only considers three coarse classes Plot, Diagram, and Image for figures due to high labeling cost. We expanded them into 24 classes, which EZLearn learned to accurately predict with zero manually labeled examples................................. 40 4.8 Example annotations by EZLearn, all chosen among figures with no class information in their captions........................... 42 5.1 Graphical representation of pathways (top) and the corresponding precision matrix (bottom).................................. 46 5.2 Comparison of learned networks between the pathway graphical lasso (middle) and the standard graphical lasso (right). The true network has the lattice structure (left)................................... 47 5.3 Example with 4 pathways forming a cycle m. means marginalization..... 54 5.4 Run time (y-axis) for (A) Cycle, (B) Lattice and (C) Random (see text for details)....................................... 56 5.5 Run time for various values of η, with λ = 0:1. η = 1:95 is drawn as a dotted vertical line..................................... 57 5.6 MILE data (p = 4591; k = 156). (A) Relative error vs time, (B) Test log- likelihood on Gentles dataset for random pathways, (C) Significant pathway interactions..................................... 59 iv 6.1 (A) A simple hypothetical example that illustrates the perturbation of a network of 7 genes between disease and normal tissues. One possible cause of the perturbation is a cancer driver mutation on gene `1' that alters the interactions between gene `1' and genes `3', `4', `5', and `6'. (B) One possible cause of network perturbation. Gene `1' is regulated by different sets

Grechkin Washington 0250E 1

RECENT ADVANCES in BIOLOGY, BIOPHYSICS, BIOENGINEERING and COMPUTATIONAL CHEMISTRY

Applied Category Theory for Genomics – an Initiative

Gene Prediction: the End of the Beginning Comment Colin Semple

BIOINFORMATICS Doi:10.1093/Bioinformatics/Bti144

CHRISTINA “TINA” WARINNER (Last Updated October 18, 2018)

Ayshwarya Subramanian Updated June 10, 2021

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

Masthead (PDF)

Medical Advisory Board September 1, 2006–August 31, 2007

UNIVERSITY of CALIFORNIA RIVERSIDE Unsupervised And

ISMB 2008 Toronto

From DNA Sequence to Chromatin Dynamics: Computational Analysis of Transcriptional Regulation