Grechkin Washington 0250E 1

Total Page:16

File Type:pdf, Size:1020Kb

Grechkin Washington 0250E 1 Computational Curation of Open Science Data Maxim Grechkin A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2018 Reading Committee: Bill Howe, Chair Walter L. Ruzzo Hoifung Poon Program Authorized to Offer Degree: Computer Science & Engineering c Copyright 2018 Maxim Grechkin University of Washington Abstract Computational Curation of Open Science Data Maxim Grechkin Chair of the Supervisory Committee: Associate Professor Bill Howe Information School Rapid advances in data collection, storage and processing technologies are driving a new, data-driven paradigm in science. In the life sciences, progress is driven by plummeting genome sequencing costs, opening up new fields of bioinformatics, genomics, and systems biology. The return on the enormous investments into the collection and storage of the data is hindered by a lack of curation, leaving significant portion of the data stagnant and underused. In this dissertation, we introduce several approaches aimed at making open scientific data accessible, valuable, and reusable. First, in the Wide-Open project, we introduce a text mining system for detecting datasets that are referenced in published papers but are still kept private. After parsing over 1.5 million open access publications, Wide-Open has identified hundreds of datasets overdue for publication, 400 of them were then released within one week. Second, we propose a machine learning system, EZLearn, for annotating scientific data into potentially thousands of classes without manual work required to provide training labels. EZLearn is based on an observation that in scientific domains, data samples often come with natural language descriptions meant for human consumption. We take advantage of those descriptions by introducing an auxiliary natural language processing system, training it together with the main classifier in a co-training fashion. Third, we introduce Cedalion, a system that can capture scientific claims from papers, validate them against the data associated with the paper, then generalize and adapt the claims to other relevant datasets in the repository to gather additional statistical evidence. We evaluated Cedalion by applying it to gene expression datasets, and producing reports summarizing the evidence for or against the claim based on the entirety of the collected knowledge in the repository. We find that the claim-based algorithms we propose outperform conventional data integration methods and achieve high accuracy against manually validated claims. TABLE OF CONTENTS Page List of Figures....................................... iii Chapter 1: Introduction................................1 Chapter 2: Related Work................................ 11 2.1 General data curation.............................. 11 2.2 Curating Gene Expression data......................... 13 2.3 Repository access tools.............................. 15 2.4 Hypothesis Generation and Discovery...................... 16 2.5 Workflow systems as an alternative to repository curation services...... 18 2.6 Summary..................................... 19 Chapter 3: Enforcing open data policy compliance using text-mining........ 20 3.1 Tracking dataset references in open access literature.............. 20 3.2 Summary..................................... 24 Chapter 4: Aiding data discovery in the repository by building curation classifiers 26 4.1 Introduction.................................... 26 4.2 Related Work................................... 28 4.3 EZLearn ...................................... 29 4.4 Application: Functional Genomics........................ 32 4.5 Application: Scientific Figure Comprehension................. 39 4.6 Summary..................................... 43 Chapter 5: Reconstructing gene regulatory networks based on public datasets... 44 5.1 Introduction.................................... 44 5.2 Pathway Constrained Sparse Inverse Covariance Estimation......... 45 i 5.3 PathGLasso Learning Algorithm......................... 49 5.4 Experiments.................................... 54 5.5 Interpretation of the Learned Network..................... 58 5.6 Summary..................................... 60 Chapter 6: Uncovering network-perturbed genes in public cancer expression datasets 61 6.1 Introduction.................................... 61 6.2 Results....................................... 66 6.3 Methods...................................... 88 6.4 Summary..................................... 96 Chapter 7: Enabling reproducibility by using claim-aware data integration.... 98 7.1 Introduction.................................... 98 7.2 Related work................................... 103 7.3 Problem Definition................................ 107 7.4 Claim-Aware Algorithms............................. 114 7.5 Experiments.................................... 121 7.6 Summary..................................... 133 Chapter 8: Conclusions................................. 136 8.1 Limitations and future work........................... 137 Bibliography........................................ 140 ii LIST OF FIGURES Figure Number Page 1.1 a) Total number of human samples submitted and curated in GEO. Curation is defined as being assembled into Geo DataSets (GDSs). b) Number of in- tegrative studies found in the PubMed Centeral corpus of open access papers plotted over years. Only studies using datasets from Affymetrix U133 Plus 2.0 platform were considered. c) The median number of samples used by a in- tegrative study using data from Affymetrix U133 Plus 2.0 platform vs number of available samples with matching tissue (based on EZLearn tissue labels).5 3.1 Number of samples in the NCBI Gene Expression Omnibus (GEO)...... 21 3.2 Number of GEO datasets overdue for release over time, as detected by Wide- Open. We notified GEO of the standing list in February 2017, which led to the dramatic drop of overdue datasets (magenta portion), with four hundred datasets released within the first week...................... 22 3.3 Average delay from submission to release in GEO................ 24 3.4 Wide-Open tracking result from February 2018, showing an initial drop in the number of overdue datasets, together with newly discovered ones....... 25 4.1 The EZLearn architecture: an auxiliary text-based classifier is introduced to bootstrap from the lexicon (often available from an ontology) and co-teaches the main classifier until convergence....................... 30 4.2 Example gene expression profile and its text description in Gene Expression Omnibus (GEO). Description is provided voluntarily and may contain am- biguous or incomplete class information..................... 33 4.3 Ontology-based precision-recall curves comparing EZLearn, distant supervi- sion, URSA, and the random baseline (gray). Extrapolated points are shown in transparent colors............................... 35 iii 4.4 (a) Comparison of test accuracy with varying amount of unlabeled data, aver- aged over fifteen runs. EZLearn gained substantially with more data, whereas co-EM barely improves. (b) Comparison of number of unique classes in high- confidence predictions with varying amount of unlabeled data. EZLearn's gain stems in large part from learning to annotate an increasing number of classes, by using organic supervision to generate noisy examples, whereas co-EM is confined to classes in its labeled data...................... 36 4.5 Comparison of test accuracy of the main and auxiliary classifiers at various iterations during learning............................. 38 4.6 EZLearn's test accuracy with varying portion of the distant-supervision labels replaced by random ones in the first iteration. EZLearn is remarkably robust to noise, with its accuracy only starting to deteriorate significantly after 80% of labels are perturbed.............................. 39 4.7 The Viziometrics project only considers three coarse classes Plot, Diagram, and Image for figures due to high labeling cost. We expanded them into 24 classes, which EZLearn learned to accurately predict with zero manually labeled examples................................. 40 4.8 Example annotations by EZLearn, all chosen among figures with no class information in their captions........................... 42 5.1 Graphical representation of pathways (top) and the corresponding precision matrix (bottom).................................. 46 5.2 Comparison of learned networks between the pathway graphical lasso (middle) and the standard graphical lasso (right). The true network has the lattice structure (left)................................... 47 5.3 Example with 4 pathways forming a cycle m. means marginalization..... 54 5.4 Run time (y-axis) for (A) Cycle, (B) Lattice and (C) Random (see text for details)....................................... 56 5.5 Run time for various values of η, with λ = 0:1. η = 1:95 is drawn as a dotted vertical line..................................... 57 5.6 MILE data (p = 4591; k = 156). (A) Relative error vs time, (B) Test log- likelihood on Gentles dataset for random pathways, (C) Significant pathway interactions..................................... 59 iv 6.1 (A) A simple hypothetical example that illustrates the perturbation of a net- work of 7 genes between disease and normal tissues. One possible cause of the perturbation is a cancer driver mutation on gene `1' that alters the in- teractions between gene `1' and genes `3', `4', `5', and `6'. (B) One possible cause of network perturbation. Gene `1' is regulated by different sets
Recommended publications
  • RECENT ADVANCES in BIOLOGY, BIOPHYSICS, BIOENGINEERING and COMPUTATIONAL CHEMISTRY
    RECENT ADVANCES in BIOLOGY, BIOPHYSICS, BIOENGINEERING and COMPUTATIONAL CHEMISTRY Proceedings of the 5th WSEAS International Conference on CELLULAR and MOLECULAR BIOLOGY, BIOPHYSICS and BIOENGINEERING (BIO '09) Proceedings of the 3rd WSEAS International Conference on COMPUTATIONAL CHEMISTRY (COMPUCHEM '09) Puerto De La Cruz, Tenerife, Canary Islands, Spain December 14-16, 2009 Recent Advances in Biology and Biomedicine A Series of Reference Books and Textbooks Published by WSEAS Press ISSN: 1790-5125 www.wseas.org ISBN: 978-960-474-141-0 RECENT ADVANCES in BIOLOGY, BIOPHYSICS, BIOENGINEERING and COMPUTATIONAL CHEMISTRY Proceedings of the 5th WSEAS International Conference on CELLULAR and MOLECULAR BIOLOGY, BIOPHYSICS and BIOENGINEERING (BIO '09) Proceedings of the 3rd WSEAS International Conference on COMPUTATIONAL CHEMISTRY (COMPUCHEM '09) Puerto De La Cruz, Tenerife, Canary Islands, Spain December 14-16, 2009 Recent Advances in Biology and Biomedicine A Series of Reference Books and Textbooks Published by WSEAS Press www.wseas.org Copyright © 2009, by WSEAS Press All the copyright of the present book belongs to the World Scientific and Engineering Academy and Society Press. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the Editor of World Scientific and Engineering Academy and Society Press. All papers of the present volume were peer reviewed
    [Show full text]
  • Applied Category Theory for Genomics – an Initiative
    Applied Category Theory for Genomics { An Initiative Yanying Wu1,2 1Centre for Neural Circuits and Behaviour, University of Oxford, UK 2Department of Physiology, Anatomy and Genetics, University of Oxford, UK 06 Sept, 2020 Abstract The ultimate secret of all lives on earth is hidden in their genomes { a totality of DNA sequences. We currently know the whole genome sequence of many organisms, while our understanding of the genome architecture on a systematic level remains rudimentary. Applied category theory opens a promising way to integrate the humongous amount of heterogeneous informations in genomics, to advance our knowledge regarding genome organization, and to provide us with a deep and holistic view of our own genomes. In this work we explain why applied category theory carries such a hope, and we move on to show how it could actually do so, albeit in baby steps. The manuscript intends to be readable to both mathematicians and biologists, therefore no prior knowledge is required from either side. arXiv:2009.02822v1 [q-bio.GN] 6 Sep 2020 1 Introduction DNA, the genetic material of all living beings on this planet, holds the secret of life. The complete set of DNA sequences in an organism constitutes its genome { the blueprint and instruction manual of that organism, be it a human or fly [1]. Therefore, genomics, which studies the contents and meaning of genomes, has been standing in the central stage of scientific research since its birth. The twentieth century witnessed three milestones of genomics research [1]. It began with the discovery of Mendel's laws of inheritance [2], sparked a climax in the middle with the reveal of DNA double helix structure [3], and ended with the accomplishment of a first draft of complete human genome sequences [4].
    [Show full text]
  • Gene Prediction: the End of the Beginning Comment Colin Semple
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by PubMed Central http://genomebiology.com/2000/1/2/reports/4012.1 Meeting report Gene prediction: the end of the beginning comment Colin Semple Address: Department of Medical Sciences, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK. E-mail: [email protected] Published: 28 July 2000 reviews Genome Biology 2000, 1(2):reports4012.1–4012.3 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2000/1/2/reports/4012 © GenomeBiology.com (Print ISSN 1465-6906; Online ISSN 1465-6914) Reducing genomes to genes reports A report from the conference entitled Genome Based Gene All ab initio gene prediction programs have to balance sensi- Structure Determination, Hinxton, UK, 1-2 June, 2000, tivity against accuracy. It is often only possible to detect all organised by the European Bioinformatics Institute (EBI). the real exons present in a sequence at the expense of detect- ing many false ones. Alternatively, one may accept only pre- dictions scoring above a more stringent threshold but lose The draft sequence of the human genome will become avail- those real exons that have lower scores. The trick is to try and able later this year. For some time now it has been accepted increase accuracy without any large loss of sensitivity; this deposited research that this will mark a beginning rather than an end. A vast can be done by comparing the prediction with additional, amount of work will remain to be done, from detailing independent evidence.
    [Show full text]
  • BIOINFORMATICS Doi:10.1093/Bioinformatics/Bti144
    Vol. 00 no. 0 2004, pages 1–11 BIOINFORMATICS doi:10.1093/bioinformatics/bti144 Solving and analyzing side-chain positioning problems using linear and integer programming Carleton L. Kingsford, Bernard Chazelle and Mona Singh∗ Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, 35, Olden Street, Princeton, NJ 08544, USA Received on August 1, 2004; revised on October 10, 2004; accepted on November 8, 2004 Advance Access publication … ABSTRACT set of possible rotamer choices (Ponder and Richards, 1987; Motivation: Side-chain positioning is a central component of Dunbrack and Karplus, 1993) for each Cα position on the homology modeling and protein design. In a common for- backbone. The goal is to choose a rotamer for each position mulation of the problem, the backbone is fixed, side-chain so that the total energy of the molecule is minimized. This conformations come from a rotamer library, and a pairwise formulation of SCP has been the basis of some of the more energy function is optimized. It is NP-complete to find even a successful methods for homology modeling (e.g. Petrey et al., reasonable approximate solution to this problem. We seek to 2003; Xiang and Honig, 2001; Jones and Kleywegt, 1999; put this hardness result into practical context. Bower et al., 1997) and protein design (e.g. Dahiyat and Mayo, Results: We present an integer linear programming (ILP) 1997; Malakauskas and Mayo, 1998; Looger et al., 2003). In formulation of side-chain positioning that allows us to tackle homology modeling, the goal is to predict the structure for a large problem sizes.
    [Show full text]
  • CHRISTINA “TINA” WARINNER (Last Updated October 18, 2018)
    CHRISTINA “TINA” WARINNER (last updated October 18, 2018) Max Planck Institute University of Oklahoma for the Science of Human History (MPI-SHH) Department of Anthropology Department of Archaeogenetics Laboratories of Molecular Anthropology Kahlaische Strasse 10, 07743 Jena, Germany And Microbiome Research (LMAMR) +49 3641686620 101 David L. Boren Blvd, [email protected] Norman, OK 73019 USA www.christinawarinner.com [email protected] http://www.shh.mpg.de/employees/50506/25522 www.lmamr.org APPOINTMENTS W2 Group Leader, Max Planck Institute for the Science of Human History, Germany 2016-present University Professor, Friedrich Schiller University, Jena, Germany 2018-present Presidential Research Professor, Univ. of Oklahoma, USA 2014-present Assistant Professor, Dept. of Anthropology, Univ. of Oklahoma, USA 2014-present Adjunct Professor, Dept. of Periodontics, College of Dentistry, Univ. of Oklahoma, USA 2014-present Visiting Associate Professor, Dept. of Systems Biology, Technical University of Denmark 2015 Research Associate, Dept. of Anthropology, Univ. of Oklahoma, USA 2012-2014 Acting Head of Group, Centre for Evolutionary Medicine, Univ. of Zürich, Switzerland 2011-2012 Research Assistant, Centre for Evolutionary Medicine, Univ. of Zürich, Switzerland 2010-2011 EDUCATION Ph.D., Anthropology, Dept. of Anthropology, Harvard University 2010 Thesis Title: “Life and Death at Teposcolula Yucundaa: Mortuary, Archaeogenetic, and Isotopic Investigations of the Early Colonial Period in Mexico” A.M., Anthropology, Dept. of Anthropology, Harvard University 2008 B.A., with Honors, Anthropology, University of Kansas 2003 B.A., Germanic Literatures and Languages, University of Kansas 2003 SELECTED HONORS, AWARDS, AND FELLOWSHIPS Invited speaker, British Academy, Albert Reckitt Archaeological Lecture (forthcoming) 2019 Invited speaker, EMBL Science and Society (forthcoming, Nov.
    [Show full text]
  • Ayshwarya Subramanian Updated June 10, 2021
    Ayshwarya Subramanian Updated June 10, 2021 Contact Kuchroo Lab and Klarman Cell Observatory Information Broad Institute twitter: @ayshwaryas 5407G, 415 Main Street website: ayshwaryas.github.io Cambridge, MA 02412 USA email:[email protected] Areas of Computational Biology (Kidney disease, Cancer, Inflammatory Bowel Disease, RNA Biology), Ge- expertise nomic Data Analysis (Single-cell and Bulk-RNAseq, Metagenomics, Exome and single-nucleus DNA sequencing), Machine Learning, Probabilistic modeling, Phylogenetics, Applied Statistics Education 2013 Ph.D., Biological Sciences Carnegie Mellon University, Pittsburgh, PA USA Doctoral Advisor: Russell Schwartz, Ph.D. Dissertation: Inferring tumor evolution using computational phylogenetics 2007 M.Sc. (Hons), Biological Sciences (Undergraduate degree) Birla Institute of Technology and Science (BITS{Pilani), Rajasthan, India CGPA 9.21/10, Major GPA 10/10, with Distinction Undergraduate Honors Thesis: A mathematical model for phototactic responses in Halobac- terium salinarium, Max Planck Institute for Complex Technical Systems, Germany. Current Computational Scientist, Cambridge MA 2017{Present Appointment Mentors: Aviv Regev, Ph.D. & Vijay Kuchroo, Ph.D., D.V.M Research Summary: Single-cell portraits of disease and normal states using human data, mouse and organoid models Klarman Cell Observatory Broad Institute, Cambridge, MA 02142 Publications Pre-prints/Under review [1] Subramanian A†, Vernon KA†, Zhou Y† et al. Obesity-instructed TREM2high macrophages identified by comparative analysis of diabetic mouse and human kidney at single cell resolution. bioRxiv 2021.05.30.446342; doi: https://doi.org/10.1101/2021.05.30.446342. [2] Subramanian A†,Vernon KA†, Slyper M, Waldman J, et al. RAAS blockade, kidney dis- ease, and expression of ACE2, the entry receptor for SARS-CoV-2, in kidney epithelial and endothelial cells.
    [Show full text]
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe
    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD [email protected] www.ebi.ac.uk What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science www.ebi.ac.uk/services Browse our services 9 What services do we provide? Labs around the
    [Show full text]
  • Masthead (PDF)
    Proceedings of the National Academy ofPNAS Sciences of the United States of America www.pnas.org PRESIDENT OF Robert G. Roeder Geophysics John J. Mekalanos THE ACADEMY Michael Rosbash Michael L. Bender Peter Palese Ralph J. Cicerone David D. Sabatini Thomas E. Shenk Gertrud M. Schupbach Human Environmental Thomas J. Silhavy SENIOR EDITOR Robert Tjian Sciences Peter K. Vogt Solomon H. Snyder Susan Hanson Cellular and Molecular Physics ASSOCIATE EDITORS Neuroscience Immunology Anthony Leggett William C. Clark Pietro V. De Camilli Peter Cresswell Paul C. Martin Alan Fersht Richard L. Huganir Douglas T. Fearon Jack Halpern L. L. Iversen Richard A. Flavell Physiology and Jeremy Nathans Charles F. Stevens Dan R. Littman Pharmacology Thomas C. Su¨dhof Tak Wah Mak Susan G. Amara Joseph S. Takahashi William E. Paul David Julius EDITORIAL BOARD Hans Thoenen Michael Sela Ramon Latorre Ralph M. Steinman Masashi Yanagisawa Animal, Nutritional, and Chemistry Applied Microbial Sciences Stephen J. Benkovic Mathematics Plant Biology David L. Denlinger Bruce J. Berne Richard V. Kadison Maarten J. Chrispeels R. Michael Roberts Harry B. Gray Robion C. Kirby Enrico Coen Robert Haselkorn Linda J. Saif Mark A. Ratner George H. Lorimer Ryuzo Yanagimachi Barry M. Trost Medical Genetics, Hematology, and Nicholas J. Turro Anthropology Oncology Plant, Soil, and Charles T. Esmon Microbial Sciences Jeremy A. Sabloff Computer and Joseph L. Goldstein Roger N. Beachy Information Sciences Mark T. Groudine Steven E. Lindow Applied Mathematical Sciences Shafrira Goldwasser David O. Siegmund Tony Hunter M. S. Swaminathan Philip W. Majerus Susan R. Wessler Economic Sciences Newton E. Morton Applied Physical Sciences Ronald W.
    [Show full text]
  • Medical Advisory Board September 1, 2006–August 31, 2007
    hoWard hughes medical iNstitute 2007 annual report What’s Next h o W ard hughes medical i 4000 oNes Bridge road chevy chase, marylaNd 20815-6789 www.hhmi.org N stitute 2007 a nn ual report What’s Next Letter from the president 2 The primary purpose and objective of the conversation: wiLLiam r. Lummis 6 Howard Hughes Medical Institute shall be the promotion of human knowledge within the CREDITS thiNkiNg field of the basic sciences (principally the field of like medical research and education) and the a scieNtist 8 effective application thereof for the benefit of mankind. Page 1 Page 25 Page 43 Page 50 seeiNg Illustration by Riccardo Vecchio Südhof: Paul Fetters; Fuchs: Janelia Farm lab: © Photography Neurotoxin (Brunger & Chapman): Page 3 Matthew Septimus; SCNT images: by Brad Feinknopf; First level of Rongsheng Jin and Axel Brunger; iN Bruce Weller Blake Porch and Chris Vargas/HHMI lab building: © Photography by Shadlen: Paul Fetters; Mouse Page 6 Page 26 Brad Feinknopf (Tsai): Li-Huei Tsai; Zoghbi: Agapito NeW Illustration by Riccardo Vecchio Arabidopsis: Laboratory of Joanne Page 44 Sanchez/Baylor College 14 Page 8 Chory; Chory: Courtesy of Salk Janelia Farm guest housing: © Jeff Page 51 Ways Illustration by Riccardo Vecchio Institute Goldberg/Esto; Dudman: Matthew Szostak: Mark Wilson; Evans: Fred Page 10 Page 27 Septimus; Lee: Oliver Wien; Greaves/PR Newswire, © HHMI; Mello: Erika Larsen; Hannon: Zack Rosenthal: Paul Fetters; Students: Leonardo: Paul Fetters; Riddiford: Steitz: Harold Shapiro; Lefkowitz: capacity Seckler/AP, © HHMI; Lowe: Zack Paul Fetters; Map: Reprinted by Paul Fetters; Truman: Paul Fetters Stewart Waller/PR Newswire, Seckler/AP, © HHMI permission from Macmillan Page 46 © HHMI for Page 12 Publishers, Ltd.: Nature vol.
    [Show full text]
  • UNIVERSITY of CALIFORNIA RIVERSIDE Unsupervised And
    UNIVERSITY OF CALIFORNIA RIVERSIDE Unsupervised and Zero-Shot Learning for Open-Domain Natural Language Processing A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Muhammad Abu Bakar Siddique June 2021 Dissertation Committee: Dr. Evangelos Christidis, Chairperson Dr. Amr Magdy Ahmed Dr. Samet Oymak Dr. Evangelos Papalexakis Copyright by Muhammad Abu Bakar Siddique 2021 The Dissertation of Muhammad Abu Bakar Siddique is approved: Committee Chairperson University of California, Riverside To my family for their unconditional love and support. i ABSTRACT OF THE DISSERTATION Unsupervised and Zero-Shot Learning for Open-Domain Natural Language Processing by Muhammad Abu Bakar Siddique Doctor of Philosophy, Graduate Program in Computer Science University of California, Riverside, June 2021 Dr. Evangelos Christidis, Chairperson Natural Language Processing (NLP) has yielded results that were unimaginable only a few years ago on a wide range of real-world tasks, thanks to deep neural networks and the availability of large-scale labeled training datasets. However, existing supervised methods assume an unscalable requirement that labeled training data is available for all classes: the acquisition of such data is prohibitively laborious and expensive. Therefore, zero-shot (or unsupervised) models that can seamlessly adapt to new unseen classes are indispensable for NLP methods to work in real-world applications effectively; such models mitigate (or eliminate) the need for collecting and annotating data for each domain. This dissertation ad- dresses three critical NLP problems in contexts where training data is scarce (or unavailable): intent detection, slot filling, and paraphrasing. Having reliable solutions for the mentioned problems in the open-domain setting pushes the frontiers of NLP a step towards practical conversational AI systems.
    [Show full text]
  • ISMB 2008 Toronto
    ISMB 2008 Toronto The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Linial, Michal, Jill P. Mesirov, B. J. Morrison McKay, and Burkhard Rost. 2008. ISMB 2008 Toronto. PLoS Computational Biology 4(6): e1000094. Published Version doi:10.1371/journal.pcbi.1000094 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:11213310 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Message from ISCB ISMB 2008 Toronto Michal Linial1,2, Jill P. Mesirov1,3, BJ Morrison McKay1*, Burkhard Rost1,4 1 International Society for Computational Biology (ISCB), University of California San Diego, La Jolla, California, United States of America, 2 Sudarsky Center, The Hebrew University of Jerusalem, Jerusalem, Israel, 3 Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America, 4 Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America the integration of students, and for the of ISMB. One meeting in South Asia support of young leaders in the field. (InCoB; http://incob.binfo.org.tw/) has ISMB has also become a forum for already been sponsored by ISCB, and reviewing the state of the art in the many another one in North Asia is going to fields of this growing discipline, for follow. ISMB itself has also been held in introducing new directions, and for an- Australia (2003) and Brazil (2006).
    [Show full text]
  • From DNA Sequence to Chromatin Dynamics: Computational Analysis of Transcriptional Regulation
    From DNA Sequence to Chromatin Dynamics: Computational Analysis of Transcriptional Regulation Thesis submitted for the degree of “Doctor of Philosophy” by Tommy Kaplan Submitted to the Senate of the Hebrew University May 2008 This work was carried out under the supervision of Prof. Nir Friedman and Prof. Hanah Margalit Abstract All cells of a living organism share the same DNA. Yet, they differ in structure, activities and interactions. These differences arise through a tight regulatory system which activates different genes and pathways to fit the cell’s specialization, condition, and requirements. Deciphering the regulatory mechanisms underlying a living cell is one of the fundamental challenges in biology. Such knowledge will allow us to better understand how cells work, how they respond to external stimuli, what goes wrong in diseases like cancer (which often involves disruption of gene regulation), and how it can be fought. In my PhD, I focus on regulation of gene expression from three perspectives. First, I present an innovative algorithm for identifying the target genes of novel transcription factors, based on their protein sequence (Chapter 1). Second, I consider how several transcription factors cooperate to process external stimuli and alter the behavior of the cell (Chapter 2). Finally, I study how the genomic position of nucleosomes and their covalent modifications modulate the accessibility of DNA to transcription factors, thus adding a fascinating dimension to transcriptional regulation (Chapters 3 and 4). To understand transcriptional regulation, one should first reconstruct the architecture of the cell’s regulatory map, thus identifying which genes are regulated by which transcription factors (TFs).
    [Show full text]