Computational Toxinology Joseph D. Romano Submitted in Partial

Computational Toxinology Joseph D. Romano Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2019 Oc 2019 Joseph D. Romano All rights reserved ABSTRACT Computational Toxinology Joseph D. Romano Venoms are complex mixtures of biological macromolecules and other com- pounds that are used for predatory and defensive purposes by hundreds of thousands of known species worldwide. Throughout human history, venoms and venom components have been used to treat a vast array of illnesses, causing them to be of great clinical, economic, and academic interest to the drug discovery and toxinology communities. In spite of major computational advances that facilitate data-driven drug discovery, most therapeutic venom effects are still discovered via tedious trial-and-error, or simply by accident. In this dissertation, I describe a body of work that aims to establish a new subdiscipline of translational bioinformatics, which I name “computational toxinology”. To accomplish this goal, I present three integrated components that span a wide range of informatics techniques: (1) VenomKB, (2) VenomSeq, and (3) VenomKB’s Semantic API. To provide a platform for structuring, representing, retrieving, and integrating venom data relevant to drug discovery, VenomKB provides a database-backed web application and knowledge base for computational toxinology. VenomKB is structured according to a fully-featured ontology of venoms, and provides data aggregated from many popular web resources. VenomSeq is a biotechnology workflow that is designed to generate new high-throughput sequencing data for incorporation into VenomKB. Specif- ically, we expose human cells to controlled doses of crude venoms, conduct RNA-Sequencing, and build profiles of differential gene expression, which we then compare to publicly-available differential expression data for known diseases and drugs with known effects, and use those comparisons to hypothesize ways that the venoms could act in a therapeutic manner, as well. These data are then integrated into VenomKB, where they can be effectively retrieved and evaluated using existing data and known therapeutic associations. VenomKB’s Semantic API further develops this functionality by providing an intelligent, powerful, and user-friendly interface for querying the complex underlying data in VenomKB in a way that reflects the intuitive, human-understandable mean- ing of those data. The Semantic API is designed to cater to the needs of advanced users as well as laypersons and bench scientists without previous ex- pertise in computational biology and semantic data analysis. In each chapter of the dissertation, I describe how we evaluated these 3 components through various approaches. We demonstrate the utility of Ven- omKB and the Semantic API by testing a number of practical use-cases for each, designed to highlight their ability to rediscover existing knowledge as well as suggesting potential areas for future exploration. We use statistics and data science techniques to evaluate VenomSeq on 25 diverse species of venomous an- imals, and propose biologically feasible explanations for significant findings. In evaluating the Semantic API, I show how observations on VenomSeq data can be interpreted and placed into the context of past research by members of the larger toxinology community. Computational toxinology is a toolbox designed to be used by multiple stakeholders (toxinologists, computational biologists, and systems pharmacolo- gists, among others) to improve the return rate of clinically-significant findings from manual experimentation. It aims to achieve this goal by enabling access to data, providing means for easy validation of results, and suggesting specific hypotheses that are preliminarily supported by rigorous inferential statistics. All components of the research I describe are open-access and publicly available, to improve reproducibility and encourage widespread adoption. Contents List of Figures iv List of Tables vi Acknowledgements vii Dedication ix 1. Background and Introduction1 1.1. Dissertation Overview . .3 1.1.1. Specific aims . .3 1.1.2. Knowledge gaps . .7 1.1.3. Significance and contributions . .9 1.1.4. Limitations . 11 1.2. Toxins, toxinology, and natural products . 13 1.3. Informatics methods in natural product drug discovery . 14 1.3.1. Classes of therapeutic natural products . 19 1.3.2. Cheminformatics methods . 24 1.3.3. Bioinformatics methods . 29 1.3.4. Semantic (knowledge-based) methods . 34 1.3.5. Gaps and opportunities in NP drug discovery . 43 1.3.6. Data needs for NP drug discovery . 45 1.4. Semantic knowledge resources . 47 1.4.1. Knowledge representations . 47 1.4.2. Constructing and using ontologies . 49 1.5. Data-driven discovery from gene expression data . 51 1.5.1. Data-driven science . 51 1.5.2. Differential expression analysis . 54 2. An informatics infrastructure for computational toxinology 58 2.1. VenomKB v1 . 58 2.1.1. Introduction . 58 2.1.2. Methods . 61 2.1.3. Description of data in VenomKB v1.0 . 65 2.1.4. Technical validation of VenomKB v1 . 69 i 2.1.5. VenomKB v1 usage notes . 73 2.1.6. VenomKB v1 Data Citations . 75 2.2. Venom Ontology . 75 2.2.1. Introduction . 75 2.2.2. Methods . 76 2.2.3. Results . 79 2.2.4. Discussion . 86 2.3. VenomKB v2 . 90 2.3.1. Results . 91 2.3.2. Discussion . 102 2.3.3. Methods . 106 3. A transcriptomic approach for generating therapeutic effect data from venoms 111 3.1. Introduction (VenomSeq)............................ 111 3.1.1. Enrichment analysis . 111 3.2. Results (VenomSeq)............................... 115 3.2.1. Venom dosages . 116 3.2.2. mRNA sequencing of venom-perturbed human cells . 118 3.2.3. Differential expression signatures of venom-perturbed human cells . 118 3.2.4. Associations between venoms and existing drugs . 119 3.2.5. Associations between venoms and disease regulatory networks . 130 3.3. Discussion (VenomSeq)............................. 131 3.3.1. Venoms versus small-molecule drugs . 131 3.3.2. Venoms versus human diseases . 132 3.3.3. Biologically plausible therapeutic hypotheses . 133 3.3.4. Accessing and querying VenomSeq data . 137 3.3.5. VenomSeq data analysis software . 138 3.3.6. Transitioning from venoms to venom components . 138 3.3.7. Applying VenomSeq to other natural product classes . 139 3.3.8. Interpreting connectivity analysis validation results . 140 3.4. Methods (VenomSeq).............................. 142 3.4.1. Reagents and materials . 142 3.4.2. Obtaining 25 venoms . 143 3.4.3. Growth inhibition assays . 143 3.4.4. mRNA Sequencing . 146 3.4.5. Constructing expression signatures . 147 3.4.6. Comparing venoms to known drugs and diseases . 147 3.4.7. Assessing sequencing technology and cell type compatibility . 153 ii 4. Integrating and delivering venom knowledge using semantic data analysis 155 4.1. Structuring and representing VenomSeq data in VenomKB . 155 4.2. The VenomKB Semantic API . 156 4.2.1. Related work and advantages over existing tools . 158 4.2.2. A theoretical framework for the semantic API . 160 4.2.3. Semantic API implementation . 162 4.2.4. The benefits of a semantic API . 169 4.2.5. Complete example . 171 4.2.6. Task-based evaluation of the semantic API for computational toxinology.................................... 174 4.2.7. Using the semantic API in other research contexts . 182 4.3. Automating discovery with VenomKB . 183 5. Conclusions 186 5.1. Summary of findings and original contributions . 186 5.2. Limitations and future work . 189 Bibliography 191 Appendix A. PLATE-Seq quality control data 217 Appendix B. VenomSeq JSON schema 222 Appendix C. Listing of algorithms 225 C.1. Connectivity analysis . 225 C.2. VIPER and aREA-3T . 226 C.3. The Semantic API . 228 Appendix D. Code availability 231 iii List of Figures 1.1. Therapeutic venom examples . .2 1.2. Graphical abstract . .4 1.3. XKCD webcomic—“How Standards Proliferate” . 12 1.4. Informatics methods in NP drug discovery . 15 1.5. Protégéscreenshot . 50 1.6. Hypothesis-driven vs. data-driven science . 52 2.1. Schematic of VenomKB (v1.0) . 59 2.2. VenomKB v1 home page . 60 2.3. VenomKB v1 home page . 70 2.4. VO class diagram . 79 2.5. Venom sequence similarity networks . 81 2.6. Distributions of venom complexity in VO . 85 2.7. VenomKB v2 home page . 92 2.8. VenomKB v2 class diagram . 93 2.9. Datatype counts in VenomKB . 95 2.10. Venom complexity by taxonomic group . 96 2.11. VenomKB drug and target example . 98 2.12. List of data entries in VenomKB . 101 2.13. Interface for a data record in VenomKB . 110 3.1. VenomSeq graphical abstract . 112 3.2. GSEA example . 115 3.3. Venom growth inhibiton plots . 117 3.4. MA plot; O. macropus . 120 3.5. Connectivity analysis results . 121 3.6. Connectivity score density plots . 124 3.7. Validating VenomSeq technology with known drugs . 126 3.8. Digoxin and ATP1A . 135 3.9. FGFR signaling . 136 3.10. RNA-Seq workflow summary. 142 3.11. VenomSeq species cladogram . 145 3.12. Data analysis workflow summary . 148 4.1. Semantic API schematic; simplified view . 163 iv 4.2. Semantic API schematic; detailed view . ..

Load more