Applications of Machine Learning to Solve Biological Puzzles
Total Page:16
File Type:pdf, Size:1020Kb
Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2019 Applications of machine learning to solve biological puzzles Carla M. Mann Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Bioinformatics Commons, and the Computer Sciences Commons Recommended Citation Mann, Carla M., "Applications of machine learning to solve biological puzzles" (2019). Graduate Theses and Dissertations. 17508. https://lib.dr.iastate.edu/etd/17508 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Applications of machine learning to solve biological puzzles by Carla M. Mann A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Bioinformatics and Computational Biology Program of Study Committee: Drena L. Dobbs, Co-major Professor Robert Jernigan, Co-major Professor Carolyn Lawrence-Dill Maura McGrail Kris De Brabanter The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2019 Copyright © Carla M. Mann, 2019. All rights reserved. ii DEDICATION There were a number of people who did not participate in the research or writing or lab meetings, but without them this dissertation would not exist and I would be a very different person. Thank you for everything. iii TABLE OF CONTENTS Page LIST OF FIGURES ........................................................................................................... ix LIST OF TABLES ............................................................................................................ xv NOMENCLATURE ........................................................................................................ xvi ACKNOWLEDGMENTS ............................................................................................... xix ABSTRACT .................................................................................................................... xxii CHAPTER 1. INTRODUCTION ....................................................................................... 1 1.1 Identifying Complex Recognition Signals in Biological Sequences ....................... 1 1.2 Why Machine Learning? ......................................................................................... 2 1.3 Specific Aims of This Research .............................................................................. 3 1.4 Organization of This Thesis .................................................................................... 3 References ..................................................................................................................... 7 CHAPTER 2. RNA-PROTEIN INTERACTION PREDICTIONS WIKI PAGE .............. 9 Introduction ................................................................................................................... 9 Significance ................................................................................................................. 10 Features ........................................................................................................................ 11 Sequence Based Features ....................................................................................... 12 Sequence composition ....................................................................................... 12 Sequence motifs ................................................................................................ 14 Hydrophobicity and hydrophilicity ................................................................... 14 Structure-based Features ........................................................................................ 14 Protein secondary structure ............................................................................... 15 Protein disorder ................................................................................................. 15 RNA secondary structure .................................................................................. 15 Feature Dimensions ................................................................................................ 16 Models ......................................................................................................................... 16 Machine Learning Methods.................................................................................... 16 Random forests ................................................................................................. 16 Gradient boosting .............................................................................................. 16 Support vector machines ................................................................................... 16 Neural networks ................................................................................................ 17 Multi-classifier methods ................................................................................... 17 Scoring Systems ..................................................................................................... 17 Datasets ........................................................................................................................ 18 Dataset Creation ..................................................................................................... 18 Structure-derived datasets ................................................................................. 18 Datasets from high-throughput experiments ..................................................... 19 Non-redundant datasets ..................................................................................... 19 iv Experimentally-validated negative training datasets ........................................ 20 Publicly Available Datasets.................................................................................... 20 Methods ....................................................................................................................... 22 Databases of Known Interactions ................................................................................ 24 Structure-based Databases:..................................................................................... 24 Protein Data Bank ............................................................................................. 24 Nucleic Acid Database ...................................................................................... 24 Sequence-based Databases: .................................................................................... 24 ENCODE .......................................................................................................... 24 GEO .................................................................................................................. 25 NPInter .............................................................................................................. 25 POSTAR2 ......................................................................................................... 25 UniProt .............................................................................................................. 25 See Also ....................................................................................................................... 26 References ................................................................................................................... 28 CHAPTER 3. RPIDisorder: A METHOD FOR PREDICTING RNA-PROTEIN PARTNERS USING INTRINSIC PROTEIN DISORDER ............................................. 34 Abstract ........................................................................................................................ 34 Introduction ................................................................................................................. 34 RNA-Protein Interactions Play Important Biological Roles .................................. 34 Examples of Disruptions in Regulatory RNA-Protein Interaction Networks that Lead to Disease ............................................................................................... 35 Intrinsic Protein Disorder May Play a Role in Determining RNA-Protein Interaction Specificity ............................................................................................ 38 Why Predict RPIs? ................................................................................................. 39 Available RPI Prediction Methods ......................................................................... 40 Methods ....................................................................................................................... 42 Datasets .................................................................................................................. 42 RPI2241 structure-derived dataset (RPI-PDB) ................................................. 42 RPI12252* NPInter-derived dataset (RPI-NPInter*) ......................................