Applications of Machine Learning to Solve Biological Puzzles

Applications of Machine Learning to Solve Biological Puzzles

Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2019 Applications of machine learning to solve biological puzzles Carla M. Mann Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Bioinformatics Commons, and the Computer Sciences Commons Recommended Citation Mann, Carla M., "Applications of machine learning to solve biological puzzles" (2019). Graduate Theses and Dissertations. 17508. https://lib.dr.iastate.edu/etd/17508 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Applications of machine learning to solve biological puzzles by Carla M. Mann A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Bioinformatics and Computational Biology Program of Study Committee: Drena L. Dobbs, Co-major Professor Robert Jernigan, Co-major Professor Carolyn Lawrence-Dill Maura McGrail Kris De Brabanter The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2019 Copyright © Carla M. Mann, 2019. All rights reserved. ii DEDICATION There were a number of people who did not participate in the research or writing or lab meetings, but without them this dissertation would not exist and I would be a very different person. Thank you for everything. iii TABLE OF CONTENTS Page LIST OF FIGURES ........................................................................................................... ix LIST OF TABLES ............................................................................................................ xv NOMENCLATURE ........................................................................................................ xvi ACKNOWLEDGMENTS ............................................................................................... xix ABSTRACT .................................................................................................................... xxii CHAPTER 1. INTRODUCTION ....................................................................................... 1 1.1 Identifying Complex Recognition Signals in Biological Sequences ....................... 1 1.2 Why Machine Learning? ......................................................................................... 2 1.3 Specific Aims of This Research .............................................................................. 3 1.4 Organization of This Thesis .................................................................................... 3 References ..................................................................................................................... 7 CHAPTER 2. RNA-PROTEIN INTERACTION PREDICTIONS WIKI PAGE .............. 9 Introduction ................................................................................................................... 9 Significance ................................................................................................................. 10 Features ........................................................................................................................ 11 Sequence Based Features ....................................................................................... 12 Sequence composition ....................................................................................... 12 Sequence motifs ................................................................................................ 14 Hydrophobicity and hydrophilicity ................................................................... 14 Structure-based Features ........................................................................................ 14 Protein secondary structure ............................................................................... 15 Protein disorder ................................................................................................. 15 RNA secondary structure .................................................................................. 15 Feature Dimensions ................................................................................................ 16 Models ......................................................................................................................... 16 Machine Learning Methods.................................................................................... 16 Random forests ................................................................................................. 16 Gradient boosting .............................................................................................. 16 Support vector machines ................................................................................... 16 Neural networks ................................................................................................ 17 Multi-classifier methods ................................................................................... 17 Scoring Systems ..................................................................................................... 17 Datasets ........................................................................................................................ 18 Dataset Creation ..................................................................................................... 18 Structure-derived datasets ................................................................................. 18 Datasets from high-throughput experiments ..................................................... 19 Non-redundant datasets ..................................................................................... 19 iv Experimentally-validated negative training datasets ........................................ 20 Publicly Available Datasets.................................................................................... 20 Methods ....................................................................................................................... 22 Databases of Known Interactions ................................................................................ 24 Structure-based Databases:..................................................................................... 24 Protein Data Bank ............................................................................................. 24 Nucleic Acid Database ...................................................................................... 24 Sequence-based Databases: .................................................................................... 24 ENCODE .......................................................................................................... 24 GEO .................................................................................................................. 25 NPInter .............................................................................................................. 25 POSTAR2 ......................................................................................................... 25 UniProt .............................................................................................................. 25 See Also ....................................................................................................................... 26 References ................................................................................................................... 28 CHAPTER 3. RPIDisorder: A METHOD FOR PREDICTING RNA-PROTEIN PARTNERS USING INTRINSIC PROTEIN DISORDER ............................................. 34 Abstract ........................................................................................................................ 34 Introduction ................................................................................................................. 34 RNA-Protein Interactions Play Important Biological Roles .................................. 34 Examples of Disruptions in Regulatory RNA-Protein Interaction Networks that Lead to Disease ............................................................................................... 35 Intrinsic Protein Disorder May Play a Role in Determining RNA-Protein Interaction Specificity ............................................................................................ 38 Why Predict RPIs? ................................................................................................. 39 Available RPI Prediction Methods ......................................................................... 40 Methods ....................................................................................................................... 42 Datasets .................................................................................................................. 42 RPI2241 structure-derived dataset (RPI-PDB) ................................................. 42 RPI12252* NPInter-derived dataset (RPI-NPInter*) ......................................

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    254 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us