Applications of Machine Learning to Solve Biological Puzzles

Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2019 Applications of machine learning to solve biological puzzles Carla M. Mann Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Bioinformatics Commons, and the Computer Sciences Commons Recommended Citation Mann, Carla M., "Applications of machine learning to solve biological puzzles" (2019). Graduate Theses and Dissertations. 17508. https://lib.dr.iastate.edu/etd/17508 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Applications of machine learning to solve biological puzzles by Carla M. Mann A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Bioinformatics and Computational Biology Program of Study Committee: Drena L. Dobbs, Co-major Professor Robert Jernigan, Co-major Professor Carolyn Lawrence-Dill Maura McGrail Kris De Brabanter The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2019 Copyright © Carla M. Mann, 2019. All rights reserved. ii DEDICATION There were a number of people who did not participate in the research or writing or lab meetings, but without them this dissertation would not exist and I would be a very different person. Thank you for everything. iii TABLE OF CONTENTS Page LIST OF FIGURES ........................................................................................................... ix LIST OF TABLES ............................................................................................................ xv NOMENCLATURE ........................................................................................................ xvi ACKNOWLEDGMENTS ............................................................................................... xix ABSTRACT .................................................................................................................... xxii CHAPTER 1. INTRODUCTION ....................................................................................... 1 1.1 Identifying Complex Recognition Signals in Biological Sequences ....................... 1 1.2 Why Machine Learning? ......................................................................................... 2 1.3 Specific Aims of This Research .............................................................................. 3 1.4 Organization of This Thesis .................................................................................... 3 References ..................................................................................................................... 7 CHAPTER 2. RNA-PROTEIN INTERACTION PREDICTIONS WIKI PAGE .............. 9 Introduction ................................................................................................................... 9 Significance ................................................................................................................. 10 Features ........................................................................................................................ 11 Sequence Based Features ....................................................................................... 12 Sequence composition ....................................................................................... 12 Sequence motifs ................................................................................................ 14 Hydrophobicity and hydrophilicity ................................................................... 14 Structure-based Features ........................................................................................ 14 Protein secondary structure ............................................................................... 15 Protein disorder ................................................................................................. 15 RNA secondary structure .................................................................................. 15 Feature Dimensions ................................................................................................ 16 Models ......................................................................................................................... 16 Machine Learning Methods.................................................................................... 16 Random forests ................................................................................................. 16 Gradient boosting .............................................................................................. 16 Support vector machines ................................................................................... 16 Neural networks ................................................................................................ 17 Multi-classifier methods ................................................................................... 17 Scoring Systems ..................................................................................................... 17 Datasets ........................................................................................................................ 18 Dataset Creation ..................................................................................................... 18 Structure-derived datasets ................................................................................. 18 Datasets from high-throughput experiments ..................................................... 19 Non-redundant datasets ..................................................................................... 19 iv Experimentally-validated negative training datasets ........................................ 20 Publicly Available Datasets.................................................................................... 20 Methods ....................................................................................................................... 22 Databases of Known Interactions ................................................................................ 24 Structure-based Databases:..................................................................................... 24 Protein Data Bank ............................................................................................. 24 Nucleic Acid Database ...................................................................................... 24 Sequence-based Databases: .................................................................................... 24 ENCODE .......................................................................................................... 24 GEO .................................................................................................................. 25 NPInter .............................................................................................................. 25 POSTAR2 ......................................................................................................... 25 UniProt .............................................................................................................. 25 See Also ....................................................................................................................... 26 References ................................................................................................................... 28 CHAPTER 3. RPIDisorder: A METHOD FOR PREDICTING RNA-PROTEIN PARTNERS USING INTRINSIC PROTEIN DISORDER ............................................. 34 Abstract ........................................................................................................................ 34 Introduction ................................................................................................................. 34 RNA-Protein Interactions Play Important Biological Roles .................................. 34 Examples of Disruptions in Regulatory RNA-Protein Interaction Networks that Lead to Disease ............................................................................................... 35 Intrinsic Protein Disorder May Play a Role in Determining RNA-Protein Interaction Specificity ............................................................................................ 38 Why Predict RPIs? ................................................................................................. 39 Available RPI Prediction Methods ......................................................................... 40 Methods ....................................................................................................................... 42 Datasets .................................................................................................................. 42 RPI2241 structure-derived dataset (RPI-PDB) ................................................. 42 RPI12252* NPInter-derived dataset (RPI-NPInter*) ......................................

Applications of Machine Learning to Solve Biological Puzzles

INTRODUCTION Located on the Western Edge of the Nile Valley

Sphinx Sphinx

Early Hydraulic Civilization in Egypt Oi.Uchicago.Edu

Downloading Material Is Agreeing to Abide by the Terms of the Repository Licence

God's Wife, God's Servant

Paula Alexandra Da Silva Veiga Introdution

Medjed: from Ancient Egypt to Japanese Pop Culture

Tombs of the Roman Period in Sector 26 of the High Necropolis. Archaeological Site of Oxyrhynchus, El-Bahnasa

October 06,1882

Applicationsofmachinelearningto

The Space of the City in Graeco-Roman Egypt Image and Reality

H11h`!!\ K6r! 1"1.T