Next Generation Machine Learning Prediction of Protein Cellular Sorting
Total Page:16
File Type:pdf, Size:1020Kb
TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Bioinformatik Next Generation Machine Learning Prediction of Protein Cellular Sorting Tatyana Goldberg Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Daniel Cremers Prüfer der Dissertation: 1. Univ.-Prof. Dr. Burkhard Rost 2. Prof. Dr. Yana Bromberg, The State University of New Jersey/USA 3. Univ.-Prof. Dr. Iris Antes Die Dissertation wurde am 15.12.2015 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 22.03.2016 angenommen. ii Table of Contents Abstract ............................................................................................................................................. v Zusammenfassung ........................................................................................................................... vii Acknowledgments ............................................................................................................................ ix List of publications ........................................................................................................................... xi List of Figures and Tables ............................................................................................................... xiii 1. Introduction ................................................................................................................................ 1 1.1 Sub-cellular localization is an aspect of protein function ....................................................... 1 1.2 Applications of protein sub-cellular localization data ............................................................. 5 1.3 Experimental characterization of protein localization ............................................................ 7 1.4 In silico methods predicting protein sorting .......................................................................... 10 1.5 Overview of this work ............................................................................................................ 15 1.6 References ............................................................................................................................ 17 2. LocTree2: prediction of protein cellular sorting in all domains of life ................................. 25 2.1 Preface ................................................................................................................................... 25 2.2 Journal Article. Goldberg T., Hamp T., Rost B. Bioinformatics 2012;28(18):i458-i465 ......... 26 2.3 References ............................................................................................................................. 54 3. LocTree3: improved prediction of protein cellular sorting .................................................... 56 3.1 Preface ................................................................................................................................... 56 3.2 Journal Article. Goldberg T., Hecht M., Rost B., et al. NAR 2014;42:W350-5 ...................... 57 3.3 References ............................................................................................................................. 78 4. Prediction of nuclear import and nuclear protein sorting ..................................................... 79 4.1 Introduction ............................................................................................................................ 79 4.2 Compartmentalization of the nucleus .................................................................................... 80 4.3 Materials and Methods .......................................................................................................... 84 4.4 Results and Discussion ......................................................................................................... 86 4.5 References ............................................................................................................................. 93 iii 5. NLSDb2.0: a database of nuclear import and export signals ............................................... 97 5.1 Introduction ............................................................................................................................ 97 5.2 Materials and Methods .......................................................................................................... 98 5.3 Results and Discussion ......................................................................................................... 101 5.4 Database description ............................................................................................................. 110 5.5 References ............................................................................................................................. 113 6. pEffect: prediction of bacterial type III effector proteins ...................................................... 117 6.1 Introduction ............................................................................................................................ 117 6.2 Materials and Methods .......................................................................................................... 118 6.3 Results................................................................................................................................... 121 6.4 Discussion ............................................................................................................................. 129 6.5 References ............................................................................................................................ 134 7. LocText: a manually annotated text corpus for protein localization data ........................... 137 7.1 Preface .................................................................................................................................. 137 7.2 Journal Article. Goldberg T., Vinchurkar S., Cejuela J.M., Jensen L.J., Rost B. BMC Proceedings 2015, 9(Suppl 5):A4 ......................................................................................... 138 7.3 References ............................................................................................................................ 141 8. Conclusions .............................................................................................................................. 142 8.1 Our work in the context of the developments in the field ...................................................... 142 8.2 Summary of our main findings .............................................................................................. 144 8.3 References ............................................................................................................................ 148 9. Appendix .................................................................................................................................... 151 9.1 Supplementary Figures .......................................................................................................... 151 9.2 Supplementary Tables ........................................................................................................... 159 9.3 References ............................................................................................................................. 161 iv Abstract Living cells are divided into specific compartments, each responsible for a different cellular function. The identification of protein localization (i.e. into which sub-cellular compartment it is being sorted) is important for understanding protein function, as certain functions can only be performed in certain environments. Despite advances in high-throughput experiments for protein localization, the gap between the number of known proteins and the number of proteins with known localization continues to grow. Several computational approaches have been developed to predict protein localization, yet many challenges remain to be tackled. The work at hand describes a series of novel machine learning-based approaches that predict protein localization from amino acid sequence. Protein localization is predicted at different resolution levels: (i) a cell, (ii) a compartment and (iii) a pathogenic organism. The first approach employs machine learning (profile kernel Support Vector Machines) to predict protein sub-cellular localization. Prediction performance is made 25% better by adding homology-based inference. The improved method was made publicly available as a web server and was used to annotate over 1,000 entirely sequenced proteomes. Predicting protein localization at a resolution of a single compartment is a harder problem due to the lack of experimental data. This work presents another method that combines homology-based inference with machine learning to predict proteins in 13 sub- nuclear localizations. In addition, a database that archives all experimentally known nuclear signals, i.e. “zip codes” that guide nuclear protein import and export, is described. Learned from the set of experimental signals, the database suggests a two-fold larger set of potential computationally determined signals that await their experimental verification. Knowledge of protein localization can assist in the identification of pathogenic bacteria. The type III secretion system is a key mechanism for