Fakultät Für Informatik Masterarbeit Predict Subcellular Localization In
Total Page:16
File Type:pdf, Size:1020Kb
LUDWIGMAXIMILIANSUNIVERSITÄT TECHNISCHE UNIVERSITÄT MÜNCHEN Fakult¨at fur¨ Informatik Masterarbeit in Bioinformatik Predict Subcellular Localization in All Kingdoms Tatyana Goldberg Aufgabensteller: Prof. Dr. Burkhard Rost Betreuer: Dipl. Bioinf. Tobias Hamp Abgabedatum: 17. Oktober 2011 Ich versichere, dass ich diese Masterarbeit selbst¨andig verfasst und nur die angegebenen Quellen und Hilfsmittel verwendet habe. 17. Oktober 2011 Tatyana Goldberg Abstract The prediction of protein subcellular localization is an important step towards under- standing its function. Here, a new method for predicting localization in all six taxonomic kingdoms is presented. The method was developed on a non-redundant data set of pro- teins of known localization from SWISS-PROT. Three localization classes were targeted for archaea, six for bacteria and eleven for eukaryota. Prediction requires an amino acid sequence and the taxonomic classification. For the development of the method support vector machines were used and a range of string kernels examined. The kernel using evolutionary profiles was selected as the most appropriate for detecting compartment-specific patterns. A number of multiclass classification techniques were then compared, including one-against-all, various types of ensembles of nested dichotomies and the nested dichotomy with a fixed structure. The latter allowed prediction of protein subcellular localization by mimicking the cascading mechanism of cellular sorting. Though its overall accuracy was comparable to or higher than other classification techniques, its computational time was significantly lower. Three separate classifiers were trained on non-membrane proteins, transmembrane proteins and proteins of all types, allowing the latter to be applied to large-scale screenings of entire proteomes. When evaluated on the non-redundant test sets, the method developed on all types of proteins achieved the highest level of accuracy for archaeal proteins, 84% for bacterial and 64% for eukaryotic proteins, thus outperforming current state-of-the art predictors. In addition, the prediction methods were benchmarked on three independent data sets that were not used during their development. The method developed here surpassed the other methods in nearly all benchmarks. i Zusammenfassung Die Vorhersage der subzellul¨aren Lokalisierung eines Proteins ist ein wichtiger Schritt zur Aufkl¨arung seiner Funktion. Hier wird eine neue Methode zur Vorhersage der Lokalisierung in allen sechs taxonomischen Reiche vorgestellt. Die Methode wurde auf einem nicht-redundanten Datensatz mit Proteinen bekannter Lokalisierung aus der SWISS-PROT Datenbank entwickelt. Drei Klassen wurden fur¨ Archaeen, sechs fur¨ Bak- terien und elf fur¨ Eukaryoten vorhergesagt. Die Vorhersage erfordert eine Aminos¨aure- sequenz und die taxonomische Klassifizierung. Fur¨ die Entwicklung der Vorhersagemethode wurden Support Vector Machines benutzt und eine Reihe von String-Kernels getestet. Der Kernel, der mit evolution¨aren Profilen arbeitet, wurde als der geeignetste zur Erkennung Kompartiment-spezifischer Muster in der Sequenz ausgew¨ahlt. Danach wurde eine Anzahl von verschiedenen MultiClass Klassifizierungstechniken verglichen, darunter one-against-all, verschiedene Ensembles verschachtelter Dichotomien und die verschachtelte Dichotomie mit einer festen Struktur. Letztere erlaubt die Vorhersage der subzellul¨aren Lokalisierung durch Nachahmung des kaskadierenden Mechanismus der Proteinsortierung in der Zelle. Obwohl die Genauigkeit dieser Technik vergleichbar oder h¨oher war als die der anderen Klassifizierungstechniken, war ihre Rechenzeit deutlich niedriger. Drei verschiedene Klassifikatoren wurden auf Nicht-Membranproteinen, Transmem- branproteinen und Proteinen aller Art trainiert. Letztere erm¨oglichte die Anwendung auf large-scale Screenings gesamter Proteome. Die Auswertung der Methode entwickelt auf Proteinen aller Art zeigte, dass diese Methode ein H¨ochstmaß an Genauigkeit fur¨ Proteine aus Archaea, 84% Genauigkeit fur¨ Proteine aus Bakteria und 64% Genauigkeit fur¨ Proteine aus Eukaryota erreichen kann. Damit war sie besser als die aktuellen "State- of-the-Art" Vohersagemethoden. Die Methode wurde zus¨atzlich mit anderen Methoden auf drei unabh¨angigen Datens¨atzen, die nicht w¨ahrend ihrer Entwicklung verwendet wur- den, verglichen. Die hier entwickelte Methode ubertraf¨ die anderen Methoden in fast allen Benchmark-Tests. iii Acknowledgments I consider myself very fortunate to have had the opportunity to study at two of the outstanding German universities, the Ludwig Maximilian University and the Technical University of Munich. Both universities provided me with an excellent academic envi- ronment throughout the years of my study, which I greatly enjoyed. I want to express my sincere gratitude to the professors who supported me all the way to the point where I am and to my fellow students for the great time we had together. Most of all, I wish to thank Prof. Burkhard Rost for offering me a warm welcome into his group and for supervising this thesis. His great experience, patience and fruitful discussions have continuously encouraged me. He gave me the opportunities such as the mentoring of two programs for high-school students, involvement in the teaching of the 'Bioinformatics Lab' course, attending an international conference on computational biology and presenting a poster there, all of which helped me to enhance my skills considerably and were greatly appreciated. I also owe many thanks to Tobias Hamp for providing an excellent support with my thesis. I thank him for all the help with machine learning, for making me looking at the statistics more seriously and for keeping telling me not to worry too much. I also thank him for proofreading this thesis and his valuable comments. Further, I wish to thank the whole Rost Lab group that I had a pleasure to work with. Especially, I want to thank Laszlo Kajan and Guy Yachdav for introducing me the Rost Lab and for always having a good advice for me concerning work or future plans. I also thank Timothy Karl for the help with the computer cluster, Shruti Rastogi for the initial help with LOCtree and for providing the LocDB data, Edda Kloppmann for her comments on my work, Yana Bromberg for our delightful conversations, Marc Offman for his eccentric jokes, Andrea Schafferhans and Shaila R¨ossle for their good mood. I like to thank Christian Sch¨afer, Esmeralda Vicedo, Arthur Dong, Maina Bitar, Dedan Githae and the students for the range of activities we undertook together. Many thanks go to Marlena Drabik and Lothar Richter for their help with the administrative matters. v On the personal side, I want to thank my lovely mother Nelly for teaching me to appreciate knowledge and education. I thank my brother Valerij, who has always been my big hero and always will be. Margarita, his wife, and my three little nephews deserve a special thanks for the fun we have together. Finally, I thank my boyfriend Taras for always being there for me even over long distances and differences in time. Last but not least, I wish to thank Sebastian Briesemeister from the University of Tubingen¨ for running MultiLoc2 predictions and all those who make their data and implementations publicly available. vi Contents Abstract i Zusammenfassung iii Acknowledgments v List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Subcellular Localization as a Functional Characteristic of a Protein . 1 1.2 Transmembrane Proteins . 1 1.3 Cell Compartmentalization . 2 1.4 Protein Sorting . 4 1.4.1 The Protein Trafficking System . 5 1.4.2 Sorting Signals . 6 1.5 Prediction of Subcellular Localization . 6 1.6 Novel Method . 8 2 Materials and Methods 9 2.1 Workflow ................................... 9 2.2 Data Sets for Development . 12 2.2.1 Data Sets Extraction . 12 2.2.2 Homology Reduction . 13 2.2.3 Scores for Measuring Sequence Similarity . 16 2.2.4 Size Increase of the Training Sets . 17 vii Contents 2.3 Data Sets for Testing Only . 18 2.3.1 LocDB . 18 2.3.2 Newly Added SWISS-PROT Proteins . 20 2.4 Support Vector Machines . 21 2.4.1 Linear Classification . 21 2.4.2 Soft Margin . 24 2.4.3 Non-Linear Classification . 25 2.4.4 The Kernel Matrix . 27 2.4.5 Sequential Minimal Optimization . 28 2.5 String Kernels . 28 2.5.1 String Subsequence Kernel . 28 2.5.2 Mismatch Kernel . 29 2.5.3 Profile Kernel . 30 2.6 Multiclass Classification . 32 2.6.1 One-Against-All . 32 2.6.2 Ensemble of Nested Dichotomies . 33 2.6.3 Ensemble of Class Balanced Nested Dichotomies . 33 2.6.4 Ensemble of Data Balanced Nested Dichotomies . 34 2.6.5 Predefined Nested Dichotomies . 35 2.7 Performance Evaluation . 37 2.7.1 Accuracy, Coverage and their Geometric Average . 37 2.7.2 The Standard Error . 38 2.7.3 Stratified k-fold Cross Validation . 40 2.8 External Prediction Methods . 43 2.8.1 CELLO v.2.5 . 43 2.8.2 LOCtree . 43 2.8.3 MultiLoc2 . 44 2.8.4 WoLFPSORT............................. 44 2.9 Box Plots . 45 3 Results and Discussion 47 3.1 Training and Test Sets . 47 3.2 Kernel Selection . 49 3.2.1 Parameter Optimization . 49 viii Contents 3.2.2 Comparative Evaluation . 51 3.3 Classification Model Selection . 54 3.4 Model Parameter Optimization . 55 3.4.1 Generalization Performance . 55 3.4.2 Classification Runtime . 56 3.5 Testing..................................... 58 3.5.1 Performance Evaluation . 58 3.5.2 Prediction Reliability . 63 3.5.3 Comparison with the External Classifiers . 64 3.6 Application to the Independent Test Sets . 69 3.6.1 Re-Training of the Final Classification Model . 69 3.6.2 Comparison with the External Classifiers . 69 3.7 Localization-wise Performance of MyND . 74 4 Conclusion 77 5 Appendix 79 5.1 Effectiveness of the Mismatch Kernel . 79 5.2 Effectiveness of the String Subsequence Kernel . 80 5.3 Effectiveness of the Profile Kernel . 81 5.4 MyND Benchmark on Eukaryotic Non-membrane Proteins . 82 5.5 MyND Benchmark on Eukaryotic Transmembrane Proteins . 83 Bibliography 84 ix List of Figures 1.1 Subcellular compartments of a prokaryotic and eukaryotic cells .