Advanced Stochastic Protein Sequence Analysis

Advanced Stochastic Protein Sequence Analysis Thomas Plotz¨ Dipl.-Inform. Thomas Plotz¨ AG Angewandte Informatik Technische Fakultat¨ Universitat¨ Bielefeld email: [email protected] Abdruck der genehmigten Dissertation zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften (Dr.-Ing.). Der Technischen Fakultat¨ der Universitat¨ Bielefeld am 18.04.2005 vorgelegt von Thomas Plotz,¨ am 13.06.2005 verteidigt und genehmigt. Gutachter: PD Dr.-Ing. Gernot A. Fink, Universitat¨ Bielefeld Dr. rer. nat. Karsten Quast, Boehringer Ingelheim Pharma GmbH und Co. KG Prof. Dr. Jens Stoye, Universitat¨ Bielefeld Prufungsausschuss:¨ Prof. Dr. Robert Giegerich, Universitat¨ Bielefeld PD Dr.-Ing. Gernot A. Fink, Universitat¨ Bielefeld Dr. rer. nat. Karsten Quast, Boehringer Ingelheim Pharma GmbH und Co. KG Prof. Dr. Jens Stoye, Universitat¨ Bielefeld Dr.-Ing. Frank G. Zollner,¨ Universitat¨ Bielefeld Gedruckt auf alterungsbestandigem¨ Papier nach ISO 9706 Advanced Stochastic Protein Sequence Analysis Dissertation zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften (Dr.-Ing.) der Technischen Fakultat¨ der Universitat¨ Bielefeld vorgelegt von Thomas Plotz¨ Bielefeld – April 2005 Acknowledgements I often compared the process of writing my PhD-thesis with a bicycle ride across an alpine pass: It is a long and hard ascend that requires a lot of endurance. Sometimes it hurts, but most of the time it is very exciting to climb the mountain and to reach different points of view which is the prerequisite for new thoughts and ideas. Similar to a ride across an alpine pass, writing my PhD-thesis would not have been possible without the support of a strong team which I would like to thank here. First of all, I am very much obliged to my supervisor PD Dr.-Ing. Gernot A. Fink. Over the years, a very close collaboration with him has been established, including countless fruitful discussions with respect to all kinds of pattern recognition problems. His fresh and honest way always gives me the motivation to follow new ideas but also to at least double check them before arguing. It is difficult to imagine how my work would have been developed without the influence of his great experience and his numerous ideas. Thank you very much, Gernot, you actually made me stand here at the top of the alpine pass. The research performed for this thesis was embedded in a cooperation with Boehringer Ingelheim and the Boehringer Ingelheim Pharma GmbH und Co. KG Genomics Group. I would like to thank the project partners, especially Dr. Andreas Weith, Dr. Karsten Quast, Dr. Andreas Kohler,¨ and Ogsen Gabrielyan, for their enthusiastic support. I am very grateful to Dr. Quast who agreed to review this thesis. Over the years Birgit Moller¨ and I have become real friends, sharing similar ideas, sup- porting each other for individual goals and having fantastic discussions. Although Birgit wrote her PhD-thesis at the very same time as I, including her own hard ascend to her alpine pass, she always had an open mind for my problems. Her very exact proofreading including productive criticisms substantially helped me to improve the quality of this thesis. My second proofreader was Erich Wehmeyer who checked the thesis for any language related traps and failures. I am very grateful for his native speaker expertise and his will- ingness to correct the work. Furthermore, I would like to thank Prof. Dr.-Ing. Gerhard Sagerer, the leader of the Ap- plied Computer Science group at the Bielefeld University, who always encouraged me to cut my own path even when it was much rockier than expected. The productive atmosphere within the working group including discussions, collaborations, chocolate and tea, gave the background for successfully finishing this thesis. Finally, my wife Alexandra Schubert played an important role for the successfull com- pletion of this thesis. I would like to thank her for her assistance, affection, and emotional support. Alex, without you, everything would count for nothing. I II Contents 1 Introduction 1 2 Principles of Modern Protein Analysis 5 2.1 The Central Dogma of Molecular Biology .................. 6 2.2 Proteins: The Fundamentals of Life ..................... 8 2.2.1 Biochemical Composition ...................... 8 2.2.2 Biological Relevancy ......................... 11 2.3 Protein Relationships ............................. 13 2.3.1 Protein Families ........................... 13 2.3.2 Exemplary Hierarchical Classification ................ 14 2.4 Protein Analysis ................................ 16 2.4.1 The Drug Discovery Process ..................... 16 2.4.2 Protein Sequence Analysis ...................... 18 2.5 Summary ................................... 19 3 Computational Protein Sequence Analysis 21 3.1 Pairwise Sequence Alignment ........................ 22 3.1.1 Principles of Sequence Alignment .................. 23 3.1.2 Heuristic Approximations ...................... 34 3.2 Analysis of Sequence Families ........................ 37 3.2.1 Profile Analysis ............................ 39 3.2.2 Profile Hidden Markov Models ................... 41 3.2.3 Further Probabilistic Modeling Approaches ............. 65 3.3 Signal Processing based Sequence Comparison ............... 72 3.3.1 Alternative Representations of Protein Sequences .......... 73 3.3.2 Signal Processing Methods for Classification ............ 77 3.4 Summary ................................... 81 4 Concepts for Improved HMM Based Sequence Analysis 83 4.1 Assessment of Current Methodologies’ Capabilities ............. 84 4.1.1 Task: Homology Detection at the Superfamily Level ........ 84 4.1.2 Capabilities of State-of-the-Art Approaches ............. 87 4.2 Improving the Quality of HMM Based Sequence Analysis ......... 92 4.2.1 Semi-Continuous Feature Based Modeling ............. 95 4.2.2 Model Architectures with Reduced Complexity ........... 97 4.2.3 Accelerating the Model Evaluation ................. 98 4.3 Summary ................................... 101 III Contents 5 Advanced Probabilistic Models for Protein Families 103 5.1 Feature Extraction from Protein Sequences ................. 104 5.1.1 Rich Signal-Like Protein Sequence Representation ......... 104 5.1.2 Feature Extraction by Abstraction .................. 108 5.2 Robust Feature Based Profile HMMs and Remote Homology Detection .. 113 5.2.1 Feature Space Representation .................... 114 5.2.2 General Semi-Continuous Profile HMMs .............. 117 5.2.3 Specialization by Adaptation ..................... 119 5.2.4 Explicit Background Model ..................... 125 5.3 Protein Family HMMs with Reduced Complexity .............. 127 5.3.1 Beyond Profile HMMs ........................ 128 5.3.2 Protein Family Modeling using Sub-Protein Units (SPUs) ..... 132 5.4 Accelerating the Model Evaluation by Pruning Techniques ......... 137 5.4.1 State-Space Pruning ......................... 139 5.4.2 Combined Model Evaluation ..................... 142 5.4.3 Optimization of Mixture Density Evaluation ............ 143 5.5 Summary ................................... 146 6 Evaluation 149 6.1 Methodology and Datasets .......................... 149 6.2 Effectiveness of Semi-Continuous Feature Based Profile HMMs ...... 155 6.3 Advanced Stochastic Protein Family Models for Small Training Sets .... 159 6.3.1 Effectiveness of Sub-Protein Unit based Models ........... 159 6.3.2 Effectiveness of Bounded Left-Right Models ............ 166 6.4 Acceleration of Protein Family HMM Evaluation .............. 174 6.4.1 Effectiveness of State-Space Pruning ................ 176 6.4.2 Effectiveness of Accelerated Mixture Density Evaluation ...... 177 6.5 Combined Evaluation of Advanced Stochastic Modeling Techniques .... 180 6.6 Summary ................................... 184 7 Conclusion 187 A Wavelets 193 A.1 Fourier Analysis ................................ 193 A.2 Continuous Wavelet Transformation ..................... 194 A.3 Discrete Wavelet Transformation ....................... 196 B Principal Components Analysis (PCA) 201 C Amino Acid Indices 203 D Detailed Evaluation Results 205 Bibliography 215 IV 1 Introduction Millennia ago, the ancient Egyptians used selected micro-organisms to produce cheese, wine and bread. Apparently, they were very experienced in food-making, because in Egypt one of the cradles of civilization could develop and the high quality catering certainly had a positive influence on this process. However, strictly speaking, they had no idea why their food became so tasty, giving them the power to build giant pyramids and to establish science and culture. It took ages until the reasons for it, the foundations of molecular biology, could be explained.1 In fact, not until 1866 the Augustinian monk Gregor Mendel developed the first general theory of heredity by means of the analysis of garden peas which represents the base for all further molecular biology research. Later on James Watson and Francis Crick discovered the double-helical structure of DNA in 1953 which determined the major breakthrough on the way to understanding the microbiological foundations of life [Wat53]. Between these principle breakthroughs lay almost hundred years and they were gained thousands of years after the Egyptians baked their tasty bread. Due to the development of several revolutionary methodologies in the last decades, the speed of knowledge gain could be increased dramatically. Not before Fred Sanger and Walter Gilbert in 1977 independently invented powerful sequencing methods, nowadays’ large-scale sequencing

Load more