Improved Computational Methods of Protein Sequence Alignment, Model Selection and Tertiary Structure Prediction

IMPROVED COMPUTATIONAL METHODS OF PROTEIN SEQUENCE ALIGNMENT, MODEL SELECTION AND TERTIARY STRUCTURE PREDICTION A Thesis presented to the Faculty of the Graduate School at the University of Missouri In Partial Fulfillment of the Requirements for the Degree Doctor of Computer Science by XIN DENG Dr. Jianlin Cheng, Thesis Supervisor DECEMBER 2013 c Copyright by Xin Deng 2013 All Rights Reserved The undersigned, appointed by the Dean of the Graduate School, have examined the dissertation entitled: IMPROVED COMPUTATIONAL METHODS OF PROTEIN ALIGNMENT, MODEL SELECTION AND TERTIARY STRUCTURE PREDICION presented by XIN DENG, a candidate for the degree of Doctor of Computer Science and hereby certify that, in their opinion, it is worthy of acceptance. Dr. Jianlin Cheng Dr. Dong Xu Dr. Ye Duan Dr. John C Walker ACKNOWLEDGMENTS I would never have been able to make some breakthroughs in my research or finish my dissertation without the guidance of my committee members, help from friends, and support from my parents. I would like to express my deepest gratitude to my advisor, Dr.Cheng, for his excellent guidance, caring, patience, and providing me with an excellent atmosphere for doing research. He not only gave me his best help to explore my potential and develop me to be a good researcher, but also offered me many valuable opinions in the life philosophy. Particularly, he told us, "we need to earn everything not gain in this world". Thanks to his words, when I look back to my past five-year's PhD study and research, I gained some excited achievements, made my own contributions to the scientific society, and I did not waste my life. I would like to thank Dr. Xu for leading the researches in the whole Computer Science Department and also having given much useful advice in my research. I would also like to thank Dr. Walker for giving me a chance in the cooperated project and leading me to a world of genes involved in abscission in Arabidopsis. Last, sincere thanks go to Dr. Duan, who patiently gave me advice in the qualify exam and was willing to participate in both my comprehensive exam and final defense. I would also like to thank my parents. They were always supporting me and encouraging me with their best wishes. They cheered me up and stood by me through the good times and bad. ii TABLE OF CONTENTS ACKNOWLEDGMENTS . ii LIST OF TABLES . vi LIST OF FIGURES . viii ABSTRACT . x CHAPTER . 1 MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue con- tacts . 1 1.1 Introduction . 1 1.2 Method . 3 1.2.1 Construction of pairwise posterior probability matrices based on amino acid sequence, secondary structure and solvent accessibility information . 3 1.2.2 Construction of pairwise distance matrices based on pairwise posterior probabilities and pairwise contact map scores . 7 1.2.3 Construction of guide tree and transformation of posterior probability . 9 1.2.4 Combination of progressive and iterative alignment . 10 1.3 Results and discussion . 11 1.3.1 Evaluation of MSACompro and other tools on the standard benchmarks . 11 1.3.2 A comprehensive study of the effect of predicted structural information on the alignment accuracy . 21 1.4 Conclusion . 30 2 New profile-profile pairwise protein sequence alignment by HMM- HMM comparison . 33 iii 2.1 Introduction . 33 2.2 Method . 34 2.2.1 Discretize profile columns and prefilter . 35 2.2.2 Viterbi alignment combining the structural information . 36 2.2.3 Re-align by maximum accuracy alignment combining the structural information . 38 2.2.4 Trace-back maximum accuracy alignment based on inferred residue coupling information . 41 2.3 Evaluation and Results . 44 2.3.1 Evaluation of HHpacom and other tools on CASP9 data . 44 2.3.2 A comprehensive study of the impact of the new information on the alignment accuracy . 47 2.4 Conclusion . 53 3 Predicting Protein Model Quality from Sequence Alignment by Support Vector Machines . 54 3.1 Abstract . 54 3.1.1 Background . 54 3.1.2 Results . 54 3.1.3 Conclusions . 55 3.2 Background . 55 3.3 Methods . 57 3.4 Results . 62 3.4.1 Evaluation of the pairwise alignment based SVM model quality assessment method . 62 3.4.2 Evaluation of the multiple alignment based SVM model quality assessment method . 63 3.5 Conclusions . 65 iv 4 A Protein Tertiary Structure Prediction Pipeline . 66 4.1 Introduction . 66 4.2 Methods . 67 4.2.1 Overview of the prediction pipeline . 67 4.2.2 Fold recognition using protein structural similarity network . 68 4.2.3 Query-template alignment . 77 4.2.4 Model generation . 77 4.2.5 Model quality assessment . 77 5 Summary and concluding remarks . 79 BIBLIOGRAPHY . 83 VITA...................................... 95 v LIST OF TABLES Table Page 1.1 Table 1.1 . 14 1.2 Table 1.2 . 14 1.3 Table 1.3 . 15 1.4 Table 1.4 . 16 1.5 Table 1.5 . 18 1.6 Table 1.6 . 19 1.7 Table 1.7 . 19 1.8 Table 1.8 . 20 1.9 Table 1.9 . 22 1.10 Table 1.10 . 23 1.11 Table 1.11 . 25 1.12 Table 1.12 . 26 1.13 Table 1.13 . 27 1.14 Table 1.14 . 27 1.15 Table 1.15 . 28 1.16 Table 1.16 . 29 1.17 Table 1.17 . 31 1.18 Table 1.18 . 31 2.1 Table 2.1 . 46 vi 2.2 Table 2.2 . 47 2.3 Table 2.3 . 47 2.4 Table 2.4 . 48 2.5 Table 2.5 . 50 3.1 Table 3.1 . 62 3.2 Table 3.2 . 63 3.3 Table 3.3 . 64 vii LIST OF FIGURES Figure Page 1.1 Figure 1.1 . 17 1.2 Figure 1.2 . 23 1.3 Figure 1.3 . 24 1.4 Figure 1.4 . 25 1.5 Figure 1.5 . 26 1.6 Figure 1.6 . 29 1.7 Figure 1.7 . 30 2.1 Figure 2.1 . 35 2.2 Figure 2.2 . 44 2.3 Figure 2.3 . 49 2.4 Figure 2.4 . 49 2.5 Figure 2.5 . 50 2.6 Figure 2.6 . 51 2.7 Figure 2.7 . 52 3.1 Figure 3.1 . 59 3.2 Figure 3.2 . 61 4.1 Figure 4.1 . 69 4.2 Figure 4.2 . 72 viii 4.3 Figure 4.3 . ..

Improved Computational Methods of Protein Sequence Alignment, Model Selection and Tertiary Structure Prediction

Bioinformatics 1: Lecture 3

BASS: Approximate Search on Large String Databases

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

B.Sc. (Hons.) Biotech BIOT 3013 Unit-5 Satarudra P Singh

Parameter Advising for Multiple Sequence Alignment

Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices

Deriving Amino Acid Exchange Matrices (II) and Multiple Sequence Alignment (I) Summarysummary Dayhoff’Sdayhoff’S PAMPAM--Matricesmatrices

Pairwise Sequence Alignment Algorithm by a New Measure Based

Scoring Matrices for Sequence Comparisons

Rotamer-Specific Statistical Potentials for Protein Structure Modeling

Multiple Sequence Alignment

Practical Considerations of Working with Sequencing Data File Types