A Two-Pronged Approach to Improve Distant Homology
Total Page:16
File Type:pdf, Size:1020Kb
A TWO-PRONGED APPROACH TO IMPROVE DISTANT HOMOLOGY DETECTION DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Graduate School of The Ohio State University By Marianne Lee, MBA * * * * * The Ohio State University 2009 Dissertation Committee: Approved by Professor Ralf Bundschuh, Adviser Professor Umit Catalyurek ———————————————— Professor Charles Daniels Adviser Biophysics Graduate Program ABSTRACT With the tremendous growth in biological information, bioinformatics has become a powerful approach to aid in assigning the functional role of proteins. By establishing an ancestral relationship or homology to a well-understood protein, the function of a previously uncharacterized protein can be inferred. The most common method to detect homology between proteins is to use sequence alignment, of which BLAST and PSI-BLAST are the most popular tools. The challenge is to find as many true positives as possible, and distinguish these true positives from false positives when sequence similarity falls into the twilight zone (<25%), as is commonly observed for distantly related sequences. A two-pronged approach is presented to address the challenge in distant homology detection. In the proposed LESTAT algorithm, conserved structural features are incorporated into an iterative profile-based sequence alignment method. This imparts LESTAT with the ability to finding more true positives than PSI-BLAST based on seven test case studies. In the proposed SimpleIsBeautiful (SIB) algorithm, a mathematical model and a novel model validation approach is utilized to improve PSI-BLAST's ability to discriminate true and false positives without sacrificing its computational efficiency. These additional features result in improved performance in deciphering true and false positives when compared to i i existing PSI-BLAST approach. A web-server that runs the SIB algorithm, SIB- BLAST, was launched in December 2008 under the URL (http://sib- blast.osc.edu). One alternative application of homology prediction is to utilize that information to predict protein-protein interactions. As a first step to explore such questions, an algorithm was developed that attempts to predict interacting partners of a hetero-oligomer from a homo-oligomer using a structure-based sequence alignment strategy in conjunction with correlation analysis of amino acids pair. The prediction algorithm was applied to the human Rh proteins and the SSoPCNA proteins. The results reveal that interacting residues in a homo- oligomer do undergo mutation, presumably under evolutionary pressure, when trying to complex with another protein molecule to form a hetero-oligomer. ii i For Michael and my family iv ACKNOWLEDGMENTS My venture into science from finance has been both exciting and eventful. At times, it was frustrating especially when I hit a roadblock. Nevertheless, the end result is rewarding and I would not trade the past years and experience with anything. There are many people that I would like to thank for their support and encouragement - my mom and dad, my sister and brothers, my graduate committee members: Dr. Bundschuh, Dr. Clanton, Dr. Catalyurek, and Dr. Daniels, my lab mates: Xin, Patrick, Manoj and Sridevi, and people that have helped me in time of need. I would like to take this opportunity to thank them all. I would like to thank a few people specifically who have deep influence on my PhD career. First and foremost is my mentor and advisor Dr. Bundschuh. I would not have achieved my scientific accomplishments without his advice and guidance. His immense patience and calming manner in correcting my errors and explaining fundamental theories have kept me motivated. I especially thank him for believing in me, first by taking a chance on a novice in science and computer programming, and then by constantly challenging me to reach for my potentials. I am fortunate in having an advisor who upholds high scientific standards and integrity. v I must thank Dr. Thomas Clanton, who has been instrumental in my joining the OSU Biophysics program. His lectures on ethics and constant emphasis on being an honest and responsible scientist have greatly influenced my professional and personal development. Last but not least is my husband Michael. He has been my constant companion and cheerleader throughout my PhD career. Even when I doubt myself, his unwavering faith in me keeps me going. v i VITA 1989………………………. B.S. Business Administration, California State University, Los Angeles, CA 1993………………………. MBA Finance, University of Southern California, Los Angeles, CA 2002 – Present…………...Graduate Teaching and Research Associate, The Ohio State University PUBLICATIONS Research Publication 1. Lee, MM., Chan, MK. Bundschuh, R., " SIB-BLAST: a web server for improved delineation of true and false positives in PSI-BLAST searches." Nucleic Acids Research, 2009, May 8. 2. Lee, MM., Isaza, CE., White, JD., Chen, RY., Liang, G., He, H., Chan, SI., Chan, MK., "Insight into the substrate length restriction of M32 carboxypeptidases: characterization of two distinct subfamilies." Proteins: Structure, Function, and Bioinformatics, 2009, in press. 3. Fekner, T., Li, X., Lee, MM., Chan, MK., "A Pyrrolysine Analog for Protein Click Chemistry." Angew Chem Int Ed Engl 2009, 48:1633-5. 4. Lee, MM., Jiang, R., Krzycki, JA., Chan, MK.,"Structure of the Desulfitobacterium hafniense pyrrolysyl-tRNA synthetase." Biochem Biophys Res Commun., 2008, 374:470-4. v ii 5. Lee, MM., Chan, MK. Bundschuh, R., "Simple is beautiful: a straightforward approach to improving the delineation of true and false positives from a PSI- BLAST search." Bioinformatics, 2008, Apr 10. 6. Lee, MM., Bundschuh, R., Chan, MK., "Distant Homology Detection Using a LEngth and STructure-based Sequence Alignment Tool (LESTAT)." Proteins: Structure, Function, and Bioinformatics, 2007, 71:1409-1419. FIELDS OF STUDY Major Field: Biophysics Graduate Program vi ii TABLE OF CONTENTS Page Abstract ……………………………………………………………………………….. ii Dedication ……………………………………………………………………………. iv Acknowledgements ………………………………………………………………… v Vita …………………………………………………………………………………….. vii List of Tables ………………………………………………………………………… x List of Figures ………………………………………………………………………. xi Chapters 1. Introduction ………………………………………………………………………. 1 1.1. Protein Structure and Function ……………………………………... 3 1.2. Homology Detection ……………………………………..................... 8 1.3. Sequence Alignment …………………………………….....................9 1.4. Dynamic Programming Algorithm ………………………………… 13 1.5. BLAST - a Heuristic Approach ……………………………………... 15 2. Distant Homology Detection Using a Length and Structure-based Sequence Alignment tool (LESTAT) ……………………………………........ 18 2.1. Experimental Section …………………………………….................. 21 2.2. Results and Discussion …………………………………….............. 32 2.3. Conclusion ……………………………………................................... 46 3. Simple Is Beautiful: a Straightforward Approach to Improve the Delineation of True and False Positives in PSI-BLAST Searches .......... 48 3.1. Methods ……………………………………....................................... 50 3.2. Results ……………………………………......................................... 58 3.3. SIB-BLAST Web Server…...…………………………………………. 63 3.4. Discussion ……………..…...…………………………………………. 67 ix 3.5. Conclusion …………….…...…………………………………………. 68 4. Predicting Interacting Partners of Hetero-oligomers …………………..... 70 4.1. Methods …………….………………………………………………….. 74 4.2. Results and Discussion …………………………………………….. 83 5. Conclusion …………………………………………………………………….... 89 x LIST OF TABLES Table Page Table 3.1 ROC100 values characterizing retrieval performance for a "gold standard" yeast database …………………......................................................... 62 Table 4.1 Top ten predictions of the possible subunit combinations forming the human Rh complex using different multiple sequence alignment models …..... 84 x i LIST OF FIGURES Figure Page Fig. 1.1 Proteins are comprised of a specific sequence of amino acids …..….. 4 Fig. 1.2 The different levels of protein structure ……………………………..….. 5 Fig. 1.3 Venn diagram depicting the distribution of recognized folds in the three superkingdoms of life and the number of recognized folds in each ……………. 6 Fig. 1.4 Which individuals belong to the Simpson family? ………..……………. 8 Fig. 1.5 Protein homology based on structure ………..…………………………. 9 Fig. 1.6 Example of a sequence alignment ……………..….……………………. 10 Fig. 1.7 Challenge in homology detection ……..………..….……………………. 12 Fig. 1.8 Coverage versus Error plot ……………….…………..……….…………. 13 Fig. 1.9. Example of a filled dynamic programming matrix ……………………… 14 Fig. 2.1 The LESTAT algorithm ……………….………..………….……..….……. 20 Fig. 2.2 Performance Comparison of LESTAT versus PSI-BLAST for the initial test set ……………….…………………………………………..…….……..………. 34 Fig. 2.3 Performance Comparison of LESTAT versus PSI-BLAST for the independent validation set assessed at superfamily level ………..……...……. 35 Fig. 3.1 Performance comparison between rounds two and five of PSI-BLAST, and our new method for combining different rounds ………..…….……..………. 58 Fig. 3.2 Performance comparison between PSI-BLAST, SAM-T2K, and our proposed method ………..…………………………………………………..………. 61 Fig. 3.3 Snapshot of the SIB-BLAST front page. ………..…….………..………. 64 Fig. 3.4 Snapshot of the SIB-BLAST result page …………..….………..………. 66 x ii Fig. 4.1 Schematic overview of the prediction algorithm for oligomeric assemblies ………………………………………………………………………….... 74 Fig. 4.2 Graphical representation of the interacting subunits and the corresponding