Feng.Jiao.Thesis.Pdf (1.426Mb)
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Graphical Models and Algorithms for Protein Problems by Feng Jiao A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2008 c Feng Jiao, 2008 AUTHOR'S DECLARATION I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract In this thesis I present research in two fields: machine learning and computational biology. First, I develop new machine learning methods for graphical models that can be applied to protein problems. Then I apply graphical model algorithms to protein problems, obtaining improvements in protein structure prediction and protein structure alignment. First, in the machine learning work, I focus on a special kind of graphical model| conditional random fields (CRFs). Here, I present a new semi-supervised training proce- dure for CRFs that can be used to train sequence segmentors and labellers from a com- bination of labeled and unlabeled training data. Such learning algorithms can be applied to protein and gene name entity recognition problems. This work provides one of the first semi-supervised discriminative training methods for structured classification. Second, in my computational biology work, I focus mainly on protein problems. In particular, I first propose a tree decomposition method for solving the protein structure prediction and protein structure alignment problems. In so doing, I reveal why tree decom- position is a good method for many protein problems. Then, I propose a computational framework for detection of similar structures of a target protein with sparse NMR data, which can help to predict protein structure using experimental data. Finally, I propose a new machine learning approach|LS Boost|to solve the protein fold recognition problem, which is one of the key steps in protein structure prediction. After a thorough comparison, the algorithm is proved to be both more accurate and more efficient than traditional z-Score method and other machine learning methods. iii Acknowledgements I would like to express my great gratitudes to those who help me to finish my degree. I am deeply indebted to my two supervisors: Dale Schuurmans and Ming Li. I feel that I was so lucky to meet two great professors and it was my honor to be their students. It was Dale who chose me as his student and led me to the door of research, especially in the fields of machine learning. I am deeply grateful to him for his guidance, bringing me energy and giving me thoughtful insight into every research work I have done. I was so impressed and benefited much for his broad knowledge in every fields in machine learning and even in the filed of bioinformatics. He is one of best researcher in the world. And he is the most kind and nice people I have ever met in the world. Also it was my great luck to have Ming, one of the best researcher in the world, as my co-supervisor in the field of bioinformatics fields. He gave me some great guidance in my research. And my great thanks go to professor Jinbo Xu in Toyota Technological Institute at Chicago and professor Shaojun Wang in Wright State university. Jinbo led me to the fields of computation biology while Shaojun led me to the fields of machine learning. They are both one of the best researcher in their fields and I obtained so much from their solid knowledge and deep insight. Meanwhile we are very good friends and they also help me a lot in my personal life. I would like to thanks for my other collaborators such as Li Cheng, Chihoon Lee, etc. It was great to work with them and published these good research papers. I would like to thanks for the committee members who spent their precious time to review my thesis and give valuable suggestions. Thanks for University of Waterloo and University of Alberta which provided me great academic research places and also provided me the funding for my research. iv Dedication This is dedicated to my Father and Mother. v Contents 1 Introduction 1 1.1 Motivation .................................... 1 1.2 Contributions .................................. 2 1.3 Organization .................................. 3 2 Background on Graphical Models 5 2.1 Representation ................................. 6 2.1.1 Directed Models ............................. 6 2.1.2 Undirected Models ........................... 8 2.2 Important Examples|Conditional Random Fields .............. 9 2.2.1 Definition of CRFs ........................... 9 2.2.2 Parameter Estimation for CRFs .................... 12 2.3 Inference ..................................... 14 2.3.1 Exact Inference ............................. 15 2.3.2 Approximate Inference ......................... 16 2.4 A Tree-Decomposition Approach to the Graph Labeling Problem . 17 2.4.1 Tree Decomposition ........................... 18 2.4.2 Algorithms for Tree Decomposition of a Graph . 18 2.4.3 Tree Decomposition-Based Graph Labeling Algorithms . 21 3 Background on Protein Problems 24 3.1 Introduction to Protein Structure ....................... 24 3.2 Representation of 3-D Protein Structure|Contact Graphs . 26 vi 3.3 Protein Structure Prediction .......................... 27 3.3.1 Main Methods for Structure Prediction . 27 3.3.2 Threading Algorithms ......................... 28 3.3.3 Fold Recognition ............................ 30 3.4 Protein Structure Alignment .......................... 31 4 Semi-Supervised Conditional Random Fields 34 4.1 Introduction ................................... 34 4.2 Semi-supervised CRFs training ........................ 36 4.3 An Efficient Training Procedure ........................ 39 4.4 Time and Space Complexity .......................... 44 4.5 Experimental Results .............................. 45 4.6 Conclusions ................................... 49 4.7 Appendix A: Deriving the gradient of the entropy . 50 5 Tree Decomposition for Protein Structure Prediction 53 5.1 Problem Formulation-Protein Threading ................... 55 5.2 Experimental Results .............................. 56 5.3 Conclusions ................................... 58 6 Tree Decomposition for Protein Structure Alignment 60 6.1 Problem Formulation .............................. 60 6.2 Tree-decomposition for Structure Alignment Algorithm . 61 6.3 Structure Alignment with Rigid-body Transformations . 63 6.4 Experimental Results .............................. 64 6.5 Conclusions ................................... 65 7 Combining Spare NMR Data to Improve Protein Structure Prediction 68 7.1 Introduction ................................... 68 7.2 Problem Formulation .............................. 71 7.3 A Knowledge-Based Scoring Function For RDC Data . 74 7.4 Algorithms For NMR-template Alignment . 76 7.5 Implementation ................................. 81 vii 7.6 Experimental Results .............................. 82 7.7 Conclusions ................................... 83 8 LS Boosting for Protein Fold Recognition 87 8.1 Introduction ................................... 88 8.2 Protein Threading and Fold Recognition ................... 89 8.2.1 The z-score Method for Fold Recognition . 89 8.2.2 Machine Learning Methods for Fold Recognition . 90 8.3 Feature Extraction ............................... 91 8.4 Least-squares Boosting Algorithm for Fold Recognition . 93 8.5 Experimental Results .............................. 95 8.6 Conclusions ................................... 102 9 Conclusions and Future work 111 9.1 Conclusions ................................... 111 9.2 Future work ................................... 112 References 115 viii List of Tables 4.1 Performance of the supervised CRF ...................... 46 4.2 Performance of the semi-supervised CRFs . 47 4.3 Performance of the semi-supervised CRFs . 49 5.1 Performance of six different tree decomposition methods. 54 5.2 Top templates chosen results .......................... 59 6.1 Structure alignments of some proteins ..................... 65 6.2 Structure alignments of some proteins ..................... 66 6.3 Structure alignments of some proteins ..................... 67 7.1 Alignment accuracy of different experimental data . 85 7.2 The best templates found for 18 target proteins. 86 8.1 Sensitivity of the LS Boost method ...................... 97 8.2 Performance comparison of seven machine learning methods. 98 8.3 Comparison of fold recognition performance with zscore and without zscore. 100 8.4 Error Analysis of seven machine learning methods. 100 8.5 Error Analysis of seven machine learning methods. 101 8.6 Error Analysis of seven machine learning methods. 101 8.7 Running time of different machine learning approaches. 103 ix List of Figures 2.1 Example of a directed graph ......................... 6 2.2 Example of an undirected graph ....................... 8 2.3 Example of a conditional random field model on a chain structure . 10 2.4 Example of a graph. .............................. 19 2.5 Example of a tree decomposition with width 3. 19 2.6 A tree decomposition (T;X) of G. ....................... 21 3.1 Four levels of protein structure ........................ 25 3.2 Example of contact graph ........................... 26 4.1 Small Trellis for computing the expectation . 45 4.2 Performance of the supervised and semi-supervised CRFs . 48 5.1 Example of protein structure prediction alignment . 53 5.2 Computational efficiency of TD and LP approach