Computational Protein Structure Prediction Using Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
COMPUTATIONAL PROTEIN STRUCTURE PREDICTION USING DEEP LEARNING _______________________________________ A Dissertation presented to the Faculty of the Graduate School at the University of Missouri-Columbia _______________________________________________________ In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy _____________________________________________________ by ZHAOYU LI Professor Yi Shang, Dissertation Advisor May 2020 The undersigned, appointed by the dean of the Graduate School, have examined the dissertation entitled COMPUTATIONAL PROTEIN STRUCTURE PREDICTION USING DEEP LEARNING presented by Zhaoyu Li, a candidate for the degree of Doctor of Philosophy and hereby certify that, in their opinion, it is worthy of acceptance. Dr. Yi Shang Dr. Dong Xu Dr. Jianlin Cheng Dr. Ioan Kosztin DEDICATION To my wife Xian, my parents, and my son Ethan. ACKNOWLEDGEMENTS I would like to take the opportunity to thank the following people who have supported me through my PhD program. Without the support of them, this thesis would not have been possible. First, I would like to thank my advisor, Dr. Yi Shang. He has been my advisor since my master’s study. He opened the door of research for me, showed me instructions, guided me, supported me, encouraged me, and inspired me whenever I need help. It is a great pleasure to work with Dr. Shang together. I would also like to thank my committee members, Dr. Dong Xu, Dr. Jianlin Cheng, and Dr. Ioan Kosztin. They have been giving me constructive suggestions and advices to my dissertation and helping me solve problems in my research. Their support is essential, and I appreciate their time and efforts very much. I would also like to thank other lab mates. I am glad that I have known them, and I really enjoyed their company along the way. Special thanks to Junlin Wang, Wenbo Wang, Chao Fang, and Son Nguyen, who were always there for me. Last but not least, I would like to thank my family, Xian Kang, Ethan Li, Fafen Li, Jinxiu Chang, and Min Li. They provided me great supports through the whole course of my study and without their help I would not have been here to pursue my degrees. This work was supported in part by National Institutes of Health grant R01GM100701. The high-performance computing infrastructure is supported by the National Science Foundation under grant number CNS-1429294. ii TABLE OF CONTENTS DEDICATION .................................................................................................................... 3 ACKNOWLEDGEMENTS ................................................................................................ ii TABLE OF CONTENTS ................................................................................................... iii LIST OF TABLES ........................................................................................................... viii LIST OF FIGURES ............................................................................................................. x CHAPTER 1. INTRODUCTION ........................................................................................ 1 1.1 Protein and Protein Structure Prediction .................................................................. 1 1.2 Protein Loop Modeling ............................................................................................. 2 1.3 Protein Contact Map Prediction ................................................................................ 3 1.4 Protein Contact Map Refinement ............................................................................. 4 1.5 Contributions ............................................................................................................ 4 1.6 Thesis Organization .................................................................................................. 6 CHAPTER 2. BACKGROUND AND RELATED WORK ................................................ 8 2.1 Protein Distance Map and Multidimensional Scaling .............................................. 8 2.2 Protein Loop Modeling ............................................................................................. 9 2.2.1 Ab-Initio Methods ............................................................................................. 9 2.2.2 Template-Based Methods ................................................................................ 10 2.3 Protein Contact Map Prediction .............................................................................. 11 2.3.1 StatistiCal Methods .......................................................................................... 11 2.3.2 Co-evolution of Protein Residues ................................................................... 12 iii 2.3.3 DireCt Coupling Analysis (DCA) .................................................................... 13 2.3.4 Machine Learning Methods ............................................................................ 14 2.3.5 Evaluation MetriCs .......................................................................................... 15 2.4 Predicted Contact Map Refinement ........................................................................ 17 2.4.1 Existing Methods ............................................................................................. 17 2.4.2 Evaluation MetriCs .......................................................................................... 18 2.5 Deep Neural Networks ........................................................................................... 18 2.5.1 Convolutional Neural Network ....................................................................... 19 2.5.2 Fully Convolutional Network (FCN) .............................................................. 20 2.5.3 Residual Neural Network (ResNet) ................................................................. 21 2.5.4 Generative Adversarial Network (GAN) ......................................................... 22 CHAPTER 3. MUFOLD-LM: PROTEIN LOOP MODELING USING GENERATIVE ADVERSARIAL NETWORK .......................................................................................... 23 3.1 Motivations ............................................................................................................. 23 3.2 Problem Formulation .............................................................................................. 24 3.3 MUFOLD-LM System Architecture ...................................................................... 25 3.4 Deep Neural Network Structure ............................................................................. 28 3.4.1 Generator Network ......................................................................................... 29 3.4.2 Adversarial DisCriminator Network ............................................................... 30 3.4.3 Implementation and Training ......................................................................... 30 3.5 Evaluation ............................................................................................................... 32 3.5.1 Benchmark Dataset ......................................................................................... 32 iv 3.5.2 Results ............................................................................................................. 33 3.6 Conclusion .............................................................................................................. 36 CHAPTER 4. MUFOLD-CONTACT: PROTEIN RESIDUE-RESIDUE CONTACT MAP PREDICTION .......................................................................................................... 38 4.1 Motivations ............................................................................................................. 38 4.2 Problem Formulation .............................................................................................. 39 4.3 Input Features ......................................................................................................... 39 4.3.1 Dataset ............................................................................................................ 40 4.3.2 Features Introduction ..................................................................................... 43 4.4 Deep Neural Network Structure ............................................................................. 47 4.4.1 PrediCting Distance DireCtly .......................................................................... 48 4.4.2 PrediCting Contact Map DireCtly .................................................................... 51 4.4.3 Two-Stage Multi-branch Network ................................................................... 55 4.5 Evaluation Results .................................................................................................. 59 4.6 Application: Contact Guided Modeling .................................................................. 64 4.7 Summary ................................................................................................................. 67 CHAPTER 5. TPCREF: PREDICTED CONTACT MAP REFINEMENT ..................... 69 5.1 Motivations ............................................................................................................. 69 5.2 Problem Formulation .............................................................................................. 71 5.3 Overall Architecture ............................................................................................... 71 5.4 Refinement Process ...............................................................................................