Computational Methods for Predicting Protein-Protein Interactions and Binding Sites
Total Page:16
File Type:pdf, Size:1020Kb
Western University Scholarship@Western Electronic Thesis and Dissertation Repository 8-24-2020 11:30 AM Computational Methods for Predicting Protein-protein Interactions and Binding Sites Yiwei Li, The University of Western Ontario Supervisor: Ilie, Lucian, The University of Western Ontario A thesis submitted in partial fulfillment of the equirr ements for the Doctor of Philosophy degree in Computer Science © Yiwei Li 2020 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Bioinformatics Commons, and the Other Computer Sciences Commons Recommended Citation Li, Yiwei, "Computational Methods for Predicting Protein-protein Interactions and Binding Sites" (2020). Electronic Thesis and Dissertation Repository. 7182. https://ir.lib.uwo.ca/etd/7182 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract Proteins are essential to organisms and participate in virtually every process within cells. Quite often, they keep the cells functioning by interacting with other proteins. This process is called protein-protein interaction (PPI). The bonding amino acid residues during the process of protein-protein interactions are called PPI binding sites. Identifying PPIs and PPI binding sites are fundamental problems in system biology. Experimental methods for solving these two problems are slow and expensive. There- fore, great efforts are being made towards increasing the performance of computational methods. We present DELPHI, a deep learning based program for PPI site prediction and SPRINT, an algorithmic based program for PPI prediction. Both programs have been compared to the state-of-the-art programs on several datasets. Both DELPHI and SPRINT are more accurate than the competing method. SPRINT is also orders of magnitudes faster while using very little memory. The dataset and source code for both DELPHI and SPRINT are publicly available at: github.com/lucian-ilie and and www.csd.uwo.ca/~ilie/software.html Keywords: Bioinformatics, SPRINT, DELPHI, Protein-protein interaction, deep learning, Protein-protein interaction prediction, Protein-protein interaction binding sites prediction i Lay Summary Proteins are essential to organisms and participate in virtually every process within cells. Quite often, they keep the cells functioning by interacting with other proteins. This process is called protein-protein interaction (PPI). The bonding amino acid residues during the process of protein-protein interactions are called PPI binding sites. Identifying PPIs and PPI binding sites are fundamental problems in system biology. Experimental methods for solving these two problems are slow and expensive. There- fore, great efforts are being made towards increasing the performance of computational methods. We present two computational methods: DELPHI, for PPI site prediction and SPRINT, for PPI prediction. Both programs surpass the state-of-the-art programs and they are freely available at www.csd.uwo.ca/~ilie/software.html ii Acknowledgements First and foremost, my sincere gratitude goes to my supervisor, Dr. Lucian Ilie. He has not only taught me knowledge in computer science, more importantly, the right way of doing things in general. My growth throughout my master and PhD period would not be possible without his inspiring lectures, hands-on instructions and manuscript editing, countless video discussions, and mental supports. I would also like to express my heartfelt appreciation to my parents. To me, leaving home and pursing new life is exciting, adventures, and rewarding, but to them, in addi- tion, is not being able to spend holidays with their only kid. I gradually understand their feelings these years, and I would like to say: thank you and I love you Dad and Mom. I would also like to thank my fiancee Karen Qi and her parents. Studying and working at the same time is way harder than I initially though, but you made it easier. Toronto feels like home because of you. I thank my lab mates, Shaofeng Jiang, Fang Han, Qin Dong, Zhewei Liang, Nilesh Khiste, Yi Liu, Weiping Sun, Ehsan Haghshenas, Mike Molnar in Middlesex College 222, Computer Science Department, Western University. I will certainly remember the great hikes, BBQs, trips, gatherings we had together as colleagues and friends. My appreciation also goes to Huawei Toronto Research Center. I am horned to be in such a great team with great talents. The engineering skills I learned at Huawei benefit me a lot in my research projects. To all my teammates on the Table Tennis Varsity Team, Western University: table tennis is an important part of me, and one of the reasons is the amazing people I met, like you. iii Contents Abstract i Lay Summary ii Acknowlegements iii List of Figures viii List of Tables xi 1 Introduction 1 1.1 DNA . 1 1.2 Protein . 1 1.3 Protein-protein Interaction Prediction . 4 1.4 Protein-protein Interaction Binding Sites Prediction . 6 1.5 Thesis Overview . 9 2 DELPHI 10 2.1 Background . 10 2.1.1 Deep Learning in Bioinformatics . 10 2.1.2 Basic Notions and Definitions . 14 Deep Neural Networks . 14 Training . 16 Inference . 18 Convolutional Neural Networks . 18 Recurrent Neural Networks . 22 iv Ensemble Networks . 24 Dropout Layers . 26 Training, Validation, and Testing Dataset . 27 Data Augmentations and Sampling . 27 2.1.3 Previous Methods . 28 PIPE-sites . 28 DLPred . 29 DeepPPISP . 30 SCRIBER . 31 2.2 Methods . 32 2.2.1 Training Data Preparation . 33 Raw Training Data . 33 Similarities Eliminations . 33 Data Split . 34 2.2.2 Features . 34 2.2.3 Model Architecture . 39 Architecture Overview . 39 Many-to-one Structure . 39 Architecture of the CNN Network . 41 Architecture of the RNN Network . 42 Architecture of the Ensemble Network . 42 2.2.4 Parameter/Hyper-parameter Tuning . 43 2.2.5 Implementation . 44 Environment Configuration . 45 Class Weights . 46 Data Shuffling . 46 2.2.6 The DELPHI Web Server . 47 The Architecture of the Web Server . 47 Front End Server Configuration . 48 Back End Server Configuration . 49 v Communications between the Front and Back End Servers . 49 Pre-computing PSSMs . 50 Job Scheduling . 50 2.3 Results . 50 2.3.1 Testing Datasets . 51 2.3.2 Evaluation Scheme . 53 2.3.3 Performance Comparison on Dset 448 and Dset 355 . 54 2.3.4 Performance Comparison on Dset 186, Dset 164, and Dset 72 . 54 2.3.5 Ablation Study . 56 Feature Evaluation . 56 The Evaluation of the Model Architecture and the Novel Features 57 2.3.6 Evolutionary Conservation...................... 58 2.3.7 Accuracy of PBR Prediction..................... 59 2.3.8 Human Proteome Prediction..................... 62 2.3.9 Availability . 62 2.4 Conclusion . 62 3 SPRINT 64 3.1 Background . 64 3.1.1 Similarity Search . 64 BLAST Seeds . 65 Spaced Seeds . 67 Multiple Spaced Seeds . 68 Substitution Matrix . 69 Interactome Prediction . 70 3.1.2 Previous Methods . 70 PIPE . 73 Martin's Program . 76 Shen's Program . 77 Guo's Program . 78 vi Ding's Program . 80 3.2 Methods . 80 3.2.1 Basic Idea . 81 3.2.2 Detecting Similarities . 83 3.2.3 Predicting Interactions . 88 Post-processing Similarities . 88 Scoring Function . 88 3.2.4 Implementations . 90 Tuneable Parameters . 90 Pseudocode . 90 System Configuration . 92 3.3 Results . 92 3.3.1 Datasets Classification . 92 3.3.2 Datasets . 92 3.3.3 Competing Methods . 94 3.3.4 Comparative Analysis on Park & Marcotte's Datasets . 94 3.3.5 Comparative Analysis on Seven Human Datasets . 95 3.3.6 Comparative Analysis on Human Interactome Prediction . 97 3.3.7 Availability . 100 3.4 Conclusion . 100 4 Conclusion and Future Research 102 4.1 Common Deep Learning Practises in Bioinformatics . 102 4.1.1 Data Preparation . 103 Comparative Analysis . 104 Improving Results . 104 4.2 Future Research . 105 Bibliography 106 Curriculum Vitae 124 vii List of Figures 1.1 Illustration of transcription and Translation . 2 1.2 The primary structure of a protein . 3 1.3 The primary, secondary, tertiary, and quaternary structure of proteins . 3 1.4 A FASTA file example . 4 1.5 Protein complex . 5 1.6 A human protein-protein interaction network . ..