Structure-Based Algorithms for Protein-Protein Interaction Prediction

Structure-based algorithms for protein-protein interaction prediction by Raghavendra Hosur Submitted to the Department of Materials Science and Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2012 © Massachusetts Institute of Technology 2012. All rights reserved. Author................................................................ Department of Materials Science and Engineering May 17, 2012 Certified by. Bonnie Berger Professor of Applied Mathematics Thesis Supervisor Accepted by . Prof. Gerbrand Ceder Chairman, Department Committee on Graduate Students ii Structure-based algorithms for protein-protein interaction prediction by Raghavendra Hosur Submitted to the Department of Materials Science and Engineering on May 17, 2012, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Protein-protein interactions (PPIs) play a central role in all biological processes. Akin to the complete sequencing of genomes, complete descriptions of interactomes is a fundamental step towards a deeper understanding of biological processes, and has a vast potential to impact systems biology, genomics, molecular biology and thera- peutics. PPIs are critical in maintenance of cellular integrity, metabolism, transcrip- tion/translation, and cell-cell communication. This thesis develops new methods that significantly advance our efforts at structure- based approaches to predict PPIs and boost confidence in emerging high-throughput (HTP) data. The aims of this thesis are, 1) to utilize physicochemical properties of protein interfaces to better predict the putative interacting regions and increase coverage of PPI prediction, 2) increase confidence in HTP datasets by identifying likely experimental errors, and 3) provide residue-level information that gives us in- sights into structure-function relationships in PPIs. Taken together, these methods will vastly expand our understanding of macromolecular networks. In this thesis, I introduce two computational approaches for structure-based protein- protein interaction prediction: iWRAP and Coev2Net. iWRAP is an interface threading approach that utilizes biophysical properties specific to protein interfaces to improve PPI prediction. Unlike previous structure-based approaches that use single structures to make predictions, iWRAP first builds profiles that characterize the hy- drophobic, electrostatic and structural properties specific to protein interfaces from multiple interface alignments. Compatibility with these profiles is used to predict the putative interface region between the two proteins. In addition to improved interface prediction, iWRAP provides better accuracy and close to 50% increase in coverage on genome-scale PPI prediction tasks. As an application, we effectively combine iWRAP with genomic data to identify novel cancer related genes involved in chromatin remod- eling, nucleosome organization and ribonuclear complex assembly – processes known to be critical in cancer. Coev2Net addresses some of the limitations of iWRAP, and provides techniques iii to increase coverage and accuracy even further. Unlike earlier sequence and structure profiles, Coev2Net explicitly models long-distance correlations at protein interfaces. By formulating interface co-evolution as a high-dimensional sampling problem, we enrich sequence/structure profiles with artificial interacting homologus sequences for families which do not have known multiple interacting homologs. We build a spanning-tree based graphical model induced by the simulated sequences as our interface profile. Cross-validation results indicate that this approach is as good as previous methods at PPI prediction. We show that Coev2Net’s predictions correlate with experimental observations and experimentally validate some of the high-confidence predictions. Furthermore, we demonstrate how analysis of the predicted interfaces together with human genomic variation data can help us understand the role of these mutations in disease and normal cells. Thesis Supervisor: Bonnie Berger Title: Professor of Applied Mathematics iv Publications Some ideas and figures have appeared previously in the following publications: Park, D., Singh, R., Xu, J., Hosur, R. and Berger, B. Struct2Net: a webser- vice to predict protein-protein interactions using a structure-based approach, Nucleic Acids Research, 38(W508-15), Jul 2010. Hosur, R., Xu, J., Bienkowska, J. and Berger, B. iWRAP: an interface threading approach with application to cancer-related protein-protein interactions, Journal of Molecular Biology, 405(5):1295-1310. Feb 2011. The article and image were featured on the cover Hosur, R., Peng, J., Arunachalam, V., Stelzl, U., Xu, J., Perrimon, N., Bi- enkowska, J. and Berger, B. A computational framework for boosting confidence in high-throughput protein-protein interaction datasets, submitted Hosur, R., Singh, R. and Berger, B. Sparse estimation for structural variability, Algorithms for Molecular Biology, 6:12. Jun 2011. Selected for fast-track publication from WABI 2010 Bryan, A., Starner, J., Hosur, R., Clark, P. and Berger, B. Structure-based prediction reveals capping motifs that inhibit β-helix aggregation, Proceedings of the National Academy of Sciences, 108(27):11099-11104, Jul 2011. Daniels, N., Hosur, R., Cowen, L. and Berger, B. SMURFLite: combining sim- plified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone, Bioinformatics, 2012, v doi: 10.1093/bioinformatics/bts110. vi Acknowledgments I would like to take this opportunity to express my heartfelt appreciation and grat- itude to many who have supported me throughout my graduate studies at MIT. First, I am indebted to my advisor, Prof Bonnie Berger, for her support, advice and mentorship. None of this would have been possible without her constant guidance, motivation and encouragement, even during the tough times of my PhD. Her never- ending enthusiasm for research, positive attitude and belief in me have helped me evolve both academically and personally. It is something that I will always remember and cherish. I would like to especially thank, Jadwiga Bienkowska, for her support, help and encouragement during my thesis. Her knowledge, combined with her enthusiasm in helping answer all the technical challenges that we faced during the course of this thesis helped me learn a lot about the field. I also appreciate her taking time out from her busy schedule to meet us frequently to discuss about my thesis and provide invaluable suggestions. I would like to thank my thesis committee, Prof Collin Stultz, Prof Samuel Allen and Prof Polina Anikeeva for their feedback and suggestions on my thesis. I want to thank the current as well as past members of the Berger lab. My discussions and brain-storming sessions with Jian over the last year have helped me gain from his vast knowledge. Discussions with Rohit, Vinu, Luke, Jason, Michael, Jerome and Charlie have also helped me think about problems from a broader perspective. vii I also want to thank George, Po-Ru, Mark, Hadar and Patrick for comments and suggestions during group meetings that have helped me improve as a researcher. I want to thank Patrice Macaluso, who always welcomed me to the lab with a smile and some home-made treats. I want to thank all my friends here at MIT – Aniruddh, Ahmed, Andrea, Vikrant, Vaibhav, Vivek, Kashi, Harshad, Pari, Chaitanya, Oshani, Sahil, Kedia, Karthik, Murali, Varun, Neeraj who have made the stay a fun and exciting experience. I will always cherish the “latt”, “dudo” and “muddy” sessions with the gang. I want to especially thank Manas, Navin and Angad for their constant support and friendship. Finally, I want to thank my parents who have made immense sacrifices so that me and my sisters can be where we are today. Their unwavering love and belief in me makes me want to be a better person every day. This thesis is as much for them as it is for me. viii Contents 1 Introduction 1 1.1 Proteins . 2 1.1.1 Protein structure determination . 5 1.2 Protein-protein interactions . 7 1.2.1 Types of PPIs . 8 1.2.2 Structural features . 8 1.2.3 Physico-chemical features . 9 1.2.4 Evolutionary features . 13 1.3 Experimental methods for PPI detection . 14 1.3.1 Low throughput screens . 14 1.3.2 High throughput screens . 14 1.3.3 Limitations of experimental techniques . 17 1.4 Computational methods for PPI prediction . 18 1.4.1 Indirect methods . 18 1.4.2 Direct methods . 19 1.4.3 Data integration methods . 20 1.5 Medical impact . 22 1.6 Organization of the thesis . 23 2 Struct2Net: structure-based approach to PPI prediction 25 ix 2.1 Background . 25 2.2 Methods overview . 27 2.3 Evaluation . 30 2.4 Conclusion . 31 3 iWRAP: an interface threading approach for PPI prediction 35 3.1 Introduction . 35 3.2 Results . 38 3.2.1 Overview of the threading algorithm . 38 3.2.2 Interface validation . 41 3.2.3 PPI Prediction: yeast genome . 50 3.2.4 iWRAP predicts novel cancer-related interactions . 53 3.3 Materials and Methods . 57 3.3.1 Stage 1: Template construction . 57 3.3.2 Stage 2: Aligning query sequences to templates . 58 3.3.3 Stage 3: Interface scoring . 59 3.3.4 Stage 4: PPI prediction . 61 3.3.5 Training and test sets . 63 3.4 Discussion . 64 4 Coev2Net: a computational framework for boosting confidence in HTP PPI datasets 67 4.1 Background . 67 4.2 Results . 69 4.2.1 The Coev2Net framework . 69 4.2.2 Benchmarking Coev2Net . 71 4.2.3 MAPK interactome validation . 73 4.2.4 Experimental validation of predictions . 74 x 4.2.5 Abundance of missense SNPs at predicted interfaces . 76 4.2.6 Novel potential cross-talk regulatory mechanism . 78 4.3 Methods . 79 4.3.1 Simulating interface co-evolution . 81 4.4 Discussion . 84 5 Conclusions 87 A Appendix:iWRAP 93 A.1 Evaluation of alignments . 93 A.2 Methods . 95 A.2.1 Templates . 95 A.2.2 Multiple Interface Alignment . 96 A.2.3 Genomic Predictions: S.cerevisiae . 96 B Appendix:Coev2Net 101 B.1 Proof of equivalence of simulated co-evolution and high-dimensional sampling .

Structure-Based Algorithms for Protein-Protein Interaction Prediction

A University of Sussex Phd Thesis Available Online Via

Molecular and Physiological Basis for Hair Loss in Near Naked Hairless and Oak Ridge Rhino-Like Mouse Models: Tracking the Role of the Hairless Gene

Location Analysis of Estrogen Receptor Target Promoters Reveals That

Essential Genes and Their Role in Autism Spectrum Disorder

The Genome of Schmidtea Mediterranea and the Evolution Of

Table S2.Up Or Down Regulated Genes in Tcof1 Knockdown Neuroblastoma N1E-115 Cells Involved in Differentbiological Process Anal

The Genetics of Bipolar Disorder

Novel Potential ALL Low-Risk Markers Revealed by Gene

A High-Throughput Approach to Uncover Novel Roles of APOBEC2, a Functional Orphan of the AID/APOBEC Family

Disruption of the Anaphase-Promoting Complex Confers Resistance to TTK Inhibitors in Triple-Negative Breast Cancer

Mrna Editing, Processing and Quality Control in Caenorhabditis Elegans

Genome-Wide Gene Expression Profiling of the Angelman Syndrome