Structure-Based Prediction of Protein-Protein Interactions on A
Total Page:16
File Type:pdf, Size:1020Kb
LETTER doi:10.1038/nature11503 Structure-based prediction of protein–protein interactions on a genome-wide scale Qiangfeng Cliff Zhang1,2,3*, Donald Petrey1,2,3*, Lei Deng2,3,4, Li Qiang5,YuShi6, Chan Aye Thu2, Brygida Bisikirska3, Celine Lefebvre3,7, Domenico Accili5, Tony Hunter6, Tom Maniatis2, Andrea Califano2,3,7,8 & Barry Honig1,2,3 The genome-wide identification of pairs of interacting proteins is an their surface with similar arrangements of secondary structure elements important step in the elucidation of cell regulatory mechanisms1,2. to bind to other proteins16–18, suggesting the possibility of considerably Much of our present knowledge derives from high-throughput tech- expanding the number of putative PPIs that can be identified. It is niques such as the yeast two-hybrid assay and affinity purification3, likely that further expansion can be achieved if interactions involving as well as from manual curation of experiments on individual sys- unstructured regions of proteins are taken into account, but these are tems4. A variety of computational approaches based, for example, on not considered in this work. sequence homology, gene co-expression and phylogenetic profiles, Our approach to the prediction of PPIs is embodied in an algorithm have also been developed for the genome-wide inference of protein– we have named PrePPI (predicting protein–protein interactions), protein interactions (PPIs)5,6. Yet comparative studies suggest that which combines structural and non-structural interaction clues using the development of accurate and complete repertoires of PPIs is Bayesian statistics (see Fig. 1 and Methods for details). The structural still in its early stages7–9. Here we show that three-dimensional component of PrePPI involves a number of steps. Briefly, given a pair structural information can be used to predict PPIs with an accuracy of query proteins (QA and QB), we first use sequence alignment to and coverage that are superior to predictions based on non-structural identify structural representatives (MA and MB) that correspond to evidence. Moreover, an algorithm, termed PrePPI, which combines either their experimentally determined structures or to homology mod- structural information with other functional clues, is comparable in els. We then use structural alignment to find both close and remote accuracy to high-throughput experiments, yielding over 30,000 high- structural neighbours (NAi and NBj) of MA and MB (an average of confidence interactions for yeast and over 300,000 for human. ,1,500 neighbours are found for each structure). Whenever two (for Experimental tests of a number of predictions demonstrate the ability example, NA1 and NB3) of the over 2 million pairs of neighbours of MA of the PrePPI algorithm to identify unexpected PPIs of considerable and MB form a complex reported in the PDB, this defines a template for biological interest. The surprising effectiveness of three-dimensional modelling the interaction of QA and QB. Models of the complex are structural information can be attributed to the use of homology created by superimposing the representative structures on their corres- models combined with the exploitation of both close and remote ponding structural neighbours in the template (that is, MA on NA1 and geometric relationships between proteins. MB on NB3). This procedure produces about 550 million ‘interaction Until now, structural information has had relatively little impact in models’ for about 2.4 million PPIs involving about 3,900 yeast proteins, constructing protein–protein interactomes, primarily because there is and about 12 billion models for about 36 million PPIs involving about a marked difference between the number of proteins with known 13,000 human proteins. Note that an interaction model is based on sequences and those with an experimentally known structure. For structure-based sequence alignments of query proteins to their indi- example, as of early 2010, the Protein Data Bank (PDB) provided vidual templates (Supplementary Fig. 1) and that we do not construct a structures for ,600 of the total complement of ,6,500 yeast proteins three-dimensional model of each complex because the scoring of so (,10%), whereas the structural coverage of protein–protein complexes many individual complexes would be prohibitively time consuming is even more sparse, with only about 300 structures available out of the using standard energy functions (for example, as used in docking19). approximately 75,000 PPIs (,0.5%) recorded in publically available Once an interaction model has been created, it is evaluated using a databases. However, ,3,600 additional yeast proteins have homology combination of five empirical scores that measure properties derived models in either the ModBase10 or SkyBase11 databases. Moreover, there from alignments of the individual monomers to their templates were about 37,000 protein–protein complexes derived from multiple (Supplementary Fig. 1). The first score, SIM, depends on the structural organisms in the PDB and Protein Quaternary Structure12 (PQS) data- similarity between models of the two query proteins (that is, MA and bases, which might be used as ‘templates’ to model PPIs. Clearly, if MB) and those in the template complex (that is, NA1 and NB3). The structure is to be useful on a large scale, it is essential that modelling next two scores determine whether the interface in the template com- of individual proteins and of complexes be exploited. plex actually exists in the model. They are calculated as SIZ, the num- A number of studies have used structurally characterized complexes ber, and COV, the fraction, of interacting residue pairs in the template as templates to construct models of the complexes that might be (for example, NA1–NB3) that align to some pair of residues in the formed between proteins that have been classified as having sequence model (MA–MB). The final two scores reflect whether the residues that and/or structural relationships to the proteins in the template13–15.We appear in the model interface have properties consistent with those that searched more broadly for templates, using geometric relationships mediate known PPIs (for example, residue type, evolutionary conser- between groups of secondary structure elements as revealed by struc- vation, orstatistical propensity tobe inprotein–proteininterfaces). This tural alignment, independently of how they are classified. It has been information is obtained from three publically available servers that demonstrated that even distantly related proteins often use regions of predict interfacial residues based on the sequence and structure of the 1Howard Hughes Medical Institute, Columbia University, New York, New York 10032, USA. 2Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA. 3Center for Computational Biology and Bioinformatics, Columbia Initiative in Systems Biology, Columbia University, New York, New York 10032, USA. 4Department of Computer Science and Technology, Tongji University, Shanghai 201804, China. 5Naomi Berrie Diabetes Center, Department of Medicine, College of Physicians & Surgeons of Columbia University, New York, New York 10032, USA. 6Molecular and Cell Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, California 92037, USA. 7Institute of Cancer Genetics, Columbia University, New York, New York 10032, USA. 8Department of Biomedical Informatics, Columbia University, New York, New York 10032, USA. *These authors contributed equally to this work. 556 | NATURE | VOL 490 | 25 OCTOBER 2012 ©2013 Macmillan Publishers Limited. All rights reserved LETTER RESEARCH PDB structures/ Structural Template Interaction Structural-based homology models neighbours complexes models score from PDB QA MA NA1 NA2 NA3 NB MB Sequence Structural 3 Structural Model MA SM similarity similarity NA1 superposition evaluation ? MB NB Bayesian 1 NB2 NB3 PrePPI classification score QB E E E Non-structural evidence 1 2 n Co-expression Functional Evolutionary similarity similarity Figure 1 | Predicting protein–protein interactions using PrePPI. Given a MA and MB, on their corresponding structural neighbours, NA1 and NB3.We pair of query proteins that potentially interact (QA, QB), representative assign five empirical-structure-based scores to each interaction model structures for the individual subunits (MA, MB) are taken from the PDB, where (Supplementary Fig. 1) and then calculate a likelihood for each model to available, or from homology model databases. For each subunit we find both represent a true interaction by combining these scores using a Bayesian close and remote structural neighbours. A ‘template’ for the interaction exists network (Supplementary Fig. 2) trained on the HC and the N interaction whenever a PDB or PQS structure contains a pair of interacting chains (for reference sets. We finally combine the structure-derived score (SM) with non- example, NA1–NB3) that are structural neighbours of MA and MB, structural evidence associated with the query proteins (for example, co- respectively. A model is constructed by superposing the individual subunits, expression, functional similarity) using a naive Bayesian classifier. individual subunits of the model20–22. These scores are calculated as OS, low FPRs, SM by itself outperforms even the naive Bayesian classifiers which is identical to SIZ but with the additional requirement that both that combine all non-structure-based clues (NS). Looking specifically residues in an interacting