Simultaneous Identification of Specifically Interacting Paralogs and Interprotein Contacts by Direct Coupling Analysis

Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis Thomas Gueudréa,1, Carlo Baldassia,b,1, Marco Zamparoa, Martin Weigtc,2, and Andrea Pagnania,b,2 aDepartment of Applied Science and Technology, Politecnico di Torino, 10129 Torino, Italy; bHuman Genetics Foundation, Molecular Biotechnology Center, 10126 Torino, Italy; and cSorbonne Universités, UPMC Université Paris 06, CNRS, Biologie Computationnelle et Quantitative, Institut de Biologie Paris Seine, 75005 Paris, France Edited by Barry Honig, Howard Hughes Medical Institute, Columbia University, New York, NY, and approved September 16, 2016 (received for review May 12, 2016) Understanding protein−protein interactions is central to our under- unphysiological treatment needed for protein purification, enrich- standing of almost all complex biological processes. Computational ment, and crystallization. It is therefore tempting to use the ex- tools exploiting rapidly growing genomic databases to characterize ponentially increasing genomic databases to design in silico protein−protein interactions are urgently needed. Such methods techniques for identifying protein−protein interactions (cf. refs. 3 should connect multiple scales from evolutionary conserved interac- and 4). Prominent techniques, to date, include the search for tions between families of homologous proteins, over the identifica- colocalization of genes on the genome (e.g., operons in bacteria) tion of specifically interacting proteins in the case of multiple (5, 6), the Rosetta stone method (domains fused to a single protein paralogs inside a species, down to the prediction of residues being in some genome are expected to interact in other genomes) (7, 8), in physical contact across interaction interfaces. Statistical inference − and also coevolutionary techniques like phylogenetic profiling methods detecting residue residue coevolution have recently trig- (correlated presence or absence of interacting proteins in ge- gered considerable progress in using sequence data for quaternary nomes) (9) or similarities between phylogenetic trees of groups of protein structure prediction; they require, however, large joint align- orthologous proteins (compare the mirrortree method) (10, 11). ments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; Despite the success of all these methods, their sensitivity is limited application of coevolutionary modeling has, in turn, been restricted due to the use of relatively coarse global criteria (genomic location, to proteins without paralogs, or to bacterial systems with the corre- phylogenetic distance) instead of full amino acid sequences. sponding coding genes being colocalized in operons. Here we show The availability of thousands of sequenced genomes (12), that the direct coupling analysis of residue coevolution can be ex- thanks to next-generation sequencing techniques, enables the tended to connect the different scales, and simultaneously to match application of much finer-scale statistical modeling approaches, interacting paralogs, to identify interprotein residue−residue con- which take into account the full sequence (13). In this context, tacts and to discriminate interacting from noninteracting families in direct coupling analysis (DCA) (14) was developed to detect direct a multiprotein system. Our results extend the potential applications interprotein coevolution and, in turn, interprotein residue−residue of coevolutionary analysis far beyond cases treatable so far. Significance coevolution | protein−protein interaction networks | paralog matching | statistical inference | direct coupling analysis Most biological processes rely on specific interactions between proteins, but the experimental characterization of protein−pro- lmost all biological processes depend on interacting proteins. tein interactions is a labor-intensive task of frequently uncertain AUnderstanding protein−protein interactions is therefore key outcome. Computational methods based on exponentially grow- to our understanding of complex biological systems. In this con- ing genomic databases are urgently needed. It has recently been text, at least two questions are of interest: First, the question “who shown that coevolutionary methods are able to detect correlated with whom,” i.e., which proteins interact; this concerns the net- mutations between residues in different proteins, which are in works connecting specific proteins inside one organism, but also— contact across the interaction interface, thus enabling the struc- in the context of this article—the evolutionary perspective of ture prediction of protein complexes. Here we show that the protein−protein interactions, which are conserved across different applicability of coevolutionary methods is much broader, con- species. Their coevolution is at the basis of many modern com- necting multiple scales relevant in protein−protein interaction: putational techniques for characterizing protein−protein interac- the residue scale of interprotein contacts, the protein scale of tions. The second question is the question “how” proteins interact specific interactions between paralogous proteins, and the evo- with each other, in particular, which residues are involved in the lutionary scale of conserved interactions between homologous interaction interfaces, and which residues are in contact across the protein families. interfaces. Such knowledge may provide important mechanistic insight into questions related to interaction specificity or com- Author contributions: T.G., C.B., M.W., and A.P. designed research; T.G., C.B., M.Z., M.W., and A.P. performed research; T.G. and C.B. analyzed data; and M.W. and A.P. wrote petitive interaction with partially shared interfaces. the paper. − The experimental identification of protein protein interactions The authors declare no conflict of interest. is an arduous task (for reviews, cf. refs. 1 and 2): High-throughput This article is a PNAS Direct Submission. techniques that aim to identify protein−protein interactions in Data deposition: Julia package for paralog matching is available at https://github.com/ vivo or in vitro are well documented and include large-scale yeast Mirmu/ParalogMatching.jl. two-hybrid assays and protein affinity mass spectrometry assays. 1T.G. and C.B. contributed equally to this work. Such large-scale efforts have revealed useful information but are 2To whom correspondence may be addressed. Email: [email protected] or martin. hampered by high false positive and false negative error rates. [email protected]. Structural approaches based on protein cocrystallization are in- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. trinsically low-throughput and of uncertain outcome due to the 1073/pnas.1607570113/-/DCSupplemental. 12186–12191 | PNAS | October 25, 2016 | vol. 113 | no. 43 www.pnas.org/cgi/doi/10.1073/pnas.1607570113 Downloaded by guest on September 23, 2021 contacts between bacterial signal transduction proteins, to help to ABC assemble protein complexes (15, 16) and shed light on interaction sort family by increasing generate 2k random specificity (17, 18). The applicability of DCA and related co- family 1 family 2 matching entropy matchings evolutionary approaches (19, 20) to protein−protein interactions far generate seed matching refine matchings by beyond the signaling system has been recently established (21, 22). gradient ascent However, these methods require a large joint multiple sequence calculate DCA model alignment (MSA) of at least about 1,000aminoacidsequencepairs merge pairs of matchings to work accurately. Each line of this MSA concatenates a pair of match new species interacting proteins. So far, the application of coevolutionary methods remains therefore restricted to those cases where such output matched MSA injective matching output matched MSA joint alignments could be constructed easily: (i) Each species has only a single copy of the family, i.e., no paralogs exist. Matching of Fig. 1. Paralog matching procedures. (A) The considered injective strategy to interacting proteins can be achieved by uniqueness in the genome. match paralogs. For each species (depicted by different colors), each paralog (ii) Even if paralogs exist, genes of interacting proteins are fre- from the species with the lower paralog number is matched to a distinct se- quently colocalized on the genome and can therefore be matched quence in the other species (injection). (B) The pipeline of the PPM algorithm. by chromosomal vicinity. This finding is true, in particular, in the Species are sorted by increasing matching entropy (a measure of the computational complexity of the matching). Starting from a seed matching (gener- case of bacteria; functionally related proteins are frequently coded ated, in our case, by restricting the MSA to all genomes having a single in operons and consequently cotranscribed. sequence in both families), the algorithm calculates the DCA model, uses it to Colocalization is used extensively in the construction of joint add and match a new species, and iterates these two steps until all species are MSA for covariance analysis (22–25). However, the case of matched. (C) The IPM pipeline; 2k random matchings are generated, and each multiple paralogs with noncolocalized genes has remained out of one is independently refined using hill climbing of the likelihood. After re- reach for coevolutionary

Simultaneous Identification of Specifically Interacting Paralogs and Interprotein Contacts by Direct Coupling Analysis

Inter-Residue, Inter-Protein and Inter-Family Coevolution: Bridging the Scales

1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3

Assessing the Utility of Residue-Residue Contact Information in a Sequence and Structure Rich Era

Ensembling Multiple Raw Coevolutionary Features with Deep Residual Neural Networks for Contact‐Map Prediction in CASP13

Direct Information Reweighted by Contact Templates: Improved RNA Contact Prediction by Combining Structural Features

DIRECT: RNA Contact Predictions by Integrating Structural Patterns

Understanding and Improving Statistical Models of Protein Sequences, November 2018

Optimal Alignment of Coevolutionary Models for Protein Sequences

Assessing the Accuracy of Direct-Coupling Analysis for RNA Contact Prediction

Arxiv:2105.01428V1 [Q-Bio.PE] 4 May 2021

And Contact-Based Protein Structure Prediction

Towards a Genome-Scale Coevolutionary Analysis Giancarlo Croce