Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis

Thomas Gueudréa,1, Carlo Baldassia,b,1, Marco Zamparoa, Martin Weigtc,2, and Andrea Pagnania,b,2

aDepartment of Applied Science and Technology, Politecnico di Torino, 10129 Torino, Italy; bHuman Genetics Foundation, Molecular Biotechnology Center, 10126 Torino, Italy; and cSorbonne Universités, UPMC Université Paris 06, CNRS, Biologie Computationnelle et Quantitative, Institut de Biologie Paris Seine, 75005 Paris, France

Edited by Barry Honig, Howard Hughes Medical Institute, Columbia University, New York, NY, and approved September 16, 2016 (received for review May 12, 2016) Understanding protein−protein interactions is central to our under- unphysiological treatment needed for protein purification, enrich- standing of almost all complex biological processes. Computational ment, and crystallization. It is therefore tempting to use the ex- tools exploiting rapidly growing genomic databases to characterize ponentially increasing genomic databases to design in silico protein−protein interactions are urgently needed. Such methods techniques for identifying protein−protein interactions (cf. refs. 3 should connect multiple scales from evolutionary conserved interac- and 4). Prominent techniques, to date, include the search for tions between families of homologous proteins, over the identifica- colocalization of genes on the genome (e.g., operons in bacteria) tion of specifically interacting proteins in the case of multiple (5, 6), the Rosetta stone method (domains fused to a single protein paralogs inside a species, down to the prediction of residues being in some genome are expected to interact in other genomes) (7, 8), in physical contact across interaction interfaces. Statistical inference − and also coevolutionary techniques like phylogenetic profiling methods detecting residue residue coevolution have recently trig- (correlated presence or absence of interacting proteins in ge- gered considerable progress in using sequence data for quaternary nomes) (9) or similarities between phylogenetic trees of groups of prediction; they require, however, large joint align- orthologous proteins (compare the mirrortree method) (10, 11). ments of homologous protein pairs known to interact. The genera- tion of such alignments is a complex computational task on its own; Despite the success of all these methods, their sensitivity is limited application of coevolutionary modeling has, in turn, been restricted due to the use of relatively coarse global criteria (genomic location, to proteins without paralogs, or to bacterial systems with the corre- phylogenetic distance) instead of full sequences. sponding coding genes being colocalized in operons. Here we show The availability of thousands of sequenced genomes (12), that the direct coupling analysis of residue coevolution can be ex- thanks to next-generation sequencing techniques, enables the tended to connect the different scales, and simultaneously to match application of much finer-scale statistical modeling approaches, interacting paralogs, to identify interprotein residue−residue con- which take into account the full sequence (13). In this context, tacts and to discriminate interacting from noninteracting families in direct coupling analysis (DCA) (14) was developed to detect direct a multiprotein system. Our results extend the potential applications interprotein coevolution and, in turn, interprotein residue−residue of coevolutionary analysis far beyond cases treatable so far. Significance coevolution | protein−protein interaction networks | paralog matching | statistical inference | direct coupling analysis Most biological processes rely on specific interactions between proteins, but the experimental characterization of protein−pro- lmost all biological processes depend on interacting proteins. tein interactions is a labor-intensive task of frequently uncertain AUnderstanding protein−protein interactions is therefore key outcome. Computational methods based on exponentially grow- to our understanding of complex biological systems. In this con- ing genomic databases are urgently needed. It has recently been text, at least two questions are of interest: First, the question “who shown that coevolutionary methods are able to detect correlated with whom,” i.e., which proteins interact; this concerns the net- mutations between residues in different proteins, which are in works connecting specific proteins inside one organism, but also— contact across the interaction interface, thus enabling the struc- in the context of this article—the evolutionary perspective of ture prediction of protein complexes. Here we show that the protein−protein interactions, which are conserved across different applicability of coevolutionary methods is much broader, con- species. Their coevolution is at the basis of many modern com- necting multiple scales relevant in protein−protein interaction: putational techniques for characterizing protein−protein interac- the residue scale of interprotein contacts, the protein scale of tions. The second question is the question “how” proteins interact specific interactions between paralogous proteins, and the evo- with each other, in particular, which residues are involved in the lutionary scale of conserved interactions between homologous interaction interfaces, and which residues are in contact across the protein families. interfaces. Such knowledge may provide important mechanistic insight into questions related to interaction specificity or com- Author contributions: T.G., C.B., M.W., and A.P. designed research; T.G., C.B., M.Z., M.W., and A.P. performed research; T.G. and C.B. analyzed data; and M.W. and A.P. wrote petitive interaction with partially shared interfaces. the paper. − The experimental identification of protein protein interactions The authors declare no conflict of interest. is an arduous task (for reviews, cf. refs. 1 and 2): High-throughput This article is a PNAS Direct Submission. techniques that aim to identify protein−protein interactions in Data deposition: Julia package for paralog matching is available at https://github.com/ vivo or in vitro are well documented and include large-scale yeast Mirmu/ParalogMatching.jl. two-hybrid assays and protein affinity mass spectrometry assays. 1T.G. and C.B. contributed equally to this work. Such large-scale efforts have revealed useful information but are 2To whom correspondence may be addressed. Email: [email protected] or martin. hampered by high false positive and false negative error rates. [email protected]. Structural approaches based on protein cocrystallization are in- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. trinsically low-throughput and of uncertain outcome due to the 1073/pnas.1607570113/-/DCSupplemental.

12186–12191 | PNAS | October 25, 2016 | vol. 113 | no. 43 www.pnas.org/cgi/doi/10.1073/pnas.1607570113 Downloaded by guest on September 23, 2021 contacts between bacterial signal transduction proteins, to help to ABC assemble protein complexes (15, 16) and shed light on interaction sort family by increasing generate 2k random specificity (17, 18). The applicability of DCA and related co- family 1 family 2 matching entropy matchings evolutionary approaches (19, 20) to protein−protein interactions far generate seed matching refine matchings by beyond the signaling system has been recently established (21, 22). gradient ascent However, these methods require a large joint multiple sequence calculate DCA model alignment (MSA) of at least about 1,000aminoacidsequencepairs merge pairs of matchings to work accurately. Each line of this MSA concatenates a pair of match new species interacting proteins. So far, the application of coevolutionary methods remains therefore restricted to those cases where such output matched MSA injective matching output matched MSA joint alignments could be constructed easily: (i) Each species has only a single copy of the family, i.e., no paralogs exist. Matching of Fig. 1. Paralog matching procedures. (A) The considered injective strategy to interacting proteins can be achieved by uniqueness in the genome. match paralogs. For each species (depicted by different colors), each paralog (ii) Even if paralogs exist, genes of interacting proteins are fre- from the species with the lower paralog number is matched to a distinct se- quently colocalized on the genome and can therefore be matched quence in the other species (injection). (B) The pipeline of the PPM algorithm. by chromosomal vicinity. This finding is true, in particular, in the Species are sorted by increasing matching entropy (a measure of the compu- tational complexity of the matching). Starting from a seed matching (gener- case of bacteria; functionally related proteins are frequently coded ated, in our case, by restricting the MSA to all genomes having a single in operons and consequently cotranscribed. sequence in both families), the algorithm calculates the DCA model, uses it to Colocalization is used extensively in the construction of joint add and match a new species, and iterates these two steps until all species are MSA for analysis (22–25). However, the case of matched. (C) The IPM pipeline; 2k random matchings are generated, and each multiple paralogs with noncolocalized genes has remained out of one is independently refined using hill climbing of the likelihood. After re- reach for coevolutionary analysis, despite its enormous rele- finement, pairs of matchings are merged using average matching scores. Re- vance: Out of the 4,499 Pfam-29 (26) protein families with more finement and merging are iterated until only a single refined matching is left. than 500 sequences, 3,221 have, on average, more than two paralogs per species, and 1,378 families even more than five For any given matching π, we thus find a joint MSA Xπ,which paralogs. Another observation underlines the importance of 1 2 addressing generally localized genes: Out of 3,643 protein−protein concatenates two subalignments Xπ and Xπ of the original single- interactions reported for Escherichia coli in the IntAct Molecular family MSAs. Within Gaussian DCA (28), the total log-likelihood of Interaction database (27), only 1,341 (36.8%) concern intraoperon an arbitrary MSA X depends only on its regularized covariance matrix ð Þ ð Þ = − = ð Þ interactions. C X and reads L X 1 2 log det C X (SI Appendix). The Here we suggest an approach, based on a simultaneous con- amount of interprotein coevolution can be quantified by the inter- π = ð Þ − ð 1Þ − ð 2Þ struction of the joint MSA and detection of interprotein co- protein log-likelihood Linter L Xπ L Xπ L Xπ ,whichresults evolution, to solve the problem of matching paralogs. The method from the difference between two quantities: (i) the log-likelihood of is based on the idea that the correct matching of interacting the joint MSA Xπ and (ii) the log-likelihood of the single-family sub- 1 2 paralogs maximizes the interprotein coevolutionary signal. The MSAs Xπ and Xπ , modeled separately. The best matching maximizes πp = ð π Þ corresponding optimization problem turns out to be extraordinarily this interfamily log-likelihood, i.e., argmaxπ Linter . Due to the hard to solve exactly. We therefore propose two approximate huge number of possible matchings, which is exponential in the strategies: The first one is computationally very efficient and of number of species and superexponential in the number of paralogs sufficient accuracy for subsequent contact prediction. If interaction inside each species, the exact solution of this optimization task is, partner prediction is the central task, a slower but more accurate unfortunately, infeasible. Furthermore, we empirically observed this iterative scheme can be used. The validity of the approach is discrete optimization problem to be plagued by many local likelihood demonstrated in the cases of bacterial two-component signal maxima, such that local search algorithms easily get stuck. transduction and the protein−protein interaction network between We therefore propose two heuristic algorithms to approximate the proteins of the Tryptophan biosynthesis pathway (Trp pathway). the solution of this optimization task. A fast progressive method Ourfindingsopenthefieldtobroadapplications to protein inter- is applicable to large-scale data sets (e.g., many pairs of large actions beyond single-copy or colocalized protein-coding genes, and families). Although having limited accuracy in identifying spe- help to bridge the multiple scales of interprotein coevolution. cifically interacting paralogs, the method is suitable for sub- sequent interfamily DCA analysis to predict residue−residue Results contacts between proteins, or to discriminate interacting from An Efficient Approach to Paralog Matching in Interacting Protein noninteracting families. A slow but accurate iterative method is Families. more suitable for smaller-scale problems, where the accurate Paralog matching by maximizing the interfamily covariation. In this pa- identification of individual interacting protein pairs is central. per, we show that DCA may help to solve the aforementioned An efficient progressive paralog-matching algorithm. A first algorithmic paralog problem by simultaneously matching paralogs and de- strategy to find the matching maximizing the interfamily covariation termining interprotein coevolutionary scores. We argue that the is inspired by progressive techniques in constructing MSAs (29): best matching is actually the one maximizing the interprotein Species are matched progressively, starting with the simplest ones covariation; empirical evidence for the correctness of this idea (species with low paralog numbers in our case) and iteratively will be provided later in this section. adding more complicated species with higher paralog num- We use the injective matching strategy illustrated in Fig. 1A bers. Each species is matched only once, on the basis of all SI Appendix (see for mathematical details); it starts from the already matched species. Our progressive paralog-matching (PPM) individual MSAs of two protein families, denoted as X 1 and X 2. algorithm proceeds as follows (technical details are provided in Only sequences belonging to the same species are matched. For Methods and in SI Appendix; the pipeline is depicted in Fig. 1B): each single species, all proteins from the family of lower paralog number are matched to pairwise different proteins in the other 1. Species are ordered according to the entropy of their possible family. Matched sequence pairs are concatenated. In this article, matchings, i.e., to the expected hardness of the matching task. we consider neither the sparse case, where only part of the se- 2. Species of low entropy are used to generate a seed matching.

quences are matched, nor cases of promiscuous interaction, In our specific case, zero-entropy species, i.e., species with a BIOPHYSICS AND

where one protein should be matched to several others. single paralog, are used. COMPUTATIONAL BIOLOGY

Gueudré et al. PNAS | October 25, 2016 | vol. 113 | no. 43 | 12187 Downloaded by guest on September 23, 2021 3. In order of increasing entropy, species are added recursively: matching procedure? To answer this question, we infer a DCA (a) Gaussian DCA (GaussDCA) (28) is applied to the already model using this MSA, and we rematch all species. No changes are matched MSA. (b) The GaussDCA parameters are used to observed: The true MSA is actually a fixed point of the proposed score each pair of paralogs inside the new species to be algorithmic procedure. As a second step, we run PPM. Two possi- added. (c) An optimal matching for the new species is con- bilities to assess the quality of the matching are considered. First, we structed using these scores. check which fraction of the 8,998 matched pairs actually coincides with cognate pairs (as in the dataset published in ref. 17). Second, The algorithm terminates when all species are included. The we use the matched MSA to predict interprotein residue−residue absence of iterative error correction makes this algorithm compu- contacts. Because this second test requires only a single run of DCA tationally efficient. However, early on, fixed errors may propagate on the matched alignment, we replace GaussDCA with the more through the whole procedure and disturb later matched species. accurate but slower plmDCA (pseudo-likelihood maximization DCA) An accurate iterative paralog-matching algorithm. The PPM algorithm (32). This algorithm results in a plmDCA score for each residue pair; matches iteratively the proteins belonging to each species only the largest interprotein residue−residue scores are used to predict once, based on the previously matched species. Any matching interprotein residue−residue contacts; see SI Appendix for details. error made at some stage is kept up to the end, possibly causing Before running PPM, only 59 out of 8,998 sequence pairs are other matching errors. It would be possible to correct at least part matched immediately because both SK and RR are unique in the of these errors when considering later included proteins. However, genome. The extension of this seed matching by PPM is shown in the likelihood landscape has many local maxima, so a simple it- Fig. 2A: Although the seed matching alone is insufficient to erative refinement remains stuck close to the PPM result. predict interprotein contacts between SK and RR (only one true To overcome this limitation, our slow but accurate iterative contact out of the strongest 15 interprotein predictions), it is paralog-matching (IPM) algorithm follows three steps (all ex- sufficient to guide PPM to 84.7% precision: 7,620 out of 8,998 SI Appendix tensively described in ), cognate pairs are correctly identified. The red line in Fig. 2A 1. Generate K random paralog matchings respecting species. In shows the number of correctly matched pairs as a function of all practical applications, K = 256 was found to be a good com- matched pairs during the progression of the algorithm. This promise between computational time and accuracy. mildly sublinear curve signals a moderate decay in accuracy 2. Independently refine all K matchings iteratively by hill climb- during the matching procedure, which results from a tradeoff ing (discrete analog of gradient ascent): at step t + 1, improve between increasingly more accurate DCA models (larger se- the matching within each single species, based on the quence numbers) and increasingly harder matching tasks (spe- GaussDCA model computed from the matching at the pre- cies were sorted according to their matching entropy). The final vious step t, until convergence to a local likelihood maximum. matching is sufficient to provide accurate interprotein contact 3. Merge pairs of matchings by averaging and refinement: sub- predictions; all of the 15 highest-scoring residue pairs are true stitute two matchings with a new one obtained from the av- interprotein contacts [distance 8 Å in (Protein Data Bank) file erage GaussDCA model, and refine it by hill climbing as 3dge (33)]. We observe that, with increasing size of the pro- above. This is iterated until a single matching remains. Sub- gressive matching, the contact prediction becomes more and sequently, perform a final refinement step: produce K′ = 32 noisy (partially scrambled) versions of the last matching and merge them again as above, thus obtaining a new matching; 1 1 repeat until the score reaches a plateau. AB

The idea behind the merging step is simple: The consensus of 0.8 0.8 two imperfect matchings should reinforce the common signal compared with the random noise. This nonlocal change of the matching is found to be able to escape local log-likelihood 0.6 0.6 maxima. Details of the algorithm are given in Methods and in SI Appendix; the pipeline is depicted in Fig. 1C. 0.4 0.4 true MSA Simultaneous Identification of Interaction Partners and Interprotein full PPM Residue−Residue Contacts in Bacterial Signal Transduction. To test 2000 seqs positive predictive value 0.2 1014 seqs 0.2

both algorithmic strategies, we first consider bacterial two-com- fraction of matched cognate pairs seed MSA ponent systems (TCS) (14), which are the most diffused signal transduction systems in the bacteria. TCS have played a prom- 0 0 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 inent role in the development of DCA (14). They consist of two fraction of included sequences number of predictions interacting proteins, the Histidine sensor kinase (SK), as a signal receiver, and the response regulator (RR), which, under activa- Fig. 2. The progressive matching procedure matches cognate pairs and en- tion, typically acts as a transcription factor and triggers a tran- ables interprotein residue contact prediction. (A)Theredlineshowsthefraction of the 8,998 SK/RR cognate pairs, which are correctly matched by the progressive scriptional response (30). In particular, we use the dataset of matching algorithm, as a function of the matched pairs. A perfect matching Procaccini et al. (17), which collects 8,998 interacting (so-called procedure would follow the dashed diagonal. The SK/RR complex structure is cognate) protein pairs from 712 distinct species (Methods). A ran- overlaid with the 15 highest-scoring contact predictions at three different steps dom matching between SK and RR inside species would make, on of the algorithm: for the seed alignment, after having matched 1,014 proteins, average, one correct prediction per species; that is, only a fraction and at the end of the matching. Green bonds show correct predictions, and red of 712/8,998 = 7.9% of all matched SK/RR pairs would be correct. bonds show incorrect predictions (contact cutoff 8 Å). The upper structure shows Earlier approaches to match SK and RR have used Bayesian res- the prediction obtained with the full cognate MSA. (B) The positive predictive idue networks (23) or aligned protein similarity networks (31); al- value (i.e., the fraction of true positives amongst all interprotein contact pre- though they improve substantially over random matchings, their dictions) is shown, as a function of the number of predictions, for several joint MSA: the true operon-based cognate matching (solid black); the matching of accuracy remains inferior to the algorithms presented here. the seed alignment (magenta); and after having matched 1,014 (blue), 2,000 We first check the self-consistency of our matching idea: Is our (green), and 8,998 (red) sequences. The perfect predictor is depicted as black MSA of SK/RR, which are colocalized in joint operons and dashed. The prediction accuracy grows during the progressive matching and therefore expected to be truly interacting, stable under the finally reaches almost the accuracy of the cognate matching.

12188 | www.pnas.org/cgi/doi/10.1073/pnas.1607570113 Gueudré et al. Downloaded by guest on September 23, 2021 A C As a test system, we choose the tryptophan biosynthesis pathway 4000 comprising seven different proteins, TrpA through TrpG, which 1000 PPM IPM true L catalyze subsequent reactions in the pathway. Among the 21 pairs, only two are known to interact based on ex- perimentally resolved cocrystal structures: TrpA−TrpB [PDB 1k7f 500 3000 − merging (34)] and TrpE TrpG [PDB 1qdl (35)]. Although individual Pfam MSA sizes reach from 8,713 sequences for TrpF to 78,265 for TrpG, pairing by uniqueness in the genome only in three cases leads to joint errors

100 300 500 700 g MSAs beyond 1,000 sequences (TrpC−TrpF, 1,578; TrpA−TrpC, B 7360 7360.2 − 1000 2000 1,546; TrpA TrpF, 1,433). The actually interacting pairs have ex- IPM tremely small joint MSAs of 15 sequences for TrpE−TrpG and 900 matchin 95 sequences for TrpA−TrpB. No detection of interactions is pos- 500 sible with such small alignments (Fig. 4). In ref. 25, we have shown refinement 1000 that matching by genomic colocalization leads to joint MSA sizes of 800 2,519 to 8,053 sequences, with a majority below 4,000 sequences. These alignments separate the two known interacting pairs (inter- 100 300 500 700 7354 7356 7358 7360 protein plmDCA scores 0.3, 0.38) from an almost continuous GaussDCA score of matched pairs log-likelihood of matching background of scores not exceeding 0.17. Fig. 3. PPM vs. IPM algorithm for the SK/RR system. Shown are the histograms To test our paralog matching, we apply PPM to each of the 21 Trp of DCA scores of the final (A)PPMand(B) IPM matching. The fraction of true protein pairs (Fig. 4). The seed matchings, generated by uniqueness positive (TP) predictions is colored in green, and the fraction of false positive in the genome, range from 15 to 1,578 protein pairs. They do not (FP) predictions in red. Although the high-scoring pairs are exclusively TP, low allow for recovering the correct interacting family pairs (ranks 5 and and intermediate scores show a mixture of TP and FP. The overall histogram is 21 out of 21). After having matched 1,000proteinpairsineachfamily, only insignificantly shifted toward higher scores when comparing IPM to PPM, and computed the interprotein plmDCA score, the three highest- but the overall weight of the FP is visibly decreased. C shows the dependence scoring family pairs are TrpA−TrpB (score 0.23), TrpF−TrpG of the number of matching errors (FP) on the log-likelihood of the IPM − matching. Iteration proceeds from the upper left to the lower right corner, (score 0.18), and TrpE TrpG (score 0.17), followed by almost con- showing the last two stages of IPM: first, the progressive merging of locally tinuous scores below 0.15. The correct interactions thus have ranks 1 optimal matchings (blue points), and, then, the final refinement stage (red and 3, but no gap exists between the scores of interacting pairs and points). The overall procedure arrives at a log-likelihood that is slightly superior the scores of noninteracting pairs. to the one of the true matching (dotted vertical line), at a precision of about Using the full progressive matchings, TrpA−TrpB (TrpE−TrpG) 91.2% (8,206 TP out of 8,998 TP+FP). (Inset) An enlargement of the refinement have an interprotein plmDCA score of 0.34 (0.25) followed by stage; the almost-linear relation between log-likelihood and error clearly almost continuous scores below 0.15. The two correct interactions breaks down once the log-likelihood of the cognate matching is reached. arerecognizedwithagap,whichisalmostaslargeasinthe matching obtained using genomic colocalization, illustrating again more accurate: For 1,014 matched sequences, 10 out of the first the strong capacity of our method to recover accurately the 15 plmDCA predictions are interprotein contacts, and, for 2,000 matched sequences, even 13 out of 15 are contacts; see Fig. 2B for a more quantitative assessment. ACE PPM assigns a protein−protein matching score to each of the pairs in the final matching; compare step 3b of the PPM algo- rithmic description. Fig. 3A shows that the highest matching scores exclusively indicate truly interacting pairs. All of the first 1,347 pairs are cognate pairs. Although PPM is computationally very efficient, its accuracy in identifying true interaction partners is limited. In the progressive strategy, once a matching error is made, B D F it is not corrected but influences all subsequently matched species. To this end, we have applied the computationally more involved IPM algorithm (Fig. 3C); thanks to the nonlocal merging steps, we reach 91.2% of precision (8,206 true matches). Although IPM proceeds to maximize the log-likelihood, the matching error is, up to fluctuations, monotonously decreasing. Furthermore, Fig. 3C, Inset shows that IPV slightly exceeds the log-likelihood of the true − operon-based matching, but the error rate is not decreasing any more Fig. 4. Detection of protein protein interactions between enzymes of the beyond that point, suggesting that the intrinsic error rate of the as- tryptophan biosynthesis pathway. (A) The known PPI between the seven enzymes in the Trp pathway; only TrpA−TrpB and TrpE−TrpG are known to interact. (B−D) sociation between log-likelihood and matching error is close to 9%. The results (B) for the seed matching (subalignment made of genomes having a single sequence in both families), (C) for matchings of 1,000 sequences per pro- Simultaneous Identification of Interacting Families and Specifically tein pair, and (D) for the full matching. Line width is proportional to the inter- Interacting Proteins in a Bacterial Metabolic Pathway. DCA has protein coevolution score; the first two predictions are colored (TP, green; FP, been used to identify interacting protein families (21, 25). Based red). For the seed alignment, none of the true PPIs is recognized, whereas for again on the availability of large joint MSA, only pairs of families 1,000 sequences, one out of the two PPIs is recognized. The second true PPI has showing significant interprotein coevolution are expected to in- the third score, but there is no gap between true and false PPI. For the full teract. Again, we argue that, even without a large known set of matching, the known PPI are found to be the two highest-scoring pairs, with (potentially) interacting protein pairs, the PPM strategy simul- scores detached from an almost continuous distribution of the remaining 19 scores. E and F show the PDB structures of the complexes (E)TrpA−TrpB and (F) taneously creates such an alignment, and the interprotein TrpE−TrpF, together with the 15 highest DCA-scoring interprotein pairs, col- plmDCA scores are informative about interfamily interaction. ored in green for TP interprotein contact predictions (12 for TrpA−TrpB, 11 for Following ref. 25, the average of the four highest interprotein TrpE−TrpG) and in red for FP predictions (3 for TrpA−TrpB, 4 for TrpE−TrpG). BIOPHYSICS AND

residue−residue plmDCA scores is used (SI Appendix). The contact prediction is based on the fully matched PPM alignments. COMPUTATIONAL BIOLOGY

Gueudré et al. PNAS | October 25, 2016 | vol. 113 | no. 43 | 12189 Downloaded by guest on September 23, 2021 matching between interacting proteins. Again, we stress that our concern triples. It has been speculated, before, that 15 to 20% of all method, at variance with the one presented in ref. 25, does not bacterial signaling systems display some tendency to crosstalk; that is, need any information about the genomic location but only about interactions are not really one-to-one. Part of the “mismatched” the protein sequences in the MSA. Thus, it is of more general proteins could actually been read as predictions for inter-TCS applicability. crosstalk. However, in model species E. coli and Bacillus subtilis, Results obtained at the level of interaction networks can be where cases of crosstalk have been reported (36, 37), no matching corroborated by interprotein contact predictions obtained for the errors were found. two interacting pairs (Fig. 4 E and F): For TrpA–TrpB, 9 out of the The intuitive idea of maximizing the interfamily coevolutionary first 10 (and 12 out of the first 15) interprotein contact predictions signal leads to a computationally extremely hard problem: The are true positive. The situation is very similar for TrpE−TrpG: 10 search space (i.e., all possible joint MSA) is exponentially large in out of the first 10 and 11 out of the first 15 predicted pairs are in the number of species and superexponential in the number of contact across the interface. paralogs inside each species. The problem would become much To assess the robustness of our results, we included more (pu- simpler to solve if the global log-likelihood score could be replaced tative) negative controls to perform a larger-scale analysis by a local correlation measure maximizing, e.g., the Frobenius extending beyond the Trp system. We considered a larger dataset norm of the interprotein covariance matrix. This is implemented as of 40 protein families, which we tested exhaustively against the a first fast stage of the IPM algorithm, but, in the case of TCS, it four proteins involved in interactions, i.e., TrpA, TrpB, TrpE, and gets stuck at a high error rate of almost 40% of mismatches. TrpG (each of TrpA, TrpB, TrpE, and TrpG is tested against all Global modeling is necessary to reach high accuracy in paralog other 39 families); see SI Appendix for the selection of these matching. We have also seen that the accuracy drops only slightly proteins and detailed results. Despite the increased number of (error rate ∼15%) when the slow iterative procedure is replaced by possible protein family pairs, the scoring gap of the known inter- a fast PPM. The resulting joint alignments are sufficiently precise acting pairs vs. all other pairs discriminates interacting from to enable accurate interprotein contact prediction, and to dis- noninteracting pairs; only one pair (Trp2/P0ABY7, score 0.226) criminate between interacting and noninteracting protein families. shows an interprotein plmDCA score close to TrpA/TrpB (0.337) The two strategies—progressive and IPM—both open the road to and TrpE/TrpG (0.245). The alignment of this pair has, however, large-scale analysis for predicting currently unknown protein−protein an insufficient sequence number for reliable coevolutionary in- interactions. Coevolution-based procedures to analyze PPI ð = Þ ference Meff 34 . A large interprotein plmDCA score based on have extensively used colocalization (22–25). A natural question a sufficiently deep MSA seems to provide a promising predictor is, what fraction of the known bacterial interactome comes from of conserved protein−protein interaction. colocalized genes? Given our partial knowledge of the inter- actome at present, we still cannot provide a precise answer to Discussion this question. However, we can give a partial estimate based on Global methods to detect coevolution, like DCA, (Precise current knowledge in E. coli, i.e., in the currently best-studied Structural Contact Prediction Using Sparse Inverse Covariance model species. E. coli’s proteome consists of 4,323 nonredundant Estimation), and GREMLIN, have recently enjoyed growing proteins organized in 2,148 operons (817 of which host at least popularity in a very specific setting: Starting from a large mul- two genes). This results in 4,885 potential PPIs within the same tiple of homologous proteins, these ap- operon, in comparison with more than 9 million protein pairs in − proaches have helped to extract residue residue contacts from total. IntAct (27), one of the most comprehensive database for − residue residue amino acid covariation. In the context of the PPI network, reports 3,643 PPI for E. coli, of which more interacting proteins, the inferred interprotein contacts have, in than one-third (1,341 pairs) are intraoperon PPI. A domain- turn, helped to structurally assemble protein complexes. How- based database of structurally known PPI, iPfam (38), reports ever, the applicability of these methods has remained limited due 4,100 interacting family pairs (∼2,000 of which are homodimers). to the a priori need to obtain joint multiple sequence alignments The breakdown of the 4,885 possible intraoperon interactions in of pairs of interacting proteins, with each row containing a pair terms of distinct protein domains gives 8,068 distinct intraoperon of interacting proteins out of two protein families. This MSA has domain pairs. Of the 2,100 heterodimeric domain pairs in iPfam, to be obtained by external information like the uniqueness of the only 640 are present in E. coli, 214 of which are in the same two protein families inside a species (no paralogs present) or the operon. Again, about one-third of the known interactions origi- genomic colocalization in bacterial operons. nate from the same operon. Our methodology provides an effi- In this work, we show that one can turn the argument around: The cient and scalable algorithmic strategy to analyze the remaining coevolution between two protein families itself can be used to two-thirds of the known interactome, for which criteria such as identify interacting partner proteins, and thereby to generate the genomic proximity cannot be used. joint MSA while simultaneously obtaining an interprotein contact prediction. We have shown that an accurate matching between Methods proteins families can be obtained, which (i) connects only proteins in Gaussian Direct Coupling Analysis. The basis of the paralog matching procedure thesamespeciesand(ii) maximizes the detectable interfamily is the GaussDCA formulated in ref. 28. Let us assume a matched MSA A of M coevolutionary signal. The idea is that basically any mismatch sequences of length L. The MSA is transformed into an M × 20L-dimensional connecting two noninteracting proteins decreases the interfamily binary array X by replacing each amino acid with a distinct 20-dimensional covariation. In Fig. 3, we have actually observed that there is an vector containing one entry “1” and 19 entries “0”; gaps are represented by almost monotonously decreasing relation between the log-likelihood zero vectors. The empirical covariance matrix of the transformed MSA is the of a matching (which is a measure of the total interfamily co- 20L-dimensional square matrix C (the explicit dependence on the matching leading to the MSA is suppressed here), and the empirical mean is the evolutionary signal) and the error rate in the matching, compared μ with a Gold-standard dataset of colocalized bacterial proteins from 20L-dimensional vector ;seeSI Appendix for the precise definition of these quantities using standard DCA sequence weighting and pseudocounts. Given two-component signal transduction pathways. However, once the these empirical matrices, the GaussDCA model assigns a probability log-likelihood of the Gold-standard matching was obtained (or even   slightly exceeded), the residual matching error of about 9% did not 1 1 − P ðx j μ, CÞ = qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp − ðx − μÞT C 1 ðx − μÞ G 2 decrease any more; this may be a sign for an intrinsic limitation of ð2πÞNdetðCÞ the idea connecting likelihood and matching accuracy, but it may also be a biological signal. About 60% of the mismatches were to any amino acid sequence x of length L in binary representation. From pairwise switches (transpositions) between two TCS, and 18% this expression, the log-likelihood of the original MSA X can be easily

12190 | www.pnas.org/cgi/doi/10.1073/pnas.1607570113 Gueudré et al. Downloaded by guest on September 23, 2021 determined as LðXÞ = 1 =M log½PGðX j μ, CÞ = − 1=2 log det C(SI Appendix). Trp operon. The tryptophan biosynthetic pathway consists of seven enzymes Our matching strategy aims at maximizing this likelihood by selecting the (TrpA, TrpB, TrpC, TrpD, TrpE, TrpF, and TrpG). Only two protein−protein matching leading to the joint MSA X. In the initial phase of IPM, we will interactions are known and resolved structurally: TrpA−TrpB [PDB 1k7f also useP the squared Frobenius norm of the covariance matrix C, i.e., (42)] and TrpG−TrpE [PDB 1qdl (43)]. Single-protein MSA have been 2 = 2 C F i,j Ci,j, as a faster to compute objective function (SI Appendix). extracted using the pipeline proposed in ref. 25: (i) Extract sequences corresponding to names from Uniprot (Universal Protein Resource); (ii)run Paralog Matching. The two matching strategies are described in Results, and MAFFT (multiple-alignment program for amino acid or nucleotide se- extensive details are provided in SI Appendix. As the optimal assignment quences) (44) using mafft–anysymbol–auto; (iii) create a profile Hidden problem can be easily formulated in terms of linear programming, we used Markov Model using hmmbuild from the hmmer suite, and search Uniprot the Gurobi library (39) to efficiently solve it. using hmmsearch (45); and (iv) remove inserts. In addition to the seven Trp enzymes, we also created, with the same Data Extraction. procedure, an enlarged dataset of 33 negative controls. Details are provided TCS. The data for the SK/RR analysis were originally published in ref. 17; here in SI Appendix. we give a short description: 769 bacterial genomes were scanned using hmmer (Hidden Markov Model biosequence analysis) (40) with the Pfam Note. While finalizing this manuscript, we learned that A.-F. Bitbol, R. S. Dwyer, ‘‘ ’’ 22.0 Hidden Markov Models (41) for the following SK domains: HisKA L. J. Colwell, and N. S. Wingreen have prepared a related paper on predicting ‘‘ ’’ ‘‘ ’’ ‘‘ ′’ (PF00512), HWE_HK (PF07536), HisKA_2 (PF07568), HisKA_3 (PF07730), interacting paralog pairs (46). ‘‘His_ kinase’’ (PF06580), and ‘‘Hpt’’ (PF01627), and, for the RR domain, ‘‘ ’’ Response_reg (PF00072). Using a simple operational definition of an operon ACKNOWLEDGMENTS. We thank Christoph Feinauer, Guido Uguzzoni, as a sequence of consecutive genes of same coding sense, and with intergenic and Hendrik Szurmant for helpful discussions. M.W. was partly funded distances not exceeding 200 base pairs, a total M = 8,998 SK/RR pairs were by the Agence Nationale de la Recherche Project COEVSTAT (ANR-13- identified in operons containing a single SK (of type HisKA) and a single RR BS04-0012-01). C.B. was partly funded by the European Research Council domain. As reference structure, we consider the PDB entry 3dge (32). (Grant 267915).

1. Shoemaker BA, Panchenko AR (2007) Deciphering protein−protein interactions. Part I. 23. Burger L, van Nimwegen E (2008) Accurate prediction of protein−protein interactions Experimental techniques and databases. PLOS Comput Biol 3(3):e42. from sequence alignments using a Bayesian method. Mol Syst Biol 4(1):165. 2. Rao VS, Srinivas K, Sujini GN, Kumar GN (2014) Protein-protein interaction detection: 24. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T (2009) Identification of direct res- Methods and analysis. Int J Proteomics 2014:147648. idue contacts in protein−protein interaction by message passing. Proc Natl Acad Sci 3. Shoemaker BA, Panchenko AR (2007) Deciphering protein−protein interactions. Part USA 106(1):67–72. II. Computational methods to predict protein and domain interaction partners. PLOS 25. Feinauer C, Szurmant H, Weigt M, Pagnani A (2016) Inter-protein sequence co-evo- Comput Biol 3(4):e43. lution predicts known physical interactions in bacterial ribosomes and the Trp operon. 4. Keskin O, Tuncbag N, Gursoy A (2016) Predicting protein−protein interactions from PLoS One 11(2):e0149166. the molecular to the proteome level. Chem Rev 116(8):4884–4909. 26. Finn RD (2012) Pfam: The Protein Families Database. Encyclopedia of Genetics, Ge- 5. Dandekar T, Snel B, Huynen M, Bork P (1998) Conservation of gene order: A finger- nomics, Proteomics and Bioinformatics (Wiley, New York), Vol 3. print of proteins that physically interact. Trends Biochem Sci 23(9):324–328. 27. Orchard S, et al. (2014) The MIntAct project—IntAct as a common curation platform for 6. Galperin MY, Koonin EV (2000) Who’s your neighbor? New computational ap- 11 molecular interaction databases. Nucleic Acids Res 42(Database issue):D358–D363. proaches for functional genomics. Nat Biotechnol 18(6):609–613. 28. Baldassi C, et al. (2014) Fast and accurate multivariate Gaussian modeling of protein 7. Marcotte CJV, Marcotte EM (2002) Predicting functional linkages from gene fusions families: Predicting residue contacts and protein-interaction partners. PLoS One 9(3): with confidence. Appl Bioinformatics 1(2):93–100. e92721. 8. Marcotte EM, et al. (1999) Detecting protein function and protein-protein interac- 29. Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to – tions from genome sequences. Science 285(5428):751–753. correct phylogenetic trees. J Mol Evol 25(4):351 360. 9. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning 30. Stock AM, Robinson VL, Goudreau PN (2000) Two-component signal transduction. – protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Annu Rev Biochem 69(1):183 215. Natl Acad Sci USA 96(8):4285–4288. 31. Bradde S, et al. (2010) Aligning graphs and finding substructures by a cavity approach. 10. Pazos F, Valencia A (2001) Similarity of phylogenetic trees as indicator of protein−protein Europhys Lett 89(3):37009. interaction. Protein Eng 14(9):609–614. 32. Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E (2013) Improved contact prediction in 11. Juan D, Pazos F, Valencia A (2008) High-confidence prediction of global interactomes proteins: Using pseudolikelihoods to infer Potts models. Phys Rev E Stat Nonlin Soft Matter Phys 87(1):012707. based on genome-wide coevolutionary networks. Proc Natl Acad Sci USA 105(3):934–939. 33. Casino P, Rubio V, Marina A (2009) Structural insight into partner specificity and 12. Reddy TB, et al. (2015) The Genomes OnLine Database (GOLD) v.5: A metadata phosphoryl transfer in two-component signal transduction. Cell 139(2):325–336. management system based on a four level (meta)genome project classification. 34. Weyand M, Schlichting I, Marabotti A, Mozzarelli A (2002) Crystal structures of a new Nucleic Acids Res 43(Database issue):D1099–D1106. class of allosteric effectors complexed to tryptophan synthase. J Biol Chem 277(12): 13. de Juan D, Pazos F, Valencia A (2013) Emerging methods in protein co-evolution. Nat 10647–10652. Rev Genet 14(4):249–261. 35. Knöchel T, et al. (1999) The crystal structure of anthranilate synthase from Sulfolobus 14. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T (2009) Identification of direct res- solfataricus: Functional implications. Proc Natl Acad Sci USA 96(17):9479–9484. idue contacts in protein−protein interaction by message passing. Proc Natl Acad Sci 36. Howell A, Dubrac S, Noone D, Varughese KI, Devine K (2006) Interactions between USA 106(1):67–72. the YycFG and PhoPR two-component systems in Bacillus subtilis: The PhoR kinase 15. Schug A, Weigt M, Onuchic JN, Hwa T, Szurmant H (2009) High-resolution protein phosphorylates the non-cognate YycF response regulator upon phosphate limitation. complexes from integrating genomic information with molecular simulation. Proc Mol Microbiol 59(4):1199–1215. Natl Acad Sci USA 106(52):22124–22129. 37. Rietkötter E, Hoyer D, Mascher T (2008) Bacitracin sensing in Bacillus subtilis. Mol 16. Dago AE, et al. (2012) Structural basis of histidine kinase autophosphorylation de- Microbiol 68(3):768–785. duced by integrating genomics, molecular dynamics, and mutagenesis. Proc Natl Acad 38. Finn RD, Miller BL, Clements J, Bateman A (2014) iPfam: A database of protein family – Sci USA 109(26):E1733 E1742. and domain interactions found in the Protein Data Bank. Nucleic Acids Res 42(D1): 17. Procaccini A, Lunt B, Szurmant H, Hwa T, Weigt M (2011) Dissecting the specificity of D364–D373. protein-protein interaction in bacterial two-component signaling: Orphans and 39. Gurobi Optimization, Inc. (2015) Gurobi Optimizer Reference Manual (Gurobi Opti- crosstalks. PLoS One 6(5):e19729. mization, Houston). 18. Cheng RR, Morcos F, Levine H, Onuchic JN (2014) Toward rationally redesigning 40. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763. bacterial two-component signaling systems using coevolutionary information. Proc 41. Finn RD, et al. (2016) The Pfam protein families database: Towards a more sustainable – Natl Acad Sci USA 111(5):E563 E571. future. Nucleic Acids Res 44(D1):D279–D285. 19. Jones DT, Buchan DW, Cozzetto D, Pontil M (2012) PSICOV: Precise structural contact 42. Weyand M, Schlichting I, Marabotti A, Mozzarelli A (2002) Crystal structures of a new prediction using sparse inverse covariance estimation on large multiple sequence class of allosteric effectors complexed to tryptophan synthase. J Biol Chem 277(12): alignments. Bioinformatics 28(2):184–190. 10647–10652. 20. Kamisetty H, Ovchinnikov S, Baker D (2013) Assessing the utility of coevolution-based 43. Knöchel T, et al. (1999) The crystal structure of anthranilate synthase from Sulfolobus residue−residue contact predictions in a sequence- and structure-rich era. Proc Natl solfataricus: Functional implications. Proc Natl Acad Sci USA 96(17):9479–9484. Acad Sci USA 110(39):15674–15679. 44. Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: 21. Ovchinnikov S, Kamisetty H, Baker D (2014) Robust and accurate prediction of Improvements in performance and usability. Mol Biol Evol 30(4):772–780. residue−residue interactions across protein interfaces using evolutionary informa- 45. Finn RD, et al. (2015) HMMER web server: 2015 update. Nucleic Acids Res 43(W1): tion. eLife 3:e02030. W30–W38. 22. Hopf TA, et al. (2014) Sequence co-evolution gives 3D contacts and structures of 46. Bitbol A-F, Dwyer RS, Colwell LJ, Wingreen NS (2016) Inferring interaction partners BIOPHYSICS AND protein complexes. eLife 3:e03430. from protein sequences. Proc Natl Acad Sci USA 113:12180–12185. COMPUTATIONAL BIOLOGY

Gueudré et al. PNAS | October 25, 2016 | vol. 113 | no. 43 | 12191 Downloaded by guest on September 23, 2021