Large-Scale Identification of Coevolution Signals Across Homo-Oligomeric Protein Interfaces by Direct Coupling Analysis

Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis Guido Uguzzonia,1, Shalini John Lovisb,1, Francesco Oteria, Alexander Schugb,2, Hendrik Szurmantc,2, and Martin Weigta,2 aSorbonne Universités, Université Pierre-et-Marie-Curie Université Paris 06, CNRS, Biologie Computationnelle et Quantitative-Institut de Biologie Paris Seine, 75005 Paris, France; bSteinbuch Centre for Computing, Karlsruhe Institute of Technology, 76344 Eggenstein-Leopoldshafen, Germany; and cDepartment of Basic Medical Sciences, College of Osteopathic Medicine of the Pacific, Western University of Health Sciences, Pomona, CA 91766 Edited by Arup K. Chakraborty, Massachusetts Institute of Technology, Cambridge, MA, and approved January 24, 2017 (received for review September 8, 2016) Proteins have evolved to perform diverse cellular functions, from To be successful, global statistical models must be trained on serving as reaction catalysts to coordinating cellular propagation and sufficiently large alignments of homologous sequences. It has development. Frequently, proteins do not exert their full potential as been argued (2) that on the order of 1,000 properly aligned and monomers but rather undergo concerted interactions as either homo- sufficiently divergent sequences are adequate for accurate model oligomers or with other proteins as hetero-oligomers. The experi- learning. Scanning the latest Pfam database release (19), we mental study of such protein complexes and interactions has been observe a fundamental extension of the families amenable to arduous. Theoretical structure prediction methods are an attractive statistical modeling (compare SI Appendix, Fig. S1). Whereas in alternative. Here, we investigate homo-oligomeric interfaces by the Pfam 5.0 release in the year 2000, only 33 families contained tracing residue coevolution via the global statistical direct coupling at least 1,000 sequences, this number has grown by a factor of analysis (DCA). DCA can accurately infer spatial adjacencies between more than 200: 6,783 in Pfam 28.0 (released in 2015) of a total of residues. These adjacencies can be included as constraints in structure 16,230 included families. The median family size of 654 se- prediction techniques to predict high-resolution models. By taking quences is close to the required sequence number. Statistical advantage of the ongoing exponential growth of sequence databases, sequence modeling will therefore strongly increase in importance we go significantly beyond anecdotal cases of a few protein families over the next years. and apply DCA to a systematic large-scale study of nearly 2,000 Pfam Statistical modeling of single-protein families is the final aim: protein families with sufficient sequence information and structurally Proteins rarely work in isolation and monomeric form. Protein– resolved homo-oligomeric interfaces. We find that large interfaces are commonly identified by DCA. We further demonstrate that DCA can protein interfaces coevolve to conserve the interaction and interaction specificity between proteins. To identify such coevolving differentiate between subfamilies with different binding modes – within one large Pfam family. Sequence-derived contact information protein protein interfaces has been the original motivation for for the subfamilies proves sufficient to assemble accurate structural DCA development (1, 20), and this methodology has recently been models of the diverse protein-oligomers. Thus, we provide an extended beyond simple anecdotal cases to dozens of protein pairs – approach to investigate oligomerization for arbitrary protein families (13, 21 25). Much more extensive datasets are needed for a solid leading to structural models complementary to often-difficult exper- statistical assessment of the quality of coevolution-based predic- imental methods. Combined with ever more abundant sequential tions, which tells us when DCA can reliably predict residue contacts data, we anticipate that this study will be instrumental to allow the structural description of many heteroprotein complexes in the future. Significance homo-oligomers | coevolution | direct coupling analysis | protein–protein Protein–protein interactions are important to all facets of life, interactions | big data analysis but their experimental and computational characterization is arduous and frequently of uncertain outcome. The current ife on the molecular level is orchestrated by the interplay of study demonstrates both the power and limitation to study Lmany different biomolecules. A crucial component for the protein interactions by utilizing sophisticated statistical in- function is its structure, ranging from small monomers to com- ference technology to derive protein contacts from available plex homo- or heteromultimers. The full structural character- sequence databases, more precisely from the coevolution be- ization of a biomolecule therefore typically precedes a detailed tween residues, that are in contact across the interaction in- explanation of its functional mechanism. However, despite the terface of two proteins. By studying homo-oligomeric protein incredible progress of structural characterization methods, many interactions, the current study expands from anecdotal evi- important biomolecules have not been structurally resolved. An dence of the performance of this technology to systematic intriguing alternative to often involved experimental measure- evidence of its value across close to 2,000 interacting protein ments of 3D structures is taking advantage of the growing wealth families. of genetic sequential information via sophisticated statistical Author contributions: A.S., H.S., and M.W. designed research; G.U., S.J.L., F.O., and A.S. methods. Direct coupling analysis (DCA) (1, 2) and related tools performed research; G.U., S.J.L., F.O., H.S., and M.W. analyzed data; and A.S., H.S., and (3, 4) develop a global model mimicking evolutionary fitness M.W. wrote the paper. landscapes of protein families (5, 6) and quantify the coevolution The authors declare no conflict of interest. – of amino acid residue positions (7 9). In the context of protein This article is a PNAS Direct Submission. – structure prediction, these models allow extraction of residue 1G.U. and S.J.L. contributed equally to this work. residue contacts from sequence information alone. Such in- 2 To whom correspondence may be addressed. Email: [email protected], alexander.schug@ formation has proven useful in the prediction of tertiary protein kit.edu, or [email protected]. – – structures (10 13); conformational transitions (14 16); and, This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. similarly, RNA structures (17, 18). 1073/pnas.1615068114/-/DCSupplemental. E2662–E2671 | PNAS | Published online March 13, 2017 www.pnas.org/cgi/doi/10.1073/pnas.1615068114 Downloaded by guest on September 23, 2021 between interacting proteins and how to overcome a possible fail- proach reduces possible artifacts coming from a subjective se- PNAS PLUS ure. The main difficulty in going to the results of much larger lection of systems and/or human data curation. The analysis of datasets stems from the necessity to create joint multiple sequence more than 750 families with large interfaces shows significantly alignments (MSAs) of interaction partners, where each line con- enriched coevolutionary couplings across the interfaces, whereas tains an interacting pair of proteins. Computationally, this prob- small interfaces are rarely detected. However, even for large in- lem becomes difficult by itself due to the existence of paralogs terfaces, the detection of strong coevolutionary couplings, and (26–28), and most work so far has concentrated on systems where therefore the accuracy of the resulting contact predictions, is limited interaction partners have colocalized genes (e.g., in common op- to roughly 50% of cases. To understand this observation better, and erons in the bacteria). to open paths for overcoming this limited performance, we analyze The present work addresses the challenge by focusing on homo- in more detail the specific case of response regulator (RR) proteins oligomerization rather than heteroprotein–heteroprotein interac- involved in bacterial two-component signal transduction, where tions, thereby increasing the number of available datasets by more several subfamilies (characterized by different domain architec- than one order of magnitude. The reason is simple: The creation of tures) show different homodimerization modes. We show that due joint MSAs of interaction partners is simplified, because the iden- to the nonconservation of the dimerization mode in the entire Pfam tical sequence of both interaction partners allows one to work with (19) domain family of RRs, a straightforward application of DCA MSAs for single proteins. On the contrary, the distinction between leads to a weakened signal as different dimerization modes are intraprotein and interprotein residue–residue contact predictions mixed. The restriction of the MSAs to proteins of presumably the from coevolutionary couplings is nontrivial when studying homo- same dimerization interface, however, strongly improves the oligomers, and the existence of experimental monomer structures strength and accuracy of the coevolutionary signal. In the case of is needed to identify putative interprotein contacts as directly RRs, we demonstrate that by subdividing the alignments based on coevolving pairs

Large-Scale Identification of Coevolution Signals Across Homo-Oligomeric Protein Interfaces by Direct Coupling Analysis

Simultaneous Identification of Specifically Interacting Paralogs and Interprotein Contacts by Direct Coupling Analysis

Inter-Residue, Inter-Protein and Inter-Family Coevolution: Bridging the Scales

1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3

Assessing the Utility of Residue-Residue Contact Information in a Sequence and Structure Rich Era

Ensembling Multiple Raw Coevolutionary Features with Deep Residual Neural Networks for Contact‐Map Prediction in CASP13

Direct Information Reweighted by Contact Templates: Improved RNA Contact Prediction by Combining Structural Features

DIRECT: RNA Contact Predictions by Integrating Structural Patterns

Understanding and Improving Statistical Models of Protein Sequences, November 2018

Optimal Alignment of Coevolutionary Models for Protein Sequences

Assessing the Accuracy of Direct-Coupling Analysis for RNA Contact Prediction

Arxiv:2105.01428V1 [Q-Bio.PE] 4 May 2021

And Contact-Based Protein Structure Prediction