Interactome INSIDER: a Structural Interactome Browser for Genomic Studies
Total Page:16
File Type:pdf, Size:1020Kb
RESOURCE Interactome INSIDER: a structural interactome browser for genomic studies Michael J Meyer1–3,6, Juan Felipe Beltrán1,2,6, Siqi Liang1,2,6, Robert Fragoza2,4, Aaron Rumack1,2, Jin Liang2, Xiaomu Wei1,5 & Haiyuan Yu1,2 We present Interactome INSIDER, a tool to link genomic been demonstrated that mutations tend to localize to interaction variant information with structural protein–protein interfaces, and mutations on the same protein may cause clini- interactomes. Underlying this tool is the application cally distinct diseases by disrupting interactions with different of machine learning to predict protein interaction interfaces partners6,8. However, the binding topologies of interacting pro- for 185,957 protein interactions with previously unresolved teins can only be determined at atomic resolution through X-ray interfaces in human and seven model organisms, including crystallography, NMR, and (more recently) cryo-EM9 experi- the entire experimentally determined human binary ments, which limits the number of interactions with resolved interactome. Predicted interfaces exhibit functional interaction interfaces. properties similar to those of known interfaces, including To study protein function on a genomic scale, especially as it relates enrichment for disease mutations and recurrent cancer to human disease, a large-scale set of protein interaction interfaces mutations. Through 2,164 de novo mutagenesis experiments, is needed. Thus far, computational methods such as docking10 and we show that mutations of predicted and known interface homology modeling11 have been employed to predict the atomic- residues disrupt interactions at a similar rate and much more level bound conformations of interactions whose experimental frequently than mutations outside of predicted interfaces. To spur functional genomic studies, Interactome INSIDER structures have not yet been determined. However, docked models (http://interactomeinsider.yulab.org) enables users to are not yet available on a large scale; and while homology modeling 12 identify whether variants or disease mutations are enriched has been used to produce models at scale , it is only amenable to in known and predicted interaction interfaces at various interactions with structural templates (<5% of known interactions). resolutions. Users may explore known population variants, Together, cocrystal structures and homology models comprise the disease mutations, and somatic cancer mutations, or they may currently available precalculated sources of structural interactomes, upload their own set of mutations for this purpose. covering only ~6% of all known interactions (Fig. 1a,b). Here, we present Interactome INSIDER (integrated structural interactome and genomic data browser), a tool for functional Protein–protein interactions facilitate much of known cellular exploration of human disease on a genomic scale (http://inter- function. Recent efforts to experimentally determine protein actomeinsider.yulab.org). Interactome INSIDER is based on a interactomes in human1 and model organisms2–4, in addition to structurally resolved, proteome-wide human interactome. We literature curation of small-scale interaction assays5, have dramat- assembled this resource by building an interactome-wide set of ically increased the scale of known interactome networks. Studies protein interaction interfaces at the highest resolution possible of these interactomes have allowed researchers to elucidate how for each interaction. We compiled structural interactomes by modes of evolution affect the functional fates of paralogs4 and calculating interfaces in experimental cocrystal structures and to examine, on a genomic scale, network interconnectivities that homology models, when available. For the remaining ~94% of determine cellular functions and disease states6. interactions, we applied a machine-learning framework to predict While simply knowing which proteins interact with each other partner-specific interfaces by applying recent advances in coevo- provides valuable information to spur functional studies, far more lution- and docking-based feature construction13,14. Interactome specific hypotheses can be tested if the spatial contacts of inter- INSIDER combines predicted interaction interfaces for 185,957 acting proteins are known7. In the study of human disease, it has previously unresolved interactions (including the full human 1Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, USA. 2Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA. 3Tri-Institutional Training Program in Computational Biology and Medicine, New York, New York, USA. 4Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA. 5Department of Medicine, Weill Cornell College of Medicine, New York, New York, USA. 6These authors contributed equally to this work. Correspondence should be addressed to H.Y. ([email protected]). RECEIVED 24 JANUARY; ACCEPTED 22 OCTOBER; PUBLISHED oNLINE 1 JANUARY 2018; DOI:10.1038/NMETH.4540 NATURE METHODS | VOL.15 NO.2 | FEBRUARY 2018 | 107 RESOURCE interactome and seven commonly studied model organisms) with a 120,000 disease mutations and functional annotations in an interactive Unresolved toolbox designed to spur functional genomics research. It allows 110,000 Homology modeled Cocrystalized users to find enrichment of disease mutations at different scales: 100,000 in protein interaction domains, in residues, and through atomic 3D clustering in protein interfaces. 90,000 80,000 RESULTS To build Interactome INSIDER, we first constructed an inter- 70,000 actome-wide set of protein interaction interfaces. While there 60,000 are well-established methods for predicting whether two pro- 15,16 50,000 teins interact , we focused on interactions that have been Protein interactions experimentally determined, but whose interfaces are unknown 40,000 (Supplementary Note 1). For this task, there is a rich literature exploring the potential of many structural, evolutionary, and 30,000 docking-based methods to predict protein interaction interfaces. 20,000 However, so far none of these methods have been used to produce a whole-interactome data set of protein interaction interfaces 10,000 (Supplementary Note 2). 0 We used ECLAIR (ensemble classifier learning algorithm to E. coli predict interface residues), a unified machine-learning frame- H. sapiens C. elegansA. thaliana S. pombe S. cerevisiae M. musculus work, to predict the interfaces of protein interactions. ECLAIR D. melanogaster leverages several complementary and proven classification Species features, including sequence-based biophysical features, struc- b S. cerevisiae tural features, and recently proposed features for predicting A. thaliana M. musculus binding partner-specific interfaces, including coevolution- D. melanogaster 736 C. elegans 17,18 14 769 ary and docking-based metrics (Supplementary Note 3; S. pombe Supplementary Figs. 1 and 2). Unfortunately, many protein–pro- tein interactions have missing features (especially structural fea- 931 E. coli tures). In fact, this type of nonrandom missing-feature problem is present in many biological prediction studies and cannot be adequately resolved by commonly used imputation methods. To address this issue, ECLAIR is structured as an ensemble of eight 4,150 independent classifiers, each of which covers a common case of H. sapiens feature availability. This unique structure of ECLAIR enables it to be applied to any interaction while using the most informative Figure 1 | The current size of structural interactomes. (a) The plot shows subset of available features for that interaction (Supplementary the coverage (number of protein interactions) of known high-quality binary interactomes with precomputed cocomplexed protein structures. Notes 4 and 5; Supplementary Figs. 3 and 4). (b) The number of interactions from the eight largest interactomes with We comprehensively optimized hyperparameters for ECLAIR experimentally solved structures. using a recently published Bayesian method, the tree-structured Parzen estimator approach (TPE)19, which allowed us to simulta- neously tune up to eight hyperparameters for each subclassifier. seven model organisms and human (including all 122,647 human We trained and tested each ECLAIR subclassifier using a set of experimentally determined binary interactions reported in major known protein interaction interfaces, and we observed that inter- databases; see Online Methods). faces can be predicted by the single, top-performing subclassifier available for each residue. Subclassifier performance increases with Comprehensive evaluation of predicted interfaces the number of features used. We observe an area under the ROC We established that our predictions are of high quality through curve (AUROC) of 0.64 for our top-sequence-only subclassifier and both machine learning and biological evaluation. We first AUROC of 0.80 for our top subclassifier using both sequence and evaluated the trade-offs between false-positive rate and true- structural features. In total, we used ECLAIR to predict the inter- positive rate and between precision and recall for each of the eight faces of 185,957 interactions with previously unknown interfaces, independent subclassifiers that compose ECLAIR. As expected, including for 115,576 human interactions (Supplementary Fig. 5). we find that as more informative features are added to subse- Specifically, residues classified by ECLAIR with