Predicting DNA, RNA, Ion, Peptide and Small Molecule Interaction Sites Within Protein Domains Anat Etzion-Fuchs 1,Davida.Todd2 and Mona Singh 1,2,*

Published online 17 May 2021 Nucleic Acids Research, 2021, Vol. 49, No. 13 e78 https://doi.org/10.1093/nar/gkab356 dSPRINT: predicting DNA, RNA, ion, peptide and small molecule interaction sites within protein domains Anat Etzion-Fuchs 1,DavidA.Todd2 and Mona Singh 1,2,* 1Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, NJ 08544, USA and 2Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08544, USA Received November 12, 2020; Revised March 30, 2021; Editorial Decision April 20, 2021; Accepted April 22, 2021 Downloaded from https://academic.oup.com/nar/article/49/13/e78/6276922 by guest on 30 September 2021 ABSTRACT quence alone. Domains facilitate a wide range of functions, among the most important of which are mediating inter- Domains are instrumental in facilitating protein interactions. Knowing the types of interactions each domain actions with DNA, RNA, small molecules, ions and enables––whether with DNA, RNA, ions, small molecules peptides. Identifying ligand-binding domains within or other proteins––would be a great aid in protein function sequences is a critical step in protein function an- annotation. For example, proteins with DNA- and RNA- notation, and the ligand-binding properties of pro- binding activities are routinely analyzed and categorized teins are frequently analyzed based upon whether based upon identifying domain subsequences (6,7), and the they contain one of these domains. To date, how- functions of signaling proteins can be inferred based upon ever, knowledge of whether and how protein do- their composition of interaction domains (8). Nevertheless, mains interact with ligands has been limited to do- not all domains that bind these ligands are known, and new mains that have been observed in co-crystal struc- binding properties of domains continue to be elucidated ex- perimentally (9). tures; this leaves approximately two-thirds of hu- While identifying the types of molecules with which a man protein domain families uncharacterized with re- protein interacts has already proven to be an essential com- spect to whether and how they bind DNA, RNA, small ponent of functional annotation pipelines (10), pinpointing molecules, ions and peptides. To ﬁll this gap, we in- the specific residues within proteins that mediate these in- troduce dSPRINT, a novel ensemble machine learn- teractions results in a deeper molecular understanding of ing method for predicting whether a domain binds protein function. For example, protein interaction sites can DNA, RNA, small molecules, ions or peptides, along be used to assess the impact of mutations in the context of with the positions within it that participate in these disease (11,12), to identify specificity-determining positions types of interactions. In stringent cross-validation (13) and to reason about interaction network evolution (14). testing, we demonstrate that dSPRINT has an excel- Recently, it has been shown that domains provide a frame- lent performance in uncovering ligand-binding posi- work within which to aggregate structural co-complex data and this per-domain aggregation allows accurate inference tions and domains. We also apply dSPRINT to newly of positions within domains that participate in interactions characterize the molecular functions of domains of (15). Domains annotated in this manner can then be used to unknown function. dSPRINT’s predictions can be identify interaction sites within proteins, and such domain- transferred from domains to sequences, enabling based annotation of interaction sites has been the basis of predictions about the ligand-binding properties of approaches to detect genes with perturbed functionalities in 95% of human genes. The dSPRINT framework and cancer (16) and to perform systematic cross-genomic anal- its predictions for 6503 human protein domains are ysis of regulatory network variation (17). freely available at http://protdomain.princeton.edu/ In this paper, we aim to develop methods to comprehen- dsprint. sively identify, for all domains found in the human genome, which types of ligands they bind along with the positions within them that participate in these interactions. To date, INTRODUCTION computational methods to characterize domain–ligand in- Domains are recurring protein subunits that are grouped teractions (15,18–20) have relied heavily on structural infor- into families that share sequence, structure and evolution- mation. However, structural data are not readily available ary descent (1–5). Protein domains are ubiquitous––over for all domain families, with only a third of the domain fam- 95% of human genes contain complete instances of ∼6000 ilies with instances in human protein sequences observed in Pfam domain families (1)––and can be identified from se- *To whom correspondence should be addressed. Tel: +1 609 258 2087; Fax: +1 609 258 1771; Email: [email protected] C The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. e78 Nucleic Acids Research, 2021, Vol. 49, No. 13 PAGE 2 OF 13 any co-complex crystal structure (see Supplementary Meth- MATERIALS AND METHODS ods S1.1.3). Identifying protein domains and creating our training set For the remaining domain families, current techniques fail to identify whether they are involved in mediating in- We first curate a training set of human protein domains teractions, what types of ligands they bind, and the specific with reliably annotated ligand-binding positions. We begin positions within them that participate in binding. A method by identifying all Pfam (1) domains with matches in hu- that can identify a domain’s ligands and interaction posi- man genes. This corresponds to 6029 protein domain fam- tions without relying on structural information not only ilies that collectively cover 95% of human genes (see Sup- would be a great aid in characterizing domains of unknown plementary Figure S1). Then, to construct our training set, function (DUFs) or other domains for which interaction in- we restrict to domains that (i) have >10 instances in the hu- formation is unknown but also would enable further down- man genome to leverage aggregating information across hu- stream analyses by identifying interaction sites within pro- man genes at the domain level, (ii) have per-position bind- teins that contain instances of these domains. Indeed, at ing labels (15) as described in the next section and (iii) pass Downloaded from https://academic.oup.com/nar/article/49/13/e78/6276922 by guest on 30 September 2021 present, the proteins for only ∼13% of human genes have certain domain similarity thresholds to avoid trivial cases co-complex structures with ligands and from which inter- where multiple near-identical domains are utilized. Supple- action sites can be derived (see Supplementary Figure S1). mentary Methods S1.1.1 describes the details of this pro- Here we develop a machine learning framework, cess including how we identify domain instances using hid- domain Sequence-based PRediction of INTeraction-sites den Markov Model (HMM) profile models from the Pfam- (dSPRINT), to predict DNA, RNA, ion, small molecule A (version 31) database (1) using HMMER (28), and Sup- and peptide binding positions within all protein domains plementary Methods S1.1.2 describes the domain similarity found in the human genome. While previous methods have filtering done using hhalign (v3.0.3) from the HH-suite29 ( ). predicted sites and regions within protein sequences that Ultimately, we obtain a training set of 44,872 domain posi- participate in interactions (e.g., (21–25), and see reviews tions within 327 domains. (26,27)), dSPRINT instead predicts the types of ligands a domain binds along with the specific interaction posi- Defining structure-based training labels within domains tions within it, even for domains that lack any structural characterization. dSPRINT’s focus on predicting the Domain positions are labeled as binding RNA, DNA, ion, ligand-binding properties of domains is complementary peptide and small-molecule based on InteracDome (15), a to existing approaches that annotate interaction sites collection of binding positions across 4128 protein domain within protein sequences. To train dSPRINT, we leverage families annotated via analysis of PDB co-complex crys- InteracDome (15), a large dataset of domains for which tal structures. For a position within a domain, its Interac- binding information has been annotated based on the Dome positional score corresponds to the fraction of Bi- analysis of co-complex crystal structures, while considering oLip (30) structural co-complexes in which the amino acid a comprehensive set of features extracted from a diverse found in that domain position is within 3.6A˚ of the lig- set of sources. Our framework aggregates information at and. For each InteracDome domain–ligand type pair anno- the level of protein domain positions, and features for each tated to have instances in at least three non-redundant PDB domain position are derived from sites corresponding to structures, we define positive (i.e., binding) and negative this position across instances within protein sequences. (i.e., non-binding) positions using the pair’s InteracDome- Site-based

Predicting DNA, RNA, Ion, Peptide and Small Molecule Interaction Sites Within Protein Domains Anat Etzion-Fuchs 1,Davida.Todd2 and Mona Singh 1,2,*

Antisense RNA Polymerase II Divergent Transcripts Are P-Tefb Dependent and Substrates for the RNA Exosome

Coupling of Spliceosome Complexity to Intron Diversity

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Análise Integrativa De Perfis Transcricionais De Pacientes Com

A Genome-Wide Library of MADM Mice for Single-Cell Genetic Mosaic Analysis

Mouse Fam32a Knockout Project (CRISPR/Cas9)

Gene Regulation Underlies Environmental Adaptation in House Mice

Lack of Polymorphisms in the Coding Region of the Highly Conserved Gene Encoding Transcription Elongation Factor S-II (TCEA1)

A High-Throughput Approach to Uncover Novel Roles of APOBEC2, a Functional Orphan of the AID/APOBEC Family

Mrna Editing, Processing and Quality Control in Caenorhabditis Elegans

Oral Administration of Lactobacillus Plantarum 299V

Gene Regulation and the Genomic Basis of Speciation and Adaptation in House Mice (Mus Musculus)