De Novo Prediction of RNA-Binding Proteins Based on Short

bioRxiv preprint doi: https://doi.org/10.1101/466151; this version posted November 12, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. TriPepSVM - de novo prediction of RNA-binding proteins based on short amino acid motifs Annkatrin Bressin 1+, Roman Schulte-Sasse 1+, Davide Figini 2+, Erika C Urdaneta 2, Benedikt M Beckmann 2∗ and Annalisa Marsico 1,3∗ 1Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, and 2IRI Life Sciences, Humboldt University Berlin, Philippstrasse 13, 10115 Berlin, Germany and 3Free University of Berlin, Takustrasse 9, 14195 Berlin, Germany In recent years hundreds of novel RNA-binding proteins (RBPs) cell cycle regulators and dual specificity DNA-RNA binders, have been identified leading to the discovery of novel RNA- including transcription factor and chromatin components (3). binding domains (RBDs). Furthermore, unstructured or dis- The discovery of these unconventional RBPs without known ordered low-complexity regions of RBPs have been identified to RNA-binding motifs suggests the existence of new modes of play an important role in interactions with nucleic acids. How- RNA binding and the involvement of RBPs in previously un- ever, these advances in understanding RBPs are limited mainly explored biological processes (10). to eukaryotic species and we only have limited tools to faithfully predict RNA-binders from bacteria. Here, we describe a sup- Intrinsically disordered regions (IDRs) are widespread in the port vector machine (SVM)-based method, called TriPepSVM, proteome and have been shown to be involved in regulatory for the classification of RNA-binding proteins and non-RBPs. functions, including direct RNA binding (11). RBPs identi- TriPepSVM applies string kernels to directly handle protein se- fied by RIC are highly enriched in disordered regions com- quences using tri-peptide frequencies. Testing the method in pared to the whole human proteome and are characterized by human and bacteria, we find that several RBP-enriched tri- low complexity, repetitive amino acid sequences. In partic- peptides occur more often in structurally disordered regions ular, a low content of bulky hydrophobic amino acids and of RBPs. TriPepSVM outperforms existing applications, which a prevalence of small, polar and charged amino acids are consider classical structural features of RNA-binding or homol- found in unstructured regions of RBPs. These amino acids, ogy, in the task of RBP prediction in both human and bacteria. such as glycine (G), arginine (R) and lysine (K), as well as Finally, we predict 66 novel RBPs in Salmonella Typhimurium the aromatic residue tyrosine (Y), form shared sequence pat- and validate the bacterial proteins ClpX, DnaJ and UbiG to as- sociate with RNA in vivo. terns among RBPs. For example RGG box binding motifs and glycine/tyrosine boxes YGG are broadly used platforms Correspondence: Annalisa Marsico and Benedikt M Beckmann. Email: mar- for RNA binding and can work alone or in combination with [email protected], [email protected] classical RBDs (11). The occurrence of IDRs within RBPs appears to be con- Introduction served from yeast to humans. In earlier work, we defined Gene regulation in eukaryotes occurs at several levels and a core set of RBPs conserved from yeast to human and iden- involves the action of transcription factors, chromatin, RNA- tified [K]- and [R]-rich tripeptide repeat motifs as conserved binding proteins (RBPs) and other RNAs. RBPs and mes- across evolution, making IDRs plastic components of RBP senger RNAs (mRNAs) form ribonucleoprotein complexes co-evolution (7). Importantly, we proposed that the number (RNPs) by dynamic, transient interactions, which control of repeats in IDRs of RBPs considerably expands from yeast different steps in RNA metabolism, such as RNA stability, to human, while the number of RBDs remains the same, link- degradation, splicing and polyadenylation. Numerous dis- ing repeat motifs in IDRs of RBPs to the functional complex- eases, such as neuropathies, cancer and metabolic disorders, ity of regulation in higher eukaryotes. have also been linked to defects in RBPs expression and func- Research over the past two decades has revealed extensive tion (1–3). post-transcriptional control in bacteria as well, with regula- Technological advances such as RNA Interactome Cap- tory networks comprising RBPs and small non-coding RNAs ture (RIC) has enabled proteome-wide identifications of (sRNAs). Advances in RNA sequencing methods have en- RBPs (4). RIC utilizes UV cross-linking to induce stable abled the discovery of new bacterial RBPs and showed that RNA-protein interactions in living cells, followed by poly(A) bacterial RBPs are relatively simple, often possessing only RNA selection via magnetic oligo d(T) beads and subse- a few or even a single RBD per protein, which recognizes a quent protein identification by mass-spectrometry. RIC stud- short RNA sequence (12). This is unlike eukaryotic RBPs, ies yielded hundreds of novel RBPs in e.g. human HeLa whose modular architecture enables versatility and combina- (5), HEK293 (6), Huh-7 (7) and in K562 (8) cells but also torial RBP-RNA interactions. Classical RBDs, such as the in worm and yeast (7, 9), which do not harbor canonical S1 domain, cold-shock domain, Sm and Sm-like domains, RBDs, as well as factors which were not previously asso- the double-stranded RNA binding domain, the K-homology ciated with RNA biology. Among them we find enzymes, domain and others, are widespread among bacteria (12). One Bressin | bioRχiv | November 12, 2018 | 1–11 bioRxiv preprint doi: https://doi.org/10.1101/466151; this version posted November 12, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. third of annotated bacterial RBPs are ribosomal proteins. Other bacterial RBPs are involved in regulatory functions of transcription termination, RNA synthesis, modification and translation, like the well-known Hfq protein. The latter as- sociates to sRNAs and leads to translational inhibition of the targeted mRNA (13). Since interactome capture relies on the use of oligo(dT) to isolate proteins bound to mRNA, the method is not applica- ble to bacterial species which generally lack polyadenylation (except for degradation purposes). Although experimental approaches have been developed recently to extend RBP cat- alogs and to include RBPs not identified by RIC techniques (14, 15), bacterial RBPs remain poorly annotated. There- fore, computational approaches able to predict new RBPs in both pro- and eukaryotes in the absence of experimental data and in a proteome-wide manner are in high demand, in or- der to narrow down the list of putative candidate RBPs for further experimental investigation. Several in silico methods have been developed to predict RBPs from primary sequence Fig. 1. TriPepSVM schematic. TriPepSVM is a support vector machine trained and/or protein structure (16–19). Approaches that character- on tri-peptide frequencies from RBPs and non-RBPs to discriminate between ize RBPs by predicting RNA binding residues from known these two protien classes. It can be applied to different species. protein-RNA structures are computationally expensive and can be trained only on a small subset of RBPs with known us to explore the possibility that RBPs might be confidently structure. Such methods do not generalize well to proteins predicted based purely on the occurrence of short amino acid whose structure is still unknown, which is the case for most k-mers. bacterial RBPs. We set up to predict whether a protein is likely to be an RBP Computational prediction tools such as SPOT-Seq (16), or not based on primary sequence using a string kernel (spec- RNApred (20), RBPPred (18), catRAPID (17) and APRICOT trum kernel) support vector machine (SVM). The spectrum (19), either derive sequence-structure features such as bio- kernel in combination with SVMs was first successfully in- chemical properties and evolutionary conservation and train troduced by Leslie et al. in the context of protein family a supervised classifier to distinguish RBPs from non-RBPs or classification (22). Support vector machine classifiers have classify proteins based on whether they harbor a known RBD. been used successfully in other biological tasks, for example Gerstberger et al. define a set of 1542 human RBPs based on the identification of specific regulatory sequences (23) and in whether those proteins harbor known RBDs or other RNA RBP prediction as well, for example in RBPPred, which cre- binding motifs via both computational analysis and manual ates a feature representation based on several properties of curation (1). However, in silico methods that identify RBPs the protein. based on known domains might have some limitations, as one In this paper we describe our newly developed RBP predic- third of RBPs from recent experimental studies do not have tion method TriPepSVM. It applies for the first time string prior RNA-binding related homology or domain annotation. kernel support vector machines for RBP prediction. The Therefore such methods might generate a high percentage of model uses exclusively k-mer counts to classify RBPs ver- false negatives, i.e. RNA-binders which lack an RBD and sus non-RBPs in potentially any species. It does not depend therefore are not predicted as such, as well as false positives, on any prior biological information such as RBD annotation RBPs with classified RBDs that perform non-RNA binding or structure, and therefore allows RBP prediction in an un- functions (3). A recently developed method, SONAR, adopts biased manner. We show that TriPepSVM performs better a completely different approach for RBP prediction and ex- than other methods in RBP prediction in human and the bac- ploits the fact that proteins that interact with many others terium Salmonella Typhimurium.

De Novo Prediction of RNA-Binding Proteins Based on Short

Chronic Exposure of Humans to High Level Natural Background Radiation Leads to Robust Expression of Protective Stress Response Proteins S

Visualization of Human Karyopherin Beta-1/Importin Beta-1 Interactions

In Vivo Gene Expression in Granulosa Cells During Pig Terminal Follicular Development. A

1 Title: Ultra-Conserved Elements in the Human Genome Authors And

Hnrnp U (HNRNPU) (NM 031844) Human Tagged ORF Clone Product Data

Wilms' Tumour 1 (WT1) in Development, Homeostasis and Disease

Genome-Wide Analysis of Organ-Preferential Metastasis of Human Small Cell Lung Cancer in Mice

Molecular Targeting and Enhancing Anticancer Efficacy of Oncolytic HSV-1 to Midkine Expressing Tumors

Coexpression Networks Based on Natural Variation in Human Gene Expression at Baseline and Under Stress

Targeted in Situ Cross-Linking Mass Spectrometry and Integrative Modeling Reveal the Architectures of Three Proteins from SARS-Cov-2

Supplementary Data

Cytogenomic Characterization of 1Q43q44 Deletion Associated with 4Q32.1Q35.2 Duplication and Phenotype Correlation A