Degsampler: Gibbs Sampling Strategy for Predicting E3-Binding Sites with Position-Specific Prior Information

DegSampler: Gibbs sampling strategy for predicting E3-binding sites with position-specic prior information Osamu Maruyama ( [email protected] ) Kyushu University https://orcid.org/0000-0002-5760-1507 Fumiko Matsuzaki Kyushu University Research article Keywords: motif, substrate protein, E3 ubiquitin ligase, prior, posterior, disorder, collapsed, Gibbs sampling, DegSampler Posted Date: November 6th, 2020 DOI: https://doi.org/10.21203/rs.3.rs-101967/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Maruyama and Matsuzaki RESEARCH DegSampler: Gibbs sampling strategy for predicting E3-binding sites with position-specific prior information Osamu Maruyama1* and Fumiko Matsuzaki2 Background Eukaryotic cells have two major pathways for degrading proteins in order to control cellular processes and maintain intracellular homeostasis. One of the pathways is autophagy, a catabolic process that deliv- ers intracellular components to lysosomes or vacuoles Abstract [1]. The other pathway is the ubiquitin-proteasome Background: The ubiquitin-proteasome system is system, which degrades polyubiquitin-tagged pro- a pathway in eukaryotic cells for degrading teins through the proteasomal machinery [2]. In the polyubiquitin-tagged proteins through the ubiquitin-proteasome system, an E3 ubiquitin ligase proteasomal machinery to control various cellular (hereinafter E3) selectively recognizes and binds to processes and maintain intracellular homeostasis. specific regions of the target substrate proteins. These In this system, the E3 ubiquitin ligase (hereinafter binding sites are called degrons [3]. E3) plays an important role in selectively It is important to identify the E3s and substrate recognizing and binding to specific regions of its proteins that interact with one another to character- substrate proteins. The relationship between a ize various cellular mechanisms. Information on known substrate protein and its sites bound by E3s is not degrons has been accumulated in the eukaryotic linear well understood. Thus, it is challenging to motif (ELM) resource for functional sites of proteins computationally identify such sites in substrate [4]. In the database, degron motifs are grouped into proteins. the “DEG” class. In addition, in the E3Net database [5], the relationship between E3s and their associated Results: In this study, we proposed a collapsed substrate proteins have been compiled. To date, only Gibbs sampling algorithm called DegSampler 55 of the 837 identified substrate proteins have been (Degron Sampler) to identify the binding sites of annotated with DEG motifs. In other words, 93% of E3s. DegSampler employs a position-specific prior the 837 known substrates are not annotated with any probability distribution, based on the estimated degrons. Thus, it is necessary to identify degrons of information of the disorder-to-order region bound substrate proteins bound by specific E3s. by any protein. In literature, many sequence motif finders have Conclusions: Our computational experiments been presented, such as MEME [6], Gibbs Motif show that DegSampler achieved 5 and 3.5 times Sampler [7], and GLAM2 [8]. MEME is a popu- higher the F-measure values of MEME and lar position-specific scoring matrix (PSSM) finding GLAM2, respectively. Thus DegSampler is the first method based on expectation-maximization (EM) (see model demonstrating an effective way of using http://meme-suite.org/index.html). Gibbs Motif estimated information on disorder-to-order binding Sampler and GLAM2 are Gibbs sampling-based meth- regions in motif discovery. We expect our results to ods. One way of further improving motif identification improve further as higher quality proteome-wide is to exploit prior information on sequence positions. disorder-to-order binding region data become Narlikar and colleagues proposed a Gibbs sampling- available. based method called PRIORITY, which showed the effectiveness of the position-specific prior information Keywords: motif; substrate protein; E3 ubiquitin ligase; prior; posterior; disorder; collapsed; Gibbs *Correspondence: [email protected] 1 sampling; DegSampler Faculty of Design, Kyushu University, Shiobaru, Minami-ku, Fukuoka, Japan Full list of author information is available at the end of the article Maruyama and Matsuzaki Page 2 of 14 based on nucleosome occupancy for motif discovery species: Homo sapiens (human), S. cerevisiae (yeast), in deoxyribonucleic acid (DNA) [9]. In general, prior Arabidopsis thaliana (mouse-ear cress), Mus muscu- information can be easily integrated into the poste- lus (mouse), Rattus norvegicus (rat), and Drosophila rior probability distribution of motifs by representing melanogaster (fruit fly), and three viruses, human her- it as a prior probability distribution. Inspired by their pesvirus 1 (HHV11), human papillomavirus type 16 success, Bailey and colleagues extended the EM-based (HPV16), and human herpesvirus 8 type P (HHV8P). MEME algorithm to use position-specific priors and The average number of substrates assigned to an E3 is reported their effectiveness in identifying transcription 8.5. factor-binding sites in yeast and mouse DNA sequences ELM is a database of short eukaryotic linear motifs [10]. that are manually curated from literature; this includes However, to the best of our knowledge, there is no 3,026 experimentally validated motif occurrences on model in which position-specific prior information is 1,886 proteins. These linear motifs are grouped into integrated in Gibbs sampling algorithms to identify six functional classes: cleavage (CLV) sites, degrada- motifs in protein sequences. Thus, we propose a Gibbs tion (DEG) sites, docking (DOC) sites, ligand (LIG)- sampling-based method called DegSampler (Degron binding sites, post-translational modification (MOD) Sampler) to identify the locations of degrons in a given sites, and targeting (TRG) sites. The DEG class mo- set of substrate protein sequences, which are bound by tifs are equivalent to degron motifs. Thus, the output E3s. of motif finders was evaluated using these DEG motifs. One of the candidates of information sources for This class has 25 motifs, as shown in Table 1. Some position-specific prior distributions is the disorderness of the motifs are too short to be predicted. Further- of each residue in a protein sequence. This inference more, only 36 of the 123 E3s in the E3Net database is based on the observation that the E3-binding sites include at least one substrate that is annotated with of substrate proteins are often located within the dis- a DEG motif in ELM (human 29; yeast 3; mouse-ear ordered regions of the protein structure [11, 12]. The cress 1; mouse 1; rat 0; fruit fly 0; HHV11 1; HPV16 1; disorderness for a protein sequence can be calculated and HHV8P 0). The remaining 87 E3s have no DEG by IUPred1 [13] and IUPred2 [14]. motif annotation on the substrate proteins. Thus, we Furthermore, we also examined the information on used these 36 E3-specific sets of substrate proteins as disorder-to-order binding regions in disordered pro- input for motif finders to evaluate the performance of teins as a source of position-specific prior distribution. DegSampler. The estimated information can be obtained using pre- Although E3Net has 837 E3-specific substrates, only diction tools, including ANCHOR [15, 16], ANCHOR2 [14], MoRFpred [17], MFSPSSMpred [18], and DISO- 199 substrates are annotated with some ELM mo- PRED3 [19]. Typically, the output of the tools is the tifs. Furthermore, only 55 substrates are annotated probability of each residue being in a disorder-to-order with the DEG motifs. In other words, 93% of the 837 region bound by any protein. These probabilities are substrates were not annotated with any DEG motif. expected to work better than disorderness as position- Therefore, the motif occurrences predicted by reliable specific prior information in identifying the locations motif finders are good candidates for degrons. of degrons. We have evaluated the efficiency of position-specific Source of position-specific prior information prior information related to the disorderness and Here, we describe the source of the position-specific disorder-to-order binding affinity of amino acid residues prior information used in this study. We prepared in identifying degrons. We have also compared the per- position-specific priors derived from IUPred1 [13] and formance of DegSampler with those of other popular IUPred2 [14], which provides the probability of each tools, viz. MEME and GLAM2, on real data sets. We residue belonging to a disordered region. IUPreds have found that DegSampler significantly outperforms the three prediction types, ”short disorder”, ”long disor- others. der”, and ”structured domain.” We selected the short and long disorder types. The outputs are denoted Materials by IUPred1-short, IUPred1-long, IUPred2-short, and Sequence data sets and known motifs IUPred2-long. The E3Net database [5] identified 491 E3 ubiquitin Concerning the prior information on disorder-to- ligases (E3s) associated with some known substrate order binding regions of protein sequences, we used proteins. In this research, we selected all E3s with at three prediction tools, ANCHOR [15], which is des- least three or more known substrate proteins. Con- ignated as ANCHOR1 to distinguish from its succes- sequently, we have extracted 123 E3s that are asso- sor ANCHOR2, ANCHOR2 [14], fMoRFpred [20], and ciated with 837 different substrate proteins from six DISOPRED3 [19]. The

Degsampler: Gibbs Sampling Strategy for Predicting E3-Binding Sites with Position-Specific Prior Information

Representation for Discovery of Protein Motifs

Introduction to Sequence Motif Discovery and Search the Complete

Tools for Motif and Pattern Searching

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Sequence Conservation

Alignment, Clustering and Extraction of Structured Motifs in DNA Promoter Sequences

ER-Resident Transcription Factor Nrf1 Regulates Proteasome Expression and Beyond

Lecture 7: Sequence Motif Discovery

Novel Sm-Like Proteins with Long C-Terminal Tails and Associated

Lecture 9: Protein Sequence Profiles and Motif Applications

Difflogo: a Comparative Visualization of Sequence Motifs Martin Nettling1*†, Hendrik Treutler2†, Jan Grau1, Jens Keilwagen3, Stefan Posch1 and Ivo Grosse1,4

What Are DNA Sequence Motifs?