<<

Computational prediction of RNA structural motifs involved in posttranscriptional regulatory processes

Michal Rabani*, Michael Kertesz*, and Eran Segal*†

*Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel

Edited by Sean R. Eddy, Howard Hughes Medical Institute, Janelia Farm, Ashburn, VA, and accepted by the Editorial Board August 6, 2008 (received for review April 12, 2008) Messenger RNA molecules are tightly regulated, mostly through Here we develop RNApromo (RNA prediction of motifs), a new interactions with and other RNAs, but the mechanisms that computational method to identify short structural RNA motifs in confer the specificity of such interactions are poorly understood. It is sets of long unaligned RNAs. Using our method, we identify clear, however, that this specificity is determined by both the nucle- putative motifs in sets of mRNAs with substantial experimental otide sequence and secondary structure of the mRNA. Here, we evidence for a common posttranscriptional regulation, and we develop RNApromo, an efficient computational tool for identifying support these findings with cross-validation analysis. In some cases, structural elements within mRNAs that are involved in specifying sequence conservation of the putative motifs provides strong posttranscriptional regulations. By analyzing experimental data on independent support for our findings. The identified motifs include mRNA decay rates, we identify common structural elements in fast- motifs for mRNAs with similar decay rates, mRNAs that are bound decaying and slow-decaying mRNAs and link them with binding by the same RBP, and mRNAs with a common cellular localization. preferences of several RNA binding proteins. We also predict struc- Additionally, analysis of pre-microRNA structures identified dif- tural elements in sets of mRNAs with common subcellular localization ferences between animals and plants in the sizes of stem and loop in mouse neurons and fly embryos. Finally, by analyzing pre- structures of pre-microRNAs, suggesting that animals and plants microRNA stem–loops, we identify structural differences between use different mechanisms for microRNA biogenesis. pre-microRNAs of animals and plants, which provide insights into the mechanism of microRNA biogenesis. Together, our results reveal Results and Discussion unexplored layers of posttranscriptional regulations in groups of A New Computational Scheme for RNA Motif Discovery. Several tools RNAs and are therefore an important step toward a better under- exist for finding local structural motifs in a set of long input RNAs. standing of the regulatory information conveyed within RNA mole- Many existing works use free energy minimization considerations cules. Our new RNA motif discovery tool is available online. for predicting local motifs, by applying one of three major schemes: use of a sequence-based local alignment to build a motif consensus ͉ motif prediction ͉ posttranscriptional regulation ͉ structure (12, 13); identification of common structures in the RNAs’ RNA secondary structure ͉ SCFGs predicted minimal free energy folds (14–16); or simultaneous alignment of the RNAs and prediction of their secondary structure (17–20). Recently, some graph theoretical techniques were also NA molecules undergo diverse posttranscriptional regulation proposed for this task (21, 22). Yet some of the most successful of expression, including regulation of RNA transport and R approaches for motif discovery are based on advanced probabilistic localization, mRNA translation, and RNA decay (1–3). In many models, which are highly suitable to capture the observed variation cases, such posttranscriptional regulation occurs through elements in the input set. on the mRNA molecule that interact with the hundreds of RNA Stochastic context-free grammars (SCFGs) are a class of prob- binding proteins (RBPs) that exist in the cell (4). A well-known abilistic models proposed for modeling common sequence and example is the iron-responsive element (IRE), a secondary struc- structure in a set of input RNAs (23, 24), which replaced the ture RNA motif located on UTRs of members of the iron metab- thermodynamic considerations in several existing RNA motif dis- olism and transport pathway (5). The binding of the RBPs Irp1 and covery tools (25, 26). However, currently available SCFG applica- Irp2 to IRE elements affects the translation rate of the mRNA, and tions optimize the model’s parameters by essentially considering all by that coordinates the response to changing levels of iron in the possible secondary structures of the input sequences. This approach environment. Other examples include a 118- stem–loop is successful mainly when it is possible to exploit covariation structure through which mRNAs are transported to the yeast bud between organisms to infer the secondary structure at the motif tip by the RBP She2 (6) and the RBP Sbp2 that is involved in position. However, when one wishes to identify short motifs, using mediating UGA redefinition from a stop codon to selenocysteine data from a single organism, covariation data cannot be used. CELL BIOLOGY by binding specific stem–loop structures, termed selenocystein Moreover, enumerating all possible secondary structures results in Ј insertion site (SECIS) elements, in the 3 UTR of selenoproteins a rather high time complexity of the algorithm, making it unfeasible (7). In other cases, elements on the mRNA molecule interact with to scan large RNA sets or long RNA sequences. other RNAs that direct the regulatory effect. For example, the Here, we devised a new SCFG-based method for finding local recognition and binding affinity of a microRNA to its mRNA target motifs in a set of unaligned RNAs, which restricts the search space is determined by both the sequence and structure of the target to a predefined and limited number of structures for each input mRNA (8–11). RNA. Yet which structures should be considered by the algorithm? The examples above suggest that the posttranscriptional regula- tion of mRNAs is determined not only by its linear nucleotide sequence but also by its secondary structure. Thus, a key goal is to Author contributions: M.R., M.K., and E.S. designed research, performed research, contrib- understand the involvement of mRNA secondary structures in such uted new analytic tools, analyzed data, and wrote the paper. regulation. One approach is to identify recurring patterns, termed The authors declare no conflict of interest. motifs. However, linear sequence motifs, which are commonly This article is a PNAS Direct Submission. S.R.E. is a guest editor invited by the Editorial Board. found in DNA sequences, are not suitable in this case. Instead, we †To whom correspondence should be addressed. E-mail: [email protected]. wish to identify motifs that combine primary and secondary struc- This article contains supporting information online at www.pnas.org/cgi/content/full/ tural elements and are therefore better suited to describe functional 0803169105/DCSupplemental. elements in RNA molecules. © 2008 by The National Academy of Sciences of the USA

www.pnas.org͞cgi͞doi͞10.1073͞pnas.0803169105 PNAS ͉ September 30, 2008 ͉ vol. 105 ͉ no. 39 ͉ 14885–14890 Downloaded by guest on October 1, 2021 B A real (Rfam) sets Predicted Motif Known Motif Selected Instances Conservation C G C A C A C C U C C U G C A A HIST2H2AA C G C

G C A C G A A U C G A C profile

permuted sets Set size = 65 G C A A A U C C C A U U U G

A A U G U

G C A U U C U

C U C C C C U G

C C C A G C U U C U C G A G C G C C C G G C C

AUC = 0.95 G G G A C G G C G

G A G G C C A A U A

-16 A A A HIST4H4

C G A A G C

C G A A C

A p < 4.5x10 U

HIST1H2ABC

0.3 G A A HIST3H2A C A U C

U G

C G

G A A

U A C A G A C

C C

C C U

U A G G

C G

Similarity: C A C C

U C C C

C U A

A A C U

C G G G A C G U G

C G U C U G

A A

A G G U A G U U avg=49%, C U G

U U G A A C G

A C C U A C A

U A A C A U G U A C C C A C

U G C C C U

C C U G

max=85% U G G A

C U U U U

U U

U

C

U C U C G G C

DFH G C

U C

C C

G

U C G

A C

U A A A

A A G G G C

U

G C G

C G G C A HIST2H3C C

A G G

A C U

C G

U A A U U C

A

U U U C C

C A A U

HIST1H2BG G C -16

A G U

A C

U A U U U UU A A U U U p < 4.5x10

A G A G C 0.2 C C U U A A C A U G U C U A U U U U A U G A G U 2-5 bp U C U A C Set size = 10 C C A A U C U G TFRC A C A G C C G C C G A A A G G C C C C U C G A U G G G A C A A UC G G A C G AUC=0.86 U A C GA U U A C G A C U U U U U C C G A G AU U C C G A G -4 U C C A A A A A C U G A A G G C G G C p < 6.5x10 A A C C U U U G C A G G % of gene sets C U C U U U A A U G C C G C U G C U U C U C G G G C C A A U C C C C C U A G A C FTH1 G A C U U A A C C U U U C A A C CA G C C C U C G G U A A C A G C Similarity: G A G C C G C G G A C G G U G U C C G G C U C U A A G C GG G U A C U C C A U C U G C U G C U G IREG1 C U U A G C C C avg=10%, 4-5 bp U A G G G C A U U A G A C C C U A A A C G U C U G C C U C G U C A C G G 0.1 max=51% U A U A C C G U G C G G U A U U G C C A U C G U C FTL C C C C G U C C C A U U U C U A G G C A C C A G G G A A CAG A G A G A U U U C G C U A U C A U C C C UUC C U C A U U G A C G C G C A U A C A G C G A G U G U G A G A A U U A C C G A G A U U G C C C U U A C U C U U G U U G A C G U U U G C U U G A U C G C U G U A U G G C C C A C A C G G U C U U U C U G A AA C G G A C -4 A A U A U G A U U G ER G U G p < 1.4x10 A G U U C G ACO2 C U G SLC11A2 U A G C A G I G U ALAS2 C pb C 8 A A 4- C A A G G A C A A C G G U U G U G GPX1 G C C 0 U U G A G C

C C C G

U C U U G G A U

C A A GPX2 U U G U

U G G C U G A A G G U C U U

G A G G A G G C C

A G G G C U U U G U

A C A G G A A

A C G A C A U G C G U G G 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A A C

G A C U C U G G

U U C A C G

U U G A C A C C C C

G C

G A

G G G G U U U G G A G U A C A C

A C U G C

A C U G U C G A A A C C

AUC A A C C U A U A

C G A A G G U U

A U C A U C U A

C A G C U U C A

G C

A U G

C

U A

C

C

A A G

U C

G U A G U

A

C G U

G U

G G

A A C U C C C

A A U C G C A G A C C U G U

U G C G A C

U SELM C U C C C

A G G G G A G A U

A

G C U

A A A G G

C G

G G C C A G A

G G C

A A

G G C

pb 0pb U U A

G U

C G

G C

C G G C G U U U SPS2

C G C G

C

G

A CC U C G

G 1 U C G

- A C

C C G C 5 C C

A G C G C C U

U

A G G U

G Set size = 15 G U G A U G A A

U C C

A G C G G

C U A C A U

G G

C G

U U

G A

A G A C C G C U

A G U C C U

U A A C

U

G C

C U C

U structure AUC=0.59 C

U U

G C

G

C G U U

C

G A A

G U A G U G

A G

C

U G A G

U A A G A C

A C

U U U

C

p < 0.05 G A G G

G

A C C A SELR

G

G U

C A

G

U C

U

A A U U

U C U

A G C C

A C C A C

A C A A

G

A G U G

A

U U A A

SICES A G A

A A G G

G

Similarity: C U G U U C U U

sequence C C U A U

G C G A U U U

G U C

G A

U C G U G A t U G C U

n A G A U C C

0 G A G 1 U A A

A avg=14%, -0 G A

U U U A U U G C p < 0.05 G

U

SELP C

A

A A U A prob. 0 0.5 1 max=48% A Fig. 1. Validation of motif discovery scheme. (A) Distribution of AUC scores for motifs in the true (dark blue) vs. permuted (light blue) sets (using ViennaRNA predicted folds). AUC scores of permuted sets are distributed around 0.5, whereas AUC for motif-containing sets are usually higher. (B) Prediction of three known human motifs: HFD (Top), IRE (Middle), and SECIS (Bottom). Motifs are represented using a structural logo of the motif’s most probable structure. Specific positions are color coded according to their probability (green-to-red scale for sequence, and gray scale for structure). Shown is the predicted motif logo and the known consensus structure (Left), with several examples of correctly classified motifs (Right). The motif position is annotated in green, and the 5Ј end of the motif is circled. The FTH1 UTR (in the IRE part), which was misfolded at the IRE loop, is shown in red. Sequence conservation profile (average across all of the motif instances) is also shown (Rightmost Column).

Ideally, experimental information about RNA structure can be include randomly selected sequences from all of the validation sets. used, and only the few structures that are consistent with such Because we do not expect to identify motifs in these random sets, structural data are considered. Even though such experimental their AUC scores should be close to 0.5, indicating that in the structural information currently exists for only a small number of absence of a common motif in the input set, no signal is detected. cases, such information may soon be available. Alternatively, using The results (Fig. 1A) show that whereas the AUC scores of the existing thermodynamic-based secondary structure prediction pro- random sets are indeed distributed around 0.5, with only 5% of grams (14, 27) (with ϳ50%–90% prediction accuracy [28]) for the scores above 0.6, more than 60% of the AUC scores of the predicting a small set of thermodynamically stable folds and re- true sets are above 0.6. This indicates that in most cases, stricting the algorithm to those, allows both reduction of the search RNApromo indeed detects the biological signal when it is space and integration of thermodynamic considerations into our present, yet it rarely detects such a signal if there is no common model, which are not fully embedded into standard covariance motif in the input RNAs. Similar results were obtained using a model applications. Certainly, other structure prediction tools, such different method to predict the structure of the input RNAs (29) as those based on probabilistic models (29), can also be used to (see SI). derive a set of highly probable folds as an input to our method. Repeating the same analysis with other available tools (26) Our algorithm, called RNApromo, takes as input a set of RNA shows that RNApromo is compatible with these tools in terms of sequences assumed to share a motif, and their suggested secondary results but requires shorter running times (see SI). This result is structures. The algorithm first identifies specific and relatively short notable, given that the Rfam database includes motif instances candidate structures that appear in as many inputs as possible. from several organisms, and in such a setting, tools that exploit These candidates are then used as seeds for a probabilistic inference covariation information have an advantage over RNApromo. algorithm that refines the predicted motif using statistical estima- Next, we tested our method’s ability to identify and correctly tion (see supporting information (SI) for a full description). describe known biological motifs when covariation data is minimal. To examine the performance of RNApromo, we first tested its We analyzed three well-known motifs from human: histone-fold ability to identify known RNA structural motifs in a large collection domain (HFD), iron-responsive element (IRE), and seleno-cystein of validation sets from the Rfam database (30). We used a fivefold insertion site (SECIS), each sharing a different level of sequence cross-validation scheme, in which we partition the input set into five similarity. Although three examples cannot provide global statistics, parts, learn a model from each of the possible combinations of four they can still allow a better evaluation of our method’s abilities in sets, and use this model to assign likelihood scores to the RNAs that such a setting. were held out while learning it. We use the standard receiver Using RNApromo, we detected a motif in all three sets (Fig. 1B): operating characteristic (ROC) curve and its associated area under HFD (AUC ϭ 0.95, P Ͻ 5 ϫ 10Ϫ16), IRE (AUC ϭ 0.86, P Ͻ 7 ϫ the curve (AUC) measure to evaluate the significance of the input 10Ϫ4), and SECIS (AUC ϭ 0.59, P Ͻ 0.05). The predicted con- RNAs’ likelihood scores compared with shuffled sequences. We sensus structures are highly similar to those described in the filtered each set to include only sequences with Ͻ90% sequence literature and usually match the known motif positions (94% of the similarity, because high sequence similarity may produce high AUC HFD instances, 60% of the IRE instances, and 47% of the SECIS scores even when a functional motif is not present. The maximal instances). In the misclassified cases, the motif position was not sequence similarity in our validation sets ranges from 30% to 90% folded into the correct structure by the prediction algorithm. Note (see SI for details). Although high AUC scores indicate that the that the SECIS consensus structure includes two noncanonical G-A input RNAs share a biological signal, we still have to validate that base pairs, which standard folding algorithms cannot predict. this signal results from a common motif. To test that, we applied the Nonetheless, RNApromo identifies a motif in this set and predicts same motif discovery scheme to a collection of random sets that its structure quite accurately. The predicted SECIS motif obviously

14886 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0803169105 Rabani et al. Downloaded by guest on October 1, 2021 G U G U U U U U A U A A C G G U C C G A U A A A C C G G U G U A C U U U U G U U YBR291C C G G U G U U A C A AG C G A U A C U U G C A U A G G U C A G G A A C A U U A G A C YOR310C A A C A U A G U U G G C A A C A U G A U yaceDtsaF UU G A U U C G U U A A U A A A A A C U U A C A G C U YEL077C C C A G A G U G A U C A U A G A G C U A G C C G U G U U G A A C U U U A U G A U U U U A G U A U G C A U

U G G U A U A C G A A A A 3’UTR U G A A U U G U G A U A G U C C U A A U A A G U U U A U G A G C G A A G U Set size = 75 G A U G A A A A U A G U U A U U U U -3 A A U U C p < 4.9x10 U C G C U U U U G C A U AUC = 0.61 U A C U U U A A G A

U A p < 10 -3 YGR143W A G A G U YDR449C G A A U A U U G A A Half Life <= 6 min A U U U

similarity: avg=25%, max=90% U G

C G U C U U A

U C U A A

U 3’UTR U U A G U C

U A C U U

U A G U A

A

U U A A

B Set size = 240 U U A

YAL037W A U U

A U U U A

U

C

A A

U

A U G

G C

U

A A A G U U U U A

AUC = 0.58 U A U A

U A G

U U

G

U U U A A C

U

C A

A A G U A C

G G U

U -4 U U U U G A

U

U G A

p < 8x10 U U U

U

A

A U

C G

U U A G

A C U

U A U

U A

U A A C

yaceDwolS G

U C

U C A G A C

A U A A C

U A U

U Half Life >= 60 min U C A U C C

G

U

U U A

U U A

C

U U C

C U A U A C U A U G

A A U

G U A U G

similarity: U U U U A U U

A A U A C U C

U U C C G U C A

C A

U U U C U A U U C U U U U

avg=24%, max=88% U G U U U U U U A U A U G U

U U C A C A U

U U U G

U U U U A G U U

U U U A A G A U C C U G

U A U A U U UAUUA

G A A U U U C U C

G U U A U A U

U C C U U A U YDL222C U

YPR045C A C U A U

U U U

G C YKR091W U U A C C U

U U U A A A U U

U U

C A G

U U

U A U

U YDR273W U U C C U

U U U U

C U

U A U

U U U C

U U U

U C U C U G U U U U U U

0.4 Puf3-5 C U D P fu5 U Puf4 Puf3 1buP Sam68 U pub1 3’UTR 3’UTR 3’UTR 3’UTR 3’UTR U Set size = 191 Set size = 210 Set size = 345 Set size = 8 Set size = 192 U U U AUC = 0.58 A AUC = 0.73 AUC = 0.59 AUC = 0.57 AUC = 0.60 0.3 U -5 -3 -3 -5 p < 8x10 p < 2x10 p < 2x10 -5 p < 2x10 p < 4x10 A U similarity: avg=24%, similarity: avg=24%, similarity: avg=24%, similarity: avg=23%, C similarity: avg=27%, G max=58% A max=61% max=69% U max=48% max=82% U C 0.2 U U A % of U U U U C C

A U A U C 0.1 U U C A C U U U U U A U U U U U G U U U A U U G 0.0 -5 C U U U p < 9.9x10 U U U p < 4.3x10-10 0 5 10 15 20 25 30 U U U Half Life Fig. 2. mRNA decay motifs. (A and B) Fast- and slow-decaying mRNAs. Shown is the predicted motif (Left) and the sequence conservation profile, along with top-scoring motif examples (Right). The motif position is annotated in green, and the 5Ј end of the motif is circled. (C) Predicted motifs for RBPs mRNA targets. Shown are the motif consensus and sequence conservation profile. (D) Half-life distribution for the top 20% of targets of the Puf proteins (blue line) and Pub1 (red line). Puf targets have faster decay rates than Pub1 targets.

does not include the noncanonical base pairs, yet the sequence independent support for our findings, we evaluate the statistical specificity of these positions is detected. In two cases, the sequence significance of the average sequence conservation at the predicted conservation at the motif positions is also remarkably high: HFD motif positions. Because the learning process itself is done on (P Ͻ 5 ϫ 10Ϫ16) and IRE (P Ͻ 2 ϫ 10Ϫ4). In the third case, the mRNAs from a single organism, no evolutionary information is conservation signal is relatively weak, demonstrating that structural used during the training process. Therefore, a significant level of motifs are not always associated with high sequence conservation. conservation provides an independent biological signal that further Repeating the analysis of those three sets using CMfinder (26) supports the findings and suggests that the motif that was identified results in the identification of only two of the three motifs (the in one organism could be functional in additional organisms. SECIS motif is not identified). Moreover, the HFD consensus However, motifs with low sequence conservation may still be includes only a 5-bp (rather than 6-bp) stem (see SI). Because the functional, either because they are conserved at the structure level comparison is done only on three sets, we cannot draw definite rather than the sequence level or because they are specific to the conclusions but only suggest that in the absence of covariation data, tested organism. RNApromo may be better suited to predict RNA structural motifs. Recently, genome-wide decay rates of yeast mRNAs were mea- Taken together, these examples demonstrate the ability of sured (3), but the investigators were unable to detect any significant RNApromo to correctly identify and characterize the right relationships between the measured mRNA half-lives and codon (short) RNA motif from an input of (long) unaligned RNAs, usage, ORF lengths, or primary sequence motifs. It is therefore including both its structure and sequence elements, without any possible that mRNAs with similar decay rates exhibit a common CELL BIOLOGY prior knowledge of its length, location, or structure, and in the motif through which their costability is controlled. To test this presence of noise in the folding input. Importantly, RNApromo hypothesis, we created sets of genes with similar decay rates and also performs well with minimal covariation information and applied our motif discovery scheme to them. therefore can be used to predict motifs for a single organism. Intriguingly, we identify a motif in 3Ј UTRs of 75 mRNAs with a measured short half-life of Յ6 min (AUC ϭ 0.61, P Ͻ 10Ϫ3). The Motifs Involved in Modulating mRNA Decay Rates. Having validated predicted motif sequence is AU rich and folds into a stem–loop that our computational scheme can detect known biological motifs, structure with a relatively short loop (Fig. 2A). Furthermore, this we used it to predict novel motifs. We applied RNApromo to 3Ј and motif shows a high conservation profile (P Ͻ 5 ϫ 10Ϫ3), as an 5Ј UTR sequences of several sets of genes for which substantial independent support for its biological relevance. We also identify a evidence suggests that they share common posttranscriptional motif in the 3Ј UTRs of 240 mRNAs with a measured long half-life regulation. of Ն60 min (AUC ϭ 0.59, P Ͻ 8 ϫ 10Ϫ4). Once again, the motif We first filtered these sets to include only sequences with Ͻ90% sequence is AU rich, yet the structural context is different and includes sequence similarity. Using a similar cross-validation scheme, we a large unstructured U-rich loop followed by a short stem (Fig. 2B). then assigned an AUC score to each set and evaluated its statistical The role of AU-rich elements located on the 3Ј UTR of mRNAs significance relative to a background distribution. For sets with in modulating mRNA stability, both as stabilizing and destabilizing significant AUC scores, we also build a model of the motif. As elements, has long been known (31). Our results suggest that the

Rabani et al. PNAS ͉ September 30, 2008 ͉ vol. 105 ͉ no. 39 ͉ 14887 Downloaded by guest on October 1, 2021 5’UTR structural context within which these sequence elements are em- 5’UTR 3’UTR Set size = 22 Set size = 9 Set size = 9 bedded determines their activity: a small loop will induce destabi- AUC = 0.69 A AUC = 0.72 U -4 AUC = 0.75 p < 2x10 U p < 2x10 p < 2x10-3 Maternal similarity: avg=18%, lization, whereas a long U-rich loop will stabilize the mRNA. similarity: max=44% similarity: avg=20%, Embryonic patterns avg=16%, U A A Many studies have shown that much of the posttranscriptional -3 U max=42% max=55% Basal A C C C regulation of mRNAs in the cell occurs through their interactions C Apical C G U A with the hundreds of different cellular RBPs. It is therefore possible U A Diffuse apical C C U A that the motifs we identified in the fast- and slow-decaying mRNAs Subcellular patterns C U A U U G U A A 5’UTR are bound by specific proteins that modulate mRNA stability. In an Cell division apparatus A A A A U Set size = 13 U Spindle midzone AUC = 0.72 attempt to identify such proteins, we selected several available sets U U A A p < 10-3 Other cell division pattern A similarity: of genome-wide measurements of mRNAs bound by the same RBP G C Nuclei associated C avg=19%, U A and tried to identify common motifs in them. A A G U max=40% Perinuclear around pole cells A 5’UTR

G 5’UTR C Set size = 9 Proteins from the Puf family of RBPs have been reported to bind Cytoplasmic foci C G C Set size = 34 U A G AUC = 0.70 Ј G U -5 UGUR motifs located in the 3 UTR of their targets and thereby A AUC = 0.67 U p < 8x10 A U p < 2x10 -3 similarity: avg=20%, max=49% repress gene expression by affecting mRNA stability and translation A A G Zygotic similarity: avg=16%, max=52% rate. A recent study in yeast (32) measured the set of RNAs bound Yolk nuclei BMouse A 3’UTR A by five RBPs from the Puf family, resulting in groups of 40–220 Pole cell nuclei Set size = 95 U Dendrite AUC = 0.66 C bound mRNAs per . By applying a DNA motif discovery Blastoderm nuclei p < 2x10 U -4 A C A similarity: avg=20%, U U A tool to the 3Ј UTR primary sequence of the mRNAs bound by each All nuclei U G max=54% A A Nuclei subset U C A Puf protein, the investigators were able to identify distinct 10- U Random nuceli 5’UTR nucleotide RNA sequence motifs containing the UGUR element in G C Anterior nuclei CSet size = 9 -4 mRNAs interacting with Puf3, Puf4, and Puf5. Applying AUC = 0.66 Posterior nuclei p < 7x10 U RNApromo to the targets of each of the five Puf proteins, we Ventral nuclei U A similarity : 5’UTR G avg=15%, Set size = 97 Ј ϭ Dorsal nuclei C A -3 identify a significant motif in the 3 UTRs of Puf3 (AUC 0.57, max=51% p < 5.6x10 AUC = 0.61 Ϫ3 Ϫ5 A U -5 Ͻ ϫ ϭ Ͻ ϫ Flanking nuclei A p < 2x10 P 2 10 ), Puf4 (AUC 0.58, P 8 10 ), and Puf5 A A G 5’UTR G U Set size = 9 Ϫ U similarity : 5 Central band nuclei ϭ Ͻ ϫ C U G AUC = 0.73 (AUC 0.59, P 4 10 ) targets. The three motifs are C avg=20%, p < 6x10 -6 Mesoderm nuclei A A somewhat similar (Fig. 2C) and include an AU-rich sequence that C G similarity: avg=16%, max=58% A U

is folded into a stem–loop structure with a relatively short loop. For max=52% the Puf5 motif, we find a significant level of sequence conservation Fig. 3. Localization motifs. (A) Fly embryonic localization patterns. Consen- (P Ͻ 10Ϫ4), as an independent support for the predicted motif. sus structure logos of the statistically significant RNA motifs. The location of Moreover, the Puf5 consensus structure includes the UGU ele- each set in the localization annotation hierarchy is indicated. (B) RNA motif ment, which is part of the previously proposed Puf5 predicted for mRNAs that are localized to mouse dendrites. (in 63% of the predicted Puf5 target sites). The yeast polyU binding protein, Pub1, an embryonic lethal destabilizing elements. By binding different RBPs, AU-rich abnormal visual (ELAV)-like RBP with mammalian homologues, elements with different structure can change the mRNA stability is known to play key roles in cellular mRNA decay. Pub1 was shown and therefore affect its translation rates. to bind with high affinity different target mRNAs with AU-rich sequences in their 3Ј UTR and stabilize them (33). In a recent study, Motifs Involved in mRNA Localization. RNA transport and local a genome-wide measurement of Pub1 targets identified 368 target translation have now been documented in vertebrates, inverte- transcripts. Applying RNApromo to the 3Ј UTRs of Pub1 targets, brates, and unicellular organisms, enabling cells to control gene we predict a significant motif (AUC ϭ 0.6, P Ͻ 2 ϫ 10Ϫ5), which Ϫ expression at small localized regions. Several studies suggest that is also highly conserved (P Ͻ 5 ϫ 10 10). The predicted motif is AU cellular localization of mRNAs is specified by RNA motifs on the rich and folds into a stem–loop structure that includes a long and localized mRNA and bound by RBPs involved in trafficking (5). highly U-rich loop (Fig. 2C). Looking at the motifs we predict for these RBPs, we notice their The RBPs involved in several localization processes were identified similarity to the mRNA decay motifs: in targets of Pub1, a protein experimentally and shown to interact with elements composed of that is known to stabilize its targets, we identify a very similar motif both sequence and structure (6). We thus applied our motif to the slow-decay motif, whereas in targets of the Puf proteins, discovery scheme to several sets of mRNAs that share similar which were suggested to increase mRNA degradation rates, we localization patterns. identify a motif with a similar structure to the fast-decay motif. Recently, a large study on mRNA localization during fly embry- Indeed, fast-decaying mRNAs (half-life Ͻ6 min) are enriched with onic development was performed (35). This study provides a large Puf proteins targets (P Ͻ 8 ϫ 10Ϫ16), whereas slow-decaying set of colocalized mRNA transcripts, both to specific parts of the mRNAs (half-life Ͼ40 min) are enriched for Pub1 targets (P Ͻ 2 ϫ embryo and to distinct cellular compartments. Although some of 10Ϫ3). Moreover, looking at the decay rates of the top 20% of the mechanisms for mRNA localization during development are targets of each protein (scored by the predicted motifs), we see (Fig. known and involve morphogen gradients that induce local tran- 2C) that the half-lives of the Puf targets are lower (18 Ϯ 13 min on scription of mRNA, these are not active at the subcellular level or average) than the Pub1 targets (28 Ϯ 32 min on average). for maternal mRNAs. Therefore, it is possible that some of the Finally, Sam68 is a human RBP involved in cell growth documented localization events are a result of signals on the mRNA regulation, which was shown to bind to AU-rich motifs within a itself. Applying RNApromo to the 94 sets of colocalized mRNAs, stem–loop context (34). Analyzing its eight known targets, we we detect significant motifs in nine sets of colocalized maternal and predict a stem–loop motif (AUC ϭ 0.73, P Ͻ 2 ϫ 10Ϫ3). On the embryonic transcripts (Fig. 3A). Interestingly, whereas mRNA basis of the previous observations and the structure of this motif, stability motifs are located on 3Ј UTRs, most predicted localization we can hypothesize that Sam68 promotes the degradation of its motifs are located on the 5Ј UTR of the transcripts. mRNA targets. If true, this would indicate that this destabiliza- Of the nine predicted motifs, five (56%) are identified in sets of tion motif is conserved between human and yeast. maternal transcripts with a common cellular localization, although Overall, our results suggest a role for mRNA secondary such sets constitute only 7% of the localization patterns (P Ͻ 5 ϫ structure in controlling mRNA stability. The detailed secondary 10Ϫ5). One interesting example is the motif predicted in maternal structure of the well known AU-rich elements may play a transcripts localized to the spindle midzone, the central area of the significant role in determining their function as stabilizing or spindle where microtubules from opposite poles overlap. This

14888 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0803169105 Rabani et al. Downloaded by guest on October 1, 2021 A loop size stem size B pre-microRNA stems CControl: genomic transcripts stems

G 0510 02040 A Green algae

U Loblolly pine C ppt ppt U U U C Physcomitrella patens C Selaginella moellendorffii G G A A A Arabidopsis A

Brassica napus oosasa osa

G G

U U A Soya A

Medicago truncatula

U U

A A C Populus trichocarpa C Rice pptctc pptctc Wheat AUA

Plants Sorghum bicolor

U

U U U

Maize U

U U

Planarium U athath aathth U C. elegans U C. briggsae Honey bee dmedme Silkworm dmedme Mosquito D. melanogaster D.pseudoobscura Zebra fish mmu mmummu Puffer fish Fugu fish Frog Chicken hhsasa hsahsa

Opossum U Cattle

A

G

A Wild boar U C

U

G U C Mouse A dredre Rat ddrere Human Chimpanzee Bonobo C

Gorilla G Orangutan xxtrtr Macaque mulatta Macaque nemestrina U

A

G Wooly Monkey U

U

A Spider Monkey C btabta

Animals Tamarin Lemur

U

A

Rhesus lymphocrypto A

U

G

U

A G

G

U

U

A Epstein-Barr C

U

C

G

G G

U C C G

G

G

G U

A U

C

U

G

A

G

G Gamma -herpes A

C

U

C

G

U

U

C

C

G G A Kanopsi-sarcoma G

C

U A C

G C

A C

C U

A

U

C

U

U

C C G

C

U

A U Cytomegalo C

U

A

ebvebv C Mareks-disease 1 ebvebv Viruses Mareks-disease 2 Fig. 4. Pre-microRNA motifs. (A) Bar graphs showing the loop and stem size of the motifs predicted for pre-microRNAs of different organisms. (B) Examples of pre-microRNA–predicted consensus structure for selected species. (C) Examples of genomic transcript stem–loops predicted consensus structure for selected species.

localization pattern is evident during developmental stages 4 to 5 (at that play a significant role in regulating gene expression and mRNA the beginning of blastoderm cellularization), and it is possible that stability (37). Mature microRNAs are produced from endogenous these maternal transcripts are involved in early cellularization transcripts (pri-microRNAs) with a stem–loop structure. During processes in the embryo. Quite surprisingly, very similar motifs were microRNA biogenesis, the Drosha RNase recognizes these tran- independently identified in both 3Ј and 5Ј UTRs of these tran- scripts and cleaves them to produce the pre-microRNA stem–loops. scripts, suggesting that they both function in the same localization The pre-microRNA is further processed into a mature microRNA event. through a second cleavage event by the Dicer family of RNases. It Because only posttranscriptional mechanisms can induce em- is known that Drosha recognizes its targets by their stem–loop bryonic localization of maternal mRNAs, we can also expect to structure, and several elements of this structure were demonstrated identify common RNA motifs in maternal mRNAs with similar to be particularly important for this recognition (38), yet the exact embryonic localization patterns. Indeed, we identified a motif in features are still unclear. To produce a more accurate model of this apically localized maternal transcripts, providing a possible structural motif, we applied RNApromo to pre-microRNAs of mechanism for the localization of maternal transcripts to distinct different organisms. We expect this collection to represent the embryonic locations. preferences of the pre-microRNA recognition mechanism. Finally, we identified motifs in three sets of zygotic transcripts The identified motifs take, as expected, the shape of a stem loop that localize to unique subsets of the blastodermal nuclei. One motif with almost no bulges or internal loops, and with relatively weak also shows a significant sequence conservation profile. Because it is sequence signals (Fig. 4A). This does not mean that the pre- less likely that RNA motifs are involved in localization of transcripts microRNA structures do not contain bulges (in fact they usually to specific parts of the embryo, especially after cellularization do), but rather that there is no tendency across the different processes begin, we hypothesize that these motifs are involved in pre-microRNA structures for a bulge in a specific position. Yet the other posttranscriptional regulations, whereas the expression in pre-microRNAs consensus structures of different organisms are specific groups of blastoderm cells can be determined by another not identical and can be divided into two groups on the basis of two

mechanism. key features: the stem length and the loop length. Intriguingly, this CELL BIOLOGY We also predicted motifs for other (non-embryonic) mRNA division matches two distinct groups of organisms: animals (meta- localization events. mRNA localization is particularly important in zoa) and plants (Fig. 4A). Plant pre-microRNAs have a small loop neurons, where the plastic modulation of synaptic connections (5.6 bases on average) and a long stem (average 38.1 bp), whereas requires local changes of gene expression. Indeed, we predict a animal pre-microRNAs usually have a longer loop (average 10.9 motif in 97 dendrite-localized mRNAs that were recently identified bases) and shorter stem (average 32.8 bp). Another evident differ- experimentally in mouse hippocampal neurons (AUC ϭ 0.61, P Ͻ ence between animals and plants pre-microRNA is their overall 2 ϫ 10Ϫ5) (36), suggesting that it may be involved in the localization length, which is much higher in plants (average 160 bases) than in of these transcripts to dendrites (Fig. 3B). animals (average 88 bases). Finally, consistent with the fact that Overall, we were able to predict motifs in several sets of viruses are known to use the host cell machinery to process their mRNAs that are localized to similar cellular compartments, both microRNAs (39), the stem size of animal viruses’ pre-microRNAs in fly embryos and mouse neurons. These results demonstrate is closer to that of animals (average 9.4 bases). No data for plant the potential use for mRNA structural motifs in producing local viruses were available for comparison. mRNA concentration in the cell, which may be beneficial for the The Drosha enzyme is active inside the nucleus, where many local translation of proteins. other transcripts besides those of pri-microRNAs are produced. Assuming it recognizes and cleaves only pri-microRNA stems, we Target Recognition by the Pre-microRNA Processing Machinery. Mi- would expect pri-microRNA stems to have specific structural croRNAs are a class of small (21–24 ) noncoding RNAs features, distinguishing them from stem–loop structures that appear

Rabani et al. PNAS ͉ September 30, 2008 ͉ vol. 105 ͉ no. 39 ͉ 14889 Downloaded by guest on October 1, 2021 in other genomic transcripts. To test this hypothesis, we applied other available tools, RNApromo restricts the motif search to a RNApromo to stem–loops that appear in arbitrary genomic tran- predefined set of input structures, which allows it to perform well scripts. Strikingly, we find that in plants, the length of the stem and even on sets of mRNAs from a single organism, reduces the time loop of arbitrary stem–loop-containing transcripts is very similar to complexity, and allows the use of thermodynamic-based meth- the lengths of the plant pri-microRNAs, whereas in animals, ods for predicting RNA secondary structure. stem–loops of arbitrary transcripts and stem–loops of animal Using RNApromo, we predict two structurally different and pri-microRNAs have markedly different lengths. Thus, our results AU-rich motifs in 3Ј UTRs of fast-decaying and slow-decaying suggest that in animals, the nuclear RNase Drosha recognizes and mRNAs and identify proteins that may bind these motifs and cleaves its pri-microRNA targets, perhaps by measuring the length modulate the stability of the bound mRNAs. Although the involve- of their loop, among other features. Longer loops, which are ment of AU-rich elements in this process is known (31), our analysis characteristic of pri-microRNAs, enable the enzyme to specifically reveals the importance of the structural context in which these recognize these transcripts. These results further suggest that in elements are embedded. Next, we predict motifs in sets of colocal- plants, Drosha recognition is not part of the microRNA biogenesis ized transcripts in mouse neurons and fly embryos, which can be process, and thus a different mechanism may exist. involved in establishing local cellular concentrations of mRNAs. These intriguing results are supported by the observation that Finally, we analyze pre-microRNA sequences and reveal that in although in animals cleavage by Drosha is essential for pre- animals Drosha may recognize its pri-microRNA targets by mea- microRNA export to the cytoplasm and further processing by suring, among other features, the length of their loop. Moreover, Dicer, most Dicer homologues in plants are already localized to the our analysis suggests that plant microRNAs are processed similarly nucleus (37). Moreover, the existing literature provides no evidence to other siRNAs, in a mechanism that requires only the Dicer of a plant Drosha homologue and no evidence for accumulation of enzyme. Overall, the predicted motifs represent novel and exper- pre-microRNAs after a Dicer knockout in plants (40). Thus, it is imentally testable findings and demonstrate the potential of possible that plant pri-microRNAs, in a similar way to siRNAs RNApromo for uncovering posttranscriptional regulatory mecha- precursors, are directly recognized and cleaved by Dicer. Indeed, nisms. Our work thus represents a step toward the long-term goal plant pre-microRNAs have long stems that are similar to the long of making the genome-wide computational prediction of RNA dsRNA precursors that generate siRNAs, and the mature secondary structure elements as routine and robust as is currently microRNAs, like mature siRNAs, usually induce degradation of practiced for linear DNA elements, thereby revealing some of the their targets rather than a translational arrest. mechanisms by which RNA molecules are regulated within the cell. Conclusions Methods In summary, we have developed and applied a novel computa- See the SI for full details on our methods and on the datasets we used, as well as tional tool, RNApromo, that identifies short RNA motifs from for additional results. An online version and the implementation of the the full-length RNA regions in which they are embedded. Unlike RNApromo algorithm are available from the authors’ website or upon request.

1. Arava Y, et al. (2003) Genome-wide analysis of mRNA translation profiles in Saccha- 21. Ji Y, Xu X, Stormo GD (2004) A graph theoretical approach for predicting common RNA romyces cerevisiae. Proc Natl Acad Sci USA 100:3889–3894. secondary structure motifs including in unaligned sequences. Bioinfor- 2. Shepard KA, et al. (2003) Widespread cytoplasmic mRNA transport in yeast: Identifi- matics 20:1591–1602. cation of 22 bud-localized transcripts using DNA microarray analysis. Proc Natl Acad Sci 22. Hamada M, et al. (2006) Mining frequent stem patterns from unaligned RNA se- USA 100:11429–11434. quences. Bioinformatics 22:2480–2487. 3. Wang Y, et al. (2002) Precision and functional specificity in mRNA decay. Proc Natl Acad 23. Eddy SR, Durbin R (1994) RNA sequence analysis using covariance models. Nucleic Acids Sci USA 99:5860–5865. Res 22:2079–2088. 4. Anantharaman V, Koonin EV, Aravind L (2002) Comparative genomics and evolution 24. Sakakibara Y, et al. (1994) Stochastic context-free grammars for tRNA modeling. of proteins involved in RNA metabolism. Nucleic Acids Res 30:1427–1464. Nucleic Acids Res 22:5112–5120. 5. Hentze MW, Muckenthaler MU, Andrews NC (2004) Balancing acts: Molecular control 25. Holmes I (2005) Accelerated probabilistic inference of RNA structure evolution. BMC of mammalian iron metabolism. Cell 117:285–297. Bioinformatics 6:73. 6. Olivier C, et al. (2005) Identification of a conserved RNA motif essential for She2p 26. Yao Z, Weinberg Z, Ruzzo WL (2006) CMfinder—A covariance model based RNA motif recognition and mRNA localization to the yeast bud. Mol Cell Biol 25:4752–4766. finding algorithm. Bioinformatics 22:445–452. 7. Krol A (2002) Evolutionarily different RNA motifs and RNA-protein complexes to 27. Wuchty S, et al. (1999) Complete suboptimal folding of RNA and the stability of achieve selenoprotein synthesis. Biochimie 84:765–774. secondary structures. Biopolymers 49:145–165. 8. Kertesz M, et al. (2007) The role of site accessibility in microRNA target recognition. Nat 28. Wiese KC, Hendriks A (2006) Comparison of P-RnaPredict and mfold—Algorithms for Genet 39:1278–1284. RNA secondary structure prediction. Bioinformatics 22:934–942. 9. Robins H, Li Y, Padgett RW (2005) Incorporating structure to predict microRNA targets. 29. Do CB, Woods DA, Batzoglou S (2006) CONTRAfold: RNA secondary structure predic- Proc Natl Acad Sci USA 102:4006–4009. tion without physics-based models. Bioinformatics 22:e90–e98. 10. Long D, et al. (2007) Potent effect of target structure on microRNA function. Nat Struct 30. Griffiths-Jones S, et al. (2005) Rfam: Annotating non-coding RNAs in complete ge- Mol Biol 14:287–294. nomes. Nucleic Acids Res 33:D121–D124. 11. Zhao Y, Samal E, Srivastava D (2005) Serum response factor regulates a muscle-specific 31. Barreau C, Paillard L, Osborne HB (2005) AU-rich elements and associated factors: Are microRNA that targets Hand2 during cardiogenesis. Nature 436:214–220. there unifying principles? Nucleic Acids Res 33:7138–7150. 12. Hofacker IL, Fekete M, Stadler PF (2002) Secondary structure prediction for aligned 32. Gerber AP, Herschlag D, Brown PO (2004) Extensive association of functionally and RNA sequences. J Mol Biol 319:1059–1066. cytotopically related mRNAs with Puf family RNA-binding proteins in yeast. PLoS Biol 2:E79. 13. Gautheret D, Lambert A (2001) Direct RNA motif definition and identification from 33. Duttagupta R, et al. (2005) Global analysis of Pub1p targets reveals a coordinate control multiple sequence alignments using secondary structure profiles. J Mol Biol 313:1003–1011. of gene expression through modulation of binding and stability. Mol Cell Biol 14. Hofacker IL, et al. (1994) Fast folding and comparison of RNA secondary structures. 25:5499–5513. Monatshefte fur Chemie 125:167–188. 34. Itoh M, et al. (2002) Identification of cellular mRNA targets for RNA-binding protein 15. Hochsmann M, et al. (2003) Local similarity in RNA secondary structures. Proc IEEE Sam68. Nucleic Acids Res 30:5452–5464. Comput Soc Bioinform Conf 2:159–168. 35. Lecuyer E, et al. (2007) Global analysis of mRNA localization reveals a prominent role 16. Pavesi G, et al. (2004) RNAProfile: An algorithm for finding conserved secondary in organizing cellular architecture and function. Cell 131:174–187. structure motifs in unaligned RNA sequences. Nucleic Acids Res 32:3258–3269. 36. Zhong J, Zhang T, Bloch LM (2006) Dendritic mRNAs encode diversified functionalities 17. Sankoff D (1985) Simultaneous solution of the RNA folding, alignment and protose- in hippocampal pyramidal neurons. BMC Neurosci 7:17. quence problems. SIAM J Appl Math 45:810–825. 37. He L, Hannon GJ (2004) MicroRNAs: Small RNAs with a big role in gene regulation. Nat 18. Gorodkin J, Heyer LJ, Stormo GD (1997) Finding the most significant common sequence Rev Genet 5:522–531. and structure motifs in a set of RNA sequences. Nucleic Acids Res 25:3724–3732. 38. Han J, et al. (2006) Molecular basis for the recognition of primary microRNAs by the 19. Harmanci AO, Sharma G, Mathews DH (2007) Efficient pairwise RNA structure predic- Drosha-DGCR8 complex. Cell 125:887–901. tion using probabilistic alignment constraints in Dynalign. BMC Bioinformatics 8:130. 39. Cullen BR (2006) Viruses and microRNAs. Nat Genet 38(Suppl):S25–S30. 20. Will S, et al. (2007) Inferring noncoding RNA families and classes by means of genome- 40. Jones-Rhoades MW, Bartel DP, Bartel B (2006) MicroRNAS and their regulatory roles in scale structure-based clustering. PLoS Comput Biol 3:e65. plants. Annu Rev Plant Biol 57:19–53.

14890 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0803169105 Rabani et al. Downloaded by guest on October 1, 2021