Computational Prediction of RNA Structural Motifs Involved in Posttranscriptional Regulatory Processes
Total Page:16
File Type:pdf, Size:1020Kb
Computational prediction of RNA structural motifs involved in posttranscriptional regulatory processes Michal Rabani*, Michael Kertesz*, and Eran Segal*† *Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel Edited by Sean R. Eddy, Howard Hughes Medical Institute, Janelia Farm, Ashburn, VA, and accepted by the Editorial Board August 6, 2008 (received for review April 12, 2008) Messenger RNA molecules are tightly regulated, mostly through Here we develop RNApromo (RNA prediction of motifs), a new interactions with proteins and other RNAs, but the mechanisms that computational method to identify short structural RNA motifs in confer the specificity of such interactions are poorly understood. It is sets of long unaligned RNAs. Using our method, we identify clear, however, that this specificity is determined by both the nucle- putative motifs in sets of mRNAs with substantial experimental otide sequence and secondary structure of the mRNA. Here, we evidence for a common posttranscriptional regulation, and we develop RNApromo, an efficient computational tool for identifying support these findings with cross-validation analysis. In some cases, structural elements within mRNAs that are involved in specifying sequence conservation of the putative motifs provides strong posttranscriptional regulations. By analyzing experimental data on independent support for our findings. The identified motifs include mRNA decay rates, we identify common structural elements in fast- motifs for mRNAs with similar decay rates, mRNAs that are bound decaying and slow-decaying mRNAs and link them with binding by the same RBP, and mRNAs with a common cellular localization. preferences of several RNA binding proteins. We also predict struc- Additionally, analysis of pre-microRNA structures identified dif- tural elements in sets of mRNAs with common subcellular localization ferences between animals and plants in the sizes of stem and loop in mouse neurons and fly embryos. Finally, by analyzing pre- structures of pre-microRNAs, suggesting that animals and plants microRNA stem–loops, we identify structural differences between use different mechanisms for microRNA biogenesis. pre-microRNAs of animals and plants, which provide insights into the mechanism of microRNA biogenesis. Together, our results reveal Results and Discussion unexplored layers of posttranscriptional regulations in groups of A New Computational Scheme for RNA Motif Discovery. Several tools RNAs and are therefore an important step toward a better under- exist for finding local structural motifs in a set of long input RNAs. standing of the regulatory information conveyed within RNA mole- Many existing works use free energy minimization considerations cules. Our new RNA motif discovery tool is available online. for predicting local motifs, by applying one of three major schemes: use of a sequence-based local alignment to build a motif consensus bioinformatics ͉ motif prediction ͉ posttranscriptional regulation ͉ structure (12, 13); identification of common structures in the RNAs’ RNA secondary structure ͉ SCFGs predicted minimal free energy folds (14–16); or simultaneous alignment of the RNAs and prediction of their secondary structure (17–20). Recently, some graph theoretical techniques were also NA molecules undergo diverse posttranscriptional regulation proposed for this task (21, 22). Yet some of the most successful of gene expression, including regulation of RNA transport and R approaches for motif discovery are based on advanced probabilistic localization, mRNA translation, and RNA decay (1–3). In many models, which are highly suitable to capture the observed variation cases, such posttranscriptional regulation occurs through elements in the input set. on the mRNA molecule that interact with the hundreds of RNA Stochastic context-free grammars (SCFGs) are a class of prob- binding proteins (RBPs) that exist in the cell (4). A well-known abilistic models proposed for modeling common sequence and example is the iron-responsive element (IRE), a secondary struc- structure in a set of input RNAs (23, 24), which replaced the ture RNA motif located on UTRs of members of the iron metab- thermodynamic considerations in several existing RNA motif dis- olism and transport pathway (5). The binding of the RBPs Irp1 and covery tools (25, 26). However, currently available SCFG applica- Irp2 to IRE elements affects the translation rate of the mRNA, and tions optimize the model’s parameters by essentially considering all by that coordinates the response to changing levels of iron in the possible secondary structures of the input sequences. This approach environment. Other examples include a 118-nucleotide stem–loop is successful mainly when it is possible to exploit covariation structure through which mRNAs are transported to the yeast bud between organisms to infer the secondary structure at the motif tip by the RBP She2 (6) and the RBP Sbp2 that is involved in position. However, when one wishes to identify short motifs, using mediating UGA redefinition from a stop codon to selenocysteine data from a single organism, covariation data cannot be used. CELL BIOLOGY by binding specific stem–loop structures, termed selenocystein Moreover, enumerating all possible secondary structures results in Ј insertion site (SECIS) elements, in the 3 UTR of selenoproteins a rather high time complexity of the algorithm, making it unfeasible (7). In other cases, elements on the mRNA molecule interact with to scan large RNA sets or long RNA sequences. other RNAs that direct the regulatory effect. For example, the Here, we devised a new SCFG-based method for finding local recognition and binding affinity of a microRNA to its mRNA target motifs in a set of unaligned RNAs, which restricts the search space is determined by both the sequence and structure of the target to a predefined and limited number of structures for each input mRNA (8–11). RNA. Yet which structures should be considered by the algorithm? The examples above suggest that the posttranscriptional regula- tion of mRNAs is determined not only by its linear nucleotide sequence but also by its secondary structure. Thus, a key goal is to Author contributions: M.R., M.K., and E.S. designed research, performed research, contrib- understand the involvement of mRNA secondary structures in such uted new analytic tools, analyzed data, and wrote the paper. regulation. One approach is to identify recurring patterns, termed The authors declare no conflict of interest. motifs. However, linear sequence motifs, which are commonly This article is a PNAS Direct Submission. S.R.E. is a guest editor invited by the Editorial Board. found in DNA sequences, are not suitable in this case. Instead, we †To whom correspondence should be addressed. E-mail: [email protected]. wish to identify motifs that combine primary and secondary struc- This article contains supporting information online at www.pnas.org/cgi/content/full/ tural elements and are therefore better suited to describe functional 0803169105/DCSupplemental. elements in RNA molecules. © 2008 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0803169105 PNAS ͉ September 30, 2008 ͉ vol. 105 ͉ no. 39 ͉ 14885–14890 Downloaded by guest on October 1, 2021 B A real (Rfam) sets Predicted Motif Known Motif Selected Instances Conservation C G C A C A C C U C C U G C A A HIST2H2AA C G C G C A C G A A U C G A C profile permuted sets Set size = 65 G C A A A U C C C A U U U G A A U G U G C A U U C U C U C C C C U G C C C A G C U U C U C G A G C G C C C G G C C AUC = 0.95 G G G A C G G C G G A G G C C A A U A -16 A A A HIST4H4 C G A A G C C G A A C A p < 4.5x10 U HIST1H2ABC 0.3 G A A HIST3H2A C A U C U G C G G A A U A C A G A C C C C C U U A G G C G Similarity: C A C C U C C C C U A A A C U C G G G A C G U G C G U C U G A A A G G U A G U U avg=49%, C U G U U G A A C G A C C U A C A U A A C A U G U A C C C A C U G C C C U C C U G max=85% U G G A C U U U U U U U C U C U C G G C DFH G C U C C C G U C G A C U A A A A A G G G C U G C G C G G C A HIST2H3C C A G G A C U C G U A A U U C A U U U C C C A A U HIST1H2BG G C -16 A G U A C U A U U U UU A A U U U p < 4.5x10 A G A G C 0.2 C C U U A A C A U G U C U A U U U U A U G A G U 2-5 bp U C U A C Set size = 10 C C A A U C U G TFRC A C A G C C G C C G A A A G G C C C C U C G A U G G G A C A A UC G G A C G AUC=0.86 U A C GA U U A C G A C U U U U U C C G A G AU U C C G A G -4 U C C A A A A A C U G A A G G C G G C p < 6.5x10 A A C C U U U G C A G G % of gene sets C U C U U U A A U G C C G C U G C U U C U C G G G C C A A U C C C C C U A G A C FTH1 G A C U U A A C C U U U C A A C CA G C C C U C G G U A A C A G C Similarity: G A G C C G C G G A C G G U G U C C G G C U C U A A G C GG G U A C U C C A U C U G C U G C U G IREG1 C U U A G C C C avg=10%, 4-5 bp U A G G G C A U U A G A C C C U A A A C G U C U G C C U C G U C A C G G 0.1 max=51% U A U A C C G U G C G G U A U U G C C A U C G U C FTL C C C C G U C C C A U U U C U A G G C A C C A G G G A A CAG A G A G A U U U C G C U A U C A U C C C UUC C U C A U U G A C G C G C A U A C A G C G A G U G U G A G A A U U A C C G A G A U U G C C C U U A C U C U U G U U G A C G U U U G C U U G A U C G C U G U A U G G C C C A C A C G G U C U U U C U G A AA C G G A C -4 A A U A U G A U U G ER G U G p < 1.4x10 A G U U C G ACO2 C U G SLC11A2 U A G C A G I G U ALAS2 C pb C 8 A A 4- C A A G G A C A A C G G U U G U G GPX1 G C C 0 U U G A G C C C C G U C U U G G A U C A A GPX2 U U G U U G G C U G A