Fast and Reliable Prediction of Noncoding Rnas
Total Page:16
File Type:pdf, Size:1020Kb
Fast and reliable prediction of noncoding RNAs Stefan Washietl*, Ivo L. Hofacker*, and Peter F. Stadler*†‡ *Department of Theoretical Chemistry and Structural Biology, University of Vienna, Wa¨hringerstrasse 17, A-1090 Wien, Austria; and †Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Ha¨rtelstrasse 16-18, D-04107 Leipzig, Germany Communicated by Hans Frauenfelder, Los Alamos National Laboratory, Los Alamos, NM, December 14, 2004 (received for review November 2, 2004) We report an efficient method for detecting functional RNAs. The served noncoding elements in mammalian (or, more generally, approach, which combines comparative sequence analysis and vertebrate) genomes, and it must be expected that a significant structure prediction, already has yielded excellent results for a fraction of them are functional RNAs. small number of aligned sequences and is suitable for large-scale Possible candidates, however, have been identified only spo- genomic screens. It consists of two basic components: (i) a measure radically so far (19, 21), simply because there are no reliable tools for RNA secondary structure conservation based on computing a to scan multiple sequence alignments for functional RNAs. The consensus secondary structure, and (ii) a measure for thermody- most widely used program QRNA (22), which has been success- namic stability, which, in the spirit of a z score, is normalized with fully used to identify ncRNAs in bacteria (23) and yeast (24), is respect to both sequence length and base composition but can be not suitable for screens of large genomes. QRNA is limited to calculated without sampling from shuffled sequences. Functional pairwise alignments, and its reliability is low, especially if the RNA secondary structures can be identified in multiple sequence evolutionary distance of the two sequences lies outside of the alignments with high sensitivity and high specificity. We demon- optimal range. An alternative approach, DDBRNA (25), suffers strate that this approach is not only much more accurate than from similar problems and so far has not been used in a real-life previous methods but also significantly faster. The method is application. MSARI (26), on the other hand, gains its drastically implemented in the program RNAZ, which can be downloaded from enhanced accuracy from the large amount of information con- www.tbi.univie.ac.at͞ϳwash͞RNAz. We screened all alignments tained in large multiple sequence alignments of 10–15 sequences of length n > 50 in the Comparative Regulatory Genomics data- with high sequence diversity. At present, however, data sets of base, which compiles conserved noncoding elements in upstream this kind are not available at a genomewide scale, at least for regions of orthologous genes from human, mouse, rat, Fugu, and multicellular organisms. zebrafish. We recovered all of the known noncoding RNAs and In this article we address the problem by using an alternative cis-acting elements with high significance and found compelling approach: we combine a measure for thermodynamic stability evidence for many other conserved RNA secondary structures not with a measure for structure conservation. Using a combination described so far to our knowledge. of both scores we are able to efficiently detect functional RNAs in multiple sequence alignments of only a few sequences. Our comparative genomics ͉ conserved RNA secondary structure method is substantially more accurate than QRNA or DDBRNA and performs better on pairwise alignments than MSARI does on raditionally, the role of RNA in the cell was considered alignments with 15 sequences. On the large, diverse alignments Tmostly in the context of protein gene expression, limiting used for testing MSARI in ref. 26, our RNAZ program achieved RNA to its function as mRNA, tRNA, and rRNA. The discovery 100% sensitivity at 100% specificity. of a diverse array of transcripts that are not translated to proteins Methods but rather function as RNAs has changed this view profoundly (1–3). Noncoding RNAs (ncRNAs) are involved in a large Minimum Free Energy (MFE) RNA Folding. For MFE RNA folding we variety of processes, including gene regulation (4), maturation of used the C libraries of the Vienna RNA package version 1.5 (27). mRNAs, rRNAs, and tRNAs, or X-chromosome inactivation in We used RNAFOLD for folding single sequences and RNAALIFOLD mammals (5). In fact, a large fraction of the mouse transcriptome (28) for consensus folding of aligned sequences. The same consists of ncRNAs (6), and about half of the transcripts from folding parameters were used for both algorithms to ensure that human chromosomes 21 and 22 are noncoding (7, 8). Structured the obtained MFE values were comparable. For the covariation RNA motifs furthermore function as cis-acting regulatory ele- part of RNAALIFOLD we used default parameters. Gaps were ments within protein-coding genes. Also in this context, new removed for single sequence folding. intriguing mechanisms are being discovered (9). Hence, a comprehensive understanding of cellular processes is Calculation of z Scores Using Support Vector Machine (SVM) Regres- impossible without considering RNAs as key players. Efficient sion. To calculate z scores by regression analysis we used the identification of functional RNAs (ncRNAs as well as cis-acting following procedure: we generated synthetic sequences of dif- elements) in genomic sequences is, therefore, one of the major ferent length and base composition. The length of the test sequences ranged from 50 to 400 nt in steps of 50. To quantify goals of current bioinformatics. Notwithstanding its utmost ͞ ͞ ͞ biological relevance, de novo prediction is still a largely unsolved base composition, we used the GC AT, A T, and G C ratios of issue. Unlike protein-coding genes, functional RNAs lack in the sequences and chose values for all ratios ranging from 0.25 their primary sequence common statistical signals that could be to 0.75 in steps of 0.05. This process resulted in 10,648 points in exploited for reliable detection algorithms. Many functional a four-dimensional space of the independent variables. For each RNAs, however, depend on a defined secondary structure. In of the points we calculated the mean and standard deviation of particular, evolutionary conservation of secondary structures serves as compelling evidence for biologically relevant RNA Abbreviations: ncRNA, noncoding RNA; snoRNA, small nucleolar RNA; SRP, signal- function. Comparative studies therefore seem to be the most recognition particle; MFE, minimum free energy; SCI, structure conservation index; SVM, promising approach. To date, complete genomic sequences of support vector machine; CNB, conserved noncoding block; CORG, Comparative Regulatory related species have been sequenced for almost all genetic model Genomics. organisms as, for example, bacteria (10, 11), yeasts (12), nem- See Commentary on page 2269. atodes (13, 14), and even mammals (15–17). Several studies ‡To whom correspondence should be addressed. E-mail: [email protected]. (18–21) have identified a large collection of evolutionary con- © 2005 by The National Academy of Sciences of the USA 2454–2459 ͉ PNAS ͉ February 15, 2005 ͉ vol. 102 ͉ no. 7 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0409169102 Downloaded by guest on October 1, 2021 the MFE of 1,000 random sequences, representing the depen- Results dent variables in our regression. The SCI. In a recent article (34) we demonstrated that the program ͞ SEE COMMENTARY We used the SVM library LIBSVM (www.csie.ntu.edu.tw RNAALIFOLD [which originally was developed for prediction of ϳ ͞ cjlin libsvm) to train two regression models for mean and secondary structure in aligned sequences (28)] also can be used standard deviation. Input data for the SVM were scaled to mean for detection of evolutionarily conserved secondary structure. of 0 and standard deviation of 1. RNAALIFOLD implements a consensus folding algorithm gener- We chose the variant of regression and a radial basis function alizing the standard dynamic programming algorithms for RNA ϭ ␥ ϭ kernel. We optimized the parameters and found 0.5, 1, secondary structure prediction algorithms (35) by adding se- ϭ and C 5 to yield the best results. Finally, we obtained two quence covariation terms to the folding energy model (36, 37). models for the mean and standard deviation we used for z-score More precisely, a consensus MFE is computed for an alignment calculation. that is composed of an energy term averaging the energy The traditional sampling of z scores depends on the random- contributions of the single sequences and a covariance term ization of the native sequence by shuffling the positions. In this rewarding compensatory and consistent mutations (28). As this context it was pointed out by Workman and Krogh (30) that a consensus MFE is difficult to interpret in absolute terms, we correct randomization procedure should conserve dinucleotide previously used a time-consuming random sampling method to content because of the energy contributions of stacked base pairs assess its significance (34). This approach would require massive in the energy model. In principle, the regression model could be computational effort even for small-sized genomes and it does extended to use dinucleotide frequencies. The good results with not seem practicable for large genomes as, for example, the the simple model, however, allow us to neglect this effect. human genome. A much more efficient normalization