Consensus Folding of Aligned Sequences As a New Measure for the Detection of Functional Rnas by Comparative Genomics
Total Page:16
File Type:pdf, Size:1020Kb
doi:10.1016/j.jmb.2004.07.018 J. Mol. Biol. (2004) 342, 19–30 Consensus Folding of Aligned Sequences as a New Measure for the Detection of Functional RNAs by Comparative Genomics Stefan Washietl and Ivo L. Hofacker* Institut fu¨r Theoretische Chemie Facing the ever-growing list of newly discovered classes of functional und Molekulare RNAs, it can be expected that further types of functional RNAs are still Strukturbiologie, Universita¨t hidden in recently completed genomes. The computational identification of Wien, Wa¨hringerstraße 17 such RNA genes is, therefore, of major importance. While most known A-1090 Wien, Austria functional RNAs have characteristic secondary structures, their free energies are generally not statistically significant enough to distinguish RNA genes from the genomic background. Additional information is required. Considering the wide availability of new genomic data of closely related species, comparative studies seem to be the most promising approach. Here, we show that prediction of consensus structures of aligned sequences can be a significant measure to detect functional RNAs. We report a new method to test multiple sequence alignments for the existence of an unusually structured and conserved fold. We show for alignments of six types of well-known functional RNA that an energy score consisting of free energy and a covariation term significantly improves sensitivity compared to single sequence predictions. We further test our method on a number of non-coding RNAs from Caenorhabditis elegans/Caenorhabditis briggsae and seven Saccharomyces species. Most RNAs can be detected with high significance. We provide a Perl implementation that can be used readily to score single alignments and discuss how the methods described here can be extended to allow for efficient genome-wide screens. q 2004 Elsevier Ltd. All rights reserved. Keywords: conserved secondary structure; consensus structure prediction; non-coding RNAs; comparative genomics; randomizing multiple sequence *Corresponding author alignments Introduction regulation. There are many other examples of such new “RNA-genes”.4,5 In the past few years, our knowledge on the Another aspect of RNA function concerns cis- molecular and cellular functions of RNA has acting regulatory elements within protein-coding increased dramatically. In particular, the identifi- genes. A recent example is the regulation of cation of numerous RNA transcripts that function metabolic pathways in bacteria through “ribo- directly as RNA without ever being translated to switches”. These riboswitches occur in leader protein (non-coding RNAs; ncRNAs) has made clear sequences of operons and interact directly with 6 that the traditional view of RNA must be extended small metabolites in order to control protein profoundly. To mention just one example, the expression. discovery of micro RNAs1–3 has led to a new These findings not only force experimental paradigm of RNA-directed gene expression biologists to reconsider their strategies and methods, but also pose new challenges to bioinfor- matics. In particular, the computational identifi- cation of functional RNAs in genomes is a major, yet Abbreviations used: MFE, minimum free energy; RUF, largely unsolved, issue. RNAs of unknown function; SGD, Saccharomyces Current methods mostly are based on similarity Genome Database; ncRNA, non-coding RNA. searches and are successful in the identification of E-mail address of the corresponding author: functional RNAs that are members of already [email protected] known families.7–11 A more general approach that 0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. 20 Detection of Functional RNAs detects new classes of functional RNAs without straightforward measure for their detection. How- relying on any a priori knowledge would be helpful. ever, prediction programs readily calculate mini- This, however, proved to be difficult. In contrast to mum free energy (MFE) structures also for arbitrary protein-coding genes, which show strong statistical random sequences. The question arises of whether signals like open reading frames and codon bias, the natural RNAs are more stable (have lower MFE) primary sequences of functional RNAs seem to lack than random sequences. This question has been comparable signals completely. partly addressed.15 Here,wetestitagainfor Since most known functional RNAs depend on a sequences from a set of six structural RNA families defined secondary structure, it was suggested by (tRNA, 5 S rRNA, hammerhead ribozyme type III, Maizel and co-workers that functional RNAs have a group II catalytic intron, signal recognition particle more stable secondary structure than expected by RNA, U5 spliceosomal RNA). We used RNAfold for chance.12–14 However, efforts to build a general the prediction and calculated z-scores from a sample RNA gene finder based on secondary structure of 100 random sequences (see Methods). The results prediction failed. Rivas & Eddy had to conclude in are shown in Table 1. On average, the structural an in-depth study on the subject that secondary RNAs all have z-scores clearly below zero, meaning structure alone is generally not significant enough they have lower folding energy than the random for the detection of ncRNAs.15 Some other statistical samples. Is this significant enough to reliably measures, partly derived from secondary structure distinguish single sequences from the random predictions, have been proposed.16–18 Still, background? Figure 2 illustrates this for the tRNA additional information seems to be required for test set. The topmost panel shows the distribution of reliable predictions on a genome-wide scale. z-scores for 579 tRNAs together with the z-scores of The most promising source of information comes 579 random sequences (one shuffled version for from comparative studies. Already, a number of each tRNA). If we use a conservative limit of K4to complete genomes from closely related species are define a significant z-score, we can detect only 2% of available. Some of them have been sequenced solely the tRNAs. To detect half of all tRNAs we would for the purpose of genome comparisons. Readily have to lower the cutoff to K1.8. Then, however, we available sets for comparison are: more than 15 would encounter 4% of false positives. For genome- enteric bacteria,19,20 seven yeast species,21,22 two wide screens where a huge number of candidates nematodes23,24 and the two mammalian genomes has to be scored, this selectivity is too low (especially from human25 and mouse.26 Facing the ever- for a corresponding sensitivity of only 50%). Some of growing pace of genome projects, even more can the tested families form more stable structures (e.g. be expected in the near future. group II catalytic intron, average zZK3.88; ham- QRNA is a program that makes use of this merhead ribozyme III, zZK3.08) but generally the comparative information and scans pairwise align- native sequences are not efficiently separated from ments for conserved secondary structures using the bulk of random sequences. probabilistic models based on stochastic context- An additional point seems noteworthy regarding free grammars.27 This approach has been applied these experiments. Workman & Krogh30 pointed successfully to predict candidates for non-coding out that dinucleotide content influences secondary RNAs in Escherichia coli and Saccharomyces cerevisiae, structure predictions, because of the energy contri- some of which could be verified experimentally.28,29 butions of stacked base-pairs. A correct randomiz- Here, we propose an alternative method to assess ation procedure should, therefore, generate random a multiple sequence alignment for the existence of a sequences of the same dinucleotide content. It is conserved secondary structure. We compute an impossible to consider this in the randomization of averaged folding energy of aligned sequences that multiple sequence alignments (see the next section). also takes into account sequence covariations. For single sequences, however, we performed the z- Following the ideas of the Maizel group, we score calculations with both mono-and dinucleotide compare this to a set of random alignments in shuffled random sequences. The results (Table 1) order to estimate if there is an unusually stable and show that a systematic bias is not recognizable for conserved fold. We address the question of whether our test sets. The values differ only minimally and this can be a significant measure to detect functional the mononucleotide-shuffled z-scores are not RNAs in genome-wide screens. necessarily below the dinucleotide-shuffled scores. Thus, while dinucleotide composition was import- ant in the study by Workman & Krogh, where long Results and Discussion (O500 nt) mRNAs are tested for an (obviously non- existent) subtle bias towards lower folding energies, MFE predictions for single sequences are of it can be neglected in our case. limited statistical significance Additional information from aligned sequences Secondary structure is a useful level on which to shifts MFE predictions towards significant understand RNA function. Fairly reliable models levels can be predicted with computational methods. Since many known functional RNAs are tied to a defined The results so far show that folding energy is secondary structure, such predictions appear a indeed a characteristic signal of (structural) Detection of Functional RNAs Table 1. The z-scores and detection sensitivities for single and aligned sequences of various functional RNAs ncRNA type Single sequence