C. Elegans Noncoding RNA Genes*
Total Page:16
File Type:pdf, Size:1020Kb
C. elegans noncoding RNA genes* 1 2 1,§ Shawn L. Stricklin, Sam Griffiths-Jones, Sean R. Eddy 1 Howard Hughes Medical Institute and Department of Genetics, WashingtonUniversity, St. Louis, MO 63108 USA 2 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK Table of Contents 1. Introduction ............................................................................................................................2 2. Ribosomal RNAs .....................................................................................................................2 3. Transfer RNAs ........................................................................................................................2 4. Spliced leader RNAs ................................................................................................................ 3 5. Small nucleolar RNAs (snoRNAs) .............................................................................................. 3 6. microRNAs (miRNAs) ............................................................................................................. 3 7. Other known RNAs ..................................................................................................................4 8. ncRNA conservation ................................................................................................................ 4 9. Prospects for novel ncRNAs ...................................................................................................... 4 10. Acknowledgments .................................................................................................................. 5 11. References ............................................................................................................................5 Abstract The C. elegans genome contains approximately 1300 genes that produce functional noncoding RNA (ncRNA) transcripts. Here we describe what is currently known about these ncRNA genes, from the perspective of the annotation of the finished genome sequence. We have collated a reference set of C. elegans ncRNA gene annotation relative to the WS130 version of the genome assembly, and made these data available in several formats. *Edited by Jonathan Hodgkin and Philip Anderson. Last revised June 16, 2005. Published June 25, 2005. This chapter should be cited as: Stricklin, S.L. et al. C. elegans noncoding RNA genes (June 25, 2005), WormBook, ed. The C. elegans Research Community, WormBook, doi/10.1895/wormbook.1.1.1, http://www.wormbook.org. Copyright: © 2005 Shawn L. Stricklin, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. §To whom correspondence should be addressed. E-mail: [email protected] 1 C. elegans noncoding RNA genes 1. Introduction The C. elegans genome contains approximately 1300 genes that are known to produce functional noncoding RNA (ncRNA) transcripts, as opposed to mRNAs that encode proteins. These known ncRNA genes include about 590 transfer RNA (tRNA) genes, 275 ribosomal RNA (rRNA) genes, 140 trans-spliced leader RNA genes, 120 microRNA (miRNA) genes, 70 spliceosomal RNA genes, and 30 snoRNA genes. Based on what is known about ncRNA-directed functions in other animals, there are additional ncRNA genes performing known biochemical functions that have not yet been identified in the worm genome. These include the telomerase RNA and on the order of 100-200 small nucleolar RNA (snoRNA) genes that direct site-specific 2'-O-ribose methylations and pseudouridylations of ribosomal RNAs and other target RNAs. It also seems likely that novel ncRNA genes remain to be discovered. The belated realization that the lin-4 and let-7 regulatory RNA genes are not just worm-specific anecdotes, but instead are members of a huge gene family of microRNAs with important roles in posttranscriptional gene regulation in many eukaryotes (Lee et al., 1993; Lim et al., 2003; Pasquinelli et al., 2000; Reinhart et al., 2000) was a spectacular demonstration that important genes (indeed, whole gene families) can easily escape standard computational and experimental gene discovery methods. There is a tantalizing possibility that the miRNAs foreshadow the discovery of even more RNA-directed functions. Here we describe what is currently known about the ncRNA genes of C. elegans, from the perspective of the annotation of the finished genome sequence (C. elegans Sequencing Consortium, 1998). Based on the literature, Genbank, and on computational searches for homologs of known RNAs and members of known RNA gene families (Benson et al., 2004; Griffiths-Jones, 2004; Griffiths-Jones et al., 2003; Harris et al., 2004; Lowe and Eddy, 1997), we have collated a stable, curated reference set of C. elegans noncoding RNA genes, and their chromosomal coordinates relative to the WS130 version of the genome sequence assembly in Table 1. We have made these data available as annotation tracks for WormBase, and downloadable as HTML tables, GFF coordinate files, or FASTA sequence files. We describe how the reference set has been produced, and summarize what it contains. 2. Ribosomal RNAs The 18S, 5.8S, and 26S subunits of rRNA are transcribed by RNA polymerase I from a 7.2 kb rDNA unit that is tandemly repeated ∼55 times at the end of chromosome I (C. elegans Sequencing Consortium, 1998; Ellis et al., 1986; Sulston and Brenner, 1974). The 5S rRNAs are transcribed separately by RNA polymerase III from ∼110 copies of a ∼1 kb tandem repeat unit on chromosome V (Nelson and Honda, 1985; Sulston and Brenner, 1974). The 5S rRNA repeat unit also includes the gene for the SL1 trans-spliced leader; see below. The rRNA genes are systematically underrepresented in the current genome sequence assembly, because tandem arrays are problematic for physical mapping and sequencing. According to WUBLAST searches using the published sequences of the 7.2 kb and 1 kb rDNA repeat units as queries, one copy of the 18S/5.8S/26S rRNA repeat unit is represented in the chromosome I sequence assembly, and fifteen copies of the 5S rRNA gene are included in the chromosome V sequence. Additionally, the mitochondrial DNA contains one 18S rRNA gene and one 23S rRNA gene. 3. Transfer RNAs We have annotated genes for 569 nuclear tRNAs, 22 mitochondrial tRNAs, and 1072 probable tRNA pseudogenes. The mitochondrial tRNAs were curated from the literature (Okimoto et al., 1992; Wolstenholme et al., 1987). Nuclear tRNAs can be reliably identified by computational methods. We used the programs tRNAscan-SE (Lowe and Eddy, 1997) and ARAGORN (Laslett and Canback, 2004) to identify a combined candidate list of 612 putative tRNA genes and 214 candidate tRNA pseudogenes. These candidate tRNA genes were manually curated to remove an additional 40 putative pseudogenes and 3 false positives, leaving the final set of 569 annotated genes. This gene set is essentially in agreement with the independent analysis of Marck and Grosjean, who identified 529 putative tRNA genes (Marck and Grosjean, 2002). Differences appear to be due to variation in what is called a putative true gene versus a putative pseudogene, and differences in the version of the genome assembly used. As is the case in many eukaryotes, tRNA pseudogenes are numerous in C. elegans. Current tRNA scanning programs only detect pseudogenes that are closely related to true tRNAs. Using WUBLAST, we identified an additional 818 sequences with significant similarity to one or more of the 569 tRNA genes and/or 254 tRNA pseudogenes, and added them to annotate a total of 1072 putative pseudogenes. Many of these overlay four previously identified repetitive sequences (Tc4, CEREP3, CELE45, and NDNAX3_CE) defined by RepBase and RepeatMasker searches. 2 C. elegans noncoding RNA genes 4. Spliced leader RNAs Approximately 70% of C. elegans mRNAs are covalently modified at their 5' end by the addition of 22-nt trans-spliced leader RNA sequences (Blumenthal and Gleason, 2003; Ross et al., 1995; Zorio et al., 1994). Trans-spliced leaders are donated by independently transcribed ∼100-110 nt SL RNAs, which come in two forms, SL1 RNA and SL2 RNA (see Trans-splicing and operons). The most abundant form, SL1, is predominantly trans-spliced to the 5' end of pre-mRNAs, including the first cistron in polycistronic (operon) pre-mRNAs; the rarer form, SL2, is generally trans-spliced to downstream cistrons in polycistronic operons (Blumenthal and Gleason, 2003). The genes for SL1 RNA are part of the same tandem repeat unit that encodes 5S rRNA, occurring in ∼110 copies on chromosome V (Krause and Hirsh, 1987; Nelson and Honda, 1985). Ten SL1 RNA genes are represented in the current genome sequence assembly. The genes for the ∼110 nt SL2 RNAs are dispersed, and show more sequence variation than SL1 RNAs (indeed, some have been named SL3 RNA, SL4 RNA, etc.; we follow the WormBase convention of annotating all as SL2 variants; Huang and Hirsh, 1989; Ross et al., 1995; Zorio et al., 1994). 20 SL2 RNA gene variants are found in the genome sequence, roughly in agreement with the copy number of ∼30 predicted from genomic Southerns (Ross et al., 1995; Zorio et al., 1994). SL RNAs are thought to be transcribed by pol II (Krause and Hirsh, 1987). 5. Small nucleolar RNAs (snoRNAs) In Eukarya and Archaea, two classes of snoRNAs direct site-specific base modifications of ribosomal RNA and other ncRNAs.