Overlapping Antisense Transcription in the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
Comparative and Functional Genomics Comp Funct Genom 2002; 3: 244–253. Published online 2 May 2002 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.173 Short Communication Overlapping antisense transcription in the human genome M. E. Fahey, T. F. Moore and D. G. Higgins* Department of Biochemistry, University College Cork, Lee Maltings, Prospect Row, Cork, Ireland *Correspondence to: Abstract Department of Biochemistry, University College Cork, Lee Accumulating evidence indicates an important role for non-coding RNA molecules in Maltings, Prospect Row, Cork, eukaryotic cell regulation. A small number of coding and non-coding overlapping antisense Ireland. transcripts (OATs) in eukaryotes have been reported, some of which regulate expression of E-mail: [email protected] the corresponding sense transcript. The prevalence of this phenomenon is unknown, but there may be an enrichment of such transcripts at imprinted gene loci. Taking a bioinfor- matics approach, we systematically searched a human mRNA database (RefSeq) for com- plementary regions that might facilitate pairing with other transcripts. We report 56 pairs of overlapping transcripts, in which each member of the pair is transcribed from the same locus. This allows us to make an estimate of 1000 for the minimum number of such transcript pairs in the entire human genome. This is a surprisingly large number of overlapping gene pairs and, clearly, some of the overlaps may not be functionally significant. Nonetheless, this may indicate an important general role for overlapping antisense control in gene regulation. EST databases were also investigated in order to address the prevalence of cases of imprinted genes with associated non-coding overlapping, antisense transcripts. However, EST databases were found to be completely inappropriate Received: 31 January 2002 for this purpose. Copyright # 2002 John Wiley & Sons, Ltd. Accepted: 11 April 2002 Keywords: OAT; antisense transcript; imprinting; human genome; EST; cDNA Introduction 1999), and at developmentally regulated, non- imprinted loci such as a globin (Gribnau et al., There is accumulating evidence that a large number 2000). Moreover, it is unclear whether a low level of of structurally and functionally diverse non-coding transcription of certain ncRNAs is functionally RNA (ncRNA) molecules are produced in the significant, or whether it merely represents ‘illegiti- eukaryotic cell (Eddy, 2001; Mattick, 2001). The mate’ or ‘leaky’ transcription from cryptic promo- hitherto unsuspected complexity of RNA-based ters. gene regulatory mechanisms presents a considerable These difficulties notwithstanding, evidence that technical challenge to both bioinformaticists and ncRNAs make a significant contribution to eukar- molecular biologists. Most of the transcripts do not yotic cell function comes from a variety of sources. currently have well-defined structural features and Established work indicates that, in addition to may not be represented, or may be dismissed as functional intronic RNA, ribosomal RNA (rRNA), cloning artifacts, in gene expression libraries. For transfer RNA (tRNA), and the 5k and 3k untrans- example, functional ncRNA occurs in sizes ranging lated regions (UTR) of messenger RNA (mRNA), from 21 nucleotides for ‘small temporal RNA’ short ncRNAs are also integral components of (stRNA) and double-stranded silencing RNA major nuclear catalytic complexes, for example, the (siRNA) (Pasquinelli et al., 2000; Harborth et al., small nuclear RNAs (snRNAs) of the spliceosome 2001), to greater than 40 kb for overlapping anti- (Valadkhan and Manley, 2001), and telomerase sense or intergenic transcripts associated with chro- RNA (Lukowiak et al., 2001). In addition, a wide matin remodeling at imprinted gene loci such as variety of transcriptional and translational regula- IGF2R and XIST (Wutz et al., 1997; Lee et al., tory mechanisms have been either described or Copyright # 2002 John Wiley & Sons, Ltd. Overlapping antisense transcription in the human genome 245 proposed that involve the base-pairing of comple- These sequences include the coding regions as well mentary RNA molecules produced either in cis or as the 5k and 3k untranslated regions (UTRs) of in trans. These include the small nucleolar RNAs each gene. The BLASTN program (version 2.0.13, (snoRNAs), which modify rRNA and snRNAs (Kiss, Altschul et al., 1990) was then used to compare 2001), the ‘microRNAs’ (miRNAs) including stRNA each sequence in RefSeq to the complementary and siRNA (Eddy, 2001), and overlapping antisense strand of all the remaining genes. This locates pairs transcripts (OATs) produced in cis at protein-coding of genes that have, in principle, the ability to form gene loci in mammals (Kumar and Carmichael, 1998; stretches of double-stranded RNA. The threshold Vanhe´e-Brossollet and Vaquero, 1998). E-value was set to 10x8 to exclude weak matches. Imprinted genes, which are expressed from only This yielded a collection of 1221 high scoring pairs one of the parental alleles during mammalian devel- (HSPs). These included matches due to the pre- opment, comprise a functionally diverse family of sence of repeated sequences (e.g. ALU repeats in developmentally regulated genes with unusual geno- the UTRs), which were filtered manually using mic features, such as associated tandem repeats Repeatmasker (A.F.A. Smith and P. Green, (Neumann et al., 1995) and reduced intronic con- RepeatMasker at http://ftp.genome.washington.edu/ tent (McVean et al., 1996). In addition, there may RM/RepeatMasker.html Smith and Green, 27). The be an enrichment of OATs at imprinted loci (Moore, remaining pairs of sequences were then checked 2001). Such imprinted, antisense transcripts may be using the Locuslink records from RefSeq and the functionally significant because many are expressed at UCSC human genome browser (http://genome.ucsc. high levels and are associated with genomic regions edu/) to locate pairs of overlapping genes that map implicated in regulating the imprinting mechanism to the same chromosomal location. (Moore et al., 1997; Sleutels et al., 2000). However, In a second series of experiments, the sequ- there are also examples of OATs at non-imprinted ences of known imprinted genes from human and loci (Vanhe´e-Brossollet and Vaquero, 1998). More- mouse were examined for complementary matches over, the apparent enrichment of such transcripts against corresponding databases of EST sequ- at imprinted loci may reflect an ascertainment bias, ences using the Gene2est server at http://www.woody. because of the intensive study of the genomic organ- embl-heidelberg.de/gene2est (Gemund et al., 2001). ization and allele-specific expression patterns of The list of mouse and human imprinted genes was these genes relative to non-imprinted genes. taken from the Genomic Imprinting Website (http:// In order to address the question of the func- www.geneimprint.com/). The Gene2est server pro- tional significance of OATs in the human genome, duces a BLAST output, which was imported into we sought to estimate the frequency of their occur- Artemis for visualization of results (Rutherford rence and to delineate their genomic structures et al., 2000). In order to check the validity of EST through bioinformatics, by using BLASTN to search ‘hits’ to the complementary strand, mouse RefSeq for sequence complementarities between transcribed was blasted against mouse EST sequences from gene sequences in the public databases. We were able Genbank 124 (June 2001). to place a lower boundary on this estimate of approximately one thousand OATs in the human genome. Results Materials and methods The initial 1221 HSPs from the BLASTN searches were taken and reduced to 56 pairs of overlapping The RefSeq database (Pruitt et al., 2001), which genes as described in Materials and Methods contains annotated mRNA sequences for 11 015 (Table 1). As expected under an assumption of different human genes (at January 2001), was used. random distribution, a large proportion of the trans- These are high quality gene predictions that use a cripts map to the two largest chromosomes, 1 and combination of the scientific literature, expressed 2. The majority of overlaps are between the 3k sequence tag (EST) sequences and automatic pre- UTRs of the transcripts (Table 2), with a smaller dictions of the locations of introns and exons. We number located in the 5k UTRs, or between the 5k downloaded the complete processed mRNA sequ- UTR of one transcript and the 3k UTR of another. ences for all genes (ftp://ftp.ncbi.nlm.nih.gov/refseq/). The overlaps typically extend over 50 – 200 Copyright # 2002 John Wiley & Sons, Ltd. Comp Funct Genom 2002; 3: 244–253. 246 M. E. Fahey et al. Table 1. A list of overlapping transcripts reported in chromosomal order. The length of the overlap between each pair of transcripts is denoted in nucleotides. The type of overlap is denoted as: between the 3k UTRs (3k), between the 5k UTRs (5k), or the 5k UTR of one transcript and 3k UTR of the other (3k/5k). If a coding region is involved, it is marked cds. ** denotes a reviewed RefSeq sequence which has been manually pro- cessed; * indicates a provisional sequence on which some initial quality checking has been done; the remaining cases correspond to automatically generated predicted records which are validated by cDNA or EST data, and/or closely related homologous sequences Size of Map Overlap in Name Position Overlap Nucleotides Comment MUF1 1p33 3k 72 -leucine repeat *RAD54L -role in DNA recombination and repair FLJ20580 1p32.3