Identification of Gene 3 Ends by Automated EST Cluster Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Identification of gene 3 ends by automated EST cluster analysis Enrique M. Muroa,b, Robert Herringtonc, Salima Janmohamedc, Catherine Frelinc, Miguel A. Andrade-Navarroa,b,1, and Norman N. Iscovec,1,2 aOttawa Health Research Institute, 501 Smyth Road, Ottawa, ON, Canada K1H 8L6; bMax Delbru¨ck Center for Molecular Medicine, Robert-Ro¨ssle-Strasse 10, 13125 Berlin, Germany; and cOntario Cancer Institute, Princess Margaret Hospital, University Health Network, University of Toronto, Toronto Medical Discovery Tower, 101 College Street, Toronto, ON, Canada M5G 1L7 Edited by Tak Wah Mak, University of Toronto, Toronto, ON, Canada, and approved October 14, 2008 (received for review August 11, 2008) The properties and biology of mRNA transcripts can be affected abundance of terminal 3Ј sequences. Amplified cDNAs are profoundly by the choice of alternative polyadenylation sites, therefore ideally interrogated by probes that target sequences as .making definition of the 3 ends of transcripts essential for under- close to 3Ј ends as is practical standing their regulation. Here we show that 22–52% of sequences For access to 3Ј terminal sequences, biologists usually turn in commonly used human and murine ‘‘full-length’’ transcript to highly curated collections of complete mRNA sequences, databases may not currently end at bona fide polyadenylation such as the RefSeq (10), Ensembl (11), UCSC KnownGene sites. To identify probable transcript termini over the entire murine (12), FANTOM (3) and VEGA (13) collections. Probes in and human genomes, we analyzed the EST databases for positional widely used commercial microarrays are similarly selected clustering of EST ends. The analysis yielded 58,282 murine- and from sequence collections representative of full-length tran- 86,410 human-candidate polyadenylation sites, of which 75% scripts, from which probes near 3Ј polyadenylation sites could mapped to 23,091 known murine transcripts and 22,891 known be generated in principle. Affymetrix, for example, has pub- human transcripts. The murine dataset correctly predicted 97% of lished collections of murine and human transcript sequences. the 3 ends in a manually curated and experimentally supported However, it cannot be taken for granted that transcripts in benchmark transcript set. Of currently known genes, 15% had no public or commercial collections do in fact include sequences associated prediction and 25% had only a single predicted termi- near all used alternative sites or that a given transcript nation site. The remaining genes had an average of 3–4 alternative sequence is complete to any of its possible termini. At the polyadenylation sites predicted for each murine or human tran- present time, no survey of the extent of completeness of the script, respectively. The results are made available in the form of sequences in popular collections is available, leaving the onus tables and an interactive web site that can be mined for rapid on the user to determine whether a given transcript sequence assessment of the validity of 3 ends in existing collections, enu- ends at a valid polyadenylation site or not and whether other meration of potential alternative 3 polyadenylation sites of more frequently used sites exist upstream or downstream of known transcripts, direct retrieval of terminal sequences for design available sequence ends. of probes, and detection of polyadenylation sites not currently A useful methodology exists (14–19) for locating polyadenyl- mapped to known genes. ation sites based on the large and growing databases of EST sequences. These short sequences are sampled from larger 3Ј UTR ͉ gene prediction ͉ alternative polyadenylation ͉ transcriptome ͉ cDNA clones, which are generally reverse-transcribed from transcript probe design polyadenylated cellular transcripts after priming with oligo(dT). NCBI’s GenBank contained over 96 million such sequences as of March 2007. The high level of EST redundancy means that the he 3Ј ends of nascent mRNA transcripts are generated by a expression of each gene is described in each organism by multiple multifactorial complex that recognizes a well-defined hex- T independently derived ESTs, while their preparation based on amer polyadenylation signal (PAS, typically AAUAAA or oligo(dT) priming results in a preponderance of ESTs aligning AUUAAA) in a context that includes other less clearly defined to and terminating at 3Ј polyadenylation sites. motifs, cleaves the RNA 16–28 nt downstream of the PAS, and Here we report results directly addressing the needs of re- adds the characteristic polyA tail. Many transcripts are cleaved searchers for methodology and readily accessible databases and polyadenylated at alternative sites that may depend on Ј Ј facilitating the identification of 3 transcript ends. First, an cellular contexts (1–5). The 3 UTRs of transcripts frequently analysis of the main public murine and human transcript col- contain motifs that regulate their stability and ribosomal trans- lections provides evidence that up to half of available sequences lation (6) and their translocation to the cytoplasm. Additionally, Ј may not end at true polyadenylation sites. We further describe these 3 UTRs of transcripts may contain miRNA targets and and validate an EST-based method with improved prediction of short hairpin loops of regulatory significance. Indeed, as many candidate polyadenylation sites. The method was applied to both as 40% of miRNA targets have been estimated to be located in Ј Ј murine and human genomes to yield sets of predicted 3 ends alternative 3 UTR segments (7). The properties and biology of mapped against the currently known genes. The results are transcripts can therefore be affected profoundly by the choice of alternative polyadenylation sites, making definition of the 3Ј ends of transcripts essential for understanding their regulation. Author contributions: E.M.M., M.A.A.-N., and N.N.I. designed research; E.M.M., R.H., S.J., There are additional practical motivations for determining 3Ј and C.F. performed research; M.A.A.-N. and N.N.I. analyzed data; and M.A.A.-N. and N.N.I. ends. For many genes, similarities in coding regions may make wrote the paper. it necessary to find probes that target the UTRs to achieve the The authors declare no conflict of interest. desired specificity. Furthermore, methods for amplification of This article is a PNAS Direct Submission. cDNA are finding increasing use in circumstances wherein 1M.A.A.-N and N.N.I. contributed equally to this work. amounts of available RNA are too small, for example, for direct 2To whom correspondence should be addressed. E-mail: [email protected]. application to microarrays (8, 9). Because amplification proto- This article contains supporting information online at www.pnas.org/cgi/content/full/ cols are typically initiated by oligo(dT) priming on transcript 0807813105/DCSupplemental. polyA tails, there is an inherent bias toward best preservation of © 2008 by The National Academy of Sciences of the USA 20286–20290 ͉ PNAS ͉ December 23, 2008 ͉ vol. 105 ͉ no. 51 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0807813105 Downloaded by guest on September 30, 2021 Table 1. Proportions of transcript sequence collections tively low accuracy of EST sequencing and recent genomic containing a 3 PAS duplications, some ESTs can be aligned with more than one Proportion genomic position. To avoid possible misidentifications, ambig- No. of with uously aligning ESTs were excluded from our analysis. For Collection transcripts terminal PAS end-cluster detection, we plotted the number of matching ESTs against position along the entire genome. This analysis revealed, Mouse as expected, that the numbers of aligned ESTs gradually increase RefSeq mm6 Total NM* 18,280 0.74 toward transcript 3Ј ends (Fig. 1) and then abruptly fall, sug- Ensembl 36.40 Known 28,110 0.56 gesting that the plot could be used to infer the direction of Ensembl 36.40 Novel 3,192 0.14 transcription. We used an approach that exploited a convolution Affymetrix MOE430 Consensus 45,037 0.58 algorithm to detect and quantify the shape of such edges, FANTOM3 Representative 46,315 0.58 allowing adjustment of the algorithm parameters. We then tuned VEGA Known July 2007 24,791 0.48 parameters and detection thresholds on various transcript col- UCSC KnownGenes mm9 43,685 0.64 lections to optimize the relationship between the numbers of Human database sequence ends detected and the proportion of pre- RefSeq Release 20 human NM * 24,162 0.78 dicted ends corresponding to nominal ends. Fig. S3 illustrates Affymetrix HG-U133 Consensus/Exemplar 54,613 0.54 optimization of 1 parameter, the minimum number of EST ends VEGA Total October 2006 98,630 0.60 required to identify a potential polyadenylation site. H-InvDB v3.8 Representative 35,005 0.60 The automated analysis identified 58,282 and 86,410 EST UCSC KnownGenes hg18 54,896 0.73 clusters on the murine and human genomes, respectively, that The presence of a validated PAS (Table S1 and Fig. S1) within 40 nt of the contain at least 2 EST ends and an appropriately positioned transcript ends was recorded after trimming of extraneous or homopoly- PAS. These candidate 3Ј ends were mapped in relationship to the meric sequences from the 3Ј ends as required for individual datasets. The UCSC KnownGene (12) collection and assembled in Dataset S1, VEGA (13) and H-InvDB (www.h-invitational.jp) (23) collections are highly Dataset S2, Dataset S3, and Dataset S4. The legends to the curated. VEGA integrates gene information for vertebrate organisms datasets are located at the end of the Supplementary Text. originating from NCBI, UCSC, Ensembl, and the Sanger Institute. H-InvDB Approximately 75% of candidate ends lay within KnownGene integrates and maps cDNAs to the human genome from selected high- throughput cDNA sequencing projects. Source files for the analysis were transcription zones or up to 10 kb downstream of their nominal RefSeq (mm6) refMrna.fa; Ensembl Musmusculus.NCBIM36.40.cdna.known.fa ends.