Identification of 3؅ ends by automated EST cluster analysis

Enrique M. Muroa,b, Robert Herringtonc, Salima Janmohamedc, Catherine Frelinc, Miguel A. Andrade-Navarroa,b,1, and Norman N. Iscovec,1,2

aOttawa Health Research Institute, 501 Smyth Road, Ottawa, ON, Canada K1H 8L6; bMax Delbru¨ck Center for Molecular Medicine, Robert-Ro¨ssle-Strasse 10, 13125 Berlin, Germany; and cOntario Cancer Institute, Princess Margaret Hospital, University Health Network, University of Toronto, Toronto Medical Discovery Tower, 101 College Street, Toronto, ON, Canada M5G 1L7

Edited by Tak Wah Mak, University of Toronto, Toronto, ON, Canada, and approved October 14, 2008 (received for review August 11, 2008) The properties and biology of mRNA transcripts can be affected abundance of terminal 3Ј sequences. Amplified cDNAs are profoundly by the choice of alternative polyadenylation sites, therefore ideally interrogated by probes that target sequences as .making definition of the 3؅ ends of transcripts essential for under- close to 3Ј ends as is practical standing their regulation. Here we show that 22–52% of sequences For access to 3Ј terminal sequences, biologists usually turn in commonly used human and murine ‘‘full-length’’ transcript to highly curated collections of complete mRNA sequences, databases may not currently end at bona fide polyadenylation such as the RefSeq (10), Ensembl (11), UCSC KnownGene sites. To identify probable transcript termini over the entire murine (12), FANTOM (3) and VEGA (13) collections. Probes in and human genomes, we analyzed the EST databases for positional widely used commercial microarrays are similarly selected clustering of EST ends. The analysis yielded 58,282 murine- and from sequence collections representative of full-length tran- 86,410 human-candidate polyadenylation sites, of which 75% scripts, from which probes near 3Ј polyadenylation sites could mapped to 23,091 known murine transcripts and 22,891 known be generated in principle. Affymetrix, for example, has pub- human transcripts. The murine dataset correctly predicted 97% of lished collections of murine and human transcript sequences. the 3؅ ends in a manually curated and experimentally supported However, it cannot be taken for granted that transcripts in benchmark transcript set. Of currently known , 15% had no public or commercial collections do in fact include sequences associated prediction and 25% had only a single predicted termi- near all used alternative sites or that a given transcript nation site. The remaining genes had an average of 3–4 alternative sequence is complete to any of its possible termini. At the polyadenylation sites predicted for each murine or human tran- present time, no survey of the extent of completeness of the script, respectively. The results are made available in the form of sequences in popular collections is available, leaving the onus tables and an interactive web site that can be mined for rapid on the user to determine whether a given transcript sequence assessment of the validity of 3؅ ends in existing collections, enu- ends at a valid polyadenylation site or not and whether other meration of potential alternative 3؅ polyadenylation sites of more frequently used sites exist upstream or downstream of known transcripts, direct retrieval of terminal sequences for design available sequence ends. of probes, and detection of polyadenylation sites not currently A useful methodology exists (14–19) for locating polyadenyl- mapped to known genes. ation sites based on the large and growing databases of EST sequences. These short sequences are sampled from larger 3Ј UTR ͉ gene prediction ͉ alternative polyadenylation ͉ transcriptome ͉ cDNA clones, which are generally reverse-transcribed from transcript probe design polyadenylated cellular transcripts after priming with oligo(dT). NCBI’s GenBank contained over 96 million such sequences as of March 2007. The high level of EST redundancy means that the he 3Ј ends of nascent mRNA transcripts are generated by a expression of each gene is described in each organism by multiple multifactorial complex that recognizes a well-defined hex- T independently derived ESTs, while their preparation based on amer polyadenylation signal (PAS, typically AAUAAA or oligo(dT) priming results in a preponderance of ESTs aligning AUUAAA) in a context that includes other less clearly defined to and terminating at 3Ј polyadenylation sites. motifs, cleaves the RNA 16–28 nt downstream of the PAS, and Here we report results directly addressing the needs of re- adds the characteristic polyA tail. Many transcripts are cleaved searchers for methodology and readily accessible databases and polyadenylated at alternative sites that may depend on Ј Ј facilitating the identification of 3 transcript ends. First, an cellular contexts (1–5). The 3 UTRs of transcripts frequently analysis of the main public murine and human transcript col- contain motifs that regulate their stability and ribosomal trans- lections provides evidence that up to half of available sequences lation (6) and their translocation to the cytoplasm. Additionally, Ј may not end at true polyadenylation sites. We further describe these 3 UTRs of transcripts may contain miRNA targets and and validate an EST-based method with improved prediction of short hairpin loops of regulatory significance. Indeed, as many candidate polyadenylation sites. The method was applied to both as 40% of miRNA targets have been estimated to be located in Ј Ј murine and human genomes to yield sets of predicted 3 ends alternative 3 UTR segments (7). The properties and biology of mapped against the currently known genes. The results are transcripts can therefore be affected profoundly by the choice of alternative polyadenylation sites, making definition of the 3Ј ends of transcripts essential for understanding their regulation. Author contributions: E.M.M., M.A.A.-N., and N.N.I. designed research; E.M.M., R.H., S.J., There are additional practical motivations for determining 3Ј and C.F. performed research; M.A.A.-N. and N.N.I. analyzed data; and M.A.A.-N. and N.N.I. ends. For many genes, similarities in coding regions may make wrote the paper. it necessary to find probes that target the UTRs to achieve the The authors declare no conflict of interest. desired specificity. Furthermore, methods for amplification of This article is a PNAS Direct Submission. cDNA are finding increasing use in circumstances wherein 1M.A.A.-N and N.N.I. contributed equally to this work. amounts of available RNA are too small, for example, for direct 2To whom correspondence should be addressed. E-mail: [email protected]. application to microarrays (8, 9). Because amplification proto- This article contains supporting information online at www.pnas.org/cgi/content/full/ cols are typically initiated by oligo(dT) priming on transcript 0807813105/DCSupplemental. polyA tails, there is an inherent bias toward best preservation of © 2008 by The National Academy of Sciences of the USA

20286–20290 ͉ PNAS ͉ December 23, 2008 ͉ vol. 105 ͉ no. 51 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0807813105 Downloaded by guest on September 30, 2021 Table 1. Proportions of transcript sequence collections tively low accuracy of EST sequencing and recent genomic containing a 3؅ PAS duplications, some ESTs can be aligned with more than one Proportion genomic position. To avoid possible misidentifications, ambig- No. of with uously aligning ESTs were excluded from our analysis. For Collection transcripts terminal PAS end-cluster detection, we plotted the number of matching ESTs against position along the entire genome. This analysis revealed, Mouse as expected, that the numbers of aligned ESTs gradually increase RefSeq mm6 Total NM࿝* 18,280 0.74 toward transcript 3Ј ends (Fig. 1) and then abruptly fall, sug- Ensembl 36.40 Known 28,110 0.56 gesting that the plot could be used to infer the direction of Ensembl 36.40 Novel 3,192 0.14 transcription. We used an approach that exploited a convolution Affymetrix MOE430 Consensus 45,037 0.58 algorithm to detect and quantify the shape of such edges, FANTOM3 Representative 46,315 0.58 allowing adjustment of the algorithm parameters. We then tuned VEGA Known July 2007 24,791 0.48 parameters and detection thresholds on various transcript col- UCSC KnownGenes mm9 43,685 0.64 lections to optimize the relationship between the numbers of Human database sequence ends detected and the proportion of pre- ࿝ RefSeq Release 20 human NM * 24,162 0.78 dicted ends corresponding to nominal ends. Fig. S3 illustrates Affymetrix HG-U133 Consensus/Exemplar 54,613 0.54 optimization of 1 parameter, the minimum number of EST ends VEGA Total October 2006 98,630 0.60 required to identify a potential polyadenylation site. H-InvDB v3.8 Representative 35,005 0.60 The automated analysis identified 58,282 and 86,410 EST UCSC KnownGenes hg18 54,896 0.73 clusters on the murine and human genomes, respectively, that The presence of a validated PAS (Table S1 and Fig. S1) within 40 nt of the contain at least 2 EST ends and an appropriately positioned transcript ends was recorded after trimming of extraneous or homopoly- PAS. These candidate 3Ј ends were mapped in relationship to the meric sequences from the 3Ј ends as required for individual datasets. The UCSC KnownGene (12) collection and assembled in Dataset S1, VEGA (13) and H-InvDB (www.h-invitational.jp) (23) collections are highly Dataset S2, Dataset S3, and Dataset S4. The legends to the curated. VEGA integrates gene information for vertebrate organisms datasets are located at the end of the Supplementary Text. originating from NCBI, UCSC, Ensembl, and the Sanger Institute. H-InvDB Approximately 75% of candidate ends lay within KnownGene integrates and maps cDNAs to the from selected high- throughput cDNA sequencing projects. Source files for the analysis were transcription zones or up to 10 kb downstream of their nominal RefSeq (mm6) refMrna.fa; Ensembl Mus࿝musculus.NCBIM36.40.cdna.known.fa ends. Within the total KnownGene collection, 15% of genes had and Mus࿝musculus.NCBIM36.40.cdna.novel.fa; Affymetrix MOE430A࿝consensus no associated prediction, 25% had only a single predicted and MOE430B࿝consensus (nonredundant list excluding control probes); FAN- termination site, and the remaining 60% had an average of 3–4 TOM3 fantom3.00.seq using the annotation file repseq.sep.xml to extract the alternative polyadenylation sites predicted for each murine or representative sequence subset; Mus࿝musculus.VEGA.jul.cdna.known.fa; UCSC human transcript, respectively. Failure to predict termination KnownGenes mm9 knownGeneMrna.txt; NCBI RefSeq human.rna.fna 11/27/06; ࿝ ࿝ ࿝ ࿝ ࿝ ࿝ ࿝ sites for 15% of the KnownGene collection could occur by Affymetrix HG-U133 Plus 2 consensus and HG-U133 Plus 2 exemplar; Homo failure to detect termini due to termination at rare variant PAS sapiens.VEGA.oct.cdna.tot.fa; H-InvDB nuc࿝rep.fa.gz; and UCSC KnownGenes hg18 knownGeneMrna.txt.gz. motifs, insufficient local EST representation, or complexity in CELL BIOLOGY the EST signature arising from nearby transcript ends on the opposite strand. KnownGenes may also lack 3Ј termini because presented in 4 datasets (Dataset S1, Dataset S2, Dataset S3, and they are incomplete; an example discussed below is the murine Dataset S4), 2 text files (UCSCSessionmm8 and UCSCSes- Mll2 gene (Table 2), the true termini of which are likely those sionhg18), and an interactive web site (www.ogic.ca/ts) that can detected in the ensuing downstream KnownGene. Of murine be mined for rapid assessment of the validity of 3Ј ends in RefSeq genes originally lacking terminal PAS, 57% were as- existing collections, enumeration of potential alternative 3Ј signed candidate ends in our analysis. polyadenylation sites of known transcripts, direct retrieval of To assess the accuracy of identification of polyadenylation terminal sequences for the design of probes, and detection of sites, we needed a benchmark set of transcripts ending at polyadenylation sites not currently mapped to known genes. (For validated 3Ј polyadenylation sites. For this purpose, we used a additional details, see also SI Text, Figs. S1–S8, and Tables collection of 113 murine genes chosen originally on the basis of S1–S5.) their biological interest and accordingly representing an essen- tially random sampling relative to our objective here. For each Results transcript, a single likely 3Ј transcript terminus was chosen by To obtain a comprehensive first approximation of the degree of hand curation based on identification of clustered EST ends completeness of available transcript sequence collections, we occurring closely downstream of a PAS (Table S2). The validity determined the proportion of 3Ј termini in various murine and of each curated 3Ј terminus was tested by specific secondary PCR human collections (including RefSeq, ENSEMBL, UCSC probing of total cDNA, initially amplified globally from murine KnownGene, FANTOM, and VEGA) that contained a hexamer ES cells or purified hematopoietic precursor cells (21) under PAS. The results summarized in Table 1 indicate that despite the stringent conditions of oligo(dT) primer annealing and reverse extensive hand curation involved in their creation, 22–52% of transcription (8). The global RT-PCR amplification procedure sequences in individual collections lack a terminal PAS and are yields a mixture of individual cDNA fragments, each confined to thus likely not to end at sites of polyadenylation. We will refer a 300- to 500-nt window immediately upstream of a polyA here to 3Ј ends of database transcript sequences as “nominal sequence (8). PCR primers (Table S3) were synthesized to target ends.” sequences contained within 300 nt upstream of the predicted For automating the process of detection of potential polyad- polyadenylation sites and used to probe for the presence of their enylation sites, we sought to identify clusters of EST ends targets in globally amplified cDNA. Such targets would only be associated with a nearby upstream PAS and ending within a few present in the global cDNA if they were located closely upstream bases of each other when aligned with the genome. The method of polyA sequences in the original RNA templates. Fragments of is summarized here and described in greater detail in the SI Text. the predicted size were amplified in each instance (Fig. S5), To determine the coordinates of genomic EST alignments, we providing experimental support for the usage of all 113 curated used the UCSC genome annotation (20). Because of the rela- polyadenylation sites.

Muro et al. PNAS ͉ December 23, 2008 ͉ vol. 105 ͉ no. 51 ͉ 20287 Downloaded by guest on September 30, 2021 Fig. 1. EST evidence for alternative 3Ј ends for murine Pde7a transcripts. (A) The diagram, obtained from the UCSC Genome Browser (Mouse mm8, February 2006 Assembly) (22), illustrates a region of mouse 3 spanning 8 kb, including the 3ЈUTR end of the Pde7a gene, transcribed from right to left. The ends of gene transcript predictions from RefSeq and Ensembl are represented as blue and brown bars, respectively. Black boxes represent the matches to mouse ESTs and mRNAs. Accumulations of EST ends at 3 particular positions are visible (indicated by a red diamond, a yellow oval, and a violet diamond). The position indicated by the red diamond suggests the existence of an alternative termination of the Pde7a gene considered neither by RefSeq nor by Ensembl. (B) Interpretation of EST and PAS information around the predicted 3ЈUTR of murine Pde7a. The curve, in blue, indicates the number of EST matches at each position. Many of these ESTs end abruptly at the left side of the principal peak, whereas the right side of the peak has a softer slope, which indicates that the ESTs derive from transcripts running from right to left, in agreement with the known direction of transcription of the Pde7a gene. The vertical red lines are maxima of the convolution of the EST-match histogram (see Materials and Methods), which indicate potential terminations. The red lines below the baseline represent potential terminations in the sense of the Pde7a transcription. Further evidence using PAS and clusters of EST ends is then used to confirm transcript ends. The 2 violet vertical bars (under both diamonds) represent clusters of EST ends located near rough ends and composed of at least 2 ESTs ending in the same position with a valid local polyadenylation signal. As explained in A, the rightmost end (violet diamond) is from the Ensembl collection but is found in the RefSeq database; the central end (yellow oval) is represented in Ensembl; and the leftmost (red diamond) is not represented in those collections. Of note, the end marked with the yellow oval had many EST ends (see A). However there was no corresponding PAS, and a tract of 16 consecutive A’s coincided with the peak of EST ends. This end, reported by Ensembl, appears to reflect internal transcript priming during cDNA generation rather than a site of transcript termination.

Each curated end was then mapped against the complete set corresponding nominal database ends, and the automatically of predicted ends listed in Dataset S1 and Dataset S2. Table S4 identified ends. Of the benchmark set, 110 (97%) were exactly lists the genome coordinates of the curated benchmark set, the matched by automatically identified ends. Of 31 curated bench-

20288 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0807813105 Muro et al. Downloaded by guest on September 30, 2021 Table 2. Predicted 3؅ ends for murine genes Pde7a, Rnf11, Mll2, and Zdhhc5 extracted from Dataset S1 and Dataset S2

12 3 4 5 6 7 8 9 10111213141516 row chr strand sourceAcc Sym desc znStart znEnd length nominal_ dist_5' dist_3' EST_ends_ width_end rel_pos_m representat seq_up 1 3 - NM_008802 Pde7a phosphodies 19457108 19418068 39040 0 2 3 - Pde7a 19436942 20166 2 11 -11 BX528750 taactaggttg 3 3 - Pde7a 19418206 -138 33 36 -13 AI325273 cttttgatgtga 4 3 - Pde7a 19416840 1228 24 36 -26 CN459507 tggtttcctcag 5 3 - Pde7a 19415265 2803 7 9 -5 BE634954 aaatactaact 6 3 - NM_028840 Armc1 armadillo rep 19355216 19324294 30922 1

7 4 - NM_013876 Rnf11 ring finger pr 108974437 108950788 23649 0 8 4 - Rnf11 108950564 224 88 86 -11 BF453904 tccttatgttctc 9 4 - Rnf11 108949038 1750 17 9 -2 AI790154 atggtgtccct 10 4 - Rnf11 108947497 3291 2 17 -17 R74785 tgattgacag 11 4 - AK148461 AK148461 Mus musculu 108904186 108901690 2496 0

12 15 - BC058659 Mll2 Mll2 98672299 98661718 10581 0 13 15 - AK039901 AK039901 98660553 98659732 821 0 14 15 - AK039901 98659964 -232 3 5 0 AA210592 tctaccttcacc 15 15 - AK039901 98659736 -4 23 34 -3 BE847771 cccaaccccgc 16 15 - NM_016781 Prkag1 AMP-activat 98659542 98640830 18712 1

17 2 - NM_144887 Zdhhc5 zinc finger, D 84515874 84488808 27066 1 18 2 - Zdhhc5 84488989 -181 148 17 -9 AW544970 agagagcttg 19 2 - NM_025868 Txndc14 thioredoxin-r 84479013 84472155 6858 1

The use of these data to identify probable 3Ј terminal sequences is explained in the text. KnownGene symbols and coordinates are bold. Column headings, truncated in the table, are described in the legend to Dataset S1 and Dataset S2 located at the end of Supplementary Information.

mark 3Ј termini that differed from the nominal database ends, (Table 2, rows 3–5, column 11), each comprising 7–33 EST all were correctly identified by the computational procedure ends (column 12). A possible fourth alternative ending is within a tolerance of 50 nt. Of the 3 curated ends that were indicated in row 2, lying further upstream at 20,166 nt down- undetected by the automated algorithm, Cbx1 shares 3Ј terminal stream of the RefSeq start. The predicted termini (Table 2, sequence with other genes leading to the exclusion of relevant rows 3 and 4) at 138 nt upstream and 1,228 nt downstream of terminal ESTs from the automated procedure, whereas Ring1 the nominal RefSeq end are supported by 33 and 24 ESTs and apparently uses an anomalous PAS at its curated end (AA- would be reasonable targets for 3Ј probe construction. The CAAA). Despite the presence of many EST ends at the Phf1 values in Table 2, column 13, indicate that the EST termini are transcript terminus, ESTs originating from a gene on the oppo- distributed over a 36-nt interval (‘‘termination zone’’) for the site strand and terminating close to the 3Ј end of Phf1 interfered predictions in both rows 3 and 4, descending from the start of with edge detection by the convolution function. The high the interval indicated in column 6. The 3Ј terminal sequence

proportion of valid benchmark ends included in our automated of a representative EST (Table 2, column 15) at each predicted CELL BIOLOGY prediction set extrapolates to a high rate of inclusion of valid terminus is given in column 16. For the prediction in Table 2, ends in our murine and human prediction collections. Table S5 row 3, the value in column 14 locates the maximum number of further describes the predictive precision of our murine dataset EST endings to a position 13 nt upstream of the 3Ј end of the against various transcript collections, and Fig. S4 documents the sequence in column 16. Each prediction in Dataset S1 and improved performance of our dataset relative to earlier analyses Dataset S2 has an associated link to the UCSC genome and available curated transcript sets. browser (22). The browser view links to information on the Dataset S1, Dataset S2, Dataset S3, and Dataset S4 can be tissue origin of the individual ESTs terminating at a predicted used to devise cDNA probes for the 3Ј ends of any particular end that could narrow the choice of particular probes among transcript. For each predicted terminus, 400 genomic nucleotides alternative endings. The view can also be used to determine are included in the table corresponding to the terminal align- whether there is a continuous pattern of overlapping ESTs ment of one EST ending at the predicted polyadenylation site. aligning with the genome between a nominal end and a Priming at A-rich tracts internal to transcripts could yield ESTs predicted end further downstream. EST continuity would whose polyA tails originate from genomic sequence rather than support membership of the predicted end in the same tran- polyadenylation (Fig. 1). Therefore 40 nt of sequence down- scriptional unit. Additional information related to position of stream of the predicted terminus are also supplied with a flag to a predicted end and the evidence used to generate it can be indicate whether a downstream A-rich tract is present. examined by using the link in Dataset S1 and Dataset S2 to our Four practical examples of use of the prediction tables are Transcriptome Sailor web server (www.ogic.ca/ts). described here in detail. Table 2 contains the pertinent infor- mation for 4 murine transcripts as extracted from Dataset S1. Murine Rnf11. Lookup of the gene symbol ‘‘Rnf11’’ in Dataset S1 and Dataset S2 identifies 3 predicted alternative polyadenylation Murine Pde7a. The RefSeq for Pde7a lacks a polyadenylation sites (Table 2, rows 8–10), each of which lies downstream signal near its nominal 3Ј end (Table 2, row 1, column 9). The (column 11) of the nominal RefSeq end (row 7, columns 3 and alignment of EST ends to the 3Ј end of this transcript was 7). The absence of a nominal PAS flag in Table 2, column 9, shown in Fig. 1, where 3 termination maxima were clearly indicates that the nominal RefSeq terminus lacks a PAS and is evident, none of which occurred at the nominal RefSeq end. therefore unlikely to be a site of polyadenylation. One of the Pde7a was located by a search of columns 4 in Dataset S1 and predicted alternative ends, supported by 17 EST ends, is located Dataset S2, and the corresponding data were copied to Table 1,750 nt downstream of the corresponding nominal RefSeq end 2. The Pde7a transcription interval on the genome is repre- on the genome. The view, accessible by using the corresponding sented in the table by the alignment of RefSeq NM࿝008802 on link to the UCSC browser, is shown in Fig. S6. The UCSC chromosome 3 from position 19457108 (5Ј) to 19418068 (3Ј). browser data indicate a continuous pattern of overlapping ESTs Within the limits of Ϫ1,000 to ϩ10,000 nt from the nominal 3Ј aligning with the genome between the nominal and predicted end of NM࿝008802, 3 EST termination clusters are indicated ends, providing evidence for linkage of the predicted end to the

Muro et al. PNAS ͉ December 23, 2008 ͉ vol. 105 ͉ no. 51 ͉ 20289 Downloaded by guest on September 30, 2021 same transcriptional unit. This predicted site was among the set Discussion verified experimentally within this study as described above. The approach outlined here, together with tables relating pre- dicted polyadenylation sites to known genes, significantly re- Murine Mll2. EST termination clusters were not detected in the duces the effort required to identify bona fide transcript ends. Mll2 transcription interval on the genome as represented by the The prediction tables are based on methodology that improves BC058659 mRNA sequence (Table 2, row 12). However, a on earlier EST-based analyses (14–19) and demonstrates a 97% distinct transcript AK039901 is shown (Table 2, row 13) begin- level of predictive recall on a hand-curated and experimentally ning 1,500 nt downstream of the nominal end of BC058659, supported benchmark transcript set. Exact and comprehensive within which 2 potential alternative endings are detected a short definition of alternative 3Ј transcript ends will allow for the distance upstream of its nominal end (Table 2, rows 14 and 15, design of probes that discriminate between alternative forms and column 11). The UCSC browser indicates a lack of continuity of thereby support elucidation of the regulatory impact of alterna- overlapping ESTs in the interval between the 3Ј end of tive polyadenylation. The described methods will also facilitate BC058659 and the 5Ј end of AK039901, suggesting these might measurement of global in amplified cDNAs be unrelated transcripts. However, analysis of the sequence in representative only of 3Ј transcript termini. Redesign of mi- the unpopulated intervals by using an RNA folding algorithm croarrays with probes uniformly positioned close to true 3Ј ends reveals 2 regions of high-energy internal folding (Fig. S7), which will significantly reduce the sequence-based, secondary struc- could render RNA transcripts inaccessible to cDNA generation ture-mediated biases that are inherent in transcribing and label- in this region and thus obscure their transcriptional continuity in ing copies of mRNA and which may be magnified when ampli- the EST databases. The analysis suggests that the predicted end fication of target sequences is attempted. Ј near position 98659766 could represent a 3 polyadenylation site Materials and Methods of Mll2. Supporting evidence was obtained by probing this target Implementation and tuning of the algorithm for automated recognition of by specific RT-PCR in hematopoietic and other cells by using the EST end clusters is described in detail in SI Text, along with measurements of strategy described for Fig. S5. The results yielded an expression recall and precision and comparison of these parameters against earlier pattern that was expected for Mll2. reported implementations. Procedures used in generating Dataset S1, Dataset S2, Dataset S3, and Dataset S4 are described in the legend to Dataset S1 and Murine Zdhhc5. The exemplar RefSeq for Zdhhc5 is shaded gray Dataset S2. PCR probing of globally amplified cDNA samples used to test for in Table 2 (row 17) and in Dataset S1 and Dataset S2, indicating the presence of curated transcript ends is described in the legend to Fig. S5. that its nominal 3Ј transcript end overlaps another gene tran- We have implemented our predictions in a web tool (Transcriptome Sailor, www.ogic.ca/ts) to allow for their examination in a genomic context. The web scribed in the opposite direction on the positive genome strand. site also provides access to the complete datasets used for this study and to The predicted polyadenylation site (Table 2, row 18) located 181 updates, as generated with future refinement of our algorithms. bases upstream of the nominal RefSeq end is similarly shaded in gray. The narrow width (Table 2, column 13) of the termination ACKNOWLEDGMENTS. We thank Christopher J. Porter for assistance with zone of the corresponding EST cluster suggests that it belongs to database maintenance and Carl Virtanen for helpful discussions. This work Zdhhc5 and not to the gene ending on the opposite strand, as was supported by funds from the Ontario Innovation Trust, the Canadian Foundation for Innovation, the Ontario Research and Development Challenge explained in the legend to Dataset S1 and Dataset S2. The Fund, the Terry Fox Foundation, and the Stem Cell Network. M.A.A.-N. is a corresponding UCSC browser view is shown in Fig. S8. Canada Research Chair in Bioinformatics.

1. Tian B, Hu J, Zhang H, Lutz C (2005) A large-scale analysis of mRNA polyadenylation of 13. Ashurst JL, et al. (2005) The vertebrate genome annotation (Vega) database. Nucleic human and mouse genes. Nucleic Acids Res 33:201–212. Acids Res 33:D459–D465. 2. Zhang H, Lee JY, Tian B (2005) Biased alternative polyadenylation in human tissues. 14. Brockman JM, et al. (2005) PACdb: PolyA cleavage site and 3Ј-UTR database. Bioinfor- Genome Biol 6:R100. matics 21:3691–3693. 3. Carninci P, Kasukawa T, Katayama S, Gough J, Frith M, et al. (2005) The transcriptional 15. Zhang H, Hu J, Recce M, Tian B (2005) PolyA࿝DB: a database for mammalian mRNA landscape of the mammalian genome. Science 309:1559–1563. polyadenylation. Nucleic Acids Res 33:D116–D120. 4. Kan Z, States D, Gish W (2002) Selecting for functional alternative splices in ESTs. 16. Yan J, Marr TG (2005) Computational analysis of 3Ј-ends of ESTs shows four classes of Genome Res 12:1837–1845. alternative polyadenylation in human, mouse, and rat. Genome Res 15:369–375. 5. Kan Z, Rouchka EC, Gish WR, States DJ (2001) Gene structure prediction and alternative 17. Lopez F, Granjeaud S, Ara T, Ghattas B, Gautheret D (2006) The disparate nature of splicing analysis using genomically aligned ESTs. Genome Res 11:889–900. ‘‘intergenic’’ polyadenylation sites. RNA 12:1794–1801. 6. Mignone F, et al. (2005) UTRdb and UTRsite: A collection of sequences and regulatory 18. Moucadel V, Lopez F, Ara T, Benech P, Gautheret D (2007) Beyond the 3Ј end: motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res 33:D141– Experimental validation of extended transcript isoforms. Nucleic Acids Res 35:1947– D146. 1957. 7. Majoros WH, Ohler U (2007) Spatial preferences of microRNA targets in 3Ј untranslated 19. Lee JY, Yeh I, Park JY, Tian B (2007) PolyA࿝DB 2: mRNA polyadenylation sites in regions. BMC Genomics 8:152. vertebrate genes. Nucleic Acids Res 35:D165–D168. 8. Iscove NN, et al. (2002) Representation is faithfully preserved in global cDNA amplified exponentially from sub-picogram quantities of mRNA. Nat Biotechnol 20:940–943. 20. Hinrichs AS, Karolchik D, Baertsch R, Barber G, Bejerano G, et al. (2006) The UCSC 9. Kenzelmann M, et al. (2004) High-accuracy amplification of nanogram total RNA genome browser database: Update 2006. Nucleic Acids Res 34:D590–D598. amounts for gene profiling. Genomics 83:550–558. 21. Benveniste P, Cantin C, Hyam D, Iscove NN (2003) Hematopoietic stem cells engraft in 10. Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): A curated mice with absolute efficiency. Nat Immunol 4:708–713. nonredundant sequence database of genomes, transcripts and . Nucleic Acids 22. Karolchik D, Hinrichs AS, Kent WJ (2007) The UCSC genome browser. Curr Protoc Res 35:D61–65. Bioinformatics Chapter 1:Unit 1.4. 11. Hubbard TJ, et al. (2007) Ensembl 2007. Nucleic Acids Res 35:D610–D617. 23. Imanishi T, Itoh T, Suzuki Y, Donovan CO, Fukuchi S, et al. (2004) Integrative annotation 12. Hsu F, et al. (2006) The UCSC known genes. Bioinformatics 22:1036–1046. of 21,037 human genes validated by full-length cDNA clones. PLoS Biol 2:e162.

20290 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0807813105 Muro et al. Downloaded by guest on September 30, 2021