Supporting Information
Total Page:16
File Type:pdf, Size:1020Kb
Supporting Information Muro et al. 10.1073/pnas.0807813105 SI Text (Fig. S1) showed for 10 of the PAS variants a marked tendency 1. Algorithm for Automated Recognition of EST End Clusters. To to lie 10–30 nt upstream of EST ends, whereas no strong trend determine the coordinates of genomic EST alignments we used was seen for 3 additional variants. Based on this simple analysis the UCSC genome annotation (1), a regularly updated database we chose to accept only 10 of the variants as functional PAS which includes mouse and human EST sequences and their (Table S1), and only EST ends validated by a PAS 10–30 nt positions in the genome. The UCSC Golden Path version was upstream. Furthermore, we considered all EST ends validated by mm6 for mouse (equivalent to NCBI build 34, March 2005) and the same PAS and ending within 20 nt of one another to hg18 for human (NCBI Build 36.1, March 2006). The UCSC represent the same transcript end and to be clustered together, mapping database is based on alignments of ESTs to the as others have also proposed (6). corresponding genome using the BLAT program (2). Because of Next, we studied the distribution of individual EST ends the relatively low accuracy of EST sequencing and recent around our predicted rough ends (Fig. S2). The graph indicates genomic duplications (3), some ESTs can be aligned to more that in a range of Ϫ200 to ϩ150 nt around these rough ends the than 1 genomic position. To avoid possible misidentifications, frequency of EST ends is distinguishable from the surrounding ambiguously aligning ESTs were excluded from our analysis. background. Therefore, to relate rough ends to precise genomic Ϫ ϩ Multiple matches to the genome were reported when alignments locations of EST ends, first EST ends in a range of 200 to 150 had a base pair identity within 0.5% of the best alignment and nt of the rough end were collected, then EST ends within 20 nt at least 96% base pair identity to the genomic sequence. We of each other were clustered to define a potential termination accepted only ESTs that aligned to a unique genomic position. zone, and finally, each cluster was tested for the presence of a For example, in the analysis of the murine genome this constraint PAS within 10–30 nt upstream of the start of the cluster. reduced the number of aligned ESTs from 4.9 to 3.3 million. EST The strength of a cluster is described in terms of total number alignment to the genomic sequence circumvents the need to of EST terminations in the cluster and of maximum number of analyze EST sequences for possible sequencing errors, RNA terminations at a single nucleotide position. The width of the editing, or elimination of polyA tails. Once a genomic position termination zone provides an additional descriptor used in the was identified by coalignment of multiple EST ends, further analysis. The results of this protocol are ‘‘candidate ends’’— analysis was simply performed on the corresponding genomic approximate RNA transcript end predictions—each of which is sequence. a cluster of EST ends located near a rough end. A count of the number of matching ESTs along the genome (Fig. 1A) results in a histogram (Fig. 1B). The simple presence 2. Estimation of Recall and Precision of the Method and Comparison of ESTs does not indicate the direction of transcription. How- to Other Approaches. The correspondence between the automated ever, our analysis revealed, as expected, that the numbers of predictions for the murine genome and the nominal ends in the aligned ESTs gradually increase toward transcript 3Ј ends (Fig. RefSeq collection of protein-encoding transcripts was assessed 1B) and then abruptly fall, suggesting that the shape of the EST quantitatively. Of the 69,220 predictions obtained by using a frequency histogram could be used to infer the direction of minimum definition of 2 ESTs in a cluster and absence of a transcription. A histogram was generated describing the number nearby polyA tract, 10,693 fell within 10 nt of one of 18,280 of ESTs spanning each genomic position (in steps of 20 nt), database transcript ends. ‘‘Recall,’’ the proportion of nominal irrespective of the EST direction or RNA splicing. Intron/exon RefSeq ends that were matched by automated prediction, was ϭ boundaries were not considered for this computation as we were 10,693/18,280 0.59, a figure reflecting the incompleteness of Ј primarily interested in relating EST ends to transcript ends. The many of the database sequences at their 3 ends. When RefSeq resulting EST histogram was convoluted with a mathematical sequences lacking a PAS within 50 terminal nt were removed ϭ function that acts as an ‘‘edge detector.’’ from consideration, recall was 10,693/13,349 0.80. This mea- sure is still influenced by database sequences that are incomplete f(x) ϭ a ⅐ tanh(x/b) ⅐ [1 Ϫ (tanh(x/b))2] but contain a nonused PAS near the nominal terminus (e.g., rows 11–18 in Table S2). The constants a and b modify the width of the function and are, When we focused on the subset of nominal transcript ends that therefore, parameters to optimize. We used a ϭϪ1/150 and b ϭ were matched by automated prediction (10,693) and considered 150. Sharp edges are converted into maxima or minima, de- the total number of ‘‘local ends’’ in the Ϫ200 to ϩ150 nt range pending on the transcript direction, whereas the absolute mag- of these ends and in the direction of transcription of the gene, nitude of a peak indicates the abruptness of the edge. The sign 15,481 were found. ‘‘Precision,’’ the proportion of total predicted of the convolution maximum indicates the direction of the local ends that correspond to confirmed database sequence ends, transcript: negative values indicate termination of transcription is estimated as 10,693/15,481 ϭ 0.69. Lower precision values proceeding from right to left and positive values the opposite mainly reflect greater usage of locally redundant polyadenyla- direction. Peaks are indicated by red bars in Fig. 1B. Peaks of tion sites in the transcriptome. The 4,788 additional terminal magnitude 0.25 or greater were accepted as ‘‘rough ends’’ zone predictions are likely to represent local alternative ends (as indicative of potential termination. Further analysis was then illustrated in Fig. 1) typically associated with EST end clusters performed to relate these rough ends to the precise genomic containing smaller numbers of ESTs than the clusters coinciding locations of the ends of ESTs associated with 1 or more PAS. with the nominal ends. Valid transcript termini are expected to contain a PAS 10–30 Candidate cluster ends are characterized by variable levels of nt upstream of the polyadenylation site. The most common and EST end evidence described as number of EST ends in the strongest signal is AAUAAA, but 12 additional variants have cluster, maximum number of EST ends at a single position, been demonstrated or proposed (4–6) (Table S1). An analysis of association or not to a PAS, and presence or absence of nearby the distribution of distances between the closest PAS to putative polyA tracts. Our computational method allows tuning of results transcript ends in the RefSeq collection and the transcript ends according to thresholds on these 4 properties. Fig. S3 illustrates Muro et al. www.pnas.org/cgi/content/short/0807813105 1of19 the dependence of recall and precision on these variables when accepted all uniquely aligning ESTs regardless of whether their comparing the complete prediction on the mouse genome to the corresponding database entries included a flanking A-rich se- RefSeq collection of protein-encoding transcripts. This analysis quence. Our approach also differed from the PolyADB and the was used to guide the optimization of the method parameters Yan and Marr (9) methods, which considered only ESTs that and to focus on predicted ends supported by 2 EST ends, as they overlap a gene. The latter strategy detects only those transcript offer the best tradeoff between recall and precision values given ends that are close to current gene predictions. This restriction the current state of the murine EST database. would preclude, for example, detection of the alternative down- Generically similar approaches to computational prediction of stream polyadenylation site for the Pde7a gene described in Fig. Ј RNA transcript ends based on EST evidence have been de- 1. PACdb is less restrictive, but also relates all 3 -processing sites scribed before (7–12). Collections of manually curated ends have to known or currently predicted genes. This approach can result also been published (VEGA) (13). Fig. S4 compares the levels in assignment of a termination to the wrong gene due to missing of recall relative to our experimental benchmark and other gene annotations (7). Many human and mouse gene predictions are currently unstable as they depend on changing EST data and accessible compilations of curated and predicted ends, of our gene prediction methodologies (14). Our analysis, in contrast, is algorithm (TS), the method from Lopez et al. (11, 12) and VEGA independent of gene predictions and we demonstrate experi- (13). With correct prediction of 110 of our 113 experimentally Ј mentally that it can be used to detect new transcript ends that are verified 3 ends (108 positions predicted within an accuracy of 10 not described in the databases. Further, our approach is not nt), TS represents a decisive improvement in predictive power restricted to protein-coding RNAs and can be used to detect over previous approaches. In general, recall was better for more noncoding RNAs, a growing number of which are being recog- highly curated collections of transcripts.