Supporting Information

Muro et al. 10.1073/pnas.0807813105 SI Text (Fig. S1) showed for 10 of the PAS variants a marked tendency 1. Algorithm for Automated Recognition of EST End Clusters. To to lie 10–30 nt upstream of EST ends, whereas no strong trend determine the coordinates of genomic EST alignments we used was seen for 3 additional variants. Based on this simple analysis the UCSC genome annotation (1), a regularly updated database we chose to accept only 10 of the variants as functional PAS which includes mouse and human EST sequences and their (Table S1), and only EST ends validated by a PAS 10–30 nt positions in the genome. The UCSC Golden Path version was upstream. Furthermore, we considered all EST ends validated by mm6 for mouse (equivalent to NCBI build 34, March 2005) and the same PAS and ending within 20 nt of one another to hg18 for human (NCBI Build 36.1, March 2006). The UCSC represent the same transcript end and to be clustered together, mapping database is based on alignments of ESTs to the as others have also proposed (6). corresponding genome using the BLAT program (2). Because of Next, we studied the distribution of individual EST ends the relatively low accuracy of EST sequencing and recent around our predicted rough ends (Fig. S2). The graph indicates genomic duplications (3), some ESTs can be aligned to more that in a range of Ϫ200 to ϩ150 nt around these rough ends the than 1 genomic position. To avoid possible misidentifications, frequency of EST ends is distinguishable from the surrounding ambiguously aligning ESTs were excluded from our analysis. background. Therefore, to relate rough ends to precise genomic Ϫ ϩ Multiple matches to the genome were reported when alignments locations of EST ends, first EST ends in a range of 200 to 150 had a identity within 0.5% of the best alignment and nt of the rough end were collected, then EST ends within 20 nt at least 96% base pair identity to the genomic sequence. We of each other were clustered to define a potential termination accepted only ESTs that aligned to a unique genomic position. zone, and finally, each cluster was tested for the presence of a For example, in the analysis of the murine genome this constraint PAS within 10–30 nt upstream of the start of the cluster. reduced the number of aligned ESTs from 4.9 to 3.3 million. EST The strength of a cluster is described in terms of total number alignment to the genomic sequence circumvents the need to of EST terminations in the cluster and of maximum number of analyze EST sequences for possible sequencing errors, RNA terminations at a single nucleotide position. The width of the editing, or elimination of polyA tails. Once a genomic position termination zone provides an additional descriptor used in the was identified by coalignment of multiple EST ends, further analysis. The results of this protocol are ‘‘candidate ends’’— analysis was simply performed on the corresponding genomic approximate RNA transcript end predictions—each of which is sequence. a cluster of EST ends located near a rough end. A count of the number of matching ESTs along the genome (Fig. 1A) results in a histogram (Fig. 1B). The simple presence 2. Estimation of Recall and Precision of the Method and Comparison of ESTs does not indicate the direction of transcription. How- to Other Approaches. The correspondence between the automated ever, our analysis revealed, as expected, that the numbers of predictions for the murine genome and the nominal ends in the aligned ESTs gradually increase toward transcript 3Ј ends (Fig. RefSeq collection of -encoding transcripts was assessed 1B) and then abruptly fall, suggesting that the shape of the EST quantitatively. Of the 69,220 predictions obtained by using a frequency histogram could be used to infer the direction of minimum definition of 2 ESTs in a cluster and absence of a transcription. A histogram was generated describing the number nearby polyA tract, 10,693 fell within 10 nt of one of 18,280 of ESTs spanning each genomic position (in steps of 20 nt), database transcript ends. ‘‘Recall,’’ the proportion of nominal irrespective of the EST direction or RNA splicing. Intron/exon RefSeq ends that were matched by automated prediction, was ϭ boundaries were not considered for this computation as we were 10,693/18,280 0.59, a figure reflecting the incompleteness of Ј primarily interested in relating EST ends to transcript ends. The many of the database sequences at their 3 ends. When RefSeq resulting EST histogram was convoluted with a mathematical sequences lacking a PAS within 50 terminal nt were removed ϭ function that acts as an ‘‘edge detector.’’ from consideration, recall was 10,693/13,349 0.80. This mea- sure is still influenced by database sequences that are incomplete f(x) ϭ a ⅐ tanh(x/b) ⅐ [1 Ϫ (tanh(x/b))2] but contain a nonused PAS near the nominal terminus (e.g., rows 11–18 in Table S2). The constants a and b modify the width of the function and are, When we focused on the subset of nominal transcript ends that therefore, parameters to optimize. We used a ϭϪ1/150 and b ϭ were matched by automated prediction (10,693) and considered 150. Sharp edges are converted into maxima or minima, de- the total number of ‘‘local ends’’ in the Ϫ200 to ϩ150 nt range pending on the transcript direction, whereas the absolute mag- of these ends and in the direction of transcription of the , nitude of a peak indicates the abruptness of the edge. The sign 15,481 were found. ‘‘Precision,’’ the proportion of total predicted of the convolution maximum indicates the direction of the local ends that correspond to confirmed database sequence ends, transcript: negative values indicate termination of transcription is estimated as 10,693/15,481 ϭ 0.69. Lower precision values proceeding from right to left and positive values the opposite mainly reflect greater usage of locally redundant polyadenyla- direction. Peaks are indicated by red bars in Fig. 1B. Peaks of tion sites in the transcriptome. The 4,788 additional terminal magnitude 0.25 or greater were accepted as ‘‘rough ends’’ zone predictions are likely to represent local alternative ends (as indicative of potential termination. Further analysis was then illustrated in Fig. 1) typically associated with EST end clusters performed to relate these rough ends to the precise genomic containing smaller numbers of ESTs than the clusters coinciding locations of the ends of ESTs associated with 1 or more PAS. with the nominal ends. Valid transcript termini are expected to contain a PAS 10–30 Candidate cluster ends are characterized by variable levels of nt upstream of the polyadenylation site. The most common and EST end evidence described as number of EST ends in the strongest signal is AAUAAA, but 12 additional variants have cluster, maximum number of EST ends at a single position, been demonstrated or proposed (4–6) (Table S1). An analysis of association or not to a PAS, and presence or absence of nearby the distribution of distances between the closest PAS to putative polyA tracts. Our computational method allows tuning of results transcript ends in the RefSeq collection and the transcript ends according to thresholds on these 4 properties. Fig. S3 illustrates

Muro et al. www.pnas.org/cgi/content/short/0807813105 1of19 the dependence of recall and precision on these variables when accepted all uniquely aligning ESTs regardless of whether their comparing the complete prediction on the mouse genome to the corresponding database entries included a flanking A-rich se- RefSeq collection of protein-encoding transcripts. This analysis quence. Our approach also differed from the PolyA࿝DB and the was used to guide the optimization of the method parameters Yan and Marr (9) methods, which considered only ESTs that and to focus on predicted ends supported by 2 EST ends, as they overlap a gene. The latter strategy detects only those transcript offer the best tradeoff between recall and precision values given ends that are close to current gene predictions. This restriction the current state of the murine EST database. would preclude, for example, detection of the alternative down- Generically similar approaches to computational prediction of stream polyadenylation site for the Pde7a gene described in Fig. Ј RNA transcript ends based on EST evidence have been de- 1. PACdb is less restrictive, but also relates all 3 -processing sites scribed before (7–12). Collections of manually curated ends have to known or currently predicted . This approach can result also been published (VEGA) (13). Fig. S4 compares the levels in assignment of a termination to the wrong gene due to missing of recall relative to our experimental benchmark and other gene annotations (7). Many human and mouse gene predictions are currently unstable as they depend on changing EST data and accessible compilations of curated and predicted ends, of our gene prediction methodologies (14). Our analysis, in contrast, is algorithm (TS), the method from Lopez et al. (11, 12) and VEGA independent of gene predictions and we demonstrate experi- (13). With correct prediction of 110 of our 113 experimentally Ј mentally that it can be used to detect new transcript ends that are verified 3 ends (108 positions predicted within an accuracy of 10 not described in the databases. Further, our approach is not nt), TS represents a decisive improvement in predictive power restricted to protein-coding RNAs and can be used to detect over previous approaches. In general, recall was better for more noncoding RNAs, a growing number of which are being recog- highly curated collections of transcripts. We take this as an nized as functionally significant (15). indication of the quality of our algorithm. VEGA achieves too Genomic positions for murine RefSeq and Affymetrix tran- low a level of recall to be useful as a tool for 3Ј end completion. scripts were taken from the UCSC Genome Browser database Our implementation of EST-based prediction differed in using the mm6 (March 2005) genome assembly. Ensembl tran- critical details from previous analyses. We excluded from con- script positions were obtained from the Ensembl web site. For sideration all ESTs aligning to more than a single unique other collections of transcripts we used BLAT to align to the genomic site because of the ambiguity they would introduce. mm6 genome (score Ն 97%, id Ն 99%). Human transcript data Further, in contrast to PACdb (7) and PolyA࿝DB (8, 10), we were obtained from the UCSC hg18 genome assembly.

1. Hinrichs AS, et al. (2006) The UCSC Genome Browser Database: Update 2006. Nucleic 12. Moucadel V, Lopez F, Ara T, Benech P, Gautheret D (2007) Beyond the 3Ј end: Acids Res 34:D590–D598. Experimental validation of extended transcript isoforms. Nucleic Acids Res 35:1947– 2. Kent W (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664. 1957. 3. Cheung J, et al. (2003) Recent segmental and gene duplications in the mouse genome. 13. Ashurst JL, et al. (2005) The Vertebrate Genome Annotation (Vega) database. Nucleic Genome Biol 4:R47. Acids Res 33:D459–D465. 4. Beaudoing E, Freier S, Wyatt JR, Claverie JM, Gautheret D (2000) Patterns of variant 14. Perez-Iratxeta C, Andrade MA (2005) Inconsistencies over time in 5% of NetAffx polyadenylation signal usage in human genes. Genome Res 10:1001–1010. probe-to-gene annotations. BMC Bioinformatics 6:183. 5. MacDonald CC, Redondo JL (2002) Reexamining the polyadenylation signal: Were we 15. Mattick JS, Makunin IV (2006) Non-coding RNA. Hum Mol Genet 15 Spec No 1:R17–R29. wrong about AAUAAA? Mol Cell Endocrinol 190:1–8. 16. Benveniste P, Cantin C, Hyam D, Iscove NN (2003) Hematopoietic stem cells engraft in 6. Tian B, Hu J, Zhang H, Lutz C (2005) A large-scale analysis of mRNA polyadenylation of mice with absolute efficiency. Nat Immunol 4:708–713. human and mouse genes. Nucleic Acids Res 33:201–212. 17. Billia F, Barbara M, McEwen J, Trevisan M, Iscove NN (2001) Resolution of pluripotential 7. Brockman JM, et al. (2005) PACdb: PolyA Cleavage Site and 3Ј-UTR Database. Bioin- intermediates in murine hematopoietic differentiation by global complementary DNA formatics 21:3691–3693. amplification from single cells: Confirmation of assignments by expression profiling of 8. Zhang H, Hu J, Recce M, Tian B (2005) PolyA࿝DB: A database for mammalian mRNA cytokine receptor transcripts. Blood 97:2257–2268. polyadenylation. Nucleic Acids Res 33:D116–D120. 18. Iscove NN, et al. (2002) Representation is faithfully preserved in global cDNA amplified 9. Yan J, Marr TG (2005) Computational analysis of 3Ј-ends of ESTs shows four classes of exponentially from sub-picogram quantities of mRNA. Nat Biotechnol 20:940–943. alternative polyadenylation in human, mouse, and rat. Genome Res 15:369–375. 19. Karolchik D, Hinrichs AS, Kent WJ (2007) The UCSC Genome Browser. Curr Protoc 10. Lee JY, Yeh I, Park JY, Tian B (2007) PolyA࿝DB 2: mRNA polyadenylation sites in Bioinformatics Chapter 1:Unit 1.4. vertebrate genes. Nucleic Acids Res 35:D165–168. 20. Matzura O, Wennborg A (1996) RNAdraw: an integrated program for RNA secondary 11. Lopez F, Granjeaud S, Ara T, Ghattas B, Gautheret D (2006) The disparate nature of structure calculation and analysis under 32-bit Microsoft Windows. Comput Appl Biosci ‘‘intergenic’’ polyadenylation sites. RNA 12:1794–1801. 12:247–249.

Muro et al. www.pnas.org/cgi/content/short/0807813105 2of19 A B

0.18 0.18 AATAAA AAGAAA 0.16 0.16 ATTAAA AATGAA 0.14 TATAAA 0.14 TTTAAA AGTAAA 0.12 0.12 AATATA 0.1 AATACA 0.1 0.08 CATAAA 0.08 GATAAA

0.06 number of ends number of ends 0.06 ACTAAA 0.04 AATAGA 0.04

0.02 0.02

0 0 0 102030405060708090100 0 102030405060708090100 distance from trasncript ends distance from transcript ends

Fig. S1. Distribution of offsets of different PAS motifs to transcript ends. RefSeq sequences containing only 1 of 13 PAS variants in the terminal 50 nt were selected. All of the corresponding ESTs ending in the region of the transcript end were then scanned to determine the distance of the PAS from each EST end. The plot indicates the normalized number of nonredundant EST ends downstream of each PAS at positions up to 100 nt downstream. Two distinct patterns were observed. (A) Ten PAS motifs displayed a clear maximum at 20 nt as expected for a used PAS. (B) Three proposed PAS variants showed distinctly weaker maxima around 20 nt. We used this evidence to reject these variants as predictors of transcript termination.

Muro et al. www.pnas.org/cgi/content/short/0807813105 3of19 7.E+05

6.E+05

5.E+05

4.E+05

3.E+05

2.E+05

1.E+05

0.E+00 0 20 60 40 80 -80 -60 -40 -20 100 120 -200 -180 -160 -140 -120 -100

Fig. S2. Number of EST ends in the neighborhood of rough ends in 1. The x axis represents offsets of EST ends in nucleotides relative to rough ends, and the y axis the number of EST ends at each offset. Direction of transcription is to the right (ϩ strand). The frequency of EST ends differs from background in the range Ϫ200 nt to ϩ150 nt.

Muro et al. www.pnas.org/cgi/content/short/0807813105 4of19 Fig. S3. Dependence of recall and precision on numbers of ESTs used to define EST clusters. Each point maps the recall and precision in the detection of 18,280 RefSeq entries by 15,481 ends predicted in the range Ϫ150, ϩ200 nt of the RefSeq ends using a pair of thresholds (number of ESTs in cluster, maximum number of ESTs in the cluster ending at the same genome coordinate). A predicted end was considered to match a nominal end if it was located within 10 nt. Using thresholds of 2 EST ends in the cluster and 1 EST ending at the identical genome coordinate, recall and precision reach 59% and 69%, respectively.

Muro et al. www.pnas.org/cgi/content/short/0807813105 5of19 Fig. S4. Benchmark results of TS and others (11, 13) for transcript 3Ј-end prediction against several collections of nominal transcript ends. Predicted ends and nominal ends were considered to match if they were less than 10 nt apart. Recall was computed as the percentage of transcript in the collection whose ends were detected by the method. For example, 108 of 113 of the experimentally verified ends were detected by TS applying this strict condition (compare with 110 of 113 when using a distance of 50 nt). The percentage of nominal ends with a PAS signal is indicated (horizontal bars) and possibly constitutes an upper limit to the recall as transcripts without a PAS could be considered incorrect. Comparisons against PolyA࿝DB (10) could not be performed because the predictions were not available for download. Collections used for benchmark were: set of 113 ends experimentally verified ends in this work (Exp), murine RefSeq revised set (Refrev), complete murine RefSeq (Ref), PacDB murine set (PacDB), Affymetrix MOE430 transcripts (Affy), Ensembl murine known genes (EKnown), Tigr murine (Tigr), and Ensembl murine novel genes (ENov). Source files were RefSeq mm6 from UCSC Golden Path; Ensembl SQL database mus࿝musculus࿝core࿝35࿝34c; Affymetrix as in Table 1; TIGR MGI.release࿝15.zip, using definitions.html to extract the TC subset and excluding singleton ESTs; PacDB mouse࿝12–19–2005.fa.gz.

Muro et al. www.pnas.org/cgi/content/short/0807813105 6of19 7 n 8 8 s 10 l2 2 3 2 l1 o a 1 2 1 e2 m ae hc cl e ar ox lt3 zh pc pc bp Ti Sc R P P M Ik H F E E E 1031 800 700 600 500 400

300 250

200

150

100

Fig. S5. Detection by specific PCR of 3Ј transcript ends predicted by manual analysis (Table S4). Globally amplified cDNA from purified murine hematopoietic and ES cells was probed by primer pairs (Table S3) specific to sequence immediately upstream of hand-curated polyadenylation sites. PCR products were subjected to agarose gel electrophoresis in the presence of Gel Red and photographed under UV illumination. A total of 113 distinct genes was analyzed, of which 12 representative runs are shown here. The figure is a composite of separate gel runs each performed with internal molecular weight markers. One representative marker lane is included at the left, and the relative positions obtained in individual experiments for each gene fragment are accurately indicated. Murine hematopoietic stem and precursor cells (16, 17), ES cell lines, and cells from various organs were isolated and cDNA globally amplified (18) as described. Fragments of the predicted size (Table S3) were amplified for all 113 transcripts, including 4 examples in which the predicted polyadenylation sites were upstream or downstream of the existing database sequences (Ikzf1, Pcgf2, Scml2, Phc3) and would not have been detected in globally amplified cDNA by primers targeting the nominal ends.

Muro et al. www.pnas.org/cgi/content/short/0807813105 7of19 Fig. S6. UCSC browser view (mm8) of the region downstream of the nominal 3Ј end of murine Rnf11. Dataset S1 and Dataset S2 indicate an EST end cluster 224 nt downstream of the nominal end supported by 88 ESTs, and a smaller end cluster 1750 nt downstream near position 108949038.

Muro et al. www.pnas.org/cgi/content/short/0807813105 8of19 Fig. S7. EST discontinuity between Mll2 and AK039901. UCSC mm9 genome browser view (19) of EST alignments to the interval on the minus strand of chromosome 15 between the nominal transcript for murine Mll2 on the right (ending at 98664116 in the figure, 98661719 on mm8) and the transcript represented by AK039901 at the left beginning at 98662950 above, 98660553 on mm8. Few ESTs populate the interval. The green bars represent the average energy per nucleotide of internal RNA base pairing in the zones indicated, determined using RNADraw (20). ESTs are absent from the zones having the highest folding energy (Ϫ0.45, Ϫ0.38 kcal/nt) and present in zones with energies Ϫ0.26 to Ϫ0.33 kcal/nt.

Muro et al. www.pnas.org/cgi/content/short/0807813105 9of19 Fig. S8. UCSC browser view (mm8) of the region downstream of the nominal 3Ј end of murine Zdhhc5 on the minus strand. The end of Zdhhc5 overlaps the 3Ј end of Med19 located on the plus strand. Dataset S1 and Dataset S2 indicate a large EST end cluster 181 nt upstream of the nominal end of Zdhhc5 whose narrow termination zone near position 84489000 suggests it represents a 3Ј end of Zdhhc5.

Muro et al. www.pnas.org/cgi/content/short/0807813105 10 of 19 Table S1. Sequences of 13 proposed hexamer PAS variants (4–6) type seq valid

1 AATAAA 1 2 ATTAAA 1 3 TATAAA 1 4 AGTAAA 1 5 AAGAAA 0 6 AATATA 1 7 AATACA 1 8 CATAAA 1 9 GATAAA 1 10 AATGAA 0 11 TTTAAA 0 12 ACTAAA 1 13 AATAGA 1

The ЉvalidЉ column indicates whether our analysis of their distribution near the ends of murine RefSeq transcripts (Fig. S1) indicated a maximum at a distance of 20 nt (1) or not (0).

Muro et al. www.pnas.org/cgi/content/short/0807813105 11 of 19 Table S2. Transcript sequences whose ends predicted by manual curation differed from their nominal ends by 20 or more nucleotides Nominal End Curated End

Row Accession Gene PAS* ESTs† A-rich* Offset New End‡ ESTs§ A-rich*

1 82 transcripts 1 1 - 100 0 0

2NM࿝008714.2 Notch1 1 0 1 119 50 0 3NM࿝007664.2 Cdh2 1 20 1 132 44 0 4NM࿝028768.1 Armc8 1 9 1 1705 19 0 5NM࿝028399.1 Ccnt2 1 20 1 3222 18 0

6XM࿝892747.2 Tmcc1 0 8 1 865 18 0 7XM࿝110671.3 Mll1 0 22 1 1302 43 0 8NM࿝008397.2 Itga6 0 100 1 1472 50 0 9NM࿝001033324.1 Zbtb16 0 10 1 2495 10 0 10 NM࿝010459.3 HoxB4 0 0 1 1062 12 0

11 XM࿝983766.1 Arid1b 1 3 0 Ϫ3133 37 0 12 XM࿝980910.1 Asxl3 1 1 0 Ϫ1695 21 0 13 NM࿝011277.1 Rnf2 1 12 0 Ϫ107 25 0 14 NM࿝172445.2 Wdr37 1 4 0 Ϫ28 46 0 15 NM࿝001033202.1 Usp30 1 11 0 580 20 0 16 NM࿝009545.1 Pcgf2 1 5 0 1002 25 0 17 NM࿝172716.2 Pcgf3 1 17 0 1250 40 0 18 NM࿝133194.2 Scml2 1 0 0 2541 7 0

19 NM࿝001025597.1 Ikzf1 0 4 0 Ϫ819 18 0 20 NM࿝010789.1 Meis1 0 0 0 Ϫ247 47 0 21 NM࿝010453.4 HoxA5 0 0 0 Ϫ35 17 0 22 NM࿝013883.1 Scmh1 0 2 0 228 12 0 23 NM࿝009354.1 Tert 0 0 0 801 9 0 24 NM࿝010949.1 Numb 0 0 0 1401 37 0 25 NM࿝010456.1 HoxA9 0 0 0 1506 14 0 26 NM࿝013876.2 Rnf11 0 0 0 1759 16 0 27 NM࿝009821.1 Runx1 0 0 0 2027 14 0 28 NM࿝011929.2 Clcn6 0 3 0 2225 13 0 29 NM࿝008396.2 Itga2 0 0 0 2833 6 0 30 NM࿝008957.1 Ptch1 0 0 0 3205 46 0 31 NM࿝183355.1 Pbx1 var a 0 0 0 4456 60 0 32 NM࿝153421.1 Phc3 0 0 0 7442 25 0

For each database sequence, the nominal 3Ј end was located on the genome using the NCBI Genome View facility. The region from 1,000 nt upstream to 10,000 nt downstream of the nominal end was scanned for ESTs aligning to the same region. In 82 transcripts (row 1) a PAS was located within 30 nt (excluding polyA tails) of the nominal end (PAS column), and terminal polyadenylation was supported by clusters of EST terminations within 10 nt of one another at the nominal transcript ends (ESTs column). In the remaining 31 transcripts (rows 2–32) the EST evidence did not support termination near the nominal sequence end.In6of these cases (rows 12–15, 19–21), ESTs indicated a terminus 28–3,133 nt upstream of the nominal end, while in 25 other examples true polyadenylation sites were suggested at locations 119–7,442 nt downstream. Of particular note, 12 of these nominal sequences (rows 2–6, 12–18) contained a terminal PAS. In 5 of these (rows 2–6), the RefSeq sequence terminus was followed in the genome by polyA or A-rich tracts (A-rich column) which may have contributed to generation of the observed terminal EST clusters. In each instance where an A-rich tract was observed in the genome following the nominal transcript terminus, a PAS-containing EST cluster was found further downstream providing stronger support for the presence of a true polyadenylation site (number of ESTs, ESTs column). Rows 7–10 illustrate cases where clusters of EST ends aligned with the end of the nominal sequence but lacked a PAS. In each case, the clusters of EST ends were again explainable on the basis of A-rich tracts downstream of the nominal ends. In all 25 examples where a downstream polyadenylation site was predicted by the EST evidence, additional nonterminal ESTs were found aligned to the genome in the space between the nominal transcript ends and the newly identified downstream termini. Their overlapping pattern supported the prediction that the identified ends were part of the same transcriptional unitasthe original nominal sequence. *0 ϭ absent, 1 ϭ present. †Number of EST ends near nominal end. ‡Number of nt downstream from nominal database sequence end. §Number of EST ends near predicted end.

Muro et al. www.pnas.org/cgi/content/short/0807813105 12 of 19 Table S3. Sequences of primer pairs used to detect murine transcripts experimentally in globally amplified cDNA HSC amplicon Gene Alias control size Positive Upstream primer Downstream primer Expected

1810013L24Rik 1 CTAAGCACTTGAGTATGGG CAGTGGAGAACTCAGCCTG 124 9430067K14Rik 1 TGTCTGGATCTGATTGCC TGTCAACAGCACAACTTCAA 107 A630018P17Rik 1 CGTATGACAGTTCCCCAATAG AAGGACTAACACATTGCACC 164 Abcb1a Mdr1a, 1 AATGAGGGTGTCAGCCATGT ATTCATCCAGGTACCCTCCT 149 p-glycoprotein 3 Abcb1b Mdr1b 1 GCTCATGAGCTGTGACTATC AAAGTCTCGGAAGGCTTCTC 132 Abcg2 Bcrp1 1 GGGTACAAGTTGCTTAGCAT CATGATTCTTCCACAGTCCC 220 Angpt1 Angiopoietin-1 1 GTTGTTCCCACAGATGTTCG ATTGCTACACACATGTTGGC 206 Arid1a 1 TGTCTCAGCAGCCAATCAAC TCAGGAACTTCAACTGAACC 111 Arid1b 1 GCTGATCCCAGTTTGCTTCA AGCCAGAGGAAAAGGAGGAAT 250 Armc8 1 TTGGTCAGCAAGGAAGTGG ACAAAGGGAGAACAGAATGC 104 Ash1L 1 GGCAAAGCACTACATCACTG CAAGGATTGAGATGCAAGCC 214 Ash2L 1 CTGAGTGCCATAAATCAGCC ATAGAACCCACTAGCTTCCC 268 Asxl1 1 CTCTTCCGCCATTTCATCTG ATGGACCAGACACTGTCAAG 225 Asxl3 0 ES CTGGCTGCAACAAGCATTTC AACATAGAGACGGACGTTCC 188 Cbx1 1 AGGCTCAATCTTGTGCCCTA ATGCTGATGCCATCCAACTG 213 Cbx2 M33 1 GGTTACGTCAGTCCCAAAGT GAGAACTTGACTCAGAGCGA 175 Cbx4 1 GTTGGCCTTTTCTTTCCCCT TAGCAGCAGGTTCTAGAGAG 178 Cbx5 1 GGACTTTTTTGGTTTGTGGG GGGAGAGGATGATGTCTGAA 211 Cbx6 1 TTGCCTCCAGTGGTTAATGG CTACCGAATCCACGTCCAAA 134 Cbx7 1 TTGCCAAGTTCCTTCCACCT GCCTACACCTGCTTACTCTT 240 Cbx8 0 ES CAACACGGACCAAGGATTCT TTTCCAGCCAGAACTGAGGA 192 CcnD1 Cyclin D1 1 TGTATCCATGGTGATGGGGT CCATGGTGTGTCAACCAGAA 148 Ccnd3 Cyclin D3 1 CTCATCTTCATCAGAGCAGG CAGCATGATAGAAATGGGCC 145 Ccnt2 1 TGAGGTAGGCAGGTTGAAAC AACACACTCAGAGCTCGTAG 218 CD34 1 GATTCCTTTCAGTCTGTGCC ACCCAATCCTCTCATCTCTG 184 Cdh2 0 brain ACTGTCTGGAAAACACCGAG AGTGGGTTGAAGCGTATCAC 205 Cdkn1a p21 1 GTAGCAGTTGTACAAGGAGC ACAATCTGAGTGGAGACAGG 242 Clcn6 1 CCATGTTGACAAGGAGAG ATCTACACAGTGGAATCAGC 132 Clu 1 GTGCGGAATGAGATAGAAGC TCCCGAGAGCAGCAAGTG 186 Csf3r G-CSFR 0 hematopoietic CTTCTCAGGCTATACCCTGA TGGATCTCACTATGTAGGCT 259 progenitors E2F6 1 CAAGACAAGTGCCCAGTGAA CCAAACTAACGATGCTTCCC 114 Eed 1 ATGATGCCAGCATTTGGCGA AAAGTCCGAGCAGGAAGACA 213 Eng Endoglin 1 TGCTGCTTAGAAGCCTAAGC GTGAATATGCAGGACTTGGC 272 Epc1 1 GCCAGAACCACAAATATCGG TCACAATGTTTCCAGGTGGG 171 Epc2 1 GGTTGTGGCGAGATTGTCTT TTCTGAAACACCACTACACG 170 Ezh1 1 GGCCAATAATACTCATGCGC TTAGATAAGAGGTCACAAGCC 217 Ezh2 1 TGATGCCCTGAAGTATGTGG CAAGGTTCCTGAAGCTAAGG 106 Fgd5 1 AGTCGGATGGAAGACAAGTG GCTCCACTCTCTAAAGGTTC 110 Flt3 Flk2 1 TGCTTCGCTGGACTTTTCTC TACATGGCTTTCCCCTCAAG 171 Fzd1 0 testis CAGTCCTCCTGATTGTAGTG GATGATCTCATGGTGGTGAG 175 GATA2 1 GACGATTGTGCTGAGTCAAC TCTTATGCGGGTACTAGCAC 99 GATA3 1 GTCACTTTTCTTGCAGCCTA CAGACTGTTTAAAGGCAGTG 161 Grhpr 1 AAACTCGCAACACCATGTCC TCTGTCTGGCAAGATGTCTC 160 Hba-a1 ␣-globin 0 erythroid cells GGACAAATTCCTTGCCTCTG CAAAGACCAAGAGGTACAGG 125 HoxA10 1 GTCAAACCTGTAGGTGCAGA TTCCACGCACAGCAGCAATA 196 HoxA4 1 TCCCAGCTTTCTAACCTTCC AAATGCATTTCCCTCTCCCC 225 HoxA5 1 CTTGTTCAACGTGTAGTGGC GCTTAAACAGCCAGACTTGG 183 HoxA9 1 GAGCTATACGTGTGTGCAGA TTTGGTCAGTAGGCCTTGAG 230 HoxB4 1 TCATGTGTGTCCTCTCTCCT CTGTTGTCACTCTGTACAGG 139 Idb1 Id1 1 TCTCTGGGAAAGACACTACC GAGAAGCACGAAATGTGACC 199 Ikzf1 1 TAAATAGTGGCTTCAGGAGC TCATCAACTTCTGATACAGC 220 Il7r 1 CTGCCAATTTTCCTCTTGGT CCAGAAAATAGCGCATGCTT 249 Ipo11 1 ATCTCAGGTTCCTCAGTAC CTCGCACGCTTAGTGCAATG 141 Itga2 Integrin ␣21 AGGAAGCCGAGACGTAAATG ACGTTTCTGTCTGGCTCTTG 280 Itga6 Integrin ␣61 CCTTGACAGTGTTTGTAGACC GCAAACAGACCAGTGACTTG 149 Kit 1 TACTTAAGGGGCCACACCAT CCCACATGTAACGTGACATG 275 Lck 1 AGCCTTGGATACCTCCTAAG GAGACATGAGATTGGATGGC 182 Lmo2 1 TAGAAACAATCTGTGGGGCG CTTAAGCTCTGGAGCCAAGG 157 Lyzs 1 CCTGTCTTTCTTAGAGCTGC TGCTGATACAGGCTCATCTG 198 Mll1 1 TGTCTGAAGCCCATCAGTGT TTCAGTCTTTCGTAGAGAGC 102 Mll3 1 TGCACAATTCGAGTGGACTG TCGGCAGGACAAATACTGGT 210

Muro et al. www.pnas.org/cgi/content/short/0807813105 13 of 19 HSC amplicon Gene Alias control size Positive Upstream primer Downstream primer Expected mMeis1 1 ATCAGCTGTTGCAGGCAGTG TGCTCCAAGGTGGGACTATG 259 Mpo 0 macrophage CCGTGTCGAAGAACAACATC ACAAATAGCACAGGAAGGCC 160 precursors Mtf2 Pcl2 1 TCAAGCTGAACTGTCTCTAG CTGAACATCAGAGCCATGTG 83 Myb 1 GGAAGAACATTCTCTGTAGG GCCAGCATTCTTGCAAATGC 131 Myc 1 GAGAACGGTTCCTTCTGACA ATGGCTGAAGCTTACAGTCC 106 Myo18a 1 AGCTGATCTCTTCCACCTGC ACCACTTGTTGCTGTGTCTG 223 Nanog 0 testis/ES GTTCAAGGCCAACCTGTACT GCATCGGTTCATCATGGTAC 194 Notch1 0 aorta/brain TAAGAGCACAACCCAGGATG ATCTTAGGATGCGTCTGGTC 173 Ns 1 GGTCGTCTGCCTTGGATAGA CAGCGTACAGAGAGCACGTT 100 Numb 0 testis GTGTCGGTTTGTAATGTGGG GGGTCTCATGTTGTTACCCA 182 Numbl 0 ES CTGTGTTCACTGCCAATGCT CACCGACATGAGGTAACAGA 158 Pbx1 1 AAAGCTCAGCTCTACTTTGG GGAATCTGCATTACCAGTAC 167 Pcgf1 1 TGTTCGGTGTTCTGTGAGAG TGGAGAAGCAAAGGAGATGG 180 Pcgf2 Mel-18 1 GGGTGTTGGCTTAGTGTGTG AAATGACCTAGGACAGGAGG 101 Pcgf3 1 GGAGAAAGAACTAAGGTGTG GCAGCATTTCATACACCGTG 221 Pcgf4 Bmi1 1 ATGAATGACCCCTCCAAGTC GGCTACAAGCAAGACAAAGC 194 Pcgf5 1 ATTAGGAGCGATCCAAGTGC AACTGCAACACCACAAGGAC 201 Pcgf6 Mblr 1 CCTTATTGTTCCTCAGTCAC GCAAGGTTCTCAAAGACTCT 98 Phc1 Rae28 1 TTCTGCATTCATGGCAGGGT AAGGTACAGTCCTGACCTCA 223 Phc2 1 CTACTCCCACACTTGCTTGT TTCATTTCCAACCCTGCCTG 184 Phc3 1 CTTCCTCTGTGTGTCTAGTC TGTGCAGGTAACAAAGGTGC 129 Phf1 Pcl1 1 CTTCATGACTCCTAATGCCC CATACAGTCTGGAATCCTGG 142 Phf2 1 GACCCTGTCCTTGTGAACAT AGTTGCCTTAGACCTCATGG 203 Podxl Podocalyxin 0 aorta GGACTGCATAGATGAAGGCA CAGTGTACTTTAGACCCCTG 212 Pou5f1 Oct4 0 ES TGCATTCAAACTGAGGCACC ATGATGAGTGACAGACAGGC 200 Ptch1 1 GTCATCAAGTCTTTCGACAG ATCTGAGAACCGCACTAGGT 230 Ptpn11 SHP-2 1 GCCAGACTCACACTTAGCTT AAACAGCAAACAGCCAACCC 162 Rb1 1 ACAGCTTCCCCCATTTCTTC AATAGTGCAGTGTCTGCAGC 188 RBM15 1 GTGGCTTATGTGGAGTTTAC CAACTGTGAGCCCATGTTTG 252 Ring1 1 GTCTATATTGGACAGCACGG ATTTGTTGTTGGGAGGAGGG 125 Rnf11 1 TTGGTATTGGAAGACGTACC AGAGATGGCGCAGCATTG 317 Rnf2 Ring1b 1 GTCAGTTTAGACAGATTGGG TACCCCAGAAAAGGAATTGG 201 Runx1 Aml1 1 CGGTATAGCCAAAGCTGATC GTGAGTAAGAAGTAGCTTGG 264 Scmh1 1 CTGGTCAGAATTGCTGCAAA ATAGTTCAGATGGCAGTGGG 166 Scml2 1 TGTGCTTTAGACAACAATGC TCCAATCAACTGGAAAGAGC 130 Sfpi1 PU.1, Spi-1 1 CTTTGCCTCCCACCAGGACT TAGTGACTAATGAGGGGGTG 178 Smarca2 Brm 1 TCTCCCGTGTTACCAATGTG TGCCTCAGGCTTATGACTTG 103 Smarca4 Brg1 1 ATGGGTAGCACCAGATGTAG TACTGTCTGACTCCAGTGAC 307 Smarcc1 Baf155 1 GCTCTTTCCATGACTGTCAG TCCCAAACTCTCAAAGCTCC 228 Smarcc2 Baf170 1 CCTACTTTTGACAGTGGGAC GAAGAAAGGTGTGGAGATGC 264 Suz12 1 AAGGCTAGCATTGTTTGCACAA CAAAATCTGTTTTCAAGGGTAC 176 Tcta 1 GATGTCATATAGCCATCTGG CCACACAGATGACATCTCTG 99 Tek Tie2 1 TGACATGGAGTTACCATCCC AGGTGGCTACCACATCAACA 124 Tert 1 AGGTGAAGGGTGATGAAGTC CAGTACAAATACGCTTCCCC 153 Tie1 1 TTTCTTGCCAGCTGTTCCCA GAGTATAAGCAGAGAGTCGG 136 Tmcc1 1 CATGTCTGACAGTGAGCATG TGTGTGTAGTCCAAGGCTTC 160 Trp53 p53 1 TCTCTGAGTAGTGGTTCCTG AGGCTTTGCAGAATGGAAGG 210 Usp30 1 GTCACTTGTGACCATAGAGC ACGCACACAATTGTCCTCTC 93 Wdr37 1 CGTGTGGCTAAGGATATGTC AATTCTGGACACTCCAGCAC 102 Yy1 1 GCCAGATGCTGATGTTCAGT GTTGCCCTTTCTGTTACACG 123 Zbtb16 Plzf 1 CTCCATGTGTCACCAAGTGA GGTATACAACTGAACATCGC 150 Zfp42 Rex1 1 GGACTTTTGCATACGTCGGA TGACTACTGCCAAAGTTGGC 132

Sequences are listed from 5Ј at the left. Presence of an amplicon of the expected size from hematopoietic stem cell (HSC) cDNA is indicated with a 1. Where an amplicon was not detected in HSC cDNA, the positive control source of amplified cDNA is indicated from which an amplicon of the expected size was obtained.

Muro et al. www.pnas.org/cgi/content/short/0807813105 14 of 19 Table S4. Comparison of transcript ends in the benchmark set identified manually and by automated prediction nominal: predicted: mm6 curated mm6 curated

Gene symbol Accession Chr Strand Nominal end Curated end Distance c0 c1 Distance-2 EST ends

31 curated ends differing from nominal ends Arid1b XM࿝983766.1 17 ϩ 5255032 5251899 Ϫ3133 5251890 5251905 0 22 Armc8 NM࿝028768.1 9 Ϫ 99381642 99379937 1705 99379918 99379948 0 28 Asxl3 XM࿝980910.1 18 ϩ 22752234 22750539 Ϫ1695 22750514 22750541 0 15 Ccnt2 NM࿝028399.1 1 ϩ 127647226 127650448 3222 127650437 127650450 0 14 Cdh2 NM࿝007664.2 18 Ϫ 16783576 16783576 0 16783575 16783583 0 14 Clcn6 NM࿝011929.2 4 Ϫ 146498846 146496621 2225 146496621 146496650 0 15 HoxA5 NM࿝010453.4 6 Ϫ 52345310 52345344 Ϫ34 52345340 52345347 0 7 HoxA9 NM࿝010456.1 6 Ϫ 52368213 52366707 1506 52366709 52366711 2 8 HoxB4 NM࿝010459.3 11 ϩ 96141682 96142723 1041 96142719 96142726 0 6 Ikzf1 NM࿝001025597.1 11 ϩ 11667719 11666900 Ϫ819 11666836 11666853 Ϫ47 11 Itga2 NM࿝008396.2 13 Ϫ 111250195 111247362 2833 111247362 111247363 0 4 Itga6 NM࿝008397.2 2 ϩ 71554502 71555974 1472 71555953 71555974 0 52 Meis1 NM࿝010789.1 11 Ϫ 18775223 18775472 Ϫ249 18775458 18775480 0 31 Mll1 XM࿝110671.3 9 Ϫ 44792211 44792210 1 44792210 44792225 0 29 Notch1 NM࿝008714.2 2 Ϫ 26390198 26390067 131 26390064 26390084 0 44 Numb NM࿝010949.1 12 Ϫ 80659465 80658086 1379 80658062 80658106 0 70 Pbx1 NM࿝183355.1 1 Ϫ 168092570 168088114 4456 168088091 168088120 0 73 Pcgf2 NM࿝009545.1 11 Ϫ 97510912 97509910 1002 97509910 97509912 0 14 Pcgf3 NM࿝172716.2 5 ϩ 107579311 107580563 1252 107580550 107580568 0 42 Phc3 NM࿝153421.1 3 Ϫ 30318234 30310792 7442 30310791 30310826 0 29 Ptch1 NM࿝008957.1 13 Ϫ 60894523 60891318 3205 60891318 60891343 0 33 Rnf11 NM࿝013876.2 4 Ϫ 108411763 108410004 1759 108410003 108410012 0 17 Rnf2 NM࿝011277.1 1 Ϫ 151325156 151325263 Ϫ107 151325256 151325312 0 103 Runx1 NM࿝009821.1 16 Ϫ 91765169 91763141 2028 91763140 91763166 0 14 Scmh1 NM࿝013883.1 4 ϩ 119552400 119552628 228 119552609 119552644 0 50 Scml2 NM࿝133194.2 X ϩ 154855068 154857619 2551 154857598 154857609 Ϫ10 5 Tert NM࿝009354.1 13 ϩ 69708154 69708955 801 69708952 69708956 0 7 Tmcc1 XM࿝892747.2 6 Ϫ 116457019 116456153 866 116456146 116456161 Ϫ815 Usp30 NM࿝001033202.1 5 ϩ 113233569 113234149 580 113234145 113234158 Ϫ413 Wdr37 NM࿝172445.2 13 Ϫ 8762169 8760257 1912 8760227 8760286 0 37 Zbtb16 NM࿝001033324.1 9 Ϫ 48682076 48679580 2496 48679575 48679587 0 5 82 curated ends coinciding with nominal ends 1810013L24Rik XM࿝978179.1 16 ϩ 8530946 8530939 Ϫ7 8530926 8530946 0 13 9430067K14Rik NM࿝001039493.1 1 Ϫ 65083206 65083205 Ϫ1 65083204 65083218 0 15 A630018P17Rik NM࿝001007577.1 X Ϫ 160099222 160099221 1 160099217 160099248 0 20 Abcb1a NM࿝011076.1 5 ϩ 8755263 8755256 Ϫ7 8755249 8755262 0 9 Abcb1b NM࿝011075.1 5 ϩ 8873052 8873059 7 8873052 8873059 0 21 Abcg2 NM࿝011920.2 6 ϩ 58854925 58854924 Ϫ1 58854875 58854926 0 17 Angpt1 NM࿝009640.2 15 Ϫ 42335552 42335554 Ϫ2 42335538 42335570 0 9 Arid1a XM࿝992390.1 4 Ϫ 132639908 132639908 0 132639907 132639940 Ϫ153 Ash1L NM࿝138679.5 3 ϩ 88822742 88823239 497 88823181 88823240 0 40 Ash2L NM࿝011791.1 8 Ϫ 24541625 24541624 1 24541619 24541654 0 84 Asxl1 NM࿝001039939.1 2 ϩ 152860944 152860944 0 152860923 152860944 0 33 Cbx1 NM࿝007622.3 11 ϩ 96629728 96629728 0 96629216 96629239 Ϫ489 75 Cbx2 NM࿝007623.2 11 ϩ 118852363 118852363 0 118852348 118852371 0 50 Cbx4 NM࿝007625.1 11 Ϫ 118901250 118901250 0 118901248 118901254 0 10 Cbx5 NM࿝007626.3 15 Ϫ 103258024 103253567 4457 103253564 103253597 0 109 Cbx6 NM࿝028763.3 15 Ϫ 79876219 79876218 1 79876216 79876228 0 36 Cbx7 NM࿝144811.3 15 Ϫ 79968127 79968127 0 79968121 79968136 0 66 Cbx8 NM࿝013926.1 11 Ϫ 118859531 118859532 Ϫ1 118859532 118859560 0 15 Ccnd1 NM࿝007631.1 7 ϩ 139343623 139343624 1 139343568 139343625 0 100 Ccnd3 NM࿝007632.1 17 ϩ 45118147 45118148 1 45118055 45118175 0 228 CD34 NM࿝133654.1 1 ϩ 194701657 194701654 Ϫ3 194701621 194701671 0 110 Cdkn1a NM࿝007669.2 17 ϩ 26909140 26909142 2 26909099 26909142 0 139 Clu NM࿝013492.1 14 ϩ 60508969 60508965 Ϫ4 60508763 60508970 0 463 Csf3r NM࿝007782.1 4 ϩ 125071553 125071554 1 125071548 125071554 0 30 E2F6 NM࿝033270.1 12 ϩ 16193675 16193682 7 16193666 16193684 0 36 Eed NM࿝021876.2 7 Ϫ 83945801 83945801 0 83945773 83945802 0 14 Eng NM࿝007932.1 2 ϩ 32614278 32614284 6 32614254 32614287 0 89 Epc1 NM࿝007935.1 18 Ϫ 6478660 6478659 1 6478658 6478718 0 25 Epc2 NM࿝172663.2 2 ϩ 49483344 49483349 5 49483327 49483349 0 18

Muro et al. www.pnas.org/cgi/content/short/0807813105 15 of 19 nominal: predicted: mm6 curated mm6 curated

Gene symbol Accession Chr Strand Nominal end Curated end Distance c0 c1 Distance-2 EST ends

Ezh1 NM࿝007970.1 11 Ϫ 101012203 101012202 1 101012198 101012209 0 79 Ezh2 NM࿝007971.1 6 Ϫ 47731971 47731961 10 47731962 47732039 1 81 Fgd5 NM࿝172731 6 ϩ 92523961 92523958 Ϫ3 92523933 92523972 0 24 Flt3 NM࿝010229.1 5 Ϫ 146222241 146222240 1 146222236 146222241 0 6 Fzd1 NM࿝021457.2 5 Ϫ 4759845 4759854 Ϫ9 4759847 4759880 0 54 Gata2 NM࿝008090.3 6 ϩ 88642526 88642528 2 88642524 88642532 0 68 Gata3 NM࿝008091.2 2 Ϫ 9773199 9773190 9 9773188 9773206 0 11 Grhpr NM࿝080289.1 4 ϩ 44906618 44906621 3 44906550 44906657 0 91 Hba-a1 NM࿝008218.1 11 ϩ 32179287 32179288 1 32179117 32179285 Ϫ3 144 HoxA10 NM࿝008263.1 6 Ϫ 52375259 52375258 1 52375251 52375274 0 31 HoxA4 NM࿝008265.2 6 Ϫ 52333394 52333396 Ϫ2 52333378 52333397 0 5 Idb1 NM࿝010495.1 2 ϩ 152194345 152194346 1 152194336 152194347 0 68 Il7r NM࿝008372.3 15 Ϫ 9319643 9319649 Ϫ6 9319640 9319669 0 54 Ipo11 NM࿝029665.2 13 Ϫ 103003196 103003195 1 103003190 103003254 0 117 Kit NM࿝021099.2 5 ϩ 74490902 74490903 1 74490847 74490933 0 150 Lck NM࿝010693.1 4 Ϫ 128575471 128575471 0 128575461 128575524 0 79 Lmo2 NM࿝008505.3 2 ϩ 103686626 103686627 1 103686576 103686634 0 44 Lyzs NM࿝017372.2 10 Ϫ 116966785 116966785 0 116966782 116967003 0 182 Mll3 NM࿝001081383.1 5 Ϫ 23741344 23741346 Ϫ2 23741342 23741395 Ϫ441 Mpo NM࿝010824.1 11 ϩ 87532386 87532405 19 87532377 87532406 0 162 Mtf2 NM࿝013827.2 5 ϩ 107179450 107179454 4 107179422 107179457 0 51 Myb NM࿝010848.3 10 Ϫ 21054488 21054489 Ϫ1 21054485 21054492 0 23 Myc NM࿝010849.4 15 ϩ 62002460 62002534 74 62002508 62002538 0 83 Myo18a NM࿝011586.2 11 ϩ 77591640 77591640 0 77591621 77591648 0 44 Nanog AY278951.1 6 ϩ 123381606 123381606 0 123381598 123381606 0 24 Ns NM࿝153547.2 14 Ϫ 29144396 29144393 3 29144392 29144412 0 82 Numbl NM࿝010950.2 7 ϩ 22655229 22655229 0 22655220 22655238 0 50 Pcgf1 NM࿝197992.1 6 ϩ 83424537 83424544 7 83424536 83424545 0 21 Pcgf4 NM࿝007552.3 2 ϩ 18728640 18728638 Ϫ2 18728622 18728647 0 52 Pcgf5 NM࿝029508.1 19 ϩ 35798875 35798875 0 35798841 35798888 0 24 Pcgf6 NM࿝027654.1 19 Ϫ 46584919 46584919 0 46584916 46584951 0 42 Phc1 NM࿝007905.1 6 Ϫ 122978699 122978698 1 122978518 122978649 Ϫ49 201 Phc2 NM࿝018774.1 4 ϩ 127779988 127779995 7 127779971 127779996 0 178 Phf1 NM࿝009343.2 17 ϩ 24732954 24732954 0 24733035 24733213 81 159 Phf2 NM࿝011078.2 13 Ϫ 48400247 48400248 Ϫ1 48400247 48400283 0 45 Podxl NM࿝013723.2 6 Ϫ 31596477 31596476 1 31596470 31596514 0 49 Pou5f1 NM࿝013633.1 17 ϩ 33228085 33228086 1 33228078 33228089 0 16 Ptpn11 NM࿝011202.2 5 Ϫ 120282349 120282343 6 120282340 120282396 0 68 Rb1 NM࿝009029.1 14 Ϫ 67546936 67546936 0 67546936 67546950 0 44 RBM15 XM࿝131139.4 3 Ϫ 107120566 107121776 Ϫ1210 107121770 107121777 Ϫ138 Ring1 NM࿝009066.2 17 Ϫ 31725139 31725141 Ϫ2 31725316 31725344 Ϫ175 4 Sfpi1 NM࿝011355.1 2 ϩ 90820506 90820507 1 90820478 90820509 0 36 Smarca2 NM࿝011416.2 19 ϩ 26018387 26018387 0 26018379 26018389 0 23 Smarca4 NM࿝011417.2 9 ϩ 21593154 21593154 0 21593021 21593158 0 128 Smarcc1 NM࿝009211.2 9 ϩ 110278594 110279542 948 110279491 110279548 0 29 Smarcc2 NM࿝198160.1 10 ϩ 128227200 128227200 0 128227144 128227202 0 73 Suz12 NM࿝199196.1 11 ϩ 79759782 79759782 0 79759723 79759782 0 51 Tcta NM࿝133986.1 9 Ϫ 108372720 108372720 0 108372716 108372802 0 46 Tek NM࿝013690.1 4 ϩ 93850011 93850010 Ϫ1 93850007 93850023 0 13 Tie1 NM࿝011587.1 4 Ϫ 117430096 117430096 0 117430092 117430105 0 50 Trp53 NM࿝011640.1 11 ϩ 69317530 69317532 2 69317427 69317534 0 41 Yy1 NM࿝009537.3 12 ϩ 104287625 104287626 1 104287616 104287636 0 17 Zfp42 NM࿝009556.2 8 Ϫ 42236188 42236184 4 42236185 42236208 1 11

Manually curated and experimentally verified ends were mapped to the predicted ends listed in Dataset S1 and Dataset S2. The analysis was performed using mm6 genome coordinates which are displayed here. A curated end was considered matched if the distance to a predicted end was 50 nt or less. One hundred ten of 113 ends were matched by predicted ends. Columns indicate gene symbol; accession number; chromosome; strand; nominal end coordinate; curated end coordinate; distance between nominal and curated end, negative if curated end is upstream; Features of automatically predicted end: c0 coordinate of start of predicted termination zone; c1 end of termination zone; distance between curated and predicted ends, negative if the predicted end is upstream; total number of EST ends in zone.

Muro et al. www.pnas.org/cgi/content/short/0807813105 16 of 19 Table S5. Recall and precision of automated predictions against various transcript collections Collection N N PAS M P Recall % Precision %

RefSeq mm6 reviewed 122 98 88 131 72.1 67.1 RefSeq mm6 total 18280 13349 11188 16099 61.2 69.4 Ensembl 35.34c known 26315 16210 13175 19747 50.0 66.7 Ensembl 35.34c novel 11847 2114 795 1723 6.7 46.1 Affymetrix MOE430 consensus 42756 28242 22761 31868 53.2 71.4 TIGR release 15 87421 30759 19311 28770 22.0 67.1 PacDB 12–19–2005 8074 5812 5093 7836 63.0 64.9

N is the number of transcripts examined in each set. The number of sequences containing a valid terminal PAS (N PAS) was approximated roughly, using untrimmed sequences and allowing for possible terminal polyA tracts by increasing the scan zone to 50 nt. A predicted termination was considered to match a nominal end if it was located within 10 nt. For each collection, we determined the number of transcripts matched by a predicted end (M) and the number of predicted ends in the Ϫ150 nt ϩ200 nt range of the transcript end (P). Recall is the fraction of nominal ends matched by the method (M/N) and precision is the fraction of predicted ends that match a nominal end (M/P). Source files were RefSeq mm6 from UCSC Golden Path; Ensembl SQL database mus࿝musculus࿝core࿝35࿝34c; Affymetrix as in Table 2; TIGR MGI.release࿝15.zip, using definitions.html to extract the TC subset and excluding singleton ESTs; PacDB mouse࿝12–19–2005.fa.gz.

Muro et al. www.pnas.org/cgi/content/short/0807813105 17 of 19 Other Supporting Information Files

Dataset S1 (XLS) Dataset S2 (XLS) UCSCSessionmm8 (TXT)

Datasets S1 and S2. Mapping of 58,282 candidate murine 3Ј ends against the UCSC mm9 collection of KnownGenes. Plus and minus strand mappings are indicated on separate worksheets. KnownGene symbols and coordinates are bolded. Genome extents associated with KnownGenes to a distance 10,000 nt downstream of the KnownGene 3Ј end are filled in blue. Coordinates indicate positions on the mm8 genome draft. Positions that overlap with transcripts on the opposite strand are filled in gray. Columns in the table indicate (1 chr) chromosome; (2 strand); (3 sourceAcc) Entrez accession number of the KnownGene exemplar sequence; (4 Sym) Gene symbol; (5 desc) descriptive information; (6 znStart) 5Ј start of KnownGene or of predicted termination zone; (7 znEnd) nominal 3Ј end of KnownGene exemplar; (8 length) of KnownGene exemplar; (9 nominal࿝PAS) presence (ϭ1) or absence (ϭ0) of a PAS within 40 bases of the KnownGene exemplar nominal 3Ј end after trimming poly(A) sequence; (10 dist࿝5Ј) distance in nt of the start of the predicted termination zone from the 5Ј end of the KnownGene; (11 dist࿝3Ј) distance of the predicted end from the 3Ј end of the KnownGene; (12 EST࿝ends࿝total) total number of ESTs ending in the termination zone; (13 width࿝end࿝zone) width of the termination zone; (14 rel࿝pos࿝max) distance of the position where most ESTs terminate from the end of the termination zone; (15 representative࿝EST) Accession number of an EST terminating at the 3Ј extreme of the termination zone; (16 seq࿝up) up to 400 nt of genomic sequence aligned to the representative EST extending from the end of the termination zone; (17 seq࿝down) 40 nt of genome sequence downstream of the termination zone; (18 polyA࿝tract) presence (ϭ1) or absence in the downstream sequence of a polyA tract of at least9Ainawindow of 10 nt; (19 UCSC) a link to the corresponding region of the mm8 genome draft on the UCSC Genome Browser. The file UCSCSessionmm8.txt can be uploaded to the UCSC Session Manager (link on blue bar at top of Genome Browser) to configure a view similar to Fig. 1; (20 TS) a link to the corresponding region of the mm6 genome on our Transcriptome Sailor browser. Nonredundant candidate 3Ј end coordinates mapped originally on the mm6 genome were converted to mm8 coordinates using the online UCSC LiftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver). The mm9 KnownGene collection coordinates (http://genome.ucsc.edu/cgi-bin/hgTables) were simi- larly converted to mm8 coordinates. Extensive redundancy among the original 49,409 items was eliminated by selection of a single item to represent each set of overlapping KnownGenes on each strand, with preference given to items spanning the greatest genome extent, those with an assigned gene symbol, and those with a corresponding RefSeq. The resulting 23,091 positive and negative strand exemplars were entirely nonoverlapping on the respective genome strands, and served to mark and identify known transcribed regions of the genome. The 3Ј ends of KnownGene exemplars frequently overlap with the 3Ј ends of KnownGene exemplars on the opposite genomic strand. In these situations, it may not be clear to which KnownGene a given EST cluster belongs. A set of 737 3Ј end clusters occuring close to the nominal ends of RefSeqs and having a classical AATAAA PAS within 30 nt of the cluster end, meant to be representative of ‘‘true’’ 3Ј ends, was analyzed. A dominant feature of this set was the narrow width of their termination zones. In contrast, the dominant feature of the leading 5Ј edges of these clusters was their extended width. The metric 10*[termination࿝zone࿝width / (number of ESTs ϩ 12)] was determined for the true 3Ј end set to distribute around a mode of 4 with 97% of clusters having values below 15. This zone width metric was accordingly used, together with the presence or absence of an AATAAA or ATTAAA PAS motif near the end of the cluster, to accept or reject clusters occuring at zones of overlap with termini on the opposite strand. The same metric was used to reject any clusters with values 25 regardless of whether they overlapped opposite strand transcripts or not. Present and future updated versions of this table can be accessed from the Transcriptome Sailor web server (www.ogic.ca/ts).

Muro et al. www.pnas.org/cgi/content/short/0807813105 18 of 19 Dataset S3 (XLS) Dataset S4 (XLS) UCSCSessionhg18 (TXT)

Datasets S3 and S4. Mapping of 86,410 candidate human 3Ј ends against the UCSC collection of KnownGenes. The hg18 KnownGene collection originally contained 56,722 entries. Redundancy was reduced as in legend to Dataset S1 and Dataset S2 to 22,891 nonoverlapping exemplars. Table layout, and exclusion of overlaps with transcripts on opposite strands and clusters with wide zone widths, are as described for Dataset S1 and Dataset S2. The file UCSCSessionhg18.txt can be uploaded to the UCSC Session Manager to configure a view similar to Fig. 1. Present and future updated versions of this table can be accessed from the Transcriptome Sailor web server (www.ogic.ca/ts).

Muro et al. www.pnas.org/cgi/content/short/0807813105 19 of 19