DoTS: integrated indices for human and mouse built from transcribed sequences

Running Title: DoTS gene indices

Y Thomas Gan 1,2 , Brian Brunk1, Jonathan Crabtree 1,2 , Deborah Pinney 1,2 , Steve Fischer 1,2 , Joan

Mazzarelli 1,2 , Otto Valladares 2, Maja Bucan 2, Christian J. Stoeckert, Jr. 1,2

1Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA

2Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA

Y Thomas Gan: 215-746-7013 (tel), 215-573-3111 (fax), yg [email protected] (email)

Brian Brunk: 215-573-3118 (tel), 215-573-3111 (fax), [email protected] (email)

Jonathan Crabtree: 215-573-3115 (tel), 215-573-3111 (fax), [email protected] (email)

Deborah Pinney: 215-573 -3116 (tel), 215-573-3111 (fax), [email protected] (email)

Steve Fischer: 215-573-2280 (tel), 215-573-3111 (fax), [email protected] (email)

Joan Mazzarelli: 215-573-4413 (tel), 215-573-3111 (fax), [email protected] (email)

Otto Valladares: 215-898-0021 (tel), 215-573-2041 (fax ), [email protected] (email)

Maja Bucan: 215-898-0020 (tel), 215-573-2041 (fax), [email protected] (email)

Corresponding author:

Christian J. Stoeckert. Jr.

215-573-4409 (tel), 215-573-3111 (fax), [email protected] (email) Genome Biology

Abbreviations used in this paper:

EST : expressed sequence tag

DoTS : database of transcribed sequences

DT : DoTS Transcript

DG : DoTS Gene sDG : similarity -based DoTS Gene gDG : genome -based DoTS Gene

TC : tentative consensus

BLAST : basic local alignment search tool

BLAT : BLA ST -like alignment tool

UTR : un-translated region

ORF : open reading frame

CDS : (protein) coding sequence Genome Biology

Abstract

Background

Although sequences for large eukaryotic genomes are being completed, it remains a challenge to identify all encoded by them and determine or predict their functions. To help address this challenge, we have built a Database of Transcribed Sequences (DoTS). We cluster and assemble

ESTs and mRNAs into DoTS Transcripts (DTs). We further group DTs representing transcripts from the sa me genes into DoTS Genes (DGs). We describe human and mouse DoTS here, although DoTS is generic and applicable to other species such as apicomplexa [1] .

Results

We have built an integrated transcriptome resource, DoTS, for human and mouse. In DoTS we catalogue, categorize, and annotate known and predic ted transcripts and genes. We have identified 48,994 human and 37,984 mouse high confidence DGs, of which 25,326 human and

22,024 mouse DGs are predicted to be protein -coding genes. Using these data, we can predict novel genes as demonstrated using a 75Mb proximal region on mouse 5. We have found that DGs can significantly enrich the models of known genes by predicting extended

UTRs, novel exons, and alternative transcription starts. DoTS also enables the study of non- coding genes and singleton transcripts (DTs with only one input EST or mRNA), in addition to other studies such as the investigation of alternative splicing. A powerful query interface for human and mouse DoTS is available at http://www.allgenes. org [2].

Conclusion

DoTS Transcripts and DoTS Genes, which are extensively annotated and significantly curated, present a unique, integrated, non-redundant, and genome -mapped view of the millions of ESTs and mRNAs in the public domain. They are categorized into various subsets such as high Genome Biology confidence genes, protein -coding genes, and non-coding genes. They predict many putative novel genes, enrich gene models of known genes, and enable datamining in novel directions.

Background and significance

In a post -genomic era, identifying all genes and studying their functions and relationships are among the ongoing challenges in the field of functional genomics. Transcribed sequences

(mRNAs and ESTs) may be used to build integrated transcriptome da ta resources to help address such challenges.

Genomic data integration

Much progress has been made recently in sequencing large eukaryotic genomes. We now have an essentially complete sequence for the [3 -5] and a draft for mouse [6].

Coincident with the explosion of genomic sequence data is the rapidly growing availability of vast amounts of functional genomics data such as expressed se quence tags (ESTs), proteomes, protein domains, and microarray gene expression data. For example, as of October, 2003, there are 5.4 million human and 3.9 million mouse ESTs in the public EST repository dbEST [7]. It is necessary to integrate these diverse types of data to facilitate gene identification and functional annotation.

Transcribed sequences for data integration

Transcribed sequences are a good integration point . First, they are the products of gene transcription, and they are abundant as a result of the large scale EST sequencing efforts.

Therefore, they can be used for gene discovery and analysis of gene structure (e.g. exon-intron structures, alternative splic ing), in genomic sequences via alignments. Second, expression Genome Biology information is usually available for ESTs, based on the libraries from which they originate. In addition, ESTs are commonly used to generate features on microarrays. Therefore, transcribed sequences allow easy integration of expression information with genes, providing the basis for expression analyses. Third, transcribed sequences may be translated to allow protein sequence analyses (e.g. domain based functional annotation, ortholog identificati on). Fourth, they may be aligned with genomic sequences to identify regulatory regions. Finally, they may originate from genes that do not encode proteins, therefore, they allow the identification of non-coding genes.

Existing transcriptome data resource s

Human and mouse genome and transcriptome data are available from several sites [8]. Although there is overlap in the information presented, the sites generally provide unique views or emphases. This is expected as we are far from a complete understanding of the wealth of information provided by genome sequencing, EST sequencing, and microarray experiments.

Groups such as Ensembl [9, 1 0] or the UCSC Genome Browser team [11] use the genome as their reference point. Another approach is to use shared identifiers (accessions) from different resources to organize and integrate information as is done by GeneCards [12] and MGI [13], which focus on known genes and emphasize phenotypes. These approaches are complementary, and they provide different views and different interpretations of the data. For example, transcribed sequences that cannot be properly aligned to the genome would fail to be seen as primary entities on genome -based views.

Unigene [14] and the TIGR gene indices [15] represent multiple species transcriptome data resources organized around transcribed sequences. Other efforts in this class include MGC

[16], RefSeq [17], STACK [18], and MIPS [19]. Unigene uses sequence similarity to cluster all

ESTs and mRNAs but does not generate consensus sequences. Essentially, the Unigene clusters represent ESTs associated with the same gene. The gre at strength of Unigene is its currency but Genome Biology one of its weaknesses is the lack of persistent identifiers. TIGR gene indices provide consensus sequences and persistent identifiers, and they also have data on orthologs for species other than human and mouse, which enables comparative genomics studies using more than two species.

TIGR assemblies (TCs) represent transcripts rather than genes, therefore they are a transcript- centric, not gene -centric resource. MGC focuses on full length cDNAs, and RefSeq underscor e known and curated genes, therefore, they are both limited in scope.

DoTS as a transcriptome resource

DoTS, short for Database of Transcribed Sequences, is a collective name to describe DoTS

Transcripts (DTs) and DoTS Genes (DGs). A DT is an assembly of transcribed sequences representing transcripts of the same splice form, and a DG is a group of DTs representing transcripts from the same gene. The goal of DoTS is to generate relationships among genes,

RNAs, proteins, and their sequences to assist in disc overing new genes, functions, genomic relationships (e.g. clusters by location), and regulation of gene expression. Allgenes.org is the website for public access to DoTS.

As a human and mouse transcriptome resource, data in DoTS are organized around transcribed sequences, as Unigene and TIGR TCs do. DoTS and TIGR TCs provide consensus sequences and persistent identifiers, both of which Unigene lacks. Although DoTS and TIGR

TCs are very similar in the degree of annotation performed and, as recently reported , in the assemblies generated [20], the two are not identical because of differences in the details of their clustering and assembly processes. For example DoTS has more consensus transcripts but a smal ler number of sequences per transcript than TIGR TCs. This may be due to less trimming of low quality sequences from the ends, a choice made for DoTS to better preserve representation of differentially processed transcripts. The DoTS transcript indices als o differ from TIGR TCs in some of the annotations performed on the consensus sequences (e.g. gene trap associations, Genome Biology signal peptide prediction, transmembrane predictions), significant manual curation by expert annotators (Mazzarelli J. et. al., manuscript in preparation), and the availability of a powerful query interface through the Allgenes website [2]. DTs are taken a step further than TCs to generate genes. Therefore DoTS is also a gene index.

Gene finding and transcribed sequences

The difficulty in identifying all the genes in a mammalian genome is illustrated by the range of predictions over recent years. The estimate for the total number of human genes ranges from

28,000-34,000 based on homology [21], 35,000 based on ESTs [22], and 41,000-45,000 based on validation of computational predictions [23], to 56,960-81,273 based on cDN As [24]. The initial genome annotations by the public and private human genome projects, using similar approaches, both suggested that there are ab out 30,000 human protein -coding genes [4, 5], but the actual genes predicted differed significantly [25].

Approaches used to date include ab initio gene prediction (e.g. GenScan [26]), cross - species conservation-based annotation (e.g. TwinScan [27]), and protein similarity -based annotation in combination with these methods [4 -6]. Ab initio methods exploit statistical differences in sequence content and/or signals between gene and non-gene sequences and usually suffer from high rates of fal se positives. Similarity- based methods use similarity to known protein sequences as evidence that a piece of genomic sequence is part of a gene. Novel protein -coding genes will not be identified by such methods. Composite approaches such as those taken by the public and private human genome projects to annotate protein -coding genes in the human genome [4, 5] are purported to reduce both false positives of ab initio methods and false negatives of similarity -based methods. However, Hogenesch et. al. reported that novel human genes predicted by the two groups had little (~20%) overlap as revealed by BLAST Genome Biology comparison, although most predicted g enes from both sets appear to be real as verified by expression analyses (microarray and Northern blot) [25].

As a complementary approach, transcribed sequences may be used for gene finding in eukaryotic genomes. As discussed above, different approaches often complement each other in identifying protein -coding genes. More importantly, none of the above approaches are directed toward identifying non-coding genes, which can be biologically relevant, as in the cases of tRNA, rRNA, and enzymatic RNA genes. Since transcribed sequences are (mostly partial) transcripts of genes, protein -co ding or not, they are a good resource for the identification of non- coding genes. A recent effort in this regard is Gene Bounds (unpublished, see UCSC Genome

Browser [11]), where individual ESTs are directly aligned to the genome, and the starts and ends of genes deduced. However no effort is made there to infer the exon-intron gene structure, alternative splicing, or protein -coding status.

DoTS as a resource for gene finding

DoTS offers a variant of the direct transcribed sequence al ignment approach for genome -wide gene finding. We align DTs, instead of individual ESTs and mRNAs, against genomic sequences to identify genes. Since DTs explicitly model transcripts, our approach makes it straightforward to identify alternative splicing. Furthermore, the number of input sequences in a

DT may be used to separate certain sequences such as singletons from non-singletons for different analyses. A singleton is a transcribed sequence that does not share sufficient similarity with any other transcribed sequences to form an assembly (i.e. non-singleton) with more than one input sequence. Although singletons may result from artifactual sequences, they may also be biologically relevant but rarely transcribed sequences. Genome Biology

In this paper, we describe the process we follow to build the DoTS transcriptome data resource for human and mouse, and the resulting transcript (DT) and gene (DG) components of DoTS.

We discuss the validation of DTs by genomic sequence, and the validation of DGs using independent approaches. We also demonstrate the usefulness of DoTS with example applications.

Results and discussion

We have made five public releases of DoTS with release 6.0 being current as of this writing, and the basis for this report. Figure 1 shows an overview of the current DoTS build process. The details are described in the Materials and Methods section, but, in summary, we download ESTs and mRNAs from GenBank and pass the input sequences through a parallelized pipeline with over 120 stages (a few key steps are shown in Figure 1b). The pipeline cleans the input sequences by trimming low quality (>20% N’s in a 20bp window) sequence and eliminating contaminants (e.g. vectors, repeats, E.coli , mitochondria, genomic), clusters the sequences by similarity, and uses CA P4 to assemble each cluster into one or more DTs. It then clusters DTs into DGs with two approaches. In one, it clusters DTs by a more stringent similarity to form similarity -based DoTS Genes (sDGs). In the other, it uses genomic alignments to cluster DTs into genome -based DoTS Genes (gDGs) and construct exon-intron gene structures. In addition,

DTs are subjected to extensive automated annotation such as protein sequence prediction and

GO prediction, as well as manual review by expert curators (Figure 1b, Mazzarelli J. et. al., manuscript in preparation ). The output of the DoTS build process consists of highly annotated

DTs grouped into DGs (sDGs and gDGs). Genome Biology

DoTS sequence content

Table 1 shows the compaction of the number of sequences achieved by DoTS. We st art with roughly 4.6 million human and 3.2 million mouse transcribed sequences after discarding sequences of low quality or suspected contaminants. We cluster and assemble them into about

0.9 million human DTs, of which 0.26 million are non-singletons, and 0.6 million mouse DTs, of which 0.16 million are non-singletons. Although DTs are clustered into rather large numbers of sDGs and gDGs, we identify 48,994 human and 37,984 mouse high confidence DGs based on cross -validation between sDGs and gDGs and splic ing. We further predict 25,326 human and

22,024 mouse high confidence DGs to be protein -coding genes .

Genomic alignments of DTs and alignment quality classes

We have aligned DTs to their corresponding genomic sequences using BLAT [28], classified each alignment into one of four quality categories, and calculated an alig nment score S (see

Materials and Methods). Figure 1c shows a diagram of the quality categories: “very good” (1),

“very good but with genomic gaps” (2), “good” (3), and “not so good” (4).

Table 2 shows the alignment statistics. Many non -singletons (>40%) c an be uniquely mapped to the genome with very high confidence (quality 1, average alignments per DT ratio

≤1.1, and median S score ≥99.5). The statistics for all DTs are similar. Conversely few, 0.1% of human and 3.6% of mouse, non-singletons have quality 2 alignments. This is consistent with the fact that the human genome is essentially complete with few gaps [3] and the mouse genome sequence is in a draft form [6] . These alignments have relatively low S scores due to the presence of genomic gaps, but comparable percent identities with respect to quality 1 alignments. For the quality 3 alignments, there is an increase of alignments per DT ratio of up to

1.25, while percent identity and alignment score values remain similar to those for quality 1 alignments. The criteria for quality 3 alignments tolerate small sequence or assembly errors, and Genome Biology they enable the identification of paralogs and closely related family members in addition to the primary gene. In contrast to alignments in other categories, quality 4 alignments have significantly increased alignments per DT ratios and decreased alignment scores.

Instead of ignoring the many (>20%) DTs with only quality 4 alignment(s), we identify

“top alignments”. An alignment is a “top alignmen t” if its S score is above a cutoff of 85 and is within 1% of the highest S value for all alignments by a given DT (similar to the approach used for Gene Bounds [11]). We include all top alignments since the best alignment is not always clearly identif iable. A DT with only quality 4 alignment(s) may indicate: (a) some spurious sequence or a chimeric EST extends the end of a DT to >50 non-alignable base pairs (the most allowed for quality 3; (b) mis -assembly or deletion in the genomic sequence; or (c) th e DT sequence is not a transcribed sequence from that genome.

Some DTs (<5% non-singleton) do not have significant (>10% of the query sequence aligned at >90% identity) alignments. They may originate from the portions of the genome yet to be sequenced, or they may be artifacts. These DTs are not analyzed further.

Validation of DTs by genomic sequences

Assuming that the quality of the genomic sequence is good, a DT is validated by the genomic sequence if most of the bases of the DT align to it with high percent identity. As shown in Table

2, the genomic alignments validate DTs to the extent that 74% of human non -singletons have acceptable genomic alignments (quality 1-3). The median percent identity is 100 and the median alignment score 99.0, translating to an estimated average percent aligned metric of 99.8. The results for mouse are similar.

An alternative approach to assess the consistency of DTs using genomic sequences is to examine DTs aligned only to one chromosome and look at their partition among the four alignment quality classes. More DTs in quality 1 class indicates better consistency. Examing top Genome Biology alignments per chromosome, we find that on average 78% human non-singletons are of quality

1-3, with 53% of quality 1. This is consistent among all (supplementary data).

Cross -validation of similarity -based and genome -based DoTS Genes

Ideally sDGs and gDGs, the DGs generated by two independent approaches, would be equivalent, but in reality this is not the case. The similarity -based approach suffers from problems such as chimeric sequences, thus, it may incorrectly cluster DTs representing different genes into the same sDG. The genome -based approach may be able to separate these DTs based on the genomic locations (both orientation and coordinates) of their alignments, but it may mistakenly merge DTs from different genes because they happen to share genomic proximity.

Furthermore, unlike sDGs, a gDG may consist of non-overlapping DTs, and more than one gDG may share a DT (due to alignments to mu ltiple locations in the genome, see below).

For cross -validation, we have mapped the gDGs to sDGs according to shared DTs and found that they are highly consistent with each other. The mapping is a complex many -to -many relationship for reasons stated above. Table 4 summarizes the mapping and assignment results achieved by following the procedure described in the Materials and Methods section. As expected, all of the 297,709 human gDGs share DTs with some sDGs. We are able to uniquely assign 86% of the gDGs to sDGs. This sDG -assigned subset of gDGs still contain all the shared

DTs, which cover more than 78% of all sDGs constructed using non-singletons. Roughly 75% of gDGs with unique sDG assignments are one -to -one, meaning that they share DTs with only one sDG and no other gDG shares DTs with the same sDG. Checking examples from the 41,511 gDGs without unique sDG assignments, we find that they tend to align to multiple locations (up to a few hundred in some cases) in the genome (data not shown). These un-assignable gDGs likely represent paralogs, closely related gene family members, or repetitive sequences. Genome Biology

Consistent with this, DTs in these gDGs correspond to only half the number of sDGs. The gDG - sDG mapping and assignment results for mouse are comparable to those for human.

High confidence DoTS Genes

To identify DGs most likely representing valid genes, we filter gDGs by cross -validation with sDG in combination with the evidence of splicing. It is more likely for spliced DGs to represent valid genes because artifactual sequences such as genomic contaminants or unprocessed mRNA would not have intron(s) when aligned to the genome. The presence of an apparent intron of at least 15bp is taken as evidence of splicing. This is supported by the presence of clear turning points around 15bp when we plot the frequencies verses length of short introns of gDGs on human and mouse chromosome 5 (supplementary data). We assume that

“introns” less than 15bp are likely noise such as EST sequencing errors, small Do TS assembly inaccuracies, or genomic sequence errors.

We have identified 48,994 human and 37,984 mouse high confidence DGs (Table 3) when we apply the combined filter with two adjustments. First, we exempt mRNA -containing gDGs from the splice requirement because mRNA sequences are generally of higher quality and their biological functions have often been verified. Second, sometimes a gene is identified by two gDGs, one on the forward strand and another on the reverse strand. This may be caused by

EST track ing errors. When we observe significant exon overlap between two gDGs on opposite strands, we selectively deprecate one of them (usually the shorter one). These gDGs are not eliminated since we can not always distinguish them from legitimate anti -sense tra nscripts [29].

We have relatively high confidence that these filtered gDGs (with corresponding sDGs) represent real genes or gene fragments. The larger number for human than for mouse may possibly result from several factors: more expressed human genes may have been sampled since there are significantly more human ESTs; there may be more gene fragments in human gDGs Genome Biology due to the higher number of short human ESTs (trimmed down from early low quality ESTs); and more paralogs or closely related family members may have been identified by human gDGs.

Without restriction, we have 297,709 gDGs in the human genome and 157,355 in mouse

(T able 3). These are rather large numbers and likely include non-overlapping gene fragments given that ESTs are partial sequences of genes [30], and artifacts due to genomic and other contamination in EST sequencing [31]. Although unspliced DTs may represent contaminants, eliminating them all will prevent the identification of single -exon genes or genes without transcribed sequences spanning an intron. We include them in gDG construction, and find that they tend to result in single exon gDGs, i.e. they remain in a distinct class of genes. Some of the single exon gDGs may be real genes since they are conserved between human and mouse by reciprocal best hit analysis (Table 5). Although singletons are also routinely ignored in previous studies, their inclusion here results in interesting observations as described below.

DoTS Genes correlate well with other gene annotations/predictions

Assessing the quality of genomic sequence annotation by DGs, we observe that the exon density distribution of high confidence DGs correlates well with that of Ensembl Genes and of

RefGenes [11, 17]. This is true for both fully sequenced chromosomes such as human chromosome 21 and 22 [3 2, 33] and draft sequence of mouse chromosome 5 (shown to have many gaps [6]). We employ the approach described by Kapranov et. al. [34] and used by Xuan et. al. [35] to compare DGs with Re fGenes, which are highly curated, and Ensembl Genes [4, 6,

10], which employ a highly conservative annotation methodology. Specifically we pool, count, and plot exon bases within a 5.7Mb window along the chromosome at 57kb intervals . Figure 2 shows such plots of exon density for DGs, Ensembl Genes, and RefGenes on human chromosome 22. The results for human chromosome 21 and a 75 Mb proximal region on mouse chromosome 5 are similar (data not shown). When comparing DGs with Ensembl Gen es (and Genome Biology

RefGenes) in a representative small genomic region in the UCSC Genome Browser [11], we find that the gene structures of DGs largely match Ensembl genes and RefGenes (data not shown, and occasional differences discussed below).

DoTS Gene confidence score

We assign confidence scores to gDGs (and cross -validated sDGs) in order to rank DGs by confidence levels. The confidence score determinants include the presence of splice signals (e.g.

AG...GT) at exon-intron junctions, the presence of poly -adenylation signals (e.g. AATAAA) around 30bp upstream of 3’ ends, and expression information based on EST sources (e.g. number of libraries in which ESTs have been detected). The presence of 5’ and 3’ EST pairs from the same clone is taken as evidence that a gDG is unlikely to comprise purely artifactual sequences given the required genomic proximity for the matched EST pair to be in the same gDG. In addition, we have identified 22,459 pairs of reciprocal best BLAST hits between human and mouse gDGs. We have more confidence that these gDGs are valid due to their cross -species sequence conservation. As demonstrated in Table 3, all of these attributes can be considered separately or in combination to categorize subsets of gDGs. An overall confidence score is calculated using a formula detailed in the Materials and Methods section that incorporates the attributes included in Table 3 as well as the number of DTs and number of exons in each gDG.

Such a confidence score provides another means for us to partition DGs into subsets for further studies.

Applications of DoTS

DoTS is an integrated transcriptome data resource for human and mouse, cataloguing known and predicted transcripts and genes. With transcribed sequences at the core, it integra tes other diverse types of data such as microarray expression data, genomic sequence, and many types of Genome Biology sequence annotation data (Figure 1b). The interface provided by the Allgenes website enables powerful and structured queries not easily done elsewhere. The applications of DoTS in this respect will be described in another paper. Examples of other applications are discussed next.

DoTS Genes identify protein -coding genes

One of the automated annotations (Mazzarelli J. et. al., manuscript in preparation) of DTs predicts open reading frames (ORFs) using DIANA [36] and FrameFinder [37]. We find that

25,326 human and 22,024 mouse hig h confidence DGs are predicted to be protein -coding at a

95% confidence level (Table 3). This is achieved with a p-value calculated based on the length of the ORF using a Poisson distribution (V. Babenko, unpublished).

Among high confidence DGs, there are 26,051 human and 22,405 mouse DGs with at least three exons (Table 3). They are similar to predicted protein -coding subsets in terms of population sizes, DTs per DG ratios, and exons per DG ratios. They also have average and median exons per gene ratios strikingly similar to the statistics for 22,808 human and 22,011 mouse protein -coding genes reported in the recent paper on the mouse genome sequence [6].

Interestingly, 75% of the 3+ exon high confidence human DGs overlap 87% of the predicted protein -coding DGs. This is similar for mouse. The predicted protein -coding high confidence DGs with 2 exons and 1 exon (to a lesser extent) may represent genes with ESTs from 5’ and/or 3’ ends but are missing EST representation of other exons. Alternatively, they are

2-exon and single exon genes predicted by DoTS but excluded by other more restrictive genome annotation pipelines [4, 5].

DoTS Genes predict many non -coding genes

In addition to protein -coding genes, DGs also predict many non-coding genes. DoTS consists of

48,994 human and 37,984 mouse high confidence DGs. This is considerably higher than the

~30,000 human genes initially predicted by the public and private human genome sequencing Genome Biology projects [4, 5], and ~22,000 for mouse as initially predicted by the mouse genome sequence consortium [6] . The predicted number of genes for human has been further revised down to

24,500 (of which 3,000 are likely pseudogenes), as the human genome sequence reaches its essentially finished state [3, 38]. However, only protein -coding genes are counted in these genome annotation efforts, while DGs include other genes such as non-coding RNA genes. In fact, as discussed before, only 25,326 human and 22,024 mouse high confidence DGs are predicted to code for proteins, while the others are likely non-coding genes (some of them may be UTRs of coding genes). Pseudogenes are an important class of genes [39], however, few of them may be expressed. Since DGs represent expressed genes, we expect only a small number of pseudogenes in DoTS.

Without filtering, there are even more non-coding gDGs. Although many of them may represent contaminants, those conserved between human and mouse are more likely to be biologically meaningful. With an OR F prediction p-value of >0.5 to select for non-coding genes, we find that 2,722 pairs of human and mouse gDGs are reciprocal best BLAST hits in terms of cDNA. One example is the mouse DG.5302365 and human DG.35716765 (Figure 5).

DG.5302365 contains one tra nscript DT.55125300, which in turn contains a 1,611bp mRNA

(GenBank accession AK015294) and a 262bp EST (GenBank accession AV262900) from mouse adult testis. This DG has 6 exons, suggesting that it represents a non -coding gene rather than the UTR of a coding gene. It is located at mouse chr5:75517829-75528477. DG.35716765 contains one transcript DT.91714204, which in turn contains a 526bp EST (GenBank accession

BE504451) from human lung carcinoid cells (dbEST library NCI_CGAP_Lu24). This DG is located at human chr4:56321913-56322439, which is toward the 3’ end of the human syntenic block for mouse chr5:75517829-75528477. Note DG.35716765 is much shorter than

DG.5302365 and it has only 1 exon. It is likely an incomplete non-coding gene given that it is a

3’ EST, and the conservation is in the 3’ end of the multi -exon DG.5302365. Genome Biology

Although the number of protein -coding genes is surprisingly low for human and mouse, several studies have already indicated a much higher degree of transcription in the genomes.

Kapra nov et. al. detected as much as an order of magnitude more of the genomic sequence is transcribed than accounted for by annotated exons, using arrays of oligonucleotide probes to human chromosomes 21 and 22 at 35bp resolution [34]. In another study, the Snyder group found that twice as many bases of human chromosome 22, as have been previously reported, are expressed as poly -adenylated RNA in placenta alone [40]. Using full length enriched cDNAs from many libraries, the RIKEN group has constructed the most complete mouse transcriptome data, which are annotated in the international FANTOM project [41-44]. As a result, 70,000 transcription units have been identified, and many do not encode proteins. Reproducible expression of a significant fraction of non-coding RNAs has been experimentally verified [43].

DoTS facilitates the study of singletons

A singleton is a DT comprised of only one EST or mRNA, which does not share sufficient similarity with any other ESTs or mRNAs to form an assembly with more than one input sequence. Singletons may include biologically relevant but rarely transcribed sequences, although many of them are probably artifacts given the low frequency of their detection.

Singletons have been routinely ignored in previous studies, however, we find that there are larg e quantities of them (0.67 million for human and 0.42 million for mouse) and many are in introns of spliced DGs. Similar for human and mouse, ~56% of singletons have acceptable alignments to be included in gDGs, and about half of these singletons are const ituents of gDGs that are spliced (Table 5). Furthermore, most spliced gDGs, 86% human and 82% mouse, contain singletons as constituent DTs. We find singletons in the introns of many spliced gDGs.

This is consistent with the reported observation for human chromosome 22 that much transcriptional activity occurs in introns of other genes [40]. Genome Biology

We also find that singletons, even if not part of other genes, align close to them on the genome, thus are associated with regions of active transcription. When plotting the distribution of all singletons on the genome, we observe good correlation of singleton density with the density of spliced gDGs (supplementary data). Some singletons with genomic alignments (~60% for human and 70% for mouse) associate by genomic proximity with other DTs to form gDGs with at least two transcripts, and most of these gDGs are spliced (Table 5). Other singletons are lo ne singletons and mostly unspliced, however, they also tend to be near other spliced gDGs. It is tempting to hypothesize, therefore, that some of these lone singletons indicate the existence of nearby novel genes, although indirectly.

Do singletons repres ent rarely expressed novel exons, regulatory sequences, transcriptional noise, or other unknown biological phenomena? More studies are needed in this respect, but it is likely that some singletons represent real transcripts. One line of evidence is that 2,452 mouse singletons are mRNAs of lengths up to 8,740bp. A comparable number of singleton mRNAs are also observed in human. The majority of these singletons contain CDS.

For example, the mouse epithelial ankyrin gene ( Ank3 ) is represented by the singleton

DT.99848204, which consists solely of the 6,552bp NM_170688. Ank3 has multiple isoforms [45] and the variant represented by NM_170688 has more restricted expression than others [46]. As another example, the human X102 protein is represented by the si ngleton DT.99959939

(NM_030879, 417bp), which may be a transcript restrictively expressed in the brains of individuals with neuropsychiatric disorders [47]. Based on CDS evidence, even the short (51bp) singleton mRNAs contained in DoTS appear to be real transcripts rather than artifacts (they are primarily immunoglobulin heavy chain V-D-J variable regi on partial CDSs). Additionally, for both human and mouse, many singletons (~28%) are predicted (p -value <0.05) to encode proteins (average length ~120). Another line of evidence is that a detectable albeit low percent

(6.7%) of mouse singletons (~99,000 wi th quality 1 alignments, amounting to ~50Mb cDNA sequence) are conserved with human singletons based on BLAST reciprocal best hit analysis. Genome Biology

This is above the level expected for a comparable amount of random (<<0.1%) or genomic

(<1%) sequence (data not shown).

DoTS Genes predict many putative novel genes

DoTS may be used for the annotation of a genomic region of interest and may allow the identification of many novel genes. Of special interest to us is to identify and study all genes in a roughly 75Mb regio n in the proximal portion of mouse chromosome 5, which was targeted in a region specific mutagenesis study [48]. Furthermore, results from experimental studies guided by DoTS annotations in this region may provide important feedback for fine-tuning the DoTS build process. Before the draft mouse genome became available [6], we had used DoTS to integrate radiation hybrid map and fingerprint map data for the construction of a high -resolution

BAC -based map of a 5Mb fragment of mChr5 [49]. In the same study, annotation of selected

BACs of this region demonstrated that DoTS could allow identification of novel genes even in a historicall y well -characterized region.

As discussed above, DGs preserve the landscape of gene-richness in the genomes. On the other hand, the overall gene density predicted by DGs is elevated above that of RefGenes or

Ensembl Genes. In mChr5, we use all gDGs, even i f not cross -validated with sDGs, to maximize the chance of also identifying paralogs and closely related gene family members. We compare gDGs with 453 Celera and 434 Ensembl genes (longest transcripts) in this region using a conservative approach where we consider gene boundaries instead of just exons. Of the 3,943 gDGs, 1,358 do not overlap any Celera or Ensembl genes from either strand, and 223 are high confidence DGs. About 25% of these 223 DGs are predicted (p -value <0.05) to code for proteins

(average length 154), while 50% do not appear to have coding potential (p -value >0.5). This analysis provides sequences of putative novel genes (ranked by confidence scores) that will be experimentally validated in expression studies (RT -PCR, in situ hybridization, microarray).

Figure 3a shows the 72-74Mb segment of mChr5 in the UCSC Genome Browser. Note examples Genome Biology of gDGs (top track) not covered by Ensembl Genes (second to last track). Two specific examples are shown in Figure 3. One example (Figure 3b), DG.36043521, has multiple DTs that are manually reviewed and deemed correct by our curators. Several other gene predictors appear to predict various subsets of the exons predicted by this putative DG. The other example (Figure

3c), DG.36213308, contains only a singlet on DT (RIKEN cDNA BB636598). This putative gene is not predicted by the gene predictors available from the Genome Browser. However, our curators judged it to be correct based on gene characteristics such as splice signals at intron-exon junctions. It is predicted to encode a protein of 155 amino acid at a p-value of 0.008.

DoTS Genes enrich gene models of known genes

To evaluate coverage of known genes by DGs, we compare RefSeq -containing gDGs on human chromosome 22 with the detailed gene-models of UCSC Re fGenes [11]. We find that 396 out of

450 (88%) RefGenes are represented by gDGs, i.e. they have “OK coverage” as defined in the

Materials and Methods section. 63% of the OK coverage cases are also “good coverage” (i.e. without extensive exon extensions or additions, see Materials and Methods), while the rest extend the ends of RefGenes significantly. Among the good coverage cases, the median coverage is 100% of exon bases, the median end extension is 1.2% of exon bases (0.2% of genomic range), and the median internal extension (see Materials and Methods) is 40%.

Although some of the internal extensions might be due to unprocessed RNAs, in many cases it is because novel exons are predicted (see below). As for RefGenes without OK co verage, there are several causes. First there are slight differences in BLAT alignment options used by the UCSC site. Second, there are also slight differences in the criteria used by the UCSC site to filter BLAT alignment results. Third, occasional inaccu racies in the cluster and assembly steps of the DoTS build may cause a RefSeq -containing DT to align to the genome with inferior quality and be eliminated by our relatively stringent filters. Genome Biology

Through manual inspection of the comparison results, we have observed that in many cases gDGs can significantly enrich the gene models of RefGenes by predicting extended UTRs, novel exons, and alternative transcription starts. As shown in Figure 4a, the 5’ UTR of a

RefGene LZTR1 (NM_006767) has been extended by DG.35883194. Figure 4b shows an example where DG.36041378 predicts novel internal exons. Here four additional exons are predicted for RefGene COMT (NM_000754), in addition to the extension of the 3’ UTR. As discussed in Zhu et. al. [20], genomic alignments of DTs (and TIGR TCs) predicted that the

Dtna gene on chromosome 18 has alternative transcription starts, as suggested by published experimental results. In this comparison, we have also observed similar case s, and the example shown in Figure 4c (DG.36380767) suggests a putative alternative transcription start for

RefGene TTLL1 (NM_012263).

Conclusions

DoTS Transcripts and DoTS Genes, extensively annotated and significantly curated, present a unique and integ rated view of the millions of ESTs and mRNAs in the public domain.

DoTS is useful in several ways. First, it is a gene-centric, genome -mapped, and non-redundant catalogue of all known and predicted human and mouse genes and transcripts. The data for human and mouse are readily accessible through a powerful query interface at the Allgenes website. DoTS is a generic system that can be and has been applied to other species [1]. Second,

DoTS may be integrated with other resources as has been with MGI [20], GeneCards [12], and

Ensembl Genome Browser [9] (also UCSC Genome Browser [11] as custom tracks). Third,

DoTS allows us to provide biologists with large data sets of interest, such as protein -coding genes, genes with certain expression patterns (e.g. pancreas specific) , or genes predicted to encode transmembrane proteins. Fourth, DoTS Genes predict a plethora of putative novel genes, ranked with confidence scores, as candidates for further laboratory studies. This includes large Genome Biology numbers of putative human and mouse non-coding genes, a class of genes whose biological prevalence, if not importance, is just starting to be recognized. Fifth, in many cases, DoTS

Genes appear to enrich gene models of known genes by suggesting novel exons, alternative transcription starts, or ex tended UTRs. Finally, DoTS may enable datamining in novel directions such as the study of rarely expressed genes, and the investigation of frequent (e.g. alternative splicing) or infrequent (e.g. anti -sense transcription) gene structures.

Materials and Me thods

Build DoTS Transcript indices

In the current DoTS release (6.0), DTs are created by an initial clustering of ESTs and mRNAs using a self -BLAST followed by assembly using CAP4 (Paracel). Sequences to be assembled are identified in the GUS relational database [50, 51] based on the following crit eria. They must be the correct taxon and are either in dbEST (in which case we consider the type to be EST even though some full length cDNA sequences are present in dbEST), in GenBank with sequence type equal to mRNA, or are in GenBank with sequence type equal to RNA and have an annotated CDS with a simple location (single start and end). These sequences (we use the quality sequence if this is defined in dbEST) are then "cleaned" by detecting and removing vector sequences using cross_match from the phrap package [52] and the GenBank vector database, removing ribosomal and mitochondrial sequences, removing trailing poly A and leading poly T sequences and removing low quality ends where the percentage of N's in a 20bp window exceeds 20%. Sequences shorter than 50bp following this process are marked as

'low_quality' and ignored. Sequences are then blocked for repeats using RepeatMasker (Smit,

AFA & Green P., unpublished) and the relevant libraries of repeats depending on organ ism.

Again, if fewer than 50bp of informative sequence remains, sequences are marked as 'repeat' and Genome Biology ignored. The blocked sequences are clustered by running an all -against -all BLASTN matrix with parameters N=10 M=5 to limit extension of matches into low quality regions. The BLAST results are subjected to extensive Perl postprocessing to identify and remove from consideration repeats or domains that did not get blocked by RepeatMasker. In the incremental update steps, these sequences are also compared to th e existing DoTS consensus sequences in order to assign new sequences to existing assemblies. Clusters are formed by a connected components analysis of all the BLASTN matches with minimum cutoff values of 92% identity and 40 base pair length and two ends ma tching consistent with being able to be assembled. Very large clusters

(>10,000 members) are separated by increasing the cutoff thresholds to 95% identity and 50bp overlap, then 98% identity, 100bp overlap if necessary. The clusters are assembled to form consensus sequences using the CAP4 algorithm. The CAP4 alignments are decomposed into constituent parts and stored in GUS. During incremental updates, a Perl module is used to build the assembly from the existing assembly and the new assembly (of new input sequences and the existing consensus sequences), avoiding expensive re -assembly with CAP4. This complete assembly is then used to calculate a new consensus and sequence alignment to update the database. The resulting consensus sequences are then blocked with RepeatMasker, clustered with BLASTN (95% identity, 75bp overlap) and incrementally assembled with CAP4 to complete a build cycle for DoTS. Assemblies are reverse complemented if assembly orientation is inconsistent with mRNA orientation and EST clone end assignment of contained sequences.

In the case of re -assembly, identifiers (DT.s) are maintained for the assemblies by tracking the source_ids (accessions) of the sequences which are contained in the new assembly as compared to the updated ones. Genome Biology

Genera te similarity -based DoTS Genes

The final DoTS consensus sequences are blocked with RepeatMasker and BLASTN run using cutoffs of 97% identity over at least 150bp. sDGs are generated using a graph algorithm to avoid joining large clusters connected by a sin gle (or a small number of) edge(s) which happens due to noise in the assemblies caused by artifacts such as chimeric input sequences.

Genomic alignment and alignment quality classification

Human and mouse genomic sequences and BLAT [28] software are downloaded from UCSC

[11]. For the results descri bed in this paper, the Golden Path April 2003 release of the

(essentially complete) human genome sequence and February 2003 release of the mouse draft sequence were used. DTs are exported from GUS. Then BLAT alignment of human/mouse DTs against human/mouse genome is performed on a compute cluster with 128 dual -processor nodes.

The default settings of BLAT are used except that the “-mask=lower ” option is turned on so that blocks of alignment cannot be initiated in repeat -masked regions of the genome (however alignments initiated elsewhere are allowed to extend into repeat -masked regions). Alignments over at least 10% of query sequence length with at least 90% identity are loaded into GUS.

While being loaded, each BLAT alignment is assigned one of four quality classes: “very good”

(1), “very good but with genome gaps” (2), “good” (3), and “not so good” (4). An alignment of quality 1 is one in which almost all of the RNA matches the genome very well and with only small continuous mismatches. The criteria are: i) percent identity >= 95; ii) internal mismatch

<= 5bp; iii) length of each end mismatch <= 10bp unless it is a polyA tail; and iv) only one end may be a polyA tail. Alignments of quality 2 are alignments with large internal/end mismatches that might be cau sed by the presence of genomic sequence gaps nearby. Quality 3 alignments relax the criteria to allow internal continuous mismatch of up to 15bp and end mismatch of up to

50bp. An alignment of quality 3 may indicate sequence errors in the RNA and/or the ge nomic Genome Biology sequence. For possible future updates to alignment quality classification, refer to the page

“blatAlignExplain.html” on the Allgenes website [2]. We also define a simple and convenient alignment score to quantify BLAT alignment qual ity: S = (I% * A%) 1/2 , where I is the percent identity and A the percent of query sequence aligned.

Generate genome -based DoTS Genes

First, we select BLAT alignments of quality 1-3 and “top alignments” of quality 4. An alignment is a “top alignment” with respect to the whole genome if its S score is above 85 and within 1% of the highest S value for a given DT (similar to the approach used for Gene Bounds [11]). We include all top alignments since the best alignment is not always clearly identifiable.

We order selected alignments along the same strands of the genome by their start and end coordinates. Then we transitively merge alignments into consensus gDGs if adjacent alignments have “exon” (alignment block) overlap, and we refer to this step as “merge-by - overlap”. We also carry out the merge in the absence of “exon” overlap if two adjacent alignments are less than a certain distance apart and there are ESTs in both DTs from the same clone, and we call this “merge-by- clone”. We use a default merge -by- cl one distance parameter of 500kb because we rarely see genes with genomic sizes >500kb when we examine UCSC

RefGenes on human chromosome 21 and 22. In addition a merge is performed if the end of one alignment is less than a certain number of bases from the start of the other and this is “merge-by- proximity”. Likewise, we set the maximum merge -by -proximity distance parameter at 75bp.

Although we rarely see genes <220bp apart on the same strand when we examine UCSC

RefGenes on human chromosome 21 and 22, we co nservatively choose 75bp to account for the fact that RefGenes might not represent all expressed genes. Genome Biology

Mapping genome -based DoTS Gene Models to similarity -based DoTS Genes gDGs are sorted first by whether they are spliced, and then by gene size based on total exon length. The resulting gDG list is ordered with spliced members first in descending size followed by unspliced members in descending size. Similarly, sDGs are sorted first by whether they contain mRNA and then by size based on number of contained EST/mRNAs. The resulting sDG list is ordered with mRNA + in descending size followed by mRNA - in descending size. Initially, the many -to -many relationship between gDGs and sDGs via shared DT is established without regard to list positions. Subsequently, each gDG in descending order is assigned uniquely to the highest ordered sDG with which it shares one or more DTs and the assigned gDG and sDG are removed from their respective lists. If a gDG and a sDG, in the original lists, reciprocally share

DT(s) only with each other, the assignment is called 1:1. A gDG will remain unassigned to a sDG if all the sDGs with which it shares DTs have already been assigned to other gDGs.

High confidence DoTS Genes

We define high confidence DGs to be gDGs that are: spliced (with an intron of at least 15bp) unless it contains an mRNA, cross -validated with sDGs, and not deprecated. See “Cross - validation of similarity -based and genome -based DoTS Genes” under the Results section.

Confidence scoring

We also use a heuristic schem e to score our gDGs based on the following criteria: 1) number of exons and the presence of splice signals (e.g. AG..GT) at the intron/exon junctions on the genome; 2) the composition of input sequences (e.g. whether an mRNA is present, how many

DoTS Transcripts contribute to the gene model); 3) expression evidence in terms of ESTs (e.g. how many EST libraries and EST clones do the ESTs originate from, how many 5’-3’ pairs of

ESTs); 4) whether a polyA signal or polyA track is present; 5) whether a human -mouse BLAST Genome Biology reciprocal best hit exists; 6) the distribution of 5’ ends of 5’ ESTs and 3’ ends of 3’ ESTs along the genome for a given DoTS Gene Model. Specifically, the score for a gDG is calculated as follows: +1 if it is spliced, +1 if it has 3 or more exons, +1 if it has at least one splice signal (e.g.

AG..GT), +1 if it has at least two splice signals, +3 if any constituent DTs contain an mRNA, +1 if it has at least two constituent DTs, +1 if it has ESTs from at least two libraries, +1 if it has

ESTs from at least two clones, +1 if it has at least a pair of 5’-3’ ESTs from the same clone, +2 if it has a canonical polyA signal (5’ AATAAA 3’ ) or +1 if alternative polyA signal

(5’ ATTAAA 3’ ), -1 if there is a downstream polyA track or upstream polyT track, +2 if it has a human -mouse reciprocal BLAST best hit. Furthermore, the 5' ends of 5' ESTs and the 3' ends of

3' ESTs in the gDGs are plotted. If the plot has the expected shape of a 3’ end peak at the 3’ and a 5’ end peak at the 5’ (or several peaks scattered aro und the 5’ end) of the gene, add 1 to the confidence score.

DoTS Gene vs RefGene and Ensembl exon density comparison

As described by Kapranov et. al. [34] and Xuan et. al. [35], exon bases within a 5.7Mb window at 57kb intervals are pooled, counted, and plotted along the chromosome.

DoTS Gene vs RefGene gene model comparison

RefGenes are downloaded from UCSC, and their exon coordinates compared to all gDGs whose genomic ranges overlap those of the RefGenes, regar dless of orientation. A best overlapping gDG is chosen to assess the degree of overlap in terms of exonic or genomic (i.e. both intronic and exonic) sequences. We define “OK” and “good” coverage of a RefGene by a gDG to indicate how closely RefGenes are ma tched by DGs. An “OK” coverage covers at least 85%

(90%) of the exon bases (genomic range) of a RefGene, while a “good” coverage also requires that neither the 5’ end nor the 3’ end of a RefGene is extended beyond 50% (100%) of its exon Genome Biology bases (genomic size ). We also define internal extension as the percentage of extra exon bases

(w.r.t. the RefGene) in a gDG within the genomic range of the RefGene.

Figure Legends

Figure 1 – Overview of DoTS build process

Figure 1a. An overview of DoTS build process and the major DoTS entities. ESTs and RNAs are downloaded from GenBank and fed to the DoTS build process for cleanup (e.g. low quality sequence trimming, repeat and contaminant removal), clustering, assembling, annotation, and genomic alignment. The output is an index of annotated DoTS Transcripts (DTs) and DoTS

Genes (DGs). DGs are groups of DTs representing transcripts from the same genes and brought together via BLAST similarities (sDGs) or genomic alignments (gDGs).

Figure 1b. The major workflow of the DoTS build process. After download and pre- process (see

1a), ESTs and mRNAs are combined with DTs from prior builds, if any. The combined sequence set is repeat -masked before being used to build all -against -all BLAST similarity matrices. The matrices are used cl uster sequences sharing sufficient similarities (>92% identity over 40bp, and with consistent ends). CAP4 is used to generate one or more consensus sequences (DTs) from each cluster. DTs are clustered into sDGs using similarities (with more stringent crite ria), and into gDGs using genomic alignments. DTs are subjected to a series of automated annotation such as protein sequence prediction, GO function prediction, and gene trap line association, and many DTs are also being manually reviewed.

Figure 1c. Ge nomic alignment quality classes. In quality 1 alignments all (or nearly all) bases of

DTs align to the genome with high percent identity. For quality 2, there may be large numbers of Genome Biology continuous bases in the DT that do not align, but they might be explained by the presence of sufficiently large gaps in the genomic sequence. Quality 3 tolerates continuous internal mismatches of up to 15bp and end mismatches of up to 50bp in the DT to account for genomic and EST sequence errors and assembly inaccuracies. Quali ty 4 class includes any alignment of at least 90% identity over at least 10% of the DT, but can not be classified as quality 1, 2 or 3.

Figure 2 - Genome locations of DoTS Genes correlate well with other gene annotations/predictions

This figure shows the distributions of exon density of DGs, Ensembl genes, and RefGenes along human chromosome 22. Exon bases within a 5.7Mb window are pooled, counted, and plotted along the chromosome at 57kb intervals. As seen here, the exon density distribution of DGs correl ates well with that of Ensembl genes and that of RefGenes. Similar results are obtained for other finished chromosomes such as the human chromosome 21 and draft chromosomes such as the mouse chromosome 5. Detailed comparison in representative small regions such as the well studied DiGeorge Critical Region confirms the observed correlation.

Figure 3 - DoTS Genes predict many putative novel genes in the mouse chromosome 5 proximal region

Figure 3a. When compared to Ensembl and Celera gene predictions in a ~7 5Mb proximal region on mouse chromsome 5, DGs predict many additional putative genes. This figure shows the 72-

74Mb segment of the region (UCSC February 2003 freeze). Note examples of DGs (the top track, highlighted in red boxes) not covered by Ensembl Gen es (second to last track).

Figure 3b. An example of a putative novel gene predicted by a DG (top) in this region at chr5:72574660- 72654736 (DG.36043521, with multiple DTs that have been manually Genome Biology reviewed. Note various other gene predictors predict some but not all of the exons predicted by this putative DG.

Figure 3c. Another example of a putative novel gene predicted by a DG (top) in this region at chr5: 73842944-73858830 (DG.36213308, with only a singleton DT containing RIKEN cDNA

BB636598. This gene has been manually reviewed and found to have gene characteristics such as splice signals at exon-intron junctions. This putative gene is not predicted by any other gene predictors available from the UCSC Genome Browser.

Figure 4 - DoTS Genes enrich gene models of known genes

Figure 4a. An example of a DG extending the 5’ UTR of a RefGene LZTR1 (NM_006767).

Region shown: UCSC April 2003 freeze, chr22: 19660500-19678000)

Figure 4b. An example of a putative alternative transcription start of TTLL1 (NM_012263). For an example of DoTS identifying alternative transcription start with experimental evidence support, see [20]. Region shown: UCSC April 2003 freeze, chr22: 41664000 -41741300.

Figure 4c. An example of a DG adding several novel internal exons to RefGene COMT

(NM_000754), in addition to extending 3’ UTR. Region shown: UCSC April 2003 freeze, chr22: 18301500-18333600).

Figure 5 - DoTS Gene predict non -coding genes

This is an example of a pair of conserved non-coding human and mouse DGs as revealed by

BLAST reciprocal best hit analysis. DG.5302365 (middle) aligns to mouse chr5:75517829-

75528477 and DG.35716765 (bottom) aligns to human chr4:56321913-56322439, which is Genome Biology within a human -mouse synteny block (top, from UCSC Genome Browser). Note the coordinates for the rightmost “human cons” block is chr4:56321290-56322379. DG.35716765 likely represents an incomplete gene since it contains only a 3’ EST (see text).

Tables

Table 1 – Summary of DoTS sequence c ontent

Human Mouse TSs: input transcribed sequences • total 5,452,944 3,883,616 • filtered 1 4,631,703 3,181,217 • mRNA 111,768 63,432 • EST 4,519,935 3,117,785 DTs: clustered & assembled TSs • total 931,935 586,593 • non-singleton (DT w/ >1 input TS) 257,532 163,379 DGs: clustered DTs • total similarity -based (sDGs) 808,565 518,976 • sDGs constructed w/ non-singletons 134,162 95,762 • total genome -based (gDGs) 297,709 157,355 • high confidence 2 48,994 37,984 • predicted protein -coding 25,326 22,024

This table summarizes the statistics for the major sequence entities of DoTS.

1 TS filtering include low quality sequence trimming and contaminant removal

2 high confidence DGs are gDGs that are spliced and cross -validated with sDGs, with two adjustments described in the text

Table 2a – Statistics for genomic alignments of human DoTS Transcripts quality %DTs (*) alignments %identity alignment score # per DT (*) average (*) median (*) average (*) Median (*) 1 37.8 (47.9) 1.13 (1.10) 99.4 (99.6) 100 (1 00) 98.8 (99.2) 99.3 (99.5) 2 0.06 ( 0.1) 1.11 (1.12) 98.9 (99.2) 100 (100) 63.3 (62.1) 62.6 (61.2) 3 23.1 (26.6) 1.30 (1.25) 98.7 (99.1) 99 (100) 93.5 (95.9) 95.4 (97.2) Genome Biology

4 29.0 (32.3) 2.91 (3.68) 95.9 (95.6) 96 ( 95) 71.6 (72.0) 76.3 (77.8) 1-3 60.2 (73.8) 1.21 (1.17) 99.1 (99.4) 100 (100) 96.6 (97.9) 98.4 (99.0) 1-4 82.0 (95.5) 1.92 (2.14) 97.4 (97.2) 99 ( 99) 83.2 (82.8) 91.1 (91.6)

Statistics summarizing the genomic alignments of 931,935 human DTs (270,040 of which are non-singletons or contain mRNA) in various alignment quality categories. Numbers in parentheses (*) are for non -singletons and singleton mRNAs. Alignment score (#) is defined as the square root of percent identity times percent query aligned.

Table 2b – statistics for genom ic alignments of mouse DoTS Transcripts quality %DTs (*) alignments %identity alignment score # per DT (*) average (*) median (*) average (*) median (*) 1 25.9 (42.9) 1.04 (1.04) 99.5 (99.6) 100 (100) 99.1 (99.3) 99.6 (99.7) 2 2.2 ( 3.6) 1.73 (1.93) 98.1 (98.3) 98 ( 99) 65.8 (64.3) 66.4 (62.1) 3 17.6 (25.8) 1.27 (1.18) 98.2 (99.0) 99 (100) 92.7 (95.8) 94.3 (97.1) 4 47.3 (36.5) 2.45 (3.65) 95.1 (95.6) 95 ( 95) 72.0 (69.1) 76.6 (73.5) 1-3 44.9 (70.8) 1.18 (1.15) 98.8 (99.3) 100 (100) 94.0 (95.0) 98.1 (98.8) 1-4 86.3 (97.2) 1.96 (2.21) 96.3 (97.0) 96 ( 98) 78.9 (79.0) 82.8 (85.9)

Same as Table 2a, except that this table is for the genomic alignments of 586,593 mouse DoTS

Transcripts (165,831 of which are non-singletons or contain mRNA).

Table 3 – Statistics for genome -based DoTS Genes (gDGs or Gs)

human mouse count DTs/G * exons/G * Count DTs/G * exons/G * All 297,709 4.0 (1.0) 2.0 (1.0) 157,355 2.7 (1.0) 2.5 (1.0) spliced 1 58,038 6.4 (2.0) 5.9 (3.0) 44,839 5.1 (2.0 ) 6.0 (3.0) longest ORF >= 120 41,748 8.2 (3.0) 6.5 (3.0) 37,865 6.0 (3.0) 6.1 (3.0) high confidence (HC) 2 48,994 7.3 (3.0) 6.0 (3.0) 37,984 5.6 (3.0) 6.2 (3.0) HC, w/ 3+ exons 26,051 11.1 (7.0) 9.8 (6.0) 22,405 7.4 (5 .0) 9.2 (6.0) HC, ORF pval<.05 25,326 12.1 (8.0) 9.4 (6.0) 22,024 8.1 (6.0) 8.9 (6.0) w/ mRNA 24,496 12.0 (8.0) 9.2 (6.0) 22,179 7.8 (5.0) 8.5 (6.0) w/ splice signals 35,606 8.8 (4.0) 7.8 (5.0) 28,166 6.4 (4.0) 7.5 (4.0 ) w/ 2+ splice signals 19,123 13.1 (10.) 11.6 (9.0) 16,767 8.1 (6.0) 10.7 (8.0) w/ polyA signals 46,072 9.0 (4.0) 4.6 (1.0) 26,028 6.0 (3.0) 5.6 (2.0) w/ 2+ EST libraries 36,881 9.5 (5.0) 7.2 (4.0) 33,088 7.1 (5.0) 6.4 (3.0 ) w/ 5’ -3’ EST pairs 16,803 15.8 (12.) 10.8 (8.0) 21,227 8.1 (6.0) 7.8 (4.0) w/ hm reciprocal hit 22,459 12.6 (9.0) 8.8 (6.0) 22,459 6.9 (4.0) 8.0 (5.0) confidence score >= 10 16,775 15.7 (12.) 12.5 (10.) 17,232 9.5 (7.0) 10.8 (8 .0) Genome Biology confidence score 5 -10 20,959 4.5 (3.0) 3.5 (3.0) 21,197 3.5 (2.0) 2.8 (2.0) confidence score < 5 259,975 1.3 (1.0) 1.2 (1.0) 118,926 1.5 (1.0) 1.2 (1.0)

This table summarizes statistics for human and mouse gDG. This include the counts, average (* median in parentheses) number of DTs per DG, and average (* median in parentheses) number of exons per DG. Various attributes of gDGs (as detailed in the text), including an overall confidence score, are used to partition gDGs into subse ts.

1 spliced: gDGs with an intron of at least 15bp

2 high confidence (see text)

Table 4 – Cross -validation of similarity -based and genome -based DoTS Genes

human mouse gDG sDG (*) gDG sDG (*) share DTs 297,709 351,355 (105,205) 157,355 235,287 (80,085) assigned (unique) 256,198 351,355 (105,205) 137,448 235,287 (80,085) assigned & 1:1 192,278 192,278 ( 50,335) 83,440 83,440 (29,142) not assigned (NA) 41,511 21,892 ( 14,111) 19,906 10,216 ( 7,831) NA & spliced 9,037 n/a 4,337 n/a

For cross -validation of sDGs and gDGs, DGs generated by two independent approaches, we make assignments between them via shared DTs using the rules described in the Materials and

Methods section. Since the mapping between sDGs and gDGs is a many -to -many relationship

(see text), a uniquely assignment for a gDG or sDG is not always possible. This table summarizes the statistics of the assignment.

* sDGs constructed using non-singleton DTs

Table 5 – Singleton DoTS Transcripts and single -exon DoTS Genes

human mouse DTs gDGs DTs gDGs Singleton Genome Biology

• total 674,403 n/a 423,214 n/a • predicted protein -coding 200,659 n/a 108,715 n/a • w/ alignment(s) 1 368,252 239,226 216,574 119,637 • in spliced gDGs 184,647 50,050 113,521 36,936 • in multi -DT gDGs 209,460 63,788 156,608 53,360 • in multi -DT & spliced 174,446 38,638 107,377 30,021 • lone: in gDGs by itself 162,467 175,436 61,973 66,277 • lone & spliced 10,638 11,408 6,485 6,915 si ngle -exon • total 257,698 240,104 153,145 112,897 • w/ reciprocal best hit 15,203 6,414 13,220 6,431

This table summarize statistics about singletons and single -exon (i.e. unspliced) gDGs

1 with acceptable genomic alignments for gDG constr uction

Acknowledgements

This work is supported by NIH grants HG -01539 (C.S.) and HD028410 (M.B.). We thank the many individuals and groups who have contributed to dbEST and GenBank, and the teams that have generated human and mouse assembled genomes. We also thank members of the CBIL group, particularly J. Schug, for their helpful suggestions and discussions.

References

1. Li, L., et al., Gene discovery in the apicomplexa as revealed by EST sequencing and

assembly of a comparative gen e database. Genome Res, 2003. 13 (3): p. 443-54.

2. AllGenes: a web site providing access to an integrated database of known and predicted

human and mouse genes. (version 6.0, 2003). [http://www.allgenes.org] , 2003.

3. Pennisi, E., HUMAN GENOME: Reaching Their Goal Early, Sequencing Labs

Celebrate. Science, 2003. 300 (5618): p. 409-. Genome Biology

4. Lander, E.S., et al., Initial sequencing and analysis of the human genome. Nature, 2001.

409 (6822): p. 860-921.

5. Venter, J.C., et al., The sequence of the human genome. Science, 2001. 291 (5507): p.

1304-51.

6. Waterston, R.H., et al., Initial sequencing and comparative analysis of the mouse

genome. Nature, 2002. 420 (6915): p. 520-62.

7. Boguski, M.S., T.M. Lowe, and C.M. Tolstoshev, dbEST-- database for "expressed

sequence tags". Nat Genet, 1993. 4(4): p. 332-3.

8. Baxevanis, A.D., The Molecular Biology Database Collection: 2003 update. Nucleic

Acids Res, 2003. 31 (1): p. 1-12.

9. ENSEMBL Genome Browser. [http://www.ensembl.org] .

10. Clamp, M., et al., Ensembl 2002: accommodating comparative genomics. Nucleic Acids

Res, 2003. 31 (1): p. 38-42.

11. UCSC Genome Browser. [http://genome.ucsc.edu].

12. GeneCards. [http://bioinformatics.weizmann.ac.il/cards/] .

13. Mouse Genome Informatics. [http://www.informatics.jax.org/] .

14. Wheeler, D.L., et al., Database resources of the National Center for Biotechnology.

Nucleic Acids Res, 2003. 31 (1): p. 28-33.

15. Quackenbush, J., et al., The TIGR Gene Indices: analysis of gene transcript sequences in

highly sampled eukaryotic species. Nucl. Acids. Res., 2001. 29 (1): p. 159-164.

16. Mammalian Gene Collection Program Team*, et al., Generation and initial analysis of

more than 15,000 full -length human and mouse cDNA sequences. PNAS, 2002. 99 (26):

p. 16899-16903.

17. Pruitt, K.D. and D.R. Maglott, RefSeq and LocusLink: NCBI gene-centered resources.

Nucl. Acids. Res., 2001. 29 (1): p. 137-140. Genome Biology

18. Christoffels, A., et al., STACK: Sequence Tag Alignment and Consensus Knowledgebase.

Nucleic Acids Res, 2001. 29 (1): p. 234-8.

19. Geier, B., et al., The HIB database of annotated UniGene clusters. Bioinformatics, 2001.

17 (6): p. 571-2.

20. Zhu, Y., et al., Integrating computationally assembled mouse transcript sequences with

the Mouse Genome Informatics (MGI) database. Genome Biol, 2003. 4(2): p. R16.

21. Roest Crollius, H., et al., Estimate of human gene number provided by genome -wide

analysis using Tetraodon nigroviridis DNA sequence. Nat Genet, 2000. 25 (2): p. 235-8.

22. Ewing, B. and P. Green, Analysis of expressed sequence tags indicates 35,000 human

genes. Nat Genet, 2000. 25 (2): p. 232-4.

23. Das, M., et al., Assessment of the total number of human transcription units. Genomics,

2001. 77 (1 -2): p. 71-8.

24. Liang, F., et al., Gene index analysis of the human genome estimates approximately

120,000 genes. Nat Genet, 2000. 25 (2 ): p. 239-40.

25. Hogenesch, J.B., et al., A comparison of the Celera and Ensembl predicted gene sets

reveals little overlap in novel genes. Cell, 2001. 106 (4): p. 413-5.

26. Burge, C. and S. Karlin, Prediction of complete gene structures in human genomic DNA.

J Mol Biol, 1997. 268 (1): p. 78-94.

27. Korf, I., et al., Integrating genomic homology into gene structure prediction.

Bioinformatics, 2001. 17 Suppl 1: p. S140-8.

28. Kent, W.J., BLAT -- the BLAST-like alignment tool. Genome Res, 2002. 12 (4): p. 656-64.

29. Yelin, R., et al., Widespread occurrence of antisense transcription in the human genome.

Nat Biotechnol, 2003. 21 (4): p. 379-86.

30. Adams, M.D., et al., Complementary DNA sequencing: expressed sequence tags and

human genome project. Science, 1991. 252 (5013): p. 1651-6. Genome Biology

31. Wolfsberg, T.G. and D. Landsman, A comparison of expressed sequence tags (ESTs) to

human genomic sequences. Nucleic Acids Res, 1997. 25 (8): p. 1626-32.

32. Dunham, I., et al., The DNA sequence of human chromosome 22. Nature, 1999.

402 (6761): p. 489-95.

33. Hattori, M., et al., The DNA sequence of human chromosome 21. Nature, 2000.

405 (6784): p. 311-9.

34. Kapranov, P., et al., Large-scale transcriptional activity in chromosomes 21 and 22.

Science, 2002. 296 (5569): p. 916-9.

35. Xuan , Z., J. Wang, and M.Q. Zhang, Computational comparison of two mouse draft

genomes and the human golden path. Genome Biol, 2003. 4(1): p. R1.

36. Hatzigeorgiou, A.G., P. Fiziev, and M. Reczko, DIANA -EST: a statistical analysis.

Bioinformatics, 2001. 17 (10): p. 913-9.

37. Frame Finder. [http://www.hgmp.mrc.ac.uk/~gslater/estateman/framefinder.html] .

38. Pennisi, E., HUMAN GENOME: A Low Number Wins the GeneSweep Pool. Science,

2003. 300 (5625): p. 1484b-.

39. Harrison, P.M. and M. Gerstein, Studying genomes through the aeons: protein families,

pseudogenes and proteome evolution. J Mol Biol, 2002. 318 (5): p. 1155-74.

40. Rinn, J.L., et al., The transcriptional activity of human Chromo some 22. Genes Dev.,

2003. 17 (4): p. 529-540.

41. Kawai, J., et al., Functional annotation of a full -length mouse cDNA collection. Nature,

2001. 409 (6821): p. 685-90.

42. Numata, K., et al., Identification of putative noncoding RNAs among the RIKEN mouse

full -length cDNA collection. Genome Res, 2003. 13 (6B): p. 1301-6.

43. Bono, H., et al., Systematic expression profiling of the mouse transcriptome using

RIKEN cDNA microarrays. Genome Res, 2003. 13 (6B): p. 1318-23. Genome Biology

44. Carninci, P., et al., Targeting a comp lex transcriptome: the construction of the mouse

full -length cDNA encyclopedia. Genome Res, 2003. 13 (6B): p. 1273-89.

45. Peters LL, J.K., Lu FM, Eicher EM, Higgins A, Yialamas M, Turtzo LC, Otsuka AJ, Lux

SE., Ank3 (epithelial ankyrin), a widely distributed new member of the ankyrin gene

family and the major ankyrin in kidney, is expressed in alternatively spliced forms,

including forms that lack the repeat domain. J Cell Biol., 1995. 130 (2): p. 313-30.

46. Peters, B., H.W. Kaiser, and T.M. Magin, Skin -Spe cific Expression of ank-393, a Novel

Ankyrin -3 Splice Variant. J Invest Dermatol, 2001. 116 (2): p. 216-223.

47. Editorial. Biological Psychiatry, 1997. 41 (7): p. 759-761.

48. Schimenti, J.C., et al., Interdigitated Deletion Complexes on Mouse Chromosome 5

Induced by Irradiation of Embryonic Stem Cells. Genome Res., 2000. 10 (7): p. 1043-

1050.

49. Crabtree, J., et al., High-resolution BAC -based map of the central portion of mouse

chromosome 5. Genome Res, 2001. 11 (10): p. 1746-57.

50. The GUS Platform for functional genomics. [http://www.gusdb.org] .

51. Davidson SB, C.J., Brunk B, Schug J, Tannen V, Overton GC, Stoeckert CJ Jr., Data

integration and warehousing in genomics: Two case studies. IBM Systems Journal,

2001. 40 : p. 512-531.

52. Green, P., phrap. [http://www.phrap.org/] .

Additional files

Additional file 1 – human high confidence DoTS Genes for download

File: humDoTSGene_dots6hg15.gff, format: GFF, description: spliced (or mRNA -containing) and cross -validated DoTS Genes, URL: Genome Biology http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/hum_dots6.0_hg15_

1/humDoTSGene_dots6hg15.gff

Additional file 2 – mouse high confidence DoTS Genes for download

File: musDoTSGene.dots6mm3.gff, format: GFF, description: spliced (or mRNA -containing) and cross -validated DoTS Genes, URL: http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/mus_dots6.0_mm3_

1/musDoTSGene.dots6mm3.gff

Additional file 3 – sequence of human pro tein -coding DoTS Genes for download

File: humDoTSGene_dots6hg15.coding.fa.gz, format: zipped FASTA, description: sequence of human DoTS Genes that are predicted protein -coding genes, URL: http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/ hum_dots6.0_hg15_

1/humDoTSGene_dots6hg15.coding.fa.gz

Additional file 4 – sequence of mouse protein -coding DoTS Genes for download

File: musDoTSGene_dots6mm3.coding.fa.gz, format: zipped FASTA, description: sequence of mouse DoTS Genes predicted protein -coding genes, URL: http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/mus_dots6.0_mm3_

1/musDoTSGene_dots6mm3.coding.fa.gz Genome Biology

Additional file 5 – sequence of human non -coding DoTS Genes for download

File: humDoTSGene_dots6hg15.noncoding.fa.gz, format: zipped FASTA, description: sequence of human DoTS Genes predicted non-coding genes, URL: http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/hum_dots6.0_hg15_

1/humDoTSGene_dots6hg15.noncoding.fa.gz

Additional file 6 – sequence of mouse non -coding DoTS Genes for download

File: musDoTSGene_dots6mm3.noncoding.fa.gz, format: zipped FASTA, description: sequence of mouse DoTS Genes predicted non-coding genes, URL: http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/mus_dots6.0_mm3_

1/musDoTSGene_dots6mm3.noncoding.fa.gz Genome Biology

Supplementary data

Suppl. Fig. 1 - Distribution of size of small gDG introns (human chr 22 and mouse chr 5)

The frequencies of the sizes of maximum introns of gDGs on human chromosome 22 (1a) and mouse chromosome 5 (1b) are plotted against intron sizes. Note the sharp turning point around

15bp in both cases.

Suppl. Fig. 2 - Singletons co -localize with spliced gDGs (density correlation):

The same approach as described for Figure 2 is used to generate this plot. The exon density of non-singleton DTs, singleton DTs not associated with other DTs (lone singletons), unspliced lone singletons, and all singletons (scaled down by a factor of 2) are plotted along mouse chromosome 5. Similar results obtained for huma n chromosome 22 (data not shown).

Supp. Table 1 - Quality distribution of DT alignments specific to a chromosome chr top alignments, all DTs top alignments, nsDTs only count %quality 1 %quality 1-3 count %quality 1 %quality 1-3 1 70001 44.9 72.1 21878 51.1 77.7 2 56320 46.6 73.3 16973 53.4 79 3 43842 48 74.4 13318 54.5 80 4 31071 48.8 73.9 9041 54.9 79.4 5 38986 46.1 72.8 11775 52.4 78.1 6 38563 47.6 73.7 11485 53.1 78.4 7 44047 45.5 73.2 13713 50.6 77.6 8 28100 47.9 74.6 8533 54 79.5 9 31099 47.4 73.5 9665 52.6 78.4 10 32674 48.3 75.6 9917 55 80.7 11 41830 46.9 73.6 12327 49.8 76.7 12 40344 43 70 12000 49.8 76.3 13 15612 49.8 75.1 4570 54.9 79.3 14 29372 37.9 64 8497 44.5 70.1 15 26330 46.2 72.7 8178 53.7 79.9 16 34287 44.6 72.5 10601 51.2 78.1 Genome Biology

17 41260 43.5 71.8 12898 50 76.9 18 12238 51.1 76 3641 57.7 82.2 19 35178 41 69.6 11314 47.7 75.3 20 18933 45 72.4 5851 51.1 77.3 21 8702 49.3 75.4 2630 54.8 79.5 22 17682 43.3 70.7 5498 50.2 76.4 X 20726 43.4 70.7 6381 49.3 75.9 Y 1339 50.1 68.7 345 64.3 78

This table shows the percentages of quality 1 and quality 1,2,3 alignments for all DTs (left) and non-singleton DTs (right) that align to a specific human chromosome. Assuming good quality of the genomic sequences, a high percentage of quality 1 alignments indicate a high degree of agreement between the DTs and the genomic sequences they align to. Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14