Dots: Integrated Gene Indices for Human and Mouse Built from Transcribed Sequences
Total Page:16
File Type:pdf, Size:1020Kb
DoTS: integrated gene indices for human and mouse built from transcribed sequences Running Title: DoTS gene indices Y Thomas Gan 1,2 , Brian Brunk1, Jonathan Crabtree 1,2 , Deborah Pinney 1,2 , Steve Fischer 1,2 , Joan Mazzarelli 1,2 , Otto Valladares 2, Maja Bucan 2, Christian J. Stoeckert, Jr. 1,2 1Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA 2Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA Y Thomas Gan: 215-746-7013 (tel), 215-573-3111 (fax), yg [email protected] (email) Brian Brunk: 215-573-3118 (tel), 215-573-3111 (fax), [email protected] (email) Jonathan Crabtree: 215-573-3115 (tel), 215-573-3111 (fax), [email protected] (email) Deborah Pinney: 215-573 -3116 (tel), 215-573-3111 (fax), [email protected] (email) Steve Fischer: 215-573-2280 (tel), 215-573-3111 (fax), [email protected] (email) Joan Mazzarelli: 215-573-4413 (tel), 215-573-3111 (fax), [email protected] (email) Otto Valladares: 215-898-0021 (tel), 215-573-2041 (fax ), [email protected] (email) Maja Bucan: 215-898-0020 (tel), 215-573-2041 (fax), [email protected] (email) Corresponding author: Christian J. Stoeckert. Jr. 215-573-4409 (tel), 215-573-3111 (fax), [email protected] (email) Genome Biology Abbreviations used in this paper: EST : expressed sequence tag DoTS : database of transcribed sequences DT : DoTS Transcript DG : DoTS Gene sDG : similarity -based DoTS Gene gDG : genome -based DoTS Gene TC : tentative consensus BLAST : basic local alignment search tool BLAT : BLA ST -like alignment tool UTR : un-translated region ORF : open reading frame CDS : (protein) coding sequence Genome Biology Abstract Background Although sequences for large eukaryotic genomes are being completed, it remains a challenge to identify all genes encoded by them and determine or predict their functions. To help address this challenge, we have built a Database of Transcribed Sequences (DoTS). We cluster and assemble ESTs and mRNAs into DoTS Transcripts (DTs). We further group DTs representing transcripts from the sa me genes into DoTS Genes (DGs). We describe human and mouse DoTS here, although DoTS is generic and applicable to other species such as apicomplexa [1] . Results We have built an integrated transcriptome resource, DoTS, for human and mouse. In DoTS we catalogue, categorize, and annotate known and predic ted transcripts and genes. We have identified 48,994 human and 37,984 mouse high confidence DGs, of which 25,326 human and 22,024 mouse DGs are predicted to be protein -coding genes. Using these data, we can predict novel genes as demonstrated using a 75Mb proximal region on mouse chromosome 5. We have found that DGs can significantly enrich the models of known genes by predicting extended UTRs, novel exons, and alternative transcription starts. DoTS also enables the study of non- coding genes and singleton transcripts (DTs with only one input EST or mRNA), in addition to other studies such as the investigation of alternative splicing. A powerful query interface for human and mouse DoTS is available at http://www.allgenes. org [2]. Conclusion DoTS Transcripts and DoTS Genes, which are extensively annotated and significantly curated, present a unique, integrated, non-redundant, and genome -mapped view of the millions of ESTs and mRNAs in the public domain. They are categorized into various subsets such as high Genome Biology confidence genes, protein -coding genes, and non-coding genes. They predict many putative novel genes, enrich gene models of known genes, and enable datamining in novel directions. Background and significance In a post -genomic era, identifying all genes and studying their functions and relationships are among the ongoing challenges in the field of functional genomics. Transcribed sequences (mRNAs and ESTs) may be used to build integrated transcriptome da ta resources to help address such challenges. Genomic data integration Much progress has been made recently in sequencing large eukaryotic genomes. We now have an essentially complete sequence for the human genome [3 -5] and a draft for mouse [6]. Coincident with the explosion of genomic sequence data is the rapidly growing availability of vast amounts of functional genomics data such as expressed se quence tags (ESTs), proteomes, protein domains, and microarray gene expression data. For example, as of October, 2003, there are 5.4 million human and 3.9 million mouse ESTs in the public EST repository dbEST [7]. It is necessary to integrate these diverse types of data to facilitate gene identification and functional annotation. Transcribed sequences for data integration Transcribed sequences are a good integration point . First, they are the products of gene transcription, and they are abundant as a result of the large scale EST sequencing efforts. Therefore, they can be used for gene discovery and analysis of gene structure (e.g. exon-intron structures, alternative splic ing), in genomic sequences via alignments. Second, expression Genome Biology information is usually available for ESTs, based on the libraries from which they originate. In addition, ESTs are commonly used to generate features on microarrays. Therefore, transcribed sequences allow easy integration of expression information with genes, providing the basis for expression analyses. Third, transcribed sequences may be translated to allow protein sequence analyses (e.g. domain based functional annotation, ortholog identificati on). Fourth, they may be aligned with genomic sequences to identify regulatory regions. Finally, they may originate from genes that do not encode proteins, therefore, they allow the identification of non-coding genes. Existing transcriptome data resource s Human and mouse genome and transcriptome data are available from several sites [8]. Although there is overlap in the information presented, the sites generally provide unique views or emphases. This is expected as we are far from a complete understanding of the wealth of information provided by genome sequencing, EST sequencing, and microarray experiments. Groups such as Ensembl [9, 1 0] or the UCSC Genome Browser team [11] use the genome as their reference point. Another approach is to use shared identifiers (accessions) from different resources to organize and integrate information as is done by GeneCards [12] and MGI [13], which focus on known genes and emphasize phenotypes. These approaches are complementary, and they provide different views and different interpretations of the data. For example, transcribed sequences that cannot be properly aligned to the genome would fail to be seen as primary entities on genome -based views. Unigene [14] and the TIGR gene indices [15] represent multiple species transcriptome data resources organized around transcribed sequences. Other efforts in this class include MGC [16], RefSeq [17], STACK [18], and MIPS [19]. Unigene uses sequence similarity to cluster all ESTs and mRNAs but does not generate consensus sequences. Essentially, the Unigene clusters represent ESTs associated with the same gene. The gre at strength of Unigene is its currency but Genome Biology one of its weaknesses is the lack of persistent identifiers. TIGR gene indices provide consensus sequences and persistent identifiers, and they also have data on orthologs for species other than human and mouse, which enables comparative genomics studies using more than two species. TIGR assemblies (TCs) represent transcripts rather than genes, therefore they are a transcript- centric, not gene -centric resource. MGC focuses on full length cDNAs, and RefSeq underscor e known and curated genes, therefore, they are both limited in scope. DoTS as a transcriptome resource DoTS, short for Database of Transcribed Sequences, is a collective name to describe DoTS Transcripts (DTs) and DoTS Genes (DGs). A DT is an assembly of transcribed sequences representing transcripts of the same splice form, and a DG is a group of DTs representing transcripts from the same gene. The goal of DoTS is to generate relationships among genes, RNAs, proteins, and their sequences to assist in disc overing new genes, functions, genomic relationships (e.g. clusters by location), and regulation of gene expression. Allgenes.org is the website for public access to DoTS. As a human and mouse transcriptome resource, data in DoTS are organized around transcribed sequences, as Unigene and TIGR TCs do. DoTS and TIGR TCs provide consensus sequences and persistent identifiers, both of which Unigene lacks. Although DoTS and TIGR TCs are very similar in the degree of annotation performed and, as recently reported , in the assemblies generated [20], the two are not identical because of differences in the details of their clustering and assembly processes. For example DoTS has more consensus transcripts but a smal ler number of sequences per transcript than TIGR TCs. This may be due to less trimming of low quality sequences from the ends, a choice made for DoTS to better preserve representation of differentially processed transcripts. The DoTS transcript indices als o differ from TIGR TCs in some of the annotations performed on the consensus sequences (e.g. gene trap associations, Genome Biology signal peptide prediction, transmembrane predictions), significant manual curation by expert annotators (Mazzarelli J. et. al., manuscript in preparation), and the availability of a powerful query interface through the Allgenes website [2]. DTs are taken a step further than TCs to generate genes. Therefore DoTS is also a gene index. Gene finding and transcribed sequences The difficulty in identifying all the genes in a mammalian genome is illustrated by the range of predictions over recent years. The estimate for the total number of human genes ranges from 28,000-34,000 based on homology [21], 35,000 based on ESTs [22], and 41,000-45,000 based on validation of computational predictions [23], to 56,960-81,273 based on cDN As [24]. The initial genome annotations by the public and private human genome projects, using similar approaches, both suggested that there are ab out 30,000 human protein -coding genes [4, 5], but the actual genes predicted differed significantly [25].