Comparison of the Current Refseq, Ensembl and EST Databases for Counting Genes and Gene Discovery
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector FEBS Letters 579 (2005) 690–698 FEBS 29200 Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery Thomas P. Larsson, Christian G. Murray, Tobias Hill, Robert Fredriksson, Helgi B. Schio¨th* Department of Neuroscience, Uppsala University, BMC Box 593, 751 24 Uppsala, Sweden Received 15 November 2004; revised 13 December 2004; accepted 13 December 2004 Available online 29 December 2004 Edited by Gianni Cesareni mated that if clustered at high stringency, the resulting EST- Abstract Large amounts of refined sequence material in the form of predicted, curated and annotated genes and expressed se- clusters would represent about 50% of all genes, which from quences tags (ESTs) have recently been added to the NCBI dat- their sim35000 clusters would correspond to about 60000– abases. We matched the transcript-sequences of RefSeq, 70000 genes. Davidson and Burke [4] used a similar method Ensembl and dbEST in an attempt to provide an updated over- based on clustering human EST sequences into transcription view of how many unique human genes can be found. The results units to reach a number of about 70000 genes, with between indicate that there are about 25000 unique genes in the union of 1.2 and 1.5 different transcripts per gene. Both of these meth- RefSeq and Ensembl with 12–18% and 8–13% of the genes in ods suffered from the fact that normalization of cDNA li- each set unique to the other set, respectively. About 20% of all braries may enrich for contaminants and aberrant clones. genes had splice variants. There are a considerable number of The problem of genomic contamination has been estimated ESTs (2200000) that do not match the identified genes and we to affect 5–8% of all ESTs [9,10]. Moreover, the former method used an in-house pipeline to identify 22 novel genes from Gen- scan predictions that have considerable EST coverage. The study included singleton clusters, which may also cause overestima- provides an insight into the current status of human gene cata- tion of the total number of genes. logues and shows that considerable refinement of methods and Another high number was reached by extrapolating from datasets is needed to come to a conclusive gene count. correlation between human genes and genomic CpG islands. Ó 2004 Federation of European Biochemical Societies. Published It was estimated that half of the human genes were accompa- by Elsevier B.V. All rights reserved. nied by a neighboring region of above average CpG dinucleo- tide content. The number of CpG islands was calculated to Keywords: Eexpressed sequences tag; RefSeq; Ensembl; around 40000, which led to an estimate of 80000 human genes Databases; Genscan [5]. This early estimate assumed that each CpG-island corre- sponds to a unique gene, while a later analysis of chromosome 22 showed that this is only true for about 60% of the islands. The public sequencing project reported the existence of only about 29000 islands in non-repeatmasked areas in the analysis 1. Introduction of the finished genomic sequence. With these data, the estimate of this method would be lowered to about 35000 human genes, One of the most intriguing questions in human biology is the a number not far from the currently popular quote derived number and identity of the human genes. Three years after the from both sequencing projects [1,2]. release of the genomic sequence [1,2], there are still large uncer- Predictions made by ab initio programs such as Genscan [11] tainties about the exact number of human genes. The number have proven to be rich source of novel genes. In 2001, the false of known and hypothetical genes in current databases such as positive rate for Genscan was estimated on chromosome 22 RefSeq and Ensembl is continuously growing but is still less using microarray analysis. This study concluded a false posi- than the predicted number of genes [1–8], indicating that these tive fraction of 17% compared to the original estimate by the datasets are incomplete. Moreover, there are significant differ- chromosome 22 sequencing group of 27.5% [12,13]. Using this ences between these datasets. Identification and annotation of assumption and predicting that the rate is constant over human genes is likely to continue to be an important issue the other chromosomes, this would mean that 7368 within the field of genetic research in several years to come. (0.17 · 43000) genes from the whole Genscan set of 43000 Historically, the highest estimates for the human gene count genes are false predictions. The remaining set of about 35600 have been based on clustering of expressed sequences tags human predictions is significantly larger than any known gene (ESTs) and many uncertainties with such methods have been set (RefSeq, Ensembl, SWISS-PROT, TrEMBL), of which the pointed out. Fields and colleagues used a method based on Ensembl set is the largest, currently containing 29802 human similarity between known cDNA sequences and ESTs to calcu- transcripts. Evidence of actual transcription of predicted genes late an approximate number of human genes [3]. They esti- can be obtained by identifying mRNA and EST sequences matching the predictions. The growth of dbEST from 2 million sequences in 2001 to over 5 million today means that the like- *Corresponding author. Fax: +46 18 51 15 40. lihood of finding transcript evidence for novel predictions is E-mail addresses: [email protected] (T.P. Larsson), better than ever. Since 2002, the University of California Santa [email protected] (H.B. Schio¨th). Cruz (UCSC) has provided data on genomic alignments for all 0014-5793/$30.00 Ó 2004 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.febslet.2004.12.046 T.P. Larsson et al. / FEBS Letters 579 (2005) 690–698 691 human ESTs, providing means to more specifically match novel genes from the Genscan set and we present 22 novel ESTs with gene predictions based on genomic location rather genes with extensive EST support identified by filtering Gen- than sequence similarity. Cross matching ab initio predictions scan predictions through a special in-house pipeline. such as Genscan predictions with ESTs remains thus as an important method to verify gene predictions and identify new genes. 2. Materials and methods Creation of a complete set of known and hypothetical genes 2.1. Datasources requires merging of existing datasets into a non-redundant set 2.1.1. NCBI. The 2004-04-05 version of the RefSeq nucleotide of genes. Such merging is often based on a sequence similarity transcript-sequences of Homo sapiens was downloaded from NCBI threshold [14,15] that is prone to misjudgements, since it is dif- (ftp://ftp.ncbi.nih.gov/RefSeq/) as files in FASTA format. ficult to find a threshold that always gives the correct result. Time of download was used as an identifier due to the continuous Another method of merging sets of transcripts is to align all se- update of the RefSeq dataset. For simplicity, instead of sorting Ref- Seq transcripts according to all six official categories: Model, Inferred, quences to the genome and determine which sequences from Predicted, Provisional, Validated and Reviewed, transcripts with each set that are not overlapped by any sequence form the name prefixes N (NM,NR) and X (XM/XR) were defined as reviewed other set. However, for transcript-sequences that are not de- and predicted, respectively. At the time of download, this file con- fined directly from the genomic sequence, it can be a problem tained 27602 sequences. The human ESTs were downloaded as a to find good quality genomic alignments. For example, the FASTA file from NCBI (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ est_human.gz). At the time of download, the file contained 5484645 UCSC table refGene, which contains positions for RefSeq ESTs. transcripts as acquired by BLAT, only includes about 22000 2.1.2. Ensembl. The 2004-04-05 Ensembl human transcript set of the 27000 transcripts in the set. The inability to find good based on the NCBI genome assembly 34 was downloaded in cDNA quality alignments can be problematic when using this method. FASTA format from (ftp://ftp.ensembl.org/pub). The Ensembl tran- Using genomic sequence to define transcript-sequences, as is scripts fall into three categories: known, novel and pseudogene, and in each comparison with RefSeq, we let known correspond to re- done by Ensembl, should alleviate this problem. The increased viewed and novel to predicted. When referred to as Ensembl genes quality of the genome sequences should also make this prob- or Ensembl transcripts, only sequences of statuses known and novel lem smaller. The continuous changes of the databases make are intended unless otherwise specified. At the time of download, this it a difficult and time-consuming issue to practically ‘‘count’’ file contained 31609 sequences divided into 26309, 3493 and 1807 known, novel and pseudogenes, respectively. The ESTgene dataset known unique genes in an unbiased fashion. Since the publica- version 25.34 for the NCBI34 assembly was downloaded using the tion of the first analysis of the human genome, the NCBI Ref- EnsMart retrieval system. At the time of download, this set consisted Seq catalogue of human transcripts [16] has grown from just of 24980 sequences. above 11000 entries [1] to over 27000 [RefSeq release 4], while 2.1.3. UCSC. The human genome assembly based on NCBI 34 the EMBL/Sanger Ensembl database [17] currently contains was downloaded as unmasked FASTA files from (ftp://hgdown- load.cse.ucsc.edu/goldenPath/hg16/).